CommonLanguage Dataset [download]
This dataset is composed of speakers of 45 languages that were carefully selected from CommonVoice database. The total duration of audio recordings is 45.1 hours. The data is already split into train, dev (validation) and test sets.
Statistics of CommonLanguage:
| Name | Train | Dev | Test |
|---|---|---|---|
| # of utterances | 177552 | 47104 | 47704 |
| # unique speakers | 11189 | 1297 | 1322 |
| Total duration, hr | 30.04 | 7.53 | 7.53 |
| Min duration, sec | 0.86 | 0.98 | 0.89 |
| Mean duration, sec | 4.87 | 4.61 | 4.55 |
| Max duration, sec | 21.72 | 105.67 | 29.83 |
| Duration per language, min | ~40 | ~10 | ~10 |
List of languages:
- Arabic
- Basque
- Breton
- Catalan
- Chinese_China
- Chinese_Hongkong
- Chinese_Taiwan
- Chuvash
- Czech
- Dhivehi
- Dutch
- English
- Esperanto
- Estonian
- French
- Frisian
- Georgian
- German
- Greek
- Hakha_Chin
- Indonesian
- Interlingua
- Italian
- Japanese
- Kabyle
- Kinyarwanda
- Kyrgyz
- Latvian
- Maltese
- Mangolian
- Persian
- Polish
- Portuguese
- Romanian
- Romansh_Sursilvan
- Russian
- Sakha
- Slovenian
- Spanish
- Swedish
- Tamil
- Tatar
- Turkish
- Ukrainian
- Welsh
Other information
In addition to the language label, the datapoints have age, gender and utterance transcription associated with each utterance.