Common Voice Dataset Analyzer
Part of CV-ToolBox
This tool is directed to language communities on Common Voice and those who train on them. The main purpose is to view the general and detailed statistical characteristics of datasets, with special emphasis on their health and diversity, so that communities can direct their efforts to correct the problem areas. The data presented here are results of long offline calculations. Currently it covers most important ones, but new measures will be added in time.
Because the data is rather large it is divided and related portion will be loaded whenever you click a language-version pair. The table on the browse page shows all languages Common Voice supports, with version info this application supports. To shorten the list, please use the filter feature at the bottom to select one or more locales you are interested in.
If you are just interested in general status between versions, you can use the sister app Common Voice Metadata Viewer.
In this application, under each dataset, you may see results for multiple splitting algorithms, given below (some may miss some). These splitting algorithms are:
- s1: Default splits in the Common Voice distribution datasets, same as using CorporaCreator with -s 1 option, i.e. only 1 recording per sentence is taken. This algorithm creates most diverse test split and scientifically most correct one. On the other hand it mostly results in taking a small/smaller portion of the dataset, especially for low resource languages. The training set is generally least diverse and small, resulting in biasing (voice, gender etc).
- s5/s99: CorporaCreator output with -s 5 and -s 99 options respectively, i.e. up to 5/99 recordings per sentence are taken. s99 mostly results in taking the whole dataset, but for some languages, more than 100 sentences have been recorded. Although using the whole dataset, this does not guarantee the diversity in the train set either but gives better test results in models. s5 option is mostly good for taking many of the recordings, except for low text-corpora ones where people exhaust them up to 15 times.
- v1: An alternative adaptive algorithm proposed for Common Voice. It uses the whole dataset, ensures voice diversity in all splits, and you get better training results than s99 algorithm. It is currently under development, ran on all languages but only tested with a few using Coqui STT and OpenAI Whisper.
- vw: A version of v1 for better OpenAI Whisper fine-tuning, with 90-5-5% splits, keeping 25-25-50% diversity.
- vx: A version of v1 with 95-5-0% splits and 50-50-0% diversity, so no test split, where you can test your model against other datasets like Fleurs or Voxpopuli.
Now, as you know what this is about, you can start browsing the provided datasets by clicking the button below, or using the left hand menu.