Building voice datasets

Building a quality dataset for your STT purposesEdit

There are several datasets you can build for STT purposes.

  • Wake word training
  • Fine-tuning a pretrained STT model
  • Complete STT model training

Three tipsEdit

Each has a separate end goal. For all three, there seems to be several clear tips.

DiversityEdit

Diversity is the key to improving general usability. The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are.

Dataset sizeEdit

Larger datasets tend to result in better usability, with the caveat of the first tip.

Recording qualityEdit

Recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter. It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some wakeword tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.

Before you start gathering a single clip, read the docs for the tool you're using. Coqui STT[1] specifies 10-20 second clips, so a dataset with five minute long clips would not be a good choice. Make sure you can get data in the formats that your tool needs. For audio this is usually not a difficult problem, tools like SoX[2] or FFMpeg[3] can convert pretty much any source to any type these days.

Get diverse set of dataEdit

For wake words, this is mostly about the users who will be using the end models. They should provide a good chunk of data to start with. Expand by finding personalized variables that you're missing, ie, accents, pitches. For this type of diversification, ask your friends, colleagues, enemies, the internet. Pay a service to garner data for you (ie, mechanical turk, craig's list, etc). Write a paper at college and collect samples from as many different students as you can. Find users all over the world that have some level of interest and simply ask them to contribute. It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability.

For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words). Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence. If that's your end goal, then this might be a strategy to pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language's usage. Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words. Putting both of those together, you should then find a source of sentences, or build a list of your own to record. There are several open data sets you can also copy from (Common Voice[4]). For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you're trying to target. Mozilla's Common Voice English set has over 40,000 speakers and more than 200,000 different words. It's a massive sampling of the language with a huge range of voice types, and word and sentences. Within those users there's also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.

Getting more dataEdit

Expanding the diversity of your dataset is a good start for this. Still need more data? Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used. General purpose users can incorporate existing datasets when they can. Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets. For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.

Quality, or can you hear me now?Edit

Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.

ReferencesEdit