Open Voice Technology Wiki - User contributions [en]

Research papers

2021-11-17T06:51:29Z

Eltocino: finding research papers for speech

Most current research papers on speech can be found online.

[https://scholar.google.com/scholar Google Scholar] is a useful place that links out to many other sites.

[https://scholar.archive.org/ Internet Archive] Scholar section can also be used to search for papers.

[https://arxiv.org/corr Arxiv] has an entire subsection for computer research, and the papers are linked from the abstract page in available formats.

Conference pages are also a good place to find new and interesting papers you might otherwise miss, finding a title on interspeech's presentations and searching for it on arxiv usually nets you a copy to read.

Building voice datasets

2021-11-17T06:38:58Z

Eltocino: first submit

'''Building a quality dataset for your STT purposes'''

There are several datasets you can build for STT purposes. Wake word, fine-tuning, complete STT. Each has a separate end goal. For all three, there seems to be several clear tips. First, diversity is the key to improving general usability. The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are. Second, larger datasets tend to result in better usability, with the caveat of the first tip. Third, recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter. It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some wakeword tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.

Before you start gathering a single clip, read the docs for the tool you're using. [https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html Coqui STT] specifies 10-20 second clips, so a dataset with five minute long clips would not be a good choice. Make sure you can get data in the formats that your tool needs. For audio this is usually not a difficult problem, tools like [https://en.wikipedia.org/wiki/SoX SoX] or [https://www.ffmpeg.org/ FFMpeg] can convert pretty much any source to any type these days.

* How to get a more diverse set of data?

For wake words, this is mostly about the users who will be using the end models. They should provide a good chunk of data to start with. Expand by finding personalized variables that you're missing, ie, accents, pitches. For this type of diversification, ask your friends, colleagues, enemies, the internet. Pay a service to garner data for you (ie, mechanical turk, craig's list, etc). Write a paper at college and collect samples from as many different students as you can. Find users all over the world that have some level of interest and simply ask them to contribute. It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability.

For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words). Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence. If that's your end goal, then this might be a strategy to pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language's usage. Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words. Putting both of those together, you should then find a source of sentences, or build a list of your own to record. There are several open data sets you can also copy from (Common voice). For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you're trying to target. Mozilla's Common Voice English set has over 40,000 speakers and more than 200,000 different words. It's a massive sampling of the language with a huge range of voice types, and word and sentences. Within those users there's also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.

* Getting more data

Expanding the diversity of your dataset is a good start for this. Still need more data? Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used. General purpose users can incorporate existing datasets when they can. Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets. For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.

* Quality, or can you hear me now?

Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.

Category:Wake words

2021-11-17T05:58:56Z

Eltocino: Updated dataset line to be more generic, since stt datasets will have similar info.

[[Category:STT]]
[[Category:Open Voice Assistants]]

Wake words, sometimes called key words, are a special category of Speech-To-Text. Wake words are used to "wake" a listening device and start its functions. In most cases these "wake words" are detected locally on devices while actual speech recognition is mostly done by internet cloud services. Mycroft defaults to "Hey, Mycroft" for its wake word, for instance. Some platforms allow for multiple wake words to be used. Coqui STT engine can even be configured as a wake word listener.

'''Wake word listeners''':

* [[Mycroft Precise]]
* [[Porcupine]]
* [[Snowboy]]
* [[Howl]]
* [[Coqui]] STT
* Google tensorflow lite speech recognition

'''Customizing wake words'''

* What makes [[a good wake word]]?
* Building a quality dataset

Recording tipps

2021-11-17T05:57:03Z

Eltocino: added link to librivox recording tips.

[[Category:Recording tipps]]
[[Category:Lessons learned]]

When you plan to record a voice dataset to be used for a TTS model training you should check these tips and tricks:

* '''Use a good microphone and a quiet recording room setup''' (no computer fans, air conditioning, ...)
* Use a text corpus with cleaned numbers/abbreviations and good phoneme coverage
* Read in a neutral style, but with a natural speech flow and do not swallow up letters
* Adjust tone and pitch with punctuation
* Use a constant recording speed
* Check your recordings regularly in high volume for background noise
* Take breaks regularly and do not record more than four hours a day
* Record error free
* Investing in a quality interface and mic can make a big difference in quality. A 24 bit 96khz interface with a large diaphragm condenser can be had for about $200 USD.
* Record at the highest quality level practical. You can convert to lesser formats later, but you cannot up convert cleanly
* Review your work at regular intervals and compare with previous recording to ensure consistent quality
* Do not be afraid to ask for help! Getting feedback on your data early on can help prevent wasted effort.
*There's a wealth of information on the internet about recording. For instance, https://wiki.librivox.org/index.php/Newbie_Guide_to_Recording from Librivox is a useful guide with numerous sub pages of information. Some is audio-book specific, but the majority is useful for anyone recording voice.