Building voice datasets - Revision history

Thorsten: Added lessons learned category.

2022-01-23T12:12:23Z

Added lessons learned category.

← Older revision		Revision as of 14:12, 23 January 2022
Line 1:		Line 1:
	~~[[Category:Dataset]]~~

	== Building a quality dataset for your STT purposes ==		== Building a quality dataset for your STT purposes ==
	There are several datasets you can build for STT purposes.		There are several datasets you can build for STT purposes.
Line 34:		Line 32:

	== References ==		== References ==

			[[Category:Dataset]]
			[[Category:Lessons learned]]

Thorsten: Thorsten moved page Quality datasets to Building voice datasets: Rename title after discussion with Nmammedov

2022-01-23T12:07:02Z

Thorsten moved page Quality datasets to Building voice datasets: Rename title after discussion with Nmammedov

← Older revision	Revision as of 14:07, 23 January 2022
(No difference)

Thorsten: Added references headline

2021-11-17T21:21:36Z

Added references headline

← Older revision		Revision as of 23:21, 17 November 2021
Line 32:		Line 32:
	=== Quality, or can you hear me now? ===		=== Quality, or can you hear me now? ===
	Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.		Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.

			== References ==

Thorsten: Style adjustments and added dataset category.

2021-11-17T21:20:53Z

Style adjustments and added dataset category.

← Older revision		Revision as of 23:20, 17 November 2021
Line 1:		Line 1:
	~~'''Building a quality dataset for your STT purposes'''~~		[[Category:Dataset]]

	There are several datasets you can build for STT purposes. Wake word, fine-tuning, complete STT. Each has a separate end goal. For all three, there seems to be several clear tips. First, diversity is the key to improving general usability. The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are. Second, larger datasets tend to result in better usability, with the caveat of the first tip. Third, recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter. It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some wakeword tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.		== Building a quality dataset for your STT purposes ==
			There are several datasets you can build for STT purposes.

	~~Before you start gathering a single clip, read the docs for the tool you're using.~~ ~~[https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html Coqui STT] specifies 10~~-~~20 second clips, so~~ a ~~dataset with five minute long clips would not be a good choice.~~ ~~Make sure you can get data in the formats that your tool needs.~~ ~~For audio this is usually not a difficult problem, tools like [https://en.wikipedia.org/wiki/SoX SoX] or [https://www.ffmpeg.org/ FFMpeg] can convert pretty much any source to any type these days.~~		* Wake word training
			* Fine-tuning a pretrained STT model
			* Complete STT model training

	* How to ~~get a more diverse set of data?~~		== Three tips ==
			Each has a separate end goal. For all three, there seems to be several clear tips.

	~~For wake words, this~~ is ~~mostly about~~ the ~~users who will be using the end models. They should provide a good chunk of data~~ to ~~start with~~. ~~Expand by finding personalized variables that~~ you~~'re missing~~, ie, ~~accents~~, ~~pitches. For this type of diversification~~, ~~ask your friends~~, ~~colleagues~~, ~~enemies~~, the internet. Pay a service to garner data for you (ie, mechanical turk, craig's list, etc). Write a paper at college and collect samples from as many different students as you can. Find users all over the ~~world that have some level of interest and simply ask them to contribute. It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability~~.		=== Diversity ===
			Diversity is the key to improving general usability. The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are.

	For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words). Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence. If that's your end goal, then this might be a strategy to ~~pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically~~ in a ration mimicking the general language's usage. Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words. Putting both of those together, you should then find a source of sentences, or build a list of your own to record. There are several open data sets you can also copy from (Common voice). For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the ~~domain you're trying to target. Mozilla's Common Voice English set has over 40,000 speakers and more than 200,000 different words. It's a massive sampling~~ of the ~~language with a huge range of voice types, and word and sentences. Within those users there's also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips~~.		=== Dataset size ===
			Larger datasets tend to result in better usability, with the caveat of the first tip.

	* Getting more data		=== Recording quality ===
			Recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter. It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some [[:Category:Wake words\|wakeword]] tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.

	~~Expanding~~ the ~~diversity of your dataset is a good start~~ for ~~this~~. ~~Still need more data? Domain specific modelers will want to search industry resources for conference or work~~-~~area recordings that can~~ be ~~used~~. ~~General purpose users can incorporate existing datasets when they~~ can. ~~Common Voice~~, ~~Tedlium3, Librivox all have decent sized sets that can be used to supplement~~ or ~~build general datasets~~. ~~For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you~~ can ~~will still leave you in a better position than having too little~~ to ~~effectively tune~~.		Before you start gathering a single clip, read the docs for the tool you're using. Coqui STT<ref>https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html</ref> specifies 10-20 second clips, so a [[:Category:Dataset\|dataset]] with five minute long clips would not be a good choice. Make sure you can get data in the formats that your tool needs. For audio this is usually not a difficult problem, tools like SoX<ref>https://en.wikipedia.org/wiki/SoX</ref> or FFMpeg<ref>https://www.ffmpeg.org/</ref> can convert pretty much any source to any type these days.

	* Quality, or can ~~you hear me now?~~		== Get diverse set of data ==
			For [[:Category:Wake words\|wake words]], this is mostly about the users who will be using the end models. They should provide a good chunk of data to start with. Expand by finding personalized variables that you're missing, ie, accents, pitches. For this type of diversification, ask your friends, colleagues, enemies, the internet. Pay a service to garner data for you (ie, mechanical turk, craig's list, etc). Write a paper at college and collect samples from as many different students as you can. Find users all over the world that have some level of interest and simply ask them to contribute. It is possible to use [[:Category:TTS\|TTS]] services to provide data, however, this has limitations and in some cases may reduce usability.

			For [[:Category:STT\|STT]], you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words). Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence. If that's your end goal, then this might be a strategy to pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language's usage. Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words. Putting both of those together, you should then find a source of sentences, or build a list of your own to record. There are several open data sets you can also copy from (Common Voice<ref>https://commonvoice.mozilla.org/</ref>). For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you're trying to target. Mozilla's Common Voice English set has over 40,000 speakers and more than 200,000 different words. It's a massive sampling of the language with a huge range of voice types, and word and sentences. Within those users there's also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.

			=== Getting more data ===
			Expanding the diversity of your dataset is a good start for this. Still need more data? Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used. General purpose users can incorporate existing datasets when they can. Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets. For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.

			=== Quality, or can you hear me now? ===
	Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.		Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.

Eltocino: first submit

2021-11-17T06:38:58Z

first submit

New page

'''Building a quality dataset for your STT purposes'''

There are several datasets you can build for STT purposes. Wake word, fine-tuning, complete STT. Each has a separate end goal. For all three, there seems to be several clear tips. First, diversity is the key to improving general usability. The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are. Second, larger datasets tend to result in better usability, with the caveat of the first tip. Third, recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter. It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some wakeword tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.

Before you start gathering a single clip, read the docs for the tool you're using. [https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html Coqui STT] specifies 10-20 second clips, so a dataset with five minute long clips would not be a good choice. Make sure you can get data in the formats that your tool needs. For audio this is usually not a difficult problem, tools like [https://en.wikipedia.org/wiki/SoX SoX] or [https://www.ffmpeg.org/ FFMpeg] can convert pretty much any source to any type these days.

* How to get a more diverse set of data?

For wake words, this is mostly about the users who will be using the end models. They should provide a good chunk of data to start with. Expand by finding personalized variables that you're missing, ie, accents, pitches. For this type of diversification, ask your friends, colleagues, enemies, the internet. Pay a service to garner data for you (ie, mechanical turk, craig's list, etc). Write a paper at college and collect samples from as many different students as you can. Find users all over the world that have some level of interest and simply ask them to contribute. It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability.

For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words). Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence. If that's your end goal, then this might be a strategy to pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language's usage. Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words. Putting both of those together, you should then find a source of sentences, or build a list of your own to record. There are several open data sets you can also copy from (Common voice). For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you're trying to target. Mozilla's Common Voice English set has over 40,000 speakers and more than 200,000 different words. It's a massive sampling of the language with a huge range of voice types, and word and sentences. Within those users there's also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.

* Getting more data

Expanding the diversity of your dataset is a good start for this. Still need more data? Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used. General purpose users can incorporate existing datasets when they can. Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets. For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.

* Quality, or can you hear me now?

Common Voice has a massive range of samples within it. In addition to simply collecting sentences, users can also verify samples to confirm they're a match to the expected transcript. This has a two-fold benefit: sentences that don't match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.