<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://openvoice-tech.net/index.php?action=history&amp;feed=atom&amp;title=Building_voice_datasets</id>
	<title>Building voice datasets - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://openvoice-tech.net/index.php?action=history&amp;feed=atom&amp;title=Building_voice_datasets"/>
	<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;action=history"/>
	<updated>2026-04-30T00:26:47Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.1</generator>
	<entry>
		<id>https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=2244&amp;oldid=prev</id>
		<title>Thorsten: Added lessons learned category.</title>
		<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=2244&amp;oldid=prev"/>
		<updated>2022-01-23T12:12:23Z</updated>

		<summary type="html">&lt;p&gt;Added lessons learned category.&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 14:12, 23 January 2022&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[Category:Dataset]]&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Building a quality dataset for your STT purposes ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Building a quality dataset for your STT purposes ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There are several datasets you can build for STT purposes.   &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There are several datasets you can build for STT purposes.   &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l34&quot;&gt;Line 34:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 32:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== References ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== References ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[Category:Dataset]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[Category:Lessons learned]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Thorsten</name></author>
	</entry>
	<entry>
		<id>https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=2238&amp;oldid=prev</id>
		<title>Thorsten: Thorsten moved page Quality datasets to Building voice datasets: Rename title after discussion with Nmammedov</title>
		<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=2238&amp;oldid=prev"/>
		<updated>2022-01-23T12:07:02Z</updated>

		<summary type="html">&lt;p&gt;Thorsten moved page &lt;a href=&quot;/wiki/Quality_datasets&quot; class=&quot;mw-redirect&quot; title=&quot;Quality datasets&quot;&gt;Quality datasets&lt;/a&gt; to &lt;a href=&quot;/wiki/Building_voice_datasets&quot; title=&quot;Building voice datasets&quot;&gt;Building voice datasets&lt;/a&gt;: Rename title after discussion with Nmammedov&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 14:07, 23 January 2022&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-notice&quot; lang=&quot;en&quot;&gt;&lt;div class=&quot;mw-diff-empty&quot;&gt;(No difference)&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</summary>
		<author><name>Thorsten</name></author>
	</entry>
	<entry>
		<id>https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1920&amp;oldid=prev</id>
		<title>Thorsten: Added references headline</title>
		<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1920&amp;oldid=prev"/>
		<updated>2021-11-17T21:21:36Z</updated>

		<summary type="html">&lt;p&gt;Added references headline&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 23:21, 17 November 2021&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l32&quot;&gt;Line 32:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 32:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Quality, or can you hear me now? ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Quality, or can you hear me now? ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Common Voice has a massive range of samples within it.  In addition to simply collecting sentences, users can also verify samples to confirm they&amp;#039;re a match to the expected transcript. This has a two-fold benefit: sentences that don&amp;#039;t match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Common Voice has a massive range of samples within it.  In addition to simply collecting sentences, users can also verify samples to confirm they&amp;#039;re a match to the expected transcript. This has a two-fold benefit: sentences that don&amp;#039;t match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== References ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Thorsten</name></author>
	</entry>
	<entry>
		<id>https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1919&amp;oldid=prev</id>
		<title>Thorsten: Style adjustments and added dataset category.</title>
		<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1919&amp;oldid=prev"/>
		<updated>2021-11-17T21:20:53Z</updated>

		<summary type="html">&lt;p&gt;Style adjustments and added dataset category.&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 23:20, 17 November 2021&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;&#039;&#039;Building a quality dataset for your STT purposes&#039;&#039;&#039;&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[[Category:Dataset]]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There are several datasets you can build for STT purposes&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;.  Wake word, fine-tuning, complete STT. Each has a separate end goal.  For all three, there seems to be several clear tips.  First, diversity is the key to improving general usability.  The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are.  Second, larger datasets tend to result in better usability, with the caveat of the first tip.  Third, recording quality can play a big role in how the end model turns out.  Transcription quality as well as audio quality matter.  It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result.  Some wakeword tools promise usability after recording as few as one to three utterances.  These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model&lt;/del&gt;.   &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Building a quality dataset for your STT purposes ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There are several datasets you can build for STT purposes.   &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Before you start gathering a single clip, read the docs for the tool you&#039;re using. &lt;/del&gt; &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;[https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html Coqui STT] specifies 10&lt;/del&gt;-&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;20 second clips, so &lt;/del&gt;a &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;dataset with five minute long clips would not be a good choice. &lt;/del&gt; &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Make sure you can get data in the formats that your tool needs. &lt;/del&gt; &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For audio this is usually not a difficult problem, tools like [https://en.wikipedia.org/wiki/SoX SoX] or [https://www.ffmpeg.org/ FFMpeg] can convert pretty much any source to any type these days. &lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* Wake word training &lt;/ins&gt;  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* Fine&lt;/ins&gt;-&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;tuning &lt;/ins&gt;a &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;pretrained STT model &lt;/ins&gt;  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* Complete STT model training &lt;/ins&gt;  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; * How &lt;/del&gt;to &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;get a more diverse set of data? &lt;/del&gt;  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Three tips ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Each has a separate end goal. For all three, there seems &lt;/ins&gt;to &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;be several clear tips. &lt;/ins&gt;  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For wake words, this &lt;/del&gt;is &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;mostly about &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;users who will be using the end models.  They should provide a good chunk of data &lt;/del&gt;to &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;start with&lt;/del&gt;.  &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Expand by finding personalized variables that &lt;/del&gt;you&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&#039;re missing&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;ie&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;accents&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;pitches.  For this type of diversification&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;ask your friends&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;colleagues&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;enemies&lt;/del&gt;, the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;internet.  Pay a service to garner data for you (ie, mechanical turk, craig&#039;s list, etc).  Write a paper at college and collect samples from as many different students as you can.  Find users all over &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;world that have some level of interest and simply ask them to contribute.  It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability&lt;/del&gt;.  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Diversity ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Diversity &lt;/ins&gt;is the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;key &lt;/ins&gt;to &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;improving general usability&lt;/ins&gt;.  &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;The more variance &lt;/ins&gt;you &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;have in the data, pitch&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;speed&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;phonemes&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;tones&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;accents&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;speakers&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;and such&lt;/ins&gt;, the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;better &lt;/ins&gt;the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;end models typically are&lt;/ins&gt;. &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;  &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words).  Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence.  If that&#039;s your end goal, then this might be a strategy &lt;/del&gt;to &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;pursue.  For more general purpose models, gather as wide a set of phonemes as possible, typically &lt;/del&gt;in &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;a ration mimicking the general language&#039;s usage.  Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words.  Putting both of those together&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;you should then find a source of sentences, or build a list of your own to record.  There are several open data sets you can also copy from (Common voice).  For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;domain you&#039;re trying to target.   Mozilla&#039;s Common Voice English set has over 40,000 speakers and more than 200,000 different words.  It&#039;s a massive sampling &lt;/del&gt;of the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;language with a huge range of voice types, and word and sentences.  Within those users there&#039;s also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips&lt;/del&gt;.   &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Dataset size ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Larger datasets tend &lt;/ins&gt;to &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;result &lt;/ins&gt;in &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;better usability&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;with &lt;/ins&gt;the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;caveat &lt;/ins&gt;of the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;first tip&lt;/ins&gt;.   &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* Getting more data&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Recording quality ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Recording quality can play a big role in how the end model turns out. Transcription quality as well as audio quality matter.  It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result. Some [[:Category:Wake words|wakeword]] tools promise usability after recording as few as one to three utterances. These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.  &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Expanding &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;diversity of your dataset is a good start &lt;/del&gt;for &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;this&lt;/del&gt;.  &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Still need more data?  Domain specific modelers will want to search industry resources for conference or work&lt;/del&gt;-&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;area recordings that can &lt;/del&gt;be &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;used&lt;/del&gt;.  &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;General purpose users can incorporate existing datasets when they &lt;/del&gt;can.  &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Common Voice&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Tedlium3, Librivox all have decent sized sets that can be used to supplement &lt;/del&gt;or &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;build general datasets&lt;/del&gt;. &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you &lt;/del&gt;can &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;will still leave you in a better position than having too little &lt;/del&gt;to &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;effectively tune&lt;/del&gt;. &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; &lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Before you start gathering a single clip, read &lt;/ins&gt;the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;docs &lt;/ins&gt;for &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;the tool you&#039;re using&lt;/ins&gt;.  &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Coqui STT&amp;lt;ref&amp;gt;https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html&amp;lt;/ref&amp;gt; specifies 10&lt;/ins&gt;-&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;20 second clips, so a [[:Category:Dataset|dataset]] with five minute long clips would not &lt;/ins&gt;be &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;a good choice&lt;/ins&gt;.  &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Make sure you &lt;/ins&gt;can &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;get data in the formats that your tool needs&lt;/ins&gt;.  &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For audio this is usually not a difficult problem&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;tools like SoX&amp;lt;ref&amp;gt;https://en.wikipedia.org/wiki/SoX&amp;lt;/ref&amp;gt; &lt;/ins&gt;or &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;FFMpeg&amp;lt;ref&amp;gt;https://www.ffmpeg&lt;/ins&gt;.&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;org/&amp;lt;/ref&amp;gt; &lt;/ins&gt;can &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;convert pretty much any source &lt;/ins&gt;to &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;any type these days&lt;/ins&gt;.  &lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* Quality&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;or &lt;/del&gt;can &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;you hear me now?&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;== Get diverse set of data ==&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For [[:Category:Wake words|wake words]]&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;this is mostly about the users who will be using the end models.  They should provide a good chunk of data to start with. Expand by finding personalized variables that you&#039;re missing, ie, accents, pitches.  For this type of diversification, ask your friends, colleagues, enemies, the internet.  Pay a service to garner data for you (ie, mechanical turk, craig&#039;s list, etc).  Write a paper at college and collect samples from as many different students as you &lt;/ins&gt;can&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;.  Find users all over the world that have some level of interest and simply ask them to contribute.  It is possible to use [[:Category:TTS|TTS]] services to provide data, however, this has limitations and in some cases may reduce usability. &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;For [[:Category:STT|STT]], you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words).  Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence.  If that&#039;s your end goal, then this might be a strategy to pursue. For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language&#039;s usage.  Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words.  Putting both of those together, you should then find a source of sentences, or build a list of your own to record.  There are several open data sets you can also copy from (Common Voice&amp;lt;ref&amp;gt;https://commonvoice.mozilla.org/&amp;lt;/ref&amp;gt;).  For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you&#039;re trying to target.   Mozilla&#039;s Common Voice English set has over 40,000 speakers and more than 200,000 different words.  It&#039;s a massive sampling of the language with a huge range of voice types, and word and sentences.  Within those users there&#039;s also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.  &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Getting more data ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Expanding the diversity of your dataset is a good start for this.  Still need more data?  Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used.  General purpose users can incorporate existing datasets when they can. Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets.  For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.  &lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Quality, or can you hear me now? ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Common Voice has a massive range of samples within it.  In addition to simply collecting sentences, users can also verify samples to confirm they&amp;#039;re a match to the expected transcript. This has a two-fold benefit: sentences that don&amp;#039;t match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Common Voice has a massive range of samples within it.  In addition to simply collecting sentences, users can also verify samples to confirm they&amp;#039;re a match to the expected transcript. This has a two-fold benefit: sentences that don&amp;#039;t match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Thorsten</name></author>
	</entry>
	<entry>
		<id>https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1907&amp;oldid=prev</id>
		<title>Eltocino: first submit</title>
		<link rel="alternate" type="text/html" href="https://openvoice-tech.net/index.php?title=Building_voice_datasets&amp;diff=1907&amp;oldid=prev"/>
		<updated>2021-11-17T06:38:58Z</updated>

		<summary type="html">&lt;p&gt;first submit&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Building a quality dataset for your STT purposes&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
There are several datasets you can build for STT purposes.  Wake word, fine-tuning, complete STT. Each has a separate end goal.  For all three, there seems to be several clear tips.  First, diversity is the key to improving general usability.  The more variance you have in the data, pitch, speed, phonemes, tones, accents, speakers, and such, the better the end models typically are.  Second, larger datasets tend to result in better usability, with the caveat of the first tip.  Third, recording quality can play a big role in how the end model turns out.  Transcription quality as well as audio quality matter.  It may be tempting to record noise and layer it over all your studio-quality clips, but this may not improve your end result.  Some wakeword tools promise usability after recording as few as one to three utterances.  These may be usable for a single user, in matching audio settings, but would likely not be useful as a general purpose wake word model.  &lt;br /&gt;
&lt;br /&gt;
Before you start gathering a single clip, read the docs for the tool you&amp;#039;re using.  [https://stt.readthedocs.io/en/stable/playbook/DATA_FORMATTING.html Coqui STT] specifies 10-20 second clips, so a dataset with five minute long clips would not be a good choice.  Make sure you can get data in the formats that your tool needs.  For audio this is usually not a difficult problem, tools like [https://en.wikipedia.org/wiki/SoX SoX] or [https://www.ffmpeg.org/ FFMpeg] can convert pretty much any source to any type these days. &lt;br /&gt;
&lt;br /&gt;
 * How to get a more diverse set of data?  &lt;br /&gt;
&lt;br /&gt;
For wake words, this is mostly about the users who will be using the end models.  They should provide a good chunk of data to start with.  Expand by finding personalized variables that you&amp;#039;re missing, ie, accents, pitches.  For this type of diversification, ask your friends, colleagues, enemies, the internet.  Pay a service to garner data for you (ie, mechanical turk, craig&amp;#039;s list, etc).  Write a paper at college and collect samples from as many different students as you can.  Find users all over the world that have some level of interest and simply ask them to contribute.  It is possible to use TTS services to provide data, however, this has limitations and in some cases may reduce usability. &lt;br /&gt;
&lt;br /&gt;
For STT, you want both personal diversity (lots of different voices), as well as sentence, word, and phoneme diversity (lots of different sentences and words).  Having 1000 people read the same sentence to build a dataset might make a great model for that particular sentence.  If that&amp;#039;s your end goal, then this might be a strategy to pursue.  For more general purpose models, gather as wide a set of phonemes as possible, typically in a ration mimicking the general language&amp;#039;s usage.  Word diversity can be simply getting a dictionary and starting at aardvark and working through to zymurgy (for English speakers). Another tactic would be to utilize the top 1000 most commonly used words.  Putting both of those together, you should then find a source of sentences, or build a list of your own to record.  There are several open data sets you can also copy from (Common voice).  For domain-specific modeling, you will certainly want to focus on both word- and sentence-choice to be relevant to the domain you&amp;#039;re trying to target.   Mozilla&amp;#039;s Common Voice English set has over 40,000 speakers and more than 200,000 different words.  It&amp;#039;s a massive sampling of the language with a huge range of voice types, and word and sentences.  Within those users there&amp;#039;s also a massive range of recording quality from studio-level to buzzing, hissy computer mic clips.  &lt;br /&gt;
&lt;br /&gt;
* Getting more data&lt;br /&gt;
&lt;br /&gt;
Expanding the diversity of your dataset is a good start for this.  Still need more data?  Domain specific modelers will want to search industry resources for conference or work-area recordings that can be used.  General purpose users can incorporate existing datasets when they can.  Common Voice, Tedlium3, Librivox all have decent sized sets that can be used to supplement or build general datasets.  For fine-tuning, your needs will be more specific, but collecting the broadest, largest set of data you can will still leave you in a better position than having too little to effectively tune.  &lt;br /&gt;
&lt;br /&gt;
* Quality, or can you hear me now?&lt;br /&gt;
&lt;br /&gt;
Common Voice has a massive range of samples within it.  In addition to simply collecting sentences, users can also verify samples to confirm they&amp;#039;re a match to the expected transcript. This has a two-fold benefit: sentences that don&amp;#039;t match the transcript can be noted for exclusion, and the poorest quality samples that are unintelligible or have other audio quality problems can be noted for exclusion.&lt;/div&gt;</summary>
		<author><name>Eltocino</name></author>
	</entry>
</feed>