General TTS tutorial: Difference between revisions

General TTS tutorial (edit)

402 bytes added , 27 February 2022

m

Added phoneme coverage and question marks.

581

edits

@@ Line 36: / Line 36: @@
 The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.
+=== Phoneme coverage ===
+Beside diversity check your text based on phoneme coverage for your language. A good coverage of all phonemes is recommended.
+=== <samp>Punctuation mark</samp> ===
+If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.
 ''The page is being developed. (sentence by sentence everyday)''