General TTS tutorial: Difference between revisions

m
no edit summary
mNo edit summary
mNo edit summary
Line 31: Line 31:


=== Text license ===
=== Text license ===
It is important that text dataset has license that correspond to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights.  
It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.  


=== Text amount ===
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.






''The page is being developed. (sentence by sentence everyday)''
''The page is being developed. (sentence by sentence everyday)''
12

edits