General TTS tutorial: Difference between revisions
m
no edit summary
mNo edit summary |
mNo edit summary |
||
Line 22: | Line 22: | ||
Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech. | Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech. | ||
When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then use the model to read everyday speech, then it may not meet your expectations. | === Text domain === | ||
When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it. | |||
Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use. | |||
''The page is being developed.'' | === Word-level diversity of text === | ||
While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase. | |||
=== Text license === | |||
It is important that text dataset has license that correspond to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. | |||
''The page is being developed. (sentence by sentence everyday)'' |