General TTS tutorial: Difference between revisions

Jump to navigation Jump to search
m
mNo edit summary
Line 22: Line 22:
Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.
Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.


=== Text domain ===
=== Type of text to choose ===
When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.
When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.


Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.
Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.


=== Word-level diversity of text ===
=== Ensuring word-level diversity of text ===
While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.
While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.


=== Text license ===
=== Licensing issues ===
It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.  
It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.  


=== Text amount ===
=== Amount of text required ===
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.


12

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu