General TTS tutorial
This tutorial contains non-technical informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.
Introduction[edit | edit source]
TTS is an acronym that stands for Text-To-Speech technology which allows you to convert text to audio in the form of human speech. There are many different methods converting textual input to speech output. Some people use speech-synthesis to refer to TTS, they both mean the same thing. Currently, machile-learning method of text-to-speech conversion has become the dominant method of TTS.
To use TTS you have two options based on the language you are targeting and based on whether the tool/service for your target language is already there or not. If you want to generate speech in a widely-used languages, there are TTS services developed for languages such as English, German, French, Russian etc. Major platforms such as Google, Windows support these languages by default. Then you can use these existing services to generate output. For example, you can try Google Translate if it speaks your language when you provide text input. This does not mean that all TTS technologies are available to you for free. It just means that it is likely that a free version also exists or can be built without too much effort using open-source technologies.
If you are targeting a language which is not widely-used then it is likely that you will need to build a service that supports TTS function in your language. Usually, building a service means that you use open-source technologies and train the software to teach the computer to convert TTS in your language. There are well-developed open-source TTS platforms (examples?) out there which can be used if you are ready to invest your time and energy. This tutorial aims to provide general information for people who want to develop a TTS application for their language of choice. It can be useful for both widely-used and under-represented languages.
To be more precise, this tutorial covers how to develop a TTS voice. Understanding TTS technology deeply requires significant amount of knowledge from computer science, linguistics, statistics disciplines. The technology in various forms and with diverse functionalities has been developed and is being used already. The purpose of this tutorial is to understand how to use the existing technology and apply them to a particular language.
Stages of training a TTS model[edit | edit source]
There are several stages to build a TTS voice model. The order of these stages cannot be changed, unless one already has the output from a particular stage ready to proceed to the next stage.
- Building a text dataset
- Preparing a voice dataset using the text dataset
- Training a TTS model using the voice dataset
Building a text dataset[edit | edit source]
Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.
Type of text to choose[edit | edit source]
When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.
Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.
Language accents[edit | edit source]
It is also important to keep in mind that written texts can have accents - different vocabularies, grammar differences that are still intelligible to people who speak other accents. For example, people from regions A and B of a country can use different words to refer to the same object and they may also have slightly different written forms of verbs. Therefore, if the target language of TTS model has such issues it is important to pay attention to addressing them - by using a standardized version of the language accepted by all speakers with different accents or by making sure that all accents are fairly represented in the text dataset.
Ensuring word-level diversity of text[edit | edit source]
While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.
Licensing issues[edit | edit source]
It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.
Amount of text required[edit | edit source]
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of LJ-Speech-Dataset dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.
Phoneme coverage[edit | edit source]
Beside diversity check your text based on phoneme coverage for your language. A good coverage of all phonemes is recommended.
Punctuation mark[edit | edit source]
If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.
Preparing a voice dataset[edit | edit source]
Selecting a speaker[edit | edit source]
Selecting a speaker to record the text dataset is the next step. Ideally, the ultimate TTS model being developed will speak exactly the way original speakers speaks. Make sure that the speaker's voice is "pleasant" to all listeners. Obviously it is impossible to get feedback from all future users, but there needs to be a reasonable effort to ensure that many listeners will be comfortable listening to the voice for long time and frequently. One way of achieving it is to record voices of speaker candidates and ask listeners opinion with an easy-to-understand scoring system. From technical perspective the voice does not matter for training. For example, if it is necessary for some reason to have a high-pitch voice, then it is perfectly okay.
One important aspect of selecting a speaker is making sure that the speaker will be available for the entire duration of recording phase for the voice dataset. For instance, working with a person with good voice but a very busy schedule will not be good for the recording project. It is essential that the entire dataset is recorded by one person (voice).
Ethical issues[edit | edit source]
There are ethical issues that one must be aware of when recording a voice. Voice is a distinct characteristic of a person's identity. Usually, it is possible to identify a known person from recorded sound. Ethical issues are related to potential use of TTS voices after they become ready. When they are made available for public use, users can use the voice to make it speak not only good texts but also bad texts including offensive, obscene, racist, and other illegal texts etc. Will the original speaker be comfortable with such use? Will there be a risk for the speaker? These questions must have clear answers. There are several ways to mitigate ethical issues. First thing to do is to inform the speaker fully of potential risks and secure his/her informed consent. Another measure is to select a relatively unknown person with a good voice and keep the person's identity anonymous. Third, make sure that legal disclaimers are in place for any malicious use of the voice. Fourth, work with and mobilize the community to monitor using of the model for only legal purposes.
Recording setup[edit | edit source]
The page is being developed. (sentence by sentence everyweek)