General TTS tutorial: Difference between revisions

From Voice Technology Wiki
Jump to navigation Jump to search
m (Added phoneme coverage and question marks.)
Line 36: Line 36:
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.
The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.


=== Phoneme coverage ===
Beside diversity check your text based on phoneme coverage for your language. A good coverage of all phonemes is recommended.


=== <samp>Punctuation mark</samp> ===
If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.


''The page is being developed. (sentence by sentence everyday)''
''The page is being developed. (sentence by sentence everyday)''

Revision as of 16:31, 27 February 2022

This tutorial contains non-technical informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.

Introduction

TTS is an acronym that stands for Text-To-Speech technology which allows you to convert text to audio in the form of human speech. There are many different methods converting textual input to speech output. Some people use speech-synthesis to refer to TTS, they both mean the same thing. Currently, machile-learning method of text-to-speech conversion has become the dominant method of TTS.

To use TTS you have two options based on the language you are targeting and based on whether the tool/service for your target language is already there or not. If you want to generate speech in a widely-used languages, there are TTS services developed for languages such as English, German, French, Russian etc. Major platforms such as Google, Windows support these languages by default. Then you can use these existing services to generate output. For example, you can try Google Translate if it speaks your language when you provide text input. This does not mean that all TTS technologies are available to you for free. It just means that it is likely that a free version also exists or can be built without too much effort using open-source technologies.

If you are targeting a language which is not widely-used then it is likely that you will need to build a service that supports TTS function in your language. Usually, building a service means that you use open-source technologies and train the software to teach the computer to convert TTS in your language. There are well-developed open-source TTS platforms (examples?) out there which can be used if you are ready to invest your time and energy. This tutorial aims to provide general information for people who want to develop a TTS application for their language of choice. It can be useful for both widely-used and under-represented languages.

To be more precise, this tutorial covers how to develop a TTS voice. Understanding TTS technology deeply requires significant amount of knowledge from computer science, linguistics, statistics disciplines. The technology in various forms and with diverse functionalities has been developed and is being used already. The purpose of this tutorial is to understand how to use the existing technology and apply them to a particular language.

Stages of training a TTS model

There are several stages to build a TTS voice model. The order of these stages cannot be changed, unless one already has the output from a particular stage ready to proceed to the next stage.

  1. Building a text dataset
  2. Preparing a voice dataset using the text dataset
  3. Training a TTS model using the voice dataset

Building a text dataset

Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.

Type of text to choose

When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.

Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.

Ensuring word-level diversity of text

While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.

Licensing issues

It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.

Amount of text required

The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of LJ-Speech-Dataset dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.

Phoneme coverage

Beside diversity check your text based on phoneme coverage for your language. A good coverage of all phonemes is recommended.

Punctuation mark

If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.

The page is being developed. (sentence by sentence everyday)