General TTS tutorial: Difference between revisions

Jump to navigation Jump to search
m (Added phoneme coverage and question marks.)
 
Line 26: Line 26:


Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.
Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.
==== Language accents ====
It is also important to keep in mind that written texts can have accents - different vocabularies, grammar differences that are still intelligible to people who speak other accents. For example, people from regions A and B of a country can use different words to refer to the same object and they may also have slightly different written forms of verbs. Therefore, if the target language of TTS model has such issues it is important to pay attention to addressing them - by using a standardized version of the language accepted by all speakers with different accents or by making sure that all accents are fairly represented in the text dataset.


=== Ensuring word-level diversity of text ===
=== Ensuring word-level diversity of text ===
Line 42: Line 45:
If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.
If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.


''The page is being developed. (sentence by sentence everyday)''
== Preparing a voice dataset ==
 
=== Selecting a speaker ===
Selecting a speaker to record the text dataset is the next step. Ideally, the ultimate TTS model being developed will speak exactly the way original speakers speaks. Make sure that the speaker's voice is "pleasant" to all listeners. Obviously it is impossible to get feedback from all future users, but there needs to be a reasonable effort to ensure that many listeners will be comfortable listening to the voice for long time and frequently. One way of achieving it is to record voices of speaker candidates and ask listeners opinion with an easy-to-understand scoring system. From technical perspective the voice does not matter for training. For example, if it is necessary for some reason to have a high-pitch voice, then it is perfectly okay.
 
One important aspect of selecting a speaker is making sure that the speaker will be available for the entire duration of recording phase for the voice dataset. For instance, working with a person with good voice but a very busy schedule will not be good for the recording project. It is essential that the entire dataset is recorded by one person (voice).
 
=== Ethical issues ===
There are ethical issues that one must be aware of when recording a voice. Voice is a distinct characteristic of a person's identity. Usually, it is possible to identify a known person from recorded sound. Ethical issues are related to potential use of TTS voices after they become ready. When they are made available for public use, users can use the voice to make it speak not only good texts but also bad texts including offensive, obscene, racist, and other illegal texts etc. Will the original speaker be comfortable with such use? Will there be a risk for the speaker? These questions must have clear answers. There are several ways to mitigate ethical issues. First thing to do is to inform the speaker fully of potential risks and secure his/her informed consent. Another measure is to select a relatively unknown person with a good voice and keep the person's identity anonymous. Third, make sure that legal disclaimers are in place for any malicious use of the voice. Fourth, work with and mobilize the community to monitor using of the model for only legal purposes.
 
=== Recording setup ===
forthcoming
 
 
''The page is being developed. (sentence by sentence everyweek)''
12

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu