General TTS tutorial - Revision history

Nmammedov: /* Type of text to choose */

2022-03-07T14:06:56Z

Type of text to choose

@@ Line 26: / Line 26: @@
 Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.
 === Ensuring word-level diversity of text ===
@@ Line 42: / Line 45: @@
 If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.
-''The page is being developed. (sentence by sentence everyday)''
+== Preparing a voice dataset ==

Thorsten: Added phoneme coverage and question marks.

2022-02-27T14:31:25Z

Added phoneme coverage and question marks.

← Older revision		Revision as of 16:31, 27 February 2022
Line 36:		Line 36:
	The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.		The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.

			=== Phoneme coverage ===
			Beside diversity check your text based on phoneme coverage for your language. A good coverage of all phonemes is recommended.

			=== <samp>Punctuation mark</samp> ===
			If you want your trained model to synthesize question or exclamation mark sentences which are pronounced different than normal sentences you should have question and exclamation marks in your text for recordings.

	''The page is being developed. (sentence by sentence everyday)''		''The page is being developed. (sentence by sentence everyday)''

Nmammedov: /* Text domain */

2022-02-27T05:57:56Z

Text domain

← Older revision		Revision as of 07:57, 27 February 2022
Line 22:		Line 22:
	Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.		Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.

	=== ~~Text domain~~ ===		=== Type of text to choose ===
	When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.		When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.

	Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.		Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.

	=== ~~Word~~-level diversity of text ===		=== Ensuring word-level diversity of text ===
	While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.		While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.

	=== ~~Text license~~ ===		=== Licensing issues ===
	It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.		It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.

	=== ~~Text amount~~ ===		=== Amount of text required ===
	The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.		The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.

Nmammedov at 14:42, 23 February 2022

2022-02-23T14:42:22Z

← Older revision		Revision as of 16:42, 23 February 2022
Line 31:		Line 31:

	=== Text license ===		=== Text license ===
	It is important that text dataset has license that ~~correspond~~ to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights.		It is important that text dataset has license that corresponds to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights. If the target use is proprietary (commercial), then one can agree with the authors of the texts included in the dataset to share benefits earned from the product.

			=== Text amount ===
			The size of the text dataset depends on the size of the target voice dataset. Experts suggest that 25-40 hours of voice recordings are enough to build a good model. We can learn from the example of [https://keithito.com/LJ-Speech-Dataset/ LJ-Speech-Dataset] dataset to determine the amount of text we need to build our dataset. LJ-Speech dataset is 24 hours long, contains 13,100 sound clips with an average of 17 words in each clip. The total amount of words in the dataset is 225,715 and 13,821 of them are distinct words. To put it simply one needs about 13000 sentences of various lengths but with average number of 17 words in them. Emulating the example of LJ-Speech dataset can be useful for beginners because it has been tested by many TTS experts with good results.



	''The page is being developed. (sentence by sentence everyday)''		''The page is being developed. (sentence by sentence everyday)''

Nmammedov at 14:28, 23 February 2022

2022-02-23T14:28:05Z

← Older revision		Revision as of 16:28, 23 February 2022
Line 22:		Line 22:
	Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.		Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.

	When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then use the model to read everyday speech, then it may not meet your expectations.		=== Text domain ===
			When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then try to use the model to read everyday speech, then it may not meet your expectations. It is like training for competitions if one wants to compete in Olympic games then it is necessary to train at Olympic-standard sport facilities. The TTS learning program itself does not know anything about the language so it will be able to output only what was taught to it.

			Therefore, if one's goal is to synthesize speech from text in diverse domains it is necesary to include such texts in the dataset. For example, one can combine texts from different domains: 25% legal-normative-formal texts, 25% literature, fiction, poetry, 25% everyday conversations, 25% science-technology, art. The proportions and types can be chosen arbitrarily depending on the target use.

	''The page is being developed.''		=== Word-level diversity of text ===
			While choosing text from diverse domains is the first step of ensuring diversity, in fact, in some domains, especially in legal-formal texts, the words can be quite repetitive. The diversity of the text dataset must be at word level not at domain-category level. For example, if we want our TTS model to read all words with good quality then those words need to be fed to the system in the training phase.

			=== Text license ===
			It is important that text dataset has license that correspond to its target use. Ideally, to ensure unlimited free use of the final output, text should be chosen from public domain. One widespread example of such a license is CC0 “No Rights Reserved”. To find "no rights reserved" texts one needs to look at the copyright legislation of the country in which the text dataset is being built. Usually, constitutions, laws, regulations, forms, directives are public domain texts. They are public domain by definition. In other domains usually the copyright expires after passage of a certain period of time, for instance, 50 years after the death of the author. By examining local legislation and researching it is usually possible to find sufficient amount of text with public domain rights.




			''The page is being developed. (sentence by sentence everyday)''

Nmammedov at 08:54, 22 February 2022

2022-02-22T08:54:59Z

← Older revision		Revision as of 10:54, 22 February 2022
Line 18:		Line 18:
	# Preparing a voice dataset using the text dataset		# Preparing a voice dataset using the text dataset
	# Training a TTS model using the voice dataset		# Training a TTS model using the voice dataset

			== Building a text dataset ==
			Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.

			When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then use the model to read everyday speech, then it may not meet your expectations.


			''The page is being developed.''

Nmammedov at 10:03, 21 February 2022

2022-02-21T10:03:57Z

← Older revision		Revision as of 12:03, 21 February 2022
Line 17:		Line 17:
	# Building a text dataset		# Building a text dataset
	# Preparing a voice dataset using the text dataset		# Preparing a voice dataset using the text dataset
	# Training a TTS model ~~used~~		# Training a TTS model using the voice dataset

Nmammedov: /* Introduction */

2022-02-21T10:03:18Z

Introduction

@@ Line 3: / Line 3: @@
 This tutorial contains '''non-technical''' informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.
-== '''Introduction''' ==
+== Introduction ==
-TTS stands for text-to-speech technology. There are many different methods converting textual input to audio output. Currently, machile-learning method of text-to-speech conversion is becoming the dominant method of TTS. To use TTS you have two options: (1) use an existing application/service or (2) build a new one. There are TTS services developed for widely-used languages such as English, German, French, Russian etc. Major platforms such as Google, Windows support these languages by default. If you want to synthesize
+TTS is an acronym that stands for '''T'''ext-'''T'''o-'''S'''peech technology which allows you to convert text to audio in the form of human speech. There are many different methods converting textual input to speech output. Some people use speech-synthesis to refer to TTS, they both mean the same thing. Currently, machile-learning method of text-to-speech conversion has become the dominant method of TTS.

193.220.55.203: I wrote introduction

2022-02-21T09:27:37Z

I wrote introduction

← Older revision		Revision as of 11:27, 21 February 2022
Line 2:		Line 2:
	[[Category:HelloWorld]]		[[Category:HelloWorld]]
	This tutorial contains '''non-technical''' informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.		This tutorial contains '''non-technical''' informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.

			== '''Introduction''' ==
			TTS stands for text-to-speech technology. There are many different methods converting textual input to audio output. Currently, machile-learning method of text-to-speech conversion is becoming the dominant method of TTS. To use TTS you have two options: (1) use an existing application/service or (2) build a new one. There are TTS services developed for widely-used languages such as English, German, French, Russian etc. Major platforms such as Google, Windows support these languages by default. If you want to synthesize

Thorsten at 08:06, 19 February 2022

2022-02-19T08:06:30Z

← Older revision		Revision as of 10:06, 19 February 2022
Line 1:		Line 1:
	[[Category:TTS]]		[[Category:TTS]]
	[[Category:HelloWorld]]		[[Category:HelloWorld]]
			This tutorial contains '''non-technical''' informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.