General TTS tutorial

Revision as of 09:54, 22 February 2022 by Nmammedov (talk | contribs)

This tutorial contains non-technical informations on TTS or Text-to-Speech. It's meant for all people who are new to the field of TTS and try to figure out which aspects to check or which possible ways to go.

Introduction

TTS is an acronym that stands for Text-To-Speech technology which allows you to convert text to audio in the form of human speech. There are many different methods converting textual input to speech output. Some people use speech-synthesis to refer to TTS, they both mean the same thing. Currently, machile-learning method of text-to-speech conversion has become the dominant method of TTS.

To use TTS you have two options based on the language you are targeting and based on whether the tool/service for your target language is already there or not. If you want to generate speech in a widely-used languages, there are TTS services developed for languages such as English, German, French, Russian etc. Major platforms such as Google, Windows support these languages by default. Then you can use these existing services to generate output. For example, you can try Google Translate if it speaks your language when you provide text input. This does not mean that all TTS technologies are available to you for free. It just means that it is likely that a free version also exists or can be built without too much effort using open-source technologies.

If you are targeting a language which is not widely-used then it is likely that you will need to build a service that supports TTS function in your language. Usually, building a service means that you use open-source technologies and train the software to teach the computer to convert TTS in your language. There are well-developed open-source TTS platforms (examples?) out there which can be used if you are ready to invest your time and energy. This tutorial aims to provide general information for people who want to develop a TTS application for their language of choice. It can be useful for both widely-used and under-represented languages.

To be more precise, this tutorial covers how to develop a TTS voice. Understanding TTS technology deeply requires significant amount of knowledge from computer science, linguistics, statistics disciplines. The technology in various forms and with diverse functionalities has been developed and is being used already. The purpose of this tutorial is to understand how to use the existing technology and apply them to a particular language.

Stages of training a TTS model

There are several stages to build a TTS voice model. The order of these stages cannot be changed, unless one already has the output from a particular stage ready to proceed to the next stage.

  1. Building a text dataset
  2. Preparing a voice dataset using the text dataset
  3. Training a TTS model using the voice dataset

Building a text dataset

Training a TTS model requires text and voice dataset. To put it simply, TTS training aims to learn the relationship between text and sound and record that learned information in a file. Then when speech is synthesized the recorded information can be used to convert text to speech.

When choosing text for the dataset it is important to think about the context in which the trained TTS model will be used. For example, if you use legal texts to train the model and then use the model to read everyday speech, then it may not meet your expectations.


The page is being developed.