Text cleaning

From Voice Technology Wiki
Revision as of 17:33, 6 December 2021 by Alex42 (talk | contribs) (sanitizing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Text cleaning is the process of removing some characters from text corpus before recording them. Here are some examples of text to be cleaned. When you are using a TTS model you have to clean the text before giving it to the synthesizer if text cleaning is not included in the TTS process itself.

Numbers[edit | edit source]

Numbers should be replaced with the written form.

You have 3 timers set. ==> You have three timers set.

Time and date[edit | edit source]

Today is monday, november 3rd. ==> Today is monday, the third.
It is 2021. ==> It is twentytwentyone.

Abbreviations[edit | edit source]

Let's go to Dr. John Doe. ==> Let's go to doctor John Doe.
Weight is 5kg. ==> Weight is five kilogram.