Text cleaning: Difference between revisions

From Voice Technology Wiki
Jump to navigation Jump to search
(Initial draft for Text cleaning.)
 
m (sanitizing)
 
(One intermediate revision by one other user not shown)
Line 7: Line 7:


=== Time and date ===
=== Time and date ===
<blockquote>Today is monday, november 3rd. ==> Today is monday, the third.
<blockquote>Today is monday, november 3rd. ==> Today is monday, the '''third'''.<br/>
 
It is 2021. ==> It is '''twentytwentyone'''.</blockquote>
It is 2021. ==> It is twentytwentyone.</blockquote>


=== Abbreviations ===
=== Abbreviations ===
<blockquote>Let's go to Dr. John Doe. ==> Let's go to doctor John Doe.
<blockquote>Let's go to Dr. John Doe. ==> Let's go to '''doctor''' John Doe.<br/>
 
Weight is 5kg. ==> Weight is '''five kilogram'''.</blockquote>
Weight is 5kg. ==> Weight is five kilogram.</blockquote>

Latest revision as of 16:33, 6 December 2021


Text cleaning is the process of removing some characters from text corpus before recording them. Here are some examples of text to be cleaned. When you are using a TTS model you have to clean the text before giving it to the synthesizer if text cleaning is not included in the TTS process itself.

Numbers[edit | edit source]

Numbers should be replaced with the written form.

You have 3 timers set. ==> You have three timers set.

Time and date[edit | edit source]

Today is monday, november 3rd. ==> Today is monday, the third.
It is 2021. ==> It is twentytwentyone.

Abbreviations[edit | edit source]

Let's go to Dr. John Doe. ==> Let's go to doctor John Doe.
Weight is 5kg. ==> Weight is five kilogram.