From Voice Technology Wiki
Jump to navigation Jump to search

What is MaryTTS?[edit | edit source]

Mary (Modular Architecture for Research in sYynthesis) Text-to-Speech is an open-source (GNU LGPL license[1]), multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University, Germany. It is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI.[2]

MaryTTS has been around for a very! long time. Version 3.0 even dates back to 2006, long before Deep Learning was a broadly known term and the last official release was version 5.2 in 2016[3]. The system uses unit selection and HMM-based techniques to build voices (today probably called AI, back then called statistics ^^). If you want to learn more about the architecture check out the official documentation.

There is still activity on the GitHub page and internally there has been some major code refactoring but it is currently unclear if there will ever be another release version. There has been an unofficial snapshot release for the SEPIA Framework which runs stable on Java 11 but should be considered experimental: MaryTTS 6.0 snapshot (Docker container).

Advantages of MaryTTS[edit | edit source]

MaryTTS has certain advantages compared to modern Deep Learning systems or classical, synthetic engines like eSpeak:

  • The quality of the voice depends strongly on the model but can be surprisingly good, not state-of-the-art but much better than a synthetic voice.
  • Audio generation is very fast and ranges from 0.2 to 0.5 RTF on a Raspberry Pi 4 depending on the selected voice[4] meaning it can actually be used on edge devices.
  • RAM consumption is moderate but you should probably reserve around 256-512 MB.
  • Installation is super easy and it runs on Windows, Mac and Linux (every system that can install Java 8)
  • Language support is very good: German, British and American English, French, Italian, Luxembourgish, Russian, Swedish, Telugu, Turkish and more
  • Pronunciation of times, dates, temperatures etc. can be very good. MaryTTS uses an extensive, handcrafted set of rules (and statistics) to handle this.[5]
  • MaryTTS is server-based meaning it offers a HTTP REST API out-of-the-box to synthesize text

Models[edit | edit source]

Voice models can be downloaded via script from inside the release version or using these links:


Installation[edit | edit source]

Installation is super easy:

  • Install Java 8 or 11 (Debian: 'sudo apt-get install openjdk-11-jdk-headless')
  • Download the release ZIP file: v5.2 official or v6.0 snpashot
  • Extract the ZIP and start the server (run scripts are in 'marytts\bin')

By default you can access the server in your browser via: http://localhost:59125/

In a production system you might want to run MaryTTS behind a reverse proxy (like Nginx or Apache) to avoid CORS issues.

Performance[edit | edit source]

To be compareable for performance we should use identical phrases for testing purposes.

Voice: dfki-spike-hsmm en_GB[edit | edit source]

  • Test system: Raspberry Pi4 4GB
  • Sentence: "Hello this is a test"
  • Run-time: 0.33 s
  • Real-time-factor: 0.19