Open Voice Technology Wiki - User contributions [en]

Italien phoneme list (it)

2022-03-27T17:30:00Z

Solyarisoftware: table filled with some test words

{{Supported by eSpeak}}
{{Phoneme list introduction}}
{| class="wikitable"
|+
!Italien written version
!Spoken like
!Phonemes
|-
|caffè
|caffè
|
|-
|thè
|tè
|
|-
|te
|te
|
|-
|solyaris
|soliaris
|
|-
|software
|softuer
|
|-
|trade-off
|treid off
|
|}

[[Category:Phoneme list]]

Talk:Python

2022-01-12T08:26:55Z

Solyarisoftware:

I propose to add, at the end, a statement like:

Python is de facto the most common used programming language in natural language processing (NLP) vertical, also because the huge ecosystem of open-source packages (libraries), e.g. to develop machine learning / deep learning alghorithms ([[Tensorflow]], [[Pythorch]], etc.), speech recognition ([[Vosk]], [[Coqui]], etc.), sound processing, etc.--[[User:Solyarisoftware|Solyarisoftware]] ([[User talk:Solyarisoftware|talk]]) 11:25, 6 January 2022 (CET)
:Sounding good to me [[User:Solyarisoftware|Solyarisoftware]]. Just pinging [[User:Digitalica|Digitalica]] as author. Could be added in my personal opinion.--[[User:Thorsten|Thorsten]] ([[User talk:Thorsten|talk]]) 23:07, 6 January 2022 (CET)

great idea, go ahead ;-)
:I've added text as suggested by Giorgio. --[[User:Thorsten|Thorsten]] ([[User talk:Thorsten|talk]]) 17:52, 7 January 2022 (CET)

:thanks--[[User:Solyarisoftware|Solyarisoftware]] ([[User talk:Solyarisoftware|talk]]) 09:26, 12 January 2022 (CET)

Talk:Python

2022-01-06T10:27:08Z

Solyarisoftware:

2021-12-12T09:36:43Z

Solyarisoftware:

Talk:OVT:What is OpenVoice-Tech Wiki

2021-12-12T09:25:31Z

Solyarisoftware: /* about contributes on the wiki */

== Our principles ==

Thanks [[User:Solyarisoftware|Solyarisoftware]] for bringing this topic up. I've created this page as a collection for principals of OpenVoice-Tech Wiki. Maybe we can discuss and develop our principals here. --[[User:Thorsten|Thorsten]] ([[User talk:Thorsten|talk]]) 12:59, 10 December 2021 (CET)

== about contributes on the wiki ==

Hi Thorsten, a minor duplication of the same subject here: https://openvoice-tech.net/index.php?title=OpenVoice-Tech_Wiki_talk:About

Following your right suggestions, I guess the fair process is: ''If you disagree with a written content do not simply change it, but use "Discussion" page to discuss with original writer''

thanks
--[[User:Solyarisoftware|Solyarisoftware]] ([[User talk:Solyarisoftware|talk]]) 13:34, 10 December 2021 (CET)
: In case it's obvious wrong then it can be corrected, but if it's general disagreement it should be discussed before edited. Do you have additional ideas for "contribution guidelines"?--[[User:Thorsten|Thorsten]] ([[User talk:Thorsten|talk]]) 18:20, 10 December 2021 (CET)

It's clear and I agree. Maybe the entire process (discussion on disagreement/major correction) could be pointed out / put in evidence in the main page / guidelines page, because for me, at first glance wasn't so clear.--[[User:Solyarisoftware|Solyarisoftware]] ([[User talk:Solyarisoftware|talk]]) 10:25, 12 December 2021 (CET)

Talk:OVT:What is OpenVoice-Tech Wiki

2021-12-10T12:34:11Z

Solyarisoftware: /* about contributes on the wiki */ new section

Conversation Design

2021-12-10T08:52:58Z

Solyarisoftware: Conversation design definition

"''Conversation design'' (''CxD'') is about defining the interactions between the user and a conversational agent, based on how people communicate in real life." [https://uxdesign.cc/intro-to-conversation-design-ce3bd30e4385 cit.]

Designing a (human-to-machine) conversation is mainly related to the linguistics (pragmatics, psycholinguistics, sociolinguistics) and the authoring/screenwriting. [https://developers.google.com/assistant/conversation-design/what-is-conversation-design Google], with the famous CxD depth, lead by [https://developers.google.com/assistant/conversation-design/learn-about-conversation James Giangola] et al, people that conceived Google Assistant UX, contributed few years ago to divulgate concepts now became "common sense" as: Voice User Interfaces (VUI) best practices, Grice's Maxims, botpersona, persona, multimodal conversations.

The ''conversation designer'' has a fundamental role in any enterprise team that build professional conversational agents/virtual agents.

Open Voice Technology Wiki talk:About

2021-12-10T08:28:46Z

Solyarisoftware: /* about contributes on the wiki */ new section

== about contributes on the wiki ==

Hi Thorsten!

Thanks for your initiative here and your work on Open German Voice Dataset (even if I don't know German language(ù)!
I'll try to contribute the wiki and for sure I'll share on twitter and linkedin.

My personal main concern, when by example writing on this wiki a definition of a concept, or a company/project, is that I'm naturally biased/opinionated on a technology or any tech solution. Also any definition I could add is for sure debatable. Even if I'm aware about the wiki/wikipedia-like common way to evolve/refine contents (with the continuous-delivery :) contribute of many people during time), my question is:

In general, it's ok if I submit a definition that inevitably contains a personal bias/comment ?

Thanks again

respect

giorgio --[[User:Solyarisoftware|Solyarisoftware]] ([[User talk:Solyarisoftware|talk]]) 09:28, 10 December 2021 (CET)

Open Voice Technology Wiki talk:About

2021-12-10T08:28:26Z

Solyarisoftware: Blanked the page

Open Voice Technology Wiki talk:About

2021-12-10T08:26:47Z

Solyarisoftware: about contributes on the wiki

Hi Thorsten!

Thanks for your initiative here and your work on Open German Voice Dataset (even if I don't know German language(ù)!
I'll try to contribute the wiki and for sure I'll share on twitter and linkedin.

My personal main concern, when by example writing on this wiki a definition of a concept, or a company/project, is that I'm naturally biased/opinionated on a technology or any tech solution. Also any definition I could add is for sure debatable. Even if I'm aware about the wiki/wikipedia-like common way to evolve/refine contents (with the continuous-delivery :) contribute of many people during time), my question is:

In general, it's ok if I submit a definition that inevitably contains a personal bias/comment ?

Thanks again
respect
giorgio

Real-time-factor

2021-12-10T08:05:46Z

Solyarisoftware: RTF definition

The ''real time factor'' (''RTF'') is a common metric of measuring the speed of an automatic speech recognition system (ASR) in the decoding phase ("at run-time"). It can also be used in other context where an audio or video signal is processed (usually automatically) at nearly constant rate. All in all RTF is a measure of the latency of any (audio) processing system, not only a speech recognition engine, but also a text-to-speech engine, a [[transcoding]] engine, etc.

If it takes time f(d) to process an input of duration d , the real time factor is defined as: RTF = f(d)/d

If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done "in real time". It is a hardware dependent value, it is a network bandwidth dependent value (this is important to note, if processing is done as cloud-based service).

Usually a state of the art speech-to-text cloud-based service supplied by Google, Azure, AWS, etc. has values between 0.2 and 0.6. Note that is all very depending on many factors, the network/internet bandwith, the speech content, etc. In case of an on-prem ASR, the major impacting factor is the algorithm and the hardware resources (CPU/RAM). <syntaxhighlight lang="python">
def real_time_factor(processingTime, audioLenght, decimals=2):

''' Real-Time Factor (RTF) is defined as processing-time / length-of-audio. '''

rtf = (processingTime / audioLenght)

return round(rtf, decimals)
</syntaxhighlight>

Conversational AI

2021-12-10T07:41:11Z

Solyarisoftware:

The term ''Conversational AI'', shortcut for ''Conversational Artificial Intelligence'' is an umbrella term, become spread in recent years, used to define all technologies around speech recognition (ASR), synthetic voice generation (TTS), natural language generation (NLG), dialog management (DM), chatbots, [[voicebots]], multimodal assistants in general.

Not fully sure but the term has been probably "coined" in [[IBM Watson]] (TBV) and used as synonym of ''Conversational Computing'', another definition used at time in IBM, that doesn't gained success (TBV).

RASA

2021-12-10T07:38:15Z

Solyarisoftware:

"''Open source machine learning tools for developers to build, improve, and deploy text-and voice-based chatbots and assistants''". (cit. [https://github.com/RasaHQ/ RASA github home page]).

RASA is probably the most important open-source tool to develop "task-oriented" conversational applications. Despite the RASA official statement, the original project has not conceived to manage voice interactions, but just [[chatbots]] with some support to GUI/buttons.

RASA architecture consist in two main components:

* ''RASA NLU'' is based upon [https://rasa.com/blog/introducing-dual-intent-and-entity-transformer-diet-state-of-the-art-performance-on-a-lightweight-architecture/ DIET algorithm], a a refined state of the art intent/entities "classifier
* ''RASA Core'' (now called ''RASA Dialog Manager''), based on [https://rasa.com/blog/unpacking-the-ted-policy-in-rasa-open-source/ TED policy], a machine learning algorithm, to manage multi-turn dialogs, escaping the traditional state-machine based way, but instead allowing conversation developers to insert "''stories",'' set of of intents-actions sequences (conversation examples). With [https://rasa.com/blog/were-a-step-closer-to-getting-rid-of-intents/ end-to-end training], developers program the conversational agent [[dialog manager]] giving end-to-end turn-taking examples (the stories).

RASA owned in few years now, a huge open community of developers and researchers. It's probably the biggest open source project to develop on-premise "production-ready" complex dialog systems. All the development ecosystem is around the [[Python]] programming language.

== References ==

* Home page: https://rasa.com/

* Github: https://github.com/RasaHQ/

* Community forum: https://forum.rasa.com/

Glossary of voice tech

2021-12-09T16:09:12Z

Solyarisoftware: /* Voice assistant terms */

[[Category:Open Voice Tech]]

In the field of voice technology there are lots of buzzwords. Some are self explaining, other lead to confusion regularly. This list should be a glossary.

==General terms==

*[[:Category:Dataset|Dataset]]
*[[Research papers|Papers]] (''research papers'')
*[[Phonemes]]
*[[Model]]
*[[Checkpoint]]
*[[Repository]]

==STT terms==

*[[:Category:Wake words|Wake word]]
*[[Hotword]]
*[[Voice print]]
*[[Word error rate]] (''WER'')
*[[Diarization]]
*[[Barge-in]]

==TTS terms==

*

==Voice assistant terms==

*[[Conversational AI]]
*[[Natural language understanding]] (''NLU'')
*[[Utterance]]
*[[Voiceonly]]

==Machine learning==

*[[Epoch]]
*[[Step]]
*[[Batch size]]
*[[Learning rate]]
*[[Inference]]
*[[Alignment]]

Conversational AI

2021-12-09T16:07:34Z

Solyarisoftware: definition of Conversational AI

The term ''Conversational AI'', shortcut for ''Conversational Artificial Intelligence'' is an umbrella term, become spread in recent years, used to define all technologies around speech recognition (ASR), synthetic voice generation (TTS), natural language generation (NLG), dialog management (DM), chatbots, voicebots, multimodal assistants in general.

Not fully sure but the term has been probably "coined" in IBM (Watson) and used at times as synonym of ''Conversational Computing''. TBV.

Glossary of voice tech

2021-12-09T16:00:31Z

Solyarisoftware: update of some keywords

[[Category:Open Voice Tech]]

In the field of voice technology there are lots of buzzwords. Some are self explaining, other lead to confusion regularly. This list should be a glossary.

==General terms==

*[[:Category:Dataset|Dataset]]
*[[Research papers|Papers]] (''research papers'')
*[[Phonemes]]
*[[Model]]
*[[Checkpoint]]
*[[Repository]]

==STT terms==

*[[:Category:Wake words|Wake word]]
*[[Hotword]]
*[[Voice print]]
*[[Word error rate]] (''WER'')
*[[Diarization]]
*[[Barge-in]]

==TTS terms==

*

==Voice assistant terms==

*[[Utterance]]
*[[Natural language understanding]] (''NLU'')
*[[Voiceonly]]

==Machine learning==

*[[Epoch]]
*[[Step]]
*[[Batch size]]
*[[Learning rate]]
*[[Inference]]
*[[Alignment]]

Natural language understanding

2021-12-09T15:55:49Z

Solyarisoftware: DIET classifier link added

Natural Language Understanding (NLU) is a a misleading term, highly discussed in the Conversational AI / scientific community.

In recent years, especially in the chatbot engineering industry, we tend to use NLU to mean an intent/entities classifier, based on machine learning techniques (transformers, etc.). The main open source project / state of the art of this approach is probably the [https://rasa.com/blog/introducing-dual-intent-and-entity-transformer-diet-state-of-the-art-performance-on-a-lightweight-architecture/ RASA DIET classifier].

Besides, in terms of linguistic, and psycho-linguistic/cognitive scientific disciplines, there is a great skepticism about naming "language understanding" a ML-based classifier of intents (and entities). A growing number of researcher linguists state that it's even impossible to understand language with machine language techniques (the more famous and currently debated is probably GPT-3). One of the scientist more active in this battle is [https://ontologik.medium.com/ Walid Saba].

Voiceonly

2021-12-09T15:53:41Z

Solyarisoftware: voiceonly definition

A ''voiceonly'' (or ''voice-only'') application is, as the name suggests, a software application (like a voice-interfaced ''chatbot'', called also ''voicebot'' in this case) where the interface channel is only voice-based, without any graphical interface (GUI).

In last years, a voiceonly channel is a synonym for using ''smartspeakers'', since when Amazon Echo device, terminal of the Amazon Alexa cloud-based system, was put on production in 2014; before, the traditional voice-only channel is of course the telephone and the IVR voice automation, still alive nowadays.

To be precise, popular smartspeakers by Amazon and Google are example of so called ''voicefirst'' devices (not just voiceonly), because there users interact with virtual assistants primary via the voice channel (through the smartspeaker), but the users can also interact with the same assistants using the chat interface on a mobile phone app.

Funny fact: the term voicefirst and especially the hashtag ''#voicefirst'' become popular few years ago on twitter, maybe used initially by [https://twitter.com/BrianRoemmele Brain Roemmele]. The term voicefirst become soon viral on voice / conversational AI community since then.

Digression: the step forward the voicefirst conversational user experience (UX) is a ''multimodal'' conversational experience where the channel is not just the speech recognition or the texting (input) or the synthetic voice play or a written prompt (in output), but the conversation is true multimodal when really multi-sensory, where by example the input is not just text or voice but also gaze detection, gesture detection, geolocation info, ambient sensors, etc.

User:Solyarisoftware

2021-12-09T11:14:43Z

Solyarisoftware: personal page

My name is Giorgio Robino. I'm from Genova, Italy. My nickname, solyarisoftware, is a tribute to Andrej Tarkovskij's movie SOLYARIS.

I'm an engineer and researcher in "Conversational AI" verticals. I'm generally interested in voice/voiceonly multimodal interfaces (ASR/TTS) but I'm especially focused in dialog management and task-oriented realtime "assistants" that help human operator to complete real world working tasks. I call this kind of assistants: "[https://docs.google.com/presentation/d/1ieZnAdREzEGXkcO4C_XPIbS9YAnE76mB0wpP2k-yOlQ/edit#slide=id.gc0058244ce_0_10 Enterpise Voice Cobots]".

As Italian, I'm especially focused in the italian accademies and industries ecosystem (in natural language processing realms), and I realized in 2016 a first italian open conference #convcomp2016 about "conversational computing". Afterward I maintained the related blog: [https://www.convcomp.it www.convcomp.it] where I try to share articles about topics related to chatbots/voicebots/virtual assistants. I'm also pretty active on [https://www.twitter.com/solyarisoftware twitter], where I share and chat/rant just only about conversational AI/voice related stuff.

My current day job is "Conversational AI Tech Lead" in the Italian company [https://www.almawave.it www.almawave.it], where I'm now working in R%D projects, especially in ehealth realms.

Of course I'm supporter of opensource and opendata also in voice realms. As free-time beside project, I published small open source projects:

* https://github.com/solyarisoftware/naifjs, simple state-machine based dialog manager, in nodejs
* https://github.com/solyarisoftware/voskJs, Vosk ASR offline engine API for NodeJs developers. With a simple HTTP ASR server
* https://github.com/solyarisoftware/CoquiSTTJs, Coqui STT offline engine API for NodeJs developers. With a simple HTTP ASR server
* https://github.com/solyarisoftware/jointts, a brainless concatenative text to speech
* https://github.com/solyarisoftware/webad, Web Browser Audio Detection/Speech Recording Events API
* https://github.com/solyarisoftware/Highlight.vim, Highlight vim plugin colorizes pattern of texts, with a random or specified background colors

Last but not least, I have been an ambient music maker as [http://solyaris.altervista.org/ SOLYARIS MUSIC].

Natural language understanding

2021-12-05T18:06:34Z

Solyarisoftware: NLU definition

Natural Language Understanding (NLU) is a a misleading term, highly discussed in the Conversational AI / scientific community.

In recent years, especially in the chatbot engineering industry, we tend to use NLU to mean an intent/entities classifier, based on machine learning techniques (transformers, etc.). The main open source project / state of the art of this approach is probably the RASA DIET classifier.

Besides, in terms of linguistic, and psycho-linguistic/cognitive scientific disciplines, there is a great skepticism about naming "language understanding" a ML-based classifier of intents (and entities). A growing number of researcher linguists state that it's even impossible to understand language with machine language techniques (the more famous and currently debated is probably GPT-3). One of the scientist more active in this battle is [https://ontologik.medium.com/ Walid Saba].

RASA

2021-12-05T17:50:34Z

Solyarisoftware: RASA platform description

Open source machine learning tools for developers to build, improve, and deploy text-and voice-based chatbots and assistants. (cit. RASA github home).

RASA is probably the most important tool to develop "task-oriented" conversational application. Despite the RASA official statement, is not originally developed to manage voice interaction, but chatbots. RASA consist in two main components:

* ''RASA NLU'' is based upon DIET, a a refined state of the art intent/entities classifier
* ''RASA Core'' (now called ''RASA Dialog Manager''), based on TED, a machine learning algorithm, to manage multi-turn dialogs, escaping the traditional state-machine based way, but instead allowing conversation developers to insert "''stories",'' set of sequences of intents-actions examples. In a sense, developers define the conversational agent dialog manager giving examples (the stories).

RASA own a huge open community of developers and researchers. It's probably the biggest open source project to develop on-premise "production-ready" complex dialog systems. All the development ecosystem is around the Python programming language.

== References ==
home page: https://rasa.com/

github: https://github.com/RasaHQ/

community forum: https://forum.rasa.com/

Diarization

2021-12-05T09:43:01Z

Solyarisoftware: diarization definition minor correction

''Speaker diarisation'' (or ''diarization''), or s''peaker separation'' is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity.

Source: https://en.wikipedia.org/wiki/Speaker_diarisation

Barge-in

2021-12-05T09:41:57Z

Solyarisoftware: barge-in definition

''"Barge-in is a feature that allows callers to interrupt a prompt and provide their response before the prompt has finished playing"'' at pag. 24 of book "Voice User Interface Design" by James Giangola et al.

In other words, barge-in is, in any voice interface system / voice assistant, the user capability of interrupt/stop the assistant spoken (a text-to-speech synthetic voice play), to impose a new overriding user voice request to be processed asap.

Diarization

2021-12-04T16:54:54Z

Solyarisoftware: diarization new page

Speaker diarisation (or diarization), or Speaker separation is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity.

Source: https://en.wikipedia.org/wiki/Speaker_diarisation

Coqui

2021-12-03T17:27:57Z

Solyarisoftware: introduced coaqui.ai description

[[Category:Coqui]]
[[Category:Project]]

https://coqui.ai/ Coqui is dedicated to open speech technology and to serving as the hub where speech researchers, developers, and practitioners congregate.

https://github.com/coqui-ai/STT The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.

https://github.com/coqui-ai/TTS 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

https://github.com/coqui-ai/snakepit 🐍 Coqui's machine learning job scheduler

---

related projects:

- https://github.com/solyarisoftware/CoquiSTTJs Coqui STT offline engine API for NodeJs developers. With a simple HTTP ASR server.

== TTS ==
Let's collect some questions related to Coqui TTS.

* [[Continue Coqui TTS training based on checkpoint]]
* [[Finetune existing Coqui TTS model]]

Vosk

2021-12-03T17:08:45Z

Solyarisoftware:

[https://github.com/alphacep/vosk-api Vosk] is an open-source speech recognition toolkit by Alphacephei<ref>https://alphacephei.com/vosk/</ref>. Key features are:

# Supports 20+ languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish. More to come.
# Works offline, even on lightweight devices - Raspberry Pi, Android, iOS
# Installs with simple <code>pip3 install vosk</code>
# Portable per-language models are only 50Mb each, but there are much bigger server models available.
# Provides streaming API for the best user experience (unlike popular speech-recognition python packages)
# There are bindings for different programming languages, too - java/csharp/javascript etc.
# Allows quick reconfiguration of vocabulary for best accuracy.
# Supports speaker identification beside simple speech recognition.

[[Category:STT]]
<references />Vosk related projects

- https://github.com/solyarisoftware/voskjs Vosk ASR offline engine API for NodeJs developers. With a simple HTTP ASR server