SEPIA Speech-To-Text Server

SEPIA Speech-To-Text (STT) Server is a WebSocket based, full-duplex Python server for real-time automatic speech recognition (ASR) supporting multiple open-source ASR engines. It can receive a stream of audio chunks via the secure WebSocket connection and return transcribed text almost immediately as partial and final results.

One goal of this project is to offer a standardized, secure, real-time interface for all the great open-source ASR tools out there. The server works on all major platforms including single-board devices like Raspberry Pi (4).^[1]

Features

WebSocket server (Python Fast-API) that can receive audio streams and send transcribed text at the same time
Modular architecture to support multiple ASR engines like Vosk (reference implementation), Coqui, Deepspeech, Scribosermo and more (under construction)
Optional post processing of result (e.g. via text2num and custom modules)
Standardized API for all engines and support for individual engine features (speaker identification, grammar, confidence score, word timestamps, alternative results, etc.)
On-the-fly server and engine configuration via HTTP REST API and WebSocket 'welcome' event (including custom grammar, if supported by engine and model)
User authentication via simple common token or individual tokens for multiple users
Docker containers with support for all major platform architectures: x86 64Bit (amd64), ARM 32Bit (armv7l) and ARM 64Bit (aarch64)
Fast enough to run even on Raspberry Pi 4 (2GB) in real-time (depending on engine and model configuration)
Compatible to SEPIA Framework client (v0.24+)

↑ https://github.com/SEPIA-Framework/sepia-stt-server

[1] ttps://github.com/SEPIA-Framework/sepia-stt-server

[1]