From OpenVoice-Tech Wiki
Jump to navigation Jump to search

The real time factor (RTF) is a common metric of measuring the speed of an automatic speech recognition system (ASR) in the decoding phase ("at run-time"). It can also be used in other context where an audio or video signal is processed (usually automatically) at nearly constant rate. All in all RTF is a measure of the latency of any (audio) processing system, not only a speech recognition engine, but also a text-to-speech engine, a transcoding engine, etc.

If it takes time f(d) to process an input of duration d , the real time factor is defined as: RTF = f(d)/d

If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done "in real time". It is a hardware dependent value, it is a network bandwidth dependent value (this is important to note, if processing is done as cloud-based service).

Usually a state of the art speech-to-text cloud-based service supplied by Google, Azure, AWS, etc. has values between 0.2 and 0.6. Note that is all very depending on many factors, the network/internet bandwith, the speech content, etc. In case of an on-prem ASR, the major impacting factor is the algorithm and the hardware resources (CPU/RAM).

def real_time_factor(processingTime, audioLenght, decimals=2):

    ''' Real-Time Factor (RTF) is defined as processing-time / length-of-audio. '''

    rtf = (processingTime / audioLenght)

    return round(rtf, decimals)