Real-time-factor: Difference between revisions

From Voice Technology Wiki
Jump to navigation Jump to search
(RTF definition)
 
(typo on the word "length")
 
(One intermediate revision by one other user not shown)
Line 6: Line 6:


Usually a state of the art speech-to-text cloud-based service supplied by Google, Azure, AWS, etc. has values between 0.2 and 0.6. Note that is all very depending on many factors, the network/internet bandwith, the speech content, etc. In case of an on-prem ASR, the major impacting factor is the algorithm and the hardware resources (CPU/RAM).  <syntaxhighlight lang="python">
Usually a state of the art speech-to-text cloud-based service supplied by Google, Azure, AWS, etc. has values between 0.2 and 0.6. Note that is all very depending on many factors, the network/internet bandwith, the speech content, etc. In case of an on-prem ASR, the major impacting factor is the algorithm and the hardware resources (CPU/RAM).  <syntaxhighlight lang="python">
def real_time_factor(processingTime, audioLenght, decimals=2):
def real_time_factor(processingTime, audioLength, decimals=2):


    ''' Real-Time Factor (RTF) is defined as processing-time / length-of-audio. '''
    ''' Real-Time Factor (RTF) is defined as processing-time / length-of-audio. '''


    rtf = (processingTime / audioLenght)
    rtf = (processingTime / audioLength)


    return round(rtf, decimals)
    return round(rtf, decimals)
</syntaxhighlight>
</syntaxhighlight>
[[Category:STT]]
[[Category:TTS]]

Latest revision as of 19:57, 18 January 2024

The real time factor (RTF) is a common metric of measuring the speed of an automatic speech recognition system (ASR) in the decoding phase ("at run-time"). It can also be used in other context where an audio or video signal is processed (usually automatically) at nearly constant rate. All in all RTF is a measure of the latency of any (audio) processing system, not only a speech recognition engine, but also a text-to-speech engine, a transcoding engine, etc.

If it takes time f(d) to process an input of duration d , the real time factor is defined as: RTF = f(d)/d

If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done "in real time". It is a hardware dependent value, it is a network bandwidth dependent value (this is important to note, if processing is done as cloud-based service).

Usually a state of the art speech-to-text cloud-based service supplied by Google, Azure, AWS, etc. has values between 0.2 and 0.6. Note that is all very depending on many factors, the network/internet bandwith, the speech content, etc. In case of an on-prem ASR, the major impacting factor is the algorithm and the hardware resources (CPU/RAM).

def real_time_factor(processingTime, audioLength, decimals=2):

    ''' Real-Time Factor (RTF) is defined as processing-time / length-of-audio. '''

    rtf = (processingTime / audioLength)

    return round(rtf, decimals)