🚀 T-one is a high-performance streaming ASR pipeline for Russian, specialized for the telephony domain.
T-one provides a complete low-latency solution for real-time transcription. It features a pretrained streaming Conformer-based acoustic model, a custom phrase boundary detector and a decoder, making it a ready-to-use solution for production environments. It provides not only the pretrained model but also a full suite of tools for inference, fine-tuning, and deployment.
Developed by T-Software DC, this project is a practical low-latency, high-throughput ASR solution with modular components.
For more details, see the GitHub Repository.
Key Features:
Word Error Rate (WER) is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
| Category | T-one (71M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
|---|---|---|---|---|---|---|
| Call-center | 8.63 | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
| Other telephony | 6.20 | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
| Named entities | 5.83 | 9.55 | 9.81 | 12.12 | 17.65 | 17.87 |
| CommonVoice 19 (test split) | 5.32 | 2.68 | 3.14 | 6.22 | 11.3 | 5.78 |
| OpenSTT asr_calls_2_val original | 20.27 | 20.07 | 21.24 | 22.64 | 29.45 | 29.02 |
| OpenSTT asr_calls_2_val re-labeled | 7.94 | 11.14 | 12.43 | 13.22 | 21.03 | 20.82 |
from tone import StreamingCTCPipeline, read_audio, read_example_audio
audio = read_example_audio() # or read_audio("your_audio.flac")
pipeline = StreamingCTCPipeline.from_hugging_face()
print(pipeline.forward_offline(audio)) # run offline recognition
Output:
[TextPhrase(text='привет', start_time=1.79, end_time=2.04), TextPhrase(text='это я', start_time=3.72, end_time=4.26), TextPhrase(text='я подумала не хочешь ли ты встретиться спустя все эти годы', start_time=5.88, end_time=10.59)]
from tone import StreamingCTCPipeline, read_stream_example_audio
pipeline = StreamingCTCPipeline.from_hugging_face()
state = None # Current state of the ASR pipeline (None - initial)
for audio_chunk in read_stream_example_audio(): # Use any source of audio chunks
new_phrases, state = pipeline.forward(audio_chunk, state)
print(new_phrases)
# Finalize the pipeline and get the remaining phrases
new_phrases, _ = pipeline.finalize(state)
print(new_phrases)
Output:
TextPhrase(text='привет', start_time=1.79, end_time=2.04)
TextPhrase(text='это я', start_time=3.72, end_time=4.26)
TextPhrase(text='я подумала не хочешь ли ты встретиться спустя все эти годы', start_time=5.88, end_time=10.59)
In order to fine-tune T-one from a pre-trained checkpoint you need to prepare the training dataset, load the tokenizer and feature extractor from t-tech/T-one 🤗 repo.
import torch
from tone.training.model_wrapper import ToneForCTC
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ToneForCTC.from_pretrained("t-tech/T-one").to(device)
Setup the data collator, evaluation metric, training arguments and 🤗 Trainer.
For a complete guide please refer to the fine-tuning example notebook.
T-one is a 71M parameter acoustic model based on the Conformer architecture, with several key innovations to improve performance and efficiency:
It processes audio in 300 ms chunks and generates transcriptions using either greedy decoding or a KenLM-based CTC beam search decoder.
The model was trained using CTC-Loss. T-one is primarily intended for use with telephone-channel audio. However, since it was trained on heterogeneous data, it is robust across different domains and can be used not only for telephony. The model supports streaming inference, which means it can process long audio files out-of-the-box in a real-time manner. The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time. T-one can be easily fine-tuned for specific domains.
For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article. Also refer to our technical deep dive on how to improve quality and training speed of a streaming ASR model on YouTube.
The acoustic model was trained on over 80,000 hours of Russian speech. A significant portion (up to 64%) was pseudo-labeled using a robust ROVER model ensemble.
| Domain | Hours | Source |
|---|---|---|
| Telephony | 57.9k | internal |
| Far-field | 2.2K | internal |
| Mix | 18.4K | internal |
| Mix | 2.3K | open-source |
The model was trained from scratch (random initialization) for 7 days on 8 A100 GPUs using the NVIDIA NeMo framework. Key training parameters include:
This project, including the code and pretrained models, is released under the Apache 2.0 License.