kingabzpro/wav2vec2-large-xls-r-300m-Urdu

automatic speech recognitiontransformersurtransformerssafetensorswav2vec2automatic-speech-recognitiongenerated_from_trainerhf-asr-leaderboardapache-2.0

13

HuggingFace

81.9K

wav2vec2-large-xls-r-300m-Urdu

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the common_voice dataset. It achieves the following results on the evaluation set:

Loss: 0.9889
Wer: 0.5607
Cer: 0.2370

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id kingabzpro/wav2vec2-large-xls-r-300m-Urdu --dataset mozilla-foundation/common_voice_8_0 --config ur --split test

Inference With LM

# pip install transformers datasets pyctcdecode kenlm huggingface_hub torch

import json, torch
from datasets import load_dataset, Audio
from transformers import AutoProcessor, AutoModelForCTC
from pyctcdecode import build_ctcdecoder
from huggingface_hub import hf_hub_download

mid = "kingabzpro/wav2vec2-large-xls-r-300m-Urdu"
proc = AutoProcessor.from_pretrained(mid)
model = AutoModelForCTC.from_pretrained(mid).eval().to(
    "cuda" if torch.cuda.is_available() else "cpu"
)

kenlm = hf_hub_download(mid, "language_model/5gram.bin")
uni  = hf_hub_download(mid, "language_model/unigrams.txt")
try: attrs = json.load(open(hf_hub_download(mid, "language_model/attrs.json"), encoding="utf-8"))
except: attrs = {}

v = proc.tokenizer.get_vocab()
id2tok = [t for t,i in sorted(v.items(), key=lambda x:x[1])]
blank = proc.tokenizer.pad_token_id; wdt = proc.tokenizer.word_delimiter_token
keep, labels = zip(*[
    (i, "" if i==blank else " " if t==wdt else t)
    for i,t in enumerate(id2tok) if (i==blank or t==wdt or len(t)==1)
])

dec = build_ctcdecoder(list(labels), kenlm_model_path=kenlm,
                       unigrams=open(uni,encoding="utf-8").read().splitlines())
dec.alpha, dec.beta = attrs.get("alpha",0.5), attrs.get("beta",1.0)

ds = load_dataset("mozilla-foundation/common_voice_22_0", "ur", split="test", streaming=True)
ex = next(iter(ds.cast_column("audio", Audio(sampling_rate=16_000))))
x = proc(ex["audio"]["array"], sampling_rate=16_000, return_tensors="pt").input_values.to(model.device)

with torch.no_grad():
    logits = model(x).logits[0].cpu().numpy()[:, keep]
print(dec.decode(logits))

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 200

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
3.6398	30.77	400	3.3517	1.0	1.0
2.9225	61.54	800	2.5123	1.0	0.8310
1.2568	92.31	1200	0.9699	0.6273	0.2575
0.8974	123.08	1600	0.9715	0.5888	0.2457
0.7151	153.85	2000	0.9984	0.5588	0.2353
0.6416	184.62	2400	0.9889	0.5607	0.2370

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

Eval results on Common Voice 8 "test" (WER):

Without LM	With LM (run `./eval.py`)
52.03	39.89

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run wav2vec2-large-xls-r-300m-Urdu on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.