-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Describe the bug
Two issues with nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0:
-
No diacritics produced: Despite the model card claiming diacritical marks support and the tokenizer vocabulary containing all Arabic diacritics, the model never outputs them. Tested with both RNNT and CTC decoders.
-
(less serious) Stereo audio crashes: Transcribing multi-channel audio fails with shape mismatch
(batch, time)vstorch.Size([1, 2, 240000]). The lhotse dataloader (default) doesn't downmix to mono.channel_selector='average'raisesValueError: Channel selector average not found in cut.custombecause_select_channel()innemo/collections/common/data/lhotse/dataloader.pyonly handlesintand custom-field string lookup, not the'average'mode. Onlychannel_selector=0works.
Steps/Code to reproduce bug
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
model_name="nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0")
# Bug 1: any stereo 16kHz WAV crashes
asr_model.transcribe(["stereo.wav"])
# TypeError: Input shape expected = (batch, time) | found : torch.Size([1, 2, 240000])
asr_model.transcribe(["stereo.wav"], channel_selector='average')
# ValueError: Channel selector average not found in cut.custom
asr_model.transcribe(["stereo.wav"], channel_selector=0) # workaround
# Bug 2: no diacritics with either decoder
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text) # "ما أجمل هذه الحديقة" — no tashkeel
asr_model.change_decoding_strategy(decoder_type='ctc')
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text) # still no diacriticsExpected behavior
- Stereo audio should be auto-downmixed to mono, or
channel_selector='average'should work as documented. - Model should produce diacritical marks as stated in the model card.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install: git clone ... pip install '.[all]'
Environment details
- OS version: Linux 6.12.31-gentoo-x86_64
- PyTorch version: 2.10.0+cu128
- Python version: 3.12.12
- NeMo version: 2.8.0rc0
Additional context
Bug 2 root cause: lhotse collate_audio() returns (batch, channels, time) for multi-channel audio. _select_channel() (dataloader.py ~L947) treats string channel_selector as a cut.custom field key, not a processing mode. Since use_lhotse=True is the default in NeMo 2.x, this affects all stereo transcriptions.