Skip to content

Issues with Arabic transcription #15428

@ayghri

Description

@ayghri

Describe the bug

Two issues with nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0:

  1. No diacritics produced: Despite the model card claiming diacritical marks support and the tokenizer vocabulary containing all Arabic diacritics, the model never outputs them. Tested with both RNNT and CTC decoders.

  2. (less serious) Stereo audio crashes: Transcribing multi-channel audio fails with shape mismatch (batch, time) vs torch.Size([1, 2, 240000]). The lhotse dataloader (default) doesn't downmix to mono. channel_selector='average' raises ValueError: Channel selector average not found in cut.custom because _select_channel() in nemo/collections/common/data/lhotse/dataloader.py only handles int and custom-field string lookup, not the 'average' mode. Only channel_selector=0 works.

Steps/Code to reproduce bug

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
    model_name="nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0")

# Bug 1: any stereo 16kHz WAV crashes
asr_model.transcribe(["stereo.wav"])
# TypeError: Input shape expected = (batch, time) | found : torch.Size([1, 2, 240000])

asr_model.transcribe(["stereo.wav"], channel_selector='average')
# ValueError: Channel selector average not found in cut.custom

asr_model.transcribe(["stereo.wav"], channel_selector=0)  # workaround

# Bug 2: no diacritics with either decoder
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text)  # "ما أجمل هذه الحديقة" — no tashkeel

asr_model.change_decoding_strategy(decoder_type='ctc')
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text)  # still no diacritics

Expected behavior

  1. Stereo audio should be auto-downmixed to mono, or channel_selector='average' should work as documented.
  2. Model should produce diacritical marks as stated in the model card.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: git clone ... pip install '.[all]'

Environment details

  • OS version: Linux 6.12.31-gentoo-x86_64
  • PyTorch version: 2.10.0+cu128
  • Python version: 3.12.12
  • NeMo version: 2.8.0rc0

Additional context

Bug 2 root cause: lhotse collate_audio() returns (batch, channels, time) for multi-channel audio. _select_channel() (dataloader.py ~L947) treats string channel_selector as a cut.custom field key, not a processing mode. Since use_lhotse=True is the default in NeMo 2.x, this affects all stereo transcriptions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions