Issues with Arabic transcription

**Describe the bug**

Two issues with `nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0`:


1. **No diacritics produced**: Despite the model card claiming diacritical marks support and the tokenizer vocabulary containing all Arabic diacritics, the model never outputs them. Tested with both RNNT and CTC decoders.
 
2. **(less serious) Stereo audio crashes**: Transcribing multi-channel audio fails with shape mismatch `(batch, time)` vs `torch.Size([1, 2, 240000])`. The lhotse dataloader (default) doesn't downmix to mono. `channel_selector='average'` raises `ValueError: Channel selector average not found in cut.custom` because `_select_channel()` in `nemo/collections/common/data/lhotse/dataloader.py` only handles `int` and custom-field string lookup, not the `'average'` mode. Only `channel_selector=0` works.

**Steps/Code to reproduce bug**

```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
    model_name="nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0")

# Bug 1: any stereo 16kHz WAV crashes
asr_model.transcribe(["stereo.wav"])
# TypeError: Input shape expected = (batch, time) | found : torch.Size([1, 2, 240000])

asr_model.transcribe(["stereo.wav"], channel_selector='average')
# ValueError: Channel selector average not found in cut.custom

asr_model.transcribe(["stereo.wav"], channel_selector=0)  # workaround

# Bug 2: no diacritics with either decoder
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text)  # "ما أجمل هذه الحديقة" — no tashkeel

asr_model.change_decoding_strategy(decoder_type='ctc')
output = asr_model.transcribe(["arabic_mono.mp3"])
print(output[0].text)  # still no diacritics
```

**Expected behavior**

1. Stereo audio should be auto-downmixed to mono, or `channel_selector='average'` should work as documented.
2. Model should produce diacritical marks as stated in the model card.

**Environment overview (please complete the following information)**

 - Environment location: Bare-metal
 - Method of NeMo install: git clone ... pip install '.[all]'

**Environment details**

- OS version: Linux 6.12.31-gentoo-x86_64
- PyTorch version: 2.10.0+cu128
- Python version: 3.12.12
- NeMo version: 2.8.0rc0

**Additional context**

Bug 2 root cause: lhotse `collate_audio()` returns `(batch, channels, time)` for multi-channel audio. `_select_channel()` (dataloader.py ~L947) treats string `channel_selector` as a `cut.custom` field key, not a processing mode. Since `use_lhotse=True` is the default in NeMo 2.x, this affects all stereo transcriptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Arabic transcription #15428

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with Arabic transcription #15428

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions