Question about cross-attention decoder_input

Hello @mgaido91,

Thank you for your work on https://github.com/NVIDIA-NeMo/NeMo/pull/15229

I would like to use this together with alignatt, but I'm not sure how to pick the subset of [H, U, T]. I'm a bit thrown off by the mismatch in the xatt dimension sometimes:

Consider the following decoder_input_ids of len 14(mapped into text for a better illustration)
```
<|startofcontext|> A dream.<|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|notimestamp|><|nodiarize|> 
```

And the output of len 3.

When I examine the hypothesis[0].xatt_scores.shape, I get in most of the cases [H, (len(decoder_input_ids) + len(output) + 1, T] = [H, 18, T]. Sometimes I get more than that. My question is, am I right to assume that the positions in the xatt_scores that correspond to the new output, can be expected in the range 
```python
[:, len(decoder_input_ids):len(decoder_input_ids)+len(output), :]
```

And if I also wanted the context, I could find it starting from the same index it start in the decoder_input_ids?(2).

I'm doing inference with beam search, where beam=5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cross-attention decoder_input #15419

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about cross-attention decoder_input #15419

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions