Skip to content

Question about cross-attention decoder_input #15419

@azziko

Description

@azziko

Hello @mgaido91,

Thank you for your work on #15229

I would like to use this together with alignatt, but I'm not sure how to pick the subset of [H, U, T]. I'm a bit thrown off by the mismatch in the xatt dimension sometimes:

Consider the following decoder_input_ids of len 14(mapped into text for a better illustration)

<|startofcontext|> A dream.<|startoftranscript|><|emo:undefined|><|en|><|en|><|pnc|><|noitn|><|notimestamp|><|nodiarize|> 

And the output of len 3.

When I examine the hypothesis[0].xatt_scores.shape, I get in most of the cases [H, (len(decoder_input_ids) + len(output) + 1, T] = [H, 18, T]. Sometimes I get more than that. My question is, am I right to assume that the positions in the xatt_scores that correspond to the new output, can be expected in the range

[:, len(decoder_input_ids):len(decoder_input_ids)+len(output), :]

And if I also wanted the context, I could find it starting from the same index it start in the decoder_input_ids?(2).

I'm doing inference with beam search, where beam=5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions