Question about the training on the expanded vocabulary

Thanks for your exciting work.

However, I have a question regarding the potential instability of directly expanding the LLM's decoding space by incorporating visual reference prototypes. Given that the semantics and number of these prototypes are input-dependent, will a unified cross-entropy training objective lead to performance fluctuations?

Was an alternative approach investigated, such as separating the classification of visual patches from the LLM's original vocabulary and optimizing it with a distinct, auxiliary loss function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the training on the expanded vocabulary #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the training on the expanded vocabulary #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions