Thanks for your exciting work.
However, I have a question regarding the potential instability of directly expanding the LLM's decoding space by incorporating visual reference prototypes. Given that the semantics and number of these prototypes are input-dependent, will a unified cross-entropy training objective lead to performance fluctuations?
Was an alternative approach investigated, such as separating the classification of visual patches from the LLM's original vocabulary and optimizing it with a distinct, auxiliary loss function?