Questions about pre-training about multi-modal LLM

hello! thanks for your excellent work ! I wanna know more detail about the stage 2 in your paper: how did your train the LLama model that It can genarate motion tokens or image tokens with text tokens or the putput tokens of the previous round?
much thanks!