Question about training

Great work! thanks for releasing this codebase!

I noticed that the current implementation mainly adopts a distillation-style setup (distilling from a 14B model to a smaller model with the Self-Forcing framework). I’m wondering whether this framework can also be used for post-training a smaller model that has already been pre-trained with Diffusion Forcing, i.e., small → small self-forcing post-training rather than large → small distillation. Is the reason why you choose distillation because this could better use the pretrained knowledge?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about training #79

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about training #79

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions