Currently I don't think it is possible to use quantized checkpoints (containing megatron fakequant layers inserted via modelopt) to learn parameters in the quantized space during GRPO.
This could be done by consuming a PTQ checkpoint exported from modelopt or similar, or a quantized HF model (e.g. GPT-OSS) - or even a modelopt model which is then exported to the HF format. The key part is post-training in the quantized space.
I imagine loading the megatron workers is not particularly hard, and what's more likely a challenge is any refit required for passing parameters to vllm for rollout which I imagine does not expect the "fakequant" layers.
The key motivation here would be to evaluate the effectiveness of learning the quantised weights while training on the actual task rather than appending a PTQ, QAT SFT, or similar step.
Currently I don't think it is possible to use quantized checkpoints (containing megatron fakequant layers inserted via modelopt) to learn parameters in the quantized space during GRPO.
This could be done by consuming a PTQ checkpoint exported from
modeloptor similar, or a quantized HF model (e.g. GPT-OSS) - or even a modelopt model which is then exported to the HF format. The key part is post-training in the quantized space.I imagine loading the megatron workers is not particularly hard, and what's more likely a challenge is any refit required for passing parameters to vllm for rollout which I imagine does not expect the "fakequant" layers.
The key motivation here would be to evaluate the effectiveness of learning the quantised weights while training on the actual task rather than appending a PTQ, QAT SFT, or similar step.