This commit #1115 broke this nightly test:
tests/test_suites/llm/dpo-mistral-nemo-instruct-2407-1n8g-fsdp2tp8-actckpt-long.sh
The issue appeared as an OOM, and I had narrowed it down to the transformers version. I believe this regression is the same one identified here where KV cache was suddenly treated as trainable:
huggingface/transformers#39795
The memory pressure on either dtensor path (v1,v2) is exacerbated when using higher TP(this test used 8), and long sequence lengths. In some settings I saw 4x more memory being used.
I have noticed by manually upgrading to 4.56, the memory is back to normal, but Automodel is not ready to upgrade yet, so RL has to disable this test for now.
This commit #1115 broke this nightly test:
The issue appeared as an OOM, and I had narrowed it down to the transformers version. I believe this regression is the same one identified here where KV cache was suddenly treated as trainable:
huggingface/transformers#39795
The memory pressure on either dtensor path (v1,v2) is exacerbated when using higher TP(this test used 8), and long sequence lengths. In some settings I saw 4x more memory being used.
I have noticed by manually upgrading to 4.56, the memory is back to normal, but Automodel is not ready to upgrade yet, so RL has to disable this test for now.