Skip to content

[do not review] Add SFT experiment#2556

Draft
joecummings wants to merge 2 commits intopytorch:mainfrom
joecummings:sft-experiment
Draft

[do not review] Add SFT experiment#2556
joecummings wants to merge 2 commits intopytorch:mainfrom
joecummings:sft-experiment

Conversation

@joecummings
Copy link
Member

No description provided.

Adds an SFT dataloader and config under torchtitan/experiments/sft/ that
reuses the existing Trainer without modification. Key features:

- Incremental prefix re-tokenization for correct label masking at BPE
  boundaries (matches torchtune's approach)
- Greedy sequence packing with EOS-based document boundaries for flex/varlen
  attention backends
- Config validation for attention backend compatibility and validation hang
  prevention
- Epoch shuffling with deterministic seeds for checkpoint reproducibility
- GSM8K dataset config with Qwen3 reasoning trace support

Includes 13 unit tests and a 2-GPU integration test.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 11, 2026
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to put in torchtitan/hf_datasets



@dataclass(kw_only=True, slots=True)
class SFTTrainerConfig(Trainer.Config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in exploring the feasibility to land this in core (outside experiments/), possibly by consolidating the trainer / dataloader and their configs.

…idation

When packing multiple documents into one sequence, RoPE positions were
not reset per document, causing later documents to receive wrong
positional embeddings (positions continued from the previous document
instead of restarting at 0). This fixes pytorch#2559.

Changes:
- Yield per-document position tensors from packed sequences that reset
  to 0 at each document boundary, flowing through extra_inputs to
  Decoder.forward(positions=...)
- Validate attn_mask_type='block_causal' when pack_sequences=True to
  prevent cross-document attention leakage
- Simplify _tokenize_sample to single-turn with explicit validation
- Extract _flush_pack_buffer helper and use slice assignment for masking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants