Yield per-document RoPE position ids from dataset by joecummings · Pull Request #2560 · pytorch/torchtitan

joecummings · 2026-03-12T20:38:32Z

HuggingFaceTextDataset now tracks a _position_buffer alongside the existing _token_buffer. Each document's tokens get positions [0, 1, ..., doc_len-1], resetting at every document boundary. Positions are yielded as {"input": input, "positions": positions} and flow through the trainer's extra_inputs into Decoder.forward(positions=...) automatically.

Checkpoint state_dict/load_state_dict updated to persist the position buffer (BC via .get()).

Longer-term consideration

Right now there are two considerations for packed datasets: attention masks and position IDs. Attention masks are computed in the post_dataloading_process and, in this PR, position IDs are built in the dataset. Constructing masks purely based on EOS token id is fragile, especially with post-training multi-turn sequences where models could co-opts that token for end of sequence versus end of document.

The right long-term approach for torchtitan is that datasets yield seq_lens metadata alongside tokens (rather than position_ids directely), and both positions and attention masks are derived from that single source of truth in post-processing. This would retire the EOS-based get_document_mask_mod path entirely and co-locate both computations in one place.

# In dataloader
def _iter_greedy_packed(self):
      for sample in self._get_data_iter():
          input_ids = self._tokenize(sample)
          self._pack_buffer_input.extend(input_ids)
          self._pack_seq_lens.append(len(input_ids))  # just track the length

# In post dataloading process
 if "seq_lens" in extra_inputs:
          seq_lens = extra_inputs.pop("seq_lens")
          extra_inputs["positions"] = positions_from_seq_lens(seq_lens)
          extra_kwargs["attention_masks"] = mask_from_seq_lens(seq_lens)

Doesn't change how Decoder works.

Resources: https://github.com/NVIDIA/NeMo/blob/v2.7.0/nemo/collections/llm/gpt/data/core.py, https://github.com/pytorch/torchtune/blob/d0f63bb33d00b8bd3905a010b71d8c6324c2e980/torchtune/datasets/_packed.py#L108-L143,

Test plan

Unit tests pass

Also for fun, comparison between WITH position ids and WITHOUT. Definitely different in the loss, but not by a ton:

Fixes pytorch#2559. The dataloader now tracks a position buffer alongside the token buffer, resetting positions to 0 at each document boundary. This ensures RoPE encodes within-document positions correctly when block_causal attention is used.

joecummings · 2026-03-12T21:06:37Z

cc @tianyu-l @francesco-bertolotti : I did the fix that was discussed in #2559, but the "longer term fix" is also pretty simple. I might suggest we just do that in this PR, unless you have objections b/c that would technically be changing the behavior of the attention mask construction. Could be a follow up.

tianyu-l · 2026-03-12T22:54:37Z

@joecummings
The long term fix sounds reasonable. It can also replace varlen metadata creation https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/attention.py#L322

co-locate both computations in one place

So you are suggesting putting it in dataloading. But then for more complicated, model-specific mask generation (e.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama4/model.py#L209), there still need to be this post_dataloading_processing https://github.com/pytorch/torchtitan/blob/main/torchtitan/trainer.py#L608, right?

rakkit · 2026-03-13T20:25:57Z

Also for fun, comparison between WITH position ids and WITHOUT. Definitely different in the loss, but not by a ton:

i think this is expected for rope

pytorch-bot bot added the ciflow/8gpu label Mar 12, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 12, 2026

joecummings changed the title ~~Yield per-document RoPE position IDs from HuggingFaceTextDataset~~ Yield per-document RoPE position ids from dataset Mar 12, 2026

joecummings force-pushed the fix-pos-id branch from d71ed28 to 5ddb317 Compare March 12, 2026 21:02

joecummings marked this pull request as ready for review March 12, 2026 21:06

joecummings requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 12, 2026 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yield per-document RoPE position ids from dataset#2560

Yield per-document RoPE position ids from dataset#2560
joecummings wants to merge 1 commit intopytorch:mainfrom
joecummings:fix-pos-id

joecummings commented Mar 12, 2026 •

edited

Loading

Uh oh!

joecummings commented Mar 12, 2026

Uh oh!

tianyu-l commented Mar 12, 2026

Uh oh!

rakkit commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joecummings commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Longer-term consideration

Test plan

Uh oh!

joecummings commented Mar 12, 2026

Uh oh!

tianyu-l commented Mar 12, 2026

Uh oh!

rakkit commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joecummings commented Mar 12, 2026 •

edited

Loading