Skip to content

feat: length-aware DP partitioning for variable seq lengths (#13)#109

Open
ppraneth wants to merge 2 commits into
lightseekorg:mainfrom
ppraneth:dp
Open

feat: length-aware DP partitioning for variable seq lengths (#13)#109
ppraneth wants to merge 2 commits into
lightseekorg:mainfrom
ppraneth:dp

Conversation

@ppraneth
Copy link
Copy Markdown
Contributor

Closes #13

Replaces round-robin in _partition_results with longest-first greedy bin-packing, with a per-rank capacity cap so each rank still gets exactly per_dp_rank_batch_size samples per dispatch.

Sequence length is read from tensor_shapes["input_ids"][-1] same field the data fetcher already uses. No new field on InferenceOutput.

Falls back to round-robin when there's at most one sample per rank(eval dispatch, USP, or per_dp_rank_batch_size=1) since there's nothing to balance in that case.

Why approach A and not B/C

Went with A from the issue because it's the smallest safe change. C (pre-sorting at preprocessing) doesn't really survive async inference generation length is variable and dispatch order tracks inference completion. B (larger buffer) is the bigger win but changes dispatch timing and needs real measurement before landing.

Honest caveat

Step time per dispatch is bounded by longest_sample * mbs whichever rank holds the longest sample pads to that length. So this PR mostly reduces wasted FLOPs on the other ranks rather than dropping step time. The real lever for step time is Approach B (larger candidate buffer so we can defer outliers). Leaving that as a follow-up once we can measure.

Tests

Added tests/test_partition_results.py — pure Python, no GPU, no Ray cluster (uses the same ray.remote patch trick as
test_capacity_handling.py). Covers:

  • Round-robin fallback paths (dp=1, 1 sample/rank, empty input)
  • Bin-packing balance vs. round-robin
  • Per-rank capacity holds under skew (one 2000-len + many 100s)
  • Determinism for fixed input
  • Defensive _seq_len (missing input_ids key, tensor_shapes=None)

I haven't run an end-to-end training job. Someone with a cluster should validate by logging max_seq_in_batch per DP rank before/after to confirm imbalance actually drops on real workloads and that tells us whether B is worth doing.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
@ppraneth
Copy link
Copy Markdown
Contributor Author

@yubofredwang Could you take a look at this PR and share your feedback?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d894e2f4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread torchspec/controller/training_controller.py
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: reduce DP rank imbalance from variable sequence lengths

1 participant