feat: length-aware DP partitioning for variable seq lengths (#13) by ppraneth · Pull Request #109 · lightseekorg/TorchSpec

ppraneth · 2026-05-24T07:06:52Z

Closes #13

Replaces round-robin in _partition_results with longest-first greedy bin-packing, with a per-rank capacity cap so each rank still gets exactly per_dp_rank_batch_size samples per dispatch.

Sequence length is read from tensor_shapes["input_ids"][-1] same field the data fetcher already uses. No new field on InferenceOutput.

Falls back to round-robin when there's at most one sample per rank(eval dispatch, USP, or per_dp_rank_batch_size=1) since there's nothing to balance in that case.

Why approach A and not B/C

Went with A from the issue because it's the smallest safe change. C (pre-sorting at preprocessing) doesn't really survive async inference generation length is variable and dispatch order tracks inference completion. B (larger buffer) is the bigger win but changes dispatch timing and needs real measurement before landing.

Honest caveat

Step time per dispatch is bounded by longest_sample * mbs whichever rank holds the longest sample pads to that length. So this PR mostly reduces wasted FLOPs on the other ranks rather than dropping step time. The real lever for step time is Approach B (larger candidate buffer so we can defer outliers). Leaving that as a follow-up once we can measure.

Tests

Added tests/test_partition_results.py — pure Python, no GPU, no Ray cluster (uses the same ray.remote patch trick as
test_capacity_handling.py). Covers:

Round-robin fallback paths (dp=1, 1 sample/rank, empty input)
Bin-packing balance vs. round-robin
Per-rank capacity holds under skew (one 2000-len + many 100s)
Determinism for fixed input
Defensive _seq_len (missing input_ids key, tensor_shapes=None)

I haven't run an end-to-end training job. Someone with a cluster should validate by logging max_seq_in_batch per DP rank before/after to confirm imbalance actually drops on real workloads and that tells us whether B is worth doing.

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

ppraneth · 2026-05-24T07:08:14Z

@yubofredwang Could you take a look at this PR and share your feedback?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d894e2f4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

test

4d894e2

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

Comment thread torchspec/controller/training_controller.py

fix non-divisible batches

7dfda81

Signed-off-by: ppraneth <pranethparuchuri@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: length-aware DP partitioning for variable seq lengths (#13)#109

feat: length-aware DP partitioning for variable seq lengths (#13)#109
ppraneth wants to merge 2 commits into
lightseekorg:mainfrom
ppraneth:dp

ppraneth commented May 24, 2026

Uh oh!

ppraneth commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ppraneth commented May 24, 2026

Why approach A and not B/C

Honest caveat

Tests

Uh oh!

ppraneth commented May 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant