Skip to content

[WIP] MAGIC query padding#176

Open
luciaquirke wants to merge 4 commits intomagic-dtensor-patchfrom
magic-query-padding
Open

[WIP] MAGIC query padding#176
luciaquirke wants to merge 4 commits intomagic-dtensor-patchfrom
magic-query-padding

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Mar 7, 2026

  • Trial a fix for the annoying errors when your batch size and world size don't line up perfectly
  • Add resume functionality for crashes (we've already had one when a pod went down)

luciaquirke and others added 4 commits March 7, 2026 07:55
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Save scores and baseline to .npy files after backward pass; skip backward
  on resume if files exist
- Save validation progress incrementally; resume from last completed subset
- DataStream: auto-pad batch_size to world_size multiple, expose
  real_examples_per_rank and padding metadata
- Query stream: mask padded labels with ignore_index and apply correction
  factor after all-reduce for batch-invariant gradients
- Training stream: keep divisibility requirement (padding changes loss via
  .mean() denominator)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
batch_size now means per-device; global batch = batch_size * world_size,
which is always divisible by construction. This eliminates the need for
padding, weight correction factors, label masking, and associated
complexity in both DataStream and double_backward.

Adds test_magic.py with:
- Padding gradient invariance test (21 parametrized cases using real
  pythia-14m model, verified < 1e-5 in f64)
- E2E magic CLI test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the dataset size isn't a multiple of global_batch_size, the last
batch is padded: weight=0, labels=-100 for padded positions, and real
positions in that batch are correction-scaled. DataStream.reset_weights()
restores this state for the validation loop.

Also adds test_final_batch_padding to verify the padding behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant