Skip to content

[WIP] Add mixed precision gradient collection#189

Open
luciaquirke wants to merge 5 commits intomainfrom
mixed-prec
Open

[WIP] Add mixed precision gradient collection#189
luciaquirke wants to merge 5 commits intomainfrom
mixed-prec

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Mar 11, 2026

  • Always leave optimizers and preconditioners in fp32
  • Use mixed precision fwd-bwd

My conclusions from the test below are basically:

  • In practice this makes retraining effects in either fp32 or bf16 essentially indistinguishable
  • TrackStar in its current form is lame for coreset selection
  • Eval selection results TODO
  • Single item-based selection results TODO

Coreset selection (last regression: #178):

I tried to repeat the analysis from the last PR but I became suspicious of the results because the training loss drop on this dataset sucks, because the model has already undergone IFT. So I did a second round using Qwen + LoRA + a dataset Qwen's bad at (sheet music) so get a smooth loss drop:

 torchrun --nproc_per_node 8 -m examples.filter_data \
    --model Qwen/Qwen2.5-1.5B \
    --dataset sander-wood/irishman \
    --prompt_column "abc notation" \
    --max_samples 10000 \
    --num_examples 1000 \
    --num_epochs 1 \
    --precision fp32 \
    --learning_rate 5e-5 \
    --subset default \
    --filter random 

I used an effective batch size of 128, LoRA rank 64, projection dim 16 and took the query over the entire dataset (should not have an eval set for coreset)

ordered=False Leave-90%-out coreset selection (train loss), seed=42, 10k full dataset (works, narrowly)

"Attribution": 2.77 (raw grad cosine sim)
Random: 2.52
TrackStar BF16: 2.38
TrackStar FP32: 2.43
Full dataset: 1.77

ordered=True Leave-90%-out coreset selection (train loss), seed=42, 50k full dataset (works, narrowly)

Full dataset: 1.287
Random: 1.878
TrackStar BF16: 1.763
TrackStar FP32: 1.763

ordered=True Leave-80%-out retrain train loss, seed=42, 50k full dataset (actually kind of sucks because loss does a big spike at the end for trackstar fp16 and bf16, which are almost exactly the same. probably they work better if you can maintain stability still but the gains here aren't large enough for me to keep experimenting).

Full dataset: 1.287
Random: 1.52
TrackStar BF16: 1.58
TrackStar FP32: 1.57

I tried the eval set query filtering, one step closer to the core thing TrackStar is supposed to be good at:

ordered=True Leave-90%-out retrain eval loss, seed=42, 50k full dataset

Full dataset:
Random:
TrackStar BF16:
TrackStar FP32:

image

Note that mixed precision attribution matches our mixed precision training so it may have some predictive advantage over single precision attribution.

TODO

  • check we maintain correct dtype outside trackstar, in normalizers
  • make coreset selection clean for future regression testing

@luciaquirke luciaquirke requested a review from LouisYRYJ March 11, 2026 04:23
@luciaquirke luciaquirke changed the title Add mixed precision gradient collection [WIP] Add mixed precision gradient collection Mar 11, 2026
@LouisYRYJ
Copy link
Contributor

Generally, will mixed precision be adjustable in the config or not?
Are there downsides to using it?

@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Mar 13, 2026

Generally, will mixed precision be adjustable in the config or not? Are there downsides to using it?

I think we will not let it be adjustable for now. The attribution accuracy should be all upside because we are more closely matching bf16/fp16 training (pure bf16 or fp16 training is a thing but it's vanishingly rare). The downside is that fitting normalizers and preconditioners will presumably use more VRAM/wall clock time. I think I'm comfortable with this because we can get a good fit for these values in ~10k data points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants