[WIP] Add mixed precision gradient collection#189
Open
luciaquirke wants to merge 5 commits intomainfrom
Open
Conversation
luciaquirke
commented
Mar 11, 2026
luciaquirke
commented
Mar 11, 2026
luciaquirke
commented
Mar 11, 2026
luciaquirke
commented
Mar 11, 2026
Contributor
|
Generally, will mixed precision be adjustable in the config or not? |
luciaquirke
commented
Mar 11, 2026
Collaborator
Author
I think we will not let it be adjustable for now. The attribution accuracy should be all upside because we are more closely matching bf16/fp16 training (pure bf16 or fp16 training is a thing but it's vanishingly rare). The downside is that fitting normalizers and preconditioners will presumably use more VRAM/wall clock time. I think I'm comfortable with this because we can get a good fit for these values in ~10k data points. |
fc0c829 to
380e6fb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
My conclusions from the test below are basically:
Coreset selection (last regression: #178):
I tried to repeat the analysis from the last PR but I became suspicious of the results because the training loss drop on this dataset sucks, because the model has already undergone IFT. So I did a second round using Qwen + LoRA + a dataset Qwen's bad at (sheet music) so get a smooth loss drop:
I used an effective batch size of 128, LoRA rank 64, projection dim 16 and took the query over the entire dataset (should not have an eval set for coreset)
ordered=False Leave-90%-out coreset selection (train loss), seed=42, 10k full dataset (works, narrowly)
"Attribution": 2.77 (raw grad cosine sim)
Random: 2.52
TrackStar BF16: 2.38
TrackStar FP32: 2.43
Full dataset: 1.77
ordered=True Leave-90%-out coreset selection (train loss), seed=42, 50k full dataset (works, narrowly)
Full dataset: 1.287
Random: 1.878
TrackStar BF16: 1.763
TrackStar FP32: 1.763
ordered=True Leave-80%-out retrain train loss, seed=42, 50k full dataset (actually kind of sucks because loss does a big spike at the end for trackstar fp16 and bf16, which are almost exactly the same. probably they work better if you can maintain stability still but the gains here aren't large enough for me to keep experimenting).
Full dataset: 1.287
Random: 1.52
TrackStar BF16: 1.58
TrackStar FP32: 1.57
I tried the eval set query filtering, one step closer to the core thing TrackStar is supposed to be good at:
ordered=True Leave-90%-out retrain eval loss, seed=42, 50k full dataset
Full dataset:
Random:
TrackStar BF16:
TrackStar FP32:
Note that mixed precision attribution matches our mixed precision training so it may have some predictive advantage over single precision attribution.
TODO