Fix NaN validation loss in SFT training by correcting label masking logic#335
Open
sfc-gh-aponnusamy wants to merge 8 commits intomainfrom
Open
Fix NaN validation loss in SFT training by correcting label masking logic#335sfc-gh-aponnusamy wants to merge 8 commits intomainfrom
sfc-gh-aponnusamy wants to merge 8 commits intomainfrom
Conversation
0871e7a to
c2d3fb2
Compare
sfc-gh-sbekman
approved these changes
Jan 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a bug where NaN validation loss was occurring during SQL autocompletion SFT training. The root cause was incorrect label masking that caused all labels to be set to
-100(ignored), resulting in NaN loss during training.Changes
1. Fixed
get_assistant_start_end_indices()insft_factory.pyProblem: The previous implementation searched for assistant content from the beginning of the conversation text each time. This could incorrectly match content that appeared earlier in the conversation (e.g., in user context/history).
Solution: Now tracks
search_startposition and processes ALL messages in order, ensuring assistant content is found AFTER the preceding user message rather than at its first occurrence anywhere in the text.2. Fixed
get_masked_labels()insft_factory.pyProblem: The token inclusion condition required tokens to be fully contained within assistant ranges (
id_s >= s and id_e <= e). This was too strict for short assistant content where tokenizer offsets can span wider than the actual content.Solution: Changed to an overlap condition (
id_s < e and id_e > s) that includes tokens if they overlap with any assistant range. Also added handling for invalid ranges (s == -1means content was not found).3. Added Debug Logging
DEBUG_LABEL_MASKING=1environment variable to enable detailed logging when labels are unexpectedly all masked or very few are non-maskedFiles Changed
arctic_training/data/sft_factory.py- Label masking fixes and debug loggingarctic_training/trainer/trainer.py- NaN/Inf evaluation loss detectionRoot Cause
The NaN loss was caused by all labels being masked (set to -100), which happens when:
When all labels are -100, the cross-entropy loss computation has no valid targets, resulting in NaN.