Fix NaN validation loss in SFT training by correcting label masking logic by sfc-gh-aponnusamy · Pull Request #335 · snowflakedb/ArcticTraining

sfc-gh-aponnusamy · 2026-01-05T18:31:33Z

Summary

This PR fixes a bug where NaN validation loss was occurring during SQL autocompletion SFT training. The root cause was incorrect label masking that caused all labels to be set to -100 (ignored), resulting in NaN loss during training.

Changes

1. Fixed `get_assistant_start_end_indices()` in `sft_factory.py`

Problem: The previous implementation searched for assistant content from the beginning of the conversation text each time. This could incorrectly match content that appeared earlier in the conversation (e.g., in user context/history).

Solution: Now tracks search_start position and processes ALL messages in order, ensuring assistant content is found AFTER the preceding user message rather than at its first occurrence anywhere in the text.

2. Fixed `get_masked_labels()` in `sft_factory.py`

Problem: The token inclusion condition required tokens to be fully contained within assistant ranges (id_s >= s and id_e <= e). This was too strict for short assistant content where tokenizer offsets can span wider than the actual content.

Solution: Changed to an overlap condition (id_s < e and id_e > s) that includes tokens if they overlap with any assistant range. Also added handling for invalid ranges (s == -1 means content was not found).

3. Added Debug Logging

Label masking debug: Set DEBUG_LABEL_MASKING=1 environment variable to enable detailed logging when labels are unexpectedly all masked or very few are non-masked
NaN loss detection: Added logging in the evaluation loop to detect and report when NaN/Inf losses occur, including batch and label statistics

Files Changed

arctic_training/data/sft_factory.py - Label masking fixes and debug logging
arctic_training/trainer/trainer.py - NaN/Inf evaluation loss detection

Root Cause

The NaN loss was caused by all labels being masked (set to -100), which happens when:

The assistant content search found the wrong occurrence of text (earlier in conversation)
The token overlap logic was too strict (requiring full containment vs overlap)

When all labels are -100, the cross-entropy loss computation has no valid targets, resulting in NaN.

arctic_training/data/sft_factory.py

arctic_training/trainer/sft_trainer.py

arctic_training/trainer/trainer.py

…review comments

sfc-gh-sbekman

The trainer part is perfect, the data part I'd ask for someone who works a lot with instruct data to validate.

sfc-gh-aponnusamy added 4 commits January 5, 2026 10:39

Debugging val/loss for sql autocompletion that results in NaN

50ac823

Debugging val/loss for sql autocompletion that results in NaN

e9be1e4

Debugging val/loss for sql autocompletion that results in NaN

59f84cc

Debugging val/loss for sql autocompletion that results in NaN

c2d3fb2

sfc-gh-aponnusamy force-pushed the ac-sft-fix branch from 0871e7a to c2d3fb2 Compare January 5, 2026 18:39

sfc-gh-aponnusamy changed the title ~~Ac sft fix~~ Fix NaN validation loss in SFT training by correcting label masking logic Jan 5, 2026

Debugging val/loss for sql autocompletion that results in NaN

9e90bdb

sfc-gh-aponnusamy marked this pull request as ready for review January 5, 2026 19:28

sfc-gh-aponnusamy requested a review from sfc-gh-jrasley as a code owner January 5, 2026 19:28

Debugging val/loss for sql autocompletion that results in NaN

a7eb093

sfc-gh-sbekman reviewed Jan 5, 2026

View reviewed changes

arctic_training/data/sft_factory.py Outdated Show resolved Hide resolved

arctic_training/data/sft_factory.py Outdated Show resolved Hide resolved

arctic_training/trainer/sft_trainer.py Outdated Show resolved Hide resolved

arctic_training/trainer/trainer.py Outdated Show resolved Hide resolved

sfc-gh-aponnusamy added 2 commits January 5, 2026 13:12

Debugging val/loss for sql autocompletion that results in NaN - Code …

b289f7e

…review comments

Debugging val/loss for sql autocompletion that results in NaN - Code …

d999367

…review comments

sfc-gh-aponnusamy requested a review from sfc-gh-sbekman January 5, 2026 21:19

sfc-gh-sbekman approved these changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN validation loss in SFT training by correcting label masking logic#335

Fix NaN validation loss in SFT training by correcting label masking logic#335
sfc-gh-aponnusamy wants to merge 8 commits intomainfrom
ac-sft-fix

sfc-gh-aponnusamy commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-sbekman left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfc-gh-aponnusamy commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Fixed get_assistant_start_end_indices() in sft_factory.py

2. Fixed get_masked_labels() in sft_factory.py

3. Added Debug Logging

Files Changed

Root Cause

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-sbekman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfc-gh-aponnusamy commented Jan 5, 2026 •

edited

Loading

1. Fixed `get_assistant_start_end_indices()` in `sft_factory.py`

2. Fixed `get_masked_labels()` in `sft_factory.py`

sfc-gh-sbekman left a comment •

edited

Loading