fix(curate): fail fast on list-typed label_column (regression)#7
Open
mdressman wants to merge 1 commit into
Open
fix(curate): fail fast on list-typed label_column (regression)#7mdressman wants to merge 1 commit into
mdressman wants to merge 1 commit into
Conversation
Strategy assessor previously inferred label_column=positive for HarmBench-style datasets where `positive` is a list<string> column of completion examples, not a string label. curate then silently wrote 0 rows. Adds per-component-skip with a clear `label_column_type_mismatch` failure (surfaced in lockfile + report.md, mirroring the existing failure-classification pattern) and a regression test mirroring the failure mode. Bug reproduces on main: a recipe with a list-typed label_column materialises 0 rows silently with only a generic 'usually a recipe column mismatch' note in report.md. After this change, the component is skipped with category=label_column_type_mismatch citing the observed type and a sample value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Catches a regression where the strategy assessor sometimes infers
label_columnfrom the column name alone — e.g. HarmBench'spositivecolumn, which is actually alist<string>of completion examples, not a scalar label. Before this change, curate silently wrote 0 rows with only a genericrecipe column mismatchnote inreport.md.After this change, the component is skipped with a clear
label_column_type_mismatchfailure category surfaced in bothlockfile.jsonandreport.md, with the observed type, a sample value, and an actionable hint.How it works
text_columnorlabel_column.label_columnvalue is missing,None, a container type (list/dict/tuple/set), or a scalar whosestr(...)form isn't a key inlabel_value_map.LabelColumnTypeMismatch(subclass ofDatasetScoutError) with the column name, observed type, sample value, and bad fraction._classify_component_failureroutes it to the newlabel_column_type_mismatchcategory with a recipe-fixing hint.Relationship to #3
Complementary, not redundant. #3 catches column-name hallucinations (column doesn't exist). This catches value-type hallucinations (column exists but values are wrong type). Both validations now run off the same probe buffer:
probe[0](from FM1+FM5 mitigations: column verification, schema validation, label distribution warnings; Performance Optimizations #3)Rebased the original commit cleanly onto the post-#3
mainand merged the two validations into a single probe pass.Behaviour change?
Yes — the user-visible change is
0 silent rows→skipped component with explicit failure category and recipe-fix hint.How tested
uv run pytest -m unit— 570 passed locallyuv run ruff check .cleanuv run mypycleantest_curate_skips_component_when_label_column_is_list_typed)Honest-limits impact
None — this narrows a known honest-limit (assessor mis-inferring schema). No README change needed beyond the natural
fewer silent failuresimprovement.