What we want
When two tabular snapshots contain the same rows in a different order, and no manual row ID is supplied, binoc's default (positional) row matching should not report them as wholesale changes.
Some ideas:
- Auto-detect a row key — perhaps we could cheaply detect if there is a column that is unique on both sides with high overlap, a likely match ID
- Order-independent fallback — perhaps a transformer could just lexically sort the rows on both sides
Examples
- CDC BRFSS (a frozen historical series that should never change): positional
reported 37,140,180 cells changed — ~77% of all cells, almost entirely a
re-sort artifact. Keyed: 195,143 rows / 196,835 cells, which are the real
edits (e.g. Break_Out_Category: 'Gender' -> 'Sex'). A 99.5% reduction.
- USDA FoodData Central
food_nutrient.csv: positional 133,964,297 cells
changed → keyed 14 cells (the table is re-sorted and appended, not
rewritten). Whole-bundle total: 165,194,392 → 856 cells.
Without auto-detection, every such dataset needs hand-tuned config, and a user
running plain binoc diff gets a confident multi-million-cell changelog that is
mostly fiction.
Implementation notes
- Consider gating behind a heuristic rather than always-on (see the companion
"high-churn guardrail" issue). Order for these isn't clear yet.
What we want
When two tabular snapshots contain the same rows in a different order, and no manual row ID is supplied, binoc's default (positional) row matching should not report them as wholesale changes.
Some ideas:
Examples
reported 37,140,180 cells changed — ~77% of all cells, almost entirely a
re-sort artifact. Keyed: 195,143 rows / 196,835 cells, which are the real
edits (e.g.
Break_Out_Category: 'Gender' -> 'Sex'). A 99.5% reduction.food_nutrient.csv: positional 133,964,297 cellschanged → keyed 14 cells (the table is re-sorted and appended, not
rewritten). Whole-bundle total: 165,194,392 → 856 cells.
Without auto-detection, every such dataset needs hand-tuned config, and a user
running plain
binoc diffgets a confident multi-million-cell changelog that ismostly fiction.
Implementation notes
"high-churn guardrail" issue). Order for these isn't clear yet.