Skip to content

Match reordered rows by default (auto-detect a key, or order-independent fallback) #92

@jcushman

Description

@jcushman

What we want

When two tabular snapshots contain the same rows in a different order, and no manual row ID is supplied, binoc's default (positional) row matching should not report them as wholesale changes.

Some ideas:

  • Auto-detect a row key — perhaps we could cheaply detect if there is a column that is unique on both sides with high overlap, a likely match ID
  • Order-independent fallback — perhaps a transformer could just lexically sort the rows on both sides

Examples

  • CDC BRFSS (a frozen historical series that should never change): positional
    reported 37,140,180 cells changed — ~77% of all cells, almost entirely a
    re-sort artifact. Keyed: 195,143 rows / 196,835 cells, which are the real
    edits (e.g. Break_Out_Category: 'Gender' -> 'Sex'). A 99.5% reduction.
  • USDA FoodData Central food_nutrient.csv: positional 133,964,297 cells
    changed
    → keyed 14 cells (the table is re-sorted and appended, not
    rewritten). Whole-bundle total: 165,194,392 → 856 cells.

Without auto-detection, every such dataset needs hand-tuned config, and a user
running plain binoc diff gets a confident multi-million-cell changelog that is
mostly fiction.

Implementation notes

  • Consider gating behind a heuristic rather than always-on (see the companion
    "high-churn guardrail" issue). Order for these isn't clear yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions