Add embeddings support, modify Jaccard, enrich one-to-one filtering#97
Merged
Conversation
…eric tversky index, add more one-to-one filtering methods
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #97 +/- ##
==========================================
- Coverage 95.30% 95.13% -0.18%
==========================================
Files 53 53
Lines 2621 2794 +173
Branches 399 440 +41
==========================================
+ Hits 2498 2658 +160
- Misses 75 83 +8
- Partials 48 53 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary
Overview
This branch adds three independent capabilities to the instance-matching path, the post-retrieval selection API, and the metric evaluation API. They can be reviewed (or, in the worst case, partially reverted) in isolation.
JaccardDistanceMatcher.MatcherResultswith Hungarian as the new default, plumbed through the metrics API as a per-call algorithm choice.1. Embedding-based Jaccard / Tversky distance [Closes #65]
Files:
valentine/algorithms/jaccard_distance/__init__.py,valentine/algorithms/jaccard_distance/jaccard_distance.py,pyproject.toml,experiments/bench.pyWhat. New
StringDistanceFunction.Embeddingvalue: cosine similarity on sentence-transformer embeddings replaces the per-string distance for deciding which values "match" between two columns. The rest of the matcher (Tversky reduction, threshold, Match output) is unchanged.Design choices, each exposed as a constructor knob:
embedding_model: str = "all-MiniLM-L6-v2"— small (23 MB / 384-dim), CPU-friendly default.embedding_device: str | None = None— auto-pickscuda→mps→cpu. Pass"cpu"/"cuda"/"cuda:1"/"mps"to force.embedding_batch_size: int | None = None— when unset, sentence-transformers' library default (32) is used; pass an explicit value for capable hardware.Performance shape.
sentence_transformersis lazy-imported inside_load_sentence_transformer, which islru_cache-d on(model, device). Module-level import works without the optional extra installed; calling the embedding branch without it raises a cleanImportErrorpointing atpip install 'valentine[embeddings]'.get_matches/get_matches_batchcall: every unique string across every column of every table is encoded once via a single batchedmodel.encode(...), then per-column embeddings are sliced out of the shared array. This avoids re-encoding shared columns when matching multiple table pairs.get_matches_batchis overridden so multi-table calls share one encoding pass.(sims >= threshold).any(axis=...)) carries over unchanged.Dependency. New
embeddingsextra inpyproject.toml(sentence-transformers>=2.0,<6.0). Not added to base deps.What we tried and dropped: fp16 (
embedding_dtype). On both CPU and MPS the bench was 8.3× slower than fp32 because the per-pair similarity step lives in NumPy, which has no hardware-accelerated fp16 matmul; the fp16 win on encode is dwarfed by the matmul loss. Properly using fp16 would require keeping embeddings as torch tensors throughout — a bigger architectural change deferred as future work. The flag and code path were removed cleanly.Bench impact (NYU full, default model, batch=128, threshold=0.7, MPS auto-detected):
Real F1 / MRR uplift over Levenshtein at ~4× the wall time. Conditionally added to the bench (
JaccardDistanceMatcher_emb) only whensentence-transformersis importable.2. Tversky as the unified set-similarity reduction
Files:
valentine/algorithms/jaccard_distance/__init__.py,valentine/algorithms/jaccard_distance/jaccard_distance.pyWhat. The directional-match-counts → pair-similarity reduction is now an asymmetric Tversky index, max-symmetrised across the two directions:
Constructor now exposes
tversky_alpha: float = 1.0andtversky_beta: float = 1.0. Defaults give exactly Jaccard and reduce to the previous code's behaviour. Other operating points:α = β = 1.0→ Jaccard (default).α = β = 0.5→ Sørensen-Dice.α = 1.0, β = 0.0(or vice versa) → set containment,max(|∩|/|A|, |∩|/|B|).Why. The previous code had a binary
set_similarityflag (Jaccard/Containment) that we briefly experimented with and removed. Tversky is a strict superset that subsumes both options under a single principled API and gives the user fine-grained control (e.g. asymmetric Tversky for subset/superset workloads) without adding new enums.Implementation note. Both directional match counts (
a_match,b_match) are read off the same value-level score matrix in one pass — no extra compute compared to the prior single-direction reduction. The rapidfuzz path keeps itsscore_cutoff=thresholdoptimisation; the embedding path uses the same matmul.What we tried and dropped: A
MatchWeightingenum withMarginweighting (each value's contribution =top1 − top2margin instead of binary). It regressed F1 / R@|GT| / MRR across the bench because it double-penalised real matches with near-tied alternatives. Removed cleanly; only theBinarypath remains, inlined, with no flag.Bench impact (NYU full): Default (α=β=1.0) is bit-identical to prior Jaccard behaviour. Symmetric Tversky variants (α=β between 0.5 and 1.0) are rank-equivalent under one-to-one selection, so they produce identical F1 / R@|GT| / MRR. Containment (
α=1, β=0) regresses F1 by ~12pp on NYU because asymmetric scoring inflates similarity for size-asymmetric pairs — included for users with subset/superset workloads (dataset discovery, etc.) where it's the right tool.3. One-to-one selection — three named methods, pluggable per call
Files:
valentine/algorithms/matcher_results.py,valentine/metrics/base_metric.py,valentine/metrics/metric_helpers.py,valentine/metrics/metrics.py,tests/*,examples/*,docs/*What. Three explicitly-named selectors on
MatcherResults:one_to_one_hungarian(threshold=None)— globally optimal 1:1 assignment viascipy.optimize.linear_sum_assignment. Each source and target appears in at most one returned pair; the assignment maximises total similarity over all valid 1:1 assignments. Threshold semantics match the priorone_to_one(median by default). Result is cached on_cached_hungarian. This is the new default 1:1 selector for the metrics API.one_to_one_greedy(threshold=None)— the prior greedy implementation, kept under an explicit name for tests that pin specific outputs and for users who need the legacy behaviour. Not cached.one_to_one_mutual_top(n=1)— keeps pair(s, t)only iftis amongs's top-ntargets ANDsis amongt's top-nsources. Withn=1this is the classic mutual nearest-neighbour filter.The previous
one_to_one()method is removed (renamed). The previous_cached_one_to_oneattribute is renamed to_cached_hungarian.Pluggable algorithm choice in the metrics API.
The choice of 1:1 algorithm is now a per-call argument rather than hardcoded inside
apply(). Specifically:A new
OneToOneMethodliteral ("greedy" | "hungarian" | "mutual_top") lives invalentine/metrics/base_metric.py.Every concrete metric's
apply(matches, ground_truth, one_to_one_method="hungarian")now takes the algorithm as a parameter and dispatches via the helper_apply_one_to_one(valentine/metrics/metric_helpers.py); unknown values raiseValueError.MatcherResults.get_metrics(...)accepts the sameone_to_one_methodparameter and threads it into every metric in the set, so users can pick the algorithm in one place:The default is
"hungarian"end-to-end, so existing callers see no behavioural change. Metrics withone_to_one=False(e.g.MeanReciprocalRank,RecallAtSizeofGroundTruth) ignore the argument.Why. Greedy bipartite matching can lock in a locally-best pair that blocks a globally-better assignment; Hungarian cannot, and
scipyis already a transitive dependency, so Hungarian is free. The mutual-top-nfilter fills a third operating point on the precision/recall curve. Surfacing the algorithm as a per-call argument lets users explore precision/recall tradeoffs without instantiating per-config metric variants — useful both for downstream code and for the bench harness.Bench impact (per-selector F1 on three matchers, NYU full):
Hungarian is consistently +0.010 F1 over greedy with the same output shape. Mutual top-1 is the F1 winner on instance-level matchers (+0.04–0.06 over Hungarian) at the cost of recall. R@|GT| and MRR are unchanged across selectors (they're retrieval-quality metrics on the full ranked output — selector-independent by design).
API breaks.
MatcherResults.one_to_one()is gone. Every caller in this repo has been migrated.Metric.apply()andMatcherResults.get_metrics()gain a new keyword-only argument with a default. Existing positional callers are unaffected. CustomMetricsubclasses that overrideapplywill need to accept the newone_to_one_methodkeyword (or**kwargs).Migration footprint inside the repo:
valentine/metrics/metrics.py: sixapply()methods now take and useone_to_one_method; hardcodedmatches.one_to_one_hungarian()calls are gone.valentine/metrics/metric_helpers.py: new_apply_one_to_onedispatcher.tests/test_matcher_results.py,tests/test_coverage_gaps.py,tests/test_distribution_based_benchmark.py,tests/test_docs_smoke.py: renamed test methods, updated calls, updated cache-attribute references.examples/valentine_example_pandas.py,valentine_example_polars.py,valentine_example_mixed.py: updated to the new default.README.md,docs/api.md,docs/example.md,docs/faq.md,docs/results.md,docs/metrics.md,docs/changelog.md: prose, code samples, link anchors all updated.docs/api.mdhas full sections for all three new selectors.Downstream users will need to migrate to
one_to_one_hungarian()(recommended — better default) orone_to_one_greedy()(preserve previous behaviour). The rename is justified by the better default and by the project being on1.0.0.dev0.Tests
All 248 existing tests pass; 24 polars tests skip (extra not installed in CI); 6 doctests pass. The Tversky default (α=β=1.0) is bit-identical to the previous Jaccard implementation, and the metric API default (
"hungarian") is bit-identical to the prior hardcoded path, which is why the regression suite stayed green throughout. The embedding branch is exercised by the bench harness (whensentence-transformersis available); we did not add unit tests for it — that's a reasonable follow-up.Suggested review focus
one_to_one()rename and theMetric.applysignature change are both public-API breaks. Justified by the better default and by the major version on the horizon, but worth a deliberate sign-off.one_to_one_methodargument — purely additive, default preserves behaviour.