Associo

From Italian "associo" — I associate. A library for discovering associations and connections in data.

High-performance association analysis library built on Polars. Computes 60+ association metrics, performs graph clustering and community detection, and provides t-SNE embedding for visualization.

pip install git+https://github.com/avpv/associo.git

Requirements: Python >= 3.10 | Dependencies: polars>=0.20.0, networkx>=3.0, scikit-learn>=1.3.0

How It Works

All inputs and outputs are Polars DataFrames — you can compose steps freely in any order.

1. The Contingency Table

Every metric in Associo is derived from a 2×2 contingency table. This table counts how often two items (X and Y) appear together and apart across all transactions.

How to read it:

Cell	Meaning	Example
XY = 30	Both X and Y present	30 orders have both bread and butter
X¬Y = 20	X present, Y absent	20 orders have bread but not butter
¬XY = 10	X absent, Y present	10 orders have butter but not bread
¬X¬Y = 40	Neither present	40 orders have neither
X = 50	Total transactions with X	bread appears in 50 orders
Y = 40	Total transactions with Y	butter appears in 40 orders
n = 100	Grand total	100 orders total

Throughout this guide, we use these example values: XY=30, X=50, Y=40, n=100.

2. Core Metrics Explained Visually

Support = P(X ∧ Y)

"How often do X and Y appear together?"

Support is the most basic metric — the probability that both X and Y occur in the same transaction. It measures absolute frequency of co-occurrence.

Range: 0 to 1
High support → the pair appears frequently
Low support → rare pair (but may still be interesting!)

Confidence = P(Y|X)

"When X is present, how often is Y also present?"

Confidence is a conditional probability. It restricts the universe to only transactions containing X (the dashed circle), then asks: what fraction of those also contain Y?

Range: 0 to 1
confidence = 0.60 → 60% of transactions with X also contain Y
Asymmetric: conf(X→Y) ≠ conf(Y→X) in general

Pitfall: High confidence alone doesn't mean X and Y are associated. If Y appears in 95% of all transactions, then conf(X→Y) ≈ 0.95 for any X. That's why we need lift.

Lift = P(Y|X) / P(Y)

"How much more likely is Y when X is present, compared to random chance?"

Lift compares the observed co-occurrence rate to the expected rate under independence. It corrects for popularity bias.

Lift	Interpretation
= 1.0	X and Y are independent (no association)
> 1.0	Positive association — X makes Y more likely
< 1.0	Negative association — X makes Y less likely
= 1.5	Y is 50% more likely when X is present

Jaccard Similarity

"What fraction of transactions containing X or Y contain both?"

Jaccard focuses only on the union of X and Y — it completely ignores transactions where neither appears (¬X¬Y). This makes it useful when "absence" is not meaningful (e.g., in large catalogs).

Range: 0 to 1
= 0 → no overlap
= 1 → identical sets
Symmetric: jaccard(X,Y) = jaccard(Y,X)

Related similarity metrics: cosine, kulczynski, sokal_sneath, sokal_michener, rogers_tanimoto, hamming, lerman_similarity

Leverage & Conviction

Leverage = P(X ∧ Y) − P(X)·P(Y) — the difference between observed and expected support. Ranges from −0.25 to +0.25. Zero means independence.

Conviction = (1 − P(Y)) / (1 − confidence) — measures the ratio of the expected error rate to the observed error rate. Values above 1 indicate a positive association; ∞ means the rule never fails.

Statistical Tests — χ², Phi, Odds Ratio

"Is the association statistically significant? How strong is it?"

These metrics answer the question: could this association have appeared by chance?

Metric	What it tells you	Range	Key value
χ² (chi-squared)	Strength of departure from independence	0 to ∞	Higher = stronger association
Phi coefficient (φ)	Normalized χ² — a correlation for binary data	−1 to +1	0 = independent
Odds ratio	How much the odds of Y change with X	0 to ∞	1 = independent
Yule's Q	Normalized odds ratio	−1 to +1	0 = independent
Cohen's kappa	Agreement beyond chance	−1 to +1	0 = chance-level agreement
p-value approx.	Rough significance estimate from χ²	0 to 1	<0.05 conventionally "significant"

When to use: Chi-squared and phi are best for ranking pairs by association strength. Odds ratio is standard in epidemiology and A/B testing. Kappa is useful when comparing binary classifiers or annotators.

Information-Theoretic Metrics

"How much do X and Y tell us about each other?"

Information-theoretic metrics measure association in bits of information — how much knowing X reduces uncertainty about Y.

Metric	What it measures	Key property
Mutual information	Shared information between X and Y	Symmetric, always ≥ 0, 0 = independent
J-measure	Rule quality via cross-entropy	Accounts for both confirmation and disconfirmation
Weight of evidence	Log-odds that Y is evidence for X	Positive = supportive, negative = contradicts
Importance	Log-odds of confidence	0 means 50% confidence, unbounded

When to use: Mutual information is ideal for feature selection and comparing pairs on a common scale. J-measure is considered one of the best single measures for rule quality. Weight of evidence is widely used in credit scoring and fraud detection.

Interestingness & Rule Quality

"Is this rule actually useful — not just frequent?"

High support and confidence aren't enough — a rule can be frequent but trivial. These metrics separate genuinely interesting rules from noise.

Metric	Formula intuition	Best for
Added value	confidence − prevalence	How much X boosts P(Y) above baseline
Certainty factor	Added value normalized to [−1, +1]	Comparing rules with different base rates
Zhang's metric	Directional, bounded [−1, +1]	Robust ranking — insensitive to marginal changes
Improvement	conf(X→Y) − conf(¬X→Y)	How much better X is than "not X" at predicting Y
Klösgen	√support × added_value	Balancing frequency and strength
Rule power factor	support × confidence	Simple composite of frequency and reliability
Sebag-Schoenauer	conf / (1 − conf)	Ratio of positive to negative evidence
Gini index	Reduction in uncertainty about Y	Variable importance / feature selection

When to use: Zhang's metric is recommended as a general-purpose ranking metric — it's bounded, symmetric around zero, and robust. Added value and certainty factor are intuitive for business reporting. Klösgen is good when you want to penalize rare rules.

3. All 60+ Metrics — Taxonomy

Full list of all metrics with formulas

Contingency Table Cells

Metric	Formula	Description
`lhs_not_rhs_count`	X − XY	Transactions with X but not Y
`not_lhs_rhs_count`	Y − XY	Transactions with Y but not X
`not_lhs_not_rhs_count`	n − X − Y + XY	Transactions with neither X nor Y

Probabilistic

Metric	Formula	Description
`support`	XY / n	Joint probability P(X ∧ Y) — how often X and Y co-occur
`coverage`	X / n	Marginal probability P(X) — how often X appears
`prevalence`	Y / n	Marginal probability P(Y) — how often Y appears
`confidence`	XY / X	Conditional probability P(Y\|X) — when X is present, how often is Y?
`reverse_confidence`	XY / Y	Conditional probability P(X\|Y) — when Y is present, how often is X?
`lift`	confidence / prevalence	How much more likely Y is when X is present vs chance. =1 independent, >1 positive, <1 negative
`leverage`	support − coverage × prevalence	Difference between observed and expected support. 0 = independence
`conviction`	(1 − prevalence) / (1 − confidence)	How often the rule would be wrong if X and Y were independent. ∞ = never fails

Similarity & Distance

Metric	Formula	Description
`jaccard`	XY / (X + Y − XY)	Fraction of the union that is the intersection. Ignores ¬X¬Y
`cosine`	support / √(coverage × prevalence)	Normalized dot product similarity. Like lift but dampened
`kulczynski`	0.5 × (confidence + reverse_confidence)	Average of both conditional probabilities. Symmetric
`sokal_sneath`	2·XY / (X + Y)	Double-weighted overlap similarity
`sokal_michener`	(XY + ¬X¬Y) / n	Simple matching coefficient — counts all agreements
`rogers_tanimoto`	(XY + ¬X¬Y) / (n + X + Y − 2·XY)	Adjusted matching — penalizes disagreements more heavily
`hamming`	(X + Y − 2·XY) / n	Fraction of transactions where X and Y disagree (distance metric)
`lerman_similarity`	(P(X∪Y) − P(X)·P(Y)) / √(P(X)·P(Y))	Normalized deviation of union from expected under independence

Statistical Tests

Metric	Formula	Description
`chi_squared`	Σ (observed − expected)² / expected	Pearson's χ² test for independence. Higher = stronger association
`local_chi_squared`	(support·n − expected)² / expected	Contribution of the (X,Y) cell to χ². Isolates this pair's effect
`phi_coefficient`	leverage / √(coverage·(1−coverage)·prevalence·(1−prevalence))	Normalized χ² for 2×2 tables. Range −1 to +1, like a correlation
`odds_ratio`	(XY · ¬X¬Y) / (X¬Y · ¬XY)	Odds of Y with X vs without. Uses Haldane +0.5 correction. Range 0 to ∞
`yules_q`	(OR − 1) / (OR + 1)	Normalized odds ratio. Range −1 to +1
`yules_y`	(√OR − 1) / (√OR + 1)	Alternative normalized OR, more conservative than Yule's Q
`kappa`	(observed agreement − expected) / (1 − expected)	Cohen's kappa — agreement beyond chance. Range −1 to +1
`p_value_approximation`	exp(−χ² / 2)	Quick approximate p-value from χ². Useful for ranking, not exact

Information-Theoretic

Metric	Formula	Description
`mutual_information`	support × log₂(support / (coverage × prevalence))	Shared information in bits. 0 = independent
`j_measure`	support·log₂(conf/prev) + (coverage−support)·log₂((1−conf)/(1−prev))	Cross-entropy based rule quality — accounts for confirmation and disconfirmation
`weight_of_evidence`	log(P(X\|Y) / P(X\|¬Y)) with Laplace smoothing	Log-odds that Y is evidence for X. Positive = supportive, negative = contradicts
`importance`	log₁₀(confidence / (1 − confidence))	Log-odds of confidence. Range −∞ to +∞, 0 means 50% confidence

Interestingness & Rule Quality

Metric	Formula	Description
`added_value`	confidence − prevalence	How much X increases the probability of Y above the baseline
`improvement`	confidence − P(Y\|¬X)	How much better the rule X→Y is compared to ¬X→Y
`certainty_factor`	(confidence − prevalence) / (1 − prevalence)	Normalized confidence gain. Range −1 to +1, 0 = no gain
`zhangs_metric`	(confidence − P(Y\|¬X)) / max(confidence, P(Y\|¬X))	Directional association. Range −1 to +1, robust to marginal changes
`sebag_schoenauer`	confidence / (1 − confidence)	Ratio of positive to negative examples. Higher = stronger rule
`relative_risk`	confidence / P(Y\|¬X)	Risk ratio — how much X increases the risk of Y vs absence of X
`relative_difference`	(confidence − prevalence) / prevalence	Relative gain in confidence over the base rate
`difference_of_confidence`	confidence − P(Y\|¬X)	Absolute difference between conditional probabilities
`klosgen`	√support × added_value	Support-weighted added value. Balances frequency and strength
`rule_power_factor`	support × confidence	Combined measure of rule frequency and reliability
`gini_index`	support × (1 − conf² − (1−conf)²)	How much knowing X reduces uncertainty about Y
`collective_strength`	support·(1−support) / ((cov·prev−sup)·(cov+prev−sup))	Ratio of observed to expected deviations from independence
`interestingness`	confidence × reverse_confidence × (1 − confidence)	High when rule is strong in both directions but not trivial
`comprehensibility`	log(1 + prevalence) / log(1 + support)	Ratio of RHS complexity to rule complexity. Lower = simpler rule

Directional & Specialized

Metric	Formula	Description
`casual_confidence`	0.5 × (confidence + P(¬Y\|¬X))	Average of forward rule and contrapositive
`casual_support`	P(X∪Y) + (1 − support)	Broad co-occurrence including complementary pairs
`confirmed_confidence`	confidence − P(¬Y\|X)	Difference between positive and negative confidence
`counter_example_rate`	(XY + ¬XY) / n	Fraction of transactions where Y appears (with or without X)
`implication_index`	(support − coverage·prevalence) / √(coverage·prevalence)	Standardized deviation from expected support, like a z-score
`lambda`	Goodman-Kruskal's λ	Proportional reduction in prediction error. 0 = no improvement
`least_contradiction`	(P(X∪Y) − (1−prevalence) − support) / prevalence	Penalizes contradictory evidence
`confidence_boost`	confidence / (confidence − improvement)	Ratio of confidence to the ¬X baseline
`lift_increase`	(lift − 1) / prevalence	Lift deviation normalized by prevalence
`standardized_lift`	(lift − 1) / (lift + 1)	Bounded version of lift. Range −1 to +1
`varying_rates_liaison`	lift − 1	Simple deviation from independence. 0 = independent
`support_vrl`	support × (lift − 1)	Support-weighted liaison — balances frequency and deviation
`imbalance_ratio`	(coverage − prevalence) / (coverage + prevalence − support)	Asymmetry between X and Y frequency. 0 = balanced
`hyper_confidence`	confidence / P(¬Y\|X)	Ratio of correct to incorrect predictions
`hyper_lift`	confidence / P(Y\|¬X)	How much better X→Y is vs ¬X→Y (same as relative_risk)
`relative_linkage_disequilibrium`	D / (D + min) piecewise	Normalized linkage disequilibrium. Range −1 to +1

Fisher Transforms

Metric	Formula	Description
`fisher_transformation_confidence`	2·arcsin(√confidence)	Variance-stabilizing transform. Useful for statistical comparisons
`fisher_transformation_reverse_confidence`	2·arcsin(√reverse_confidence)	Variance-stabilizing transform of reverse confidence

Smoothed & Confidence Intervals

Metric	Formula	Description
`confidence_laplace`	(XY + 2) / (X + 4)	Laplace-smoothed confidence. Handles zero counts gracefully
`importance_laplace`	log₁₀(conf_laplace / (1 − conf_laplace))	Laplace-smoothed importance. Stable for small samples
`confidence_lower`	confidence − 1.96 × SE	95% CI lower bound for confidence
`confidence_upper`	confidence + 1.96 × SE	95% CI upper bound for confidence
`confidence_lower_laplace`	conf_laplace − 1.96 × SE_laplace	95% CI lower bound (Laplace-smoothed)
`confidence_upper_laplace`	conf_laplace + 1.96 × SE_laplace	95% CI upper bound (Laplace-smoothed)
`odds_ratio_lower`	exp(ln(OR) − 1.96 × SE_log_OR)	95% CI lower bound for odds ratio
`odds_ratio_upper`	exp(ln(OR) + 1.96 × SE_log_OR)	95% CI upper bound for odds ratio

4. Quick Start

Direct Associations

When your data already has LHS/RHS columns (e.g., category→product, query→click):

import polars as pl
from associo import direct_associations

df = pl.DataFrame({
    "product_a": ["bread", "bread", "milk", "eggs"],
    "product_b": ["butter", "milk", "butter", "bread"],
    "order_id":  [1, 2, 3, 4],
})

result = direct_associations(
    df,
    column_lhs="product_a",
    column_rhs="product_b",
    column_tid="order_id",
)

Combinatorial Associations

When each row is one item per transaction — all pairs generated automatically:

from associo import combinatorial_associations

basket = pl.DataFrame({
    "product":  ["bread", "butter", "milk", "bread", "milk", "eggs"],
    "order_id": [1, 1, 1, 2, 2, 2],
})

result = combinatorial_associations(
    basket,
    column_items="product",
    column_tid="order_id",
)

From Pre-Aggregated Counts

When you already have the contingency table counts:

from associo import association_measures

counts = pl.DataFrame({
    "lhs": ["bread"],
    "rhs": ["butter"],
    "lhs_rhs_count": [150],
    "lhs_total_count": [500],
    "rhs_total_count": [300],
    "total_count": [1000],
})

result = association_measures(counts)

# Or compute only specific metrics:
result = association_measures(counts, measures=["support", "confidence", "lift"])

5. API Reference

Function	Description
`direct_associations`	Association metrics from pre-paired lhs/rhs data
`combinatorial_associations`	All item-pair combinations within transactions
`association_measures`	60+ metrics from pre-aggregated counts
`clusters`	Affinity Propagation clustering
`communities`	Louvain community detection
`label_propagation`	Fast label propagation
`connected_components`	Connected components
`maximal_cliques`	All maximal cliques
`k_clique_communities`	K-clique percolation
`label_propagation_overlapping`	SLPA-based overlapping communities
`embedding`	t-SNE 2D projection from distance matrix

6. Graph Algorithms

Build graphs from association metrics and discover structure. Each algorithm treats items as nodes and similarity scores as weighted edges.

Choosing the Right Algorithm

Goal	Algorithm	Overlapping?	Key parameter
Auto-detect clusters with exemplars	`clusters`	No	`cluster_preference_factor`
Find natural communities (most popular)	`communities`	No	`community_resolution`
Fast large-scale grouping	`label_propagation`	No	`min_edge_weight`
Items in multiple groups	`label_propagation_overlapping`	Yes	`threshold`
Fully-connected subgroups	`maximal_cliques`	Yes	`min_clique_size`
Overlapping via clique merging	`k_clique_communities`	Yes	`k`
Isolated groups (simplest)	`connected_components`	No	`min_edge_weight`

Usage

from associo import clusters, communities, label_propagation
from associo import maximal_cliques, k_clique_communities, connected_components
from associo import label_propagation_overlapping

similarity = pl.DataFrame({
    "item_a": ["A", "A", "B", "C"],
    "item_b": ["B", "C", "C", "D"],
    "sim":    [0.8, 0.3, 0.7, 0.9],
})

# Affinity Propagation — auto-detects number of clusters
cl = clusters(similarity, column_lhs="item_a", column_rhs="item_b", column_similarity="sim")

# Louvain communities — best general-purpose choice
comm = communities(similarity, column_lhs="item_a", column_rhs="item_b", column_similarity="sim")

# Label propagation — fast for large graphs
lp = label_propagation(similarity, column_lhs="item_a", column_rhs="item_b", column_similarity="sim")

# Maximal cliques — find tightly-knit groups (overlapping)
clq = maximal_cliques(similarity, column_lhs="item_a", column_rhs="item_b", column_similarity="sim")

# Connected components — which items are linked at all?
cc = connected_components(similarity, column_lhs="item_a", column_rhs="item_b", column_similarity="sim")

Tip: Start with communities() (Louvain) — it works well on most data. Use clusters() when you want automatic exemplar selection. Switch to overlapping methods (maximal_cliques, k_clique_communities, label_propagation_overlapping) when items naturally belong to multiple groups.

7. Embedding & Visualization

Project items into 2D space based on their pairwise distances using t-SNE (t-distributed Stochastic Neighbor Embedding).

t-SNE preserves local neighborhood structure: items that are similar (low distance) stay close together in 2D; dissimilar items are pushed apart. This makes it ideal for visualizing clusters of associated items.

Usage

from associo import embedding

distances = pl.DataFrame({
    "item_a": ["A", "A", "B"],
    "item_b": ["B", "C", "C"],
    "dist":   [0.2, 0.7, 0.3],
})

coords = embedding(distances, column_lhs="item_a", column_rhs="item_b", column_distance="dist")
# Returns DataFrame with columns: item, x, y

Tuning Perplexity

Perplexity	Effect	Use when
5–10	Tight local clusters, may miss global structure	Few items (<50) or very distinct groups
30 (default)	Good balance of local and global	Most cases
40–50	More global structure, clusters may merge	Many items (>500) or gradual similarity

Tip: Convert any similarity metric to a distance with 1 − similarity, then pass to embedding(). Combine with communities() or clusters() to color-code the scatter plot by group.

8. Algorithm Parameters

Algorithm	Parameter	Default	Effect
Affinity Propagation	`cluster_preference_factor`	50	0–100; lower = fewer larger clusters
Louvain	`community_resolution`	1.0	Higher = more communities (0.5–2.0)
Maximal Cliques	`min_clique_size`	3	Minimum clique size to return
Maximal Cliques	`min_edge_weight`	0.1	Edge weight threshold
K-Clique	`k`	3	Clique size for percolation
K-Clique	`min_edge_weight`	0.1	Edge weight threshold
Connected Components	`min_edge_weight`	0.0	Edge weight threshold
Label Propagation	`min_edge_weight`	0.0	Edge weight threshold
SLPA Overlapping	`n_iter`	20	Propagation iterations
SLPA Overlapping	`threshold`	0.1	Label inclusion threshold (0–1)
t-SNE	`perplexity`	30	5–50; local vs global structure
t-SNE	`n_iter`	1000	Iterations (1000–2000)

Project Structure

├── src/associo/
│   ├── __init__.py          # Public API
│   ├── measures.py          # 60+ association metrics (Polars expressions)
│   ├── associations.py      # Direct & combinatorial association extraction
│   ├── graph.py             # Clustering, communities, cliques, components
│   ├── embedding.py         # t-SNE dimensionality reduction
│   └── _validation.py       # Input validation
├── tests/
│   ├── test_measures.py
│   ├── test_associations.py
│   ├── test_graph.py
│   ├── test_embedding.py
│   ├── test_formulas.py
│   └── test_validation.py
└── pyproject.toml

Development

pip install -e ".[dev]"
pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs/img		docs/img
examples		examples
src/associo		src/associo
tests		tests
.gitignore		.gitignore
README.md		README.md
logo.svg		logo.svg
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Associo

How It Works

Table of Contents

1. The Contingency Table

2. Core Metrics Explained Visually

Support = P(X ∧ Y)

Confidence = P(Y|X)

Lift = P(Y|X) / P(Y)

Jaccard Similarity

Leverage & Conviction

Statistical Tests — χ², Phi, Odds Ratio

Information-Theoretic Metrics

Interestingness & Rule Quality

3. All 60+ Metrics — Taxonomy

Contingency Table Cells

Probabilistic

Similarity & Distance

Statistical Tests

Information-Theoretic

Interestingness & Rule Quality

Directional & Specialized

Fisher Transforms

Smoothed & Confidence Intervals

4. Quick Start

Direct Associations

Combinatorial Associations

From Pre-Aggregated Counts

5. API Reference

6. Graph Algorithms

Choosing the Right Algorithm

Usage

7. Embedding & Visualization

Usage

Tuning Perplexity

8. Algorithm Parameters

Project Structure

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages