diff --git a/README.md b/README.md index 4d1cede..1ab7645 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ See [`docs/PERF.md`](docs/PERF.md) for extended benchmarks, including gitcortex **This is the biggest practical distortion in every stat.** Line-count metrics treat a 50k-line `generated.pb.go` the same as a 50k-line hand-written module. Lock files like `package-lock.json` regenerate with every dependency bump. Vendored dependencies inflate churn whenever they're updated. OpenAPI specs, minified JS, `bindata.go`-style embeds — all common, all inflate churn and bus factor without reflecting real human contribution. -Run gitcortex on kubernetes without filtering and the top legacy-hotspots are `vendor/golang.org/x/tools/…/manifest.go`, `api/openapi-spec/v3/…v1alpha3_openapi.json`, and `staging/…/generated.pb.go` — technically correct per the data, practically useless for decision-making. +Run gitcortex on kubernetes without filtering and the top fading-silos are `vendor/golang.org/x/tools/…/manifest.go`, `api/openapi-spec/v3/…v1alpha3_openapi.json`, and `staging/…/generated.pb.go` — technically correct per the data, practically useless for decision-making. Mitigate with `--ignore` glob patterns at extract time. Files matched are dropped from the JSONL entirely, so **every downstream stat** (hotspots, churn-risk, bus factor, coupling, dev-network, profiles) reflects only hand-authored code: @@ -251,7 +251,7 @@ Available stats: | `activity` | Commits and line changes bucketed by day, week, month, or year | | `busfactor` | Files with lowest bus factor (fewest developers owning 80%+ of changes) | | `coupling` | Files that frequently change together, revealing hidden architectural dependencies | -| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `legacy-hotspot` | +| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `fading-silo` | | `working-patterns` | Commit heatmap by hour and day of week | | `dev-network` | Developer collaboration graph based on shared file ownership | | `profile` | Per-developer report: scope, specialization index, contribution type, pace, collaboration, top files | @@ -266,7 +266,7 @@ Output formats: `table` (default, human-readable), `csv` (single clean table per ```bash gitcortex stats --input data.jsonl --stat churn-risk --top 0 --format json \ - | jq '.churn_risk[] | select(.Label == "legacy-hotspot")' + | jq '.churn_risk[] | select(.Label == "fading-silo")' ``` CSV output also carries a stable header on line 1, but paths containing commas (font filenames, generated assets) are standard-quoted — a naive `awk -F','` will mis-split on those rows. For CSV pipelines use a proper parser (`csvkit`, `mlr`) or stick with the JSON path above. @@ -340,15 +340,15 @@ advanced/Scripts/utils.sh active-core (age P27, trend P94) | `active` | Shared ownership (bus factor ≥ 3). Healthy. | | `active-core` | New code (younger than most of the repo), single author. Usually fine. | | `silo` | Old + concentrated + stable/growing. Knowledge bottleneck — plan transfer. | -| `legacy-hotspot` | **Urgent.** Old + concentrated + declining. Deprecated paths still being touched. | +| `fading-silo` | **Urgent.** Old + concentrated + declining. A silo whose owner is drifting away. | -Sort order is **label priority** (legacy-hotspot → silo → active-core → active → cold), then `recent_churn` descending within the same label. The label answers "is this activity a problem?" and leads the table so the actionable classifications surface at the top — without this, a mature repo's `--top 20` would be dominated by unremarkable active files and the flagged risks would scroll off. The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat. +Sort order is **label priority** (fading-silo → silo → active-core → active → cold), then `recent_churn` descending within the same label. The label answers "is this activity a problem?" and leads the table so the actionable classifications surface at the top — without this, a mature repo's `--top 20` would be dominated by unremarkable active files and the flagged risks would scroll off. The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat. -**The `(age PXX, trend PYY)` suffix** reports where the file sits in this repo's distribution: `age P90` = older than 90% of tracked files, `trend P08` = declining more sharply than 92%. Classification thresholds are not absolute — they adapt to each dataset (P75 age and P25 trend, with a fallback to fixed constants for repos under 8 files). A `legacy-hotspot` with `(age P76, trend P24)` barely qualifies; one at `(age P98, trend P03)` is the real alarm. Distance from the boundary is now visible instead of hidden. See `docs/METRICS.md` for the adaptive-thresholds section. +**The `(age PXX, trend PYY)` suffix** reports where the file sits in this repo's distribution: `age P90` = older than 90% of tracked files, `trend P08` = declining more sharply than 92%. Classification thresholds are not absolute — they adapt to each dataset (P75 age and P25 trend, with a fallback to fixed constants for repos under 8 files). A `fading-silo` with `(age P76, trend P24)` barely qualifies; one at `(age P98, trend P03)` is the real alarm. Distance from the boundary is now visible instead of hidden. See `docs/METRICS.md` for the adaptive-thresholds section. `--churn-half-life` controls how fast old changes lose weight (default 90 days = changes lose half their weight every 90 days). -The HTML report precedes the Churn Risk table with a colored distribution strip — `48 legacy-hotspot · 1 silo · 2,330 active-core · 1,404 active · 4,585 cold` — counted over the full classified set. The truncated table below shows only the top N by label priority, so a reader glancing at "all 20 rows are legacy-hotspot" can still tell whether the repo has 20 legacy files or 20,000 before drawing a conclusion. To inspect the full list, use `--top 0 --format json` from the CLI and filter with `jq`. +The HTML report precedes the Churn Risk table with a colored distribution strip — `48 fading-silo · 1 silo · 2,330 active-core · 1,404 active · 4,585 cold` — counted over the full classified set. The truncated table below shows only the top N by label priority, so a reader glancing at "all 20 rows are fading-silo" can still tell whether the repo has 20 legacy files or 20,000 before drawing a conclusion. To inspect the full list, use `--top 0 --format json` from the CLI and filter with `jq`. ### Working patterns @@ -509,7 +509,7 @@ gitcortex report --input data.jsonl --output report.html --top 30 gitcortex report --input data.jsonl --email alice@company.com --output alice.html ``` -Includes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, working patterns heatmap, top commits, developer network, and developer profiles. A collapsible glossary at the top defines the terms (bus factor, churn, legacy-hotspot, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors. +Includes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, working patterns heatmap, top commits, developer network, and developer profiles. A collapsible glossary at the top defines the terms (bus factor, churn, fading-silo, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors. When the input is multi-repo (from `gitcortex scan` or multiple `--input` files) AND `--email` is set, the profile report renders a *Per-Repository Breakdown* with commit/churn/files/active-days per repo, filtered to that developer's contributions. The team-view report intentionally omits this section — per-repo aggregates on a consolidated dataset reduce to raw git-history distribution, which is more usefully inspected via `manifest.json` or `stats --input X.jsonl` per repo. @@ -537,7 +537,7 @@ Output formats: `text` (default), `github-actions` (annotations), `gitlab` (Code Exit code 1 when violations are found, 0 when clean. -> `--fail-on-churn-risk` evaluates the legacy `risk_score = recent_churn / bus_factor` field, not the new label classification surfaced by `stats --stat churn-risk`. The two can disagree — a file might have `risk_score` below the threshold yet still classify as `legacy-hotspot`. Use the stat command for triage; use the CI gate as a coarse threshold alarm. +> `--fail-on-churn-risk` evaluates the legacy `risk_score = recent_churn / bus_factor` field, not the new label classification surfaced by `stats --stat churn-risk`. The two can disagree — a file might have `risk_score` below the threshold yet still classify as `fading-silo`. Use the stat command for triage; use the CI gate as a coarse threshold alarm. ## Architecture diff --git a/docs/METRICS.md b/docs/METRICS.md index 72b9fff..ed9f2a4 100644 --- a/docs/METRICS.md +++ b/docs/METRICS.md @@ -110,7 +110,7 @@ Files ranked by recency-weighted churn, **classified into actionable labels** so ### Ranking -Sort order: **label priority** first, then `recent_churn` descending within the same label, then lower `bus_factor` first, then path ascending. Label priority runs `legacy-hotspot` → `silo` → `active-core` → `active` → `cold`, so the named actionable classifications always lead the table. Sorting by `recent_churn` alone used to bury `legacy-hotspot` files behind very active code (declining trend is part of the classification, so recent churn is low by definition) — a user running `--top 20` on a mature repo would see unremarkable active files and zero flagged risks. +Sort order: **label priority** first, then `recent_churn` descending within the same label, then lower `bus_factor` first, then path ascending. Label priority runs `fading-silo` → `silo` → `active-core` → `active` → `cold`, so the named actionable classifications always lead the table. Sorting by `recent_churn` alone used to bury `fading-silo` files behind very active code (declining trend is part of the classification, so recent churn is low by definition) — a user running `--top 20` on a mature repo would see unremarkable active files and zero flagged risks. `recent_churn` uses exponential decay: ``` @@ -135,39 +135,39 @@ rows implicitly assume the earlier rows didn't match. | 1 | **cold** | `recent_churn ≤ 0.5 × median(recent_churn)` | Ignore. | | 2 | **active** | `bus_factor ≥ 3` | Healthy, shared. | | 3 | **active-core** | `bus_factor ≤ 2` and `age < oldAgeThreshold` | New code, single author is expected. | -| 4 | **legacy-hotspot** | `bus_factor ≤ 2`, `age ≥ oldAgeThreshold`, and `trend < decliningTrendThreshold` | **Urgent.** Old + concentrated + declining. | +| 4 | **fading-silo** | `bus_factor ≤ 2`, `age ≥ oldAgeThreshold`, and `trend < decliningTrendThreshold` | **Urgent.** Old + concentrated + declining. | | 5 | **silo** | default (everything the rules above didn't catch) | Knowledge bottleneck — plan transfer. | Where: - `age = days between firstChange and latest commit in dataset` -- `trend = churn_last_3_months / churn_earlier`. Edge cases: empty history returns 1 (no signal); recent-only history returns 2 (grew from nothing); earlier-only history returns 0 (declined to nothing — the strongest `legacy-hotspot` signal); short-span datasets whose entire window fits inside the trend window return 1 to avoid false "growing" reports +- `trend = churn_last_3_months / churn_earlier`. Edge cases: empty history returns 1 (no signal); recent-only history returns 2 (grew from nothing); earlier-only history returns 0 (declined to nothing — the strongest `fading-silo` signal); short-span datasets whose entire window fits inside the trend window return 1 to avoid false "growing" reports ### Adaptive thresholds (per-dataset calibration) `oldAgeThreshold` and `decliningTrendThreshold` are not fixed constants: they are derived from the dataset's own distribution each run. With at least `classifyMinSample` (8) files present: - `oldAgeThreshold` = **P75** of file ages in this dataset -- `decliningTrendThreshold` = **P25** of file trends in this dataset, clamped to at least `adaptiveDecliningTrendFloor` (0.01). The floor matters on mature repos where ≥25% of files are dormant (trend=0 via the earlier-only path): P25 would otherwise collapse to 0 and the strict `trend < threshold` check would never fire, silently misclassifying every dormant concentrated file as `silo` instead of `legacy-hotspot`. The floor keeps the threshold strictly positive so the trend=0 signal — the strongest legacy-hotspot alarm — still reaches the rule. +- `decliningTrendThreshold` = **P25** of file trends in this dataset, clamped to at least `adaptiveDecliningTrendFloor` (0.01). The floor matters on mature repos where ≥25% of files are dormant (trend=0 via the earlier-only path): P25 would otherwise collapse to 0 and the strict `trend < threshold` check would never fire, silently misclassifying every dormant concentrated file as `silo` instead of `fading-silo`. The floor keeps the threshold strictly positive so the trend=0 signal — the strongest fading-silo alarm — still reaches the rule. -This makes "old" mean "older than 75% of tracked files in this repo" instead of an absolute 180 days. A 4-year-old file in a 12-year-old codebase was previously tagged `legacy-hotspot` even though it was newer than most of the repo — now the same file lands in `active-core`. Below the sample threshold, the absolute fallbacks `classifyOldAgeDays` and `classifyDecliningTrend` apply so tiny repos still produce labels. +This makes "old" mean "older than 75% of tracked files in this repo" instead of an absolute 180 days. A 4-year-old file in a 12-year-old codebase was previously tagged `fading-silo` even though it was newer than most of the repo — now the same file lands in `active-core`. Below the sample threshold, the absolute fallbacks `classifyOldAgeDays` and `classifyDecliningTrend` apply so tiny repos still produce labels. -Each `ChurnRiskResult` also exposes `AgePercentile` and `TrendPercentile` (0-100) showing where the file sits in the distribution. The fields are nil (omitted from JSON, empty in CSV) when the fallback path ran. The CLI and HTML surface these alongside the label — `legacy-hotspot (age P92, trend P08)` tells you the file is both old and sharply declining relative to peers; `legacy-hotspot (age P76, trend P24)` barely qualifies. Distance from the classification boundary is now readable, not hidden. +Each `ChurnRiskResult` also exposes `AgePercentile` and `TrendPercentile` (0-100) showing where the file sits in the distribution. The fields are nil (omitted from JSON, empty in CSV) when the fallback path ran. The CLI and HTML surface these alongside the label — `fading-silo (age P92, trend P08)` tells you the file is both old and sharply declining relative to peers; `fading-silo (age P76, trend P24)` barely qualifies. Distance from the classification boundary is now readable, not hidden. -> **Degenerate trend distribution.** When every file's entire history fits inside the trend window (e.g. a repo with <3 months of commits), `churnTrend` returns the flat-signal sentinel `1.0` for all of them. The adaptive P25 then lands on `1.0` too, and the `trend < P25` predicate matches nobody — no file reaches `legacy-hotspot` through the trend check. Old + concentrated files fall through to `silo` instead. This is mathematically correct (there's no variation to classify on) but can surprise readers of short-lived repos. Pinned by `TestChurnRiskAdaptiveDegenerateTrendDistribution` so future refactors don't silently flip it. +> **Degenerate trend distribution.** When every file's entire history fits inside the trend window (e.g. a repo with <3 months of commits), `churnTrend` returns the flat-signal sentinel `1.0` for all of them. The adaptive P25 then lands on `1.0` too, and the `trend < P25` predicate matches nobody — no file reaches `fading-silo` through the trend check. Old + concentrated files fall through to `silo` instead. This is mathematically correct (there's no variation to classify on) but can surprise readers of short-lived repos. Pinned by `TestChurnRiskAdaptiveDegenerateTrendDistribution` so future refactors don't silently flip it. -> **Sensitivity note.** Files touched a single time long ago and never again correctly route to `legacy-hotspot` via the earlier-only trend=0 path. On large mature repos this pattern is the common case, not the exception — e.g. validation on a kubernetes snapshot classified ~29k files this way. If the label distribution looks heavy on `legacy-hotspot` for a long-lived codebase, that is usually diagnosing real dormant code, not a bug. +> **Sensitivity note.** Files touched a single time long ago and never again correctly route to `fading-silo` via the earlier-only trend=0 path. On large mature repos this pattern is the common case, not the exception — e.g. validation on a kubernetes snapshot classified ~29k files this way. If the label distribution looks heavy on `fading-silo` for a long-lived codebase, that is usually diagnosing real dormant code, not a bug. ### Additional columns | Column | Meaning | |--------|---------| -| `risk_score` | `recent_churn / bus_factor` — legacy composite. Still consumed by `gitcortex ci --fail-on-churn-risk N`. Not used for ranking. May diverge from the label (a file can have low `risk_score` but be classified `legacy-hotspot`, and vice versa). | +| `risk_score` | `recent_churn / bus_factor` — legacy composite. Still consumed by `gitcortex ci --fail-on-churn-risk N`. Not used for ranking. May diverge from the label (a file can have low `risk_score` but be classified `fading-silo`, and vice versa). | | `first_change`, `last_change` | Bounds of the file's activity in the dataset (UTC) | | `age_days` | `latest - first_change` in days | | `trend` | Ratio described above | ### How to interpret -- **legacy-hotspot** is the alarm — investigate first. +- **fading-silo** is the alarm — investigate first. - **silo** suggests pairing / documentation work, not panic. - **active-core** is usually fine, but watch for `bus_factor=1` + growing. - **active** with growing trend may indicate a healthy shared module or a collision of too many cooks. @@ -214,7 +214,7 @@ Per-developer report combining multiple metrics. | Active days | Unique dates with at least one commit | | Pace | commits / active_days (smooths bursts — a dev with 100 commits on 2 days and silence for 28 shows pace=50, which reads as a steady rate but isn't) | | Weekend % | commits on Saturday+Sunday / total commits × 100 | -| Scope | Top 5 directories by unique file count, as % of the dev's **authored** files — i.e. files where the dev added or removed at least one line. Pure renames (file appears in the dev's change set with zero line changes) are excluded from both numerator and denominator so the visible Pct values sum to 100% (modulo the top-5 truncation). Same denominator is used for Extensions and for the Herfindahl specialization index, keeping the three consistent. | +| Scope | Top 5 directories by unique file count, as % of the dev's **authored** files — i.e. files where the dev added or removed at least one line. Pure renames (file appears in the dev's change set with zero line changes) are excluded from both numerator and denominator so the visible Pct values sum to 100% (modulo the top-5 truncation). Same denominator is used for Extensions and for the Herfindahl specialization index, keeping the three consistent. **Multi-repo:** the `:` prefix that `LoadMultiJSONL` adds to avoid filename collisions is stripped before bucketing, so `cmd/` in three repos aggregates into one `cmd` bucket instead of fragmenting into `repoA:cmd`, `repoB:cmd`, `repoC:cmd`. Without the strip, Scope burns the top-5 slots on repo-×-dir pairs and Specialization deflates toward "generalist" for anyone whose area of work happens to exist in several repos. The per-repo split is still surfaced by the Per-Repository Breakdown section when a profile report is multi-repo. | | Extensions | Top 5 file extensions the dev touched, sorted by **files desc** (tiebreak churn desc, then ext asc) so the displayed `Pct` is monotonic with the sort order and HTML bar widths read correctly. `Pct` is `Files / authored * 100` where `authored` is the count of files the dev added or removed at least one line on — same denominator as Scope, so Pcts sum to 100% modulo top-5 truncation. The raw dev-attributable `Churn` (sum of `devLines[email]` across bucket files) is kept on the struct for JSON consumers who want a churn-ranked view. Answers the "language/skill fingerprint" question (`.go` + `.yaml` → backend+infra; `.tsx` + `.ts` + `.css` → frontend). **Attribution caveat:** bucket is derived from the file's canonical (post-rename) path — a dev who worked on `foo.js` pre-migration still shows up under `.ts` if it was later renamed; per-era per-dev attribution would need `byExt` to carry a dev dimension, which isn't tracked. | | Specialization | Herfindahl index over the **full** per-directory file-count distribution: Σ pᵢ² where pᵢ is the share of the dev's files in directory i. 1 = all files in one directory (narrow specialist); 1/N for a uniform spread across N directories; approaches 0 as the distribution widens. Computed before the top-5 Scope truncation so it reflects actual breadth. Labels (see `specBroadGeneralistMax`, `specBalancedMax`, `specFocusedMax` constants): `< 0.15` broad generalist, `< 0.35` balanced, `< 0.7` focused specialist, `≥ 0.7` narrow specialist. Herfindahl, not Gini, because Gini would collapse "1 file in 1 dir" and "1 file in each of 5 dirs" to the same value (both have zero inequality among buckets), which misses the specialization distinction. **Measures file distribution, not domain expertise** — see caveat below. **Display vs raw:** CLI and HTML show the value rounded to 3 decimals (`%.3f`) for readability; JSON output preserves the full float64. Band classification runs against the raw float, so a value like 0.149 lands in `broad generalist` even though %.2f would have rounded it to `0.15`. JSON consumers that reproduce the banding must use the raw value, not a rounded version. | | Contribution type | Based on del/add ratio: growth (<0.4), balanced (0.4-0.8), refactor (>0.8) | @@ -342,7 +342,7 @@ Every classification boundary is a named constant in `internal/stats/stats.go`. | `classifyOldAgeDays` | `180` | **Fallback only** (dataset < `classifyMinSample` files). Adaptive path uses P75 of the dataset's own age distribution. | | `classifyDecliningTrend` | `0.5` | **Fallback only**. Adaptive path uses P25 of the dataset's own trend distribution. | | `classifyMinSample` | `8` | Below this many files, percentile estimates are too noisy to trust and the two thresholds above revert to absolutes. | -| `adaptiveDecliningTrendFloor` | `0.01` | Minimum value for the adaptive `decliningTrendThreshold`. Prevents P25 from collapsing to 0 on mature repos where dormant files dominate, which would hide every legacy-hotspot. | +| `adaptiveDecliningTrendFloor` | `0.01` | Minimum value for the adaptive `decliningTrendThreshold`. Prevents P25 from collapsing to 0 on mature repos where dormant files dominate, which would hide every fading-silo. | | `suspectWarningMinChurnRatio` | `0.10` | Vendor/generated path warning fires only when matched paths together exceed this fraction of total repo churn — prevents a single incidental `.lock` file from triggering noise. | | `classifyTrendWindowMonths` | `3` | Window (months, relative to latest commit) for the recent vs earlier split in `trend`. | | `contribRefactorRatio` | `0.8` | `del/add ≥ this` → dev profile `contribType = refactor`. | @@ -368,7 +368,7 @@ Every ranking function has an explicit tiebreaker so the same input produces the | `directories` | file_touches | dir asc | | `busfactor` | bus_factor (asc) | path asc | | `coupling` | co_changes | coupling_pct | -| `churn-risk` | label priority (legacy-hotspot → silo → active-core → active → cold) | recent_churn desc, then bus_factor asc | +| `churn-risk` | label priority (fading-silo → silo → active-core → active → cold) | recent_churn desc, then bus_factor asc | | `top-commits` | lines_changed | sha asc | | `dev-network` | shared_lines | shared_files | | `profile` | commits | email asc | diff --git a/internal/report/report.go b/internal/report/report.go index 24c982f..acd4603 100644 --- a/internal/report/report.go +++ b/internal/report/report.go @@ -44,8 +44,8 @@ type ReportData struct { MaxPattern int // Label distribution for the Churn Risk section — counted over the - // full classified set so the reader can tell "top 20, all legacy- - // hotspot" from "there are 48 legacy-hotspots in total". Populated + // full classified set so the reader can tell "top 20, all fading- + // silo" from "there are 48 fading-silos in total". Populated // alongside ChurnRisk in Generate(). ChurnRiskLabelCounts []LabelCount @@ -383,11 +383,11 @@ func Generate(w io.Writer, ds *stats.Dataset, repoName string, topN int, sf stat } // churnRiskLabelCounts aggregates the per-label totals for the Churn -// Risk distribution strip. Ordering matches the table below: legacy- -// hotspot first (most actionable), cold last. Labels with zero files +// Risk distribution strip. Ordering matches the table below: fading- +// silo first (most actionable), cold last. Labels with zero files // are omitted so the strip doesn't show empty chips on small repos. func buildLabelCountList(counts map[string]int) []LabelCount { - order := []string{"legacy-hotspot", "silo", "active-core", "active", "cold"} + order := []string{"fading-silo", "silo", "active-core", "active", "cold"} var result []LabelCount for i, lbl := range order { if n := counts[lbl]; n > 0 { diff --git a/internal/report/report_test.go b/internal/report/report_test.go index 45a91eb..af1d6e7 100644 --- a/internal/report/report_test.go +++ b/internal/report/report_test.go @@ -507,15 +507,15 @@ func TestHumanize(t *testing.T) { func TestBuildLabelCountList(t *testing.T) { counts := map[string]int{ - "active": 2, - "legacy-hotspot": 1, - "cold": 1, - "silo": 1, - "active-core": 3, + "active": 2, + "fading-silo": 1, + "cold": 1, + "silo": 1, + "active-core": 3, } got := buildLabelCountList(counts) - wantOrder := []string{"legacy-hotspot", "silo", "active-core", "active", "cold"} + wantOrder := []string{"fading-silo", "silo", "active-core", "active", "cold"} if len(got) != len(wantOrder) { t.Fatalf("got %d entries, want %d: %+v", len(got), len(wantOrder), got) } diff --git a/internal/report/template.go b/internal/report/template.go index d6a3a64..908bf2a 100644 --- a/internal/report/template.go +++ b/internal/report/template.go @@ -38,7 +38,7 @@ tr:last-child td { border-bottom: none; } footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; color: #656d76; font-size: 12px; } .churn-chips { display: flex; flex-wrap: wrap; align-items: center; gap: 8px; margin-bottom: 12px; } .churn-chips .chip { padding: 3px 10px; border-radius: 12px; font-size: 11px; font-weight: 500; white-space: nowrap; } -.chip-legacy-hotspot { background: #cf222e; color: #fff; } +.chip-fading-silo { background: #9a3412; color: #fff; } .chip-silo { background: #bf8700; color: #fff; } .chip-active-core { background: #0969da; color: #fff; } .chip-active { background: #2da44e; color: #fff; } @@ -61,7 +61,7 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
Glossary — what do these terms mean? -

gitcortex is a repository behavior analyzer, not a code analyzer. These metrics describe what people and processes did in git — who touched what, when, and with whom — not the quality of the source code itself. A file classified as silo or legacy-hotspot reveals a human or process pattern; it is not a judgment on the code (a well-written library maintained by one person will classify as silo regardless of how good it is). Labels point at where to look, not what to conclude.

+

gitcortex is a repository behavior analyzer, not a code analyzer. These metrics describe what people and processes did in git — who touched what, when, and with whom — not the quality of the source code itself. A file classified as silo or fading-silo reveals a human or process pattern; it is not a judgment on the code (a well-written library maintained by one person will classify as silo regardless of how good it is). Labels point at where to look, not what to conclude.

Bus factor
How many developers would need to leave before critical knowledge is lost. A file with bus factor 1 has a single owner — losing that person means losing the context.
@@ -69,14 +69,14 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
Total lines added plus lines removed. High churn files are heavily modified — often where bugs accumulate.
Recent churn
Churn weighted so recent changes count more. Default half-life is 90 days — a change loses half its weight every 90 days, so a change from a year ago (≈4 half-lives) is worth ~1/16 of a change today.
-
Legacy-hotspot
-
An old file with concentrated ownership and declining activity — deprecated code still being touched. Usually the most urgent refactor target.
+
Fading-silo
+
An old file with concentrated ownership whose activity is cooling — a silo whose owner is drifting away. Usually the most urgent refactor target.
Silo
Old, concentrated, and still stable or growing — a knowledge bottleneck. Plan transfer before the owner moves on.
Active-core
Newer code with a single main author. Often fine during early development; revisit if it ages without spreading ownership.
Trend
-
Ratio of recent churn to older churn for a file. Below 0.5 means activity is declining sharply; around 1 is stable; above 1.5 is growing. The declining case is what flips an old concentrated file from silo to legacy-hotspot.
+
Ratio of recent churn to older churn for a file. Below 0.5 means activity is declining sharply; around 1 is stable; above 1.5 is growing. The declining case is what flips an old concentrated file from silo to fading-silo.
Age P__ / Trend P__
Percentile suffixes on Churn Risk labels show where this file sits in the repo's own distribution. Age P90 = older than 90% of tracked files; Trend P10 = declining more sharply than 90%. Useful to separate a borderline classification (P76/P24) from a real alarm (P98/P03).
Coupling
@@ -262,7 +262,7 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{if .ChurnRisk}}

Churn Risk{{if lt (len .ChurnRisk) .Summary.TotalFiles}} {{thousands (len .ChurnRisk)}} of {{thousands .Summary.TotalFiles}}{{end}}

-

Files ranked by recent churn. Label classifies context so you can judge action: legacy-hotspot (old code + concentrated + declining) is the urgent alarm; silo suggests knowledge transfer; active-core is young code with a single author (often fine); active is shared healthy work; cold is quiet.{{if (index .ChurnRisk 0).AgePercentile}} Age P__ / Trend P__ under the label show where this file sits in the repo's distribution: age P90 means older than 90% of tracked files; trend P10 means declining more sharply than 90%. Classification boundaries are the P75 age and P25 trend of this dataset (see {{docRef "churn-risk"}}).{{end}}

+

Files ranked by recent churn. Label classifies context so you can judge action: fading-silo (old code + concentrated + declining) is the urgent alarm; silo suggests knowledge transfer; active-core is young code with a single author (often fine); active is shared healthy work; cold is quiet.{{if (index .ChurnRisk 0).AgePercentile}} Age P__ / Trend P__ under the label show where this file sits in the repo's distribution: age P90 means older than 90% of tracked files; trend P10 means declining more sharply than 90%. Classification boundaries are the P75 age and P25 trend of this dataset (see {{docRef "churn-risk"}}).{{end}}

{{if .ChurnRiskLabelCounts}}
{{range .ChurnRiskLabelCounts}} @@ -277,7 +277,7 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{range .ChurnRisk}} {{.Path}} - {{if eq .Label "legacy-hotspot"}}🔴 {{.Label}}{{else if eq .Label "silo"}}🟡 {{.Label}}{{else if eq .Label "active-core"}}{{.Label}}{{else if eq .Label "active"}}{{.Label}}{{else}}{{.Label}}{{end}}{{if .AgePercentile}}
age P{{derefInt .AgePercentile}} · trend P{{derefInt .TrendPercentile}}
{{end}} + {{if eq .Label "fading-silo"}}{{.Label}}{{else if eq .Label "silo"}}🟡 {{.Label}}{{else if eq .Label "active-core"}}{{.Label}}{{else if eq .Label "active"}}{{.Label}}{{else}}{{.Label}}{{end}}{{if .AgePercentile}}
age P{{derefInt .AgePercentile}} · trend P{{derefInt .TrendPercentile}}
{{end}} {{printf "%.1f" .RecentChurn}}
{{.BusFactor}} diff --git a/internal/stats/stats.go b/internal/stats/stats.go index 331ac3a..0719041 100644 --- a/internal/stats/stats.go +++ b/internal/stats/stats.go @@ -22,10 +22,10 @@ const ( // adaptiveDecliningTrendFloor keeps the P25-derived "declining" // threshold strictly positive. churnTrend clamps its output at 0 // for files with earlier-only history (the strongest - // legacy-hotspot signal). In mature repos where ≥25% of files are + // fading-silo signal). In mature repos where ≥25% of files are // dormant, P25 collapses to 0; without this floor, `trend < 0` // never fires and every dormant concentrated file is misrouted to - // silo instead of legacy-hotspot — the exact signal the rule is + // silo instead of fading-silo — the exact signal the rule is // supposed to surface. Epsilon is small enough not to widen the // declining band past the fallback (0.5) even in the pathological // case, large enough that 0.0 ≠ threshold under float compare. @@ -131,13 +131,13 @@ type ChurnRiskResult struct { FirstChangeDate string AgeDays int Trend float64 // recent 3mo churn / earlier churn; 1 = flat, <0.5 declining, >1.5 growing - Label string // "cold" | "active" | "active-core" | "silo" | "legacy-hotspot" + Label string // "cold" | "active" | "active-core" | "silo" | "fading-silo" // AgePercentile and TrendPercentile report where this file lands in the // per-dataset distribution (0-100). Nil when the fallback path ran // (dataset below classifyMinSample) so JSON consumers see the field // omitted rather than a `-1` sentinel. Surfacing these alongside the // label makes the distance from the classification boundary visible: - // `legacy-hotspot (age P92, trend P08)` vs a file that barely crossed. + // `fading-silo (age P92, trend P08)` vs a file that barely crossed. // Tag form `json:",omitempty"` (with the leading comma) keeps Go's // default PascalCase name — AgePercentile / TrendPercentile — so the // field names match every other field on this struct. Without it @@ -977,12 +977,12 @@ func rankFloat(sorted []float64, v float64) int { // churnRiskLabelPriority returns a sort key for ChurnRisk labels where // lower values rise to the top. Order reflects actionability: named -// risks (legacy-hotspot, silo) first, then young concentrated code +// risks (fading-silo, silo) first, then young concentrated code // (active-core), then healthy active code, then cold. Any unrecognized // label sorts last so the primary labels always lead the table. func churnRiskLabelPriority(label string) int { switch label { - case "legacy-hotspot": + case "fading-silo": return 0 case "silo": return 1 @@ -1011,7 +1011,7 @@ func classifyFile(recentChurn, lowChurn float64, bf, ageDays int, trend float64, return "active-core" // new code, single author is expected } if trend < bands.DecliningTrend { - return "legacy-hotspot" // old + concentrated + declining → urgent + return "fading-silo" // old + concentrated + declining → urgent } return "silo" // old + concentrated + stable/growing → knowledge bottleneck } @@ -1206,7 +1206,7 @@ func ChurnRisk(ds *Dataset, n int) []ChurnRiskResult { }) } - // Primary sort: label priority — legacy-hotspot and silo are the + // Primary sort: label priority — fading-silo and silo are the // actionable classifications (old + concentrated, with diverging // trend). Sorting by RecentChurn alone buried them behind very // active files, so a user running `--top 20` on a mature repo would @@ -1622,12 +1622,33 @@ func DevProfiles(ds *Dataset, filterEmail string, n int) []DevProfile { // otherwise a dev who only touches README, Makefile, go.mod, etc. // appears as a broad generalist across N pseudo-dirs instead of // a narrow specialist on the repo root. + // + // Multi-repo: strip the `:` prefix added by LoadMultiJSONL + // so a dev who works on `cmd/` across several repos aggregates + // into a single `cmd` bucket instead of `repoA:cmd`, `repoB:cmd`, + // ... — which would fragment the top-5 truncation, deflate the + // Herfindahl specialization index toward "generalist", and show + // awkward `slug:dir` labels in the HTML bar. The per-repo split + // is still visible in the Per-Repository Breakdown section below. + // + // The strip is gated on multi-repo mode (len(commitsByRepo) > 1) + // because stripRepoPrefix alone is too aggressive in single-repo + // datasets: a legitimate top-level dir containing ":" — say + // `ops:core/main.go` committed to a single repo — would be + // wrongly collapsed to `core/main.go`, silently moving files + // between buckets. The multi-repo gate guarantees we only strip + // paths that actually carry a LoadMultiJSONL prefix. + multiRepo := len(ds.commitsByRepo) > 1 dirCount := make(map[string]int) if files, ok := devFiles[email]; ok { for path := range files { + p := path + if multiRepo { + p = stripRepoPrefix(path) + } dir := "." - if idx := strings.LastIndex(path, "/"); idx >= 0 { - dir = path[:idx] + if idx := strings.LastIndex(p, "/"); idx >= 0 { + dir = p[:idx] } dirCount[dir]++ } diff --git a/internal/stats/stats_test.go b/internal/stats/stats_test.go index b9c418b..ba9e906 100644 --- a/internal/stats/stats_test.go +++ b/internal/stats/stats_test.go @@ -651,10 +651,10 @@ func TestChurnRisk(t *testing.T) { func TestChurnRiskLabelPriority(t *testing.T) { // Labels must sort in this actionability order regardless of - // RecentChurn — a legacy-hotspot with RC=50 outranks an active file + // RecentChurn — a fading-silo with RC=50 outranks an active file // with RC=50000. See the sort comment in stats.go. want := []string{ - "legacy-hotspot", "silo", "active-core", "active", "cold", + "fading-silo", "silo", "active-core", "active", "cold", } for i := 1; i < len(want); i++ { if churnRiskLabelPriority(want[i-1]) >= churnRiskLabelPriority(want[i]) { @@ -671,13 +671,13 @@ func TestChurnRiskLabelPriority(t *testing.T) { } func TestChurnRiskSortLegacyBeatsActiveDespiteHigherChurn(t *testing.T) { - // A legacy-hotspot with low RecentChurn must rank above an active + // A fading-silo with low RecentChurn must rank above an active // file with huge RecentChurn — otherwise the top-N display hides the // classified risks behind unremarkable active code (the WordPress bug // that motivated the label-first sort). results := []ChurnRiskResult{ {Path: "active.go", Label: "active", RecentChurn: 10000, BusFactor: 4}, - {Path: "legacy.go", Label: "legacy-hotspot", RecentChurn: 50, BusFactor: 1}, + {Path: "legacy.go", Label: "fading-silo", RecentChurn: 50, BusFactor: 1}, } sort.Slice(results, func(i, j int) bool { pi, pj := churnRiskLabelPriority(results[i].Label), churnRiskLabelPriority(results[j].Label) @@ -690,7 +690,7 @@ func TestChurnRiskSortLegacyBeatsActiveDespiteHigherChurn(t *testing.T) { return results[i].Path < results[j].Path }) if results[0].Path != "legacy.go" { - t.Errorf("legacy-hotspot must outrank active file, got top=%q", results[0].Path) + t.Errorf("fading-silo must outrank active file, got top=%q", results[0].Path) } } @@ -737,7 +737,7 @@ func TestClassifyFile(t *testing.T) { {"active-core: new code, single author", 200, 50, 1, 30, 1.0, "active-core"}, {"silo: old + concentrated + stable", 200, 50, 2, 365, 1.0, "silo"}, {"silo: old + concentrated + growing", 200, 50, 2, 365, 2.0, "silo"}, - {"legacy-hotspot: old + concentrated + declining", 200, 50, 1, 365, 0.3, "legacy-hotspot"}, + {"fading-silo: old + concentrated + declining", 200, 50, 1, 365, 0.3, "fading-silo"}, {"cold wins over everything when churn low", 10, 50, 1, 365, 0.1, "cold"}, } // Use defaultBands so the old absolute constants (180d age, 0.5 trend) @@ -884,7 +884,7 @@ func TestChurnRiskAdaptiveDormantP25ZeroFlooring(t *testing.T) { // earlier-only path), half are active. Without the declining-trend // floor, P25 collapses to 0 and `trend < 0` never fires — dormant // concentrated files would silently be misclassified as silo - // instead of legacy-hotspot, hiding the strongest alarm. + // instead of fading-silo, hiding the strongest alarm. // // The floor guarantees the threshold is strictly positive so the // signal survives the adaptive-mode switch. @@ -930,7 +930,7 @@ func TestChurnRiskAdaptiveDormantP25ZeroFlooring(t *testing.T) { var legacyCount int var dormantLabels []string for _, r := range results { - if r.Label == "legacy-hotspot" { + if r.Label == "fading-silo" { legacyCount++ } if strings.HasPrefix(r.Path, "dormant/") { @@ -938,13 +938,13 @@ func TestChurnRiskAdaptiveDormantP25ZeroFlooring(t *testing.T) { } } if legacyCount == 0 { - t.Errorf("expected dormant+concentrated files to be flagged legacy-hotspot; got 0 "+ + t.Errorf("expected dormant+concentrated files to be flagged fading-silo; got 0 "+ "(dormant labels: %v). P25 likely collapsed to 0 without the floor.", dormantLabels) } - // Sanity: every dormant file (bf=1, old, trend=0) should be legacy-hotspot. + // Sanity: every dormant file (bf=1, old, trend=0) should be fading-silo. for _, lbl := range dormantLabels { - if lbl != "legacy-hotspot" { - t.Errorf("dormant file got label %q, want legacy-hotspot", lbl) + if lbl != "fading-silo" { + t.Errorf("dormant file got label %q, want fading-silo", lbl) } } } @@ -965,8 +965,8 @@ func TestClassifyFileFloorBoundary(t *testing.T) { trend float64 want string }{ - {"trend 0 (dormant, earlier-only) → declining", 0.0, "legacy-hotspot"}, - {"trend 0.001 below floor → declining", 0.001, "legacy-hotspot"}, + {"trend 0 (dormant, earlier-only) → declining", 0.0, "fading-silo"}, + {"trend 0.001 below floor → declining", 0.001, "fading-silo"}, {"trend exactly at floor 0.01 → NOT declining", 0.01, "silo"}, {"trend 0.011 above floor → NOT declining", 0.011, "silo"}, {"trend 1.0 flat → NOT declining", 1.0, "silo"}, @@ -984,7 +984,7 @@ func TestChurnRiskAdaptiveDegenerateTrendDistribution(t *testing.T) { // trend window (earlier bucket is empty), so churnTrend returns the // sentinel 1.0 for all of them. The adaptive P25 then collapses onto // 1.0 and the "declining" check (`trend < 1.0`) matches nobody — no - // file can reach legacy-hotspot via the trend predicate. Old + + // file can reach fading-silo via the trend predicate. Old + // concentrated files fall through to silo. // // This test pins that behavior so future refactors don't silently @@ -1022,15 +1022,15 @@ func TestChurnRiskAdaptiveDegenerateTrendDistribution(t *testing.T) { legacyCount, siloCount := 0, 0 for _, r := range results { switch r.Label { - case "legacy-hotspot": + case "fading-silo": legacyCount++ case "silo": siloCount++ } // Trend of 1.0 means P25 of a constant-1 distribution is also 1, // so `trend < 1.0` never fires and no file is declining. - if r.Label == "legacy-hotspot" { - t.Errorf("%s: unexpected legacy-hotspot — trend distribution is degenerate (all 1.0), "+ + if r.Label == "fading-silo" { + t.Errorf("%s: unexpected fading-silo — trend distribution is degenerate (all 1.0), "+ "no file should be classified as declining", r.Path) } } @@ -1073,10 +1073,10 @@ func TestLabelWithPercentile(t *testing.T) { age, trend *int want string }{ - {"both nil → bare label", "legacy-hotspot", nil, nil, "legacy-hotspot"}, + {"both nil → bare label", "fading-silo", nil, nil, "fading-silo"}, {"age nil → bare label", "silo", nil, p(10), "silo"}, {"trend nil → bare label", "active", p(75), nil, "active"}, - {"both set → suffix", "legacy-hotspot", p(92), p(8), "legacy-hotspot (age P92, trend P8)"}, + {"both set → suffix", "fading-silo", p(92), p(8), "fading-silo (age P92, trend P8)"}, {"zero values render", "active-core", p(0), p(0), "active-core (age P0, trend P0)"}, {"three-digit value renders unpadded", "active", p(100), p(100), "active (age P100, trend P100)"}, } @@ -1159,7 +1159,7 @@ func TestChurnTrend(t *testing.T) { // Single-month histories used to short-circuit to 1 via len(monthChurn)<2, // silencing the two strongest trend signals: earlier-only (declined to // nothing) and recent-only (grew from nothing). Both must now come - // through so old concentrated files can be classified as legacy-hotspot. + // through so old concentrated files can be classified as fading-silo. // Earlier-only: a single month well before the cutoff. earlierOnly := map[string]int64{"2023-05": 500} @@ -1182,11 +1182,11 @@ func TestChurnTrend(t *testing.T) { } } -func TestChurnRiskLegacyHotspotFromSingleOldMonth(t *testing.T) { +func TestChurnRiskFadingSiloFromSingleOldMonth(t *testing.T) { // Integration: file touched only in one old month, bf=1, age > 180. // Before the churnTrend fix (len<2 guard), this file returned // trend=1 (stable) and landed at label=silo. After the fix, - // trend=0 (declined to nothing) routes it through the legacy-hotspot + // trend=0 (declined to nothing) routes it through the fading-silo // branch. Pins the end-to-end wiring so a future regression in // churnTrend, classifyFile, or ChurnRisk can't silently send such // files back to silo. @@ -1211,8 +1211,8 @@ func TestChurnRiskLegacyHotspotFromSingleOldMonth(t *testing.T) { t.Fatalf("len = %d, want 1", len(results)) } r := results[0] - if r.Label != "legacy-hotspot" { - t.Errorf("Label = %q, want legacy-hotspot (single-old-month + bf=1 + age>180)", r.Label) + if r.Label != "fading-silo" { + t.Errorf("Label = %q, want fading-silo (single-old-month + bf=1 + age>180)", r.Label) } if r.Trend != 0 { t.Errorf("Trend = %.2f, want 0 (earlier-only — the fix)", r.Trend) @@ -2720,6 +2720,111 @@ func TestDevProfilesSpecializationRootFilesBucket(t *testing.T) { } } +func TestDevProfilesScopeAggregatesAcrossMultiRepoPrefix(t *testing.T) { + // LoadMultiJSONL prepends `:` to every path so repos with + // colliding basenames (cmd/, src/, …) don't merge at the file + // level. Scope and Specialization operate on the *developer's* + // directory distribution, not the file universe — a dev who + // works on `cmd/` across three repos is a cmd specialist + // regardless of how the loader namespaced the paths. Strip the + // prefix before bucketing so the top-5 truncation isn't + // fragmented, the Herfindahl index reflects area-of-work + // (not area-×-repo), and the HTML bar avoids awkward + // `slug:dir` labels. Per-repo split is already surfaced by + // Per-Repository Breakdown below the scope section. + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + // commitsByRepo must carry multiple slugs so DevProfiles recognizes + // the dataset as multi-repo and enables prefix stripping. Without + // this, the gate in stats.go skips the strip (correct behavior for + // single-repo datasets with legit ":" in path segments) and the + // test would assert fragmentation. + c1 := &commitEntry{email: "dev@x", date: t1, add: 50, del: 0, files: 5, repo: "repoA"} + ds := &Dataset{ + Earliest: t1, Latest: t1, + commits: map[string]*commitEntry{"c1": c1}, + commitsByRepo: map[string][]*commitEntry{ + "repoA": {c1}, + "repoB": {}, + "repoC": {}, + }, + contributors: map[string]*ContributorStat{ + "dev@x": {Email: "dev@x", Name: "D", Commits: 1, ActiveDays: 1, FilesTouched: 5, Additions: 50}, + }, + files: map[string]*fileEntry{ + "repoA:cmd/main.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + "repoA:cmd/run.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + "repoB:cmd/serve.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + "repoC:cmd/boot.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + "repoA:pkg/util.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + }, + } + profiles := DevProfiles(ds, "", 0) + if len(profiles) != 1 { + t.Fatalf("profiles = %d", len(profiles)) + } + p := profiles[0] + // Scope: two aggregated buckets — `cmd` (4 files across 3 repos) + // and `pkg` (1 file). Without prefix stripping there would be 4 + // entries (`repoA:cmd` with 2, three others with 1 each). + if len(p.Scope) != 2 { + t.Fatalf("Scope = %d entries (%+v), want 2 (cmd, pkg) after cross-repo aggregation", len(p.Scope), p.Scope) + } + if p.Scope[0].Dir != "cmd" || p.Scope[0].Files != 4 { + t.Errorf("Scope[0] = %+v, want {cmd, 4}", p.Scope[0]) + } + if p.Scope[1].Dir != "pkg" || p.Scope[1].Files != 1 { + t.Errorf("Scope[1] = %+v, want {pkg, 1}", p.Scope[1]) + } + // Specialization: Herfindahl over {4,1} = 17/25 = 0.68 + // (focused specialist). Without aggregation it would be + // (4+1+1+1)/25 = 0.28 (broad generalist) — a regression that + // misclassifies every cross-repo specialist. + if got, want := p.Specialization, 0.68; got < want-0.01 || got > want+0.01 { + t.Errorf("Specialization = %.3f, want ~%.2f (focused specialist over {cmd:4, pkg:1})", got, want) + } +} + +func TestDevProfilesScopePreservesColonDirsInSingleRepo(t *testing.T) { + // Regression: stripRepoPrefix drops everything before the first ":" + // when the preceding segment has no slash, which would wrongly + // collapse a legitimate top-level dir like `ops:core` into `core` + // in a single-repo dataset. The multi-repo gate in DevProfiles + // guards against this — a dataset with a single entry in + // commitsByRepo must keep the raw path, so `ops:core/main.go` + // stays in the `ops:core` bucket and Specialization sees the dir + // the dev actually works in. + t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + c1 := &commitEntry{email: "dev@x", date: t1, add: 20, del: 0, files: 2, repo: "(repo)"} + ds := &Dataset{ + Earliest: t1, Latest: t1, + commits: map[string]*commitEntry{"c1": c1}, + commitsByRepo: map[string][]*commitEntry{ + "(repo)": {c1}, + }, + contributors: map[string]*ContributorStat{ + "dev@x": {Email: "dev@x", Name: "D", Commits: 1, ActiveDays: 1, FilesTouched: 2, Additions: 20}, + }, + files: map[string]*fileEntry{ + "ops:core/main.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + "ops:core/run.go": {commits: 1, devLines: map[string]int64{"dev@x": 10}, devCommits: map[string]int{"dev@x": 1}, monthChurn: map[string]int64{}}, + }, + } + profiles := DevProfiles(ds, "", 0) + if len(profiles) != 1 { + t.Fatalf("profiles = %d", len(profiles)) + } + p := profiles[0] + if len(p.Scope) != 1 { + t.Fatalf("Scope = %d entries (%+v), want 1 — single-repo paths must not be stripped", len(p.Scope), p.Scope) + } + if p.Scope[0].Dir != "ops:core" { + t.Errorf("Scope[0].Dir = %q, want %q (stripRepoPrefix fired in single-repo mode and dropped the legit ops: segment)", p.Scope[0].Dir, "ops:core") + } + if p.Scope[0].Files != 2 { + t.Errorf("Scope[0].Files = %d, want 2", p.Scope[0].Files) + } +} + func TestDevProfilesSpecializationEdgeCases(t *testing.T) { t1 := time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC)