diff --git a/README.md b/README.md index 29ea136..4d1cede 100644 --- a/README.md +++ b/README.md @@ -401,6 +401,73 @@ gitcortex stats --input auth.jsonl --input payments.jsonl --stat coupling --top Paths appear as `auth:src/main.go` and `payments:src/main.go`. Contributors are deduped by email across repos — the same developer contributing to both repos is counted once. +For workspaces containing many repos (an engineer's `~/work`, a platform team's service folder), `gitcortex scan` discovers every `.git` under one or more roots and extracts them in parallel — see below. + +### Scan: discover and aggregate every repo under a root + +Walk one or more directories, find every git repository (working trees and bare clones both detected), extract them in parallel, and optionally render HTML. Two output modes: + +- `--report-dir ` — one standalone HTML per repo plus an `index.html` landing page linking them. Each per-repo report is equivalent to running `gitcortex report` against that repo alone; no metric mixing across unrelated codebases. +- `--report --email
` — a **single** consolidated profile report for one developer across every scanned repo. The only cross-repo aggregation in the feature, because "where did this person spend their time?" is the only question that genuinely benefits from pooling signal across projects. + +There is no third mode. Cross-repo consolidation at the team/codebase level inflates hotspots, bus factor, and coupling with noise from unrelated codebases; if that's what you want, inspect `manifest.json` or run `gitcortex report` per JSONL. + +```bash +# Discover and extract every repo under ~/work (JSONLs + manifest, no HTML) +gitcortex scan --root ~/work --output ./scan-out + +# Per-repo HTML reports + index landing page +gitcortex scan --root ~/work --output ./scan-out --report-dir ./reports +# opens ./reports/index.html → click through to each repo + +# Personal cross-repo profile: only MY commits, consolidated into one HTML +gitcortex scan --root ~/work --output ./scan-out \ + --report ./me.html --email me@company.com --since 1y \ + --include-commit-messages + +# Multiple roots, higher parallelism, pre-set ignore patterns +gitcortex scan --root ~/work --root ~/personal --root ~/oss \ + --parallel 8 --max-depth 4 \ + --output ./scan-out --report-dir ./reports +``` + +The scan output directory holds: + +| file | purpose | +|---|---| +| `.jsonl` | per-repo JSONL, one per discovered repo | +| `.state` | resume checkpoint (safe to re-run scan to continue) | +| `manifest.json` | discovery results, per-repo status (ok/failed/pending), timing | + +Each repo's slug is derived from its directory basename; colliding basenames get a short SHA-1 suffix (the suffix lengthens automatically on the rare truncation collision, so `.state` is stable across runs). + +**Filtering discovery with `.gitcortex-ignore`.** Create a gitignore-style file at the scan root: + +``` +# skip heavy clones we don't want in the report +node_modules +chromium.git +linux.git + +# skip vendored repos except the one we own +vendor/ +!vendor/in-house-fork +``` + +Directory rules, globs, `**/foo`, and `!path` negations all work. Globbed negations like `!vendor*/keep` are honored — discovery descends into any dir where a negation rule could match a descendant. If `--ignore-file` is not set, scan looks for `.gitcortex-ignore` in the first `--root`. + +**Consolidated profile report.** When `scan --email me@company.com --report path.html` runs against a multi-repo dataset, the profile report renders a *Per-Repository Breakdown* section: commits, churn, files, active days, and share-of-total — all filtered to that developer's contributions (files count reflects only files the dev touched). This is the one report that legitimately aggregates across repos; team-level views live in `--report-dir` (one HTML per repo, never mixed). + +**Flags worth knowing:** + +- `--parallel N` — repos extracted concurrently (default 4). Git is I/O-bound, so values past NumCPU give diminishing returns. +- `--max-depth N` — stop descending past N levels. Useful when a root contains a monorepo with deeply nested internal repos you don't want enumerated. +- `--extract-ignore ` (repeatable) — forwarded to each per-repo `extract --ignore`, e.g. `--extract-ignore 'package-lock.json' --extract-ignore 'dist/*'`. +- `--from / --to / --since` — time window applied to the consolidated report (same semantics as `report`). +- `--churn-half-life`, `--coupling-max-files`, `--coupling-min-changes`, `--network-min-files` — pass tuning to the consolidated report identical to `gitcortex report`. + +Partial failures are non-fatal: the manifest records which repos failed, and the report is built from whichever JSONLs completed. `Ctrl+C` aborts both the discovery walk and any in-flight extracts; re-running picks up from each repo's state file. + ### Diff: compare time periods Compare stats between two time periods, or filter to a single period. @@ -444,6 +511,8 @@ gitcortex report --input data.jsonl --email alice@company.com --output alice.htm Includes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, working patterns heatmap, top commits, developer network, and developer profiles. A collapsible glossary at the top defines the terms (bus factor, churn, legacy-hotspot, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors. +When the input is multi-repo (from `gitcortex scan` or multiple `--input` files) AND `--email` is set, the profile report renders a *Per-Repository Breakdown* with commit/churn/files/active-days per repo, filtered to that developer's contributions. The team-view report intentionally omits this section — per-repo aggregates on a consolidated dataset reduce to raw git-history distribution, which is more usefully inspected via `manifest.json` or `stats --input X.jsonl` per repo. + > The HTML activity heatmap is always monthly (year × 12 months grid). For day/week/year buckets, use `gitcortex stats --stat activity --granularity `. ### CI: quality gates for pipelines @@ -483,9 +552,14 @@ internal/ parse.go Shared types (RawEntry, NumstatEntry) discard.go Malformed entry tracking extract/extract.go Extraction orchestration, state, JSONL writing + scan/ + scan.go Multi-repo orchestration (worker pool over extract) + discovery.go Directory walk, bare-repo detection, slug uniqueness + ignore.go Gitignore-style matcher with negation support stats/ - reader.go Streaming JSONL aggregator (single-pass) + reader.go Streaming JSONL aggregator (single-pass, multi-JSONL) stats.go Stat computations (9 stats) + repo_breakdown.go Per-repository aggregate (scan consolidated report) format.go Table/CSV/JSON output formatting ``` diff --git a/cmd/gitcortex/main.go b/cmd/gitcortex/main.go index 50267c4..256ec87 100644 --- a/cmd/gitcortex/main.go +++ b/cmd/gitcortex/main.go @@ -9,6 +9,8 @@ import ( "os" "os/signal" "path/filepath" + "runtime" + "sort" "strings" "syscall" "time" @@ -16,6 +18,7 @@ import ( "github.com/lex0c/gitcortex/internal/extract" "github.com/lex0c/gitcortex/internal/git" reportpkg "github.com/lex0c/gitcortex/internal/report" + "github.com/lex0c/gitcortex/internal/scan" "github.com/lex0c/gitcortex/internal/stats" "github.com/spf13/cobra" @@ -35,6 +38,7 @@ func main() { rootCmd.AddCommand(diffCmd()) rootCmd.AddCommand(ciCmd()) rootCmd.AddCommand(reportCmd()) + rootCmd.AddCommand(scanCmd()) if err := rootCmd.Execute(); err != nil { os.Exit(1) @@ -869,3 +873,455 @@ func reportCmd() *cobra.Command { return cmd } + +// defaultScanParallel picks an initial concurrency that scales with +// the host but caps the aggressive end: git extract is a mix of +// CPU (diff/numstat parsing) and disk I/O, so more than ~8 workers +// hit diminishing returns on typical SSDs and can thrash on +// spinning disks. The cap at 16 keeps 64-core servers from +// over-subscribing; the floor at 2 keeps single-core CI from +// serialising every scan. +func defaultScanParallel() int { + n := runtime.NumCPU() + if n > 16 { + n = 16 + } + if n < 2 { + n = 2 + } + return n +} + +// profileScanLabel builds the `RepoName` header string for +// `scan --report --email`. The profile dataset aggregates commits +// from every successful repo across all --root values, so titling +// the report after roots[0] alone would mislead readers into +// thinking the scope is a single root's work. +// +// Label policy: +// - 1 root → root basename (the default case, unchanged). +// - 2-3 roots → joined with " + " so the scope is visible at a glance. +// - 4+ roots → "N roots" because joining many basenames bloats +// the H1 and the exact set is available in the scan log and +// manifest anyway. +func profileScanLabel(roots []string) string { + switch len(roots) { + case 0: + return "scan" + case 1: + return filepath.Base(absPath(roots[0])) + case 2, 3: + parts := make([]string, len(roots)) + for i, r := range roots { + parts[i] = filepath.Base(absPath(r)) + } + return strings.Join(parts, " + ") + default: + return fmt.Sprintf("%d roots", len(roots)) + } +} + +// scanIndexStatusRank maps a ManifestRepo.Status to a sort bucket +// used by the index page ordering. Lower rank renders first; failed +// entries float to the top so operators spot them immediately, +// pending/skipped land in the middle, and ok entries make up the +// "everything else" tail ranked by commit count inside +// renderScanReportDir. +func scanIndexStatusRank(status string) int { + switch status { + case "failed": + return 0 + case "pending", "skipped": + return 1 + case "ok": + return 2 + default: + return 3 + } +} + +// renderScanReportDir is `scan --report-dir` in one place: for each +// successful repo in the manifest, load only that repo's JSONL as a +// standalone dataset, run the normal `report` pipeline into +// /.html, and afterward emit /index.html linking +// all of them. Intentionally uses LoadJSONL (single-input) rather +// than LoadMultiJSONL so each per-repo report's paths carry no +// `:` prefix — each tab of the experience is self-contained. +func renderScanReportDir(result *scan.Result, dir string, loadOpts stats.LoadOptions, sf stats.StatsFlags, topN int) error { + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("create report dir: %w", err) + } + + // Build an index-row slice parallel to Manifest.Repos so the + // landing page preserves scan order. We can't skip failed/pending + // entries — operators need to see what didn't produce a report. + entries := make([]reportpkg.ScanIndexEntry, 0, len(result.Manifest.Repos)) + var totalCommits int + allDevs := map[string]struct{}{} + maxCommits := 0 + + for _, m := range result.Manifest.Repos { + // Collapse "skipped" (worker picked the job up and saw + // ctx.Err before running extract) into "pending" for the + // index. The summary strip already buckets them together + // under PendingRepos; without normalizing here too, the + // per-row render would emit class="repo skipped" / + // status-skipped — CSS selectors the template doesn't + // define, so skipped rows lost the amber border and pill + // that make cancellation fallout easy to spot at a glance. + status := m.Status + if status == "skipped" { + status = "pending" + } + entry := reportpkg.ScanIndexEntry{ + Slug: m.Slug, + Path: m.Path, + Status: status, + Error: m.Error, + } + if m.Status != "ok" { + entries = append(entries, entry) + continue + } + + ds, err := stats.LoadJSONL(m.JSONL, loadOpts) + if err != nil { + // Treat load failure as a per-repo failure — index shows it, + // but the scan itself was already successful to that point. + entry.Status = "failed" + entry.Error = fmt.Sprintf("load %s: %v", m.JSONL, err) + entries = append(entries, entry) + continue + } + + // All render-time errors for a single repo (create the output + // file, template.Execute) get demoted to an inline failure and + // the loop continues to the next repo — matching the policy + // already applied above for LoadJSONL. Aborting the batch on + // one bad render would discard every HTML already written and + // skip the index entirely, leaving operators staring at an + // incomplete directory with no way to discover the state. + outFile := filepath.Join(dir, m.Slug+".html") + f, err := os.Create(outFile) + if err != nil { + entry.Status = "failed" + entry.Error = fmt.Sprintf("create %s: %v", outFile, err) + entries = append(entries, entry) + continue + } + if err := reportpkg.Generate(f, ds, m.Slug, topN, sf); err != nil { + f.Close() + // Best-effort cleanup: leave nothing half-written on disk. + _ = os.Remove(outFile) + entry.Status = "failed" + entry.Error = fmt.Sprintf("render %s: %v", m.Slug, err) + entries = append(entries, entry) + continue + } + f.Close() + + summary := stats.ComputeSummary(ds) + entry.Commits = summary.TotalCommits + entry.Devs = summary.TotalDevs + entry.Files = summary.TotalFiles + entry.Churn = summary.TotalAdditions + summary.TotalDeletions + entry.FirstCommitDate = summary.FirstCommitDate + entry.LastCommitDate = summary.LastCommitDate + entry.LastCommitAgo, entry.RecencyBucket = reportpkg.HumanizeAgo(summary.LastCommitDate) + entry.ReportHref = m.Slug + ".html" + + totalCommits += entry.Commits + if entry.Commits > maxCommits { + maxCommits = entry.Commits + } + for _, email := range stats.DevEmails(ds) { + allDevs[email] = struct{}{} + } + + entries = append(entries, entry) + } + + // Order the index as a triage view rather than by discovery's + // alphabetical slug order. Failed entries come first (they need + // attention and their absence of metrics makes the red border + // the only signal), then pending/skipped, then ok entries + // ranked by commit count desc so the heaviest repos surface at + // the top of the "healthy" section. Slug asc is the tiebreaker + // in every bucket so scans with identical shapes still produce + // identical page order. + sort.SliceStable(entries, func(i, j int) bool { + a, b := scanIndexStatusRank(entries[i].Status), scanIndexStatusRank(entries[j].Status) + if a != b { + return a < b + } + if entries[i].Status == "ok" && entries[i].Commits != entries[j].Commits { + return entries[i].Commits > entries[j].Commits + } + return entries[i].Slug < entries[j].Slug + }) + + indexPath := filepath.Join(dir, "index.html") + f, err := os.Create(indexPath) + if err != nil { + return fmt.Errorf("create index: %w", err) + } + defer f.Close() + + // Pending and failed live in separate buckets: pending means the + // worker never reached the repo (user cancelled mid-scan, or a + // future scheduling change), while failed means the extract was + // attempted and broke. Conflating them in the summary strip would + // mislead operators reading a cancel-shaped failure as a real + // repo-level problem. + var okCount, failedCount, pendingCount int + for _, e := range entries { + switch e.Status { + case "ok": + okCount++ + case "pending", "skipped": + pendingCount++ + default: + failedCount++ + } + } + + data := reportpkg.ScanIndexData{ + Repos: entries, + TotalRepos: len(entries), + OKRepos: okCount, + FailedRepos: failedCount, + PendingRepos: pendingCount, + TotalCommits: totalCommits, + TotalDevs: len(allDevs), + MaxCommits: maxCommits, + } + if err := reportpkg.GenerateScanIndex(f, data); err != nil { + return fmt.Errorf("generate index: %w", err) + } + + // Log line tracks the summary card's zero-suppression: only + // non-zero buckets appear, and the opener is whichever bucket + // has a count so an all-failed scan reads as "2 failed" instead + // of the misleading "0 ok, 2 failed". + var parts []string + if okCount > 0 { + parts = append(parts, fmt.Sprintf("%d ok", okCount)) + } + if failedCount > 0 { + parts = append(parts, fmt.Sprintf("%d failed", failedCount)) + } + if pendingCount > 0 { + parts = append(parts, fmt.Sprintf("%d pending", pendingCount)) + } + if len(parts) == 0 { + parts = []string{"0 repos"} // defensive; renderScanReportDir requires > 0 manifest entries to be reached at all + } + fmt.Fprintf(os.Stderr, "Per-repo reports written to %s (%s); open %s\n", + dir, strings.Join(parts, ", "), fileURL(indexPath)) + return nil +} + +// --- Scan --- + +func scanCmd() *cobra.Command { + var ( + roots []string + output string + ignoreFile string + maxDepth int + parallel int + email string + from string + to string + since string + reportPath string + reportDir string + topN int + extractIgnore []string + batchSize int + mailmap bool + firstParent bool + includeMessages bool + couplingMaxFiles int + couplingMinChanges int + churnHalfLife int + networkMinFiles int + ) + + cmd := &cobra.Command{ + Use: "scan", + Short: "Discover git repositories under one or more roots and consolidate their history", + Long: `Walk the given root(s), find every git repository, and run extract on each +repository in parallel. Outputs one JSONL per repo plus a manifest in --output. +Optionally generates a consolidated HTML report including a per-repository +breakdown — handy for showing aggregated work across many repos.`, + RunE: func(cmd *cobra.Command, args []string) error { + if len(roots) == 0 { + return fmt.Errorf("--root is required (repeatable for multiple roots)") + } + if since != "" && (from != "" || to != "") { + return fmt.Errorf("--since cannot be combined with --from/--to") + } + if err := validateDate(from, "--from"); err != nil { + return err + } + if err := validateDate(to, "--to"); err != nil { + return err + } + if from != "" && to != "" && from > to { + return fmt.Errorf("--from (%s) must be on or before --to (%s)", from, to) + } + // Report-flag combinations. --report is reserved for the + // only case that genuinely consolidates signal across repos + // — the per-developer profile; cross-repo team aggregation + // inflates hotspots/bus-factor/coupling with mixed-codebase + // noise and has been replaced by --report-dir which emits + // one standalone HTML per repo plus an index landing page. + if reportPath != "" && email == "" { + return fmt.Errorf("--report requires --email (profile consolidation); for per-repo HTML reports use --report-dir ") + } + if email != "" && reportPath == "" { + // Without this up-front check, --email alone would + // run the full scan, reach the profile branch, and + // only then call os.Create("") — surfacing a cryptic + // `open : no such file or directory` after the slow + // multi-repo extract phase had already completed. + return fmt.Errorf("--email requires --report pointing at the profile HTML output path") + } + if reportPath != "" && reportDir != "" { + return fmt.Errorf("--report and --report-dir are mutually exclusive; pick profile (--report + --email) or per-repo (--report-dir)") + } + if email != "" && reportDir != "" { + return fmt.Errorf("--email filters a single consolidated profile report; combine it with --report , not --report-dir") + } + // Resolve --since up-front. If this fails we'd otherwise + // discover the typo only after a full multi-repo scan — + // minutes to hours on a large workspace, all thrown away + // because the user mistyped `--since 1yy`. Validate early, + // fail fast, keep the result for the report stage. + fromDate := from + if since != "" { + d, err := parseSince(since) + if err != nil { + return err + } + fromDate = d + } + + cfg := scan.Config{ + Roots: roots, + Output: output, + IgnoreFile: ignoreFile, + MaxDepth: maxDepth, + Parallel: parallel, + Extract: extract.Config{ + BatchSize: batchSize, + IncludeMessages: includeMessages, + CommandTimeout: extract.DefaultCommandTimeout, + FirstParent: firstParent, + Mailmap: mailmap, + IgnorePatterns: extractIgnore, + StartOffset: -1, + }, + } + + ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) + defer stop() + + result, err := scan.Run(ctx, cfg) + // scan.Run returns a partial result alongside ctx.Err() on + // cancellation. Honor that — write whatever progress we made + // to disk and surface the error so the CLI exits non-zero. + if err != nil { + return err + } + + // Exit non-zero when every repo failed to extract, regardless + // of whether a report was requested. scan.Run intentionally + // treats per-repo failures as non-fatal so a transient error + // on one repo doesn't tank the whole batch — but when the + // count of successes is zero, there's nothing useful on disk + // to inspect. Without this guard CI sees exit code 0 and + // continues with empty artifacts; the manifest's per-repo + // Status fields remain the only way to see the failures. + if len(result.JSONLs) == 0 { + return fmt.Errorf("scan found %d repositor(ies) but all extracts failed; see %s for per-repo status", + len(result.Manifest.Repos), + filepath.Join(result.OutputDir, "manifest.json")) + } + + sf := stats.StatsFlags{CouplingMinChanges: couplingMinChanges, NetworkMinFiles: networkMinFiles} + loadOpts := stats.LoadOptions{ + From: fromDate, + To: to, + HalfLifeDays: churnHalfLife, + CoupMaxFiles: couplingMaxFiles, + } + + // Path 1: profile consolidation — the one cross-repo + // aggregation that earns its place (a single dev's work + // across all scanned repos). Build a multi-repo Dataset + // (path-prefixed via LoadMultiJSONL) and filter by email + // at render time. + if email != "" { + ds, err := stats.LoadMultiJSONL(result.JSONLs, loadOpts) + if err != nil { + return fmt.Errorf("load consolidated dataset: %w", err) + } + fmt.Fprintf(os.Stderr, "Loaded %d commits across %d repo(s)\n", ds.CommitCount, len(result.JSONLs)) + + f, err := os.Create(reportPath) + if err != nil { + return fmt.Errorf("create report: %w", err) + } + defer f.Close() + + repoLabel := profileScanLabel(cfg.Roots) + if err := reportpkg.GenerateProfile(f, ds, repoLabel, email); err != nil { + return fmt.Errorf("generate profile report: %w", err) + } + fmt.Fprintf(os.Stderr, "Profile report for %s written to %s\n", email, fileURL(reportPath)) + return nil + } + + // Path 2: per-repo reports + index. Each repo is loaded + // standalone so its stats reflect its own codebase — no + // hotspot/bus-factor/coupling mixing across unrelated + // projects. The index.html lists every repo with cards + // linking into each per-repo report; failed extracts are + // surfaced inline so operators can spot them without + // opening the manifest. + if reportDir != "" { + return renderScanReportDir(result, reportDir, loadOpts, sf, topN) + } + + // Path 3: no report flag — just confirm JSONLs landed. + fmt.Fprintf(os.Stderr, "Scan complete: %d JSONL file(s) in %s\n", len(result.JSONLs), result.OutputDir) + return nil + }, + } + + cmd.Flags().StringSliceVar(&roots, "root", nil, "Root directory to walk for repositories (repeatable)") + cmd.Flags().StringVar(&output, "output", "scan-output", "Directory to write per-repo JSONL files and the manifest") + cmd.Flags().StringVar(&ignoreFile, "ignore-file", "", "Gitignore-style file with directories to skip during discovery. When unset, only the first --root is searched for a .gitcortex-ignore; pass an explicit path to apply rules across all roots.") + cmd.Flags().IntVar(&maxDepth, "max-depth", 0, "Maximum directory depth to descend into when looking for repos (0 = unlimited)") + cmd.Flags().IntVar(¶llel, "parallel", defaultScanParallel(), "Number of repositories to extract in parallel") + cmd.Flags().StringVar(&email, "email", "", "Generate a per-developer profile report (only when --report is set)") + cmd.Flags().StringVar(&from, "from", "", "Window start date YYYY-MM-DD (forwarded to the consolidated report)") + cmd.Flags().StringVar(&to, "to", "", "Window end date YYYY-MM-DD (forwarded to the consolidated report)") + cmd.Flags().StringVar(&since, "since", "", "Filter to recent period (e.g. 7d, 4w, 3m, 1y); mutually exclusive with --from/--to") + cmd.Flags().StringVar(&reportPath, "report", "", "With --email, write a developer profile consolidated across all scanned repos to this HTML file. For per-repo reports use --report-dir instead.") + cmd.Flags().StringVar(&reportDir, "report-dir", "", "Directory to write one standalone HTML report per repo plus an index.html landing page. Each per-repo report is equivalent to running `gitcortex report` on that repo alone — no cross-repo metric mixing.") + cmd.Flags().IntVar(&topN, "top", 20, "Top-N entries per section in each per-repo report") + cmd.Flags().StringSliceVar(&extractIgnore, "extract-ignore", nil, "Glob patterns forwarded to per-repo extract --ignore (e.g. package-lock.json)") + cmd.Flags().IntVar(&batchSize, "batch-size", 1000, "Per-repo extract checkpoint interval") + cmd.Flags().BoolVar(&mailmap, "mailmap", false, "Use .mailmap (per repo) to normalize identities") + cmd.Flags().BoolVar(&firstParent, "first-parent", false, "Restrict extracts to the first-parent chain") + cmd.Flags().BoolVar(&includeMessages, "include-commit-messages", false, "Include commit messages in JSONL (needed for Top Commits in the consolidated report)") + cmd.Flags().IntVar(&couplingMaxFiles, "coupling-max-files", 50, "Max files per commit for coupling analysis (consolidated report)") + cmd.Flags().IntVar(&couplingMinChanges, "coupling-min-changes", 5, "Min co-changes for coupling results (consolidated report)") + cmd.Flags().IntVar(&churnHalfLife, "churn-half-life", 90, "Half-life in days for churn decay (consolidated report)") + cmd.Flags().IntVar(&networkMinFiles, "network-min-files", 5, "Min shared files for dev-network edges (consolidated report)") + + return cmd +} diff --git a/cmd/gitcortex/main_test.go b/cmd/gitcortex/main_test.go index 57410bf..2795e2b 100644 --- a/cmd/gitcortex/main_test.go +++ b/cmd/gitcortex/main_test.go @@ -1,10 +1,565 @@ package main import ( + "bytes" + "fmt" + "os" + "os/exec" + "path/filepath" "strings" "testing" + + "github.com/lex0c/gitcortex/internal/scan" + "github.com/lex0c/gitcortex/internal/stats" ) +// scanCmd must validate --since BEFORE running the discovery walk +// and extract pool. Without the early check, an obvious typo like +// `--since 1yy` only surfaces after scan.Run has already walked +// every root and extracted every repo found — which can take +// minutes-to-hours on a large workspace and waste the work. +// +// Use an empty TempDir so scan.Run would fail fast with "no git +// repositories found" if we reached it — that error does not mention +// "since", so an error containing "since" proves the early +// validation fired first. +func TestScanCmd_ValidatesSinceBeforeScanning(t *testing.T) { + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetArgs([]string{ + "--root", t.TempDir(), + "--output", t.TempDir(), + "--since", "bogus", + }) + // Swallow any stderr output cobra might emit so we don't pollute + // go-test logs on success. + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + + err := cmd.Execute() + if err == nil { + t.Fatal("expected error for --since bogus, got nil") + } + if !strings.Contains(err.Error(), "since") { + t.Errorf("expected error to mention --since (proving early validation), got %q", err) + } + // Reaching scan.Run on an empty TempDir would produce the + // discovery error; asserting its absence confirms the walk was + // not started. + if strings.Contains(err.Error(), "no git repositories found") { + t.Errorf("scan.Run was reached before --since was validated — discovery ran on an invalid-flag input: %q", err) + } +} + +// scanCmd must exit non-zero when every discovered repo's extract +// failed, regardless of whether --report was requested. scan.Run +// intentionally treats per-repo failures as non-fatal (a transient +// error on one repo shouldn't tank the whole batch), but if ZERO +// repos succeed there's nothing useful on disk and CI should know. +// Previously the no-report branch returned success unconditionally; +// automation saw exit 0 and "Scan complete: 0 JSONL file(s)" and +// continued with empty artifacts. +// +// Test uses a fake repo (a bare `.git` dir with no history) so +// discovery picks it up but extract fails. +func TestScanCmd_ExitsNonZeroWhenAllExtractsFail(t *testing.T) { + root := t.TempDir() + if err := os.MkdirAll(filepath.Join(root, "fake-repo", ".git"), 0o755); err != nil { + t.Fatal(err) + } + + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + cmd.SetArgs([]string{ + "--root", root, + "--output", t.TempDir(), + }) + + err := cmd.Execute() + if err == nil { + t.Fatal("expected non-zero exit when all extracts fail; CI would silently treat this as success") + } + if !strings.Contains(err.Error(), "all extracts failed") { + t.Errorf("error should explain every repo failed; got %q", err) + } +} + +// `scan --ignore-file /nonexistent` must fail at the CLI layer — +// silently falling back to zero rules would widen discovery scope +// without telling the user (e.g. node_modules and vendor/ dirs the +// user thought they excluded suddenly appear in the report). The +// default-path lookup in scan.loadMatcher still tolerates a missing +// `.gitcortex-ignore` at the first root; only the explicit flag is +// strict. +func TestScanCmd_RejectsMissingIgnoreFile(t *testing.T) { + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + cmd.SetArgs([]string{ + "--root", t.TempDir(), + "--output", t.TempDir(), + "--ignore-file", filepath.Join(t.TempDir(), "typo.ignore"), + }) + + err := cmd.Execute() + if err == nil { + t.Fatal("expected error for missing --ignore-file; a typo must not silently widen scope") + } + if !strings.Contains(err.Error(), "typo.ignore") { + t.Errorf("error should identify the missing file; got %q", err) + } + // Must fail BEFORE discovery starts — scan on an empty TempDir + // would otherwise produce "no git repositories found". + if strings.Contains(err.Error(), "no git repositories found") { + t.Errorf("ignore-file check ran after discovery; got %q", err) + } +} + +// --email without --report would silently reach os.Create("") after +// the full scan/extract phase finished, surfacing a cryptic "open : +// no such file or directory" error minutes into a wasted run. The +// flag pair is mandatory in both directions: reject up front +// alongside the other flag-combination checks. +func TestScanCmd_RejectsEmailWithoutReport(t *testing.T) { + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + cmd.SetArgs([]string{ + "--root", t.TempDir(), + "--output", t.TempDir(), + "--email", "me@x.com", + }) + err := cmd.Execute() + if err == nil { + t.Fatal("expected error for --email without --report; silent fall-through would waste a full scan before failing") + } + if !strings.Contains(err.Error(), "--report") { + t.Errorf("error should name --report as the missing flag; got %q", err) + } + // Must fail BEFORE discovery starts; empty TempDir would + // otherwise produce "no git repositories found". + if strings.Contains(err.Error(), "no git repositories found") { + t.Errorf("flag validation ran after scan; got %q", err) + } +} + +// --report without --email is ambiguous under the "no team-wide +// consolidation" design: the only meaningful single-HTML output +// across multiple repos is a developer profile. The error must +// point the user at the right flag for each intent. +func TestScanCmd_RejectsReportWithoutEmail(t *testing.T) { + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + cmd.SetArgs([]string{ + "--root", t.TempDir(), + "--output", t.TempDir(), + "--report", filepath.Join(t.TempDir(), "out.html"), + }) + err := cmd.Execute() + if err == nil { + t.Fatal("expected error for --report without --email") + } + if !strings.Contains(err.Error(), "--report-dir") || !strings.Contains(err.Error(), "--email") { + t.Errorf("error should point at both --report-dir and --email as remediation; got %q", err) + } +} + +// `scan --report-dir ` writes index.html plus one HTML per +// successful repo, with each per-repo HTML rendered from a +// single-input Dataset (no cross-repo path prefix). End-to-end +// check against two real repos built with git. +func TestScanCmd_ReportDirGeneratesPerRepoHTMLAndIndex(t *testing.T) { + if _, err := exec.LookPath("git"); err != nil { + t.Skip("git not installed") + } + root := t.TempDir() + makeRepoWithCommit(t, filepath.Join(root, "alpha"), "a.go", "package a\n") + makeRepoWithCommit(t, filepath.Join(root, "beta"), "b.py", "print('b')\n") + + scanOut := t.TempDir() + reportDir := t.TempDir() + + cmd := scanCmd() + cmd.SilenceUsage = true + cmd.SilenceErrors = true + cmd.SetOut(&bytes.Buffer{}) + cmd.SetErr(&bytes.Buffer{}) + cmd.SetArgs([]string{ + "--root", root, + "--output", scanOut, + "--report-dir", reportDir, + }) + if err := cmd.Execute(); err != nil { + t.Fatalf("scan --report-dir: %v", err) + } + + // Every successful repo gets a standalone HTML named .html. + for _, slug := range []string{"alpha", "beta"} { + path := filepath.Join(reportDir, slug+".html") + b, err := os.ReadFile(path) + if err != nil { + t.Fatalf("per-repo report missing: %v", err) + } + if !strings.Contains(string(b), "") { + t.Errorf("%s should be a full HTML document", path) + } + // Paths inside the per-repo report must NOT carry a + // `:` prefix — each tab is standalone, so LoadJSONL + // (not LoadMultiJSONL) was used and file paths appear raw. + if strings.Contains(string(b), slug+":") { + t.Errorf("per-repo report %s contains a `%s:` path prefix; Dataset was loaded as multi-repo", path, slug) + } + } + + // index.html links to each per-repo report. + indexBytes, err := os.ReadFile(filepath.Join(reportDir, "index.html")) + if err != nil { + t.Fatalf("index.html missing: %v", err) + } + index := string(indexBytes) + if !strings.Contains(index, `href="alpha.html"`) || !strings.Contains(index, `href="beta.html"`) { + t.Errorf("index.html should link to both per-repo reports; got body excerpt: %.300s", index) + } + // The summary card strip is now the top-of-page anchor (no H1). + // Asserting on the "Repositories" label keeps the test meaningful + // if the card layout changes. + if !strings.Contains(index, "Repositories") { + t.Errorf("index.html missing summary strip") + } +} + +// renderScanReportDir must render the index for a manifest that +// mixes ok / failed / pending statuses. Failed entries show their +// error message, pending entries render with the pending pill, and +// neither inflates the "failed" count in the summary card. +func TestRenderScanReportDir_MixedStatuses(t *testing.T) { + dir := t.TempDir() + okJSONL := filepath.Join(dir, "good.jsonl") + jsonlContent := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-01T00:00:00Z","additions":10,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0} +` + if err := os.WriteFile(okJSONL, []byte(jsonlContent), 0o644); err != nil { + t.Fatal(err) + } + + reportDir := t.TempDir() + result := &scan.Result{ + OutputDir: dir, + Manifest: scan.Manifest{ + Roots: []string{"/fake/root"}, + Repos: []scan.ManifestRepo{ + {Slug: "good", Path: "/fake/root/good", Status: "ok", JSONL: okJSONL}, + {Slug: "broken", Path: "/fake/root/broken", Status: "failed", Error: "simulated extract crash"}, + {Slug: "skipped", Path: "/fake/root/skipped", Status: "pending"}, + }, + }, + } + + err := renderScanReportDir(result, reportDir, + stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}, + stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}, + 10) + if err != nil { + t.Fatalf("renderScanReportDir returned error despite inline-failure policy: %v", err) + } + + // OK entry's report file exists; failed/pending do not. + if _, err := os.Stat(filepath.Join(reportDir, "good.html")); err != nil { + t.Errorf("good.html missing: %v", err) + } + for _, unwanted := range []string{"broken.html", "skipped.html"} { + if _, err := os.Stat(filepath.Join(reportDir, unwanted)); err == nil { + t.Errorf("%s should NOT be written — status was not ok", unwanted) + } + } + + indexB, err := os.ReadFile(filepath.Join(reportDir, "index.html")) + if err != nil { + t.Fatalf("index.html: %v", err) + } + index := string(indexB) + + // Summary card splits failed and pending correctly: "(1 failed)" + // AND "(1 pending)" appear, not "(2 failed)". + if strings.Contains(index, "(2 failed)") { + t.Error("summary miscounted pending as failed — `(2 failed)` appeared instead of separate buckets") + } + if !strings.Contains(index, "1 failed") { + t.Errorf("summary should show `1 failed`; index excerpt: %.600s", index) + } + if !strings.Contains(index, "1 pending") { + t.Errorf("summary should show `1 pending`; index excerpt: %.600s", index) + } + + // Failed entry's error bubbles up to the card. + if !strings.Contains(index, "simulated extract crash") { + t.Error("failed entry's error message missing from index") + } + + // Status pills render only for non-ok entries (ok is implicit + // given the link itself and the presence of metric cells). Assert + // the two non-ok pills exist and the ok one does not. + for _, want := range []string{"status-failed", "status-pending"} { + if !strings.Contains(index, want) { + t.Errorf("index missing %q pill — status branch lost its style", want) + } + } + if strings.Contains(index, `class="status-pill status-ok"`) { + t.Error("ok entries should not render a status pill — redundant with the link and metric cells") + } + + // Only OK entries carry an href; failed/pending must not link to + // a file that doesn't exist. + if strings.Contains(index, `href="broken.html"`) || strings.Contains(index, `href="skipped.html"`) { + t.Error("index links to a non-existent per-repo HTML for a non-ok entry") + } +} + +// Index order is a triage view: failed first (actionable), then +// pending (skipped work, less urgent), then ok ranked by commit +// count desc (heaviest healthy repos first). Slug asc breaks ties +// everywhere. +func TestRenderScanReportDir_IndexOrderFailedFirstThenByCommits(t *testing.T) { + dir := t.TempDir() + // One JSONL with 50 commits, one with 5 — lets us see the + // ok-bucket reorder from the "small -> big" alphabetical slug + // order to "big -> small" commit-count order. + bigJSONL := filepath.Join(dir, "big.jsonl") + smallJSONL := filepath.Join(dir, "small.jsonl") + if err := os.WriteFile(bigJSONL, []byte(manyCommitsJSONL(50)), 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(smallJSONL, []byte(manyCommitsJSONL(5)), 0o644); err != nil { + t.Fatal(err) + } + + reportDir := t.TempDir() + result := &scan.Result{ + OutputDir: dir, + Manifest: scan.Manifest{ + Repos: []scan.ManifestRepo{ + // Slug asc would give: big, broken, pending, small. + // Triage order expected: broken (failed), pending, big (50c), small (5c). + {Slug: "big", Path: "/x/big", Status: "ok", JSONL: bigJSONL}, + {Slug: "broken", Path: "/x/broken", Status: "failed", Error: "explode"}, + {Slug: "pending", Path: "/x/pending", Status: "pending"}, + {Slug: "small", Path: "/x/small", Status: "ok", JSONL: smallJSONL}, + }, + }, + } + + err := renderScanReportDir(result, reportDir, + stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}, + stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}, + 10) + if err != nil { + t.Fatal(err) + } + + b, err := os.ReadFile(filepath.Join(reportDir, "index.html")) + if err != nil { + t.Fatal(err) + } + out := string(b) + + // Each card renders `
{{.Path}}
` with the + // unique filesystem path. Finding each path's offset gives the + // render order; ok entries also include an anchor but failed / + // pending do not, so the `.path` cell is the stable locator. + idxOf := func(p string) int { return strings.Index(out, p) } + brokenIdx := idxOf("/x/broken") + pendingIdx := idxOf("/x/pending") + bigIdx := idxOf("/x/big") + smallIdx := idxOf("/x/small") + for name, i := range map[string]int{"broken": brokenIdx, "pending": pendingIdx, "big": bigIdx, "small": smallIdx} { + if i < 0 { + t.Fatalf("%q path not found in index output", name) + } + } + if !(brokenIdx < pendingIdx && pendingIdx < bigIdx && bigIdx < smallIdx) { + t.Errorf("triage order failed: broken=%d pending=%d big=%d small=%d — want broken < pending < big < small", + brokenIdx, pendingIdx, bigIdx, smallIdx) + } +} + +// manyCommitsJSONL returns n valid commit+file rows, each with a +// distinct SHA so LoadJSONL counts them as N commits. +func manyCommitsJSONL(n int) string { + var sb strings.Builder + for i := 0; i < n; i++ { + sha := fmt.Sprintf("%040x", i+1) + fmt.Fprintf(&sb, + `{"type":"commit","sha":"%s","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-%02dT00:00:00Z","additions":10,"deletions":0,"files_changed":1}`+"\n", + sha, (i%28)+1) + fmt.Fprintf(&sb, + `{"type":"commit_file","commit":"%s","path_current":"f%d.go","additions":10,"deletions":0}`+"\n", + sha, i) + } + return sb.String() +} + +// Regression: scan emits Status="skipped" for repos the worker +// picked up but abandoned when ctx.Err fired (see scan.runOne). +// The index template only styles .repo.pending / .status-pending; +// a raw "skipped" status would produce class="repo skipped" with +// no CSS match, hiding the amber border and pill that make +// cancellation fallout triagable. The renderer must fold skipped +// into pending so the row visuals stay consistent with the +// summary-strip count (which already buckets them together). +func TestRenderScanReportDir_NormalizesSkippedToPending(t *testing.T) { + reportDir := t.TempDir() + result := &scan.Result{ + OutputDir: t.TempDir(), + Manifest: scan.Manifest{ + Repos: []scan.ManifestRepo{ + {Slug: "a", Path: "/x/a", Status: "skipped", Error: "context canceled"}, + }, + }, + } + err := renderScanReportDir(result, reportDir, + stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}, + stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}, + 10) + if err != nil { + t.Fatal(err) + } + indexB, err := os.ReadFile(filepath.Join(reportDir, "index.html")) + if err != nil { + t.Fatal(err) + } + index := string(indexB) + + // Summary counts skipped under pending (already the case, but + // anchor it so a future refactor doesn't drift). + if !strings.Contains(index, "1 pending") { + t.Errorf("skipped entry should count as pending in summary; got index excerpt: %.400s", index) + } + // Row CSS must use the pending selector — a raw "skipped" class + // has no style defined so the row would lose its warning affordance. + if !strings.Contains(index, `class="repo pending"`) { + t.Error("skipped entry's card must render with `class=\"repo pending\"`") + } + if strings.Contains(index, `class="repo skipped"`) { + t.Error("raw `skipped` leaked into CSS class — template selectors don't match it") + } + // Pill class must also normalize. + if !strings.Contains(index, `status-pending`) { + t.Error("skipped entry's status pill should read as `status-pending`") + } + if strings.Contains(index, `status-skipped`) { + t.Error("raw `skipped` leaked into pill class") + } +} + +// Render with zero successful repos must still emit an index +// without tripping on the MaxCommits==0 guard in the bar-width +// template expression. +func TestRenderScanReportDir_AllFailedStillEmitsIndex(t *testing.T) { + reportDir := t.TempDir() + result := &scan.Result{ + OutputDir: t.TempDir(), + Manifest: scan.Manifest{ + Repos: []scan.ManifestRepo{ + {Slug: "a", Status: "failed", Error: "boom"}, + {Slug: "b", Status: "failed", Error: "also boom"}, + }, + }, + } + err := renderScanReportDir(result, reportDir, + stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50}, + stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}, + 10) + if err != nil { + t.Fatalf("render should not fail on all-failed manifest: %v", err) + } + if _, err := os.Stat(filepath.Join(reportDir, "index.html")); err != nil { + t.Fatalf("index.html must be emitted even when every repo failed: %v", err) + } +} + +// makeRepoWithCommit initializes a git repo with one file and one +// commit using a deterministic identity. +func makeRepoWithCommit(t *testing.T, dir, file, contents string) { + t.Helper() + if err := os.MkdirAll(dir, 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(dir, file), []byte(contents), 0o644); err != nil { + t.Fatal(err) + } + run := func(args ...string) { + cmd := exec.Command("git", args...) + cmd.Dir = dir + cmd.Env = append(os.Environ(), + "GIT_AUTHOR_NAME=Test", + "GIT_AUTHOR_EMAIL=test@example.com", + "GIT_COMMITTER_NAME=Test", + "GIT_COMMITTER_EMAIL=test@example.com", + ) + out, err := cmd.CombinedOutput() + if err != nil { + t.Fatalf("git %s: %v\n%s", strings.Join(args, " "), err, out) + } + } + run("init", "-q", "-b", "main") + run("add", ".") + run("commit", "-q", "-m", "initial") +} + +// The profile report title used to echo roots[0] even when the +// scan aggregated many roots, so "gitcortex scan --root ~/work +// --root ~/oss --report me.html --email me@x.com" produced a +// report titled "· work" that readers interpreted as belonging +// only to the first root. Ensure the label reflects the actual +// scope. +func TestProfileScanLabel(t *testing.T) { + cases := []struct { + name string + roots []string + want string + }{ + {"no roots", nil, "scan"}, + {"single root uses basename", []string{"/home/me/work"}, "work"}, + // Two and three-root scans join with " + " so the scope is + // visible at a glance rather than being summarised as a number. + {"two roots joined", []string{"/home/me/work", "/home/me/personal"}, "work + personal"}, + {"three roots joined", []string{"/a/one", "/b/two", "/c/three"}, "one + two + three"}, + // Large scans fall back to a count — joining 50 basenames + // would bloat the H1; the exact set lives in the scan log + // and manifest. + {"many roots use count", []string{"/a/x", "/b/x", "/c/x", "/d/x", "/e/x"}, "5 roots"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + if got := profileScanLabel(c.roots); got != c.want { + t.Errorf("profileScanLabel(%v) = %q, want %q", c.roots, got, c.want) + } + }) + } +} + +// defaultScanParallel must stay within [2, 16] regardless of host +// NumCPU. The floor protects single-core CI from serialising every +// scan; the cap prevents 64-core servers from over-subscribing +// when git I/O can't actually use that much concurrency. +func TestDefaultScanParallel(t *testing.T) { + got := defaultScanParallel() + if got < 2 || got > 16 { + t.Errorf("defaultScanParallel() = %d, want between 2 and 16 inclusive", got) + } +} + func TestValidateDate(t *testing.T) { cases := []struct { in string diff --git a/docs/METRICS.md b/docs/METRICS.md index 7ed473e..72b9fff 100644 --- a/docs/METRICS.md +++ b/docs/METRICS.md @@ -294,6 +294,29 @@ A `tree(1)`-style view of the repository's directory layout, built from paths se **When to use**: before drilling into hotspots or churn-risk, skim the structure to locate the modules those files live in. The tree is navigational context; ranked tables are where judgment happens. +## Per-Repository Breakdown + +Cross-repo aggregation scoped to a single developer. Renders in the HTML profile report (`report --email me@x.com` or `scan --email me@x.com --report …`) when the dataset was loaded from more than one JSONL. The metric is deliberately profile-only: on a team-view consolidated report, per-repo aggregates reduce to raw git-history distribution and are better inspected via `manifest.json` or by running `stats --input X.jsonl` per-repo. Filtered by a developer's email, the same numbers answer a genuinely different question — *"where did this person spend their time?"* — which is what the section surfaces. + +One row per repo: + +| column | meaning | +|---|---| +| `Commits` | author-dated commit count in this repo | +| `% Commits` | share of total commits in the dataset | +| `Churn` | additions + deletions attributed to this repo | +| `% Churn` | share of total churn | +| `Files` | distinct files the developer touched in this repo (always email-filtered, since the section only renders on profile reports) | +| `Active days` | distinct UTC author-dates | +| `Devs` | unique author emails in this repo | +| `First → Last` | earliest and latest author-date | + +**How the repo label is derived**: `LoadMultiJSONL` prefixes every path in the dataset with `:` (so `WordPress.git.jsonl` contributes paths like `WordPress.git:wp-includes/foo.php`). The breakdown groups by that prefix. If only a single JSONL is loaded, no prefix is emitted and the breakdown collapses to a single `(repo)` row the HTML report hides. + +**Divergence between `% Commits` and `% Churn` is informative**: a repo dominating churn while holding modest commit share often signals large-content work (docs, data), while the reverse points to small-diff high-frequency repos (config, manifests). + +**SHA collisions across repos** (forks, mirrors, cherry-picks between sibling projects) are preserved here — the breakdown tracks commits per repo via a dedicated slice populated at ingest, not the SHA-keyed commit map. Other file-level metrics (bus factor, coupling, dev network) still key by SHA and will collapse collided commits onto one record; if exact attribution matters for those, scan and aggregate the sibling repos separately. + ## Data Flow ``` @@ -364,6 +387,8 @@ Directory-segment heuristics (`vendor`, `node_modules`, `dist`, `build`, `third_ The warning is advisory. Nothing is auto-filtered; the user decides whether to re-extract. Matches do not affect computed stats in that run. JSON/CSV output paths skip the warning since they're typically piped. +On multi-JSONL loads (e.g. `stats --input a.jsonl --input b.jsonl` or a `gitcortex scan` dataset), paths carry a `:` prefix internally. The suspect detector strips that prefix before matching and suggesting, so root-level `vendor/`, `package-lock.json`, `go.sum`, etc. in any individual repo are detected and the emitted `--ignore` globs are repo-relative (drop-in for `extract --ignore` and `scan --extract-ignore`). Same-shape findings across repos collapse to one suggestion (`dist/*` applies everywhere rather than being listed once per repo). + Statistical heuristics (very high churn-per-commit, single-author bulk updates) are deliberately out of scope — their false-positive rate on hand-authored code is higher than the path-based list and we'd rather stay quiet than cry wolf. ### `--mailmap` off by default diff --git a/internal/report/profile_template.go b/internal/report/profile_template.go index 5bb8f0f..5fedc3a 100644 --- a/internal/report/profile_template.go +++ b/internal/report/profile_template.go @@ -134,6 +134,42 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col {{end}} +{{if gt (len .Repos) 1}} +

Per-Repository Breakdown {{thousands (len .Repos)}} repositories

+

How {{.Profile.Name}}'s work is split across repositories in this scan. Use this to point at the projects with the most activity, or to spot single-repo focus vs. broad multi-repo engagement.

+ + + + + + + + + + + +{{range .Repos}} + + + + + + + + + + +{{end}} +
RepositoryCommits% of My CommitsChurn% of My ChurnFilesActive daysFirst → Last
{{.Repo}}{{thousands .Commits}} +
+
+
+
+ {{printf "%.1f" .PctOfTotalCommits}}% +
+
{{thousands .Churn}}{{printf "%.1f" .PctOfTotalChurn}}%{{thousands .Files}}{{.ActiveDays}}{{.FirstCommitDate}} → {{.LastCommitDate}}
+{{end}} + {{if .Profile.TopFiles}}

Top Files

Files this developer changed most (churn = additions + deletions). High churn on few files suggests deep ownership and potential knowledge concentration. · {{docRef "hotspots"}}

diff --git a/internal/report/report.go b/internal/report/report.go index 80e875f..24c982f 100644 --- a/internal/report/report.go +++ b/internal/report/report.go @@ -641,6 +641,13 @@ type ProfileReportData struct { MaxActivityCommits int PatternGrid [7][24]int MaxPattern int + + // Repos is the per-repository breakdown filtered to this developer's + // commits. Empty on single-repo profile reports — gated in the + // template so existing single-repo callers see no change. The headline + // use case for `gitcortex scan --email me` lives here: the developer + // can see at a glance which repos they spent time in. + Repos []stats.RepoStat } func GenerateProfile(w io.Writer, ds *stats.Dataset, repoName, email string) error { @@ -672,6 +679,7 @@ func GenerateProfile(w io.Writer, ds *stats.Dataset, repoName, email string) err MaxActivityCommits: maxAct, PatternGrid: p.WorkGrid, MaxPattern: maxP, + Repos: stats.RepoBreakdown(ds, email), } return profileTmpl.Execute(w, data) diff --git a/internal/report/report_test.go b/internal/report/report_test.go index 2b4c826..45a91eb 100644 --- a/internal/report/report_test.go +++ b/internal/report/report_test.go @@ -5,6 +5,7 @@ import ( "fmt" "os" "path/filepath" + "regexp" "strings" "testing" @@ -77,6 +78,117 @@ func TestGenerate_SmokeRender(t *testing.T) { } } +// The Per-Repository Breakdown is a consolidation metric about the +// developer's work, not about the repository set. It belongs on the +// profile (--email) report where "3 of my commits in auth, 50 in +// payments" tells a story; on the team report it would just restate +// git-history distribution, which is tautological. Assert the team +// render does NOT surface the section even with a multi-repo dataset. +func TestGenerate_TeamReportOmitsPerRepoBreakdown(t *testing.T) { + dir := t.TempDir() + alpha := filepath.Join(dir, "alpha.jsonl") + beta := filepath.Join(dir, "beta.jsonl") + alphaRow := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-01T00:00:00Z","additions":10,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0} +` + betaRow := `{"type":"commit","sha":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb","author_email":"me@x.com","author_name":"Me","author_date":"2024-02-01T00:00:00Z","additions":20,"deletions":5,"files_changed":1} +{"type":"commit_file","commit":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb","path_current":"b.go","additions":20,"deletions":5} +` + if err := os.WriteFile(alpha, []byte(alphaRow), 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(beta, []byte(betaRow), 0o644); err != nil { + t.Fatal(err) + } + ds, err := stats.LoadMultiJSONL([]string{alpha, beta}) + if err != nil { + t.Fatalf("LoadMultiJSONL: %v", err) + } + var buf bytes.Buffer + if err := Generate(&buf, ds, "scan-fixture", 10, stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}); err != nil { + t.Fatalf("Generate: %v", err) + } + out := buf.String() + if strings.Contains(out, "Per-Repository Breakdown") { + t.Error("team report should not render the Per-Repository Breakdown; it's a profile-only metric") + } + // Sanity: the scan JSONL prefix still shows up in file paths etc. + // — we didn't accidentally strip the repo context from the whole + // report. + if !strings.Contains(out, "alpha:") && !strings.Contains(out, "beta:") { + t.Error("expected repo-prefixed paths to appear somewhere in the team report; none found") + } +} + +// End-to-end for the `gitcortex scan --email me@x.com --report …` +// flow. Covers three assertions at once: +// 1. GenerateProfile emits the Per-Repository Breakdown section +// when the dataset is multi-repo (gated on len(Repos) > 1). +// 2. Counts per repo are filtered to the dev — a commit by +// someone-else@x.com in alpha doesn't bleed into my profile's +// alpha row. +// 3. Files counted per repo are only files THIS dev touched — a +// colleague-exclusive file in alpha must not inflate my scope. +func TestGenerateProfile_MultiRepoBreakdownFiltersByEmail(t *testing.T) { + dir := t.TempDir() + alpha := filepath.Join(dir, "alpha.jsonl") + beta := filepath.Join(dir, "beta.jsonl") + + // alpha: me has 1 commit on a.go; colleague has 1 commit on + // colleague-only.go. beta: me has 1 commit on b.go. + alphaContent := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa01","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-10T00:00:00Z","additions":10,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa01","path_current":"a.go","additions":10,"deletions":0} +{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa02","author_email":"colleague@x.com","author_name":"Col","author_date":"2024-01-11T00:00:00Z","additions":5,"deletions":0,"files_changed":1} +{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa02","path_current":"colleague-only.go","additions":5,"deletions":0} +` + betaContent := `{"type":"commit","sha":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb01","author_email":"me@x.com","author_name":"Me","author_date":"2024-02-10T00:00:00Z","additions":20,"deletions":5,"files_changed":1} +{"type":"commit_file","commit":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb01","path_current":"b.go","additions":20,"deletions":5} +` + if err := os.WriteFile(alpha, []byte(alphaContent), 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(beta, []byte(betaContent), 0o644); err != nil { + t.Fatal(err) + } + + ds, err := stats.LoadMultiJSONL([]string{alpha, beta}) + if err != nil { + t.Fatalf("LoadMultiJSONL: %v", err) + } + + var buf bytes.Buffer + if err := GenerateProfile(&buf, ds, "scan-profile", "me@x.com"); err != nil { + t.Fatalf("GenerateProfile: %v", err) + } + out := buf.String() + + if !strings.Contains(out, "Per-Repository Breakdown") { + t.Fatal("profile report missing the breakdown section — scan --email users won't see their cross-repo split") + } + + // Each row renders as `alpha` followed by + // a `N` for commits. Assert 1 commit per repo — if the + // filter leaked, alpha would show 2 (me's + colleague's). + commitCountRe := regexp.MustCompile(`(alpha|beta)\s*(\d+)`) + rows := commitCountRe.FindAllStringSubmatch(out, -1) + if len(rows) != 2 { + t.Fatalf("expected 2 repo rows in breakdown, got %d: %v", len(rows), rows) + } + for _, r := range rows { + if r[2] != "1" { + t.Errorf("repo %s shows %s commits in profile breakdown — email filter leaked (want 1)", r[1], r[2]) + } + } + + // Colleague-exclusive file must not appear in my scope. The + // template renders file counts as `N` inside each row + // — indirect assertion: if the file-count bumped, the dev-filter + // on devCommits is wrong. + if strings.Contains(out, "colleague-only.go") { + t.Error("profile report mentions a file only the colleague touched") + } +} + func TestGenerate_EmptyDataset(t *testing.T) { dir := t.TempDir() path := filepath.Join(dir, "empty.jsonl") diff --git a/internal/report/scan_index.go b/internal/report/scan_index.go new file mode 100644 index 0000000..e8b943d --- /dev/null +++ b/internal/report/scan_index.go @@ -0,0 +1,243 @@ +package report + +import ( + "fmt" + "html/template" + "io" + "time" +) + +// ScanIndexEntry is one repo's row on the scan-index landing page. +// Successful repos populate the numeric fields; failed / pending +// repos leave them zero and surface Error instead. ReportHref is the +// relative URL the index uses to link into each per-repo report — +// empty when no report exists for that entry. +type ScanIndexEntry struct { + Slug string + Path string + Status string + Error string + Commits int + Devs int + Files int + Churn int64 + FirstCommitDate string + LastCommitDate string + // LastCommitAgo is LastCommitDate humanized relative to the + // index generation time: "today" / "Nd ago" / "Nmo ago" / "Ny ago". + // Lets operators spot abandoned repos in a list of 30 without + // doing date math in their head. + LastCommitAgo string + // RecencyBucket classifies LastCommitAgo into "fresh" (≤ 30 days), + // "stable" (≤ 1 year), "stale" (> 1 year). Drives the badge color + // in the template so the cold repos stand out at a glance. + RecencyBucket string + ReportHref string +} + +// HumanizeAgo is the caller entry point for the index: formats +// `lastDate` (YYYY-MM-DD) as a compact "ago" phrase relative to +// wall-clock now, and classifies recency into fresh / stable / +// stale for template coloring. Empty pair on unparsable input so +// failed/pending rows with no dates render nothing. +func HumanizeAgo(lastDate string) (label, bucket string) { + return humanizeAgoAt(lastDate, time.Now()) +} + +// humanizeAgoAt is the testable core — same logic as HumanizeAgo +// but takes an explicit "now" so tests can pin the reference time +// and avoid drift. +// +// Future-commit policy: dates strictly AFTER today's UTC midnight +// (e.g. tomorrow's YYYY-MM-DD due to clock skew or a future-dated +// CI rewrite) yield empty — the index surfaces how STALE each repo +// is, and labeling "in 3d" as "fresh" would actively mislead. Both +// sides of the comparison reduce to UTC midnights so "tomorrow" +// is detected even when the user's wall clock is mid-day; sub-day +// committer-clock skew inside today's date still falls into +// `days == 0 → "today"`, documented as the safe fallback at day +// granularity. +// +// The earlier implementation computed `days` as +// int(now.Sub(t).Hours() / 24), which truncates toward zero for +// negative durations. A "tomorrow" date 12 hours ahead produced +// days == 0 instead of -1, slipping past the future-date guard +// and rendering "today" / fresh. Comparing the parsed date +// directly against today's UTC midnight removes the truncation +// gap. +func humanizeAgoAt(lastDate string, now time.Time) (label, bucket string) { + t, err := time.Parse("2006-01-02", lastDate) + if err != nil { + return "", "" + } + today := now.UTC().Truncate(24 * time.Hour) + if t.After(today) { + return "", "" + } + days := int(today.Sub(t).Hours() / 24) + switch { + case days < 0: + // Defensive: t.After(today) above already covered this, but + // keep the branch so a later refactor that moves the guard + // can't silently produce a negative-days label. + return "", "" + case days == 0: + label = "today" + case days < 30: + label = fmt.Sprintf("%dd ago", days) + case days < 365: + // Cap the month reading at 11 — otherwise `days/30` emits + // "12mo ago" at days 360-364, which reads as older than + // "1y ago" even though it's actually in the still-stable + // (≤365d) band. Clamping keeps the label progression + // monotonic with the bucket color. + months := days / 30 + if months > 11 { + months = 11 + } + label = fmt.Sprintf("%dmo ago", months) + default: + label = fmt.Sprintf("%dy ago", days/365) + } + switch { + case days <= 30: + bucket = "fresh" + case days <= 365: + bucket = "stable" + default: + bucket = "stale" + } + return label, bucket +} + +// ScanIndexData is the top-level template input for the index page. +type ScanIndexData struct { + GeneratedAt string + Repos []ScanIndexEntry + // TotalRepos / OKRepos / FailedRepos / PendingRepos are + // precomputed so the template doesn't need conditional arithmetic. + // Pending is distinct from failed: a pending repo is one the + // worker never reached (cancelled mid-scan), a failed repo is one + // whose extract or render broke. Mixing them in the summary would + // read a cancel-shaped partial run as a fleet-of-errors. + TotalRepos int + OKRepos int + FailedRepos int + PendingRepos int + TotalCommits int + TotalDevs int + // Largest repo commit count — used to normalize the bar widths so + // the relative-volume bars are visually comparable across repos. + MaxCommits int +} + +// GenerateScanIndex writes the scan landing page: a per-repo card +// list with links to each repo's standalone report and a short +// summary strip. Failures are surfaced inline rather than hidden so +// operators can spot them at a glance and dig into the manifest. +func GenerateScanIndex(w io.Writer, data ScanIndexData) error { + if data.GeneratedAt == "" { + data.GeneratedAt = time.Now().Format("2006-01-02 15:04") + } + return scanIndexTmpl.Execute(w, data) +} + +var scanIndexTmpl = template.Must(template.New("scan-index").Funcs(funcMap).Parse(scanIndexHTML)) + +const scanIndexHTML = ` + + + +gitcortex scan index ({{.TotalRepos}} repositories) + + + + +
+
Repositories
{{thousands .OKRepos}}{{if gt .FailedRepos 0}} ({{.FailedRepos}} failed){{end}}{{if gt .PendingRepos 0}} ({{.PendingRepos}} pending){{end}}
+
Total commits
{{humanize .TotalCommits}}
+
Unique devs
{{thousands .TotalDevs}}
+
+ +

Each repo below links to its own standalone report. Metrics are per-repository — no cross-repo aggregation that would mix signals from unrelated codebases. For a consolidated developer view, use gitcortex scan --report <file> --email <address>.

+ +{{$max := .MaxCommits}} +{{range .Repos}} +
+
+
+ {{if .ReportHref}}{{.Slug}}{{else}}{{.Slug}}{{end}} + {{if ne .Status "ok"}}{{.Status}}{{end}} +
+
{{.Path}}
+ {{if .Error}}
{{.Error}}
{{end}} +
+ {{if eq .Status "ok"}} +
+ {{humanize .Commits}} + commits +
+
+ {{humanize .Churn}} + churn +
+
+ {{.Devs}} + devs +
+
+ {{humanize .Files}} + files +
+
+ {{if .LastCommitAgo}}{{.LastCommitAgo}}
{{end}} + {{if .LastCommitDate}}{{.FirstCommitDate}}
→ {{.LastCommitDate}}{{end}} +
+ {{else}} +
No report available.
+ {{end}} +
+{{if eq .Status "ok"}} +
+
+
+{{end}} +{{end}} + + + + + +` diff --git a/internal/report/scan_index_test.go b/internal/report/scan_index_test.go new file mode 100644 index 0000000..470c10a --- /dev/null +++ b/internal/report/scan_index_test.go @@ -0,0 +1,126 @@ +package report + +import ( + "bytes" + "strings" + "testing" + "time" +) + +func TestHumanizeAgoAt(t *testing.T) { + now := time.Date(2026, 4, 20, 12, 0, 0, 0, time.UTC) + + cases := []struct { + name string + lastDate string + wantLabel string + wantBucket string + }{ + // Boundary on day 0 — "today" reads cleaner than "0d ago". + {"same day", "2026-04-20", "today", "fresh"}, + {"one day", "2026-04-19", "1d ago", "fresh"}, + {"29 days", "2026-03-22", "29d ago", "fresh"}, + // Transition from days to months at 30. + {"30 days exact", "2026-03-21", "1mo ago", "fresh"}, + {"two months", "2026-02-15", "2mo ago", "stable"}, + {"eleven months", "2025-05-25", "11mo ago", "stable"}, + // Month label must not exceed "11mo" — days/30 = 12 on the + // 360-day boundary, but the label is clamped so the stable + // band always reads as sub-year. + {"month clamp at 360 days", "2025-04-25", "11mo ago", "stable"}, + {"month clamp at 364 days", "2025-04-21", "11mo ago", "stable"}, + // 365-day boundary: still "stable" at exactly one year, + // "stale" the day after. 2025-04-20 is 365 days before now. + {"one year exact (stable boundary)", "2025-04-20", "1y ago", "stable"}, + {"one year + one day (stale)", "2025-04-19", "1y ago", "stale"}, + {"two years stale", "2024-04-10", "2y ago", "stale"}, + // Parse failure yields empty. + {"bad input", "not-a-date", "", ""}, + // Future dates (clock skew) yield empty — we don't label "in 3d" + // on an index that exists to surface recency of PAST commits. + {"full-day future", "2026-05-01", "", ""}, + // Regression for the truncation-gap bug: tomorrow (~12h ahead + // of `now`) used to yield int(-0.5) == 0 and silently render + // as "today" / fresh. Comparing UTC date midnights instead + // of raw durations rejects it correctly. + {"tomorrow (sub-24h future)", "2026-04-21", "", ""}, + {"two days future", "2026-04-22", "", ""}, + // Sub-day future (committer-date hours ahead of scanner clock) + // lands in days==0 → "today". Documented: the index works at + // day granularity and "today" is the least-misleading fallback + // for intra-day skew within today's date. + {"same-day future (sub-day skew)", "2026-04-20", "today", "fresh"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + label, bucket := humanizeAgoAt(c.lastDate, now) + if label != c.wantLabel { + t.Errorf("label: got %q, want %q", label, c.wantLabel) + } + if bucket != c.wantBucket { + t.Errorf("bucket: got %q, want %q", bucket, c.wantBucket) + } + }) + } +} + +// End-to-end check that the recency chip renders for an ok entry +// and is absent for failed entries (which have no dates). Confirms +// the CSS bucket class is reachable by the template. +func TestGenerateScanIndex_RecencyChipRenders(t *testing.T) { + data := ScanIndexData{ + GeneratedAt: "2026-04-20 12:00", + TotalRepos: 2, + OKRepos: 1, + FailedRepos: 1, + MaxCommits: 10, + Repos: []ScanIndexEntry{ + { + Slug: "alive", + Path: "/work/alive", + Status: "ok", + Commits: 10, + FirstCommitDate: "2024-01-01", + LastCommitDate: "2026-04-18", + LastCommitAgo: "2d ago", + RecencyBucket: "fresh", + ReportHref: "alive.html", + }, + { + Slug: "broken", + Path: "/work/broken", + Status: "failed", + Error: "boom", + }, + }, + } + var buf bytes.Buffer + if err := GenerateScanIndex(&buf, data); err != nil { + t.Fatal(err) + } + out := buf.String() + if !strings.Contains(out, `class="recency fresh"`) { + t.Error("fresh recency chip missing from ok entry") + } + if !strings.Contains(out, `>2d ago<`) { + t.Error("recency label text missing") + } + // Anchor the chip inside the `.dates` wrapper — without this, + // a future template change that moved the chip outside the + // dates cell (losing the date context) would still satisfy the + // class-only assertion above. + if !strings.Contains(out, `class="dates"`) { + t.Error("`.dates` wrapper missing — the recency chip has lost its structural anchor") + } + datesIdx := strings.Index(out, `class="dates"`) + recencyIdx := strings.Index(out, `class="recency fresh"`) + if datesIdx < 0 || recencyIdx < 0 || recencyIdx < datesIdx { + t.Errorf("recency chip not inside `.dates` wrapper: dates@%d recency@%d", datesIdx, recencyIdx) + } + // Failed entry has no dates block, so no recency chip should + // render for it — a weak but useful guard against a template + // restructure leaking the chip into the failure render. + if strings.Count(out, `class="recency`) != 1 { + t.Errorf("expected exactly one recency chip (ok entry only); got %d", strings.Count(out, `class="recency`)) + } +} diff --git a/internal/scan/discovery.go b/internal/scan/discovery.go new file mode 100644 index 0000000..8fc96e5 --- /dev/null +++ b/internal/scan/discovery.go @@ -0,0 +1,357 @@ +package scan + +import ( + "context" + "crypto/sha1" + "encoding/hex" + "fmt" + "log" + "os" + "path/filepath" + "sort" + "strings" +) + +// Repo describes one git repository discovered during scan. +type Repo struct { + // AbsPath is the absolute path to the working tree root. + AbsPath string + // RelPath is AbsPath relative to the scan root it was found under. + RelPath string + // Slug is a filesystem-safe identifier used to derive the output + // JSONL file name and the path prefix that stats will use as the + // repo label. Must be unique across the scan (collisions are resolved + // by appending a short hash of AbsPath). + Slug string +} + +// Discover walks the given roots and returns every git repository it finds. +// Directories matched by the ignore matcher are pruned. Each repo's Slug is +// unique across the full result so downstream JSONL files don't collide, +// and the prefix LoadMultiJSONL derives (basename-minus-ext) still groups +// correctly. +// +// A cancelled ctx aborts the walk promptly — important because on large +// roots (home directory scans, monorepos of repos) discovery is often +// the longest phase of scan and Ctrl+C needs to land in real time, not +// "after the walk naturally finishes". Each filepath.WalkDir callback +// checks ctx.Done() before doing any filesystem work, so the abort +// window is bounded by the cost of one stat call per dir entry. +func Discover(ctx context.Context, roots []string, matcher *Matcher, maxDepth int) ([]Repo, error) { + if matcher == nil { + matcher = NewMatcher(nil) + } + if ctx == nil { + ctx = context.Background() + } + var repos []Repo + seen := make(map[string]bool) + + for _, root := range roots { + if err := ctx.Err(); err != nil { + return nil, err + } + abs, err := filepath.Abs(root) + if err != nil { + return nil, fmt.Errorf("resolve %s: %w", root, err) + } + // Canonicalize via EvalSymlinks. os.Stat above followed the + // symlink for the is-directory check, but filepath.WalkDir + // treats a symlink ROOT specially: it visits only the link + // itself (reported with ModeSymlink) and does NOT descend + // into the target. For a user whose ~/work is a symlink to + // /mnt/data/work — a common setup — the walk would yield + // zero callbacks past the root and scan would conclude "no + // git repositories found under ..." despite the target being + // full of repos. EvalSymlinks dereferences the root once so + // WalkDir starts with a real directory path; links ENCOUNTERED + // during the walk are still left as-is (default WalkDir + // behavior), which is what we want — we don't chase every + // symlink we come across, only the ones the user explicitly + // named as a root. + resolved, err := filepath.EvalSymlinks(abs) + if err != nil { + return nil, fmt.Errorf("resolve symlinks for %s: %w", abs, err) + } + abs = resolved + info, err := os.Stat(abs) + if err != nil { + return nil, fmt.Errorf("stat %s: %w", abs, err) + } + if !info.IsDir() { + return nil, fmt.Errorf("%s is not a directory", abs) + } + + err = filepath.WalkDir(abs, func(path string, d os.DirEntry, werr error) error { + // Return ctx.Err() (not SkipDir) so WalkDir short-circuits + // the entire walk. Without this check, a cancelled scan + // would keep stat'ing every directory in a large tree + // before the caller saw the interrupt. + if err := ctx.Err(); err != nil { + return err + } + if werr != nil { + // Permission errors on one subtree shouldn't abort the + // whole scan. Log and keep walking. + log.Printf("scan: skip %s: %v", path, werr) + if d != nil && d.IsDir() { + return filepath.SkipDir + } + return nil + } + if !d.IsDir() { + return nil + } + + rel, _ := filepath.Rel(abs, path) + rel = filepath.ToSlash(rel) + + if maxDepth > 0 && rel != "." { + depth := strings.Count(rel, "/") + 1 + if depth > maxDepth { + return filepath.SkipDir + } + } + + if rel != "." && matcher.Match(rel, true) { + // Don't SkipDir blindly — if any negation rule could + // re-include a descendant (e.g. `vendor/` + `!vendor/keep`), + // pruning here would drop the re-inclusion before its + // target is visited. + if !matcher.CouldReinclude(rel) { + return filepath.SkipDir + } + // The dir itself is ignored; we only walk in so a + // negated descendant can be visited in its own turn. + // Return early so the .git detection below doesn't + // record this ignored directory as a repo (and so its + // `return filepath.SkipDir` doesn't cut the descent + // off before the negation's target is seen). A + // descendant whose rel path is re-included by `!rule` + // will be examined by the next WalkDir callback + // invocation. + return nil + } + + // Two repo shapes to detect: + // 1. Working tree: path/.git is a dir (normal) or a + // `gitdir: …` pointer file (worktree/submodule). + // 2. Bare repo: path itself contains HEAD + objects/ + // + refs/ — common for clones used as fixtures or + // mirrors (`git clone --bare`, GitHub-style server dirs + // named foo.git). + // Validating the .git entry (not just its presence) matters: + // a random regular file named `.git` would otherwise be + // accepted, the parent path recorded as a repo, and the + // walker SkipDir'd out of the subtree — hiding any real + // repos nested underneath and guaranteeing a downstream + // "not a git repository" failure during extract. isBareRepo + // separately requires HEAD + objects + refs so a stray HEAD + // or empty objects dir can't false-positive either. + gitEntry := filepath.Join(path, ".git") + if isWorkingTreeEntry(gitEntry) || isBareRepo(path) { + if seen[path] { + return filepath.SkipDir + } + seen[path] = true + repos = append(repos, Repo{ + AbsPath: path, + RelPath: rel, + }) + // Don't descend into a repo — nested repos (submodules, + // vendored repos) get picked up separately only if they + // live outside this repo's worktree, which is rare. If + // users need submodule coverage they can list the parent + // and the submodule paths as separate roots. + return filepath.SkipDir + } + + return nil + }) + if err != nil { + return nil, err + } + } + + assignSlugs(repos) + // Sort by slug so output ordering is deterministic across runs + // (WalkDir ordering is, but concatenating multiple roots could shuffle). + sort.Slice(repos, func(i, j int) bool { return repos[i].Slug < repos[j].Slug }) + return repos, nil +} + +// initialSlugHashLen controls the default hash-suffix length in hex +// chars when two repos share a basename. Six chars (24 bits) gives +// readable slugs in the common case; the retry loop in assignSlugs +// grows the length if a truncation collision occurs. +const initialSlugHashLen = 6 + +// isReservedSlug reports whether base would collide with a name +// downstream consumers already use for their own output file. +// `scan --report-dir` writes /index.html as the landing page; +// a repo whose basename was literally `index` would overwrite it +// (or be overwritten by it). Forcing the hash branch for reserved +// names avoids the collision without losing the repo. +// +// Switch, not a package-level map, to keep the reserved set +// immutable — a mutable map would let one test silently leak an +// entry into every other test that runs afterward. Case folded to +// align with the case-insensitivity elsewhere in assignSlugs. +func isReservedSlug(base string) bool { + switch strings.ToLower(base) { + case "index": + return true + } + return false +} + +// assignSlugs derives a unique slug per repo from its basename, falling +// back to `-` when two repos share a name. +// The slug is also the JSONL filename stem and the persistence key +// (`.state`), so uniqueness AND determinism across runs both matter: +// if a re-run swaps which sibling gets the bare name, the per-repo +// state file is orphaned and the affected repos are re-extracted from +// scratch (or worse, two repos collide onto one state). +// +// Two-pass: count basenames first, then suffix with a hash whenever the +// base appears more than once anywhere in the result. This makes the +// slug a pure function of (absPath, set of basenames seen), independent +// of WalkDir traversal order. +// +// Short-hash truncation (6 hex = 24 bits) admits a small but nonzero +// collision probability — two absolute paths whose SHA-1 digests share +// the same 6-hex prefix would land on the same slug, and scan would +// silently overwrite one repo's JSONL + state file with the other's. +// The retry loop walks the proposed slug set; on any duplicate we +// redo the pass with a longer hash. Grows up to the full 40 hex chars +// before panicking — needing that much is a cryptographic event, not +// a scan-time bug. +// +// Case-insensitive uniqueness: on macOS (APFS/HFS+ in the default +// configuration) and Windows (NTFS), `Repo.jsonl` and `repo.jsonl` +// resolve to the same file on disk. A case-sensitive slug compare +// would treat the two repos as distinct, hand each the bare basename, +// and then quietly let the filesystem merge their JSONL and state +// files. Fold to lower case for BOTH the duplicate-detection count +// and the uniqueness seen-set so collision-triggered hashing fires +// on case-only differences. The emitted slug retains the original +// case for readability (so paths named Repo and repo produce +// `Repo-` and `repo-` — visually distinct to humans, +// case-insensitively distinct on disk). +func assignSlugs(repos []Repo) { + counts := make(map[string]int) + bases := make([]string, len(repos)) + for i := range repos { + base := sanitizeSlug(filepath.Base(repos[i].AbsPath)) + if base == "" { + base = "repo" + } + bases[i] = base + counts[strings.ToLower(base)]++ + } + + for hashLen := initialSlugHashLen; hashLen <= 40; hashLen += 2 { + proposed := make([]string, len(repos)) + seen := make(map[string]int, len(repos)) + collided := false + for i := range repos { + base := bases[i] + slug := base + if counts[strings.ToLower(base)] > 1 || isReservedSlug(base) { + h := sha1.Sum([]byte(repos[i].AbsPath)) + slug = fmt.Sprintf("%s-%s", base, hex.EncodeToString(h[:])[:hashLen]) + } + key := strings.ToLower(slug) + if prev, ok := seen[key]; ok && prev != i { + collided = true + break + } + seen[key] = i + proposed[i] = slug + } + if !collided { + for i := range repos { + repos[i].Slug = proposed[i] + } + return + } + } + panic("scan: slug assignment failed to find unique hash within SHA-1 range") +} + +// isBareRepo returns true when path is a bare git repository — i.e. the +// directory itself holds HEAD, objects/, and refs/ rather than wrapping +// them in a .git subdirectory. All three entries are required because +// a stray HEAD file or empty refs/ dir alone is not enough to be a real +// repo and we don't want false positives polluting the manifest. +func isBareRepo(path string) bool { + for _, name := range []string{"HEAD", "objects", "refs"} { + if _, err := os.Stat(filepath.Join(path, name)); err != nil { + return false + } + } + return true +} + +// isWorkingTreeEntry reports whether gitEntry (the path of a `.git` +// dirent inside a candidate repo) marks a real git working tree. +// Accepts: +// - A directory named `.git` — the standard layout. +// - A regular file named `.git` beginning with `gitdir: ` — the +// pointer format git writes for linked worktrees and submodules. +// Rejects everything else: symlinks (by existing policy — the target +// could be anywhere), irregular files, and — the case that motivated +// the validation — a regular file named `.git` that happens to exist +// with unrelated content. Without this check, any plain file called +// `.git` (source dump, build artifact, user mistake) made discovery +// record its parent as a repo and SkipDir out of the subtree, hiding +// nested real repos and producing a guaranteed extract failure. +func isWorkingTreeEntry(gitEntry string) bool { + info, err := os.Lstat(gitEntry) + if err != nil { + return false + } + if info.Mode()&os.ModeSymlink != 0 { + return false + } + if info.IsDir() { + return true + } + if !info.Mode().IsRegular() { + return false + } + // A valid gitdir pointer file starts with literally "gitdir: ". + // Checking only the first 8 bytes is sufficient (and cheap) — the + // full content isn't parsed here; extract will fail loudly if the + // pointer target is broken, which is a better signal than silently + // accepting every regular file and leaving the same confusion for + // the extract phase. + f, openErr := os.Open(gitEntry) + if openErr != nil { + return false + } + defer f.Close() + const prefix = "gitdir: " + buf := make([]byte, len(prefix)) + n, _ := f.Read(buf) + return n == len(prefix) && string(buf) == prefix +} + +// sanitizeSlug strips characters that would break the LoadMultiJSONL prefix +// contract (`:`) or filesystem paths. Keeps alphanumerics, dash, +// underscore, and dot. +func sanitizeSlug(s string) string { + var b strings.Builder + for _, r := range s { + switch { + case r >= 'a' && r <= 'z', + r >= 'A' && r <= 'Z', + r >= '0' && r <= '9', + r == '-', r == '_', r == '.': + b.WriteRune(r) + default: + b.WriteRune('_') + } + } + return b.String() +} diff --git a/internal/scan/discovery_test.go b/internal/scan/discovery_test.go new file mode 100644 index 0000000..97d9e70 --- /dev/null +++ b/internal/scan/discovery_test.go @@ -0,0 +1,682 @@ +package scan + +import ( + "context" + "crypto/sha1" + "encoding/hex" + "fmt" + "os" + "path/filepath" + "strings" + "testing" + "time" +) + +func TestDiscover_FindsRepos(t *testing.T) { + root := t.TempDir() + + mustMkRepo(t, filepath.Join(root, "a")) + mustMkRepo(t, filepath.Join(root, "b")) + mustMkRepo(t, filepath.Join(root, "nested", "c")) + // Plain dir without .git — must NOT be picked up. + if err := os.MkdirAll(filepath.Join(root, "not-a-repo", "src"), 0o755); err != nil { + t.Fatal(err) + } + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatalf("Discover: %v", err) + } + if len(repos) != 3 { + t.Fatalf("expected 3 repos, got %d: %+v", len(repos), repos) + } + got := map[string]bool{} + for _, r := range repos { + got[r.Slug] = true + } + for _, want := range []string{"a", "b", "c"} { + if !got[want] { + t.Errorf("expected slug %q in %v", want, got) + } + } +} + +func TestDiscover_RespectsIgnore(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "keep")) + mustMkRepo(t, filepath.Join(root, "node_modules", "garbage")) + + matcher := NewMatcher([]string{"node_modules"}) + repos, err := Discover(context.Background(), []string{root}, matcher, 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 || repos[0].Slug != "keep" { + t.Fatalf("expected only `keep`, got %+v", repos) + } +} + +// Regression: `vendor/` + `!vendor/keep` must descend into vendor/ +// so the negation has a chance to re-include vendor/keep. Before the +// fix, the walker SkipDir'd vendor unconditionally and the re-included +// repo was silently dropped. +func TestDiscover_HonorsNegatedDescendant(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "app")) + mustMkRepo(t, filepath.Join(root, "vendor", "garbage")) + mustMkRepo(t, filepath.Join(root, "vendor", "keep")) + + matcher := NewMatcher([]string{"vendor/", "!vendor/keep"}) + repos, err := Discover(context.Background(), []string{root}, matcher, 0) + if err != nil { + t.Fatal(err) + } + got := map[string]bool{} + for _, r := range repos { + got[r.RelPath] = true + } + if !got["app"] { + t.Errorf("app should be included: %+v", repos) + } + if !got["vendor/keep"] { + t.Errorf("vendor/keep should be re-included by negation rule: %+v", repos) + } + if got["vendor/garbage"] { + t.Errorf("vendor/garbage should remain ignored: %+v", repos) + } +} + +// Regression: when the ignored directory itself is a repo and a +// negation rule points at a descendant, the walker must NOT record +// the ignored dir (it's ignored) AND must keep descending so the +// negation's target can be visited. Previously the .git detection +// ran unconditionally and both incorrectly recorded the ignored +// parent AND SkipDir'd the child out of the walk. +// With a 6-hex (24-bit) suffix, the birthday paradox predicts a +// truncation collision around ~2^12 duplicates. Generating 10k paths +// that share a basename makes collisions statistically near-certain +// (expected count ≈ 3). The invariant "all resulting slugs are +// distinct" forces assignSlugs's retry loop to grow the hash and +// still produce a unique namespace — the exact corruption class the +// extra iterations were added to prevent. +func TestAssignSlugs_UniqueEvenUnderTruncationCollisions(t *testing.T) { + const n = 10000 + repos := make([]Repo, n) + for i := 0; i < n; i++ { + repos[i] = Repo{AbsPath: fmt.Sprintf("/path/to/dir-%06d/myrepo", i)} + } + assignSlugs(repos) + + slugs := make(map[string]int, n) + for i, r := range repos { + if prev, ok := slugs[r.Slug]; ok { + t.Fatalf("duplicate slug %q between repos[%d]=%s and repos[%d]=%s — JSONL + state files would collide", + r.Slug, prev, repos[prev].AbsPath, i, r.AbsPath) + } + slugs[r.Slug] = i + } +} + +// Directly exercise the retry path with two paths constructed to +// collide at initialSlugHashLen. Without the retry, both would get +// slug "myrepo-" and the scan would silently overwrite one +// repo's files with the other's. With the retry, the loop grows +// the hash until the pair separates. +func TestAssignSlugs_ResolvesFirstPrefixCollision(t *testing.T) { + a, b, found := findColliding6HexPaths(50000) + if !found { + t.Skip("no colliding pair found within search budget — astronomically unlikely, skip rather than flake") + } + repos := []Repo{{AbsPath: a}, {AbsPath: b}} + assignSlugs(repos) + + if repos[0].Slug == repos[1].Slug { + t.Fatalf("retry failed: both repos got slug %q for colliding paths %s and %s", repos[0].Slug, a, b) + } + // Sanity: the slug suffix must be longer than the initial 6 hex, + // proving the retry branch actually fired. + const minLenAfterRetry = len("myrepo-") + initialSlugHashLen + 1 + if len(repos[0].Slug) < minLenAfterRetry && len(repos[1].Slug) < minLenAfterRetry { + t.Errorf("expected at least one slug to have a longer hash after retry; got %q and %q", repos[0].Slug, repos[1].Slug) + } +} + +// findColliding6HexPaths searches a deterministic sequence of +// `myrepo` paths for any two whose SHA-1 digests share the first +// initialSlugHashLen hex chars. At 24 bits of resolution the +// birthday bound hits 50% around N≈4900, so maxAttempts=50000 is +// overwhelmingly likely to yield a pair. Returns false only if no +// collision was found — the test treats that as a skip, not a +// failure, since the worst-case outcome is a missed opportunity to +// exercise the retry, not a bug. +func findColliding6HexPaths(maxAttempts int) (string, string, bool) { + seen := make(map[string]string, maxAttempts) + for i := 0; i < maxAttempts; i++ { + p := fmt.Sprintf("/search/path-%07d/myrepo", i) + h := sha1.Sum([]byte(p)) + prefix := hex.EncodeToString(h[:])[:initialSlugHashLen] + if prev, ok := seen[prefix]; ok { + return prev, p, true + } + seen[prefix] = p + } + return "", "", false +} + +// `index` is the landing-page filename used by `scan --report-dir`. +// A repo whose basename is literally `index` would emit +// `/index.html`, colliding with (and getting overwritten by) +// the landing page. Force the hash branch for reserved names so +// the per-repo report file never lands on a reserved path. +func TestAssignSlugs_ReservesIndex(t *testing.T) { + repos := []Repo{ + {AbsPath: "/workspace/index"}, + {AbsPath: "/other/unrelated"}, + } + assignSlugs(repos) + + for _, r := range repos { + if filepath.Base(r.AbsPath) == "index" && r.Slug == "index" { + t.Errorf("`index` basename must not produce the bare slug; landing page would be overwritten (path=%s, slug=%s)", r.AbsPath, r.Slug) + } + } + // Case-insensitivity: `Index` would also collide on a case- + // insensitive filesystem serving the HTML. + repos = []Repo{{AbsPath: "/workspace/Index"}} + assignSlugs(repos) + if strings.EqualFold(repos[0].Slug, "index") { + t.Errorf("`Index` basename must also be reserved; got %q", repos[0].Slug) + } +} + +// Regression: on macOS (APFS/HFS+ default) and Windows (NTFS), +// `Repo.jsonl` and `repo.jsonl` are the same file on disk. A +// case-sensitive slug compare would hand both repos the bare +// basename and let the filesystem merge their outputs — one scan's +// JSONL silently overwrites the other. Counts and seen-set are +// lower-cased so any case-only difference is treated as a collision +// and forces the hash suffix. +func TestAssignSlugs_CaseInsensitiveUniqueness(t *testing.T) { + repos := []Repo{ + {AbsPath: "/a/Team/Repo"}, + {AbsPath: "/b/OSS/repo"}, + } + assignSlugs(repos) + + if repos[0].Slug == repos[1].Slug { + t.Fatalf("repos with case-only-differing basenames got same slug %q", repos[0].Slug) + } + // The fs collision only goes away if the slugs also differ when + // folded to lower case — which is what decides filenames on + // case-insensitive fs. + if strings.EqualFold(repos[0].Slug, repos[1].Slug) { + t.Fatalf("slugs %q and %q differ only in case — they would still collide on macOS/Windows", repos[0].Slug, repos[1].Slug) + } + // Sanity: both should have gained a hash suffix (the bare `Repo` + // and `repo` weren't safe to emit). + for _, r := range repos { + if !strings.Contains(r.Slug, "-") { + t.Errorf("repo %s got bare slug %q; case-sensitive branch did not trigger hashing", r.AbsPath, r.Slug) + } + } +} + +// Same paths ingested twice must produce the same slugs — this is +// what makes resume work across runs. Even with the retry loop +// adjusting hash length under collisions, a deterministic input set +// must yield a deterministic output. +func TestAssignSlugs_DeterministicUnderRetry(t *testing.T) { + build := func() []Repo { + rs := make([]Repo, 5000) + for i := 0; i < 5000; i++ { + rs[i] = Repo{AbsPath: fmt.Sprintf("/work/p-%05d/myrepo", i)} + } + return rs + } + a := build() + b := build() + assignSlugs(a) + assignSlugs(b) + for i := range a { + if a[i].Slug != b[i].Slug { + t.Errorf("slug for %s differs across runs: %q vs %q", a[i].AbsPath, a[i].Slug, b[i].Slug) + } + } +} + +// Mid-walk cancel: the pre-cancelled test below only proves the +// first callback checks ctx. This test cancels AFTER Discover has +// already started walking — the scenario users hit with Ctrl+C on +// a long home-dir scan. Asserting the walk stops in flight, not +// just when prompted at the very start. +// +// The tree is large enough (5,000 dirs × 3 subdirs = 20k nodes) +// that a full walk on typical hardware takes tens of ms; sleeping +// briefly before cancel lands the signal mid-flight. We assert: +// - err is context.Canceled (ctx check fired inside the walk) +// - elapsed time < full baseline (walk was actually interrupted) +// - at least one repo was discovered BEFORE cancel (not a +// pre-cancel scenario in disguise) +// Regression: WalkDir given a symlink ROOT visits only the link +// entry and refuses to descend. Users whose ~/work is a symlink to +// a real data disk would previously see "no git repositories found" +// despite the target being full of repos. Canonicalize via +// EvalSymlinks before walking so the walk starts at a real dir. +func TestDiscover_DereferencesSymlinkRoot(t *testing.T) { + real := t.TempDir() + mustMkRepo(t, filepath.Join(real, "repo-a")) + mustMkRepo(t, filepath.Join(real, "nested", "repo-b")) + + // Place the symlink in a separate TempDir so the test doesn't + // have to clean it up explicitly. + linkDir := t.TempDir() + link := filepath.Join(linkDir, "work") + if err := os.Symlink(real, link); err != nil { + t.Skipf("symlinks not supported here: %v", err) + } + + repos, err := Discover(context.Background(), []string{link}, NewMatcher(nil), 0) + if err != nil { + t.Fatalf("Discover via symlink root: %v", err) + } + if len(repos) != 2 { + t.Fatalf("want 2 repos found through symlink root, got %d: %+v", len(repos), repos) + } + got := map[string]bool{} + for _, r := range repos { + got[r.Slug] = true + } + for _, want := range []string{"repo-a", "repo-b"} { + if !got[want] { + t.Errorf("expected slug %q in %v (symlink root was not dereferenced)", want, got) + } + } +} + +func TestDiscover_AbortsMidWalk(t *testing.T) { + if testing.Short() { + t.Skip("builds a large synthetic tree; skipped in -short mode") + } + root := t.TempDir() + const topLevel = 5000 + for i := 0; i < topLevel; i++ { + parent := filepath.Join(root, fmt.Sprintf("d-%05d", i)) + for j := 0; j < 3; j++ { + if err := os.MkdirAll(filepath.Join(parent, fmt.Sprintf("sub-%d", j)), 0o755); err != nil { + t.Fatal(err) + } + } + // Every top-level dir is a repo so the result count is a + // direct proxy for how far the walk got. + mustMkRepo(t, parent) + } + + // Baseline: uncancelled walk finds all topLevel repos. + baseStart := time.Now() + full, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + baseline := time.Since(baseStart) + if len(full) != topLevel { + t.Fatalf("baseline walk should find all %d repos, got %d", topLevel, len(full)) + } + + // Cancelled walk: give the walk a head start, then cancel. + ctx, cancel := context.WithCancel(context.Background()) + done := make(chan struct{}) + go func() { + time.Sleep(baseline / 4) + cancel() + close(done) + }() + + cancStart := time.Now() + repos, err := Discover(ctx, []string{root}, NewMatcher(nil), 0) + cancElapsed := time.Since(cancStart) + <-done + + if err != context.Canceled { + t.Fatalf("want context.Canceled after mid-walk cancel, got err=%v (elapsed %v, baseline %v, repos %d)", + err, cancElapsed, baseline, len(repos)) + } + if cancElapsed >= baseline { + t.Errorf("cancelled walk took %v — baseline was %v. Ctx respected but walk did not shortcut; users will see delayed Ctrl+C", cancElapsed, baseline) + } + if len(repos) >= topLevel { + t.Errorf("cancelled walk returned all %d repos — appears to have finished before cancel fired; baseline %v, elapsed %v", len(repos), baseline, cancElapsed) + } +} + +// A cancelled context must abort the walk — previously Discover +// ignored ctx entirely, so Ctrl+C during the walk phase only took +// effect after every directory had been stat'd. Test uses a +// pre-cancelled context to trip the early-return on the very first +// callback; no repos should be returned and the error should equal +// ctx.Err(). +func TestDiscover_AbortsOnCancelledContext(t *testing.T) { + root := t.TempDir() + // Enough decoy directories that an un-aborted walk would still + // produce observable work — the assertion below fails loudly if + // the check is skipped. + for i := 0; i < 20; i++ { + if err := os.MkdirAll(filepath.Join(root, fmt.Sprintf("dir-%d", i)), 0o755); err != nil { + t.Fatal(err) + } + } + mustMkRepo(t, filepath.Join(root, "would-be-found")) + + ctx, cancel := context.WithCancel(context.Background()) + cancel() // pre-cancelled + + repos, err := Discover(ctx, []string{root}, NewMatcher(nil), 0) + if err != context.Canceled { + t.Fatalf("want context.Canceled, got err=%v", err) + } + if len(repos) != 0 { + t.Errorf("want empty repo list after cancel, got %+v", repos) + } +} + +func TestDiscover_IgnoredRepoNotRecorded_DescendantStillFound(t *testing.T) { + root := t.TempDir() + // `vendor` is itself a repo AND is ignored by `vendor/`. + mustMkRepo(t, filepath.Join(root, "vendor")) + // `vendor/keep` is a nested repo, re-included by `!vendor/keep`. + mustMkRepo(t, filepath.Join(root, "vendor", "keep")) + + matcher := NewMatcher([]string{"vendor/", "!vendor/keep"}) + repos, err := Discover(context.Background(), []string{root}, matcher, 0) + if err != nil { + t.Fatal(err) + } + + got := map[string]bool{} + for _, r := range repos { + got[r.RelPath] = true + } + if got["vendor"] { + t.Error("vendor itself is ignored and must not be recorded as a repo") + } + if !got["vendor/keep"] { + t.Errorf("vendor/keep should be re-included by the negation rule; got %+v", repos) + } +} + +func TestMatcher_CouldReinclude(t *testing.T) { + cases := []struct { + name string + patterns []string + dir string + want bool + }{ + {"no negation", []string{"vendor/"}, "vendor", false}, + {"explicit descendant", []string{"vendor/", "!vendor/keep"}, "vendor", true}, + {"unrelated negation", []string{"vendor/", "!src/main"}, "vendor", false}, + {"basename negation could fire anywhere", []string{"vendor/", "!keep"}, "vendor", true}, + {"deep-match negation", []string{"build/", "!**/src"}, "build", true}, + // Globbed first segment — reviewer case. + {"glob star in segment", []string{"vendor*/", "!vendor*/keep"}, "vendor", true}, + {"wildcard segment matches any parent", []string{"*/", "!*/keep"}, "vendor", true}, + {"glob prefix that doesn't match dir", []string{"vendor*/", "!foo*/keep"}, "vendor", false}, + {"nested dir with literal pattern", []string{"pkg/vendor/", "!pkg/vendor/keep"}, "pkg/vendor", true}, + {"nested dir with glob in first segment", []string{"*/vendor/", "!*/vendor/keep"}, "pkg/vendor", true}, + {"pattern with same segment count as dir can't match descendant", []string{"!vendor"}, "vendor", true}, // basename-anywhere + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + m := NewMatcher(c.patterns) + if got := m.CouldReinclude(c.dir); got != c.want { + t.Errorf("CouldReinclude(%q) with %v = %v, want %v", c.dir, c.patterns, got, c.want) + } + }) + } +} + +// Regression: glob-prefixed negations like `!vendor*/keep` or `!*/keep` +// used to slip past CouldReinclude because it only matched literal +// `dir + "/"` prefixes. Discovery then pruned vendor/ and the +// re-included vendor/keep repo disappeared from the scan. +func TestDiscover_HonorsGlobbedNegation(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "vendor", "keep")) + mustMkRepo(t, filepath.Join(root, "vendor", "garbage")) + mustMkRepo(t, filepath.Join(root, "vendor-old", "keep")) + mustMkRepo(t, filepath.Join(root, "unrelated")) + + matcher := NewMatcher([]string{"vendor*/", "!vendor*/keep"}) + repos, err := Discover(context.Background(), []string{root}, matcher, 0) + if err != nil { + t.Fatal(err) + } + got := map[string]bool{} + for _, r := range repos { + got[r.RelPath] = true + } + want := []string{"vendor/keep", "vendor-old/keep", "unrelated"} + for _, w := range want { + if !got[w] { + t.Errorf("missing %q in discovered repos: %+v", w, repos) + } + } + if got["vendor/garbage"] { + t.Error("vendor/garbage should remain ignored") + } +} + +func TestDiscover_DoesNotDescendIntoRepo(t *testing.T) { + root := t.TempDir() + parent := filepath.Join(root, "parent") + mustMkRepo(t, parent) + // A nested .git inside an already-discovered repo is treated as a + // submodule and skipped — we don't double-count. + mustMkRepo(t, filepath.Join(parent, "submodule")) + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 { + t.Fatalf("expected 1 repo (parent only), got %d: %+v", len(repos), repos) + } +} + +func TestDiscover_SlugCollisionGetsHashSuffix(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "a", "myrepo")) + mustMkRepo(t, filepath.Join(root, "b", "myrepo")) + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 2 { + t.Fatalf("expected 2 repos, got %d", len(repos)) + } + if repos[0].Slug == repos[1].Slug { + t.Errorf("collision not resolved: both slugs are %q", repos[0].Slug) + } + // With the two-pass naming, BOTH duplicates carry a hash suffix — + // neither gets the bare basename. This guarantees the slug for any + // given path is stable regardless of which sibling WalkDir hits + // first across runs. + for _, r := range repos { + if r.Slug == "myrepo" { + t.Errorf("expected both colliding repos to get a hash suffix, but %s kept the bare name", r.AbsPath) + } + } +} + +// Re-running discovery must produce the same slug for the same path +// even when the WalkDir traversal could legally vary. This is the +// invariant `.state` resumption depends on. +func TestDiscover_SlugDeterministicAcrossRuns(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "a", "myrepo")) + mustMkRepo(t, filepath.Join(root, "b", "myrepo")) + + first, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + second, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(first) != len(second) { + t.Fatalf("repo count differs across runs: %d vs %d", len(first), len(second)) + } + pathSlug := map[string]string{} + for _, r := range first { + pathSlug[r.AbsPath] = r.Slug + } + for _, r := range second { + if pathSlug[r.AbsPath] != r.Slug { + t.Errorf("slug for %s changed across runs: %q → %q", r.AbsPath, pathSlug[r.AbsPath], r.Slug) + } + } +} + +// Regression: a random regular file named `.git` (not the +// `gitdir: …` pointer format) used to make discovery record the +// parent as a repo, SkipDir out of the subtree, and hide any real +// repos nested below. Now the .git entry is validated — arbitrary +// files are rejected AND the walker keeps descending to find real +// repos deeper. +func TestDiscover_RejectsArbitraryGitFileAndDescends(t *testing.T) { + root := t.TempDir() + // Parent with a bogus `.git` file — looks like a repo by naive + // existence check, isn't by content check. + bogus := filepath.Join(root, "bogus") + if err := os.MkdirAll(bogus, 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(bogus, ".git"), []byte("not a git pointer file\n"), 0o644); err != nil { + t.Fatal(err) + } + // Real nested repo the old code would have hidden via SkipDir + // after recording the bogus parent. + mustMkRepo(t, filepath.Join(bogus, "nested")) + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + got := map[string]bool{} + for _, r := range repos { + got[r.RelPath] = true + } + if got["bogus"] { + t.Error("parent with a non-pointer `.git` file must not be recorded as a repo") + } + if !got["bogus/nested"] { + t.Errorf("real repo nested under a bogus .git file should still be discovered; got %+v", repos) + } +} + +// Regression companion: a valid `gitdir: …` pointer file (the +// format git writes for linked worktrees and submodules) must still +// be accepted as a repo. +func TestDiscover_AcceptsGitdirPointerFile(t *testing.T) { + root := t.TempDir() + realGit := filepath.Join(t.TempDir(), "real-git-dir") + if err := os.MkdirAll(realGit, 0o755); err != nil { + t.Fatal(err) + } + + wt := filepath.Join(root, "worktree") + if err := os.MkdirAll(wt, 0o755); err != nil { + t.Fatal(err) + } + // Pointer file: git's actual format is literally "gitdir: \n". + pointer := []byte("gitdir: " + realGit + "\n") + if err := os.WriteFile(filepath.Join(wt, ".git"), pointer, 0o644); err != nil { + t.Fatal(err) + } + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 || repos[0].RelPath != "worktree" { + t.Fatalf("expected to discover the gitdir-pointer worktree, got %+v", repos) + } +} + +func TestDiscover_RejectsSymlinkGit(t *testing.T) { + root := t.TempDir() + // A "repo" whose .git is a symlink — should not be picked up. + bad := filepath.Join(root, "weird") + if err := os.MkdirAll(bad, 0o755); err != nil { + t.Fatal(err) + } + if err := os.Symlink("/etc/hostname", filepath.Join(bad, ".git")); err != nil { + t.Skipf("symlink unsupported here: %v", err) + } + mustMkRepo(t, filepath.Join(root, "real")) + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 || repos[0].Slug != "real" { + t.Fatalf("expected only `real`, got %+v", repos) + } +} + +func TestDiscover_MaxDepthHonored(t *testing.T) { + root := t.TempDir() + mustMkRepo(t, filepath.Join(root, "shallow")) + mustMkRepo(t, filepath.Join(root, "a", "b", "c", "deep")) + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 2) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 || repos[0].Slug != "shallow" { + t.Fatalf("expected only shallow, got %+v", repos) + } +} + +func mustMkRepo(t *testing.T, path string) { + t.Helper() + if err := os.MkdirAll(filepath.Join(path, ".git"), 0o755); err != nil { + t.Fatal(err) + } +} + +func TestDiscover_FindsBareRepo(t *testing.T) { + root := t.TempDir() + bare := filepath.Join(root, "myrepo.git") + for _, name := range []string{"HEAD", "objects", "refs"} { + full := filepath.Join(bare, name) + if name == "HEAD" { + if err := os.MkdirAll(bare, 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(full, []byte("ref: refs/heads/main\n"), 0o644); err != nil { + t.Fatal(err) + } + } else { + if err := os.MkdirAll(full, 0o755); err != nil { + t.Fatal(err) + } + } + } + // Decoy: dir with HEAD only — must not be picked up. + decoy := filepath.Join(root, "not-a-repo") + if err := os.MkdirAll(decoy, 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(decoy, "HEAD"), []byte("nope\n"), 0o644); err != nil { + t.Fatal(err) + } + + repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0) + if err != nil { + t.Fatal(err) + } + if len(repos) != 1 || repos[0].Slug != "myrepo.git" { + t.Fatalf("expected single myrepo.git, got %+v", repos) + } +} diff --git a/internal/scan/ignore.go b/internal/scan/ignore.go new file mode 100644 index 0000000..bffce2c --- /dev/null +++ b/internal/scan/ignore.go @@ -0,0 +1,231 @@ +package scan + +import ( + "bufio" + "fmt" + "os" + "path" + "path/filepath" + "strings" +) + +// IgnoreRule is one line of the ignore file, pre-parsed. +// Supported syntax (a subset of gitignore, enough for directory pruning): +// +// # comment (ignored) +// node_modules (literal name — matches as basename or full path) +// vendor/ (directory-only; matches dirs named vendor) +// archive/* (glob — forwarded to path.Match on the rel path) +// **/generated (deep-match: same as `generated` matched anywhere) +// !important (negation — re-includes a match) +// +// Matching is evaluated in order; the last rule to match wins, following +// gitignore semantics so `!foo` after `foo/` re-includes foo. +type IgnoreRule struct { + Pattern string + Negate bool + DirOnly bool +} + +type Matcher struct { + rules []IgnoreRule +} + +func NewMatcher(patterns []string) *Matcher { + m := &Matcher{} + for _, p := range patterns { + if r, ok := parseRule(p); ok { + m.rules = append(m.rules, r) + } + } + return m +} + +// LoadMatcher reads an ignore file and returns a matcher. Missing files +// are rejected — this entry point is used when the path was specified +// explicitly (e.g. `--ignore-file foo.ignore`), where a typo should +// fail loudly instead of silently disabling all ignore rules and +// widening discovery scope. Callers that want "load if present, empty +// matcher otherwise" semantics (implicit default-path lookup) should +// os.Stat first. +func LoadMatcher(file string) (*Matcher, error) { + f, err := os.Open(file) + if err != nil { + return nil, fmt.Errorf("open %s: %w", file, err) + } + defer f.Close() + + var patterns []string + scanner := bufio.NewScanner(f) + for scanner.Scan() { + patterns = append(patterns, scanner.Text()) + } + if err := scanner.Err(); err != nil { + return nil, fmt.Errorf("read %s: %w", file, err) + } + return NewMatcher(patterns), nil +} + +func parseRule(line string) (IgnoreRule, bool) { + line = strings.TrimSpace(line) + if line == "" || strings.HasPrefix(line, "#") { + return IgnoreRule{}, false + } + r := IgnoreRule{} + if strings.HasPrefix(line, "!") { + r.Negate = true + line = line[1:] + } + if strings.HasSuffix(line, "/") { + r.DirOnly = true + line = strings.TrimSuffix(line, "/") + } + r.Pattern = line + return r, r.Pattern != "" +} + +// Match reports whether relPath should be ignored. isDir hints directory- +// only rules (`vendor/` only fires on dirs). +// +// Evaluation order: scan every rule, track the last match. This is what +// gitignore does — it lets `!src/keep` override a broader earlier `src/` +// block without forcing the user to care about rule ordering beyond the +// obvious "put exceptions after the thing they exclude". +func (m *Matcher) Match(relPath string, isDir bool) bool { + relPath = filepath.ToSlash(relPath) + matched := false + for _, r := range m.rules { + if r.DirOnly && !isDir { + continue + } + if !matchRule(r, relPath) { + continue + } + matched = !r.Negate + } + return matched +} + +// CouldReinclude reports whether any negation rule targets a descendant +// of dir — i.e. walking into the dir could still yield a re-included +// path. Callers use this to decide whether to prune an ignored +// directory via filepath.SkipDir or descend into it and evaluate +// children individually. +// +// Without this check the walker short-circuits the matcher's last- +// match-wins semantics: a `vendor/` + `!vendor/keep` pair would skip +// the vendor subtree entirely before vendor/keep could be examined, +// silently dropping the re-included path. +// +// Returns true when: +// - a negation's pattern has no path separator (basename rules like +// `!keep` or `!*.go` that could fire at any depth), OR +// - a negation begins with `**/` (deep-match; applies anywhere in the +// tree), OR +// - the pattern's leading path segments are each compatible with the +// corresponding segment of dir via path.Match, so its trailing +// segment(s) could name a descendant. This covers literal prefixes +// like `!vendor/keep` AND globbed prefixes like `!vendor*/keep` or +// `!*/keep` — both can match children of an ignored `vendor`. +// +// Negations that target a sibling or ancestor path (e.g. `!src/keep` +// when walking vendor) correctly don't trigger descent, so pruning +// remains effective for unrelated ignored trees like `node_modules/`. +func (m *Matcher) CouldReinclude(dir string) bool { + dir = filepath.ToSlash(dir) + dirSegs := strings.Split(dir, "/") + for _, r := range m.rules { + if !r.Negate { + continue + } + pat := r.Pattern + // `**/foo` and other deep-match patterns can fire at any + // depth — walking into any ignored subtree could reach one. + if strings.HasPrefix(pat, "**/") { + return true + } + // Basename-only rules (no `/`) apply to any segment. `!keep` + // could re-include vendor/keep, a/b/keep, anywhere. + if !strings.Contains(pat, "/") { + return true + } + patSegs := strings.Split(pat, "/") + // The pattern must have strictly more segments than dir; + // otherwise its deepest named entity is dir itself or an + // ancestor, never a descendant worth descending for. + if len(patSegs) <= len(dirSegs) { + continue + } + // Each of the pattern's leading segments must be compatible + // with the corresponding dir segment. path.Match handles both + // literal names (`vendor`) and globs (`vendor*`, `*`, `?`) + // uniformly; a failing match on any segment rules this + // negation out. + ok := true + for i := 0; i < len(dirSegs); i++ { + matched, err := path.Match(patSegs[i], dirSegs[i]) + if err != nil || !matched { + ok = false + break + } + } + if ok { + return true + } + } + return false +} + +func matchRule(r IgnoreRule, relPath string) bool { + pat := r.Pattern + + // **/foo → strip leading **/ and treat as "match foo anywhere" — + // same as the basename-or-suffix check below. We don't support + // arbitrary ** in the middle of patterns because users coming from + // gitignore expect the common forms (prefix dir, basename, ext), + // not the full double-star algebra. Saves us writing a mini doublestar + // engine for a marginal feature. + if strings.HasPrefix(pat, "**/") { + pat = strings.TrimPrefix(pat, "**/") + } + + base := path.Base(relPath) + + // Literal basename: `node_modules` matches any segment named that. + if pat == base { + return true + } + // Any segment of the path equals the pattern (literal dir/file name + // embedded in the tree). "vendor" matches "a/b/vendor/c.go". + if !strings.ContainsAny(pat, "*?[") && segmentMatch(pat, relPath) { + return true + } + // Glob on basename: "*.log" + if matched, _ := path.Match(pat, base); matched { + return true + } + // Glob on full relative path: "archive/*" + if matched, _ := path.Match(pat, relPath); matched { + return true + } + // Directory prefix: "archive/" (after DirOnly stripping applied above + // removed the trailing slash) or "archive/*". + prefix := strings.TrimSuffix(pat, "*") + prefix = strings.TrimSuffix(prefix, "/") + if prefix != "" && prefix != pat { + if strings.HasPrefix(relPath, prefix+"/") || relPath == prefix { + return true + } + } + return false +} + +func segmentMatch(needle, relPath string) bool { + for _, seg := range strings.Split(relPath, "/") { + if seg == needle { + return true + } + } + return false +} + diff --git a/internal/scan/ignore_test.go b/internal/scan/ignore_test.go new file mode 100644 index 0000000..cda4f25 --- /dev/null +++ b/internal/scan/ignore_test.go @@ -0,0 +1,87 @@ +package scan + +import ( + "os" + "path/filepath" + "strings" + "testing" +) + +func TestMatcher_Basics(t *testing.T) { + m := NewMatcher([]string{ + "# comment line", + "", + "node_modules", + "vendor/", + "archive/*", + "*.log", + "!important.log", + "**/generated", + }) + + cases := []struct { + name string + path string + isDir bool + want bool + }{ + {"basename dir match", "a/b/node_modules", true, true}, + {"basename file match in middle", "a/node_modules/c.go", false, true}, + {"vendor dir-only match", "src/vendor", true, true}, + {"vendor on file does not match", "src/vendor", false, false}, + {"glob in subdir", "archive/2024", true, true}, + {"glob extension", "out/build.log", false, true}, + {"negation re-includes", "out/important.log", false, false}, + {"deep generated", "src/foo/generated", true, true}, + {"unrelated path", "src/main.go", false, false}, + } + + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + got := m.Match(c.path, c.isDir) + if got != c.want { + t.Errorf("Match(%q, dir=%v) = %v, want %v", c.path, c.isDir, got, c.want) + } + }) + } +} + +// LoadMatcher is called when the user supplied an explicit +// --ignore-file; a missing file there is almost always a typo. If +// we silently returned an empty matcher, every discovery would +// widen to include node_modules/, vendor/, chromium-clones/, etc. +// the user thought they had excluded — and they'd have no way to +// tell from the console output. Fail loudly instead; the +// default-path lookup in scan.loadMatcher handles the "silent when +// absent" case via its own os.Stat before calling here. +func TestLoadMatcher_MissingFileFails(t *testing.T) { + _, err := LoadMatcher(filepath.Join(t.TempDir(), "typo.ignore")) + if err == nil { + t.Fatal("expected error for missing explicit ignore file; a typo must not silently disable rules") + } + if !os.IsNotExist(err) && !strings.Contains(err.Error(), "typo.ignore") { + t.Errorf("error should identify the missing path; got %q", err) + } +} + +func TestLoadMatcher_FromFile(t *testing.T) { + dir := t.TempDir() + path := filepath.Join(dir, ".gitcortex-ignore") + contents := "# comment\nnode_modules\nvendor/\n" + if err := os.WriteFile(path, []byte(contents), 0o644); err != nil { + t.Fatal(err) + } + m, err := LoadMatcher(path) + if err != nil { + t.Fatalf("LoadMatcher: %v", err) + } + if !m.Match("foo/node_modules", true) { + t.Error("node_modules should match") + } + if !m.Match("vendor", true) { + t.Error("vendor/ should match dirs") + } + if m.Match("vendor", false) { + t.Error("vendor/ should not match files") + } +} diff --git a/internal/scan/scan.go b/internal/scan/scan.go new file mode 100644 index 0000000..a634f21 --- /dev/null +++ b/internal/scan/scan.go @@ -0,0 +1,304 @@ +package scan + +import ( + "context" + "encoding/json" + "fmt" + "log" + "os" + "path/filepath" + "sync" + "time" + + "github.com/lex0c/gitcortex/internal/extract" + "github.com/lex0c/gitcortex/internal/git" +) + +// Config holds scan command input. The Extract template is copied per-repo +// and its Repo/Output/StateFile/Branch are overwritten per worker — the +// template exists to carry shared flags (ignore patterns, batch size, +// mailmap, etc.) without re-declaring them on the scan surface. +type Config struct { + Roots []string + Output string + IgnoreFile string + MaxDepth int + Parallel int + Extract extract.Config +} + +// Manifest is persisted to /manifest.json so subsequent runs +// (or the report stage) can discover which JSONL files to consolidate +// without re-walking the filesystem. +type Manifest struct { + GeneratedAt string `json:"generated_at"` + Roots []string `json:"roots"` + Repos []ManifestRepo `json:"repos"` +} + +type ManifestRepo struct { + Slug string `json:"slug"` + Path string `json:"path"` + Branch string `json:"branch"` + JSONL string `json:"jsonl"` + StateFile string `json:"state_file"` + Status string `json:"status"` // ok | failed | skipped + Error string `json:"error,omitempty"` + Commits int `json:"commits,omitempty"` + DurationMS int64 `json:"duration_ms,omitempty"` +} + +// Result is returned to the caller so the CLI can drive the report stage +// with the concrete JSONL paths (no need to re-read the manifest). +type Result struct { + OutputDir string + Manifest Manifest + JSONLs []string // only successful repos +} + +// Run discovers repos under cfg.Roots, writes one JSONL per repo via +// the existing extract pipeline, and emits a manifest. Failures on +// individual repos are recorded in the manifest with Status="failed" +// but do not abort the whole scan — the report can still be generated +// from the repos that succeeded. +func Run(ctx context.Context, cfg Config) (*Result, error) { + if len(cfg.Roots) == 0 { + return nil, fmt.Errorf("scan: at least one --root is required") + } + if cfg.Output == "" { + return nil, fmt.Errorf("scan: --output is required") + } + if err := os.MkdirAll(cfg.Output, 0o755); err != nil { + return nil, fmt.Errorf("create output dir: %w", err) + } + if cfg.Parallel <= 0 { + cfg.Parallel = 1 + } + + matcher, err := loadMatcher(cfg) + if err != nil { + return nil, err + } + + repos, err := Discover(ctx, cfg.Roots, matcher, cfg.MaxDepth) + if err != nil { + return nil, err + } + if len(repos) == 0 { + return nil, fmt.Errorf("scan: no git repositories found under %v", cfg.Roots) + } + log.Printf("scan: discovered %d repositories", len(repos)) + + manifest := Manifest{ + GeneratedAt: time.Now().UTC().Format(time.RFC3339), + Roots: cfg.Roots, + Repos: make([]ManifestRepo, len(repos)), + } + // Pre-fill every slot with the repo's identity and a "pending" + // status. If the user cancels (Ctrl+C) before a worker reaches a + // repo, the manifest still names it with Status="pending" instead + // of an empty struct that misleads "ok, 0 failed" accounting. + for i, r := range repos { + manifest.Repos[i] = ManifestRepo{ + Slug: r.Slug, + Path: r.AbsPath, + JSONL: filepath.Join(cfg.Output, r.Slug+".jsonl"), + StateFile: filepath.Join(cfg.Output, r.Slug+".state"), + Status: "pending", + } + } + + // Worker pool over repos. Each worker runs extract.Run against one + // repo and writes to its dedicated JSONL + state file. The extract + // package already checkpoints per-repo state, so an interrupted scan + // can be resumed by re-invoking the command — completed repos will + // be skipped (their state file's last SHA prevents re-emission) and + // partials will resume. + type job struct { + idx int + repo Repo + } + jobs := make(chan job) + var wg sync.WaitGroup + + for w := 0; w < cfg.Parallel; w++ { + wg.Add(1) + go func() { + defer wg.Done() + for j := range jobs { + manifest.Repos[j.idx] = runJobSafely(ctx, cfg, j.repo) + } + }() + } + + go func() { + for i, r := range repos { + select { + case <-ctx.Done(): + close(jobs) + return + case jobs <- job{idx: i, repo: r}: + } + } + close(jobs) + }() + + wg.Wait() + + manifestPath := filepath.Join(cfg.Output, "manifest.json") + if err := writeManifest(manifestPath, manifest); err != nil { + return nil, fmt.Errorf("write manifest: %w", err) + } + + var jsonls []string + var ok, failed, pending int + for _, r := range manifest.Repos { + switch r.Status { + case "ok": + jsonls = append(jsonls, r.JSONL) + ok++ + case "pending": + pending++ + default: + failed++ + } + } + log.Printf("scan: %d ok, %d failed, %d not started; manifest at %s", ok, failed, pending, manifestPath) + + res := &Result{ + OutputDir: cfg.Output, + Manifest: manifest, + JSONLs: jsonls, + } + // Surface cancellation as a non-nil error so the CLI exits non-zero + // — but return the partial result anyway so the caller can still + // inspect what completed before the interrupt. + if err := ctx.Err(); err != nil { + return res, err + } + return res, nil +} + +func loadMatcher(cfg Config) (*Matcher, error) { + // Per-root default: look for .gitcortex-ignore at the first root when + // no explicit path was given. Single-file load is enough for the MVP; + // if users need per-root rule sets they can concat them into one file. + path := cfg.IgnoreFile + if path == "" && len(cfg.Roots) > 0 { + candidate := filepath.Join(cfg.Roots[0], ".gitcortex-ignore") + if _, err := os.Stat(candidate); err == nil { + path = candidate + } + } + if path == "" { + return NewMatcher(nil), nil + } + return LoadMatcher(path) +} + +// runJobFn is the injection point the worker pool uses. Default is +// the real runOne; tests override it to simulate extract panics and +// verify the pool's recovery path. Not exposed via any user-facing +// API — this is strictly a test seam. +var runJobFn = runOne + +// runJobSafely wraps runJobFn in a defer-recover. If extract panics +// inside a worker the goroutine would otherwise crash the entire +// process and the repo's manifest slot would stay at "pending", which +// is indistinguishable from a job that was never dispatched. Convert +// panics to a well-formed ManifestRepo with Status="failed" so the +// manifest reflects reality and the rest of the pool keeps running. +// +// extract.Run doesn't panic under normal inputs — the recover exists +// for the "impossible" cases (e.g. a git binary that writes malformed +// output, nil-deref bugs introduced by a future refactor). Better to +// isolate the blast radius than leave a silent corruption mode. +func runJobSafely(ctx context.Context, cfg Config, repo Repo) (entry ManifestRepo) { + defer func() { + if r := recover(); r != nil { + entry = ManifestRepo{ + Slug: repo.Slug, + Path: repo.AbsPath, + JSONL: filepath.Join(cfg.Output, repo.Slug+".jsonl"), + StateFile: filepath.Join(cfg.Output, repo.Slug+".state"), + Status: "failed", + Error: fmt.Sprintf("panic: %v", r), + } + log.Printf("scan: [%s] panicked: %v", repo.Slug, r) + } + }() + return runJobFn(ctx, cfg, repo) +} + +func runOne(ctx context.Context, cfg Config, repo Repo) ManifestRepo { + jsonlPath := filepath.Join(cfg.Output, repo.Slug+".jsonl") + statePath := filepath.Join(cfg.Output, repo.Slug+".state") + entry := ManifestRepo{ + Slug: repo.Slug, + Path: repo.AbsPath, + JSONL: jsonlPath, + StateFile: statePath, + } + + // Skip cleanly if the user already cancelled before this worker + // picked up the job — avoids a noisy "extracting..." log followed + // by an immediate ctx.Err() from extract. + if err := ctx.Err(); err != nil { + entry.Status = "skipped" + entry.Error = err.Error() + return entry + } + + branch := cfg.Extract.Branch + if branch == "" { + branch = git.DetectDefaultBranch(repo.AbsPath) + } + if branch == "" { + entry.Status = "failed" + entry.Error = "could not detect default branch" + return entry + } + entry.Branch = branch + + // Copy the extract config so each worker gets its own Repo/Output/State + // without racing on shared fields. Preserves ignore patterns, batch + // size, mailmap, etc. from the template. + sub := cfg.Extract + sub.Repo = repo.AbsPath + sub.Branch = branch + sub.Output = jsonlPath + sub.StateFile = statePath + if sub.CommandTimeout == 0 { + sub.CommandTimeout = extract.DefaultCommandTimeout + } + // Always force StartOffset back to -1 (= "load from state file") + // regardless of the template's value. Programmatic callers passing + // a zero-value Config would otherwise feed flagOffset=0 to + // extract.LoadState, which would IGNORE the per-repo state file and + // re-extract from scratch. StartSHA gets cleared for the same + // reason: a single CLI-level SHA cannot apply to N different repos. + sub.StartSHA = "" + sub.StartOffset = -1 + + log.Printf("scan: [%s] extracting from %s (branch %s)", repo.Slug, repo.AbsPath, branch) + + start := time.Now() + if err := extract.Run(ctx, sub); err != nil { + entry.Status = "failed" + entry.Error = fmt.Errorf("extract %s: %w", repo.Slug, err).Error() + entry.DurationMS = time.Since(start).Milliseconds() + log.Printf("scan: [%s] failed: %v", repo.Slug, err) + return entry + } + entry.Status = "ok" + entry.DurationMS = time.Since(start).Milliseconds() + return entry +} + +func writeManifest(path string, m Manifest) error { + data, err := json.MarshalIndent(m, "", " ") + if err != nil { + return err + } + return os.WriteFile(path, data, 0o644) +} diff --git a/internal/scan/scan_test.go b/internal/scan/scan_test.go new file mode 100644 index 0000000..ad0c073 --- /dev/null +++ b/internal/scan/scan_test.go @@ -0,0 +1,211 @@ +package scan + +import ( + "context" + "encoding/json" + "os" + "os/exec" + "path/filepath" + "strings" + "testing" + + "github.com/lex0c/gitcortex/internal/extract" + "github.com/lex0c/gitcortex/internal/stats" +) + +// Regression: if extract.Run panics inside a worker (a future +// refactor with a nil deref, an unexpected git binary output, etc.), +// defer wg.Done() still closes the WaitGroup — but without a +// recover, the goroutine crashes the whole process before the +// manifest slot for that repo is written. Even without a full crash, +// the slot would silently stay at "pending" indistinguishable from +// "worker never reached it". The pool must convert panics to a +// ManifestRepo with Status="failed" and continue with sibling jobs. +// +// Override the injection seam runJobFn: first repo panics, second +// runs normally. Assert (1) the pool survives, (2) the panicking +// repo's slot is Status=failed with a panic: prefix, (3) the +// non-panicking repo still completes. +func TestRun_RecoversFromWorkerPanic(t *testing.T) { + if _, err := exec.LookPath("git"); err != nil { + t.Skip("git not installed") + } + + root := t.TempDir() + makeRealRepo(t, filepath.Join(root, "boom"), map[string]string{"x.go": "x\n"}) + makeRealRepo(t, filepath.Join(root, "ok"), map[string]string{"y.go": "y\n"}) + + // Override runJobFn for the duration of this test only. + original := runJobFn + defer func() { runJobFn = original }() + runJobFn = func(ctx context.Context, cfg Config, repo Repo) ManifestRepo { + if repo.Slug == "boom" { + panic("simulated extract failure") + } + return original(ctx, cfg, repo) + } + + output := filepath.Join(t.TempDir(), "out") + cfg := Config{ + Roots: []string{root}, + Output: output, + Parallel: 1, // force serial so the panicking job is reached + Extract: extract.Config{ + BatchSize: 100, + CommandTimeout: extract.DefaultCommandTimeout, + }, + } + + res, err := Run(context.Background(), cfg) + if err != nil { + t.Fatalf("Run should not propagate the worker panic as an error; got %v", err) + } + + gotStatus := map[string]ManifestRepo{} + for _, r := range res.Manifest.Repos { + gotStatus[r.Slug] = r + } + boom, ok := gotStatus["boom"] + if !ok { + t.Fatal("boom slot missing from manifest") + } + if boom.Status != "failed" { + t.Errorf("boom should be failed after panic, got Status=%q", boom.Status) + } + if !strings.Contains(boom.Error, "panic") { + t.Errorf("boom error should carry a panic: prefix so operators can tell panics from normal failures; got %q", boom.Error) + } + if okRepo, okFound := gotStatus["ok"]; !okFound || okRepo.Status != "ok" { + t.Errorf("sibling job must still complete despite the other worker's panic; got %+v", okRepo) + } + // Ensure the pool didn't leak: all slots have a non-pending status. + for _, r := range res.Manifest.Repos { + if r.Status == "pending" { + t.Errorf("repo %s left at 'pending' after Run returned — the pool abandoned a slot", r.Slug) + } + } +} + +func TestRun_EndToEnd(t *testing.T) { + if _, err := exec.LookPath("git"); err != nil { + t.Skip("git not installed") + } + + root := t.TempDir() + makeRealRepo(t, filepath.Join(root, "alpha"), map[string]string{ + "main.go": "package main\nfunc main() {}\n", + "README.md": "# alpha\n", + }) + makeRealRepo(t, filepath.Join(root, "beta"), map[string]string{ + "app.py": "print('hi')\n", + }) + // One node_modules-style decoy that should be ignored. + makeRealRepo(t, filepath.Join(root, "node_modules", "garbage"), map[string]string{ + "x.js": "x\n", + }) + + output := filepath.Join(t.TempDir(), "out") + cfg := Config{ + Roots: []string{root}, + Output: output, + Parallel: 2, + Extract: extract.Config{ + BatchSize: 100, + CommandTimeout: extract.DefaultCommandTimeout, + }, + } + cfg.IgnoreFile = writeIgnoreFile(t, root, "node_modules\n") + + res, err := Run(context.Background(), cfg) + if err != nil { + t.Fatalf("Run: %v", err) + } + + if len(res.JSONLs) != 2 { + t.Fatalf("expected 2 JSONLs, got %d (%v)", len(res.JSONLs), res.JSONLs) + } + + // Manifest sanity + manifestPath := filepath.Join(output, "manifest.json") + mb, err := os.ReadFile(manifestPath) + if err != nil { + t.Fatalf("read manifest: %v", err) + } + var m Manifest + if err := json.Unmarshal(mb, &m); err != nil { + t.Fatalf("parse manifest: %v", err) + } + if len(m.Repos) != 2 { + t.Fatalf("manifest should have 2 repos, got %d", len(m.Repos)) + } + for _, r := range m.Repos { + if r.Status != "ok" { + t.Errorf("repo %s status=%s err=%s", r.Slug, r.Status, r.Error) + } + } + + // Consolidated load + breakdown + ds, err := stats.LoadMultiJSONL(res.JSONLs) + if err != nil { + t.Fatalf("LoadMultiJSONL: %v", err) + } + breakdown := stats.RepoBreakdown(ds, "") + if len(breakdown) != 2 { + t.Fatalf("expected breakdown across 2 repos, got %d: %+v", len(breakdown), breakdown) + } + got := map[string]int{} + for _, b := range breakdown { + got[b.Repo] = b.Commits + } + for _, name := range []string{"alpha", "beta"} { + if got[name] == 0 { + t.Errorf("repo %s missing or has 0 commits in breakdown: %v", name, got) + } + } +} + +// makeRealRepo initializes a git repo with the given files and a single +// commit using a deterministic identity so the assertions don't depend on +// the developer's git config. +func makeRealRepo(t *testing.T, dir string, files map[string]string) { + t.Helper() + if err := os.MkdirAll(dir, 0o755); err != nil { + t.Fatal(err) + } + for name, contents := range files { + full := filepath.Join(dir, name) + if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(full, []byte(contents), 0o644); err != nil { + t.Fatal(err) + } + } + + runGit := func(args ...string) { + cmd := exec.Command("git", args...) + cmd.Dir = dir + cmd.Env = append(os.Environ(), + "GIT_AUTHOR_NAME=Test", + "GIT_AUTHOR_EMAIL=test@example.com", + "GIT_COMMITTER_NAME=Test", + "GIT_COMMITTER_EMAIL=test@example.com", + ) + out, err := cmd.CombinedOutput() + if err != nil { + t.Fatalf("git %s: %v\n%s", strings.Join(args, " "), err, out) + } + } + runGit("init", "-q", "-b", "main") + runGit("add", ".") + runGit("commit", "-q", "-m", "initial") +} + +func writeIgnoreFile(t *testing.T, root, contents string) string { + t.Helper() + path := filepath.Join(root, ".gitcortex-ignore") + if err := os.WriteFile(path, []byte(contents), 0o644); err != nil { + t.Fatal(err) + } + return path +} diff --git a/internal/stats/reader.go b/internal/stats/reader.go index 87fc4ef..4162249 100644 --- a/internal/stats/reader.go +++ b/internal/stats/reader.go @@ -22,6 +22,11 @@ type commitEntry struct { del int64 files int message string + // repo carries the LoadMultiJSONL pathPrefix (without the trailing + // colon) so RepoBreakdown can attribute commits back to their source + // repository on multi-repo scans. Empty for single-file loads, which + // keeps single-repo callers' behavior unchanged. + repo string } type fileEntry struct { @@ -82,6 +87,17 @@ type Dataset struct { files map[string]*fileEntry workGrid [7][24]int + // commitsByRepo preserves commit records per repository so the + // per-repo breakdown stays correct when two repos share a SHA + // (forks, mirrors, cherry-picks between sibling projects). The + // commits map above is keyed by SHA and drops the losing side of + // a collision — acceptable for file-side lookups where a single + // record per SHA is enough — but RepoBreakdown would then + // silently under-count the overwritten repo's commits and churn. + // Populated at ingest in lockstep with commits; a single-repo + // load (no pathPrefix) keeps all entries under the "(repo)" key. + commitsByRepo map[string][]*commitEntry + // Coupling (pre-aggregated during load) couplingPairs map[filePair]int couplingFileChanges map[string]int @@ -158,6 +174,7 @@ func newDataset() *Dataset { contribFiles: make(map[string]map[string]struct{}), contribFirst: make(map[string]time.Time), contribLast: make(map[string]time.Time), + commitsByRepo: make(map[string][]*commitEntry), } } @@ -235,14 +252,25 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string ds.TotalDeletions += c.Deletions ds.TotalFilesChanged += int64(c.FilesChanged) - ds.commits[c.SHA] = &commitEntry{ + entry := &commitEntry{ email: c.AuthorEmail, date: t, add: c.Additions, del: c.Deletions, files: c.FilesChanged, message: c.Message, + repo: strings.TrimSuffix(pathPrefix, ":"), + } + ds.commits[c.SHA] = entry + // Record separately per-repo so RepoBreakdown survives + // SHA collisions across repositories (forks/mirrors/ + // cherry-picks). Key defaults to "(repo)" in single-repo + // loads so the fallback row in RepoBreakdown still works. + repoKey := entry.repo + if repoKey == "" { + repoKey = "(repo)" } + ds.commitsByRepo[repoKey] = append(ds.commitsByRepo[repoKey], entry) // Contributors cs, ok := ds.contributors[c.AuthorEmail] diff --git a/internal/stats/repo_breakdown.go b/internal/stats/repo_breakdown.go new file mode 100644 index 0000000..57822e8 --- /dev/null +++ b/internal/stats/repo_breakdown.go @@ -0,0 +1,188 @@ +package stats + +import ( + "sort" + "strings" + "time" +) + +// RepoStat aggregates commit-level activity for one repository in a +// multi-repo Dataset. Populated only when LoadMultiJSONL ran with more +// than one input file (single-repo loads have no path prefix to group by, +// so the breakdown would be a noop). +type RepoStat struct { + Repo string + Commits int + Additions int64 + Deletions int64 + Churn int64 + Files int + ActiveDays int + UniqueDevs int + FirstCommitDate string + LastCommitDate string + // PctOfTotalCommits / PctOfTotalChurn are computed against the + // dataset totals so the report can show "this repo was 38% of + // your commits" without the template needing to do the math. + PctOfTotalCommits float64 + PctOfTotalChurn float64 +} + +// RepoBreakdown groups commits by repo and returns one row per repo, +// sorted by commit count descending. When emailFilter is non-empty, +// only commits authored by that email are counted — this is the +// "show my work across repos" lens used by the scan report. +// +// Source of truth is ds.commitsByRepo, a per-repository slice +// populated at ingest. The SHA-keyed ds.commits map cannot be used +// directly because two repos that share a commit SHA (fork, mirror, +// cherry-pick) would lose the earlier ingest to the later one, +// under-counting one repo's totals. Iterating commitsByRepo sidesteps +// that collision. +// +// Fallback: test datasets that hand-build ds.commits without +// populating commitsByRepo still work — we detect the empty slice +// and bucket ds.commits by repo as before. Production ingest always +// populates commitsByRepo. +// +// On single-repo datasets (no prefix tag), the function returns a +// single row labeled "(repo)" so callers can render without a special +// case. +func RepoBreakdown(ds *Dataset, emailFilter string) []RepoStat { + type acc struct { + commits int + add, del int64 + days map[string]struct{} + devs map[string]struct{} + first, last time.Time + } + + repos := make(map[string]*acc) + emailLower := strings.ToLower(strings.TrimSpace(emailFilter)) + + visit := func(key string, c *commitEntry) { + if emailLower != "" && strings.ToLower(strings.TrimSpace(c.email)) != emailLower { + return + } + a, ok := repos[key] + if !ok { + a = &acc{ + days: make(map[string]struct{}), + devs: make(map[string]struct{}), + } + repos[key] = a + } + a.commits++ + a.add += c.add + a.del += c.del + if c.email != "" { + a.devs[strings.ToLower(c.email)] = struct{}{} + } + if !c.date.IsZero() { + a.days[c.date.UTC().Format("2006-01-02")] = struct{}{} + if a.first.IsZero() || c.date.Before(a.first) { + a.first = c.date + } + if c.date.After(a.last) { + a.last = c.date + } + } + } + + if len(ds.commitsByRepo) > 0 { + for repo, entries := range ds.commitsByRepo { + for _, c := range entries { + visit(repo, c) + } + } + } else { + // Fallback for hand-built test datasets that populate + // ds.commits directly. Production ingest always goes through + // streamLoadInto and populates commitsByRepo. + for _, c := range ds.commits { + key := c.repo + if key == "" { + key = "(repo)" + } + visit(key, c) + } + } + + // File counts come from ds.files. Path prefix is `:` so + // the split key matches the commit-side `c.repo`. Files without a + // prefix (single-repo loads) bucket under the same "(repo)" label. + // + // When emailFilter is set the count must reflect files THIS dev + // touched, not the repo's total — the unfiltered count would + // massively over-state per-dev scope (e.g. "I worked on 12k files + // in monorepo X" when really the dev touched 30). devCommits + // tracks per-file dev appearance and is the right denominator. + fileCounts := make(map[string]int) + for path, fe := range ds.files { + if emailLower != "" { + matched := false + for devEmail := range fe.devCommits { + if strings.ToLower(strings.TrimSpace(devEmail)) == emailLower { + matched = true + break + } + } + if !matched { + continue + } + } + key := "" + if i := strings.Index(path, ":"); i >= 0 { + key = path[:i] + } + if key == "" { + key = "(repo)" + } + fileCounts[key]++ + } + + // Totals computed AFTER the email filter: percentages are relative + // to the filtered universe so a dev profile shows "60% of MY commits + // happened in repo X", not "0.4% of the org's commits". + var totalCommits int + var totalChurn int64 + for _, a := range repos { + totalCommits += a.commits + totalChurn += a.add + a.del + } + + out := make([]RepoStat, 0, len(repos)) + for repo, a := range repos { + stat := RepoStat{ + Repo: repo, + Commits: a.commits, + Additions: a.add, + Deletions: a.del, + Churn: a.add + a.del, + Files: fileCounts[repo], + ActiveDays: len(a.days), + UniqueDevs: len(a.devs), + } + if !a.first.IsZero() { + stat.FirstCommitDate = a.first.UTC().Format("2006-01-02") + } + if !a.last.IsZero() { + stat.LastCommitDate = a.last.UTC().Format("2006-01-02") + } + if totalCommits > 0 { + stat.PctOfTotalCommits = float64(stat.Commits) / float64(totalCommits) * 100 + } + if totalChurn > 0 { + stat.PctOfTotalChurn = float64(stat.Churn) / float64(totalChurn) * 100 + } + out = append(out, stat) + } + + sort.Slice(out, func(i, j int) bool { + if out[i].Commits != out[j].Commits { + return out[i].Commits > out[j].Commits + } + return out[i].Repo < out[j].Repo + }) + return out +} diff --git a/internal/stats/repo_breakdown_test.go b/internal/stats/repo_breakdown_test.go new file mode 100644 index 0000000..404a563 --- /dev/null +++ b/internal/stats/repo_breakdown_test.go @@ -0,0 +1,177 @@ +package stats + +import ( + "os" + "path/filepath" + "testing" + "time" +) + +func TestRepoBreakdown_GroupsByPrefix(t *testing.T) { + ds := &Dataset{ + commits: map[string]*commitEntry{ + "a1": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 10, del: 0}, + "a2": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC), add: 5, del: 2}, + "b1": {email: "you@x.com", repo: "beta", date: time.Date(2024, 2, 1, 0, 0, 0, 0, time.UTC), add: 30, del: 0}, + }, + files: map[string]*fileEntry{ + "alpha:main.go": {}, + "alpha:helper.go": {}, + "beta:app.py": {}, + }, + } + + breakdown := RepoBreakdown(ds, "") + if len(breakdown) != 2 { + t.Fatalf("want 2 repos, got %d", len(breakdown)) + } + // Sort is by commit count desc, so alpha first. + if breakdown[0].Repo != "alpha" || breakdown[0].Commits != 2 { + t.Errorf("alpha: got %+v", breakdown[0]) + } + if breakdown[0].Files != 2 { + t.Errorf("alpha files: got %d, want 2", breakdown[0].Files) + } + if breakdown[1].Repo != "beta" || breakdown[1].Commits != 1 { + t.Errorf("beta: got %+v", breakdown[1]) + } + // Pct totals — alpha 2/3, beta 1/3. + if got := breakdown[0].PctOfTotalCommits; got < 66.6 || got > 66.7 { + t.Errorf("alpha pct commits: got %.2f, want ~66.67", got) + } +} + +func TestRepoBreakdown_EmailFilter(t *testing.T) { + ds := &Dataset{ + commits: map[string]*commitEntry{ + "a1": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 10}, + "a2": {email: "other@x.com", repo: "alpha", date: time.Date(2024, 1, 2, 0, 0, 0, 0, time.UTC), add: 5}, + "b1": {email: "me@x.com", repo: "beta", date: time.Date(2024, 2, 1, 0, 0, 0, 0, time.UTC), add: 30}, + }, + // Per-dev file counts: alpha has 2 files but me@ touched only 1; + // beta has 1 file me@ touched. Other dev's exclusive file in + // alpha must NOT be counted toward me@'s repo Files. + files: map[string]*fileEntry{ + "alpha:mine.go": {devCommits: map[string]int{"me@x.com": 1}}, + "alpha:other.go": {devCommits: map[string]int{"other@x.com": 1}}, + "beta:both.py": {devCommits: map[string]int{"me@x.com": 1}}, + }, + } + breakdown := RepoBreakdown(ds, "me@x.com") + if len(breakdown) != 2 { + t.Fatalf("want 2 repos for me@x.com, got %d", len(breakdown)) + } + for _, b := range breakdown { + if b.Commits != 1 { + t.Errorf("repo %s: expected 1 commit (filtered), got %d", b.Repo, b.Commits) + } + if b.Files != 1 { + t.Errorf("repo %s: expected 1 file (filtered), got %d", b.Repo, b.Files) + } + } +} + +// A rename within a repo collapses two file paths onto one canonical +// path during finalizeDataset. The collapsed path keeps its `:` +// prefix, so RepoBreakdown should still attribute the file to the +// correct repo. Regression guard: without proper handling, the +// post-rename canonical key would either lose its prefix (and bucket +// into "(repo)") or duplicate-count. +func TestRepoBreakdown_SurvivesRename(t *testing.T) { + ds := newDataset() + ds.commits["c1"] = &commitEntry{ + email: "me@x.com", repo: "alpha", + date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 5, + } + ds.commits["c2"] = &commitEntry{ + email: "me@x.com", repo: "alpha", + date: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC), add: 10, + } + ds.files["alpha:old.go"] = &fileEntry{ + devCommits: map[string]int{"me@x.com": 1}, + } + ds.files["alpha:new.go"] = &fileEntry{ + devCommits: map[string]int{"me@x.com": 1}, + } + ds.renameEdges = []renameEdge{{ + oldPath: "alpha:old.go", newPath: "alpha:new.go", + commitDate: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC), + }} + applyRenames(ds) + + breakdown := RepoBreakdown(ds, "") + if len(breakdown) != 1 || breakdown[0].Repo != "alpha" { + t.Fatalf("want single alpha row, got %+v", breakdown) + } + // After rename collapse: 1 canonical file, both commits attributed. + if breakdown[0].Files != 1 { + t.Errorf("want 1 file (post-rename), got %d", breakdown[0].Files) + } + if breakdown[0].Commits != 2 { + t.Errorf("want 2 commits, got %d", breakdown[0].Commits) + } +} + +// Regression: when two repos share a commit SHA (fork / mirror / +// cherry-pick) the ingest map ds.commits drops one side. Without a +// separate per-repo record, RepoBreakdown would silently under-count +// the repo ingested first and the consolidated report would show +// wrong percentages. Must preserve both. +func TestRepoBreakdown_SurvivesCrossRepoSHACollision(t *testing.T) { + dir := t.TempDir() + // Two repos with IDENTICAL commit rows — different churn per row + // so we can verify both contribute to the breakdown. In real + // life this maps to a cherry-picked or mirrored commit where the + // SHA matches but the files and stats land in different repos. + commitRow := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_date":"2024-01-01T00:00:00Z","author_email":"me@x.com","author_name":"Me","additions":10,"deletions":0,"files_changed":1}` + alphaFile := `{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0}` + betaFile := `{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"b.go","additions":10,"deletions":0}` + + alpha := filepath.Join(dir, "alpha.jsonl") + beta := filepath.Join(dir, "beta.jsonl") + if err := os.WriteFile(alpha, []byte(commitRow+"\n"+alphaFile+"\n"), 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(beta, []byte(commitRow+"\n"+betaFile+"\n"), 0o644); err != nil { + t.Fatal(err) + } + + ds, err := LoadMultiJSONL([]string{alpha, beta}) + if err != nil { + t.Fatalf("LoadMultiJSONL: %v", err) + } + + breakdown := RepoBreakdown(ds, "") + if len(breakdown) != 2 { + t.Fatalf("want breakdown across both alpha AND beta despite SHA collision, got %d rows: %+v", len(breakdown), breakdown) + } + got := map[string]int{} + for _, b := range breakdown { + got[b.Repo] = b.Commits + } + if got["alpha"] != 1 || got["beta"] != 1 { + t.Errorf("each repo should keep its commit; got alpha=%d beta=%d", got["alpha"], got["beta"]) + } + // Sanity on percentage — each repo is 50% of the 2 collided commits. + for _, b := range breakdown { + if b.PctOfTotalCommits != 50 { + t.Errorf("repo %s: want 50%% commits, got %.1f", b.Repo, b.PctOfTotalCommits) + } + } +} + +func TestRepoBreakdown_SingleRepoFallback(t *testing.T) { + // No prefix tag → all commits bucket under "(repo)". + ds := &Dataset{ + commits: map[string]*commitEntry{ + "x1": {email: "me@x.com", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 5}, + }, + files: map[string]*fileEntry{ + "src/foo.go": {}, + }, + } + breakdown := RepoBreakdown(ds, "") + if len(breakdown) != 1 || breakdown[0].Repo != "(repo)" { + t.Fatalf("want single (repo) row, got %+v", breakdown) + } +} diff --git a/internal/stats/stats.go b/internal/stats/stats.go index a0b985c..331ac3a 100644 --- a/internal/stats/stats.go +++ b/internal/stats/stats.go @@ -239,6 +239,19 @@ func ComputeSummary(ds *Dataset) Summary { return s } +// DevEmails returns every author email in the dataset, unordered. +// Cheaper than TopContributors(ds, 0) when the caller only needs +// the email set (e.g. aggregating unique devs across multiple +// repos): skips the per-contributor struct copy and the full sort +// that TopContributors pays for ranking purposes. +func DevEmails(ds *Dataset) []string { + out := make([]string, 0, len(ds.contributors)) + for email := range ds.contributors { + out = append(out, email) + } + return out +} + func TopContributors(ds *Dataset, n int) []ContributorStat { result := make([]ContributorStat, 0, len(ds.contributors)) for _, cs := range ds.contributors { diff --git a/internal/stats/stats_test.go b/internal/stats/stats_test.go index a22bd15..b9c418b 100644 --- a/internal/stats/stats_test.go +++ b/internal/stats/stats_test.go @@ -63,6 +63,30 @@ func makeDataset() *Dataset { return ds } +// DevEmails returns every author email unordered. Faster than +// TopContributors(ds, 0) when the caller only needs the set — +// used by scan's per-repo render loop to aggregate unique devs +// across repos without paying the full contributor sort per repo. +func TestDevEmails(t *testing.T) { + ds := makeDataset() + got := map[string]bool{} + for _, e := range DevEmails(ds) { + got[e] = true + } + want := map[string]bool{ + "alice@test.com": true, + "bob@test.com": true, + } + for email := range want { + if !got[email] { + t.Errorf("missing %q in DevEmails output: %v", email, got) + } + } + if len(got) != len(want) { + t.Errorf("unexpected extras in DevEmails: got %v, want %v", got, want) + } +} + func TestComputeSummary(t *testing.T) { ds := makeDataset() s := ComputeSummary(ds) diff --git a/internal/stats/suspect.go b/internal/stats/suspect.go index 33a4b5f..c12f6f4 100644 --- a/internal/stats/suspect.go +++ b/internal/stats/suspect.go @@ -142,11 +142,22 @@ func DetectSuspectFiles(ds *Dataset) ([]SuspectBucket, bool) { for path, fe := range ds.files { fileChurn := fe.additions + fe.deletions totalChurn += fileChurn + // Multi-repo loads prefix every path with `:` so files + // from different repos can coexist in one Dataset. The + // pattern predicates assume bare repo-relative paths — they + // look for root-level `vendor/`, full-path `package-lock.json`, + // and similar shapes that fail when the string actually + // begins with `alpha:vendor/x.go` or equals + // `alpha:package-lock.json`. Strip the prefix once and feed + // the normalized form to both Match and Suggest; nested + // occurrences (e.g. `repo:pkg/vendor/x.go`) still work via + // the normal substring checks on the stripped path. + matchPath := stripRepoPrefix(path) // A file might match multiple patterns (e.g. vendor/*.pb.go). // Attribute it to the first match so totals don't double-count; // the first-match wins keeps output predictable. for _, pat := range defaultSuspectPatterns { - if !pat.Match(path) { + if !pat.Match(matchPath) { continue } b, ok := buckets[pat.Glob] @@ -159,7 +170,7 @@ func DetectSuspectFiles(ds *Dataset) ([]SuspectBucket, bool) { b.Churn += fileChurn suspectChurn += fileChurn if pat.Suggest != nil { - suggSets[pat.Glob][pat.Suggest(path)] = struct{}{} + suggSets[pat.Glob][pat.Suggest(matchPath)] = struct{}{} } break } @@ -211,11 +222,23 @@ func ShellQuoteSingle(s string) string { // trimming suggestions to just the displayed subset would leave // unshown suspects untouched and cause the warning to persist after // the suggested fix. +// +// Multi-repo note: LoadMultiJSONL prefixes file paths with `:` +// so every consumer (hotspots, directories, suspect detector) can +// attribute paths back to their source. The suggestion strings +// inherit that prefix from suggestDirGlob, but `extract --ignore` and +// `scan --extract-ignore` both consume bare repo-relative globs — +// neither understands the `:` notation. Strip it here so the +// copy-pasted remediation command actually matches the offending +// paths. Same-shaped globs from different repos collapse via the +// dedup below (e.g. `dist/*` appearing under multiple repos becomes +// one suggestion, which is what the user wants). func CollectAllSuggestions(buckets []SuspectBucket) []string { seen := map[string]struct{}{} var out []string for _, b := range buckets { for _, s := range b.Suggestions { + s = stripRepoPrefix(s) if _, ok := seen[s]; ok { continue } @@ -225,3 +248,21 @@ func CollectAllSuggestions(buckets []SuspectBucket) []string { } return out } + +// stripRepoPrefix removes a leading `:` from a suggestion glob +// when present. Conservative: the prefix is only recognized when the +// segment before the colon has no slash or glob metacharacters, which +// matches the LoadMultiJSONL convention and avoids misinterpreting +// legitimate colons inside paths (rare but possible on some +// filesystems). Single-repo datasets have no prefix and pass through +// unchanged. +func stripRepoPrefix(s string) string { + i := strings.Index(s, ":") + if i < 0 { + return s + } + if strings.ContainsAny(s[:i], "/*?[") { + return s + } + return s[i+1:] +} diff --git a/internal/stats/suspect_test.go b/internal/stats/suspect_test.go index a83ea2c..a9dfa7c 100644 --- a/internal/stats/suspect_test.go +++ b/internal/stats/suspect_test.go @@ -1,6 +1,7 @@ package stats import ( + "strings" "testing" "github.com/lex0c/gitcortex/internal/extract" @@ -353,6 +354,109 @@ func TestCollectAllSuggestionsCoversUnshownBuckets(t *testing.T) { } } +// When paths come from LoadMultiJSONL they carry a `:` prefix. +// Those prefixes leak into directory-glob suggestions and make the +// copy-pasted `extract --ignore` / `scan --extract-ignore` command +// a no-op on real repo paths (which never carry the prefix). +// CollectAllSuggestions must strip the prefix and collapse duplicates +// that only differed in which repo they came from. +func TestCollectAllSuggestionsStripsRepoPrefix(t *testing.T) { + ds := &Dataset{ + files: map[string]*fileEntry{ + // Two repos with the same offending shape — suggestions + // collapse to one canonical "dist/*" glob. + "repoA:js/dist/a.min.js": {additions: 400, deletions: 400}, + "repoB:css/dist/b.min.css": {additions: 400, deletions: 400}, + // A single-repo path stays untouched (no prefix). + "vendor/c.go": {additions: 400, deletions: 400}, + }, + } + buckets, _ := DetectSuspectFiles(ds) + got := CollectAllSuggestions(buckets) + + for _, s := range got { + if strings.Contains(s, ":") { + t.Errorf("suggestion %q still carries a repo prefix — users can't copy-paste this into extract --ignore", s) + } + } + has := func(want string) bool { + for _, s := range got { + if s == want { + return true + } + } + return false + } + for _, want := range []string{"js/dist/*", "css/dist/*", "vendor/*"} { + if !has(want) { + t.Errorf("expected suggestion %q in %v", want, got) + } + } +} + +func TestStripRepoPrefix(t *testing.T) { + cases := []struct{ in, want string }{ + {"repoA:wp-includes/dist/*", "wp-includes/dist/*"}, + {"dist/*", "dist/*"}, // no prefix + {"*.min.js", "*.min.js"}, // no prefix + {"vendor/*", "vendor/*"}, // no prefix (no colon) + {"src/weird:name.go", "src/weird:name.go"}, // colon after slash = not a prefix + {"glob*prefix:foo/bar", "glob*prefix:foo/bar"}, // metachar before colon = not a prefix + } + for _, c := range cases { + if got := stripRepoPrefix(c.in); got != c.want { + t.Errorf("stripRepoPrefix(%q) = %q, want %q", c.in, got, c.want) + } + } +} + +// Regression: multi-repo loads prefix every file path with `:`, +// and the pattern predicates (hasPathSegment, basenameEquals) compare +// against the raw string. Root-level occurrences like +// `alpha:vendor/x.go` and `alpha:package-lock.json` slipped past +// detection because `strings.HasPrefix(p, "vendor/")` and +// `p == "package-lock.json"` both fail when the `:` prefix is +// present. Symptom in the wild: stats on a scan consumed multiple +// --input files, suspect warning suppressed, users kept seeing +// inflated churn/bus-factor from generated content with no hint to +// fix it. +func TestDetectSuspectFiles_RootLevelUnderRepoPrefix(t *testing.T) { + ds := &Dataset{ + files: map[string]*fileEntry{ + // Root-level occurrences — the cases that previously missed. + "alpha:vendor/x.go": {additions: 400, deletions: 400}, + "beta:package-lock.json": {additions: 400, deletions: 400}, + "beta:go.sum": {additions: 400, deletions: 400}, + // Nested occurrence — worked before the fix and must still work. + "alpha:pkg/vendor/deep.go": {additions: 100, deletions: 100}, + }, + } + buckets, worth := DetectSuspectFiles(ds) + if !worth { + t.Fatal("expected warning-worthy dataset; prefix-stripped matching should fire on root-level vendor/lockfiles") + } + + got := map[string]bool{} + for _, b := range buckets { + got[b.Pattern.Glob] = true + } + for _, want := range []string{"vendor/*", "package-lock.json", "go.sum"} { + if !got[want] { + t.Errorf("pattern %q missing from detected buckets %v — root-level %q under repo prefix went undetected", want, got, want) + } + } + + // Suggestions should come out clean (no `:`) so the + // copy-pasted --ignore command actually matches these paths + // next run. + suggestions := CollectAllSuggestions(buckets) + for _, s := range suggestions { + if len(s) > 0 && s[0] == '\'' { + t.Errorf("suggestion %q should not carry shell quotes here", s) + } + } +} + func TestSuspectSuggestionsMatchExtractShouldIgnore(t *testing.T) { // Mix root-level and nested occurrences of every pattern class: // directory segments (vendor, dist), suffix (*.min.js), basename