diff --git a/README.md b/README.md
index 29ea136..4d1cede 100644
--- a/README.md
+++ b/README.md
@@ -401,6 +401,73 @@ gitcortex stats --input auth.jsonl --input payments.jsonl --stat coupling --top
 
 Paths appear as `auth:src/main.go` and `payments:src/main.go`. Contributors are deduped by email across repos — the same developer contributing to both repos is counted once.
 
+For workspaces containing many repos (an engineer's `~/work`, a platform team's service folder), `gitcortex scan` discovers every `.git` under one or more roots and extracts them in parallel — see below.
+
+### Scan: discover and aggregate every repo under a root
+
+Walk one or more directories, find every git repository (working trees and bare clones both detected), extract them in parallel, and optionally render HTML. Two output modes:
+
+- `--report-dir <dir>` — one standalone HTML per repo plus an `index.html` landing page linking them. Each per-repo report is equivalent to running `gitcortex report` against that repo alone; no metric mixing across unrelated codebases.
+- `--report <file> --email <address>` — a **single** consolidated profile report for one developer across every scanned repo. The only cross-repo aggregation in the feature, because "where did this person spend their time?" is the only question that genuinely benefits from pooling signal across projects.
+
+There is no third mode. Cross-repo consolidation at the team/codebase level inflates hotspots, bus factor, and coupling with noise from unrelated codebases; if that's what you want, inspect `manifest.json` or run `gitcortex report` per JSONL.
+
+```bash
+# Discover and extract every repo under ~/work (JSONLs + manifest, no HTML)
+gitcortex scan --root ~/work --output ./scan-out
+
+# Per-repo HTML reports + index landing page
+gitcortex scan --root ~/work --output ./scan-out --report-dir ./reports
+# opens ./reports/index.html → click through to each repo
+
+# Personal cross-repo profile: only MY commits, consolidated into one HTML
+gitcortex scan --root ~/work --output ./scan-out \
+  --report ./me.html --email me@company.com --since 1y \
+  --include-commit-messages
+
+# Multiple roots, higher parallelism, pre-set ignore patterns
+gitcortex scan --root ~/work --root ~/personal --root ~/oss \
+  --parallel 8 --max-depth 4 \
+  --output ./scan-out --report-dir ./reports
+```
+
+The scan output directory holds:
+
+| file | purpose |
+|---|---|
+| `<slug>.jsonl` | per-repo JSONL, one per discovered repo |
+| `<slug>.state` | resume checkpoint (safe to re-run scan to continue) |
+| `manifest.json` | discovery results, per-repo status (ok/failed/pending), timing |
+
+Each repo's slug is derived from its directory basename; colliding basenames get a short SHA-1 suffix (the suffix lengthens automatically on the rare truncation collision, so `<slug>.state` is stable across runs).
+
+**Filtering discovery with `.gitcortex-ignore`.** Create a gitignore-style file at the scan root:
+
+```
+# skip heavy clones we don't want in the report
+node_modules
+chromium.git
+linux.git
+
+# skip vendored repos except the one we own
+vendor/
+!vendor/in-house-fork
+```
+
+Directory rules, globs, `**/foo`, and `!path` negations all work. Globbed negations like `!vendor*/keep` are honored — discovery descends into any dir where a negation rule could match a descendant. If `--ignore-file` is not set, scan looks for `.gitcortex-ignore` in the first `--root`.
+
+**Consolidated profile report.** When `scan --email me@company.com --report path.html` runs against a multi-repo dataset, the profile report renders a *Per-Repository Breakdown* section: commits, churn, files, active days, and share-of-total — all filtered to that developer's contributions (files count reflects only files the dev touched). This is the one report that legitimately aggregates across repos; team-level views live in `--report-dir` (one HTML per repo, never mixed).
+
+**Flags worth knowing:**
+
+- `--parallel N` — repos extracted concurrently (default 4). Git is I/O-bound, so values past NumCPU give diminishing returns.
+- `--max-depth N` — stop descending past N levels. Useful when a root contains a monorepo with deeply nested internal repos you don't want enumerated.
+- `--extract-ignore <glob>` (repeatable) — forwarded to each per-repo `extract --ignore`, e.g. `--extract-ignore 'package-lock.json' --extract-ignore 'dist/*'`.
+- `--from / --to / --since` — time window applied to the consolidated report (same semantics as `report`).
+- `--churn-half-life`, `--coupling-max-files`, `--coupling-min-changes`, `--network-min-files` — pass tuning to the consolidated report identical to `gitcortex report`.
+
+Partial failures are non-fatal: the manifest records which repos failed, and the report is built from whichever JSONLs completed. `Ctrl+C` aborts both the discovery walk and any in-flight extracts; re-running picks up from each repo's state file.
+
 ### Diff: compare time periods
 
 Compare stats between two time periods, or filter to a single period.
@@ -444,6 +511,8 @@ gitcortex report --input data.jsonl --email alice@company.com --output alice.htm
 
 Includes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, working patterns heatmap, top commits, developer network, and developer profiles. A collapsible glossary at the top defines the terms (bus factor, churn, legacy-hotspot, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors.
 
+When the input is multi-repo (from `gitcortex scan` or multiple `--input` files) AND `--email` is set, the profile report renders a *Per-Repository Breakdown* with commit/churn/files/active-days per repo, filtered to that developer's contributions. The team-view report intentionally omits this section — per-repo aggregates on a consolidated dataset reduce to raw git-history distribution, which is more usefully inspected via `manifest.json` or `stats --input X.jsonl` per repo.
+
 > The HTML activity heatmap is always monthly (year × 12 months grid). For day/week/year buckets, use `gitcortex stats --stat activity --granularity <unit>`.
 
 ### CI: quality gates for pipelines
@@ -483,9 +552,14 @@ internal/
     parse.go                   Shared types (RawEntry, NumstatEntry)
     discard.go                 Malformed entry tracking
   extract/extract.go           Extraction orchestration, state, JSONL writing
+  scan/
+    scan.go                    Multi-repo orchestration (worker pool over extract)
+    discovery.go               Directory walk, bare-repo detection, slug uniqueness
+    ignore.go                  Gitignore-style matcher with negation support
   stats/
-    reader.go                  Streaming JSONL aggregator (single-pass)
+    reader.go                  Streaming JSONL aggregator (single-pass, multi-JSONL)
     stats.go                   Stat computations (9 stats)
+    repo_breakdown.go          Per-repository aggregate (scan consolidated report)
     format.go                  Table/CSV/JSON output formatting
 ```
 
diff --git a/cmd/gitcortex/main.go b/cmd/gitcortex/main.go
index 50267c4..256ec87 100644
--- a/cmd/gitcortex/main.go
+++ b/cmd/gitcortex/main.go
@@ -9,6 +9,8 @@ import (
 	"os"
 	"os/signal"
 	"path/filepath"
+	"runtime"
+	"sort"
 	"strings"
 	"syscall"
 	"time"
@@ -16,6 +18,7 @@ import (
 	"github.com/lex0c/gitcortex/internal/extract"
 	"github.com/lex0c/gitcortex/internal/git"
 	reportpkg "github.com/lex0c/gitcortex/internal/report"
+	"github.com/lex0c/gitcortex/internal/scan"
 	"github.com/lex0c/gitcortex/internal/stats"
 
 	"github.com/spf13/cobra"
@@ -35,6 +38,7 @@ func main() {
 	rootCmd.AddCommand(diffCmd())
 	rootCmd.AddCommand(ciCmd())
 	rootCmd.AddCommand(reportCmd())
+	rootCmd.AddCommand(scanCmd())
 
 	if err := rootCmd.Execute(); err != nil {
 		os.Exit(1)
@@ -869,3 +873,455 @@ func reportCmd() *cobra.Command {
 
 	return cmd
 }
+
+// defaultScanParallel picks an initial concurrency that scales with
+// the host but caps the aggressive end: git extract is a mix of
+// CPU (diff/numstat parsing) and disk I/O, so more than ~8 workers
+// hit diminishing returns on typical SSDs and can thrash on
+// spinning disks. The cap at 16 keeps 64-core servers from
+// over-subscribing; the floor at 2 keeps single-core CI from
+// serialising every scan.
+func defaultScanParallel() int {
+	n := runtime.NumCPU()
+	if n > 16 {
+		n = 16
+	}
+	if n < 2 {
+		n = 2
+	}
+	return n
+}
+
+// profileScanLabel builds the `RepoName` header string for
+// `scan --report --email`. The profile dataset aggregates commits
+// from every successful repo across all --root values, so titling
+// the report after roots[0] alone would mislead readers into
+// thinking the scope is a single root's work.
+//
+// Label policy:
+//   - 1 root → root basename (the default case, unchanged).
+//   - 2-3 roots → joined with " + " so the scope is visible at a glance.
+//   - 4+ roots → "N roots" because joining many basenames bloats
+//     the H1 and the exact set is available in the scan log and
+//     manifest anyway.
+func profileScanLabel(roots []string) string {
+	switch len(roots) {
+	case 0:
+		return "scan"
+	case 1:
+		return filepath.Base(absPath(roots[0]))
+	case 2, 3:
+		parts := make([]string, len(roots))
+		for i, r := range roots {
+			parts[i] = filepath.Base(absPath(r))
+		}
+		return strings.Join(parts, " + ")
+	default:
+		return fmt.Sprintf("%d roots", len(roots))
+	}
+}
+
+// scanIndexStatusRank maps a ManifestRepo.Status to a sort bucket
+// used by the index page ordering. Lower rank renders first; failed
+// entries float to the top so operators spot them immediately,
+// pending/skipped land in the middle, and ok entries make up the
+// "everything else" tail ranked by commit count inside
+// renderScanReportDir.
+func scanIndexStatusRank(status string) int {
+	switch status {
+	case "failed":
+		return 0
+	case "pending", "skipped":
+		return 1
+	case "ok":
+		return 2
+	default:
+		return 3
+	}
+}
+
+// renderScanReportDir is `scan --report-dir` in one place: for each
+// successful repo in the manifest, load only that repo's JSONL as a
+// standalone dataset, run the normal `report` pipeline into
+// <dir>/<slug>.html, and afterward emit <dir>/index.html linking
+// all of them. Intentionally uses LoadJSONL (single-input) rather
+// than LoadMultiJSONL so each per-repo report's paths carry no
+// `<repo>:` prefix — each tab of the experience is self-contained.
+func renderScanReportDir(result *scan.Result, dir string, loadOpts stats.LoadOptions, sf stats.StatsFlags, topN int) error {
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		return fmt.Errorf("create report dir: %w", err)
+	}
+
+	// Build an index-row slice parallel to Manifest.Repos so the
+	// landing page preserves scan order. We can't skip failed/pending
+	// entries — operators need to see what didn't produce a report.
+	entries := make([]reportpkg.ScanIndexEntry, 0, len(result.Manifest.Repos))
+	var totalCommits int
+	allDevs := map[string]struct{}{}
+	maxCommits := 0
+
+	for _, m := range result.Manifest.Repos {
+		// Collapse "skipped" (worker picked the job up and saw
+		// ctx.Err before running extract) into "pending" for the
+		// index. The summary strip already buckets them together
+		// under PendingRepos; without normalizing here too, the
+		// per-row render would emit class="repo skipped" /
+		// status-skipped — CSS selectors the template doesn't
+		// define, so skipped rows lost the amber border and pill
+		// that make cancellation fallout easy to spot at a glance.
+		status := m.Status
+		if status == "skipped" {
+			status = "pending"
+		}
+		entry := reportpkg.ScanIndexEntry{
+			Slug:   m.Slug,
+			Path:   m.Path,
+			Status: status,
+			Error:  m.Error,
+		}
+		if m.Status != "ok" {
+			entries = append(entries, entry)
+			continue
+		}
+
+		ds, err := stats.LoadJSONL(m.JSONL, loadOpts)
+		if err != nil {
+			// Treat load failure as a per-repo failure — index shows it,
+			// but the scan itself was already successful to that point.
+			entry.Status = "failed"
+			entry.Error = fmt.Sprintf("load %s: %v", m.JSONL, err)
+			entries = append(entries, entry)
+			continue
+		}
+
+		// All render-time errors for a single repo (create the output
+		// file, template.Execute) get demoted to an inline failure and
+		// the loop continues to the next repo — matching the policy
+		// already applied above for LoadJSONL. Aborting the batch on
+		// one bad render would discard every HTML already written and
+		// skip the index entirely, leaving operators staring at an
+		// incomplete directory with no way to discover the state.
+		outFile := filepath.Join(dir, m.Slug+".html")
+		f, err := os.Create(outFile)
+		if err != nil {
+			entry.Status = "failed"
+			entry.Error = fmt.Sprintf("create %s: %v", outFile, err)
+			entries = append(entries, entry)
+			continue
+		}
+		if err := reportpkg.Generate(f, ds, m.Slug, topN, sf); err != nil {
+			f.Close()
+			// Best-effort cleanup: leave nothing half-written on disk.
+			_ = os.Remove(outFile)
+			entry.Status = "failed"
+			entry.Error = fmt.Sprintf("render %s: %v", m.Slug, err)
+			entries = append(entries, entry)
+			continue
+		}
+		f.Close()
+
+		summary := stats.ComputeSummary(ds)
+		entry.Commits = summary.TotalCommits
+		entry.Devs = summary.TotalDevs
+		entry.Files = summary.TotalFiles
+		entry.Churn = summary.TotalAdditions + summary.TotalDeletions
+		entry.FirstCommitDate = summary.FirstCommitDate
+		entry.LastCommitDate = summary.LastCommitDate
+		entry.LastCommitAgo, entry.RecencyBucket = reportpkg.HumanizeAgo(summary.LastCommitDate)
+		entry.ReportHref = m.Slug + ".html"
+
+		totalCommits += entry.Commits
+		if entry.Commits > maxCommits {
+			maxCommits = entry.Commits
+		}
+		for _, email := range stats.DevEmails(ds) {
+			allDevs[email] = struct{}{}
+		}
+
+		entries = append(entries, entry)
+	}
+
+	// Order the index as a triage view rather than by discovery's
+	// alphabetical slug order. Failed entries come first (they need
+	// attention and their absence of metrics makes the red border
+	// the only signal), then pending/skipped, then ok entries
+	// ranked by commit count desc so the heaviest repos surface at
+	// the top of the "healthy" section. Slug asc is the tiebreaker
+	// in every bucket so scans with identical shapes still produce
+	// identical page order.
+	sort.SliceStable(entries, func(i, j int) bool {
+		a, b := scanIndexStatusRank(entries[i].Status), scanIndexStatusRank(entries[j].Status)
+		if a != b {
+			return a < b
+		}
+		if entries[i].Status == "ok" && entries[i].Commits != entries[j].Commits {
+			return entries[i].Commits > entries[j].Commits
+		}
+		return entries[i].Slug < entries[j].Slug
+	})
+
+	indexPath := filepath.Join(dir, "index.html")
+	f, err := os.Create(indexPath)
+	if err != nil {
+		return fmt.Errorf("create index: %w", err)
+	}
+	defer f.Close()
+
+	// Pending and failed live in separate buckets: pending means the
+	// worker never reached the repo (user cancelled mid-scan, or a
+	// future scheduling change), while failed means the extract was
+	// attempted and broke. Conflating them in the summary strip would
+	// mislead operators reading a cancel-shaped failure as a real
+	// repo-level problem.
+	var okCount, failedCount, pendingCount int
+	for _, e := range entries {
+		switch e.Status {
+		case "ok":
+			okCount++
+		case "pending", "skipped":
+			pendingCount++
+		default:
+			failedCount++
+		}
+	}
+
+	data := reportpkg.ScanIndexData{
+		Repos:        entries,
+		TotalRepos:   len(entries),
+		OKRepos:      okCount,
+		FailedRepos:  failedCount,
+		PendingRepos: pendingCount,
+		TotalCommits: totalCommits,
+		TotalDevs:    len(allDevs),
+		MaxCommits:   maxCommits,
+	}
+	if err := reportpkg.GenerateScanIndex(f, data); err != nil {
+		return fmt.Errorf("generate index: %w", err)
+	}
+
+	// Log line tracks the summary card's zero-suppression: only
+	// non-zero buckets appear, and the opener is whichever bucket
+	// has a count so an all-failed scan reads as "2 failed" instead
+	// of the misleading "0 ok, 2 failed".
+	var parts []string
+	if okCount > 0 {
+		parts = append(parts, fmt.Sprintf("%d ok", okCount))
+	}
+	if failedCount > 0 {
+		parts = append(parts, fmt.Sprintf("%d failed", failedCount))
+	}
+	if pendingCount > 0 {
+		parts = append(parts, fmt.Sprintf("%d pending", pendingCount))
+	}
+	if len(parts) == 0 {
+		parts = []string{"0 repos"} // defensive; renderScanReportDir requires > 0 manifest entries to be reached at all
+	}
+	fmt.Fprintf(os.Stderr, "Per-repo reports written to %s (%s); open %s\n",
+		dir, strings.Join(parts, ", "), fileURL(indexPath))
+	return nil
+}
+
+// --- Scan ---
+
+func scanCmd() *cobra.Command {
+	var (
+		roots              []string
+		output             string
+		ignoreFile         string
+		maxDepth           int
+		parallel           int
+		email              string
+		from               string
+		to                 string
+		since              string
+		reportPath         string
+		reportDir          string
+		topN               int
+		extractIgnore      []string
+		batchSize          int
+		mailmap            bool
+		firstParent        bool
+		includeMessages    bool
+		couplingMaxFiles   int
+		couplingMinChanges int
+		churnHalfLife      int
+		networkMinFiles    int
+	)
+
+	cmd := &cobra.Command{
+		Use:   "scan",
+		Short: "Discover git repositories under one or more roots and consolidate their history",
+		Long: `Walk the given root(s), find every git repository, and run extract on each
+repository in parallel. Outputs one JSONL per repo plus a manifest in --output.
+Optionally generates a consolidated HTML report including a per-repository
+breakdown — handy for showing aggregated work across many repos.`,
+		RunE: func(cmd *cobra.Command, args []string) error {
+			if len(roots) == 0 {
+				return fmt.Errorf("--root is required (repeatable for multiple roots)")
+			}
+			if since != "" && (from != "" || to != "") {
+				return fmt.Errorf("--since cannot be combined with --from/--to")
+			}
+			if err := validateDate(from, "--from"); err != nil {
+				return err
+			}
+			if err := validateDate(to, "--to"); err != nil {
+				return err
+			}
+			if from != "" && to != "" && from > to {
+				return fmt.Errorf("--from (%s) must be on or before --to (%s)", from, to)
+			}
+			// Report-flag combinations. --report is reserved for the
+			// only case that genuinely consolidates signal across repos
+			// — the per-developer profile; cross-repo team aggregation
+			// inflates hotspots/bus-factor/coupling with mixed-codebase
+			// noise and has been replaced by --report-dir which emits
+			// one standalone HTML per repo plus an index landing page.
+			if reportPath != "" && email == "" {
+				return fmt.Errorf("--report requires --email (profile consolidation); for per-repo HTML reports use --report-dir <dir>")
+			}
+			if email != "" && reportPath == "" {
+				// Without this up-front check, --email alone would
+				// run the full scan, reach the profile branch, and
+				// only then call os.Create("") — surfacing a cryptic
+				// `open : no such file or directory` after the slow
+				// multi-repo extract phase had already completed.
+				return fmt.Errorf("--email requires --report <file> pointing at the profile HTML output path")
+			}
+			if reportPath != "" && reportDir != "" {
+				return fmt.Errorf("--report and --report-dir are mutually exclusive; pick profile (--report + --email) or per-repo (--report-dir)")
+			}
+			if email != "" && reportDir != "" {
+				return fmt.Errorf("--email filters a single consolidated profile report; combine it with --report <file>, not --report-dir")
+			}
+			// Resolve --since up-front. If this fails we'd otherwise
+			// discover the typo only after a full multi-repo scan —
+			// minutes to hours on a large workspace, all thrown away
+			// because the user mistyped `--since 1yy`. Validate early,
+			// fail fast, keep the result for the report stage.
+			fromDate := from
+			if since != "" {
+				d, err := parseSince(since)
+				if err != nil {
+					return err
+				}
+				fromDate = d
+			}
+
+			cfg := scan.Config{
+				Roots:      roots,
+				Output:     output,
+				IgnoreFile: ignoreFile,
+				MaxDepth:   maxDepth,
+				Parallel:   parallel,
+				Extract: extract.Config{
+					BatchSize:       batchSize,
+					IncludeMessages: includeMessages,
+					CommandTimeout:  extract.DefaultCommandTimeout,
+					FirstParent:     firstParent,
+					Mailmap:         mailmap,
+					IgnorePatterns:  extractIgnore,
+					StartOffset:     -1,
+				},
+			}
+
+			ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
+			defer stop()
+
+			result, err := scan.Run(ctx, cfg)
+			// scan.Run returns a partial result alongside ctx.Err() on
+			// cancellation. Honor that — write whatever progress we made
+			// to disk and surface the error so the CLI exits non-zero.
+			if err != nil {
+				return err
+			}
+
+			// Exit non-zero when every repo failed to extract, regardless
+			// of whether a report was requested. scan.Run intentionally
+			// treats per-repo failures as non-fatal so a transient error
+			// on one repo doesn't tank the whole batch — but when the
+			// count of successes is zero, there's nothing useful on disk
+			// to inspect. Without this guard CI sees exit code 0 and
+			// continues with empty artifacts; the manifest's per-repo
+			// Status fields remain the only way to see the failures.
+			if len(result.JSONLs) == 0 {
+				return fmt.Errorf("scan found %d repositor(ies) but all extracts failed; see %s for per-repo status",
+					len(result.Manifest.Repos),
+					filepath.Join(result.OutputDir, "manifest.json"))
+			}
+
+			sf := stats.StatsFlags{CouplingMinChanges: couplingMinChanges, NetworkMinFiles: networkMinFiles}
+			loadOpts := stats.LoadOptions{
+				From:         fromDate,
+				To:           to,
+				HalfLifeDays: churnHalfLife,
+				CoupMaxFiles: couplingMaxFiles,
+			}
+
+			// Path 1: profile consolidation — the one cross-repo
+			// aggregation that earns its place (a single dev's work
+			// across all scanned repos). Build a multi-repo Dataset
+			// (path-prefixed via LoadMultiJSONL) and filter by email
+			// at render time.
+			if email != "" {
+				ds, err := stats.LoadMultiJSONL(result.JSONLs, loadOpts)
+				if err != nil {
+					return fmt.Errorf("load consolidated dataset: %w", err)
+				}
+				fmt.Fprintf(os.Stderr, "Loaded %d commits across %d repo(s)\n", ds.CommitCount, len(result.JSONLs))
+
+				f, err := os.Create(reportPath)
+				if err != nil {
+					return fmt.Errorf("create report: %w", err)
+				}
+				defer f.Close()
+
+				repoLabel := profileScanLabel(cfg.Roots)
+				if err := reportpkg.GenerateProfile(f, ds, repoLabel, email); err != nil {
+					return fmt.Errorf("generate profile report: %w", err)
+				}
+				fmt.Fprintf(os.Stderr, "Profile report for %s written to %s\n", email, fileURL(reportPath))
+				return nil
+			}
+
+			// Path 2: per-repo reports + index. Each repo is loaded
+			// standalone so its stats reflect its own codebase — no
+			// hotspot/bus-factor/coupling mixing across unrelated
+			// projects. The index.html lists every repo with cards
+			// linking into each per-repo report; failed extracts are
+			// surfaced inline so operators can spot them without
+			// opening the manifest.
+			if reportDir != "" {
+				return renderScanReportDir(result, reportDir, loadOpts, sf, topN)
+			}
+
+			// Path 3: no report flag — just confirm JSONLs landed.
+			fmt.Fprintf(os.Stderr, "Scan complete: %d JSONL file(s) in %s\n", len(result.JSONLs), result.OutputDir)
+			return nil
+		},
+	}
+
+	cmd.Flags().StringSliceVar(&roots, "root", nil, "Root directory to walk for repositories (repeatable)")
+	cmd.Flags().StringVar(&output, "output", "scan-output", "Directory to write per-repo JSONL files and the manifest")
+	cmd.Flags().StringVar(&ignoreFile, "ignore-file", "", "Gitignore-style file with directories to skip during discovery. When unset, only the first --root is searched for a .gitcortex-ignore; pass an explicit path to apply rules across all roots.")
+	cmd.Flags().IntVar(&maxDepth, "max-depth", 0, "Maximum directory depth to descend into when looking for repos (0 = unlimited)")
+	cmd.Flags().IntVar(&parallel, "parallel", defaultScanParallel(), "Number of repositories to extract in parallel")
+	cmd.Flags().StringVar(&email, "email", "", "Generate a per-developer profile report (only when --report is set)")
+	cmd.Flags().StringVar(&from, "from", "", "Window start date YYYY-MM-DD (forwarded to the consolidated report)")
+	cmd.Flags().StringVar(&to, "to", "", "Window end date YYYY-MM-DD (forwarded to the consolidated report)")
+	cmd.Flags().StringVar(&since, "since", "", "Filter to recent period (e.g. 7d, 4w, 3m, 1y); mutually exclusive with --from/--to")
+	cmd.Flags().StringVar(&reportPath, "report", "", "With --email, write a developer profile consolidated across all scanned repos to this HTML file. For per-repo reports use --report-dir instead.")
+	cmd.Flags().StringVar(&reportDir, "report-dir", "", "Directory to write one standalone HTML report per repo plus an index.html landing page. Each per-repo report is equivalent to running `gitcortex report` on that repo alone — no cross-repo metric mixing.")
+	cmd.Flags().IntVar(&topN, "top", 20, "Top-N entries per section in each per-repo report")
+	cmd.Flags().StringSliceVar(&extractIgnore, "extract-ignore", nil, "Glob patterns forwarded to per-repo extract --ignore (e.g. package-lock.json)")
+	cmd.Flags().IntVar(&batchSize, "batch-size", 1000, "Per-repo extract checkpoint interval")
+	cmd.Flags().BoolVar(&mailmap, "mailmap", false, "Use .mailmap (per repo) to normalize identities")
+	cmd.Flags().BoolVar(&firstParent, "first-parent", false, "Restrict extracts to the first-parent chain")
+	cmd.Flags().BoolVar(&includeMessages, "include-commit-messages", false, "Include commit messages in JSONL (needed for Top Commits in the consolidated report)")
+	cmd.Flags().IntVar(&couplingMaxFiles, "coupling-max-files", 50, "Max files per commit for coupling analysis (consolidated report)")
+	cmd.Flags().IntVar(&couplingMinChanges, "coupling-min-changes", 5, "Min co-changes for coupling results (consolidated report)")
+	cmd.Flags().IntVar(&churnHalfLife, "churn-half-life", 90, "Half-life in days for churn decay (consolidated report)")
+	cmd.Flags().IntVar(&networkMinFiles, "network-min-files", 5, "Min shared files for dev-network edges (consolidated report)")
+
+	return cmd
+}
diff --git a/cmd/gitcortex/main_test.go b/cmd/gitcortex/main_test.go
index 57410bf..2795e2b 100644
--- a/cmd/gitcortex/main_test.go
+++ b/cmd/gitcortex/main_test.go
@@ -1,10 +1,565 @@
 package main
 
 import (
+	"bytes"
+	"fmt"
+	"os"
+	"os/exec"
+	"path/filepath"
 	"strings"
 	"testing"
+
+	"github.com/lex0c/gitcortex/internal/scan"
+	"github.com/lex0c/gitcortex/internal/stats"
 )
 
+// scanCmd must validate --since BEFORE running the discovery walk
+// and extract pool. Without the early check, an obvious typo like
+// `--since 1yy` only surfaces after scan.Run has already walked
+// every root and extracted every repo found — which can take
+// minutes-to-hours on a large workspace and waste the work.
+//
+// Use an empty TempDir so scan.Run would fail fast with "no git
+// repositories found" if we reached it — that error does not mention
+// "since", so an error containing "since" proves the early
+// validation fired first.
+func TestScanCmd_ValidatesSinceBeforeScanning(t *testing.T) {
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetArgs([]string{
+		"--root", t.TempDir(),
+		"--output", t.TempDir(),
+		"--since", "bogus",
+	})
+	// Swallow any stderr output cobra might emit so we don't pollute
+	// go-test logs on success.
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+
+	err := cmd.Execute()
+	if err == nil {
+		t.Fatal("expected error for --since bogus, got nil")
+	}
+	if !strings.Contains(err.Error(), "since") {
+		t.Errorf("expected error to mention --since (proving early validation), got %q", err)
+	}
+	// Reaching scan.Run on an empty TempDir would produce the
+	// discovery error; asserting its absence confirms the walk was
+	// not started.
+	if strings.Contains(err.Error(), "no git repositories found") {
+		t.Errorf("scan.Run was reached before --since was validated — discovery ran on an invalid-flag input: %q", err)
+	}
+}
+
+// scanCmd must exit non-zero when every discovered repo's extract
+// failed, regardless of whether --report was requested. scan.Run
+// intentionally treats per-repo failures as non-fatal (a transient
+// error on one repo shouldn't tank the whole batch), but if ZERO
+// repos succeed there's nothing useful on disk and CI should know.
+// Previously the no-report branch returned success unconditionally;
+// automation saw exit 0 and "Scan complete: 0 JSONL file(s)" and
+// continued with empty artifacts.
+//
+// Test uses a fake repo (a bare `.git` dir with no history) so
+// discovery picks it up but extract fails.
+func TestScanCmd_ExitsNonZeroWhenAllExtractsFail(t *testing.T) {
+	root := t.TempDir()
+	if err := os.MkdirAll(filepath.Join(root, "fake-repo", ".git"), 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+	cmd.SetArgs([]string{
+		"--root", root,
+		"--output", t.TempDir(),
+	})
+
+	err := cmd.Execute()
+	if err == nil {
+		t.Fatal("expected non-zero exit when all extracts fail; CI would silently treat this as success")
+	}
+	if !strings.Contains(err.Error(), "all extracts failed") {
+		t.Errorf("error should explain every repo failed; got %q", err)
+	}
+}
+
+// `scan --ignore-file /nonexistent` must fail at the CLI layer —
+// silently falling back to zero rules would widen discovery scope
+// without telling the user (e.g. node_modules and vendor/ dirs the
+// user thought they excluded suddenly appear in the report). The
+// default-path lookup in scan.loadMatcher still tolerates a missing
+// `.gitcortex-ignore` at the first root; only the explicit flag is
+// strict.
+func TestScanCmd_RejectsMissingIgnoreFile(t *testing.T) {
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+	cmd.SetArgs([]string{
+		"--root", t.TempDir(),
+		"--output", t.TempDir(),
+		"--ignore-file", filepath.Join(t.TempDir(), "typo.ignore"),
+	})
+
+	err := cmd.Execute()
+	if err == nil {
+		t.Fatal("expected error for missing --ignore-file; a typo must not silently widen scope")
+	}
+	if !strings.Contains(err.Error(), "typo.ignore") {
+		t.Errorf("error should identify the missing file; got %q", err)
+	}
+	// Must fail BEFORE discovery starts — scan on an empty TempDir
+	// would otherwise produce "no git repositories found".
+	if strings.Contains(err.Error(), "no git repositories found") {
+		t.Errorf("ignore-file check ran after discovery; got %q", err)
+	}
+}
+
+// --email without --report would silently reach os.Create("") after
+// the full scan/extract phase finished, surfacing a cryptic "open :
+// no such file or directory" error minutes into a wasted run. The
+// flag pair is mandatory in both directions: reject up front
+// alongside the other flag-combination checks.
+func TestScanCmd_RejectsEmailWithoutReport(t *testing.T) {
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+	cmd.SetArgs([]string{
+		"--root", t.TempDir(),
+		"--output", t.TempDir(),
+		"--email", "me@x.com",
+	})
+	err := cmd.Execute()
+	if err == nil {
+		t.Fatal("expected error for --email without --report; silent fall-through would waste a full scan before failing")
+	}
+	if !strings.Contains(err.Error(), "--report") {
+		t.Errorf("error should name --report as the missing flag; got %q", err)
+	}
+	// Must fail BEFORE discovery starts; empty TempDir would
+	// otherwise produce "no git repositories found".
+	if strings.Contains(err.Error(), "no git repositories found") {
+		t.Errorf("flag validation ran after scan; got %q", err)
+	}
+}
+
+// --report without --email is ambiguous under the "no team-wide
+// consolidation" design: the only meaningful single-HTML output
+// across multiple repos is a developer profile. The error must
+// point the user at the right flag for each intent.
+func TestScanCmd_RejectsReportWithoutEmail(t *testing.T) {
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+	cmd.SetArgs([]string{
+		"--root", t.TempDir(),
+		"--output", t.TempDir(),
+		"--report", filepath.Join(t.TempDir(), "out.html"),
+	})
+	err := cmd.Execute()
+	if err == nil {
+		t.Fatal("expected error for --report without --email")
+	}
+	if !strings.Contains(err.Error(), "--report-dir") || !strings.Contains(err.Error(), "--email") {
+		t.Errorf("error should point at both --report-dir and --email as remediation; got %q", err)
+	}
+}
+
+// `scan --report-dir <dir>` writes index.html plus one HTML per
+// successful repo, with each per-repo HTML rendered from a
+// single-input Dataset (no cross-repo path prefix). End-to-end
+// check against two real repos built with git.
+func TestScanCmd_ReportDirGeneratesPerRepoHTMLAndIndex(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skip("git not installed")
+	}
+	root := t.TempDir()
+	makeRepoWithCommit(t, filepath.Join(root, "alpha"), "a.go", "package a\n")
+	makeRepoWithCommit(t, filepath.Join(root, "beta"), "b.py", "print('b')\n")
+
+	scanOut := t.TempDir()
+	reportDir := t.TempDir()
+
+	cmd := scanCmd()
+	cmd.SilenceUsage = true
+	cmd.SilenceErrors = true
+	cmd.SetOut(&bytes.Buffer{})
+	cmd.SetErr(&bytes.Buffer{})
+	cmd.SetArgs([]string{
+		"--root", root,
+		"--output", scanOut,
+		"--report-dir", reportDir,
+	})
+	if err := cmd.Execute(); err != nil {
+		t.Fatalf("scan --report-dir: %v", err)
+	}
+
+	// Every successful repo gets a standalone HTML named <slug>.html.
+	for _, slug := range []string{"alpha", "beta"} {
+		path := filepath.Join(reportDir, slug+".html")
+		b, err := os.ReadFile(path)
+		if err != nil {
+			t.Fatalf("per-repo report missing: %v", err)
+		}
+		if !strings.Contains(string(b), "<!DOCTYPE html>") {
+			t.Errorf("%s should be a full HTML document", path)
+		}
+		// Paths inside the per-repo report must NOT carry a
+		// `<repo>:` prefix — each tab is standalone, so LoadJSONL
+		// (not LoadMultiJSONL) was used and file paths appear raw.
+		if strings.Contains(string(b), slug+":") {
+			t.Errorf("per-repo report %s contains a `%s:` path prefix; Dataset was loaded as multi-repo", path, slug)
+		}
+	}
+
+	// index.html links to each per-repo report.
+	indexBytes, err := os.ReadFile(filepath.Join(reportDir, "index.html"))
+	if err != nil {
+		t.Fatalf("index.html missing: %v", err)
+	}
+	index := string(indexBytes)
+	if !strings.Contains(index, `href="alpha.html"`) || !strings.Contains(index, `href="beta.html"`) {
+		t.Errorf("index.html should link to both per-repo reports; got body excerpt: %.300s", index)
+	}
+	// The summary card strip is now the top-of-page anchor (no H1).
+	// Asserting on the "Repositories" label keeps the test meaningful
+	// if the card layout changes.
+	if !strings.Contains(index, "Repositories") {
+		t.Errorf("index.html missing summary strip")
+	}
+}
+
+// renderScanReportDir must render the index for a manifest that
+// mixes ok / failed / pending statuses. Failed entries show their
+// error message, pending entries render with the pending pill, and
+// neither inflates the "failed" count in the summary card.
+func TestRenderScanReportDir_MixedStatuses(t *testing.T) {
+	dir := t.TempDir()
+	okJSONL := filepath.Join(dir, "good.jsonl")
+	jsonlContent := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-01T00:00:00Z","additions":10,"deletions":0,"files_changed":1}
+{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0}
+`
+	if err := os.WriteFile(okJSONL, []byte(jsonlContent), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	reportDir := t.TempDir()
+	result := &scan.Result{
+		OutputDir: dir,
+		Manifest: scan.Manifest{
+			Roots: []string{"/fake/root"},
+			Repos: []scan.ManifestRepo{
+				{Slug: "good", Path: "/fake/root/good", Status: "ok", JSONL: okJSONL},
+				{Slug: "broken", Path: "/fake/root/broken", Status: "failed", Error: "simulated extract crash"},
+				{Slug: "skipped", Path: "/fake/root/skipped", Status: "pending"},
+			},
+		},
+	}
+
+	err := renderScanReportDir(result, reportDir,
+		stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50},
+		stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1},
+		10)
+	if err != nil {
+		t.Fatalf("renderScanReportDir returned error despite inline-failure policy: %v", err)
+	}
+
+	// OK entry's report file exists; failed/pending do not.
+	if _, err := os.Stat(filepath.Join(reportDir, "good.html")); err != nil {
+		t.Errorf("good.html missing: %v", err)
+	}
+	for _, unwanted := range []string{"broken.html", "skipped.html"} {
+		if _, err := os.Stat(filepath.Join(reportDir, unwanted)); err == nil {
+			t.Errorf("%s should NOT be written — status was not ok", unwanted)
+		}
+	}
+
+	indexB, err := os.ReadFile(filepath.Join(reportDir, "index.html"))
+	if err != nil {
+		t.Fatalf("index.html: %v", err)
+	}
+	index := string(indexB)
+
+	// Summary card splits failed and pending correctly: "(1 failed)"
+	// AND "(1 pending)" appear, not "(2 failed)".
+	if strings.Contains(index, "(2 failed)") {
+		t.Error("summary miscounted pending as failed — `(2 failed)` appeared instead of separate buckets")
+	}
+	if !strings.Contains(index, "1 failed") {
+		t.Errorf("summary should show `1 failed`; index excerpt: %.600s", index)
+	}
+	if !strings.Contains(index, "1 pending") {
+		t.Errorf("summary should show `1 pending`; index excerpt: %.600s", index)
+	}
+
+	// Failed entry's error bubbles up to the card.
+	if !strings.Contains(index, "simulated extract crash") {
+		t.Error("failed entry's error message missing from index")
+	}
+
+	// Status pills render only for non-ok entries (ok is implicit
+	// given the link itself and the presence of metric cells). Assert
+	// the two non-ok pills exist and the ok one does not.
+	for _, want := range []string{"status-failed", "status-pending"} {
+		if !strings.Contains(index, want) {
+			t.Errorf("index missing %q pill — status branch lost its style", want)
+		}
+	}
+	if strings.Contains(index, `class="status-pill status-ok"`) {
+		t.Error("ok entries should not render a status pill — redundant with the link and metric cells")
+	}
+
+	// Only OK entries carry an href; failed/pending must not link to
+	// a file that doesn't exist.
+	if strings.Contains(index, `href="broken.html"`) || strings.Contains(index, `href="skipped.html"`) {
+		t.Error("index links to a non-existent per-repo HTML for a non-ok entry")
+	}
+}
+
+// Index order is a triage view: failed first (actionable), then
+// pending (skipped work, less urgent), then ok ranked by commit
+// count desc (heaviest healthy repos first). Slug asc breaks ties
+// everywhere.
+func TestRenderScanReportDir_IndexOrderFailedFirstThenByCommits(t *testing.T) {
+	dir := t.TempDir()
+	// One JSONL with 50 commits, one with 5 — lets us see the
+	// ok-bucket reorder from the "small -> big" alphabetical slug
+	// order to "big -> small" commit-count order.
+	bigJSONL := filepath.Join(dir, "big.jsonl")
+	smallJSONL := filepath.Join(dir, "small.jsonl")
+	if err := os.WriteFile(bigJSONL, []byte(manyCommitsJSONL(50)), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(smallJSONL, []byte(manyCommitsJSONL(5)), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	reportDir := t.TempDir()
+	result := &scan.Result{
+		OutputDir: dir,
+		Manifest: scan.Manifest{
+			Repos: []scan.ManifestRepo{
+				// Slug asc would give: big, broken, pending, small.
+				// Triage order expected: broken (failed), pending, big (50c), small (5c).
+				{Slug: "big", Path: "/x/big", Status: "ok", JSONL: bigJSONL},
+				{Slug: "broken", Path: "/x/broken", Status: "failed", Error: "explode"},
+				{Slug: "pending", Path: "/x/pending", Status: "pending"},
+				{Slug: "small", Path: "/x/small", Status: "ok", JSONL: smallJSONL},
+			},
+		},
+	}
+
+	err := renderScanReportDir(result, reportDir,
+		stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50},
+		stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1},
+		10)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	b, err := os.ReadFile(filepath.Join(reportDir, "index.html"))
+	if err != nil {
+		t.Fatal(err)
+	}
+	out := string(b)
+
+	// Each card renders `<div class="path">{{.Path}}</div>` with the
+	// unique filesystem path. Finding each path's offset gives the
+	// render order; ok entries also include an anchor but failed /
+	// pending do not, so the `.path` cell is the stable locator.
+	idxOf := func(p string) int { return strings.Index(out, p) }
+	brokenIdx := idxOf("/x/broken")
+	pendingIdx := idxOf("/x/pending")
+	bigIdx := idxOf("/x/big")
+	smallIdx := idxOf("/x/small")
+	for name, i := range map[string]int{"broken": brokenIdx, "pending": pendingIdx, "big": bigIdx, "small": smallIdx} {
+		if i < 0 {
+			t.Fatalf("%q path not found in index output", name)
+		}
+	}
+	if !(brokenIdx < pendingIdx && pendingIdx < bigIdx && bigIdx < smallIdx) {
+		t.Errorf("triage order failed: broken=%d pending=%d big=%d small=%d — want broken < pending < big < small",
+			brokenIdx, pendingIdx, bigIdx, smallIdx)
+	}
+}
+
+// manyCommitsJSONL returns n valid commit+file rows, each with a
+// distinct SHA so LoadJSONL counts them as N commits.
+func manyCommitsJSONL(n int) string {
+	var sb strings.Builder
+	for i := 0; i < n; i++ {
+		sha := fmt.Sprintf("%040x", i+1)
+		fmt.Fprintf(&sb,
+			`{"type":"commit","sha":"%s","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-%02dT00:00:00Z","additions":10,"deletions":0,"files_changed":1}`+"\n",
+			sha, (i%28)+1)
+		fmt.Fprintf(&sb,
+			`{"type":"commit_file","commit":"%s","path_current":"f%d.go","additions":10,"deletions":0}`+"\n",
+			sha, i)
+	}
+	return sb.String()
+}
+
+// Regression: scan emits Status="skipped" for repos the worker
+// picked up but abandoned when ctx.Err fired (see scan.runOne).
+// The index template only styles .repo.pending / .status-pending;
+// a raw "skipped" status would produce class="repo skipped" with
+// no CSS match, hiding the amber border and pill that make
+// cancellation fallout triagable. The renderer must fold skipped
+// into pending so the row visuals stay consistent with the
+// summary-strip count (which already buckets them together).
+func TestRenderScanReportDir_NormalizesSkippedToPending(t *testing.T) {
+	reportDir := t.TempDir()
+	result := &scan.Result{
+		OutputDir: t.TempDir(),
+		Manifest: scan.Manifest{
+			Repos: []scan.ManifestRepo{
+				{Slug: "a", Path: "/x/a", Status: "skipped", Error: "context canceled"},
+			},
+		},
+	}
+	err := renderScanReportDir(result, reportDir,
+		stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50},
+		stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1},
+		10)
+	if err != nil {
+		t.Fatal(err)
+	}
+	indexB, err := os.ReadFile(filepath.Join(reportDir, "index.html"))
+	if err != nil {
+		t.Fatal(err)
+	}
+	index := string(indexB)
+
+	// Summary counts skipped under pending (already the case, but
+	// anchor it so a future refactor doesn't drift).
+	if !strings.Contains(index, "1 pending") {
+		t.Errorf("skipped entry should count as pending in summary; got index excerpt: %.400s", index)
+	}
+	// Row CSS must use the pending selector — a raw "skipped" class
+	// has no style defined so the row would lose its warning affordance.
+	if !strings.Contains(index, `class="repo pending"`) {
+		t.Error("skipped entry's card must render with `class=\"repo pending\"`")
+	}
+	if strings.Contains(index, `class="repo skipped"`) {
+		t.Error("raw `skipped` leaked into CSS class — template selectors don't match it")
+	}
+	// Pill class must also normalize.
+	if !strings.Contains(index, `status-pending`) {
+		t.Error("skipped entry's status pill should read as `status-pending`")
+	}
+	if strings.Contains(index, `status-skipped`) {
+		t.Error("raw `skipped` leaked into pill class")
+	}
+}
+
+// Render with zero successful repos must still emit an index
+// without tripping on the MaxCommits==0 guard in the bar-width
+// template expression.
+func TestRenderScanReportDir_AllFailedStillEmitsIndex(t *testing.T) {
+	reportDir := t.TempDir()
+	result := &scan.Result{
+		OutputDir: t.TempDir(),
+		Manifest: scan.Manifest{
+			Repos: []scan.ManifestRepo{
+				{Slug: "a", Status: "failed", Error: "boom"},
+				{Slug: "b", Status: "failed", Error: "also boom"},
+			},
+		},
+	}
+	err := renderScanReportDir(result, reportDir,
+		stats.LoadOptions{HalfLifeDays: 90, CoupMaxFiles: 50},
+		stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1},
+		10)
+	if err != nil {
+		t.Fatalf("render should not fail on all-failed manifest: %v", err)
+	}
+	if _, err := os.Stat(filepath.Join(reportDir, "index.html")); err != nil {
+		t.Fatalf("index.html must be emitted even when every repo failed: %v", err)
+	}
+}
+
+// makeRepoWithCommit initializes a git repo with one file and one
+// commit using a deterministic identity.
+func makeRepoWithCommit(t *testing.T, dir, file, contents string) {
+	t.Helper()
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(dir, file), []byte(contents), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	run := func(args ...string) {
+		cmd := exec.Command("git", args...)
+		cmd.Dir = dir
+		cmd.Env = append(os.Environ(),
+			"GIT_AUTHOR_NAME=Test",
+			"GIT_AUTHOR_EMAIL=test@example.com",
+			"GIT_COMMITTER_NAME=Test",
+			"GIT_COMMITTER_EMAIL=test@example.com",
+		)
+		out, err := cmd.CombinedOutput()
+		if err != nil {
+			t.Fatalf("git %s: %v\n%s", strings.Join(args, " "), err, out)
+		}
+	}
+	run("init", "-q", "-b", "main")
+	run("add", ".")
+	run("commit", "-q", "-m", "initial")
+}
+
+// The profile report title used to echo roots[0] even when the
+// scan aggregated many roots, so "gitcortex scan --root ~/work
+// --root ~/oss --report me.html --email me@x.com" produced a
+// report titled "· work" that readers interpreted as belonging
+// only to the first root. Ensure the label reflects the actual
+// scope.
+func TestProfileScanLabel(t *testing.T) {
+	cases := []struct {
+		name  string
+		roots []string
+		want  string
+	}{
+		{"no roots", nil, "scan"},
+		{"single root uses basename", []string{"/home/me/work"}, "work"},
+		// Two and three-root scans join with " + " so the scope is
+		// visible at a glance rather than being summarised as a number.
+		{"two roots joined", []string{"/home/me/work", "/home/me/personal"}, "work + personal"},
+		{"three roots joined", []string{"/a/one", "/b/two", "/c/three"}, "one + two + three"},
+		// Large scans fall back to a count — joining 50 basenames
+		// would bloat the H1; the exact set lives in the scan log
+		// and manifest.
+		{"many roots use count", []string{"/a/x", "/b/x", "/c/x", "/d/x", "/e/x"}, "5 roots"},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			if got := profileScanLabel(c.roots); got != c.want {
+				t.Errorf("profileScanLabel(%v) = %q, want %q", c.roots, got, c.want)
+			}
+		})
+	}
+}
+
+// defaultScanParallel must stay within [2, 16] regardless of host
+// NumCPU. The floor protects single-core CI from serialising every
+// scan; the cap prevents 64-core servers from over-subscribing
+// when git I/O can't actually use that much concurrency.
+func TestDefaultScanParallel(t *testing.T) {
+	got := defaultScanParallel()
+	if got < 2 || got > 16 {
+		t.Errorf("defaultScanParallel() = %d, want between 2 and 16 inclusive", got)
+	}
+}
+
 func TestValidateDate(t *testing.T) {
 	cases := []struct {
 		in      string
diff --git a/docs/METRICS.md b/docs/METRICS.md
index 7ed473e..72b9fff 100644
--- a/docs/METRICS.md
+++ b/docs/METRICS.md
@@ -294,6 +294,29 @@ A `tree(1)`-style view of the repository's directory layout, built from paths se
 
 **When to use**: before drilling into hotspots or churn-risk, skim the structure to locate the modules those files live in. The tree is navigational context; ranked tables are where judgment happens.
 
+## Per-Repository Breakdown
+
+Cross-repo aggregation scoped to a single developer. Renders in the HTML profile report (`report --email me@x.com` or `scan --email me@x.com --report …`) when the dataset was loaded from more than one JSONL. The metric is deliberately profile-only: on a team-view consolidated report, per-repo aggregates reduce to raw git-history distribution and are better inspected via `manifest.json` or by running `stats --input X.jsonl` per-repo. Filtered by a developer's email, the same numbers answer a genuinely different question — *"where did this person spend their time?"* — which is what the section surfaces.
+
+One row per repo:
+
+| column | meaning |
+|---|---|
+| `Commits` | author-dated commit count in this repo |
+| `% Commits` | share of total commits in the dataset |
+| `Churn` | additions + deletions attributed to this repo |
+| `% Churn` | share of total churn |
+| `Files` | distinct files the developer touched in this repo (always email-filtered, since the section only renders on profile reports) |
+| `Active days` | distinct UTC author-dates |
+| `Devs` | unique author emails in this repo |
+| `First → Last` | earliest and latest author-date |
+
+**How the repo label is derived**: `LoadMultiJSONL` prefixes every path in the dataset with `<filename-stem>:` (so `WordPress.git.jsonl` contributes paths like `WordPress.git:wp-includes/foo.php`). The breakdown groups by that prefix. If only a single JSONL is loaded, no prefix is emitted and the breakdown collapses to a single `(repo)` row the HTML report hides.
+
+**Divergence between `% Commits` and `% Churn` is informative**: a repo dominating churn while holding modest commit share often signals large-content work (docs, data), while the reverse points to small-diff high-frequency repos (config, manifests).
+
+**SHA collisions across repos** (forks, mirrors, cherry-picks between sibling projects) are preserved here — the breakdown tracks commits per repo via a dedicated slice populated at ingest, not the SHA-keyed commit map. Other file-level metrics (bus factor, coupling, dev network) still key by SHA and will collapse collided commits onto one record; if exact attribution matters for those, scan and aggregate the sibling repos separately.
+
 ## Data Flow
 
 ```
@@ -364,6 +387,8 @@ Directory-segment heuristics (`vendor`, `node_modules`, `dist`, `build`, `third_
 
 The warning is advisory. Nothing is auto-filtered; the user decides whether to re-extract. Matches do not affect computed stats in that run. JSON/CSV output paths skip the warning since they're typically piped.
 
+On multi-JSONL loads (e.g. `stats --input a.jsonl --input b.jsonl` or a `gitcortex scan` dataset), paths carry a `<repo>:` prefix internally. The suspect detector strips that prefix before matching and suggesting, so root-level `vendor/`, `package-lock.json`, `go.sum`, etc. in any individual repo are detected and the emitted `--ignore` globs are repo-relative (drop-in for `extract --ignore` and `scan --extract-ignore`). Same-shape findings across repos collapse to one suggestion (`dist/*` applies everywhere rather than being listed once per repo).
+
 Statistical heuristics (very high churn-per-commit, single-author bulk updates) are deliberately out of scope — their false-positive rate on hand-authored code is higher than the path-based list and we'd rather stay quiet than cry wolf.
 
 ### `--mailmap` off by default
diff --git a/internal/report/profile_template.go b/internal/report/profile_template.go
index 5bb8f0f..5fedc3a 100644
--- a/internal/report/profile_template.go
+++ b/internal/report/profile_template.go
@@ -134,6 +134,42 @@ footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; col
 </div>
 {{end}}
 
+{{if gt (len .Repos) 1}}
+<h2>Per-Repository Breakdown <span style="font-size:13px; color:#656d76; font-weight:normal;">{{thousands (len .Repos)}} repositories</span></h2>
+<p class="hint">How {{.Profile.Name}}'s work is split across repositories in this scan. Use this to point at the projects with the most activity, or to spot single-repo focus vs. broad multi-repo engagement.</p>
+<table>
+<tr>
+  <th>Repository</th>
+  <th>Commits</th>
+  <th>% of My Commits</th>
+  <th>Churn</th>
+  <th>% of My Churn</th>
+  <th>Files</th>
+  <th>Active days</th>
+  <th>First → Last</th>
+</tr>
+{{range .Repos}}
+<tr>
+  <td class="mono">{{.Repo}}</td>
+  <td>{{thousands .Commits}}</td>
+  <td>
+    <div style="display:flex; align-items:center; gap:6px;">
+      <div style="flex:0 0 60px; height:8px; background:#eaeef2; border-radius:3px; overflow:hidden;">
+        <div style="height:100%; width:{{printf "%.0f" .PctOfTotalCommits}}%; background:#0969da;"></div>
+      </div>
+      <span class="mono">{{printf "%.1f" .PctOfTotalCommits}}%</span>
+    </div>
+  </td>
+  <td>{{thousands .Churn}}</td>
+  <td class="mono">{{printf "%.1f" .PctOfTotalChurn}}%</td>
+  <td>{{thousands .Files}}</td>
+  <td>{{.ActiveDays}}</td>
+  <td class="mono" style="font-size:11px;">{{.FirstCommitDate}} → {{.LastCommitDate}}</td>
+</tr>
+{{end}}
+</table>
+{{end}}
+
 {{if .Profile.TopFiles}}
 <h2>Top Files</h2>
 <p class="hint">Files this developer changed most (churn = additions + deletions). High churn on few files suggests deep ownership and potential knowledge concentration. · {{docRef "hotspots"}}</p>
diff --git a/internal/report/report.go b/internal/report/report.go
index 80e875f..24c982f 100644
--- a/internal/report/report.go
+++ b/internal/report/report.go
@@ -641,6 +641,13 @@ type ProfileReportData struct {
 	MaxActivityCommits int
 	PatternGrid     [7][24]int
 	MaxPattern      int
+
+	// Repos is the per-repository breakdown filtered to this developer's
+	// commits. Empty on single-repo profile reports — gated in the
+	// template so existing single-repo callers see no change. The headline
+	// use case for `gitcortex scan --email me` lives here: the developer
+	// can see at a glance which repos they spent time in.
+	Repos []stats.RepoStat
 }
 
 func GenerateProfile(w io.Writer, ds *stats.Dataset, repoName, email string) error {
@@ -672,6 +679,7 @@ func GenerateProfile(w io.Writer, ds *stats.Dataset, repoName, email string) err
 		MaxActivityCommits: maxAct,
 		PatternGrid:        p.WorkGrid,
 		MaxPattern:         maxP,
+		Repos:              stats.RepoBreakdown(ds, email),
 	}
 
 	return profileTmpl.Execute(w, data)
diff --git a/internal/report/report_test.go b/internal/report/report_test.go
index 2b4c826..45a91eb 100644
--- a/internal/report/report_test.go
+++ b/internal/report/report_test.go
@@ -5,6 +5,7 @@ import (
 	"fmt"
 	"os"
 	"path/filepath"
+	"regexp"
 	"strings"
 	"testing"
 
@@ -77,6 +78,117 @@ func TestGenerate_SmokeRender(t *testing.T) {
 	}
 }
 
+// The Per-Repository Breakdown is a consolidation metric about the
+// developer's work, not about the repository set. It belongs on the
+// profile (--email) report where "3 of my commits in auth, 50 in
+// payments" tells a story; on the team report it would just restate
+// git-history distribution, which is tautological. Assert the team
+// render does NOT surface the section even with a multi-repo dataset.
+func TestGenerate_TeamReportOmitsPerRepoBreakdown(t *testing.T) {
+	dir := t.TempDir()
+	alpha := filepath.Join(dir, "alpha.jsonl")
+	beta := filepath.Join(dir, "beta.jsonl")
+	alphaRow := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-01T00:00:00Z","additions":10,"deletions":0,"files_changed":1}
+{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0}
+`
+	betaRow := `{"type":"commit","sha":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb","author_email":"me@x.com","author_name":"Me","author_date":"2024-02-01T00:00:00Z","additions":20,"deletions":5,"files_changed":1}
+{"type":"commit_file","commit":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb","path_current":"b.go","additions":20,"deletions":5}
+`
+	if err := os.WriteFile(alpha, []byte(alphaRow), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(beta, []byte(betaRow), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	ds, err := stats.LoadMultiJSONL([]string{alpha, beta})
+	if err != nil {
+		t.Fatalf("LoadMultiJSONL: %v", err)
+	}
+	var buf bytes.Buffer
+	if err := Generate(&buf, ds, "scan-fixture", 10, stats.StatsFlags{CouplingMinChanges: 1, NetworkMinFiles: 1}); err != nil {
+		t.Fatalf("Generate: %v", err)
+	}
+	out := buf.String()
+	if strings.Contains(out, "Per-Repository Breakdown") {
+		t.Error("team report should not render the Per-Repository Breakdown; it's a profile-only metric")
+	}
+	// Sanity: the scan JSONL prefix still shows up in file paths etc.
+	// — we didn't accidentally strip the repo context from the whole
+	// report.
+	if !strings.Contains(out, "alpha:") && !strings.Contains(out, "beta:") {
+		t.Error("expected repo-prefixed paths to appear somewhere in the team report; none found")
+	}
+}
+
+// End-to-end for the `gitcortex scan --email me@x.com --report …`
+// flow. Covers three assertions at once:
+//   1. GenerateProfile emits the Per-Repository Breakdown section
+//      when the dataset is multi-repo (gated on len(Repos) > 1).
+//   2. Counts per repo are filtered to the dev — a commit by
+//      someone-else@x.com in alpha doesn't bleed into my profile's
+//      alpha row.
+//   3. Files counted per repo are only files THIS dev touched — a
+//      colleague-exclusive file in alpha must not inflate my scope.
+func TestGenerateProfile_MultiRepoBreakdownFiltersByEmail(t *testing.T) {
+	dir := t.TempDir()
+	alpha := filepath.Join(dir, "alpha.jsonl")
+	beta := filepath.Join(dir, "beta.jsonl")
+
+	// alpha: me has 1 commit on a.go; colleague has 1 commit on
+	// colleague-only.go. beta: me has 1 commit on b.go.
+	alphaContent := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa01","author_email":"me@x.com","author_name":"Me","author_date":"2024-01-10T00:00:00Z","additions":10,"deletions":0,"files_changed":1}
+{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa01","path_current":"a.go","additions":10,"deletions":0}
+{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa02","author_email":"colleague@x.com","author_name":"Col","author_date":"2024-01-11T00:00:00Z","additions":5,"deletions":0,"files_changed":1}
+{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa02","path_current":"colleague-only.go","additions":5,"deletions":0}
+`
+	betaContent := `{"type":"commit","sha":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb01","author_email":"me@x.com","author_name":"Me","author_date":"2024-02-10T00:00:00Z","additions":20,"deletions":5,"files_changed":1}
+{"type":"commit_file","commit":"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb01","path_current":"b.go","additions":20,"deletions":5}
+`
+	if err := os.WriteFile(alpha, []byte(alphaContent), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(beta, []byte(betaContent), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	ds, err := stats.LoadMultiJSONL([]string{alpha, beta})
+	if err != nil {
+		t.Fatalf("LoadMultiJSONL: %v", err)
+	}
+
+	var buf bytes.Buffer
+	if err := GenerateProfile(&buf, ds, "scan-profile", "me@x.com"); err != nil {
+		t.Fatalf("GenerateProfile: %v", err)
+	}
+	out := buf.String()
+
+	if !strings.Contains(out, "Per-Repository Breakdown") {
+		t.Fatal("profile report missing the breakdown section — scan --email users won't see their cross-repo split")
+	}
+
+	// Each row renders as `<td class="mono">alpha</td>` followed by
+	// a `<td>N</td>` for commits. Assert 1 commit per repo — if the
+	// filter leaked, alpha would show 2 (me's + colleague's).
+	commitCountRe := regexp.MustCompile(`<td class="mono">(alpha|beta)</td>\s*<td>(\d+)</td>`)
+	rows := commitCountRe.FindAllStringSubmatch(out, -1)
+	if len(rows) != 2 {
+		t.Fatalf("expected 2 repo rows in breakdown, got %d: %v", len(rows), rows)
+	}
+	for _, r := range rows {
+		if r[2] != "1" {
+			t.Errorf("repo %s shows %s commits in profile breakdown — email filter leaked (want 1)", r[1], r[2])
+		}
+	}
+
+	// Colleague-exclusive file must not appear in my scope. The
+	// template renders file counts as `<td>N</td>` inside each row
+	// — indirect assertion: if the file-count bumped, the dev-filter
+	// on devCommits is wrong.
+	if strings.Contains(out, "colleague-only.go") {
+		t.Error("profile report mentions a file only the colleague touched")
+	}
+}
+
 func TestGenerate_EmptyDataset(t *testing.T) {
 	dir := t.TempDir()
 	path := filepath.Join(dir, "empty.jsonl")
diff --git a/internal/report/scan_index.go b/internal/report/scan_index.go
new file mode 100644
index 0000000..e8b943d
--- /dev/null
+++ b/internal/report/scan_index.go
@@ -0,0 +1,243 @@
+package report
+
+import (
+	"fmt"
+	"html/template"
+	"io"
+	"time"
+)
+
+// ScanIndexEntry is one repo's row on the scan-index landing page.
+// Successful repos populate the numeric fields; failed / pending
+// repos leave them zero and surface Error instead. ReportHref is the
+// relative URL the index uses to link into each per-repo report —
+// empty when no report exists for that entry.
+type ScanIndexEntry struct {
+	Slug            string
+	Path            string
+	Status          string
+	Error           string
+	Commits         int
+	Devs            int
+	Files           int
+	Churn           int64
+	FirstCommitDate string
+	LastCommitDate  string
+	// LastCommitAgo is LastCommitDate humanized relative to the
+	// index generation time: "today" / "Nd ago" / "Nmo ago" / "Ny ago".
+	// Lets operators spot abandoned repos in a list of 30 without
+	// doing date math in their head.
+	LastCommitAgo string
+	// RecencyBucket classifies LastCommitAgo into "fresh" (≤ 30 days),
+	// "stable" (≤ 1 year), "stale" (> 1 year). Drives the badge color
+	// in the template so the cold repos stand out at a glance.
+	RecencyBucket string
+	ReportHref    string
+}
+
+// HumanizeAgo is the caller entry point for the index: formats
+// `lastDate` (YYYY-MM-DD) as a compact "ago" phrase relative to
+// wall-clock now, and classifies recency into fresh / stable /
+// stale for template coloring. Empty pair on unparsable input so
+// failed/pending rows with no dates render nothing.
+func HumanizeAgo(lastDate string) (label, bucket string) {
+	return humanizeAgoAt(lastDate, time.Now())
+}
+
+// humanizeAgoAt is the testable core — same logic as HumanizeAgo
+// but takes an explicit "now" so tests can pin the reference time
+// and avoid drift.
+//
+// Future-commit policy: dates strictly AFTER today's UTC midnight
+// (e.g. tomorrow's YYYY-MM-DD due to clock skew or a future-dated
+// CI rewrite) yield empty — the index surfaces how STALE each repo
+// is, and labeling "in 3d" as "fresh" would actively mislead. Both
+// sides of the comparison reduce to UTC midnights so "tomorrow"
+// is detected even when the user's wall clock is mid-day; sub-day
+// committer-clock skew inside today's date still falls into
+// `days == 0 → "today"`, documented as the safe fallback at day
+// granularity.
+//
+// The earlier implementation computed `days` as
+// int(now.Sub(t).Hours() / 24), which truncates toward zero for
+// negative durations. A "tomorrow" date 12 hours ahead produced
+// days == 0 instead of -1, slipping past the future-date guard
+// and rendering "today" / fresh. Comparing the parsed date
+// directly against today's UTC midnight removes the truncation
+// gap.
+func humanizeAgoAt(lastDate string, now time.Time) (label, bucket string) {
+	t, err := time.Parse("2006-01-02", lastDate)
+	if err != nil {
+		return "", ""
+	}
+	today := now.UTC().Truncate(24 * time.Hour)
+	if t.After(today) {
+		return "", ""
+	}
+	days := int(today.Sub(t).Hours() / 24)
+	switch {
+	case days < 0:
+		// Defensive: t.After(today) above already covered this, but
+		// keep the branch so a later refactor that moves the guard
+		// can't silently produce a negative-days label.
+		return "", ""
+	case days == 0:
+		label = "today"
+	case days < 30:
+		label = fmt.Sprintf("%dd ago", days)
+	case days < 365:
+		// Cap the month reading at 11 — otherwise `days/30` emits
+		// "12mo ago" at days 360-364, which reads as older than
+		// "1y ago" even though it's actually in the still-stable
+		// (≤365d) band. Clamping keeps the label progression
+		// monotonic with the bucket color.
+		months := days / 30
+		if months > 11 {
+			months = 11
+		}
+		label = fmt.Sprintf("%dmo ago", months)
+	default:
+		label = fmt.Sprintf("%dy ago", days/365)
+	}
+	switch {
+	case days <= 30:
+		bucket = "fresh"
+	case days <= 365:
+		bucket = "stable"
+	default:
+		bucket = "stale"
+	}
+	return label, bucket
+}
+
+// ScanIndexData is the top-level template input for the index page.
+type ScanIndexData struct {
+	GeneratedAt string
+	Repos       []ScanIndexEntry
+	// TotalRepos / OKRepos / FailedRepos / PendingRepos are
+	// precomputed so the template doesn't need conditional arithmetic.
+	// Pending is distinct from failed: a pending repo is one the
+	// worker never reached (cancelled mid-scan), a failed repo is one
+	// whose extract or render broke. Mixing them in the summary would
+	// read a cancel-shaped partial run as a fleet-of-errors.
+	TotalRepos   int
+	OKRepos      int
+	FailedRepos  int
+	PendingRepos int
+	TotalCommits int
+	TotalDevs    int
+	// Largest repo commit count — used to normalize the bar widths so
+	// the relative-volume bars are visually comparable across repos.
+	MaxCommits int
+}
+
+// GenerateScanIndex writes the scan landing page: a per-repo card
+// list with links to each repo's standalone report and a short
+// summary strip. Failures are surfaced inline rather than hidden so
+// operators can spot them at a glance and dig into the manifest.
+func GenerateScanIndex(w io.Writer, data ScanIndexData) error {
+	if data.GeneratedAt == "" {
+		data.GeneratedAt = time.Now().Format("2006-01-02 15:04")
+	}
+	return scanIndexTmpl.Execute(w, data)
+}
+
+var scanIndexTmpl = template.Must(template.New("scan-index").Funcs(funcMap).Parse(scanIndexHTML))
+
+const scanIndexHTML = `<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<title>gitcortex scan index ({{.TotalRepos}} repositories)</title>
+<style>
+* { margin: 0; padding: 0; box-sizing: border-box; }
+body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; color: #24292f; background: #f6f8fa; padding: 20px; max-width: 1200px; margin: 0 auto; }
+.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(160px, 1fr)); gap: 12px; margin-bottom: 24px; }
+.card { background: #fff; border: 1px solid #d0d7de; border-radius: 6px; padding: 16px; }
+.card .label { font-size: 12px; color: #656d76; text-transform: uppercase; }
+.card .value { font-size: 24px; font-weight: 600; margin-top: 4px; }
+.repo { background: #fff; border: 1px solid #d0d7de; border-radius: 6px; padding: 16px 20px; margin-bottom: 10px; display: grid; grid-template-columns: 2fr 1fr 1fr 1fr 1fr 1.2fr; gap: 16px; align-items: center; }
+.repo.failed { border-left: 4px solid #cf222e; }
+.repo.pending { border-left: 4px solid #bf8700; }
+.repo .name { font-weight: 600; font-size: 15px; }
+.repo .name a { color: #0969da; text-decoration: none; }
+.repo .name a:hover { text-decoration: underline; }
+.repo .path { font-family: "SF Mono", Consolas, monospace; font-size: 11px; color: #656d76; margin-top: 2px; word-break: break-all; }
+.repo .error { font-size: 12px; color: #cf222e; margin-top: 4px; }
+.repo .metric { font-size: 13px; }
+.repo .metric .val { font-weight: 600; font-size: 16px; }
+.repo .metric .lbl { font-size: 11px; color: #656d76; text-transform: uppercase; display: block; margin-top: 2px; }
+.repo .bar-container { display: flex; align-items: center; gap: 8px; }
+.repo .bar-outer { flex: 1; height: 8px; background: #eaeef2; border-radius: 3px; overflow: hidden; }
+.repo .bar-inner { height: 100%; background: #2da44e; }
+.repo .dates { font-size: 11px; color: #656d76; font-family: "SF Mono", Consolas, monospace; }
+.recency { display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 11px; font-weight: 600; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; margin-bottom: 4px; }
+.recency.fresh { background: #dafbe1; color: #1a7f37; }
+.recency.stable { background: #eaeef2; color: #656d76; }
+.recency.stale { background: #fff4d4; color: #9a6700; }
+.status-pill { display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 11px; font-weight: 600; }
+.status-ok { background: #dafbe1; color: #1a7f37; }
+.status-failed { background: #ffebe9; color: #cf222e; }
+.status-pending { background: #fff4d4; color: #9a6700; }
+footer { margin-top: 40px; padding-top: 16px; border-top: 1px solid #d0d7de; color: #656d76; font-size: 12px; }
+.hint { font-size: 12px; color: #656d76; font-style: italic; margin-bottom: 14px; }
+</style>
+</head>
+<body>
+
+<div class="cards">
+  <div class="card"><div class="label">Repositories</div><div class="value">{{thousands .OKRepos}}{{if gt .FailedRepos 0}} <span style="font-size:14px; color:#cf222e;">({{.FailedRepos}} failed)</span>{{end}}{{if gt .PendingRepos 0}} <span style="font-size:14px; color:#9a6700;">({{.PendingRepos}} pending)</span>{{end}}</div></div>
+  <div class="card"><div class="label">Total commits</div><div class="value" title="{{thousands .TotalCommits}}">{{humanize .TotalCommits}}</div></div>
+  <div class="card"><div class="label">Unique devs</div><div class="value">{{thousands .TotalDevs}}</div></div>
+</div>
+
+<p class="hint">Each repo below links to its own standalone report. Metrics are per-repository — no cross-repo aggregation that would mix signals from unrelated codebases. For a consolidated developer view, use <code>gitcortex scan --report &lt;file&gt; --email &lt;address&gt;</code>.</p>
+
+{{$max := .MaxCommits}}
+{{range .Repos}}
+<div class="repo {{.Status}}">
+  <div>
+    <div class="name">
+      {{if .ReportHref}}<a href="{{.ReportHref}}">{{.Slug}}</a>{{else}}{{.Slug}}{{end}}
+      {{if ne .Status "ok"}}<span class="status-pill status-{{.Status}}">{{.Status}}</span>{{end}}
+    </div>
+    <div class="path">{{.Path}}</div>
+    {{if .Error}}<div class="error">{{.Error}}</div>{{end}}
+  </div>
+  {{if eq .Status "ok"}}
+  <div class="metric">
+    <span class="val" title="{{thousands .Commits}}">{{humanize .Commits}}</span>
+    <span class="lbl">commits</span>
+  </div>
+  <div class="metric">
+    <span class="val" title="{{thousands .Churn}}">{{humanize .Churn}}</span>
+    <span class="lbl">churn</span>
+  </div>
+  <div class="metric">
+    <span class="val">{{.Devs}}</span>
+    <span class="lbl">devs</span>
+  </div>
+  <div class="metric">
+    <span class="val">{{humanize .Files}}</span>
+    <span class="lbl">files</span>
+  </div>
+  <div class="dates">
+    {{if .LastCommitAgo}}<span class="recency {{.RecencyBucket}}">{{.LastCommitAgo}}</span><br>{{end}}
+    {{if .LastCommitDate}}{{.FirstCommitDate}}<br>→ {{.LastCommitDate}}{{end}}
+  </div>
+  {{else}}
+  <div style="grid-column: span 5; color:#656d76; font-style:italic;">No report available.</div>
+  {{end}}
+</div>
+{{if eq .Status "ok"}}
+<div class="bar-container" style="margin:-4px 0 12px 20px; max-width:600px;">
+  <div class="bar-outer"><div class="bar-inner" style="width:{{if gt $max 0}}{{pctInt .Commits $max}}%{{else}}0%{{end}};"></div></div>
+</div>
+{{end}}
+{{end}}
+
+<footer>Generated by <a href="https://github.com/lex0c/gitcortex" target="_blank" rel="noopener noreferrer" style="color:#0969da; text-decoration:none;">gitcortex</a> · {{.GeneratedAt}}</footer>
+
+</body>
+</html>
+`
diff --git a/internal/report/scan_index_test.go b/internal/report/scan_index_test.go
new file mode 100644
index 0000000..470c10a
--- /dev/null
+++ b/internal/report/scan_index_test.go
@@ -0,0 +1,126 @@
+package report
+
+import (
+	"bytes"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestHumanizeAgoAt(t *testing.T) {
+	now := time.Date(2026, 4, 20, 12, 0, 0, 0, time.UTC)
+
+	cases := []struct {
+		name      string
+		lastDate  string
+		wantLabel string
+		wantBucket string
+	}{
+		// Boundary on day 0 — "today" reads cleaner than "0d ago".
+		{"same day", "2026-04-20", "today", "fresh"},
+		{"one day", "2026-04-19", "1d ago", "fresh"},
+		{"29 days", "2026-03-22", "29d ago", "fresh"},
+		// Transition from days to months at 30.
+		{"30 days exact", "2026-03-21", "1mo ago", "fresh"},
+		{"two months", "2026-02-15", "2mo ago", "stable"},
+		{"eleven months", "2025-05-25", "11mo ago", "stable"},
+		// Month label must not exceed "11mo" — days/30 = 12 on the
+		// 360-day boundary, but the label is clamped so the stable
+		// band always reads as sub-year.
+		{"month clamp at 360 days", "2025-04-25", "11mo ago", "stable"},
+		{"month clamp at 364 days", "2025-04-21", "11mo ago", "stable"},
+		// 365-day boundary: still "stable" at exactly one year,
+		// "stale" the day after. 2025-04-20 is 365 days before now.
+		{"one year exact (stable boundary)", "2025-04-20", "1y ago", "stable"},
+		{"one year + one day (stale)", "2025-04-19", "1y ago", "stale"},
+		{"two years stale", "2024-04-10", "2y ago", "stale"},
+		// Parse failure yields empty.
+		{"bad input", "not-a-date", "", ""},
+		// Future dates (clock skew) yield empty — we don't label "in 3d"
+		// on an index that exists to surface recency of PAST commits.
+		{"full-day future", "2026-05-01", "", ""},
+		// Regression for the truncation-gap bug: tomorrow (~12h ahead
+		// of `now`) used to yield int(-0.5) == 0 and silently render
+		// as "today" / fresh. Comparing UTC date midnights instead
+		// of raw durations rejects it correctly.
+		{"tomorrow (sub-24h future)", "2026-04-21", "", ""},
+		{"two days future", "2026-04-22", "", ""},
+		// Sub-day future (committer-date hours ahead of scanner clock)
+		// lands in days==0 → "today". Documented: the index works at
+		// day granularity and "today" is the least-misleading fallback
+		// for intra-day skew within today's date.
+		{"same-day future (sub-day skew)", "2026-04-20", "today", "fresh"},
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			label, bucket := humanizeAgoAt(c.lastDate, now)
+			if label != c.wantLabel {
+				t.Errorf("label: got %q, want %q", label, c.wantLabel)
+			}
+			if bucket != c.wantBucket {
+				t.Errorf("bucket: got %q, want %q", bucket, c.wantBucket)
+			}
+		})
+	}
+}
+
+// End-to-end check that the recency chip renders for an ok entry
+// and is absent for failed entries (which have no dates). Confirms
+// the CSS bucket class is reachable by the template.
+func TestGenerateScanIndex_RecencyChipRenders(t *testing.T) {
+	data := ScanIndexData{
+		GeneratedAt: "2026-04-20 12:00",
+		TotalRepos:  2,
+		OKRepos:     1,
+		FailedRepos: 1,
+		MaxCommits:  10,
+		Repos: []ScanIndexEntry{
+			{
+				Slug:            "alive",
+				Path:            "/work/alive",
+				Status:          "ok",
+				Commits:         10,
+				FirstCommitDate: "2024-01-01",
+				LastCommitDate:  "2026-04-18",
+				LastCommitAgo:   "2d ago",
+				RecencyBucket:   "fresh",
+				ReportHref:      "alive.html",
+			},
+			{
+				Slug:   "broken",
+				Path:   "/work/broken",
+				Status: "failed",
+				Error:  "boom",
+			},
+		},
+	}
+	var buf bytes.Buffer
+	if err := GenerateScanIndex(&buf, data); err != nil {
+		t.Fatal(err)
+	}
+	out := buf.String()
+	if !strings.Contains(out, `class="recency fresh"`) {
+		t.Error("fresh recency chip missing from ok entry")
+	}
+	if !strings.Contains(out, `>2d ago<`) {
+		t.Error("recency label text missing")
+	}
+	// Anchor the chip inside the `.dates` wrapper — without this,
+	// a future template change that moved the chip outside the
+	// dates cell (losing the date context) would still satisfy the
+	// class-only assertion above.
+	if !strings.Contains(out, `class="dates"`) {
+		t.Error("`.dates` wrapper missing — the recency chip has lost its structural anchor")
+	}
+	datesIdx := strings.Index(out, `class="dates"`)
+	recencyIdx := strings.Index(out, `class="recency fresh"`)
+	if datesIdx < 0 || recencyIdx < 0 || recencyIdx < datesIdx {
+		t.Errorf("recency chip not inside `.dates` wrapper: dates@%d recency@%d", datesIdx, recencyIdx)
+	}
+	// Failed entry has no dates block, so no recency chip should
+	// render for it — a weak but useful guard against a template
+	// restructure leaking the chip into the failure render.
+	if strings.Count(out, `class="recency`) != 1 {
+		t.Errorf("expected exactly one recency chip (ok entry only); got %d", strings.Count(out, `class="recency`))
+	}
+}
diff --git a/internal/scan/discovery.go b/internal/scan/discovery.go
new file mode 100644
index 0000000..8fc96e5
--- /dev/null
+++ b/internal/scan/discovery.go
@@ -0,0 +1,357 @@
+package scan
+
+import (
+	"context"
+	"crypto/sha1"
+	"encoding/hex"
+	"fmt"
+	"log"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+)
+
+// Repo describes one git repository discovered during scan.
+type Repo struct {
+	// AbsPath is the absolute path to the working tree root.
+	AbsPath string
+	// RelPath is AbsPath relative to the scan root it was found under.
+	RelPath string
+	// Slug is a filesystem-safe identifier used to derive the output
+	// JSONL file name and the path prefix that stats will use as the
+	// repo label. Must be unique across the scan (collisions are resolved
+	// by appending a short hash of AbsPath).
+	Slug string
+}
+
+// Discover walks the given roots and returns every git repository it finds.
+// Directories matched by the ignore matcher are pruned. Each repo's Slug is
+// unique across the full result so downstream JSONL files don't collide,
+// and the prefix LoadMultiJSONL derives (basename-minus-ext) still groups
+// correctly.
+//
+// A cancelled ctx aborts the walk promptly — important because on large
+// roots (home directory scans, monorepos of repos) discovery is often
+// the longest phase of scan and Ctrl+C needs to land in real time, not
+// "after the walk naturally finishes". Each filepath.WalkDir callback
+// checks ctx.Done() before doing any filesystem work, so the abort
+// window is bounded by the cost of one stat call per dir entry.
+func Discover(ctx context.Context, roots []string, matcher *Matcher, maxDepth int) ([]Repo, error) {
+	if matcher == nil {
+		matcher = NewMatcher(nil)
+	}
+	if ctx == nil {
+		ctx = context.Background()
+	}
+	var repos []Repo
+	seen := make(map[string]bool)
+
+	for _, root := range roots {
+		if err := ctx.Err(); err != nil {
+			return nil, err
+		}
+		abs, err := filepath.Abs(root)
+		if err != nil {
+			return nil, fmt.Errorf("resolve %s: %w", root, err)
+		}
+		// Canonicalize via EvalSymlinks. os.Stat above followed the
+		// symlink for the is-directory check, but filepath.WalkDir
+		// treats a symlink ROOT specially: it visits only the link
+		// itself (reported with ModeSymlink) and does NOT descend
+		// into the target. For a user whose ~/work is a symlink to
+		// /mnt/data/work — a common setup — the walk would yield
+		// zero callbacks past the root and scan would conclude "no
+		// git repositories found under ..." despite the target being
+		// full of repos. EvalSymlinks dereferences the root once so
+		// WalkDir starts with a real directory path; links ENCOUNTERED
+		// during the walk are still left as-is (default WalkDir
+		// behavior), which is what we want — we don't chase every
+		// symlink we come across, only the ones the user explicitly
+		// named as a root.
+		resolved, err := filepath.EvalSymlinks(abs)
+		if err != nil {
+			return nil, fmt.Errorf("resolve symlinks for %s: %w", abs, err)
+		}
+		abs = resolved
+		info, err := os.Stat(abs)
+		if err != nil {
+			return nil, fmt.Errorf("stat %s: %w", abs, err)
+		}
+		if !info.IsDir() {
+			return nil, fmt.Errorf("%s is not a directory", abs)
+		}
+
+		err = filepath.WalkDir(abs, func(path string, d os.DirEntry, werr error) error {
+			// Return ctx.Err() (not SkipDir) so WalkDir short-circuits
+			// the entire walk. Without this check, a cancelled scan
+			// would keep stat'ing every directory in a large tree
+			// before the caller saw the interrupt.
+			if err := ctx.Err(); err != nil {
+				return err
+			}
+			if werr != nil {
+				// Permission errors on one subtree shouldn't abort the
+				// whole scan. Log and keep walking.
+				log.Printf("scan: skip %s: %v", path, werr)
+				if d != nil && d.IsDir() {
+					return filepath.SkipDir
+				}
+				return nil
+			}
+			if !d.IsDir() {
+				return nil
+			}
+
+			rel, _ := filepath.Rel(abs, path)
+			rel = filepath.ToSlash(rel)
+
+			if maxDepth > 0 && rel != "." {
+				depth := strings.Count(rel, "/") + 1
+				if depth > maxDepth {
+					return filepath.SkipDir
+				}
+			}
+
+			if rel != "." && matcher.Match(rel, true) {
+				// Don't SkipDir blindly — if any negation rule could
+				// re-include a descendant (e.g. `vendor/` + `!vendor/keep`),
+				// pruning here would drop the re-inclusion before its
+				// target is visited.
+				if !matcher.CouldReinclude(rel) {
+					return filepath.SkipDir
+				}
+				// The dir itself is ignored; we only walk in so a
+				// negated descendant can be visited in its own turn.
+				// Return early so the .git detection below doesn't
+				// record this ignored directory as a repo (and so its
+				// `return filepath.SkipDir` doesn't cut the descent
+				// off before the negation's target is seen). A
+				// descendant whose rel path is re-included by `!rule`
+				// will be examined by the next WalkDir callback
+				// invocation.
+				return nil
+			}
+
+			// Two repo shapes to detect:
+			//   1. Working tree:   path/.git is a dir (normal) or a
+			//      `gitdir: …` pointer file (worktree/submodule).
+			//   2. Bare repo:      path itself contains HEAD + objects/
+			//      + refs/ — common for clones used as fixtures or
+			//      mirrors (`git clone --bare`, GitHub-style server dirs
+			//      named foo.git).
+			// Validating the .git entry (not just its presence) matters:
+			// a random regular file named `.git` would otherwise be
+			// accepted, the parent path recorded as a repo, and the
+			// walker SkipDir'd out of the subtree — hiding any real
+			// repos nested underneath and guaranteeing a downstream
+			// "not a git repository" failure during extract. isBareRepo
+			// separately requires HEAD + objects + refs so a stray HEAD
+			// or empty objects dir can't false-positive either.
+			gitEntry := filepath.Join(path, ".git")
+			if isWorkingTreeEntry(gitEntry) || isBareRepo(path) {
+				if seen[path] {
+					return filepath.SkipDir
+				}
+				seen[path] = true
+				repos = append(repos, Repo{
+					AbsPath: path,
+					RelPath: rel,
+				})
+				// Don't descend into a repo — nested repos (submodules,
+				// vendored repos) get picked up separately only if they
+				// live outside this repo's worktree, which is rare. If
+				// users need submodule coverage they can list the parent
+				// and the submodule paths as separate roots.
+				return filepath.SkipDir
+			}
+
+			return nil
+		})
+		if err != nil {
+			return nil, err
+		}
+	}
+
+	assignSlugs(repos)
+	// Sort by slug so output ordering is deterministic across runs
+	// (WalkDir ordering is, but concatenating multiple roots could shuffle).
+	sort.Slice(repos, func(i, j int) bool { return repos[i].Slug < repos[j].Slug })
+	return repos, nil
+}
+
+// initialSlugHashLen controls the default hash-suffix length in hex
+// chars when two repos share a basename. Six chars (24 bits) gives
+// readable slugs in the common case; the retry loop in assignSlugs
+// grows the length if a truncation collision occurs.
+const initialSlugHashLen = 6
+
+// isReservedSlug reports whether base would collide with a name
+// downstream consumers already use for their own output file.
+// `scan --report-dir` writes <dir>/index.html as the landing page;
+// a repo whose basename was literally `index` would overwrite it
+// (or be overwritten by it). Forcing the hash branch for reserved
+// names avoids the collision without losing the repo.
+//
+// Switch, not a package-level map, to keep the reserved set
+// immutable — a mutable map would let one test silently leak an
+// entry into every other test that runs afterward. Case folded to
+// align with the case-insensitivity elsewhere in assignSlugs.
+func isReservedSlug(base string) bool {
+	switch strings.ToLower(base) {
+	case "index":
+		return true
+	}
+	return false
+}
+
+// assignSlugs derives a unique slug per repo from its basename, falling
+// back to `<basename>-<shortHash(absPath)>` when two repos share a name.
+// The slug is also the JSONL filename stem and the persistence key
+// (`<slug>.state`), so uniqueness AND determinism across runs both matter:
+// if a re-run swaps which sibling gets the bare name, the per-repo
+// state file is orphaned and the affected repos are re-extracted from
+// scratch (or worse, two repos collide onto one state).
+//
+// Two-pass: count basenames first, then suffix with a hash whenever the
+// base appears more than once anywhere in the result. This makes the
+// slug a pure function of (absPath, set of basenames seen), independent
+// of WalkDir traversal order.
+//
+// Short-hash truncation (6 hex = 24 bits) admits a small but nonzero
+// collision probability — two absolute paths whose SHA-1 digests share
+// the same 6-hex prefix would land on the same slug, and scan would
+// silently overwrite one repo's JSONL + state file with the other's.
+// The retry loop walks the proposed slug set; on any duplicate we
+// redo the pass with a longer hash. Grows up to the full 40 hex chars
+// before panicking — needing that much is a cryptographic event, not
+// a scan-time bug.
+//
+// Case-insensitive uniqueness: on macOS (APFS/HFS+ in the default
+// configuration) and Windows (NTFS), `Repo.jsonl` and `repo.jsonl`
+// resolve to the same file on disk. A case-sensitive slug compare
+// would treat the two repos as distinct, hand each the bare basename,
+// and then quietly let the filesystem merge their JSONL and state
+// files. Fold to lower case for BOTH the duplicate-detection count
+// and the uniqueness seen-set so collision-triggered hashing fires
+// on case-only differences. The emitted slug retains the original
+// case for readability (so paths named Repo and repo produce
+// `Repo-<hash>` and `repo-<hash>` — visually distinct to humans,
+// case-insensitively distinct on disk).
+func assignSlugs(repos []Repo) {
+	counts := make(map[string]int)
+	bases := make([]string, len(repos))
+	for i := range repos {
+		base := sanitizeSlug(filepath.Base(repos[i].AbsPath))
+		if base == "" {
+			base = "repo"
+		}
+		bases[i] = base
+		counts[strings.ToLower(base)]++
+	}
+
+	for hashLen := initialSlugHashLen; hashLen <= 40; hashLen += 2 {
+		proposed := make([]string, len(repos))
+		seen := make(map[string]int, len(repos))
+		collided := false
+		for i := range repos {
+			base := bases[i]
+			slug := base
+			if counts[strings.ToLower(base)] > 1 || isReservedSlug(base) {
+				h := sha1.Sum([]byte(repos[i].AbsPath))
+				slug = fmt.Sprintf("%s-%s", base, hex.EncodeToString(h[:])[:hashLen])
+			}
+			key := strings.ToLower(slug)
+			if prev, ok := seen[key]; ok && prev != i {
+				collided = true
+				break
+			}
+			seen[key] = i
+			proposed[i] = slug
+		}
+		if !collided {
+			for i := range repos {
+				repos[i].Slug = proposed[i]
+			}
+			return
+		}
+	}
+	panic("scan: slug assignment failed to find unique hash within SHA-1 range")
+}
+
+// isBareRepo returns true when path is a bare git repository — i.e. the
+// directory itself holds HEAD, objects/, and refs/ rather than wrapping
+// them in a .git subdirectory. All three entries are required because
+// a stray HEAD file or empty refs/ dir alone is not enough to be a real
+// repo and we don't want false positives polluting the manifest.
+func isBareRepo(path string) bool {
+	for _, name := range []string{"HEAD", "objects", "refs"} {
+		if _, err := os.Stat(filepath.Join(path, name)); err != nil {
+			return false
+		}
+	}
+	return true
+}
+
+// isWorkingTreeEntry reports whether gitEntry (the path of a `.git`
+// dirent inside a candidate repo) marks a real git working tree.
+// Accepts:
+//   - A directory named `.git` — the standard layout.
+//   - A regular file named `.git` beginning with `gitdir: ` — the
+//     pointer format git writes for linked worktrees and submodules.
+// Rejects everything else: symlinks (by existing policy — the target
+// could be anywhere), irregular files, and — the case that motivated
+// the validation — a regular file named `.git` that happens to exist
+// with unrelated content. Without this check, any plain file called
+// `.git` (source dump, build artifact, user mistake) made discovery
+// record its parent as a repo and SkipDir out of the subtree, hiding
+// nested real repos and producing a guaranteed extract failure.
+func isWorkingTreeEntry(gitEntry string) bool {
+	info, err := os.Lstat(gitEntry)
+	if err != nil {
+		return false
+	}
+	if info.Mode()&os.ModeSymlink != 0 {
+		return false
+	}
+	if info.IsDir() {
+		return true
+	}
+	if !info.Mode().IsRegular() {
+		return false
+	}
+	// A valid gitdir pointer file starts with literally "gitdir: ".
+	// Checking only the first 8 bytes is sufficient (and cheap) — the
+	// full content isn't parsed here; extract will fail loudly if the
+	// pointer target is broken, which is a better signal than silently
+	// accepting every regular file and leaving the same confusion for
+	// the extract phase.
+	f, openErr := os.Open(gitEntry)
+	if openErr != nil {
+		return false
+	}
+	defer f.Close()
+	const prefix = "gitdir: "
+	buf := make([]byte, len(prefix))
+	n, _ := f.Read(buf)
+	return n == len(prefix) && string(buf) == prefix
+}
+
+// sanitizeSlug strips characters that would break the LoadMultiJSONL prefix
+// contract (`<slug>:`) or filesystem paths. Keeps alphanumerics, dash,
+// underscore, and dot.
+func sanitizeSlug(s string) string {
+	var b strings.Builder
+	for _, r := range s {
+		switch {
+		case r >= 'a' && r <= 'z',
+			r >= 'A' && r <= 'Z',
+			r >= '0' && r <= '9',
+			r == '-', r == '_', r == '.':
+			b.WriteRune(r)
+		default:
+			b.WriteRune('_')
+		}
+	}
+	return b.String()
+}
diff --git a/internal/scan/discovery_test.go b/internal/scan/discovery_test.go
new file mode 100644
index 0000000..97d9e70
--- /dev/null
+++ b/internal/scan/discovery_test.go
@@ -0,0 +1,682 @@
+package scan
+
+import (
+	"context"
+	"crypto/sha1"
+	"encoding/hex"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestDiscover_FindsRepos(t *testing.T) {
+	root := t.TempDir()
+
+	mustMkRepo(t, filepath.Join(root, "a"))
+	mustMkRepo(t, filepath.Join(root, "b"))
+	mustMkRepo(t, filepath.Join(root, "nested", "c"))
+	// Plain dir without .git — must NOT be picked up.
+	if err := os.MkdirAll(filepath.Join(root, "not-a-repo", "src"), 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatalf("Discover: %v", err)
+	}
+	if len(repos) != 3 {
+		t.Fatalf("expected 3 repos, got %d: %+v", len(repos), repos)
+	}
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.Slug] = true
+	}
+	for _, want := range []string{"a", "b", "c"} {
+		if !got[want] {
+			t.Errorf("expected slug %q in %v", want, got)
+		}
+	}
+}
+
+func TestDiscover_RespectsIgnore(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "keep"))
+	mustMkRepo(t, filepath.Join(root, "node_modules", "garbage"))
+
+	matcher := NewMatcher([]string{"node_modules"})
+	repos, err := Discover(context.Background(), []string{root}, matcher, 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 || repos[0].Slug != "keep" {
+		t.Fatalf("expected only `keep`, got %+v", repos)
+	}
+}
+
+// Regression: `vendor/` + `!vendor/keep` must descend into vendor/
+// so the negation has a chance to re-include vendor/keep. Before the
+// fix, the walker SkipDir'd vendor unconditionally and the re-included
+// repo was silently dropped.
+func TestDiscover_HonorsNegatedDescendant(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "app"))
+	mustMkRepo(t, filepath.Join(root, "vendor", "garbage"))
+	mustMkRepo(t, filepath.Join(root, "vendor", "keep"))
+
+	matcher := NewMatcher([]string{"vendor/", "!vendor/keep"})
+	repos, err := Discover(context.Background(), []string{root}, matcher, 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.RelPath] = true
+	}
+	if !got["app"] {
+		t.Errorf("app should be included: %+v", repos)
+	}
+	if !got["vendor/keep"] {
+		t.Errorf("vendor/keep should be re-included by negation rule: %+v", repos)
+	}
+	if got["vendor/garbage"] {
+		t.Errorf("vendor/garbage should remain ignored: %+v", repos)
+	}
+}
+
+// Regression: when the ignored directory itself is a repo and a
+// negation rule points at a descendant, the walker must NOT record
+// the ignored dir (it's ignored) AND must keep descending so the
+// negation's target can be visited. Previously the .git detection
+// ran unconditionally and both incorrectly recorded the ignored
+// parent AND SkipDir'd the child out of the walk.
+// With a 6-hex (24-bit) suffix, the birthday paradox predicts a
+// truncation collision around ~2^12 duplicates. Generating 10k paths
+// that share a basename makes collisions statistically near-certain
+// (expected count ≈ 3). The invariant "all resulting slugs are
+// distinct" forces assignSlugs's retry loop to grow the hash and
+// still produce a unique namespace — the exact corruption class the
+// extra iterations were added to prevent.
+func TestAssignSlugs_UniqueEvenUnderTruncationCollisions(t *testing.T) {
+	const n = 10000
+	repos := make([]Repo, n)
+	for i := 0; i < n; i++ {
+		repos[i] = Repo{AbsPath: fmt.Sprintf("/path/to/dir-%06d/myrepo", i)}
+	}
+	assignSlugs(repos)
+
+	slugs := make(map[string]int, n)
+	for i, r := range repos {
+		if prev, ok := slugs[r.Slug]; ok {
+			t.Fatalf("duplicate slug %q between repos[%d]=%s and repos[%d]=%s — JSONL + state files would collide",
+				r.Slug, prev, repos[prev].AbsPath, i, r.AbsPath)
+		}
+		slugs[r.Slug] = i
+	}
+}
+
+// Directly exercise the retry path with two paths constructed to
+// collide at initialSlugHashLen. Without the retry, both would get
+// slug "myrepo-<sameHex>" and the scan would silently overwrite one
+// repo's files with the other's. With the retry, the loop grows
+// the hash until the pair separates.
+func TestAssignSlugs_ResolvesFirstPrefixCollision(t *testing.T) {
+	a, b, found := findColliding6HexPaths(50000)
+	if !found {
+		t.Skip("no colliding pair found within search budget — astronomically unlikely, skip rather than flake")
+	}
+	repos := []Repo{{AbsPath: a}, {AbsPath: b}}
+	assignSlugs(repos)
+
+	if repos[0].Slug == repos[1].Slug {
+		t.Fatalf("retry failed: both repos got slug %q for colliding paths %s and %s", repos[0].Slug, a, b)
+	}
+	// Sanity: the slug suffix must be longer than the initial 6 hex,
+	// proving the retry branch actually fired.
+	const minLenAfterRetry = len("myrepo-") + initialSlugHashLen + 1
+	if len(repos[0].Slug) < minLenAfterRetry && len(repos[1].Slug) < minLenAfterRetry {
+		t.Errorf("expected at least one slug to have a longer hash after retry; got %q and %q", repos[0].Slug, repos[1].Slug)
+	}
+}
+
+// findColliding6HexPaths searches a deterministic sequence of
+// `myrepo` paths for any two whose SHA-1 digests share the first
+// initialSlugHashLen hex chars. At 24 bits of resolution the
+// birthday bound hits 50% around N≈4900, so maxAttempts=50000 is
+// overwhelmingly likely to yield a pair. Returns false only if no
+// collision was found — the test treats that as a skip, not a
+// failure, since the worst-case outcome is a missed opportunity to
+// exercise the retry, not a bug.
+func findColliding6HexPaths(maxAttempts int) (string, string, bool) {
+	seen := make(map[string]string, maxAttempts)
+	for i := 0; i < maxAttempts; i++ {
+		p := fmt.Sprintf("/search/path-%07d/myrepo", i)
+		h := sha1.Sum([]byte(p))
+		prefix := hex.EncodeToString(h[:])[:initialSlugHashLen]
+		if prev, ok := seen[prefix]; ok {
+			return prev, p, true
+		}
+		seen[prefix] = p
+	}
+	return "", "", false
+}
+
+// `index` is the landing-page filename used by `scan --report-dir`.
+// A repo whose basename is literally `index` would emit
+// `<dir>/index.html`, colliding with (and getting overwritten by)
+// the landing page. Force the hash branch for reserved names so
+// the per-repo report file never lands on a reserved path.
+func TestAssignSlugs_ReservesIndex(t *testing.T) {
+	repos := []Repo{
+		{AbsPath: "/workspace/index"},
+		{AbsPath: "/other/unrelated"},
+	}
+	assignSlugs(repos)
+
+	for _, r := range repos {
+		if filepath.Base(r.AbsPath) == "index" && r.Slug == "index" {
+			t.Errorf("`index` basename must not produce the bare slug; landing page would be overwritten (path=%s, slug=%s)", r.AbsPath, r.Slug)
+		}
+	}
+	// Case-insensitivity: `Index` would also collide on a case-
+	// insensitive filesystem serving the HTML.
+	repos = []Repo{{AbsPath: "/workspace/Index"}}
+	assignSlugs(repos)
+	if strings.EqualFold(repos[0].Slug, "index") {
+		t.Errorf("`Index` basename must also be reserved; got %q", repos[0].Slug)
+	}
+}
+
+// Regression: on macOS (APFS/HFS+ default) and Windows (NTFS),
+// `Repo.jsonl` and `repo.jsonl` are the same file on disk. A
+// case-sensitive slug compare would hand both repos the bare
+// basename and let the filesystem merge their outputs — one scan's
+// JSONL silently overwrites the other. Counts and seen-set are
+// lower-cased so any case-only difference is treated as a collision
+// and forces the hash suffix.
+func TestAssignSlugs_CaseInsensitiveUniqueness(t *testing.T) {
+	repos := []Repo{
+		{AbsPath: "/a/Team/Repo"},
+		{AbsPath: "/b/OSS/repo"},
+	}
+	assignSlugs(repos)
+
+	if repos[0].Slug == repos[1].Slug {
+		t.Fatalf("repos with case-only-differing basenames got same slug %q", repos[0].Slug)
+	}
+	// The fs collision only goes away if the slugs also differ when
+	// folded to lower case — which is what decides filenames on
+	// case-insensitive fs.
+	if strings.EqualFold(repos[0].Slug, repos[1].Slug) {
+		t.Fatalf("slugs %q and %q differ only in case — they would still collide on macOS/Windows", repos[0].Slug, repos[1].Slug)
+	}
+	// Sanity: both should have gained a hash suffix (the bare `Repo`
+	// and `repo` weren't safe to emit).
+	for _, r := range repos {
+		if !strings.Contains(r.Slug, "-") {
+			t.Errorf("repo %s got bare slug %q; case-sensitive branch did not trigger hashing", r.AbsPath, r.Slug)
+		}
+	}
+}
+
+// Same paths ingested twice must produce the same slugs — this is
+// what makes resume work across runs. Even with the retry loop
+// adjusting hash length under collisions, a deterministic input set
+// must yield a deterministic output.
+func TestAssignSlugs_DeterministicUnderRetry(t *testing.T) {
+	build := func() []Repo {
+		rs := make([]Repo, 5000)
+		for i := 0; i < 5000; i++ {
+			rs[i] = Repo{AbsPath: fmt.Sprintf("/work/p-%05d/myrepo", i)}
+		}
+		return rs
+	}
+	a := build()
+	b := build()
+	assignSlugs(a)
+	assignSlugs(b)
+	for i := range a {
+		if a[i].Slug != b[i].Slug {
+			t.Errorf("slug for %s differs across runs: %q vs %q", a[i].AbsPath, a[i].Slug, b[i].Slug)
+		}
+	}
+}
+
+// Mid-walk cancel: the pre-cancelled test below only proves the
+// first callback checks ctx. This test cancels AFTER Discover has
+// already started walking — the scenario users hit with Ctrl+C on
+// a long home-dir scan. Asserting the walk stops in flight, not
+// just when prompted at the very start.
+//
+// The tree is large enough (5,000 dirs × 3 subdirs = 20k nodes)
+// that a full walk on typical hardware takes tens of ms; sleeping
+// briefly before cancel lands the signal mid-flight. We assert:
+//   - err is context.Canceled (ctx check fired inside the walk)
+//   - elapsed time < full baseline (walk was actually interrupted)
+//   - at least one repo was discovered BEFORE cancel (not a
+//     pre-cancel scenario in disguise)
+// Regression: WalkDir given a symlink ROOT visits only the link
+// entry and refuses to descend. Users whose ~/work is a symlink to
+// a real data disk would previously see "no git repositories found"
+// despite the target being full of repos. Canonicalize via
+// EvalSymlinks before walking so the walk starts at a real dir.
+func TestDiscover_DereferencesSymlinkRoot(t *testing.T) {
+	real := t.TempDir()
+	mustMkRepo(t, filepath.Join(real, "repo-a"))
+	mustMkRepo(t, filepath.Join(real, "nested", "repo-b"))
+
+	// Place the symlink in a separate TempDir so the test doesn't
+	// have to clean it up explicitly.
+	linkDir := t.TempDir()
+	link := filepath.Join(linkDir, "work")
+	if err := os.Symlink(real, link); err != nil {
+		t.Skipf("symlinks not supported here: %v", err)
+	}
+
+	repos, err := Discover(context.Background(), []string{link}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatalf("Discover via symlink root: %v", err)
+	}
+	if len(repos) != 2 {
+		t.Fatalf("want 2 repos found through symlink root, got %d: %+v", len(repos), repos)
+	}
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.Slug] = true
+	}
+	for _, want := range []string{"repo-a", "repo-b"} {
+		if !got[want] {
+			t.Errorf("expected slug %q in %v (symlink root was not dereferenced)", want, got)
+		}
+	}
+}
+
+func TestDiscover_AbortsMidWalk(t *testing.T) {
+	if testing.Short() {
+		t.Skip("builds a large synthetic tree; skipped in -short mode")
+	}
+	root := t.TempDir()
+	const topLevel = 5000
+	for i := 0; i < topLevel; i++ {
+		parent := filepath.Join(root, fmt.Sprintf("d-%05d", i))
+		for j := 0; j < 3; j++ {
+			if err := os.MkdirAll(filepath.Join(parent, fmt.Sprintf("sub-%d", j)), 0o755); err != nil {
+				t.Fatal(err)
+			}
+		}
+		// Every top-level dir is a repo so the result count is a
+		// direct proxy for how far the walk got.
+		mustMkRepo(t, parent)
+	}
+
+	// Baseline: uncancelled walk finds all topLevel repos.
+	baseStart := time.Now()
+	full, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	baseline := time.Since(baseStart)
+	if len(full) != topLevel {
+		t.Fatalf("baseline walk should find all %d repos, got %d", topLevel, len(full))
+	}
+
+	// Cancelled walk: give the walk a head start, then cancel.
+	ctx, cancel := context.WithCancel(context.Background())
+	done := make(chan struct{})
+	go func() {
+		time.Sleep(baseline / 4)
+		cancel()
+		close(done)
+	}()
+
+	cancStart := time.Now()
+	repos, err := Discover(ctx, []string{root}, NewMatcher(nil), 0)
+	cancElapsed := time.Since(cancStart)
+	<-done
+
+	if err != context.Canceled {
+		t.Fatalf("want context.Canceled after mid-walk cancel, got err=%v (elapsed %v, baseline %v, repos %d)",
+			err, cancElapsed, baseline, len(repos))
+	}
+	if cancElapsed >= baseline {
+		t.Errorf("cancelled walk took %v — baseline was %v. Ctx respected but walk did not shortcut; users will see delayed Ctrl+C", cancElapsed, baseline)
+	}
+	if len(repos) >= topLevel {
+		t.Errorf("cancelled walk returned all %d repos — appears to have finished before cancel fired; baseline %v, elapsed %v", len(repos), baseline, cancElapsed)
+	}
+}
+
+// A cancelled context must abort the walk — previously Discover
+// ignored ctx entirely, so Ctrl+C during the walk phase only took
+// effect after every directory had been stat'd. Test uses a
+// pre-cancelled context to trip the early-return on the very first
+// callback; no repos should be returned and the error should equal
+// ctx.Err().
+func TestDiscover_AbortsOnCancelledContext(t *testing.T) {
+	root := t.TempDir()
+	// Enough decoy directories that an un-aborted walk would still
+	// produce observable work — the assertion below fails loudly if
+	// the check is skipped.
+	for i := 0; i < 20; i++ {
+		if err := os.MkdirAll(filepath.Join(root, fmt.Sprintf("dir-%d", i)), 0o755); err != nil {
+			t.Fatal(err)
+		}
+	}
+	mustMkRepo(t, filepath.Join(root, "would-be-found"))
+
+	ctx, cancel := context.WithCancel(context.Background())
+	cancel() // pre-cancelled
+
+	repos, err := Discover(ctx, []string{root}, NewMatcher(nil), 0)
+	if err != context.Canceled {
+		t.Fatalf("want context.Canceled, got err=%v", err)
+	}
+	if len(repos) != 0 {
+		t.Errorf("want empty repo list after cancel, got %+v", repos)
+	}
+}
+
+func TestDiscover_IgnoredRepoNotRecorded_DescendantStillFound(t *testing.T) {
+	root := t.TempDir()
+	// `vendor` is itself a repo AND is ignored by `vendor/`.
+	mustMkRepo(t, filepath.Join(root, "vendor"))
+	// `vendor/keep` is a nested repo, re-included by `!vendor/keep`.
+	mustMkRepo(t, filepath.Join(root, "vendor", "keep"))
+
+	matcher := NewMatcher([]string{"vendor/", "!vendor/keep"})
+	repos, err := Discover(context.Background(), []string{root}, matcher, 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.RelPath] = true
+	}
+	if got["vendor"] {
+		t.Error("vendor itself is ignored and must not be recorded as a repo")
+	}
+	if !got["vendor/keep"] {
+		t.Errorf("vendor/keep should be re-included by the negation rule; got %+v", repos)
+	}
+}
+
+func TestMatcher_CouldReinclude(t *testing.T) {
+	cases := []struct {
+		name     string
+		patterns []string
+		dir      string
+		want     bool
+	}{
+		{"no negation", []string{"vendor/"}, "vendor", false},
+		{"explicit descendant", []string{"vendor/", "!vendor/keep"}, "vendor", true},
+		{"unrelated negation", []string{"vendor/", "!src/main"}, "vendor", false},
+		{"basename negation could fire anywhere", []string{"vendor/", "!keep"}, "vendor", true},
+		{"deep-match negation", []string{"build/", "!**/src"}, "build", true},
+		// Globbed first segment — reviewer case.
+		{"glob star in segment", []string{"vendor*/", "!vendor*/keep"}, "vendor", true},
+		{"wildcard segment matches any parent", []string{"*/", "!*/keep"}, "vendor", true},
+		{"glob prefix that doesn't match dir", []string{"vendor*/", "!foo*/keep"}, "vendor", false},
+		{"nested dir with literal pattern", []string{"pkg/vendor/", "!pkg/vendor/keep"}, "pkg/vendor", true},
+		{"nested dir with glob in first segment", []string{"*/vendor/", "!*/vendor/keep"}, "pkg/vendor", true},
+		{"pattern with same segment count as dir can't match descendant", []string{"!vendor"}, "vendor", true}, // basename-anywhere
+	}
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			m := NewMatcher(c.patterns)
+			if got := m.CouldReinclude(c.dir); got != c.want {
+				t.Errorf("CouldReinclude(%q) with %v = %v, want %v", c.dir, c.patterns, got, c.want)
+			}
+		})
+	}
+}
+
+// Regression: glob-prefixed negations like `!vendor*/keep` or `!*/keep`
+// used to slip past CouldReinclude because it only matched literal
+// `dir + "/"` prefixes. Discovery then pruned vendor/ and the
+// re-included vendor/keep repo disappeared from the scan.
+func TestDiscover_HonorsGlobbedNegation(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "vendor", "keep"))
+	mustMkRepo(t, filepath.Join(root, "vendor", "garbage"))
+	mustMkRepo(t, filepath.Join(root, "vendor-old", "keep"))
+	mustMkRepo(t, filepath.Join(root, "unrelated"))
+
+	matcher := NewMatcher([]string{"vendor*/", "!vendor*/keep"})
+	repos, err := Discover(context.Background(), []string{root}, matcher, 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.RelPath] = true
+	}
+	want := []string{"vendor/keep", "vendor-old/keep", "unrelated"}
+	for _, w := range want {
+		if !got[w] {
+			t.Errorf("missing %q in discovered repos: %+v", w, repos)
+		}
+	}
+	if got["vendor/garbage"] {
+		t.Error("vendor/garbage should remain ignored")
+	}
+}
+
+func TestDiscover_DoesNotDescendIntoRepo(t *testing.T) {
+	root := t.TempDir()
+	parent := filepath.Join(root, "parent")
+	mustMkRepo(t, parent)
+	// A nested .git inside an already-discovered repo is treated as a
+	// submodule and skipped — we don't double-count.
+	mustMkRepo(t, filepath.Join(parent, "submodule"))
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 {
+		t.Fatalf("expected 1 repo (parent only), got %d: %+v", len(repos), repos)
+	}
+}
+
+func TestDiscover_SlugCollisionGetsHashSuffix(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "a", "myrepo"))
+	mustMkRepo(t, filepath.Join(root, "b", "myrepo"))
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 2 {
+		t.Fatalf("expected 2 repos, got %d", len(repos))
+	}
+	if repos[0].Slug == repos[1].Slug {
+		t.Errorf("collision not resolved: both slugs are %q", repos[0].Slug)
+	}
+	// With the two-pass naming, BOTH duplicates carry a hash suffix —
+	// neither gets the bare basename. This guarantees the slug for any
+	// given path is stable regardless of which sibling WalkDir hits
+	// first across runs.
+	for _, r := range repos {
+		if r.Slug == "myrepo" {
+			t.Errorf("expected both colliding repos to get a hash suffix, but %s kept the bare name", r.AbsPath)
+		}
+	}
+}
+
+// Re-running discovery must produce the same slug for the same path
+// even when the WalkDir traversal could legally vary. This is the
+// invariant `<slug>.state` resumption depends on.
+func TestDiscover_SlugDeterministicAcrossRuns(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "a", "myrepo"))
+	mustMkRepo(t, filepath.Join(root, "b", "myrepo"))
+
+	first, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	second, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(first) != len(second) {
+		t.Fatalf("repo count differs across runs: %d vs %d", len(first), len(second))
+	}
+	pathSlug := map[string]string{}
+	for _, r := range first {
+		pathSlug[r.AbsPath] = r.Slug
+	}
+	for _, r := range second {
+		if pathSlug[r.AbsPath] != r.Slug {
+			t.Errorf("slug for %s changed across runs: %q → %q", r.AbsPath, pathSlug[r.AbsPath], r.Slug)
+		}
+	}
+}
+
+// Regression: a random regular file named `.git` (not the
+// `gitdir: …` pointer format) used to make discovery record the
+// parent as a repo, SkipDir out of the subtree, and hide any real
+// repos nested below. Now the .git entry is validated — arbitrary
+// files are rejected AND the walker keeps descending to find real
+// repos deeper.
+func TestDiscover_RejectsArbitraryGitFileAndDescends(t *testing.T) {
+	root := t.TempDir()
+	// Parent with a bogus `.git` file — looks like a repo by naive
+	// existence check, isn't by content check.
+	bogus := filepath.Join(root, "bogus")
+	if err := os.MkdirAll(bogus, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(bogus, ".git"), []byte("not a git pointer file\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	// Real nested repo the old code would have hidden via SkipDir
+	// after recording the bogus parent.
+	mustMkRepo(t, filepath.Join(bogus, "nested"))
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	got := map[string]bool{}
+	for _, r := range repos {
+		got[r.RelPath] = true
+	}
+	if got["bogus"] {
+		t.Error("parent with a non-pointer `.git` file must not be recorded as a repo")
+	}
+	if !got["bogus/nested"] {
+		t.Errorf("real repo nested under a bogus .git file should still be discovered; got %+v", repos)
+	}
+}
+
+// Regression companion: a valid `gitdir: …` pointer file (the
+// format git writes for linked worktrees and submodules) must still
+// be accepted as a repo.
+func TestDiscover_AcceptsGitdirPointerFile(t *testing.T) {
+	root := t.TempDir()
+	realGit := filepath.Join(t.TempDir(), "real-git-dir")
+	if err := os.MkdirAll(realGit, 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	wt := filepath.Join(root, "worktree")
+	if err := os.MkdirAll(wt, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	// Pointer file: git's actual format is literally "gitdir: <path>\n".
+	pointer := []byte("gitdir: " + realGit + "\n")
+	if err := os.WriteFile(filepath.Join(wt, ".git"), pointer, 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 || repos[0].RelPath != "worktree" {
+		t.Fatalf("expected to discover the gitdir-pointer worktree, got %+v", repos)
+	}
+}
+
+func TestDiscover_RejectsSymlinkGit(t *testing.T) {
+	root := t.TempDir()
+	// A "repo" whose .git is a symlink — should not be picked up.
+	bad := filepath.Join(root, "weird")
+	if err := os.MkdirAll(bad, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.Symlink("/etc/hostname", filepath.Join(bad, ".git")); err != nil {
+		t.Skipf("symlink unsupported here: %v", err)
+	}
+	mustMkRepo(t, filepath.Join(root, "real"))
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 || repos[0].Slug != "real" {
+		t.Fatalf("expected only `real`, got %+v", repos)
+	}
+}
+
+func TestDiscover_MaxDepthHonored(t *testing.T) {
+	root := t.TempDir()
+	mustMkRepo(t, filepath.Join(root, "shallow"))
+	mustMkRepo(t, filepath.Join(root, "a", "b", "c", "deep"))
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 2)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 || repos[0].Slug != "shallow" {
+		t.Fatalf("expected only shallow, got %+v", repos)
+	}
+}
+
+func mustMkRepo(t *testing.T, path string) {
+	t.Helper()
+	if err := os.MkdirAll(filepath.Join(path, ".git"), 0o755); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func TestDiscover_FindsBareRepo(t *testing.T) {
+	root := t.TempDir()
+	bare := filepath.Join(root, "myrepo.git")
+	for _, name := range []string{"HEAD", "objects", "refs"} {
+		full := filepath.Join(bare, name)
+		if name == "HEAD" {
+			if err := os.MkdirAll(bare, 0o755); err != nil {
+				t.Fatal(err)
+			}
+			if err := os.WriteFile(full, []byte("ref: refs/heads/main\n"), 0o644); err != nil {
+				t.Fatal(err)
+			}
+		} else {
+			if err := os.MkdirAll(full, 0o755); err != nil {
+				t.Fatal(err)
+			}
+		}
+	}
+	// Decoy: dir with HEAD only — must not be picked up.
+	decoy := filepath.Join(root, "not-a-repo")
+	if err := os.MkdirAll(decoy, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(decoy, "HEAD"), []byte("nope\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	repos, err := Discover(context.Background(), []string{root}, NewMatcher(nil), 0)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(repos) != 1 || repos[0].Slug != "myrepo.git" {
+		t.Fatalf("expected single myrepo.git, got %+v", repos)
+	}
+}
diff --git a/internal/scan/ignore.go b/internal/scan/ignore.go
new file mode 100644
index 0000000..bffce2c
--- /dev/null
+++ b/internal/scan/ignore.go
@@ -0,0 +1,231 @@
+package scan
+
+import (
+	"bufio"
+	"fmt"
+	"os"
+	"path"
+	"path/filepath"
+	"strings"
+)
+
+// IgnoreRule is one line of the ignore file, pre-parsed.
+// Supported syntax (a subset of gitignore, enough for directory pruning):
+//
+//	# comment                (ignored)
+//	node_modules             (literal name — matches as basename or full path)
+//	vendor/                  (directory-only; matches dirs named vendor)
+//	archive/*                (glob — forwarded to path.Match on the rel path)
+//	**/generated             (deep-match: same as `generated` matched anywhere)
+//	!important               (negation — re-includes a match)
+//
+// Matching is evaluated in order; the last rule to match wins, following
+// gitignore semantics so `!foo` after `foo/` re-includes foo.
+type IgnoreRule struct {
+	Pattern string
+	Negate  bool
+	DirOnly bool
+}
+
+type Matcher struct {
+	rules []IgnoreRule
+}
+
+func NewMatcher(patterns []string) *Matcher {
+	m := &Matcher{}
+	for _, p := range patterns {
+		if r, ok := parseRule(p); ok {
+			m.rules = append(m.rules, r)
+		}
+	}
+	return m
+}
+
+// LoadMatcher reads an ignore file and returns a matcher. Missing files
+// are rejected — this entry point is used when the path was specified
+// explicitly (e.g. `--ignore-file foo.ignore`), where a typo should
+// fail loudly instead of silently disabling all ignore rules and
+// widening discovery scope. Callers that want "load if present, empty
+// matcher otherwise" semantics (implicit default-path lookup) should
+// os.Stat first.
+func LoadMatcher(file string) (*Matcher, error) {
+	f, err := os.Open(file)
+	if err != nil {
+		return nil, fmt.Errorf("open %s: %w", file, err)
+	}
+	defer f.Close()
+
+	var patterns []string
+	scanner := bufio.NewScanner(f)
+	for scanner.Scan() {
+		patterns = append(patterns, scanner.Text())
+	}
+	if err := scanner.Err(); err != nil {
+		return nil, fmt.Errorf("read %s: %w", file, err)
+	}
+	return NewMatcher(patterns), nil
+}
+
+func parseRule(line string) (IgnoreRule, bool) {
+	line = strings.TrimSpace(line)
+	if line == "" || strings.HasPrefix(line, "#") {
+		return IgnoreRule{}, false
+	}
+	r := IgnoreRule{}
+	if strings.HasPrefix(line, "!") {
+		r.Negate = true
+		line = line[1:]
+	}
+	if strings.HasSuffix(line, "/") {
+		r.DirOnly = true
+		line = strings.TrimSuffix(line, "/")
+	}
+	r.Pattern = line
+	return r, r.Pattern != ""
+}
+
+// Match reports whether relPath should be ignored. isDir hints directory-
+// only rules (`vendor/` only fires on dirs).
+//
+// Evaluation order: scan every rule, track the last match. This is what
+// gitignore does — it lets `!src/keep` override a broader earlier `src/`
+// block without forcing the user to care about rule ordering beyond the
+// obvious "put exceptions after the thing they exclude".
+func (m *Matcher) Match(relPath string, isDir bool) bool {
+	relPath = filepath.ToSlash(relPath)
+	matched := false
+	for _, r := range m.rules {
+		if r.DirOnly && !isDir {
+			continue
+		}
+		if !matchRule(r, relPath) {
+			continue
+		}
+		matched = !r.Negate
+	}
+	return matched
+}
+
+// CouldReinclude reports whether any negation rule targets a descendant
+// of dir — i.e. walking into the dir could still yield a re-included
+// path. Callers use this to decide whether to prune an ignored
+// directory via filepath.SkipDir or descend into it and evaluate
+// children individually.
+//
+// Without this check the walker short-circuits the matcher's last-
+// match-wins semantics: a `vendor/` + `!vendor/keep` pair would skip
+// the vendor subtree entirely before vendor/keep could be examined,
+// silently dropping the re-included path.
+//
+// Returns true when:
+//   - a negation's pattern has no path separator (basename rules like
+//     `!keep` or `!*.go` that could fire at any depth), OR
+//   - a negation begins with `**/` (deep-match; applies anywhere in the
+//     tree), OR
+//   - the pattern's leading path segments are each compatible with the
+//     corresponding segment of dir via path.Match, so its trailing
+//     segment(s) could name a descendant. This covers literal prefixes
+//     like `!vendor/keep` AND globbed prefixes like `!vendor*/keep` or
+//     `!*/keep` — both can match children of an ignored `vendor`.
+//
+// Negations that target a sibling or ancestor path (e.g. `!src/keep`
+// when walking vendor) correctly don't trigger descent, so pruning
+// remains effective for unrelated ignored trees like `node_modules/`.
+func (m *Matcher) CouldReinclude(dir string) bool {
+	dir = filepath.ToSlash(dir)
+	dirSegs := strings.Split(dir, "/")
+	for _, r := range m.rules {
+		if !r.Negate {
+			continue
+		}
+		pat := r.Pattern
+		// `**/foo` and other deep-match patterns can fire at any
+		// depth — walking into any ignored subtree could reach one.
+		if strings.HasPrefix(pat, "**/") {
+			return true
+		}
+		// Basename-only rules (no `/`) apply to any segment. `!keep`
+		// could re-include vendor/keep, a/b/keep, anywhere.
+		if !strings.Contains(pat, "/") {
+			return true
+		}
+		patSegs := strings.Split(pat, "/")
+		// The pattern must have strictly more segments than dir;
+		// otherwise its deepest named entity is dir itself or an
+		// ancestor, never a descendant worth descending for.
+		if len(patSegs) <= len(dirSegs) {
+			continue
+		}
+		// Each of the pattern's leading segments must be compatible
+		// with the corresponding dir segment. path.Match handles both
+		// literal names (`vendor`) and globs (`vendor*`, `*`, `?`)
+		// uniformly; a failing match on any segment rules this
+		// negation out.
+		ok := true
+		for i := 0; i < len(dirSegs); i++ {
+			matched, err := path.Match(patSegs[i], dirSegs[i])
+			if err != nil || !matched {
+				ok = false
+				break
+			}
+		}
+		if ok {
+			return true
+		}
+	}
+	return false
+}
+
+func matchRule(r IgnoreRule, relPath string) bool {
+	pat := r.Pattern
+
+	// **/foo → strip leading **/ and treat as "match foo anywhere" —
+	// same as the basename-or-suffix check below. We don't support
+	// arbitrary ** in the middle of patterns because users coming from
+	// gitignore expect the common forms (prefix dir, basename, ext),
+	// not the full double-star algebra. Saves us writing a mini doublestar
+	// engine for a marginal feature.
+	if strings.HasPrefix(pat, "**/") {
+		pat = strings.TrimPrefix(pat, "**/")
+	}
+
+	base := path.Base(relPath)
+
+	// Literal basename: `node_modules` matches any segment named that.
+	if pat == base {
+		return true
+	}
+	// Any segment of the path equals the pattern (literal dir/file name
+	// embedded in the tree). "vendor" matches "a/b/vendor/c.go".
+	if !strings.ContainsAny(pat, "*?[") && segmentMatch(pat, relPath) {
+		return true
+	}
+	// Glob on basename: "*.log"
+	if matched, _ := path.Match(pat, base); matched {
+		return true
+	}
+	// Glob on full relative path: "archive/*"
+	if matched, _ := path.Match(pat, relPath); matched {
+		return true
+	}
+	// Directory prefix: "archive/" (after DirOnly stripping applied above
+	// removed the trailing slash) or "archive/*".
+	prefix := strings.TrimSuffix(pat, "*")
+	prefix = strings.TrimSuffix(prefix, "/")
+	if prefix != "" && prefix != pat {
+		if strings.HasPrefix(relPath, prefix+"/") || relPath == prefix {
+			return true
+		}
+	}
+	return false
+}
+
+func segmentMatch(needle, relPath string) bool {
+	for _, seg := range strings.Split(relPath, "/") {
+		if seg == needle {
+			return true
+		}
+	}
+	return false
+}
+
diff --git a/internal/scan/ignore_test.go b/internal/scan/ignore_test.go
new file mode 100644
index 0000000..cda4f25
--- /dev/null
+++ b/internal/scan/ignore_test.go
@@ -0,0 +1,87 @@
+package scan
+
+import (
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+)
+
+func TestMatcher_Basics(t *testing.T) {
+	m := NewMatcher([]string{
+		"# comment line",
+		"",
+		"node_modules",
+		"vendor/",
+		"archive/*",
+		"*.log",
+		"!important.log",
+		"**/generated",
+	})
+
+	cases := []struct {
+		name    string
+		path    string
+		isDir   bool
+		want    bool
+	}{
+		{"basename dir match", "a/b/node_modules", true, true},
+		{"basename file match in middle", "a/node_modules/c.go", false, true},
+		{"vendor dir-only match", "src/vendor", true, true},
+		{"vendor on file does not match", "src/vendor", false, false},
+		{"glob in subdir", "archive/2024", true, true},
+		{"glob extension", "out/build.log", false, true},
+		{"negation re-includes", "out/important.log", false, false},
+		{"deep generated", "src/foo/generated", true, true},
+		{"unrelated path", "src/main.go", false, false},
+	}
+
+	for _, c := range cases {
+		t.Run(c.name, func(t *testing.T) {
+			got := m.Match(c.path, c.isDir)
+			if got != c.want {
+				t.Errorf("Match(%q, dir=%v) = %v, want %v", c.path, c.isDir, got, c.want)
+			}
+		})
+	}
+}
+
+// LoadMatcher is called when the user supplied an explicit
+// --ignore-file; a missing file there is almost always a typo. If
+// we silently returned an empty matcher, every discovery would
+// widen to include node_modules/, vendor/, chromium-clones/, etc.
+// the user thought they had excluded — and they'd have no way to
+// tell from the console output. Fail loudly instead; the
+// default-path lookup in scan.loadMatcher handles the "silent when
+// absent" case via its own os.Stat before calling here.
+func TestLoadMatcher_MissingFileFails(t *testing.T) {
+	_, err := LoadMatcher(filepath.Join(t.TempDir(), "typo.ignore"))
+	if err == nil {
+		t.Fatal("expected error for missing explicit ignore file; a typo must not silently disable rules")
+	}
+	if !os.IsNotExist(err) && !strings.Contains(err.Error(), "typo.ignore") {
+		t.Errorf("error should identify the missing path; got %q", err)
+	}
+}
+
+func TestLoadMatcher_FromFile(t *testing.T) {
+	dir := t.TempDir()
+	path := filepath.Join(dir, ".gitcortex-ignore")
+	contents := "# comment\nnode_modules\nvendor/\n"
+	if err := os.WriteFile(path, []byte(contents), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	m, err := LoadMatcher(path)
+	if err != nil {
+		t.Fatalf("LoadMatcher: %v", err)
+	}
+	if !m.Match("foo/node_modules", true) {
+		t.Error("node_modules should match")
+	}
+	if !m.Match("vendor", true) {
+		t.Error("vendor/ should match dirs")
+	}
+	if m.Match("vendor", false) {
+		t.Error("vendor/ should not match files")
+	}
+}
diff --git a/internal/scan/scan.go b/internal/scan/scan.go
new file mode 100644
index 0000000..a634f21
--- /dev/null
+++ b/internal/scan/scan.go
@@ -0,0 +1,304 @@
+package scan
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"log"
+	"os"
+	"path/filepath"
+	"sync"
+	"time"
+
+	"github.com/lex0c/gitcortex/internal/extract"
+	"github.com/lex0c/gitcortex/internal/git"
+)
+
+// Config holds scan command input. The Extract template is copied per-repo
+// and its Repo/Output/StateFile/Branch are overwritten per worker — the
+// template exists to carry shared flags (ignore patterns, batch size,
+// mailmap, etc.) without re-declaring them on the scan surface.
+type Config struct {
+	Roots      []string
+	Output     string
+	IgnoreFile string
+	MaxDepth   int
+	Parallel   int
+	Extract    extract.Config
+}
+
+// Manifest is persisted to <output>/manifest.json so subsequent runs
+// (or the report stage) can discover which JSONL files to consolidate
+// without re-walking the filesystem.
+type Manifest struct {
+	GeneratedAt string         `json:"generated_at"`
+	Roots       []string       `json:"roots"`
+	Repos       []ManifestRepo `json:"repos"`
+}
+
+type ManifestRepo struct {
+	Slug      string `json:"slug"`
+	Path      string `json:"path"`
+	Branch    string `json:"branch"`
+	JSONL     string `json:"jsonl"`
+	StateFile string `json:"state_file"`
+	Status    string `json:"status"` // ok | failed | skipped
+	Error     string `json:"error,omitempty"`
+	Commits   int    `json:"commits,omitempty"`
+	DurationMS int64 `json:"duration_ms,omitempty"`
+}
+
+// Result is returned to the caller so the CLI can drive the report stage
+// with the concrete JSONL paths (no need to re-read the manifest).
+type Result struct {
+	OutputDir string
+	Manifest  Manifest
+	JSONLs    []string // only successful repos
+}
+
+// Run discovers repos under cfg.Roots, writes one JSONL per repo via
+// the existing extract pipeline, and emits a manifest. Failures on
+// individual repos are recorded in the manifest with Status="failed"
+// but do not abort the whole scan — the report can still be generated
+// from the repos that succeeded.
+func Run(ctx context.Context, cfg Config) (*Result, error) {
+	if len(cfg.Roots) == 0 {
+		return nil, fmt.Errorf("scan: at least one --root is required")
+	}
+	if cfg.Output == "" {
+		return nil, fmt.Errorf("scan: --output is required")
+	}
+	if err := os.MkdirAll(cfg.Output, 0o755); err != nil {
+		return nil, fmt.Errorf("create output dir: %w", err)
+	}
+	if cfg.Parallel <= 0 {
+		cfg.Parallel = 1
+	}
+
+	matcher, err := loadMatcher(cfg)
+	if err != nil {
+		return nil, err
+	}
+
+	repos, err := Discover(ctx, cfg.Roots, matcher, cfg.MaxDepth)
+	if err != nil {
+		return nil, err
+	}
+	if len(repos) == 0 {
+		return nil, fmt.Errorf("scan: no git repositories found under %v", cfg.Roots)
+	}
+	log.Printf("scan: discovered %d repositories", len(repos))
+
+	manifest := Manifest{
+		GeneratedAt: time.Now().UTC().Format(time.RFC3339),
+		Roots:       cfg.Roots,
+		Repos:       make([]ManifestRepo, len(repos)),
+	}
+	// Pre-fill every slot with the repo's identity and a "pending"
+	// status. If the user cancels (Ctrl+C) before a worker reaches a
+	// repo, the manifest still names it with Status="pending" instead
+	// of an empty struct that misleads "ok, 0 failed" accounting.
+	for i, r := range repos {
+		manifest.Repos[i] = ManifestRepo{
+			Slug:      r.Slug,
+			Path:      r.AbsPath,
+			JSONL:     filepath.Join(cfg.Output, r.Slug+".jsonl"),
+			StateFile: filepath.Join(cfg.Output, r.Slug+".state"),
+			Status:    "pending",
+		}
+	}
+
+	// Worker pool over repos. Each worker runs extract.Run against one
+	// repo and writes to its dedicated JSONL + state file. The extract
+	// package already checkpoints per-repo state, so an interrupted scan
+	// can be resumed by re-invoking the command — completed repos will
+	// be skipped (their state file's last SHA prevents re-emission) and
+	// partials will resume.
+	type job struct {
+		idx  int
+		repo Repo
+	}
+	jobs := make(chan job)
+	var wg sync.WaitGroup
+
+	for w := 0; w < cfg.Parallel; w++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			for j := range jobs {
+				manifest.Repos[j.idx] = runJobSafely(ctx, cfg, j.repo)
+			}
+		}()
+	}
+
+	go func() {
+		for i, r := range repos {
+			select {
+			case <-ctx.Done():
+				close(jobs)
+				return
+			case jobs <- job{idx: i, repo: r}:
+			}
+		}
+		close(jobs)
+	}()
+
+	wg.Wait()
+
+	manifestPath := filepath.Join(cfg.Output, "manifest.json")
+	if err := writeManifest(manifestPath, manifest); err != nil {
+		return nil, fmt.Errorf("write manifest: %w", err)
+	}
+
+	var jsonls []string
+	var ok, failed, pending int
+	for _, r := range manifest.Repos {
+		switch r.Status {
+		case "ok":
+			jsonls = append(jsonls, r.JSONL)
+			ok++
+		case "pending":
+			pending++
+		default:
+			failed++
+		}
+	}
+	log.Printf("scan: %d ok, %d failed, %d not started; manifest at %s", ok, failed, pending, manifestPath)
+
+	res := &Result{
+		OutputDir: cfg.Output,
+		Manifest:  manifest,
+		JSONLs:    jsonls,
+	}
+	// Surface cancellation as a non-nil error so the CLI exits non-zero
+	// — but return the partial result anyway so the caller can still
+	// inspect what completed before the interrupt.
+	if err := ctx.Err(); err != nil {
+		return res, err
+	}
+	return res, nil
+}
+
+func loadMatcher(cfg Config) (*Matcher, error) {
+	// Per-root default: look for .gitcortex-ignore at the first root when
+	// no explicit path was given. Single-file load is enough for the MVP;
+	// if users need per-root rule sets they can concat them into one file.
+	path := cfg.IgnoreFile
+	if path == "" && len(cfg.Roots) > 0 {
+		candidate := filepath.Join(cfg.Roots[0], ".gitcortex-ignore")
+		if _, err := os.Stat(candidate); err == nil {
+			path = candidate
+		}
+	}
+	if path == "" {
+		return NewMatcher(nil), nil
+	}
+	return LoadMatcher(path)
+}
+
+// runJobFn is the injection point the worker pool uses. Default is
+// the real runOne; tests override it to simulate extract panics and
+// verify the pool's recovery path. Not exposed via any user-facing
+// API — this is strictly a test seam.
+var runJobFn = runOne
+
+// runJobSafely wraps runJobFn in a defer-recover. If extract panics
+// inside a worker the goroutine would otherwise crash the entire
+// process and the repo's manifest slot would stay at "pending", which
+// is indistinguishable from a job that was never dispatched. Convert
+// panics to a well-formed ManifestRepo with Status="failed" so the
+// manifest reflects reality and the rest of the pool keeps running.
+//
+// extract.Run doesn't panic under normal inputs — the recover exists
+// for the "impossible" cases (e.g. a git binary that writes malformed
+// output, nil-deref bugs introduced by a future refactor). Better to
+// isolate the blast radius than leave a silent corruption mode.
+func runJobSafely(ctx context.Context, cfg Config, repo Repo) (entry ManifestRepo) {
+	defer func() {
+		if r := recover(); r != nil {
+			entry = ManifestRepo{
+				Slug:      repo.Slug,
+				Path:      repo.AbsPath,
+				JSONL:     filepath.Join(cfg.Output, repo.Slug+".jsonl"),
+				StateFile: filepath.Join(cfg.Output, repo.Slug+".state"),
+				Status:    "failed",
+				Error:     fmt.Sprintf("panic: %v", r),
+			}
+			log.Printf("scan: [%s] panicked: %v", repo.Slug, r)
+		}
+	}()
+	return runJobFn(ctx, cfg, repo)
+}
+
+func runOne(ctx context.Context, cfg Config, repo Repo) ManifestRepo {
+	jsonlPath := filepath.Join(cfg.Output, repo.Slug+".jsonl")
+	statePath := filepath.Join(cfg.Output, repo.Slug+".state")
+	entry := ManifestRepo{
+		Slug:      repo.Slug,
+		Path:      repo.AbsPath,
+		JSONL:     jsonlPath,
+		StateFile: statePath,
+	}
+
+	// Skip cleanly if the user already cancelled before this worker
+	// picked up the job — avoids a noisy "extracting..." log followed
+	// by an immediate ctx.Err() from extract.
+	if err := ctx.Err(); err != nil {
+		entry.Status = "skipped"
+		entry.Error = err.Error()
+		return entry
+	}
+
+	branch := cfg.Extract.Branch
+	if branch == "" {
+		branch = git.DetectDefaultBranch(repo.AbsPath)
+	}
+	if branch == "" {
+		entry.Status = "failed"
+		entry.Error = "could not detect default branch"
+		return entry
+	}
+	entry.Branch = branch
+
+	// Copy the extract config so each worker gets its own Repo/Output/State
+	// without racing on shared fields. Preserves ignore patterns, batch
+	// size, mailmap, etc. from the template.
+	sub := cfg.Extract
+	sub.Repo = repo.AbsPath
+	sub.Branch = branch
+	sub.Output = jsonlPath
+	sub.StateFile = statePath
+	if sub.CommandTimeout == 0 {
+		sub.CommandTimeout = extract.DefaultCommandTimeout
+	}
+	// Always force StartOffset back to -1 (= "load from state file")
+	// regardless of the template's value. Programmatic callers passing
+	// a zero-value Config would otherwise feed flagOffset=0 to
+	// extract.LoadState, which would IGNORE the per-repo state file and
+	// re-extract from scratch. StartSHA gets cleared for the same
+	// reason: a single CLI-level SHA cannot apply to N different repos.
+	sub.StartSHA = ""
+	sub.StartOffset = -1
+
+	log.Printf("scan: [%s] extracting from %s (branch %s)", repo.Slug, repo.AbsPath, branch)
+
+	start := time.Now()
+	if err := extract.Run(ctx, sub); err != nil {
+		entry.Status = "failed"
+		entry.Error = fmt.Errorf("extract %s: %w", repo.Slug, err).Error()
+		entry.DurationMS = time.Since(start).Milliseconds()
+		log.Printf("scan: [%s] failed: %v", repo.Slug, err)
+		return entry
+	}
+	entry.Status = "ok"
+	entry.DurationMS = time.Since(start).Milliseconds()
+	return entry
+}
+
+func writeManifest(path string, m Manifest) error {
+	data, err := json.MarshalIndent(m, "", "  ")
+	if err != nil {
+		return err
+	}
+	return os.WriteFile(path, data, 0o644)
+}
diff --git a/internal/scan/scan_test.go b/internal/scan/scan_test.go
new file mode 100644
index 0000000..ad0c073
--- /dev/null
+++ b/internal/scan/scan_test.go
@@ -0,0 +1,211 @@
+package scan
+
+import (
+	"context"
+	"encoding/json"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"github.com/lex0c/gitcortex/internal/extract"
+	"github.com/lex0c/gitcortex/internal/stats"
+)
+
+// Regression: if extract.Run panics inside a worker (a future
+// refactor with a nil deref, an unexpected git binary output, etc.),
+// defer wg.Done() still closes the WaitGroup — but without a
+// recover, the goroutine crashes the whole process before the
+// manifest slot for that repo is written. Even without a full crash,
+// the slot would silently stay at "pending" indistinguishable from
+// "worker never reached it". The pool must convert panics to a
+// ManifestRepo with Status="failed" and continue with sibling jobs.
+//
+// Override the injection seam runJobFn: first repo panics, second
+// runs normally. Assert (1) the pool survives, (2) the panicking
+// repo's slot is Status=failed with a panic: prefix, (3) the
+// non-panicking repo still completes.
+func TestRun_RecoversFromWorkerPanic(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skip("git not installed")
+	}
+
+	root := t.TempDir()
+	makeRealRepo(t, filepath.Join(root, "boom"), map[string]string{"x.go": "x\n"})
+	makeRealRepo(t, filepath.Join(root, "ok"), map[string]string{"y.go": "y\n"})
+
+	// Override runJobFn for the duration of this test only.
+	original := runJobFn
+	defer func() { runJobFn = original }()
+	runJobFn = func(ctx context.Context, cfg Config, repo Repo) ManifestRepo {
+		if repo.Slug == "boom" {
+			panic("simulated extract failure")
+		}
+		return original(ctx, cfg, repo)
+	}
+
+	output := filepath.Join(t.TempDir(), "out")
+	cfg := Config{
+		Roots:    []string{root},
+		Output:   output,
+		Parallel: 1, // force serial so the panicking job is reached
+		Extract: extract.Config{
+			BatchSize:      100,
+			CommandTimeout: extract.DefaultCommandTimeout,
+		},
+	}
+
+	res, err := Run(context.Background(), cfg)
+	if err != nil {
+		t.Fatalf("Run should not propagate the worker panic as an error; got %v", err)
+	}
+
+	gotStatus := map[string]ManifestRepo{}
+	for _, r := range res.Manifest.Repos {
+		gotStatus[r.Slug] = r
+	}
+	boom, ok := gotStatus["boom"]
+	if !ok {
+		t.Fatal("boom slot missing from manifest")
+	}
+	if boom.Status != "failed" {
+		t.Errorf("boom should be failed after panic, got Status=%q", boom.Status)
+	}
+	if !strings.Contains(boom.Error, "panic") {
+		t.Errorf("boom error should carry a panic: prefix so operators can tell panics from normal failures; got %q", boom.Error)
+	}
+	if okRepo, okFound := gotStatus["ok"]; !okFound || okRepo.Status != "ok" {
+		t.Errorf("sibling job must still complete despite the other worker's panic; got %+v", okRepo)
+	}
+	// Ensure the pool didn't leak: all slots have a non-pending status.
+	for _, r := range res.Manifest.Repos {
+		if r.Status == "pending" {
+			t.Errorf("repo %s left at 'pending' after Run returned — the pool abandoned a slot", r.Slug)
+		}
+	}
+}
+
+func TestRun_EndToEnd(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skip("git not installed")
+	}
+
+	root := t.TempDir()
+	makeRealRepo(t, filepath.Join(root, "alpha"), map[string]string{
+		"main.go":     "package main\nfunc main() {}\n",
+		"README.md":   "# alpha\n",
+	})
+	makeRealRepo(t, filepath.Join(root, "beta"), map[string]string{
+		"app.py": "print('hi')\n",
+	})
+	// One node_modules-style decoy that should be ignored.
+	makeRealRepo(t, filepath.Join(root, "node_modules", "garbage"), map[string]string{
+		"x.js": "x\n",
+	})
+
+	output := filepath.Join(t.TempDir(), "out")
+	cfg := Config{
+		Roots:    []string{root},
+		Output:   output,
+		Parallel: 2,
+		Extract: extract.Config{
+			BatchSize:      100,
+			CommandTimeout: extract.DefaultCommandTimeout,
+		},
+	}
+	cfg.IgnoreFile = writeIgnoreFile(t, root, "node_modules\n")
+
+	res, err := Run(context.Background(), cfg)
+	if err != nil {
+		t.Fatalf("Run: %v", err)
+	}
+
+	if len(res.JSONLs) != 2 {
+		t.Fatalf("expected 2 JSONLs, got %d (%v)", len(res.JSONLs), res.JSONLs)
+	}
+
+	// Manifest sanity
+	manifestPath := filepath.Join(output, "manifest.json")
+	mb, err := os.ReadFile(manifestPath)
+	if err != nil {
+		t.Fatalf("read manifest: %v", err)
+	}
+	var m Manifest
+	if err := json.Unmarshal(mb, &m); err != nil {
+		t.Fatalf("parse manifest: %v", err)
+	}
+	if len(m.Repos) != 2 {
+		t.Fatalf("manifest should have 2 repos, got %d", len(m.Repos))
+	}
+	for _, r := range m.Repos {
+		if r.Status != "ok" {
+			t.Errorf("repo %s status=%s err=%s", r.Slug, r.Status, r.Error)
+		}
+	}
+
+	// Consolidated load + breakdown
+	ds, err := stats.LoadMultiJSONL(res.JSONLs)
+	if err != nil {
+		t.Fatalf("LoadMultiJSONL: %v", err)
+	}
+	breakdown := stats.RepoBreakdown(ds, "")
+	if len(breakdown) != 2 {
+		t.Fatalf("expected breakdown across 2 repos, got %d: %+v", len(breakdown), breakdown)
+	}
+	got := map[string]int{}
+	for _, b := range breakdown {
+		got[b.Repo] = b.Commits
+	}
+	for _, name := range []string{"alpha", "beta"} {
+		if got[name] == 0 {
+			t.Errorf("repo %s missing or has 0 commits in breakdown: %v", name, got)
+		}
+	}
+}
+
+// makeRealRepo initializes a git repo with the given files and a single
+// commit using a deterministic identity so the assertions don't depend on
+// the developer's git config.
+func makeRealRepo(t *testing.T, dir string, files map[string]string) {
+	t.Helper()
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	for name, contents := range files {
+		full := filepath.Join(dir, name)
+		if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil {
+			t.Fatal(err)
+		}
+		if err := os.WriteFile(full, []byte(contents), 0o644); err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	runGit := func(args ...string) {
+		cmd := exec.Command("git", args...)
+		cmd.Dir = dir
+		cmd.Env = append(os.Environ(),
+			"GIT_AUTHOR_NAME=Test",
+			"GIT_AUTHOR_EMAIL=test@example.com",
+			"GIT_COMMITTER_NAME=Test",
+			"GIT_COMMITTER_EMAIL=test@example.com",
+		)
+		out, err := cmd.CombinedOutput()
+		if err != nil {
+			t.Fatalf("git %s: %v\n%s", strings.Join(args, " "), err, out)
+		}
+	}
+	runGit("init", "-q", "-b", "main")
+	runGit("add", ".")
+	runGit("commit", "-q", "-m", "initial")
+}
+
+func writeIgnoreFile(t *testing.T, root, contents string) string {
+	t.Helper()
+	path := filepath.Join(root, ".gitcortex-ignore")
+	if err := os.WriteFile(path, []byte(contents), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	return path
+}
diff --git a/internal/stats/reader.go b/internal/stats/reader.go
index 87fc4ef..4162249 100644
--- a/internal/stats/reader.go
+++ b/internal/stats/reader.go
@@ -22,6 +22,11 @@ type commitEntry struct {
 	del     int64
 	files   int
 	message string
+	// repo carries the LoadMultiJSONL pathPrefix (without the trailing
+	// colon) so RepoBreakdown can attribute commits back to their source
+	// repository on multi-repo scans. Empty for single-file loads, which
+	// keeps single-repo callers' behavior unchanged.
+	repo string
 }
 
 type fileEntry struct {
@@ -82,6 +87,17 @@ type Dataset struct {
 	files        map[string]*fileEntry
 	workGrid     [7][24]int
 
+	// commitsByRepo preserves commit records per repository so the
+	// per-repo breakdown stays correct when two repos share a SHA
+	// (forks, mirrors, cherry-picks between sibling projects). The
+	// commits map above is keyed by SHA and drops the losing side of
+	// a collision — acceptable for file-side lookups where a single
+	// record per SHA is enough — but RepoBreakdown would then
+	// silently under-count the overwritten repo's commits and churn.
+	// Populated at ingest in lockstep with commits; a single-repo
+	// load (no pathPrefix) keeps all entries under the "(repo)" key.
+	commitsByRepo map[string][]*commitEntry
+
 	// Coupling (pre-aggregated during load)
 	couplingPairs       map[filePair]int
 	couplingFileChanges map[string]int
@@ -158,6 +174,7 @@ func newDataset() *Dataset {
 		contribFiles:        make(map[string]map[string]struct{}),
 		contribFirst:        make(map[string]time.Time),
 		contribLast:         make(map[string]time.Time),
+		commitsByRepo:       make(map[string][]*commitEntry),
 	}
 }
 
@@ -235,14 +252,25 @@ func streamLoadInto(ds *Dataset, r io.Reader, opt LoadOptions, pathPrefix string
 			ds.TotalDeletions += c.Deletions
 			ds.TotalFilesChanged += int64(c.FilesChanged)
 
-			ds.commits[c.SHA] = &commitEntry{
+			entry := &commitEntry{
 				email:   c.AuthorEmail,
 				date:    t,
 				add:     c.Additions,
 				del:     c.Deletions,
 				files:   c.FilesChanged,
 				message: c.Message,
+				repo:    strings.TrimSuffix(pathPrefix, ":"),
+			}
+			ds.commits[c.SHA] = entry
+			// Record separately per-repo so RepoBreakdown survives
+			// SHA collisions across repositories (forks/mirrors/
+			// cherry-picks). Key defaults to "(repo)" in single-repo
+			// loads so the fallback row in RepoBreakdown still works.
+			repoKey := entry.repo
+			if repoKey == "" {
+				repoKey = "(repo)"
 			}
+			ds.commitsByRepo[repoKey] = append(ds.commitsByRepo[repoKey], entry)
 
 			// Contributors
 			cs, ok := ds.contributors[c.AuthorEmail]
diff --git a/internal/stats/repo_breakdown.go b/internal/stats/repo_breakdown.go
new file mode 100644
index 0000000..57822e8
--- /dev/null
+++ b/internal/stats/repo_breakdown.go
@@ -0,0 +1,188 @@
+package stats
+
+import (
+	"sort"
+	"strings"
+	"time"
+)
+
+// RepoStat aggregates commit-level activity for one repository in a
+// multi-repo Dataset. Populated only when LoadMultiJSONL ran with more
+// than one input file (single-repo loads have no path prefix to group by,
+// so the breakdown would be a noop).
+type RepoStat struct {
+	Repo            string
+	Commits         int
+	Additions       int64
+	Deletions       int64
+	Churn           int64
+	Files           int
+	ActiveDays      int
+	UniqueDevs      int
+	FirstCommitDate string
+	LastCommitDate  string
+	// PctOfTotalCommits / PctOfTotalChurn are computed against the
+	// dataset totals so the report can show "this repo was 38% of
+	// your commits" without the template needing to do the math.
+	PctOfTotalCommits float64
+	PctOfTotalChurn   float64
+}
+
+// RepoBreakdown groups commits by repo and returns one row per repo,
+// sorted by commit count descending. When emailFilter is non-empty,
+// only commits authored by that email are counted — this is the
+// "show my work across repos" lens used by the scan report.
+//
+// Source of truth is ds.commitsByRepo, a per-repository slice
+// populated at ingest. The SHA-keyed ds.commits map cannot be used
+// directly because two repos that share a commit SHA (fork, mirror,
+// cherry-pick) would lose the earlier ingest to the later one,
+// under-counting one repo's totals. Iterating commitsByRepo sidesteps
+// that collision.
+//
+// Fallback: test datasets that hand-build ds.commits without
+// populating commitsByRepo still work — we detect the empty slice
+// and bucket ds.commits by repo as before. Production ingest always
+// populates commitsByRepo.
+//
+// On single-repo datasets (no prefix tag), the function returns a
+// single row labeled "(repo)" so callers can render without a special
+// case.
+func RepoBreakdown(ds *Dataset, emailFilter string) []RepoStat {
+	type acc struct {
+		commits     int
+		add, del    int64
+		days        map[string]struct{}
+		devs        map[string]struct{}
+		first, last time.Time
+	}
+
+	repos := make(map[string]*acc)
+	emailLower := strings.ToLower(strings.TrimSpace(emailFilter))
+
+	visit := func(key string, c *commitEntry) {
+		if emailLower != "" && strings.ToLower(strings.TrimSpace(c.email)) != emailLower {
+			return
+		}
+		a, ok := repos[key]
+		if !ok {
+			a = &acc{
+				days: make(map[string]struct{}),
+				devs: make(map[string]struct{}),
+			}
+			repos[key] = a
+		}
+		a.commits++
+		a.add += c.add
+		a.del += c.del
+		if c.email != "" {
+			a.devs[strings.ToLower(c.email)] = struct{}{}
+		}
+		if !c.date.IsZero() {
+			a.days[c.date.UTC().Format("2006-01-02")] = struct{}{}
+			if a.first.IsZero() || c.date.Before(a.first) {
+				a.first = c.date
+			}
+			if c.date.After(a.last) {
+				a.last = c.date
+			}
+		}
+	}
+
+	if len(ds.commitsByRepo) > 0 {
+		for repo, entries := range ds.commitsByRepo {
+			for _, c := range entries {
+				visit(repo, c)
+			}
+		}
+	} else {
+		// Fallback for hand-built test datasets that populate
+		// ds.commits directly. Production ingest always goes through
+		// streamLoadInto and populates commitsByRepo.
+		for _, c := range ds.commits {
+			key := c.repo
+			if key == "" {
+				key = "(repo)"
+			}
+			visit(key, c)
+		}
+	}
+
+	// File counts come from ds.files. Path prefix is `<repo>:<path>` so
+	// the split key matches the commit-side `c.repo`. Files without a
+	// prefix (single-repo loads) bucket under the same "(repo)" label.
+	//
+	// When emailFilter is set the count must reflect files THIS dev
+	// touched, not the repo's total — the unfiltered count would
+	// massively over-state per-dev scope (e.g. "I worked on 12k files
+	// in monorepo X" when really the dev touched 30). devCommits
+	// tracks per-file dev appearance and is the right denominator.
+	fileCounts := make(map[string]int)
+	for path, fe := range ds.files {
+		if emailLower != "" {
+			matched := false
+			for devEmail := range fe.devCommits {
+				if strings.ToLower(strings.TrimSpace(devEmail)) == emailLower {
+					matched = true
+					break
+				}
+			}
+			if !matched {
+				continue
+			}
+		}
+		key := ""
+		if i := strings.Index(path, ":"); i >= 0 {
+			key = path[:i]
+		}
+		if key == "" {
+			key = "(repo)"
+		}
+		fileCounts[key]++
+	}
+
+	// Totals computed AFTER the email filter: percentages are relative
+	// to the filtered universe so a dev profile shows "60% of MY commits
+	// happened in repo X", not "0.4% of the org's commits".
+	var totalCommits int
+	var totalChurn int64
+	for _, a := range repos {
+		totalCommits += a.commits
+		totalChurn += a.add + a.del
+	}
+
+	out := make([]RepoStat, 0, len(repos))
+	for repo, a := range repos {
+		stat := RepoStat{
+			Repo:       repo,
+			Commits:    a.commits,
+			Additions:  a.add,
+			Deletions:  a.del,
+			Churn:      a.add + a.del,
+			Files:      fileCounts[repo],
+			ActiveDays: len(a.days),
+			UniqueDevs: len(a.devs),
+		}
+		if !a.first.IsZero() {
+			stat.FirstCommitDate = a.first.UTC().Format("2006-01-02")
+		}
+		if !a.last.IsZero() {
+			stat.LastCommitDate = a.last.UTC().Format("2006-01-02")
+		}
+		if totalCommits > 0 {
+			stat.PctOfTotalCommits = float64(stat.Commits) / float64(totalCommits) * 100
+		}
+		if totalChurn > 0 {
+			stat.PctOfTotalChurn = float64(stat.Churn) / float64(totalChurn) * 100
+		}
+		out = append(out, stat)
+	}
+
+	sort.Slice(out, func(i, j int) bool {
+		if out[i].Commits != out[j].Commits {
+			return out[i].Commits > out[j].Commits
+		}
+		return out[i].Repo < out[j].Repo
+	})
+	return out
+}
diff --git a/internal/stats/repo_breakdown_test.go b/internal/stats/repo_breakdown_test.go
new file mode 100644
index 0000000..404a563
--- /dev/null
+++ b/internal/stats/repo_breakdown_test.go
@@ -0,0 +1,177 @@
+package stats
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+	"time"
+)
+
+func TestRepoBreakdown_GroupsByPrefix(t *testing.T) {
+	ds := &Dataset{
+		commits: map[string]*commitEntry{
+			"a1": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 10, del: 0},
+			"a2": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC), add: 5, del: 2},
+			"b1": {email: "you@x.com", repo: "beta", date: time.Date(2024, 2, 1, 0, 0, 0, 0, time.UTC), add: 30, del: 0},
+		},
+		files: map[string]*fileEntry{
+			"alpha:main.go":   {},
+			"alpha:helper.go": {},
+			"beta:app.py":     {},
+		},
+	}
+
+	breakdown := RepoBreakdown(ds, "")
+	if len(breakdown) != 2 {
+		t.Fatalf("want 2 repos, got %d", len(breakdown))
+	}
+	// Sort is by commit count desc, so alpha first.
+	if breakdown[0].Repo != "alpha" || breakdown[0].Commits != 2 {
+		t.Errorf("alpha: got %+v", breakdown[0])
+	}
+	if breakdown[0].Files != 2 {
+		t.Errorf("alpha files: got %d, want 2", breakdown[0].Files)
+	}
+	if breakdown[1].Repo != "beta" || breakdown[1].Commits != 1 {
+		t.Errorf("beta: got %+v", breakdown[1])
+	}
+	// Pct totals — alpha 2/3, beta 1/3.
+	if got := breakdown[0].PctOfTotalCommits; got < 66.6 || got > 66.7 {
+		t.Errorf("alpha pct commits: got %.2f, want ~66.67", got)
+	}
+}
+
+func TestRepoBreakdown_EmailFilter(t *testing.T) {
+	ds := &Dataset{
+		commits: map[string]*commitEntry{
+			"a1": {email: "me@x.com", repo: "alpha", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 10},
+			"a2": {email: "other@x.com", repo: "alpha", date: time.Date(2024, 1, 2, 0, 0, 0, 0, time.UTC), add: 5},
+			"b1": {email: "me@x.com", repo: "beta", date: time.Date(2024, 2, 1, 0, 0, 0, 0, time.UTC), add: 30},
+		},
+		// Per-dev file counts: alpha has 2 files but me@ touched only 1;
+		// beta has 1 file me@ touched. Other dev's exclusive file in
+		// alpha must NOT be counted toward me@'s repo Files.
+		files: map[string]*fileEntry{
+			"alpha:mine.go":  {devCommits: map[string]int{"me@x.com": 1}},
+			"alpha:other.go": {devCommits: map[string]int{"other@x.com": 1}},
+			"beta:both.py":   {devCommits: map[string]int{"me@x.com": 1}},
+		},
+	}
+	breakdown := RepoBreakdown(ds, "me@x.com")
+	if len(breakdown) != 2 {
+		t.Fatalf("want 2 repos for me@x.com, got %d", len(breakdown))
+	}
+	for _, b := range breakdown {
+		if b.Commits != 1 {
+			t.Errorf("repo %s: expected 1 commit (filtered), got %d", b.Repo, b.Commits)
+		}
+		if b.Files != 1 {
+			t.Errorf("repo %s: expected 1 file (filtered), got %d", b.Repo, b.Files)
+		}
+	}
+}
+
+// A rename within a repo collapses two file paths onto one canonical
+// path during finalizeDataset. The collapsed path keeps its `<repo>:`
+// prefix, so RepoBreakdown should still attribute the file to the
+// correct repo. Regression guard: without proper handling, the
+// post-rename canonical key would either lose its prefix (and bucket
+// into "(repo)") or duplicate-count.
+func TestRepoBreakdown_SurvivesRename(t *testing.T) {
+	ds := newDataset()
+	ds.commits["c1"] = &commitEntry{
+		email: "me@x.com", repo: "alpha",
+		date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 5,
+	}
+	ds.commits["c2"] = &commitEntry{
+		email: "me@x.com", repo: "alpha",
+		date: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC), add: 10,
+	}
+	ds.files["alpha:old.go"] = &fileEntry{
+		devCommits: map[string]int{"me@x.com": 1},
+	}
+	ds.files["alpha:new.go"] = &fileEntry{
+		devCommits: map[string]int{"me@x.com": 1},
+	}
+	ds.renameEdges = []renameEdge{{
+		oldPath: "alpha:old.go", newPath: "alpha:new.go",
+		commitDate: time.Date(2024, 1, 5, 0, 0, 0, 0, time.UTC),
+	}}
+	applyRenames(ds)
+
+	breakdown := RepoBreakdown(ds, "")
+	if len(breakdown) != 1 || breakdown[0].Repo != "alpha" {
+		t.Fatalf("want single alpha row, got %+v", breakdown)
+	}
+	// After rename collapse: 1 canonical file, both commits attributed.
+	if breakdown[0].Files != 1 {
+		t.Errorf("want 1 file (post-rename), got %d", breakdown[0].Files)
+	}
+	if breakdown[0].Commits != 2 {
+		t.Errorf("want 2 commits, got %d", breakdown[0].Commits)
+	}
+}
+
+// Regression: when two repos share a commit SHA (fork / mirror /
+// cherry-pick) the ingest map ds.commits drops one side. Without a
+// separate per-repo record, RepoBreakdown would silently under-count
+// the repo ingested first and the consolidated report would show
+// wrong percentages. Must preserve both.
+func TestRepoBreakdown_SurvivesCrossRepoSHACollision(t *testing.T) {
+	dir := t.TempDir()
+	// Two repos with IDENTICAL commit rows — different churn per row
+	// so we can verify both contribute to the breakdown. In real
+	// life this maps to a cherry-picked or mirrored commit where the
+	// SHA matches but the files and stats land in different repos.
+	commitRow := `{"type":"commit","sha":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","author_date":"2024-01-01T00:00:00Z","author_email":"me@x.com","author_name":"Me","additions":10,"deletions":0,"files_changed":1}`
+	alphaFile := `{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"a.go","additions":10,"deletions":0}`
+	betaFile := `{"type":"commit_file","commit":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","path_current":"b.go","additions":10,"deletions":0}`
+
+	alpha := filepath.Join(dir, "alpha.jsonl")
+	beta := filepath.Join(dir, "beta.jsonl")
+	if err := os.WriteFile(alpha, []byte(commitRow+"\n"+alphaFile+"\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(beta, []byte(commitRow+"\n"+betaFile+"\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	ds, err := LoadMultiJSONL([]string{alpha, beta})
+	if err != nil {
+		t.Fatalf("LoadMultiJSONL: %v", err)
+	}
+
+	breakdown := RepoBreakdown(ds, "")
+	if len(breakdown) != 2 {
+		t.Fatalf("want breakdown across both alpha AND beta despite SHA collision, got %d rows: %+v", len(breakdown), breakdown)
+	}
+	got := map[string]int{}
+	for _, b := range breakdown {
+		got[b.Repo] = b.Commits
+	}
+	if got["alpha"] != 1 || got["beta"] != 1 {
+		t.Errorf("each repo should keep its commit; got alpha=%d beta=%d", got["alpha"], got["beta"])
+	}
+	// Sanity on percentage — each repo is 50% of the 2 collided commits.
+	for _, b := range breakdown {
+		if b.PctOfTotalCommits != 50 {
+			t.Errorf("repo %s: want 50%% commits, got %.1f", b.Repo, b.PctOfTotalCommits)
+		}
+	}
+}
+
+func TestRepoBreakdown_SingleRepoFallback(t *testing.T) {
+	// No prefix tag → all commits bucket under "(repo)".
+	ds := &Dataset{
+		commits: map[string]*commitEntry{
+			"x1": {email: "me@x.com", date: time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC), add: 5},
+		},
+		files: map[string]*fileEntry{
+			"src/foo.go": {},
+		},
+	}
+	breakdown := RepoBreakdown(ds, "")
+	if len(breakdown) != 1 || breakdown[0].Repo != "(repo)" {
+		t.Fatalf("want single (repo) row, got %+v", breakdown)
+	}
+}
diff --git a/internal/stats/stats.go b/internal/stats/stats.go
index a0b985c..331ac3a 100644
--- a/internal/stats/stats.go
+++ b/internal/stats/stats.go
@@ -239,6 +239,19 @@ func ComputeSummary(ds *Dataset) Summary {
 	return s
 }
 
+// DevEmails returns every author email in the dataset, unordered.
+// Cheaper than TopContributors(ds, 0) when the caller only needs
+// the email set (e.g. aggregating unique devs across multiple
+// repos): skips the per-contributor struct copy and the full sort
+// that TopContributors pays for ranking purposes.
+func DevEmails(ds *Dataset) []string {
+	out := make([]string, 0, len(ds.contributors))
+	for email := range ds.contributors {
+		out = append(out, email)
+	}
+	return out
+}
+
 func TopContributors(ds *Dataset, n int) []ContributorStat {
 	result := make([]ContributorStat, 0, len(ds.contributors))
 	for _, cs := range ds.contributors {
diff --git a/internal/stats/stats_test.go b/internal/stats/stats_test.go
index a22bd15..b9c418b 100644
--- a/internal/stats/stats_test.go
+++ b/internal/stats/stats_test.go
@@ -63,6 +63,30 @@ func makeDataset() *Dataset {
 	return ds
 }
 
+// DevEmails returns every author email unordered. Faster than
+// TopContributors(ds, 0) when the caller only needs the set —
+// used by scan's per-repo render loop to aggregate unique devs
+// across repos without paying the full contributor sort per repo.
+func TestDevEmails(t *testing.T) {
+	ds := makeDataset()
+	got := map[string]bool{}
+	for _, e := range DevEmails(ds) {
+		got[e] = true
+	}
+	want := map[string]bool{
+		"alice@test.com": true,
+		"bob@test.com":   true,
+	}
+	for email := range want {
+		if !got[email] {
+			t.Errorf("missing %q in DevEmails output: %v", email, got)
+		}
+	}
+	if len(got) != len(want) {
+		t.Errorf("unexpected extras in DevEmails: got %v, want %v", got, want)
+	}
+}
+
 func TestComputeSummary(t *testing.T) {
 	ds := makeDataset()
 	s := ComputeSummary(ds)
diff --git a/internal/stats/suspect.go b/internal/stats/suspect.go
index 33a4b5f..c12f6f4 100644
--- a/internal/stats/suspect.go
+++ b/internal/stats/suspect.go
@@ -142,11 +142,22 @@ func DetectSuspectFiles(ds *Dataset) ([]SuspectBucket, bool) {
 	for path, fe := range ds.files {
 		fileChurn := fe.additions + fe.deletions
 		totalChurn += fileChurn
+		// Multi-repo loads prefix every path with `<repo>:` so files
+		// from different repos can coexist in one Dataset. The
+		// pattern predicates assume bare repo-relative paths — they
+		// look for root-level `vendor/`, full-path `package-lock.json`,
+		// and similar shapes that fail when the string actually
+		// begins with `alpha:vendor/x.go` or equals
+		// `alpha:package-lock.json`. Strip the prefix once and feed
+		// the normalized form to both Match and Suggest; nested
+		// occurrences (e.g. `repo:pkg/vendor/x.go`) still work via
+		// the normal substring checks on the stripped path.
+		matchPath := stripRepoPrefix(path)
 		// A file might match multiple patterns (e.g. vendor/*.pb.go).
 		// Attribute it to the first match so totals don't double-count;
 		// the first-match wins keeps output predictable.
 		for _, pat := range defaultSuspectPatterns {
-			if !pat.Match(path) {
+			if !pat.Match(matchPath) {
 				continue
 			}
 			b, ok := buckets[pat.Glob]
@@ -159,7 +170,7 @@ func DetectSuspectFiles(ds *Dataset) ([]SuspectBucket, bool) {
 			b.Churn += fileChurn
 			suspectChurn += fileChurn
 			if pat.Suggest != nil {
-				suggSets[pat.Glob][pat.Suggest(path)] = struct{}{}
+				suggSets[pat.Glob][pat.Suggest(matchPath)] = struct{}{}
 			}
 			break
 		}
@@ -211,11 +222,23 @@ func ShellQuoteSingle(s string) string {
 // trimming suggestions to just the displayed subset would leave
 // unshown suspects untouched and cause the warning to persist after
 // the suggested fix.
+//
+// Multi-repo note: LoadMultiJSONL prefixes file paths with `<repo>:`
+// so every consumer (hotspots, directories, suspect detector) can
+// attribute paths back to their source. The suggestion strings
+// inherit that prefix from suggestDirGlob, but `extract --ignore` and
+// `scan --extract-ignore` both consume bare repo-relative globs —
+// neither understands the `<repo>:` notation. Strip it here so the
+// copy-pasted remediation command actually matches the offending
+// paths. Same-shaped globs from different repos collapse via the
+// dedup below (e.g. `dist/*` appearing under multiple repos becomes
+// one suggestion, which is what the user wants).
 func CollectAllSuggestions(buckets []SuspectBucket) []string {
 	seen := map[string]struct{}{}
 	var out []string
 	for _, b := range buckets {
 		for _, s := range b.Suggestions {
+			s = stripRepoPrefix(s)
 			if _, ok := seen[s]; ok {
 				continue
 			}
@@ -225,3 +248,21 @@ func CollectAllSuggestions(buckets []SuspectBucket) []string {
 	}
 	return out
 }
+
+// stripRepoPrefix removes a leading `<repo>:` from a suggestion glob
+// when present. Conservative: the prefix is only recognized when the
+// segment before the colon has no slash or glob metacharacters, which
+// matches the LoadMultiJSONL convention and avoids misinterpreting
+// legitimate colons inside paths (rare but possible on some
+// filesystems). Single-repo datasets have no prefix and pass through
+// unchanged.
+func stripRepoPrefix(s string) string {
+	i := strings.Index(s, ":")
+	if i < 0 {
+		return s
+	}
+	if strings.ContainsAny(s[:i], "/*?[") {
+		return s
+	}
+	return s[i+1:]
+}
diff --git a/internal/stats/suspect_test.go b/internal/stats/suspect_test.go
index a83ea2c..a9dfa7c 100644
--- a/internal/stats/suspect_test.go
+++ b/internal/stats/suspect_test.go
@@ -1,6 +1,7 @@
 package stats
 
 import (
+	"strings"
 	"testing"
 
 	"github.com/lex0c/gitcortex/internal/extract"
@@ -353,6 +354,109 @@ func TestCollectAllSuggestionsCoversUnshownBuckets(t *testing.T) {
 	}
 }
 
+// When paths come from LoadMultiJSONL they carry a `<repo>:` prefix.
+// Those prefixes leak into directory-glob suggestions and make the
+// copy-pasted `extract --ignore` / `scan --extract-ignore` command
+// a no-op on real repo paths (which never carry the prefix).
+// CollectAllSuggestions must strip the prefix and collapse duplicates
+// that only differed in which repo they came from.
+func TestCollectAllSuggestionsStripsRepoPrefix(t *testing.T) {
+	ds := &Dataset{
+		files: map[string]*fileEntry{
+			// Two repos with the same offending shape — suggestions
+			// collapse to one canonical "dist/*" glob.
+			"repoA:js/dist/a.min.js":  {additions: 400, deletions: 400},
+			"repoB:css/dist/b.min.css": {additions: 400, deletions: 400},
+			// A single-repo path stays untouched (no prefix).
+			"vendor/c.go": {additions: 400, deletions: 400},
+		},
+	}
+	buckets, _ := DetectSuspectFiles(ds)
+	got := CollectAllSuggestions(buckets)
+
+	for _, s := range got {
+		if strings.Contains(s, ":") {
+			t.Errorf("suggestion %q still carries a repo prefix — users can't copy-paste this into extract --ignore", s)
+		}
+	}
+	has := func(want string) bool {
+		for _, s := range got {
+			if s == want {
+				return true
+			}
+		}
+		return false
+	}
+	for _, want := range []string{"js/dist/*", "css/dist/*", "vendor/*"} {
+		if !has(want) {
+			t.Errorf("expected suggestion %q in %v", want, got)
+		}
+	}
+}
+
+func TestStripRepoPrefix(t *testing.T) {
+	cases := []struct{ in, want string }{
+		{"repoA:wp-includes/dist/*", "wp-includes/dist/*"},
+		{"dist/*", "dist/*"},                          // no prefix
+		{"*.min.js", "*.min.js"},                      // no prefix
+		{"vendor/*", "vendor/*"},                      // no prefix (no colon)
+		{"src/weird:name.go", "src/weird:name.go"},    // colon after slash = not a prefix
+		{"glob*prefix:foo/bar", "glob*prefix:foo/bar"}, // metachar before colon = not a prefix
+	}
+	for _, c := range cases {
+		if got := stripRepoPrefix(c.in); got != c.want {
+			t.Errorf("stripRepoPrefix(%q) = %q, want %q", c.in, got, c.want)
+		}
+	}
+}
+
+// Regression: multi-repo loads prefix every file path with `<repo>:`,
+// and the pattern predicates (hasPathSegment, basenameEquals) compare
+// against the raw string. Root-level occurrences like
+// `alpha:vendor/x.go` and `alpha:package-lock.json` slipped past
+// detection because `strings.HasPrefix(p, "vendor/")` and
+// `p == "package-lock.json"` both fail when the `<repo>:` prefix is
+// present. Symptom in the wild: stats on a scan consumed multiple
+// --input files, suspect warning suppressed, users kept seeing
+// inflated churn/bus-factor from generated content with no hint to
+// fix it.
+func TestDetectSuspectFiles_RootLevelUnderRepoPrefix(t *testing.T) {
+	ds := &Dataset{
+		files: map[string]*fileEntry{
+			// Root-level occurrences — the cases that previously missed.
+			"alpha:vendor/x.go":        {additions: 400, deletions: 400},
+			"beta:package-lock.json":   {additions: 400, deletions: 400},
+			"beta:go.sum":              {additions: 400, deletions: 400},
+			// Nested occurrence — worked before the fix and must still work.
+			"alpha:pkg/vendor/deep.go": {additions: 100, deletions: 100},
+		},
+	}
+	buckets, worth := DetectSuspectFiles(ds)
+	if !worth {
+		t.Fatal("expected warning-worthy dataset; prefix-stripped matching should fire on root-level vendor/lockfiles")
+	}
+
+	got := map[string]bool{}
+	for _, b := range buckets {
+		got[b.Pattern.Glob] = true
+	}
+	for _, want := range []string{"vendor/*", "package-lock.json", "go.sum"} {
+		if !got[want] {
+			t.Errorf("pattern %q missing from detected buckets %v — root-level %q under repo prefix went undetected", want, got, want)
+		}
+	}
+
+	// Suggestions should come out clean (no `<repo>:`) so the
+	// copy-pasted --ignore command actually matches these paths
+	// next run.
+	suggestions := CollectAllSuggestions(buckets)
+	for _, s := range suggestions {
+		if len(s) > 0 && s[0] == '\'' {
+			t.Errorf("suggestion %q should not carry shell quotes here", s)
+		}
+	}
+}
+
 func TestSuspectSuggestionsMatchExtractShouldIgnore(t *testing.T) {
 	// Mix root-level and nested occurrences of every pattern class:
 	// directory segments (vendor, dist), suffix (*.min.js), basename