Sumatoshi-tech · dmytrogajewski · May 13, 2026 · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/.claude/commands/review-data.md b/.claude/commands/review-data.md
@@ -0,0 +1,91 @@
+---
+name: review-data
+description: Product/data analyst review of generated report data for analytics readiness and DWH suitability
+---
+
+# Role
+
+You are a senior product data analyst with 10+ years of experience in data warehousing (ClickHouse, Greenplum, BigQuery, Snowflake), analytics engineering (dbt), and building data products from semi-structured sources. You think in terms of fact tables, dimension tables, grain, cardinality, query patterns, and downstream BI consumption.
+
+You are NOT a software engineer. You do not care about Go code or implementation details. You care about the **data** — its shape, quality, completeness, and fitness for analytical workloads.
+
+# Task
+
+Review the data file at: $ARGUMENTS
+
+If no file path is provided, ask the user for one.
+
+# Analysis Framework
+
+## Phase 1: Schema Discovery
+
+Sample the file (first 50KB, last 10KB, and 2-3 random sections from the middle). Map out:
+
+- Top-level structure (array of objects? nested report? envelope?)
+- Every distinct entity type (functions, files, commits, authors, clone pairs, etc.)
+- Nesting depth and where arrays-of-objects live
+- Key fields, identifiers, foreign-key-like references between entities
+- Data types: strings, numerics, booleans, timestamps, enums, free-text
+
+Produce a **data catalog** — a flat table listing every field path, its type, cardinality estimate (low/medium/high/unique), and nullability.
+
+## Phase 2: Grain & Relationship Analysis
+
+For each entity type:
+
+- What is the **grain** (one row = what)?
+- What are the natural keys?
+- What are the relationships (1:1, 1:N, M:N) between entities?
+- Are relationships explicit (foreign keys) or implicit (shared field values)?
+- Is there a time dimension? What's the temporal grain?
+
+Draw an **entity-relationship summary** in text/ASCII.
+
+## Phase 3: Analytical Quality Assessment
+
+Score each dimension (1-5 stars) with justification:
+
+1. **Completeness** — Are there gaps, nulls, missing relationships?
+2. **Consistency** — Same entity named differently in different analyzers? Units mismatched?
+3. **Granularity** — Is the data at a useful grain or pre-aggregated into uselessness?
+4. **Denormalization** — Is it query-friendly or would ETL need to unnest/flatten heavily?
+5. **Cardinality** — Are there high-cardinality string fields that would explode dimension tables?
+6. **Temporal coverage** — Is time-series data present? At what resolution?
+7. **Identifiers** — Are entities consistently identifiable across analyzers?
+
+## Phase 4: DWH Suitability Assessment
+
+For ClickHouse / Greenplum / columnar DWH specifically:
+
+- **Ingestion**: Can this JSON be loaded as-is, or does it need pre-processing? How much ETL?
+- **Table design**: Propose a star/snowflake schema sketch (fact tables + dimensions)
+- **Partitioning strategy**: What would you partition by? (time? file path prefix? analyzer?)
+- **Sort keys / ORDER BY**: What query patterns does this data naturally support?
+- **Materialized views**: What pre-aggregations would be valuable?
+- **Estimated row counts**: From this sample, project table sizes at scale (e.g., for repos with 100K commits, 50K files)
+- **Compression**: Are there fields that compress well (low-cardinality enums) vs poorly (unique strings)?
+
+## Phase 5: Analytics Readiness Verdict
+
+Answer these questions directly:
+
+1. **Can a BI analyst build dashboards from this data without engineering help?** (Yes/No/With caveats)
+2. **What analytics questions can this data answer today?** (List top 10)
+3. **What analytics questions are tantalizingly close but the data doesn't quite support?** (List gaps)
+4. **What's the single biggest structural problem for analytics consumption?**
+5. **If you had to ship a "code health dashboard" product from this data in 2 weeks, what would you cut/change?**
+
+## Phase 6: Recommendations
+
+Provide a prioritized list of changes (P0/P1/P2):
+
+- Schema changes that would make DWH loading trivial
+- Missing fields or identifiers that would unlock key analytics
+- Structural changes for better query performance
+- Data quality issues to fix at the source
+
+# Output Format
+
+Use clear section headers. Be opinionated — this is a review, not a neutral description. Use tables where they help. Quote specific field paths from the actual data. Call out both strengths and problems bluntly.
+
+If the file is too large to read fully, sample strategically and note what you sampled vs. what you extrapolated.
diff --git a/AGENTS.md b/AGENTS.md
@@ -426,12 +426,17 @@ analyzer.Analyze(ctx, nodes)
 - `pkg/alg/lru` - Generic LRU cache with optional Bloom pre-filter, cost-based eviction, and clone-on-insert
 - `pkg/alg` - Generic algorithms: `Range` (half-open interval), `Chunk` (range partitioning), `ForEachPair` (C(n,2) pairwise iteration), `Iterator[T]` (pull-based sequence with `Next()` + `Close()`, EOF signals end), `CollectN[T](iter, limit)` (drain up to limit items, 0 = unlimited), `TraverseTree[T any](root, children, visit)` (iterative pre-order DFS with explicit stack — generic tree traversal). FRD: specs/frds/FRD-20260310-iterator.md, specs/frds/FRD-20260310-traverse-tree.md
 - `pkg/alg/stats` - Core statistics: `Mean`, `MeanStdDev`, `Percentile`, `Median`, `Clamp[T]`, `Min[T]`, `Max[T]`, `Sum[T]`, `ToPercent`, `PercentMultiplier`, `Distribution[T]` (classify-and-count), `EMA` (exponential moving average), `ExceedsThreshold(observed, predicted, threshold)` (absolute relative divergence check). FRD: specs/frds/FRD-20260310-exceeds-threshold.md
+- `internal/analyzers/common/perfile_retainer.go` - Per-file report retention: `PerFileRetainer` embeddable struct with `SetPerFileMode(bool)`, `Retain(report)`, `PerFileResults() map[string]Report`. Extracts source file path from `TypedCollection.SourceFile` or legacy `_source_file` items, stores shallow clone. Embedded in all 5 static analyzer aggregators (complexity, comments, halstead, cohesion, imports). Zero-value is disabled. FRD: specs/frds/FRD-20260327-perfile-retainer.md
+- `internal/analyzers/analyze/perfile.go` - Per-file orchestration: `PerFileModeEnabled` interface for aggregator type-assertion, `PerFileEnricher` interface for JSON enrichment (avoids import cycles), `StaticService.PerFileResults()` getter, `extractPerFileResults` collects per-file reports from aggregators, `enrichWithPerFileData` injects files into JSON output via `PerFileEnricher`, `MakeRelativePath(filePath, rootPath)` for relative file paths. `StaticService.PerFile` bool enables per-file mode in `initAggregators()` and `AnalyzeFolder()`. FRDs: specs/frds/FRD-20260327-static-perfile-orchestration.md, specs/frds/FRD-20260327-json-perfile-emission.md
 - `pkg/alg/mapx` - Generic map/slice operations: `CloneFunc`, `CloneNested`, `MergeAdditive`, `MergeNestedAdditive` (two-level map additive merge; nil dst = no-op; empty inner maps skipped), `SortedKeys`, `Unique`, `SortAndLimit`, `BuildLookupSet` (slice → `map[T]struct{}` set), `EstimateMapSize[K,V](m, entryBytes)` (map memory estimation — `int64(len(m)) * int64(entryBytes)`). Use stdlib `maps.Clone` for shallow map copies; use stdlib `slices.Clone` for shallow slice copies. FRD: specs/frds/FRD-20260310-estimate-map-size.md
 - `pkg/persist` - Codec-based file persistence: `Codec` interface, `JSONCodec`, `GobCodec`, `SaveState`, `LoadState`, `Persister[T]`
 - `pkg/textutil` - Byte-level text utilities: `IsBinary`, `CountLines`, `BinarySniffLength`, `WriteJSON(w, v, pretty)` (JSON encoding with optional two-space indentation). FRD: specs/frds/FRD-20260310-writejson-helper.md
 
+**Content Analyzers:**
+- `internal/analyzers/composition/` - File composition analyzer: `ContentAnalyzer` implementation that classifies files by type (source, vendor, generated, docs, config, binary, image) using enry. Reports breakdown, percentages, and non-source file issues. Info-only score. Uses `filehistory.Classifier` for classification. FRD: specs/frds/FRD-20260404-static-composition-analyzer.md
+
 **Caching:**
-- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache
+- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache. Incremental analysis cache: `IncrementalMeta` struct, `Key(rootSHA, branch)` deterministic directory name, `WriteMeta`/`ReadMeta` atomic JSON persistence, `IsStale` root SHA validation, `ErrCacheNotFound`/`ErrCacheCorrupt` sentinel errors. FRD: specs/frds/FRD-20260328-incremental-cache-meta.md
 
 **Shared Utilities:**
 - `pkg/sigutil` - Signal-handling utilities: `SignalCleanupGuard` (SIGINT/SIGTERM + `sync.Once` idempotent cleanup + goroutine listener + deregistration on `Close`)
@@ -449,7 +454,11 @@ analyzer.Analyze(ctx, nodes)
 - `internal/analyzers/common/plotpage/builders.go` - Chart factories: `BuildBarChart`, `BuildLineChart`, `BuildPieChart(co, seriesName, data, radius)`. `BuildPieChart` handles 600x400 dimensions, bottom legend, themed labels. Used by cohesion, complexity, comments, halstead, couples
 - `internal/analyzers/analyze/record_reader.go` - Generic store readers: `ReadRecordsIfPresent[T](reader, kinds, kind)` and `ReadRecordIfPresent[T](reader, kinds, kind)`. Used by all 10 analyzer store_reader.go files
 - `internal/analyzers/analyze/record_writer.go` - Generic store writer: `WriteSliceKind[T](w, kind, records)`. Used by devs, anomaly, quality, sentiment, typos, file_history, couples store_writer.go
-- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, ToMaps}`, `ItemConverter` func type, `SourceFileKey` const, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. FRD: specs/frds/FRD-20260311-typed-report-items.md
+- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, Language, Directory, ToMaps}`, `ItemConverter` func type, `SourceFileKey`/`LanguageKey`/`DirectoryKey` consts, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. `DetailedDataCollector.buildItems()` calls `stampCollectionMetadata()` to propagate Language and Directory to converted maps. FRD: specs/frds/FRD-20260311-typed-report-items.md
+- `internal/analyzers/analyze/metadata.go` - `AnalysisMetadata` struct (`RepoPath`, `RepoName`, `AnalyzedAt`, `CodefangVersion`), `NewAnalysisMetadata(repoPath)` constructor. Injected into `UnifiedModel.Metadata` after `DecodeCombinedBinaryReports`. FRD: specs/frds/FRD-20260408-output-metadata.md
+- `internal/analyzers/analyze/tick_bounds.go` - `TickBounds{StartTime, EndTime}` type with `FormatStartTime()`/`FormatEndTime()` (RFC 3339), `BuildTickBounds(ticks []TICK) map[int]TickBounds`. Used by all history analyzers to export tick timestamps. FRD: specs/frds/FRD-20260408-tick-timestamps.md
+- `internal/analyzers/analyze/schema_registry.go` - `FieldMeta{Type, Grain, Description}`, `AnalyzerSchema` (map alias), `SchemaForAnalyzer(id) AnalyzerSchema`. Static registry covering all 17 analyzers with type (list/aggregate/time_series/risk/scalar) and grain (function/file/tick/pair/developer). FRD: specs/frds/FRD-20260408-schema-manifest.md
+- `internal/identity/split.go` - `SplitIdentity(s string) (name, email string)`. Handles pipe-delimited (`"alice|alice@example.com"`), exact (`"alice <alice@example.com>"`), and plain name formats. Used by devs and couples analyzers. FRD: specs/frds/FRD-20260408-normalize-developer-identity.md
 - `internal/analyzers/analyze/analyzer.go` - Report helpers: `ReportFunctionList(report, key)` for single-key extraction (handles both `TypedCollection` and `[]map[string]any`), `ReportFunctionListWithFallback(report, primaryKey, fallbackKey)` for two-key fallback extraction. Used by complexity, halstead, cohesion, comments plot.go
 - `internal/analyzers/common/reportutil/reportutil.go` - Type-safe report accessors: `GetAs[T any](report, key) (T, bool)` (generic base, pure type assertion), `GetFloat64`/`GetInt` (safeconv coercion — handles cross-type), `GetString`/`GetStringSlice`/`GetStringIntMap`/`GetFunctions`/`MapString` (delegate to `GetAs`), `FormatInt`/`FormatFloat`/`FormatPercent`/`Pct`. `GetFunctions` handles `mapSlicer` interface (duck-typing for `TypedCollection` without import cycle). FRD: specs/frds/FRD-20260306-reportutil-getas.md