Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions .claude/commands/review-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
name: review-data
description: Product/data analyst review of generated report data for analytics readiness and DWH suitability
---

# Role

You are a senior product data analyst with 10+ years of experience in data warehousing (ClickHouse, Greenplum, BigQuery, Snowflake), analytics engineering (dbt), and building data products from semi-structured sources. You think in terms of fact tables, dimension tables, grain, cardinality, query patterns, and downstream BI consumption.

You are NOT a software engineer. You do not care about Go code or implementation details. You care about the **data** — its shape, quality, completeness, and fitness for analytical workloads.

# Task

Review the data file at: $ARGUMENTS

If no file path is provided, ask the user for one.

# Analysis Framework

## Phase 1: Schema Discovery

Sample the file (first 50KB, last 10KB, and 2-3 random sections from the middle). Map out:

- Top-level structure (array of objects? nested report? envelope?)
- Every distinct entity type (functions, files, commits, authors, clone pairs, etc.)
- Nesting depth and where arrays-of-objects live
- Key fields, identifiers, foreign-key-like references between entities
- Data types: strings, numerics, booleans, timestamps, enums, free-text

Produce a **data catalog** — a flat table listing every field path, its type, cardinality estimate (low/medium/high/unique), and nullability.

## Phase 2: Grain & Relationship Analysis

For each entity type:

- What is the **grain** (one row = what)?
- What are the natural keys?
- What are the relationships (1:1, 1:N, M:N) between entities?
- Are relationships explicit (foreign keys) or implicit (shared field values)?
- Is there a time dimension? What's the temporal grain?

Draw an **entity-relationship summary** in text/ASCII.

## Phase 3: Analytical Quality Assessment

Score each dimension (1-5 stars) with justification:

1. **Completeness** — Are there gaps, nulls, missing relationships?
2. **Consistency** — Same entity named differently in different analyzers? Units mismatched?
3. **Granularity** — Is the data at a useful grain or pre-aggregated into uselessness?
4. **Denormalization** — Is it query-friendly or would ETL need to unnest/flatten heavily?
5. **Cardinality** — Are there high-cardinality string fields that would explode dimension tables?
6. **Temporal coverage** — Is time-series data present? At what resolution?
7. **Identifiers** — Are entities consistently identifiable across analyzers?

## Phase 4: DWH Suitability Assessment

For ClickHouse / Greenplum / columnar DWH specifically:

- **Ingestion**: Can this JSON be loaded as-is, or does it need pre-processing? How much ETL?
- **Table design**: Propose a star/snowflake schema sketch (fact tables + dimensions)
- **Partitioning strategy**: What would you partition by? (time? file path prefix? analyzer?)
- **Sort keys / ORDER BY**: What query patterns does this data naturally support?
- **Materialized views**: What pre-aggregations would be valuable?
- **Estimated row counts**: From this sample, project table sizes at scale (e.g., for repos with 100K commits, 50K files)
- **Compression**: Are there fields that compress well (low-cardinality enums) vs poorly (unique strings)?

## Phase 5: Analytics Readiness Verdict

Answer these questions directly:

1. **Can a BI analyst build dashboards from this data without engineering help?** (Yes/No/With caveats)
2. **What analytics questions can this data answer today?** (List top 10)
3. **What analytics questions are tantalizingly close but the data doesn't quite support?** (List gaps)
4. **What's the single biggest structural problem for analytics consumption?**
5. **If you had to ship a "code health dashboard" product from this data in 2 weeks, what would you cut/change?**

## Phase 6: Recommendations

Provide a prioritized list of changes (P0/P1/P2):

- Schema changes that would make DWH loading trivial
- Missing fields or identifiers that would unlock key analytics
- Structural changes for better query performance
- Data quality issues to fix at the source

# Output Format

Use clear section headers. Be opinionated — this is a review, not a neutral description. Use tables where they help. Quote specific field paths from the actual data. Call out both strengths and problems bluntly.

If the file is too large to read fully, sample strategically and note what you sampled vs. what you extrapolated.
13 changes: 11 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,12 +426,17 @@ analyzer.Analyze(ctx, nodes)
- `pkg/alg/lru` - Generic LRU cache with optional Bloom pre-filter, cost-based eviction, and clone-on-insert
- `pkg/alg` - Generic algorithms: `Range` (half-open interval), `Chunk` (range partitioning), `ForEachPair` (C(n,2) pairwise iteration), `Iterator[T]` (pull-based sequence with `Next()` + `Close()`, EOF signals end), `CollectN[T](iter, limit)` (drain up to limit items, 0 = unlimited), `TraverseTree[T any](root, children, visit)` (iterative pre-order DFS with explicit stack — generic tree traversal). FRD: specs/frds/FRD-20260310-iterator.md, specs/frds/FRD-20260310-traverse-tree.md
- `pkg/alg/stats` - Core statistics: `Mean`, `MeanStdDev`, `Percentile`, `Median`, `Clamp[T]`, `Min[T]`, `Max[T]`, `Sum[T]`, `ToPercent`, `PercentMultiplier`, `Distribution[T]` (classify-and-count), `EMA` (exponential moving average), `ExceedsThreshold(observed, predicted, threshold)` (absolute relative divergence check). FRD: specs/frds/FRD-20260310-exceeds-threshold.md
- `internal/analyzers/common/perfile_retainer.go` - Per-file report retention: `PerFileRetainer` embeddable struct with `SetPerFileMode(bool)`, `Retain(report)`, `PerFileResults() map[string]Report`. Extracts source file path from `TypedCollection.SourceFile` or legacy `_source_file` items, stores shallow clone. Embedded in all 5 static analyzer aggregators (complexity, comments, halstead, cohesion, imports). Zero-value is disabled. FRD: specs/frds/FRD-20260327-perfile-retainer.md
- `internal/analyzers/analyze/perfile.go` - Per-file orchestration: `PerFileModeEnabled` interface for aggregator type-assertion, `PerFileEnricher` interface for JSON enrichment (avoids import cycles), `StaticService.PerFileResults()` getter, `extractPerFileResults` collects per-file reports from aggregators, `enrichWithPerFileData` injects files into JSON output via `PerFileEnricher`, `MakeRelativePath(filePath, rootPath)` for relative file paths. `StaticService.PerFile` bool enables per-file mode in `initAggregators()` and `AnalyzeFolder()`. FRDs: specs/frds/FRD-20260327-static-perfile-orchestration.md, specs/frds/FRD-20260327-json-perfile-emission.md
- `pkg/alg/mapx` - Generic map/slice operations: `CloneFunc`, `CloneNested`, `MergeAdditive`, `MergeNestedAdditive` (two-level map additive merge; nil dst = no-op; empty inner maps skipped), `SortedKeys`, `Unique`, `SortAndLimit`, `BuildLookupSet` (slice → `map[T]struct{}` set), `EstimateMapSize[K,V](m, entryBytes)` (map memory estimation — `int64(len(m)) * int64(entryBytes)`). Use stdlib `maps.Clone` for shallow map copies; use stdlib `slices.Clone` for shallow slice copies. FRD: specs/frds/FRD-20260310-estimate-map-size.md
- `pkg/persist` - Codec-based file persistence: `Codec` interface, `JSONCodec`, `GobCodec`, `SaveState`, `LoadState`, `Persister[T]`
- `pkg/textutil` - Byte-level text utilities: `IsBinary`, `CountLines`, `BinarySniffLength`, `WriteJSON(w, v, pretty)` (JSON encoding with optional two-space indentation). FRD: specs/frds/FRD-20260310-writejson-helper.md

**Content Analyzers:**
- `internal/analyzers/composition/` - File composition analyzer: `ContentAnalyzer` implementation that classifies files by type (source, vendor, generated, docs, config, binary, image) using enry. Reports breakdown, percentages, and non-source file issues. Info-only score. Uses `filehistory.Classifier` for classification. FRD: specs/frds/FRD-20260404-static-composition-analyzer.md

**Caching:**
- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache
- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache. Incremental analysis cache: `IncrementalMeta` struct, `Key(rootSHA, branch)` deterministic directory name, `WriteMeta`/`ReadMeta` atomic JSON persistence, `IsStale` root SHA validation, `ErrCacheNotFound`/`ErrCacheCorrupt` sentinel errors. FRD: specs/frds/FRD-20260328-incremental-cache-meta.md

**Shared Utilities:**
- `pkg/sigutil` - Signal-handling utilities: `SignalCleanupGuard` (SIGINT/SIGTERM + `sync.Once` idempotent cleanup + goroutine listener + deregistration on `Close`)
Expand All @@ -449,7 +454,11 @@ analyzer.Analyze(ctx, nodes)
- `internal/analyzers/common/plotpage/builders.go` - Chart factories: `BuildBarChart`, `BuildLineChart`, `BuildPieChart(co, seriesName, data, radius)`. `BuildPieChart` handles 600x400 dimensions, bottom legend, themed labels. Used by cohesion, complexity, comments, halstead, couples
- `internal/analyzers/analyze/record_reader.go` - Generic store readers: `ReadRecordsIfPresent[T](reader, kinds, kind)` and `ReadRecordIfPresent[T](reader, kinds, kind)`. Used by all 10 analyzer store_reader.go files
- `internal/analyzers/analyze/record_writer.go` - Generic store writer: `WriteSliceKind[T](w, kind, records)`. Used by devs, anomaly, quality, sentiment, typos, file_history, couples store_writer.go
- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, ToMaps}`, `ItemConverter` func type, `SourceFileKey` const, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. FRD: specs/frds/FRD-20260311-typed-report-items.md
- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, Language, Directory, ToMaps}`, `ItemConverter` func type, `SourceFileKey`/`LanguageKey`/`DirectoryKey` consts, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. `DetailedDataCollector.buildItems()` calls `stampCollectionMetadata()` to propagate Language and Directory to converted maps. FRD: specs/frds/FRD-20260311-typed-report-items.md
- `internal/analyzers/analyze/metadata.go` - `AnalysisMetadata` struct (`RepoPath`, `RepoName`, `AnalyzedAt`, `CodefangVersion`), `NewAnalysisMetadata(repoPath)` constructor. Injected into `UnifiedModel.Metadata` after `DecodeCombinedBinaryReports`. FRD: specs/frds/FRD-20260408-output-metadata.md
- `internal/analyzers/analyze/tick_bounds.go` - `TickBounds{StartTime, EndTime}` type with `FormatStartTime()`/`FormatEndTime()` (RFC 3339), `BuildTickBounds(ticks []TICK) map[int]TickBounds`. Used by all history analyzers to export tick timestamps. FRD: specs/frds/FRD-20260408-tick-timestamps.md
- `internal/analyzers/analyze/schema_registry.go` - `FieldMeta{Type, Grain, Description}`, `AnalyzerSchema` (map alias), `SchemaForAnalyzer(id) AnalyzerSchema`. Static registry covering all 17 analyzers with type (list/aggregate/time_series/risk/scalar) and grain (function/file/tick/pair/developer). FRD: specs/frds/FRD-20260408-schema-manifest.md
- `internal/identity/split.go` - `SplitIdentity(s string) (name, email string)`. Handles pipe-delimited (`"alice|alice@example.com"`), exact (`"alice <alice@example.com>"`), and plain name formats. Used by devs and couples analyzers. FRD: specs/frds/FRD-20260408-normalize-developer-identity.md
- `internal/analyzers/analyze/analyzer.go` - Report helpers: `ReportFunctionList(report, key)` for single-key extraction (handles both `TypedCollection` and `[]map[string]any`), `ReportFunctionListWithFallback(report, primaryKey, fallbackKey)` for two-key fallback extraction. Used by complexity, halstead, cohesion, comments plot.go
- `internal/analyzers/common/reportutil/reportutil.go` - Type-safe report accessors: `GetAs[T any](report, key) (T, bool)` (generic base, pure type assertion), `GetFloat64`/`GetInt` (safeconv coercion — handles cross-type), `GetString`/`GetStringSlice`/`GetStringIntMap`/`GetFunctions`/`MapString` (delegate to `GetAs`), `FormatInt`/`FormatFloat`/`FormatPercent`/`Pct`. `GetFunctions` handles `mapSlicer` interface (duck-typing for `TypedCollection` without import cycle). FRD: specs/frds/FRD-20260306-reportutil-getas.md

Expand Down
Loading
Loading