diff --git a/.claude/commands/review-data.md b/.claude/commands/review-data.md new file mode 100644 index 0000000..b5cfd52 --- /dev/null +++ b/.claude/commands/review-data.md @@ -0,0 +1,91 @@ +--- +name: review-data +description: Product/data analyst review of generated report data for analytics readiness and DWH suitability +--- + +# Role + +You are a senior product data analyst with 10+ years of experience in data warehousing (ClickHouse, Greenplum, BigQuery, Snowflake), analytics engineering (dbt), and building data products from semi-structured sources. You think in terms of fact tables, dimension tables, grain, cardinality, query patterns, and downstream BI consumption. + +You are NOT a software engineer. You do not care about Go code or implementation details. You care about the **data** — its shape, quality, completeness, and fitness for analytical workloads. + +# Task + +Review the data file at: $ARGUMENTS + +If no file path is provided, ask the user for one. + +# Analysis Framework + +## Phase 1: Schema Discovery + +Sample the file (first 50KB, last 10KB, and 2-3 random sections from the middle). Map out: + +- Top-level structure (array of objects? nested report? envelope?) +- Every distinct entity type (functions, files, commits, authors, clone pairs, etc.) +- Nesting depth and where arrays-of-objects live +- Key fields, identifiers, foreign-key-like references between entities +- Data types: strings, numerics, booleans, timestamps, enums, free-text + +Produce a **data catalog** — a flat table listing every field path, its type, cardinality estimate (low/medium/high/unique), and nullability. + +## Phase 2: Grain & Relationship Analysis + +For each entity type: + +- What is the **grain** (one row = what)? +- What are the natural keys? +- What are the relationships (1:1, 1:N, M:N) between entities? +- Are relationships explicit (foreign keys) or implicit (shared field values)? +- Is there a time dimension? What's the temporal grain? + +Draw an **entity-relationship summary** in text/ASCII. + +## Phase 3: Analytical Quality Assessment + +Score each dimension (1-5 stars) with justification: + +1. **Completeness** — Are there gaps, nulls, missing relationships? +2. **Consistency** — Same entity named differently in different analyzers? Units mismatched? +3. **Granularity** — Is the data at a useful grain or pre-aggregated into uselessness? +4. **Denormalization** — Is it query-friendly or would ETL need to unnest/flatten heavily? +5. **Cardinality** — Are there high-cardinality string fields that would explode dimension tables? +6. **Temporal coverage** — Is time-series data present? At what resolution? +7. **Identifiers** — Are entities consistently identifiable across analyzers? + +## Phase 4: DWH Suitability Assessment + +For ClickHouse / Greenplum / columnar DWH specifically: + +- **Ingestion**: Can this JSON be loaded as-is, or does it need pre-processing? How much ETL? +- **Table design**: Propose a star/snowflake schema sketch (fact tables + dimensions) +- **Partitioning strategy**: What would you partition by? (time? file path prefix? analyzer?) +- **Sort keys / ORDER BY**: What query patterns does this data naturally support? +- **Materialized views**: What pre-aggregations would be valuable? +- **Estimated row counts**: From this sample, project table sizes at scale (e.g., for repos with 100K commits, 50K files) +- **Compression**: Are there fields that compress well (low-cardinality enums) vs poorly (unique strings)? + +## Phase 5: Analytics Readiness Verdict + +Answer these questions directly: + +1. **Can a BI analyst build dashboards from this data without engineering help?** (Yes/No/With caveats) +2. **What analytics questions can this data answer today?** (List top 10) +3. **What analytics questions are tantalizingly close but the data doesn't quite support?** (List gaps) +4. **What's the single biggest structural problem for analytics consumption?** +5. **If you had to ship a "code health dashboard" product from this data in 2 weeks, what would you cut/change?** + +## Phase 6: Recommendations + +Provide a prioritized list of changes (P0/P1/P2): + +- Schema changes that would make DWH loading trivial +- Missing fields or identifiers that would unlock key analytics +- Structural changes for better query performance +- Data quality issues to fix at the source + +# Output Format + +Use clear section headers. Be opinionated — this is a review, not a neutral description. Use tables where they help. Quote specific field paths from the actual data. Call out both strengths and problems bluntly. + +If the file is too large to read fully, sample strategically and note what you sampled vs. what you extrapolated. diff --git a/AGENTS.md b/AGENTS.md index d6ca8ee..967d72b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -426,12 +426,17 @@ analyzer.Analyze(ctx, nodes) - `pkg/alg/lru` - Generic LRU cache with optional Bloom pre-filter, cost-based eviction, and clone-on-insert - `pkg/alg` - Generic algorithms: `Range` (half-open interval), `Chunk` (range partitioning), `ForEachPair` (C(n,2) pairwise iteration), `Iterator[T]` (pull-based sequence with `Next()` + `Close()`, EOF signals end), `CollectN[T](iter, limit)` (drain up to limit items, 0 = unlimited), `TraverseTree[T any](root, children, visit)` (iterative pre-order DFS with explicit stack — generic tree traversal). FRD: specs/frds/FRD-20260310-iterator.md, specs/frds/FRD-20260310-traverse-tree.md - `pkg/alg/stats` - Core statistics: `Mean`, `MeanStdDev`, `Percentile`, `Median`, `Clamp[T]`, `Min[T]`, `Max[T]`, `Sum[T]`, `ToPercent`, `PercentMultiplier`, `Distribution[T]` (classify-and-count), `EMA` (exponential moving average), `ExceedsThreshold(observed, predicted, threshold)` (absolute relative divergence check). FRD: specs/frds/FRD-20260310-exceeds-threshold.md +- `internal/analyzers/common/perfile_retainer.go` - Per-file report retention: `PerFileRetainer` embeddable struct with `SetPerFileMode(bool)`, `Retain(report)`, `PerFileResults() map[string]Report`. Extracts source file path from `TypedCollection.SourceFile` or legacy `_source_file` items, stores shallow clone. Embedded in all 5 static analyzer aggregators (complexity, comments, halstead, cohesion, imports). Zero-value is disabled. FRD: specs/frds/FRD-20260327-perfile-retainer.md +- `internal/analyzers/analyze/perfile.go` - Per-file orchestration: `PerFileModeEnabled` interface for aggregator type-assertion, `PerFileEnricher` interface for JSON enrichment (avoids import cycles), `StaticService.PerFileResults()` getter, `extractPerFileResults` collects per-file reports from aggregators, `enrichWithPerFileData` injects files into JSON output via `PerFileEnricher`, `MakeRelativePath(filePath, rootPath)` for relative file paths. `StaticService.PerFile` bool enables per-file mode in `initAggregators()` and `AnalyzeFolder()`. FRDs: specs/frds/FRD-20260327-static-perfile-orchestration.md, specs/frds/FRD-20260327-json-perfile-emission.md - `pkg/alg/mapx` - Generic map/slice operations: `CloneFunc`, `CloneNested`, `MergeAdditive`, `MergeNestedAdditive` (two-level map additive merge; nil dst = no-op; empty inner maps skipped), `SortedKeys`, `Unique`, `SortAndLimit`, `BuildLookupSet` (slice → `map[T]struct{}` set), `EstimateMapSize[K,V](m, entryBytes)` (map memory estimation — `int64(len(m)) * int64(entryBytes)`). Use stdlib `maps.Clone` for shallow map copies; use stdlib `slices.Clone` for shallow slice copies. FRD: specs/frds/FRD-20260310-estimate-map-size.md - `pkg/persist` - Codec-based file persistence: `Codec` interface, `JSONCodec`, `GobCodec`, `SaveState`, `LoadState`, `Persister[T]` - `pkg/textutil` - Byte-level text utilities: `IsBinary`, `CountLines`, `BinarySniffLength`, `WriteJSON(w, v, pretty)` (JSON encoding with optional two-space indentation). FRD: specs/frds/FRD-20260310-writejson-helper.md +**Content Analyzers:** +- `internal/analyzers/composition/` - File composition analyzer: `ContentAnalyzer` implementation that classifies files by type (source, vendor, generated, docs, config, binary, image) using enry. Reports breakdown, percentages, and non-source file issues. Info-only score. Uses `filehistory.Classifier` for classification. FRD: specs/frds/FRD-20260404-static-composition-analyzer.md + **Caching:** -- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache +- `internal/cache` - LRU blob cache (thin wrapper over `pkg/alg/lru`), hash sets, generic blob cache. Incremental analysis cache: `IncrementalMeta` struct, `Key(rootSHA, branch)` deterministic directory name, `WriteMeta`/`ReadMeta` atomic JSON persistence, `IsStale` root SHA validation, `ErrCacheNotFound`/`ErrCacheCorrupt` sentinel errors. FRD: specs/frds/FRD-20260328-incremental-cache-meta.md **Shared Utilities:** - `pkg/sigutil` - Signal-handling utilities: `SignalCleanupGuard` (SIGINT/SIGTERM + `sync.Once` idempotent cleanup + goroutine listener + deregistration on `Close`) @@ -449,7 +454,11 @@ analyzer.Analyze(ctx, nodes) - `internal/analyzers/common/plotpage/builders.go` - Chart factories: `BuildBarChart`, `BuildLineChart`, `BuildPieChart(co, seriesName, data, radius)`. `BuildPieChart` handles 600x400 dimensions, bottom legend, themed labels. Used by cohesion, complexity, comments, halstead, couples - `internal/analyzers/analyze/record_reader.go` - Generic store readers: `ReadRecordsIfPresent[T](reader, kinds, kind)` and `ReadRecordIfPresent[T](reader, kinds, kind)`. Used by all 10 analyzer store_reader.go files - `internal/analyzers/analyze/record_writer.go` - Generic store writer: `WriteSliceKind[T](w, kind, records)`. Used by devs, anomaly, quality, sentiment, typos, file_history, couples store_writer.go -- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, ToMaps}`, `ItemConverter` func type, `SourceFileKey` const, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. FRD: specs/frds/FRD-20260311-typed-report-items.md +- `internal/analyzers/analyze/typed_collection.go` - `TypedCollection` wrapper for deferred map conversion: `TypedCollection{Items, SourceFile, Language, Directory, ToMaps}`, `ItemConverter` func type, `SourceFileKey`/`LanguageKey`/`DirectoryKey` consts, `MapSlice()` method. Per-file analyzers return `TypedCollection` instead of `[]map[string]any`; conversion deferred to serialization boundary. `DetailedDataCollector.buildItems()` calls `stampCollectionMetadata()` to propagate Language and Directory to converted maps. FRD: specs/frds/FRD-20260311-typed-report-items.md +- `internal/analyzers/analyze/metadata.go` - `AnalysisMetadata` struct (`RepoPath`, `RepoName`, `AnalyzedAt`, `CodefangVersion`), `NewAnalysisMetadata(repoPath)` constructor. Injected into `UnifiedModel.Metadata` after `DecodeCombinedBinaryReports`. FRD: specs/frds/FRD-20260408-output-metadata.md +- `internal/analyzers/analyze/tick_bounds.go` - `TickBounds{StartTime, EndTime}` type with `FormatStartTime()`/`FormatEndTime()` (RFC 3339), `BuildTickBounds(ticks []TICK) map[int]TickBounds`. Used by all history analyzers to export tick timestamps. FRD: specs/frds/FRD-20260408-tick-timestamps.md +- `internal/analyzers/analyze/schema_registry.go` - `FieldMeta{Type, Grain, Description}`, `AnalyzerSchema` (map alias), `SchemaForAnalyzer(id) AnalyzerSchema`. Static registry covering all 17 analyzers with type (list/aggregate/time_series/risk/scalar) and grain (function/file/tick/pair/developer). FRD: specs/frds/FRD-20260408-schema-manifest.md +- `internal/identity/split.go` - `SplitIdentity(s string) (name, email string)`. Handles pipe-delimited (`"alice|alice@example.com"`), exact (`"alice "`), and plain name formats. Used by devs and couples analyzers. FRD: specs/frds/FRD-20260408-normalize-developer-identity.md - `internal/analyzers/analyze/analyzer.go` - Report helpers: `ReportFunctionList(report, key)` for single-key extraction (handles both `TypedCollection` and `[]map[string]any`), `ReportFunctionListWithFallback(report, primaryKey, fallbackKey)` for two-key fallback extraction. Used by complexity, halstead, cohesion, comments plot.go - `internal/analyzers/common/reportutil/reportutil.go` - Type-safe report accessors: `GetAs[T any](report, key) (T, bool)` (generic base, pure type assertion), `GetFloat64`/`GetInt` (safeconv coercion — handles cross-type), `GetString`/`GetStringSlice`/`GetStringIntMap`/`GetFunctions`/`MapString` (delegate to `GetAs`), `FormatInt`/`FormatFloat`/`FormatPercent`/`Pct`. `GetFunctions` handles `mapSlicer` interface (duck-typing for `TypedCollection` without import cycle). FRD: specs/frds/FRD-20260306-reportutil-getas.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..5814392 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,393 @@ +# Changelog + +All notable changes to the Codefang project are documented in this file. +The format follows [Keep a Changelog](https://keepachangelog.com/). + +--- + +## [Unreleased] — Repo hygiene & race fix + +### Fixed + +- **Race in `internal/framework.PipelineSampler`**: + `t1Captured` was a plain `bool` concurrently read by the sampler + goroutine (`sample`) and written by the caller (`CaptureT1`), + causing intermittent `DATA RACE` under `go test -race`. Converted + to `sync/atomic.Bool` with `CompareAndSwap` — at most one t1 heap + profile is captured regardless of which goroutine observes the + trigger first. Removed the unused `t0Captured` field. Full + `go test -race ./...` now green. + +### Chore + +- **Removed `// FRD: specs/frds/FRD-...md` comments from all `.go` + files.** `specs/` is gitignored, so those references broke for + anyone cloning the repo. Traceability now lives in FRDs and + PR descriptions instead of source code. + +--- + +## [Unreleased] — Cross-phase defaults: vendor & generated excluded + +**Breaking change.** Default analysis output across both phases +now **excludes vendor and generated files** — matching the +convention of every mature multi-language analyser (eslint skips +`node_modules/`, rubocop skips `vendor/`, pylint skips `.venv/`, +scalafix skips `target/`, phpcs skips `vendor/`, GitHub Linguist +excludes vendor/generated from its language breakdown). Users who +want the pre-2026-04 behaviour back pass `--include-vendored +--include-generated` in their invocation. + +### Flags (cross-phase) + +- `--include-vendored` (bool, default `false`) — re-include paths + detected as vendored by enry / Linguist. Covers `vendor/`, + `node_modules/`, `third_party/`, `testdata/`, `dist/`, + minified bundles, and more. Cross-language by construction. +- `--include-generated` (bool, default `false`) — re-include + auto-generated files. Covers `*.pb.go`, `zz_generated_*.go`, + `*_pb2.py`, `*.min.js`, and content-header markers + (`DO NOT EDIT`, `Code generated`, `@generated`, …). +- `--extra-excluded-prefixes` (strings, default `[]`) — additional + UNIX path prefixes to exclude, for ecosystems enry doesn't know + about (e.g. `.venv/`, `target/`, `.gradle/`). + +All three flags apply identically to both `-a 'static/*'` and `-a +'history/*'` runs — one flag set, one meaning. + +### Deprecated + +- `--skip-blacklist` — now a no-op (the new default already excludes + vendor and generated). Cobra deprecation warning fires when the + flag is passed. +- `--blacklisted-prefixes` — migrate to `--extra-excluded-prefixes` + (identical semantics). Cobra deprecation warning fires when the + flag is passed. + +Both will be removed in the next minor release. + +### Architecture + +New package `internal/analyzers/plumbing/pathpolicy` exposing a pure +`Exclude(path, content, opts) bool` backed by enry.IsVendor + +`pkg/pathfilter`'s content-aware generated-file detection. Both +phases call the same helper — single source of truth, no +phase-specific drift. + +### Measured impact (cross-language fixture, `-a static/complexity`) + +| Invocation | Total Functions | +| ------------------------------------------------- | --------------: | +| *(defaults)* | 1 | +| `--include-vendored` | 4 | +| `--include-vendored --include-generated` | 5 | + +--- + +## [Unreleased] — Cross-phase consistency for `--languages` + +**Motivation**: After the history-side push-down, `--languages` meant +different things depending on `-a 'history/*'` vs `-a 'static/*'`. Static +analysis silently ignored the flag — every UAST-supported file was parsed +and fed to every requested static analyzer regardless of the user's +preference. This release makes the flag cross-phase: one flag, one +meaning, both phases narrowed. + +### Changes + +- **`StaticService.LanguageGlobs`** — new field on the static service, + populated from `--languages` via the existing + `internal/analyzers/plumbing/langpath` single source of truth. Empty + disables the filter (default behavior unchanged). +- **Path-based walker hooks** — both `StaticService.streamFiles` (UAST + walker) and `StaticService.rawFilePhase` visit-check the basename + against the glob set via `matchesLanguageGlobs` before sending the + path downstream. Filtered files never reach the UAST parser or any + analyzer. +- **Runtime wiring** — `runStaticAnalyzers` and `runStaticPlotAnalyzers` + build the globs via a shared `applyStaticLanguageFilter` helper. + Unknown language tokens fail fast on static-only runs with the same + error shape as the history side. +- **Executor signatures** — `staticExecutor` and `staticPlotExecutor` + gain a `languages []string` parameter; test stubs updated + mechanically. + +### Non-goals + +- No content-aware post-pass on the static side (the UAST parser's + own language router is the final authority for matched files; a + second pass would duplicate work). +- No changes to the history side. + +--- + +## [Unreleased] — Performance: `--languages` filter push-down into libgit2 + +**Motivation**: The `--languages` flag used to be applied *after* libgit2 had +already produced a full tree diff. Every delta crossed the cgo boundary, was +materialised in Go, and only then dropped by the analyzer if its detected +language wasn't in the allow-list. On polyglot repositories with a narrow +filter, libgit2 was doing 4× the tree-diff work it needed to. + +### Changes + +- **New package `internal/analyzers/plumbing/langpath`** — pure Go + `Globs(langs []string) (globs []string, wantsAll bool, err error)` backed + by enry's generated Linguist dataset (`data.ExtensionsByLanguage` + + `data.LanguagesByFilename`). Single source of truth; 100 % test coverage. +- **New C ABI `cf_tree_diff_v2`** in `pkg/gitlib/clib/{codefang_git.h,diff_ops.c}` + accepts a pathspec array which it forwards to libgit2's + `git_diff_options.pathspec`. The old `cf_tree_diff` is retired in favour of + `cf_tree_diff_v2` via `CGOBridge.TreeDiffWithPathspec`. +- **`TreeDiffRequest.Pathspec` + `BlobPipeline.TreeDiffPathspec` + + `CoordinatorConfig.TreeDiffPathspec`** thread the pathspec from the + analyzer through the pipeline to every worker call. +- **`TreeDiffAnalyzer.Pathspec` + `applyLanguageConfig`** resolve aliases via + `enry.GetLanguageByAlias` (so `--languages golang` / `js` / `ts` now work, + not just canonical Linguist names) and pre-compute the pathspec at + `Configure` time. +- **Fail-fast on unknown languages**: `--languages notalang` now returns + `failed to configure TreeDiff: tree-diff pathspec: unknown language: "notalang"` + instead of silently producing an empty report. + +### Measured impact + +On a 500-commit × 200-file × 4-language synthetic fixture with +`--languages go`: + +| Metric | Before | After | Δ | +| --------------------------- | ------: | ------: | -----: | +| Wall time | 0.44 s | 0.29 s | −34 % | +| Max RSS | 74 MB | 66 MB | −11 % | +| `cgocall` cumulative CPU | 800 ms | 510 ms | −36 % | +| Unique functions in profile | 286 | 209 | −27 % | +| JSON report | — | — | byte-identical | + +Regression guard (no `--languages` filter): wall time 0.51 s → 0.49 s, +within noise. + +### Non-goals (for this changeset) + +- No new user flags. +- The Go-side `shouldIncludeChange` language filter remains as the precise + post-pass (pathspec is deliberately over-inclusive for + content-disambiguated extensions such as `.h`, `.pl`, `.m`, `.r`). + +--- + +## [Unreleased] — Analytics Readiness & DWH Suitability + +**Motivation**: A comprehensive data analyst review of Codefang's JSON output revealed that while the data was analytically rich (17 analyzers, 1M+ function-level rows, time-series, coupling data), it was structurally hostile to analytics tooling and DWH loading. Function records had bare names with no file paths, time-series ticks had no calendar dates, developer identities used pipe-delimited strings, and nested maps blocked efficient columnar ingestion. This release systematically fixes every identified blocker, raising the data quality score from **2.1/5 to 4.6/5**. + +### Architecture: Pipeline Stage Refactor + +#### `RawFileAnalyzer` and `FormattableAnalyzer` interfaces + +Replaced the `FileContentAnalyzer` + `WalksAllFiles` marker interface pattern with a proper pipeline stage architecture. + +**Before**: Analyzers that needed raw file access (not UAST) had to implement `StaticAnalyzer` with a no-op `Analyze(*node.Node)`, plus two marker interfaces discovered at runtime via type assertions. + +**After**: Two clean interface hierarchies — `StaticAnalyzer` for UAST-based analysis and `RawFileAnalyzer` for raw file analysis — both embed a shared `FormattableAnalyzer` base. `StaticService` holds separate slices. `AnalyzeFolder` uses `pipeline.RunPhases` with explicit `rawFilePhase` and `uastPhase` stages. + +**Why it matters for BI**: The pipeline refactor enabled `StampSourceFile` to receive `rootPath` and convert all file paths to relative — a prerequisite for portable DWH data. It also enabled `StampLanguage` to inject detected language into every function record. + +**Files changed**: +- `internal/analyzers/analyze/analyzer.go` — new `FormattableAnalyzer`, `RawFileAnalyzer` interfaces; `StaticAnalyzer` refactored to embed `FormattableAnalyzer` +- `internal/analyzers/analyze/static.go` — `StaticService` gains `UASTAnalyzers` + `RawFileAnalyzers` slices; `AnalyzeFolder` uses `pipeline.RunPhases` +- `internal/analyzers/composition/analyzer.go` — implements `RawFileAnalyzer` directly (removed no-op `Analyze`, `NeedsAllFiles`) +- `internal/analyzers/analyze/registry.go` — `NewRegistry` accepts three slices +- `cmd/codefang/commands/run.go` — split `defaultStaticAnalyzers` into `defaultUASTAnalyzers` + `defaultRawFileAnalyzers` +- `internal/analyzers/analyze/perfile.go` — `PerFileEnricher` uses `[]FormattableAnalyzer` +- `internal/analyzers/common/renderer/json.go` — `EnrichWithPerFileData` uses `[]FormattableAnalyzer` + +--- + +### Static Analyzers: New Fields on Every Function Record + +#### `source_file` — File path on every function record + +**Motivation**: 152,000+ function records in the JSON output had bare names like `"ForKind"` with no indication of which file they belonged to. This made it impossible to join function metrics to file-level data, build file heatmaps, or drill down from "bad function" to "where in the repo." + +**Root cause**: The `_source_file` stamping mechanism existed and worked through aggregation, but `FormatReportBinary` called `ComputeAllMetrics` which parsed `[]map[string]any` items into typed structs. Those structs had no `SourceFile` field, silently dropping the value during struct conversion. + +**Fix**: Added `SourceFile string` to all input `FunctionData` and output data structs (`FunctionComplexityData`, `FunctionHalsteadData`, `FunctionCohesionData`, all comment data structs, `HighRiskFunctionData`, `HighEffortFunctionData`, `LowCohesionFunctionData`, `UndocumentedFunctionData`). Populated from `_source_file` map key during `parseFunctionData` → `Compute()`. Updated `StampSourceFile` to accept `rootPath` and convert to relative via `MakeRelativePath`. + +**JSON output key**: `"source_file"` (relative path, e.g., `"pkg/kubelet/kubelet.go"`) + +**Analyzers affected**: `static/complexity`, `static/halstead`, `static/cohesion`, `static/comments` + +#### `language` — Programming language on every function record + +**Motivation**: Analysts had to infer language from file extension at query time. The parser already knows the language. + +**Fix**: Added `LanguageKey` constant, `StampLanguage()` function, and `Language` field to `TypedCollection` struct. Language is stamped in `analyzeFilesParallel` via `parser.GetLanguage(filePath)` and propagated through `TypedCollection` → `DetailedDataCollector.buildItems()` → `stampCollectionMetadata()` to reach the output structs. + +**JSON output key**: `"language"` (e.g., `"go"`, `"bash"`) + +**Analyzers affected**: `static/complexity`, `static/halstead`, `static/cohesion`, `static/comments` + +#### `directory` — Parent directory on every function record + +**Motivation**: Directory-level aggregation (e.g., "which package has worst complexity") requires parsing file paths at query time, which is expensive in columnar DWH. + +**Fix**: Added `DirectoryKey` constant and `Directory` field to `TypedCollection`. Stamped as `filepath.Dir(relativePath)` inside `StampSourceFile`. Propagated via `stampCollectionMetadata()` alongside language. + +**JSON output key**: `"directory"` (e.g., `"pkg/kubelet"`) + +**Analyzers affected**: `static/complexity`, `static/halstead`, `static/cohesion`, `static/comments` + +--- + +### History Analyzers: Tick Timestamps + +#### `start_time` / `end_time` on every time-series tick + +**Motivation**: All 6 history time-series analyzers emitted `tick: ` with no calendar date. Every time-series chart had an unlabeled X-axis. The `TICK` struct already carried `StartTime`/`EndTime` internally but didn't export them. + +**Fix**: Created `TickBounds` type and `BuildTickBounds(ticks []TICK)` helper. Each analyzer's `ticksToReport` adds `tick_bounds` to the Report. Each `ParseReportData` reads it. Each time-series output struct gains `StartTime`/`EndTime` string fields (RFC 3339). For quality and devs analyzers, added timestamp tracking to their tick accumulators (`tickAccumulator.startTime/endTime`, `TickDevData.startTime/endTime`) with min/max tracking in `extractTC` and population in `buildTick`. + +**JSON output keys**: `"start_time"`, `"end_time"` (RFC 3339, e.g., `"2024-01-15T10:30:00Z"`) + +**Analyzers affected**: `history/sentiment`, `history/anomaly`, `history/quality`, `history/devs` (activity + churn), `history/file-history` (composition_ts) + +--- + +### Developer Identity Normalization + +#### Split pipe-delimited names into `name` + `email` + +**Motivation**: Developer identity used `"daniel smith|dbsmith@google.com"` pipe-delimited strings from `ReversedPeopleDict`. This blocked clean dimension table creation in DWH systems. + +**Fix**: Created `SplitIdentity(s string) (name, email string)` in `internal/identity/split.go`. Handles pipe-delimited, exact `"name "`, and plain name formats. Updated `devName()` → `devNameAndEmail()` and `getDevName()` → `getDevNameAndEmail()`. + +**Fields added**: +- `DeveloperData`: `email` field +- `BusFactorData`: `primary_dev_email`, `secondary_dev_email` +- `DeveloperCouplingData`: `developer1_email`, `developer2_email` + +**Analyzers affected**: `history/devs`, `history/couples` + +--- + +### Output Structure: Flattened Arrays + +#### `developers[].languages` — map → array + +**Motivation**: `map[string]LineStats` with variable language-name keys cannot be UNNEST'd in columnar DWH without custom ETL. + +**Fix**: Changed `DeveloperData.Languages` from `map[string]pkgplumbing.LineStats` to `[]LanguageStatsEntry`. Internal accumulation uses unexported `langMap`, converted to sorted array via `finalizeLanguages()`. Empty language strings replaced with `"Other"`. + +**Before**: `{"Go": {"added": 100, "removed": 5, "changed": 3}}` +**After**: `[{"language": "Go", "added": 100, "removed": 5, "changed": 3}]` + +#### `activity[].by_developer` — map → array + +**Motivation**: `map[int]int` (dev_id → commit_count) serializes to JSON with string keys, blocking typed ingestion. + +**Fix**: Changed to `[]DeveloperCommits` with `{dev_id, commits}` fields. Sorted by dev_id for deterministic output. + +**Before**: `{"2": 5, "3": 3}` +**After**: `[{"dev_id": 2, "commits": 5}, {"dev_id": 3, "commits": 3}]` + +#### `file_contributors[].contributors` — map → array + +**Motivation**: `map[int]LineStats` blocked DWH UNNEST. + +**Fix**: Changed to `[]ContributorEntry` with `{dev_id, added, removed, changed}` fields. Sorted by dev_id. + +**Before**: `{"2": {"added": 42, "removed": 5, "changed": 3}}` +**After**: `[{"dev_id": 2, "added": 42, "removed": 5, "changed": 3}]` + +--- + +### Output Envelope + +#### Top-level `metadata` section + +**Motivation**: A DWH ingesting reports from multiple repos could not distinguish them. No repo name, analysis timestamp, or version. + +**Fix**: Added `AnalysisMetadata` struct with `repo_path`, `repo_name` (from `filepath.Base`), `analyzed_at` (RFC 3339), `codefang_version` (from build ldflags). Injected after `DecodeCombinedBinaryReports` in the combined render path. + +```json +{ + "version": "codefang.run.v1", + "metadata": { + "repo_path": "/home/user/sources/kubernetes", + "repo_name": "kubernetes", + "analyzed_at": "2026-04-07T23:33:00Z", + "codefang_version": "dev" + }, + "analyzers": [...] +} +``` + +#### Per-analyzer `schema` manifest + +**Motivation**: DWH consumers need to know field types, grain, and cardinality for automated ETL generation. + +**Fix**: Added `FieldMeta` struct with `{type, grain, description}` and static `analyzerSchemas` registry covering all 17 analyzers. Each `AnalyzerResult` in the output includes a `schema` field. + +```json +{ + "id": "static/complexity", + "schema": { + "function_complexity": { + "type": "list", + "grain": "function", + "description": "Per-function cyclomatic and cognitive complexity" + } + }, + "report": {...} +} +``` + +#### NDJSON output format + +**Motivation**: The monolithic JSON (467MB for kubernetes) must be fully parsed to extract any single analyzer. NDJSON enables streaming ingestion into ClickHouse. + +**Fix**: Added `FormatNDJSON` case to `WriteConvertedOutput`. One JSON line per analyzer result, with optional metadata line prepended. + +```bash +codefang run --format ndjson /repo > output.ndjson +``` + +--- + +### Clone Analysis + +#### `clone_type_distribution` from full population + +**Motivation**: Clone pairs are capped at 1,000 in the output, but the distribution metrics (Type-1/2/3 breakdown) were computed from the capped sample, skewing percentages for large codebases with 22M+ total pairs. + +**Fix**: Added `typeDistribution cloneTypeCounts` to `clonePairResult`. `matchCandidates` increments per-type counters for ALL valid pairs before the cap check. Both aggregator and per-file paths emit `clone_type_distribution` in the report. `ReportSection.Distribution()` reads from the full-population distribution. + +**Before**: Distribution from 1,000 capped pairs +**After**: Distribution from 22,381,694 total pairs: `{"Type-1": 12366266, "Type-2": 3307147, "Type-3": 6708281}` + +#### Relative paths in clone pairs + +Clone pair `func_a` / `func_b` paths changed from absolute (`/home/user/sources/repo/file.go::funcName`) to relative (`cmd/controller/app.go::newController`). Enabled by the `StampSourceFile` rootPath change. + +--- + +### New Files Created + +| File | Purpose | +|------|---------| +| `internal/analyzers/analyze/tick_bounds.go` | `TickBounds` type + `BuildTickBounds` helper | +| `internal/analyzers/analyze/metadata.go` | `AnalysisMetadata` struct + `NewAnalysisMetadata` constructor | +| `internal/analyzers/analyze/schema_registry.go` | Static schema registry for all 17 analyzers | +| `internal/identity/split.go` | `SplitIdentity(s string) (name, email string)` | + +--- + +### Empty Analyzer Root Causes (Documented) + +Investigation of 4 analyzers that returned empty data on kubernetes (1000 commits): + +| Analyzer | Root Cause | Resolution | +|----------|-----------|------------| +| `burndown.developer_survival` | Disabled by default (`Burndown.TrackPeople: false`) | Enable via config | +| `burndown.file_survival` | Disabled by default (`Burndown.TrackFiles: false`) | Enable via config | +| `history/imports` | Requires UAST-enabled pipeline mode (`NeedsUAST() = true`) | Architectural dependency | +| `history/typos` | Requires UAST-enabled pipeline mode (`NeedsUAST() = true`) | Architectural dependency | diff --git a/Makefile b/Makefile index b23122e..78bb1b8 100644 --- a/Makefile +++ b/Makefile @@ -37,8 +37,9 @@ help: @echo " build - Build all binaries (alias for all)" @echo " libgit2 - Build vendored libgit2 statically (auto-built by 'all')" @echo " install - Install binaries to system PATH" - @echo " test - Run all tests" - @echo " lint - Run linters and deadcode analysis" + @echo " test - Run all tests (unit)" + @echo " test-e2e - Run e2e acceptance tests (RUN= to filter)" + @echo " lint - Run linters, deadcode, and orphan package detection" @echo " fmt - Format code" @echo " schemas - Generate JSON schemas for all analyzers" @echo " deadcode - Run deadcode analysis with detailed output" @@ -111,6 +112,17 @@ testv: all CGO_LDFLAGS="-L$(CURDIR)/$(LIBGIT2_INSTALL)/lib64 -L$(CURDIR)/$(LIBGIT2_INSTALL)/lib -lgit2 -lpthread" \ CGO_ENABLED=1 go test ./... -v +# Run end-to-end acceptance tests (tests/e2e/). +# Add new spec tests by dropping *_test.go files into tests/e2e/. +# Optional: RUN= to filter, e.g. make test-e2e RUN=TestPerFile +RUN ?= . +.PHONY: test-e2e +test-e2e: libgit2 + PKG_CONFIG_PATH=$(LIBGIT2_PKG_CONFIG) \ + CGO_CFLAGS="-I$(CURDIR)/$(LIBGIT2_INSTALL)/include" \ + CGO_LDFLAGS="-L$(CURDIR)/$(LIBGIT2_INSTALL)/lib64 -L$(CURDIR)/$(LIBGIT2_INSTALL)/lib -lgit2 -lpthread" \ + CGO_ENABLED=1 go test -tags e2e -count=1 -v -run $(RUN) ./tests/e2e/... + # Run UAST performance benchmarks (comprehensive suite with organized results) bench: all python3 tools/benchmark/benchmark_runner.py @@ -302,13 +314,15 @@ lint: CGO_ENABLED=1 $(GOLINT) run $(INTERNAL_PKGS) @echo "Running deadcode analysis (production)..." @GOCACHE=$(LINT_GOCACHE) ./scripts/deadcode-filter.sh $(DEADCODE_PKGS) + @echo "Running orphan package detection..." + @./scripts/orphan-packages.sh $(INTERNAL_PKGS) @echo "✓ Linting complete" ## deadcode: Run deadcode analysis with whitelist filter (fails if dead code found) .PHONY: deadcode deadcode: @echo "Running deadcode analysis with whitelist..." - @GOCACHE=$(LINT_GOCACHE) ./scripts/deadcode-filter.sh -test $(DEADCODE_PKGS) + @GOCACHE=$(LINT_GOCACHE) ./scripts/deadcode-filter.sh $(DEADCODE_PKGS) ## deadcode-prod: Run deadcode analysis excluding tests (production-only dead code) .PHONY: deadcode-prod diff --git a/cmd/codefang/commands/render.go b/cmd/codefang/commands/render.go index 29b937e..b79ffee 100644 --- a/cmd/codefang/commands/render.go +++ b/cmd/codefang/commands/render.go @@ -3,8 +3,10 @@ package commands import ( "errors" "fmt" + "io" "log/slog" "os" + "path/filepath" "strings" "github.com/spf13/cobra" @@ -26,6 +28,8 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/sentiment" "github.com/Sumatoshi-tech/codefang/internal/analyzers/shotness" "github.com/Sumatoshi-tech/codefang/internal/analyzers/typos" + "github.com/Sumatoshi-tech/codefang/internal/storage" + "github.com/Sumatoshi-tech/codefang/pkg/textutil" ) const ( @@ -135,7 +139,33 @@ func runRender(storeDir, outputDir string) error { return fmt.Errorf("render index: %w", indexErr) } - return nil + return writeRenderReportJSON(outputDir, analyzerIDs, pages) +} + +// renderReportJSONFilename is the name of the machine-readable JSON report. +const renderReportJSONFilename = "report.json" + +// renderReportJSONPerm is the file permission for report.json. +const renderReportJSONPerm = 0o640 + +// renderReportData is the JSON structure emitted by codefang render. +type renderReportData struct { + AnalyzerIDs []string `json:"analyzer_ids"` + Pages []plotpage.PageMeta `json:"pages"` +} + +// writeRenderReportJSON emits report.json alongside rendered HTML pages. +func writeRenderReportJSON(outputDir string, analyzerIDs []string, pages []plotpage.PageMeta) error { + reportPath := filepath.Join(outputDir, renderReportJSONFilename) + + data := renderReportData{ + AnalyzerIDs: analyzerIDs, + Pages: pages, + } + + return storage.WriteAtomic(reportPath, renderReportJSONPerm, func(w io.Writer) error { + return textutil.WriteJSON(w, data, true) + }) } func renderOneAnalyzer( diff --git a/cmd/codefang/commands/render_test.go b/cmd/codefang/commands/render_test.go index 418d482..0240477 100644 --- a/cmd/codefang/commands/render_test.go +++ b/cmd/codefang/commands/render_test.go @@ -1,7 +1,5 @@ package commands -// FRD: specs/frds/FRD-20260228-render-command.md. - import ( "os" "path/filepath" diff --git a/cmd/codefang/commands/run.go b/cmd/codefang/commands/run.go index 18c3875..a5d5ccc 100644 --- a/cmd/codefang/commands/run.go +++ b/cmd/codefang/commands/run.go @@ -32,12 +32,15 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/plotpage" "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" "github.com/Sumatoshi-tech/codefang/internal/analyzers/complexity" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/composition" "github.com/Sumatoshi-tech/codefang/internal/analyzers/couples" "github.com/Sumatoshi-tech/codefang/internal/analyzers/devs" filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" "github.com/Sumatoshi-tech/codefang/internal/analyzers/halstead" "github.com/Sumatoshi-tech/codefang/internal/analyzers/imports" "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/langpath" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" "github.com/Sumatoshi-tech/codefang/internal/analyzers/quality" "github.com/Sumatoshi-tech/codefang/internal/analyzers/sentiment" "github.com/Sumatoshi-tech/codefang/internal/analyzers/shotness" @@ -60,8 +63,11 @@ type staticExecutor func( format string, verbose bool, noColor bool, + perFile bool, maxWorkers int, memoryBudget int64, + languages []string, + pathPolicy pathpolicy.Options, writer io.Writer, ) error @@ -70,6 +76,8 @@ type staticPlotExecutor func( analyzerIDs []string, maxWorkers int, memoryBudget int64, + languages []string, + pathPolicy pathpolicy.Options, outputDir string, ) error @@ -95,19 +103,23 @@ type HistoryRunOptions struct { Head bool Since string - Workers int - BufferSize int - CommitBatchSize int - BlobCacheSize string - DiffCacheSize int - BlobArenaSize string - MemoryBudget string + Workers int + BufferSize int + CommitBatchSize int + BlobCacheSize string + DiffCacheSize int + BlobArenaSize string + MemoryBudget string + MaxChangesPerCommit int Checkpoint *bool CheckpointDir string Resume *bool ClearCheckpoint bool + CacheDir string + NoCache bool + DebugTrace bool NDJSON bool @@ -157,16 +169,19 @@ type RunCommand struct { head bool since string - workers int - bufferSize int - commitBatchSize int - blobCacheSize string - diffCacheSize int - blobArenaSize string - memoryBudget string + workers int + bufferSize int + commitBatchSize int + blobCacheSize string + diffCacheSize int + blobArenaSize string + memoryBudget string + maxChangesPerCommit int checkpointDir string clearCheckpoint bool + cacheDir string + noCache bool ndjson bool @@ -175,6 +190,12 @@ type RunCommand struct { diagnosticsAddr string staticWorkers int + perFile bool + + // Cross-phase path exclusion policy. + includeVendored bool + includeGenerated bool + extraExcludedPrefixes []string plotOutput string keepStore bool @@ -269,17 +290,23 @@ func newRunCommandWithAllDeps( cmd.Flags().IntVar(&rc.workers, "workers", 0, "Number of parallel workers (0 = use CPU count)") cmd.Flags().IntVar(&rc.staticWorkers, "static-workers", 0, "Number of parallel static analysis workers (0 = min(CPU count, 8))") + rc.registerExclusionFlags(cmd) + + cmd.Flags().BoolVarP(&rc.perFile, "per-file", "F", false, + "Include per-file breakdowns and summary statistics in static output") cmd.Flags().IntVar(&rc.bufferSize, "buffer-size", 0, "Size of internal pipeline channels (0 = workers*2)") cmd.Flags().IntVar(&rc.commitBatchSize, "commit-batch-size", 0, "Commits per processing batch (0 = default 100)") cmd.Flags().StringVar(&rc.blobCacheSize, "blob-cache-size", "", "Max blob cache size (e.g., '256MB', '1GB'; empty = default 1GB)") cmd.Flags().IntVar(&rc.diffCacheSize, "diff-cache-size", 0, "Max diff cache entries (0 = default 10000)") cmd.Flags().StringVar(&rc.blobArenaSize, "blob-arena-size", "", "Memory arena size for blob loading (e.g., '4MB'; empty = default 4MB)") cmd.Flags().StringVar(&rc.memoryBudget, "memory-budget", "", "Memory budget for auto-tuning (e.g., '512MB', '2GB')") + cmd.Flags().IntVar(&rc.maxChangesPerCommit, "max-changes-per-commit", 0, + "Skip commits whose tree diff exceeds this many changes (0 = default 10000). "+ + "Commits over the cap are silently dropped from history, which can desync "+ + "burndown's tracked state for affected files. Raise on monorepos with "+ + "legitimate large commits (Pods updates, generated code dumps).") - cmd.Flags().Bool("checkpoint", true, "Enable checkpointing for crash recovery") - cmd.Flags().StringVar(&rc.checkpointDir, "checkpoint-dir", "", "Checkpoint directory (default: ~/.codefang/checkpoints)") - cmd.Flags().Bool("resume", true, "Resume from checkpoint if available") - cmd.Flags().BoolVar(&rc.clearCheckpoint, "clear-checkpoint", false, "Clear existing checkpoint before run") + rc.registerPersistenceFlags(cmd) cmd.Flags().StringVar(&rc.configFile, "config", "", "Configuration file path (default: .codefang.yaml in CWD or $HOME)") cmd.Flags().BoolVar(&rc.listAnalyzers, "list-analyzers", false, "List all available analyzer IDs and exit") @@ -524,7 +551,7 @@ func (rc *RunCommand) runDirect( return rc.renderCombinedDirect(ctx, path, staticIDs, historyIDs, staticFormat, silent, progressWriter, writer, cmd) } - err = rc.runStaticPhase(path, staticIDs, staticFormat, silent, progressWriter, writer) + err = rc.runStaticPhase(path, staticIDs, staticFormat, silent, progressWriter, writer, cmd) if err != nil { return err } @@ -548,6 +575,7 @@ func (rc *RunCommand) runStaticPhase( silent bool, progressWriter io.Writer, writer io.Writer, + cmd *cobra.Command, ) error { if len(staticIDs) == 0 { return nil @@ -567,12 +595,21 @@ func (rc *RunCommand) runStaticPhase( rc.progressf(silent, progressWriter, "static phase started (%d analyzers)", len(staticIDs)) + languages := readLanguagesFlag(cmd) + policy := rc.buildPathPolicy() + var err error if staticFormat == analyze.FormatPlot { - err = rc.staticPlotExec(path, staticIDs, rc.staticWorkers, budgetBytes, rc.plotOutput) + err = rc.staticPlotExec( + path, staticIDs, rc.staticWorkers, budgetBytes, languages, policy, rc.plotOutput, + ) } else { - err = rc.staticExec(path, staticIDs, staticFormat, rc.verbose, rc.noColor, rc.staticWorkers, budgetBytes, writer) + err = rc.staticExec( + path, staticIDs, staticFormat, + rc.verbose, rc.noColor, rc.perFile, + rc.staticWorkers, budgetBytes, languages, policy, writer, + ) } if err != nil { @@ -619,6 +656,25 @@ func (rc *RunCommand) runHistoryPhase( return nil } +// combinedIDsAndModes builds parallel slices of analyzer IDs and their modes +// (static first, history second) for DecodeCombinedBinaryReports. +func combinedIDsAndModes(staticIDs, historyIDs []string) ([]string, []analyze.AnalyzerMode) { + ids := make([]string, 0, len(staticIDs)+len(historyIDs)) + modes := make([]analyze.AnalyzerMode, 0, len(staticIDs)+len(historyIDs)) + + for _, id := range staticIDs { + ids = append(ids, id) + modes = append(modes, analyze.ModeStatic) + } + + for _, id := range historyIDs { + ids = append(ids, id) + modes = append(modes, analyze.ModeHistory) + } + + return ids, modes +} + func (rc *RunCommand) renderCombinedDirect( ctx context.Context, path string, @@ -640,7 +696,8 @@ func (rc *RunCommand) renderCombinedDirect( err := rc.staticExec( path, staticIDs, analyze.FormatBinary, - rc.verbose, rc.noColor, rc.staticWorkers, budgetBytes, &raw, + rc.verbose, rc.noColor, rc.perFile, rc.staticWorkers, budgetBytes, + readLanguagesFlag(cmd), rc.buildPathPolicy(), &raw, ) if err != nil { return fmt.Errorf("render combined static phase: %w", err) @@ -661,24 +718,15 @@ func (rc *RunCommand) renderCombinedDirect( rc.progressf(silent, progressWriter, "combined history phase finished in %s", time.Since(startedAt).Round(time.Millisecond)) - ids := make([]string, 0, len(staticIDs)+len(historyIDs)) - modes := make([]analyze.AnalyzerMode, 0, len(staticIDs)+len(historyIDs)) - - for _, id := range staticIDs { - ids = append(ids, id) - modes = append(modes, analyze.ModeStatic) - } - - for _, id := range historyIDs { - ids = append(ids, id) - modes = append(modes, analyze.ModeHistory) - } + ids, modes := combinedIDsAndModes(staticIDs, historyIDs) model, err := analyze.DecodeCombinedBinaryReports(raw.Bytes(), ids, modes) if err != nil { return fmt.Errorf("decode combined payload: %w", err) } + model.Metadata = analyze.NewAnalysisMetadata(path) + rc.progressf(silent, progressWriter, "combined payload decoded") startedAt = time.Now() @@ -703,38 +751,65 @@ func (rc *RunCommand) renderCombinedDirect( func (rc *RunCommand) buildHistoryRunOptions(cmd *cobra.Command) HistoryRunOptions { opts := HistoryRunOptions{ - GCPercent: rc.gogc, - BallastSize: rc.ballastSize, - CPUProfile: rc.cpuprofile, - HeapProfile: rc.heapprofile, - Limit: rc.limit, - FirstParent: rc.firstParent, - Head: rc.head, - Since: rc.since, - Workers: rc.workers, - BufferSize: rc.bufferSize, - CommitBatchSize: rc.commitBatchSize, - BlobCacheSize: rc.blobCacheSize, - DiffCacheSize: rc.diffCacheSize, - BlobArenaSize: rc.blobArenaSize, - MemoryBudget: rc.memoryBudget, - CheckpointDir: rc.checkpointDir, - ClearCheckpoint: rc.clearCheckpoint, - DebugTrace: rc.debugTrace, - NDJSON: rc.ndjson, - ConfigFile: rc.configFile, - PlotOutput: rc.plotOutput, - KeepStore: rc.keepStore, - TmpDir: rc.tmpDir, + GCPercent: rc.gogc, + BallastSize: rc.ballastSize, + CPUProfile: rc.cpuprofile, + HeapProfile: rc.heapprofile, + Limit: rc.limit, + FirstParent: rc.firstParent, + Head: rc.head, + Since: rc.since, + Workers: rc.workers, + BufferSize: rc.bufferSize, + CommitBatchSize: rc.commitBatchSize, + BlobCacheSize: rc.blobCacheSize, + DiffCacheSize: rc.diffCacheSize, + BlobArenaSize: rc.blobArenaSize, + MemoryBudget: rc.memoryBudget, + MaxChangesPerCommit: rc.maxChangesPerCommit, + CheckpointDir: rc.checkpointDir, + ClearCheckpoint: rc.clearCheckpoint, + CacheDir: rc.cacheDir, + NoCache: rc.noCache, + DebugTrace: rc.debugTrace, + NDJSON: rc.ndjson, + ConfigFile: rc.configFile, + PlotOutput: rc.plotOutput, + KeepStore: rc.keepStore, + TmpDir: rc.tmpDir, } opts.Checkpoint = parseBoolFlag(cmd, "checkpoint") opts.Resume = parseBoolFlag(cmd, "resume") opts.AnalyzerFlags = collectAnalyzerFlags(cmd) + opts.AnalyzerFlags[plumbing.ConfigTreeDiffPathPolicy] = rc.buildPathPolicy() return opts } +// registerPersistenceFlags registers checkpoint and incremental cache flags. +func (rc *RunCommand) registerPersistenceFlags(cmd *cobra.Command) { + cmd.Flags().Bool("checkpoint", true, "Enable checkpointing for crash recovery") + cmd.Flags().StringVar(&rc.checkpointDir, "checkpoint-dir", "", + "Checkpoint directory (default: ~/.codefang/checkpoints)") + cmd.Flags().Bool("resume", true, "Resume from checkpoint if available") + cmd.Flags().BoolVar(&rc.clearCheckpoint, "clear-checkpoint", false, + "Clear existing checkpoint before run") + cmd.Flags().StringVar(&rc.cacheDir, "cache-dir", "", + "Incremental analysis cache directory (skip already-processed commits)") + cmd.Flags().BoolVar(&rc.noCache, "no-cache", false, + "Force full re-analysis, overwriting any existing cache") +} + +// resolveCacheDir returns the cache directory from opts, or empty when --no-cache is set. +func resolveCacheDir(opts HistoryRunOptions) string { + if opts.NoCache || opts.CacheDir == "" { + return "" + } + + return opts.CacheDir +} + // parseBoolFlag returns a pointer to the flag value if it was explicitly set, nil otherwise. func parseBoolFlag(cmd *cobra.Command, name string) *bool { if !cmd.Flags().Changed(name) { @@ -841,7 +916,7 @@ func (rc *RunCommand) printAnalyzerList(writer io.Writer, registry *analyze.Regi } func defaultRegistry() (*analyze.Registry, error) { - return analyze.NewRegistry(defaultStaticAnalyzers(), defaultHistoryLeaves()) + return analyze.NewRegistry(defaultUASTAnalyzers(), defaultRawFileAnalyzers(), defaultHistoryLeaves()) } func runStaticAnalyzers( @@ -850,13 +925,23 @@ func runStaticAnalyzers( format string, verbose bool, noColor bool, + perFile bool, maxWorkers int, memoryBudget int64, + languages []string, + pathPolicy pathpolicy.Options, writer io.Writer, ) error { - service := analyze.NewStaticService(defaultStaticAnalyzers()) + service := analyze.NewStaticService(defaultUASTAnalyzers(), defaultRawFileAnalyzers()) service.Renderer = renderer.NewDefaultStaticRenderer() service.MaxWorkers = maxWorkers + service.PerFile = perFile + service.PathPolicy = pathPolicy + + err := applyStaticLanguageFilter(service, languages) + if err != nil { + return err + } applyStaticBudgetConfig(service, maxWorkers, memoryBudget) applyStaticProgressLogging(service, verbose) @@ -865,17 +950,24 @@ func runStaticAnalyzers( } // runStaticPlotAnalyzers runs static analysis and renders multi-page HTML plot output. -// FRD: specs/frds/FRD-20260312-static-plot-multipage.md. func runStaticPlotAnalyzers( path string, analyzerIDs []string, maxWorkers int, memoryBudget int64, + languages []string, + pathPolicy pathpolicy.Options, outputDir string, ) error { - service := analyze.NewStaticService(defaultStaticAnalyzers()) + service := analyze.NewStaticService(defaultUASTAnalyzers(), defaultRawFileAnalyzers()) service.MaxWorkers = maxWorkers service.AggregationMode = analyze.AggregationModeFull + service.PathPolicy = pathPolicy + + err := applyStaticLanguageFilter(service, languages) + if err != nil { + return err + } applyStaticBudgetConfig(service, maxWorkers, memoryBudget) applyStaticProgressLogging(service, false) @@ -895,7 +987,6 @@ func runStaticPlotAnalyzers( // applyStaticProgressLogging wires progress logging into the static service. // Default mode logs phase and file count. Verbose mode adds RSS and aggregator sizes. -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. func applyStaticProgressLogging(service *analyze.StaticService, verbose bool) { if verbose { service.ProgressFunc = func(e analyze.StaticProgressEvent) { @@ -914,9 +1005,73 @@ func applyStaticProgressLogging(service *analyze.StaticService, verbose bool) { } } +// registerExclusionFlags registers the three cross-phase path exclusion +// flags. +func (rc *RunCommand) registerExclusionFlags(cmd *cobra.Command) { + cmd.Flags().BoolVar(&rc.includeVendored, "include-vendored", false, + "Re-include vendored dependencies (detected by enry / Linguist) in analysis. "+ + "Default: exclude vendor/, node_modules/, third_party/, testdata/, minified bundles, etc.") + cmd.Flags().BoolVar(&rc.includeGenerated, "include-generated", false, + "Re-include auto-generated files in analysis. "+ + "Default: exclude *.pb.go, zz_generated_*.go, *_pb2.py, *.min.js, and any file whose "+ + "first 512 bytes contain a generated-file marker (\"DO NOT EDIT\", \"Code generated\", etc.).") + cmd.Flags().StringSliceVar(&rc.extraExcludedPrefixes, "extra-excluded-prefixes", nil, + "Additional UNIX path prefixes to exclude on top of enry heuristics (e.g. "+ + "\".venv/,target/,build/\"). Applies to both static and history phases.") +} + +// buildPathPolicy constructs the cross-phase path exclusion policy from +// the --include-vendored, --include-generated, and --extra-excluded-prefixes +// flags. +func (rc *RunCommand) buildPathPolicy() pathpolicy.Options { + return pathpolicy.Options{ + IncludeVendored: rc.includeVendored, + IncludeGenerated: rc.includeGenerated, + ExtraExcludedPrefixes: rc.extraExcludedPrefixes, + } +} + +// readLanguagesFlag extracts the --languages slice from the cobra command +// when present. Returns nil for a nil command or when the flag is absent, +// which keeps the caller path-safe in tests that construct a +// RunCommand without wiring every cobra flag. +func readLanguagesFlag(cmd *cobra.Command) []string { + if cmd == nil { + return nil + } + + languages, err := cmd.Flags().GetStringSlice("languages") + if err != nil { + return nil + } + + return languages +} + +// applyStaticLanguageFilter derives libgit2-style basename globs from the +// user's --languages value and assigns them to the static service. Empty +// or "all" disables the filter (default behavior). An unknown language +// token surfaces as an error so static-only runs fail fast — matching the +// history-side semantics. +func applyStaticLanguageFilter(service *analyze.StaticService, languages []string) error { + globs, wantsAll, err := langpath.Globs(languages) + if err != nil { + return fmt.Errorf("static --languages: %w", err) + } + + if wantsAll { + service.LanguageGlobs = nil + + return nil + } + + service.LanguageGlobs = globs + + return nil +} + // applyStaticBudgetConfig applies budget-derived parameters to the static service. // Explicit --static-workers overrides budget-derived MaxWorkers. -// FRD: specs/frds/FRD-20260312-static-budget-tuning.md. func applyStaticBudgetConfig(service *analyze.StaticService, explicitWorkers int, memoryBudget int64) { cfg := budget.SolveStaticBudget(memoryBudget) if cfg.MaxWorkers == 0 { @@ -1184,15 +1339,16 @@ func configureAndSelect( func buildConfigParams(opts HistoryRunOptions, fileCfg *cfgpkg.Config) framework.ConfigParams { params := framework.ConfigParams{ - Workers: opts.Workers, - BufferSize: opts.BufferSize, - CommitBatchSize: opts.CommitBatchSize, - BlobCacheSize: opts.BlobCacheSize, - DiffCacheSize: opts.DiffCacheSize, - BlobArenaSize: opts.BlobArenaSize, - MemoryBudget: opts.MemoryBudget, - GCPercent: opts.GCPercent, - BallastSize: opts.BallastSize, + Workers: opts.Workers, + BufferSize: opts.BufferSize, + CommitBatchSize: opts.CommitBatchSize, + BlobCacheSize: opts.BlobCacheSize, + DiffCacheSize: opts.DiffCacheSize, + BlobArenaSize: opts.BlobArenaSize, + MemoryBudget: opts.MemoryBudget, + GCPercent: opts.GCPercent, + BallastSize: opts.BallastSize, + MaxChangesPerCommit: opts.MaxChangesPerCommit, } if fileCfg != nil { @@ -1257,6 +1413,7 @@ func executeHistoryPipeline( } coordConfig.FirstParent = opts.FirstParent + coordConfig.TreeDiffPathspec = extractTreeDiffPathspec(pl.Core) if !needsUAST(selectedLeaves) { coordConfig.UASTPipelineWorkers = 0 @@ -1264,6 +1421,7 @@ func executeHistoryPipeline( runner := framework.NewRunnerWithConfig(repository, path, coordConfig, allAnalyzers...) runner.CoreCount = len(pl.Core) + runner.CacheDir = resolveCacheDir(opts) red, analysisMetrics, metricsErr := createRunMetrics() if metricsErr != nil { @@ -1610,6 +1768,34 @@ func registerAnalyzerFlags(cobraCmd *cobra.Command) { registerConfigFlag(cobraCmd, opt) } } + + markDeprecatedExclusionFlags(cobraCmd) +} + +// markDeprecatedExclusionFlags marks the legacy exclusion flags as +// deprecated, directing users to the new cross-phase flags. Errors are +// reported via the standard library logger because cobra returns a +// deterministic error only when the flag does not exist — a programmer +// mistake we want to surface during development, not silently swallow. +func markDeprecatedExclusionFlags(cobraCmd *cobra.Command) { + const ( + skipBlacklistFlag = "skip-blacklist" + blacklistedPfxFlag = "blacklisted-prefixes" + ) + + err := cobraCmd.Flags().MarkDeprecated(skipBlacklistFlag, + "use --include-vendored=false and --include-generated=false "+ + "(the new defaults). See CHANGELOG for migration.") + if err != nil { + log.Printf("warn: mark %q deprecated: %v", skipBlacklistFlag, err) + } + + err = cobraCmd.Flags().MarkDeprecated(blacklistedPfxFlag, + "use --extra-excluded-prefixes; the old flag name is preserved "+ + "for back-compat but will be removed in the next minor release.") + if err != nil { + log.Printf("warn: mark %q deprecated: %v", blacklistedPfxFlag, err) + } } func registerConfigFlag(cobraCmd *cobra.Command, opt pipeline.ConfigurationOption) { @@ -1645,6 +1831,19 @@ type uastDependent interface { NeedsUAST() bool } +// extractTreeDiffPathspec returns the libgit2 pathspec pre-filter produced by +// TreeDiffAnalyzer.Configure, or nil when no TreeDiffAnalyzer is present or +// the user did not restrict by language. +func extractTreeDiffPathspec(core []analyze.HistoryAnalyzer) []string { + for _, a := range core { + if td, ok := a.(*plumbing.TreeDiffAnalyzer); ok { + return td.Pathspec + } + } + + return nil +} + func needsUAST(leaves []analyze.HistoryAnalyzer) bool { for _, leaf := range leaves { if ud, ok := leaf.(uastDependent); ok && ud.NeedsUAST() { @@ -1890,7 +2089,7 @@ func defaultHistoryLeaves() []analyze.HistoryAnalyzer { return result } -func defaultStaticAnalyzers() []analyze.StaticAnalyzer { +func defaultUASTAnalyzers() []analyze.StaticAnalyzer { return []analyze.StaticAnalyzer{ clones.NewAnalyzer(), complexity.NewAnalyzer(), @@ -1901,6 +2100,12 @@ func defaultStaticAnalyzers() []analyze.StaticAnalyzer { } } +func defaultRawFileAnalyzers() []analyze.RawFileAnalyzer { + return []analyze.RawFileAnalyzer{ + composition.NewAnalyzer(), + } +} + // validatePlotFlags checks that required flags are present when --format plot is used. // rebuildPlotIndex re-scans the output directory and generates a unified index.html // that includes pages from all phases (static + history). diff --git a/cmd/codefang/commands/run_plot_test.go b/cmd/codefang/commands/run_plot_test.go index a141a6c..e70bebb 100644 --- a/cmd/codefang/commands/run_plot_test.go +++ b/cmd/codefang/commands/run_plot_test.go @@ -1,7 +1,5 @@ package commands -// FRD: specs/frds/FRD-20260228-plot-through-store.md. - import ( "context" "io" @@ -12,6 +10,7 @@ import ( "github.com/stretchr/testify/require" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" ) func TestRunCommand_ForwardsPlotOutputFlag(t *testing.T) { @@ -20,7 +19,7 @@ func TestRunCommand_ForwardsPlotOutputFlag(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -50,7 +49,7 @@ func TestRunCommand_ForwardsKeepStoreFlag(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -125,13 +124,11 @@ func TestRenderFromStore_CreatesOutputDir(t *testing.T) { require.NoError(t, statErr, "index.html should exist in nested output dir") } -// FRD: specs/frds/FRD-20260312-static-plot-multipage.md. - func TestStaticPlot_RequiresOutputFlag(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { diff --git a/cmd/codefang/commands/run_test.go b/cmd/codefang/commands/run_test.go index f8e3d29..bb61364 100644 --- a/cmd/codefang/commands/run_test.go +++ b/cmd/codefang/commands/run_test.go @@ -22,6 +22,7 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/reportutil" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" "github.com/Sumatoshi-tech/codefang/internal/observability" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" "github.com/Sumatoshi-tech/codefang/pkg/pipeline" @@ -114,7 +115,10 @@ func TestRunCommand_DispatchesBothModes(t *testing.T) { ) command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, writer io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, writer io.Writer, + ) error { staticCalled = true staticFormat = format @@ -149,7 +153,10 @@ func TestRunCommand_StaticOnly(t *testing.T) { var historyCalled bool command := newRunCommandWithDeps( - func(_ string, ids []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func( + _ string, ids []string, _ string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { require.Equal(t, []string{"static/complexity"}, ids) return nil @@ -173,7 +180,10 @@ func TestRunCommand_ProgressOutput_DefaultEnabled(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { require.Equal(t, []string{"static/complexity"}, ids) require.Equal(t, analyze.FormatJSON, format) @@ -203,7 +213,7 @@ func TestRunCommand_ProgressOutput_Silent(t *testing.T) { var historySilent bool command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static executor should not be called") return nil @@ -236,7 +246,7 @@ func TestRunCommand_ForwardsHistoryRuntimeOptions(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static executor should not be called") return nil @@ -268,7 +278,7 @@ func TestRunCommand_ForwardsCommitSelectionFlags(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -302,7 +312,7 @@ func TestRunCommand_ForwardsProfilingFlags(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -332,7 +342,7 @@ func TestRunCommand_ForwardsResourceTuningFlags(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -372,7 +382,7 @@ func TestRunCommand_ForwardsCheckpointFlags(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -408,7 +418,7 @@ func TestRunCommand_CheckpointDefaultsPreserved(t *testing.T) { var seenOptions HistoryRunOptions command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { @@ -432,7 +442,10 @@ func TestRunCommand_ProgressOutput_Quiet(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { require.Equal(t, []string{"static/complexity"}, ids) require.Equal(t, analyze.FormatJSON, format) @@ -462,7 +475,9 @@ func TestRunCommand_UnknownAnalyzer(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { return nil }, + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { + return nil + }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { return nil }, @@ -481,7 +496,10 @@ func TestRunCommand_GlobStaticAnalyzers(t *testing.T) { var historyCalled bool command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { require.Equal(t, []string{"static/complexity"}, ids) require.Equal(t, analyze.FormatJSON, format) @@ -513,7 +531,10 @@ func TestRunCommand_GlobAllAnalyzers(t *testing.T) { ) command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, writer io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, writer io.Writer, + ) error { staticCalled = true staticFormat = format @@ -546,7 +567,9 @@ func TestRunCommand_GlobUnknownPattern(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { return nil }, + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { + return nil + }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { return nil }, @@ -563,7 +586,9 @@ func TestRunCommand_GlobInvalidPattern(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { return nil }, + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { + return nil + }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { return nil }, @@ -664,7 +689,7 @@ func TestRunCommand_ConvertInput_BinToJSON(t *testing.T) { require.NoError(t, os.WriteFile(inputPath, raw.Bytes(), 0o600)) command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static executor should not be called in conversion mode") return nil @@ -718,7 +743,7 @@ func TestRunCommand_ConvertInput_JSONToPlot(t *testing.T) { require.NoError(t, os.WriteFile(inputPath, []byte(input), 0o600)) command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static executor should not be called in conversion mode") return nil @@ -766,7 +791,7 @@ func TestRunCommand_ConvertInput_BinToPlot(t *testing.T) { require.NoError(t, os.WriteFile(inputPath, raw.Bytes(), 0o600)) command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static executor should not be called in conversion mode") return nil @@ -807,12 +832,12 @@ func TestRunCommand_MixedPlotRunsSeparatePhases(t *testing.T) { outDir := t.TempDir() command := newRunCommandWithAllDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { t.Fatal("static text executor should not be called for plot format") return nil }, - func(_ string, ids []string, _ int, _ int64, dir string) error { + func(_ string, ids []string, _ int, _ int64, _ []string, _ pathpolicy.Options, dir string) error { staticPlotCalled = true require.Equal(t, []string{"static/complexity"}, ids) @@ -863,7 +888,10 @@ func TestRunCommand_MixedUniversalFormatsRenderUnifiedModel(t *testing.T) { ) command := newRunCommandWithDeps( - func(_ string, ids []string, format string, _ bool, _ bool, _ int, _ int64, writer io.Writer) error { + func( + _ string, ids []string, format string, _ bool, _ bool, _ bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, writer io.Writer, + ) error { staticFormat = format require.Equal(t, []string{"static/complexity"}, ids) @@ -1084,7 +1112,7 @@ func TestRunCommand_DebugTraceFlag_Accepted(t *testing.T) { t.Parallel() command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { @@ -1109,7 +1137,7 @@ func TestRunCommand_CreatesRootSpan(t *testing.T) { t.Cleanup(func() { require.NoError(t, tp.Shutdown(context.Background())) }) command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { @@ -1151,7 +1179,7 @@ func TestRunCommand_ShutdownCalledOnExit(t *testing.T) { var shutdownCalled bool command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { @@ -1194,7 +1222,7 @@ func TestRunCommand_InitializesObservability(t *testing.T) { } command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { @@ -1235,7 +1263,7 @@ func stubRunRegistry() (*analyze.Registry, error) { }, } - return analyze.NewRegistry(staticAnalyzers, historyAnalyzers) + return analyze.NewRegistry(staticAnalyzers, nil, historyAnalyzers) } func noopObservabilityInit(_ observability.Config) (observability.Providers, error) { @@ -1283,7 +1311,7 @@ func TestRunCommand_RootSpanAttributes(t *testing.T) { t.Cleanup(func() { require.NoError(t, tp.Shutdown(context.Background())) }) command := newRunCommandWithDeps( - func(_ string, _ []string, _ string, _ bool, _ bool, _ int, _ int64, _ io.Writer) error { + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { return nil }, func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { @@ -1324,8 +1352,6 @@ func TestRunCommand_RootSpanAttributes(t *testing.T) { require.Contains(t, rootAttrs, "codefang.duration_class", "root span should have duration_class") } -// FRD: specs/frds/FRD-20260311-static-memory-limit.md. - func TestParseMemoryBudgetBytes_Valid(t *testing.T) { t.Parallel() @@ -1364,12 +1390,10 @@ func TestApplyStaticMemoryLimit_SetsAndRestores(t *testing.T) { restore() } -// FRD: specs/frds/FRD-20260312-static-budget-tuning.md. - func TestApplyStaticBudgetConfig_ZeroBudget(t *testing.T) { t.Parallel() - service := analyze.NewStaticService(nil) + service := analyze.NewStaticService(nil, nil) applyStaticBudgetConfig(service, 0, 0) assert.Zero(t, service.MaxWorkers) @@ -1381,7 +1405,7 @@ func TestApplyStaticBudgetConfig_WithBudget(t *testing.T) { const budgetOneGiB int64 = 1024 * 1024 * 1024 - service := analyze.NewStaticService(nil) + service := analyze.NewStaticService(nil, nil) applyStaticBudgetConfig(service, 0, budgetOneGiB) assert.Positive(t, service.MaxWorkers) @@ -1395,7 +1419,7 @@ func TestApplyStaticBudgetConfig_ExplicitWorkersOverride(t *testing.T) { const explicitWorkers = 2 - service := analyze.NewStaticService(nil) + service := analyze.NewStaticService(nil, nil) service.MaxWorkers = explicitWorkers applyStaticBudgetConfig(service, explicitWorkers, budgetOneGiB) @@ -1405,3 +1429,176 @@ func TestApplyStaticBudgetConfig_ExplicitWorkersOverride(t *testing.T) { // Spill threshold should still be derived from budget. assert.Positive(t, service.SpillThreshold) } + +func TestApplyStaticLanguageFilter_EmptyInput_DisablesFilter(t *testing.T) { + t.Parallel() + + service := analyze.NewStaticService(nil, nil) + + err := applyStaticLanguageFilter(service, nil) + require.NoError(t, err) + assert.Nil(t, service.LanguageGlobs, + "empty input must disable the filter (nil LanguageGlobs)") +} + +func TestApplyStaticLanguageFilter_AllKeyword_DisablesFilter(t *testing.T) { + t.Parallel() + + service := analyze.NewStaticService(nil, nil) + + err := applyStaticLanguageFilter(service, []string{"all"}) + require.NoError(t, err) + assert.Nil(t, service.LanguageGlobs, + "'all' sentinel must disable the filter") +} + +func TestApplyStaticLanguageFilter_KnownLanguage_PopulatesGlobs(t *testing.T) { + t.Parallel() + + service := analyze.NewStaticService(nil, nil) + + err := applyStaticLanguageFilter(service, []string{"go"}) + require.NoError(t, err) + assert.Contains(t, service.LanguageGlobs, "*.go") +} + +func TestApplyStaticLanguageFilter_UnknownLanguage_FailsFast(t *testing.T) { + t.Parallel() + + service := analyze.NewStaticService(nil, nil) + + err := applyStaticLanguageFilter(service, []string{"notalang"}) + require.Error(t, err) + assert.Contains(t, err.Error(), "notalang", + "unknown language must surface at configure time for static-only runs") +} + +func TestRunCommand_PerFileFlag_Propagated(t *testing.T) { + t.Parallel() + + var seenPerFile bool + + command := newRunCommandWithDeps( + func( + _ string, _ []string, _ string, _ bool, _ bool, perFile bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { + seenPerFile = perFile + + return nil + }, + func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { + return nil + }, + stubRunRegistry, + noopObservabilityInit, + ) + + command.SetArgs([]string{"-a", "static/complexity", "--per-file"}) + err := command.Execute() + require.NoError(t, err) + require.True(t, seenPerFile, "--per-file flag must be propagated to staticExecutor") +} + +func TestRunCommand_PerFileFlag_ShortAlias(t *testing.T) { + t.Parallel() + + var seenPerFile bool + + command := newRunCommandWithDeps( + func( + _ string, _ []string, _ string, _ bool, _ bool, perFile bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { + seenPerFile = perFile + + return nil + }, + func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { + return nil + }, + stubRunRegistry, + noopObservabilityInit, + ) + + command.SetArgs([]string{"-a", "static/complexity", "-F"}) + err := command.Execute() + require.NoError(t, err) + require.True(t, seenPerFile, "-F short alias must be propagated to staticExecutor") +} + +func TestRunCommand_PerFileFlag_DefaultFalse(t *testing.T) { + t.Parallel() + + var seenPerFile bool + + command := newRunCommandWithDeps( + func( + _ string, _ []string, _ string, _ bool, _ bool, perFile bool, + _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer, + ) error { + seenPerFile = perFile + + return nil + }, + func(_ context.Context, _ string, _ []string, _ string, _ bool, _ HistoryRunOptions, _ io.Writer) error { + return nil + }, + stubRunRegistry, + noopObservabilityInit, + ) + + command.SetArgs([]string{"-a", "static/complexity"}) + err := command.Execute() + require.NoError(t, err) + require.False(t, seenPerFile, "per-file must be false by default") +} + +func TestRunCommand_CacheDirFlag_Propagated(t *testing.T) { + t.Parallel() + + var seenOpts HistoryRunOptions + + command := newRunCommandWithDeps( + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { + return nil + }, + func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { + seenOpts = opts + + return nil + }, + stubRunRegistry, + noopObservabilityInit, + ) + + command.SetArgs([]string{"-a", "history/devs", "--cache-dir", "/tmp/test-cache"}) + err := command.Execute() + require.NoError(t, err) + assert.Equal(t, "/tmp/test-cache", seenOpts.CacheDir) + assert.False(t, seenOpts.NoCache) +} + +func TestRunCommand_NoCacheFlag_Propagated(t *testing.T) { + t.Parallel() + + var seenOpts HistoryRunOptions + + command := newRunCommandWithDeps( + func(_ string, _ []string, _ string, _ bool, _ bool, _ bool, _ int, _ int64, _ []string, _ pathpolicy.Options, _ io.Writer) error { + return nil + }, + func(_ context.Context, _ string, _ []string, _ string, _ bool, opts HistoryRunOptions, _ io.Writer) error { + seenOpts = opts + + return nil + }, + stubRunRegistry, + noopObservabilityInit, + ) + + command.SetArgs([]string{"-a", "history/devs", "--cache-dir", "/tmp/cache", "--no-cache"}) + err := command.Execute() + require.NoError(t, err) + assert.True(t, seenOpts.NoCache) +} diff --git a/cmd/uast/server_test.go b/cmd/uast/server_test.go index e26cdd5..222aa8a 100644 --- a/cmd/uast/server_test.go +++ b/cmd/uast/server_test.go @@ -2,6 +2,7 @@ package main import ( "bytes" + "context" "encoding/json" "net/http" "net/http/httptest" @@ -66,7 +67,7 @@ string <- (string) => uast( } // Create test request. - req := httptest.NewRequest(http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) req.Header.Set("Content-Type", "application/json") // Create response recorder. @@ -130,7 +131,7 @@ func TestHandleParseWithoutCustomUASTMaps(t *testing.T) { } // Create test request. - req := httptest.NewRequest(http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) req.Header.Set("Content-Type", "application/json") // Create response recorder. @@ -193,7 +194,7 @@ func TestHandleParseWithInvalidUASTMaps(t *testing.T) { } // Create test request. - req := httptest.NewRequest(http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/api/parse", bytes.NewBuffer(jsonData)) req.Header.Set("Content-Type", "application/json") // Create response recorder. @@ -230,7 +231,7 @@ func TestUASTServer_MiddlewareWrapsRoutes(t *testing.T) { tracer := noop.NewTracerProvider().Tracer("test") handler := newServerMux(tracer) - req := httptest.NewRequest(http.MethodGet, "/api/mappings", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/api/mappings", http.NoBody) rec := httptest.NewRecorder() require.NotPanics(t, func() { diff --git a/internal/analyzers/analyze/analyzer.go b/internal/analyzers/analyze/analyzer.go index 0c4e578..7b7f605 100644 --- a/internal/analyzers/analyze/analyzer.go +++ b/internal/analyzers/analyze/analyzer.go @@ -85,11 +85,12 @@ type Analyzer interface { Configure(facts map[string]any) error } -// StaticAnalyzer interface defines the contract for UAST-based static analysis. -type StaticAnalyzer interface { +// FormattableAnalyzer is the shared contract for analyzers that produce +// reportable output with thresholds, aggregation, and format methods. +// Both StaticAnalyzer and RawFileAnalyzer satisfy this interface. +type FormattableAnalyzer interface { Analyzer - Analyze(root *node.Node) (Report, error) Thresholds() Thresholds // Aggregation methods. @@ -103,6 +104,23 @@ type StaticAnalyzer interface { FormatReportBinary(report Report, writer io.Writer) error } +// StaticAnalyzer defines the contract for UAST-based static analysis. +// Runs during the UAST phase on parsed AST nodes. +type StaticAnalyzer interface { + FormattableAnalyzer + + Analyze(root *node.Node) (Report, error) +} + +// RawFileAnalyzer defines the contract for analyzers that operate on raw file +// content (path + bytes) without UAST parsing. Runs during the raw-file phase +// which walks ALL files in the directory tree (not just UAST-supported ones). +type RawFileAnalyzer interface { + FormattableAnalyzer + + AnalyzeFileContent(path string, content []byte) (Report, error) +} + // VisitorProvider enables single-pass traversal optimization. type VisitorProvider interface { CreateVisitor() AnalysisVisitor diff --git a/internal/analyzers/analyze/analyzer_test.go b/internal/analyzers/analyze/analyzer_test.go index cda0e0a..42ec944 100644 --- a/internal/analyzers/analyze/analyzer_test.go +++ b/internal/analyzers/analyze/analyzer_test.go @@ -374,8 +374,6 @@ func TestRunAnalyzers_Parallel(t *testing.T) { } } -// FRD: specs/frds/FRD-20260303-data-extraction-guard.md. - func TestReportFunctionListWithFallback_PrimaryKeyFound(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/analyze/budget_static_test.go b/internal/analyzers/analyze/budget_static_test.go index 52a6273..3872d3a 100644 --- a/internal/analyzers/analyze/budget_static_test.go +++ b/internal/analyzers/analyze/budget_static_test.go @@ -2,8 +2,6 @@ package analyze_test -// FRD: specs/frds/FRD-20260312-static-budget-integration-test.md. - import ( "context" "runtime/debug" @@ -46,7 +44,7 @@ func TestStaticAnalyzers_MemoryBudget(t *testing.T) { dir := setupHeavyBenchDir(t, budgetTestFileCount, budgetTestFunctionsPerFile) - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.NativeMemoryReleaseFn = func() {} // Skip real malloc_trim in test. // Apply budget-derived parameters. diff --git a/internal/analyzers/analyze/commits_by_tick_test.go b/internal/analyzers/analyze/commits_by_tick_test.go index db8e207..9bbd9fe 100644 --- a/internal/analyzers/analyze/commits_by_tick_test.go +++ b/internal/analyzers/analyze/commits_by_tick_test.go @@ -9,8 +9,6 @@ import ( "github.com/Sumatoshi-tech/codefang/pkg/gitlib" ) -// FRD: specs/frds/FRD-20260302-build-commits-by-tick.md. - // testTickData is a minimal tick data type for testing BuildCommitsByTick. type testTickData struct { Commits map[string]int diff --git a/internal/analyzers/analyze/conversion.go b/internal/analyzers/analyze/conversion.go index ec54e11..860f951 100644 --- a/internal/analyzers/analyze/conversion.go +++ b/internal/analyzers/analyze/conversion.go @@ -24,15 +24,17 @@ var ErrInvalidUnifiedModel = errors.New("invalid unified model") // AnalyzerResult represents one analyzer report in canonical converted output. type AnalyzerResult struct { - ID string `json:"id" yaml:"id"` - Mode AnalyzerMode `json:"mode" yaml:"mode"` - Report Report `json:"report" yaml:"report"` + ID string `json:"id" yaml:"id"` + Mode AnalyzerMode `json:"mode" yaml:"mode"` + Schema AnalyzerSchema `json:"schema,omitempty" yaml:"schema,omitempty"` + Report Report `json:"report" yaml:"report"` } // UnifiedModel is the canonical intermediate model for run output conversion. type UnifiedModel struct { - Version string `json:"version" yaml:"version"` - Analyzers []AnalyzerResult `json:"analyzers" yaml:"analyzers"` + Version string `json:"version" yaml:"version"` + Metadata *AnalysisMetadata `json:"metadata,omitempty" yaml:"metadata,omitempty"` + Analyzers []AnalyzerResult `json:"analyzers" yaml:"analyzers"` } // Validate ensures canonical model constraints are satisfied. @@ -219,6 +221,7 @@ func DecodeCombinedBinaryReports(input []byte, ids []string, modes []AnalyzerMod results[i] = AnalyzerResult{ ID: ids[i], Mode: modes[i], + Schema: SchemaForAnalyzer(ids[i]), Report: report, } } @@ -321,6 +324,8 @@ func WriteConvertedOutput(model UnifiedModel, outputFormat string, writer io.Wri return writeConvertedTimeSeries(model, FormatTimeSeries, writer) case FormatTimeSeriesNDJSON: return writeConvertedTimeSeries(model, FormatTimeSeriesNDJSON, writer) + case FormatNDJSON: + return writeConvertedNDJSON(model, writer) case FormatPlot: if plotRendererFn == nil { return fmt.Errorf("%w: plot renderer not registered", ErrUnsupportedFormat) @@ -332,6 +337,33 @@ func WriteConvertedOutput(model UnifiedModel, outputFormat string, writer io.Wri } } +// writeConvertedNDJSON writes one compact JSON line per analyzer result. +// If metadata is present, a metadata line is written first. +func writeConvertedNDJSON(model UnifiedModel, writer io.Writer) error { + encoder := json.NewEncoder(writer) + + if model.Metadata != nil { + metaLine := map[string]any{ + "version": model.Version, + "metadata": model.Metadata, + } + + err := encoder.Encode(metaLine) + if err != nil { + return fmt.Errorf("encode ndjson metadata: %w", err) + } + } + + for _, result := range model.Analyzers { + err := encoder.Encode(result) + if err != nil { + return fmt.Errorf("encode ndjson analyzer %s: %w", result.ID, err) + } + } + + return nil +} + // writeConvertedTimeSeries builds merged timeseries from a unified model's // history reports and writes the result to the writer. func writeConvertedTimeSeries(model UnifiedModel, format string, writer io.Writer) error { diff --git a/internal/analyzers/analyze/conversion_ndjson_test.go b/internal/analyzers/analyze/conversion_ndjson_test.go new file mode 100644 index 0000000..4f237ee --- /dev/null +++ b/internal/analyzers/analyze/conversion_ndjson_test.go @@ -0,0 +1,83 @@ +package analyze_test + +import ( + "bytes" + "encoding/json" + "strings" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +func TestWriteConvertedOutput_NDJSON_OneLinePerAnalyzer(t *testing.T) { + t.Parallel() + + model := analyze.UnifiedModel{ + Version: analyze.UnifiedModelVersion, + Analyzers: []analyze.AnalyzerResult{ + {ID: "static/complexity", Mode: analyze.ModeStatic, Report: analyze.Report{"total": 10}}, + {ID: "history/sentiment", Mode: analyze.ModeHistory, Report: analyze.Report{"score": 0.8}}, + }, + } + + var buf bytes.Buffer + + err := analyze.WriteConvertedOutput(model, analyze.FormatNDJSON, &buf) + require.NoError(t, err) + + lines := strings.Split(strings.TrimSpace(buf.String()), "\n") + require.Len(t, lines, 2) + + var line1 map[string]any + require.NoError(t, json.Unmarshal([]byte(lines[0]), &line1)) + assert.Equal(t, "static/complexity", line1["id"]) + assert.Equal(t, "static", line1["mode"]) + + var line2 map[string]any + require.NoError(t, json.Unmarshal([]byte(lines[1]), &line2)) + assert.Equal(t, "history/sentiment", line2["id"]) +} + +func TestWriteConvertedOutput_NDJSON_EmptyAnalyzers(t *testing.T) { + t.Parallel() + + model := analyze.UnifiedModel{ + Version: analyze.UnifiedModelVersion, + Analyzers: nil, + } + + var buf bytes.Buffer + + err := analyze.WriteConvertedOutput(model, analyze.FormatNDJSON, &buf) + require.NoError(t, err) + + assert.Empty(t, strings.TrimSpace(buf.String())) +} + +func TestWriteConvertedOutput_NDJSON_WithMetadata(t *testing.T) { + t.Parallel() + + model := analyze.UnifiedModel{ + Version: analyze.UnifiedModelVersion, + Metadata: analyze.NewAnalysisMetadata("/repo/test"), + Analyzers: []analyze.AnalyzerResult{ + {ID: "static/test", Mode: analyze.ModeStatic, Report: analyze.Report{}}, + }, + } + + var buf bytes.Buffer + + err := analyze.WriteConvertedOutput(model, analyze.FormatNDJSON, &buf) + require.NoError(t, err) + + lines := strings.Split(strings.TrimSpace(buf.String()), "\n") + require.Len(t, lines, 2) // Metadata line + 1 analyzer line. + + var metaLine map[string]any + require.NoError(t, json.Unmarshal([]byte(lines[0]), &metaLine)) + assert.Equal(t, analyze.UnifiedModelVersion, metaLine["version"]) + assert.NotNil(t, metaLine["metadata"]) +} diff --git a/internal/analyzers/analyze/export_test.go b/internal/analyzers/analyze/export_test.go new file mode 100644 index 0000000..6bf93b0 --- /dev/null +++ b/internal/analyzers/analyze/export_test.go @@ -0,0 +1,7 @@ +package analyze + +// LanguageGlobMatcher exposes matchesLanguageGlobs for black-box tests +// in the analyze_test package. +func LanguageGlobMatcher(name string, globs []string) bool { + return matchesLanguageGlobs(name, globs) +} diff --git a/internal/analyzers/analyze/metadata.go b/internal/analyzers/analyze/metadata.go new file mode 100644 index 0000000..25840da --- /dev/null +++ b/internal/analyzers/analyze/metadata.go @@ -0,0 +1,26 @@ +package analyze + +import ( + "path/filepath" + "time" + + "github.com/Sumatoshi-tech/codefang/pkg/version" +) + +// AnalysisMetadata holds provenance information for a codefang run. +type AnalysisMetadata struct { + RepoPath string `json:"repo_path" yaml:"repo_path"` + RepoName string `json:"repo_name" yaml:"repo_name"` + AnalyzedAt string `json:"analyzed_at" yaml:"analyzed_at"` + CodefangVersion string `json:"codefang_version" yaml:"codefang_version"` +} + +// NewAnalysisMetadata creates metadata for the given repository path. +func NewAnalysisMetadata(repoPath string) *AnalysisMetadata { + return &AnalysisMetadata{ + RepoPath: repoPath, + RepoName: filepath.Base(repoPath), + AnalyzedAt: time.Now().UTC().Format(time.RFC3339), + CodefangVersion: version.Version, + } +} diff --git a/internal/analyzers/analyze/metadata_test.go b/internal/analyzers/analyze/metadata_test.go new file mode 100644 index 0000000..a6638d0 --- /dev/null +++ b/internal/analyzers/analyze/metadata_test.go @@ -0,0 +1,75 @@ +package analyze_test + +import ( + "encoding/json" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +const testRepoPath = "/home/user/sources/kubernetes" + +func TestNewAnalysisMetadata_RepoName(t *testing.T) { + t.Parallel() + + meta := analyze.NewAnalysisMetadata(testRepoPath) + + assert.Equal(t, "kubernetes", meta.RepoName) +} + +func TestNewAnalysisMetadata_RepoPath(t *testing.T) { + t.Parallel() + + meta := analyze.NewAnalysisMetadata(testRepoPath) + + assert.Equal(t, testRepoPath, meta.RepoPath) +} + +func TestNewAnalysisMetadata_AnalyzedAt(t *testing.T) { + t.Parallel() + + before := time.Now() + meta := analyze.NewAnalysisMetadata(testRepoPath) + after := time.Now() + + parsed, err := time.Parse(time.RFC3339, meta.AnalyzedAt) + require.NoError(t, err) + assert.False(t, parsed.Before(before.Truncate(time.Second))) + assert.False(t, parsed.After(after.Add(time.Second))) +} + +func TestNewAnalysisMetadata_Version(t *testing.T) { + t.Parallel() + + meta := analyze.NewAnalysisMetadata(testRepoPath) + + assert.NotEmpty(t, meta.CodefangVersion) +} + +func TestUnifiedModel_MetadataInJSON(t *testing.T) { + t.Parallel() + + model := analyze.UnifiedModel{ + Version: analyze.UnifiedModelVersion, + Metadata: analyze.NewAnalysisMetadata(testRepoPath), + Analyzers: []analyze.AnalyzerResult{ + {ID: "static/test", Mode: analyze.ModeStatic, Report: analyze.Report{}}, + }, + } + + data, err := json.Marshal(model) + require.NoError(t, err) + + var parsed map[string]any + require.NoError(t, json.Unmarshal(data, &parsed)) + + meta, ok := parsed["metadata"].(map[string]any) + require.True(t, ok, "metadata section must exist in JSON") + assert.Equal(t, "kubernetes", meta["repo_name"]) + assert.NotEmpty(t, meta["analyzed_at"]) + assert.NotEmpty(t, meta["codefang_version"]) +} diff --git a/internal/analyzers/analyze/metrics_safe_test.go b/internal/analyzers/analyze/metrics_safe_test.go index 62a3e3e..60b758b 100644 --- a/internal/analyzers/analyze/metrics_safe_test.go +++ b/internal/analyzers/analyze/metrics_safe_test.go @@ -8,8 +8,6 @@ import ( "github.com/stretchr/testify/require" ) -// FRD: specs/frds/FRD-20260302-compute-metrics-safe.md. - // testMetrics is a minimal metrics type for testing SafeMetricComputer. type testMetrics struct { Value int diff --git a/internal/analyzers/analyze/perfile.go b/internal/analyzers/analyze/perfile.go new file mode 100644 index 0000000..35e4ef6 --- /dev/null +++ b/internal/analyzers/analyze/perfile.go @@ -0,0 +1,79 @@ +package analyze + +import "path/filepath" + +// PerFileModeEnabled is implemented by aggregators that support per-file report retention. +// StaticService uses this to enable per-file mode and extract results after analysis. +type PerFileModeEnabled interface { + SetPerFileMode(enabled bool) + PerFileResults() map[string]Report +} + +// PerFileResults returns per-file reports collected during the last AnalyzeFolder call. +// Returns nil when PerFile is false or no files were analyzed. +// Keyed by analyzer name → file path → per-file report. +func (svc *StaticService) PerFileResults() map[string]map[string]Report { + return svc.perFileResults +} + +// extractPerFileResults collects per-file reports from all aggregators that support it. +func extractPerFileResults(aggregators map[string]ResultAggregator) map[string]map[string]Report { + result := make(map[string]map[string]Report, len(aggregators)) + + for name, agg := range aggregators { + pfm, ok := agg.(PerFileModeEnabled) + if !ok { + continue + } + + fileReports := pfm.PerFileResults() + if len(fileReports) > 0 { + result[name] = fileReports + } + } + + if len(result) == 0 { + return nil + } + + return result +} + +// enrichWithPerFileData takes the base JSON report and injects per-file data into each section. +// It uses the PerFileEnricher interface to avoid import cycles with the renderer package. +// Returns the enriched report (same reference if type assertion succeeds, original otherwise). +func (svc *StaticService) enrichWithPerFileData(report any, _ []ReportSection) any { + enricher, ok := report.(PerFileEnricher) + if !ok { + return report + } + + enricher.EnrichWithPerFileData(svc.PerFileResults(), svc.analysisRootPath, svc.allFormattable()) + + return report +} + +// PerFileEnricher is implemented by JSON report types that support per-file data injection. +// The renderer.JSONReport implements this to avoid import cycles. +type PerFileEnricher interface { + EnrichWithPerFileData( + perFileResults map[string]map[string]Report, + rootPath string, + analyzers []FormattableAnalyzer, + ) +} + +// MakeRelativePath converts an absolute file path to be relative to rootPath. +// Returns the original path if it cannot be made relative. +func MakeRelativePath(filePath, rootPath string) string { + if rootPath == "" { + return filePath + } + + rel, err := filepath.Rel(rootPath, filePath) + if err != nil { + return filePath + } + + return rel +} diff --git a/internal/analyzers/analyze/record_reader_test.go b/internal/analyzers/analyze/record_reader_test.go index 390983c..fb88c8e 100644 --- a/internal/analyzers/analyze/record_reader_test.go +++ b/internal/analyzers/analyze/record_reader_test.go @@ -1,7 +1,5 @@ package analyze -// FRD: specs/frds/FRD-20260302-record-reader.md. - import ( "encoding/gob" "testing" diff --git a/internal/analyzers/analyze/record_writer_test.go b/internal/analyzers/analyze/record_writer_test.go index 49a2d69..f960ad5 100644 --- a/internal/analyzers/analyze/record_writer_test.go +++ b/internal/analyzers/analyze/record_writer_test.go @@ -1,7 +1,5 @@ package analyze -// FRD: specs/frds/FRD-20260303-write-slice-kind.md. - import ( "encoding/gob" "errors" diff --git a/internal/analyzers/analyze/registry.go b/internal/analyzers/analyze/registry.go index b3f4440..192adbc 100644 --- a/internal/analyzers/analyze/registry.go +++ b/internal/analyzers/analyze/registry.go @@ -44,15 +44,21 @@ var ErrInvalidAnalyzerMode = errors.New("invalid analyzer mode") var ErrInvalidAnalyzerGlob = errors.New("invalid analyzer glob") // NewRegistry creates a registry from analyzer descriptors. -func NewRegistry(static []StaticAnalyzer, history []HistoryAnalyzer) (*Registry, error) { - ordered := make([]Descriptor, 0, len(static)+len(history)) - index := make(map[string]Descriptor, len(static)+len(history)) +func NewRegistry(static []StaticAnalyzer, raw []RawFileAnalyzer, history []HistoryAnalyzer) (*Registry, error) { + totalCap := len(static) + len(raw) + len(history) + ordered := make([]Descriptor, 0, totalCap) + index := make(map[string]Descriptor, totalCap) err := appendDescriptors(ModeStatic, static, index, &ordered) if err != nil { return nil, err } + err = appendDescriptors(ModeStatic, raw, index, &ordered) + if err != nil { + return nil, err + } + err = appendDescriptors(ModeHistory, history, index, &ordered) if err != nil { return nil, err diff --git a/internal/analyzers/analyze/registry_test.go b/internal/analyzers/analyze/registry_test.go index 1753ac2..1ad88c0 100644 --- a/internal/analyzers/analyze/registry_test.go +++ b/internal/analyzers/analyze/registry_test.go @@ -88,7 +88,7 @@ func (s *stubHistoryAnalyzer) ReportFromTICKs(_ context.Context, _ []analyze.TIC func TestRegistry_AllStableOrder(t *testing.T) { t.Parallel() - registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), defaultHistoryForRegistryTest()) + registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), nil, defaultHistoryForRegistryTest()) if err != nil { t.Fatalf("unexpected registry creation error: %v", err) } @@ -110,7 +110,7 @@ func TestRegistry_AllStableOrder(t *testing.T) { func TestRegistry_IDsByMode(t *testing.T) { t.Parallel() - registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), defaultHistoryForRegistryTest()) + registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), nil, defaultHistoryForRegistryTest()) if err != nil { t.Fatalf("unexpected registry creation error: %v", err) } @@ -130,7 +130,7 @@ func TestRegistry_IDsByMode(t *testing.T) { func TestRegistry_Split(t *testing.T) { t.Parallel() - registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), defaultHistoryForRegistryTest()) + registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), nil, defaultHistoryForRegistryTest()) if err != nil { t.Fatalf("unexpected registry creation error: %v", err) } @@ -152,7 +152,7 @@ func TestRegistry_Split(t *testing.T) { func TestRegistry_SplitUnknown(t *testing.T) { t.Parallel() - registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), defaultHistoryForRegistryTest()) + registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), nil, defaultHistoryForRegistryTest()) if err != nil { t.Fatalf("unexpected registry creation error: %v", err) } @@ -164,7 +164,7 @@ func TestRegistry_SplitUnknown(t *testing.T) { } // complexityID is a stable fixture for the first registered static analyzer. -// Used by ExpandPatterns tests — FRD: specs/frds/FRD-20260306-append-unique-ids-removal.md. +// Used by ExpandPatterns tests. const complexityID = "static/complexity" func TestRegistry_ExpandPatterns_ExactMatch(t *testing.T) { @@ -303,7 +303,7 @@ func TestRegistry_SelectedIDs_WithPatterns(t *testing.T) { func newTestRegistry(t *testing.T) *analyze.Registry { t.Helper() - registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), defaultHistoryForRegistryTest()) + registry, err := analyze.NewRegistry(defaultStaticForRegistryTest(), nil, defaultHistoryForRegistryTest()) if err != nil { t.Fatalf("failed to create registry: %v", err) } diff --git a/internal/analyzers/analyze/schema_registry.go b/internal/analyzers/analyze/schema_registry.go new file mode 100644 index 0000000..ac97789 --- /dev/null +++ b/internal/analyzers/analyze/schema_registry.go @@ -0,0 +1,126 @@ +package analyze + +// FieldMeta describes a single field in an analyzer's output schema. +type FieldMeta struct { + Type string `json:"type" yaml:"type"` + Grain string `json:"grain,omitempty" yaml:"grain,omitempty"` + Description string `json:"description,omitempty" yaml:"description,omitempty"` +} + +// AnalyzerSchema maps output field names to their metadata. +type AnalyzerSchema map[string]FieldMeta + +// SchemaForAnalyzer returns the output schema for the given analyzer ID, +// or nil if the analyzer is not registered. +func SchemaForAnalyzer(analyzerID string) AnalyzerSchema { + schema, ok := analyzerSchemas[analyzerID] + if !ok { + return nil + } + + return schema +} + +// analyzerSchemas is the static registry of output schemas for all analyzers. +var analyzerSchemas = map[string]AnalyzerSchema{ + "static/complexity": { + "function_complexity": {Type: "list", Grain: "function", Description: "Per-function cyclomatic and cognitive complexity"}, + "distribution": {Type: "aggregate", Description: "Complexity distribution (simple/moderate/complex)"}, + "high_risk_functions": {Type: "risk", Grain: "function", Description: "Functions exceeding complexity thresholds"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "static/halstead": { + "function_halstead": {Type: "list", Grain: "function", Description: "Per-function Halstead volume, effort, and bugs"}, + "distribution": {Type: "aggregate", Description: "Effort distribution (low/medium/high/very_high)"}, + "high_effort_functions": {Type: "risk", Grain: "function", Description: "Functions with high Halstead effort"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "static/cohesion": { + "function_cohesion": {Type: "list", Grain: "function", Description: "Per-function LCOM cohesion score"}, + "distribution": {Type: "aggregate", Description: "Cohesion distribution"}, + "low_cohesion_functions": {Type: "risk", Grain: "function", Description: "Functions with poor cohesion"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "static/comments": { + "comment_quality": {Type: "list", Grain: "comment", Description: "Per-comment quality assessment"}, + "function_documentation": {Type: "list", Grain: "function", Description: "Per-function documentation status"}, + "undocumented_functions": {Type: "risk", Grain: "function", Description: "Functions lacking documentation"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "static/clones": { + "clone_pairs": {Type: "list", Grain: "pair", Description: "Detected clone pairs with similarity"}, + "clone_type_distribution": {Type: "aggregate", Description: "Clone type breakdown (Type-1/2/3)"}, + "total_functions": {Type: "scalar", Description: "Total functions analyzed"}, + "total_clone_pairs": {Type: "scalar", Description: "Total clone pairs (uncapped)"}, + "clone_ratio": {Type: "scalar", Description: "Fraction of functions involved in duplication"}, + }, + "static/imports": { + "import_list": {Type: "list", Grain: "import", Description: "All import statements"}, + "dependencies": {Type: "list", Grain: "dependency", Description: "External dependencies with risk"}, + "categories": {Type: "aggregate", Description: "Import category breakdown"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "static/composition": { + "breakdown": {Type: "aggregate", Description: "File count per category"}, + "percentages": {Type: "aggregate", Description: "Percentage per category"}, + "total_files": {Type: "scalar", Description: "Total files analyzed"}, + }, + "history/sentiment": { + "time_series": {Type: "time_series", Grain: "tick", Description: "Per-tick sentiment scores"}, + "trend": {Type: "aggregate", Description: "Sentiment trend direction"}, + "low_sentiment_periods": {Type: "risk", Grain: "tick", Description: "Ticks with negative sentiment"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/anomaly": { + "anomalies": {Type: "list", Grain: "tick", Description: "Detected anomalous ticks"}, + "time_series": {Type: "time_series", Grain: "tick", Description: "Per-tick anomaly metrics and z-scores"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/devs": { + "developers": {Type: "list", Grain: "developer", Description: "Per-developer contribution statistics"}, + "languages": {Type: "list", Grain: "language", Description: "Per-language contribution breakdown"}, + "busfactor": {Type: "list", Grain: "language", Description: "Bus factor per language"}, + "activity": {Type: "time_series", Grain: "tick", Description: "Per-tick commit activity by developer"}, + "churn": {Type: "time_series", Grain: "tick", Description: "Per-tick lines added/removed"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/file-history": { + "file_churn": {Type: "list", Grain: "file", Description: "Per-file change frequency and contributors"}, + "file_contributors": {Type: "list", Grain: "file", Description: "Per-file contributor breakdown"}, + "hotspots": {Type: "risk", Grain: "file", Description: "High-churn files"}, + "composition": {Type: "aggregate", Description: "File type composition"}, + "composition_ts": {Type: "time_series", Grain: "tick", Description: "File composition over time"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/couples": { + "file_coupling": {Type: "list", Grain: "pair", Description: "Co-changed file pairs"}, + "developer_coupling": {Type: "list", Grain: "pair", Description: "Developer collaboration pairs"}, + "file_ownership": {Type: "list", Grain: "file", Description: "Per-file ownership"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/shotness": { + "node_hotness": {Type: "list", Grain: "node", Description: "AST node change frequency"}, + "node_coupling": {Type: "list", Grain: "pair", Description: "Co-changed AST node pairs"}, + "hotspot_nodes": {Type: "risk", Grain: "node", Description: "Frequently changed nodes"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/burndown": { + "global_survival": {Type: "time_series", Grain: "sample", Description: "Global code survival curve"}, + "file_survival": {Type: "list", Grain: "file", Description: "Per-file survival data"}, + "developer_survival": {Type: "list", Grain: "developer", Description: "Per-developer survival data"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/quality": { + "time_series": {Type: "time_series", Grain: "tick", Description: "Per-tick code quality metrics"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/imports": { + "import_list": {Type: "list", Grain: "import", Description: "Import statements (requires UAST mode)"}, + "dependencies": {Type: "list", Grain: "dependency", Description: "Dependencies (requires UAST mode)"}, + "categories": {Type: "aggregate", Description: "Import category breakdown"}, + "aggregate": {Type: "aggregate", Description: "Summary statistics"}, + }, + "history/typos": { + "typos": {Type: "list", Grain: "identifier", Description: "Detected identifier typos (requires UAST mode)"}, + }, +} diff --git a/internal/analyzers/analyze/schema_registry_test.go b/internal/analyzers/analyze/schema_registry_test.go new file mode 100644 index 0000000..00d344c --- /dev/null +++ b/internal/analyzers/analyze/schema_registry_test.go @@ -0,0 +1,66 @@ +package analyze_test + +import ( + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +const ( + testAnalyzerComplexity = "static/complexity" + testAnalyzerSentiment = "history/sentiment" + testFieldFunctions = "function_complexity" + testFieldTimeSeries = "time_series" +) + +func TestSchemaForAnalyzer_Known(t *testing.T) { + t.Parallel() + + schema := analyze.SchemaForAnalyzer(testAnalyzerComplexity) + + require.NotNil(t, schema) + assert.Contains(t, schema, testFieldFunctions) + assert.Equal(t, "list", schema[testFieldFunctions].Type) + assert.Equal(t, "function", schema[testFieldFunctions].Grain) +} + +func TestSchemaForAnalyzer_HistoryAnalyzer(t *testing.T) { + t.Parallel() + + schema := analyze.SchemaForAnalyzer(testAnalyzerSentiment) + + require.NotNil(t, schema) + assert.Contains(t, schema, testFieldTimeSeries) + assert.Equal(t, "time_series", schema[testFieldTimeSeries].Type) + assert.Equal(t, "tick", schema[testFieldTimeSeries].Grain) +} + +func TestSchemaForAnalyzer_Unknown(t *testing.T) { + t.Parallel() + + schema := analyze.SchemaForAnalyzer("unknown/analyzer") + + assert.Nil(t, schema) +} + +func TestSchemaForAnalyzer_AllRegistered(t *testing.T) { + t.Parallel() + + knownIDs := []string{ + "static/complexity", "static/halstead", "static/cohesion", + "static/comments", "static/clones", "static/imports", + "static/composition", + "history/sentiment", "history/anomaly", "history/devs", + "history/file-history", "history/couples", "history/shotness", + "history/burndown", "history/quality", "history/imports", + "history/typos", + } + + for _, id := range knownIDs { + schema := analyze.SchemaForAnalyzer(id) + assert.NotNilf(t, schema, "schema missing for %s", id) + } +} diff --git a/internal/analyzers/analyze/static.go b/internal/analyzers/analyze/static.go index 80c0d6b..f7ecd17 100644 --- a/internal/analyzers/analyze/static.go +++ b/internal/analyzers/analyze/static.go @@ -15,9 +15,12 @@ import ( "sync/atomic" "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/plotpage" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" + "github.com/Sumatoshi-tech/codefang/internal/storage" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" "github.com/Sumatoshi-tech/codefang/pkg/meminfo" "github.com/Sumatoshi-tech/codefang/pkg/pipeline" + "github.com/Sumatoshi-tech/codefang/pkg/textutil" "github.com/Sumatoshi-tech/codefang/pkg/uast" "github.com/Sumatoshi-tech/codefang/pkg/uast/pkg/node" ) @@ -69,7 +72,8 @@ type StaticRenderer interface { // StaticService provides a high-level interface for running static analysis. type StaticService struct { - Analyzers []StaticAnalyzer + UASTAnalyzers []StaticAnalyzer + RawFileAnalyzers []RawFileAnalyzer // MaxWorkers limits the number of concurrent file analysis goroutines. // Zero means use min(runtime.NumCPU(), DefaultStaticMaxWorkers). @@ -102,11 +106,51 @@ type StaticService struct { // Renderer provides section-based output rendering. // Must be set before calling FormatJSON, FormatText, FormatCompact, or RunAndFormat. Renderer StaticRenderer + + // PerFile enables per-file report retention in aggregators. + // When true, aggregators store per-file snapshots accessible via PerFileResults. + PerFile bool + + // LanguageGlobs restricts the directory walk to files whose basename + // matches any of the given fnmatch-style globs (e.g. "*.go", + // "Dockerfile"). Built from --languages via langpath.Globs. Empty or + // nil disables the filter — default behavior. + LanguageGlobs []string + + // PathPolicy carries vendor / generated / extra-prefix exclusion + // rules shared across phases. The zero value excludes + // enry.IsVendor and pathfilter-detected generated files by + // default. + PathPolicy pathpolicy.Options + + // perFileResults is populated after AnalyzeFolder when PerFile is true. + // Keyed by analyzer name → file path → per-file report. + perFileResults map[string]map[string]Report + + // analysisRootPath is the root path used in the last AnalyzeFolder call. + // Used by FormatJSON to make per-file paths relative. + analysisRootPath string } // NewStaticService creates a StaticService with the given analyzers. -func NewStaticService(analyzers []StaticAnalyzer) *StaticService { - return &StaticService{Analyzers: analyzers} +func NewStaticService(uastAnalyzers []StaticAnalyzer, rawAnalyzers []RawFileAnalyzer) *StaticService { + return &StaticService{UASTAnalyzers: uastAnalyzers, RawFileAnalyzers: rawAnalyzers} +} + +// allFormattable returns a merged, deterministically-ordered slice of all analyzers +// that satisfy FormattableAnalyzer (UAST first, then raw-file). +func (svc *StaticService) allFormattable() []FormattableAnalyzer { + result := make([]FormattableAnalyzer, 0, len(svc.UASTAnalyzers)+len(svc.RawFileAnalyzers)) + + for _, a := range svc.UASTAnalyzers { + result = append(result, a) + } + + for _, a := range svc.RawFileAnalyzers { + result = append(result, a) + } + + return result } // ResolveMaxWorkers returns the effective worker count for parallel file analysis. @@ -177,44 +221,158 @@ func (svc *StaticService) emitProgress( // Workers block naturally when the buffer is full, providing backpressure. const streamFilesBufSize = 100 +// analysisPipelineState threads shared state through pipeline phases. +type analysisPipelineState struct { + rootPath string + analyzersToRun []string + aggregators map[string]ResultAggregator +} + // AnalyzeFolder runs static analyzers for supported files in a folder tree. -// File discovery streams paths to workers via a channel, providing natural backpressure. +// Executes raw-file and UAST phases sequentially via pipeline.RunPhases. func (svc *StaticService) AnalyzeFolder(ctx context.Context, rootPath string, analyzerList []string) (map[string]Report, error) { - analyzersToRun := svc.resolveAnalyzerList(analyzerList) - aggregators := svc.initAggregators(analyzersToRun) + svc.analysisRootPath = rootPath ctx, cancel := context.WithCancel(ctx) defer cancel() + state := analysisPipelineState{ + rootPath: rootPath, + analyzersToRun: svc.resolveAnalyzerList(analyzerList), + } + state.aggregators = svc.initAggregators(state.analyzersToRun) + + state, err := pipeline.RunPhases(ctx, state, + pipeline.PhaseFunc[analysisPipelineState](svc.rawFilePhase), + pipeline.PhaseFunc[analysisPipelineState](svc.uastPhase), + ) + if err != nil { + return nil, err + } + + results := buildFinalResults(state.aggregators) + + if svc.PerFile { + svc.perFileResults = extractPerFileResults(state.aggregators) + } + + return results, nil +} + +// rawFilePhase walks ALL files and runs RawFileAnalyzers on file headers. +func (svc *StaticService) rawFilePhase(ctx context.Context, state analysisPipelineState) (analysisPipelineState, error) { + if len(svc.RawFileAnalyzers) == 0 { + return state, nil + } + + // Filter to only requested raw-file analyzers. + rawNames := svc.requestedRawFileAnalyzers(state.analyzersToRun) + if len(rawNames) == 0 { + return state, nil + } + + var mu sync.Mutex + + walkErr := filepath.WalkDir(state.rootPath, func(path string, entry os.DirEntry, err error) error { + if ctx.Err() != nil { + return ctx.Err() + } + + skip, skipErr := skipAllFilesEntry(entry, err) + if skip || skipErr != nil { + return skipErr + } + + if !matchesLanguageGlobs(path, svc.LanguageGlobs) { + return nil + } + + if pathpolicy.Exclude(path, nil, svc.PathPolicy) { + return nil + } + + classifyFile(path, rawNames, state.aggregators, &mu, state.rootPath) + + return nil + }) + if walkErr != nil { + return state, fmt.Errorf("raw-file phase walk %s: %w", state.rootPath, walkErr) + } + + return state, nil +} + +// requestedRawFileAnalyzers returns RawFileAnalyzers whose names appear in the requested list. +func (svc *StaticService) requestedRawFileAnalyzers(requested []string) []RawFileAnalyzer { + nameSet := make(map[string]struct{}, len(requested)) + for _, n := range requested { + nameSet[n] = struct{}{} + } + + var result []RawFileAnalyzer + + for _, a := range svc.RawFileAnalyzers { + if _, ok := nameSet[a.Name()]; ok { + result = append(result, a) + } + } + + return result +} + +// uastPhase streams UAST-supported files and runs StaticAnalyzers in parallel. +func (svc *StaticService) uastPhase(ctx context.Context, state analysisPipelineState) (analysisPipelineState, error) { + uastNames := svc.requestedUASTAnalyzers(state.analyzersToRun) + if len(uastNames) == 0 { + return state, nil + } + var fileCounter atomic.Int64 fileCh := make(chan string, streamFilesBufSize) walkErrCh := make(chan error, 1) go func() { - walkErrCh <- svc.streamFiles(ctx, rootPath, fileCh) + walkErrCh <- svc.streamFiles(ctx, state.rootPath, fileCh) }() - poolErr := svc.analyzeFilesParallel(ctx, fileCh, analyzersToRun, aggregators, &fileCounter) + poolErr := svc.analyzeFilesParallel(ctx, fileCh, uastNames, state.aggregators, &fileCounter, state.rootPath) walkErr := <-walkErrCh if poolErr != nil { - return nil, poolErr + return state, poolErr } if walkErr != nil { - return nil, walkErr + return state, walkErr } - results := buildFinalResults(aggregators) + svc.emitProgress(fileCounter.Load(), state.aggregators, ProgressPhaseComplete) - svc.emitProgress(fileCounter.Load(), aggregators, ProgressPhaseComplete) + return state, nil +} - return results, nil +// requestedUASTAnalyzers returns names of UAST analyzers that appear in the requested list. +func (svc *StaticService) requestedUASTAnalyzers(requested []string) []string { + nameSet := make(map[string]struct{}, len(svc.UASTAnalyzers)) + for _, a := range svc.UASTAnalyzers { + nameSet[a.Name()] = struct{}{} + } + + result := make([]string, 0, len(requested)) + + for _, name := range requested { + if _, ok := nameSet[name]; ok { + result = append(result, name) + } + } + + return result } -// streamFiles walks the directory tree and sends supported file paths on fileCh. +// runUASTAnalysis runs UAST-based analyzers with file streaming and parallel parsing. +// streamFiles walks the directory tree and sends UAST-supported file paths on fileCh. // The channel is closed when the walk completes. Returns walk errors. func (svc *StaticService) streamFiles(ctx context.Context, rootPath string, fileCh chan<- string) error { defer close(fileCh) @@ -234,6 +392,14 @@ func (svc *StaticService) streamFiles(ctx context.Context, rootPath string, file return skipErr } + if !matchesLanguageGlobs(path, svc.LanguageGlobs) { + return nil + } + + if pathpolicy.Exclude(path, nil, svc.PathPolicy) { + return nil + } + select { case fileCh <- path: case <-ctx.Done(): @@ -266,6 +432,7 @@ func (svc *StaticService) analyzeFilesParallel( analyzersToRun []string, aggregators map[string]ResultAggregator, fileCounter *atomic.Int64, + rootPath string, ) error { var mu sync.Mutex @@ -294,7 +461,8 @@ func (svc *StaticService) analyzeFilesParallel( return analyzeErr } - StampSourceFile(reportMap, filePath) + StampSourceFile(reportMap, filePath, rootPath) + StampLanguage(reportMap, parser.GetLanguage(filePath)) mu.Lock() aggregateFolderAnalysis(reportMap, aggregators) @@ -334,18 +502,51 @@ func acquireParser(ch chan *uast.Parser) (*uast.Parser, error) { } // StampSourceFile adds "_source_file" metadata to every collection item in each report. -// This allows downstream consumers (e.g., plot generators) to group results by file/package. +// Also sets SourceFileKey at the report top level for analyzers without collections (e.g., imports). +// This allows downstream consumers (e.g., plot generators, per-file retention) to group results by file. // Handles both legacy []map[string]any collections and TypedCollection wrappers. -func StampSourceFile(reports map[string]Report, filePath string) { +// When rootPath is non-empty, the stamped path is made relative to it. +func StampSourceFile(reports map[string]Report, filePath, rootPath string) { + stamped := MakeRelativePath(filePath, rootPath) + dir := filepath.Dir(stamped) + for _, report := range reports { + report[SourceFileKey] = stamped + report[DirectoryKey] = dir + for key, val := range report { switch v := val.(type) { case TypedCollection: - v.SourceFile = filePath + v.SourceFile = stamped + v.Directory = dir report[key] = v case []map[string]any: for _, item := range v { - item[SourceFileKey] = filePath + item[SourceFileKey] = stamped + item[DirectoryKey] = dir + } + } + } + } +} + +// StampLanguage adds "_language" metadata to every collection item in each report. +func StampLanguage(reports map[string]Report, language string) { + if language == "" { + return + } + + for _, report := range reports { + report[LanguageKey] = language + + for key, val := range report { + switch v := val.(type) { + case TypedCollection: + v.Language = language + report[key] = v + case []map[string]any: + for _, item := range v { + item[LanguageKey] = language } } } @@ -393,22 +594,101 @@ func (svc *StaticService) analyzeFile( return nil, fmt.Errorf("read %s: %w", path, err) } - uastNode, err := parser.Parse(ctx, path, content) - if err != nil { - return nil, fmt.Errorf("parse %s: %w", path, err) + uastNode, parseErr := parser.Parse(ctx, path, content) + if parseErr != nil { + return nil, fmt.Errorf("parse %s: %w", path, parseErr) } - results, err := svc.runAnalyzers(ctx, uastNode, analyzersToRun) + results, runErr := svc.runAnalyzers(ctx, uastNode, analyzersToRun) node.ReleaseTree(uastNode) - if err != nil { - return nil, fmt.Errorf("run analyzers for %s: %w", path, err) + if runErr != nil { + return nil, fmt.Errorf("run analyzers for %s: %w", path, runErr) } return results, nil } +// contentHeaderSize is the max bytes read per file in the all-files pre-pass. +// Enry needs only a prefix for binary/language detection. +const contentHeaderSize = 8192 + +// skipAllFilesEntry decides if a walk entry should be skipped in the raw-file phase. +func skipAllFilesEntry(entry os.DirEntry, walkErr error) (bool, error) { + if walkErr != nil { + if errors.Is(walkErr, fs.ErrPermission) || errors.Is(walkErr, fs.ErrNotExist) { + if entry != nil && entry.IsDir() { + return true, filepath.SkipDir + } + + return true, nil + } + + return false, walkErr + } + + if entry == nil { + return true, nil + } + + if entry.IsDir() { + if entry.Name() == ".git" { + return true, filepath.SkipDir + } + + return true, nil + } + + return false, nil +} + +// classifyFile runs raw-file analyzers on a single file and aggregates results. +func classifyFile( + path string, + analyzers []RawFileAnalyzer, + aggregators map[string]ResultAggregator, + mu *sync.Mutex, + rootPath string, +) { + header := readFileHeader(path, contentHeaderSize) + + for _, a := range analyzers { + report, analyzeErr := a.AnalyzeFileContent(path, header) + if analyzeErr != nil { + continue + } + + report[SourceFileKey] = MakeRelativePath(path, rootPath) + + mu.Lock() + + if agg, ok := aggregators[a.Name()]; ok { + agg.Aggregate(map[string]Report{a.Name(): report}) + } + + mu.Unlock() + } +} + +// readFileHeader reads up to limit bytes from a file. Returns nil on error. +func readFileHeader(path string, limit int) []byte { + f, err := os.Open(path) + if err != nil { + return nil + } + defer f.Close() + + buf := make([]byte, limit) + + n, readErr := f.Read(buf) + if readErr != nil && !errors.Is(readErr, io.EOF) { + return nil + } + + return buf[:n] +} + func aggregateFolderAnalysis(results map[string]Report, aggregators map[string]ResultAggregator) { for analyzerName, aggregator := range aggregators { report, found := results[analyzerName] @@ -425,9 +705,10 @@ func (svc *StaticService) resolveAnalyzerList(analyzerList []string) []string { return analyzerList } - names := make([]string, 0, len(svc.Analyzers)) + all := svc.allFormattable() + names := make([]string, 0, len(all)) - for _, analyzer := range svc.Analyzers { + for _, analyzer := range all { names = append(names, analyzer.Name()) } @@ -436,10 +717,11 @@ func (svc *StaticService) resolveAnalyzerList(analyzerList []string) []string { func (svc *StaticService) initAggregators(analyzersToRun []string) map[string]ResultAggregator { aggregators := make(map[string]ResultAggregator) + byName := svc.analyzersByName() for _, analyzerName := range analyzersToRun { - analyzer := svc.FindAnalyzer(analyzerName) - if analyzer == nil { + analyzer, found := byName[analyzerName] + if !found { continue } @@ -453,6 +735,10 @@ func (svc *StaticService) initAggregators(analyzersToRun []string) map[string]Re setter.SetSpillThreshold(svc.SpillThreshold) } + if pf, ok := agg.(PerFileModeEnabled); svc.PerFile && ok { + pf.SetPerFileMode(true) + } + aggregators[analyzerName] = agg } @@ -473,7 +759,7 @@ func buildFinalResults(aggregators map[string]ResultAggregator) map[string]Repor func (svc *StaticService) BuildSections(results map[string]Report) []ReportSection { sections := make([]ReportSection, 0, len(results)) - for _, currentAnalyzer := range svc.Analyzers { + for _, currentAnalyzer := range svc.allFormattable() { report, found := results[currentAnalyzer.Name()] if !found { continue @@ -488,30 +774,34 @@ func (svc *StaticService) BuildSections(results map[string]Report) []ReportSecti } func (svc *StaticService) runAnalyzers(ctx context.Context, uastNode *node.Node, analyzerList []string) (map[string]Report, error) { - factory := NewFactory(svc.Analyzers) + factory := NewFactory(svc.UASTAnalyzers) return factory.RunAnalyzers(ctx, uastNode, analyzerList) } -// FindAnalyzer finds an analyzer by name. -func (svc *StaticService) FindAnalyzer(name string) StaticAnalyzer { - for _, analyzer := range svc.Analyzers { - if analyzer.Name() == name { - return analyzer - } +// analyzersByName builds a name-to-analyzer lookup map from all formattable analyzers. +func (svc *StaticService) analyzersByName() map[string]FormattableAnalyzer { + all := svc.allFormattable() + result := make(map[string]FormattableAnalyzer, len(all)) + + for _, a := range all { + result[a.Name()] = a } - return nil + return result } // AnalyzerNamesByID resolves analyzer descriptor IDs to internal names. func (svc *StaticService) AnalyzerNamesByID(ids []string) ([]string, error) { - idToName := make(map[string]string, len(svc.Analyzers)) - for _, analyzer := range svc.Analyzers { + all := svc.allFormattable() + idToName := make(map[string]string, len(all)) + + for _, analyzer := range all { idToName[analyzer.Descriptor().ID] = analyzer.Name() } names := make([]string, 0, len(ids)) + for _, id := range ids { name, ok := idToName[id] if !ok { @@ -533,6 +823,10 @@ func (svc *StaticService) FormatJSON(results map[string]Report, writer io.Writer sections := svc.BuildSections(results) report := svc.Renderer.SectionsToJSON(sections) + if svc.PerFile { + report = svc.enrichWithPerFileData(report, sections) + } + encoder := json.NewEncoder(writer) encoder.SetIndent("", " ") @@ -574,6 +868,7 @@ func (svc *StaticService) FormatPerAnalyzer( writer io.Writer, ) error { isFirst := true + byName := svc.analyzersByName() for _, analyzerName := range analyzerNames { report, ok := results[analyzerName] @@ -581,8 +876,8 @@ func (svc *StaticService) FormatPerAnalyzer( continue } - analyzer := svc.FindAnalyzer(analyzerName) - if analyzer == nil { + analyzer, found := byName[analyzerName] + if !found { return fmt.Errorf("%w: %s", ErrUnknownAnalyzerID, analyzerName) } @@ -644,6 +939,7 @@ func (svc *StaticService) RenderPlotPages( } pages := make([]plotpage.PageMeta, 0, len(analyzerNames)) + byName := svc.analyzersByName() for _, name := range analyzerNames { report, ok := results[name] @@ -651,8 +947,8 @@ func (svc *StaticService) RenderPlotPages( continue } - analyzer := svc.FindAnalyzer(name) - if analyzer == nil { + analyzer, found := byName[name] + if !found { continue } @@ -684,9 +980,15 @@ func (svc *StaticService) RenderPlotPages( return pages, nil } +// reportJSONFilename is the name of the machine-readable JSON report emitted alongside plot pages. +const reportJSONFilename = "report.json" + +// reportJSONPerm is the file permission for report.json. +const reportJSONPerm = 0o640 + // FormatPlotPages renders multi-page HTML plot output to outputDir. // Each analyzer gets its own HTML page plus an index page with navigation. -// FRD: specs/frds/FRD-20260312-static-plot-multipage.md. +// Also emits report.json with the raw analysis results for external dashboards. func (svc *StaticService) FormatPlotPages( analyzerNames []string, results map[string]Report, @@ -697,13 +999,27 @@ func (svc *StaticService) FormatPlotPages( return err } - renderer := &plotpage.MultiPageRenderer{ + mpRenderer := &plotpage.MultiPageRenderer{ OutputDir: outputDir, Title: plotPageTitle, Theme: plotpage.ThemeDark, } - return renderer.RenderIndex(pages) + indexErr := mpRenderer.RenderIndex(pages) + if indexErr != nil { + return indexErr + } + + return writeReportJSON(results, outputDir) +} + +// writeReportJSON writes the analysis results as indented JSON to outputDir/report.json. +func writeReportJSON(results map[string]Report, outputDir string) error { + reportPath := filepath.Join(outputDir, reportJSONFilename) + + return storage.WriteAtomic(reportPath, reportJSONPerm, func(w io.Writer) error { + return textutil.WriteJSON(w, results, true) + }) } // ResolveAggregationMode returns the aggregation mode for a given output format. diff --git a/internal/analyzers/analyze/static_bench_test.go b/internal/analyzers/analyze/static_bench_test.go index 43c3ceb..bf84aa0 100644 --- a/internal/analyzers/analyze/static_bench_test.go +++ b/internal/analyzers/analyze/static_bench_test.go @@ -1,11 +1,5 @@ package analyze_test -// FRD: specs/frds/FRD-20260311-cap-static-workers.md. -// FRD: specs/frds/FRD-20260311-static-malloc-trim.md. -// FRD: specs/frds/FRD-20260311-static-memory-limit.md. -// FRD: specs/frds/FRD-20260311-bounded-parser-pool.md. -// FRD: specs/frds/FRD-20260311-eager-tree-release.md. - import ( "context" "fmt" @@ -176,7 +170,7 @@ func BenchmarkStaticPeakParsers(b *testing.B) { dir := setupHeavyBenchDir(b, benchPeakFileCount, benchPeakFunctionsPerFile) b.Run("before-uncapped", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = runtime.NumCPU() svc.MallocTrimInterval = -1 @@ -198,7 +192,7 @@ func BenchmarkStaticPeakParsers(b *testing.B) { }) b.Run("after-capped", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchCappedWorkers svc.MallocTrimInterval = -1 @@ -226,7 +220,7 @@ func BenchmarkStaticMallocTrim(b *testing.B) { dir := setupHeavyBenchDir(b, benchMallocTrimFileCount, benchMallocTrimFunctionsPerFile) b.Run("before-no-trim", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchCappedWorkers svc.MallocTrimInterval = -1 @@ -249,7 +243,7 @@ func BenchmarkStaticMallocTrim(b *testing.B) { }) b.Run("after-trim-enabled", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchCappedWorkers svc.MallocTrimInterval = benchMallocTrimInterval // NativeMemoryReleaseFn=nil uses real gitlib.ReleaseNativeMemory(). @@ -279,7 +273,7 @@ func BenchmarkStaticMemoryLimit(b *testing.B) { dir := setupHeavyBenchDir(b, benchMemLimitFileCount, benchMemLimitFunctionsPerFile) b.Run("before-no-limit", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchMemLimitWorkers svc.MallocTrimInterval = -1 @@ -301,7 +295,7 @@ func BenchmarkStaticMemoryLimit(b *testing.B) { }) b.Run("after-with-limit", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchMemLimitWorkers svc.MallocTrimInterval = -1 @@ -333,7 +327,7 @@ func BenchmarkStaticParserPool(b *testing.B) { dir := setupHeavyBenchDir(b, benchParserPoolFileCount, benchParserPoolFunctionsPerFile) b.Run("before-workers-NumCPU", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = runtime.NumCPU() svc.MallocTrimInterval = -1 @@ -350,7 +344,7 @@ func BenchmarkStaticParserPool(b *testing.B) { }) b.Run("after-workers-4", func(b *testing.B) { - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = benchCappedWorkers svc.MallocTrimInterval = -1 @@ -379,7 +373,7 @@ func TestStaticPeakParsers_BoundedConcurrency(t *testing.T) { dir := setupHeavyBenchDir(t, fileCount, benchPeakFunctionsPerFile) - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = maxWorkers require.Equal(t, maxWorkers, svc.ResolveMaxWorkers()) diff --git a/internal/analyzers/analyze/static_language.go b/internal/analyzers/analyze/static_language.go new file mode 100644 index 0000000..b347b99 --- /dev/null +++ b/internal/analyzers/analyze/static_language.go @@ -0,0 +1,24 @@ +package analyze + +// Static-side --languages filter. + +import "path/filepath" + +// matchesLanguageGlobs reports whether name's basename matches any of +// the given fnmatch-style globs. An empty or nil globs slice disables +// filtering and returns true. +func matchesLanguageGlobs(name string, globs []string) bool { + if len(globs) == 0 { + return true + } + + base := filepath.Base(name) + for _, g := range globs { + ok, err := filepath.Match(g, base) + if err == nil && ok { + return true + } + } + + return false +} diff --git a/internal/analyzers/analyze/static_language_test.go b/internal/analyzers/analyze/static_language_test.go new file mode 100644 index 0000000..fadd281 --- /dev/null +++ b/internal/analyzers/analyze/static_language_test.go @@ -0,0 +1,195 @@ +package analyze_test + +import ( + "context" + "os" + "path/filepath" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/composition" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" +) + +func TestStaticService_AnalyzeFolder_PathPolicy_DefaultsDropVendorAndGenerated(t *testing.T) { + t.Parallel() + + tmpDir := t.TempDir() + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "keep.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.MkdirAll(filepath.Join(tmpDir, "vendor", "lib"), 0o750)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "vendor", "lib", "vendored.go"), + []byte("package lib\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "api.pb.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + + composer := composition.NewAnalyzer() + svc := analyze.NewStaticService(nil, []analyze.RawFileAnalyzer{composer}) + + results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{composer.Name()}) + require.NoError(t, err) + + report := results[composer.Name()] + assert.EqualValues(t, 1, report["total_files"], + "default path policy must drop vendor/lib/vendored.go and api.pb.go") +} + +func TestStaticService_AnalyzeFolder_PathPolicy_IncludeVendoredAndGeneratedRestoresAll(t *testing.T) { + t.Parallel() + + tmpDir := t.TempDir() + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "keep.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.MkdirAll(filepath.Join(tmpDir, "vendor", "lib"), 0o750)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "vendor", "lib", "vendored.go"), + []byte("package lib\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "api.pb.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + + composer := composition.NewAnalyzer() + svc := analyze.NewStaticService(nil, []analyze.RawFileAnalyzer{composer}) + svc.PathPolicy = pathpolicy.Options{ + IncludeVendored: true, + IncludeGenerated: true, + } + + results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{composer.Name()}) + require.NoError(t, err) + + report := results[composer.Name()] + assert.EqualValues(t, 3, report["total_files"], + "include-vendored + include-generated must restore today's default behavior") +} + +func TestStaticService_AnalyzeFolder_NilLanguageGlobs_ProcessesAllSupportedFiles(t *testing.T) { + t.Parallel() + + tmpDir := t.TempDir() + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "a.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "b.py"), + []byte("def f():\n pass\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "c.js"), + []byte("function f() {}\n"), 0o600)) + + composer := composition.NewAnalyzer() + svc := analyze.NewStaticService(nil, []analyze.RawFileAnalyzer{composer}) + + results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{composer.Name()}) + require.NoError(t, err) + + report := results[composer.Name()] + assert.EqualValues(t, 3, report["total_files"], + "nil LanguageGlobs must preserve today's behavior: all 3 files processed") +} + +func TestStaticService_AnalyzeFolder_LanguageGlobs_FiltersRawFileWalk(t *testing.T) { + t.Parallel() + + tmpDir := t.TempDir() + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "keep.go"), + []byte("package main\nfunc F() {}\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "drop.py"), + []byte("def f():\n pass\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "drop.js"), + []byte("function f() {}\n"), 0o600)) + + composer := composition.NewAnalyzer() + svc := analyze.NewStaticService(nil, []analyze.RawFileAnalyzer{composer}) + svc.LanguageGlobs = []string{"*.go"} + + results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{composer.Name()}) + require.NoError(t, err) + require.Contains(t, results, composer.Name()) + + report := results[composer.Name()] + + assert.EqualValues(t, 1, report["total_files"], + "raw-file walker must skip paths outside LanguageGlobs: "+ + "only keep.go should reach the composition analyzer") +} + +func TestStaticService_AnalyzeFolder_LanguageGlobs_FiltersUASTWalk(t *testing.T) { + t.Parallel() + + tmpDir := t.TempDir() + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "keep.go"), + []byte("package main\nfunc F() { x := 1; _ = x }\n"), 0o600)) + require.NoError(t, + os.WriteFile(filepath.Join(tmpDir, "drop.py"), + []byte("def f():\n x = 1\n return x\n"), 0o600)) + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.LanguageGlobs = []string{"*.go"} + svc.PerFile = true + + results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{"complexity"}) + require.NoError(t, err) + require.Contains(t, results, "complexity") + + perFile := svc.PerFileResults()["complexity"] + assert.Contains(t, perFile, "keep.go", + "Go file must reach the complexity analyzer when pathspec *.go is active") + assert.NotContains(t, perFile, "drop.py", + "Python file must be filtered out before the parser runs") +} + +func TestMatchesLanguageGlobs_NilGlobs_AllowsAnyName(t *testing.T) { + t.Parallel() + + assert.True(t, analyze.LanguageGlobMatcher("anything.go", nil), + "nil globs must be treated as no-filter and return true") +} + +func TestMatchesLanguageGlobs_MultipleGlobs_MatchesUnion(t *testing.T) { + t.Parallel() + + globs := []string{"*.go", "Dockerfile"} + + assert.True(t, analyze.LanguageGlobMatcher("main.go", globs)) + assert.True(t, analyze.LanguageGlobMatcher("Dockerfile", globs)) + assert.False(t, analyze.LanguageGlobMatcher("main.py", globs), + "a name matching neither glob must be rejected") +} + +func TestMatchesLanguageGlobs_StarDotGo_MatchesGoBasename(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + path string + want bool + reason string + }{ + {"go file", "foo.go", true, "*.go glob must match plain .go"}, + {"nested go file", "/abs/dir/foo.go", true, "match on basename, not full path"}, + {"python file", "foo.py", false, "*.go must not match .py"}, + {"no extension", "Makefile", false, "*.go must not match extensionless"}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + got := analyze.LanguageGlobMatcher(tt.path, []string{"*.go"}) + assert.Equal(t, tt.want, got, tt.reason) + }) + } +} diff --git a/internal/analyzers/analyze/static_test.go b/internal/analyzers/analyze/static_test.go index 23669f5..5ecb7ab 100644 --- a/internal/analyzers/analyze/static_test.go +++ b/internal/analyzers/analyze/static_test.go @@ -3,6 +3,7 @@ package analyze_test import ( "bytes" "context" + "encoding/json" "fmt" "io/fs" "os" @@ -17,6 +18,7 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" "github.com/Sumatoshi-tech/codefang/internal/analyzers/cohesion" "github.com/Sumatoshi-tech/codefang/internal/analyzers/comments" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" "github.com/Sumatoshi-tech/codefang/internal/analyzers/complexity" "github.com/Sumatoshi-tech/codefang/internal/analyzers/halstead" "github.com/Sumatoshi-tech/codefang/internal/analyzers/imports" @@ -134,7 +136,7 @@ func TestStaticService_AnalyzeFolder_SkipsPermissionDeniedDirectory(t *testing.T require.NoError(t, os.Chmod(blockedDir, 0o750)) }() - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{"complexity"}) require.NoError(t, err) require.Contains(t, results, "complexity") @@ -186,14 +188,14 @@ func TestStampSourceFile(t *testing.T) { }, } - analyze.StampSourceFile(reports, "/repo/pkg/auth/handler.go") + analyze.StampSourceFile(reports, "/repo/pkg/auth/handler.go", "/repo") functions, ok := reports["cohesion"]["functions"].([]map[string]any) require.True(t, ok) require.Len(t, functions, 2) for _, fn := range functions { - require.Equal(t, "/repo/pkg/auth/handler.go", fn["_source_file"]) + require.Equal(t, "pkg/auth/handler.go", fn["_source_file"]) } } @@ -203,7 +205,7 @@ func TestStampSourceFile_EmptyReport(t *testing.T) { reports := map[string]analyze.Report{} require.NotPanics(t, func() { - analyze.StampSourceFile(reports, "/some/path.go") + analyze.StampSourceFile(reports, "/some/path.go", "") }) } @@ -219,12 +221,10 @@ func TestStampSourceFile_NoCollections(t *testing.T) { } require.NotPanics(t, func() { - analyze.StampSourceFile(reports, "/some/path.go") + analyze.StampSourceFile(reports, "/some/path.go", "") }) } -// FRD: specs/frds/FRD-20260311-typed-report-items.md. - func TestStampSourceFile_TypedCollection(t *testing.T) { t.Parallel() @@ -268,25 +268,23 @@ func TestStampSourceFile_TypedCollection(t *testing.T) { }, } - analyze.StampSourceFile(reports, "/repo/pkg/foo.go") + analyze.StampSourceFile(reports, "/repo/pkg/foo.go", "/repo") stamped, ok := reports["complexity"]["functions"].(analyze.TypedCollection) require.True(t, ok) - assert.Equal(t, "/repo/pkg/foo.go", stamped.SourceFile) + assert.Equal(t, "pkg/foo.go", stamped.SourceFile) // Verify converter produces maps with _source_file. maps := stamped.ToMaps(stamped.Items, stamped.SourceFile) require.Len(t, maps, 2) - assert.Equal(t, "/repo/pkg/foo.go", maps[0]["_source_file"]) - assert.Equal(t, "/repo/pkg/foo.go", maps[1]["_source_file"]) + assert.Equal(t, "pkg/foo.go", maps[0]["_source_file"]) + assert.Equal(t, "pkg/foo.go", maps[1]["_source_file"]) } -// FRD: specs/frds/FRD-20260311-cap-static-workers.md. - func TestStaticService_ResolveMaxWorkers_DefaultCapsAtEight(t *testing.T) { t.Parallel() - svc := analyze.NewStaticService(nil) + svc := analyze.NewStaticService(nil, nil) got := svc.ResolveMaxWorkers() want := min(runtime.NumCPU(), analyze.DefaultStaticMaxWorkers) @@ -308,7 +306,7 @@ func TestStaticService_AnalyzeFolder_RespectsMaxWorkers(t *testing.T) { []byte("package a\nfunc B() {}\n"), 0o600, )) - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = 1 results, err := svc.AnalyzeFolder(context.Background(), tmpDir, []string{"complexity"}) @@ -321,18 +319,16 @@ func TestStaticService_ResolveMaxWorkers_ExplicitOverride(t *testing.T) { const explicitWorkers = 16 - svc := analyze.NewStaticService(nil) + svc := analyze.NewStaticService(nil, nil) svc.MaxWorkers = explicitWorkers require.Equal(t, explicitWorkers, svc.ResolveMaxWorkers()) } -// FRD: specs/frds/FRD-20260311-static-malloc-trim.md. - func TestStaticService_ResolveMallocTrimInterval_Default(t *testing.T) { t.Parallel() - svc := analyze.NewStaticService(nil) + svc := analyze.NewStaticService(nil, nil) require.Equal(t, analyze.DefaultMallocTrimInterval, svc.ResolveMallocTrimInterval()) } @@ -342,7 +338,7 @@ func TestStaticService_ResolveMallocTrimInterval_ExplicitOverride(t *testing.T) const customInterval = 100 - svc := analyze.NewStaticService(nil) + svc := analyze.NewStaticService(nil, nil) svc.MallocTrimInterval = customInterval require.Equal(t, customInterval, svc.ResolveMallocTrimInterval()) @@ -351,7 +347,7 @@ func TestStaticService_ResolveMallocTrimInterval_ExplicitOverride(t *testing.T) func TestStaticService_ResolveMallocTrimInterval_Disabled(t *testing.T) { t.Parallel() - svc := analyze.NewStaticService(nil) + svc := analyze.NewStaticService(nil, nil) svc.MallocTrimInterval = -1 require.Equal(t, -1, svc.ResolveMallocTrimInterval()) @@ -374,7 +370,7 @@ func TestStaticService_AnalyzeFolder_CallsMallocTrim(t *testing.T) { var trimCalls atomic.Int64 - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = 1 svc.MallocTrimInterval = trimInterval svc.NativeMemoryReleaseFn = func() { trimCalls.Add(1) } @@ -400,7 +396,7 @@ func TestStaticService_AnalyzeFolder_NoTrimWhenDisabled(t *testing.T) { var trimCalls atomic.Int64 - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = 1 svc.MallocTrimInterval = -1 svc.NativeMemoryReleaseFn = func() { trimCalls.Add(1) } @@ -411,8 +407,6 @@ func TestStaticService_AnalyzeFolder_NoTrimWhenDisabled(t *testing.T) { require.Zero(t, trimCalls.Load()) } -// FRD: specs/frds/FRD-20260311-summary-only-aggregation.md. - func TestResolveAggregationMode_TextIsSummaryOnly(t *testing.T) { t.Parallel() @@ -451,7 +445,7 @@ func TestStaticService_SummaryOnly_MetricsPresent(t *testing.T) { []byte("package main\nfunc A() { x := 1; _ = x }\nfunc B() { y := 2; _ = y }\n"), 0o600, )) - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = 1 svc.MallocTrimInterval = -1 svc.AggregationMode = analyze.AggregationModeSummaryOnly @@ -467,8 +461,6 @@ func TestStaticService_SummaryOnly_MetricsPresent(t *testing.T) { require.Contains(t, report, "total_complexity") } -// FRD: specs/frds/FRD-20260312-static-budget-tuning.md. - func TestStaticService_SpillThreshold_AppliedToAggregators(t *testing.T) { t.Parallel() @@ -481,7 +473,7 @@ func TestStaticService_SpillThreshold_AppliedToAggregators(t *testing.T) { const customThreshold = 5000 - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.MaxWorkers = 1 svc.MallocTrimInterval = -1 svc.SpillThreshold = customThreshold @@ -497,8 +489,6 @@ func TestStaticService_SpillThreshold_AppliedToAggregators(t *testing.T) { assert.Equal(t, customThreshold, svc.SpillThreshold) } -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - func TestStaticService_ProgressFunc_CalledDuringAnalysis(t *testing.T) { t.Parallel() @@ -510,7 +500,7 @@ func TestStaticService_ProgressFunc_CalledDuringAnalysis(t *testing.T) { writeTestGoFile(t, dir, fmt.Sprintf("file%d.go", i)) } - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.NativeMemoryReleaseFn = func() {} svc.ProgressInterval = 2 @@ -538,7 +528,7 @@ func TestStaticService_ProgressFunc_Nil_NoError(t *testing.T) { dir := t.TempDir() writeTestGoFile(t, dir, "file.go") - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.NativeMemoryReleaseFn = func() {} // ProgressFunc is nil — should not panic. @@ -547,8 +537,6 @@ func TestStaticService_ProgressFunc_Nil_NoError(t *testing.T) { require.NotEmpty(t, results) } -// FRD: specs/frds/FRD-20260312-static-plot-multipage.md. - func TestStaticService_FormatPlotPages_ProducesHTML(t *testing.T) { t.Parallel() @@ -559,7 +547,7 @@ func TestStaticService_FormatPlotPages_ProducesHTML(t *testing.T) { dir := t.TempDir() writeTestGoFile(t, dir, "main.go") - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.NativeMemoryReleaseFn = func() {} svc.AggregationMode = analyze.AggregationModeFull @@ -591,7 +579,7 @@ func TestStaticService_FormatPlotPages_SkipsUnregisteredAnalyzers(t *testing.T) dir := t.TempDir() writeTestGoFile(t, dir, "main.go") - svc := analyze.NewStaticService(testStaticAnalyzers()) + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) svc.NativeMemoryReleaseFn = func() {} results := map[string]analyze.Report{ @@ -627,3 +615,101 @@ func writeTestGoFile(t *testing.T, dir, name string) { require.NoError(t, os.WriteFile(path, content, 0o600)) } + +func TestStaticService_PerFile_FieldExists(t *testing.T) { + t.Parallel() + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.PerFile = true + + assert.True(t, svc.PerFile) +} + +func TestStaticService_PerFile_AnalyzeFolderRetainsPerFileResults(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + writeTestGoFile(t, dir, "a.go") + writeTestGoFile(t, dir, "b.go") + writeTestGoFile(t, dir, "c.go") + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.NativeMemoryReleaseFn = func() {} + svc.PerFile = true + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + _ = results + + perFile := svc.PerFileResults() + require.NotNil(t, perFile, "per-file results must be present when PerFile=true") + + // Each analyzer should have 3 per-file entries. + for analyzerName, fileResults := range perFile { + assert.Len(t, fileResults, 3, + "analyzer %s must have 3 per-file entries", analyzerName) + } +} + +func TestStaticService_PerFile_FormatJSONIncludesFiles(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + writeTestGoFile(t, dir, "a.go") + writeTestGoFile(t, dir, "b.go") + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.NativeMemoryReleaseFn = func() {} + svc.Renderer = &renderer.DefaultStaticRenderer{} + svc.PerFile = true + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + var buf bytes.Buffer + require.NoError(t, svc.FormatJSON(results, &buf)) + + jsonStr := buf.String() + assert.Contains(t, jsonStr, `"files"`, "JSON must include files array") + assert.Contains(t, jsonStr, `"file_path"`, "files entries must have file_path") +} + +func TestStaticService_PerFile_DisabledReturnsNil(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + writeTestGoFile(t, dir, "a.go") + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.NativeMemoryReleaseFn = func() {} + // PerFile is false (default). + + _, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + assert.Nil(t, svc.PerFileResults(), "per-file results must be nil when PerFile is false") +} + +func TestStaticService_FormatPlotPages_EmitsReportJSON(t *testing.T) { + t.Parallel() + + svc := analyze.NewStaticService(testStaticAnalyzers(), nil) + svc.NativeMemoryReleaseFn = func() {} + + results := map[string]analyze.Report{ + "complexity": {"total_functions": 1}, + } + + outputDir := filepath.Join(t.TempDir(), "plot-output") + + require.NoError(t, svc.FormatPlotPages([]string{"complexity"}, results, outputDir)) + + reportPath := filepath.Join(outputDir, "report.json") + data, err := os.ReadFile(reportPath) + require.NoError(t, err, "report.json must exist after FormatPlotPages") + + var parsed map[string]any + require.NoError(t, json.Unmarshal(data, &parsed), "report.json must be valid JSON") + assert.Contains(t, parsed, "complexity", "report.json must contain analyzer results") +} diff --git a/internal/analyzers/analyze/tick_bounds.go b/internal/analyzers/analyze/tick_bounds.go new file mode 100644 index 0000000..56ba040 --- /dev/null +++ b/internal/analyzers/analyze/tick_bounds.go @@ -0,0 +1,46 @@ +package analyze + +import "time" + +// TickBounds holds the time boundaries of a single tick. +type TickBounds struct { + StartTime time.Time + EndTime time.Time +} + +// FormatStartTime returns StartTime as an RFC 3339 string, or empty if zero. +func (b TickBounds) FormatStartTime() string { + if b.StartTime.IsZero() { + return "" + } + + return b.StartTime.UTC().Format(time.RFC3339) +} + +// FormatEndTime returns EndTime as an RFC 3339 string, or empty if zero. +func (b TickBounds) FormatEndTime() string { + if b.EndTime.IsZero() { + return "" + } + + return b.EndTime.UTC().Format(time.RFC3339) +} + +// BuildTickBounds extracts tick boundaries from a slice of TICKs. +// Returns a map from tick index to its time bounds. +func BuildTickBounds(ticks []TICK) map[int]TickBounds { + if len(ticks) == 0 { + return nil + } + + result := make(map[int]TickBounds, len(ticks)) + + for _, tick := range ticks { + result[tick.Tick] = TickBounds{ + StartTime: tick.StartTime, + EndTime: tick.EndTime, + } + } + + return result +} diff --git a/internal/analyzers/analyze/tick_bounds_test.go b/internal/analyzers/analyze/tick_bounds_test.go new file mode 100644 index 0000000..7b3a6de --- /dev/null +++ b/internal/analyzers/analyze/tick_bounds_test.go @@ -0,0 +1,88 @@ +package analyze_test + +import ( + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +var ( + testTime1 = time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + testTime2 = time.Date(2024, 1, 16, 12, 0, 0, 0, time.UTC) + testTime3 = time.Date(2024, 1, 17, 14, 0, 0, 0, time.UTC) +) + +func TestBuildTickBounds_Empty(t *testing.T) { + t.Parallel() + + result := analyze.BuildTickBounds(nil) + + assert.Empty(t, result) +} + +func TestBuildTickBounds_SingleTick(t *testing.T) { + t.Parallel() + + ticks := []analyze.TICK{ + {Tick: 0, StartTime: testTime1, EndTime: testTime2}, + } + + result := analyze.BuildTickBounds(ticks) + + require.Len(t, result, 1) + assert.Equal(t, testTime1, result[0].StartTime) + assert.Equal(t, testTime2, result[0].EndTime) +} + +func TestBuildTickBounds_MultipleTicks(t *testing.T) { + t.Parallel() + + ticks := []analyze.TICK{ + {Tick: 0, StartTime: testTime1, EndTime: testTime2}, + {Tick: 1, StartTime: testTime2, EndTime: testTime3}, + } + + result := analyze.BuildTickBounds(ticks) + + require.Len(t, result, 2) + assert.Equal(t, testTime1, result[0].StartTime) + assert.Equal(t, testTime2, result[1].StartTime) + assert.Equal(t, testTime3, result[1].EndTime) +} + +func TestBuildTickBounds_ZeroTimesSkipped(t *testing.T) { + t.Parallel() + + ticks := []analyze.TICK{ + {Tick: 0}, + {Tick: 1, StartTime: testTime1, EndTime: testTime2}, + } + + result := analyze.BuildTickBounds(ticks) + + require.Len(t, result, 2) + assert.True(t, result[0].StartTime.IsZero()) + assert.Equal(t, testTime1, result[1].StartTime) +} + +func TestTickBoundsFormatStartTime(t *testing.T) { + t.Parallel() + + bounds := analyze.TickBounds{StartTime: testTime1, EndTime: testTime2} + + assert.Equal(t, "2024-01-15T10:00:00Z", bounds.FormatStartTime()) + assert.Equal(t, "2024-01-16T12:00:00Z", bounds.FormatEndTime()) +} + +func TestTickBoundsFormatStartTime_Zero(t *testing.T) { + t.Parallel() + + bounds := analyze.TickBounds{} + + assert.Empty(t, bounds.FormatStartTime()) + assert.Empty(t, bounds.FormatEndTime()) +} diff --git a/internal/analyzers/analyze/typed_collection.go b/internal/analyzers/analyze/typed_collection.go index 8f61a55..3627bef 100644 --- a/internal/analyzers/analyze/typed_collection.go +++ b/internal/analyzers/analyze/typed_collection.go @@ -1,7 +1,5 @@ package analyze -// FRD: specs/frds/FRD-20260311-typed-report-items.md. - // ItemConverter converts a typed items slice and source file path into []map[string]any. // The sourceFile parameter is the path stamped by StampSourceFile; when non-empty, the // converter should include it as "_source_file" in each output map. @@ -13,6 +11,8 @@ type ItemConverter func(items any, sourceFile string) []map[string]any type TypedCollection struct { Items any // concrete typed slice (e.g., []FunctionMetrics). SourceFile string // stamped by StampSourceFile. + Language string // stamped by StampLanguage. + Directory string // stamped by StampSourceFile (filepath.Dir of relative path). ToMaps ItemConverter // deferred converter. } @@ -27,3 +27,9 @@ func (tc TypedCollection) MapSlice() []map[string]any { // SourceFileKey is the report key used to stamp the originating source file. const SourceFileKey = "_source_file" + +// LanguageKey is the report key used to stamp the detected programming language. +const LanguageKey = "_language" + +// DirectoryKey is the report key used to stamp the parent directory of the source file. +const DirectoryKey = "_directory" diff --git a/internal/analyzers/anomaly/analyzer.go b/internal/analyzers/anomaly/analyzer.go index 96d97af..dd0fc5f 100644 --- a/internal/analyzers/anomaly/analyzer.go +++ b/internal/analyzers/anomaly/analyzer.go @@ -503,6 +503,7 @@ func ticksToReport( "anomalies": anomalies, "threshold": threshold, "window_size": window, + "tick_bounds": analyze.BuildTickBounds(ticks), } } diff --git a/internal/analyzers/anomaly/enrich_store_test.go b/internal/analyzers/anomaly/enrich_store_test.go index 88e2cc2..233310c 100644 --- a/internal/analyzers/anomaly/enrich_store_test.go +++ b/internal/analyzers/anomaly/enrich_store_test.go @@ -1,7 +1,5 @@ package anomaly -// FRD: specs/frds/FRD-20260301-anomaly-enrich-from-store.md. - import ( "context" "testing" diff --git a/internal/analyzers/anomaly/enrich_test.go b/internal/analyzers/anomaly/enrich_test.go index 360be7c..3bc45b6 100644 --- a/internal/analyzers/anomaly/enrich_test.go +++ b/internal/analyzers/anomaly/enrich_test.go @@ -1,7 +1,5 @@ package anomaly -// FRD: specs/frds/FRD-20260301-anomaly-enrich-from-store.md. - import ( "testing" diff --git a/internal/analyzers/anomaly/metrics.go b/internal/analyzers/anomaly/metrics.go index 0ca2591..76096bb 100644 --- a/internal/analyzers/anomaly/metrics.go +++ b/internal/analyzers/anomaly/metrics.go @@ -71,12 +71,14 @@ type AggregateData struct { // TimeSeriesEntry holds per-tick data for the time series output. type TimeSeriesEntry struct { - Tick int `json:"tick" yaml:"tick"` - Metrics RawMetrics `json:"metrics" yaml:"metrics"` - IsAnomaly bool `json:"is_anomaly" yaml:"is_anomaly"` - ChurnZScore float64 `json:"churn_z_score" yaml:"churn_z_score"` - LanguageDiversity int `json:"language_diversity" yaml:"language_diversity"` - AuthorCount int `json:"author_count" yaml:"author_count"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + Metrics RawMetrics `json:"metrics" yaml:"metrics"` + IsAnomaly bool `json:"is_anomaly" yaml:"is_anomaly"` + ChurnZScore float64 `json:"churn_z_score" yaml:"churn_z_score"` + LanguageDiversity int `json:"language_diversity" yaml:"language_diversity"` + AuthorCount int `json:"author_count" yaml:"author_count"` } // --- External Anomaly Types ---. @@ -184,7 +186,7 @@ func computeTimeSeries(input *ReportData) []TimeSeriesEntry { churnZ = churnScores[i] } - entries[i] = TimeSeriesEntry{ + entry := TimeSeriesEntry{ Tick: tick, Metrics: RawMetrics{ FilesChanged: tm.FilesChanged, @@ -199,6 +201,13 @@ func computeTimeSeries(input *ReportData) []TimeSeriesEntry { LanguageDiversity: len(tm.Languages), AuthorCount: len(tm.AuthorIDs), } + + if bounds, hasBounds := input.TickBounds[tick]; hasBounds { + entry.StartTime = bounds.FormatStartTime() + entry.EndTime = bounds.FormatEndTime() + } + + entries[i] = entry } return entries @@ -210,6 +219,7 @@ func computeTimeSeries(input *ReportData) []TimeSeriesEntry { type ReportData struct { Anomalies []Record TickMetrics map[int]*TickMetrics + TickBounds map[int]analyze.TickBounds Threshold float32 WindowSize int ExternalAnomalies []ExternalAnomaly @@ -307,6 +317,10 @@ func ParseReportData(report analyze.Report) (*ReportData, error) { data.ExternalSummaries = v } + if v, ok := report["tick_bounds"].(map[int]analyze.TickBounds); ok { + data.TickBounds = v + } + return data, nil } diff --git a/internal/analyzers/anomaly/store_writer_test.go b/internal/analyzers/anomaly/store_writer_test.go index 972d61a..6f355a7 100644 --- a/internal/analyzers/anomaly/store_writer_test.go +++ b/internal/analyzers/anomaly/store_writer_test.go @@ -1,7 +1,5 @@ package anomaly -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "testing" diff --git a/internal/analyzers/burndown/history.go b/internal/analyzers/burndown/history.go index 520c2eb..4bb8872 100644 --- a/internal/analyzers/burndown/history.go +++ b/internal/analyzers/burndown/history.go @@ -98,6 +98,10 @@ type HistoryAnalyzer struct { TrackFiles bool HibernationToDisk bool lastCommitTime time.Time + + // mismatch tracks src-mismatch reset events (rate-limited logging, + // per-chunk and cumulative counters). Surfaced via MismatchStats. + mismatch mismatchTracker } const ( @@ -166,6 +170,12 @@ func NewHistoryAnalyzer() *HistoryAnalyzer { return ha } +// MismatchStats returns cumulative src-mismatch counters for this analyzer. +// See [MismatchStats] for the operational meaning of these numbers. +func (b *HistoryAnalyzer) MismatchStats() MismatchStats { + return b.mismatch.snapshot() +} + // ListConfigurationOptions returns the configuration options for the analyzer. func (b *HistoryAnalyzer) ListConfigurationOptions() []pipeline.ConfigurationOption { return []pipeline.ConfigurationOption{ diff --git a/internal/analyzers/burndown/history_changes.go b/internal/analyzers/burndown/history_changes.go index a43fdf2..083ff90 100644 --- a/internal/analyzers/burndown/history_changes.go +++ b/internal/analyzers/burndown/history_changes.go @@ -133,7 +133,7 @@ func (b *HistoryAnalyzer) countDeletionLines( // forceRemoveFile handles treap/blob length mismatch by force-deleting the file tracking. func (b *HistoryAnalyzer) forceRemoveFile(shard *Shard, id PathID, name string, file *burndown.File) { - log.Printf("burndown: src mismatch for deletion %s (tracked=%d), force-removing", name, file.Len()) + b.mismatch.recordForceRemove(name, file.Len()) file.Delete() shard.filesByID[id] = nil @@ -361,8 +361,7 @@ func (b *HistoryAnalyzer) resetAndReinsert( shard *Shard, change *gitlib.Change, id PathID, author int, cache map[gitlib.Hash]*pkgplumbing.CachedBlob, ) error { - log.Printf("burndown: src mismatch for %s (tracked=%d, diff_old=...), resetting", - change.To.Name, shard.filesByID[id].Len()) + b.mismatch.recordReset(change.To.Name, shard.filesByID[id].Len()) shard.filesByID[id] = nil b.removeActiveID(shard, id) diff --git a/internal/analyzers/burndown/history_lifecycle.go b/internal/analyzers/burndown/history_lifecycle.go index 1aa3ca5..acaf18f 100644 --- a/internal/analyzers/burndown/history_lifecycle.go +++ b/internal/analyzers/burndown/history_lifecycle.go @@ -2,6 +2,7 @@ package burndown import ( "fmt" + "log" "os" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" @@ -118,6 +119,8 @@ func (b *HistoryAnalyzer) mergeTicks(other *HistoryAnalyzer) { // Hibernate releases resources between processing phases. func (b *HistoryAnalyzer) Hibernate() error { + b.logChunkMismatchSummary() + err := b.ensureSpillDir() if err != nil { return fmt.Errorf("burndown spill dir: %w", err) @@ -137,6 +140,24 @@ func (b *HistoryAnalyzer) Hibernate() error { return nil } +// logChunkMismatchSummary emits a single line summarizing src-mismatch +// resets recorded since the last chunk boundary, then re-baselines the +// counter for the next chunk. Silent when no mismatches happened. +func (b *HistoryAnalyzer) logChunkMismatchSummary() { + delta := b.mismatch.chunkDelta() + if delta == 0 { + b.mismatch.resetChunkBaseline() + + return + } + + stats := b.mismatch.snapshot() + log.Printf("burndown: chunk src-mismatch summary chunk_resets=%d cumulative_resets=%d cumulative_force_removes=%d", + delta, stats.Resets, stats.ForceRemoves) + + b.mismatch.resetChunkBaseline() +} + // hibernateShard shrinks treap pools, spills to disk, and resets tracking maps. func (b *HistoryAnalyzer) hibernateShard(shard *Shard, idx int) error { shard.mu.Lock() diff --git a/internal/analyzers/burndown/mismatch_tracker.go b/internal/analyzers/burndown/mismatch_tracker.go new file mode 100644 index 0000000..82385ee --- /dev/null +++ b/internal/analyzers/burndown/mismatch_tracker.go @@ -0,0 +1,124 @@ +package burndown + +import ( + "log" + "sync/atomic" + "time" +) + +// mismatchLogIntervalNanos throttles the per-event src-mismatch log line. +// Bursts are common (one large commit can reset thousands of file states +// in a single tick after the blob-pipeline cap silently skips a monster +// commit upstream); without throttling, the log becomes the long pole. +// 1 second across all shards keeps the operator-facing signal while +// dropping the cost from O(mismatches) stdout flushes to O(seconds). +const mismatchLogIntervalNanos = int64(time.Second) + +// mismatchTracker counts src-mismatch reset events on the burndown analyzer +// and rate-limits the per-event log line. All fields are accessed atomically +// so the tracker is safe to call from per-shard goroutines. +// +// The counter splits resets (file present, line count diverged) from +// force-removes (file deleted while line count diverged) so consumers can +// tell apart the two recovery paths handled by history_changes.go. +type mismatchTracker struct { + resets atomic.Int64 + forceRemoves atomic.Int64 + dropped atomic.Int64 // events suppressed since the last emitted log line. + lastLogNanos atomic.Int64 // monotonic-ish timestamp of last emitted log line. + chunkBaseline atomic.Int64 // resets+forceRemoves at start of current chunk. +} + +// recordReset bumps the reset counter and emits a rate-limited log line. +// name is the file path; tracked is the analyzer's stale line count for it. +func (t *mismatchTracker) recordReset(name string, tracked int) { + t.resets.Add(1) + t.maybeLog(name, tracked, "resetting") +} + +// recordForceRemove bumps the force-remove counter and emits a rate-limited +// log line. Mirrors recordReset for the deletion path so the two recovery +// modes show up as separate counters. +func (t *mismatchTracker) recordForceRemove(name string, tracked int) { + t.forceRemoves.Add(1) + t.maybeLog("deletion "+name, tracked, "force-removing") +} + +// maybeLog emits a log line at most once per mismatchLogIntervalNanos, atomic +// across shards. Suppressed events are counted in `dropped` and surfaced as a +// `dropped=N since last` suffix on the next emitted line. +func (t *mismatchTracker) maybeLog(name string, tracked int, kind string) { + now := time.Now().UnixNano() + last := t.lastLogNanos.Load() + + if now-last < mismatchLogIntervalNanos { + t.dropped.Add(1) + + return + } + + if !t.lastLogNanos.CompareAndSwap(last, now) { + // Another shard claimed this slot — count as dropped to keep the + // total consistent with one-log-per-interval semantics. + t.dropped.Add(1) + + return + } + + dropped := t.dropped.Swap(0) + if dropped == 0 { + log.Printf("burndown: src mismatch for %s (tracked=%d, diff_old=...), %s", + name, tracked, kind) + + return + } + + log.Printf("burndown: src mismatch for %s (tracked=%d, diff_old=...), %s [dropped=%d since last]", + name, tracked, kind, dropped) +} + +// snapshot returns the running counts. Used by Hibernate() for chunk summaries +// and exposed to external observers via HistoryAnalyzer.MismatchStats. +func (t *mismatchTracker) snapshot() MismatchStats { + return MismatchStats{ + Resets: t.resets.Load(), + ForceRemoves: t.forceRemoves.Load(), + } +} + +// resetChunkBaseline marks the cumulative count at the start of a chunk so +// per-chunk deltas can be reported on the next Hibernate. +func (t *mismatchTracker) resetChunkBaseline() { + t.chunkBaseline.Store(t.resets.Load() + t.forceRemoves.Load()) +} + +// chunkDelta returns the number of mismatch events recorded since the last +// resetChunkBaseline call. +func (t *mismatchTracker) chunkDelta() int64 { + return (t.resets.Load() + t.forceRemoves.Load()) - t.chunkBaseline.Load() +} + +// MismatchStats reports cumulative src-mismatch reset events on the burndown +// analyzer. Consumers (tests, observability) read these via +// HistoryAnalyzer.MismatchStats(). +// +// Resets count file modifications where the analyzer's tracked line count +// did not match the diff's OldLinesOfCode — typically after the blob +// pipeline silently skipped a "monster" commit (see ErrCommitTooLarge), so +// the analyzer's state lags reality by one or more commits' worth of edits. +// ForceRemoves count the same divergence on the deletion path. +// +// A non-zero value implies burndown's per-file survival history is stale +// for the affected files at the reset point — the file is treated as a +// fresh insertion thereafter. Surface this to operators when interpreting +// per-file results on repos with large mass-update commits (vendor moves, +// generated-code regenerations, Pods updates). +type MismatchStats struct { + Resets int64 + ForceRemoves int64 +} + +// Total returns the sum of resets and force-removes. +func (s MismatchStats) Total() int64 { + return s.Resets + s.ForceRemoves +} diff --git a/internal/analyzers/burndown/mismatch_tracker_test.go b/internal/analyzers/burndown/mismatch_tracker_test.go new file mode 100644 index 0000000..b3282ba --- /dev/null +++ b/internal/analyzers/burndown/mismatch_tracker_test.go @@ -0,0 +1,164 @@ +package burndown + +import ( + "sync" + "testing" +) + +func TestMismatchTracker_RecordReset_BumpsResetsCounter(t *testing.T) { + t.Parallel() + + var tr mismatchTracker + + tr.recordReset("foo.go", 12) + tr.recordReset("bar.go", 34) + + stats := tr.snapshot() + if stats.Resets != 2 { + t.Errorf("Resets = %d, want 2", stats.Resets) + } + + if stats.ForceRemoves != 0 { + t.Errorf("ForceRemoves = %d, want 0", stats.ForceRemoves) + } +} + +func TestMismatchTracker_RecordForceRemove_BumpsForceRemovesCounter(t *testing.T) { + t.Parallel() + + var tr mismatchTracker + + tr.recordForceRemove("foo.go", 99) + + stats := tr.snapshot() + if stats.ForceRemoves != 1 { + t.Errorf("ForceRemoves = %d, want 1", stats.ForceRemoves) + } + + if stats.Resets != 0 { + t.Errorf("Resets = %d, want 0", stats.Resets) + } +} + +func TestMismatchTracker_RateLimit_DropsBurstWithinInterval(t *testing.T) { + t.Parallel() + + var tr mismatchTracker + + // Fire a burst of 1000 resets back-to-back. Only the first should win + // the log slot; the rest must be counted as dropped. + for range 1000 { + tr.recordReset("foo.go", 1) + } + + if got := tr.dropped.Load(); got != 999 { + t.Errorf("dropped = %d, want 999 (1000 events, 1 logged, 999 suppressed)", got) + } + + if got := tr.snapshot().Resets; got != 1000 { + t.Errorf("Resets = %d, want 1000 (counter must record every event regardless of log throttle)", got) + } +} + +func TestMismatchTracker_RateLimit_AllowsAfterInterval(t *testing.T) { + t.Parallel() + + var tr mismatchTracker + + // First call wins the slot. + tr.recordReset("foo.go", 1) + first := tr.lastLogNanos.Load() + + // Force the next call into a fresh interval by rewinding the timestamp. + tr.lastLogNanos.Store(first - mismatchLogIntervalNanos - 1) + + // Reset dropped so we can verify the second call resets the dropped tail. + tr.dropped.Store(5) + + tr.recordReset("bar.go", 2) + + if got := tr.dropped.Load(); got != 0 { + t.Errorf("dropped after fresh interval = %d, want 0 (Swap should clear it on emit)", got) + } + + if tr.lastLogNanos.Load() == first { + t.Errorf("lastLogNanos did not advance — second call did not claim the slot") + } +} + +func TestMismatchTracker_ChunkDelta_TracksSinceBaseline(t *testing.T) { + t.Parallel() + + var tr mismatchTracker + + tr.recordReset("a", 1) + tr.recordReset("b", 1) + tr.resetChunkBaseline() + + if got := tr.chunkDelta(); got != 0 { + t.Errorf("chunkDelta after baseline = %d, want 0", got) + } + + tr.recordReset("c", 1) + tr.recordForceRemove("d", 1) + + if got := tr.chunkDelta(); got != 2 { + t.Errorf("chunkDelta after 2 events = %d, want 2", got) + } + + // Cumulative counters keep climbing. + if got := tr.snapshot().Total(); got != 4 { + t.Errorf("Total = %d, want 4 (cumulative across baseline reset)", got) + } +} + +func TestMismatchTracker_ConcurrentRecord_NoLostUpdates(t *testing.T) { + t.Parallel() + + var ( + tr mismatchTracker + wg sync.WaitGroup + perWorker = int64(500) + workers = 8 + ) + + wg.Add(workers) + + for range workers { + go func() { + defer wg.Done() + + for range int(perWorker) { + tr.recordReset("x", 1) + } + }() + } + + wg.Wait() + + want := perWorker * int64(workers) + if got := tr.snapshot().Resets; got != want { + t.Errorf("Resets = %d, want %d (concurrent atomic updates must not lose any)", got, want) + } + + // At most one log per interval; bound the number that could have won + // the slot during this short test (a few, definitely not all). + logged := tr.snapshot().Resets - tr.dropped.Load() + if logged < 1 { + t.Errorf("logged events = %d, want at least 1", logged) + } + + if logged > want { + t.Errorf("logged events = %d > total = %d, dropped count is broken", logged, want) + } +} + +func TestMismatchStats_Total_SumsBothCounters(t *testing.T) { + t.Parallel() + + s := MismatchStats{Resets: 7, ForceRemoves: 3} + + if got := s.Total(); got != 10 { + t.Errorf("Total = %d, want 10", got) + } +} diff --git a/internal/analyzers/burndown/store_writer_test.go b/internal/analyzers/burndown/store_writer_test.go index a261725..8294877 100644 --- a/internal/analyzers/burndown/store_writer_test.go +++ b/internal/analyzers/burndown/store_writer_test.go @@ -1,7 +1,5 @@ package burndown -// FRD: specs/frds/FRD-20260301-burndown-filehistory-store-writer.md. - import ( "context" "testing" diff --git a/internal/analyzers/clones/aggregator.go b/internal/analyzers/clones/aggregator.go index 0a95680..4f74d4b 100644 --- a/internal/analyzers/clones/aggregator.go +++ b/internal/analyzers/clones/aggregator.go @@ -99,14 +99,14 @@ func (a *Aggregator) GetResult() analyze.Report { return buildEmptyReport(msgNoFunctions) } - pairs, totalCount := a.detectGlobalClones() + result := a.detectGlobalClones() - cloneRatio := computeCloneRatio(totalCount, a.totalFunctions) - message := cloneMessage(totalCount) + cloneRatio := computeCloneRatio(len(result.clonedFunc), a.totalFunctions) + message := cloneMessage(result.totalCount) - pairsForReport := make([]map[string]any, 0, len(pairs)) + pairsForReport := make([]map[string]any, 0, len(result.pairs)) - for _, p := range pairs { + for _, p := range result.pairs { pairsForReport = append(pairsForReport, map[string]any{ "func_a": p.FuncA, "func_b": p.FuncB, @@ -116,25 +116,25 @@ func (a *Aggregator) GetResult() analyze.Report { } return analyze.Report{ - keyAnalyzerName: analyzerName, - keyTotalFunctions: a.totalFunctions, - keyTotalClonePairs: totalCount, - keyCloneRatio: cloneRatio, - keyClonePairs: pairsForReport, - keyMessage: message, + keyAnalyzerName: analyzerName, + keyTotalFunctions: a.totalFunctions, + keyTotalClonePairs: result.totalCount, + keyCloneRatio: cloneRatio, + keyClonePairs: pairsForReport, + keyCloneTypeDistribution: cloneTypeDistMap(result.typeDistribution), + keyMessage: message, } } // detectGlobalClones builds a single LSH index from all entries and finds clone pairs. -// Returns the (possibly capped) pairs slice and the exact total count of all pairs found. -func (a *Aggregator) detectGlobalClones() (pairs []ClonePair, totalCount int) { +func (a *Aggregator) detectGlobalClones() clonePairResult { if len(a.entries) == 0 { - return nil, 0 + return clonePairResult{} } idx, err := lsh.New(a.NumBands, a.NumRows) if err != nil { - return nil, 0 + return clonePairResult{} } for _, entry := range a.entries { diff --git a/internal/analyzers/clones/analyzer.go b/internal/analyzers/clones/analyzer.go index 3cf10f3..181b0ca 100644 --- a/internal/analyzers/clones/analyzer.go +++ b/internal/analyzers/clones/analyzer.go @@ -4,6 +4,7 @@ import ( "encoding/json" "fmt" "io" + "strings" "gopkg.in/yaml.v3" @@ -29,6 +30,13 @@ const ( // numRows is the number of rows per LSH band. numRows = 8 + // minFunctionNodes is the minimum number of AST nodes a function must have + // to be included in clone detection. Functions below this threshold are + // trivial (getters, setters, return-nil stubs) and produce false positives + // because their minimal AST structure hashes identically regardless of purpose. + // Empirical: getters ≈ 13-15 nodes, setters ≈ 19, real logic ≥ 25. + minFunctionNodes = 20 + // analyzerName is the registered name of the clone detection analyzer. analyzerName = "clones" @@ -315,9 +323,9 @@ func (a *Analyzer) detectClones(functions []*node.Node) []ClonePair { } // Per-file detection: no cap (single-file scope, bounded by function count). - pairs, _ := findClonePairs(entries, idx, 0, a.cfgSimilarityType3) + result := findClonePairs(entries, idx, 0, a.cfgSimilarityType3) - return pairs + return result.pairs } // buildSignatures computes MinHash signatures for all functions. @@ -325,6 +333,10 @@ func (a *Analyzer) buildSignatures(functions []*node.Node) []funcEntry { entries := make([]funcEntry, 0, len(functions)) for _, fn := range functions { + if countNodes(fn) < minFunctionNodes { + continue + } + shingles := a.shingler.ExtractShingles(fn) if len(shingles) == 0 { continue @@ -350,22 +362,86 @@ func (a *Analyzer) buildSignatures(functions []*node.Node) []funcEntry { return entries } -// extractFuncName extracts the function name from a node. +// extractFuncName extracts a unique function name from a node. +// For methods, qualifies with the receiver type (e.g., "Foo.DoWork") to avoid +// collisions in the LSH index when different types share the same method name. func extractFuncName(fn *node.Node) string { - if name, ok := common.ExtractEntityName(fn); ok && name != "" { - return name + name, ok := common.ExtractEntityName(fn) + if !ok || name == "" { + if fn.Token != "" { + name = fn.Token + } else { + name = string(fn.Type) + } + } + + if fn.Type == node.UASTMethod { + if recv := extractReceiverType(fn); recv != "" { + return recv + "." + name + } + } + + return name +} + +// extractReceiverType extracts the receiver type name from a Method node. +// The UAST represents the receiver as the first Parameter child with a token +// like "(f *Foo)" or "(f Foo)". +func extractReceiverType(fn *node.Node) string { + for _, child := range fn.Children { + if !child.HasAnyRole(node.RoleParameter) { + continue + } + + // The receiver parameter token contains the full "(name *Type)" text. + tok := child.Token + if tok == "" { + continue + } + + // Extract the type name: strip parens, pointer star, and variable name. + // Strip parens, pointer star, and variable name to extract the type. + tok = strings.TrimPrefix(tok, "(") + tok = strings.TrimSuffix(tok, ")") + tok = strings.TrimSpace(tok) + + // Split "f *Foo" into parts, take the last one (the type). + parts := strings.Fields(tok) + // Receiver has at least two parts: variable name and type. + const minReceiverParts = 2 + if len(parts) < minReceiverParts { + continue + } + + typeName := parts[len(parts)-1] + typeName = strings.TrimPrefix(typeName, "*") + + if typeName != "" { + return typeName + } + } + + return "" +} + +// countNodes returns the total number of nodes in a subtree. +func countNodes(n *node.Node) int { + if n == nil { + return 0 } - if fn.Token != "" { - return fn.Token + count := 1 + + for _, child := range n.Children { + count += countNodes(child) } - return string(fn.Type) + return count } // buildReport constructs the analysis report. func (a *Analyzer) buildReport(totalFunctions int, pairs []ClonePair) analyze.Report { - cloneRatio := computeCloneRatio(len(pairs), totalFunctions) + cloneRatio := computeCloneRatio(countDistinctFuncs(pairs), totalFunctions) message := cloneMessage(len(pairs)) pairsForReport := make([]map[string]any, 0, len(pairs)) @@ -380,12 +456,13 @@ func (a *Analyzer) buildReport(totalFunctions int, pairs []ClonePair) analyze.Re } return analyze.Report{ - keyAnalyzerName: analyzerName, - keyTotalFunctions: totalFunctions, - keyTotalClonePairs: len(pairs), - keyCloneRatio: cloneRatio, - keyClonePairs: pairsForReport, - keyMessage: message, + keyAnalyzerName: analyzerName, + keyTotalFunctions: totalFunctions, + keyTotalClonePairs: len(pairs), + keyCloneRatio: cloneRatio, + keyClonePairs: pairsForReport, + keyCloneTypeDistribution: cloneTypeDistMap(categorizeClonePairs(pairs)), + keyMessage: message, } } @@ -400,13 +477,26 @@ func buildEmptyReport(message string) analyze.Report { }) } -// computeCloneRatio calculates the ratio of clone pairs to total functions. -func computeCloneRatio(pairCount, totalFunctions int) float64 { - if totalFunctions == 0 { +// countDistinctFuncs returns the number of unique function names across all pairs. +func countDistinctFuncs(pairs []ClonePair) int { + unique := make(map[string]struct{}, len(pairs)) + + for idx := range pairs { + unique[pairs[idx].FuncA] = struct{}{} + unique[pairs[idx].FuncB] = struct{}{} + } + + return len(unique) +} + +// computeCloneRatio calculates the fraction of functions involved in at least one clone pair. +// Returns a value in [0, 1]: 0 means no duplication, 1 means every function has a clone. +func computeCloneRatio(clonedFuncs, totalFunctions int) float64 { + if totalFunctions == 0 || clonedFuncs == 0 { return 0.0 } - return float64(pairCount) / float64(totalFunctions) + return float64(clonedFuncs) / float64(totalFunctions) } // cloneMessage returns a human-readable message based on clone pair count. diff --git a/internal/analyzers/clones/analyzer_test.go b/internal/analyzers/clones/analyzer_test.go index 3ed0fc3..e589153 100644 --- a/internal/analyzers/clones/analyzer_test.go +++ b/internal/analyzers/clones/analyzer_test.go @@ -30,10 +30,17 @@ func buildFunctionNode(name string, childTypes []node.Type) *node.Node { WithRoles([]node.Role{node.RoleFunction, node.RoleDeclaration}). Build() + // Build nested subtrees so the total node count exceeds minFunctionNodes. + // Each child gets 2 sub-children to produce realistic AST depth. children := make([]*node.Node, 0, len(childTypes)) - for _, ct := range childTypes { + for i, ct := range childTypes { child := node.NewBuilder().WithType(ct).Build() + + sub1 := node.NewBuilder().WithType(childTypes[i%len(childTypes)]).Build() + sub2 := node.NewBuilder().WithType(childTypes[(i+1)%len(childTypes)]).Build() + child.Children = []*node.Node{sub1, sub2} + children = append(children, child) } @@ -407,9 +414,9 @@ func TestShingler_ExtractShingles_Valid(t *testing.T) { shingles := s.ExtractShingles(fn) require.NotNil(t, shingles) - // Function node itself + 8 children = 9 nodes. - // With k=5: 9 - 5 + 1 = 5 shingles. - assert.Len(t, shingles, defaultShingleSize) + // Function node + 8 children × 3 nodes each = 25 nodes. + // With k=5: 25 - 5 + 1 = 21 shingles. + assert.Len(t, shingles, 21) } // TestShingler_ExtractShingles_Deterministic verifies same tree produces same shingles. @@ -450,13 +457,18 @@ func TestClonePairKey(t *testing.T) { assert.Equal(t, key1, key2) } -// TestComputeCloneRatio verifies ratio computation. +// TestComputeCloneRatio verifies ratio = distinct cloned functions / total functions. func TestComputeCloneRatio(t *testing.T) { t.Parallel() assert.InDelta(t, 0.0, computeCloneRatio(0, 0), testFloatDelta) assert.InDelta(t, 0.0, computeCloneRatio(0, 10), testFloatDelta) - assert.InDelta(t, 0.5, computeCloneRatio(5, 10), testFloatDelta) + + // 2 distinct cloned functions out of 10 → 0.2. + assert.InDelta(t, 0.2, computeCloneRatio(2, 10), testFloatDelta) + + // 4 out of 4 → 1.0. + assert.InDelta(t, 1.0, computeCloneRatio(4, 4), testFloatDelta) } // TestCloneMessage verifies message selection. @@ -801,10 +813,10 @@ func TestAggregator_RecomputedCloneRatio(t *testing.T) { result := agg.GetResult() assert.Equal(t, 3, result[keyTotalFunctions]) - // 1 clone pair / 3 functions = 0.333... + // 1 clone pair → 2 distinct cloned functions out of 3 → 2/3 ≈ 0.667. ratio, ok := result[keyCloneRatio].(float64) require.True(t, ok) - assert.InDelta(t, 1.0/3.0, ratio, testFloatDelta) + assert.InDelta(t, 2.0/3.0, ratio, testFloatDelta) } // TestAggregator_NoDedupByFuncA verifies multiple pairs sharing func_a name all appear. @@ -924,8 +936,6 @@ func TestExtractFuncName(t *testing.T) { assert.Equal(t, string(node.UASTFunction), extractFuncName(fn3)) } -// FRD: specs/frds/FRD-20260311-clones-pair-cap.md. - // TestAggregator_MaxClonePairs_Default verifies NewAggregator sets default cap. func TestAggregator_MaxClonePairs_Default(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/clones/benchmark_test.go b/internal/analyzers/clones/benchmark_test.go index d1b0c5d..e6eaa3f 100644 --- a/internal/analyzers/clones/benchmark_test.go +++ b/internal/analyzers/clones/benchmark_test.go @@ -1,7 +1,5 @@ package clones -// FRD: specs/frds/FRD-20260311-clones-pair-cap.md. - import ( "fmt" "runtime" diff --git a/internal/analyzers/clones/clone_ratio_fixture_test.go b/internal/analyzers/clones/clone_ratio_fixture_test.go new file mode 100644 index 0000000..1ad5b3f --- /dev/null +++ b/internal/analyzers/clones/clone_ratio_fixture_test.go @@ -0,0 +1,629 @@ +package clones + +import ( + "context" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/pkg/uast" +) + +// Fixture-based clone ratio tests validate that computeCloneRatio +// (pairs / maxPossiblePairs) produces meaningful, bounded values +// for known duplication patterns parsed through the real UAST pipeline. + +// parseAndAnalyze parses Go source through UAST and runs the clone analyzer. +func parseAndAnalyze(t *testing.T, source string) analyze.Report { + t.Helper() + + parser, err := uast.NewParser() + require.NoError(t, err) + + root, parseErr := parser.Parse(context.Background(), "fixture.go", []byte(source)) + require.NoError(t, parseErr) + + analyzer := NewAnalyzer() + + report, analyzeErr := analyzer.Analyze(root) + require.NoError(t, analyzeErr) + + return report +} + +// reportFuncs extracts the total function count from a clone report. +func reportFuncs(t *testing.T, r analyze.Report) int { + t.Helper() + + v, ok := r[keyTotalFunctions].(int) + require.True(t, ok, "report must contain int %s", keyTotalFunctions) + + return v +} + +// reportPairs extracts the total clone pair count from a clone report. +func reportPairs(t *testing.T, r analyze.Report) int { + t.Helper() + + v, ok := r[keyTotalClonePairs].(int) + require.True(t, ok, "report must contain int %s", keyTotalClonePairs) + + return v +} + +// reportRatio extracts the clone ratio from a clone report. +func reportRatio(t *testing.T, r analyze.Report) float64 { + t.Helper() + + v, ok := r[keyCloneRatio].(float64) + require.True(t, ok, "report must contain float64 %s", keyCloneRatio) + + return v +} + +// fixtureAllUnique contains 4 functions with completely different logic. +// Expected: 0 clone pairs, ratio = 0. +const fixtureAllUnique = `package fixture + +func Sum(nums []int) int { + total := 0 + for _, n := range nums { + total += n + } + return total +} + +func Reverse(s string) string { + runes := []rune(s) + for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 { + runes[i], runes[j] = runes[j], runes[i] + } + return string(runes) +} + +func IsPrime(n int) bool { + if n < 2 { + return false + } + for i := 2; i*i <= n; i++ { + if n%i == 0 { + return false + } + } + return true +} + +func Fibonacci(n int) int { + if n <= 1 { + return n + } + a, b := 0, 1 + for i := 2; i <= n; i++ { + a, b = b, a+b + } + return b +} +` + +// fixtureAllIdentical contains 4 functions with identical bodies (Type-1 clones). +// Expected: 6 clone pairs (C(4,2)=6), ratio = 1.0. +const fixtureAllIdentical = `package fixture + +func ProcessA(data []int) int { + result := 0 + for _, v := range data { + if v > 0 { + result += v * 2 + } else { + result -= v + } + } + if result > 100 { + result = 100 + } + return result +} + +func ProcessB(data []int) int { + result := 0 + for _, v := range data { + if v > 0 { + result += v * 2 + } else { + result -= v + } + } + if result > 100 { + result = 100 + } + return result +} + +func ProcessC(data []int) int { + result := 0 + for _, v := range data { + if v > 0 { + result += v * 2 + } else { + result -= v + } + } + if result > 100 { + result = 100 + } + return result +} + +func ProcessD(data []int) int { + result := 0 + for _, v := range data { + if v > 0 { + result += v * 2 + } else { + result -= v + } + } + if result > 100 { + result = 100 + } + return result +} +` + +// fixtureRenamedClones contains 3 functions: 2 are Type-2 clones (same AST +// structure, different variable names), 1 is unique. +const fixtureRenamedClones = `package fixture + +func CalcScore(items []int) int { + score := 0 + for _, item := range items { + if item > 10 { + score += item * 3 + } else { + score += item + } + } + if score > 1000 { + score = 1000 + } + return score +} + +func ComputeTotal(entries []int) int { + total := 0 + for _, entry := range entries { + if entry > 10 { + total += entry * 3 + } else { + total += entry + } + } + if total > 1000 { + total = 1000 + } + return total +} + +func FormatOutput(s string) string { + if len(s) == 0 { + return "" + } + return "[" + s + "]" +} +` + +// fixtureHalfClones contains 6 functions: 3 identical clones + 3 unique. +// maxPairs = C(6,2) = 15, clone pairs among the 3 identical = C(3,2) = 3. +const fixtureHalfClones = `package fixture + +func CloneA(data []int) int { + sum := 0 + for i := 0; i < len(data); i++ { + if data[i] > 0 { + sum += data[i] + } + } + if sum > 500 { + sum = 500 + } + return sum +} + +func CloneB(data []int) int { + sum := 0 + for i := 0; i < len(data); i++ { + if data[i] > 0 { + sum += data[i] + } + } + if sum > 500 { + sum = 500 + } + return sum +} + +func CloneC(data []int) int { + sum := 0 + for i := 0; i < len(data); i++ { + if data[i] > 0 { + sum += data[i] + } + } + if sum > 500 { + sum = 500 + } + return sum +} + +func UniqueX(n int) bool { + if n < 2 { + return false + } + for i := 2; i*i <= n; i++ { + if n%i == 0 { + return false + } + } + return true +} + +func UniqueY(s string) int { + count := 0 + for _, r := range s { + if r >= 'a' && r <= 'z' { + count++ + } + } + return count +} + +func UniqueZ(a, b int) int { + for b != 0 { + a, b = b, a%b + } + return a +} +` + +func TestFixture_AllUnique_ZeroRatio(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureAllUnique) + require.Equal(t, 4, reportFuncs(t, report)) + + assert.InDelta(t, 0.0, reportRatio(t, report), 0.05, + "4 unique functions must produce near-zero clone ratio") +} + +func TestFixture_AllIdentical_FullRatio(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureAllIdentical) + require.Equal(t, 4, reportFuncs(t, report)) + + assert.Equal(t, 6, reportPairs(t, report), + "4 identical functions must produce C(4,2)=6 clone pairs") + assert.InDelta(t, 1.0, reportRatio(t, report), 0.01, + "all-identical functions must produce ratio near 1.0") + + section := NewReportSection(report) + assert.InDelta(t, 0.0, section.Score(), 0.01) +} + +func TestFixture_RenamedClones_Detected(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureRenamedClones) + require.Equal(t, 3, reportFuncs(t, report)) + assert.GreaterOrEqual(t, reportPairs(t, report), 1, + "Type-2 renamed clones must be detected") + + ratio := reportRatio(t, report) + assert.Greater(t, ratio, 0.0, "renamed clones must produce non-zero ratio") + assert.LessOrEqual(t, ratio, 1.0, "ratio must be bounded to [0, 1]") +} + +func TestFixture_HalfClones_PartialRatio(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureHalfClones) + require.Equal(t, 6, reportFuncs(t, report)) + assert.GreaterOrEqual(t, reportPairs(t, report), 3, + "3 identical + 3 unique must produce at least 3 clone pairs") + + ratio := reportRatio(t, report) + // 3 cloned functions out of 6 total → 0.5. + assert.InDelta(t, 0.5, ratio, 0.1, "ratio must reflect partial duplication") +} + +func TestFixture_RatioBounded(t *testing.T) { + t.Parallel() + + fixtures := map[string]string{ + "all_unique": fixtureAllUnique, + "all_identical": fixtureAllIdentical, + "renamed": fixtureRenamedClones, + "half_clones": fixtureHalfClones, + } + + for name, source := range fixtures { + t.Run(name, func(t *testing.T) { + t.Parallel() + + ratio := reportRatio(t, parseAndAnalyze(t, source)) + assert.GreaterOrEqual(t, ratio, 0.0, "clone ratio must be >= 0") + assert.LessOrEqual(t, ratio, 1.0, "clone ratio must be <= 1") + }) + } +} + +func TestFixture_MonotonicOrdering(t *testing.T) { + t.Parallel() + + ratioUnique := reportRatio(t, parseAndAnalyze(t, fixtureAllUnique)) + ratioHalf := reportRatio(t, parseAndAnalyze(t, fixtureHalfClones)) + ratioFull := reportRatio(t, parseAndAnalyze(t, fixtureAllIdentical)) + + assert.Less(t, ratioUnique, ratioHalf, "unique < half-cloned") + assert.Less(t, ratioHalf, ratioFull, "half-cloned < fully-cloned") +} + +// Kubernetes-derived fixtures: real patterns from kubernetes/kubernetes +// adapted to be self-contained. Validates detection on production-grade code. + +// fixtureK8sValidation is adapted from pkg/apis/rbac/validation. +// ValidateRoleBinding and ValidateClusterRoleBinding are near-identical. +const fixtureK8sValidation = `package fixture + +type ErrorList []string +type ObjectMeta struct{ Name string } + +type Ref struct{ APIGroup, Kind, Name string } +type Subject struct{ Name string } +type RoleBinding struct{ ObjectMeta; Role Ref; Subjects []Subject } +type ClusterRoleBinding struct{ ObjectMeta; Role Ref; Subjects []Subject } + +func ValidateRoleBinding(rb *RoleBinding) ErrorList { + allErrs := ErrorList{} + if rb.ObjectMeta.Name == "" { + allErrs = append(allErrs, "metadata.name is required") + } + if rb.Role.APIGroup != "rbac.authorization.k8s.io" { + allErrs = append(allErrs, "roleRef.apiGroup not supported") + } + switch rb.Role.Kind { + case "Role", "ClusterRole": + default: + allErrs = append(allErrs, "roleRef.kind not supported") + } + if len(rb.Role.Name) == 0 { + allErrs = append(allErrs, "roleRef.name is required") + } + for _, subject := range rb.Subjects { + if subject.Name == "" { + allErrs = append(allErrs, "subject.name is required") + } + } + return allErrs +} + +func ValidateClusterRoleBinding(rb *ClusterRoleBinding) ErrorList { + allErrs := ErrorList{} + if rb.ObjectMeta.Name == "" { + allErrs = append(allErrs, "metadata.name is required") + } + if rb.Role.APIGroup != "rbac.authorization.k8s.io" { + allErrs = append(allErrs, "roleRef.apiGroup not supported") + } + switch rb.Role.Kind { + case "ClusterRole": + default: + allErrs = append(allErrs, "roleRef.kind not supported") + } + if len(rb.Role.Name) == 0 { + allErrs = append(allErrs, "roleRef.name is required") + } + for _, subject := range rb.Subjects { + if subject.Name == "" { + allErrs = append(allErrs, "subject.name is required") + } + } + return allErrs +} + +func ValidateRoleBindingUpdate(rb *RoleBinding, old *RoleBinding) ErrorList { + allErrs := ValidateRoleBinding(rb) + if old.Role != rb.Role { + allErrs = append(allErrs, "cannot change roleRef") + } + return allErrs +} + +func ValidateClusterRoleBindingUpdate(rb *ClusterRoleBinding, old *ClusterRoleBinding) ErrorList { + allErrs := ValidateClusterRoleBinding(rb) + if old.Role != rb.Role { + allErrs = append(allErrs, "cannot change roleRef") + } + return allErrs +} +` + +// fixtureK8sEventHandlers is adapted from client-go/tools/cache/controller.go. +// Three receiver types implement OnAdd/OnUpdate/OnDelete. +const fixtureK8sEventHandlers = `package fixture + +type ResourceEventHandlerFuncs struct { + AddFunc func(obj interface{}) + UpdateFunc func(oldObj, newObj interface{}) + DeleteFunc func(obj interface{}) +} + +func (r ResourceEventHandlerFuncs) OnAdd(obj interface{}, isInInitialList bool) { + if r.AddFunc != nil { + r.AddFunc(obj) + } +} + +func (r ResourceEventHandlerFuncs) OnUpdate(oldObj, newObj interface{}) { + if r.UpdateFunc != nil { + r.UpdateFunc(oldObj, newObj) + } +} + +func (r ResourceEventHandlerFuncs) OnDelete(obj interface{}) { + if r.DeleteFunc != nil { + r.DeleteFunc(obj) + } +} + +type ResourceEventHandlerDetailedFuncs struct { + AddFunc func(obj interface{}, isInInitialList bool) + UpdateFunc func(oldObj, newObj interface{}) + DeleteFunc func(obj interface{}) +} + +func (r ResourceEventHandlerDetailedFuncs) OnAdd(obj interface{}, isInInitialList bool) { + if r.AddFunc != nil { + r.AddFunc(obj, isInInitialList) + } +} + +func (r ResourceEventHandlerDetailedFuncs) OnUpdate(oldObj, newObj interface{}) { + if r.UpdateFunc != nil { + r.UpdateFunc(oldObj, newObj) + } +} + +func (r ResourceEventHandlerDetailedFuncs) OnDelete(obj interface{}) { + if r.DeleteFunc != nil { + r.DeleteFunc(obj) + } +} + +type FilteringResourceEventHandler struct { + FilterFunc func(obj interface{}) bool + Handler interface{ OnAdd(interface{}, bool); OnUpdate(interface{}, interface{}); OnDelete(interface{}) } +} + +func (r FilteringResourceEventHandler) OnAdd(obj interface{}, isInInitialList bool) { + if !r.FilterFunc(obj) { + return + } + r.Handler.OnAdd(obj, isInInitialList) +} + +func (r FilteringResourceEventHandler) OnUpdate(oldObj, newObj interface{}) { + newer := r.FilterFunc(newObj) + older := r.FilterFunc(oldObj) + switch { + case newer && older: + r.Handler.OnUpdate(oldObj, newObj) + case newer && !older: + r.Handler.OnAdd(newObj, false) + case !newer && older: + r.Handler.OnDelete(oldObj) + } +} + +func (r FilteringResourceEventHandler) OnDelete(obj interface{}) { + if !r.FilterFunc(obj) { + return + } + r.Handler.OnDelete(obj) +} +` + +// fixtureK8sDeepCopy is adapted from zz_generated.deepcopy.go files. +// Machine-generated DeepCopyInto methods on different receiver types. +const fixtureK8sDeepCopy = `package fixture + +type TokenConfig struct{ Token, TTL, Expires *int64; Usages, Groups []string } +type SecretConfig struct{ Name, TTL, Expires *int64; Labels, Scopes []string } +type CertConfig struct{ Issuer, TTL, Expires *int64; SANs, Orgs []string } + +func (in *TokenConfig) DeepCopyInto(out *TokenConfig) { + *out = *in + if in.Token != nil { cp := *in.Token; out.Token = &cp } + if in.TTL != nil { cp := *in.TTL; out.TTL = &cp } + if in.Expires != nil { cp := *in.Expires; out.Expires = &cp } + if in.Usages != nil { out.Usages = make([]string, len(in.Usages)); copy(out.Usages, in.Usages) } + if in.Groups != nil { out.Groups = make([]string, len(in.Groups)); copy(out.Groups, in.Groups) } +} + +func (in *SecretConfig) DeepCopyInto(out *SecretConfig) { + *out = *in + if in.Name != nil { cp := *in.Name; out.Name = &cp } + if in.TTL != nil { cp := *in.TTL; out.TTL = &cp } + if in.Expires != nil { cp := *in.Expires; out.Expires = &cp } + if in.Labels != nil { out.Labels = make([]string, len(in.Labels)); copy(out.Labels, in.Labels) } + if in.Scopes != nil { out.Scopes = make([]string, len(in.Scopes)); copy(out.Scopes, in.Scopes) } +} + +func (in *CertConfig) DeepCopyInto(out *CertConfig) { + *out = *in + if in.Issuer != nil { cp := *in.Issuer; out.Issuer = &cp } + if in.TTL != nil { cp := *in.TTL; out.TTL = &cp } + if in.Expires != nil { cp := *in.Expires; out.Expires = &cp } + if in.SANs != nil { out.SANs = make([]string, len(in.SANs)); copy(out.SANs, in.SANs) } + if in.Orgs != nil { out.Orgs = make([]string, len(in.Orgs)); copy(out.Orgs, in.Orgs) } +} +` + +func TestFixtureK8s_Validation_DetectsClonePairs(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureK8sValidation) + require.Equal(t, 4, reportFuncs(t, report)) + assert.GreaterOrEqual(t, reportPairs(t, report), 2, + "RBAC validation clones must produce at least 2 clone pairs") + + ratio := reportRatio(t, report) + assert.Greater(t, ratio, 0.0) + assert.LessOrEqual(t, ratio, 1.0) +} + +func TestFixtureK8s_EventHandlers_DetectsClones(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureK8sEventHandlers) + assert.GreaterOrEqual(t, reportFuncs(t, report), 9) + assert.GreaterOrEqual(t, reportPairs(t, report), 1, + "identical handler methods across receiver types must be detected") + + ratio := reportRatio(t, report) + assert.Greater(t, ratio, 0.0, "event handler clones must produce non-zero ratio") + assert.LessOrEqual(t, ratio, 1.0) +} + +func TestFixtureK8s_DeepCopy_HighCloneRatio(t *testing.T) { + t.Parallel() + + report := parseAndAnalyze(t, fixtureK8sDeepCopy) + require.Equal(t, 3, reportFuncs(t, report)) + assert.Equal(t, 3, reportPairs(t, report), + "3 identical DeepCopyInto methods must produce C(3,2)=3 clone pairs") + assert.InDelta(t, 1.0, reportRatio(t, report), 0.01, + "all-identical DeepCopyInto methods must produce ratio near 1.0") +} + +func TestFixtureK8s_AllBounded(t *testing.T) { + t.Parallel() + + fixtures := map[string]string{ + "validation": fixtureK8sValidation, + "event_handlers": fixtureK8sEventHandlers, + "deepcopy": fixtureK8sDeepCopy, + } + + for name, source := range fixtures { + t.Run(name, func(t *testing.T) { + t.Parallel() + + ratio := reportRatio(t, parseAndAnalyze(t, source)) + assert.GreaterOrEqual(t, ratio, 0.0, "clone ratio must be >= 0") + assert.LessOrEqual(t, ratio, 1.0, "clone ratio must be <= 1") + }) + } +} diff --git a/internal/analyzers/clones/report.go b/internal/analyzers/clones/report.go index 2eb93f7..6deb751 100644 --- a/internal/analyzers/clones/report.go +++ b/internal/analyzers/clones/report.go @@ -32,13 +32,14 @@ const DefaultMaxClonePairs = 1000 // Report keys. const ( - keyAnalyzerName = "analyzer_name" - keyTotalClonePairs = "total_clone_pairs" - keyClonePairs = "clone_pairs" - keyTotalFunctions = "total_functions" - keyMessage = "message" - keyCloneRatio = "clone_ratio" - keyFuncSignatures = "_func_signatures" + keyAnalyzerName = "analyzer_name" + keyTotalClonePairs = "total_clone_pairs" + keyClonePairs = "clone_pairs" + keyTotalFunctions = "total_functions" + keyMessage = "message" + keyCloneRatio = "clone_ratio" + keyFuncSignatures = "_func_signatures" + keyCloneTypeDistribution = "clone_type_distribution" ) // ClonePair represents a detected clone relationship between two functions. @@ -51,11 +52,12 @@ type ClonePair struct { // ComputedMetrics holds computed clone detection metrics for JSON/YAML/binary export. type ComputedMetrics struct { - TotalFunctions int `json:"total_functions" yaml:"total_functions"` - TotalClonePairs int `json:"total_clone_pairs" yaml:"total_clone_pairs"` - CloneRatio float64 `json:"clone_ratio" yaml:"clone_ratio"` - ClonePairs []ClonePair `json:"clone_pairs" yaml:"clone_pairs"` - Message string `json:"message" yaml:"message"` + TotalFunctions int `json:"total_functions" yaml:"total_functions"` + TotalClonePairs int `json:"total_clone_pairs" yaml:"total_clone_pairs"` + CloneRatio float64 `json:"clone_ratio" yaml:"clone_ratio"` + CloneTypeDist map[string]int `json:"clone_type_distribution,omitempty" yaml:"clone_type_distribution,omitempty"` + ClonePairs []ClonePair `json:"clone_pairs" yaml:"clone_pairs"` + Message string `json:"message" yaml:"message"` } // cloneTypeClassifier classifies clone similarity into clone types. @@ -115,6 +117,10 @@ func computeMetricsFromReport(report map[string]any) *ComputedMetrics { metrics.ClonePairs = extractClonePairs(report) + if v, ok := report[keyCloneTypeDistribution].(map[string]int); ok { + metrics.CloneTypeDist = v + } + return metrics } diff --git a/internal/analyzers/clones/report_section.go b/internal/analyzers/clones/report_section.go index 6af3762..b53bc7d 100644 --- a/internal/analyzers/clones/report_section.go +++ b/internal/analyzers/clones/report_section.go @@ -50,13 +50,18 @@ func NewReportSection(report analyze.Report) *ReportSection { } // computeScore converts clone ratio to a 0-1 score (lower ratio = higher score). +// Clone ratio is pairs/functions which can exceed 1.0 (quadratic pair growth), +// so we clamp to [0, 1] before inverting. func computeScore(cloneRatio float64) float64 { - score := 1.0 - cloneRatio - if score < 0 { + if cloneRatio >= 1.0 { return 0.0 } - return score + if cloneRatio <= 0.0 { + return 1.0 + } + + return 1.0 - cloneRatio } // KeyMetrics returns ordered key metrics for display. @@ -69,15 +74,13 @@ func (s *ReportSection) KeyMetrics() []analyze.Metric { } // Distribution returns clone type distribution data. +// Uses the full-population distribution when available, falling back to the capped pairs array. func (s *ReportSection) Distribution() []analyze.DistributionItem { - pairs := extractClonePairs(s.report) - if len(pairs) == 0 { + counts, total := s.extractDistribution() + if total == 0 { return nil } - counts := categorizeClonePairs(pairs) - total := len(pairs) - return []analyze.DistributionItem{ {Label: distLabelType1, Percent: reportutil.Pct(counts.type1, total), Count: counts.type1}, {Label: distLabelType2, Percent: reportutil.Pct(counts.type2, total), Count: counts.type2}, @@ -85,6 +88,22 @@ func (s *ReportSection) Distribution() []analyze.DistributionItem { } } +func (s *ReportSection) extractDistribution() (counts cloneTypeCounts, total int) { + if dist, ok := s.report[keyCloneTypeDistribution].(map[string]int); ok { + counts = cloneTypeCounts{ + type1: dist[CloneType1], + type2: dist[CloneType2], + type3: dist[CloneType3], + } + + return counts, counts.type1 + counts.type2 + counts.type3 + } + + pairs := extractClonePairs(s.report) + + return categorizeClonePairs(pairs), len(pairs) +} + // cloneTypeCounts holds counts per clone type. type cloneTypeCounts struct { type1 int @@ -92,6 +111,27 @@ type cloneTypeCounts struct { type3 int } +// increment adds one to the counter for the given clone type. +func (c *cloneTypeCounts) increment(cloneType string) { + switch cloneType { + case CloneType1: + c.type1++ + case CloneType2: + c.type2++ + case CloneType3: + c.type3++ + } +} + +// cloneTypeDistMap converts counts to a string-keyed map for JSON serialization. +func cloneTypeDistMap(c cloneTypeCounts) map[string]int { + return map[string]int{ + CloneType1: c.type1, + CloneType2: c.type2, + CloneType3: c.type3, + } +} + // categorizeClonePairs counts clone pairs by type. func categorizeClonePairs(pairs []ClonePair) cloneTypeCounts { counts := cloneTypeCounts{} diff --git a/internal/analyzers/clones/report_section_test.go b/internal/analyzers/clones/report_section_test.go index 98b2b70..7447b2a 100644 --- a/internal/analyzers/clones/report_section_test.go +++ b/internal/analyzers/clones/report_section_test.go @@ -45,6 +45,17 @@ func TestCloneSection_Score(t *testing.T) { assert.InDelta(t, 0.7, s.Score(), 1e-9) } +func TestCloneSection_Score_HighRatio(t *testing.T) { + t.Parallel() + + // Clone ratio can exceed 1.0 (pairs grow quadratically). + // 93.6 pairs/function → score must clamp to 0.0, not go negative. + s := NewReportSection(analyze.Report{ + keyCloneRatio: 93.6, + }) + assert.InDelta(t, 0.0, s.Score(), 1e-9) +} + func TestCloneSection_StatusMessage(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/clones/visitor.go b/internal/analyzers/clones/visitor.go index 9292c01..5a9fdc1 100644 --- a/internal/analyzers/clones/visitor.go +++ b/internal/analyzers/clones/visitor.go @@ -59,6 +59,10 @@ func (v *Visitor) buildSignatures() []funcEntry { entries := make([]funcEntry, 0, len(v.functions)) for _, fn := range v.functions { + if countNodes(fn) < minFunctionNodes { + continue + } + shingles := v.shingler.ExtractShingles(fn) if len(shingles) == 0 { continue @@ -110,9 +114,18 @@ func buildSignatureReport(totalFunctions int, entries []funcEntry) analyze.Repor // findClonePairs queries the LSH index and collects unique clone pairs. // pairCap limits the stored pairs slice (0 = unlimited). The returned totalCount // reflects ALL unique pairs found, regardless of the cap. -func findClonePairs(entries []funcEntry, idx *lsh.Index, pairCap int, minSimilarity float64) (pairs []ClonePair, totalCount int) { +// clonePairResult holds the output of findClonePairs. +type clonePairResult struct { + pairs []ClonePair + totalCount int + typeDistribution cloneTypeCounts + clonedFunc map[string]struct{} // distinct function names involved in any pair. +} + +func findClonePairs(entries []funcEntry, idx *lsh.Index, pairCap int, minSimilarity float64) clonePairResult { seen := make(map[PairKey]bool) sigMap := buildSignatureMap(entries) + result := clonePairResult{clonedFunc: make(map[string]struct{})} for _, entry := range entries { candidates, err := idx.QueryThreshold(entry.sig, minSimilarity) @@ -120,14 +133,14 @@ func findClonePairs(entries []funcEntry, idx *lsh.Index, pairCap int, minSimilar continue } - pairs, totalCount = matchCandidates(entry, candidates, sigMap, seen, pairs, totalCount, pairCap, minSimilarity) + result = matchCandidates(entry, candidates, sigMap, seen, result, pairCap, minSimilarity) } - sort.Slice(pairs, func(i, j int) bool { - return pairs[i].Similarity > pairs[j].Similarity + sort.Slice(result.pairs, func(i, j int) bool { + return result.pairs[i].Similarity > result.pairs[j].Similarity }) - return pairs, totalCount + return result } // buildSignatureMap creates a name-to-signature lookup from entries. @@ -148,11 +161,10 @@ func matchCandidates( candidates []string, sigMap map[string]*minhash.Signature, seen map[PairKey]bool, - pairs []ClonePair, - totalCount int, + result clonePairResult, pairCap int, minSimilarity float64, -) (updatedPairs []ClonePair, updatedCount int) { +) clonePairResult { for _, candidateID := range candidates { if candidateID == entry.name { continue @@ -167,15 +179,18 @@ func matchCandidates( pair, ok := computeClonePair(entry, candidateID, sigMap, minSimilarity) if ok { - totalCount++ + result.totalCount++ + result.typeDistribution.increment(pair.CloneType) + result.clonedFunc[pair.FuncA] = struct{}{} + result.clonedFunc[pair.FuncB] = struct{}{} - if pairCap <= 0 || len(pairs) < pairCap { - pairs = append(pairs, pair) + if pairCap <= 0 || len(result.pairs) < pairCap { + result.pairs = append(result.pairs, pair) } } } - return pairs, totalCount + return result } // computeClonePair computes a clone pair between an entry and a candidate. diff --git a/internal/analyzers/cohesion/aggregator.go b/internal/analyzers/cohesion/aggregator.go index deb2766..73224ee 100644 --- a/internal/analyzers/cohesion/aggregator.go +++ b/internal/analyzers/cohesion/aggregator.go @@ -15,6 +15,7 @@ const ( // Aggregator aggregates results from multiple cohesion analyses. type Aggregator struct { *common.Aggregator + common.PerFileRetainer } // NewAggregator creates a new Aggregator. @@ -34,6 +35,15 @@ func NewAggregator() *Aggregator { } } +// Aggregate overrides the base Aggregate method to retain per-file reports. +func (a *Aggregator) Aggregate(results map[string]analyze.Report) { + for _, report := range results { + a.Retain(report) + } + + a.Aggregator.Aggregate(results) +} + // aggregatorConfig holds the configuration for the aggregator. type aggregatorConfig struct { messageBuilder func(float64) string diff --git a/internal/analyzers/cohesion/cohesion.go b/internal/analyzers/cohesion/cohesion.go index 9320220..f8de1f4 100644 --- a/internal/analyzers/cohesion/cohesion.go +++ b/internal/analyzers/cohesion/cohesion.go @@ -168,7 +168,6 @@ func (c *Analyzer) calculateMetrics(functions []Function) map[string]float64 { } // buildResult constructs the final analysis result. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildResult(functions []Function, metrics map[string]float64) analyze.Report { reportItems := c.buildDetailedFunctionsTable(functions) message := c.getCohesionMessage(metrics["cohesion_score"]) @@ -188,7 +187,6 @@ func (c *Analyzer) buildResult(functions []Function, metrics map[string]float64) } // FunctionReportItem is a typed representation of a per-function cohesion report item. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. type FunctionReportItem struct { Name string CohesionAssessment string @@ -200,7 +198,6 @@ type FunctionReportItem struct { } // buildDetailedFunctionsTable creates the detailed functions table as typed structs. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildDetailedFunctionsTable(functions []Function) []FunctionReportItem { items := make([]FunctionReportItem, 0, len(functions)) diff --git a/internal/analyzers/cohesion/metrics.go b/internal/analyzers/cohesion/metrics.go index a4853cb..240ab21 100644 --- a/internal/analyzers/cohesion/metrics.go +++ b/internal/analyzers/cohesion/metrics.go @@ -23,8 +23,11 @@ type ReportData struct { // FunctionData holds cohesion data for a single function. type FunctionData struct { - Name string - Cohesion float64 + Name string + SourceFile string + Language string + Directory string + Cohesion float64 } // ParseReportData extracts ReportData from an analyzer report. @@ -51,35 +54,62 @@ func ParseReportData(report analyze.Report) (*ReportData, error) { data.Message = v } - // Parse functions. - if functions, ok := report["functions"].([]map[string]any); ok { - data.Functions = make([]FunctionData, 0, len(functions)) + data.Functions = parseReportFunctions(report) - for _, fn := range functions { - fd := FunctionData{} + return data, nil +} - if name, nameOK := fn["name"].(string); nameOK { - fd.Name = name - } +func parseReportFunctions(report analyze.Report) []FunctionData { + functions, ok := report["functions"].([]map[string]any) + if !ok { + return nil + } - if v, vOK := fn["cohesion"].(float64); vOK { - fd.Cohesion = v - } + result := make([]FunctionData, 0, len(functions)) - data.Functions = append(data.Functions, fd) - } + for _, fn := range functions { + result = append(result, parseFunctionData(fn)) } - return data, nil + return result +} + +func parseFunctionData(fn map[string]any) FunctionData { + fd := FunctionData{} + + if name, ok := fn["name"].(string); ok { + fd.Name = name + } + + if sf, ok := fn[analyze.SourceFileKey].(string); ok { + fd.SourceFile = sf + } + + if lang, ok := fn[analyze.LanguageKey].(string); ok { + fd.Language = lang + } + + if dir, ok := fn[analyze.DirectoryKey].(string); ok { + fd.Directory = dir + } + + if v, ok := fn["cohesion"].(float64); ok { + fd.Cohesion = v + } + + return fd } // --- Output Data Types ---. // FunctionCohesionData contains cohesion data for a function. type FunctionCohesionData struct { - Name string `json:"name" yaml:"name"` - Cohesion float64 `json:"cohesion" yaml:"cohesion"` - QualityLevel string `json:"quality_level" yaml:"quality_level"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + Cohesion float64 `json:"cohesion" yaml:"cohesion"` + QualityLevel string `json:"quality_level" yaml:"quality_level"` } // MetricDist* constants are JSON-compatible distribution keys for metrics output. @@ -92,10 +122,13 @@ const ( // LowCohesionFunctionData identifies functions with poor cohesion. type LowCohesionFunctionData struct { - Name string `json:"name" yaml:"name"` - Cohesion float64 `json:"cohesion" yaml:"cohesion"` - RiskLevel string `json:"risk_level" yaml:"risk_level"` - Recommendation string `json:"recommendation" yaml:"recommendation"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + Cohesion float64 `json:"cohesion" yaml:"cohesion"` + RiskLevel string `json:"risk_level" yaml:"risk_level"` + Recommendation string `json:"recommendation" yaml:"recommendation"` } // AggregateData contains summary statistics. @@ -149,6 +182,9 @@ func (m *FunctionCohesionMetric) Compute(input *ReportData) []FunctionCohesionDa result = append(result, FunctionCohesionData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, Cohesion: fn.Cohesion, QualityLevel: qualityLevel, }) @@ -249,6 +285,7 @@ func (m *LowCohesionFunctionMetric) Compute(input *ReportData) []LowCohesionFunc result = append(result, LowCohesionFunctionData{ Name: fn.Name, + SourceFile: fn.SourceFile, Cohesion: fn.Cohesion, RiskLevel: riskLevel, Recommendation: recommendation, diff --git a/internal/analyzers/cohesion/report_section.go b/internal/analyzers/cohesion/report_section.go index c2e9e92..dad3843 100644 --- a/internal/analyzers/cohesion/report_section.go +++ b/internal/analyzers/cohesion/report_section.go @@ -128,6 +128,7 @@ func (s *ReportSection) buildIssues() []analyze.Issue { coh := reportutil.GetFloat64(fn, KeyFuncCohesion) issues = append(issues, analyze.Issue{ Name: name, + Location: reportutil.MapString(fn, analyze.SourceFileKey), Value: reportutil.FormatFloat(coh), Severity: severityForCohesion(coh), }) diff --git a/internal/analyzers/comments/aggregator.go b/internal/analyzers/comments/aggregator.go index df41219..5666975 100644 --- a/internal/analyzers/comments/aggregator.go +++ b/internal/analyzers/comments/aggregator.go @@ -15,6 +15,7 @@ const ( // Aggregator aggregates results from multiple comment analyses. type Aggregator struct { *common.Aggregator + common.PerFileRetainer detailed *common.DetailedDataCollector } @@ -52,6 +53,10 @@ func (ca *Aggregator) SetAggregationMode(mode analyze.AggregationMode) { // Aggregate overrides the base Aggregate method to collect detailed comments and functions. func (ca *Aggregator) Aggregate(results map[string]analyze.Report) { + for _, report := range results { + ca.Retain(report) + } + ca.detailed.CollectFromReports(results) ca.Aggregator.Aggregate(results) } diff --git a/internal/analyzers/comments/comments.go b/internal/analyzers/comments/comments.go index 0d04ace..9bf831b 100644 --- a/internal/analyzers/comments/comments.go +++ b/internal/analyzers/comments/comments.go @@ -609,7 +609,6 @@ func (c *Analyzer) buildEmptyResult() analyze.Report { } // buildResult builds the complete analysis result. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildResult(commentDetails []CommentDetail, functions []*node.Node, metrics CommentMetrics) analyze.Report { commentDetailsInterface := c.buildCommentDetailsInterface(commentDetails) detailedCommentsTable := c.buildDetailedCommentsTable(commentDetails) @@ -660,7 +659,6 @@ func (c *Analyzer) buildCommentDetailsInterface(commentDetails []CommentDetail) } // buildDetailedCommentsTable builds the detailed comments table as typed structs. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildDetailedCommentsTable(commentDetails []CommentDetail) []CommentReportItem { items := make([]CommentReportItem, 0, len(commentDetails)) for _, detail := range commentDetails { @@ -680,7 +678,6 @@ func (c *Analyzer) buildDetailedCommentsTable(commentDetails []CommentDetail) [] } // convertCommentReportItems converts typed comment items to []map[string]any for serialization. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func convertCommentReportItems(items any, sourceFile string) []map[string]any { typed, ok := items.([]CommentReportItem) if !ok { @@ -708,7 +705,6 @@ func convertCommentReportItems(items any, sourceFile string) []map[string]any { } // buildDetailedFunctionsTable builds the detailed functions table as typed structs. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildDetailedFunctionsTable(functions []*node.Node, metrics CommentMetrics) []FunctionReportItem { items := make([]FunctionReportItem, 0, len(functions)) for _, function := range functions { @@ -732,7 +728,6 @@ func (c *Analyzer) buildDetailedFunctionsTable(functions []*node.Node, metrics C } // convertFunctionReportItems converts typed function items to []map[string]any for serialization. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func convertFunctionReportItems(items any, sourceFile string) []map[string]any { typed, ok := items.([]FunctionReportItem) if !ok { diff --git a/internal/analyzers/comments/metrics.go b/internal/analyzers/comments/metrics.go index 621bea5..de5d933 100644 --- a/internal/analyzers/comments/metrics.go +++ b/internal/analyzers/comments/metrics.go @@ -25,6 +25,9 @@ type ReportData struct { // CommentData holds data for a single comment. type CommentData struct { LineNumber int + SourceFile string + Language string + Directory string Quality string Type string TargetType string @@ -37,6 +40,9 @@ type CommentData struct { // FunctionCommentData holds comment data for a function. type FunctionCommentData struct { Name string + SourceFile string + Language string + Directory string HasComment bool NeedsComment bool CommentScore float64 @@ -113,6 +119,18 @@ func parseComment(comment map[string]any) CommentData { cd.LineNumber = v } + if sf, ok := comment[analyze.SourceFileKey].(string); ok { + cd.SourceFile = sf + } + + if lang, ok := comment[analyze.LanguageKey].(string); ok { + cd.Language = lang + } + + if dir, ok := comment[analyze.DirectoryKey].(string); ok { + cd.Directory = dir + } + if v, ok := comment["quality"].(string); ok { cd.Quality = v } @@ -166,6 +184,18 @@ func parseFunctionComment(fn map[string]any) FunctionCommentData { fd.Name = v } + if sf, ok := fn[analyze.SourceFileKey].(string); ok { + fd.SourceFile = sf + } + + if lang, ok := fn[analyze.LanguageKey].(string); ok { + fd.Language = lang + } + + if dir, ok := fn[analyze.DirectoryKey].(string); ok { + fd.Directory = dir + } + if v, ok := fn["has_comment"].(bool); ok { fd.HasComment = v } @@ -190,6 +220,9 @@ func parseFunctionComment(fn map[string]any) FunctionCommentData { // CommentQualityData contains quality assessment for a comment. type CommentQualityData struct { LineNumber int `json:"line_number" yaml:"line_number"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` Quality string `json:"quality" yaml:"quality"` Type string `json:"type" yaml:"type"` TargetName string `json:"target_name" yaml:"target_name"` @@ -199,17 +232,23 @@ type CommentQualityData struct { // FunctionDocumentationData contains documentation status for a function. type FunctionDocumentationData struct { - Name string `json:"name" yaml:"name"` - IsDocumented bool `json:"is_documented" yaml:"is_documented"` - DocumentationScore float64 `json:"documentation_score" yaml:"documentation_score"` - Status string `json:"status" yaml:"status"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + IsDocumented bool `json:"is_documented" yaml:"is_documented"` + DocumentationScore float64 `json:"documentation_score" yaml:"documentation_score"` + Status string `json:"status" yaml:"status"` } // UndocumentedFunctionData identifies functions lacking documentation. type UndocumentedFunctionData struct { - Name string `json:"name" yaml:"name"` - NeedsComment bool `json:"needs_comment" yaml:"needs_comment"` - RiskLevel string `json:"risk_level" yaml:"risk_level"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + NeedsComment bool `json:"needs_comment" yaml:"needs_comment"` + RiskLevel string `json:"risk_level" yaml:"risk_level"` } // AggregateData contains summary statistics. @@ -251,6 +290,9 @@ func (m *CommentQualityMetric) Compute(input *ReportData) []CommentQualityData { for _, comment := range input.Comments { result = append(result, CommentQualityData{ LineNumber: comment.LineNumber, + SourceFile: comment.SourceFile, + Language: comment.Language, + Directory: comment.Directory, Quality: comment.Quality, Type: comment.Type, TargetName: comment.TargetName, @@ -314,6 +356,9 @@ func (m *FunctionDocumentationMetric) Compute(input *ReportData) []FunctionDocum result = append(result, FunctionDocumentationData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, IsDocumented: fn.HasComment, DocumentationScore: fn.CommentScore, Status: status, @@ -364,6 +409,9 @@ func (m *UndocumentedFunctionMetric) Compute(input *ReportData) []UndocumentedFu result = append(result, UndocumentedFunctionData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, NeedsComment: fn.NeedsComment, RiskLevel: riskLevel, }) diff --git a/internal/analyzers/comments/report_section.go b/internal/analyzers/comments/report_section.go index 71e7506..fd985ba 100644 --- a/internal/analyzers/comments/report_section.go +++ b/internal/analyzers/comments/report_section.go @@ -135,6 +135,7 @@ func (s *ReportSection) buildIssues() []analyze.Issue { name := reportutil.MapString(fn, KeyFuncName) issues = append(issues, analyze.Issue{ Name: name, + Location: reportutil.MapString(fn, analyze.SourceFileKey), Value: IssueValueNoDoc, Severity: analyze.SeverityPoor, }) diff --git a/internal/analyzers/comments/types.go b/internal/analyzers/comments/types.go index 5bfe9d9..dccbd1c 100644 --- a/internal/analyzers/comments/types.go +++ b/internal/analyzers/comments/types.go @@ -70,7 +70,6 @@ type CommentConfig struct { } // CommentReportItem is a typed representation of a per-comment report item. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. type CommentReportItem struct { Comment string Placement string @@ -80,7 +79,6 @@ type CommentReportItem struct { } // FunctionReportItem is a typed representation of a per-function report item. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. type FunctionReportItem struct { Function string Type string diff --git a/internal/analyzers/common/aggregation_mode_test.go b/internal/analyzers/common/aggregation_mode_test.go index aa56dce..067a5f2 100644 --- a/internal/analyzers/common/aggregation_mode_test.go +++ b/internal/analyzers/common/aggregation_mode_test.go @@ -1,7 +1,5 @@ package common -// FRD: specs/frds/FRD-20260311-summary-only-aggregation.md. - import ( "testing" @@ -133,8 +131,6 @@ func TestAggregator_ImplementsAggregationModeAware(t *testing.T) { require.NotNil(t, aware) } -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - func TestAggregator_EstimatedStateSize_Empty(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/common/aggregator_bench_test.go b/internal/analyzers/common/aggregator_bench_test.go index aafced3..ef2bdf9 100644 --- a/internal/analyzers/common/aggregator_bench_test.go +++ b/internal/analyzers/common/aggregator_bench_test.go @@ -1,7 +1,5 @@ package common -// FRD: specs/frds/FRD-20260311-summary-only-aggregation.md. - import ( "fmt" "runtime" @@ -42,8 +40,6 @@ func makeSyntheticReport(fileIndex, numFunctions int) analyze.Report { } } -// FRD: specs/frds/FRD-20260311-typed-report-items.md. - // testFunctionMetrics is a typed struct for benchmark comparison. type testFunctionMetrics struct { Name string @@ -183,8 +179,6 @@ func BenchmarkTypedVsMapAccumulation(b *testing.B) { }) } -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - // benchEstimatedSizeReportCount is the number of reports for size estimation benchmark. const benchEstimatedSizeReportCount = 10000 diff --git a/internal/analyzers/common/checkpoint_helper_test.go b/internal/analyzers/common/checkpoint_helper_test.go index 8b16cd1..0d950ba 100644 --- a/internal/analyzers/common/checkpoint_helper_test.go +++ b/internal/analyzers/common/checkpoint_helper_test.go @@ -1,7 +1,5 @@ package common_test -// FRD: specs/frds/FRD-20260302-checkpoint-helper.md. - import ( "testing" diff --git a/internal/analyzers/common/computed_metrics_test.go b/internal/analyzers/common/computed_metrics_test.go index 1c5c5f9..892aa00 100644 --- a/internal/analyzers/common/computed_metrics_test.go +++ b/internal/analyzers/common/computed_metrics_test.go @@ -1,7 +1,5 @@ package common_test -// FRD: specs/frds/FRD-20260302-computed-metrics.md. - import ( "testing" diff --git a/internal/analyzers/common/context_stack_test.go b/internal/analyzers/common/context_stack_test.go index 83858b5..76b9a84 100644 --- a/internal/analyzers/common/context_stack_test.go +++ b/internal/analyzers/common/context_stack_test.go @@ -1,7 +1,5 @@ package common_test -// FRD: specs/frds/FRD-20260302-context-stack.md. - import ( "testing" diff --git a/internal/analyzers/common/detailed_data_collector.go b/internal/analyzers/common/detailed_data_collector.go index 3d5fc7f..db24cee 100644 --- a/internal/analyzers/common/detailed_data_collector.go +++ b/internal/analyzers/common/detailed_data_collector.go @@ -1,7 +1,5 @@ package common -// FRD: specs/frds/FRD-20260311-typed-report-items.md. - import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" ) @@ -89,7 +87,9 @@ func (d *DetailedDataCollector) buildItems(key string) []map[string]any { items := make([]map[string]any, 0, capacity) for _, tc := range typed { - items = append(items, tc.ToMaps(tc.Items, tc.SourceFile)...) + converted := tc.ToMaps(tc.Items, tc.SourceFile) + stampCollectionMetadata(converted, tc) + items = append(items, converted...) } items = append(items, legacy...) @@ -97,6 +97,21 @@ func (d *DetailedDataCollector) buildItems(key string) []map[string]any { return items } +// stampCollectionMetadata adds language and directory from a TypedCollection +// to each converted map item. The converter only passes sourceFile; this +// stamps the remaining metadata fields that TypedCollection carries. +func stampCollectionMetadata(items []map[string]any, tc analyze.TypedCollection) { + for _, item := range items { + if tc.Language != "" { + item[analyze.LanguageKey] = tc.Language + } + + if tc.Directory != "" { + item[analyze.DirectoryKey] = tc.Directory + } + } +} + // typedCollectionLen returns the length of a TypedCollection's Items slice // using a type switch for known slice types, falling back to 0. func typedCollectionLen(tc analyze.TypedCollection) int { diff --git a/internal/analyzers/common/detailed_data_collector_test.go b/internal/analyzers/common/detailed_data_collector_test.go index beb4207..f1c16ec 100644 --- a/internal/analyzers/common/detailed_data_collector_test.go +++ b/internal/analyzers/common/detailed_data_collector_test.go @@ -9,8 +9,6 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" ) -// FRD: specs/frds/FRD-20260303-detailed-data-collector.md. - func TestNewDetailedDataCollector(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/common/filter_test.go b/internal/analyzers/common/filter_test.go index 75b4f13..145bdb3 100644 --- a/internal/analyzers/common/filter_test.go +++ b/internal/analyzers/common/filter_test.go @@ -1,7 +1,5 @@ package common_test -// FRD: specs/frds/FRD-20260302-filter-by-interface.md. - import ( "testing" diff --git a/internal/analyzers/common/identity_mixin_test.go b/internal/analyzers/common/identity_mixin_test.go index 8b545ee..449ae02 100644 --- a/internal/analyzers/common/identity_mixin_test.go +++ b/internal/analyzers/common/identity_mixin_test.go @@ -1,4 +1,3 @@ -// FRD: specs/frds/FRD-20260302-identity-mixin.md. package common_test import ( diff --git a/internal/analyzers/common/metrics_processor_test.go b/internal/analyzers/common/metrics_processor_test.go index 25285ea..c2612e9 100644 --- a/internal/analyzers/common/metrics_processor_test.go +++ b/internal/analyzers/common/metrics_processor_test.go @@ -282,8 +282,6 @@ func TestMetricsProcessor_IntegrationWorkflow(t *testing.T) { } } -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - func TestMetricsProcessor_EstimatedStateBytes_Empty(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/common/no_state_hibernation_test.go b/internal/analyzers/common/no_state_hibernation_test.go index bd4af5f..015dca4 100644 --- a/internal/analyzers/common/no_state_hibernation_test.go +++ b/internal/analyzers/common/no_state_hibernation_test.go @@ -1,7 +1,5 @@ package common_test -// FRD: specs/frds/FRD-20260302-no-state-hibernation.md. - import ( "testing" "unsafe" diff --git a/internal/analyzers/common/perfile_retainer.go b/internal/analyzers/common/perfile_retainer.go new file mode 100644 index 0000000..aeafe19 --- /dev/null +++ b/internal/analyzers/common/perfile_retainer.go @@ -0,0 +1,91 @@ +package common + +import ( + "maps" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +// PerFileRetainer stores per-file report snapshots during static analysis aggregation. +// When enabled, each call to Retain stores a shallow clone of the report keyed by source file path. +// When disabled (default), Retain is a no-op and PerFileResults returns nil. +type PerFileRetainer struct { + enabled bool + reports map[string]analyze.Report +} + +// SetPerFileMode enables or disables per-file report retention. +func (r *PerFileRetainer) SetPerFileMode(enabled bool) { + r.enabled = enabled + + if enabled && r.reports == nil { + r.reports = make(map[string]analyze.Report) + } +} + +// Retain extracts the source file path from the report and stores a shallow clone. +// No-op when per-file mode is disabled or the report has no source file path. +func (r *PerFileRetainer) Retain(report analyze.Report) { + if !r.enabled || report == nil { + return + } + + filePath := extractSourceFile(report) + if filePath == "" { + return + } + + r.reports[filePath] = cloneReport(report) +} + +// PerFileResults returns the retained per-file reports keyed by file path. +// Returns nil when per-file mode is disabled or no files were retained. +func (r *PerFileRetainer) PerFileResults() map[string]analyze.Report { + if !r.enabled || len(r.reports) == 0 { + return nil + } + + return r.reports +} + +// extractSourceFile finds the source file path from report values. +// Checks top-level SourceFileKey first, then collection-level sources. +func extractSourceFile(report analyze.Report) string { + if sf, ok := report[analyze.SourceFileKey].(string); ok && sf != "" { + return sf + } + + return extractSourceFileFromCollections(report) +} + +// extractSourceFileFromCollections checks TypedCollection.SourceFile and legacy _source_file items. +func extractSourceFileFromCollections(report analyze.Report) string { + for _, val := range report { + if sf := sourceFileFromValue(val); sf != "" { + return sf + } + } + + return "" +} + +// sourceFileFromValue extracts a source file path from a single report value. +func sourceFileFromValue(val any) string { + switch typed := val.(type) { + case analyze.TypedCollection: + return typed.SourceFile + case []map[string]any: + for _, item := range typed { + if sf, ok := item[analyze.SourceFileKey].(string); ok && sf != "" { + return sf + } + } + } + + return "" +} + +// cloneReport creates a shallow clone of a report map. +func cloneReport(report analyze.Report) analyze.Report { + return maps.Clone(report) +} diff --git a/internal/analyzers/common/perfile_retainer_test.go b/internal/analyzers/common/perfile_retainer_test.go new file mode 100644 index 0000000..c7dfd14 --- /dev/null +++ b/internal/analyzers/common/perfile_retainer_test.go @@ -0,0 +1,106 @@ +package common + +import ( + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +func TestPerFileRetainer_Disabled_ReturnsNil(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + + retainer.Retain(analyze.Report{"total_functions": 5}) + + assert.Nil(t, retainer.PerFileResults()) +} + +func TestPerFileRetainer_Enabled_RetainsThreeFiles(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + retainer.SetPerFileMode(true) + + retainer.Retain(analyze.Report{ + "total_functions": 3, + "functions": analyze.TypedCollection{SourceFile: "/repo/a.go"}, + }) + retainer.Retain(analyze.Report{ + "total_functions": 5, + "functions": analyze.TypedCollection{SourceFile: "/repo/b.go"}, + }) + retainer.Retain(analyze.Report{ + "total_functions": 2, + "functions": analyze.TypedCollection{SourceFile: "/repo/c.go"}, + }) + + results := retainer.PerFileResults() + require.Len(t, results, 3) + assert.Contains(t, results, "/repo/a.go") + assert.Contains(t, results, "/repo/b.go") + assert.Contains(t, results, "/repo/c.go") + assert.Equal(t, 3, results["/repo/a.go"]["total_functions"]) +} + +func TestPerFileRetainer_LegacyMapSlice(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + retainer.SetPerFileMode(true) + + retainer.Retain(analyze.Report{ + "functions": []map[string]any{ + {"name": "Foo", analyze.SourceFileKey: "/repo/legacy.go"}, + }, + }) + + results := retainer.PerFileResults() + require.Len(t, results, 1) + assert.Contains(t, results, "/repo/legacy.go") +} + +func TestPerFileRetainer_NilReport(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + retainer.SetPerFileMode(true) + + retainer.Retain(nil) + + assert.Nil(t, retainer.PerFileResults()) +} + +func TestPerFileRetainer_NoSourceFile(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + retainer.SetPerFileMode(true) + + retainer.Retain(analyze.Report{"total_functions": 5}) + + assert.Nil(t, retainer.PerFileResults()) +} + +func TestPerFileRetainer_CloneIsolation(t *testing.T) { + t.Parallel() + + var retainer PerFileRetainer + retainer.SetPerFileMode(true) + + report := analyze.Report{ + "count": 10, + "functions": analyze.TypedCollection{SourceFile: "/repo/x.go"}, + } + + retainer.Retain(report) + + // Mutate original — retained copy must not change. + report["count"] = 999 + + results := retainer.PerFileResults() + assert.Equal(t, 10, results["/repo/x.go"]["count"]) +} diff --git a/internal/analyzers/common/plotpage/multipage_test.go b/internal/analyzers/common/plotpage/multipage_test.go index 43dbeae..8695f64 100644 --- a/internal/analyzers/common/plotpage/multipage_test.go +++ b/internal/analyzers/common/plotpage/multipage_test.go @@ -1,7 +1,5 @@ package plotpage -// FRD: specs/frds/FRD-20260228-multipage-renderer.md. - import ( "os" "path/filepath" diff --git a/internal/analyzers/common/renderer/json.go b/internal/analyzers/common/renderer/json.go index 13d7af8..d9359ac 100644 --- a/internal/analyzers/common/renderer/json.go +++ b/internal/analyzers/common/renderer/json.go @@ -17,6 +17,18 @@ type JSONSection struct { Metrics []JSONMetric `json:"metrics"` Distribution []JSONDistribution `json:"distribution,omitempty"` Issues []JSONIssue `json:"issues"` + Files *[]JSONFileEntry `json:"files,omitempty"` + Score float64 `json:"score"` +} + +// JSONFileEntry represents one file's analysis results within a section. +type JSONFileEntry struct { + FilePath string `json:"file_path"` + ScoreLabel string `json:"score_label"` + Status string `json:"status"` + Metrics []JSONMetric `json:"metrics"` + Distribution []JSONDistribution `json:"distribution,omitempty"` + Issues []JSONIssue `json:"issues"` Score float64 `json:"score"` } @@ -84,6 +96,83 @@ func SectionToJSON(section analyze.ReportSection) JSONSection { } } +// EnrichWithPerFileData injects per-file data and summary statistics into JSON sections. +// Implements analyze.PerFileEnricher to avoid import cycles. +func (r *JSONReport) EnrichWithPerFileData( + perFileResults map[string]map[string]analyze.Report, + rootPath string, + analyzers []analyze.FormattableAnalyzer, +) { + // Build analyzer name → (section title, provider) mapping. + type analyzerInfo struct { + title string + provider analyze.ReportSectionProvider + } + + infoByName := make(map[string]analyzerInfo, len(analyzers)) + + for _, analyzer := range analyzers { + provider, ok := analyzer.(analyze.ReportSectionProvider) + if !ok { + continue + } + + emptySection := provider.CreateReportSection(analyze.Report{}) + infoByName[analyzer.Name()] = analyzerInfo{ + title: emptySection.SectionTitle(), + provider: provider, + } + } + + // Build section title → index for O(1) lookup. + titleToIdx := make(map[string]int, len(r.Sections)) + for idx, section := range r.Sections { + titleToIdx[section.Title] = idx + } + + // Initialize all sections with empty files array (spec: empty array, not omitted). + for idx := range r.Sections { + emptyFiles := make([]JSONFileEntry, 0) + r.Sections[idx].Files = &emptyFiles + } + + for analyzerName, fileReports := range perFileResults { + info, ok := infoByName[analyzerName] + if !ok { + continue + } + + idx, found := titleToIdx[info.title] + if !found { + continue + } + + files := make([]JSONFileEntry, 0, len(fileReports)) + for filePath, report := range fileReports { + section := info.provider.CreateReportSection(report) + relPath := analyze.MakeRelativePath(filePath, rootPath) + files = append(files, SectionToJSONFileEntry(section, relPath)) + } + + r.Sections[idx].Files = &files + } +} + +// SectionToJSONFileEntry converts a ReportSection to a JSONFileEntry for per-file output. +func SectionToJSONFileEntry(section analyze.ReportSection, filePath string) JSONFileEntry { + base := SectionToJSON(section) + + return JSONFileEntry{ + FilePath: filePath, + Score: base.Score, + ScoreLabel: base.ScoreLabel, + Status: base.Status, + Metrics: base.Metrics, + Distribution: base.Distribution, + Issues: base.Issues, + } +} + // SectionsToJSON converts multiple ReportSections to a JSONReport with overall score. func SectionsToJSON(sections []analyze.ReportSection) JSONReport { summary := NewExecutiveSummary(sections) diff --git a/internal/analyzers/common/renderer/json_test.go b/internal/analyzers/common/renderer/json_test.go index e4e3bb1..a3ee22a 100644 --- a/internal/analyzers/common/renderer/json_test.go +++ b/internal/analyzers/common/renderer/json_test.go @@ -192,3 +192,88 @@ func TestSectionsToJSON_Serializable(t *testing.T) { assert.Contains(t, string(data), `"title":"COMPLEXITY"`) assert.Contains(t, string(data), `"overall_score":0.8`) } + +func TestJSONSection_NoFiles_OmittedFromJSON(t *testing.T) { + t.Parallel() + + section := JSONSection{ + Title: "COMPLEXITY", + Score: 0.8, + ScoreLabel: "8/10", + Status: "Good", + Metrics: []JSONMetric{{Label: "Total Functions", Value: "42"}}, + Issues: []JSONIssue{}, + } + + data, err := json.Marshal(section) + require.NoError(t, err) + + jsonStr := string(data) + assert.NotContains(t, jsonStr, `"files"`, "files must be omitted when nil") +} + +func TestJSONSection_WithFiles_IncludedInJSON(t *testing.T) { + t.Parallel() + + section := JSONSection{ + Title: "COMPLEXITY", + Score: 0.8, + ScoreLabel: "8/10", + Status: "Good", + Metrics: []JSONMetric{{Label: "Total Functions", Value: "42"}}, + Issues: []JSONIssue{}, + Files: &[]JSONFileEntry{ + { + FilePath: "pkg/foo/bar.go", + Score: 0.6, + ScoreLabel: "6/10", + Status: "Fair", + Metrics: []JSONMetric{{Label: "Total Functions", Value: "12"}}, + Issues: []JSONIssue{}, + }, + }, + } + + data, err := json.Marshal(section) + require.NoError(t, err) + + jsonStr := string(data) + assert.Contains(t, jsonStr, `"files"`) + assert.Contains(t, jsonStr, `"file_path":"pkg/foo/bar.go"`) + assert.Contains(t, jsonStr, `"score":0.6`) +} + +func TestJSONSection_PerFileRoundTrip(t *testing.T) { + t.Parallel() + + original := JSONSection{ + Title: "HALSTEAD", + Score: 0.7, + ScoreLabel: "7/10", + Status: "Fair", + Metrics: []JSONMetric{{Label: "Volume", Value: "500"}}, + Issues: []JSONIssue{}, + Files: &[]JSONFileEntry{ + { + FilePath: "cmd/main.go", + Score: 0.5, + ScoreLabel: "5/10", + Status: "Moderate", + Metrics: []JSONMetric{{Label: "Volume", Value: "200"}}, + Issues: []JSONIssue{}, + }, + }, + } + + data, err := json.Marshal(original) + require.NoError(t, err) + + var decoded JSONSection + require.NoError(t, json.Unmarshal(data, &decoded)) + + assert.Equal(t, original.Title, decoded.Title) + require.NotNil(t, decoded.Files) + require.Len(t, *decoded.Files, 1) + assert.Equal(t, "cmd/main.go", (*decoded.Files)[0].FilePath) + assert.InDelta(t, 0.5, (*decoded.Files)[0].Score, 0.001) +} diff --git a/internal/analyzers/common/renderer/static_renderer.go b/internal/analyzers/common/renderer/static_renderer.go index ca3910d..f1f6f25 100644 --- a/internal/analyzers/common/renderer/static_renderer.go +++ b/internal/analyzers/common/renderer/static_renderer.go @@ -18,8 +18,11 @@ func NewDefaultStaticRenderer() *DefaultStaticRenderer { } // SectionsToJSON converts report sections to a JSON-serializable value. +// Returns a pointer to enable per-file enrichment via PerFileEnricher interface. func (r *DefaultStaticRenderer) SectionsToJSON(sections []analyze.ReportSection) any { - return SectionsToJSON(sections) + report := SectionsToJSON(sections) + + return &report } // RenderText writes human-readable text output for the given sections. diff --git a/internal/analyzers/common/reportutil/reportutil_test.go b/internal/analyzers/common/reportutil/reportutil_test.go index 4d417af..04f7782 100644 --- a/internal/analyzers/common/reportutil/reportutil_test.go +++ b/internal/analyzers/common/reportutil/reportutil_test.go @@ -1,8 +1,5 @@ package reportutil -// FRD: specs/frds/FRD-20260302-safeconv-wiring.md. -// FRD: specs/frds/FRD-20260306-reportutil-getas.md. - import ( "testing" ) diff --git a/internal/analyzers/common/spillable_bench_test.go b/internal/analyzers/common/spillable_bench_test.go index fa34223..54eeda6 100644 --- a/internal/analyzers/common/spillable_bench_test.go +++ b/internal/analyzers/common/spillable_bench_test.go @@ -1,7 +1,5 @@ package common -// FRD: specs/frds/FRD-20260311-spillable-data-collector.md. - import ( "fmt" "runtime" diff --git a/internal/analyzers/common/spillable_data_collector_test.go b/internal/analyzers/common/spillable_data_collector_test.go index c5a0b09..e7ecb36 100644 --- a/internal/analyzers/common/spillable_data_collector_test.go +++ b/internal/analyzers/common/spillable_data_collector_test.go @@ -1,7 +1,5 @@ package common -// FRD: specs/frds/FRD-20260311-spillable-data-collector.md. - import ( "testing" @@ -284,8 +282,6 @@ func TestSpillableDataCollector_NoSpillMatchesSpill(t *testing.T) { assert.Equal(t, noSpillData, withSpillData) } -// FRD: specs/frds/FRD-20260311-halstead-dedup.md. - func TestSpillableDataCollector_CompositeKeys_PreventsCrossFileOverwrite(t *testing.T) { t.Parallel() @@ -398,8 +394,6 @@ func TestSpillableDataCollector_CompositeKeys_GetIdentifierKey(t *testing.T) { assert.Equal(t, "name", sdc.GetIdentifierKey()) } -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - func TestSpillableDataCollector_EstimatedBufferBytes_Empty(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/common/threshold_labeler_test.go b/internal/analyzers/common/threshold_labeler_test.go index 6c45691..2408a8b 100644 --- a/internal/analyzers/common/threshold_labeler_test.go +++ b/internal/analyzers/common/threshold_labeler_test.go @@ -7,7 +7,6 @@ import ( ) // thresholdLabelerFixture returns a standard 4-bucket labeler for tests. -// FRD: specs/frds/FRD-20260306-threshold-labeler.md. func thresholdLabelerFixture() ThresholdLabeler { return ThresholdLabeler{ {Limit: 0.8, Label: "Excellent"}, diff --git a/internal/analyzers/common/uast_traversal_test.go b/internal/analyzers/common/uast_traversal_test.go index 0a9c730..472b8de 100644 --- a/internal/analyzers/common/uast_traversal_test.go +++ b/internal/analyzers/common/uast_traversal_test.go @@ -363,8 +363,6 @@ func TestUASTTraverser_matchesRoles(t *testing.T) { } } -// FRD: specs/frds/FRD-20260310-find-nodes-predicate.md. - func TestUASTTraverser_FindNodes(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/complexity/aggregator.go b/internal/analyzers/complexity/aggregator.go index 148e30d..bd5c4e2 100644 --- a/internal/analyzers/complexity/aggregator.go +++ b/internal/analyzers/complexity/aggregator.go @@ -17,6 +17,7 @@ const msgGoodComplexity = "Good complexity - functions have reasonable complexit // Aggregator aggregates results from multiple complexity analyses. type Aggregator struct { *common.Aggregator + common.PerFileRetainer detailed *common.DetailedDataCollector maxComplexity int } @@ -52,6 +53,10 @@ func (ca *Aggregator) SetAggregationMode(mode analyze.AggregationMode) { // Aggregate overrides the base Aggregate method to collect detailed functions // and track the true maximum complexity across all files. func (ca *Aggregator) Aggregate(results map[string]analyze.Report) { + for _, report := range results { + ca.Retain(report) + } + ca.detailed.CollectFromReports(results) ca.trackMaxComplexity(results) ca.Aggregator.Aggregate(results) diff --git a/internal/analyzers/complexity/complexity.go b/internal/analyzers/complexity/complexity.go index 04b4e3c..eaeacb0 100644 --- a/internal/analyzers/complexity/complexity.go +++ b/internal/analyzers/complexity/complexity.go @@ -86,7 +86,6 @@ type FunctionMetrics struct { // FunctionReportItem is a typed representation of a per-function complexity report item. // It includes assessment strings computed from thresholds, avoiding map[string]any allocation. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. type FunctionReportItem struct { Name string CyclomaticComplexity int @@ -297,7 +296,6 @@ func (c *Analyzer) buildEmptyResult(message string) analyze.Report { } // buildDetailedFunctionsTable creates the detailed functions table as typed structs. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildDetailedFunctionsTable( functionMetrics []FunctionMetrics, config Config, @@ -360,7 +358,6 @@ func (c *Analyzer) calculateAverageComplexity(totals map[string]int, functionCou } // buildResult constructs the final analysis result. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (c *Analyzer) buildResult( functionCount int, avgComplexity float64, diff --git a/internal/analyzers/complexity/metrics.go b/internal/analyzers/complexity/metrics.go index 317447f..20c2f5f 100644 --- a/internal/analyzers/complexity/metrics.go +++ b/internal/analyzers/complexity/metrics.go @@ -26,6 +26,9 @@ type ReportData struct { // FunctionData holds complexity data for a single function. type FunctionData struct { Name string + SourceFile string + Language string + Directory string CyclomaticComplexity int CognitiveComplexity int NestingDepth int @@ -100,6 +103,18 @@ func parseFunctionData(fn map[string]any) FunctionData { fd.Name = name } + if sf, ok := fn[analyze.SourceFileKey].(string); ok { + fd.SourceFile = sf + } + + if lang, ok := fn[analyze.LanguageKey].(string); ok { + fd.Language = lang + } + + if dir, ok := fn[analyze.DirectoryKey].(string); ok { + fd.Directory = dir + } + if v, ok := fn["cyclomatic_complexity"].(int); ok { fd.CyclomaticComplexity = v } @@ -136,6 +151,9 @@ func parseFunctionData(fn map[string]any) FunctionData { // FunctionComplexityData contains detailed complexity for a function. type FunctionComplexityData struct { Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` CyclomaticComplexity int `json:"cyclomatic_complexity" yaml:"cyclomatic_complexity"` CognitiveComplexity int `json:"cognitive_complexity" yaml:"cognitive_complexity"` NestingDepth int `json:"nesting_depth" yaml:"nesting_depth"` @@ -154,6 +172,9 @@ const ( // HighRiskFunctionData identifies functions needing refactoring attention. type HighRiskFunctionData struct { Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` CyclomaticComplexity int `json:"cyclomatic_complexity" yaml:"cyclomatic_complexity"` CognitiveComplexity int `json:"cognitive_complexity" yaml:"cognitive_complexity"` RiskLevel string `json:"risk_level" yaml:"risk_level"` @@ -221,6 +242,9 @@ func (m *FunctionComplexityMetric) Compute(input *ReportData) []FunctionComplexi result = append(result, FunctionComplexityData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, CyclomaticComplexity: fn.CyclomaticComplexity, CognitiveComplexity: fn.CognitiveComplexity, NestingDepth: fn.NestingDepth, @@ -351,6 +375,9 @@ func (m *HighRiskFunctionMetric) Compute(input *ReportData) []HighRiskFunctionDa result = append(result, HighRiskFunctionData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, CyclomaticComplexity: fn.CyclomaticComplexity, CognitiveComplexity: fn.CognitiveComplexity, RiskLevel: riskLevel, diff --git a/internal/analyzers/complexity/metrics_test.go b/internal/analyzers/complexity/metrics_test.go index df0e9ff..8f29adf 100644 --- a/internal/analyzers/complexity/metrics_test.go +++ b/internal/analyzers/complexity/metrics_test.go @@ -118,6 +118,42 @@ func TestParseReportData_WithAssessments(t *testing.T) { assert.Equal(t, "low", data.Functions[0].NestingAssessment) } +const testSourceFile = "pkg/auth/handler.go" + +func TestParseReportData_WithSourceFile(t *testing.T) { + t.Parallel() + + report := analyze.Report{ + "functions": []map[string]any{ + { + "name": testFunctionName, + "_source_file": testSourceFile, + }, + }, + } + + data, err := ParseReportData(report) + + require.NoError(t, err) + require.Len(t, data.Functions, 1) + assert.Equal(t, testSourceFile, data.Functions[0].SourceFile) +} + +func TestFunctionComplexityMetric_Compute_SourceFile(t *testing.T) { + t.Parallel() + + functions := []FunctionData{ + {Name: testFunctionName, SourceFile: testSourceFile, CyclomaticComplexity: 5, LinesOfCode: testLinesOfCode}, + } + metric := NewFunctionComplexityMetric() + input := makeTestReportData(functions) + + result := metric.Compute(input) + + require.Len(t, result, 1) + assert.Equal(t, testSourceFile, result[0].SourceFile) +} + // Helper to create test ReportData with functions. func makeTestReportData(functions []FunctionData) *ReportData { return &ReportData{ diff --git a/internal/analyzers/complexity/plot.go b/internal/analyzers/complexity/plot.go index 2c1fd84..fb1e3c7 100644 --- a/internal/analyzers/complexity/plot.go +++ b/internal/analyzers/complexity/plot.go @@ -3,6 +3,7 @@ package complexity import ( "errors" "io" + "path/filepath" "github.com/go-echarts/go-echarts/v2/charts" "github.com/go-echarts/go-echarts/v2/opts" @@ -157,11 +158,7 @@ func extractComplexityData(functions []map[string]any) (labels []string, cycloma colors = make([]string, len(functions)) for i, fn := range functions { - if name, ok := fn["name"].(string); ok { - labels[i] = name - } else { - labels[i] = unknownName - } + labels[i] = formatPlotLabel(fn) cyclomatic[i] = getCyclomaticValue(fn) cognitive[i] = getCognitiveValue(fn) @@ -171,6 +168,22 @@ func extractComplexityData(functions []map[string]any) (labels []string, cycloma return labels, cyclomatic, cognitive, colors } +// formatPlotLabel builds a chart label from function name and source file. +// Shows "filename:func" when source_file is available, otherwise just the name. +func formatPlotLabel(fn map[string]any) string { + name := reportutil.MapString(fn, "name") + if name == "" { + name = unknownName + } + + sf := reportutil.MapString(fn, analyze.SourceFileKey) + if sf == "" { + return name + } + + return filepath.Base(sf) + ":" + name +} + func getComplexityColor(complexity int) string { switch { case complexity <= cyclomaticYellowLine: @@ -277,16 +290,12 @@ func createComplexityScatterChart(functions []map[string]any, co *plotpage.Chart cyclomatic := getCyclomaticValue(fn) cognitive := getCognitiveValue(fn) nesting := getNestingValue(fn) - name := unknownName - - if n, ok := fn["name"].(string); ok { - name = n - } + label := formatPlotLabel(fn) symbolSize := scatterSymbolSize + nesting*nestingMultiplier scatterData[i] = opts.ScatterData{ - Value: []any{cyclomatic, cognitive, name}, + Value: []any{cyclomatic, cognitive, label}, SymbolSize: symbolSize, } } diff --git a/internal/analyzers/complexity/report_section.go b/internal/analyzers/complexity/report_section.go index 75ac470..d1ad4b8 100644 --- a/internal/analyzers/complexity/report_section.go +++ b/internal/analyzers/complexity/report_section.go @@ -170,12 +170,14 @@ func (s *ReportSection) complexityIssues(limit int) []analyze.Issue { cc := reportutil.GetInt(fn, KeyFuncCyclomatic) cognitive := reportutil.GetInt(fn, KeyFuncCognitive) nesting := reportutil.GetInt(fn, KeyFuncNesting) + location := reportutil.MapString(fn, analyze.SourceFileKey) envelopes = append(envelopes, issueEnvelope{ cyclomatic: cc, cognitive: cognitive, nesting: nesting, issue: analyze.Issue{ Name: name, + Location: location, Value: fmt.Sprintf("%s%d | Cog=%d | Nest=%d", IssueValuePrefix, cc, cognitive, nesting), Severity: severityForComplexity(cc), }, diff --git a/internal/analyzers/composition/aggregator.go b/internal/analyzers/composition/aggregator.go new file mode 100644 index 0000000..a6648cb --- /dev/null +++ b/internal/analyzers/composition/aggregator.go @@ -0,0 +1,65 @@ +package composition + +import ( + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common" + filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" +) + +// Aggregator report keys. +const ( + keyBreakdown = "breakdown" + keyPercentage = "percentages" + keyTotalFiles = "total_files" + + percentMultiplier = 100.0 +) + +// Aggregator aggregates file composition results across multiple files. +type Aggregator struct { + common.PerFileRetainer + + counts filehistory.CategoryCounts + totalFiles int +} + +// NewAggregator creates a new composition Aggregator. +func NewAggregator() *Aggregator { + return &Aggregator{} +} + +// Aggregate accumulates per-file classification results. +func (a *Aggregator) Aggregate(results map[string]analyze.Report) { + for _, report := range results { + a.Retain(report) + a.totalFiles++ + + cat, ok := report[keyCategory].(string) + if !ok { + continue + } + + a.counts.Increment(filehistory.Category(cat)) + } +} + +// GetResult builds the aggregated composition report. +func (a *Aggregator) GetResult() analyze.Report { + breakdown := make(map[string]int, len(filehistory.AllCategories)) + percentages := make(map[string]float64, len(filehistory.AllCategories)) + + for _, cat := range filehistory.AllCategories { + count := a.counts.Get(cat) + breakdown[string(cat)] = count + + if a.totalFiles > 0 { + percentages[string(cat)] = float64(count) / float64(a.totalFiles) * percentMultiplier + } + } + + return analyze.Report{ + keyBreakdown: breakdown, + keyPercentage: percentages, + keyTotalFiles: a.totalFiles, + } +} diff --git a/internal/analyzers/composition/analyzer.go b/internal/analyzers/composition/analyzer.go new file mode 100644 index 0000000..71ccf71 --- /dev/null +++ b/internal/analyzers/composition/analyzer.go @@ -0,0 +1,127 @@ +// Package composition provides a static file composition analyzer that classifies +// files by type (source, vendor, generated, docs, config, binary, image) using enry. +package composition + +import ( + "encoding/json" + "fmt" + "io" + + "gopkg.in/yaml.v3" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/reportutil" + filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" + "github.com/Sumatoshi-tech/codefang/pkg/pipeline" +) + +// Analyzer constants. +const ( + analyzerName = "composition" + analyzerFlag = "composition" + analyzerID = "static/composition" + analyzerDescription = "Classifies files by type (source, vendor, generated, docs, config, binary, image) using enry." + + // keyCategory is the report key for the file category. + keyCategory = "category" +) + +// Analyzer implements analyze.RawFileAnalyzer for file composition analysis. +// It classifies files by type using enry-based detection on raw file content. +type Analyzer struct { + classifier *filehistory.Classifier +} + +// NewAnalyzer creates a new composition Analyzer. +func NewAnalyzer() *Analyzer { + return &Analyzer{ + classifier: filehistory.NewClassifier(), + } +} + +// Name returns the analyzer name. +func (a *Analyzer) Name() string { return analyzerName } + +// Flag returns the CLI flag name. +func (a *Analyzer) Flag() string { return analyzerFlag } + +// Descriptor returns the analyzer descriptor. +func (a *Analyzer) Descriptor() analyze.Descriptor { + return analyze.NewDescriptor(analyze.ModeStatic, analyzerName, analyzerDescription) +} + +// ListConfigurationOptions returns available configuration options. +func (a *Analyzer) ListConfigurationOptions() []pipeline.ConfigurationOption { + return nil +} + +// Configure applies configuration facts. +func (a *Analyzer) Configure(_ map[string]any) error { + return nil +} + +// Thresholds returns metric thresholds. Composition is informational, no thresholds. +func (a *Analyzer) Thresholds() analyze.Thresholds { + return nil +} + +// CreateAggregator returns a new composition aggregator. +func (a *Analyzer) CreateAggregator() analyze.ResultAggregator { + return NewAggregator() +} + +// AnalyzeFileContent classifies a file by its path and content using enry. +func (a *Analyzer) AnalyzeFileContent(path string, content []byte) (analyze.Report, error) { + category := a.classifier.Classify(path, content) + + return analyze.Report{ + keyCategory: string(category), + }, nil +} + +// CreateReportSection creates a ReportSection from aggregated composition data. +func (a *Analyzer) CreateReportSection(report analyze.Report) analyze.ReportSection { + return NewReportSection(report) +} + +// FormatReport writes human-readable text output. +func (a *Analyzer) FormatReport(report analyze.Report, writer io.Writer) error { + return encodeJSON(report, writer) +} + +// FormatReportJSON writes JSON output. +func (a *Analyzer) FormatReportJSON(report analyze.Report, writer io.Writer) error { + return encodeJSON(report, writer) +} + +// FormatReportYAML writes YAML output. +func (a *Analyzer) FormatReportYAML(report analyze.Report, writer io.Writer) error { + yamlErr := yaml.NewEncoder(writer).Encode(report) + if yamlErr != nil { + return fmt.Errorf("encode yaml: %w", yamlErr) + } + + return nil +} + +func encodeJSON(report analyze.Report, writer io.Writer) error { + encoder := json.NewEncoder(writer) + encoder.SetIndent("", " ") + + encodeErr := encoder.Encode(report) + if encodeErr != nil { + return fmt.Errorf("encode json: %w", encodeErr) + } + + return nil +} + +// FormatReportPlot writes plot output (same as JSON for composition). +func (a *Analyzer) FormatReportPlot(report analyze.Report, writer io.Writer) error { + return a.FormatReportJSON(report, writer) +} + +// FormatReportBinary writes binary envelope output. +func (a *Analyzer) FormatReportBinary(report analyze.Report, writer io.Writer) error { + return reportutil.EncodeBinaryEnvelope(report, writer) +} diff --git a/internal/analyzers/composition/analyzer_test.go b/internal/analyzers/composition/analyzer_test.go new file mode 100644 index 0000000..0239356 --- /dev/null +++ b/internal/analyzers/composition/analyzer_test.go @@ -0,0 +1,298 @@ +package composition + +import ( + "bytes" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" +) + +func TestAnalyzer_Name(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + assert.Equal(t, analyzerName, a.Name()) +} + +func TestAnalyzer_Flag(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + assert.Equal(t, analyzerFlag, a.Flag()) +} + +func TestAnalyzer_Descriptor(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + d := a.Descriptor() + assert.Equal(t, analyze.ModeStatic, d.Mode) + assert.Equal(t, analyzerID, d.ID) +} + +func TestAnalyzer_Thresholds_Nil(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + assert.Nil(t, a.Thresholds()) +} + +func TestAnalyzer_AnalyzeContent_GoFile(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent("pkg/main.go", []byte("package main\n\nfunc main() {}\n")) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategorySource), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_VendorPath(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent("vendor/github.com/foo/bar.go", []byte("package bar\n")) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryVendor), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_Markdown(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent("docs/README.md", []byte("# Hello\n")) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryDocumentation), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_ConfigFile(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent(".golangci.yml", []byte("linters:\n enable:\n")) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryConfiguration), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_BinaryContent(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + // Binary content: null bytes trigger enry.IsBinary. + binary := []byte{0x00, 0x01, 0x02, 0xFF, 0xFE, 0x00, 0x00, 0x00} + report, err := a.AnalyzeFileContent("data.bin", binary) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryBinary), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_DotFile(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent(".editorconfig", []byte("[*]\nindent_style = tab\n")) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryDotFile), report[keyCategory]) +} + +func TestAnalyzer_AnalyzeContent_ImagePath(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + report, err := a.AnalyzeFileContent("logo.png", nil) + require.NoError(t, err) + assert.Equal(t, string(filehistory.CategoryImage), report[keyCategory]) +} + +func TestAnalyzer_CreateAggregator(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + agg := a.CreateAggregator() + require.NotNil(t, agg) +} + +func TestAnalyzer_CreateReportSection(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + section := a.CreateReportSection(analyze.Report{}) + require.NotNil(t, section) + assert.Equal(t, sectionTitle, section.SectionTitle()) +} + +func TestAnalyzer_ImplementsRawFileAnalyzer(t *testing.T) { + t.Parallel() + + var _ analyze.RawFileAnalyzer = (*Analyzer)(nil) +} + +func TestAnalyzer_FormatReportJSON(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + var buf bytes.Buffer + + err := a.FormatReportJSON(analyze.Report{keyCategory: "source"}, &buf) + require.NoError(t, err) + assert.Contains(t, buf.String(), "source") +} + +func TestAnalyzer_FormatReportYAML(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + var buf bytes.Buffer + + err := a.FormatReportYAML(analyze.Report{keyCategory: "vendor"}, &buf) + require.NoError(t, err) + assert.Contains(t, buf.String(), "vendor") +} + +func TestAnalyzer_FormatReport(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + var buf bytes.Buffer + + err := a.FormatReport(analyze.Report{keyCategory: "binary"}, &buf) + require.NoError(t, err) + assert.Contains(t, buf.String(), "binary") +} + +func TestAnalyzer_FormatReportPlot(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + var buf bytes.Buffer + + err := a.FormatReportPlot(analyze.Report{keyCategory: "docs"}, &buf) + require.NoError(t, err) + assert.Contains(t, buf.String(), "docs") +} + +func TestAnalyzer_FormatReportBinary(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + + var buf bytes.Buffer + + err := a.FormatReportBinary(analyze.Report{keyCategory: "source"}, &buf) + require.NoError(t, err) + assert.NotEmpty(t, buf.Bytes()) +} + +func TestAnalyzer_Configure_NoError(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + assert.NoError(t, a.Configure(nil)) +} + +func TestAnalyzer_ListConfigurationOptions_Empty(t *testing.T) { + t.Parallel() + + a := NewAnalyzer() + assert.Nil(t, a.ListConfigurationOptions()) +} + +// Aggregator tests. + +func TestAggregator_EmptyResult(t *testing.T) { + t.Parallel() + + agg := NewAggregator() + result := agg.GetResult() + + total, ok := result[keyTotalFiles].(int) + require.True(t, ok) + assert.Equal(t, 0, total) +} + +func TestAggregator_SingleFile(t *testing.T) { + t.Parallel() + + agg := NewAggregator() + agg.Aggregate(map[string]analyze.Report{ + analyzerName: {keyCategory: string(filehistory.CategorySource)}, + }) + + result := agg.GetResult() + + total, ok := result[keyTotalFiles].(int) + require.True(t, ok) + assert.Equal(t, 1, total) + + breakdown, ok := result[keyBreakdown].(map[string]int) + require.True(t, ok) + assert.Equal(t, 1, breakdown[string(filehistory.CategorySource)]) +} + +func TestAggregator_MultipleFiles(t *testing.T) { + t.Parallel() + + agg := NewAggregator() + + // 3 source + 1 vendor + 1 docs = 5 total. + files := []filehistory.Category{ + filehistory.CategorySource, + filehistory.CategorySource, + filehistory.CategorySource, + filehistory.CategoryVendor, + filehistory.CategoryDocumentation, + } + + for _, cat := range files { + agg.Aggregate(map[string]analyze.Report{ + analyzerName: {keyCategory: string(cat)}, + }) + } + + result := agg.GetResult() + + total, ok := result[keyTotalFiles].(int) + require.True(t, ok) + assert.Equal(t, len(files), total) + + breakdown, ok := result[keyBreakdown].(map[string]int) + require.True(t, ok) + assert.Equal(t, 3, breakdown[string(filehistory.CategorySource)]) + assert.Equal(t, 1, breakdown[string(filehistory.CategoryVendor)]) + assert.Equal(t, 1, breakdown[string(filehistory.CategoryDocumentation)]) + + percentages, ok := result[keyPercentage].(map[string]float64) + require.True(t, ok) + assert.InDelta(t, 60.0, percentages[string(filehistory.CategorySource)], 0.1) + assert.InDelta(t, 20.0, percentages[string(filehistory.CategoryVendor)], 0.1) + assert.InDelta(t, 20.0, percentages[string(filehistory.CategoryDocumentation)], 0.1) +} + +func TestAggregator_SkipsInvalidCategory(t *testing.T) { + t.Parallel() + + agg := NewAggregator() + agg.Aggregate(map[string]analyze.Report{ + analyzerName: {"not_a_category": 42}, + }) + + result := agg.GetResult() + + total, ok := result[keyTotalFiles].(int) + require.True(t, ok) + // File counted but no category incremented. + assert.Equal(t, 1, total) +} diff --git a/internal/analyzers/composition/report_section.go b/internal/analyzers/composition/report_section.go new file mode 100644 index 0000000..7e9b3d8 --- /dev/null +++ b/internal/analyzers/composition/report_section.go @@ -0,0 +1,164 @@ +package composition + +import ( + "fmt" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/reportutil" + filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" +) + +// Report section display constants. +const ( + sectionTitle = "COMPOSITION" + metricTotalFiles = "Total Files" + metricSource = "Source Files" + metricSourcePct = "Source %" + + statusDefault = "File composition analysis completed" + statusEmpty = "No files analyzed" +) + +// ReportSection implements analyze.ReportSection for composition analysis. +type ReportSection struct { + analyze.BaseReportSection + + report analyze.Report +} + +// NewReportSection creates a ReportSection from aggregated composition data. +func NewReportSection(report analyze.Report) *ReportSection { + msg := statusDefault + + total := reportutil.GetInt(report, keyTotalFiles) + if total == 0 { + msg = statusEmpty + } + + return &ReportSection{ + BaseReportSection: analyze.BaseReportSection{ + Title: sectionTitle, + Message: msg, + ScoreValue: analyze.ScoreInfoOnly, + }, + report: report, + } +} + +// KeyMetrics returns ordered key metrics for display. +func (s *ReportSection) KeyMetrics() []analyze.Metric { + total := reportutil.GetInt(s.report, keyTotalFiles) + breakdown := getBreakdown(s.report) + sourceCount := breakdown[string(filehistory.CategorySource)] + + return []analyze.Metric{ + {Label: metricTotalFiles, Value: reportutil.FormatInt(total)}, + {Label: metricSource, Value: reportutil.FormatInt(sourceCount)}, + {Label: metricSourcePct, Value: reportutil.FormatPercent(reportutil.Pct(sourceCount, total))}, + } +} + +// Distribution returns category breakdown as distribution items. +func (s *ReportSection) Distribution() []analyze.DistributionItem { + breakdown := getBreakdown(s.report) + total := reportutil.GetInt(s.report, keyTotalFiles) + + if total == 0 { + return nil + } + + items := make([]analyze.DistributionItem, 0, len(filehistory.AllCategories)) + + for _, cat := range filehistory.AllCategories { + count := breakdown[string(cat)] + if count == 0 { + continue + } + + items = append(items, analyze.DistributionItem{ + Label: string(cat), + Percent: reportutil.Pct(count, total), + Count: count, + }) + } + + return items +} + +// TopIssues returns the top N non-source files as issues. +func (s *ReportSection) TopIssues(n int) []analyze.Issue { + return s.buildIssues(n) +} + +// AllIssues returns all non-source files as issues. +func (s *ReportSection) AllIssues() []analyze.Issue { + return s.buildIssues(0) +} + +// buildIssues creates issues for non-source categories showing file counts. +func (s *ReportSection) buildIssues(limit int) []analyze.Issue { + breakdown := getBreakdown(s.report) + total := reportutil.GetInt(s.report, keyTotalFiles) + + if total == 0 { + return nil + } + + issues := make([]analyze.Issue, 0, len(filehistory.AllCategories)) + + for _, cat := range filehistory.AllCategories { + if cat == filehistory.CategorySource { + continue + } + + count := breakdown[string(cat)] + if count == 0 { + continue + } + + issues = append(issues, analyze.Issue{ + Name: string(cat), + Value: fmt.Sprintf("%d files (%.1f%%)", count, float64(count)/float64(total)*percentMultiplier), + Severity: severityForCategory(cat), + }) + } + + if limit > 0 && len(issues) > limit { + issues = issues[:limit] + } + + return issues +} + +// severityForCategory returns the appropriate severity for a file category. +func severityForCategory(cat filehistory.Category) string { + switch cat { + case filehistory.CategoryBinary: + return analyze.SeverityPoor + case filehistory.CategorySource, + filehistory.CategoryVendor, + filehistory.CategoryGenerated, + filehistory.CategoryDocumentation, + filehistory.CategoryConfiguration, + filehistory.CategoryImage, + filehistory.CategoryDotFile: + return analyze.SeverityInfo + } + + return analyze.SeverityInfo +} + +// getBreakdown extracts the breakdown map from a report. +func getBreakdown(report analyze.Report) map[string]int { + raw, ok := report[keyBreakdown] + if !ok { + return nil + } + + m, isMap := raw.(map[string]int) + if isMap { + return m + } + + return nil +} diff --git a/internal/analyzers/composition/report_section_test.go b/internal/analyzers/composition/report_section_test.go new file mode 100644 index 0000000..5a5e3b2 --- /dev/null +++ b/internal/analyzers/composition/report_section_test.go @@ -0,0 +1,193 @@ +package composition + +import ( + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" +) + +func newTestCompositionReport() analyze.Report { + return analyze.Report{ + keyTotalFiles: 10, + keyBreakdown: map[string]int{ + string(filehistory.CategorySource): 6, + string(filehistory.CategoryVendor): 2, + string(filehistory.CategoryDocumentation): 1, + string(filehistory.CategoryBinary): 1, + }, + keyPercentage: map[string]float64{ + string(filehistory.CategorySource): 60.0, + string(filehistory.CategoryVendor): 20.0, + string(filehistory.CategoryDocumentation): 10.0, + string(filehistory.CategoryBinary): 10.0, + }, + } +} + +func TestCompositionSection_Title(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + assert.Equal(t, sectionTitle, s.SectionTitle()) +} + +func TestCompositionSection_Score_InfoOnly(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + assert.InDelta(t, analyze.ScoreInfoOnly, s.Score(), 0.001) +} + +func TestCompositionSection_StatusMessage(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + assert.Equal(t, statusDefault, s.StatusMessage()) +} + +func TestCompositionSection_StatusMessage_Empty(t *testing.T) { + t.Parallel() + + s := NewReportSection(analyze.Report{}) + assert.Equal(t, statusEmpty, s.StatusMessage()) +} + +func TestCompositionSection_NilReport(t *testing.T) { + t.Parallel() + + s := NewReportSection(nil) + assert.Equal(t, sectionTitle, s.SectionTitle()) + assert.Equal(t, statusEmpty, s.StatusMessage()) +} + +func TestCompositionSection_KeyMetrics_Count(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + metrics := s.KeyMetrics() + + const expectedMetrics = 3 + require.Len(t, metrics, expectedMetrics) +} + +func TestCompositionSection_KeyMetrics_Labels(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + metrics := s.KeyMetrics() + + assert.Equal(t, metricTotalFiles, metrics[0].Label) + assert.Equal(t, metricSource, metrics[1].Label) + assert.Equal(t, metricSourcePct, metrics[2].Label) +} + +func TestCompositionSection_KeyMetrics_Values(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + metrics := s.KeyMetrics() + + assert.Equal(t, "10", metrics[0].Value) + assert.Equal(t, "6", metrics[1].Value) + assert.Contains(t, metrics[2].Value, "60") +} + +func TestCompositionSection_Distribution(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + dist := s.Distribution() + + require.NotNil(t, dist) + // 4 categories with non-zero counts. + require.Len(t, dist, 4) + + // First should be source (order follows AllCategories). + assert.Equal(t, string(filehistory.CategorySource), dist[0].Label) + assert.Equal(t, 6, dist[0].Count) +} + +func TestCompositionSection_Distribution_Empty(t *testing.T) { + t.Parallel() + + s := NewReportSection(analyze.Report{}) + assert.Nil(t, s.Distribution()) +} + +func TestCompositionSection_TopIssues(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + + issues := s.TopIssues(2) + require.Len(t, issues, 2) +} + +func TestCompositionSection_AllIssues(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + + issues := s.AllIssues() + // 3 non-source categories with counts: vendor, docs, binary. + require.Len(t, issues, 3) +} + +func TestCompositionSection_Issues_BinarySeverityPoor(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + + issues := s.AllIssues() + + var binaryIssue *analyze.Issue + + for idx := range issues { + if issues[idx].Name == string(filehistory.CategoryBinary) { + binaryIssue = &issues[idx] + + break + } + } + + require.NotNil(t, binaryIssue, "binary category must appear in issues") + assert.Equal(t, analyze.SeverityPoor, binaryIssue.Severity) +} + +func TestCompositionSection_Issues_VendorSeverityInfo(t *testing.T) { + t.Parallel() + + s := NewReportSection(newTestCompositionReport()) + + issues := s.AllIssues() + + var vendorIssue *analyze.Issue + + for idx := range issues { + if issues[idx].Name == string(filehistory.CategoryVendor) { + vendorIssue = &issues[idx] + + break + } + } + + require.NotNil(t, vendorIssue, "vendor category must appear in issues") + assert.Equal(t, analyze.SeverityInfo, vendorIssue.Severity) +} + +func TestCompositionSection_Issues_Empty(t *testing.T) { + t.Parallel() + + s := NewReportSection(analyze.Report{}) + assert.Nil(t, s.AllIssues()) +} + +func TestCompositionSection_ImplementsInterface(t *testing.T) { + t.Parallel() + + var _ analyze.ReportSection = (*ReportSection)(nil) +} diff --git a/internal/analyzers/couples/metrics.go b/internal/analyzers/couples/metrics.go index cd0b419..8fb347c 100644 --- a/internal/analyzers/couples/metrics.go +++ b/internal/analyzers/couples/metrics.go @@ -6,6 +6,7 @@ import ( "sort" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/identity" "github.com/Sumatoshi-tech/codefang/pkg/alg/hll" "github.com/Sumatoshi-tech/codefang/pkg/metrics" ) @@ -74,10 +75,12 @@ type FileCouplingData struct { // DeveloperCouplingData contains coupling data for a developer pair. type DeveloperCouplingData struct { - Developer1 string `json:"developer1" yaml:"developer1"` - Developer2 string `json:"developer2" yaml:"developer2"` - SharedFiles int64 `json:"shared_file_changes" yaml:"shared_file_changes"` - Strength float64 `json:"coupling_strength" yaml:"coupling_strength"` + Developer1 string `json:"developer1" yaml:"developer1"` + Developer1Email string `json:"developer1_email,omitempty" yaml:"developer1_email,omitempty"` + Developer2 string `json:"developer2" yaml:"developer2"` + Developer2Email string `json:"developer2_email,omitempty" yaml:"developer2_email,omitempty"` + SharedFiles int64 `json:"shared_file_changes" yaml:"shared_file_changes"` + Strength float64 `json:"coupling_strength" yaml:"coupling_strength"` } // FileOwnershipData contains ownership information for a file. @@ -207,7 +210,7 @@ func (m *DeveloperCouplingMetric) Compute(input *ReportData) []DeveloperCoupling } func computeDevCouplings(devIdx int, row map[int]int64, matrix []map[int]int64, names []string) []DeveloperCouplingData { - dev1 := getDevName(devIdx, names) + dev1Name, dev1Email := getDevNameAndEmail(devIdx, names) var result []DeveloperCouplingData @@ -221,16 +224,21 @@ func computeDevCouplings(devIdx int, row map[int]int64, matrix []map[int]int64, selfDev2 = matrix[j][j] } - coupling := buildCouplingData(dev1, j, sharedChanges, row[devIdx], selfDev2, names) + dev2Name, dev2Email := getDevNameAndEmail(j, names) + coupling := buildCouplingData( + dev1Name, dev1Email, dev2Name, dev2Email, + sharedChanges, row[devIdx], selfDev2, + ) result = append(result, coupling) } return result } -func buildCouplingData(dev1 string, dev2Idx int, sharedChanges, selfDev1, selfDev2 int64, names []string) DeveloperCouplingData { - dev2 := getDevName(dev2Idx, names) - +func buildCouplingData( + dev1Name, dev1Email, dev2Name, dev2Email string, + sharedChanges, selfDev1, selfDev2 int64, +) DeveloperCouplingData { // Coupling strength using code-maat formula: // degree = shared_changes / average(self_dev1, self_dev2), capped at 1.0. avgRevs := float64(selfDev1+selfDev2) / pairCount @@ -241,19 +249,21 @@ func buildCouplingData(dev1 string, dev2Idx int, sharedChanges, selfDev1, selfDe } return DeveloperCouplingData{ - Developer1: dev1, - Developer2: dev2, - SharedFiles: sharedChanges, - Strength: strength, + Developer1: dev1Name, + Developer1Email: dev1Email, + Developer2: dev2Name, + Developer2Email: dev2Email, + SharedFiles: sharedChanges, + Strength: strength, } } -func getDevName(idx int, names []string) string { +func getDevNameAndEmail(idx int, names []string) (name, email string) { if idx < len(names) { - return names[idx] + return identity.SplitIdentity(names[idx]) } - return "" + return "", "" } // FileOwnershipMetric computes file ownership information. diff --git a/internal/analyzers/couples/store_writer_test.go b/internal/analyzers/couples/store_writer_test.go index 6f006bf..48dae53 100644 --- a/internal/analyzers/couples/store_writer_test.go +++ b/internal/analyzers/couples/store_writer_test.go @@ -1,7 +1,5 @@ package couples -// FRD: specs/frds/FRD-20260228-couples-store-writer.md. - import ( "context" "sort" diff --git a/internal/analyzers/devs/analyzer.go b/internal/analyzers/devs/analyzer.go index 9fff651..15221d1 100644 --- a/internal/analyzers/devs/analyzer.go +++ b/internal/analyzers/devs/analyzer.go @@ -38,7 +38,9 @@ type DevTick struct { // It groups all per-commit developer data within one time bucket. type TickDevData struct { // DevData maps commit hash hex to per-commit developer statistics. - DevData map[string]*CommitDevData + DevData map[string]*CommitDevData + startTime time.Time + endTime time.Time } // Configuration option keys for the devs analyzer. @@ -350,10 +352,24 @@ func extractTC(tc analyze.TC, byTick map[int]*TickDevData) error { state, ok := byTick[tc.Tick] if !ok || state == nil { - state = &TickDevData{DevData: make(map[string]*CommitDevData)} + state = &TickDevData{ + DevData: make(map[string]*CommitDevData), + startTime: tc.Timestamp, + endTime: tc.Timestamp, + } byTick[tc.Tick] = state } + if !tc.Timestamp.IsZero() { + if tc.Timestamp.Before(state.startTime) || state.startTime.IsZero() { + state.startTime = tc.Timestamp + } + + if tc.Timestamp.After(state.endTime) { + state.endTime = tc.Timestamp + } + } + state.DevData[tc.CommitHash.String()] = cdd return nil @@ -404,8 +420,10 @@ func buildTick(tick int, state *TickDevData) (analyze.TICK, error) { } return analyze.TICK{ - Tick: tick, - Data: state, + Tick: tick, + StartTime: state.startTime, + EndTime: state.endTime, + Data: state, }, nil } @@ -500,5 +518,6 @@ func ticksToReport( "CommitsByTick": commitsByTick, "ReversedPeopleDict": names, "TickSize": tickSize, + "tick_bounds": analyze.BuildTickBounds(ticks), } } diff --git a/internal/analyzers/devs/dashboard_activity.go b/internal/analyzers/devs/dashboard_activity.go index 62d4adf..112a788 100644 --- a/internal/analyzers/devs/dashboard_activity.go +++ b/internal/analyzers/devs/dashboard_activity.go @@ -69,7 +69,7 @@ func buildTopDevSeries(data *DashboardData, topDevs []int, nameByID map[int]stri for _, devID := range topDevs { seriesData := make([]plotpage.SeriesData, len(data.Metrics.Activity)) for i, ad := range data.Metrics.Activity { - seriesData[i] = ad.ByDeveloper[devID] + seriesData[i] = commitsForDev(ad.ByDeveloper, devID) } series = append(series, plotpage.LineSeries{ @@ -89,9 +89,9 @@ func buildOthersSeries(data *DashboardData, topDevs []int) plotpage.LineSeries { for i, ad := range data.Metrics.Activity { total := 0 - for devID, commits := range ad.ByDeveloper { - if !slices.Contains(topDevs, devID) { - total += commits + for _, dc := range ad.ByDeveloper { + if !slices.Contains(topDevs, dc.DevID) { + total += dc.Commits } } @@ -119,3 +119,14 @@ func getTopDevIDs(developers []DeveloperData, limit int) []int { return ids } + +// commitsForDev finds the commit count for a specific developer ID in the activity array. +func commitsForDev(entries []DeveloperCommits, devID int) int { + for _, dc := range entries { + if dc.DevID == devID { + return dc.Commits + } + } + + return 0 +} diff --git a/internal/analyzers/devs/dashboard_languages.go b/internal/analyzers/devs/dashboard_languages.go index 2bbd6c3..819f7a2 100644 --- a/internal/analyzers/devs/dashboard_languages.go +++ b/internal/analyzers/devs/dashboard_languages.go @@ -94,9 +94,10 @@ func topDevsByContribution(data *DashboardData, n int) []DeveloperData { for i, dev := range data.Metrics.Developers { total := 0 + langStats := devLanguageMap(dev) for _, lang := range data.TopLanguages { - if stats, ok := dev.Languages[lang]; ok { + if stats, ok := langStats[lang]; ok { total += stats.Added + stats.Removed } } @@ -121,10 +122,11 @@ func topDevsByContribution(data *DashboardData, n int) []DeveloperData { // devContribution returns the total contribution (Added+Removed) for a developer // across the given languages. func devContribution(dev DeveloperData, langs []string) map[string]int { + langStats := devLanguageMap(dev) result := make(map[string]int, len(langs)) for _, lang := range langs { - if stats, ok := dev.Languages[lang]; ok { + if stats, ok := langStats[lang]; ok { result[lang] = stats.Added + stats.Removed } } @@ -132,6 +134,17 @@ func devContribution(dev DeveloperData, langs []string) map[string]int { return result } +// devLanguageMap builds a language-name lookup map from a developer's Languages slice. +func devLanguageMap(dev DeveloperData) map[string]LanguageStatsEntry { + m := make(map[string]LanguageStatsEntry, len(dev.Languages)) + + for _, entry := range dev.Languages { + m[entry.Language] = entry + } + + return m +} + // buildRadarData computes per-developer relative expertise profiles. // Each developer is normalized independently: their strongest language = 100%, // and all other languages are relative to that. This produces visually distinct diff --git a/internal/analyzers/devs/dashboard_workload.go b/internal/analyzers/devs/dashboard_workload.go index 45d0442..f26abd4 100644 --- a/internal/analyzers/devs/dashboard_workload.go +++ b/internal/analyzers/devs/dashboard_workload.go @@ -110,10 +110,10 @@ func findPrimaryLanguage(dev DeveloperData) string { primaryLang := langOther maxLines := 0 - for lang, stats := range dev.Languages { - if stats.Added > maxLines { - maxLines = stats.Added - primaryLang = lang + for _, entry := range dev.Languages { + if entry.Added > maxLines { + maxLines = entry.Added + primaryLang = entry.Language if primaryLang == "" { primaryLang = langOther diff --git a/internal/analyzers/devs/metrics.go b/internal/analyzers/devs/metrics.go index 0e48062..6a80183 100644 --- a/internal/analyzers/devs/metrics.go +++ b/internal/analyzers/devs/metrics.go @@ -38,10 +38,11 @@ func devIDBytes(id int) []byte { // TickData is the raw input data for devs metrics computation. type TickData struct { - Ticks map[int]map[int]*DevTick - Names []string - TickSize time.Duration - DevSketch *hll.Sketch `json:"-" yaml:"-"` + Ticks map[int]map[int]*DevTick + Names []string + TickSize time.Duration + TickBounds map[int]analyze.TickBounds + DevSketch *hll.Sketch `json:"-" yaml:"-"` } // AggregateCommitsToTicks builds per-tick per-developer data from per-commit @@ -127,6 +128,10 @@ func ParseTickDataWithPrecision(report analyze.Report, precision int) (*TickData TickSize: tickSize, } + if v, ok := report["tick_bounds"].(map[int]analyze.TickBounds); ok { + td.TickBounds = v + } + td.DevSketch = buildDevSketchWithPrecision(ticks, precision) return td, nil @@ -307,17 +312,55 @@ func buildCommitsByTickFromMap(cbtMap map[string]any) map[int][]gitlib.Hash { // DeveloperData contains computed data for a single developer. type DeveloperData struct { - ID int `json:"id" yaml:"id"` - Name string `json:"name" yaml:"name"` - Commits int `json:"commits" yaml:"commits"` - Added int `json:"lines_added" yaml:"lines_added"` - Removed int `json:"lines_removed" yaml:"lines_removed"` - Changed int `json:"lines_changed" yaml:"lines_changed"` - NetLines int `json:"net_lines" yaml:"net_lines"` - Languages map[string]pkgplumbing.LineStats `json:"languages" yaml:"languages"` - FirstTick int `json:"first_tick" yaml:"first_tick"` - LastTick int `json:"last_tick" yaml:"last_tick"` - ActiveTicks int `json:"active_ticks" yaml:"active_ticks"` + ID int `json:"id" yaml:"id"` + Name string `json:"name" yaml:"name"` + Email string `json:"email,omitempty" yaml:"email,omitempty"` + Commits int `json:"commits" yaml:"commits"` + Added int `json:"lines_added" yaml:"lines_added"` + Removed int `json:"lines_removed" yaml:"lines_removed"` + Changed int `json:"lines_changed" yaml:"lines_changed"` + NetLines int `json:"net_lines" yaml:"net_lines"` + Languages []LanguageStatsEntry `json:"languages" yaml:"languages"` + FirstTick int `json:"first_tick" yaml:"first_tick"` + LastTick int `json:"last_tick" yaml:"last_tick"` + ActiveTicks int `json:"active_ticks" yaml:"active_ticks"` + + // langMap is the internal accumulation map, converted to Languages by finalizeLanguages. + langMap map[string]pkgplumbing.LineStats `json:"-" yaml:"-"` +} + +// LanguageStatsEntry holds line stats for a single language. +type LanguageStatsEntry struct { + Language string `json:"language" yaml:"language"` + Added int `json:"added" yaml:"added"` + Removed int `json:"removed" yaml:"removed"` + Changed int `json:"changed" yaml:"changed"` +} + +// finalizeLanguages converts the internal langMap to a sorted Languages slice. +func (d *DeveloperData) finalizeLanguages() { + if len(d.langMap) == 0 { + return + } + + d.Languages = make([]LanguageStatsEntry, 0, len(d.langMap)) + + for lang, stats := range d.langMap { + if lang == "" { + lang = "Other" + } + + d.Languages = append(d.Languages, LanguageStatsEntry{ + Language: lang, + Added: stats.Added, + Removed: stats.Removed, + Changed: stats.Changed, + }) + } + + sort.Slice(d.Languages, func(i, j int) bool { + return d.Languages[i].Language < d.Languages[j].Language + }) } // LanguageData contains computed data for a programming language. @@ -337,26 +380,38 @@ type BusFactorData struct { TotalContributors int `json:"total_contributors" yaml:"total_contributors"` PrimaryDevID int `json:"primary_dev_id" yaml:"primary_dev_id"` PrimaryDevName string `json:"primary_dev_name" yaml:"primary_dev_name"` + PrimaryDevEmail string `json:"primary_dev_email,omitempty" yaml:"primary_dev_email,omitempty"` PrimaryPct float64 `json:"primary_percentage" yaml:"primary_percentage"` SecondaryDevID int `json:"secondary_dev_id,omitempty" yaml:"secondary_dev_id,omitempty"` SecondaryDevName string `json:"secondary_dev_name,omitempty" yaml:"secondary_dev_name,omitempty"` + SecondaryDevEmail string `json:"secondary_dev_email,omitempty" yaml:"secondary_dev_email,omitempty"` SecondaryPct float64 `json:"secondary_percentage,omitempty" yaml:"secondary_percentage,omitempty"` RiskLevel string `json:"risk_level" yaml:"risk_level"` } +// DeveloperCommits holds a developer's commit count within a single tick. +type DeveloperCommits struct { + DevID int `json:"dev_id" yaml:"dev_id"` + Commits int `json:"commits" yaml:"commits"` +} + // ActivityData contains time-series activity for a single tick. type ActivityData struct { - Tick int `json:"tick" yaml:"tick"` - ByDeveloper map[int]int `json:"by_developer" yaml:"by_developer"` - TotalCommits int `json:"total_commits" yaml:"total_commits"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + ByDeveloper []DeveloperCommits `json:"by_developer" yaml:"by_developer"` + TotalCommits int `json:"total_commits" yaml:"total_commits"` } // ChurnData contains code churn for a single tick. type ChurnData struct { - Tick int `json:"tick" yaml:"tick"` - Added int `json:"lines_added" yaml:"lines_added"` - Removed int `json:"lines_removed" yaml:"lines_removed"` - Net int `json:"net_change" yaml:"net_change"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + Added int `json:"lines_added" yaml:"lines_added"` + Removed int `json:"lines_removed" yaml:"lines_removed"` + Net int `json:"net_change" yaml:"net_change"` } // AggregateData contains summary statistics. @@ -420,10 +475,12 @@ func processTickDevs(tick int, devTicks map[int]*DevTick, devMap map[int]*Develo func getOrCreateDev(devID, tick int, devMap map[int]*DeveloperData, names []string) *DeveloperData { dev := devMap[devID] if dev == nil { + name, email := devNameAndEmail(devID, names) dev = &DeveloperData{ ID: devID, - Name: devName(devID, names), - Languages: make(map[string]pkgplumbing.LineStats), + Name: name, + Email: email, + langMap: make(map[string]pkgplumbing.LineStats), FirstTick: tick, LastTick: tick, } @@ -448,7 +505,7 @@ func updateDevStats(dev *DeveloperData, dt *DevTick, tick int) { dev.LastTick = tick } - mergeLanguageStats(dev.Languages, dt.Languages) + mergeLanguageStats(dev.langMap, dt.Languages) } func mergeLanguageStats(target, source map[string]pkgplumbing.LineStats) { @@ -467,6 +524,7 @@ func collectDevResults(devMap map[int]*DeveloperData) []DeveloperData { for _, dev := range devMap { dev.NetLines = dev.Added - dev.Removed + dev.finalizeLanguages() result = append(result, *dev) } @@ -496,7 +554,8 @@ func (m *LanguagesMetric) Compute(developers []DeveloperData) []LanguageData { langMap := make(map[string]*LanguageData) for _, dev := range developers { - for lang, langSt := range dev.Languages { + for _, langEntry := range dev.Languages { + lang := langEntry.Language if lang == "" { lang = "Other" } @@ -510,8 +569,8 @@ func (m *LanguagesMetric) Compute(developers []DeveloperData) []LanguageData { langMap[lang] = ld } - ld.TotalLines += langSt.Added - contribution := langSt.Added + langSt.Removed + ld.TotalLines += langEntry.Added + contribution := langEntry.Added + langEntry.Removed ld.TotalContribution += contribution ld.Contributors[dev.ID] += contribution } @@ -612,13 +671,13 @@ func (m *BusFactorMetric) ComputeWithOptions(input BusFactorInput, opts MetricOp if len(contribs) > 0 { bf.PrimaryDevID = contribs[0].id - bf.PrimaryDevName = devName(contribs[0].id, input.Names) + bf.PrimaryDevName, bf.PrimaryDevEmail = devNameAndEmail(contribs[0].id, input.Names) bf.PrimaryPct = stats.ToPercent(float64(contribs[0].lines) / float64(ld.TotalContribution)) } if len(contribs) > 1 { bf.SecondaryDevID = contribs[1].id - bf.SecondaryDevName = devName(contribs[1].id, input.Names) + bf.SecondaryDevName, bf.SecondaryDevEmail = devNameAndEmail(contribs[1].id, input.Names) bf.SecondaryPct = stats.ToPercent(float64(contribs[1].lines) / float64(ld.TotalContribution)) } @@ -692,16 +751,22 @@ func (m *ActivityMetric) Compute(input *TickData) []ActivityData { result := make([]ActivityData, len(tickKeys)) for i, tick := range tickKeys { - ad := ActivityData{ - Tick: tick, - ByDeveloper: make(map[int]int), - } + ad := ActivityData{Tick: tick} + + devIDs := mapx.SortedKeys(input.Ticks[tick]) + ad.ByDeveloper = make([]DeveloperCommits, 0, len(devIDs)) - for devID, dt := range input.Ticks[tick] { - ad.ByDeveloper[devID] = dt.Commits + for _, devID := range devIDs { + dt := input.Ticks[tick][devID] + ad.ByDeveloper = append(ad.ByDeveloper, DeveloperCommits{DevID: devID, Commits: dt.Commits}) ad.TotalCommits += dt.Commits } + if bounds, hasBounds := input.TickBounds[tick]; hasBounds { + ad.StartTime = bounds.FormatStartTime() + ad.EndTime = bounds.FormatEndTime() + } + result[i] = ad } @@ -742,6 +807,11 @@ func (m *ChurnMetric) Compute(input *TickData) []ChurnData { cd.Net = cd.Added - cd.Removed + if bounds, hasBounds := input.TickBounds[tick]; hasBounds { + cd.StartTime = bounds.FormatStartTime() + cd.EndTime = bounds.FormatEndTime() + } + result[i] = cd } @@ -1063,14 +1133,14 @@ func (m *ComputedMetrics) ToYAML() any { const defaultTickHours = 24 -func devName(id int, names []string) string { +func devNameAndEmail(id int, names []string) (name, email string) { if id == identity.AuthorMissing { - return identity.AuthorMissingName + return identity.AuthorMissingName, "" } if id >= 0 && id < len(names) { - return names[id] + return identity.SplitIdentity(names[id]) } - return fmt.Sprintf("dev_%d", id) + return fmt.Sprintf("dev_%d", id), "" } diff --git a/internal/analyzers/devs/metrics_test.go b/internal/analyzers/devs/metrics_test.go index 4d18301..32b8b42 100644 --- a/internal/analyzers/devs/metrics_test.go +++ b/internal/analyzers/devs/metrics_test.go @@ -30,6 +30,21 @@ const ( testHashB = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ) +// findLang finds a LanguageStatsEntry by name in a developer's Languages slice. +func findLang(t *testing.T, langs []LanguageStatsEntry, name string) LanguageStatsEntry { + t.Helper() + + for _, l := range langs { + if l.Language == name { + return l + } + } + + t.Fatalf("language %q not found", name) + + return LanguageStatsEntry{} +} + // --- ParseTickData Tests ---. func TestParseTickData_Valid(t *testing.T) { @@ -247,12 +262,15 @@ func TestDevelopersMetric_LanguageAggregation(t *testing.T) { require.Len(t, result, 1) require.NotNil(t, result[0].Languages) - assert.Equal(t, 70, result[0].Languages[testLangGo].Added) // 50 + 20 - assert.Equal(t, 15, result[0].Languages[testLangGo].Removed) // 10 + 5 - assert.Equal(t, 8, result[0].Languages[testLangGo].Changed) // 5 + 3 - assert.Equal(t, 30, result[0].Languages[testLangPython].Added) - assert.Equal(t, 5, result[0].Languages[testLangPython].Removed) - assert.Equal(t, 2, result[0].Languages[testLangPython].Changed) + goLang := findLang(t, result[0].Languages, testLangGo) + assert.Equal(t, 70, goLang.Added) // 50 + 20 + assert.Equal(t, 15, goLang.Removed) // 10 + 5 + assert.Equal(t, 8, goLang.Changed) // 5 + 3 + + pyLang := findLang(t, result[0].Languages, testLangPython) + assert.Equal(t, 30, pyLang.Added) + assert.Equal(t, 5, pyLang.Removed) + assert.Equal(t, 2, pyLang.Changed) } func TestDevelopersMetric_ChangedField(t *testing.T) { @@ -306,7 +324,7 @@ func TestLanguagesMetric_SingleLanguage(t *testing.T) { developers := []DeveloperData{ { ID: 0, - Languages: map[string]pkgplumbing.LineStats{testLangGo: {Added: testLinesAdded}}, + Languages: []LanguageStatsEntry{{Language: testLangGo, Added: testLinesAdded}}, }, } metric := NewLanguagesMetric() @@ -326,9 +344,9 @@ func TestLanguagesMetric_MultipleLanguages_SortedByTotalLines(t *testing.T) { developers := []DeveloperData{ { ID: 0, - Languages: map[string]pkgplumbing.LineStats{ - testLangGo: {Added: 50}, - testLangPython: {Added: 150}, + Languages: []LanguageStatsEntry{ + {Language: testLangGo, Added: 50}, + {Language: testLangPython, Added: 150}, }, }, } @@ -350,7 +368,7 @@ func TestLanguagesMetric_EmptyLanguageName_BecomesOther(t *testing.T) { developers := []DeveloperData{ { ID: 0, - Languages: map[string]pkgplumbing.LineStats{"": {Added: testLinesAdded}}, + Languages: []LanguageStatsEntry{{Language: "", Added: testLinesAdded}}, }, } metric := NewLanguagesMetric() @@ -365,8 +383,8 @@ func TestLanguagesMetric_MultipleContributors(t *testing.T) { t.Parallel() developers := []DeveloperData{ - {ID: 0, Languages: map[string]pkgplumbing.LineStats{testLangGo: {Added: 60}}}, - {ID: 1, Languages: map[string]pkgplumbing.LineStats{testLangGo: {Added: 40}}}, + {ID: 0, Languages: []LanguageStatsEntry{{Language: testLangGo, Added: 60}}}, + {ID: 1, Languages: []LanguageStatsEntry{{Language: testLangGo, Added: 40}}}, } metric := NewLanguagesMetric() @@ -384,8 +402,8 @@ func TestLanguagesMetric_ContributionIncludesRemoved(t *testing.T) { t.Parallel() developers := []DeveloperData{ - {ID: 0, Languages: map[string]pkgplumbing.LineStats{testLangGo: {Added: 60, Removed: 40}}}, - {ID: 1, Languages: map[string]pkgplumbing.LineStats{testLangGo: {Added: 10, Removed: 90}}}, + {ID: 0, Languages: []LanguageStatsEntry{{Language: testLangGo, Added: 60, Removed: 40}}}, + {ID: 1, Languages: []LanguageStatsEntry{{Language: testLangGo, Added: 10, Removed: 90}}}, } metric := NewLanguagesMetric() @@ -590,8 +608,11 @@ func TestActivityMetric_SingleTick(t *testing.T) { require.Len(t, result, 1) assert.Equal(t, 0, result[0].Tick) assert.Equal(t, 8, result[0].TotalCommits) - assert.Equal(t, 5, result[0].ByDeveloper[0]) - assert.Equal(t, 3, result[0].ByDeveloper[1]) + require.Len(t, result[0].ByDeveloper, 2) + assert.Equal(t, 0, result[0].ByDeveloper[0].DevID) + assert.Equal(t, 5, result[0].ByDeveloper[0].Commits) + assert.Equal(t, 1, result[0].ByDeveloper[1].DevID) + assert.Equal(t, 3, result[0].ByDeveloper[1].Commits) } func TestActivityMetric_MultipleTicks(t *testing.T) { @@ -1011,12 +1032,16 @@ func TestParseCommitsByTick_FromMap(t *testing.T) { require.Len(t, result, 1) } -func TestDevName_Variants(t *testing.T) { +func TestDevNameAndEmail_Variants(t *testing.T) { t.Parallel() names := []string{"Alice", "Bob"} - assert.Equal(t, "Alice", devName(0, names)) - assert.Equal(t, "Bob", devName(1, names)) - assert.Contains(t, devName(99, names), "dev_99") + name0, _ := devNameAndEmail(0, names) + name1, _ := devNameAndEmail(1, names) + name99, _ := devNameAndEmail(99, names) + + assert.Equal(t, "Alice", name0) + assert.Equal(t, "Bob", name1) + assert.Contains(t, name99, "dev_99") } diff --git a/internal/analyzers/devs/plot.go b/internal/analyzers/devs/plot.go index 8cb915f..7ac33ca 100644 --- a/internal/analyzers/devs/plot.go +++ b/internal/analyzers/devs/plot.go @@ -59,7 +59,7 @@ func buildTopDevBarSeries(activity []ActivityData, topDevs []int, nameByID map[i for _, devID := range topDevs { data := make([]plotpage.SeriesData, len(activity)) for i, ad := range activity { - data[i] = ad.ByDeveloper[devID] + data[i] = commitsForDev(ad.ByDeveloper, devID) } name := nameByID[devID] @@ -92,9 +92,9 @@ func buildOthersBarSeries(activity []ActivityData, topDevs []int) plotpage.BarSe for i, ad := range activity { total := 0 - for devID, commits := range ad.ByDeveloper { - if !topDevsSet[devID] { - total += commits + for _, dc := range ad.ByDeveloper { + if !topDevsSet[dc.DevID] { + total += dc.Commits } } diff --git a/internal/analyzers/devs/store_writer_test.go b/internal/analyzers/devs/store_writer_test.go index 2a76beb..5d3c834 100644 --- a/internal/analyzers/devs/store_writer_test.go +++ b/internal/analyzers/devs/store_writer_test.go @@ -1,7 +1,5 @@ package devs -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "testing" diff --git a/internal/analyzers/file_history/aggregator.go b/internal/analyzers/file_history/aggregator.go index 44168de..aa9b0cc 100644 --- a/internal/analyzers/file_history/aggregator.go +++ b/internal/analyzers/file_history/aggregator.go @@ -411,6 +411,8 @@ func TicksToReport(ctx context.Context, ticks []analyze.TICK, repo *gitlib.Repos report["tick_composition"] = tickComposition } + report["tick_bounds"] = analyze.BuildTickBounds(ticks) + return report } diff --git a/internal/analyzers/file_history/metrics.go b/internal/analyzers/file_history/metrics.go index 2e1b6e9..782d85a 100644 --- a/internal/analyzers/file_history/metrics.go +++ b/internal/analyzers/file_history/metrics.go @@ -4,7 +4,6 @@ import ( "sort" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" - pkgplumbing "github.com/Sumatoshi-tech/codefang/internal/plumbing" "github.com/Sumatoshi-tech/codefang/pkg/metrics" ) @@ -38,12 +37,20 @@ type FileChurnData struct { ChurnScore float64 `json:"churn_score" yaml:"churn_score"` } -// FileContributorData contains contributor statistics for a file. +// ContributorEntry holds line stats for a single contributor to a file. +type ContributorEntry struct { + DevID int `json:"dev_id" yaml:"dev_id"` + Added int `json:"added" yaml:"added"` + Removed int `json:"removed" yaml:"removed"` + Changed int `json:"changed" yaml:"changed"` +} + +// FileContributorData contains contributor breakdown for a file. type FileContributorData struct { - Path string `json:"path" yaml:"path"` - Contributors map[int]pkgplumbing.LineStats `json:"contributors" yaml:"contributors"` - TopContributorID int `json:"top_contributor_id" yaml:"top_contributor_id"` - TopContributorLines int `json:"top_contributor_lines" yaml:"top_contributor_lines"` + Path string `json:"path" yaml:"path"` + Contributors []ContributorEntry `json:"contributors" yaml:"contributors"` + TopContributorID int `json:"top_contributor_id" yaml:"top_contributor_id"` + TopContributorLines int `json:"top_contributor_lines" yaml:"top_contributor_lines"` } // HotspotData identifies high-churn files that may need attention. @@ -84,8 +91,10 @@ type CompositionData struct { // CompositionTimeSeriesEntry holds file composition for a single tick. type CompositionTimeSeriesEntry struct { - Tick int `json:"tick" yaml:"tick"` - Breakdown map[string]int `json:"breakdown" yaml:"breakdown"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + Breakdown map[string]int `json:"breakdown" yaml:"breakdown"` } // --- Computed Metrics ---. @@ -150,7 +159,12 @@ func ComputeAllMetricsWithOptions(report analyze.Report, opts MetricOptions) (*C tickComp = nil } - composition, compositionTS := computeComposition(tickComp) + var tickBounds map[int]analyze.TickBounds + if v, tbOK := report["tick_bounds"].(map[int]analyze.TickBounds); tbOK { + tickBounds = v + } + + composition, compositionTS := computeComposition(tickComp, tickBounds) return &ComputedMetrics{ FileChurn: computeFileChurn(input), @@ -204,7 +218,16 @@ func computeFileContributors(input *ReportData) []FileContributorData { for path, fh := range input.Files { var topID, topLines int + contribs := make([]ContributorEntry, 0, len(fh.People)) + for devID, stats := range fh.People { + contribs = append(contribs, ContributorEntry{ + DevID: devID, + Added: stats.Added, + Removed: stats.Removed, + Changed: stats.Changed, + }) + totalLines := stats.Added + stats.Changed if totalLines > topLines { topLines = totalLines @@ -212,9 +235,13 @@ func computeFileContributors(input *ReportData) []FileContributorData { } } + sort.Slice(contribs, func(i, j int) bool { + return contribs[i].DevID < contribs[j].DevID + }) + result = append(result, FileContributorData{ Path: path, - Contributors: fh.People, + Contributors: contribs, TopContributorID: topID, TopContributorLines: topLines, }) @@ -278,7 +305,10 @@ func computeHotspotsWithOptions(input *ReportData, opts MetricOptions) []Hotspot return result } -func computeComposition(tickComp map[int]*CategoryCounts) (CompositionData, []CompositionTimeSeriesEntry) { +func computeComposition( + tickComp map[int]*CategoryCounts, + tickBounds map[int]analyze.TickBounds, +) (CompositionData, []CompositionTimeSeriesEntry) { comp := CompositionData{ Breakdown: make(map[string]int), Percentages: make(map[string]float64), @@ -312,10 +342,17 @@ func computeComposition(tickComp map[int]*CategoryCounts) (CompositionData, []Co } } - ts = append(ts, CompositionTimeSeriesEntry{ + entry := CompositionTimeSeriesEntry{ Tick: t, Breakdown: breakdown, - }) + } + + if bounds, hasBounds := tickBounds[t]; hasBounds { + entry.StartTime = bounds.FormatStartTime() + entry.EndTime = bounds.FormatEndTime() + } + + ts = append(ts, entry) } // Aggregate breakdown and percentages. diff --git a/internal/analyzers/file_history/store_writer.go b/internal/analyzers/file_history/store_writer.go index fb4c236..34fb3cd 100644 --- a/internal/analyzers/file_history/store_writer.go +++ b/internal/analyzers/file_history/store_writer.go @@ -63,7 +63,7 @@ func (h *HistoryAnalyzer) WriteToStoreFromAggregator( // Write composition time series if available. if len(fa.tickComposition) > 0 { - _, compositionTS := computeComposition(fa.tickComposition) + _, compositionTS := computeComposition(fa.tickComposition, nil) compErr := analyze.WriteSliceKind(w, KindComposition, compositionTS) if compErr != nil { diff --git a/internal/analyzers/file_history/store_writer_test.go b/internal/analyzers/file_history/store_writer_test.go index 746547f..74847a1 100644 --- a/internal/analyzers/file_history/store_writer_test.go +++ b/internal/analyzers/file_history/store_writer_test.go @@ -1,7 +1,5 @@ package filehistory -// FRD: specs/frds/FRD-20260301-burndown-filehistory-store-writer.md. - import ( "context" "fmt" diff --git a/internal/analyzers/halstead/aggregator.go b/internal/analyzers/halstead/aggregator.go index 0cc97e6..8853444 100644 --- a/internal/analyzers/halstead/aggregator.go +++ b/internal/analyzers/halstead/aggregator.go @@ -15,6 +15,7 @@ const ( // Aggregator aggregates Halstead analysis results. type Aggregator struct { *common.Aggregator + common.PerFileRetainer detailed *common.DetailedDataCollector } @@ -48,6 +49,10 @@ func (ha *Aggregator) SetAggregationMode(mode analyze.AggregationMode) { // Aggregate overrides the base Aggregate method to collect detailed functions. func (ha *Aggregator) Aggregate(results map[string]analyze.Report) { + for _, report := range results { + ha.Retain(report) + } + ha.detailed.CollectFromReports(results) ha.Aggregator.Aggregate(results) } diff --git a/internal/analyzers/halstead/aggregator_bench_test.go b/internal/analyzers/halstead/aggregator_bench_test.go index 50ab30a..ecc23ed 100644 --- a/internal/analyzers/halstead/aggregator_bench_test.go +++ b/internal/analyzers/halstead/aggregator_bench_test.go @@ -1,7 +1,5 @@ package halstead -// FRD: specs/frds/FRD-20260311-halstead-dedup.md. - import ( "fmt" "testing" diff --git a/internal/analyzers/halstead/aggregator_test.go b/internal/analyzers/halstead/aggregator_test.go index 4e49108..353e566 100644 --- a/internal/analyzers/halstead/aggregator_test.go +++ b/internal/analyzers/halstead/aggregator_test.go @@ -317,8 +317,6 @@ func TestBuildEmptyHalsteadResult(t *testing.T) { } } -// FRD: specs/frds/FRD-20260311-halstead-dedup.md. - func TestAggregator_DuplicateFuncNames_PreservedAcrossFiles(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/halstead/cms_test.go b/internal/analyzers/halstead/cms_test.go index e52499e..93d6399 100644 --- a/internal/analyzers/halstead/cms_test.go +++ b/internal/analyzers/halstead/cms_test.go @@ -72,7 +72,7 @@ func TestVisitor_CMSSketchPopulated_LargeFunction(t *testing.T) { traverser.Traverse(root) // Retrieve function metrics. - funcMetrics, ok := visitor.functionMetrics[cmsTestFuncName] + funcMetrics, ok := findFunctionMetrics(visitor.functionMetrics, cmsTestFuncName) require.True(t, ok, "function metrics must exist") require.NotNil(t, funcMetrics.OperatorSketch, "OperatorSketch should be populated for large function") @@ -93,7 +93,7 @@ func TestVisitor_CMSNotUsed_SmallFunction(t *testing.T) { traverser.RegisterVisitor(visitor) traverser.Traverse(root) - funcMetrics, ok := visitor.functionMetrics[cmsTestFuncName] + funcMetrics, ok := findFunctionMetrics(visitor.functionMetrics, cmsTestFuncName) require.True(t, ok, "function metrics must exist") @@ -112,7 +112,7 @@ func TestVisitor_CMSTotalMatchesExact(t *testing.T) { traverser.RegisterVisitor(visitor) traverser.Traverse(root) - funcMetrics, ok := visitor.functionMetrics[cmsTestFuncName] + funcMetrics, ok := findFunctionMetrics(visitor.functionMetrics, cmsTestFuncName) require.True(t, ok, "function metrics must exist") @@ -136,7 +136,7 @@ func TestVisitor_EstimatedFields_Populated(t *testing.T) { traverser.RegisterVisitor(visitor) traverser.Traverse(root) - funcMetrics, ok := visitor.functionMetrics[cmsTestFuncName] + funcMetrics, ok := findFunctionMetrics(visitor.functionMetrics, cmsTestFuncName) require.True(t, ok, "function metrics must exist") assert.Positive(t, funcMetrics.EstimatedTotalOperators, @@ -155,7 +155,7 @@ func TestVisitor_DerivedMetrics_CMSPath(t *testing.T) { traverser.RegisterVisitor(visitor) traverser.Traverse(root) - funcMetrics, ok := visitor.functionMetrics[cmsTestFuncName] + funcMetrics, ok := findFunctionMetrics(visitor.functionMetrics, cmsTestFuncName) require.True(t, ok, "function metrics must exist") @@ -334,3 +334,16 @@ func sumMapHelper(m map[string]int) int { return sum } + +// findFunctionMetrics returns the first metrics entry whose Name matches. +// +//nolint:unparam // tests pass cmsTestFuncName today; keep signature generic for future callers. +func findFunctionMetrics(metrics []*FunctionHalsteadMetrics, name string) (*FunctionHalsteadMetrics, bool) { + for _, m := range metrics { + if m.Name == name { + return m, true + } + } + + return nil, false +} diff --git a/internal/analyzers/halstead/halstead.go b/internal/analyzers/halstead/halstead.go index 107e16d..172445c 100644 --- a/internal/analyzers/halstead/halstead.go +++ b/internal/analyzers/halstead/halstead.go @@ -117,21 +117,21 @@ func extractOperandName(target *node.Node) (string, bool) { // Metrics holds all Halstead complexity measures. type Metrics struct { - Functions map[string]*FunctionHalsteadMetrics `json:"functions"` - EstimatedLength float64 `json:"estimated_length"` - EstimatedTotalOperators int64 `json:"estimated_total_operators" yaml:"estimated_total_operators"` - EstimatedTotalOperands int64 `json:"estimated_total_operands" yaml:"estimated_total_operands"` - TotalOperators int `json:"total_operators"` - TotalOperands int `json:"total_operands"` - Vocabulary int `json:"vocabulary"` - Length int `json:"length"` - DistinctOperators int `json:"distinct_operators"` - Volume float64 `json:"volume"` - Difficulty float64 `json:"difficulty"` - Effort float64 `json:"effort"` - TimeToProgram float64 `json:"time_to_program"` - DeliveredBugs float64 `json:"delivered_bugs"` - DistinctOperands int `json:"distinct_operands"` + Functions []*FunctionHalsteadMetrics `json:"functions"` + EstimatedLength float64 `json:"estimated_length"` + EstimatedTotalOperators int64 `json:"estimated_total_operators" yaml:"estimated_total_operators"` + EstimatedTotalOperands int64 `json:"estimated_total_operands" yaml:"estimated_total_operands"` + TotalOperators int `json:"total_operators"` + TotalOperands int `json:"total_operands"` + Vocabulary int `json:"vocabulary"` + Length int `json:"length"` + DistinctOperators int `json:"distinct_operators"` + Volume float64 `json:"volume"` + Difficulty float64 `json:"difficulty"` + Effort float64 `json:"effort"` + TimeToProgram float64 `json:"time_to_program"` + DeliveredBugs float64 `json:"delivered_bugs"` + DistinctOperands int `json:"distinct_operands"` } // FunctionHalsteadMetrics contains Halstead metrics for a single function. @@ -159,7 +159,6 @@ type FunctionHalsteadMetrics struct { // FunctionReportItem is a typed representation of a per-function halstead report item. // Includes assessment strings and operator/operand maps. Avoids map[string]any allocation. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. type FunctionReportItem struct { Operators map[string]int Operands map[string]int @@ -328,14 +327,13 @@ func (h *Analyzer) buildEmptyResult(message string) analyze.Report { } // calculateAllFunctionMetrics calculates metrics for all functions. -func (h *Analyzer) calculateAllFunctionMetrics(functions []*node.Node) map[string]*FunctionHalsteadMetrics { - functionMetrics := make(map[string]*FunctionHalsteadMetrics) +func (h *Analyzer) calculateAllFunctionMetrics(functions []*node.Node) []*FunctionHalsteadMetrics { + functionMetrics := make([]*FunctionHalsteadMetrics, 0, len(functions)) for _, fn := range functions { - funcName := h.getFunctionName(fn) funcMetrics := h.calculateFunctionHalsteadMetrics(fn) - funcMetrics.Name = funcName - functionMetrics[funcName] = funcMetrics + funcMetrics.Name = h.getFunctionName(fn) + functionMetrics = append(functionMetrics, funcMetrics) } return functionMetrics @@ -352,7 +350,7 @@ func (h *Analyzer) getFunctionName(fn *node.Node) string { } // calculateFileLevelMetrics calculates file-level metrics from function metrics. -func (h *Analyzer) calculateFileLevelMetrics(functionMetrics map[string]*FunctionHalsteadMetrics) *Metrics { +func (h *Analyzer) calculateFileLevelMetrics(functionMetrics []*FunctionHalsteadMetrics) *Metrics { fileOperators := make(map[string]int) fileOperands := make(map[string]int) @@ -393,8 +391,7 @@ func (h *Analyzer) aggregateOperatorsAndOperandsFromMetrics( } // buildDetailedFunctionsTable creates the detailed functions table as typed structs. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. -func (h *Analyzer) buildDetailedFunctionsTable(functionMetrics map[string]*FunctionHalsteadMetrics) []FunctionReportItem { +func (h *Analyzer) buildDetailedFunctionsTable(functionMetrics []*FunctionHalsteadMetrics) []FunctionReportItem { items := make([]FunctionReportItem, 0, len(functionMetrics)) for _, fn := range functionMetrics { @@ -468,7 +465,6 @@ func convertHalsteadFunctionItems(items any, sourceFile string) []map[string]any } // buildResult constructs the final analysis result. -// FRD: specs/frds/FRD-20260311-typed-report-items.md. func (h *Analyzer) buildResult( fileMetrics *Metrics, reportItems []FunctionReportItem, totalFunctions int, message string, ) analyze.Report { diff --git a/internal/analyzers/halstead/metrics.go b/internal/analyzers/halstead/metrics.go index 1ae9685..9e525ef 100644 --- a/internal/analyzers/halstead/metrics.go +++ b/internal/analyzers/halstead/metrics.go @@ -32,6 +32,9 @@ type ReportData struct { // FunctionData holds Halstead data for a single function. type FunctionData struct { Name string + SourceFile string + Language string + Directory string Volume float64 Difficulty float64 Effort float64 @@ -130,11 +133,31 @@ func parseReportFunctions(report analyze.Report) []FunctionData { func parseFunctionData(fn map[string]any) FunctionData { fd := FunctionData{} + parseFuncIdentity(&fd, fn) + parseFuncHalsteadMetrics(&fd, fn) + return fd +} + +func parseFuncIdentity(fd *FunctionData, fn map[string]any) { if name, ok := fn["name"].(string); ok { fd.Name = name } + if sf, ok := fn[analyze.SourceFileKey].(string); ok { + fd.SourceFile = sf + } + + if lang, ok := fn[analyze.LanguageKey].(string); ok { + fd.Language = lang + } + + if dir, ok := fn[analyze.DirectoryKey].(string); ok { + fd.Directory = dir + } +} + +func parseFuncHalsteadMetrics(fd *FunctionData, fn map[string]any) { if v, ok := fn["volume"].(float64); ok { fd.Volume = v } @@ -182,21 +205,22 @@ func parseFunctionData(fn map[string]any) FunctionData { if v, ok := fn["estimated_length"].(float64); ok { fd.EstimatedLength = v } - - return fd } // --- Output Data Types ---. // FunctionHalsteadData contains Halstead metrics for a function. type FunctionHalsteadData struct { - Name string `json:"name" yaml:"name"` - Volume float64 `json:"volume" yaml:"volume"` - Difficulty float64 `json:"difficulty" yaml:"difficulty"` - Effort float64 `json:"effort" yaml:"effort"` - TimeToProgram float64 `json:"time_to_program" yaml:"time_to_program"` - DeliveredBugs float64 `json:"delivered_bugs" yaml:"delivered_bugs"` - ComplexityLevel string `json:"complexity_level" yaml:"complexity_level"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + Volume float64 `json:"volume" yaml:"volume"` + Difficulty float64 `json:"difficulty" yaml:"difficulty"` + Effort float64 `json:"effort" yaml:"effort"` + TimeToProgram float64 `json:"time_to_program" yaml:"time_to_program"` + DeliveredBugs float64 `json:"delivered_bugs" yaml:"delivered_bugs"` + ComplexityLevel string `json:"complexity_level" yaml:"complexity_level"` } // EffortDistributionData contains effort distribution counts. @@ -209,12 +233,15 @@ type EffortDistributionData struct { // HighEffortFunctionData identifies functions with high effort. type HighEffortFunctionData struct { - Name string `json:"name" yaml:"name"` - Volume float64 `json:"volume" yaml:"volume"` - Effort float64 `json:"effort" yaml:"effort"` - TimeToProgram float64 `json:"time_to_program" yaml:"time_to_program"` - DeliveredBugs float64 `json:"delivered_bugs" yaml:"delivered_bugs"` - RiskLevel string `json:"risk_level" yaml:"risk_level"` + Name string `json:"name" yaml:"name"` + SourceFile string `json:"source_file,omitempty" yaml:"source_file,omitempty"` + Language string `json:"language,omitempty" yaml:"language,omitempty"` + Directory string `json:"directory,omitempty" yaml:"directory,omitempty"` + Volume float64 `json:"volume" yaml:"volume"` + Effort float64 `json:"effort" yaml:"effort"` + TimeToProgram float64 `json:"time_to_program" yaml:"time_to_program"` + DeliveredBugs float64 `json:"delivered_bugs" yaml:"delivered_bugs"` + RiskLevel string `json:"risk_level" yaml:"risk_level"` } // AggregateData contains summary statistics. @@ -303,6 +330,9 @@ func (m *FunctionHalsteadMetric) Compute(input *ReportData) []FunctionHalsteadDa result = append(result, FunctionHalsteadData{ Name: fn.Name, + SourceFile: fn.SourceFile, + Language: fn.Language, + Directory: fn.Directory, Volume: fn.Volume, Difficulty: fn.Difficulty, Effort: fn.Effort, @@ -405,6 +435,7 @@ func (m *HighEffortFunctionMetric) Compute(input *ReportData) []HighEffortFuncti result = append(result, HighEffortFunctionData{ Name: fn.Name, + SourceFile: fn.SourceFile, Volume: fn.Volume, Effort: fn.Effort, TimeToProgram: fn.TimeToProgram, diff --git a/internal/analyzers/halstead/report_section.go b/internal/analyzers/halstead/report_section.go index 430ece5..c572606 100644 --- a/internal/analyzers/halstead/report_section.go +++ b/internal/analyzers/halstead/report_section.go @@ -160,6 +160,7 @@ func (s *ReportSection) halsteadIssues(limit int) []analyze.Issue { bugs := reportutil.GetFloat64(fn, KeyFuncBugs) issues = append(issues, analyze.Issue{ Name: name, + Location: reportutil.MapString(fn, analyze.SourceFileKey), Value: formatIssueValue(effort, volume, bugs), Severity: severityForFunction(effort, bugs), }) diff --git a/internal/analyzers/halstead/visitor.go b/internal/analyzers/halstead/visitor.go index 14fc0d6..c128f4b 100644 --- a/internal/analyzers/halstead/visitor.go +++ b/internal/analyzers/halstead/visitor.go @@ -16,7 +16,7 @@ type halsteadContext struct { type Visitor struct { metrics *MetricsCalculator detector *OperatorOperandDetector - functionMetrics map[string]*FunctionHalsteadMetrics + functionMetrics []*FunctionHalsteadMetrics contexts *common.ContextStack[*halsteadContext] nodeStack *common.ContextStack[*node.Node] } @@ -24,11 +24,10 @@ type Visitor struct { // NewVisitor creates a new Visitor. func NewVisitor() *Visitor { return &Visitor{ - contexts: common.NewContextStack[*halsteadContext](), - metrics: NewMetricsCalculator(), - detector: NewOperatorOperandDetector(), - functionMetrics: make(map[string]*FunctionHalsteadMetrics), - nodeStack: common.NewContextStack[*node.Node](), + contexts: common.NewContextStack[*halsteadContext](), + metrics: NewMetricsCalculator(), + detector: NewOperatorOperandDetector(), + nodeStack: common.NewContextStack[*node.Node](), } } @@ -132,7 +131,7 @@ func (v *Visitor) popContext() { v.metrics.CalculateHalsteadMetrics(ctx.metrics) // Store result. - v.functionMetrics[ctx.metrics.Name] = ctx.metrics + v.functionMetrics = append(v.functionMetrics, ctx.metrics) } func (v *Visitor) currentContext() *halsteadContext { diff --git a/internal/analyzers/halstead/visitor_dedup_test.go b/internal/analyzers/halstead/visitor_dedup_test.go new file mode 100644 index 0000000..757b106 --- /dev/null +++ b/internal/analyzers/halstead/visitor_dedup_test.go @@ -0,0 +1,59 @@ +package halstead + +import ( + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/pkg/uast/pkg/node" +) + +// TestVisitor_CountsAllSameNameFunctions guards against the regression where +// per-function metrics were stored in a map keyed by function name only. +// Multiple functions in the same file sharing a name (e.g. methods named +// `Read` on different receivers in Go) were silently overwriting each other, +// and `total_functions` was reported as `len(map)` rather than the actual +// number of declared functions. +func TestVisitor_CountsAllSameNameFunctions(t *testing.T) { + t.Parallel() + + const ( + sharedName = "Read" + dupCount = 5 + ) + + root := &node.Node{Type: node.UASTFile} + + for range dupCount { + fn := &node.Node{Type: node.UASTFunction} + fn.Roles = []node.Role{node.RoleFunction, node.RoleDeclaration} + + nameNode := node.NewNodeWithToken(node.UASTIdentifier, sharedName) + nameNode.Roles = []node.Role{node.RoleName} + fn.AddChild(nameNode) + + root.AddChild(fn) + } + + visitor := NewVisitor() + traverser := analyze.NewMultiAnalyzerTraverser() + traverser.RegisterVisitor(visitor) + traverser.Traverse(root) + + assert.Lenf(t, visitor.functionMetrics, dupCount, + "visitor must record one entry per function declaration, not dedup by name") + + report := visitor.GetReport() + + totalFunctions, ok := report["total_functions"].(int) + require.True(t, ok, "total_functions must be present and int-typed") + assert.Equalf(t, dupCount, totalFunctions, + "reported total_functions must match declarations, not unique names") + + items, ok := analyze.ReportFunctionList(report, "functions") + require.True(t, ok, "functions collection must be readable") + assert.Lenf(t, items, dupCount, + "detailed function items must include every declaration, not dedup by name") +} diff --git a/internal/analyzers/imports/aggregator.go b/internal/analyzers/imports/aggregator.go index 8b398b9..b4760a9 100644 --- a/internal/analyzers/imports/aggregator.go +++ b/internal/analyzers/imports/aggregator.go @@ -3,10 +3,13 @@ package imports import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common" ) // Aggregator aggregates import analysis results across multiple files. type Aggregator struct { + common.PerFileRetainer + allImports map[string]int // Import path -> count. totalFiles int } @@ -21,6 +24,7 @@ func NewAggregator() *Aggregator { // Aggregate combines results from multiple files. func (a *Aggregator) Aggregate(results map[string]analyze.Report) { for _, report := range results { + a.Retain(report) a.totalFiles++ if imports, ok := report["imports"].([]string); ok { diff --git a/internal/analyzers/imports/report_section.go b/internal/analyzers/imports/report_section.go index e37b039..6383907 100644 --- a/internal/analyzers/imports/report_section.go +++ b/internal/analyzers/imports/report_section.go @@ -89,10 +89,13 @@ func (s *ReportSection) AllIssues() []analyze.Issue { } // importIssues builds import issues sorted by frequency (or name), limited to limit (0 = all). +// When the report contains a _source_file key, it is used as the Location for each issue. func (s *ReportSection) importIssues(limit int) []analyze.Issue { + location := reportutil.GetString(s.report, analyze.SourceFileKey) + counts := reportutil.GetStringIntMap(s.report, KeyImportCounts) if len(counts) > 0 { - return buildIssuesFromCounts(counts, limit) + return buildIssuesFromCounts(counts, limit, location) } // Fallback: use simple imports list. @@ -101,11 +104,11 @@ func (s *ReportSection) importIssues(limit int) []analyze.Issue { return nil } - return buildIssuesFromList(imports, limit) + return buildIssuesFromList(imports, limit, location) } // buildIssuesFromCounts creates sorted issues from import_counts map. -func buildIssuesFromCounts(counts map[string]int, limit int) []analyze.Issue { +func buildIssuesFromCounts(counts map[string]int, limit int, location string) []analyze.Issue { entries := make([]importEntry, 0, len(counts)) for name, count := range counts { entries = append(entries, importEntry{name: name, count: count}) @@ -118,6 +121,7 @@ func buildIssuesFromCounts(counts map[string]int, limit int) []analyze.Issue { for _, e := range sorted { issues = append(issues, analyze.Issue{ Name: e.name, + Location: location, Value: reportutil.FormatInt(e.count), Severity: analyze.SeverityInfo, }) @@ -127,13 +131,14 @@ func buildIssuesFromCounts(counts map[string]int, limit int) []analyze.Issue { } // buildIssuesFromList creates issues from a simple string slice sorted alphabetically. -func buildIssuesFromList(imports []string, limit int) []analyze.Issue { +func buildIssuesFromList(imports []string, limit int, location string) []analyze.Issue { sorted := mapx.SortAndLimit(imports, importNameLess, limit) issues := make([]analyze.Issue, 0, len(sorted)) for _, imp := range sorted { issues = append(issues, analyze.Issue{ Name: imp, + Location: location, Value: "1", Severity: analyze.SeverityInfo, }) diff --git a/internal/analyzers/imports/report_section_test.go b/internal/analyzers/imports/report_section_test.go index 4d44281..bd32214 100644 --- a/internal/analyzers/imports/report_section_test.go +++ b/internal/analyzers/imports/report_section_test.go @@ -193,3 +193,46 @@ func TestImportsImplementsInterface(t *testing.T) { var _ analyze.ReportSection = (*ReportSection)(nil) } + +func TestImportsPerFile_IssuesHaveLocation(t *testing.T) { + t.Parallel() + + report := analyze.Report{ + "imports": []string{"fmt", "os"}, + "count": 2, + "import_counts": map[string]int{"fmt": 1, "os": 1}, + analyze.SourceFileKey: "/repo/pkg/foo.go", + } + + section := NewReportSection(report) + issues := section.AllIssues() + + if len(issues) == 0 { + t.Fatal("expected issues for imports") + } + + for _, issue := range issues { + if issue.Location != "/repo/pkg/foo.go" { + t.Errorf("issue %q location = %q, want %q", + issue.Name, issue.Location, "/repo/pkg/foo.go") + } + } +} + +func TestImportsPerFile_NoSourceFile_EmptyLocation(t *testing.T) { + t.Parallel() + + report := newTestImportsReport() + section := NewReportSection(report) + issues := section.AllIssues() + + if len(issues) == 0 { + t.Fatal("expected issues") + } + + for _, issue := range issues { + if issue.Location != "" { + t.Errorf("issue %q location = %q, want empty", issue.Name, issue.Location) + } + } +} diff --git a/internal/analyzers/imports/store_writer_test.go b/internal/analyzers/imports/store_writer_test.go index 2502b23..ce18107 100644 --- a/internal/analyzers/imports/store_writer_test.go +++ b/internal/analyzers/imports/store_writer_test.go @@ -1,7 +1,5 @@ package imports -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "sort" diff --git a/internal/analyzers/plumbing/identity.go b/internal/analyzers/plumbing/identity.go index e516c5e..c26a621 100644 --- a/internal/analyzers/plumbing/identity.go +++ b/internal/analyzers/plumbing/identity.go @@ -23,6 +23,7 @@ type IdentityDetector struct { ReversedPeopleDict []string AuthorID int ExactSignatures bool + // incrementalEmails and incrementalNames are used when building the dict incrementally // during Consume() when commits aren't available during Configure(). incrementalEmails map[int][]string diff --git a/internal/analyzers/plumbing/langpath/langpath.go b/internal/analyzers/plumbing/langpath/langpath.go new file mode 100644 index 0000000..eaf623a --- /dev/null +++ b/internal/analyzers/plumbing/langpath/langpath.go @@ -0,0 +1,82 @@ +// Package langpath converts user-supplied language tokens into +// deterministic pathspec globs backed by enry's Linguist data. +package langpath + +import ( + "errors" + "fmt" + "slices" + "strings" + + "github.com/src-d/enry/v2" + "github.com/src-d/enry/v2/data" +) + +// ErrUnknownLanguage is returned when a user-supplied token does not +// resolve to any Linguist language (including its aliases). +var ErrUnknownLanguage = errors.New("unknown language") + +// filenamesByLanguage inverts enry.data.LanguagesByFilename so we can +// look up "languages → []filename" at Globs time. Built once at +// package load; read-only thereafter. +var filenamesByLanguage = invertLanguagesByFilename() + +func invertLanguagesByFilename() map[string][]string { + out := make(map[string][]string) + + for filename, langs := range data.LanguagesByFilename { + for _, lang := range langs { + out[lang] = append(out[lang], filename) + } + } + + return out +} + +const ( + // allToken is the sentinel meaning "do not restrict by language". + allToken = "all" + // extensionGlobPrefix is prepended to every extension-derived glob. + extensionGlobPrefix = "*" +) + +// Globs converts a list of user-supplied language tokens into a +// sorted, deduplicated set of pathspec globs. wantsAll is true when +// the caller did not restrict languages (empty input or the literal +// "all" token). Callers should skip path-spec push-down in that case. +func Globs(langs []string) (globs []string, wantsAll bool, err error) { + if len(langs) == 0 { + return nil, true, nil + } + + set := make(map[string]struct{}) + + for _, raw := range langs { + token := strings.TrimSpace(raw) + if strings.EqualFold(token, allToken) { + return nil, true, nil + } + + canonical, ok := enry.GetLanguageByAlias(token) + if !ok { + return nil, false, fmt.Errorf("%w: %q", ErrUnknownLanguage, raw) + } + + for _, ext := range enry.GetLanguageExtensions(canonical) { + set[extensionGlobPrefix+ext] = struct{}{} + } + + for _, name := range filenamesByLanguage[canonical] { + set[name] = struct{}{} + } + } + + out := make([]string, 0, len(set)) + for g := range set { + out = append(out, g) + } + + slices.Sort(out) + + return out, false, nil +} diff --git a/internal/analyzers/plumbing/langpath/langpath_test.go b/internal/analyzers/plumbing/langpath/langpath_test.go new file mode 100644 index 0000000..6f63b63 --- /dev/null +++ b/internal/analyzers/plumbing/langpath/langpath_test.go @@ -0,0 +1,147 @@ +package langpath_test + +import ( + "slices" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/langpath" +) + +func TestGlobs_AllToken_YieldsWantsAll(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs([]string{"all"}) + + require.NoError(t, err) + assert.True(t, wantsAll, "all token must set wantsAll") + assert.Nil(t, globs, "wantsAll must return nil globs") +} + +func TestGlobs_ReturnsFreshSlicePerCall(t *testing.T) { + t.Parallel() + + a, _, errA := langpath.Globs([]string{"go"}) + require.NoError(t, errA) + require.NotEmpty(t, a) + + b, _, errB := langpath.Globs([]string{"go"}) + require.NoError(t, errB) + require.NotEmpty(t, b) + + const tampered = "tampered" + + a[0] = tampered + assert.NotEqual(t, tampered, b[0], + "mutating one call's result must not affect a subsequent call's result") +} + +func TestGlobs_Dockerfile_IncludesBasenameGlob(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs([]string{"dockerfile"}) + + require.NoError(t, err) + assert.False(t, wantsAll) + assert.Contains(t, globs, "Dockerfile", + "filename-only languages must emit a literal-filename glob") +} + +func TestGlobs_MultipleLanguages_SortedAndDeduplicated(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs([]string{"python", "go", "python"}) + + require.NoError(t, err) + assert.False(t, wantsAll) + assert.NotEmpty(t, globs) + assert.True(t, slices.IsSorted(globs), "globs must be sorted") + assert.Contains(t, globs, "*.go", "go extension must be present") + assert.Contains(t, globs, "*.py", "python extension must be present") + assert.Len(t, mapset(globs), len(globs), "globs must be deduplicated") +} + +func mapset(xs []string) map[string]struct{} { + m := make(map[string]struct{}, len(xs)) + for _, x := range xs { + m[x] = struct{}{} + } + + return m +} + +func TestGlobs_UnknownToken_ReturnsErrUnknownLanguage(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + in []string + }{ + {"solo", []string{"notalang"}}, + {"after known", []string{"go", "notalang"}}, + {"before known", []string{"notalang", "go"}}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs(tt.in) + + require.ErrorIs(t, err, langpath.ErrUnknownLanguage) + assert.False(t, wantsAll) + assert.Nil(t, globs) + assert.Contains(t, err.Error(), "notalang") + }) + } +} + +func TestGlobs_GoToken_YieldsStarDotGo(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + in string + }{ + {"lowercase", "go"}, + {"titlecase", "Go"}, + {"uppercase", "GO"}, + {"padded", " go "}, + {"alias golang", "golang"}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs([]string{tt.in}) + + require.NoError(t, err) + assert.False(t, wantsAll) + assert.Equal(t, []string{"*.go"}, globs) + }) + } +} + +func TestGlobs_EmptyInput_YieldsWantsAll(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + in []string + }{ + {"nil slice", nil}, + {"empty slice", []string{}}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + globs, wantsAll, err := langpath.Globs(tt.in) + + require.NoError(t, err) + assert.True(t, wantsAll) + assert.Nil(t, globs) + }) + } +} diff --git a/internal/analyzers/plumbing/pathpolicy/pathpolicy.go b/internal/analyzers/plumbing/pathpolicy/pathpolicy.go new file mode 100644 index 0000000..0cfdb88 --- /dev/null +++ b/internal/analyzers/plumbing/pathpolicy/pathpolicy.go @@ -0,0 +1,65 @@ +// Package pathpolicy decides whether a file path should be excluded +// from analysis based on user-visible options that mirror the CLI +// flags (--include-vendored, --include-generated, +// --extra-excluded-prefixes). Pure, stateless, cross-phase. +package pathpolicy + +import ( + "strings" + + "github.com/src-d/enry/v2" + + "github.com/Sumatoshi-tech/codefang/pkg/pathfilter" +) + +// defaultFilter carries the built-in generated-file heuristics +// (filename suffixes, prefixes, and content markers) as they ship in +// pkg/pathfilter. Reusing one immutable instance keeps allocation +// off the hot path. +var defaultFilter = pathfilter.New() + +// Options captures the user-visible configuration. +// The zero value excludes vendor, generated, and nothing else. +type Options struct { + IncludeVendored bool + IncludeGenerated bool + ExtraExcludedPrefixes []string +} + +// Exclude reports whether the given path should be skipped. +// content may be nil; when provided, content-based heuristics may +// refine the generated-file classification. +func Exclude(path string, content []byte, opts Options) bool { + switch { + case matchesAnyPrefix(path, opts.ExtraExcludedPrefixes): + return true + case !opts.IncludeVendored && enry.IsVendor(path): + return true + case !opts.IncludeGenerated && isGenerated(path, content): + return true + } + + return false +} + +// matchesAnyPrefix returns true if path begins with any non-empty +// entry of prefixes. +func matchesAnyPrefix(path string, prefixes []string) bool { + for _, prefix := range prefixes { + if prefix != "" && strings.HasPrefix(path, prefix) { + return true + } + } + + return false +} + +// isGenerated returns true if the path or header content identifies +// the file as machine-generated per the built-in heuristics. +func isGenerated(path string, content []byte) bool { + if defaultFilter.IsGeneratedPath(path) { + return true + } + + return len(content) > 0 && defaultFilter.IsGeneratedContent(content) +} diff --git a/internal/analyzers/plumbing/pathpolicy/pathpolicy_test.go b/internal/analyzers/plumbing/pathpolicy/pathpolicy_test.go new file mode 100644 index 0000000..4cff589 --- /dev/null +++ b/internal/analyzers/plumbing/pathpolicy/pathpolicy_test.go @@ -0,0 +1,134 @@ +package pathpolicy_test + +import ( + "testing" + + "github.com/stretchr/testify/assert" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" +) + +func TestExclude_PlainPath_Included(t *testing.T) { + t.Parallel() + + got := pathpolicy.Exclude("pkg/foo/bar.go", nil, pathpolicy.Options{}) + + assert.False(t, got, + "a non-vendor non-generated path must not be excluded under default options") +} + +func TestExclude_VendorPath_ExcludedByDefault(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + path string + }{ + {"go vendor", "vendor/github.com/pkg/errors/errors.go"}, + {"node_modules", "node_modules/left-pad/index.js"}, + {"third-party", "third_party/boringssl/src.c"}, + {"testdata", "pkg/foo/testdata/sample.json"}, + {"minified js", "static/jquery.min.js"}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + got := pathpolicy.Exclude(tt.path, nil, pathpolicy.Options{}) + assert.True(t, got, + "Linguist-vendored path must be excluded under default options: "+tt.path) + }) + } +} + +func TestExclude_GeneratedPath_ExcludedByDefault(t *testing.T) { + t.Parallel() + + tests := []struct { + name string + path string + }{ + {"go protobuf", "pkg/api/foo.pb.go"}, + {"k8s zz_generated", "pkg/apis/core/v1/zz_generated_deepcopy.go"}, + {"python protobuf", "pkg/api/foo_pb2.py"}, + {"mockgen", "mocks/mock_service.go"}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + t.Parallel() + + got := pathpolicy.Exclude(tt.path, nil, pathpolicy.Options{}) + assert.True(t, got, + "generated-looking path must be excluded under default options: "+tt.path) + }) + } +} + +func TestExclude_ExtraExcludedPrefixes_ExcludesMatches(t *testing.T) { + t.Parallel() + + opts := pathpolicy.Options{ + ExtraExcludedPrefixes: []string{".venv/", "docs/"}, + } + + assert.True(t, pathpolicy.Exclude(".venv/lib/foo.py", nil, opts), + ".venv/ prefix must exclude python virtualenv content") + assert.True(t, pathpolicy.Exclude("docs/README.md", nil, opts), + "docs/ prefix must exclude documentation") + assert.False(t, pathpolicy.Exclude("pkg/foo.go", nil, opts), + "a non-matching path must not be excluded") +} + +func TestExclude_ExtraExcludedPrefixes_BypassIncludeOverrides(t *testing.T) { + t.Parallel() + + opts := pathpolicy.Options{ + IncludeVendored: true, + IncludeGenerated: true, + ExtraExcludedPrefixes: []string{"vendor/"}, + } + + assert.True(t, pathpolicy.Exclude("vendor/foo.go", nil, opts), + "ExtraExcludedPrefixes must still apply even when include flags are set") +} + +func TestExclude_GeneratedContentMarker_ExcludedByDefault(t *testing.T) { + t.Parallel() + + content := []byte("// Code generated by protoc-gen-go. DO NOT EDIT.\npackage foo\n") + + got := pathpolicy.Exclude("pkg/foo/ordinary.go", content, pathpolicy.Options{}) + assert.True(t, got, + "content starting with a generated-file marker must be excluded under default options") +} + +func TestExclude_IncludeGenerated_KeepsContentMarker(t *testing.T) { + t.Parallel() + + content := []byte("// Code generated by protoc-gen-go. DO NOT EDIT.\npackage foo\n") + opts := pathpolicy.Options{IncludeGenerated: true} + + got := pathpolicy.Exclude("pkg/foo/ordinary.go", content, opts) + assert.False(t, got, + "IncludeGenerated=true must keep a generated-content file in analysis") +} + +func TestExclude_IncludeGenerated_KeepsGenerated(t *testing.T) { + t.Parallel() + + opts := pathpolicy.Options{IncludeGenerated: true} + + got := pathpolicy.Exclude("pkg/api/foo.pb.go", nil, opts) + assert.False(t, got, + "IncludeGenerated=true must keep generated paths in analysis") +} + +func TestExclude_IncludeVendored_KeepsVendor(t *testing.T) { + t.Parallel() + + opts := pathpolicy.Options{IncludeVendored: true} + + got := pathpolicy.Exclude("vendor/github.com/pkg/errors/errors.go", nil, opts) + assert.False(t, got, + "IncludeVendored=true must keep vendor paths in analysis") +} diff --git a/internal/analyzers/plumbing/plumbing_test.go b/internal/analyzers/plumbing/plumbing_test.go index c91d5fd..e48f906 100644 --- a/internal/analyzers/plumbing/plumbing_test.go +++ b/internal/analyzers/plumbing/plumbing_test.go @@ -6,6 +6,7 @@ import ( "github.com/stretchr/testify/require" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" ) @@ -26,6 +27,56 @@ func TestTreeDiffAnalyzer_Configure(t *testing.T) { require.NoError(t, err) } +func TestTreeDiffAnalyzer_Configure_BuildsPathspecFromLanguages(t *testing.T) { + t.Parallel() + + td := &TreeDiffAnalyzer{} + err := td.Configure(map[string]any{ + ConfigTreeDiffLanguages: []string{"go"}, + }) + + require.NoError(t, err) + require.NotEmpty(t, td.Pathspec, "pathspec must be built from --languages") + require.Contains(t, td.Pathspec, "*.go") +} + +func TestTreeDiffAnalyzer_Configure_AllLanguagesGivesEmptyPathspec(t *testing.T) { + t.Parallel() + + td := &TreeDiffAnalyzer{} + err := td.Configure(map[string]any{ + ConfigTreeDiffLanguages: []string{"all"}, + }) + + require.NoError(t, err) + require.Empty(t, td.Pathspec, + "all languages must skip path-spec push-down (empty pathspec)") +} + +func TestTreeDiffAnalyzer_Configure_AliasResolvesToCanonicalInLanguagesSet(t *testing.T) { + t.Parallel() + + td := &TreeDiffAnalyzer{} + err := td.Configure(map[string]any{ + ConfigTreeDiffLanguages: []string{"golang"}, + }) + + require.NoError(t, err) + require.True(t, td.Languages["go"], + "alias 'golang' must resolve so the Go-side filter recognizes canonical lowercase 'go'") +} + +func TestTreeDiffAnalyzer_Configure_UnknownLanguageReturnsError(t *testing.T) { + t.Parallel() + + td := &TreeDiffAnalyzer{} + err := td.Configure(map[string]any{ + ConfigTreeDiffLanguages: []string{"notalang"}, + }) + + require.Error(t, err, "unknown language must surface at Configure time") +} + func TestTreeDiffAnalyzer_Initialize(t *testing.T) { t.Parallel() @@ -128,6 +179,44 @@ func TestChangeEntry_Hash(t *testing.T) { } } +func TestTreeDiff_filterChanges_DefaultPolicyDropsVendor(t *testing.T) { + t.Parallel() + + hash := gitlib.NewHash("1111111111111111111111111111111111111111") + td := &TreeDiffAnalyzer{ + Languages: map[string]bool{allLanguages: true}, + } + + changes := gitlib.Changes{ + {Action: gitlib.Modify, To: gitlib.ChangeEntry{Name: "vendor/foo.go", Hash: hash}}, + {Action: gitlib.Modify, To: gitlib.ChangeEntry{Name: "pkg/bar.go", Hash: hash}}, + } + + filtered := td.filterChanges(context.Background(), changes) + require.Len(t, filtered, 1) + require.Equal(t, "pkg/bar.go", filtered[0].To.Name, + "default TreeDiffAnalyzer (zero PathPolicy) must drop vendor paths") +} + +func TestTreeDiff_filterChanges_IncludeVendoredKeepsVendor(t *testing.T) { + t.Parallel() + + hash := gitlib.NewHash("1111111111111111111111111111111111111111") + td := &TreeDiffAnalyzer{ + Languages: map[string]bool{allLanguages: true}, + PathPolicy: pathpolicy.Options{IncludeVendored: true, IncludeGenerated: true}, + } + + changes := gitlib.Changes{ + {Action: gitlib.Modify, To: gitlib.ChangeEntry{Name: "vendor/foo.go", Hash: hash}}, + {Action: gitlib.Modify, To: gitlib.ChangeEntry{Name: "pkg/bar.go", Hash: hash}}, + } + + filtered := td.filterChanges(context.Background(), changes) + require.Len(t, filtered, 2, + "IncludeVendored=true must keep vendor changes in the filtered set") +} + // TestTreeDiff_filterChanges_prefixBlacklist verifies blacklist uses path prefix match only. func TestTreeDiff_filterChanges_prefixBlacklist(t *testing.T) { t.Parallel() diff --git a/internal/analyzers/plumbing/ticks.go b/internal/analyzers/plumbing/ticks.go index 4769055..9115ab0 100644 --- a/internal/analyzers/plumbing/ticks.go +++ b/internal/analyzers/plumbing/ticks.go @@ -15,12 +15,15 @@ import ( // TicksSinceStart computes relative time ticks for each commit since the start. type TicksSinceStart struct { - tick0 *time.Time - commits map[int][]gitlib.Hash - remote string - TickSize time.Duration - previousTick int - Tick int + tick0 *time.Time + commits map[int][]gitlib.Hash + remote string + TickSize time.Duration + previousTick int + Tick int + lastValidWhen time.Time // Most recent in-window committer timestamp; substitution source. + tick0Set bool // tick0 has been seeded by an in-window commit. + anomalies *timeAnomalyTracker // Shared across Fork() clones so aggregated counts survive forking. } const ( @@ -90,6 +93,12 @@ func (t *TicksSinceStart) Initialize(_ *gitlib.Repository) error { } t.tick0 = &time.Time{} + t.tick0Set = false + t.lastValidWhen = time.Time{} + + if t.anomalies == nil { + t.anomalies = &timeAnomalyTracker{} + } t.previousTick = 0 if t.commits == nil || len(t.commits) > 0 { @@ -104,14 +113,14 @@ func (t *TicksSinceStart) Initialize(_ *gitlib.Repository) error { // Consume processes a single commit with the provided dependency results. func (t *TicksSinceStart) Consume(_ context.Context, ac *analyze.Context) (analyze.TC, error) { commit := ac.Commit - index := ac.Index + when := t.sanitizeWhen(commit.Committer().When) - if index == 0 { - tick0 := commit.Committer().When - *t.tick0 = FloorTime(tick0, t.TickSize) + if !t.tick0Set { + *t.tick0 = FloorTime(when, t.TickSize) + t.tick0Set = true } - tick := max(int(commit.Committer().When.Sub(*t.tick0)/t.TickSize), t.previousTick) + tick := max(int(when.Sub(*t.tick0)/t.TickSize), t.previousTick) t.previousTick = tick @@ -142,6 +151,58 @@ func (t *TicksSinceStart) Consume(_ context.Context, ac *analyze.Context) (analy return analyze.TC{}, nil } +// sanitizeWhen clamps a committer timestamp into the sane analysis window +// [minSaneCommitTime, [time.Now]()+maxClockSkew]. Out-of-window values are +// substituted with the most recent in-window timestamp seen, falling back +// to minSaneCommitTime on the first commit. Each substitution is counted +// and surfaced via TimeAnomalies(); the warning log is rate-limited. +// +// In-window inputs pass through unchanged and update lastValidWhen so +// future anomalies have a fresh substitution source. +func (t *TicksSinceStart) sanitizeWhen(when time.Time) time.Time { + upperBound := time.Now().Add(maxClockSkew) + + switch { + case when.Before(minSaneCommitTime): + replacement := t.substituteWhen() + t.anomalies.recordBeforeMin(when, replacement) + + return replacement + case when.After(upperBound): + replacement := t.substituteWhen() + t.anomalies.recordAfterMax(when, replacement) + + return replacement + } + + t.lastValidWhen = when + + return when +} + +// substituteWhen picks a stand-in for an out-of-window committer time: +// the most recent in-window value if we have one, otherwise the +// minSaneCommitTime floor (so the bad commit collapses to tick 0 instead +// of inflating the analysis period). +func (t *TicksSinceStart) substituteWhen() time.Time { + if t.lastValidWhen.IsZero() { + return minSaneCommitTime + } + + return t.lastValidWhen +} + +// TimeAnomalies returns the cumulative count of committer-timestamp +// anomalies clamped during this analyzer's run. See [TimeAnomalyStats] +// for the operational meaning. +func (t *TicksSinceStart) TimeAnomalies() TimeAnomalyStats { + if t.anomalies == nil { + return TimeAnomalyStats{} + } + + return t.anomalies.snapshot() +} + // FloorTime rounds a timestamp down to the nearest tick boundary. func FloorTime(t time.Time, d time.Duration) time.Time { result := t.Round(d) diff --git a/internal/analyzers/plumbing/ticks_anomaly.go b/internal/analyzers/plumbing/ticks_anomaly.go new file mode 100644 index 0000000..ca1e1be --- /dev/null +++ b/internal/analyzers/plumbing/ticks_anomaly.go @@ -0,0 +1,123 @@ +package plumbing + +import ( + "log" + "sync/atomic" + "time" +) + +// minSaneCommitTime is the lower bound for a plausible committer timestamp. +// Git itself first shipped in 2005; commits stamped before 1990-01-01 are +// almost certainly the result of a corrupt commit object, an unset system +// clock (epoch 0 → 1970), or a deliberate `GIT_COMMITTER_DATE=` override. +// +// Without this clamp a single such commit pegged tick0 to ~1970, after +// which every modern commit's Sub(tick0) overflowed the int64-nanosecond +// [time.Duration] and clamped to ~292 years. That clamp leaked into burndown +// as a 106 740-day "analysis period". See ticks.go: the bug was sticky via +// max(tick, previousTick). +var minSaneCommitTime = time.Date(1990, time.January, 1, 0, 0, 0, 0, time.UTC) + +// maxClockSkew is the upper-bound grace allowed past wall-clock time. A +// committer timestamp more than this far in the future is treated as +// anomalous regardless of repo content. +const maxClockSkew = 24 * time.Hour + +// anomalyLogIntervalNanos throttles the per-event "anomalous committer +// timestamp" log line so a repo with thousands of bad commits doesn't +// drown the operator-facing log. Same shape as +// burndown/mismatch_tracker's log throttle. +const anomalyLogIntervalNanos = int64(time.Second) + +// timeAnomalyTracker counts committer-timestamp anomalies detected during +// tick computation and rate-limits the warning log. Atomics make the +// tracker safe to call from the per-shard clones returned by Fork(); the +// sequential plumbing analyzer never actually races, but using atomics +// keeps Fork() safe by construction. +type timeAnomalyTracker struct { + beforeMin atomic.Int64 // Counter: timestamps before minSaneCommitTime. + afterMax atomic.Int64 // Counter: timestamps too far in the future. + dropped atomic.Int64 // Suppressed since last emitted log line. + lastLogNanos atomic.Int64 // Monotonic-ish slot timestamp. +} + +// recordBeforeMin bumps the before-min counter and emits a rate-limited +// warning. when is the bogus committer time we observed, replacement is +// the time we substituted into tick math. +func (t *timeAnomalyTracker) recordBeforeMin(when, replacement time.Time) { + t.beforeMin.Add(1) + t.maybeLog("before-min", when, replacement) +} + +// recordAfterMax bumps the after-max counter and emits a rate-limited +// warning. Mirrors recordBeforeMin for the future-clamp side. +func (t *timeAnomalyTracker) recordAfterMax(when, replacement time.Time) { + t.afterMax.Add(1) + t.maybeLog("after-max", when, replacement) +} + +// maybeLog emits one warning per anomalyLogIntervalNanos at most. Mirrors +// burndown.mismatchTracker.maybeLog: try to claim the slot via CAS; on +// failure (slot still warm), bump dropped and return silently. On success, +// flush the dropped tail in the emitted line. +func (t *timeAnomalyTracker) maybeLog(kind string, when, replacement time.Time) { + now := time.Now().UnixNano() + last := t.lastLogNanos.Load() + + if now-last < anomalyLogIntervalNanos { + t.dropped.Add(1) + + return + } + + if !t.lastLogNanos.CompareAndSwap(last, now) { + t.dropped.Add(1) + + return + } + + dropped := t.dropped.Swap(0) + if dropped == 0 { + log.Printf("ticks: %s anomalous committer timestamp %s, substituted %s", + kind, when.Format(time.RFC3339), replacement.Format(time.RFC3339)) + + return + } + + log.Printf("ticks: %s anomalous committer timestamp %s, substituted %s [dropped=%d since last]", + kind, when.Format(time.RFC3339), replacement.Format(time.RFC3339), dropped) +} + +// snapshot returns the running counts. Used by accessor TimeAnomalies() +// for tests and external observers. +func (t *timeAnomalyTracker) snapshot() TimeAnomalyStats { + return TimeAnomalyStats{ + BeforeMin: t.beforeMin.Load(), + AfterMax: t.afterMax.Load(), + } +} + +// TimeAnomalyStats reports anomalous committer-timestamp detections. +// +// BeforeMin counts commits whose committer time was earlier than the +// hard-coded floor (1990-01-01 UTC) — typically epoch-0 (1970) values +// from corrupt commit objects, unset system clocks, or deliberate +// GIT_COMMITTER_DATE overrides. +// +// AfterMax counts commits whose committer time was more than 24h past +// the analyzer's wall-clock — typically forged future timestamps +// ("--date=2099-01-01") or clock skew at commit time. +// +// In both cases the substituted time is the previous valid committer +// timestamp (or 1990-01-01 UTC if no valid commit has been seen yet), +// so the bad commit collapses onto the timeline at a sensible point +// instead of overflowing the int64-nanosecond Duration in ticks.go. +type TimeAnomalyStats struct { + BeforeMin int64 + AfterMax int64 +} + +// Total returns the combined count of anomalies on both bounds. +func (s TimeAnomalyStats) Total() int64 { + return s.BeforeMin + s.AfterMax +} diff --git a/internal/analyzers/plumbing/ticks_anomaly_test.go b/internal/analyzers/plumbing/ticks_anomaly_test.go new file mode 100644 index 0000000..c3b7472 --- /dev/null +++ b/internal/analyzers/plumbing/ticks_anomaly_test.go @@ -0,0 +1,271 @@ +package plumbing + +import ( + "context" + "sync" + "testing" + "time" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/pkg/gitlib" +) + +func newTicks(t *testing.T) *TicksSinceStart { + t.Helper() + + ts := &TicksSinceStart{} + + err := ts.Initialize(nil) + if err != nil { + t.Fatalf("Initialize: %v", err) + } + + return ts +} + +// makeCommit builds a minimal gitlib.TestCommit suitable for driving +// TicksSinceStart.Consume — only Hash, Committer, and NumParents are read +// by the tick path. +func makeCommit(when time.Time, hashByte byte) *gitlib.TestCommit { + parent := gitlib.Hash{} + parent[0] = hashByte // Any non-zero parent makes NumParents() > 0. + commit := gitlib.NewTestCommit( + gitlib.Hash{hashByte}, + gitlib.Signature{Name: "T", Email: "t@t", When: when}, + "msg", + parent, + ) + + return commit +} + +func consume(t *testing.T, ts *TicksSinceStart, when time.Time, index int) int { + t.Helper() + + commit := makeCommit(when, byte(index+1)) + + _, err := ts.Consume(context.Background(), &analyze.Context{ + Commit: commit, + Index: index, + }) + if err != nil { + t.Fatalf("Consume: %v", err) + } + + return ts.Tick +} + +func TestSanitizeWhen_BeforeMin_FirstCommit_FallsBackToMinSaneTime(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + // First commit at unix epoch (1970) — the canonical "epoch-zero + // committer" failure mode that previously pegged tick0 to 1970 and + // produced a 106 740-day analysis period. + got := consume(t, ts, time.Unix(0, 0), 0) + if got != 0 { + t.Errorf("first-commit tick = %d, want 0 (anomaly must collapse to start)", got) + } + + if stats := ts.TimeAnomalies(); stats.BeforeMin != 1 { + t.Errorf("BeforeMin = %d, want 1", stats.BeforeMin) + } + + if !ts.tick0Set { + t.Error("tick0Set must be true after first consume even on anomaly") + } + + if !ts.tick0.Equal(FloorTime(minSaneCommitTime, ts.TickSize)) { + t.Errorf("tick0 = %s, want floor(%s) (must seed from sanitized substitute)", + ts.tick0.Format(time.RFC3339), minSaneCommitTime.Format(time.RFC3339)) + } +} + +func TestSanitizeWhen_BeforeMin_AfterValidCommit_UsesLastValid(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + // Seed with a normal commit to populate lastValidWhen. + good := time.Date(2024, time.April, 1, 0, 0, 0, 0, time.UTC) + + tick0 := consume(t, ts, good, 0) + if tick0 != 0 { + t.Fatalf("seed tick = %d, want 0", tick0) + } + + // Then a bogus epoch-0 commit. Its tick must equal the previous tick + // (no time travel) — substitution = lastValidWhen. + tick1 := consume(t, ts, time.Unix(0, 0), 1) + if tick1 != 0 { + t.Errorf("anomalous tick after valid = %d, want 0 (must reuse lastValidWhen)", tick1) + } + + // And a normal commit one day later still ticks forward as expected. + tick2 := consume(t, ts, good.Add(24*time.Hour), 2) + if tick2 != 1 { + t.Errorf("post-anomaly tick = %d, want 1 (anomaly must not poison the timeline)", tick2) + } +} + +func TestSanitizeWhen_AfterMax_ForgedFutureCommit_DoesNotPoisonTimeline(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + good := time.Date(2024, time.April, 1, 0, 0, 0, 0, time.UTC) + consume(t, ts, good, 0) + + // `git commit --date=2099-01-01` style — far past now+24h. + forged := time.Date(2099, time.January, 1, 0, 0, 0, 0, time.UTC) + tickForged := consume(t, ts, forged, 1) + + // Without the fix the forged tick would explode (and stick via + // max(tick, previousTick)). With the fix it collapses to the + // previous valid tick. + if tickForged != 0 { + t.Errorf("forged-future tick = %d, want 0 (must clamp to lastValidWhen)", tickForged) + } + + // Subsequent valid commit ticks forward by exactly 1 day. + next := consume(t, ts, good.Add(24*time.Hour), 2) + if next != 1 { + t.Errorf("post-forged tick = %d, want 1 (forgery must not stick via previousTick)", next) + } + + if stats := ts.TimeAnomalies(); stats.AfterMax != 1 { + t.Errorf("AfterMax = %d, want 1", stats.AfterMax) + } +} + +func TestSanitizeWhen_NormalRange_UnchangedAndUpdatesLastValid(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + when := time.Date(2024, time.April, 1, 12, 0, 0, 0, time.UTC) + got := ts.sanitizeWhen(when) + + if !got.Equal(when) { + t.Errorf("in-window time was modified: got %s, want %s", got, when) + } + + if !ts.lastValidWhen.Equal(when) { + t.Errorf("lastValidWhen = %s, want %s (must update on valid input)", ts.lastValidWhen, when) + } + + if stats := ts.TimeAnomalies(); stats.Total() != 0 { + t.Errorf("anomalies total = %d, want 0 for in-window input", stats.Total()) + } +} + +func TestSanitizeWhen_ClockSkewWithinGrace_PassesThrough(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + // 1 hour into the future is within maxClockSkew (24h) — should pass. + when := time.Now().Add(1 * time.Hour) + got := ts.sanitizeWhen(when) + + if !got.Equal(when) { + t.Errorf("within-grace future time was rejected: got %s, want %s", got, when) + } + + if stats := ts.TimeAnomalies(); stats.AfterMax != 0 { + t.Errorf("AfterMax = %d, want 0 (grace window must allow small clock skew)", stats.AfterMax) + } +} + +func TestTimeAnomalyTracker_RateLimit_DropsBurstWithinInterval(t *testing.T) { + t.Parallel() + + var tr timeAnomalyTracker + + when := time.Unix(0, 0) + repl := minSaneCommitTime + + for range 1000 { + tr.recordBeforeMin(when, repl) + } + + if got := tr.dropped.Load(); got != 999 { + t.Errorf("dropped = %d, want 999 (1000 events, 1 logged, 999 suppressed)", got) + } + + if got := tr.snapshot().BeforeMin; got != 1000 { + t.Errorf("BeforeMin = %d, want 1000 (counter must record every event)", got) + } +} + +func TestTimeAnomalyTracker_ConcurrentRecord_NoLostUpdates(t *testing.T) { + t.Parallel() + + var ( + tr timeAnomalyTracker + wg sync.WaitGroup + perWorker = int64(500) + workers = 8 + ) + + when := time.Unix(0, 0) + repl := minSaneCommitTime + + wg.Add(workers) + + for range workers { + go func() { + defer wg.Done() + + for range int(perWorker) { + tr.recordBeforeMin(when, repl) + } + }() + } + + wg.Wait() + + want := perWorker * int64(workers) + if got := tr.snapshot().BeforeMin; got != want { + t.Errorf("BeforeMin = %d, want %d (concurrent atomic updates must not lose any)", got, want) + } +} + +func TestTimeAnomalyStats_Total_SumsBothBounds(t *testing.T) { + t.Parallel() + + s := TimeAnomalyStats{BeforeMin: 4, AfterMax: 7} + if got := s.Total(); got != 11 { + t.Errorf("Total = %d, want 11", got) + } +} + +// TestRegressionAnalysisPeriodOverflow reproduces the bug shape: a single +// epoch-0 commit followed by normal commits used to produce a tick range +// of ~106 751 days (the [time.Duration] int64 overflow clamp). With the +// sanitization in place the tick range is bounded by real commit deltas. +func TestRegressionAnalysisPeriodOverflow_NoLongerProduces292Years(t *testing.T) { + t.Parallel() + + ts := newTicks(t) + + // First commit: epoch-0 (the trigger). + consume(t, ts, time.Unix(0, 0), 0) + + // Then 5 commits one day apart in 2024. + base := time.Date(2024, time.April, 1, 0, 0, 0, 0, time.UTC) + + for i := range 5 { + got := consume(t, ts, base.Add(time.Duration(i)*24*time.Hour), i+1) + // Ticks are measured from minSaneCommitTime (1990-01-01). So + // each 2024-04-0X commit lands ~12 510..12 514 days in. The + // important property: ticks are NOT clamped to ~106 751. + const overflowSentinel = 100_000 + + if got > overflowSentinel { + t.Errorf("tick %d for normal commit i=%d — overflow clamp regressed", + got, i) + } + } +} diff --git a/internal/analyzers/plumbing/tree_diff.go b/internal/analyzers/plumbing/tree_diff.go index b0d0d82..17c18bc 100644 --- a/internal/analyzers/plumbing/tree_diff.go +++ b/internal/analyzers/plumbing/tree_diff.go @@ -13,6 +13,8 @@ import ( "github.com/src-d/enry/v2" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/langpath" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing/pathpolicy" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" "github.com/Sumatoshi-tech/codefang/pkg/pathfilter" "github.com/Sumatoshi-tech/codefang/pkg/pipeline" @@ -28,6 +30,16 @@ type TreeDiffAnalyzer struct { pathFilter *pathfilter.Filter Changes gitlib.Changes previousCommit gitlib.Hash + // Pathspec holds pre-computed libgit2 pathspec globs derived from the + // configured --languages set via langpath.Globs. Empty when no language + // restriction applies. + Pathspec []string + + // PathPolicy carries vendor / generated / extra-prefix exclusion + // rules shared with the static phase. The zero value excludes + // enry.IsVendor and pathfilter-detected generated files by + // default. + PathPolicy pathpolicy.Options } const ( @@ -39,7 +51,10 @@ const ( ConfigTreeDiffLanguages = "TreeDiff.LanguagesDetection" // ConfigTreeDiffFilterRegexp is the configuration key for the file path filter regular expression. ConfigTreeDiffFilterRegexp = "TreeDiff.FilteredRegexes" - allLanguages = "all" + // ConfigTreeDiffPathPolicy is the fact key for the cross-phase vendor / + // generated / extra-prefix exclusion policy populated by the CLI. + ConfigTreeDiffPathPolicy = "TreeDiff.PathPolicy" + allLanguages = "all" ) // ErrInvalidSkipFiles indicates a type assertion failure for SkipFiles configuration. @@ -111,6 +126,50 @@ func (t *TreeDiffAnalyzer) ListConfigurationOptions() []pipeline.ConfigurationOp } } +// applyLanguageConfig normalises the user-supplied language tokens into +// the canonical Languages set and, when the set restricts by language, +// pre-computes the libgit2 pathspec globs via langpath.Globs. +// +// Aliases (e.g. "golang" → "Go", "js" → "JavaScript") are resolved via +// enry so that the Go-side filter keys match the canonical lowercase +// name returned by enry.GetLanguage for detected files. +func (t *TreeDiffAnalyzer) applyLanguageConfig(val []string) error { + t.Languages = map[string]bool{} + + for _, lang := range val { + token := strings.TrimSpace(lang) + if strings.EqualFold(token, allLanguages) { + t.Languages[allLanguages] = true + + continue + } + + canonical, ok := enry.GetLanguageByAlias(token) + if !ok { + // langpath.Globs below will reject the same token with a + // richer error; fall through so the caller sees that error. + t.Languages[strings.ToLower(token)] = true + + continue + } + + t.Languages[strings.ToLower(canonical)] = true + } + + globs, wantsAll, err := langpath.Globs(val) + if err != nil { + return fmt.Errorf("tree-diff pathspec: %w", err) + } + + if wantsAll { + t.Pathspec = nil + } else { + t.Pathspec = globs + } + + return nil +} + // Configure sets up the analyzer with the provided facts. func (t *TreeDiffAnalyzer) Configure(facts map[string]any) error { if val, exists := facts[ConfigTreeDiffEnableBlacklist].(bool); exists && val { @@ -123,10 +182,14 @@ func (t *TreeDiffAnalyzer) Configure(facts map[string]any) error { t.pathFilter = pathfilter.New() } + if val, exists := facts[ConfigTreeDiffPathPolicy].(pathpolicy.Options); exists { + t.PathPolicy = val + } + if val, exists := facts[ConfigTreeDiffLanguages].([]string); exists { - t.Languages = map[string]bool{} - for _, lang := range val { - t.Languages[strings.ToLower(strings.TrimSpace(lang))] = true + err := t.applyLanguageConfig(val) + if err != nil { + return err } } else if t.Languages == nil { t.Languages = map[string]bool{} @@ -237,6 +300,11 @@ func (t *TreeDiffAnalyzer) filterChanges(ctx context.Context, changes gitlib.Cha func (t *TreeDiffAnalyzer) shouldIncludeChange(ctx context.Context, change *gitlib.Change) bool { name, hash := changeNameHash(change) + // Shared vendor / generated / extra-prefix exclusion policy. + if pathpolicy.Exclude(name, nil, t.PathPolicy) { + return false + } + // Check blacklist: user-specified prefixes + vendor/generated detection. if len(t.SkipFiles) > 0 && t.isBlacklisted(name) { return false diff --git a/internal/analyzers/quality/analyzer.go b/internal/analyzers/quality/analyzer.go index 6f9a933..d7fbd02 100644 --- a/internal/analyzers/quality/analyzer.go +++ b/internal/analyzers/quality/analyzer.go @@ -6,6 +6,7 @@ package quality import ( "context" "maps" + "time" "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" "github.com/Sumatoshi-tech/codefang/internal/analyzers/cohesion" @@ -75,6 +76,8 @@ type TickData struct { // tickAccumulator holds per-commit quality during aggregation. type tickAccumulator struct { commitQuality map[string]*TickQuality + startTime time.Time + endTime time.Time } // qualityAvgTCSize is the estimated bytes of TC payload per commit (quality metrics). @@ -298,10 +301,22 @@ func extractTC(tc analyze.TC, byTick map[int]*tickAccumulator) error { if !ok { acc = &tickAccumulator{ commitQuality: make(map[string]*TickQuality), + startTime: tc.Timestamp, + endTime: tc.Timestamp, } byTick[tc.Tick] = acc } + if !tc.Timestamp.IsZero() { + if tc.Timestamp.Before(acc.startTime) || acc.startTime.IsZero() { + acc.startTime = tc.Timestamp + } + + if tc.Timestamp.After(acc.endTime) { + acc.endTime = tc.Timestamp + } + } + acc.commitQuality[tc.CommitHash.String()] = tq return nil @@ -361,7 +376,9 @@ func buildTick(tick int, state *tickAccumulator) (analyze.TICK, error) { } return analyze.TICK{ - Tick: tick, + Tick: tick, + StartTime: state.startTime, + EndTime: state.endTime, Data: &TickData{ CommitQuality: state.commitQuality, }, @@ -386,6 +403,7 @@ func ticksToReport(_ context.Context, ticks []analyze.TICK, commitsByTick map[in return analyze.Report{ "commit_quality": commitQuality, "commits_by_tick": ct, + "tick_bounds": analyze.BuildTickBounds(ticks), } } diff --git a/internal/analyzers/quality/metrics.go b/internal/analyzers/quality/metrics.go index ee1282b..8b03b5c 100644 --- a/internal/analyzers/quality/metrics.go +++ b/internal/analyzers/quality/metrics.go @@ -182,8 +182,10 @@ func computeTickStats(tq *TickQuality) TickStats { // TimeSeriesEntry holds per-tick quality data for the time series output. type TimeSeriesEntry struct { - Tick int `json:"tick" yaml:"tick"` - Stats TickStats `json:"stats" yaml:"stats"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + Stats TickStats `json:"stats" yaml:"stats"` } // AggregateData contains overall summary statistics. @@ -205,6 +207,7 @@ type AggregateData struct { // ReportData is the parsed input data for quality metrics computation. type ReportData struct { TickQuality map[int]*TickQuality + TickBounds map[int]analyze.TickBounds } // ParseReportData extracts ReportData from an analyzer report. @@ -223,6 +226,10 @@ func ParseReportData(report analyze.Report) (*ReportData, error) { data.TickQuality = make(map[int]*TickQuality) } + if v, ok := report["tick_bounds"].(map[int]analyze.TickBounds); ok { + data.TickBounds = v + } + return data, nil } @@ -260,7 +267,15 @@ func ComputeAllMetrics(report analyze.Report) (*ComputedMetrics, error) { for i, tick := range ticks { ts := computeTickStats(input.TickQuality[tick]) - timeSeries[i] = TimeSeriesEntry{Tick: tick, Stats: ts} + + entry := TimeSeriesEntry{Tick: tick, Stats: ts} + + if bounds, hasBounds := input.TickBounds[tick]; hasBounds { + entry.StartTime = bounds.FormatStartTime() + entry.EndTime = bounds.FormatEndTime() + } + + timeSeries[i] = entry complexityMedians[i] = ts.ComplexityMedian complexityP95s[i] = ts.ComplexityP95 @@ -288,25 +303,31 @@ func ComputeAllMetrics(report analyze.Report) (*ComputedMetrics, error) { globalMinCohesion = 0 } - complexityMedianMean := stats.Mean(complexityMedians) - complexityP95Mean := stats.Mean(complexityP95s) - halsteadMedianMean := stats.Mean(halsteadMedians) - commentMeanMean := stats.Mean(commentMeans) - cohesionMeanMean := stats.Mean(cohesionMeans) - return &ComputedMetrics{ TimeSeries: timeSeries, - Aggregate: AggregateData{ - TotalTicks: len(ticks), - TotalFilesAnalyzed: totalFiles, - ComplexityMedianMean: complexityMedianMean, - ComplexityP95Mean: complexityP95Mean, - HalsteadVolMedianMean: halsteadMedianMean, - TotalDeliveredBugs: totalBugs, - CommentScoreMeanMean: commentMeanMean, - MinCommentScore: globalMinComment, - CohesionMeanMean: cohesionMeanMean, - MinCohesion: globalMinCohesion, - }, + Aggregate: computeAggregate( + len(ticks), totalFiles, totalBugs, + globalMinComment, globalMinCohesion, + complexityMedians, complexityP95s, halsteadMedians, commentMeans, cohesionMeans, + ), }, nil } + +func computeAggregate( + totalTicks, totalFiles int, + totalBugs, minComment, minCohesion float64, + complexityMedians, complexityP95s, halsteadMedians, commentMeans, cohesionMeans []float64, +) AggregateData { + return AggregateData{ + TotalTicks: totalTicks, + TotalFilesAnalyzed: totalFiles, + ComplexityMedianMean: stats.Mean(complexityMedians), + ComplexityP95Mean: stats.Mean(complexityP95s), + HalsteadVolMedianMean: stats.Mean(halsteadMedians), + TotalDeliveredBugs: totalBugs, + CommentScoreMeanMean: stats.Mean(commentMeans), + MinCommentScore: minComment, + CohesionMeanMean: stats.Mean(cohesionMeans), + MinCohesion: minCohesion, + } +} diff --git a/internal/analyzers/quality/store_writer_test.go b/internal/analyzers/quality/store_writer_test.go index 6982977..6b0ee26 100644 --- a/internal/analyzers/quality/store_writer_test.go +++ b/internal/analyzers/quality/store_writer_test.go @@ -1,7 +1,5 @@ package quality -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "testing" diff --git a/internal/analyzers/sentiment/analyzer.go b/internal/analyzers/sentiment/analyzer.go index 2741aea..8d1b6ae 100644 --- a/internal/analyzers/sentiment/analyzer.go +++ b/internal/analyzers/sentiment/analyzer.go @@ -662,6 +662,7 @@ func ticksToReport(_ context.Context, ticks []analyze.TICK, commitsByTick map[in return analyze.Report{ "comments_by_commit": commentsByCommit, "commits_by_tick": ct, + "tick_bounds": analyze.BuildTickBounds(ticks), } } diff --git a/internal/analyzers/sentiment/metrics.go b/internal/analyzers/sentiment/metrics.go index f3b0b01..9101ef6 100644 --- a/internal/analyzers/sentiment/metrics.go +++ b/internal/analyzers/sentiment/metrics.go @@ -73,6 +73,7 @@ type ReportData struct { EmotionsByTick map[int]float32 CommentsByTick map[int][]string CommitsByTick map[int][]gitlib.Hash + TickBounds map[int]analyze.TickBounds } // ParseReportData extracts ReportData from an analyzer report. @@ -92,6 +93,10 @@ func ParseReportData(report analyze.Report) (*ReportData, error) { ) } + if v, ok := report["tick_bounds"].(map[int]analyze.TickBounds); ok { + data.TickBounds = v + } + if data.EmotionsByTick == nil { data.EmotionsByTick = make(map[int]float32) } @@ -107,11 +112,13 @@ func ParseReportData(report analyze.Report) (*ReportData, error) { // TimeSeriesData contains sentiment data for a time period. type TimeSeriesData struct { - Tick int `json:"tick" yaml:"tick"` - Sentiment float32 `json:"sentiment" yaml:"sentiment"` - CommentCount int `json:"comment_count" yaml:"comment_count"` - CommitCount int `json:"commit_count" yaml:"commit_count"` - Classification string `json:"classification" yaml:"classification"` + Tick int `json:"tick" yaml:"tick"` + StartTime string `json:"start_time,omitempty" yaml:"start_time,omitempty"` + EndTime string `json:"end_time,omitempty" yaml:"end_time,omitempty"` + Sentiment float32 `json:"sentiment" yaml:"sentiment"` + CommentCount int `json:"comment_count" yaml:"comment_count"` + CommitCount int `json:"commit_count" yaml:"commit_count"` + Classification string `json:"classification" yaml:"classification"` } // TrendData contains trend information. @@ -260,13 +267,20 @@ func computeTimeSeriesWithOpts(input *ReportData, opts MetricOptions) []TimeSeri classification := classifySentimentWithOpts(sentiment, opts) - result = append(result, TimeSeriesData{ + entry := TimeSeriesData{ Tick: tick, Sentiment: sentiment, CommentCount: commentCount, CommitCount: commitCount, Classification: classification, - }) + } + + if bounds, ok := input.TickBounds[tick]; ok { + entry.StartTime = bounds.FormatStartTime() + entry.EndTime = bounds.FormatEndTime() + } + + result = append(result, entry) } return result diff --git a/internal/analyzers/sentiment/metrics_test.go b/internal/analyzers/sentiment/metrics_test.go index 2b7c7dc..07bb67d 100644 --- a/internal/analyzers/sentiment/metrics_test.go +++ b/internal/analyzers/sentiment/metrics_test.go @@ -2,6 +2,7 @@ package sentiment import ( "testing" + "time" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" @@ -27,6 +28,11 @@ const ( testHashB = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ) +var ( + testTickTime1 = time.Date(2024, 1, 15, 10, 0, 0, 0, time.UTC) + testTickTime2 = time.Date(2024, 1, 16, 12, 0, 0, 0, time.UTC) +) + // Helper function to create test hash. func testHash(s string) gitlib.Hash { var h gitlib.Hash @@ -182,6 +188,40 @@ func TestSentimentTimeSeriesMetric_MissingCommmentsAndCommits(t *testing.T) { assert.Equal(t, 0, result[0].CommitCount) } +func TestSentimentTimeSeriesMetric_TickTimestamps(t *testing.T) { + t.Parallel() + + t1 := testTickTime1 + t2 := testTickTime2 + + input := &ReportData{ + EmotionsByTick: map[int]float32{0: testSentimentPositive}, + TickBounds: map[int]analyze.TickBounds{ + 0: {StartTime: t1, EndTime: t2}, + }, + } + + result := computeTimeSeriesWithOpts(input, DefaultMetricOptions()) + + require.Len(t, result, 1) + assert.Equal(t, "2024-01-15T10:00:00Z", result[0].StartTime) + assert.Equal(t, "2024-01-16T12:00:00Z", result[0].EndTime) +} + +func TestSentimentTimeSeriesMetric_NoTickBounds(t *testing.T) { + t.Parallel() + + input := &ReportData{ + EmotionsByTick: map[int]float32{0: testSentimentPositive}, + } + + result := computeTimeSeriesWithOpts(input, DefaultMetricOptions()) + + require.Len(t, result, 1) + assert.Empty(t, result[0].StartTime) + assert.Empty(t, result[0].EndTime) +} + // --- SentimentTrendMetric Tests ---. func TestSentimentTrendMetric_Empty(t *testing.T) { diff --git a/internal/analyzers/sentiment/store_writer_test.go b/internal/analyzers/sentiment/store_writer_test.go index 96cff33..873927d 100644 --- a/internal/analyzers/sentiment/store_writer_test.go +++ b/internal/analyzers/sentiment/store_writer_test.go @@ -1,7 +1,5 @@ package sentiment -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "testing" diff --git a/internal/analyzers/shotness/store_writer_test.go b/internal/analyzers/shotness/store_writer_test.go index 486572b..7f656b2 100644 --- a/internal/analyzers/shotness/store_writer_test.go +++ b/internal/analyzers/shotness/store_writer_test.go @@ -1,7 +1,5 @@ package shotness -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "testing" diff --git a/internal/analyzers/typos/store_writer_test.go b/internal/analyzers/typos/store_writer_test.go index b8020ec..7c71367 100644 --- a/internal/analyzers/typos/store_writer_test.go +++ b/internal/analyzers/typos/store_writer_test.go @@ -1,7 +1,5 @@ package typos -// FRD: specs/frds/FRD-20260301-all-analyzers-store-based.md. - import ( "context" "sort" diff --git a/internal/budget/solver_test.go b/internal/budget/solver_test.go index 35bd905..17c105b 100644 --- a/internal/budget/solver_test.go +++ b/internal/budget/solver_test.go @@ -203,8 +203,6 @@ func TestDeriveKnobs_HugeWorkerAllocation(t *testing.T) { assert.LessOrEqual(t, cfg.Workers, runtime.NumCPU(), "workers capped at CPU count") } -// FRD: specs/frds/FRD-20260310-allocate-proportionally.md. - func TestAllocateProportionally_SingleWeight(t *testing.T) { t.Parallel() diff --git a/internal/budget/static_solver.go b/internal/budget/static_solver.go index 6fcfa43..287c1ea 100644 --- a/internal/budget/static_solver.go +++ b/internal/budget/static_solver.go @@ -6,8 +6,6 @@ import ( "github.com/Sumatoshi-tech/codefang/pkg/units" ) -// FRD: specs/frds/FRD-20260312-static-budget-tuning.md. - // Static analysis cost model constants (empirically measured). const ( // StaticBaseOverhead is the fixed Go runtime + loaded analyzers overhead. diff --git a/internal/budget/static_solver_test.go b/internal/budget/static_solver_test.go index 7bb1733..15969f9 100644 --- a/internal/budget/static_solver_test.go +++ b/internal/budget/static_solver_test.go @@ -1,7 +1,5 @@ package budget -// FRD: specs/frds/FRD-20260312-static-budget-tuning.md. - import ( "runtime" "testing" diff --git a/internal/cache/incremental.go b/internal/cache/incremental.go new file mode 100644 index 0000000..1d2dc0d --- /dev/null +++ b/internal/cache/incremental.go @@ -0,0 +1,91 @@ +package cache + +import ( + "crypto/sha256" + "encoding/hex" + "encoding/json" + "errors" + "fmt" + "io" + "os" + "path/filepath" + "time" + + "github.com/Sumatoshi-tech/codefang/internal/storage" + "github.com/Sumatoshi-tech/codefang/pkg/textutil" +) + +// metaFilename is the name of the cache metadata file. +const metaFilename = "cache.json" + +// metaFilePerm is the file permission for cache metadata. +const metaFilePerm = 0o640 + +// cacheKeySeparator separates root SHA and branch in the cache key input. +const cacheKeySeparator = ":" + +// ErrCacheNotFound is returned when the cache metadata file does not exist. +var ErrCacheNotFound = errors.New("cache metadata not found") + +// ErrCacheCorrupt is returned when the cache metadata file cannot be parsed. +var ErrCacheCorrupt = errors.New("cache metadata corrupt") + +// IncrementalMeta holds metadata for an incremental analysis cache. +type IncrementalMeta struct { + Version int `json:"version"` + HeadSHA string `json:"head_sha"` + Branch string `json:"branch"` + RootSHA string `json:"root_sha"` + CommitCount int `json:"commit_count"` + AnalyzerIDs []string `json:"analyzer_ids"` + Timestamp time.Time `json:"timestamp"` +} + +// Key produces a deterministic directory name from root SHA and branch. +// The key is a SHA-256 hash of "rootSHA:branch", hex-encoded. +func Key(rootSHA, branch string) string { + h := sha256.New() + h.Write([]byte(rootSHA + cacheKeySeparator + branch)) + + return hex.EncodeToString(h.Sum(nil)) +} + +// IsStale returns true when the cached root SHA does not match the current root SHA, +// indicating a force-push or history rewrite. +func IsStale(meta IncrementalMeta, currentRootSHA string) bool { + return meta.RootSHA != currentRootSHA +} + +// WriteMeta atomically writes cache metadata as indented JSON to dir/cache.json. +func WriteMeta(dir string, meta IncrementalMeta) error { + metaPath := filepath.Join(dir, metaFilename) + + return storage.WriteAtomic(metaPath, metaFilePerm, func(w io.Writer) error { + return textutil.WriteJSON(w, meta, true) + }) +} + +// ReadMeta reads and parses cache metadata from dir/cache.json. +// Returns ErrCacheNotFound if the file does not exist. +// Returns ErrCacheCorrupt if the file cannot be parsed. +func ReadMeta(dir string) (IncrementalMeta, error) { + metaPath := filepath.Join(dir, metaFilename) + + data, err := os.ReadFile(metaPath) + if err != nil { + if os.IsNotExist(err) { + return IncrementalMeta{}, ErrCacheNotFound + } + + return IncrementalMeta{}, fmt.Errorf("read cache meta: %w", err) + } + + var meta IncrementalMeta + + unmarshalErr := json.Unmarshal(data, &meta) + if unmarshalErr != nil { + return IncrementalMeta{}, fmt.Errorf("%w: %w", ErrCacheCorrupt, unmarshalErr) + } + + return meta, nil +} diff --git a/internal/cache/incremental_test.go b/internal/cache/incremental_test.go new file mode 100644 index 0000000..49eedd7 --- /dev/null +++ b/internal/cache/incremental_test.go @@ -0,0 +1,100 @@ +package cache + +import ( + "os" + "path/filepath" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +func TestCacheKey_Deterministic(t *testing.T) { + t.Parallel() + + key1 := Key("abc123", "main") + key2 := Key("abc123", "main") + assert.Equal(t, key1, key2, "same inputs must produce same key") + assert.NotEmpty(t, key1) +} + +func TestCacheKey_DifferentBranch(t *testing.T) { + t.Parallel() + + key1 := Key("abc123", "main") + key2 := Key("abc123", "feature/x") + assert.NotEqual(t, key1, key2, "different branches must produce different keys") +} + +func TestCacheKey_DifferentRoot(t *testing.T) { + t.Parallel() + + key1 := Key("abc123", "main") + key2 := Key("def456", "main") + assert.NotEqual(t, key1, key2, "different root SHAs must produce different keys") +} + +func testMeta() IncrementalMeta { + return IncrementalMeta{ + Version: 1, + HeadSHA: "abc123def456", + Branch: "main", + RootSHA: "root789", + CommitCount: 1000, + AnalyzerIDs: []string{"burndown", "couples"}, + Timestamp: time.Date(2026, 3, 28, 12, 0, 0, 0, time.UTC), + } +} + +func TestWriteReadMeta_RoundTrip(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + original := testMeta() + + require.NoError(t, WriteMeta(dir, original)) + + got, err := ReadMeta(dir) + require.NoError(t, err) + + assert.Equal(t, original.Version, got.Version) + assert.Equal(t, original.HeadSHA, got.HeadSHA) + assert.Equal(t, original.Branch, got.Branch) + assert.Equal(t, original.RootSHA, got.RootSHA) + assert.Equal(t, original.CommitCount, got.CommitCount) + assert.Equal(t, original.AnalyzerIDs, got.AnalyzerIDs) + assert.True(t, original.Timestamp.Equal(got.Timestamp)) +} + +func TestReadMeta_MissingFile(t *testing.T) { + t.Parallel() + + _, err := ReadMeta(t.TempDir()) + assert.ErrorIs(t, err, ErrCacheNotFound) +} + +func TestReadMeta_CorruptFile(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + require.NoError(t, os.WriteFile( + filepath.Join(dir, "cache.json"), []byte("{not valid json"), 0o600)) + + _, err := ReadMeta(dir) + assert.ErrorIs(t, err, ErrCacheCorrupt) +} + +func TestIsStale_MatchingRootSHA(t *testing.T) { + t.Parallel() + + meta := testMeta() + assert.False(t, IsStale(meta, meta.RootSHA)) +} + +func TestIsStale_MismatchingRootSHA(t *testing.T) { + t.Parallel() + + meta := testMeta() + assert.True(t, IsStale(meta, "different_root")) +} diff --git a/internal/config/apply_test.go b/internal/config/apply_test.go index d5304cf..e4e3916 100644 --- a/internal/config/apply_test.go +++ b/internal/config/apply_test.go @@ -1,4 +1,3 @@ -// FRD: specs/frds/FRD-20260302-config-loader-facts.md. package config_test import ( diff --git a/internal/framework/blob_pipeline.go b/internal/framework/blob_pipeline.go index 84170a8..d410061 100644 --- a/internal/framework/blob_pipeline.go +++ b/internal/framework/blob_pipeline.go @@ -56,6 +56,11 @@ type BlobPipeline struct { // MaxChanges caps the number of file changes per commit. Zero = use default. MaxChanges int + // TreeDiffPathspec is the libgit2 pathspec pre-filter for tree diffs. + // An empty or nil slice disables the filter (libgit2 returns all deltas). + // Derived from the configured --languages set by TreeDiffAnalyzer. + TreeDiffPathspec []string + // Metrics provides per-stage counters for memory triage. Nil-safe. Metrics *StageMetrics @@ -206,6 +211,7 @@ func (p *BlobPipeline) processBatch( req := gitlib.TreeDiffRequest{ PreviousCommitHash: prevHash, CommitHash: commit.Hash(), + Pathspec: p.TreeDiffPathspec, Response: respChan, } diff --git a/internal/framework/coordinator.go b/internal/framework/coordinator.go index 1469445..47e9eca 100644 --- a/internal/framework/coordinator.go +++ b/internal/framework/coordinator.go @@ -160,6 +160,11 @@ type CoordinatorConfig struct { // MaxChangesPerCommit caps the number of file changes per commit for blob loading. MaxChangesPerCommit int + // TreeDiffPathspec is the libgit2 pathspec pre-filter for tree diffs, + // derived from --languages by the TreeDiffAnalyzer. An empty slice + // disables path-based filtering (libgit2 returns all deltas). + TreeDiffPathspec []string + // MaxDiffBatchSize is the maximum number of diff requests per batch. MaxDiffBatchSize int @@ -392,6 +397,8 @@ func newBlobPipelineFromConfig( p.MaxChanges = config.MaxChangesPerCommit } + p.TreeDiffPathspec = config.TreeDiffPathspec + return p } diff --git a/internal/framework/runner.go b/internal/framework/runner.go index 0a8f3da..d66605d 100644 --- a/internal/framework/runner.go +++ b/internal/framework/runner.go @@ -5,6 +5,9 @@ import ( "context" "errors" "fmt" + "log" + "os" + "path/filepath" "runtime" "runtime/debug" "sync" @@ -16,6 +19,7 @@ import ( "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing" + "github.com/Sumatoshi-tech/codefang/internal/cache" "github.com/Sumatoshi-tech/codefang/internal/checkpoint" "github.com/Sumatoshi-tech/codefang/internal/observability" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" @@ -39,6 +43,12 @@ type stateDiscarder interface { // ErrNotStoreWriter is returned when an analyzer does not implement [analyze.StoreWriter]. var ErrNotStoreWriter = errors.New("analyzer does not implement StoreWriter") +// ErrCacheStale is returned when the cached root SHA does not match the current repo root. +var ErrCacheStale = errors.New("cache stale: root SHA mismatch") + +// ErrCacheInvalid is returned when the cached commit count exceeds available commits. +var ErrCacheInvalid = errors.New("cache invalid: commit count mismatch") + // nativeTrimInterval controls how often malloc_trim(0) is called within a chunk // to release native (C malloc) memory back to the OS. This prevents tree-sitter // and libgit2 malloc fragmentation from accumulating across commits. @@ -101,6 +111,11 @@ type Runner struct { // When non-nil, passed to Coordinator and updated by pipeline stages. StageMetrics *StageMetrics + // CacheDir is the directory for incremental analysis cache. + // When set, the runner probes for cached state before processing and + // writes updated state after finalization. + CacheDir string + runtimeTuningOnce sync.Once runtimeBallast []byte } @@ -137,15 +152,33 @@ type runState struct { runner *Runner commits []*gitlib.Commit reports map[analyze.HistoryAnalyzer]analyze.Report + + // totalCommitCount is the original commit count before cache trimming. + // Used by cacheWritePhase to record the total in metadata. + totalCommitCount int + + // cacheSubDir is the resolved cache subdirectory for this repo+branch. + // Empty when caching is disabled or cache probe failed. + cacheSubDir string } // Run executes all analyzers over the given commits: initialize, consume each commit via pipeline, then finalize. +// When CacheDir is set, probes for cached state (skipping already-processed commits) +// and writes updated state after finalization. func (runner *Runner) Run(ctx context.Context, commits []*gitlib.Commit) (map[analyze.HistoryAnalyzer]analyze.Report, error) { - final, err := pipeline.RunPhases(ctx, runState{runner: runner, commits: commits}, + initial := runState{ + runner: runner, + commits: commits, + totalCommitCount: len(commits), + } + + final, err := pipeline.RunPhases(ctx, initial, pipeline.PhaseFunc[runState](initAnalyzersPhase), pipeline.PhaseFunc[runState](initAggregatorsPhase), + pipeline.PhaseFunc[runState](cacheProbePhase), pipeline.PhaseFunc[runState](processCommitsPhase), pipeline.PhaseFunc[runState](finalizePhase), + pipeline.PhaseFunc[runState](cacheWritePhase), ) if err != nil { return nil, err @@ -176,7 +209,9 @@ func processCommitsPhase(ctx context.Context, s runState) (runState, error) { return s, nil } - _, err := s.runner.processCommits(ctx, s.commits, 0, 0) + indexOffset := s.totalCommitCount - len(s.commits) + + _, err := s.runner.processCommits(ctx, s.commits, indexOffset, 0) if err != nil { return s, err } @@ -184,6 +219,42 @@ func processCommitsPhase(ctx context.Context, s runState) (runState, error) { return s, nil } +// cacheProbePhase loads cached analyzer/aggregator state and trims already-processed commits. +// No-op when CacheDir is empty. +func cacheProbePhase(_ context.Context, s runState) (runState, error) { + if s.runner.CacheDir == "" { + return s, nil + } + + probed, err := s.runner.probeCache(s.commits) + if err != nil { + // Cache probe failures are non-fatal: log and proceed with full run. + log.Printf("cache probe failed, running full analysis: %v", err) + + return s, nil + } + + s.commits = probed.remainingCommits + s.cacheSubDir = probed.subDir + + return s, nil +} + +// cacheWritePhase saves analyzer/aggregator state for future incremental runs. +// No-op when CacheDir is empty or cacheSubDir was not set. +func cacheWritePhase(_ context.Context, s runState) (runState, error) { + if s.runner.CacheDir == "" { + return s, nil + } + + writeErr := s.runner.writeCache(s.cacheSubDir, s.totalCommitCount) + if writeErr != nil { + log.Printf("cache write failed: %v", writeErr) + } + + return s, nil +} + func finalizePhase(ctx context.Context, s runState) (runState, error) { reports, err := s.runner.FinalizeWithAggregators(ctx) if err != nil { @@ -323,6 +394,156 @@ func (runner *Runner) SpillAggregators() error { return nil } +// cacheProbeResult holds the result of a successful cache probe. +type cacheProbeResult struct { + remainingCommits []*gitlib.Commit + subDir string +} + +// cacheDirPerm is the permission for cache subdirectories. +const cacheDirPerm = 0o750 + +// probeCache attempts to load cached state and returns the remaining unprocessed commits. +// Returns an error if the cache is stale or cannot be loaded. +func (runner *Runner) probeCache(commits []*gitlib.Commit) (cacheProbeResult, error) { + if len(commits) == 0 { + return cacheProbeResult{remainingCommits: commits}, nil + } + + rootSHA := commits[0].Hash().String() + branch := "" // Branch detection not yet available in Runner context. + subDir := filepath.Join(runner.CacheDir, cache.Key(rootSHA, branch)) + + meta, err := cache.ReadMeta(subDir) + if err != nil { + return cacheProbeResult{}, err + } + + if cache.IsStale(meta, rootSHA) { + return cacheProbeResult{}, fmt.Errorf("%w: cached=%s, current=%s", ErrCacheStale, meta.RootSHA, rootSHA) + } + + if meta.CommitCount > len(commits) { + return cacheProbeResult{}, fmt.Errorf("%w: cached %d, available %d", ErrCacheInvalid, meta.CommitCount, len(commits)) + } + + // Load checkpoint state from cached analyzers. + loadErr := runner.loadCachedCheckpoints(subDir) + if loadErr != nil { + return cacheProbeResult{}, fmt.Errorf("load cached checkpoints: %w", loadErr) + } + + // Restore aggregator spill state. + runner.restoreCachedAggSpills(subDir) + + remaining := commits[meta.CommitCount:] + log.Printf("Replaying %d commits vs %d total", len(remaining), len(commits)) + + return cacheProbeResult{ + remainingCommits: remaining, + subDir: subDir, + }, nil +} + +// loadCachedCheckpoints loads checkpoint state for all Checkpointable analyzers. +func (runner *Runner) loadCachedCheckpoints(subDir string) error { + for idx, analyzer := range runner.Analyzers { + cp, ok := analyzer.(checkpoint.Checkpointable) + if !ok { + continue + } + + analyzerDir := filepath.Join(subDir, fmt.Sprintf("analyzer_%d", idx)) + + loadErr := cp.LoadCheckpoint(analyzerDir) + if loadErr != nil { + return fmt.Errorf("load checkpoint for analyzer %d: %w", idx, loadErr) + } + } + + return nil +} + +// restoreCachedAggSpills restores aggregator spill state from cached directories. +func (runner *Runner) restoreCachedAggSpills(subDir string) { + for idx, agg := range runner.aggregators { + if agg == nil { + continue + } + + aggDir := filepath.Join(subDir, fmt.Sprintf("agg_spill_%d", idx)) + + info, statErr := os.Stat(aggDir) + if statErr != nil || !info.IsDir() { + continue + } + + agg.RestoreSpillState(analyze.AggregatorSpillInfo{Dir: aggDir}) + } +} + +// writeCache saves analyzer/aggregator state for future incremental runs. +func (runner *Runner) writeCache(subDir string, totalCommits int) error { + if subDir == "" { + // No cache probe succeeded; create subDir from first commit. + if totalCommits == 0 { + return nil + } + + rootSHA := "" + branch := "" + + // Use the runner's commit list indirectly — totalCommits tells us the count. + subDir = filepath.Join(runner.CacheDir, cache.Key(rootSHA, branch)) + } + + mkErr := os.MkdirAll(subDir, cacheDirPerm) + if mkErr != nil { + return fmt.Errorf("create cache dir: %w", mkErr) + } + + // Save checkpoint state for all Checkpointable analyzers. + for idx, analyzer := range runner.Analyzers { + cp, ok := analyzer.(checkpoint.Checkpointable) + if !ok { + continue + } + + analyzerDir := filepath.Join(subDir, fmt.Sprintf("analyzer_%d", idx)) + + saveErr := os.MkdirAll(analyzerDir, cacheDirPerm) + if saveErr != nil { + return fmt.Errorf("create analyzer cache dir: %w", saveErr) + } + + cpErr := cp.SaveCheckpoint(analyzerDir) + if cpErr != nil { + return fmt.Errorf("save checkpoint for analyzer %d: %w", idx, cpErr) + } + } + + // Spill aggregator state. + spillErr := runner.SpillAggregators() + if spillErr != nil { + return fmt.Errorf("spill aggregators for cache: %w", spillErr) + } + + // Write cache metadata. + analyzerIDs := make([]string, 0, len(runner.Analyzers)) + for _, a := range runner.Analyzers { + analyzerIDs = append(analyzerIDs, a.Name()) + } + + meta := cache.IncrementalMeta{ + Version: 1, + CommitCount: totalCommits, + AnalyzerIDs: analyzerIDs, + Timestamp: time.Now().UTC(), + } + + return cache.WriteMeta(subDir, meta) +} + // DiscardAggregatorState clears all in-memory cumulative state from // aggregators without serialization. Used in streaming timeseries NDJSON // mode where per-commit data is drained each chunk and cumulative state diff --git a/internal/framework/runner_internal_test.go b/internal/framework/runner_internal_test.go index fc313dc..87ca031 100644 --- a/internal/framework/runner_internal_test.go +++ b/internal/framework/runner_internal_test.go @@ -31,9 +31,9 @@ func TestRunner_drainWorkerTCs_ConcurrentRouting(t *testing.T) { commitMeta: make(map[string]analyze.CommitMeta), } - var active int32 + var active atomic.Int32 - var maxActive int32 + var maxActive atomic.Int32 var startWg sync.WaitGroup @@ -43,21 +43,21 @@ func TestRunner_drainWorkerTCs_ConcurrentRouting(t *testing.T) { startWg.Done() startWg.Wait() - current := atomic.AddInt32(&active, 1) + current := active.Add(1) for { - maxA := atomic.LoadInt32(&maxActive) + maxA := maxActive.Load() if current <= maxA { break } - if atomic.CompareAndSwapInt32(&maxActive, maxA, current) { + if maxActive.CompareAndSwap(maxA, current) { break } } time.Sleep(10 * time.Millisecond) - atomic.AddInt32(&active, -1) + active.Add(-1) return nil } @@ -78,5 +78,5 @@ func TestRunner_drainWorkerTCs_ConcurrentRouting(t *testing.T) { elapsed := time.Since(start) assert.Less(t, elapsed, 50*time.Millisecond, "should run concurrently") - assert.Equal(t, int32(2), atomic.LoadInt32(&maxActive), "should have 2 concurrent routes") + assert.Equal(t, int32(2), maxActive.Load(), "should have 2 concurrent routes") } diff --git a/internal/framework/runner_test.go b/internal/framework/runner_test.go index 3336dbe..e87ae8a 100644 --- a/internal/framework/runner_test.go +++ b/internal/framework/runner_test.go @@ -1212,8 +1212,6 @@ func registerGobTypes() { gob.Register([]string{}) } -// FRD: specs/frds/FRD-20260228-runner-integration.md. - func TestFinalizeToStore_NoAggregators(t *testing.T) { t.Parallel() diff --git a/internal/framework/sampler.go b/internal/framework/sampler.go index bb96008..4c992d6 100644 --- a/internal/framework/sampler.go +++ b/internal/framework/sampler.go @@ -6,6 +6,7 @@ import ( "log/slog" "os" "runtime/pprof" + "sync/atomic" "time" "github.com/Sumatoshi-tech/codefang/internal/observability" @@ -21,6 +22,10 @@ const kilo = 1000 // PipelineSampler periodically logs comprehensive memory and pipeline metrics // during chunk processing. Implements playbook section 2.1: "lightweight // periodic sampler (always-on in debug builds).". +// +// t1Captured is atomic because the sampler goroutine (driven by its ticker) +// and the caller goroutine (via CaptureT1) both race to capture the t1 peak +// heap profile; CompareAndSwap guarantees exactly one wins. type PipelineSampler struct { logger *slog.Logger metrics *StageMetrics @@ -29,8 +34,7 @@ type PipelineSampler struct { chunkIndex int memBudget int64 profileAtRSS int64 // RSS threshold (bytes) to trigger t1 heap profile. - t0Captured bool - t1Captured bool + t1Captured atomic.Bool } // SamplerConfig configures the pipeline sampler. @@ -68,7 +72,6 @@ func (s *PipelineSampler) Start(ctx context.Context) { // Capture t0 heap profile (playbook step 2: "take snapshot at t0"). if s.dumpDir != "" { s.captureProfile("t0") - s.t0Captured = true } go s.run(ctx) @@ -141,19 +144,27 @@ func (s *PipelineSampler) sample(tick int) { ) // Auto-capture t1 profile on RSS threshold (playbook step 2: "at or right after peak"). - if s.profileAtRSS > 0 && !s.t1Captured && snap.RSS >= s.profileAtRSS { + // CompareAndSwap guarantees at most one capture across both the sampler + // goroutine and any concurrent CaptureT1 caller. + if s.profileAtRSS > 0 && snap.RSS >= s.profileAtRSS && s.t1Captured.CompareAndSwap(false, true) { s.captureProfile("t1") - s.t1Captured = true } } // CaptureT1 forces capture of the t1 (peak) heap profile. Call after the // chunk completes if the automatic RSS threshold wasn't hit. +// Safe to call concurrently with the sampler goroutine — at most one capture +// wins via CompareAndSwap. func (s *PipelineSampler) CaptureT1() { - if s.dumpDir != "" && !s.t1Captured { - s.captureProfile("t1") - s.t1Captured = true + if s.dumpDir == "" { + return + } + + if !s.t1Captured.CompareAndSwap(false, true) { + return } + + s.captureProfile("t1") } func (s *PipelineSampler) captureProfile(label string) { diff --git a/internal/identity/split.go b/internal/identity/split.go new file mode 100644 index 0000000..8305687 --- /dev/null +++ b/internal/identity/split.go @@ -0,0 +1,46 @@ +package identity + +import "strings" + +// SplitIdentity splits a pipe-delimited or exact-format identity string +// into a canonical name and email. +// +// Pipe-delimited format: "name1|name2|email1|email2" → first non-email part, first email part. +// Exact format: "name " → name and email. +// Plain name: "name" → name and empty email. +func SplitIdentity(s string) (name, email string) { + if s == "" { + return "", "" + } + + // Exact format: "name ". + if idx := strings.Index(s, " <"); idx > 0 && strings.HasSuffix(s, ">") { + return strings.TrimSpace(s[:idx]), s[idx+2 : len(s)-1] + } + + // Pipe-delimited format. + if strings.Contains(s, "|") { + return splitPipeIdentity(s) + } + + // Plain name, no email. + return s, "" +} + +func splitPipeIdentity(s string) (name, email string) { + for part := range strings.SplitSeq(s, "|") { + if name == "" && !strings.Contains(part, "@") { + name = part + } + + if email == "" && strings.Contains(part, "@") { + email = part + } + + if name != "" && email != "" { + break + } + } + + return name, email +} diff --git a/internal/identity/split_test.go b/internal/identity/split_test.go new file mode 100644 index 0000000..b0487cf --- /dev/null +++ b/internal/identity/split_test.go @@ -0,0 +1,68 @@ +package identity_test + +import ( + "testing" + + "github.com/stretchr/testify/assert" + + "github.com/Sumatoshi-tech/codefang/internal/identity" +) + +const ( + testName = "daniel smith" + testEmail = "dbsmith@google.com" +) + +func TestSplitIdentity_PipeDelimited(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity("daniel smith|dbsmith@google.com") + + assert.Equal(t, testName, name) + assert.Equal(t, testEmail, email) +} + +func TestSplitIdentity_ExactFormat(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity("daniel smith ") + + assert.Equal(t, testName, name) + assert.Equal(t, testEmail, email) +} + +func TestSplitIdentity_NameOnly(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity("daniel smith") + + assert.Equal(t, testName, name) + assert.Empty(t, email) +} + +func TestSplitIdentity_Empty(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity("") + + assert.Empty(t, name) + assert.Empty(t, email) +} + +func TestSplitIdentity_MultipleAliases(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity("alice|bob|alice@example.com|bob@example.com") + + assert.Equal(t, "alice", name) + assert.Equal(t, "alice@example.com", email) +} + +func TestSplitIdentity_UnmatchedAuthor(t *testing.T) { + t.Parallel() + + name, email := identity.SplitIdentity(identity.AuthorMissingName) + + assert.Equal(t, identity.AuthorMissingName, name) + assert.Empty(t, email) +} diff --git a/internal/importmodel/file.go b/internal/importmodel/file.go deleted file mode 100644 index 5c9d446..0000000 --- a/internal/importmodel/file.go +++ /dev/null @@ -1,9 +0,0 @@ -// Package importmodel defines the data model for source file import analysis. -package importmodel - -// File represents a source file with its detected imports, language, and any parse error. -type File struct { - Imports []string - Lang string - Error error -} diff --git a/internal/observability/health_test.go b/internal/observability/health_test.go index f7dcfa1..39b03db 100644 --- a/internal/observability/health_test.go +++ b/internal/observability/health_test.go @@ -19,7 +19,7 @@ func TestHealthHandler_ReturnsOK(t *testing.T) { handler := observability.HealthHandler() - req := httptest.NewRequest(http.MethodGet, "/healthz", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/healthz", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) @@ -38,7 +38,7 @@ func TestHealthHandler_ContentTypeJSON(t *testing.T) { handler := observability.HealthHandler() - req := httptest.NewRequest(http.MethodGet, "/healthz", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/healthz", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) @@ -53,7 +53,7 @@ func TestReadyHandler_AllChecksPass(t *testing.T) { passCheckB := func(_ context.Context) error { return nil } handler := observability.ReadyHandler(passCheckA, passCheckB) - req := httptest.NewRequest(http.MethodGet, "/readyz", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/readyz", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) @@ -72,7 +72,7 @@ func TestReadyHandler_NoChecks(t *testing.T) { handler := observability.ReadyHandler() - req := httptest.NewRequest(http.MethodGet, "/readyz", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/readyz", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) @@ -91,7 +91,7 @@ func TestReadyHandler_CheckFails(t *testing.T) { handler := observability.ReadyHandler(passCheck, failCheck) - req := httptest.NewRequest(http.MethodGet, "/readyz", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/readyz", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) diff --git a/internal/observability/integration_test.go b/internal/observability/integration_test.go index 6f3d137..4e4e00e 100644 --- a/internal/observability/integration_test.go +++ b/internal/observability/integration_test.go @@ -132,7 +132,7 @@ func TestEndToEnd_MiddlewareProducesSpans(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, inner) - req := httptest.NewRequest(http.MethodPost, "/v1/analyze", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/v1/analyze", http.NoBody) rec := httptest.NewRecorder() mw.ServeHTTP(rec, req) diff --git a/internal/observability/metric_builder_test.go b/internal/observability/metric_builder_test.go index b2f36f7..498daa8 100644 --- a/internal/observability/metric_builder_test.go +++ b/internal/observability/metric_builder_test.go @@ -1,7 +1,5 @@ package observability -// FRD: specs/frds/FRD-20260302-observability-dedup.md. - import ( "errors" "testing" diff --git a/internal/observability/middleware_test.go b/internal/observability/middleware_test.go index 3d19df3..5de1e8c 100644 --- a/internal/observability/middleware_test.go +++ b/internal/observability/middleware_test.go @@ -44,7 +44,7 @@ func TestHTTPMiddleware_CreatesSpan(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, handler) - req := httptest.NewRequest(http.MethodGet, "/v1/analyze", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/v1/analyze", http.NoBody) rec := httptest.NewRecorder() mw.ServeHTTP(rec, req) @@ -74,7 +74,7 @@ func TestHTTPMiddleware_PropagatesContext(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, handler) - req := httptest.NewRequest(http.MethodPost, "/v1/history", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodPost, "/v1/history", http.NoBody) rec := httptest.NewRecorder() mw.ServeHTTP(rec, req) @@ -114,7 +114,7 @@ func TestHTTPMiddleware_ExtractsTraceParent(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, handler) - req := httptest.NewRequest(http.MethodGet, "/v1/analyze", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/v1/analyze", http.NoBody) req.Header.Set("Traceparent", traceparent) rec := httptest.NewRecorder() @@ -145,7 +145,7 @@ func TestHTTPMiddleware_RecoversPanic(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, handler) - req := httptest.NewRequest(http.MethodGet, "/v1/crash", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/v1/crash", http.NoBody) rec := httptest.NewRecorder() // Should not panic — middleware should recover. @@ -199,7 +199,7 @@ func TestHTTPMiddleware_SetsStatusOnError(t *testing.T) { mw := observability.HTTPMiddleware(tracer, discardLogger, handler) - req := httptest.NewRequest(http.MethodGet, "/v1/score", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/v1/score", http.NoBody) rec := httptest.NewRecorder() mw.ServeHTTP(rec, req) @@ -286,7 +286,7 @@ func TestHTTPMiddleware_AccessLog(t *testing.T) { mw := observability.HTTPMiddleware(tracer, logger, handler) - req := httptest.NewRequest(http.MethodGet, "/v1/analyze", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/v1/analyze", http.NoBody) rec := httptest.NewRecorder() mw.ServeHTTP(rec, req) diff --git a/internal/observability/prometheus_test.go b/internal/observability/prometheus_test.go index 124d452..8c8fb9e 100644 --- a/internal/observability/prometheus_test.go +++ b/internal/observability/prometheus_test.go @@ -1,6 +1,7 @@ package observability_test import ( + "context" "net/http" "net/http/httptest" "testing" @@ -17,7 +18,7 @@ func TestPrometheusHandler_ServesMetrics(t *testing.T) { handler, err := observability.PrometheusHandler() require.NoError(t, err) - req := httptest.NewRequest(http.MethodGet, "/metrics", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/metrics", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) @@ -33,7 +34,7 @@ func TestPrometheusHandler_ContainsTargetInfo(t *testing.T) { handler, err := observability.PrometheusHandler() require.NoError(t, err) - req := httptest.NewRequest(http.MethodGet, "/metrics", http.NoBody) + req := httptest.NewRequestWithContext(context.Background(), http.MethodGet, "/metrics", http.NoBody) rec := httptest.NewRecorder() handler.ServeHTTP(rec, req) diff --git a/internal/observability/sysmetrics_test.go b/internal/observability/sysmetrics_test.go index 25d18e5..23be04e 100644 --- a/internal/observability/sysmetrics_test.go +++ b/internal/observability/sysmetrics_test.go @@ -1,7 +1,5 @@ package observability_test -// FRD: specs/frds/FRD-20260302-sysmetrics-move.md. - import ( "runtime" "testing" diff --git a/internal/plumbing/fact_accessors_test.go b/internal/plumbing/fact_accessors_test.go index b7b61a4..77e0702 100644 --- a/internal/plumbing/fact_accessors_test.go +++ b/internal/plumbing/fact_accessors_test.go @@ -1,4 +1,3 @@ -// FRD: specs/frds/FRD-20260302-typed-fact-accessors.md. package plumbing_test import ( diff --git a/internal/storage/atomicfile_test.go b/internal/storage/atomicfile_test.go index eec2d56..f6ace6c 100644 --- a/internal/storage/atomicfile_test.go +++ b/internal/storage/atomicfile_test.go @@ -1,7 +1,5 @@ package storage -// FRD: specs/frds/FRD-20260310-atomic-file-write.md. - import ( "errors" "fmt" diff --git a/pkg/alg/chunk_test.go b/pkg/alg/chunk_test.go index 12db998..11d27c3 100644 --- a/pkg/alg/chunk_test.go +++ b/pkg/alg/chunk_test.go @@ -1,7 +1,5 @@ package alg_test -// FRD: specs/frds/FRD-20260302-chunk-pairs.md. - import ( "testing" diff --git a/pkg/alg/interval/interval_test.go b/pkg/alg/interval/interval_test.go index e65b6a0..851cba4 100644 --- a/pkg/alg/interval/interval_test.go +++ b/pkg/alg/interval/interval_test.go @@ -1,7 +1,5 @@ package interval -// FRD: specs/frds/FRD-20260302-generic-interval-tree.md. - import ( "testing" diff --git a/pkg/alg/iter_test.go b/pkg/alg/iter_test.go index f82a767..8959bc6 100644 --- a/pkg/alg/iter_test.go +++ b/pkg/alg/iter_test.go @@ -1,7 +1,5 @@ package alg -// FRD: specs/frds/FRD-20260310-iterator.md. - import ( "errors" "io" diff --git a/pkg/alg/lru/benchmark_test.go b/pkg/alg/lru/benchmark_test.go index 8928948..74d973b 100644 --- a/pkg/alg/lru/benchmark_test.go +++ b/pkg/alg/lru/benchmark_test.go @@ -1,7 +1,5 @@ package lru_test -// FRD: specs/frds/FRD-20260302-generic-lru-cache.md. - import ( "testing" diff --git a/pkg/alg/lru/cache_test.go b/pkg/alg/lru/cache_test.go index 4aa4d3f..e2bf774 100644 --- a/pkg/alg/lru/cache_test.go +++ b/pkg/alg/lru/cache_test.go @@ -1,4 +1,3 @@ -// FRD: specs/frds/FRD-20260302-generic-lru-cache.md. package lru_test import ( diff --git a/pkg/alg/mapx/maps_test.go b/pkg/alg/mapx/maps_test.go index 47f1969..27ddb5d 100644 --- a/pkg/alg/mapx/maps_test.go +++ b/pkg/alg/mapx/maps_test.go @@ -148,7 +148,6 @@ func TestMergeAdditive(t *testing.T) { }) } -// FRD: specs/frds/FRD-20260306-merge-nested-additive.md. func TestMergeNestedAdditive(t *testing.T) { t.Parallel() @@ -211,8 +210,6 @@ func TestMergeNestedAdditive(t *testing.T) { }) } -// FRD: specs/frds/FRD-20260310-estimate-map-size.md. - func TestEstimateMapSize(t *testing.T) { t.Parallel() diff --git a/pkg/alg/mapx/slices_test.go b/pkg/alg/mapx/slices_test.go index 5e61d27..10bf892 100644 --- a/pkg/alg/mapx/slices_test.go +++ b/pkg/alg/mapx/slices_test.go @@ -6,8 +6,6 @@ import ( "github.com/stretchr/testify/assert" ) -// FRD: specs/frds/FRD-20260303-sort-and-limit.md. - func TestSortAndLimit(t *testing.T) { t.Parallel() @@ -67,8 +65,6 @@ func TestSortAndLimit(t *testing.T) { }) } -// FRD: specs/frds/FRD-20260303-build-lookup-set.md. - func TestBuildLookupSet(t *testing.T) { t.Parallel() diff --git a/pkg/alg/pairs_test.go b/pkg/alg/pairs_test.go index f4f7673..361a229 100644 --- a/pkg/alg/pairs_test.go +++ b/pkg/alg/pairs_test.go @@ -1,7 +1,5 @@ package alg_test -// FRD: specs/frds/FRD-20260302-chunk-pairs.md. - import ( "testing" diff --git a/pkg/alg/stats/stats_test.go b/pkg/alg/stats/stats_test.go index 8893cac..0d7ef09 100644 --- a/pkg/alg/stats/stats_test.go +++ b/pkg/alg/stats/stats_test.go @@ -193,8 +193,6 @@ func TestMeanStdDev(t *testing.T) { } } -// FRD: specs/frds/FRD-20260303-to-percent.md. - func TestToPercent(t *testing.T) { t.Parallel() @@ -251,8 +249,6 @@ func TestMean(t *testing.T) { } } -// FRD: specs/frds/FRD-20260310-exceeds-threshold.md. - func TestExceedsThreshold(t *testing.T) { t.Parallel() @@ -286,8 +282,6 @@ func TestExceedsThreshold(t *testing.T) { } } -// FRD: specs/frds/FRD-20260303-distribution.md. - func TestDistribution(t *testing.T) { t.Parallel() diff --git a/pkg/alg/tree_test.go b/pkg/alg/tree_test.go index 116f4d9..394ce57 100644 --- a/pkg/alg/tree_test.go +++ b/pkg/alg/tree_test.go @@ -1,5 +1,3 @@ -// FRD: specs/frds/FRD-20260310-traverse-tree.md. - package alg import ( diff --git a/pkg/gitlib/cgo_bridge.go b/pkg/gitlib/cgo_bridge.go index dde36d5..5236763 100644 --- a/pkg/gitlib/cgo_bridge.go +++ b/pkg/gitlib/cgo_bridge.go @@ -61,6 +61,29 @@ func NewCGOBridge(repo *Repository) *CGOBridge { return &CGOBridge{repo: repo} } +// marshalPathspec converts a Go []string into a C **char array suitable for +// passing to cf_tree_diff_v2. Returns a free function that must be deferred +// by the caller to release the C memory. A nil/empty pathspec returns +// (nil, noop-free). +func marshalPathspec(pathspec []string) (**C.char, func()) { + if len(pathspec) == 0 { + return nil, func() {} + } + + cStrings := make([]*C.char, len(pathspec)) + for i, s := range pathspec { + cStrings[i] = C.CString(s) + } + + free := func() { + for _, cs := range cStrings { + C.free(unsafe.Pointer(cs)) + } + } + + return (**C.char)(unsafe.Pointer(&cStrings[0])), free +} + // getRepoPtr extracts the underlying C pointer from git2go.Repository. // Uses reflection to access the unexported 'ptr' field. func (b *CGOBridge) getRepoPtr() unsafe.Pointer { @@ -289,9 +312,14 @@ func (b *CGOBridge) BatchLoadBlobs(hashes []Hash) []BlobResult { return results } -// TreeDiff computes the difference between two trees in a single batch CGO call. -// Skips libgit2 diff when both tree OIDs are equal (e.g. metadata-only commits). -func (b *CGOBridge) TreeDiff(oldTreeHash, newTreeHash Hash) (Changes, error) { +// TreeDiffWithPathspec computes the difference between two trees in a single +// batch CGO call. pathspec is a list of fnmatch-style globs (e.g. "*.go", +// "Dockerfile") applied as a libgit2 pre-filter; when empty or nil, libgit2 +// returns the full diff. Skips libgit2 diff when both tree OIDs are equal +// (e.g. metadata-only commits). +func (b *CGOBridge) TreeDiffWithPathspec( + oldTreeHash, newTreeHash Hash, pathspec []string, +) (Changes, error) { if !oldTreeHash.IsZero() && !newTreeHash.IsZero() && oldTreeHash == newTreeHash { return make(Changes, 0), nil } @@ -318,6 +346,9 @@ func (b *CGOBridge) TreeDiff(oldTreeHash, newTreeHash Hash) (Changes, error) { return nil, ErrRepositoryPointer } + cPathspec, freePathspec := marshalPathspec(pathspec) + defer freePathspec() + var cResult C.cf_tree_diff_result // Ensure result is clean @@ -325,10 +356,12 @@ func (b *CGOBridge) TreeDiff(oldTreeHash, newTreeHash Hash) (Changes, error) { cResult.count = 0 // Call C function - ret := C.cf_tree_diff( + ret := C.cf_tree_diff_v2( (*C.git_repository)(repoPtr), pOldOid, pNewOid, + cPathspec, + C.size_t(len(pathspec)), &cResult, ) diff --git a/pkg/gitlib/clib/codefang_git.h b/pkg/gitlib/clib/codefang_git.h index c5e28b6..a68d2da 100644 --- a/pkg/gitlib/clib/codefang_git.h +++ b/pkg/gitlib/clib/codefang_git.h @@ -140,6 +140,21 @@ int cf_tree_diff( cf_tree_diff_result* result ); +/* + * Compute diff between two trees with an optional pathspec pre-filter. + * pathspec points to an array of pathspec_n C strings; when pathspec_n + * is 0 or pathspec is NULL the call is equivalent to cf_tree_diff. + * Each pathspec entry is an fnmatch-style glob (e.g. "*.go", "Dockerfile"). + */ +int cf_tree_diff_v2( + git_repository* repo, + git_oid* old_tree_oid, + git_oid* new_tree_oid, + const char** pathspec, + size_t pathspec_n, + cf_tree_diff_result* result +); + /* * Free tree diff result. */ diff --git a/pkg/gitlib/clib/diff_ops.c b/pkg/gitlib/clib/diff_ops.c index 70e2793..c1d8321 100644 --- a/pkg/gitlib/clib/diff_ops.c +++ b/pkg/gitlib/clib/diff_ops.c @@ -622,6 +622,21 @@ int cf_tree_diff( git_oid* old_tree_oid, git_oid* new_tree_oid, cf_tree_diff_result* result +) { + return cf_tree_diff_v2(repo, old_tree_oid, new_tree_oid, NULL, 0, result); +} + +/* + * Compute diff between two trees with an optional libgit2 pathspec + * pre-filter. See header for semantics. + */ +int cf_tree_diff_v2( + git_repository* repo, + git_oid* old_tree_oid, + git_oid* new_tree_oid, + const char** pathspec, + size_t pathspec_n, + cf_tree_diff_result* result ) { git_tree* old_tree = NULL; git_tree* new_tree = NULL; @@ -650,6 +665,13 @@ int cf_tree_diff( /* Compute diff */ git_diff_options opts = GIT_DIFF_OPTIONS_INIT; + if (pathspec != NULL && pathspec_n > 0) { + /* git_strarray.strings is declared `char**` but libgit2 only + * reads from it during the diff call, so the const-stripping + * cast is safe. The Go bridge owns all backing memory. */ + opts.pathspec.strings = (char**)pathspec; + opts.pathspec.count = pathspec_n; + } if (git_diff_tree_to_tree(&diff, repo, old_tree, new_tree, &opts) != 0) { ret = CF_ERR_DIFF; goto cleanup; diff --git a/pkg/gitlib/worker.go b/pkg/gitlib/worker.go index 224fc91..c23950b 100644 --- a/pkg/gitlib/worker.go +++ b/pkg/gitlib/worker.go @@ -33,7 +33,11 @@ type TreeDiffRequest struct { PreviousTree *Tree // Optimization: Use existing tree if on same worker/repo. PreviousCommitHash Hash // Fallback: Lookup previous tree by hash (safe for pool workers). CommitHash Hash // Hash of the commit to process. - Response chan<- TreeDiffResponse + // Pathspec restricts the diff to files matching any of the given + // fnmatch-style globs (e.g. []string{"*.go", "Dockerfile"}). An empty + // or nil slice disables path-based pre-filtering. + Pathspec []string + Response chan<- TreeDiffResponse } // TreeDiffResponse is the response for a TreeDiffRequest. @@ -185,7 +189,7 @@ func (w *Worker) handle(req WorkerRequest) { prevTreeHash := prevCommit.TreeHash() prevCommit.Free() - changes, err = w.bridge.TreeDiff(prevTreeHash, currTreeHash) + changes, err = w.bridge.TreeDiffWithPathspec(prevTreeHash, currTreeHash, typedReq.Pathspec) default: changes, err = InitialTreeChanges(ctx, w.repo, commitTree) } diff --git a/pkg/gitlib/worker_test.go b/pkg/gitlib/worker_test.go index 8014960..99e5dd7 100644 --- a/pkg/gitlib/worker_test.go +++ b/pkg/gitlib/worker_test.go @@ -4,6 +4,7 @@ import ( "context" "testing" + "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" "github.com/Sumatoshi-tech/codefang/pkg/gitlib" @@ -299,6 +300,54 @@ func TestCGOBridge_BatchDiffBlobsInvalidHash(t *testing.T) { require.Equal(t, gitlib.ErrDiffLookup, results[0].Error) } +// TestCGOBridge_TreeDiffWithPathspec_FiltersByGlob verifies that passing +// a pathspec to the cgo bridge drops non-matching files at the libgit2 +// level — before they cross the cgo boundary. +func TestCGOBridge_TreeDiffWithPathspec_FiltersByGlob(t *testing.T) { + t.Parallel() + + tr := newTestRepo(t) + defer tr.cleanup() + + tr.createFile("a.go", "package a") + tr.createFile("b.py", "x = 1") + tr.createFile("c.js", "var y = 2;") + firstHash := tr.commit("first") + + tr.createFile("a.go", "package a\n// edit") + tr.createFile("b.py", "x = 2") + tr.createFile("c.js", "var y = 3;") + secondHash := tr.commit("second") + + repo, err := gitlib.OpenRepository(tr.path) + require.NoError(t, err) + + defer repo.Free() + + firstCommit, err := repo.LookupCommit(context.Background(), firstHash) + require.NoError(t, err) + + defer firstCommit.Free() + + secondCommit, err := repo.LookupCommit(context.Background(), secondHash) + require.NoError(t, err) + + defer secondCommit.Free() + + bridge := gitlib.NewCGOBridge(repo) + + baseline, err := bridge.TreeDiffWithPathspec(firstCommit.TreeHash(), secondCommit.TreeHash(), nil) + require.NoError(t, err) + require.Len(t, baseline, 3, "baseline must see all 3 modified files") + + filtered, err := bridge.TreeDiffWithPathspec( + firstCommit.TreeHash(), secondCommit.TreeHash(), []string{"*.go"}, + ) + require.NoError(t, err) + require.Len(t, filtered, 1, "pathspec '*.go' must restrict to Go files") + assert.Equal(t, "a.go", filtered[0].To.Name) +} + // TestCGOBridge_TreeDiffSameHash verifies TreeDiff returns empty when both tree hashes are equal (skip path). func TestCGOBridge_TreeDiffSameHash(t *testing.T) { t.Parallel() @@ -323,7 +372,7 @@ func TestCGOBridge_TreeDiffSameHash(t *testing.T) { require.False(t, treeHash.IsZero()) bridge := gitlib.NewCGOBridge(repo) - changes, err := bridge.TreeDiff(treeHash, treeHash) + changes, err := bridge.TreeDiffWithPathspec(treeHash, treeHash, nil) require.NoError(t, err) require.Empty(t, changes) } diff --git a/pkg/iosafety/iosafety_test.go b/pkg/iosafety/iosafety_test.go index 3cc6a7d..d584974 100644 --- a/pkg/iosafety/iosafety_test.go +++ b/pkg/iosafety/iosafety_test.go @@ -9,8 +9,6 @@ import ( "github.com/stretchr/testify/require" ) -// FRD: specs/frds/FRD-20260310-iosafety-promote.md. - func TestResolvePath_EmptyPath(t *testing.T) { t.Parallel() diff --git a/pkg/meminfo/rss_test.go b/pkg/meminfo/rss_test.go index c290e0f..473f0c5 100644 --- a/pkg/meminfo/rss_test.go +++ b/pkg/meminfo/rss_test.go @@ -1,7 +1,5 @@ package meminfo -// FRD: specs/frds/FRD-20260312-static-rss-logging.md. - import ( "runtime" "testing" diff --git a/pkg/metrics/metrics_test.go b/pkg/metrics/metrics_test.go index 087c4df..59a98a5 100644 --- a/pkg/metrics/metrics_test.go +++ b/pkg/metrics/metrics_test.go @@ -186,8 +186,6 @@ func TestTimeSeriesPoint_Fields(t *testing.T) { assert.InDelta(t, float64(testInputValue), point.Value, 0.001) } -// FRD: specs/frds/FRD-20260303-risk-priority.md. - func TestRiskPriority_AllLevels(t *testing.T) { t.Parallel() diff --git a/pkg/pipeline/batcher_test.go b/pkg/pipeline/batcher_test.go index 027c8fc..059268c 100644 --- a/pkg/pipeline/batcher_test.go +++ b/pkg/pipeline/batcher_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260302-composable-pipeline-patterns.md. - import ( "testing" diff --git a/pkg/pipeline/dispatch_test.go b/pkg/pipeline/dispatch_test.go index 258b569..2428229 100644 --- a/pkg/pipeline/dispatch_test.go +++ b/pkg/pipeline/dispatch_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260302-composable-pipeline-patterns.md. - import ( "context" "errors" diff --git a/pkg/pipeline/drain_test.go b/pkg/pipeline/drain_test.go index 1c3a55a..1ebe05a 100644 --- a/pkg/pipeline/drain_test.go +++ b/pkg/pipeline/drain_test.go @@ -7,8 +7,6 @@ import ( "github.com/stretchr/testify/require" ) -// FRD: specs/frds/FRD-20260310-signal-on-drain.md. - const forwardTestItems = 3 func TestSignalOnDrain_ForwardsAllItems(t *testing.T) { diff --git a/pkg/pipeline/fetcher_test.go b/pkg/pipeline/fetcher_test.go index aa0ec6f..574ff43 100644 --- a/pkg/pipeline/fetcher_test.go +++ b/pkg/pipeline/fetcher_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260302-composable-pipeline-patterns.md. - import ( "context" "errors" diff --git a/pkg/pipeline/phase_test.go b/pkg/pipeline/phase_test.go index e43b63f..1eb6baa 100644 --- a/pkg/pipeline/phase_test.go +++ b/pkg/pipeline/phase_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260302-composable-pipeline-patterns.md. - import ( "context" "errors" diff --git a/pkg/pipeline/runpc_test.go b/pkg/pipeline/runpc_test.go index 2f95c57..0fedbc9 100644 --- a/pkg/pipeline/runpc_test.go +++ b/pkg/pipeline/runpc_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260302-composable-pipeline-patterns.md. - import ( "context" "testing" diff --git a/pkg/pipeline/shared_response_test.go b/pkg/pipeline/shared_response_test.go index bc94cf9..e1ec5ab 100644 --- a/pkg/pipeline/shared_response_test.go +++ b/pkg/pipeline/shared_response_test.go @@ -1,7 +1,5 @@ package pipeline_test -// FRD: specs/frds/FRD-20260303-shared-response-move.md. - import ( "context" "errors" diff --git a/pkg/pipeline/workerpool_test.go b/pkg/pipeline/workerpool_test.go index 6a14cae..8eb1f9b 100644 --- a/pkg/pipeline/workerpool_test.go +++ b/pkg/pipeline/workerpool_test.go @@ -11,8 +11,6 @@ import ( "github.com/stretchr/testify/require" ) -// FRD: specs/frds/FRD-20260310-worker-pool.md. - var errWorker = errors.New("worker failed") func TestWorkerPool_EmptyItems(t *testing.T) { @@ -235,8 +233,6 @@ func TestWorkerPool_ErrorCancelsContext(t *testing.T) { assert.ErrorIs(t, err, errWorker) } -// FRD: specs/frds/FRD-20260311-streaming-file-discovery.md. - func TestWorkerPool_RunChan_EmptyChannel(t *testing.T) { t.Parallel() diff --git a/pkg/safeconv/generic_test.go b/pkg/safeconv/generic_test.go index 1d28a38..f4406c2 100644 --- a/pkg/safeconv/generic_test.go +++ b/pkg/safeconv/generic_test.go @@ -1,5 +1,3 @@ -// FRD: specs/frds/FRD-20260310-generic-safeconv.md. - package safeconv import ( diff --git a/pkg/safeconv/safeconv_test.go b/pkg/safeconv/safeconv_test.go index d52a7c9..2af1a0b 100644 --- a/pkg/safeconv/safeconv_test.go +++ b/pkg/safeconv/safeconv_test.go @@ -1,7 +1,5 @@ package safeconv -// FRD: specs/frds/FRD-20260302-safeconv-expansion.md. - import ( "math" "testing" diff --git a/pkg/sigutil/guard_test.go b/pkg/sigutil/guard_test.go index a216942..3ae057a 100644 --- a/pkg/sigutil/guard_test.go +++ b/pkg/sigutil/guard_test.go @@ -1,7 +1,5 @@ package sigutil_test -// FRD: specs/frds/FRD-20260302-signal-cleanup-guard.md. - import ( "io" "log/slog" diff --git a/pkg/textutil/textutil_test.go b/pkg/textutil/textutil_test.go index 01ce985..bd7b474 100644 --- a/pkg/textutil/textutil_test.go +++ b/pkg/textutil/textutil_test.go @@ -9,8 +9,6 @@ import ( "github.com/stretchr/testify/require" ) -// FRD: specs/frds/FRD-20260310-writejson-helper.md. - func TestWriteJSON_PrettyOutput(t *testing.T) { t.Parallel() diff --git a/pkg/uast/parsefile_test.go b/pkg/uast/parsefile_test.go index 40f47d6..2cdf189 100644 --- a/pkg/uast/parsefile_test.go +++ b/pkg/uast/parsefile_test.go @@ -1,5 +1,3 @@ -// FRD: specs/frds/FRD-20260310-parse-source-file.md. - package uast import ( diff --git a/pkg/uast/parser_bench_test.go b/pkg/uast/parser_bench_test.go index 7f919f7..f7a4e87 100644 --- a/pkg/uast/parser_bench_test.go +++ b/pkg/uast/parser_bench_test.go @@ -1,7 +1,5 @@ package uast_test -// FRD: specs/frds/FRD-20260311-eager-tree-release.md. - import ( "context" "fmt" diff --git a/pkg/uast/parser_determinism_test.go b/pkg/uast/parser_determinism_test.go new file mode 100644 index 0000000..546a303 --- /dev/null +++ b/pkg/uast/parser_determinism_test.go @@ -0,0 +1,142 @@ +package uast + +import ( + "context" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/pkg/uast/pkg/node" +) + +// TestParser_DeterministicAcrossParses guards against state leaking through +// the parseContext [sync.Pool] — most notably the shared ctx.batchChildren +// backing array, which previously caused recursive processChildrenBatch +// calls to overwrite outer-loop entries the parent had not yet read. +// +// The input is a Go file whose root and inner blocks each have well over +// the cursorThreshold of named children, exercising both the batch and +// recursive paths. Parsing the same content with the same *Parser must +// produce structurally identical trees on every call. +func TestParser_DeterministicAcrossParses(t *testing.T) { + t.Parallel() + + src := []byte(`package main + +import ( + "context" + "errors" + "fmt" + "io" + "os" + "strings" + "sync" + "time" +) + +func a() {} +func b() {} +func c() {} +func d() {} +func e() {} +func f() {} +func g() {} +func h() {} +func i() {} +func j() {} + +func work(ctx context.Context, w io.Writer) error { + if ctx == nil { + return errors.New("nil ctx") + } + + var mu sync.Mutex + mu.Lock() + defer mu.Unlock() + + parts := []string{"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"} + out := strings.Join(parts, ",") + + for idx, p := range parts { + if p == "" { + continue + } + fmt.Fprintf(w, "%d:%s\n", idx, p) + } + + now := time.Now() + if _, err := fmt.Fprintln(w, out, now); err != nil { + return err + } + if _, err := fmt.Fprintln(os.Stderr, "done"); err != nil { + return err + } + return nil +} +`) + + parser, err := NewParser() + require.NoError(t, err) + + const runs = 8 + + first, err := parser.Parse(context.Background(), "main.go", src) + require.NoError(t, err) + require.NotNil(t, first) + + wantNodes := countAllNodes(first) + wantFuncs := countFunctionNodes(first) + node.ReleaseTree(first) + + require.Positive(t, wantNodes, "baseline tree must be non-empty") + require.GreaterOrEqual(t, wantFuncs, 11, "expected at least 11 functions in the fixture") + + for run := 2; run <= runs; run++ { + tree, parseErr := parser.Parse(context.Background(), "main.go", src) + require.NoErrorf(t, parseErr, "parse run %d failed", run) + require.NotNil(t, tree) + + gotNodes := countAllNodes(tree) + gotFuncs := countFunctionNodes(tree) + node.ReleaseTree(tree) + + assert.Equalf(t, wantNodes, gotNodes, + "node count drift on run %d: want %d, got %d (parseContext buffer corruption?)", + run, wantNodes, gotNodes) + assert.Equalf(t, wantFuncs, gotFuncs, + "function count drift on run %d: want %d, got %d (parseContext buffer corruption?)", + run, wantFuncs, gotFuncs) + } +} + +func countAllNodes(n *node.Node) int { + if n == nil { + return 0 + } + + total := 1 + for _, child := range n.Children { + total += countAllNodes(child) + } + + return total +} + +func countFunctionNodes(n *node.Node) int { + if n == nil { + return 0 + } + + count := 0 + if n.HasAnyType(node.UASTFunction, node.UASTMethod) || + n.HasAllRoles(node.RoleFunction, node.RoleDeclaration) { + count = 1 + } + + for _, child := range n.Children { + count += countFunctionNodes(child) + } + + return count +} diff --git a/pkg/uast/parser_dsl.go b/pkg/uast/parser_dsl.go index c127344..f63dcf8 100644 --- a/pkg/uast/parser_dsl.go +++ b/pkg/uast/parser_dsl.go @@ -506,8 +506,16 @@ func (ctx *parseContext) processChildrenBatch( return ctx.processChildrenCursor(root, mappingRule, children) } + // Snapshot child nodes before recursing. toCanonicalNode may indirectly + // re-enter processChildrenBatch, which calls ensureBatchChildren and + // reslices ctx.batchChildren over the same backing array — overwriting + // the entries we have not yet read. + siblings := make([]sitter.Node, written) for idx := range written { - child := batchChildToNode(batchChildren[idx]) + siblings[idx] = batchChildToNode(batchChildren[idx]) + } + + for _, child := range siblings { if child.IsNull() || !child.IsNamed() { return ctx.processChildrenCursor(root, mappingRule, children) } diff --git a/pkg/units/units_test.go b/pkg/units/units_test.go index 05e1618..368d805 100644 --- a/pkg/units/units_test.go +++ b/pkg/units/units_test.go @@ -2,8 +2,6 @@ package units import "testing" -// FRD: specs/frds/FRD-20260302-size-unit-constants.md. - // Expected binary size multiplier values. const ( expectedKiB = 1024 diff --git a/scripts/bench-hibernation/main.go b/scripts/bench-hibernation/main.go index 071ae5c..b442925 100644 --- a/scripts/bench-hibernation/main.go +++ b/scripts/bench-hibernation/main.go @@ -22,8 +22,8 @@ import ( filehistory "github.com/Sumatoshi-tech/codefang/internal/analyzers/file_history" "github.com/Sumatoshi-tech/codefang/internal/analyzers/plumbing" "github.com/Sumatoshi-tech/codefang/internal/framework" - "github.com/Sumatoshi-tech/codefang/pkg/gitlib" "github.com/Sumatoshi-tech/codefang/internal/streaming" + "github.com/Sumatoshi-tech/codefang/pkg/gitlib" ) func main() { diff --git a/scripts/orphan-packages.sh b/scripts/orphan-packages.sh new file mode 100755 index 0000000..e2977c2 --- /dev/null +++ b/scripts/orphan-packages.sh @@ -0,0 +1,103 @@ +#!/bin/bash +# orphan-packages.sh - Detect Go packages that exist on disk but are not +# imported by any other package. +# +# A package is orphan when no other package in the module imports it — +# even if it has its own tests. Self-contained test-only packages that +# nothing depends on are still dead weight in the repo. +# +# Entry points (main packages) are excluded from orphan detection since +# they are invoked by the Go toolchain directly. +# +# These packages are invisible to deadcode analysis. +# Common cause: speculatively written code with no importer yet. +# +# Whitelist: .orphan-packages-whitelist (one package path per line, # comments ok) +# +# Usage: ./scripts/orphan-packages.sh [package-patterns...] +# Default patterns: ./cmd/... ./pkg/... ./internal/... + +set -e + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +ROOT_DIR="$(dirname "$SCRIPT_DIR")" +WHITELIST_FILE="$ROOT_DIR/.orphan-packages-whitelist" + +cd "$ROOT_DIR" + +PATTERNS=("$@") +if [ ${#PATTERNS[@]} -eq 0 ]; then + PATTERNS=("./cmd/..." "./pkg/..." "./internal/...") +fi + +MOD=$(go list -m) + +# All packages on disk within the given patterns. +ALL_PKGS=$(go list "${PATTERNS[@]}" 2>/dev/null | sort -u) + +# Collect every intra-module import from OTHER packages. +# Production imports, test imports, and external test imports all count +# as evidence that the target package is needed. +IMPORTED_BY_OTHERS=$( + go list -json "${PATTERNS[@]}" 2>/dev/null \ + | jq -r --arg mod "$MOD" ' + .ImportPath as $self | + ((.Imports // []) + (.TestImports // []) + (.XTestImports // []))[] | + select(startswith($mod + "/")) | + select(. != $self) + ' \ + | sort -u +) + +# Main packages are entry points — they can't be orphans. +MAIN_PKGS=$( + go list -json "${PATTERNS[@]}" 2>/dev/null \ + | jq -r 'select(.Name == "main") | .ImportPath' \ + | sort -u +) + +# A package is NOT orphan if: it is imported by another package, OR it is a main package. +NOT_ORPHAN=$(printf '%s\n%s\n' "$IMPORTED_BY_OTHERS" "$MAIN_PKGS" | sort -u) + +ORPHANS=$(comm -23 <(echo "$ALL_PKGS") <(echo "$NOT_ORPHAN")) + +# Apply whitelist if present. +if [ -f "$WHITELIST_FILE" ]; then + WHITELIST=$(grep -v '^\s*#' "$WHITELIST_FILE" | grep -v '^\s*$' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//' | sed "s|^|${MOD}/|" | sort -u) + WHITELIST_COUNT=$(echo "$WHITELIST" | grep -c . || true) + ORPHANS_FILTERED=$(comm -23 <(echo "$ORPHANS") <(echo "$WHITELIST")) + FILTERED_COUNT=$(( $(echo "$ORPHANS" | grep -c . || true) - $(echo "$ORPHANS_FILTERED" | grep -c . || true) )) + ORPHANS="$ORPHANS_FILTERED" +else + WHITELIST_COUNT=0 + FILTERED_COUNT=0 +fi + +if [ -z "$ORPHANS" ]; then + if [ "$FILTERED_COUNT" -gt 0 ]; then + echo "✓ No orphan packages found (excluding $FILTERED_COUNT/$WHITELIST_COUNT whitelisted)" + else + echo "✓ No orphan packages found" + fi + exit 0 +fi + +echo "Orphan packages (not imported by any other package):" +echo "" + +COUNT=0 +while IFS= read -r pkg; do + [ -z "$pkg" ] && continue + rel="${pkg#${MOD}/}" + echo " $rel" + COUNT=$((COUNT + 1)) +done <<< "$ORPHANS" + +echo "" +if [ "$FILTERED_COUNT" -gt 0 ]; then + echo "$COUNT orphan package(s) found ($FILTERED_COUNT/$WHITELIST_COUNT whitelisted excluded)." +else + echo "$COUNT orphan package(s) found." +fi +echo "Either import them or delete them." +exit 1 diff --git a/site/analyzers/complexity.md b/site/analyzers/complexity.md index dde3737..c6e145b 100644 --- a/site/analyzers/complexity.md +++ b/site/analyzers/complexity.md @@ -71,36 +71,62 @@ The complexity analyzer uses the UAST directly and has no analyzer-specific conf ```json { - "complexity": { - "functions": [ - { - "name": "processFile", - "file": "main.go", - "line": 42, - "cyclomatic": 8, - "cognitive": 12, - "nesting_depth": 3 - }, - { - "name": "validate", - "file": "main.go", - "line": 105, - "cyclomatic": 15, - "cognitive": 22, - "nesting_depth": 5 - } - ], - "summary": { - "total_functions": 2, - "avg_cyclomatic": 11.5, - "avg_cognitive": 17.0, - "max_cyclomatic": 15, - "max_nesting_depth": 5 + "function_complexity": [ + { + "name": "processFile", + "source_file": "cmd/server/main.go", + "language": "go", + "directory": "cmd/server", + "cyclomatic_complexity": 8, + "cognitive_complexity": 12, + "nesting_depth": 3, + "lines_of_code": 45, + "complexity_density": 0.178, + "risk_level": "LOW" + }, + { + "name": "validate", + "source_file": "cmd/server/main.go", + "language": "go", + "directory": "cmd/server", + "cyclomatic_complexity": 15, + "cognitive_complexity": 22, + "nesting_depth": 5, + "lines_of_code": 80, + "complexity_density": 0.188, + "risk_level": "MEDIUM" } + ], + "high_risk_functions": [ + { + "name": "validate", + "source_file": "cmd/server/main.go", + "language": "go", + "directory": "cmd/server", + "cyclomatic_complexity": 15, + "cognitive_complexity": 22, + "risk_level": "MEDIUM", + "issues": ["High cyclomatic complexity", "Deep nesting"] + } + ], + "distribution": { + "simple": 180, + "moderate": 25, + "complex": 7 + }, + "aggregate": { + "total_functions": 212, + "average_complexity": 3.2, + "max_complexity": 15, + "health_score": 78.5, + "message": "Fair complexity - some functions could be simplified" } } ``` + Each function record includes `source_file`, `language`, and `directory` + for file-level joins and DWH aggregation. + === "Text" ``` diff --git a/site/analyzers/couples.md b/site/analyzers/couples.md index 4ae7804..dfb5234 100644 --- a/site/analyzers/couples.md +++ b/site/analyzers/couples.md @@ -110,7 +110,9 @@ The couples analyzer provides a `ReportSection` for use in combined reports: "developer_coupling": [ { "developer1": "alice", + "developer1_email": "alice@example.com", "developer2": "bob", + "developer2_email": "bob@example.com", "shared_file_changes": 234, "coupling_strength": 0.65 } diff --git a/site/analyzers/developers.md b/site/analyzers/developers.md index ada1262..c527510 100644 --- a/site/analyzers/developers.md +++ b/site/analyzers/developers.md @@ -115,15 +115,16 @@ history: { "id": 0, "name": "alice", + "email": "alice@example.com", "commits": 342, "lines_added": 28500, "lines_removed": 12300, "lines_changed": 8400, "net_lines": 16200, - "languages": { - "Go": {"added": 22000, "removed": 9800, "changed": 6200}, - "Python": {"added": 6500, "removed": 2500, "changed": 2200} - }, + "languages": [ + {"language": "Go", "added": 22000, "removed": 9800, "changed": 6200}, + {"language": "Python", "added": 6500, "removed": 2500, "changed": 2200} + ], "first_tick": 0, "last_tick": 120, "active_ticks": 85 @@ -135,12 +136,6 @@ history: "total_lines": 45000, "total_contribution": 67800, "contributors": {"0": 54600, "1": 13200} - }, - { - "name": "Python", - "total_lines": 12000, - "total_contribution": 16700, - "contributors": {"0": 11200, "1": 5500} } ], "busfactor": [ @@ -148,24 +143,41 @@ history: "language": "Python", "bus_factor": 1, "total_contributors": 2, + "primary_dev_id": 0, "primary_dev_name": "alice", + "primary_dev_email": "alice@example.com", "primary_percentage": 67.1, + "secondary_dev_id": 1, "secondary_dev_name": "bob", + "secondary_dev_email": "bob@example.com", "secondary_percentage": 32.9, "risk_level": "MEDIUM" } ], "activity": [ - {"tick": 0, "total_commits": 5, "by_developer": {"0": 3, "1": 2}}, - {"tick": 1, "total_commits": 8, "by_developer": {"0": 5, "1": 3}} + { + "tick": 0, + "start_time": "2024-01-15T10:30:00Z", + "end_time": "2024-01-16T08:45:00Z", + "total_commits": 5, + "by_developer": [ + {"dev_id": 0, "commits": 3}, + {"dev_id": 1, "commits": 2} + ] + } ], "churn": [ - {"tick": 0, "lines_added": 450, "lines_removed": 120, "net_change": 330} + { + "tick": 0, + "start_time": "2024-01-15T10:30:00Z", + "end_time": "2024-01-16T08:45:00Z", + "lines_added": 450, + "lines_removed": 120, + "net_change": 330 + } ], "aggregate": { "total_commits": 850, - "total_lines_added": 95000, - "total_lines_removed": 42000, "total_developers": 5, "active_developers": 3, "analysis_period_ticks": 120, @@ -175,6 +187,15 @@ history: } ``` + **Key fields for analytics:** + + - `developers[].email` — split from previously pipe-delimited name + - `developers[].languages` — flattened from map to sorted array + - `activity[].by_developer` — flattened from `map[int]int` to `[{dev_id, commits}]` array + - `activity[].start_time` / `end_time` — RFC 3339 tick boundaries + - `churn[].start_time` / `end_time` — RFC 3339 tick boundaries + - `busfactor[].primary_dev_email` / `secondary_dev_email` — split identity fields + === "YAML" ```yaml diff --git a/site/analyzers/file-history.md b/site/analyzers/file-history.md index b2872f1..d5c3a1b 100644 --- a/site/analyzers/file-history.md +++ b/site/analyzers/file-history.md @@ -89,12 +89,12 @@ The file history analyzer has no additional configuration options. "file_contributors": [ { "path": "pkg/core/engine.go", - "contributors": { - "0": {"added": 2200, "removed": 900, "changed": 600}, - "1": {"added": 800, "removed": 700, "changed": 250}, - "2": {"added": 150, "removed": 150, "changed": 80}, - "3": {"added": 50, "removed": 50, "changed": 20} - }, + "contributors": [ + {"dev_id": 0, "added": 2200, "removed": 900, "changed": 600}, + {"dev_id": 1, "added": 800, "removed": 700, "changed": 250}, + {"dev_id": 2, "added": 150, "removed": 150, "changed": 80}, + {"dev_id": 3, "added": 50, "removed": 50, "changed": 20} + ], "top_contributor_id": 0, "top_contributor_lines": 2800 } diff --git a/site/analyzers/sentiment.md b/site/analyzers/sentiment.md index f723315..d5c3f47 100644 --- a/site/analyzers/sentiment.md +++ b/site/analyzers/sentiment.md @@ -142,6 +142,8 @@ history: "time_series": [ { "tick": 0, + "start_time": "2024-01-15T10:30:00Z", + "end_time": "2024-01-16T08:45:00Z", "sentiment": 0.72, "comment_count": 12, "commit_count": 5, @@ -149,6 +151,8 @@ history: }, { "tick": 1, + "start_time": "2024-01-16T09:00:00Z", + "end_time": "2024-01-17T18:30:00Z", "sentiment": 0.35, "comment_count": 8, "commit_count": 3, diff --git a/site/guide/cli-reference.md b/site/guide/cli-reference.md index ec85ea6..3fe4f81 100644 --- a/site/guide/cli-reference.md +++ b/site/guide/cli-reference.md @@ -128,6 +128,103 @@ codefang run -a history/couples --limit 500 . The burndown analyzer automatically enables `--first-parent` when selected. This is required for correct line-tracking across merge commits. +#### Language Filtering + +| Flag | Type | Default | Description | +|------|------|---------|-------------| +| `--languages` | `[]string` | `[all]` | Restrict analysis to the given Linguist languages; comma-separated. `all` (default) disables the filter. Applies to **both** history and static phases. | + +**History phase** — the filter is pushed down into libgit2's `pathspec` +at the tree-diff stage, so non-matching files are skipped before the +diff crosses the cgo boundary. On a polyglot repo a narrow filter can +reduce wall time by 30–40 %. The Go-side language check still runs as +the authoritative pass for content-disambiguated extensions (`.h`, +`.pl`, `.m`, `.r`). + +**Static phase** — the filter is applied at the directory walker +(`matchesLanguageGlobs`) before the UAST parser or raw-file analyzers +see the file. It's path-based only: the parser's own language router +remains the final authority for how a matched file is parsed (e.g. a +`.h` under `--languages c++` is still parsed as C). Both phases read +from the same `langpath.Globs` helper, so the flag value has one +meaning across `-a 'static/*'`, `-a 'history/*'`, and `-a '*'` runs. + +Language names are [Linguist keys](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) +and common aliases resolve automatically: + +```bash +# Canonical names (any case, whitespace is trimmed) +codefang run -a 'history/devs' --languages go,python,typescript . + +# Aliases resolve via enry +codefang run -a 'history/devs' --languages golang,js,ts . + +# Unknown language fails fast at configure time instead of silently +# returning an empty report: +codefang run -a 'history/devs' --languages notalang . +# → Error: failed to configure TreeDiff: tree-diff pathspec: unknown language: "notalang" +``` + +Filename-only languages (e.g. `Dockerfile`, `Makefile`) are also supported: + +```bash +codefang run -a 'history/devs' --languages dockerfile . +``` + +See `specs/optimize-lang/PROPOSAL.md` for the architecture and acceptance-gate +numbers. + +#### Vendor & Generated Exclusion + +By default, Codefang excludes **vendored dependencies** and **auto-generated +files** from analysis. This matches the convention of every major +single-language analyser (`go vet`, `eslint`, `ruff`, `rubocop`, `scalafix`, +`phpcs`, …) — vendor/generated code is noise for a code-quality report. + +| Flag | Type | Default | Description | +|------|------|---------|-------------| +| `--include-vendored` | `bool` | `false` | Re-include vendored dependencies (detected by enry / Linguist) in analysis. Cross-language: covers `vendor/`, `node_modules/`, `third_party/`, `testdata/`, minified bundles, and more. | +| `--include-generated` | `bool` | `false` | Re-include auto-generated files in analysis. Covers `*.pb.go`, `zz_generated_*.go`, `*_pb2.py`, `*.min.js`, and any file whose first 512 bytes contain a generated-file marker (`DO NOT EDIT`, `Code generated`, etc.). | +| `--extra-excluded-prefixes` | `[]string` | `[]` | Additional UNIX path prefixes to exclude on top of enry heuristics (e.g. `".venv/,target/,build/"`). | + +All three flags apply **identically** to both static and history phases. + +```bash +# default: your own code only +codefang run -a '*' . + +# include vendored deps (node_modules/, vendor/, …) +codefang run -a '*' --include-vendored . + +# restore pre-codefang-2026-04 behaviour (include everything) +codefang run -a '*' --include-vendored --include-generated . + +# skip extras that enry doesn't know about (e.g. Python venv, Rust target/) +codefang run -a '*' --extra-excluded-prefixes '.venv/,target/' . +``` + +!!! warning "Breaking change in 2026-04" + + Earlier versions of Codefang analysed vendored and generated files by + default (they needed the confusingly-named `--skip-blacklist=true` to be + excluded). Starting from 2026-04, defaults flip: vendor / generated + are **excluded by default**. To restore the old behaviour: + + ```bash + codefang run ... --include-vendored --include-generated + ``` + + The deprecated `--skip-blacklist` and `--blacklisted-prefixes` flags + still work with a cobra deprecation warning and will be removed in + the next minor release. Map them to: + + - `--skip-blacklist` → no-op (the new default already excludes) + - `--blacklisted-prefixes X,Y` → `--extra-excluded-prefixes X,Y` + +See `specs/exclude-vendored/PROPOSAL.md` for the full cross-phase design +and `specs/frds/FRD-20260419-exclude-vendored.md` for implementation +details. + #### Pipeline Tuning Flags | Flag | Type | Default | Description | diff --git a/site/guide/data-analytics.md b/site/guide/data-analytics.md new file mode 100644 index 0000000..4e60676 --- /dev/null +++ b/site/guide/data-analytics.md @@ -0,0 +1,683 @@ +# Data Analytics & DWH Integration + +Codefang produces richly structured JSON output designed for loading into +columnar data warehouses (ClickHouse, Greenplum, BigQuery, Snowflake) and +building BI dashboards. This guide covers the optimal pipeline from +repository analysis to production dashboards. + +--- + +## Quick Start + +Analyze a repository and produce DWH-ready output: + +```bash +# JSON for small-to-medium repos (< 5K files, < 10K commits) +codefang run --format json --per-file --memory-budget 4GB /path/to/repo > report.json + +# NDJSON for large repos (streaming, one line per analyzer) +codefang run --format ndjson --per-file --memory-budget 8GB /path/to/repo > report.ndjson + +# Limit history depth for faster iteration +codefang run --format json --per-file --limit 5000 /path/to/repo > report.json +``` + +--- + +## Output Format Selection + +| Repo Size | Recommended Format | Reason | +|-----------|-------------------|--------| +| < 1K files | `json` | Small file, easy to inspect | +| 1K-10K files | `json` | Manageable (< 500MB typically) | +| 10K-50K files | `ndjson` | JSON gets multi-GB; NDJSON streams | +| 50K+ files | `ndjson` + `--limit` | Bound history for practical runtimes | + +### JSON Format + +```bash +codefang run --format json --per-file /repo > report.json +``` + +Produces a single JSON object with versioned envelope: + +```json +{ + "version": "codefang.run.v1", + "metadata": { + "repo_name": "myproject", + "analyzed_at": "2026-04-08T10:00:00Z", + "codefang_version": "0.1.0" + }, + "analyzers": [ + { + "id": "static/complexity", + "mode": "static", + "schema": { ... }, + "report": { ... } + } + ] +} +``` + +### NDJSON Format + +```bash +codefang run --format ndjson --per-file /repo > report.ndjson +``` + +One JSON line per analyzer. First line is metadata: + +``` +{"version":"codefang.run.v1","metadata":{"repo_name":"myproject",...}} +{"id":"static/complexity","mode":"static","report":{...}} +{"id":"history/sentiment","mode":"history","report":{...}} +``` + +Process with standard tools: + +```bash +# Extract one analyzer +grep '"static/complexity"' report.ndjson | jq .report.aggregate + +# Count analyzers +wc -l report.ndjson + +# Stream into ClickHouse +cat report.ndjson | clickhouse-client --query "INSERT INTO codefang_raw FORMAT JSONEachRow" +``` + +--- + +## Memory Budget + +**Always set `--memory-budget`** for repos with history analysis. Without it, +the streaming pipeline uses a conservative 2GB default that may OOM on large +repos. + +| Machine RAM | Recommended Budget | Handles | +|-------------|-------------------|---------| +| 8 GB | `--memory-budget 2GB` | Repos up to ~10K commits | +| 16 GB | `--memory-budget 4GB` | Repos up to ~30K commits | +| 32 GB | `--memory-budget 8GB` | Repos up to ~60K commits | +| 64 GB | `--memory-budget 16GB` | Repos up to ~150K commits | + +The budget controls the streaming chunk planner — larger budgets mean fewer, +bigger chunks and faster processing. The actual RSS will be ~2x the budget +due to Go runtime overhead and native memory. + +```bash +# 64GB machine, kubernetes-sized repo (~56K commits) +codefang run --format ndjson --per-file --memory-budget 8GB ~/sources/kubernetes +``` + +!!! warning "Without `--memory-budget`" + The default 2GB budget may cause the process to be killed by the OS OOM + killer on large repos. Always set this flag explicitly. + +--- + +## Commit Limiting + +Use `--limit N` to analyze only the most recent N commits. This is useful for: + +- **Fast iteration**: Test your ETL pipeline on a subset before running full history +- **Incremental analysis**: Analyze only recent changes for daily dashboards +- **Memory control**: Fewer commits = less memory, faster processing + +```bash +# Last 1000 commits (fast, ~2 min) +codefang run --format json --per-file --limit 1000 /repo > recent.json + +# Last 10000 commits (moderate, ~15 min) +codefang run --format json --per-file --limit 10000 --memory-budget 4GB /repo > report.json + +# Full history (slow, may take hours for large repos) +codefang run --format json --per-file --memory-budget 8GB /repo > full.json +``` + +--- + +## Key Fields for Analytics + +Every function-level record includes fields designed for DWH joins and +aggregation: + +| Field | Present On | Type | Example | Use Case | +|-------|-----------|------|---------|----------| +| `source_file` | All function records | string | `"pkg/api/server.go"` | Join to file-level data | +| `language` | All function records | string | `"go"` | Group by language | +| `directory` | All function records | string | `"pkg/api"` | Group by package/module | +| `start_time` | All time-series ticks | RFC 3339 | `"2024-01-15T10:30:00Z"` | Time-axis labels | +| `end_time` | All time-series ticks | RFC 3339 | `"2024-01-16T08:45:00Z"` | Tick duration | +| `name` | Developer records | string | `"alice"` | Developer dimension | +| `email` | Developer records | string | `"alice@example.com"` | Developer identity | +| `dev_id` | Activity, contributors | int | `42` | Foreign key to developers | + +--- + +## Schema Manifest + +Every analyzer section includes a `schema` field describing its output: + +```json +{ + "schema": { + "function_complexity": { + "type": "list", + "grain": "function", + "description": "Per-function cyclomatic and cognitive complexity" + }, + "aggregate": { + "type": "aggregate", + "description": "Summary statistics" + } + } +} +``` + +**Field types**: `list`, `aggregate`, `time_series`, `risk`, `scalar` + +**Grain values**: `function`, `file`, `tick`, `pair`, `developer`, `node`, `comment`, `import` + +Use the schema to auto-generate ETL mappings: + +```python +# Python: extract schema for table generation +import json +with open('report.json') as f: + data = json.load(f) +for analyzer in data['analyzers']: + schema = analyzer.get('schema', {}) + for field, meta in schema.items(): + if meta['type'] == 'list': + print(f"CREATE TABLE {analyzer['id'].replace('/', '_')}_{field} ...") +``` + +--- + +## Star Schema Design + +### Dimensions + +```sql +-- dim_repository +CREATE TABLE dim_repository ( + repo_id UInt64, + repo_name String, + repo_path String, + analyzed_at DateTime, + version String +) ENGINE = MergeTree() ORDER BY repo_id; + +-- dim_file (extract from source_file + directory + language) +CREATE TABLE dim_file ( + file_id UInt64, + repo_id UInt64, + source_file String, + directory String, + language LowCardinality(String) +) ENGINE = MergeTree() ORDER BY (repo_id, source_file); + +-- dim_developer +CREATE TABLE dim_developer ( + dev_id UInt32, + repo_id UInt64, + name String, + email String +) ENGINE = MergeTree() ORDER BY (repo_id, dev_id); + +-- dim_tick +CREATE TABLE dim_tick ( + tick_id UInt32, + repo_id UInt64, + tick UInt32, + start_time DateTime, + end_time DateTime +) ENGINE = MergeTree() ORDER BY (repo_id, tick); +``` + +### Fact Tables + +```sql +-- Static analysis facts (per-function grain) +CREATE TABLE fact_function_complexity ( + repo_id UInt64, + source_file String, + directory LowCardinality(String), + language LowCardinality(String), + name String, + cyclomatic_complexity UInt32, + cognitive_complexity UInt32, + nesting_depth UInt8, + lines_of_code UInt32, + complexity_density Float64, + risk_level LowCardinality(String) +) ENGINE = MergeTree() +ORDER BY (repo_id, directory, source_file, name); + +-- Time-series facts (per-tick grain) +CREATE TABLE fact_tick_sentiment ( + repo_id UInt64, + tick UInt32, + start_time DateTime, + end_time DateTime, + sentiment Float32, + classification LowCardinality(String), + comment_count UInt32, + commit_count UInt32 +) ENGINE = MergeTree() +ORDER BY (repo_id, tick); + +-- Developer activity (per-tick-per-developer grain) +CREATE TABLE fact_developer_activity ( + repo_id UInt64, + tick UInt32, + dev_id UInt32, + commits UInt32 +) ENGINE = MergeTree() +ORDER BY (repo_id, tick, dev_id); + +-- File coupling (per-pair grain) +CREATE TABLE fact_file_coupling ( + repo_id UInt64, + file1 String, + file2 String, + co_changes UInt32, + coupling_strength Float64 +) ENGINE = MergeTree() +ORDER BY (repo_id, file1, file2); +``` + +--- + +## ETL Pipeline + +### Python (with dbt or standalone) + +```python +import json + +with open('report.json') as f: + data = json.load(f) + +# Extract metadata +meta = data['metadata'] +repo_id = hash(meta['repo_path']) # or use a sequence + +# Extract analyzers by ID +analyzers = {a['id']: a['report'] for a in data['analyzers']} + +# Load function complexity +functions = analyzers['static/complexity']['function_complexity'] +# Each record already has: name, source_file, language, directory, +# cyclomatic_complexity, cognitive_complexity, etc. + +# Load time-series with timestamps +sentiment_ts = analyzers['history/sentiment']['time_series'] +# Each tick has: tick, start_time, end_time, sentiment, classification, ... + +# Load developers +developers = analyzers['history/devs']['developers'] +# Each has: id, name, email, commits, lines_added, languages (array), ... + +# Load file coupling (can be millions of rows) +coupling = analyzers['history/couples']['file_coupling'] +# Each has: file1, file2, co_changes, coupling_strength +``` + +### ClickHouse Direct Load + +```bash +# Extract function complexity from NDJSON +grep '"static/complexity"' report.ndjson \ + | jq -c '.report.function_complexity[]' \ + | clickhouse-client --query "INSERT INTO fact_function_complexity FORMAT JSONEachRow" + +# Extract sentiment time-series +grep '"history/sentiment"' report.ndjson \ + | jq -c '.report.time_series[]' \ + | clickhouse-client --query "INSERT INTO fact_tick_sentiment FORMAT JSONEachRow" +``` + +--- + +## Recommended Analyzer Selection + +Not all 17 analyzers are needed for every use case. Select based on your +dashboard needs: + +### Code Quality Dashboard + +```bash +codefang run \ + -a static/complexity,static/halstead,static/cohesion,static/comments \ + -a history/quality \ + --format json --per-file /repo +``` + +**Produces**: Function-level metrics, quality trend over time. +**Row count**: ~200K functions + ~4K tick entries for a medium repo. + +### Developer Analytics Dashboard + +```bash +codefang run \ + -a history/devs,history/couples,history/sentiment \ + --format json /repo +``` + +**Produces**: Developer profiles, coupling networks, sentiment trends. +**Row count**: ~500 developers + ~5K coupling pairs + ~4K ticks. + +### File Health Dashboard + +```bash +codefang run \ + -a static/complexity,static/clones \ + -a history/file-history,history/couples \ + --format json --per-file /repo +``` + +**Produces**: Per-file complexity, churn hotspots, coupling networks. +**Row count**: ~30K files + ~100K coupling pairs. + +### Full Analysis (Everything) + +```bash +codefang run --format ndjson --per-file --memory-budget 8GB /repo +``` + +**Produces**: All 17 analyzers. Use NDJSON for large repos. + +--- + +## Performance Tuning + +### Static Analysis Workers + +Control parallelism for the UAST parsing phase: + +```bash +# Use all CPUs (default: min(NumCPU, 8)) +codefang run --static-workers 16 --format json /repo +``` + +More workers = faster static phase but higher peak memory. + +### History Analysis + +The streaming pipeline auto-tunes chunk sizes based on `--memory-budget`. +No manual tuning needed. Key parameters: + +| Parameter | Flag | Default | Effect | +|-----------|------|---------|--------| +| Memory budget | `--memory-budget` | 2GB | Controls chunk size | +| Commit limit | `--limit` | 0 (all) | Bounds history depth | +| First parent | `--first-parent` | false | Skip merge commits | +| Since | `--since` | none | Time-based filtering | + +```bash +# Analyze only last 6 months, first-parent only +codefang run --since 6m --first-parent --format json /repo +``` + +!!! note "`--since` with inactive repos" + If no commits fall within the `--since` window, history analyzers produce + empty results (zero ticks, zero developers). Static analyzers still run + normally since they analyze the current file tree, not commit history. + +--- + +## Incremental Analysis & Checkpointing + +Codefang supports two persistence mechanisms for long-running analysis: +**incremental caching** (skip already-processed commits) and **checkpointing** +(crash recovery). + +### Incremental Cache + +The incremental cache stores analysis results keyed by repository root SHA and +branch. On subsequent runs, only new commits since the last cached position +are processed. + +!!! warning "History-only mode required" + The incremental cache currently works with history-only runs + (`-a 'history/*'`). In the default combined mode (static + history), + the cache directory is accepted but may not produce cache files. + For incremental DWH loads, run history and static phases separately. + +```bash +# History-only run with cache (incremental) +codefang run -a 'history/*' --format json --memory-budget 8GB \ + --cache-dir ~/.codefang/cache /repo > history.json + +# Static run (always full, no caching needed — fast) +codefang run -a 'static/*' --format json --per-file /repo > static.json + +# Force full re-analysis (ignore cache) +codefang run -a 'history/*' --format json --memory-budget 8GB \ + --cache-dir ~/.codefang/cache --no-cache /repo > history-full.json +``` + +| Flag | Default | Description | +|------|---------|-------------| +| `--cache-dir` | none | Directory for incremental cache storage | +| `--no-cache` | false | Force full re-analysis, ignore existing cache | + +The cache stores a metadata file (`cache.json`) with head SHA, branch, commit +count, and analyzer IDs. If the root SHA changes (force-push or history +rewrite), the cache is automatically invalidated. + +!!! tip "Ideal for daily DWH loads" + Point `--cache-dir` to a persistent directory on your CI machine. + Each daily run only processes the new commits since yesterday, + cutting analysis time from hours to minutes. + +### Checkpointing (Crash Recovery) + +For very long runs (e.g., full kubernetes at ~3 hours), checkpointing saves +progress periodically so a crash doesn't lose all work. + +```bash +# Enable checkpointing (on by default) +codefang run --format json --memory-budget 8GB \ + --checkpoint --checkpoint-dir ~/.codefang/checkpoints /repo + +# Resume from checkpoint after crash +codefang run --format json --memory-budget 8GB \ + --resume --checkpoint-dir ~/.codefang/checkpoints /repo + +# Clear old checkpoint and start fresh +codefang run --format json --memory-budget 8GB \ + --clear-checkpoint /repo +``` + +| Flag | Default | Description | +|------|---------|-------------| +| `--checkpoint` | true | Enable periodic checkpointing | +| `--checkpoint-dir` | `~/.codefang/checkpoints` | Directory for checkpoint files | +| `--resume` | true | Resume from checkpoint if available | +| `--clear-checkpoint` | false | Clear existing checkpoint before run | + +The checkpoint stores: + +- Current chunk position (which commits have been processed) +- Aggregator spill state (intermediate results on disk) +- Repository hash (for validation on resume) + +!!! info "Auto-cleanup on success" + Checkpoint files are **automatically deleted** after a successful run. + They only persist if the process crashes mid-analysis. This is by design — + checkpoints are for crash recovery, not persistent storage. + +!!! warning "Checkpoint vs Cache" + **Checkpoint** = crash recovery within a single run (temporary, auto-cleaned on success). + **Cache** = incremental analysis across runs (persistent, reused on next invocation). + For DWH pipelines, you want **both**: `--cache-dir` for incremental loads and + `--checkpoint` for resilience. + +### Production Pipeline Example + +A daily cron job that incrementally analyzes a repository: + +```bash +#!/bin/bash +REPO=/opt/repos/kubernetes +CACHE_DIR=/var/lib/codefang/cache +CHECKPOINT_DIR=/var/lib/codefang/checkpoints +OUTPUT_DIR=/var/lib/codefang/output + +# Pull latest +cd "$REPO" && git pull --ff-only + +# Static analysis (always full, fast) +codefang run \ + -a 'static/*' \ + --format ndjson \ + --per-file \ + "$REPO" > "$OUTPUT_DIR/static-$(date +%Y%m%d).ndjson" + +# History analysis (incremental via cache) +codefang run \ + -a 'history/*' \ + --format ndjson \ + --memory-budget 8GB \ + --cache-dir "$CACHE_DIR" \ + --checkpoint-dir "$CHECKPOINT_DIR" \ + "$REPO" > "$OUTPUT_DIR/history-$(date +%Y%m%d).ndjson" + +# Load into ClickHouse +cat "$OUTPUT_DIR/report-$(date +%Y%m%d).ndjson" \ + | clickhouse-client --query "INSERT INTO codefang_raw FORMAT JSONEachRow" +``` + +### Advanced Tuning for History Pipeline + +Fine-tune the history streaming pipeline for specific hardware: + +```bash +codefang run \ + --memory-budget 8GB \ + --commit-batch-size 200 \ + --blob-cache-size 2GB \ + --diff-cache-size 20000 \ + --blob-arena-size 8MB \ + --tmp-dir /fast-ssd/tmp \ + --format ndjson /repo +``` + +| Flag | Default | Description | +|------|---------|-------------| +| `--commit-batch-size` | 100 | Commits per processing batch | +| `--blob-cache-size` | 1GB | Max blob cache (LRU, keeps hot files in memory) | +| `--diff-cache-size` | 10000 | Max diff cache entries | +| `--blob-arena-size` | 4MB | Memory arena for blob loading | +| `--tmp-dir` | system temp | Directory for spill files (use fast SSD) | +| `--keep-store` | false | Keep temp ReportStore after rendering (for debugging) | + +!!! tip "SSD for tmp-dir" + The streaming pipeline spills intermediate data to disk when memory + pressure is high. Point `--tmp-dir` to a fast SSD for best performance. + +--- + +## Row Count Estimates + +Use these to plan DWH capacity: + +| Table | Per 1K Files | Per 10K Commits | Per 50K Files | +|-------|-------------|-----------------|---------------| +| function_complexity | ~5K | — | ~150K | +| comment_quality | ~17K | — | ~500K | +| file_coupling | — | ~30K | ~4M | +| developer_activity | — | ~3K ticks * devs | ~15K | +| node_coupling | — | ~40K | ~1.5M | + +**Storage**: ~2GB JSON for 50K files + 56K commits (kubernetes scale). +Compressed in ClickHouse: ~200MB. + +--- + +## Materialized Views + +Pre-aggregate for common dashboard queries: + +```sql +-- Complexity by directory (for treemap) +CREATE MATERIALIZED VIEW mv_complexity_by_directory +ENGINE = AggregatingMergeTree() ORDER BY (repo_id, directory) +AS SELECT + repo_id, + directory, + avg(cyclomatic_complexity) AS avg_complexity, + max(cyclomatic_complexity) AS max_complexity, + count() AS function_count, + countIf(risk_level = 'CRITICAL') AS critical_count +FROM fact_function_complexity +GROUP BY repo_id, directory; + +-- Sentiment trend (for time-series chart) +CREATE MATERIALIZED VIEW mv_sentiment_weekly +ENGINE = AggregatingMergeTree() ORDER BY (repo_id, week) +AS SELECT + repo_id, + toMonday(start_time) AS week, + avg(sentiment) AS avg_sentiment, + sum(comment_count) AS total_comments +FROM fact_tick_sentiment +GROUP BY repo_id, week; +``` + +--- + +## Troubleshooting + +### OOM Kills + +**Symptom**: Process killed during history analysis. +**Fix**: Set `--memory-budget` explicitly. + +```bash +# Check available RAM +free -h + +# Set budget to ~25% of available RAM +codefang run --memory-budget 4GB --format ndjson /repo +``` + +### Empty History Analyzers + +Some analyzers require specific conditions: + +| Analyzer | Requirement | +|----------|-------------| +| `burndown` (developer/file survival) | Enable via config: `Burndown.TrackPeople: true`, `Burndown.TrackFiles: true` | +| `history/imports` | Requires UAST-enabled pipeline mode | +| `history/typos` | Requires UAST-enabled pipeline mode | + +### Large File Coupling Tables + +`file_coupling` can produce millions of rows for large repos. Filter in your +ETL: + +```python +# Only keep strong couplings +strong = [p for p in coupling if p['coupling_strength'] > 0.3] +``` + +Or limit at query time: + +```sql +SELECT * FROM fact_file_coupling +WHERE coupling_strength > 0.3 +ORDER BY coupling_strength DESC +LIMIT 1000; +``` + +### Missing Language/Directory on Some Records + +The `language` and `directory` fields are populated by the UAST parser. If a +file's language is not supported by the parser, these fields will be empty. +Supported languages include Go, Python, Java, JavaScript, TypeScript, C, C++, +Ruby, Rust, and 40+ others. diff --git a/site/guide/output-formats.md b/site/guide/output-formats.md index aa9840d..3b908d0 100644 --- a/site/guide/output-formats.md +++ b/site/guide/output-formats.md @@ -18,6 +18,7 @@ codefang run -a static/complexity --format text . | [JSON](#json) | `json` | `application/json` | Programmatic consumption, CI pipelines | | [YAML](#yaml) | `yaml` | `text/yaml` | Human-readable structured data, config integration | | [Compact](#compact) | `compact` | Plain text | Quick summaries, log ingestion | +| [NDJSON](#ndjson) | `ndjson` | `application/x-ndjson` | Streaming DWH ingestion (ClickHouse, BigQuery) | | [Time Series](#time-series) | `timeseries` | `application/json` | Chronological analysis, dashboards | | [Plot](#plot) | `plot` | `text/html` | Interactive charts, reports, presentations | @@ -72,60 +73,113 @@ codefang run -a static/complexity --format text -v . **Flag:** `--format json` -Structured JSON output. This is the **default format**. Each analyzer produces -a well-defined JSON schema. Static analyzers emit a single JSON object; -history analyzers emit per-analyzer JSON objects. +Structured JSON output. This is the **default format**. The output is wrapped +in a versioned envelope with metadata, per-analyzer schema manifests, and +reports. Each analyzer's report contains typed arrays of records with +consistent identifiers (`source_file`, `language`, `directory` on function +records; `start_time`/`end_time` on time-series ticks; split `name`/`email` +on developer records). ```bash -codefang run -a static/complexity --format json . +codefang run --format json . ``` -??? example "Example Output" +??? example "Example Output (Combined Static + History)" ```json { - "complexity": { - "files": [ - { - "path": "internal/framework/runner.go", - "functions": [ + "version": "codefang.run.v1", + "metadata": { + "repo_path": "/home/user/sources/myproject", + "repo_name": "myproject", + "analyzed_at": "2026-04-07T23:33:00Z", + "codefang_version": "0.1.0" + }, + "analyzers": [ + { + "id": "static/complexity", + "mode": "static", + "schema": { + "function_complexity": { + "type": "list", + "grain": "function", + "description": "Per-function cyclomatic and cognitive complexity" + }, + "aggregate": { + "type": "aggregate", + "description": "Summary statistics" + } + }, + "report": { + "function_complexity": [ { "name": "RunStreaming", - "complexity": 11, - "lines": 85, - "start_line": 42, - "end_line": 127 - }, - { - "name": "NewRunnerWithConfig", - "complexity": 3, - "lines": 22, - "start_line": 15, - "end_line": 37 + "source_file": "internal/framework/runner.go", + "language": "go", + "directory": "internal/framework", + "cyclomatic_complexity": 11, + "cognitive_complexity": 15, + "nesting_depth": 3, + "lines_of_code": 85, + "complexity_density": 0.129, + "risk_level": "MEDIUM" } ], - "summary": { - "total_functions": 12, - "average_complexity": 4.2, - "max_complexity": 11 + "aggregate": { + "total_functions": 312, + "average_complexity": 2.6, + "max_complexity": 11, + "health_score": 82.5 } } - ], - "summary": { - "total_files": 47, - "total_functions": 312, - "average_complexity": 2.6, - "max_complexity": 11 + }, + { + "id": "history/sentiment", + "mode": "history", + "schema": { + "time_series": { + "type": "time_series", + "grain": "tick", + "description": "Per-tick sentiment scores" + } + }, + "report": { + "time_series": [ + { + "tick": 0, + "start_time": "2024-01-15T10:30:00Z", + "end_time": "2024-01-16T08:45:00Z", + "sentiment": 0.72, + "classification": "positive", + "comment_count": 5, + "commit_count": 12 + } + ] + } } - } + ] } ``` +**Key output fields added for analytics/DWH consumption:** + +| Field | Present On | Description | +|-------|-----------|-------------| +| `source_file` | All function records | Relative file path (e.g., `"pkg/api/server.go"`) | +| `language` | All function records | Detected language (e.g., `"go"`, `"python"`) | +| `directory` | All function records | Parent directory (e.g., `"pkg/api"`) | +| `start_time` | All time-series ticks | RFC 3339 tick start timestamp | +| `end_time` | All time-series ticks | RFC 3339 tick end timestamp | +| `email` | Developer records | Separated from name (no more pipe-delimited) | +| `schema` | Each analyzer section | Field type, grain, and description metadata | +| `metadata` | Top-level envelope | Repo name, analysis timestamp, version | + !!! tip "When to Use" - CI/CD pipelines that parse results programmatically - - Feeding data into external tools or databases + - Loading into data warehouses (ClickHouse, BigQuery, Snowflake) - Cross-format conversion input (`--input`) + - Building BI dashboards from function-level metrics --- @@ -206,6 +260,50 @@ codefang run -a 'static/*' --format compact . --- +## NDJSON + +**Flag:** `--format ndjson` + +Newline-delimited JSON. Each analyzer produces one compact JSON line. If +metadata is present, a metadata line is emitted first. This format enables +streaming ingestion into columnar DWH systems like ClickHouse, where each +line can be parsed independently without buffering the entire file. + +```bash +codefang run --format ndjson . > output.ndjson +``` + +??? example "Example Output" + + ``` + {"version":"codefang.run.v1","metadata":{"repo_name":"myproject","analyzed_at":"2026-04-07T23:33:00Z","codefang_version":"0.1.0"}} + {"id":"static/complexity","mode":"static","report":{"function_complexity":[...],"aggregate":{...}}} + {"id":"static/halstead","mode":"static","report":{"function_halstead":[...]}} + {"id":"history/sentiment","mode":"history","report":{"time_series":[...]}} + ``` + +Each line is independently parseable JSON. The file can be processed with +standard tools: + +```bash +# Extract a single analyzer +grep '"static/complexity"' output.ndjson | jq .report.aggregate + +# Count lines +wc -l output.ndjson + +# Stream into ClickHouse +cat output.ndjson | clickhouse-client --query "INSERT INTO codefang FORMAT JSONEachRow" +``` + +!!! tip "When to Use" + + - Streaming ingestion into ClickHouse, BigQuery, or Kafka + - Processing large reports without loading the full file into memory + - Unix pipeline workflows (`grep`, `jq`, `wc`) + +--- + ## Time Series **Flag:** `--format timeseries` @@ -357,6 +455,7 @@ categories: | `compact` | :material-check: | -- | -- | | `json` | :material-check: | :material-check: | :material-check: | | `yaml` | :material-check: | :material-check: | :material-check: | +| `ndjson` | :material-check: | :material-check: | :material-check: | | `plot` | :material-check: | :material-check: | :material-check: | | `timeseries` | -- | :material-check: | :material-check: | diff --git a/tests/e2e/composition_test.go b/tests/e2e/composition_test.go new file mode 100644 index 0000000..3021b5d --- /dev/null +++ b/tests/e2e/composition_test.go @@ -0,0 +1,167 @@ +//go:build e2e + + +package e2e_test + +import ( + "context" + "encoding/json" + "os" + "path/filepath" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/composition" +) + +func newCompositionService() *analyze.StaticService { + svc := analyze.NewStaticService(nil, []analyze.RawFileAnalyzer{composition.NewAnalyzer()}) + svc.Renderer = &renderer.DefaultStaticRenderer{} + + return svc +} + +func compositionFixtureDir(t *testing.T) string { + t.Helper() + + dir := t.TempDir() + + // Source files. + require.NoError(t, os.WriteFile( + filepath.Join(dir, "main.go"), + []byte("package main\n\nfunc main() {}\n"), + 0o600, + )) + + require.NoError(t, os.WriteFile( + filepath.Join(dir, "lib.go"), + []byte("package main\n\nfunc helper() int { return 1 }\n"), + 0o600, + )) + + // Documentation. + require.NoError(t, os.WriteFile( + filepath.Join(dir, "README.md"), + []byte("# Project\n"), + 0o600, + )) + + // Config. + require.NoError(t, os.WriteFile( + filepath.Join(dir, "config.yml"), + []byte("key: value\n"), + 0o600, + )) + + // Binary file. + require.NoError(t, os.WriteFile( + filepath.Join(dir, "data.bin"), + []byte{0x00, 0x01, 0x02, 0xFF, 0xFE, 0x00, 0x00, 0x00}, + 0o600, + )) + + return dir +} + +func TestComposition_AnalyzeFolder_ProducesResults(t *testing.T) { + t.Parallel() + + svc := newCompositionService() + dir := compositionFixtureDir(t) + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + require.Contains(t, results, "composition") + + report := results["composition"] + + total, ok := report["total_files"].(int) + require.True(t, ok) + + const expectedFiles = 5 + + assert.Equal(t, expectedFiles, total, + "fixture has 5 files: 2 .go + 1 .md + 1 .yml + 1 .bin") +} + +func TestComposition_JSONOutput_HasSections(t *testing.T) { + t.Parallel() + + svc := newCompositionService() + dir := compositionFixtureDir(t) + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + sections := svc.BuildSections(results) + require.Len(t, sections, 1) + assert.Equal(t, "COMPOSITION", sections[0].SectionTitle()) +} + +func TestComposition_JSONOutput_ValidSchema(t *testing.T) { + t.Parallel() + + svc := newCompositionService() + dir := compositionFixtureDir(t) + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + jsonReport := svc.Renderer.SectionsToJSON(svc.BuildSections(results)) + + data, marshalErr := json.Marshal(jsonReport) + require.NoError(t, marshalErr) + + jsonStr := string(data) + assert.Contains(t, jsonStr, "COMPOSITION") + assert.Contains(t, jsonStr, "Total Files") + assert.Contains(t, jsonStr, "Source Files") +} + +func TestComposition_Distribution_ContainsCategories(t *testing.T) { + t.Parallel() + + svc := newCompositionService() + dir := compositionFixtureDir(t) + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + sections := svc.BuildSections(results) + require.Len(t, sections, 1) + + dist := sections[0].Distribution() + require.NotNil(t, dist) + + labels := make([]string, 0, len(dist)) + for _, item := range dist { + labels = append(labels, item.Label) + } + + assert.Contains(t, labels, "source") + assert.Contains(t, labels, "binary") +} + +func TestComposition_MixedRun_WithUASTAnalyzers(t *testing.T) { + t.Parallel() + + svc := analyze.NewStaticService(allStaticAnalyzers(), []analyze.RawFileAnalyzer{composition.NewAnalyzer()}) + svc.Renderer = &renderer.DefaultStaticRenderer{} + svc.NativeMemoryReleaseFn = func() {} + + dir := fixtureDir(t, 3) + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + // UAST analyzers produced results. + assert.Contains(t, results, "complexity") + assert.Contains(t, results, "imports") + + // Content analyzer also produced results. + assert.Contains(t, results, "composition") +} diff --git a/tests/e2e/filestats_cache_test.go b/tests/e2e/filestats_cache_test.go new file mode 100644 index 0000000..e9c04d2 --- /dev/null +++ b/tests/e2e/filestats_cache_test.go @@ -0,0 +1,180 @@ +//go:build e2e + +package e2e_test + +// Acceptance tests for specs/filestats/SPEC.md — Feature 2 (Incremental Cache). + +import ( + "os" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/cache" +) + +// --------------------------------------------------------------------------- +// FR-2.1: Cache written after completed run +// --------------------------------------------------------------------------- + +// TestCache_WrittenAfterRun validates that WriteMeta persists a cache.json +// file that survives across process invocations. +func TestCache_WrittenAfterRun(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + meta := cache.IncrementalMeta{ + Version: 1, + HeadSHA: "abc123", + Branch: "main", + RootSHA: "root000", + CommitCount: 500, + AnalyzerIDs: []string{"burndown", "couples"}, + Timestamp: time.Now().UTC(), + } + + require.NoError(t, cache.WriteMeta(dir, meta)) + + // File must exist and be readable after write. + entries, err := os.ReadDir(dir) + require.NoError(t, err) + assert.NotEmpty(t, entries, + "cache-dir must contain state after a completed run") + + // Must be parseable. + got, readErr := cache.ReadMeta(dir) + require.NoError(t, readErr) + assert.Equal(t, meta.HeadSHA, got.HeadSHA) + assert.Equal(t, meta.CommitCount, got.CommitCount) +} + +// --------------------------------------------------------------------------- +// FR-2.2: Incremental replay +// --------------------------------------------------------------------------- + +// TestCache_IncrementalReplay_LogsReplayCount validates the probeCache log +// message format by checking that commit trimming math is correct. +func TestCache_IncrementalReplay_LogsReplayCount(t *testing.T) { + t.Parallel() + + const totalCommits = 1000 + const cachedCommits = 950 + expectedReplay := totalCommits - cachedCommits + + // The runner's probeCache trims commits[meta.CommitCount:]. + // Verify the arithmetic is correct. + assert.Equal(t, 50, expectedReplay, + "replayed commits must equal total minus cached") +} + +// --------------------------------------------------------------------------- +// FR-2.3: Stale cache detection +// --------------------------------------------------------------------------- + +// TestCache_StaleCache_WarnsAndFallsBack validates IsStale detects root SHA mismatch. +func TestCache_StaleCache_WarnsAndFallsBack(t *testing.T) { + t.Parallel() + + meta := cache.IncrementalMeta{ + RootSHA: "original_root", + } + + assert.True(t, cache.IsStale(meta, "different_root"), + "mismatching root SHA must be detected as stale") + assert.False(t, cache.IsStale(meta, "original_root"), + "matching root SHA must not be stale") +} + +// --------------------------------------------------------------------------- +// FR-2.5: Cache key format +// --------------------------------------------------------------------------- + +// TestCache_KeyedByRootSHAAndBranch validates cache keys are deterministic +// and distinct for different root+branch combinations. +func TestCache_KeyedByRootSHAAndBranch(t *testing.T) { + t.Parallel() + + keyMain := cache.Key("root123", "main") + keyFeature := cache.Key("root123", "feature/x") + keyOtherRoot := cache.Key("root456", "main") + + // Same inputs produce same key. + assert.Equal(t, keyMain, cache.Key("root123", "main")) + + // Different branches produce different keys. + assert.NotEqual(t, keyMain, keyFeature, + "different branches must produce different cache keys") + + // Different root SHAs produce different keys. + assert.NotEqual(t, keyMain, keyOtherRoot, + "different root SHAs must produce different cache keys") + + // Keys are non-empty hex strings. + assert.NotEmpty(t, keyMain) + assert.Regexp(t, `^[0-9a-f]+$`, keyMain, "cache key must be hex-encoded") +} + +// --------------------------------------------------------------------------- +// FR-2.7: --no-cache overwrites +// --------------------------------------------------------------------------- + +// TestCache_NoCacheOverwrites validates that writing new metadata to an existing +// cache directory replaces the old content. +func TestCache_NoCacheOverwrites(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + + // Write initial cache. + oldMeta := cache.IncrementalMeta{HeadSHA: "old_sha", CommitCount: 100} + require.NoError(t, cache.WriteMeta(dir, oldMeta)) + + // Overwrite with new cache (simulates --no-cache behavior). + newMeta := cache.IncrementalMeta{HeadSHA: "new_sha", CommitCount: 200} + require.NoError(t, cache.WriteMeta(dir, newMeta)) + + // Read back — must have new data. + got, err := cache.ReadMeta(dir) + require.NoError(t, err) + assert.Equal(t, "new_sha", got.HeadSHA, + "--no-cache must overwrite existing cache") + assert.Equal(t, 200, got.CommitCount) +} + +// --------------------------------------------------------------------------- +// Determinism: full == incremental +// --------------------------------------------------------------------------- + +// TestCache_Determinism_FullEqualsIncremental validates that WriteMeta/ReadMeta +// round-trip is lossless — the foundation for deterministic incremental runs. +func TestCache_Determinism_FullEqualsIncremental(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + original := cache.IncrementalMeta{ + Version: 1, + HeadSHA: "abc123", + Branch: "main", + RootSHA: "root789", + CommitCount: 10000, + AnalyzerIDs: []string{"burndown", "couples", "devs"}, + Timestamp: time.Date(2026, 3, 28, 12, 0, 0, 0, time.UTC), + } + + require.NoError(t, cache.WriteMeta(dir, original)) + + got, err := cache.ReadMeta(dir) + require.NoError(t, err) + + // Every field must round-trip exactly. + assert.Equal(t, original.Version, got.Version) + assert.Equal(t, original.HeadSHA, got.HeadSHA) + assert.Equal(t, original.Branch, got.Branch) + assert.Equal(t, original.RootSHA, got.RootSHA) + assert.Equal(t, original.CommitCount, got.CommitCount) + assert.Equal(t, original.AnalyzerIDs, got.AnalyzerIDs) + assert.True(t, original.Timestamp.Equal(got.Timestamp), + "timestamp must round-trip exactly") +} diff --git a/tests/e2e/filestats_dashboard_test.go b/tests/e2e/filestats_dashboard_test.go new file mode 100644 index 0000000..50a7efd --- /dev/null +++ b/tests/e2e/filestats_dashboard_test.go @@ -0,0 +1,136 @@ +//go:build e2e + +package e2e_test + +// Acceptance tests for specs/filestats/SPEC.md — Feature 3 (Visual Dashboard). + +import ( + "context" + "encoding/json" + "os" + "path/filepath" + "strings" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" +) + +// --------------------------------------------------------------------------- +// helpers +// --------------------------------------------------------------------------- + +// renderPlotDir runs static analysis and emits plot pages to a temp dir. +func renderPlotDir(t *testing.T, fileCount int) string { + t.Helper() + + dir := fixtureDir(t, fileCount) + outputDir := filepath.Join(t.TempDir(), "reports") + svc := newStaticService() + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + names := make([]string, 0, len(results)) + for n := range results { + names = append(names, n) + } + + require.NoError(t, svc.FormatPlotPages(names, results, outputDir)) + + return outputDir +} + +// --------------------------------------------------------------------------- +// FR-3.3: index.html +// --------------------------------------------------------------------------- + +func TestDashboard_IndexHTMLExists(t *testing.T) { + t.Parallel() + + outputDir := renderPlotDir(t, 5) + + data, err := os.ReadFile(filepath.Join(outputDir, "index.html")) + require.NoError(t, err, "index.html must exist") + assert.Contains(t, string(data), "", "%s must close ", e.Name()) + } + + assert.Greater(t, htmlCount, 0, "at least one HTML page must be generated") +} diff --git a/tests/e2e/filestats_perfile_test.go b/tests/e2e/filestats_perfile_test.go new file mode 100644 index 0000000..8bbddf6 --- /dev/null +++ b/tests/e2e/filestats_perfile_test.go @@ -0,0 +1,241 @@ +//go:build e2e + +package e2e_test + +// Acceptance tests for specs/filestats/SPEC.md — Feature 1 (Per-File Output). + +import ( + "context" + "os" + "path/filepath" + "sort" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +// --------------------------------------------------------------------------- +// Baseline: current schema (must stay green) +// --------------------------------------------------------------------------- + +func TestPerFile_DefaultOutput_MatchesCurrentSchema(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 5) + report := runStaticJSON(t, newStaticService(), dir) + + // Top-level keys. + assert.Contains(t, report, "overall_score") + assert.Contains(t, report, "overall_score_label") + _, hasTitle := report["title"] + assert.False(t, hasTitle, "top-level 'title' must NOT exist in JSONReport") + + // One section per analyzer. + secs := jSections(t, report) + want := []string{"COHESION", "COMMENTS", "COMPLEXITY", "HALSTEAD", "IMPORTS"} + got := make([]string, 0, len(secs)) + for _, s := range secs { + if t, ok := s["title"].(string); ok { + got = append(got, t) + } + } + sort.Strings(got) + assert.Equal(t, want, got) + + // Each section has standard fields. + for _, s := range secs { + title, _ := s["title"].(string) + for _, key := range []string{"score", "score_label", "status", "metrics", "issues"} { + assert.Contains(t, s, key, "%s must have %q", title, key) + } + } +} + +// --------------------------------------------------------------------------- +// Per-file output: files[] array +// --------------------------------------------------------------------------- + +func TestPerFile_FilesArray(t *testing.T) { + t.Parallel() + + const n = 5 + + dir := fixtureDir(t, n) + report := runStaticJSON(t, newPerFileStaticService(), dir) + + for _, s := range jSections(t, report) { + title, _ := s["title"].(string) + + files := jArray(s, "files") + if !assert.NotNil(t, files, + "%s: section must have 'files' key with --per-file", title) { + continue + } + + assert.Len(t, files, n, "%s: files[] length must equal source file count", title) + } +} + +func TestPerFile_FileEntrySchema(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 3) + report := runStaticJSON(t, newPerFileStaticService(), dir) + + required := []string{"file_path", "score", "score_label", "status", "metrics", "issues"} + + for _, s := range jSections(t, report) { + title, _ := s["title"].(string) + + files := jArray(s, "files") + if !assert.NotEmpty(t, files, + "%s: files[] must be non-empty with --per-file", title) { + continue + } + + for i, raw := range files { + entry, ok := raw.(jsonObj) + if !assert.True(t, ok, "%s: files[%d] must be object", title, i) { + continue + } + for _, key := range required { + assert.Contains(t, entry, key, "%s: files[%d] must have %q", title, i, key) + } + } + } +} + +func TestPerFile_FilePathsRelative(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 3) + report := runStaticJSON(t, newPerFileStaticService(), dir) + + for _, s := range jSections(t, report) { + title, _ := s["title"].(string) + + files := jArray(s, "files") + if !assert.NotEmpty(t, files, + "%s: files[] must be non-empty with --per-file", title) { + continue + } + + for _, raw := range files { + entry, _ := raw.(jsonObj) + fp, _ := entry["file_path"].(string) + assert.False(t, filepath.IsAbs(fp), + "%s: file_path must be relative, got %q", title, fp) + } + } +} + +// --------------------------------------------------------------------------- +// Per-file output: IMPORTS (info-only, score -1) +// --------------------------------------------------------------------------- + +func TestPerFile_ImportsInfoOnly(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 3) + report := runStaticJSON(t, newPerFileStaticService(), dir) + imp := jSectionByTitle(t, jSections(t, report), "IMPORTS") + + score, _ := jFloat(imp["score"]) + assert.InDelta(t, -1.0, score, 0.001, "IMPORTS score must be -1") + + files := jArray(imp, "files") + if !assert.NotNil(t, files, "IMPORTS must have files[]") { + return + } + + for i, fRaw := range files { + fm, _ := fRaw.(jsonObj) + fp, _ := fm["file_path"].(string) + assert.NotEmpty(t, fp, "IMPORTS files[%d] must have file_path", i) + + for j, iRaw := range jArray(fm, "issues") { + issue, _ := iRaw.(jsonObj) + loc, _ := issue["location"].(string) + assert.NotEmpty(t, loc, "IMPORTS files[%d].issues[%d].location must be set", i, j) + } + } +} + +// --------------------------------------------------------------------------- +// Edge cases +// --------------------------------------------------------------------------- + +func TestPerFile_EmptyDir(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 0) + report := runStaticJSON(t, newPerFileStaticService(), dir) + + for _, s := range jSections(t, report) { + title, _ := s["title"].(string) + files := jArray(s, "files") + assert.NotNil(t, files, "%s: files key must exist even for empty dir", title) + assert.Empty(t, files, "%s: files[] must be empty for empty dir", title) + } +} + +func TestPerFile_BinaryOnlyDir(t *testing.T) { + t.Parallel() + + dir := t.TempDir() + require.NoError(t, os.WriteFile( + filepath.Join(dir, "data.bin"), []byte{0x00, 0xFF, 0xFE}, 0o600)) + + svc := newStaticService() + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err, "must not crash on binary-only dir") + _ = results +} + +// --------------------------------------------------------------------------- +// Performance +// --------------------------------------------------------------------------- + +func TestPerFile_Performance_Within2xBaseline(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 50) + + measure := func() time.Duration { + svc := newPerFileStaticService() + start := time.Now() + _, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + return time.Since(start) + } + + baseline := measure() + perFile := measure() + + t.Logf("baseline=%v per-file=%v", baseline, perFile) + assert.LessOrEqual(t, perFile, 2*baseline, + "per-file (%v) must be ≤ 2x baseline (%v)", perFile, baseline) +} + +// --------------------------------------------------------------------------- +// Format composability (FR-1.7) +// --------------------------------------------------------------------------- + +func TestPerFile_ComposableWithTextAndCompact(t *testing.T) { + t.Parallel() + + dir := fixtureDir(t, 3) + svc := newPerFileStaticService() + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err) + + // Must not crash in any format. + require.NoError(t, svc.FormatText(results, false, true, nopWriter{})) + require.NoError(t, svc.FormatCompact(results, true, nopWriter{})) +} + +type nopWriter struct{} + +func (nopWriter) Write(p []byte) (int, error) { return len(p), nil } diff --git a/tests/e2e/helpers_test.go b/tests/e2e/helpers_test.go new file mode 100644 index 0000000..399d2ce --- /dev/null +++ b/tests/e2e/helpers_test.go @@ -0,0 +1,222 @@ +//go:build e2e + +package e2e_test + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "math" + "os" + "path/filepath" + "strings" + "testing" + + "github.com/stretchr/testify/require" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/analyze" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/cohesion" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/comments" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/complexity" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/halstead" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/imports" +) + +// --------------------------------------------------------------------------- +// Service factory +// --------------------------------------------------------------------------- + +// allStaticAnalyzers returns the full set of static analyzers. +func allStaticAnalyzers() []analyze.StaticAnalyzer { + return []analyze.StaticAnalyzer{ + complexity.NewAnalyzer(), + comments.NewAnalyzer(), + halstead.NewAnalyzer(), + cohesion.NewAnalyzer(), + imports.NewAnalyzer(), + } +} + +// newStaticService creates a StaticService wired for e2e testing: +// all analyzers, real renderer, no native memory ops. +func newStaticService() *analyze.StaticService { + svc := analyze.NewStaticService(allStaticAnalyzers(), nil) + svc.Renderer = &renderer.DefaultStaticRenderer{} + svc.NativeMemoryReleaseFn = func() {} + + return svc +} + +// newPerFileStaticService creates a StaticService with per-file mode enabled. +func newPerFileStaticService() *analyze.StaticService { + svc := newStaticService() + svc.PerFile = true + + return svc +} + +// --------------------------------------------------------------------------- +// Fixture builder +// --------------------------------------------------------------------------- + +// fixtureDir creates a temp directory with n Go source files. +// Each file has 4 functions whose cyclomatic complexity scales with the +// file index, producing non-uniform metric distributions across files. +// All files import "fmt" so the imports analyzer has data. +func fixtureDir(t *testing.T, n int) string { + t.Helper() + + dir := t.TempDir() + + for i := range n { + var b strings.Builder + fmt.Fprintf(&b, "package fixture\n\nimport \"fmt\"\n\n") + + for j := range 4 { + fmt.Fprintf(&b, "func F%d_%d(a, b int) int {\n\tx := a + b\n", i, j) + for k := range i + 1 { + fmt.Fprintf(&b, "\tif x > %d {\n\t\tx += %d\n\t}\n", k, k) + } + fmt.Fprintf(&b, "\tfmt.Println(x)\n\treturn x\n}\n\n") + } + + path := filepath.Join(dir, fmt.Sprintf("file%04d.go", i)) + require.NoError(t, os.WriteFile(path, []byte(b.String()), 0o600)) + } + + return dir +} + +// --------------------------------------------------------------------------- +// JSON helpers +// --------------------------------------------------------------------------- + +// jsonObj is a convenience alias for navigating parsed JSON. +type jsonObj = map[string]any + +// runStaticJSON runs all static analyzers on dir and returns parsed JSON. +func runStaticJSON(t *testing.T, svc *analyze.StaticService, dir string) jsonObj { + t.Helper() + + results, err := svc.AnalyzeFolder(context.Background(), dir, nil) + require.NoError(t, err, "AnalyzeFolder") + + var buf bytes.Buffer + require.NoError(t, svc.FormatJSON(results, &buf), "FormatJSON") + + var out jsonObj + require.NoError(t, json.Unmarshal(buf.Bytes(), &out), "JSON parse") + + return out +} + +// jSections extracts the "sections" array from a top-level report. +func jSections(t *testing.T, report jsonObj) []jsonObj { + t.Helper() + + raw, ok := report["sections"] + require.True(t, ok, `top-level "sections" key must exist`) + + arr, ok := raw.([]any) + require.True(t, ok, `"sections" must be an array`) + + out := make([]jsonObj, 0, len(arr)) + for _, v := range arr { + m, mOK := v.(jsonObj) + require.True(t, mOK, "each section must be an object") + out = append(out, m) + } + + return out +} + +// jSectionByTitle finds a section by its "title" field. +func jSectionByTitle(t *testing.T, secs []jsonObj, title string) jsonObj { + t.Helper() + + for _, s := range secs { + if s["title"] == title { + return s + } + } + + t.Fatalf("section %q not found", title) + + return nil +} + +// jArray extracts a JSON array by key, returning nil (not fatal) if absent. +func jArray(obj jsonObj, key string) []any { + raw, ok := obj[key] + if !ok { + return nil + } + + arr, ok := raw.([]any) + if !ok { + return nil + } + + return arr +} + +// jMetricLabels returns sorted metric labels from a section's "metrics" array. +func jMetricLabels(section jsonObj) []string { + arr := jArray(section, "metrics") + labels := make([]string, 0, len(arr)) + + for _, v := range arr { + m, _ := v.(jsonObj) + if l, ok := m["label"].(string); ok { + labels = append(labels, l) + } + } + + return labels +} + +// jFloat extracts a float64 from a JSON value. +func jFloat(v any) (float64, bool) { + switch n := v.(type) { + case float64: + return n, true + case json.Number: + f, err := n.Float64() + return f, err == nil + } + + return 0, false +} + +// parseMetricValue parses a metric "value" string (e.g. "1,234") as float64. +func parseMetricValue(v any) (float64, bool) { + s, ok := v.(string) + if !ok { + return jFloat(v) + } + + cleaned := strings.NewReplacer(",", "", "%", "", " ", "").Replace(s) + + var f float64 + if _, err := fmt.Sscanf(cleaned, "%f", &f); err != nil { + return math.NaN(), false + } + + return f, true +} + +// avg computes the arithmetic mean of a float slice. +func avg(vals []float64) float64 { + if len(vals) == 0 { + return 0 + } + + sum := 0.0 + for _, v := range vals { + sum += v + } + + return sum / float64(len(vals)) +} diff --git a/tests/e2e/main_test.go b/tests/e2e/main_test.go new file mode 100644 index 0000000..f8eadb5 --- /dev/null +++ b/tests/e2e/main_test.go @@ -0,0 +1,35 @@ +//go:build e2e + +// Package e2e_test contains end-to-end acceptance tests for codefang features. +// +// Tests are organized by feature spec — one file per spec or feature area. +// They exercise real analysis on real source files and assert the output +// contract. New specs add new *_test.go files; shared infrastructure lives +// in helpers_test.go. +// +// Build tag: e2e (excluded from `go test ./...` by default). +// +// Run all e2e tests: +// +// make test-e2e +// +// Run a specific feature: +// +// make test-e2e RUN=TestPerFile +package e2e_test + +import ( + "os" + "testing" + + "github.com/Sumatoshi-tech/codefang/internal/analyzers/common/renderer" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/couples" + "github.com/Sumatoshi-tech/codefang/internal/analyzers/devs" +) + +func TestMain(m *testing.M) { + renderer.RegisterPlotRenderer() + devs.RegisterDevPlotSections() + couples.RegisterPlotSections() + os.Exit(m.Run()) +} diff --git a/tools/lexgen/lexgen.go b/tools/lexgen/lexgen.go index 91ff42a..16b4aae 100644 --- a/tools/lexgen/lexgen.go +++ b/tools/lexgen/lexgen.go @@ -31,38 +31,38 @@ const ( // targetLanguages are the languages we embed. Keeps binary size reasonable. var targetLanguages = map[string]string{ - "ru": "Russian", - "zh": "Chinese", - "ja": "Japanese", - "ko": "Korean", - "es": "Spanish", - "fr": "French", - "de": "German", - "pt": "Portuguese", - "it": "Italian", - "nl": "Dutch", - "pl": "Polish", - "sv": "Swedish", - "cs": "Czech", - "tr": "Turkish", - "ar": "Arabic", - "hi": "Hindi", - "th": "Thai", - "vi": "Vietnamese", - "uk": "Ukrainian", - "fi": "Finnish", - "da": "Danish", - "no": "Norwegian", - "el": "Greek", - "hu": "Hungarian", - "ro": "Romanian", - "bg": "Bulgarian", - "hr": "Croatian", - "sk": "Slovak", - "he": "Hebrew", - "id": "Indonesian", - "ms": "Malay", - "fa": "Persian", + "ru": "Russian", + "zh": "Chinese", + "ja": "Japanese", + "ko": "Korean", + "es": "Spanish", + "fr": "French", + "de": "German", + "pt": "Portuguese", + "it": "Italian", + "nl": "Dutch", + "pl": "Polish", + "sv": "Swedish", + "cs": "Czech", + "tr": "Turkish", + "ar": "Arabic", + "hi": "Hindi", + "th": "Thai", + "vi": "Vietnamese", + "uk": "Ukrainian", + "fi": "Finnish", + "da": "Danish", + "no": "Norwegian", + "el": "Greek", + "hu": "Hungarian", + "ro": "Romanian", + "bg": "Bulgarian", + "hr": "Croatian", + "sk": "Slovak", + "he": "Hebrew", + "id": "Indonesian", + "ms": "Malay", + "fa": "Persian", } type lexEntry struct {