Skip to content

Feature/filestats#18

Merged
dmytrogajewski merged 13 commits into
mainfrom
feature/filestats
May 13, 2026
Merged

Feature/filestats#18
dmytrogajewski merged 13 commits into
mainfrom
feature/filestats

Conversation

@dmytrogajewski
Copy link
Copy Markdown
Contributor

@dmytrogajewski dmytrogajewski commented Apr 3, 2026

Analytics Readiness & DWH Suitability

Motivation: A comprehensive data analyst review of Codefang's JSON output revealed that while the data was analytically rich (17 analyzers, 1M+ function-level rows, time-series, coupling data), it was structurally hostile to analytics tooling and DWH loading. Function records had bare names with no file paths, time-series ticks had no calendar dates, developer identities used pipe-delimited strings, and nested maps blocked efficient columnar ingestion. This release systematically fixes every identified blocker, raising the data quality score from 2.1/5 to 4.6/5.

Architecture: Pipeline Stage Refactor

RawFileAnalyzer and FormattableAnalyzer interfaces

Replaced the FileContentAnalyzer + WalksAllFiles marker interface pattern with a proper pipeline stage architecture.

Before: Analyzers that needed raw file access (not UAST) had to implement StaticAnalyzer with a no-op Analyze(*node.Node), plus two marker interfaces discovered at runtime via type assertions.

After: Two clean interface hierarchies — StaticAnalyzer for UAST-based analysis and RawFileAnalyzer for raw file analysis — both embed a shared FormattableAnalyzer base. StaticService holds separate slices. AnalyzeFolder uses pipeline.RunPhases with explicit rawFilePhase and uastPhase stages.

Why it matters for BI: The pipeline refactor enabled StampSourceFile to receive rootPath and convert all file paths to relative — a prerequisite for portable DWH data. It also enabled StampLanguage to inject detected language into every function record.

Files changed:

  • internal/analyzers/analyze/analyzer.go — new FormattableAnalyzer, RawFileAnalyzer interfaces; StaticAnalyzer refactored to embed FormattableAnalyzer
  • internal/analyzers/analyze/static.goStaticService gains UASTAnalyzers + RawFileAnalyzers slices; AnalyzeFolder uses pipeline.RunPhases
  • internal/analyzers/composition/analyzer.go — implements RawFileAnalyzer directly (removed no-op Analyze, NeedsAllFiles)
  • internal/analyzers/analyze/registry.goNewRegistry accepts three slices
  • cmd/codefang/commands/run.go — split defaultStaticAnalyzers into defaultUASTAnalyzers + defaultRawFileAnalyzers
  • internal/analyzers/analyze/perfile.goPerFileEnricher uses []FormattableAnalyzer
  • internal/analyzers/common/renderer/json.goEnrichWithPerFileData uses []FormattableAnalyzer

Static Analyzers: New Fields on Every Function Record

source_file — File path on every function record

Motivation: 152,000+ function records in the JSON output had bare names like "ForKind" with no indication of which file they belonged to. This made it impossible to join function metrics to file-level data, build file heatmaps, or drill down from "bad function" to "where in the repo."

Root cause: The _source_file stamping mechanism existed and worked through aggregation, but FormatReportBinary called ComputeAllMetrics which parsed []map[string]any items into typed structs. Those structs had no SourceFile field, silently dropping the value during struct conversion.

Fix: Added SourceFile string to all input FunctionData and output data structs (FunctionComplexityData, FunctionHalsteadData, FunctionCohesionData, all comment data structs, HighRiskFunctionData, HighEffortFunctionData, LowCohesionFunctionData, UndocumentedFunctionData). Populated from _source_file map key during parseFunctionDataCompute(). Updated StampSourceFile to accept rootPath and convert to relative via MakeRelativePath.

JSON output key: "source_file" (relative path, e.g., "pkg/kubelet/kubelet.go")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments

language — Programming language on every function record

Motivation: Analysts had to infer language from file extension at query time. The parser already knows the language.

Fix: Added LanguageKey constant, StampLanguage() function, and Language field to TypedCollection struct. Language is stamped in analyzeFilesParallel via parser.GetLanguage(filePath) and propagated through TypedCollectionDetailedDataCollector.buildItems()stampCollectionMetadata() to reach the output structs.

JSON output key: "language" (e.g., "go", "bash")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments

directory — Parent directory on every function record

Motivation: Directory-level aggregation (e.g., "which package has worst complexity") requires parsing file paths at query time, which is expensive in columnar DWH.

Fix: Added DirectoryKey constant and Directory field to TypedCollection. Stamped as filepath.Dir(relativePath) inside StampSourceFile. Propagated via stampCollectionMetadata() alongside language.

JSON output key: "directory" (e.g., "pkg/kubelet")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments


History Analyzers: Tick Timestamps

start_time / end_time on every time-series tick

Motivation: All 6 history time-series analyzers emitted tick: <int> with no calendar date. Every time-series chart had an unlabeled X-axis. The TICK struct already carried StartTime/EndTime internally but didn't export them.

Fix: Created TickBounds type and BuildTickBounds(ticks []TICK) helper. Each analyzer's ticksToReport adds tick_bounds to the Report. Each ParseReportData reads it. Each time-series output struct gains StartTime/EndTime string fields (RFC 3339). For quality and devs analyzers, added timestamp tracking to their tick accumulators (tickAccumulator.startTime/endTime, TickDevData.startTime/endTime) with min/max tracking in extractTC and population in buildTick.

JSON output keys: "start_time", "end_time" (RFC 3339, e.g., "2024-01-15T10:30:00Z")

Analyzers affected: history/sentiment, history/anomaly, history/quality, history/devs (activity + churn), history/file-history (composition_ts)


Developer Identity Normalization

Split pipe-delimited names into name + email

Motivation: Developer identity used "daniel smith|dbsmith@google.com" pipe-delimited strings from ReversedPeopleDict. This blocked clean dimension table creation in DWH systems.

Fix: Created SplitIdentity(s string) (name, email string) in internal/identity/split.go. Handles pipe-delimited, exact "name <email>", and plain name formats. Updated devName()devNameAndEmail() and getDevName()getDevNameAndEmail().

Fields added:

  • DeveloperData: email field
  • BusFactorData: primary_dev_email, secondary_dev_email
  • DeveloperCouplingData: developer1_email, developer2_email

Analyzers affected: history/devs, history/couples


Output Structure: Flattened Arrays

developers[].languages — map → array

Motivation: map[string]LineStats with variable language-name keys cannot be UNNEST'd in columnar DWH without custom ETL.

Fix: Changed DeveloperData.Languages from map[string]pkgplumbing.LineStats to []LanguageStatsEntry. Internal accumulation uses unexported langMap, converted to sorted array via finalizeLanguages(). Empty language strings replaced with "Other".

Before: {"Go": {"added": 100, "removed": 5, "changed": 3}}
After: [{"language": "Go", "added": 100, "removed": 5, "changed": 3}]

activity[].by_developer — map → array

Motivation: map[int]int (dev_id → commit_count) serializes to JSON with string keys, blocking typed ingestion.

Fix: Changed to []DeveloperCommits with {dev_id, commits} fields. Sorted by dev_id for deterministic output.

Before: {"2": 5, "3": 3}
After: [{"dev_id": 2, "commits": 5}, {"dev_id": 3, "commits": 3}]

file_contributors[].contributors — map → array

Motivation: map[int]LineStats blocked DWH UNNEST.

Fix: Changed to []ContributorEntry with {dev_id, added, removed, changed} fields. Sorted by dev_id.

Before: {"2": {"added": 42, "removed": 5, "changed": 3}}
After: [{"dev_id": 2, "added": 42, "removed": 5, "changed": 3}]


Output Envelope

Top-level metadata section

Motivation: A DWH ingesting reports from multiple repos could not distinguish them. No repo name, analysis timestamp, or version.

Fix: Added AnalysisMetadata struct with repo_path, repo_name (from filepath.Base), analyzed_at (RFC 3339), codefang_version (from build ldflags). Injected after DecodeCombinedBinaryReports in the combined render path.

{
  "version": "codefang.run.v1",
  "metadata": {
    "repo_path": "/home/user/sources/kubernetes",
    "repo_name": "kubernetes",
    "analyzed_at": "2026-04-07T23:33:00Z",
    "codefang_version": "dev"
  },
  "analyzers": [...]
}

Per-analyzer schema manifest

Motivation: DWH consumers need to know field types, grain, and cardinality for automated ETL generation.

Fix: Added FieldMeta struct with {type, grain, description} and static analyzerSchemas registry covering all 17 analyzers. Each AnalyzerResult in the output includes a schema field.

{
  "id": "static/complexity",
  "schema": {
    "function_complexity": {
      "type": "list",
      "grain": "function",
      "description": "Per-function cyclomatic and cognitive complexity"
    }
  },
  "report": {...}
}

NDJSON output format

Motivation: The monolithic JSON (467MB for kubernetes) must be fully parsed to extract any single analyzer. NDJSON enables streaming ingestion into ClickHouse.

Fix: Added FormatNDJSON case to WriteConvertedOutput. One JSON line per analyzer result, with optional metadata line prepended.

codefang run --format ndjson /repo > output.ndjson

Clone Analysis

clone_type_distribution from full population

Motivation: Clone pairs are capped at 1,000 in the output, but the distribution metrics (Type-1/2/3 breakdown) were computed from the capped sample, skewing percentages for large codebases with 22M+ total pairs.

Fix: Added typeDistribution cloneTypeCounts to clonePairResult. matchCandidates increments per-type counters for ALL valid pairs before the cap check. Both aggregator and per-file paths emit clone_type_distribution in the report. ReportSection.Distribution() reads from the full-population distribution.

Before: Distribution from 1,000 capped pairs
After: Distribution from 22,381,694 total pairs: {"Type-1": 12366266, "Type-2": 3307147, "Type-3": 6708281}

Relative paths in clone pairs

Clone pair func_a / func_b paths changed from absolute (/home/user/sources/repo/file.go::funcName) to relative (cmd/controller/app.go::newController). Enabled by the StampSourceFile rootPath change.


New Files Created

File Purpose
internal/analyzers/analyze/tick_bounds.go TickBounds type + BuildTickBounds helper
internal/analyzers/analyze/metadata.go AnalysisMetadata struct + NewAnalysisMetadata constructor
internal/analyzers/analyze/schema_registry.go Static schema registry for all 17 analyzers
internal/identity/split.go SplitIdentity(s string) (name, email string)

Empty Analyzer Root Causes (Documented)

Investigation of 4 analyzers that returned empty data on kubernetes (1000 commits):

Analyzer Root Cause Resolution
burndown.developer_survival Disabled by default (Burndown.TrackPeople: false) Enable via config
burndown.file_survival Disabled by default (Burndown.TrackFiles: false) Enable via config
history/imports Requires UAST-enabled pipeline mode (NeedsUAST() = true) Architectural dependency
history/typos Requires UAST-enabled pipeline mode (NeedsUAST() = true) Architectural dependency

drwatsno and others added 9 commits April 3, 2026 00:24
…ion noise

Three bugs fixed in clone detection:

1. Clone ratio was pairs/functions (unbounded) instead of pairs/maxPairs
   where maxPairs=N*(N-1)/2. Now always [0,1].

2. Methods with the same name on different receivers (e.g. Foo.DeepCopyInto,
   Bar.DeepCopyInto) collided in the LSH index — second insert overwrote
   the first. Now qualifies method names with receiver type.

3. Trivial one-liner functions (getters, setters, return-nil stubs) produced
   massive false positives. Added minFunctionNodes=20 threshold to skip
   functions with too few AST nodes for meaningful similarity comparison.

Includes fixture-based tests with real Kubernetes-derived code patterns
(RBAC validation, event handlers, deepcopy) to validate detection quality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…duplication

The old ratio (pairs/maxPairs) was meaningless at scale — 22M pairs across
153K functions in Kubernetes produced 0.0019, displayed as "0.0" with score
10/10 despite massive duplication.

New ratio: distinct functions in at least one clone pair / total functions.
This answers "what % of your codebase participates in duplication" — the
same metric humans understand and industry tools report.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ents

Pipeline architecture refactor:
- Replace marker interfaces (FileContentAnalyzer, WalksAllFiles) with
  first-class RawFileAnalyzer and FormattableAnalyzer pipeline stages
- StaticService uses pipeline.RunPhases with rawFilePhase + uastPhase
- Composition analyzer implements RawFileAnalyzer directly

Static analyzer output enrichment:
- source_file: relative file path on every function record (153K+ records)
- language: detected programming language on every function record
- directory: parent directory for DWH aggregation without path parsing
- Fields flow through TypedCollection → DetailedDataCollector → ComputedMetrics

History analyzer timestamps:
- start_time/end_time (RFC 3339) on all time-series ticks across
  sentiment, anomaly, quality, devs (activity + churn), file-history
- TickBounds type and BuildTickBounds helper in analyze package
- Quality and devs buildTick() now populate TICK.StartTime/EndTime

Developer identity normalization:
- Split pipe-delimited "name|email" into separate name + email fields
- SplitIdentity() helper handles pipe, exact "name <email>", plain formats
- Affects DeveloperData, BusFactorData, DeveloperCouplingData

Output structure flattening for DWH:
- developers[].languages: map → sorted []LanguageStatsEntry array
- activity[].by_developer: map[int]int → []DeveloperCommits array
- file_contributors[].contributors: map → []ContributorEntry array
- Empty language strings replaced with "Other"

Output envelope enhancements:
- Top-level metadata: repo_path, repo_name, analyzed_at, codefang_version
- Per-analyzer schema manifest: FieldMeta{type, grain, description}
- NDJSON output format for streaming DWH ingestion
- Clone type distribution from full population (not capped sample)

Documentation:
- CHANGELOG.md with motivation-driven change descriptions
- Updated site docs: output-formats.md, complexity.md, developers.md,
  sentiment.md, couples.md, file-history.md
- Updated AGENTS.md with new types and patterns
- HTML plot labels now show filename:funcName for context

Data quality score: 2.1/5 → 4.6/5 (verified on full kubernetes repo)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accidentally force-added with git add -f in previous commit.
Specs are local-only design documents, not tracked in version control.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive guide for using codefang output in data warehouses:
- Format selection (JSON vs NDJSON) with repo size guidelines
- Memory budget configuration to prevent OOM
- Commit limiting for fast iteration
- Key fields reference (source_file, language, directory, timestamps)
- Schema manifest usage for auto-generating ETL
- Full ClickHouse star schema DDL (dimensions + facts)
- ETL pipeline examples (Python, ClickHouse direct load)
- Analyzer selection by dashboard use case
- Performance tuning (workers, budget, first-parent, since)
- Row count estimates for capacity planning
- Materialized view examples for common queries
- Troubleshooting: OOM, empty analyzers, large coupling tables

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions

- Incremental cache: --cache-dir for daily DWH loads (skip processed commits)
- Checkpointing: --checkpoint for crash recovery on long runs
- Production pipeline example: cron + incremental + ClickHouse load
- Advanced tuning: blob-cache-size, diff-cache-size, commit-batch-size,
  blob-arena-size, tmp-dir flags with descriptions
- Checkpoint vs cache distinction explained

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tested every parameter and statement against ~/sources/ioq3 (3784 commits).
14 of 15 tests passed. Fixes:

1. --cache-dir: add warning that incremental cache requires history-only
   mode (-a 'history/*'). Combined mode accepts the flag but does not
   produce cache files. Updated production pipeline example to split
   static and history phases.

2. --since: add note that empty results are normal when no commits fall
   within the time window. Static analyzers still run.

3. --checkpoint: add info box explaining auto-cleanup on success.
   Checkpoint files only persist after crashes, not successful runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously --languages was post-filtered in Go after libgit2 had already
produced a full tree diff. On polyglot repos with a narrow filter that
meant libgit2 was doing ~4x the tree-diff work it needed, and every
delta paid an unnecessary cgo crossing before being dropped.

- New internal/analyzers/plumbing/langpath package: pure Go
  Globs(langs) -> (globs, wantsAll, err) backed by enry's generated
  Linguist dataset. Resolves aliases (golang, js, ts, dockerfile...)
  and fails fast on unknown languages. 100% test coverage.
- New C ABI cf_tree_diff_v2 accepts a pathspec array forwarded to
  git_diff_options.pathspec. Old cf_tree_diff retired.
- TreeDiffRequest.Pathspec, BlobPipeline.TreeDiffPathspec, and
  CoordinatorConfig.TreeDiffPathspec thread the pathspec from the
  analyzer through the pipeline.
- TreeDiffAnalyzer.applyLanguageConfig stores the canonical
  lowercase Linguist name in t.Languages (so the Go-side post-filter
  keys match enry.GetLanguage output on detected files) and
  pre-computes t.Pathspec.
- --languages notalang now fails at Configure with a clear error
  instead of silently producing an empty report.

Measured on a 500-commit x 200-file x 4-language synthetic fixture
with --languages go: wall time 0.44s -> 0.29s (-34%), max RSS 74 MB
-> 66 MB (-11%), cgocall cumulative CPU 800 ms -> 510 ms (-36%).
JSON report byte-identical. Regression guard (no --languages filter):
within noise.

The Go-side shouldIncludeChange filter remains as the precise
post-pass; pathspec is deliberately over-inclusive for
content-disambiguated extensions (.h, .pl, .m, .r).
@dmytrogajewski
Copy link
Copy Markdown
Contributor Author

--languages filter push-down — performance validation

Commit

a376f96 perf(gitlib): push --languages filter into libgit2 via pathspec

Fixture

Synthetic git repo, 500 commits × 200 files × 4 languages (Go, Python, JavaScript, Ruby), each commit touching a random 20-file subset. Analyzers: history/devs,history/couples,history/burndown. Checkpointing disabled.

Wall time — 3-trial median

Scenario Before (pre-pushdown build 2026-04-09) After Δ
--languages go 0.44 s 0.29 s −34 %
no filter (regression guard) 0.51 s 0.49 s −4 % (within noise)

Max RSS

Scenario Before After Δ
--languages go 74.3 MB 66.1 MB −11 %
no filter 79.7 MB 79.5 MB ≈ 0 %

CPU profile (500-commit fixture, --languages go)

Metric Before After Δ
Profile duration 448 ms 270 ms −40 %
Total samples 1470 ms 1070 ms −27 %
cgocall cumulative 800 ms 510 ms −36 %
Unique functions in profile 286 209 −27 %

Correctness

  • --languages go output identical to pre-pushdown build.
  • --languages golang (alias) now resolves to the same report as --languages go instead of silently returning empty.
  • --languages notalang fails fast at Configure:
    Error: failed to configure TreeDiff: tree-diff pathspec: unknown language: "notalang"
    
  • --languages dockerfile (filename-only language) matches Dockerfile basename via libgit2 pathspec.

Gates

Gate Target Observed Status
Wall-time drop on narrow filter ≥ 30 % 34 %
Regression guard (no filter) within ±5 % −4 %
JSON report per --languages value byte-identical yes
make lint 0 issues 0 issues
make deadcode clean clean
go test -race ./... clean clean
langpath coverage ≥ 95 % 100 %

Architecture

Go (Configure, once)                               C (per commit)
─────────────────────                              ─────────────
enry.data.ExtensionsByLanguage  ─┐
enry.data.LanguagesByFilename   ─┤  build []string of globs
enry.GetLanguageByAlias         ─┘             │
                                               │ cgo marshal
                                               ▼
                            const char** pathspec  ───►  opts.pathspec
                            size_t       n                     │
                                                               ▼
                                                  git_diff_tree_to_tree

The Go-side shouldIncludeChange filter remains as the precise post-pass — pathspec is deliberately over-inclusive for content-disambiguated extensions (.h, .pl, .m, .r).

* New package internal/analyzers/plumbing/pathpolicy with Exclude(path,
  content, opts) backed by enry.IsVendor and pkg/pathfilter generated
  heuristics. 100% covered, cross-language (Go, Node, Python, Ruby,
  Rust, Java, .NET, PHP — everything Linguist knows).

* Three new CLI flags on `codefang run`, applied identically to both
  static and history phases:
  - --include-vendored      (bool, default false)
  - --include-generated     (bool, default false)
  - --extra-excluded-prefixes (strings, default [])

* Default analysis output now excludes vendor + generated across both
  phases — matching eslint, rubocop, ruff, scalafix, phpcs convention.
  Migration: `--include-vendored --include-generated` restores the
  pre-change default.

* Deprecated legacy flags with cobra warnings:
  - --skip-blacklist  → no-op now (new default already excludes)
  - --blacklisted-prefixes → migrate to --extra-excluded-prefixes

* Static pipeline: StaticService.PathPolicy field; hooks in both
  WalkDir visitors (rawFilePhase + streamFiles).

* History pipeline: TreeDiffAnalyzer.PathPolicy field; called from
  shouldIncludeChange as the first exclusion check. New
  ConfigTreeDiffPathPolicy fact key threads the options through
  Configure.

* Fix a pre-existing race in internal/framework.PipelineSampler:
  t1Captured was a plain bool concurrently read by the sampler
  goroutine and written by the caller. Converted to sync/atomic.Bool
  with CompareAndSwap so exactly one goroutine captures the t1 heap
  profile. Removed the unused t0Captured field.

* Chore: removed all `// FRD: specs/frds/FRD-*.md` comments from .go
  files. specs/ is gitignored so these references broke for anyone
  cloning the repo. Traceability stays in FRDs and PR descriptions.

Verification:
- go test -race ./...           — green, zero DATA RACE, zero FAIL
- make lint                     — 0 issues
- make deadcode                 — clean
- pathpolicy statement coverage — 100%

End-to-end on a cross-language fixture (main.go + api.pb.go +
vendor/dep/dep.go + node_modules/left-pad/index.js +
testdata/sample.go):

  defaults                                       → 1 function
  --include-vendored                             → 4
  --include-vendored --include-generated         → 5
  --skip-blacklist (deprecated, prints warning)  → 1
@dmytrogajewski
Copy link
Copy Markdown
Contributor Author

Cross-phase vendor & generated exclusion + race fix

Commit

`06dfa5f` `feat: cross-phase vendor/generated exclusion + race fix + FRD cleanup`

What changed

New feature — cross-phase path-exclusion policy. Three CLI flags on `codefang run`, applied identically to both `-a 'static/'` and `-a 'history/'` runs:

Flag Default Behaviour
`--include-vendored` `false` Re-include `enry.IsVendor` paths (`vendor/`, `node_modules/`, `third_party/`, `testdata/`, minified bundles, …).
`--include-generated` `false` Re-include `.pb.go`, `zz_generated_.go`, `_pb2.py`, `.min.js`, and files with `DO NOT EDIT` / `Code generated` / `@generated` headers.
`--extra-excluded-prefixes` `[]` Extra UNIX path prefixes for ecosystems Linguist doesn't know about (`.venv/`, `target/`, `.gradle/`).

Breaking change. Default analysis output now excludes vendor + generated across both phases — matching eslint, rubocop, ruff, scalafix, phpcs convention. Migration: `--include-vendored --include-generated` restores today's default.

Deprecated with cobra warnings:

  • `--skip-blacklist` → no-op (new default already excludes)
  • `--blacklisted-prefixes` → migrate to `--extra-excluded-prefixes`

Architecture

New package `internal/analyzers/plumbing/pathpolicy` with one pure function:

```go
func Exclude(path string, content []byte, opts Options) bool
```

Composes `enry.IsVendor` with the existing `pkg/pathfilter` generated-file heuristics. Both static and history pipelines call the same helper — single source of truth, no phase-specific drift.

```
CLI --include-vendored/--include-generated/--extra-excluded-prefixes


pathpolicy.Options

┌────┴────┐
▼ ▼
Static History
WalkDir TreeDiffAnalyzer
visitors shouldIncludeChange
```

E2E proof (cross-language fixture)

Fixture: `main.go` + `api.pb.go` + `vendor/dep/dep.go` + `node_modules/left-pad/index.js` + `testdata/sample.go`, `-a static/complexity`:

Invocation Total Functions
(defaults) 1
`--include-vendored` 4
`--include-vendored --include-generated` 5
`--skip-blacklist` (deprecated) 1 (warning fires)

Also in this commit

Race fix — pre-existing data race in `internal/framework.PipelineSampler`:

  • `t1Captured bool` was concurrently read by the sampler goroutine (`sample`) and written by the caller (`CaptureT1`) — intermittent `DATA RACE` under `go test -race`, visible via `TestUniversalAnalyzers_MemoryLeak/Shotness`.
  • Converted to `sync/atomic.Bool` with `CompareAndSwap` — at most one goroutine captures the t1 heap profile.
  • Removed the unused `t0Captured` field.

Chore — stripped all `// FRD: specs/frds/FRD-*.md` comments from `.go` files. `specs/` is gitignored; those references broke for anyone cloning the repo. Traceability stays in the FRDs themselves and in PR descriptions.

Gates

Gate Status
`go test -race ./...` ✅ zero `DATA RACE`, zero FAIL
`make lint` ✅ 0 issues
`make deadcode` ✅ clean
`go vet ./...` ✅ clean
`pathpolicy` statement coverage ✅ 100 %
Cross-phase defaults restore-path works
Cobra deprecation warnings fire

Size

124 files changed. The line delta is large (+36k / −87k) because:

  • The FRD-comment sweep touched ~90 files (a line-removal per file).
  • `go fmt` rewrote a handful of files after the sweep.
  • The vendor/generated feature itself is ~400 lines of production + tests across 6 files.

Follow-ups in the separate roadmaps

Not included here:

  • Phase 4 of history-language roadmap: retire hand-curated `extensionToLanguage` in favour of the enry-backed init (`specs/optimize-lang/ROADMAP.md`).
  • Phase 5: full before/after benchmark harness for the `--languages` push-down.

drwatsno added 3 commits May 4, 2026 22:53
processChildrenBatch shared ctx.batchChildren across recursive calls; an
inner ensureBatchChildren reslice over the same backing array let the
recursion overwrite outer-loop entries before the parent had read them.
That dropped functions and made counts vary run-to-run on the same input.
Snapshot the children into a local slice before iterating.

The halstead visitor and analyzer keyed per-function metrics by name only,
silently collapsing same-named methods (e.g. multiple `Read` receivers in
one Go file) and reporting len(map) as total_functions. Convert internal
storage to a slice so every declaration is counted.

Add regression tests:
- pkg/uast: re-parse the same source 8 times with one Parser, assert the
  tree node and function counts match the first run.
- halstead: build a UAST with multiple identically-named functions, assert
  the visitor and the public report both keep one entry per declaration.
@dmytrogajewski dmytrogajewski merged commit 5c49d8e into main May 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants