Feature/filestats by dmytrogajewski · Pull Request #18 · Sumatoshi-tech/codefang

dmytrogajewski · 2026-04-03T18:32:30Z

Analytics Readiness & DWH Suitability

Motivation: A comprehensive data analyst review of Codefang's JSON output revealed that while the data was analytically rich (17 analyzers, 1M+ function-level rows, time-series, coupling data), it was structurally hostile to analytics tooling and DWH loading. Function records had bare names with no file paths, time-series ticks had no calendar dates, developer identities used pipe-delimited strings, and nested maps blocked efficient columnar ingestion. This release systematically fixes every identified blocker, raising the data quality score from 2.1/5 to 4.6/5.

Architecture: Pipeline Stage Refactor

`RawFileAnalyzer` and `FormattableAnalyzer` interfaces

Replaced the FileContentAnalyzer + WalksAllFiles marker interface pattern with a proper pipeline stage architecture.

Before: Analyzers that needed raw file access (not UAST) had to implement StaticAnalyzer with a no-op Analyze(*node.Node), plus two marker interfaces discovered at runtime via type assertions.

After: Two clean interface hierarchies — StaticAnalyzer for UAST-based analysis and RawFileAnalyzer for raw file analysis — both embed a shared FormattableAnalyzer base. StaticService holds separate slices. AnalyzeFolder uses pipeline.RunPhases with explicit rawFilePhase and uastPhase stages.

Why it matters for BI: The pipeline refactor enabled StampSourceFile to receive rootPath and convert all file paths to relative — a prerequisite for portable DWH data. It also enabled StampLanguage to inject detected language into every function record.

Files changed:

internal/analyzers/analyze/analyzer.go — new FormattableAnalyzer, RawFileAnalyzer interfaces; StaticAnalyzer refactored to embed FormattableAnalyzer
internal/analyzers/analyze/static.go — StaticService gains UASTAnalyzers + RawFileAnalyzers slices; AnalyzeFolder uses pipeline.RunPhases
internal/analyzers/composition/analyzer.go — implements RawFileAnalyzer directly (removed no-op Analyze, NeedsAllFiles)
internal/analyzers/analyze/registry.go — NewRegistry accepts three slices
cmd/codefang/commands/run.go — split defaultStaticAnalyzers into defaultUASTAnalyzers + defaultRawFileAnalyzers
internal/analyzers/analyze/perfile.go — PerFileEnricher uses []FormattableAnalyzer
internal/analyzers/common/renderer/json.go — EnrichWithPerFileData uses []FormattableAnalyzer

Static Analyzers: New Fields on Every Function Record

`source_file` — File path on every function record

Motivation: 152,000+ function records in the JSON output had bare names like "ForKind" with no indication of which file they belonged to. This made it impossible to join function metrics to file-level data, build file heatmaps, or drill down from "bad function" to "where in the repo."

Root cause: The _source_file stamping mechanism existed and worked through aggregation, but FormatReportBinary called ComputeAllMetrics which parsed []map[string]any items into typed structs. Those structs had no SourceFile field, silently dropping the value during struct conversion.

Fix: Added SourceFile string to all input FunctionData and output data structs (FunctionComplexityData, FunctionHalsteadData, FunctionCohesionData, all comment data structs, HighRiskFunctionData, HighEffortFunctionData, LowCohesionFunctionData, UndocumentedFunctionData). Populated from _source_file map key during parseFunctionData → Compute(). Updated StampSourceFile to accept rootPath and convert to relative via MakeRelativePath.

JSON output key: "source_file" (relative path, e.g., "pkg/kubelet/kubelet.go")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments

`language` — Programming language on every function record

Motivation: Analysts had to infer language from file extension at query time. The parser already knows the language.

Fix: Added LanguageKey constant, StampLanguage() function, and Language field to TypedCollection struct. Language is stamped in analyzeFilesParallel via parser.GetLanguage(filePath) and propagated through TypedCollection → DetailedDataCollector.buildItems() → stampCollectionMetadata() to reach the output structs.

JSON output key: "language" (e.g., "go", "bash")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments

`directory` — Parent directory on every function record

Motivation: Directory-level aggregation (e.g., "which package has worst complexity") requires parsing file paths at query time, which is expensive in columnar DWH.

Fix: Added DirectoryKey constant and Directory field to TypedCollection. Stamped as filepath.Dir(relativePath) inside StampSourceFile. Propagated via stampCollectionMetadata() alongside language.

JSON output key: "directory" (e.g., "pkg/kubelet")

Analyzers affected: static/complexity, static/halstead, static/cohesion, static/comments

History Analyzers: Tick Timestamps

`start_time` / `end_time` on every time-series tick

Motivation: All 6 history time-series analyzers emitted tick: <int> with no calendar date. Every time-series chart had an unlabeled X-axis. The TICK struct already carried StartTime/EndTime internally but didn't export them.

Fix: Created TickBounds type and BuildTickBounds(ticks []TICK) helper. Each analyzer's ticksToReport adds tick_bounds to the Report. Each ParseReportData reads it. Each time-series output struct gains StartTime/EndTime string fields (RFC 3339). For quality and devs analyzers, added timestamp tracking to their tick accumulators (tickAccumulator.startTime/endTime, TickDevData.startTime/endTime) with min/max tracking in extractTC and population in buildTick.

JSON output keys: "start_time", "end_time" (RFC 3339, e.g., "2024-01-15T10:30:00Z")

Analyzers affected: history/sentiment, history/anomaly, history/quality, history/devs (activity + churn), history/file-history (composition_ts)

Developer Identity Normalization

Split pipe-delimited names into `name` + `email`

Motivation: Developer identity used "daniel smith|dbsmith@google.com" pipe-delimited strings from ReversedPeopleDict. This blocked clean dimension table creation in DWH systems.

Fix: Created SplitIdentity(s string) (name, email string) in internal/identity/split.go. Handles pipe-delimited, exact "name <email>", and plain name formats. Updated devName() → devNameAndEmail() and getDevName() → getDevNameAndEmail().

Fields added:

DeveloperData: email field
BusFactorData: primary_dev_email, secondary_dev_email
DeveloperCouplingData: developer1_email, developer2_email

Analyzers affected: history/devs, history/couples

Output Structure: Flattened Arrays

`developers[].languages` — map → array

Motivation: map[string]LineStats with variable language-name keys cannot be UNNEST'd in columnar DWH without custom ETL.

Fix: Changed DeveloperData.Languages from map[string]pkgplumbing.LineStats to []LanguageStatsEntry. Internal accumulation uses unexported langMap, converted to sorted array via finalizeLanguages(). Empty language strings replaced with "Other".

Before: {"Go": {"added": 100, "removed": 5, "changed": 3}}
After: [{"language": "Go", "added": 100, "removed": 5, "changed": 3}]

`activity[].by_developer` — map → array

Motivation: map[int]int (dev_id → commit_count) serializes to JSON with string keys, blocking typed ingestion.

Fix: Changed to []DeveloperCommits with {dev_id, commits} fields. Sorted by dev_id for deterministic output.

Before: {"2": 5, "3": 3}
After: [{"dev_id": 2, "commits": 5}, {"dev_id": 3, "commits": 3}]

`file_contributors[].contributors` — map → array

Motivation: map[int]LineStats blocked DWH UNNEST.

Fix: Changed to []ContributorEntry with {dev_id, added, removed, changed} fields. Sorted by dev_id.

Before: {"2": {"added": 42, "removed": 5, "changed": 3}}
After: [{"dev_id": 2, "added": 42, "removed": 5, "changed": 3}]

Output Envelope

Top-level `metadata` section

Motivation: A DWH ingesting reports from multiple repos could not distinguish them. No repo name, analysis timestamp, or version.

Fix: Added AnalysisMetadata struct with repo_path, repo_name (from filepath.Base), analyzed_at (RFC 3339), codefang_version (from build ldflags). Injected after DecodeCombinedBinaryReports in the combined render path.

{
  "version": "codefang.run.v1",
  "metadata": {
    "repo_path": "/home/user/sources/kubernetes",
    "repo_name": "kubernetes",
    "analyzed_at": "2026-04-07T23:33:00Z",
    "codefang_version": "dev"
  },
  "analyzers": [...]
}

Per-analyzer `schema` manifest

Motivation: DWH consumers need to know field types, grain, and cardinality for automated ETL generation.

Fix: Added FieldMeta struct with {type, grain, description} and static analyzerSchemas registry covering all 17 analyzers. Each AnalyzerResult in the output includes a schema field.

{
  "id": "static/complexity",
  "schema": {
    "function_complexity": {
      "type": "list",
      "grain": "function",
      "description": "Per-function cyclomatic and cognitive complexity"
    }
  },
  "report": {...}
}

NDJSON output format

Motivation: The monolithic JSON (467MB for kubernetes) must be fully parsed to extract any single analyzer. NDJSON enables streaming ingestion into ClickHouse.

Fix: Added FormatNDJSON case to WriteConvertedOutput. One JSON line per analyzer result, with optional metadata line prepended.

codefang run --format ndjson /repo > output.ndjson

Clone Analysis

`clone_type_distribution` from full population

Motivation: Clone pairs are capped at 1,000 in the output, but the distribution metrics (Type-1/2/3 breakdown) were computed from the capped sample, skewing percentages for large codebases with 22M+ total pairs.

Fix: Added typeDistribution cloneTypeCounts to clonePairResult. matchCandidates increments per-type counters for ALL valid pairs before the cap check. Both aggregator and per-file paths emit clone_type_distribution in the report. ReportSection.Distribution() reads from the full-population distribution.

Before: Distribution from 1,000 capped pairs
After: Distribution from 22,381,694 total pairs: {"Type-1": 12366266, "Type-2": 3307147, "Type-3": 6708281}

Relative paths in clone pairs

Clone pair func_a / func_b paths changed from absolute (/home/user/sources/repo/file.go::funcName) to relative (cmd/controller/app.go::newController). Enabled by the StampSourceFile rootPath change.

New Files Created

File	Purpose
`internal/analyzers/analyze/tick_bounds.go`	`TickBounds` type + `BuildTickBounds` helper
`internal/analyzers/analyze/metadata.go`	`AnalysisMetadata` struct + `NewAnalysisMetadata` constructor
`internal/analyzers/analyze/schema_registry.go`	Static schema registry for all 17 analyzers
`internal/identity/split.go`	`SplitIdentity(s string) (name, email string)`

Empty Analyzer Root Causes (Documented)

Investigation of 4 analyzers that returned empty data on kubernetes (1000 commits):

Analyzer	Root Cause	Resolution
`burndown.developer_survival`	Disabled by default (`Burndown.TrackPeople: false`)	Enable via config
`burndown.file_survival`	Disabled by default (`Burndown.TrackFiles: false`)	Enable via config
`history/imports`	Requires UAST-enabled pipeline mode (`NeedsUAST() = true`)	Architectural dependency
`history/typos`	Requires UAST-enabled pipeline mode (`NeedsUAST() = true`)	Architectural dependency

…ion noise Three bugs fixed in clone detection: 1. Clone ratio was pairs/functions (unbounded) instead of pairs/maxPairs where maxPairs=N*(N-1)/2. Now always [0,1]. 2. Methods with the same name on different receivers (e.g. Foo.DeepCopyInto, Bar.DeepCopyInto) collided in the LSH index — second insert overwrote the first. Now qualifies method names with receiver type. 3. Trivial one-liner functions (getters, setters, return-nil stubs) produced massive false positives. Added minFunctionNodes=20 threshold to skip functions with too few AST nodes for meaningful similarity comparison. Includes fixture-based tests with real Kubernetes-derived code patterns (RBAC validation, event handlers, deepcopy) to validate detection quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…duplication The old ratio (pairs/maxPairs) was meaningless at scale — 22M pairs across 153K functions in Kubernetes produced 0.0019, displayed as "0.0" with score 10/10 despite massive duplication. New ratio: distinct functions in at least one clone pair / total functions. This answers "what % of your codebase participates in duplication" — the same metric humans understand and industry tools report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ents Pipeline architecture refactor: - Replace marker interfaces (FileContentAnalyzer, WalksAllFiles) with first-class RawFileAnalyzer and FormattableAnalyzer pipeline stages - StaticService uses pipeline.RunPhases with rawFilePhase + uastPhase - Composition analyzer implements RawFileAnalyzer directly Static analyzer output enrichment: - source_file: relative file path on every function record (153K+ records) - language: detected programming language on every function record - directory: parent directory for DWH aggregation without path parsing - Fields flow through TypedCollection → DetailedDataCollector → ComputedMetrics History analyzer timestamps: - start_time/end_time (RFC 3339) on all time-series ticks across sentiment, anomaly, quality, devs (activity + churn), file-history - TickBounds type and BuildTickBounds helper in analyze package - Quality and devs buildTick() now populate TICK.StartTime/EndTime Developer identity normalization: - Split pipe-delimited "name|email" into separate name + email fields - SplitIdentity() helper handles pipe, exact "name <email>", plain formats - Affects DeveloperData, BusFactorData, DeveloperCouplingData Output structure flattening for DWH: - developers[].languages: map → sorted []LanguageStatsEntry array - activity[].by_developer: map[int]int → []DeveloperCommits array - file_contributors[].contributors: map → []ContributorEntry array - Empty language strings replaced with "Other" Output envelope enhancements: - Top-level metadata: repo_path, repo_name, analyzed_at, codefang_version - Per-analyzer schema manifest: FieldMeta{type, grain, description} - NDJSON output format for streaming DWH ingestion - Clone type distribution from full population (not capped sample) Documentation: - CHANGELOG.md with motivation-driven change descriptions - Updated site docs: output-formats.md, complexity.md, developers.md, sentiment.md, couples.md, file-history.md - Updated AGENTS.md with new types and patterns - HTML plot labels now show filename:funcName for context Data quality score: 2.1/5 → 4.6/5 (verified on full kubernetes repo) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Accidentally force-added with git add -f in previous commit. Specs are local-only design documents, not tracked in version control. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Comprehensive guide for using codefang output in data warehouses: - Format selection (JSON vs NDJSON) with repo size guidelines - Memory budget configuration to prevent OOM - Commit limiting for fast iteration - Key fields reference (source_file, language, directory, timestamps) - Schema manifest usage for auto-generating ETL - Full ClickHouse star schema DDL (dimensions + facts) - ETL pipeline examples (Python, ClickHouse direct load) - Analyzer selection by dashboard use case - Performance tuning (workers, budget, first-parent, since) - Row count estimates for capacity planning - Materialized view examples for common queries - Troubleshooting: OOM, empty analyzers, large coupling tables Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ions - Incremental cache: --cache-dir for daily DWH loads (skip processed commits) - Checkpointing: --checkpoint for crash recovery on long runs - Production pipeline example: cron + incremental + ClickHouse load - Advanced tuning: blob-cache-size, diff-cache-size, commit-batch-size, blob-arena-size, tmp-dir flags with descriptions - Checkpoint vs cache distinction explained Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tested every parameter and statement against ~/sources/ioq3 (3784 commits). 14 of 15 tests passed. Fixes: 1. --cache-dir: add warning that incremental cache requires history-only mode (-a 'history/*'). Combined mode accepts the flag but does not produce cache files. Updated production pipeline example to split static and history phases. 2. --since: add note that empty results are normal when no commits fall within the time window. Static analyzers still run. 3. --checkpoint: add info box explaining auto-cleanup on success. Checkpoint files only persist after crashes, not successful runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously --languages was post-filtered in Go after libgit2 had already produced a full tree diff. On polyglot repos with a narrow filter that meant libgit2 was doing ~4x the tree-diff work it needed, and every delta paid an unnecessary cgo crossing before being dropped. - New internal/analyzers/plumbing/langpath package: pure Go Globs(langs) -> (globs, wantsAll, err) backed by enry's generated Linguist dataset. Resolves aliases (golang, js, ts, dockerfile...) and fails fast on unknown languages. 100% test coverage. - New C ABI cf_tree_diff_v2 accepts a pathspec array forwarded to git_diff_options.pathspec. Old cf_tree_diff retired. - TreeDiffRequest.Pathspec, BlobPipeline.TreeDiffPathspec, and CoordinatorConfig.TreeDiffPathspec thread the pathspec from the analyzer through the pipeline. - TreeDiffAnalyzer.applyLanguageConfig stores the canonical lowercase Linguist name in t.Languages (so the Go-side post-filter keys match enry.GetLanguage output on detected files) and pre-computes t.Pathspec. - --languages notalang now fails at Configure with a clear error instead of silently producing an empty report. Measured on a 500-commit x 200-file x 4-language synthetic fixture with --languages go: wall time 0.44s -> 0.29s (-34%), max RSS 74 MB -> 66 MB (-11%), cgocall cumulative CPU 800 ms -> 510 ms (-36%). JSON report byte-identical. Regression guard (no --languages filter): within noise. The Go-side shouldIncludeChange filter remains as the precise post-pass; pathspec is deliberately over-inclusive for content-disambiguated extensions (.h, .pl, .m, .r).

dmytrogajewski · 2026-04-19T14:03:19Z

`--languages` filter push-down — performance validation

Commit

a376f96 perf(gitlib): push --languages filter into libgit2 via pathspec

Fixture

Synthetic git repo, 500 commits × 200 files × 4 languages (Go, Python, JavaScript, Ruby), each commit touching a random 20-file subset. Analyzers: history/devs,history/couples,history/burndown. Checkpointing disabled.

Wall time — 3-trial median

Scenario	Before (pre-pushdown build 2026-04-09)	After	Δ
`--languages go`	0.44 s	0.29 s	−34 %
no filter (regression guard)	0.51 s	0.49 s	−4 % (within noise)

Max RSS

Scenario	Before	After	Δ
`--languages go`	74.3 MB	66.1 MB	−11 %
no filter	79.7 MB	79.5 MB	≈ 0 %

CPU profile (500-commit fixture, `--languages go`)

Metric	Before	After	Δ
Profile duration	448 ms	270 ms	−40 %
Total samples	1470 ms	1070 ms	−27 %
`cgocall` cumulative	800 ms	510 ms	−36 %
Unique functions in profile	286	209	−27 %

Correctness

--languages go output identical to pre-pushdown build.
--languages golang (alias) now resolves to the same report as --languages go instead of silently returning empty.

--languages notalang fails fast at Configure:

Error: failed to configure TreeDiff: tree-diff pathspec: unknown language: "notalang"

--languages dockerfile (filename-only language) matches Dockerfile basename via libgit2 pathspec.

Gates

Gate	Target	Observed	Status
Wall-time drop on narrow filter	≥ 30 %	34 %	✅
Regression guard (no filter)	within ±5 %	−4 %	✅
JSON report per `--languages` value	byte-identical	yes	✅
`make lint`	0 issues	0 issues	✅
`make deadcode`	clean	clean	✅
`go test -race ./...`	clean	clean	✅
`langpath` coverage	≥ 95 %	100 %	✅

Architecture

Go (Configure, once)                               C (per commit)
─────────────────────                              ─────────────
enry.data.ExtensionsByLanguage  ─┐
enry.data.LanguagesByFilename   ─┤  build []string of globs
enry.GetLanguageByAlias         ─┘             │
                                               │ cgo marshal
                                               ▼
                            const char** pathspec  ───►  opts.pathspec
                            size_t       n                     │
                                                               ▼
                                                  git_diff_tree_to_tree

The Go-side shouldIncludeChange filter remains as the precise post-pass — pathspec is deliberately over-inclusive for content-disambiguated extensions (.h, .pl, .m, .r).

* New package internal/analyzers/plumbing/pathpolicy with Exclude(path, content, opts) backed by enry.IsVendor and pkg/pathfilter generated heuristics. 100% covered, cross-language (Go, Node, Python, Ruby, Rust, Java, .NET, PHP — everything Linguist knows). * Three new CLI flags on `codefang run`, applied identically to both static and history phases: - --include-vendored (bool, default false) - --include-generated (bool, default false) - --extra-excluded-prefixes (strings, default []) * Default analysis output now excludes vendor + generated across both phases — matching eslint, rubocop, ruff, scalafix, phpcs convention. Migration: `--include-vendored --include-generated` restores the pre-change default. * Deprecated legacy flags with cobra warnings: - --skip-blacklist → no-op now (new default already excludes) - --blacklisted-prefixes → migrate to --extra-excluded-prefixes * Static pipeline: StaticService.PathPolicy field; hooks in both WalkDir visitors (rawFilePhase + streamFiles). * History pipeline: TreeDiffAnalyzer.PathPolicy field; called from shouldIncludeChange as the first exclusion check. New ConfigTreeDiffPathPolicy fact key threads the options through Configure. * Fix a pre-existing race in internal/framework.PipelineSampler: t1Captured was a plain bool concurrently read by the sampler goroutine and written by the caller. Converted to sync/atomic.Bool with CompareAndSwap so exactly one goroutine captures the t1 heap profile. Removed the unused t0Captured field. * Chore: removed all `// FRD: specs/frds/FRD-*.md` comments from .go files. specs/ is gitignored so these references broke for anyone cloning the repo. Traceability stays in FRDs and PR descriptions. Verification: - go test -race ./... — green, zero DATA RACE, zero FAIL - make lint — 0 issues - make deadcode — clean - pathpolicy statement coverage — 100% End-to-end on a cross-language fixture (main.go + api.pb.go + vendor/dep/dep.go + node_modules/left-pad/index.js + testdata/sample.go): defaults → 1 function --include-vendored → 4 --include-vendored --include-generated → 5 --skip-blacklist (deprecated, prints warning) → 1

dmytrogajewski · 2026-04-19T17:19:42Z

Cross-phase vendor & generated exclusion + race fix

Commit

`06dfa5f` `feat: cross-phase vendor/generated exclusion + race fix + FRD cleanup`

What changed

New feature — cross-phase path-exclusion policy. Three CLI flags on `codefang run`, applied identically to both `-a 'static/'` and `-a 'history/'` runs:

Flag	Default	Behaviour
`--include-vendored`	`false`	Re-include `enry.IsVendor` paths (`vendor/`, `node_modules/`, `third_party/`, `testdata/`, minified bundles, …).
`--include-generated`	`false`	Re-include `.pb.go`, `zz_generated_.go`, `_pb2.py`, `.min.js`, and files with `DO NOT EDIT` / `Code generated` / `@generated` headers.
`--extra-excluded-prefixes`	`[]`	Extra UNIX path prefixes for ecosystems Linguist doesn't know about (`.venv/`, `target/`, `.gradle/`).

Breaking change. Default analysis output now excludes vendor + generated across both phases — matching eslint, rubocop, ruff, scalafix, phpcs convention. Migration: `--include-vendored --include-generated` restores today's default.

Deprecated with cobra warnings:

`--skip-blacklist` → no-op (new default already excludes)
`--blacklisted-prefixes` → migrate to `--extra-excluded-prefixes`

Architecture

New package `internal/analyzers/plumbing/pathpolicy` with one pure function:

```go
func Exclude(path string, content []byte, opts Options) bool
```

Composes `enry.IsVendor` with the existing `pkg/pathfilter` generated-file heuristics. Both static and history pipelines call the same helper — single source of truth, no phase-specific drift.

```
CLI --include-vendored/--include-generated/--extra-excluded-prefixes
│
▼
pathpolicy.Options
│
┌────┴────┐
▼ ▼
Static History
WalkDir TreeDiffAnalyzer
visitors shouldIncludeChange
```

E2E proof (cross-language fixture)

Fixture: `main.go` + `api.pb.go` + `vendor/dep/dep.go` + `node_modules/left-pad/index.js` + `testdata/sample.go`, `-a static/complexity`:

Invocation	Total Functions
(defaults)	1
`--include-vendored`	4
`--include-vendored --include-generated`	5
`--skip-blacklist` (deprecated)	1 (warning fires)

Also in this commit

Race fix — pre-existing data race in `internal/framework.PipelineSampler`:

`t1Captured bool` was concurrently read by the sampler goroutine (`sample`) and written by the caller (`CaptureT1`) — intermittent `DATA RACE` under `go test -race`, visible via `TestUniversalAnalyzers_MemoryLeak/Shotness`.
Converted to `sync/atomic.Bool` with `CompareAndSwap` — at most one goroutine captures the t1 heap profile.
Removed the unused `t0Captured` field.

Chore — stripped all `// FRD: specs/frds/FRD-*.md` comments from `.go` files. `specs/` is gitignored; those references broke for anyone cloning the repo. Traceability stays in the FRDs themselves and in PR descriptions.

Gates

Gate	Status
`go test -race ./...`	✅ zero `DATA RACE`, zero FAIL
`make lint`	✅ 0 issues
`make deadcode`	✅ clean
`go vet ./...`	✅ clean
`pathpolicy` statement coverage	✅ 100 %
Cross-phase defaults restore-path works	✅
Cobra deprecation warnings fire	✅

Size

124 files changed. The line delta is large (+36k / −87k) because:

The FRD-comment sweep touched ~90 files (a line-removal per file).
`go fmt` rewrote a handful of files after the sweep.
The vendor/generated feature itself is ~400 lines of production + tests across 6 files.

Follow-ups in the separate roadmaps

Not included here:

Phase 4 of history-language roadmap: retire hand-curated `extensionToLanguage` in favour of the enry-backed init (`specs/optimize-lang/ROADMAP.md`).
Phase 5: full before/after benchmark harness for the `--languages` push-down.

processChildrenBatch shared ctx.batchChildren across recursive calls; an inner ensureBatchChildren reslice over the same backing array let the recursion overwrite outer-loop entries before the parent had read them. That dropped functions and made counts vary run-to-run on the same input. Snapshot the children into a local slice before iterating. The halstead visitor and analyzer keyed per-function metrics by name only, silently collapsing same-named methods (e.g. multiple `Read` receivers in one Go file) and reporting len(map) as total_functions. Convert internal storage to a slice so every declaration is counted. Add regression tests: - pkg/uast: re-parse the same source 8 times with one Parser, assert the tree node and function counts match the first run. - halstead: build a UAST with multiple identically-named functions, assert the visitor and the public report both keep one entry per declaration.

drwatsno and others added 9 commits April 3, 2026 00:24

feat(stats): stats per file

d534020

fix: remove specs/ from tracking — directory is in .gitignore

5b030f4

Accidentally force-added with git add -f in previous commit. Specs are local-only design documents, not tracked in version control. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

drwatsno added 3 commits May 4, 2026 22:53

feat(analyzer): add max-changes-per-commit flag to change this default

676bbcd

fix(tick): guarding against anomalous time in commits

f2641ca

dmytrogajewski merged commit 5c49d8e into main May 13, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/filestats#18

Feature/filestats#18
dmytrogajewski merged 13 commits into
mainfrom
feature/filestats

dmytrogajewski commented Apr 3, 2026 •

edited

Loading

Uh oh!

dmytrogajewski commented Apr 19, 2026

Uh oh!

dmytrogajewski commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmytrogajewski commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analytics Readiness & DWH Suitability

Architecture: Pipeline Stage Refactor

RawFileAnalyzer and FormattableAnalyzer interfaces

Static Analyzers: New Fields on Every Function Record

source_file — File path on every function record

language — Programming language on every function record

directory — Parent directory on every function record

History Analyzers: Tick Timestamps

start_time / end_time on every time-series tick

Developer Identity Normalization

Split pipe-delimited names into name + email

Output Structure: Flattened Arrays

developers[].languages — map → array

activity[].by_developer — map → array

file_contributors[].contributors — map → array

Output Envelope

Top-level metadata section

Per-analyzer schema manifest

NDJSON output format

Clone Analysis

clone_type_distribution from full population

Relative paths in clone pairs

New Files Created

Empty Analyzer Root Causes (Documented)

Uh oh!

dmytrogajewski commented Apr 19, 2026

--languages filter push-down — performance validation

Commit

Fixture

Wall time — 3-trial median

Max RSS

CPU profile (500-commit fixture, --languages go)

Correctness

Gates

Architecture

Uh oh!

dmytrogajewski commented Apr 19, 2026

Cross-phase vendor & generated exclusion + race fix

Commit

What changed

Architecture

E2E proof (cross-language fixture)

Also in this commit

Gates

Size

Follow-ups in the separate roadmaps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dmytrogajewski commented Apr 3, 2026 •

edited

Loading

`RawFileAnalyzer` and `FormattableAnalyzer` interfaces

`source_file` — File path on every function record

`language` — Programming language on every function record

`directory` — Parent directory on every function record

`start_time` / `end_time` on every time-series tick

Split pipe-delimited names into `name` + `email`

`developers[].languages` — map → array

`activity[].by_developer` — map → array

`file_contributors[].contributors` — map → array

Top-level `metadata` section

Per-analyzer `schema` manifest

`clone_type_distribution` from full population

`--languages` filter push-down — performance validation

CPU profile (500-commit fixture, `--languages go`)