Feature/filestats#18
Conversation
…ion noise Three bugs fixed in clone detection: 1. Clone ratio was pairs/functions (unbounded) instead of pairs/maxPairs where maxPairs=N*(N-1)/2. Now always [0,1]. 2. Methods with the same name on different receivers (e.g. Foo.DeepCopyInto, Bar.DeepCopyInto) collided in the LSH index — second insert overwrote the first. Now qualifies method names with receiver type. 3. Trivial one-liner functions (getters, setters, return-nil stubs) produced massive false positives. Added minFunctionNodes=20 threshold to skip functions with too few AST nodes for meaningful similarity comparison. Includes fixture-based tests with real Kubernetes-derived code patterns (RBAC validation, event handlers, deepcopy) to validate detection quality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…duplication The old ratio (pairs/maxPairs) was meaningless at scale — 22M pairs across 153K functions in Kubernetes produced 0.0019, displayed as "0.0" with score 10/10 despite massive duplication. New ratio: distinct functions in at least one clone pair / total functions. This answers "what % of your codebase participates in duplication" — the same metric humans understand and industry tools report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ents
Pipeline architecture refactor:
- Replace marker interfaces (FileContentAnalyzer, WalksAllFiles) with
first-class RawFileAnalyzer and FormattableAnalyzer pipeline stages
- StaticService uses pipeline.RunPhases with rawFilePhase + uastPhase
- Composition analyzer implements RawFileAnalyzer directly
Static analyzer output enrichment:
- source_file: relative file path on every function record (153K+ records)
- language: detected programming language on every function record
- directory: parent directory for DWH aggregation without path parsing
- Fields flow through TypedCollection → DetailedDataCollector → ComputedMetrics
History analyzer timestamps:
- start_time/end_time (RFC 3339) on all time-series ticks across
sentiment, anomaly, quality, devs (activity + churn), file-history
- TickBounds type and BuildTickBounds helper in analyze package
- Quality and devs buildTick() now populate TICK.StartTime/EndTime
Developer identity normalization:
- Split pipe-delimited "name|email" into separate name + email fields
- SplitIdentity() helper handles pipe, exact "name <email>", plain formats
- Affects DeveloperData, BusFactorData, DeveloperCouplingData
Output structure flattening for DWH:
- developers[].languages: map → sorted []LanguageStatsEntry array
- activity[].by_developer: map[int]int → []DeveloperCommits array
- file_contributors[].contributors: map → []ContributorEntry array
- Empty language strings replaced with "Other"
Output envelope enhancements:
- Top-level metadata: repo_path, repo_name, analyzed_at, codefang_version
- Per-analyzer schema manifest: FieldMeta{type, grain, description}
- NDJSON output format for streaming DWH ingestion
- Clone type distribution from full population (not capped sample)
Documentation:
- CHANGELOG.md with motivation-driven change descriptions
- Updated site docs: output-formats.md, complexity.md, developers.md,
sentiment.md, couples.md, file-history.md
- Updated AGENTS.md with new types and patterns
- HTML plot labels now show filename:funcName for context
Data quality score: 2.1/5 → 4.6/5 (verified on full kubernetes repo)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accidentally force-added with git add -f in previous commit. Specs are local-only design documents, not tracked in version control. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive guide for using codefang output in data warehouses: - Format selection (JSON vs NDJSON) with repo size guidelines - Memory budget configuration to prevent OOM - Commit limiting for fast iteration - Key fields reference (source_file, language, directory, timestamps) - Schema manifest usage for auto-generating ETL - Full ClickHouse star schema DDL (dimensions + facts) - ETL pipeline examples (Python, ClickHouse direct load) - Analyzer selection by dashboard use case - Performance tuning (workers, budget, first-parent, since) - Row count estimates for capacity planning - Materialized view examples for common queries - Troubleshooting: OOM, empty analyzers, large coupling tables Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ions - Incremental cache: --cache-dir for daily DWH loads (skip processed commits) - Checkpointing: --checkpoint for crash recovery on long runs - Production pipeline example: cron + incremental + ClickHouse load - Advanced tuning: blob-cache-size, diff-cache-size, commit-batch-size, blob-arena-size, tmp-dir flags with descriptions - Checkpoint vs cache distinction explained Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tested every parameter and statement against ~/sources/ioq3 (3784 commits). 14 of 15 tests passed. Fixes: 1. --cache-dir: add warning that incremental cache requires history-only mode (-a 'history/*'). Combined mode accepts the flag but does not produce cache files. Updated production pipeline example to split static and history phases. 2. --since: add note that empty results are normal when no commits fall within the time window. Static analyzers still run. 3. --checkpoint: add info box explaining auto-cleanup on success. Checkpoint files only persist after crashes, not successful runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously --languages was post-filtered in Go after libgit2 had already produced a full tree diff. On polyglot repos with a narrow filter that meant libgit2 was doing ~4x the tree-diff work it needed, and every delta paid an unnecessary cgo crossing before being dropped. - New internal/analyzers/plumbing/langpath package: pure Go Globs(langs) -> (globs, wantsAll, err) backed by enry's generated Linguist dataset. Resolves aliases (golang, js, ts, dockerfile...) and fails fast on unknown languages. 100% test coverage. - New C ABI cf_tree_diff_v2 accepts a pathspec array forwarded to git_diff_options.pathspec. Old cf_tree_diff retired. - TreeDiffRequest.Pathspec, BlobPipeline.TreeDiffPathspec, and CoordinatorConfig.TreeDiffPathspec thread the pathspec from the analyzer through the pipeline. - TreeDiffAnalyzer.applyLanguageConfig stores the canonical lowercase Linguist name in t.Languages (so the Go-side post-filter keys match enry.GetLanguage output on detected files) and pre-computes t.Pathspec. - --languages notalang now fails at Configure with a clear error instead of silently producing an empty report. Measured on a 500-commit x 200-file x 4-language synthetic fixture with --languages go: wall time 0.44s -> 0.29s (-34%), max RSS 74 MB -> 66 MB (-11%), cgocall cumulative CPU 800 ms -> 510 ms (-36%). JSON report byte-identical. Regression guard (no --languages filter): within noise. The Go-side shouldIncludeChange filter remains as the precise post-pass; pathspec is deliberately over-inclusive for content-disambiguated extensions (.h, .pl, .m, .r).
|
| Scenario | Before (pre-pushdown build 2026-04-09) | After | Δ |
|---|---|---|---|
--languages go |
0.44 s | 0.29 s | −34 % |
| no filter (regression guard) | 0.51 s | 0.49 s | −4 % (within noise) |
Max RSS
| Scenario | Before | After | Δ |
|---|---|---|---|
--languages go |
74.3 MB | 66.1 MB | −11 % |
| no filter | 79.7 MB | 79.5 MB | ≈ 0 % |
CPU profile (500-commit fixture, --languages go)
| Metric | Before | After | Δ |
|---|---|---|---|
| Profile duration | 448 ms | 270 ms | −40 % |
| Total samples | 1470 ms | 1070 ms | −27 % |
cgocall cumulative |
800 ms | 510 ms | −36 % |
| Unique functions in profile | 286 | 209 | −27 % |
Correctness
--languages gooutput identical to pre-pushdown build.--languages golang(alias) now resolves to the same report as--languages goinstead of silently returning empty.--languages notalangfails fast atConfigure:Error: failed to configure TreeDiff: tree-diff pathspec: unknown language: "notalang"--languages dockerfile(filename-only language) matchesDockerfilebasename via libgit2 pathspec.
Gates
| Gate | Target | Observed | Status |
|---|---|---|---|
| Wall-time drop on narrow filter | ≥ 30 % | 34 % | ✅ |
| Regression guard (no filter) | within ±5 % | −4 % | ✅ |
JSON report per --languages value |
byte-identical | yes | ✅ |
make lint |
0 issues | 0 issues | ✅ |
make deadcode |
clean | clean | ✅ |
go test -race ./... |
clean | clean | ✅ |
langpath coverage |
≥ 95 % | 100 % | ✅ |
Architecture
Go (Configure, once) C (per commit)
───────────────────── ─────────────
enry.data.ExtensionsByLanguage ─┐
enry.data.LanguagesByFilename ─┤ build []string of globs
enry.GetLanguageByAlias ─┘ │
│ cgo marshal
▼
const char** pathspec ───► opts.pathspec
size_t n │
▼
git_diff_tree_to_tree
The Go-side shouldIncludeChange filter remains as the precise post-pass — pathspec is deliberately over-inclusive for content-disambiguated extensions (.h, .pl, .m, .r).
* New package internal/analyzers/plumbing/pathpolicy with Exclude(path, content, opts) backed by enry.IsVendor and pkg/pathfilter generated heuristics. 100% covered, cross-language (Go, Node, Python, Ruby, Rust, Java, .NET, PHP — everything Linguist knows). * Three new CLI flags on `codefang run`, applied identically to both static and history phases: - --include-vendored (bool, default false) - --include-generated (bool, default false) - --extra-excluded-prefixes (strings, default []) * Default analysis output now excludes vendor + generated across both phases — matching eslint, rubocop, ruff, scalafix, phpcs convention. Migration: `--include-vendored --include-generated` restores the pre-change default. * Deprecated legacy flags with cobra warnings: - --skip-blacklist → no-op now (new default already excludes) - --blacklisted-prefixes → migrate to --extra-excluded-prefixes * Static pipeline: StaticService.PathPolicy field; hooks in both WalkDir visitors (rawFilePhase + streamFiles). * History pipeline: TreeDiffAnalyzer.PathPolicy field; called from shouldIncludeChange as the first exclusion check. New ConfigTreeDiffPathPolicy fact key threads the options through Configure. * Fix a pre-existing race in internal/framework.PipelineSampler: t1Captured was a plain bool concurrently read by the sampler goroutine and written by the caller. Converted to sync/atomic.Bool with CompareAndSwap so exactly one goroutine captures the t1 heap profile. Removed the unused t0Captured field. * Chore: removed all `// FRD: specs/frds/FRD-*.md` comments from .go files. specs/ is gitignored so these references broke for anyone cloning the repo. Traceability stays in FRDs and PR descriptions. Verification: - go test -race ./... — green, zero DATA RACE, zero FAIL - make lint — 0 issues - make deadcode — clean - pathpolicy statement coverage — 100% End-to-end on a cross-language fixture (main.go + api.pb.go + vendor/dep/dep.go + node_modules/left-pad/index.js + testdata/sample.go): defaults → 1 function --include-vendored → 4 --include-vendored --include-generated → 5 --skip-blacklist (deprecated, prints warning) → 1
Cross-phase vendor & generated exclusion + race fixCommit`06dfa5f` `feat: cross-phase vendor/generated exclusion + race fix + FRD cleanup` What changedNew feature — cross-phase path-exclusion policy. Three CLI flags on `codefang run`, applied identically to both `-a 'static/'` and `-a 'history/'` runs:
Breaking change. Default analysis output now excludes vendor + generated across both phases — matching eslint, rubocop, ruff, scalafix, phpcs convention. Migration: `--include-vendored --include-generated` restores today's default. Deprecated with cobra warnings:
ArchitectureNew package `internal/analyzers/plumbing/pathpolicy` with one pure function: ```go Composes `enry.IsVendor` with the existing `pkg/pathfilter` generated-file heuristics. Both static and history pipelines call the same helper — single source of truth, no phase-specific drift. ``` E2E proof (cross-language fixture)Fixture: `main.go` + `api.pb.go` + `vendor/dep/dep.go` + `node_modules/left-pad/index.js` + `testdata/sample.go`, `-a static/complexity`:
Also in this commitRace fix — pre-existing data race in `internal/framework.PipelineSampler`:
Chore — stripped all `// FRD: specs/frds/FRD-*.md` comments from `.go` files. `specs/` is gitignored; those references broke for anyone cloning the repo. Traceability stays in the FRDs themselves and in PR descriptions. Gates
Size124 files changed. The line delta is large (+36k / −87k) because:
Follow-ups in the separate roadmapsNot included here:
|
processChildrenBatch shared ctx.batchChildren across recursive calls; an inner ensureBatchChildren reslice over the same backing array let the recursion overwrite outer-loop entries before the parent had read them. That dropped functions and made counts vary run-to-run on the same input. Snapshot the children into a local slice before iterating. The halstead visitor and analyzer keyed per-function metrics by name only, silently collapsing same-named methods (e.g. multiple `Read` receivers in one Go file) and reporting len(map) as total_functions. Convert internal storage to a slice so every declaration is counted. Add regression tests: - pkg/uast: re-parse the same source 8 times with one Parser, assert the tree node and function counts match the first run. - halstead: build a UAST with multiple identically-named functions, assert the visitor and the public report both keep one entry per declaration.
Analytics Readiness & DWH Suitability
Motivation: A comprehensive data analyst review of Codefang's JSON output revealed that while the data was analytically rich (17 analyzers, 1M+ function-level rows, time-series, coupling data), it was structurally hostile to analytics tooling and DWH loading. Function records had bare names with no file paths, time-series ticks had no calendar dates, developer identities used pipe-delimited strings, and nested maps blocked efficient columnar ingestion. This release systematically fixes every identified blocker, raising the data quality score from 2.1/5 to 4.6/5.
Architecture: Pipeline Stage Refactor
RawFileAnalyzerandFormattableAnalyzerinterfacesReplaced the
FileContentAnalyzer+WalksAllFilesmarker interface pattern with a proper pipeline stage architecture.Before: Analyzers that needed raw file access (not UAST) had to implement
StaticAnalyzerwith a no-opAnalyze(*node.Node), plus two marker interfaces discovered at runtime via type assertions.After: Two clean interface hierarchies —
StaticAnalyzerfor UAST-based analysis andRawFileAnalyzerfor raw file analysis — both embed a sharedFormattableAnalyzerbase.StaticServiceholds separate slices.AnalyzeFolderusespipeline.RunPhaseswith explicitrawFilePhaseanduastPhasestages.Why it matters for BI: The pipeline refactor enabled
StampSourceFileto receiverootPathand convert all file paths to relative — a prerequisite for portable DWH data. It also enabledStampLanguageto inject detected language into every function record.Files changed:
internal/analyzers/analyze/analyzer.go— newFormattableAnalyzer,RawFileAnalyzerinterfaces;StaticAnalyzerrefactored to embedFormattableAnalyzerinternal/analyzers/analyze/static.go—StaticServicegainsUASTAnalyzers+RawFileAnalyzersslices;AnalyzeFolderusespipeline.RunPhasesinternal/analyzers/composition/analyzer.go— implementsRawFileAnalyzerdirectly (removed no-opAnalyze,NeedsAllFiles)internal/analyzers/analyze/registry.go—NewRegistryaccepts three slicescmd/codefang/commands/run.go— splitdefaultStaticAnalyzersintodefaultUASTAnalyzers+defaultRawFileAnalyzersinternal/analyzers/analyze/perfile.go—PerFileEnricheruses[]FormattableAnalyzerinternal/analyzers/common/renderer/json.go—EnrichWithPerFileDatauses[]FormattableAnalyzerStatic Analyzers: New Fields on Every Function Record
source_file— File path on every function recordMotivation: 152,000+ function records in the JSON output had bare names like
"ForKind"with no indication of which file they belonged to. This made it impossible to join function metrics to file-level data, build file heatmaps, or drill down from "bad function" to "where in the repo."Root cause: The
_source_filestamping mechanism existed and worked through aggregation, butFormatReportBinarycalledComputeAllMetricswhich parsed[]map[string]anyitems into typed structs. Those structs had noSourceFilefield, silently dropping the value during struct conversion.Fix: Added
SourceFile stringto all inputFunctionDataand output data structs (FunctionComplexityData,FunctionHalsteadData,FunctionCohesionData, all comment data structs,HighRiskFunctionData,HighEffortFunctionData,LowCohesionFunctionData,UndocumentedFunctionData). Populated from_source_filemap key duringparseFunctionData→Compute(). UpdatedStampSourceFileto acceptrootPathand convert to relative viaMakeRelativePath.JSON output key:
"source_file"(relative path, e.g.,"pkg/kubelet/kubelet.go")Analyzers affected:
static/complexity,static/halstead,static/cohesion,static/commentslanguage— Programming language on every function recordMotivation: Analysts had to infer language from file extension at query time. The parser already knows the language.
Fix: Added
LanguageKeyconstant,StampLanguage()function, andLanguagefield toTypedCollectionstruct. Language is stamped inanalyzeFilesParallelviaparser.GetLanguage(filePath)and propagated throughTypedCollection→DetailedDataCollector.buildItems()→stampCollectionMetadata()to reach the output structs.JSON output key:
"language"(e.g.,"go","bash")Analyzers affected:
static/complexity,static/halstead,static/cohesion,static/commentsdirectory— Parent directory on every function recordMotivation: Directory-level aggregation (e.g., "which package has worst complexity") requires parsing file paths at query time, which is expensive in columnar DWH.
Fix: Added
DirectoryKeyconstant andDirectoryfield toTypedCollection. Stamped asfilepath.Dir(relativePath)insideStampSourceFile. Propagated viastampCollectionMetadata()alongside language.JSON output key:
"directory"(e.g.,"pkg/kubelet")Analyzers affected:
static/complexity,static/halstead,static/cohesion,static/commentsHistory Analyzers: Tick Timestamps
start_time/end_timeon every time-series tickMotivation: All 6 history time-series analyzers emitted
tick: <int>with no calendar date. Every time-series chart had an unlabeled X-axis. TheTICKstruct already carriedStartTime/EndTimeinternally but didn't export them.Fix: Created
TickBoundstype andBuildTickBounds(ticks []TICK)helper. Each analyzer'sticksToReportaddstick_boundsto the Report. EachParseReportDatareads it. Each time-series output struct gainsStartTime/EndTimestring fields (RFC 3339). For quality and devs analyzers, added timestamp tracking to their tick accumulators (tickAccumulator.startTime/endTime,TickDevData.startTime/endTime) with min/max tracking inextractTCand population inbuildTick.JSON output keys:
"start_time","end_time"(RFC 3339, e.g.,"2024-01-15T10:30:00Z")Analyzers affected:
history/sentiment,history/anomaly,history/quality,history/devs(activity + churn),history/file-history(composition_ts)Developer Identity Normalization
Split pipe-delimited names into
name+emailMotivation: Developer identity used
"daniel smith|dbsmith@google.com"pipe-delimited strings fromReversedPeopleDict. This blocked clean dimension table creation in DWH systems.Fix: Created
SplitIdentity(s string) (name, email string)ininternal/identity/split.go. Handles pipe-delimited, exact"name <email>", and plain name formats. UpdateddevName()→devNameAndEmail()andgetDevName()→getDevNameAndEmail().Fields added:
DeveloperData:emailfieldBusFactorData:primary_dev_email,secondary_dev_emailDeveloperCouplingData:developer1_email,developer2_emailAnalyzers affected:
history/devs,history/couplesOutput Structure: Flattened Arrays
developers[].languages— map → arrayMotivation:
map[string]LineStatswith variable language-name keys cannot be UNNEST'd in columnar DWH without custom ETL.Fix: Changed
DeveloperData.Languagesfrommap[string]pkgplumbing.LineStatsto[]LanguageStatsEntry. Internal accumulation uses unexportedlangMap, converted to sorted array viafinalizeLanguages(). Empty language strings replaced with"Other".Before:
{"Go": {"added": 100, "removed": 5, "changed": 3}}After:
[{"language": "Go", "added": 100, "removed": 5, "changed": 3}]activity[].by_developer— map → arrayMotivation:
map[int]int(dev_id → commit_count) serializes to JSON with string keys, blocking typed ingestion.Fix: Changed to
[]DeveloperCommitswith{dev_id, commits}fields. Sorted by dev_id for deterministic output.Before:
{"2": 5, "3": 3}After:
[{"dev_id": 2, "commits": 5}, {"dev_id": 3, "commits": 3}]file_contributors[].contributors— map → arrayMotivation:
map[int]LineStatsblocked DWH UNNEST.Fix: Changed to
[]ContributorEntrywith{dev_id, added, removed, changed}fields. Sorted by dev_id.Before:
{"2": {"added": 42, "removed": 5, "changed": 3}}After:
[{"dev_id": 2, "added": 42, "removed": 5, "changed": 3}]Output Envelope
Top-level
metadatasectionMotivation: A DWH ingesting reports from multiple repos could not distinguish them. No repo name, analysis timestamp, or version.
Fix: Added
AnalysisMetadatastruct withrepo_path,repo_name(fromfilepath.Base),analyzed_at(RFC 3339),codefang_version(from build ldflags). Injected afterDecodeCombinedBinaryReportsin the combined render path.{ "version": "codefang.run.v1", "metadata": { "repo_path": "/home/user/sources/kubernetes", "repo_name": "kubernetes", "analyzed_at": "2026-04-07T23:33:00Z", "codefang_version": "dev" }, "analyzers": [...] }Per-analyzer
schemamanifestMotivation: DWH consumers need to know field types, grain, and cardinality for automated ETL generation.
Fix: Added
FieldMetastruct with{type, grain, description}and staticanalyzerSchemasregistry covering all 17 analyzers. EachAnalyzerResultin the output includes aschemafield.{ "id": "static/complexity", "schema": { "function_complexity": { "type": "list", "grain": "function", "description": "Per-function cyclomatic and cognitive complexity" } }, "report": {...} }NDJSON output format
Motivation: The monolithic JSON (467MB for kubernetes) must be fully parsed to extract any single analyzer. NDJSON enables streaming ingestion into ClickHouse.
Fix: Added
FormatNDJSONcase toWriteConvertedOutput. One JSON line per analyzer result, with optional metadata line prepended.codefang run --format ndjson /repo > output.ndjsonClone Analysis
clone_type_distributionfrom full populationMotivation: Clone pairs are capped at 1,000 in the output, but the distribution metrics (Type-1/2/3 breakdown) were computed from the capped sample, skewing percentages for large codebases with 22M+ total pairs.
Fix: Added
typeDistribution cloneTypeCountstoclonePairResult.matchCandidatesincrements per-type counters for ALL valid pairs before the cap check. Both aggregator and per-file paths emitclone_type_distributionin the report.ReportSection.Distribution()reads from the full-population distribution.Before: Distribution from 1,000 capped pairs
After: Distribution from 22,381,694 total pairs:
{"Type-1": 12366266, "Type-2": 3307147, "Type-3": 6708281}Relative paths in clone pairs
Clone pair
func_a/func_bpaths changed from absolute (/home/user/sources/repo/file.go::funcName) to relative (cmd/controller/app.go::newController). Enabled by theStampSourceFilerootPath change.New Files Created
internal/analyzers/analyze/tick_bounds.goTickBoundstype +BuildTickBoundshelperinternal/analyzers/analyze/metadata.goAnalysisMetadatastruct +NewAnalysisMetadataconstructorinternal/analyzers/analyze/schema_registry.gointernal/identity/split.goSplitIdentity(s string) (name, email string)Empty Analyzer Root Causes (Documented)
Investigation of 4 analyzers that returned empty data on kubernetes (1000 commits):
burndown.developer_survivalBurndown.TrackPeople: false)burndown.file_survivalBurndown.TrackFiles: false)history/importsNeedsUAST() = true)history/typosNeedsUAST() = true)