diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md index 306104eca..cbf38a358 100644 --- a/docs/BENCHMARKS.md +++ b/docs/BENCHMARKS.md @@ -11,7 +11,7 @@ The chart below uses a log-log scatter plot: file count on the x-axis, wall-cloc ![Scan duration vs. file count for Provenant and ScanCode](benchmarks/scan-duration-vs-files.svg) -> Provenant is faster on 145 of 147 recorded runs, with a **11.6× median speedup** and **10.2× geometric-mean speedup** overall; the median gap grows from **6.4×** on sub-100-file targets to **20.1×** on 10k+ file targets. +> Provenant is faster on 148 of 150 recorded runs, with a **11.7× median speedup** and **10.2× geometric-mean speedup** overall; the median gap grows from **6.4×** on sub-100-file targets to **20.1×** on 10k+ file targets. > Generated from the benchmark timing rows in this document via `cargo run --manifest-path xtask/Cargo.toml --bin generate-benchmark-chart`. ## Current benchmark examples @@ -199,6 +199,8 @@ The tables below provide the per-target detail behind the chart. Each row is one | Target snapshot | Run context | Timing snapshot | Advantages over ScanCode | | -------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [commercialhaskell/stack @ cb6070f](https://github.com/commercialhaskell/stack/tree/cb6070feb211ddb305ee2384c86932ffeef76cbe)
1,110 files | 2026-04-17 · stack-72934 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 15.49s
ScanCode: 167.47s
**10.81× faster (-90.8%)** | Far broader Hackage package and dependency extraction (`76` vs `1` packages, `524` vs `4` dependencies) from the root `stack.cabal`, `stack.yaml`, `cabal.project`, and committed integration-fixture manifests, with richer maintainer identity on Cabal metadata | +| [HaxeFlixel/flixel @ ec54c5a](https://github.com/HaxeFlixel/flixel/tree/ec54c5a582b252de3aca69283045719d3201778b)
446 files | 2026-04-22 · flixel-45256 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 10.70s
ScanCode: 135.43s
**12.66× faster (-92.1%)** | Matched Haxe package and dependency coverage on the repo-root `haxelib.json`, with compound `LicenseRef-scancode-public-domain AND OFL-1.1` font licensing on `assets/fonts/monsterrat.ttf` instead of split duplicate detections, cleaner URL normalization across docs and snippets, and much faster same-host runtime | +| [HeapsIO/heaps @ d2992b0](https://github.com/HeapsIO/heaps/tree/d2992b061db3f51b47cdb87c39d659a5bb96dd83)
666 files | 2026-04-22 · heaps-50135 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 10.63s
ScanCode: 169.15s
**15.91× faster (-93.7%)** | Matched Haxe package and dependency coverage on the repo-root `haxelib.json`, with cleaner copyright and holder recovery on `hxd/fmt/fbx/Writer.hx` and `samples/text_res/trueTypeFont.ttf`, safer trailing-slash URL normalization, and much faster same-host runtime | | [jgm/pandoc @ d9838eb](https://github.com/jgm/pandoc/tree/d9838eba11ae18216f52e233dbbca735f0f97ccb)
2,768 files | 2026-04-17 · pandoc-69673 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 22.78s
ScanCode: 332.82s
**14.61× faster (-93.2%)** | Broader mixed Hackage and Nix package extraction (`5` vs `0` packages, `197` vs `0` dependencies) from sibling `pandoc*.cabal` manifests, `stack.yaml`, and `flake.nix` / `flake.lock`, with explicit package identities across `pandoc`, `pandoc-cli`, `pandoc-lua-engine`, and `pandoc-server` | | [JuliaLang/julia @ afc71c2](https://github.com/JuliaLang/julia/tree/afc71c255e327d8a64b69061c15994e80740974d)
1,948 files | 2026-04-19 · julia-15784 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 25.28s
ScanCode: 549.75s
**21.75× faster (-95.4%)** | Direct Julia package visibility and much broader dependency extraction (`115` vs `0` packages, `240` vs `0` dependencies) from stdlib, test, and nested `Project.toml` / `Manifest.toml` pairs across the tree, with richer author recovery on Julia metadata and cleaner rejection of prose-only copyright or holder noise | | [JuliaLang/Pkg.jl @ c96cfdf](https://github.com/JuliaLang/Pkg.jl/tree/c96cfdf70976e8a5cc21fcef53c0ba137f6b2f64)
486 files | 2026-04-19 · Pkg.jl-15780 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 13.20s
ScanCode: 96.27s
**7.29× faster (-86.3%)** | Direct Julia package visibility and much broader dependency extraction (`98` vs `0` packages, `150` vs `0` dependencies) from `Project.toml`, `Manifest.toml`, and sibling project-plus-manifest assembly across root, docs, and test fixture trees, with safer URL credential stripping in Julia metadata examples | @@ -209,6 +211,7 @@ The tables below provide the per-target detail behind the chart. Each row is one | [ocaml/dune @ b13ab94](https://github.com/ocaml/dune/tree/b13ab949e185a205a39eb6163eea050b7d60a047)
7,751 files | 2026-04-22 · dune-32635 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 20.74s
ScanCode: 519.01s
**25.02× faster (-96.0%)** | Broader opam and Nix package visibility (`4` vs `2` packages, `130` vs `116` dependencies) from the generated `opam/*.opam` manifests and `flake.lock`, with structured opam description, maintainer, and dependency recovery instead of ScanCode's field-bleeding author text on those manifests | | [ocaml/merlin @ 30b4f24](https://github.com/ocaml/merlin/tree/30b4f24fdd76fdbf32685aac73de7fd4a6ff7470)
2,120 files | 2026-04-22 · merlin-47624 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 31.93s
ScanCode: 656.13s
**20.55× faster (-95.1%)** | Direct opam package visibility (`1` vs `0` packages) with broader dependency extraction (`27` vs `24`) from the repo-root `merlin*.opam`, `dot-merlin-reader.opam`, `ocaml-index.opam`, and `flake.lock` surfaces, plus Unicode-preserving copyright normalization across the Merlin source tree | | [ocaml/ocaml-lsp @ 788ff73](https://github.com/ocaml/ocaml-lsp/tree/788ff738991189537141776bfa07652547bff9c4)
546 files | 2026-04-22 · ocaml-lsp-41966 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 13.83s
ScanCode: 185.33s
**13.40× faster (-92.5%)** | Broader opam package visibility (`3` vs `1` packages) with slightly richer dependency extraction (`380` vs `376`) from the root and submodule `.opam` manifests plus `flake.lock`, with cleaner maintainer and email recovery on opam metadata and Unicode-preserving copyright normalization | +| [openfl/openfl @ 74d8f72](https://github.com/openfl/openfl/tree/74d8f72890b9ae70bba589d034ea35b86588e548)
1,196 files | 2026-04-22 · openfl-32439 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 12.77s
ScanCode: 216.36s
**16.94× faster (-94.1%)** | Matched Haxe package and dependency coverage on the repo-root `haxelib.json`, with richer bundled Windows executable identity on `assets/templates/bin/openfl.exe`, extra Docker package visibility on `Dockerfile`, cleaner URL normalization across shipped font metadata, and much faster same-host runtime | | [univention/Nubus @ fef2258](https://github.com/univention/Nubus/tree/fef2258483c56cce0e1f14e4c8d8fce24d26b891)
16 files | 2026-04-19 · Nubus-321 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 10.53s
ScanCode: 72.03s
**6.84× faster (-85.4%)** | Direct `publiccode.yml` package visibility on the root metadata file (`1` vs `0` on that file), with cleaner SPDX copyright placeholder normalization for `Univention GmbH` and the same zero-scan-error behavior under the shared profile | | [yesodweb/yesod @ 1b033c7](https://github.com/yesodweb/yesod/tree/1b033c741ce81d01070de993b285a17e71178156)
324 files | 2026-04-17 · yesod-71400 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 9 proc | Provenant: 10.62s
ScanCode: 99.03s
**9.32× faster (-89.3%)** | Broader multi-package Hackage extraction (`16` vs `0` packages, `391` vs `0` dependencies) from the repo's many sibling `yesod-*/*.cabal` manifests, with explicit package identities across the Yesod family where ScanCode stays manifest-blind | diff --git a/docs/benchmarks/scan-duration-vs-files.svg b/docs/benchmarks/scan-duration-vs-files.svg index ae3e81b15..f029f07c1 100644 --- a/docs/benchmarks/scan-duration-vs-files.svg +++ b/docs/benchmarks/scan-duration-vs-files.svg @@ -200,6 +200,9 @@ ScanCode: 97.37s vernemq/vernemq @ 4681e54 Files: 441 ScanCode: 149.29s + HaxeFlixel/flixel @ ec54c5a +Files: 446 +ScanCode: 135.43s tidyverse/dplyr @ 2f9f49e Files: 462 ScanCode: 170.71s @@ -233,6 +236,9 @@ ScanCode: 214.30s rpm-software-management/dnf @ e47634f Files: 655 ScanCode: 203.47s + HeapsIO/heaps @ d2992b0 +Files: 666 +ScanCode: 169.15s boostorg/json @ 70efd4b Files: 701 ScanCode: 150.19s @@ -275,6 +281,9 @@ ScanCode: 178.35s rpm-software-management/libdnf @ d395731 Files: 1162 ScanCode: 168.27s + openfl/openfl @ 74d8f72 +Files: 1196 +ScanCode: 216.36s OpenMDAO/OpenMDAO @ bf1fcb6 Files: 1199 ScanCode: 298.91s @@ -643,6 +652,9 @@ Provenant: 9.33s vernemq/vernemq @ 4681e54 Files: 441 Provenant: 13.90s + HaxeFlixel/flixel @ ec54c5a +Files: 446 +Provenant: 10.70s tidyverse/dplyr @ 2f9f49e Files: 462 Provenant: 13.86s @@ -676,6 +688,9 @@ Provenant: 16.74s rpm-software-management/dnf @ e47634f Files: 655 Provenant: 14.37s + HeapsIO/heaps @ d2992b0 +Files: 666 +Provenant: 10.63s boostorg/json @ 70efd4b Files: 701 Provenant: 32.30s @@ -718,6 +733,9 @@ Provenant: 14.46s rpm-software-management/libdnf @ d395731 Files: 1162 Provenant: 13.65s + openfl/openfl @ 74d8f72 +Files: 1196 +Provenant: 12.77s OpenMDAO/OpenMDAO @ bf1fcb6 Files: 1199 Provenant: 17.94s diff --git a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md index d393114e1..1bbb6268c 100644 --- a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md +++ b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md @@ -111,7 +111,7 @@ The ranking below is ordered by **practical verification value first**: broad ec | 37 | FreeBSD | ⚪ Planned | FreeBSD `pkg` package archive sample
FreeBSD `bash` package archive sample
FreeBSD `curl` package archive sample | Important artifact-family support, but narrower day-to-day scan prevalence than the higher-priority distro lanes. | | 38 | Chef | 🟢 Verified | `sous-chefs/apache2` (<500 files)
`sous-chefs/mysql` (<500 files)
`chef/chef` (2k–10k files) | Worth covering, but lower priority than the mainstream language and distro families. | | 39 | Bower | 🟢 Verified | `jquery/jquery-ui` (500–2k files)
`select2/select2` (<500 files)
`jashkenas/backbone` (<500 files) | Legacy ecosystem with ongoing value mostly for backward compatibility. | -| 40 | Haxe | ⚪ Planned | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | +| 40 | Haxe | 🟢 Verified | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | | 41 | Windows Update | ⚪ Planned | `wsusscn2.cab` extracted tree
Windows cumulative update `.msu` extracted tree
Windows servicing stack update extracted tree | Artifact-oriented family with real value, but specialized and best handled after the higher-signal source/package ecosystems. | | 42 | `misc.py` recognizers | ⚪ Planned | Apache Tomcat binary release artifacts
Firefox add-on / language-pack artifacts
NSIS official installer artifacts | Broad recognizer family, but not a normal package-manager lane; treat as specialized follow-up verification. | | 43 | Julia | 🟢 Verified | `JuliaLang/Pkg.jl` (500–2k files)
`JuliaLang/julia` (10k–50k files)
`JuliaPlots/Plots.jl` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `JuliaLang/Pkg.jl` is the canonical `Project.toml` and `Manifest.toml` reference, `JuliaLang/julia` adds a large real-world Julia project tree, and `JuliaPlots/Plots.jl` is a mid-sized consumer library. Focus on correct `Project.toml` metadata extraction, `Manifest.toml` resolved dependency coverage, and sibling assembly of project-plus-manifest pairs. |