From 85b018100b88c72015783f2ea5bddc8f55bd04af Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 18:13:39 +0200 Subject: [PATCH 1/6] feat(parser): add Erlang/OTP parser for app.src, rebar.config, rebar.lock Add net-new Erlang/OTP package metadata support with three parsers backed by a native Erlang term parser. No Python ScanCode reference exists for this ecosystem. Signed-off-by: Maxim Stykow --- docs/SUPPORTED_FORMATS.md | 3 + .../package-detection/PARSER_PLAN.md | 2 +- .../PARSER_VERIFICATION_SCORECARD.md | 116 +- docs/improvements/erlang-otp-parser.md | 49 + src/assembly/assemblers.rs | 7 + src/models/datasource_id.rs | 8 + src/parsers/erlang_otp.rs | 1048 +++++++++++++++++ src/parsers/erlang_otp_golden_test.rs | 63 + src/parsers/erlang_otp_test.rs | 366 ++++++ src/parsers/golden_test.rs | 2 + src/parsers/mod.rs | 7 + testdata/erlang-otp-golden/lager.app.src | 9 + .../erlang-otp-golden/lager.app.src.expected | 58 + testdata/erlang-otp-golden/rebar.config | 5 + .../erlang-otp-golden/rebar.config.expected | 82 ++ testdata/erlang-otp-golden/rebar.lock | 8 + .../erlang-otp-golden/rebar.lock.expected | 155 +++ testdata/erlang-otp/app-src/fast_xml.app.src | 10 + testdata/erlang-otp/app-src/lager.app.src | 18 + testdata/erlang-otp/rebar-config/rebar.config | 16 + testdata/erlang-otp/rebar-lock/rebar.lock | 18 + 21 files changed, 1992 insertions(+), 58 deletions(-) create mode 100644 docs/improvements/erlang-otp-parser.md create mode 100644 src/parsers/erlang_otp.rs create mode 100644 src/parsers/erlang_otp_golden_test.rs create mode 100644 src/parsers/erlang_otp_test.rs create mode 100644 testdata/erlang-otp-golden/lager.app.src create mode 100644 testdata/erlang-otp-golden/lager.app.src.expected create mode 100644 testdata/erlang-otp-golden/rebar.config create mode 100644 testdata/erlang-otp-golden/rebar.config.expected create mode 100644 testdata/erlang-otp-golden/rebar.lock create mode 100644 testdata/erlang-otp-golden/rebar.lock.expected create mode 100644 testdata/erlang-otp/app-src/fast_xml.app.src create mode 100644 testdata/erlang-otp/app-src/lager.app.src create mode 100644 testdata/erlang-otp/rebar-config/rebar.config create mode 100644 testdata/erlang-otp/rebar-lock/rebar.lock diff --git a/docs/SUPPORTED_FORMATS.md b/docs/SUPPORTED_FORMATS.md index 974b5cea5..ce7b677b6 100644 --- a/docs/SUPPORTED_FORMATS.md +++ b/docs/SUPPORTED_FORMATS.md @@ -82,7 +82,10 @@ Provenant supports package manifests, installed-package metadata, recognizers, a | Hackage cabal.project workspace file | `**/cabal.project` | hackage | Haskell | [Link](https://cabal.readthedocs.io/en/stable/cabal-project-description-file.html) | | Haxe haxelib.json package manifest | `**/haxelib.json` | haxe | Haxe | [Link](https://lib.haxe.org/documentation/creating-a-haxelib-package/) | | Helm chart metadata | `**/Chart.yaml, **/Chart.lock` | helm | YAML | [Link](https://helm.sh/docs/topics/charts/) | +| Erlang OTP application resource file | `**/*.app.src` | hex | Erlang | [Link](https://www.erlang.org/doc/apps/kernel/application) | | Hex mix.lock lockfile | `**/mix.lock` | hex | Elixir | [Link](https://hexdocs.pm/mix/Mix.Tasks.Deps.html) | +| Rebar3 configuration | `**/rebar.config` | hex | Erlang | [Link](https://rebar3.org/docs/configuration/configuration/) | +| Rebar3 lockfile | `**/rebar.lock` | hex | Erlang | [Link](https://rebar3.org/docs/configuration/configuration/) | | Julia Manifest.toml resolved dependencies | `**/Manifest.toml` | julia | Julia | [Link](https://pkgdocs.julialang.org/v1/toml-files/) | | Julia Project.toml manifest | `**/Project.toml` | julia | Julia | [Link](https://pkgdocs.julialang.org/v1/toml-files/) | | Linux OS release metadata file | `*etc/os-release, *usr/lib/os-release` | linux-distro | | [Link](https://www.freedesktop.org/software/systemd/man/os-release.html) | diff --git a/docs/implementation-plans/package-detection/PARSER_PLAN.md b/docs/implementation-plans/package-detection/PARSER_PLAN.md index 9462accdf..e5db397d1 100644 --- a/docs/implementation-plans/package-detection/PARSER_PLAN.md +++ b/docs/implementation-plans/package-detection/PARSER_PLAN.md @@ -60,6 +60,7 @@ All production handlers in the original plan scope are covered. Some ecosystems | Deno | ✅ Implemented | `deno.json`, `deno.jsonc`, `deno.lock` | | Debian | ✅ Implemented | Includes ⭐ `.deb` introspection, copyright, distroless, `md5sums` variants | | Docker | ✅ Implemented | `Dockerfile`, `Containerfile`, OCI label extraction | +| Erlang / OTP | ✅ Implemented | `*.app.src`, `rebar.config`, `rebar.lock` with Erlang term parser, OTP stdlib filtering, git/hex dependency extraction, profile deps, and rebar.lock hash resolution; ⭐ net-new parser with no Python ScanCode reference | | FreeBSD | ✅ Implemented | `FreebsdCompactManifestParser` | | Git submodules | ✅ Implemented | `GitmodulesParser` | | Go | ✅ Implemented | `go.mod`, `go.sum`, `Godeps.json`, `go.mod.graph`, `go.work`, and scanner-gated compiled-binary extraction for embedded Go build info | @@ -171,7 +172,6 @@ These issues may still be worth doing, but they are currently lower-value becaus | #354 | HuggingFace model metadata (`config.json`, model cards, repo metadata) | Manifest conventions are weaker than mainstream package ecosystems and closer to artifact metadata than classic dependency parsing. | Previously isolated in the future purl list; still better handled after stronger package-manager style formats. Related upstream issue: `aboutcode-org/scancode-toolkit#4826`. | | #351 | Bitnami catalog metadata | Captures a narrow packaging family with weaker general-purpose source-repo reach than the higher-ranked package ecosystems. | Real but narrow, so it belongs with opportunistic work rather than the main parser queue. Related upstream issue: `aboutcode-org/scancode-toolkit#4829`. | | #353 | MLflow model metadata (`MLmodel`) | More model/artifact metadata than classic package parsing, with weaker package identity and dependency semantics. | Better treated as artifact metadata follow-on work. Related upstream issue: `aboutcode-org/scancode-toolkit#4827`. | -| #355 | Erlang / OTP manifests (`*.app.src`, `rebar.config`) | Overlaps heavily with the stronger Hex opportunity and is less compelling as a standalone parser family today. | Keep behind `mix.exs` / `mix.lock` unless strong user demand appears. Related upstream issue: `aboutcode-org/scancode-toolkit#4828`. | | #79 | `datapackage.json` | Structured metadata, but narrower adoption and lighter dependency value. | More metadata-oriented than dependency-oriented. | | #70 | DOAP RDF/XML | Rich project metadata, but niche and metadata-first. | Better as enrichment work. | | #67 | PEX Python binaries | Valuable for packaged Python artifacts, but artifact parsing is costlier than manifest follow-ons. | Lower priority than Python lockfile and metadata improvements. | diff --git a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md index 13839df82..6eace1fa6 100644 --- a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md +++ b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md @@ -57,64 +57,66 @@ Method rules: The ranking below is ordered by **practical verification value first**: broad ecosystem prevalence, likelihood of exposing real parser-plus-license/copyright interactions under `--profile common`, and coverage breadth within the implemented family. -| Priority | Ecosystem | Status | Candidate targets | Priority and scope notes | +<<<<<<< HEAD +| Priority | Ecosystem | Status | Candidate targets | Priority and scope notes | | -------- | ------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 0a | Cross-cutting broad `C++` repository scans (non-parser reference) | 🟢 Verified | `boostorg/boost` (236 files)
`boostorg/json` (701 files)
`mongodb/mongo` (11k files) | There is no generic `C++` parser row. These repositories are still valuable reference targets because they exercise multiple implemented `C++`-adjacent families and package-adjacent detection in realistic trees. They complement, but do not replace, family-specific verification for Autotools, Conan, vcpkg, Bazel, and Buck. | -| 0b | Cross-cutting broad polyglot / vendored monorepo scans (non-parser reference) | 🟢 Verified | `chromium/chromium` (490,886 files)
`apache/airflow` (11,854 files)
`kubernetes/kubernetes` (29,080 files) | These are good early warning targets for interaction bugs across multiple parser families, vendored third-party metadata, README/submodule handling, and common-profile license/copyright detection in very large trees. They complement, but do not replace, family-specific rows. | -| 0c | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference) | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)
Fedora base-image rootfs snapshot (1,579 files)
official Alpine minirootfs snapshot (84 files) | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. | -| 0d | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | 🟢 Verified | `torvalds/linux` (100k files)
`rust-lang/rust` (8k files) | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions. | -| 0e | Cross-cutting licensing-edge-case repository scans (non-parser reference) | 🟢 Verified | `nmap/nmap` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files)
`mongodb/mongo` (11k files) | Use this lane when the main goal is license-classification accuracy rather than parser breadth. These targets are useful when the verification focus is classification quality on real repository text, reference notices, and packaging-adjacent licensing material rather than parser coverage alone. | -| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)
`yarnpkg/berry` (500–2k files)
`vercel/next.js` (5k files)
`oven-sh/bun` (500–2k files)
`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. | -| 2 | Python / PyPI | 🟢 Verified | `pandas-dev/pandas` (1.2k files)
`scipy/scipy` (1.3k files)
`django/django` (2.5k files)
`python-poetry/poetry` (500–2k files)
`astral-sh/uv` (500–2k files) | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas. | -| 3 | Maven / Java | 🟢 Verified | `apache/maven` (500–2k files)
`apache/camel` (2k–10k files)
`spring-projects/spring-boot` (2k–10k files)
`apache/felix-dev` (2k–10k files) | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure. | -| 3a | Clojure / Leiningen | 🟢 Verified | `technomancy/leiningen` (500–2k files)
`metabase/metabase` (2k–10k files)
`renovatebot/renovate` Leiningen fixtures | Keep this row explicit instead of assuming the broader Maven or SBT rows cover it. `technomancy/leiningen` is the canonical `project.clj` reference, `metabase/metabase` gives a real-world root `deps.edn`, and `renovatebot/renovate` adds a fixture-heavy Leiningen edge-case lane. The shipped Rust surface is bounded static parsing of `deps.edn` and `project.clj`, not generic JVM build inheritance, and these manifests are intentionally treated as standalone unassembled inputs. | -| 4 | Go | 🟢 Verified | `containerd/containerd` (2k–10k files)
`go-gitea/gitea` (2k–10k files)
Go build-info sample binaries via local `--target-path` + `common-with-compiled` lane | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, `go.work` workspace roots, vendored trees, and source-versus-binary coverage gaps explicitly during compare review. | -| 5 | Cargo/Rust | 🟢 Verified | `tokio-rs/tokio` (250 files)
`rust-lang/cargo` (700 files)
cargo-auditable sample binaries via local `--target-path` + `common-with-compiled` lane | Strong workspace/member coverage plus an explicit compiled-metadata lane for the scanner-gated cargo-auditable surface. Watch workspace root/member ownership, manifest-declared file references such as `README` and license files, and compiled-versus-source coverage gaps. Keep bootstrap-scale mixed Rust/C++ trees such as `rust-lang/rust` in the dedicated filesystem-scale cross-cutting lane instead of duplicating them here. | -| 5a | Compiled artifacts (`go build info`, cargo-auditable, Windows PE `VERSIONINFO`) | 🟢 Verified | `itchyny/gojq` release binaries via local `--target-path` + `common-with-compiled` lane
`lichess-org/fishnet` release binaries via local `--target-path` + `common-with-compiled` lane
`glzr-io/glazewm` Windows release executables via local `--target-path` lane | Keep this detector-oriented row explicit so compiled-binary verification does not stay implicit inside the Go, Cargo/Rust, Windows Update, or `misc.py` rows. `itchyny/gojq` is a clean Go build-info target, `lichess-org/fishnet` is an explicit cargo-auditable release lane, and `glzr-io/glazewm` gives a focused Windows `VERSIONINFO` executable target. Prefer small release trees that include nearby README or LICENSE material when possible, so the compare still exercises common-profile interactions rather than only binary package identity. | -| 6 | NuGet | 🟢 Verified | `OrchardCMS/OrchardCore` (2k–10k files)
`AvaloniaUI/Avalonia` (2k–10k files)
`.nupkg` / `.deps.json` snapshots via local `--target-path` lane | Broad .NET lane across source and shipped artifacts. `OrchardCMS/OrchardCore` and `AvaloniaUI/Avalonia` cover large solution-style repos and central package management patterns, while the `.nupkg` / `.deps.json` lane covers runtime and package-artifact metadata that source repos may miss. Watch duplicate package signals across solution props/targets, project files, and runtime artifacts before counting them as regressions. | -| 7 | PHP / Composer | 🟢 Verified | `laravel/framework` (2k–10k files)
`composer/composer` (500–2k files)
`symfony/symfony` (2k–10k files) | Mature Composer lane. `composer/composer` is the canonical Composer reference, while `laravel/framework` and `symfony/symfony` add large real-world monorepo/library dependency graphs. Watch `composer.json` versus `composer.lock` behavior, split-package repo structure, and README/LICENSE-heavy trees that can create unrelated common-profile deltas. | -| 8 | Gradle | 🟢 Verified | `gradle/gradle` (2k–10k files)
`elastic/elasticsearch` (11k files)
`apache/kafka` (2k–10k files) | High-signal JVM build family with settings/includes and large build graphs; `elastic/elasticsearch` adds an especially large multi-project Gradle and packaging target with meaningful licensing/distribution complexity. | -| 8a | Android metadata and package artifacts | 🟢 Verified | `aosp-mirror/platform_build` (Soong `METADATA` coverage)
`aosp-mirror/platform_frameworks_base` (Android manifest surfaces)
representative local `.aab`, `.apk`, and standalone binary `AndroidManifest.xml` artifacts via `--target-path` lane | Keep this Android-specific lane explicit instead of assuming the broader Gradle row covers it. Use the repository targets for Soong `METADATA` files and committed manifest surfaces, and the local artifact lane for proto-encoded `.aab` plus binary AXML/APK manifest metadata that ordinary repository scans do not usually contain. | -| 9 | Ruby | 🟢 Verified | `rails/rails` (2k–10k files)
`rubocop/rubocop` (500–2k files)
`.gem` archive sample via local `--target-path` lane | Use this row to separate source-repo and shipped-gem behavior. `rails/rails` is the large multi-gemspec/Bundler stress case, `rubocop/rubocop` is a smaller modern Bundler contrast, and the `.gem` lane covers archive metadata. Watch Gemfile-versus-gemspec-versus-lockfile precedence and differences between source trees and packaged gem metadata. | -| 10 | Debian | 🟢 Verified | `guillemj/dpkg` (500–2k files)
`Debian/apt` (2k–10k files)
official `.deb` / dpkg status / distroless `status.d` snapshots via local `--target-path` lane | Keep source-package and installed-state coverage separate. `guillemj/dpkg` and `Debian/apt` exercise Debian source-package metadata, while the `.deb`, `dpkg status`, and distroless `status.d` lanes cover binary-package and installed-database behavior. Watch source-versus-binary package identity, multiple package stanzas, and Debian copyright/license files generating common-profile deltas that are not parser failures. | -| 11 | Docker | 🟢 Verified | `moby/moby` (2k–10k files)
`docker-library/official-images` (<500 files)
`docker-library/python` (<500 files)
`getsentry/self-hosted` (<500 files) | Docker needs both canonical and real deployment targets. `moby/moby` is the broad Dockerfile/build-context reference, `docker-library/official-images` is the source-of-truth library-definition lane, `docker-library/python` is a useful generated official-image leaf target, and `getsentry/self-hosted` adds compose-heavy multi-service realism. Watch multi-stage Dockerfiles, compose-plus-Dockerfile overlap, and template/env noise before treating extra findings as parser regressions. | -| 11a | Helm | 🟢 Verified | `baserow/baserow` (2k–10k files)
`appsmithorg/appsmith` (10k–50k files)
`DefectDojo/django-DefectDojo` (500–2k files) | Keep Helm explicit instead of relying on incidental chart files inside larger application repositories. `baserow/baserow` gives a strong `Chart.yaml` plus `Chart.lock` lane, `appsmithorg/appsmith` adds a large conventional chart deployment tree, and `DefectDojo/django-DefectDojo` is a smaller contrast target. The implemented Rust surface is static `Chart.yaml` plus `Chart.lock` parsing with sibling assembly, declared-versus-locked dependency coverage, and bounded malformed-entry tolerance; that needs at least one focused chart-first verification lane. | -| 12 | Conda | 🟢 Verified | `conda/conda` (500–2k files)
`conda/conda-build` (500–2k files)
`conda-forge/pandas-feedstock` (<500 files) | Conda needs three distinct target shapes. `conda/conda` covers user-facing environment metadata, `conda/conda-build` covers recipes and build outputs, and `conda-forge/pandas-feedstock` is the feedstock pattern Provenant must handle. Watch recipe-output duplication and generated feedstock files before overcounting package or license deltas. | -| 12a | Pixi | 🟢 Verified | `prefix-dev/pixi` (500–2k files)
`pydata/xarray` (500–2k files)
`OpenMDAO/OpenMDAO` (500–2k files) | Keep Pixi explicit even though some Python and Conda compare targets already surface `pixi.toml` and `pixi.lock`. `prefix-dev/pixi` is the canonical upstream with both `pixi.toml` and `pixi.lock`, `pydata/xarray` adds a real consumer repo, and `OpenMDAO/OpenMDAO` adds a second repo with both manifest and lockfile. This row isolates the native `pixi.toml` plus `pixi.lock` contract, mixed Conda/PyPI dependency behavior, and topology-planned root assembly instead of letting those behaviors hide inside broader Python-family compare noise. | -| 13 | Swift | 🟢 Verified | `pointfreeco/swift-composable-architecture` (500–2k files)
`SwiftFiddle/swiftfiddle-web` (<500 files)
`Package.swift.json` / `Package.resolved` snapshots via local `--target-path` lane | `pointfreeco/swift-composable-architecture` is a clean SwiftPM library reference, `SwiftFiddle/swiftfiddle-web` adds a real committed `Resources/Package.swift.json` plus `Package.resolved` target shape, and the local snapshot lane remains important for future pinned production captures that record generated SwiftPM surfaces alongside their source manifests. Watch repo-only verification gaps whenever a bug might live in `Package.swift.json` or `Package.resolved` rather than in source manifests. | -| 14 | Haskell / Hackage | 🟢 Verified | `commercialhaskell/stack` (500–2k files)
`jgm/pandoc` (500–2k files)
`yesodweb/yesod` (500–2k files) | Good mix of Cabal, Stack, and multi-package Haskell repository structure. | -| 15 | Scala / SBT | 🟢 Verified | `akka/akka` (2k–10k files)
`playframework/playframework` (2k–10k files)
`scalatest/scalatest` (500–2k files) | Valuable JVM surface, but current Rust scope is bounded static parsing rather than full evaluation semantics. | -| 16 | CocoaPods | 🟢 Verified | `AFNetworking/AFNetworking` (<500 files)
`Alamofire/Alamofire` (<500 files)
`SDWebImage/SDWebImage` (<500 files) | Strong Apple packaging coverage through widely used podspec-based libraries. | -| 16a | Carthage | 🟢 Verified | `Carthage/Carthage` (500–2k files)
`ReactiveCocoa/ReactiveCocoa` (<500 files)
`Mantle/Mantle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `Carthage/Carthage` is the canonical upstream with both `Cartfile` and `Cartfile.resolved`, while `ReactiveCocoa/ReactiveCocoa` and `Mantle/Mantle` are representative consumer libraries. Focus on correct `Cartfile` dependency extraction, `Cartfile.resolved` pinned-version coverage, and the dependency-hoisting contract for sibling manifest-plus-lockfile pairs without inventing a root Carthage package identity. | -| 16b | Yocto / BitBake | 🟢 Verified | `yoctoproject/poky` (10k–50k files)
`openembedded/meta-openembedded` (10k–50k files)
`pocketbeagle/meta-pocketbeagle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `yoctoproject/poky` is the canonical Yocto reference distribution, while `openembedded/meta-openembedded` provides a large recipe corpus across many layers. Focus on correct package identity extraction from filenames and `PN`/`PV` variables, license normalization of BitBake-specific operator syntax (`&`/`\|`), and `DEPENDS`/`RDEPENDS` dependency scoping. | -| 17 | Nix | 🟢 Verified | `NixOS/nixpkgs` (50k+ files)
`NixOS/nix` (2k–10k files)
`numtide/devshell` (<500 files) | Valuable ecosystem with explicit note that current `default.nix` support is intentionally bounded. | -| 18 | CPAN | 🟢 Verified | `Plack/Plack` (500–2k files)
`libwww-perl/libwww-perl` (500–2k files)
`PerlDancer/Dancer2` (500–2k files) | Good Perl metadata variety through `META.*`, `dist.ini`, and `Makefile.PL`. | -| 19 | CRAN / R | 🟢 Verified | `tidyverse/dplyr` (500–2k files)
`tidyverse/ggplot2` (500–2k files)
`r-lib/devtools` (500–2k files) | Strong DESCRIPTION-based metadata with realistic dependency fields. | -| 20 | Alpine | ⚪ Planned | `alpinelinux/aports`
official `.apk` sample via local `--target-path` lane
Alpine `lib/apk/db/installed` snapshot via local `--target-path` lane | Keep this family row even though Alpine rootfs targets also appear in `0c`: `0c` is the cross-cutting rootfs lane, while this row tracks Alpine-specific source, archive, and installed-DB surfaces. Do not treat rootfs-only verification as verification of the remaining `APKBUILD`, `.apk`, and standalone installed-DB surfaces listed here. | -| 21 | RPM | 🟢 Verified | `rpm-software-management/dnf` (2k–10k files)
`rpm-software-management/libdnf` (500–2k files)
official `.rpm` / RPM BDB, NDB, and SQLite DB snapshots via local `--target-path` lane | Important distro-family lane across source and installed-state metadata. `rpm-software-management/dnf` and `rpm-software-management/libdnf` cover realistic RPM-adjacent source trees, while the local `.rpm` and RPM DB lanes cover shipped package and installed-database behavior. Watch specfile subpackages, changelog/license fields, namespace-from-`os-release` behavior, and DB-versus-source differences separately during triage. | -| 22 | Arch Linux | ⚪ Planned | Arch Linux GitLab packaging repo for `pacman`
Arch Linux GitLab packaging repo for `grep`
official built package sample for `.PKGINFO` via local `--target-path` lane | Use one source-package contrast plus one built-package lane here. The Arch packaging repos cover PKGBUILD and `.SRCINFO` source metadata, while the local built-package lane covers `.PKGINFO` behavior that source repos do not contain. Keep the candidate repos concrete because the canonical Arch packaging sources live in the Arch packaging tree rather than in one obvious GitHub umbrella repository. | -| 23 | Bazel | 🟢 Verified | `tensorflow/tensorflow` (10k files)
`bazelbuild/bazel` (2k–10k files)
`protocolbuffers/protobuf` (2.5k files) | Strong Bazel lane across old and new module surfaces. `bazelbuild/bazel` is the canonical direct reference, `tensorflow/tensorflow` is the large mixed-language stress case, and `protocolbuffers/protobuf` is a smaller contrast target. Watch `WORKSPACE` versus `MODULE.bazel`, macro-heavy static-parsing limits, and giant `third_party` trees producing unrelated common-profile noise. | -| 24 | Autotools | 🟢 Verified | `curl/curl` (1k files)
`libevent/libevent` (<500 files)
`libgit2/libgit2` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files) | Mature native-build lane with several useful contrasts. `curl/curl` is the clearest autoconf-heavy reference, `libevent/libevent` is a smaller contrast, `libgit2/libgit2` adds a mixed native project shape, and `ffmpeg/ffmpeg` adds strong GPL/LGPL-conditional licensing pressure in a `configure`-driven native tree. Watch generated `configure` / `Makefile.in` noise and avoid collapsing file-level licensing differences into one top-level verdict. | -| 24a | Meson | 🟢 Verified | `qemu/qemu` (10k–50k files)
`systemd/systemd` (10k–50k files)
`LinuxCNC/linuxcnc` (2k–10k files) | Keep Meson explicit instead of assuming the Autotools or generic native-tree rows cover it. `qemu/qemu` and `systemd/systemd` are high-signal root-`meson.build` upstreams, while `LinuxCNC/linuxcnc` is a smaller contrast target. The shipped Rust surface is bounded static `meson.build` parsing for literal `project()` metadata and top-level `dependency()` calls, with explicit no-evaluation guardrails that deserve a focused verification lane. | -| 25 | Conan | 🟢 Verified | `conan-io/conan-center-index` (10k–50k files)
`catchorg/Catch2` (<500 files)
`fmtlib/fmt` (<500 files) | Conan needs both recipe-corpus and upstream-library targets. `conan-io/conan-center-index` is the authoritative recipe index, while `catchorg/Catch2` and `fmtlib/fmt` are smaller upstream consumer-library contrasts. Watch recipe-only repository structure, versioned recipe directories, and the difference between Conan recipe metadata and normal source-package behavior. | -| 26 | vcpkg | 🟢 Verified | `microsoft/vcpkg` (10k–50k files)
`microsoft/terminal` (2k–10k files)
`microsoft/onnxruntime` (10k–50k files) | Important Windows/`C++` manifest-mode lane. `microsoft/vcpkg` is the authoritative manifest and registry target, while `microsoft/terminal` and `microsoft/onnxruntime` cover large consumer repos that use `vcpkg.json` in real codebases. Watch current scope boundaries carefully: this row is about implemented manifest-mode metadata, not every vendored or toolchain surface in those trees. | -| 27 | Deno | 🟢 Verified | `denoland/fresh` (500–2k files)
`oakserver/oak` (500–2k files)
`denoland/std` (2k–10k files) | Useful modern JS/TS ecosystem with explicit config and lockfile coverage. | -| 28 | Dart / Pub | 🟢 Verified | `rrousselGit/riverpod` (500–2k files)
`firebase/flutterfire` (2k–10k files)
`flutter/packages` (2k–10k files) | Good Pub and Flutter-adjacent coverage through large multi-package repositories. | -| 29 | Git submodules | 🟢 Verified | `grpc/grpc` (10k–50k files)
`git/git` (500–2k files)
`chromium/chromium` (490,886 files) | This is a package-adjacent lane, not a parser-breadth lane. `git/git` is the clearest focused `.gitmodules` reference, `grpc/grpc` adds large real-world third-party trees, and `chromium/chromium` is the stress case. Watch absent submodule checkouts and vendored-tree context so `.gitmodules` findings stay coherent instead of being drowned by unrelated common-profile output. | -| 30 | Structured metadata (`CITATION.cff`, `publiccode.yml`) | 🟢 Verified | `astropy/astropy` (2k–10k files)
`iTowns/itowns` (500–2k files)
`univention/Nubus` (500–2k files) | Keep both structured-metadata families explicit here. `astropy/astropy` is the strongest `CITATION.cff` reference, `univention/Nubus` is the clearest `publiccode.yml` case, and `iTowns/itowns` adds mixed-project contrast. Watch that structured metadata stays visible beside richer README and package findings instead of being lost in broader common-profile output. | -| 31 | README | 🟢 Verified | `chromium/chromium` vendored `README.chromium` samples (490,886 files)
`vercel/next.js` (5k files)
`django/django` (2.5k files) | Chromium is the main proof target for the specialized README variants; the other two are broader repo-level contrast targets only. Use this row to verify that package-adjacent README parsing stays visible under the common profile instead of disappearing inside unrelated monorepo noise. | -| 32 | Linux Distro (`os-release`) | ⚪ Planned | Debian base-image rootfs snapshot
Fedora base-image rootfs snapshot
Distroless `base-debian12` rootfs snapshot | This row is rootfs-only on purpose. Debian and Fedora give conventional distro metadata layouts, while Distroless shows the minimal-image case where `os-release` may be one of the few package-identity signals present. Watch path/layout differences and do not treat intentionally sparse distroless metadata as a parser regression by itself. | -| 33 | AboutCode | ⚪ Planned | `aboutcode-org/scancode-toolkit` (10k–50k files)
`aboutcode-org/scancode.io` (500–2k files)
`aboutcode-org/dejacode` (500–2k files) | Niche but very high-fit `.ABOUT` lane. `aboutcode-org/scancode-toolkit` is the broadest real-world `.ABOUT` reference, while `aboutcode-org/scancode.io` and `aboutcode-org/dejacode` provide smaller product-style contrasts. Watch `.ABOUT` extraction staying visible beside denser package, README, and license output in these application trees. | -| 34 | Hex / Elixir | 🟢 Verified | `phoenixframework/phoenix` (500–2k files)
`elixir-ecto/ecto` (500–2k files)
`elixir-plug/plug` (<500 files) | Useful ecosystem, but current Rust scope is still the lockfile/static subset, so this ranks below the broader mainstream families. | -| 35 | OCaml / opam | 🟢 Verified | `ocaml/dune` (500–2k files)
`ocaml/ocaml-lsp` (500–2k files)
`ocaml/merlin` (500–2k files) | Good `opam` coverage, but lower practical verification priority than the broader ecosystems above. | -| 36 | Buck | 🟢 Verified | `facebook/buck2` (2k–10k files)
`facebook/watchman` (500–2k files)
`facebook/react-native` (10k–50k files) | Real Buck lane, even if narrower than Bazel in practice. `facebook/buck2` is the canonical direct reference, `facebook/watchman` is a smaller focused contrast, and `facebook/react-native` adds a large mixed-language consumer tree. Watch Buck metadata separately from the rest of the monorepo so unrelated JS/native/common-profile noise does not hide actual build-metadata gaps. | -| 37 | FreeBSD | ⚪ Planned | FreeBSD `pkg` package archive sample
FreeBSD `bash` package archive sample
FreeBSD `curl` package archive sample | Important artifact-family support, but narrower day-to-day scan prevalence than the higher-priority distro lanes. | -| 38 | Chef | ⚪ Planned | `sous-chefs/apache2` (<500 files)
`sous-chefs/mysql` (<500 files)
`chef/chef` (2k–10k files) | Worth covering, but lower priority than the mainstream language and distro families. | -| 39 | Bower | ⚪ Planned | `jquery/jquery-ui` (500–2k files)
`select2/select2` (<500 files)
`jashkenas/backbone` (<500 files) | Legacy ecosystem with ongoing value mostly for backward compatibility. | -| 40 | Haxe | ⚪ Planned | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | -| 41 | Windows Update | ⚪ Planned | `wsusscn2.cab` extracted tree
Windows cumulative update `.msu` extracted tree
Windows servicing stack update extracted tree | Artifact-oriented family with real value, but specialized and best handled after the higher-signal source/package ecosystems. | -| 42 | `misc.py` recognizers | ⚪ Planned | Apache Tomcat binary release artifacts
Firefox add-on / language-pack artifacts
NSIS official installer artifacts | Broad recognizer family, but not a normal package-manager lane; treat as specialized follow-up verification. | -| 43 | Julia | 🟢 Verified | `JuliaLang/Pkg.jl` (500–2k files)
`JuliaLang/julia` (10k–50k files)
`JuliaPlots/Plots.jl` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `JuliaLang/Pkg.jl` is the canonical `Project.toml` and `Manifest.toml` reference, `JuliaLang/julia` adds a large real-world Julia project tree, and `JuliaPlots/Plots.jl` is a mid-sized consumer library. Focus on correct `Project.toml` metadata extraction, `Manifest.toml` resolved dependency coverage, and sibling assembly of project-plus-manifest pairs. | +| 0a | Cross-cutting broad `C++` repository scans (non-parser reference) | 🟢 Verified | `boostorg/boost` (236 files)
`boostorg/json` (701 files)
`mongodb/mongo` (11k files) | There is no generic `C++` parser row. These repositories are still valuable reference targets because they exercise multiple implemented `C++`-adjacent families and package-adjacent detection in realistic trees. They complement, but do not replace, family-specific verification for Autotools, Conan, vcpkg, Bazel, and Buck. | +| 0b | Cross-cutting broad polyglot / vendored monorepo scans (non-parser reference) | 🟢 Verified | `chromium/chromium` (490,886 files)
`apache/airflow` (11,854 files)
`kubernetes/kubernetes` (29,080 files) | These are good early warning targets for interaction bugs across multiple parser families, vendored third-party metadata, README/submodule handling, and common-profile license/copyright detection in very large trees. They complement, but do not replace, family-specific rows. | +| 0c | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference) | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)
Fedora base-image rootfs snapshot (1,579 files)
official Alpine minirootfs snapshot (84 files) | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. | +| 0d | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | 🟢 Verified | `torvalds/linux` (100k files)
`rust-lang/rust` (8k files) | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions. | +| 0e | Cross-cutting licensing-edge-case repository scans (non-parser reference) | 🟢 Verified | `nmap/nmap` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files)
`mongodb/mongo` (11k files) | Use this lane when the main goal is license-classification accuracy rather than parser breadth. These targets are useful when the verification focus is classification quality on real repository text, reference notices, and packaging-adjacent licensing material rather than parser coverage alone. | +| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)
`yarnpkg/berry` (500–2k files)
`vercel/next.js` (5k files)
`oven-sh/bun` (500–2k files)
`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. | +| 2 | Python / PyPI | 🟢 Verified | `pandas-dev/pandas` (1.2k files)
`scipy/scipy` (1.3k files)
`django/django` (2.5k files)
`python-poetry/poetry` (500–2k files)
`astral-sh/uv` (500–2k files) | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas. | +| 3 | Maven / Java | 🟢 Verified | `apache/maven` (500–2k files)
`apache/camel` (2k–10k files)
`spring-projects/spring-boot` (2k–10k files)
`apache/felix-dev` (2k–10k files) | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure. | +| 3a | Clojure / Leiningen | 🟢 Verified | `technomancy/leiningen` (500–2k files)
`metabase/metabase` (2k–10k files)
`renovatebot/renovate` Leiningen fixtures | Keep this row explicit instead of assuming the broader Maven or SBT rows cover it. `technomancy/leiningen` is the canonical `project.clj` reference, `metabase/metabase` gives a real-world root `deps.edn`, and `renovatebot/renovate` adds a fixture-heavy Leiningen edge-case lane. The shipped Rust surface is bounded static parsing of `deps.edn` and `project.clj`, not generic JVM build inheritance, and these manifests are intentionally treated as standalone unassembled inputs. | +| 4 | Go | 🟢 Verified | `containerd/containerd` (2k–10k files)
`go-gitea/gitea` (2k–10k files)
Go build-info sample binaries via local `--target-path` + `common-with-compiled` lane | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, `go.work` workspace roots, vendored trees, and source-versus-binary coverage gaps explicitly during compare review. | +| 5 | Cargo/Rust | 🟢 Verified | `tokio-rs/tokio` (250 files)
`rust-lang/cargo` (700 files)
cargo-auditable sample binaries via local `--target-path` + `common-with-compiled` lane | Strong workspace/member coverage plus an explicit compiled-metadata lane for the scanner-gated cargo-auditable surface. Watch workspace root/member ownership, manifest-declared file references such as `README` and license files, and compiled-versus-source coverage gaps. Keep bootstrap-scale mixed Rust/C++ trees such as `rust-lang/rust` in the dedicated filesystem-scale cross-cutting lane instead of duplicating them here. | +| 5a | Compiled artifacts (`go build info`, cargo-auditable, Windows PE `VERSIONINFO`) | 🟢 Verified | `itchyny/gojq` release binaries via local `--target-path` + `common-with-compiled` lane
`lichess-org/fishnet` release binaries via local `--target-path` + `common-with-compiled` lane
`glzr-io/glazewm` Windows release executables via local `--target-path` lane | Keep this detector-oriented row explicit so compiled-binary verification does not stay implicit inside the Go, Cargo/Rust, Windows Update, or `misc.py` rows. `itchyny/gojq` is a clean Go build-info target, `lichess-org/fishnet` is an explicit cargo-auditable release lane, and `glzr-io/glazewm` gives a focused Windows `VERSIONINFO` executable target. Prefer small release trees that include nearby README or LICENSE material when possible, so the compare still exercises common-profile interactions rather than only binary package identity. | +| 6 | NuGet | 🟢 Verified | `OrchardCMS/OrchardCore` (2k–10k files)
`AvaloniaUI/Avalonia` (2k–10k files)
`.nupkg` / `.deps.json` snapshots via local `--target-path` lane | Broad .NET lane across source and shipped artifacts. `OrchardCMS/OrchardCore` and `AvaloniaUI/Avalonia` cover large solution-style repos and central package management patterns, while the `.nupkg` / `.deps.json` lane covers runtime and package-artifact metadata that source repos may miss. Watch duplicate package signals across solution props/targets, project files, and runtime artifacts before counting them as regressions. | +| 7 | PHP / Composer | 🟢 Verified | `laravel/framework` (2k–10k files)
`composer/composer` (500–2k files)
`symfony/symfony` (2k–10k files) | Mature Composer lane. `composer/composer` is the canonical Composer reference, while `laravel/framework` and `symfony/symfony` add large real-world monorepo/library dependency graphs. Watch `composer.json` versus `composer.lock` behavior, split-package repo structure, and README/LICENSE-heavy trees that can create unrelated common-profile deltas. | +| 8 | Gradle | 🟢 Verified | `gradle/gradle` (2k–10k files)
`elastic/elasticsearch` (11k files)
`apache/kafka` (2k–10k files) | High-signal JVM build family with settings/includes and large build graphs; `elastic/elasticsearch` adds an especially large multi-project Gradle and packaging target with meaningful licensing/distribution complexity. | +| 8a | Android metadata and package artifacts | 🟢 Verified | `aosp-mirror/platform_build` (Soong `METADATA` coverage)
`aosp-mirror/platform_frameworks_base` (Android manifest surfaces)
representative local `.aab`, `.apk`, and standalone binary `AndroidManifest.xml` artifacts via `--target-path` lane | Keep this Android-specific lane explicit instead of assuming the broader Gradle row covers it. Use the repository targets for Soong `METADATA` files and committed manifest surfaces, and the local artifact lane for proto-encoded `.aab` plus binary AXML/APK manifest metadata that ordinary repository scans do not usually contain. | +| 9 | Ruby | 🟢 Verified | `rails/rails` (2k–10k files)
`rubocop/rubocop` (500–2k files)
`.gem` archive sample via local `--target-path` lane | Use this row to separate source-repo and shipped-gem behavior. `rails/rails` is the large multi-gemspec/Bundler stress case, `rubocop/rubocop` is a smaller modern Bundler contrast, and the `.gem` lane covers archive metadata. Watch Gemfile-versus-gemspec-versus-lockfile precedence and differences between source trees and packaged gem metadata. | +| 10 | Debian | 🟢 Verified | `guillemj/dpkg` (500–2k files)
`Debian/apt` (2k–10k files)
official `.deb` / dpkg status / distroless `status.d` snapshots via local `--target-path` lane | Keep source-package and installed-state coverage separate. `guillemj/dpkg` and `Debian/apt` exercise Debian source-package metadata, while the `.deb`, `dpkg status`, and distroless `status.d` lanes cover binary-package and installed-database behavior. Watch source-versus-binary package identity, multiple package stanzas, and Debian copyright/license files generating common-profile deltas that are not parser failures. | +| 11 | Docker | 🟢 Verified | `moby/moby` (2k–10k files)
`docker-library/official-images` (<500 files)
`docker-library/python` (<500 files)
`getsentry/self-hosted` (<500 files) | Docker needs both canonical and real deployment targets. `moby/moby` is the broad Dockerfile/build-context reference, `docker-library/official-images` is the source-of-truth library-definition lane, `docker-library/python` is a useful generated official-image leaf target, and `getsentry/self-hosted` adds compose-heavy multi-service realism. Watch multi-stage Dockerfiles, compose-plus-Dockerfile overlap, and template/env noise before treating extra findings as parser regressions. | +| 11a | Helm | 🟢 Verified | `baserow/baserow` (2k–10k files)
`appsmithorg/appsmith` (10k–50k files)
`DefectDojo/django-DefectDojo` (500–2k files) | Keep Helm explicit instead of relying on incidental chart files inside larger application repositories. `baserow/baserow` gives a strong `Chart.yaml` plus `Chart.lock` lane, `appsmithorg/appsmith` adds a large conventional chart deployment tree, and `DefectDojo/django-DefectDojo` is a smaller contrast target. The implemented Rust surface is static `Chart.yaml` plus `Chart.lock` parsing with sibling assembly, declared-versus-locked dependency coverage, and bounded malformed-entry tolerance; that needs at least one focused chart-first verification lane. | +| 12 | Conda | 🟢 Verified | `conda/conda` (500–2k files)
`conda/conda-build` (500–2k files)
`conda-forge/pandas-feedstock` (<500 files) | Conda needs three distinct target shapes. `conda/conda` covers user-facing environment metadata, `conda/conda-build` covers recipes and build outputs, and `conda-forge/pandas-feedstock` is the feedstock pattern Provenant must handle. Watch recipe-output duplication and generated feedstock files before overcounting package or license deltas. | +| 12a | Pixi | 🟢 Verified | `prefix-dev/pixi` (500–2k files)
`pydata/xarray` (500–2k files)
`OpenMDAO/OpenMDAO` (500–2k files) | Keep Pixi explicit even though some Python and Conda compare targets already surface `pixi.toml` and `pixi.lock`. `prefix-dev/pixi` is the canonical upstream with both `pixi.toml` and `pixi.lock`, `pydata/xarray` adds a real consumer repo, and `OpenMDAO/OpenMDAO` adds a second repo with both manifest and lockfile. This row isolates the native `pixi.toml` plus `pixi.lock` contract, mixed Conda/PyPI dependency behavior, and topology-planned root assembly instead of letting those behaviors hide inside broader Python-family compare noise. | +| 13 | Swift | 🟢 Verified | `pointfreeco/swift-composable-architecture` (500–2k files)
`SwiftFiddle/swiftfiddle-web` (<500 files)
`Package.swift.json` / `Package.resolved` snapshots via local `--target-path` lane | `pointfreeco/swift-composable-architecture` is a clean SwiftPM library reference, `SwiftFiddle/swiftfiddle-web` adds a real committed `Resources/Package.swift.json` plus `Package.resolved` target shape, and the local snapshot lane remains important for future pinned production captures that record generated SwiftPM surfaces alongside their source manifests. Watch repo-only verification gaps whenever a bug might live in `Package.swift.json` or `Package.resolved` rather than in source manifests. | +| 14 | Haskell / Hackage | 🟢 Verified | `commercialhaskell/stack` (500–2k files)
`jgm/pandoc` (500–2k files)
`yesodweb/yesod` (500–2k files) | Good mix of Cabal, Stack, and multi-package Haskell repository structure. | +| 15 | Scala / SBT | 🟢 Verified | `akka/akka` (2k–10k files)
`playframework/playframework` (2k–10k files)
`scalatest/scalatest` (500–2k files) | Valuable JVM surface, but current Rust scope is bounded static parsing rather than full evaluation semantics. | +| 16 | CocoaPods | 🟢 Verified | `AFNetworking/AFNetworking` (<500 files)
`Alamofire/Alamofire` (<500 files)
`SDWebImage/SDWebImage` (<500 files) | Strong Apple packaging coverage through widely used podspec-based libraries. | +| 16a | Carthage | 🟢 Verified | `Carthage/Carthage` (500–2k files)
`ReactiveCocoa/ReactiveCocoa` (<500 files)
`Mantle/Mantle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `Carthage/Carthage` is the canonical upstream with both `Cartfile` and `Cartfile.resolved`, while `ReactiveCocoa/ReactiveCocoa` and `Mantle/Mantle` are representative consumer libraries. Focus on correct `Cartfile` dependency extraction, `Cartfile.resolved` pinned-version coverage, and the dependency-hoisting contract for sibling manifest-plus-lockfile pairs without inventing a root Carthage package identity. | +| 16b | Yocto / BitBake | 🟢 Verified | `yoctoproject/poky` (10k–50k files)
`openembedded/meta-openembedded` (10k–50k files)
`pocketbeagle/meta-pocketbeagle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `yoctoproject/poky` is the canonical Yocto reference distribution, while `openembedded/meta-openembedded` provides a large recipe corpus across many layers. Focus on correct package identity extraction from filenames and `PN`/`PV` variables, license normalization of BitBake-specific operator syntax (`&`/`\|`), and `DEPENDS`/`RDEPENDS` dependency scoping. | +| 17 | Nix | 🟢 Verified | `NixOS/nixpkgs` (50k+ files)
`NixOS/nix` (2k–10k files)
`numtide/devshell` (<500 files) | Valuable ecosystem with explicit note that current `default.nix` support is intentionally bounded. | +| 18 | CPAN | 🟢 Verified | `Plack/Plack` (500–2k files)
`libwww-perl/libwww-perl` (500–2k files)
`PerlDancer/Dancer2` (500–2k files) | Good Perl metadata variety through `META.*`, `dist.ini`, and `Makefile.PL`. | +| 19 | CRAN / R | 🟢 Verified | `tidyverse/dplyr` (500–2k files)
`tidyverse/ggplot2` (500–2k files)
`r-lib/devtools` (500–2k files) | Strong DESCRIPTION-based metadata with realistic dependency fields. | +| 20 | Alpine | ⚪ Planned | `alpinelinux/aports`
official `.apk` sample via local `--target-path` lane
Alpine `lib/apk/db/installed` snapshot via local `--target-path` lane | Keep this family row even though Alpine rootfs targets also appear in `0c`: `0c` is the cross-cutting rootfs lane, while this row tracks Alpine-specific source, archive, and installed-DB surfaces. Do not treat rootfs-only verification as verification of the remaining `APKBUILD`, `.apk`, and standalone installed-DB surfaces listed here. | +| 21 | RPM | 🟢 Verified | `rpm-software-management/dnf` (2k–10k files)
`rpm-software-management/libdnf` (500–2k files)
official `.rpm` / RPM BDB, NDB, and SQLite DB snapshots via local `--target-path` lane | Important distro-family lane across source and installed-state metadata. `rpm-software-management/dnf` and `rpm-software-management/libdnf` cover realistic RPM-adjacent source trees, while the local `.rpm` and RPM DB lanes cover shipped package and installed-database behavior. Watch specfile subpackages, changelog/license fields, namespace-from-`os-release` behavior, and DB-versus-source differences separately during triage. | +| 22 | Arch Linux | ⚪ Planned | Arch Linux GitLab packaging repo for `pacman`
Arch Linux GitLab packaging repo for `grep`
official built package sample for `.PKGINFO` via local `--target-path` lane | Use one source-package contrast plus one built-package lane here. The Arch packaging repos cover PKGBUILD and `.SRCINFO` source metadata, while the local built-package lane covers `.PKGINFO` behavior that source repos do not contain. Keep the candidate repos concrete because the canonical Arch packaging sources live in the Arch packaging tree rather than in one obvious GitHub umbrella repository. | +| 23 | Bazel | 🟢 Verified | `tensorflow/tensorflow` (10k files)
`bazelbuild/bazel` (2k–10k files)
`protocolbuffers/protobuf` (2.5k files) | Strong Bazel lane across old and new module surfaces. `bazelbuild/bazel` is the canonical direct reference, `tensorflow/tensorflow` is the large mixed-language stress case, and `protocolbuffers/protobuf` is a smaller contrast target. Watch `WORKSPACE` versus `MODULE.bazel`, macro-heavy static-parsing limits, and giant `third_party` trees producing unrelated common-profile noise. | +| 24 | Autotools | 🟢 Verified | `curl/curl` (1k files)
`libevent/libevent` (<500 files)
`libgit2/libgit2` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files) | Mature native-build lane with several useful contrasts. `curl/curl` is the clearest autoconf-heavy reference, `libevent/libevent` is a smaller contrast, `libgit2/libgit2` adds a mixed native project shape, and `ffmpeg/ffmpeg` adds strong GPL/LGPL-conditional licensing pressure in a `configure`-driven native tree. Watch generated `configure` / `Makefile.in` noise and avoid collapsing file-level licensing differences into one top-level verdict. | +| 24a | Meson | 🟢 Verified | `qemu/qemu` (10k–50k files)
`systemd/systemd` (10k–50k files)
`LinuxCNC/linuxcnc` (2k–10k files) | Keep Meson explicit instead of assuming the Autotools or generic native-tree rows cover it. `qemu/qemu` and `systemd/systemd` are high-signal root-`meson.build` upstreams, while `LinuxCNC/linuxcnc` is a smaller contrast target. The shipped Rust surface is bounded static `meson.build` parsing for literal `project()` metadata and top-level `dependency()` calls, with explicit no-evaluation guardrails that deserve a focused verification lane. | +| 25 | Conan | 🟢 Verified | `conan-io/conan-center-index` (10k–50k files)
`catchorg/Catch2` (<500 files)
`fmtlib/fmt` (<500 files) | Conan needs both recipe-corpus and upstream-library targets. `conan-io/conan-center-index` is the authoritative recipe index, while `catchorg/Catch2` and `fmtlib/fmt` are smaller upstream consumer-library contrasts. Watch recipe-only repository structure, versioned recipe directories, and the difference between Conan recipe metadata and normal source-package behavior. | +| 26 | vcpkg | 🟢 Verified | `microsoft/vcpkg` (10k–50k files)
`microsoft/terminal` (2k–10k files)
`microsoft/onnxruntime` (10k–50k files) | Important Windows/`C++` manifest-mode lane. `microsoft/vcpkg` is the authoritative manifest and registry target, while `microsoft/terminal` and `microsoft/onnxruntime` cover large consumer repos that use `vcpkg.json` in real codebases. Watch current scope boundaries carefully: this row is about implemented manifest-mode metadata, not every vendored or toolchain surface in those trees. | +| 27 | Deno | 🟢 Verified | `denoland/fresh` (500–2k files)
`oakserver/oak` (500–2k files)
`denoland/std` (2k–10k files) | Useful modern JS/TS ecosystem with explicit config and lockfile coverage. | +| 28 | Dart / Pub | 🟢 Verified | `rrousselGit/riverpod` (500–2k files)
`firebase/flutterfire` (2k–10k files)
`flutter/packages` (2k–10k files) | Good Pub and Flutter-adjacent coverage through large multi-package repositories. | +| 29 | Git submodules | 🟢 Verified | `grpc/grpc` (10k–50k files)
`git/git` (500–2k files)
`chromium/chromium` (490,886 files) | This is a package-adjacent lane, not a parser-breadth lane. `git/git` is the clearest focused `.gitmodules` reference, `grpc/grpc` adds large real-world third-party trees, and `chromium/chromium` is the stress case. Watch absent submodule checkouts and vendored-tree context so `.gitmodules` findings stay coherent instead of being drowned by unrelated common-profile output. | +| 30 | Structured metadata (`CITATION.cff`, `publiccode.yml`) | 🟢 Verified | `astropy/astropy` (2k–10k files)
`iTowns/itowns` (500–2k files)
`univention/Nubus` (500–2k files) | Keep both structured-metadata families explicit here. `astropy/astropy` is the strongest `CITATION.cff` reference, `univention/Nubus` is the clearest `publiccode.yml` case, and `iTowns/itowns` adds mixed-project contrast. Watch that structured metadata stays visible beside richer README and package findings instead of being lost in broader common-profile output. | +| 31 | README | 🟢 Verified | `chromium/chromium` vendored `README.chromium` samples (490,886 files)
`vercel/next.js` (5k files)
`django/django` (2.5k files) | Chromium is the main proof target for the specialized README variants; the other two are broader repo-level contrast targets only. Use this row to verify that package-adjacent README parsing stays visible under the common profile instead of disappearing inside unrelated monorepo noise. | +| 32 | Linux Distro (`os-release`) | ⚪ Planned | Debian base-image rootfs snapshot
Fedora base-image rootfs snapshot
Distroless `base-debian12` rootfs snapshot | This row is rootfs-only on purpose. Debian and Fedora give conventional distro metadata layouts, while Distroless shows the minimal-image case where `os-release` may be one of the few package-identity signals present. Watch path/layout differences and do not treat intentionally sparse distroless metadata as a parser regression by itself. | +| 33 | AboutCode | ⚪ Planned | `aboutcode-org/scancode-toolkit` (10k–50k files)
`aboutcode-org/scancode.io` (500–2k files)
`aboutcode-org/dejacode` (500–2k files) | Niche but very high-fit `.ABOUT` lane. `aboutcode-org/scancode-toolkit` is the broadest real-world `.ABOUT` reference, while `aboutcode-org/scancode.io` and `aboutcode-org/dejacode` provide smaller product-style contrasts. Watch `.ABOUT` extraction staying visible beside denser package, README, and license output in these application trees. | +| 34 | Hex / Elixir | 🟢 Verified | `phoenixframework/phoenix` (500–2k files)
`elixir-ecto/ecto` (500–2k files)
`elixir-plug/plug` (<500 files) | Useful ecosystem, but current Rust scope is still the lockfile/static subset, so this ranks below the broader mainstream families. | +| 35 | OCaml / opam | 🟢 Verified | `ocaml/dune` (500–2k files)
`ocaml/ocaml-lsp` (500–2k files)
`ocaml/merlin` (500–2k files) | Good `opam` coverage, but lower practical verification priority than the broader ecosystems above. | +| 36 | Buck | 🟢 Verified | `facebook/buck2` (2k–10k files)
`facebook/watchman` (500–2k files)
`facebook/react-native` (10k–50k files) | Real Buck lane, even if narrower than Bazel in practice. `facebook/buck2` is the canonical direct reference, `facebook/watchman` is a smaller focused contrast, and `facebook/react-native` adds a large mixed-language consumer tree. Watch Buck metadata separately from the rest of the monorepo so unrelated JS/native/common-profile noise does not hide actual build-metadata gaps. | +| 37 | FreeBSD | ⚪ Planned | FreeBSD `pkg` package archive sample
FreeBSD `bash` package archive sample
FreeBSD `curl` package archive sample | Important artifact-family support, but narrower day-to-day scan prevalence than the higher-priority distro lanes. | +| 38 | Chef | ⚪ Planned | `sous-chefs/apache2` (<500 files)
`sous-chefs/mysql` (<500 files)
`chef/chef` (2k–10k files) | Worth covering, but lower priority than the mainstream language and distro families. | +| 39 | Bower | ⚪ Planned | `jquery/jquery-ui` (500–2k files)
`select2/select2` (<500 files)
`jashkenas/backbone` (<500 files) | Legacy ecosystem with ongoing value mostly for backward compatibility. | +| 40 | Haxe | ⚪ Planned | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | +| 41 | Windows Update | ⚪ Planned | `wsusscn2.cab` extracted tree
Windows cumulative update `.msu` extracted tree
Windows servicing stack update extracted tree | Artifact-oriented family with real value, but specialized and best handled after the higher-signal source/package ecosystems. | +| 42 | `misc.py` recognizers | ⚪ Planned | Apache Tomcat binary release artifacts
Firefox add-on / language-pack artifacts
NSIS official installer artifacts | Broad recognizer family, but not a normal package-manager lane; treat as specialized follow-up verification. | +| 43 | Julia | 🟢 Verified | `JuliaLang/Pkg.jl` (500–2k files)
`JuliaLang/julia` (10k–50k files)
`JuliaPlots/Plots.jl` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `JuliaLang/Pkg.jl` is the canonical `Project.toml` and `Manifest.toml` reference, `JuliaLang/julia` adds a large real-world Julia project tree, and `JuliaPlots/Plots.jl` is a mid-sized consumer library. Focus on correct `Project.toml` metadata extraction, `Manifest.toml` resolved dependency coverage, and sibling assembly of project-plus-manifest pairs. | +| 44 | Erlang / OTP | ⚪ Planned | `processone/ejabberd` (2k–10k files)
`erlang/otp` (10k–50k files)
`vernemq/vernemq` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `processone/ejabberd` is a large real-world Erlang project with `rebar.config`, `rebar.lock`, and multiple `.app.src` files across its dependency tree. `erlang/otp` is the canonical OTP distribution with many `.app.src` files. `vernemq/vernemq` adds a complex multi-dependency rebar project with mixed pkg and git dependencies in `rebar.lock`. Focus on correct `.app.src` metadata and dependency extraction, `rebar.config` dependency parsing including git and profile deps, and `rebar.lock` resolved dependency and hash coverage. | ## How to maintain this file diff --git a/docs/improvements/erlang-otp-parser.md b/docs/improvements/erlang-otp-parser.md new file mode 100644 index 000000000..4633c8996 --- /dev/null +++ b/docs/improvements/erlang-otp-parser.md @@ -0,0 +1,49 @@ +# Erlang / OTP Parser Improvements + +## Summary + +Rust now ships static Erlang/OTP package metadata support for `*.app.src` application resource +files, `rebar.config` build configuration, and `rebar.lock` lockfiles. Python ScanCode does not +currently provide a production Erlang/OTP parser. + +## Rust Improvements + +### Application resource file coverage (`*.app.src`) + +- Rust parses OTP application resource files using a native Erlang term parser. +- Extracts package identity from the `{application, Name, Props}` tuple, including `vsn`, + `description`, `licenses`, and `links` fields. +- Filters OTP standard library applications (`kernel`, `stdlib`, `sasl`, `crypto`, etc.) from the + `applications` dependency list so only third-party dependencies appear in parser output. +- Handles `runtime_dependencies` entries with embedded version requirements (e.g., `"cowboy-2.10.0"`). +- Skips template version strings like `"%VSN%"` that are replaced at build time. +- Extracts `maintainers` and `keywords` metadata when present. + +### Rebar3 configuration coverage (`rebar.config`) + +- Rust parses `rebar.config` files and extracts dependencies from the `deps` field. +- Supports Hex package dependencies (`{Name, Version}`), git dependencies with tag/branch/ref + references, and version-constrained git dependencies (`{Name, Version, {git, URL, Ref}}`). +- Extracts profile-scoped dependencies from the `profiles` field (e.g., test dependencies). +- Preserves git source URLs in dependency `extra_data` for provenance tracking. + +### Rebar3 lockfile coverage (`rebar.lock`) + +- Rust parses both v1 (flat list) and v2 (`{"1.2.0", [deps]}`) rebar.lock formats. +- Extracts resolved package versions and git commit references as pinned dependencies. +- Resolves SHA256 checksums from the `pkg_hash` section into `resolved_package` metadata. +- Produces `ResolvedPackage` entries with Hex registry homepage and API URLs. + +### Sibling assembly + +- `rebar.config` and `rebar.lock` participate in sibling merge assembly so manifest and lockfile + data combine into one logical package when both files are present. +- `*.app.src` files remain standalone since they describe individual OTP applications rather than + project-level build configuration. + +## Guardrails + +- Rust does **not** evaluate Erlang expressions, resolve variables, or execute rebar3 plugins. +- Conditional dependency wrappers like `{if_var_true, ...}` are skipped rather than guessed at. +- The Erlang term parser handles atoms, strings, binaries (`<<"...">>`), tuples, lists, integers, + floats, and Erlang-style `%` comments but does not attempt full Erlang syntax coverage. diff --git a/src/assembly/assemblers.rs b/src/assembly/assemblers.rs index 4ce243e1f..f04168089 100644 --- a/src/assembly/assemblers.rs +++ b/src/assembly/assemblers.rs @@ -346,6 +346,12 @@ pub static ASSEMBLERS: &[AssemblerConfig] = &[ sibling_file_patterns: &["Project.toml", "Manifest.toml"], mode: AssemblyMode::SiblingMerge, }, + // Erlang/OTP Rebar ecosystem + AssemblerConfig { + datasource_ids: &[DatasourceId::RebarConfig, DatasourceId::RebarLock], + sibling_file_patterns: &["rebar.config", "rebar.lock"], + mode: AssemblyMode::SiblingMerge, + }, // Carthage ecosystem AssemblerConfig { datasource_ids: &[ @@ -887,6 +893,7 @@ pub static UNASSEMBLED_DATASOURCE_IDS: &[DatasourceId] = &[ DatasourceId::GoBinary, DatasourceId::WindowsExecutable, DatasourceId::Dockerfile, + DatasourceId::ErlangOtpAppSrc, DatasourceId::HexMixLock, DatasourceId::JavaEarApplicationXml, DatasourceId::JavaWarWebXml, diff --git a/src/models/datasource_id.rs b/src/models/datasource_id.rs index 446e20793..7b43d2572 100644 --- a/src/models/datasource_id.rs +++ b/src/models/datasource_id.rs @@ -180,6 +180,11 @@ pub enum DatasourceId { // ── Docker ── Dockerfile, + // ── Erlang / OTP ── + ErlangOtpAppSrc, + RebarConfig, + RebarLock, + // ── FreeBSD ── FreebsdCompactManifest, @@ -479,6 +484,9 @@ impl DatasourceId { Self::DenoJson => "deno_json", Self::DenoLock => "deno_lock", Self::Dockerfile => "dockerfile", + Self::ErlangOtpAppSrc => "erlang_otp_app_src", + Self::RebarConfig => "rebar_config", + Self::RebarLock => "rebar_lock", Self::BazelModule => "bazel_module", // FreeBSD diff --git a/src/parsers/erlang_otp.rs b/src/parsers/erlang_otp.rs new file mode 100644 index 000000000..60277c453 --- /dev/null +++ b/src/parsers/erlang_otp.rs @@ -0,0 +1,1048 @@ +// SPDX-FileCopyrightText: Provenant contributors +// SPDX-License-Identifier: Apache-2.0 + +use std::collections::HashMap; +use std::path::Path; + +use packageurl::PackageUrl; +use serde_json::Value as JsonValue; + +use crate::models::{ + DatasourceId, Dependency, PackageData, PackageType, ResolvedPackage, Sha256Digest, +}; +use crate::parser_warn as warn; +use crate::parsers::utils::{ + MAX_ITERATION_COUNT, MAX_RECURSION_DEPTH, read_file_to_string, truncate_field, +}; + +use super::PackageParser; + +// ── Parser structs ── + +pub struct ErlangAppSrcParser; +pub struct RebarConfigParser; +pub struct RebarLockParser; + +// ── Erlang term AST ── + +#[derive(Clone, Debug)] +enum ErlTerm { + Atom(String), + String(String), + Binary(String), + Integer(i64), + Float(f64), + Tuple(Vec), + List(Vec), +} + +// ── Erlang term parser ── + +struct ErlParser { + chars: Vec, + pos: usize, + depth: usize, +} + +impl ErlParser { + fn new(source: &str) -> Self { + Self { + chars: source.chars().collect(), + pos: 0, + depth: 0, + } + } + + fn parse_term(&mut self) -> Result { + if self.depth >= MAX_RECURSION_DEPTH { + return Err("recursion depth exceeded".to_string()); + } + self.depth += 1; + let result = self.parse_term_inner(); + self.depth -= 1; + result + } + + fn parse_term_inner(&mut self) -> Result { + self.skip_whitespace_and_comments(); + match self.peek() { + Some('{') => self.parse_tuple(), + Some('[') => self.parse_list(), + Some('"') => self.parse_string().map(ErlTerm::String), + Some('<') if self.peek_n(1) == Some('<') => self.parse_binary().map(ErlTerm::Binary), + Some('\'') => self.parse_quoted_atom().map(ErlTerm::Atom), + Some(c) if c.is_ascii_digit() || c == '-' => self.parse_number(), + Some(c) if c.is_ascii_lowercase() || c == '_' => self.parse_atom_or_bool(), + Some(c) => Err(format!( + "Unexpected character '{}' at position {}", + c, self.pos + )), + None => Err("Unexpected end of input".to_string()), + } + } + + fn parse_tuple(&mut self) -> Result { + self.expect('{')?; + let items = self.parse_comma_separated('}')?; + Ok(ErlTerm::Tuple(items)) + } + + fn parse_list(&mut self) -> Result { + self.expect('[')?; + let items = self.parse_comma_separated(']')?; + Ok(ErlTerm::List(items)) + } + + fn parse_comma_separated(&mut self, closing: char) -> Result, String> { + let mut items = Vec::new(); + let mut count = 0usize; + loop { + self.skip_whitespace_and_comments(); + if self.peek() == Some(closing) { + self.pos += 1; + break; + } + if count >= MAX_ITERATION_COUNT { + return Err("too many items".to_string()); + } + items.push(self.parse_term()?); + count += 1; + self.skip_whitespace_and_comments(); + if self.peek() == Some(',') { + self.pos += 1; + } else if self.peek() == Some('|') { + // list tail syntax: [H | T] — skip rest + self.pos += 1; + self.parse_term()?; + self.skip_whitespace_and_comments(); + if self.peek() == Some(closing) { + self.pos += 1; + } + break; + } + } + Ok(items) + } + + fn parse_string(&mut self) -> Result { + self.expect('"')?; + let mut out = String::new(); + while let Some(c) = self.peek() { + self.pos += 1; + match c { + '"' => return Ok(out), + '\\' => { + let escaped = self + .peek() + .ok_or_else(|| "Unterminated string escape".to_string())?; + self.pos += 1; + out.push(match escaped { + 'n' => '\n', + 'r' => '\r', + 't' => '\t', + '"' => '"', + '\\' => '\\', + other => other, + }); + } + other => out.push(other), + } + } + Err("Unterminated string literal".to_string()) + } + + fn parse_binary(&mut self) -> Result { + self.expect('<')?; + self.expect('<')?; + self.skip_whitespace_and_comments(); + let value = if self.peek() == Some('"') { + self.parse_string()? + } else { + String::new() + }; + self.skip_whitespace_and_comments(); + self.expect('>')?; + self.expect('>')?; + Ok(value) + } + + fn parse_quoted_atom(&mut self) -> Result { + self.expect('\'')?; + let mut out = String::new(); + while let Some(c) = self.peek() { + self.pos += 1; + match c { + '\'' => return Ok(out), + '\\' => { + if let Some(escaped) = self.peek() { + self.pos += 1; + out.push(escaped); + } + } + other => out.push(other), + } + } + Err("Unterminated quoted atom".to_string()) + } + + fn parse_atom_or_bool(&mut self) -> Result { + let atom = self.parse_bare_atom()?; + match atom.as_str() { + "true" => Ok(ErlTerm::Atom("true".to_string())), + "false" => Ok(ErlTerm::Atom("false".to_string())), + _ => Ok(ErlTerm::Atom(atom)), + } + } + + fn parse_bare_atom(&mut self) -> Result { + let start = self.pos; + while let Some(c) = self.peek() { + if c.is_ascii_alphanumeric() || c == '_' || c == '@' { + self.pos += 1; + } else { + break; + } + } + if self.pos == start { + return Err("Expected atom".to_string()); + } + Ok(self.chars[start..self.pos].iter().collect()) + } + + fn parse_number(&mut self) -> Result { + let start = self.pos; + if self.peek() == Some('-') { + self.pos += 1; + } + while let Some(c) = self.peek() { + if c.is_ascii_digit() { + self.pos += 1; + } else { + break; + } + } + if self.peek() == Some('.') && self.peek_n(1).is_some_and(|c| c.is_ascii_digit()) { + self.pos += 1; + while let Some(c) = self.peek() { + if c.is_ascii_digit() { + self.pos += 1; + } else { + break; + } + } + let s: String = self.chars[start..self.pos].iter().collect(); + return s + .parse::() + .map(ErlTerm::Float) + .map_err(|e| format!("Invalid float: {}", e)); + } + let s: String = self.chars[start..self.pos].iter().collect(); + s.parse::() + .map(ErlTerm::Integer) + .map_err(|e| format!("Invalid integer: {}", e)) + } + + fn skip_whitespace_and_comments(&mut self) { + loop { + match self.peek() { + Some(c) if c.is_whitespace() => { + self.pos += 1; + } + Some('%') => { + while let Some(c) = self.peek() { + self.pos += 1; + if c == '\n' { + break; + } + } + } + _ => break, + } + } + } + + fn expect(&mut self, expected: char) -> Result<(), String> { + self.skip_whitespace_and_comments(); + match self.peek() { + Some(c) if c == expected => { + self.pos += 1; + Ok(()) + } + Some(c) => Err(format!( + "Expected '{}' but found '{}' at position {}", + expected, c, self.pos + )), + None => Err(format!("Expected '{}' but reached end of input", expected)), + } + } + + fn peek(&self) -> Option { + self.chars.get(self.pos).copied() + } + + fn peek_n(&self, n: usize) -> Option { + self.chars.get(self.pos + n).copied() + } + + fn is_eof(&self) -> bool { + self.pos >= self.chars.len() + } +} + +fn parse_dotted_terms(content: &str) -> Result, String> { + let mut parser = ErlParser::new(content); + let mut terms = Vec::new(); + let mut count = 0usize; + loop { + parser.skip_whitespace_and_comments(); + if parser.is_eof() { + break; + } + if count >= MAX_ITERATION_COUNT { + break; + } + let term = parser.parse_term()?; + terms.push(term); + count += 1; + parser.skip_whitespace_and_comments(); + if parser.peek() == Some('.') { + parser.pos += 1; + } + } + Ok(terms) +} + +// ── Helpers ── + +fn term_to_str(term: &ErlTerm) -> Option { + match term { + ErlTerm::String(s) | ErlTerm::Binary(s) | ErlTerm::Atom(s) => Some(s.clone()), + ErlTerm::Integer(n) => Some(n.to_string()), + ErlTerm::Float(f) => Some(f.to_string()), + _ => None, + } +} + +fn term_to_proplist(term: &ErlTerm) -> Option> { + let items = match term { + ErlTerm::List(items) => items, + _ => return None, + }; + let mut result = Vec::new(); + for item in items { + if let ErlTerm::Tuple(fields) = item + && fields.len() == 2 + && let Some(key) = term_to_str(&fields[0]) + { + result.push((key, fields[1].clone())); + } + } + Some(result) +} + +fn term_to_atom_list(term: &ErlTerm) -> Vec { + match term { + ErlTerm::List(items) => items.iter().filter_map(term_to_str).collect(), + _ => Vec::new(), + } +} + +fn build_hex_purl(name: &str, version: Option<&str>) -> Option { + let mut purl = PackageUrl::new("hex", name).ok()?; + if let Some(version) = version { + purl.with_version(version).ok()?; + } + Some(purl.to_string()) +} + +// ── ErlangAppSrcParser ── + +impl PackageParser for ErlangAppSrcParser { + const PACKAGE_TYPE: PackageType = PackageType::Hex; + + fn is_match(path: &Path) -> bool { + path.extension() + .and_then(|e| e.to_str()) + .is_some_and(|ext| ext == "src") + && path + .file_stem() + .and_then(|s| s.to_str()) + .is_some_and(|stem| stem.ends_with(".app")) + } + + fn extract_packages(path: &Path) -> Vec { + let content = match read_file_to_string(path, None) { + Ok(c) => c, + Err(e) => { + warn!("Failed to read {:?}: {}", path, e); + return vec![default_app_src_package()]; + } + }; + + match parse_app_src(&content) { + Ok(pkg) => vec![pkg], + Err(e) => { + warn!("Failed to parse {:?}: {}", path, e); + vec![default_app_src_package()] + } + } + } +} + +fn default_app_src_package() -> PackageData { + PackageData { + package_type: Some(PackageType::Hex), + primary_language: Some("Erlang".to_string()), + datasource_id: Some(DatasourceId::ErlangOtpAppSrc), + ..Default::default() + } +} + +fn parse_app_src(content: &str) -> Result { + let terms = parse_dotted_terms(content)?; + + let app_tuple = terms + .into_iter() + .find_map(|term| { + if let ErlTerm::Tuple(fields) = &term + && fields.len() == 3 + && term_to_str(&fields[0]).as_deref() == Some("application") + { + Some(term) + } else { + None + } + }) + .ok_or_else(|| "No {application, _, _} tuple found".to_string())?; + + let fields = match app_tuple { + ErlTerm::Tuple(fields) => fields, + _ => unreachable!(), + }; + + let app_name = term_to_str(&fields[1]); + let props = term_to_proplist(&fields[2]).unwrap_or_default(); + + let mut package = default_app_src_package(); + package.name = app_name.map(truncate_field); + + let mut extra_data = HashMap::new(); + + for (key, value) in &props { + match key.as_str() { + "vsn" => { + if let Some(v) = term_to_str(value) + && !v.contains('%') + { + package.version = Some(truncate_field(v)); + } + } + "description" => { + package.description = term_to_str(value).map(truncate_field); + } + "licenses" => { + let licenses = term_to_atom_list(value); + if !licenses.is_empty() { + package.extracted_license_statement = Some(truncate_field(licenses.join(", "))); + } + } + "links" => { + if let Some(link_props) = term_to_proplist(value) { + for (link_name, link_val) in &link_props { + if let Some(url) = term_to_str(link_val) { + let lower = link_name.to_lowercase(); + if lower.contains("github") + || lower.contains("source") + || lower.contains("repo") + { + package.vcs_url = Some(truncate_field(url.clone())); + } + if package.homepage_url.is_none() { + package.homepage_url = Some(truncate_field(url)); + } + } + } + } + } + "applications" => { + let apps = term_to_atom_list(value); + for app in apps { + if is_otp_stdlib(&app) { + continue; + } + package.dependencies.push(Dependency { + purl: build_hex_purl(&app, None).map(truncate_field), + extracted_requirement: None, + scope: Some("dependencies".to_string()), + is_runtime: Some(true), + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: None, + }); + } + } + "runtime_dependencies" => { + let deps = term_to_atom_list(value); + for dep_str in deps { + if let Some((name, version)) = dep_str.split_once('-') { + if is_otp_stdlib(name) { + continue; + } + let version_str = if version.starts_with('@') { + None + } else { + Some(version) + }; + package.dependencies.push(Dependency { + purl: build_hex_purl(name, version_str).map(truncate_field), + extracted_requirement: version_str + .map(|v| truncate_field(v.to_string())), + scope: Some("dependencies".to_string()), + is_runtime: Some(true), + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: None, + }); + } + } + } + "maintainers" => { + let maintainers = term_to_atom_list(value); + if !maintainers.is_empty() { + extra_data.insert( + "maintainers".to_string(), + JsonValue::Array( + maintainers + .into_iter() + .map(|m| JsonValue::String(truncate_field(m))) + .collect(), + ), + ); + } + } + "keywords" => { + let keywords = term_to_atom_list(value); + if !keywords.is_empty() { + package.keywords = keywords.into_iter().map(truncate_field).collect(); + } + } + _ => {} + } + } + + if let Some(ref name) = package.name { + package.purl = build_hex_purl(name, package.version.as_deref()).map(truncate_field); + package.repository_homepage_url = + Some(truncate_field(format!("https://hex.pm/packages/{}", name))); + package.api_data_url = Some(truncate_field(format!( + "https://hex.pm/api/packages/{}", + name + ))); + } + + if !extra_data.is_empty() { + package.extra_data = Some(extra_data); + } + + Ok(package) +} + +fn is_otp_stdlib(name: &str) -> bool { + matches!( + name, + "kernel" + | "stdlib" + | "sasl" + | "erts" + | "compiler" + | "crypto" + | "inets" + | "ssl" + | "public_key" + | "asn1" + | "syntax_tools" + | "tools" + | "os_mon" + | "runtime_tools" + | "mnesia" + | "observer" + | "wx" + | "debugger" + | "reltool" + | "xmerl" + | "edoc" + | "eunit" + | "common_test" + | "dialyzer" + | "et" + | "megaco" + | "parsetools" + | "snmp" + | "ssh" + | "tftp" + | "ftp" + | "erl_interface" + | "jinterface" + | "odbc" + | "eldap" + | "diameter" + ) +} + +// ── RebarConfigParser ── + +impl PackageParser for RebarConfigParser { + const PACKAGE_TYPE: PackageType = PackageType::Hex; + + fn is_match(path: &Path) -> bool { + path.file_name().and_then(|n| n.to_str()) == Some("rebar.config") + } + + fn extract_packages(path: &Path) -> Vec { + let content = match read_file_to_string(path, None) { + Ok(c) => c, + Err(e) => { + warn!("Failed to read {:?}: {}", path, e); + return vec![default_rebar_config_package()]; + } + }; + + match parse_rebar_config(&content) { + Ok(pkg) => vec![pkg], + Err(e) => { + warn!("Failed to parse {:?}: {}", path, e); + vec![default_rebar_config_package()] + } + } + } +} + +fn default_rebar_config_package() -> PackageData { + PackageData { + package_type: Some(PackageType::Hex), + primary_language: Some("Erlang".to_string()), + datasource_id: Some(DatasourceId::RebarConfig), + ..Default::default() + } +} + +fn parse_rebar_config(content: &str) -> Result { + let terms = parse_dotted_terms(content)?; + + let mut package = default_rebar_config_package(); + + for term in &terms { + if let ErlTerm::Tuple(fields) = term + && fields.len() == 2 + { + let key = term_to_str(&fields[0]); + match key.as_deref() { + Some("deps") => { + if let ErlTerm::List(deps) = &fields[1] { + for dep in deps.iter().take(MAX_ITERATION_COUNT) { + if let Some(d) = parse_rebar_dep(dep) { + package.dependencies.push(d); + } + } + } + } + Some("profiles") => { + parse_profile_deps(&fields[1], &mut package.dependencies); + } + _ => {} + } + } + } + + Ok(package) +} + +fn parse_rebar_dep(term: &ErlTerm) -> Option { + let fields = match term { + ErlTerm::Tuple(fields) => fields, + _ => return None, + }; + + if fields.is_empty() { + return None; + } + + if let Some(key) = term_to_str(&fields[0]) + && key.starts_with("if_") + { + return fields.last().and_then(parse_rebar_dep); + } + + let name = term_to_str(&fields[0])?; + + match fields.len() { + // {Name, Version} or {Name, {git, URL, Ref}} + 2 => { + if let Some(version) = term_to_str(&fields[1]) { + // {Name, Version} + Some(Dependency { + purl: build_hex_purl(&name, Some(&version)).map(truncate_field), + extracted_requirement: Some(truncate_field(version)), + scope: Some("dependencies".to_string()), + is_runtime: None, + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: None, + }) + } else { + // {Name, {git, URL, Ref}} + let vcs_url = extract_git_url(&fields[1]); + let version = extract_git_version(&fields[1]); + let git_extra = vcs_url.map(|url| { + HashMap::from([( + "vcs_url".to_string(), + JsonValue::String(truncate_field(url)), + )]) + }); + Some(Dependency { + purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + extracted_requirement: version.map(truncate_field), + scope: Some("dependencies".to_string()), + is_runtime: None, + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: git_extra, + }) + } + } + // {Name, Version, Source} + 3 => { + if let Some(version) = term_to_str(&fields[1]) { + // {Name, Version, {git, URL, Ref}} + let git_extra = extract_git_url(&fields[2]).map(|vcs_url| { + HashMap::from([( + "vcs_url".to_string(), + JsonValue::String(truncate_field(vcs_url)), + )]) + }); + Some(Dependency { + purl: build_hex_purl(&name, Some(&version)).map(truncate_field), + extracted_requirement: Some(truncate_field(version)), + scope: Some("dependencies".to_string()), + is_runtime: None, + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: git_extra, + }) + } else { + // {Name, {git, URL, Ref}} + let vcs_url = extract_git_url(&fields[1]); + let version = extract_git_version(&fields[1]); + let git_extra = vcs_url.map(|url| { + HashMap::from([( + "vcs_url".to_string(), + JsonValue::String(truncate_field(url)), + )]) + }); + Some(Dependency { + purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + extracted_requirement: version.map(truncate_field), + scope: Some("dependencies".to_string()), + is_runtime: None, + is_optional: None, + is_pinned: None, + is_direct: None, + resolved_package: None, + extra_data: git_extra, + }) + } + } + _ => None, + } +} + +fn extract_git_url(term: &ErlTerm) -> Option { + if let ErlTerm::Tuple(fields) = term + && fields.len() >= 2 + && term_to_str(&fields[0]).as_deref() == Some("git") + { + term_to_str(&fields[1]) + } else { + None + } +} + +fn extract_git_version(term: &ErlTerm) -> Option { + if let ErlTerm::Tuple(fields) = term + && fields.len() >= 3 + && term_to_str(&fields[0]).as_deref() == Some("git") + { + if let ErlTerm::Tuple(ref_fields) = &fields[2] + && ref_fields.len() == 2 + { + let ref_type = term_to_str(&ref_fields[0])?; + let ref_val = term_to_str(&ref_fields[1])?; + match ref_type.as_str() { + "tag" => Some(ref_val), + _ => None, + } + } else { + None + } + } else { + None + } +} + +fn parse_profile_deps(term: &ErlTerm, dependencies: &mut Vec) { + let profiles = match term { + ErlTerm::List(items) => items, + _ => return, + }; + + for profile in profiles.iter().take(MAX_ITERATION_COUNT) { + if let ErlTerm::Tuple(fields) = profile + && fields.len() == 2 + { + let profile_name = term_to_str(&fields[0]).unwrap_or_default(); + if let ErlTerm::List(profile_opts) = &fields[1] { + for opt in profile_opts { + if let ErlTerm::Tuple(opt_fields) = opt + && opt_fields.len() == 2 + && term_to_str(&opt_fields[0]).as_deref() == Some("deps") + && let ErlTerm::List(deps) = &opt_fields[1] + { + for dep in deps.iter().take(MAX_ITERATION_COUNT) { + if let Some(mut d) = parse_rebar_dep(dep) { + d.scope = Some(truncate_field(profile_name.clone())); + dependencies.push(d); + } + } + } + } + } + } + } +} + +// ── RebarLockParser ── + +impl PackageParser for RebarLockParser { + const PACKAGE_TYPE: PackageType = PackageType::Hex; + + fn is_match(path: &Path) -> bool { + path.file_name().and_then(|n| n.to_str()) == Some("rebar.lock") + } + + fn extract_packages(path: &Path) -> Vec { + let content = match read_file_to_string(path, None) { + Ok(c) => c, + Err(e) => { + warn!("Failed to read {:?}: {}", path, e); + return vec![default_rebar_lock_package()]; + } + }; + + match parse_rebar_lock(&content) { + Ok(pkg) => vec![pkg], + Err(e) => { + warn!("Failed to parse {:?}: {}", path, e); + vec![default_rebar_lock_package()] + } + } + } +} + +fn default_rebar_lock_package() -> PackageData { + PackageData { + package_type: Some(PackageType::Hex), + primary_language: Some("Erlang".to_string()), + datasource_id: Some(DatasourceId::RebarLock), + ..Default::default() + } +} + +fn parse_rebar_lock(content: &str) -> Result { + let terms = parse_dotted_terms(content)?; + + // rebar.lock format: first term is either: + // - {Version, [deps]} (v2 format, e.g. {"1.2.0", [...]}) + // - [deps] (v1 format, flat list) + // Second term (if present): [{pkg_hash, [...]}, {pkg_hash_ext, [...]}] + + let (dep_list, hash_map) = match terms.as_slice() { + // v2 format: {"1.2.0", [deps]} + [ErlTerm::Tuple(fields), rest @ ..] if fields.len() == 2 => { + let deps = match &fields[1] { + ErlTerm::List(items) => items.clone(), + _ => return Err("Expected dependency list in lock tuple".to_string()), + }; + let hashes = rest.first().map(extract_pkg_hashes).unwrap_or_default(); + (deps, hashes) + } + // v1 format: [deps] + [ErlTerm::List(items), rest @ ..] => { + let hashes = rest.first().map(extract_pkg_hashes).unwrap_or_default(); + (items.clone(), hashes) + } + _ => return Err("Unrecognized rebar.lock format".to_string()), + }; + + let mut package = default_rebar_lock_package(); + + for dep_term in dep_list.iter().take(MAX_ITERATION_COUNT) { + if let Some(dep) = parse_lock_dep(dep_term, &hash_map) { + package.dependencies.push(dep); + } + } + + Ok(package) +} + +fn parse_lock_dep(term: &ErlTerm, hashes: &HashMap) -> Option { + let fields = match term { + ErlTerm::Tuple(fields) if fields.len() >= 3 => fields, + _ => return None, + }; + + let name = term_to_str(&fields[0])?; + // fields[2] is the level (integer) + + let (version, vcs_url) = match &fields[1] { + // {pkg, <<"name">>, <<"version">>} + ErlTerm::Tuple(pkg_fields) + if pkg_fields.len() >= 3 && term_to_str(&pkg_fields[0]).as_deref() == Some("pkg") => + { + let ver = term_to_str(&pkg_fields[2]); + (ver, None) + } + // {git, "url", {ref, "hash"}} + ErlTerm::Tuple(git_fields) + if git_fields.len() >= 2 && term_to_str(&git_fields[0]).as_deref() == Some("git") => + { + let url = term_to_str(&git_fields[1]); + let ver = if git_fields.len() >= 3 { + extract_git_version_from_lock_ref(&git_fields[2]) + } else { + None + }; + (ver, url) + } + _ => (None, None), + }; + + let sha256 = hashes + .get(&name) + .and_then(|h| Sha256Digest::from_hex(h).ok()); + + let resolved_package = ResolvedPackage { + primary_language: Some("Erlang".to_string()), + sha256, + is_virtual: true, + datasource_id: Some(DatasourceId::RebarLock), + purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + repository_homepage_url: Some(truncate_field(format!("https://hex.pm/packages/{}", name))), + api_data_url: Some(truncate_field(format!( + "https://hex.pm/api/packages/{}", + name + ))), + ..ResolvedPackage::new( + PackageType::Hex, + String::new(), + name.clone(), + version.clone().unwrap_or_default(), + ) + }; + + let mut extra_data = HashMap::new(); + if let Some(url) = vcs_url { + extra_data.insert( + "vcs_url".to_string(), + JsonValue::String(truncate_field(url)), + ); + } + + Some(Dependency { + purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + extracted_requirement: version.map(truncate_field), + scope: Some("dependencies".to_string()), + is_runtime: None, + is_optional: None, + is_pinned: Some(true), + is_direct: None, + resolved_package: Some(Box::new(resolved_package)), + extra_data: if extra_data.is_empty() { + None + } else { + Some(extra_data) + }, + }) +} + +fn extract_git_version_from_lock_ref(term: &ErlTerm) -> Option { + if let ErlTerm::Tuple(fields) = term + && fields.len() == 2 + && term_to_str(&fields[0]).as_deref() == Some("ref") + { + term_to_str(&fields[1]) + } else { + None + } +} + +fn extract_pkg_hashes(term: &ErlTerm) -> HashMap { + let items = match term { + ErlTerm::List(items) => items, + _ => return HashMap::new(), + }; + + let mut hashes = HashMap::new(); + for item in items { + if let ErlTerm::Tuple(fields) = item + && fields.len() == 2 + && term_to_str(&fields[0]).as_deref() == Some("pkg_hash") + && let ErlTerm::List(hash_list) = &fields[1] + { + for entry in hash_list.iter().take(MAX_ITERATION_COUNT) { + if let ErlTerm::Tuple(pair) = entry + && pair.len() == 2 + && let (Some(name), Some(hash)) = (term_to_str(&pair[0]), term_to_str(&pair[1])) + { + hashes.insert(name, hash); + } + } + } + } + hashes +} + +// ── Parser metadata registration ── + +crate::register_parser!( + "Erlang OTP application resource file", + &["**/*.app.src"], + "hex", + "Erlang", + Some("https://www.erlang.org/doc/apps/kernel/application"), +); + +crate::register_parser!( + "Rebar3 configuration", + &["**/rebar.config"], + "hex", + "Erlang", + Some("https://rebar3.org/docs/configuration/configuration/"), +); + +crate::register_parser!( + "Rebar3 lockfile", + &["**/rebar.lock"], + "hex", + "Erlang", + Some("https://rebar3.org/docs/configuration/configuration/"), +); diff --git a/src/parsers/erlang_otp_golden_test.rs b/src/parsers/erlang_otp_golden_test.rs new file mode 100644 index 000000000..a7a8a42e0 --- /dev/null +++ b/src/parsers/erlang_otp_golden_test.rs @@ -0,0 +1,63 @@ +// SPDX-FileCopyrightText: Provenant contributors +// SPDX-License-Identifier: Apache-2.0 + +#[cfg(all(test, feature = "golden-tests"))] +mod golden_tests { + use crate::parsers::PackageParser; + use crate::parsers::erlang_otp::{ErlangAppSrcParser, RebarConfigParser, RebarLockParser}; + use crate::parsers::golden_test_utils::compare_package_data_parser_only; + use std::path::Path; + use std::path::PathBuf; + + fn assert_fixture_exists(path: &Path) { + assert!(path.exists(), "missing fixture: {}", path.display()); + } + + #[test] + fn test_golden_app_src() { + let test_file = PathBuf::from("testdata/erlang-otp-golden/lager.app.src"); + let expected_file = PathBuf::from("testdata/erlang-otp-golden/lager.app.src.expected"); + + assert_fixture_exists(&test_file); + assert_fixture_exists(&expected_file); + + let package_data = ErlangAppSrcParser::extract_first_package(&test_file); + + match compare_package_data_parser_only(&package_data, &expected_file) { + Ok(_) => (), + Err(e) => panic!("Golden test failed for app.src: {}", e), + } + } + + #[test] + fn test_golden_rebar_config() { + let test_file = PathBuf::from("testdata/erlang-otp-golden/rebar.config"); + let expected_file = PathBuf::from("testdata/erlang-otp-golden/rebar.config.expected"); + + assert_fixture_exists(&test_file); + assert_fixture_exists(&expected_file); + + let package_data = RebarConfigParser::extract_first_package(&test_file); + + match compare_package_data_parser_only(&package_data, &expected_file) { + Ok(_) => (), + Err(e) => panic!("Golden test failed for rebar.config: {}", e), + } + } + + #[test] + fn test_golden_rebar_lock() { + let test_file = PathBuf::from("testdata/erlang-otp-golden/rebar.lock"); + let expected_file = PathBuf::from("testdata/erlang-otp-golden/rebar.lock.expected"); + + assert_fixture_exists(&test_file); + assert_fixture_exists(&expected_file); + + let package_data = RebarLockParser::extract_first_package(&test_file); + + match compare_package_data_parser_only(&package_data, &expected_file) { + Ok(_) => (), + Err(e) => panic!("Golden test failed for rebar.lock: {}", e), + } + } +} diff --git a/src/parsers/erlang_otp_test.rs b/src/parsers/erlang_otp_test.rs new file mode 100644 index 000000000..a26cedf69 --- /dev/null +++ b/src/parsers/erlang_otp_test.rs @@ -0,0 +1,366 @@ +// SPDX-FileCopyrightText: Provenant contributors +// SPDX-License-Identifier: Apache-2.0 + +#[cfg(test)] +mod tests { + use std::fs; + use std::path::PathBuf; + + use tempfile::TempDir; + + use super::super::PackageParser; + use super::super::erlang_otp::{ErlangAppSrcParser, RebarConfigParser, RebarLockParser}; + use super::super::try_parse_file; + use crate::models::{DatasourceId, PackageType}; + + // ── is_match ── + + #[test] + fn test_app_src_is_match() { + assert!(ErlangAppSrcParser::is_match(&PathBuf::from( + "src/myapp.app.src" + ))); + assert!(ErlangAppSrcParser::is_match(&PathBuf::from( + "apps/web/src/web.app.src" + ))); + assert!(!ErlangAppSrcParser::is_match(&PathBuf::from( + "src/myapp.erl" + ))); + assert!(!ErlangAppSrcParser::is_match(&PathBuf::from( + "src/myapp.app" + ))); + } + + #[test] + fn test_rebar_config_is_match() { + assert!(RebarConfigParser::is_match(&PathBuf::from("rebar.config"))); + assert!(RebarConfigParser::is_match(&PathBuf::from( + "apps/web/rebar.config" + ))); + assert!(!RebarConfigParser::is_match(&PathBuf::from( + "rebar.config.script" + ))); + } + + #[test] + fn test_rebar_lock_is_match() { + assert!(RebarLockParser::is_match(&PathBuf::from("rebar.lock"))); + assert!(!RebarLockParser::is_match(&PathBuf::from("rebar.config"))); + } + + // ── app.src parsing ── + + #[test] + fn test_parse_app_src_fixture() { + let package = ErlangAppSrcParser::extract_first_package(&PathBuf::from( + "testdata/erlang-otp/app-src/lager.app.src", + )); + + assert_eq!(package.package_type, Some(PackageType::Hex)); + assert_eq!(package.datasource_id, Some(DatasourceId::ErlangOtpAppSrc)); + assert_eq!(package.name.as_deref(), Some("lager")); + assert_eq!(package.version.as_deref(), Some("3.9.2")); + assert_eq!( + package.description.as_deref(), + Some("Erlang logging framework") + ); + assert_eq!( + package.extracted_license_statement.as_deref(), + Some("Apache 2") + ); + assert_eq!( + package.vcs_url.as_deref(), + Some("https://github.com/erlang-lager/lager") + ); + + // goldrush should be a dependency, kernel/stdlib should be excluded + assert_eq!(package.dependencies.len(), 1); + assert!( + package.dependencies[0] + .purl + .as_deref() + .unwrap() + .contains("goldrush") + ); + } + + #[test] + fn test_parse_app_src_with_multiple_deps() { + let package = ErlangAppSrcParser::extract_first_package(&PathBuf::from( + "testdata/erlang-otp/app-src/fast_xml.app.src", + )); + + assert_eq!(package.name.as_deref(), Some("fast_xml")); + assert_eq!(package.version.as_deref(), Some("1.1.60")); + assert_eq!( + package.description.as_deref(), + Some("Fast Expat-based Erlang / Elixir XML parsing library") + ); + assert_eq!( + package.extracted_license_statement.as_deref(), + Some("Apache 2.0") + ); + + // p1_utils should be a dependency, kernel/stdlib should be excluded + assert_eq!(package.dependencies.len(), 1); + assert!( + package.dependencies[0] + .purl + .as_deref() + .unwrap() + .contains("p1_utils") + ); + } + + #[test] + fn test_parse_app_src_template_version_skipped() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("myapp.app.src"); + fs::write( + &path, + r#"{application, myapp, [{vsn, "%VSN%"}, {description, "test"}]}."#, + ) + .expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.name.as_deref(), Some("myapp")); + assert!(package.version.is_none()); + } + + #[test] + fn test_parse_app_src_runtime_dependencies() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("stdlib.app.src"); + fs::write( + &path, + r#"{application, stdlib, [ + {vsn, "5.0"}, + {runtime_dependencies, ["sasl-3.0","kernel-9.0","crypto-4.5"]} + ]}."#, + ) + .expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.name.as_deref(), Some("stdlib")); + assert_eq!(package.version.as_deref(), Some("5.0")); + // sasl, kernel, crypto are all OTP stdlib — should be filtered + assert!(package.dependencies.is_empty()); + } + + #[test] + fn test_parse_app_src_with_non_stdlib_runtime_deps() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("myapp.app.src"); + fs::write( + &path, + r#"{application, myapp, [ + {vsn, "1.0.0"}, + {runtime_dependencies, ["cowboy-2.10.0","ranch-2.1.0"]} + ]}."#, + ) + .expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.dependencies.len(), 2); + assert_eq!( + package.dependencies[0].extracted_requirement.as_deref(), + Some("2.10.0") + ); + assert!( + package.dependencies[0] + .purl + .as_deref() + .unwrap() + .contains("cowboy") + ); + } + + #[test] + fn test_parse_app_src_malformed_returns_fallback() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("bad.app.src"); + fs::write(&path, "not valid erlang at all!!!").expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.package_type, Some(PackageType::Hex)); + assert_eq!(package.datasource_id, Some(DatasourceId::ErlangOtpAppSrc)); + assert!(package.name.is_none()); + } + + // ── rebar.config parsing ── + + #[test] + fn test_parse_rebar_config_fixture() { + let package = RebarConfigParser::extract_first_package(&PathBuf::from( + "testdata/erlang-otp/rebar-config/rebar.config", + )); + + assert_eq!(package.package_type, Some(PackageType::Hex)); + assert_eq!(package.datasource_id, Some(DatasourceId::RebarConfig)); + + // 3 main deps + 1 test profile dep + assert_eq!(package.dependencies.len(), 4); + + let cowboy = &package.dependencies[0]; + assert!(cowboy.purl.as_deref().unwrap().contains("cowboy")); + assert_eq!(cowboy.extracted_requirement.as_deref(), Some("2.10.0")); + assert_eq!(cowboy.scope.as_deref(), Some("dependencies")); + + let jiffy = &package.dependencies[1]; + assert!(jiffy.purl.as_deref().unwrap().contains("jiffy")); + assert_eq!(jiffy.extracted_requirement.as_deref(), Some("1.1.1")); + assert!( + jiffy + .extra_data + .as_ref() + .unwrap() + .get("vcs_url") + .unwrap() + .as_str() + .unwrap() + .contains("jiffy") + ); + + let proper = &package.dependencies[3]; + assert!(proper.purl.as_deref().unwrap().contains("proper")); + assert_eq!(proper.scope.as_deref(), Some("test")); + } + + #[test] + fn test_parse_rebar_config_git_only_dep() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.config"); + fs::write( + &path, + r#"{deps, [{lager, {git, "https://github.com/erlang-lager/lager.git", {branch, "master"}}}]}."#, + ) + .expect("write"); + + let package = RebarConfigParser::extract_first_package(&path); + assert_eq!(package.dependencies.len(), 1); + let dep = &package.dependencies[0]; + assert!(dep.purl.as_deref().unwrap().contains("lager")); + // branch deps don't get a version + assert!(dep.extracted_requirement.is_none()); + } + + #[test] + fn test_parse_rebar_config_empty_deps() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.config"); + fs::write(&path, "{deps, []}.\n{erl_opts, [debug_info]}.\n").expect("write"); + + let package = RebarConfigParser::extract_first_package(&path); + assert_eq!(package.datasource_id, Some(DatasourceId::RebarConfig)); + assert!(package.dependencies.is_empty()); + } + + #[test] + fn test_parse_rebar_config_malformed_returns_fallback() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.config"); + fs::write(&path, "}}}}garbage").expect("write"); + + let package = RebarConfigParser::extract_first_package(&path); + assert_eq!(package.datasource_id, Some(DatasourceId::RebarConfig)); + } + + // ── rebar.lock parsing ── + + #[test] + fn test_parse_rebar_lock_fixture() { + let package = RebarLockParser::extract_first_package(&PathBuf::from( + "testdata/erlang-otp/rebar-lock/rebar.lock", + )); + + assert_eq!(package.package_type, Some(PackageType::Hex)); + assert_eq!(package.datasource_id, Some(DatasourceId::RebarLock)); + + // 4 dependencies: cowboy, cowlib, ranch (pkg), jiffy (git) + assert_eq!(package.dependencies.len(), 4); + + let cowboy = &package.dependencies[0]; + assert!(cowboy.purl.as_deref().unwrap().contains("cowboy")); + assert_eq!(cowboy.extracted_requirement.as_deref(), Some("2.10.0")); + assert_eq!(cowboy.is_pinned, Some(true)); + assert!(cowboy.resolved_package.is_some()); + + let jiffy = &package.dependencies[3]; + assert!(jiffy.purl.as_deref().unwrap().contains("jiffy")); + // git ref dep gets the ref as version + assert_eq!(jiffy.extracted_requirement.as_deref(), Some("abc123def456")); + assert!( + jiffy + .extra_data + .as_ref() + .unwrap() + .get("vcs_url") + .unwrap() + .as_str() + .unwrap() + .contains("jiffy") + ); + } + + #[test] + fn test_parse_rebar_lock_with_hashes() { + let package = RebarLockParser::extract_first_package(&PathBuf::from( + "testdata/erlang-otp/rebar-lock/rebar.lock", + )); + + // cowboy has a pkg_hash entry + let cowboy = &package.dependencies[0]; + let resolved = cowboy.resolved_package.as_ref().unwrap(); + assert!(resolved.sha256.is_some()); + } + + #[test] + fn test_parse_rebar_lock_malformed_returns_fallback() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.lock"); + fs::write(&path, "not valid erlang lock").expect("write"); + + let package = RebarLockParser::extract_first_package(&path); + assert_eq!(package.datasource_id, Some(DatasourceId::RebarLock)); + } + + // ── Scanner dispatch ── + + #[test] + fn test_dispatch_app_src() { + let result = try_parse_file(&PathBuf::from("testdata/erlang-otp/app-src/lager.app.src")) + .expect("should be claimed by parser dispatch"); + assert!(result.scan_errors.is_empty()); + assert_eq!(result.packages.len(), 1); + assert_eq!( + result.packages[0].datasource_id, + Some(DatasourceId::ErlangOtpAppSrc) + ); + } + + #[test] + fn test_dispatch_rebar_config() { + let result = try_parse_file(&PathBuf::from( + "testdata/erlang-otp/rebar-config/rebar.config", + )) + .expect("should be claimed by parser dispatch"); + assert!(result.scan_errors.is_empty()); + assert_eq!(result.packages.len(), 1); + assert_eq!( + result.packages[0].datasource_id, + Some(DatasourceId::RebarConfig) + ); + } + + #[test] + fn test_dispatch_rebar_lock() { + let result = try_parse_file(&PathBuf::from("testdata/erlang-otp/rebar-lock/rebar.lock")) + .expect("should be claimed by parser dispatch"); + assert!(result.scan_errors.is_empty()); + assert_eq!(result.packages.len(), 1); + assert_eq!( + result.packages[0].datasource_id, + Some(DatasourceId::RebarLock) + ); + } +} diff --git a/src/parsers/golden_test.rs b/src/parsers/golden_test.rs index 2726109fb..c4fec4b21 100644 --- a/src/parsers/golden_test.rs +++ b/src/parsers/golden_test.rs @@ -57,6 +57,8 @@ mod debian_golden_test; mod deno_golden_test; #[path = "docker_golden_test.rs"] mod docker_golden_test; +#[path = "erlang_otp_golden_test.rs"] +mod erlang_otp_golden_test; #[path = "freebsd_golden_test.rs"] mod freebsd_golden_test; #[path = "gitmodules_golden_test.rs"] diff --git a/src/parsers/mod.rs b/src/parsers/mod.rs index 74be19aae..e73246ad2 100644 --- a/src/parsers/mod.rs +++ b/src/parsers/mod.rs @@ -127,6 +127,9 @@ mod docker; mod docker_scan_test; #[cfg(test)] mod docker_test; +mod erlang_otp; +#[cfg(test)] +mod erlang_otp_test; mod freebsd; #[cfg(test)] mod freebsd_scan_test; @@ -585,6 +588,7 @@ pub use self::debian::{ pub use self::deno::DenoParser; pub use self::deno_lock::DenoLockParser; pub use self::docker::DockerfileParser; +pub use self::erlang_otp::{ErlangAppSrcParser, RebarConfigParser, RebarLockParser}; pub use self::freebsd::FreebsdCompactManifestParser; pub use self::gitmodules::GitmodulesParser; pub use self::go::{GoModParser, GoSumParser, GoWorkParser, GodepsParser}; @@ -859,6 +863,9 @@ register_package_handlers! { DenoParser, DenoLockParser, DockerfileParser, + ErlangAppSrcParser, + RebarConfigParser, + RebarLockParser, FreebsdCompactManifestParser, GemArchiveParser, GemfileLockParser, diff --git a/testdata/erlang-otp-golden/lager.app.src b/testdata/erlang-otp-golden/lager.app.src new file mode 100644 index 000000000..2a37df883 --- /dev/null +++ b/testdata/erlang-otp-golden/lager.app.src @@ -0,0 +1,9 @@ +{application, lager, + [ + {description, "Erlang logging framework"}, + {vsn, "3.9.2"}, + {modules, []}, + {applications, [kernel, stdlib, goldrush]}, + {licenses, ["Apache 2"]}, + {links, [{"Github", "https://github.com/erlang-lager/lager"}]} + ]}. diff --git a/testdata/erlang-otp-golden/lager.app.src.expected b/testdata/erlang-otp-golden/lager.app.src.expected new file mode 100644 index 000000000..cd65c33ce --- /dev/null +++ b/testdata/erlang-otp-golden/lager.app.src.expected @@ -0,0 +1,58 @@ +[ + { + "type": "hex", + "namespace": null, + "name": "lager", + "version": "3.9.2", + "qualifiers": null, + "subpath": null, + "primary_language": "Erlang", + "description": "Erlang logging framework", + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": "https://github.com/erlang-lager/lager", + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": "https://github.com/erlang-lager/lager", + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": "Apache 2", + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": false, + "extra_data": null, + "dependencies": [ + { + "purl": "pkg:hex/goldrush", + "extracted_requirement": null, + "scope": "dependencies", + "is_runtime": true, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": null + } + ], + "repository_homepage_url": "https://hex.pm/packages/lager", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/lager", + "datasource_id": "erlang_otp_app_src", + "purl": "pkg:hex/lager@3.9.2" + } +] diff --git a/testdata/erlang-otp-golden/rebar.config b/testdata/erlang-otp-golden/rebar.config new file mode 100644 index 000000000..07880540d --- /dev/null +++ b/testdata/erlang-otp-golden/rebar.config @@ -0,0 +1,5 @@ +{deps, [ + {cowboy, "2.10.0"}, + {jiffy, {git, "https://github.com/davisp/jiffy.git", {tag, "1.1.1"}}}, + {goldrush, "0.1.9"} +]}. diff --git a/testdata/erlang-otp-golden/rebar.config.expected b/testdata/erlang-otp-golden/rebar.config.expected new file mode 100644 index 000000000..172dd272e --- /dev/null +++ b/testdata/erlang-otp-golden/rebar.config.expected @@ -0,0 +1,82 @@ +[ + { + "type": "hex", + "namespace": null, + "name": null, + "version": null, + "qualifiers": null, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": false, + "extra_data": null, + "dependencies": [ + { + "purl": "pkg:hex/cowboy@2.10.0", + "extracted_requirement": "2.10.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": null + }, + { + "purl": "pkg:hex/jiffy@1.1.1", + "extracted_requirement": "1.1.1", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": { + "vcs_url": "https://github.com/davisp/jiffy.git" + } + }, + { + "purl": "pkg:hex/goldrush@0.1.9", + "extracted_requirement": "0.1.9", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": null + } + ], + "repository_homepage_url": null, + "repository_download_url": null, + "api_data_url": null, + "datasource_id": "rebar_config", + "purl": null + } +] diff --git a/testdata/erlang-otp-golden/rebar.lock b/testdata/erlang-otp-golden/rebar.lock new file mode 100644 index 000000000..98bafd778 --- /dev/null +++ b/testdata/erlang-otp-golden/rebar.lock @@ -0,0 +1,8 @@ +{"1.2.0", +[{<<"cowboy">>,{pkg,<<"cowboy">>,<<"2.10.0">>},0}, + {<<"ranch">>,{pkg,<<"ranch">>,<<"2.1.0">>},1}]}. +[ +{pkg_hash,[ + {<<"cowboy">>, <<"AA68E5ECABE53F3B27CEEEFFE972E5BDBA31AE59">>}, + {<<"ranch">>, <<"8306D225F3A4BE20A9F3F3E5F3C8B234A2AE7AED">>}]} +]. diff --git a/testdata/erlang-otp-golden/rebar.lock.expected b/testdata/erlang-otp-golden/rebar.lock.expected new file mode 100644 index 000000000..49a3f2e63 --- /dev/null +++ b/testdata/erlang-otp-golden/rebar.lock.expected @@ -0,0 +1,155 @@ +[ + { + "type": "hex", + "namespace": null, + "name": null, + "version": null, + "qualifiers": null, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": false, + "extra_data": null, + "dependencies": [ + { + "purl": "pkg:hex/cowboy@2.10.0", + "extracted_requirement": "2.10.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "cowboy", + "version": "2.10.0", + "qualifiers": null, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": null, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/cowboy", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/cowboy", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/cowboy@2.10.0" + }, + "extra_data": null + }, + { + "purl": "pkg:hex/ranch@2.1.0", + "extracted_requirement": "2.1.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "ranch", + "version": "2.1.0", + "qualifiers": null, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": null, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/ranch", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/ranch", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/ranch@2.1.0" + }, + "extra_data": null + } + ], + "repository_homepage_url": null, + "repository_download_url": null, + "api_data_url": null, + "datasource_id": "rebar_lock", + "purl": null + } +] diff --git a/testdata/erlang-otp/app-src/fast_xml.app.src b/testdata/erlang-otp/app-src/fast_xml.app.src new file mode 100644 index 000000000..9671ef5e1 --- /dev/null +++ b/testdata/erlang-otp/app-src/fast_xml.app.src @@ -0,0 +1,10 @@ +{application, fast_xml, + [{description, "Fast Expat-based Erlang / Elixir XML parsing library"}, + {vsn, "1.1.60"}, + {modules, []}, + {registered, []}, + {applications, [kernel, stdlib, p1_utils]}, + {mod, {fast_xml,[]}}, + {files, ["include/", "lib/", "src/", "c_src/fxml.c", "rebar.config"]}, + {licenses, ["Apache 2.0"]}, + {links, [{"Github", "https://github.com/processone/fast_xml"}]}]}. diff --git a/testdata/erlang-otp/app-src/lager.app.src b/testdata/erlang-otp/app-src/lager.app.src new file mode 100644 index 000000000..7d0acc72c --- /dev/null +++ b/testdata/erlang-otp/app-src/lager.app.src @@ -0,0 +1,18 @@ +%% -*- tab-width: 4;erlang-indent-level: 4;indent-tabs-mode: nil -*- +%% ex: ts=4 sw=4 et +{application, lager, + [ + {description, "Erlang logging framework"}, + {vsn, "3.9.2"}, + {modules, []}, + {applications, [ + kernel, + stdlib, + goldrush + ]}, + {registered, [lager_sup, lager_event, lager_crash_log, lager_handler_watcher_sup]}, + {mod, {lager_app, []}}, + {env, []}, + {licenses, ["Apache 2"]}, + {links, [{"Github", "https://github.com/erlang-lager/lager"}]} + ]}. diff --git a/testdata/erlang-otp/rebar-config/rebar.config b/testdata/erlang-otp/rebar-config/rebar.config new file mode 100644 index 000000000..404160e40 --- /dev/null +++ b/testdata/erlang-otp/rebar-config/rebar.config @@ -0,0 +1,16 @@ +%% Example rebar.config +{erl_opts, [debug_info, warn_export_vars]}. + +{deps, [ + {cowboy, "2.10.0"}, + {jiffy, {git, "https://github.com/davisp/jiffy.git", {tag, "1.1.1"}}}, + {goldrush, "0.1.9"} +]}. + +{profiles, [ + {test, [ + {deps, [ + {proper, "1.4.0"} + ]} + ]} +]}. diff --git a/testdata/erlang-otp/rebar-lock/rebar.lock b/testdata/erlang-otp/rebar-lock/rebar.lock new file mode 100644 index 000000000..3108632df --- /dev/null +++ b/testdata/erlang-otp/rebar-lock/rebar.lock @@ -0,0 +1,18 @@ +{"1.2.0", +[{<<"cowboy">>,{pkg,<<"cowboy">>,<<"2.10.0">>},0}, + {<<"cowlib">>,{pkg,<<"cowlib">>,<<"2.12.1">>},1}, + {<<"ranch">>,{pkg,<<"ranch">>,<<"2.1.0">>},1}, + {<<"jiffy">>, + {git,"https://github.com/davisp/jiffy.git", + {ref,"abc123def456"}}, + 0}]}. +[ +{pkg_hash,[ + {<<"cowboy">>, <<"AA68E5ECABE53F3B27CEEEFFE972E5BDBA31AE5900000000000000000000ABCD">>}, + {<<"cowlib">>, <<"2B3E9DA0B21C4565751A6D4901C20D1B4CC25CBB00000000000000000000ABCD">>}, + {<<"ranch">>, <<"8306D225F3A4BE20A9F3F3E5F3C8B234A2AE7AED00000000000000000000ABCD">>}]}, +{pkg_hash_ext,[ + {<<"cowboy">>, <<"E64658D0B465CA0C576E1AEF51A2BCE78C8B60A200000000000000000000ABCD">>}, + {<<"cowlib">>, <<"DB768DB88BEE444B62DA81B2160D9F3CD1AADC7000000000000000000000ABCD">>}, + {<<"ranch">>, <<"244EE3FA2A6175270D8E1E30E41B3313C56A8D8700000000000000000000ABCD">>}]} +]. From 83550a532ab1d870c164e2f40bfc3df0f7f5680b Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 18:49:39 +0200 Subject: [PATCH 2/6] fix(parser): handle Erlang OTP map and alias metadata Keep map-bearing OTP metadata from falling back and preserve real Hex package identity for aliased rebar dependencies. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus Signed-off-by: Maxim Stykow --- src/parsers/erlang_otp.rs | 231 ++++++++++++++++++++++++--------- src/parsers/erlang_otp_test.rs | 109 ++++++++++++++++ 2 files changed, 282 insertions(+), 58 deletions(-) diff --git a/src/parsers/erlang_otp.rs b/src/parsers/erlang_otp.rs index 60277c453..2043bab70 100644 --- a/src/parsers/erlang_otp.rs +++ b/src/parsers/erlang_otp.rs @@ -34,6 +34,7 @@ enum ErlTerm { Float(f64), Tuple(Vec), List(Vec), + Map(Vec<(ErlTerm, ErlTerm)>), } // ── Erlang term parser ── @@ -68,6 +69,7 @@ impl ErlParser { match self.peek() { Some('{') => self.parse_tuple(), Some('[') => self.parse_list(), + Some('#') if self.peek_n(1) == Some('{') => self.parse_map(), Some('"') => self.parse_string().map(ErlTerm::String), Some('<') if self.peek_n(1) == Some('<') => self.parse_binary().map(ErlTerm::Binary), Some('\'') => self.parse_quoted_atom().map(ErlTerm::Atom), @@ -93,6 +95,65 @@ impl ErlParser { Ok(ErlTerm::List(items)) } + fn parse_map(&mut self) -> Result { + self.expect('#')?; + self.expect('{')?; + + let mut entries = Vec::new(); + let mut count = 0usize; + + loop { + self.skip_whitespace_and_comments(); + if self.peek() == Some('}') { + self.pos += 1; + break; + } + + if count >= MAX_ITERATION_COUNT { + return Err("too many map entries".to_string()); + } + + let key = self.parse_term()?; + self.skip_whitespace_and_comments(); + + match (self.peek(), self.peek_n(1)) { + (Some('='), Some('>')) | (Some(':'), Some('=')) => { + self.pos += 2; + } + _ => { + return Err(format!( + "Expected map association operator at position {}", + self.pos + )); + } + } + + let value = self.parse_term()?; + entries.push((key, value)); + count += 1; + + self.skip_whitespace_and_comments(); + match self.peek() { + Some(',') => { + self.pos += 1; + } + Some('}') => { + self.pos += 1; + break; + } + Some(c) => { + return Err(format!( + "Expected ',' or '}}' in map but found '{}' at position {}", + c, self.pos + )); + } + None => return Err("Unterminated map literal".to_string()), + } + } + + Ok(ErlTerm::Map(entries)) + } + fn parse_comma_separated(&mut self, closing: char) -> Result, String> { let mut items = Vec::new(); let mut count = 0usize; @@ -340,6 +401,18 @@ fn term_to_proplist(term: &ErlTerm) -> Option> { Some(result) } +fn term_to_key_value_pairs(term: &ErlTerm) -> Option> { + match term { + ErlTerm::Map(entries) => Some( + entries + .iter() + .filter_map(|(key, value)| term_to_str(key).map(|key| (key, value.clone()))) + .collect(), + ), + _ => term_to_proplist(term), + } +} + fn term_to_atom_list(term: &ErlTerm) -> Vec { match term { ErlTerm::List(items) => items.iter().filter_map(term_to_str).collect(), @@ -447,7 +520,7 @@ fn parse_app_src(content: &str) -> Result { } } "links" => { - if let Some(link_props) = term_to_proplist(value) { + if let Some(link_props) = term_to_key_value_pairs(value) { for (link_name, link_val) in &link_props { if let Some(url) = term_to_str(link_val) { let lower = link_name.to_lowercase(); @@ -674,10 +747,10 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { if let Some(key) = term_to_str(&fields[0]) && key.starts_with("if_") { - return fields.last().and_then(parse_rebar_dep); + return None; } - let name = term_to_str(&fields[0])?; + let app_name = term_to_str(&fields[0])?; match fields.len() { // {Name, Version} or {Name, {git, URL, Ref}} @@ -685,7 +758,7 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { if let Some(version) = term_to_str(&fields[1]) { // {Name, Version} Some(Dependency { - purl: build_hex_purl(&name, Some(&version)).map(truncate_field), + purl: build_hex_purl(&app_name, Some(&version)).map(truncate_field), extracted_requirement: Some(truncate_field(version)), scope: Some("dependencies".to_string()), is_runtime: None, @@ -696,17 +769,11 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { extra_data: None, }) } else { - // {Name, {git, URL, Ref}} + let package_name = extract_rebar_package_name(&fields[1], &app_name); let vcs_url = extract_git_url(&fields[1]); let version = extract_git_version(&fields[1]); - let git_extra = vcs_url.map(|url| { - HashMap::from([( - "vcs_url".to_string(), - JsonValue::String(truncate_field(url)), - )]) - }); Some(Dependency { - purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + purl: build_hex_purl(&package_name, version.as_deref()).map(truncate_field), extracted_requirement: version.map(truncate_field), scope: Some("dependencies".to_string()), is_runtime: None, @@ -714,22 +781,21 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { is_pinned: None, is_direct: None, resolved_package: None, - extra_data: git_extra, + extra_data: build_rebar_dependency_extra_data( + vcs_url, + app_name.as_str(), + package_name.as_str(), + ), }) } } // {Name, Version, Source} 3 => { if let Some(version) = term_to_str(&fields[1]) { - // {Name, Version, {git, URL, Ref}} - let git_extra = extract_git_url(&fields[2]).map(|vcs_url| { - HashMap::from([( - "vcs_url".to_string(), - JsonValue::String(truncate_field(vcs_url)), - )]) - }); + let package_name = extract_rebar_package_name(&fields[2], &app_name); + let vcs_url = extract_git_url(&fields[2]); Some(Dependency { - purl: build_hex_purl(&name, Some(&version)).map(truncate_field), + purl: build_hex_purl(&package_name, Some(&version)).map(truncate_field), extracted_requirement: Some(truncate_field(version)), scope: Some("dependencies".to_string()), is_runtime: None, @@ -737,20 +803,18 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { is_pinned: None, is_direct: None, resolved_package: None, - extra_data: git_extra, + extra_data: build_rebar_dependency_extra_data( + vcs_url, + app_name.as_str(), + package_name.as_str(), + ), }) } else { - // {Name, {git, URL, Ref}} + let package_name = extract_rebar_package_name(&fields[1], &app_name); let vcs_url = extract_git_url(&fields[1]); let version = extract_git_version(&fields[1]); - let git_extra = vcs_url.map(|url| { - HashMap::from([( - "vcs_url".to_string(), - JsonValue::String(truncate_field(url)), - )]) - }); Some(Dependency { - purl: build_hex_purl(&name, version.as_deref()).map(truncate_field), + purl: build_hex_purl(&package_name, version.as_deref()).map(truncate_field), extracted_requirement: version.map(truncate_field), scope: Some("dependencies".to_string()), is_runtime: None, @@ -758,7 +822,11 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { is_pinned: None, is_direct: None, resolved_package: None, - extra_data: git_extra, + extra_data: build_rebar_dependency_extra_data( + vcs_url, + app_name.as_str(), + package_name.as_str(), + ), }) } } @@ -766,10 +834,53 @@ fn parse_rebar_dep(term: &ErlTerm) -> Option { } } +fn extract_rebar_package_name(term: &ErlTerm, fallback_name: &str) -> String { + if let ErlTerm::Tuple(fields) = term + && fields.len() >= 2 + && term_to_str(&fields[0]).as_deref() == Some("pkg") + && let Some(package_name) = term_to_str(&fields[1]) + { + package_name + } else { + fallback_name.to_string() + } +} + +fn build_rebar_dependency_extra_data( + vcs_url: Option, + app_name: &str, + package_name: &str, +) -> Option> { + let mut extra_data = HashMap::new(); + + if let Some(url) = vcs_url { + extra_data.insert( + "vcs_url".to_string(), + JsonValue::String(truncate_field(url)), + ); + } + + if app_name != package_name { + extra_data.insert( + "app_name".to_string(), + JsonValue::String(truncate_field(app_name.to_string())), + ); + } + + if extra_data.is_empty() { + None + } else { + Some(extra_data) + } +} + fn extract_git_url(term: &ErlTerm) -> Option { if let ErlTerm::Tuple(fields) = term && fields.len() >= 2 - && term_to_str(&fields[0]).as_deref() == Some("git") + && matches!( + term_to_str(&fields[0]).as_deref(), + Some("git") | Some("git_subdir") + ) { term_to_str(&fields[1]) } else { @@ -780,7 +891,10 @@ fn extract_git_url(term: &ErlTerm) -> Option { fn extract_git_version(term: &ErlTerm) -> Option { if let ErlTerm::Tuple(fields) = term && fields.len() >= 3 - && term_to_str(&fields[0]).as_deref() == Some("git") + && matches!( + term_to_str(&fields[0]).as_deref(), + Some("git") | Some("git_subdir") + ) { if let ErlTerm::Tuple(ref_fields) = &fields[2] && ref_fields.len() == 2 @@ -910,20 +1024,25 @@ fn parse_lock_dep(term: &ErlTerm, hashes: &HashMap) -> Option return None, }; - let name = term_to_str(&fields[0])?; + let app_name = term_to_str(&fields[0])?; // fields[2] is the level (integer) - let (version, vcs_url) = match &fields[1] { + let (package_name, version, vcs_url) = match &fields[1] { // {pkg, <<"name">>, <<"version">>} ErlTerm::Tuple(pkg_fields) if pkg_fields.len() >= 3 && term_to_str(&pkg_fields[0]).as_deref() == Some("pkg") => { + let package_name = term_to_str(&pkg_fields[1]).unwrap_or_else(|| app_name.clone()); let ver = term_to_str(&pkg_fields[2]); - (ver, None) + (package_name, ver, None) } // {git, "url", {ref, "hash"}} ErlTerm::Tuple(git_fields) - if git_fields.len() >= 2 && term_to_str(&git_fields[0]).as_deref() == Some("git") => + if git_fields.len() >= 2 + && matches!( + term_to_str(&git_fields[0]).as_deref(), + Some("git") | Some("git_subdir") + ) => { let url = term_to_str(&git_fields[1]); let ver = if git_fields.len() >= 3 { @@ -931,13 +1050,14 @@ fn parse_lock_dep(term: &ErlTerm, hashes: &HashMap) -> Option (None, None), + _ => (app_name.clone(), None, None), }; let sha256 = hashes - .get(&name) + .get(&app_name) + .or_else(|| hashes.get(&package_name)) .and_then(|h| Sha256Digest::from_hex(h).ok()); let resolved_package = ResolvedPackage { @@ -945,30 +1065,25 @@ fn parse_lock_dep(term: &ErlTerm, hashes: &HashMap) -> Option) -> Option <<"mqtt">>, retries => 3, nested => #{}}}, + {links, #{"Docs" => "https://example.com/docs", "Github" => "https://github.com/example/myapp"}} + ]}."#, + ) + .expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.name.as_deref(), Some("myapp")); + assert_eq!(package.version.as_deref(), Some("1.2.3")); + assert_eq!(package.description.as_deref(), Some("Map-aware app")); + assert_eq!( + package.homepage_url.as_deref(), + Some("https://example.com/docs") + ); + assert_eq!( + package.vcs_url.as_deref(), + Some("https://github.com/example/myapp") + ); + } + #[test] fn test_parse_app_src_with_non_stdlib_runtime_deps() { let temp_dir = TempDir::new().expect("temp dir"); @@ -255,6 +284,54 @@ mod tests { assert!(package.dependencies.is_empty()); } + #[test] + fn test_parse_rebar_config_skips_conditional_wrappers_instead_of_guessing() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.config"); + fs::write( + &path, + r#"{deps, [{if_var_true, coverage, {proper, "1.4.0"}}, {cowboy, "2.10.0"}]}. +"#, + ) + .expect("write"); + + let package = RebarConfigParser::extract_first_package(&path); + assert_eq!(package.dependencies.len(), 1); + assert_eq!( + package.dependencies[0].purl.as_deref(), + Some("pkg:hex/cowboy@2.10.0") + ); + } + + #[test] + fn test_parse_rebar_config_pkg_alias_dep() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.config"); + fs::write( + &path, + r#"{deps, [{uuid, "1.2.0", {pkg, uuid_erl}}, {cowboy_alias, {pkg, cowboy}}]}."#, + ) + .expect("write"); + + let package = RebarConfigParser::extract_first_package(&path); + assert_eq!(package.dependencies.len(), 2); + + let uuid = &package.dependencies[0]; + assert_eq!(uuid.purl.as_deref(), Some("pkg:hex/uuid_erl@1.2.0")); + assert_eq!(uuid.extracted_requirement.as_deref(), Some("1.2.0")); + assert_eq!( + uuid.extra_data + .as_ref() + .and_then(|extra| extra.get("app_name")) + .and_then(|value| value.as_str()), + Some("uuid") + ); + + let cowboy = &package.dependencies[1]; + assert_eq!(cowboy.purl.as_deref(), Some("pkg:hex/cowboy")); + assert!(cowboy.extracted_requirement.is_none()); + } + #[test] fn test_parse_rebar_config_malformed_returns_fallback() { let temp_dir = TempDir::new().expect("temp dir"); @@ -314,6 +391,38 @@ mod tests { assert!(resolved.sha256.is_some()); } + #[test] + fn test_parse_rebar_lock_pkg_alias_dep() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("rebar.lock"); + fs::write( + &path, + concat!( + "{\"1.2.0\", [{<<\"uuid\">>, {pkg, <<\"uuid_erl\">>, <<\"1.2.0\">>}, 0}]}.\n", + "[{pkg_hash, [{<<\"uuid\">>, <<\"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\">>}]}].\n" + ), + ) + .expect("write"); + + let package = RebarLockParser::extract_first_package(&path); + assert_eq!(package.dependencies.len(), 1); + + let dep = &package.dependencies[0]; + assert_eq!(dep.purl.as_deref(), Some("pkg:hex/uuid_erl@1.2.0")); + assert_eq!(dep.extracted_requirement.as_deref(), Some("1.2.0")); + assert_eq!( + dep.extra_data + .as_ref() + .and_then(|extra| extra.get("app_name")) + .and_then(|value| value.as_str()), + Some("uuid") + ); + + let resolved = dep.resolved_package.as_ref().expect("resolved package"); + assert_eq!(resolved.name, "uuid_erl"); + assert!(resolved.sha256.is_some()); + } + #[test] fn test_parse_rebar_lock_malformed_returns_fallback() { let temp_dir = TempDir::new().expect("temp dir"); From ad9c9bc8a3d02e74438121c611cedc6fcc915f1f Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 18:51:12 +0200 Subject: [PATCH 3/6] test(parser): cover Erlang OTP rebar scan and assembly Lock in the rebar.config plus rebar.lock contract so dependency hoisting and assembly output stay stable across refactors. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus Signed-off-by: Maxim Stykow --- src/assembly/assembly_golden_test.rs | 8 + src/parsers/erlang_otp_scan_test.rs | 63 ++++ src/parsers/mod.rs | 2 + .../erlang-otp-basic/expected.json | 310 ++++++++++++++++++ .../erlang-otp-basic/rebar.config | 16 + .../erlang-otp-basic/rebar.lock | 18 + 6 files changed, 417 insertions(+) create mode 100644 src/parsers/erlang_otp_scan_test.rs create mode 100644 testdata/assembly-golden/erlang-otp-basic/expected.json create mode 100644 testdata/assembly-golden/erlang-otp-basic/rebar.config create mode 100644 testdata/assembly-golden/erlang-otp-basic/rebar.lock diff --git a/src/assembly/assembly_golden_test.rs b/src/assembly/assembly_golden_test.rs index 70af61aa6..b11f9c484 100644 --- a/src/assembly/assembly_golden_test.rs +++ b/src/assembly/assembly_golden_test.rs @@ -587,6 +587,14 @@ mod tests { } } + #[test] + fn test_assembly_erlang_otp_basic() { + match run_assembly_golden_test("erlang-otp-basic") { + Ok(_) => (), + Err(e) => panic!("Assembly golden test failed for erlang-otp-basic: {}", e), + } + } + #[test] fn test_assembly_nuget_basic() { match run_assembly_golden_test("nuget-basic") { diff --git a/src/parsers/erlang_otp_scan_test.rs b/src/parsers/erlang_otp_scan_test.rs new file mode 100644 index 000000000..e06450a7b --- /dev/null +++ b/src/parsers/erlang_otp_scan_test.rs @@ -0,0 +1,63 @@ +// SPDX-FileCopyrightText: Provenant contributors +// SPDX-License-Identifier: Apache-2.0 + +#[cfg(test)] +mod tests { + use std::path::Path; + + use super::super::scan_test_utils::{assert_dependency_present, scan_and_assemble}; + use crate::models::DatasourceId; + + #[test] + fn test_erlang_otp_scan_hoists_rebar_manifest_and_lock_dependencies() { + let (files, result) = + scan_and_assemble(Path::new("testdata/assembly-golden/erlang-otp-basic")); + + assert!(result.packages.is_empty()); + assert_eq!(result.dependencies.len(), 8); + assert!( + result + .dependencies + .iter() + .all(|dependency| dependency.for_package_uid.is_none()) + ); + + assert_dependency_present( + &result.dependencies, + "pkg:hex/cowboy@2.10.0", + "rebar.config", + ); + assert_dependency_present(&result.dependencies, "pkg:hex/jiffy@1.1.1", "rebar.config"); + assert_dependency_present(&result.dependencies, "pkg:hex/proper@1.4.0", "rebar.config"); + assert_dependency_present(&result.dependencies, "pkg:hex/cowboy@2.10.0", "rebar.lock"); + assert_dependency_present(&result.dependencies, "pkg:hex/cowlib@2.12.1", "rebar.lock"); + assert_dependency_present( + &result.dependencies, + "pkg:hex/jiffy@abc123def456", + "rebar.lock", + ); + + let rebar_config = files + .iter() + .find(|file| file.path.ends_with("/rebar.config")) + .expect("rebar.config should be scanned"); + let rebar_lock = files + .iter() + .find(|file| file.path.ends_with("/rebar.lock")) + .expect("rebar.lock should be scanned"); + + assert!(rebar_config.for_packages.is_empty()); + assert!(rebar_lock.for_packages.is_empty()); + + assert!( + rebar_config.package_data.iter().any(|package_data| { + package_data.datasource_id == Some(DatasourceId::RebarConfig) + }) + ); + assert!( + rebar_lock.package_data.iter().any(|package_data| { + package_data.datasource_id == Some(DatasourceId::RebarLock) + }) + ); + } +} diff --git a/src/parsers/mod.rs b/src/parsers/mod.rs index e73246ad2..52fa78a81 100644 --- a/src/parsers/mod.rs +++ b/src/parsers/mod.rs @@ -129,6 +129,8 @@ mod docker_scan_test; mod docker_test; mod erlang_otp; #[cfg(test)] +mod erlang_otp_scan_test; +#[cfg(test)] mod erlang_otp_test; mod freebsd; #[cfg(test)] diff --git a/testdata/assembly-golden/erlang-otp-basic/expected.json b/testdata/assembly-golden/erlang-otp-basic/expected.json new file mode 100644 index 000000000..b88d70460 --- /dev/null +++ b/testdata/assembly-golden/erlang-otp-basic/expected.json @@ -0,0 +1,310 @@ +{ + "packages": [], + "dependencies": [ + { + "purl": "pkg:hex/cowboy@2.10.0", + "extracted_requirement": "2.10.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": {}, + "dependency_uid": "pkg:hex/cowboy@2.10.0?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.config", + "datasource_id": "rebar_config", + "namespace": null + }, + { + "purl": "pkg:hex/cowboy@2.10.0", + "extracted_requirement": "2.10.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "cowboy", + "version": "2.10.0", + "qualifiers": {}, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": "aa68e5ecabe53f3b27ceeeffe972e5bdba31ae5900000000000000000000abcd", + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": {}, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/cowboy", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/cowboy", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/cowboy@2.10.0" + }, + "extra_data": {}, + "dependency_uid": "pkg:hex/cowboy@2.10.0?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.lock", + "datasource_id": "rebar_lock", + "namespace": null + }, + { + "purl": "pkg:hex/cowlib@2.12.1", + "extracted_requirement": "2.12.1", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "cowlib", + "version": "2.12.1", + "qualifiers": {}, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": "2b3e9da0b21c4565751a6d4901c20d1b4cc25cbb00000000000000000000abcd", + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": {}, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/cowlib", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/cowlib", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/cowlib@2.12.1" + }, + "extra_data": {}, + "dependency_uid": "pkg:hex/cowlib@2.12.1?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.lock", + "datasource_id": "rebar_lock", + "namespace": null + }, + { + "purl": "pkg:hex/goldrush@0.1.9", + "extracted_requirement": "0.1.9", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": {}, + "dependency_uid": "pkg:hex/goldrush@0.1.9?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.config", + "datasource_id": "rebar_config", + "namespace": null + }, + { + "purl": "pkg:hex/jiffy@1.1.1", + "extracted_requirement": "1.1.1", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": { + "vcs_url": "https://github.com/davisp/jiffy.git" + }, + "dependency_uid": "pkg:hex/jiffy@1.1.1?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.config", + "datasource_id": "rebar_config", + "namespace": null + }, + { + "purl": "pkg:hex/jiffy@abc123def456", + "extracted_requirement": "abc123def456", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "jiffy", + "version": "abc123def456", + "qualifiers": {}, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": null, + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": {}, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/jiffy", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/jiffy", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/jiffy@abc123def456" + }, + "extra_data": { + "vcs_url": "https://github.com/davisp/jiffy.git" + }, + "dependency_uid": "pkg:hex/jiffy@abc123def456?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.lock", + "datasource_id": "rebar_lock", + "namespace": null + }, + { + "purl": "pkg:hex/proper@1.4.0", + "extracted_requirement": "1.4.0", + "scope": "test", + "is_runtime": null, + "is_optional": null, + "is_pinned": null, + "is_direct": null, + "resolved_package": null, + "extra_data": {}, + "dependency_uid": "pkg:hex/proper@1.4.0?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.config", + "datasource_id": "rebar_config", + "namespace": null + }, + { + "purl": "pkg:hex/ranch@2.1.0", + "extracted_requirement": "2.1.0", + "scope": "dependencies", + "is_runtime": null, + "is_optional": null, + "is_pinned": true, + "is_direct": null, + "resolved_package": { + "type": "hex", + "namespace": "", + "name": "ranch", + "version": "2.1.0", + "qualifiers": {}, + "subpath": null, + "primary_language": "Erlang", + "description": null, + "release_date": null, + "parties": [], + "keywords": [], + "homepage_url": null, + "download_url": null, + "size": null, + "sha1": null, + "md5": null, + "sha256": "8306d225f3a4be20a9f3f3e5f3c8b234a2ae7aed00000000000000000000abcd", + "sha512": null, + "bug_tracking_url": null, + "code_view_url": null, + "vcs_url": null, + "copyright": null, + "holder": null, + "declared_license_expression": null, + "declared_license_expression_spdx": null, + "license_detections": [], + "other_license_expression": null, + "other_license_expression_spdx": null, + "other_license_detections": [], + "extracted_license_statement": null, + "notice_text": null, + "source_packages": [], + "file_references": [], + "is_private": false, + "is_virtual": true, + "extra_data": {}, + "dependencies": [], + "repository_homepage_url": "https://hex.pm/packages/ranch", + "repository_download_url": null, + "api_data_url": "https://hex.pm/api/packages/ranch", + "datasource_id": "rebar_lock", + "purl": "pkg:hex/ranch@2.1.0" + }, + "extra_data": {}, + "dependency_uid": "pkg:hex/ranch@2.1.0?uuid=fixed-uid-done-for-testing-5642512d1758", + "for_package_uid": null, + "datafile_path": "rebar.lock", + "datasource_id": "rebar_lock", + "namespace": null + } + ], + "files_with_packages": [] +} diff --git a/testdata/assembly-golden/erlang-otp-basic/rebar.config b/testdata/assembly-golden/erlang-otp-basic/rebar.config new file mode 100644 index 000000000..404160e40 --- /dev/null +++ b/testdata/assembly-golden/erlang-otp-basic/rebar.config @@ -0,0 +1,16 @@ +%% Example rebar.config +{erl_opts, [debug_info, warn_export_vars]}. + +{deps, [ + {cowboy, "2.10.0"}, + {jiffy, {git, "https://github.com/davisp/jiffy.git", {tag, "1.1.1"}}}, + {goldrush, "0.1.9"} +]}. + +{profiles, [ + {test, [ + {deps, [ + {proper, "1.4.0"} + ]} + ]} +]}. diff --git a/testdata/assembly-golden/erlang-otp-basic/rebar.lock b/testdata/assembly-golden/erlang-otp-basic/rebar.lock new file mode 100644 index 000000000..3108632df --- /dev/null +++ b/testdata/assembly-golden/erlang-otp-basic/rebar.lock @@ -0,0 +1,18 @@ +{"1.2.0", +[{<<"cowboy">>,{pkg,<<"cowboy">>,<<"2.10.0">>},0}, + {<<"cowlib">>,{pkg,<<"cowlib">>,<<"2.12.1">>},1}, + {<<"ranch">>,{pkg,<<"ranch">>,<<"2.1.0">>},1}, + {<<"jiffy">>, + {git,"https://github.com/davisp/jiffy.git", + {ref,"abc123def456"}}, + 0}]}. +[ +{pkg_hash,[ + {<<"cowboy">>, <<"AA68E5ECABE53F3B27CEEEFFE972E5BDBA31AE5900000000000000000000ABCD">>}, + {<<"cowlib">>, <<"2B3E9DA0B21C4565751A6D4901C20D1B4CC25CBB00000000000000000000ABCD">>}, + {<<"ranch">>, <<"8306D225F3A4BE20A9F3F3E5F3C8B234A2AE7AED00000000000000000000ABCD">>}]}, +{pkg_hash_ext,[ + {<<"cowboy">>, <<"E64658D0B465CA0C576E1AEF51A2BCE78C8B60A200000000000000000000ABCD">>}, + {<<"cowlib">>, <<"DB768DB88BEE444B62DA81B2160D9F3CD1AADC7000000000000000000000ABCD">>}, + {<<"ranch">>, <<"244EE3FA2A6175270D8E1E30E41B3313C56A8D8700000000000000000000ABCD">>}]} +]. From ed404c6910ed179efa3f3a055ff07c95521b8ca3 Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 18:52:10 +0200 Subject: [PATCH 4/6] docs(parser): sync Erlang OTP improvement notes Document the map, alias, and git_subdir behavior shipped with the parser fixes so the improvement notes stay aligned with the code. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus Signed-off-by: Maxim Stykow --- docs/improvements/erlang-otp-parser.md | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/docs/improvements/erlang-otp-parser.md b/docs/improvements/erlang-otp-parser.md index 4633c8996..bf9f32de6 100644 --- a/docs/improvements/erlang-otp-parser.md +++ b/docs/improvements/erlang-otp-parser.md @@ -11,8 +11,12 @@ currently provide a production Erlang/OTP parser. ### Application resource file coverage (`*.app.src`) - Rust parses OTP application resource files using a native Erlang term parser. +- The bounded Erlang term surface now accepts maps (`#{...}`) in addition to atoms, strings, + binaries, tuples, lists, integers, floats, and `%` comments, so map-bearing metadata blocks no + longer force fallback parser output. - Extracts package identity from the `{application, Name, Props}` tuple, including `vsn`, `description`, `licenses`, and `links` fields. +- Accepts both proplist-style and map-style `links` metadata when recovering homepage and VCS URLs. - Filters OTP standard library applications (`kernel`, `stdlib`, `sasl`, `crypto`, etc.) from the `applications` dependency list so only third-party dependencies appear in parser output. - Handles `runtime_dependencies` entries with embedded version requirements (e.g., `"cowboy-2.10.0"`). @@ -23,9 +27,13 @@ currently provide a production Erlang/OTP parser. - Rust parses `rebar.config` files and extracts dependencies from the `deps` field. - Supports Hex package dependencies (`{Name, Version}`), git dependencies with tag/branch/ref - references, and version-constrained git dependencies (`{Name, Version, {git, URL, Ref}}`). + references, `git_subdir` dependencies, and version-constrained git dependencies + (`{Name, Version, {git, URL, Ref}}`). - Extracts profile-scoped dependencies from the `profiles` field (e.g., test dependencies). - Preserves git source URLs in dependency `extra_data` for provenance tracking. +- Preserves `{pkg, PackageName}` alias identity by emitting package-facing purls from the real Hex + package name and storing the outer OTP application name in dependency `extra_data.app_name` when + they differ. ### Rebar3 lockfile coverage (`rebar.lock`) @@ -33,6 +41,9 @@ currently provide a production Erlang/OTP parser. - Extracts resolved package versions and git commit references as pinned dependencies. - Resolves SHA256 checksums from the `pkg_hash` section into `resolved_package` metadata. - Produces `ResolvedPackage` entries with Hex registry homepage and API URLs. +- Preserves lockfile alias identity for `{pkg, PackageName, Version}` entries, keeping package URLs + and resolved-package names aligned with the real Hex package while retaining the outer app name in + dependency `extra_data.app_name` when needed. ### Sibling assembly @@ -45,5 +56,5 @@ currently provide a production Erlang/OTP parser. - Rust does **not** evaluate Erlang expressions, resolve variables, or execute rebar3 plugins. - Conditional dependency wrappers like `{if_var_true, ...}` are skipped rather than guessed at. -- The Erlang term parser handles atoms, strings, binaries (`<<"...">>`), tuples, lists, integers, - floats, and Erlang-style `%` comments but does not attempt full Erlang syntax coverage. +- The Erlang term parser handles atoms, strings, binaries (`<<"...">>`), tuples, lists, maps, + integers, floats, and Erlang-style `%` comments but does not attempt full Erlang syntax coverage. From 6f0aac79ad1cb5e7bb12b55df85621e5886bfa6a Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 20:38:12 +0200 Subject: [PATCH 5/6] fix(parser): handle OTP app.src template placeholders Strip bounded %PLACEHOLDER% macro runs outside strings so canonical OTP app.src templates keep parsing while weak bare-word GPL hits remain clue-only evidence. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus Signed-off-by: Maxim Stykow --- src/parsers/erlang_otp.rs | 100 ++++++++++++++++++++++++++++++++- src/parsers/erlang_otp_test.rs | 61 ++++++++++++++++++++ 2 files changed, 160 insertions(+), 1 deletion(-) diff --git a/src/parsers/erlang_otp.rs b/src/parsers/erlang_otp.rs index 2043bab70..e3dfd44de 100644 --- a/src/parsers/erlang_otp.rs +++ b/src/parsers/erlang_otp.rs @@ -351,7 +351,8 @@ impl ErlParser { } fn parse_dotted_terms(content: &str) -> Result, String> { - let mut parser = ErlParser::new(content); + let normalized = strip_template_placeholders(content); + let mut parser = ErlParser::new(&normalized); let mut terms = Vec::new(); let mut count = 0usize; loop { @@ -359,6 +360,10 @@ fn parse_dotted_terms(content: &str) -> Result, String> { if parser.is_eof() { break; } + if parser.peek() == Some('.') { + parser.pos += 1; + continue; + } if count >= MAX_ITERATION_COUNT { break; } @@ -373,6 +378,99 @@ fn parse_dotted_terms(content: &str) -> Result, String> { Ok(terms) } +fn strip_template_placeholders(source: &str) -> String { + let chars: Vec = source.chars().collect(); + let mut result = String::with_capacity(source.len()); + let mut i = 0usize; + let mut in_string = false; + let mut in_quoted_atom = false; + + while i < chars.len() { + let c = chars[i]; + + if in_string { + result.push(c); + i += 1; + if c == '\\' && i < chars.len() { + result.push(chars[i]); + i += 1; + continue; + } + if c == '"' { + in_string = false; + } + continue; + } + + if in_quoted_atom { + result.push(c); + i += 1; + if c == '\\' && i < chars.len() { + result.push(chars[i]); + i += 1; + continue; + } + if c == '\'' { + in_quoted_atom = false; + } + continue; + } + + match c { + '"' => { + in_string = true; + result.push(c); + i += 1; + } + '\'' => { + in_quoted_atom = true; + result.push(c); + i += 1; + } + '%' if chars.get(i + 1) != Some(&'%') => { + let line_end = chars[i..] + .iter() + .position(|&ch| ch == '\n') + .map(|offset| i + offset) + .unwrap_or(chars.len()); + + let last_percent = chars[i + 1..line_end] + .iter() + .rposition(|&ch| ch == '%') + .map(|offset| i + 1 + offset); + + if let Some(last_percent) = last_percent { + let placeholder_body: String = chars[i + 1..last_percent].iter().collect(); + let trailing: String = chars[last_percent + 1..line_end].iter().collect(); + let looks_like_placeholder = !placeholder_body.is_empty() + && placeholder_body.chars().all(|ch| { + ch.is_ascii_uppercase() + || ch.is_ascii_digit() + || matches!(ch, '_' | ',' | '%') + }) + && trailing + .chars() + .all(|ch| ch.is_whitespace() || matches!(ch, ',' | ']' | '}' | ')')); + + if looks_like_placeholder { + i = last_percent + 1; + continue; + } + } + + result.push(c); + i += 1; + } + _ => { + result.push(c); + i += 1; + } + } + } + + result +} + // ── Helpers ── fn term_to_str(term: &ErlTerm) -> Option { diff --git a/src/parsers/erlang_otp_test.rs b/src/parsers/erlang_otp_test.rs index b43fc06c1..37db0a40c 100644 --- a/src/parsers/erlang_otp_test.rs +++ b/src/parsers/erlang_otp_test.rs @@ -176,6 +176,67 @@ mod tests { ); } + #[test] + fn test_parse_app_src_handles_commented_placeholder_blocks() { + let temp_dir = TempDir::new().expect("temp dir"); + let path = temp_dir.path().join("diameter.app.src"); + fs::write( + &path, + r#"%% +%% %CopyrightBegin% +%% +%% SPDX-License-Identifier: Apache-2.0 +%% +%% Copyright Ericsson AB 2010-2025. All Rights Reserved. +%% +%% %CopyrightEnd% +%% + +{application, diameter, + [{description, "Diameter protocol"}, + {vsn, "%VSN%"}, + {modules, [ + %MODULES% + %,%COMPILER% + %,%INFO% + ]}, + {registered, [%REGISTERED%]}, + {applications, [ + stdlib, + kernel + %, ssl + %, syntax-tools + %, runtime-tools + ]}, + {env, []}, + {mod, {diameter_app, []}}, + {runtime_dependencies, [ + "erts-10.0", + "stdlib-5.0", + "kernel-3.2", + "ssl-9.0" + %, "syntax-tools-1.6.18" + %, "runtime-tools-1.8.16" + ]} + %% + %% Note that ssl is only required if configured on TCP transports, + %% and syntax-tools and runtime-tools are only required if the + %% dictionary compiler and debug modules (respectively) are + %% needed/wanted at runtime, which they typically aren't. These + %% modules are the two commented lines in the 'modules' tuple. + %% + ]}. +"#, + ) + .expect("write"); + + let package = ErlangAppSrcParser::extract_first_package(&path); + assert_eq!(package.name.as_deref(), Some("diameter")); + assert_eq!(package.description.as_deref(), Some("Diameter protocol")); + assert!(package.version.is_none()); + assert!(package.dependencies.is_empty()); + } + #[test] fn test_parse_app_src_with_non_stdlib_runtime_deps() { let temp_dir = TempDir::new().expect("temp dir"); From da94654d3a34b53378b570984adcab8e393296c2 Mon Sep 17 00:00:00 2001 From: Maxim Stykow Date: Wed, 22 Apr 2026 20:39:25 +0200 Subject: [PATCH 6/6] docs(benchmarks): verify Erlang OTP compare targets Record the ejabberd, OTP, and VerneMQ compare runs, regenerate the benchmark chart, and mark scorecard row 44 verified after triaging the remaining deltas. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus Signed-off-by: Maxim Stykow --- docs/BENCHMARKS.md | 17 +-- docs/benchmarks/scan-duration-vs-files.svg | 18 +++ .../PARSER_VERIFICATION_SCORECARD.md | 119 +++++++++--------- 3 files changed, 88 insertions(+), 66 deletions(-) diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md index 60fc20165..eb4599d50 100644 --- a/docs/BENCHMARKS.md +++ b/docs/BENCHMARKS.md @@ -11,7 +11,7 @@ The chart below uses a log-log scatter plot: file count on the x-axis, wall-cloc ![Scan duration vs. file count for Provenant and ScanCode](benchmarks/scan-duration-vs-files.svg) -> Provenant is faster on 136 of 138 recorded runs, with a **11.6× median speedup** and **10.1× geometric-mean speedup** overall; the median gap grows from **6.4×** on sub-100-file targets to **19.7×** on 10k+ file targets. +> Provenant is faster on 139 of 141 recorded runs, with a **11.7× median speedup** and **10.2× geometric-mean speedup** overall; the median gap grows from **6.4×** on sub-100-file targets to **20.1×** on 10k+ file targets. > Generated from the benchmark timing rows in this document via `cargo run --manifest-path xtask/Cargo.toml --bin generate-benchmark-chart`. ## Current benchmark examples @@ -53,13 +53,16 @@ The tables below provide the per-target detail behind the chart. Each row is one | [r-lib/devtools @ a3447b9](https://github.com/r-lib/devtools/tree/a3447b9f3d59abb6cc8b63a54db3435819324c1e)
266 files | 2026-04-19 · devtools-24729 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 9.28s
ScanCode: 80.85s
**8.71× faster (-88.5%)** | Far broader CRAN package and dependency extraction (`14` vs `1` packages, `45` vs `1` dependencies) from the root `DESCRIPTION` plus committed test-package fixtures, with correct filtering of fake `pkg:cran/R` dependency noise and cleaner maintainer or URL normalization | | [tidyverse/ggplot2 @ 7d79c95](https://github.com/tidyverse/ggplot2/tree/7d79c956b5707cb7c762d834caf842dc6496b032)
1,154 files | 2026-04-19 · ggplot2-95481 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 14.46s
ScanCode: 178.35s
**12.33× faster (-91.9%)** | Direct CRAN package visibility on the root `DESCRIPTION` plus declared dependency extraction (`41` vs `0`) across `Imports`, `Suggests`, and `Enhances`, with correct hyphenated CRAN version constraints such as `sf (>= 0.7-3)` and cleaner Rd or roxygen URL recovery | -#### Hex / Elixir +#### Hex / Elixir / Erlang / OTP -| Target snapshot | Run context | Timing snapshot | Advantages over ScanCode | -| -------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [elixir-ecto/ecto @ 28d9282](https://github.com/elixir-ecto/ecto/tree/28d928267388018d5b0bb1f83e04368b7e8cae50)
156 files | 2026-04-22 · ecto-26520 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 14.03s
ScanCode: 135.56s
**9.66× faster (-89.7%)** | Broader Hex dependency extraction (`16` vs `0`) from the repo-root `mix.lock` plus `examples/friends/mix.lock`, with direct locked package identities for entries such as `ecto_sql`, `postgrex`, and `telemetry` that ScanCode leaves dependency-blind | -| [elixir-plug/plug @ 47649aa](https://github.com/elixir-plug/plug/tree/47649aa7bb910f481b66cc3e98c14b2c3b761c3c)
104 files | 2026-04-22 · plug-22829 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 10.77s
ScanCode: 92.08s
**8.55× faster (-88.3%)** | Direct Hex package visibility on `mix.lock` (`1` vs `0`) plus locked dependency extraction (`9` vs `0`) for `plug_crypto`, `telemetry`, `ex_doc`, and sibling Hex pins that ScanCode leaves at zero, with Unicode-preserving `Loïc Hoguin` holder normalization | -| [phoenixframework/phoenix @ e7b8081](https://github.com/phoenixframework/phoenix/tree/e7b8081792fa51c9fede6d0fb9ddb610bac3f26f)
476 files | 2026-04-22 · phoenix-13265 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 12.80s
ScanCode: 149.17s
**11.66× faster (-91.4%)** | Direct Hex package visibility on the repo-root, `installer/mix.lock`, and `integration_test/mix.lock` surfaces (`3` vs `0` file-level package records), while keeping top-level package and dependency counts aligned elsewhere and preserving structured npm party metadata | +| Target snapshot | Run context | Timing snapshot | Advantages over ScanCode | +| -------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [elixir-ecto/ecto @ 28d9282](https://github.com/elixir-ecto/ecto/tree/28d928267388018d5b0bb1f83e04368b7e8cae50)
156 files | 2026-04-22 · ecto-26520 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 14.03s
ScanCode: 135.56s
**9.66× faster (-89.7%)** | Broader Hex dependency extraction (`16` vs `0`) from the repo-root `mix.lock` plus `examples/friends/mix.lock`, with direct locked package identities for entries such as `ecto_sql`, `postgrex`, and `telemetry` that ScanCode leaves dependency-blind | +| [elixir-plug/plug @ 47649aa](https://github.com/elixir-plug/plug/tree/47649aa7bb910f481b66cc3e98c14b2c3b761c3c)
104 files | 2026-04-22 · plug-22829 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 10.77s
ScanCode: 92.08s
**8.55× faster (-88.3%)** | Direct Hex package visibility on `mix.lock` (`1` vs `0`) plus locked dependency extraction (`9` vs `0`) for `plug_crypto`, `telemetry`, `ex_doc`, and sibling Hex pins that ScanCode leaves at zero, with Unicode-preserving `Loïc Hoguin` holder normalization | +| [erlang/otp @ 264def5](https://github.com/erlang/otp/tree/264def545b8214ea7100bfede1a4629c676ff1c0)
11,749 files | 2026-04-22 · otp-15523 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 135.93s
ScanCode: 3197.26s
**23.52× faster (-95.7%)** | Direct OTP application package visibility (`11` vs `0`) across committed `lib/*/src/*.app.src` templates, with bounded `%PLACEHOLDER%` handling that keeps canonical manifests such as `diameter.app.src` scannable and preserves the same non-stdlib runtime dependency inventory ScanCode finds | +| [phoenixframework/phoenix @ e7b8081](https://github.com/phoenixframework/phoenix/tree/e7b8081792fa51c9fede6d0fb9ddb610bac3f26f)
476 files | 2026-04-22 · phoenix-13265 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 10 proc | Provenant: 12.80s
ScanCode: 149.17s
**11.66× faster (-91.4%)** | Direct Hex package visibility on the repo-root, `installer/mix.lock`, and `integration_test/mix.lock` surfaces (`3` vs `0` file-level package records), while keeping top-level package and dependency counts aligned elsewhere and preserving structured npm party metadata | +| [processone/ejabberd @ 87475d8](https://github.com/processone/ejabberd/tree/87475d813b974492f338720eab5c9c3d4646a4ce)
623 files | 2026-04-22 · ejabberd-26578 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 16.74s
ScanCode: 214.30s
**12.80× faster (-92.2%)** | Broader Erlang/Rebar package and dependency extraction (`2` vs `1` packages, `43` vs `3` dependencies) from the root `rebar.config`, `rebar.lock`, nested `_checkouts/configure_deps` manifests, and committed Dockerfiles, with the bundled `priv/mod_invites/copyright` notice kept as clue-level license evidence instead of being overstated as Debian package metadata | +| [vernemq/vernemq @ 4681e54](https://github.com/vernemq/vernemq/tree/4681e5490cc42e6cc26a504bb4b3c5413315c21f)
441 files | 2026-04-22 · vernemq-20484 · macOS 26.3.1 · Apple M1 Max · 32 GB · arm64 · 4 proc | Provenant: 13.90s
ScanCode: 149.29s
**10.74× faster (-90.7%)** | Broader Erlang/Rebar dependency extraction (`119` vs `0`) from the repo-root and per-app `rebar.config` / `.app.src` manifests, plus direct `.gitmodules` package visibility and mixed Hex or git package identity across the VerneMQ app tree where ScanCode stays manifest-blind | #### JavaScript / TypeScript / web stacks diff --git a/docs/benchmarks/scan-duration-vs-files.svg b/docs/benchmarks/scan-duration-vs-files.svg index 0f90dd9a0..dbb7caeeb 100644 --- a/docs/benchmarks/scan-duration-vs-files.svg +++ b/docs/benchmarks/scan-duration-vs-files.svg @@ -188,6 +188,9 @@ ScanCode: 186.62s PerlDancer/Dancer2 @ a1faa22 Files: 436 ScanCode: 97.37s + vernemq/vernemq @ 4681e54 +Files: 441 +ScanCode: 149.29s tidyverse/dplyr @ 2f9f49e Files: 462 ScanCode: 170.71s @@ -215,6 +218,9 @@ ScanCode: 219.94s iTowns/itowns @ 08e08f5 Files: 616 ScanCode: 170.19s + processone/ejabberd @ 87475d8 +Files: 623 +ScanCode: 214.30s rpm-software-management/dnf @ e47634f Files: 655 ScanCode: 203.47s @@ -425,6 +431,9 @@ ScanCode: 1974.56s spring-projects/spring-boot @ 53827d4 Files: 11610 ScanCode: 776.24s + erlang/otp @ 264def5 +Files: 11749 +ScanCode: 3197.26s apache/airflow @ 47ce5f3 Files: 11854 ScanCode: 936.34s @@ -604,6 +613,9 @@ Provenant: 12.89s PerlDancer/Dancer2 @ a1faa22 Files: 436 Provenant: 9.33s + vernemq/vernemq @ 4681e54 +Files: 441 +Provenant: 13.90s tidyverse/dplyr @ 2f9f49e Files: 462 Provenant: 13.86s @@ -631,6 +643,9 @@ Provenant: 14.57s iTowns/itowns @ 08e08f5 Files: 616 Provenant: 12.53s + processone/ejabberd @ 87475d8 +Files: 623 +Provenant: 16.74s rpm-software-management/dnf @ e47634f Files: 655 Provenant: 14.37s @@ -841,6 +856,9 @@ Provenant: 200.80s spring-projects/spring-boot @ 53827d4 Files: 11610 Provenant: 67.58s + erlang/otp @ 264def5 +Files: 11749 +Provenant: 135.93s apache/airflow @ 47ce5f3 Files: 11854 Provenant: 65.32s diff --git a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md index 6eace1fa6..e311cd610 100644 --- a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md +++ b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md @@ -58,65 +58,66 @@ Method rules: The ranking below is ordered by **practical verification value first**: broad ecosystem prevalence, likelihood of exposing real parser-plus-license/copyright interactions under `--profile common`, and coverage breadth within the implemented family. <<<<<<< HEAD -| Priority | Ecosystem | Status | Candidate targets | Priority and scope notes | -| -------- | ------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 0a | Cross-cutting broad `C++` repository scans (non-parser reference) | 🟢 Verified | `boostorg/boost` (236 files)
`boostorg/json` (701 files)
`mongodb/mongo` (11k files) | There is no generic `C++` parser row. These repositories are still valuable reference targets because they exercise multiple implemented `C++`-adjacent families and package-adjacent detection in realistic trees. They complement, but do not replace, family-specific verification for Autotools, Conan, vcpkg, Bazel, and Buck. | -| 0b | Cross-cutting broad polyglot / vendored monorepo scans (non-parser reference) | 🟢 Verified | `chromium/chromium` (490,886 files)
`apache/airflow` (11,854 files)
`kubernetes/kubernetes` (29,080 files) | These are good early warning targets for interaction bugs across multiple parser families, vendored third-party metadata, README/submodule handling, and common-profile license/copyright detection in very large trees. They complement, but do not replace, family-specific rows. | -| 0c | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference) | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)
Fedora base-image rootfs snapshot (1,579 files)
official Alpine minirootfs snapshot (84 files) | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. | -| 0d | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | 🟢 Verified | `torvalds/linux` (100k files)
`rust-lang/rust` (8k files) | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions. | -| 0e | Cross-cutting licensing-edge-case repository scans (non-parser reference) | 🟢 Verified | `nmap/nmap` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files)
`mongodb/mongo` (11k files) | Use this lane when the main goal is license-classification accuracy rather than parser breadth. These targets are useful when the verification focus is classification quality on real repository text, reference notices, and packaging-adjacent licensing material rather than parser coverage alone. | -| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)
`yarnpkg/berry` (500–2k files)
`vercel/next.js` (5k files)
`oven-sh/bun` (500–2k files)
`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. | -| 2 | Python / PyPI | 🟢 Verified | `pandas-dev/pandas` (1.2k files)
`scipy/scipy` (1.3k files)
`django/django` (2.5k files)
`python-poetry/poetry` (500–2k files)
`astral-sh/uv` (500–2k files) | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas. | -| 3 | Maven / Java | 🟢 Verified | `apache/maven` (500–2k files)
`apache/camel` (2k–10k files)
`spring-projects/spring-boot` (2k–10k files)
`apache/felix-dev` (2k–10k files) | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure. | -| 3a | Clojure / Leiningen | 🟢 Verified | `technomancy/leiningen` (500–2k files)
`metabase/metabase` (2k–10k files)
`renovatebot/renovate` Leiningen fixtures | Keep this row explicit instead of assuming the broader Maven or SBT rows cover it. `technomancy/leiningen` is the canonical `project.clj` reference, `metabase/metabase` gives a real-world root `deps.edn`, and `renovatebot/renovate` adds a fixture-heavy Leiningen edge-case lane. The shipped Rust surface is bounded static parsing of `deps.edn` and `project.clj`, not generic JVM build inheritance, and these manifests are intentionally treated as standalone unassembled inputs. | -| 4 | Go | 🟢 Verified | `containerd/containerd` (2k–10k files)
`go-gitea/gitea` (2k–10k files)
Go build-info sample binaries via local `--target-path` + `common-with-compiled` lane | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, `go.work` workspace roots, vendored trees, and source-versus-binary coverage gaps explicitly during compare review. | -| 5 | Cargo/Rust | 🟢 Verified | `tokio-rs/tokio` (250 files)
`rust-lang/cargo` (700 files)
cargo-auditable sample binaries via local `--target-path` + `common-with-compiled` lane | Strong workspace/member coverage plus an explicit compiled-metadata lane for the scanner-gated cargo-auditable surface. Watch workspace root/member ownership, manifest-declared file references such as `README` and license files, and compiled-versus-source coverage gaps. Keep bootstrap-scale mixed Rust/C++ trees such as `rust-lang/rust` in the dedicated filesystem-scale cross-cutting lane instead of duplicating them here. | -| 5a | Compiled artifacts (`go build info`, cargo-auditable, Windows PE `VERSIONINFO`) | 🟢 Verified | `itchyny/gojq` release binaries via local `--target-path` + `common-with-compiled` lane
`lichess-org/fishnet` release binaries via local `--target-path` + `common-with-compiled` lane
`glzr-io/glazewm` Windows release executables via local `--target-path` lane | Keep this detector-oriented row explicit so compiled-binary verification does not stay implicit inside the Go, Cargo/Rust, Windows Update, or `misc.py` rows. `itchyny/gojq` is a clean Go build-info target, `lichess-org/fishnet` is an explicit cargo-auditable release lane, and `glzr-io/glazewm` gives a focused Windows `VERSIONINFO` executable target. Prefer small release trees that include nearby README or LICENSE material when possible, so the compare still exercises common-profile interactions rather than only binary package identity. | -| 6 | NuGet | 🟢 Verified | `OrchardCMS/OrchardCore` (2k–10k files)
`AvaloniaUI/Avalonia` (2k–10k files)
`.nupkg` / `.deps.json` snapshots via local `--target-path` lane | Broad .NET lane across source and shipped artifacts. `OrchardCMS/OrchardCore` and `AvaloniaUI/Avalonia` cover large solution-style repos and central package management patterns, while the `.nupkg` / `.deps.json` lane covers runtime and package-artifact metadata that source repos may miss. Watch duplicate package signals across solution props/targets, project files, and runtime artifacts before counting them as regressions. | -| 7 | PHP / Composer | 🟢 Verified | `laravel/framework` (2k–10k files)
`composer/composer` (500–2k files)
`symfony/symfony` (2k–10k files) | Mature Composer lane. `composer/composer` is the canonical Composer reference, while `laravel/framework` and `symfony/symfony` add large real-world monorepo/library dependency graphs. Watch `composer.json` versus `composer.lock` behavior, split-package repo structure, and README/LICENSE-heavy trees that can create unrelated common-profile deltas. | -| 8 | Gradle | 🟢 Verified | `gradle/gradle` (2k–10k files)
`elastic/elasticsearch` (11k files)
`apache/kafka` (2k–10k files) | High-signal JVM build family with settings/includes and large build graphs; `elastic/elasticsearch` adds an especially large multi-project Gradle and packaging target with meaningful licensing/distribution complexity. | -| 8a | Android metadata and package artifacts | 🟢 Verified | `aosp-mirror/platform_build` (Soong `METADATA` coverage)
`aosp-mirror/platform_frameworks_base` (Android manifest surfaces)
representative local `.aab`, `.apk`, and standalone binary `AndroidManifest.xml` artifacts via `--target-path` lane | Keep this Android-specific lane explicit instead of assuming the broader Gradle row covers it. Use the repository targets for Soong `METADATA` files and committed manifest surfaces, and the local artifact lane for proto-encoded `.aab` plus binary AXML/APK manifest metadata that ordinary repository scans do not usually contain. | -| 9 | Ruby | 🟢 Verified | `rails/rails` (2k–10k files)
`rubocop/rubocop` (500–2k files)
`.gem` archive sample via local `--target-path` lane | Use this row to separate source-repo and shipped-gem behavior. `rails/rails` is the large multi-gemspec/Bundler stress case, `rubocop/rubocop` is a smaller modern Bundler contrast, and the `.gem` lane covers archive metadata. Watch Gemfile-versus-gemspec-versus-lockfile precedence and differences between source trees and packaged gem metadata. | -| 10 | Debian | 🟢 Verified | `guillemj/dpkg` (500–2k files)
`Debian/apt` (2k–10k files)
official `.deb` / dpkg status / distroless `status.d` snapshots via local `--target-path` lane | Keep source-package and installed-state coverage separate. `guillemj/dpkg` and `Debian/apt` exercise Debian source-package metadata, while the `.deb`, `dpkg status`, and distroless `status.d` lanes cover binary-package and installed-database behavior. Watch source-versus-binary package identity, multiple package stanzas, and Debian copyright/license files generating common-profile deltas that are not parser failures. | -| 11 | Docker | 🟢 Verified | `moby/moby` (2k–10k files)
`docker-library/official-images` (<500 files)
`docker-library/python` (<500 files)
`getsentry/self-hosted` (<500 files) | Docker needs both canonical and real deployment targets. `moby/moby` is the broad Dockerfile/build-context reference, `docker-library/official-images` is the source-of-truth library-definition lane, `docker-library/python` is a useful generated official-image leaf target, and `getsentry/self-hosted` adds compose-heavy multi-service realism. Watch multi-stage Dockerfiles, compose-plus-Dockerfile overlap, and template/env noise before treating extra findings as parser regressions. | -| 11a | Helm | 🟢 Verified | `baserow/baserow` (2k–10k files)
`appsmithorg/appsmith` (10k–50k files)
`DefectDojo/django-DefectDojo` (500–2k files) | Keep Helm explicit instead of relying on incidental chart files inside larger application repositories. `baserow/baserow` gives a strong `Chart.yaml` plus `Chart.lock` lane, `appsmithorg/appsmith` adds a large conventional chart deployment tree, and `DefectDojo/django-DefectDojo` is a smaller contrast target. The implemented Rust surface is static `Chart.yaml` plus `Chart.lock` parsing with sibling assembly, declared-versus-locked dependency coverage, and bounded malformed-entry tolerance; that needs at least one focused chart-first verification lane. | -| 12 | Conda | 🟢 Verified | `conda/conda` (500–2k files)
`conda/conda-build` (500–2k files)
`conda-forge/pandas-feedstock` (<500 files) | Conda needs three distinct target shapes. `conda/conda` covers user-facing environment metadata, `conda/conda-build` covers recipes and build outputs, and `conda-forge/pandas-feedstock` is the feedstock pattern Provenant must handle. Watch recipe-output duplication and generated feedstock files before overcounting package or license deltas. | -| 12a | Pixi | 🟢 Verified | `prefix-dev/pixi` (500–2k files)
`pydata/xarray` (500–2k files)
`OpenMDAO/OpenMDAO` (500–2k files) | Keep Pixi explicit even though some Python and Conda compare targets already surface `pixi.toml` and `pixi.lock`. `prefix-dev/pixi` is the canonical upstream with both `pixi.toml` and `pixi.lock`, `pydata/xarray` adds a real consumer repo, and `OpenMDAO/OpenMDAO` adds a second repo with both manifest and lockfile. This row isolates the native `pixi.toml` plus `pixi.lock` contract, mixed Conda/PyPI dependency behavior, and topology-planned root assembly instead of letting those behaviors hide inside broader Python-family compare noise. | -| 13 | Swift | 🟢 Verified | `pointfreeco/swift-composable-architecture` (500–2k files)
`SwiftFiddle/swiftfiddle-web` (<500 files)
`Package.swift.json` / `Package.resolved` snapshots via local `--target-path` lane | `pointfreeco/swift-composable-architecture` is a clean SwiftPM library reference, `SwiftFiddle/swiftfiddle-web` adds a real committed `Resources/Package.swift.json` plus `Package.resolved` target shape, and the local snapshot lane remains important for future pinned production captures that record generated SwiftPM surfaces alongside their source manifests. Watch repo-only verification gaps whenever a bug might live in `Package.swift.json` or `Package.resolved` rather than in source manifests. | -| 14 | Haskell / Hackage | 🟢 Verified | `commercialhaskell/stack` (500–2k files)
`jgm/pandoc` (500–2k files)
`yesodweb/yesod` (500–2k files) | Good mix of Cabal, Stack, and multi-package Haskell repository structure. | -| 15 | Scala / SBT | 🟢 Verified | `akka/akka` (2k–10k files)
`playframework/playframework` (2k–10k files)
`scalatest/scalatest` (500–2k files) | Valuable JVM surface, but current Rust scope is bounded static parsing rather than full evaluation semantics. | -| 16 | CocoaPods | 🟢 Verified | `AFNetworking/AFNetworking` (<500 files)
`Alamofire/Alamofire` (<500 files)
`SDWebImage/SDWebImage` (<500 files) | Strong Apple packaging coverage through widely used podspec-based libraries. | -| 16a | Carthage | 🟢 Verified | `Carthage/Carthage` (500–2k files)
`ReactiveCocoa/ReactiveCocoa` (<500 files)
`Mantle/Mantle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `Carthage/Carthage` is the canonical upstream with both `Cartfile` and `Cartfile.resolved`, while `ReactiveCocoa/ReactiveCocoa` and `Mantle/Mantle` are representative consumer libraries. Focus on correct `Cartfile` dependency extraction, `Cartfile.resolved` pinned-version coverage, and the dependency-hoisting contract for sibling manifest-plus-lockfile pairs without inventing a root Carthage package identity. | -| 16b | Yocto / BitBake | 🟢 Verified | `yoctoproject/poky` (10k–50k files)
`openembedded/meta-openembedded` (10k–50k files)
`pocketbeagle/meta-pocketbeagle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `yoctoproject/poky` is the canonical Yocto reference distribution, while `openembedded/meta-openembedded` provides a large recipe corpus across many layers. Focus on correct package identity extraction from filenames and `PN`/`PV` variables, license normalization of BitBake-specific operator syntax (`&`/`\|`), and `DEPENDS`/`RDEPENDS` dependency scoping. | -| 17 | Nix | 🟢 Verified | `NixOS/nixpkgs` (50k+ files)
`NixOS/nix` (2k–10k files)
`numtide/devshell` (<500 files) | Valuable ecosystem with explicit note that current `default.nix` support is intentionally bounded. | -| 18 | CPAN | 🟢 Verified | `Plack/Plack` (500–2k files)
`libwww-perl/libwww-perl` (500–2k files)
`PerlDancer/Dancer2` (500–2k files) | Good Perl metadata variety through `META.*`, `dist.ini`, and `Makefile.PL`. | -| 19 | CRAN / R | 🟢 Verified | `tidyverse/dplyr` (500–2k files)
`tidyverse/ggplot2` (500–2k files)
`r-lib/devtools` (500–2k files) | Strong DESCRIPTION-based metadata with realistic dependency fields. | -| 20 | Alpine | ⚪ Planned | `alpinelinux/aports`
official `.apk` sample via local `--target-path` lane
Alpine `lib/apk/db/installed` snapshot via local `--target-path` lane | Keep this family row even though Alpine rootfs targets also appear in `0c`: `0c` is the cross-cutting rootfs lane, while this row tracks Alpine-specific source, archive, and installed-DB surfaces. Do not treat rootfs-only verification as verification of the remaining `APKBUILD`, `.apk`, and standalone installed-DB surfaces listed here. | -| 21 | RPM | 🟢 Verified | `rpm-software-management/dnf` (2k–10k files)
`rpm-software-management/libdnf` (500–2k files)
official `.rpm` / RPM BDB, NDB, and SQLite DB snapshots via local `--target-path` lane | Important distro-family lane across source and installed-state metadata. `rpm-software-management/dnf` and `rpm-software-management/libdnf` cover realistic RPM-adjacent source trees, while the local `.rpm` and RPM DB lanes cover shipped package and installed-database behavior. Watch specfile subpackages, changelog/license fields, namespace-from-`os-release` behavior, and DB-versus-source differences separately during triage. | -| 22 | Arch Linux | ⚪ Planned | Arch Linux GitLab packaging repo for `pacman`
Arch Linux GitLab packaging repo for `grep`
official built package sample for `.PKGINFO` via local `--target-path` lane | Use one source-package contrast plus one built-package lane here. The Arch packaging repos cover PKGBUILD and `.SRCINFO` source metadata, while the local built-package lane covers `.PKGINFO` behavior that source repos do not contain. Keep the candidate repos concrete because the canonical Arch packaging sources live in the Arch packaging tree rather than in one obvious GitHub umbrella repository. | -| 23 | Bazel | 🟢 Verified | `tensorflow/tensorflow` (10k files)
`bazelbuild/bazel` (2k–10k files)
`protocolbuffers/protobuf` (2.5k files) | Strong Bazel lane across old and new module surfaces. `bazelbuild/bazel` is the canonical direct reference, `tensorflow/tensorflow` is the large mixed-language stress case, and `protocolbuffers/protobuf` is a smaller contrast target. Watch `WORKSPACE` versus `MODULE.bazel`, macro-heavy static-parsing limits, and giant `third_party` trees producing unrelated common-profile noise. | -| 24 | Autotools | 🟢 Verified | `curl/curl` (1k files)
`libevent/libevent` (<500 files)
`libgit2/libgit2` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files) | Mature native-build lane with several useful contrasts. `curl/curl` is the clearest autoconf-heavy reference, `libevent/libevent` is a smaller contrast, `libgit2/libgit2` adds a mixed native project shape, and `ffmpeg/ffmpeg` adds strong GPL/LGPL-conditional licensing pressure in a `configure`-driven native tree. Watch generated `configure` / `Makefile.in` noise and avoid collapsing file-level licensing differences into one top-level verdict. | -| 24a | Meson | 🟢 Verified | `qemu/qemu` (10k–50k files)
`systemd/systemd` (10k–50k files)
`LinuxCNC/linuxcnc` (2k–10k files) | Keep Meson explicit instead of assuming the Autotools or generic native-tree rows cover it. `qemu/qemu` and `systemd/systemd` are high-signal root-`meson.build` upstreams, while `LinuxCNC/linuxcnc` is a smaller contrast target. The shipped Rust surface is bounded static `meson.build` parsing for literal `project()` metadata and top-level `dependency()` calls, with explicit no-evaluation guardrails that deserve a focused verification lane. | -| 25 | Conan | 🟢 Verified | `conan-io/conan-center-index` (10k–50k files)
`catchorg/Catch2` (<500 files)
`fmtlib/fmt` (<500 files) | Conan needs both recipe-corpus and upstream-library targets. `conan-io/conan-center-index` is the authoritative recipe index, while `catchorg/Catch2` and `fmtlib/fmt` are smaller upstream consumer-library contrasts. Watch recipe-only repository structure, versioned recipe directories, and the difference between Conan recipe metadata and normal source-package behavior. | -| 26 | vcpkg | 🟢 Verified | `microsoft/vcpkg` (10k–50k files)
`microsoft/terminal` (2k–10k files)
`microsoft/onnxruntime` (10k–50k files) | Important Windows/`C++` manifest-mode lane. `microsoft/vcpkg` is the authoritative manifest and registry target, while `microsoft/terminal` and `microsoft/onnxruntime` cover large consumer repos that use `vcpkg.json` in real codebases. Watch current scope boundaries carefully: this row is about implemented manifest-mode metadata, not every vendored or toolchain surface in those trees. | -| 27 | Deno | 🟢 Verified | `denoland/fresh` (500–2k files)
`oakserver/oak` (500–2k files)
`denoland/std` (2k–10k files) | Useful modern JS/TS ecosystem with explicit config and lockfile coverage. | -| 28 | Dart / Pub | 🟢 Verified | `rrousselGit/riverpod` (500–2k files)
`firebase/flutterfire` (2k–10k files)
`flutter/packages` (2k–10k files) | Good Pub and Flutter-adjacent coverage through large multi-package repositories. | -| 29 | Git submodules | 🟢 Verified | `grpc/grpc` (10k–50k files)
`git/git` (500–2k files)
`chromium/chromium` (490,886 files) | This is a package-adjacent lane, not a parser-breadth lane. `git/git` is the clearest focused `.gitmodules` reference, `grpc/grpc` adds large real-world third-party trees, and `chromium/chromium` is the stress case. Watch absent submodule checkouts and vendored-tree context so `.gitmodules` findings stay coherent instead of being drowned by unrelated common-profile output. | -| 30 | Structured metadata (`CITATION.cff`, `publiccode.yml`) | 🟢 Verified | `astropy/astropy` (2k–10k files)
`iTowns/itowns` (500–2k files)
`univention/Nubus` (500–2k files) | Keep both structured-metadata families explicit here. `astropy/astropy` is the strongest `CITATION.cff` reference, `univention/Nubus` is the clearest `publiccode.yml` case, and `iTowns/itowns` adds mixed-project contrast. Watch that structured metadata stays visible beside richer README and package findings instead of being lost in broader common-profile output. | -| 31 | README | 🟢 Verified | `chromium/chromium` vendored `README.chromium` samples (490,886 files)
`vercel/next.js` (5k files)
`django/django` (2.5k files) | Chromium is the main proof target for the specialized README variants; the other two are broader repo-level contrast targets only. Use this row to verify that package-adjacent README parsing stays visible under the common profile instead of disappearing inside unrelated monorepo noise. | -| 32 | Linux Distro (`os-release`) | ⚪ Planned | Debian base-image rootfs snapshot
Fedora base-image rootfs snapshot
Distroless `base-debian12` rootfs snapshot | This row is rootfs-only on purpose. Debian and Fedora give conventional distro metadata layouts, while Distroless shows the minimal-image case where `os-release` may be one of the few package-identity signals present. Watch path/layout differences and do not treat intentionally sparse distroless metadata as a parser regression by itself. | -| 33 | AboutCode | ⚪ Planned | `aboutcode-org/scancode-toolkit` (10k–50k files)
`aboutcode-org/scancode.io` (500–2k files)
`aboutcode-org/dejacode` (500–2k files) | Niche but very high-fit `.ABOUT` lane. `aboutcode-org/scancode-toolkit` is the broadest real-world `.ABOUT` reference, while `aboutcode-org/scancode.io` and `aboutcode-org/dejacode` provide smaller product-style contrasts. Watch `.ABOUT` extraction staying visible beside denser package, README, and license output in these application trees. | -| 34 | Hex / Elixir | 🟢 Verified | `phoenixframework/phoenix` (500–2k files)
`elixir-ecto/ecto` (500–2k files)
`elixir-plug/plug` (<500 files) | Useful ecosystem, but current Rust scope is still the lockfile/static subset, so this ranks below the broader mainstream families. | -| 35 | OCaml / opam | 🟢 Verified | `ocaml/dune` (500–2k files)
`ocaml/ocaml-lsp` (500–2k files)
`ocaml/merlin` (500–2k files) | Good `opam` coverage, but lower practical verification priority than the broader ecosystems above. | -| 36 | Buck | 🟢 Verified | `facebook/buck2` (2k–10k files)
`facebook/watchman` (500–2k files)
`facebook/react-native` (10k–50k files) | Real Buck lane, even if narrower than Bazel in practice. `facebook/buck2` is the canonical direct reference, `facebook/watchman` is a smaller focused contrast, and `facebook/react-native` adds a large mixed-language consumer tree. Watch Buck metadata separately from the rest of the monorepo so unrelated JS/native/common-profile noise does not hide actual build-metadata gaps. | -| 37 | FreeBSD | ⚪ Planned | FreeBSD `pkg` package archive sample
FreeBSD `bash` package archive sample
FreeBSD `curl` package archive sample | Important artifact-family support, but narrower day-to-day scan prevalence than the higher-priority distro lanes. | -| 38 | Chef | ⚪ Planned | `sous-chefs/apache2` (<500 files)
`sous-chefs/mysql` (<500 files)
`chef/chef` (2k–10k files) | Worth covering, but lower priority than the mainstream language and distro families. | -| 39 | Bower | ⚪ Planned | `jquery/jquery-ui` (500–2k files)
`select2/select2` (<500 files)
`jashkenas/backbone` (<500 files) | Legacy ecosystem with ongoing value mostly for backward compatibility. | -| 40 | Haxe | ⚪ Planned | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | -| 41 | Windows Update | ⚪ Planned | `wsusscn2.cab` extracted tree
Windows cumulative update `.msu` extracted tree
Windows servicing stack update extracted tree | Artifact-oriented family with real value, but specialized and best handled after the higher-signal source/package ecosystems. | -| 42 | `misc.py` recognizers | ⚪ Planned | Apache Tomcat binary release artifacts
Firefox add-on / language-pack artifacts
NSIS official installer artifacts | Broad recognizer family, but not a normal package-manager lane; treat as specialized follow-up verification. | -| 43 | Julia | 🟢 Verified | `JuliaLang/Pkg.jl` (500–2k files)
`JuliaLang/julia` (10k–50k files)
`JuliaPlots/Plots.jl` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `JuliaLang/Pkg.jl` is the canonical `Project.toml` and `Manifest.toml` reference, `JuliaLang/julia` adds a large real-world Julia project tree, and `JuliaPlots/Plots.jl` is a mid-sized consumer library. Focus on correct `Project.toml` metadata extraction, `Manifest.toml` resolved dependency coverage, and sibling assembly of project-plus-manifest pairs. | -| 44 | Erlang / OTP | ⚪ Planned | `processone/ejabberd` (2k–10k files)
`erlang/otp` (10k–50k files)
`vernemq/vernemq` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `processone/ejabberd` is a large real-world Erlang project with `rebar.config`, `rebar.lock`, and multiple `.app.src` files across its dependency tree. `erlang/otp` is the canonical OTP distribution with many `.app.src` files. `vernemq/vernemq` adds a complex multi-dependency rebar project with mixed pkg and git dependencies in `rebar.lock`. Focus on correct `.app.src` metadata and dependency extraction, `rebar.config` dependency parsing including git and profile deps, and `rebar.lock` resolved dependency and hash coverage. | + +| Priority | Ecosystem | Status | Candidate targets | Priority and scope notes | +| -------- | ------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 0a | Cross-cutting broad `C++` repository scans (non-parser reference) | 🟢 Verified | `boostorg/boost` (236 files)
`boostorg/json` (701 files)
`mongodb/mongo` (11k files) | There is no generic `C++` parser row. These repositories are still valuable reference targets because they exercise multiple implemented `C++`-adjacent families and package-adjacent detection in realistic trees. They complement, but do not replace, family-specific verification for Autotools, Conan, vcpkg, Bazel, and Buck. | +| 0b | Cross-cutting broad polyglot / vendored monorepo scans (non-parser reference) | 🟢 Verified | `chromium/chromium` (490,886 files)
`apache/airflow` (11,854 files)
`kubernetes/kubernetes` (29,080 files) | These are good early warning targets for interaction bugs across multiple parser families, vendored third-party metadata, README/submodule handling, and common-profile license/copyright detection in very large trees. They complement, but do not replace, family-specific rows. | +| 0c | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference) | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)
Fedora base-image rootfs snapshot (1,579 files)
official Alpine minirootfs snapshot (84 files) | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. | +| 0d | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | 🟢 Verified | `torvalds/linux` (100k files)
`rust-lang/rust` (8k files) | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions. | +| 0e | Cross-cutting licensing-edge-case repository scans (non-parser reference) | 🟢 Verified | `nmap/nmap` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files)
`mongodb/mongo` (11k files) | Use this lane when the main goal is license-classification accuracy rather than parser breadth. These targets are useful when the verification focus is classification quality on real repository text, reference notices, and packaging-adjacent licensing material rather than parser coverage alone. | +| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)
`yarnpkg/berry` (500–2k files)
`vercel/next.js` (5k files)
`oven-sh/bun` (500–2k files)
`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. | +| 2 | Python / PyPI | 🟢 Verified | `pandas-dev/pandas` (1.2k files)
`scipy/scipy` (1.3k files)
`django/django` (2.5k files)
`python-poetry/poetry` (500–2k files)
`astral-sh/uv` (500–2k files) | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas. | +| 3 | Maven / Java | 🟢 Verified | `apache/maven` (500–2k files)
`apache/camel` (2k–10k files)
`spring-projects/spring-boot` (2k–10k files)
`apache/felix-dev` (2k–10k files) | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure. | +| 3a | Clojure / Leiningen | 🟢 Verified | `technomancy/leiningen` (500–2k files)
`metabase/metabase` (2k–10k files)
`renovatebot/renovate` Leiningen fixtures | Keep this row explicit instead of assuming the broader Maven or SBT rows cover it. `technomancy/leiningen` is the canonical `project.clj` reference, `metabase/metabase` gives a real-world root `deps.edn`, and `renovatebot/renovate` adds a fixture-heavy Leiningen edge-case lane. The shipped Rust surface is bounded static parsing of `deps.edn` and `project.clj`, not generic JVM build inheritance, and these manifests are intentionally treated as standalone unassembled inputs. | +| 4 | Go | 🟢 Verified | `containerd/containerd` (2k–10k files)
`go-gitea/gitea` (2k–10k files)
Go build-info sample binaries via local `--target-path` + `common-with-compiled` lane | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, `go.work` workspace roots, vendored trees, and source-versus-binary coverage gaps explicitly during compare review. | +| 5 | Cargo/Rust | 🟢 Verified | `tokio-rs/tokio` (250 files)
`rust-lang/cargo` (700 files)
cargo-auditable sample binaries via local `--target-path` + `common-with-compiled` lane | Strong workspace/member coverage plus an explicit compiled-metadata lane for the scanner-gated cargo-auditable surface. Watch workspace root/member ownership, manifest-declared file references such as `README` and license files, and compiled-versus-source coverage gaps. Keep bootstrap-scale mixed Rust/C++ trees such as `rust-lang/rust` in the dedicated filesystem-scale cross-cutting lane instead of duplicating them here. | +| 5a | Compiled artifacts (`go build info`, cargo-auditable, Windows PE `VERSIONINFO`) | 🟢 Verified | `itchyny/gojq` release binaries via local `--target-path` + `common-with-compiled` lane
`lichess-org/fishnet` release binaries via local `--target-path` + `common-with-compiled` lane
`glzr-io/glazewm` Windows release executables via local `--target-path` lane | Keep this detector-oriented row explicit so compiled-binary verification does not stay implicit inside the Go, Cargo/Rust, Windows Update, or `misc.py` rows. `itchyny/gojq` is a clean Go build-info target, `lichess-org/fishnet` is an explicit cargo-auditable release lane, and `glzr-io/glazewm` gives a focused Windows `VERSIONINFO` executable target. Prefer small release trees that include nearby README or LICENSE material when possible, so the compare still exercises common-profile interactions rather than only binary package identity. | +| 6 | NuGet | 🟢 Verified | `OrchardCMS/OrchardCore` (2k–10k files)
`AvaloniaUI/Avalonia` (2k–10k files)
`.nupkg` / `.deps.json` snapshots via local `--target-path` lane | Broad .NET lane across source and shipped artifacts. `OrchardCMS/OrchardCore` and `AvaloniaUI/Avalonia` cover large solution-style repos and central package management patterns, while the `.nupkg` / `.deps.json` lane covers runtime and package-artifact metadata that source repos may miss. Watch duplicate package signals across solution props/targets, project files, and runtime artifacts before counting them as regressions. | +| 7 | PHP / Composer | 🟢 Verified | `laravel/framework` (2k–10k files)
`composer/composer` (500–2k files)
`symfony/symfony` (2k–10k files) | Mature Composer lane. `composer/composer` is the canonical Composer reference, while `laravel/framework` and `symfony/symfony` add large real-world monorepo/library dependency graphs. Watch `composer.json` versus `composer.lock` behavior, split-package repo structure, and README/LICENSE-heavy trees that can create unrelated common-profile deltas. | +| 8 | Gradle | 🟢 Verified | `gradle/gradle` (2k–10k files)
`elastic/elasticsearch` (11k files)
`apache/kafka` (2k–10k files) | High-signal JVM build family with settings/includes and large build graphs; `elastic/elasticsearch` adds an especially large multi-project Gradle and packaging target with meaningful licensing/distribution complexity. | +| 8a | Android metadata and package artifacts | 🟢 Verified | `aosp-mirror/platform_build` (Soong `METADATA` coverage)
`aosp-mirror/platform_frameworks_base` (Android manifest surfaces)
representative local `.aab`, `.apk`, and standalone binary `AndroidManifest.xml` artifacts via `--target-path` lane | Keep this Android-specific lane explicit instead of assuming the broader Gradle row covers it. Use the repository targets for Soong `METADATA` files and committed manifest surfaces, and the local artifact lane for proto-encoded `.aab` plus binary AXML/APK manifest metadata that ordinary repository scans do not usually contain. | +| 9 | Ruby | 🟢 Verified | `rails/rails` (2k–10k files)
`rubocop/rubocop` (500–2k files)
`.gem` archive sample via local `--target-path` lane | Use this row to separate source-repo and shipped-gem behavior. `rails/rails` is the large multi-gemspec/Bundler stress case, `rubocop/rubocop` is a smaller modern Bundler contrast, and the `.gem` lane covers archive metadata. Watch Gemfile-versus-gemspec-versus-lockfile precedence and differences between source trees and packaged gem metadata. | +| 10 | Debian | 🟢 Verified | `guillemj/dpkg` (500–2k files)
`Debian/apt` (2k–10k files)
official `.deb` / dpkg status / distroless `status.d` snapshots via local `--target-path` lane | Keep source-package and installed-state coverage separate. `guillemj/dpkg` and `Debian/apt` exercise Debian source-package metadata, while the `.deb`, `dpkg status`, and distroless `status.d` lanes cover binary-package and installed-database behavior. Watch source-versus-binary package identity, multiple package stanzas, and Debian copyright/license files generating common-profile deltas that are not parser failures. | +| 11 | Docker | 🟢 Verified | `moby/moby` (2k–10k files)
`docker-library/official-images` (<500 files)
`docker-library/python` (<500 files)
`getsentry/self-hosted` (<500 files) | Docker needs both canonical and real deployment targets. `moby/moby` is the broad Dockerfile/build-context reference, `docker-library/official-images` is the source-of-truth library-definition lane, `docker-library/python` is a useful generated official-image leaf target, and `getsentry/self-hosted` adds compose-heavy multi-service realism. Watch multi-stage Dockerfiles, compose-plus-Dockerfile overlap, and template/env noise before treating extra findings as parser regressions. | +| 11a | Helm | 🟢 Verified | `baserow/baserow` (2k–10k files)
`appsmithorg/appsmith` (10k–50k files)
`DefectDojo/django-DefectDojo` (500–2k files) | Keep Helm explicit instead of relying on incidental chart files inside larger application repositories. `baserow/baserow` gives a strong `Chart.yaml` plus `Chart.lock` lane, `appsmithorg/appsmith` adds a large conventional chart deployment tree, and `DefectDojo/django-DefectDojo` is a smaller contrast target. The implemented Rust surface is static `Chart.yaml` plus `Chart.lock` parsing with sibling assembly, declared-versus-locked dependency coverage, and bounded malformed-entry tolerance; that needs at least one focused chart-first verification lane. | +| 12 | Conda | 🟢 Verified | `conda/conda` (500–2k files)
`conda/conda-build` (500–2k files)
`conda-forge/pandas-feedstock` (<500 files) | Conda needs three distinct target shapes. `conda/conda` covers user-facing environment metadata, `conda/conda-build` covers recipes and build outputs, and `conda-forge/pandas-feedstock` is the feedstock pattern Provenant must handle. Watch recipe-output duplication and generated feedstock files before overcounting package or license deltas. | +| 12a | Pixi | 🟢 Verified | `prefix-dev/pixi` (500–2k files)
`pydata/xarray` (500–2k files)
`OpenMDAO/OpenMDAO` (500–2k files) | Keep Pixi explicit even though some Python and Conda compare targets already surface `pixi.toml` and `pixi.lock`. `prefix-dev/pixi` is the canonical upstream with both `pixi.toml` and `pixi.lock`, `pydata/xarray` adds a real consumer repo, and `OpenMDAO/OpenMDAO` adds a second repo with both manifest and lockfile. This row isolates the native `pixi.toml` plus `pixi.lock` contract, mixed Conda/PyPI dependency behavior, and topology-planned root assembly instead of letting those behaviors hide inside broader Python-family compare noise. | +| 13 | Swift | 🟢 Verified | `pointfreeco/swift-composable-architecture` (500–2k files)
`SwiftFiddle/swiftfiddle-web` (<500 files)
`Package.swift.json` / `Package.resolved` snapshots via local `--target-path` lane | `pointfreeco/swift-composable-architecture` is a clean SwiftPM library reference, `SwiftFiddle/swiftfiddle-web` adds a real committed `Resources/Package.swift.json` plus `Package.resolved` target shape, and the local snapshot lane remains important for future pinned production captures that record generated SwiftPM surfaces alongside their source manifests. Watch repo-only verification gaps whenever a bug might live in `Package.swift.json` or `Package.resolved` rather than in source manifests. | +| 14 | Haskell / Hackage | 🟢 Verified | `commercialhaskell/stack` (500–2k files)
`jgm/pandoc` (500–2k files)
`yesodweb/yesod` (500–2k files) | Good mix of Cabal, Stack, and multi-package Haskell repository structure. | +| 15 | Scala / SBT | 🟢 Verified | `akka/akka` (2k–10k files)
`playframework/playframework` (2k–10k files)
`scalatest/scalatest` (500–2k files) | Valuable JVM surface, but current Rust scope is bounded static parsing rather than full evaluation semantics. | +| 16 | CocoaPods | 🟢 Verified | `AFNetworking/AFNetworking` (<500 files)
`Alamofire/Alamofire` (<500 files)
`SDWebImage/SDWebImage` (<500 files) | Strong Apple packaging coverage through widely used podspec-based libraries. | +| 16a | Carthage | 🟢 Verified | `Carthage/Carthage` (500–2k files)
`ReactiveCocoa/ReactiveCocoa` (<500 files)
`Mantle/Mantle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `Carthage/Carthage` is the canonical upstream with both `Cartfile` and `Cartfile.resolved`, while `ReactiveCocoa/ReactiveCocoa` and `Mantle/Mantle` are representative consumer libraries. Focus on correct `Cartfile` dependency extraction, `Cartfile.resolved` pinned-version coverage, and the dependency-hoisting contract for sibling manifest-plus-lockfile pairs without inventing a root Carthage package identity. | +| 16b | Yocto / BitBake | 🟢 Verified | `yoctoproject/poky` (10k–50k files)
`openembedded/meta-openembedded` (10k–50k files)
`pocketbeagle/meta-pocketbeagle` (<500 files) | New Provenant-only parser with no Python ScanCode reference implementation. `yoctoproject/poky` is the canonical Yocto reference distribution, while `openembedded/meta-openembedded` provides a large recipe corpus across many layers. Focus on correct package identity extraction from filenames and `PN`/`PV` variables, license normalization of BitBake-specific operator syntax (`&`/`\|`), and `DEPENDS`/`RDEPENDS` dependency scoping. | +| 17 | Nix | 🟢 Verified | `NixOS/nixpkgs` (50k+ files)
`NixOS/nix` (2k–10k files)
`numtide/devshell` (<500 files) | Valuable ecosystem with explicit note that current `default.nix` support is intentionally bounded. | +| 18 | CPAN | 🟢 Verified | `Plack/Plack` (500–2k files)
`libwww-perl/libwww-perl` (500–2k files)
`PerlDancer/Dancer2` (500–2k files) | Good Perl metadata variety through `META.*`, `dist.ini`, and `Makefile.PL`. | +| 19 | CRAN / R | 🟢 Verified | `tidyverse/dplyr` (500–2k files)
`tidyverse/ggplot2` (500–2k files)
`r-lib/devtools` (500–2k files) | Strong DESCRIPTION-based metadata with realistic dependency fields. | +| 20 | Alpine | ⚪ Planned | `alpinelinux/aports`
official `.apk` sample via local `--target-path` lane
Alpine `lib/apk/db/installed` snapshot via local `--target-path` lane | Keep this family row even though Alpine rootfs targets also appear in `0c`: `0c` is the cross-cutting rootfs lane, while this row tracks Alpine-specific source, archive, and installed-DB surfaces. Do not treat rootfs-only verification as verification of the remaining `APKBUILD`, `.apk`, and standalone installed-DB surfaces listed here. | +| 21 | RPM | 🟢 Verified | `rpm-software-management/dnf` (2k–10k files)
`rpm-software-management/libdnf` (500–2k files)
official `.rpm` / RPM BDB, NDB, and SQLite DB snapshots via local `--target-path` lane | Important distro-family lane across source and installed-state metadata. `rpm-software-management/dnf` and `rpm-software-management/libdnf` cover realistic RPM-adjacent source trees, while the local `.rpm` and RPM DB lanes cover shipped package and installed-database behavior. Watch specfile subpackages, changelog/license fields, namespace-from-`os-release` behavior, and DB-versus-source differences separately during triage. | +| 22 | Arch Linux | ⚪ Planned | Arch Linux GitLab packaging repo for `pacman`
Arch Linux GitLab packaging repo for `grep`
official built package sample for `.PKGINFO` via local `--target-path` lane | Use one source-package contrast plus one built-package lane here. The Arch packaging repos cover PKGBUILD and `.SRCINFO` source metadata, while the local built-package lane covers `.PKGINFO` behavior that source repos do not contain. Keep the candidate repos concrete because the canonical Arch packaging sources live in the Arch packaging tree rather than in one obvious GitHub umbrella repository. | +| 23 | Bazel | 🟢 Verified | `tensorflow/tensorflow` (10k files)
`bazelbuild/bazel` (2k–10k files)
`protocolbuffers/protobuf` (2.5k files) | Strong Bazel lane across old and new module surfaces. `bazelbuild/bazel` is the canonical direct reference, `tensorflow/tensorflow` is the large mixed-language stress case, and `protocolbuffers/protobuf` is a smaller contrast target. Watch `WORKSPACE` versus `MODULE.bazel`, macro-heavy static-parsing limits, and giant `third_party` trees producing unrelated common-profile noise. | +| 24 | Autotools | 🟢 Verified | `curl/curl` (1k files)
`libevent/libevent` (<500 files)
`libgit2/libgit2` (500–2k files)
`ffmpeg/ffmpeg` (10,200 files) | Mature native-build lane with several useful contrasts. `curl/curl` is the clearest autoconf-heavy reference, `libevent/libevent` is a smaller contrast, `libgit2/libgit2` adds a mixed native project shape, and `ffmpeg/ffmpeg` adds strong GPL/LGPL-conditional licensing pressure in a `configure`-driven native tree. Watch generated `configure` / `Makefile.in` noise and avoid collapsing file-level licensing differences into one top-level verdict. | +| 24a | Meson | 🟢 Verified | `qemu/qemu` (10k–50k files)
`systemd/systemd` (10k–50k files)
`LinuxCNC/linuxcnc` (2k–10k files) | Keep Meson explicit instead of assuming the Autotools or generic native-tree rows cover it. `qemu/qemu` and `systemd/systemd` are high-signal root-`meson.build` upstreams, while `LinuxCNC/linuxcnc` is a smaller contrast target. The shipped Rust surface is bounded static `meson.build` parsing for literal `project()` metadata and top-level `dependency()` calls, with explicit no-evaluation guardrails that deserve a focused verification lane. | +| 25 | Conan | 🟢 Verified | `conan-io/conan-center-index` (10k–50k files)
`catchorg/Catch2` (<500 files)
`fmtlib/fmt` (<500 files) | Conan needs both recipe-corpus and upstream-library targets. `conan-io/conan-center-index` is the authoritative recipe index, while `catchorg/Catch2` and `fmtlib/fmt` are smaller upstream consumer-library contrasts. Watch recipe-only repository structure, versioned recipe directories, and the difference between Conan recipe metadata and normal source-package behavior. | +| 26 | vcpkg | 🟢 Verified | `microsoft/vcpkg` (10k–50k files)
`microsoft/terminal` (2k–10k files)
`microsoft/onnxruntime` (10k–50k files) | Important Windows/`C++` manifest-mode lane. `microsoft/vcpkg` is the authoritative manifest and registry target, while `microsoft/terminal` and `microsoft/onnxruntime` cover large consumer repos that use `vcpkg.json` in real codebases. Watch current scope boundaries carefully: this row is about implemented manifest-mode metadata, not every vendored or toolchain surface in those trees. | +| 27 | Deno | 🟢 Verified | `denoland/fresh` (500–2k files)
`oakserver/oak` (500–2k files)
`denoland/std` (2k–10k files) | Useful modern JS/TS ecosystem with explicit config and lockfile coverage. | +| 28 | Dart / Pub | 🟢 Verified | `rrousselGit/riverpod` (500–2k files)
`firebase/flutterfire` (2k–10k files)
`flutter/packages` (2k–10k files) | Good Pub and Flutter-adjacent coverage through large multi-package repositories. | +| 29 | Git submodules | 🟢 Verified | `grpc/grpc` (10k–50k files)
`git/git` (500–2k files)
`chromium/chromium` (490,886 files) | This is a package-adjacent lane, not a parser-breadth lane. `git/git` is the clearest focused `.gitmodules` reference, `grpc/grpc` adds large real-world third-party trees, and `chromium/chromium` is the stress case. Watch absent submodule checkouts and vendored-tree context so `.gitmodules` findings stay coherent instead of being drowned by unrelated common-profile output. | +| 30 | Structured metadata (`CITATION.cff`, `publiccode.yml`) | 🟢 Verified | `astropy/astropy` (2k–10k files)
`iTowns/itowns` (500–2k files)
`univention/Nubus` (500–2k files) | Keep both structured-metadata families explicit here. `astropy/astropy` is the strongest `CITATION.cff` reference, `univention/Nubus` is the clearest `publiccode.yml` case, and `iTowns/itowns` adds mixed-project contrast. Watch that structured metadata stays visible beside richer README and package findings instead of being lost in broader common-profile output. | +| 31 | README | 🟢 Verified | `chromium/chromium` vendored `README.chromium` samples (490,886 files)
`vercel/next.js` (5k files)
`django/django` (2.5k files) | Chromium is the main proof target for the specialized README variants; the other two are broader repo-level contrast targets only. Use this row to verify that package-adjacent README parsing stays visible under the common profile instead of disappearing inside unrelated monorepo noise. | +| 32 | Linux Distro (`os-release`) | ⚪ Planned | Debian base-image rootfs snapshot
Fedora base-image rootfs snapshot
Distroless `base-debian12` rootfs snapshot | This row is rootfs-only on purpose. Debian and Fedora give conventional distro metadata layouts, while Distroless shows the minimal-image case where `os-release` may be one of the few package-identity signals present. Watch path/layout differences and do not treat intentionally sparse distroless metadata as a parser regression by itself. | +| 33 | AboutCode | ⚪ Planned | `aboutcode-org/scancode-toolkit` (10k–50k files)
`aboutcode-org/scancode.io` (500–2k files)
`aboutcode-org/dejacode` (500–2k files) | Niche but very high-fit `.ABOUT` lane. `aboutcode-org/scancode-toolkit` is the broadest real-world `.ABOUT` reference, while `aboutcode-org/scancode.io` and `aboutcode-org/dejacode` provide smaller product-style contrasts. Watch `.ABOUT` extraction staying visible beside denser package, README, and license output in these application trees. | +| 34 | Hex / Elixir | 🟢 Verified | `phoenixframework/phoenix` (500–2k files)
`elixir-ecto/ecto` (500–2k files)
`elixir-plug/plug` (<500 files) | Useful ecosystem, but current Rust scope is still the lockfile/static subset, so this ranks below the broader mainstream families. | +| 35 | OCaml / opam | 🟢 Verified | `ocaml/dune` (500–2k files)
`ocaml/ocaml-lsp` (500–2k files)
`ocaml/merlin` (500–2k files) | Good `opam` coverage, but lower practical verification priority than the broader ecosystems above. | +| 36 | Buck | 🟢 Verified | `facebook/buck2` (2k–10k files)
`facebook/watchman` (500–2k files)
`facebook/react-native` (10k–50k files) | Real Buck lane, even if narrower than Bazel in practice. `facebook/buck2` is the canonical direct reference, `facebook/watchman` is a smaller focused contrast, and `facebook/react-native` adds a large mixed-language consumer tree. Watch Buck metadata separately from the rest of the monorepo so unrelated JS/native/common-profile noise does not hide actual build-metadata gaps. | +| 37 | FreeBSD | ⚪ Planned | FreeBSD `pkg` package archive sample
FreeBSD `bash` package archive sample
FreeBSD `curl` package archive sample | Important artifact-family support, but narrower day-to-day scan prevalence than the higher-priority distro lanes. | +| 38 | Chef | ⚪ Planned | `sous-chefs/apache2` (<500 files)
`sous-chefs/mysql` (<500 files)
`chef/chef` (2k–10k files) | Worth covering, but lower priority than the mainstream language and distro families. | +| 39 | Bower | ⚪ Planned | `jquery/jquery-ui` (500–2k files)
`select2/select2` (<500 files)
`jashkenas/backbone` (<500 files) | Legacy ecosystem with ongoing value mostly for backward compatibility. | +| 40 | Haxe | ⚪ Planned | `openfl/openfl` (500–2k files)
`HaxeFlixel/flixel` (500–2k files)
`HeapsIO/heaps` (500–2k files) | Smaller ecosystem; still useful, but lower-value than the broader mainstream families above. | +| 41 | Windows Update | ⚪ Planned | `wsusscn2.cab` extracted tree
Windows cumulative update `.msu` extracted tree
Windows servicing stack update extracted tree | Artifact-oriented family with real value, but specialized and best handled after the higher-signal source/package ecosystems. | +| 42 | `misc.py` recognizers | ⚪ Planned | Apache Tomcat binary release artifacts
Firefox add-on / language-pack artifacts
NSIS official installer artifacts | Broad recognizer family, but not a normal package-manager lane; treat as specialized follow-up verification. | +| 43 | Julia | 🟢 Verified | `JuliaLang/Pkg.jl` (500–2k files)
`JuliaLang/julia` (10k–50k files)
`JuliaPlots/Plots.jl` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `JuliaLang/Pkg.jl` is the canonical `Project.toml` and `Manifest.toml` reference, `JuliaLang/julia` adds a large real-world Julia project tree, and `JuliaPlots/Plots.jl` is a mid-sized consumer library. Focus on correct `Project.toml` metadata extraction, `Manifest.toml` resolved dependency coverage, and sibling assembly of project-plus-manifest pairs. | +| 44 | Erlang / OTP | 🟢 Verified | `processone/ejabberd` (2k–10k files)
`erlang/otp` (10k–50k files)
`vernemq/vernemq` (2k–10k files) | New Provenant-only parser with no Python ScanCode reference implementation. `processone/ejabberd` is a large real-world Erlang project with `rebar.config`, `rebar.lock`, and multiple `.app.src` files across its dependency tree. `erlang/otp` is the canonical OTP distribution with many `.app.src` files. `vernemq/vernemq` adds a complex multi-dependency rebar project with mixed pkg and git dependencies in `rebar.lock`. Focus on correct `.app.src` metadata and dependency extraction, `rebar.config` dependency parsing including git and profile deps, and `rebar.lock` resolved dependency and hash coverage. | ## How to maintain this file