Skip to content
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ provenant --json-pp scan-results.json --license --package ~/projects/my-codebase
Use `-` as `FILE` to write an output stream to stdout, for example `--json-pp -`.
Multiple output flags can be used in a single run, matching ScanCode CLI behavior.
When using `--from-json`, you can pass multiple JSON inputs. Native directory scans also support multiple input paths, matching ScanCode's common-prefix behavior.
When you need to scan an explicit allowlist of files or directories under one root (for example PR-changed files from CI), use `--paths-file <FILE>` with one explicit scan root instead of expanding the list into positional args.
Use `--incremental` for repeated scans of the same tree. After a completed scan, Provenant keeps
an incremental manifest and uses it on the next run to skip unchanged files. That is useful for
local iteration, CI-style reruns, and retrying after a later failed or interrupted scan. The
Expand Down
50 changes: 50 additions & 0 deletions docs/CLI_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -481,8 +481,55 @@ This is useful for:
- scanning split source trees in one run
- collecting one combined report for several directories

These native multi-input paths still follow the current common-prefix behavior. They work best when you can invoke Provenant from a cwd where the relative input paths share a usable common ancestor.

You can also pass multiple JSON inputs with `--from-json`.

### 20. "I want to scan only files matching certain patterns"

```sh
provenant --json-pp scan.json --license /path/to/repo --include "*.rs" --include "src/**/*.toml"
```

Use `--include` when you want glob-style path filtering inside one scan root.

Current behavior:

- `--include` matches file/path patterns; repeated flags are additive
- use `**` when you want recursion across directory boundaries
- plain directory-looking tokens such as `src/foo` are treated as literal path patterns, not as an implicit “scan this whole subtree” shortcut
- if you already know the exact files or directories you want, prefer `--paths-file` instead of encoding that selection indirectly through globs

### 21. "I have an explicit list of files or directories to scan"

```sh
provenant --json-pp scan.json --license /path/to/repo --paths-file changed-files.txt
```

Use this when you already have a selected path list under one known root, especially for CI and pull-request workflows where cwd cannot be the repo root.

`--paths-file` is the preferred workflow when:

- `git diff --name-only` or another tool already produced the changed-file list
- Provenant must run from a fixed mount location or other non-repo cwd
- you want Provenant itself, not shell `xargs`, to own the selection semantics

Current behavior:

- pass exactly one native scan root as the positional input
- entries in the paths file are interpreted relative to that root
- one path per line, with blank lines ignored and CRLF tolerated
- directory entries select that subtree
- missing entries are skipped with a warning
- `--paths-file -` reads the list from stdin
- `--paths-file` cannot currently be combined with `--from-json`

Example with stdin:

```sh
git diff --name-only --diff-filter=d origin/main...HEAD | provenant --json-pp - --license /path/to/repo --paths-file -
```

## Important Flag Combinations

These are worth learning early because they change what the output means:
Expand All @@ -497,6 +544,7 @@ These are worth learning early because they change what the output means:
- `--tallies-key-files` requires `--tallies` and `--classify`
- `--tallies-by-facet` requires `--facet` and `--tallies`
- `--debian <FILE>` requires `--license`, `--copyright`, and `--license-text`
- `--paths-file <FILE>` requires exactly one native scan root and is currently native-scan only (no `--from-json`)
- `--reindex` only matters when the license engine is initialized (`--license` and some `--from-json` reference-recompute flows)
- `--no-license-index-cache` only matters when the license engine is initialized

Expand All @@ -512,6 +560,8 @@ If you are not sure where to start, use this rule of thumb:
- Want browser-friendly review? → `--html`
- Want policy-aware license review? → add `--license-references`, `--filter-clues`, and optionally `--license-policy`
- Want summary/tally/facet review? → add `--classify`, `--summary`, and optionally `--tallies*` / `--facet`
- Want glob-style file filtering inside one scan root? → add one or more `--include` patterns
- Want an explicit rooted list of files/directories? → use `--paths-file`
- Already have JSON and only want to filter or reshape it? → `--from-json`

## Where to Go Next
Expand Down
21 changes: 21 additions & 0 deletions docs/MIGRATING_FROM_SCANCODE.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,26 @@ These are not random incompatibilities; they are documented behavior improvement

See [Beyond-Parity Improvements](improvements/README.md) for the full index.

### 6. Path selection is split more explicitly between patterns and exact rooted paths

If you previously relied on `--include` as a rough way to express “scan this subtree”, pay close attention to Provenant's newer split here.

- `--include` is for glob-style path filtering
- recursion should be explicit in the pattern (for example `src/**`)
- `--paths-file` is the explicit rooted workflow for “scan exactly these files or directories under this root”

That means Provenant now prefers:

- `--include '*.rs' --include 'src/**/*.toml'` when you mean pattern filtering
- `--paths-file changed-files.txt /path/to/repo` when you already know the exact rooted file or directory list

This is a workflow-level difference worth knowing when you migrate existing ScanCode habits or shell wrappers.

See also:

- [CLI Guide](CLI_GUIDE.md)
- [CLI Workflows](improvements/cli-workflows.md)

## Practical migration advice

If you are moving an existing ScanCode workflow to Provenant:
Expand All @@ -121,6 +141,7 @@ If you are moving an existing ScanCode workflow to Provenant:
2. compare outputs on one representative codebase
3. check this guide if you see a meaningful delta
4. use the exported dataset workflow if you previously customized license/rule data in a ScanCode checkout
5. if your old workflow used `--include` to approximate explicit path lists, consider switching that part to `--paths-file`

## Other differences worth knowing

Expand Down
31 changes: 16 additions & 15 deletions docs/implementation-plans/infrastructure/CLI_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,21 +44,22 @@ Treat this file as a maintained compatibility ledger rather than the primary use

### Invocation & Input Handling

| Flag | What it does | Status | Notes |
| ---------------------- | ------------------------------------------------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `<input>...` | Supplies the path or paths to scan | `Done` | Native scans now support the upstream-style relative multi-input common-prefix flow, and `--from-json` still supports multiple scan files. |
| `-h, --help` | Prints CLI help | `Done` | Provided by `clap`. |
| `-V, --version` | Prints CLI version | `Done` | Provided by `clap`. |
| `-q, --quiet` | Reduces runtime output | `Done` | Matches the current quiet-mode surface. |
| `-v, --verbose` | Increases runtime path reporting | `Done` | Matches the current verbose-path surface: per-file paths on TTY, bounded progress plus per-file warning/error context on non-TTY stderr. |
| `-m, --max-depth` | Limits recursive scan depth | `Done` | `0` means no depth limit. |
| `-n, --processes` | Controls worker count | `Done` | Positive values set the worker count; `0` disables parallel file scanning; `-1` also disables timeout-backed interruption checks. |
| `--timeout` | Sets per-file processing timeout | `Done` | Wired through the scanner runtime. |
| `--exclude / --ignore` | Excludes files by glob pattern | `Done` | `--ignore` is the ScanCode-facing alias. |
| `--include` | Re-includes matching paths after filtering | `Done` | Native scans now apply ScanCode-style combined include/ignore path filtering before file scanning; `--from-json` applies the same path selection as a shaping step over the loaded result tree. |
| `--strip-root` | Rewrites paths relative to the scan root | `Done` | Root-resource, single-file, native multi-input, nested reference, and top-level package/dependency path projection are now handled in the final shaping pass. |
| `--full-root` | Preserves absolute/rooted output paths | `Done` | Full-root display paths now follow the ScanCode-style formatting pass, including path cleanup and field-specific projection rules. |
| `--from-json` | Loads prior scan JSON instead of rescanning input files | `Done` | Supports multiple input scans, shaping-time include/ignore filtering, root-flag reshaping per loaded scan before merge, and recomputation of followed top-level license outputs after load. |
| Flag | What it does | Status | Notes |
| ---------------------- | ------------------------------------------------------- | --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `<input>...` | Supplies the path or paths to scan | `Done` | Native scans now support the upstream-style relative multi-input common-prefix flow, and `--from-json` still supports multiple scan files. |
| `--paths-file <FILE>` | Loads selected native scan paths from a file | `Rust-specific` | Explicit-path convenience for one rooted native scan. v1 uses one explicit scan root and root-relative entries instead of extending the common-prefix argv flow. |
| `-h, --help` | Prints CLI help | `Done` | Provided by `clap`. |
| `-V, --version` | Prints CLI version | `Done` | Provided by `clap`. |
| `-q, --quiet` | Reduces runtime output | `Done` | Matches the current quiet-mode surface. |
| `-v, --verbose` | Increases runtime path reporting | `Done` | Matches the current verbose-path surface: per-file paths on TTY, bounded progress plus per-file warning/error context on non-TTY stderr. |
| `-m, --max-depth` | Limits recursive scan depth | `Done` | `0` means no depth limit. |
| `-n, --processes` | Controls worker count | `Done` | Positive values set the worker count; `0` disables parallel file scanning; `-1` also disables timeout-backed interruption checks. |
| `--timeout` | Sets per-file processing timeout | `Done` | Wired through the scanner runtime. |
| `--exclude / --ignore` | Excludes files by glob pattern | `Done` | `--ignore` is the ScanCode-facing alias. |
| `--include` | Re-includes matching paths after filtering | `Done` | Native scans now apply ScanCode-style combined include/ignore path filtering before file scanning; `--from-json` applies the same path selection as a shaping step over the loaded result tree. |
| `--strip-root` | Rewrites paths relative to the scan root | `Done` | Root-resource, single-file, native multi-input, nested reference, and top-level package/dependency path projection are now handled in the final shaping pass. |
| `--full-root` | Preserves absolute/rooted output paths | `Done` | Full-root display paths now follow the ScanCode-style formatting pass, including path cleanup and field-specific projection rules. |
| `--from-json` | Loads prior scan JSON instead of rescanning input files | `Done` | Supports multiple input scans, shaping-time include/ignore filtering, root-flag reshaping per loaded scan before merge, and recomputation of followed top-level license outputs after load. |

### Output Formats & Result Shaping

Expand Down
Loading
Loading