Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
8bf6753
docs(research): add research on wasm detector and latin-tag interaction
dev-pi2pie Mar 24, 2026
71c75be
chore(release): update version to 0.1.5-canary.3 in package.json and …
dev-pi2pie Mar 24, 2026
1984135
docs: link related plan for WASM mode Latin hint ordering fix
dev-pi2pie Mar 24, 2026
e34dfe9
docs: update WASM Latin ordering documentation to clarify hinting beh…
dev-pi2pie Mar 24, 2026
8871ec4
fix(detector): defer latin hints until after wasm detection
dev-pi2pie Mar 24, 2026
12251e3
docs: clarify behavior of Latin hint rules in WASM detector mode
dev-pi2pie Mar 24, 2026
b160ec1
docs(detector): draft wasm latin quality guardrails
dev-pi2pie Mar 24, 2026
34b5040
docs: remove outdated WASM Latin detector quality guardrails plan and…
dev-pi2pie Mar 24, 2026
e237884
docs: update status to completed and add resolution notes for WASM de…
dev-pi2pie Mar 24, 2026
23781d9
docs: update recommendations for handling detector quality issues and…
dev-pi2pie Mar 24, 2026
fe36f50
docs: update status to in-progress for global debug observability mod…
dev-pi2pie Mar 24, 2026
42294ee
docs(release): close release workflow consolidation plan
dev-pi2pie Mar 24, 2026
00b0c21
docs: update debug observability and WASM Latin quality plans with ne…
dev-pi2pie Mar 24, 2026
4a8f750
docs: update debug observability and WASM Latin quality plans with sc…
dev-pi2pie Mar 24, 2026
6dc9060
feat(detector): add debug event envelope and wasm latin guardrails
dev-pi2pie Mar 24, 2026
1012d53
feat(debug): finish observability plan and tighten detector debug gating
dev-pi2pie Mar 24, 2026
3d812b3
docs: update status to completed for observability model and wasm lat…
dev-pi2pie Mar 24, 2026
f8e6337
docs(research): add detector evidence debug surface documentation
dev-pi2pie Mar 24, 2026
ec4cff9
docs(research): enhance detector evidence verbosity behavior and clar…
dev-pi2pie Mar 24, 2026
a9a6c24
docs(detector): refine evidence debug research and add implementation…
dev-pi2pie Mar 24, 2026
48aef81
feat(detector): add debug evidence events for wasm fallback analysis
dev-pi2pie Mar 24, 2026
dfd560e
docs(detector): finalize evidence contract and close implementation plan
dev-pi2pie Mar 24, 2026
9902c86
fix(detector): preserve latin hints and wrapped prose in wasm mode
dev-pi2pie Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,10 @@ Detector mode notes:
- `--detector wasm` only runs for ambiguous `und-Latn` and `und-Hani` chunks.
- `--detector regex` keeps the original script/regex chunk-first detection path.
- `--detector wasm` uses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.
- In `--detector wasm` mode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolved `und-Latn` output.
- Very short chunks stay on the original `und-*` fallback.
- Low-confidence or unsupported detector results fall back to `und-*`.
- Technical-noise-heavy Latin windows stay conservative and may remain `und-Latn` even when the detector produces a wrong-but-confident language guess.

Collect non-words (emoji/symbols/punctuation):

Expand Down Expand Up @@ -285,14 +287,24 @@ word-counter --path ./examples/test-case-multi-files-support --debug --verbose

Use `--debug-report [path]` to route debug diagnostics to a JSONL report file:

- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-<pid>.jsonl`
- no path: writes to current working directory with pattern `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
- no path with `--detector-evidence`: writes with pattern `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
- path provided: writes to the specified location
- default-name collision handling: appends `-<n>` suffix to avoid overwriting existing files
- explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
- compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC `...-utc-...jsonl` pattern

By default with `--debug-report`, debug lines are file-only (not mirrored to terminal).
Use `--debug-report-tee` (alias: `--debug-tee`) to mirror to both file and `stderr`.
Flag dependencies: `--verbose` requires `--debug`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.
Flag dependencies: `--verbose` requires `--debug`; `--detector-evidence` requires `--debug` and `--detector wasm`; `--debug-report` requires `--debug`; `--debug-report-tee`/`--debug-tee` requires `--debug-report`.

Use `--detector-evidence` to add per-window detector evidence onto the same debug stream:

- only meaningful with `--detector wasm`
- compact mode emits bounded single-line previews plus detector decision metadata
- verbose mode emits full raw detector windows and full normalized samples
- evidence remains detector-window based even when output mode changes to `collector`, `char`, or another counting mode
- fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include `finalLocales`

Examples:

Expand All @@ -301,17 +313,26 @@ word-counter --path ./examples/test-case-multi-files-support --debug --debug-rep
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
word-counter --detector wasm --debug --detector-evidence --debug-report
```

Skip details stay debug-gated and can be suppressed with `--quiet-skips`.

When `--format json` is combined with `--debug`, debug-only diagnostics are emitted under `debug.*`:

- single input and merged batch may include `debug.detector`
- per-file batch may include `debug.skipped`, `debug.detector`, and per-entry `files[i].debug.detector`
- per-file top-level `skipped` is still emitted temporarily for compatibility

## How It Works

- The runtime inspects each character's Unicode script to infer its likely locale tag (e.g., `und-Latn`, `und-Hani`, `ja`).
- Adjacent characters that share the same locale tag are grouped into a chunk.
- Each chunk is counted with `Intl.Segmenter` at `granularity: "word"`, caching segmenters to avoid re-instantiation.
- Per-locale counts are summed into an overall total and printed to stdout.
- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting.
- With `--detector wasm`, ambiguous `und-Latn` and `und-Hani` chunks can be relabeled through the optional WASM detector before counting; unresolved `und-Latn` chunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.

## Locale vs Language Code

Expand Down Expand Up @@ -696,6 +717,7 @@ Example JSON (trimmed):
- Detection is regex/script based, not statistical language-ID.
- Ambiguous Latin defaults to `und-Latn`; Han fallback defaults to `und-Hani`.
- `--detector wasm` is optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.
- In `--detector wasm` mode, ambiguous Latin stays on `und-Latn` for detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved.
- The current first WASM engine is `whatlang`, remapped into this package's public tags.
- The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
- Use explicit tag and hint flags when you need deterministic tagging.
Expand Down
34 changes: 34 additions & 0 deletions docs/plans/jobs/2026-03-24-detector-evidence-fallback-tag-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "detector evidence fallback tag fix"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Correct detector evidence output so fallback windows report the same post-fallback Latin locale that the final counted result uses when Latin hints relabel `und-Latn`.

## What Changed

- Updated `src/detector/wasm.ts` so fallback debug payloads derive their reported final locale from the same deferred Latin fallback pass used by the runtime result.
- Kept the runtime fallback return value unchanged for the detector pipeline while fixing only the emitted debug metadata.
- Added a CLI regression test in `test/command.test.ts` that verifies `--detector-evidence` reports `de` instead of `und-Latn` for a hinted short Latin fallback window.

## Why

- The previous detector evidence payload could report `und-Latn` even when deferred Latin fallback relabeled the final chunk to a hinted locale such as `de`.
- That made the new debugging surface disagree with the actual counted output for supported Latin-hint configurations.

## Verification

- `bun test test/command.test.ts`
- `bun run type-check`

## Related Plans

- `docs/plans/plan-2026-03-24-detector-evidence-debug-implementation.md`

## Related Jobs

- `docs/plans/jobs/2026-03-24-detector-evidence-phases-1-4-implementation.md`
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: "detector evidence phase 5 docs and closure"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Finish phase 5 of `docs/plans/plan-2026-03-24-detector-evidence-debug-implementation.md` by documenting the detector evidence event contract, the evidence-specific debug-report naming branch, and the fallback-tag alignment behavior.

## What Changed

- Updated `docs/schemas/debug-event-stream-contract.md` to document:
- `detector.window.evidence`
- evidence-enabled autogenerated debug-report filenames
- compact vs verbose evidence payload rules
- optional `decision.finalLocales` and fallback-event `finalLocales` for split relabeling cases
- post-fallback `finalTag` behavior for hinted Latin fallback windows
- Updated `README.md` to document:
- `--detector-evidence` flag dependencies and examples
- evidence-specific autogenerated report filenames
- compact and verbose detector evidence behavior
- the fallback-tag alignment note and optional `finalLocales` disclosure
- Marked the detector-evidence implementation plan completed.

## Why

- The implementation was already complete through code and tests, but the published schema and user-facing docs still lagged the behavior.
- The fallback-tag fix introduced a small but important contract detail: debug evidence should report the same post-fallback tag that users see in final counting output, and split relabeling may surface as optional `finalLocales`.

## Verification

- Documentation-only pass; no code or tests were required for this phase.

## Related Plans

- `docs/plans/plan-2026-03-24-detector-evidence-debug-implementation.md`

## Related Jobs

- `docs/plans/jobs/2026-03-24-detector-evidence-phases-1-4-implementation.md`
- `docs/plans/jobs/2026-03-24-detector-evidence-fallback-tag-fix.md`
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: "detector evidence phases 1-4 implementation"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Implement phases 1 through 4 of `docs/plans/plan-2026-03-24-detector-evidence-debug-implementation.md` so `--detector-evidence` works end-to-end across single-input, async batch, and worker batch execution.

## What Changed

- Added the `--detector-evidence` CLI flag and first-version validation rules:
- requires `--debug`
- requires `--detector wasm`
- Added evidence-aware debug report naming for autogenerated report paths:
- `wc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
- Extended detector debug context so evidence configuration carries:
- compact vs verbose evidence behavior
- output mode metadata
- section metadata
- Added `detector.window.evidence` emission from the existing WASM detector decision flow.
- Kept evidence window-based instead of output-row-based across `chunk`, `collector`, and `char` usage.
- Preserved file-scoped debug routing for async batch and worker-forwarded detector evidence events.
- Added regression coverage for:
- CLI validation
- compact preview evidence payloads
- verbose full-text evidence payloads
- async batch and worker batch evidence routing
- evidence-specific autogenerated report filenames
- output-mode invariance of evidence granularity

## Verification

- `bun test test/command.test.ts`
- `bun test test/detector-interop.test.ts`
- `bun run type-check`
- `bun run build`

## Remaining Work

- Phase 5 of `docs/plans/plan-2026-03-24-detector-evidence-debug-implementation.md` remains open for schema and user-facing doc updates.
29 changes: 29 additions & 0 deletions docs/plans/jobs/2026-03-24-fix-detector-debug-gating-and-scope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "fix detector debug gating and scope"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Summary

Addressed two review findings in the detector debug pipeline:

- prevented worker and async batch detector debug contexts from being created when `--debug` is not enabled
- marked per-file batch detector events as `scope: "file"` in the shared debug event envelope

## What Changed

- updated `src/cli/batch/run.ts` to:
- gate detector debug callbacks on `debug.enabled`
- wrap batch detector events with explicit `scope: "file"`
- updated `src/cli/batch/jobs/load-count.ts` to stop creating fallback detector summaries when no debug context is requested
- updated `src/cli/batch/jobs/load-count-worker.ts` and `src/cli/batch/jobs/worker/count-worker.ts` so worker-side detector debug state is only created when debug forwarding is enabled
- updated `src/cli/debug/channel.ts` to accept an explicit event scope override
- added regression coverage in `test/command.test.ts` for:
- file-scoped detector events in async and worker batch executors
- absence of worker detector debug summaries when no debug callback is provided

## Verification

- ran `bun test test/command.test.ts`
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: "Phase 1 and Phase 4 debug envelope and latin guardrails"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Implement Phase 1 of the debug observability plan and Phase 4 of the WASM Latin quality plan in one pass.

## What Changed

- Added a shared debug event envelope in `src/cli/debug/channel.ts`.
- Added `schemaVersion: 1`
- Added UTC ISO timestamps
- Added per-run `runId` using `wc-debug-<epochMs>-<pid>`
- Added inferred `topic` and `scope`
- Kept current flat event payload fields and current event names
- Changed autogenerated debug report filenames to the new UTC contract:
- `wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl`
- Hardened WASM Latin corroborated acceptance in `src/detector/wasm.ts`.
- Corroboration now requires at least one sample with `reliable = true`
- Added the first Latin token-quality gate in `src/detector/policy.ts`.
- Uses a prose-vs-technical dominance rule
- Biases mixed technical windows back to `und-Latn`
- Expanded regression coverage in `test/command.test.ts` and `test/word-counter.test.ts`.
- Added debug envelope assertions
- Added UTC filename assertions
- Added the approved eight-fixture WASM Latin quality matrix

## Why

- Debug observability needed a stable v1 envelope before more subsystems add one-off event shapes.
- WASM Latin routing still needed conservative guardrails to avoid wrong language upgrades on technical English windows.

## Verification

- `bun test test/word-counter.test.ts`
- `bun test test/command.test.ts`
- `bun run type-check`
- `bun run build`

## Related Plans

- `docs/plans/plan-2026-03-24-debug-observability-and-wasm-latin-quality.md`

## Related Research

- `docs/researches/research-2026-03-24-global-debug-observability-model.md`
- `docs/researches/research-2026-03-24-wasm-latin-detector-quality-false-positives.md`
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: "Phase 2 phase 3 phase 5 debug json detector observability and docs"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Finish the remaining phases of the debug observability and WASM Latin quality plan by landing single-input debug parity, detector observability, and the schema/README closure work.

## What Changed

- Extended single-input execution in `src/cli/runtime/single.ts`.
- added `runtime.single.start` and `runtime.single.complete` debug events
- added debug-gated single-input JSON detector summaries under `debug.detector`
- Normalized debug-gated JSON output in `src/cli/runtime/batch.ts`.
- added top-level `debug.skipped`
- retained top-level `skipped` for per-file compatibility
- added aggregated `debug.detector` summaries
- added per-entry `files[i].debug.detector` summaries
- Added detector observability plumbing in `src/detector/debug.ts` and `src/detector/wasm.ts`.
- compact detector summary events
- verbose per-window detector events
- detector summary aggregation for JSON output
- Extended batch execution routes to preserve detector observability across both async and worker execution.
- updated `src/cli/batch/jobs/load-count.ts`
- updated worker protocol and worker-pool forwarding
- updated `src/cli/batch/jobs/worker/count-worker.ts`
- Added and updated documentation:
- `docs/schemas/debug-event-stream-contract.md`
- `docs/schemas/json-output-contract.md`
- `README.md`

## Why

- Single-input runs needed to use the same debug model as batch execution.
- Detector investigation needed structured observability that survives both direct and worker-backed execution paths.
- The JSON and debug-report contracts needed final schema and user-facing documentation before the plan could be considered complete.

## Verification

- `bun test test/word-counter.test.ts`
- `bun test test/command.test.ts`
- `bun run type-check`
- `bun run build`

## Related Plans

- `docs/plans/plan-2026-03-24-debug-observability-and-wasm-latin-quality.md`

## Related Jobs

- `docs/plans/jobs/2026-03-24-phase1-phase4-debug-envelope-and-latin-guardrails.md`

## Related Research

- `docs/researches/research-2026-03-24-global-debug-observability-model.md`
- `docs/researches/research-2026-03-24-wasm-latin-detector-quality-false-positives.md`
34 changes: 34 additions & 0 deletions docs/plans/jobs/2026-03-24-release-workflow-plan-closure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "release workflow plan closure"
created-date: 2026-03-24
status: completed
agent: Codex
---

## Goal

Confirm whether `docs/plans/plan-2026-03-24-release-workflow-consolidation.md` still had unresolved work after the `v0.1.5-canary.2` integration point and update the plan to match the repository state.

## Findings

- Tag `v0.1.5-canary.2` points to merge commit `fde1039` on 2026-03-24.
- That merge already included the consolidation and follow-up workflow changes:
- `dd0274e` added Rust caching and package-content verification.
- `37084fe` consolidated publish workflows into `release.yml` and removed the duplicated publish workflow files.
- `04e05ca` fixed CI type-check behavior for Bun tests.
- `5052a8c` reduced CI triggering to pull requests only.
- The current workflow set contains only `.github/workflows/ci.yml` and `.github/workflows/release.yml`.
- `.github/workflows/release.yml` still supports `workflow_dispatch` inputs for `tag` and `shallow_since`, verifies package contents in `prepare`, uploads `release-package-${tag}`, and has both publish jobs consume that artifact.
- `scripts/verify-package-contents.mjs` explicitly requires:
- `dist/wasm-language-detector/language_detector.js`
- `dist/wasm-language-detector/language_detector_bg.wasm`

## What Changed

- Marked `docs/plans/plan-2026-03-24-release-workflow-consolidation.md` as completed.
- Replaced the stale remaining rollout items with completed confirmation items aligned with the workflow state that shipped around `v0.1.5-canary.2`.
- Updated the plan text so its CI trigger description matches the later pull-request-only cleanup.

## Related Plans

- `docs/plans/plan-2026-03-24-release-workflow-consolidation.md`
Loading