🎮 Playground · 📚 Handbook (mdbook) · 📖 API reference (rustdoc) · 📦 Releases & binaries · 🇯🇵 日本語
Pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation):
ruby (|青梅《おうめ》), bouten ([#「X」に傍点]), 縦中横, 外字
references (※[#…、第3水準1-85-54]), kunten / kaeriten,
indent / align-end containers ([#ここから2字下げ]… [#ここで字下げ終わり]),
and page / section breaks.
The parser is CommonMark-free, Markdown-free — this repository deals only with the 青空文庫 notation itself. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes.
Pre-built aozora CLI binaries for Linux x86_64, macOS arm64,
and Windows x86_64 are attached to every GitHub Release —
the releases page carries
aozora-vX.Y.Z-<target>.{tar.gz,zip} archives with SHA256SUMS.
cargo install --git https://github.com/P4suta/aozora --locked aozora-cli(builds the latest main; pin to a release tag for reproducible builds —
see the install chapter
for the tag-pinned form.)
The Cargo.toml snippet (with the current release tag) lives in the
install chapter —
keeping it in one place avoids version-pin drift across multiple READMEs.
crates.io publication tracks the 1.0 API freeze.
For WASM / C ABI / Python bindings see the Bindings chapters of the handbook.
use aozora::Document;
let source = "|青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();
let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();
assert_eq!(canonical, "|青梅《おうめ》");Document owns a bumpalo arena; tree
borrows from it for the lifetime of the Document. Dropping the
Document releases every node in a single Bump::reset step.
aozora check FILE.txt # lex + report diagnostics
aozora fmt --check FILE.txt # round-trip parse ∘ serialize check
aozora render FILE.txt # render to HTML on stdout
aozora check -E sjis FILE.txt # Shift_JIS source from Aozora BunkoAll subcommands accept - (or no path argument) to read from stdin.
See the CLI reference chapter
for the full subcommand reference.
aozora is a 21-crate workspace.
crates/aozora is the public facade — library
consumers usually import only this one.
| Crate | Purpose |
|---|---|
crates/aozora |
Top-level facade. Document::parse() → AozoraTree<'_>, structured Diagnostics, SLUGS catalogue, canonicalise_slug. The single front door. |
crates/aozora-spec |
Single source of truth for shared types: Span, TriggerKind, PairKind, Diagnostic, PUA sentinel codepoints, SLUGS dispatch table. No internal dependency. |
crates/aozora-syntax |
AST types (AozoraNode borrowed-arena variants, ContainerKind, BoutenKind, Indent). |
crates/aozora-encoding |
Shift_JIS decoding + 外字 lookup (compile-time PHF, JIS X 0213 + UCS resolution). |
crates/aozora-scan |
SIMD-friendly multi-pattern scanner backends (Teddy / structural-bitmap / Hoehrmann DFA / naive fallback). |
crates/aozora-veb |
Eytzinger-layout sorted-set lookup (cache-friendly binary search). |
crates/aozora-pipeline |
4-phase lexer (sanitize → events → pair → classify) plus the lex_into_arena orchestrator — pure fn(&str, &Arena) -> BorrowedLexOutput<'_>. |
crates/aozora-render |
HTML and serialise renderers — html::render_to_string, serialize::serialize. |
crates/aozora-cst |
rowan-backed lossless concrete syntax tree. Editor/formatter surface. |
crates/aozora-query |
Tree-sitter-style pattern DSL (SyntaxKind + capture) for queries over the CST. |
crates/aozora-pandoc |
Pandoc AST projection (AozoraTree → pandoc_ast::Pandoc); unlocks 50+ output formats via Pandoc writers. |
crates/aozora-cli |
aozora binary: check / fmt / schema / kinds / explain / pandoc. |
crates/aozora-wasm |
wasm32-unknown-unknown target for wasm-pack build --target web. |
crates/aozora-ffi |
C ABI driver (opaque handle, JSON-encoded structured data). |
crates/aozora-py |
PyO3 bindings, distributed via maturin. |
crates/aozora-bench |
Criterion + corpus-driven probes (PGO profile source). |
crates/aozora-conformance |
WPT-style conformance fixture runner (golden HTML / serialize / diagnostics / wire across 23 fixtures). |
crates/aozora-corpus |
Corpus source abstraction for sweep tests (dev-only, set AOZORA_CORPUS_ROOT). |
crates/aozora-proptest |
Shared proptest strategies (aozora_fragment / pathological_aozora / unicode_adversarial and friends; dev-only). |
crates/aozora-trace |
DWARF symbolicator for samply traces. |
crates/aozora-xtask |
Repo automation (samply wrapper, trace analysis, corpus pack/unpack, schema dumps). |
See the Architecture chapter of the handbook for the layered design, the borrowed-arena AST, the SIMD scanner backends, and the dependency graph between these crates.
Everything runs inside Docker — the host toolchain is never invoked.
Bring up the dev image once, then drive every operation through just:
just # list targets
just build # cargo build --workspace --all-targets
just test # cargo nextest run --workspace
just prop # property-based sweep (128 cases per block)
just lint # fmt + clippy pedantic+nursery + typos + strict-code
just deny # cargo-deny licenses + advisories + bans
just coverage # cargo llvm-cov branch coverage
just ci # full CI replica
just book-build # render the mdbook handbook
just book-serve # live-preview the handbook at localhost:3000Use just run to invoke the CLI inside the container:
just run check FILE.txt
just run render -E sjis FILE.txt > out.htmlSee CONTRIBUTING.md for the contribution flow,
testing strategy, and lint policy.
- 📚 Handbook — the mdbook site: notation reference, architecture (borrowed-arena AST, SIMD scanner backends, encoding), bindings (Rust / WASM / C ABI / Python), performance (samply / bench / corpus sweep), CLI / API / env reference, and the contributor guide.
- 📖 API reference (rustdoc) — auto-deployed alongside the handbook.
CONTRIBUTING.md— dev setup, TDD flow, PR rules.SECURITY.md— vulnerability disclosure.CHANGELOG.md— release history.
| Repo | What it is |
|---|---|
P4suta/afm |
CommonMark + GFM + 青空文庫記法 integrated Markdown dialect, built on top of this parser. |
P4suta/aozora-tools |
Authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension. |
Dual-licensed under Apache-2.0 OR MIT
at your option, matching Rust community convention. See
NOTICE for third-party attribution (Aozora Bunko spec
snapshots and public-domain sample works used in tests).