Skip to content

P4suta/aozora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

230 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

aozora

ci docs deploy latest release license msrv

🎮 Playground · 📚 Handbook (mdbook) · 📖 API reference (rustdoc) · 📦 Releases & binaries · 🇯🇵 日本語

Pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation): ruby (|青梅《おうめ》), bouten ([#「X」に傍点]), 縦中横, 外字 references (※[#…、第3水準1-85-54]), kunten / kaeriten, indent / align-end containers ([#ここから2字下げ]… [#ここで字下げ終わり]), and page / section breaks.

The parser is CommonMark-free, Markdown-free — this repository deals only with the 青空文庫 notation itself. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes.

Installation

Pre-built CLI

Pre-built aozora CLI binaries for Linux x86_64, macOS arm64, and Windows x86_64 are attached to every GitHub Release — the releases page carries aozora-vX.Y.Z-<target>.{tar.gz,zip} archives with SHA256SUMS.

Build from source

cargo install --git https://github.com/P4suta/aozora --locked aozora-cli

(builds the latest main; pin to a release tag for reproducible builds — see the install chapter for the tag-pinned form.)

As a Rust library

The Cargo.toml snippet (with the current release tag) lives in the install chapter — keeping it in one place avoids version-pin drift across multiple READMEs. crates.io publication tracks the 1.0 API freeze.

For WASM / C ABI / Python bindings see the Bindings chapters of the handbook.

Quickstart

use aozora::Document;

let source = "|青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();

let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();

assert_eq!(canonical, "|青梅《おうめ》");

Document owns a bumpalo arena; tree borrows from it for the lifetime of the Document. Dropping the Document releases every node in a single Bump::reset step.

CLI

aozora check FILE.txt           # lex + report diagnostics
aozora fmt --check FILE.txt     # round-trip parse ∘ serialize check
aozora render FILE.txt          # render to HTML on stdout
aozora check -E sjis FILE.txt   # Shift_JIS source from Aozora Bunko

All subcommands accept - (or no path argument) to read from stdin. See the CLI reference chapter for the full subcommand reference.

Crate layout

aozora is a 21-crate workspace. crates/aozora is the public facade — library consumers usually import only this one.

Crate Purpose
crates/aozora Top-level facade. Document::parse() → AozoraTree<'_>, structured Diagnostics, SLUGS catalogue, canonicalise_slug. The single front door.
crates/aozora-spec Single source of truth for shared types: Span, TriggerKind, PairKind, Diagnostic, PUA sentinel codepoints, SLUGS dispatch table. No internal dependency.
crates/aozora-syntax AST types (AozoraNode borrowed-arena variants, ContainerKind, BoutenKind, Indent).
crates/aozora-encoding Shift_JIS decoding + 外字 lookup (compile-time PHF, JIS X 0213 + UCS resolution).
crates/aozora-scan SIMD-friendly multi-pattern scanner backends (Teddy / structural-bitmap / Hoehrmann DFA / naive fallback).
crates/aozora-veb Eytzinger-layout sorted-set lookup (cache-friendly binary search).
crates/aozora-pipeline 4-phase lexer (sanitize → events → pair → classify) plus the lex_into_arena orchestrator — pure fn(&str, &Arena) -> BorrowedLexOutput<'_>.
crates/aozora-render HTML and serialise renderers — html::render_to_string, serialize::serialize.
crates/aozora-cst rowan-backed lossless concrete syntax tree. Editor/formatter surface.
crates/aozora-query Tree-sitter-style pattern DSL (SyntaxKind + capture) for queries over the CST.
crates/aozora-pandoc Pandoc AST projection (AozoraTreepandoc_ast::Pandoc); unlocks 50+ output formats via Pandoc writers.
crates/aozora-cli aozora binary: check / fmt / schema / kinds / explain / pandoc.
crates/aozora-wasm wasm32-unknown-unknown target for wasm-pack build --target web.
crates/aozora-ffi C ABI driver (opaque handle, JSON-encoded structured data).
crates/aozora-py PyO3 bindings, distributed via maturin.
crates/aozora-bench Criterion + corpus-driven probes (PGO profile source).
crates/aozora-conformance WPT-style conformance fixture runner (golden HTML / serialize / diagnostics / wire across 23 fixtures).
crates/aozora-corpus Corpus source abstraction for sweep tests (dev-only, set AOZORA_CORPUS_ROOT).
crates/aozora-proptest Shared proptest strategies (aozora_fragment / pathological_aozora / unicode_adversarial and friends; dev-only).
crates/aozora-trace DWARF symbolicator for samply traces.
crates/aozora-xtask Repo automation (samply wrapper, trace analysis, corpus pack/unpack, schema dumps).

See the Architecture chapter of the handbook for the layered design, the borrowed-arena AST, the SIMD scanner backends, and the dependency graph between these crates.

Development

Everything runs inside Docker — the host toolchain is never invoked. Bring up the dev image once, then drive every operation through just:

just                # list targets
just build          # cargo build --workspace --all-targets
just test           # cargo nextest run --workspace
just prop           # property-based sweep (128 cases per block)
just lint           # fmt + clippy pedantic+nursery + typos + strict-code
just deny           # cargo-deny licenses + advisories + bans
just coverage       # cargo llvm-cov branch coverage
just ci             # full CI replica
just book-build     # render the mdbook handbook
just book-serve     # live-preview the handbook at localhost:3000

Use just run to invoke the CLI inside the container:

just run check FILE.txt
just run render -E sjis FILE.txt > out.html

See CONTRIBUTING.md for the contribution flow, testing strategy, and lint policy.

Documentation

  • 📚 Handbook — the mdbook site: notation reference, architecture (borrowed-arena AST, SIMD scanner backends, encoding), bindings (Rust / WASM / C ABI / Python), performance (samply / bench / corpus sweep), CLI / API / env reference, and the contributor guide.
  • 📖 API reference (rustdoc) — auto-deployed alongside the handbook.
  • CONTRIBUTING.md — dev setup, TDD flow, PR rules.
  • SECURITY.md — vulnerability disclosure.
  • CHANGELOG.md — release history.

Related projects

Repo What it is
P4suta/afm CommonMark + GFM + 青空文庫記法 integrated Markdown dialect, built on top of this parser.
P4suta/aozora-tools Authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension.

License

Dual-licensed under Apache-2.0 OR MIT at your option, matching Rust community convention. See NOTICE for third-party attribution (Aozora Bunko spec snapshots and public-domain sample works used in tests).

About

Pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation): ルビ, 傍点, 縦中横, 外字, 返り点, indent containers, page breaks.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors