Arcana is a personal, local-first retrieval CLI for books and papers.
Current shape:
- one live corpus per source
- one live Tantivy index per source
- SQLite as the canonical backend for document-like corpora
- no user-visible corpus/index generations
- exact lookup by ISBN / DOI / MD5
- source-specific capabilities like Anna download
This is intentionally a breaking, rebuild-friendly tool. It prefers a clean single-user model over compatibility or artifact-hoarding.
The current implementation is centered on the Anna's Archive source:
arcana source aa refresh --input ...- SQLite corpus storage with compressed binary document payloads
- live Tantivy lexical index at a fixed source path
arcana search ...arcana doc ...arcana source aa download ...arcana source aa mirrorsarcana status
cargo build --releaseThen run either:
arcana ...target/release/arcana ...
Arcana reads Anna's Archive-derived Elasticsearch shards directly:
- a directory containing
aarecords__*.json.gz - or a single
aarecords__*.json.gzfile
Initialize config:
arcana config initRefresh the live Anna source from AA shards:
arcana source aa refresh \
--input ~/Datasets/aa/elasticsearchOr build a smaller language-limited live corpus on purpose:
arcana source aa refresh \
--input ~/Datasets/aa/elasticsearch \
--language en \
--language frInspect current state:
arcana status
arcana source aa showSearch:
arcana search "large language models"
arcana search isbn:9780131103627
arcana search --show-doc "large language models"
arcana doc aa:aa-llm-1Download from Anna's Archive:
export ANNAS_ARCHIVE_SECRET_KEY=...
arcana source aa mirrors
arcana source aa download aa:aa-nof1-1
arcana source aa download isbn:9789401771993
arcana source aa download --output-dir ~/Downloads/arcana aa:aa-nof1-1Rebuild only the live index from the current live SQLite corpus:
arcana source aa reindexFor measurement runs, you can publish only the corpus and remove any stale live index:
arcana source aa refresh \
--input ~/Datasets/aa/elasticsearch \
--no-index
arcana source aa reindexExact lookup still works after --no-index; keyword search requires reindex.
Arcana no longer keeps user-visible corpus/index generations around.
For each source there is one live pair:
- one live SQLite corpus directory
- one live Tantivy index
Refreshes build into staging paths and then publish atomically into the live paths. Old live artifacts are replaced instead of accumulating forever.
For AA, the live corpus now contains:
manifest.yamlrouter.sqlite- one SQLite DB per AA input shard under
shards/
Exact identifier lookup still works, but it now fans out across shard-local SQLite DBs instead of depending on one giant monolithic corpus database.
Each shard keeps document rows payload-first, assigns a shard-local integer doc_seq, and uses that compact key in shard-local identifier tables/indexes. Payloads are CBOR-encoded normalized documents compressed with zstd; AA refresh trains and stores a small payload.zdict when enough samples are available. Tantivy does not duplicate exact ISBN/DOI/MD5 fields.
Default config path:
~/.config/arcana/config.yaml
Typical config:
state_path: "~/.local/state/arcana"
corpus_store_path: "~/.local/state/arcana/corpora"
index_store_path: "~/.local/state/arcana/indexes"
download_dir: "~/Downloads"
secret_key_env: "ANNAS_ARCHIVE_SECRET_KEY"
fast_download_api_url: "https://annas-archive.gl/dyn/api/fast_download.json"Notes:
state_pathholds lightweight source state and lockscorpus_store_pathholds heavy SQLite corpus storesindex_store_pathholds heavy Tantivy indexesdownload_diris the default destination forarcana source aa downloadfast_download_api_urlis an override; when left at the legacy default, Arcana resolves current Anna mirrors from the Anna's Archive Wikipedia page and tries them in order
Useful config commands:
arcana config
arcana config path
arcana config --json
arcana config init --forcearcana config ...
arcana doc ...
arcana search ...
arcana source aa refresh ...
arcana source aa reindex
arcana source aa show
arcana source aa mirrors
arcana source aa download ...
arcana status
Use the in-process benchmark runner against the live AA index:
cargo run --release --example search_bench -- \
--queries bench/search_queries.jsonOr point it at an explicit Tantivy index directory:
cargo run --release --example search_bench -- \
--index-path /path/to/tantivy/index \
--queries bench/search_queries.jsonSee bench/README.md for details.
Storage experiments that should not build a corpus/index first:
cargo run --release --example payload_codec_bench -- --input /path/to/aarecords --max-shards 4
cargo run --release --example payload_shape_bench -- --input /path/to/aarecords --max-shards 4payload_shape_bench includes a simulated description hot/cold split and writes only a small report.