Arcana

Arcana is a personal, local-first retrieval CLI for books and papers.

Current shape:

one live corpus per source
one live Tantivy index per source
SQLite as the canonical backend for document-like corpora
no user-visible corpus/index generations
exact lookup by ISBN / DOI / MD5
source-specific capabilities like Anna download

This is intentionally a breaking, rebuild-friendly tool. It prefers a clean single-user model over compatibility or artifact-hoarding.

Status

The current implementation is centered on the Anna's Archive source:

arcana source aa refresh --input ...
SQLite corpus storage with compressed binary document payloads
live Tantivy lexical index at a fixed source path
arcana search ...
arcana doc ...
arcana source aa download ...
arcana source aa mirrors
arcana status

Install

cargo build --release

Then run either:

arcana ...
target/release/arcana ...

Input data

Arcana reads Anna's Archive-derived Elasticsearch shards directly:

a directory containing aarecords__*.json.gz
or a single aarecords__*.json.gz file

Quick start

Initialize config:

arcana config init

Refresh the live Anna source from AA shards:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch

Or build a smaller language-limited live corpus on purpose:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch \
  --language en \
  --language fr

Inspect current state:

arcana status
arcana source aa show

Search:

arcana search "large language models"
arcana search isbn:9780131103627
arcana search --show-doc "large language models"
arcana doc aa:aa-llm-1

Download from Anna's Archive:

export ANNAS_ARCHIVE_SECRET_KEY=...

arcana source aa mirrors
arcana source aa download aa:aa-nof1-1
arcana source aa download isbn:9789401771993
arcana source aa download --output-dir ~/Downloads/arcana aa:aa-nof1-1

Rebuild only the live index from the current live SQLite corpus:

arcana source aa reindex

For measurement runs, you can publish only the corpus and remove any stale live index:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch \
  --no-index

arcana source aa reindex

Exact lookup still works after --no-index; keyword search requires reindex.

Storage model

Arcana no longer keeps user-visible corpus/index generations around.

For each source there is one live pair:

one live SQLite corpus directory
one live Tantivy index

Refreshes build into staging paths and then publish atomically into the live paths. Old live artifacts are replaced instead of accumulating forever.

For AA, the live corpus now contains:

manifest.yaml
router.sqlite
one SQLite DB per AA input shard under shards/

Exact identifier lookup still works, but it now fans out across shard-local SQLite DBs instead of depending on one giant monolithic corpus database. Each shard keeps document rows payload-first, assigns a shard-local integer doc_seq, and uses that compact key in shard-local identifier tables/indexes. Payloads are CBOR-encoded normalized documents compressed with zstd; AA refresh trains and stores a small payload.zdict when enough samples are available. Tantivy does not duplicate exact ISBN/DOI/MD5 fields.

Configuration

Default config path:

~/.config/arcana/config.yaml

Typical config:

state_path: "~/.local/state/arcana"
corpus_store_path: "~/.local/state/arcana/corpora"
index_store_path: "~/.local/state/arcana/indexes"
download_dir: "~/Downloads"
secret_key_env: "ANNAS_ARCHIVE_SECRET_KEY"
fast_download_api_url: "https://annas-archive.gl/dyn/api/fast_download.json"

Notes:

state_path holds lightweight source state and locks
corpus_store_path holds heavy SQLite corpus stores
index_store_path holds heavy Tantivy indexes
download_dir is the default destination for arcana source aa download
fast_download_api_url is an override; when left at the legacy default, Arcana resolves current Anna mirrors from the Anna's Archive Wikipedia page and tries them in order

Useful config commands:

arcana config
arcana config path
arcana config --json
arcana config init --force

Main commands

arcana config ...
arcana doc ...
arcana search ...
arcana source aa refresh ...
arcana source aa reindex
arcana source aa show
arcana source aa mirrors
arcana source aa download ...
arcana status

Benchmarking

Use the in-process benchmark runner against the live AA index:

cargo run --release --example search_bench -- \
  --queries bench/search_queries.json

Or point it at an explicit Tantivy index directory:

cargo run --release --example search_bench -- \
  --index-path /path/to/tantivy/index \
  --queries bench/search_queries.json

See bench/README.md for details.

Storage experiments that should not build a corpus/index first:

cargo run --release --example payload_codec_bench -- --input /path/to/aarecords --max-shards 4
cargo run --release --example payload_shape_bench -- --input /path/to/aarecords --max-shards 4

payload_shape_bench includes a simulated description hot/cold split and writes only a small report.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.cargo		.cargo
bench		bench
examples		examples
src		src
tests		tests
.build.yml		.build.yml
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arcana

Status

Install

Input data

Quick start

Storage model

Configuration

Main commands

Benchmarking

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Arcana

Status

Install

Input data

Quick start

Storage model

Configuration

Main commands

Benchmarking

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages