Skip to content

aussetg/arcana

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arcana

Arcana is a personal, local-first retrieval CLI for books and papers.

Current shape:

  • one live corpus per source
  • one live Tantivy index per source
  • SQLite as the canonical backend for document-like corpora
  • no user-visible corpus/index generations
  • exact lookup by ISBN / DOI / MD5
  • source-specific capabilities like Anna download

This is intentionally a breaking, rebuild-friendly tool. It prefers a clean single-user model over compatibility or artifact-hoarding.

Status

The current implementation is centered on the Anna's Archive source:

  • arcana source aa refresh --input ...
  • SQLite corpus storage with compressed binary document payloads
  • live Tantivy lexical index at a fixed source path
  • arcana search ...
  • arcana doc ...
  • arcana source aa download ...
  • arcana source aa mirrors
  • arcana status

Install

cargo build --release

Then run either:

  • arcana ...
  • target/release/arcana ...

Input data

Arcana reads Anna's Archive-derived Elasticsearch shards directly:

  • a directory containing aarecords__*.json.gz
  • or a single aarecords__*.json.gz file

Quick start

Initialize config:

arcana config init

Refresh the live Anna source from AA shards:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch

Or build a smaller language-limited live corpus on purpose:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch \
  --language en \
  --language fr

Inspect current state:

arcana status
arcana source aa show

Search:

arcana search "large language models"
arcana search isbn:9780131103627
arcana search --show-doc "large language models"
arcana doc aa:aa-llm-1

Download from Anna's Archive:

export ANNAS_ARCHIVE_SECRET_KEY=...

arcana source aa mirrors
arcana source aa download aa:aa-nof1-1
arcana source aa download isbn:9789401771993
arcana source aa download --output-dir ~/Downloads/arcana aa:aa-nof1-1

Rebuild only the live index from the current live SQLite corpus:

arcana source aa reindex

For measurement runs, you can publish only the corpus and remove any stale live index:

arcana source aa refresh \
  --input ~/Datasets/aa/elasticsearch \
  --no-index

arcana source aa reindex

Exact lookup still works after --no-index; keyword search requires reindex.

Storage model

Arcana no longer keeps user-visible corpus/index generations around.

For each source there is one live pair:

  • one live SQLite corpus directory
  • one live Tantivy index

Refreshes build into staging paths and then publish atomically into the live paths. Old live artifacts are replaced instead of accumulating forever.

For AA, the live corpus now contains:

  • manifest.yaml
  • router.sqlite
  • one SQLite DB per AA input shard under shards/

Exact identifier lookup still works, but it now fans out across shard-local SQLite DBs instead of depending on one giant monolithic corpus database. Each shard keeps document rows payload-first, assigns a shard-local integer doc_seq, and uses that compact key in shard-local identifier tables/indexes. Payloads are CBOR-encoded normalized documents compressed with zstd; AA refresh trains and stores a small payload.zdict when enough samples are available. Tantivy does not duplicate exact ISBN/DOI/MD5 fields.

Configuration

Default config path:

~/.config/arcana/config.yaml

Typical config:

state_path: "~/.local/state/arcana"
corpus_store_path: "~/.local/state/arcana/corpora"
index_store_path: "~/.local/state/arcana/indexes"
download_dir: "~/Downloads"
secret_key_env: "ANNAS_ARCHIVE_SECRET_KEY"
fast_download_api_url: "https://annas-archive.gl/dyn/api/fast_download.json"

Notes:

  • state_path holds lightweight source state and locks
  • corpus_store_path holds heavy SQLite corpus stores
  • index_store_path holds heavy Tantivy indexes
  • download_dir is the default destination for arcana source aa download
  • fast_download_api_url is an override; when left at the legacy default, Arcana resolves current Anna mirrors from the Anna's Archive Wikipedia page and tries them in order

Useful config commands:

arcana config
arcana config path
arcana config --json
arcana config init --force

Main commands

arcana config ...
arcana doc ...
arcana search ...
arcana source aa refresh ...
arcana source aa reindex
arcana source aa show
arcana source aa mirrors
arcana source aa download ...
arcana status

Benchmarking

Use the in-process benchmark runner against the live AA index:

cargo run --release --example search_bench -- \
  --queries bench/search_queries.json

Or point it at an explicit Tantivy index directory:

cargo run --release --example search_bench -- \
  --index-path /path/to/tantivy/index \
  --queries bench/search_queries.json

See bench/README.md for details.

Storage experiments that should not build a corpus/index first:

cargo run --release --example payload_codec_bench -- --input /path/to/aarecords --max-shards 4
cargo run --release --example payload_shape_bench -- --input /path/to/aarecords --max-shards 4

payload_shape_bench includes a simulated description hot/cold split and writes only a small report.

About

local SQLite search database from Anna's Archive derived metadata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors