Skip to content

Archerkattri/mathlas

Repository files navigation

mathlas

mathlas

PyPI DOI Glama score License Python

An airtight-math tool an AI uses — no LLM, no API key, free. Plug it into Claude Code, Cursor, or any MCP client. The AI is the brain; mathlas is the hands — it gives the AI the capabilities it lacks and returns data (candidates, verdicts, checklists, scaffolds) for the AI to reason over. Apache-2.0. The code is free for any use; published corpus/index artifacts carry their own per-source terms (CC-BY/CC0).


Is this for you?

  • You use Claude Code / Cursor and want your AI to stop hallucinating mathsearch_existing_math finds the real theorem from a 3.68M-doc index; verify_numeric and verify_formal check claims with zero hallucination risk.
  • You have a numeric constant or integer sequence you can't identifyidentify_constant runs PSLQ + closed-form matching (50-digit precision); identify_sequence does an exact OEIS term-match.
  • You need the formal (Lean/mathlib) name of a resultsearch_formal_math proxies the public Loogle + LeanSearch services and returns declaration names + types, provenance-labeled.
  • You're building an agent pipeline that needs airtight math in the loop — all 12 tools are pure data-returning MCP tools, no LLM inside, composable with any framework.

Install & register with Claude Code (no API key)

One line, nothing to install first (needs uv):

claude mcp add mathlas -- uvx mathlas-mcp

uvx mathlas-mcp fetches + runs the server in an isolated env on first use. Prefer pip?

pip install mathlas-mcp              # core: numeric + retrieval + verify + scaffolds
pip install 'mathlas-mcp[mcp]'       # + official MCP SDK
pip install 'mathlas-mcp[retrieve]'  # + pyarrow, to read the real index
pip install 'mathlas-mcp[embed]'     # + sentence-transformers/torch, for the Qwen3 embedder

claude mcp add mathlas -- python -m mathlas.server

mathlas now appears as twelve tools the agent can call. The server prefers the official mcp SDK and falls back to a dependency-free stdio JSON-RPC server if mcp isn't installed — it always runs. (Cursor / any MCP client: point it at the same uvx mathlas-mcp or python -m mathlas.server stdio command.)

Optional local data (degrades honestly): identify_sequence wants a local OEIS copy; verify_formal wants a Lean toolchain. Without them the tools return a clear "data/toolchain not available" — never a fake answer. See RESULTS.md for the one-line setup of each.


A worked example — an AI using the tools

User:  "Does x = cos(x) have a unique solution I can reach by iterating?"

AI →   search_existing_math("contraction mapping unique fixed point complete metric space")
       ← [{name:"Banach Fixed-Point Theorem", statement:"Let (X,d) be a complete metric
            space and T a contraction. Then T has a unique fixed point ...", ...}, ...]
AI →   applicability_checklist(banach.statement)
       ← preconditions: ["(X,d) is a complete metric space", "T: X→X is a contraction"]
          conclusion:    "T has a unique fixed point"
AI  (reasons): [0,1] is complete; cos is a contraction there (|cos'|=|sin|≤sin 1<1).
       Every precondition holds ⇒ Banach applies ⇒ unique fixed point, reachable by iteration.
AI →   verify_numeric("0.7390851332151607", "<the Dottie-number closed form, if claimed>")

mathlas supplied the search, the checklist, and the airtight numeric check. The AI did the judging. No LLM was called inside mathlas.


Results

The discipline is airtight-or-nothing: a result is an independently-checkable fact or an honest "nothing." The false-positive rate is 0 across every tier (full tables + commands in RESULTS.md):

Tier Recovery@known False-positive Why it's airtight Benchmark
Numeric (identify_constant) 8/8 0/3 independent high-precision re-eval (50–51 digits) benchmarks/numeric_bench.py
Sequence (identify_sequence) 8/8 (7 top-1) 0/3 exact term-match vs local OEIS (~400k seqs) benchmarks/tier_bench.py
Formal (verify_formal) 7/7 verdicts real Lean 4.30 kernel typecheck benchmarks/tier_bench.py
Ramanujan (conjecture_relation) 6/6 0/2 PSLQ + CF, every hit re-verified ≥25 digits benchmarks/tier_bench.py
Applicability moat 15/15 decomp + 6/6 catch atomic preconditions, misapplication traps benchmarks/moat_bench.py
FunSearch + web-aug 14/14 sandbox containment (network / timeout / memory) benchmarks/tools_bench.py

Agent-in-the-loop, honestly reported (2026-06-10, Claude Fable 5): the same headless agent given 18 math tasks WITH the live mathlas MCP server as its only tool vs WITHOUT any tools scores 18/18 vs 15/18. The original 10-task set is saturated (10/10 both ways: a frontier model passes it from parametric knowledge alone, and we say so plainly), so an 8-task hard set was added where verification, not recall, is the bottleneck: that set goes 8/8 WITH vs 5/8 WITHOUT. The bare model times out on 50-digit integer-relation detection (PSLQ) and cannot name obscure OEIS sequences that shadow Catalan/Fibonacci prefixes and only diverge at depth. The bare passes it does earn are remarkable and we report them: it evaluated a 6-term constant relation to 45 digits by hand (residual 1.475e-27, correct), simulated IEEE-754 rounding bit-for-bit in its head (with one wrong exponent in prose), and proved a Machin-like formula exactly via Gaussian integers, all in-context at 3-9x the latency of a tool call. Every ground truth is a deterministic computation recorded in the bench; full table and provenance: RESULTS.md §2c. Run: benchmarks/agent_bench.py.

The 3.68M-doc index. search_existing_math is served from a 3,683,428-document dense index (Qwen3-Embedding-8B, 4096-d): the 1.34M permissive CC-BY/CC0 TheoremSearch subset + 2.34M slogan-embedded arXiv-math documents from Dolma, dense + Okapi-BM25 + RRF. Honest headline recall at full 3.68M scale: R@1 0.614 / R@10 0.832 querying by a document's raw body against its slogan-embedded entry — the hard cross-representation self-recall regime. (At the earlier 1.635M build, the easier same-representation slogan→slogan self-recall was R@1 0.977 / R@10 0.998 on its 81,833-doc held-out split.)

Quantized laptop tier (opt-in). The fp16 matrix is 30 GB on disk (~60 GB fp32 resident) — fine on the build box, not on a laptop. MATHLAS_QUANTIZED=binary (or quantized="binary" on HybridRetriever.from_index) serves the SAME index from memmapped quantized sidecars instead: sign-bit Hamming over 1.9 GB shortlists 1000 candidates, exact rescore picks the top-k — measured on the full 3.68M index with the same n=3000 protocol as the headline, it is recall-lossless (R@1 0.6143 vs 0.6140 fp16, R@10 equal at 0.8323; int8 mode: R@1 0.6147, 15 GB) at 2.4 s/query on 4 CPU threads. Honest caveat: this shrinks the document side only — queries must still be embedded by the same Qwen3-Embedding-8B (a small 0.6B encoder lives in a different vector space). The true end-to-end small-encoder tier is the 0.6B tier below. Numbers, build command, and the caveat in full: docs/QUANTIZED_TIER.md.

0.6B end-to-end laptop tier (opt-in, v1.3). The SAME 3,683,428-doc corpus re-embedded once with Qwen3-Embedding-0.6B (1024-d, row-aligned with the served meta), so the query encoder itself runs on a laptop CPU: MATHLAS_ENCODER=0.6b (composes with MATHLAS_QUANTIZED=binary). Measured with the identical n=3000 cross-representation protocol, queries re-encoded by the 0.6B model: R@1 0.545 / R@10 0.745 (binary + int8 rescore; the 0.6B fp16 exact scan is 0.544 / 0.745, so quantization is again lossless within the tier). The honest price vs the 8B tier (0.614 / 0.832) is about 7-9pp recall; the dual-channel 8B configuration (0.965 / 0.999) stays the big-box quality ceiling. The laptop headline: end-to-end 0.67 s/query on 4 CPU threads (0.88 s on 2), query encoding included, over all 3.68M documents. Dense-channel footprint: binary sidecar 0.47 GB + 0.6B encoder ~1.2 GB (~1.7 GB; int8 rescore source 3.77 GB recommended; full fp16 sibling index 7.54 GB). On the TheoremSearch-110 corpus-only probe the tier scores Hit@20 8.2% / 10.0% theorem/paper vs the 8B tier's 10.0% / 11.8% (both licensing-bounded floors). Full tables, footprints, and caveats: docs/QUANTIZED_TIER.md; build: scripts/build_06b_index.py; eval: scripts/eval_06b_tier.py.

Dual-channel retrieval (opt-in, v1.2). The 0.614 headline is a cross-representation gap: LaTeX-statement-shaped queries searched against slogan-embedded docs. v1.2 adds a second dense channel, the same 3,683,428 docs embedded by their cleaned LaTeX statement (Qwen3-Embedding-8B, row-aligned, built by scripts/build_statement_channel.py), folded into the dense ranking by per-doc max-sim. Measured on the same n=3000 sample at full corpus scale: R@1 0.614 to 0.965, R@10 0.832 to 0.999. Honest caveats: that eval is a self-retrieval proxy in which the statement channel indexes the very text the queries are drawn from (an exact-text advantage, like BM25's); on the no-leak 110 human-query benchmark the lift is real but partial (paper Hit@20 11.8% to 12.7%). And the second matrix roughly doubles serving RAM (measured at full scale: 150 GB process peak for the dual server vs ~95 GB single-channel; ~2.75 s/query dual dense scan on 2 CPU threads), so it ships strictly opt-in (MATHLAS_STATEMENT_INDEX=/path/index_full_statement.npz, never auto-detected) and is not combinable with the quantized tier. Full numbers and the serving-tier table: docs/RETRIEVAL_UPGRADE_NOTES.md. Also in v1.2: the production hybrid default rrf_k is now 10 (measured best at every k tested) and an opt-in cross-encoder rerank blend (MATHLAS_RERANK=1, Qwen3-Reranker-0.6B, +1.7pp R@1 honest lift).

The self-augmenting loop — beating TheoremSearch

On TheoremSearch's own 110 human-written queries, baseline mathlas hits a coverage floor — TheoremSearch withheld 85% of their private 9.2M corpus, so 95 target papers are unreachable for any open system. The AI then runs the loop: for each missing theorem it web-finds the real statement, embeds it with the same Qwen3-Embedding-8B, and add_finding(dense_vec=…) fuses it through the dense channel at runtime (re-measured 2026-06-10 on the served 3.68M index — the after-loop headline reproduced exactly; the corpus-only baseline dipped 13.6% → 11.8% paper-level from the added Dolma distractors, reported as is):

Method theorem Hit@20 paper Hit@20
Google (site:arxiv.org) 37.8%
ChatGPT 5.2 w/ Search 19.8%
Gemini 3 Pro 27.0%
TheoremSearch (Qwen3-8B, full private 9.2M) 45.0% 56.8%
mathlas — baseline (corpus-only) 10.0% 11.8%
mathlas — after self-augmenting web loop 59.1% (65/110) 70.0% (77/110)

The 10.0% floor exists because TheoremSearch withheld 85% of their corpus — the loop repairs that coverage gap. Reproduce with benchmarks/webaug_110_bench.py (use the full 82-finding worklist _findings_worklist_full.json).

Source-aware retrieval (opt-in). Growing the index 1.34M → 3.68M had a measured cost: the 2.34M web-mined Dolma docs crowd canonical papers out of the top-20 (corpus-only paper-level 13.6% → 11.8% on these same 110 queries). search_existing_math now takes optional source_filter / source_weights — e.g. source_filter={"exclude": ["dolma"]} when you want canonical theorem statements only — and excluding dolma fully recovers the pre-growth 13.6% paper-level (15/110; reachable-15 paper 15/15 = 100%) with theorem-level above the old index (11.8% vs 10.9%). The default ranking stays byte-identical (test-pinned). It is a per-query-intent knob, not a free win: on the n=3000 self-recall, 65% of whose targets ARE Dolma docs, down-weighting dolma is catastrophic for those queries (dolma-target R@10 0.999 → 0.884 at weight 0.5, → 0 when excluded) — exactly why it ships opt-in, default off. We also tested whether the v1.2 dual channel fixes this regression structurally, without the knob: it recovers part of it (paper 11.8% to 12.7%, theorem 10.0% to 10.9% at default settings) but not the full 13.6%, so the knob remains the documented mitigation on this benchmark. Full matrix: docs/02_eval_vs_theoremsearch.md.


The 12 tools

search_existing_math ─▶ mapping_scaffold + applicability_checklist ─▶ (AI judges) ─▶ verify_numeric / verify_formal
   (own index)            (needs↔guarantees, no LLM)                                  (airtight)

Core four — what most agents use:

Tool What it does
search_existing_math(query, k) query → ranked results from the 3.68M-doc dense + BM25 + RRF index
identify_constant(value) a real value → known closed form + provenance (50-digit re-eval)
verify_numeric(value, closed_form) digit-agreement verdict — different engine, higher precision
verify_formal(statement, lean?, proof?) runs the real Lean kernel — typecheck a snippet, or pass proof to kernel-check a full Lean 4 proof: VERIFIED_PROOF / REFUTED (the kernel's exact error, for the repair loop) / honest UNDETERMINED

Full toolkit:

Tool What it does
search_formal_math(query, backend) mathlib declaration names + types via the public Loogle (pattern/type) + LeanSearch (natural language) services, provenance-labeled; honest "service unavailable" — with a 7-day on-disk cache that serves the last good response when a service is down, clearly labeled (cached, <age> old)
identify_sequence(terms) integer sequence → matching OEIS entries (exact term-match)
applicability_checklist(statement) result's hypotheses as an atomic checklist for the AI to mark
mapping_scaffold(problem, statement) needs↔guarantees questions + fill-in template
conjecture_relation(value) Ramanujan Machine: PSLQ over rich basis + CF/recurrence conjectures
funsearch(action, problem_id, …) FunSearch harness in one tool — action=evaluate (sandbox-score an AI-written program), register (MAP-Elites DB), status (best + few-shot)
search_directive(problem) web-search plan: arXiv queries + sub-fields + which tools to run
add_finding(statement, slogan, source) ingest a web-found result into the live corpus

All tools return data. No tool calls an LLM. search_formal_math is the one tool that itself makes a web call (to the public Loogle/LeanSearch services); everything else is fully local.

Proof checking — the repair loop

verify_formal doesn't just typecheck statements: give it a proposition and your Lean 4 proof, and the real kernel checks the full declaration. mathlas never writes a proof (the generator/verifier split is absolute) — but when your proof is wrong, the kernel tells you exactly why, verbatim, in kernel_error. That turns proof writing into a tight loop: the agent writes a proof → mathlas's kernel says exactly what's wrong → the agent repairs and re-calls.

verify_formal(statement="∀ n : Nat, n + 0 = n", proof="by\n  intro n\n  rfl")
// → {"proof_status": "VERIFIED_PROOF", "checked": true, ...}
verify_formal(statement="2 + 2 = 5", proof="rfl")
// → {"proof_status": "REFUTED", "kernel_error": "error: Not a definitional equality:
//     the left-hand side 2 + 2 is not definitionally equal to the right-hand side 5 ...", ...}

No fake passes, by construction: sorry/admit holes are REJECTED (Lean itself exits 0 on a sorried proof — mathlas scans the source and the kernel's sorryAx diagnostics); a missing toolchain, a timeout (60 s cap), or an import this bare toolchain can't resolve all return an honest UNDETERMINED, never a verdict. The whole contract is pinned by tests/test_proof_check.py (20 tests against the real Lean 4.30 kernel: correct term and tactic-block proofs verified, wrong proofs refuted with the kernel's message, sorried proofs rejected, toolchain-absent honest).


CLI / Python

mathlas 1.6449340668482264364724151666460251892   # -> pi**2/6  [verified 51 digits]
mathlas 1,1,2,3,5,8,13,21                          # -> A000045 Fibonacci  https://oeis.org/A000045
mathlas "a bounded sequence has a convergent subsequence" --k 5   # search + scaffold
mathlas mcp                                                        # run the MCP server
import mpmath
from mathlas import identify, identify_sequence, mapping_scaffold, applicability_checklist
print(identify(mpmath.zeta(2)))            # -> pi**2/6 [verified 51 digits]
print(identify_sequence([1,1,2,3,5,8,13,21]).matches[1].a_number)  # -> 'A000045'

Docs

Positioning — retrieval is table stakes; verification is the moat

Credit where due: the closest system, TheoremSearch (UW Math AI Lab), now ships a production REST API and its own MCP endpoint (api.theoremsearch.com/mcp) over a 9.2M-document corpus — on raw recall over math literature it is the system to beat, and "we're MCP-native, they're a lab tool" is no longer a differentiator. We reuse only their openly-licensed (CC-BY/CC0) dataset subset as raw data for our own index — not their API, MCP, index, or code.

What no competitor has is everything that happens after retrieval:

  • Verification tiersverify_numeric (independent 50-digit re-evaluation) and verify_formal (a real Lean kernel typecheck — including full proof checking with the kernel's error returned verbatim for agent repair loops — or an honest UNDETERMINED). Retrieval hands you a candidate; mathlas can also check the claim and check your proof of it.
  • applicability_checklist — decomposes a candidate theorem into atomic preconditions the AI verifies one by one, catching misapplications (open vs closed interval, infinite vs finite group). No competitor has one.
  • The self-augmenting add_finding loop — the AI web-finds a missing statement, embeds it, and fuses it into the live index at runtime: 59.1% vs TheoremSearch's 45.0% theorem Hit@20 on their own 110-query benchmark (see above).
  • Zero-false-positive discipline — every tier returns an independently-checkable fact or an honest "nothing"; measured false-positive rate is 0 across all tiers (RESULTS.md).
  • Free, no API key, provenance-labeled — every result carries where it came from (known_constant, conjectured_relation, web_added, external:loogle, …), and the index is built 100% from openly-licensed data.
mathlas TheoremSearch LeanSearch / Loogle Wolfram MCP sympy-mcp
Informal math retrieval ✅ 3.68M docs, open ✅ 9.2M docs (~85% private) ❌ (mathlib decls only)
Formal (mathlib) search ✅ proxies both → one MCP tool ✅ (is exactly this)
Numeric verification ✅ airtight 50-digit re-eval ⚠️ CAS eval ⚠️ CAS (no claim-check framing)
Formal verification ✅ real Lean kernel (statements and full proofs, repair-loop errors) ❌ (search, not check)
Applicability checklist unique
Self-augmenting corpus add_finding (59.1 vs 45.0 Hit@20)
Constant/sequence ID ✅ PSLQ + OEIS + Ramanujan-Machine ⚠️ some
Provenance labels ✅ every result n/a n/a
Cost / key free, no key free endpoint free paid Wolfram API key free
MCP ✅ stdio, uvx one-liner ✅ remote endpoint ❌ (mathlas proxies them)

(sympy-mcp is a fine CAS-manipulation server — its scope barely overlaps: it rewrites expressions you give it; mathlas finds, scopes, and verifies existing math.)


Official MCP registry

mathlas is published as io.github.Archerkattri/mathlas (see docs/REGISTRY_PUBLISH.md and server.json).

mcp-name: io.github.Archerkattri/mathlas

About

Airtight math tools an AI uses over MCP — 3.7M-theorem search, PSLQ constant ID, OEIS, real Lean kernel checks, applicability checklists. No LLM inside, no API key.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages