Stop feeding broken Markdown to your AI.
v1.4.1 — Test-hardening patch: 378 tests (+107), community coverage for parser helpers, CLI agents, FORGE visitors, and wave-2 good-first issues (#43–#52) — no production API changes (see CHANGELOG).
Turning a forest of local plain-text files into a unified semantic powerhouse.
20260426.Logseq.Matryca.Parser.DEMO.mp4
👉 TRY THE LIVE INTERACTIVE DEMO
📘 ARCHITECTURE · AST Primer · Cookbook · Good first issues · Docs index · CodeQL · Changelog · Release process
The PKM (Personal Knowledge Management) world is currently forcing users to make a painful choice between Data Longevity and AI Power.
- Vanilla Logseq / Obsidian is a "Forest" of decentralized Markdown files. It guarantees the Lindy effect (plain-text lasts forever) and perfect Git versioning, but standard AI chunkers treat it like a blender, destroying the outliner hierarchy.
- Tana is a centralized "Tree". It offers incredible semantic power, but traps your brain in a proprietary cloud database.
- The new Logseq DB (SQLite) aims for database speed, but at a huge cost: it locks your notes inside a binary
.dbfile. You lose human-readable files, you lose line-by-line Git diffs, and you lose the immortality of plain-text.
Logseq Matryca Parser is the ultimate bridge. It allows you to keep your sovereign, future-proof Markdown files, while synthesizing a Virtual Global Graph in RAM at runtime.
It acts as the strict File System Driver for your LLM OS. By using a deterministic Stack-Machine to parse your outliner topology, it feeds LangChain or LlamaIndex with the exact parent-child context of every single block.
You get the reasoning power of a centralized relational database, without sacrificing the plain-text soul of your Second Brain in Logseq.
| Feature | Vanilla Markdown | Matryca Parser | Logseq DB (SQLite) | Tana |
|---|---|---|---|---|
| Data Format | Plain-text (.md) | Plain-text (.md) | Binary (.db) | Proprietary Cloud |
| Version Control | Perfect (Git) | Perfect (Git) | Poor (Binary blob) | None |
| Data Structure | Decentralized Forest | Virtually Centralized Graph | Relational Database | Centralized Tree |
| AI Readiness | Low (Linear Chunks) | High (Topological AST) | TBD (Requires SQL) | High (Proprietary) |
| Sovereignty | 100% Local | 100% Local (Sovereign AI) | 100% Local | Cloud-Only |
| Capability | Typical LangChain / LlamaIndex Markdown loaders | Matryca (LOGOS + SYNAPSE + graph) |
|---|---|---|
| Parent–child context | Character or heading splits; children often orphaned from parents | True outliner AST: every block carries parent_id, path, left_id and visits in deterministic tree order |
Block references ((uuid)) |
Treated as opaque text or dropped | Resolved against LogseqGraph; optional embed expansion and Obsidian [[Page#^anchor]] export |
| Property inheritance | Page-level frontmatter at best | get_effective_properties: page + ancestor outline keys merged top-down (Org-mode style), then exposed on enriched chunks |
| Live sync | Re-read whole tree or poll | LogseqGraph.start_watching() (optional watchdog): per-file invalidation — re-parse one page, purge stale UUIDs from registries, refresh backlinks |
| Page aliases & titles | Filename-only or manual link maps | title::, alias:: / aliases:: re-key graph.pages and wire backlinks for alias wikilinks |
| Case-insensitive pages & tags | Exact string match on filenames | get_page, resolve_relative_page_link, search_content, and GraphQuery.has_tag use case-insensitive matching (Datomic / Logseq parity) |
| Attachments & assets | Opaque  text in chunks |
LogseqNode.assets + LogseqPage.resolve_asset_path for graph-root PDFs and images |
Standard RAG pipelines treat your notes like a blender. They chop Markdown into random shards, destroying the parent-child hierarchy that makes Logseq powerful.
graph TD
Raw[(Logseq Markdown\nFiles)]
subgraph Standard RAG
Blender[Standard Text Splitter\n'The Blender']
Chunk1[Chunk 1: Orphan text]
Chunk2[Chunk 2: Lost context]
Blender --> Chunk1 & Chunk2
end
subgraph Matryca Parser
Architect[Logos Engine\nStack-Machine]
Parent[Parent Node\n+ Properties]
Child[Child Node\n+ Task State & Time]
Architect --> Parent --> Child
end
Raw --> Blender
Raw --> Architect
classDef bad fill:#fee2e2,stroke:#ef4444,color:#000;
classDef good fill:#dcfce7,stroke:#22c55e,color:#000;
class Chunk1,Chunk2 bad;
class Parent,Child good;
Logseq Matryca Parser is a deterministic Stack-Machine engine that acts as the File System Driver for your LLM. It preserves the true topology of your thoughts, ensuring AI understands spatial hierarchy, time, and block-lineage—including structured task state and first-class temporal attributes you can query in downstream graph databases and GraphRAG engines without re-parsing raw Markdown.
Patch release — contributor test coverage and onboarding refresh. No intentional changes to parser, graph, or CLI runtime behavior.
| Area | Change |
|---|---|
| Test suite | 378 pytest cases (+107 vs v1.4.0): normalize_logseq_timestamp, clean_node_content, logseq_paths fallbacks, exception hierarchy, extract_changelog script, KINETIC --help, agent-read --query, direct ObsidianForgeVisitor tests (#42). |
| New test modules | tests/test_exceptions.py, tests/test_extract_changelog.py. |
| Contributor index | docs/GOOD_FIRST_ISSUES.md wave 2 (#43–#52); wave-1 items marked complete. |
Minor release — graph integrity, export hygiene, and parser hardening from the local static-analysis bug hunt (waves 1–8). No intentional breaking changes to default parse behavior.
| Area | Change |
|---|---|
| Graph index | iter_canonical_pages() and page_for_node() deduplicate alias keys; load_directory rebuilds _node_registry from indexed pages only (no ghost nodes after title collision). |
| Case-insensitive queries | search_content, GraphQuery.has_tag, and get_nodes_by_tag match tags case-insensitively (optional # prefix). |
| Live watcher | LogseqGraphWatcher handles on_deleted and on_moved; invalidate_and_reload_page purges registries when a page file was deleted. |
| Agent writes | append_child_to_node calls invalidate_and_reload_page so the in-memory graph matches disk after headless splice. |
| SYNAPSE | Page/block embed expansion uses get_page (case-insensitive) and fail-safe empty replacement (no infinite loops on unresolved embeds). |
| Serialization | Per-page tab_size at parse time; serialize_logseq_page and append_child_to_node preserve four-space vault indentation. |
| Paths & assets | resolve_relative_page_link supports ../ / ./; resolve_asset_path rejects absolute paths and links that escape the graph root. |
| Strict refs | LogseqGraph.load_directory(strict_refs=True) validates cross-page block refs via raise_if_broken_references(). |
| Docs & community | docs/COOKBOOK.md, docs/GOOD_FIRST_ISSUES.md, docs/BUG_HUNT_REPORT.md (audit complete). |
Patch release — aligns example and skill install docs with the project's uv workflow. No parser or public API changes.
| Area | Change |
|---|---|
| Examples | examples/run_demo.py error hint uses uv sync --all-extras. |
| Claude skill | claude-skill-logseq-read/SKILL.md recommends uv pip install. |
Minor release — architectural quick wins, runtime robustness, and expanded public API. No breaking changes to default parser behavior.
| Area | Change |
|---|---|
| Public API | Root logseq_matryca_parser exports SynapseAdapter, SessionAliasRegistry, GraphVisualizer, discover_graph_files, and core LOGOS symbols via explicit __all__. |
| Graph model | LogseqGraph uses validate_assignment=True instead of frozen/object.__setattr__ for incremental reloads. |
| Live watcher | start_watching() debounces filesystem events (~500ms) and ignores editor temp/swap files (.swp, ~, .tmp, .DS_Store). |
| Strict refs | StackMachineParser(strict_refs=True) raises BlockReferenceError for unresolved same-page ((uuid)) refs (default off). |
| SYNAPSE | SynapseMetadata / build_synapse_metadata for vector-store-safe fields; LlamaIndex adds SOURCE, NEXT, PREVIOUS relationships. |
| KINETIC CLI | Global --verbose / --graph via @app.callback(); optional-dependency hints recommend uv sync --extra ai|viz. |
| LENS | Lazy-imports NetworkX/PyVis so core installs stay lightweight. |
| Security | Transitive aiohttp / nltk constraints for optional [ai] extras. |
Patch release — fixes a failing CodeQL GitHub Actions workflow; no parser or public API changes.
| Area | Change |
|---|---|
| CodeQL | Removed duplicate .github/workflows/codeql.yml; scanning continues via GitHub default setup (Node 24 runners). |
| Docs | New docs/CODEQL.md explains default vs advanced setup and troubleshooting. |
Infrastructure and contributor experience — no parser API breaks.
| Area | Capability |
|---|---|
| Python matrix | CI and PyPI pre-flight test 3.12 and 3.13; PyPI classifier for 3.13. |
| Quality gates | make all parity in GitHub Actions (uv sync --all-extras → lint, mypy, pytest with ≥80% coverage). |
| Security | GitHub CodeQL default setup (SAST), pip-audit on production deps, expanded SECURITY.md, PyPI publish blocked until pre-flight passes. |
| Community | CODE_OF_CONDUCT.md, CODEOWNERS, issue-template config, CONTRIBUTING with uv workflow. |
| Docs | Root ROADMAP_*.md consolidated under docs/roadmaps/. |
Contributor setup: CONTRIBUTING.md · docs/GOOD_FIRST_ISSUES.md · Security: SECURITY.md · CodeQL: docs/CODEQL.md
| Area | Capability |
|---|---|
| Asset extraction | LogseqNode.assets collects markdown images, {{pdf}} macros, and local [label](path) attachments; LogseqPage.resolve_asset_path maps to absolute paths (%20 decode, graph-root relative). |
| YAML frontmatter | --- blocks at file start populate LogseqPage.properties like native key:: lines; title: in YAML sets page.title at parse; serialize_logseq_page preserves --- fences on round-trip when the source file used YAML. |
page-tags:: |
Block and page page-tags:: inject implicit graph tokens like tags::; list-shaped values feed refs. |
| Case-insensitive routing | LogseqGraph.get_page and resolve_relative_page_link resolve titles via a lowercase index (Datomic parity). |
| Extended shielding | HTML comments, {{query}} / {{advancedquery}}, and escaped \# / \[\[ do not emit false graph tokens (embed macros still harvest nested wikilinks). |
| Property & temporal fixes | Comma-split ignores commas inside [[wikilinks]]; properties after code fences; quoted value stripping; SCHEDULED/DEADLINE ranges, repeaters, and Org warning periods; legacy ___ / %2F / Dendron filenames; UTF-8 BOM via utf-8-sig. |
| Area | Capability |
|---|---|
| Soft-break bodies | Multiline block continuations serialize without double-indenting alignment spaces. |
| List-shaped block props | tags:: / page-tags:: with indented - bullets round-trip as Logseq lists (not Python repr). |
:LOGBOOK: drawers |
Org drawers re-emit as :LOGBOOK: / :END: blocks, not bogus logbook:: property lines. |
| Derived temporal keys | Parsed scheduled::, repeater::, and related derived fields are omitted from serialized key:: output. |
| Stable block UUIDs | Parse → serialize_logseq_page → parse preserves block id:: / UUIDs on the same outline. |
from logseq_matryca_parser.graph import LogseqGraph
from logseq_matryca_parser.logos_parser import LogosParser
graph = LogseqGraph.load_directory("/path/to/logseq/graph")
# Case-insensitive page lookup
page = graph.get_page("my page") # same object as graph.pages["My Page"]
# Assets on a parsed block (Vision / document pipelines)
single = LogosParser().parse_page_file("pages/Notes.md")
block = single.root_nodes[0]
if block.assets:
abs_path = single.resolve_asset_path(block.assets[0])Deep dive: Architecture §3.1 — LOGOS · §3.6 — LogseqGraph · AST primer.
| Area | Capability |
|---|---|
| Graph index | title:: / TITLE:: overrides filename titles; alias:: / aliases:: inject extra graph.pages keys. |
| Backlinks | [[Dev]] resolves against alias keys (get_backlinks("Dev")). |
| Incremental reload | invalidate_and_reload_page re-applies title/alias enrichment after watcher edits. |
| Parser shields | LaTeX, #+BEGIN_QUERY, fenced code, drawers; {{embed [[Page]]}} harvests nested wikilinks. |
| Property contiguity | key:: contiguous under bullets; soft-break closes the window (fence exception in v1.2.0). |
| Tasks & bullets | GFM checkboxes, extended Org markers, ordered-list bullets, aliased ((uuid)) clean text. |
Compile an entire Logseq graph into an Obsidian vault layout: YAML frontmatter from page properties, list body preserved, Logseq ((uuid)) links rewritten to [[Page#^anchor]], and trailing ^block-id on referenced blocks. Namespace titles become nested folders (e.g. Projects/AI/Demo.md).
matryca-parse export /path/to/logseq/graph /path/to/obsidian/vault --format obsidianNote: Wikilinks currently use the Logseq page title (e.g.
[[Target#^…]]). Vault files may live under namespace folders (Projects/AI/Demo.md). Obsidian usually resolves unique titles; aligning link text to folder paths is a possible future refinement.
LogseqGraph supports surgical file invalidation (optional dependency: uv sync --extra watch). start_watching() runs a recursive watchdog observer with ~500ms debounce and ignores editor temp/swap files: on created / modified / deleted / moved under pages/ or journals/, only the affected file is re-parsed (or purged when deleted); stale synthetic UUIDs are removed from _node_registry and scrubbed from _backlink_registry—no full-graph cold reload.
Filter the global node registry with a chainable API (tags, task state, ancestry under a parent UUID):
from logseq_matryca_parser.graph import LogseqGraph
graph = LogseqGraph.load_directory("/path/to/logseq/graph")
hits = (
graph.query()
.has_tag("idea")
.under_parent("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee")
.is_task_state("TODO")
.execute()
)For autonomous LLM agents, passing raw Markdown into the context window wastes thousands of tokens on 36-character UUIDs, hidden id:: properties, drawers, and collapsed directives that carry no immediate semantic signal. X-Ray mode compresses the parsed AST into ultra-dense, zero-fluff plain text: each block becomes {indent}[{alias}] {clean_text}, with heavy Logseq UUIDs replaced by sequential integer aliases ([0], [1], …) held in a session registry. On typical outlines this can reduce context consumption by up to ~35× compared to dumping full block payloads.
matryca-parse agent-read /path/to/graph --tag idea
matryca-parse agent-read /path/to/graph --query "quantum"The agent reads cheap topology now; the registry resolves aliases back to sovereign UUIDs when you wire targeted writes.
The parser is no longer read-only. Wave 12 adds a headless Markdown splicer (agent_writer.py): append_child_to_node uses AST line numbers and indentation ((indent_level + 1) × tab_size) to insert a new bullet atomically into the sovereign .md file—via tempfile + os.replace—without Logseq’s fragile HTTP API. Beyond surgical node splicing, the engine now supports full bidirectional page generation via serialize_logseq_page and write_logseq_page—rebuilding entire Logseq-compliant .md pages from an in-memory AST. Pair agent-read with agent-write: X-Ray persists its alias map to .matryca_xray_state.json at the graph root so stateless CLI invocations can read, then write in sequence.
matryca-parse agent-read /path/to/graph --tag idea
matryca-parse agent-write /path/to/graph --alias 0 --content "Follow-up from the agent"For graph hygiene, LogseqGraph.get_broken_references() flags nodes whose ((uuid)) block refs point at missing registry targets—structural linting, not regex guessing.
| Feature | Description |
|---|---|
| LOGOS Engine | Deterministic AST parsing. YAML + native frontmatter ingest, format-preserving serialize_logseq_page (YAML vs key:: by source), list-shaped block property layout, assets, property contiguity (incl. post-fence), comma-safe wikilink splits, temporal ranges/repeaters, legacy filename decode, BOM-safe reads, and shielded code/math/query/HTML/escape regions. |
| Multimodal assets | LogseqNode.assets + LogseqPage.resolve_asset_path for PDFs and images relative to the graph root (Vision / document RAG). |
| LogseqGraph | In-memory vault: pages index (with title/alias enrichment and case-insensitive lookup), iter_canonical_pages() / page_for_node(), backlinks, effective properties, namespace resolution, fluent GraphQuery, optional watchdog invalidation (create/modify/delete/move). |
| Advanced Task Extraction | Task state (TODO / DOING / DELEGATED / IN-PROGRESS / …), priority markers [#A]–[#C] promoted to task_priority, and SCHEDULED / DEADLINE Logseq timestamps normalized to UTC Unix epoch seconds on scheduled_at / deadline_at for temporal graph and retrieval pipelines. |
| SYNAPSE Adapter | Native exports for LangChain and LlamaIndex with automated lineage metadata; context-enriched chunks with breadcrumbs, embed expansion, and inherited properties. |
| FORGE | JSON, clean Markdown, and Obsidian vault serialization (ObsidianForgeVisitor, ForgeExporter.to_obsidian_markdown). |
| LENS Visualizer | 60FPS interactive graph rendering (10k+ nodes) with Glassmorphism HUD. |
| Agent-Native Printing Press | agent_press.py: SessionAliasRegistry maps session aliases ↔ block UUIDs; to_xray_markdown emits token-minimal outline text for autonomous agents (matryca-parse agent-read). |
| Native Markdown Serialization | logseq_markdown.py + logseq_paths.py: rebuild and write Logseq-compliant markdown from an AST—page header preserves YAML --- or native key:: by source format, block properties at parent whitespace + 2 spaces (including bullet-list tags::), :LOGBOOK: drawers, and namespace titles via ___ pathing rules. |
| Headless Write Engine | agent_writer.py: append_child_to_node splices child bullets into on-disk Markdown from AST topology; serialize_logseq_page / write_logseq_page emit full pages; matryca-parse agent-write resolves aliases via .matryca_xray_state.json. |
| AST Linters | LogseqGraph.get_broken_references() returns originating nodes when block_refs target UUIDs absent from the global registry. |
| Sovereign AI | 100% Local. Zero telemetry. Private by design. |
Each AST block is a LogseqNode. Alongside task_status, the parser surfaces priority and schedule metadata as typed fields (epoch integers are seconds since Unix epoch, UTC):
{
"uuid": "6ba7b810-9dad-11d1-80b4-00c04fd430c8",
"task_status": "TODO",
"task_priority": "A",
"scheduled_at": 1641600000,
"deadline_at": 1641772800,
"clean_text": "Cut v0.3.2 release"
}Marker syntax ([#A], SCHEDULED: <...>, DEADLINE: <...>) is stripped from clean_text so embeddings stay clean; the promoted fields carry the structured signal for downstream graph databases and GraphRAG engines.
# Install from PyPI (latest: v1.4.1)
uv pip install logseq-matryca-parser
# Optional: filesystem watcher for live incremental graph updates
uv pip install 'logseq-matryca-parser[watch]'
# Or clone and sync all extras locally
uv sync --all-extras# 1. Visualize your local graph (LENS)
matryca-parse visualize /path/to/logseq/graph my-map.html
# 2. Export for AI / RAG (SYNAPSE)
matryca-parse export /path/to/logseq/graph output --format langchain
# 3. Context-enriched LangChain JSON (graph + inheritance + embed expansion)
matryca-parse export /path/to/logseq/graph output --format langchain-enriched
# 4. Obsidian vault (YAML frontmatter + ^ block ids)
matryca-parse export /path/to/logseq/graph output --format obsidian
# Global options (all subcommands): --verbose, --graph /path/to/vault
matryca-parse --graph /path/to/logseq/graph --verbose export output --format jsonPrefer the package root for stable imports (see __all__ in logseq_matryca_parser):
from logseq_matryca_parser import (
LogseqGraph,
LogosParser,
SynapseAdapter,
SessionAliasRegistry,
discover_graph_files,
)
# Parse a single page to AST (YAML or native frontmatter; utf-8-sig BOM-safe)
page = LogosParser().parse_page_file("page.md")
if page.root_nodes[0].assets:
absolute = page.resolve_asset_path(page.root_nodes[0].assets[0])
# Load the whole vault (pages, backlinks, node registry)
graph = LogseqGraph.load_directory("/path/to/logseq/graph")
page_obj = graph.get_page("My Page") # case-insensitive
effective = graph.get_effective_properties(page_obj.root_nodes[0].uuid)
# Export to LangChain with lineage metadata
docs = SynapseAdapter.to_langchain_documents(page.root_nodes, source_name=page.title)
# Optional strict same-page block-ref validation at parse time
from logseq_matryca_parser import StackMachineParser
strict_page = StackMachineParser(strict_refs=True).parse_page_file("page.md")Agents such as Hermes or OpenClaw can record structured notes into a Logseq graph without rewriting existing pages. The helper logseq_agent_write only opens the weekly agent page in append mode ("a"), writes a new bullet (journal link + optional tag links + body), and never truncates or replaces prior content—so routine logging cannot wipe blocks that already live in that file.
Point it at your graph’s pages directory and config.edn so journal titles match Logseq’s :journal/page-title-format (including ordinal days when you use do in the pattern).
from logseq_matryca_parser import logseq_agent_write
result = logseq_agent_write(
"Summarized user intent and proposed next steps.",
config_path="/path/to/logseq/config.edn",
pages_dir="/path/to/logseq/pages",
context_tags=["agent/hermes", "#session"],
)
assert result["status"] == "success"
# result["path"] → e.g. .../pages/2026-18-agent.md- Desktop GUI: Standalone app for non-technical users. (Join the RFC)
- Obsidian Adapter: Native CLI export (
--format obsidian) with YAML frontmatter and^block anchors. - Ollama Integration: One-click local RAG setup. (RFC draft) · (Track progress #34)
Logseq Matryca Parser is open-source. If it powers your pipeline, consider a star ⭐ or a sponsorship!
Need custom RAG integrations or consulting? Contact: marco@marcoporcellato.it
We welcome issues, pull requests, and constructive feedback.
| Resource | Link |
|---|---|
| Good first issues | docs/GOOD_FIRST_ISSUES.md — starter tasks (#19–#52) |
| Contributing | CONTRIBUTING.md — setup, tests, PR workflow |
| Cookbook | docs/COOKBOOK.md — integration recipes (Synapse, graph query, watcher) |
| Documentation index | docs/README.md — active vs historical docs |
| Code of Conduct | CODE_OF_CONDUCT.md — community standards |
| Security | SECURITY.md — report vulnerabilities privately |
Architected by Marco Porcellato | Powered by Matryca.ai