Skip to content

Stress-test codec architecture with SQLite and JSON sidecar storage formats #102

Description

@abegong

Architecture spec proposal. Status: planning.

Overview

Add SQLite and JSON as sidecar storage formats: configured storage that augments
an existing primary item store instead of replacing it.

This is the second stress test for the codec architecture. Standalone SQLite
proves that a backend can own the whole item store; sidecars prove that Katalyst
can compose content from multiple storage units without collapsing codecs,
storage definitions, and check inputs into one filesystem-shaped object.

Value

Many knowledge bases keep human-authored markdown as the primary artifact while
placing structured data nearby: a JSON metadata file, an index database, an
annotation table, or generated enrichment state. Katalyst should be able to
validate and inspect that shape without forcing users to inline every attribute
into frontmatter.

Sidecars also test a harder architectural question than a standalone backend:
when item content is assembled from more than one source, the storage layer must
own identity and persistence while codecs own decoding. Checks should still see
the same content shapes they already understand.

Current State

The current filesystem backend treats one markdown file as one item. The
markdownbodytext codec parses frontmatter and body from that file, and checks
consume the resulting metadata/body shape.

The storage deep-dive already frames storage as a two-way mapping between
backend-native references and Katalyst collections/items. It also treats
Reference as opaque and keeps backend-native IO under
internal/storage/collection/<backend>.

What does not exist yet is a sidecar model: one Katalyst item assembled from a
primary reference plus one or more auxiliary references.

Design

Define sidecar storage explicitly

A sidecar is auxiliary storage attached to an item from a primary collection.
It should not become an independent collection unless configured that way.

The sidecar definition needs to answer three questions:

  1. How does Katalyst find the sidecar for a given item?
  2. How does decoded sidecar content merge with the primary item shape?
  3. Which operations are allowed to write the sidecar?

Start with sidecars attached to filesystem markdown collections, because that
is the shape users are most likely to have and the easiest way to compare
primary file behavior with augmented metadata behavior.

Support JSON sidecars

JSON sidecars should cover the common file-adjacent case:

notes/dune.md
notes/dune.json

or a configured sidecar directory:

notes/dune.md
.katalyst/sidecars/notes/dune.json

The JSON sidecar should decode to structured metadata and merge into the item
metadata map under a predictable policy.

The merge policy is part of this issue. Prefer a conservative first cut:

  • require explicit config for whether sidecar fields merge at the top level or
    under a namespace;
  • report collisions clearly;
  • avoid silently overwriting primary frontmatter.

Support SQLite sidecars

SQLite sidecars should cover the indexed/enrichment case: one database stores
extra rows keyed by the primary item id or coordinates.

Example shape:

type: filesystem
root: .
collections:
  notes:
    path: notes
    sidecars:
      enrichments:
        type: sqlite
        path: ./.katalyst/enrichments.sqlite
        table: note_enrichments
        key: slug

The exact config shape can change. The important constraint is that SQLite is
not the primary collection here; it is an auxiliary source keyed by the primary
collection's item identity.

Keep composition in storage, decoding in codecs

Sidecar lookup and identity matching belong in storage/collection code. Decoding
the sidecar payload belongs in codecs when the payload shape is reusable.

Likely codec pressure points:

  • a structured JSON object codec for .json sidecars;
  • row/object decoding for SQLite sidecar rows;
  • merge/overlay behavior that may belong above individual codecs if multiple
    sidecar types share it.

Do not make checks aware of sidecars. Checks should receive a content shape:
metadata, body, and any future typed shapes. They should not know whether a
field came from frontmatter, JSON, or SQLite unless the configured merge policy
chooses to preserve provenance.

Open Questions

  1. Is sidecar composition a storage feature, a codec feature, or its own
    layer?

    Context. Storage owns references and identity. Codecs own decoding.
    Sidecars add composition across multiple decoded sources.

    Recommendation. Start in the collection storage implementation, then
    extract only if both JSON and SQLite sidecars share enough composition logic
    to justify a common package.

  2. What is the first merge policy?

    Context. Top-level merging is ergonomic but collision-prone. Namespacing
    is safer but makes checks more verbose.

    Recommendation. Support an explicit namespace first, and only allow
    top-level merge with collision errors. Do not silently overwrite primary
    frontmatter.

  3. Are sidecars read-only at first?

    Context. check and inspect only need reads. fix and item update
    raise harder questions about which source should be modified.

    Recommendation. Make sidecars read-only in the first cut unless a
    specific write behavior is needed to validate the storage seam. If read-only,
    diagnostics should say which fields cannot be fixed because they come from a
    sidecar.

  4. How should missing sidecars behave?

    Context. Some sidecars are optional enrichments; others may be required
    for validation.

    Recommendation. Make required/optional explicit in config. Missing
    required sidecars should be a storage/config diagnostic; missing optional
    sidecars should produce an item with only primary content.

  5. Should provenance be preserved?

    Context. Users may need to understand whether a field came from
    frontmatter, JSON, or SQLite, especially when fixes are unavailable.

    Recommendation. Preserve provenance internally if it is cheap, but do
    not expose it in check APIs until a concrete diagnostic or write-path need
    forces the shape.

Documentation Updates

  • docs/content/deep-dives/storage.md: document sidecars as auxiliary storage
    attached to a primary item store, including identity and write limitations.
  • docs/content/reference/configuration.md: document sidecar configuration for
    JSON and SQLite once the config shape is decided.
  • docs/content/deep-dives/formatting.md: mention when metadata comes from
    sidecars rather than markdown frontmatter, if user-visible.
  • internal/storage/collection/AGENTS.md: add conventions for composition and
    sidecar ownership.
  • New backend-specific AGENTS.md files if JSON or SQLite sidecar code gets
    its own package.

Test Checklist

  • A filesystem markdown collection can configure a JSON sidecar.
  • A filesystem markdown collection can configure a SQLite sidecar.
  • katalyst check can validate fields supplied by a JSON sidecar.
  • katalyst check can validate fields supplied by a SQLite sidecar.
  • katalyst inspect includes sidecar-supplied fields in collection
    evidence according to the merge policy.
  • Field collisions between primary frontmatter and sidecars produce clear,
    deterministic diagnostics.
  • Missing required sidecars and missing optional sidecars behave
    differently and are covered by tests.
  • Sidecar read-only behavior is explicit in fix or item update
    diagnostics if write paths are not supported.
  • Checks and inspectors do not import sidecar backend packages directly.
  • go test ./... passes.

Rejected Alternatives

  • Treat every sidecar as its own collection. That loses the point of a
    sidecar: auxiliary content attached to the same item identity as the primary
    store.
  • Inline sidecar data into markdown frontmatter before checks run. This may
    be an implementation detail, but it should not erase collision handling,
    provenance, or write-path decisions.
  • Support only JSON sidecars first and defer SQLite. JSON is useful, but
    SQLite is the better architecture stress test because it forces keyed lookup,
    non-file references, and row decoding.
  • Let checks read sidecars directly. That would punch through both the
    storage and codec boundaries and make each check family backend-aware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions