Architecture spec proposal. Status: planning.
Overview
Add SQLite and JSON as sidecar storage formats: configured storage that augments
an existing primary item store instead of replacing it.
This is the second stress test for the codec architecture. Standalone SQLite
proves that a backend can own the whole item store; sidecars prove that Katalyst
can compose content from multiple storage units without collapsing codecs,
storage definitions, and check inputs into one filesystem-shaped object.
Value
Many knowledge bases keep human-authored markdown as the primary artifact while
placing structured data nearby: a JSON metadata file, an index database, an
annotation table, or generated enrichment state. Katalyst should be able to
validate and inspect that shape without forcing users to inline every attribute
into frontmatter.
Sidecars also test a harder architectural question than a standalone backend:
when item content is assembled from more than one source, the storage layer must
own identity and persistence while codecs own decoding. Checks should still see
the same content shapes they already understand.
Current State
The current filesystem backend treats one markdown file as one item. The
markdownbodytext codec parses frontmatter and body from that file, and checks
consume the resulting metadata/body shape.
The storage deep-dive already frames storage as a two-way mapping between
backend-native references and Katalyst collections/items. It also treats
Reference as opaque and keeps backend-native IO under
internal/storage/collection/<backend>.
What does not exist yet is a sidecar model: one Katalyst item assembled from a
primary reference plus one or more auxiliary references.
Design
Define sidecar storage explicitly
A sidecar is auxiliary storage attached to an item from a primary collection.
It should not become an independent collection unless configured that way.
The sidecar definition needs to answer three questions:
- How does Katalyst find the sidecar for a given item?
- How does decoded sidecar content merge with the primary item shape?
- Which operations are allowed to write the sidecar?
Start with sidecars attached to filesystem markdown collections, because that
is the shape users are most likely to have and the easiest way to compare
primary file behavior with augmented metadata behavior.
Support JSON sidecars
JSON sidecars should cover the common file-adjacent case:
notes/dune.md
notes/dune.json
or a configured sidecar directory:
notes/dune.md
.katalyst/sidecars/notes/dune.json
The JSON sidecar should decode to structured metadata and merge into the item
metadata map under a predictable policy.
The merge policy is part of this issue. Prefer a conservative first cut:
- require explicit config for whether sidecar fields merge at the top level or
under a namespace;
- report collisions clearly;
- avoid silently overwriting primary frontmatter.
Support SQLite sidecars
SQLite sidecars should cover the indexed/enrichment case: one database stores
extra rows keyed by the primary item id or coordinates.
Example shape:
type: filesystem
root: .
collections:
notes:
path: notes
sidecars:
enrichments:
type: sqlite
path: ./.katalyst/enrichments.sqlite
table: note_enrichments
key: slug
The exact config shape can change. The important constraint is that SQLite is
not the primary collection here; it is an auxiliary source keyed by the primary
collection's item identity.
Keep composition in storage, decoding in codecs
Sidecar lookup and identity matching belong in storage/collection code. Decoding
the sidecar payload belongs in codecs when the payload shape is reusable.
Likely codec pressure points:
- a structured JSON object codec for
.json sidecars;
- row/object decoding for SQLite sidecar rows;
- merge/overlay behavior that may belong above individual codecs if multiple
sidecar types share it.
Do not make checks aware of sidecars. Checks should receive a content shape:
metadata, body, and any future typed shapes. They should not know whether a
field came from frontmatter, JSON, or SQLite unless the configured merge policy
chooses to preserve provenance.
Open Questions
-
Is sidecar composition a storage feature, a codec feature, or its own
layer?
Context. Storage owns references and identity. Codecs own decoding.
Sidecars add composition across multiple decoded sources.
Recommendation. Start in the collection storage implementation, then
extract only if both JSON and SQLite sidecars share enough composition logic
to justify a common package.
-
What is the first merge policy?
Context. Top-level merging is ergonomic but collision-prone. Namespacing
is safer but makes checks more verbose.
Recommendation. Support an explicit namespace first, and only allow
top-level merge with collision errors. Do not silently overwrite primary
frontmatter.
-
Are sidecars read-only at first?
Context. check and inspect only need reads. fix and item update
raise harder questions about which source should be modified.
Recommendation. Make sidecars read-only in the first cut unless a
specific write behavior is needed to validate the storage seam. If read-only,
diagnostics should say which fields cannot be fixed because they come from a
sidecar.
-
How should missing sidecars behave?
Context. Some sidecars are optional enrichments; others may be required
for validation.
Recommendation. Make required/optional explicit in config. Missing
required sidecars should be a storage/config diagnostic; missing optional
sidecars should produce an item with only primary content.
-
Should provenance be preserved?
Context. Users may need to understand whether a field came from
frontmatter, JSON, or SQLite, especially when fixes are unavailable.
Recommendation. Preserve provenance internally if it is cheap, but do
not expose it in check APIs until a concrete diagnostic or write-path need
forces the shape.
Documentation Updates
docs/content/deep-dives/storage.md: document sidecars as auxiliary storage
attached to a primary item store, including identity and write limitations.
docs/content/reference/configuration.md: document sidecar configuration for
JSON and SQLite once the config shape is decided.
docs/content/deep-dives/formatting.md: mention when metadata comes from
sidecars rather than markdown frontmatter, if user-visible.
internal/storage/collection/AGENTS.md: add conventions for composition and
sidecar ownership.
- New backend-specific
AGENTS.md files if JSON or SQLite sidecar code gets
its own package.
Test Checklist
Rejected Alternatives
- Treat every sidecar as its own collection. That loses the point of a
sidecar: auxiliary content attached to the same item identity as the primary
store.
- Inline sidecar data into markdown frontmatter before checks run. This may
be an implementation detail, but it should not erase collision handling,
provenance, or write-path decisions.
- Support only JSON sidecars first and defer SQLite. JSON is useful, but
SQLite is the better architecture stress test because it forces keyed lookup,
non-file references, and row decoding.
- Let checks read sidecars directly. That would punch through both the
storage and codec boundaries and make each check family backend-aware.
Overview
Add SQLite and JSON as sidecar storage formats: configured storage that augments
an existing primary item store instead of replacing it.
This is the second stress test for the codec architecture. Standalone SQLite
proves that a backend can own the whole item store; sidecars prove that Katalyst
can compose content from multiple storage units without collapsing codecs,
storage definitions, and check inputs into one filesystem-shaped object.
Value
Many knowledge bases keep human-authored markdown as the primary artifact while
placing structured data nearby: a JSON metadata file, an index database, an
annotation table, or generated enrichment state. Katalyst should be able to
validate and inspect that shape without forcing users to inline every attribute
into frontmatter.
Sidecars also test a harder architectural question than a standalone backend:
when item content is assembled from more than one source, the storage layer must
own identity and persistence while codecs own decoding. Checks should still see
the same content shapes they already understand.
Current State
The current filesystem backend treats one markdown file as one item. The
markdownbodytextcodec parses frontmatter and body from that file, and checksconsume the resulting metadata/body shape.
The storage deep-dive already frames storage as a two-way mapping between
backend-native references and Katalyst collections/items. It also treats
Referenceas opaque and keeps backend-native IO underinternal/storage/collection/<backend>.What does not exist yet is a sidecar model: one Katalyst item assembled from a
primary reference plus one or more auxiliary references.
Design
Define sidecar storage explicitly
A sidecar is auxiliary storage attached to an item from a primary collection.
It should not become an independent collection unless configured that way.
The sidecar definition needs to answer three questions:
Start with sidecars attached to filesystem markdown collections, because that
is the shape users are most likely to have and the easiest way to compare
primary file behavior with augmented metadata behavior.
Support JSON sidecars
JSON sidecars should cover the common file-adjacent case:
or a configured sidecar directory:
The JSON sidecar should decode to structured metadata and merge into the item
metadata map under a predictable policy.
The merge policy is part of this issue. Prefer a conservative first cut:
under a namespace;
Support SQLite sidecars
SQLite sidecars should cover the indexed/enrichment case: one database stores
extra rows keyed by the primary item id or coordinates.
Example shape:
The exact config shape can change. The important constraint is that SQLite is
not the primary collection here; it is an auxiliary source keyed by the primary
collection's item identity.
Keep composition in storage, decoding in codecs
Sidecar lookup and identity matching belong in storage/collection code. Decoding
the sidecar payload belongs in codecs when the payload shape is reusable.
Likely codec pressure points:
.jsonsidecars;sidecar types share it.
Do not make checks aware of sidecars. Checks should receive a content shape:
metadata, body, and any future typed shapes. They should not know whether a
field came from frontmatter, JSON, or SQLite unless the configured merge policy
chooses to preserve provenance.
Open Questions
Is sidecar composition a storage feature, a codec feature, or its own
layer?
Context. Storage owns references and identity. Codecs own decoding.
Sidecars add composition across multiple decoded sources.
Recommendation. Start in the collection storage implementation, then
extract only if both JSON and SQLite sidecars share enough composition logic
to justify a common package.
What is the first merge policy?
Context. Top-level merging is ergonomic but collision-prone. Namespacing
is safer but makes checks more verbose.
Recommendation. Support an explicit namespace first, and only allow
top-level merge with collision errors. Do not silently overwrite primary
frontmatter.
Are sidecars read-only at first?
Context.
checkandinspectonly need reads.fixanditem updateraise harder questions about which source should be modified.
Recommendation. Make sidecars read-only in the first cut unless a
specific write behavior is needed to validate the storage seam. If read-only,
diagnostics should say which fields cannot be fixed because they come from a
sidecar.
How should missing sidecars behave?
Context. Some sidecars are optional enrichments; others may be required
for validation.
Recommendation. Make required/optional explicit in config. Missing
required sidecars should be a storage/config diagnostic; missing optional
sidecars should produce an item with only primary content.
Should provenance be preserved?
Context. Users may need to understand whether a field came from
frontmatter, JSON, or SQLite, especially when fixes are unavailable.
Recommendation. Preserve provenance internally if it is cheap, but do
not expose it in check APIs until a concrete diagnostic or write-path need
forces the shape.
Documentation Updates
docs/content/deep-dives/storage.md: document sidecars as auxiliary storageattached to a primary item store, including identity and write limitations.
docs/content/reference/configuration.md: document sidecar configuration forJSON and SQLite once the config shape is decided.
docs/content/deep-dives/formatting.md: mention when metadata comes fromsidecars rather than markdown frontmatter, if user-visible.
internal/storage/collection/AGENTS.md: add conventions for composition andsidecar ownership.
AGENTS.mdfiles if JSON or SQLite sidecar code getsits own package.
Test Checklist
katalyst checkcan validate fields supplied by a JSON sidecar.katalyst checkcan validate fields supplied by a SQLite sidecar.katalyst inspectincludes sidecar-supplied fields in collectionevidence according to the merge policy.
deterministic diagnostics.
differently and are covered by tests.
fixoritem updatediagnostics if write paths are not supported.
go test ./...passes.Rejected Alternatives
sidecar: auxiliary content attached to the same item identity as the primary
store.
be an implementation detail, but it should not erase collision handling,
provenance, or write-path decisions.
SQLite is the better architecture stress test because it forces keyed lookup,
non-file references, and row decoding.
storage and codec boundaries and make each check family backend-aware.