Stress-test codec architecture with SQLite and JSON sidecar storage formats

> Architecture spec proposal. Status: **planning**.

## Overview

Add SQLite and JSON as sidecar storage formats: configured storage that augments
an existing primary item store instead of replacing it.

This is the second stress test for the codec architecture. Standalone SQLite
proves that a backend can own the whole item store; sidecars prove that Katalyst
can compose content from multiple storage units without collapsing codecs,
storage definitions, and check inputs into one filesystem-shaped object.

## Value

Many knowledge bases keep human-authored markdown as the primary artifact while
placing structured data nearby: a JSON metadata file, an index database, an
annotation table, or generated enrichment state. Katalyst should be able to
validate and inspect that shape without forcing users to inline every attribute
into frontmatter.

Sidecars also test a harder architectural question than a standalone backend:
when item content is assembled from more than one source, the storage layer must
own identity and persistence while codecs own decoding. Checks should still see
the same content shapes they already understand.

## Current State

The current filesystem backend treats one markdown file as one item. The
`markdownbodytext` codec parses frontmatter and body from that file, and checks
consume the resulting metadata/body shape.

The storage deep-dive already frames storage as a two-way mapping between
backend-native references and Katalyst collections/items. It also treats
`Reference` as opaque and keeps backend-native IO under
`internal/storage/collection/<backend>`.

What does not exist yet is a sidecar model: one Katalyst item assembled from a
primary reference plus one or more auxiliary references.

## Design

### Define sidecar storage explicitly

A sidecar is auxiliary storage attached to an item from a primary collection.
It should not become an independent collection unless configured that way.

The sidecar definition needs to answer three questions:

1. How does Katalyst find the sidecar for a given item?
2. How does decoded sidecar content merge with the primary item shape?
3. Which operations are allowed to write the sidecar?

Start with sidecars attached to filesystem markdown collections, because that
is the shape users are most likely to have and the easiest way to compare
primary file behavior with augmented metadata behavior.

### Support JSON sidecars

JSON sidecars should cover the common file-adjacent case:

```text
notes/dune.md
notes/dune.json
```

or a configured sidecar directory:

```text
notes/dune.md
.katalyst/sidecars/notes/dune.json
```

The JSON sidecar should decode to structured metadata and merge into the item
metadata map under a predictable policy.

The merge policy is part of this issue. Prefer a conservative first cut:

- require explicit config for whether sidecar fields merge at the top level or
  under a namespace;
- report collisions clearly;
- avoid silently overwriting primary frontmatter.

### Support SQLite sidecars

SQLite sidecars should cover the indexed/enrichment case: one database stores
extra rows keyed by the primary item id or coordinates.

Example shape:

```yaml
type: filesystem
root: .
collections:
  notes:
    path: notes
    sidecars:
      enrichments:
        type: sqlite
        path: ./.katalyst/enrichments.sqlite
        table: note_enrichments
        key: slug
```

The exact config shape can change. The important constraint is that SQLite is
not the primary collection here; it is an auxiliary source keyed by the primary
collection's item identity.

### Keep composition in storage, decoding in codecs

Sidecar lookup and identity matching belong in storage/collection code. Decoding
the sidecar payload belongs in codecs when the payload shape is reusable.

Likely codec pressure points:

- a structured JSON object codec for `.json` sidecars;
- row/object decoding for SQLite sidecar rows;
- merge/overlay behavior that may belong above individual codecs if multiple
  sidecar types share it.

Do not make checks aware of sidecars. Checks should receive a content shape:
metadata, body, and any future typed shapes. They should not know whether a
field came from frontmatter, JSON, or SQLite unless the configured merge policy
chooses to preserve provenance.

## Open Questions

1. **Is sidecar composition a storage feature, a codec feature, or its own
   layer?**

   **Context.** Storage owns references and identity. Codecs own decoding.
   Sidecars add composition across multiple decoded sources.

   **Recommendation.** Start in the collection storage implementation, then
   extract only if both JSON and SQLite sidecars share enough composition logic
   to justify a common package.

2. **What is the first merge policy?**

   **Context.** Top-level merging is ergonomic but collision-prone. Namespacing
   is safer but makes checks more verbose.

   **Recommendation.** Support an explicit namespace first, and only allow
   top-level merge with collision errors. Do not silently overwrite primary
   frontmatter.

3. **Are sidecars read-only at first?**

   **Context.** `check` and `inspect` only need reads. `fix` and `item update`
   raise harder questions about which source should be modified.

   **Recommendation.** Make sidecars read-only in the first cut unless a
   specific write behavior is needed to validate the storage seam. If read-only,
   diagnostics should say which fields cannot be fixed because they come from a
   sidecar.

4. **How should missing sidecars behave?**

   **Context.** Some sidecars are optional enrichments; others may be required
   for validation.

   **Recommendation.** Make required/optional explicit in config. Missing
   required sidecars should be a storage/config diagnostic; missing optional
   sidecars should produce an item with only primary content.

5. **Should provenance be preserved?**

   **Context.** Users may need to understand whether a field came from
   frontmatter, JSON, or SQLite, especially when fixes are unavailable.

   **Recommendation.** Preserve provenance internally if it is cheap, but do
   not expose it in check APIs until a concrete diagnostic or write-path need
   forces the shape.

## Documentation Updates

- `docs/content/deep-dives/storage.md`: document sidecars as auxiliary storage
  attached to a primary item store, including identity and write limitations.
- `docs/content/reference/configuration.md`: document sidecar configuration for
  JSON and SQLite once the config shape is decided.
- `docs/content/deep-dives/formatting.md`: mention when metadata comes from
  sidecars rather than markdown frontmatter, if user-visible.
- `internal/storage/collection/AGENTS.md`: add conventions for composition and
  sidecar ownership.
- New backend-specific `AGENTS.md` files if JSON or SQLite sidecar code gets
  its own package.

## Test Checklist

- [ ] A filesystem markdown collection can configure a JSON sidecar.
- [ ] A filesystem markdown collection can configure a SQLite sidecar.
- [ ] `katalyst check` can validate fields supplied by a JSON sidecar.
- [ ] `katalyst check` can validate fields supplied by a SQLite sidecar.
- [ ] `katalyst inspect` includes sidecar-supplied fields in collection
      evidence according to the merge policy.
- [ ] Field collisions between primary frontmatter and sidecars produce clear,
      deterministic diagnostics.
- [ ] Missing required sidecars and missing optional sidecars behave
      differently and are covered by tests.
- [ ] Sidecar read-only behavior is explicit in `fix` or `item update`
      diagnostics if write paths are not supported.
- [ ] Checks and inspectors do not import sidecar backend packages directly.
- [ ] `go test ./...` passes.

## Rejected Alternatives

- **Treat every sidecar as its own collection.** That loses the point of a
  sidecar: auxiliary content attached to the same item identity as the primary
  store.
- **Inline sidecar data into markdown frontmatter before checks run.** This may
  be an implementation detail, but it should not erase collision handling,
  provenance, or write-path decisions.
- **Support only JSON sidecars first and defer SQLite.** JSON is useful, but
  SQLite is the better architecture stress test because it forces keyed lookup,
  non-file references, and row decoding.
- **Let checks read sidecars directly.** That would punch through both the
  storage and codec boundaries and make each check family backend-aware.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress-test codec architecture with SQLite and JSON sidecar storage formats #102

Overview

Value

Current State

Design

Define sidecar storage explicitly

Support JSON sidecars

Support SQLite sidecars

Keep composition in storage, decoding in codecs

Open Questions

Documentation Updates

Test Checklist

Rejected Alternatives

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stress-test codec architecture with SQLite and JSON sidecar storage formats #102

Description

Overview

Value

Current State

Design

Define sidecar storage explicitly

Support JSON sidecars

Support SQLite sidecars

Keep composition in storage, decoding in codecs

Open Questions

Documentation Updates

Test Checklist

Rejected Alternatives

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions