Stress-test codec layer with SQLite as a standalone storage format

> Architecture spec proposal. Status: **planning**.

## Overview

Add SQLite as the first non-filesystem standalone storage format: a configured
storage instance whose collections and items live in a `.sqlite` database rather
than in markdown files on disk.

This is primarily a stress test for the new `internal/codec` layer and the
storage seam. SQLite should force Katalyst to prove that checks, inspect, and
item operations consume codec-owned content shapes rather than filesystem
assumptions.

## Value

The codec layer exists so storage readers can decode backend-native units into
reusable item content shapes, while checks and inspectors stay independent of
where the data came from. A standalone SQLite backend is the first practical
test of that promise.

If SQLite can support the same collection and item workflows as filesystem
markdown without pushing SQL details into `checks`, `inspect`, `fix`, or `cmd`,
the architecture is doing real work rather than just moving the markdown parser
to a nicer package path.

## Current State

Katalyst currently has the storage seam and a filesystem implementation:

- `internal/storage` names backend kinds such as `filesystem` and owns the
  backend registry.
- `internal/storage/collection.CollectionDefinition` maps backend-native
  references to Katalyst collections and items.
- `internal/storage/collection/filesystem` implements the first backend, with
  `FileIsItem` granularity.
- `internal/codec/markdownbodytext` parses and encodes markdown body text with
  YAML, TOML, or JSON frontmatter.
- The storage deep-dive names SQLite as the first planned stress test because
  it forces the granularity question.

There is already loader test coverage showing that `type: sqlite` is recognized
as a desired storage type shape, but no SQLite collection definition or
read/write implementation exists yet.

## Design

### Add a SQLite collection definition

Introduce a backend implementation under `internal/storage/collection/sqlite`
and register a `sqlite` storage type only when a real `CollectionDefinition`
exists.

The backend should keep SQL-specific concerns below the storage boundary:

- opening the database;
- discovering configured tables or queries;
- mapping rows to item references and item ids;
- listing, reading, adding, updating, and deleting items where the current
  command surface requires it;
- translating backend-native errors into useful Katalyst diagnostics.

No SQL package should leak into `internal/checks`, `internal/inspect`,
`internal/fix`, or codec packages.

### Follow the Data Source / Data Asset split

Great Expectations' Data Source and Data Asset model is the useful prior art:
the connection to a backend and the logical named data inside it are separate
ideas, and a request/selector carries the parameters needed to recreate a
specific slice.

Map that to Katalyst without copying GX directly:

- `StorageInstance` is the connectable SQLite database.
- `CollectionDefinition` is the mapping from one configured table to one named
  Katalyst collection.
- The configured id column is the row coordinate Katalyst selectors use for
  item identity.

This keeps table/id mapping storage-owned. It does not make SQL a codec
concept.

### Exercise row-oriented granularity

SQLite uses the tabular granularity described in the storage deep-dive: one
table is a collection, and each row is an item.

A first cut can be deliberately narrow, but it should still force the
architecture through the non-file path:

```yaml
type: sqlite
path: ./content.sqlite
collections:
  notes:
    table: notes
    id: slug
    body: body
    checks:
      - kind: object_required_field
        field: title
```

The first schema contract is:

- one table per collection;
- one configured id column;
- scalar columns as metadata;
- one optional configured body column.

The body column is useful for text/markdown checks, but the backend must be
valuable without it for structured-object checks.

### Add or extract the codec shape only when forced

This issue should answer whether a second backend needs a new reusable codec or
a small adapter around existing content shapes.

Likely directions:

- A row/object codec that turns a SQL row into the metadata map checks already
  consume.
- A markdown body column that can reuse `internal/codec/markdownbodytext` for
  body-bearing rows.
- A hybrid shape where structured columns provide metadata and one text column
  provides body text.

Do not add a generic `internal/codec` interface just because this backend
exists. The GX lesson is that backend connection/mapping concepts and content
decoding concepts are adjacent but distinct. Add or extract a codec only if row
decoding becomes reusable content-shape logic rather than SQLite mapping.

### Preserve existing command behavior where meaningful

`katalyst check`, `katalyst inspect`, and `katalyst item add/update/delete`
should work against a SQLite-backed collection using the same collection
selector concepts users already have.

Filesystem-specific checks remain filesystem-specific. Object and
text/markdown checks should work only when the SQLite item shape provides the
content they need.

`fix` is explicitly out of scope for the first cut. It can come later after the
SQLite read and item CRUD paths prove the storage seam.

## Decisions

1. **Use one table per collection.**

   A SQLite collection maps to one configured table. Each row is one item, the
   configured id column provides item identity, scalar columns become metadata,
   and one optional body column provides body text.

2. **Keep table/id mapping in storage.**

   Great Expectations' Data Source / Data Asset / Batch Request split guides
   the shape: backend connection, logical collection mapping, and selectors are
   separate concepts. For Katalyst, that means SQLite connection and table/id
   mapping live under storage. Codecs should only appear when reusable content
   decoding is forced by the implementation.

3. **Include item CRUD, defer fix.**

   `item add`, `item update`, and `item delete` are in scope for the first cut.
   `fix` is out of scope and should become follow-up work.

4. **Fail unsupported check families at config/load time.**

   Filesystem-specific checks on SQLite-backed collections should produce a
   clear diagnostic during configuration or project load. They should not be
   skipped silently.

## Documentation Updates

- `docs/content/deep-dives/storage.md`: update the "first stress test" language
  with the implemented SQLite behavior and granularity.
- `docs/content/reference/configuration.md`: document SQLite storage instance
  configuration once the config shape is real.
- `internal/storage/AGENTS.md`: add backend registration and SQLite-specific
  conventions if they differ from filesystem.
- `internal/storage/collection/sqlite/AGENTS.md`: document SQLite ownership,
  schema assumptions, and dependency boundaries.
- `docs/content/reference/glossary.md`: update storage-related terms only if
  new public vocabulary is introduced.

## Test Checklist

- [ ] A SQLite storage instance can be loaded from `.katalyst/storage/*.yaml`.
- [ ] A configured SQLite collection lists items using stable Katalyst item ids.
- [ ] `katalyst check` runs structured-object checks against SQLite-backed rows.
- [ ] `katalyst inspect` reports collection evidence for SQLite-backed rows.
- [ ] `katalyst item get` proves SQLite item addressing works through the
      storage seam.
- [ ] `katalyst item add` inserts a new SQLite row.
- [ ] `katalyst item update` updates scalar metadata and, when configured, body
      content.
- [ ] `katalyst item delete` deletes a SQLite row.
- [ ] `fix` is explicitly unsupported or deferred for SQLite collections.
- [ ] Unsupported filesystem-specific checks on SQLite collections produce a
      clear diagnostic.
- [ ] No production package outside `internal/storage/collection/sqlite`
      imports a SQLite driver directly.
- [ ] No codec package imports storage, checks, project, fix, inspect, or cmd.
- [ ] `go test ./...` passes.

## Rejected Alternatives

- **Implement SQLite as filesystem markdown files synchronized to a database.**
  That would avoid the row granularity problem this issue is meant to expose.
- **Add a generic codec interface before implementing SQLite.** A second backend
  should provide evidence for the abstraction; guessing it early risks hardening
  the wrong shape.
- **Start with a broad SQL abstraction.** Supporting PostgreSQL, arbitrary
  queries, migrations, or multiple SQL dialects can come later. SQLite is
  valuable here because it is small enough to stress the seam without turning
  into a database framework.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress-test codec layer with SQLite as a standalone storage format #101

Overview

Value

Current State

Design

Add a SQLite collection definition

Follow the Data Source / Data Asset split

Exercise row-oriented granularity

Add or extract the codec shape only when forced

Preserve existing command behavior where meaningful

Decisions

Documentation Updates

Test Checklist

Rejected Alternatives

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stress-test codec layer with SQLite as a standalone storage format #101

Description

Overview

Value

Current State

Design

Add a SQLite collection definition

Follow the Data Source / Data Asset split

Exercise row-oriented granularity

Add or extract the codec shape only when forced

Preserve existing command behavior where meaningful

Decisions

Documentation Updates

Test Checklist

Rejected Alternatives

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions