Skip to content

Stress-test codec layer with SQLite as a standalone storage format #101

Description

@abegong

Architecture spec proposal. Status: planning.

Overview

Add SQLite as the first non-filesystem standalone storage format: a configured
storage instance whose collections and items live in a .sqlite database rather
than in markdown files on disk.

This is primarily a stress test for the new internal/codec layer and the
storage seam. SQLite should force Katalyst to prove that checks, inspect, and
item operations consume codec-owned content shapes rather than filesystem
assumptions.

Value

The codec layer exists so storage readers can decode backend-native units into
reusable item content shapes, while checks and inspectors stay independent of
where the data came from. A standalone SQLite backend is the first practical
test of that promise.

If SQLite can support the same collection and item workflows as filesystem
markdown without pushing SQL details into checks, inspect, fix, or cmd,
the architecture is doing real work rather than just moving the markdown parser
to a nicer package path.

Current State

Katalyst currently has the storage seam and a filesystem implementation:

  • internal/storage names backend kinds such as filesystem and owns the
    backend registry.
  • internal/storage/collection.CollectionDefinition maps backend-native
    references to Katalyst collections and items.
  • internal/storage/collection/filesystem implements the first backend, with
    FileIsItem granularity.
  • internal/codec/markdownbodytext parses and encodes markdown body text with
    YAML, TOML, or JSON frontmatter.
  • The storage deep-dive names SQLite as the first planned stress test because
    it forces the granularity question.

There is already loader test coverage showing that type: sqlite is recognized
as a desired storage type shape, but no SQLite collection definition or
read/write implementation exists yet.

Design

Add a SQLite collection definition

Introduce a backend implementation under internal/storage/collection/sqlite
and register a sqlite storage type only when a real CollectionDefinition
exists.

The backend should keep SQL-specific concerns below the storage boundary:

  • opening the database;
  • discovering configured tables or queries;
  • mapping rows to item references and item ids;
  • listing, reading, adding, updating, and deleting items where the current
    command surface requires it;
  • translating backend-native errors into useful Katalyst diagnostics.

No SQL package should leak into internal/checks, internal/inspect,
internal/fix, or codec packages.

Follow the Data Source / Data Asset split

Great Expectations' Data Source and Data Asset model is the useful prior art:
the connection to a backend and the logical named data inside it are separate
ideas, and a request/selector carries the parameters needed to recreate a
specific slice.

Map that to Katalyst without copying GX directly:

  • StorageInstance is the connectable SQLite database.
  • CollectionDefinition is the mapping from one configured table to one named
    Katalyst collection.
  • The configured id column is the row coordinate Katalyst selectors use for
    item identity.

This keeps table/id mapping storage-owned. It does not make SQL a codec
concept.

Exercise row-oriented granularity

SQLite uses the tabular granularity described in the storage deep-dive: one
table is a collection, and each row is an item.

A first cut can be deliberately narrow, but it should still force the
architecture through the non-file path:

type: sqlite
path: ./content.sqlite
collections:
  notes:
    table: notes
    id: slug
    body: body
    checks:
      - kind: object_required_field
        field: title

The first schema contract is:

  • one table per collection;
  • one configured id column;
  • scalar columns as metadata;
  • one optional configured body column.

The body column is useful for text/markdown checks, but the backend must be
valuable without it for structured-object checks.

Add or extract the codec shape only when forced

This issue should answer whether a second backend needs a new reusable codec or
a small adapter around existing content shapes.

Likely directions:

  • A row/object codec that turns a SQL row into the metadata map checks already
    consume.
  • A markdown body column that can reuse internal/codec/markdownbodytext for
    body-bearing rows.
  • A hybrid shape where structured columns provide metadata and one text column
    provides body text.

Do not add a generic internal/codec interface just because this backend
exists. The GX lesson is that backend connection/mapping concepts and content
decoding concepts are adjacent but distinct. Add or extract a codec only if row
decoding becomes reusable content-shape logic rather than SQLite mapping.

Preserve existing command behavior where meaningful

katalyst check, katalyst inspect, and katalyst item add/update/delete
should work against a SQLite-backed collection using the same collection
selector concepts users already have.

Filesystem-specific checks remain filesystem-specific. Object and
text/markdown checks should work only when the SQLite item shape provides the
content they need.

fix is explicitly out of scope for the first cut. It can come later after the
SQLite read and item CRUD paths prove the storage seam.

Decisions

  1. Use one table per collection.

    A SQLite collection maps to one configured table. Each row is one item, the
    configured id column provides item identity, scalar columns become metadata,
    and one optional body column provides body text.

  2. Keep table/id mapping in storage.

    Great Expectations' Data Source / Data Asset / Batch Request split guides
    the shape: backend connection, logical collection mapping, and selectors are
    separate concepts. For Katalyst, that means SQLite connection and table/id
    mapping live under storage. Codecs should only appear when reusable content
    decoding is forced by the implementation.

  3. Include item CRUD, defer fix.

    item add, item update, and item delete are in scope for the first cut.
    fix is out of scope and should become follow-up work.

  4. Fail unsupported check families at config/load time.

    Filesystem-specific checks on SQLite-backed collections should produce a
    clear diagnostic during configuration or project load. They should not be
    skipped silently.

Documentation Updates

  • docs/content/deep-dives/storage.md: update the "first stress test" language
    with the implemented SQLite behavior and granularity.
  • docs/content/reference/configuration.md: document SQLite storage instance
    configuration once the config shape is real.
  • internal/storage/AGENTS.md: add backend registration and SQLite-specific
    conventions if they differ from filesystem.
  • internal/storage/collection/sqlite/AGENTS.md: document SQLite ownership,
    schema assumptions, and dependency boundaries.
  • docs/content/reference/glossary.md: update storage-related terms only if
    new public vocabulary is introduced.

Test Checklist

  • A SQLite storage instance can be loaded from .katalyst/storage/*.yaml.
  • A configured SQLite collection lists items using stable Katalyst item ids.
  • katalyst check runs structured-object checks against SQLite-backed rows.
  • katalyst inspect reports collection evidence for SQLite-backed rows.
  • katalyst item get proves SQLite item addressing works through the
    storage seam.
  • katalyst item add inserts a new SQLite row.
  • katalyst item update updates scalar metadata and, when configured, body
    content.
  • katalyst item delete deletes a SQLite row.
  • fix is explicitly unsupported or deferred for SQLite collections.
  • Unsupported filesystem-specific checks on SQLite collections produce a
    clear diagnostic.
  • No production package outside internal/storage/collection/sqlite
    imports a SQLite driver directly.
  • No codec package imports storage, checks, project, fix, inspect, or cmd.
  • go test ./... passes.

Rejected Alternatives

  • Implement SQLite as filesystem markdown files synchronized to a database.
    That would avoid the row granularity problem this issue is meant to expose.
  • Add a generic codec interface before implementing SQLite. A second backend
    should provide evidence for the abstraction; guessing it early risks hardening
    the wrong shape.
  • Start with a broad SQL abstraction. Supporting PostgreSQL, arbitrary
    queries, migrations, or multiple SQL dialects can come later. SQLite is
    valuable here because it is small enough to stress the seam without turning
    into a database framework.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions