Architecture spec proposal. Status: planning.
Overview
Add SQLite as the first non-filesystem standalone storage format: a configured
storage instance whose collections and items live in a .sqlite database rather
than in markdown files on disk.
This is primarily a stress test for the new internal/codec layer and the
storage seam. SQLite should force Katalyst to prove that checks, inspect, and
item operations consume codec-owned content shapes rather than filesystem
assumptions.
Value
The codec layer exists so storage readers can decode backend-native units into
reusable item content shapes, while checks and inspectors stay independent of
where the data came from. A standalone SQLite backend is the first practical
test of that promise.
If SQLite can support the same collection and item workflows as filesystem
markdown without pushing SQL details into checks, inspect, fix, or cmd,
the architecture is doing real work rather than just moving the markdown parser
to a nicer package path.
Current State
Katalyst currently has the storage seam and a filesystem implementation:
internal/storage names backend kinds such as filesystem and owns the
backend registry.
internal/storage/collection.CollectionDefinition maps backend-native
references to Katalyst collections and items.
internal/storage/collection/filesystem implements the first backend, with
FileIsItem granularity.
internal/codec/markdownbodytext parses and encodes markdown body text with
YAML, TOML, or JSON frontmatter.
- The storage deep-dive names SQLite as the first planned stress test because
it forces the granularity question.
There is already loader test coverage showing that type: sqlite is recognized
as a desired storage type shape, but no SQLite collection definition or
read/write implementation exists yet.
Design
Add a SQLite collection definition
Introduce a backend implementation under internal/storage/collection/sqlite
and register a sqlite storage type only when a real CollectionDefinition
exists.
The backend should keep SQL-specific concerns below the storage boundary:
- opening the database;
- discovering configured tables or queries;
- mapping rows to item references and item ids;
- listing, reading, adding, updating, and deleting items where the current
command surface requires it;
- translating backend-native errors into useful Katalyst diagnostics.
No SQL package should leak into internal/checks, internal/inspect,
internal/fix, or codec packages.
Follow the Data Source / Data Asset split
Great Expectations' Data Source and Data Asset model is the useful prior art:
the connection to a backend and the logical named data inside it are separate
ideas, and a request/selector carries the parameters needed to recreate a
specific slice.
Map that to Katalyst without copying GX directly:
StorageInstance is the connectable SQLite database.
CollectionDefinition is the mapping from one configured table to one named
Katalyst collection.
- The configured id column is the row coordinate Katalyst selectors use for
item identity.
This keeps table/id mapping storage-owned. It does not make SQL a codec
concept.
Exercise row-oriented granularity
SQLite uses the tabular granularity described in the storage deep-dive: one
table is a collection, and each row is an item.
A first cut can be deliberately narrow, but it should still force the
architecture through the non-file path:
type: sqlite
path: ./content.sqlite
collections:
notes:
table: notes
id: slug
body: body
checks:
- kind: object_required_field
field: title
The first schema contract is:
- one table per collection;
- one configured id column;
- scalar columns as metadata;
- one optional configured body column.
The body column is useful for text/markdown checks, but the backend must be
valuable without it for structured-object checks.
Add or extract the codec shape only when forced
This issue should answer whether a second backend needs a new reusable codec or
a small adapter around existing content shapes.
Likely directions:
- A row/object codec that turns a SQL row into the metadata map checks already
consume.
- A markdown body column that can reuse
internal/codec/markdownbodytext for
body-bearing rows.
- A hybrid shape where structured columns provide metadata and one text column
provides body text.
Do not add a generic internal/codec interface just because this backend
exists. The GX lesson is that backend connection/mapping concepts and content
decoding concepts are adjacent but distinct. Add or extract a codec only if row
decoding becomes reusable content-shape logic rather than SQLite mapping.
Preserve existing command behavior where meaningful
katalyst check, katalyst inspect, and katalyst item add/update/delete
should work against a SQLite-backed collection using the same collection
selector concepts users already have.
Filesystem-specific checks remain filesystem-specific. Object and
text/markdown checks should work only when the SQLite item shape provides the
content they need.
fix is explicitly out of scope for the first cut. It can come later after the
SQLite read and item CRUD paths prove the storage seam.
Decisions
-
Use one table per collection.
A SQLite collection maps to one configured table. Each row is one item, the
configured id column provides item identity, scalar columns become metadata,
and one optional body column provides body text.
-
Keep table/id mapping in storage.
Great Expectations' Data Source / Data Asset / Batch Request split guides
the shape: backend connection, logical collection mapping, and selectors are
separate concepts. For Katalyst, that means SQLite connection and table/id
mapping live under storage. Codecs should only appear when reusable content
decoding is forced by the implementation.
-
Include item CRUD, defer fix.
item add, item update, and item delete are in scope for the first cut.
fix is out of scope and should become follow-up work.
-
Fail unsupported check families at config/load time.
Filesystem-specific checks on SQLite-backed collections should produce a
clear diagnostic during configuration or project load. They should not be
skipped silently.
Documentation Updates
docs/content/deep-dives/storage.md: update the "first stress test" language
with the implemented SQLite behavior and granularity.
docs/content/reference/configuration.md: document SQLite storage instance
configuration once the config shape is real.
internal/storage/AGENTS.md: add backend registration and SQLite-specific
conventions if they differ from filesystem.
internal/storage/collection/sqlite/AGENTS.md: document SQLite ownership,
schema assumptions, and dependency boundaries.
docs/content/reference/glossary.md: update storage-related terms only if
new public vocabulary is introduced.
Test Checklist
Rejected Alternatives
- Implement SQLite as filesystem markdown files synchronized to a database.
That would avoid the row granularity problem this issue is meant to expose.
- Add a generic codec interface before implementing SQLite. A second backend
should provide evidence for the abstraction; guessing it early risks hardening
the wrong shape.
- Start with a broad SQL abstraction. Supporting PostgreSQL, arbitrary
queries, migrations, or multiple SQL dialects can come later. SQLite is
valuable here because it is small enough to stress the seam without turning
into a database framework.
Overview
Add SQLite as the first non-filesystem standalone storage format: a configured
storage instance whose collections and items live in a
.sqlitedatabase ratherthan in markdown files on disk.
This is primarily a stress test for the new
internal/codeclayer and thestorage seam. SQLite should force Katalyst to prove that checks, inspect, and
item operations consume codec-owned content shapes rather than filesystem
assumptions.
Value
The codec layer exists so storage readers can decode backend-native units into
reusable item content shapes, while checks and inspectors stay independent of
where the data came from. A standalone SQLite backend is the first practical
test of that promise.
If SQLite can support the same collection and item workflows as filesystem
markdown without pushing SQL details into
checks,inspect,fix, orcmd,the architecture is doing real work rather than just moving the markdown parser
to a nicer package path.
Current State
Katalyst currently has the storage seam and a filesystem implementation:
internal/storagenames backend kinds such asfilesystemand owns thebackend registry.
internal/storage/collection.CollectionDefinitionmaps backend-nativereferences to Katalyst collections and items.
internal/storage/collection/filesystemimplements the first backend, withFileIsItemgranularity.internal/codec/markdownbodytextparses and encodes markdown body text withYAML, TOML, or JSON frontmatter.
it forces the granularity question.
There is already loader test coverage showing that
type: sqliteis recognizedas a desired storage type shape, but no SQLite collection definition or
read/write implementation exists yet.
Design
Add a SQLite collection definition
Introduce a backend implementation under
internal/storage/collection/sqliteand register a
sqlitestorage type only when a realCollectionDefinitionexists.
The backend should keep SQL-specific concerns below the storage boundary:
command surface requires it;
No SQL package should leak into
internal/checks,internal/inspect,internal/fix, or codec packages.Follow the Data Source / Data Asset split
Great Expectations' Data Source and Data Asset model is the useful prior art:
the connection to a backend and the logical named data inside it are separate
ideas, and a request/selector carries the parameters needed to recreate a
specific slice.
Map that to Katalyst without copying GX directly:
StorageInstanceis the connectable SQLite database.CollectionDefinitionis the mapping from one configured table to one namedKatalyst collection.
item identity.
This keeps table/id mapping storage-owned. It does not make SQL a codec
concept.
Exercise row-oriented granularity
SQLite uses the tabular granularity described in the storage deep-dive: one
table is a collection, and each row is an item.
A first cut can be deliberately narrow, but it should still force the
architecture through the non-file path:
The first schema contract is:
The body column is useful for text/markdown checks, but the backend must be
valuable without it for structured-object checks.
Add or extract the codec shape only when forced
This issue should answer whether a second backend needs a new reusable codec or
a small adapter around existing content shapes.
Likely directions:
consume.
internal/codec/markdownbodytextforbody-bearing rows.
provides body text.
Do not add a generic
internal/codecinterface just because this backendexists. The GX lesson is that backend connection/mapping concepts and content
decoding concepts are adjacent but distinct. Add or extract a codec only if row
decoding becomes reusable content-shape logic rather than SQLite mapping.
Preserve existing command behavior where meaningful
katalyst check,katalyst inspect, andkatalyst item add/update/deleteshould work against a SQLite-backed collection using the same collection
selector concepts users already have.
Filesystem-specific checks remain filesystem-specific. Object and
text/markdown checks should work only when the SQLite item shape provides the
content they need.
fixis explicitly out of scope for the first cut. It can come later after theSQLite read and item CRUD paths prove the storage seam.
Decisions
Use one table per collection.
A SQLite collection maps to one configured table. Each row is one item, the
configured id column provides item identity, scalar columns become metadata,
and one optional body column provides body text.
Keep table/id mapping in storage.
Great Expectations' Data Source / Data Asset / Batch Request split guides
the shape: backend connection, logical collection mapping, and selectors are
separate concepts. For Katalyst, that means SQLite connection and table/id
mapping live under storage. Codecs should only appear when reusable content
decoding is forced by the implementation.
Include item CRUD, defer fix.
item add,item update, anditem deleteare in scope for the first cut.fixis out of scope and should become follow-up work.Fail unsupported check families at config/load time.
Filesystem-specific checks on SQLite-backed collections should produce a
clear diagnostic during configuration or project load. They should not be
skipped silently.
Documentation Updates
docs/content/deep-dives/storage.md: update the "first stress test" languagewith the implemented SQLite behavior and granularity.
docs/content/reference/configuration.md: document SQLite storage instanceconfiguration once the config shape is real.
internal/storage/AGENTS.md: add backend registration and SQLite-specificconventions if they differ from filesystem.
internal/storage/collection/sqlite/AGENTS.md: document SQLite ownership,schema assumptions, and dependency boundaries.
docs/content/reference/glossary.md: update storage-related terms only ifnew public vocabulary is introduced.
Test Checklist
.katalyst/storage/*.yaml.katalyst checkruns structured-object checks against SQLite-backed rows.katalyst inspectreports collection evidence for SQLite-backed rows.katalyst item getproves SQLite item addressing works through thestorage seam.
katalyst item addinserts a new SQLite row.katalyst item updateupdates scalar metadata and, when configured, bodycontent.
katalyst item deletedeletes a SQLite row.fixis explicitly unsupported or deferred for SQLite collections.clear diagnostic.
internal/storage/collection/sqliteimports a SQLite driver directly.
go test ./...passes.Rejected Alternatives
That would avoid the row granularity problem this issue is meant to expose.
should provide evidence for the abstraction; guessing it early risks hardening
the wrong shape.
queries, migrations, or multiple SQL dialects can come later. SQLite is
valuable here because it is small enough to stress the seam without turning
into a database framework.