Skip to content

Improve raw-source inspector: document_shape should tell a coherent candidate-collection story #112

Description

@abegong

Context

document_shape is probably the most important raw-source inspector for the onboarding path: it should help a user understand which files appear to belong together as candidate collections. Today its generic classes / outliers output can be technically accurate but still fail to tell a coherent story.

For katalyst inspect ., the reader should come away believing Katalyst saw the actual documents, inferred useful similarities, and can point to the evidence.

Goal

Make document_shape answer, at a glance:

  • What candidate document groups or collections did Katalyst see?
  • What makes each group coherent: frontmatter keys, body sections, naming, location, or file type?
  • How many files are in each group?
  • Which representative files belong to each group?
  • Which documents are exceptions, and why?
  • Are there too few documents to infer a meaningful shape?

Possible shape

Render candidate groups as named or numbered groups with a short explanation and representative members. Translate feature tokens into user-facing evidence, for example:

  • frontmatter: title, status
  • sections: Review
  • naming: kebab-case markdown files

Outliers should be framed as exceptions to an understandable pattern, not as the primary organizing principle.

Acceptance criteria

  • katalyst inspect . --inspector document_shape presents candidate document groups in a way humans can scan and AI agents can cite.
  • Each group includes count, concrete representative paths, and the visible evidence behind the grouping.
  • Outliers include concrete paths and an explanation of what differs.
  • Very small inputs avoid overclaiming; the output can say there is not enough evidence to infer a stable document shape.
  • Existing JSON output remains complete and parseable; any schema changes are intentional and covered by tests.
  • Snapshot tests cover a coherent collection, a mixed directory, and a tiny directory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions