Skip to content

Pluggable chunking strategy trait for tree-sitter/AST-based code chunking #3

@dataO1

Description

@dataO1

For code-heavy knowledge graphs, AST-based chunking (via tree-sitter) significantly outperforms token-based splitting by preserving syntactic boundaries and complete semantic units, which is critical for accurate code retrieval and generation tasks. Currently, the chunking logic appears to be embedded without a pluggable trait abstraction.

Would you be open to extracting chunking into a trait-based strategy pattern (e.g., ChunkingStrategy trait with a chunk(&self, text: &str) -> Vec method)? If a modular chunking interface exists or you'd accept such a refactor, I'd implement a tree-sitter-based strategy for my use case and could share it back if useful.

Context: Research shows tree-sitter chunking improves code RAG accuracy by 4-5+ points on retrieval and generation benchmarks compared to semantic/token chunking (see CMU cAST paper).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions