For code-heavy knowledge graphs, AST-based chunking (via tree-sitter) significantly outperforms token-based splitting by preserving syntactic boundaries and complete semantic units, which is critical for accurate code retrieval and generation tasks. Currently, the chunking logic appears to be embedded without a pluggable trait abstraction.
Would you be open to extracting chunking into a trait-based strategy pattern (e.g., ChunkingStrategy trait with a chunk(&self, text: &str) -> Vec method)? If a modular chunking interface exists or you'd accept such a refactor, I'd implement a tree-sitter-based strategy for my use case and could share it back if useful.
Context: Research shows tree-sitter chunking improves code RAG accuracy by 4-5+ points on retrieval and generation benchmarks compared to semantic/token chunking (see CMU cAST paper).
For code-heavy knowledge graphs, AST-based chunking (via tree-sitter) significantly outperforms token-based splitting by preserving syntactic boundaries and complete semantic units, which is critical for accurate code retrieval and generation tasks. Currently, the chunking logic appears to be embedded without a pluggable trait abstraction.
Would you be open to extracting chunking into a trait-based strategy pattern (e.g., ChunkingStrategy trait with a chunk(&self, text: &str) -> Vec method)? If a modular chunking interface exists or you'd accept such a refactor, I'd implement a tree-sitter-based strategy for my use case and could share it back if useful.
Context: Research shows tree-sitter chunking improves code RAG accuracy by 4-5+ points on retrieval and generation benchmarks compared to semantic/token chunking (see CMU cAST paper).