Skip to content

Implement Code-Aware Custom Chunking for Vector Ingestion #12

@avishek0769

Description

@avishek0769

Summary

Introduce a chunking strategy optimized for technical documentation with code blocks and API references.

Problem

Generic character-based splitting may break code context, signatures, and semantic boundaries, reducing retrieval quality.

Expected Solution

Build a custom chunker that is structure-aware (headings, code fences, lists, tables, API sections), preserves useful context windows, and improves retrieval precision.

Scope

  • backend/chatWorker.js
  • backend/utils/ragUtilities.js
  • Chunk metadata fields and retrieval payload structure

Acceptance Criteria

  • Chunking preserves code fence integrity and key heading context.
  • Retrieval quality improves on code-heavy documentation examples.
  • Chunk metadata remains compatible with current storage/query pipeline.
  • Feature includes tests or evaluation fixtures for regression checks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend issuesenhancementNew feature or requesthardThis is issue is hard to solve

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions