diff --git a/.claude/commands/merge.md b/.claude/commands/merge.md new file mode 100644 index 0000000..3931644 --- /dev/null +++ b/.claude/commands/merge.md @@ -0,0 +1,89 @@ +--- +description: Land the current feature branch on develop — README/CHANGELOG checks, commit, push, PR, auto-merge +argument-hint: [optional PR title] +allowed-tools: Bash(git:*), Bash(gh:*), Bash(cargo:*), Bash(grep:*), Read, Edit, Grep, Glob +--- + +# /merge — land the current feature branch on `develop` + +Run the project's **merge workflow**: verify docs are current, then bring the current +feature branch into `develop` through a pull request. This command does **not** tag a +release — tagging happens only in `/release`. + +## Branch & version facts (this repo) +- Flow: `feature/*` | `features/*` | `fix/*` → PR → **`develop`** → (later) PR → **`master`**. +- `master` is protected (`.github/workflows/protect-master.yml`): it accepts PRs only from + `develop` or `release/*`. +- The pre-commit hook **bumps the patch version (+1) and rebuilds the binary on feature + branches only** (`feature/*`, `features/*`, `fix/*`). On `develop`/`master`/`release`/`chore` + it runs `cargo fmt` only — no bump. So **the feature-branch commit here fixes the release + version**; it carries forward unchanged through develop, master, and the tag. + +## Guardrails +- ABORT unless the current branch matches `feature/*`, `features/*`, or `fix/*` — i.e. the + branches the pre-commit hook version-bumps. Never run from `develop`, `master`, `release/*`, + or `chore/*`: on those the hook does **not** bump, so the version/CHANGELOG premise below + would silently break. +- NEVER push directly to `develop` or `master` — everything lands via a PR. +- NEVER pass `--no-verify` / `--no-gpg-sign` — let the pre-commit hook run (it bumps + rebuilds). +- Do NOT create or push a tag here. That is `/release`'s job. +- Do NOT force-push. + +## Steps + +1. **Context** + - `git rev-parse --abbrev-ref HEAD` → current branch. If it is NOT `feature/*`, `features/*`, + or `fix/*`, STOP with an error (see Guardrails). + - `git fetch origin`. + - Compute the change set landing on develop: `git log origin/develop..HEAD --oneline` + plus `git status --short` for uncommitted work. If there is nothing to land, report and STOP. + +2. **README up to date?** + - Inspect the change set for user-facing changes: new/removed CLI flags or subcommands, + behavior changes, new env vars, new supported languages, new MCP tools. + - Compare against `README.md`. If anything is missing, wrong, or stale, **UPDATE `README.md`** + so it matches reality. Keep examples free of hardcoded config strings (per CLAUDE.md). + - If README already matches, state that and move on. + +3. **CHANGELOG up to date?** + - Ensure `CHANGELOG.md` has an entry for this change under a `## [X.Y.Z] - YYYY-MM-DD` + heading with `Added` / `Changed` / `Fixed` subsections describing every user-facing change. + - **Version for the heading**: the hook bumps the patch by +1 on **every** feature-branch + commit where the working-tree version still equals HEAD's. The most reliable approach is to + land this branch in a **single commit** — then the heading version = current + `Cargo.toml` version + 1 (`grep -m1 '^version' Cargo.toml`). If you commit more than once, + the version advances once per commit; after the final commit, read the actual + `Cargo.toml` version and make sure the CHANGELOG heading matches it (fix it if not). + - Use today's date. If an accurate entry already exists for the pending version, leave it. + +4. **Commit** + - Stage code + doc changes (`git add -A`, plus `git add -f` for any tracked-but-gitignored file). + - Commit with a clear, scoped message. End the message with: + `Co-Authored-By: Claude Opus 4.8 (1M context) ` + - Let the pre-commit hook finish (fmt → version bump → rebuild). This can take 60–120s. + +5. **Validate** (fast loop, per CLAUDE.md — do NOT run `--release`): + - `cargo fmt --all -- --check` + - `cargo check --all-targets` + - `cargo clippy --all-targets -- -D warnings` + - Fix any failures and commit again before pushing. Never push code that fails these. + +6. **Push** + - `git push -u origin HEAD`. + +7. **Open PR → develop** + - `gh pr create --base develop --head --title "" --body "<body>"`. + - Title: use `$ARGUMENTS` if provided; otherwise summarize the branch concisely. + - Body: bullet summary of changes; end with: + `🤖 Generated with [Claude Code](https://claude.com/claude-code)`. + - Capture the PR number for the next step: + `PR=$(gh pr view --json number --jq .number)`. + +8. **Auto-merge after CI** + - `gh pr merge "$PR" --auto --merge` so the PR lands automatically once required checks pass. + - If auto-merge is not enabled on the repo (command errors), fall back: poll + `gh pr checks "$PR" --watch`, then `gh pr merge "$PR" --merge` once green. + +## Report +Branch, pending release version, doc updates made, PR URL, and merge status +(auto-merge enabled / merged). diff --git a/.claude/commands/release.md b/.claude/commands/release.md new file mode 100644 index 0000000..16f2641 --- /dev/null +++ b/.claude/commands/release.md @@ -0,0 +1,63 @@ +--- +description: Cut a release — run /merge (feature → develop), then promote develop → master and push the version tag +argument-hint: [optional PR/release title] +allowed-tools: Bash(git:*), Bash(gh:*), Bash(cargo:*), Bash(grep:*), Read, Edit, Grep, Glob +--- + +# /release — full release: land on `develop`, promote to `master`, tag + +This is `/merge` **plus** the `develop → master` promotion and the version-tag push that +triggers the build/publish pipeline. + +## Branch & version facts (this repo) +- Flow: `feature/*` → PR → **`develop`** → PR → **`master`** → push tag `vX.Y.Z`. +- `master` is protected: PRs to it may come **only** from `develop` or `release/*` + (`.github/workflows/protect-master.yml`). +- Pushing a `vX.Y.Z` tag triggers `.github/workflows/release.yml` (builds Windows/Linux/macOS + archives, plain + `-with-csharp`, and publishes the GitHub release). **Push the tag only + AFTER the develop→master PR has merged.** +- The version is fixed by the feature-branch commit (the pre-commit hook bumps only on + feature branches). develop/master merges and the tag all carry that same version. + +## Guardrails +- NEVER use `--no-verify`. NEVER force-push shared branches. +- Push the tag exactly once, only after master has the release commit. +- If CI fails at any gate, STOP and report — do not promote or tag a red build. + +## Part 1 — land on `develop` (the `/merge` workflow) +Execute every step of **`/merge`** (README/CHANGELOG checks → commit → push → PR → auto-merge +to `develop`). Then **wait for the develop PR to actually merge** (auto-merge waits on CI): +- Capture the PR number (`PR=$(gh pr view --json number --jq .number)`), then poll + `gh pr view "$PR" --json state,mergedAt,mergeStateStatus` until `state` is `MERGED`. +- If checks fail, STOP and report. Do not proceed to Part 2. + +## Part 2 — promote `develop` → `master` +1. `git fetch origin && git checkout develop && git pull --ff-only origin develop`. +2. Determine the release version: `VERSION=v$(grep -m1 '^version' Cargo.toml | sed -E 's/.*"(.+)".*/\1/')`. +3. Open the release PR (source `develop`, which protect-master allows): + - `gh pr create --base master --head develop --title "Release $VERSION — <summary>" --body "<body>"`. + - Title: prefix `Release $VERSION — ` then a short summary (or `$ARGUMENTS` if provided), + matching history (e.g. `Release v1.0.142 — serve responsive during warmup`). + - Body ends with: `🤖 Generated with [Claude Code](https://claude.com/claude-code)`. + - Capture the PR number: `RELEASE_PR=$(gh pr view develop --json number --jq .number)`. +4. `gh pr merge "$RELEASE_PR" --auto --merge`. Wait until `state` is + `MERGED` (poll as in Part 1). If auto-merge is unavailable, `gh pr checks "$RELEASE_PR" --watch` + then `gh pr merge "$RELEASE_PR" --merge`. If CI fails, STOP. + +## Part 3 — tag the release +1. `git fetch origin --tags && git checkout master && git pull --ff-only origin master`. +2. Confirm the version on master matches: `grep -m1 '^version' Cargo.toml` equals `$VERSION` (minus the `v`). + If it does not match, STOP and report (do not guess a tag). +3. Guard against a double release: if `$VERSION` already exists as a tag + (`git tag -l "$VERSION"` non-empty, or `git ls-remote --tags origin "$VERSION"` non-empty), + STOP — the release was already cut. +4. `git tag "$VERSION" && git push origin "$VERSION"` → triggers `release.yml`. +5. Report the pushed tag and remind the user to watch the Actions "Release" run for artifacts. + +## Part 4 — keep `develop` in sync (only if needed) +If `master` ended up ahead of `develop` (e.g. a CHANGELOG/version edit merged only on master), +open a sync PR `master → develop` (or fast-forward develop) — matching the repo's post-release +sync convention (e.g. PR #90 "sync: backfill CHANGELOG … from master"). Skip if already in sync. + +## Report +develop PR URL, release PR URL, tag pushed (`vX.Y.Z`), final version, and sync action (if any). diff --git a/AGENTS.md b/AGENTS.md index a6b967e..f239200 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -267,6 +267,27 @@ LMDB **does not allow** two `EnvOpenOptions::open()` handles on the same directo --- +## Release workflow — `/merge` and `/release` + +Two committed Claude Code slash commands codify the release process +(`.claude/commands/merge.md`, `.claude/commands/release.md`; force-added past `.gitignore`). + +- **`/merge`** — land the current feature branch on `develop`: README/CHANGELOG freshness + checks → commit → `cargo fmt`/`check`/`clippy` → push → PR to `develop` → `gh pr merge --auto` + (lands after CI). Does **not** tag. +- **`/release`** — `/merge`, then promote `develop` → `master` via a `Release vX.Y.Z` PR + (`protect-master.yml` allows PRs to `master` only from `develop` or `release/*`), then push + the `vX.Y.Z` tag that triggers `.github/workflows/release.yml` (6 archives, plain + + `-with-csharp`). Includes an optional post-release `master → develop` sync. + +**Version rule (encoded in the commands):** the `pre-commit` hook bumps the patch (+1) and +rebuilds **only on `feature/*` | `features/*` | `fix/*` branches**; `develop`/`master`/`release`/ +`chore` get `cargo fmt` only. So the release version is fixed at the feature-branch commit and +carries forward unchanged through develop, master, and the tag. `/merge` therefore aborts unless +run from a feature/fix branch. + +--- + ## Live Test Report — 2026-05-08 **Versie**: codesearch v1.0.93+416 diff --git a/CHANGELOG.md b/CHANGELOG.md index 2c4a3d2..acb6322 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 +## [1.0.146] - 2026-06-02 + +### Added + +- **Semantic Markdown chunking** — Markdown files (`.md`, `.markdown`, `.txt`) are + now parsed with the **tree-sitter-md block grammar**, so chunks align to sections, + headings, and code fences instead of arbitrary line ranges. `Language::Markdown` + now reports `supports_tree_sitter() == true` and has a compiled-in grammar. + +### Changed + +- **Supported-languages documentation corrected** — the README language table now + lists all 15 tree-sitter languages actually supported (Rust, Python, JavaScript, + TypeScript, C, C++, C#, Go, Java, Shell, Ruby, PHP, YAML, JSON, Markdown); + it previously showed only 9, omitting Shell, Ruby, PHP, YAML, JSON, and Markdown. + ## [1.0.142] - 2026-06-01 ### Fixed diff --git a/Cargo.lock b/Cargo.lock index 9d36a43..c22ffaa 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -628,7 +628,7 @@ dependencies = [ [[package]] name = "codesearch" -version = "1.0.142" +version = "1.0.146" dependencies = [ "anyhow", "arroy", @@ -686,6 +686,7 @@ dependencies = [ "tree-sitter-java", "tree-sitter-javascript", "tree-sitter-json", + "tree-sitter-md", "tree-sitter-php", "tree-sitter-python", "tree-sitter-ruby", @@ -5166,6 +5167,16 @@ version = "0.1.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "009994f150cc0cd50ff54917d5bc8bffe8cad10ca10d81c34da2ec421ae61782" +[[package]] +name = "tree-sitter-md" +version = "0.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2efd398be546456c814598ee56c0f51769a77241511b4a58077815d120afa882" +dependencies = [ + "cc", + "tree-sitter-language", +] + [[package]] name = "tree-sitter-php" version = "0.24.2" diff --git a/Cargo.toml b/Cargo.toml index e41c0b9..004caa1 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "codesearch" -version = "1.0.142" +version = "1.0.146" edition = "2021" authors = ["codesearch contributors"] license = "Apache-2.0" @@ -52,6 +52,7 @@ tree-sitter-ruby = "0.23.1" tree-sitter-php = "0.24.2" tree-sitter-yaml = "0.7.2" tree-sitter-json = "0.24.8" +tree-sitter-md = "0.5.3" # File handling ignore = "0.4" diff --git a/README.md b/README.md index e4afec5..8ee61fe 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ codesearch gives AI agents (OpenCode, Claude Code, Cursor, and any MCP client) d - **Multi-repo serve mode**: Fan-out queries across repository groups with cross-repo RRF ranking - **Hybrid retrieval**: Vector embeddings + BM25 full-text search fused with Reciprocal Rank Fusion - **Symbol navigation**: Jump to definitions, find usages, trace imports and dependents — in the same tool -- **AST-aware chunking**: Tree-sitter parsing for 9 languages — chunks align to functions/classes, not arbitrary line ranges +- **AST-aware chunking**: Tree-sitter parsing for 15 languages — chunks align to functions/classes (and Markdown sections), not arbitrary line ranges - **Token-efficient**: Returns metadata by default; agents fetch full code only when needed via `get_chunk` - **Lightweight footprint**: Hundreds of MB on disk, runs on CPU only, no runtime model downloads (works behind enterprise proxies) - **Zero config for single repos**: `codesearch index && codesearch mcp` — done @@ -410,16 +410,23 @@ Tree-sitter AST-aware chunking: | Language | Extensions | |----------|-----------| | Rust | `.rs` | -| Python | `.py` | -| JavaScript | `.js`, `.jsx` | -| TypeScript | `.ts`, `.tsx` | +| Python | `.py`, `.pyw`, `.pyi` | +| JavaScript | `.js`, `.mjs`, `.cjs` | +| TypeScript | `.ts`, `.tsx`, `.jsx`, `.mts`, `.cts` | | C | `.c`, `.h` | -| C++ | `.cpp`, `.hpp` | +| C++ | `.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx` | | C# | `.cs` | | Go | `.go` | | Java | `.java` | - -All other text files use line-based chunking as fallback. +| Shell | `.sh`, `.bash`, `.zsh` | +| Ruby | `.rb`, `.rake` | +| PHP | `.php` | +| YAML | `.yaml`, `.yml` | +| JSON | `.json` | +| Markdown | `.md`, `.markdown`, `.txt` | + +Markdown uses the tree-sitter-md **block** grammar — chunks align to sections, +headings, and code fences. All other text files use line-based chunking as fallback. ## Core Technology diff --git a/src/chunker/grammar.rs b/src/chunker/grammar.rs index e891628..a6e7dfb 100644 --- a/src/chunker/grammar.rs +++ b/src/chunker/grammar.rs @@ -72,6 +72,11 @@ impl GrammarManager { Language::Php => Ok(tree_sitter_php::LANGUAGE_PHP.into()), Language::Yaml => Ok(tree_sitter_yaml::LANGUAGE.into()), Language::Json => Ok(tree_sitter_json::LANGUAGE.into()), + // Markdown uses the tree-sitter-md *block* grammar (sections, headings, + // code fences). The inline grammar is intentionally not used: chunk + // boundaries only need block structure, and the block grammar runs on a + // plain `Parser` like every other language here. + Language::Markdown => Ok(tree_sitter_md::LANGUAGE.into()), _ => Err(anyhow!( "Language {} does not support tree-sitter", language.name() @@ -96,6 +101,7 @@ impl GrammarManager { Language::Php, Language::Yaml, Language::Json, + Language::Markdown, ] } @@ -251,10 +257,19 @@ mod tests { } #[test] - fn test_unsupported_language() { + fn test_load_markdown_grammar() { let manager = GrammarManager::new(); let grammar = manager.get_grammar(Language::Markdown); + assert!(grammar.is_some()); + } + + #[test] + fn test_unsupported_language() { + let manager = GrammarManager::new(); + // Toml has no compiled-in grammar. + let grammar = manager.get_grammar(Language::Toml); + assert!(grammar.is_none()); } @@ -304,6 +319,7 @@ mod tests { assert!(manager.is_supported(Language::Php)); assert!(manager.is_supported(Language::Yaml)); assert!(manager.is_supported(Language::Json)); - assert!(!manager.is_supported(Language::Markdown)); + assert!(manager.is_supported(Language::Markdown)); + assert!(!manager.is_supported(Language::Toml)); } } diff --git a/src/chunker/parser.rs b/src/chunker/parser.rs index a4fc193..0efe8b3 100644 --- a/src/chunker/parser.rs +++ b/src/chunker/parser.rs @@ -280,7 +280,8 @@ fn baz() {} let mut parser = CodeParser::new(); let source = "some code"; - let result = parser.parse(Language::Markdown, source); + // Toml has no compiled-in grammar, so parsing must fail. + let result = parser.parse(Language::Toml, source); assert!(result.is_err()); } diff --git a/src/chunker/semantic.rs b/src/chunker/semantic.rs index b2557ab..3b8bc5a 100644 --- a/src/chunker/semantic.rs +++ b/src/chunker/semantic.rs @@ -42,12 +42,25 @@ impl SemanticChunker { path: &Path, content: &str, ) -> Result<Vec<Chunk>> { + // Markdown/txt are chunked by heading section rather than by definition + // node, so they take a dedicated path (no LanguageExtractor). + if language == Language::Markdown { + return self.chunk_markdown(path, content); + } + // 1. Check if we have an extractor for this language let extractor = match get_extractor(language) { Some(ext) => ext, None => { - // Fall back to simple chunking for unsupported languages - return Ok(self.fallback_chunk(path, content)); + // Fall back to simple chunking for unsupported languages. The + // line-windowed fallback ignores the char budget, so route its + // output through split_oversized to enforce max_chunk_chars and + // avoid pathological huge single chunks (e.g. minified one-line text). + return Ok(self + .fallback_chunk(path, content) + .into_iter() + .flat_map(|c| self.split_oversized(c)) + .collect()); } }; @@ -89,6 +102,180 @@ impl SemanticChunker { Ok(final_chunks) } + /// Chunk a Markdown/text file by heading section. + /// + /// Uses the tree-sitter-md *block* grammar: the document is a tree of nested + /// `section` nodes (one per heading). Each chunk is a single heading plus its + /// own prose/code, *excluding* nested subsections (which become their own + /// chunks). Heading text is carried in the breadcrumb context so the embedding + /// captures the section's place in the document (e.g. `File: x.md > Title > + /// Subsection`). Leading document content (YAML front-matter, prose before the + /// first heading) becomes a single preamble chunk. Oversized sections are + /// char/line-bounded via `split_oversized`, and a file with no parseable + /// structure falls back to the line-windowed chunker (also bounded). + fn chunk_markdown(&mut self, path: &Path, content: &str) -> Result<Vec<Chunk>> { + let bounded_fallback = |this: &Self| -> Vec<Chunk> { + this.fallback_chunk(path, content) + .into_iter() + .flat_map(|c| this.split_oversized(c)) + .collect() + }; + + let parsed = match self.parser.parse(Language::Markdown, content) { + Ok(p) => p, + Err(_) => return Ok(bounded_fallback(self)), + }; + + let source = content.as_bytes(); + let path_str = normalize_path(path); + let file_context = format!("File: {}", path_str); + let root = parsed.root_node(); + + let mut cursor = root.walk(); + let top: Vec<Node> = root.named_children(&mut cursor).collect(); + + let mut chunks: Vec<Chunk> = Vec::new(); + + // Leading non-section nodes (front-matter / prose before the first heading) + // form a single preamble chunk. + let mut preamble_end = 0; + while preamble_end < top.len() && top[preamble_end].kind() != "section" { + preamble_end += 1; + } + if preamble_end > 0 { + let start_byte = top[0].start_byte(); + let end_byte = top[preamble_end - 1].end_byte(); + if let Some(chunk) = Self::md_chunk( + source, + start_byte, + end_byte, + top[0].start_position().row, + std::slice::from_ref(&file_context), + &path_str, + ) { + chunks.push(chunk); + } + } + + // Each top-level section (and, recursively, its subsections) becomes a chunk. + for node in top.iter().filter(|n| n.kind() == "section") { + self.emit_md_section( + *node, + source, + &path_str, + std::slice::from_ref(&file_context), + &mut chunks, + ); + } + + if chunks.is_empty() { + return Ok(bounded_fallback(self)); + } + + let source_lines: Vec<&str> = content.lines().collect(); + self.populate_context_windows(&mut chunks, &source_lines); + + let final_chunks = chunks + .into_iter() + .flat_map(|c| self.split_oversized(c)) + .collect(); + Ok(final_chunks) + } + + /// Emit a chunk for one `section` node (heading + direct content), then recurse + /// into nested subsections with an extended breadcrumb. + fn emit_md_section( + &self, + section: Node, + source: &[u8], + path_str: &str, + context_stack: &[String], + chunks: &mut Vec<Chunk>, + ) { + let mut cursor = section.walk(); + let children: Vec<Node> = section.named_children(&mut cursor).collect(); + + // Heading text (if the section opens with one) extends the breadcrumb. + let heading_text = children + .first() + .filter(|c| Self::md_is_heading(c.kind())) + .map(|h| Self::md_heading_text(*h, source)) + .unwrap_or_default(); + + let mut new_context = context_stack.to_vec(); + if !heading_text.is_empty() { + new_context.push(heading_text); + } + + // Direct content = section start .. first nested subsection (exclusive). + let first_sub = children.iter().find(|c| c.kind() == "section"); + let end_byte = first_sub.map_or_else(|| section.end_byte(), |s| s.start_byte()); + if let Some(chunk) = Self::md_chunk( + source, + section.start_byte(), + end_byte, + section.start_position().row, + &new_context, + path_str, + ) { + chunks.push(chunk); + } + + for child in children.iter().filter(|c| c.kind() == "section") { + self.emit_md_section(*child, source, path_str, &new_context, chunks); + } + } + + /// Build a Markdown chunk from a byte range, or None if it is blank. + fn md_chunk( + source: &[u8], + start_byte: usize, + end_byte: usize, + start_line: usize, + context: &[String], + path_str: &str, + ) -> Option<Chunk> { + let text = std::str::from_utf8(source.get(start_byte..end_byte)?).ok()?; + if text.trim().is_empty() { + return None; + } + let line_count = text.lines().count().max(1); + let mut chunk = Chunk::new( + text.to_string(), + start_line, + start_line + line_count, + ChunkKind::Block, + path_str.to_string(), + ); + chunk.context = context.to_vec(); + Some(chunk) + } + + /// True if a node kind is a Markdown heading. + fn md_is_heading(kind: &str) -> bool { + kind == "atx_heading" || kind == "setext_heading" + } + + /// Extract clean heading text (no `#` markers / underline). + fn md_heading_text(node: Node, source: &[u8]) -> String { + // atx_heading exposes the text via the `heading_content` field. + if let Some(inline) = node.child_by_field_name("heading_content") { + if let Ok(t) = inline.utf8_text(source) { + return t.trim().to_string(); + } + } + // Fallback (e.g. setext_heading): first line, stripped of '#'. + node.utf8_text(source) + .unwrap_or("") + .lines() + .next() + .unwrap_or("") + .trim() + .trim_matches('#') + .trim() + .to_string() + } + /// Populate context_prev and context_next for each chunk fn populate_context_windows(&self, chunks: &mut [Chunk], source_lines: &[&str]) { let total_lines = source_lines.len(); @@ -257,6 +444,88 @@ impl SemanticChunker { chunks } + /// Char- *and* line-aware splitter for unstructured text (Markdown/txt and the + /// generic fallback). Unlike `split_if_needed`, which windows purely by line + /// count, this also enforces `max_chunk_chars`: a single physical line longer + /// than the char budget is hard-split on UTF-8 boundaries. This is what keeps + /// scraped HTML/markdown — which can be one 80 KB line — from producing a single + /// enormous chunk. The structured code path keeps using `split_if_needed`, so + /// code chunking is unchanged. + fn split_oversized(&self, chunk: Chunk) -> Vec<Chunk> { + if chunk.line_count() <= self.max_chunk_lines && chunk.size_bytes() <= self.max_chunk_chars + { + return vec![chunk]; + } + + // 1. Expand into "units": one per line, but any line over the char budget is + // fragmented on char boundaries so no single unit exceeds max_chunk_chars. + let mut units: Vec<String> = Vec::new(); + for line in chunk.content.lines() { + if line.len() <= self.max_chunk_chars { + units.push(line.to_string()); + continue; + } + let mut frag = String::new(); + for ch in line.chars() { + if !frag.is_empty() && frag.len() + ch.len_utf8() > self.max_chunk_chars { + units.push(std::mem::take(&mut frag)); + } + frag.push(ch); + } + if !frag.is_empty() { + units.push(frag); + } + } + + if units.is_empty() { + return vec![chunk]; + } + + // 2. Greedily window units, bounded by both max_chunk_lines and + // max_chunk_chars. Windows advance without overlap (context_prev/next + // already supply surrounding lines), so no content is duplicated. + let mut out: Vec<Chunk> = Vec::new(); + let mut i = 0; + let mut split_index = 0; + while i < units.len() { + let mut j = i; + let mut char_count = 0usize; + while j < units.len() + && (j - i) < self.max_chunk_lines + && (j == i || char_count + units[j].len() < self.max_chunk_chars) + { + char_count += units[j].len() + 1; + j += 1; + } + let end = if j > i { j } else { i + 1 }; + + let content = units[i..end].join("\n"); + let mut piece = Chunk::new( + content, + chunk.start_line + i, + chunk.start_line + end, + chunk.kind, + chunk.path.clone(), + ); + piece.context = chunk.context.clone(); + piece.signature = chunk.signature.clone(); + piece.is_complete = false; + piece.split_index = Some(split_index); + out.push(piece); + + split_index += 1; + i = end; + } + + // A single resulting piece means no real split happened — keep it whole. + if out.len() == 1 { + out[0].is_complete = true; + out[0].split_index = None; + } + + out + } + /// Split a chunk if it exceeds size limits fn split_if_needed(&self, chunk: Chunk) -> Vec<Chunk> { let line_count = chunk.line_count(); @@ -586,6 +855,118 @@ class Calculator: ); } + #[test] + fn test_chunk_markdown_sections() { + let mut chunker = SemanticChunker::new(100, 2000, 10); + + let md = "---\nsource: dam_help\ntitle: E-mail ordering\nurl: https://help.example.com/x\npath: dam_help/Ordering/EmailOrd\n---\n\n# E-mail ordering\n\nIntro paragraph about ordering.\n\n## Configure SMTP\n\nSteps to configure the mail server.\n\n## Troubleshooting\n\nFinal section text about errors.\n"; + + let path = Path::new("EmailOrd.md"); + let chunks = chunker + .chunk_semantic(Language::Markdown, path, md) + .unwrap(); + + // Preamble (front-matter) + h1 intro + 2 h2 sections = at least 4 chunks. + assert!( + chunks.len() >= 4, + "Expected >=4 section chunks, got {}", + chunks.len() + ); + + // No chunk should span the whole page: the "Configure SMTP" body and the + // "Troubleshooting" body must live in *different* chunks. + let smtp = chunks + .iter() + .find(|c| c.content.contains("Steps to configure")) + .expect("should have a Configure SMTP chunk"); + assert!( + !smtp.content.contains("Final section text"), + "sections must not be merged into a whole-page block" + ); + + // Breadcrumb context must carry the heading path (document title + section). + assert!(smtp.context.iter().any(|c| c.contains("E-mail ordering"))); + assert!(smtp.context.iter().any(|c| c.contains("Configure SMTP"))); + + // Every chunk stays within the char budget. + assert!(chunks.iter().all(|c| c.content.len() <= 2000)); + } + + #[test] + fn test_chunk_markdown_nested_breadcrumb() { + let mut chunker = SemanticChunker::new(100, 2000, 10); + let md = "# Top\n\nlead\n\n## Middle\n\nmid body\n\n### Deep\n\ndeep body here\n"; + let chunks = chunker + .chunk_semantic(Language::Markdown, Path::new("n.md"), md) + .unwrap(); + + let deep = chunks + .iter() + .find(|c| c.content.contains("deep body here")) + .expect("should find deep section"); + // File > Top > Middle > Deep + assert!(deep.context.iter().any(|c| c.contains("Top"))); + assert!(deep.context.iter().any(|c| c.contains("Middle"))); + assert!(deep.context.iter().any(|c| c.contains("Deep"))); + // The deep chunk must not contain its ancestors' bodies. + assert!(!deep.content.contains("mid body")); + assert!(!deep.content.contains("lead")); + } + + #[test] + fn test_chunk_markdown_oversized_section_split() { + let mut chunker = SemanticChunker::new(100, 200, 5); + let big_body = (0..50) + .map(|i| format!("line of section body number {}", i)) + .collect::<Vec<_>>() + .join("\n"); + let md = format!("# Heading\n\n{}\n", big_body); + + let chunks = chunker + .chunk_semantic(Language::Markdown, Path::new("big.md"), &md) + .unwrap(); + + // A single >200-char section must be split into multiple bounded parts. + assert!(chunks.len() > 1, "oversized section should be split"); + assert!(chunks.iter().any(|c| !c.is_complete)); + } + + #[test] + fn test_chunk_markdown_hard_splits_long_line() { + // Mirrors real-world scraped docs: a section whose body is ONE huge line + // (no internal newlines). Line-based splitting can't bound this; the + // char-aware splitter must. + let mut chunker = SemanticChunker::new(100, 500, 10); + let long_line = "word ".repeat(2000); // ~10_000 chars, single line + let md = format!("# Title\n\n{}\n", long_line); + + let chunks = chunker + .chunk_semantic(Language::Markdown, Path::new("huge.md"), &md) + .unwrap(); + + assert!(chunks.len() > 1, "a single 10KB line must be hard-split"); + assert!( + chunks.iter().all(|c| c.content.len() <= 500), + "every piece must respect the char budget; got max {}", + chunks.iter().map(|c| c.content.len()).max().unwrap_or(0) + ); + } + + #[test] + fn test_chunk_markdown_no_headings_falls_back() { + let mut chunker = SemanticChunker::new(100, 2000, 10); + let md = "Just some plain text\nwith a few lines\nand no headings at all.\n"; + let chunks = chunker + .chunk_semantic(Language::Markdown, Path::new("plain.txt"), md) + .unwrap(); + + assert!(!chunks.is_empty()); + // All content is preserved across chunks. + let joined: String = chunks.iter().map(|c| c.content.clone()).collect(); + assert!(joined.contains("plain text")); + assert!(joined.contains("no headings")); + } + #[test] fn test_chunk_unsupported_language() { let mut chunker = SemanticChunker::new(100, 2000, 10); diff --git a/src/file/language.rs b/src/file/language.rs index cd2f310..f164a23 100644 --- a/src/file/language.rs +++ b/src/file/language.rs @@ -105,6 +105,7 @@ impl Language { | Self::Php | Self::Yaml | Self::Json + | Self::Markdown ) } @@ -176,7 +177,9 @@ mod tests { assert!(Language::Python.supports_tree_sitter()); assert!(Language::TypeScript.supports_tree_sitter()); assert!(Language::Json.supports_tree_sitter()); - assert!(!Language::Markdown.supports_tree_sitter()); + assert!(Language::Markdown.supports_tree_sitter()); + // Toml has no tree-sitter grammar yet. + assert!(!Language::Toml.supports_tree_sitter()); } #[test]