Validate token spans before CST; panic in debug, skip in release#204
Conversation
Add validation for token spans to ensure they are within source bounds before the CST builder advances span cursors. In debug builds, invalid spans cause a panic to catch lexer bugs early. In release builds, invalid spans generate a warning and are skipped to maintain error recovery without panics. Include tests covering panics on out-of-bounds spans in debug and ignoring them in release builds. Update docs/parser-plan.md to describe this validation behavior. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
Reviewer's GuideCentralizes token span validation for CST construction, enforcing strict panics in debug builds and warning/skip behavior in release, and updates documentation and tests to reflect and verify this behavior. Sequence diagram for token span validation in build_green_treesequenceDiagram
participant build_green_tree
participant validate_token_span
participant SpanCursors
participant GreenNodeBuilder
build_green_tree->>SpanCursors: new(spans)
loop for each (kind, span) in tokens
build_green_tree->>validate_token_span: validate_token_span(span, src_len)
alt span is valid
validate_token_span-->>build_green_tree: true
build_green_tree->>SpanCursors: advance_and_start(builder, span.start)
build_green_tree->>GreenNodeBuilder: push_token(kind, span, src)
else span is invalid
alt debug_assertions enabled
validate_token_span-->>build_green_tree: panic
build_green_tree--xbuild_green_tree: unwind stack
else debug_assertions disabled
validate_token_span-->>build_green_tree: false
build_green_tree-->>build_green_tree: continue (skip token)
end
end
end
build_green_tree->>GreenNodeBuilder: finish()
build_green_tree-->>build_green_tree: GreenNode
Class diagram for CST builder and token span validationclassDiagram
class Span {
usize start
usize end
}
class SpanCursors {
+new(spans ParsedSpans) SpanCursors
+advance_and_start(builder GreenNodeBuilder, offset usize) void
}
class GreenNodeBuilder {
+finish() GreenNode
}
class ParserCstBuilderModule {
+validate_token_span(span Span, src_len usize) bool
+build_green_tree(tokens Vec~(SyntaxKind, Span)~, src &str, spans &ParsedSpans) GreenNode
}
SpanCursors --> Span : uses
ParserCstBuilderModule --> Span : uses
ParserCstBuilderModule --> SpanCursors : uses
ParserCstBuilderModule --> GreenNodeBuilder : uses
ParserCstBuilderModule --> Span : validates
ParserCstBuilderModule ..> Span : panics_or_warns_on_oob
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary by CodeRabbitRelease Notes
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughIntroduce token-span validation before advancing CST builder cursors. Document the control flow: in debug builds panic on out-of-bounds spans to catch lexer bugs; in release builds log a warning and skip the token to preserve predictable error recovery. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Possibly related issues
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.rs📄 CodeRabbit inference engine (AGENTS.md)
Files:
⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (1)📚 Learning: 2025-12-26T13:23:35.435ZApplied to files:
🧬 Code graph analysis (1)src/parser/cst_builder/tree.rs (3)
🔍 Remote MCP DeepwikiRelevant facts useful for reviewing this PR (sources cited):
Tags: ⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (3)
Comment |
Replaced the usage of positional formatting with inline format strings in the panic and warn macro calls within `validate_token_span` for improved readability and consistency. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
docs/parser-plan.mdsrc/parser/cst_builder/tree.rs
🧰 Additional context used
📓 Path-based instructions (5)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Clippy warnings must be disallowed in Rust code.
Fix any warnings emitted during tests in the code itself rather than silencing them in Rust.
Where a Rust function is too long, extract meaningfully named helper functions adhering to separation of concerns and the Command Query Responsibility Segregation pattern.
Where a Rust function has too many parameters, group related parameters in meaningfully named structs.
Where a Rust function is returning a large error, consider usingArcto reduce the amount of data returned.
Write unit and scenario tests for new Rust functionality; run both before and after making any change.
Every Rust module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs in Rust using Rustdoc comments (///), so documentation can be generated withcargo doc.
Prefer immutable data and avoid unnecessarymutbindings in Rust.
Handle errors with theResulttype instead of panicking in Rust where feasible.
Avoidunsafecode in Rust unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments in Rust.
Do not usereturnin single-line functions in Rust.
Use predicate functions for conditional criteria with more than two branches in Rust.
Lints must not be silenced in Rust except as a last resort; lint rule suppressions must be tightly scoped and include a clear reason.
Preferexpectoverallowin Rust.
Userstestfixtures for shared test setup in Rust.
Replace duplicated tests with#[rstest(…)]parameterized cases in Rust.
Prefermockallfor mocks/stubs in Rust.
Prefer.expect()over.unwrap()in Rust.
Useconcat!()to combine long string literals in Rust rather than escaping newlines with a backslash.
Prefer single-line versions of functions in Rust where appropriate (e.g.,pub fn new(id: u64) -> Self { Self(id) }instead of multi-line variants).
Use NewTypes in Rust to model domain values and eliminate "i...
Files:
src/parser/cst_builder/tree.rs
⚙️ CodeRabbit configuration file
**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.
- Adhere to single responsibility and CQRS
- Place function attributes after doc comments.
- Do not use
returnin single-line functions.- Move conditionals with >2 branches into a predicate function.
- Avoid
unsafeunless absolutely necessary.- Every module must begin with a
//!doc comment that explains the module's purpose and utility.- Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Lints must not be silenced except as a last resort.
#[allow]is forbidden.- Only narrowly scoped
#[expect(lint, reason = "...")]is allowed.- No lint groups, no blanket or file-wide suppression.
- Include
FIXME:with link if a fix is expected.- Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
- Use
rstestfixtures for shared setup and to avoid repetition between tests.- Replace duplicated tests with
#[rstest(...)]parameterised cases.- Prefer
mockallfor mocks/stubs.- Prefer
.expect()over.unwrap()in tests..expect()and.unwrap()are forbidden outside of tests. Errors must be propagated.- Ensure that any API or behavioural changes are reflected in the documentation in
docs/- Ensure that any completed roadmap steps are recorded in the appropriate roadmap in
docs/- Files must not exceed 400 lines in length
- Large modules must be decomposed
- Long match statements or dispatch tables should be decomposed by domain and collocated with targets
- Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
- Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
- For testing of functionality depending upon environment variables, dependency injection and...
Files:
src/parser/cst_builder/tree.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
**/*.md: Markdown paragraphs must be wrapped at 80 columns, and bullet points need the same limit.
Code blocks in Markdown must be wrapped at 120 columns.
Tables in Markdown must not be wrapped, and headings must remain unwrapped.
Use dashes (-) for list bullets in Markdown.
Use GitHub-flavoured Markdown footnotes ([^1]) for references and footnotes.
Files:
docs/parser-plan.md
⚙️ CodeRabbit configuration file
**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")
- Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
- Headings must not be wrapped.
- Documents must start with a level 1 heading
- Headings must correctly increase or decrease by no more than one level at a time
- Use GitHub-flavoured Markdown style for footnotes and endnotes.
- Numbered footnotes must be numbered by order of appearance in the document.
Files:
docs/parser-plan.md
docs/**/*.{md,rs,hs,py}
📄 CodeRabbit inference engine (docs/differential-datalog-parser-syntax-spec-migration-plan.md)
Update all repository references from
docs/haskell-parser-analysis.mdto point to the normative DDlog syntax specification and new implementation notes
Files:
docs/parser-plan.md
docs/**/*.md
📄 CodeRabbit inference engine (docs/documentation-style-guide.md)
docs/**/*.md: Use British English based on Oxford English Dictionary locale en-GB, including: -ize suffixes in words like 'realize' and 'organization', -lyse suffixes in words like 'analyse' and 'paralyse', -our suffixes in words like 'colour' and 'behaviour', -re suffixes in words like 'centre' and 'calibre', double 'l' in words like 'cancelled' and 'counsellor', maintain 'e' in words like 'likeable' and 'liveable', -ogue suffixes in words like 'catalogue'
Keep United States (US) spelling when used in an API (e.g., 'color')
Use the Oxford comma in documentation: 'ships, planes, and hovercraft' where it aids comprehension
Treat company names as collective nouns in documentation (e.g., 'Concordat Industries are expanding')
Write headings in sentence case
Use Markdown headings (#, ##, ###, etc.) in order without skipping levels
Follow markdownlint recommendations for markdown formatting
Always provide a language identifier for fenced code blocks; use 'plaintext' for non-code text
Use '-' as the first level bullet and renumber lists when items change
Prefer inline links using text or angle brackets around the URL in documentation
Ensure blank lines before and after bulleted lists and fenced blocks in documentation
Ensure tables have a delimiter line below the header row
Expand any uncommon acronym on first use (e.g., Continuous Integration (CI))
Wrap paragraphs at 80 columns in documentation
Wrap code at 120 columns in documentation
Do not wrap tables in documentation
Use GitHub-flavoured numeric footnotes referenced as [^1] in documentation
Footnotes must be numbered in order of appearance in the document
Caption every table and every diagram in documentation
Include Mermaid diagrams where they add clarity to documentation
Use '' syntax for embedding figures and provide brief alt text describing the content in documentation
Add a short description before each Mermaid diagram in documentation so screen readers can understand it
Files:
docs/parser-plan.md
docs/**/!(README).md
📄 CodeRabbit inference engine (docs/documentation-style-guide.md)
Avoid first and second person personal pronouns outside the README.md file
Files:
docs/parser-plan.md
🧠 Learnings (1)
📚 Learning: 2025-12-26T13:23:35.435Z
Learnt from: leynos
Repo: leynos/ddlint PR: 197
File: src/test_util/literals.rs:135-139
Timestamp: 2025-12-26T13:23:35.435Z
Learning: In Rust projects, rustfmt may reflow long simple constructor functions (e.g., pub fn lit_bool(b: bool) -> Expr { Expr::Literal(Literal::Bool(b)) }) to a multi-line form. Ensure you run and respect rustfmt formatting so CI formatting checks pass. Do not manually enforce non-default line breaks; format with rustfmt. Consider adding a rustfmt --check (or cargo fmt) step in CI and running cargo fmt before commits to avoid formatting-related failures.
Applied to files:
src/parser/cst_builder/tree.rs
🔍 Remote MCP Deepwiki
Summary of Relevant Context for PR Review
Token Span Structure and Usage
Token spans in the ddlint parser are represented by the Span type, which is a std::ops::Range<usize> indicating a byte range within the source text. These spans are organized in a ParsedSpans structure that categorizes spans for different statement types (imports, typedefs, relations, indexes, functions, transformers, and rules), and are used by the build_green_tree function through SpanCursors to manage iteration and control CST node creation.
Existing Error Handling Patterns
The ddlint parser employs a robust error handling strategy that focuses on error recovery rather than halting on first error. When errors occur, an N_ERROR node is emitted into the CST, encapsulating invalid tokens, and the parser attempts to resynchronize by skipping tokens. The parser also uses property-based testing with proptest to ensure the system does not panic even with arbitrary or malformed inputs.
Out-of-Bounds Span Handling
Invalid or out-of-bounds token spans can occur from tokenization mismatches (where a logos lexer produces a span extending beyond the source length) or incorrect span ordering/overlaps during span analysis. Currently, the parser logs warnings for out-of-bounds spans during the push_token function, sets the token text to an empty string to prevent further errors, and in debug builds uses validate_span_lists_sorted to panic on span ordering issues.
CST Construction Flow
The build_green_tree function initializes a GreenNodeBuilder and iterates through tokens, using SpanCursors to determine when to start and finish nodes. The SpanCursors provides methods like advance_and_start and finish which call underlying SpanCursor methods (advance_to, start_if, finish_if) to coordinate node creation with token processing.
Review Implications
This PR appropriately extends the existing validation philosophy by introducing early token-span bounds checking before cursor advancement. The dual behavior (panic in debug builds to surface lexer bugs, warn and skip in release builds for resilience) aligns with the established error recovery patterns and property-based testing strategy already in place for ensuring robustness in production.,
⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Sourcery review
- GitHub Check: Sourcery review
🔇 Additional comments (4)
docs/parser-plan.md (1)
90-93: LGTM!The documentation accurately describes the token span validation behavior, distinguishing debug (panic) from release (warn-and-skip) builds. Formatting complies with the 80-column guideline.
src/parser/cst_builder/tree.rs (3)
124-126: LGTM!The validation check correctly guards cursor advancement. Tokens with out-of-bounds spans are skipped, preventing CST corruption.
Note: After this validation,
push_tokenat line 129 will never receive an invalid span, rendering the bounds check at lines 139-149 redundant. However, the redundancy provides defense in depth and may be intentional.
182-191: LGTM!The test correctly verifies that debug builds panic on out-of-bounds token spans. The minimal setup (empty source, span 0..1) is sufficient to trigger the validation failure.
193-205: LGTM!The test correctly verifies that release builds skip out-of-bounds token spans without panicking. Appending an out-of-bounds token after valid tokenization and asserting the tree matches the original source confirms the skip behavior.
…nching The validate_token_span function's conditional logic was refactored to properly handle out-of-bounds spans. The method now correctly panics in debug mode and logs a warning in release mode when a token span is invalid, ensuring improved error handling and consistency. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Changes
Core functionality
validate_token_span(span: &Span, src_len: usize) -> boolwith behavior:start <= endandend <= src_len.debug_assertions, panics with a descriptive message when out-of-bounds.false.build_green_treeto skip tokens with invalid spans:validate_token_span(span, src.len())and continues if false.Documentation
Tests
build_green_tree_panics_on_oob_token_span(debug mode): ensures a panic occurs when a token span is out of bounds.build_green_tree_skips_oob_token_span_in_release(release mode): ensures an out-of-bounds span is skipped and the resulting CST remains consistent.Test plan
Notes
🌿 Generated by Terry
ℹ️ Tag @terragon-labs to ask questions and address PR feedback
📎 Task: https://www.terragonlabs.com/task/9cce6c64-a042-4924-8040-a8bf7085ef04
Summary by Sourcery
Validate token spans during CST construction to catch lexer span bugs in debug builds while preserving robust error recovery in release builds.
New Features:
Bug Fixes:
Documentation:
Tests: