Skip to content

Validate token spans before CST; panic in debug, skip in release#204

Merged
leynos merged 5 commits into
mainfrom
terragon/handle-token-span-out-of-bounds-r2hkjz
Jan 3, 2026
Merged

Validate token spans before CST; panic in debug, skip in release#204
leynos merged 5 commits into
mainfrom
terragon/handle-token-span-out-of-bounds-r2hkjz

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Jan 2, 2026

Summary

  • Validates token spans before the CST builder advances span cursors
  • Debug builds panic on out-of-bounds token spans to surface lexer bugs
  • Release builds log a warning and skip the offending token to keep error recovery predictable

Changes

Core functionality

  • Added validate_token_span(span: &Span, src_len: usize) -> bool with behavior:
    • Returns true if start <= end and end <= src_len.
    • In debug_assertions, panics with a descriptive message when out-of-bounds.
    • In non-debug builds, logs a warning and returns false.
  • Updated build_green_tree to skip tokens with invalid spans:
    • Before advancing cursors, the code now checks validate_token_span(span, src.len()) and continues if false.

Documentation

  • Updated docs/parser-plan.md to reflect the new validation behavior:
    • Token spans are validated before the CST builder advances span cursors.
    • Debug builds panic on out-of-bounds spans to catch lexer bugs early.
    • Release builds warn and skip the token to maintain predictable error recovery.

Tests

  • Added tests to cover both debug and release behaviors:
    • build_green_tree_panics_on_oob_token_span (debug mode): ensures a panic occurs when a token span is out of bounds.
    • build_green_tree_skips_oob_token_span_in_release (release mode): ensures an out-of-bounds span is skipped and the resulting CST remains consistent.

Test plan

  • Run tests in debug to verify panic on out-of-bounds token spans
  • Run tests in release to verify skipping behavior and preserved CST
  • Validate that valid inputs still produce correct CST text

Notes

  • This change helps catch lexer bugs early in debug builds while avoiding panics for user inputs in production, ensuring more predictable error handling.

🌿 Generated by Terry


ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/9cce6c64-a042-4924-8040-a8bf7085ef04

Summary by Sourcery

Validate token spans during CST construction to catch lexer span bugs in debug builds while preserving robust error recovery in release builds.

New Features:

  • Introduce a token span validation helper used by the CST builder before advancing span cursors.

Bug Fixes:

  • Prevent CST construction from operating on out-of-bounds token spans by panicking in debug builds and skipping invalid tokens in release builds.

Documentation:

  • Document the new token span validation behavior and differing debug vs. release handling in the parser plan.

Tests:

  • Add tests verifying panic behavior for invalid token spans in debug builds and skip behavior with preserved CST text in release builds.

Add validation for token spans to ensure they are within source bounds before the CST builder advances span cursors. In debug builds, invalid spans cause a panic to catch lexer bugs early. In release builds, invalid spans generate a warning and are skipped to maintain error recovery without panics.

Include tests covering panics on out-of-bounds spans in debug and ignoring them in release builds.

Update docs/parser-plan.md to describe this validation behavior.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jan 2, 2026

Reviewer's Guide

Centralizes token span validation for CST construction, enforcing strict panics in debug builds and warning/skip behavior in release, and updates documentation and tests to reflect and verify this behavior.

Sequence diagram for token span validation in build_green_tree

sequenceDiagram
    participant build_green_tree
    participant validate_token_span
    participant SpanCursors
    participant GreenNodeBuilder

    build_green_tree->>SpanCursors: new(spans)
    loop for each (kind, span) in tokens
        build_green_tree->>validate_token_span: validate_token_span(span, src_len)
        alt span is valid
            validate_token_span-->>build_green_tree: true
            build_green_tree->>SpanCursors: advance_and_start(builder, span.start)
            build_green_tree->>GreenNodeBuilder: push_token(kind, span, src)
        else span is invalid
            alt debug_assertions enabled
                validate_token_span-->>build_green_tree: panic
                build_green_tree--xbuild_green_tree: unwind stack
            else debug_assertions disabled
                validate_token_span-->>build_green_tree: false
                build_green_tree-->>build_green_tree: continue (skip token)
            end
        end
    end
    build_green_tree->>GreenNodeBuilder: finish()
    build_green_tree-->>build_green_tree: GreenNode
Loading

Class diagram for CST builder and token span validation

classDiagram
    class Span {
        usize start
        usize end
    }

    class SpanCursors {
        +new(spans ParsedSpans) SpanCursors
        +advance_and_start(builder GreenNodeBuilder, offset usize) void
    }

    class GreenNodeBuilder {
        +finish() GreenNode
    }

    class ParserCstBuilderModule {
        +validate_token_span(span Span, src_len usize) bool
        +build_green_tree(tokens Vec~(SyntaxKind, Span)~, src &str, spans &ParsedSpans) GreenNode
    }

    SpanCursors --> Span : uses
    ParserCstBuilderModule --> Span : uses
    ParserCstBuilderModule --> SpanCursors : uses
    ParserCstBuilderModule --> GreenNodeBuilder : uses
    ParserCstBuilderModule --> Span : validates
    ParserCstBuilderModule ..> Span : panics_or_warns_on_oob
Loading

File-Level Changes

Change Details Files
Introduce centralized token span validation and use it in CST construction loop.
  • Add validate_token_span(span: &Span, src_len: usize) -> bool that enforces span ordering and bounds against the source length.
  • In debug builds, have invalid spans trigger a panic with a descriptive message; in non-debug builds, log a warning and return false.
  • Call validate_token_span before advancing span cursors in build_green_tree, skipping tokens whose spans are invalid.
src/parser/cst_builder/tree.rs
Document the new token span validation behavior in the parser design docs.
  • Describe that token spans are validated before advancing span cursors in the CST builder.
  • Document differing debug (panic) vs release (warn and skip) behavior for out-of-bounds spans.
docs/parser-plan.md
Add tests to cover debug and release behaviors for out-of-bounds token spans during CST construction.
  • Add a debug-only test that expects build_green_tree to panic on an out-of-bounds token span.
  • Add a non-debug test that appends an out-of-bounds token, asserts no parse errors, and verifies build_green_tree skips the bad token while preserving CST text.
src/parser/cst_builder/tree.rs

Possibly linked issues

  • Refactor CST construction into module #79: PR adds token span validation in build_green_tree, matching the issue’s request for bounds checks and debug panics.
  • #unknown: PR adds token span validation with debug panics and release warnings/skips, directly addressing span validation concerns in release builds.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 2, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced error handling and recovery mechanisms for token processing edge cases, improving application stability and robustness in both debug and release configurations
  • Documentation

    • Updated documentation detailing token validation procedures, error recovery strategies, and diagnostic logging behaviour implemented across different build environments

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Introduce token-span validation before advancing CST builder cursors. Document the control flow: in debug builds panic on out-of-bounds spans to catch lexer bugs; in release builds log a warning and skip the token to preserve predictable error recovery.

Changes

Cohort / File(s) Summary
Documentation
docs/parser-plan.md
Update to describe pre-validation of token spans and the differing control flow between debug and release builds.
Token-span validation
src/parser/cst_builder/tree.rs
Add private validate_token_span() helper to check span bounds (panic in debug, warn-and-skip in release). Modify build_green_tree() to call this helper before advancing cursors. Add tests asserting debug-build panic and release-build skip behaviour.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Possibly related issues

Poem

Bounds and spans now guard the way,
In debug mode—no bugs shall stay!
Release builds skip with care and grace,
Validation steadies every place. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarises the main change: validating token spans with differing debug vs release behaviours.
Description check ✅ Passed The description comprehensively covers the changeset, documenting the validation logic, modified functions, documentation updates, and test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch terragon/handle-token-span-out-of-bounds-r2hkjz

📜 Recent review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38850c7 and af165f3.

📒 Files selected for processing (1)
  • src/parser/cst_builder/tree.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Clippy warnings must be disallowed in Rust code.
Fix any warnings emitted during tests in the code itself rather than silencing them in Rust.
Where a Rust function is too long, extract meaningfully named helper functions adhering to separation of concerns and the Command Query Responsibility Segregation pattern.
Where a Rust function has too many parameters, group related parameters in meaningfully named structs.
Where a Rust function is returning a large error, consider using Arc to reduce the amount of data returned.
Write unit and scenario tests for new Rust functionality; run both before and after making any change.
Every Rust module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs in Rust using Rustdoc comments (///), so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessary mut bindings in Rust.
Handle errors with the Result type instead of panicking in Rust where feasible.
Avoid unsafe code in Rust unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments in Rust.
Do not use return in single-line functions in Rust.
Use predicate functions for conditional criteria with more than two branches in Rust.
Lints must not be silenced in Rust except as a last resort; lint rule suppressions must be tightly scoped and include a clear reason.
Prefer expect over allow in Rust.
Use rstest fixtures for shared test setup in Rust.
Replace duplicated tests with #[rstest(…)] parameterized cases in Rust.
Prefer mockall for mocks/stubs in Rust.
Prefer .expect() over .unwrap() in Rust.
Use concat!() to combine long string literals in Rust rather than escaping newlines with a backslash.
Prefer single-line versions of functions in Rust where appropriate (e.g., pub fn new(id: u64) -> Self { Self(id) } instead of multi-line variants).
Use NewTypes in Rust to model domain values and eliminate "i...

Files:

  • src/parser/cst_builder/tree.rs

⚙️ CodeRabbit configuration file

**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.

  • Adhere to single responsibility and CQRS
  • Place function attributes after doc comments.
  • Do not use return in single-line functions.
  • Move conditionals with >2 branches into a predicate function.
  • Avoid unsafe unless absolutely necessary.
  • Every module must begin with a //! doc comment that explains the module's purpose and utility.
  • Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Lints must not be silenced except as a last resort.
    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.
  • Replace duplicated tests with #[rstest(...)] parameterised cases.
  • Prefer mockall for mocks/stubs.
  • Prefer .expect() over .unwrap() in tests.
  • .expect() and .unwrap() are forbidden outside of tests. Errors must be propagated.
  • Ensure that any API or behavioural changes are reflected in the documentation in docs/
  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/
  • Files must not exceed 400 lines in length
    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
  • Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
    • For testing of functionality depending upon environment variables, dependency injection and...

Files:

  • src/parser/cst_builder/tree.rs
🧠 Learnings (1)
📚 Learning: 2025-12-26T13:23:35.435Z
Learnt from: leynos
Repo: leynos/ddlint PR: 197
File: src/test_util/literals.rs:135-139
Timestamp: 2025-12-26T13:23:35.435Z
Learning: In Rust projects, rustfmt may reflow long simple constructor functions (e.g., pub fn lit_bool(b: bool) -> Expr { Expr::Literal(Literal::Bool(b)) }) to a multi-line form. Ensure you run and respect rustfmt formatting so CI formatting checks pass. Do not manually enforce non-default line breaks; format with rustfmt. Consider adding a rustfmt --check (or cargo fmt) step in CI and running cargo fmt before commits to avoid formatting-related failures.

Applied to files:

  • src/parser/cst_builder/tree.rs
🧬 Code graph analysis (1)
src/parser/cst_builder/tree.rs (3)
src/parser/token_stream.rs (2)
  • src (118-120)
  • tokens (105-107)
src/parser/cst_builder/spans.rs (1)
  • builder (148-150)
src/test_util/mod.rs (1)
  • tokenize (41-43)
🔍 Remote MCP Deepwiki

Relevant facts useful for reviewing this PR (sources cited):

  • build_green_tree location & role: build_green_tree constructs the rowan GreenNode using tokens + ParsedSpans; it uses SpanCursors (advance_and_start / finish) and push_token to drive CST construction. Changes to validate and skip tokens before advancing will alter that token-processing loop.

  • Span/ParsedSpans invariants: ParsedSpans are built by span-scanner/collectors and ParsedSpansBuilder enforces span list validation (sorted / non‑overlapping) in debug builds via validate_span_lists_sorted — the codebase already uses debug-only assertions for span invariants. This aligns with the PR’s debug-vs-release panic vs warn behavior.

  • Tokenization and Span semantics: Tokens are produced by tokenize() as Vec<(SyntaxKind, Span)>, with Span = Range (byte offsets). The tokenizer already produces N_ERROR tokens for lexing errors and logs warnings for out‑of‑bounds spans in push_token; out‑of‑bounds spans can originate from lexer mismatches. Thus validating spans before cursor advancement targets a known risk.

  • SpanCursors/SpanCursor behavior: build_green_tree coordinates seven SpanCursor instances (imports, relations, functions, rules, etc.) which call start_if/finish_if based on token positions; skipping tokens with invalid spans will change when/if these cursors start or finish nodes. Review should confirm skipping doesn’t break node boundaries or leave unfinished nodes.

  • Existing error-recovery & tests philosophy: The project emphasizes error recovery (emit N_ERROR nodes, continue parsing) and uses property/round‑trip tests (rstest; plan for proptest). Tests in the codebase check for non‑panic behavior on malformed inputs; the PR’s added debug panic test and release skipping test map directly to these testing goals but reviewers should ensure CI/test targets (debug vs release) reflect intended assertions.

  • Documentation touchpoint: docs/parser-plan.md and Architecture/Core Parsing Pipeline docs describe the pipeline: tokenize() → parse_tokens() → build_green_tree() → ast::Root. The docs were updated by the PR; confirm doc wording matches implementation (early validation before cursor advance, debug panic vs release warn+skip).

Tags:

⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Sourcery review
🔇 Additional comments (3)
src/parser/cst_builder/tree.rs (3)

80-95: LGTM: validation logic correctly implements debug panic vs release skip.

The dual-mode behaviour aligns with the PR objectives—debug builds catch lexer bugs early, release builds preserve error recovery. The validation criteria (start <= end && end <= src_len) are correct.

Minor style observation: the blank line at 88 between mutually exclusive cfg blocks is slightly unusual (typically omitted), but this does not affect correctness.


124-126: Validate-and-skip approach correctly prevents invalid spans from reaching cursor logic.

Skipping tokens with invalid spans is safe: the SpanCursors track positions from ParsedSpans (which are built from valid tokens by parse_tokens), so an out-of-bounds token from a lexer bug won't correspond to any cursor-tracked span. This prevents cursor state corruption.


182-205: LGTM: tests correctly cover debug panic and release skip behaviours.

The debug test (lines 182-191) appropriately asserts that an out-of-bounds span (0..1 with empty source) triggers the expected panic. The release test (lines 193-205) verifies that an appended OOB token is skipped by confirming the resulting tree text matches the original source. Both tests are correctly gated with cfg attributes for their respective build modes.


Comment @coderabbitai help to get the list of available commands and usage tips.

Replaced the usage of positional formatting with inline format strings in the panic and warn macro calls within `validate_token_span` for improved readability and consistency.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@leynos leynos marked this pull request as ready for review January 3, 2026 02:00
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06f9376 and 38850c7.

📒 Files selected for processing (2)
  • docs/parser-plan.md
  • src/parser/cst_builder/tree.rs
🧰 Additional context used
📓 Path-based instructions (5)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Clippy warnings must be disallowed in Rust code.
Fix any warnings emitted during tests in the code itself rather than silencing them in Rust.
Where a Rust function is too long, extract meaningfully named helper functions adhering to separation of concerns and the Command Query Responsibility Segregation pattern.
Where a Rust function has too many parameters, group related parameters in meaningfully named structs.
Where a Rust function is returning a large error, consider using Arc to reduce the amount of data returned.
Write unit and scenario tests for new Rust functionality; run both before and after making any change.
Every Rust module must begin with a module level (//!) comment explaining the module's purpose and utility.
Document public APIs in Rust using Rustdoc comments (///), so documentation can be generated with cargo doc.
Prefer immutable data and avoid unnecessary mut bindings in Rust.
Handle errors with the Result type instead of panicking in Rust where feasible.
Avoid unsafe code in Rust unless absolutely necessary and document any usage clearly.
Place function attributes after doc comments in Rust.
Do not use return in single-line functions in Rust.
Use predicate functions for conditional criteria with more than two branches in Rust.
Lints must not be silenced in Rust except as a last resort; lint rule suppressions must be tightly scoped and include a clear reason.
Prefer expect over allow in Rust.
Use rstest fixtures for shared test setup in Rust.
Replace duplicated tests with #[rstest(…)] parameterized cases in Rust.
Prefer mockall for mocks/stubs in Rust.
Prefer .expect() over .unwrap() in Rust.
Use concat!() to combine long string literals in Rust rather than escaping newlines with a backslash.
Prefer single-line versions of functions in Rust where appropriate (e.g., pub fn new(id: u64) -> Self { Self(id) } instead of multi-line variants).
Use NewTypes in Rust to model domain values and eliminate "i...

Files:

  • src/parser/cst_builder/tree.rs

⚙️ CodeRabbit configuration file

**/*.rs: * Seek to keep the cognitive complexity of functions no more than 9.

  • Adhere to single responsibility and CQRS
  • Place function attributes after doc comments.
  • Do not use return in single-line functions.
  • Move conditionals with >2 branches into a predicate function.
  • Avoid unsafe unless absolutely necessary.
  • Every module must begin with a //! doc comment that explains the module's purpose and utility.
  • Comments and docs must follow en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Lints must not be silenced except as a last resort.
    • #[allow] is forbidden.
    • Only narrowly scoped #[expect(lint, reason = "...")] is allowed.
    • No lint groups, no blanket or file-wide suppression.
    • Include FIXME: with link if a fix is expected.
  • Where code is only used by specific features, it must be conditionally compiled or a conditional expectation for unused_code applied.
  • Use rstest fixtures for shared setup and to avoid repetition between tests.
  • Replace duplicated tests with #[rstest(...)] parameterised cases.
  • Prefer mockall for mocks/stubs.
  • Prefer .expect() over .unwrap() in tests.
  • .expect() and .unwrap() are forbidden outside of tests. Errors must be propagated.
  • Ensure that any API or behavioural changes are reflected in the documentation in docs/
  • Ensure that any completed roadmap steps are recorded in the appropriate roadmap in docs/
  • Files must not exceed 400 lines in length
    • Large modules must be decomposed
    • Long match statements or dispatch tables should be decomposed by domain and collocated with targets
    • Large blocks of inline data (e.g., test fixtures, constants or templates) must be moved to external files and inlined at compile-time or loaded at run-time.
  • Environment access (env::set_var and env::remove_var) are always unsafe in Rust 2024 and MUST be marked as such
    • For testing of functionality depending upon environment variables, dependency injection and...

Files:

  • src/parser/cst_builder/tree.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

**/*.md: Markdown paragraphs must be wrapped at 80 columns, and bullet points need the same limit.
Code blocks in Markdown must be wrapped at 120 columns.
Tables in Markdown must not be wrapped, and headings must remain unwrapped.
Use dashes (-) for list bullets in Markdown.
Use GitHub-flavoured Markdown footnotes ([^1]) for references and footnotes.

Files:

  • docs/parser-plan.md

⚙️ CodeRabbit configuration file

**/*.md: * Avoid 2nd person or 1st person pronouns ("I", "you", "we")

  • Use en-GB-oxendict (-ize / -yse / -our) spelling and grammar
  • Headings must not be wrapped.
  • Documents must start with a level 1 heading
  • Headings must correctly increase or decrease by no more than one level at a time
  • Use GitHub-flavoured Markdown style for footnotes and endnotes.
  • Numbered footnotes must be numbered by order of appearance in the document.

Files:

  • docs/parser-plan.md
docs/**/*.{md,rs,hs,py}

📄 CodeRabbit inference engine (docs/differential-datalog-parser-syntax-spec-migration-plan.md)

Update all repository references from docs/haskell-parser-analysis.md to point to the normative DDlog syntax specification and new implementation notes

Files:

  • docs/parser-plan.md
docs/**/*.md

📄 CodeRabbit inference engine (docs/documentation-style-guide.md)

docs/**/*.md: Use British English based on Oxford English Dictionary locale en-GB, including: -ize suffixes in words like 'realize' and 'organization', -lyse suffixes in words like 'analyse' and 'paralyse', -our suffixes in words like 'colour' and 'behaviour', -re suffixes in words like 'centre' and 'calibre', double 'l' in words like 'cancelled' and 'counsellor', maintain 'e' in words like 'likeable' and 'liveable', -ogue suffixes in words like 'catalogue'
Keep United States (US) spelling when used in an API (e.g., 'color')
Use the Oxford comma in documentation: 'ships, planes, and hovercraft' where it aids comprehension
Treat company names as collective nouns in documentation (e.g., 'Concordat Industries are expanding')
Write headings in sentence case
Use Markdown headings (#, ##, ###, etc.) in order without skipping levels
Follow markdownlint recommendations for markdown formatting
Always provide a language identifier for fenced code blocks; use 'plaintext' for non-code text
Use '-' as the first level bullet and renumber lists when items change
Prefer inline links using text or angle brackets around the URL in documentation
Ensure blank lines before and after bulleted lists and fenced blocks in documentation
Ensure tables have a delimiter line below the header row
Expand any uncommon acronym on first use (e.g., Continuous Integration (CI))
Wrap paragraphs at 80 columns in documentation
Wrap code at 120 columns in documentation
Do not wrap tables in documentation
Use GitHub-flavoured numeric footnotes referenced as [^1] in documentation
Footnotes must be numbered in order of appearance in the document
Caption every table and every diagram in documentation
Include Mermaid diagrams where they add clarity to documentation
Use 'alt text' syntax for embedding figures and provide brief alt text describing the content in documentation
Add a short description before each Mermaid diagram in documentation so screen readers can understand it

Files:

  • docs/parser-plan.md
docs/**/!(README).md

📄 CodeRabbit inference engine (docs/documentation-style-guide.md)

Avoid first and second person personal pronouns outside the README.md file

Files:

  • docs/parser-plan.md
🧠 Learnings (1)
📚 Learning: 2025-12-26T13:23:35.435Z
Learnt from: leynos
Repo: leynos/ddlint PR: 197
File: src/test_util/literals.rs:135-139
Timestamp: 2025-12-26T13:23:35.435Z
Learning: In Rust projects, rustfmt may reflow long simple constructor functions (e.g., pub fn lit_bool(b: bool) -> Expr { Expr::Literal(Literal::Bool(b)) }) to a multi-line form. Ensure you run and respect rustfmt formatting so CI formatting checks pass. Do not manually enforce non-default line breaks; format with rustfmt. Consider adding a rustfmt --check (or cargo fmt) step in CI and running cargo fmt before commits to avoid formatting-related failures.

Applied to files:

  • src/parser/cst_builder/tree.rs
🔍 Remote MCP Deepwiki

Summary of Relevant Context for PR Review

Token Span Structure and Usage

Token spans in the ddlint parser are represented by the Span type, which is a std::ops::Range<usize> indicating a byte range within the source text. These spans are organized in a ParsedSpans structure that categorizes spans for different statement types (imports, typedefs, relations, indexes, functions, transformers, and rules), and are used by the build_green_tree function through SpanCursors to manage iteration and control CST node creation.

Existing Error Handling Patterns

The ddlint parser employs a robust error handling strategy that focuses on error recovery rather than halting on first error. When errors occur, an N_ERROR node is emitted into the CST, encapsulating invalid tokens, and the parser attempts to resynchronize by skipping tokens. The parser also uses property-based testing with proptest to ensure the system does not panic even with arbitrary or malformed inputs.

Out-of-Bounds Span Handling

Invalid or out-of-bounds token spans can occur from tokenization mismatches (where a logos lexer produces a span extending beyond the source length) or incorrect span ordering/overlaps during span analysis. Currently, the parser logs warnings for out-of-bounds spans during the push_token function, sets the token text to an empty string to prevent further errors, and in debug builds uses validate_span_lists_sorted to panic on span ordering issues.

CST Construction Flow

The build_green_tree function initializes a GreenNodeBuilder and iterates through tokens, using SpanCursors to determine when to start and finish nodes. The SpanCursors provides methods like advance_and_start and finish which call underlying SpanCursor methods (advance_to, start_if, finish_if) to coordinate node creation with token processing.

Review Implications

This PR appropriately extends the existing validation philosophy by introducing early token-span bounds checking before cursor advancement. The dual behavior (panic in debug builds to surface lexer bugs, warn and skip in release builds for resilience) aligns with the established error recovery patterns and property-based testing strategy already in place for ensuring robustness in production.,

⏰ Context from checks skipped due to timeout of 120000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Sourcery review
  • GitHub Check: Sourcery review
🔇 Additional comments (4)
docs/parser-plan.md (1)

90-93: LGTM!

The documentation accurately describes the token span validation behavior, distinguishing debug (panic) from release (warn-and-skip) builds. Formatting complies with the 80-column guideline.

src/parser/cst_builder/tree.rs (3)

124-126: LGTM!

The validation check correctly guards cursor advancement. Tokens with out-of-bounds spans are skipped, preventing CST corruption.

Note: After this validation, push_token at line 129 will never receive an invalid span, rendering the bounds check at lines 139-149 redundant. However, the redundancy provides defense in depth and may be intentional.


182-191: LGTM!

The test correctly verifies that debug builds panic on out-of-bounds token spans. The minimal setup (empty source, span 0..1) is sufficient to trigger the validation failure.


193-205: LGTM!

The test correctly verifies that release builds skip out-of-bounds token spans without panicking. Appending an out-of-bounds token after valid tokenization and asserting the tree matches the original source confirms the skip behavior.

Comment thread src/parser/cst_builder/tree.rs
…nching

The validate_token_span function's conditional logic was refactored to properly handle out-of-bounds spans. The method now correctly panics in debug mode and logs a warning in release mode when a token span is invalid, ensuring improved error handling and consistency.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@leynos leynos changed the title Validate token spans before CST construction; panic in debug, warn in release Validate token spans before CST; panic in debug, skip in release Jan 3, 2026
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Jan 3, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@leynos leynos linked an issue Jan 3, 2026 that may be closed by this pull request
@leynos leynos merged commit 6182d8a into main Jan 3, 2026
4 checks passed
@leynos leynos deleted the terragon/handle-token-span-out-of-bounds-r2hkjz branch January 3, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add bounds checking for token spans in CST construction

1 participant