Skip to content

I had this UTF-8 issue when I used graphrag_py. It compiled correctly with matuin develop, but then caused this issue below when used in a script. This fix worked for me, please verify if you have this issue as well. #6

@marcusjihansson

Description

@marcusjihansson

UTF-8 Character Boundary Issue Report

Executive Summary

A critical bug was discovered in the graphrag-rs Rust codebase that causes runtime panics when processing documents containing multi-byte UTF-8 characters (e.g., Unicode arrows, emoji, non-ASCII text). The panic occurs due to improper string slicing at byte indices that fall within multi-byte characters.

Issue Details

Error Message

thread 'tokio-runtime-worker' (4577899) panicked at graphrag-core/src/entity/mod.rs:534:32:
byte index 384 is not a char boundary; it is inside '→' (bytes 383..386) of `ponsible use of AI in government. In
this document, the government recognizes the
potential benefits of AI and notes the public
expects the government to use the technology
safely and responsibly. According to the policy,
government agencies must adop`[...]

Root Cause

Rust's string type (&str) uses UTF-8 encoding where characters can be 1-4 bytes. When slicing a string with text[start..end], Rust requires that both start and end indices fall on character boundaries. If they fall in the middle of a multi-byte character, Rust panics.

The problematic code pattern was:

let name = text[start..end].trim().to_string();

Where start and end were calculated using byte positions from find() operations, without ensuring they align with character boundaries.

Specific Character That Triggered the Panic

  • Character: (Unicode right arrow)
  • Byte representation: 3 bytes (bytes 383..386 in the error)
  • The code attempted to slice at byte 384, which is in the middle of this 3-byte character

Fix Applied

Phase 1: Critical Fix (Immediate)

File: graphrag-core/src/entity/mod.rs

Changes:

  1. Added a safe string slicing helper method (after line 552):
/// Safely extract a substring, ensuring UTF-8 character boundaries
/// 
/// This helper prevents panics when slicing strings with multi-byte UTF-8 characters.
/// It adjusts start/end indices to the nearest valid character boundaries.
fn safe_slice(&self, text: &str, start: usize, end: usize) -> Option<&str> {
    let len = text.len();
    
    // Clamp indices to string length
    let start = start.min(len);
    let end = end.min(len);
    
    // Adjust start to the next valid character boundary
    let start = if start == 0 || text.is_char_boundary(start) {
        start
    } else {
        // Find the next valid character boundary after start
        let mut pos = start;
        while pos < len && !text.is_char_boundary(pos) {
            pos += 1;
        }
        pos
    };
    
    // Adjust end to the previous valid character boundary
    let end = if end == len || text.is_char_boundary(end) {
        end
    } else {
        // Find the previous valid character boundary before end
        let mut pos = end;
        while pos > 0 && !text.is_char_boundary(pos) {
            pos -= 1;
        }
        pos
    };
    
    // Ensure start <= end
    if start <= end {
        text.get(start..end)
    } else {
        None
    }
}
  1. Updated line 509 (organization suffix extraction):
// Before:
let name = text[start..end].trim().to_string();

// After:
let name = self.safe_slice(text, start, end)
    .map(|s| s.trim().to_string())
    .unwrap_or_default();
  1. Updated line 534 (organization prefix extraction):
// Before:
let name = text[start..end].trim().to_string();

// After:
let name = self.safe_slice(text, start, end)
    .map(|s| s.trim().to_string())
    .unwrap_or_default();

How the Fix Works

The safe_slice method:

  1. Clamps indices to valid string bounds
  2. Uses text.is_char_boundary(pos) to check if an index is valid
  3. Adjusts invalid indices by moving forward (for start) or backward (for end) until a valid boundary is found
  4. Uses text.get(start..end) which returns Option<&str> instead of panicking
  5. Returns None if no valid slice can be created

This approach is idiomatic Rust and ensures graceful handling of multi-byte characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions