UTF-8 Character Boundary Issue Report
Executive Summary
A critical bug was discovered in the graphrag-rs Rust codebase that causes runtime panics when processing documents containing multi-byte UTF-8 characters (e.g., Unicode arrows, emoji, non-ASCII text). The panic occurs due to improper string slicing at byte indices that fall within multi-byte characters.
Issue Details
Error Message
thread 'tokio-runtime-worker' (4577899) panicked at graphrag-core/src/entity/mod.rs:534:32:
byte index 384 is not a char boundary; it is inside '→' (bytes 383..386) of `ponsible use of AI in government. In
this document, the government recognizes the
potential benefits of AI and notes the public
expects the government to use the technology
safely and responsibly. According to the policy,
government agencies must adop`[...]
Root Cause
Rust's string type (&str) uses UTF-8 encoding where characters can be 1-4 bytes. When slicing a string with text[start..end], Rust requires that both start and end indices fall on character boundaries. If they fall in the middle of a multi-byte character, Rust panics.
The problematic code pattern was:
let name = text[start..end].trim().to_string();
Where start and end were calculated using byte positions from find() operations, without ensuring they align with character boundaries.
Specific Character That Triggered the Panic
- Character:
→ (Unicode right arrow)
- Byte representation: 3 bytes (bytes 383..386 in the error)
- The code attempted to slice at byte 384, which is in the middle of this 3-byte character
Fix Applied
Phase 1: Critical Fix (Immediate)
File: graphrag-core/src/entity/mod.rs
Changes:
- Added a safe string slicing helper method (after line 552):
/// Safely extract a substring, ensuring UTF-8 character boundaries
///
/// This helper prevents panics when slicing strings with multi-byte UTF-8 characters.
/// It adjusts start/end indices to the nearest valid character boundaries.
fn safe_slice(&self, text: &str, start: usize, end: usize) -> Option<&str> {
let len = text.len();
// Clamp indices to string length
let start = start.min(len);
let end = end.min(len);
// Adjust start to the next valid character boundary
let start = if start == 0 || text.is_char_boundary(start) {
start
} else {
// Find the next valid character boundary after start
let mut pos = start;
while pos < len && !text.is_char_boundary(pos) {
pos += 1;
}
pos
};
// Adjust end to the previous valid character boundary
let end = if end == len || text.is_char_boundary(end) {
end
} else {
// Find the previous valid character boundary before end
let mut pos = end;
while pos > 0 && !text.is_char_boundary(pos) {
pos -= 1;
}
pos
};
// Ensure start <= end
if start <= end {
text.get(start..end)
} else {
None
}
}
- Updated line 509 (organization suffix extraction):
// Before:
let name = text[start..end].trim().to_string();
// After:
let name = self.safe_slice(text, start, end)
.map(|s| s.trim().to_string())
.unwrap_or_default();
- Updated line 534 (organization prefix extraction):
// Before:
let name = text[start..end].trim().to_string();
// After:
let name = self.safe_slice(text, start, end)
.map(|s| s.trim().to_string())
.unwrap_or_default();
How the Fix Works
The safe_slice method:
- Clamps indices to valid string bounds
- Uses
text.is_char_boundary(pos) to check if an index is valid
- Adjusts invalid indices by moving forward (for start) or backward (for end) until a valid boundary is found
- Uses
text.get(start..end) which returns Option<&str> instead of panicking
- Returns
None if no valid slice can be created
This approach is idiomatic Rust and ensures graceful handling of multi-byte characters.
UTF-8 Character Boundary Issue Report
Executive Summary
A critical bug was discovered in the
graphrag-rsRust codebase that causes runtime panics when processing documents containing multi-byte UTF-8 characters (e.g., Unicode arrows, emoji, non-ASCII text). The panic occurs due to improper string slicing at byte indices that fall within multi-byte characters.Issue Details
Error Message
Root Cause
Rust's string type (
&str) uses UTF-8 encoding where characters can be 1-4 bytes. When slicing a string withtext[start..end], Rust requires that bothstartandendindices fall on character boundaries. If they fall in the middle of a multi-byte character, Rust panics.The problematic code pattern was:
Where
startandendwere calculated using byte positions fromfind()operations, without ensuring they align with character boundaries.Specific Character That Triggered the Panic
→(Unicode right arrow)Fix Applied
Phase 1: Critical Fix (Immediate)
File:
graphrag-core/src/entity/mod.rsChanges:
How the Fix Works
The
safe_slicemethod:text.is_char_boundary(pos)to check if an index is validtext.get(start..end)which returnsOption<&str>instead of panickingNoneif no valid slice can be createdThis approach is idiomatic Rust and ensures graceful handling of multi-byte characters.