I had this UTF-8 issue when I used graphrag_py. It compiled correctly with matuin develop, but then caused this issue below when used in a script.  This fix worked for me, please verify if you have this issue as well.

# UTF-8 Character Boundary Issue Report

## Executive Summary

A critical bug was discovered in the `graphrag-rs` Rust codebase that causes runtime panics when processing documents containing multi-byte UTF-8 characters (e.g., Unicode arrows, emoji, non-ASCII text). The panic occurs due to improper string slicing at byte indices that fall within multi-byte characters.

## Issue Details

### Error Message
```
thread 'tokio-runtime-worker' (4577899) panicked at graphrag-core/src/entity/mod.rs:534:32:
byte index 384 is not a char boundary; it is inside '→' (bytes 383..386) of `ponsible use of AI in government. In
this document, the government recognizes the
potential benefits of AI and notes the public
expects the government to use the technology
safely and responsibly. According to the policy,
government agencies must adop`[...]
```

### Root Cause

Rust's string type (`&str`) uses UTF-8 encoding where characters can be 1-4 bytes. When slicing a string with `text[start..end]`, Rust requires that both `start` and `end` indices fall on character boundaries. If they fall in the middle of a multi-byte character, Rust panics.

The problematic code pattern was:
```rust
let name = text[start..end].trim().to_string();
```

Where `start` and `end` were calculated using byte positions from `find()` operations, without ensuring they align with character boundaries.

### Specific Character That Triggered the Panic
- Character: `→` (Unicode right arrow)
- Byte representation: 3 bytes (bytes 383..386 in the error)
- The code attempted to slice at byte 384, which is in the middle of this 3-byte character

## Fix Applied

### Phase 1: Critical Fix (Immediate)

**File:** `graphrag-core/src/entity/mod.rs`

**Changes:**

1. **Added a safe string slicing helper method** (after line 552):
```rust
/// Safely extract a substring, ensuring UTF-8 character boundaries
/// 
/// This helper prevents panics when slicing strings with multi-byte UTF-8 characters.
/// It adjusts start/end indices to the nearest valid character boundaries.
fn safe_slice(&self, text: &str, start: usize, end: usize) -> Option<&str> {
    let len = text.len();
    
    // Clamp indices to string length
    let start = start.min(len);
    let end = end.min(len);
    
    // Adjust start to the next valid character boundary
    let start = if start == 0 || text.is_char_boundary(start) {
        start
    } else {
        // Find the next valid character boundary after start
        let mut pos = start;
        while pos < len && !text.is_char_boundary(pos) {
            pos += 1;
        }
        pos
    };
    
    // Adjust end to the previous valid character boundary
    let end = if end == len || text.is_char_boundary(end) {
        end
    } else {
        // Find the previous valid character boundary before end
        let mut pos = end;
        while pos > 0 && !text.is_char_boundary(pos) {
            pos -= 1;
        }
        pos
    };
    
    // Ensure start <= end
    if start <= end {
        text.get(start..end)
    } else {
        None
    }
}
```

2. **Updated line 509** (organization suffix extraction):
```rust
// Before:
let name = text[start..end].trim().to_string();

// After:
let name = self.safe_slice(text, start, end)
    .map(|s| s.trim().to_string())
    .unwrap_or_default();
```

3. **Updated line 534** (organization prefix extraction):
```rust
// Before:
let name = text[start..end].trim().to_string();

// After:
let name = self.safe_slice(text, start, end)
    .map(|s| s.trim().to_string())
    .unwrap_or_default();
```

### How the Fix Works

The `safe_slice` method:
1. Clamps indices to valid string bounds
2. Uses `text.is_char_boundary(pos)` to check if an index is valid
3. Adjusts invalid indices by moving forward (for start) or backward (for end) until a valid boundary is found
4. Uses `text.get(start..end)` which returns `Option<&str>` instead of panicking
5. Returns `None` if no valid slice can be created

This approach is idiomatic Rust and ensures graceful handling of multi-byte characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I had this UTF-8 issue when I used graphrag_py. It compiled correctly with matuin develop, but then caused this issue below when used in a script. This fix worked for me, please verify if you have this issue as well. #6

UTF-8 Character Boundary Issue Report

Executive Summary

Issue Details

Error Message

Root Cause

Specific Character That Triggered the Panic

Fix Applied

Phase 1: Critical Fix (Immediate)

How the Fix Works

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

I had this UTF-8 issue when I used graphrag_py. It compiled correctly with matuin develop, but then caused this issue below when used in a script. This fix worked for me, please verify if you have this issue as well. #6

Description

UTF-8 Character Boundary Issue Report

Executive Summary

Issue Details

Error Message

Root Cause

Specific Character That Triggered the Panic

Fix Applied

Phase 1: Critical Fix (Immediate)

How the Fix Works

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions