Fix _decode_bytes to handle invalid UTF-8 gracefully #505

markknoffler · 2026-01-12T19:31:38Z

Solves #504

Solution: Fix `_decode_bytes` to handle invalid UTF-8 gracefully

Problem

The _decode_bytes function in gemma/gm/data/_tasks.py was crashing with UnicodeDecodeError when encountering invalid UTF-8 byte sequences, causing entire data processing pipelines to fail.

Solution

Added error handling to gracefully handle invalid UTF-8 sequences by:

Wrapping the decode operation in a try-except block to catch UnicodeDecodeError
Using errors='replace' parameter to replace invalid bytes with the Unicode replacement character (U+FFFD)
Issuing a UnicodeWarning to inform users about data quality issues

Implementation

Changes Made

File: gemma/gm/data/_tasks.py

Added import:
```
import warnings
```

Updated _decode_bytes function:

def _decode_bytes(element):
  """Decode bytes to string, handling invalid UTF-8 gracefully.

  Some datasets (e.g., TFDS) return bytes instead of str. This function
  decodes bytes to UTF-8 strings, replacing invalid UTF-8 sequences with
  the Unicode replacement character (U+FFFD) rather than crashing.

  Args:
    element: Either bytes to decode or a non-bytes value to return as-is.

  Returns:
    Decoded string if element is bytes, otherwise the element unchanged.
  """
  if isinstance(element, bytes):
    try:
      return element.decode("utf-8")
    except UnicodeDecodeError as e:
      # Replace invalid UTF-8 sequences with the Unicode replacement character.
      # This is safer than 'ignore' as it preserves data flow while marking
      # corrupted bytes, and is better than crashing the entire pipeline.
      warnings.warn(
          f"Encountered invalid UTF-8 byte sequence at position {e.start}-{e.end}: "
          f"{e.object[e.start:e.end]!r}. Replacing with Unicode replacement "
          "character (U+FFFD).",
          UnicodeWarning,
          stacklevel=2,
      )
      return element.decode("utf-8", errors="replace")
  else:
    return element

Design Decisions

Why `errors='replace'` instead of `errors='ignore'`?

errors='replace' replaces invalid bytes with U+FFFD (), which:
- Preserves the data flow and length information
- Makes corrupted data visible in the output
- Allows users to identify and filter problematic data if needed
errors='ignore' would silently remove invalid bytes, which:
- Loses information about data corruption
- Can cause subtle bugs if data length matters
- Makes it harder to detect and debug data quality issues

Why issue warnings?

Informs users about data quality issues without breaking the pipeline
Helps with debugging by providing context about where corruption occurred
Allows users to filter warnings if they choose to ignore them
Follows Python best practices for handling recoverable errors

Mixed Valid/Invalid Bytes

mixed_bytes = b"Valid text " + bytes([0xFF, 0xFE]) + b" more valid text"
result = _decode_bytes(mixed_bytes)
# result: 'Valid text \ufffd\ufffd more valid text'
# Valid parts are preserved, only invalid bytes are replaced

Benefits

Prevents pipeline crashes - Data processing continues even with corrupted bytes
Preserves valid data - Valid UTF-8 sequences are decoded correctly
Marks corrupted data - Invalid bytes are replaced with visible U+FFFD characters
User awareness - Warnings inform users about data quality issues
Backward compatible - Valid inputs work exactly as before
No performance impact - Only adds overhead when invalid UTF-8 is encountered

Testing

The fix handles:

✅ Valid UTF-8 bytes (works as before, no warnings)
✅ Invalid UTF-8 bytes (replaced with U+FFFD, warning issued)
✅ Mixed valid/invalid bytes (valid parts preserved, invalid replaced)
✅ Non-bytes input (returned as-is, no changes)

Impact

Affected functions: Seq2SeqTask.map() and ContrastiveTask.map() which use _decode_bytes
Backward compatibility: Fully maintained - no breaking changes
Performance: Negligible impact (only when invalid UTF-8 is encountered)

Fix _decode_bytes to handle invalid UTF-8 gracefully

f81b343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix _decode_bytes to handle invalid UTF-8 gracefully #505

Fix _decode_bytes to handle invalid UTF-8 gracefully #505

Uh oh!

markknoffler commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix _decode_bytes to handle invalid UTF-8 gracefully #505

Are you sure you want to change the base?

Fix _decode_bytes to handle invalid UTF-8 gracefully #505

Uh oh!

Conversation

markknoffler commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solution: Fix _decode_bytes to handle invalid UTF-8 gracefully

Problem

Solution

Implementation

Changes Made

Design Decisions

Why errors='replace' instead of errors='ignore'?

Why issue warnings?

Mixed Valid/Invalid Bytes

Benefits

Testing

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markknoffler commented Jan 12, 2026 •

edited

Loading

Solution: Fix `_decode_bytes` to handle invalid UTF-8 gracefully

Why `errors='replace'` instead of `errors='ignore'`?