-
Notifications
You must be signed in to change notification settings - Fork 640
Open
Description
Description
The _decode_bytes helper function in gemma/gm/data/_tasks.py crashes with UnicodeDecodeError when encountering invalid UTF-8 byte sequences. This prevents data processing pipelines from handling datasets (especially TFDS datasets) that may contain corrupted or non-UTF-8 bytes.
Current Behavior
The function attempts to decode bytes as UTF-8 without error handling:
def _decode_bytes(element):
if isinstance(element, bytes):
return element.decode("utf-8") # Crashes on invalid UTF-8
else:
return elementWhen invalid UTF-8 sequences are encountered (e.g., bytes([0xFF, 0xFE, 0xFD])), the function raises UnicodeDecodeError, causing the entire data processing pipeline to crash.
Expected Behavior
The function should handle invalid UTF-8 sequences gracefully by:
- Replacing invalid bytes with the Unicode replacement character (U+FFFD)
- Issuing a warning to inform users about data quality issues
- Allowing the data processing pipeline to continue
Impact
- Crashes entire data processing pipelines when datasets contain invalid UTF-8 bytes
- No graceful degradation - valid data cannot be processed if any invalid bytes exist
- Poor error messages -
UnicodeDecodeErrordoesn't clearly indicate the issue is with data encoding - Affects both
Seq2SeqTaskandContrastiveTaskwhich use this helper function
Reproduction
from gemma.gm.data._tasks import _decode_bytes
# This crashes with UnicodeDecodeError
invalid_bytes = bytes([0xFF, 0xFE, 0xFD])
result = _decode_bytes(invalid_bytes)Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Metadata
Metadata
Assignees
Labels
No labels