Fix _decode_bytes to handle invalid UTF-8 gracefully #505
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Solves #504
Solution: Fix
_decode_bytesto handle invalid UTF-8 gracefullyProblem
The
_decode_bytesfunction ingemma/gm/data/_tasks.pywas crashing withUnicodeDecodeErrorwhen encountering invalid UTF-8 byte sequences, causing entire data processing pipelines to fail.Solution
Added error handling to gracefully handle invalid UTF-8 sequences by:
UnicodeDecodeErrorerrors='replace'parameter to replace invalid bytes with the Unicode replacement character (U+FFFD)UnicodeWarningto inform users about data quality issuesImplementation
Changes Made
File:
gemma/gm/data/_tasks.pyAdded import:
Updated
_decode_bytesfunction:Design Decisions
Why
errors='replace'instead oferrors='ignore'?errors='replace'replaces invalid bytes with U+FFFD (), which:errors='ignore'would silently remove invalid bytes, which:Why issue warnings?
Mixed Valid/Invalid Bytes
Benefits
Testing
The fix handles:
Impact
Seq2SeqTask.map()andContrastiveTask.map()which use_decode_bytes