Skip to content

fix(gitea): tolerate binary file payloads (#2380)#2440

Open
raywcm wants to merge 2 commits into
The-PR-Agent:mainfrom
raywcm:fix/gitea-binary-diff-decode
Open

fix(gitea): tolerate binary file payloads (#2380)#2440
raywcm wants to merge 2 commits into
The-PR-Agent:mainfrom
raywcm:fix/gitea-binary-diff-decode

Conversation

@raywcm

@raywcm raywcm commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Fixes #2380.

Gitea PRs that contain binary media changes can currently crash the provider when raw bytes are decoded as UTF-8. This affects both raw file fetches and .diff retrieval before ignored binary extensions can be filtered out cleanly.

This PR makes the Gitea provider tolerate those binary payloads:

  • raw file content with invalid UTF-8 is skipped and treated as empty content
  • PR diff payloads are decoded with replacement so the provider can still parse the textual diff header/context instead of crashing
  • regression tests cover both paths

Testing

Ran focused local verification against the changed RepoApi methods with invalid UTF-8 bytes:

diff decode ok: 'diff --git a/old.png b/new.webp\n+�binary'
LOG_WARNING Skipping non-UTF-8 file content for asset.webp: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
file content decode ok: ''

Also ran syntax checks:

python3 -m py_compile pr_agent/git_providers/gitea_provider.py tests/unittest/test_gitea_provider.py

The full pytest test module could not be run in this minimal environment without installing PR-Agent's full pinned dependency set; the added tests follow the existing RepoApi mock style in tests/unittest/test_gitea_provider.py.

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used

Grey Divider


Remediation recommended

1. Non-UTF8 text lost 🐞 Bug ≡ Correctness
Description
RepoApi.get_file_content now returns "" on UnicodeDecodeError, so valid non-UTF-8 text files (e.g.,
UTF-16) are indistinguishable from empty files. This causes downstream patch/context expansion to
miss real content for those files, reducing review/diff accuracy in that case.
Code

pr_agent/git_providers/gitea_provider.py[R949-953]

+        except UnicodeDecodeError as e:
+            self.logger.warning(
+                f"Skipping non-UTF-8 file content for {filepath}: {e}"
+            )
+            return ""
Evidence
get_file_content() still decodes strictly as UTF-8 and the new handler returns an empty string on
decode failure. Patch processing already has a fallback decoder for byte inputs and returns early
when content is empty, so returning "" here prevents using those fallbacks and makes non-UTF-8 text
effectively disappear.

pr_agent/git_providers/gitea_provider.py[917-956]
pr_agent/algo/git_patch_processing.py[16-28]
pr_agent/algo/git_patch_processing.py[39-51]
tests/unittest/test_gitea_provider.py[108-135]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`RepoApi.get_file_content()` decodes raw bytes strictly as UTF-8 and returns an empty string on `UnicodeDecodeError`. This fixes binary crashes, but it also drops legitimate non-UTF-8 *text* files (e.g., UTF-16), making them appear empty downstream and preventing patch/context logic from using the actual content.

## Issue Context
The repo already has `decode_if_bytes()` logic with fallback encodings (including `utf-16`) used by patch processing, but it only helps if the provider passes bytes (or performs similar fallback decoding itself). Returning `""` on decode failure bypasses these fallbacks entirely.

## Fix Focus Areas
- pr_agent/git_providers/gitea_provider.py[917-956]
- pr_agent/algo/git_patch_processing.py[16-28]
- pr_agent/algo/git_patch_processing.py[39-51]

### Concrete fix options
1) On `UnicodeDecodeError`, return `raw_data` (bytes) instead of `""`, and adjust type hints / downstream callers accordingly (they already have `decode_if_bytes()` for bytes).
2) Alternatively, attempt a small, safe set of fallback decodes (mirroring `decode_if_bytes()`), and only return `""` if all fail.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@github-actions github-actions Bot added the bug label Jun 9, 2026
@qodo-free-for-open-source-projects

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Gitea provider: tolerate non-UTF-8 binary payloads in diffs and raw files
🐞 Bug fix 🧪 Tests 🕐 10-20 Minutes

Grey Divider

Walkthroughs

User Description

Summary

Fixes #2380.

Gitea PRs that contain binary media changes can currently crash the provider when raw bytes are decoded as UTF-8. This affects both raw file fetches and .diff retrieval before ignored binary extensions can be filtered out cleanly.

This PR makes the Gitea provider tolerate those binary payloads:

  • raw file content with invalid UTF-8 is skipped and treated as empty content
  • PR diff payloads are decoded with replacement so the provider can still parse the textual diff header/context instead of crashing
  • regression tests cover both paths

Testing

Ran focused local verification against the changed RepoApi methods with invalid UTF-8 bytes:

diff decode ok: 'diff --git a/old.png b/new.webp\n+�binary'
LOG_WARNING Skipping non-UTF-8 file content for asset.webp: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
file content decode ok: ''

Also ran syntax checks:

python3 -m py_compile pr_agent/git_providers/gitea_provider.py tests/unittest/test_gitea_provider.py

The full pytest test module could not be run in this minimal environment without installing PR-Agent's full pinned dependency set; the added tests follow the existing RepoApi mock style in tests/unittest/test_gitea_provider.py.

AI Description
• Decode Gitea PR .diff payloads with replacement to avoid UTF-8 decode crashes.
• Skip non-UTF-8 raw file contents and return empty content with a warning.
• Add regression tests covering both binary diff and binary raw file paths.
Diagram
graph TD
  A["PR-Agent (Gitea)" ] --> B["RepoApi" ] --> C{{"Gitea HTTP API"}} --> D[("Bytes payload")]
  D --> E["Diff decode" ] --> F["Text diff parse" ]
  D --> G["File decode" ] --> H["Empty content" ]

  subgraph Legend
    direction LR
    _svc["Component" ] ~~~ _ext{{"External"}} ~~~ _db[("Data")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Detect binary via Content-Type/headers before decoding
  • ➕ Avoids introducing replacement characters into diffs
  • ➕ Can fast-path known binary payloads without attempting decode
  • ➖ Gitea responses may not reliably provide accurate content-type for all endpoints
  • ➖ Still requires a fallback when headers are missing/incorrect
2. Decode diffs as latin-1 (lossless byte mapping)
  • ➕ Never fails decoding; preserves 1:1 byte values
  • ➕ Avoids U+FFFD replacement characters in output
  • ➖ Downstream diff parsing may assume UTF-8 semantics; non-UTF-8 text could be misinterpreted
  • ➖ Can silently produce misleading characters for truly UTF-8 content if misapplied globally
3. Pre-scan bytes and strip/skip binary hunks
  • ➕ Keeps diff text cleaner by removing binary noise
  • ➕ Could reduce parsing overhead for large binary sections
  • ➖ More complex and error-prone; may miss edge cases in diff formats
  • ➖ Harder to maintain than using standard decode error handling

Recommendation: The PR’s approach is the best tradeoff: it preserves provider stability with minimal, targeted changes. Using errors='replace' for .diff keeps the textual headers/context parseable, while skipping non-UTF-8 raw file content prevents crashes on binary assets. Header-based binary detection or binary-hunk filtering could be considered later if replacement characters materially affect downstream behavior, but they add complexity and rely on less reliable signals.

Grey Divider

File Changes

Bug fix (1)
gitea_provider.py Harden diff/file decoding against binary (non-UTF-8) payloads +7/-2

Harden diff/file decoding against binary (non-UTF-8) payloads

• Decodes PR diff bytes using UTF-8 with replacement to avoid UnicodeDecodeError. Adds a UnicodeDecodeError handler for raw file content fetches that logs a warning and returns an empty string when content is not valid UTF-8.

pr_agent/git_providers/gitea_provider.py


Tests (1)
test_gitea_provider.py Add regression tests for binary diff and binary raw file handling +62/-0

Add regression tests for binary diff and binary raw file handling

• Adds unit tests that simulate invalid UTF-8 bytes returned from Gitea for raw file content and for PR .diff retrieval. Verifies the provider returns empty content for files and successfully returns a diff string containing the replacement character.

tests/unittest/test_gitea_provider.py


Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] UnicodeDecodeError in gitea_provider.py when parsing binary files before extension filtering

1 participant