Skip to content

Return 4xx for undecodable/corrupt images instead of 500#27

Open
JasonWildMe wants to merge 1 commit into
mainfrom
fix/reject-undecodable-image-4xx
Open

Return 4xx for undecodable/corrupt images instead of 500#27
JasonWildMe wants to merge 1 commit into
mainfrom
fix/reject-undecodable-image-4xx

Conversation

@JasonWildMe

Copy link
Copy Markdown

Summary

Inference endpoints returned HTTP 500 for a corrupt/undecodable input image, which made a permanent failure look transient to clients and could stall downstream pipelines.

Root cause: a corrupt image (e.g. a JPEG with a valid header but a broken entropy-coded scan stream — broken data stream when reading image file / Unsupported marker type 0xNN) fails during PIL decode inside model.predict/extract. The routers catch that with their generic except Exception500. Wildbook (the caller) classifies any 5xx as retryable and re-queues the job — so a permanently-bad image is retried indefinitely and the asset never reaches a terminal state (observed as a bulk import stuck at 99% detection).

Fix: reject an undecodable image as a 4xx (client error) so callers treat it as permanent / non-retryable.

  • app/utils/image_uri.py: add ImageDecodeError(ValueError) and validate_decodable(), which fully Image.open(...).load()s the bytes (a header-only verify() would miss broken scan streams) and raises ImageDecodeError on UnidentifiedImageError / OSError / DecompressionBombError.
  • predict, pipeline, extract, classify routers: call validate_decodable(image_bytes) right after resolve_image_uri(...), inside the existing except ValueError → HTTP 400 block. Since ImageDecodeError subclasses ValueError, an undecodable image now returns 400 instead of escaping to the generic 500.

Test Plan

  • tests/test_image_decode_validation.py (pure PIL, no GPU) — valid image passes; corrupt-marker, non-image, empty, truncated, and decompression-bomb inputs all raise ImageDecodeError; ImageDecodeError is a ValueError. 7/7 pass locally.
  • CI / model env: endpoint-level check that a corrupt image returns 400 (not 500) from /predict/, /pipeline/, /extract/, /classify/, and that inference is not invoked. (Router 400 contract verified by inspection; endpoint tests need the model/GPU environment.)

Notes

  • Trade-off: the validator decodes the image once and the model decodes again (~2× decode time, negligible vs GPU inference). Accepted for the robustness gain; a future optimization could share one decoded representation.
  • Status choice: reuses the routers' existing image-input convention (400). A dedicated 422 would be marginally more semantic, but any 4xx achieves the non-retryable classification.
  • Companion Wildbook-side change (WildMeOrg/Wildbook#1609) rejects corrupt bulk-import images before they reach ml-service; this PR is defense-in-depth for images that arrive via other paths or pass Wildbook's (Java ImageIO) decoder but fail ml-service's (PIL) decoder.

🤖 Generated with Claude Code

A corrupt image (e.g. a JPEG with a broken entropy-coded scan stream:
"broken data stream when reading image file" / "Unsupported marker type
0xNN") fails during PIL decode inside model.predict/extract, which the
inference routers catch with their generic `except Exception` and return as
HTTP 500. Consumers (Wildbook) classify 5xx as retryable and re-queue the
job, so a *permanent* bad-image failure is retried as if transient — the
work never reaches a terminal state and bulk imports can stall.

Reject undecodable images as a 4xx (client error) so callers treat them as
permanent and non-retryable:

- app/utils/image_uri.py: add ImageDecodeError(ValueError) and
  validate_decodable(), which fully Image.open(...).load()s the bytes (a
  header-only verify() would miss broken scan streams) and raises
  ImageDecodeError on UnidentifiedImageError / OSError / DecompressionBombError.
- predict, pipeline, extract, classify routers: call validate_decodable()
  right after resolve_image_uri(), inside the existing `except ValueError ->
  HTTP 400` block. ImageDecodeError subclasses ValueError, so an undecodable
  image now returns 400 instead of escaping to the generic 500.
- tests: pure-PIL unit tests for validate_decodable (valid passes; corrupt,
  non-image, empty, truncated, and decompression-bomb inputs raise).

Trade-off: the validator decodes the image once and the model decodes again
(~2x decode time, negligible vs GPU inference) — accepted for the robustness
gain. Router-level 400 contract verified by inspection; endpoint tests
require the model/GPU environment (CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant