Skip to content

[INBT-135] Fix: Sanitize source_id to prevent URL-as-filename errors#1

Merged
captainyugi00 merged 2 commits into
mainfrom
fix/INBT-135-sanitize-source-id
Mar 9, 2026
Merged

[INBT-135] Fix: Sanitize source_id to prevent URL-as-filename errors#1
captainyugi00 merged 2 commits into
mainfrom
fix/INBT-135-sanitize-source-id

Conversation

@captainyugi00
Copy link
Copy Markdown
Contributor

Summary

Fix a critical bug where all scrapers and downloaders could use raw URLs as source_id, producing invalid filesystem paths and S3 keys. Adds a Pydantic-level safety net that hashes any unsafe value into a deterministic 16-char hex string.

Linear Issue

INBT-135

Changes

  • Add _sanitize_source_id field validator on ContentMetadata.source_id — clean IDs pass through, anything unsafe gets SHA-256 hashed
  • Fix URL fallback in all 6 affected locations (ssstik, snapinsta, sssinstagram, apify x3, instagram_downloader)
  • Add _last_path_segment() helper in apify_downloader for DRY extraction
  • Add 10 unit tests for sanitization logic (test_source_id_sanitization.py)
  • Update existing ssstik test assertion for fixed behavior
  • Add GitHub Actions CI workflow (lint + unit tests + typecheck)
  • Fix ruff config compatibility (remove unsupported ASYNC240 rule)
  • Update README source_id field description

Testing

  • All 166 unit tests pass
  • Ruff lint passes clean
  • Ruff format passes clean
  • Tested with the exact URL from the bug report (https://vm.tiktok.com/ZNRufq2ex/)

Notes

The validator uses a simple allow-list regex (^[A-Za-z0-9_\-]{1,200}$). If it matches, the value passes through unchanged. Otherwise it's replaced with sha256(value)[:16] — deterministic, always safe, zero edge cases.

captainyugi00 and others added 2 commits March 9, 2026 20:46
…BT-135]

Add a Pydantic field validator on ContentMetadata.source_id that hashes
any unsafe value (URLs, special chars, path traversals) into a 16-char
SHA-256 hex string while passing clean alphanumeric IDs through unchanged.

Fix all 6 scrapers/downloaders that fell back to the raw URL when regex
extraction failed, and add 10 unit tests for the sanitization logic.

Also add CI workflow (lint + test + typecheck) and fix ruff config
compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin ruff>=0.9.0 for ASYNC240 support, add mypy ignore_missing_imports
for untyped third-party libs (aioboto3, noble_tls, instagrapi, yt_dlp,
bs4), fix type annotations to pass mypy strict mode, and apply ruff
format across all files for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@captainyugi00 captainyugi00 merged commit 7e34728 into main Mar 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant