[INBT-135] Fix: Sanitize source_id to prevent URL-as-filename errors by captainyugi00 · Pull Request #1 · Inoue-AI/Inoue-AI-Content-Downloader-SDK

captainyugi00 · 2026-03-09T19:47:26Z

Summary

Fix a critical bug where all scrapers and downloaders could use raw URLs as source_id, producing invalid filesystem paths and S3 keys. Adds a Pydantic-level safety net that hashes any unsafe value into a deterministic 16-char hex string.

Linear Issue

INBT-135

Changes

Add _sanitize_source_id field validator on ContentMetadata.source_id — clean IDs pass through, anything unsafe gets SHA-256 hashed
Fix URL fallback in all 6 affected locations (ssstik, snapinsta, sssinstagram, apify x3, instagram_downloader)
Add _last_path_segment() helper in apify_downloader for DRY extraction
Add 10 unit tests for sanitization logic (test_source_id_sanitization.py)
Update existing ssstik test assertion for fixed behavior
Add GitHub Actions CI workflow (lint + unit tests + typecheck)
Fix ruff config compatibility (remove unsupported ASYNC240 rule)
Update README source_id field description

Testing

All 166 unit tests pass
Ruff lint passes clean
Ruff format passes clean
Tested with the exact URL from the bug report (https://vm.tiktok.com/ZNRufq2ex/)

Notes

The validator uses a simple allow-list regex (^[A-Za-z0-9_\-]{1,200}$). If it matches, the value passes through unchanged. Otherwise it's replaced with sha256(value)[:16] — deterministic, always safe, zero edge cases.

…BT-135] Add a Pydantic field validator on ContentMetadata.source_id that hashes any unsafe value (URLs, special chars, path traversals) into a 16-char SHA-256 hex string while passing clean alphanumeric IDs through unchanged. Fix all 6 scrapers/downloaders that fell back to the raw URL when regex extraction failed, and add 10 unit tests for the sanitization logic. Also add CI workflow (lint + test + typecheck) and fix ruff config compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pin ruff>=0.9.0 for ASYNC240 support, add mypy ignore_missing_imports for untyped third-party libs (aioboto3, noble_tls, instagrapi, yt_dlp, bs4), fix type annotations to pass mypy strict mode, and apply ruff format across all files for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

captainyugi00 and others added 2 commits March 9, 2026 20:46

captainyugi00 merged commit 7e34728 into main Mar 9, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INBT-135] Fix: Sanitize source_id to prevent URL-as-filename errors#1

[INBT-135] Fix: Sanitize source_id to prevent URL-as-filename errors#1
captainyugi00 merged 2 commits into
mainfrom
fix/INBT-135-sanitize-source-id

captainyugi00 commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

captainyugi00 commented Mar 9, 2026

Summary

Linear Issue

Changes

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant