Summary
fetch_reddit_user_about and fetch_reddit_recent_comments in specific_scrapers.py access Reddit's JSON API using a fake browser User-Agent without OAuth authentication. This violates Reddit's API Terms of Service, which require all third-party applications to use OAuth 2.0 and a registered app client ID for any programmatic API access.
Evidence
File: src/adapters/specific_scrapers.py
async def fetch_reddit_user_about(*, username: str, settings: AppSettings | None = None) -> dict[str, Any] | None:
...
headers = {
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (compatible; OSINT-D2/1.0)", # ← fake browser UA
}
async with build_async_client(settings, extra_headers=headers) as client:
resp = await client.get(url) # https://www.reddit.com/user/{username}/about.json
The about.json and comments.json endpoints are part of Reddit's public API surface, but since June 2023, Reddit requires all API access (including public endpoints) to go through OAuth with a registered app. Direct JSON endpoint access using browser-style headers is explicitly prohibited and is the scraping pattern that triggered Reddit's API pricing controversy.
Reddit's current Terms of Service state: "You may not use the Reddit Platform... in a way that does not comply with Reddit's API Terms." The API Terms require OAuth 2.0 for all access.
Scraping with a fake browser UA (Mozilla/5.0 (compatible; OSINT-D2/1.0)) also provides no honest identification of the client, which is specifically what Reddit requires via the User-Agent format <platform>:<app ID>:<version string> (by /u/<reddit username>).
Why this matters
- Legal risk: Reddit's ToS explicitly prohibit unauthenticated API access. Distributing a tool that violates these terms exposes the project and its users to cease-and-desist actions.
- Reliability: Reddit has actively blocked unauthenticated scraping since 2023. The endpoints can and do return 403, 429, or incorrect data at any time. The code checks for
status_code != 200 but does not distinguish between "user doesn't exist" and "Reddit blocked the request" — both are silently treated as None.
- IP reputation: The tool uses residential proxies via ScrapingAnt; accessing Reddit this way may get proxy IP ranges flagged and could impact other ScrapingAnt users.
Attack or failure scenario
Reddit begins returning HTTP 200 responses with GDPR-compliant empty payloads for unauthenticated scrapers (a pattern they have used before). The fetch_reddit_user_about function receives a well-formed but empty response, returns None, and the operator concludes the target has no Reddit presence — a false negative in a privacy-sensitive investigation.
Root cause
The Reddit scrapers were implemented against the legacy API surface that existed before Reddit enforced OAuth. No OAuth flow was implemented when Reddit changed its policies in 2023.
Recommended fix
- Register a Reddit app and add
OSINT_D2_REDDIT_CLIENT_ID and OSINT_D2_REDDIT_CLIENT_SECRET to AppSettings.
- Implement client credentials OAuth flow (
grant_type=client_credentials) before making API calls.
- Use the proper Reddit API User-Agent format:
osint-d2:v0.1:by-/u/<maintainer_username>.
- If Reddit credentials are not configured, skip the Reddit scan and emit a warning rather than scraping silently.
The PRAW library handles this cleanly, or use httpx directly with the OAuth token flow:
token_resp = await client.post(
"https://www.reddit.com/api/v1/access_token",
data={"grant_type": "client_credentials"},
auth=(client_id, client_secret),
headers={"User-Agent": "osint-d2/0.1"},
)
token = token_resp.json()["access_token"]
# Then use Authorization: bearer {token} on API calls
Acceptance criteria
- Reddit API calls use OAuth 2.0 client credentials flow.
User-Agent follows Reddit's required format.
OSINT_D2_REDDIT_CLIENT_ID and OSINT_D2_REDDIT_CLIENT_SECRET are documented in .env.example.
- If credentials are absent,
RedditScanner skips gracefully with a warning.
Suggested labels
security, bug, technical-debt
Priority
P2
Severity
Medium — Active ToS violation affecting all users of the tool. Produces unreliable results post-2023 Reddit API changes. Legal risk is real but enforcement is typically directed at large-scale abusers.
Confidence
Confirmed — no OAuth token flow, no client ID, fake UA.
Summary
fetch_reddit_user_aboutandfetch_reddit_recent_commentsinspecific_scrapers.pyaccess Reddit's JSON API using a fake browser User-Agent without OAuth authentication. This violates Reddit's API Terms of Service, which require all third-party applications to use OAuth 2.0 and a registered app client ID for any programmatic API access.Evidence
File:
src/adapters/specific_scrapers.pyThe
about.jsonandcomments.jsonendpoints are part of Reddit's public API surface, but since June 2023, Reddit requires all API access (including public endpoints) to go through OAuth with a registered app. Direct JSON endpoint access using browser-style headers is explicitly prohibited and is the scraping pattern that triggered Reddit's API pricing controversy.Reddit's current Terms of Service state: "You may not use the Reddit Platform... in a way that does not comply with Reddit's API Terms." The API Terms require OAuth 2.0 for all access.
Scraping with a fake browser UA (
Mozilla/5.0 (compatible; OSINT-D2/1.0)) also provides no honest identification of the client, which is specifically what Reddit requires via theUser-Agentformat<platform>:<app ID>:<version string> (by /u/<reddit username>).Why this matters
status_code != 200but does not distinguish between "user doesn't exist" and "Reddit blocked the request" — both are silently treated asNone.Attack or failure scenario
Reddit begins returning HTTP 200 responses with GDPR-compliant empty payloads for unauthenticated scrapers (a pattern they have used before). The
fetch_reddit_user_aboutfunction receives a well-formed but empty response, returnsNone, and the operator concludes the target has no Reddit presence — a false negative in a privacy-sensitive investigation.Root cause
The Reddit scrapers were implemented against the legacy API surface that existed before Reddit enforced OAuth. No OAuth flow was implemented when Reddit changed its policies in 2023.
Recommended fix
OSINT_D2_REDDIT_CLIENT_IDandOSINT_D2_REDDIT_CLIENT_SECRETtoAppSettings.grant_type=client_credentials) before making API calls.osint-d2:v0.1:by-/u/<maintainer_username>.The PRAW library handles this cleanly, or use httpx directly with the OAuth token flow:
Acceptance criteria
User-Agentfollows Reddit's required format.OSINT_D2_REDDIT_CLIENT_IDandOSINT_D2_REDDIT_CLIENT_SECRETare documented in.env.example.RedditScannerskips gracefully with a warning.Suggested labels
security, bug, technical-debt
Priority
P2
Severity
Medium — Active ToS violation affecting all users of the tool. Produces unreliable results post-2023 Reddit API changes. Legal risk is real but enforcement is typically directed at large-scale abusers.
Confidence
Confirmed — no OAuth token flow, no client ID, fake UA.