Skip to content

Reddit scraper violates API Terms of Service — no OAuth, fake browser User-Agent #28

@tg12

Description

@tg12

Summary

fetch_reddit_user_about and fetch_reddit_recent_comments in specific_scrapers.py access Reddit's JSON API using a fake browser User-Agent without OAuth authentication. This violates Reddit's API Terms of Service, which require all third-party applications to use OAuth 2.0 and a registered app client ID for any programmatic API access.

Evidence

File: src/adapters/specific_scrapers.py

async def fetch_reddit_user_about(*, username: str, settings: AppSettings | None = None) -> dict[str, Any] | None:
    ...
    headers = {
        "Accept": "application/json",
        "User-Agent": "Mozilla/5.0 (compatible; OSINT-D2/1.0)",  # ← fake browser UA
    }
    async with build_async_client(settings, extra_headers=headers) as client:
        resp = await client.get(url)   # https://www.reddit.com/user/{username}/about.json

The about.json and comments.json endpoints are part of Reddit's public API surface, but since June 2023, Reddit requires all API access (including public endpoints) to go through OAuth with a registered app. Direct JSON endpoint access using browser-style headers is explicitly prohibited and is the scraping pattern that triggered Reddit's API pricing controversy.

Reddit's current Terms of Service state: "You may not use the Reddit Platform... in a way that does not comply with Reddit's API Terms." The API Terms require OAuth 2.0 for all access.

Scraping with a fake browser UA (Mozilla/5.0 (compatible; OSINT-D2/1.0)) also provides no honest identification of the client, which is specifically what Reddit requires via the User-Agent format <platform>:<app ID>:<version string> (by /u/<reddit username>).

Why this matters

  1. Legal risk: Reddit's ToS explicitly prohibit unauthenticated API access. Distributing a tool that violates these terms exposes the project and its users to cease-and-desist actions.
  2. Reliability: Reddit has actively blocked unauthenticated scraping since 2023. The endpoints can and do return 403, 429, or incorrect data at any time. The code checks for status_code != 200 but does not distinguish between "user doesn't exist" and "Reddit blocked the request" — both are silently treated as None.
  3. IP reputation: The tool uses residential proxies via ScrapingAnt; accessing Reddit this way may get proxy IP ranges flagged and could impact other ScrapingAnt users.

Attack or failure scenario

Reddit begins returning HTTP 200 responses with GDPR-compliant empty payloads for unauthenticated scrapers (a pattern they have used before). The fetch_reddit_user_about function receives a well-formed but empty response, returns None, and the operator concludes the target has no Reddit presence — a false negative in a privacy-sensitive investigation.

Root cause

The Reddit scrapers were implemented against the legacy API surface that existed before Reddit enforced OAuth. No OAuth flow was implemented when Reddit changed its policies in 2023.

Recommended fix

  1. Register a Reddit app and add OSINT_D2_REDDIT_CLIENT_ID and OSINT_D2_REDDIT_CLIENT_SECRET to AppSettings.
  2. Implement client credentials OAuth flow (grant_type=client_credentials) before making API calls.
  3. Use the proper Reddit API User-Agent format: osint-d2:v0.1:by-/u/<maintainer_username>.
  4. If Reddit credentials are not configured, skip the Reddit scan and emit a warning rather than scraping silently.

The PRAW library handles this cleanly, or use httpx directly with the OAuth token flow:

token_resp = await client.post(
    "https://www.reddit.com/api/v1/access_token",
    data={"grant_type": "client_credentials"},
    auth=(client_id, client_secret),
    headers={"User-Agent": "osint-d2/0.1"},
)
token = token_resp.json()["access_token"]
# Then use Authorization: bearer {token} on API calls

Acceptance criteria

  • Reddit API calls use OAuth 2.0 client credentials flow.
  • User-Agent follows Reddit's required format.
  • OSINT_D2_REDDIT_CLIENT_ID and OSINT_D2_REDDIT_CLIENT_SECRET are documented in .env.example.
  • If credentials are absent, RedditScanner skips gracefully with a warning.

Suggested labels

security, bug, technical-debt

Priority

P2

Severity

Medium — Active ToS violation affecting all users of the tool. Produces unreliable results post-2023 Reddit API changes. Legal risk is real but enforcement is typically directed at large-scale abusers.

Confidence

Confirmed — no OAuth token flow, no client ID, fake UA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions