Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
af9a1f7
Fixes #9
myz21 Jan 17, 2026
cf48adf
Fixes #9
myz21 Jan 29, 2026
8f7396a
Fixes #9
myz21 Mar 2, 2026
12950e3
refactor: use a constant for Selenium wait timeout
myz21 Mar 8, 2026
b27d3d6
feat: add Selenium timeouts and minimum text length thresholds
myz21 Mar 8, 2026
2d30df6
docs: enhance comments for privacy regex patterns and function return…
myz21 Mar 8, 2026
4b583bc
refactor: streamline link collection logic in resolve_privacy_url fun…
myz21 Mar 8, 2026
296519b
refactor: simplify common path verification logic in resolve_privacy_…
myz21 Mar 8, 2026
8b62b6d
refactor: remove obsolete resolve_privacy_url function for cleaner co…
myz21 Mar 8, 2026
0ad4f68
feat: add OPENAI_BASE_URL to .env.example for configuration clarity
myz21 Mar 8, 2026
40a423f
feat: add click library for command-line interface enhancements
myz21 Mar 8, 2026
e19ff9a
chore: resolve uv.lock merge conflict by regenerating
myz21 May 23, 2026
9504283
chore: add google-genai dependency and regenerate lockfile
myz21 May 23, 2026
2a69765
feat: add native google gemini api integration to analyze_chunk_json
myz21 May 23, 2026
6ed62df
docs: add real-world benchmark results for GitHub, TikTok, and Meta
myz21 May 23, 2026
0b8d9d2
chore: simplify .env.example template with basic key options
myz21 May 23, 2026
5af4c8c
docs: add architecture sequence diagram and update README to document…
myz21 May 23, 2026
952bc4d
docs: add simplified PlantUML workflow diagram and integrate into README
myz21 May 23, 2026
75aff70
refactor: apply modern typing, centralize constants, and improve docs…
myz21 May 23, 2026
e2d566f
refactor: improve candidate scoring, resolve debug print statements t…
myz21 May 23, 2026
51030ee
refactor: add detailed docstrings to chunk analysis functions
myz21 May 23, 2026
d26c2ab
refactor: eliminate lambda functions in sorting and enrich docstrings…
myz21 May 23, 2026
29723b8
refactor: add tenacity retry to LLM analysis functions for handling r…
myz21 May 23, 2026
de3e804
docs: add comprehensive test results for Facebook and TikTok privacy …
myz21 May 23, 2026
3c61266
docs: integrate interactive mermaid sequence diagram into README and …
myz21 May 23, 2026
b57ac92
docs: generate and embed premium PlantUML workflow flowchart in README
myz21 May 23, 2026
0e6fd37
docs: remove redundant mermaid sequence diagram and promote clean wor…
myz21 May 23, 2026
d4e6cca
con
myz21 May 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
OPENAI_API_KEY=
OPENAI_MODEL=
GEMINI_API_KEY=
LOG_LEVEL=INFO
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,13 @@ and **aggregate** category scores into an overall score with strengths, risks, r
> **Part of [Happy Hacking Space](https://github.com/HappyHackingSpace) - A community-driven
> organization focused on security, AI, and software development.**

## Architecture & How It Works

![Privacy Policy Analyzer System Workflow](docs/workflow.svg)

## Features

- **Multi-LLM Provider Engine**: Seamlessly switch between **Google Gemini** (using official `google-genai` SDK) and **OpenAI GPT** models based on model name prefix.
- **Auto-discovery**: Common paths → robots.txt/sitemaps → footer links.
- **HTTP-first extraction**: `trafilatura` (clean text) or `BeautifulSoup` fallback; **Selenium** for dynamic pages.
- **Structured scoring (JSON)**: Per-category (0–10) scores + rationales; aggregated to 0–100 overall in `scoring.py`.
Expand Down Expand Up @@ -48,6 +53,7 @@ privacy-policy-analyzer/
├── pyproject.toml # Project configuration
├── requirements.txt # Legacy requirements
├── .env.example # Environment template
├── RESULTS.md # Benchmarks & real-world scores (GitHub, TikTok, Meta)
├── .gitignore
├── LICENSE
└── README.md
Expand All @@ -56,7 +62,7 @@ privacy-policy-analyzer/
## Requirements

- Python **3.10.11 or higher**
- An **OpenAI API key**
- An **OpenAI API key** (for GPT models) and/or **Google Gemini API key** (for Gemini models)
- (Optional) **Chrome/Chromium** on the machine (Selenium fallback; driver auto-installs)

## Installation
Expand Down Expand Up @@ -100,8 +106,10 @@ Copy `.env.example` → `.env` and set your credentials:

```
OPENAI_API_KEY=sk-************************
# Optional (overrides default):
OPENAI_MODEL=gpt-4o

# Required if using Gemini models (e.g. gemini-2.5-flash)
GEMINI_API_KEY=AIzaSy*********************
```

## Usage
Expand Down
71 changes: 71 additions & 0 deletions RESULTS.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Privacy Policy Analyzer - Test Results

This document contains the analysis results for **Facebook** and **TikTok** privacy policies using the newly refactored production-grade CLI tool.

---

## 1. Facebook Analysis Results

* **Target Website:** `https://facebook.com`
* **Resolved Privacy URL:** `https://www.facebook.com/privacy/policy/?entry_point=facebook_page_footer`
* **LLM Model Used:** `openai/gpt-oss-120b:free` (OpenRouter)
* **Total Chunks Analyzed:** 6 (5 valid chunks aggregated)

### Score & Overview
* **Overall Privacy Score:** `42.16 / 100` (Moderate risk profile)
* **Confidence Level:** `1.0`
* **Total Red Flags Identified:** `20`

### Top Strengths
1. **Lawful Basis & Purpose:** `5.8 / 10`
2. **Transparency & Notice:** `5.8 / 10`
3. **Collection & Minimization:** `5.2 / 10`

### Top Risks
1. **Retention & Deletion:** `2.8 / 10` (High Risk)
2. **User Rights & Redress:** `3.2 / 10` (High Risk)
3. **Sensitive Children, Ads & Profiling:** `3.2 / 10` (High Risk)

---

## 2. TikTok Analysis Results

* **Target Website:** `https://tiktok.com`
* **Resolved Privacy URL:** `https://www.tiktok.com/legal/privacy-policy-row`
* **LLM Model Used:** `openai/gpt-oss-120b:free` (OpenRouter)
* **Total Chunks Analyzed:** 4 (2 valid chunks aggregated)

### Score & Overview
* **Overall Privacy Score:** `49.9 / 100` (Moderate risk profile, slightly better than Facebook)
* **Confidence Level:** `1.0`
* **Total Red Flags Identified:** `9`

### Top Strengths
1. **Lawful Basis & Purpose:** `6.5 / 10`
2. **Third Parties & Processors:** `6.5 / 10`
3. **User Rights & Redress:** `6.0 / 10`

### Top Risks
1. **Retention & Deletion:** `3.0 / 10` (High Risk)
2. **Security & Breach:** `3.0 / 10` (High Risk)
3. **Cross Border Transfers:** `3.5 / 10` (Medium-High Risk)

---

## 3. Performance & System Benchmarks

| Metric | Facebook (`https://facebook.com`) | TikTok (`https://tiktok.com`) |
| :--- | :--- | :--- |
| **Discovery Time** | 1.60s | 5.11s (Discovered via Sitemap) |
| **Fetching Time** | 3.50s | 1.97s |
| **LLM Analysis Time** | 114.72s | 59.91s |
| **Total Duration** | 119.83s | 66.99s |
| **Average Per Chunk** | 19.12s | 14.98s |
| **Text Characters** | 48,763 chars | 31,001 chars |

---

## 4. Key Takeaways
1. **Robust Sitemap Discovery:** The new scoring and sitemap heuristics successfully located TikTok's exact privacy policy at `https://www.tiktok.com/legal/privacy-policy-row` under 5.11s.
2. **Parallel Execution & Rate Limits:** The multi-threaded pipeline analyzed multiple policy chunks simultaneously. The integrated `tenacity` exponential retry logic flawlessly handled rate limits on the free tier backend.
3. **Clean Output Architecture:** Standardized error outputs (`click.secho(..., err=True)`) kept stdout completely clean, allowing accurate piping of the output JSON string.
80 changes: 80 additions & 0 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Privacy Policy Analysis Results

This document records the empirical results of evaluating real-world privacy policies using the Privacy Policy Analyzer. Analyses were conducted using directly supported LLM providers on **Google Gemini** (`gemini-2.5-flash`).

---

## 📊 Summary of Analyzed Policies

| Platform | Domain | Resolved Policy URL | Overall Score | Red Flags | Confidence | Analysis Time |
| :--- | :--- | :--- | :---: | :---: | :---: | :---: |
| **GitHub** | `github.com` | [GitHub Privacy Statement](https://docs.github.com/site-policy/privacy-policies/github-privacy-statement) | **44.40 / 100** | 8 | 100% | 23.66s |
| **TikTok** | `tiktok.com` | [TikTok Privacy Policy](https://www.tiktok.com/legal/privacy-policy) | **41.41 / 100** | 15 | 100% | 36.90s |
| **Facebook** | `facebook.com` | [Meta Privacy Policy](https://www.facebook.com/privacy/policy/) | **40.15 / 100** | 9 | 100% | 28.62s |

---

## 🔍 Detailed Breakdown per Platform

### 1. GitHub (Score: 44.40/100)

> [!NOTE]
> GitHub provides relatively transparent data sharing disclosures and clear lawful bases, but scores extremely poorly on retention timelines and cross-border safeguards in user-facing summaries.

* **Top Strengths:**
1. **Third Parties & Processors** (`8.5/10`) — Detailed processor disclosures and joint-controller roles are clearly demarcated.
2. **Transparency & Notice** (`8.0/10`) — Standard simplified language with comprehensive version notices and clear contacts.
3. **Lawful Basis & Purpose** (`7.5/10`) — Well-articulated data processing purposes and legitimate business interests.
* **Top Risks:**
1. **Retention & Deletion** (`0.0/10`) — Criticized for indefinite retention clauses (e.g., "retained as long as your account is active").
2. **Cross-Border Transfers** (`0.5/10`) — Extremely vague transfer mechanism safety disclosures in main text chunks.
3. **Security & Breach** (`1.0/10`) — Lack of concrete technical and organizational safety measures detailed in the text excerpt analyzed.

---

### 2. TikTok (Score: 41.41/100)

> [!WARNING]
> TikTok suffers from a massive volume of Red Flags (15 distinct flags detected). While transparent about *why* they collect data, their security standards and user-redress frameworks remain highly problematic under global scoring rubrics.

* **Top Strengths:**
1. **Lawful Basis & Purpose** (`7.0/10`) — Purposes of processing are detailed but heavily geared toward personalized content.
2. **Secondary Use & Limits** (`6.67/10`) — Stated boundaries on auxiliary features.
3. **Transparency & Notice** (`6.67/10`) — Highly structured layout despite massive length.
* **Top Risks:**
1. **Security & Breach** (`0.0/10`) — Analyzed chunks contained zero actionable breach details or concrete technical guardrails.
2. **Retention & Deletion** (`2.0/10`) — "Keep as long as necessary" phrasing used frequently without specific schedules.
3. **User Rights & Redress** (`2.67/10`) — Complex mechanisms to object or request erasure for global users outside EEA/California jurisdictions.

---

### 3. Facebook / Meta (Score: 40.15/100)

> [!CAUTION]
> Meta's privacy policy is highly structured and written in plain language (high transparency scoring), yet the underlying substance scores heavily against user protection—particularly on indefinite data retention and aggressive cross-border transfers.

* **Top Strengths:**
1. **Transparency & Notice** (`7.0/10`) — Excellent usage of plain-language headings, bullet points, and interactive cookie control guidance.
2. **Lawful Basis & Purpose** (`5.67/10`) — Categorized lists of purposes.
3. **Secondary Use & Limits** (`5.67/10`) — Stated boundaries on ad rendering vs. diagnostic telemetry.
* **Top Risks:**
1. **Retention & Deletion** (`0.67/10`) — Extremely low scoring due to massive loops of indefinite storage policies.
2. **Security & Breach** (`1.67/10`) — Highly generic security assertions with no robust technical benchmarks.
3. **Cross-Border Transfers** (`2.0/10`) — Aggressive globally dispersed storage notices with standard contractual clauses (SCC) obscured in deeply nested external links.

---

## 🚀 Execution Commands

To reproduce these exact results, configure your `.env` with a `GEMINI_API_KEY` and run:

```bash
# Analyze GitHub
uv run python src/main.py --url https://github.com --model gemini-2.5-flash --report summary --max-chunks 2

# Analyze TikTok
uv run python src/main.py --url https://tiktok.com --model gemini-2.5-flash --report summary --max-chunks 3

# Analyze Facebook
uv run python src/main.py --url https://facebook.com --model gemini-2.5-flash --report summary --max-chunks 3
```
1 change: 1 addition & 0 deletions docs/workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ dependencies = [
"selenium>=4.35.0",
"chromedriver-autoinstaller>=0.6.4",
"requests>=2.32.5",
"click>=8.1.0",
"google-genai>=1.0.0",
]

[dependency-groups]
Expand Down
40 changes: 31 additions & 9 deletions src/analyzer/scoring.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,39 @@


def _avg(nums: list[float]) -> float:
"""Calculate the average of a list of floats.

Args:
nums: A list of float numbers.

Returns:
The average value as a float, or 0.0 if the list is empty.
"""
return sum(nums) / len(nums) if nums else 0.0


def _get_score_descending(kv: tuple[str, float]) -> float:
"""Helper key function to sort categories by score in descending order."""
return -kv[1]


def _get_score_ascending(kv: tuple[str, float]) -> float:
"""Helper key function to sort categories by score in ascending order."""
return kv[1]


def aggregate_chunk_results(chunk_json_list: list[dict[str, Any]]) -> dict[str, Any]:
"""
Aggregates per-chunk JSON results into a weighted overall report.
Expects each item to contain:
- "scores": {category: int 0..10}
- "rationales": {category: str}
- optional "red_flags": [str]
- optional "notes": [str]
"""Aggregates per-chunk JSON results into a weighted overall report.

Args:
chunk_json_list: A list of chunk evaluation results from LLM. Each item must contain
"scores" (dict mapping category keys to int 0..10), "rationales" (dict mapping
category keys to string explanations), optional "red_flags" (list of strings),
and optional "notes" (list of strings).

Returns:
A aggregated report dictionary with overall score, confidence, category scores,
top strengths, top risks, red flags, and recommendations.
"""
per_cat: dict[str, list[float]] = {k: [] for k in _REQUIRED_KEYS}
rationales: dict[str, list[str]] = {k: [] for k in _REQUIRED_KEYS}
Expand Down Expand Up @@ -73,11 +95,11 @@ def aggregate_chunk_results(chunk_json_list: list[dict[str, Any]]) -> dict[str,

strengths: list[tuple[str, float]] = sorted(
((k, v["score"]) for k, v in category_scores.items()),
key=lambda kv: -kv[1],
key=_get_score_descending,
)[:3]
risks: list[tuple[str, float]] = sorted(
((k, v["score"]) for k, v in category_scores.items()),
key=lambda kv: kv[1],
key=_get_score_ascending,
)[:3]

return {
Expand Down
Loading
Loading