Skip to content

Fixes #9#10

Open
myz21 wants to merge 28 commits into
HappyHackingSpace:mainfrom
myz21:main
Open

Fixes #9#10
myz21 wants to merge 28 commits into
HappyHackingSpace:mainfrom
myz21:main

Conversation

@myz21
Copy link
Copy Markdown

@myz21 myz21 commented Jan 17, 2026

Changes Made to src/main.py

Old System

  • Tried common paths first (/privacy, /privacy-policy)
  • Found something? Return immediately
  • Result: Short URLs preferred → returns landing pages

New System

  1. New Functions:
  • _collect_link_candidates() - Collects all privacy links from the page
  • _score_candidate() - Scores each URL by priority and anchor text
  • _pick_best_verified_candidate() - Selects and verifies the best URL
  1. Regex Priority Ranking:
  • privacy-policy ← HIGHEST (detailed policy)
  • privacy/policy
  • privacy-policy-[a-z]+
  • ...
  • privacy ← LOWEST (hub page)
  • legal
  • terms

3-Tier Discovery Strategy:

  1. Link-based (real links on page)
  2. ↓ (if not found) Sitemap-based (from sitemap.xml)
  3. ↓ (if not found) Common Paths (/privacy, /privacy-policy, etc.)

Result

Finds the most detailed policy from real site links, skips landing pages. But it requires longer time.

  • Old: /privacy (landing page)
  • New: /legal/page/row/privacy-policy/tr (full policy)

Summary by CodeRabbit

  • New Features

    • Multi-phase privacy URL discovery with prioritized patterns, sitemap checks, and staged fallbacks
    • New CLI with structured options (fetch method, chunk size, disable discovery) and JSON/status output
    • Environment-configurable AI endpoint (OPENAI_BASE_URL)
  • Improvements

    • More robust policy fetching and extraction with preferred main-content parsing, HTTP hardening and dynamic-page fallbacks
    • Parallelized chunk analysis, larger default chunks and higher token limits
    • Per-phase timing and stderr performance summary

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Implements a multi‑phase privacy URL discovery and verification flow in src/main.py (link, sitemap, common paths) with regex‑prioritized matching, candidate scoring/verification, improved content extraction (prefer <main>), parallelized chunk analysis, OpenAI base URL support, and increased LLM token limits.

Changes

Policy Discovery & Analysis

Layer / File(s) Summary
Imports, constants, and regex catalog
src/main.py
Replaces argparse-era imports, adds click, updates typing, and introduces _PRIVACY_REGEX_PATTERNS and related constants.
HTTP helpers & probes
src/main.py
Adds _http_get, _fetch_text, and _head_ok with richer headers, shorter timeouts, and typed returns.
HTTP extraction (main-first)
src/main.py
Implements _extract_text_http that prefers <main> over <body> and enforces minimum-length validation.
Selenium dynamic extraction
src/main.py
Reworks Selenium fetching with eager load, reduced resource loading, waits for <body>, prefers <main>, and ensures teardown.
Light verification & robots parsing
src/main.py
Updates light verify to use _head_ok, parses robots.txt for sitemap links, and defaults to /sitemap.xml.
Sitemap fetching & parsing
src/main.py
Adds _fetch_sitemap_urls supporting gzipped payloads, sitemapindex/urlset parsing, recursion, and deduplication.
Prioritization & link discovery
src/main.py
Adds _get_url_priority and find_best_policy_url to rank anchor URLs by regex priorities.
Candidate collection, scoring & refinement
src/main.py
Introduces _collect_link_candidates, _score_candidate, _pick_best_verified_candidate, and _improve_candidate for hub link refinement and verification.
Three-phase resolver
src/main.py
Replaces resolver with link-based Phase 1, sitemap Phase 2, and common-path Phase 3; returns (resolved_url, source_origin).
Chunking & OpenAI analysis
src/main.py
Updates split_text_into_chunks typing/defaults, analyze_chunk_json uses OPENAI_BASE_URL, increases max_tokens to 2000, and returns `dict[str, Any]
Click CLI & parallel analysis
src/main.py
Replaces argparse entrypoint with Click main(...), adds options (--chunk-size, --fetch, --no-discover, etc.), runs parallel chunk analysis, aggregates reports, and prints per-phase timings.
Dependency & env
pyproject.toml, .env.example
Adds click>=8.1.0 to dependencies and OPENAI_BASE_URL to .env.example.

Sequence Diagram

sequenceDiagram
  participant Caller
  participant Resolver as resolve_privacy_url
  participant Fetcher as HTTP Fetcher
  participant LinkDiscovery as Link Discovery
  participant Sitemap as Robots/Sitemap
  participant Prober as Common Paths Prober
  participant Verifier as Scorer & Verifier
  participant Analyzer as Chunk Analyzer (OpenAI)

  Caller->>Resolver: resolve_privacy_url(input_url)
  Resolver->>Fetcher: fetch page & homepage
  Fetcher-->>LinkDiscovery: page HTML
  LinkDiscovery->>Verifier: candidate list + priorities
  Verifier-->>Resolver: verified best_url (if found)
  alt no verified link
    Resolver->>Sitemap: fetch robots -> sitemap URLs
    Sitemap->>Verifier: sitemap candidates
    Verifier-->>Resolver: verified sitemap_url (if found)
    alt none found
      Resolver->>Prober: probe common privacy paths
      Prober->>Verifier: probe results
      Verifier-->>Resolver: best_url or fallback
    end
  end
  Resolver->>Fetcher: fetch policy content (HTTP or Selenium)
  Fetcher-->>Resolver: policy text
  Resolver->>Analyzer: split -> parallel analyze chunks (OPENAI_BASE_URL)
  Analyzer-->>Resolver: chunk analysis results
  Resolver-->>Caller: return aggregated JSON
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Poem

"I hopped through HTML, sitemap, and path,
Regex in paw, I chased the policy's tail,
Phases three and chunks that race,
Threads humming as tokens swell,
A happy rabbit found the privacy trail 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'Fixes #9' is too vague and does not convey the actual changes made in the pull request. Use a descriptive title that highlights the main change, such as 'Improve privacy policy URL discovery with multi-phase resolver and link prioritization' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main.py (1)

369-411: Remove the duplicated resolve_privacy_url; the earlier version is dead code.
The second definition overwrites the first, so edits to the earlier version (including _improve_candidate usage) never take effect and are easy to miss. Please delete the first definition and, if you still want hub-page deepening, apply _improve_candidate in the active flow.

♻️ Suggested integration in the active resolver
-        if best_url:
-            return best_url, input_url
+        if best_url:
+            return _improve_candidate(best_url), input_url
...
-            return cand, input_url
+            return _improve_candidate(cand), input_url
...
-            return cand, input_url
+            return _improve_candidate(cand), input_url

Also applies to: 487-539

🤖 Fix all issues with AI agents
In `@src/main.py`:
- Around line 414-446: The current _collect_link_candidates filters out URLs
using _is_privacy_like which doesn’t catch "legal"/"terms" patterns from the
regex priority list; update _collect_link_candidates to also accept URLs that
match the module’s priority regex list (use the same regexes used elsewhere
instead of only _is_privacy_like) so links like /legal or /terms are included,
preserve the deduplication map behavior (candidates: Dict[str, str]) and ensure
you still keep the best anchor text per URL when inserting/updating candidates,
and stop when limit is reached.
- Around line 555-571: The code currently defaults OPENAI_BASE_URL to
"https://openrouter.ai/api/v1" which routes users to OpenRouter unexpectedly;
change os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") to
os.getenv("OPENAI_BASE_URL") so the OpenAI SDK uses its native endpoint by
default (keep the existing OpenAI(client) usage in main.py/resp creation), and
then add a clear entry for OPENAI_BASE_URL with the OpenRouter value to
.env.example and a short note in README instructing users to set
OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they want OpenRouter.

Comment thread src/main.py Outdated
Comment thread src/main.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main.py (1)

369-411: Critical: Duplicate function definition - this entire block is dead code.

resolve_privacy_url is defined twice: here (lines 369-411) and again at lines 500-552. Python uses the last definition, so this first implementation is completely shadowed and never executes.

Additionally, this first version calls _improve_candidate() on discovered URLs (lines 384, 396, 402), while the active second version at line 500 does not call _improve_candidate() anywhere. This means the "upgrade generic pages to deeper policy links" feature described in the PR objectives is not functioning.

Recommended action: Remove this dead code block (lines 369-411) and integrate _improve_candidate calls into the active function at line 500.

🛠️ Example fix for active function (lines 527-528)
         best_url = _pick_best_verified_candidate(candidates_list, max_verify=5)
         if best_url:
-            return best_url, input_url
+            return _improve_candidate(best_url), input_url
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Line 692: The print statement formatting in the progress message (print(...)
in src/main.py) is missing a space after the bracket; update the string in the
print call that starts with "[3/3]Analyzing" to include a space so it reads
"[3/3] Analyzing {len(chunks)} chunks in parallel..." to fix the typo.
- Around line 489-496: The loop currently returns a URL immediately when
score[0] <= 1 (high-priority candidate) without checking if the link is live,
which can return broken URLs; change the logic in the for-loop that iterates
over scored[:max_verify] so that even when score[0] <= 1 you call
_light_verify(url) and only return url if that verification succeeds (print the
same debug message on success); if verification fails, continue to the next
candidate (and only return after a successful _light_verify or after exhausting
candidates), using the existing variables scored, max_verify, _light_verify,
candidates and the same debug message.
- Around line 198-206: The Selenium path currently accepts main text >100 chars
while the HTTP path enforces a ≥400 char minimum; make these thresholds
consistent by introducing a shared constant (e.g., CONTENT_MIN_LENGTH = 400) and
using it in both the HTTP extraction checks and the Selenium extraction block
(the driver.find_element(By.TAG_NAME, "main") / body.get_attribute("innerText")
logic), and ensure you use the same comparison operator (>=) in both places so
identical pages behave consistently regardless of extraction method.
🧹 Nitpick comments (3)
src/main.py (3)

139-143: Redundant redirect status check.

With allow_redirects=True, the response status code reflects the final destination, not intermediate redirects. The explicit check for 3xx codes (lines 141-142) will rarely match unless the server returns a redirect without a Location header or hits a redirect limit.

This isn't a bug, but the logic could be simplified.

♻️ Simplified version
-        if 200 <= r.status_code < 300:
-            return True
-        if r.status_code in (301, 302, 303, 307, 308):
-            return True
-        return False
+        return 200 <= r.status_code < 400

354-364: Consider using logging instead of print for DEBUG statements.

Multiple print(f"DEBUG: ...") statements are scattered throughout (lines 354, 363, 494, 534, 543, 549, 648, 654, 692). These should use Python's logging module for proper log level control and configurability.


695-703: Move import to top of file and note that chunk order is not preserved.

  1. The ThreadPoolExecutor import at line 695 should be at the top of the file with other imports.
  2. Iterating over futures dict doesn't guarantee completion order matches submission order. If aggregate_chunk_results depends on chunk ordering, this could affect results. The current code assigns index (line 702) but the list order itself is completion-order.
♻️ Preserve chunk order using as_completed with sorting
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
 # ... at top of file
 
-    from concurrent.futures import ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=5) as executor:
-        futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)}
-        for future in futures:
-            idx = futures[future]
-            res = future.result()
+        futures = [executor.submit(analyze_chunk_json, chunk, args.model) for chunk in chunks]
+        for idx, future in enumerate(futures, 1):
+            res = future.result()  # blocks in order
             if res:
                 res["index"] = idx
                 results.append(res)

Comment thread src/main.py
Comment thread src/main.py
Comment thread src/main.py Outdated
@myz21
Copy link
Copy Markdown
Author

myz21 commented Jan 29, 2026

I’ve streamlined the privacy policy discovery and analysis system. Here’s a rundown of the optimizations and the current performance status:

Key Improvements

  • Smarter Discovery: Built a 0-9 ranking system that evaluates URLs and anchor text to nail down the right policy links faster.
  • Parallel Processing: Switched to ThreadPoolExecutor to handle text chunks simultaneously, cutting down the old sequential wait times.
  • Browser Tuning: Optimized Selenium with an eager load strategy and disabled images to shave seconds off page loads.
  • Lightweight Checks: Used HEAD requests for link validation to avoid downloading unnecessary data.

Performance Bottlenecks

Testing on sites like TikTok and Google highlighted a few hurdles:

  • LLM Latency: Analysis takes up to 170 seconds on free-tier models—this is currently our biggest slowdown.
  • Rate Limits (429): Large documents (>10 chunks) are hitting API quota limits.
  • Bot Detection: Anti-bot measures on sites like Google can push Selenium fetch times up to 15 seconds.

Comment thread src/main.py Outdated


def _http_get(url: str, timeout: int = 15) -> Optional[requests.Response]:
def _http_get(url: str, timeout: int = 5) -> Optional[requests.Response]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional kullanılmaması gerek py 3.9+ gerek yok | kullanılmalı

Comment thread src/main.py Outdated


def _fetch_text(url: str, timeout: int = 12) -> Optional[str]:
def _fetch_text(url: str, timeout: int = 5) -> Optional[str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aynı şekilde

Comment thread src/main.py Outdated
driver.set_page_load_timeout(10)
driver.get(url)
WebDriverWait(driver, 12).until(
WebDriverWait(driver, 5).until(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 sayısı nedir açıklanmalı cost olması gerek

Comment thread src/main.py Outdated
try:
content_element = driver.find_element(By.TAG_NAME, "main")
text = content_element.get_attribute("innerText")
if text and len(text.strip()) > 100:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neden 100?

Comment thread src/main.py Outdated
for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS):
if pattern.search(url):
return idx
return 999
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bu ne demek

Comment thread src/main.py Outdated
# Collect links from input page and homepage
candidates_set: Dict[str, str] = {} # url -> anchor_text

for page_url in [input_url, base]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for page_url in (input_url, base):
    if not (resp := _http_get(page_url)):
        continue

    for url, text in _collect_link_candidates(resp.text, resp.url, limit=100):
        candidates_set.setdefault(url, text)

yapabilirsin

Comment thread src/main.py Outdated

# === PHASE 3: Common paths (last resort) ===
path_heads: List[str] = []
for p in _COMMON_PATHS:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for candidate_url in (base + path for path in _COMMON_PATHS):
    if _head_ok(candidate_url) or _light_verify(candidate_url):
        if _light_verify(candidate_url):
            print(f" Found via common path: {candidate_url}")
            return candidate_url, input_url

kısaltabilirsin

Comment thread src/main.py Outdated
@@ -351,17 +567,20 @@ def split_text_into_chunks(
def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing issues

Comment thread src/main.py Outdated
def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]:
"""Analyze a text chunk with the LLM and return one JSON object."""
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.env.tempalte eklenmeli

Comment thread src/main.py Outdated

print(json.dumps(out, ensure_ascii=False, indent=2))

print("\n" + "="*40)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buradaki printlerin kaldırılması onun yerline click kullanılması gerek

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/main.py (3)

156-171: ⚠️ Potential issue | 🟡 Minor

Use one shared minimum-content threshold for HTTP and Selenium paths.

HTTP extraction uses >= 400 while Selenium accepts > 100 at Line 201, causing inconsistent acceptance for the same page content.

♻️ Suggested fix
+CONTENT_MIN_LENGTH = 400
...
-                return t if len(t) >= 400 else None
+                return t if len(t) >= CONTENT_MIN_LENGTH else None
...
-    return t if len(t) >= 400 else None
+    return t if len(t) >= CONTENT_MIN_LENGTH else None
...
-            if text and len(text.strip()) > 100:
+            if text and len(text.strip()) >= CONTENT_MIN_LENGTH:
                 return text

Also applies to: 197-203

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 156 - 171, Define a single shared minimum content
constant (e.g., MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the
HTTP and Selenium extraction paths instead of hard-coded literals; replace the
HTTP check "return t if len(t) >= 400" and the Selenium path's "> 100" check
with "len(t) >= MIN_CONTENT_CHARS" so both branches (the code using _http_get
and the code that sets content_element/get_text into t) enforce the same
threshold and comparison semantics.

489-493: ⚠️ Potential issue | 🟠 Major

Do not bypass URL verification for top-ranked candidates.

At Lines 491-492, high-priority candidates are returned without _light_verify. This can emit broken URLs.

🔧 Suggested fix
-    for i, (url, text, score) in enumerate(scored[:max_verify]):
-        `#if` the score is already very good, skip verification
-        if score[0] <= 1: 
-            return url
+    for _, (url, text, score) in enumerate(scored[:max_verify]):
         if _light_verify(url):
             print(f"DEBUG: Selected URL '{url}' from {len(candidates)} candidates (score: {score})")
             return url
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 489 - 493, The loop over scored candidates
currently returns a top-ranked URL when score[0] <= 1 before calling
_light_verify, which bypasses verification; change the control flow in the loop
(the for i, (url, text, score) in enumerate(scored[:max_verify]) block) so that
you do not return immediately for high-scoring items—instead run
_light_verify(url) for every candidate (including those with score[0] <= 1) and
only return the url when _light_verify succeeds; if verification fails, continue
to the next candidate and only return after a successful verification (or
exhaust candidates).

570-575: ⚠️ Potential issue | 🟠 Major

Defaulting to OpenRouter should be explicit or documented.

At Line 570, defaulting OPENAI_BASE_URL to OpenRouter reroutes users unexpectedly when they only set OPENAI_API_KEY.

🔧 Suggested fix (SDK default unless explicitly configured)
-    base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+    base_url = os.getenv("OPENAI_BASE_URL")
@@
-    client = OpenAI(api_key=api_key, base_url=base_url)
+    client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)
In openai Python SDK 1.102.0, what is the default API base URL when `OpenAI(api_key=...)` is created without `base_url`?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 570 - 575, The code defaults OPENAI_BASE_URL to
OpenRouter and always passes base_url into the OpenAI constructor (base_url
variable and client = OpenAI(api_key=api_key, base_url=base_url)), which
silently overrides the SDK default; change the logic so OPENAI_BASE_URL is
optional: read os.getenv("OPENAI_BASE_URL") without a default, and only pass
base_url into OpenAI(...) when that env var is present (or otherwise leave the
OpenAI(...) call as OpenAI(api_key=api_key)) so the SDK default base URL is
preserved unless explicitly configured; update any docstring or README to state
that OPENAI_BASE_URL must be set to override the SDK default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.py`:
- Around line 576-585: The OpenAI per-chunk call (client.chat.completions.create
used to assign resp with build_user_prompt) has no error handling and futures
are consumed with an unguarded future.result(); wrap the API call inside a
try/except that catches network/HTTP/timeout/rate-limit exceptions and returns
None (or a sentinel) for failed chunks, and then guard future.result() where
futures are iterated so you skip/ignore None results or catch exceptions from
future.result() and continue processing remaining chunks; reference
client.chat.completions.create, build_user_prompt, resp, and future.result()
when applying the changes.
- Around line 369-411: Remove the obsolete duplicate implementation of
resolve_privacy_url (the one that uses find_best_policy_url and
_extract_text_http) so only the intended phase-based implementation remains (the
one that uses _collect_link_candidates and _pick_best_verified_candidate);
delete the first definition entirely to avoid shadowing and ensure imports/refs
still match the remaining resolve_privacy_url, then run tests/static checks to
verify no references break.

---

Duplicate comments:
In `@src/main.py`:
- Around line 156-171: Define a single shared minimum content constant (e.g.,
MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium
extraction paths instead of hard-coded literals; replace the HTTP check "return
t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >=
MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that
sets content_element/get_text into t) enforce the same threshold and comparison
semantics.
- Around line 489-493: The loop over scored candidates currently returns a
top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses
verification; change the control flow in the loop (the for i, (url, text, score)
in enumerate(scored[:max_verify]) block) so that you do not return immediately
for high-scoring items—instead run _light_verify(url) for every candidate
(including those with score[0] <= 1) and only return the url when _light_verify
succeeds; if verification fails, continue to the next candidate and only return
after a successful verification (or exhaust candidates).
- Around line 570-575: The code defaults OPENAI_BASE_URL to OpenRouter and
always passes base_url into the OpenAI constructor (base_url variable and client
= OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK
default; change the logic so OPENAI_BASE_URL is optional: read
os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into
OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...)
call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless
explicitly configured; update any docstring or README to state that
OPENAI_BASE_URL must be set to override the SDK default.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf48adf and 8f7396a.

📒 Files selected for processing (1)
  • src/main.py

Comment thread src/main.py Outdated
Comment thread src/main.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/main.py (1)

693-704: ⚠️ Potential issue | 🟡 Minor

Missing space in progress message and unguarded future.result().

  1. Typo (line 693): Missing space — "[3/3]Analyzing" should be "[3/3] Analyzing".

  2. Unguarded future.result() (line 701): If any chunk's API call fails (network error, rate limit), future.result() raises an exception, aborting the entire analysis. Wrap in try/except to skip failed chunks gracefully.

🛠️ Proposed fix
-    print(f"[3/3]Analyzing {len(chunks)} chunks in parallel...")
+    print(f"[3/3] Analyzing {len(chunks)} chunks in parallel...")
     
     # Parallel analysis of chunks
     from concurrent.futures import ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=5) as executor:
         futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)}
         for future in futures:
             idx = futures[future]
-            res = future.result()
-            if res:
-                res["index"] = idx
-                results.append(res)
+            try:
+                res = future.result()
+                if res:
+                    res["index"] = idx
+                    results.append(res)
+            except Exception as e:
+                print(f"Warning: Chunk {idx} analysis failed: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 693 - 704, The progress print message is missing a
space and the ThreadPoolExecutor loop calls future.result() unguarded which will
raise and abort processing if a chunk fails; update the string to "[3/3]
Analyzing {len(chunks)} chunks in parallel..." and wrap the call to
future.result() inside a try/except around the loop that iterates over futures
(created when submitting analyze_chunk_json) to catch exceptions, log or skip
failed chunk results, and continue appending successful results (ensuring you
still set res["index"] = idx when res is returned).
🧹 Nitpick comments (4)
src/main.py (4)

696-696: Move import to module level.

ThreadPoolExecutor is imported inside the function body. Per PEP 8, imports should be at the top of the file.

♻️ Proposed fix

At module level (near line 1):

from concurrent.futures import ThreadPoolExecutor

Then remove line 696.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` at line 696, Move the "from concurrent.futures import
ThreadPoolExecutor" import out of the function body and place it at the top of
the module (module-level imports), then delete the local import statement inside
the function so all references to ThreadPoolExecutor use the top-level import;
keep the existing usage of ThreadPoolExecutor unchanged.

501-553: Active resolve_privacy_url implementation — clean up debug prints.

This is the intended 3-phase implementation. After removing the shadowed definition at lines 370-412, clean up the DEBUG print statements (lines 535, 544, 550) or convert them to proper logging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 501 - 553, In resolve_privacy_url, replace the
three DEBUG print statements with proper logging calls (e.g., use the module
logger) or remove them: the prints occur when a sitemap candidate is found
inside the loop over _get_sitemaps_from_robots/_fetch_sitemap_urls and in the
two places inside the common-path checks that currently call print(f"DEBUG:
..."). Update those prints to logger.debug(...) (or remove) so they don't leak
stdout; keep the rest of the 3-phase logic and return values unchanged and
ensure the logger is imported/initialized if added.

190-190: Move _SELENIUM_WAIT_TIMEOUT to module level.

The constant is defined inside the function body. Per the commit message "use a constant for Selenium wait timeout", this should be a module-level constant for better visibility and configurability.

♻️ Proposed refactor

Move to module level near other constants:

+_SELENIUM_WAIT_TIMEOUT = 5
+
 def fetch_content_with_selenium(url: str) -> str | None:
     """Return visible text using headless Chrome; robust for dynamic pages."""
     chromedriver_autoinstaller.install()
     # ... options setup ...
-    _SELENIUM_WAIT_TIMEOUT = 5
     try:
         driver.set_page_load_timeout(10)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` at line 190, Move the `_SELENIUM_WAIT_TIMEOUT` constant out of
the function body and declare it at module level alongside the other constants;
locate the occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the
function that currently defines it) and remove the in-function definition, then
add a top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.

294-299: Extract magic number 999 to a named constant.

The value 999 is used as a "no match" sentinel but lacks explanation. Define a module-level constant for clarity.

♻️ Proposed fix
+_NO_MATCH_PRIORITY = 999  # Sentinel for URLs not matching any priority pattern

 def _get_url_priority(url: str) -> int:
     """Return the priority index of a URL based on regex patterns. Lower is better."""
     for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS):
         if pattern.search(url):
             return idx
-    return 999
+    return _NO_MATCH_PRIORITY

Then update references at lines 332, 445-446, and 492.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 294 - 299, Introduce a descriptive module-level
constant (e.g., NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in
_get_url_priority with that constant; also replace any other uses in this module
where 999 is used as the "no match" sentinel (places that return, compare to, or
rely on that sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead
of the literal. Ensure the constant is documented with a short comment and
update the return in _get_url_priority and any other functions or conditionals
that previously used 999 to reference this new symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/main.py`:
- Around line 693-704: The progress print message is missing a space and the
ThreadPoolExecutor loop calls future.result() unguarded which will raise and
abort processing if a chunk fails; update the string to "[3/3] Analyzing
{len(chunks)} chunks in parallel..." and wrap the call to future.result() inside
a try/except around the loop that iterates over futures (created when submitting
analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and
continue appending successful results (ensuring you still set res["index"] = idx
when res is returned).

---

Nitpick comments:
In `@src/main.py`:
- Line 696: Move the "from concurrent.futures import ThreadPoolExecutor" import
out of the function body and place it at the top of the module (module-level
imports), then delete the local import statement inside the function so all
references to ThreadPoolExecutor use the top-level import; keep the existing
usage of ThreadPoolExecutor unchanged.
- Around line 501-553: In resolve_privacy_url, replace the three DEBUG print
statements with proper logging calls (e.g., use the module logger) or remove
them: the prints occur when a sitemap candidate is found inside the loop over
_get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the
common-path checks that currently call print(f"DEBUG: ..."). Update those prints
to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the
3-phase logic and return values unchanged and ensure the logger is
imported/initialized if added.
- Line 190: Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body
and declare it at module level alongside the other constants; locate the
occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that
currently defines it) and remove the in-function definition, then add a
top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.
- Around line 294-299: Introduce a descriptive module-level constant (e.g.,
NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority
with that constant; also replace any other uses in this module where 999 is used
as the "no match" sentinel (places that return, compare to, or rely on that
sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the
literal. Ensure the constant is documented with a short comment and update the
return in _get_url_priority and any other functions or conditionals that
previously used 999 to reference this new symbol.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b6eaba1-e680-475b-9c4f-a8023caaddb6

📥 Commits

Reviewing files that changed from the base of the PR and between 8f7396a and 12950e3.

📒 Files selected for processing (1)
  • src/main.py

@dogancanbakir dogancanbakir requested a review from ahsentekd May 22, 2026 13:06
@ahsentekd
Copy link
Copy Markdown
Contributor

@myz21 threadleri çözdüysen resolve attıktan sonra içeri alabiliriz ek olarak lock'ta conflict var bilgine

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main.py (1)

615-620: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unguarded future.result() will abort analysis on any chunk failure.

If any chunk's API call fails (network error, timeout, rate limit), future.result() raises the exception, aborting the entire analysis. This should be wrapped in try/except to allow partial results.

🛡️ Suggested fix
         for future in futures:
             idx = futures[future]
-            res = future.result()
-            if res:
-                res["index"] = idx
-                results.append(res)
+            try:
+                res = future.result()
+                if res:
+                    res["index"] = idx
+                    results.append(res)
+            except Exception as e:
+                click.echo(f"Warning: Chunk {idx} failed: {e}", err=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.py` around lines 615 - 620, The loop over futures currently calls
future.result() without protection, so any exception from a chunk will abort the
whole run; wrap the future.result() call in a try/except around the block that
sets res["index"] and appends to results (referencing the futures dict, loop
variable future, the idx lookup and the results list) — on exception log or
record the error (using future.exception() or the caught exception),
skip/continue to the next future, and optionally append a partial error marker
containing idx so downstream code can handle partial results.
🧹 Nitpick comments (1)
src/main.py (1)

217-217: 💤 Low value

Body fallback lacks minimum length check.

The <body> fallback returns text without any minimum threshold, whereas _extract_text_http enforces >= 400 characters. This could cause Selenium to return very short/empty content that HTTP extraction would have rejected, leading to inconsistent behavior.

🛡️ Suggested threshold check
-        return driver.find_element(By.TAG_NAME, "body").get_attribute("innerText")
+        body_text = driver.find_element(By.TAG_NAME, "body").get_attribute("innerText")
+        if body_text and len(body_text.strip()) >= _MIN_TEXT_LENGTH_POLICY:
+            return body_text
+        return None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.py` at line 217, The body fallback currently returns
driver.find_element(By.TAG_NAME, "body").get_attribute("innerText") with no
length guard; update the Selenium extraction function (the code that returns the
body innerText) to enforce the same minimum-character threshold used by
_extract_text_http (>= 400 chars): after retrieving body_text from
driver.find_element(By.TAG_NAME, "body").get_attribute("innerText"), check its
length and only return it if len(body_text) >= 400, otherwise treat it as a
failure (e.g., return None or raise/continue the existing fallback flow) so
behavior matches _extract_text_http.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/main.py`:
- Around line 615-620: The loop over futures currently calls future.result()
without protection, so any exception from a chunk will abort the whole run; wrap
the future.result() call in a try/except around the block that sets res["index"]
and appends to results (referencing the futures dict, loop variable future, the
idx lookup and the results list) — on exception log or record the error (using
future.exception() or the caught exception), skip/continue to the next future,
and optionally append a partial error marker containing idx so downstream code
can handle partial results.

---

Nitpick comments:
In `@src/main.py`:
- Line 217: The body fallback currently returns driver.find_element(By.TAG_NAME,
"body").get_attribute("innerText") with no length guard; update the Selenium
extraction function (the code that returns the body innerText) to enforce the
same minimum-character threshold used by _extract_text_http (>= 400 chars):
after retrieving body_text from driver.find_element(By.TAG_NAME,
"body").get_attribute("innerText"), check its length and only return it if
len(body_text) >= 400, otherwise treat it as a failure (e.g., return None or
raise/continue the existing fallback flow) so behavior matches
_extract_text_http.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 52208454-0882-4647-b29c-d261a3997dc6

📥 Commits

Reviewing files that changed from the base of the PR and between 12950e3 and e19ff9a.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • .env.example
  • pyproject.toml
  • src/main.py
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants