Fixes #9#10
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughImplements a multi‑phase privacy URL discovery and verification flow in ChangesPolicy Discovery & Analysis
Sequence DiagramsequenceDiagram
participant Caller
participant Resolver as resolve_privacy_url
participant Fetcher as HTTP Fetcher
participant LinkDiscovery as Link Discovery
participant Sitemap as Robots/Sitemap
participant Prober as Common Paths Prober
participant Verifier as Scorer & Verifier
participant Analyzer as Chunk Analyzer (OpenAI)
Caller->>Resolver: resolve_privacy_url(input_url)
Resolver->>Fetcher: fetch page & homepage
Fetcher-->>LinkDiscovery: page HTML
LinkDiscovery->>Verifier: candidate list + priorities
Verifier-->>Resolver: verified best_url (if found)
alt no verified link
Resolver->>Sitemap: fetch robots -> sitemap URLs
Sitemap->>Verifier: sitemap candidates
Verifier-->>Resolver: verified sitemap_url (if found)
alt none found
Resolver->>Prober: probe common privacy paths
Prober->>Verifier: probe results
Verifier-->>Resolver: best_url or fallback
end
end
Resolver->>Fetcher: fetch policy content (HTTP or Selenium)
Fetcher-->>Resolver: policy text
Resolver->>Analyzer: split -> parallel analyze chunks (OPENAI_BASE_URL)
Analyzer-->>Resolver: chunk analysis results
Resolver-->>Caller: return aggregated JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.py (1)
369-411: Remove the duplicatedresolve_privacy_url; the earlier version is dead code.
The second definition overwrites the first, so edits to the earlier version (including_improve_candidateusage) never take effect and are easy to miss. Please delete the first definition and, if you still want hub-page deepening, apply_improve_candidatein the active flow.♻️ Suggested integration in the active resolver
- if best_url: - return best_url, input_url + if best_url: + return _improve_candidate(best_url), input_url ... - return cand, input_url + return _improve_candidate(cand), input_url ... - return cand, input_url + return _improve_candidate(cand), input_urlAlso applies to: 487-539
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Around line 414-446: The current _collect_link_candidates filters out URLs
using _is_privacy_like which doesn’t catch "legal"/"terms" patterns from the
regex priority list; update _collect_link_candidates to also accept URLs that
match the module’s priority regex list (use the same regexes used elsewhere
instead of only _is_privacy_like) so links like /legal or /terms are included,
preserve the deduplication map behavior (candidates: Dict[str, str]) and ensure
you still keep the best anchor text per URL when inserting/updating candidates,
and stop when limit is reached.
- Around line 555-571: The code currently defaults OPENAI_BASE_URL to
"https://openrouter.ai/api/v1" which routes users to OpenRouter unexpectedly;
change os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") to
os.getenv("OPENAI_BASE_URL") so the OpenAI SDK uses its native endpoint by
default (keep the existing OpenAI(client) usage in main.py/resp creation), and
then add a clear entry for OPENAI_BASE_URL with the OpenRouter value to
.env.example and a short note in README instructing users to set
OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they want OpenRouter.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.py (1)
369-411: Critical: Duplicate function definition - this entire block is dead code.
resolve_privacy_urlis defined twice: here (lines 369-411) and again at lines 500-552. Python uses the last definition, so this first implementation is completely shadowed and never executes.Additionally, this first version calls
_improve_candidate()on discovered URLs (lines 384, 396, 402), while the active second version at line 500 does not call_improve_candidate()anywhere. This means the "upgrade generic pages to deeper policy links" feature described in the PR objectives is not functioning.Recommended action: Remove this dead code block (lines 369-411) and integrate
_improve_candidatecalls into the active function at line 500.🛠️ Example fix for active function (lines 527-528)
best_url = _pick_best_verified_candidate(candidates_list, max_verify=5) if best_url: - return best_url, input_url + return _improve_candidate(best_url), input_url
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Line 692: The print statement formatting in the progress message (print(...)
in src/main.py) is missing a space after the bracket; update the string in the
print call that starts with "[3/3]Analyzing" to include a space so it reads
"[3/3] Analyzing {len(chunks)} chunks in parallel..." to fix the typo.
- Around line 489-496: The loop currently returns a URL immediately when
score[0] <= 1 (high-priority candidate) without checking if the link is live,
which can return broken URLs; change the logic in the for-loop that iterates
over scored[:max_verify] so that even when score[0] <= 1 you call
_light_verify(url) and only return url if that verification succeeds (print the
same debug message on success); if verification fails, continue to the next
candidate (and only return after a successful _light_verify or after exhausting
candidates), using the existing variables scored, max_verify, _light_verify,
candidates and the same debug message.
- Around line 198-206: The Selenium path currently accepts main text >100 chars
while the HTTP path enforces a ≥400 char minimum; make these thresholds
consistent by introducing a shared constant (e.g., CONTENT_MIN_LENGTH = 400) and
using it in both the HTTP extraction checks and the Selenium extraction block
(the driver.find_element(By.TAG_NAME, "main") / body.get_attribute("innerText")
logic), and ensure you use the same comparison operator (>=) in both places so
identical pages behave consistently regardless of extraction method.
🧹 Nitpick comments (3)
src/main.py (3)
139-143: Redundant redirect status check.With
allow_redirects=True, the response status code reflects the final destination, not intermediate redirects. The explicit check for 3xx codes (lines 141-142) will rarely match unless the server returns a redirect without aLocationheader or hits a redirect limit.This isn't a bug, but the logic could be simplified.
♻️ Simplified version
- if 200 <= r.status_code < 300: - return True - if r.status_code in (301, 302, 303, 307, 308): - return True - return False + return 200 <= r.status_code < 400
354-364: Consider usinglogginginstead ofMultiple
print(f"DEBUG: ...")statements are scattered throughout (lines 354, 363, 494, 534, 543, 549, 648, 654, 692). These should use Python'sloggingmodule for proper log level control and configurability.
695-703: Move import to top of file and note that chunk order is not preserved.
- The
ThreadPoolExecutorimport at line 695 should be at the top of the file with other imports.- Iterating over
futuresdict doesn't guarantee completion order matches submission order. Ifaggregate_chunk_resultsdepends on chunk ordering, this could affect results. The current code assignsindex(line 702) but the list order itself is completion-order.♻️ Preserve chunk order using as_completed with sorting
+from concurrent.futures import ThreadPoolExecutor, as_completed + # ... at top of file - from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=5) as executor: - futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)} - for future in futures: - idx = futures[future] - res = future.result() + futures = [executor.submit(analyze_chunk_json, chunk, args.model) for chunk in chunks] + for idx, future in enumerate(futures, 1): + res = future.result() # blocks in order if res: res["index"] = idx results.append(res)
|
I’ve streamlined the privacy policy discovery and analysis system. Here’s a rundown of the optimizations and the current performance status: Key Improvements
Performance BottlenecksTesting on sites like TikTok and Google highlighted a few hurdles:
|
|
|
||
|
|
||
| def _http_get(url: str, timeout: int = 15) -> Optional[requests.Response]: | ||
| def _http_get(url: str, timeout: int = 5) -> Optional[requests.Response]: |
There was a problem hiding this comment.
Optional kullanılmaması gerek py 3.9+ gerek yok | kullanılmalı
|
|
||
|
|
||
| def _fetch_text(url: str, timeout: int = 12) -> Optional[str]: | ||
| def _fetch_text(url: str, timeout: int = 5) -> Optional[str]: |
| driver.set_page_load_timeout(10) | ||
| driver.get(url) | ||
| WebDriverWait(driver, 12).until( | ||
| WebDriverWait(driver, 5).until( |
There was a problem hiding this comment.
5 sayısı nedir açıklanmalı cost olması gerek
| try: | ||
| content_element = driver.find_element(By.TAG_NAME, "main") | ||
| text = content_element.get_attribute("innerText") | ||
| if text and len(text.strip()) > 100: |
| for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS): | ||
| if pattern.search(url): | ||
| return idx | ||
| return 999 |
| # Collect links from input page and homepage | ||
| candidates_set: Dict[str, str] = {} # url -> anchor_text | ||
|
|
||
| for page_url in [input_url, base]: |
There was a problem hiding this comment.
for page_url in (input_url, base):
if not (resp := _http_get(page_url)):
continue
for url, text in _collect_link_candidates(resp.text, resp.url, limit=100):
candidates_set.setdefault(url, text)
yapabilirsin
|
|
||
| # === PHASE 3: Common paths (last resort) === | ||
| path_heads: List[str] = [] | ||
| for p in _COMMON_PATHS: |
There was a problem hiding this comment.
for candidate_url in (base + path for path in _COMMON_PATHS):
if _head_ok(candidate_url) or _light_verify(candidate_url):
if _light_verify(candidate_url):
print(f" Found via common path: {candidate_url}")
return candidate_url, input_url
kısaltabilirsin
| @@ -351,17 +567,20 @@ def split_text_into_chunks( | |||
| def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]: | |||
| def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]: | ||
| """Analyze a text chunk with the LLM and return one JSON object.""" | ||
| api_key = os.getenv("OPENAI_API_KEY") | ||
| base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") |
|
|
||
| print(json.dumps(out, ensure_ascii=False, indent=2)) | ||
|
|
||
| print("\n" + "="*40) |
There was a problem hiding this comment.
buradaki printlerin kaldırılması onun yerline click kullanılması gerek
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/main.py (3)
156-171:⚠️ Potential issue | 🟡 MinorUse one shared minimum-content threshold for HTTP and Selenium paths.
HTTP extraction uses
>= 400while Selenium accepts> 100at Line 201, causing inconsistent acceptance for the same page content.♻️ Suggested fix
+CONTENT_MIN_LENGTH = 400 ... - return t if len(t) >= 400 else None + return t if len(t) >= CONTENT_MIN_LENGTH else None ... - return t if len(t) >= 400 else None + return t if len(t) >= CONTENT_MIN_LENGTH else None ... - if text and len(text.strip()) > 100: + if text and len(text.strip()) >= CONTENT_MIN_LENGTH: return textAlso applies to: 197-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 156 - 171, Define a single shared minimum content constant (e.g., MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium extraction paths instead of hard-coded literals; replace the HTTP check "return t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >= MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that sets content_element/get_text into t) enforce the same threshold and comparison semantics.
489-493:⚠️ Potential issue | 🟠 MajorDo not bypass URL verification for top-ranked candidates.
At Lines 491-492, high-priority candidates are returned without
_light_verify. This can emit broken URLs.🔧 Suggested fix
- for i, (url, text, score) in enumerate(scored[:max_verify]): - `#if` the score is already very good, skip verification - if score[0] <= 1: - return url + for _, (url, text, score) in enumerate(scored[:max_verify]): if _light_verify(url): print(f"DEBUG: Selected URL '{url}' from {len(candidates)} candidates (score: {score})") return url🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 489 - 493, The loop over scored candidates currently returns a top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses verification; change the control flow in the loop (the for i, (url, text, score) in enumerate(scored[:max_verify]) block) so that you do not return immediately for high-scoring items—instead run _light_verify(url) for every candidate (including those with score[0] <= 1) and only return the url when _light_verify succeeds; if verification fails, continue to the next candidate and only return after a successful verification (or exhaust candidates).
570-575:⚠️ Potential issue | 🟠 MajorDefaulting to OpenRouter should be explicit or documented.
At Line 570, defaulting
OPENAI_BASE_URLto OpenRouter reroutes users unexpectedly when they only setOPENAI_API_KEY.🔧 Suggested fix (SDK default unless explicitly configured)
- base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") + base_url = os.getenv("OPENAI_BASE_URL") @@ - client = OpenAI(api_key=api_key, base_url=base_url) + client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)In openai Python SDK 1.102.0, what is the default API base URL when `OpenAI(api_key=...)` is created without `base_url`?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 570 - 575, The code defaults OPENAI_BASE_URL to OpenRouter and always passes base_url into the OpenAI constructor (base_url variable and client = OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK default; change the logic so OPENAI_BASE_URL is optional: read os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...) call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless explicitly configured; update any docstring or README to state that OPENAI_BASE_URL must be set to override the SDK default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main.py`:
- Around line 576-585: The OpenAI per-chunk call (client.chat.completions.create
used to assign resp with build_user_prompt) has no error handling and futures
are consumed with an unguarded future.result(); wrap the API call inside a
try/except that catches network/HTTP/timeout/rate-limit exceptions and returns
None (or a sentinel) for failed chunks, and then guard future.result() where
futures are iterated so you skip/ignore None results or catch exceptions from
future.result() and continue processing remaining chunks; reference
client.chat.completions.create, build_user_prompt, resp, and future.result()
when applying the changes.
- Around line 369-411: Remove the obsolete duplicate implementation of
resolve_privacy_url (the one that uses find_best_policy_url and
_extract_text_http) so only the intended phase-based implementation remains (the
one that uses _collect_link_candidates and _pick_best_verified_candidate);
delete the first definition entirely to avoid shadowing and ensure imports/refs
still match the remaining resolve_privacy_url, then run tests/static checks to
verify no references break.
---
Duplicate comments:
In `@src/main.py`:
- Around line 156-171: Define a single shared minimum content constant (e.g.,
MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium
extraction paths instead of hard-coded literals; replace the HTTP check "return
t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >=
MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that
sets content_element/get_text into t) enforce the same threshold and comparison
semantics.
- Around line 489-493: The loop over scored candidates currently returns a
top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses
verification; change the control flow in the loop (the for i, (url, text, score)
in enumerate(scored[:max_verify]) block) so that you do not return immediately
for high-scoring items—instead run _light_verify(url) for every candidate
(including those with score[0] <= 1) and only return the url when _light_verify
succeeds; if verification fails, continue to the next candidate and only return
after a successful verification (or exhaust candidates).
- Around line 570-575: The code defaults OPENAI_BASE_URL to OpenRouter and
always passes base_url into the OpenAI constructor (base_url variable and client
= OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK
default; change the logic so OPENAI_BASE_URL is optional: read
os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into
OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...)
call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless
explicitly configured; update any docstring or README to state that
OPENAI_BASE_URL must be set to override the SDK default.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/main.py (1)
693-704:⚠️ Potential issue | 🟡 MinorMissing space in progress message and unguarded
future.result().
Typo (line 693): Missing space —
"[3/3]Analyzing"should be"[3/3] Analyzing".Unguarded future.result() (line 701): If any chunk's API call fails (network error, rate limit),
future.result()raises an exception, aborting the entire analysis. Wrap in try/except to skip failed chunks gracefully.🛠️ Proposed fix
- print(f"[3/3]Analyzing {len(chunks)} chunks in parallel...") + print(f"[3/3] Analyzing {len(chunks)} chunks in parallel...") # Parallel analysis of chunks from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=5) as executor: futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)} for future in futures: idx = futures[future] - res = future.result() - if res: - res["index"] = idx - results.append(res) + try: + res = future.result() + if res: + res["index"] = idx + results.append(res) + except Exception as e: + print(f"Warning: Chunk {idx} analysis failed: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 693 - 704, The progress print message is missing a space and the ThreadPoolExecutor loop calls future.result() unguarded which will raise and abort processing if a chunk fails; update the string to "[3/3] Analyzing {len(chunks)} chunks in parallel..." and wrap the call to future.result() inside a try/except around the loop that iterates over futures (created when submitting analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and continue appending successful results (ensuring you still set res["index"] = idx when res is returned).
🧹 Nitpick comments (4)
src/main.py (4)
696-696: Move import to module level.
ThreadPoolExecutoris imported inside the function body. Per PEP 8, imports should be at the top of the file.♻️ Proposed fix
At module level (near line 1):
from concurrent.futures import ThreadPoolExecutorThen remove line 696.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` at line 696, Move the "from concurrent.futures import ThreadPoolExecutor" import out of the function body and place it at the top of the module (module-level imports), then delete the local import statement inside the function so all references to ThreadPoolExecutor use the top-level import; keep the existing usage of ThreadPoolExecutor unchanged.
501-553: Activeresolve_privacy_urlimplementation — clean up debug prints.This is the intended 3-phase implementation. After removing the shadowed definition at lines 370-412, clean up the
DEBUGprint statements (lines 535, 544, 550) or convert them to proper logging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 501 - 553, In resolve_privacy_url, replace the three DEBUG print statements with proper logging calls (e.g., use the module logger) or remove them: the prints occur when a sitemap candidate is found inside the loop over _get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the common-path checks that currently call print(f"DEBUG: ..."). Update those prints to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the 3-phase logic and return values unchanged and ensure the logger is imported/initialized if added.
190-190: Move_SELENIUM_WAIT_TIMEOUTto module level.The constant is defined inside the function body. Per the commit message "use a constant for Selenium wait timeout", this should be a module-level constant for better visibility and configurability.
♻️ Proposed refactor
Move to module level near other constants:
+_SELENIUM_WAIT_TIMEOUT = 5 + def fetch_content_with_selenium(url: str) -> str | None: """Return visible text using headless Chrome; robust for dynamic pages.""" chromedriver_autoinstaller.install() # ... options setup ... - _SELENIUM_WAIT_TIMEOUT = 5 try: driver.set_page_load_timeout(10)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` at line 190, Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body and declare it at module level alongside the other constants; locate the occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that currently defines it) and remove the in-function definition, then add a top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module constants so the constant is globally visible and configurable.
294-299: Extract magic number 999 to a named constant.The value
999is used as a "no match" sentinel but lacks explanation. Define a module-level constant for clarity.♻️ Proposed fix
+_NO_MATCH_PRIORITY = 999 # Sentinel for URLs not matching any priority pattern def _get_url_priority(url: str) -> int: """Return the priority index of a URL based on regex patterns. Lower is better.""" for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS): if pattern.search(url): return idx - return 999 + return _NO_MATCH_PRIORITYThen update references at lines 332, 445-446, and 492.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 294 - 299, Introduce a descriptive module-level constant (e.g., NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority with that constant; also replace any other uses in this module where 999 is used as the "no match" sentinel (places that return, compare to, or rely on that sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the literal. Ensure the constant is documented with a short comment and update the return in _get_url_priority and any other functions or conditionals that previously used 999 to reference this new symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/main.py`:
- Around line 693-704: The progress print message is missing a space and the
ThreadPoolExecutor loop calls future.result() unguarded which will raise and
abort processing if a chunk fails; update the string to "[3/3] Analyzing
{len(chunks)} chunks in parallel..." and wrap the call to future.result() inside
a try/except around the loop that iterates over futures (created when submitting
analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and
continue appending successful results (ensuring you still set res["index"] = idx
when res is returned).
---
Nitpick comments:
In `@src/main.py`:
- Line 696: Move the "from concurrent.futures import ThreadPoolExecutor" import
out of the function body and place it at the top of the module (module-level
imports), then delete the local import statement inside the function so all
references to ThreadPoolExecutor use the top-level import; keep the existing
usage of ThreadPoolExecutor unchanged.
- Around line 501-553: In resolve_privacy_url, replace the three DEBUG print
statements with proper logging calls (e.g., use the module logger) or remove
them: the prints occur when a sitemap candidate is found inside the loop over
_get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the
common-path checks that currently call print(f"DEBUG: ..."). Update those prints
to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the
3-phase logic and return values unchanged and ensure the logger is
imported/initialized if added.
- Line 190: Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body
and declare it at module level alongside the other constants; locate the
occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that
currently defines it) and remove the in-function definition, then add a
top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.
- Around line 294-299: Introduce a descriptive module-level constant (e.g.,
NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority
with that constant; also replace any other uses in this module where 999 is used
as the "no match" sentinel (places that return, compare to, or rely on that
sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the
literal. Ensure the constant is documented with a short comment and update the
return in _get_url_priority and any other functions or conditionals that
previously used 999 to reference this new symbol.
|
@myz21 threadleri çözdüysen resolve attıktan sonra içeri alabiliriz ek olarak lock'ta conflict var bilgine |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.py (1)
615-620:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnguarded
future.result()will abort analysis on any chunk failure.If any chunk's API call fails (network error, timeout, rate limit),
future.result()raises the exception, aborting the entire analysis. This should be wrapped in try/except to allow partial results.🛡️ Suggested fix
for future in futures: idx = futures[future] - res = future.result() - if res: - res["index"] = idx - results.append(res) + try: + res = future.result() + if res: + res["index"] = idx + results.append(res) + except Exception as e: + click.echo(f"Warning: Chunk {idx} failed: {e}", err=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.py` around lines 615 - 620, The loop over futures currently calls future.result() without protection, so any exception from a chunk will abort the whole run; wrap the future.result() call in a try/except around the block that sets res["index"] and appends to results (referencing the futures dict, loop variable future, the idx lookup and the results list) — on exception log or record the error (using future.exception() or the caught exception), skip/continue to the next future, and optionally append a partial error marker containing idx so downstream code can handle partial results.
🧹 Nitpick comments (1)
src/main.py (1)
217-217: 💤 Low valueBody fallback lacks minimum length check.
The
<body>fallback returns text without any minimum threshold, whereas_extract_text_httpenforces>= 400characters. This could cause Selenium to return very short/empty content that HTTP extraction would have rejected, leading to inconsistent behavior.🛡️ Suggested threshold check
- return driver.find_element(By.TAG_NAME, "body").get_attribute("innerText") + body_text = driver.find_element(By.TAG_NAME, "body").get_attribute("innerText") + if body_text and len(body_text.strip()) >= _MIN_TEXT_LENGTH_POLICY: + return body_text + return None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.py` at line 217, The body fallback currently returns driver.find_element(By.TAG_NAME, "body").get_attribute("innerText") with no length guard; update the Selenium extraction function (the code that returns the body innerText) to enforce the same minimum-character threshold used by _extract_text_http (>= 400 chars): after retrieving body_text from driver.find_element(By.TAG_NAME, "body").get_attribute("innerText"), check its length and only return it if len(body_text) >= 400, otherwise treat it as a failure (e.g., return None or raise/continue the existing fallback flow) so behavior matches _extract_text_http.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/main.py`:
- Around line 615-620: The loop over futures currently calls future.result()
without protection, so any exception from a chunk will abort the whole run; wrap
the future.result() call in a try/except around the block that sets res["index"]
and appends to results (referencing the futures dict, loop variable future, the
idx lookup and the results list) — on exception log or record the error (using
future.exception() or the caught exception), skip/continue to the next future,
and optionally append a partial error marker containing idx so downstream code
can handle partial results.
---
Nitpick comments:
In `@src/main.py`:
- Line 217: The body fallback currently returns driver.find_element(By.TAG_NAME,
"body").get_attribute("innerText") with no length guard; update the Selenium
extraction function (the code that returns the body innerText) to enforce the
same minimum-character threshold used by _extract_text_http (>= 400 chars):
after retrieving body_text from driver.find_element(By.TAG_NAME,
"body").get_attribute("innerText"), check its length and only return it if
len(body_text) >= 400, otherwise treat it as a failure (e.g., return None or
raise/continue the existing fallback flow) so behavior matches
_extract_text_http.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 52208454-0882-4647-b29c-d261a3997dc6
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
.env.examplepyproject.tomlsrc/main.py
✅ Files skipped from review due to trivial changes (1)
- pyproject.toml
… Multi-LLM provider support
…trings in main.py
…o click.secho, and add docstrings to url resolution
…configure OpenAI base_url
Changes Made to
src/main.pyOld System
/privacy,/privacy-policy)New System
_collect_link_candidates()- Collects all privacy links from the page_score_candidate()- Scores each URL by priority and anchor text_pick_best_verified_candidate()- Selects and verifies the best URL3-Tier Discovery Strategy:
Result
Finds the most detailed policy from real site links, skips landing pages. But it requires longer time.
/privacy(landing page)/legal/page/row/privacy-policy/tr(full policy)Summary by CodeRabbit
New Features
Improvements