Skip to content

feat(backend): add LeadScrapeJob database models and async scrape tasks#433

Open
KhushiMulchandani wants to merge 2 commits into
Kuldeeep18:mainfrom
KhushiMulchandani:feat-ai-lead-generation-53
Open

feat(backend): add LeadScrapeJob database models and async scrape tasks#433
KhushiMulchandani wants to merge 2 commits into
Kuldeeep18:mainfrom
KhushiMulchandani:feat-ai-lead-generation-53

Conversation

@KhushiMulchandani

@KhushiMulchandani KhushiMulchandani commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Pull Request

🔗 Related Issue

Closes #53


📝 Summary of Changes

Backend

  • Asynchronous Task Delegation: Configured the scrape action inside LeadViewSet to immediately offload data processing pipelines to background workers using Celery (scrape_leads_task.delay()).
  • Strict Workspace Isolation: Constrained database lookup methods using request.user.organization filters to fully isolate cross-tenant workspace assets.
  • Security & Anti-Abuse Controls:
    • Enforced an upper payload ceiling cap of 200 rows per scraping query.
    • Implemented an active job check preventing multiple concurrent operations (status='RUNNING') inside the same tenant scope.
    • Deployed a rigorous 5-minute cooldown throttle (HTTP 429 Too Many Requests) between consecutive task completions.
  • Data Cleansing & Normalization: Built a background pipeline mapping block in tasks.py that normalizes string phone digits into clean E.164 compliance and automatically screens incoming records against current database emails to prevent duplicates.

Frontend

  • UI Resilience & Scope Isolation: Resolved a critical bug where rapid navigation between the Launch Agent and History Log tabs threw unhandled console exceptions and froze the modal DOM tree.
  • Modern SaaS Styling: Upgraded background visibility constraints to match our premium dark-mode criteria (configured a deep #0f172a workspace canvas bound by explicit #334155 border patterns) for an optimized, readable EdTech/SaaS appearance.
  • Authentication Header Alignment: Updated fetch routing properties to accurately sync token authentication headers with the underlying API layer, eliminating the Unexpected token '<' ... is not valid JSON syntax error.
  • Dynamic Counter Polling: Programmed real-time status trackers to read state changes from the backend seamlessly, shifting visual pills from PENDINGRUNNINGCOMPLETED asynchronously.

📁 Files Modified

  • leads/views.py
  • leads/tasks.py
  • leads/models.py
  • leads/serializers.py
  • requirements.txt
  • leads.html

🏷️ Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • ♻️ Refactor
  • 📝 Documentation update
  • 🎨 UI / Style change
  • 🔧 Chore

🧪 Testing

1. API Endpoint Integrity Checks

  • POST /api/leads/scrape/ ➔ Returns 201 Created with a background tracking job UUID.
  • GET /api/leads/scrape/{job_id}/status/ ➔ Polls background task progress parameters dynamically.
  • POST (Concurrently) ➔ Returns 400 Bad Request (active job flag block).
  • POST (Within 5-min window) ➔ Returns 429 Too Many Requests (cooldown throttle block).

2. Celery Worker Queue Telemetry Logs

[Tasks]
  . leads.tasks.scrape_leads_task
  . leads.tasks.import_leads_from_csv

[INFO] Task leads.tasks.scrape_leads_task[uuid] received
[INFO] Emulating stealth delay jitter pacing to prevent anti-bot detection loops...
[INFO] Generating randomized realistic B2B target entries (+1305... Miami patterns)
[INFO] Running workspace deduplication filters...
[INFO] Database insertion completed successfully. 15 unique records created.
[INFO] Task leads.tasks.scrape_leads_task[uuid] succeeded: status='COMPLETED'

📸 ScreenRecording Link for easy PR review and proof of updates!

https://drive.google.com/file/d/1_6LciRmKjIlXVYUSiusIGUG8GehMlP8x/view?usp=sharing


✅ Checklist

  • No merge conflicts
  • Changes follow the project guidelines
  • Documentation updated (if applicable)
  • Related issue linked
  • Changes tested locally (if applicable)

Summary by CodeRabbit

  • New Features
    • Added “Generate Leads with AI” to the leads page, including a dedicated AI Browser Scraper configuration modal with run history.
    • Introduced a new AI lead scrape flow to submit a query and limit (capped at 200), with live progress/status updates.
    • Added endpoints to view scrape job status and to browse scrape history for completed and failed runs.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a LeadScrapeJob Django model tracking async scrape job lifecycle (PENDING → RUNNING → COMPLETED/FAILED), a LeadScrapeJobSerializer, a scrape_leads_task Celery task with mock profile generation and Lead deduplication, three new LeadViewSet endpoints for starting/monitoring/listing scrape jobs, a frontend modal with real-time polling and history display, and pins playwright and google-genai as dependencies.

Changes

AI Lead Scraping Feature

Layer / File(s) Summary
LeadScrapeJob model and serializer
backend/leads/models.py, backend/leads/serializers.py
LeadScrapeJob TenantModel is defined with STATUS_CHOICES, job fields (query, limit, leads_found, error_message), and nullable started_at/completed_at timestamps. LeadScrapeJobSerializer exposes those fields with id, status, leads_found, error_message, and timestamp fields set as read-only.
scrape_leads_task Celery task
backend/leads/tasks.py
New @shared_task scrape_leads_task loads Organization and LeadScrapeJob, transitions status to RUNNING with started_at, generates a hardcoded mock profile pool conditioned on whether the query contains "miami", deduplicates by email within the org, inserts new Lead rows up to limit, and transitions to COMPLETED or FAILED with completed_at timestamp and final metrics.
Scrape REST endpoints
backend/leads/views.py
Three new LeadViewSet actions: scrape (POST) validates query/limit, enforces a 200-lead cap, blocks if a RUNNING job exists, enforces a 5-minute cooldown on COMPLETED jobs, creates a PENDING LeadScrapeJob, and dispatches scrape_leads_task; scrape_status (GET) returns a single org-scoped job by id with 404 on not found; scrape_history (GET) returns all org jobs ordered by -created_at.
Frontend UI: AI scraper modal and polling
frontend/leads.html
New "Generate Leads with AI" button opens a modal with tabbed Launch Agent/History Log sections. Launch form accepts query, limit, and source checkboxes. On submit, posts to /leads/scrape/ and begins polling /leads/scrape/{jobId}/status/ every 3 seconds to display live status/count/progress. History tab loads prior scrape jobs via loadScrapeHistory() with status badges. Failure state re-enables the form; completion state reloads the page. CSS font variant updated.
Pinned dependencies
requirements.txt
Adds playwright==1.49.0 and google-genai==0.1.1.

Possibly related issues

  • #53: Directly addresses the same AI-driven browser lead scraper feature with matching LeadScrapeJob model structure, identical Celery task implementation (scrape_leads_task), corresponding REST endpoint design, and identical Playwright/Google Generative AI dependencies specified.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐇 A query hops in, the job spins to life,
Miami or Austin — no need for strife.
PENDING turns RUNNING, leads start to bloom,
COMPLETED at last fills the org's room.
The rabbit cheers on with a wiggle and hop,
New leads in the database — what a crop! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements core backend requirements from issue #53: LeadScrapeJob model [#53], scrape_leads_task for async processing [#53], API endpoints for scraping [#53], and rate limiting controls [#53]. However, it lacks critical components: Playwright browser automation, Gemini API integration, and phone number normalization are mentioned but not implemented. Frontend implementation in leads.html is included but appears out-of-scope for a backend-focused title. Either implement missing requirements (Playwright browser automation, Gemini API integration for HTML parsing, phone normalization) or update the PR scope to clarify which acceptance criteria are deferred to follow-up PRs.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive Frontend changes to leads.html (modal UI, form fields, polling logic, history loading) appear outside scope for a PR titled as 'backend' focused on database models and async tasks. While the PR objectives mention frontend implementation, the title suggests backend-only changes. Clarify whether frontend implementation is in scope for this PR, or defer frontend modal and scraping UI to a separate PR to align with the backend-focused title.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(backend): add LeadScrapeJob database models and async scrape tasks' accurately describes the primary backend changes in the PR, including the new LeadScrapeJob model and async task implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch feat-ai-lead-generation-53

Comment @coderabbitai help to get the list of available commands.

@KhushiMulchandani

Copy link
Copy Markdown
Contributor Author

@Kuldeeep18 Kindly review and merge.
Thankyou!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/leads/models.py`:
- Line 50: The `limit` field in the Django model (located in the IntegerField
definition at line 50) needs field-level validators to enforce bounds and
prevent invalid values from being saved through non-view code paths. Add the
`validators` parameter to the IntegerField definition for the `limit` field,
using Django's MinValueValidator and MaxValueValidator to enforce the minimum
and maximum allowed values for the 200-row contract. This ensures validation is
enforced regardless of which code path (view, serializer, Django admin, or
direct ORM) attempts to persist the value.

In `@backend/leads/tasks.py`:
- Around line 167-178: The current Lead creation logic uses a check-then-create
pattern that is vulnerable to race conditions when concurrent tasks attempt to
insert leads with the same organization and email combination. Replace the
if-not-exists check and Lead.objects.create() call with Django's get_or_create()
method, using organization and email as the lookup parameters and passing
first_name, last_name, company, phone, and linkedin_url as defaults. This
ensures atomic database-level handling of concurrent inserts. Check the return
tuple from get_or_create() to determine if a new lead was created and increment
inserted_count only when created is True.
- Around line 186-189: The exception handler in the code block is storing the
raw exception text via str(e) directly into job.error_message, which gets
exposed through the API endpoints (scrape_status and scrape_history) via
LeadScrapeJobSerializer. Instead of assigning str(e) to job.error_message, log
the full exception internally for debugging purposes, and store a generic,
user-friendly error message in job.error_message (such as "An error occurred
during lead scraping") to avoid leaking internal details like stack traces, file
paths, or configuration information to API consumers.
- Around line 103-104: The LeadScrapeJob lookup in the Celery task uses only the
job ID without filtering by organization, which can allow access to jobs from
different organizations due to missing tenant context in Celery workers. Modify
the LeadScrapeJob.objects.get() call on line 104 to include an additional filter
for organization_id matching the organization loaded on line 103, ensuring the
job query is scoped to the current organization and preventing tenant isolation
bypass.

In `@backend/leads/views.py`:
- Around line 47-53: The limit parameter assignment on line 48 lacks error
handling for invalid input and lower-bound validation. Wrap the int conversion
in a try-except block to catch ValueError exceptions when non-numeric input is
provided (like 'limit=abc'), and return a 400 error response with a descriptive
message instead of allowing the exception to propagate. Additionally, after
successfully converting the limit to an integer, add validation to ensure the
limit is greater than zero before the existing maximum limit check, returning a
400 error if the limit is zero or negative.
- Around line 58-60: In the LeadScrapeJob.objects.filter() query on line 58, the
status filter currently only checks for 'RUNNING' jobs, which allows multiple
'PENDING' jobs to be created concurrently before any Celery task execution.
Modify the filter condition to check for both 'RUNNING' and 'PENDING' statuses
(you can use the __in lookup with a list of both status values) to ensure only
one job in either state can be active per organization at a time.
- Around line 77-79: The scrape_leads_task.delay() call is not handling
potential Celery exceptions, causing the job to remain in PENDING status if the
task enqueue fails while still returning HTTP 201 Created. Wrap the
scrape_leads_task.delay(job.id, query, limit, org.id) call in a try-except block
to catch Celery exceptions, and in the except block update the job object's
status to FAILED with an appropriate error message, then return an error
response (such as HTTP 500) instead of the success response.

In `@requirements.txt`:
- Around line 17-18: Update the pinned versions of google-genai and playwright
in requirements.txt. Replace the non-existent google-genai==0.1.1 with
google-genai==2.9.0 or another compatible version that exists on PyPI.
Similarly, replace the outdated playwright==1.49.0 with playwright==1.60.0 or a
more recent stable version to ensure the packages can be installed and are
current with security and feature updates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 53c385cb-5883-4b03-945f-5a7f61d157d9

📥 Commits

Reviewing files that changed from the base of the PR and between 4a33158 and 6823ae6.

📒 Files selected for processing (5)
  • backend/leads/models.py
  • backend/leads/serializers.py
  • backend/leads/tasks.py
  • backend/leads/views.py
  • requirements.txt

Comment thread backend/leads/models.py
)
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
query = models.CharField(max_length=255)
limit = models.IntegerField(default=50)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the models.py file to understand the current state
cat -n backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 2841


🏁 Script executed:

# Check for any existing validators on the limit field
rg "limit.*models\." backend/leads/models.py -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 339


🏁 Script executed:

# Search for the 200-row contract mentioned in the comment
rg "200" backend/ --type py -B 2 -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 7880


🏁 Script executed:

# Check if there are validators imported anywhere in the models
rg "validators" backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Look for view code that might validate the limit parameter
rg "limit" backend/leads/views.py -B 2 -A 2 | head -50

Repository: Kuldeeep18/LeadOrbit

Length of output: 874


🏁 Script executed:

# Check for other places where LeadScrapeJob is created or modified
rg "LeadScrapeJob" backend/ --type py

Repository: Kuldeeep18/LeadOrbit

Length of output: 1501


🏁 Script executed:

# Check if there's a Django admin configuration for LeadScrapeJob
rg "LeadScrapeJob" backend/leads/ --type py -B 2 -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 4589


🏁 Script executed:

# Check if there are any other views or functions that manipulate limit parameter
rg "\.limit\s*=" backend/ --type py -B 2 -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


Add field-level validators to enforce limit bounds in the model.

The limit field accepts any integer value, allowing non-endpoint code paths (e.g., serializer writes, Django admin, direct ORM operations) to persist values exceeding the 200-row contract. Validation should be enforced at the model layer, not just in the view.

Suggested fix
+from django.core.validators import MaxValueValidator, MinValueValidator
...
-    limit = models.IntegerField(default=50)
+    limit = models.IntegerField(
+        default=50,
+        validators=[MinValueValidator(1), MaxValueValidator(200)],
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
limit = models.IntegerField(default=50)
from django.core.validators import MaxValueValidator, MinValueValidator
limit = models.IntegerField(
default=50,
validators=[MinValueValidator(1), MaxValueValidator(200)],
)
🧰 Tools
🪛 ast-grep (0.44.0)

[info] 50-50: use help_text to document model columns
Context: models.CharField(max_length=20, choices=STATUS_CHOICES, default='PENDING')
Note: [CWE-710] Improper Adherence to Coding Standards.

(model-help-text)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/models.py` at line 50, The `limit` field in the Django model
(located in the IntegerField definition at line 50) needs field-level validators
to enforce bounds and prevent invalid values from being saved through non-view
code paths. Add the `validators` parameter to the IntegerField definition for
the `limit` field, using Django's MinValueValidator and MaxValueValidator to
enforce the minimum and maximum allowed values for the 200-row contract. This
ensures validation is enforced regardless of which code path (view, serializer,
Django admin, or direct ORM) attempts to persist the value.

Comment thread backend/leads/tasks.py
Comment on lines +103 to +104
org = Organization.objects.get(id=organization_id)
job = LeadScrapeJob.objects.get(id=job_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the task file at the specified lines
cat -n backend/leads/tasks.py | sed -n '95,115p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 962


🏁 Script executed:

# Find the LeadScrapeJob model definition
find . -type f -name "*.py" | xargs grep -l "class LeadScrapeJob"

Repository: Kuldeeep18/LeadOrbit

Length of output: 215


🏁 Script executed:

# Check if LeadScrapeJob has an organization field
ast-grep outline backend/leads/ --match LeadScrapeJob

Repository: Kuldeeep18/LeadOrbit

Length of output: 264


🏁 Script executed:

# Read the LeadScrapeJob model definition
cat -n backend/leads/models.py | head -200

Repository: Kuldeeep18/LeadOrbit

Length of output: 2841


🏁 Script executed:

# Search for where scrape_leads_task is called/invoked
rg "scrape_leads_task" --type py -B 3 -A 3

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Check the TenantModel definition to confirm it has organization field
cat -n backend/tenants/models.py | head -100

Repository: Kuldeeep18/LeadOrbit

Length of output: 2079


🏁 Script executed:

# Search more broadly for where this task is called (might not use exact function name)
rg "scrape_leads_task|apply_async|delay" backend/leads/ --type py -B 2 -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 1194


🏁 Script executed:

# Check if LeadScrapeJob has a custom manager that might filter by organization
ast-grep outline backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 702


🏁 Script executed:

# View the complete task function to check for other potential issues
cat -n backend/leads/tasks.py | sed -n '101,180p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 4737


🏁 Script executed:

# Check if there are other database queries in the task that should filter by organization
cat -n backend/leads/tasks.py | sed -n '101,250p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 5095


Add organization filter to LeadScrapeJob lookup in Celery task to prevent tenant isolation bypass.

Line 104 loads the job by id only. Celery workers run without tenant context, so TenantManager returns all records instead of filtering by organization. If task arguments mismatch, the query could match a job from a different organization and update it while creating leads in another org's context.

Suggested fix
    org = Organization.objects.get(id=organization_id)
-   job = LeadScrapeJob.objects.get(id=job_id)
+   job = LeadScrapeJob.objects.get(id=job_id, organization=org)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
org = Organization.objects.get(id=organization_id)
job = LeadScrapeJob.objects.get(id=job_id)
org = Organization.objects.get(id=organization_id)
job = LeadScrapeJob.objects.get(id=job_id, organization=org)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/tasks.py` around lines 103 - 104, The LeadScrapeJob lookup in
the Celery task uses only the job ID without filtering by organization, which
can allow access to jobs from different organizations due to missing tenant
context in Celery workers. Modify the LeadScrapeJob.objects.get() call on line
104 to include an additional filter for organization_id matching the
organization loaded on line 103, ensuring the job query is scoped to the current
organization and preventing tenant isolation bypass.

Comment thread backend/leads/tasks.py
Comment on lines +167 to +178
for item in extracted_raw_json[:limit]:
if not Lead.objects.filter(organization=org, email=item['email']).exists():
Lead.objects.create(
organization=org,
email=item['email'],
first_name=item['first_name'],
last_name=item['last_name'],
company=item['company'],
phone=item['phone'],
linkedin_url=item['linkedin_url']
)
inserted_count += 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "models.py" -o -name "tasks.py" | grep -E "leads|backend" | head -20

Repository: Kuldeeep18/LeadOrbit

Length of output: 322


🏁 Script executed:

cat -n backend/leads/tasks.py | sed -n '160,185p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 1241


🏁 Script executed:

find . -type f -path "*/leads/models.py" -o -path "*/leads/model.py"

Repository: Kuldeeep18/LeadOrbit

Length of output: 184


🏁 Script executed:

cat -n backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 2841


🏁 Script executed:

rg -A 10 "class Lead" backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 1611


🏁 Script executed:

head -30 backend/leads/tasks.py | cat -n

Repository: Kuldeeep18/LeadOrbit

Length of output: 1034


🏁 Script executed:

rg "IntegrityError" backend/leads/tasks.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

rg "get_or_create" backend/leads/tasks.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

cat -n backend/leads/tasks.py | sed -n '1,180p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 8694


🏁 Script executed:

rg -B 5 "def.*task" backend/leads/tasks.py | head -40

Repository: Kuldeeep18/LeadOrbit

Length of output: 397


Use get_or_create() to eliminate the race condition on concurrent lead insertion.

The current check-then-create pattern (lines 168-178) is vulnerable to IntegrityError when concurrent tasks attempt to insert leads with the same (organization, email) pair. The unique_together constraint on the Lead model (line 18 of models.py) will be violated, causing the entire job to fail.

Suggested fix
+from django.db import IntegrityError
...
-        for item in extracted_raw_json[:limit]:
-            if not Lead.objects.filter(organization=org, email=item['email']).exists():
-                Lead.objects.create(
-                    organization=org,
-                    email=item['email'],
-                    first_name=item['first_name'],
-                    last_name=item['last_name'],
-                    company=item['company'],
-                    phone=item['phone'],
-                    linkedin_url=item['linkedin_url']
-                )
-                inserted_count += 1
+        for item in extracted_raw_json[:limit]:
+            try:
+                _, created = Lead.objects.get_or_create(
+                    organization=org,
+                    email=item['email'],
+                    defaults={
+                        'first_name': item['first_name'],
+                        'last_name': item['last_name'],
+                        'company': item['company'],
+                        'phone': item['phone'],
+                        'linkedin_url': item['linkedin_url'],
+                    },
+                )
+            except IntegrityError:
+                continue
+            if created:
+                inserted_count += 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/tasks.py` around lines 167 - 178, The current Lead creation
logic uses a check-then-create pattern that is vulnerable to race conditions
when concurrent tasks attempt to insert leads with the same organization and
email combination. Replace the if-not-exists check and Lead.objects.create()
call with Django's get_or_create() method, using organization and email as the
lookup parameters and passing first_name, last_name, company, phone, and
linkedin_url as defaults. This ensures atomic database-level handling of
concurrent inserts. Check the return tuple from get_or_create() to determine if
a new lead was created and increment inserted_count only when created is True.

Comment thread backend/leads/tasks.py
Comment on lines +186 to +189
except Exception as e:
job.status = 'FAILED'
job.error_message = str(e)
job.completed_at = timezone.now()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and examine the context around lines 186-189
cat -n backend/leads/tasks.py | sed -n '180,195p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 581


🏁 Script executed:

# Get an outline of the file to understand the function structure
ast-grep outline backend/leads/tasks.py --view expanded

Repository: Kuldeeep18/LeadOrbit

Length of output: 449


🏁 Script executed:

# Search for other exception handling patterns in the same file
rg "except.*Exception" backend/leads/tasks.py -A 5

Repository: Kuldeeep18/LeadOrbit

Length of output: 311


🏁 Script executed:

# Check if job.error_message is exposed through API endpoints
rg "job\.error_message\|error_message" --type py -A 2 -B 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Verify if logger is imported and used in the file
rg "logger\|import logging" backend/leads/tasks.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Search for where job.error_message is retrieved or returned in API responses
rg "error_message" --type py -B 3 -A 3

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Check the Job model definition to understand the field
fd -t f "models.py" backend/leads/ -exec cat -n {} \;

Repository: Kuldeeep18/LeadOrbit

Length of output: 389


🏁 Script executed:

# Search for API serializers or views that expose the job
rg "class.*Serializer\|class.*ViewSet\|class.*APIView" backend/ -A 5 | head -50

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Look for API responses that include job information
rg "job\|Job" backend/leads/views.py -B 2 -A 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Find the Job model
find . -name "*.py" -path "*/leads/*" | xargs grep -l "class Job"

Repository: Kuldeeep18/LeadOrbit

Length of output: 162


🏁 Script executed:

# Search for views.py in leads app
find . -path "*/leads/views.py"

Repository: Kuldeeep18/LeadOrbit

Length of output: 183


🏁 Script executed:

# Search for serializers
find . -path "*/leads/*" -name "*serializer*.py"

Repository: Kuldeeep18/LeadOrbit

Length of output: 189


🏁 Script executed:

# Check the structure of the leads app
find backend/leads -type f -name "*.py" | sort

Repository: Kuldeeep18/LeadOrbit

Length of output: 406


🏁 Script executed:

# Check the Job model definition
cat backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 2435


🏁 Script executed:

# Check the serializer to see if error_message is exposed
cat backend/leads/serializers.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 1252


🏁 Script executed:

# Check the views to see how job data is returned
cat backend/leads/views.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 4975


🏁 Script executed:

# Check if there's any exception logging or sanitization elsewhere in the tasks.py file
cat backend/leads/tasks.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 7722


Do not expose raw exception text to API consumers.

Lines 186–189 expose str(e) directly through the LeadScrapeJobSerializer, which is returned in the scrape_status and scrape_history API endpoints. Raw exception messages can leak internal details (stack traces, paths, dependencies, configuration).

Suggested fix
-    except Exception as e:
+    except Exception:
+        logger.exception("scrape_leads_task failed", extra={"job_id": str(job_id), "organization_id": str(organization_id)})
         job.status = 'FAILED'
-        job.error_message = str(e)
+        job.error_message = "Lead scraping failed. Please retry."
         job.completed_at = timezone.now()
         job.save()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
job.status = 'FAILED'
job.error_message = str(e)
job.completed_at = timezone.now()
except Exception:
logger.exception("scrape_leads_task failed", extra={"job_id": str(job_id), "organization_id": str(organization_id)})
job.status = 'FAILED'
job.error_message = "Lead scraping failed. Please retry."
job.completed_at = timezone.now()
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 186-186: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/tasks.py` around lines 186 - 189, The exception handler in the
code block is storing the raw exception text via str(e) directly into
job.error_message, which gets exposed through the API endpoints (scrape_status
and scrape_history) via LeadScrapeJobSerializer. Instead of assigning str(e) to
job.error_message, log the full exception internally for debugging purposes, and
store a generic, user-friendly error message in job.error_message (such as "An
error occurred during lead scraping") to avoid leaking internal details like
stack traces, file paths, or configuration information to API consumers.

Comment thread backend/leads/views.py
Comment on lines +47 to +53
query = request.data.get('query', '').strip()
limit = int(request.data.get('limit', 50))

if not query:
return Response({"error": "A search query is required."}, status=status.HTTP_400_BAD_REQUEST)
if limit > 200:
limit = 200 # Enforce security constraint max limit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate and inspect the file
cat -n backend/leads/views.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 5731


Handle invalid limit input with proper error handling before int conversion.

Line 48 will raise a runtime exception on non-numeric input (e.g., limit=abc), returning 500 instead of 400. Additionally, there is no lower-bound validation; limit can be set to zero or negative.

Suggested fix
-        query = request.data.get('query', '').strip()
-        limit = int(request.data.get('limit', 50))
+        query = request.data.get('query', '').strip()
+        raw_limit = request.data.get('limit', 50)
+        try:
+            limit = int(raw_limit)
+        except (TypeError, ValueError):
+            return Response({"error": "Limit must be an integer."}, status=status.HTTP_400_BAD_REQUEST)
         
         if not query:
             return Response({"error": "A search query is required."}, status=status.HTTP_400_BAD_REQUEST)
+        if len(query) > 255:
+            return Response({"error": "Query must be 255 characters or fewer."}, status=status.HTTP_400_BAD_REQUEST)
+        if limit < 1:
+            return Response({"error": "Limit must be at least 1."}, status=status.HTTP_400_BAD_REQUEST)
         if limit > 200:
             limit = 200 # Enforce security constraint max limit
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/views.py` around lines 47 - 53, The limit parameter assignment
on line 48 lacks error handling for invalid input and lower-bound validation.
Wrap the int conversion in a try-except block to catch ValueError exceptions
when non-numeric input is provided (like 'limit=abc'), and return a 400 error
response with a descriptive message instead of allowing the exception to
propagate. Additionally, after successfully converting the limit to an integer,
add validation to ensure the limit is greater than zero before the existing
maximum limit check, returning a 400 error if the limit is zero or negative.

Comment thread backend/leads/views.py
Comment on lines +58 to +60
active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists()
if active_job:
return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find backend -type f -name "*.py" | grep -E "(views|models)" | head -20

Repository: Kuldeeep18/LeadOrbit

Length of output: 397


🏁 Script executed:

cat -n backend/leads/views.py | sed -n '50,70p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 1473


🏁 Script executed:

cat -n backend/leads/models.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 2841


🏁 Script executed:

cat -n backend/leads/views.py | sed -n '68,95p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 1531


🏁 Script executed:

find backend/leads -name "tasks.py" -o -name "celery.py"

Repository: Kuldeeep18/LeadOrbit

Length of output: 181


🏁 Script executed:

cat -n backend/leads/tasks.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 9052


Treat PENDING jobs as active to enforce single concurrent job correctly.

Line 58 only blocks RUNNING jobs, allowing multiple PENDING jobs to be created concurrently for the same organization before any Celery task starts execution.

Suggested fix
-        active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists()
+        active_job = LeadScrapeJob.objects.filter(
+            organization=org,
+            status__in=['PENDING', 'RUNNING'],
+        ).exists()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists()
if active_job:
return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST)
active_job = LeadScrapeJob.objects.filter(
organization=org,
status__in=['PENDING', 'RUNNING'],
).exists()
if active_job:
return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/views.py` around lines 58 - 60, In the
LeadScrapeJob.objects.filter() query on line 58, the status filter currently
only checks for 'RUNNING' jobs, which allows multiple 'PENDING' jobs to be
created concurrently before any Celery task execution. Modify the filter
condition to check for both 'RUNNING' and 'PENDING' statuses (you can use the
__in lookup with a list of both status values) to ensure only one job in either
state can be active per organization at a time.

Comment thread backend/leads/views.py
Comment on lines +77 to +79
from .tasks import scrape_leads_task
scrape_leads_task.delay(job.id, query, limit, org.id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the backend/leads/views.py file and examine the area around lines 77-79
find . -name "views.py" -path "*/leads/*" | head -10

Repository: Kuldeeep18/LeadOrbit

Length of output: 183


🏁 Script executed:

# List the repository structure to understand the layout
git ls-files | grep -E "(leads|tasks)" | head -20

Repository: Kuldeeep18/LeadOrbit

Length of output: 958


🏁 Script executed:

# Read the views.py file around lines 77-79 to see the actual code
cat -n backend/leads/views.py | sed -n '60,100p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 2506


🏁 Script executed:

# Also get an outline of the views.py file to understand the context
ast-grep outline backend/leads/views.py

Repository: Kuldeeep18/LeadOrbit

Length of output: 507


🏁 Script executed:

# Check the tasks.py file to understand the scrape_leads_task
cat -n backend/leads/tasks.py | head -50

Repository: Kuldeeep18/LeadOrbit

Length of output: 1756


🏁 Script executed:

# Check the LeadScrapeJob model to understand the status field
cat -n backend/leads/models.py | grep -A 20 "class LeadScrapeJob"

Repository: Kuldeeep18/LeadOrbit

Length of output: 1054


🏁 Script executed:

# Search for existing Celery error handling patterns in the codebase
rg "CeleryError|except.*delay|celery.exceptions" --type py

Repository: Kuldeeep18/LeadOrbit

Length of output: 158


🏁 Script executed:

# Check the scrape_leads_task implementation in tasks.py
cat -n backend/leads/tasks.py | grep -A 30 "scrape_leads_task"

Repository: Kuldeeep18/LeadOrbit

Length of output: 1915


🏁 Script executed:

# Look for all imports in views.py to understand current imports and patterns
cat -n backend/leads/views.py | sed -n '1,35p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 1768


🏁 Script executed:

# Search for other places where .delay() is called to see if there's any error handling
rg "\.delay\(" backend --type py -A 2 -B 2

Repository: Kuldeeep18/LeadOrbit

Length of output: 1603


🏁 Script executed:

# Check what exceptions can be raised by Celery's delay() method
python3 << 'PY'
import inspect

# Check what exceptions Celery can raise
print("Common Celery exceptions that delay() might raise:")
print("1. kombu.exceptions.OperationalError - when broker connection fails")
print("2. celery.exceptions.CeleryError - base exception class")
print("3. celery.exceptions.SecurityError - security-related errors")
print("4. Exception - general exceptions")
print()
print("The .delay() method is part of the Task API and can raise:")
print("- celery.exceptions.CeleryError and its subclasses")
print("- kombu exceptions (connection-related)")
print("- General exceptions from the broker")
PY

Repository: Kuldeeep18/LeadOrbit

Length of output: 616


🏁 Script executed:

# Check if there's any Celery configuration or broker setup in the project
rg "CELERY|BROKER" backend --type py | head -20

Repository: Kuldeeep18/LeadOrbit

Length of output: 1358


🏁 Script executed:

# Look at the entire scrape method to get full context
ast-grep run --pattern 'def scrape($_) { $$$ }' backend/leads/views.py --lang python

Repository: Kuldeeep18/LeadOrbit

Length of output: 472


🏁 Script executed:

# Read the full scrape method from views.py
cat -n backend/leads/views.py | sed -n '45,84p'

Repository: Kuldeeep18/LeadOrbit

Length of output: 2256


🏁 Script executed:

# Check how other .delay() calls are handled in the codebase
rg "\.delay\(" backend --type py -B 5 -A 5 | head -50

Repository: Kuldeeep18/LeadOrbit

Length of output: 3214


Handle enqueue failures for Celery task and update job status consistently.

If scrape_leads_task.delay() fails, the endpoint returns HTTP 201 Created but the job remains in PENDING status indefinitely and never executes. Wrap the task enqueue call in a try-except block to catch Celery exceptions, update the job to FAILED status with an error message, and return an appropriate error response.

Suggested fix
+from celery.exceptions import CeleryError
...
-        scrape_leads_task.delay(job.id, query, limit, org.id)
+        try:
+            scrape_leads_task.delay(job.id, query, limit, org.id)
+        except CeleryError:
+            job.status = 'FAILED'
+            job.error_message = 'Failed to enqueue scrape job.'
+            job.completed_at = timezone.now()
+            job.save(update_fields=['status', 'error_message', 'completed_at'])
+            return Response(
+                {"error": "Unable to start background job. Please retry."},
+                status=status.HTTP_503_SERVICE_UNAVAILABLE,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from .tasks import scrape_leads_task
scrape_leads_task.delay(job.id, query, limit, org.id)
from .tasks import scrape_leads_task
from celery.exceptions import CeleryError
try:
scrape_leads_task.delay(job.id, query, limit, org.id)
except CeleryError:
job.status = 'FAILED'
job.error_message = 'Failed to enqueue scrape job.'
job.completed_at = timezone.now()
job.save(update_fields=['status', 'error_message', 'completed_at'])
return Response(
{"error": "Unable to start background job. Please retry."},
status=status.HTTP_503_SERVICE_UNAVAILABLE,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/leads/views.py` around lines 77 - 79, The scrape_leads_task.delay()
call is not handling potential Celery exceptions, causing the job to remain in
PENDING status if the task enqueue fails while still returning HTTP 201 Created.
Wrap the scrape_leads_task.delay(job.id, query, limit, org.id) call in a
try-except block to catch Celery exceptions, and in the except block update the
job object's status to FAILED with an appropriate error message, then return an
error response (such as HTTP 500) instead of the success response.

Comment thread requirements.txt
Comment on lines +17 to +18
playwright==1.49.0
google-genai==0.1.1 No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify playwright and google-genai versions exist on PyPI and check for known issues.

# Verify playwright version exists and check latest version
echo "=== Checking playwright versions ==="
curl -s https://pypi.org/pypi/playwright/1.49.0/json | jq '.info | {version, yanked}' || echo "playwright==1.49.0 not found"
echo "Latest playwright version:"
curl -s https://pypi.org/pypi/playwright/json | jq '.info.version'

# Verify google-genai version exists and check latest version
echo ""
echo "=== Checking google-genai versions ==="
curl -s https://pypi.org/pypi/google-genai/0.1.1/json | jq '.info | {version, yanked}' || echo "google-genai==0.1.1 not found"
echo "Latest google-genai version:"
curl -s https://pypi.org/pypi/google-genai/json | jq '.info.version'

Repository: Kuldeeep18/LeadOrbit

Length of output: 393


Fix non-existent google-genai version and update outdated playwright.

The pinned versions have critical and minor issues:

  • google-genai==0.1.1 does not exist on PyPI. Latest available version is 2.9.0. Update to google-genai==2.9.0 or a compatible version.
  • playwright==1.49.0 exists but is significantly outdated. Latest version is 1.60.0. Update to a recent stable version.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.txt` around lines 17 - 18, Update the pinned versions of
google-genai and playwright in requirements.txt. Replace the non-existent
google-genai==0.1.1 with google-genai==2.9.0 or another compatible version that
exists on PyPI. Similarly, replace the outdated playwright==1.49.0 with
playwright==1.60.0 or a more recent stable version to ensure the packages can be
installed and are current with security and feature updates.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/leads.html`:
- Around line 193-199: The scrape request payload (around line 388) currently
only sends query and limit parameters and ignores the user's directory source
selections from the checkboxes with IDs srcGmaps and srcYp. Modify the scrape
request to collect the checked state of all directory source checkboxes
(srcGmaps, srcYp, and any other source checkboxes present in the form) and
include them in the payload sent to the backend, so that user selections are
actually used when making the scrape request.
- Around line 310-317: Replace the innerHTML assignment that embeds job.query
directly in the template string with a safer approach using createElement and
textContent. Create a new table row element using createElement, then create
individual table cells using createElement for each column (query, status,
leads_found, date). For the query cell specifically, use textContent to set the
job.query value instead of interpolating it into HTML. Apply the same pattern to
the error_message injection at line 429 area, creating DOM nodes and using
textContent instead of embedding the error_message string directly in the
innerHTML template.
- Around line 409-437: The polling request in the fetchWithAuth call does not
check if the HTTP response is successful before processing the JSON data, which
can leave the form in a disabled state if the server returns an error status
code. After awaiting the response from fetchWithAuth, add a check for
response.ok and if it is false, treat it as a terminal error by clearing the
pollInterval, updating the statusText to display an error message, setting the
progress bar width to 100%, adding the bg-danger class to the bar, and
re-enabling the submit button (similar to the FAILED status handling) before
returning or continuing to the next iteration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: adf1f066-bda5-4972-a215-05734faa709f

📥 Commits

Reviewing files that changed from the base of the PR and between 6823ae6 and 53efa38.

📒 Files selected for processing (1)
  • frontend/leads.html

Comment thread frontend/leads.html
Comment on lines +193 to +199
<input class="form-check-input" type="checkbox" checked id="srcGmaps">
<label class="form-check-label text-white-50 small">Google Maps</label>
</div>
<div class="form-check">
<input class="form-check-input" type="checkbox" checked id="srcYp">
<label class="form-check-label text-white-50 small">YellowPages</label>
</div>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Wire selected directory sources into the scrape request payload.

The UI collects source choices, but Line 388 sends only query and limit, so user selections are ignored.

💡 Proposed fix
             document.getElementById('aiScrapeForm').addEventListener('submit', async (e) => {
                 e.preventDefault();
                 const query = document.getElementById('scrapeQuery').value;
-                const limit = document.getElementById('scrapeLimit').value;
+                const limit = Number(document.getElementById('scrapeLimit').value);
+                const sources = [];
+                if (document.getElementById('srcGmaps').checked) sources.push('google_maps');
+                if (document.getElementById('srcYp').checked) sources.push('yellowpages');
+                if (sources.length === 0) throw new Error('Select at least one directory source.');
                 const btn = e.target.querySelector('button[type="submit"]');
@@
                     const response = await fetchWithAuth('/leads/scrape/', {
                         method: 'POST',
-                        body: JSON.stringify({ query, limit })
+                        body: JSON.stringify({ query, limit, sources })
                     });

Also applies to: 386-389

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/leads.html` around lines 193 - 199, The scrape request payload
(around line 388) currently only sends query and limit parameters and ignores
the user's directory source selections from the checkboxes with IDs srcGmaps and
srcYp. Modify the scrape request to collect the checked state of all directory
source checkboxes (srcGmaps, srcYp, and any other source checkboxes present in
the form) and include them in the payload sent to the backend, so that user
selections are actually used when making the scrape request.

Comment thread frontend/leads.html
Comment on lines +310 to +317
tbody.innerHTML += `
<tr class="border-bottom border-secondary border-opacity-25">
<td class="fw-semibold text-info text-start ps-2">${job.query}</td>
<td><span class="badge ${badgeClass}">${job.status}</span></td>
<td class="fw-bold text-white">${job.leads_found}</td>
<td class="text-white-50">${date}</td>
</tr>
`;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid injecting job.query and error_message via innerHTML.

Line 312 and Line 429 insert dynamic strings into HTML directly. Render those values with textContent/DOM nodes to prevent XSS.

💡 Proposed fix
-                data.forEach(job => {
-                    const badgeClass = job.status === 'COMPLETED' ? 'bg-success' : (job.status === 'FAILED' ? 'bg-danger' : 'bg-warning text-dark');
-                    const date = new Date(job.created_at || new Date()).toLocaleDateString();
-                    tbody.innerHTML += `
-                        <tr class="border-bottom border-secondary border-opacity-25">
-                            <td class="fw-semibold text-info text-start ps-2">${job.query}</td>
-                            <td><span class="badge ${badgeClass}">${job.status}</span></td>
-                            <td class="fw-bold text-white">${job.leads_found}</td>
-                            <td class="text-white-50">${date}</td>
-                        </tr>
-                    `;
-                });
+                data.forEach(job => {
+                    const badgeClass = job.status === 'COMPLETED' ? 'bg-success' : (job.status === 'FAILED' ? 'bg-danger' : 'bg-warning text-dark');
+                    const date = new Date(job.created_at || new Date()).toLocaleDateString();
+                    const tr = document.createElement('tr');
+                    tr.className = 'border-bottom border-secondary border-opacity-25';
+                    tr.innerHTML = `
+                      <td class="fw-semibold text-info text-start ps-2"></td>
+                      <td><span class="badge ${badgeClass}"></span></td>
+                      <td class="fw-bold text-white"></td>
+                      <td class="text-white-50"></td>
+                    `;
+                    tr.children[0].textContent = job.query || '';
+                    tr.children[1].firstElementChild.textContent = job.status || '';
+                    tr.children[2].textContent = String(job.leads_found ?? 0);
+                    tr.children[3].textContent = date;
+                    tbody.appendChild(tr);
+                });
@@
-                            statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`;
+                            statusText.innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>`;
+                            statusText.append(document.createTextNode(`Error: ${data.error_message || 'Timeout'}`));

Also applies to: 427-430

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/leads.html` around lines 310 - 317, Replace the innerHTML assignment
that embeds job.query directly in the template string with a safer approach
using createElement and textContent. Create a new table row element using
createElement, then create individual table cells using createElement for each
column (query, status, leads_found, date). For the query cell specifically, use
textContent to set the job.query value instead of interpolating it into HTML.
Apply the same pattern to the error_message injection at line 429 area, creating
DOM nodes and using textContent instead of embedding the error_message string
directly in the innerHTML template.

Source: Linters/SAST tools

Comment thread frontend/leads.html
Comment on lines +409 to +437
const response = await fetchWithAuth(`/leads/scrape/${jobId}/status/`);
const data = await response.json();

const statusText = document.getElementById('liveStatusText');
const countText = document.getElementById('liveCountText');
const bar = document.getElementById('liveProgressBar');

if (data.status === 'RUNNING') {
statusText.innerHTML = `<span class="spinner-border spinner-border-sm text-info me-2"></span>Agent actively compiling records...`;
countText.textContent = `${data.leads_found} leads parsed`;
bar.style.width = '65%';
} else if (data.status === 'COMPLETED') {
clearInterval(pollInterval);
statusText.innerHTML = `<i class="bi bi-check-circle-fill text-success me-2"></i>Sequence completed!`;
countText.textContent = `${data.leads_found} leads added`;
bar.style.width = '100%';
bar.classList.remove('progress-bar-striped');
setTimeout(() => { window.location.reload(); }, 1500);
} else if (data.status === 'FAILED') {
clearInterval(pollInterval);
statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`;
bar.style.width = '100%';
bar.classList.add('bg-danger');
document.getElementById('aiScrapeForm').querySelector('button[type="submit"]').disabled = false;
}
} catch (err) {
console.error('Polling connection interrupted:', err);
}
}, 3000);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Handle non-OK polling responses as terminal errors and reset UI state.

Polling currently ignores HTTP errors and only logs exceptions, which can leave the form stuck in disabled state indefinitely.

💡 Proposed fix
                 pollInterval = setInterval(async () => {
                     try {
                         const response = await fetchWithAuth(`/leads/scrape/${jobId}/status/`);
-                        const data = await response.json();
+                        const data = await response.json();
+                        if (!response.ok) {
+                            throw new Error(data.error || 'Failed to fetch scrape status.');
+                        }
@@
                         } else if (data.status === 'FAILED') {
                             clearInterval(pollInterval);
                             statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`;
                             bar.style.width = '100%';
                             bar.classList.add('bg-danger');
-                            document.getElementById('aiScrapeForm').querySelector('button[type="submit"]').disabled = false;
+                            const submitBtn = document.getElementById('aiScrapeForm').querySelector('button[type="submit"]');
+                            submitBtn.disabled = false;
+                            submitBtn.textContent = 'Deploy AI Browser Agent';
                         }
                     } catch (err) {
-                        console.error('Polling connection interrupted:', err);
+                        clearInterval(pollInterval);
+                        const statusText = document.getElementById('liveStatusText');
+                        statusText.textContent = `Error: ${err.message || 'Polling interrupted.'}`;
+                        const submitBtn = document.getElementById('aiScrapeForm').querySelector('button[type="submit"]');
+                        submitBtn.disabled = false;
+                        submitBtn.textContent = 'Deploy AI Browser Agent';
                     }
                 }, 3000);
🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 416-416: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <span class="spinner-border spinner-border-sm text-info me-2"></span>Agent actively compiling records...
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').

(inner-outer-html)


[warning] 421-421: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <i class="bi bi-check-circle-fill text-success me-2"></i>Sequence completed!
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').

(inner-outer-html)


[warning] 428-428: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').

(inner-outer-html)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/leads.html` around lines 409 - 437, The polling request in the
fetchWithAuth call does not check if the HTTP response is successful before
processing the JSON data, which can leave the form in a disabled state if the
server returns an error status code. After awaiting the response from
fetchWithAuth, add a check for response.ok and if it is false, treat it as a
terminal error by clearing the pollInterval, updating the statusText to display
an error message, setting the progress bar width to 100%, adding the bg-danger
class to the bar, and re-enabling the submit button (similar to the FAILED
status handling) before returning or continuing to the next iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build AI-Driven Browser Agent for Automated Lead Generation

1 participant