feat(backend): add LeadScrapeJob database models and async scrape tasks#433
feat(backend): add LeadScrapeJob database models and async scrape tasks#433KhushiMulchandani wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughAdds a ChangesAI Lead Scraping Feature
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
|
@Kuldeeep18 Kindly review and merge. |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/leads/models.py`:
- Line 50: The `limit` field in the Django model (located in the IntegerField
definition at line 50) needs field-level validators to enforce bounds and
prevent invalid values from being saved through non-view code paths. Add the
`validators` parameter to the IntegerField definition for the `limit` field,
using Django's MinValueValidator and MaxValueValidator to enforce the minimum
and maximum allowed values for the 200-row contract. This ensures validation is
enforced regardless of which code path (view, serializer, Django admin, or
direct ORM) attempts to persist the value.
In `@backend/leads/tasks.py`:
- Around line 167-178: The current Lead creation logic uses a check-then-create
pattern that is vulnerable to race conditions when concurrent tasks attempt to
insert leads with the same organization and email combination. Replace the
if-not-exists check and Lead.objects.create() call with Django's get_or_create()
method, using organization and email as the lookup parameters and passing
first_name, last_name, company, phone, and linkedin_url as defaults. This
ensures atomic database-level handling of concurrent inserts. Check the return
tuple from get_or_create() to determine if a new lead was created and increment
inserted_count only when created is True.
- Around line 186-189: The exception handler in the code block is storing the
raw exception text via str(e) directly into job.error_message, which gets
exposed through the API endpoints (scrape_status and scrape_history) via
LeadScrapeJobSerializer. Instead of assigning str(e) to job.error_message, log
the full exception internally for debugging purposes, and store a generic,
user-friendly error message in job.error_message (such as "An error occurred
during lead scraping") to avoid leaking internal details like stack traces, file
paths, or configuration information to API consumers.
- Around line 103-104: The LeadScrapeJob lookup in the Celery task uses only the
job ID without filtering by organization, which can allow access to jobs from
different organizations due to missing tenant context in Celery workers. Modify
the LeadScrapeJob.objects.get() call on line 104 to include an additional filter
for organization_id matching the organization loaded on line 103, ensuring the
job query is scoped to the current organization and preventing tenant isolation
bypass.
In `@backend/leads/views.py`:
- Around line 47-53: The limit parameter assignment on line 48 lacks error
handling for invalid input and lower-bound validation. Wrap the int conversion
in a try-except block to catch ValueError exceptions when non-numeric input is
provided (like 'limit=abc'), and return a 400 error response with a descriptive
message instead of allowing the exception to propagate. Additionally, after
successfully converting the limit to an integer, add validation to ensure the
limit is greater than zero before the existing maximum limit check, returning a
400 error if the limit is zero or negative.
- Around line 58-60: In the LeadScrapeJob.objects.filter() query on line 58, the
status filter currently only checks for 'RUNNING' jobs, which allows multiple
'PENDING' jobs to be created concurrently before any Celery task execution.
Modify the filter condition to check for both 'RUNNING' and 'PENDING' statuses
(you can use the __in lookup with a list of both status values) to ensure only
one job in either state can be active per organization at a time.
- Around line 77-79: The scrape_leads_task.delay() call is not handling
potential Celery exceptions, causing the job to remain in PENDING status if the
task enqueue fails while still returning HTTP 201 Created. Wrap the
scrape_leads_task.delay(job.id, query, limit, org.id) call in a try-except block
to catch Celery exceptions, and in the except block update the job object's
status to FAILED with an appropriate error message, then return an error
response (such as HTTP 500) instead of the success response.
In `@requirements.txt`:
- Around line 17-18: Update the pinned versions of google-genai and playwright
in requirements.txt. Replace the non-existent google-genai==0.1.1 with
google-genai==2.9.0 or another compatible version that exists on PyPI.
Similarly, replace the outdated playwright==1.49.0 with playwright==1.60.0 or a
more recent stable version to ensure the packages can be installed and are
current with security and feature updates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 53c385cb-5883-4b03-945f-5a7f61d157d9
📒 Files selected for processing (5)
backend/leads/models.pybackend/leads/serializers.pybackend/leads/tasks.pybackend/leads/views.pyrequirements.txt
| ) | ||
| id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) | ||
| query = models.CharField(max_length=255) | ||
| limit = models.IntegerField(default=50) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the models.py file to understand the current state
cat -n backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 2841
🏁 Script executed:
# Check for any existing validators on the limit field
rg "limit.*models\." backend/leads/models.py -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 339
🏁 Script executed:
# Search for the 200-row contract mentioned in the comment
rg "200" backend/ --type py -B 2 -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 7880
🏁 Script executed:
# Check if there are validators imported anywhere in the models
rg "validators" backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Look for view code that might validate the limit parameter
rg "limit" backend/leads/views.py -B 2 -A 2 | head -50Repository: Kuldeeep18/LeadOrbit
Length of output: 874
🏁 Script executed:
# Check for other places where LeadScrapeJob is created or modified
rg "LeadScrapeJob" backend/ --type pyRepository: Kuldeeep18/LeadOrbit
Length of output: 1501
🏁 Script executed:
# Check if there's a Django admin configuration for LeadScrapeJob
rg "LeadScrapeJob" backend/leads/ --type py -B 2 -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 4589
🏁 Script executed:
# Check if there are any other views or functions that manipulate limit parameter
rg "\.limit\s*=" backend/ --type py -B 2 -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 158
Add field-level validators to enforce limit bounds in the model.
The limit field accepts any integer value, allowing non-endpoint code paths (e.g., serializer writes, Django admin, direct ORM operations) to persist values exceeding the 200-row contract. Validation should be enforced at the model layer, not just in the view.
Suggested fix
+from django.core.validators import MaxValueValidator, MinValueValidator
...
- limit = models.IntegerField(default=50)
+ limit = models.IntegerField(
+ default=50,
+ validators=[MinValueValidator(1), MaxValueValidator(200)],
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| limit = models.IntegerField(default=50) | |
| from django.core.validators import MaxValueValidator, MinValueValidator | |
| limit = models.IntegerField( | |
| default=50, | |
| validators=[MinValueValidator(1), MaxValueValidator(200)], | |
| ) |
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 50-50: use help_text to document model columns
Context: models.CharField(max_length=20, choices=STATUS_CHOICES, default='PENDING')
Note: [CWE-710] Improper Adherence to Coding Standards.
(model-help-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/models.py` at line 50, The `limit` field in the Django model
(located in the IntegerField definition at line 50) needs field-level validators
to enforce bounds and prevent invalid values from being saved through non-view
code paths. Add the `validators` parameter to the IntegerField definition for
the `limit` field, using Django's MinValueValidator and MaxValueValidator to
enforce the minimum and maximum allowed values for the 200-row contract. This
ensures validation is enforced regardless of which code path (view, serializer,
Django admin, or direct ORM) attempts to persist the value.
| org = Organization.objects.get(id=organization_id) | ||
| job = LeadScrapeJob.objects.get(id=job_id) |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the task file at the specified lines
cat -n backend/leads/tasks.py | sed -n '95,115p'Repository: Kuldeeep18/LeadOrbit
Length of output: 962
🏁 Script executed:
# Find the LeadScrapeJob model definition
find . -type f -name "*.py" | xargs grep -l "class LeadScrapeJob"Repository: Kuldeeep18/LeadOrbit
Length of output: 215
🏁 Script executed:
# Check if LeadScrapeJob has an organization field
ast-grep outline backend/leads/ --match LeadScrapeJobRepository: Kuldeeep18/LeadOrbit
Length of output: 264
🏁 Script executed:
# Read the LeadScrapeJob model definition
cat -n backend/leads/models.py | head -200Repository: Kuldeeep18/LeadOrbit
Length of output: 2841
🏁 Script executed:
# Search for where scrape_leads_task is called/invoked
rg "scrape_leads_task" --type py -B 3 -A 3Repository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Check the TenantModel definition to confirm it has organization field
cat -n backend/tenants/models.py | head -100Repository: Kuldeeep18/LeadOrbit
Length of output: 2079
🏁 Script executed:
# Search more broadly for where this task is called (might not use exact function name)
rg "scrape_leads_task|apply_async|delay" backend/leads/ --type py -B 2 -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 1194
🏁 Script executed:
# Check if LeadScrapeJob has a custom manager that might filter by organization
ast-grep outline backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 702
🏁 Script executed:
# View the complete task function to check for other potential issues
cat -n backend/leads/tasks.py | sed -n '101,180p'Repository: Kuldeeep18/LeadOrbit
Length of output: 4737
🏁 Script executed:
# Check if there are other database queries in the task that should filter by organization
cat -n backend/leads/tasks.py | sed -n '101,250p'Repository: Kuldeeep18/LeadOrbit
Length of output: 5095
Add organization filter to LeadScrapeJob lookup in Celery task to prevent tenant isolation bypass.
Line 104 loads the job by id only. Celery workers run without tenant context, so TenantManager returns all records instead of filtering by organization. If task arguments mismatch, the query could match a job from a different organization and update it while creating leads in another org's context.
Suggested fix
org = Organization.objects.get(id=organization_id)
- job = LeadScrapeJob.objects.get(id=job_id)
+ job = LeadScrapeJob.objects.get(id=job_id, organization=org)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| org = Organization.objects.get(id=organization_id) | |
| job = LeadScrapeJob.objects.get(id=job_id) | |
| org = Organization.objects.get(id=organization_id) | |
| job = LeadScrapeJob.objects.get(id=job_id, organization=org) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/tasks.py` around lines 103 - 104, The LeadScrapeJob lookup in
the Celery task uses only the job ID without filtering by organization, which
can allow access to jobs from different organizations due to missing tenant
context in Celery workers. Modify the LeadScrapeJob.objects.get() call on line
104 to include an additional filter for organization_id matching the
organization loaded on line 103, ensuring the job query is scoped to the current
organization and preventing tenant isolation bypass.
| for item in extracted_raw_json[:limit]: | ||
| if not Lead.objects.filter(organization=org, email=item['email']).exists(): | ||
| Lead.objects.create( | ||
| organization=org, | ||
| email=item['email'], | ||
| first_name=item['first_name'], | ||
| last_name=item['last_name'], | ||
| company=item['company'], | ||
| phone=item['phone'], | ||
| linkedin_url=item['linkedin_url'] | ||
| ) | ||
| inserted_count += 1 |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "models.py" -o -name "tasks.py" | grep -E "leads|backend" | head -20Repository: Kuldeeep18/LeadOrbit
Length of output: 322
🏁 Script executed:
cat -n backend/leads/tasks.py | sed -n '160,185p'Repository: Kuldeeep18/LeadOrbit
Length of output: 1241
🏁 Script executed:
find . -type f -path "*/leads/models.py" -o -path "*/leads/model.py"Repository: Kuldeeep18/LeadOrbit
Length of output: 184
🏁 Script executed:
cat -n backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 2841
🏁 Script executed:
rg -A 10 "class Lead" backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 1611
🏁 Script executed:
head -30 backend/leads/tasks.py | cat -nRepository: Kuldeeep18/LeadOrbit
Length of output: 1034
🏁 Script executed:
rg "IntegrityError" backend/leads/tasks.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
rg "get_or_create" backend/leads/tasks.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
cat -n backend/leads/tasks.py | sed -n '1,180p'Repository: Kuldeeep18/LeadOrbit
Length of output: 8694
🏁 Script executed:
rg -B 5 "def.*task" backend/leads/tasks.py | head -40Repository: Kuldeeep18/LeadOrbit
Length of output: 397
Use get_or_create() to eliminate the race condition on concurrent lead insertion.
The current check-then-create pattern (lines 168-178) is vulnerable to IntegrityError when concurrent tasks attempt to insert leads with the same (organization, email) pair. The unique_together constraint on the Lead model (line 18 of models.py) will be violated, causing the entire job to fail.
Suggested fix
+from django.db import IntegrityError
...
- for item in extracted_raw_json[:limit]:
- if not Lead.objects.filter(organization=org, email=item['email']).exists():
- Lead.objects.create(
- organization=org,
- email=item['email'],
- first_name=item['first_name'],
- last_name=item['last_name'],
- company=item['company'],
- phone=item['phone'],
- linkedin_url=item['linkedin_url']
- )
- inserted_count += 1
+ for item in extracted_raw_json[:limit]:
+ try:
+ _, created = Lead.objects.get_or_create(
+ organization=org,
+ email=item['email'],
+ defaults={
+ 'first_name': item['first_name'],
+ 'last_name': item['last_name'],
+ 'company': item['company'],
+ 'phone': item['phone'],
+ 'linkedin_url': item['linkedin_url'],
+ },
+ )
+ except IntegrityError:
+ continue
+ if created:
+ inserted_count += 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/tasks.py` around lines 167 - 178, The current Lead creation
logic uses a check-then-create pattern that is vulnerable to race conditions
when concurrent tasks attempt to insert leads with the same organization and
email combination. Replace the if-not-exists check and Lead.objects.create()
call with Django's get_or_create() method, using organization and email as the
lookup parameters and passing first_name, last_name, company, phone, and
linkedin_url as defaults. This ensures atomic database-level handling of
concurrent inserts. Check the return tuple from get_or_create() to determine if
a new lead was created and increment inserted_count only when created is True.
| except Exception as e: | ||
| job.status = 'FAILED' | ||
| job.error_message = str(e) | ||
| job.completed_at = timezone.now() |
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
# First, verify the file exists and examine the context around lines 186-189
cat -n backend/leads/tasks.py | sed -n '180,195p'Repository: Kuldeeep18/LeadOrbit
Length of output: 581
🏁 Script executed:
# Get an outline of the file to understand the function structure
ast-grep outline backend/leads/tasks.py --view expandedRepository: Kuldeeep18/LeadOrbit
Length of output: 449
🏁 Script executed:
# Search for other exception handling patterns in the same file
rg "except.*Exception" backend/leads/tasks.py -A 5Repository: Kuldeeep18/LeadOrbit
Length of output: 311
🏁 Script executed:
# Check if job.error_message is exposed through API endpoints
rg "job\.error_message\|error_message" --type py -A 2 -B 2Repository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Verify if logger is imported and used in the file
rg "logger\|import logging" backend/leads/tasks.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Search for where job.error_message is retrieved or returned in API responses
rg "error_message" --type py -B 3 -A 3Repository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Check the Job model definition to understand the field
fd -t f "models.py" backend/leads/ -exec cat -n {} \;Repository: Kuldeeep18/LeadOrbit
Length of output: 389
🏁 Script executed:
# Search for API serializers or views that expose the job
rg "class.*Serializer\|class.*ViewSet\|class.*APIView" backend/ -A 5 | head -50Repository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Look for API responses that include job information
rg "job\|Job" backend/leads/views.py -B 2 -A 2Repository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Find the Job model
find . -name "*.py" -path "*/leads/*" | xargs grep -l "class Job"Repository: Kuldeeep18/LeadOrbit
Length of output: 162
🏁 Script executed:
# Search for views.py in leads app
find . -path "*/leads/views.py"Repository: Kuldeeep18/LeadOrbit
Length of output: 183
🏁 Script executed:
# Search for serializers
find . -path "*/leads/*" -name "*serializer*.py"Repository: Kuldeeep18/LeadOrbit
Length of output: 189
🏁 Script executed:
# Check the structure of the leads app
find backend/leads -type f -name "*.py" | sortRepository: Kuldeeep18/LeadOrbit
Length of output: 406
🏁 Script executed:
# Check the Job model definition
cat backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 2435
🏁 Script executed:
# Check the serializer to see if error_message is exposed
cat backend/leads/serializers.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 1252
🏁 Script executed:
# Check the views to see how job data is returned
cat backend/leads/views.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 4975
🏁 Script executed:
# Check if there's any exception logging or sanitization elsewhere in the tasks.py file
cat backend/leads/tasks.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 7722
Do not expose raw exception text to API consumers.
Lines 186–189 expose str(e) directly through the LeadScrapeJobSerializer, which is returned in the scrape_status and scrape_history API endpoints. Raw exception messages can leak internal details (stack traces, paths, dependencies, configuration).
Suggested fix
- except Exception as e:
+ except Exception:
+ logger.exception("scrape_leads_task failed", extra={"job_id": str(job_id), "organization_id": str(organization_id)})
job.status = 'FAILED'
- job.error_message = str(e)
+ job.error_message = "Lead scraping failed. Please retry."
job.completed_at = timezone.now()
job.save()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| job.status = 'FAILED' | |
| job.error_message = str(e) | |
| job.completed_at = timezone.now() | |
| except Exception: | |
| logger.exception("scrape_leads_task failed", extra={"job_id": str(job_id), "organization_id": str(organization_id)}) | |
| job.status = 'FAILED' | |
| job.error_message = "Lead scraping failed. Please retry." | |
| job.completed_at = timezone.now() |
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 186-186: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/tasks.py` around lines 186 - 189, The exception handler in the
code block is storing the raw exception text via str(e) directly into
job.error_message, which gets exposed through the API endpoints (scrape_status
and scrape_history) via LeadScrapeJobSerializer. Instead of assigning str(e) to
job.error_message, log the full exception internally for debugging purposes, and
store a generic, user-friendly error message in job.error_message (such as "An
error occurred during lead scraping") to avoid leaking internal details like
stack traces, file paths, or configuration information to API consumers.
| query = request.data.get('query', '').strip() | ||
| limit = int(request.data.get('limit', 50)) | ||
|
|
||
| if not query: | ||
| return Response({"error": "A search query is required."}, status=status.HTTP_400_BAD_REQUEST) | ||
| if limit > 200: | ||
| limit = 200 # Enforce security constraint max limit |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
# First, locate and inspect the file
cat -n backend/leads/views.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 5731
Handle invalid limit input with proper error handling before int conversion.
Line 48 will raise a runtime exception on non-numeric input (e.g., limit=abc), returning 500 instead of 400. Additionally, there is no lower-bound validation; limit can be set to zero or negative.
Suggested fix
- query = request.data.get('query', '').strip()
- limit = int(request.data.get('limit', 50))
+ query = request.data.get('query', '').strip()
+ raw_limit = request.data.get('limit', 50)
+ try:
+ limit = int(raw_limit)
+ except (TypeError, ValueError):
+ return Response({"error": "Limit must be an integer."}, status=status.HTTP_400_BAD_REQUEST)
if not query:
return Response({"error": "A search query is required."}, status=status.HTTP_400_BAD_REQUEST)
+ if len(query) > 255:
+ return Response({"error": "Query must be 255 characters or fewer."}, status=status.HTTP_400_BAD_REQUEST)
+ if limit < 1:
+ return Response({"error": "Limit must be at least 1."}, status=status.HTTP_400_BAD_REQUEST)
if limit > 200:
limit = 200 # Enforce security constraint max limit🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/views.py` around lines 47 - 53, The limit parameter assignment
on line 48 lacks error handling for invalid input and lower-bound validation.
Wrap the int conversion in a try-except block to catch ValueError exceptions
when non-numeric input is provided (like 'limit=abc'), and return a 400 error
response with a descriptive message instead of allowing the exception to
propagate. Additionally, after successfully converting the limit to an integer,
add validation to ensure the limit is greater than zero before the existing
maximum limit check, returning a 400 error if the limit is zero or negative.
| active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists() | ||
| if active_job: | ||
| return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
find backend -type f -name "*.py" | grep -E "(views|models)" | head -20Repository: Kuldeeep18/LeadOrbit
Length of output: 397
🏁 Script executed:
cat -n backend/leads/views.py | sed -n '50,70p'Repository: Kuldeeep18/LeadOrbit
Length of output: 1473
🏁 Script executed:
cat -n backend/leads/models.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 2841
🏁 Script executed:
cat -n backend/leads/views.py | sed -n '68,95p'Repository: Kuldeeep18/LeadOrbit
Length of output: 1531
🏁 Script executed:
find backend/leads -name "tasks.py" -o -name "celery.py"Repository: Kuldeeep18/LeadOrbit
Length of output: 181
🏁 Script executed:
cat -n backend/leads/tasks.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 9052
Treat PENDING jobs as active to enforce single concurrent job correctly.
Line 58 only blocks RUNNING jobs, allowing multiple PENDING jobs to be created concurrently for the same organization before any Celery task starts execution.
Suggested fix
- active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists()
+ active_job = LeadScrapeJob.objects.filter(
+ organization=org,
+ status__in=['PENDING', 'RUNNING'],
+ ).exists()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| active_job = LeadScrapeJob.objects.filter(organization=org, status='RUNNING').exists() | |
| if active_job: | |
| return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST) | |
| active_job = LeadScrapeJob.objects.filter( | |
| organization=org, | |
| status__in=['PENDING', 'RUNNING'], | |
| ).exists() | |
| if active_job: | |
| return Response({"error": "Your organization already has an active lead scraping job running."}, status=status.HTTP_400_BAD_REQUEST) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/views.py` around lines 58 - 60, In the
LeadScrapeJob.objects.filter() query on line 58, the status filter currently
only checks for 'RUNNING' jobs, which allows multiple 'PENDING' jobs to be
created concurrently before any Celery task execution. Modify the filter
condition to check for both 'RUNNING' and 'PENDING' statuses (you can use the
__in lookup with a list of both status values) to ensure only one job in either
state can be active per organization at a time.
| from .tasks import scrape_leads_task | ||
| scrape_leads_task.delay(job.id, query, limit, org.id) | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, find the backend/leads/views.py file and examine the area around lines 77-79
find . -name "views.py" -path "*/leads/*" | head -10Repository: Kuldeeep18/LeadOrbit
Length of output: 183
🏁 Script executed:
# List the repository structure to understand the layout
git ls-files | grep -E "(leads|tasks)" | head -20Repository: Kuldeeep18/LeadOrbit
Length of output: 958
🏁 Script executed:
# Read the views.py file around lines 77-79 to see the actual code
cat -n backend/leads/views.py | sed -n '60,100p'Repository: Kuldeeep18/LeadOrbit
Length of output: 2506
🏁 Script executed:
# Also get an outline of the views.py file to understand the context
ast-grep outline backend/leads/views.pyRepository: Kuldeeep18/LeadOrbit
Length of output: 507
🏁 Script executed:
# Check the tasks.py file to understand the scrape_leads_task
cat -n backend/leads/tasks.py | head -50Repository: Kuldeeep18/LeadOrbit
Length of output: 1756
🏁 Script executed:
# Check the LeadScrapeJob model to understand the status field
cat -n backend/leads/models.py | grep -A 20 "class LeadScrapeJob"Repository: Kuldeeep18/LeadOrbit
Length of output: 1054
🏁 Script executed:
# Search for existing Celery error handling patterns in the codebase
rg "CeleryError|except.*delay|celery.exceptions" --type pyRepository: Kuldeeep18/LeadOrbit
Length of output: 158
🏁 Script executed:
# Check the scrape_leads_task implementation in tasks.py
cat -n backend/leads/tasks.py | grep -A 30 "scrape_leads_task"Repository: Kuldeeep18/LeadOrbit
Length of output: 1915
🏁 Script executed:
# Look for all imports in views.py to understand current imports and patterns
cat -n backend/leads/views.py | sed -n '1,35p'Repository: Kuldeeep18/LeadOrbit
Length of output: 1768
🏁 Script executed:
# Search for other places where .delay() is called to see if there's any error handling
rg "\.delay\(" backend --type py -A 2 -B 2Repository: Kuldeeep18/LeadOrbit
Length of output: 1603
🏁 Script executed:
# Check what exceptions can be raised by Celery's delay() method
python3 << 'PY'
import inspect
# Check what exceptions Celery can raise
print("Common Celery exceptions that delay() might raise:")
print("1. kombu.exceptions.OperationalError - when broker connection fails")
print("2. celery.exceptions.CeleryError - base exception class")
print("3. celery.exceptions.SecurityError - security-related errors")
print("4. Exception - general exceptions")
print()
print("The .delay() method is part of the Task API and can raise:")
print("- celery.exceptions.CeleryError and its subclasses")
print("- kombu exceptions (connection-related)")
print("- General exceptions from the broker")
PYRepository: Kuldeeep18/LeadOrbit
Length of output: 616
🏁 Script executed:
# Check if there's any Celery configuration or broker setup in the project
rg "CELERY|BROKER" backend --type py | head -20Repository: Kuldeeep18/LeadOrbit
Length of output: 1358
🏁 Script executed:
# Look at the entire scrape method to get full context
ast-grep run --pattern 'def scrape($_) { $$$ }' backend/leads/views.py --lang pythonRepository: Kuldeeep18/LeadOrbit
Length of output: 472
🏁 Script executed:
# Read the full scrape method from views.py
cat -n backend/leads/views.py | sed -n '45,84p'Repository: Kuldeeep18/LeadOrbit
Length of output: 2256
🏁 Script executed:
# Check how other .delay() calls are handled in the codebase
rg "\.delay\(" backend --type py -B 5 -A 5 | head -50Repository: Kuldeeep18/LeadOrbit
Length of output: 3214
Handle enqueue failures for Celery task and update job status consistently.
If scrape_leads_task.delay() fails, the endpoint returns HTTP 201 Created but the job remains in PENDING status indefinitely and never executes. Wrap the task enqueue call in a try-except block to catch Celery exceptions, update the job to FAILED status with an error message, and return an appropriate error response.
Suggested fix
+from celery.exceptions import CeleryError
...
- scrape_leads_task.delay(job.id, query, limit, org.id)
+ try:
+ scrape_leads_task.delay(job.id, query, limit, org.id)
+ except CeleryError:
+ job.status = 'FAILED'
+ job.error_message = 'Failed to enqueue scrape job.'
+ job.completed_at = timezone.now()
+ job.save(update_fields=['status', 'error_message', 'completed_at'])
+ return Response(
+ {"error": "Unable to start background job. Please retry."},
+ status=status.HTTP_503_SERVICE_UNAVAILABLE,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from .tasks import scrape_leads_task | |
| scrape_leads_task.delay(job.id, query, limit, org.id) | |
| from .tasks import scrape_leads_task | |
| from celery.exceptions import CeleryError | |
| try: | |
| scrape_leads_task.delay(job.id, query, limit, org.id) | |
| except CeleryError: | |
| job.status = 'FAILED' | |
| job.error_message = 'Failed to enqueue scrape job.' | |
| job.completed_at = timezone.now() | |
| job.save(update_fields=['status', 'error_message', 'completed_at']) | |
| return Response( | |
| {"error": "Unable to start background job. Please retry."}, | |
| status=status.HTTP_503_SERVICE_UNAVAILABLE, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/leads/views.py` around lines 77 - 79, The scrape_leads_task.delay()
call is not handling potential Celery exceptions, causing the job to remain in
PENDING status if the task enqueue fails while still returning HTTP 201 Created.
Wrap the scrape_leads_task.delay(job.id, query, limit, org.id) call in a
try-except block to catch Celery exceptions, and in the except block update the
job object's status to FAILED with an appropriate error message, then return an
error response (such as HTTP 500) instead of the success response.
| playwright==1.49.0 | ||
| google-genai==0.1.1 No newline at end of file |
There was a problem hiding this comment.
🔒 Security & Privacy | 🔴 Critical
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify playwright and google-genai versions exist on PyPI and check for known issues.
# Verify playwright version exists and check latest version
echo "=== Checking playwright versions ==="
curl -s https://pypi.org/pypi/playwright/1.49.0/json | jq '.info | {version, yanked}' || echo "playwright==1.49.0 not found"
echo "Latest playwright version:"
curl -s https://pypi.org/pypi/playwright/json | jq '.info.version'
# Verify google-genai version exists and check latest version
echo ""
echo "=== Checking google-genai versions ==="
curl -s https://pypi.org/pypi/google-genai/0.1.1/json | jq '.info | {version, yanked}' || echo "google-genai==0.1.1 not found"
echo "Latest google-genai version:"
curl -s https://pypi.org/pypi/google-genai/json | jq '.info.version'Repository: Kuldeeep18/LeadOrbit
Length of output: 393
Fix non-existent google-genai version and update outdated playwright.
The pinned versions have critical and minor issues:
google-genai==0.1.1does not exist on PyPI. Latest available version is 2.9.0. Update togoogle-genai==2.9.0or a compatible version.playwright==1.49.0exists but is significantly outdated. Latest version is 1.60.0. Update to a recent stable version.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@requirements.txt` around lines 17 - 18, Update the pinned versions of
google-genai and playwright in requirements.txt. Replace the non-existent
google-genai==0.1.1 with google-genai==2.9.0 or another compatible version that
exists on PyPI. Similarly, replace the outdated playwright==1.49.0 with
playwright==1.60.0 or a more recent stable version to ensure the packages can be
installed and are current with security and feature updates.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@frontend/leads.html`:
- Around line 193-199: The scrape request payload (around line 388) currently
only sends query and limit parameters and ignores the user's directory source
selections from the checkboxes with IDs srcGmaps and srcYp. Modify the scrape
request to collect the checked state of all directory source checkboxes
(srcGmaps, srcYp, and any other source checkboxes present in the form) and
include them in the payload sent to the backend, so that user selections are
actually used when making the scrape request.
- Around line 310-317: Replace the innerHTML assignment that embeds job.query
directly in the template string with a safer approach using createElement and
textContent. Create a new table row element using createElement, then create
individual table cells using createElement for each column (query, status,
leads_found, date). For the query cell specifically, use textContent to set the
job.query value instead of interpolating it into HTML. Apply the same pattern to
the error_message injection at line 429 area, creating DOM nodes and using
textContent instead of embedding the error_message string directly in the
innerHTML template.
- Around line 409-437: The polling request in the fetchWithAuth call does not
check if the HTTP response is successful before processing the JSON data, which
can leave the form in a disabled state if the server returns an error status
code. After awaiting the response from fetchWithAuth, add a check for
response.ok and if it is false, treat it as a terminal error by clearing the
pollInterval, updating the statusText to display an error message, setting the
progress bar width to 100%, adding the bg-danger class to the bar, and
re-enabling the submit button (similar to the FAILED status handling) before
returning or continuing to the next iteration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| <input class="form-check-input" type="checkbox" checked id="srcGmaps"> | ||
| <label class="form-check-label text-white-50 small">Google Maps</label> | ||
| </div> | ||
| <div class="form-check"> | ||
| <input class="form-check-input" type="checkbox" checked id="srcYp"> | ||
| <label class="form-check-label text-white-50 small">YellowPages</label> | ||
| </div> |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Wire selected directory sources into the scrape request payload.
The UI collects source choices, but Line 388 sends only query and limit, so user selections are ignored.
💡 Proposed fix
document.getElementById('aiScrapeForm').addEventListener('submit', async (e) => {
e.preventDefault();
const query = document.getElementById('scrapeQuery').value;
- const limit = document.getElementById('scrapeLimit').value;
+ const limit = Number(document.getElementById('scrapeLimit').value);
+ const sources = [];
+ if (document.getElementById('srcGmaps').checked) sources.push('google_maps');
+ if (document.getElementById('srcYp').checked) sources.push('yellowpages');
+ if (sources.length === 0) throw new Error('Select at least one directory source.');
const btn = e.target.querySelector('button[type="submit"]');
@@
const response = await fetchWithAuth('/leads/scrape/', {
method: 'POST',
- body: JSON.stringify({ query, limit })
+ body: JSON.stringify({ query, limit, sources })
});Also applies to: 386-389
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` around lines 193 - 199, The scrape request payload
(around line 388) currently only sends query and limit parameters and ignores
the user's directory source selections from the checkboxes with IDs srcGmaps and
srcYp. Modify the scrape request to collect the checked state of all directory
source checkboxes (srcGmaps, srcYp, and any other source checkboxes present in
the form) and include them in the payload sent to the backend, so that user
selections are actually used when making the scrape request.
| tbody.innerHTML += ` | ||
| <tr class="border-bottom border-secondary border-opacity-25"> | ||
| <td class="fw-semibold text-info text-start ps-2">${job.query}</td> | ||
| <td><span class="badge ${badgeClass}">${job.status}</span></td> | ||
| <td class="fw-bold text-white">${job.leads_found}</td> | ||
| <td class="text-white-50">${date}</td> | ||
| </tr> | ||
| `; |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid injecting job.query and error_message via innerHTML.
Line 312 and Line 429 insert dynamic strings into HTML directly. Render those values with textContent/DOM nodes to prevent XSS.
💡 Proposed fix
- data.forEach(job => {
- const badgeClass = job.status === 'COMPLETED' ? 'bg-success' : (job.status === 'FAILED' ? 'bg-danger' : 'bg-warning text-dark');
- const date = new Date(job.created_at || new Date()).toLocaleDateString();
- tbody.innerHTML += `
- <tr class="border-bottom border-secondary border-opacity-25">
- <td class="fw-semibold text-info text-start ps-2">${job.query}</td>
- <td><span class="badge ${badgeClass}">${job.status}</span></td>
- <td class="fw-bold text-white">${job.leads_found}</td>
- <td class="text-white-50">${date}</td>
- </tr>
- `;
- });
+ data.forEach(job => {
+ const badgeClass = job.status === 'COMPLETED' ? 'bg-success' : (job.status === 'FAILED' ? 'bg-danger' : 'bg-warning text-dark');
+ const date = new Date(job.created_at || new Date()).toLocaleDateString();
+ const tr = document.createElement('tr');
+ tr.className = 'border-bottom border-secondary border-opacity-25';
+ tr.innerHTML = `
+ <td class="fw-semibold text-info text-start ps-2"></td>
+ <td><span class="badge ${badgeClass}"></span></td>
+ <td class="fw-bold text-white"></td>
+ <td class="text-white-50"></td>
+ `;
+ tr.children[0].textContent = job.query || '';
+ tr.children[1].firstElementChild.textContent = job.status || '';
+ tr.children[2].textContent = String(job.leads_found ?? 0);
+ tr.children[3].textContent = date;
+ tbody.appendChild(tr);
+ });
@@
- statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`;
+ statusText.innerHTML = `<i class="bi bi-exclamation-triangle-fill text-danger me-2"></i>`;
+ statusText.append(document.createTextNode(`Error: ${data.error_message || 'Timeout'}`));Also applies to: 427-430
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` around lines 310 - 317, Replace the innerHTML assignment
that embeds job.query directly in the template string with a safer approach
using createElement and textContent. Create a new table row element using
createElement, then create individual table cells using createElement for each
column (query, status, leads_found, date). For the query cell specifically, use
textContent to set the job.query value instead of interpolating it into HTML.
Apply the same pattern to the error_message injection at line 429 area, creating
DOM nodes and using textContent instead of embedding the error_message string
directly in the innerHTML template.
Source: Linters/SAST tools
| const response = await fetchWithAuth(`/leads/scrape/${jobId}/status/`); | ||
| const data = await response.json(); | ||
|
|
||
| const statusText = document.getElementById('liveStatusText'); | ||
| const countText = document.getElementById('liveCountText'); | ||
| const bar = document.getElementById('liveProgressBar'); | ||
|
|
||
| if (data.status === 'RUNNING') { | ||
| statusText.innerHTML = `<span class="spinner-border spinner-border-sm text-info me-2"></span>Agent actively compiling records...`; | ||
| countText.textContent = `${data.leads_found} leads parsed`; | ||
| bar.style.width = '65%'; | ||
| } else if (data.status === 'COMPLETED') { | ||
| clearInterval(pollInterval); | ||
| statusText.innerHTML = `<i class="bi bi-check-circle-fill text-success me-2"></i>Sequence completed!`; | ||
| countText.textContent = `${data.leads_found} leads added`; | ||
| bar.style.width = '100%'; | ||
| bar.classList.remove('progress-bar-striped'); | ||
| setTimeout(() => { window.location.reload(); }, 1500); | ||
| } else if (data.status === 'FAILED') { | ||
| clearInterval(pollInterval); | ||
| statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`; | ||
| bar.style.width = '100%'; | ||
| bar.classList.add('bg-danger'); | ||
| document.getElementById('aiScrapeForm').querySelector('button[type="submit"]').disabled = false; | ||
| } | ||
| } catch (err) { | ||
| console.error('Polling connection interrupted:', err); | ||
| } | ||
| }, 3000); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Handle non-OK polling responses as terminal errors and reset UI state.
Polling currently ignores HTTP errors and only logs exceptions, which can leave the form stuck in disabled state indefinitely.
💡 Proposed fix
pollInterval = setInterval(async () => {
try {
const response = await fetchWithAuth(`/leads/scrape/${jobId}/status/`);
- const data = await response.json();
+ const data = await response.json();
+ if (!response.ok) {
+ throw new Error(data.error || 'Failed to fetch scrape status.');
+ }
@@
} else if (data.status === 'FAILED') {
clearInterval(pollInterval);
statusText.innerHTML = `<i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}`;
bar.style.width = '100%';
bar.classList.add('bg-danger');
- document.getElementById('aiScrapeForm').querySelector('button[type="submit"]').disabled = false;
+ const submitBtn = document.getElementById('aiScrapeForm').querySelector('button[type="submit"]');
+ submitBtn.disabled = false;
+ submitBtn.textContent = 'Deploy AI Browser Agent';
}
} catch (err) {
- console.error('Polling connection interrupted:', err);
+ clearInterval(pollInterval);
+ const statusText = document.getElementById('liveStatusText');
+ statusText.textContent = `Error: ${err.message || 'Polling interrupted.'}`;
+ const submitBtn = document.getElementById('aiScrapeForm').querySelector('button[type="submit"]');
+ submitBtn.disabled = false;
+ submitBtn.textContent = 'Deploy AI Browser Agent';
}
}, 3000);🧰 Tools
🪛 ast-grep (0.44.0)
[warning] 416-416: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <span class="spinner-border spinner-border-sm text-info me-2"></span>Agent actively compiling records...
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').
(inner-outer-html)
[warning] 421-421: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <i class="bi bi-check-circle-fill text-success me-2"></i>Sequence completed!
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').
(inner-outer-html)
[warning] 428-428: Avoid assigning untrusted data to innerHTML/outerHTML or document.write
Context: statusText.innerHTML = <i class="bi bi-exclame-triangle-fill text-danger me-2"></i>Error: ${data.error_message || 'Timeout'}
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting').
(inner-outer-html)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/leads.html` around lines 409 - 437, The polling request in the
fetchWithAuth call does not check if the HTTP response is successful before
processing the JSON data, which can leave the form in a disabled state if the
server returns an error status code. After awaiting the response from
fetchWithAuth, add a check for response.ok and if it is false, treat it as a
terminal error by clearing the pollInterval, updating the statusText to display
an error message, setting the progress bar width to 100%, adding the bg-danger
class to the bar, and re-enabling the submit button (similar to the FAILED
status handling) before returning or continuing to the next iteration.
Pull Request
🔗 Related Issue
Closes #53
📝 Summary of Changes
Backend
scrapeaction insideLeadViewSetto immediately offload data processing pipelines to background workers using Celery (scrape_leads_task.delay()).request.user.organizationfilters to fully isolate cross-tenant workspace assets.200rows per scraping query.status='RUNNING') inside the same tenant scope.HTTP 429 Too Many Requests) between consecutive task completions.tasks.pythat normalizes string phone digits into clean E.164 compliance and automatically screens incoming records against current database emails to prevent duplicates.Frontend
#0f172aworkspace canvas bound by explicit#334155border patterns) for an optimized, readable EdTech/SaaS appearance.Unexpected token '<' ... is not valid JSONsyntax error.PENDING➔RUNNING➔COMPLETEDasynchronously.📁 Files Modified
leads/views.pyleads/tasks.pyleads/models.pyleads/serializers.pyrequirements.txtleads.html🏷️ Type of Change
🧪 Testing
1. API Endpoint Integrity Checks
POST /api/leads/scrape/➔ Returns201 Createdwith a background tracking job UUID.GET /api/leads/scrape/{job_id}/status/➔ Polls background task progress parameters dynamically.POST(Concurrently) ➔ Returns400 Bad Request(active job flag block).POST(Within 5-min window) ➔ Returns429 Too Many Requests(cooldown throttle block).2. Celery Worker Queue Telemetry Logs
📸 ScreenRecording Link for easy PR review and proof of updates!
https://drive.google.com/file/d/1_6LciRmKjIlXVYUSiusIGUG8GehMlP8x/view?usp=sharing
✅ Checklist
Summary by CodeRabbit