Skip to content

Feature/async pdf processing queue#437

Open
anshika1179 wants to merge 7 commits into
FireFistisDead:masterfrom
anshika1179:feature/async-pdf-processing-queue
Open

Feature/async pdf processing queue#437
anshika1179 wants to merge 7 commits into
FireFistisDead:masterfrom
anshika1179:feature/async-pdf-processing-queue

Conversation

@anshika1179

@anshika1179 anshika1179 commented Jun 1, 2026

Copy link
Copy Markdown

Summary

Refactors PDF ingestion into an asynchronous job-based architecture using Redis and Celery, eliminating long-running synchronous requests.

Previously, PDF parsing, chunking, and embedding generation were executed synchronously on the FastAPI thread, which caused request blocking, HTTP timeouts for large files, and poor scalability for concurrent uploads.

This PR introduces:

  • Redis-backed queue: Manages Celery tasks and tracks progress via Redis hashes.
  • Celery workers: Migrated extraction, chunking, and FAISS indexing into background workers.
  • Asynchronous API: Endpoints immediately return a jobId.
  • Status endpoint: Exposes GET /upload/:jobId/status to poll progress.
  • Frontend progress UI: React UI now polls the status endpoint and displays a smooth progress bar during upload.

Related issue

Closes #406

Testing

  • I ran the relevant checks locally
  • I verified the app still starts
  • I tested the affected flow end-to-end

Additional Testing Details:

  • Small PDFs: Verified that the UI uploads quickly, state updates smoothly, and transitions to the chat viewer.
  • Large PDFs: Verified that the job is queued properly, progress bar updates accurately, and the HTTP request no longer times out.
  • Worker failures: Simulated failures to ensure they are handled gracefully and the frontend receives the error state.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

Screenshots / recordings

(N/A - the UI now shows a progress bar and status text instead of a static loading spinner during uploads)

Notes

  • Architecture Changes: The upload flow now looks like this: User -> React -> Express -> FastAPI -> Redis (job queued) -> Response (jobId). The React app then polls for the status until the Celery worker finishes the embedding generation.
  • Deployment Requirement: You must run the redis and celery-worker containers defined in docker-compose.yml for uploads to work in this branch.

Security

  • No sensitive data included

Summary by CodeRabbit

  • New Features
    • Queued PDF processing with async upload, live progress UI, and a backend upload job-status endpoint
    • Authenticated background processing via a new worker service
  • Improvements
    • Upload UI now shows determinate progress while processing and displays dynamic status text
    • Stronger security: JWT-based authentication, resource ownership checks, and audit logging
  • Infrastructure
    • Added Redis-backed job storage and Celery worker; updated service wiring
    • Added SQLite-backed persistence and updated service dependencies

@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

@rhoggs-bot-test-account is attempting to deploy a commit to the firefistisdead's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Decouples PDF processing from synchronous request-response by implementing Celery background jobs persisted in Redis, adds frontend polling UI with progress indicators, introduces SQLite-backed JWT authentication and multi-tenant resource ownership checks, and routes user identity through Express gateway to RAG service endpoints.

Changes

Asynchronous PDF Processing with Multi-Tenant Authorization

Layer / File(s) Summary
Redis & Celery Infrastructure
docker-compose.yml, rag-service/celery_app.py, rag-service/requirements.txt
Adds celery-worker service in Compose with Redis broker/backend config, introduces Celery app with worker module registration, and adds celery>=5.3.0 and redis>=5.0.0 dependencies.
Redis Job Storage
rag-service/job_store.py, rag-service/workers/__init__.py
Implements create_job, update_job_status, and get_job functions storing job hashes in Redis with 24-hour TTL and progress/status/session metadata.
Celery PDF Processing Worker
rag-service/workers/pdf_processor.py
Adds process_pdf_task Celery task that extracts PDF text, chunks and deduplicates content, generates embeddings, persists FAISS vectorstore under thread lock, updates job progress/completion status, and cleans temp files; handles exceptions by marking job failed and logging errors.
RAG Service Async Endpoints
rag-service/main.py
/process-pdf now requires X-User-Id header, validates PDF magic bytes, enforces 20MB max size, streams to temp file, creates job via job_store, enqueues Celery task, and returns { jobId, status: "queued" } immediately. Adds /process-pdf/{job_id}/status endpoint with user ownership enforcement (403 if mismatch). Adds optional user-scoped filtering to search_retrieval_candidates. Note: /validate-session-write contains malformed duplicated raise statements and undefined variable reference.
SQLite Database & User Authentication
package.json, src/db/db.js, src/controllers/authController.js, src/middleware/auth.js
Adds sqlite3 npm dependency. Creates SQLite schema with users, documents, chat_sessions, and embeddings tables. Provides promise-based query helpers (dbQuery, dbRun, dbGet). Migrates signup/login from file-based to SQL-backed user persistence with password hashing and JWT token generation (signed with JWT_SECRET env var). Introduces authenticateUser middleware validating Bearer tokens and attaching decoded req.user.
Resource Authorization & Audit Logging
src/services/authz/resource-access.js, src/services/security/audit.js
Adds verifyDocumentOwnership and verifyChatSessionOwnership async helpers querying SQL for ownership, logging cross-tenant access attempts. Provides authorizeDocument and authorizeSession middleware returning 403 on unauthorized access. Implements logEvent audit logger writing timestamped newline-JSON entries to security/logs/audit.log.
Express Gateway Auth & Proxying
server.js
Gateway adds authenticateUser requirement on /upload, /process-from-url, /ask, /ask/stream, /summarize, /knowledge-gaps, /sessions/lookup routes. Updates ragAuthHeaders to conditionally include X-User-Id from authenticated user. Forwards /validate-session-write and all inference endpoints with user-aware RAG headers. Changes /upload and /process-from-url success responses to queued-style { jobId, status } payloads. Adds new GET /upload/:jobId/status endpoint proxying to RAG service status endpoint. Simplifies request validation to direct Zod parsing (removes cached validators).
Frontend Upload API & Polling
frontend/src/services/api.js, frontend/src/App.js
Adds checkJobStatusApi(jobId) helper making GET requests to job status endpoint. Extends handleUpload to initialize progress/status, call uploadPdfApi, then enter 1-second polling loop on returned jobId updating uploadProgress and uploadStatusText, throwing on failure, and on completion updating pdfs list and selectedPdf. Adds uploadProgress and uploadStatusText React state.
Upload Card Progress Rendering
frontend/src/components/UploadCard/UploadCard.jsx
Updates component to accept uploadProgress and uploadStatusText props. Renders CircularProgress in determinate mode when uploadProgress > 0 (with value set to progress), otherwise indeterminate. Displays dynamic uploadStatusText with fallback to "Uploading...".
Patch & Transformation Scripts
scratch/patch_*.py
Multiple Python scripts automate regex-based rewrites of source files: patch_app.py injects API import and polling logic into App.js; patch_uploadcard.py updates UploadCard props and JSX; patch_express.py modifies server.js response shapes and adds job status route; patch_fastapi.py adds header auth checks to RAG endpoints; patch_process_pdf.py replaces /process-pdf handler with queued job flow; patch_test.py adds auth token generation to tests.

Sequence Diagram

sequenceDiagram
  participant User
  participant Frontend
  participant Gateway as Express_Gateway
  participant RAG as RAG_Service
  participant Redis
  participant Worker as Celery_Worker
  User->>Frontend: Select PDF and upload
  Frontend->>Gateway: POST /upload (Bearer token)
  Gateway->>Gateway: authenticateUser
  Gateway->>RAG: POST /process-pdf (X-User-Id header)
  RAG->>RAG: Validate PDF, write temp file
  RAG->>Redis: create_job(jobId, filename, user_id)
  RAG->>Worker: process_pdf_task.delay(jobId, ...)
  RAG->>Gateway: {jobId, status: "queued"}
  Gateway->>Frontend: {jobId, status: "queued"}
  Frontend->>Frontend: Start polling every 1s
  loop Poll
    Frontend->>Gateway: GET /upload/:jobId/status
    Gateway->>RAG: GET /process-pdf/:jobId/status (X-User-Id)
    RAG->>Redis: get_job(jobId)
    RAG->>Gateway: {progress, status}
    Gateway->>Frontend: {progress, status}
    Frontend->>Frontend: Update progress bar
  end
  Worker->>Worker: Extract PDF, chunk, embed
  Worker->>Redis: update_job_status(progress=50%)
  Worker->>Redis: update_job_status(completed, sessionId, sessionSecret)
  Frontend->>Frontend: Polling detects completion
  Frontend->>Frontend: Add PDF to list, select it
  User->>User: See success message and PDF loaded
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

level:intermediate

Suggested reviewers

  • FireFistisDead

Poem

🐰 I hopped through queues where Celery keeps time,
Jobs in Redis, workers working in rhyme.
Frontend polls softly, progress bars show,
SQLite remembers the users we know.
A rabbit applauds — async makes things flow. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR includes failing tests reported by the reviewer, preventing verification that all linked issue objectives are met. The PR requires test remediation before merge. Address the failing tests reported in the review comments. Verify all tests pass locally and in CI before proceeding with merge.
Out of Scope Changes check ⚠️ Warning Multiple out-of-scope changes detected: backend authentication refactoring (authController, auth middleware, database layer), authorization checks (resource-access), audit logging, and extensive test patching—these exceed the async PDF queue scope. Isolate async queue work (celery, redis, job_store, pdf_processor, api changes) into a focused PR; defer auth/db/audit/authorization features to separate feature branches to maintain clarity and simplify review.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feature/async pdf processing queue' clearly and concisely summarizes the main change: converting PDF processing to an asynchronous queue-based system.
Description check ✅ Passed The description covers all required template sections: Summary, Related issue, Testing (with details), Checklist completion, Notes, and Security declaration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Biome (2.5.0)
frontend/src/services/api.js

File contains syntax errors that prevent linting: Line 183: expected , but instead found {; Line 184: expected , but instead found return

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

frontend/src/App.js

ESLint skipped: missing config or dependency (missing-dependency). The ESLint configuration references a package that is not available in the sandbox.

frontend/src/services/api.js

ESLint skipped: the ESLint configuration for this file references a package that is not available in the sandbox.

server.js

ESLint skipped: the ESLint configuration for this file references a package that is not available in the sandbox.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added backend Express or API gateway work bug Something isn't working docs Documentation only enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work level:advanced rag-service FastAPI / model service work type:security type:testing labels Jun 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
frontend/src/App.js (1)

315-317: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Upload progress state not reset in finally block.

uploadProgress and uploadStatusText are set at the start of upload but never cleared on completion or error. After a failed upload, the next upload attempt will briefly show stale progress values until overwritten.

🧹 Suggested fix
     } finally {
       setUploading(false);
+      setUploadProgress(0);
+      setUploadStatusText("");
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/App.js` around lines 315 - 317, The finally block currently only
calls setUploading(false) but leaves uploadProgress and uploadStatusText with
stale values; update the finally block (where setUploading is called) to also
reset upload progress and status by invoking setUploadProgress(0) and
setUploadStatusText('') (or a suitable initial message) so both uploadProgress
and uploadStatusText are cleared on success/error; locate these state setters in
the upload handler (references: setUploading, setUploadProgress,
setUploadStatusText, uploadProgress, uploadStatusText) and add the resets in the
same finally branch.
rag-service/main.py (2)

2469-2479: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix undefined user_id in /sessions/lookup authorization check.

user_id is referenced but never defined in this handler, which will raise NameError at runtime. Read it from Request headers (or remove this check if enforced elsewhere).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2469 - 2479, The handler lookup_sessions
references user_id but never defines it; update the function signature to accept
the incoming Request (e.g., add a request: Request dependency) and extract the
user id from request.headers (for example from "X-User-Id" or whichever header
your service uses), validate it exists (return 401 if missing) and then use that
user_id in the existing authorization check, keeping the existing session lookup
via _touch_session_unlocked and sessions_lock; alternatively, if the check is
enforced upstream, remove the user_id check in lookup_sessions instead.

2537-2543: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate existing session ownership/secret before enqueuing upload jobs.

If session_id is supplied, this path currently queues work without confirming the caller owns that session (and without requiring a valid session secret), enabling unauthorized write attempts and wasted background processing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2537 - 2543, When a session_id is provided
we must verify caller ownership/secret before enqueuing work: after calling
normalize_session_id(session_id) assign requested_session_id and
require/normalize requested_session_secret, then look up and validate the
session (e.g., via an existing lookup like get_session_by_id or a new
verify_session_secret(session_id, secret) helper) and raise HTTPException(403)
for missing/invalid secret or if the session does not belong to the caller; only
proceed to enqueue upload jobs after that validation succeeds. Ensure you
reference normalize_session_id, session_id and requested_session_secret when
adding the lookup/verify call so the check happens before any enqueue operation.
🧹 Nitpick comments (4)
frontend/src/App.js (1)

285-289: 💤 Low value

Duplicate setSelectedPdf(pdfId) in both branches.

Both the if and else branches execute the same setSelectedPdf(pdfId) call, making the conditional pointless.

♻️ Simplify
-        if (prev.length === 0) {
-          setSelectedPdf(pdfId);
-        } else {
-          setSelectedPdf(pdfId);
-        }
+        setSelectedPdf(pdfId);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/App.js` around lines 285 - 289, The conditional around
setSelectedPdf(pdfId) is redundant because both branches call the same function;
simplify by removing the if/else and call setSelectedPdf(pdfId) once. Update the
block that references prev and pdfId (where setSelectedPdf is used) to a single
statement invoking setSelectedPdf(pdfId); if there was intended different
behavior for the prev.length === 0 case, implement that separate logic instead
of duplicating the setSelectedPdf call.
src/db/db.js (1)

3-3: 💤 Low value

Remove unused import.

The fs module is imported but never used in this file.

 const sqlite3 = require('sqlite3').verbose();
 const path = require('path');
-const fs = require('fs');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/db/db.js` at line 3, Remove the unused import of the fs module by
deleting the line "const fs = require('fs');" from the file (so the module is
not required where it's unused); update any related requires if you intended to
use filesystem functions elsewhere and run linting to ensure no other unused
imports remain.
scratch/patch_app.py (1)

119-125: ⚡ Quick win

Redundant conditional branches.

Both branches of this if/else execute the identical statement setSelectedPdf(pdfId). The condition is unnecessary.

Proposed simplification
-        if (prev.length === 0) {
-          setSelectedPdf(pdfId);
-        } else {
-          setSelectedPdf(pdfId);
-        }
+        setSelectedPdf(pdfId);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scratch/patch_app.py` around lines 119 - 125, The if/else is redundant
because both branches call setSelectedPdf(pdfId); remove the conditional and
replace it with a single call to setSelectedPdf(pdfId) (leaving the subsequent
return updated intact) — locate the block referencing prev, setSelectedPdf,
pdfId, and updated and simplify by calling setSelectedPdf(pdfId) once.
scratch/patch_process_pdf.py (1)

60-65: 💤 Low value

Inline imports inside function body.

Importing job_store and process_pdf_task inside the function body is non-idiomatic. While this may avoid circular import issues, it adds overhead on each request and reduces code clarity.

If circular imports aren't a concern, move these to the module level.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scratch/patch_process_pdf.py` around lines 60 - 65, The imports for job_store
and process_pdf_task are inside the function body causing per-call overhead and
non-idiomatic structure; move the imports to the module top-level (import
job_store and from workers.pdf_processor import process_pdf_task) and update the
function to call job_store.create_job(...) and process_pdf_task.delay(...)
directly; if a circular import prevents top-level imports, instead refactor to
break the cycle (e.g., move shared logic into a third module) or use a single,
documented lazy import fallback (e.g., importlib.import_module in a small
helper) so the normal path uses module-level imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/src/App.js`:
- Around line 246-262: The polling loop for jobId (while (!isDone) in the block
that calls checkJobStatusApi) can run forever; add a maximum retry count or
overall timeout and break with a graceful error state if exceeded. Implement a
counter/expiry (e.g., maxPolls or pollDeadline) in the loop that increments each
cycle and, when reached, sets upload progress/status via setUploadProgress and
setUploadStatusText and throws or returns a descriptive error so the UI shows
failure; keep existing handling for status === "completed" and "failed" and
ensure you still clear any pending state when exiting the loop.

In `@frontend/src/services/api.js`:
- Around line 179-182: The checkJobStatusApi function is using a malformed URL
string and never interpolates jobId; update the axios.get call in
checkJobStatusApi to use a template literal with backticks and include the jobId
parameter (e.g. `${API_BASE}/upload/${jobId}/status`) and remove any duplicate
slashes so the correct endpoint is requested and res.data is returned as before.

In `@rag-service/celery_app.py`:
- Around line 13-21: Add a Celery soft time limit and explicitly handle it:
update celery_app.conf to include task_soft_time_limit (e.g., slightly below the
existing task_time_limit) so long-running PDF jobs get a soft timeout, and
modify the process_pdf_task function to catch
celery.exceptions.SoftTimeLimitExceeded explicitly (in addition to the generic
except) to perform deterministic status/progress updates and cleanup before the
hard time limit kills the worker.

In `@rag-service/main.py`:
- Around line 2609-2618: Collapse the duplicated authorization checks in the
/validate-session-write handler (validate_session_write) into a single existence
check and a single ownership check: first raise HTTPException(404, "Session not
found") if session is falsy, then compare session.get("user_id") to the
authenticated request user id (do not use an undefined user_id) and raise
HTTPException(403, "Forbidden") if they differ; remove all repeated/ unreachable
raises and obtain the request user id from the actual authentication context
(e.g., a current_user or request_user_id parameter/state) before performing the
comparison.

In `@rag-service/workers/pdf_processor.py`:
- Around line 72-99: The code mutates rag_main.sessions for
processing_session_id under rag_main.sessions_lock without verifying
ownership/secret; add a guard when the session already exists to check that
session["user_id"] == user_id and that if requested_session_secret is provided
it matches session["session_secret"] (or otherwise require a valid secret), and
only proceed to merge vectors, append uploaded_document, update last_accessed,
call rag_main.persist_vectorstore and rag_main.persist_session_registry_entry
when the check passes; if the check fails, raise an authorization error or abort
the operation instead of mutating the session.

In `@scratch/patch_app.py`:
- Around line 140-143: The script writes back file_path using content without
validating that the earlier regex/string replacements actually changed anything;
update the patch logic (the places using re.sub and str.replace) to use re.subn
(or compare content_before vs content_after) for each substitution, check the
returned count or diff, and raise/log a clear error (or abort) if a substitution
count is zero so you don't silently write an unchanged or corrupted file; ensure
this validation happens before the final write and reference the same variable
names (content, file_path) and functions (re.sub/re.subn, str.replace) so
failures are reported with helpful messages tied to the specific
pattern/description.
- Around line 84-97: The polling loop in the block using
checkJobStatusApi(jobId) with variables isDone, data, setUploadProgress and
setUploadStatusText can run indefinitely on unexpected statuses; add a maximum
retry count or overall timeout (e.g., maxPolls or deadline timestamp) and break
the loop when exceeded, setting an error state (or throwing) and updating
setUploadStatusText accordingly. Also explicitly handle unexpected statuses
(e.g., "cancelled" or malformed statusRes) by treating them as terminal (set
isDone/throw) instead of continuing to poll, and ensure network/errors from
checkJobStatusApi are caught so the timeout logic still triggers. Ensure
references to jobId, checkJobStatusApi, setUploadProgress and
setUploadStatusText are used when updating state and error messages.

In `@scratch/patch_express.py`:
- Around line 54-59: The current patch insertion uses a fragile string match
('app.post("/ask",') and performs a blind replace without validating success;
update the logic around status_endpoint and the replace to first check that the
target string (e.g., 'app.post("/ask",') exists in content, raise a clear
RuntimeError if not found, perform the insertion (maintaining the use of
status_endpoint and the existing 'app.post("/ask",' token), then verify that
'/upload/:jobId/status' is present in the modified content and raise if the
post-check fails so the script fails loudly on mismatch; use the existing
variables status_endpoint and content to locate and validate the insertion point
and final content.

In `@scratch/patch_fastapi.py`:
- Around line 4-5: The file is opening files without a specified encoding in
some places (e.g., the with open(file_path, "r") as f: block that reads into
content and the later open(...) at lines around 104–105); update both open calls
to explicitly pass encoding="utf-8" so reads are consistent across platforms
(change the with open(file_path, "r") and the later open(...) usages to with
open(file_path, "r", encoding="utf-8") and with open(..., "r", encoding="utf-8")
respectively) to prevent Windows/default-encoding mismatches.
- Around line 84-88: The substitution uses an overly generic pattern and may
affect any `if not session:`; restrict the regex to the /summarize endpoint by
matching the unique surrounding context (e.g., the route decorator or function
signature) such as "`@app.post`('/summarize')" or "def summarize" and the local
variable "user_id" so only that function body is changed; update the pattern to
a multiline/DOTALL-aware regex that first locates the summarize handler
(decorator or def) and then replaces the specific `if not session:` inside that
block with the two HTTPException checks, ensuring you use re.MULTILINE/re.DOTALL
(or explicit non-greedy context) so other `if not session:` instances are
untouched.

In `@scratch/patch_test.py`:
- Around line 31-35: The global regex that injects Authorization into headers is
affecting the signup request because authToken is empty at signup time; update
the replacement to target only non-auth requests (e.g., match
fetch(`\${baseUrl}/...` and exclude paths like api/auth with a negative
lookahead) or perform the auth-header injection before the code that inserts the
signup/before() hook so the signup request is not modified; reference the
headers object and authToken variable and adjust the regex to skip endpoints
starting with api/auth (or reorder the replacement sequence) so signup requests
remain without Authorization.
- Around line 4-5: The file reads open(file_path, "r") as f and another similar
open call later without specifying encoding; update both open(...) calls (the
one using file_path and the one around the second read at lines shown in review)
to include encoding="utf-8" so reads are deterministic across platforms—i.e.,
change open(file_path, "r") and the second open(...) to open(..., "r",
encoding="utf-8").

In `@scratch/patch_uploadcard.py`:
- Around line 52-55: The write-back currently blindly writes `content` to
`file_path` after performing `str.replace`/`re.sub` operations without checking
they actually matched; before opening the file for writing, capture the original
file contents (e.g., `original_content`), perform the replacements to produce
`patched_content`, then verify replacements succeeded by either comparing
`patched_content != original_content` or checking for presence/absence of
expected marker strings used in the replacements; if no changes were made,
log/raise an error and do not overwrite the file, otherwise proceed to write
`patched_content` and print success. Ensure you reference the same replacement
calls and variables used in the script where they occur so failures are detected
prior to writing `file_path`.

In `@src/controllers/authController.js`:
- Line 8: The JWT secret fallback is insecure—replace the hardcoded default by
requiring process.env.JWT_SECRET and fail fast if missing: set SECRET =
process.env.JWT_SECRET; if (!SECRET && process.env.NODE_ENV !== 'development')
throw a clear startup error (or process.exit(1)); if you want a dev-only
convenience, allow a dev-only fallback when NODE_ENV === 'development' (e.g.,
SECRET = process.env.JWT_SECRET || 'dev-only-fallback') and ensure this logic
runs at module initialization (where SECRET is declared) so the app refuses to
start in non-development environments without a configured JWT_SECRET.

In `@src/db/db.js`:
- Around line 11-54: Enable SQLite foreign key enforcement before creating
tables: call the PRAGMA to turn on foreign_keys on the DB connection (e.g.,
invoke db.run or db.exec with "PRAGMA foreign_keys = ON") prior to the
db.serialize block that creates tables (referencing db.serialize and the
subsequent db.run table creation calls for
users/documents/chat_sessions/embeddings); ensure the PRAGMA is executed on the
same connection so the declared FOREIGN KEY constraints are enforced for all
subsequent schema operations and runtime DML.

In `@src/middleware/auth.js`:
- Line 3: Replace the duplicate hardcoded fallback SECRET by creating a single
shared config module that exports the required SECRET and throws if
process.env.JWT_SECRET is missing, then update both modules to import that
constant; specifically, add a new module that does const SECRET =
process.env.JWT_SECRET; if (!SECRET) throw new Error('JWT_SECRET environment
variable is required'); module.exports = { SECRET }; and in auth.js (where
SECRET is currently defined) and in authController.js remove the fallback
assignment and require/import the shared SECRET instead so both modules use the
same source of truth.

In `@src/services/authz/resource-access.js`:
- Around line 28-48: The authorizeDocument and authorizeSession middlewares call
async ownership checks (verifyDocumentOwnership, verifyChatSessionOwnership) but
do not handle DB errors; wrap the await calls in try-catch blocks inside
authorizeDocument and authorizeSession, and on catch return
res.status(500).json({ error: 'Internal Server Error' }) (or include
error.message if preferred) so a DB failure doesn't crash the middleware; ensure
you still call next() when ownership succeeds and only return 403 for verified
false.

In `@src/services/security/audit.js`:
- Around line 11-22: The async function logEvent currently uses fs.appendFile
with a callback so it returns before the disk write completes; change it to
return/await a Promise-backed append (e.g., use fs.promises.appendFile or wrap
fs.appendFile in a new Promise) so callers that await logEvent (such as in
authController.js) truly wait for the write to finish; ensure errors from the
awaited append are propagated or logged via processLogger/console so failures
are visible.

---

Outside diff comments:
In `@frontend/src/App.js`:
- Around line 315-317: The finally block currently only calls
setUploading(false) but leaves uploadProgress and uploadStatusText with stale
values; update the finally block (where setUploading is called) to also reset
upload progress and status by invoking setUploadProgress(0) and
setUploadStatusText('') (or a suitable initial message) so both uploadProgress
and uploadStatusText are cleared on success/error; locate these state setters in
the upload handler (references: setUploading, setUploadProgress,
setUploadStatusText, uploadProgress, uploadStatusText) and add the resets in the
same finally branch.

In `@rag-service/main.py`:
- Around line 2469-2479: The handler lookup_sessions references user_id but
never defines it; update the function signature to accept the incoming Request
(e.g., add a request: Request dependency) and extract the user id from
request.headers (for example from "X-User-Id" or whichever header your service
uses), validate it exists (return 401 if missing) and then use that user_id in
the existing authorization check, keeping the existing session lookup via
_touch_session_unlocked and sessions_lock; alternatively, if the check is
enforced upstream, remove the user_id check in lookup_sessions instead.
- Around line 2537-2543: When a session_id is provided we must verify caller
ownership/secret before enqueuing work: after calling
normalize_session_id(session_id) assign requested_session_id and
require/normalize requested_session_secret, then look up and validate the
session (e.g., via an existing lookup like get_session_by_id or a new
verify_session_secret(session_id, secret) helper) and raise HTTPException(403)
for missing/invalid secret or if the session does not belong to the caller; only
proceed to enqueue upload jobs after that validation succeeds. Ensure you
reference normalize_session_id, session_id and requested_session_secret when
adding the lookup/verify call so the check happens before any enqueue operation.

---

Nitpick comments:
In `@frontend/src/App.js`:
- Around line 285-289: The conditional around setSelectedPdf(pdfId) is redundant
because both branches call the same function; simplify by removing the if/else
and call setSelectedPdf(pdfId) once. Update the block that references prev and
pdfId (where setSelectedPdf is used) to a single statement invoking
setSelectedPdf(pdfId); if there was intended different behavior for the
prev.length === 0 case, implement that separate logic instead of duplicating the
setSelectedPdf call.

In `@scratch/patch_app.py`:
- Around line 119-125: The if/else is redundant because both branches call
setSelectedPdf(pdfId); remove the conditional and replace it with a single call
to setSelectedPdf(pdfId) (leaving the subsequent return updated intact) — locate
the block referencing prev, setSelectedPdf, pdfId, and updated and simplify by
calling setSelectedPdf(pdfId) once.

In `@scratch/patch_process_pdf.py`:
- Around line 60-65: The imports for job_store and process_pdf_task are inside
the function body causing per-call overhead and non-idiomatic structure; move
the imports to the module top-level (import job_store and from
workers.pdf_processor import process_pdf_task) and update the function to call
job_store.create_job(...) and process_pdf_task.delay(...) directly; if a
circular import prevents top-level imports, instead refactor to break the cycle
(e.g., move shared logic into a third module) or use a single, documented lazy
import fallback (e.g., importlib.import_module in a small helper) so the normal
path uses module-level imports.

In `@src/db/db.js`:
- Line 3: Remove the unused import of the fs module by deleting the line "const
fs = require('fs');" from the file (so the module is not required where it's
unused); update any related requires if you intended to use filesystem functions
elsewhere and run linting to ensure no other unused imports remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b01747eb-ca0c-4684-8ae2-f39978e76a2f

📥 Commits

Reviewing files that changed from the base of the PR and between 2b7887f and fb3cb3e.

⛔ Files ignored due to path filters (3)
  • frontend/package-lock.json is excluded by !**/package-lock.json
  • package-lock.json is excluded by !**/package-lock.json
  • security/logs/audit.log is excluded by !**/*.log
📒 Files selected for processing (24)
  • database.sqlite
  • docker-compose.yml
  • frontend/src/App.js
  • frontend/src/components/UploadCard/UploadCard.jsx
  • frontend/src/services/api.js
  • package.json
  • rag-service/celery_app.py
  • rag-service/job_store.py
  • rag-service/main.py
  • rag-service/requirements.txt
  • rag-service/workers/__init__.py
  • rag-service/workers/pdf_processor.py
  • scratch/patch_app.py
  • scratch/patch_express.py
  • scratch/patch_fastapi.py
  • scratch/patch_process_pdf.py
  • scratch/patch_test.py
  • scratch/patch_uploadcard.py
  • server.js
  • src/controllers/authController.js
  • src/db/db.js
  • src/middleware/auth.js
  • src/services/authz/resource-access.js
  • src/services/security/audit.js

Comment thread frontend/src/App.js
Comment thread frontend/src/services/api.js
Comment thread rag-service/celery_app.py
Comment thread rag-service/main.py
Comment thread rag-service/workers/pdf_processor.py
Comment thread src/controllers/authController.js
Comment thread src/db/db.js
Comment thread src/middleware/auth.js
Comment thread src/services/authz/resource-access.js
Comment thread src/services/security/audit.js

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
server.js (3)

799-799: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-add the concurrent upload guard on /upload.

This route still receives the full PDF, writes it to disk, and forwards it upstream before responding. Dropping uploadConcurrencyGuard means one client can now open many large uploads in parallel and pin disk, network, and worker capacity despite the async queue.

🔧 Suggested fix
-app.post("/upload", authenticateUser, uploadLimiter, upload.single("file"), multerErrorHandler, async (req, res) => {
+app.post("/upload", authenticateUser, uploadLimiter, uploadConcurrencyGuard, upload.single("file"), multerErrorHandler, async (req, res) => {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server.js` at line 799, The POST /upload route removed the
uploadConcurrencyGuard middleware, allowing clients to open many concurrent
large uploads; restore uploadConcurrencyGuard into the middleware chain for the
route (use the uploadConcurrencyGuard symbol) so it runs before the heavy file
handling (i.e., place it before upload.single("file") in the app.post("/upload",
...) middleware list) to limit concurrent uploads and protect
disk/network/worker resources.

1060-1069: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Sanitize upstream errors in the job-status proxy.

This handler returns err.response?.data verbatim even in production. Any traceback or internal job metadata emitted by the RAG service becomes user-visible here, unlike the other proxied routes.

🔧 Suggested fix
   } catch (err) {
     const statusCode = err.response?.status || 500;
-    const details = err.response?.data || err.message;
-    return res.status(statusCode).json({ error: "Failed to fetch job status", details });
+    const details = extractServiceDetails(err, "Failed to fetch job status");
+    return res.status(statusCode).json({
+      error: "Failed to fetch job status",
+      details: isDevelopment ? details : "Internal processing error",
+    });
   }
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server.js` around lines 1060 - 1069, The job-status proxy handler at
app.get("/upload/:jobId/status") currently returns err.response?.data verbatim;
change it to sanitize upstream errors by returning a generic error payload to
the client (e.g. { error: "Failed to fetch job status" }) and not including raw
err.response.data or stack traces; log the full upstream response/error
internally (use your existing logger or console) along with ctx info
(req.params.jobId and RAG_SERVICE_URL request metadata) so operators can debug,
and if you must return any upstream fields only allow a vetted whitelist (e.g. {
code: upstreamCode }) instead of the full err.response.data; keep ragAuthHeaders
usage unchanged for the proxied request.

961-961: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Preserve the gateway user when validating the Supabase token.

authenticateUser runs first here, but requireSupabaseAuth overwrites req.user. ragAuthHeaders(req) will then forward Supabase claims instead of the app user ID, which breaks the new per-user ownership checks on the RAG side.

🔧 Suggested fix
-      req.user = jwt.verify(token, secret);
+      req.supabaseUser = jwt.verify(token, secret);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server.js` at line 961, The route uses authenticateUser then
requireSupabaseAuth but requireSupabaseAuth currently overwrites req.user so
ragAuthHeaders(req) forwards Supabase claims instead of the gateway user; change
requireSupabaseAuth to preserve the gateway user by not assigning to req.user
(e.g., set req.supabaseUser or req.supabaseClaims instead, or only set req.user
if it's undefined), and update ragAuthHeaders to explicitly read the gateway
user from req.user and Supabase claims from req.supabaseUser when necessary so
per-user ownership checks use the original app user ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@server.js`:
- Line 799: The POST /upload route removed the uploadConcurrencyGuard
middleware, allowing clients to open many concurrent large uploads; restore
uploadConcurrencyGuard into the middleware chain for the route (use the
uploadConcurrencyGuard symbol) so it runs before the heavy file handling (i.e.,
place it before upload.single("file") in the app.post("/upload", ...) middleware
list) to limit concurrent uploads and protect disk/network/worker resources.
- Around line 1060-1069: The job-status proxy handler at
app.get("/upload/:jobId/status") currently returns err.response?.data verbatim;
change it to sanitize upstream errors by returning a generic error payload to
the client (e.g. { error: "Failed to fetch job status" }) and not including raw
err.response.data or stack traces; log the full upstream response/error
internally (use your existing logger or console) along with ctx info
(req.params.jobId and RAG_SERVICE_URL request metadata) so operators can debug,
and if you must return any upstream fields only allow a vetted whitelist (e.g. {
code: upstreamCode }) instead of the full err.response.data; keep ragAuthHeaders
usage unchanged for the proxied request.
- Line 961: The route uses authenticateUser then requireSupabaseAuth but
requireSupabaseAuth currently overwrites req.user so ragAuthHeaders(req)
forwards Supabase claims instead of the gateway user; change requireSupabaseAuth
to preserve the gateway user by not assigning to req.user (e.g., set
req.supabaseUser or req.supabaseClaims instead, or only set req.user if it's
undefined), and update ragAuthHeaders to explicitly read the gateway user from
req.user and Supabase claims from req.supabaseUser when necessary so per-user
ownership checks use the original app user ID.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6cf6ecc7-e51f-4251-b577-2b5f5e1b8da6

📥 Commits

Reviewing files that changed from the base of the PR and between e0689ed and 994ca3a.

⛔ Files ignored due to path filters (1)
  • frontend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (3)
  • docker-compose.yml
  • rag-service/main.py
  • server.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • docker-compose.yml
  • rag-service/main.py

@anshika1179

Copy link
Copy Markdown
Author

@FireFistisDead could u pleasse review this pr it has been 2 days

@FireFistisDead

Copy link
Copy Markdown
Owner

@anshika1179 sorry was busy can you reolve the merge conflits

@anshika1179

Copy link
Copy Markdown
Author

@FireFistisDead could u plz review my both pr first i had already solved merge conflict ... otherwise there will surely be more if other prs get reviewed first :((

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
rag-service/requirements.txt (1)

13-19: ⚠️ Potential issue | 🔴 Critical

Fix: pypdf is still imported at runtimerag-service/crawler/pdf_extractor.py contains from pypdf import PdfReader, so removing pypdf from rag-service/requirements.txt will still break execution (e.g., via that ingestion path / any “PyPDF fallback”).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/requirements.txt` around lines 13 - 19, The requirements list
removed pypdf but rag-service/crawler/pdf_extractor.py still imports from pypdf
(from pypdf import PdfReader), so either restore pypdf to
rag-service/requirements.txt (pin a compatible version) or refactor
pdf_extractor.py to stop using PdfReader and instead use the existing PyMuPDF
API (e.g., replace PdfReader usage with fitz-based extraction), ensuring all
runtime imports match the declared dependencies.
rag-service/main.py (4)

3087-3092: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate function definition breaks /ask endpoint.

The @app.post("/ask") decorator is bound to the first ask_question function (lines 3088-3090) which only calls cleanup_expired_sessions() and implicitly returns None. The second definition (line 3092) contains the actual implementation but is NOT registered as a route handler because the decorator was already consumed.

The /ask endpoint will return a null/empty response instead of processing questions.

🐛 Proposed fix

Remove the incomplete first definition and ensure the decorator is on the complete implementation:

 `@app.post`("/ask")
-def ask_question(data: Question, _ready: None = Depends(require_models_ready)):
-    cleanup_expired_sessions()
-
-
-def ask_question(data: Question):
+def ask_question(data: Question, _ready: None = Depends(require_models_ready)):
+    cleanup_expired_sessions()
     question = (data.question or "").strip()
     # ... rest of implementation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3087 - 3092, Remove the accidental
duplicate/empty route handler: delete the first, incomplete definition of
ask_question that only calls cleanup_expired_sessions() and is decorated with
`@app.post`("/ask"), and move the decorator `@app.post`("/ask") and dependency
require_models_ready to the complete implementation of ask_question so the
actual handler (the function with the full request-processing logic) is
registered as the route; ensure cleanup_expired_sessions() is still invoked
inside that real ask_question as needed.

3852-3856: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate function definition breaks /summarize endpoint.

Same issue as /ask: the decorator is bound to the stub function that only calls cleanup_expired_sessions(), while the full implementation below is not registered as a route.

🐛 Proposed fix
 `@app.post`("/summarize")
-def summarize_pdf(data: SummarizeRequest, _ready: None = Depends(require_models_ready)):
-    cleanup_expired_sessions()
-def summarize_pdf(data: SummarizeRequest):
+def summarize_pdf(data: SummarizeRequest, _ready: None = Depends(require_models_ready)):
+    cleanup_expired_sessions()
     session_id = str(data.session_id)
     # ... rest of implementation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3852 - 3856, The /summarize endpoint is
broken because there are two summarize_pdf definitions: one decorated that only
calls cleanup_expired_sessions() and the real implementation below which is not
registered; remove the duplicate stub or merge them so the decorated
summarize_pdf is the full implementation, ensuring the route decorator and
dependency (require_models_ready) are applied to the actual function that
accepts data: SummarizeRequest and uses session_id, and keep the call to
cleanup_expired_sessions() inside that single implementation.

2993-3000: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

File size limit is inconsistent with MAX_UPLOAD_SIZE_MB configuration.

Line 2993 hard-codes a 20MB limit, but line 311 defines MAX_UPLOAD_SIZE_MB with a 50MB default. This will reject files between 20-50MB even though the configurable limit allows them.

🐛 Proposed fix
-    max_size = 20 * 1024 * 1024
+    max_size = MAX_UPLOAD_SIZE_MB * 1024 * 1024
     bytes_written = 0
     with open(temp_path, "wb") as f_out:
         while chunk := file.file.read(65536):
             bytes_written += len(chunk)
             if bytes_written > max_size:
                 os.remove(temp_path)
-                raise HTTPException(status_code=413, detail="Uploaded PDF exceeds the maximum size of 20MB.")
+                raise HTTPException(status_code=413, detail=f"Uploaded PDF exceeds the maximum size of {MAX_UPLOAD_SIZE_MB}MB.")
             f_out.write(chunk)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2993 - 3000, The upload size check
currently hard-codes 20MB, causing inconsistency with the configurable
MAX_UPLOAD_SIZE_MB; replace the literal with a computed max_size =
MAX_UPLOAD_SIZE_MB * 1024 * 1024 and use that in the loop where bytes_written is
compared (symbols: MAX_UPLOAD_SIZE_MB, max_size, bytes_written, temp_path,
file.file.read); keep the same cleanup and HTTPException behavior but ensure the
computed max_size is used so the runtime config controls the limit.

2906-2918: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

user_id is undefined; ownership check will crash at runtime.

The function references user_id at line 2916 but never extracts it from the request. This will raise a NameError when a session with a user_id is accessed. Additionally, the continue at line 2918 is unreachable after raise.

🐛 Proposed fix
 `@app.post`("/sessions/lookup")
-def lookup_sessions(data: SessionsLookupRequest):
+def lookup_sessions(data: SessionsLookupRequest, request: Request):
     sessions_out = []
+    user_id = request.headers.get("X-User-Id")

     with sessions_lock:
         for item in data.sessions:
             sid = str(item.session_id)
             session = _touch_session_unlocked(sid)
             if not session:
                 raise HTTPException(status_code=404, detail="Session not found")
             if session.get("user_id") and session.get("user_id") != user_id:
                 raise HTTPException(status_code=403, detail="Forbidden")
-                continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2906 - 2918, The lookup_sessions handler
references an undefined user_id and contains an unreachable continue; fix by
extracting the caller's user id from the incoming request object (e.g.,
add/consume data.user_id or pull current user id from the auth/context used
elsewhere) before the loop, then use that value in the ownership check
(session.get("user_id") != user_id) inside the for-loop, and remove the
unreachable `continue` after the `raise`; locate symbols lookup_sessions,
SessionsLookupRequest, _touch_session_unlocked, and sessions_lock to apply the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/main.py`:
- Around line 1-2: Remove the duplicate FastAPI imports by consolidating them
into a single import statement from the fastapi module; keep only one import
that includes all used symbols (FastAPI, Depends, Request, HTTPException, File,
UploadFile, Form, Header, Query) so functions/classes that reference these names
(e.g., any usages of FastAPI, Depends, Request, HTTPException, File, UploadFile,
Form, Header, Query) will still resolve and there are no repeated import lines.

---

Outside diff comments:
In `@rag-service/main.py`:
- Around line 3087-3092: Remove the accidental duplicate/empty route handler:
delete the first, incomplete definition of ask_question that only calls
cleanup_expired_sessions() and is decorated with `@app.post`("/ask"), and move the
decorator `@app.post`("/ask") and dependency require_models_ready to the complete
implementation of ask_question so the actual handler (the function with the full
request-processing logic) is registered as the route; ensure
cleanup_expired_sessions() is still invoked inside that real ask_question as
needed.
- Around line 3852-3856: The /summarize endpoint is broken because there are two
summarize_pdf definitions: one decorated that only calls
cleanup_expired_sessions() and the real implementation below which is not
registered; remove the duplicate stub or merge them so the decorated
summarize_pdf is the full implementation, ensuring the route decorator and
dependency (require_models_ready) are applied to the actual function that
accepts data: SummarizeRequest and uses session_id, and keep the call to
cleanup_expired_sessions() inside that single implementation.
- Around line 2993-3000: The upload size check currently hard-codes 20MB,
causing inconsistency with the configurable MAX_UPLOAD_SIZE_MB; replace the
literal with a computed max_size = MAX_UPLOAD_SIZE_MB * 1024 * 1024 and use that
in the loop where bytes_written is compared (symbols: MAX_UPLOAD_SIZE_MB,
max_size, bytes_written, temp_path, file.file.read); keep the same cleanup and
HTTPException behavior but ensure the computed max_size is used so the runtime
config controls the limit.
- Around line 2906-2918: The lookup_sessions handler references an undefined
user_id and contains an unreachable continue; fix by extracting the caller's
user id from the incoming request object (e.g., add/consume data.user_id or pull
current user id from the auth/context used elsewhere) before the loop, then use
that value in the ownership check (session.get("user_id") != user_id) inside the
for-loop, and remove the unreachable `continue` after the `raise`; locate
symbols lookup_sessions, SessionsLookupRequest, _touch_session_unlocked, and
sessions_lock to apply the change.

In `@rag-service/requirements.txt`:
- Around line 13-19: The requirements list removed pypdf but
rag-service/crawler/pdf_extractor.py still imports from pypdf (from pypdf import
PdfReader), so either restore pypdf to rag-service/requirements.txt (pin a
compatible version) or refactor pdf_extractor.py to stop using PdfReader and
instead use the existing PyMuPDF API (e.g., replace PdfReader usage with
fitz-based extraction), ensuring all runtime imports match the declared
dependencies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f1f28b22-50ed-4f7d-b9a2-c66d17b337fd

📥 Commits

Reviewing files that changed from the base of the PR and between 994ca3a and a605a9d.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (5)
  • frontend/src/App.js
  • package.json
  • rag-service/main.py
  • rag-service/requirements.txt
  • server.js
🚧 Files skipped from review as they are similar to previous changes (3)
  • package.json
  • frontend/src/App.js
  • server.js

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
rag-service/requirements.txt (1)

13-19: ⚠️ Potential issue | 🔴 Critical

Fix: pypdf is still imported at runtimerag-service/crawler/pdf_extractor.py contains from pypdf import PdfReader, so removing pypdf from rag-service/requirements.txt will still break execution (e.g., via that ingestion path / any “PyPDF fallback”).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/requirements.txt` around lines 13 - 19, The requirements list
removed pypdf but rag-service/crawler/pdf_extractor.py still imports from pypdf
(from pypdf import PdfReader), so either restore pypdf to
rag-service/requirements.txt (pin a compatible version) or refactor
pdf_extractor.py to stop using PdfReader and instead use the existing PyMuPDF
API (e.g., replace PdfReader usage with fitz-based extraction), ensuring all
runtime imports match the declared dependencies.
rag-service/main.py (4)

3087-3092: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate function definition breaks /ask endpoint.

The @app.post("/ask") decorator is bound to the first ask_question function (lines 3088-3090) which only calls cleanup_expired_sessions() and implicitly returns None. The second definition (line 3092) contains the actual implementation but is NOT registered as a route handler because the decorator was already consumed.

The /ask endpoint will return a null/empty response instead of processing questions.

🐛 Proposed fix

Remove the incomplete first definition and ensure the decorator is on the complete implementation:

 `@app.post`("/ask")
-def ask_question(data: Question, _ready: None = Depends(require_models_ready)):
-    cleanup_expired_sessions()
-
-
-def ask_question(data: Question):
+def ask_question(data: Question, _ready: None = Depends(require_models_ready)):
+    cleanup_expired_sessions()
     question = (data.question or "").strip()
     # ... rest of implementation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3087 - 3092, Remove the accidental
duplicate/empty route handler: delete the first, incomplete definition of
ask_question that only calls cleanup_expired_sessions() and is decorated with
`@app.post`("/ask"), and move the decorator `@app.post`("/ask") and dependency
require_models_ready to the complete implementation of ask_question so the
actual handler (the function with the full request-processing logic) is
registered as the route; ensure cleanup_expired_sessions() is still invoked
inside that real ask_question as needed.

3852-3856: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate function definition breaks /summarize endpoint.

Same issue as /ask: the decorator is bound to the stub function that only calls cleanup_expired_sessions(), while the full implementation below is not registered as a route.

🐛 Proposed fix
 `@app.post`("/summarize")
-def summarize_pdf(data: SummarizeRequest, _ready: None = Depends(require_models_ready)):
-    cleanup_expired_sessions()
-def summarize_pdf(data: SummarizeRequest):
+def summarize_pdf(data: SummarizeRequest, _ready: None = Depends(require_models_ready)):
+    cleanup_expired_sessions()
     session_id = str(data.session_id)
     # ... rest of implementation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3852 - 3856, The /summarize endpoint is
broken because there are two summarize_pdf definitions: one decorated that only
calls cleanup_expired_sessions() and the real implementation below which is not
registered; remove the duplicate stub or merge them so the decorated
summarize_pdf is the full implementation, ensuring the route decorator and
dependency (require_models_ready) are applied to the actual function that
accepts data: SummarizeRequest and uses session_id, and keep the call to
cleanup_expired_sessions() inside that single implementation.

2993-3000: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

File size limit is inconsistent with MAX_UPLOAD_SIZE_MB configuration.

Line 2993 hard-codes a 20MB limit, but line 311 defines MAX_UPLOAD_SIZE_MB with a 50MB default. This will reject files between 20-50MB even though the configurable limit allows them.

🐛 Proposed fix
-    max_size = 20 * 1024 * 1024
+    max_size = MAX_UPLOAD_SIZE_MB * 1024 * 1024
     bytes_written = 0
     with open(temp_path, "wb") as f_out:
         while chunk := file.file.read(65536):
             bytes_written += len(chunk)
             if bytes_written > max_size:
                 os.remove(temp_path)
-                raise HTTPException(status_code=413, detail="Uploaded PDF exceeds the maximum size of 20MB.")
+                raise HTTPException(status_code=413, detail=f"Uploaded PDF exceeds the maximum size of {MAX_UPLOAD_SIZE_MB}MB.")
             f_out.write(chunk)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2993 - 3000, The upload size check
currently hard-codes 20MB, causing inconsistency with the configurable
MAX_UPLOAD_SIZE_MB; replace the literal with a computed max_size =
MAX_UPLOAD_SIZE_MB * 1024 * 1024 and use that in the loop where bytes_written is
compared (symbols: MAX_UPLOAD_SIZE_MB, max_size, bytes_written, temp_path,
file.file.read); keep the same cleanup and HTTPException behavior but ensure the
computed max_size is used so the runtime config controls the limit.

2906-2918: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

user_id is undefined; ownership check will crash at runtime.

The function references user_id at line 2916 but never extracts it from the request. This will raise a NameError when a session with a user_id is accessed. Additionally, the continue at line 2918 is unreachable after raise.

🐛 Proposed fix
 `@app.post`("/sessions/lookup")
-def lookup_sessions(data: SessionsLookupRequest):
+def lookup_sessions(data: SessionsLookupRequest, request: Request):
     sessions_out = []
+    user_id = request.headers.get("X-User-Id")

     with sessions_lock:
         for item in data.sessions:
             sid = str(item.session_id)
             session = _touch_session_unlocked(sid)
             if not session:
                 raise HTTPException(status_code=404, detail="Session not found")
             if session.get("user_id") and session.get("user_id") != user_id:
                 raise HTTPException(status_code=403, detail="Forbidden")
-                continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 2906 - 2918, The lookup_sessions handler
references an undefined user_id and contains an unreachable continue; fix by
extracting the caller's user id from the incoming request object (e.g.,
add/consume data.user_id or pull current user id from the auth/context used
elsewhere) before the loop, then use that value in the ownership check
(session.get("user_id") != user_id) inside the for-loop, and remove the
unreachable `continue` after the `raise`; locate symbols lookup_sessions,
SessionsLookupRequest, _touch_session_unlocked, and sessions_lock to apply the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rag-service/main.py`:
- Around line 1-2: Remove the duplicate FastAPI imports by consolidating them
into a single import statement from the fastapi module; keep only one import
that includes all used symbols (FastAPI, Depends, Request, HTTPException, File,
UploadFile, Form, Header, Query) so functions/classes that reference these names
(e.g., any usages of FastAPI, Depends, Request, HTTPException, File, UploadFile,
Form, Header, Query) will still resolve and there are no repeated import lines.

---

Outside diff comments:
In `@rag-service/main.py`:
- Around line 3087-3092: Remove the accidental duplicate/empty route handler:
delete the first, incomplete definition of ask_question that only calls
cleanup_expired_sessions() and is decorated with `@app.post`("/ask"), and move the
decorator `@app.post`("/ask") and dependency require_models_ready to the complete
implementation of ask_question so the actual handler (the function with the full
request-processing logic) is registered as the route; ensure
cleanup_expired_sessions() is still invoked inside that real ask_question as
needed.
- Around line 3852-3856: The /summarize endpoint is broken because there are two
summarize_pdf definitions: one decorated that only calls
cleanup_expired_sessions() and the real implementation below which is not
registered; remove the duplicate stub or merge them so the decorated
summarize_pdf is the full implementation, ensuring the route decorator and
dependency (require_models_ready) are applied to the actual function that
accepts data: SummarizeRequest and uses session_id, and keep the call to
cleanup_expired_sessions() inside that single implementation.
- Around line 2993-3000: The upload size check currently hard-codes 20MB,
causing inconsistency with the configurable MAX_UPLOAD_SIZE_MB; replace the
literal with a computed max_size = MAX_UPLOAD_SIZE_MB * 1024 * 1024 and use that
in the loop where bytes_written is compared (symbols: MAX_UPLOAD_SIZE_MB,
max_size, bytes_written, temp_path, file.file.read); keep the same cleanup and
HTTPException behavior but ensure the computed max_size is used so the runtime
config controls the limit.
- Around line 2906-2918: The lookup_sessions handler references an undefined
user_id and contains an unreachable continue; fix by extracting the caller's
user id from the incoming request object (e.g., add/consume data.user_id or pull
current user id from the auth/context used elsewhere) before the loop, then use
that value in the ownership check (session.get("user_id") != user_id) inside the
for-loop, and remove the unreachable `continue` after the `raise`; locate
symbols lookup_sessions, SessionsLookupRequest, _touch_session_unlocked, and
sessions_lock to apply the change.

In `@rag-service/requirements.txt`:
- Around line 13-19: The requirements list removed pypdf but
rag-service/crawler/pdf_extractor.py still imports from pypdf (from pypdf import
PdfReader), so either restore pypdf to rag-service/requirements.txt (pin a
compatible version) or refactor pdf_extractor.py to stop using PdfReader and
instead use the existing PyMuPDF API (e.g., replace PdfReader usage with
fitz-based extraction), ensuring all runtime imports match the declared
dependencies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f1f28b22-50ed-4f7d-b9a2-c66d17b337fd

📥 Commits

Reviewing files that changed from the base of the PR and between 994ca3a and a605a9d.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (5)
  • frontend/src/App.js
  • package.json
  • rag-service/main.py
  • rag-service/requirements.txt
  • server.js
🚧 Files skipped from review as they are similar to previous changes (3)
  • package.json
  • frontend/src/App.js
  • server.js
🛑 Comments failed to post (1)
rag-service/main.py (1)

1-2: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Consolidate duplicate FastAPI imports.

FastAPI, Request, HTTPException, File, UploadFile, and Form are imported on both lines. Merge into a single import statement.

🔧 Suggested fix
-from fastapi import Depends, FastAPI, Request, HTTPException, File, UploadFile, Form
-from fastapi import FastAPI, Request, HTTPException, File, UploadFile, Form, Header, Query
+from fastapi import Depends, FastAPI, Request, HTTPException, File, UploadFile, Form, Header, Query
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

from fastapi import Depends, FastAPI, Request, HTTPException, File, UploadFile, Form, Header, Query
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 1 - 2, Remove the duplicate FastAPI imports
by consolidating them into a single import statement from the fastapi module;
keep only one import that includes all used symbols (FastAPI, Depends, Request,
HTTPException, File, UploadFile, Form, Header, Query) so functions/classes that
reference these names (e.g., any usages of FastAPI, Depends, Request,
HTTPException, File, UploadFile, Form, Header, Query) will still resolve and
there are no repeated import lines.

Comment thread server.js
Comment on lines +1101 to +1103
const response = await axios.get(`${RAG_SERVICE_URL}/process-pdf/${req.params.jobId}/status`, {
headers: ragAuthHeaders(req)
});
@FireFistisDead

Copy link
Copy Markdown
Owner

@anshika1179 all tests are falinig

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rag-service/main.py (1)

3719-3737: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stale session_id variable used after multi-session loop.

After iterating through session_id_list, the session_id variable retains only the last session's ID. The evidence-gate log, chat exchange append, and dirty-marking all reference this stale variable, which is incorrect when multiple sessions are provided.

For multi-session queries, consider either:

  1. Tracking the "primary" session explicitly for logging/chat history
  2. Or appending chat exchanges to all participating sessions
Suggested fix approach
+    # Use first session as primary for logging and chat history
+    primary_session_id = session_id_list[0] if session_id_list else None
+
     # Sort all candidates from multiple documents by score ascending
     all_scored_candidates.sort(key=lambda x: x[1])
     ...
     
-            session_id,
+            primary_session_id,
             intent,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3719 - 3737, The variable `session_id`
retains the last value from the `session_id_list` loop iteration after the loop
completes, causing the evidence gate logging, append_chat_exchange call, and
_mark_session_dirty call to all operate on a stale session ID instead of the
correct one for each session. To fix this, either explicitly track and use a
single "primary" session ID before the loop (separate from the iteration
variable) for the logging and chat history operations, or modify the logic to
append the chat exchange and mark as dirty for all sessions in the
session_id_list instead of relying on the loop's final value. Ensure that the
passes_evidence_gate check result is correctly applied to all relevant sessions
in the multi-session context.
🧹 Nitpick comments (2)
rag-service/main.py (1)

3646-3648: 💤 Low value

Consider exception chaining for better tracebacks.

Per static analysis, use raise ... from exc to preserve the exception chain for debugging.

             except Exception as exc:
                 logger.error("Failed to lazy load vectorstore session_id=%s error=%s", session_id, exc)
-                raise HTTPException(status_code=500, detail="Failed to load session index.")
+                raise HTTPException(status_code=500, detail="Failed to load session index.") from exc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rag-service/main.py` around lines 3646 - 3648, The exception raised in the
except block does not use exception chaining, which loses the original error
context. Modify the raise HTTPException statement to use exception chaining by
adding `from exc` at the end, changing it to `raise
HTTPException(status_code=500, detail="Failed to load session index.") from
exc`. This preserves the original exception chain for better debugging and
clearer tracebacks.

Source: Linters/SAST tools

server.js (1)

1700-1731: 💤 Low value

Duplicate requireInternalRagToken() call.

requireInternalRagToken() is called at line 1701 and again at line 1709 inside the async IIFE. The second call is redundant since the first would already throw and exit if the token is missing.

     requireInternalRagToken();
     if (!SUPABASE_JWT_SECRET) {
       console.error("SUPABASE_JWT_SECRET missing in .env – required for /process-from-url authentication");
       process.exit(1);
     }

     (async () => {
       try {
-        requireInternalRagToken();
-
         if (redisConnectPromise) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server.js` around lines 1700 - 1731, The requireInternalRagToken() function
is called twice in succession: once before the async IIFE and again at the
beginning of the async IIFE. Since the first call will throw and exit the
process if the token is missing, the second call inside the async IIFE is
redundant and unnecessary. Remove the duplicate requireInternalRagToken() call
from inside the async IIFE (the one that appears after the opening try block
inside the async function).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@rag-service/main.py`:
- Around line 3719-3737: The variable `session_id` retains the last value from
the `session_id_list` loop iteration after the loop completes, causing the
evidence gate logging, append_chat_exchange call, and _mark_session_dirty call
to all operate on a stale session ID instead of the correct one for each
session. To fix this, either explicitly track and use a single "primary" session
ID before the loop (separate from the iteration variable) for the logging and
chat history operations, or modify the logic to append the chat exchange and
mark as dirty for all sessions in the session_id_list instead of relying on the
loop's final value. Ensure that the passes_evidence_gate check result is
correctly applied to all relevant sessions in the multi-session context.

---

Nitpick comments:
In `@rag-service/main.py`:
- Around line 3646-3648: The exception raised in the except block does not use
exception chaining, which loses the original error context. Modify the raise
HTTPException statement to use exception chaining by adding `from exc` at the
end, changing it to `raise HTTPException(status_code=500, detail="Failed to load
session index.") from exc`. This preserves the original exception chain for
better debugging and clearer tracebacks.

In `@server.js`:
- Around line 1700-1731: The requireInternalRagToken() function is called twice
in succession: once before the async IIFE and again at the beginning of the
async IIFE. Since the first call will throw and exit the process if the token is
missing, the second call inside the async IIFE is redundant and unnecessary.
Remove the duplicate requireInternalRagToken() call from inside the async IIFE
(the one that appears after the opening try block inside the async function).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c275290b-36da-4b2f-99a0-f0b46f279a66

📥 Commits

Reviewing files that changed from the base of the PR and between a605a9d and fa3671c.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (4)
  • frontend/src/App.js
  • frontend/src/services/api.js
  • rag-service/main.py
  • server.js
🚧 Files skipped from review as they are similar to previous changes (2)
  • frontend/src/services/api.js
  • frontend/src/App.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Express or API gateway work bug Something isn't working docs Documentation only enhancement New feature or request feature A new feature or improvement fix A targeted fix or cleanup frontend Frontend-related work gssoc:approved level:advanced quality:clean rag-service FastAPI / model service work type:security type:testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Decouple PDF Processing with an Asynchronous Message Queue

3 participants