Mission: The first open-source, plug-and-play, AI-powered UPSC learning platform. Built for Desktop & Web. Free forever. Anyone can add content. Anyone can fork it for any exam.
This plan supersedes all prior versions. It merges the original architecture, senior developer review (12 points), and all revisions into a single source of truth.
- Why This Will Succeed
- Architecture Overview
- Content Pack System
- All 20 Gaps β Resolved
- All 12 Assumptions β Mitigated
- All 6 Decisions β Resolved
- POC Tests & Eval Harness
- Cold-Start Content Strategy
- Integration Reality Check
- Team & Capacity Model
- 16-Week Timeline
- Post-v1 Roadmap (Phase 5)
- Cost Model
- Operational Runbook
There is no free, AI-powered, open-source UPSC preparation tool. 15 lakh+ aspirants sit for UPSC every year. The market is served by βΉ50KββΉ2L coaching packages and βΉ5KββΉ15K app subscriptions. An open-source alternative with AI + community content will spread like wildfire.
| # | Advantage | Why It Works |
|---|---|---|
| 1 | Zero-Cost Operation | Gemini free tier (1,500 req/day) + Vercel free + Supabase free = βΉ0/month at <50 users. No VC funding needed. Students trust it because there's no business model to corrupt it. |
| 2 | 200K Lines of Code β Free | DeepTutor (16.5K stars, Apache-2.0) gives us RAG, quiz generation, TutorBots, memory, CLI, multi-channel agents β all production-tested. We build a UPSC skin, not an AI engine. |
| 3 | Community Builds The Product | Content packs = the moat. Every student who adds PYQs, notes, or flashcards makes the platform better for everyone. Wikipedia model for UPSC prep. Community contributes AFTER v1 launches with self-authored seed content. |
| 4 | Network Effects | More content β better AI answers β more students β more content contributed β stronger RAG β better quiz generation β cycle accelerates. |
| 5 | Fork-Ready = Unstoppable | GATE, SSC, State PSC, NEET β fork, swap content packs, change branding. The engine doesn't care about the exam. |
| 6 | Self-Hosted = No Lock-In | Students own their data. No server dependency. No "company shutting down" risk. Open source = permanent. |
| 7 | India-Specific Timing | India has the world's largest competitive exam ecosystem (5Cr+ aspirants/year combined). India-specific open-source AI tools are nearly zero. |
| 8 | No Competitor Can Match Free + Open + AI | Unacademy (βΉ10K+/yr), Testbook (βΉ5K+/yr) β all closed-source, subscription-based, no AI tutoring. Free + AI-native is a different category. |
| Platform | Price | AI-Powered | Open Source | Offline | Community Content | Self-Hosted |
|---|---|---|---|---|---|---|
| Unacademy | βΉ10KββΉ60K/yr | β | β | β | β | β |
| Testbook | βΉ5KββΉ15K/yr | β Basic | β | β | β | β |
| BYJU's | βΉ30KββΉ1.5L | β | β | β | β | β |
| Khan Academy | Free | β | β | β | β | β |
| Free YouTube/Telegram | Free | β | N/A | β | β Informal | N/A |
| GS360 Open Source | Free | β Full AI | β Yes | β PWA | β Plug-and-Play | β Yes |
graph LR
A["Student joins GS360"] --> B["Uses AI features for free"]
B --> C["Studies with content packs"]
C --> D["Creates their own notes/MCQs"]
D --> E["Submits as content pack PR"]
E --> F["Community validates + merges"]
F --> G["RAG knowledge base grows"]
G --> H["AI answers get better"]
H --> I["Word spreads to more students"]
I --> A
style A fill:#22C55E,color:#000
style G fill:#3B82F6,color:#fff
style I fill:#DC3545,color:#fff
graph TB
subgraph "Frontend β GS360 UI"
A["Next.js App Shell"] --> B["GS360 Design System"]
B --> C["Daily Command Center"]
B --> D["AI Notes / Chat"]
B --> E["Notes & Materials"]
B --> F["Testing / Quiz"]
B --> G["Performance Dashboard"]
B --> H["Plan View"]
end
subgraph "Auth & Multi-Tenancy Layer"
AUTH["NextAuth.js β Google/GitHub OAuth"]
SEC["Security Middleware β Path Validation + Audit Log"]
MT["Per-User Namespace Manager"]
RL["Token-Bucket Rate Limiter"]
end
subgraph "Backend β DeepTutor Engine"
N["FastAPI Server"] --> O["RAG Pipeline"]
N --> P["Chat / Deep Solve / Quiz Gen"]
N --> Q["Knowledge Base Manager"]
N --> R["TutorBot Agent System"]
N --> S["Persistent Memory"]
end
subgraph "Plug-and-Play Content Layer"
PP1["content-packs/ (Git-tracked)"]
PP2["Content Registry β manifest.json"]
PP3["Community Content Hub β GitHub"]
PP1 --> PP2
PP3 -->|"PR + Review"| PP1
end
subgraph "Data Layer"
V[("Knowledge Bases β Per User")]
W[("User Data / Sessions")]
X[("Embeddings β Vector Store")]
BK[("Daily Backup β R2/S3")]
end
subgraph "External Services"
Y["LLM Provider β Gemini β DeepSeek β Ollama Fallback"]
Z["Embedding Provider"]
AA["Search Provider β SearXNG Self-Hosted"]
end
subgraph "Evaluation & Monitoring"
EVAL["RAG Eval Harness β Continuous"]
COST["LLM Cost Monitor + Alerts"]
HEALTH["System Health Dashboard"]
end
A <-->|"WebSocket + REST"| AUTH
AUTH --> SEC
SEC --> MT
MT --> N
RL --> N
PP1 -->|"Auto-ingest on boot"| Q
N --> V
N --> W
N --> X
W --> BK
N --> Y
N --> Z
N --> AA
N --> EVAL
COST --> Y
Note
v1 scope exclusions: Voice Bot (Siri Orb), AI Cowork Studio, and Study Mode Focus Timer are deferred to v1.1. This is a deliberate scope-discipline decision, not technical limitation.
gs360-live/
βββ content-packs/ # ALL content lives here
β βββ registry.json # Master manifest β lists all packs
β β
β βββ upsc-polity/ # One folder = one content pack
β β βββ pack.json # Pack metadata
β β βββ documents/ # Raw source materials (PDF, MD, TXT)
β β βββ questions/ # MCQ + Mains question banks (JSON)
β β βββ notes/ # Pre-made study notes (Markdown)
β β βββ prompts/ # TutorBot personas (Markdown)
β β βββ flashcards/ # Spaced repetition cards (JSON)
β β βββ cache/ # Pre-generated AI outputs (fallback)
β β
β βββ upsc-economy/
β βββ upsc-history/
β βββ upsc-current-affairs-apr-2026/
β βββ gate-cse/ # Non-UPSC β fork-ready
β
βββ private-vault/ # User's PRIVATE content (not in Git)
β βββ uploads/
β βββ video-transcripts/
β βββ custom-kb/
β
βββ eval/ # Evaluation datasets
β βββ golden-dataset.json # 200 UPSC questions with verified answers
β βββ eval-results/ # Weekly eval run outputs
β βββ eval-runner.py # Automated eval script
β
βββ templates/ # Pack creation templates
β βββ pack-template/
β βββ CONTENT_GUIDE.md
β
βββ scripts/
βββ ingest-packs.py # Auto-ingest into DeepTutor KBs
βββ validate-pack.py # Schema + quality validation
βββ export-pack.py # Export KB back to pack format
βββ cost-monitor.py # LLM usage tracking + alerts
pack.jsonβ Declares metadata, exam target, subject, language, content counts. The ingestion script reads this to route content.questions/*.jsonβid,year,question,options[],correct,explanation,difficulty,topics[],source,contributor. Feeds quiz engine + RAG context.flashcards/*.jsonβfront,back,difficulty,topicfor spaced repetition.
Contributor (no code needed) Automated Pipeline
βββββββββββββββββββββββββββββ ββββββββββββββββββ
1. Fork repo
2. Copy templates/pack-template/
3. Add PDFs to documents/
4. Add MCQs to questions/*.json
(or use web form β auto-generates JSON)
5. Edit pack.json metadata
6. Submit PR β GitHub Actions triggers:
β validate-pack.py (schema)
β Question format check
β Copyright scan (hash + text fingerprint)
β Duplicate detection (embedding similarity)
β LLM quality scorer
β 2 community reviewers approve
β Auto-merge β Auto-ingest
Solution: Dual-Mode Auth
| Mode | Auth | Use Case |
|---|---|---|
| Self-Hosted (Single User) | AUTH_MODE=none β no auth needed. Works like stock DeepTutor. |
Student running locally |
| Hosted / Multi-User | AUTH_MODE=multi β NextAuth.js with Google OAuth, GitHub OAuth, Email magic link (Resend free tier: 3K emails/mo). Session stored in Supabase free tier. |
Shared hosted platform |
Effort: 1 day. NextAuth.js is drop-in for Next.js. Supabase adapter exists.
Caution
This is the #1 technical risk. The UserNamespace class is the easy part. The hard part is tracing every hardcoded path assumption across DeepTutor's 16K+ lines. This is debugging work, not codegen. Phase 0 Spike validates feasibility before committing.
Solution: Per-User Namespace Manager + Security Middleware
# middleware/namespace.py
class UserNamespace:
"""Routes all DeepTutor file operations to user-specific directories."""
VALID_USER_ID = re.compile(r'^[a-zA-Z0-9_-]{1,64}$')
def __init__(self, user_id: str):
if not self.VALID_USER_ID.match(user_id):
raise ValueError(f"Invalid user_id: {user_id}")
self.user_id = user_id
self.base = os.path.realpath(f"data/users/{user_id}")
if not self.base.startswith(os.path.realpath("data/users/")):
raise PermissionError(f"Path traversal attempt: {user_id}")# middleware/security.py
class SecurityMiddleware:
"""Request-level security enforcement. Every API call passes through this."""
def validate_request(self, request, jwt_claims: dict):
user_id = jwt_claims["sub"] # From JWT, NEVER from request body
namespace = UserNamespace(user_id)
audit_logger.info(f"ACCESS user={user_id} endpoint={request.path}")
return namespace
def validate_file_path(self, namespace, requested_path: str):
real_path = os.path.realpath(requested_path)
user_base = os.path.realpath(namespace.base)
shared_base = os.path.realpath("data/knowledge_bases/")
if real_path.startswith(user_base) or real_path.startswith(shared_base):
return True
audit_logger.warning(f"BLOCKED user={namespace.user_id} path={requested_path}")
raise PermissionError("Access denied: path outside your namespace")Data Isolation:
data/
βββ knowledge_bases/ # SHARED β content packs (read-only for users)
βββ users/ # ISOLATED β per-user data
β βββ user_abc123/
β β βββ memory/ # Learner profile
β β βββ sessions/ # Chat history
β β βββ notebooks/ # Saved notes
β β βββ knowledge_bases/ # Private uploads (physically separate vector index)
β βββ user_def456/
β βββ ...
Security Requirements (Non-Negotiable for v1):
| Requirement | Implementation |
|---|---|
| Path traversal prevention | os.path.realpath() + prefix validation on every file access |
| Index isolation | User private KBs use physically separate vector stores, NOT filtered views |
| Request-level auth | user_id from JWT sub claim, NEVER from request body |
| Audit logging | Every file access logged with user_id, path, timestamp |
| WebSocket isolation | Each WS connection authenticated + bound to single user namespace |
| Pre-launch pen test | 10 common attack vectors (path traversal, IDOR, session fixation, KB cross-contamination) |
Key Design: Content packs (community knowledge) = shared read-only. User data (notes, scores, memory) = fully isolated with physically separate vector indices. RAG queries merge both at query-time, never at index-time.
Effort: 5β7 days realistic.
- 2 days patching DeepTutor's file path resolution
- 2 days for LlamaIndex per-user vector store pooling
- 1β2 days for WebSocket session isolation
- 1 day for security middleware + audit logging
Decision: Desktop-first. No mobile app. Desktop/web optimized for deep research & note-taking.
.app {
display: grid;
grid-template-columns: 240px 1fr; /* Fixed sidebar */
height: 100vh;
}
@media (max-width: 1024px) {
.app { grid-template-columns: 80px 1fr; } /* Collapsed sidebar */
}
@media (max-width: 768px) {
.app { grid-template-columns: 1fr; }
.sidebar { display: none; } /* Hamburger menu on mobile */
}Validation: Post-launch, Plausible analytics tracks device type. If >60% of traffic is mobile after 30 days, reconsider in v1.1 with data, not assumptions.
Effort: 1 day.
Solution: Progressive Web App + Offline Quiz
| Feature | Offline? | How |
|---|---|---|
| Quiz (from content packs) | β Full | Question banks cached in IndexedDB |
| Flashcard revision | β Full | Cached locally |
| Read/write notes | β Full | IndexedDB, synced on reconnect |
| Study timer | β Full | Client-side only |
| AI Chat / Notes Gen | β No | Shows "Connect to internet for AI features" |
| Upload materials | Saved locally, uploaded when online |
Effort: 3 days.
Solution: Token-Bucket Rate Limiter + Multi-Provider Fallback Chain
Request comes in
β
1. Try Gemini Flash 2.0 (free, fast)
β if rate-limited or down
2. Try DeepSeek V3 ($0.14/M tokens β ultra cheap backup)
β if rate-limited or down
3. Try Ollama local (if self-hosted with GPU)
β if unavailable
4. Serve pre-generated cached response (from content pack cache/)
β if nothing cached
5. Show "AI quota reached β try again in X minutes" + offer offline quiz
Per-user limits (free tier): 20 AI requests/hour, 100/day, 10 quiz generations/day, 3 deep research/day.
Effort: 3 days.
Layer 1: Automated Scan (on PR / upload)
βββββββββββββββββββββββββββββββββββββββββ
β’ SHA-256 hash check against known copyrighted PDFs
β’ Filename pattern matching ("Laxmikanth*.pdf", etc.)
β’ File size threshold (>5MB PDF flagged)
β’ PDF metadata extraction (author/publisher fields)
β’ Text fingerprinting β extract 10 random pages, compute n-gram
signatures against known textbook corpus
β’ Paragraph-level similarity against reference corpus (~500 paragraphs)
Layer 2: Community Review (on PR)
ββββββββββββββββββββββββββββββββββ
β’ 2 reviewer approvals required
β’ PR template checklist: original / public domain / PYQ / fair use
Layer 3: DMCA Process (post-publish)
βββββββββββββββββββββββββββββββββββββ
β’ DMCA.md in repo root with takedown instructions
β’ Email: dmca@gs360.study
β’ Response SLA: 48 hours
β’ Auto-remove on valid claim, reinstate on counter-notice
Note
Layer 1 will never be bulletproof β even YouTube can't detect copyright reliably. The 3-layer approach is the industry standard used by GitHub, Wikipedia, and Internet Archive.
Allowed: NCERT, UPSC PYQs, Constitution, PIB, Economic Survey, Budget docs, original notes. Not allowed: Full copyrighted textbooks, coaching material, scanned paid test series.
Effort: 2 days.
Solution: Daily automated backup to Cloudflare R2 (free: 10GB, 1M ops/month).
- Frequency: Daily at 2:00 AM IST
- What:
data/users/+data/knowledge_bases/(content packs already in git) - Retention: 7 daily + 4 weekly snapshots
- Encryption: AES-256 at rest
- User-side: "Export My Data" button downloads zip with all personal data
Effort: Half day.
v1 launches English-only. next-intl framework setup in v1.1. Community translates via same PR process as content packs.
Applied during Phase 2 design:
aria-labelon all interactive elements- Keyboard navigation (Tab, Enter)
:focus-visibleoutlines- Color contrast β₯ 4.5:1 (WCAG AA)
- Semantic HTML (
<nav>,<main>,<aside>) aria-live="polite"on quiz timer- Enforced by ESLint
jsx-a11yplugin
Effort: 1 day during design phase.
Plausible Analytics (self-hosted, free, privacy-respecting). Tracks: page views, content pack usage, quiz completion rates, geography, device type (validates desktop-first decision).
Does NOT track: personal identity, study content, AI conversations.
Effort: 2 hours.
Embedding similarity check (>0.92 threshold) during CI before merge. Blocks duplicate questions.
Effort: Half day.
Syllabus version tracked in registry.json. Outdated packs flagged. Config-only.
k6 script ships with repo. Ramp to 50 concurrent users, hold 5 min. Documented limits: free tier handles ~30β50 concurrent users.
Effort: Half day.
Public pages (landing, PYQ database, CA summaries) are SSR via Next.js. Meta tags, structured data, sitemap. Monthly CA packs auto-publish as blog posts.
Effort: 1 day.
Settings β "Export My Data" β zip with profile.json, notes/, quiz-history.json, flashcard-progress.json, sessions/, README.md. Import also supported.
Effort: 1 day.
| Gap | Solution | When |
|---|---|---|
| G16: Real-time collaboration | WebSocket rooms via Partykit | v2 |
| G17: Plagiarism detection | Embedding similarity against answer corpus | v2 |
| G18: Native mobile app | Non-goal. Desktop/web only. Validated post-launch. | v2 |
| G19: UPSC model fine-tuning | Fine-tune Qwen2.5-7B on PYQ explanations | v3 |
| G20: Admin panel | /admin with moderation queue, user stats |
v1.1 |
| # | Assumption | Mitigation |
|---|---|---|
| A1 | DeepTutor stays stable | Pin to v1.0.2. Soft fork (overlay, don't modify core). Apache-2.0 = we can continue independently if they pivot. |
| A2 | Gemini free tier persists | Multi-provider fallback chain. Worst case: DeepSeek at $0.14/M tokens β ~βΉ500/month for 10K users. |
| A3 | Students have internet | PWA with offline quiz, notes, flashcards. Core study flow works offline. |
| A4 | NCERT is redistributable | Link to official NCERT portal. Extract only summaries + key concepts under fair use. |
| A5 | Community contributes | Do NOT rely on community for v1 content. Self-author all seed content from public sources. Community contributes AFTER platform has traction. |
| A6 | Desktop-first is OK | Defensible (Notion, Obsidian, Anki all desktop-first). Validate post-launch with analytics. |
| A7 | Self-hosted primary | Ship both: self-hosted (default) + hosted demo at gs360.study (rate-limited). |
| A8 | LLMs won't hallucinate | RAG-grounded only. System prompt enforces "refuse if not in context." Source attribution mandatory. Validated continuously via eval harness. |
| A9 | Web Speech API for Hindi | Voice bot deferred to v1.1. Text-only for v1. |
| A10 | Vector store scales | LlamaIndex local storage. Monitor at 10K pages. Migrate to Qdrant if slow. Content packs enable sharding by subject. |
| A11 | Students can run Docker | "Deploy to Railway" one-click button. YouTube walkthrough in Hindi. Hosted version for everyone else. |
| A12 | Non-devs can write JSON | Web form at /contribute β auto-generates JSON β auto-submits PR. Zero JSON knowledge needed. |
| Decision | Resolution | Rationale |
|---|---|---|
| D1: Hosted vs Self-Hosted | Both. Self-hosted default + hosted demo (rate-limited). | Open-source promise + accessibility. |
| D2: Platform Priority | Desktop-first. Validated post-launch. | Deep study β large screens. Directional, not irreversible. |
| D3: Contribution Method | Web form (primary) + GitHub PRs (advanced). | /contribute β auto JSON β auto PR. |
| D4: Fork Strategy | Soft fork. | Changes in gs360/ overlay. Core DeepTutor untouched. Pull upstream cleanly. |
| D5: AI Grounding | RAG-grounded with citations. Continuously evaluated. | Prevents hallucination. Community verifies via citations. |
| D6: Exam Scope | UPSC-first, multi-exam ready. | Config-driven branding + pack system supports any exam. |
| # | Test | Method | Pass Criteria | Blocking? |
|---|---|---|---|---|
| T1 | Quiz quality | Upload 50 PYQs β generate 20 MCQs β 3 aspirants rate 1β5 | Avg β₯ 3.5/5 | Yes |
| T2 | Gemini throughput | k6: 50 users Γ 10 req/min for 30min | <5% rate-limit errors with fallback active | Yes |
| T3 | RAG accuracy | Continuous harness β micro set (30 Qs) from Week 3, scaling to 200 by Week 12. Per-category scoring. | See per-category targets below | Yes (see launch thresholds) |
| T4 | Pack ingestion speed | 10K pages across 8 packs | < 30 min on 4GB RAM | No |
| T5 | Hindi speech | Deferred to v1.1 (voice bot cut from v1) | β | Deferred |
| T6 | Desktop usability | Chrome, Firefox, Safari at 1920Γ1080 and 1366Γ768 | All core flows completable, premium feel | Yes |
Important
RAG accuracy is the core value proposition. If the AI gives wrong answers about Article 370 or the 73rd Amendment, the platform is worse than useless. This section treats accuracy as an engineering discipline, not a checkbox.
Build the minimal eval harness alongside the quiz engine in Week 3, not after content generation in Week 12. This gives 9 weeks of tuning runway instead of discovering problems with days left.
Week 2, Day 5: First content pack ingested + queryable via RAG β already in plan
Week 3, Day 1: Hand-curate 30-question micro golden set
Week 3, Day 2: Run baseline eval on stock pipeline β FIRST ACCURACY READING
Week 3β10: One tuning lever per week alongside feature work
Week 12: Scale to full 200-question golden set + domain expert review
Micro golden set composition (30 questions):
- 12 factual recall ("Which article of the Constitution deals with...")
- 9 comprehension ("Explain the significance of...")
- 6 analytical ("Examine the role of..." / "Critically analyze...")
- 3 current affairs ("Discuss the implications of Budget 2026...")
Baseline test: Run stock pipeline (LlamaIndex + Gemini Flash + default 512-token chunking) on 1 sample content pack. No tuning. Just measure where we start.
Warning
One lever at a time. Running 2+ simultaneously makes attribution impossible. Each lever gets an A/B eval run before/after. Results logged in eval/changelog.md.
| Priority | Lever | Expected Gain | What to Test | When |
|---|---|---|---|---|
| 1 | Chunking strategy | 5β15% | 4 strategies: default 512-token, semantic (split on headers), hierarchical (parent + child nodes), question-aware. Pick winner via eval delta. | Week 3β4 |
| 2 | Retrieval improvements | 5β10% | Hybrid search (BM25 + vector via QueryFusionRetriever), tune Top-K (3/5/10), add bge-reranker-base reranking top-20 β top-5. |
Week 5β6 |
| 3 | Prompt engineering | 5β10% on hallucination | Force "answer ONLY from provided context", add few-shot UPSC examples, force citation format, separate prompts for factual vs analytical questions. | Week 6β7 |
| 4 | Embedding model | 3β7% | A/B test Gemini text-embedding-004 vs BAAI/bge-large-en-v1.5 vs nomic-embed-text-v1.5 on 1 pack. |
Week 7β8 |
| 5 | Query rewriting | 3β5% on analytical | HyDE pattern: LLM rewrites question into 2β3 retrieval-friendly variants, retrieve union. Helps with "Examine the role of..." style queries. | Week 8β9 |
| 6 | Answer model | Variable | Test Gemini 2.5 Pro for synthesis (keep Flash for retrieval). Compare DeepSeek V3 on analytical questions only. | Week 9β10 |
Why this order: Chunking changes what the LLM sees β it's the highest-leverage lever. Prompt engineering changes how it reasons. Model swaps are lowest-leverage because they're expensive and the delta is often smaller than chunking.
Blended accuracy is a vanity metric. A 65% blended score could hide 90% factual + 20% analytical β that's a terrible product. Score per category:
| Category | % of Golden Set | Day-1 Target | Week 12 Target | Why This Target |
|---|---|---|---|---|
| Factual recall | 40% | 75% | 90% | Direct retrieval. If chunking is right, this should be high. |
| Comprehension | 30% | 65% | 80% | Needs multi-chunk synthesis. Harder but tractable. |
| Analytical | 20% | 45% | 65% | UPSC analytical Qs are genuinely hard for RAG β retrieval finds right topic, wrong framing. 45% Day-1 is honest. |
| Current affairs | 10% | 55% | 75% | Depends on CA content pack freshness. Floor is lower. |
Why 45% Day-1 for analytical is OK: Failing analytical doesn't tank the blended score (it's 20% of the set). And honestly, even human UPSC aspirants don't ace analytical questions β they're designed to be hard. A 45%β65% improvement arc over 9 weeks is achievable via query rewriting + prompt engineering.
Weekly cadence:
1. Run eval harness (automated via GitHub Actions)
2. Review per-category scores
3. Check for regressions (β₯7% drop in any category on 30-Q set,
β₯3% on 200-Q set β adjusted for statistical significance)
4. Pick 1 tuning lever
5. A/B test: run eval with and without the change
6. If delta positive β merge. If neutral or negative β revert.
7. Log in eval/changelog.md with before/after %
eval/changelog.md format:
## Week 5 β Hybrid Search (BM25 + Vector)
- Change: Added BM25 to QueryFusionRetriever, top-K=5
- Factual: 72% β 78% (+6%) β
- Comprehension: 60% β 63% (+3%) β
- Analytical: 42% β 44% (+2%) β
- Hallucination: 8% β 6% (-2%) β
- Verdict: MERGEDHard rule: No prompt or chunking change ships without an eval delta. No vibes-based tuning.
Note
Statistical significance on small sets: 3% on a 30-question set = 1 question flipping β that's noise. Use β₯7% (2+ questions) as the regression threshold on the micro set. When scaling to 200 questions, 3% (6 questions) becomes meaningful.
Decide these NOW, not on launch day when motivated reasoning kicks in:
| Blended Accuracy | Action |
|---|---|
| <55% | π Block launch. Extend Phase 3. Revisit chunking + embedding fundamentally. |
| 55β65% | |
| β₯65% | β Launch as planned. |
| β₯75% | π Pull launch forward if other gates (security, content) also pass. |
| Hallucination Rate | Action |
|---|---|
| >10% | π Block launch regardless of accuracy. A confident wrong answer about Article 370 is worse than "I don't know." |
| 5β10% | |
| <5% | β Acceptable. |
"Launch with caveats" means concretely:
- Per-answer retrieval confidence badge (π’ high / π‘ medium / π΄ low) based on top-k similarity score
- Banner on all AI features: "AI answers are generated from content packs and may contain errors. Always verify from original sources."
- Low-confidence answers (π΄) include a "Flag this answer" button for community review
| Item | Cost | Notes |
|---|---|---|
| LLM API for eval runs (~10 weekly runs Γ 200 queries) | ~βΉ3,000ββΉ5,000 | Gemini free tier covers most; overflow to DeepSeek |
| Domain expert micro-review of golden set (30β200 Qs) | ~βΉ3,000 | Verify answer keys are correct. Bad golden data = bad eval. |
| Total eval budget | ~βΉ8,000 | Added to v1 one-time costs |
# eval/eval_runner.py β Runs weekly via GitHub Actions
def run_eval():
"""Produce a per-category scorecard, not a blended pass/fail."""
golden = load("eval/golden-dataset.json")
results = {
"date": now(),
"scores": {
"exact_match": 0,
"partial_correct": 0,
"hallucination": 0,
"no_answer": 0,
"wrong_refusal": 0,
},
"per_category": {
"factual": {"correct": 0, "total": 0},
"comprehension": {"correct": 0, "total": 0},
"analytical": {"correct": 0, "total": 0},
"current_affairs": {"correct": 0, "total": 0},
},
"low_confidence": [],
}
for q in golden:
response = query_rag(q["question"])
score = evaluate_response(response, q["verified_answer"])
results["scores"][score.category] += 1
results["per_category"][q["type"]]["total"] += 1
if score.is_correct:
results["per_category"][q["type"]]["correct"] += 1
# Save timestamped results. Compare against previous week.
# Alert if any category drops β₯7% (micro set) or β₯3% (full set).
# Alert if hallucination rate exceeds 10%.Golden Dataset (Phased Construction):
| Phase | When | Size | Source |
|---|---|---|---|
| Micro set | Week 3 | 30 questions | Hand-curated: 12 factual, 9 comprehension, 6 analytical, 3 CA |
| Full set | Week 12 | 200 questions | UPSC PYQ 2020β2025 (100) + NCERT chapter-end (50) + custom analytical (50) |
| Quarterly refresh | Post-launch | +20 questions | New PYQs, updated CA, community-flagged edge cases |
Important
Community doesn't exist yet. Community contributes AFTER you have something worth contributing to. v1 content must be self-authored from public sources.
| Content Pack | Source | Volume | Effort |
|---|---|---|---|
| upsc-pyq-2000-2025 | UPSC official papers (public record) | ~2,500 MCQs with explanations | 3 days |
| upsc-polity | NCERT + Constitution (public domain) | ~200 MCQs, ~150 flashcards, ~50 notes | 4 days |
| upsc-economy | NCERT + Economic Survey + Budget | ~200 MCQs, ~150 flashcards, ~50 notes | 4 days |
| upsc-history | NCERT Class 6β12 History | ~200 MCQs, ~150 flashcards, ~50 notes | 4 days |
| upsc-geography | NCERT + India Year Book | ~150 MCQs, ~100 flashcards, ~40 notes | 4 days |
| upsc-ethics | PYQ case studies + Constitution | ~100 MCQs, ~80 flashcards, ~30 notes | 2 days |
| upsc-current-affairs-2026 | PIB + Economic Survey + Budget 2026 | ~100 flashcards, ~50 MCQs | 2 days |
| upsc-science-tech | NCERT Science + PIB S&T | ~100 MCQs, ~60 flashcards, ~30 notes | 2 days |
Total seed content: ~1,500 MCQs, ~840 flashcards, ~300 study notes, ~2,500 PYQs.
- Budget: βΉ15,000ββΉ20,000 (freelance, 2 weeks part-time)
- Reviews: Factual accuracy, UPSC-relevance, note quality, copyright flags
- Where to find: LinkedIn, UPSC Telegram groups, Internshala, Pepper Content
v1 Launch (self-authored content)
β Students use, find value
β Analytics prove usage
β v1.1: Enable /contribute form
β Gamify: badges, leaderboard
β Partner with UPSC Telegram groups (100K+ member groups)
β Monthly content drives
β Network effects kick in
Warning
Five runtimes (Python FastAPI, Node.js/Next.js, LlamaIndex, external LLM APIs, auth layer) means integration pain is guaranteed, not possible.
| Integration | What Will Break | Mitigation | Buffer |
|---|---|---|---|
| Next.js β FastAPI | CORS, cookie/session passing, SSR vs client fetch | Shared API_URL env var. Next.js API routes as proxy. CORS middleware with explicit origin whitelist. |
2 days |
| NextAuth.js β FastAPI | JWT format mismatch, session validation, token refresh | FastAPI validates NextAuth JWT with shared secret. Test: expired, malformed, missing tokens. | 1 day |
| LlamaIndex β Gemini | Rate limit responses unhandled, embedding timeouts | Wrap all calls in try/except with fallback. Pin model version. Circuit breaker pattern. | 2 days |
| WebSocket β Auth | WS doesn't carry cookies like HTTP. Token expiry mid-session. | Auth on WS handshake via query param token. Re-auth on reconnect. | 1 day |
| Docker Compose | Startup order, health checks, volume mounting, memory limits | depends_on with health checks. Test Windows + Linux. Minimum: 4GB RAM. |
1 day |
Total integration buffer: 7 days distributed across Phase 3.
Before starting any new integration:
β‘ Both services start independently and respond to health checks
β‘ Auth token format documented and agreed by both sides
β‘ Error response format standardized (JSON, consistent schema)
β‘ Timeout values set (30s for LLM, 5s for everything else)
β‘ One happy-path e2e test passes
β‘ One error-path test exists (timeout, invalid token)
| Role | Who | Hours/Week | Notes |
|---|---|---|---|
| Lead Developer | You (primary) | 25β30 productive hrs | 5 hrs/day Γ 5β6 days. Includes review, debugging, deployment. |
| AI Codegen (Opus 4.6) | Assisted development | N/A | Boilerplate, tests, schemas. Does NOT: debug integration, trace upstream paths, make security decisions. |
| Domain Expert | Freelance (βΉ15β20K) | 10β15 hrs/week | Reviews AI-generated content. Phase 3 only. |
| DMCA Handler | You (initially) | 1 hr/week | Near-zero volume at launch. |
v1 scope: ~55 working days of effort
Lead developer: 5 hrs/day Γ 5.5 days/week = ~27.5 hrs/week
Effective dev weeks: 55 days Γ· 5.5 = 10 work-weeks
Calendar weeks (with buffer): 16 weeks
AI codegen reduces boilerplate writing ~40%
AI does NOT reduce: integration debugging, security review, testing, deployment
| AI CAN reliably generate | AI CANNOT reliably do |
|---|---|
| NextAuth.js config boilerplate | Debug why LlamaIndex returns wrong user's data |
| Rate limiter middleware | Trace hardcoded paths across 16K lines of upstream code |
| Service worker skeleton | Decide if a vector index should be shared or isolated |
| CSS theme / design system | Test WebSocket auth edge cases |
| Pack schema validation scripts | Evaluate if an AI UPSC answer is factually correct |
| Backup pipeline scripts | Determine the right chunking strategy |
| CI/CD workflows | Negotiate partnerships with Telegram groups |
| Test scaffolding | Make security architecture decisions under ambiguity |
Rule: Use AI for code generation. Use humans for judgment, debugging, and integration.
1. v1 has EXACTLY ONE GOAL: "A student can use AI to study UPSC content
packs and take quizzes." Everything else is v1.1+.
2. Feature freeze at Week 10. Weeks 11β16 are testing, content, debugging,
and launch.
3. Every feature request gets a "What breaks if we don't ship this in v1?"
test. If "nothing critical," it's v1.1.
4. Track on public GitHub project board.
5. Cheap codegen β cheap integration + testing + deployment. Resist scope creep.
| Feature | Priority | Why |
|---|---|---|
| Auth (dual-mode) | P0 | Multi-user doesn't work without it |
| Multi-tenancy (security-hardened) | P0 | Data isolation is non-negotiable |
| Content pack ingestion + registry | P0 | This IS the product |
| AI Notes (RAG-powered) | P0 | Core differentiator |
| Quiz engine (content pack + AI-generated) | P0 | Most tangible student value |
| Rate limiter + LLM fallback chain | P0 | Platform dies without this |
| Daily Command Center UI | P0 | The interface students see |
| PWA + offline quiz | P1 | Tier-2/3 access |
| Feature | Why Cut |
|---|---|
| Voice bot (Siri Orb) | Complex. Entire week for a nice-to-have. |
| AI Cowork Studio | Advanced. Students need basic RAG chat first. |
| Study Mode focus timer | Pure frontend, not core. |
| i18n (Hindi, Tamil, etc.) | English-only for v1. |
| Plausible analytics | 2-hour setup. Add Week 15 or post-launch. |
| CA engine automation | Manual publishing as content packs for v1. |
| Admin panel | Not needed at <100 users. |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 0: SPIKE & FOUNDATION (Week 1β2) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Week 1: Multi-Tenancy Spike (BLOCKING) β
β βββββββββββββββββββββββββββββββββββββ β
β Day 1-2: Get DeepTutor running locally (Docker, Python env, β
β LlamaIndex setup, verify all features work) β
β Day 3: Trace every file path reference in DeepTutor core β
β (grep + manual read). Produce PATH_AUDIT.md β
β Day 4: Identify session-scoped vs global modules. β
β Prototype UserNamespace patching on 2 modules. β
β Day 5: Build minimal 2-user test: β
β User A uploads doc β User B should NOT see it β
β User A chats β User B's session is separate β
β β
β β DECISION GATE: Does multi-tenancy work? β
β YES β Continue to Week 2 β
β NO β Re-scope: self-hosted single-user only for v1. β
β Multi-tenant hosted version becomes v1.1. β
β β
β Week 2: Auth + Content Pack Foundation β
β ββββββββββββββββββββββββββββββββββββββ β
β Day 1: NextAuth.js setup (Google + GitHub OAuth) β
β Day 2: Security middleware (path validation, audit logging) β
β Day 3-4: Content pack schema validation + ingest pipeline β
β Day 5: First content pack ingested + queryable via RAG β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 1: CORE ENGINE (Week 3β6) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Week 3-4: AI Core Features + Eval Baseline β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β’ Per-user namespace integration (all modules patched) β
β β’ AI Notes β RAG-powered note generation from content packs β
β β’ Quiz engine β generate from content packs + AI β
β β’ Rate limiter + LLM fallback chain β
β β’ Week 3 Day 1-2: Build eval harness + curate 30-Q micro set β
β β’ Week 3 Day 3: Run BASELINE eval on stock pipeline β
β β FIRST ACCURACY READING (9 weeks of tuning runway) β
β β’ Week 3β4: Test chunking strategies (Lever 1 β biggest gain) β
β β
β Week 5-6: Platform Features + Retrieval Tuning β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β’ Content Pack Manager UI (browse, install, search packs) β
β β’ Knowledge Base manager (user uploads, private vault) β
β β’ Performance dashboard (quiz history, scores, progress) β
β β’ PWA manifest + service worker + offline quiz β
β β’ Lever 2: Hybrid search + reranking (eval delta tracked) β
β β
β End of Week 6: Run T1 (quiz quality) + T2 (throughput) β
β Weekly eval runs ongoing β accuracy tracked per category β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 2: UX & DESIGN (Week 7β10) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Week 7-8: GS360 Theme + Command Center + RAG Levers 3-4 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ GS360 design system (CSS, components, dark theme) β
β β’ Daily Command Center layout β
β β’ Sidebar navigation + keyboard shortcuts β
β β’ Accessibility pass (WCAG AA, aria-labels, focus visible) β
β β’ Lever 3: Prompt engineering (factual vs analytical prompts) β
β β’ Lever 4: Embedding model A/B test (eval delta tracked) β
β β
β Week 9-10: Polish + Integration Debugging + Final Levers β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ End-to-end integration testing (all 5 runtimes talking) β
β β’ CORS, auth token passing, WebSocket fixes β
β β’ Error handling + loading states + offline UI states β
β β’ Lever 5: Query rewriting (HyDE) + Lever 6: Answer model A/B β
β β’ Run T6 (desktop/web usability) β
β β
β β FEATURE FREEZE AT END OF WEEK 10 β
β No new features after this point. Only bug fixes. β
β RAG tuning continues through Phase 3 (eval-driven only). β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 3: CONTENT, EVAL, & HARDENING (Week 11β13) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Week 11: Cold-Start Content Creation β
β ββββββββββββββββββββββββββββββββββββ β
β β’ AI-generate all 8 seed content packs β
β β’ Structure 2,500 PYQs (2000β2025) β
β β’ AI-generate NCERT summaries + MCQs + flashcards β
β β
β Week 12: Domain Expert Review + Full Golden Set β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ Domain expert reviews AI content (βΉ15-20K budget) β
β β’ Scale golden dataset: 30 β 200 verified UPSC Q&A pairs β
β β’ Domain expert micro-review of golden set answers (βΉ3K) β
β β’ Run full T3 eval. Apply launch threshold decision: β
β <55% β block launch | 55β65% β caveats | β₯65% β go β
β β’ Hallucination >10% β block launch regardless of accuracy β
β β
β Week 13: Security + Bug Fixing Buffer β
β βββββββββββββββββββββββββββββββββββββ β
β β’ Pre-launch pen test (10 attack vectors) β
β β’ Integration bug fixing (7-day buffer) β
β β’ Copyright scan pipeline testing β
β β’ Backup pipeline verification (backup + restore test) β
β β’ Load testing (k6 β T2 re-run with real content) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β PHASE 4: LAUNCH PREP (Week 14β16) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Week 14: Deployment + Infrastructure β
β ββββββββββββββββββββββββββββββββββββ β
β β’ Deploy to Vercel (frontend) + Railway (backend) β
β β’ DNS, SSL, domain (gs360.study) β
β β’ Backup pipeline live (Cloudflare R2) β
β β’ Monitoring: basic health check endpoint β
β β’ Smoke testing on production β
β β
β Week 15: Documentation + Community Setup β
β ββββββββββββββββββββββββββββββββββββββββ β
β β’ README.md, CONTRIBUTING.md, CONTENT_GUIDE.md β
β β’ DMCA.md + LICENSE (Apache-2.0) + SECURITY.md β
β β’ Discord server + Telegram channel setup β
β β’ (Optional) Plausible analytics β 2 hours β
β β
β Week 16: Soft Launch β
β ββββββββββββββββββ β
β β’ Invite 20β30 beta users from UPSC Telegram groups β
β β’ Monitor 5 days: crashes, data leaks, UX confusion β
β β’ Hotfix cycle: fix critical bugs daily β
β β’ Day 5: Public launch β GitHub, Reddit, Twitter β
β β’ Day 7: First user feedback β plan v1.1 based on data β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tracking ticket: Issue #20
Ship a compliant, local-first content bootstrap and current-affairs ingestion pipeline so a new user can install GS360 and immediately start with a high-quality baseline notebook.
| Stream | Deliverable | Notes |
|---|---|---|
| Starter notebook packs | Manifest-driven auto-download | Store manifests/metadata in Git, not copyrighted files |
| Source compliance | License metadata + policy checks | Prefer public-domain/open-license and user-provided sources |
| Local ingestion | Download β validate β index | One-command local bootstrap |
| Current affairs | Connector framework + scheduler | Prioritize RSS/API/allowed sources first |
| Quality controls | Dedupe + tagging + source citations | UPSC-friendly summaries with traceable provenance |
| Milestone | Outcome |
|---|---|
| A | Manifest schema, downloader, checksum validation |
| B | Local indexing and one-command starter setup |
| C | Current-affairs connector + normalized article schema |
| D | Summarization, tagging, scheduling, and resilience tests |
- Fresh local install can bootstrap starter knowledge with one command.
- Chat answers include source metadata for ingested content.
- Daily current-affairs sync appends into knowledge bases reliably.
- Pipeline handles source errors with retries and safe fallback behavior.
| Users | Monthly LLM | Infra | Total Monthly | Funding Source |
|---|---|---|---|---|
| 1β50 | βΉ0 | βΉ0 | βΉ0 | Free tier |
| 50β200 | βΉ500ββΉ800 | βΉ0 | ~βΉ800/mo | Personal / bootstrap |
| 200β1,000 | βΉ2KββΉ4K | βΉ500 | ~βΉ5K/mo | GitHub Sponsors + OpenCollective |
| 1,000β5,000 | βΉ15KββΉ20K | βΉ2K | ~βΉ22K/mo | Institutional sponsor needed |
| 5,000+ | βΉ50K+ | βΉ5K+ | βΉ55K+/mo | Coaching partnership or govt grant |
| Threshold | Trigger | Action |
|---|---|---|
| >200 users | LLM > βΉ1K/mo | GitHub Sponsors + OpenCollective |
| >1,000 users | LLM > βΉ5K/mo | GitHub Education grant. Google Education credits. Coaching institute sponsorship. |
| >5,000 users | LLM > βΉ20K/mo | Government grant (MyGov, AICTE, NITI Aayog). Infosys Foundation / Tata Trusts. Optional premium tier (βΉ99/mo for heavy AI users, platform stays free). |
| Item | Cost | Notes |
|---|---|---|
| Domain (gs360.study) | ~βΉ800/year | |
| Domain expert content review | βΉ15,000ββΉ20,000 | 2 weeks, part-time |
| Eval infrastructure (LLM API for eval runs) | ~βΉ3,000ββΉ5,000 | ~10 weekly eval runs Γ 200 queries |
| Golden set expert micro-review | ~βΉ3,000 | Verify answer keys are correct |
| Total v1 investment | ~βΉ28,000 |
| Responsibility | Who | Time | Automation |
|---|---|---|---|
| DMCA monitoring | Lead dev | ~1 hr/week | dmca@gs360.study inbox |
| PR review (content packs) | Lead dev + 1 community maintainer | ~2 hrs/week (post-community) | CI validates schema + copyright scan |
| Backup verification | Automated | 0 (alert on failure) | GitHub Actions + Telegram alert |
| RAG eval monitoring | Lead dev | ~30 min/week | Weekly automated run, review scorecard |
| Eval review | Lead dev | ~30 min/week | Manual review of regressions/edge cases |
| LLM cost monitoring | Automated | 0 (alert on threshold) | cost-monitor.py alerts if >120% budget |
| Community management | Lead dev | ~3 hrs/week | Discord + Telegram |
| Security incidents | Lead dev | On-call | SECURITY.md + weekly audit log review |
"15 lakh students. Zero free AI tools. One open-source platform. The community builds it. The community owns it. The community benefits."
"But the community builds it AFTER we build something worth building on. And we build it in 16 weeks, not 10."
β GS360 Open Source, Final Plan