Skip to content

avyasaini/ContentGen-RAG-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 ContentGen-RAG-System

Advanced AI Pipeline for Automated Textbook Content Generation

A robust multi-agent system powered by Knowledge Graphs and Hybrid RAG to transform raw PDFs into high-quality, exam-aligned educational chapters.

Key FeaturesTechnical ArchitectureWorkflowTech StackGetting Started


🎯 Project Overview

ContentGen-RAG-System is a state-of-the-art AI pipeline designed to automate the creation of comprehensive study materials. By leveraging Retrieval-Augmented Generation (RAG) and specialized AI agents, the system parses complex textbook PDFs (including equations, diagrams, and tables) and generates publication-quality Markdown chapters tailored for competitive exams like JEE, NEET, and Board Exams.

✨ Key Capabilities

Feature Description
Intelligent PDF Ingestion Deep parsing of scientific PDFs using MinerU to capture text, LaTeX equations, and layout structure.
Knowledge Graph Integration Building semantic relationships between concepts using LightRAG for high-accuracy retrieval.
Multi-Agent Orchestration A team of 5 specialized CrewAI agents (Researcher, Indexer, Writer, Enhancer, Formatter) collaborating sequentially.
Hybrid Retrieval Combining dense vector search (FAISS/ChromaDB) with BM25 keyword matching for superior context discovery.
Multimodal Understanding Processing and contextualizing figures and tables alongside textual data.

🏗 Technical Architecture

The system is built on a modular architecture that separates document ingestion from content generation, ensuring scalability and precision.

graph TD
    A[Raw Textbook PDFs] --> B[MinerU Parser]
    B --> C[Multimodal Processing]
    C --> D[LightRAG Knowledge Graph]
    D --> E[Hybrid Vector Storage]
    E --> F[CrewAI Multi-Agent Pipeline]
    F --> G[Final Markdown Chapters]
    
    subgraph "Agents Layer"
    F1[Research Agent] --> F2[Content Indexer]
    F2 --> F3[Content Generator]
    F3 --> F4[RAG Enhancer]
    F4 --> F5[Markdown Formatter]
    end
Loading

System Components

  • Ingestion Layer: Utilizes MinerU for high-fidelity extraction of complex academic content.
  • Knowledge Layer: Implements LightRAG to maintain a graph-based understanding of physics entities and relationships.
  • Orchestration Layer: Uses CrewAI to manage sequential workflows where each agent refines the output of the previous one.
  • Output Layer: Produces detailed Markdown files (50–180 KB) with full LaTeX support and structured sectioning.

🔄 How It Works

Phase 1: Knowledge Base Construction

  1. Parsing: Documents are broken down into semantically meaningful chunks.
  2. Indexing: Entities and relations are extracted to build a Knowledge Graph.
  3. Embedding: Chunks are stored in vector databases (FAISS/ChromaDB) for semantic retrieval.

Phase 2: Content Generation

  1. Research: The Research Agent queries the RAG system for comprehensive data on a specific topic.
  2. Structuring: The Indexer builds a hierarchical chapter outline aligned with official curriculum standards.
  3. Writing: The Generator crafts detailed prose, derivations, and examples.
  4. Refinement: The Enhancer and Formatter ensure technical accuracy and proper Markdown/LaTeX formatting.

🛠 Tech Stack

  • Frameworks: CrewAI, LightRAG, RAG-Anything
  • PDF Processing: MinerU (magic-pdf)
  • Vector Databases: FAISS, ChromaDB, nano-vectordb
  • Models: OpenAI GPT-4o-mini (Primary), Ollama LLaMA 3.1 (Local Fallback)
  • Embeddings: all-MiniLM-L6-v2 (SentenceTransformers)
  • Search: SerperDev API
  • Language: Python 3.10+

🚀 Getting Started

Prerequisites

  • Python 3.10 to 3.13
  • OpenAI API Key
  • Serper API Key (Optional)

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/ContentGen-RAG-System.git
    cd ContentGen-RAG-System
  2. Set up Environment: Create a .env file with your credentials:

    OPENAI_API_KEY=your_key_here
    SERPER_API_KEY=your_key_here
    MODEL=gpt-4o-mini
  3. Install Dependencies:

    pip install -r requirements_complete.txt

Running the System

  • Process New PDFs:
    python -m physics_content.rag_system
  • Generate Chapters:
    python -m physics_content.main

📄 License

This project is specialized for educational content generation and is provided for research and portfolio demonstration purposes.

About

ContentGen-RAG-System converts scientific PDFs into exam-aligned, LaTeX-ready Markdown chapters. It combines MinerU for precise PDF parsing, LightRAG knowledge graphs, hybrid retrieval (FAISS/ChromaDB + BM25), and CrewAI multi-agent orchestration with GPT-4o-mini to generate structured, curriculum-focused educational content

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages