🚀 ContentGen-RAG-System

Advanced AI Pipeline for Automated Textbook Content Generation

A robust multi-agent system powered by Knowledge Graphs and Hybrid RAG to transform raw PDFs into high-quality, exam-aligned educational chapters.

Key Features • Technical Architecture • Workflow • Tech Stack • Getting Started

🎯 Project Overview

ContentGen-RAG-System is a state-of-the-art AI pipeline designed to automate the creation of comprehensive study materials. By leveraging Retrieval-Augmented Generation (RAG) and specialized AI agents, the system parses complex textbook PDFs (including equations, diagrams, and tables) and generates publication-quality Markdown chapters tailored for competitive exams like JEE, NEET, and Board Exams.

✨ Key Capabilities

Feature	Description
Intelligent PDF Ingestion	Deep parsing of scientific PDFs using MinerU to capture text, LaTeX equations, and layout structure.
Knowledge Graph Integration	Building semantic relationships between concepts using LightRAG for high-accuracy retrieval.
Multi-Agent Orchestration	A team of 5 specialized CrewAI agents (Researcher, Indexer, Writer, Enhancer, Formatter) collaborating sequentially.
Hybrid Retrieval	Combining dense vector search (FAISS/ChromaDB) with BM25 keyword matching for superior context discovery.
Multimodal Understanding	Processing and contextualizing figures and tables alongside textual data.

🏗 Technical Architecture

The system is built on a modular architecture that separates document ingestion from content generation, ensuring scalability and precision.

graph TD
    A[Raw Textbook PDFs] --> B[MinerU Parser]
    B --> C[Multimodal Processing]
    C --> D[LightRAG Knowledge Graph]
    D --> E[Hybrid Vector Storage]
    E --> F[CrewAI Multi-Agent Pipeline]
    F --> G[Final Markdown Chapters]
    
    subgraph "Agents Layer"
    F1[Research Agent] --> F2[Content Indexer]
    F2 --> F3[Content Generator]
    F3 --> F4[RAG Enhancer]
    F4 --> F5[Markdown Formatter]
    end

System Components

Ingestion Layer: Utilizes MinerU for high-fidelity extraction of complex academic content.
Knowledge Layer: Implements LightRAG to maintain a graph-based understanding of physics entities and relationships.
Orchestration Layer: Uses CrewAI to manage sequential workflows where each agent refines the output of the previous one.
Output Layer: Produces detailed Markdown files (50–180 KB) with full LaTeX support and structured sectioning.

🔄 How It Works

Phase 1: Knowledge Base Construction

Parsing: Documents are broken down into semantically meaningful chunks.
Indexing: Entities and relations are extracted to build a Knowledge Graph.
Embedding: Chunks are stored in vector databases (FAISS/ChromaDB) for semantic retrieval.

Phase 2: Content Generation

Research: The Research Agent queries the RAG system for comprehensive data on a specific topic.
Structuring: The Indexer builds a hierarchical chapter outline aligned with official curriculum standards.
Writing: The Generator crafts detailed prose, derivations, and examples.
Refinement: The Enhancer and Formatter ensure technical accuracy and proper Markdown/LaTeX formatting.

🛠 Tech Stack

Frameworks: CrewAI, LightRAG, RAG-Anything
PDF Processing: MinerU (magic-pdf)
Vector Databases: FAISS, ChromaDB, nano-vectordb
Models: OpenAI GPT-4o-mini (Primary), Ollama LLaMA 3.1 (Local Fallback)
Embeddings: all-MiniLM-L6-v2 (SentenceTransformers)
Search: SerperDev API
Language: Python 3.10+

🚀 Getting Started

Prerequisites

Python 3.10 to 3.13
OpenAI API Key
Serper API Key (Optional)

Installation

Clone the repository:

git clone https://github.com/your-username/ContentGen-RAG-System.git
cd ContentGen-RAG-System

Set up Environment: Create a .env file with your credentials:

OPENAI_API_KEY=your_key_here
SERPER_API_KEY=your_key_here
MODEL=gpt-4o-mini

Install Dependencies:

pip install -r requirements_complete.txt

Running the System

Process New PDFs:
```
python -m physics_content.rag_system
```
Generate Chapters:
```
python -m physics_content.main
```

📄 License

This project is specialized for educational content generation and is provided for research and portfolio demonstration purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
RAG-Anything		RAG-Anything
artifacts		artifacts
knowledge/textbooks/cert		knowledge/textbooks/cert
rag_storage/lightrag		rag_storage/lightrag
src/physics_content		src/physics_content
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
class_12_physics_chapter_1_Electric_Charges_and_Fields.md		class_12_physics_chapter_1_Electric_Charges_and_Fields.md
class_12_physics_chapter_2_Electrostatic_Potential_and_Capacitance.md		class_12_physics_chapter_2_Electrostatic_Potential_and_Capacitance.md
class_12_physics_chapter_3_Current_Electricity.md		class_12_physics_chapter_3_Current_Electricity.md
class_12_physics_chapter_4_Moving_Charges_and_Magnetism.md		class_12_physics_chapter_4_Moving_Charges_and_Magnetism.md
class_12_physics_chapter_5_Magnetism_and_Matter.md		class_12_physics_chapter_5_Magnetism_and_Matter.md
class_12_physics_chapter_6_Electromagnetic_Induction.md		class_12_physics_chapter_6_Electromagnetic_Induction.md
class_12_physics_chapter_8_Electromagnetic_Waves.md		class_12_physics_chapter_8_Electromagnetic_Waves.md
process_chapters_gpu.py		process_chapters_gpu.py
pyproject.toml		pyproject.toml
query_documents.py		query_documents.py
requirements_complete.txt		requirements_complete.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 ContentGen-RAG-System

🎯 Project Overview

✨ Key Capabilities

🏗 Technical Architecture

System Components

🔄 How It Works

Phase 1: Knowledge Base Construction

Phase 2: Content Generation

🛠 Tech Stack

🚀 Getting Started

Prerequisites

Installation

Running the System

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 ContentGen-RAG-System

🎯 Project Overview

✨ Key Capabilities

🏗 Technical Architecture

System Components

🔄 How It Works

Phase 1: Knowledge Base Construction

Phase 2: Content Generation

🛠 Tech Stack

🚀 Getting Started

Prerequisites

Installation

Running the System

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages