Advanced AI Pipeline for Automated Textbook Content Generation
A robust multi-agent system powered by Knowledge Graphs and Hybrid RAG to transform raw PDFs into high-quality, exam-aligned educational chapters.
Key Features • Technical Architecture • Workflow • Tech Stack • Getting Started
ContentGen-RAG-System is a state-of-the-art AI pipeline designed to automate the creation of comprehensive study materials. By leveraging Retrieval-Augmented Generation (RAG) and specialized AI agents, the system parses complex textbook PDFs (including equations, diagrams, and tables) and generates publication-quality Markdown chapters tailored for competitive exams like JEE, NEET, and Board Exams.
| Feature | Description |
|---|---|
| Intelligent PDF Ingestion | Deep parsing of scientific PDFs using MinerU to capture text, LaTeX equations, and layout structure. |
| Knowledge Graph Integration | Building semantic relationships between concepts using LightRAG for high-accuracy retrieval. |
| Multi-Agent Orchestration | A team of 5 specialized CrewAI agents (Researcher, Indexer, Writer, Enhancer, Formatter) collaborating sequentially. |
| Hybrid Retrieval | Combining dense vector search (FAISS/ChromaDB) with BM25 keyword matching for superior context discovery. |
| Multimodal Understanding | Processing and contextualizing figures and tables alongside textual data. |
The system is built on a modular architecture that separates document ingestion from content generation, ensuring scalability and precision.
graph TD
A[Raw Textbook PDFs] --> B[MinerU Parser]
B --> C[Multimodal Processing]
C --> D[LightRAG Knowledge Graph]
D --> E[Hybrid Vector Storage]
E --> F[CrewAI Multi-Agent Pipeline]
F --> G[Final Markdown Chapters]
subgraph "Agents Layer"
F1[Research Agent] --> F2[Content Indexer]
F2 --> F3[Content Generator]
F3 --> F4[RAG Enhancer]
F4 --> F5[Markdown Formatter]
end
- Ingestion Layer: Utilizes MinerU for high-fidelity extraction of complex academic content.
- Knowledge Layer: Implements LightRAG to maintain a graph-based understanding of physics entities and relationships.
- Orchestration Layer: Uses CrewAI to manage sequential workflows where each agent refines the output of the previous one.
- Output Layer: Produces detailed Markdown files (50–180 KB) with full LaTeX support and structured sectioning.
- Parsing: Documents are broken down into semantically meaningful chunks.
- Indexing: Entities and relations are extracted to build a Knowledge Graph.
- Embedding: Chunks are stored in vector databases (FAISS/ChromaDB) for semantic retrieval.
- Research: The Research Agent queries the RAG system for comprehensive data on a specific topic.
- Structuring: The Indexer builds a hierarchical chapter outline aligned with official curriculum standards.
- Writing: The Generator crafts detailed prose, derivations, and examples.
- Refinement: The Enhancer and Formatter ensure technical accuracy and proper Markdown/LaTeX formatting.
- Frameworks: CrewAI, LightRAG, RAG-Anything
- PDF Processing: MinerU (magic-pdf)
- Vector Databases: FAISS, ChromaDB, nano-vectordb
- Models: OpenAI GPT-4o-mini (Primary), Ollama LLaMA 3.1 (Local Fallback)
- Embeddings: all-MiniLM-L6-v2 (SentenceTransformers)
- Search: SerperDev API
- Language: Python 3.10+
- Python 3.10 to 3.13
- OpenAI API Key
- Serper API Key (Optional)
-
Clone the repository:
git clone https://github.com/your-username/ContentGen-RAG-System.git cd ContentGen-RAG-System -
Set up Environment: Create a
.envfile with your credentials:OPENAI_API_KEY=your_key_here SERPER_API_KEY=your_key_here MODEL=gpt-4o-mini
-
Install Dependencies:
pip install -r requirements_complete.txt
- Process New PDFs:
python -m physics_content.rag_system
- Generate Chapters:
python -m physics_content.main
This project is specialized for educational content generation and is provided for research and portfolio demonstration purposes.