This project focuses on developing a complete intelligent data analytics system by integrating concepts from Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Apache Spark. It involves processing large-scale datasets using Spark for distributed computing, constructing a RAG pipeline to retrieve and generate insights from a document corpus, and combining both components into a unified application capable of answering natural language queries. The project highlights how structured data analytics and unstructured document retrieval can work together, reflecting real-world systems where organizations leverage both data processing frameworks and AI-driven search to generate meaningful insights.
- Run the command "pip install -r requirments.txt".
- Click the "Run All" button to run the entire notebook.
Claude (Anthropic) was used for troubleshooting and guidance.