IntelliQuery is a Streamlit application that provides document-based question answering using a Retrieval-Augmented Generation (RAG) approach. It supports multiple document types (PDF, PPT, Excel, audio/video transcripts, images) and answers user questions based on the embedded content. Additionally, it integrates Hierarchical BERT (HBERT) for long queries (more than 15 words) to improve retrieval on large or complex questions, while retaining a simpler vector store approach for short queries.
-
Document Upload & Processing
- Upload PDF, PPT, Excel, Image, Audio, or Video files.
- Automatic text extraction from files (PPT slides, PDF pages, Excel sheets, etc.).
- Speech-to-text transcription for audio and video via Whisper.
- Cleans and tokenizes text for downstream tasks.
-
Vector Store with FAISS
- Uses Google Generative AI Embeddings (
models/embedding-001) for short queries. - Stores embeddings in a FAISS index for fast similarity search.
- Uses Google Generative AI Embeddings (
-
Hierarchical BERT for Long Queries
- Automatically switches to Hierarchical BERT retrieval if the query exceeds 15 words.
- Splits documents into large “doc chunks,” which are further split into sub-chunks.
- Performs two-stage retrieval:
- First retrieve top doc chunks.
- Retrieve top sub-chunks for finer context.
-
Q&A Pipeline
- Uses Google Generative AI (Gemini) for final answer generation.
- Summaries or direct answers to user queries based on the retrieved text.
- Optional cohere expansions for short queries, to enrich retrieval.
-
Responsive UI
- Built with Streamlit for an interactive web app.
- Displays conversation history and supports continuous queries.
- Downloadable PDF of the conversation history.
- Python 3.9+
- Streamlit – interactive UI for uploading files and chatting.
- Whisper – speech-to-text for audio/video.
- FAISS – vector database for fast similarity searches.
- Google Generative AI Embeddings & Gemini – text embeddings and LLM Q&A.
- SentenceTransformers (Hierarchical BERT) – hierarchical embedding for long queries.
- Cohere – fetch related terms for short queries.
- PyPDF2, python-pptx, Pandas – for file parsing.
-
Clone the Repository
git clone https://github.com/pranavv34/IntelliQuery.git
-
Create and Activate a Virtual Environment (recommended)
python -m venv myenv source myenv/bin/activate # Linux/Mac # OR myenv\Scripts\activate # Windows
-
Install Dependencies
pip install -r requirements.txt
Make sure you have libraries like
faiss-cpu,sentence-transformers,whisper, etc. -
Add API Keys
- Create a
.envfile in the project root and add:COHERE_API_KEY=YOUR_COHERE_API_KEY GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
Adjust as needed if you have more environment variables.
- Create a
-
Run the App
streamlit run app.py
This will launch the Streamlit interface in your default browser.
-
Upload Documents
- In the sidebar, select the file type and upload one or more files.
- IntelliQuery automatically processes and indexes these documents (creating a FAISS index for short queries).
-
Ask Questions
- Type your query in the text box at the bottom.
- If the query is 15 words or fewer, IntelliQuery uses standard FAISS + Google Generative AI Embeddings.
- If the query is more than 15 words, IntelliQuery on-demand builds/loads the Hierarchical BERT indexes and performs a two-stage retrieval.
-
View and Download Conversations
- The conversation displays in a chat-like format.
- Click “Download Conversation” to get a PDF summary of the Q&A.
- On-demand Indexing: If you ask a query longer than 15 words, the app checks if the Hierarchical BERT indexes are built. If not, it builds them from the same text corpus already stored.
- Two-stage Retrieval:
- Doc-level retrieval with ~5k-character chunks.
- Sub-chunk retrieval with ~500-character chunks to narrow context further.
- Passes the final sub-chunks to the Q&A chain for a detailed answer.
- Author: Pranav Vuddagiri & Sivani Varada
- Issues: Please open an issue on this repo if you encounter any problems.
Enjoy using IntelliQuery! Let us know if you have any questions or feature suggestions.