Live Demo: azureragchatbotapp.azurewebsites.net
(Status - Demo availability may depend on Azure hosting)
This repository contains a Streamlit-based application that uses Azure OpenAI and Azure Cognitive Search (or Chroma) to semantically ingest, index, and query text datasets (e.g., CSV files) stored in Azure Blob Storage, delivering context-rich, on-demand insights.
azure_rag_chatbot_demo.webm
-
Semantic Ingestion: Automatically load and deduplicate text entries from CSV files in Azure Blob Storage.
-
Dynamic Indexing: Monitor CSV additions, modifications, and deletions and sync changes to Azure Cognitive Search or a local Chroma DB.
-
RAG QA Pipeline: Use LangChain’s RetrievalQA with Azure OpenAI for context-aware question answering.
-
Interactive UI: User-friendly chat interface built with Streamlit.
-
Flexible Vector Store: Switch between Azure Cognitive Search and Chroma via a single environment variable.
-
azure: Uses Azure Cognitive Search with semantic search capabilities.
-
chroma: Uses a local Chroma vector database persisted under ./chroma_db.
-
Python 3.8 or higher
-
Azure account with:
-
Blob Storage account and container for CSV files
-
Azure OpenAI resource with deployed models for embeddings and chat
-
Azure Cognitive Search service (if using vector_db_type=azure)
-
-
Azure CLI
-
(Optional) Docker for containerized deployment
git clone https://github.com/87tana/RAG_Chatbot.git
cd RAG_Chatbot
pip install -r requirements.txt
AZURE_OPENAI_API_KEY=
AZURE_OPENAI_ENDPOINT=
AZURE_SEARCH_KEY=
AZURE_SEARCH_ENDPOINT=
AZURE_STORAGE_CONNECTION_STRING=
VECTOR_DB_TYPE=<azure|chroma>
- Start the Streamlit application: "streamlit run app.py"
- Enter questions in the chat input box related to your ingested documents.
- The app maintains a conversation history during your session.
- If no CSV files are found, a dummy document is loaded to keep the chatbot functional.
-
To test the RAG Chatbot with your own documents (for example, research papers or any private data):
-
Prepare Your Data: Export your document into one or more CSV files, each with a content column.
-
Upload to Blob Storage: Place these CSV files into your configured Azure Blob Storage container. The app will automatically pick up new or updated files when you restart or invoke reindexing.
-
Run the App: Start the Streamlit app (streamlit run app.py) and navigate to the UI. Your uploaded documents will be ingested and indexed on launch.
-
-
Local Testing Tip: Set VECTOR_DB_TYPE=chroma, place CSV files in a ./data folder, and modify ingestion code to read locally instead of from Blob Storage.
-
Blob Loading: app.py fetches all CSV blobs, extracts the content column, and deduplicates entries.
-
Index Sync: reindex_if_blob_changed checks for file changes, computes hashes, and updates the vector index accordingly.
-
Embedding: Text documents are converted to semantic vectors via Azure OpenAI Embeddings.
-
Retrieval: A retriever pulls top-k relevant documents for each query.
-
Generation: Azure Chat OpenAI LLM crafts a response based on retrieved context.
-
Display: Streamlit chat UI renders the conversation.
-
Missing CSV Files: Loads a dummy document to ensure functionality.
-
Invalid CSV Format: Skips files without a content column and logs an error.
-
Connection Issues: Retries Azure API calls with exponential backoff.
-
CSV Size: Indexing large CSV files (>1GB) may increase processing time. Split large files for better performance.
-
Indexing Time: Initial indexing depends on dataset size and Azure API latency (typically 1-5 seconds per 1000 text entries).
-
Scalability: Azure Cognitive Search is recommended for large datasets; Chroma suits smaller, local deployments.
-
Azure App Service: Containerize with Docker or use Python runtime.
-
Streamlit Cloud: Directly deploy your repo, set environment variables in the dashboard.
See the Dockerfile for container setup and deployment instructions.