- Overview
- Interface Options
- System Configurations
- Project Structure
- Requirements
- Models Used
- Setup Instructions
- Running the Application
- Batch Evaluation
This repository is the technical code of an assignment for the Conversational AI module of the MPhil in Human-Inspired Artificial Intelligence programme at the University of Cambridge. This project implements IndoGuide, an intelligent travel companion chatbot designed to help users explore Indonesia by providing information on must-see destinations, visas, transportation, safety, and local etiquettes.
The system leverages Retrieval-Augmented Generation (RAG) with multiple reranking strategies to deliver contextually relevant and accurate responses.
IndoGuide provides two ways to interact with the system:
-
Web Interface (Streamlit) -
app.py- User-friendly, interactive web application
- Session management with persistent conversation history
- Real-time RAG configuration and persona selection
-
Command Line Interface (CLI) -
cli.py- Terminal-based interaction
- Ideal for scripted interactions and testing
- Direct control over configuration parameters
The system supports three persona configurations that affect the tone and style of responses:
- Neutral (Baseline): Standard, informative responses
- Friendly: Warm, conversational, and engaging tone
- Professional: Formal, detailed, and comprehensive responses
The Retrieval-Augmented Generation system offers three reranking strategies:
-
Baseline (No Reranking)
- Top-10 initial vector retrieval
- Direct top-4 selection
- Fastest option, baseline performance
-
Cross-Encoder Reranking
- Top-10 initial vector retrieval
- Top-4 Cross-Encoder reranking (MS MARCO-based)
- Balanced speed and accuracy
-
LLM Reranking
- Top-10 initial vector retrieval
- Top-4 LLM reranking using GPT
- Best accuracy, slower performance
IndoGuide/
├── app.py # Streamlit web application
├── cli.py # CLI chat interface
├── batch_replay.py # Batch dialogue replay for evaluation
├── evaluate_batch.py # Evaluation metrics calculator
├── config/
│ └── config.py # Configuration management
├── core/
│ ├── llm_client.py # LLM API client
│ ├── rag_system.py # RAG system implementation
│ └── logger.py # Session logging utilities
├── data/
│ ├── indonesia_knowledge_base.json # Knowledge base for RAG
│ ├── test_dialogues.json # Test dialogues for batch replay
│ └── prompts.json # System prompts for personas and metrics
├── results/
│ ├── batch/ # Batch replay output results
│ ├── laaj/ # LLM-as-a-Judge rating results
│ └── eval/ # Evaluation metric calculations
├── logs/ # Session conversation logs
├── assets/
│ └── style.css # Streamlit UI styling
└── environment.yml # Conda environment specification
- Knowledge Base:
data/indonesia_knowledge_base.json- Contains all the factual information about Indonesia used by the RAG system - Test Dialogues:
data/test_dialogues.json- Collection of test dialogues used for batch replay and technical research evaluation - Prompts:
data/prompts.json- System prompts for different personas and evaluation metrics (factuality, faithfulness, helpfulness, overall)
results/batch/- Output from batch dialogue replay, containing system responses and metadataresults/laaj/- LLM-as-a-Judge ratings for responses (factuality, faithfulness, helpfulness, overall quality)results/eval/- Calculated evaluation metrics (Recall@K, MRR, NDCG@K, and averaged LAAJ metrics)
- Python 3.10 or higher
The project requires the following Python libraries:
| Library | Version | Purpose |
|---|---|---|
streamlit |
≥1.51 | Web interface framework |
openai |
≥2.8 | OpenAI API client for LLM interactions |
python-dotenv |
Latest | Environment variable management |
chromadb |
≥0.4.0 | Vector database for RAG |
sentence-transformers |
≥2.2.0 | Embedding models and cross-encoders |
All dependencies are automatically installed via the conda environment.
The system utilizes several specialized models for different components:
| Component | Model | Provider | Purpose |
|---|---|---|---|
| Chatbot | gpt-5-nano-2025-08-07 |
OpenAI | Main conversational AI for generating responses |
| Embedding | text-embedding-3-small |
OpenAI | Text vectorization for semantic retrieval |
| Cross-Encoder Reranking | cross-encoder/ms-marco-MiniLM-L6-v2 |
Hugging Face | Re-ranks retrieved documents for relevance |
| LLM Reranker | gpt-5-nano-2025-08-07 |
OpenAI | LLM-based re-ranking of retrieved documents |
| LLM-as-a-Judge (LAAJ) | gpt-4o-mini-2024-07-18 |
OpenAI | Evaluates response quality (factuality, faithfulness, helpfulness) |
Note: The chatbot and LLM reranker use the same model (gpt-5-nano-2025-08-07) for consistency and cost efficiency.
Create and activate the conda environment from environment.yml:
conda env create -f environment.yml
conda activate IndoGuideThis will install all required dependencies including:
- OpenAI API client
- Streamlit
- Chroma vector database
- Cross-Encoder models
Create an openai.key file in the root directory with your OpenAI API key:
echo "your-openai-api-key-here" > openai.keyRun the interactive web application:
streamlit run app.pyThen open your browser to http://localhost:8501
Features:
- Select persona and RAG configuration from the sidebar
- Real-time conversation history
- Session management with option to save or create new chat
Run the command-line chat interface:
python cli.py [OPTIONS]Arguments:
--persona {neutral, friendly, professional}(default:neutral)- Choose the persona for the assistant
--rag-config {baseline, crossencoder, llm}(default:baseline)- Choose the RAG reranking strategy
Example:
python cli.py --persona friendly --rag-config llmCLI Commands:
/reset- Start a new conversation/history- Show conversation history/config- Show current configuration/exit- Exit the CLI
Replay a set of test dialogues and collect system responses:
python batch_replay.py [OPTIONS]Arguments:
--persona {neutral, friendly, professional}(default:neutral)- Persona for responses
--rag-config {baseline, crossencoder, llm}(default:baseline)- RAG reranking strategy
--input-file PATH(default:data/test_dialogues.json)- Path to test dialogues JSON file
--output-dir PATH(default:results/batch)- Directory to save replay results
Example:
python batch_replay.py --persona friendly --rag-config llm --output-dir results/batchResults are saved as JSON files with metadata and turn-by-turn dialogue data.
Rate batch replay results using LLM-as-a-Judge metrics:
python evaluate_batch.py [OPTIONS]Arguments:
--batch-result PATH- Path to batch replay result file (from
results/batch/)
- Path to batch replay result file (from
--knowledge-base PATH(default:data/indonesia_knowledge_base.json)- Path to knowledge base JSON file
--output-dir PATH(default:results/laaj)- Directory to save LAAJ ratings
--eval-dir PATH(default:results/eval)- Directory to save calculated metrics
Example:
python evaluate_batch.py --batch-result results/batch/batchreplay_baseline_neutral_gpt-5-nano-2025-08-07_20251207185252.jsonMetrics Generated:
The evaluation produces:
- LAAJ Ratings (LLM-as-a-Judge): Factuality, Faithfulness, Helpfulness, Overall Quality (saved in
results/laaj/) - Retrieval Metrics: Recall@K, MRR, NDCG@K (saved in
results/eval/) - Aggregated Metrics: Mean scores across all test dialogues
Here's a typical evaluation workflow:
-
Run batch replay with different configurations:
python batch_replay.py --rag-config baseline --persona neutral python batch_replay.py --rag-config crossencoder --persona neutral python batch_replay.py --rag-config llm --persona neutral
-
Evaluate with LLM-as-a-Judge:
python evaluate_batch.py --batch-result results/batch/batchreplay_baseline_neutral_*.json python evaluate_batch.py --batch-result results/batch/batchreplay_crossencoder_neutral_*.json python evaluate_batch.py --batch-result results/batch/batchreplay_llm_neutral_*.json
-
Review results in
results/batch/,results/laaj/, andresults/eval/directories