IndoGuide

Overview

This repository is the technical code of an assignment for the Conversational AI module of the MPhil in Human-Inspired Artificial Intelligence programme at the University of Cambridge. This project implements IndoGuide, an intelligent travel companion chatbot designed to help users explore Indonesia by providing information on must-see destinations, visas, transportation, safety, and local etiquettes.

The system leverages Retrieval-Augmented Generation (RAG) with multiple reranking strategies to deliver contextually relevant and accurate responses.

Interface Options

IndoGuide provides two ways to interact with the system:

Web Interface (Streamlit) - app.py
- User-friendly, interactive web application
- Session management with persistent conversation history
- Real-time RAG configuration and persona selection
Command Line Interface (CLI) - cli.py
- Terminal-based interaction
- Ideal for scripted interactions and testing
- Direct control over configuration parameters

System Configurations

Persona Configurations (3 Options)

The system supports three persona configurations that affect the tone and style of responses:

Neutral (Baseline): Standard, informative responses
Friendly: Warm, conversational, and engaging tone
Professional: Formal, detailed, and comprehensive responses

RAG Configurations (3 Options)

The Retrieval-Augmented Generation system offers three reranking strategies:

Baseline (No Reranking)
- Top-10 initial vector retrieval
- Direct top-4 selection
- Fastest option, baseline performance
Cross-Encoder Reranking
- Top-10 initial vector retrieval
- Top-4 Cross-Encoder reranking (MS MARCO-based)
- Balanced speed and accuracy
LLM Reranking
- Top-10 initial vector retrieval
- Top-4 LLM reranking using GPT
- Best accuracy, slower performance

Project Structure

IndoGuide/
├── app.py                           # Streamlit web application
├── cli.py                           # CLI chat interface
├── batch_replay.py                  # Batch dialogue replay for evaluation
├── evaluate_batch.py                # Evaluation metrics calculator
├── config/
│   └── config.py                    # Configuration management
├── core/
│   ├── llm_client.py               # LLM API client
│   ├── rag_system.py               # RAG system implementation
│   └── logger.py                   # Session logging utilities
├── data/
│   ├── indonesia_knowledge_base.json  # Knowledge base for RAG
│   ├── test_dialogues.json         # Test dialogues for batch replay
│   └── prompts.json                # System prompts for personas and metrics
├── results/
│   ├── batch/                      # Batch replay output results
│   ├── laaj/                       # LLM-as-a-Judge rating results
│   └── eval/                       # Evaluation metric calculations
├── logs/                           # Session conversation logs
├── assets/
│   └── style.css                   # Streamlit UI styling
└── environment.yml                 # Conda environment specification

Key Files

Knowledge Base: data/indonesia_knowledge_base.json - Contains all the factual information about Indonesia used by the RAG system
Test Dialogues: data/test_dialogues.json - Collection of test dialogues used for batch replay and technical research evaluation
Prompts: data/prompts.json - System prompts for different personas and evaluation metrics (factuality, faithfulness, helpfulness, overall)

Results

results/batch/ - Output from batch dialogue replay, containing system responses and metadata
results/laaj/ - LLM-as-a-Judge ratings for responses (factuality, faithfulness, helpfulness, overall quality)
results/eval/ - Calculated evaluation metrics (Recall@K, MRR, NDCG@K, and averaged LAAJ metrics)

Requirements

Python Version

Python 3.10 or higher

Dependencies

The project requires the following Python libraries:

Library	Version	Purpose
`streamlit`	≥1.51	Web interface framework
`openai`	≥2.8	OpenAI API client for LLM interactions
`python-dotenv`	Latest	Environment variable management
`chromadb`	≥0.4.0	Vector database for RAG
`sentence-transformers`	≥2.2.0	Embedding models and cross-encoders

All dependencies are automatically installed via the conda environment.

Models Used

The system utilizes several specialized models for different components:

Component	Model	Provider	Purpose
Chatbot	`gpt-5-nano-2025-08-07`	OpenAI	Main conversational AI for generating responses
Embedding	`text-embedding-3-small`	OpenAI	Text vectorization for semantic retrieval
Cross-Encoder Reranking	`cross-encoder/ms-marco-MiniLM-L6-v2`	Hugging Face	Re-ranks retrieved documents for relevance
LLM Reranker	`gpt-5-nano-2025-08-07`	OpenAI	LLM-based re-ranking of retrieved documents
LLM-as-a-Judge (LAAJ)	`gpt-4o-mini-2024-07-18`	OpenAI	Evaluates response quality (factuality, faithfulness, helpfulness)

Note: The chatbot and LLM reranker use the same model (gpt-5-nano-2025-08-07) for consistency and cost efficiency.

Setup Instructions

1. Environment Setup

Create and activate the conda environment from environment.yml:

conda env create -f environment.yml
conda activate IndoGuide

This will install all required dependencies including:

OpenAI API client
Streamlit
Chroma vector database
Cross-Encoder models

2. API Configuration

Create an openai.key file in the root directory with your OpenAI API key:

echo "your-openai-api-key-here" > openai.key

Running the Application

Option 1: Web Interface (Streamlit)

Run the interactive web application:

streamlit run app.py

Then open your browser to http://localhost:8501

Features:

Select persona and RAG configuration from the sidebar
Real-time conversation history
Session management with option to save or create new chat

Option 2: CLI Interface

Run the command-line chat interface:

python cli.py [OPTIONS]

Arguments:

--persona {neutral, friendly, professional} (default: neutral)
- Choose the persona for the assistant
--rag-config {baseline, crossencoder, llm} (default: baseline)
- Choose the RAG reranking strategy

Example:

python cli.py --persona friendly --rag-config llm

CLI Commands:

/reset - Start a new conversation
/history - Show conversation history
/config - Show current configuration
/exit - Exit the CLI

Batch Evaluation

Batch Replay

Replay a set of test dialogues and collect system responses:

python batch_replay.py [OPTIONS]

Arguments:

--persona {neutral, friendly, professional} (default: neutral)
- Persona for responses
--rag-config {baseline, crossencoder, llm} (default: baseline)
- RAG reranking strategy
--input-file PATH (default: data/test_dialogues.json)
- Path to test dialogues JSON file
--output-dir PATH (default: results/batch)
- Directory to save replay results

Example:

python batch_replay.py --persona friendly --rag-config llm --output-dir results/batch

Results are saved as JSON files with metadata and turn-by-turn dialogue data.

LLM-as-a-Judge Evaluation

Rate batch replay results using LLM-as-a-Judge metrics:

python evaluate_batch.py [OPTIONS]

Arguments:

--batch-result PATH
- Path to batch replay result file (from results/batch/)
--knowledge-base PATH (default: data/indonesia_knowledge_base.json)
- Path to knowledge base JSON file
--output-dir PATH (default: results/laaj)
- Directory to save LAAJ ratings
--eval-dir PATH (default: results/eval)
- Directory to save calculated metrics

Example:

python evaluate_batch.py --batch-result results/batch/batchreplay_baseline_neutral_gpt-5-nano-2025-08-07_20251207185252.json

Metrics Generated:

The evaluation produces:

LAAJ Ratings (LLM-as-a-Judge): Factuality, Faithfulness, Helpfulness, Overall Quality (saved in results/laaj/)
Retrieval Metrics: Recall@K, MRR, NDCG@K (saved in results/eval/)
Aggregated Metrics: Mean scores across all test dialogues

Workflow Example

Here's a typical evaluation workflow:

Run batch replay with different configurations:

python batch_replay.py --rag-config baseline --persona neutral
python batch_replay.py --rag-config crossencoder --persona neutral
python batch_replay.py --rag-config llm --persona neutral

Evaluate with LLM-as-a-Judge:

python evaluate_batch.py --batch-result results/batch/batchreplay_baseline_neutral_*.json
python evaluate_batch.py --batch-result results/batch/batchreplay_crossencoder_neutral_*.json
python evaluate_batch.py --batch-result results/batch/batchreplay_llm_neutral_*.json

Review results in results/batch/, results/laaj/, and results/eval/ directories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IndoGuide

Table of Contents

Overview

Interface Options

System Configurations

Persona Configurations (3 Options)

RAG Configurations (3 Options)

Project Structure

Key Files

Results

Requirements

Python Version

Dependencies

Models Used

Setup Instructions

1. Environment Setup

2. API Configuration

Running the Application

Option 1: Web Interface (Streamlit)

Option 2: CLI Interface

Batch Evaluation

Batch Replay

LLM-as-a-Judge Evaluation

Workflow Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
config		config
core		core
data		data
results		results
.gitignore		.gitignore
5352M Conversational AI.pdf		5352M Conversational AI.pdf
README.md		README.md
app.py		app.py
batch_replay.py		batch_replay.py
cli.py		cli.py
environment.yml		environment.yml
evaluate_batch.py		evaluate_batch.py

Folders and files

Latest commit

History

Repository files navigation

IndoGuide

Table of Contents

Overview

Interface Options

System Configurations

Persona Configurations (3 Options)

RAG Configurations (3 Options)

Project Structure

Key Files

Results

Requirements

Python Version

Dependencies

Models Used

Setup Instructions

1. Environment Setup

2. API Configuration

Running the Application

Option 1: Web Interface (Streamlit)

Option 2: CLI Interface

Batch Evaluation

Batch Replay

LLM-as-a-Judge Evaluation

Workflow Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages