Thesis Project: Using Large Language Models for Evaluation of Short Student Answers Based on Course Materials
This project implements a comprehensive system for automatically grading short student answers using Small Language Models (~1b paramters). It supports both zero-shot evaluation using DSPy and fine-tuned models using LoRA (Low-Rank Adaptation) for efficient parameter-efficient fine-tuning. The system performs 3-way classification to determine if student answers are incorrect, partially correct, or correct.
This project is designed to:
- Automatically grade short student answers based on course materials
- Generate synthetic training data using LLMs
- Fine-tune language models efficiently using LoRA for answer grading
- Evaluate models with comprehensive metrics including accuracy, F1 scores, and quadratic weighted kappa
- Track experiments using MLflow for reproducibility
- Python 3.12+
- CUDA-capable GPU (for fine-tuning) or CPU-only mode available
uvpackage manager (installation guide)
# Install dependencies using uv
uv sync
# Install development dependencies (for linting, etc.)
uv sync --group devIf you want to generate synthetic data, test proprietary models like GPT-4o, or use models through VLLM or OLLAMA for generation and evalution please create the .env. This is not needed for fine-tuning and evaluting models from Hugging Face.
Create a .env file in the project root with the following variables:
# Azure OpenAI (for GPT-4o, GPT-4o-mini)
AZURE_API_KEY=your_azure_api_key
AZURE_API_BASE=https://your-resource.openai.azure.com/
# Ollama (optional, for local models)
OLLAMA_API_BASE=http://localhost:11434
# vLLM (optional, for hosted vLLM models)
VLLM_API_BASE=http://localhost:8000Simply run the following, and you are ready to go:
uv run src/data_prep/prepare_scientsbank.pyNote: for a simple test of the pipeline, use the SciEntsBank dataset due to its simple setup.
The original GRAS dataset can be found at https://huggingface.co/datasets/saurluca/GRAS
Expected CSV format (semicolon-separated):
- Columns:
task_id,question,reference_answer,topic,student_answer,labels - Labels:
"incorrect"(0),"partial"(1),"correct"(2)
Steps
- Place dataset in the
data/directory - (Optional, if not done already) Split data into train, val, test-set by running
uv run src/data_prep/test_train_split.py- Example structure:
data/
├── gras/
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
└── SciEntsBank_3way/
├── train.csv
└── ...MLflow tracking is configured automatically using SQLite (mlflow.db in the project root). To view results:
# Start MLflow UI
uv run mlflow ui --backend-store-uri sqlite:///mlflow.dbThen open http://localhost:5000 in your browser.
Generate synthetic student answers from reference questions:
uv run src/data_prep/answer_generation.pyThis reads configuration from configs/answer_generation.yaml and generates synthetic answers based on the specified parameters. Requires prepared questions and reference answers, which can be served in json and converted to csv with json_tasks_to.py
uv run src/finetuning/dispatch.pyUses configuration from configs/finetuning.yaml. Set TRAINING_CONFIG_PATH environment variable to use a different config:
TRAINING_CONFIG_PATH=configs/custom_training.yaml uv run src/finetuning/dispatch.pyWe currently do not support multiple GPUs, so please use the environment flag CUDA_VISIBLE_DEVICES=0 if you have multiple ones running, to only use one.
It is possible to queue multiple runs with different models and seeds in configs/finetuning.yaml under the dispatcher section.
Perform hyperparameter grid search:
uv run src/finetuning/lora_gridsearch.pyConfigure the search space in configs/lora_gridsearch.yaml.
Evaluate a fine-tuned model on a test set:
uv run src/evaluation/local.pyConfiguration is read from configs/evaluation_local.yaml. Supports:
- Loading adapters from HuggingFace Hub or local files
- CPU-only evaluation mode (
enforce_cpu: true), which was used for Experiment III
Evaluate a model from an API (currently supports Azure, Ollama and VLLM) without fine-tuning using DSPy:
uv run src/evaluation/dspy_eval.pyUses configuration from configs/dspy_eval.yaml. Supports single-question or batch evaluation modes.
If you want to add other models, do so in src/model_builder.py
grading-at-scale/
├── configs/ # YAML configuration files
│ ├── base.yaml
│ ├── answer_generation.yaml
│ ├── finetuning.yaml
│ ├── evaluation_local.yaml
│ ├── dspy_eval.yaml
│ └── lora_gridsearch.yaml
├── data/ # Datasets (CSV files)
│ ├── gras/
│ ├── SciEntsBank_3way/
│ └── raw/ # Raw data files
├── results/ # Training outputs and results
│ ├── peft_output/ # LoRA adapter outputs
│ └── ...
├── src/
│ ├── data_prep/ # Data preparation scripts
│ │ ├── answer_generation.py
│ │ ├── json_tasks_to_csv.py
│ │ ├── prepare_scientsbank.py
│ │ ├── split_data.py
│ │ └── signatures.py
│ ├── finetuning/ # LoRA fine-tuning scripts
│ │ ├── lora.py
│ │ ├── lora_gridsearch.py
│ │ └── dispatch.py
│ ├── evaluation/ # Evaluation scripts
│ │ ├── local.py
│ │ ├── dspy_eval.py
│ │ └── signatures.py
│ ├── logic/ # Logic-related utilities
│ ├── plots/ # Plotting utilities
│ ├── scripts/ # Utility scripts
│ ├── common.py # Shared utilities
│ ├── model_builder.py # DSPy model builder
│ └── mlflow_config.py # MLflow configuration
├── pyproject.toml # Project dependencies and metadata
├── uv.lock # Locked dependency versions
├── mlflow.db # MLflow SQLite database
└── README.md # This file
-
common.py: Shared utilities for data loading, tokenization, model setup, training, and evaluation metrics. Contains functions for:- Loading and preprocessing datasets from CSV files
- Tokenizing datasets with optional reference answer inclusion
- Setting up models and tokenizers
- Computing evaluation metrics (accuracy, F1, quadratic weighted kappa)
- Detailed evaluation with per-topic metrics
-
model_builder.py: Builds DSPy language models from configuration. Supports multiple backends:- Azure OpenAI (GPT-4o, GPT-4o-mini)
- Ollama (llama3.2:3b, llama3.2:1b)
- vLLM (hosted models like Qwen, Llama, GPT-2, Flan-T5)
-
mlflow_config.py: MLflow tracking setup and configuration. Handles SQLite-based tracking URI resolution.
-
json_tasks_to_csv.py: Converts JSON task files to CSV format. Processes all JSON files in the raw tasks directory and creates a unified CSV with columns:question,answer,topic. -
answer_generation.py: Generates synthetic student answers using DSPy and LLMs. Supports three-generation modes:single: Generate one answer at a timeper_question: Generate multiple answers per questionall: Generate answers for all questions at once- Generates correct, partial, and incorrect answers based on configuration
-
split_data.py: Splits datasets into train/val/test sets bytask_idto ensure no data leakage. Supports stratified splitting by topic. -
prepare_scientsbank.py: Prepares SciEntsBank dataset from HuggingFace. Converts 5-way classification labels to 3-way (incorrect, partial, correct). -
signatures.py: DSPy signatures for answer generation:CorrectAnswerGenerator: Generates correct student answersPartialAnswerGenerator: Generates partially correct answersIncorrectAnswerGenerator: Generates incorrect answers- Supports batch generation variants (
*All,*PerQuestion)
-
lora.py: Main LoRA fine-tuning script using PEFT (Parameter-Efficient Fine-Tuning). Features:- Supports multiple models (Qwen, Llama, GPT-2, Flan-T5)
- Model-specific hyperparameter configuration
- Early stopping support
- MLflow experiment tracking
- Optional model saving to HuggingFace Hub
-
lora_gridsearch.py: Hyperparameter grid search for LoRA training. Explores combinations of:- Learning rates
- LoRA rank (r) values
- LoRA alpha ratios
- LoRA dropout values
- Batch sizes
- Selects best combination based on optimization metric
-
dispatch.py: Dispatcher for running multiple training runs with different models and seeds. Supports:- Multiple models from configuration
- Custom seed lists or random seed generation
- Parallel execution management
-
local.py: Evaluates fine-tuned models locally with comprehensive metrics:- Overall metrics: accuracy, macro F1, weighted F1, quadratic weighted kappa
- Per-class metrics: precision, recall, F1 for each label
- Per-topic metrics: topic-specific performance evaluation
- Confusion matrix visualization
- CPU-only mode support for environments without GPU
- Timing metrics (examples/min, time per example)
-
dspy_eval.py: Zero-shot evaluation using DSPy (for non-fine-tuned models). Features:- Single question or batch evaluation modes
- MLflow experiment tracking
- Comprehensive metrics and visualizations
- Support for multiple evaluation runs
-
signatures.py: DSPy signatures for grading/evaluation:GraderSingle: Grade a single answerGraderPerQuestion: Grade multiple answers per question- Supports optional reference answer inclusion
All configurations use OmegaConf YAML files with hierarchical merging:
-
base.yaml: Base configuration shared across all modules:- Project seed
- Data and output directory paths
- MLflow tracking URI (SQLite by default)
-
answer_generation.yaml: Synthetic data generation parameters:- Generation model selection
- Number of answers per category (correct/partial/incorrect)
- Generation mode (single/per_question/all)
- Reference answer passing configuration
-
finetuning.yaml: LoRA training hyperparameters:- Model selection and dispatcher configuration
- Dataset configuration
- LoRA parameters (r, alpha, dropout, target_modules)
- Training hyperparameters (epochs, batch size, learning rate, etc.)
- Model-specific parameter overrides
-
evaluation_local.yaml: Local evaluation settings:- Model and adapter configuration
- Dataset path and sampling options
- CPU enforcement and timing options
- MLflow reporting configuration
-
dspy_eval.yaml: DSPy evaluation configuration:- Model and mode selection
- Dataset configuration
- Evaluation run count
-
lora_gridsearch.yaml: Grid search parameters:- Grid search space definition
- Optimization metric selection
- Dispatcher configuration
All configurations use OmegaConf YAML files with hierarchical merging:
- Base configuration (
configs/base.yaml) is always loaded first - Module-specific configurations are merged on top
- Environment variables can override specific values
In configs/finetuning.yaml, you can specify model-specific parameters:
model_specific_params:
Qwen/Qwen3-0.6B:
batch_size:
train: 16
learning_rate: 0.0005In configs/evaluation_local.yaml, configure adapter source:
adapter:
source: hub # Options: 'local', 'hub', or 'none'
huggingface_username: your_username
dataset_trained_on_name: grasFor faster evaluation, use data sampling:
dataset:
sample_fraction: 0.1 # Use 10% of data
sample_seed: 42Key dependencies (see pyproject.toml for complete list):
- Core ML:
torch,transformers,peft,datasets - LLM Framework:
dspy - Experiment Tracking:
mlflow - Data Processing:
pandas,numpy,scikit-learn - Configuration:
omegaconf - Visualization:
matplotlib,seaborn - Utilities:
tqdm,accelerate,bitsandbytes
- Currently supports single GPU training only
- Always use
CUDA_VISIBLE_DEVICES=0before running training commands - For multi-GPU setups, modify the code to support distributed training
- Uses SQLite database (
mlflow.db) by default - Tracking URI can be configured in
configs/base.yaml - Start MLflow UI with:
mlflow ui --backend-store-uri sqlite:///mlflow.db
- Models can be saved locally (
save_model_locally: truein config) - Models can be pushed to HuggingFace Hub (
push_to_hub: truein config) - Adapters are saved separately from base models
- All CSV files use semicolon (
;) as separator - Labels must be:
"incorrect","partial","correct"(case-insensitive) task_idis used to prevent data leakage between splits
A substantial portion of this codebase was created using generative AI tools (Claude Sonnet 4, Composer 1). Usage includes but is not limited to: refactoring, boilerplate generation, writing of commit messages, and translating comments to code. All code has been manually reviewed, verified and tested by the author to ensure correctness.
This is a thesis project. For questions or issues, please contact the project maintainer at mail@lucasaur.com.