This repository demonstrates a complete pipeline for post-training Large Language Models using LoRA (Low-Rank Adaptation) and DPO (Direct Preference Optimization) with Hugging Face Transformers, PEFT, and TRL.
The project provides:
- Instruction-following dataset preparation
- LoRA-based fine-tuning of small language models (GPT-2)
- DPO preference optimization from chosen/rejected responses
- Model evaluation and comparison (base vs. fine-tuned)
- Simple agent demonstration showcasing instruction-following capabilities
model-pipeline/
├── data/
│ └── sample_data.jsonl # Instruction-following dataset
├── lora_model/ # Saved LoRA adapter (after training)
├── dpo_model/ # Saved DPO LoRA adapter (after DPO training)
├── .venv/ # Python virtual environment
├── .claude/ # Claude Code settings
├── .gitignore # Git ignore rules
├── CLAUDE.md # Claude Code guidance
├__requirements.txt # Python dependencies
├── README.md # This file
├── train_lora.py # LoRA fine-tuning script
├── train_dpo.py # DPO preference optimization script
└── evaluate_model.py # Model evaluation, comparison & agent demo
# Using UV (recommended)
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
# Alternative: using pip
# python -m venv .venv
# source .venv/bin/activate
# pip install -r requirements.txtsource .venv/bin/activate
python train_lora.py \
--model_name gpt2 \
--dataset_path data/sample_data.jsonl \
--output_dir ./lora_model \
--num_epochs 3 \
--batch_size 4 \
--learning_rate 1e-4source .venv/bin/activate
python train_dpo.py \
--model_name gpt2 \
--output_dir ./dpo_model \
--num_epochs 1 \
--batch_size 1train_dpo.py includes a tiny built-in preference dataset so the command runs immediately. For meaningful behavior, pass a larger JSONL file with --dataset_path.
source .venv/bin/activate
python evaluate_model.py \
--base_model gpt2 \
--lora_path ./lora_model \
--demo_agentThe sample dataset (data/sample_data.jsonl) contains instruction-following examples in JSONL format:
{"instruction": "Explain the concept of machine learning in simple terms.", "input": "", "output": "Machine learning is..."}
{"instruction": "What is the capital of France?", "input": "", "output": "The capital of France is Paris."}
...The train_lora.py script implements:
- Base Model: GPT-2 (124M parameters)
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Dropout: 0.1
- Target Modules:
c_attn(combined attention projections)
- Training Setup:
- FP16 precision for memory efficiency
- Gradient accumulation (effective batch size = batch_size × 4)
- Learning rate: 1e-4
- Epochs: Configurable (default: 3)
Key Benefit: Only ~0.24% of parameters are updated during training (~295K trainable vs 124M total), making fine-tuning extremely efficient.
The evaluate_model.py script performs:
- Model Loading: Loads both base GPT-2 and the LoRA-adapted model
- Prompt Testing: Compares responses on standardized prompts covering:
- Concept explanations
- Factual questions
- List generation
- Creative writing
- Mathematical reasoning
- Simple Agent Demo: Demonstrates instruction-following capability with predefined questions
The train_dpo.py script implements a minimal Direct Preference Optimization loop:
- Policy model: The model being trained. In this repo it is the base model plus a LoRA adapter.
- Reference model: A frozen copy of the original base model. DPO compares the policy against this model so training improves preferences without drifting too far from the starting model.
- Preference data: Each example contains a prompt, a preferred answer (
chosen), and a worse answer (rejected). - DPO loss: Increases the probability of the chosen answer relative to the rejected answer, while using the reference model as a regularizer.
The built-in preference examples are intentionally tiny and only prove the pipeline works. A real DPO run needs hundreds or thousands of high-quality preference pairs.
DPO JSONL format:
{"prompt": "### Instruction:\nExplain LoRA simply.\n\n### Response:\n", "chosen": "LoRA trains small adapter matrices while keeping the base model frozen.", "rejected": "LoRA means the model searches the internet."}
{"instruction": "Answer carefully.", "input": "What is 12 * 8?", "chosen": "12 * 8 is 96.", "rejected": "12 * 8 is 86."}After training, you should observe:
- The fine-tuned model follows instruction formats better than the base model
- Improved coherence on tasks similar to the training data
- The LoRA adapter is lightweight (~1-2MB) compared to the full model (~500MB)
- Base model may show more repetitive or less focused responses
Instead of fine-tuning all model parameters, LoRA:
- Freezes the pretrained model weights
- Injects trainable rank-decomposition matrices into each Transformer layer
- Significantly reduces computational cost and storage requirements
- Enables rapid experimentation with different adaptations
DPO is a preference-tuning method used after supervised fine-tuning or instruction tuning. Instead of training a reward model and then running reinforcement learning, DPO directly optimizes the language model on paired responses:
- Start with a prompt and two answers:
chosenandrejected - Score both answers under the trainable policy model
- Score both answers under the frozen reference model
- Update the policy so the chosen answer becomes more likely than the rejected answer
- Use
betato control how strongly the policy is allowed to move away from the reference model
In this project, DPO is combined with LoRA so only the adapter weights are updated. This keeps training lightweight and saves a small adapter in dpo_model/.
For GPT-2 family models, we target c_attn (combined attention) which includes:
- Query projection
- Key projection
- Value projection This efficiently adapts the model's attention mechanisms, which are crucial for understanding and generating text.
- Precision: FP16 (mixed precision) for reduced memory usage
- Batch Training: Gradient accumulation to simulate larger batches
- Regularization: LoRA dropout to prevent overfitting
- Optimization: AdamW optimizer with learning rate scheduling
Try different models by changing the --model_name parameter:
# Small models good for CPU training
python train_lora.py --model_name sshleef/tinyllama-10m
# Dialogue-focused models
python train_lora.py --model_name microsoft/DialoGPT-small
# Even smaller for quick experiments
python train_lora.py --model_name distilgpt2The DPO script supports the same idea:
python train_dpo.py --model_name distilgpt2 --output_dir ./dpo_modelModify the LoRA configuration in train_lora.py:
- Higher rank (r): More expressive but more parameters to train
- Lower rank: Faster training but less adaptation capacity
- Alpha: Scaling factor for LoRA updates
- Dropout: Regularization parameter
To use your own dataset, create a JSONL file with:
{"instruction": "Your instruction here", "input": "Optional input context", "output": "Expected output"}For DPO, create JSONL with either prompt/chosen/rejected or instruction/input/chosen/rejected:
{"prompt": "Your prompt here", "chosen": "Preferred response", "rejected": "Less helpful response"}
{"instruction": "Your instruction here", "input": "Optional context", "chosen": "Preferred response", "rejected": "Less helpful response"}-
CUDA not available:
- Training will fall back to CPU (much slower but functional)
- Consider reducing batch_size if encountering memory issues
-
Hugging Face Hub rate limits:
- Set HF_TOKEN environment variable for higher limits
- Export HF_TOKEN=your_token_here before running scripts
-
Out of Memory (OOM) errors:
- Reduce
--batch_sizeparameter - Reduce sequence length in tokenization (max_length parameter)
- Use CPU training instead of GPU
- Reduce
-
Module import errors:
- Ensure all dependencies are installed:
uv pip install -r requirements.txt - Restart the shell after installing packages
- Ensure all dependencies are installed:
train_lora.py: Main LoRA fine-tuning implementationtrain_dpo.py: DPO preference optimization with LoRA adaptersevaluate_model.py: Model loading, comparison, and agent demonstration
data/sample_data.jsonl: Training examples in Alpaca-format JSONL- Can be replaced with larger, more diverse datasets for better results
lora_model/adapter_model.safetensors: LoRA weights (safetensors format)lora_model/adapter_config.json: LoRA configurationlora_model/tokenizer.*: Tokenizer fileslora_model/training_args.bin: Training arguments useddpo_model/adapter_model.safetensors: DPO-trained LoRA weightsdpo_model/adapter_config.json: DPO LoRA configuration
requirements.txt: Exact package versions for reproducibility.gitignore: Excludes large files and directories from version controlCLAUDE.md: Guidance for future Claude Code sessions
When comparing base vs. fine-tuned models:
- Look for better adherence to instruction formats in the fine-tuned model
- Check for improved coherence and relevance in responses
- Notice whether the fine-tuned model shows patterns from the training data
- Remember that with small datasets and limited training, improvements may be subtle
-
Dataset Enhancement:
- Add more diverse, high-quality examples
- Include edge cases and varied instruction types
- Balance different task types (QA, summarization, generation, etc.)
-
Model Variations:
- Experiment with different base model sizes and architectures
- Try newer models like TinyLlama, Phi, or Mistral variants
- Test with instruction-tuned base models
-
Training Improvements:
- Increase number of epochs for better convergence
- Tune LoRA hyperparameters (rank, alpha, dropout)
- Collect preference pairs and tune DPO
beta - Experiment with different learning rate schedules
- Add validation set for early stopping
-
Advanced Evaluation:
- Add quantitative metrics (BLEU, ROUGE, perplexity)
- Implement human evaluation for qualitative assessment
- Test on held-out evaluation datasets
- Measure specific capabilities (reasoning, creativity, instruction following)
-
Deployment Preparation:
- Merge LoRA weights with base model for easier deployment
- Convert to different formats (GGUF for llama.cpp, ONNX, etc.)
- Create API endpoint for model serving
- Build simple web interface for interaction
- LoRA Paper: https://arxiv.org/abs/2106.09685
- DPO Paper: https://arxiv.org/abs/2305.18290
- PEFT Library: https://github.com/huggingface/peft
- Hugging Face Transformers: https://github.com/huggingface/transformers
- TRL Library: https://github.com/huggingface/trl
- Alpaca Dataset Format: https://github.com/tatsu-lab/stanford_alpaca
This project is for educational and demonstration purposes. Please respect the licenses of any base models or datasets used.