This repository provides a modular and extensible tool for benchmarking the quality of code generated by large language models (LLMs) on domain-specific tasks. The project was developed as part of a master's thesis focused on digital signal processing (DSP), but is generalizable to other domains.
The tool allows researchers and practitioners to evaluate LLM-generated code based on:
- Functional correctness (via Pytest-based tests),
- Static code quality (PEP8 violations, cyclomatic complexity),
- Efficiency (runtime, memory usage),
- Similarity to human-written reference solutions.
All metrics are normalized relative to human-written reference implementations.
The tool is structured into independent modules:
| Module | Description |
|---|---|
runner.py |
Orchestrates the benchmarking pipeline (generation → testing → evaluation). |
generator.py |
Uses LangChain to send prompts to an on-prem LLM (via Ollama API). |
tester.py |
Generates task-specific test scripts and runs them using Pytest. |
metrics.py |
Computes a set of normalized metrics from execution and static analysis. |
run_benchmarks.ps1 |
PowerShell script for large-scale benchmarking across tasks and models. |
llm-benchmark-tool/
│
├── benchmark_tool/
│ ├── init.py
│ ├── generator.py
│ ├── tester.py
│ ├── metrics.py
│ ├── runner.py
│ ├── run_benchmarks.ps1
│ │
│ ├── HumanWrittenCodes/
│ │ └── *.py # Reference implementations per task
│ │
│ ├── results/
│ ├── benchmark_metrics.csv # Final collected metrics
│ ├── metrics_evaluation.py # Script for visualization
│ └── *.png # Graphical results
│
└── .gitignore
python -m benchmark_tool.runner \
--task fft \
--description "Write a function that performs FFT on a signal using NumPy." \
--model qwen2.5-coder:3b \
--temperature 0.3 \
--runs 5Alternatively, you can launch bulk experiments via: .\benchmark_tool\run_benchmarks.ps1
See metrics.py for definitions and normalization logic. All results are saved into results/benchmark_metrics.csv and visualized via the provided Python script.
This tool was developed as part of a master's thesis on benchmarking LLMs for signal analysis code generation. Full thesis (PDF) will be available here after defense.
This repository is open-source and available under MIT License.