LLM Benchmark Tool

This repository provides a modular and extensible tool for benchmarking the quality of code generated by large language models (LLMs) on domain-specific tasks. The project was developed as part of a master's thesis focused on digital signal processing (DSP), but is generalizable to other domains.

Overview

The tool allows researchers and practitioners to evaluate LLM-generated code based on:

Functional correctness (via Pytest-based tests),
Static code quality (PEP8 violations, cyclomatic complexity),
Efficiency (runtime, memory usage),
Similarity to human-written reference solutions.

All metrics are normalized relative to human-written reference implementations.

Architecture

The tool is structured into independent modules:

Module	Description
`runner.py`	Orchestrates the benchmarking pipeline (generation → testing → evaluation).
`generator.py`	Uses LangChain to send prompts to an on-prem LLM (via Ollama API).
`tester.py`	Generates task-specific test scripts and runs them using Pytest.
`metrics.py`	Computes a set of normalized metrics from execution and static analysis.
`run_benchmarks.ps1`	PowerShell script for large-scale benchmarking across tasks and models.

Directory Structure

llm-benchmark-tool/
│
├── benchmark_tool/
│ ├── init.py
│ ├── generator.py
│ ├── tester.py
│ ├── metrics.py
│ ├── runner.py
│ ├── run_benchmarks.ps1
│ │
│ ├── HumanWrittenCodes/
│ │ └── *.py # Reference implementations per task
│ │
│ ├── results/
│ ├── benchmark_metrics.csv # Final collected metrics
│ ├── metrics_evaluation.py # Script for visualization
│ └── *.png # Graphical results
│
└── .gitignore

Example Benchmark Run

python -m benchmark_tool.runner \
    --task fft \
    --description "Write a function that performs FFT on a signal using NumPy." \
    --model qwen2.5-coder:3b \
    --temperature 0.3 \
    --runs 5

Alternatively, you can launch bulk experiments via: .\benchmark_tool\run_benchmarks.ps1

Metrics

See metrics.py for definitions and normalization logic. All results are saved into results/benchmark_metrics.csv and visualized via the provided Python script.

Related Thesis

This tool was developed as part of a master's thesis on benchmarking LLMs for signal analysis code generation. Full thesis (PDF) will be available here after defense.

License

This repository is open-source and available under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
benchmark_tool		benchmark_tool
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark Tool

Overview

Architecture

Directory Structure

Example Benchmark Run

Metrics

Related Thesis

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Tool

Overview

Architecture

Directory Structure

Example Benchmark Run

Metrics

Related Thesis

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages