Skip to content

studyharddontcry/llm-benchmark-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Benchmark Tool

This repository provides a modular and extensible tool for benchmarking the quality of code generated by large language models (LLMs) on domain-specific tasks. The project was developed as part of a master's thesis focused on digital signal processing (DSP), but is generalizable to other domains.

Overview

The tool allows researchers and practitioners to evaluate LLM-generated code based on:

  • Functional correctness (via Pytest-based tests),
  • Static code quality (PEP8 violations, cyclomatic complexity),
  • Efficiency (runtime, memory usage),
  • Similarity to human-written reference solutions.

All metrics are normalized relative to human-written reference implementations.

Architecture

The tool is structured into independent modules:

Module Description
runner.py Orchestrates the benchmarking pipeline (generation → testing → evaluation).
generator.py Uses LangChain to send prompts to an on-prem LLM (via Ollama API).
tester.py Generates task-specific test scripts and runs them using Pytest.
metrics.py Computes a set of normalized metrics from execution and static analysis.
run_benchmarks.ps1 PowerShell script for large-scale benchmarking across tasks and models.

Directory Structure

llm-benchmark-tool/
│
├── benchmark_tool/
│ ├── init.py
│ ├── generator.py
│ ├── tester.py
│ ├── metrics.py
│ ├── runner.py
│ ├── run_benchmarks.ps1
│ │
│ ├── HumanWrittenCodes/
│ │ └── *.py # Reference implementations per task
│ │
│ ├── results/
│ ├── benchmark_metrics.csv # Final collected metrics
│ ├── metrics_evaluation.py # Script for visualization
│ └── *.png # Graphical results
│
└── .gitignore

Example Benchmark Run

python -m benchmark_tool.runner \
    --task fft \
    --description "Write a function that performs FFT on a signal using NumPy." \
    --model qwen2.5-coder:3b \
    --temperature 0.3 \
    --runs 5

Alternatively, you can launch bulk experiments via: .\benchmark_tool\run_benchmarks.ps1

Metrics

See metrics.py for definitions and normalization logic. All results are saved into results/benchmark_metrics.csv and visualized via the provided Python script.

Related Thesis

This tool was developed as part of a master's thesis on benchmarking LLMs for signal analysis code generation. Full thesis (PDF) will be available here after defense.

License

This repository is open-source and available under MIT License.

About

Supplementary materials to my master thesis: Benchmarking Large Language Models for Signal Analysis Code Generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors