A comprehensive benchmarking framework for evaluating open-source generative AI models on automated Verilog HDL and testbench generation tasks.
This project establishes the first structured, reproducible benchmark for AI-assisted hardware design at the RTL (Register Transfer Level). It evaluates models across:
-
Final dataset scope: 50 curated Verilog design and verification tasks spanning 23 combinational, 14 sequential, 8 FSM, and 5 mixed/complex designs.
-
Circuit (HDL) generation from textual specifications
-
Testbench generation for functional verification
-
Quantitative benchmarking against reference implementations
| Tier | Model | Size | Purpose |
|---|---|---|---|
| Large | Llama 3 8B Instruct | 8B | High-quality general-purpose baseline |
| Medium | StarCoder2 | 7B | Code-specialized mid-tier model |
| Small | TinyLlama | 1.1B | Lightweight resource-constrained baseline |
- Syntax Validity (SV): % of files that compile without errors (Verilator + iverilog)
- Functional Correctness (FC): % producing expected simulation outputs (iverilog + testbench)
- Generation Time (GT): Average inference time per task
- Compile Time: Verilator/iverilog compilation time
- Simulation Time: iverilog/vvp simulation time
- Iteration Count: Number of generation–evaluation cycles per task
- Confidence Entropy: Token-level entropy as a model confidence proxy
- Waveform Diff Summary: Signal-level mismatch between generated and reference waveform (when enabled)
- Formal Equivalence Status: Result of formal equivalence check (when enabled)
- Semantic Repair Applied: List of repair operations applied during iterative refinement
The following were originally planned but have not been implemented in any benchmark run:
- Synthesis Quality (SQ): Gate count / area via Yosys —
SynthesisToolclass exists in code but is never invoked; all results havegate_count=None,cell_count=None - Testbench Detection Rate (TDR): Fault injection coverage — no fault injector built;
fault_detection_ratiois alwaysNone - Prompt Sensitivity (PS): Variance across prompt templates A/B/C — each phase uses a single fixed prompt style; cross-template comparison not run
- Hallucination Index (HI): Undeclared signal count — no code exists to detect these
- Usability Score (US): Composite formula
0.4*FC + 0.3*SV + 0.2*(1−area) + 0.1*TDR— never evaluated (depends on SQ and TDR)
Required Tools:
# Ubuntu/Debian
sudo apt-get install iverilog verilator yosys python3.10 python3-pip
# macOS
brew install icarus-verilog verilator yosys python@3.10Python Dependencies:
pip install -r requirements.txtOptional Dependencies (for Phase 4 features):
pyvcd>=0.4.0- For waveform analysis (installed by default in requirements.txt)pyverilog>=1.3.0- For AST-based code repair (installed by default in requirements.txt)
Note: These are optional. If not installed, Phase 4 will gracefully disable the corresponding features (waveform analysis, AST repair).
AI Models (Ollama):
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull llama3
ollama pull tinyllama
ollama pull starcoder2:7bNote: StarCoder2 can also be used via HuggingFace Transformers. See model_interface.py for configuration options.
- Validate Dataset:
cd Quantitative
python dataset_loader.py- Run Phase 2 Benchmark (Recommended):
python run_phase2.pyThis runs the full benchmark dataset (currently 50 tasks) with 3 repetitions per model and saves results to ../results/phase2_benchmark/.
Alternative: Run Mini Benchmark:
python run_mini_benchmark.pyThis evaluates 5 starter tasks and saves results to ../results/mini_benchmark/.
- Analyze Results:
python statistical_analysis.py ../results/mini_benchmark/benchmark_results.json- Generate Visualizations:
python visualizations.py ../results/mini_benchmark/benchmark_results.json ../figures/Note on reproducibility: All 12 benchmarks (1,610 runs) were conducted locally on macOS (Apple Silicon) using Icarus Verilog 12.0, Verilator 5.038, and Yosys 0.58 installed via Homebrew. The Docker setup is provided to allow others to run the pipeline on any platform without manual tool installation. It has been validated by successfully running the mini benchmark (5 tasks × 3 models × 3 repetitions) end-to-end inside the container.
![]()
Tool versions inside the container (Debian Trixie apt): Icarus Verilog 12.0, Verilator 5.032, Yosys 0.52.
On the Host Machine:
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh ollama serve # Start Ollama service
- Pull required models:
ollama pull llama3 ollama pull tinyllama ollama pull starcoder2:7b
# Build image
docker-compose build
# Run container (Ollama connection configured automatically)
docker-compose up -d
# Access shell
docker exec -it eda_benchmark bash
# Inside container — run mini benchmark (5 tasks, validated):
cd /workspace/Quantitative
python run_mini_benchmark.py
# Or run the full 50-task benchmark:
python run_phase5.pyOllama runs on the host machine; the container connects to it via OLLAMA_BASE_URL:
- Windows/Mac Docker Desktop: Uses
http://host.docker.internal:11434(configured automatically) - Linux: Set
OLLAMA_BASE_URLto your host IP:export OLLAMA_BASE_URL=http://your-host-ip:11434 docker-compose up -d
Verify Connection:
# Inside container
python -c "from model_interface import OllamaInterface; OllamaInterface('llama3')"Paper/
├── DATASET_LICENSE # Dataset license (CC BY-NC 4.0)
├── Docs/ # Extended documentation set
│ ├── architecture.md
│ ├── benchmark_history.md
│ ├── dataset_description.md
│ └── ...
├── Quantitative/ # Core benchmarking stack
│ ├── Eval_Pipeline.py # Main evaluation pipeline
│ ├── instruction.json # Methodology + run configuration
│ ├── model_interface.py # AI model integration (Ollama/HF)
│ ├── dataset_loader.py # Task loading utilities
│ ├── statistical_analysis.py # Statistical analysis module
│ ├── visualizations.py # Plotting and visualization
│ ├── confidence_tracker.py # Entropy-based confidence modeling
│ ├── feedback_generator.py # Error feedback for refinement loops
│ ├── iterative_evaluator.py # Iterative evaluation driver
│ ├── waveform_analyzer.py # Waveform comparison + diffing
│ ├── formal_verifier.py # Formal equivalence checks
│ ├── ast_repair.py # AST-guided repair helpers
│ ├── semantic_repair.py # Semantic repair orchestrator
│ ├── phase4_config.py # Phase 4 feature toggles
│ ├── phase5_config.py # Phase 5 experiment settings
│ ├── phase5_feedback.py # Prompt templates for Phase 5
│ ├── phase5_repair.py # Phase 5 repair utilities
│ ├── run_phase1.py # Phase 1: Few-shot prompting
│ ├── run_phase2.py # Phase 2: Constrained prompts + post-processing
│ ├── run_phase3.py # Phase 3: Iterative refinement (legacy)
│ ├── run_phase4.py # Phase 4: Semantic-aware refinement
│ ├── run_phase5.py # Phase 5: Extended repair experiments
│ ├── Research_Data/ # Benchmark analysis reports (1st–12th)
│ │ ├── 1st_Benchmark_Results.md
│ │ ├── ...
│ │ └── 12th_Benchmark_Results.md
│ └── dataset/
│ ├── tasks.json # Task metadata (50 tasks, final scope)
│ ├── combinational/ # Combinational circuits (23 tasks)
│ ├── sequential/ # Sequential circuits (14 tasks)
│ ├── fsm/ # Finite state machines (8 tasks)
│ └── mixed/ # Mixed designs (5 tasks)
├── Research_Paper/ # Reference papers + citations
│ ├── CITATION.md
│ └── *.pdf
├── results/ # Generated benchmark outputs
│ ├── Benchmark_1&2_Results/
│ ├── Benchmark_3&4_Results/
│ ├── Benchmark_5_Results/
│ ├── Benchmark_6_Results/
│ ├── Benchmark_7_Results/
│ ├── Benchmark_8_Results/
│ ├── Benchmark_9_Results/
│ ├── Benchmark_10_Results/
│ ├── Benchmark_11_Results/
│ ├── Benchmark_12_Results/
│ └── mini_benchmark/ # Docker-validated mini benchmark results
├── figures/ # Visualization exports by benchmark
│ ├── 1st_Benchmark_figures/
│ ├── ...
│ └── 12th_Benchmark_figures/
├── docker-compose.yml # Containerized workflow entrypoint
├── Dockerfile # Base container definition
├── LICENSE # MIT license (code)
├── README.md # This file
├── ROADMAP.md # Development roadmap
└── requirements.txt # Python dependencies
This section documents all 12 benchmark tests, which files were used, and their key features.
Eval_Pipeline.py- Main evaluation pipeline (compilation, simulation, metrics)model_interface.py- AI model integration (Ollama/HuggingFace interfaces)dataset_loader.py- Task loading and validation utilitiesinstruction.json- Configuration and methodology specifications
statistical_analysis.py- Standalone statistical analysis toolvisualizations.py- Standalone visualization generation tool
Runner: run_phase1.py
Methodology: Phase 1 - Few-shot prompting
Tasks: 5 tasks (first 5 from dataset)
Models: Llama-3-8B, TinyLlama-1.1B
Repetitions: 1 per task
Key Features:
- Basic few-shot prompting with examples
- No post-processing
- System configuration validation (EDA tools)
Results: results/Benchmark_1&2_Results/
Analysis: Research_Data/1st_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 - Constrained prompts + post-processing
Tasks: 5 tasks
Models: Llama-3-8B, TinyLlama-1.1B
Repetitions: 1 per task
Key Features:
- Task-specific module/port name constraints
- Basic post-processing fixes
- Module name extraction and correction
Results: results/Benchmark_1&2_Results/
Analysis: Research_Data/2nd_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 - Enhanced post-processing
Tasks: 5 tasks
Models: Llama-3-8B, TinyLlama-1.1B
Repetitions: 1 per task
Key Features:
- Improved post-processing with SystemVerilog→Verilog conversion
- BSV construct removal
- Port name normalization
Results: results/Benchmark_3&4_Results/
Analysis: Research_Data/3rd_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 + Statistical analysis (3 repetitions)
Tasks: 5 tasks
Models: Llama-3-8B, TinyLlama-1.1B
Repetitions: 3 per task (per instruction.json)
Key Features:
- Multiple repetitions for statistical significance
- Mean rates with standard deviations (σ)
- Variance quantification
Results: results/Benchmark_3&4_Results/
Analysis: Research_Data/4th_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 + StarCoder2-7B addition
Tasks: 5 tasks
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task
Key Features:
- Three-tier model comparison (Large/Medium/Small)
- StarCoder2 code-specialized model evaluation
- Sequential logic performance analysis
Results: results/Benchmark_5_Results/
Analysis: Research_Data/5th_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 + Sequential normalization post-processing
Tasks: 5 tasks
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task
Key Features:
- Enhanced
post_process_verilog()with sequential normalization - Enforced
begin/endstructure for sequential blocks - Sequential template matching and repair
- DFF and counter reliability improvements
Results: results/Benchmark_6_Results/
Analysis: Research_Data/6th_Benchmark_Results.md
Runner: run_phase2.py
Methodology: Phase 2 + Full 20-task dataset
Tasks: 20 tasks (9 combinational, 6 sequential, 3 FSM, 2 mixed)
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task
Key Features:
- Complete benchmark coverage
- Category-specific performance analysis
- FSM and mixed design challenges exposed
Results: results/Benchmark_7_Results/
Analysis: Research_Data/7th_Benchmark_results.md
Runner: run_phase2.py
Methodology: Phase 2 + Comprehensive examples for all task types
Tasks: 20 tasks
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task
Key Features:
- Complete examples for all task categories (combinational, sequential, FSM, mixed)
- Enhanced post-processing with FSM/mixed template generation
- Category-specific scaffolding to prevent truncation
- FSM syntax validity breakthrough (StarCoder2: 0% → 66.7%)
Results: results/Benchmark_8_Results/
Analysis: Research_Data/8th_Benchmark_Results.md
Runner: run_phase4.py
Methodology: Phase 4 - Semantic-aware iterative refinement
Tasks: 20 tasks
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task
Key Features:
- Semantic repair components:
waveform_analyzer.py- Waveform difference analysisformal_verifier.py- Formal equivalence checkingast_repair.py- AST-based code repairsemantic_repair.py- Orchestrates semantic repair tools
- Iterative evaluation:
iterative_evaluator.py- Adaptive iterative refinement loopfeedback_generator.py- Error feedback generationconfidence_tracker.py- Confidence modeling (entropy tracking)
- Configuration:
phase4_config.py- Phase 4 feature flags and settings
- Imports from Phase 2:
- Uses
extract_module_name(),get_port_spec(),get_constrained_prompt(),post_process_verilog()fromrun_phase2.py
- Uses
Results: results/Benchmark_9_Results/
Analysis: Research_Data/ (if available)
Runner: run_phase4.py
Methodology: Phase 4 - Semantic-aware iterative refinement with expanded dataset
Tasks: 50 tasks (23 combinational, 14 sequential, 8 FSM, 5 mixed)
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task (450 total generations)
Key Features:
- Validates scalability of the Phase 4 pipeline on the full 50-task scope
- Semantic repair stack (waveform analysis, formal verification, AST repair) plus confidence tracking
- Captures per-model entropy/iteration statistics for the larger dataset
- Highlights remaining FSM and mixed-design functional correctness gaps
Results: results/Benchmark_10_Results/
Analysis: Research_Data/10th_Benchmark_Results.md
Runner: run_phase4.py
Methodology: Phase 4 - Single-model validation run (Llama-3-8B only)
Tasks: 50 tasks (full dataset)
Models: Llama-3-8B only
Repetitions: 3 per task (150 total generations)
Key Features:
- Validates reproducibility of Benchmark 10 results with a single model
- Measures generation time optimization (8.84s → 4.77s, −46%)
- Confirms syntax validity (71.3%) and simulation pass rate (61.3%) are stable
Results: results/Benchmark_11_Results/
Analysis: Quantitative/Research_Data/11th_Benchmark_Results.md
Runner: run_phase5.py
Methodology: Phase 5 strict mode (waveform + formal enabled, semantic/AST repair)
Tasks: 50 tasks (full dataset)
Models: Llama-3-8B, StarCoder2-7B, TinyLlama-1.1B
Repetitions: 3 per task (450 total generations)
Key Features:
- Strict mode (no entropy skips), waveform analysis, formal verification for FSM/complex tasks
- Higher iteration caps (up to 6) with lower improvement threshold for refinement
- Confidence tracking, semantic repair, and AST repair kept enabled
Results: results/Benchmark_12_Results/
Analysis: Quantitative/Research_Data/12th_Benchmark_Results.md
| File | Benchmarks Used | Purpose |
|---|---|---|
run_phase1.py |
1 | Phase 1: Few-shot prompting baseline |
run_phase2.py |
2-8 | Phase 2: Constrained prompts + post-processing (evolved across benchmarks) |
run_phase3.py |
None | Phase 3: Iterative refinement (not used in final benchmarks) |
run_phase4.py |
9, 10 | Phase 4: Semantic-aware iterative refinement |
run_phase5.py |
12 | Phase 5: Enhanced FSM/mixed prompts + micro-repair |
waveform_analyzer.py |
9, 10, 12 | Waveform analysis for semantic repair |
formal_verifier.py |
9, 10, 12 | Formal verification for semantic repair |
ast_repair.py |
9, 10, 12 | AST-based code repair |
semantic_repair.py |
9, 10, 12 | Semantic repair orchestrator |
iterative_evaluator.py |
9, 10, 12 | Adaptive iterative evaluation loop |
feedback_generator.py |
9, 10 | Error feedback generation |
confidence_tracker.py |
9, 10, 12 | Confidence modeling and entropy tracking |
phase4_config.py |
9, 10 | Phase 4 configuration |
phase5_config.py |
12 | Phase 5 configuration |
phase5_feedback.py |
12 | Category-aware feedback templates for Phase 5 |
phase5_repair.py |
12 | Phase 5 micro-repair engine |
Phase 1 → Phase 2 (Benchmarks 1-8):
- Few-shot prompting → Constrained prompts with exact module/port specs
- No post-processing → Comprehensive post-processing
- Single runs → Statistical analysis (3 repetitions)
- 5 tasks → 20 tasks (full dataset)
- 2 models → 3 models (added StarCoder2)
- Basic fixes → Sequential normalization + FSM/mixed templates
Phase 2 → Phase 4 (Benchmarks 9–10):
- Single-pass generation → Iterative refinement with feedback
- Syntax-only validation → Semantic validation (waveform, formal verification)
- Static post-processing → Adaptive AST-based repair
- No confidence tracking → Confidence modeling with entropy
- Fixed methodology → Adaptive stopping based on confidence
Phase 4 → Phase 5 (Benchmark 12):
- Standard constrained prompts → Enhanced FSM/mixed-specific prompts with reference templates
- Standard post-processing → Micro-repair engine runs before standard post-processing
- Generic feedback → Category-aware feedback targeting FSM and mixed design failure modes
from model_interface import OllamaInterface
from dataset_loader import load_tasks_from_json
from Eval_Pipeline import BenchmarkPipeline
# Load tasks
tasks = load_tasks_from_json("dataset/tasks.json")
# Initialize model
model = OllamaInterface("llama3")
# Run evaluation
pipeline = BenchmarkPipeline(Path("./results"))
metrics = pipeline.evaluate_task(tasks[0], model)from statistical_analysis import BenchmarkAnalyzer
analyzer = BenchmarkAnalyzer("results/benchmark_results.json")
analyzer.print_summary_report()
# Statistical test
result = analyzer.paired_statistical_test("Llama-3-8B", "TinyLlama-1.1B")
print(f"p-value: {result['wilcoxon_p_value']}")from visualizations import BenchmarkVisualizer
viz = BenchmarkVisualizer(results_json="results/benchmark_results.json")
viz.plot_overall_comparison("figures/comparison.png")
viz.plot_pass_rate_by_category("figures/by_category.png")- Create reference Verilog and testbench files
- Add entry to
dataset/tasks.json:
{
"task_id": "new_task_001",
"category": "combinational",
"difficulty": "medium",
"specification": "Design a...",
"reference_hdl": "path/to/reference.v",
"reference_tb": "path/to/testbench.v",
"inputs": ["a", "b"],
"outputs": ["y"]
}# For Ollama models
model = OllamaInterface("model-name")
# For HuggingFace models
from model_interface import HuggingFaceInterface
model = HuggingFaceInterface("org/model-name")- ✅ 50 benchmark tasks (finalized scope)
- 23 combinational circuits (basic gates, arithmetic blocks, mux/decoder variants)
- 14 sequential circuits (flip-flops, shift registers, counters, Johnson/PIPO variants)
- 8 FSM designs (sequence detectors, controllers, traffic light)
- 5 mixed/complex designs (priority encoder, ALU)
- 🛑 Further dataset expansion intentionally paused at 50 tasks to focus on semantic-aware refinement, evaluation quality, and documentation.
- Pipeline infrastructure
- Model integration (Ollama/HF)
- Starter dataset (5 tasks)
- Statistical analysis
- Visualization module
- Constrained prompts with exact module/port specifications
- Comprehensive examples for all task types (20 tasks)
- Enhanced post-processing with FSM/mixed template generation
- Sequential normalization for reliable sequential designs
- Category-specific scaffolding to prevent truncation
- Full 20-task benchmark evaluation (8th benchmark)
Key Achievements:
- FSM Breakthrough: StarCoder2 achieves 66.7% syntax validity for FSM tasks (previously 0%)
- Mixed Design Success: StarCoder2 achieves 66.7% functional correctness on priority encoder
- Sequential Expansion: All models handle expanded sequential library (T flip-flop, shift register, PIPO register)
- Overall Improvement: 70% syntax validity (Llama-3), 55% (StarCoder2), 65% (TinyLlama)
- Expand to 30 tasks
- Expand to 50 tasks (final scope)
Decision: Scope is intentionally capped at 50 curated tasks to prioritize deeper analysis, semantic-aware refinement, and documentation polish over further breadth.
- Run complete experiments on final 50-task dataset (Benchmarks 10, 11, 12 — 1,610 total runs)
- Generate publication-ready results
- Write research paper (submitted for journal publication)
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
- HDLBits for circuit examples
- OpenCores for reference designs
- Open-source EDA tool developers
The source code in this repository is released under the MIT License.
See: LICENSE
The dataset files (reference Verilog, testbenches, and tasks.json) are released under the
Creative Commons Attribution–NonCommercial 4.0 License (CC BY-NC 4.0).
See: DATASET_LICENSE
Status: Research complete, paper under journal review | Last Updated: March 2026
