基于 VerilogEval 官方数据集(ICCAD 2023),包含 156 个 Verilog 数字电路设计问题,用于评估大语言模型的 Verilog 代码生成能力。
- 数据集:
dataset_code-complete-iccad2023 - 问题数量: 156 个
- 任务类型: 代码补全
- 评估工具: iverilog v12
每个问题包含:
Prob*_prompt.txt: 问题描述和模块接口Prob*_test.sv: 功能验证测试Prob*_ref.sv: 参考实现
测试配置:
- 温度参数: 0.0(使用贪心算法,确定性输出)
- top-p: 0.01(该参数此时无作用)
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 生成成功 | 编译成功 | 测试通过 | 总数 | 编译失败 | 测试失败 |
|---|---|---|---|---|---|---|---|
| gpt-oss_20b | 20B | 156(100%) | 93(59.6%) | 88(56.4%) | 156 | 63 | 5 |
| deepseek_v3_2 | API | 156(100%) | 92(59.0%) | 72(46.2%) | 156 | 64 | 20 |
| phi4_14b | 14B | 156(100%) | 107(68.6%) | 47(30.1%) | 156 | 49 | 60 |
| gemma3_12b | 12B | 156(100%) | 84(53.8%) | 44(28.2%) | 156 | 72 | 40 |
| glm_4_6 | API | 156(100%) | 41(26.3%) | 39(25.0%) | 156 | 115 | 2 |
| qwen3_8b | 8B | 156(100%) | 28(17.9%) | 27(17.3%) | 156 | 128 | 1 |
| deepseek-r1_14b | 14B | 156(100%) | 20(12.8%) | 15(9.6%) | 156 | 136 | 5 |
| llama3.2_3b | 3B | 156(100%) | 48(30.8%) | 15(9.6%) | 156 | 108 | 33 |
| mistral_7b | 7B | 156(100%) | 34(21.8%) | 14(9.0%) | 156 | 122 | 20 |
关键发现:
- gpt-oss:20b 表现最佳,测试通过率 56.4%
- phi4:14b 编译成功率最高(68.6%),但通过率仅为 30.1%,编译成功不等于功能正确
测试配置:
- 温度参数: 0.8(增加多样性)
- top-p: 0.95(更自由的采样)
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 生成成功 | 编译成功 | 测试通过 | 总数 | 编译失败 | 测试失败 |
|---|---|---|---|---|---|---|---|
| deepseek_v3_2 | API | 156(100%) | 92(59.0%) | 76(48.7%) | 156 | 64 | 16 |
| gpt_oss_20b | 20B | 156(100%) | 67(42.9%) | 65(41.7%) | 156 | 89 | 2 |
| phi4_14b | 14B | 156(100%) | 120(76.9%) | 62(39.7%) | 156 | 36 | 58 |
| gemma3_12b | 12B | 156(100%) | 105(67.3%) | 58(37.2%) | 156 | 51 | 47 |
| glm_4_6 | API | 156(100%) | 45(28.8%) | 39(25.0%) | 156 | 111 | 6 |
| deepseek-r1_14b | 14B | 156(100%) | 25(16.0%) | 24(15.4%) | 156 | 131 | 1 |
| qwen3_8b | 8B | 156(100%) | 16(10.3%) | 16(10.3%) | 156 | 140 | 0 |
| mistral_7b | 7B | 156(100%) | 27(17.3%) | 8(5.1%) | 156 | 129 | 19 |
| llama3.2_3b | 3B | 156(100%) | 28(17.9%) | 7(4.5%) | 156 | 128 | 21 |
关键发现:
- deepseek_v3_2 表现最佳,相比基线测试(参数组1)提升了3.2个百分点(从46.2%到48.7%)
- phi4_14b 编译成功率最高(76.9%),但通过率相对较低,与基线测试(参数组1)保持一致
- 模型性能与参数规模大致正相关,除了glm-4.6外,其他的模型整体按照模型参数越大通过率越高的规律
- qwen3_8b 编译和测试通过率完全一致(10.3%),说明其生成的代码要么完全正确要么完全错误,没有中间状态
| 模型名称 | 参数规模 | 编译成功率 (temp=0.0) | 编译成功率 (temp=0.8) | 编译变化 | 测试通过率 (temp=0.0) | 测试通过率 (temp=0.8) | 测试变化 |
|---|---|---|---|---|---|---|---|
| phi4_14b | 14B | 68.6% | 76.9% | +8.3% | 30.1% | 39.7% | +9.6% |
| gemma3_12b | 12B | 53.8% | 67.3% | +13.5% | 28.2% | 37.2% | +9.0% |
| deepseek-r1_14b | 14B | 12.8% | 16.0% | +3.2% | 9.6% | 15.4% | +5.8% |
| deepseek_v3_2 | API | 59.0% | 59.0% | 0.0% | 46.2% | 48.7% | +2.5% |
| glm_4_6 | API | 26.3% | 28.8% | +2.5% | 25.0% | 25.0% | 0.0% |
| mistral_7b | 7B | 21.8% | 17.3% | -4.5% | 9.0% | 5.1% | -3.9% |
| llama3.2_3b | 3B | 30.8% | 17.9% | -12.9% | 9.6% | 4.5% | -5.1% |
| qwen3_8b | 8B | 17.9% | 10.3% | -7.6% | 17.3% | 10.3% | -7.0% |
| gpt_oss_20b | 20B | 59.6% | 42.9% | -16.7% | 56.4% | 41.7% | -14.7% |
测试配置:
- 温度参数: 0.0(使用贪心算法,确定性输出)
- top-p: 0.01
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 提示词策略: v1(优化版本)
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 生成成功 | 编译成功 | 测试通过 | 总数 | 编译失败 | 测试失败 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 156(100.0%) | 128( 82.1%) | 106( 67.9%) | 156 | 28 | 22 |
| deepseek_v3_2 | API | 156(100.0%) | 115( 73.7%) | 92( 59.0%) | 156 | 41 | 23 |
| gpt_oss_20b | 20B | 156(100.0%) | 96( 61.5%) | 86( 55.1%) | 156 | 60 | 10 |
| phi4_14b | 14B | 156(100.0%) | 122( 78.2%) | 66( 42.3%) | 156 | 34 | 56 |
| gemma3_12b | 12B | 156(100.0%) | 80( 51.3%) | 44( 28.2%) | 156 | 76 | 36 |
| deepseek_r1_14b | 14B | 156(100.0%) | 47( 30.1%) | 39( 25.0%) | 156 | 109 | 8 |
| qwen3_8b | 8B | 156(100.0%) | 32( 20.5%) | 30( 19.2%) | 156 | 124 | 2 |
| llama3_2_3b | 3B | 156(100.0%) | 46( 29.5%) | 15( 9.6%) | 156 | 110 | 31 |
| mistral_7b | 7B | 156(100.0%) | 1( 0.6%) | 0( 0.0%) | 156 | 155 | 1 |
关键发现:
- glm_4_6 表现最佳,测试通过率达到 67.9%,相比基线测试(参数组1)提升了42.9个百分点(从25.0%到67.9%),提升效果显著
- deepseek_v3_2 位列第二,测试通过率 59.0%,相比基线测试(参数组1)提升了12.8个百分点(从46.2%到59.0%)
- gpt_oss_20b 表现稳定,测试通过率 55.1%,相比基线测试(参数组1)下降了1.3个百分点(从56.4%到55.1%),基本持平
- 优化提示词对不同模型效果差异明显:
- 对 glm_4_6、deepseek-V3.2这些参数量大的模型 提升最显著
- 对gpt_oss_20b、mistral_7b、gemma3_12b、llama3_2_3b是负面效果
| 模型名称 | 参数规模 | 编译成功率 (基线) | 编译成功率 (优化) | 编译变化 | 测试通过率 (基线) | 测试通过率 (优化) | 测试变化 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 26.3% | 82.1% | +55.8% | 25.0% | 67.9% | +42.9% |
| deepseek_r1_14b | 14B | 12.8% | 30.1% | +17.3% | 9.6% | 25.0% | +15.4% |
| deepseek_v3_2 | API | 59.0% | 73.7% | +14.7% | 46.2% | 59.0% | +12.8% |
| phi4_14b | 14B | 68.6% | 78.2% | +9.6% | 30.1% | 42.3% | +12.2% |
| qwen3_8b | 8B | 17.9% | 20.5% | +2.6% | 17.3% | 19.2% | +1.9% |
| gemma3_12b | 12B | 53.8% | 51.3% | -2.5% | 28.2% | 28.2% | 0.0% |
| llama3_2_3b | 3B | 30.8% | 29.5% | -1.3% | 9.6% | 9.6% | 0.0% |
| gpt_oss_20b | 20B | 59.6% | 61.5% | +1.9% | 56.4% | 55.1% | -1.3% |
| mistral_7b | 7B | 21.8% | 0.6% | -21.2% | 9.0% | 0.0% | -9.0% |
测试配置:
- 温度参数: 0.8(增加多样性)
- top-p: 0.95(更自由的采样)
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 提示词策略: v1(优化版本)
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 生成成功 | 编译成功 | 测试通过 | 总数 | 编译失败 | 测试失败 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 156(100.0%) | 133( 85.3%) | 111( 71.2%) | 156 | 23 | 22 |
| gpt_oss_20b | 20B | 156(100.0%) | 102( 65.4%) | 95( 60.9%) | 156 | 54 | 7 |
| deepseek_v3_2 | API | 156(100.0%) | 117( 75.0%) | 92( 59.0%) | 156 | 39 | 25 |
| phi4_14b | 14B | 156(100.0%) | 119( 76.3%) | 65( 41.7%) | 156 | 37 | 54 |
| gemma3_12b | 12B | 156(100.0%) | 87( 55.8%) | 45( 28.8%) | 156 | 69 | 42 |
| deepseek_r1_14b | 14B | 156(100.0%) | 44( 28.2%) | 37( 23.7%) | 156 | 112 | 7 |
| qwen3_8b | 8B | 156(100.0%) | 32( 20.5%) | 30( 19.2%) | 156 | 124 | 2 |
| llama3_2_3b | 3B | 156(100.0%) | 43( 27.6%) | 11( 7.1%) | 156 | 113 | 32 |
| mistral_7b | 7B | 156(100.0%) | 5( 3.2%) | 1( 0.6%) | 156 | 151 | 4 |
关键发现:
- glm_4_6 表现最佳,测试通过率达到 71.2%,在所有测试组合中取得最高成绩
- gpt_oss_20b 表现稳定,测试通过率 60.9%,相比基线参数组2提升了19.2个百分点(从41.7%到60.9%)
- deepseek_v3_2 位列第三,测试通过率 59.0%,与优化提示词参数组1持平
- 高温参数 + 优化提示词组合效果显著,多数模型在此配置下表现优于基线
| 模型名称 | 参数规模 | 编译成功率 (基线) | 编译成功率 (优化) | 编译变化 | 测试通过率 (基线) | 测试通过率 (优化) | 测试变化 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 28.8% | 85.3% | +56.5% | 25.0% | 71.2% | +46.2% |
| gpt_oss_20b | 20B | 42.9% | 65.4% | +22.5% | 41.7% | 60.9% | +19.2% |
| deepseek_v3_2 | API | 59.0% | 75.0% | +16.0% | 48.7% | 59.0% | +10.3% |
| deepseek_r1_14b | 14B | 16.0% | 28.2% | +12.2% | 15.4% | 23.7% | +8.3% |
| llama3_2_3b | 3B | 17.9% | 27.6% | +9.7% | 4.5% | 7.1% | +2.6% |
| qwen3_8b | 8B | 10.3% | 20.5% | +10.2% | 10.3% | 19.2% | +8.9% |
| gemma3_12b | 12B | 67.3% | 55.8% | -11.5% | 37.2% | 28.8% | -8.4% |
| phi4_14b | 14B | 76.9% | 76.3% | -0.6% | 39.7% | 41.7% | +2.0% |
| mistral_7b | 7B | 17.3% | 3.2% | -14.1% | 5.1% | 0.6% | -4.5% |
| 模型名称 | 参数规模 | 测试通过率 (参数组1优化) | 测试通过率 (参数组2优化) | 变化 |
|---|---|---|---|---|
| glm_4_6 | API | 67.9% | 71.2% | +3.3% |
| gpt_oss_20b | 20B | 55.1% | 60.9% | +5.8% |
| deepseek_v3_2 | API | 59.0% | 59.0% | 0.0% |
| phi4_14b | 14B | 42.3% | 41.7% | -0.6% |
| gemma3_12b | 12B | 28.2% | 28.8% | +0.6% |
| deepseek_r1_14b | 14B | 25.0% | 23.7% | -1.3% |
| qwen3_8b | 8B | 19.2% | 19.2% | 0.0% |
| llama3_2_3b | 3B | 9.6% | 7.1% | -2.5% |
| mistral_7b | 7B | 0.0% | 0.6% | +0.6% |
测试配置:
- 温度参数: 0.0(使用贪心算法,确定性输出)
- top-p: 0.01
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 迭代策略: 最多3次编译修复迭代 + 最多3次测试修复迭代
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 编译成功 | 测试通过 | 总数 | 编译迭代改进 | 测试迭代改进 | 平均编译迭代次数 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 140(89.7%) | 118(75.6%) | 156 | +100 | +2 | 1.96 |
| deepseek_v3_2 | API | 136(87.2%) | 112(71.8%) | 156 | +45 | +2 | 1.56 |
| gpt_oss_20b | 20B | 104(66.7%) | 99(63.5%) | 156 | +38 | +0 | 1.99 |
| phi4_14b | 14B | 136(87.2%) | 75(48.1%) | 156 | +6 | +8 | 1.31 |
| gemma3_12b | 12B | 120(76.9%) | 61(39.1%) | 156 | +13 | +4 | 1.58 |
| deepseek_r1_14b | 14B | 70(44.9%) | 43(27.6%) | 156 | +45 | +0 | 2.47 |
| qwen3_8b | 8B | 20(12.8%) | 20(12.8%) | 156 | +4 | +0 | 2.79 |
| llama3_2_3b | 3B | 56(35.9%) | 15(9.6%) | 156 | +13 | +1 | 2.42 |
| mistral_7b | 7B | 37(23.7%) | 8(5.1%) | 156 | +6 | +1 | 2.58 |
关键发现:
- glm_4_6 表现最佳,测试通过率达到 75.6%,通过编译迭代改进了 100 个问题
- deepseek_v3_2 位列第二,测试通过率 71.8%,编译迭代改进 45 个问题
- 迭代策略对编译失败修复效果显著,平均每个模型改进了约 30 个问题
- 测试迭代改进效果有限,phi4_14b 测试迭代改进最多(8个),大多数模型为 0-2 个
- deepseek_r1_14b 迭代收益最大,编译迭代改进 45 个问题,平均迭代 2.47 次
测试配置:
- 温度参数: 0.8(增加多样性)
- top-p: 0.95(更自由的采样)
- 采样方式: 零样本(0-shot)
- 最大Token数: 1024
- 迭代策略: 最多3次编译修复迭代 + 最多3次测试修复迭代
- 测试环境: Windows + iverilog v12
- Python版本: 3.13.9
- 硬件配置: Intel i7-12700KF + RTX 5060Ti 16GB
- 评估工具: iverilog + vvp + 静态过滤器(防无限循环卡死)
测试结果:
| 模型名称 | 参数规模 | 编译成功 | 测试通过 | 总数 | 编译迭代改进 | 测试迭代改进 | 平均编译迭代次数 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 139(89.1%) | 120(76.9%) | 156 | +99 | +0 | 1.99 |
| deepseek_v3_2 | API | 141(90.4%) | 118(75.6%) | 156 | +55 | +3 | 1.58 |
| gpt_oss_20b | 20B | 109(69.9%) | 100(64.1%) | 156 | +44 | +1 | 1.97 |
| phi4_14b | 14B | 138(88.5%) | 78(50.0%) | 156 | +23 | +7 | 1.40 |
| gemma3_12b | 12B | 115(73.7%) | 63(40.4%) | 156 | +10 | +1 | 1.61 |
| deepseek_r1_14b | 14B | 53(34.0%) | 36(23.1%) | 156 | +32 | +0 | 2.60 |
| qwen3_8b | 8B | 25(16.0%) | 24(15.4%) | 156 | +6 | +0 | 2.74 |
| llama3_2_3b | 3B | 30(19.2%) | 9(5.8%) | 156 | +12 | +0 | 2.72 |
| mistral_7b | 7B | 20(12.8%) | 6(3.8%) | 156 | +5 | +0 | 2.78 |
关键发现:
- glm_4_6 表现最佳,测试通过率达到 76.9%,在所有策略和参数组合中取得最高成绩
- deepseek_v3_2 表现优异,测试通过率 75.6%,编译成功率高达 90.4%
- 高温参数下迭代策略效果更好,平均通过率 39.5% 略高于参数组1的 39.2%
- 编译迭代改进总计 286 个问题,测试迭代改进仅 12 个,主要收益来自编译阶段
| 模型名称 | 参数规模 | 编译率 (Baseline) | 编译率 (Iterative) | 编译变化 | 通过率 (Baseline) | 通过率 (Iterative) | 通过变化 |
|---|---|---|---|---|---|---|---|
| glm_4_6 | API | 26.3% | 89.7% | +63.4% | 25.0% | 75.6% | +50.6% |
| deepseek_v3_2 | API | 59.0% | 87.2% | +28.2% | 46.2% | 71.8% | +25.6% |
| deepseek_r1_14b | 14B | 12.8% | 44.9% | +32.1% | 9.6% | 27.6% | +18.0% |
| phi4_14b | 14B | 68.6% | 87.2% | +18.6% | 30.1% | 48.1% | +18.0% |
| gemma3_12b | 12B | 53.8% | 76.9% | +23.1% | 28.2% | 39.1% | +10.9% |
| gpt_oss_20b | 20B | 59.6% | 66.7% | +7.1% | 56.4% | 63.5% | +7.1% |
| llama3_2_3b | 3B | 30.8% | 35.9% | +5.1% | 9.6% | 9.6% | 0.0% |
| qwen3_8b | 8B | 17.9% | 12.8% | -5.1% | 17.3% | 12.8% | -4.5% |
| mistral_7b | 7B | 21.8% | 23.7% | +1.9% | 9.0% | 5.1% | -3.9% |
| 模型名称 | 参数规模 | 编译率 (Baseline) | 编译率 (Iterative) | 编译变化 | 通过率 (Baseline) | 通过率 (Iterative) | 通过变化 |
|---|---|---|---|---|---|---|---|
| deepseek_v3_2 | API | 59.0% | 90.4% | +31.4% | 48.7% | 75.6% | +26.9% |
| glm_4_6 | API | 28.8% | 89.1% | +60.3% | 25.0% | 76.9% | +51.9% |
| gpt_oss_20b | 20B | 42.9% | 69.9% | +27.0% | 41.7% | 64.1% | +22.4% |
| phi4_14b | 14B | 76.9% | 88.5% | +11.6% | 39.7% | 50.0% | +10.3% |
| deepseek_r1_14b | 14B | 16.0% | 34.0% | +18.0% | 15.4% | 23.1% | +7.7% |
| gemma3_12b | 12B | 67.3% | 73.7% | +6.4% | 37.2% | 40.4% | +3.2% |
| qwen3_8b | 8B | 10.3% | 16.0% | +5.7% | 10.3% | 15.4% | +5.1% |
| llama3_2_3b | 3B | 17.9% | 19.2% | +1.3% | 4.5% | 5.8% | +1.3% |
| mistral_7b | 7B | 17.3% | 12.8% | -4.5% | 5.1% | 3.8% | -1.3% |
| 模型名称 | 参数规模 | 通过率 (优化v1) | 通过率 (Iterative) | 通过变化 |
|---|---|---|---|---|
| deepseek_v3_2 | API | 59.0% | 71.8% | +12.8% |
| gpt_oss_20b | 20B | 55.1% | 63.5% | +8.4% |
| glm_4_6 | API | 67.9% | 75.6% | +7.7% |
| phi4_14b | 14B | 42.3% | 48.1% | +5.8% |
| gemma3_12b | 12B | 28.2% | 39.1% | +10.9% |
| deepseek_r1_14b | 14B | 25.0% | 27.6% | +2.6% |
| qwen3_8b | 8B | 19.2% | 12.8% | -6.4% |
| llama3_2_3b | 3B | 9.6% | 9.6% | 0.0% |
| mistral_7b | 7B | 0.0% | 5.1% | +5.1% |
| 模型名称 | 参数规模 | 通过率 (优化v1) | 通过率 (Iterative) | 通过变化 |
|---|---|---|---|---|
| deepseek_v3_2 | API | 59.0% | 75.6% | +16.6% |
| glm_4_6 | API | 71.2% | 76.9% | +5.7% |
| phi4_14b | 14B | 41.7% | 50.0% | +8.3% |
| gemma3_12b | 12B | 28.8% | 40.4% | +11.6% |
| gpt_oss_20b | 20B | 60.9% | 64.1% | +3.2% |
| mistral_7b | 7B | 0.6% | 3.8% | +3.2% |
| qwen3_8b | 8B | 19.2% | 15.4% | -3.8% |
| deepseek_r1_14b | 14B | 23.7% | 23.1% | -0.6% |
| llama3_2_3b | 3B | 7.1% | 5.8% | -1.3% |
| 模型名称 | 参数规模 | 通过率 (参数组1迭代) | 通过率 (参数组2迭代) | 变化 |
|---|---|---|---|---|
| glm_4_6 | API | 75.6% | 76.9% | +1.3% |
| deepseek_v3_2 | API | 71.8% | 75.6% | +3.8% |
| gpt_oss_20b | 20B | 63.5% | 64.1% | +0.6% |
| phi4_14b | 14B | 48.1% | 50.0% | +1.9% |
| gemma3_12b | 12B | 39.1% | 40.4% | +1.3% |
| deepseek_r1_14b | 14B | 27.6% | 23.1% | -4.5% |
| qwen3_8b | 8B | 12.8% | 15.4% | +2.6% |
| llama3_2_3b | 3B | 9.6% | 5.8% | -3.8% |
| mistral_7b | 7B | 5.1% | 3.8% | -1.3% |
- 优化提示词效果显著:大部分模型在使用优化提示词后性能明显提升,尤其是 glm_4_6 提升最为显著
- 高温参数有助于部分模型:glm_4_6 和 gpt_oss_20b 在高温参数下表现更好
- 模型对提示词敏感度差异大:不同模型对提示词优化的响应差异明显
- 迭代策略效果最佳:迭代优化在所有策略中取得最高通过率,glm_4_6 达到 76.9%
| 模型 | 基线最佳表现 | 优化v1最佳表现 | 迭代最佳表现 | 最大提升幅度 | 特点 |
|---|---|---|---|---|---|
| glm_4_6 | 25.0% | 71.2% | 76.9% | +51.9% | 迭代策略收益最大,编译迭代改进100个问题 |
| deepseek_v3_2 | 48.7% | 59.0% | 75.6% | +26.9% | 迭代策略显著提升,编译成功率达90.4% |
| gpt_oss_20b | 56.4% | 60.9% | 64.1% | +22.4% | 各策略表现稳定,持续提升 |
| phi4_14b | 39.7% | 42.3% | 50.0% | +10.3% | 迭代策略提升明显,编译成功率达88.5% |
| gemma3_12b | 37.2% | 28.8% | 40.4% | +11.6% | 迭代策略弥补优化提示词负面影响 |
| deepseek_r1_14b | 15.4% | 25.0% | 27.6% | +18.0% | 迭代策略进一步提升,编译迭代改进45个 |
| qwen3_8b | 17.3% | 19.2% | 15.4% | - | 迭代策略效果有限 |
| llama3_2_3b | 9.6% | 9.6% | 9.6% | 0.0% | 小参数模型能力受限 |
| mistral_7b | 9.0% | 0.6% | 5.1% | - | 迭代策略部分恢复性能 |
-
迭代策略整体效果最优:
- 参数组1平均通过率 39.2%,参数组2平均通过率 39.5%
- 总计 556 个问题通过编译迭代改进(两个参数组合计)
-
编译迭代改进效果显著:
- glm_4_6 编译迭代改进最多:参数组1改进100个,参数组2改进99个
- deepseek_v3_2 编译迭代改进次多:参数组1改进45个,参数组2改进55个
- 平均编译迭代次数约1.5-2.8次,模型间差异明显
-
迭代策略 vs 优化提示词v1:
- 大多数模型迭代策略优于优化提示词v1
- deepseek_v3_2 提升最明显:参数组2从59.0%提升至75.6%(+16.6%)
- gemma3_12b 受益显著:迭代策略弥补了优化提示词的负面效果
-
迭代策略局限性:
- qwen3_8b、llama3_2_3b、mistral_7b 迭代收益有限
- 小参数模型难以从错误反馈中学习并改正
- 测试迭代改进效果有限,主要收益来自编译迭代
- glm_4_6 对优化提示词响应最敏感,测试通过率从基线的25.0%提升至71.2%,提升幅度达46.2个百分点
- deepseek系列模型(v3_2和r1_14b)均有显著提升,说明该系列对提示词优化友好
- gpt_oss_20b 在高温+优化提示词配置下表现最佳,达到60.9%通过率
- mistral_7b 与优化提示词严重不兼容,编译成功率从基线的21.8%骤降至0.6%,可能存在理解性问题,gemma3_12b同样
- 小参数模型(llama3_2_3b)对提示词优化收益有限,能力受模型规模制约
- 温度参数对不同模型影响差异明显:
- API模型(glm_4_6、deepseek_v3_2)在高温下表现更好
- 部分开源模型(phi4_14b、gemma3_12b)在高温下编译成功率提升但功能正确率波动
- 迭代策略是最有效的优化方法:
- glm_4_6 + 迭代策略 + 高温参数取得最高通过率 76.9%
- 迭代策略对编译失败修复效果显著,但对测试失败修复效果有限
- API模型和大参数模型从迭代策略中获益更多
VERILOG_PROMPT_V1 = """You are an expert Verilog/SystemVerilog code generator. Generate ONLY the module body code (the content between the module declaration and endmodule). Follow these CRITICAL SYNTAX RULES:
-
Module Boundary
- The module declaration is ALREADY provided in the prompt
- You MUST end your code with
endmodule - Do NOT redeclare the module header
-
Continuous Assignment (for combinational logic on wire outputs)
- ALWAYS use
assignkeyword for wire-type outputs - Correct:
assign out = a & b; - Wrong:
out = a & b;(missing assign)
- ALWAYS use
-
Output Port Types
- If output is assigned in
alwaysblock -> declare asregor uselogic - If output uses
assignstatement -> leave aswire(default) - Check the module interface: if output is already declared as
reg, use always block - If output is NOT declared as
reg, useassignfor combinational logic
- If output is assigned in
-
Always Block Assignments
- Sequential logic (with clk): use non-blocking
<= - Combinational logic (always @* or always_comb): use blocking
= - NEVER mix blocking and non-blocking in same always block
- Sequential logic (with clk): use non-blocking
-
Signal Declarations
- Only use signals that are defined in module ports or declared internally
- Do NOT use
clkin combinational-only circuits - Check module ports before using any signal
-
Common Patterns
Combinational logic (wire output):
assign out = a & b; endmodule
Combinational logic (reg output, using always):
always @(*) begin out = a & b; end endmodule
Sequential logic (D flip-flop):
always @(posedge clk) begin if (reset) q <= 1'b0; else q <= d; end endmodule
- Generate ONLY the module body (between declaration and endmodule)
- Must include
endmoduleat the end - No explanations, no comments unless necessary for complex logic
- No markdown code blocks
Before outputting, verify:
[ ] Code ends with endmodule
[ ] All wire outputs use assign keyword
[ ] All reg outputs are assigned in always blocks
[ ] Only declared signals are used
[ ] Blocking/non-blocking assignments are correct
"""
每个问题包含:
Prob*_prompt.txt: 问题描述和模块接口定义Prob*_test.sv: 功能验证测试文件Prob*_ref.sv: 参考实现(标准答案)Prob*_ifc.txt: 模块接口定义文件
问题示例:Prob001_zero(恒定低电平输出)
问题描述 (Prob001_zero_prompt.txt):
Build a circuit that always outputs a LOW.
module TopModule (
output zero
);
参考实现 (Prob001_zero_ref.sv):
module RefModule (
output zero
);
assign zero = 1'b0;
endmodule测试文件 (Prob001_zero_test.sv):
module tb();
wire zero_ref;
wire zero_dut;
RefModule good1 (.zero(zero_ref));
TopModule top_module1 (.zero(zero_dut));
assign tb_match = ( { zero_ref } === ( { zero_ref } ^ { zero_dut } ^ { zero_ref } ) );
always @(posedge clk, negedge clk) begin
if (!tb_match) begin
stats1.errors++;
end
end
endmodule系统消息:
You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions.
用户提示:
// Implement the Verilog module based on the following description. Assume that signals are positive clock/clk triggered unless otherwise stated.
[问题描述]
module TopModule (
[端口定义]
);
完整 API 调用示例:
[
{"role": "system", "content": "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions."},
{"role": "user", "content": "// Implement the Verilog module based on the following description. Assume that signals are positive clock/clk triggered unless otherwise stated.\nBuild a circuit that always outputs a LOW.\n\nmodule TopModule (\n output zero\n);\n"}
]- 代码生成: 发送提示词给模型生成 Verilog 代码
- 代码提取: 从模型响应中提取纯净的 Verilog 代码
- 静态检查: 检测可能导致仿真卡死的代码模式
- 代码组装: 将生成的代码与模块接口组装成完整文件
- 编译测试: 使用 iverilog 编译生成的代码
- 功能验证: 运行仿真测试与参考实现对比
- 结果记录: 记录编译状态、测试通过情况和详细日志
背景问题: 某些模型生成的代码包含组合逻辑环路(如 assign out = (clk && ~out) ^ in;),导致仿真无限循环卡死。例如llama3.2:3b的Problem53和155。超时机制无法解决。
解决方案: 集成静态分析器检测和过滤危险代码模式:
检测模式:
- 组合逻辑环路:信号在 assign 语句中自我引用
- 复杂反馈回路:多信号形成的循环依赖
- 不安全的时钟使用:可能导致竞争条件的时钟模式
处理流程:
- 在编译前对生成的代码进行静态分析
- 如果检测到危险模式,跳过仿真测试并标记为 "STATIC FAILED"
- 记录具体的失败原因到日志文件
- 确保评估框架能够稳定运行,避免因个别问题导致整体测试中断
评估完成后,结果保存在以下位置:
results/
├── corrected_test_results.txt # 汇总报告
├── corrected_test_results.csv # CSV格式数据
├── corrected_test_results.json # JSON详细数据
└── [model]_0shot_temp0.0/ # 各模型详细结果
├── Prob*_raw_response.txt # 模型原始响应
├── Prob*_extracted_code.txt # 提取的Verilog代码
├── Prob*.sv # 完整测试代码文件
└── Prob*-sv-iv-test.log # 编译和测试日志
本项目基于 VerilogEval 官方数据集,相关工作包括:
- VerilogEval v1 (2023): "VerilogEval: Evaluating Large Language Models for Verilog Code Generation" - ICCAD 2023
- VerilogEval v2 (2024): "Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks" - 增加了规范到RTL转换、上下文学习等功能
分支说明:
main: VerilogEval V2 改进版本release/1.0.0: 原始 VerilogEval 1.0 基准
如果使用本评估基准,请引用:
VerilogEval V2:
@misc{pinckney2024revisitingverilogevalnewerllms,
title={Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks},
author={Nathaniel Pinckney and Christopher Batten and Mingjie Liu and Haoxing Ren and Brucek Khailany},
year={2024},
eprint={2408.11053},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2408.11053},
}原始 VerilogEval:
@inproceedings{liu2023verilogeval,
title={{VerilogEval:} Evaluating Large Language Models for Verilog Code Generation},
author={Liu, Mingjie and Pinckney, Nathaniel and Khailany, Brucek and Ren, Haoxing},
booktitle={2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},
year={2023}
}此文档用于持续记录 VerilogEval Benchmark 的测试结果,后续测试结果请按日期和任务类型添加到对应章节中。