[Bug] 使用自定义数据集，在模型输出正确答案的情况下，统计结果不准确

### 操作系统及版本

ubuntu VERSION="22.04.5 LTS (Jammy Jellyfish)"

### 安装工具的python环境

docker容器中的python环境

### python版本

3.11

### AISBench工具版本

3月18日下载的最新版本

### AISBench执行命令

ais_bench --work-dir /workspace/test --models vllm_api_general_chat --custom-dataset-path /home/dataset/customdataset/cus_test_mcq.jsonl --num-prompts 10 --dump-eval-details

### 模型配置文件或自定义配置文件内容

仅自定义了数据集，为文档中的样例：
{"question": "165+833+650+615=", "A": "2258", "B": "2263", "C": "2281", "answer": "B"}
{"question": "368+959+918+653+978=", "A": "3876", "B": "3878", "C": "3880", "answer": "A"}
{"question": "776+208+589+882+571+996+515+726=", "A": "5213", "B": "5263", "C": "5383", "answer": "B"}
{"question": "803+862+815+100+409+758+262+169=", "A": "4098", "B": "4128", "C": "4178", "answer": "C"}

### 预期行为

精度测试应该为100%

### 实际行为

精度结果只有25%，且在模型预测结果正确的情况下，提取的结果错误
{
            "example_abbr": "cus_test_mcq_test_0",
            "pred": [
                "To solve the problem:\n\n$$\n165 + 833 + 650 + 615\n$$\n\nwe will add the numbers step by step to ensure accuracy.\n\n---\n\n### Step 1: Add 165 and 833\n\n$$\n165 + 833 = 998\n$$\n\n---\n\n### Step 2: Add 650 to the result\n\n$$\n998 + 650 = 1648\n$$\n\n---\n\n### Step 3: Add 615 to the new total\n\n$$\n1648 + 615 = 2263\n$$\n\n---\n\n### Verification via Column Addition\n\nLet?~Ys verify the final addition using column-wise addition:\n\n```\n   1648\n+   615\n--------\n   2263\n```\n\n-  
**Units place**: 8 + 5 = 13 ?~R write 3, carry 1  \n- **Tens place**: 4 + 1 + 1 = 6  \n- **Hundreds place**: 6 + 6 = 12 ?~R write 2, carry 1  \n- **Thousands place**:  
1 + 1 = 2  \n\nResult: **2263**\n\n---\n\n### Final Answer\n\n$$\n\\boxed{B}\n$$"
            ],
            "parsed": [
                "A"
            ],
            "refr": [
                "B"
            ],
            "correct": [
                false
            ]
        }

### 前置检查

- [x] 我已读懂主页文档的快速入门，无法解决问题
- [x] 我已检索过FAQ，无重复问题
- [x] 我已搜索过现有Issue，无重复问题
- [x] 我已更新到最新版本，问题仍存在

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 使用自定义数据集，在模型输出正确答案的情况下，统计结果不准确 #209

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

预期行为

实际行为

前置检查

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] 使用自定义数据集，在模型输出正确答案的情况下，统计结果不准确 #209

Description

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

预期行为

实际行为

前置检查

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions