An automated evaluation framework for Large Language Models based on Fuzzy Analytic Hierarchy Process (Fuzzy AHP), supporting multi-dimensional criteria generation, weight calculation, and response quality assessment.
- Dual-Method Evaluation: Supports both traditional Crisp AHP and Fuzzy AHP methods
- Automatic Criteria Generation: LLM automatically generates evaluation criteria and calculates weights
- Consistency Check: Built-in CR (Consistency Ratio) verification and automatic matrix repair mechanism
- Multi-Platform Support: Supports OpenAI, Gemini, Dashscope, Ollama, and other LLM platforms
- Incremental Evaluation: Supports resuming evaluation from checkpoints, automatically skips evaluated samples
- Two Evaluation Modes: per_sample (generate criteria for each sample) and category (predefined criteria by category)
- Flexible Scoring Scales: Supports multiple scoring scales (1-5, 1-10) and AHP scales (1-5, 1-9)
- Intelligent Caching: Built-in LLM response caching to avoid redundant API calls
- Adaptive Weighting: Adaptive combination of AHP scores and direct scores based on CR values
- Result Analysis: Comprehensive result collection and statistical analysis tools
DualJudge/
├── ahp_processor.py # AHP core processor (weight calculation, CR check, matrix repair)
├── fahp_judge.py # Main evaluation system (criteria generation, model comparison, dual-method aggregation)
├── generate_criteria_config.py # Category criteria configuration generator
├── llm_api_server.py # Unified LLM API interface (OpenAI, Gemini, Dashscope, Ollama)
├── prompts.py # Prompt templates and scoring configurations
├── utils.py # Utility functions (JSON encoder, etc.)
├── res_collection.py # Result collection and analysis tools
├── api_config.json # API configuration file
├── experiments/ # Experiment results directory
└── README.md
pip install numpy openai dashscope google-genai requests datasets tqdmCreate an api_config.json file:
{
"default": {
"type": "openai",
"api_key": "your-api-key",
"base_url": "https://api.openai.com/v1",
"model": "gpt-4"
}
}Supported platform types:
openai: OpenAI APIgemini: Google Gemini APIdashscope: Alibaba Cloud Dashscope APIollama: Local Ollama service
If using category mode, generate criteria and weights for each category first:
python generate_criteria_config.py \
--config api_config.json \
--llm-name default \
--split claude \
--samples-per-category 15 \
--method sequential \
--project gpt_oss20b_categoryParameters:
--config: Path to API configuration file--llm-name: LLM name in the configuration file--split: Dataset split (gpt or claude)--samples-per-category: Number of samples per category--method: Weight calculation method (batch or sequential)--project: Project name, output will be saved in this folder
python fahp_judge.py \
--config api_config.json \
--llm-name default \
--split claude \
--mode category \
--criteria-config gpt_oss20b_category/criteria_config.json \
--project gpt_oss20b_category \
--scoring_scale "1-10" \
--comparison_scale "1-9" \python fahp_judge.py \
--config api_config.json \
--llm-name default \
--split claude \
--mode per_sample \
--method sequential \
--project my_evaluation \
--scoring_scale "1-10" \
--comparison_scale "1-9" \Parameters:
--mode: Evaluation mode (category or per_sample)--criteria-config: Criteria configuration file (required for category mode)--method: Criteria comparison method (batch or sequential in per_sample mode)--max-samples: Maximum number of samples to evaluate (optional)
python simple_judge.py \
--config api_config.json \
--llm-name default \
--split claude \
--output simple_records.jsonlFuzzyNumber Class: Triangular fuzzy number representation
- Supports fuzzy number operations and defuzzification
- Provides conversion from AHP scale (1-9) to fuzzy numbers
AHPProcessor Class: Core AHP calculation engine
build_matrix(): Build judgment matrixbuild_fuzzy_matrix(): Build fuzzy judgment matrixcalculate_weights(): Calculate eigenvector weightscalculate_fuzzy_weights(): Calculate fuzzy weightscheck_consistency(): CR consistency checkrepair_matrix(): Automatically repair inconsistent matricescompute(): Simultaneously calculate Crisp and Fuzzy weights
UnifiedLLMJudge Class: Unified LLM interface
generate_criteria(): Automatically generate evaluation criteriacompute_criteria_weights(): Calculate criteria weights (batch mode)compute_criteria_weights_sequential(): Calculate criteria weights (sequential mode)compare_models_all_criteria(): Compare two model responses across all criteriadirect_score_responses(): Direct scoring method
DualAggregator Class: Dual-method aggregator
aggregate_crisp(): Crisp AHP aggregationaggregate_fuzzy(): Fuzzy AHP aggregation
CategoryBasedEvaluator Class: Category-based evaluator
- Uses predefined criteria and weights for each category
- Supports incremental evaluation and result saving
AutoJudgeBenchEvaluator Class: Fully automatic evaluator
- Independently generates criteria and weights for each sample
- Supports regenerating criteria when CR check fails
Analysis Functions:
load_category_mapping(): Load dataset category informationcompute_adaptive_weight(): Compute adaptive weights based on CR valuescombine_adaptive(): Combine AHP and direct scores adaptivelyanalyze_results(): Statistical analysis of evaluation results
1. Generate general criteria for each category (generate_criteria_config.py)
↓
2. Evaluate using these criteria (fahp_judge.py --mode category)
↓
3. LLM compares responses across multiple criteria
↓
4. Crisp/Fuzzy dual methods calculate comprehensive scores
1. LLM generates evaluation criteria for the sample
↓
2. LLM compares criteria importance (supports sequential comparison)
↓
3. Calculate criteria weights (CR check + automatic repair)
↓
4. LLM compares model responses across all criteria
↓
5. Crisp/Fuzzy dual methods calculate comprehensive scores
Evaluation results are saved in the project folder:
records.jsonl: Detailed evaluation records for each sampleresults.json: Summary statisticscache_eval.json: LLM call cache (avoids duplicate calls)
- Crisp AHP Accuracy: Traditional method prediction accuracy
- Fuzzy AHP Accuracy: Fuzzy method prediction accuracy
- Improvement Rate: Fuzzy accuracy improvement over Crisp
- Consistency: Consistency ratio between the two methods' predictions
- Per-Category Statistics: Accuracy performance by category
Importance comparison uses a 1-9 scale:
| Score | Meaning |
|---|---|
| 1 | The second criterion is extremely more important |
| 3 | The second criterion is moderately more important |
| 5 | Both criteria are equally important |
| 7 | The first criterion is moderately more important |
| 9 | The first criterion is extremely more important |
| 2,4,6,8 | Intermediate values |
The system automatically performs CR (Consistency Ratio) verification:
- CR ≤ 0.15: Passes consistency check
- CR > 0.15: Automatically repairs matrix or regenerates criteria
Repair strategy:
- Identify inconsistent triplets
- Adjust contradictory judgment values
- Maximum 5 repair attempts
- If still failing, suggests regenerating criteria
Uses JudgeBench dataset:
from datasets import load_dataset
dataset = load_dataset("ScalerLab/JudgeBench", split="claude")The system supports multiple scoring and AHP scale configurations:
- 1-5 Scale: Suitable for quick evaluations with 5 levels
- 1-10 Scale: More granular scoring with 10 levels
- 1-5 Scale: Simpler comparisons with 5 levels
- 1-9 Scale: Traditional AHP scale with 9 levels
Configure via --scoring_scale and --comparison_scale parameters.
In llm_api_server.py:
- Add platform configuration validation rules to
REQUIRED_KEYS - Implement call function
- Register to
API_HANDLERSandCLIENT_BUILDERS
Inherit from CategoryBasedEvaluator or AutoJudgeBenchEvaluator classes, override:
evaluate_sample(): Single sample evaluation logicget_criteria_for_source(): Criteria retrieval logic
MIT License
@misc{he2026structuredmulticriteriaevaluationlarge,
title={Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge},
author={Yulong He and Ivan Smirnov and Dmitry Fedrushkov and Sergey Kovalchuk and Ilya Revin},
year={2026},
eprint={2604.03742},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.03742},
}