Skip to content

hreyulog/DualJudge

Repository files navigation

6cbc5ef0-3838-4555-bd2f-ada1db1f1222

LLM Automatic Evaluation System (Based on Fuzzy AHP)

An automated evaluation framework for Large Language Models based on Fuzzy Analytic Hierarchy Process (Fuzzy AHP), supporting multi-dimensional criteria generation, weight calculation, and response quality assessment.

Features

  • Dual-Method Evaluation: Supports both traditional Crisp AHP and Fuzzy AHP methods
  • Automatic Criteria Generation: LLM automatically generates evaluation criteria and calculates weights
  • Consistency Check: Built-in CR (Consistency Ratio) verification and automatic matrix repair mechanism
  • Multi-Platform Support: Supports OpenAI, Gemini, Dashscope, Ollama, and other LLM platforms
  • Incremental Evaluation: Supports resuming evaluation from checkpoints, automatically skips evaluated samples
  • Two Evaluation Modes: per_sample (generate criteria for each sample) and category (predefined criteria by category)
  • Flexible Scoring Scales: Supports multiple scoring scales (1-5, 1-10) and AHP scales (1-5, 1-9)
  • Intelligent Caching: Built-in LLM response caching to avoid redundant API calls
  • Adaptive Weighting: Adaptive combination of AHP scores and direct scores based on CR values
  • Result Analysis: Comprehensive result collection and statistical analysis tools

Project Structure

DualJudge/
├── ahp_processor.py           # AHP core processor (weight calculation, CR check, matrix repair)
├── fahp_judge.py              # Main evaluation system (criteria generation, model comparison, dual-method aggregation)
├── generate_criteria_config.py # Category criteria configuration generator
├── llm_api_server.py          # Unified LLM API interface (OpenAI, Gemini, Dashscope, Ollama)
├── prompts.py                 # Prompt templates and scoring configurations
├── utils.py                   # Utility functions (JSON encoder, etc.)
├── res_collection.py          # Result collection and analysis tools
├── api_config.json            # API configuration file
├── experiments/               # Experiment results directory
└── README.md

Installation

pip install numpy openai dashscope google-genai requests datasets tqdm

Quick Start

1. Configure API

Create an api_config.json file:

{
  "default": {
    "type": "openai",
    "api_key": "your-api-key",
    "base_url": "https://api.openai.com/v1",
    "model": "gpt-4"
  }
}

Supported platform types:

  • openai: OpenAI API
  • gemini: Google Gemini API
  • dashscope: Alibaba Cloud Dashscope API
  • ollama: Local Ollama service

2. Generate Category Criteria Configuration (Optional)

If using category mode, generate criteria and weights for each category first:

python generate_criteria_config.py \
  --config api_config.json \
  --llm-name default \
  --split claude \
  --samples-per-category 15 \
  --method sequential \
  --project gpt_oss20b_category

Parameters:

  • --config: Path to API configuration file
  • --llm-name: LLM name in the configuration file
  • --split: Dataset split (gpt or claude)
  • --samples-per-category: Number of samples per category
  • --method: Weight calculation method (batch or sequential)
  • --project: Project name, output will be saved in this folder

3. Run Evaluation

Category Mode (Recommended, High Efficiency)

python fahp_judge.py \
  --config api_config.json \
  --llm-name default \
  --split claude \
  --mode category \
  --criteria-config gpt_oss20b_category/criteria_config.json \
  --project gpt_oss20b_category \
  --scoring_scale "1-10" \
  --comparison_scale "1-9" \

Per-sample Mode (Generate Criteria Independently for Each Sample)

python fahp_judge.py \
  --config api_config.json \
  --llm-name default \
  --split claude \
  --mode per_sample \
  --method sequential \
  --project my_evaluation \
  --scoring_scale "1-10" \
  --comparison_scale "1-9" \

Parameters:

  • --mode: Evaluation mode (category or per_sample)
  • --criteria-config: Criteria configuration file (required for category mode)
  • --method: Criteria comparison method (batch or sequential in per_sample mode)
  • --max-samples: Maximum number of samples to evaluate (optional)

4. Simple Baseline Evaluation

python simple_judge.py \
  --config api_config.json \
  --llm-name default \
  --split claude \
  --output simple_records.jsonl

Core Modules

AHP Processor (ahp_processor.py)

FuzzyNumber Class: Triangular fuzzy number representation

  • Supports fuzzy number operations and defuzzification
  • Provides conversion from AHP scale (1-9) to fuzzy numbers

AHPProcessor Class: Core AHP calculation engine

  • build_matrix(): Build judgment matrix
  • build_fuzzy_matrix(): Build fuzzy judgment matrix
  • calculate_weights(): Calculate eigenvector weights
  • calculate_fuzzy_weights(): Calculate fuzzy weights
  • check_consistency(): CR consistency check
  • repair_matrix(): Automatically repair inconsistent matrices
  • compute(): Simultaneously calculate Crisp and Fuzzy weights

Evaluation System (fahp_judge.py)

UnifiedLLMJudge Class: Unified LLM interface

  • generate_criteria(): Automatically generate evaluation criteria
  • compute_criteria_weights(): Calculate criteria weights (batch mode)
  • compute_criteria_weights_sequential(): Calculate criteria weights (sequential mode)
  • compare_models_all_criteria(): Compare two model responses across all criteria
  • direct_score_responses(): Direct scoring method

DualAggregator Class: Dual-method aggregator

  • aggregate_crisp(): Crisp AHP aggregation
  • aggregate_fuzzy(): Fuzzy AHP aggregation

CategoryBasedEvaluator Class: Category-based evaluator

  • Uses predefined criteria and weights for each category
  • Supports incremental evaluation and result saving

AutoJudgeBenchEvaluator Class: Fully automatic evaluator

  • Independently generates criteria and weights for each sample
  • Supports regenerating criteria when CR check fails

Result Collection (res_collection.py)

Analysis Functions:

  • load_category_mapping(): Load dataset category information
  • compute_adaptive_weight(): Compute adaptive weights based on CR values
  • combine_adaptive(): Combine AHP and direct scores adaptively
  • analyze_results(): Statistical analysis of evaluation results

Evaluation Workflow

Category Mode Workflow

1. Generate general criteria for each category (generate_criteria_config.py)
   ↓
2. Evaluate using these criteria (fahp_judge.py --mode category)
   ↓
3. LLM compares responses across multiple criteria
   ↓
4. Crisp/Fuzzy dual methods calculate comprehensive scores

Per-sample Mode Workflow

1. LLM generates evaluation criteria for the sample
   ↓
2. LLM compares criteria importance (supports sequential comparison)
   ↓
3. Calculate criteria weights (CR check + automatic repair)
   ↓
4. LLM compares model responses across all criteria
   ↓
5. Crisp/Fuzzy dual methods calculate comprehensive scores

Output Description

Evaluation results are saved in the project folder:

  • records.jsonl: Detailed evaluation records for each sample
  • results.json: Summary statistics
  • cache_eval.json: LLM call cache (avoids duplicate calls)

Evaluation Metrics

  • Crisp AHP Accuracy: Traditional method prediction accuracy
  • Fuzzy AHP Accuracy: Fuzzy method prediction accuracy
  • Improvement Rate: Fuzzy accuracy improvement over Crisp
  • Consistency: Consistency ratio between the two methods' predictions
  • Per-Category Statistics: Accuracy performance by category

AHP Scale Description

Importance comparison uses a 1-9 scale:

Score Meaning
1 The second criterion is extremely more important
3 The second criterion is moderately more important
5 Both criteria are equally important
7 The first criterion is moderately more important
9 The first criterion is extremely more important
2,4,6,8 Intermediate values

Consistency Check

The system automatically performs CR (Consistency Ratio) verification:

  • CR ≤ 0.15: Passes consistency check
  • CR > 0.15: Automatically repairs matrix or regenerates criteria

Repair strategy:

  1. Identify inconsistent triplets
  2. Adjust contradictory judgment values
  3. Maximum 5 repair attempts
  4. If still failing, suggests regenerating criteria

Dataset

Uses JudgeBench dataset:

from datasets import load_dataset
dataset = load_dataset("ScalerLab/JudgeBench", split="claude")

Scoring and AHP Scales

The system supports multiple scoring and AHP scale configurations:

Scoring Scales

  • 1-5 Scale: Suitable for quick evaluations with 5 levels
  • 1-10 Scale: More granular scoring with 10 levels

AHP Scales

  • 1-5 Scale: Simpler comparisons with 5 levels
  • 1-9 Scale: Traditional AHP scale with 9 levels

Configure via --scoring_scale and --comparison_scale parameters.

Development and Extension

Adding New LLM Platforms

In llm_api_server.py:

  1. Add platform configuration validation rules to REQUIRED_KEYS
  2. Implement call function
  3. Register to API_HANDLERS and CLIENT_BUILDERS

Customizing Evaluation Workflow

Inherit from CategoryBasedEvaluator or AutoJudgeBenchEvaluator classes, override:

  • evaluate_sample(): Single sample evaluation logic
  • get_criteria_for_source(): Criteria retrieval logic

License

MIT License

Citation / Preprint

@misc{he2026structuredmulticriteriaevaluationlarge,
      title={Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge}, 
      author={Yulong He and Ivan Smirnov and Dmitry Fedrushkov and Sergey Kovalchuk and Ilya Revin},
      year={2026},
      eprint={2604.03742},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.03742}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages