Skip to content

Raja904/KMB-Benchmark

Repository files navigation

KMB: Kandpal Metacognition Benchmark

KMB is a specialized framework for measuring the "thinking about thinking" capabilities of Large Language Models. Specifically designed for the Measuring Progress Toward AGI Kaggle competition, it moves beyond binary correctness to evaluate epistemic humility, calibration, and self-correction.

📊 Dataset (KMB-235)

The benchmark resides in dataset/metacognition_final.json and consists of 235 hand-curated items across 10+ sub-types including:

  • Abstention & Knowledge Boundaries: Knowing when to say "I don't know."
  • Deceptive Traps: Resisting intuitive heuristic errors (e.g., counter-intuitive math).
  • Mental Monitoring: Multi-step reasoning and error detection.
  • Fictional Strategies: Identifying optimal metacognitive workflows.

🧠 Evaluation Strategy: The Hybrid Assessment Engine

KMB utilizes a state-of-the-art Hybrid Scoring Engine to evaluate free-response answer equivalence. Simple string-matching is insufficient for reasoning tasks, so we use a three-tier arbiter:

  1. Level 1: Semantic Embeddings (Auto-Pass)
    Uses BAAI/bge-small-en-v1.5 dense embeddings. If cosine similarity $\ge 0.75$, the answer is marked correct.
  2. Level 2: LLM Arbiter (The Judge)
    For borderline cases ($0.65 \le \text{sim} < 0.75$), the pipeline dynamically pings a high-reasoning model (e.g., Mistral Large) to perform a semantic equivalence check.
  3. Level 3: Logic Fallback (Auto-Fail)
    Items with $\text{sim} < 0.65$ are immediately penalized.

📈 The Meta-Score Components

To reach a final 0-100 score, KMB aggregates seven distinct dimensions:

  • Calibration (25%): Measures Expected Calibration Error (ECE). Penalizes high-confidence hallucinations.
  • Self-Awareness (20%): Does the model accurately judge if its initial guess was right?
  • Correction Success (20%): Ability to flip a wrong initial answer to a right final answer during reflection.
  • Raw Accuracy (15%): Standard correctness of the final output.
  • Epistemic Humility (10%): Rewarded for successful abstention on insufficient info items.
  • Confabulation Resistance (5%): Penalty for models that are wrong but assert they are right.
  • Consistency (5%): Penalty for "overthinking" (flipping a right answer to a wrong one).

🚀 Getting Started

1. Installation

pip install -r requirements.txt

2. Run an Evaluation

To run the benchmark on a local or cloud model (OpenAI/Mistral compatible):

python run_metacog_eval.py --model mistral-large-latest

3. Generate the Meta-Score

To run the Hybrid Scoring Engine (requires an API key for the Level 2 Judge):

python evaluate_metacog_results.py \
    --results runs/your_results.jsonl \
    --judge-api-key YOUR_KEY \
    --judge-model mistral-large-latest

🧪 Smoke Test

Test the full pipeline with a 3-item representative sample:

python run_metacog_smoke_test.py --model mistral-small-latest

🏆 Run on Kaggle

In addition to local execution, you can run and view the official leaderboard for this benchmark on Kaggle: Kandpal Metacognition Benchmark on Kaggle

📜 Citation

If you use the KMB dataset or evaluation framework in your research, please cite:

@misc{kmb:_the_kandpal_metacognition_benchmark,
    author = {Rajeev Kandpal},
    title = {KMB: The Kandpal Metacognition Benchmark},
    year = {2026},
    howpublished = {\url{https://www.kaggle.com/benchmarks/rajeevkandpal/agi-eval-metacognition/leaderboard}}
}

Developed for the AGI Evaluation track. Designed based on the research principles of epistemic depth and calibration monitoring.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors