KMB: Kandpal Metacognition Benchmark

KMB is a specialized framework for measuring the "thinking about thinking" capabilities of Large Language Models. Specifically designed for the Measuring Progress Toward AGI Kaggle competition, it moves beyond binary correctness to evaluate epistemic humility, calibration, and self-correction.

📊 Dataset (KMB-235)

The benchmark resides in dataset/metacognition_final.json and consists of 235 hand-curated items across 10+ sub-types including:

Abstention & Knowledge Boundaries: Knowing when to say "I don't know."
Deceptive Traps: Resisting intuitive heuristic errors (e.g., counter-intuitive math).
Mental Monitoring: Multi-step reasoning and error detection.
Fictional Strategies: Identifying optimal metacognitive workflows.

🧠 Evaluation Strategy: The Hybrid Assessment Engine

KMB utilizes a state-of-the-art Hybrid Scoring Engine to evaluate free-response answer equivalence. Simple string-matching is insufficient for reasoning tasks, so we use a three-tier arbiter:

Level 1: Semantic Embeddings (Auto-Pass)
Uses BAAI/bge-small-en-v1.5 dense embeddings. If cosine similarity $\ge 0.75$, the answer is marked correct.
Level 2: LLM Arbiter (The Judge)
For borderline cases ($0.65 \le \text{sim} < 0.75$), the pipeline dynamically pings a high-reasoning model (e.g., Mistral Large) to perform a semantic equivalence check.
Level 3: Logic Fallback (Auto-Fail)
Items with $\text{sim} < 0.65$ are immediately penalized.

📈 The Meta-Score Components

To reach a final 0-100 score, KMB aggregates seven distinct dimensions:

Calibration (25%): Measures Expected Calibration Error (ECE). Penalizes high-confidence hallucinations.
Self-Awareness (20%): Does the model accurately judge if its initial guess was right?
Correction Success (20%): Ability to flip a wrong initial answer to a right final answer during reflection.
Raw Accuracy (15%): Standard correctness of the final output.
Epistemic Humility (10%): Rewarded for successful abstention on insufficient info items.
Confabulation Resistance (5%): Penalty for models that are wrong but assert they are right.
Consistency (5%): Penalty for "overthinking" (flipping a right answer to a wrong one).

🚀 Getting Started

1. Installation

pip install -r requirements.txt

2. Run an Evaluation

To run the benchmark on a local or cloud model (OpenAI/Mistral compatible):

python run_metacog_eval.py --model mistral-large-latest

3. Generate the Meta-Score

To run the Hybrid Scoring Engine (requires an API key for the Level 2 Judge):

python evaluate_metacog_results.py \
    --results runs/your_results.jsonl \
    --judge-api-key YOUR_KEY \
    --judge-model mistral-large-latest

🧪 Smoke Test

Test the full pipeline with a 3-item representative sample:

python run_metacog_smoke_test.py --model mistral-small-latest

🏆 Run on Kaggle

In addition to local execution, you can run and view the official leaderboard for this benchmark on Kaggle: Kandpal Metacognition Benchmark on Kaggle

📜 Citation

If you use the KMB dataset or evaluation framework in your research, please cite:

@misc{kmb:_the_kandpal_metacognition_benchmark,
    author = {Rajeev Kandpal},
    title = {KMB: The Kandpal Metacognition Benchmark},
    year = {2026},
    howpublished = {\url{https://www.kaggle.com/benchmarks/rajeevkandpal/agi-eval-metacognition/leaderboard}}
}

Developed for the AGI Evaluation track. Designed based on the research principles of epistemic depth and calibration monitoring.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
final_results_summary_outputs		final_results_summary_outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_metacognition_overview.md		dataset_metacognition_overview.md
evaluate_metacog_results.py		evaluate_metacog_results.py
kmb_kaggle.ipynb		kmb_kaggle.ipynb
kmb_writeup_draft.md		kmb_writeup_draft.md
requirements.txt		requirements.txt
run_metacog_eval.py		run_metacog_eval.py
run_metacog_smoke_test.py		run_metacog_smoke_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KMB: Kandpal Metacognition Benchmark

📊 Dataset (KMB-235)

🧠 Evaluation Strategy: The Hybrid Assessment Engine

📈 The Meta-Score Components

🚀 Getting Started

1. Installation

2. Run an Evaluation

3. Generate the Meta-Score

🧪 Smoke Test

🏆 Run on Kaggle

📜 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KMB: Kandpal Metacognition Benchmark

📊 Dataset (KMB-235)

🧠 Evaluation Strategy: The Hybrid Assessment Engine

📈 The Meta-Score Components

🚀 Getting Started

1. Installation

2. Run an Evaluation

3. Generate the Meta-Score

🧪 Smoke Test

🏆 Run on Kaggle

📜 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages