Skip to content

ASpiderA-bot/small-llm-toolcalling-benchmark

Repository files navigation

Small LLM Tool-Calling Benchmark

Inspired by: I tested 21 small LLMs on tool-calling judgment by u/MikeNonect on r/LocalLLaMA

📄 View Report

Testing 6 small LLMs on agent tool-routing judgment across 12 prompt scenarios.

Models Tested

  • Qwen/Qwen2.5-1.5B-Instruct
  • LiquidAI/LFM2.5-1.2B-Instruct
  • microsoft/Phi-3.5-mini-instruct
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  • microsoft/Phi-4-mini-instruct
  • mistralai/Ministral-3-3B-Instruct-2512 (CRASH)

Results

Model Action Restraint Score Avg Latency
Qwen2.5-1.5B-Instruct 1.0 0.714 0.857 150ms
Phi-3.5-mini-instruct 1.0 0.714 0.857 1206ms
LFM2.5-1.2B-Instruct 0.2 1.0 0.600 194ms
DeepSeek-R1-Distill-Qwen-1.5B 0.4 0.857 0.579 15454ms
Phi-4-mini-instruct 0.8 0.143 0.421 7026ms

Scoring

agent_score = (action_score + restraint_score) / 2 − (wrong_tool_calls × 0.05)

Setup

pip install transformers torch

Run on Kaggle (T4 GPU recommended). Add HF token as environment variable.

Internship Context

Built as part of Task 1 under IIT Ropar Summer Internship(Summership) 2026.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors