Small LLM Tool-Calling Benchmark

Inspired by: I tested 21 small LLMs on tool-calling judgment by u/MikeNonect on r/LocalLLaMA

Testing 6 small LLMs on agent tool-routing judgment across 12 prompt scenarios.

Models Tested

Qwen/Qwen2.5-1.5B-Instruct
LiquidAI/LFM2.5-1.2B-Instruct
microsoft/Phi-3.5-mini-instruct
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
microsoft/Phi-4-mini-instruct
mistralai/Ministral-3-3B-Instruct-2512 (CRASH)

Results

Model	Action	Restraint	Score	Avg Latency
Qwen2.5-1.5B-Instruct	1.0	0.714	0.857	150ms
Phi-3.5-mini-instruct	1.0	0.714	0.857	1206ms
LFM2.5-1.2B-Instruct	0.2	1.0	0.600	194ms
DeepSeek-R1-Distill-Qwen-1.5B	0.4	0.857	0.579	15454ms
Phi-4-mini-instruct	0.8	0.143	0.421	7026ms

Scoring

agent_score = (action_score + restraint_score) / 2 − (wrong_tool_calls × 0.05)

Setup

pip install transformers torch

Run on Kaggle (T4 GPU recommended). Add HF token as environment variable.

Internship Context

Built as part of Task 1 under IIT Ropar Summer Internship(Summership) 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLM_ToolCalling_Benchmark_Report.pdf		LLM_ToolCalling_Benchmark_Report.pdf
README.md		README.md
Terminal output for task 1.txt		Terminal output for task 1.txt
task-1-final.ipynb		task-1-final.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small LLM Tool-Calling Benchmark

Models Tested

Results

Scoring

Setup

Internship Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Small LLM Tool-Calling Benchmark

Models Tested

Results

Scoring

Setup

Internship Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages