Inspired by: I tested 21 small LLMs on tool-calling judgment by u/MikeNonect on r/LocalLLaMA
Testing 6 small LLMs on agent tool-routing judgment across 12 prompt scenarios.
- Qwen/Qwen2.5-1.5B-Instruct
- LiquidAI/LFM2.5-1.2B-Instruct
- microsoft/Phi-3.5-mini-instruct
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- microsoft/Phi-4-mini-instruct
- mistralai/Ministral-3-3B-Instruct-2512 (CRASH)
| Model | Action | Restraint | Score | Avg Latency |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct | 1.0 | 0.714 | 0.857 | 150ms |
| Phi-3.5-mini-instruct | 1.0 | 0.714 | 0.857 | 1206ms |
| LFM2.5-1.2B-Instruct | 0.2 | 1.0 | 0.600 | 194ms |
| DeepSeek-R1-Distill-Qwen-1.5B | 0.4 | 0.857 | 0.579 | 15454ms |
| Phi-4-mini-instruct | 0.8 | 0.143 | 0.421 | 7026ms |
agent_score = (action_score + restraint_score) / 2 − (wrong_tool_calls × 0.05)
pip install transformers torchRun on Kaggle (T4 GPU recommended). Add HF token as environment variable.
Built as part of Task 1 under IIT Ropar Summer Internship(Summership) 2026.