Talking CLI

Mute tools bloat your prompts. Tool hints fix the budget.

One-liner

Talking CLI is a linter that audits whether your AI skill and MCP tools have moved guidance out of bloated SKILL.md and into the tool responses themselves — giving tools a voice at the moment they are called.

The Problem

Three chronic diseases in today's AI toolchains:

Tools are mute — they return raw JSON and say nothing about errors, empty results, or ambiguity
Documents bloat — every "if zero results, broaden the query" and "if ambiguous, ask the user" gets shoved into SKILL.md, 400+ lines loaded in full every turn
Budget leaks — 90% of that guidance is noise 90% of the time, yet the agent pays token rent on all of it

Core Claim

Prompt Surface = SKILL.md ∪ {tool_result.hints} — two halves, one budget.

Move guidance that only applies after a specific tool call from static documents into dynamic responses. The tool speaks only when called, and only about what just happened. We call this Prompt-On-Call; the cumulative effect across every tool is Distributed Prompting.

🎯 Token Savings	🧪 Validation Scale	🤖 Model Coverage	🔍 Ecosystem Audit
17–26%	2,340+ executions	3 frontier models	0/68 pass
Lean Skill + Tool Hints	Cross-difficulty, cross-model	DeepSeek / Kimi / GLM	4 official Anthropic MCP servers

Anthropic and Carmack have pointed at this direction, but nobody has named it, budgeted it, or audited it — until now.

Standing on shoulders — why now?

CLI is the native interface for AI agents — Carmack, CodeAct (Wang et al., ICML 2024), and Karpathy crystallized it.

Progressive Disclosure as a skill-loading architecture was formalized by Anthropic (Oct 2025) and is now an open standard. Anthropic also advocates "steering agents with helpful instructions in tool responses" — but only as a paragraph-level best practice. Nobody has named it, budgeted it, audited it, or proposed it as a protocol-level primitive. That gap is what Talking CLI fills. We believe Prompt-On-Call / Distributed Prompting is the next evolutionary step of this idea.

How It Works

Mechanism: The Prompt Budget Shift

	Mute CLI (Before)	Distributed Prompting (After)
SKILL.md	400+ lines, full load every turn	< 150 lines, generic guidance only
Tool Response	Raw JSON, zero hints	JSON + `hints` field, context-aware guidance
Prompt Cost	Paying rent on 400 lines per turn	Paying only for precise hints at call time
Audit	None	`talking-cli audit` scores four heuristics

📊 Visual: Before vs After

graph LR
    subgraph Before ["❌ Before: Mute CLI"]
        A1[SKILL.md<br/>400+ lines] --> A2[Agent]
        A3[Tool returns<br/>raw JSON only] --> A2
        A1 -.->|"guidance shoved upstream"| A3
    end

    subgraph After ["✅ After: Distributed Prompting"]
        B1[SKILL.md<br/>&lt; 150 lines] --> B2[Agent]
        B3[Tool returns<br/>JSON + hints] --> B2
    end

    Before -->|Audit + Optimize| After

Four Heuristics, One Score

Heuristic	What It Checks	Pass Threshold
H1 · Document Budget	SKILL.md line count	≤ 150 lines
H2 · Fixture Coverage	Error + empty-result scenarios	≥ 2 fixtures per tool
H3 · Structured Hints	Response contains hint fields	`hints` / `suggestions` / `guidance`
H4 · Actionable Guidance	Hint content is specific and actionable	≥ 10 chars with action verbs

Total score 0–100. ≥ 80 to pass.

📊 Visual: Scoring Flow

graph TD
    H1[H1 · Document Budget<br/>SKILL.md ≤ 150 lines]
    H2[H2 · Fixture Coverage<br/>error + empty scenarios]
    H3[H3 · Structured Hints<br/>hints / suggestions / guidance]
    H4[H4 · Actionable Guidance<br/>specific, actionable content]

    H1 & H2 & H3 & H4 --> Score[Total Score<br/>0–100]
    Score -->|≥ 80| Pass[✅ PASS]
    Score -->|< 80| Fail[❌ FAIL]

Quick Start

# Audit your skill — plain language report telling you what to fix
npx talking-cli audit ./my-skill

# CI mode — machine-readable, exit-code driven
npx talking-cli audit ./my-skill --ci

# JSON mode — structured output for tooling integration
npx talking-cli audit ./my-skill --json

# Audit an MCP server — static analysis (fast, safe)
npx talking-cli audit-mcp ./my-mcp-server

# Deep audit — runtime heuristics (spawns server)
# ⚠️ Only use --deep on servers you trust. See SECURITY.md.
npx talking-cli audit-mcp ./my-mcp-server --deep

# Generate optimization plan (plan-only, never touches source)
npx talking-cli optimize ./my-skill

# Scaffold a new skill directory with audit-passing templates
npx talking-cli init my-skill

All commands run fully local — no API key required.

Experimental Validation

MCP Ecosystem Audit

0 / 68. We scanned 4 official Anthropic MCP servers across 68 error / empty-result scenarios. None returned actionable guidance. Static analysis of 823 Composio tools showed the same result.

Server	Tools	Scenarios	Guidance Returned
`server-filesystem`	11	21	0
`server-everything`	13	13	0
`server-memory`	9	9	0
`server-github`	25	25	0
Total	58	68	0 / 68

Cross-Model Validation (2,340+ executions)

Full 2×2 ablation (Full/Lean Skill × Mute/Hinting Tools) across 3 frontier models on 45 MCP tasks (k=3 trials per cell), plus 15 harder tasks on 2 models:

Model	Full/Mute	Lean/Hints	Δ	Token Save	Hard Baseline	Hard Δ	Hard Save
DeepSeek V4 Pro	91.1%	90.4%	−0.7	−17%	22.2% / 22.2%	0.0	−24%
Kimi K2.6	88.1%	90.4%	+1.5	−18%	—	—	—
GLM-5.1	90.4%	93.3%	+2.2	−22%	20.0% / 20.0%	0.0	−26%

What the data supports:

Token efficiency is cross-model and cross-difficulty: 17–26% savings with zero quality degradation
No harm: worst case is −0.7pp, within noise
Skill bloat is real: SkillsBench (36K real-world skills) independently found verbose skills degrade by −2.9pp while moderate ones improve by +18.8pp

What the data does not support:

Pass-rate improvement is not statistically significant (p = 1.0) — token savings are proven; quality signal remains unproven
Adding hints to a verbose skill can hurt (GLM-5.1: −6pp). Distributed Prompting only works when the skill is compressed first

What's Next

Harder benchmarks — tasks calibrated to 40–60% baseline to surface quality signal currently buried by ceiling effects
MCP spec proposal — RFC for a first-class agent_hints field in tool responses
H4 semantic upgrade — replacing the ≥ 10 chars heuristic with a lightweight classifier

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
docs		docs
src		src
talking-cli-fixtures		talking-cli-fixtures
.gitignore		.gitignore
LICENSE		LICENSE
PHILOSOPHY.md		PHILOSOPHY.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
biome.json		biome.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Talking CLI

One-liner

The Problem

Core Claim

How It Works

Mechanism: The Prompt Budget Shift

Four Heuristics, One Score

Quick Start

Experimental Validation

MCP Ecosystem Audit

Cross-Model Validation (2,340+ executions)

What's Next

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Talking CLI

One-liner

The Problem

Core Claim

How It Works

Mechanism: The Prompt Budget Shift

Four Heuristics, One Score

Quick Start

Experimental Validation

MCP Ecosystem Audit

Cross-Model Validation (2,340+ executions)

What's Next

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages