Skip to content

[codex] Count tokens with tokenizer chat templates#3593

Open
neubig wants to merge 1 commit into
mainfrom
codex/chat-template-token-count
Open

[codex] Count tokens with tokenizer chat templates#3593
neubig wants to merge 1 commit into
mainfrom
codex/chat-template-token-count

Conversation

@neubig

@neubig neubig commented Jun 9, 2026

Copy link
Copy Markdown
Member

HUMAN:

  • A human has tested these changes.

AGENT:


Why

The LLMSummarizingCondenser uses LLM.get_token_count() to decide when to condense. For OpenAI-compatible local servers, the serving backend may render messages and tool schemas through the model chat template before tokenization. LiteLLM's generic token counter can undercount that rendered prompt. In the Qwen/GGUF smoke run, LiteLLM estimated a rejected request at 55,615 tokens while llama.cpp counted about 68.7k.

Summary

  • Prefer tokenizer apply_chat_template for LLM.get_token_count() when a configured tokenizer supports it.
  • Preserve the existing LiteLLM token counter as the fallback path.
  • Normalize text-only OpenAI content blocks before chat-template rendering and handle Hugging Face BatchEncoding/Encoding tokenized outputs.

Issue Number

N/A

How to Test

Automated validation run locally:

uv run pytest tests/sdk/llm/test_llm.py -k 'token_counting or chat_template_tokenizer or tokenized_output'
uv run pytest tests/sdk/context/condenser/test_utils.py tests/sdk/context/condenser/test_llm_summarizing_condenser.py
uv run ruff check openhands-sdk/openhands/sdk/llm/llm.py tests/sdk/llm/test_llm.py
uv run ruff format --check openhands-sdk/openhands/sdk/llm/llm.py tests/sdk/llm/test_llm.py

Manual smoke testing against the live Agent Canvas/Lemonade Qwen setup:

uv run --with transformers python scripts/run_daily_workflow_qwen_smoke.py \
  --thinking canvas \
  --max-iterations 40 \
  --log-completions \
  --custom-tokenizer Qwen/Qwen3.6-35B-A3B-FP8

uv run --with transformers python scripts/run_daily_workflow_qwen_smoke.py \
  --thinking canvas \
  --max-iterations 80 \
  --log-completions \
  --custom-tokenizer Qwen/Qwen3.6-35B-A3B-FP8

Results:

  • Local sanity check on the previously failed smoke-test payload: the new chat-template path counted 68,984 tokens, matching llama.cpp/llama-server's rejected prompt size of roughly 68.7k instead of LiteLLM's 55,615-token estimate.
  • 40-iteration daily-workflow smoke did not hit a context-window error. It triggered 1 condensation, survived a clipped 349,168-character MCP observation, then stopped at Agent reached maximum iterations limit (40).
  • 80-iteration daily-workflow smoke did not hit a context-window error. It triggered 3 condensations, survived repeated clipped Linear MCP observations up to 140,211 characters, then stopped at Agent reached maximum iterations limit (80).
  • The full daily workflow did not complete end to end. The remaining blocker appears behavioral/external: repeated linear__list_issues calls and one mcp.linear.app Cloudflare 1101 worker exception. This PR fixes the condenser token-count mismatch, but it does not make Qwen complete the whole daily workflow.

Smoke artifacts are under:

/tmp/oh-daily-workflow-smoke-runs/20260609-161625
/tmp/oh-daily-workflow-smoke-runs/20260609-162146

Video/Screenshots

N/A. This is a backend token-counting fix; the smoke-test logs above are the reproduction evidence.

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Notes

This is model-agnostic: it uses apply_chat_template when available and falls back to the existing LiteLLM counter otherwise. To exercise this path in a local Qwen setup, custom_tokenizer must point to a tokenizer id with a chat template, such as Qwen/Qwen3.6-35B-A3B-FP8, and transformers must be available.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6f3fa97-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6f3fa97-python \
  ghcr.io/openhands/agent-server:6f3fa97-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6f3fa97-golang-amd64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-golang-amd64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-golang-amd64
ghcr.io/openhands/agent-server:6f3fa97-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6f3fa97-golang-arm64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-golang-arm64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-golang-arm64
ghcr.io/openhands/agent-server:6f3fa97-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6f3fa97-java-amd64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-java-amd64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-java-amd64
ghcr.io/openhands/agent-server:6f3fa97-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6f3fa97-java-arm64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-java-arm64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-java-arm64
ghcr.io/openhands/agent-server:6f3fa97-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6f3fa97-python-amd64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-python-amd64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-python-amd64
ghcr.io/openhands/agent-server:6f3fa97-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:6f3fa97-python-arm64
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-python-arm64
ghcr.io/openhands/agent-server:codex-chat-template-token-count-python-arm64
ghcr.io/openhands/agent-server:6f3fa97-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:6f3fa97-golang
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-golang
ghcr.io/openhands/agent-server:codex-chat-template-token-count-golang
ghcr.io/openhands/agent-server:6f3fa97-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:6f3fa97-java
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-java
ghcr.io/openhands/agent-server:codex-chat-template-token-count-java
ghcr.io/openhands/agent-server:6f3fa97-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:6f3fa97-python
ghcr.io/openhands/agent-server:6f3fa97ab738611059290a05812213665015c373-python
ghcr.io/openhands/agent-server:codex-chat-template-token-count-python
ghcr.io/openhands/agent-server:6f3fa97-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 6f3fa97-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 6f3fa97-python-amd64) are also available if needed

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py84013284%531, 547, 580–581, 866–867, 870–874, 876, 884–886, 890, 907–908, 912, 914–915, 917–919, 1042, 1165, 1358, 1367–1369, 1468, 1479, 1520, 1532–1534, 1537–1540, 1546, 1604, 1615, 1658, 1671–1673, 1676–1679, 1685, 1864–1869, 1985–1986, 2326–2327, 2336, 2354, 2378–2379, 2381, 2383, 2385, 2393, 2396, 2398, 2400, 2408, 2412–2413, 2423–2427, 2431, 2435–2436, 2441, 2445, 2451, 2456, 2497, 2499–2504, 2506–2523, 2526–2530, 2532–2533, 2539–2548, 2605, 2607
TOTAL29722836771% 

@neubig neubig marked this pull request as ready for review June 9, 2026 20:10
@neubig neubig added the review-this This label triggers a PR review by OpenHands label Jun 9, 2026

all-hands-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: [codex] Count tokens with tokenizer chat templates

🟡 Acceptable — Functional implementation with a few areas for improvement.

[CRITICAL ISSUES]

None identified. The implementation is sound and handles edge cases appropriately.

[IMPROVEMENT OPPORTUNITIES]

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 2364] Edge Case Handling: The _messages_for_chat_template method clears content to empty list when encountering non-text content blocks in a mixed content list. This may silently discard image blocks or other content types. Consider preserving the original content format if non-text blocks are found, rather than clearing.

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 2384] Exception Handling: Catching Exception is broad. Consider catching specific exceptions (ImportError, OSError) that are more likely when loading pretrained tokenizers, and let unexpected errors propagate.

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 2360] Type Handling: The _count_tokenized_output method handles multiple types but isinstance(tokenized, Sequence) matches strings. This could cause issues if a tokenizer returns a string directly (though you handle strings before this check, so it should be fine).

[STYLE NOTES]

  • [openhands-sdk/openhands/sdk/llm/llm.py, Lines 2340-2342] Comment: The comment block is clear and explains the motivation well. No issues here.

[TESTING GAPS]

  • [tests/sdk/llm/test_llm.py] Test Coverage: Tests cover the happy path and fallback behavior well. Consider adding a test for _messages_for_chat_template with mixed content (e.g., one text block, one image block) to verify the edge case behavior.

  • [tests/sdk/llm/test_llm.py] Test Coverage: No test for when custom_tokenizer is None but _chat_template_tokenizer might be set (though this should not happen in normal flow).

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

The implementation is well-isolated:

  • Falls back gracefully to LiteLLM when chat template tokenizer is unavailable
  • All new methods are private and optional
  • No breaking changes to existing public API
  • Proper exception handling throughout
  • Tests cover the fallback path

The only theoretical risk is if a tokenizer's apply_chat_template produces substantially different output than expected, but this would only affect token counting accuracy (not correctness), and the fallback handles failures.


VERDICT:
Worth merging: Core logic is sound, minor improvements suggested.

KEY INSIGHT:
This is a well-scoped feature that adds token counting accuracy for local OpenAI-compatible servers by using the tokenizer's native chat template rather than LiteLLM's generic estimation. The graceful fallback ensures existing functionality is preserved.


_This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ QA Report: FAIL

The real SDK token-counting path still returned the LiteLLM fallback count for a Qwen Transformers chat-template tokenizer, so the PR did not deliver its stated behavior in this smoke test.

Does this PR achieve its stated goal?

No. The stated goal is to prefer tokenizer apply_chat_template for LLM.get_token_count() when a configured tokenizer supports it. With Qwen/Qwen2.5-0.5B-Instruct loaded through Transformers, direct chat-template tokenization counted 32 input IDs, but LLM.get_token_count() returned 10 on both main and this PR, showing the PR still falls back instead of using the chat-template count.

Phase Result
Environment Setup uv run created the project environment; optional Transformers supplied with uv run --with transformers
CI Status ⚠️ 37 success, 18 skipped, 1 failing (Validate PR description), 1 in progress (qa-changes) at check time
Functional Verification ❌ Real Qwen AutoTokenizer chat-template counting did not change from main and did not match Transformers output
Functional Verification

Test 1: Qwen chat-template token count through the public SDK

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 TRANSFORMERS_VERBOSITY=error uv run --with transformers python /tmp/qa_qwen_expected_vs_llm.py:

transformers_chat_template_input_ids=32
llm_get_token_count=10
matches_chat_template=False

This confirms the baseline bug: the configured tokenizer's rendered chat template counts 32 tokens, while LLM.get_token_count() returns the lower LiteLLM-style count.

Step 2 — Apply the PR's changes:
Checked out codex/chat-template-token-count at d3fc6baf9988d3319798bfd4ca4c6975528aeb25.

Step 3 — Re-run with the fix in place:
Ran git checkout codex/chat-template-token-count && OPENHANDS_SUPPRESS_BANNER=1 TRANSFORMERS_VERBOSITY=error uv run --with transformers python /tmp/qa_qwen_expected_vs_llm.py:

transformers_chat_template_input_ids=32
llm_get_token_count=10
matches_chat_template=False

This shows the PR did not change the real Qwen Transformers path: LLM.get_token_count() still does not match the chat-template tokenization.

Setup notes

Ran uv run python to create the repo .venv and install workspace packages. Since Transformers is optional and not part of the project environment by default, I used uv run --with transformers to exercise the PR's optional AutoTokenizer path as a user would when configuring a chat-template-aware tokenizer.

Issues Found

  • 🟠 Issue: LLM.get_token_count() still falls back to LiteLLM for a real Transformers AutoTokenizer (Qwen/Qwen2.5-0.5B-Instruct), so the PR does not achieve the promised chat-template-aware counting in this scenario.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

Comment thread openhands-sdk/openhands/sdk/llm/llm.py
@neubig neubig force-pushed the codex/chat-template-token-count branch from d3fc6ba to 6f3fa97 Compare June 9, 2026 20:35
@neubig neubig added qa-this review-this This label triggers a PR review by OpenHands and removed review-this This label triggers a PR review by OpenHands labels Jun 9, 2026
@neubig neubig requested a review from all-hands-bot June 9, 2026 21:23

all-hands-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here

all-hands-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: [codex] Count tokens with tokenizer chat templates

🟡 Acceptable — Core logic is sound, minor improvements suggested.

[IMPROVEMENT OPPORTUNITIES]

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 579] Redundant Tokenizer Loading: _post_init already loads _tokenizer from custom_tokenizer, but _load_chat_template_tokenizer calls AutoTokenizer.from_pretrained(identifier) again. If the existing _tokenizer already has apply_chat_template, we could reuse it directly instead of reloading — avoiding an extra API call and potential network delay on startup.

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 2445] Expensive Deep Copy: _messages_for_chat_template calls copy.deepcopy(messages) which is expensive for large message histories. Since the function only modifies string content in text blocks, a shallow copy with individual field updates would be more efficient.

  • [openhands-sdk/openhands/sdk/llm/llm.py, Line 2420] Missing Indexable Check: In _count_tokenized_output, after isinstance(tokenized, Sequence), the code does tokenized[0] but doesn't guard against non-indexable Sequences. A hasattr(tokenized, '__getitem__') check would be safer.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This PR adds an optional token counting path that falls back to the existing LiteLLM counter when the chat template approach fails. The implementation is well-contained to get_token_count() with no public API changes. The previous QA issue (BatchEncoding handling) has been addressed. Risk is minimal — worst case is the fallback path is used.


[VERDICT]

Worth merging: Core logic is sound, minor improvements suggested

[KEY INSIGHT]

The new chat-template token counting correctly addresses the real problem of LiteLLM undercounting tokens for local servers that render messages through a model-specific template. The graceful fallback ensures existing functionality is never broken.


_This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@juanmichelini juanmichelini self-assigned this Jun 9, 2026

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review - PR #3593

🟡 Taste Rating: Acceptable - Functional implementation with some areas for improvement.


[IMPROVEMENT OPPORTUNITIES]

  • [openhands-sdk/openhands/sdk/llm/llm.py:2376] Complex Type Handling: _count_tokenized_output has 9 branches handling different tokenized output types. This is necessary for compatibility but creates a high cyclomatic complexity. Consider extracting the type-checking logic into a separate helper with a registry pattern for easier maintenance.

  • [openhands-sdk/openhands/sdk/llm/llm.py:2403] Deep Copy in Loop: _messages_for_chat_template uses copy.deepcopy(messages) which copies the entire messages structure. If messages are large, this could be expensive. Consider whether a shallow copy with selective deep copy of content blocks would suffice.


[STYLE NOTES]

  • The docstring on _get_chat_template_token_count is thorough and explains the "why" well. Keep this standard for new public-ish methods.

[TESTING GAPS]

Good test coverage: Tests cover transformer module loading, chat template tokenizer preference, various tokenized output types, and graceful fallback. No blocking testing gaps identified.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

The change is additive with graceful fallback. The new _chat_template_tokenizer attribute is initialized only when custom_tokenizer is set. The transformers module is imported lazily. No breaking changes to existing APIs.


VERDICT:
Worth merging - Core logic is sound. The implementation correctly handles the problem of local OpenAI-compatible servers using model-specific chat templates for tokenization.

KEY INSIGHT:
The real value here is providing accurate token counts for local models where LiteLLM's generic estimation may diverge from the actual tokenizer behavior once chat templates are applied.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

The changed SDK token-count path works as intended in a real LLM.get_token_count() run with a Qwen chat-template tokenizer, but the PR still has a non-functional CI/process failure.

Does this PR achieve its stated goal?

Yes. The stated goal is to make LLM.get_token_count() prefer a configured tokenizer's chat template so condenser decisions reflect the actual rendered prompt for local OpenAI-compatible backends. I reproduced the old undercount on origin/main with a real Qwen tokenizer and FinishTool schema (791 SDK tokens vs 910 chat-template tokens), then reran the same SDK call on this PR and got an exact match (910 vs 910). I also verified the no-custom-tokenizer fallback still returns a normal LiteLLM count.

Phase Result
Environment Setup make build completed successfully and installed the uv environment
CI Status ⚠️ gh pr checks showed 35 successful, 1 failing (PR Description Check), 1 pending (QA Changes by OpenHands), 16 skipped
Functional Verification ✅ Chat-template counting matched the independently rendered tokenizer count; fallback count remained positive
Functional Verification

Test 1: Chat-template tokenizer count fixes the undercount

Step 1 — Reproduce / establish baseline without the fix:

Checked out base with git checkout --detach origin/main, then ran a real SDK token-count script:

OPENHANDS_SUPPRESS_BANNER=1 uv run --with transformers python - <<'PY'
# Script instantiated LLM(model="gpt-4o-mini", custom_tokenizer="Qwen/Qwen2.5-0.5B-Instruct"),
# passed SDK Message objects plus FinishTool.create() into LLM.get_token_count(),
# and independently counted AutoTokenizer.apply_chat_template(..., tools=[tool.to_openai_tool()], tokenize=True).
PY

Relevant output:

tokenizer=Qwen/Qwen2.5-0.5B-Instruct
tokenized_type=BatchEncoding
tool_count=1
sdk_count=791
chat_template_count=910
matches_chat_template=False
delta=-119

This confirms the bug exists on base: the SDK token counter undercounts the same rendered Qwen chat-template prompt by 119 tokens when tool schema rendering is included.

Step 2 — Apply the PR's changes:

Checked out the PR branch at 6f3fa97ab738611059290a05812213665015c373.

Step 3 — Re-run with the fix in place:

Ran the same SDK/tokenizer comparison command on the PR branch.

Relevant output:

tokenizer=Qwen/Qwen2.5-0.5B-Instruct
tokenized_type=BatchEncoding
tool_count=1
sdk_count=910
chat_template_count=910
matches_chat_template=True
delta=0

This confirms the PR fixes the stated mismatch for a real tokenizer chat-template path: LLM.get_token_count() now returns the same count as the independently rendered tokenizer prompt.

Test 2: LiteLLM fallback still works without a custom tokenizer

On the PR branch, ran a real SDK count without custom_tokenizer:

OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Script instantiated LLM(model="gpt-4o-mini") and called get_token_count()
# with SDK Message objects and FinishTool.create().
PY

Relevant output:

custom_tokenizer=None
fallback_count=292
fallback_count_positive=True

This shows the non-chat-template fallback path remains usable for normal SDK callers who do not configure a custom tokenizer.

Issues Found

  • 🔴 Blocker (process, not functional): PR Description Check/Validate PR description is failing. The human-only PR description section is still the template placeholder and the human-tested checkbox is unchecked; I did not edit those reserved fields.

This review was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-this review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants