[codex] Count tokens with tokenizer chat templates#3593
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: [codex] Count tokens with tokenizer chat templates
🟡 Acceptable — Functional implementation with a few areas for improvement.
[CRITICAL ISSUES]
None identified. The implementation is sound and handles edge cases appropriately.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/llm/llm.py, Line 2364] Edge Case Handling: The
_messages_for_chat_templatemethod clearscontentto empty list when encountering non-text content blocks in a mixed content list. This may silently discard image blocks or other content types. Consider preserving the original content format if non-text blocks are found, rather than clearing. -
[openhands-sdk/openhands/sdk/llm/llm.py, Line 2384] Exception Handling: Catching
Exceptionis broad. Consider catching specific exceptions (ImportError, OSError) that are more likely when loading pretrained tokenizers, and let unexpected errors propagate. -
[openhands-sdk/openhands/sdk/llm/llm.py, Line 2360] Type Handling: The
_count_tokenized_outputmethod handles multiple types butisinstance(tokenized, Sequence)matches strings. This could cause issues if a tokenizer returns a string directly (though you handle strings before this check, so it should be fine).
[STYLE NOTES]
- [openhands-sdk/openhands/sdk/llm/llm.py, Lines 2340-2342] Comment: The comment block is clear and explains the motivation well. No issues here.
[TESTING GAPS]
-
[tests/sdk/llm/test_llm.py] Test Coverage: Tests cover the happy path and fallback behavior well. Consider adding a test for
_messages_for_chat_templatewith mixed content (e.g., one text block, one image block) to verify the edge case behavior. -
[tests/sdk/llm/test_llm.py] Test Coverage: No test for when
custom_tokenizerisNonebut_chat_template_tokenizermight be set (though this should not happen in normal flow).
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
The implementation is well-isolated:
- Falls back gracefully to LiteLLM when chat template tokenizer is unavailable
- All new methods are private and optional
- No breaking changes to existing public API
- Proper exception handling throughout
- Tests cover the fallback path
The only theoretical risk is if a tokenizer's apply_chat_template produces substantially different output than expected, but this would only affect token counting accuracy (not correctness), and the fallback handles failures.
VERDICT:
✅ Worth merging: Core logic is sound, minor improvements suggested.
KEY INSIGHT:
This is a well-scoped feature that adds token counting accuracy for local OpenAI-compatible servers by using the tokenizer's native chat template rather than LiteLLM's generic estimation. The graceful fallback ensures existing functionality is preserved.
_This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
❌ QA Report: FAIL
The real SDK token-counting path still returned the LiteLLM fallback count for a Qwen Transformers chat-template tokenizer, so the PR did not deliver its stated behavior in this smoke test.
Does this PR achieve its stated goal?
No. The stated goal is to prefer tokenizer apply_chat_template for LLM.get_token_count() when a configured tokenizer supports it. With Qwen/Qwen2.5-0.5B-Instruct loaded through Transformers, direct chat-template tokenization counted 32 input IDs, but LLM.get_token_count() returned 10 on both main and this PR, showing the PR still falls back instead of using the chat-template count.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created the project environment; optional Transformers supplied with uv run --with transformers |
| CI Status | Validate PR description), 1 in progress (qa-changes) at check time |
| Functional Verification | ❌ Real Qwen AutoTokenizer chat-template counting did not change from main and did not match Transformers output |
Functional Verification
Test 1: Qwen chat-template token count through the public SDK
Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 TRANSFORMERS_VERBOSITY=error uv run --with transformers python /tmp/qa_qwen_expected_vs_llm.py:
transformers_chat_template_input_ids=32
llm_get_token_count=10
matches_chat_template=False
This confirms the baseline bug: the configured tokenizer's rendered chat template counts 32 tokens, while LLM.get_token_count() returns the lower LiteLLM-style count.
Step 2 — Apply the PR's changes:
Checked out codex/chat-template-token-count at d3fc6baf9988d3319798bfd4ca4c6975528aeb25.
Step 3 — Re-run with the fix in place:
Ran git checkout codex/chat-template-token-count && OPENHANDS_SUPPRESS_BANNER=1 TRANSFORMERS_VERBOSITY=error uv run --with transformers python /tmp/qa_qwen_expected_vs_llm.py:
transformers_chat_template_input_ids=32
llm_get_token_count=10
matches_chat_template=False
This shows the PR did not change the real Qwen Transformers path: LLM.get_token_count() still does not match the chat-template tokenization.
Setup notes
Ran uv run python to create the repo .venv and install workspace packages. Since Transformers is optional and not part of the project environment by default, I used uv run --with transformers to exercise the PR's optional AutoTokenizer path as a user would when configuring a chat-template-aware tokenizer.
Issues Found
- 🟠 Issue:
LLM.get_token_count()still falls back to LiteLLM for a real TransformersAutoTokenizer(Qwen/Qwen2.5-0.5B-Instruct), so the PR does not achieve the promised chat-template-aware counting in this scenario.
This QA review was created by an AI agent (OpenHands) on behalf of the user.
d3fc6ba to
6f3fa97
Compare
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: [codex] Count tokens with tokenizer chat templates
🟡 Acceptable — Core logic is sound, minor improvements suggested.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/llm/llm.py, Line 579] Redundant Tokenizer Loading:
_post_initalready loads_tokenizerfromcustom_tokenizer, but_load_chat_template_tokenizercallsAutoTokenizer.from_pretrained(identifier)again. If the existing_tokenizeralready hasapply_chat_template, we could reuse it directly instead of reloading — avoiding an extra API call and potential network delay on startup. -
[openhands-sdk/openhands/sdk/llm/llm.py, Line 2445] Expensive Deep Copy:
_messages_for_chat_templatecallscopy.deepcopy(messages)which is expensive for large message histories. Since the function only modifies string content in text blocks, a shallow copy with individual field updates would be more efficient. -
[openhands-sdk/openhands/sdk/llm/llm.py, Line 2420] Missing Indexable Check: In
_count_tokenized_output, afterisinstance(tokenized, Sequence), the code doestokenized[0]but doesn't guard against non-indexable Sequences. Ahasattr(tokenized, '__getitem__')check would be safer.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This PR adds an optional token counting path that falls back to the existing LiteLLM counter when the chat template approach fails. The implementation is well-contained to get_token_count() with no public API changes. The previous QA issue (BatchEncoding handling) has been addressed. Risk is minimal — worst case is the fallback path is used.
[VERDICT]
✅ Worth merging: Core logic is sound, minor improvements suggested
[KEY INSIGHT]
The new chat-template token counting correctly addresses the real problem of LiteLLM undercounting tokens for local servers that render messages through a model-specific template. The graceful fallback ensures existing functionality is never broken.
_This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review - PR #3593
🟡 Taste Rating: Acceptable - Functional implementation with some areas for improvement.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-sdk/openhands/sdk/llm/llm.py:2376] Complex Type Handling:
_count_tokenized_outputhas 9 branches handling different tokenized output types. This is necessary for compatibility but creates a high cyclomatic complexity. Consider extracting the type-checking logic into a separate helper with a registry pattern for easier maintenance. -
[openhands-sdk/openhands/sdk/llm/llm.py:2403] Deep Copy in Loop:
_messages_for_chat_templateusescopy.deepcopy(messages)which copies the entire messages structure. If messages are large, this could be expensive. Consider whether a shallow copy with selective deep copy of content blocks would suffice.
[STYLE NOTES]
- The docstring on
_get_chat_template_token_countis thorough and explains the "why" well. Keep this standard for new public-ish methods.
[TESTING GAPS]
✅ Good test coverage: Tests cover transformer module loading, chat template tokenizer preference, various tokenized output types, and graceful fallback. No blocking testing gaps identified.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
The change is additive with graceful fallback. The new _chat_template_tokenizer attribute is initialized only when custom_tokenizer is set. The transformers module is imported lazily. No breaking changes to existing APIs.
VERDICT:
✅ Worth merging - Core logic is sound. The implementation correctly handles the problem of local OpenAI-compatible servers using model-specific chat templates for tokenization.
KEY INSIGHT:
The real value here is providing accurate token counts for local models where LiteLLM's generic estimation may diverge from the actual tokenizer behavior once chat templates are applied.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation.
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
The changed SDK token-count path works as intended in a real LLM.get_token_count() run with a Qwen chat-template tokenizer, but the PR still has a non-functional CI/process failure.
Does this PR achieve its stated goal?
Yes. The stated goal is to make LLM.get_token_count() prefer a configured tokenizer's chat template so condenser decisions reflect the actual rendered prompt for local OpenAI-compatible backends. I reproduced the old undercount on origin/main with a real Qwen tokenizer and FinishTool schema (791 SDK tokens vs 910 chat-template tokens), then reran the same SDK call on this PR and got an exact match (910 vs 910). I also verified the no-custom-tokenizer fallback still returns a normal LiteLLM count.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully and installed the uv environment |
| CI Status | gh pr checks showed 35 successful, 1 failing (PR Description Check), 1 pending (QA Changes by OpenHands), 16 skipped |
| Functional Verification | ✅ Chat-template counting matched the independently rendered tokenizer count; fallback count remained positive |
Functional Verification
Test 1: Chat-template tokenizer count fixes the undercount
Step 1 — Reproduce / establish baseline without the fix:
Checked out base with git checkout --detach origin/main, then ran a real SDK token-count script:
OPENHANDS_SUPPRESS_BANNER=1 uv run --with transformers python - <<'PY'
# Script instantiated LLM(model="gpt-4o-mini", custom_tokenizer="Qwen/Qwen2.5-0.5B-Instruct"),
# passed SDK Message objects plus FinishTool.create() into LLM.get_token_count(),
# and independently counted AutoTokenizer.apply_chat_template(..., tools=[tool.to_openai_tool()], tokenize=True).
PYRelevant output:
tokenizer=Qwen/Qwen2.5-0.5B-Instruct
tokenized_type=BatchEncoding
tool_count=1
sdk_count=791
chat_template_count=910
matches_chat_template=False
delta=-119
This confirms the bug exists on base: the SDK token counter undercounts the same rendered Qwen chat-template prompt by 119 tokens when tool schema rendering is included.
Step 2 — Apply the PR's changes:
Checked out the PR branch at 6f3fa97ab738611059290a05812213665015c373.
Step 3 — Re-run with the fix in place:
Ran the same SDK/tokenizer comparison command on the PR branch.
Relevant output:
tokenizer=Qwen/Qwen2.5-0.5B-Instruct
tokenized_type=BatchEncoding
tool_count=1
sdk_count=910
chat_template_count=910
matches_chat_template=True
delta=0
This confirms the PR fixes the stated mismatch for a real tokenizer chat-template path: LLM.get_token_count() now returns the same count as the independently rendered tokenizer prompt.
Test 2: LiteLLM fallback still works without a custom tokenizer
On the PR branch, ran a real SDK count without custom_tokenizer:
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# Script instantiated LLM(model="gpt-4o-mini") and called get_token_count()
# with SDK Message objects and FinishTool.create().
PYRelevant output:
custom_tokenizer=None
fallback_count=292
fallback_count_positive=True
This shows the non-chat-template fallback path remains usable for normal SDK callers who do not configure a custom tokenizer.
Issues Found
- 🔴 Blocker (process, not functional):
PR Description Check/Validate PR descriptionis failing. The human-only PR description section is still the template placeholder and the human-tested checkbox is unchecked; I did not edit those reserved fields.
This review was created by an AI agent (OpenHands) on behalf of the user.
HUMAN:
AGENT:
Why
The LLMSummarizingCondenser uses
LLM.get_token_count()to decide when to condense. For OpenAI-compatible local servers, the serving backend may render messages and tool schemas through the model chat template before tokenization. LiteLLM's generic token counter can undercount that rendered prompt. In the Qwen/GGUF smoke run, LiteLLM estimated a rejected request at 55,615 tokens while llama.cpp counted about 68.7k.Summary
apply_chat_templateforLLM.get_token_count()when a configured tokenizer supports it.BatchEncoding/Encodingtokenized outputs.Issue Number
N/A
How to Test
Automated validation run locally:
uv run pytest tests/sdk/llm/test_llm.py -k 'token_counting or chat_template_tokenizer or tokenized_output' uv run pytest tests/sdk/context/condenser/test_utils.py tests/sdk/context/condenser/test_llm_summarizing_condenser.py uv run ruff check openhands-sdk/openhands/sdk/llm/llm.py tests/sdk/llm/test_llm.py uv run ruff format --check openhands-sdk/openhands/sdk/llm/llm.py tests/sdk/llm/test_llm.pyManual smoke testing against the live Agent Canvas/Lemonade Qwen setup:
Results:
Agent reached maximum iterations limit (40).Agent reached maximum iterations limit (80).linear__list_issuescalls and onemcp.linear.appCloudflare 1101 worker exception. This PR fixes the condenser token-count mismatch, but it does not make Qwen complete the whole daily workflow.Smoke artifacts are under:
Video/Screenshots
N/A. This is a backend token-counting fix; the smoke-test logs above are the reproduction evidence.
Type
Notes
This is model-agnostic: it uses
apply_chat_templatewhen available and falls back to the existing LiteLLM counter otherwise. To exercise this path in a local Qwen setup,custom_tokenizermust point to a tokenizer id with a chat template, such asQwen/Qwen3.6-35B-A3B-FP8, andtransformersmust be available.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:6f3fa97-pythonRun
All tags pushed for this build
About Multi-Architecture Support
6f3fa97-python) is a multi-arch manifest supporting both amd64 and arm646f3fa97-python-amd64) are also available if needed