DRAFT: feat(terminal): reject Python/JSON literals passed as bash commands#3582
DRAFT: feat(terminal): reject Python/JSON literals passed as bash commands#3582juanmichelini wants to merge 2 commits into
Conversation
When the model emits a tool call where `command` is a Python/JSON
literal — e.g. `[{'default': {...}}, ['apps'], ['code...']]` — bash
returns a useless "command not found" and the model rarely
self-corrects. Trajectory analysis of a SWE-Bench Nemotron run showed
~12% of terminal calls (18 of 148) burning context this way, with
cascading cost growth because failed turns stay in history.
This PR adds a pre-execution guard in `TerminalExecutor` that:
* Detects the literal head pattern (`[{`, `[[`, `["`, `['`,
`{"`, `{'`) while carefully exempting real bash uses of `[`,
`[[`, and `{` (which are always followed by whitespace).
* Returns a `TerminalObservation(is_error=True)` whose text names the
literal kind, shows the offending head, and points the model at the
two recovery patterns: `file_editor create` + `python /tmp/x.py`,
or an inline heredoc.
* Emits one structured WARN log per rejection so eval pipelines can
grep for the pattern.
* Skips the check when `is_input=True` (raw bytes forwarded to a
running process, e.g. `C-c` keystrokes).
The detection lives in `definition.py` next to `TerminalAction` as
`looks_like_python_literal_argument(command) -> str | None`, exported
for reuse and individually unit-tested.
Tests: `tests/tools/terminal/test_literal_arg_guard.py` — 24 new
tests covering the heuristic matrix (real malformed samples, synthetic
variants, bash false-positives like `[ -f x ]` and `{ ls; }`), the
executor returning the structured error without hitting the shell,
`is_input=True` bypass, and WARN log emission. Existing terminal
parsing/observation suite (35 tests) passes unchanged.
Lint+pyright clean.
Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
|||||||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
The terminal tool now rejects top-level Python/JSON literal command values with an actionable tool-argument error while preserving normal shell usage.
Does this PR achieve its stated goal?
Yes. I reproduced the old behavior on origin/main: the same malformed literal was sent to bash and returned [{default:: command not found with exit code 127. On PR commit 56588a8f, the same real TerminalExecutor call returned is_error=True, exit_code=None, and a structured [Tool-argument error] hint instead; bash test expressions, bash groups, heredocs, JSON-shaped command arguments, and is_input=True payloads still executed/bypassed as expected.
| Phase | Result |
|---|---|
| Environment Setup | ✅ uv run created/synced .venv and built the local packages needed to import and execute the terminal tool. |
| CI Status | Validate PR description is failing and qa-changes is in progress; most other visible checks are passing or skipped. |
| Functional Verification | ✅ Real terminal tool execution showed the malformed-command fix works and key allowed shell/input paths still work. |
Functional Verification
Test 1: Malformed Python/JSON literal is rejected before bash
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main (50d73244) and ran the actual terminal tool path:
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
from pathlib import Path
from openhands.tools.terminal.definition import TerminalAction
from openhands.tools.terminal.impl import TerminalExecutor
executor = TerminalExecutor(working_dir=Path.cwd())
obs = executor(TerminalAction(command="[{'default': {'ENGINE': 'sqlite3'}}]"))
print('is_error=', obs.is_error)
print('exit_code=', obs.exit_code)
print('text=')
print(obs.text.strip())
PYObserved:
is_error= False
exit_code= 127
text=
[{default:: command not found
This confirms the bug exists on main: the malformed structured literal reaches bash and produces a generic command-not-found error.
Step 2 — Apply the PR's changes:
Checked out PR commit 56588a8f672328d658bf7114b1b510169a7b8d6e.
Step 3 — Re-run with the fix in place:
Ran the same terminal tool call on the PR commit. Observed:
WARNING ... Rejected terminal call: command argument looks like a Python/JSON list literal ...
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON list literal, not a shell command. It starts with: "[{'default': {'ENGINE': 'sqlite3'}}]"
The `terminal` tool runs exactly ONE shell command at a time. To pass structured data or multi-line code:
- Write a script first with `file_editor` (command="create", path="/tmp/run.py", ...), then invoke it: `python /tmp/run.py`.
- Or use a heredoc inline, e.g.:
python - <<'EOF'
DATABASES = {'default': {...}}
# your code here
EOF
This shows the PR achieves the main goal: the literal is caught pre-execution, returns an actionable hint, and emits the promised warning.
Test 2: Legitimate shell commands are not blocked
On both origin/main and the PR commit, I ran the actual terminal tool with commands that should remain valid:
[ -f pyproject.toml ] && echo FOUND
python - <<'EOF'
print({'default': {'ENGINE': 'sqlite3'}}['default']['ENGINE'])
EOFOn the PR commit, the outputs were:
=== bash_test ===
is_error= False
exit_code= 0
text=
FOUND
=== json_argument ===
is_error= False
exit_code= 0
text=
sqlite3
I also sampled additional claimed cases on the PR commit:
=== dict_literal ===
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON dict literal, not a shell command. It starts with: '{"key": "value"}'
=== nested_list_literal ===
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON nested list literal, not a shell command. It starts with: "[['col1', 'col2']]"
=== bash_double_bracket ===
is_error= False
exit_code= 0
text=
DOUBLE_OK
=== bash_group ===
is_error= False
exit_code= 0
text=
GROUP_OK
This confirms the guard catches representative list/dict/nested-list literals while allowing common bash [ ... ], [[ ... ]], { ...; }, heredoc, and JSON-as-argument patterns.
Test 3: is_input=True bypass still allows literal payloads
I started a real subprocess terminal cat process, sent a Python-literal-looking payload as interactive input, then interrupted it:
=== start_cat ===
is_error= False
exit_code= -1
text=
=== is_input_literal_payload ===
is_error= False
exit_code= -1
text=
[{'sent_as_keystrokes': True}]
=== stop_cat ===
is_error= False
exit_code= 130
text=
^C
This shows literal-looking text sent as keystrokes was forwarded to the running process instead of being rejected by the new guard.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
|
This was tested here https://openhands-eval-monitor.vercel.app/?days=15&run=swebench%2Flitellm_proxy-nemotron-3-ultra-550b-a55b%2F27224350936%2F It does improve acceptance rate to 100% |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: feat(terminal): reject Python/JSON literals passed as bash commands
Taste Rating: 🟡 Acceptable — Works, solves a real problem, but has rough edges worth polishing.
[CRITICAL ISSUES]
None found. No breaking changes, no security risks.
[IMPROVEMENT OPPORTUNITIES]
-
[openhands-tools/openhands/tools/terminal/definition.py, Lines 39-55] Verbose template string: The
_LITERAL_ARG_HINT_TEMPLATEuses{{escaping for f-strings, making it harder to read. Consider usingtextwrap.dedentor moving the template to a separate constant without f-string interpolation (interpolate only at call time). -
[openhands-tools/openhands/tools/terminal/impl.py, Lines 546-569] Guard block indentation: The guard logic is nested under
if not action.is_input:which is itself inside__call__. Consider extracting to a small helper function for clarity to reduce nesting levels. -
[tests/tools/terminal/test_literal_arg_guard.py] Consider adding explicit edge case: A command starting with
[[followed by whitespace (valid bash) should be tested explicitly rather than relying on the absence of that pattern.
[TESTING GAPS]
The test suite is solid. The sentinel-based executor_without_shell fixture is an excellent pattern — it proves the guard intercepts before shell execution. No gaps found.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This is a defensive guard that only fires when the model produces malformed tool calls. It never blocks legitimate commands (confirmed by comprehensive negative test cases). The is_input=True bypass is correctly implemented. No user-facing API changes. Tests provide high confidence.
VERDICT:
✅ Worth merging — The core logic is sound and solves a real problem observed in production eval runs. Polish the template readability and consider the helper extraction for long-term maintainability.
KEY INSIGHT:
The sentinel-based testing approach (stubbing shell paths to raise) is the right pattern for this — it proves the guard runs before shell execution without needing a real subprocess.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
I functionally verified the terminal executor now rejects Python/JSON literal commands with actionable hints while preserving ordinary shell commands and interactive input.
Does this PR achieve its stated goal?
Yes. On origin/main, sending "[{'default': {'ENGINE': 'sqlite3'}}]" through the real TerminalExecutor reached bash and returned [{default:: command not found with exit code 127. On PR commit c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae, the same executor call returned a structured [Tool-argument error] mentioning list literal, included recovery guidance, set exit_code: null, and emitted the expected Rejected terminal call warning. I also verified legitimate [ ... ], [[ ... ]], JSON-as-argument commands, and is_input=True interactive input continue to work.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; no tests/linters were run locally. |
| CI Status | REST API (OpenAPI) was failing and several checks were still in progress when checked; many core checks were green. |
| Functional Verification | ✅ Real TerminalExecutor usage showed the fix works and did not block tested valid shell/input cases. |
Functional Verification
Test 1: Malformed Python list/dict literal is rejected before bash
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main (16ad9e13) and ran the real terminal executor with a malformed command.
Relevant output:
{"case": "literal_list_of_dict", "command": "[{'default': {'ENGINE': 'sqlite3'}}]", "is_error": false, "exit_code": 127, "text": "\u001b[?2004l[{default:: command not found\n\u001b[?2004h"}This confirms the old behavior: the malformed structured literal reached bash and produced the confusing command not found error.
Step 2 — Apply the PR's changes:
Checked out openhands/sdk-5-bash-literal-arg-guard at c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae.
Step 3 — Re-run with the fix in place:
Ran the same executor scenario on the PR branch.
Relevant output:
WARNING ... Rejected terminal call: command argument looks like a Python/JSON list literal ... Returning structured hint to the model instead of executing.
{"case": "literal_list_of_dict", "command": "[{'default': {'ENGINE': 'sqlite3'}}]", "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON list literal, not a shell command. It starts with: "[{'default': {'ENGINE': 'sqlite3'}}]" ... Write a script first with `file_editor` ... Or use a heredoc inline ..."}This shows the PR achieves the core goal: the literal is rejected pre-execution with actionable guidance instead of being passed to bash.
Test 2: Legitimate shell syntax and JSON-shaped arguments still run
Step 1 — Baseline without the fix:
On origin/main, the same executor accepted a bash test expression and a JSON-shaped argument:
{"case": "bash_test_expression", "command": "[ -f /tmp/oh_qa_missing_file ]; echo status:$?", "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:1\n\u001b[?2004h"}
{"case": "echo_json_argument", "command": "printf '%s\\n' '{"a": 1}'", "is_error": false, "exit_code": 0, "text": "\u001b[?2004l{"a": 1}\n\u001b[?2004h"}Step 2 — Apply the PR's changes:
Returned to the PR branch.
Step 3 — Re-run with the fix in place:
The same commands still executed normally:
{"case": "bash_test_expression", "command": "[ -f /tmp/oh_qa_missing_file ]; echo status:$?", "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:1\n\u001b[?2004h"}
{"case": "echo_json_argument", "command": "printf '%s\\n' '{"a": 1}'", "is_error": false, "exit_code": 0, "text": "\u001b[?2004l{"a": 1}\n\u001b[?2004h"}
{"case": "bash_double_bracket", "is_input": false, "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:0\n\u001b[?2004h"}This shows the guard did not block the tested valid shell forms or commands with JSON-looking arguments.
Test 3: Dict/nested-list literals and interactive input behavior
Step 1 — Baseline without the fix:
Before the PR, top-level structured literals were sent to the shell, as shown by Test 1's command not found result. is_input=True had no literal guard to trip.
Step 2 — Apply the PR's changes:
Ran additional real executor scenarios on the PR branch.
Step 3 — Re-run with the fix in place:
Dict and nested-list literals were rejected with their specific kinds:
{"case": "json_dict_literal", "is_input": false, "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON dict literal, not a shell command..."}
{"case": "nested_list_literal", "is_input": false, "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON nested list literal, not a shell command..."}For is_input=True, I started an interactive read command and then sent a literal-looking payload as input:
{"case": "start_read", "is_error": false, "exit_code": -1, "text": "\u001b[?2004l"}
{"case": "send_literal_input", "is_error": false, "exit_code": 0, "text": "captured:[{'sent_as_keystrokes': True}]\n\u001b[?2004h"}This confirms the guard does not reject literal-looking bytes when they are interactive input to a running process.
Issues Found
- 🟠 CI status:
REST API (OpenAPI)was failing and several checks were still in progress when checked. I did not find a functional terminal-behavior issue, but the PR is not fully green yet.
This review was created by an AI agent (OpenHands) on behalf of the user.
Final verdict: PASS WITH ISSUES
TL;DR
When the model packs structured data into the
terminaltool'scommandfield — e.g.[{'default': {...}}, ['apps'], ['code...']]— bash returnscommand not foundand the model rarely self-corrects. This PR catches the malformation pre-execution and replies with an actionable hint instead.Why
Trajectory analysis of a SWE-Bench Verified Nemotron 550B run (the one resolved instance in
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b/27176499467) showed:148 actions to solve a
<15 mindifficulty Django bug (a competent agent does it in ~30–50).18 of those 148 turns (12%) were
terminalcalls wherecommandwas a Python/JSON literal:{"command": "[{'default': {'ENGINE': 'django.db.backends.sqlite3'}}, ['django.contrib.auth'], [\"field.column]\\n...\"]"}Bash responded
bash: [{default:: command not found. The model treated the error as a bug in its reproduction script (not its tool call) and rewrote the script 12 different times.Each malformed turn cost
$0.025–$0.04 directly, plus a cascading context-bloat tax (the failed call + error stay in conversation history forever, paid at every subsequent turn on non-caching models). Estimated total waste on this single trajectory: **$0.70**, or roughly 18% of the $3.97 spend.The exact malformed indices in the run were 41, 50, 58, 59, 65–68, 76, 77, 87, 92, 95, 125, 135, 136, 141, 142.
What
1. Heuristic:
looks_like_python_literal_argument(command) -> str | NoneIn
openhands-tools/openhands/tools/terminal/definition.py, next toTerminalAction. Returns one of"list literal","nested list literal","dict literal", orNone.Detection rules — deliberately conservative, exempting all real bash uses:
[{["['[[followed by non-space[[...]][[(with space)[[ -f x ]][(with space)[ -f x ]{"{'{(with space){ ls; }echo '[1, 2]',curl -d '{"a":1}'2. Pre-execution guard in
TerminalExecutor.__call__When the heuristic fires:
TerminalObservation(is_error=True)whose text:terminalruns one shell command.file_editor create→python /tmp/x.py, or an inline heredoc.logger.warning(...)so eval pipelines can grep for the pattern (e.g.grep "Rejected terminal call" *.log | wc -l).is_input=True(raw bytes forwarded to a running process, e.g.C-ckeystrokes, where literals are legal payload).3. Tests
tests/tools/terminal/test_literal_arg_guard.py— 24 tests:[,[[,{,echo,curl, edge-cases (empty / single char).[ -f /tmp/foo ]do reach the shell;is_input=Truebypasses the guard; the WARN log fires with the literal kind.All 24 pass. Existing 35 terminal parsing/observation tests pass unchanged.
Risk / blast radius
[ -f,[[ -z, or{ cmd; }always has whitespace as the second char, so the heuristic exempts them by construction.TerminalObservation(is_error=True)path that the framework already handles for tool-execution errors.is_input=Truecarve-out preserves the keystroke-forwarding contract.[{— vanishingly rare in shells, and the model can disambiguate by adding a leading space or quoting. The hint text explicitly tells them what to do.How to verify locally
uv run pytest tests/tools/terminal/test_literal_arg_guard.py -q # 24 passedTo see the pattern this addresses in a real trajectory, the per-turn malformed-call indices for the Nemotron 550B run linked above are
[41, 50, 58, 59, 65, 66, 67, 68, 76, 77, 87, 92, 95, 125, 135, 136, 141, 142]. Each one returnedbash: ...: command not foundfrom a structured-literalcommand.This PR was created by an AI agent (OpenHands) on behalf of the user investigating high per-instance cost in eval run
swebench/litellm_proxy-nemotron-3-ultra-550b-a55b/27176499467.@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:c1799d2-pythonRun
All tags pushed for this build
About Multi-Architecture Support
c1799d2-python) is a multi-arch manifest supporting both amd64 and arm64c1799d2-python-amd64) are also available if needed