Skip to content

DRAFT: feat(terminal): reject Python/JSON literals passed as bash commands#3582

Open
juanmichelini wants to merge 2 commits into
mainfrom
openhands/sdk-5-bash-literal-arg-guard
Open

DRAFT: feat(terminal): reject Python/JSON literals passed as bash commands#3582
juanmichelini wants to merge 2 commits into
mainfrom
openhands/sdk-5-bash-literal-arg-guard

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

When the model packs structured data into the terminal tool's command field — e.g. [{'default': {...}}, ['apps'], ['code...']] — bash returns command not found and the model rarely self-corrects. This PR catches the malformation pre-execution and replies with an actionable hint instead.

Why

Trajectory analysis of a SWE-Bench Verified Nemotron 550B run (the one resolved instance in swebench/litellm_proxy-nemotron-3-ultra-550b-a55b/27176499467) showed:

  • 148 actions to solve a <15 min difficulty Django bug (a competent agent does it in ~30–50).

  • 18 of those 148 turns (12%) were terminal calls where command was a Python/JSON literal:

    {"command": "[{'default': {'ENGINE': 'django.db.backends.sqlite3'}}, ['django.contrib.auth'], [\"field.column]\\n...\"]"}

    Bash responded bash: [{default:: command not found. The model treated the error as a bug in its reproduction script (not its tool call) and rewrote the script 12 different times.

  • Each malformed turn cost $0.025–$0.04 directly, plus a cascading context-bloat tax (the failed call + error stay in conversation history forever, paid at every subsequent turn on non-caching models). Estimated total waste on this single trajectory: **$0.70**, or roughly 18% of the $3.97 spend.

The exact malformed indices in the run were 41, 50, 58, 59, 65–68, 76, 77, 87, 92, 95, 125, 135, 136, 141, 142.

What

1. Heuristic: looks_like_python_literal_argument(command) -> str | None

In openhands-tools/openhands/tools/terminal/definition.py, next to TerminalAction. Returns one of "list literal", "nested list literal", "dict literal", or None.

Detection rules — deliberately conservative, exempting all real bash uses:

Head Verdict Reason
[{ ❌ list literal Python/JSON list-of-dict
[" [' ❌ list literal Python list of strings
[[ followed by non-space ❌ nested list literal Python [[...]]
[[ (with space) ✅ allowed bash extended-test [[ -f x ]]
[ (with space) ✅ allowed bash test [ -f x ]
{" {' ❌ dict literal Python/JSON dict
{ (with space) ✅ allowed bash group { ls; }
echo '[1, 2]', curl -d '{"a":1}' ✅ allowed command name is plain text

2. Pre-execution guard in TerminalExecutor.__call__

When the heuristic fires:

  • Return a TerminalObservation(is_error=True) whose text:
    1. Names the literal kind and shows the offending head (first 60 chars).
    2. Reminds the model terminal runs one shell command.
    3. Teaches two concrete recovery paths: file_editor createpython /tmp/x.py, or an inline heredoc.
  • Emit one structured logger.warning(...) so eval pipelines can grep for the pattern (e.g. grep "Rejected terminal call" *.log | wc -l).
  • Skip the check when is_input=True (raw bytes forwarded to a running process, e.g. C-c keystrokes, where literals are legal payload).

3. Tests

tests/tools/terminal/test_literal_arg_guard.py — 24 tests:

  • Heuristic unit tests: parametric over real malformed samples from the Nemotron trajectory + synthetic variants; parametric negative cases for bash [, [[, {, echo, curl, edge-cases (empty / single char).
  • Executor integration: literal command returns the structured error and never reaches the shell; bash test expressions like [ -f /tmp/foo ] do reach the shell; is_input=True bypasses the guard; the WARN log fires with the literal kind.

All 24 pass. Existing 35 terminal parsing/observation tests pass unchanged.

Risk / blast radius

  • Purely behavior-changing for malformed input. A valid bash command starting with [ -f, [[ -z, or { cmd; } always has whitespace as the second char, so the heuristic exempts them by construction.
  • No schema change, no new dependency, no new event type. The error flows through the existing TerminalObservation(is_error=True) path that the framework already handles for tool-execution errors.
  • is_input=True carve-out preserves the keystroke-forwarding contract.
  • Worst case false positive: a user's legitimate command happens to start with [{ — vanishingly rare in shells, and the model can disambiguate by adding a leading space or quoting. The hint text explicitly tells them what to do.

How to verify locally

uv run pytest tests/tools/terminal/test_literal_arg_guard.py -q
# 24 passed

To see the pattern this addresses in a real trajectory, the per-turn malformed-call indices for the Nemotron 550B run linked above are [41, 50, 58, 59, 65, 66, 67, 68, 76, 77, 87, 92, 95, 125, 135, 136, 141, 142]. Each one returned bash: ...: command not found from a structured-literal command.


This PR was created by an AI agent (OpenHands) on behalf of the user investigating high per-instance cost in eval run swebench/litellm_proxy-nemotron-3-ultra-550b-a55b/27176499467.

@juanmichelini can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:c1799d2-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-c1799d2-python \
  ghcr.io/openhands/agent-server:c1799d2-python

All tags pushed for this build

ghcr.io/openhands/agent-server:c1799d2-golang-amd64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-golang-amd64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-golang-amd64
ghcr.io/openhands/agent-server:c1799d2-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:c1799d2-golang-arm64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-golang-arm64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-golang-arm64
ghcr.io/openhands/agent-server:c1799d2-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:c1799d2-java-amd64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-java-amd64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-java-amd64
ghcr.io/openhands/agent-server:c1799d2-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:c1799d2-java-arm64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-java-arm64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-java-arm64
ghcr.io/openhands/agent-server:c1799d2-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:c1799d2-python-amd64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-python-amd64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-python-amd64
ghcr.io/openhands/agent-server:c1799d2-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:c1799d2-python-arm64
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-python-arm64
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-python-arm64
ghcr.io/openhands/agent-server:c1799d2-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:c1799d2-golang
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-golang
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-golang
ghcr.io/openhands/agent-server:c1799d2-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:c1799d2-java
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-java
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-java
ghcr.io/openhands/agent-server:c1799d2-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:c1799d2-python
ghcr.io/openhands/agent-server:c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae-python
ghcr.io/openhands/agent-server:openhands-sdk-5-bash-literal-arg-guard-python
ghcr.io/openhands/agent-server:c1799d2-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., c1799d2-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., c1799d2-python-amd64) are also available if needed

When the model emits a tool call where `command` is a Python/JSON
literal — e.g. `[{'default': {...}}, ['apps'], ['code...']]` — bash
returns a useless "command not found" and the model rarely
self-corrects. Trajectory analysis of a SWE-Bench Nemotron run showed
~12% of terminal calls (18 of 148) burning context this way, with
cascading cost growth because failed turns stay in history.

This PR adds a pre-execution guard in `TerminalExecutor` that:

* Detects the literal head pattern (`[{`, `[[`, `["`, `['`,
  `{"`, `{'`) while carefully exempting real bash uses of `[`,
  `[[`, and `{` (which are always followed by whitespace).
* Returns a `TerminalObservation(is_error=True)` whose text names the
  literal kind, shows the offending head, and points the model at the
  two recovery patterns: `file_editor create` + `python /tmp/x.py`,
  or an inline heredoc.
* Emits one structured WARN log per rejection so eval pipelines can
  grep for the pattern.
* Skips the check when `is_input=True` (raw bytes forwarded to a
  running process, e.g. `C-c` keystrokes).

The detection lives in `definition.py` next to `TerminalAction` as
`looks_like_python_literal_argument(command) -> str | None`, exported
for reuse and individually unit-tested.

Tests: `tests/tools/terminal/test_literal_arg_guard.py` — 24 new
tests covering the heuristic matrix (real malformed samples, synthetic
variants, bash false-positives like `[ -f x ]` and `{ ls; }`), the
executor returning the structured error without hitting the shell,
`is_input=True` bypass, and WARN log emission. Existing terminal
parsing/observation suite (35 tests) passes unchanged.

Lint+pyright clean.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini added the enhancement New feature or request label Jun 9, 2026 — with OpenHands AI
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/terminal
   definition.py1265953%68, 72, 76–78, 82, 119, 122, 125–126, 128, 131–133, 135–137, 139–141, 143, 171, 179, 206, 208–210, 213, 215, 217–219, 221, 225–226, 229–231, 233–234, 237–240, 244–246, 251, 255–260, 262–263, 265, 280, 314
   impl.py24510457%112, 119–120, 148–152, 155–157, 159–163, 170, 189–190, 195–196, 225–228, 230, 248–257, 259, 264, 270–271, 294, 297, 300, 304, 322, 325, 333–334, 343, 376–377, 379, 387, 391–394, 396–397, 404, 406, 410, 434–435, 437–439, 444–445, 447–448, 450, 459, 461–462, 464, 484–487, 492–493, 515, 525–528, 532–533, 541, 552–553, 560, 573, 582–589, 597–598
TOTAL301471537249% 

@juanmichelini juanmichelini marked this pull request as ready for review June 9, 2026 21:28

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The terminal tool now rejects top-level Python/JSON literal command values with an actionable tool-argument error while preserving normal shell usage.

Does this PR achieve its stated goal?

Yes. I reproduced the old behavior on origin/main: the same malformed literal was sent to bash and returned [{default:: command not found with exit code 127. On PR commit 56588a8f, the same real TerminalExecutor call returned is_error=True, exit_code=None, and a structured [Tool-argument error] hint instead; bash test expressions, bash groups, heredocs, JSON-shaped command arguments, and is_input=True payloads still executed/bypassed as expected.

Phase Result
Environment Setup uv run created/synced .venv and built the local packages needed to import and execute the terminal tool.
CI Status ⚠️ Not fully green: Validate PR description is failing and qa-changes is in progress; most other visible checks are passing or skipped.
Functional Verification ✅ Real terminal tool execution showed the malformed-command fix works and key allowed shell/input paths still work.
Functional Verification

Test 1: Malformed Python/JSON literal is rejected before bash

Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main (50d73244) and ran the actual terminal tool path:

OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
from pathlib import Path
from openhands.tools.terminal.definition import TerminalAction
from openhands.tools.terminal.impl import TerminalExecutor

executor = TerminalExecutor(working_dir=Path.cwd())
obs = executor(TerminalAction(command="[{'default': {'ENGINE': 'sqlite3'}}]"))
print('is_error=', obs.is_error)
print('exit_code=', obs.exit_code)
print('text=')
print(obs.text.strip())
PY

Observed:

is_error= False
exit_code= 127
text=
[{default:: command not found

This confirms the bug exists on main: the malformed structured literal reaches bash and produces a generic command-not-found error.

Step 2 — Apply the PR's changes:
Checked out PR commit 56588a8f672328d658bf7114b1b510169a7b8d6e.

Step 3 — Re-run with the fix in place:
Ran the same terminal tool call on the PR commit. Observed:

WARNING ... Rejected terminal call: command argument looks like a Python/JSON list literal ...
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON list literal, not a shell command. It starts with: "[{'default': {'ENGINE': 'sqlite3'}}]"

The `terminal` tool runs exactly ONE shell command at a time. To pass structured data or multi-line code:
  - Write a script first with `file_editor` (command="create", path="/tmp/run.py", ...), then invoke it: `python /tmp/run.py`.
  - Or use a heredoc inline, e.g.:
        python - <<'EOF'
        DATABASES = {'default': {...}}
        # your code here
        EOF

This shows the PR achieves the main goal: the literal is caught pre-execution, returns an actionable hint, and emits the promised warning.

Test 2: Legitimate shell commands are not blocked

On both origin/main and the PR commit, I ran the actual terminal tool with commands that should remain valid:

[ -f pyproject.toml ] && echo FOUND
python - <<'EOF'
print({'default': {'ENGINE': 'sqlite3'}}['default']['ENGINE'])
EOF

On the PR commit, the outputs were:

=== bash_test ===
is_error= False
exit_code= 0
text=
FOUND
=== json_argument ===
is_error= False
exit_code= 0
text=
sqlite3

I also sampled additional claimed cases on the PR commit:

=== dict_literal ===
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON dict literal, not a shell command. It starts with: '{"key": "value"}'

=== nested_list_literal ===
is_error= True
exit_code= None
text=
[Tool-argument error] The `command` argument looks like a Python/JSON nested list literal, not a shell command. It starts with: "[['col1', 'col2']]"

=== bash_double_bracket ===
is_error= False
exit_code= 0
text=
DOUBLE_OK
=== bash_group ===
is_error= False
exit_code= 0
text=
GROUP_OK

This confirms the guard catches representative list/dict/nested-list literals while allowing common bash [ ... ], [[ ... ]], { ...; }, heredoc, and JSON-as-argument patterns.

Test 3: is_input=True bypass still allows literal payloads

I started a real subprocess terminal cat process, sent a Python-literal-looking payload as interactive input, then interrupted it:

=== start_cat ===
is_error= False
exit_code= -1
text=

=== is_input_literal_payload ===
is_error= False
exit_code= -1
text=
[{'sent_as_keystrokes': True}]
=== stop_cat ===
is_error= False
exit_code= 130
text=
^C

This shows literal-looking text sent as keystrokes was forwarded to the running process instead of being rejected by the new guard.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

@juanmichelini

Copy link
Copy Markdown
Collaborator Author

all-hands-bot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: feat(terminal): reject Python/JSON literals passed as bash commands

Taste Rating: 🟡 Acceptable — Works, solves a real problem, but has rough edges worth polishing.


[CRITICAL ISSUES]

None found. No breaking changes, no security risks.


[IMPROVEMENT OPPORTUNITIES]

  1. [openhands-tools/openhands/tools/terminal/definition.py, Lines 39-55] Verbose template string: The _LITERAL_ARG_HINT_TEMPLATE uses {{ escaping for f-strings, making it harder to read. Consider using textwrap.dedent or moving the template to a separate constant without f-string interpolation (interpolate only at call time).

  2. [openhands-tools/openhands/tools/terminal/impl.py, Lines 546-569] Guard block indentation: The guard logic is nested under if not action.is_input: which is itself inside __call__. Consider extracting to a small helper function for clarity to reduce nesting levels.

  3. [tests/tools/terminal/test_literal_arg_guard.py] Consider adding explicit edge case: A command starting with [[ followed by whitespace (valid bash) should be tested explicitly rather than relying on the absence of that pattern.


[TESTING GAPS]

The test suite is solid. The sentinel-based executor_without_shell fixture is an excellent pattern — it proves the guard intercepts before shell execution. No gaps found.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a defensive guard that only fires when the model produces malformed tool calls. It never blocks legitimate commands (confirmed by comprehensive negative test cases). The is_input=True bypass is correctly implemented. No user-facing API changes. Tests provide high confidence.


VERDICT:
Worth merging — The core logic is sound and solves a real problem observed in production eval runs. Polish the template readability and consider the helper extraction for long-term maintainability.

KEY INSIGHT:
The sentinel-based testing approach (stubbing shell paths to raise) is the right pattern for this — it proves the guard runs before shell execution without needing a real subprocess.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

I functionally verified the terminal executor now rejects Python/JSON literal commands with actionable hints while preserving ordinary shell commands and interactive input.

Does this PR achieve its stated goal?

Yes. On origin/main, sending "[{'default': {'ENGINE': 'sqlite3'}}]" through the real TerminalExecutor reached bash and returned [{default:: command not found with exit code 127. On PR commit c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae, the same executor call returned a structured [Tool-argument error] mentioning list literal, included recovery guidance, set exit_code: null, and emitted the expected Rejected terminal call warning. I also verified legitimate [ ... ], [[ ... ]], JSON-as-argument commands, and is_input=True interactive input continue to work.

Phase Result
Environment Setup make build completed successfully; no tests/linters were run locally.
CI Status ⚠️ REST API (OpenAPI) was failing and several checks were still in progress when checked; many core checks were green.
Functional Verification ✅ Real TerminalExecutor usage showed the fix works and did not block tested valid shell/input cases.
Functional Verification

Test 1: Malformed Python list/dict literal is rejected before bash

Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main (16ad9e13) and ran the real terminal executor with a malformed command.

Relevant output:

{"case": "literal_list_of_dict", "command": "[{'default': {'ENGINE': 'sqlite3'}}]", "is_error": false, "exit_code": 127, "text": "\u001b[?2004l[{default:: command not found\n\u001b[?2004h"}

This confirms the old behavior: the malformed structured literal reached bash and produced the confusing command not found error.

Step 2 — Apply the PR's changes:
Checked out openhands/sdk-5-bash-literal-arg-guard at c1799d2dfff2e6861ab09b5394e2e7ca6d28b3ae.

Step 3 — Re-run with the fix in place:
Ran the same executor scenario on the PR branch.

Relevant output:

WARNING ... Rejected terminal call: command argument looks like a Python/JSON list literal ... Returning structured hint to the model instead of executing.
{"case": "literal_list_of_dict", "command": "[{'default': {'ENGINE': 'sqlite3'}}]", "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON list literal, not a shell command. It starts with: "[{'default': {'ENGINE': 'sqlite3'}}]" ... Write a script first with `file_editor` ... Or use a heredoc inline ..."}

This shows the PR achieves the core goal: the literal is rejected pre-execution with actionable guidance instead of being passed to bash.

Test 2: Legitimate shell syntax and JSON-shaped arguments still run

Step 1 — Baseline without the fix:
On origin/main, the same executor accepted a bash test expression and a JSON-shaped argument:

{"case": "bash_test_expression", "command": "[ -f /tmp/oh_qa_missing_file ]; echo status:$?", "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:1\n\u001b[?2004h"}
{"case": "echo_json_argument", "command": "printf '%s\\n' '{"a": 1}'", "is_error": false, "exit_code": 0, "text": "\u001b[?2004l{"a": 1}\n\u001b[?2004h"}

Step 2 — Apply the PR's changes:
Returned to the PR branch.

Step 3 — Re-run with the fix in place:
The same commands still executed normally:

{"case": "bash_test_expression", "command": "[ -f /tmp/oh_qa_missing_file ]; echo status:$?", "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:1\n\u001b[?2004h"}
{"case": "echo_json_argument", "command": "printf '%s\\n' '{"a": 1}'", "is_error": false, "exit_code": 0, "text": "\u001b[?2004l{"a": 1}\n\u001b[?2004h"}
{"case": "bash_double_bracket", "is_input": false, "is_error": false, "exit_code": 0, "text": "\u001b[?2004lstatus:0\n\u001b[?2004h"}

This shows the guard did not block the tested valid shell forms or commands with JSON-looking arguments.

Test 3: Dict/nested-list literals and interactive input behavior

Step 1 — Baseline without the fix:
Before the PR, top-level structured literals were sent to the shell, as shown by Test 1's command not found result. is_input=True had no literal guard to trip.

Step 2 — Apply the PR's changes:
Ran additional real executor scenarios on the PR branch.

Step 3 — Re-run with the fix in place:
Dict and nested-list literals were rejected with their specific kinds:

{"case": "json_dict_literal", "is_input": false, "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON dict literal, not a shell command..."}
{"case": "nested_list_literal", "is_input": false, "is_error": true, "exit_code": null, "text": "[Tool-argument error] The `command` argument looks like a Python/JSON nested list literal, not a shell command..."}

For is_input=True, I started an interactive read command and then sent a literal-looking payload as input:

{"case": "start_read", "is_error": false, "exit_code": -1, "text": "\u001b[?2004l"}
{"case": "send_literal_input", "is_error": false, "exit_code": 0, "text": "captured:[{'sent_as_keystrokes': True}]\n\u001b[?2004h"}

This confirms the guard does not reject literal-looking bytes when they are interactive input to a running process.

Issues Found

  • 🟠 CI status: REST API (OpenAPI) was failing and several checks were still in progress when checked. I did not find a functional terminal-behavior issue, but the PR is not fully green yet.

This review was created by an AI agent (OpenHands) on behalf of the user.

Final verdict: PASS WITH ISSUES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants