Add dynamic workflow tool by neubig · Pull Request #3426 · OpenHands/software-agent-sdk

neubig · 2026-05-28T22:04:41Z

Summary

add a WorkflowToolSet with WorkflowAction/WorkflowObservation for Python dynamic workflow scripts
add a WorkflowContext that supports run_agent, map_agents, reduce_agent, and flatten using the existing TaskManager
aggregate map_agents sub-agent failures so one failing item does not cancel sibling tasks before their results/errors are collected
add focused workflow tests and a runnable standalone SDK example where the parent agent generates the workflow code for a repo-wide test coverage audit

Closes #3425

Related docs PR: OpenHands/docs#535

Tests

uv run pytest tests/tools/workflow/test_workflow_tool.py -q
uv run pytest tests/tools/workflow/test_workflow_tool.py tests/tools/task/test_task_manager.py tests/tools/task/test_task_manager_thread_safety.py -q
uv run pre-commit run --files openhands-tools/openhands/tools/task/manager.py openhands-tools/openhands/tools/workflow/definition.py openhands-tools/openhands/tools/workflow/impl.py openhands-tools/openhands/tools/workflow/__init__.py openhands-tools/openhands/tools/__init__.py tests/tools/workflow/test_workflow_tool.py examples/01_standalone_sdk/52_dynamic_workflow.py
uv run python examples/01_standalone_sdk/52_dynamic_workflow.py
DOCS_PATH=/workspace/project/docs GITHUB_EVENT_NAME=pull_request GITHUB_EVENT_PATH=/tmp/pr_event.json python .github/scripts/check_documented_examples.py

This PR was created by an AI agent (OpenHands) on behalf of the user.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:16f873b-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-16f873b-python \
  ghcr.io/openhands/agent-server:16f873b-python

All tags pushed for this build

ghcr.io/openhands/agent-server:16f873b-golang-amd64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-golang-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang-amd64
ghcr.io/openhands/agent-server:16f873b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:16f873b-golang-arm64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-golang-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang-arm64
ghcr.io/openhands/agent-server:16f873b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:16f873b-java-amd64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-java-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java-amd64
ghcr.io/openhands/agent-server:16f873b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:16f873b-java-arm64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-java-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java-arm64
ghcr.io/openhands/agent-server:16f873b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:16f873b-python-amd64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-python-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python-amd64
ghcr.io/openhands/agent-server:16f873b-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:16f873b-python-arm64
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-python-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python-arm64
ghcr.io/openhands/agent-server:16f873b-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:16f873b-golang
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-golang
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang
ghcr.io/openhands/agent-server:16f873b-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:16f873b-java
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-java
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java
ghcr.io/openhands/agent-server:16f873b-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:16f873b-python
ghcr.io/openhands/agent-server:16f873b608ddb1cb988ef14645bfc5644b3f8304-python
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python
ghcr.io/openhands/agent-server:16f873b-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 16f873b-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 16f873b-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-28T22:05:16Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-28T22:05:16Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

The dynamic workflow tool worked end-to-end in manual SDK usage, but PR CI currently has a failed check-examples check.

Does this PR achieve its stated goal?

Yes, functionally. On the base branch, WorkflowToolSet is absent; on the PR branch, I could import/create the tool, execute workflow scripts, block an unsafe open() script, run a single real subagent via wf.run_agent, and run the new example that fans out subagents then reduces their answers into a final result with EXAMPLE_COST.

Phase	Result
Environment Setup	✅ `make build` completed successfully
CI Status	⚠️ `check-examples` failed; several checks were still in progress when checked
Functional Verification	✅ Direct tool usage, guardrail behavior, `run_agent`, and example `map_agents`/`reduce_agent` flow worked

Functional Verification

Test 1: Baseline absence vs PR availability

Step 1 — Establish baseline without the feature:
Ran git checkout --quiet origin/main; uv run python -c "import importlib.util; print('workflow_module_found=', importlib.util.find_spec('openhands.tools.workflow') is not None)"; uv run python -c "from openhands.tools import WorkflowToolSet; print(WorkflowToolSet)" || echo 'WorkflowToolSet import failed on base branch as expected':

BASE_COMMIT=5e614afe
workflow_module_found= False
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools'
WorkflowToolSet import failed on base branch as expected

This confirms the workflow tool is new user-facing functionality, not pre-existing behavior.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at d91a5e8.

Step 3 — Re-run with the feature in place:
Used WorkflowToolSet.create(...).as_executable() from a real Conversation, then executed a flatten workflow and an unsafe workflow:

tool_available= WorkflowToolSet
flatten_status= completed is_error= False text= ['alpha', 'beta', 'gamma', 'delta']
unsafe_status= error is_error= True text= Workflow scripts may not call `open`

This shows the new tool is importable/creatable, executes a valid workflow script, and enforces the advertised best-effort unsafe-call guardrail.

Test 2: Single subagent workflow primitive

Step 1 — Establish baseline:
The base branch import check above shows there was no WorkflowToolSet/wf.run_agent path to exercise.

Step 2 — Apply the PR's changes:
On redo-dynamic-workflow-mvp, registered a small QA subagent and invoked the workflow tool with a script calling await wf.run_agent(...).

Step 3 — Run with the feature in place:
Ran the one-off SDK script through uv run python:

run_agent_status= completed is_error= False
run_agent_text= PASS-WORKFLOW-RUN-AGENT
cost= 0.019145000000000002

This confirms the single-subagent primitive works through the actual TaskManager/subagent path with a real LLM-backed subagent.

Test 3: Runnable standalone example with fan-out and reduce

Step 1 — Establish baseline:
The example file does not exist on the base branch as part of the new feature; the base import check also shows the workflow tool itself is absent.

Step 2 — Apply the PR's changes:
On redo-dynamic-workflow-mvp, ran the new standalone example exactly as a user would.

Step 3 — Run with the feature in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
Task 'task_00000004' completed.
Dynamic workflow result:
An octopus runs on three hearts, with the main one pausing whenever it swims; a honeybee can recognize human faces through clever pattern-based vision despite its tiny brain; and snow leopards skip roaring in favor of chuffs, hisses, growls, and a loud, non-aggressive “prusten.”
EXAMPLE_COST: 0.053985000000000005

This confirms the advertised fan-out (map_agents) and reducer (reduce_agent) workflow runs end-to-end with real subagents and prints the expected example cost line.

Issues Found

🟠 Issue: GitHub CI currently reports [Optional] Docs example / check-examples as failed, with additional checks still in progress. Functional QA did not reproduce a workflow-tool failure—the new example ran successfully locally—but the PR should not be considered fully healthy until CI is green.

This review was created by an AI agent (OpenHands) on behalf of the user.

github-actions · 2026-05-28T22:11:16Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-tools/openhands/tools
__init__.py	13	2	84%	37–38
openhands-tools/openhands/tools/task
manager.py	168	112	33%	81–83, 87–89, 99–100, 102–103, 107, 117, 121, 124, 127–132, 134, 140–141, 145, 149–152, 155–159, 181–182, 184–185, 190, 195, 202–204, 209–212, 221, 226, 233, 246–247, 249, 255–256, 258, 267, 272, 278, 289–290, 292–295, 297, 314–315, 319–320, 322–323, 326, 328, 331, 334, 338–339, 341–344, 346–354, 356–357, 359, 365–366, 370–372, 374, 377, 379–380, 391–392, 396, 403–404, 412, 416, 421, 423–424
openhands-tools/openhands/tools/workflow
definition.py	24	4	83%	146, 148, 173, 175
impl.py	171	129	24%	83–90, 92–94, 98–100, 109–110, 125–127, 134–136, 151–153, 159–164, 169–170, 172, 176–178, 184, 188, 203, 211–214, 216–217, 220–223, 227–228, 231–232, 238–239, 245–247, 251–252, 254–256, 259, 273–274, 278–281, 283, 288–289, 293–294, 302, 304–310, 313, 318, 322, 326–327, 332, 336–339, 344–347, 349, 353–358, 360–362, 364–367, 373–374, 378–379, 383, 421, 432–433, 440, 444–446, 451–454, 461
TOTAL	29431	15109	48%

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

✅ QA Report: PASS

The dynamic workflow tool works end-to-end: it imports only on the PR branch, executes generated workflow scripts, rejects unsafe scripts, runs real map/reduce sub-agent orchestration, and the new standalone example completes.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/workflow actions, provide WorkflowContext helpers (run_agent, map_agents, reduce_agent, flatten), and include a runnable SDK example. I exercised those as a user: direct workflow tool execution returned a completed observation, unsafe script execution returned an error observation, real LLM-backed sub-agents produced a reducer result through map_agents/reduce_agent, and examples/01_standalone_sdk/52_dynamic_workflow.py ran successfully and printed EXAMPLE_COST.

Phase	Result
Environment Setup	✅ `make build` completed and installed the uv environment.
CI Status	✅ All checked non-QA jobs are passing; this `qa-changes` job was still in progress when checked.
Functional Verification	✅ Verified import baseline, direct tool execution, safety rejection, real sub-agent map/reduce, and the standalone example.

Functional Verification

Test 1: Baseline import vs PR import/execution

Step 1 — Establish baseline without the feature:
Ran git switch --detach origin/main && uv run python -c 'import importlib; importlib.import_module("openhands.tools.workflow")':

ModuleNotFoundError: No module named 'openhands.tools.workflow'

This shows the workflow package is new user-visible functionality added by this PR.

Step 2 — Apply the PR's changes:
Switched back to redo-dynamic-workflow-mvp at 181521b3a694e847453c1eaa3e2820f792811df8.

Step 3 — Re-run with the PR behavior in place:
Ran a user-style SDK script that creates an Agent with WorkflowToolSet, creates a Conversation, and calls WorkflowExecutor with a generated async def main(wf) script using wf.flatten:

status= completed
is_error= False
text= {'areas': ['sdk', 'tools', 'workspace', 'agent-server'], 'count': 4}

This confirms the new tool package imports and the workflow tool executes generated Python through the conversation context.

Test 2: Unsafe generated workflow code is rejected

Ran the workflow tool with a generated script that attempted open('/tmp/qa-workflow-should-not-open'):

status= error
is_error= True
text= Workflow scripts may not call `open`

This confirms the tool returns an error observation instead of executing a direct file read from generated workflow code.

Test 3: Real `map_agents` + `reduce_agent` orchestration

Ran a workflow action that registered a real qa_echo sub-agent backed by the configured LLM, fanned out two sub-agent tasks with wf.map_agents(max_concurrency=2), then reduced their outputs with wf.reduce_agent:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
status= completed
is_error= False
text= QA-RED-BLUE

This confirms the workflow context can fan out and reduce real sub-agent conversations, not just run local helper logic.

Test 4: Standalone example runs as documented

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Message from Coverage Workflow Agent to User
## Repo-wide test coverage audit
...
Recommended next step
Start with a focused test push covering the P0 security and data-integrity gaps...

Tokens: ↑ input 177.15K • cache hit 59.25% • reasoning 3.9K • ↓ output 18.44K • $ 0.9667

EXAMPLE_COST: 0.9666999999999999

This confirms the parent agent can generate and run a dynamic workflow using the new tool, producing the intended repo-wide coverage audit and example cost line.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

neubig · 2026-05-28T22:57:11Z

CI is green on 181521b, the example now has the parent agent generate the workflow code, and related docs are in OpenHands/docs#535. Ready for another automated review.\n\n_This comment was created by an AI agent (OpenHands) on behalf of the user._

all-hands-bot · 2026-05-28T22:58:12Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Good overall shape for an MVP — the three-layer design (WorkflowAction/WorkflowObservation/WorkflowToolSet in definition.py, runtime orchestration in impl.py, and the honest security disclaimer in the tool description) is clean and readable. The AST-level validation paired with _safe_globals is the right defence-in-depth approach for in-process script execution.

A few issues worth addressing before landing:

asyncio.gather in map_agents is missing return_exceptions=True, so one failing sub-agent silently aborts all parallel work and loses every partial result — the most impactful bug here.
One unreachable assert inside run_one should be removed or converted to a proper guard.
A private method (_ensure_parent) is called across module boundaries — fragile contract worth flagging.
Minor placement nit on the bottom-of-file TYPE_CHECKING import in definition.py.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

✅ QA Report: PASS

Verified the dynamic workflow tool end-to-end as an SDK user; it imports on the PR branch, orchestrates real sub-agents via map/reduce, rejects unsafe scripts, and the new standalone example completed with a coverage report.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/WorkflowContext for Python dynamic workflow scripts plus a runnable standalone example. On the PR commit, I ran a real WorkflowExecutor with an SDK Conversation and a registered sub-agent; wf.map_agents, wf.flatten, and wf.reduce_agent completed and returned the expected workflow tokens. I also ran the new example path and its output ended with a repo-wide coverage report plus EXAMPLE_COST: 0.9111179999999999.

Phase	Result
Environment Setup	✅ `make build` completed successfully and installed the uv workspace environment.
CI Status	🟡 At check time: 33 checks passing, 3 skipped, 1 pending (`qa-changes`); no CI jobs were rerun.
Functional Verification	✅ New import surface, real workflow execution, unsafe-script handling, and bundled example were exercised.

Functional Verification

Test 1: New workflow tool import surface

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout origin/main && uv run python -c "from openhands.tools import WorkflowToolSet; print(WorkflowToolSet)":

BRANCH=HEAD COMMIT=e65a8abd
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools' (.../openhands-tools/openhands/tools/__init__.py)
EXIT_CODE=1

This shows the advertised workflow tool was not available from the public tools package on the base branch.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at 181521b3.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python -c "from openhands.tools import WorkflowToolSet; from openhands.tools.workflow import WorkflowAction, WorkflowExecutor; print(WorkflowToolSet.name, WorkflowAction.__name__, WorkflowExecutor.__name__)":

BRANCH=redo-dynamic-workflow-mvp COMMIT=181521b3
workflow_tool_set WorkflowAction WorkflowExecutor
EXIT_CODE=0

This confirms the new tool set and workflow action/executor are importable as SDK users would expect.

Test 2: Real dynamic workflow with sub-agent map/reduce

Step 1 — Reproduce / establish baseline (without the fix):
The base-branch import failure above establishes that this SDK user path did not exist before the PR.

Step 2 — Apply the PR's changes:
On 181521b3, I registered a minimal qa_workflow_worker sub-agent, created a parent SDK Conversation with WorkflowToolSet, and invoked WorkflowExecutor with generated-style workflow code using wf.map_agents, wf.flatten, and wf.reduce_agent.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_real.py:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
STATUS completed
IS_ERROR False
TEXT_START
WORKFLOW_QA_alpha WORKFLOW_QA_beta WORKFLOW_QA_tail
TEXT_END

This confirms the workflow tool can orchestrate multiple real SDK sub-agent tasks, preserve map output, flatten intermediate values, run a reducer, and return a completed observation.

Test 3: Shipped standalone example

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout origin/main && test -f examples/01_standalone_sdk/52_dynamic_workflow.py:

BRANCH=HEAD COMMIT=e65a8abd
example_missing

This shows the advertised standalone example did not exist on the base branch.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at 181521b3.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py > /tmp/qa_dynamic_workflow_example.out 2>&1 and inspected the final output:

## Concrete next tests to add first

1. Agent Server secrets exposure tests
2. Agent Server persistence store tests
3. SDK LLM serialization/tool-call tests
4. Workspace lifecycle/error tests
5. Tools task tracker and timeout tests
6. SDK plugin/condenser orchestration tests
7. Websocket and patch/parser tests

EXAMPLE_COST: 0.9111179999999999

This confirms the new example ran through the parent-agent-generated workflow and produced the intended repo-wide coverage audit report.

Test 4: Unsafe generated script handling

Step 1 — Reproduce / establish baseline (without the fix):
The base branch had no workflow tool to run this error path.

Step 2 — Apply the PR's changes:
On 181521b3, I invoked the workflow executor with a script whose main(wf) attempted open('/etc/passwd').

Step 3 — Re-run with the fix in place:
Ran a real WorkflowExecutor invocation with an SDK Conversation:

STATUS error
IS_ERROR True
TEXT Workflow scripts may not call `open`

This confirms the tool returns an error observation instead of executing a blocked file-read operation directly inside workflow code.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

neubig · 2026-05-28T23:22:17Z

Review feedback addressed in 70d18ca; all checks are green and review threads are resolved. Requesting another automated review.\n\n_This comment was created by an AI agent (OpenHands) on behalf of the user._

all-hands-bot · 2026-05-28T23:23:05Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

All four issues from the previous review are cleanly resolved in this iteration — attach_parent(), the return_exceptions=True gather, the dead assert removal, and the TYPE_CHECKING block placement. The three-layer design (definition.py / impl.py / __init__.py) is easy to navigate, the AST validator + _safe_globals restriction is the right defence-in-depth approach for in-process code execution, and the test coverage is solid.

A few new observations below — one worth fixing before landing, two suggestions for your consideration.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

✅ QA Report: PASS

Verified the dynamic workflow feature end-to-end through the SDK, including parent workflow tool execution, sub-agent map/reduce orchestration, failure aggregation, and the new standalone example.

Does this PR achieve its stated goal?

Yes. On origin/main, the workflow tool was not importable; on the PR branch, it imported as workflow_tool_set, a parent SDK Conversation requested the workflow tool, the workflow spawned sub-agent map jobs plus a reducer through the task path, and the workflow observation reached the parent. I also verified map_agents aggregates multiple sub-agent failures instead of stopping after the first failure, and the new example script ran to completion with an EXAMPLE_COST line.

Phase	Result
Environment Setup	✅ `make build` completed successfully
CI Status	✅ 33 checks passing; 1 QA Changes check pending for this run; 3 PR artifact jobs skipped
Functional Verification	✅ Dynamic workflow import, conversation execution, failure aggregation, and example run all succeeded

Functional Verification

Test 1: Feature availability before and after the PR

Step 1 — Establish baseline without the PR:
Ran git checkout --detach origin/main && uv run python /tmp/qa_dynamic_workflow.py import:

HEAD is now at e65a8abd Register claude-opus-4-8 in SDK model lists (#3424)
Traceback (most recent call last):
  File "/tmp/qa_dynamic_workflow.py", line 16, in <module>
    from openhands.tools.workflow import WorkflowToolSet
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools.workflow' (unknown location)

This confirms the dynamic workflow tool is new behavior introduced by this PR.

Step 2 — Apply the PR changes:
Checked out redo-dynamic-workflow-mvp at 70d18ca12ef08f52c8de2457acf337f27eb9a59c.

Step 3 — Re-run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow.py import:

workflow tool set import ok: workflow_tool_set

This confirms the new workflow tool set is available from the package entry point.

Test 2: Parent agent executes a dynamic workflow through the SDK tool path

Step 1 — Baseline:
The base branch import failure above shows there was no workflow entry point to execute through a parent agent before the PR.

Step 2 — Apply the PR changes:
Used the PR branch with a local OpenAI-compatible fake LLM server so the SDK could run a real Conversation deterministically without external model dependencies.

Step 3 — Run the workflow as a user-facing SDK conversation:
Ran uv run python /tmp/qa_dynamic_workflow.py conversation:

fake llm requests: 6
parent requested workflow tool: True
subagent/reducer prompts observed: [{"type": "text", "text": "audit sdk"}] | [{"type": "text", "text": "audit tools"}] | [{"type": "text", "text": "audit workspace"}] | [{"type": "text", "text": "combine findings

Input:
{
  "findings": [...]
workflow observation reached parent: True
workflow observation snippet: [{'type': 'text', 'text': 'SUBAGENT_RESULT for [{"type": "text", "text": "combine findings\n\nInput:...'}]

This shows the parent agent requested the workflow tool, the generated script fanned out three sub-agent prompts, ran a reducer prompt with serialized intermediate results, and returned the workflow observation back to the parent conversation.

Test 3: `map_agents` aggregates sibling failures

Step 1 — Baseline:
The base branch did not have the workflow API, so there was no map_agents failure behavior to exercise.

Step 2 — Apply the PR changes:
Ran the PR implementation with a recording task manager that returns errors for two items and success for one item.

Step 3 — Execute the failing workflow:
Ran uv run python /tmp/qa_dynamic_workflow.py failure-aggregation:

started prompts: audit alpha, audit beta, audit gamma
aggregated errors: simulated failure for audit beta | simulated failure for audit gamma

This confirms all sibling tasks were attempted and both failures were surfaced together, matching the PR goal that one failing item should not cancel sibling collection.

Test 4: New standalone example runs to completion

Step 1 — Baseline:
examples/01_standalone_sdk/52_dynamic_workflow.py does not exist on origin/main, so this is a new runnable example from the PR.

Step 2 — Apply the PR changes:
Ran the example on the PR branch with LLM_MODEL=openai/gpt-4o-mini, LLM_API_KEY=fake-key, and LLM_BASE_URL pointing to a local OpenAI-compatible fake server.

Step 3 — Run the actual example script:
Ran python /tmp/qa_run_dynamic_workflow_example.py:

example exit code: 0
fake llm requests: 5
Tool: workflow
Result:
Coverage note for [{"type": "text", "text": "write final coverage report

Input:..."}]
Message from Coverage Workflow Agent to User
Example completed after workflow observation.
EXAMPLE_COST: 3.75e-06

This confirms the added example can execute the workflow tool path and emit the expected example cost marker.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot · 2026-05-28T23:33:52Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

The three-layer design (WorkflowAction/WorkflowObservation → WorkflowContext → WorkflowExecutor) is clean, the _TaskStarter Protocol enables genuine dependency injection without mocking, and the ExceptionGroup aggregation + attach_parent() fixes from earlier iterations are solid. The standalone example demonstrates the intended usage clearly. A few new findings below — the wf private-attribute exposure is the most substantive; the others are usability and robustness notes.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-29T04:31:15Z

@OpenHands /iterate

openhands-ai · 2026-05-29T04:31:38Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-05-29T04:43:35Z

CI is all green (33/33 checks passed). All review findings from the previous review have been addressed:

return_exceptions=True added to asyncio.gather in map_agents
Dead assert removed, replaced with _render_required_template helper
ExceptionGroup unpacking added via _format_exception for richer error text
wf private attribute access blocked in AST validator (wf._ prefix now rejected)
TYPE_CHECKING block confirmed at top of definition.py
Standard exceptions (ValueError, TypeError, RuntimeError) added to _safe_globals
0 unresolved review threads

Ready for another automated review.

This comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot · 2026-05-29T09:37:26Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

This is a well-structured, thoughtfully documented MVP for dynamic workflow orchestration. The code is clean, the tests are comprehensive, and the security model is honest about its limitations. A few issues worth addressing before merge — the most actionable ones are the wf-alias sandbox bypass and the silent no-op in attach_parent.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

- add docstring note to validate_workflow_script acknowledging the wf-alias bypass limitation (documentation gap, not security gap) - update type() comment to accurately note methods injected via wf are not re-sandboxed, though they only expose public wf API - add warning log in attach_parent when called with a different conversation (surfaces potential programming errors) - thread 2 (run_one re-wrap): kept as-is; the [item N] prefix is valuable context for debugging ExceptionGroup failures Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-29T09:47:59Z

Round 24 in 8a484b7:

Fixed:

Added docstring note to validate_workflow_script (wf-alias bypass is documented gap, not security gap)
Added warning log in attach_parent when called with a different conversation
Updated type() comment to accurately note injected-object limitation

Kept as-is:

run_one re-wrap: [item N] prefix provides valuable context; from-exc preserves original type
asyncio.timeout: target-version = py313, Python 3.13+, asyncio.timeout is available

All 5 threads resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot · 2026-05-29T09:48:58Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

This PR has gone through extensive iterative refinement and is in very good shape. The three-layer design (WorkflowAction/WorkflowObservation → WorkflowContext → WorkflowExecutor), the _TaskStarter Protocol for real dependency injection (no mocks needed in most tests), the ExceptionGroup failure aggregation with 1-based item indexing, asyncio.timeout for hung-workflow protection, idempotent close(), the str.replace prompt-injection fix, and the per-call max_concurrency cap are all cleanly implemented and tested.

Status of open threads:

Thread 3322692914 (add test for per-call max_concurrency cap): test_map_agents_respects_context_concurrency_cap at line 321 of the test file addresses this — it verifies that a per-call max_concurrency=1000 is silently capped at the context's limit of 3 and confirms peak concurrency never exceeds 3. ✅
Thread 3322692922 (document the two register_tool calls): WorkflowTool's docstring already says "Prefer WorkflowToolSet for standard SDK auto-create usage. Use WorkflowTool when you need to inject a custom executor (e.g., in tests or extensions)" which documents the intent. A short comment above the two register_tool lines would close the suggestion explicitly (see inline comment below).

Note on the _call_name false-positive (review 4386764961): The latest bot review flagged this as still unresolved, but the fix (isinstance(node.func, ast.Name) guard) has actually been in place since commit cd0aeb1. Attribute method calls like result.open() are already correctly allowed. No action needed.

[RISK ASSESSMENT]

⚠️ Risk: 🟡 MEDIUM — executes LLM-generated Python in-process with best-effort AST validation. This is explicitly and clearly documented.
No breaking changes; all new public symbols are additive.

VERDICT: ✅ Ready to merge. Two minor items below — one is a missing test for a documented behaviour guarantee, the other is a one-line comment.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

✅ QA Report: PASS

Verified the dynamic workflow feature end-to-end: the new module is absent on main, imports on the PR, executes workflow scripts, delegates to real registered sub-agents through TaskManager, aggregates map failures, and the new standalone example completed successfully.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/workflow action-observation path, provide WorkflowContext helpers (run_agent, map_agents, reduce_agent, flatten), aggregate map_agents failures, and ship a runnable SDK example. I exercised those surfaces as a user would: direct SDK imports/calls, real LLM-backed sub-agent delegation through the workflow executor, and examples/01_standalone_sdk/52_dynamic_workflow.py, which generated and ran a coverage-audit workflow and printed EXAMPLE_COST: 1.118247.

Phase	Result
Environment Setup	✅ `make build` completed and installed the uv-managed dev environment.
CI Status	🟡 Many checks passing; `tools-tests`, `qa-changes`, and several build/push jobs were still in progress when checked; no failures observed.
Functional Verification	✅ Core workflow tool behavior and the standalone example both ran successfully.

Functional Verification

Test 1: Baseline vs PR import behavior

Step 1 — Establish baseline without the feature:
Checked out origin/main and ran uv run python /tmp/qa_dynamic_workflow_import.py:

IMPORT_ERROR:ModuleNotFoundError:No module named 'openhands.tools.workflow'

This confirms the base branch does not expose the dynamic workflow module.

Step 2 — Apply the PR's changes:
Switched back to redo-dynamic-workflow-mvp at 8a484b75764fbc47af7577b36c68cd77e8c92498.

Step 3 — Re-run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_import.py:

IMPORT_OK:WorkflowToolSet:WorkflowAction:WorkflowExecutor

This confirms the new public import surface is available on the PR branch.

Test 2: Workflow executor and context helpers

Step 1 — Baseline:
On main, the import above fails, so this workflow tool entry point cannot be exercised there.

Step 2 — Apply the PR's changes:
Used the PR branch and called WorkflowExecutor with a generated script containing async def main(wf).

Step 3 — Run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_basic.py:

STATUS:completed
IS_ERROR:False
TEXT:alpha|beta|gamma|delta

Then ran uv run python /tmp/qa_dynamic_workflow_manager.py:

SUCCESS_RESULT:done:reduce

Input:
{
  "one": "done:single",
  "mapped": [
    "done:map a",
    "done:map b"
  ]
}
SUCCESS_PROMPTS:['worker:single', 'mapper:map a', 'mapper:map b', 'reducer:reduce

Input:
{
  "one": "done:single",
  "mapped": [
    "done:map a",
    "done:map b"
  ]
}']
CLOSED:True
FAILURE_TYPE:ExceptionGroup
FAILURE_TEXT:map_agents: one or more sub-agents failed (2 sub-exceptions)
FAILURE_ITEMS:[item 2] boom:item FAIL-1|[item 3] boom:item FAIL-2

This confirms flatten, run_agent, map_agents, reduce_agent, manager cleanup, and aggregated map failures all behave as described.

Test 3: Real registered sub-agents via TaskManager

Step 1 — Baseline:
The workflow module is absent on main, so real workflow delegation is unavailable there.

Step 2 — Apply the PR's changes:
Registered a minimal qa_echo sub-agent and invoked WorkflowExecutor without injecting a custom manager, so the workflow used the real TaskManager path.

Step 3 — Run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_real_subagents.py:

Task 'task_00000001' completed.
Task 'task_00000002' completed.
Task 'task_00000003' completed.
STATUS:completed
IS_ERROR:False
TEXT:RED, BLUE

This demonstrates the workflow can fan out to registered sub-agents and reduce their outputs through real SDK conversation/task machinery.

Test 4: Validation error path and standalone example

Validation path:
Ran uv run python /tmp/qa_dynamic_workflow_error.py:

STATUS:error
IS_ERROR:True
TEXT:Workflow scripts may not access private wf attributes or call wf.close()

This confirms unsafe/unsupported workflow calls surface as an error observation rather than silently running.

Standalone example:
Ran uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Summary: Return synthesized repository coverage audit report
...
Finish with message:
## Repo-wide test coverage report
...
Tokens: ↑ input 185.62K • cache hit 58.48% •  reasoning 4.85K • ↓ output 22.62K • $ 1.1182

EXAMPLE_COST: 1.118247

This confirms the new example runs end-to-end: the parent agent generated workflow code, the workflow executed sub-agent coverage audits, the reducer returned a repo-wide report, and the required EXAMPLE_COST line was printed.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-31T18:20:41Z

CI is green on fa05bbc3 after addressing the two automated review suggestions (added plain except Exception coverage for ExceptionGroup and clarified the dual workflow tool registrations). Review threads are resolved.

Low-risk note: the dynamic workflow tool is additive and is not added to any defaults or default execution paths; callers only get it if they explicitly opt into the tool.

Ready for another automated review.

This PR comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot · 2026-05-31T18:21:37Z

🔍 Review in progress…

We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🟡 QA Report: PARTIAL

Core dynamic workflow behavior works in local execution, but a successful live LLM-backed sub-agent run could not be verified because the available LLM configuration failed before model completion.

Does this PR achieve its stated goal?

Partially verified. The PR adds the workflow tool surface and I was able to run dynamic workflow scripts that exercised run_agent, map_agents, reduce_agent, and flatten, including aggregation of multiple map_agents failures and creation of the tool inside a real SDK Conversation. I could not verify the fully live TaskManager + LLM sub-agent success path or the standalone example end-to-end because the available environment either lacked the LiteLLM proxy base URL or had an invalid direct OpenAI credential.

Phase	Result
Environment Setup	✅ `make build` completed successfully; no tests/linters were run.
CI Status	⏳ 34 successful checks, 0 failed, 5 skipped, 1 in progress (`qa-changes`) when checked.
Functional Verification	🟡 Core workflow execution verified; live LLM-backed sub-agent completion blocked by environment credentials/config.

Functional Verification

Test 1: Baseline vs PR tool availability

Step 1 — Establish baseline without the PR:
Ran this on a detached origin/main worktree:

cd /tmp/qa-dynamic-workflow-base && uv run python - <<'PY'
try:
    from openhands.tools.workflow import WorkflowToolSet
except Exception as exc:
    print(type(exc).__name__ + ': ' + str(exc))
else:
    print('WorkflowToolSet import unexpectedly succeeded:', WorkflowToolSet)
PY

Output:

ModuleNotFoundError: No module named 'openhands.tools.workflow'

This shows the workflow tool is new behavior introduced by the PR.

Step 2 — Apply the PR's changes:
Used the checked-out PR head fa05bbc3efb03c1956a520cf85e94222064c814b.

Step 3 — Re-run with the PR in place:
Ran a workflow script through the PR's API that imports WorkflowToolSet, executes a generated async def main(wf) script, fans out map_agents, calls reduce_agent, and closes the manager:

imported WorkflowToolSet: workflow_tool_set
workflow result: {'flattened': ['qa-worker handled [single item]', 'qa-worker handled [map alpha]', 'qa-worker handled [map {"name": "beta"}]'], 'reduced': 'qa-reducer handled [summarize\n\nInput:\n{...}]'}
task calls: [('qa-worker', 'single item', 'single run'), ..., ('qa-reducer', 'summarize\n\nInput:\n{...}', 'final reduce')]
manager closed: True

This confirms the new workflow module is importable and generated workflow code can coordinate run_agent, map_agents, reduce_agent, and flatten with real execution semantics.

Test 2: `map_agents` failure aggregation

Ran a generated workflow where two mapped items failed while one succeeded:

aggregated failure count: 2
aggregated failure messages: ['[item 2] subagent refused prompt: item fail-one', '[item 3] subagent refused prompt: item fail-two']

This confirms a failing item does not hide sibling failures; both mapped failures were collected and surfaced together.

Test 3: Tool creation inside a real SDK conversation

Ran a real Agent/Conversation construction with Tool(name=WorkflowToolSet.name) and executed a simple workflow through WorkflowExecutor:

tool definitions created: [('workflow', 'WorkflowAction')]
observation: completed {'flattened': ['a', 'b', 'c']} False

This confirms the new tool set can be created from a conversation state and returns a successful WorkflowObservation for a valid dynamic script.

Unable to Verify

Live LLM-backed sub-agent success path / standalone example

I attempted a minimal real TaskManager-backed sub-agent workflow with a registered sub-agent and the environment's default model:

using model: litellm_proxy/openai/gpt-5.5
api key present: True
workflow status: error
workflow text: map_agents: one or more sub-agents failed:
  [1] [item 1] Conversation run failed ... Litellm_proxyException - api_base not set for LiteLLM Proxy responses API. Set via api_base parameter or LITELLM_PROXY_API_BASE environment variable
  [2] [item 2] Conversation run failed ... Litellm_proxyException - api_base not set for LiteLLM Proxy responses API. Set via api_base parameter or LITELLM_PROXY_API_BASE environment variable

I retried with a direct OpenAI model to rule out the proxy setting:

using model: openai/gpt-4o-mini
api key present: True
workflow status: error
workflow text: Conversation run failed ... litellm.AuthenticationError: OpenAIException - Incorrect API key provided

Because both attempts failed before a model response due to environment configuration/authentication, I could not honestly verify a successful live LLM-backed sub-agent run or the full examples/01_standalone_sdk/52_dynamic_workflow.py behavior. Future QA would be easier if AGENTS.md documented a known-good low-cost LLM_MODEL/LLM_BASE_URL combination for SDK example verification.

Issues Found

None.

This review was generated by an AI agent (OpenHands) on behalf of the user.

VascoSch92

LGTM!

Just some style comments.

A things to flag is that the timeout is just effective for coroutine which awaits.

So this code will not be caught by the timeout:

 async def main(wf):
      while True:
          pass

or a similar recursive script.

Is that a possible user-case or just a corner one?

neubig · 2026-06-02T22:04:49Z

@OpenHands fix the code review comments

openhands-ai · 2026-06-02T22:05:36Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-06-03T02:19:10Z

@OpenHands fix the code review comments

openhands-ai · 2026-06-03T02:19:55Z

Uh oh! There was an unexpected error starting the job :(

VascoSch92 · 2026-06-03T06:54:35Z

@OpenHands fix the code review comments

openhands-ai · 2026-06-03T06:54:51Z

I'm on it! VascoSch92 can track my progress at all-hands.dev

VascoSch92 · 2026-06-03T06:55:16Z

@neubig I launched my OH to fix the review. Hope is not a problem.

- Use Final[str] for _WORKFLOW_DESCRIPTION - Add 'from __future__ import annotations' to definition.py for consistency and drop quoted forward references on WorkflowTool/WorkflowToolSet.create - Drop the docstring-duplicating comments above register_tool calls - Simplify validate_workflow_script main-signature check via main_args alias - Collapse the unsafe-call check into a single condition - Use ternary in _attribute_root_name Co-authored-by: openhands <openhands@all-hands.dev>

VascoSch92 · 2026-06-03T06:58:41Z

Addressed @VascoSch92's style nits in 16f873b:

_WORKFLOW_DESCRIPTION is now typed Final[str]
definition.py uses from __future__ import annotations; quoted forward references on WorkflowTool/WorkflowToolSet removed
Dropped the duplicate register_tool comments — the docstrings already explain the split
Simplified the validate_workflow_script signature check via a main_args alias
Collapsed the unsafe-call check into a single condition
Used a ternary in _attribute_root_name

Re: the timeout caveat — asyncio.timeout only fires at await points, so a purely synchronous while True body would indeed bypass it. The workflow API surface (run_agent/map_agents/reduce_agent) is fully await-based, so realistic generated workflows yield to the loop. Tight CPU loops without awaits are a fair corner case that we can revisit (e.g. an explicit step-count cap) if it shows up in practice.

All six review threads are resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

openhands-ai · 2026-06-03T06:59:52Z

Summary

Addressed @VascoSch92's six inline code review comments on PR #3426 in commit 16f873b6, pushed to redo-dynamic-workflow-mvp.

Request fulfillment checklist

All 6 review comments addressed:
1. _WORKFLOW_DESCRIPTION → typed as Final[str] (definition.py)
2. Removed duplicate register_tool comments (the class docstrings already explain the split) (definition.py)
3. Added from __future__ import annotations and removed quoted forward references on WorkflowTool.create / WorkflowToolSet.create (definition.py)
4. Refactored signature check with a main_args = main_defs[0].args alias + [a.arg for a in main_args.args] != ["wf"] (impl.py)
5. Collapsed nested unsafe-call check into a single condition (impl.py)
6. Replaced if/return/return None in _attribute_root_name with a ternary (impl.py)
Pre-commit hooks pass on both touched files (ruff format, ruff lint, pycodestyle, pyright, import-dependency rules, Tool subclass registration check)
uv run pytest tests/tools/workflow/ → 21/21 passed
All 6 review threads replied to with the commit SHA and resolved via GraphQL (0 unresolved remaining)
Posted a summary PR comment that also addresses the approver's asyncio.timeout-doesn't-catch-tight-sync-loops caveat as a known corner case
AI disclosure included in the PR comment

Conciseness

The changes are purely the requested style nits — no extraneous modifications. The diff is +24/−24 across exactly the two files called out by the reviewer (definition.py and impl.py). No unrelated edits, no new files, no test churn.

PR: #3426

all-hands-bot · 2026-06-06T12:48:21Z

[Automatic Post]: This PR seems to be currently waiting for review. @neubig @VascoSch92, could you please take a look when you have a chance?

This comment was created by an AI agent (OpenHands) on behalf of the user.

Add dynamic workflow tool

d91a5e8

Co-authored-by: openhands <openhands@all-hands.dev>