Skip to content

feat: add live LLM integration tests (label-triggered CI)#624

Draft
xingyaoww wants to merge 1 commit intomainfrom
feat/live-llm-integration-tests
Draft

feat: add live LLM integration tests (label-triggered CI)#624
xingyaoww wants to merge 1 commit intomainfrom
feat/live-llm-integration-tests

Conversation

@xingyaoww
Copy link
Copy Markdown
Contributor

@xingyaoww xingyaoww commented Mar 31, 2026

What changed

Add a new test suite that runs the real CLI against a real LLM provider in --headless mode, validating the full end-to-end pipeline. Tests are opt-in and triggered by the run-live-llm PR label.

Why

All existing "conversation tests" use mock LLM servers with trajectory replay. This new layer validates that the CLI actually works end-to-end with a real LLM — covering LLM connectivity, tool-call parsing, terminal execution, observation routing, multi-step planning, code generation, and headless output.

New files

File Purpose
tests/live_llm/conftest.py --run-live-llm pytest option, run_cli fixture (headless subprocess wrapper), JSON result reporting
tests/live_llm/test_live_llm.py 3 test cases (see below)
tests/live_llm/README.md Usage docs
.github/workflows/live-llm-tests.yml CI workflow: run-live-llm label trigger, PR comment with results

Test cases

3 prompts chosen to maximize component coverage with minimal LLM calls:

Test Prompt Components covered
test_echo_command echo 'hello from openhands' LLM connectivity → tool-call parsing → TerminalTool → observation → agent summary → --override-with-envs
test_file_create_and_read Create check.txt + cat it back Multi-step planning, file I/O (write → read), cross-turn observation correctness
test_python_code_gen_and_run Write calc.py (2**10) + run it Code generation, file creation, Python execution, numeric output

Verified

All 3 tests pass with anthropic/claude-haiku-4-5-20251001 via LiteLLM proxy:

tests/live_llm/test_live_llm.py::TestEchoCommand::test_echo_command PASSED
tests/live_llm/test_live_llm.py::TestFileCreateAndRead::test_file_create_and_read PASSED
tests/live_llm/test_live_llm.py::TestCodeGenAndExecution::test_python_code_gen_and_run PASSED
3 passed in 30.04s

Existing tests unaffected (1290 passed, 3 skipped):

uv run pytest --ignore=tests/snapshots  # 1290 passed, 3 skipped

Commands run

  • uv run ruff check tests/live_llm/
  • uv run ruff format tests/live_llm/ --check
  • uv run pytest --ignore=tests/snapshots -q → 1290 passed, 3 skipped ✅
  • uv run pytest tests/live_llm/ -v → 3 skipped (no flag) ✅
  • uv run pytest tests/live_llm/ --run-live-llm -v → 3 passed ✅

Other changes

  • Makefile: add test-live-llm target
  • .gitignore: add .live-llm-results/
  • AGENTS.md: document live LLM test layer

🚀 Try this PR

uvx --python 3.12 git+https://github.com/OpenHands/OpenHands-CLI.git@feat/live-llm-integration-tests

Add a new test suite that runs the real CLI against a real LLM provider
in --headless mode. Tests are opt-in and designed to validate the full
end-to-end pipeline with minimal, high-coverage prompts.

New files:
- tests/live_llm/conftest.py — fixtures (run_cli), pytest --run-live-llm option,
  JSON result reporting for CI
- tests/live_llm/test_live_llm.py — 3 tests:
  1. test_echo_command: LLM → tool-call → terminal → observation → summary
  2. test_file_create_and_read: multi-step planning + file I/O
  3. test_python_code_gen_and_run: code gen + file creation + Python execution
- .github/workflows/live-llm-tests.yml — triggered by 'run-live-llm' PR label
  or manual dispatch; posts results as sticky PR comment
- tests/live_llm/README.md — usage docs

Also:
- Makefile: add test-live-llm target
- .gitignore: add .live-llm-results/
- AGENTS.md: document the new test layer

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL663190686% 
report-only-changed-files is enabled. No files were changed during this commit :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants