ci: Add robot framework structure by johnramsden · Pull Request #732 · canonical/microceph

johnramsden · 2026-05-13T14:16:52Z

Description

Migrate MicroCeph's CI test suite from bash + GitHub Actions to the Robot Framework. The new tests must be runnable locally without any GitHub Actions dependency.

See for context:

Type of change

Clean code (code refactor, test updates; does not introduce functional changes)

Contributor checklist

Please check that you have:

self-reviewed the code in this PR
added code comments, particularly in less straightforward areas
checked and added or updated relevant documentation
added or updated HTML meta descriptions for any new or modified documentation pages (see #643)
verified that page title and headings accurately represent page content for new or modified documentation pages
checked and added or updated relevant release notes
added tests to verify effectiveness of this change

The previous 14-test-case structure called test_dsl_functest.sh once per test case. Each call bootstraps its own fresh VMs/containers, so the job took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732). The original CI ran run_dsl_full_tests as a single step, letting the script manage all VM lifecycles internally (shared VMs for baseline/validation/dryrun, isolated VMs for provision/cleanup/ consistency). Restore that behaviour with one test case and a 4-hour timeout, matching the upstream contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that produce structured HTML/XML reports, support selective suite execution, and make failures easier to diagnose with inline keyword-level output. Structure: - tests/robot/resources/microceph_harness.resource — ~110 shared keywords (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers) - tests/robot/resources/streaming_process.py — real-time output for long-running processes (DSL, cephadm-adopt, wiping) - 23 suite directories under tests/robot/, one per CI job: single-system-tests, multi-node-tests, availability-zone-tests, multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh, test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests, cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test, nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test, dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests - robot.py / tox.ini — CLI wrapper and tox integration for local runs - tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream fixes, test_dsl_functest.sh timeout hardening Migration style: - Inline reimplementation: bash logic rewritten as Robot/harness keywords (the majority — checked line-by-line for 1:1 parity) - Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping, api-disk) run the original .sh unchanged via Run Streaming Process - All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional retry loops and polling guards added throughout Assisted-by: claude-code:claude-sonnet-4-6 Assisted-by: claude-code:claude-opus-4-7 Assisted-by: claude-code:claude-opus-4-8 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Rewrites .github/workflows/tests.yml to invoke Robot Framework instead of calling actionutils.sh functions directly: - Each of the 22 test jobs now runs: python3 tests/robot/robot.py --snap-path <snap> --test-suite <suite> - static-checks and unit-tests moved to checks.yml (run on every push, not just when a snap artifact is available) - DSL functional tests split into 6 parallel jobs (baseline, validation, dryrun, provision, cleanup, consistency) to cut wall-clock time - LXD initialisation made explicit; host dependency checks added - Wiping test streams output from inside the outer VM (no nested KVM) - bash -x tracing enabled for all DSL jobs to aid debugging Assisted-by: claude-code:claude-sonnet-4-6 Assisted-by: claude-code:claude-opus-4-7 Assisted-by: claude-code:claude-opus-4-8 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

sabaini · 2026-06-01T16:07:29Z

Thanks @johnramsden, good stuff! I'm still reading through the PR, its big :-) One high-level comment is that I'd be wary of mixing too much bash into the robot test cases, that subverts the nice readability properties of robot tests

So e.g. having Run In Container can be ok for small bits but I wonder if Run In Container node-wrk0 microceph.ceph -s | grep "mon: 1 daemons, quorum node-wrk0" wouldn't be better with a cmd Should Have One Mon or equivalent. Similarly could replace some of the poll loops with Poll For Ceph Health (for example) cmd.

johnramsden · 2026-06-01T16:15:44Z

@sabaini I completely agree. I tried to architect it in a way where re-used behavior was put into modules, however the migration was large and some of the resulting tests ended up less readable than I would have hoped.

I can take another pass at it and try to make some of the more complex tests more readable

sabaini

Hey @johnramsden some more comments. Yes, factoring out more robot keywords would be nice!

Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Replaces raw bash one-liners in test bodies and keyword bodies with descriptive Robot keywords. Harness additions (microceph_harness.resource): - Generate Self Signed CA And Server Cert In Container -- openssl chain - Read Base64 File From Container -- base64 -w0 file read - Create Loop Device At -- single mktemp/truncate/losetup/mknod sequence - Get Synced Image Count On Node -- RBD replication jq image count - Get Primary Image Count On Node -- RBD is_primary==true jq count - Get RBD Mirror Pool Health -- rbd mirror pool status health string - Wait For CephFS Replication List Non Empty -- jq .vol == {} poll - Wait For CephFS Snaps Synced -- jq snaps_synced poll (jq on outer VM) - Read File In VM -- cat wrapper - Get Node IP -- container hostname -I - Mount NFS In VM -- mount -t nfs wrapper - Write File In VM -- echo | tee wrapper - File In VM Should Contain -- cat | grep -F wrapper - Create Loop Devices refactored to call Create Loop Device At Suite changes: - multi_node_tests: Enable RGW SSL On Head Node and Test Cross Node Certificate Rotation Inline use harness openssl/base64 - single_system_tests: Add OSD With Failure uses Create Loop Device At - wal_db_tests: encrypted WAL/DB loop uses Create Loop Device At - rbd_replication_tests: Wait For Secondary Sync, Failover To Site B, Wait For RBD Mirror Health use harness RBD helpers - cephfs_replication_tests: inline jq/FOR loops replaced with Wait For CephFS Replication List Non Empty / Wait For CephFS Snaps Synced; cat calls replaced with Read File In VM - maintenance_mode_tests: Test Quorum Guardrail Blocks Enter 30-line body replaced with 5 new local keywords - nfs_multinode_tests: Test Mount And Write NFS uses Get Node IP / Mount NFS In VM / Write File In VM / File In VM Should Contain - nfs_tests: Test Log Rotation Inline refactored into 5 local keywords; mount test uses Write File In VM / File In VM Should Contain Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Robot Framework rejects a keyword that has both an embedded argument in its name and a separate [Arguments] line. Remove the embedded ${expected_rule_id} from the name; the [Arguments] line is sufficient. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

…on refactors) Replaces raw bash in test bodies with descriptive Robot keywords cluster_tests.robot: - Bombard RGW Configs: remove redundant '|| true' from all 12 config-set lines (Run In VM already ignores non-zero exit codes; docstring explains) sequential_mon_host_refresh_tests.robot: - Extract Wait For IP In Ceph Conf On Node, Mon Host Line Should Contain IP Once, Public Network Should Be Set Once local keywords - Replace three identical 11-line FOR loops with Wait For IP In Ceph Conf On Node - Replace inline grep -c / Should Be Equal pairs with the new keywords - Replace two inline lxc network list | grep | cut calls with Get Public Network CIDR single-node/basic_tests.robot: - Test Orchestrator Module: replace Run In VM hostname + Strip String with Get VM Hostname availability_zone_tests.robot: - Bootstrap AZ Cluster: replace inline lxc network list | grep | cut with Get Public Network CIDR Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

…process.py Previously the stdout-reading loop ran on the main thread, so proc.wait(timeout=) was only ever called after the process had already exited -- meaning a hanging subprocess would hang forever regardless of the timeout argument. Fix: move the stdout reader into a daemon thread so proc.wait(timeout=) on the main thread genuinely fires when the limit is exceeded. Also add start_new_session=True so the shell and all its children share a new process group (PGID == proc.pid). On timeout, os.killpg() kills the whole group -- without this, killing only the shell leaves grandchild processes (e.g. a nested "sleep N") holding the stdout pipe open, which blocks the reader thread for the full join(timeout=5) window. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Run In Container -- nested quoting: Replace 'lxc exec container -- bash -c "${cmd}"' (embedded in a shell string on the outer VM) with nested Run Process calls: lxc exec OUTER_VM -- lxc exec container -- bash -eo pipefail -c cmd ${cmd} is now a literal argv element; no intermediate shell on the outer VM means no nested-quoting issues regardless of quote characters in the command string. robot.py -- safe default: Remove the stub_test.robot fallback; running without --test-suite now defaults to the full test tree (same as --all) so an unqualified invocation exercises everything rather than silently doing nothing. AGENTS.md -- link to robot README: Add a Robot Framework section pointing agents to tests/robot/README.md for suite layout and harness conventions. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Node Is In Mon List (harness): Wraps the repeated 'microceph.ceph -s | grep -q mon: .*daemons.*NODE' yes/no check. Replaces identical inline greps in Enable Services On Head Node For and Remove Node Head Node (multi_node_tests) and Wait For Node Absent From Mons (maintenance_mode_tests). Wait For N Nodes In Cluster (harness): Wraps the identical 3-line FOR loop that polls microceph status until N nodes appear. Replaces copies in Join Worker Nodes To Cluster (harness), Bootstrap AZ Cluster, and Rejoin Node Wrk3 Into AZ-C (az tests). OSD Tree Should Contain AZ Rack Bucket (local, az tests): Wraps lxc exec node-wrk0 -- ceph osd tree | grep -F "az.ZONE". Replaces 6 identical one-liners across 4 test cases. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Harness additions: - Wait For RGW On Head Node: moved from multi_node_tests.robot (was also inlined in Enable RGW Head Node with fewer retries); Enable RGW Head Node now calls it. - Get VM IP: outer-VM analogue of Get Node IP (hostname -I | cut). - Mount CephFS From Container: wraps the 4-step conf-pull + mkdir + mount sequence that was duplicated in cephfs-replication-test. Duplicate/inline fixes: - multi_node_tests: Test Cross Node Certificate Rotation Inline now calls existing Wait For RGW SSL Port / Get RGW SSL CN instead of two raw openssl s_client FOR loops + inline sed. First loop was a latent bug (no Fail on timeout). - availability_zone_tests: delete AZ Wait For OSD Count (verbatim re-implementation of Wait For OSD Count Head); replace 6 call sites. - multi-node/basic_tests: replace inscrutable double-grep regex with Wait For N Nodes In Cluster. - cluster_tests / single_system_tests: replace raw hostname + Strip String with Get VM Hostname. - nfs_tests: replace hostname -I | cut + Strip String with Get VM IP. - messenger_v2_tests: extract Ceph Conf Should Have No V1 Addresses; replace useless cat file | grep with grep file (shellcheck finding). maintenance_mode_tests: - Replace the two WHILE loops with manual elapsed counters (only WHILE loops in the entire suite) with FOR IN RANGE using Evaluate for the iteration count. - Extract Run Maintenance Enter Exit Cycle (flags, noout state, svc state as args); collapse four 50-line near-identical enter/exit keyword bodies to one call each (~200 lines -> ~40). cephfs_replication_tests: - Use Mount CephFS From Container for both primary and secondary mounts. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Both files now explain: - static-checks and unit-tests run on the host with no LXD/snap needed - Integration tests require LXD initialised with outbound internet access in VMs (apt-get install s3cmd/jq/ceph-common runs during suite setup) - How to verify the network requirement before investing in a full run - How to build the snap and invoke individual suites or the full tree - Per-suite host resource guide (vCPU/RAM/disk/duration) Also adds harness convention summary to tests/robot/README.md. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

… check grep -c prints "0" to stdout AND exits rc=1 when no matches are found. Using '|| echo 0' caused the fallback to also print "0", yielding "0\n0" instead of "0" -- making Should Be Equal As Strings fail with "0 != 0". Use '|| true' instead: grep -c still outputs the count, and || true only normalises the exit code without adding any extra output. The original code used 'cat file | grep | wc -l' where wc -l always exits 0 (even on empty input), so the fallback never fired. grep -c behaves differently and the || echo 0 pattern is wrong for it. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden marked this pull request as draft May 13, 2026 14:17

johnramsden force-pushed the megademo-robot branch 4 times, most recently from 68f3a89 to ecacc58 Compare May 21, 2026 18:04

johnramsden force-pushed the megademo-robot branch from ffee815 to 97139ed Compare May 23, 2026 00:01

johnramsden force-pushed the megademo-robot branch 2 times, most recently from d899910 to 2ca33a5 Compare May 28, 2026 16:05

johnramsden linked an issue May 28, 2026 that may be closed by this pull request

Integration tests cannot be run locally #704

Open

johnramsden force-pushed the megademo-robot branch from 06bf096 to e868f2f Compare May 29, 2026 18:38

johnramsden marked this pull request as ready for review May 29, 2026 20:20

johnramsden added 2 commits May 29, 2026 13:23

johnramsden force-pushed the megademo-robot branch from e868f2f to bcf5f0e Compare May 29, 2026 20:25

sabaini requested changes Jun 2, 2026

View reviewed changes

Comment thread tests/robot/resources/streaming_process.py

Comment thread tests/robot/resources/microceph_harness.resource Outdated

Comment thread tests/robot/robot.py Outdated

Comment thread tests/robot/README.md

johnramsden marked this pull request as draft June 2, 2026 21:18

johnramsden added 9 commits June 2, 2026 15:27

fix: Remove 'translated from', not helpful

1199572

Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden force-pushed the megademo-robot branch from 23afe4e to c0e2372 Compare June 3, 2026 03:34

johnramsden marked this pull request as ready for review June 4, 2026 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Add robot framework structure#732

ci: Add robot framework structure#732
johnramsden wants to merge 12 commits into
canonical:mainfrom
johnramsden:megademo-robot

johnramsden commented May 13, 2026 •

edited

Loading

Uh oh!

sabaini commented Jun 1, 2026

Uh oh!

johnramsden commented Jun 1, 2026

Uh oh!

sabaini left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnramsden commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Contributor checklist

Uh oh!

sabaini commented Jun 1, 2026

Uh oh!

johnramsden commented Jun 1, 2026

Uh oh!

sabaini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnramsden commented May 13, 2026 •

edited

Loading