ci: Add robot framework structure#732
Conversation
68f3a89 to
ecacc58
Compare
The previous 14-test-case structure called test_dsl_functest.sh once per test case. Each call bootstraps its own fresh VMs/containers, so the job took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732). The original CI ran run_dsl_full_tests as a single step, letting the script manage all VM lifecycles internally (shared VMs for baseline/validation/dryrun, isolated VMs for provision/cleanup/ consistency). Restore that behaviour with one test case and a 4-hour timeout, matching the upstream contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
ffee815 to
97139ed
Compare
The previous 14-test-case structure called test_dsl_functest.sh once per test case. Each call bootstraps its own fresh VMs/containers, so the job took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732). The original CI ran run_dsl_full_tests as a single step, letting the script manage all VM lifecycles internally (shared VMs for baseline/validation/dryrun, isolated VMs for provision/cleanup/ consistency). Restore that behaviour with one test case and a 4-hour timeout, matching the upstream contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
d899910 to
2ca33a5
Compare
The previous 14-test-case structure called test_dsl_functest.sh once per test case. Each call bootstraps its own fresh VMs/containers, so the job took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732). The original CI ran run_dsl_full_tests as a single step, letting the script manage all VM lifecycles internally (shared VMs for baseline/validation/dryrun, isolated VMs for provision/cleanup/ consistency). Restore that behaviour with one test case and a 4-hour timeout, matching the upstream contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>
06bf096 to
e868f2f
Compare
Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that produce structured HTML/XML reports, support selective suite execution, and make failures easier to diagnose with inline keyword-level output. Structure: - tests/robot/resources/microceph_harness.resource — ~110 shared keywords (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers) - tests/robot/resources/streaming_process.py — real-time output for long-running processes (DSL, cephadm-adopt, wiping) - 23 suite directories under tests/robot/, one per CI job: single-system-tests, multi-node-tests, availability-zone-tests, multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh, test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests, cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test, nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test, dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests - robot.py / tox.ini — CLI wrapper and tox integration for local runs - tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream fixes, test_dsl_functest.sh timeout hardening Migration style: - Inline reimplementation: bash logic rewritten as Robot/harness keywords (the majority — checked line-by-line for 1:1 parity) - Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping, api-disk) run the original .sh unchanged via Run Streaming Process - All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional retry loops and polling guards added throughout Assisted-by: claude-code:claude-sonnet-4-6 Assisted-by: claude-code:claude-opus-4-7 Assisted-by: claude-code:claude-opus-4-8 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Rewrites .github/workflows/tests.yml to invoke Robot Framework instead of
calling actionutils.sh functions directly:
- Each of the 22 test jobs now runs:
python3 tests/robot/robot.py --snap-path <snap> --test-suite <suite>
- static-checks and unit-tests moved to checks.yml (run on every push,
not just when a snap artifact is available)
- DSL functional tests split into 6 parallel jobs (baseline, validation,
dryrun, provision, cleanup, consistency) to cut wall-clock time
- LXD initialisation made explicit; host dependency checks added
- Wiping test streams output from inside the outer VM (no nested KVM)
- bash -x tracing enabled for all DSL jobs to aid debugging
Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
e868f2f to
bcf5f0e
Compare
|
Thanks @johnramsden, good stuff! I'm still reading through the PR, its big :-) One high-level comment is that I'd be wary of mixing too much bash into the robot test cases, that subverts the nice readability properties of robot tests So e.g. having |
|
@sabaini I completely agree. I tried to architect it in a way where re-used behavior was put into modules, however the migration was large and some of the resulting tests ended up less readable than I would have hoped. I can take another pass at it and try to make some of the more complex tests more readable |
sabaini
left a comment
There was a problem hiding this comment.
Hey @johnramsden some more comments. Yes, factoring out more robot keywords would be nice!
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Replaces raw bash one-liners in test bodies and keyword bodies with
descriptive Robot keywords.
Harness additions (microceph_harness.resource):
- Generate Self Signed CA And Server Cert In Container -- openssl chain
- Read Base64 File From Container -- base64 -w0 file read
- Create Loop Device At -- single mktemp/truncate/losetup/mknod sequence
- Get Synced Image Count On Node -- RBD replication jq image count
- Get Primary Image Count On Node -- RBD is_primary==true jq count
- Get RBD Mirror Pool Health -- rbd mirror pool status health string
- Wait For CephFS Replication List Non Empty -- jq .vol == {} poll
- Wait For CephFS Snaps Synced -- jq snaps_synced poll (jq on outer VM)
- Read File In VM -- cat wrapper
- Get Node IP -- container hostname -I
- Mount NFS In VM -- mount -t nfs wrapper
- Write File In VM -- echo | tee wrapper
- File In VM Should Contain -- cat | grep -F wrapper
- Create Loop Devices refactored to call Create Loop Device At
Suite changes:
- multi_node_tests: Enable RGW SSL On Head Node and
Test Cross Node Certificate Rotation Inline use harness openssl/base64
- single_system_tests: Add OSD With Failure uses Create Loop Device At
- wal_db_tests: encrypted WAL/DB loop uses Create Loop Device At
- rbd_replication_tests: Wait For Secondary Sync, Failover To Site B,
Wait For RBD Mirror Health use harness RBD helpers
- cephfs_replication_tests: inline jq/FOR loops replaced with
Wait For CephFS Replication List Non Empty / Wait For CephFS Snaps Synced;
cat calls replaced with Read File In VM
- maintenance_mode_tests: Test Quorum Guardrail Blocks Enter 30-line body
replaced with 5 new local keywords
- nfs_multinode_tests: Test Mount And Write NFS uses Get Node IP /
Mount NFS In VM / Write File In VM / File In VM Should Contain
- nfs_tests: Test Log Rotation Inline refactored into 5 local keywords;
mount test uses Write File In VM / File In VM Should Contain
Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Robot Framework rejects a keyword that has both an embedded argument
in its name and a separate [Arguments] line. Remove the embedded
${expected_rule_id} from the name; the [Arguments] line is sufficient.
Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
…on refactors) Replaces raw bash in test bodies with descriptive Robot keywords cluster_tests.robot: - Bombard RGW Configs: remove redundant '|| true' from all 12 config-set lines (Run In VM already ignores non-zero exit codes; docstring explains) sequential_mon_host_refresh_tests.robot: - Extract Wait For IP In Ceph Conf On Node, Mon Host Line Should Contain IP Once, Public Network Should Be Set Once local keywords - Replace three identical 11-line FOR loops with Wait For IP In Ceph Conf On Node - Replace inline grep -c / Should Be Equal pairs with the new keywords - Replace two inline lxc network list | grep | cut calls with Get Public Network CIDR single-node/basic_tests.robot: - Test Orchestrator Module: replace Run In VM hostname + Strip String with Get VM Hostname availability_zone_tests.robot: - Bootstrap AZ Cluster: replace inline lxc network list | grep | cut with Get Public Network CIDR Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
…process.py Previously the stdout-reading loop ran on the main thread, so proc.wait(timeout=) was only ever called after the process had already exited -- meaning a hanging subprocess would hang forever regardless of the timeout argument. Fix: move the stdout reader into a daemon thread so proc.wait(timeout=) on the main thread genuinely fires when the limit is exceeded. Also add start_new_session=True so the shell and all its children share a new process group (PGID == proc.pid). On timeout, os.killpg() kills the whole group -- without this, killing only the shell leaves grandchild processes (e.g. a nested "sleep N") holding the stdout pipe open, which blocks the reader thread for the full join(timeout=5) window. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Run In Container -- nested quoting:
Replace 'lxc exec container -- bash -c "${cmd}"' (embedded in a shell
string on the outer VM) with nested Run Process calls:
lxc exec OUTER_VM -- lxc exec container -- bash -eo pipefail -c cmd
${cmd} is now a literal argv element; no intermediate shell on the
outer VM means no nested-quoting issues regardless of quote characters
in the command string.
robot.py -- safe default:
Remove the stub_test.robot fallback; running without --test-suite now
defaults to the full test tree (same as --all) so an unqualified
invocation exercises everything rather than silently doing nothing.
AGENTS.md -- link to robot README:
Add a Robot Framework section pointing agents to tests/robot/README.md
for suite layout and harness conventions.
Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Node Is In Mon List (harness): Wraps the repeated 'microceph.ceph -s | grep -q mon: .*daemons.*NODE' yes/no check. Replaces identical inline greps in Enable Services On Head Node For and Remove Node Head Node (multi_node_tests) and Wait For Node Absent From Mons (maintenance_mode_tests). Wait For N Nodes In Cluster (harness): Wraps the identical 3-line FOR loop that polls microceph status until N nodes appear. Replaces copies in Join Worker Nodes To Cluster (harness), Bootstrap AZ Cluster, and Rejoin Node Wrk3 Into AZ-C (az tests). OSD Tree Should Contain AZ Rack Bucket (local, az tests): Wraps lxc exec node-wrk0 -- ceph osd tree | grep -F "az.ZONE". Replaces 6 identical one-liners across 4 test cases. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Harness additions: - Wait For RGW On Head Node: moved from multi_node_tests.robot (was also inlined in Enable RGW Head Node with fewer retries); Enable RGW Head Node now calls it. - Get VM IP: outer-VM analogue of Get Node IP (hostname -I | cut). - Mount CephFS From Container: wraps the 4-step conf-pull + mkdir + mount sequence that was duplicated in cephfs-replication-test. Duplicate/inline fixes: - multi_node_tests: Test Cross Node Certificate Rotation Inline now calls existing Wait For RGW SSL Port / Get RGW SSL CN instead of two raw openssl s_client FOR loops + inline sed. First loop was a latent bug (no Fail on timeout). - availability_zone_tests: delete AZ Wait For OSD Count (verbatim re-implementation of Wait For OSD Count Head); replace 6 call sites. - multi-node/basic_tests: replace inscrutable double-grep regex with Wait For N Nodes In Cluster. - cluster_tests / single_system_tests: replace raw hostname + Strip String with Get VM Hostname. - nfs_tests: replace hostname -I | cut + Strip String with Get VM IP. - messenger_v2_tests: extract Ceph Conf Should Have No V1 Addresses; replace useless cat file | grep with grep file (shellcheck finding). maintenance_mode_tests: - Replace the two WHILE loops with manual elapsed counters (only WHILE loops in the entire suite) with FOR IN RANGE using Evaluate for the iteration count. - Extract Run Maintenance Enter Exit Cycle (flags, noout state, svc state as args); collapse four 50-line near-identical enter/exit keyword bodies to one call each (~200 lines -> ~40). cephfs_replication_tests: - Use Mount CephFS From Container for both primary and secondary mounts. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Both files now explain: - static-checks and unit-tests run on the host with no LXD/snap needed - Integration tests require LXD initialised with outbound internet access in VMs (apt-get install s3cmd/jq/ceph-common runs during suite setup) - How to verify the network requirement before investing in a full run - How to build the snap and invoke individual suites or the full tree - Per-suite host resource guide (vCPU/RAM/disk/duration) Also adds harness convention summary to tests/robot/README.md. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
… check grep -c prints "0" to stdout AND exits rc=1 when no matches are found. Using '|| echo 0' caused the fallback to also print "0", yielding "0\n0" instead of "0" -- making Should Be Equal As Strings fail with "0 != 0". Use '|| true' instead: grep -c still outputs the count, and || true only normalises the exit code without adding any extra output. The original code used 'cat file | grep | wc -l' where wc -l always exits 0 (even on empty input), so the fallback never fired. grep -c behaves differently and the || echo 0 pattern is wrong for it. Assisted-by: claude-code:claude-sonnet-4-6 Signed-off-by: John Ramsden <john.ramsden@canonical.com>
23afe4e to
c0e2372
Compare
Description
Migrate MicroCeph's CI test suite from bash + GitHub Actions to the Robot Framework. The new tests must be runnable locally without any GitHub Actions dependency.
See for context:
Type of change
Contributor checklist
Please check that you have: