Skip to content

ci: Add robot framework structure#732

Open
johnramsden wants to merge 12 commits into
canonical:mainfrom
johnramsden:megademo-robot
Open

ci: Add robot framework structure#732
johnramsden wants to merge 12 commits into
canonical:mainfrom
johnramsden:megademo-robot

Conversation

@johnramsden
Copy link
Copy Markdown
Member

@johnramsden johnramsden commented May 13, 2026

Description

Migrate MicroCeph's CI test suite from bash + GitHub Actions to the Robot Framework. The new tests must be runnable locally without any GitHub Actions dependency.

See for context:

Type of change

  • Clean code (code refactor, test updates; does not introduce functional changes)

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • added or updated HTML meta descriptions for any new or modified documentation pages (see #643)
  • verified that page title and headings accurately represent page content for new or modified documentation pages
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

@johnramsden johnramsden marked this pull request as draft May 13, 2026 14:17
@johnramsden johnramsden force-pushed the megademo-robot branch 4 times, most recently from 68f3a89 to ecacc58 Compare May 21, 2026 18:04
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 21, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 28, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden force-pushed the megademo-robot branch 2 times, most recently from d899910 to 2ca33a5 Compare May 28, 2026 16:05
johnramsden added a commit to johnramsden/microceph that referenced this pull request May 28, 2026
The previous 14-test-case structure called test_dsl_functest.sh once per
test case. Each call bootstraps its own fresh VMs/containers, so the job
took 1h+ and never completed (cancelled after 1h 16m on PR canonical#732).

The original CI ran run_dsl_full_tests as a single step, letting the
script manage all VM lifecycles internally (shared VMs for
baseline/validation/dryrun, isolated VMs for provision/cleanup/
consistency). Restore that behaviour with one test case and a 4-hour
timeout, matching the upstream contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden linked an issue May 28, 2026 that may be closed by this pull request
@johnramsden johnramsden marked this pull request as ready for review May 29, 2026 20:20
Replaces all 22 bash CI test jobs with Robot Framework 7.x suites that
produce structured HTML/XML reports, support selective suite execution,
and make failures easier to diagnose with inline keyword-level output.

Structure:
- tests/robot/resources/microceph_harness.resource — ~110 shared keywords
  (VM lifecycle, snap install, cluster bootstrap, OSD/RGW/NFS helpers)
- tests/robot/resources/streaming_process.py — real-time output for
  long-running processes (DSL, cephadm-adopt, wiping)
- 23 suite directories under tests/robot/, one per CI job:
  single-system-tests, multi-node-tests, availability-zone-tests,
  multi-node-tests-with-custom-microceph-ip, test-sequential-mon-host-refresh,
  test-maintenance-modes, loop-file-tests, wal-db-tests, upgrade-reef-tests,
  cluster-tests, rbd-replication-test, cephfs-replication-test, nfs-test,
  nfs-multinode-test, messenger-v2-tests, wiping-test, cephadm-adopt-test,
  dsl-functional-tests (6 parallel jobs), api-tests, static-checks, unit-tests
- robot.py / tox.ini — CLI wrapper and tox integration for local runs
- tests/scripts/: actionutils.sh idempotency fix, adoptutils.sh upstream
  fixes, test_dsl_functest.sh timeout hardening

Migration style:
- Inline reimplementation: bash logic rewritten as Robot/harness keywords
  (the majority — checked line-by-line for 1:1 parity)
- Direct bash execution: very long suites (DSL x6, cephadm-adopt, wiping,
  api-disk) run the original .sh unchanged via Run Streaming Process
- All flakiness fixes from upstream (canonical#737, canonical#741) incorporated; additional
  retry loops and polling guards added throughout

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Rewrites .github/workflows/tests.yml to invoke Robot Framework instead of
calling actionutils.sh functions directly:

- Each of the 22 test jobs now runs:
    python3 tests/robot/robot.py --snap-path <snap> --test-suite <suite>
- static-checks and unit-tests moved to checks.yml (run on every push,
  not just when a snap artifact is available)
- DSL functional tests split into 6 parallel jobs (baseline, validation,
  dryrun, provision, cleanup, consistency) to cut wall-clock time
- LXD initialisation made explicit; host dependency checks added
- Wiping test streams output from inside the outer VM (no nested KVM)
- bash -x tracing enabled for all DSL jobs to aid debugging

Assisted-by: claude-code:claude-sonnet-4-6
Assisted-by: claude-code:claude-opus-4-7
Assisted-by: claude-code:claude-opus-4-8
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@sabaini
Copy link
Copy Markdown
Collaborator

sabaini commented Jun 1, 2026

Thanks @johnramsden, good stuff! I'm still reading through the PR, its big :-) One high-level comment is that I'd be wary of mixing too much bash into the robot test cases, that subverts the nice readability properties of robot tests

So e.g. having Run In Container can be ok for small bits but I wonder if Run In Container node-wrk0 microceph.ceph -s | grep "mon: 1 daemons, quorum node-wrk0" wouldn't be better with a cmd Should Have One Mon or equivalent. Similarly could replace some of the poll loops with Poll For Ceph Health (for example) cmd.

@johnramsden
Copy link
Copy Markdown
Member Author

@sabaini I completely agree. I tried to architect it in a way where re-used behavior was put into modules, however the migration was large and some of the resulting tests ended up less readable than I would have hoped.

I can take another pass at it and try to make some of the more complex tests more readable

Copy link
Copy Markdown
Collaborator

@sabaini sabaini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johnramsden some more comments. Yes, factoring out more robot keywords would be nice!

Comment thread tests/robot/resources/streaming_process.py
Comment thread tests/robot/resources/microceph_harness.resource Outdated
Comment thread tests/robot/robot.py Outdated
Comment thread tests/robot/README.md
@johnramsden johnramsden marked this pull request as draft June 2, 2026 21:18
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Replaces raw bash one-liners in test bodies and keyword bodies with
descriptive Robot keywords.

Harness additions (microceph_harness.resource):
- Generate Self Signed CA And Server Cert In Container -- openssl chain
- Read Base64 File From Container -- base64 -w0 file read
- Create Loop Device At -- single mktemp/truncate/losetup/mknod sequence
- Get Synced Image Count On Node -- RBD replication jq image count
- Get Primary Image Count On Node -- RBD is_primary==true jq count
- Get RBD Mirror Pool Health -- rbd mirror pool status health string
- Wait For CephFS Replication List Non Empty -- jq .vol == {} poll
- Wait For CephFS Snaps Synced -- jq snaps_synced poll (jq on outer VM)
- Read File In VM -- cat wrapper
- Get Node IP -- container hostname -I
- Mount NFS In VM -- mount -t nfs wrapper
- Write File In VM -- echo | tee wrapper
- File In VM Should Contain -- cat | grep -F wrapper
- Create Loop Devices refactored to call Create Loop Device At

Suite changes:
- multi_node_tests: Enable RGW SSL On Head Node and
  Test Cross Node Certificate Rotation Inline use harness openssl/base64
- single_system_tests: Add OSD With Failure uses Create Loop Device At
- wal_db_tests: encrypted WAL/DB loop uses Create Loop Device At
- rbd_replication_tests: Wait For Secondary Sync, Failover To Site B,
  Wait For RBD Mirror Health use harness RBD helpers
- cephfs_replication_tests: inline jq/FOR loops replaced with
  Wait For CephFS Replication List Non Empty / Wait For CephFS Snaps Synced;
  cat calls replaced with Read File In VM
- maintenance_mode_tests: Test Quorum Guardrail Blocks Enter 30-line body
  replaced with 5 new local keywords
- nfs_multinode_tests: Test Mount And Write NFS uses Get Node IP /
  Mount NFS In VM / Write File In VM / File In VM Should Contain
- nfs_tests: Test Log Rotation Inline refactored into 5 local keywords;
  mount test uses Write File In VM / File In VM Should Contain

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Robot Framework rejects a keyword that has both an embedded argument
in its name and a separate [Arguments] line. Remove the embedded
${expected_rule_id} from the name; the [Arguments] line is sufficient.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
…on refactors)

Replaces raw bash in test bodies with descriptive Robot keywords

cluster_tests.robot:
- Bombard RGW Configs: remove redundant '|| true' from all 12 config-set
  lines (Run In VM already ignores non-zero exit codes; docstring explains)

sequential_mon_host_refresh_tests.robot:
- Extract Wait For IP In Ceph Conf On Node, Mon Host Line Should Contain IP Once,
  Public Network Should Be Set Once local keywords
- Replace three identical 11-line FOR loops with Wait For IP In Ceph Conf On Node
- Replace inline grep -c / Should Be Equal pairs with the new keywords
- Replace two inline lxc network list | grep | cut calls with Get Public Network CIDR

single-node/basic_tests.robot:
- Test Orchestrator Module: replace Run In VM hostname + Strip String with Get VM Hostname

availability_zone_tests.robot:
- Bootstrap AZ Cluster: replace inline lxc network list | grep | cut with Get Public Network CIDR

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
…process.py

Previously the stdout-reading loop ran on the main thread, so
proc.wait(timeout=) was only ever called after the process had already
exited -- meaning a hanging subprocess would hang forever regardless of
the timeout argument.

Fix: move the stdout reader into a daemon thread so proc.wait(timeout=)
on the main thread genuinely fires when the limit is exceeded.

Also add start_new_session=True so the shell and all its children share
a new process group (PGID == proc.pid).  On timeout, os.killpg() kills
the whole group -- without this, killing only the shell leaves grandchild
processes (e.g. a nested "sleep N") holding the stdout pipe open, which
blocks the reader thread for the full join(timeout=5) window.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Run In Container -- nested quoting:
  Replace 'lxc exec container -- bash -c "${cmd}"' (embedded in a shell
  string on the outer VM) with nested Run Process calls:
    lxc exec OUTER_VM -- lxc exec container -- bash -eo pipefail -c cmd
  ${cmd} is now a literal argv element; no intermediate shell on the
  outer VM means no nested-quoting issues regardless of quote characters
  in the command string.

robot.py -- safe default:
  Remove the stub_test.robot fallback; running without --test-suite now
  defaults to the full test tree (same as --all) so an unqualified
  invocation exercises everything rather than silently doing nothing.

AGENTS.md -- link to robot README:
  Add a Robot Framework section pointing agents to tests/robot/README.md
  for suite layout and harness conventions.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Node Is In Mon List (harness):
  Wraps the repeated 'microceph.ceph -s | grep -q mon: .*daemons.*NODE'
  yes/no check. Replaces identical inline greps in Enable Services On Head
  Node For and Remove Node Head Node (multi_node_tests) and
  Wait For Node Absent From Mons (maintenance_mode_tests).

Wait For N Nodes In Cluster (harness):
  Wraps the identical 3-line FOR loop that polls microceph status until
  N nodes appear. Replaces copies in Join Worker Nodes To Cluster (harness),
  Bootstrap AZ Cluster, and Rejoin Node Wrk3 Into AZ-C (az tests).

OSD Tree Should Contain AZ Rack Bucket (local, az tests):
  Wraps lxc exec node-wrk0 -- ceph osd tree | grep -F "az.ZONE".
  Replaces 6 identical one-liners across 4 test cases.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Harness additions:
- Wait For RGW On Head Node: moved from multi_node_tests.robot (was also
  inlined in Enable RGW Head Node with fewer retries); Enable RGW Head Node
  now calls it.
- Get VM IP: outer-VM analogue of Get Node IP (hostname -I | cut).
- Mount CephFS From Container: wraps the 4-step conf-pull + mkdir + mount
  sequence that was duplicated in cephfs-replication-test.

Duplicate/inline fixes:
- multi_node_tests: Test Cross Node Certificate Rotation Inline now calls
  existing Wait For RGW SSL Port / Get RGW SSL CN instead of two raw
  openssl s_client FOR loops + inline sed. First loop was a latent bug
  (no Fail on timeout).
- availability_zone_tests: delete AZ Wait For OSD Count (verbatim
  re-implementation of Wait For OSD Count Head); replace 6 call sites.
- multi-node/basic_tests: replace inscrutable double-grep regex with
  Wait For N Nodes In Cluster.
- cluster_tests / single_system_tests: replace raw hostname + Strip String
  with Get VM Hostname.
- nfs_tests: replace hostname -I | cut + Strip String with Get VM IP.
- messenger_v2_tests: extract Ceph Conf Should Have No V1 Addresses;
  replace useless cat file | grep with grep file (shellcheck finding).

maintenance_mode_tests:
- Replace the two WHILE loops with manual elapsed counters (only WHILE
  loops in the entire suite) with FOR IN RANGE using Evaluate for the
  iteration count.
- Extract Run Maintenance Enter Exit Cycle (flags, noout state, svc
  state as args); collapse four 50-line near-identical enter/exit keyword
  bodies to one call each (~200 lines -> ~40).

cephfs_replication_tests:
- Use Mount CephFS From Container for both primary and secondary mounts.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
Both files now explain:
- static-checks and unit-tests run on the host with no LXD/snap needed
- Integration tests require LXD initialised with outbound internet access
  in VMs (apt-get install s3cmd/jq/ceph-common runs during suite setup)
- How to verify the network requirement before investing in a full run
- How to build the snap and invoke individual suites or the full tree
- Per-suite host resource guide (vCPU/RAM/disk/duration)

Also adds harness convention summary to tests/robot/README.md.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
… check

grep -c prints "0" to stdout AND exits rc=1 when no matches are found.
Using '|| echo 0' caused the fallback to also print "0", yielding "0\n0"
instead of "0" -- making Should Be Equal As Strings fail with "0 != 0".

Use '|| true' instead: grep -c still outputs the count, and || true only
normalises the exit code without adding any extra output.

The original code used 'cat file | grep | wc -l' where wc -l always exits 0
(even on empty input), so the fallback never fired. grep -c behaves
differently and the || echo 0 pattern is wrong for it.

Assisted-by: claude-code:claude-sonnet-4-6
Signed-off-by: John Ramsden <john.ramsden@canonical.com>
@johnramsden johnramsden marked this pull request as ready for review June 4, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integration tests cannot be run locally

2 participants