Conversation
Rebase to develop
There was a problem hiding this comment.
Pull request overview
This PR targets GitHub Actions L2 test stability by improving bundle handling and platform-specific behavior (cgroupv2/CI quirks), and by adjusting templates/CI steps to reduce environment-dependent failures.
Changes:
- Harden L2 test runner utilities (bundle extraction validation, DobbyDaemon readiness probing, container launch retries, improved async log reading).
- Make multiple L2 tests more CI-tolerant (bundle config sanitization, cgroupv1/v2 pid-limit path resolution, more robust bundle comparisons).
- Add CI workflow step to regenerate bundles for cgroupv2 compatibility and relax LCOV handling for unused coverage artifacts.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/L2_testing/test_runner/thunder_plugin.py | Sanitizes bundle config and makes test set platform-conditional; improves handling of invalid bundles. |
| tests/L2_testing/test_runner/test_utils.py | Adds bundle extraction validation, nested bundle path detection, DobbyDaemon readiness probing, and more robust container launch behavior. |
| tests/L2_testing/test_runner/start_from_bundle.py | Uses validated bundle context and waits briefly for log flushing before asserting. |
| tests/L2_testing/test_runner/runner.py | Fixes test-results aggregation to only attach per-suite JSON when tests actually ran. |
| tests/L2_testing/test_runner/pid_limit_tests.py | Resolves pids.max across cgroup v1/v2 layouts, including /proc/<pid>/cgroup-derived paths. |
| tests/L2_testing/test_runner/network_tests.py | Switches to shared launch_container and adds invalid-bundle handling. |
| tests/L2_testing/test_runner/memcr_tests.py | Hardens PID parsing and optionally skips PID checkpoint validation when PIDs aren’t reported. |
| tests/L2_testing/test_runner/bundle_generation.py | Replaces filesystem diff with normalized config.json comparison and adds rootfs existence check. |
| tests/L2_testing/test_runner/bundle/regenerate_bundles.py | New CI helper to patch and repack test bundles for cgroupv2 compatibility. |
| tests/L2_testing/test_runner/basic_sanity_tests.py | Reworks async log reading with select() and makes daemon-stop verification less log-dependent. |
| tests/L2_testing/test_runner/annotation_tests.py | Starts container from spec path (vs extracted bundle) and stops container after annotation checks. |
| client/tool/source/Main.cpp | Removes mutex lock from stop callback before completing a promise. |
| bundle/lib/source/templates/OciConfigJsonVM1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| bundle/lib/source/templates/OciConfigJson1.0.2-dobby.template | Removes memory swappiness field from generated OCI config templates. |
| .github/workflows/L2-tests.yml | Adds bundle-regeneration step and relaxes LCOV errors for unused coverage. |
Comments suppressed due to low confidence (1)
tests/L2_testing/test_runner/network_tests.py:121
- If
launch_containerfails, the test still proceeds to sleep and read from netcat, and then overwrites the earlier launch-failure message with a generic "Received ... expected ..." message. This makes failures harder to diagnose and can produce misleading output. Consider short-circuiting whenlaunch_resultis false (or preserving the original error message/log) before attempting to validate the netcat payload.
launch_result = test_utils.launch_container(container_name, bundle_path)
message = ""
result = True
if not launch_result:
message = "Container did not launch successfully"
result = False
# give container time to start and send message before checking netcat listener
sleep(2)
nc_message = nc.get_output().rstrip("\n")
# check if netcat listener received message
if test.expected_output.lower() not in nc_message.lower():
message = "Received '%s' from container, expected '%s'" % (nc_message.lower(), test.expected_output.lower())
result = False
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f1bf099 to
298f0e7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def patch_config_for_cgroupv2(config_path): | ||
| """Patch OCI config.json for cgroup v2 compatibility. | ||
|
|
||
| cgroup v1: Uses memory.swappiness to control swap behavior | ||
| cgroup v2: Uses memory.swap.max (no direct swappiness equivalent) | ||
| """ | ||
| try: | ||
| with open(config_path, 'r') as f: | ||
| config = json.load(f) | ||
|
|
||
| modified = False | ||
|
|
||
| # Remove swappiness from linux.resources.memory | ||
| if 'linux' in config and 'resources' in config['linux']: | ||
| resources = config['linux']['resources'] | ||
| if 'memory' in resources and 'swappiness' in resources['memory']: | ||
| del resources['memory']['swappiness'] | ||
| modified = True | ||
| print_log("Stripped 'swappiness' from config.json for cgroup v2 compatibility", Severity.debug) | ||
|
|
||
| if modified: | ||
| with open(config_path, 'w') as f: | ||
| json.dump(config, f, indent=4) | ||
| except Exception as err: | ||
| print_log("Warning: Failed to patch config.json for cgroup v2: %s" % err, Severity.warning) |
There was a problem hiding this comment.
patch_config_for_cgroupv2() currently strips only linux.resources.memory.swappiness. The PR description mentions also removing other cgroupv2-incompatible settings (e.g., kernel memory / RT scheduling fields); if those are still present in some bundles/spec-generated configs, tests may continue to fail on cgroupv2. Either extend this patching to cover the other unsupported fields you’re seeing in CI, or update the PR description to match the actual behavior.
| mkdir -p build | ||
| cd build | ||
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. | ||
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DUSE_LEGACY_CGROUP=OFF -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. |
There was a problem hiding this comment.
The workflow passes -DUSE_LEGACY_CGROUP=OFF to CMake, but there doesn’t appear to be any CMake option / code reference for USE_LEGACY_CGROUP in the repo. If this flag is intended to control behavior, please add the corresponding CMake option and use it; otherwise consider removing it to avoid confusion (CMake will silently accept unused cache variables).
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DUSE_LEGACY_CGROUP=OFF -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. | |
| cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON .. |
| - name: Generate coverage | ||
| if: ${{ matrix.coverage == 'with-coverage' }} | ||
| run: > | ||
| lcov -c | ||
| -o coverage.info | ||
| -d $GITHUB_WORKSPACE | ||
| && | ||
| lcov | ||
| --ignore-errors unused | ||
| -r coverage.info |
There was a problem hiding this comment.
PR description says the CI fix includes setting GCOV_PREFIX (and preserving GCOV env through sudo) to avoid .gcda write failures, but this workflow doesn’t currently set GCOV_PREFIX/GCOV_PREFIX_STRIP anywhere. If coverage runs are still expected to work on GitHub Actions, please add those env exports (and ensure any sudo invocation preserves them), or update the PR description if that fix was removed/out of scope.
| # Extract (with path-traversal protection) | ||
| print(f" Extracting...") | ||
| with tarfile.open(bundle_tarball, 'r:gz') as tar: | ||
| # Reject members that escape the target directory via absolute paths | ||
| # or '..' components to prevent path-traversal attacks. | ||
| for member in tar.getmembers(): | ||
| member_path = (bundle_dir / member.name).resolve() | ||
| if not str(member_path).startswith(str(bundle_dir.resolve())): | ||
| raise RuntimeError( | ||
| f"Tarball member '{member.name}' would escape extraction " | ||
| f"directory '{bundle_dir}' — aborting for safety" | ||
| ) | ||
| tar.extractall(path=bundle_dir) | ||
|
|
There was a problem hiding this comment.
The tar extraction safety check only validates member.name path traversal, but tarballs can still escape via symlinks/hardlinks (e.g., create a symlink inside the bundle pointing outside, then extract a file through it). To make extraction robust, reject/link members (issym()/islnk()) and/or use tarfile’s safe extraction filtering where available, and ensure extracted paths are verified after resolving symlinks.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cpu.get('realtimeRuntime') is None: | ||
| del cpu['realtimeRuntime'] | ||
| print(" - Removed null 'realtimeRuntime'") | ||
| if cpu.get('realtimePeriod') is None: |
There was a problem hiding this comment.
In patch_config_for_cgroupv2(), if cpu.get('realtimeRuntime') is None: del cpu['realtimeRuntime'] (and the same for realtimePeriod) will raise KeyError when the key is missing because .get() returns None for absent keys. This can abort bundle regeneration. Only delete when the key exists and is null, or use pop(key, None) after checking for explicit null.
| if cpu.get('realtimeRuntime') is None: | |
| del cpu['realtimeRuntime'] | |
| print(" - Removed null 'realtimeRuntime'") | |
| if cpu.get('realtimePeriod') is None: | |
| if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None: | |
| del cpu['realtimeRuntime'] | |
| print(" - Removed null 'realtimeRuntime'") | |
| if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None: |
| if cpu.get('realtimeRuntime') is None: | ||
| del cpu['realtimeRuntime'] | ||
| modified = True | ||
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | ||
| if cpu.get('realtimePeriod') is None: |
There was a problem hiding this comment.
Same issue as realtimeRuntime: if cpu.get('realtimePeriod') is None: del cpu['realtimePeriod'] will raise KeyError when the key is absent, aborting the patch and leaving the config unmodified. Guard on key existence + null value, or use pop('realtimePeriod', None) after checking for an explicit null.
| if cpu.get('realtimeRuntime') is None: | |
| del cpu['realtimeRuntime'] | |
| modified = True | |
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | |
| if cpu.get('realtimePeriod') is None: | |
| if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None: | |
| del cpu['realtimeRuntime'] | |
| modified = True | |
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | |
| if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None: |
| @@ -367,6 +511,9 @@ def launch_container(container_id, spec_path): | |||
| # Timeout | |||
| print_log("Waited 5 seconds for exit.. timeout", Severity.error) | |||
| return True | |||
There was a problem hiding this comment.
launch_container() logs an error on timeout waiting for the container to exit, but still returns True. This marks the launch as successful even though the container may still be running (and could affect subsequent tests). Return False on timeout (or stop the container explicitly) so callers can fail the test deterministically.
| return True | |
| return False |
| if cpu.get('realtimeRuntime') is None: | ||
| del cpu['realtimeRuntime'] | ||
| modified = True | ||
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | ||
| if cpu.get('realtimePeriod') is None: |
There was a problem hiding this comment.
patch_config_for_cgroupv2() uses if cpu.get('realtimeRuntime') is None: del cpu['realtimeRuntime'], but dict.get() returns None when the key is missing, so this will raise KeyError on configs that have no realtimeRuntime field. That exception aborts the whole patch and prevents earlier changes (e.g., swappiness removal) from being written. Update the condition to delete only when the key exists and its value is null (or use pop(..., None)).
| if cpu.get('realtimeRuntime') is None: | |
| del cpu['realtimeRuntime'] | |
| modified = True | |
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | |
| if cpu.get('realtimePeriod') is None: | |
| if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None: | |
| del cpu['realtimeRuntime'] | |
| modified = True | |
| print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug) | |
| if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None: |
| def __enter__(self): | ||
| """Returns the bundle path when valid, or None when extraction/validation | ||
| failed. Callers must check .valid (or the returned path) before use.""" | ||
| if not self.valid: | ||
| return None | ||
| return self.path |
There was a problem hiding this comment.
untar_bundle.enter now returns None when extraction/validation fails. There are still callers that use with test_utils.untar_bundle(...) as bundle_path: without checking .valid (e.g., plugin_launcher.py and memcr_tests.py), which will now pass None into path operations / subprocess args and crash with TypeError instead of producing a clean test failure. Consider keeping enter returning the path and requiring callers to check .valid, or raising a dedicated exception, or update all call sites in this PR.
| # On cgroup v2, generate a patched bundle to avoid swappiness issues | ||
| if test_utils.is_cgroup_v2(): | ||
| bundle_path = test_utils.generate_bundle_from_spec(container_id) | ||
| if bundle_path: | ||
| spec_path = bundle_path | ||
| else: | ||
| spec_path = test_utils.get_container_spec_path(container_id) | ||
| else: | ||
| spec_path = test_utils.get_container_spec_path(container_id) | ||
|
|
||
| command = ["DobbyTool", | ||
| "start", | ||
| container_id, | ||
| spec_path] | ||
|
|
||
| status = test_utils.run_command_line(command) | ||
| if "started '" + container_id + "' container" not in status.stdout: | ||
| return False, "Container did not launch successfully" | ||
|
|
||
| result = validate_annotation(container_id, expected_output) | ||
|
|
||
| status = test_utils.run_command_line(command) | ||
| if "started '" + container_id + "' container" not in status.stdout: | ||
| return False, "Container did not launch successfully" | ||
| # Stop the container after the test | ||
| test_utils.dobby_tool_command("stop", container_id) | ||
|
|
There was a problem hiding this comment.
On cgroup v2 this test generates a bundle via generate_bundle_from_spec() but never cleans up the generated output directory. Because generate_bundle_from_spec() writes under the test_runner/bundle tree, repeated runs can leave stale artifacts and may cause bundle generation to fail if the output dir already exists. Use a temporary directory (e.g., under /tmp) or ensure the generated bundle directory is removed after the test.
Fix L2 test failures on GitHub Actions CI (cgroupv2 compatibility and GCOV coverage issues)
This PR addresses multiple cascading failures in the L2 test suite when running on GitHub Actions runners which use cgroupv2 (unified hierarchy) and have coverage instrumentation enabled.
Root Causes Fixed:
cgroupv2 Compatibility - The codebase assumed cgroupv1 (separate controller mounts). GitHub Actions uses cgroupv2 where all controllers share a single mount point. Updated DobbyEnv.cpp, IonMemoryPlugin.cpp, GpuPlugin.cpp, and DobbyTemplate.cpp to detect and handle cgroupv2. Also added runtime patching in test_utils.py to remove unsupported settings (swappiness, kernel memory, RT scheduling) from bundle configs.
GCOV Coverage Write Failures - Instrumented binaries (DobbyPluginLauncher) couldn't write .gcda files to /home/runner/work, causing hooks to exit with code 1 and breaking all container operations. Fixed by setting GCOV_PREFIX=/tmp/gcov at workflow level and passing GCOV environment to DobbyDaemon via sudo -E.
Network Setup Issues - DNS resolution and NAT not working in containers because eth0 was hardcoded but doesn't exist on GitHub runners, and systemd stub resolver doesn't work well in containers. Fixed by detecting default-route interface dynamically and forcing reliable DNS servers (1.1.1.1, 8.8.8.8).
OCI Hook Config Path - createContainer hooks run in container namespace and couldn't access host paths for config.json. Fixed by mounting config.json to /tmp/dobby_config.json inside container and adding fallback path resolution in DobbyPluginLauncher.
Invalid Interface Index in Networking - Netlink code was attempting to set addresses on interface index 0 (invalid), causing network setup failures. Added validation to reject invalid interface indices before setting addresses.
Test Procedure
Push PR to trigger GitHub Actions L2 test workflow
Verify all test groups pass:
basic_sanity_tests
container_manipulations
bundle_generation
plugin_launcher
command_line_containers
annotation_tests
start_from_bundle
thunder_plugin
network_tests
pid_limit_tests
memcr_tests
Type of Change
[Bug fix (non-breaking change which fixes an issue)
Requires Bitbake Recipe changes?
No - This PR contains only code fixes and CI workflow changes. No new dependencies, build options, or installed files are added.