Skip to content

RDKEMW-13253- Dobby l2 fix #433

Open
Sonajeya31 wants to merge 11 commits intordkcentral:developfrom
Sonajeya31:topic/RDKEMW-13253-5
Open

RDKEMW-13253- Dobby l2 fix #433
Sonajeya31 wants to merge 11 commits intordkcentral:developfrom
Sonajeya31:topic/RDKEMW-13253-5

Conversation

@Sonajeya31
Copy link
Copy Markdown

Fix L2 test failures on GitHub Actions CI (cgroupv2 compatibility and GCOV coverage issues)

This PR addresses multiple cascading failures in the L2 test suite when running on GitHub Actions runners which use cgroupv2 (unified hierarchy) and have coverage instrumentation enabled.
Root Causes Fixed:

cgroupv2 Compatibility - The codebase assumed cgroupv1 (separate controller mounts). GitHub Actions uses cgroupv2 where all controllers share a single mount point. Updated DobbyEnv.cpp, IonMemoryPlugin.cpp, GpuPlugin.cpp, and DobbyTemplate.cpp to detect and handle cgroupv2. Also added runtime patching in test_utils.py to remove unsupported settings (swappiness, kernel memory, RT scheduling) from bundle configs.

GCOV Coverage Write Failures - Instrumented binaries (DobbyPluginLauncher) couldn't write .gcda files to /home/runner/work, causing hooks to exit with code 1 and breaking all container operations. Fixed by setting GCOV_PREFIX=/tmp/gcov at workflow level and passing GCOV environment to DobbyDaemon via sudo -E.

Network Setup Issues - DNS resolution and NAT not working in containers because eth0 was hardcoded but doesn't exist on GitHub runners, and systemd stub resolver doesn't work well in containers. Fixed by detecting default-route interface dynamically and forcing reliable DNS servers (1.1.1.1, 8.8.8.8).

OCI Hook Config Path - createContainer hooks run in container namespace and couldn't access host paths for config.json. Fixed by mounting config.json to /tmp/dobby_config.json inside container and adding fallback path resolution in DobbyPluginLauncher.

Invalid Interface Index in Networking - Netlink code was attempting to set addresses on interface index 0 (invalid), causing network setup failures. Added validation to reject invalid interface indices before setting addresses.

Test Procedure
Push PR to trigger GitHub Actions L2 test workflow
Verify all test groups pass:
basic_sanity_tests
container_manipulations
bundle_generation
plugin_launcher
command_line_containers
annotation_tests
start_from_bundle
thunder_plugin
network_tests
pid_limit_tests
memcr_tests

Type of Change
[Bug fix (non-breaking change which fixes an issue)
Requires Bitbake Recipe changes?
No - This PR contains only code fixes and CI workflow changes. No new dependencies, build options, or installed files are added.

Copilot AI review requested due to automatic review settings April 15, 2026 07:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets GitHub Actions L2 test stability by improving bundle handling and platform-specific behavior (cgroupv2/CI quirks), and by adjusting templates/CI steps to reduce environment-dependent failures.

Changes:

  • Harden L2 test runner utilities (bundle extraction validation, DobbyDaemon readiness probing, container launch retries, improved async log reading).
  • Make multiple L2 tests more CI-tolerant (bundle config sanitization, cgroupv1/v2 pid-limit path resolution, more robust bundle comparisons).
  • Add CI workflow step to regenerate bundles for cgroupv2 compatibility and relax LCOV handling for unused coverage artifacts.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/L2_testing/test_runner/thunder_plugin.py Sanitizes bundle config and makes test set platform-conditional; improves handling of invalid bundles.
tests/L2_testing/test_runner/test_utils.py Adds bundle extraction validation, nested bundle path detection, DobbyDaemon readiness probing, and more robust container launch behavior.
tests/L2_testing/test_runner/start_from_bundle.py Uses validated bundle context and waits briefly for log flushing before asserting.
tests/L2_testing/test_runner/runner.py Fixes test-results aggregation to only attach per-suite JSON when tests actually ran.
tests/L2_testing/test_runner/pid_limit_tests.py Resolves pids.max across cgroup v1/v2 layouts, including /proc/<pid>/cgroup-derived paths.
tests/L2_testing/test_runner/network_tests.py Switches to shared launch_container and adds invalid-bundle handling.
tests/L2_testing/test_runner/memcr_tests.py Hardens PID parsing and optionally skips PID checkpoint validation when PIDs aren’t reported.
tests/L2_testing/test_runner/bundle_generation.py Replaces filesystem diff with normalized config.json comparison and adds rootfs existence check.
tests/L2_testing/test_runner/bundle/regenerate_bundles.py New CI helper to patch and repack test bundles for cgroupv2 compatibility.
tests/L2_testing/test_runner/basic_sanity_tests.py Reworks async log reading with select() and makes daemon-stop verification less log-dependent.
tests/L2_testing/test_runner/annotation_tests.py Starts container from spec path (vs extracted bundle) and stops container after annotation checks.
client/tool/source/Main.cpp Removes mutex lock from stop callback before completing a promise.
bundle/lib/source/templates/OciConfigJsonVM1.0.2-dobby.template Removes memory swappiness field from generated OCI config templates.
bundle/lib/source/templates/OciConfigJson1.0.2-dobby.template Removes memory swappiness field from generated OCI config templates.
.github/workflows/L2-tests.yml Adds bundle-regeneration step and relaxes LCOV errors for unused coverage.
Comments suppressed due to low confidence (1)

tests/L2_testing/test_runner/network_tests.py:121

  • If launch_container fails, the test still proceeds to sleep and read from netcat, and then overwrites the earlier launch-failure message with a generic "Received ... expected ..." message. This makes failures harder to diagnose and can produce misleading output. Consider short-circuiting when launch_result is false (or preserving the original error message/log) before attempting to validate the netcat payload.
        launch_result = test_utils.launch_container(container_name, bundle_path)

        message = ""
        result = True

        if not launch_result:
            message = "Container did not launch successfully"
            result = False

        # give container time to start and send message before checking netcat listener
        sleep(2)

        nc_message = nc.get_output().rstrip("\n")

        # check if netcat listener received message
        if test.expected_output.lower() not in nc_message.lower():
            message = "Received '%s' from container, expected '%s'" % (nc_message.lower(), test.expected_output.lower())
            result = False

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L2_testing/test_runner/thunder_plugin.py
Comment thread tests/L2_testing/test_runner/test_utils.py
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Copilot AI review requested due to automatic review settings April 16, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L2_testing/test_runner/test_utils.py Outdated
Comment thread client/tool/source/Main.cpp
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Comment thread tests/L2_testing/test_runner/thunder_plugin.py
Comment thread tests/L2_testing/test_runner/start_from_bundle.py
Comment thread tests/L2_testing/test_runner/pid_limit_tests.py
Comment thread tests/L2_testing/test_runner/bundle_generation.py
Comment thread client/tool/source/Main.cpp
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Comment thread tests/L2_testing/test_runner/network_tests.py
Copilot AI review requested due to automatic review settings April 16, 2026 08:32
@Sonajeya31 Sonajeya31 force-pushed the topic/RDKEMW-13253-5 branch from f1bf099 to 298f0e7 Compare April 16, 2026 08:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Copilot AI review requested due to automatic review settings April 20, 2026 11:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/L2_testing/test_runner/thunder_plugin.py
Comment thread tests/L2_testing/test_runner/bundle/regenerate_bundles.py
Copilot AI review requested due to automatic review settings April 21, 2026 11:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +651 to +675
def patch_config_for_cgroupv2(config_path):
"""Patch OCI config.json for cgroup v2 compatibility.

cgroup v1: Uses memory.swappiness to control swap behavior
cgroup v2: Uses memory.swap.max (no direct swappiness equivalent)
"""
try:
with open(config_path, 'r') as f:
config = json.load(f)

modified = False

# Remove swappiness from linux.resources.memory
if 'linux' in config and 'resources' in config['linux']:
resources = config['linux']['resources']
if 'memory' in resources and 'swappiness' in resources['memory']:
del resources['memory']['swappiness']
modified = True
print_log("Stripped 'swappiness' from config.json for cgroup v2 compatibility", Severity.debug)

if modified:
with open(config_path, 'w') as f:
json.dump(config, f, indent=4)
except Exception as err:
print_log("Warning: Failed to patch config.json for cgroup v2: %s" % err, Severity.warning)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_config_for_cgroupv2() currently strips only linux.resources.memory.swappiness. The PR description mentions also removing other cgroupv2-incompatible settings (e.g., kernel memory / RT scheduling fields); if those are still present in some bundles/spec-generated configs, tests may continue to fail on cgroupv2. Either extend this patching to cover the other unsupported fields you’re seeing in CI, or update the PR description to match the actual behavior.

Copilot uses AI. Check for mistakes.
mkdir -p build
cd build
cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON ..
cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DUSE_LEGACY_CGROUP=OFF -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON ..
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow passes -DUSE_LEGACY_CGROUP=OFF to CMake, but there doesn’t appear to be any CMake option / code reference for USE_LEGACY_CGROUP in the repo. If this flag is intended to control behavior, please add the corresponding CMake option and use it; otherwise consider removing it to avoid confusion (CMake will silently accept unused cache variables).

Suggested change
cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DUSE_LEGACY_CGROUP=OFF -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON ..
cmake -DCMAKE_TOOLCHAIN_FILE="${{ env.TOOLCHAIN_FILE }}" -DRDK_PLATFORM=DEV_VM -DCMAKE_INSTALL_PREFIX:PATH=/usr ${{ matrix.extra_flags }} -DRDK=ON -DUSE_SYSTEMD=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DAI_BUILD_TYPE=AI_DEBUG -DDOBBY_HIBERNATE_MEMCR_IMPL=ON -DDOBBY_HIBERNATE_MEMCR_PARAMS_ENABLED=ON ..

Copilot uses AI. Check for mistakes.
Comment on lines 234 to 243
- name: Generate coverage
if: ${{ matrix.coverage == 'with-coverage' }}
run: >
lcov -c
-o coverage.info
-d $GITHUB_WORKSPACE
&&
lcov
--ignore-errors unused
-r coverage.info
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the CI fix includes setting GCOV_PREFIX (and preserving GCOV env through sudo) to avoid .gcda write failures, but this workflow doesn’t currently set GCOV_PREFIX/GCOV_PREFIX_STRIP anywhere. If coverage runs are still expected to work on GitHub Actions, please add those env exports (and ensure any sudo invocation preserves them), or update the PR description if that fix was removed/out of scope.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +124
# Extract (with path-traversal protection)
print(f" Extracting...")
with tarfile.open(bundle_tarball, 'r:gz') as tar:
# Reject members that escape the target directory via absolute paths
# or '..' components to prevent path-traversal attacks.
for member in tar.getmembers():
member_path = (bundle_dir / member.name).resolve()
if not str(member_path).startswith(str(bundle_dir.resolve())):
raise RuntimeError(
f"Tarball member '{member.name}' would escape extraction "
f"directory '{bundle_dir}' — aborting for safety"
)
tar.extractall(path=bundle_dir)

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tar extraction safety check only validates member.name path traversal, but tarballs can still escape via symlinks/hardlinks (e.g., create a symlink inside the bundle pointing outside, then extract a file through it). To make extraction robust, reject/link members (issym()/islnk()) and/or use tarfile’s safe extraction filtering where available, and ensure extracted paths are verified after resolving symlinks.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 22, 2026 06:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +42
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
print(" - Removed null 'realtimeRuntime'")
if cpu.get('realtimePeriod') is None:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In patch_config_for_cgroupv2(), if cpu.get('realtimeRuntime') is None: del cpu['realtimeRuntime'] (and the same for realtimePeriod) will raise KeyError when the key is missing because .get() returns None for absent keys. This can abort bundle regeneration. Only delete when the key exists and is null, or use pop(key, None) after checking for explicit null.

Suggested change
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
print(" - Removed null 'realtimeRuntime'")
if cpu.get('realtimePeriod') is None:
if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None:
del cpu['realtimeRuntime']
print(" - Removed null 'realtimeRuntime'")
if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None:

Copilot uses AI. Check for mistakes.
Comment on lines +691 to +695
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if cpu.get('realtimePeriod') is None:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as realtimeRuntime: if cpu.get('realtimePeriod') is None: del cpu['realtimePeriod'] will raise KeyError when the key is absent, aborting the patch and leaving the config unmodified. Guard on key existence + null value, or use pop('realtimePeriod', None) after checking for an explicit null.

Suggested change
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if cpu.get('realtimePeriod') is None:
if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None:

Copilot uses AI. Check for mistakes.
@@ -367,6 +511,9 @@ def launch_container(container_id, spec_path):
# Timeout
print_log("Waited 5 seconds for exit.. timeout", Severity.error)
return True
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launch_container() logs an error on timeout waiting for the container to exit, but still returns True. This marks the launch as successful even though the container may still be running (and could affect subsequent tests). Return False on timeout (or stop the container explicitly) so callers can fail the test deterministically.

Suggested change
return True
return False

Copilot uses AI. Check for mistakes.
Comment on lines +691 to +695
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if cpu.get('realtimePeriod') is None:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_config_for_cgroupv2() uses if cpu.get('realtimeRuntime') is None: del cpu['realtimeRuntime'], but dict.get() returns None when the key is missing, so this will raise KeyError on configs that have no realtimeRuntime field. That exception aborts the whole patch and prevents earlier changes (e.g., swappiness removal) from being written. Update the condition to delete only when the key exists and its value is null (or use pop(..., None)).

Suggested change
if cpu.get('realtimeRuntime') is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if cpu.get('realtimePeriod') is None:
if 'realtimeRuntime' in cpu and cpu['realtimeRuntime'] is None:
del cpu['realtimeRuntime']
modified = True
print_log("Removed null 'realtimeRuntime' for cgroup v2 compatibility", Severity.debug)
if 'realtimePeriod' in cpu and cpu['realtimePeriod'] is None:

Copilot uses AI. Check for mistakes.
Comment on lines 92 to 97
def __enter__(self):
"""Returns the bundle path when valid, or None when extraction/validation
failed. Callers must check .valid (or the returned path) before use."""
if not self.valid:
return None
return self.path
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

untar_bundle.enter now returns None when extraction/validation fails. There are still callers that use with test_utils.untar_bundle(...) as bundle_path: without checking .valid (e.g., plugin_launcher.py and memcr_tests.py), which will now pass None into path operations / subprocess args and crash with TypeError instead of producing a clean test failure. Consider keeping enter returning the path and requiring callers to check .valid, or raising a dedicated exception, or update all call sites in this PR.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to 79
# On cgroup v2, generate a patched bundle to avoid swappiness issues
if test_utils.is_cgroup_v2():
bundle_path = test_utils.generate_bundle_from_spec(container_id)
if bundle_path:
spec_path = bundle_path
else:
spec_path = test_utils.get_container_spec_path(container_id)
else:
spec_path = test_utils.get_container_spec_path(container_id)

command = ["DobbyTool",
"start",
container_id,
spec_path]

status = test_utils.run_command_line(command)
if "started '" + container_id + "' container" not in status.stdout:
return False, "Container did not launch successfully"

result = validate_annotation(container_id, expected_output)

status = test_utils.run_command_line(command)
if "started '" + container_id + "' container" not in status.stdout:
return False, "Container did not launch successfully"
# Stop the container after the test
test_utils.dobby_tool_command("stop", container_id)

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cgroup v2 this test generates a bundle via generate_bundle_from_spec() but never cleans up the generated output directory. Because generate_bundle_from_spec() writes under the test_runner/bundle tree, repeated runs can leave stale artifacts and may cause bundle generation to fail if the output dir already exists. Use a temporary directory (e.g., under /tmp) or ensure the generated bundle directory is removed after the test.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants