Skip to content

Agent: Add comprehensive logging for nsenter failures debugging #9713

@fregataa

Description

@fregataa

Problem

Container network statistics collection fails intermittently (50% failure rate reported at one site). Need detailed logging to diagnose the root cause of setns() system call failures.

Objective

Add comprehensive logging to identify *why* setns() fails intermittently:

  • Which errno values occur?

  • When does failure happen (FD open, setns, or restore)?

  • What are the system resource states at failure time?

  • Container lifecycle correlation?

Logging Requirements

1. Add Error Logging to setns()

*File*: src/ai/backend/common/netns.py

def setns(fd: int) -> None:

    libc_path = ctypes.util.find_library("c")

    libc = ctypes.CDLL(libc_path, use_errno=True)

    CLONE_NEWNET = 1 << 30

    

    result = libc.setns(fd, CLONE_NEWNET)

    if result == -1:

        errno_val = ctypes.get_errno()

        log.error(

            "setns failed: fd={}, errno={}, error={}",

            fd, errno_val, os.strerror(errno_val)

        )

        raise OSError(errno_val, f"setns failed: {os.strerror(errno_val)}")

2. Add FD Management Logging

*File*: src/ai/backend/common/netns.py

class NetworkNamespaceManager:

    def __init__(self, path: Path) -> None:

        try:

            self.self_ns = os.open("/proc/self/ns/net", os.O_RDONLY)

            log.debug("Opened current ns: fd={}", self.self_ns)

        except OSError as e:

            log.error("Failed to open current ns: {}", e)

            raise

        

        try:

            self.new_ns = os.open(path, os.O_RDONLY)

            log.debug("Opened target ns: path={}, fd={}", path, self.new_ns)

        except OSError as e:

            log.error("Failed to open target ns: path={}, error={}", path, e)

            os.close(self.self_ns)

            raise

3. Add Thread Pool Context Logging

*File*: src/ai/backend/agent/docker/intrinsic.py

async def netstat_ns(ns_path: Path) -> dict[str, Any]:

    loop = asyncio.get_running_loop()

    current_process = multiprocessing.current_process()

    is_daemon = getattr(current_process, 'daemon', False)

    

    log.debug(

        "netstat_ns: ns_path={}, is_daemon={}, thread={}",

        ns_path, is_daemon, threading.current_thread().name

    )

    

    if is_daemon:

        log.warning("Using thread pool fallback for netstat_ns (daemon process)")

        return await loop.run_in_executor(None, netstat_ns_work, ns_path)

1. ...

4. Add Resource Usage Logging

Log system resource state when failures occur:

def log_resource_state():

    import resource

    nofile_soft, nofile_hard = resource.getrlimit(resource.RLIMIT_NOFILE)

    proc_fd_count = len(os.listdir('/proc/self/fd'))

    

    log.info(

        "Resource state: open_fds={}, limit_soft={}, limit_hard={}",

        proc_fd_count, nofile_soft, nofile_hard

    )

5. Add Container Lifecycle Correlation

*File*: src/ai/backend/agent/docker/intrinsic.py

async def sysfs_impl(container_id: str):

1. ...

    container = DockerContainer(self._docker, id=container_id)

    data = await container.show()

    sandbox_key = data["NetworkSettings"]["SandboxKey"]

    

1. Log before attempting namespace operation

    log.debug(

        "Collecting netstat: container={}, sandbox_key={}, state={}",

        container_id[:7], sandbox_key, data["State"]["Status"]

    )

    

    try:

        nstat = await netstat_ns(sandbox_key)

    except Exception as e:

        log.error(

            "netstat_ns failed: container={}, sandbox_key={}, error={}",

            container_id[:7], sandbox_key, e

        )

        raise

Success Criteria

After implementation, logs should answer:

  1. *What errno* causes setns() failures? (EBADF, EINVAL, ENOMEM, EPERM, etc.)

  2. *When* does it fail? (opening FD, switching ns, or restoring ns)

  3. *Resource state* at failure time (FD count vs limit)

  4. *Container state* when failure occurs (running, stopping, stopped)

  5. *Concurrency pattern* (how many concurrent netstat_ns calls, thread pool usage)

Implementation Notes

  • Use DEBUG level for normal operations

  • Use WARNING for fallback paths

  • Use ERROR for failures with full context

  • Include thread/process identifiers for correlation

Related

  • Parent Epic: BA-4880 (Sprint 26.3 requests)

JIRA Issue: BA-4909

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions