Agent: Add comprehensive logging for nsenter failures debugging

## Problem



Container network statistics collection fails intermittently (50% failure rate reported at one site). Need detailed logging to diagnose the root cause of `setns()` system call failures.



## Objective



Add comprehensive logging to identify \***why**\* `setns()` fails intermittently:

- Which errno values occur?

- When does failure happen (FD open, setns, or restore)?

- What are the system resource states at failure time?

- Container lifecycle correlation?



## Logging Requirements



### 1. Add Error Logging to `setns()`



\***File**\*: `src/ai/backend/common/netns.py`



```python

def setns(fd: int) -> None:

    libc_path = ctypes.util.find_library("c")

    libc = ctypes.CDLL(libc_path, use_errno=True)

    CLONE_NEWNET = 1 << 30

    

    result = libc.setns(fd, CLONE_NEWNET)

    if result == -1:

        errno_val = ctypes.get_errno()

        log.error(

            "setns failed: fd={}, errno={}, error={}",

            fd, errno_val, os.strerror(errno_val)

        )

        raise OSError(errno_val, f"setns failed: {os.strerror(errno_val)}")

```



### 2. Add FD Management Logging



\***File**\*: `src/ai/backend/common/netns.py`



```python

class NetworkNamespaceManager:

    def __init__(self, path: Path) -> None:

        try:

            self.self_ns = os.open("/proc/self/ns/net", os.O_RDONLY)

            log.debug("Opened current ns: fd={}", self.self_ns)

        except OSError as e:

            log.error("Failed to open current ns: {}", e)

            raise

        

        try:

            self.new_ns = os.open(path, os.O_RDONLY)

            log.debug("Opened target ns: path={}, fd={}", path, self.new_ns)

        except OSError as e:

            log.error("Failed to open target ns: path={}, error={}", path, e)

            os.close(self.self_ns)

            raise

```



### 3. Add Thread Pool Context Logging



\***File**\*: `src/ai/backend/agent/docker/intrinsic.py`



```python

async def netstat_ns(ns_path: Path) -> dict[str, Any]:

    loop = asyncio.get_running_loop()

    current_process = multiprocessing.current_process()

    is_daemon = getattr(current_process, 'daemon', False)

    

    log.debug(

        "netstat_ns: ns_path={}, is_daemon={}, thread={}",

        ns_path, is_daemon, threading.current_thread().name

    )

    

    if is_daemon:

        log.warning("Using thread pool fallback for netstat_ns (daemon process)")

        return await loop.run_in_executor(None, netstat_ns_work, ns_path)

1. ...

```



### 4. Add Resource Usage Logging



Log system resource state when failures occur:



```python

def log_resource_state():

    import resource

    nofile_soft, nofile_hard = resource.getrlimit(resource.RLIMIT_NOFILE)

    proc_fd_count = len(os.listdir('/proc/self/fd'))

    

    log.info(

        "Resource state: open_fds={}, limit_soft={}, limit_hard={}",

        proc_fd_count, nofile_soft, nofile_hard

    )

```



### 5. Add Container Lifecycle Correlation



\***File**\*: `src/ai/backend/agent/docker/intrinsic.py`



```python

async def sysfs_impl(container_id: str):

1. ...

    container = DockerContainer(self._docker, id=container_id)

    data = await container.show()

    sandbox_key = data["NetworkSettings"]["SandboxKey"]

    

1. Log before attempting namespace operation

    log.debug(

        "Collecting netstat: container={}, sandbox_key={}, state={}",

        container_id[:7], sandbox_key, data["State"]["Status"]

    )

    

    try:

        nstat = await netstat_ns(sandbox_key)

    except Exception as e:

        log.error(

            "netstat_ns failed: container={}, sandbox_key={}, error={}",

            container_id[:7], sandbox_key, e

        )

        raise

```



## Success Criteria



After implementation, logs should answer:

1. \***What errno**\* causes setns() failures? (EBADF, EINVAL, ENOMEM, EPERM, etc.)

2. \***When**\* does it fail? (opening FD, switching ns, or restoring ns)

3. \***Resource state**\* at failure time (FD count vs limit)

4. \***Container state**\* when failure occurs (running, stopping, stopped)

5. \***Concurrency pattern**\* (how many concurrent netstat_ns calls, thread pool usage)



## Implementation Notes



- Use DEBUG level for normal operations

- Use WARNING for fallback paths

- Use ERROR for failures with full context

- Include thread/process identifiers for correlation



## Related



- Parent Epic: BA-4880 (Sprint 26.3 requests)


JIRA Issue: BA-4909

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent: Add comprehensive logging for nsenter failures debugging #9713

Problem

Objective

Logging Requirements

1. Add Error Logging to `setns()`

2. Add FD Management Logging

3. Add Thread Pool Context Logging

4. Add Resource Usage Logging

5. Add Container Lifecycle Correlation

Success Criteria

Implementation Notes

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent: Add comprehensive logging for nsenter failures debugging #9713

Description

Problem

Objective

Logging Requirements

1. Add Error Logging to setns()

2. Add FD Management Logging

3. Add Thread Pool Context Logging

4. Add Resource Usage Logging

5. Add Container Lifecycle Correlation

Success Criteria

Implementation Notes

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Add Error Logging to `setns()`