-
Notifications
You must be signed in to change notification settings - Fork 168
Description
Problem
Container network statistics collection fails intermittently (50% failure rate reported at one site). Need detailed logging to diagnose the root cause of setns() system call failures.
Objective
Add comprehensive logging to identify *why* setns() fails intermittently:
-
Which errno values occur?
-
When does failure happen (FD open, setns, or restore)?
-
What are the system resource states at failure time?
-
Container lifecycle correlation?
Logging Requirements
1. Add Error Logging to setns()
*File*: src/ai/backend/common/netns.py
def setns(fd: int) -> None:
libc_path = ctypes.util.find_library("c")
libc = ctypes.CDLL(libc_path, use_errno=True)
CLONE_NEWNET = 1 << 30
result = libc.setns(fd, CLONE_NEWNET)
if result == -1:
errno_val = ctypes.get_errno()
log.error(
"setns failed: fd={}, errno={}, error={}",
fd, errno_val, os.strerror(errno_val)
)
raise OSError(errno_val, f"setns failed: {os.strerror(errno_val)}")2. Add FD Management Logging
*File*: src/ai/backend/common/netns.py
class NetworkNamespaceManager:
def __init__(self, path: Path) -> None:
try:
self.self_ns = os.open("/proc/self/ns/net", os.O_RDONLY)
log.debug("Opened current ns: fd={}", self.self_ns)
except OSError as e:
log.error("Failed to open current ns: {}", e)
raise
try:
self.new_ns = os.open(path, os.O_RDONLY)
log.debug("Opened target ns: path={}, fd={}", path, self.new_ns)
except OSError as e:
log.error("Failed to open target ns: path={}, error={}", path, e)
os.close(self.self_ns)
raise3. Add Thread Pool Context Logging
*File*: src/ai/backend/agent/docker/intrinsic.py
async def netstat_ns(ns_path: Path) -> dict[str, Any]:
loop = asyncio.get_running_loop()
current_process = multiprocessing.current_process()
is_daemon = getattr(current_process, 'daemon', False)
log.debug(
"netstat_ns: ns_path={}, is_daemon={}, thread={}",
ns_path, is_daemon, threading.current_thread().name
)
if is_daemon:
log.warning("Using thread pool fallback for netstat_ns (daemon process)")
return await loop.run_in_executor(None, netstat_ns_work, ns_path)
1. ...4. Add Resource Usage Logging
Log system resource state when failures occur:
def log_resource_state():
import resource
nofile_soft, nofile_hard = resource.getrlimit(resource.RLIMIT_NOFILE)
proc_fd_count = len(os.listdir('/proc/self/fd'))
log.info(
"Resource state: open_fds={}, limit_soft={}, limit_hard={}",
proc_fd_count, nofile_soft, nofile_hard
)5. Add Container Lifecycle Correlation
*File*: src/ai/backend/agent/docker/intrinsic.py
async def sysfs_impl(container_id: str):
1. ...
container = DockerContainer(self._docker, id=container_id)
data = await container.show()
sandbox_key = data["NetworkSettings"]["SandboxKey"]
1. Log before attempting namespace operation
log.debug(
"Collecting netstat: container={}, sandbox_key={}, state={}",
container_id[:7], sandbox_key, data["State"]["Status"]
)
try:
nstat = await netstat_ns(sandbox_key)
except Exception as e:
log.error(
"netstat_ns failed: container={}, sandbox_key={}, error={}",
container_id[:7], sandbox_key, e
)
raiseSuccess Criteria
After implementation, logs should answer:
-
*What errno* causes setns() failures? (EBADF, EINVAL, ENOMEM, EPERM, etc.)
-
*When* does it fail? (opening FD, switching ns, or restoring ns)
-
*Resource state* at failure time (FD count vs limit)
-
*Container state* when failure occurs (running, stopping, stopped)
-
*Concurrency pattern* (how many concurrent netstat_ns calls, thread pool usage)
Implementation Notes
-
Use DEBUG level for normal operations
-
Use WARNING for fallback paths
-
Use ERROR for failures with full context
-
Include thread/process identifiers for correlation
Related
- Parent Epic: BA-4880 (Sprint 26.3 requests)
JIRA Issue: BA-4909