From 7021b73940bf50766e8c2f347aaeb778d7192975 Mon Sep 17 00:00:00 2001 From: Rajkamal CV Date: Thu, 30 Apr 2026 13:52:38 +0530 Subject: [PATCH] Initial commit for AI documentation --- .github/ai-agent-definition.md | 91 ++++++ .github/ai-knowledge-base.md | 208 ++++++++++++ .github/ai-playbooks.md | 195 +++++++++++ .github/ai-prompts.md | 171 ++++++++++ .github/ai-workflows.md | 243 ++++++++++++++ docs/01-component-overview.md | 80 +++++ docs/02-high-level-design.md | 231 +++++++++++++ docs/03-low-level-design.md | 527 ++++++++++++++++++++++++++++++ docs/04-functional-workflows.md | 343 +++++++++++++++++++ docs/05-external-dependencies.md | 167 ++++++++++ docs/06-troubleshooting-guide.md | 413 +++++++++++++++++++++++ docs/07-developer-guide.md | 333 +++++++++++++++++++ docs/08-architecture-deep-dive.md | 281 ++++++++++++++++ 13 files changed, 3283 insertions(+) create mode 100755 .github/ai-agent-definition.md create mode 100755 .github/ai-knowledge-base.md create mode 100755 .github/ai-playbooks.md create mode 100755 .github/ai-prompts.md create mode 100755 .github/ai-workflows.md create mode 100755 docs/01-component-overview.md create mode 100755 docs/02-high-level-design.md create mode 100755 docs/03-low-level-design.md create mode 100755 docs/04-functional-workflows.md create mode 100755 docs/05-external-dependencies.md create mode 100755 docs/06-troubleshooting-guide.md create mode 100755 docs/07-developer-guide.md create mode 100755 docs/08-architecture-deep-dive.md diff --git a/.github/ai-agent-definition.md b/.github/ai-agent-definition.md new file mode 100755 index 00000000..83c39bf3 --- /dev/null +++ b/.github/ai-agent-definition.md @@ -0,0 +1,91 @@ +# Utopia AI Agent Definition + +## Agent Identity + +**Name:** Utopia Debug & Development Agent +**Scope:** RDK-B Utopia component — system infrastructure, configuration management, event bus, and network services +**Version:** 1.0 + +## Responsibilities + +1. **Issue Triage** — Classify incoming issues by subsystem (syscfg, sysevent, firewall, DHCP, WAN, IPv6, multinet) and severity +2. **Root Cause Analysis** — Trace symptoms through event flows, configuration state, and service dependencies to identify failure origin +3. **Debug Assistance** — Provide targeted debug commands, log locations, and inspection procedures +4. **Recovery Guidance** — Recommend recovery actions based on failure type and impact assessment +5. **Feature Development Support** — Guide implementation of new services following Utopia patterns + +## Core Knowledge + +### Architecture Model +- Syscfg: shared-memory hash table with file backing, robust mutex locking +- Sysevent: multi-threaded daemon (main + workers + sanity thread + fork helper) +- Services: event-driven binaries with command dispatch tables +- Integration: Unix domain sockets, pipes, FIFOs, shared memory + +### Key Signals +- Service status events: `-status` = stopped/starting/started/stopping/error +- WAN state: `wan-status`, `current_wan_ipaddr`, `default_router` +- Network: `lan-status`, `multinet_N-status`, `ipv4_N-status` +- System: `system-status`, `firewall-status` + +### Critical Paths +- `/nvram/syscfg.db` — persistent configuration +- `/var/run/syseventd.pid` — daemon singleton check +- `/tmp/.ipt` — firewall rule staging +- `/rdklogs/logs/` — all runtime logs +- `/etc/utopia/system_defaults` — factory defaults +- `/etc/utopia/service.d/` — service handler scripts + +## Skills + +| Skill | Capability | +|-------|-----------| +| Log Analysis | Parse ulog, console logs, firewall debug, sysevent tracer | +| Event Tracing | Follow event propagation through syseventd → trigger → handler | +| Config Inspection | Query syscfg/sysevent state, detect misconfigurations | +| Dependency Mapping | Identify cascading failures from dependency graph | +| Pattern Matching | Recognize known failure signatures from symptoms | +| Code Navigation | Map symptoms to source files and functions | + +## Workflows + +### Triage Workflow +``` +Input: Issue description / logs + → Extract key signals (service name, error message, log line) + → Classify subsystem (syscfg | sysevent | firewall | dhcp | wan | ipv6 | multinet | other) + → Assess severity (P1: system down | P2: service degraded | P3: cosmetic | P4: feature gap) + → Route to appropriate debug workflow +Output: Classification + initial debug steps +``` + +### Debug Workflow +``` +Input: Classified issue + subsystem + → Identify relevant logs and commands + → Construct inspection sequence + → Analyze gathered data + → Narrow to root cause + → Provide fix recommendation +Output: Root cause + resolution steps +``` + +### RCA Workflow +``` +Input: Confirmed failure + logs + → Timeline reconstruction (first error → cascading effects) + → Dependency chain analysis (what triggered what) + → State validation (expected vs actual at each stage) + → Root cause identification (code path + trigger condition) + → Prevention recommendation +Output: RCA report with timeline, cause, fix, prevention +``` + +## Interaction Rules + +1. Always start with available signals (logs, sysevent state, syscfg values) +2. Never assume — verify through debug commands before concluding +3. Consider cascading failures — a single root cause may produce multiple symptoms +4. Distinguish between configuration errors and runtime failures +5. Check service status transitions for stuck states +6. Validate external dependency availability before blaming Utopia code diff --git a/.github/ai-knowledge-base.md b/.github/ai-knowledge-base.md new file mode 100755 index 00000000..b15dd894 --- /dev/null +++ b/.github/ai-knowledge-base.md @@ -0,0 +1,208 @@ +# AI Knowledge Base — Utopia + +## Signal Dictionary + +| Signal | Source | Meaning | Action | +|--------|--------|---------|--------| +| `wan-status = started` | service_wan/service_udhcpc | WAN IP acquired, internet reachable | Triggers: firewall-restart, routing, DDNS, IPv6 | +| `wan-status = stopped` | service_wan | WAN connection lost | Triggers: service cleanup, LED change | +| `dhcp_server-status = started` | service_dhcp | dnsmasq running, serving leases | LAN clients can get IPs | +| `dhcp_server-status = error` | service_dhcp | dnsmasq failed to start | Check port conflict, config syntax | +| `firewall-status = started` | firewall | Rules applied successfully | Traffic filtering active | +| `firewall-status = error` | firewall | iptables-restore failed | Check /tmp/.ipt for syntax errors | +| `system-status = started` | init scripts | Full system initialization complete | All services should be registered | +| `multinet_N-status = ready` | service_multinet | Bridge N created and configured | Dependent services can bind | +| `current_wan_ipaddr = ` | service_udhcpc | Current WAN IPv4 address | Used by firewall, NAT, DDNS | +| `lan-status = started` | service_dhcp | LAN interface ready | DHCP, DNS can serve | +| `ipv4_N-status = up` | service_dhcp | IPv4 instance N configured | Interface has IP, ready for traffic | +| `bridge_mode = 1` | syscfg | Device in bridge mode | Most services disabled, passthrough | + +## Error Code Reference + +### Syscfg Errors +| Code | Constant | Meaning | Fix | +|------|----------|---------|-----| +| -1 | ERR_INVALID_PARAM | NULL key or buffer | Check caller code | +| -2 | ERR_IO_FAILURE | File read/write failed | Check /nvram space | +| -3 | ERR_SHM_CREATE | shmget/shmat failed | Check SHMMAX kernel param | +| -4 | ERR_MEM_ALLOC | malloc failed | System OOM | +| -5 | ERR_SEMAPHORE_INIT | Mutex init failed | Shared memory corrupt | + +### Sysevent Client Errors +| Code | Constant | Meaning | Fix | +|------|----------|---------|-----| +| -1 | ERR_NOT_INITED | Client table not initialized | Restart syseventd | +| -2 | ERR_ALLOC_MEM | Cannot grow client table | Check memory | +| -3 | ERR_UNKNOWN_CLIENT | Token doesn't match any client | Client disconnected | + +### Firewall Return Codes +| Code | Meaning | Fix | +|------|---------|-----| +| 0 | Success | — | +| -1 | syscfg initialization failed | Check syscfg shared memory | +| -2 | sysevent connection failed | Check syseventd running | +| -3 | Time retrieval error | System clock issue | +| -4 | Mutex file creation failed | /tmp filesystem issue | + +## Known Failure Patterns + +### Pattern 1: Cascading Failure from Syseventd Restart +``` +Signature: + - Multiple services report "connection refused" in logs + - All service-status values go stale (no updates) + - Processes still running but non-responsive + +Root Cause: syseventd crashed or was restarted; all client connections invalidated + +Recovery: + 1. All services must re-register (restart each or reboot) + 2. Fix: syseventd OOM protection (writes -17 to oom_adj) +``` + +### Pattern 2: Firewall Mutex Deadlock After OOM Kill +``` +Signature: + - firewall-status stuck at "starting" for >60 seconds + - /rdklogs/logs/FirewallDebug.txt shows "acquiring mutex..." + - No "mutex acquired" message follows + +Root Cause: Previous firewall process OOM-killed while holding mutex. + EOWNERDEAD not always delivered if process killed by kernel. + +Recovery: + rm /tmp/firewall_mutex + killall -9 firewall + sysevent set firewall-restart +``` + +### Pattern 3: DHCP Fails Due to Bridge Not Ready +``` +Signature: + - dhcp_server-status = "error" + - dnsmasq log: "failed to bind DHCP server socket: No such device" + - brctl show: bridge not listed + +Root Cause: service_dhcp started before multinet created the bridge. + Race condition in service startup ordering. + +Recovery: + sysevent set multinet_1-up + # Wait for: sysevent get multinet_1-status = "ready" + sysevent set dhcp_server-restart +``` + +### Pattern 4: WAN DHCP Fails — SIGCHLD Inherited +``` +Signature: + - service_wan logs: "system() returned -1" + - WAN interface up but no DHCP client started + +Root Cause: service_wan invoked by syseventd inherits SIG_IGN for SIGCHLD. + system() then returns -1 because wait() fails. + +Fix in code: + signal(SIGCHLD, SIG_DFL); // Reset before system() calls + +Recovery (runtime): + Manually start: udhcpc -i erouter0 -p /tmp/udhcpc.erouter0.pid -s /usr/bin/service_udhcpc +``` + +### Pattern 5: Syscfg Corruption on Power Loss +``` +Signature: + - After unexpected reboot, syscfg returns wrong values + - Log: "syscfg: WARNING - loading from backup file" + - Or: "syscfg: ERROR - both primary and backup corrupt, loading defaults" + +Root Cause: Power lost during syscfg_commit (file write interrupted) + +Prevention: Atomic write pattern (write temp → rename). Already implemented. + Issue occurs if temp file AND rename both interrupted. + +Recovery: + If backup loaded: minimal data loss (last committed state) + If defaults loaded: factory reset occurred — customer config lost +``` + +### Pattern 6: Sysevent Worker Threads All Blocked +``` +Signature: + - Events stop being processed + - sysevent_tracer.txt shows SET entries but no ACTION entries + - Sanity thread log: "killing blocked process " after 300s + +Root Cause: External handler script hangs (infinite loop, blocking I/O), + consuming all worker threads. + +Recovery: + 1. Identify hung processes: ps aux | grep defunct + 2. Kill hung handlers + 3. Workers auto-recover after sanity thread kills blocked processes + +Prevention: Add timeout to all handler scripts +``` + +## Dependency Graph (Startup Order) + +``` +Level 0: Linux kernel + filesystems + │ +Level 1: syscfg_create (shared memory database) + │ +Level 2: syseventd (event bus daemon) + │ +Level 3: apply_system_defaults + service registration + │ +Level 4: ┌─────────────┬───────────────┬──────────────┐ + │ multinet-up │ macclone │ pmon start │ + │ (bridges) │ (WAN MAC) │ (monitoring) │ + └──────┬──────┴───────────────┴──────────────┘ + │ +Level 5: ┌──────┴──────┬───────────────┐ + │ dhcp-start │ wan-start │ + │ (LAN DHCP) │ (WAN conn) │ + └─────────────┴───────┬───────┘ + │ +Level 6: ┌──────────────────────┴───────────────────────┐ + │ firewall-restart (needs WAN IP + LAN config) │ + └──────────────────────┬──────────────────────┘ + │ +Level 7: ┌──────────┬───────────┴──────┬──────────────┐ + │ routing │ ipv6 services │ ddns update │ + └──────────┴──────────────────┴──────────────┘ +``` + +## Configuration Key Categories + +| Category | Key Pattern | Example | Service Consumer | +|----------|-------------|---------|-----------------| +| WAN | `wan_*` | `wan_proto=dhcp` | service_wan | +| LAN | `lan_*` | `lan_ipaddr=192.168.1.1` | service_dhcp, firewall | +| DHCP | `dhcp_*` | `dhcp_start=192.168.1.100` | service_dhcp | +| Firewall | `firewall_*`, `*Forward*` | `firewall_level=high` | firewall | +| WiFi | `wl*` | `wl0_ssid=MyNetwork` | utapi_wlan | +| IPv6 | `*v6*`, `router_adv_*` | `dhcpv6s_enable=1` | service_ipv6 | +| System | `device_*`, `hostname` | `hostname=RDK-Gateway` | utapi | +| DDNS | `ddns_*` | `ddns_enable=1` | service_ddns | +| Routing | `rip_*`, `StaticRoute*` | `StaticRoute_1=...` | service_routed | +| MultiNet | (PSM-based) | `dmsb.l2net.1.Name=brlan0` | service_multinet | + +## Symptom → Root Cause Quick Reference + +| Symptom | Most Likely Root Cause | Verification Command | Fix | +|---------|----------------------|---------------------|-----| +| No internet, WAN IP empty | udhcpc not running or ISP DHCP down | `ps \| grep udhcpc; cat /sys/class/net/erouter0/carrier` | `sysevent set wan-restart` | +| No internet, WAN IP present | Firewall blocking or route missing | `ip route show; iptables -L FORWARD -n \| grep REJECT` | `sysevent set firewall-restart` | +| LAN clients can't get IP | dnsmasq crashed or bridge missing | `ps \| grep dnsmasq; brctl show` | `sysevent set dhcp_server-restart` | +| Config lost after reboot | /nvram full, commit failed | `df /nvram; ls -la /nvram/syscfg.db` | Clear nvram space, `syscfg commit` | +| Service stuck "starting" | Previous handler killed mid-transition | `sysevent get -status` | Reset status to "stopped", then restart | +| All events stop working | syseventd workers all blocked | `grep BLOCKED sysevent_tracer.txt` | Kill hung handlers, wait 300s for sanity thread | +| Firewall hangs on restart | Mutex deadlock (holder crashed) | `fuser /tmp/firewall_mutex` | `rm /tmp/firewall_mutex; killall firewall` | +| Port forward not working | Rule generated but conntrack stale | `iptables -t nat -L \| grep ; conntrack -L` | `conntrack -F; sysevent set firewall-restart` | +| IPv6 not working on LAN | No prefix delegation received | `sysevent get tr_erouter0_dhcpv6_client_v6pref` | Restart WAN (re-triggers DHCPv6) | +| Bridge interface missing | multinet-up not triggered or failed | `brctl show; sysevent get multinet_N-status` | `sysevent set multinet_N-up` | +| DNS not resolving | resolv.conf empty or dnsmasq proxy down | `cat /etc/resolv.conf; ps \| grep dnsmasq` | `sysevent set dhcp_server-restart` | +| DDNS not updating | curl failed or credentials wrong | Check service_ddns log; `syscfg get ddns_enable` | `sysevent set ddns-retry` | +| Routing broken after WAN up | zebra/ripd not started | `ps \| grep zebra` | `sysevent set service_routed-restart` | +| Device stuck in wrong mode | DeviceMode switch incomplete | Check service_devicemode log | Manually call `service_devicemode DeviceMode <0\|1>` | diff --git a/.github/ai-playbooks.md b/.github/ai-playbooks.md new file mode 100755 index 00000000..f5c56ba2 --- /dev/null +++ b/.github/ai-playbooks.md @@ -0,0 +1,195 @@ +# AI Playbooks + +## Playbook 1: Issue Triage + +### Input +- Bug report or operational alert +- May contain: logs, symptoms, affected services, timestamp + +### Triage Steps + +``` +Step 1: CLASSIFY SUBSYSTEM +───────────────────────── +Keywords → Subsystem: + "syscfg" / "config" / "nvram" / "persist" → syscfg + "sysevent" / "event" / "trigger" / "callback" → sysevent + "iptables" / "firewall" / "blocked" / "NAT" → firewall + "dhcp" / "dnsmasq" / "lease" / "pool" → service_dhcp + "wan" / "udhcpc" / "internet" / "erouter" → service_wan + "ipv6" / "prefix" / "dibbler" / "radvd" → service_ipv6 + "bridge" / "vlan" / "brlan" / "multinet" → service_multinet + "route" / "zebra" / "rip" / "default gw" → service_routed + "ddns" / "dynamic dns" / "dyndns" → service_ddns + "process" / "crash" / "restart" / "pmon" → process_health + +Step 2: ASSESS SEVERITY +──────────────────────── + P1 (Critical): System unresponsive, syseventd down, all services failed + P2 (High): WAN down, no internet, major service (DHCP/FW) failed + P3 (Medium): Single feature broken (DDNS, IPv6, port forward) + P4 (Low): Cosmetic, logging issue, non-functional regression + +Step 3: IDENTIFY FIRST DEBUG ACTION +──────────────────────────────────── + syscfg issues → syscfg show; ipcs -m + sysevent issues → ps | grep syseventd; cat sysevent_tracer.txt + firewall issues → sysevent get firewall-status; cat /tmp/.ipt + DHCP issues → ps | grep dnsmasq; sysevent get dhcp_server-status + WAN issues → sysevent get wan-status; sysevent get current_wan_ipaddr + IPv6 issues → sysevent get tr_erouter0_dhcpv6_client_v6pref + Multinet issues → brctl show; sysevent get multinet_N-status + Routing issues → ip route show; ps | grep zebra + +Step 4: OUTPUT +────────────── + → Subsystem: [identified] + → Severity: [P1-P4] + → Immediate action: [first debug command] + → Escalation path: [if P1/P2, who owns this] +``` + +--- + +## Playbook 2: Log Analysis + +### Input +- Log file content (Consolelog.txt, FirewallDebug.txt, sysevent_tracer.txt, SelfHeal.txt) + +### Analysis Steps + +``` +Step 1: IDENTIFY LOG SOURCE +──────────────────────────── + /rdklogs/logs/Consolelog.txt.0 → Service operations (DHCP, WAN, routing) + /rdklogs/logs/FirewallDebug.txt → Firewall rule generation + /rdklogs/logs/sysevent_tracer.txt → Event flow tracing + /rdklogs/logs/MnetDebug.txt → MultiNet/bridge operations + /rdklogs/logs/SelfHeal.txt.0 → Process crash/restart events + /var/log/messages → Kernel + syslog (ulog output) + +Step 2: EXTRACT SIGNALS +──────────────────────── + Error patterns to look for: + "failed" → Operation failure (note what failed) + "errno" → System error (decode errno value) + "segfault" → Memory corruption / NULL deref + "timeout" → IPC or network timeout + "EOWNERDEAD" → Mutex holder crashed (usually auto-recovered) + "killed" → Process killed (OOM? sanity thread?) + "ENOSPC" → Filesystem full + "connection refused" → Daemon not running (syseventd? dbus?) + +Step 3: CONSTRUCT TIMELINE +────────────────────────── + - Sort events by timestamp + - Identify first error (root) vs cascading errors (effects) + - Map event names to services + - Identify state transitions (starting→error, started→stopping) + +Step 4: CORRELATE WITH KNOWN PATTERNS +────────────────────────────────────── + See Knowledge Base (ai-knowledge-base.md) for known failure signatures. + +Step 5: OUTPUT +────────────── + → Timeline: [ordered events with timestamps] + → Root event: [first anomaly] + → Cascade: [subsequent failures caused by root] + → Confidence: [high/medium/low] + → Next steps: [verification commands] +``` + +--- + +## Playbook 3: Recovery Procedures + +### By Subsystem + +#### Syscfg Recovery +``` +Symptom: Configuration reads returning garbage or errors +Level 1: syscfg commit (force write to file) +Level 2: syscfg destroy; syscfg_create -f /nvram/syscfg.db (recreate shm) +Level 3: cp /nvram/syscfg.db.prev /nvram/syscfg.db; reboot (restore backup) +Level 4: rm /nvram/syscfg.db*; reboot (factory reset - loads system_defaults) +``` + +#### Sysevent Recovery +``` +Symptom: Events not delivered, services not responding +Level 1: Restart individual service (sysevent set -restart) +Level 2: Check PID file and worker threads +Level 3: kill syseventd; /usr/bin/syseventd --threads 10 (restart bus) +Level 4: Reboot (full system restart) +``` + +#### Firewall Recovery +``` +Symptom: Traffic blocked/incorrect filtering +Level 1: sysevent set firewall-restart (regenerate and reapply) +Level 2: rm /tmp/firewall_mutex; sysevent set firewall-restart (clear stuck mutex) +Level 3: iptables -F; iptables -P INPUT ACCEPT; iptables -P FORWARD ACCEPT (emergency allow-all) +Level 4: syscfg set firewall_level low; sysevent set firewall-restart (reduce ruleset) +``` + +#### DHCP Recovery +``` +Symptom: LAN clients not getting IPs +Level 1: sysevent set dhcp_server-restart +Level 2: killall dnsmasq; sysevent set dhcp_server-start +Level 3: sysevent set dhcp_server-status stopped; sysevent set dhcp_server-start (reset state machine) +Level 4: Check/regenerate /etc/dnsmasq.conf manually +``` + +#### WAN Recovery +``` +Symptom: No internet connectivity +Level 1: sysevent set wan-restart +Level 2: kill udhcpc; sysevent set wan-start (restart DHCP client) +Level 3: ifconfig erouter0 down; sleep 2; ifconfig erouter0 up; sysevent set wan-start +Level 4: Check physical link: cat /sys/class/net/erouter0/carrier +``` + +--- + +## Playbook 4: Performance Investigation + +### Input +- Slow system response, event processing delays, high CPU + +### Steps + +``` +Step 1: IDENTIFY BOTTLENECK +──────────────────────────── + CPU: top -b -n 1 | head -20 + Memory: free -m; cat /proc/meminfo + Disk I/O: iostat (if available); df /nvram; df /tmp + FD usage: ls /proc//fd | wc -l + +Step 2: CHECK SYSEVENT HEALTH +────────────────────────────── + Workers blocked: grep "BLOCKED" /rdklogs/logs/sysevent_tracer.txt + Event queue depth: (no direct metric — infer from processing delays) + Client count: count unique client connections + +Step 3: CHECK SYSCFG CONTENTION +──────────────────────────────── + Lock contention: strace -e futex syscfg get + Commit frequency: grep "commit" /rdklogs/logs/Consolelog.txt.0 | tail + File size: ls -la /nvram/syscfg.db + +Step 4: CHECK FIREWALL +─────────────────────── + Rule count: iptables -L -n | wc -l + Conntrack: cat /proc/net/nf_conntrack | wc -l + Rebuild time: time sysevent set firewall-restart + +Step 5: REMEDIATION +─────────────────── + - Reduce firewall rule complexity + - Batch syscfg commits (don't commit per-key) + - Kill orphaned sysevent clients + - Increase worker thread count if event queue growing +``` diff --git a/.github/ai-prompts.md b/.github/ai-prompts.md new file mode 100755 index 00000000..bdf6d6d7 --- /dev/null +++ b/.github/ai-prompts.md @@ -0,0 +1,171 @@ +# AI Prompts for Utopia Debugging & Development + +## Debugging Prompts + +### Prompt: Diagnose Service Failure +``` +Context: Utopia is the RDK-B system infrastructure component managing configuration (syscfg), events (sysevent), and network services (DHCP, firewall, WAN, routing, IPv6). + +Task: Diagnose why [SERVICE_NAME] is not functioning correctly. + +Available Information: +- Service status: [sysevent get -status output] +- Relevant logs: [paste log excerpt] +- System state: [describe current behavior] + +Analysis Framework: +1. Check service state machine: Is it stuck in a transitional state (starting/stopping)? +2. Check dependencies: Are upstream services ready? (sysevent get system-status, wan-status, lan-status) +3. Check configuration: Are required syscfg keys set and valid? +4. Check external daemons: Are dependent processes (dnsmasq, udhcpc, dibbler) running? +5. Check resources: Is shared memory accessible? Are FD limits hit? + +Provide: Root cause hypothesis, verification commands, and resolution steps. +``` + +### Prompt: Analyze Firewall Issue +``` +Context: Utopia firewall generates iptables/ip6tables rules from syscfg configuration and applies atomically via iptables-restore. It uses a process-shared mutex (PTHREAD_PROCESS_SHARED + PTHREAD_MUTEX_ROBUST) at /tmp/firewall_mutex. + +Task: Analyze why [describe traffic issue - blocked/allowed incorrectly]. + +Information needed: +- firewall-status value (sysevent get firewall-status) +- Current WAN IP (sysevent get current_wan_ipaddr) +- Generated rules (/tmp/.ipt content relevant section) +- Active rules (iptables -L -n) +- Relevant syscfg entries (port forwarding, DMZ, firewall level) + +Analysis: +1. Was the rule generated? (check /tmp/.ipt) +2. Was it applied? (check iptables -L vs file content) +3. Is there a conflicting rule earlier in the chain? +4. Is conntrack holding stale state? +5. Is the mutex preventing regeneration? +``` + +### Prompt: Trace Event Flow +``` +Context: Sysevent is the IPC bus. When a value is set, DataMgr detects the change, TriggerMgr looks up registered callbacks, and either sends a notification message to connected clients or executes an external binary via Fork Helper. + +Task: Trace why event [EVENT_NAME] is not triggering expected action. + +Debug sequence: +1. Verify event was set: sysevent get [EVENT_NAME] +2. Check sysevent tracer: grep EVENT_NAME /rdklogs/logs/sysevent_tracer.txt +3. Verify trigger registration: look for async_id storage (xsm__async_id_) +4. Check worker thread health: are workers blocked? (sanity thread kills after 300s) +5. Check fork helper: is the target binary accessible and executable? +6. Check handler output: did the invoked handler produce errors? + +Expected flow: SET → DataMgr change detect → TriggerMgr dispatch → Worker sends to Fork Helper → Fork Helper fork+exec → Handler runs +``` + +## RCA Prompts + +### Prompt: Root Cause Analysis Template +``` +Context: Utopia component failure requiring root cause analysis. + +Failure: [Describe the failure] +Impact: [P1/P2/P3 + user-visible impact] +Timeline: [When first observed, duration] + +RCA Framework: +1. TIMELINE RECONSTRUCTION + - First anomaly timestamp in logs + - Sequence of events leading to failure + - Cascading effects on dependent services + +2. STATE ANALYSIS + - Expected state vs actual state at failure point + - Last known good state and transition that broke it + - Dependency states at time of failure + +3. CODE PATH IDENTIFICATION + - Which source file/function was executing + - What conditional branch was taken + - What error code was returned/logged + +4. ROOT CAUSE CLASSIFICATION + - Configuration error (wrong syscfg value) + - Race condition (timing between events) + - Resource exhaustion (FDs, memory, disk) + - External dependency failure (daemon crash, HAL error) + - Code defect (logic error, missing error handling) + +5. PREVENTION + - What check/guard would have prevented this + - What monitoring would have detected it earlier +``` + +## Feature Development Prompts + +### Prompt: Implement New Service +``` +Context: Utopia services follow a standard pattern: event-driven command dispatch with sysevent/syscfg integration. Each service is a separate binary invoked by sysevent callbacks. + +Task: Design and implement a new service for [FEATURE_DESCRIPTION]. + +Implementation checklist: +1. Define service name and events: + - -start, -stop, -restart + - Custom events this service responds to + - Events this service publishes + +2. Define configuration keys (syscfg): + - What parameters does the service need? + - What are sensible defaults? + - Add to system_defaults file + +3. Implement using standard pattern: + - State structure (sefd, setok, + custom fields) + - Command dispatch table (cmd_ops[]) + - Handler functions for each operation + - Status transitions (stopped → starting → started) + - Error handling with appropriate logging + +4. Register with build system: + - Create source/service_/Makefile.am + - Add to source/Makefile.am SUBDIRS + - Add AC_CONFIG_FILES in configure.ac + - Consider conditional compilation (AM_CONDITIONAL) + +5. Register with sysevent: + - Add callback registration (via srvmgr or service script) + - Define trigger event dependencies + +6. Test: + - Manual: sysevent set -start; check status transitions + - Verify: dependent services notified on state changes + - Error cases: missing config, dependency unavailable +``` + +### Prompt: Add Configuration Parameter +``` +Context: Utopia configuration flows through syscfg (persistent) or sysevent (runtime). UTAPI provides typed access, and UTCTX manages transaction semantics. + +Task: Add new configuration parameter [PARAM_NAME] for [PURPOSE]. + +Steps: +1. Choose storage type: + - Persistent (survives reboot): syscfg → add to system_defaults + - Runtime (transient): sysevent → no persistence needed + +2. If UTAPI access needed: + - Add enum value to UtopiaValue enum in utctx headers + - Add entry to g_Utopia[] table in utctx.c with: + - Type (Config/IndexedConfig/Event/NamedConfig) + - Key format string + - Namespace (NULL for global) + - Event flags (which events to fire on change) + - Add getter/setter in utapi.c + +3. Add to system_defaults: + - Format: "key=default_value" + - Will be applied on first boot or factory reset + +4. Consuming service: + - syscfg_get(NULL, "param_name", buf, sizeof(buf)) + - React to change event (register callback via sysevent) +``` diff --git a/.github/ai-workflows.md b/.github/ai-workflows.md new file mode 100755 index 00000000..052368b9 --- /dev/null +++ b/.github/ai-workflows.md @@ -0,0 +1,243 @@ +# AI Workflows + +## Workflow 1: Issue Triage Flow + +```mermaid +flowchart TD + Start([Issue Reported]) --> Extract[Extract Key Information] + Extract --> |"Log excerpts, symptoms"| Classify{Classify Subsystem} + + Classify --> |"config/persist"| Syscfg[Syscfg Issue] + Classify --> |"event/callback"| Sysevent[Sysevent Issue] + Classify --> |"traffic/rules"| Firewall[Firewall Issue] + Classify --> |"IP/lease"| DHCP[DHCP Issue] + Classify --> |"internet/wan"| WAN[WAN Issue] + Classify --> |"v6/prefix"| IPv6[IPv6 Issue] + Classify --> |"bridge/vlan"| Multinet[MultiNet Issue] + + Syscfg --> Sev{Severity?} + Sysevent --> Sev + Firewall --> Sev + DHCP --> Sev + WAN --> Sev + IPv6 --> Sev + Multinet --> Sev + + Sev --> |"System down"| P1[P1: Immediate - Debug Flow] + Sev --> |"Feature broken"| P2[P2: High - Debug Flow] + Sev --> |"Degraded"| P3[P3: Medium - Investigate] + Sev --> |"Minor"| P4[P4: Low - Backlog] +``` + +## Workflow 2: Debug Flow + +```mermaid +flowchart TD + Start([Classified Issue]) --> Gather[Gather State] + + Gather --> |"1. Service status"| CheckStatus[sysevent get service-status] + Gather --> |"2. Dependencies"| CheckDeps[Check upstream deps] + Gather --> |"3. Config"| CheckConfig[syscfg get relevant keys] + Gather --> |"4. Processes"| CheckProc[ps grep relevant daemons] + Gather --> |"5. Logs"| CheckLogs[Read appropriate log file] + + CheckStatus --> Analyze{Analyze State} + CheckDeps --> Analyze + CheckConfig --> Analyze + CheckProc --> Analyze + CheckLogs --> Analyze + + Analyze --> |"Stuck state"| StuckFix[Reset state + restart service] + Analyze --> |"Dep missing"| DepFix[Start missing dependency first] + Analyze --> |"Bad config"| CfgFix[Fix syscfg value + restart] + Analyze --> |"Process dead"| ProcFix[Restart process/service] + Analyze --> |"Resource issue"| ResFix[Free resources + restart] + Analyze --> |"Unknown"| Deeper[Deeper investigation - RCA Flow] + + StuckFix --> Verify{Verified Fixed?} + DepFix --> Verify + CfgFix --> Verify + ProcFix --> Verify + ResFix --> Verify + + Verify --> |Yes| Done([Resolved]) + Verify --> |No| Deeper +``` + +## Workflow 3: RCA Flow + +```mermaid +flowchart TD + Start([Unresolved Issue]) --> Timeline[Reconstruct Timeline] + + Timeline --> FirstErr[Identify First Error in Logs] + FirstErr --> Cascade[Map Cascading Effects] + + Cascade --> StateCheck{State at Failure Point} + + StateCheck --> |"Config wrong"| ConfigRCA[Configuration Root Cause] + StateCheck --> |"Race/timing"| RaceRCA[Race Condition Root Cause] + StateCheck --> |"Resource"| ResourceRCA[Resource Exhaustion Root Cause] + StateCheck --> |"External"| ExtRCA[External Dependency Root Cause] + StateCheck --> |"Code bug"| CodeRCA[Code Defect Root Cause] + + ConfigRCA --> CodePath[Identify Code Path] + RaceRCA --> CodePath + ResourceRCA --> CodePath + ExtRCA --> CodePath + CodeRCA --> CodePath + + CodePath --> Reproduce[Define Reproduction Steps] + Reproduce --> Fix[Propose Fix] + Fix --> Prevent[Propose Prevention] + Prevent --> Report([RCA Report]) +``` + +## Workflow 4: Service Health Check + +``` +INPUT: Service name to verify + +SEQUENCE: +1. Check service status + → sysevent get -status + → Expected: "started" + → If "starting"/"stopping" for >30s → STUCK (see recovery) + → If "stopped" when expected running → START NEEDED + → If "error" → CHECK LOGS + +2. Check dependent process + → Map: service_dhcp→dnsmasq, service_wan→udhcpc, service_ipv6→dibbler + → ps | grep + → If missing: service must restart it + +3. Check functional behavior + → DHCP: nmap --script broadcast-dhcp-discover on LAN + → WAN: ping -c 1 -I erouter0 8.8.8.8 + → Firewall: iptables -L | wc -l (should be > 50) + → DNS: nslookup google.com 127.0.0.1 + +4. Check configuration consistency + → Compare syscfg values with runtime state + → Example: syscfg get lan_ipaddr vs ifconfig brlan0 + +OUTPUT: Health status (healthy/degraded/down) + specific findings +``` + +## Workflow 5: Configuration Audit + +``` +INPUT: Optional specific subsystem to audit + +SEQUENCE: +1. Dump full configuration + → syscfg show > /tmp/config_audit.txt + +2. Validate mandatory keys exist + → REQUIRED_KEYS = [lan_ipaddr, lan_netmask, wan_proto, dhcp_start, + dhcp_end, firewall_level, hostname] + → For each: verify non-empty and syntactically valid + +3. Validate cross-key consistency + → dhcp_start and dhcp_end in same subnet as lan_ipaddr/lan_netmask + → wan_proto matches expected values (dhcp|static|pppoe) + → StaticRoute entries have valid IP format + → PortForward entries have valid port ranges (1-65535) + +4. Check for corruption indicators + → Keys with non-printable characters + → Values exceeding 256 bytes (unusual) + → Duplicate keys (shouldn't exist in hash table) + → Keys not in system_defaults AND not in any code (orphaned) + +5. Compare with defaults + → Load /etc/utopia/system_defaults + → Flag any key deviating from default (informational) + +OUTPUT: Audit report with findings categorized as: + - CRITICAL: Mandatory key missing/invalid + - WARNING: Inconsistent cross-references + - INFO: Non-default values (expected for configured device) +``` + +## Decision Trees + +### Decision Tree: "No Internet" Triage + +``` +Q: Does WAN interface have an IP? + → sysevent get current_wan_ipaddr + │ + ├── EMPTY: WAN IP not acquired + │ Q: Is WAN interface UP? + │ │ → ifconfig erouter0 + │ ├── DOWN: Interface not up + │ │ → sysevent set wan-start + │ └── UP: Interface up, no IP + │ Q: Is DHCP client running? + │ │ → ps | grep udhcpc + │ ├── NO: Start DHCP client + │ │ → sysevent set wan-start + │ └── YES: DHCP client running + │ → Problem is upstream (ISP/cable/physical) + │ → Check: cat /sys/class/net/erouter0/carrier + │ + └── HAS IP: WAN has address + Q: Can ping gateway? + │ → ping -c 1 $(sysevent get default_router) + ├── NO: Gateway unreachable + │ Q: Is default route set? + │ │ → ip route | grep default + │ ├── NO: Route missing + │ │ → sysevent set wan-restart (will re-add route) + │ └── YES: Route exists, gateway unreachable + │ → ARP issue or gateway down + │ + └── YES: Gateway reachable + Q: Can ping external (8.8.8.8)? + ├── NO: NAT/firewall blocking + │ → iptables -L FORWARD -n -v | grep REJECT + │ → sysevent set firewall-restart + └── YES: Ping works + Q: DNS working? + │ → nslookup google.com + ├── NO: DNS issue + │ → cat /etc/resolv.conf + │ → sysevent set dhcp_server-restart + └── YES: All working + → Issue is client-specific, not Utopia +``` + +### Decision Tree: "Service Not Starting" + +``` +Q: What does service-status show? +│ → sysevent get -status +│ +├── "starting" (stuck) +│ → Previous start didn't complete +│ → FIX: sysevent set -status stopped +│ sysevent set -start +│ +├── "stopping" (stuck) +│ → Previous stop didn't complete +│ → FIX: kill remaining process, then: +│ sysevent set -status stopped +│ sysevent set -start +│ +├── "error" +│ → Service tried to start and failed +│ → CHECK: logs for specific error +│ → COMMON: port conflict, missing config, dependency not ready +│ +├── "stopped" (won't start) +│ → Start event not reaching service +│ → CHECK: Is callback registered? +│ sysevent get xsm__async_id_start +│ → If empty: service not registered → re-register or restart syseventd +│ +└── (empty/not set) + → Service never initialized + → CHECK: Is service in Makefile.am SUBDIRS? + → CHECK: Was it conditionally compiled out? +``` diff --git a/docs/01-component-overview.md b/docs/01-component-overview.md new file mode 100755 index 00000000..7ef4e1a7 --- /dev/null +++ b/docs/01-component-overview.md @@ -0,0 +1,80 @@ +# Utopia Component Overview + +## Purpose + +Utopia is the foundational system infrastructure component for RDK-B (Reference Design Kit - Broadband) middleware. It provides system initialization, configuration management, inter-process communication (IPC) via an event bus, and network service orchestration for residential gateway devices. + +## Core Responsibilities + +| Responsibility | Description | +|---|---| +| **Configuration Persistence** | Shared-memory-based key-value store (syscfg) with filesystem backing | +| **Event Bus IPC** | Publish-subscribe messaging (sysevent) over Unix domain sockets | +| **Network Service Orchestration** | Coordinated management of DHCP, firewall, routing, WAN, IPv6 | +| **Unified API Layer** | UTAPI/UTCTX libraries abstracting config and event access | +| **Service Lifecycle** | Start/stop/restart coordination for all managed services | +| **Process Monitoring** | Health checks and auto-restart via pmon | + +## Module Inventory + +| Module | Type | Purpose | +|--------|------|---------| +| `syscfg` | Library + Daemon | Persistent configuration database (shared memory + file) | +| `sysevent` | Daemon + Library | Real-time event bus with trigger-based action execution | +| `utapi` | Library + CLI | High-level configuration API (typed getters/setters) | +| `utctx` | Library | Transaction context manager over syscfg/sysevent | +| `firewall` | Service binary | iptables/ip6tables rule generation and management | +| `service_dhcp` | Service binary | DHCP server lifecycle (dnsmasq management) | +| `service_wan` | Service binary | WAN interface and DHCP client management | +| `service_routed` | Service binary | Routing daemon management (zebra/ripd, radvd) | +| `service_ipv6` | Service binary | IPv6 provisioning and DHCPv6 server (dibbler) | +| `service_multinet` | Service binary | VLAN/bridge management and network isolation | +| `service_udhcpc` | Service binary | DHCP client callback handler | +| `service_ddns` | Service binary | Dynamic DNS registration with external providers | +| `service_dslite` | Service binary | DS-Lite (IPv4-in-IPv6) tunnel management | +| `service_deviceMode` | Service binary | Router/Extender mode switching | +| `trigger` | Daemon | Port-range triggering via netfilter queue | +| `pmon` | Service binary | Process health monitor with auto-restart | +| `newhost` | Service binary | New LAN host detection and firewall trigger | +| `macclone` | Service binary | WAN interface MAC address cloning | +| `ulog` | Library | Unified logging (wraps syslog with component tags) | +| `pal` | Library | Platform Abstraction Layer (network, UPnP, XML) | +| `services/lib` | Library | Service registration framework (srvmgr) | +| `util` | Library | Shared utilities (vsystem, iface ops, PSM access) | +| `scripts/init` | Shell scripts | System bootstrap and service initialization | +| `walled_garden` | Shell scripts | Guest/parental-control access aging | + +## Key Interfaces + +### Northbound (consumed by other RDK-B components) +- **UTAPI C API** — typed configuration access for CCSP components +- **syscfg CLI** — command-line configuration read/write +- **sysevent CLI** — command-line event publish/subscribe +- **RBus** — enhanced message bus integration (optional) + +### Southbound (consumed by Utopia) +- **Linux kernel** — iptables, ip route, netlink, /proc, /sys +- **HAL APIs** — wifi_hal, ethernet_hal, docsis_hal, platform_hal +- **External daemons** — dnsmasq, udhcpc, dibbler-server, zebra, ripd, radvd, curl + +### Lateral (IPC between Utopia modules) +- **Sysevent bus** — all inter-service coordination +- **Syscfg shared memory** — cross-process configuration access +- **Pipes/FIFOs** — syseventd internal worker communication + +## Platform Support + +Utopia supports multiple hardware platforms via compile-time configuration: + +| Platform | Flag | Notes | +|----------|------|-------| +| Intel USG (Puma6) | `--with-ccsp-platform=intel_usg` | Legacy Puma6 devices | +| Intel Puma7 | `--with-ccsp-platform=intel_puma7` | XB6/XB7 platforms | +| Broadcom | `--with-ccsp-platform=bcm` | BCM-based gateways | +| PC (emulation) | `--with-ccsp-platform=pc` | Development/testing | + +## Build System + +- **Autotools** (autoconf/automake/libtool) +- Conditional compilation via `AM_CONDITIONAL` flags +- Feature flags: DSLite, CoreNetLib, Extender, Hotspot, DDNS, MoCA diff --git a/docs/02-high-level-design.md b/docs/02-high-level-design.md new file mode 100755 index 00000000..9c7a3472 --- /dev/null +++ b/docs/02-high-level-design.md @@ -0,0 +1,231 @@ +# High-Level Design (HLD) + +## System Architecture + +Utopia operates as a layered service-oriented system providing infrastructure and networking services to the RDK-B middleware stack. + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ RDK-B Management Layer │ +│ (TR-069, WebPA, USP, WebUI, CcspPandM, CcspPsm) │ +├─────────────────────────────────────────────────────────────────────┤ +│ UTAPI / UTCTX (API Layer) │ +│ Typed config getters/setters, transaction context │ +├────────────────────────┬────────────────────────────────────────────┤ +│ Syscfg (Config DB) │ Sysevent (Event Bus) │ +│ Shared memory + │ Unix domain socket server + │ +│ filesystem backing │ trigger-based action dispatch │ +├────────────────────────┴────────────────────────────────────────────┤ +│ Network Services Layer │ +│ ┌──────────┬──────────┬──────────┬──────────┬──────────┐ │ +│ │ Firewall │ DHCP │ WAN │ Routing │ IPv6 │ │ +│ ├──────────┼──────────┼──────────┼──────────┼──────────┤ │ +│ │ MultiNet │ Trigger │ DDN S │ DSLite │ DevMode │ │ +│ └──────────┴──────────┴──────────┴──────────┴──────────┘ │ +├─────────────────────────────────────────────────────────────────────┤ +│ Support Services │ +│ (pmon, newhost, macclone, ulog, walled_garden) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Platform Layer │ +│ Linux Kernel │ HAL APIs │ iptables │ ip route │ External Daemons │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## Major Components + +### 1. Syscfg — Configuration Persistence Engine + +**Architecture:** Process-shared POSIX shared memory segment containing a hash table (djb2 hash, configurable bucket count). Filesystem-backed for persistence across reboots. + +**Design Decisions:** +- Shared memory for sub-millisecond cross-process reads (no IPC round-trip) +- Robust POSIX mutexes handle process crashes (EOWNERDEAD recovery) +- Dual-file backing (primary + backup) for corruption resilience +- Namespace support for multi-tenant isolation + +**Data Flow:** +``` +syscfg_set() → acquire write_lock → update hash table in shm → release lock +syscfg_commit() → acquire commit_lock → serialize hash table → write to file → release lock +syscfg_get() → acquire read_lock → lookup hash table → copy value → release lock +``` + +### 2. Sysevent — Event Bus Daemon + +**Architecture:** Multi-threaded daemon (1 main + N workers + 1 sanity) serving TCP and Unix domain socket clients. Implements pub-sub with trigger-based action dispatch. + +**Design Decisions:** +- select()-based I/O multiplexing on main thread for connection acceptance +- Worker thread pool (default 10) for event processing parallelism +- Fork helper child process for external executable invocation +- Serial execution mode for ordered action sequences +- Named FIFOs for fork-helper→worker result delivery + +**Component Structure:** +``` +syseventd +├── Main Thread (accept loop, client token assignment) +├── Worker Threads[N] (event processing, action dispatch) +├── Sanity Thread (watchdog: kill blocked processes >300s) +├── Fork Helper (child process for external exec) +├── ClientsMgr (dynamic client table, FD↔token mapping) +├── TriggerMgr (action registry, serial/parallel dispatch) +└── DataMgr (tuple storage, change detection, trigger firing) +``` + +### 3. UTAPI/UTCTX — API and Context Layer + +**Architecture:** Two-tier abstraction: +- **UTCTX** (lower): Transaction manager buffering reads/writes, committing atomically +- **UTAPI** (upper): Typed domain-specific APIs (LAN, WAN, DHCP, Firewall, WLAN) + +**Design Pattern:** Unit-of-Work +``` +Utopia_Init() → open sysevent connection + Utopia_Set() × N → buffer changes in linked list +Utopia_Free() → commit to syscfg → fire accumulated events → close +``` + +### 4. Network Services + +All network services follow a common architectural pattern: + +``` +┌─────────────────────────────────────────┐ +│ Service Binary │ +├─────────────────────────────────────────┤ +│ main() → parse CLI args → dispatch │ +│ ┌───────────────────────────────────┐ │ +│ │ State Structure (serv_xxx) │ │ +│ │ - sysevent fd/token │ │ +│ │ - service-specific state │ │ +│ └───────────────────────────────────┘ │ +│ ┌───────────────────────────────────┐ │ +│ │ Command Dispatch Table │ │ +│ │ cmd_ops[] = {name, handler_fn} │ │ +│ └───────────────────────────────────┘ │ +│ ┌───────────────────────────────────┐ │ +│ │ Handlers │ │ +│ │ - start/stop/restart │ │ +│ │ - event-specific logic │ │ +│ └───────────────────────────────────┘ │ +├─────────────────────────────────────────┤ +│ IPC: sysevent_open/get/set/close │ +│ Config: syscfg_get/set/commit │ +│ Platform: v_secure_system / HAL APIs │ +└─────────────────────────────────────────┘ +``` + +### 5. Service Manager (srvmgr) + +**Role:** Registers services with sysevent for event-driven activation. + +**Registration Pattern:** +``` +sm_register(service_name, default_events[], custom_events[]) + → sysevent_setcallback(event, handler_path, flags) + → store async_id for later cancellation +``` + +## Component Interactions + +### Event-Driven Orchestration + +``` +Configuration Change (e.g., DHCP pool update) + │ + ▼ +UTAPI → syscfg_set() → syscfg_commit() + │ + ▼ +UTCTX posts "dhcp_server-restart" to sysevent + │ + ▼ +Sysevent TriggerMgr matches registered callback + │ + ▼ +Fork Helper executes: /usr/bin/service_dhcp dhcp_server-restart + │ + ▼ +service_dhcp: stops dnsmasq → regenerates config → starts dnsmasq + │ + ▼ +service_dhcp posts "dhcp_server-status" = "started" to sysevent +``` + +### Cross-Service Dependencies + +``` + firewall-restart + ▲ + │ + ┌────────────────────┼────────────────────┐ + │ │ │ +wan-status lan-status newhost-trigger + │ │ │ +service_wan service_dhcp newhost + │ │ + ▼ ▼ + routing dhcp_server +``` + +## External Dependencies + +### Runtime Dependencies + +| Dependency | Type | Used By | Failure Impact | +|---|---|---|---| +| dnsmasq | External daemon | service_dhcp | No DHCP/DNS on LAN | +| udhcpc / ti_udhcpc | External daemon | service_wan | No WAN IP acquisition | +| dibbler-server | External daemon | service_ipv6 | No DHCPv6 on LAN | +| zebra / ripd | External daemon | service_routed | No dynamic routing | +| radvd | External daemon | service_routed | No IPv6 RA on LAN | +| iptables / ip6tables | System utility | firewall | No packet filtering | +| ip (iproute2) | System utility | Multiple | No routing/interface config | +| curl | System utility | service_ddns | No DDNS updates | +| conntrack_delete | System utility | walled_garden | Stale connections persist | +| cron | System service | Multiple | No scheduled operations | + +### Library Dependencies + +| Library | Purpose | Linked By | +|---|---|---| +| libsyscfg | Configuration API | All services | +| libsysevent | Event bus client API | All services | +| libulog | Logging | All services | +| libccsp_common | CCSP IPC framework | firewall, service_dhcp, service_routed | +| libsafec | Safe C string operations | All services | +| libnetfilter_queue | NFQ packet handling | trigger | +| libdbus-1 | D-Bus IPC | firewall, service_dhcp | +| libcjson | JSON parsing | apply_system_defaults | +| libnet | Network operations | macclone (optional) | + +### Filesystem Dependencies + +| Path | Purpose | Critical | +|---|---|---| +| `/nvram/syscfg.db` | Primary persistent config | Yes | +| `/opt/secure/data/syscfg.db` | Alternate config location | Platform-specific | +| `/etc/utopia/system_defaults` | Factory default values | Yes (first boot) | +| `/etc/utopia/service.d/` | Service handler scripts | Yes | +| `/tmp/syseventd_connection` | Sysevent UDS path | Yes (runtime) | +| `/var/run/syseventd.pid` | Daemon PID file | Yes (singleton) | +| `/tmp/.ipt` / `/tmp/.ipt_v6` | Firewall rule staging | Yes (firewall) | +| `/rdklogs/logs/` | Runtime log directory | No (degraded logging) | + +## Threading Model + +| Component | Model | Details | +|---|---|---| +| syseventd | Multi-threaded | 1 main + 10 workers + 1 sanity + 1 fork helper | +| syscfg | Lock-based shared | POSIX robust mutexes across processes | +| Network services | Single-threaded | Each runs as separate process | +| Firewall | Single-threaded | Process-level mutex for serialization | +| Trigger | Single-threaded | select() event loop with NFQ | + +## Scalability Design + +- **Horizontal:** New services register with sysevent independently +- **Vertical:** Worker thread pool in syseventd scales event processing +- **Modularity:** Conditional compilation enables/disables features per platform +- **Isolation:** Each service runs as separate process (fault isolation) diff --git a/docs/03-low-level-design.md b/docs/03-low-level-design.md new file mode 100755 index 00000000..9c851e0a --- /dev/null +++ b/docs/03-low-level-design.md @@ -0,0 +1,527 @@ +# Low-Level Design (LLD) + +## 1. Syscfg Module + +### Data Structures + +```c +// Shared memory control block +struct syscfg_shm_ctx { + shm_cb *cb; // Control block in shared memory + int shm_fd; // Shared memory file descriptor + size_t shm_size; // Total shared memory size +}; + +// Hash table entry (stored in shared memory at offsets) +struct ht_entry { + int name_sz; // Key length + int value_sz; // Value length + int next; // Offset to next entry (0 = end) + // followed by: name[name_sz] + value[value_sz] +}; + +// Default value node (in-process linked list) +struct ConfigNode { + char *key; + char *value; + struct ConfigNode *next; +}; +``` + +### Hash Table Implementation +- **Algorithm:** djb2 (`hash = hash * 33 + c`) +- **Bucket count:** `SYSCFG_HASH_TABLE_SZ` (compile-time constant) +- **Collision resolution:** Separate chaining via offset-based linked lists in shared memory +- **Storage:** All data stored as offsets from shared memory base (portable across processes) + +### Locking Strategy + +``` +┌─────────────────────────────────────────┐ +│ Three-Lock Protocol │ +│ │ +│ read_lock - Multiple readers allowed │ +│ write_lock - Exclusive writer access │ +│ commit_lock - Exclusive file I/O │ +│ │ +│ EOWNERDEAD handling: │ +│ pthread_mutex_consistent() → recover │ +│ Continue operation (no data loss) │ +└─────────────────────────────────────────┘ +``` + +### State Machine + +``` + syscfg_create() + │ + ▼ +┌──────────────────────────────────┐ +│ UNINITIALIZED │ +└──────────────────────────────────┘ + │ shmget/shmat + load_from_file + ▼ +┌──────────────────────────────────┐ +│ ACTIVE │◄──────────────────┐ +│ (shared memory ready) │ │ +└──────────────────────────────────┘ │ + │ │ │ + syscfg_set() syscfg_commit() │ + │ │ │ + ▼ ▼ │ +┌──────────────┐ ┌──────────────────┐ │ +│ MODIFIED │ │ COMMITTING │ │ +│(in-memory) │ │(write to file) │────────────────┘ +└──────────────┘ └──────────────────┘ + │ + syscfg_commit() + │ + ▼ + (write to file → back to ACTIVE) +``` + +--- + +## 2. Sysevent Module + +### Client Manager Data Structures + +```c +typedef unsigned int token_t; + +typedef struct { + int used; // Slot active flag + token_t id; // Unique client token + int fd; // Socket file descriptor + int notifications; // Notification count + int errors; // Error count + char name[15]; // Client identifier + int isData; // Data-only client flag +} a_client_t; + +typedef struct { + pthread_mutex_t mutex; + int num_cur_clients; + int max_cur_clients; // Grows dynamically + a_client_t *clients; // Dynamic array +} clients_t; +``` + +### Trigger Manager Data Structures + +```c +typedef struct { + int used; + token_t owner; // Owning client token + int action_flags; + int action_type; // ACTION_TYPE_EXT_FUNCTION | ACTION_TYPE_MESSAGE + int action_id; + char *action; // Executable path or message format + int argc; + char **argv; // Additional arguments +} trigger_action_t; + +typedef struct { + int used; + int trigger_id; + int max_actions; + int num_actions; + int next_action_id; + trigger_action_t *trigger_actions; + int trigger_flags; // TUPLE_FLAG_SERIAL, TUPLE_FLAG_EVENT, etc. +} trigger_t; +``` + +### Event Processing Sequence + +``` +Client: sysevent_set("wan-status", "started") + │ + ▼ +Worker Thread receives SE_MSG_SET message + │ + ▼ +DATA_MGR_set("wan-status", "started") + │ + ├── Compare with current value + │ └── If unchanged → return (no trigger) + │ + ├── Update data_element_t.value + │ + └── If trigger_id != 0: + │ + ▼ + Write trigger_id to trigger_communication_pipe + │ + ▼ + Worker reads from pipe → TRIGGER_MGR.execute_trigger_actions() + │ + ├── For each ACTION_TYPE_MESSAGE: + │ Build SE_MSG_SEND_NOTIFICATION + │ Send to client FD (CLI_MGR_id2fd) + │ + └── For each ACTION_TYPE_EXT_FUNCTION: + Build SE_MSG_RUN_EXTERNAL_EXECUTABLE + Send to fork_helper via pipe + Fork helper: fork() + execve(action_path) +``` + +### Worker Thread State Machine + +``` +┌─────────────┐ semaphore wait ┌─────────────┐ +│ IDLE │──────────────────►│ ACTIVE │ +└─────────────┘ └─────────────┘ + ▲ │ + │ ▼ + │ Process message: + │ - SE_MSG_SET + │ - SE_MSG_GET + │ - SE_MSG_SET_OPTIONS + │ - SE_MSG_REMOVE_ASYNC + │ - SE_MSG_CLOSE + │ │ + └──────────────────────────────────┘ + message processed +``` + +### Sanity Thread Logic + +``` +Every 5 seconds: + For each entry in blocked_exec_list: + increment mark_counter + If mark_counter > MAX_ACTIVATION_BLOCKING_SECS/5: + kill(pid, SIGKILL) + remove from list + log "killed blocked process" +``` + +--- + +## 3. Firewall Module + +### Rule Generation Pipeline + +``` +service_start() + │ + ├── fw_shm_mutex_init() → acquire process mutex + │ + ├── Initialize: sysevent_open + syscfg_init + │ + ├── Read ALL configuration: + │ - WAN IP/interface/protocol + │ - LAN settings (IP, mask, bridge) + │ - Port forwarding rules + │ - DMZ configuration + │ - ACL/MAC filtering + │ - QoS rules + │ - Parental controls + │ + ├── prepare_ipv4_firewall("/tmp/.ipt") + │ ├── raw table rules + │ ├── mangle table (QoS marks, DSCP) + │ ├── nat table (DNAT, SNAT, masquerade) + │ └── filter table (all chains) + │ + ├── prepare_ipv6_firewall("/tmp/.ipt_v6") + │ └── (similar structure for IPv6) + │ + ├── system("iptables-restore < /tmp/.ipt") + │ system("ip6tables-restore < /tmp/.ipt_v6") + │ + └── sysevent_set("firewall-status", "started") +``` + +### Iptables Chain Architecture + +``` + PREROUTING + │ + ┌─────────────┼─────────────┐ + │ │ │ + prerouting_ prerouting_ prerouting_ + fromwan fromlan trigger + │ │ + ▼ ▼ + INPUT FORWARD + │ │ + ┌─────┴─────┐ ┌───┴───────────┐ + │ │ │ │ + wan2self lan2self lan2wan wan2lan + │ + ┌──────┴──────┐ + │ │ + wan2lan_ wan2lan_ + accept dns_intercept + + POSTROUTING + │ + postrouting_towan +``` + +### Mutex Design (Cross-Process) + +```c +// Shared memory mutex for process-level firewall serialization +#define SHM_MUTEX "/tmp/firewall_mutex" + +fw_shm_mutex_init(): + fd = open(SHM_MUTEX, O_CREAT|O_RDWR) + ftruncate(fd, sizeof(pthread_mutex_t)) + mmap → pshared_mutex + pthread_mutexattr_setpshared(PTHREAD_PROCESS_SHARED) + pthread_mutexattr_setrobust(PTHREAD_MUTEX_ROBUST) + pthread_mutex_init(pshared_mutex) + +// On EOWNERDEAD: + pthread_mutex_consistent(pshared_mutex) // Mark consistent + // Continue execution (previous holder crashed) +``` + +--- + +## 4. Service DHCP Module + +### Event Handler Dispatch + +```c +static const struct cmd_entry cmd_table[] = { + {"dhcp_server-restart", dhcp_server_restart}, + {"dhcp_server-start", dhcp_server_start}, + {"dhcp_server-stop", dhcp_server_stop}, + {"lan-status", lan_status_change}, + {"bring-lan", bring_lan_up}, + {"ipv4_N-status", ipv4_status}, + {"ipv4-up", ipv4_up}, + {"ipv4-down", teardown_instance}, + {"multinet_N-status", handle_l2_status}, + ... +}; +``` + +### DHCP Server Lifecycle + +``` +dhcp_server_start() + │ + ├── wait_till_end_state("dhcp_server", 9 retries, 1s each) + │ └── Ensures no concurrent start/stop transition + │ + ├── sysevent_set("dhcp_server-status", "starting") + │ + ├── Read pool configuration from syscfg: + │ - Pool ranges + │ - Static leases + │ - Options (DNS, gateway, lease time) + │ + ├── Generate /etc/dnsmasq.conf + │ + ├── Start dnsmasq process + │ + └── sysevent_set("dhcp_server-status", "started") + +dhcp_server_stop() + │ + ├── sysevent_set("dhcp_server-status", "stopping") + ├── kill dnsmasq (SIGTERM via PID file) + └── sysevent_set("dhcp_server-status", "stopped") +``` + +--- + +## 5. Service WAN Module + +### State Structure + +```c +struct serv_wan { + int sefd; // sysevent FD + int setok; // sysevent token + char ifname[IFNAMSIZ]; // WAN interface (e.g., erouter0) + enum wan_rt_mod rtmod; // IPv4Only/IPv6Only/DualStack/Unknown + enum wan_prot prot; // DHCP/Static + int timo; // DHCP timeout +}; +``` + +### WAN Connection State Machine + +``` + wan_start() + │ + ▼ +┌────────────────────────┐ +│ INITIALIZING │ +│ - Read wan_proto │ +│ - Read erouter_mode │ +│ - Set interface name │ +└────────────────────────┘ + │ + ▼ +┌────────────────────────┐ wan_iface_down() +│ INTERFACE_UP │◄────────────────────┐ +│ - ifconfig up │ │ +│ - sysctl forwarding │ │ +└────────────────────────┘ │ + │ │ + ▼ │ +┌────────────────────────┐ │ +│ DHCP_REQUESTING │ lease fail │ +│ - Start udhcpc │────────────────────►│ +│ - Wait for lease │ │ +└────────────────────────┘ │ + │ lease acquired │ + ▼ │ +┌────────────────────────┐ │ +│ CONNECTED │ │ +│ - Set routes │ │ +│ - Set DNS │ link down │ +│ - Fire wan-started │─────────────────────┘ +└────────────────────────┘ +``` + +--- + +## 6. Trigger Module (Port Triggering) + +### NFQ Processing Loop + +```c +main_loop(): + while(1): + select(nfq_fd, timeout=trigger_lifetime) + + if (fd readable): + nfq_handle_packet() // → trigger_callback() + + if (timeout): + update_quanta() // Decrement all active triggers + expire_triggers() // Remove expired, cleanup rules +``` + +### Trigger Activation Flow + +``` +Packet on NFQ 22 → trigger_callback(mark, src_addr) + │ + ├── Extract trigger index from mark + │ + ├── update_trigger_entry(mark, saddr): + │ ├── Read syscfg "PortRangeTrigger_" + │ ├── Parse: enabled,protocol,trigger_range,forward_range,lifetime + │ └── Populate trigger_info[mark] + │ + ├── start_forwarding(id): + │ ├── Build DNAT rule string + │ ├── sysevent_set_unique("NatFirewallRule", rule) + │ ├── Build FORWARD ACCEPT rule + │ ├── sysevent_set_unique("GeneralPurposeFirewallRule", rule) + │ └── sysevent_set("firewall-restart") + │ + └── Set trigger_info[id].active = 1 + +Expiry: quanta reaches 0 → stop_forwarding(id) + ├── sysevent_del_unique(rule_handles) + └── sysevent_set("firewall-restart") +``` + +--- + +## 7. UTCTX Transaction Layer + +### Transaction Commit Sequence + +```c +Utopia_Free(ctx): + │ + ├── For each node in transaction list: + │ ├── Determine target: syscfg or sysevent + │ ├── syscfg_set(namespace, key, value) + │ └── Accumulate event flags (bitmask) + │ + ├── syscfg_commit() // Atomic persist + │ + ├── s_UtopiaEvent_Trigger(accumulated_flags): + │ │ For each bit set in event_flags: + │ │ sysevent_set(g_Utopia_Events[bit].event_key, value) + │ │ + │ │ If any event has wait_key: + │ │ s_UtopiaEvent_Wait(wait_key, wait_value, timeout) + │ │ // Blocking wait for service completion + │ │ + │ └── Return + │ + └── sysevent_close(ctx->fd, ctx->token) +``` + +### Value Type Resolution + +```c +Utopia_Get(ctx, UtopiaValue_WanProto, buf, sz): + │ + ├── Lookup g_Utopia[UtopiaValue_WanProto]: + │ type = Utopia_Type_Config + │ key = "wan_proto" + │ ns = NULL (global namespace) + │ + ├── UtopiaTransact_Get(): + │ ├── Check transaction buffer (pending writes) + │ └── If not found: syscfg_get(ns, key, buf, sz) + │ + └── Return value +``` + +--- + +## 8. Service Registration (srvmgr) + +### Registration Protocol + +```c +sm_register(service_name, se_fd, se_token): + │ + ├── Register default events: + │ sysevent_setcallback(fd, token, + │ "-start", + │ SE_FLAG_NORMAL, + │ "/path/to/handler start", + │ TUPLE_FLAG_EVENT) + │ + │ sysevent_setcallback(... "-stop" ...) + │ sysevent_setcallback(... "-restart" ...) + │ + ├── Register custom events: + │ For each custom_event in definitions: + │ parse event spec: "event|handler|flags|tuple_flags|params" + │ sysevent_setcallback(...) + │ + └── Store async_ids via sysevent_set: + "xsm__async_id_" = " 0x 0x" +``` + +### Service State Protocol + +Every service follows this state convention: + +``` +sysevent "-status" values: + "stopped" → Service not running + "starting" → Service initialization in progress + "started" → Service running and healthy + "stopping" → Service shutdown in progress + "error" → Service in error state +``` + +Services check state before start/stop to prevent races: +```c +wait_till_end_state(service): + for i in 0..9: + status = sysevent_get("-status") + if status == "starting" || status == "stopping": + sleep(1) + continue + else: + return // Safe to proceed +``` diff --git a/docs/04-functional-workflows.md b/docs/04-functional-workflows.md new file mode 100755 index 00000000..fb51105a --- /dev/null +++ b/docs/04-functional-workflows.md @@ -0,0 +1,343 @@ +# Functional Workflows + +## 1. System Initialization Flow + +### Phase 1: Bootstrap (utopia_init.sh) + +``` +System Boot + │ + ├── 1. Kernel parameter tuning + │ └── Set nf_conntrack timeouts, TCP buffers, network params + │ + ├── 2. Syscfg database initialization + │ ├── Check /nvram/syscfg.db (primary) + │ ├── Check /nvram/syscfg.db.prev (backup) + │ ├── If both corrupt: load from /etc/utopia/system_defaults + │ └── syscfg_create -f → creates shared memory segment + │ + ├── 3. Factory reset detection + │ ├── Check hardware reset button (GPIO) + │ ├── If factory reset: wipe nvram, PSM, DHCP leases + │ └── Reload system_defaults + │ + ├── 4. Sysevent daemon start + │ └── /usr/bin/syseventd --threads 10 + │ ├── Create PID file /var/run/syseventd.pid + │ ├── Initialize TCP + UDS listeners + │ ├── Spawn worker threads + │ └── Ready for client connections + │ + ├── 5. Apply system defaults + │ └── apply_system_defaults + │ ├── Compare current syscfg against /etc/utopia/system_defaults + │ ├── Set missing keys to defaults + │ └── Handle partner-specific overrides (JSON) + │ + ├── 6. Service registration + │ └── For each service in /etc/utopia/service.d/: + │ sm_register(service_name, events[]) + │ → Callback handlers installed in sysevent + │ + └── 7. Post "system-ready" event + └── sysevent set system-status "started" +``` + +### Phase 2: Service Activation + +``` +system-status = "started" + │ + ├── LAN services + │ ├── service_multinet: create bridges, assign VLANs + │ ├── service_dhcp: start dnsmasq (DHCP server + DNS proxy) + │ └── firewall: generate and apply iptables rules + │ + ├── WAN services + │ ├── service_wan: configure WAN interface + │ ├── Start DHCP client (udhcpc) on WAN + │ └── Wait for IP address acquisition + │ + └── Post-WAN services (triggered by wan-status = "started") + ├── service_routed: start routing daemons + ├── service_ipv6: configure IPv6, start dibbler + ├── service_ddns: register with DDNS provider + └── firewall-restart: regenerate rules with WAN IP +``` + +## 2. Configuration Change Flow + +### User Changes a Parameter (e.g., DHCP Pool Range) + +``` +Step 1: External request arrives + │ (TR-069 SetParameterValues or WebUI form submit) + │ + ▼ +Step 2: CcspPandM calls UTAPI + │ Utopia_SetDHCPServerPool(ctx, pool_id, start_ip, end_ip) + │ + ▼ +Step 3: UTAPI buffers changes + │ UTOPIA_SET(ctx, UtopiaValue_DHCP_Start, "192.168.1.100") + │ UTOPIA_SET(ctx, UtopiaValue_DHCP_End, "192.168.1.200") + │ → Added to UtopiaTransact_Node linked list + │ + ▼ +Step 4: Transaction commit (Utopia_Free) + │ syscfg_set(NULL, "dhcp_start", "192.168.1.100") + │ syscfg_set(NULL, "dhcp_end", "192.168.1.200") + │ syscfg_commit() → Write to /nvram/syscfg.db + │ + ▼ +Step 5: Event trigger + │ sysevent_set("dhcp_server-restart", "") + │ + ▼ +Step 6: Sysevent dispatch + │ DataMgr detects value change → TriggerMgr matches callback + │ Fork Helper executes: /usr/bin/service_dhcp dhcp_server-restart + │ + ▼ +Step 7: Service handler executes + │ service_dhcp: + │ ├── Stop dnsmasq (SIGTERM) + │ ├── Regenerate /etc/dnsmasq.conf from syscfg + │ ├── Start dnsmasq + │ └── sysevent_set("dhcp_server-status", "started") + │ + ▼ +Step 8: Completion propagation + CcspPandM receives success response +``` + +## 3. WAN Connection Establishment + +``` +wan-start event received + │ + ├── Read configuration + │ ├── wan_proto (dhcp/static/pppoe) + │ ├── erouter_mode (ipv4/ipv6/dual/bridge) + │ └── wan_ifname (erouter0) + │ + ├── Interface bringup + │ ├── ifconfig erouter0 up + │ ├── sysctl net.ipv4.ip_forward = 1 + │ └── sysctl accept_ra = 2 (for IPv6) + │ + ├── Address acquisition (DHCP mode) + │ ├── Start udhcpc on erouter0 with options: + │ │ -i erouter0 -p /tmp/udhcpc.erouter0.pid + │ │ -s /usr/bin/service_udhcpc + │ │ -O 100 (DS-Lite AFTR) + │ │ + │ └── udhcpc sends DHCPDISCOVER → DHCPOFFER → DHCPREQUEST → DHCPACK + │ + ├── DHCP callback (service_udhcpc handle_wan) + │ ├── Parse environment: ip, subnet, router, dns, lease, opt100 + │ ├── Configure interface: ip addr add $ip/$mask dev erouter0 + │ ├── Set default route: ip route add default via $router + │ ├── Update resolv.conf with DNS servers + │ ├── sysevent_set("current_wan_ipaddr", ip) + │ ├── sysevent_set("wan_service-status", "started") + │ └── sysevent_set("wan-status", "started") + │ + ├── Post-connection triggers (wan-status = "started") + │ ├── Firewall regeneration with real WAN IP + │ ├── Routing daemon restart + │ ├── DDNS update + │ └── IPv6 DHCPv6 client start (if dual-stack) + │ + └── DS-Lite handling (if option 64/100 received) + ├── Wait for AFTR address (max 60s) + ├── service_dslite: create ip6tnl tunnel + └── Update routing for IPv4-over-IPv6 +``` + +## 4. Firewall Regeneration Flow + +``` +firewall-restart event + │ + ├── Acquire process-shared mutex (prevent concurrent rebuilds) + │ + ├── Read configuration (~100+ syscfg keys) + │ ├── WAN: ip, interface, protocol, bridge mode + │ ├── LAN: ip, subnet, bridge interfaces + │ ├── Features: DMZ, port_forward, port_trigger + │ ├── Security: firewall_level, ping_block, ident_block + │ ├── QoS: enabled, defined policies + │ └── Parental: managed_sites, managed_services + │ + ├── Generate IPv4 rules → /tmp/.ipt + │ ├── *raw table (notrack for local, connection tracking bypass) + │ ├── *mangle table (QoS DSCP marks, TTL) + │ ├── *nat table + │ │ ├── Port forwarding (DNAT) + │ │ ├── DMZ (DNAT catch-all) + │ │ ├── Port triggering rules + │ │ ├── MASQUERADE (outbound NAT) + │ │ └── DNS/HTTP intercept (captive portal) + │ └── *filter table + │ ├── INPUT: wan2self, lan2self chains + │ ├── FORWARD: lan2wan, wan2lan chains + │ ├── OUTPUT: self2wan chain + │ ├── Rate limiting (SYN flood, ICMP) + │ └── Logging rules (for dropped packets) + │ + ├── Generate IPv6 rules → /tmp/.ipt_v6 + │ + ├── Apply atomically + │ ├── iptables-restore < /tmp/.ipt + │ └── ip6tables-restore < /tmp/.ipt_v6 + │ + ├── conntrack -F (flush stale connections) + │ + ├── Release mutex + │ + └── sysevent_set("firewall-status", "started") +``` + +## 5. DHCP Lease Renewal Flow + +``` +udhcpc receives DHCPACK (renewal) + │ + ├── udhcpc invokes script: service_udhcpc renew + │ + ├── handle_wan() compares new lease with current: + │ ├── IP unchanged: update lease time only + │ ├── IP changed: + │ │ ├── Remove old IP from interface + │ │ ├── Add new IP + │ │ ├── Update routes + │ │ ├── sysevent_set("current_wan_ipaddr", new_ip) + │ │ └── Trigger firewall-restart (rules reference WAN IP) + │ │ + │ ├── DNS changed: + │ │ ├── Update /etc/resolv.conf + │ │ └── sysevent_set("wan_dns", new_dns) + │ │ + │ └── Router/gateway changed: + │ ├── Update default route + │ └── sysevent_set("default_router", new_gw) + │ + └── Continue normal operation +``` + +## 6. IPv6 Prefix Delegation Flow + +``` +DHCPv6 client receives IA_PD (prefix delegation) + │ + ├── Client stores prefix info in sysevent: + │ sysevent_set("tr_erouter0_dhcpv6_client_v6pref", "2001:db8:1::/48") + │ sysevent_set("tr_erouter0_dhcpv6_client_v6pref_vtime", "3600") + │ + ├── service_ipv6 receives event → serv_ipv6_start() + │ + ├── Topology mode determines sub-prefix allocation: + │ ├── FAVOR_DEPTH: longer prefixes (/64) per LAN interface + │ └── FAVOR_WIDTH: more /64 networks from available space + │ + ├── For each LAN bridge (brlan0, brlan1, ...): + │ ├── Calculate sub-prefix from delegated prefix + │ ├── Assign ::1 address on bridge interface + │ ├── sysevent_set("ipv6__prefix", sub_prefix) + │ └── Fire lan_addr6_set event + │ + ├── Configure DHCPv6 server (dibbler-server): + │ ├── Generate /etc/dibbler/server.conf + │ ├── Set IA_NA/IA_PD pools from delegated prefix + │ └── Restart dibbler-server + │ + └── Configure Router Advertisement: + ├── Generate radvd.conf with prefix info + └── Restart radvd → LAN clients get IPv6 via SLAAC +``` + +## 7. Service Recovery Flow + +``` +pmon detects process death + │ + ├── Read config: " " + │ + ├── Verify death: + │ ├── Read PID from pidfile + │ ├── Check /proc//cmdline + │ └── If process alive: skip (false alarm) + │ + ├── If confirmed dead: + │ ├── Log to /rdklogs/logs/SelfHeal.txt.0 + │ ├── Send telemetry event + │ └── Execute restart command via fork()+execl() + │ + └── Service restarts → registers with sysevent → resumes operation + +Sysevent sanity thread (blocked process recovery): + │ + ├── Check every 5s: any fork_helper child blocked > 300s? + │ + ├── If blocked: + │ ├── kill(blocked_pid, SIGKILL) + │ ├── Remove from blocked list + │ └── Log "killed blocked process " + │ + └── Continues monitoring +``` + +## 8. Multi-Network (Bridge/VLAN) Setup + +``` +multinet-up event (instance N) + │ + ├── Read instance config from syscfg/PSM: + │ ├── Bridge name (e.g., brlan0) + │ ├── Member interfaces + │ ├── VLAN IDs + │ └── IP configuration + │ + ├── Create bridge: + │ ├── brctl addbr brlan + │ └── Configure bridge parameters (STP, aging) + │ + ├── Add member interfaces: + │ ├── For each member configured: + │ │ ├── Create VLAN interface if needed (vconfig add) + │ │ └── brctl addif brlan + │ └── Handle platform-specific (Puma6/7, BCM) + │ + ├── Assign IP: + │ └── ifconfig brlan netmask up + │ + ├── Fire status event: + │ └── sysevent_set("multinet_-status", "ready") + │ + └── Dependent services activate: + ├── DHCP server starts for pool on brlan + └── Firewall adds rules for new interface +``` + +## 9. Device Mode Switching (Router ↔ Extender) + +``` +DeviceMode update event (0=Router, 1=Extender) + │ + ├── Stop services of OLD mode: + │ ├── sysevent_set("lan-stop") + │ ├── ipv4-down for all LAN instances + │ ├── Kill zebra, CcspLMLite + │ └── Stop NAT/routing services + │ + ├── Start services of NEW mode: + │ ├── sysevent_set("lan-start") + │ ├── ipv4-up for appropriate instances + │ ├── lnf-setup (Lost-and-Found network) + │ ├── dhcp_server-restart + │ └── firewall-restart + │ + └── Mode fully switched + └── Normal operation in new mode +``` diff --git a/docs/05-external-dependencies.md b/docs/05-external-dependencies.md new file mode 100755 index 00000000..78169af0 --- /dev/null +++ b/docs/05-external-dependencies.md @@ -0,0 +1,167 @@ +# External Dependencies & Integration + +## Dependency Matrix + +| Dependency | Category | Used By | Communication | Failure Impact | Recovery | +|---|---|---|---|---|---| +| dnsmasq | Daemon | service_dhcp | Process management (PID file) | No DHCP/DNS on LAN | pmon auto-restart | +| udhcpc | Daemon | service_wan | Script callback | No WAN IP | WAN service retry | +| dibbler-server | Daemon | service_ipv6 | Process management | No DHCPv6 | Service restart | +| zebra/ripd | Daemon | service_routed | Config file + signals | No dynamic routing | Service restart | +| radvd | Daemon | service_routed | Config file + HUP | No IPv6 RA | Service restart | +| iptables | Kernel utility | firewall | iptables-restore pipe | No packet filtering | Firewall restart | +| iproute2 (ip) | Utility | Multiple | Command execution | No routing config | Manual intervention | +| curl | Utility | service_ddns | HTTP client | DDNS update fails | Cron retry with backoff | +| conntrack tools | Utility | firewall, walled_garden | Command execution | Stale connections | Non-critical | +| cron (crond) | Daemon | Multiple | Cron job registration | No scheduled tasks | crond-restart event | +| D-Bus | IPC framework | firewall, service_dhcp | libdbus API | No CCSP access | Component restart | +| RBus | IPC framework | UTAPI | Message API | Degraded management | Fallback to direct | + +## Detailed Integration Points + +### 1. dnsmasq (DHCP Server + DNS Proxy) + +**Interaction Pattern:** +``` +service_dhcp → generates /etc/dnsmasq.conf → starts dnsmasq process +service_dhcp → monitors via PID file (/var/run/dnsmasq.pid) +dnsmasq → serves DHCP to LAN clients +dnsmasq → provides DNS proxy/cache +``` + +**Configuration Generation:** +- Pool ranges from syscfg (`dhcp_start`, `dhcp_end`) +- Static leases from syscfg (`dhcp_static_host_N`) +- DNS upstream from sysevent (`wan_dns`) +- Interface binding from multinet config + +**Failure Scenarios:** +| Symptom | Root Cause | Detection | +|---|---|---| +| LAN clients get no IP | dnsmasq crashed | pmon check, missing PID | +| DNS timeout on LAN | dnsmasq not responding | Health check failure | +| Wrong DHCP pool | Config regeneration failed | DHCP lease logs | + +### 2. udhcpc (WAN DHCP Client) + +**Interaction Pattern:** +``` +service_wan → starts udhcpc with -s /usr/bin/service_udhcpc +udhcpc → DHCP discovery on WAN interface +udhcpc → calls service_udhcpc with bound/renew/deconfig +service_udhcpc → updates sysevent (wan_ipaddr, wan_dns, wan_status) +``` + +**Critical Events:** +- `bound`: First IP acquired → full WAN initialization +- `renew`: Lease renewed → check for IP change +- `deconfig`: Lease lost → WAN down, clear all state +- `leasefail`: DHCP failed → retry or WAN Manager notification + +**Failure Impact:** +- udhcpc crash → No WAN IP renewal → eventual lease expiry → connectivity loss +- Slow DHCP server → WAN comes up late → dependent services delayed + +### 3. Linux Netfilter (iptables/ip6tables) + +**Interaction Pattern:** +``` +firewall module → generates rule file → iptables-restore (atomic apply) +trigger module → individual rule insertion via sysevent pools +walled_garden scripts → individual iptables commands +``` + +**Critical Notes:** +- Atomic restore prevents partial rule state +- Rule count can reach thousands (enterprise deployments) +- conntrack flush after rule change prevents stale connections +- NFQ (netfilter queue) used by trigger module for packet inspection + +**Failure Scenarios:** +| Symptom | Root Cause | Detection | +|---|---|---| +| All traffic blocked | iptables-restore syntax error | Firewall status != "started" | +| Partial filtering | Race during apply | Mutex contention log | +| NFQ not working | Module not loaded | trigger daemon log | + +### 4. CCSP/D-Bus Integration + +**Used By:** firewall (PSM queries), service_dhcp (bus init), service_routed (PSM queries) + +**Pattern:** +```c +// Initialize +CCSP_Message_Bus_Init(component_id, config_file, &bus_handle) + +// Query PSM +PSM_Get_Record_Value2(bus_handle, CCSP_SUBSYS, key, &type, &value) + +// Query TR-181 Data Model +CcspBaseIf_getParameterValues(bus_handle, component, dbus_path, + param_names, param_count, &size, &val) +``` + +**Failure Impact:** +- D-Bus unavailable → Cannot read PSM values → Use defaults/cached +- Component not registered → Query times out → Service degradation + +### 5. HAL Layer Integration + +**Network HAL:** +``` +wifi_hal_init() → WiFi hardware initialization +wifi_hal_getSSIDName() → SSID configuration +ethernet_hal_getEthWanLinkStatus() → Physical link monitoring +docsis_hal_GetDhcpInfo() → Cable modem DHCP info +platform_hal_GetDeviceConfigStatus() → System health +``` + +**Failure Impact:** +- HAL init failure → Hardware not controllable → Feature disabled +- HAL stale data → Incorrect state in sysevent → Cascading misconfig + +### 6. Persistent Storage + +**Primary:** `/nvram/syscfg.db` — main configuration database +**Backup:** `/nvram/syscfg.db.prev` or `/opt/secure/data/syscfg.db` + +**Corruption Recovery:** +``` +syscfg_create(): + Try primary file → if corrupt: + Try backup file → if corrupt: + Load /etc/utopia/system_defaults (factory reset) +``` + +**Write Pattern:** +- syscfg_commit() serializes entire hash table to temp file +- Rename (atomic) onto target path +- Backup file updated on successful commit + +## Dependency Startup Order + +``` +1. Linux kernel + filesystem mounted +2. syscfg_create (shared memory + file load) +3. syseventd (event bus ready) +4. apply_system_defaults (config baseline) +5. Service registration (callbacks installed) +6. LAN services (multinet → DHCP → firewall) +7. WAN services (interface up → DHCP client) +8. Post-WAN services (routing, IPv6, DDNS) +9. CCSP components (depend on syscfg + sysevent) +``` + +## Dependency Failure Matrix + +| Failed Dependency | Immediate Impact | Cascading Impact | Auto-Recovery | Manual Fix | +|---|---|---|---|---| +| syscfg shm | All config reads fail | All services fail | Reboot | Check shm limits | +| syseventd crash | No IPC delivery | All services orphaned | Reboot required | Fix OOM/bug | +| dnsmasq crash | No new DHCP leases | Clients lose connectivity | pmon restart | Check config | +| udhcpc crash | No WAN lease renewal | WAN IP lost eventually | WAN service retry | Restart service_wan | +| iptables module | Firewall rules fail | No packet filtering | Module reload | Kernel config | +| D-Bus daemon | No PSM/CCSP access | Management plane down | systemd restart | Check bus config | +| /nvram full | syscfg_commit fails | Config changes lost | None (auto) | Clear nvram space | +| dibbler crash | No DHCPv6 service | IPv6 degraded | Service restart | Check config | +| zebra/ripd | No dynamic routing | Static routes only | pmon/service restart | Check routing config | diff --git a/docs/06-troubleshooting-guide.md b/docs/06-troubleshooting-guide.md new file mode 100755 index 00000000..b3d5c935 --- /dev/null +++ b/docs/06-troubleshooting-guide.md @@ -0,0 +1,413 @@ +# Troubleshooting Guide + +## 1. Sysevent Bus Failures + +### 1.1 Sysevent Daemon Not Starting + +**Symptom:** All services fail to register. Events not delivered. System stuck in early boot. + +**Logs:** +``` +/var/log/messages: UTOPIA: syseventd: could not create PID file +/var/log/messages: UTOPIA: syseventd: another instance already running +``` + +**Root Cause:** PID file stale from previous crash, or filesystem full. + +**Debug Steps:** +1. `cat /var/run/syseventd.pid` — check if PID is valid +2. `ls -la /proc//` — verify process exists +3. `df /var/run/` — check filesystem space +4. `cat /proc/sys/kernel/threads-max` — check thread limits + +**Resolution:** +```bash +rm -f /var/run/syseventd.pid +/usr/bin/syseventd --threads 10 +``` + +### 1.2 Events Not Being Delivered + +**Symptom:** Service callbacks not firing. Configuration changes have no effect. + +**Logs:** +``` +/rdklogs/logs/sysevent_tracer.txt: [timestamp] EVENT SET: = +# Missing corresponding: ACTION EXEC: +``` + +**Root Cause:** Trigger registration lost (client disconnected), or worker threads blocked. + +**Debug Steps:** +1. `sysevent get ` — verify event was set +2. Check `/tmp/syseventd_worker_*` FIFOs exist +3. `ls /proc//task/` — count active threads +4. `cat /rdklogs/logs/sysevent_tracer.txt | grep BLOCKED` — check for blocked actions + +**Resolution:** +- Restart affected service (re-registers callbacks) +- If all workers blocked: `kill -TERM ` and restart syseventd + +### 1.3 Client Connection Exhaustion + +**Symptom:** New services cannot connect to sysevent. Error: "connection refused." + +**Root Cause:** Client table full or FD limit reached. + +**Debug Steps:** +1. `ls /proc//fd | wc -l` — count open FDs +2. `ulimit -n` — check FD limit +3. Check for leaked connections (orphaned service processes) + +**Resolution:** +- Kill orphaned service processes +- Increase FD limit in syseventd startup +- Restart syseventd to reset client table + +--- + +## 2. Syscfg Configuration Issues + +### 2.1 Configuration Not Persisting Across Reboot + +**Symptom:** Settings revert to defaults after reboot. Changes made via CLI or API are lost. + +**Logs:** +``` +UTOPIA: syscfg: commit failed, errno=28 (ENOSPC) +UTOPIA: syscfg: WARNING - loading from backup file +``` + +**Root Cause:** `/nvram` partition full, preventing commit. Or file corruption. + +**Debug Steps:** +1. `df /nvram/` — check partition space +2. `ls -la /nvram/syscfg.db*` — check file sizes and timestamps +3. `syscfg show | wc -l` — count total entries (expect ~200-500) +4. Check for stuck commit lock: look for processes holding semaphore + +**Resolution:** +```bash +# Clear nvram space +rm -f /nvram/*.log /nvram/core.* +# Force commit +syscfg commit +# Verify +syscfg get lan_ipaddr +``` + +### 2.2 Shared Memory Corruption + +**Symptom:** Services crash with SIGSEGV when reading syscfg. Random garbage values returned. + +**Logs:** +``` +kernel: service_dhcp[1234]: segfault at ip sp +``` + +**Root Cause:** Process crashed while holding write lock, corrupting hash table linkage. + +**Debug Steps:** +1. `ipcs -m` — list shared memory segments, check for orphaned +2. `syscfg show 2>&1 | grep -i error` — look for read errors +3. `syscfg get ` — test basic retrieval +4. `_syscfg_find_corrupted_keys` function detects bad entries + +**Resolution:** +```bash +# Nuclear option: recreate from file +syscfg destroy +syscfg_create -f /nvram/syscfg.db +# Or factory reset if file also corrupt +``` + +### 2.3 EOWNERDEAD Lock Recovery + +**Symptom:** First syscfg operation after process crash returns error, then succeeds on retry. + +**Logs:** +``` +UTOPIA: syscfg: mutex EOWNERDEAD, recovering +``` + +**Root Cause:** Previous process died while holding mutex. Robust mutex protocol recovers automatically. + +**Debug Steps:** Usually self-healing. Monitor for: +1. Frequent EOWNERDEAD messages (indicates chronic crasher) +2. Identify crashing process via `/var/log/messages` coredump entries + +**Resolution:** Fix the crashing process. The lock recovery is automatic. + +--- + +## 3. Firewall Issues + +### 3.1 All Traffic Blocked After Firewall Restart + +**Symptom:** No internet access. LAN clients cannot reach WAN. Ping from router fails. + +**Logs:** +``` +/rdklogs/logs/FirewallDebug.txt: iptables-restore: line N failed +UTOPIA: firewall: iptables-restore returned non-zero +``` + +**Root Cause:** Syntax error in generated rule file. Typically due to empty/null WAN IP or invalid port range from misconfigured port forwarding. + +**Debug Steps:** +1. `cat /tmp/.ipt | head -50` — examine generated rules +2. `iptables-restore --test < /tmp/.ipt` — validate without applying +3. `sysevent get current_wan_ipaddr` — check if WAN IP is set +4. `iptables -L -n --line-numbers` — see current active rules +5. `grep -n "^-" /tmp/.ipt | grep "0.0.0.0"` — find rules with empty IPs + +**Resolution:** +```bash +# Emergency: allow all traffic +iptables -P INPUT ACCEPT +iptables -P FORWARD ACCEPT +iptables -P OUTPUT ACCEPT +iptables -F +# Then fix the bad syscfg entry and restart +sysevent set firewall-restart +``` + +### 3.2 Firewall Mutex Deadlock + +**Symptom:** Firewall restart hangs indefinitely. `firewall-status` stays at "starting." + +**Logs:** +``` +/rdklogs/logs/FirewallDebug.txt: acquiring mutex... +# No "mutex acquired" follow-up +``` + +**Root Cause:** Previous firewall process crashed while holding shared mutex. EOWNERDEAD not triggered (rare kernel bug). + +**Debug Steps:** +1. `ls -la /tmp/firewall_mutex` — check mutex file exists +2. `fuser /tmp/firewall_mutex` — check who holds it +3. `ps aux | grep firewall` — look for zombie firewall processes + +**Resolution:** +```bash +# Remove stale mutex, kill zombies +rm -f /tmp/firewall_mutex +killall -9 firewall +sysevent set firewall-restart +``` + +### 3.3 Port Forwarding Not Working + +**Symptom:** External clients cannot reach forwarded port. Service accessible from LAN. + +**Debug Steps:** +1. `iptables -t nat -L prerouting_fromwan -n` — check DNAT rules exist +2. `iptables -L FORWARD -n | grep ` — check forward rule +3. `syscfg get SinglePortForwardCount` — verify config present +4. `sysevent get current_wan_ipaddr` — WAN IP must be non-empty +5. `conntrack -L | grep ` — check for stale conntrack entries + +**Resolution:** +```bash +conntrack -F +sysevent set firewall-restart +``` + +--- + +## 4. DHCP Service Issues + +### 4.1 LAN Clients Not Getting IP Addresses + +**Symptom:** Devices connect to WiFi/Ethernet but get 169.254.x.x (link-local) address. + +**Logs:** +``` +/rdklogs/logs/Consolelog.txt.0: [service_dhcp] dhcp_server_start: dnsmasq failed to start +/var/log/messages: dnsmasq: failed to bind DHCP server socket: Address already in use +``` + +**Root Cause:** dnsmasq not running, port conflict, or wrong interface binding. + +**Debug Steps:** +1. `ps | grep dnsmasq` — check if running +2. `cat /var/run/dnsmasq.pid` — verify PID file +3. `netstat -ulnp | grep :67` — check port 67 binding +4. `cat /etc/dnsmasq.conf | grep interface` — verify correct bridge +5. `sysevent get dhcp_server-status` — check service state +6. `brctl show` — verify bridge exists and has members + +**Resolution:** +```bash +killall dnsmasq +sysevent set dhcp_server-restart +``` + +### 4.2 DHCP Server Stuck in "starting" State + +**Symptom:** `sysevent get dhcp_server-status` returns "starting" indefinitely. + +**Root Cause:** Previous start/stop operation interrupted (process killed during transition). + +**Debug Steps:** +1. `sysevent get dhcp_server-status` — confirm stuck state +2. `ps | grep service_dhcp` — check for running handler +3. `ps | grep dnsmasq` — check if dnsmasq actually running + +**Resolution:** +```bash +sysevent set dhcp_server-status stopped +sysevent set dhcp_server-restart +``` + +--- + +## 5. WAN Connectivity Issues + +### 5.1 WAN Interface Not Getting IP + +**Symptom:** No internet. `sysevent get current_wan_ipaddr` returns empty. + +**Logs:** +``` +/rdklogs/logs/Consolelog.txt.0: [service_wan] udhcpc start failed +/var/log/messages: udhcpc: sending discover... (repeated) +``` + +**Root Cause:** WAN physical link down, DHCP server unresponsive, or interface misconfigured. + +**Debug Steps:** +1. `cat /sys/class/net/erouter0/carrier` — check physical link (1=up) +2. `ifconfig erouter0` — verify interface is UP +3. `ps | grep udhcpc` — check DHCP client running +4. `cat /var/run/udhcpc.erouter0.pid` — verify PID file +5. `sysevent get wan-status` — check WAN state +6. `syscfg get wan_proto` — verify expected protocol + +**Resolution:** +```bash +# Restart WAN service +sysevent set wan-restart +# Or manually restart DHCP client +kill $(cat /var/run/udhcpc.erouter0.pid) +sysevent set wan-start +``` + +### 5.2 WAN IP Acquired But No Internet + +**Symptom:** WAN IP present but cannot reach external servers. Ping to 8.8.8.8 fails. + +**Debug Steps:** +1. `ip route show` — verify default route exists +2. `ip route get 8.8.8.8` — check routing decision +3. `iptables -L FORWARD -n -v` — check for blocked forwarding +4. `cat /etc/resolv.conf` — verify DNS configured +5. `sysevent get default_router` — check gateway set +6. `arping -I erouter0 ` — verify gateway reachable at L2 + +**Resolution:** +```bash +# Add default route manually +ip route add default via dev erouter0 +# Restart firewall (may be blocking) +sysevent set firewall-restart +``` + +--- + +## 6. IPv6 Issues + +### 6.1 No IPv6 Prefix Delegation + +**Symptom:** LAN clients get no IPv6 global address. Only link-local (fe80::). + +**Logs:** +``` +/rdklogs/logs/Consolelog.txt.0: [service_ipv6] no valid prefix from DHCPv6 client +``` + +**Debug Steps:** +1. `sysevent get tr_erouter0_dhcpv6_client_v6pref` — check prefix received +2. `ps | grep dibbler` — check DHCPv6 client running +3. `cat /tmp/.dibbler-info/client_received_options` — raw DHCPv6 data +4. `ip -6 addr show brlan0` — check LAN bridge IPv6 addresses +5. `ps | grep radvd` — check router advertisement daemon + +**Resolution:** +```bash +sysevent set service_ipv6-restart +# Or restart DHCPv6 client +killall dibbler-client +sysevent set wan-restart # Triggers DHCPv6 re-negotiation +``` + +--- + +## 7. Multi-Network / Bridge Issues + +### 7.1 Bridge Not Created + +**Symptom:** `brctl show` doesn't list expected bridge. Associated services fail. + +**Logs:** +``` +/rdklogs/logs/MnetDebug.txt: multinet_bridgeUpInst: failed to create bridge brlan1 +``` + +**Debug Steps:** +1. `brctl show` — list existing bridges +2. `sysevent get multinet_1-status` — check instance status +3. Check interface availability: `ip link show` +4. Verify PSM config: multinet instance definitions + +**Resolution:** +```bash +sysevent set multinet_1-up +# Or manually +brctl addbr brlan1 +ifconfig brlan1 up +``` + +--- + +## 8. Process Monitor (pmon) Issues + +### 8.1 Service Not Being Auto-Restarted + +**Symptom:** Process crashed but pmon doesn't restart it. + +**Debug Steps:** +1. Check pmon config file for the process entry +2. Verify PID file path matches pmon config +3. `cat ` — confirm PID file is stale +4. Check pmon cron job: `crontab -l | grep pmon` +5. Verify executable path in restart command exists + +**Resolution:** Add/fix entry in pmon configuration file: +``` + +``` + +--- + +## Diagnostic Command Reference + +| Command | Purpose | +|---|---| +| `sysevent get ` | Read runtime event/state value | +| `sysevent set ` | Set event (trigger handlers) | +| `syscfg get ` | Read persistent configuration | +| `syscfg set ; syscfg commit` | Write persistent config | +| `syscfg show` | Dump all configuration | +| `iptables -L -n -v` | List active firewall rules with counters | +| `ip route show` | Display routing table | +| `ip -6 route show` | Display IPv6 routing table | +| `brctl show` | List bridges and members | +| `cat /proc/net/nf_conntrack` | Active connection tracking entries | +| `ps | grep -E "syseventd|dnsmasq|udhcpc|dibbler|zebra"` | Check service processes | +| `cat /rdklogs/logs/Consolelog.txt.0` | Main console log | +| `cat /rdklogs/logs/FirewallDebug.txt` | Firewall-specific debug | +| `cat /rdklogs/logs/MnetDebug.txt` | MultiNet debug log | +| `cat /rdklogs/logs/sysevent_tracer.txt` | Event trace log | +| `ipcs -m` | Shared memory segments (syscfg) | diff --git a/docs/07-developer-guide.md b/docs/07-developer-guide.md new file mode 100755 index 00000000..4e27c8e8 --- /dev/null +++ b/docs/07-developer-guide.md @@ -0,0 +1,333 @@ +# Developer Guide + +## Build & Development Setup + +### Prerequisites +- GCC cross-compiler for target platform +- Autotools (autoconf >= 2.65, automake, libtool) +- Libraries: libsafec, libdbus-1, libnetfilter_queue, libcjson +- RDK-B SDK with HAL headers + +### Build Commands +```bash +# Generate configure script +./autogen.sh + +# Configure for target platform +./configure --with-ccsp-platform=bcm \ + --enable-dslite_feature_support \ + --enable-core_net_lib_feature_support \ + --host=arm-rdk-linux-gnueabi + +# Build +make -j$(nproc) + +# Install to staging +make install DESTDIR=/path/to/staging +``` + +### Key Build Flags + +| Flag | Effect | +|---|---| +| `--with-ccsp-platform=` | Platform selection (intel_usg, intel_puma7, bcm, pc) | +| `--enable-dslite_feature_support` | Include DS-Lite tunnel service | +| `--enable-core_net_lib_feature_support` | Include service_dhcp + DHCPv6 client | +| `--enable-extender` | Include device mode switching | +| `--enable-hotspot` | Include HotSpot captive portal support | +| `--enable-ddns_binary_client_support` | Include DDNS service | +| `--enable-unitTestDockerSupport` | Enable unit test build | + +## Code Organization + +``` +source/ +├── syscfg/ # Configuration database (library + CLI) +│ ├── lib/ # libsyscfg.so (shared memory hash table) +│ └── cmd/ # syscfg CLI tool +├── sysevent/ # Event bus system +│ ├── server/ # syseventd daemon (main, clients, triggers, data) +│ ├── lib/ # libsysevent.so (client API) +│ ├── control/ # sysevent CLI tool +│ ├── proxy/ # Event proxy for remote access +│ └── fork_helper/ # Child process executor +├── utapi/ # High-level configuration API +│ ├── lib/ # libutapi.so +│ └── cmd/ # utapi CLI tool +├── utctx/ # Transaction context manager +│ ├── lib/ # libutctx.so +│ └── bin/ # utctx utilities +├── firewall/ # Firewall rule generator (binary) +├── service_dhcp/ # DHCP server management +├── service_wan/ # WAN interface management +├── service_routed/ # Routing daemon management +├── service_ipv6/ # IPv6 service management +├── service_multinet/# Bridge/VLAN management +├── service_udhcpc/ # DHCP client callback handler +├── service_ddns/ # Dynamic DNS service +├── service_dslite/ # DS-Lite tunnel service +├── service_deviceMode/ # Router/Extender mode switching +├── trigger/ # NFQ port triggering daemon +├── pmon/ # Process monitor +├── newhost/ # New host detection +├── macclone/ # MAC address cloning +├── ulog/ # Unified logging library +├── pal/ # Platform abstraction layer +├── services/lib/ # Service registration framework (srvmgr) +├── util/ # Shared utilities +├── scripts/init/ # Boot scripts and service handlers +└── walled_garden/ # Guest/parental access aging scripts +``` + +## Coding Patterns + +### Adding a New Service + +1. Create directory: `source/service_myfeature/` +2. Implement main with standard pattern: + +```c +#include +#include +#include "util.h" + +struct serv_myfeature { + int sefd; // sysevent file descriptor + int setok; // sysevent token + // service-specific state +}; + +static int myfeature_start(struct serv_myfeature *sf) { + // Read config from syscfg + // Apply configuration + // Start external daemon if needed + sysevent_set(sf->sefd, sf->setok, "myfeature-status", "started", 0); + return 0; +} + +static int myfeature_stop(struct serv_myfeature *sf) { + sysevent_set(sf->sefd, sf->setok, "myfeature-status", "stopping", 0); + // Stop daemon, cleanup + sysevent_set(sf->sefd, sf->setok, "myfeature-status", "stopped", 0); + return 0; +} + +static struct cmd_op { + const char *name; + int (*handler)(struct serv_myfeature *); +} cmd_ops[] = { + {"start", myfeature_start}, + {"stop", myfeature_stop}, + {"restart", myfeature_restart}, +}; + +int main(int argc, char *argv[]) { + struct serv_myfeature sf; + + sf.sefd = sysevent_open("127.0.0.1", SE_SERVER_WELL_KNOWN_PORT, + SE_VERSION, "myfeature", &sf.setok); + if (sf.sefd < 0) return -1; + + // Dispatch command + for (int i = 0; i < ARRAY_SIZE(cmd_ops); i++) { + if (strcmp(argv[1], cmd_ops[i].name) == 0) { + cmd_ops[i].handler(&sf); + break; + } + } + + sysevent_close(sf.sefd, sf.setok); + return 0; +} +``` + +3. Create `Makefile.am` and add to `source/Makefile.am` SUBDIRS +4. Add `AC_CONFIG_FILES` entry in `configure.ac` +5. Register with sysevent (in service script or apply_system_defaults) + +### Sysevent Event Naming Convention + +``` +-start → Request service start +-stop → Request service stop +-restart → Request service restart +-status → Service state (stopped/starting/started/stopping/error) +-up → Resource became available +-down → Resource became unavailable +current_ → Runtime parameter value (e.g., current_wan_ipaddr) +``` + +### Syscfg Key Naming Convention + +``` +_ → e.g., dhcp_start, wan_proto +__ → e.g., StaticRoute_1, PortForward_3 +:: → Namespaced keys for isolation +``` + +## Logging + +### Log Levels and Files + +| Log Target | Path | Content | Verbosity Control | +|---|---|---|---| +| System syslog | `/var/log/messages` | All ulog output | syslog config | +| Console log | `/rdklogs/logs/Consolelog.txt.0` | Service operations | Always on | +| Firewall debug | `/rdklogs/logs/FirewallDebug.txt` | Rule generation detail | Compile flag | +| MultiNet debug | `/rdklogs/logs/MnetDebug.txt` | Bridge operations | Always on | +| Sysevent trace | `/rdklogs/logs/sysevent_tracer.txt` | Event flow tracing | Runtime flag | +| SelfHeal | `/rdklogs/logs/SelfHeal.txt.0` | Process recovery events | Always on | + +### Adding Logging to Your Code + +```c +#include + +// Standard logging (goes to syslog) +ulog(ULOG_SYSTEM, UL_INFO, "service started successfully"); +ulogf(ULOG_SYSTEM, UL_INFO, "configured %s with IP %s", ifname, ip); + +// Error logging +ulog_error(ULOG_SYSTEM, UL_MYSERVICE, "failed to open config file"); +ulog_errorf(ULOG_SYSTEM, UL_MYSERVICE, "syscfg_get failed for key: %s", key); + +// Debug (only when enabled) +ulog_debug(ULOG_SYSTEM, UL_MYSERVICE, "entering state: CONNECTED"); +``` + +### Telemetry Events + +```c +#include + +// Send telemetry marker +t2_event_d("SYS_SH_RDKB_FIREWALL_RESTART", 1); +t2_event_s("WAN_INFO_IPAddress", wan_ip); +``` + +## Debug Commands + +### Runtime Inspection + +```bash +# Sysevent state inspection +sysevent get wan-status # Check WAN state +sysevent get dhcp_server-status # Check DHCP state +sysevent get firewall-status # Check firewall state +sysevent get system-status # Check system state + +# Configuration inspection +syscfg get wan_proto # WAN protocol (dhcp/static) +syscfg get dhcp_start # DHCP pool start +syscfg get lan_ipaddr # LAN IP address +syscfg show | grep firewall # All firewall config + +# Service status +sysevent get multinet_1-status # Bridge instance status +sysevent get ipv4_4-status # IPv4 instance status +``` + +### Forcing Service Operations + +```bash +# Restart specific services +sysevent set dhcp_server-restart +sysevent set firewall-restart +sysevent set wan-restart +sysevent set service_ipv6-restart + +# Force interface reconfiguration +sysevent set multinet_1-up +sysevent set ipv4-up 4 + +# Reset stuck states +sysevent set dhcp_server-status stopped +sysevent set firewall-status stopped +``` + +### Network Diagnostics + +```bash +# Interface status +ip addr show # All interface IPs +ip link show # Interface states +brctl show # Bridge configuration +cat /sys/class/net/erouter0/carrier # Physical link + +# Routing +ip route show # IPv4 routes +ip -6 route show # IPv6 routes +ip rule show # Policy routing + +# Firewall +iptables -L -n -v # IPv4 rules with counters +ip6tables -L -n -v # IPv6 rules +iptables -t nat -L -n # NAT rules +cat /proc/net/nf_conntrack | wc -l # Connection count + +# DNS/DHCP +cat /etc/dnsmasq.conf # DHCP server config +cat /tmp/dnsmasq.leases # Active leases +cat /etc/resolv.conf # DNS configuration +``` + +## Validation Steps + +### After Modifying Syscfg Logic +1. `syscfg set test_key test_value && syscfg commit` +2. `syscfg get test_key` → should return "test_value" +3. Reboot → `syscfg get test_key` → still "test_value" +4. Check `/nvram/syscfg.db` contains the entry + +### After Modifying Sysevent Logic +1. Terminal 1: `sysevent async test_event` +2. Terminal 2: `sysevent set test_event hello` +3. Terminal 1 should receive notification +4. Verify trigger execution with tracer log + +### After Modifying a Service +1. `sysevent set -restart` +2. Check `sysevent get -status` transitions: stopped → starting → started +3. Verify functional behavior (e.g., DHCP lease acquisition) +4. Check no error logs in `/rdklogs/logs/Consolelog.txt.0` + +### After Modifying Firewall Rules +1. `sysevent set firewall-restart` +2. `sysevent get firewall-status` → "started" +3. `iptables -L -n | wc -l` → rule count reasonable +4. Test: ping external host, verify port forwarding +5. Check `/rdklogs/logs/FirewallDebug.txt` for errors + +## Unit Testing + +Unit tests are available under `source/test/` when built with `--enable-unitTestDockerSupport`: + +```bash +# Build with tests +./configure --enable-unitTestDockerSupport ... +make + +# Test directories +source/test/service_routed/ +source/test/service_ipv6/ +source/test/service_udhcpc/ +source/test/service_dhcp/ +source/test/service_wan/ +source/test/apply_system_defaults/ +``` + +Tests use mocked sysevent/syscfg to validate service logic without hardware. + +## Common Pitfalls + +1. **SIGCHLD handling**: Services invoked by syseventd inherit `SIG_IGN` for SIGCHLD. Reset to `SIG_DFL` before calling `system()` or forks will return -1. + +2. **Sysevent connection leaks**: Always close sysevent connection in error paths. Leaked FDs exhaust the client table. + +3. **Syscfg commit frequency**: Don't commit after every set. Batch changes, then commit once (reduces flash wear). + +4. **Event loops**: Don't trigger your own event recursively (e.g., firewall-restart handler must not trigger firewall-restart). + +5. **Shared memory alignment**: When modifying syscfg hash table structures, ensure all offsets are properly aligned for the target architecture. + +6. **Process mutex recovery**: If using shared-memory mutexes (like firewall), always handle EOWNERDEAD. diff --git a/docs/08-architecture-deep-dive.md b/docs/08-architecture-deep-dive.md new file mode 100755 index 00000000..e5956b43 --- /dev/null +++ b/docs/08-architecture-deep-dive.md @@ -0,0 +1,281 @@ +# Architecture Deep Dive — Core Subsystems + +## Syscfg Internal Architecture + +### Memory Layout + +``` +Shared Memory Segment (syscfg) +┌──────────────────────────────────────────────────┐ +│ Control Block (shm_cb) │ +│ ├── read_lock (pthread_mutex_t, ROBUST) │ +│ ├── write_lock (pthread_mutex_t, ROBUST) │ +│ ├── commit_lock (pthread_mutex_t, ROBUST) │ +│ ├── store_path[128] ("/nvram/syscfg.db") │ +│ ├── max_size │ +│ ├── used_size │ +│ └── entry_count │ +├──────────────────────────────────────────────────┤ +│ Hash Table Buckets [SYSCFG_HASH_TABLE_SZ] │ +│ bucket[0] → offset to first ht_entry │ +│ bucket[1] → offset to first ht_entry │ +│ ... │ +│ bucket[N] → 0 (empty) │ +├──────────────────────────────────────────────────┤ +│ ht_entry pool (variable size) │ +│ ┌─────────────────────────────────┐ │ +│ │ name_sz | value_sz | next_offset│ │ +│ │ name_bytes[name_sz] │ │ +│ │ value_bytes[value_sz] │ │ +│ └─────────────────────────────────┘ │ +│ ... more entries ... │ +└──────────────────────────────────────────────────┘ +``` + +### Read/Write Concurrency Model + +``` +Reader: + acquire(read_lock) ← blocks only if writer active + result = hash_lookup(key) + release(read_lock) + return result + +Writer: + acquire(write_lock) ← exclusive + hash_insert_or_update(key, value) + release(write_lock) + +Committer: + acquire(commit_lock) ← exclusive, independent of read/write + serialize_hash_table_to_file() + atomic_rename(tmp_file, store_path) + release(commit_lock) +``` + +### Corruption Detection & Recovery + +``` +_syscfg_find_corrupted_keys(): + For each entry in hash table: + Validate: name_sz > 0, value_sz >= 0 + Validate: name contains only printable chars + Validate: next_offset within bounds or 0 + Cross-reference with system_defaults keys (optional) + Report: corrupted entries found + +Recovery path: + If minor corruption (few bad entries): + → Remove corrupted entries, reconstruct linkage + If major corruption (control block damaged): + → Destroy shared memory + → Recreate from backup file + → If backup corrupt: factory reset from system_defaults +``` + +--- + +## Sysevent Internal Architecture + +### Thread Communication Design + +``` + ┌─────────────────────────────────────┐ + │ MAIN THREAD │ + │ select() on TCP_fd + UDS_fd │ + │ accept() → assign token → │ + │ write(main_pipe) → notify workers │ + └──────────────┬──────────────────────┘ + │ main_communication_pipe + ┌──────────────▼──────────────────────┐ + │ WORKER THREADS [0..N] │ + │ sem_wait(worker_sem) │ + │ read from main_pipe OR trigger_pipe │ + │ process SE_MSG_* │ + │ ├── SE_MSG_SET → DataMgr.set() │ + │ │ └── if changed: write trigger_pipe│ + │ ├── SE_MSG_GET → DataMgr.get() │ + │ ├── SE_MSG_CLOSE → ClientsMgr.remove()│ + │ └── SE_MSG_SET_OPTIONS → flags update │ + └──────────────┬──────────────────────┘ + │ trigger_communication_pipe + ┌──────────────▼──────────────────────┐ + │ TRIGGER PROCESSING │ + │ Read trigger_id from pipe │ + │ TriggerMgr.execute_trigger_actions() │ + │ ├── ACTION_TYPE_MESSAGE: │ + │ │ send notification to client FD │ + │ └── ACTION_TYPE_EXT_FUNCTION: │ + │ write to fork_helper_pipe │ + └──────────────┬──────────────────────┘ + │ fork_helper_pipe + ┌──────────────▼──────────────────────┐ + │ FORK HELPER PROCESS │ + │ (separate child process) │ + │ Read action from pipe │ + │ fork() → execve(handler_binary, args)│ + │ Write result to worker FIFO │ + │ (/tmp/syseventd_worker_N) │ + └──────────────────────────────────────┘ +``` + +### Message Protocol (SE_MSG) + +``` +Message Header: +┌────────────────────────────────────┐ +│ msg_type (uint32) │ SE_MSG_SET, SE_MSG_GET, etc. +│ msg_size (uint32) │ Total message size +│ token (uint32) │ Client authentication +│ async_id (uint32) │ For async operations +└────────────────────────────────────┘ + +Message Types: + SE_MSG_OPEN_CONNECTION → Client registration + SE_MSG_CLOSE_CONNECTION → Client deregistration + SE_MSG_SET → Set tuple value + SE_MSG_GET → Get tuple value + SE_MSG_SET_OPTIONS → Set tuple flags + SE_MSG_REMOVE_ASYNC → Remove async callback + SE_MSG_SEND_NOTIFICATION → Trigger notification to client + SE_MSG_RUN_EXTERNAL_EXECUTABLE → Execute via fork helper + SE_MSG_EXECUTE_SERIALLY → Serial execution group + SE_MSG_DIE → Shutdown signal to workers +``` + +### Tuple Flags + +``` +TUPLE_FLAG_EVENT (0x00000001) — Value is an event (triggers on any set) +TUPLE_FLAG_SERIAL (0x00000002) — Actions execute serially (ordered) +TUPLE_FLAG_NORMAL (0x00000000) — Default: parallel execution, trigger on change +``` + +--- + +## Firewall Rule Generation Engine + +### Pipeline Stages + +``` +Stage 1: Configuration Collection + ├── WAN config (interface, IP, protocol, MTU) + ├── LAN config (bridges, IPs, subnets) + ├── Feature flags (DMZ, port_forward_enabled, etc.) + ├── Rule sets (PortForward_1..N, PortTrigger_1..N) + ├── Security (firewall_level, block_ping, block_ident) + ├── QoS (policies, DSCP marks) + └── Parental (managed sites/services, time windows) + +Stage 2: Rule Composition + ├── For each table (raw, mangle, nat, filter): + │ ├── Write table header (*raw / *mangle / *nat / *filter) + │ ├── Declare chains (:CHAIN_NAME POLICY) + │ ├── Generate rules per function (20+ sub-generators) + │ └── Write COMMIT + │ + ├── Sub-generators (partial list): + │ ├── do_raw_table_general_rules() + │ ├── do_mangle_qos_marking() + │ ├── do_nat_port_forwarding() + │ ├── do_nat_dmz() + │ ├── do_nat_masquerade() + │ ├── do_filter_wan2self() + │ ├── do_filter_lan2wan() + │ ├── do_filter_wan2lan() + │ ├── do_filter_rate_limiting() + │ └── do_filter_logging() + │ + └── Dynamic rules (from sysevent pools): + ├── "NatFirewallRule" pool → NAT table rules + ├── "GeneralPurposeFirewallRule" pool → filter FORWARD rules + └── "v6GeneralPurposeFirewallRule" pool → ip6tables rules + +Stage 3: Atomic Application + ├── iptables-restore < /tmp/.ipt + ├── ip6tables-restore < /tmp/.ipt_v6 + └── conntrack -F (flush stale entries) +``` + +### Dynamic Rule Pool Mechanism + +Services (like trigger module) can inject firewall rules at runtime without regenerating the entire ruleset: + +``` +Injection: + sysevent_set_unique("NatFirewallRule", + "-A prerouting_fromwan -p tcp --dport 8080 -j DNAT --to 192.168.1.100:80") + → Returns handle_id for later removal + sysevent_set("firewall-restart") + +Removal: + sysevent_del_unique("NatFirewallRule", handle_id) + sysevent_set("firewall-restart") + +During regeneration: + firewall iterates all pool entries: + sysevent_get("NatFirewallRule") → gets count + for i in 1..count: + sysevent_get("NatFirewallRule_i") → gets rule string + write rule to /tmp/.ipt +``` + +--- + +## Service Lifecycle Protocol + +### Standard Service State Machine + +``` + ┌─────────┐ + ┌─────────│ stopped │◄──────────────┐ + │ └────┬────┘ │ + │ │ -start │ -stop + │ ▼ │ + │ ┌─────────┐ │ + │ │starting │ │ + │ └────┬────┘ │ + │ │ success │ + │ ▼ │ + │ ┌─────────┐ │ + │ ┌───►│ started │───────────────┤ + │ │ └────┬────┘ │ + │ │ │ -restart │ + │ │ ▼ │ + │ │ ┌──────────┐ │ + │ │ │restarting│ │ + │ │ └────┬─────┘ │ + │ │ │ │ + │ └─────────┘ │ + │ │ + │ failure at any point │ + │ ┌─────────┐ │ + └────────►│ error │──────────────┘ + └─────────┘ (manual restart required) +``` + +### Event Registration Convention + +``` +Service "myservice" registers these standardized events: + +1. myservice-start → handler invoked with "start" arg +2. myservice-stop → handler invoked with "stop" arg +3. myservice-restart → handler invoked with "restart" arg + +Status published to: "myservice-status" +Values: "stopped" | "starting" | "started" | "stopping" | "error" + +Other services depend on status: + sysevent_get("myservice-status") == "started" → safe to interact +``` + +### Anti-Patterns Prevented by Design + +| Anti-Pattern | Prevention Mechanism | +|---|---| +| Double-start | `wait_till_end_state()` checks transitional states | +| Stop during start | Status check before action | +| Recursive restart | Handler checks current state, no-ops if already restarting | +| Orphaned processes | PID file tracking + kill on stop | +| Race with config | Transactional commit before event fire (UTCTX) |