epic: Sprint 18 - AAMI Node Agent Implementation

## Overview

Lightweight Go-based agent that runs on monitored nodes, pulls scripts/policies from Config Server, executes them, and reports status back.

### Goals
- Support large-scale node management in air-gapped environments
- Overcome limitations of cron-based approach (error handling, state management, retry logic)
- Integrated script deployment mechanism with Config Server

### Non-Goals
- Push-based execution (SSH) - future phase
- Real-time WebSocket communication - future phase

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Config Server                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │
│  │ Targets  │  │ Scripts  │  │ Policies │  │ Agent API        │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │ HTTPS (poll)
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
   ┌─────────┐           ┌─────────┐           ┌─────────┐
   │  Agent  │           │  Agent  │           │  Agent  │
   │ Node-01 │           │ Node-02 │           │ Node-03 │
   └─────────┘           └─────────┘           └─────────┘
```

### Key Design Decisions
1. **Pull-based**: Firewall/NAT friendly, only outbound connections from nodes
2. **Go single binary**: No dependencies, easy deployment in air-gapped environments
3. **Hash-based change detection**: Prevent unnecessary re-execution
4. **Graceful degradation**: Maintain last state on server connection failure

## Directory Structure

```
services/aami-agent/
├── cmd/
│   └── agent/
│       └── main.go                 # Entry point
├── internal/
│   ├── config/
│   │   └── config.go               # Configuration management
│   ├── client/
│   │   └── api_client.go           # Config Server API client
│   ├── executor/
│   │   ├── executor.go             # Script execution engine
│   │   └── result.go               # Execution result types
│   ├── poller/
│   │   └── poller.go               # Polling loop
│   ├── state/
│   │   └── state.go                # Local state management (JSON file)
│   └── reporter/
│       └── reporter.go             # Status/result reporting
├── scripts/
│   └── install-agent.sh            # Installation script
├── Dockerfile
├── go.mod
├── go.sum
└── README.md
```

---

## Implementation Phases

### Phase 1: Core Agent (Priority: High)

Basic polling and script execution functionality.

**Components:**
- Config Module - YAML configuration loading
- API Client - Config Server API integration
- Executor - Script execution with timeout
- State Manager - Local state persistence (JSON)
- Poller - Main polling loop

**Tasks:**
- [ ] Create `services/aami-agent/` directory structure
- [ ] Implement Config module
- [ ] Implement API Client (using existing `/api/v1/checks/target/hostname/:hostname`)
- [ ] Implement Executor
- [ ] Implement State Manager
- [ ] Implement Poller
- [ ] Implement Main entry point
- [ ] Unit tests

---

### Phase 2: Server-side Agent API (Priority: High)

Add Agent-specific API endpoints to Config Server.

**New Endpoints:**
```
POST /api/v1/agent/heartbeat          # Heartbeat + node status
POST /api/v1/agent/executions         # Script execution result reporting
GET  /api/v1/agent/config             # Agent configuration (poll interval, etc.)
```

**Database Schema:**
```sql
-- Agent status tracking table
CREATE TABLE agent_status (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    target_id UUID NOT NULL REFERENCES targets(id),
    agent_version VARCHAR(50),
    last_heartbeat TIMESTAMPTZ,
    last_poll TIMESTAMPTZ,
    uptime_seconds BIGINT,
    scripts_total INT DEFAULT 0,
    scripts_success INT DEFAULT 0,
    scripts_failed INT DEFAULT 0,
    system_info JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Script execution history table
CREATE TABLE script_executions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    target_id UUID NOT NULL REFERENCES targets(id),
    script_policy_id UUID REFERENCES script_policies(id),
    script_name VARCHAR(255) NOT NULL,
    script_version VARCHAR(50),
    script_hash VARCHAR(64),
    started_at TIMESTAMPTZ NOT NULL,
    finished_at TIMESTAMPTZ,
    duration_ms INT,
    exit_code INT,
    success BOOLEAN,
    stdout TEXT,
    stderr TEXT,
    error_message TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
```

**Tasks:**
- [ ] Agent status domain model
- [ ] Script execution domain model
- [ ] Agent repository implementation
- [ ] Agent service implementation
- [ ] Agent handler implementation
- [ ] Register Agent API in router
- [ ] Database migration

---

### Phase 3: Installation & Deployment (Priority: Medium)

Installation scripts and systemd service configuration.

**Components:**
- `install-agent.sh` - Automated installation script
- systemd unit file for service management
- `--agent` option in `bootstrap.sh`
- Dockerfile for containerized deployment

**Tasks:**
- [ ] install-agent.sh script
- [ ] systemd unit file
- [ ] Add --agent option to bootstrap.sh
- [ ] Dockerfile
- [ ] Add agent example to docker-compose

---

### Phase 4: Web UI Integration (Priority: Medium)

Agent status and execution history UI.

**Features:**
- Agent status section on Target detail page
- Agent version, last heartbeat, uptime display
- Script execution success/failure counts
- Execution history table with filtering

**Tasks:**
- [ ] Agent status API client (Web UI)
- [ ] Agent status component
- [ ] Execution history table component
- [ ] Integration with Target detail page

---

### Phase 5: Advanced Features (Priority: Low)

**5.1 Force Execution**
- Trigger immediate script execution from Web UI
- Agent detects force flag on next poll

**5.2 Agent Groups**
- Group-based poll interval configuration
- Group-based script assignment

**5.3 Metrics Export**
- Agent's own Prometheus metrics endpoint
- `aami_agent_poll_duration_seconds`
- `aami_agent_scripts_executed_total`
- `aami_agent_scripts_failed_total`

---

## CLI Interface

```bash
# Installation
curl -fsSL https://config-server/install-agent.sh | bash -s -- \
  --server https://config-server:8080 \
  --token <bootstrap-token>

# Direct execution (debugging)
aami-agent run \
  --server https://config-server:8080 \
  --hostname $(hostname) \
  --poll-interval 30s \
  --verbose

# Status check
aami-agent status

# Force poll
aami-agent poll --now

# Version check
aami-agent version
```

---

## Migration from Cron

### Coexistence Period
1. Existing cron method and Agent can run simultaneously
2. Option to auto-disable cron job when installing Agent
3. Support gradual migration

### Migration Procedure
1. Deploy Agent binary
2. Install and verify Agent on test node
3. Stop cron job: `systemctl stop crond` or remove `/etc/cron.d/aami-*`
4. Enable Agent: `systemctl start aami-agent`
5. Establish monitoring and rollback plan

---

## Security Considerations

1. **TLS Communication**: HTTPS with Config Server
2. **Token Authentication**: Bootstrap token or Agent-specific token
3. **Script Verification**: Hash-based integrity check
4. **Least Privilege**: Agent requests only necessary permissions
5. **Log Security**: Sensitive information masking

---

## Success Metrics

| Metric | Target |
|--------|--------|
| Agent installation success rate | 99%+ |
| Poll success rate | 99.9%+ |
| Script execution success rate | Per policy |
| Heartbeat interval compliance | 95%+ (±5s) |
| Resource usage | CPU < 1%, Memory < 50MB |

---

## Dependencies

- #3 - Generic async Job Manager
- #4 - Apply Job Manager to long-running API endpoints
- Go 1.23+
- Config Server API (existing)
- PostgreSQL (existing)
- systemd (Linux nodes)

## Related Documents

- Sprint Planning: `.agent/planning/sprints/planned/sprint-18-node-agent.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Sprint 18 - AAMI Node Agent Implementation #11

Overview

Goals

Non-Goals

Architecture

Key Design Decisions

Directory Structure

Implementation Phases

Phase 1: Core Agent (Priority: High)

Phase 2: Server-side Agent API (Priority: High)

Phase 3: Installation & Deployment (Priority: Medium)

Phase 4: Web UI Integration (Priority: Medium)

Phase 5: Advanced Features (Priority: Low)

CLI Interface

Migration from Cron

Coexistence Period

Migration Procedure

Security Considerations

Success Metrics

Dependencies

Related Documents

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Target
Agent installation success rate	99%+
Poll success rate	99.9%+
Script execution success rate	Per policy
Heartbeat interval compliance	95%+ (±5s)
Resource usage	CPU < 1%, Memory < 50MB

epic: Sprint 18 - AAMI Node Agent Implementation #11

Description

Overview

Goals

Non-Goals

Architecture

Key Design Decisions

Directory Structure

Implementation Phases

Phase 1: Core Agent (Priority: High)

Phase 2: Server-side Agent API (Priority: High)

Phase 3: Installation & Deployment (Priority: Medium)

Phase 4: Web UI Integration (Priority: Medium)

Phase 5: Advanced Features (Priority: Low)

CLI Interface

Migration from Cron

Coexistence Period

Migration Procedure

Security Considerations

Success Metrics

Dependencies

Related Documents

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions