Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ yoq up --dev watch and hot-restart on changes
yoq up --server host:port deploy to a cluster
yoq down [-f manifest.toml] stop services from a manifest
yoq run-worker <name> run a one-shot worker
yoq run-worker --server host:port <name>
yoq init [-f path] scaffold a manifest
yoq validate [-f manifest.toml] [-q] validate a manifest
```
Expand Down Expand Up @@ -260,15 +261,17 @@ yoq gpu bench [--gpus N] GPU-to-GPU bandwidth benchmark
### training

```text
yoq train start <name> start a training job
yoq train status <name> show training job status
yoq train stop <name> stop a training job
yoq train pause <name> pause a training job
yoq train resume <name> resume a paused job
yoq train scale <name> scale training ranks
yoq train logs <name> [--rank N] show logs for a training rank
yoq train start [--server host:port] <name> start a training job
yoq train status [--server host:port] <name> show training job status
yoq train stop [--server host:port] <name> stop a training job
yoq train pause [--server host:port] <name> pause a training job
yoq train resume [--server host:port] <name> resume a paused job
yoq train scale [--server host:port] <name> --gpus <n> scale training ranks
yoq train logs [--server host:port] <name> [--rank N] show logs for a training rank
```

For clustered training logs, the control plane now proxies log reads to the agent that hosts the selected rank. If that agent is unreachable or does not expose the log endpoint, the API returns an explicit hosting-agent error instead of a misleading empty result.

### diagnostics

```text
Expand All @@ -290,7 +293,8 @@ Notes:
- `--json` is available on `ps`, `images`, `prune`, `version`, `gpu topo`, and `doctor`.
- crons defined in the manifest start automatically with `yoq up`.
- deployment, metrics, and certificate commands also support `--server host:port`.
- clustered manifest deploys now go through the app-first `/apps/apply` API. the older `/deploy` route remains as a compatibility shim for legacy callers.
- clustered manifest deploys now go through the app-first `/apps/apply` API and carry services, workers, crons, and training definitions in one app snapshot. the older `/deploy` route remains as a compatibility shim for legacy callers.
- remote app applies now register active cron schedules in cluster state, and `yoq apps` / `yoq status --app` include live training runtime summaries for the current app.

## current status

Expand Down
14 changes: 12 additions & 2 deletions docs/cluster-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ port = 3000
yoq up --server
```

the `--server` flag tells yoq to submit the manifest to the cluster API instead of running locally. under the hood the CLI now sends a canonical app snapshot to `POST /apps/apply`; the older `/deploy` route remains only for compatibility. the scheduler places containers on agents using bin-packing (scores by free CPU + memory). service discovery and load balancing work transparently across nodes via the WireGuard overlay and eBPF.
the `--server` flag tells yoq to submit the manifest to the cluster API instead of running locally. under the hood the CLI now sends a canonical app snapshot to `POST /apps/apply`; that snapshot carries services, workers, crons, and training jobs together. the older `/deploy` route remains only for compatibility. the scheduler places containers on agents using bin-packing (scores by free CPU + memory). service discovery and load balancing work transparently across nodes via the WireGuard overlay and eBPF.

after deploy, use the app-first day-2 commands:

Expand All @@ -147,7 +147,7 @@ yoq history --app [name] --server 10.0.0.1:7700
yoq rollback --app [name] --server 10.0.0.1:7700 --release <release-id>
```

`yoq apps` shows the latest release summary for every app, `status --app` shows the latest release metadata for one app, `history --app` lists prior releases, and remote `rollback --app ... --release` re-applies a stored app snapshot.
`yoq apps` shows the latest release summary for every app, `status --app` shows the latest release metadata for one app, `history --app` lists prior releases, and remote `rollback --app ... --release` re-applies a stored app snapshot. `yoq run-worker --server ...` and `yoq train ... --server ...` now resolve workers and training jobs from the current app release on the server. Clustered app applies also register cron schedules from the current app snapshot, and the app summary/status views include live training runtime counts for the app.

---

Expand Down Expand Up @@ -342,12 +342,22 @@ for app operations, the important write paths are:

- `POST /apps/apply`
- `POST /apps/<name>/rollback`
- `POST /apps/<app>/workers/<name>/run`
- `POST /apps/<app>/training/<name>/start`
- `POST /apps/<app>/training/<name>/stop`
- `POST /apps/<app>/training/<name>/pause`
- `POST /apps/<app>/training/<name>/resume`
- `POST /apps/<app>/training/<name>/scale`

the important read paths are:

- `GET /apps`
- `GET /apps/<name>/status`
- `GET /apps/<name>/history`
- `GET /apps/<app>/training/<name>/status`
- `GET /apps/<app>/training/<name>/logs`

`GET /apps/<app>/training/<name>/logs` now proxies the request to the agent that hosts the selected rank. If that agent is unreachable or does not expose the log endpoint, the route returns an explicit hosting-agent error.

### draining a node

Expand Down
14 changes: 13 additions & 1 deletion docs/users-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,12 @@ this gives the operator one app-first day-2 model:
- `yoq rollback --app [name]` — print the last successful local app snapshot
- `yoq rollback --app [name] --server host:port --release <id>` — re-apply a prior remote app release
- `yoq apps` — list app release summaries across all known apps
- `yoq run-worker [--server host:port] <name>` — run a worker from the current app release
- `yoq train start|status|stop|pause|resume|scale|logs [--server host:port] <name>` — manage training jobs from the current app release

`yoq apps` and `yoq status --app` now show both the desired workload mix from the latest app release and the current training runtime summary for that app. On clustered applies, cron definitions from the app snapshot are also registered in cluster state, so rollback restores the active cron schedule set along with the rest of the app snapshot.

Remote `yoq train logs --server ...` now proxies the request to the agent that hosts the selected rank. If that agent is unreachable or does not expose the log endpoint, the API returns an explicit hosting-agent error instead of a misleading empty or missing result.

### dev mode

Expand Down Expand Up @@ -217,14 +223,20 @@ if the leader changes, agents follow automatically — heartbeat responses inclu

### app-first control plane

cluster manifest deploys now use `POST /apps/apply` as the canonical write path. the older `POST /deploy` route is still accepted as a compatibility shim, but new CLI work targets the app-first route.
cluster manifest deploys now use `POST /apps/apply` as the canonical write path. the app snapshot includes services, workers, crons, and training jobs. the older `POST /deploy` route is still accepted as a compatibility shim, but new CLI work targets the app-first route.

the cluster API also exposes app-scoped day-2 reads and rollback:

- `GET /apps` — latest release summary per app
- `GET /apps/<name>/status` — latest app release metadata
- `GET /apps/<name>/history` — app release history
- `POST /apps/<name>/rollback` with `{"release_id":"..."}` — re-apply a stored app release snapshot
- `POST /apps/<app>/workers/<name>/run` — run a worker from the current app release
- `POST /apps/<app>/training/<name>/start|stop|pause|resume|scale` — manage training jobs for the current app release
- `GET /apps/<app>/training/<name>/status|logs` — inspect training jobs for the current app release

The app status surfaces (`GET /apps`, `GET /apps/<name>/status`, `yoq apps`, and `yoq status --app`) also report live training runtime counts for the app: active, paused, and failed jobs.
For `GET /apps/<app>/training/<name>/logs`, the control plane now proxies the request to the hosting agent for the selected rank. If that agent is unreachable or does not expose the log endpoint, the route returns an explicit hosting-agent error.

### rolling upgrades

Expand Down
2 changes: 2 additions & 0 deletions src/api/routes/cluster_agents.zig
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ const testing = std.testing;
const cluster_routes = @import("cluster_agents/cluster_routes.zig");
const agent_routes = @import("cluster_agents/agent_routes.zig");
const app_routes = @import("cluster_agents/app_routes.zig");
const workload_routes = @import("cluster_agents/workload_routes.zig");
const deploy_routes = @import("cluster_agents/deploy_routes.zig");
const writers = @import("cluster_agents/writers.zig");

Expand Down Expand Up @@ -37,6 +38,7 @@ pub fn route(request: http.Request, alloc: std.mem.Allocator, ctx: RouteContext)
}

if (app_routes.route(request, alloc, ctx)) |resp| return resp;
if (workload_routes.route(request, alloc, ctx)) |resp| return resp;

if (path.len > "/agents/".len and std.mem.startsWith(u8, path, "/agents/")) {
const rest = path["/agents/".len..];
Expand Down
16 changes: 12 additions & 4 deletions src/api/routes/cluster_agents/agent_routes.zig
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ const std = @import("std");
const http = @import("../../http.zig");
const agent_registry = @import("../../../cluster/registry.zig");
const cluster_config = @import("../../../cluster/config.zig");
const request_support = @import("../../../cluster/agent/request_support.zig");
const json_helpers = @import("../../../lib/json_helpers.zig");
const common = @import("../common.zig");
const writers = @import("writers.zig");
Expand All @@ -18,10 +19,14 @@ pub fn handleAgentRegister(alloc: std.mem.Allocator, request: http.Request, ctx:

const token = extractJsonString(request.body, "token") orelse return common.badRequest("missing token field");
const address = extractJsonString(request.body, "address") orelse return common.badRequest("missing address field");
const agent_api_port = extractJsonInt(request.body, "agent_api_port");
const cpu_cores = extractJsonInt(request.body, "cpu_cores") orelse return common.badRequest("missing cpu_cores field");
const memory_mb = extractJsonInt(request.body, "memory_mb") orelse return common.badRequest("missing memory_mb field");
if (cpu_cores <= 0 or cpu_cores > 10000) return common.badRequest("invalid cpu_cores");
if (memory_mb <= 0 or memory_mb > 10_000_000) return common.badRequest("invalid memory_mb");
if (agent_api_port) |port| {
if (port <= 0 or port > 65535) return common.badRequest("invalid agent_api_port");
}
if (cpu_cores > std.math.maxInt(u32)) return common.badRequest("cpu_cores too large");
if (memory_mb > std.math.maxInt(u64)) return common.badRequest("memory_mb too large");

Expand All @@ -42,7 +47,6 @@ pub fn handleAgentRegister(alloc: std.mem.Allocator, request: http.Request, ctx:
var container_subnet_buf: [20]u8 = undefined;
var container_subnet: ?[]const u8 = null;
var endpoint_buf: [64]u8 = undefined;
var endpoint: ?[]const u8 = null;
var peer_sql: ?[]const u8 = null;
var peer_sql_buf: [1024]u8 = undefined;

Expand Down Expand Up @@ -76,15 +80,18 @@ pub fn handleAgentRegister(alloc: std.mem.Allocator, request: http.Request, ctx:
if (p <= 0 or p > 65535) return common.badRequest("invalid wg_listen_port");
break :blk @intCast(p);
} else 51820;
endpoint = std.fmt.bufPrint(&endpoint_buf, "{s}:{d}", .{ address, port }) catch null;
const endpoint_host = if (request_support.parseHostPort(address)) |hp|
std.fmt.bufPrint(&endpoint_buf, "{d}.{d}.{d}.{d}:{d}", .{ hp.addr[0], hp.addr[1], hp.addr[2], hp.addr[3], port }) catch null
else
std.fmt.bufPrint(&endpoint_buf, "{s}:{d}", .{ address, port }) catch null;

if (endpoint != null and overlay_ip_str != null and container_subnet != null) {
if (endpoint_host != null and overlay_ip_str != null and container_subnet != null) {
peer_sql = agent_registry.wireguardPeerSql(
&peer_sql_buf,
nid,
&id_buf,
pub_key,
endpoint.?,
endpoint_host.?,
overlay_ip_str.?,
container_subnet.?,
) catch return common.internalError();
Expand Down Expand Up @@ -117,6 +124,7 @@ pub fn handleAgentRegister(alloc: std.mem.Allocator, request: http.Request, ctx:
std.time.timestamp(),
.{
.node_id = assigned_node_id,
.agent_api_port = if (agent_api_port) |port| @intCast(port) else null,
.wg_public_key = wg_public_key,
.overlay_ip = overlay_ip_str,
.role = role_str,
Expand Down
Loading
Loading