diff --git a/docs/index.yml b/docs/index.yml index 730f9e7782..11e8312488 100644 --- a/docs/index.yml +++ b/docs/index.yml @@ -110,8 +110,18 @@ navigation: - section: Operations contents: - - page: NVLink Partitioning - path: manuals/nvlink_partitioning.md + - section: Network Isolation + contents: + - page: Overview + path: manuals/network_isolation.md + - page: Ethernet Isolation + path: manuals/networking/ethernet_isolation.md + - page: Network Security Groups + path: manuals/networking/network_security_groups.md + - page: InfiniBand Isolation + path: manuals/networking/infiniband_isolation.md + - page: NVLink Partitioning + path: manuals/nvlink_partitioning.md - page: Release Instance API Enhancements path: manuals/breakfix_integration.md - page: IP Resource Pools diff --git a/docs/manuals/network_isolation.md b/docs/manuals/network_isolation.md new file mode 100644 index 0000000000..3ff5f2b773 --- /dev/null +++ b/docs/manuals/network_isolation.md @@ -0,0 +1,100 @@ +# Network Isolation + +NICo enforces tenant network isolation across three independent fabrics. Each +fabric uses a different mechanism, is configured through a different operator +API, and is verified separately. This page summarises the model so an operator +can choose the right guide; it is not a replacement for the per-fabric +configuration guides linked below. + +| Fabric | Operator-facing primitive | Isolation enforced by | +|---|---|---| +| Ethernet | VPC + VpcPrefix (+ optional Network Security Group) | DPU VRF per VPC (HBN / NVUE) over a pure type-5 EVPN overlay | +| InfiniBand | InfiniBand partition | UFM P_Key partition membership; `IbFabricMonitor` reconciler | +| NVLink | NVLink logical partition | NMX-M / NMX-C partition lifecycle; `NvlPartitionMonitor` reconciler | + +--- + +## Ethernet + +A tenant's instance reaches a VPC by drawing addresses from one of the +**VpcPrefixes** attached to that VPC. NICo carves a /31 link-net per +interface from the prefix — one address to the instance, one to the +DPU's SVI in the VPC's VRF. An instance may participate in several VPCs +at once by having interfaces drawing from prefixes in different VPCs. +On the DPU of the managed host backing the instance, each related VPC +materialises as a Linux VRF; every host interface drawing from a prefix +in that VPC lives in that VRF. + +VRFs are isolated by default. Cross-VPC reachability requires explicit VPC +peering or controlled route leaking via the VPC's routing profile. Layer 3 / 4 +filtering within or across VPCs is provided by Network Security Groups, +attached at VPC or instance scope. + +See [Ethernet Isolation](networking/ethernet_isolation.md) for the operator +configuration guide. + +--- + +## InfiniBand + +Each tenant InfiniBand partition maps to a UFM P_Key. Membership is enforced +by the subnet manager at the fabric level: hosts that are not full members of +a P_Key cannot exchange traffic with other members of that P_Key, regardless +of physical connectivity. NICo reconciles desired partition membership against +UFM via the `IbFabricMonitor` background task and surfaces the synchronisation +status to operators and to tenants. + +See [InfiniBand Isolation](networking/infiniband_isolation.md) for the operator +configuration guide. + +--- + +## NVLink + +NVLink logical partitions group GPUs across hosts into a single isolated +NVLink domain. NICo drives partition lifecycle against the NMX-M REST API and +the NMX-C gRPC API and reconciles desired partitions periodically. Each tenant +instance that requests NVLink connectivity is placed into the partition +corresponding to its allocation; a host whose GPUs are not in a partition +cannot reach any other host's GPUs over NVLink. + +See [NVLink Partitioning](nvlink_partitioning.md) for the operator +configuration guide. + +--- + +## Cross-cutting behaviour + +The following invariants apply to every fabric. + +- **Per-fabric synchronisation status.** Each instance's `InstanceStatus` + exposes a per-fabric `configs_synced` field that is `true` only when the + observed fabric state matches the desired configuration. The aggregate + `configs_synced` field is the logical AND of all per-fabric fields and gates + the instance's `Ready` state. +- **Provisioning blocks on isolation convergence.** During initial + provisioning, the instance state machine waits until every requested fabric + has applied the desired configuration before the instance is marked `Ready`. + Tenants observe this as the `Configuring` tenant state, and the machine + remains in `WaitingForNetworkConfig` until the DPU reports back. +- **Termination blocks on isolation convergence.** During termination, the + state machine waits until every fabric reports that the host has been + removed from all tenant partitions before the instance is reported as + deleted. This guarantees a terminated instance cannot continue to exchange + traffic on any fabric. +- **Force-delete still tears down fabric state.** Force-deleting a managed + host explicitly detaches it from every fabric through the same external + APIs the normal lifecycle uses, so external fabric managers do not retain + stale tenant references. +- **External fabric reachability is monitored.** Each external fabric service + (UFM, NMX-M, NMX-C) is monitored from NICo with request-success and latency + metrics so that fabric-side outages can be distinguished from NICo-side + configuration errors. + +For the architectural rationale and the patterns shared across all three +fabrics, see +[Networking Integrations](../architecture/networking_integrations.md). + +For the Day 0 IP, DHCP, DNS, and admin-network configuration that every +isolation guarantee on this page rests on, see +[Day 0 IP and Network Configuration](../getting-started/installation-options/day0-ip-network-config.md). diff --git a/docs/manuals/networking/ethernet_isolation.md b/docs/manuals/networking/ethernet_isolation.md new file mode 100644 index 0000000000..d1bcd5ef4b --- /dev/null +++ b/docs/manuals/networking/ethernet_isolation.md @@ -0,0 +1,397 @@ +# Ethernet Isolation + +This page explains how NICo provides Ethernet network isolation between +tenants and across VPCs, and how an operator configures and verifies it. It +is the Day-1 configuration guide; the architectural rationale lives in +[Networking Integrations](../../architecture/networking_integrations.md), and +the deep mechanics of VXLAN / EVPN, BGP route-targets, and routing profiles +live in [VPC Network Virtualization](../vpc/vpc_network_virtualization.md). + +**Related pages** + +- [Network Isolation Overview](../network_isolation.md) +- [Day 0 IP and Network Configuration](../../getting-started/installation-options/day0-ip-network-config.md) + — the operator-facing Day 0 reference for IP pools, admin / underlay + segments, DHCP, and DNS. The isolation guarantees on this page assume + the Day 0 configuration is already in place. +- [VPC Network Virtualization](../vpc/vpc_network_virtualization.md) — the + full VXLAN / EVPN, VRF, BGP, and routing-profile reference +- [VPC Routing Profiles](../vpc/vpc_routing_profiles.md) +- [VPC Peering](../vpc/vpc_peering_management.md) +- [Network Security Groups](network_security_groups.md) +- [IP Resource Pools](ip_resource_pools.md) +- [VNI Resource Pools](../vpc/vni_resource_pools.md) + +--- + +## The Isolation Model + +NICo's Ethernet isolation is built on three objects that compose into a +single chain. This page describes the **Native Networking (FNN)** model, +which is the official NICo network virtualization model. + +``` +Instance ──► Network Interface ──► VpcPrefix ──► VPC ──► VRF on DPU + (/31 link-net is vended per interface) +``` + +Read this chain top to bottom: + +1. A **tenant instance** owns one or more **network interfaces** on the host + it is allocated to. +2. Each interface is allocated an IP address from one of the **VpcPrefixes** + attached to a VPC. NICo carves a `/31` link-net from the VpcPrefix per + interface — one address goes to the instance, the other goes to the + DPU's SVI. The /31 is the operator-visible unit of consumption for the + prefix; `VpcPrefixStatus` reports `total_31_segments` and + `available_31_segments` so operators can size prefixes against expected + instance counts. +3. Every VpcPrefix belongs to exactly one **VPC**. Attaching an interface + to a VpcPrefix is what places the interface in the parent VPC. A VPC + may have several VpcPrefixes; an instance may have interfaces drawing + from prefixes in different VPCs. +4. On the DPU of the managed host backing the instance, each VPC the + instance touches materialises as a **Linux VRF**. The DPU's SVI for + each vended /31 lives in that VRF; everything beyond the SVI is + routed. + +> **Implementation detail.** Internally, NICo records each vended /31 as a +> `NetworkSegment` row attached to the VpcPrefix. Operators do not +> normally manipulate these directly — they are visible through the same +> RPCs but are tenant-managed only as a byproduct of instance +> configuration. + +An operator configures three things independently to control what reaches +what: + +| Concern | What it determines | Operator primitive | +|---|---|---| +| Subnet attachment | Which VPC's VRF an interface joins, and which CIDR pool its IP comes from | **VpcPrefix** attached to a VPC | +| Routing (L3) | Reachability between subnets and between VPCs across the fabric | VPC and its routing profile (type-5 EVPN imports / exports, route leaking) | +| Filtering (L3 / L4) | Which permitted flows reach an instance's prefixes and ports | Network Security Group | + +An instance may have interfaces drawing from VpcPrefixes in **several VPCs +at once** (for example, a tenant-data VPC and a separate storage VPC). Each +VPC that the instance touches becomes its own VRF on the DPU; nothing +forwards between VRFs by default. + +--- + +## VXLAN / EVPN Underlay (in Brief) + +NICo carries tenant traffic over a VXLAN / EVPN overlay. Each DPU is a VTEP +that peers with the site fabric (route servers or top-of-rack switches) using +BGP EVPN. Each VPC is identified on the overlay by a VNI; the per-VPC VRF on +the DPU imports and exports BGP routes tagged with route-targets derived from +the VPC's VNI and the site's `datacenter_asn`. Isolation between VPCs is a +direct consequence: a VRF imports only the route-targets its routing profile +declares, so a route advertised in one VPC does not appear in another VPC's +forwarding table. + +The tenant overlay is a **pure type-5 EVPN (IP-prefix) overlay**. NICo does +not stretch any tenant L2 segment across the fabric. The host-to-DPU link is +layer-2 (the segment's VLAN ID is the multiplexer), the DPU acts as the L3 +gateway via the segment's SVI, and the DPU re-advertises the host's +instance route into the fabric as a type-5 EVPN prefix tagged with the VPC's +route-target. Cross-host reachability inside a tenant VPC is therefore an +L3 routing decision, never an L2 bridging decision; the segment's VNI is an +L3VNI identifying the parent VPC's VRF, not an L2VNI extending a broadcast +domain. + +The admin overlay (see [Default Isolation](#default-isolation-the-admin-overlay)) +is the one exception: admin segments carry both an L2VNI and an L3VNI to +support admin-side workflows that occasionally require L2 reachability. +Tenant segments never do. + +For the full BGP / route-target / VTEP picture (loopback pools, per-DPU ASN, +internal vs. external VNI pools, default-route mechanisms, and the +configuration checklist for a new site), follow +[VPC Network Virtualization](../vpc/vpc_network_virtualization.md). This page +does not duplicate that material. + +--- + +## Routing Isolation: VPCs and VRFs + +A VPC is the unit of routing isolation: + +- Every VPC has its own VRF on every DPU that hosts an instance with an + interface in that VPC. The VRF holds only the routes the VPC's routing + profile permits. +- Routes inside one VPC's VRF are not visible to another VPC's VRF. This is + enforced by BGP route-target import / export, not by ACLs, and applies + identically to instances of the same tenant and instances of different + tenants. +- Reachability between two VPCs is opt-in. An operator can establish it + through: + - **VPC peering** — see + [VPC Peering](../vpc/vpc_peering_management.md). Peered VPCs install + each other's host routes via additional route-target imports. + - **A shared external route-target** declared in both VPCs' routing + profiles, when the network team operates a transit VRF. + - **Controlled route leaking** between a VPC VRF and the underlay default + VRF via the `leak_*` fields on the routing profile. This is intended for + internet access or for narrow injection of underlay prefixes, and is + described in detail under + [VPC Network Virtualization → Controlled Route Leaking](../vpc/vpc_network_virtualization.md#controlled-route-leaking). +- The set of prefixes a tenant must never reach (for example, the management + plane) is declared once site-wide in `deny_prefixes` in the API server + configuration. The DPU enforces this as an ACL on every tenant VRF. + +--- + +## Default Isolation: The Admin Overlay + +NICo guarantees that a managed host is never permitted to carry tenant +traffic unless an explicit tenant configuration places it into a VPC. This +guarantee is upheld by an **admin overlay** that is separate from every +tenant VPC. + +- During site initialisation NICo creates an admin VPC and a set of admin + network segments. These are not tenant-visible and exist only to give the + DPU somewhere safe to attach a host before, between, and after tenant + allocations. +- When the API server assembles the per-host network configuration for a + DPU, it sets `use_admin_network = true` in `ManagedHostNetworkConfig` + whenever the host has no instance allocated, the instance has no + interfaces configured for this DPU, or the host is in a transient + lifecycle state in which tenant traffic must not flow. The DPU agent + then places every host interface on the admin overlay instead of any + tenant VPC's VRF. +- A DPU that cannot retrieve its configuration at all — for example, + because the host is unknown to NICo and `GetManagedHostNetworkConfig` + returns `NOT_FOUND` — places itself into **isolated mode**: every host + interface is detached from any tenant overlay until NICo issues an + explicit configuration. This is the fail-closed default; nothing in the + data path will silently fall back to a tenant network. +- The same admin-overlay path is used to enforce isolation during + instance termination. The instance state machine blocks the + termination flow until the DPU confirms that all tenant interfaces have + been moved off tenant VPCs and onto the admin overlay (or that the + machine has been tagged with a health alert preventing reuse). This is + what guarantees that a tenant whose instance has been released cannot + remain on the wire as a "ghost instance". + +A "default VPC" in the cloud-provider sense — a system-created VPC that +every tenant inherits — does not exist in NICo. Tenants get the VPCs an +operator (or the tenant API) creates for them; absent any such VPC, an +instance has nowhere to send tenant traffic. + +For the deeper handler-level account, see +[VPC Network Virtualization → How a DPU Gets Its Configuration](../vpc/vpc_network_virtualization.md#how-a-dpu-gets-its-configuration). + +--- + +## Subnet Attachment: VPC Prefixes + +A **VpcPrefix** is the tenant-facing primitive for declaring an IPv4 or +IPv6 CIDR pool that a VPC may draw instance-interface addresses from. +Creating a VpcPrefix is how a tenant says "instances allocated into this +VPC should get IPs from this CIDR." + +A VpcPrefix has: + +- A **parent VPC**, set at creation time. The prefix cannot be moved + between VPCs. +- A **CIDR** (`config.prefix`), IPv4 or IPv6, that NICo carves /31 + link-nets out of when instance interfaces are allocated. +- A **status** field reporting `total_31_segments` and + `available_31_segments`. A prefix is exhausted when available reaches + zero; further interface allocations against this prefix fail until + either another VpcPrefix with capacity is attached to the same VPC, or + the exhausted prefix is replaced. +- A **metadata** block (name, labels) for operator and tenant + bookkeeping. + +A VPC may have any number of VpcPrefixes attached. When an instance +interface is created in a VPC, NICo selects one of that VPC's prefixes +with available capacity and vends the next free /31 from it: one address +to the instance, the other to the DPU's SVI in the VPC's VRF. + +The fabric carries the prefix as a type-5 EVPN route in the parent VPC's +VRF, tagged with the route-targets the VPC's routing profile declares. +There is no per-prefix VNI; the L3VNI is the VPC's VNI and applies +uniformly to every prefix attached to that VPC. + +### How a Tenant Creates a VpcPrefix + +In the FNN model, prefixes are created via the gRPC `CreateVpcPrefix` RPC +or its REST equivalent (`POST /v2/org/{org}/carbide/vpc-prefix`). +Operators can drive the same flow via `admin-cli`: + +``` +nico-admin-cli vpc-prefix create --vpc-id --prefix [--name