You work at a company that uses GPU VMs from three cloud providers:
- Crusoe Cloud — REST API, project-scoped, async operations, reservations, reboot/reset/restart
- Lambda Cloud — REST API, flat/simple, synchronous, reserved instances
- Nebius AI Cloud — gRPC interface, parent-scoped, async operations, reservation policies
Each provider has a different API design and protocol. Your job is to build a unified CLI tool that abstracts away these differences.
You may use any programming language, libraries, AI tools, and internet resources.
Build a CLI that provides a unified interface to manage VMs across all three providers.
# List all instances across all providers (or filter by provider)
vm list [--provider <name>]
# Create a new instance
vm create --provider <name> --gpu <type> --count <n> [--name <name>] [--region <region>]
# Get instance details
vm get <instance_id> --provider <name>
# Stop an instance
vm stop <instance_id> --provider <name>
# Start an instance
vm start <instance_id> --provider <name>
# Destroy/terminate an instance
vm destroy <instance_id> --provider <name>- Clean abstraction over three different API styles (two REST + one gRPC)
- Proper error handling (auth failures, not found, capacity errors)
- Readable output (table, JSON, or similar)
- Code that's easy to extend to new providers
Build on top of Layer 1 to support fleet operations — managing groups of VMs as a logical unit.
# Request N machines of a given GPU type, spread across providers
vm fleet create --gpu <type> --count <n> [--name <fleet_name>]
# List all fleets
vm fleet list
# Show fleet status (which VMs, which providers, states)
vm fleet status <fleet_name>
# Destroy an entire fleet
vm fleet destroy <fleet_name>- Cross-provider scheduling: Query all providers for capacity, then allocate across them to fulfill the request.
- Partial failure handling: If you need 8 VMs and provider A only has 3, get the rest from provider B/C.
- Rollback: If the total request can't be fulfilled, clean up any VMs that were already created.
- Fleet state tracking: Persist which VMs belong to which fleet (file, SQLite, etc.).
- Concurrent operations: Don't create VMs one at a time — parallelize across providers.
Three mock servers simulate the real cloud APIs. Start them with:
cd mock_servers
bash start_all.sh| Provider | Endpoint | API Key | Protocol |
|---|---|---|---|
| Crusoe | http://localhost:8001 |
crusoe-test-key-001 |
REST (project: proj-001) |
| Lambda | http://localhost:8002 |
lambda-test-key-001 |
REST |
| Nebius | grpc://localhost:50051 |
nebius-test-key-001 |
gRPC (parent: project-e1a2b3c4) |
# Lambda - list instances (REST)
curl -s http://localhost:8002/api/v1/instances \
-H "Authorization: Bearer lambda-test-key-001" | python3 -m json.tool
# Crusoe - list instances (REST)
curl -s http://localhost:8001/v1alpha5/projects/proj-001/compute/vms/instances \
-H "Authorization: Bearer crusoe-test-key-001" | python3 -m json.tool
# Nebius - list instances (gRPC — requires grpcurl or a gRPC client)
grpcurl -plaintext \
-H "authorization: Bearer nebius-test-key-001" \
-d '{"parent_id": "project-e1a2b3c4"}' \
localhost:50051 nebius.compute.v1.InstanceService/ListSee docs/ for detailed API documentation per provider.
| Crusoe | Lambda | Nebius | |
|---|---|---|---|
| Protocol | REST | REST | gRPC |
| Scoping | /projects/{pid}/... |
Flat /api/v1/... |
parent_id field |
| Create | POST → async operation | POST launch → sync IDs | RPC Create → async operation |
| Stop/Start | PATCH + action body | No stop/start | RPC Stop / Start |
| Delete | DELETE → async | POST terminate (batch) | RPC Delete → async |
| GPU naming | h100.8x |
gpu_8x_h100 |
gpu-h100-sxm + preset |
| Reservations | Auto-placement, explicit ID | is_reserved flag, reservation_id |
Policy: AUTO/FORBID/STRICT |
| Reboot/Reset | 3 distinct ops (see below) | Restart only | — |
| Reboot | Reset | Restart | |
|---|---|---|---|
| Method | Graceful (ACPI) | Hard (power cycle) | Full stop+start |
| State change | Stays RUNNING | Stays RUNNING | STOPPING→STOPPED→STARTING→RUNNING |
| Same host/IP | Yes | Yes | Maybe not |
| Use case | Routine restart | Unresponsive VM | Clean environment |
Each provider handles reservations differently:
- Crusoe: VMs auto-placed into cheapest matching reservation. Explicit
reservation_idoptional. Stop releases capacity, start reclaims it. - Lambda: Instances flagged
is_reserved: true. Reserved instances cannot be terminated via API. Launch withreservation_idto use reserved capacity. - Nebius:
ReservationPolicyin instance spec —AUTO(try reservation first),FORBID(always on-demand),STRICT(must use specific reservation, fail otherwise).
| Area | What we look at |
|---|---|
| Abstraction design | How cleanly do you unify 3 different APIs + protocols (REST + gRPC)? |
| Error handling | Capacity errors, partial failures, auth issues, reserved instance constraints. |
| Concurrency | Parallel VM creation in fleet mode. Not sequential one-by-one. |
| State management | Fleet tracking, reconciliation with real provider state. |
| Code quality | Readable, testable, well-structured. Not over-engineered. |
| Trade-off communication | Can you explain your design decisions? |
- Start with Layer 1. Get
listandcreateworking for one REST provider first, then add gRPC. - For Layer 2, think about what happens when things go wrong mid-flight.
- The mock servers have limited capacity — this is intentional.
- Crusoe and Nebius operations are async (2-3 second delay). Lambda is synchronous.
- Reserved instances on Lambda can't be terminated — your fleet manager needs to handle this.
- The Nebius proto files are in
mock_servers/proto/— you'll need to compile them for your language. - Don't over-engineer Layer 1 — you need time for Layer 2.