OpenTela (Aka: OpenFabric) is a distributed computing platform designed to orchestrate computing resources across a decentralized network. It leverages peer-to-peer networking, CRDT-based state management to create a resilient and scalable network of computing resources. It is used to power the serving system at SwissAI Initiative.
Tela is the latin word for "Fabric", which refers to the interconnected network of computing resources that OpenTela manages.
- [2026/02] 💡 How SwissAI Leverages OpenTela: We wrote a case study on how SwissAI uses OpenTela to orchestrate their distributed GPU nodes for scalable model serving. Read more.
-
Decentralized Orchestration: OpenTela eliminates the need for a central coordinator by using a gossip-based P2P network. It utilizes a Conflict-free Replicated Data Type (CRDT) registry to manage service discovery, health monitoring, and routing across distributed nodes. This architecture allows the system to remain operational and maintain a global view of resources even during network partitions.
-
Non-Invasive HPC Integration: Designed specifically for the constraints of supercomputing environments, the system operates entirely as a user-space overlay. It bridges the gap between batch schedulers (like Slurm) and interactive serving engines (like vLLM or SGLang) without requiring root privileges or kernel modifications. This allows researchers to spin up "cloud-like" serving clusters using standard permissions.
-
Robust Fault Tolerance and Elasticity: OpenTela is built for high-churn environments where resources are often volatile or preemptible (e.g., scavenger queues, preemptible cloud instances or slurm preemption). It utilizes peer-to-peer heartbeats to detect node failures within seconds, automatically marking failed nodes as "LEFT" and rerouting traffic to healthy replicas without service interruption.
- OpenTela is used to power SwissAI Serving. It acts as the decentralized orchestration layer, routing inference requests to distributed GPU nodes while managing state, metrics, and peer discovery to ensure resilient and scalable model serving.
- Installation — Download and install OpenTela
- Spin Up LLM Serving — Set up multi-LLM serving cluster
- Request Routing — Understand how requests are routed
- Wallet & Ownership — Manage Solana wallets and node identity
- Solana Settlement — Configure automated usage billing
- Docker Serving — Use Docker containers for LLM serving
- Glossary — Key terms and concepts
- CRDT Internals — How CRDT synchronization works
- CRDT Tombstones — Node departure handling
- Security Hardening — Build attestation, trust, and access control
- Performance Benchmark — Proxy latency measurements
- Large-Scale Simulation — Run 100+ node simulations
- Fleet Manager — Deploy to SLURM clusters with otela-fleet
Contributions are welcome! Please follow the code of conduct and submit pull requests for any enhancements or bug fixes.
This project is licensed under the Apache v2 License - see the LICENSE file for details.