⚠️ WARNING: Experimental ProjectThis project is a proof of concept (PoC) and is not ready for production use. It was created to demonstrate the capabilities of the Scatter language in the context of distributed LLM inference.
Current limitations:
- Code not optimized for production
- API subject to change without notice
- Incomplete documentation
- Limited testing
This project primarily serves as a demonstration of Scatter's features for distributed machine learning.
Distributed LLM inference engine built on the Scatter runtime.
ScatterLM provides efficient, distributed large language model inference with:
- GGUF Model Support: Native loading and inference of GGUF quantized models
- Multi-node Distribution: Automatic model sharding across multiple nodes
- Metal GPU Acceleration: Optimized dequantization kernels for Apple Silicon
- Quantization Support: Q4_K, Q6_K, and other GGML quantization formats
- Streaming Inference: Token-by-token generation with async support
- Scatter language and runtime (see scatter)
- Nim >= 2.0
- macOS with Metal (for GPU acceleration) or Linux
import ml/gguf
import ml/inference
# Load a quantized model
let model = loadModel("models/mistral-7b-q4_k.gguf")
# Generate text
let response = model.generate("Hello, how are you?", maxTokens=100)
print(response)
scatterlm/
├── src/runtime/ # Nim runtime modules for LLM inference
│ ├── gguf.nim # GGUF file format parser
│ ├── tensor.nim # Tensor operations
│ ├── quants.nim # Quantization/dequantization
│ ├── metal_gpu.nim # Metal GPU kernels
│ └── ...
├── stdlib/ml/ # Scatter ML standard library
├── poc/ # Proof of concept - core Scatter implementation
├── examples/ # Example scripts demonstrating usage
├── tests/ # Test suite
└── docs/ # Documentation
See docs/ for detailed documentation.
MIT License