- Introduction
- Use Cases
- What is ByteNite?
- Project Structure
- CPU vs. GPU Versions
- Prerequisites
- Installing the ByteNite Developer CLI
- LLM Serving App Components
- Running an LLM Serving Job on ByteNite
- Troubleshooting & FAQ
- References
This repository provides a robust, scalable LLM serving pipeline designed to run on ByteNite's distributed, serverless container platform. It supports both CPU and GPU execution, enabling high-performance text generation at scale with minimal infrastructure configuration. The pipeline uses the Llama 4 Scout model (17B parameters, quantized) via llama-cpp-python for efficient inference on both CPU and GPU architectures.
The Llama 4 Scout model is an open-source large language model that excels at text generation, question answering, and complex reasoning tasks. Its open-source nature provides full transparency and allows for customization, while the 17B parameter architecture delivers strong performance with manageable resource requirements through quantization.
This LLM serving pipeline is ideal for:
- Content Generation: Marketing copy, documentation, creative writing
- Document Analysis: Summarization, question answering, information extraction
- Code Generation: Programming assistance and code completion
- Research Applications: Academic writing, data analysis, report generation
- Financial Analysis: News classification, sentiment analysis, report summarization
- Customer Support: Automated responses, FAQ generation, ticket classification
ByteNite is a serverless container platform for stateless, compute-intensive workloads. It abstracts away cloud infrastructure, letting you focus on your application logic. ByteNite provides:
- Near-instant startup times and flexible compute
- Distributed execution fabric (native fan-in/fan-out logic)
- Modular building blocks: Partitioners, Apps, Assemblers
- Simple job submission via CLI or API
llm-serving/
βββ llama4-app-cpu/ # CPU-optimized LLM app
β βββ manifest.json # App manifest (CPU resources)
β βββ Dockerfile # CPU container setup
β βββ app/main.py # Main LLAMA4 CPU app logic
βββ llama4-app-gpu/ # GPU-optimized LLM app
β βββ manifest.json # App manifest (GPU resources)
β βββ Dockerfile # GPU container setup (CUDA)
β βββ app/main.py # Main LLAMA4 GPU app logic
βββ templates/
β βββ llama4-app-cpu-template.json # Centralized CPU job template
β βββ llama4-app-gpu-template.json # Centralized GPU job template
βββ README.md
CPU Version (llama4-app-cpu):
- Runs on high-core-count CPUs (minimum: 30 cores, 60GB+ RAM)
- Suitable for cost-effective inference when GPU resources are unavailable
- Uses pure CPU inference with configurable thread count
- Container:
chandrabytenite/llama4-scout-cpu:v0.11
GPU Version (llama4-app-gpu):
- Runs on NVIDIA A100 40GB GPUs
- Optimized for CUDA 12.2 with GPU layer offloading
- Significantly faster inference for real-time workloads
- Container:
chandrabytenite/llama4-scout-gpu:v0.4
Both versions use the same Llama 4 Scout model and can be deployed interchangeably depending on your hardware requirements and performance needs.
If you've already created an account, set up payment, and installed the CLI for a previous app, you can skip this section and jump straight to Installing the ByteNite Developer CLI or the next relevant step.
- You will need to Request an Access Code and fill out the resulting form with your contact info.
- Once receiving your access code, you will be able to sign up on the computing platform.
- Once logged into the platform, go to the Billing Page (can also access by clicking into the Billing tab in the sidebar).
- Locate the Payment Info card and navigate to the Customer Portal. Add a payment method to your account through Stripe.
- Your payment info is used for manual and automatic top-ups. Ensure you have enough funds to avoid service interruptions.
- If you have a coupon code, redeem it on your billing page to add ByteChips (credits) to your balance.
- Go to the Account Balance card and click "Redeem". Enter your coupon code and complete the process. Refresh to confirm the balance.
- We'd love to get you started with free credits to test our platform, contact ByteNite support to request some.
Most users should use the CLI/SDK for the easiest experience:
- Download and install the ByteNite Developer CLI (see below for instructions by OS).
- Authenticate by running:
This will open a browser window for secure login.
bytenite auth
- Once authenticated, you can use all bytenite CLI commands to manage apps, engines, templates, and jobs.
If you plan to use the ByteNite API directly (e.g., with Postman or custom scripts), you'll need an API key and access token:
- Go to your ByteNite profile or click your profile avatar (top right).
- Click New API Key, configure its settings, and enter the confirmation code sent to your email.
- Copy your API key immediately and store it securely. You will not be able to view it again.
- If a key is no longer needed or is compromised, revoke it from your profile.
- An access token is required to authenticate all requests to the ByteNite API (including Postman).
- Request an access token from the Access Token endpoint using your API key. Tokens last 1 hour by default.
- See the API Reference for details and example requests.
- Python 3.8+ for local development or running scripts.
- Git (to clone this repository)
- (Optional) Docker if you plan to build custom containers.
-
Add the ByteNite repository:
echo "deb [trusted=yes] https://storage.googleapis.com/bytenite-prod-apt-repo/debs ./" | sudo tee /etc/apt/sources.list.d/bytenite.list
-
Update package lists:
sudo apt update
-
Install the ByteNite CLI:
sudo apt install bytenite
Troubleshooting:
- Update your system:
sudo apt update && sudo apt upgrade - Verify repository:
cat /etc/apt/sources.list.d/bytenite.list - Check package:
apt search bytenite
-
Add the ByteNite tap:
brew tap ByteNite2/bytenite-dev-cli https://github.com/ByteNite2/bytenite-dev-cli.git
-
Install the CLI:
brew install bytenite
-
Update Homebrew:
brew update
-
Upgrade ByteNite CLI:
brew upgrade bytenite
Download and run the latest Windows release from the ByteNite CLI GitHub page.
Check the CLI version:
bytenite versionAuthenticate with OAuth2:
bytenite authThis opens a browser for login. Credentials are stored at:
- Linux:
$HOME/.config/bytenite-cli/auth-prod.json - Mac:
/Users/[user]/Library/Application Support/bytenite-cli/auth-prod.json
Follow these steps to get up and running with your own ByteNite LLM serving pipeline:
-
Clone this repository to your own machine:
git clone <your-fork-or-this-repo-url> && cd llm-serving
(Optional but recommended) Fork this repo to your own GitHub account.
-
Install the ByteNite Developer CLI (see instructions above for your OS).
-
Authenticate with ByteNite:
bytenite auth
-
Push the apps to your ByteNite account:
bytenite app push ./llama4-app-cpu && \ bytenite app push ./llama4-app-gpu -
Activate the apps:
bytenite app activate llama4-app-cpu && \ bytenite app activate llama4-app-gpu -
Push the templates:
bytenite template push ./templates/llama4-app-cpu-template.json && \ bytenite template push ./templates/llama4-app-gpu-template.json -
Launch a job using the methods described below.
Run the help command to see all options:
bytenite --helpMost users only need these commands in order:
bytenite app push [app_folder]bytenite app activate [app_tag]bytenite app status [app_tag](to check status)
For more commands, run bytenite --help or see the ByteNite documentation.
- Implements text generation using Llama 4 Scout model via llama-cpp-python
- Accepts a text prompt and generates a response
- Uses passthrough partitioner/assembler (no data splitting required)
- See
app/main.pyfor implementation details
prompt: The input text prompt for the model to respond ton_threads: Number of CPU threads for parallel processing (CPU: recommended 59, GPU: recommended 23)gpu_layers: Number of model layers to offload to GPU (GPU only, recommended: 30)n_ctx: Maximum context size (input + output tokens combined, default: 2048)max_tokens: Maximum number of tokens to generate in response (default: 256)n_batch: Number of tokens processed in parallel during inference (larger = faster but more VRAM)
These parameters help optimize resource usage and performance for both CPU and GPU instances.
- Templates define job configuration and parameters
- CPU template: Uses
llama4-app-cpu-template - GPU template: Uses
llama4-app-gpu-template
To launch an llm serving job, simply create a new ByteNite job with the llama4-app-gpu-template or llama4-app-cpu-template as templatedId and provide your desired prompt and parameters. The model will process your input and generate a text response, with results saved to the output directory for easy access via the ByteNite UI or API. You can monitor job progress and view logs directly through the ByteNite platform.
Follow these steps to launch a job using the ByteNite GUI:
- Go to https://computing.bytenite.com and log in.
- Navigate to the Templates section in the sidebar.
- Select your desired template (e.g.,
llama4-app-gpu-templateorllama4-app-cpu-template) and click on it to create a new job. - In the job configuration form, fill in the required App parameters (see Configurable App Parameters). Make sure to add all necessary fields (e.g.,
prompt,n_threads,gpu_layers, etc.) under the App section. - For Data Source, select Bypass.
- For Data Destination, select Temporary Bucket.
- Review your configuration and click Start Job.
- Monitor job progress and logs from the job overview page. Once complete, download the results directly from the interface.
import requests
response = requests.post(
"https://api.bytenite.com/v1/auth/access_token",
json={"apiKey": "<YOUR_API_KEY>"}
)
token = response.json()["token"]{
"templateId": "llama4-app-gpu-template",
"description": "LLM text generation job",
"params": {
"partitioner": {},
"app": {
"prompt": "Make me laugh",
"gpu_layers": 30,
"n_threads": 23,
"n_ctx": 2048,
"max_tokens": 256
},
"assembler": {}
},
"dataSource": {
"dataSourceDescriptor": "bypass"
},
"dataDestination": {
"dataSourceDescriptor": "bucket"
},
"config": {
"isTestJob": true,
"jobTimeout": 3600,
"taskTimeout": 3600
}
}{
"templateId": "llama4-app-cpu-template",
"description": "LLM text generation job",
"params": {
"partitioner": {},
"app": {
"prompt": "Make me laugh",
"n_threads": 59,
"n_ctx": 2048,
"max_tokens": 256
},
"assembler": {}
},
"dataSource": {
"dataSourceDescriptor": "bypass"
},
"dataDestination": {
"dataSourceDescriptor": "bucket"
},
"config": {
"isTestJob": true,
"jobTimeout": 3600,
"taskTimeout": 3600
}
}- App fails to start: Check your container image and manifest.json for correct dependencies and entrypoint.
- No text output: Ensure the output path in main.py matches ByteNite's expected results directory.
- Resource errors: Increase min_cpu/min_memory or use the GPU version for heavy workloads.
- Authentication issues: Regenerate your API key and access token.
- Model loading errors: Verify the model path and ensure sufficient memory allocation.
See ByteNite Docs FAQ for more.
For questions or support, please open an issue or contact the ByteNite team via the official docs.