DeepSeed — Real-Time Training Dashboard & DeepSpeed Notebooks

A real-time D3.js training dashboard that visualizes GPU metrics, loss curves, gradient flow, and more — with live streaming from Google Colab or Kaggle notebooks via cloudflared tunnels.

Live Dashboard →

Features

12 interactive D3.js charts: Loss + Accuracy, GPU Utilization, Memory, Throughput, Gradient Norms, LR Schedule, Step Time Histogram, Gradient Flow Heatmap, ZeRO Memory Breakdown, System Health, and more
Live streaming from Colab/Kaggle: Paste a tunnel URL and watch training in real-time
Model-agnostic: Works with any model — titles and charts adapt dynamically
Gated Self-Attention: Novel architecture modification with per-layer gate visualization
Real GPU metrics: pynvml for utilization/memory, CUDA events for timing, per-layer profiling
Single-file dashboard: No build step, no dependencies beyond D3.js CDN
DeepSpeed ZeRO Stage 2: FP16, CPU optimizer offload, gradient accumulation

Quick Start

View the Dashboard (Static Mode)

Open the GitHub Pages dashboard — it auto-loads real BERT-Large training data (460 steps, 11 evaluations, trained on IMDB with Gated Attention on a T4 GPU).

You can also drag & drop any training_metrics.json onto the dashboard to visualize a different run.

Run Training in Google Colab (Live Mode)

BERT-Large with Gated Attention (recommended):

GPT-2 Fine-Tuning (beginner-friendly, self-contained server + tunnel):

How Live Streaming Works

Colab/Kaggle (GPU)                    Your Browser
┌──────────────────┐                  ┌──────────────────────┐
│  Training Loop   │   HTTP POST      │  D3.js Dashboard     │
│  + Monitor class ├─────────────────►│  (GitHub Pages or    │
│  + pynvml GPU    │   /api/push      │   local server.py)   │
│  + CUDA timing   │                  │                      │
└──────────────────┘                  └──────────────────────┘
        │                                       ▲
        │  cloudflared tunnel                   │
        └── https://xxx.trycloudflare.com ──────┘

Option A — Self-contained (GPT-2 notebook):

Open the GPT-2 Colab notebook above
Run All — it installs deps, starts a dashboard server + cloudflared tunnel inside Colab
Click the tunnel URL printed in the output — that's your live dashboard
Training starts automatically and charts update in real-time

Option B — Remote dashboard (BERT notebook):

On your local machine: python3 server.py --tunnel
Copy the tunnel URL printed in terminal
Open the BERT Colab notebook and paste the URL into DASHBOARD_URL
Run All — metrics stream to your local dashboard

Project Structure

deepseed/
├── index.html                    # Single-file D3.js dashboard (~3300 lines)
├── training_metrics.json         # Real BERT-Large training data (460 steps)
├── server.py                     # Python HTTP server with SSE + CORS + tunnel support
├── deepseed_monitor.py           # Python client library for pushing metrics (stdlib only)
├── deepspeed_bert_colab.ipynb    # BERT-Large + Gated Attention notebook
├── deepspeed_gpt2_colab.ipynb    # GPT-2 fine-tuning notebook (self-contained)
├── run_charts.py                 # Static chart generation (matplotlib)
├── orchestrator/                 # K8s control plane (job store, Kaggle controller)
├── k8s/                          # Kustomize manifests for K8s deployment
├── k8s_deploy.py                 # K8s deployer CLI
├── jobs/                         # Job runner implementations
├── charts/                       # Pre-rendered chart assets
└── multigpu_lora/                # Multi-GPU LoRA fine-tuning experiments

Dashboard Charts

Chart	What it shows
Training & Validation Loss	Loss curve with EMA overlay (α=0.1) + validation dots
Validation Accuracy & F1	Eval metrics over training steps
GPU Utilization	Real `pynvml` GPU utilization percentage
GPU Memory Usage	VRAM consumption over time
Training Throughput	Samples/second with moving average
LR Schedule	Learning rate warmup + linear decay
Gradient Norms	Gradient magnitude tracking
Step Time Breakdown	Forward / Backward / Optimizer / Communication stacked bars
Gradient Flow Heatmap	Per-layer forward timing heatmap
ZeRO Memory	Parameter / Gradient / Optimizer / Activation memory breakdown
Step Time Histogram	Distribution of step durations
System Health	Live GPU/memory stats (live mode only)

Gated Self-Attention

The BERT notebook includes a novel Gated Self-Attention mechanism:

g = sigmoid(W_g * attention_output)          # learnable gate per position
output = g * attention_output + (1-g) * x    # blend attend vs. skip

Adds only 24,576 parameters to BERT-Large (0.007% overhead)
Early layers learn to partially skip attention, late layers attend fully
Improves convergence and gradient flow
Gate evolution is tracked and visualized in the dashboard

Training Data

The included training_metrics.json contains real training data from fine-tuning BERT-Large (335M params) with Gated Attention on IMDB sentiment classification:

GPU: NVIDIA Tesla T4 (15.8 GB)
Optimizer: DeepSpeed ZeRO Stage 2 + FP16 + CPU offload
Dataset: IMDB (25K train / 25K test)
Results: 460 logged steps, 11 evaluations, ~93% validation accuracy
Metrics: Real GPU utilization, memory, per-layer timing via CUDA events

Local Development

# Start the dashboard server with cloudflared tunnel
python3 server.py --tunnel

# Or just serve locally
python3 server.py
# Open http://localhost:8080

Tech Stack

Dashboard: Vanilla HTML + CSS + D3.js v7 (single file, no build step)
Training: PyTorch + DeepSpeed + Hugging Face Transformers
Metrics: pynvml (GPU), CUDA events (timing), custom Monitor class
Tunnel: cloudflared (Cloudflare Tunnel) for remote access
Orchestration: Kubernetes + custom control plane for Kaggle job management

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
charts		charts
jobs		jobs
k8s		k8s
multigpu_lora		multigpu_lora
orchestrator		orchestrator
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.orchestrator		Dockerfile.orchestrator
README.md		README.md
bert_distillation_modern.ipynb		bert_distillation_modern.ipynb
bert_distillation_notebook.ipynb		bert_distillation_notebook.ipynb
build_wheels_dataset.py		build_wheels_dataset.py
dashboard_live.png		dashboard_live.png
dataset-metadata.json		dataset-metadata.json
deepseed_monitor.py		deepseed_monitor.py
deepspeed_bert_colab.ipynb		deepspeed_bert_colab.ipynb
deepspeed_gpt2_colab.ipynb		deepspeed_gpt2_colab.ipynb
docker-compose.yml		docker-compose.yml
gated_arch_memory.png		gated_arch_memory.png
gated_attention.png		gated_attention.png
gated_attention_analysis.png		gated_attention_analysis.png
gated_dashboard.png		gated_dashboard.png
heatmap.png		heatmap.png
index.html		index.html
k8s_deploy.py		k8s_deploy.py
run_charts.py		run_charts.py
run_jobs.py		run_jobs.py
server.py		server.py
setup_kaggle.sh		setup_kaggle.sh
train_bert_distill.py		train_bert_distill.py
train_bert_distill_v2.py		train_bert_distill_v2.py
training_metrics.json		training_metrics.json
zero_memory.png		zero_memory.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeed — Real-Time Training Dashboard & DeepSpeed Notebooks

Features

Quick Start

View the Dashboard (Static Mode)

Run Training in Google Colab (Live Mode)

How Live Streaming Works

Project Structure

Dashboard Charts

Gated Self-Attention

Training Data

Local Development

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepSeed — Real-Time Training Dashboard & DeepSpeed Notebooks

Features

Quick Start

View the Dashboard (Static Mode)

Run Training in Google Colab (Live Mode)

How Live Streaming Works

Project Structure

Dashboard Charts

Gated Self-Attention

Training Data

Local Development

Tech Stack

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages