Skip to content

sugeerth/deepseed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSeed — Real-Time Training Dashboard & DeepSpeed Notebooks

A real-time D3.js training dashboard that visualizes GPU metrics, loss curves, gradient flow, and more — with live streaming from Google Colab or Kaggle notebooks via cloudflared tunnels.

Live Dashboard →

Dashboard Screenshot

Features

  • 12 interactive D3.js charts: Loss + Accuracy, GPU Utilization, Memory, Throughput, Gradient Norms, LR Schedule, Step Time Histogram, Gradient Flow Heatmap, ZeRO Memory Breakdown, System Health, and more
  • Live streaming from Colab/Kaggle: Paste a tunnel URL and watch training in real-time
  • Model-agnostic: Works with any model — titles and charts adapt dynamically
  • Gated Self-Attention: Novel architecture modification with per-layer gate visualization
  • Real GPU metrics: pynvml for utilization/memory, CUDA events for timing, per-layer profiling
  • Single-file dashboard: No build step, no dependencies beyond D3.js CDN
  • DeepSpeed ZeRO Stage 2: FP16, CPU optimizer offload, gradient accumulation

Quick Start

View the Dashboard (Static Mode)

Open the GitHub Pages dashboard — it auto-loads real BERT-Large training data (460 steps, 11 evaluations, trained on IMDB with Gated Attention on a T4 GPU).

You can also drag & drop any training_metrics.json onto the dashboard to visualize a different run.

Run Training in Google Colab (Live Mode)

BERT-Large with Gated Attention (recommended):

Open BERT in Colab

GPT-2 Fine-Tuning (beginner-friendly, self-contained server + tunnel):

Open GPT-2 in Colab

How Live Streaming Works

Colab/Kaggle (GPU)                    Your Browser
┌──────────────────┐                  ┌──────────────────────┐
│  Training Loop   │   HTTP POST      │  D3.js Dashboard     │
│  + Monitor class ├─────────────────►│  (GitHub Pages or    │
│  + pynvml GPU    │   /api/push      │   local server.py)   │
│  + CUDA timing   │                  │                      │
└──────────────────┘                  └──────────────────────┘
        │                                       ▲
        │  cloudflared tunnel                   │
        └── https://xxx.trycloudflare.com ──────┘

Option A — Self-contained (GPT-2 notebook):

  1. Open the GPT-2 Colab notebook above
  2. Run All — it installs deps, starts a dashboard server + cloudflared tunnel inside Colab
  3. Click the tunnel URL printed in the output — that's your live dashboard
  4. Training starts automatically and charts update in real-time

Option B — Remote dashboard (BERT notebook):

  1. On your local machine: python3 server.py --tunnel
  2. Copy the tunnel URL printed in terminal
  3. Open the BERT Colab notebook and paste the URL into DASHBOARD_URL
  4. Run All — metrics stream to your local dashboard

Project Structure

deepseed/
├── index.html                    # Single-file D3.js dashboard (~3300 lines)
├── training_metrics.json         # Real BERT-Large training data (460 steps)
├── server.py                     # Python HTTP server with SSE + CORS + tunnel support
├── deepseed_monitor.py           # Python client library for pushing metrics (stdlib only)
├── deepspeed_bert_colab.ipynb    # BERT-Large + Gated Attention notebook
├── deepspeed_gpt2_colab.ipynb    # GPT-2 fine-tuning notebook (self-contained)
├── run_charts.py                 # Static chart generation (matplotlib)
├── orchestrator/                 # K8s control plane (job store, Kaggle controller)
├── k8s/                          # Kustomize manifests for K8s deployment
├── k8s_deploy.py                 # K8s deployer CLI
├── jobs/                         # Job runner implementations
├── charts/                       # Pre-rendered chart assets
└── multigpu_lora/                # Multi-GPU LoRA fine-tuning experiments

Dashboard Charts

Chart What it shows
Training & Validation Loss Loss curve with EMA overlay (α=0.1) + validation dots
Validation Accuracy & F1 Eval metrics over training steps
GPU Utilization Real pynvml GPU utilization percentage
GPU Memory Usage VRAM consumption over time
Training Throughput Samples/second with moving average
LR Schedule Learning rate warmup + linear decay
Gradient Norms Gradient magnitude tracking
Step Time Breakdown Forward / Backward / Optimizer / Communication stacked bars
Gradient Flow Heatmap Per-layer forward timing heatmap
ZeRO Memory Parameter / Gradient / Optimizer / Activation memory breakdown
Step Time Histogram Distribution of step durations
System Health Live GPU/memory stats (live mode only)

Gated Self-Attention

The BERT notebook includes a novel Gated Self-Attention mechanism:

g = sigmoid(W_g * attention_output)          # learnable gate per position
output = g * attention_output + (1-g) * x    # blend attend vs. skip
  • Adds only 24,576 parameters to BERT-Large (0.007% overhead)
  • Early layers learn to partially skip attention, late layers attend fully
  • Improves convergence and gradient flow
  • Gate evolution is tracked and visualized in the dashboard

Gated Attention Analysis

Training Data

The included training_metrics.json contains real training data from fine-tuning BERT-Large (335M params) with Gated Attention on IMDB sentiment classification:

  • GPU: NVIDIA Tesla T4 (15.8 GB)
  • Optimizer: DeepSpeed ZeRO Stage 2 + FP16 + CPU offload
  • Dataset: IMDB (25K train / 25K test)
  • Results: 460 logged steps, 11 evaluations, ~93% validation accuracy
  • Metrics: Real GPU utilization, memory, per-layer timing via CUDA events

Local Development

# Start the dashboard server with cloudflared tunnel
python3 server.py --tunnel

# Or just serve locally
python3 server.py
# Open http://localhost:8080

Tech Stack

  • Dashboard: Vanilla HTML + CSS + D3.js v7 (single file, no build step)
  • Training: PyTorch + DeepSpeed + Hugging Face Transformers
  • Metrics: pynvml (GPU), CUDA events (timing), custom Monitor class
  • Tunnel: cloudflared (Cloudflare Tunnel) for remote access
  • Orchestration: Kubernetes + custom control plane for Kaggle job management

License

MIT

About

Real-time D3.js training dashboard for DeepSpeed — learn distributed training with interactive Colab & Kaggle notebooks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors