hpc-llm-bench

Reference workflow for end-to-end LLM fine-tuning and inference benchmarking on a multi-node Slurm + Pyxis GPU cluster. Built as a Nebius take-home exercise (AI/ML Specialist Customer Solution Architect role); designed to be a reusable starting point for similar PoCs.

Architecture

The diagram is interactive on GitHub: click any node to jump to its source or results.

flowchart LR
    Q[Qwen2.5-7B-Instruct]
    M[cais/mmlu auxiliary_train]
    NGC[NGC PyTorch container]

    P0[Phase 0: cluster validation]
    P1[Phase 1: SFT with FSDP, 2 nodes x 2 GPUs]
    C[merged checkpoint]
    P2A[Phase 2a: lm-eval-harness accuracy]
    P2B[Phase 2b: vLLM vs transformers throughput]

    R0[validation report]
    R1[accuracy delta]
    R2[throughput speedup]

    NGC -.-> P0
    NGC -.-> P1
    NGC -.-> P2A
    NGC -.-> P2B

    Q --> P1
    M --> P1
    P1 --> C
    C --> P2A
    C --> P2B
    Q --> P2A

    P0 --> R0
    P2A --> R1
    P2B --> R2

    click P0 "https://github.com/JayDS22/HPC-llm-bench/tree/main/validation" "validation suite source"
    click P1 "https://github.com/JayDS22/HPC-llm-bench/tree/main/training" "training pipeline source"
    click P2A "https://github.com/JayDS22/HPC-llm-bench/blob/main/evaluation/eval_mmlu.py" "accuracy eval source"
    click P2B "https://github.com/JayDS22/HPC-llm-bench/tree/main/evaluation/bench" "throughput bench source"
    click R0 "https://github.com/JayDS22/HPC-llm-bench/blob/main/results/validation/386/report.md" "validation report"
    click R1 "https://github.com/JayDS22/HPC-llm-bench/blob/main/results/eval/accuracy.md" "accuracy results"
    click R2 "https://github.com/JayDS22/HPC-llm-bench/blob/main/results/eval/throughput.md" "throughput results"

Dotted lines: every Slurm job runs inside the NGC PyTorch container via Pyxis (no image build step). Solid lines: data and artifact flow.

Assignment summary

A VC-funded startup wants to reserve 512 × H100s for 6 months on Nebius. Before committing they want a PoC proving performance/reliability. We get a Slurm cluster (2 nodes, 8 × H200 each = 16 H200s total; jobs use up to 4 GPUs) shared with other users.

Deliverables

Cluster validation — a lightweight, portable container that validates the cluster is ready for training/inference. Build code + results in this repo.
Phase 1 — Training — end-to-end multi-node fine-tune of an open-source LLM against at least one category from cais/mmlu. Justify model, framework, hyperparams, distributed-training technique.
Phase 2 — Evaluation — (a) show accuracy improvement of fine-tuned vs base, then (b) optimize either latency or throughput for inference and prove the result with experiments. Provide reproduction docs.

Note: the brief says "choose one of the following options to complete" but Phase 2 references the Phase 1 model — treating them as a sequence, not a choice. Will confirm with Marija if ambiguous.

Layout

validation/   # cluster-readiness container + checks (NCCL, GPU, fabric, storage I/O)
training/     # Phase 1 — fine-tuning code & configs
evaluation/   # Phase 2 — accuracy + latency/throughput experiments
sbatch/       # Slurm submission scripts (one per stage)
docs/         # runbook, design decisions, reproduction guide
results/      # logs, metrics, plots (gitignored where bulky)

Environment

Cluster: 2 nodes × 8 × H200, Slurm scheduler, shared with other users.
Job size: up to 4 GPUs per job — be considerate of memory/storage/CPU.
Access key: ~/.ssh/nebius_assignment (local).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hpc-llm-bench

Architecture

Assignment summary

Layout

Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
evaluation		evaluation
results		results
sbatch		sbatch
training		training
validation		validation
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

hpc-llm-bench

Architecture

Assignment summary

Layout

Environment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages