fast-dit-serving

Overview

Our contributions are summarized as follows:

Attention caching: We propose an attention caching mechanism that computes attention for a single request per timestep while reusing cached attention maps for the remaining requests—reducing redundant computation and improving throughput.
Progressive batching: We introduce a progressive batching approach that incrementally grows the batch size across timesteps, controlled by a configurable caching interval.
Latent-aware denoising: To mitigate the quality degradation introduced by attention reuse, we develop a latent-aware denoising mechanism that dynamically modulates noise removal based on the current latent state.

Figure 1 illustrates the processing pipeline of our framework, detailing the flow from request prefill and partitioning through parallel attention computation, denoising, and decoding stages.

Figure 1. Processing pipeline: request prefill → partitioning → parallel attention computation → denoising → decoding.

Figure 2. Comparison of SD3 serving strategies. (a) demonstrates our attention caching with continuous batching approach. (b) illustrates attention caching without batching.

Performance

Latency, Throughput and Speedup

Figure 3. (a) Inference latency comparison across batch sizes for Diffusers SD3, Stability SD3, xDiT, and our system. (b) Speedup ratio of different systems compared to Diffusers SD3.

Figure 4. (a) End-to-end job timings for different request rates over 100 seconds. (b) Serving throughput (images/min) vs. total time and GPU time across request rates.

Image Quality

Qualitative Comparisons

Figure 5. Qualitative comparisons across systems for representative prompts (“A helicopter flies over Yosemite.”, “an old-fashioned windmill surrounded by flowers.”, “a peaceful lakeside landscape.”, “a squirrel driving a toy car.”).

PartiPrompts (P2) Benchmark

Figure 6. PartiPrompts (P2) benchmark with CLIP Score (↑), FID (↓), and SSIM (↑) across systems.

Prerequisites

Python: 3.8 – 3.11 (3.10 recommended)
NVIDIA GPU + CUDA: 11.8+ or 12.x (must match PyTorch version)
cuDNN: Compatible version with your CUDA installation

Getting Started

1. Installation

cd sd3_serve
pip install -e .

2. Download Models

cd scripts
./download_sd3_from_links.sh

3. Usage

Start the Server

cd sd3_serve
python server.py

Single and Batched Inference

cd scripts
# Single prompt
python run_simple_example.py "A beautiful landscape" --timesteps 50

# Batched prompts
python run_batched_example.py test_prompts.json

Benchmarking

Benchmarks were run on an NVIDIA A100 GPU.
For detailed usage instructions on running these benchmarks, see the Benchmarking Guide.

Results

Server Tested	Mean Time (s)	Median Time (s)
sd3_serve (Single)	1.23	1.10
sd3_serve (Batched)	0.85	0.80

Load Testing

Testing the system with different request rates and finding out metrics like end to end job latency and number of images generated per minute. For detailed usage instructions on running these tests, see the Testing Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.dvc		.dvc
assets		assets
benchmarking		benchmarking
experiments		experiments
metrics		metrics
outputs		outputs
partiprompts_generation		partiprompts_generation
profiling/profile_dit		profiling/profile_dit
results		results
scripts		scripts
sd3_serve		sd3_serve
src_infer		src_infer
stability_sd3_infer		stability_sd3_infer
system_experiments		system_experiments
torchserve_serving		torchserve_serving
vae_finetune		vae_finetune
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fast-dit-serving

Overview

Performance

Latency, Throughput and Speedup

Image Quality

Qualitative Comparisons

PartiPrompts (P2) Benchmark

Prerequisites

Getting Started

1. Installation

2. Download Models

3. Usage

Start the Server

Single and Batched Inference

Benchmarking

Results

Load Testing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fast-dit-serving

Overview

Performance

Latency, Throughput and Speedup

Image Quality

Qualitative Comparisons

PartiPrompts (P2) Benchmark

Prerequisites

Getting Started

1. Installation

2. Download Models

3. Usage

Start the Server

Single and Batched Inference

Benchmarking

Results

Load Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages