Skip to content

Add RTX 6000 speed#107

Open
d3y4n wants to merge 1 commit into
antirez:mainfrom
d3y4n:feature/speed-cuda
Open

Add RTX 6000 speed#107
d3y4n wants to merge 1 commit into
antirez:mainfrom
d3y4n:feature/speed-cuda

Conversation

@d3y4n
Copy link
Copy Markdown

@d3y4n d3y4n commented May 12, 2026

Based on the output of default benchmark cmd.

2048,2048,313.21,128,35.66,52184460
4096,2048,317.81,128,35.12,80373132
8192,2048,317.33,128,34.36,136750476
16384,2048,310.08,128,32.78,249505164
32768,2048,296.95,128,31.72,475014540
65536,2048,273.55,128,29.62,926033292

@tao12345666333
Copy link
Copy Markdown

Could you please share the complete operating steps and information about the machine resources?

This result differs significantly from the speed I tested.🤔

https://x.com/i/status/2054161265577308453

@d3y4n
Copy link
Copy Markdown
Author

d3y4n commented May 12, 2026

@tao12345666333 good callout, it did indeed look a bit suspicious, I'm running this on g7e.2xlarge AWS instance at the moment. I will the card in the upcoming weeks in my PC.

This is what I ran:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128

@luoq
Copy link
Copy Markdown

luoq commented May 14, 2026

Another result:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128
ds4-bench: context buffers 1311.89 MiB (ctx=65665, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: CUDA backend initialized on NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120)
ds4: CUDA host registration skipped: operation not supported
ds4: CUDA loading model tensors into device cache: 80.04 GiB
ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 19.100s
ds4: cuda backend initialized for graph diagnostics
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
ds4: CUDA q8 fp16 cache budget exhausted; using q8 kernels (request=8.00 MiB cached=0.00 GiB free=3.42 GiB reserve=4.75 GiB total=94.96 GiB)
2048,2048,375.34,128,44.31,52184460
4096,2048,369.07,128,43.54,80373132
6144,2048,367.39,128,43.25,108561804
8192,2048,366.01,128,42.51,136750476
10240,2048,364.06,128,42.13,164939148
12288,2048,362.11,128,42.02,193127820
14336,2048,359.93,128,41.72,221316492
16384,2048,358.73,128,42.59,249505164
18432,2048,357.38,128,42.37,277693836
20480,2048,356.18,128,42.35,305882508
22528,2048,355.03,128,42.38,334071180
24576,2048,353.84,128,42.21,362259852
26624,2048,352.70,128,41.97,390448524
28672,2048,351.51,128,41.65,418637196
30720,2048,350.41,128,41.33,446825868
32768,2048,349.44,128,38.90,475014540
34816,2048,346.21,128,37.93,503203212
36864,2048,345.57,128,37.68,531391884
38912,2048,344.61,128,37.65,559580556
40960,2048,343.78,128,37.61,587769228
43008,2048,342.95,128,37.50,615957900
45056,2048,342.18,128,37.47,644146572
47104,2048,341.44,128,37.37,672335244
49152,2048,340.76,128,37.31,700523916
51200,2048,339.56,128,37.14,728712588
53248,2048,338.77,128,37.13,756901260
55296,2048,338.09,128,37.05,785089932
57344,2048,337.44,128,36.98,813278604
59392,2048,336.65,128,36.87,841467276
61440,2048,336.07,128,36.64,869655948
63488,2048,335.44,128,36.34,897844620
65536,2048,334.79,128,36.12,926033292

@d3y4n
Copy link
Copy Markdown
Author

d3y4n commented May 15, 2026

Thanks @luoq, more in line with what I got, wondering how @tao12345666333 got almost 4x pp 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants