Skip to content

Commit 648e4e0

Browse files
fix(blog): unlink tomtunguz citation in Kimi K2 post
User asked to drop the https://tomtunguz.com/cursor-kimi-open-source-ai-imperative link. Both instances removed (lede + the model-architecture section's parenthetical citation). Surrounding text preserved: the xAI Cursor Composer 2 / 2.5 claim itself stays, just no longer hyperlinked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d3f8c9c commit 648e4e0

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

packages/app/content/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ tags:
1515
- nvfp4
1616
---
1717

18-
Kimi K2.5 and K2.6 are the open-weights models behind [xAI's Cursor Composer 2 and Composer 2.5](https://tomtunguz.com/cursor-kimi-open-source-ai-imperative/) — 1M+ daily active users from the Cursor IDE, and the current leader on SWE-Bench Pro at 58.6%. On the 8K/1K workload, vLLM on NVIDIA B200 in NVFP4 serves K2.5/K2.6 cheaper than H200 in INT4 across the entire single-node Pareto frontier. **B200 NVFP4 is 2.71x–2.95x cheaper per million tokens than H200 INT4 in the 30–90 tok/s/user serving band**, peaking at **2.95x at 32 tok/s/user** ($0.140/M on B200 NVFP4 vs $0.413/M on H200 INT4 — a 66% reduction). On the same B200 silicon, swapping INT4 for NVFP4 is worth another **2.45x–2.74x at iso-interactivity** ($0.397/M → $0.154/M at 40 tok/s/user). Measured on SemiAnalysis InferenceX, 2026-05-19, [GHA run 26118912054](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26118912054).
18+
Kimi K2.5 and K2.6 are the open-weights models behind xAI's Cursor Composer 2 and Composer 2.5 — 1M+ daily active users from the Cursor IDE, and the current leader on SWE-Bench Pro at 58.6%. On the 8K/1K workload, vLLM on NVIDIA B200 in NVFP4 serves K2.5/K2.6 cheaper than H200 in INT4 across the entire single-node Pareto frontier. **B200 NVFP4 is 2.71x–2.95x cheaper per million tokens than H200 INT4 in the 30–90 tok/s/user serving band**, peaking at **2.95x at 32 tok/s/user** ($0.140/M on B200 NVFP4 vs $0.413/M on H200 INT4 — a 66% reduction). On the same B200 silicon, swapping INT4 for NVFP4 is worth another **2.45x–2.74x at iso-interactivity** ($0.397/M → $0.154/M at 40 tok/s/user). Measured on SemiAnalysis InferenceX, 2026-05-19, [GHA run 26118912054](https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26118912054).
1919

2020
Both SKUs run the same `vllm/vllm-openai:v0.21.0` container. The spread comes from the silicon and the precision. B200 has 2.27x H200's FP8 dense throughput (4,500 vs 1,979 TFLOP/s), 1.67x its HBM bandwidth (8 vs 4.8 TB/s), and 2.00x its NVLink scale-up bandwidth (900 vs 450 GB/s uni-di). On the FP4 axis H200 has nothing — Hopper SM90 has no FP4 tensor cores, and the [official datasheet](https://resources.nvidia.com/en-us-data-center-overview/gtc24-h200-datasheet) stops at FP8. B200's NVFP4 cores deliver 9,000 TFLOP/s. The measured 3x cost-per-token gap is what those silicon ratios look like once you fold in B200's 1.38x TCO penalty ($1.95 vs $1.41 per GPU/hr per the [SemiAnalysis AI Cloud TCO Model](https://newsletter.semianalysis.com/p/ai-cloud-economics)).
2121

@@ -41,7 +41,7 @@ Both SKUs run the same `vllm/vllm-openai:v0.21.0` container. The spread comes fr
4141
caption="Kimi K2.5/K2.6 architecture (1.0T total / 32B active / 262K context). Shared backbone across both releases — K2.6 is a post-training refinement of the K2.5 pre-trained weights. Source: Moonshot AI model card via the SemiAnalysis InferenceX dashboard."
4242
/>
4343

44-
**K2.5 and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and Composer 2.5** ([reporting on the K2.5 → Composer 2 connection](https://tomtunguz.com/cursor-kimi-open-source-ai-imperative/)), serving 1M+ daily active users from the Cursor IDE. **K2.6 also leads frontier models on the public agentic-coding benchmarks**: 58.6% on SWE-Bench Pro — ahead of GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%) — and 80.2% on SWE-Bench Verified ([Moonshot K2.6 model card](https://huggingface.co/moonshotai/Kimi-K2.6)). Cline's [production deployment data](https://cline.bot/blog/moonshots-kimi-k2-for-coding-our-first-impressions-in-cline) puts it at 3.3% failure rate on complex diff-editing tasks, matching Claude 4 Sonnet. K2.6's Agent Swarm primitive fans out to **300 parallel sub-agents across 4,000 coordinated steps**, up from K2.5's 100 / 1,500. If you're hosting an OSS agentic coding stack today, K2.5 or K2.6 is the model you're serving.
44+
**K2.5 and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and Composer 2.5**, serving 1M+ daily active users from the Cursor IDE. **K2.6 also leads frontier models on the public agentic-coding benchmarks**: 58.6% on SWE-Bench Pro — ahead of GPT-5.4 (57.7%), Claude Opus 4.6 (53.4%), and Gemini 3.1 Pro (54.2%) — and 80.2% on SWE-Bench Verified ([Moonshot K2.6 model card](https://huggingface.co/moonshotai/Kimi-K2.6)). Cline's [production deployment data](https://cline.bot/blog/moonshots-kimi-k2-for-coding-our-first-impressions-in-cline) puts it at 3.3% failure rate on complex diff-editing tasks, matching Claude 4 Sonnet. K2.6's Agent Swarm primitive fans out to **300 parallel sub-agents across 4,000 coordinated steps**, up from K2.5's 100 / 1,500. If you're hosting an OSS agentic coding stack today, K2.5 or K2.6 is the model you're serving.
4545

4646
A note on quantization: Moonshot ships K2.5/K2.6 with **native INT4 weights** as the default open-weights checkpoint, which is what the H200 INT4 and B200 INT4 curves in this post use directly. The **B200 NVFP4 curve uses a NVFP4 requantization of the same weights** so B200's FP4 tensor cores can do the MoE GEMMs at full rate. H200 cannot run this path — Hopper SM90 has no FP4 tensor cores.
4747

0 commit comments

Comments
 (0)