[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow) by aussetg · Pull Request #122 · lightonai/next-plaid

aussetg · 2026-06-05T20:02:24Z

This draft PR is a best-effort attempt to make ROCm/MIGraphX useful for ColGREP indexing and a request for review from anyone who knows a better MIGraphX route.

My conclusion, after implementing and benchmarking it, is that MIGraphX should not become a normal/default ColGREP backend. If it is kept at all, it should remain an explicit experimental feature. The implementation cost and cache/shape machinery required to make it competitive are too high for the gains we get.

I am opening this PR partly as evidence for that conclusion and partly to justify moving backend work toward a llama.cpp/ggml implementation instead.

Summary

The short version is:

CPU INT8 ONNX remains a more reliable and usually faster backend for ColGREP than using the GPU via MIGraphX.
MIGraphX can beat CPU only in narrow, warm-cache indexing cases (AVX2/512 is fast!).
Those wins require static-shape cache warming, provider-specific routing, cache-key hardening, tail-padding heuristics, and run-level thresholds.
Even after the best measured optimizations, the useful gains are too small to amortize compile cost on realistic repeat counts.
One-shot search/query latency is especially unsuitable for MIGraphX because session/cache loading dominates the tiny query workload.
Dynamic-shape MIGraphX would be the clean solution, but direct dynamic sequence length remains blocked by MIGraphX GPU compiler limitations. The dynamic-shape support is not currently sufficient for this workload.

Even shorter version:

MIGraphX required substantial backend-specific machinery to barely match or modestly beat CPU. The complexity is not justified.

What this branch implements

This branch is a best-effort implementation of cache-hit-only MIGraphX indexing:

generic GPU execution-provider selection;
MIGraphX-capable ONNX Runtime discovery;
static-shape MIGraphX cache inspection and validation;
cache-hit-only Auto behavior, so normal indexing does not cold-compile MIGraphX;
CPU fallback for cold/missing shapes;
bounded warm-tail padding;
run-level thresholds to avoid using MIGraphX on small workloads;
explicit colgrep warm-cache --provider migraphx.

The point of the PR is not “MIGraphX is ready”; it is to show what is required to make it almost competitive.

Hardware / Software I used for my test:

CPU: AMD Ryzen AI MAX+ 395
Logical cores: 32
CPU features: AVX2 / AVX512 / FMA
GPU: GFX1151 / AMD 8060 RDNA3.5
GPU type: integrated/UMA
MIGraphX: 2.16.0.20250912-17-354-g4874f127d7-dirty (compiled for ROCm 7.14)
ROCm version: 7.14
OS: CachyOS / Arch-based Linux
platform: Linux-7.0.10-1-cachyos-x86_64-with-glibc2.43
python:   3.13.12
machine:  x86_64
cpu_count: 32

Why MIGraphX is hard for ColGREP

ColGREP indexing is not a single large static tensor workload. It is a pipeline:

scan files
→ parse code units
→ tokenize many variable-length units
→ encode document batches
→ pool embeddings
→ write vector/metadata indexes

GPU acceleration only affects the encode stage. Everything else remains CPU-side and competes with GPU setup/launch/session costs.

More importantly, MIGraphX also strongly prefers static shapes. And by strongly, I mean it is basically mandatory.
The ColBERT/ModernBERT document path naturally produces many variable row counts and sequence lengths. Without a true dynamic sequence-length path (MIGraphX has preliminary support for batching of one variable; we need two. Even then, some operators we need are not supported), the backend needs:

static-shape planning;
cache-keying and validation;
per-shape cache directories;
warm-cache inspection without ORT session initialization;
route selection between exact GPU, padded GPU tail, split GPU, and CPU fallback;
thresholds to avoid using GPU on tiny runs;
special query/search handling;
strict handling of forced GPU mode.

This complexity is the central negative result.

Performance Results

I diligently benchmarked each of the “improvements”/features above to ensure they genuinely helped. I can provide results, but to avoid polluting this draft PR, I will only give the big ones.

Only use pre-specified fixed shapes

First, I fixed what I think is a bug in tokenize_documents_in_batches. Comments clearly said:

        // GPU path: token-budget dynamic batching. Documents are sorted by
        // length and bucketed into fixed shapes (quantized to 32-token steps).
        // This lets the GPU reuse execution plans across batches with the same
        // shape, reducing kernel launch overhead and minimizing padding

But this wasn’t actually the case. Given that nobody noticed, CUDA likely tolerates this better because it does not have the same static-graph compile-cache behavior. But with MIGraphX, it meant that nearly each batch had a different shape than the next. This is a problem, as MIGraphX wants static shapes, so it will recompile each shape to a different new “optimized” graph, and MIGraphX compilation time is fairly long (multiple seconds).

Just solving that by padding up to the next fixed (warm) bucket brings a colgrep init run on the next-plaid repo from multiple minutes to 12.5s encode + 34s encode. Which would sound impressive if the CPU weren’t taking 8s.

Static-shape cache and cold-shape CPU fallback

You can cache MIGraphX’s compiled graphs (.mxr) and reuse them later.

auto CPU:                         8.045s
--force-gpu cold hybrid:         12.257s
--force-gpu + doc length 512:     6.610s

Before that, --force-gpu was timing out around 80s.

Separate cache warming cost:

colgrep warm-cache --provider migraphx:: ~34.2s, warmed 4 shapes

After adding shape-specialized MIGraphX sessions plus cold-shape CPU fallback, the colgrep/src benchmark went from an ~80s --force-gpu timeout to a completed cold hybrid run in 12.257s. That is a big improvement over the broken GPU path but still slower than the CPU path at 8.045s.

The explicit cache-warming step was separate and cost about 34.2s for the initial four shapes. It did not make this small colgrep/src case a clear win; many long/partial shapes still fell back to CPU. The 6.610s result came from the optional NEXT_PLAID_MIGRAPHX_DOCUMENT_LENGTH=512 cap, which changes truncation/quality and should not be treated as a fair default.

Hybrid routing for batches

One way to regain some of the performance we know we lack is to run some batches on the GPU when they are close to the bucket size and we don’t have to pad too much, and run the others on the CPU.
Doing that gets us to approximately 12s for colgrep init.

Similarly, I’ve also added logic to heuristically decide if a batch is “worth it” on GPU vs. CPU.

We also must depad the batches before sending them to the CPU.

GPU cost = per-shape session load
         + per-run launch/session overhead
         + padded_tokens × GPU_token_cost

CPU cost = CPU_model_load_if_needed
              + real_tokens × CPU_token_cost

No MIGraphX for queries

Every use of the MIGraphX backend incurs a fixed cost due to ORT session creation and MIGraphX graph creation, so even on my system, where there is no CPU→GPU memory transfer, queries (i.e., shapes of the type [1, n_seq]) end-to-end CLI query/search are 5–11× slower with MIGraphX, even when the query shape cache is warm.
So, normal/Auto CLI search should not use the GPU. --force-gpu can still be used to benchmark MIGraphX query encoding explicitly, but it is not a good default.

Direct query encode, 10 queries per repetition:

Variant	Model build median	First query batch	Steady query batch
CPU	156ms	12.4ms	11.0ms
MIGraphX cold fallback	509ms	1821ms	135.8ms
MIGraphX warm `[1,256]`	470ms	617ms	8.5ms

Warm MIGraphX was only slightly faster after the static query session was loaded;
first use dominated ordinary CLI search.

Full colgrep search, geomean median over 10 queries:

Variant	Geomean median	Slowdown vs CPU
`--force-cpu`	0.220s	1.00×
`--force-gpu`, cold MIGraphX cache	2.379s	10.83×
`--force-gpu`, warm query cache	1.275s	5.80×

After changing the default to keep MIGraphX search/query embedding on CPU:

Variant	Geomean median	vs CPU
`--force-cpu`	0.218s	1.00×
`--force-gpu` + query GPU override, warm cache	1.278s	5.86× slower

Conclusion: normal search/query should remain on CPU unless there is a long-lived
process that keeps a MIGraphX query session hot.

Other things I have tried

I tried doing graph surgery on the ONNX graphs to propose models that I hoped would compile faster, as they would map directly to MIGraphX operators; this did not materially improve compile time or runtime.

This involved:

replacing decomposed LayerNorm subgraphs with LayerNormalization;
rewriting fixed-batch reshape targets to constants;
externalizing or simplifying attention masks;
removing dynamic-shape construction subgraphs before MIGraphX sees the graph.

I also tried using MIGraphX’s partial dynamic-shape support.
I first tried dynamic sequence length and fixed batch size, as it was, in my opinion, the most promising. After graph surgery to make sure the operators were compatible, I got stuck with failed GPU compilation due to:

Error fuse_horizontal:
SHAPE: lens() called on a dynamic shape

Other observed dynamic compiler limitations included:

fuse_horizontal: SHAPE: lens() called on a dynamic shape
fuse_pointwise: Wrong number of arguments: expected 2 but given 1
fuse_pointwise: add: Dimensions do not match
split_reduce: elements() called on dynamic shape
gpu::lowering: gpu::contiguous: Dynamic shapes not supported
gpu::lowering: gpu::gemm: Dynamic shapes not supported

I then tried dynamic batch and fixed sequence. It did actually “work” with

MIGRAPHX_DISABLE_PASSES=fuse_concat

But it was slower and larger than static:

Shape mode	Compile	MXR size	B=16 runtime
dynamic B=1..16, S=64	183s	533 MiB	≈1.29ms
static B=16, S=64	9s	34 MiB	≈0.86ms

So, when is the GPU worth it?

In practice: only when the exact static MIGraphX shapes are already warm, the indexing run is large enough to amortize session/load overhead, and the same shape set will be reused many times.

Warm-cache cost is shape-set dependent, not corpus-size dependent. It depends on the model, batch/static-shape token budget, sequence lengths warmed, precision/provider options, ORT/MIGraphX version, and GPU arch. So compile/warm numbers from different shape sets are not directly comparable.

The cleanest final cache-hit-only Auto benchmark I have is a synthetic threshold case:

dataset: synthetic dataset made from replicating the Zed repo
code units: 9289
estimated warm token slots: 9289 × 128 = 1,188,992
Auto threshold: 1,048,576
warmed shape: 1024×128
warm-cache wall: 136.7s

Indexing result:

Run	Wall	Profile total	Encoding	Provider
CPU	3.641s	3.597s	2.375s	CPU INT8
Auto warm MIGraphX	3.172s	2.729s	2.197s	MIGraphX FP32/FP16

So the warmed GPU path saved:

wall saved:   ~0.47s
wall speedup: ~13%
break-even:   136.7 / 0.47 ≈ 292 repeated index runs

That is the core problem. The warm run is faster, but not enough faster to justify the compile/warm-cache cost for normal usage.

I also ran larger end-to-end warm-cache experiments. These showed that MIGraphX can win on sufficiently large corpora when the required shape set is already warm:

Scenario	tiny	medium	large
CPU median	2.52s	7.92s	81.03s
MIGraphX cold, no warm	3.78s	9.31s	97.21s
MIGraphX prewarmed	—	10.61s	59.49s

For the large corpus:

CPU:                 81.03s
MIGraphX fully warm: 59.49s
speedup:             ~1.36×

But this should be read carefully: the win appears only after the relevant static shapes have already been compiled and cached. Cold/no-warm MIGraphX was slower than CPU even on the large corpus:

CPU large:              81.03s
MIGraphX cold/no warm:  97.21s

So my conclusion is:

tiny/small repos: CPU wins;
one-shot search/query: CPU wins;
cold MIGraphX indexing: CPU wins;
warm MIGraphX indexing can win on large repeated indexing workloads;
the warm-cache compile cost usually makes that win unattractive unless the same model/shape set is reused many times.

That is why this branch makes Auto cache-hit-only and keeps CPU as the practical default.

Disclaimer

I used Coding Agents to aid me.

Teach the ONNX layer to specialize MIGraphX sessions to validated static document shapes, key the cache by the selected model and provider options, and preserve strict --force-gpu semantics. ColGREP auto mode keeps CPU as the default path: it only enables MIGraphX for warm eligible document shapes when the run has enough work to amortize session/GPU overhead, with CPU fallback for cold shapes. Add opt-in COLGREP_PROFILE diagnostics so backend routing, model loading, and indexing/search phases can be measured without changing normal output.

Expose colgrep warm-cache as an explicit, advanced path for preparing provider-specific runtime caches. For MIGraphX it warms only eligible expensive static document shapes and reports when there is nothing worth compiling.

raphaelsty · 2026-06-05T20:28:01Z

Thank you for this MR @aussetg, I'll watch carefully the MR, there might be a world in which this backend is complementary to cpu and overall accelerate inference

aussetg added 4 commits June 5, 2026 20:35

Generalize GPU execution provider selection

da4f0a1

Handle MIGraphX ONNX Runtime discovery

b18fb2a

Add ColGREP cache warming command

b9171fe

Expose colgrep warm-cache as an explicit, advanced path for preparing provider-specific runtime caches. For MIGraphX it warms only eligible expensive static document shapes and reports when there is nothing worth compiling.

aussetg mentioned this pull request Jun 5, 2026

feat(onnx): add DirectML/MIGraphX/CoreML execution provider support #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122

[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122
aussetg wants to merge 4 commits into
lightonai:mainfrom
aussetg:pr/migraphx-backend-fast

aussetg commented Jun 5, 2026

Uh oh!

raphaelsty commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aussetg commented Jun 5, 2026

Summary

What this branch implements

Hardware / Software I used for my test:

Why MIGraphX is hard for ColGREP

Performance Results

Only use pre-specified fixed shapes

Static-shape cache and cold-shape CPU fallback

Hybrid routing for batches

No MIGraphX for queries

Other things I have tried

So, when is the GPU worth it?

Disclaimer

Uh oh!

raphaelsty commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants