Skip to content

[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122

Draft
aussetg wants to merge 4 commits into
lightonai:mainfrom
aussetg:pr/migraphx-backend-fast
Draft

[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122
aussetg wants to merge 4 commits into
lightonai:mainfrom
aussetg:pr/migraphx-backend-fast

Conversation

@aussetg

@aussetg aussetg commented Jun 5, 2026

Copy link
Copy Markdown

This draft PR is a best-effort attempt to make ROCm/MIGraphX useful for ColGREP indexing and a request for review from anyone who knows a better MIGraphX route.

My conclusion, after implementing and benchmarking it, is that MIGraphX should not become a normal/default ColGREP backend. If it is kept at all, it should remain an explicit experimental feature. The implementation cost and cache/shape machinery required to make it competitive are too high for the gains we get.

I am opening this PR partly as evidence for that conclusion and partly to justify moving backend work toward a llama.cpp/ggml implementation instead.

Summary

The short version is:

  • CPU INT8 ONNX remains a more reliable and usually faster backend for ColGREP than using the GPU via MIGraphX.
  • MIGraphX can beat CPU only in narrow, warm-cache indexing cases (AVX2/512 is fast!).
  • Those wins require static-shape cache warming, provider-specific routing, cache-key hardening, tail-padding heuristics, and run-level thresholds.
  • Even after the best measured optimizations, the useful gains are too small to amortize compile cost on realistic repeat counts.
  • One-shot search/query latency is especially unsuitable for MIGraphX because session/cache loading dominates the tiny query workload.
  • Dynamic-shape MIGraphX would be the clean solution, but direct dynamic sequence length remains blocked by MIGraphX GPU compiler limitations. The dynamic-shape support is not currently sufficient for this workload.

Even shorter version:

MIGraphX required substantial backend-specific machinery to barely match or modestly beat CPU. The complexity is not justified.

What this branch implements

This branch is a best-effort implementation of cache-hit-only MIGraphX indexing:

  • generic GPU execution-provider selection;
  • MIGraphX-capable ONNX Runtime discovery;
  • static-shape MIGraphX cache inspection and validation;
  • cache-hit-only Auto behavior, so normal indexing does not cold-compile MIGraphX;
  • CPU fallback for cold/missing shapes;
  • bounded warm-tail padding;
  • run-level thresholds to avoid using MIGraphX on small workloads;
  • explicit colgrep warm-cache --provider migraphx.

The point of the PR is not “MIGraphX is ready”; it is to show what is required to make it almost competitive.

Hardware / Software I used for my test:

CPU: AMD Ryzen AI MAX+ 395
Logical cores: 32
CPU features: AVX2 / AVX512 / FMA
GPU: GFX1151 / AMD 8060 RDNA3.5
GPU type: integrated/UMA
MIGraphX: 2.16.0.20250912-17-354-g4874f127d7-dirty (compiled for ROCm 7.14)
ROCm version: 7.14
OS: CachyOS / Arch-based Linux
platform: Linux-7.0.10-1-cachyos-x86_64-with-glibc2.43
python:   3.13.12
machine:  x86_64
cpu_count: 32

Why MIGraphX is hard for ColGREP

ColGREP indexing is not a single large static tensor workload. It is a pipeline:

scan files
→ parse code units
→ tokenize many variable-length units
→ encode document batches
→ pool embeddings
→ write vector/metadata indexes

GPU acceleration only affects the encode stage. Everything else remains CPU-side and competes with GPU setup/launch/session costs.

More importantly, MIGraphX also strongly prefers static shapes. And by strongly, I mean it is basically mandatory.
The ColBERT/ModernBERT document path naturally produces many variable row counts and sequence lengths. Without a true dynamic sequence-length path (MIGraphX has preliminary support for batching of one variable; we need two. Even then, some operators we need are not supported), the backend needs:

  • static-shape planning;
  • cache-keying and validation;
  • per-shape cache directories;
  • warm-cache inspection without ORT session initialization;
  • route selection between exact GPU, padded GPU tail, split GPU, and CPU fallback;
  • thresholds to avoid using GPU on tiny runs;
  • special query/search handling;
  • strict handling of forced GPU mode.

This complexity is the central negative result.

Performance Results

I diligently benchmarked each of the “improvements”/features above to ensure they genuinely helped. I can provide results, but to avoid polluting this draft PR, I will only give the big ones.

Only use pre-specified fixed shapes

First, I fixed what I think is a bug in tokenize_documents_in_batches. Comments clearly said:

        // GPU path: token-budget dynamic batching. Documents are sorted by
        // length and bucketed into fixed shapes (quantized to 32-token steps).
        // This lets the GPU reuse execution plans across batches with the same
        // shape, reducing kernel launch overhead and minimizing padding 

But this wasn’t actually the case. Given that nobody noticed, CUDA likely tolerates this better because it does not have the same static-graph compile-cache behavior. But with MIGraphX, it meant that nearly each batch had a different shape than the next. This is a problem, as MIGraphX wants static shapes, so it will recompile each shape to a different new “optimized” graph, and MIGraphX compilation time is fairly long (multiple seconds).

Just solving that by padding up to the next fixed (warm) bucket brings a colgrep init run on the next-plaid repo from multiple minutes to 12.5s encode + 34s encode. Which would sound impressive if the CPU weren’t taking 8s.

Static-shape cache and cold-shape CPU fallback

You can cache MIGraphX’s compiled graphs (.mxr) and reuse them later.

auto CPU:                         8.045s
--force-gpu cold hybrid:         12.257s
--force-gpu + doc length 512:     6.610s

Before that, --force-gpu was timing out around 80s.

Separate cache warming cost:

colgrep warm-cache --provider migraphx:: ~34.2s, warmed 4 shapes

After adding shape-specialized MIGraphX sessions plus cold-shape CPU fallback, the colgrep/src benchmark went from an ~80s --force-gpu timeout to a completed cold hybrid run in 12.257s. That is a big improvement over the broken GPU path but still slower than the CPU path at 8.045s.

The explicit cache-warming step was separate and cost about 34.2s for the initial four shapes. It did not make this small colgrep/src case a clear win; many long/partial shapes still fell back to CPU. The 6.610s result came from the optional NEXT_PLAID_MIGRAPHX_DOCUMENT_LENGTH=512 cap, which changes truncation/quality and should not be treated as a fair default.

Hybrid routing for batches

One way to regain some of the performance we know we lack is to run some batches on the GPU when they are close to the bucket size and we don’t have to pad too much, and run the others on the CPU.
Doing that gets us to approximately 12s for colgrep init.

Similarly, I’ve also added logic to heuristically decide if a batch is “worth it” on GPU vs. CPU.

We also must depad the batches before sending them to the CPU.

GPU cost = per-shape session load
         + per-run launch/session overhead
         + padded_tokens × GPU_token_cost

CPU cost = CPU_model_load_if_needed
              + real_tokens × CPU_token_cost

No MIGraphX for queries

Every use of the MIGraphX backend incurs a fixed cost due to ORT session creation and MIGraphX graph creation, so even on my system, where there is no CPU→GPU memory transfer, queries (i.e., shapes of the type [1, n_seq]) end-to-end CLI query/search are 5–11× slower with MIGraphX, even when the query shape cache is warm.
So, normal/Auto CLI search should not use the GPU. --force-gpu can still be used to benchmark MIGraphX query encoding explicitly, but it is not a good default.

Direct query encode, 10 queries per repetition:

Variant Model build median First query batch Steady query batch
CPU 156ms 12.4ms 11.0ms
MIGraphX cold fallback 509ms 1821ms 135.8ms
MIGraphX warm [1,256] 470ms 617ms 8.5ms

Warm MIGraphX was only slightly faster after the static query session was loaded;
first use dominated ordinary CLI search.

Full colgrep search, geomean median over 10 queries:

Variant Geomean median Slowdown vs CPU
--force-cpu 0.220s 1.00×
--force-gpu, cold MIGraphX cache 2.379s 10.83×
--force-gpu, warm query cache 1.275s 5.80×

After changing the default to keep MIGraphX search/query embedding on CPU:

Variant Geomean median vs CPU
--force-cpu 0.218s 1.00×
--force-gpu + query GPU override, warm cache 1.278s 5.86× slower

Conclusion: normal search/query should remain on CPU unless there is a long-lived
process that keeps a MIGraphX query session hot.

Other things I have tried

I tried doing graph surgery on the ONNX graphs to propose models that I hoped would compile faster, as they would map directly to MIGraphX operators; this did not materially improve compile time or runtime.

This involved:

  • replacing decomposed LayerNorm subgraphs with LayerNormalization;
  • rewriting fixed-batch reshape targets to constants;
  • externalizing or simplifying attention masks;
  • removing dynamic-shape construction subgraphs before MIGraphX sees the graph.

I also tried using MIGraphX’s partial dynamic-shape support.
I first tried dynamic sequence length and fixed batch size, as it was, in my opinion, the most promising. After graph surgery to make sure the operators were compatible, I got stuck with failed GPU compilation due to:

Error fuse_horizontal:
SHAPE: lens() called on a dynamic shape

Other observed dynamic compiler limitations included:

fuse_horizontal: SHAPE: lens() called on a dynamic shape
fuse_pointwise: Wrong number of arguments: expected 2 but given 1
fuse_pointwise: add: Dimensions do not match
split_reduce: elements() called on dynamic shape
gpu::lowering: gpu::contiguous: Dynamic shapes not supported
gpu::lowering: gpu::gemm: Dynamic shapes not supported

I then tried dynamic batch and fixed sequence. It did actually “work” with

MIGRAPHX_DISABLE_PASSES=fuse_concat

But it was slower and larger than static:

Shape mode Compile MXR size B=16 runtime
dynamic B=1..16, S=64 183s 533 MiB ≈1.29ms
static B=16, S=64 9s 34 MiB ≈0.86ms

So, when is the GPU worth it?

In practice: only when the exact static MIGraphX shapes are already warm, the indexing run is large enough to amortize session/load overhead, and the same shape set will be reused many times.

Warm-cache cost is shape-set dependent, not corpus-size dependent. It depends on the model, batch/static-shape token budget, sequence lengths warmed, precision/provider options, ORT/MIGraphX version, and GPU arch. So compile/warm numbers from different shape sets are not directly comparable.

The cleanest final cache-hit-only Auto benchmark I have is a synthetic threshold case:

dataset: synthetic dataset made from replicating the Zed repo
code units: 9289
estimated warm token slots: 9289 × 128 = 1,188,992
Auto threshold: 1,048,576
warmed shape: 1024×128
warm-cache wall: 136.7s

Indexing result:

Run Wall Profile total Encoding Provider
CPU 3.641s 3.597s 2.375s CPU INT8
Auto warm MIGraphX 3.172s 2.729s 2.197s MIGraphX FP32/FP16

So the warmed GPU path saved:

wall saved:   ~0.47s
wall speedup: ~13%
break-even:   136.7 / 0.47 ≈ 292 repeated index runs

That is the core problem. The warm run is faster, but not enough faster to justify the compile/warm-cache cost for normal usage.

I also ran larger end-to-end warm-cache experiments. These showed that MIGraphX can win on sufficiently large corpora when the required shape set is already warm:

Scenario tiny medium large
CPU median 2.52s 7.92s 81.03s
MIGraphX cold, no warm 3.78s 9.31s 97.21s
MIGraphX prewarmed 10.61s 59.49s

For the large corpus:

CPU:                 81.03s
MIGraphX fully warm: 59.49s
speedup:             ~1.36×

But this should be read carefully: the win appears only after the relevant static shapes have already been compiled and cached. Cold/no-warm MIGraphX was slower than CPU even on the large corpus:

CPU large:              81.03s
MIGraphX cold/no warm:  97.21s

So my conclusion is:

  • tiny/small repos: CPU wins;
  • one-shot search/query: CPU wins;
  • cold MIGraphX indexing: CPU wins;
  • warm MIGraphX indexing can win on large repeated indexing workloads;
  • the warm-cache compile cost usually makes that win unattractive unless the same model/shape set is reused many times.

That is why this branch makes Auto cache-hit-only and keeps CPU as the practical default.

Disclaimer

I used Coding Agents to aid me.

aussetg added 4 commits June 5, 2026 20:35
Teach the ONNX layer to specialize MIGraphX sessions to validated static document shapes, key the cache by the selected model and provider options, and preserve strict --force-gpu semantics.

ColGREP auto mode keeps CPU as the default path: it only enables MIGraphX for warm eligible document shapes when the run has enough work to amortize session/GPU overhead, with CPU fallback for cold shapes.

Add opt-in COLGREP_PROFILE diagnostics so backend routing, model loading, and indexing/search phases can be measured without changing normal output.
Expose colgrep warm-cache as an explicit, advanced path for preparing provider-specific runtime caches. For MIGraphX it warms only eligible expensive static document shapes and reports when there is nothing worth compiling.
@raphaelsty

Copy link
Copy Markdown
Collaborator

Thank you for this MR @aussetg, I'll watch carefully the MR, there might be a world in which this backend is complementary to cpu and overall accelerate inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants