[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122
Draft
aussetg wants to merge 4 commits into
Draft
[DO NOT MERGE] Make Migraphx backend "fast" (i.e. not slow)#122aussetg wants to merge 4 commits into
aussetg wants to merge 4 commits into
Conversation
Teach the ONNX layer to specialize MIGraphX sessions to validated static document shapes, key the cache by the selected model and provider options, and preserve strict --force-gpu semantics. ColGREP auto mode keeps CPU as the default path: it only enables MIGraphX for warm eligible document shapes when the run has enough work to amortize session/GPU overhead, with CPU fallback for cold shapes. Add opt-in COLGREP_PROFILE diagnostics so backend routing, model loading, and indexing/search phases can be measured without changing normal output.
Expose colgrep warm-cache as an explicit, advanced path for preparing provider-specific runtime caches. For MIGraphX it warms only eligible expensive static document shapes and reports when there is nothing worth compiling.
Collaborator
|
Thank you for this MR @aussetg, I'll watch carefully the MR, there might be a world in which this backend is complementary to cpu and overall accelerate inference |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This draft PR is a best-effort attempt to make ROCm/MIGraphX useful for ColGREP indexing and a request for review from anyone who knows a better MIGraphX route.
My conclusion, after implementing and benchmarking it, is that MIGraphX should not become a normal/default ColGREP backend. If it is kept at all, it should remain an explicit experimental feature. The implementation cost and cache/shape machinery required to make it competitive are too high for the gains we get.
I am opening this PR partly as evidence for that conclusion and partly to justify moving backend work toward a
llama.cpp/ggml implementation instead.Summary
The short version is:
Even shorter version:
MIGraphX required substantial backend-specific machinery to barely match or modestly beat CPU. The complexity is not justified.
What this branch implements
This branch is a best-effort implementation of cache-hit-only MIGraphX indexing:
colgrep warm-cache --provider migraphx.The point of the PR is not “MIGraphX is ready”; it is to show what is required to make it almost competitive.
Hardware / Software I used for my test:
Why MIGraphX is hard for ColGREP
ColGREP indexing is not a single large static tensor workload. It is a pipeline:
GPU acceleration only affects the encode stage. Everything else remains CPU-side and competes with GPU setup/launch/session costs.
More importantly, MIGraphX also strongly prefers static shapes. And by strongly, I mean it is basically mandatory.
The ColBERT/ModernBERT document path naturally produces many variable row counts and sequence lengths. Without a true dynamic sequence-length path (MIGraphX has preliminary support for batching of one variable; we need two. Even then, some operators we need are not supported), the backend needs:
This complexity is the central negative result.
Performance Results
I diligently benchmarked each of the “improvements”/features above to ensure they genuinely helped. I can provide results, but to avoid polluting this draft PR, I will only give the big ones.
Only use pre-specified fixed shapes
First, I fixed what I think is a bug in
tokenize_documents_in_batches. Comments clearly said:But this wasn’t actually the case. Given that nobody noticed, CUDA likely tolerates this better because it does not have the same static-graph compile-cache behavior. But with MIGraphX, it meant that nearly each batch had a different shape than the next. This is a problem, as MIGraphX wants static shapes, so it will recompile each shape to a different new “optimized” graph, and MIGraphX compilation time is fairly long (multiple seconds).
Just solving that by padding up to the next fixed (warm) bucket brings a
colgrep initrun on the next-plaid repo from multiple minutes to 12.5s encode + 34s encode. Which would sound impressive if the CPU weren’t taking 8s.Static-shape cache and cold-shape CPU fallback
You can cache MIGraphX’s compiled graphs (.mxr) and reuse them later.
Before that, --force-gpu was timing out around 80s.
Separate cache warming cost:
After adding shape-specialized MIGraphX sessions plus cold-shape CPU fallback, the
colgrep/srcbenchmark went from an ~80s--force-gputimeout to a completed cold hybrid run in12.257s. That is a big improvement over the broken GPU path but still slower than the CPU path at8.045s.The explicit cache-warming step was separate and cost about
34.2sfor the initial four shapes. It did not make this smallcolgrep/srccase a clear win; many long/partial shapes still fell back to CPU. The6.610sresult came from the optionalNEXT_PLAID_MIGRAPHX_DOCUMENT_LENGTH=512cap, which changes truncation/quality and should not be treated as a fair default.Hybrid routing for batches
One way to regain some of the performance we know we lack is to run some batches on the GPU when they are close to the bucket size and we don’t have to pad too much, and run the others on the CPU.
Doing that gets us to approximately 12s for
colgrep init.Similarly, I’ve also added logic to heuristically decide if a batch is “worth it” on GPU vs. CPU.
We also must depad the batches before sending them to the CPU.
No MIGraphX for queries
Every use of the MIGraphX backend incurs a fixed cost due to ORT session creation and MIGraphX graph creation, so even on my system, where there is no CPU→GPU memory transfer, queries (i.e., shapes of the type [1, n_seq]) end-to-end CLI query/search are 5–11× slower with MIGraphX, even when the query shape cache is warm.
So, normal/Auto CLI search should not use the GPU.
--force-gpucan still be used to benchmark MIGraphX query encoding explicitly, but it is not a good default.Direct query encode, 10 queries per repetition:
[1,256]Warm MIGraphX was only slightly faster after the static query session was loaded;
first use dominated ordinary CLI search.
Full
colgrep search, geomean median over 10 queries:--force-cpu--force-gpu, cold MIGraphX cache--force-gpu, warm query cacheAfter changing the default to keep MIGraphX search/query embedding on CPU:
--force-cpu--force-gpu+ query GPU override, warm cacheConclusion: normal search/query should remain on CPU unless there is a long-lived
process that keeps a MIGraphX query session hot.
Other things I have tried
I tried doing graph surgery on the ONNX graphs to propose models that I hoped would compile faster, as they would map directly to MIGraphX operators; this did not materially improve compile time or runtime.
This involved:
LayerNormalization;I also tried using MIGraphX’s partial dynamic-shape support.
I first tried dynamic sequence length and fixed batch size, as it was, in my opinion, the most promising. After graph surgery to make sure the operators were compatible, I got stuck with failed GPU compilation due to:
Other observed dynamic compiler limitations included:
I then tried dynamic batch and fixed sequence. It did actually “work” with
But it was slower and larger than static:
So, when is the GPU worth it?
In practice: only when the exact static MIGraphX shapes are already warm, the indexing run is large enough to amortize session/load overhead, and the same shape set will be reused many times.
Warm-cache cost is shape-set dependent, not corpus-size dependent. It depends on the model, batch/static-shape token budget, sequence lengths warmed, precision/provider options, ORT/MIGraphX version, and GPU arch. So compile/warm numbers from different shape sets are not directly comparable.
The cleanest final cache-hit-only Auto benchmark I have is a synthetic threshold case:
Indexing result:
So the warmed GPU path saved:
That is the core problem. The warm run is faster, but not enough faster to justify the compile/warm-cache cost for normal usage.
I also ran larger end-to-end warm-cache experiments. These showed that MIGraphX can win on sufficiently large corpora when the required shape set is already warm:
For the large corpus:
But this should be read carefully: the win appears only after the relevant static shapes have already been compiled and cached. Cold/no-warm MIGraphX was slower than CPU even on the large corpus:
So my conclusion is:
That is why this branch makes Auto cache-hit-only and keeps CPU as the practical default.
Disclaimer
I used Coding Agents to aid me.