Reasons for Ratchet Quantized structure #143
FL33TW00D
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Avoiding 2 model implementations
When we want to support both full precision and quantised models, we can end up in a sticky situation where we have 2 implementations of the model to maintain and that the user must be aware of.
Why is this?
Many quantisation schemes look something like the following:
This is a quantized block, where 16 f32 values are converted to i8, and packed into 4 u32.
The abs (absmax) is used to scale the values back into the range of an f32.
This introduces a problem for us, how do we represent this data in a Tensor?
We could just use 2 tensors, one for the weights, and one for the absmax.
This would work! But see the problem this introduces in the below example:
Because we have 2 separate tensors for the weight & absmax, we need 2 model loaders!
So this leads us to a pretty strict requirement: 1 tensor for quantised weights.
Now we have the problem of:
How do we bind this in the GPU shader?
Not to worry,
WebGPUsupports struct bindings!But now we run into a problem:
This is the way WebGPU lays out the struct, it requires padding to the size of the largest member. So now each BlockQ8 struct is 32 bytes in size, so we use the same amount of GPU memory as if we had just used float 16 in the first place, yikes.
Given that we can't use the above: we are going to need some kind of translation layer between GGUF models and the GPU. We will have to parse these structs and convert them into whichever option we choose.
Options
In the end, I think this leaves us at Option 1.
Unfortunately, it has the following cons:
Option 3 was untenable, because we don't get any GPU memory savings.
Multiple buffers backing each tensor seems to break an important invariant.
Therefore Option 1 is optimal.
Beta Was this translation helpful? Give feedback.
All reactions