Reasons for Ratchet Quantized structure #143

FL33TW00D · 2024-03-27T16:25:50Z

FL33TW00D
Mar 27, 2024

Avoiding 2 model implementations

When we want to support both full precision and quantised models, we can end up in a sticky situation where we have 2 implementations of the model to maintain and that the user must be aware of.

Why is this?

Many quantisation schemes look something like the following:

pub struct BlockQ8 {
    pub(crate) abs: f32,
    pub(crate) qs: [u32; 4],
}

This is a quantized block, where 16 f32 values are converted to i8, and packed into 4 u32.
The abs (absmax) is used to scale the values back into the range of an f32.

This introduces a problem for us, how do we represent this data in a Tensor?
We could just use 2 tensors, one for the weights, and one for the absmax.
This would work! But see the problem this introduces in the below example:

//Example stem from Whisper Model (QDecoderStem & DecoderStem required)
pub struct QDecoderStem {
    pub token_embed: (GPUTensor, GPUTensor),
    pub pos_embed: GPUTensor,
}

impl Stem<'_> for QDecoderStem {
    fn load<R: BufRead + Seek>(
        manager: &GPUResourceManager,
        reader: &mut R,
        tensor_map: &HashMap<String, TensorHeader>,
    ) -> Result<Self, WhisperError>
    where
        Self: Sized,
    {
        let mut load = |name: &str| {
            let key = format!("decoder.{}", name);
            GPUTensor::from_disk(manager, reader, tensor_map.get(&key).unwrap())
        };
        Ok(Self {
            token_embed: (
                load("token_embedding.weight")?,
                load("token_embedding.weight.absmax")?,
            ),
            pos_embed: load("positional_embedding")?,
        })
    }
}

Because we have 2 separate tensors for the weight & absmax, we need 2 model loaders!

So this leads us to a pretty strict requirement: 1 tensor for quantised weights.

Now we have the problem of:
How do we bind this in the GPU shader?

Not to worry, WebGPU supports struct bindings!

struct BlockQ8 {
    q: vec4<u32>,
    absmax: f32,
}

@group(0) @binding(0) var<storage, read> A: array<vec4<f32>>;
@group(0) @binding(1) var<storage, read> B: array<BlockQ8>;
@group(0) @binding(2) var<storage, read_write> result: array<vec4<f32>>;

But now we run into a problem:

This is the way WebGPU lays out the struct, it requires padding to the size of the largest member. So now each BlockQ8 struct is 32 bytes in size, so we use the same amount of GPU memory as if we had just used float 16 in the first place, yikes.

Given that we can't use the above: we are going to need some kind of translation layer between GGUF models and the GPU. We will have to parse these structs and convert them into whichever option we choose.

Options

In the end, I think this leaves us at Option 1.
Unfortunately, it has the following cons:

We will need to parse a GGUF quantized tensor from the struct format into the packed format.
Handing the binding of the different segments of the buffer will be slightly tricky.
Each of the different segments will now need to be passed to a multiple of 256.

Option 3 was untenable, because we don't get any GPU memory savings.
Multiple buffers backing each tensor seems to break an important invariant.
Therefore Option 1 is optimal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasons for Ratchet Quantized structure #143

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Reasons for Ratchet Quantized structure #143

Uh oh!

Uh oh!

FL33TW00D Mar 27, 2024

Avoiding 2 model implementations

Why is this?

Options

Replies: 0 comments

FL33TW00D
Mar 27, 2024