Skip to content

4 bit quantization support #260

@bil-ash

Description

@bil-ash

I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than

  1. ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can use GPU on devices where WebGPU is available thereby providing better performance
  2. web-llm(which is WEBGPU only) - as it (will) have a CPU backend thereby allowing inference on devices where WEBGPU is not supported(many android browsers)
  3. onnx - it is ligter than onnx

However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very much required because without that, it is impossible to run whisper-v3-turbo and llama-3.2-1b on browser with limited RAM. So, please support 4bit quantization soon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions