4 bit quantization support

I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than
1. ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can use GPU on devices where WebGPU is available thereby providing better performance
2. web-llm(which is WEBGPU only) - as it (will) have a CPU backend thereby allowing inference on devices where WEBGPU is not supported(many android browsers)
3. onnx - it is ligter than onnx

However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very  much required because without that, it is impossible to run [whisper-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) and [llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) on browser with limited RAM. So, please support 4bit quantization soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 bit quantization support #260

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

4 bit quantization support #260

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions