Overall performance limits

Hi, 
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development. 

3 days ago I created a small issue regarding [PHI-3 demo here](https://huggingface.co/FL33TW00D-HF/phi3/discussions/1) and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp). 

The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is [FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf](https://huggingface.co/FL33TW00D-HF/phi3/tree/main) except for [Xenova](https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu). Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.

|                     | **LMStudio GPU** | **LMStudio CPU** | **Ratchet Chrome WebGPU** | **Xenova PHI-3_4q** (Twice smaller model) |
|---------------------|------------------|------------------|---------------------------|--------------------------------------|
| RAM Usage           | ~5.5gb           | ~5.5gb           | ~7.7gb                    | 5.3(GPU)+4.2(Renderer) = ~10Gb   |
| CPU Usage           | ~10%             | ~800%            | ~20%                      | ~15%                                 |
| GPU Usage           | ~95%             | ~5%              | ~95%                      | ~90%                                 |
| time to first token | 0.5s             | 1.5s             | ~12s                      | ~5s                                  |
| speed               | ~32 tok/s        | ~16 tok/s        | ~6  tok/s                 | ~12 tok/s                             |
| gen t               | 5s               | 10s              |                           |                                      |
| gpu layers          | 33               | 0                |                           |                                      |
| cpu threads         | 4                | 4                |                           |                                      |
| mlock               | true             | true             |                           |                                      |

Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU. 
**Questions**:

1. What is the performance limit of implementations using WebGPU comparing to "native" solutions like llama.cpp?
2. llama.cpp seem to be less memory hungry, peaking at 5.5gb at inference time and getting back to 3.73 GB while idle. Is there a room for improvement? 
3. Most shocking part is CPU inference speed of llama.cpp to be 16 tok/s. Does it mean WASM only implementation theoretically can achieve same speeds?       

My apologies in advance if some of these questions were answered before or make little sense.  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overall performance limits #196

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	LMStudio GPU	LMStudio CPU	Ratchet Chrome WebGPU	Xenova PHI-3_4q (Twice smaller model)
RAM Usage	~5.5gb	~5.5gb	~7.7gb	5.3(GPU)+4.2(Renderer) = ~10Gb
CPU Usage	~10%	~800%	~20%	~15%
GPU Usage	~95%	~5%	~95%	~90%
time to first token	0.5s	1.5s	~12s	~5s
speed	~32 tok/s	~16 tok/s	~6 tok/s	~12 tok/s
gen t	5s	10s
gpu layers	33	0
cpu threads	4	4
mlock	true	true

Overall performance limits #196

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions