Hi,
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development.
3 days ago I created a small issue regarding PHI-3 demo here and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp).
The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf except for Xenova. Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.
|
LMStudio GPU |
LMStudio CPU |
Ratchet Chrome WebGPU |
Xenova PHI-3_4q (Twice smaller model) |
| RAM Usage |
~5.5gb |
~5.5gb |
~7.7gb |
5.3(GPU)+4.2(Renderer) = ~10Gb |
| CPU Usage |
~10% |
~800% |
~20% |
~15% |
| GPU Usage |
~95% |
~5% |
~95% |
~90% |
| time to first token |
0.5s |
1.5s |
~12s |
~5s |
| speed |
~32 tok/s |
~16 tok/s |
~6 tok/s |
~12 tok/s |
| gen t |
5s |
10s |
|
|
| gpu layers |
33 |
0 |
|
|
| cpu threads |
4 |
4 |
|
|
| mlock |
true |
true |
|
|
Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU.
Questions:
- What is the performance limit of implementations using WebGPU comparing to "native" solutions like llama.cpp?
- llama.cpp seem to be less memory hungry, peaking at 5.5gb at inference time and getting back to 3.73 GB while idle. Is there a room for improvement?
- Most shocking part is CPU inference speed of llama.cpp to be 16 tok/s. Does it mean WASM only implementation theoretically can achieve same speeds?
My apologies in advance if some of these questions were answered before or make little sense.
Hi,
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development.
3 days ago I created a small issue regarding PHI-3 demo here and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp).
The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf except for Xenova. Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.
Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU.
Questions:
My apologies in advance if some of these questions were answered before or make little sense.