Skip to content

Overall performance limits #196

@iSuslov

Description

@iSuslov

Hi,
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development.

3 days ago I created a small issue regarding PHI-3 demo here and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp).

The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf except for Xenova. Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.

LMStudio GPU LMStudio CPU Ratchet Chrome WebGPU Xenova PHI-3_4q (Twice smaller model)
RAM Usage ~5.5gb ~5.5gb ~7.7gb 5.3(GPU)+4.2(Renderer) = ~10Gb
CPU Usage ~10% ~800% ~20% ~15%
GPU Usage ~95% ~5% ~95% ~90%
time to first token 0.5s 1.5s ~12s ~5s
speed ~32 tok/s ~16 tok/s ~6 tok/s ~12 tok/s
gen t 5s 10s
gpu layers 33 0
cpu threads 4 4
mlock true true

Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU.
Questions:

  1. What is the performance limit of implementations using WebGPU comparing to "native" solutions like llama.cpp?
  2. llama.cpp seem to be less memory hungry, peaking at 5.5gb at inference time and getting back to 3.73 GB while idle. Is there a room for improvement?
  3. Most shocking part is CPU inference speed of llama.cpp to be 16 tok/s. Does it mean WASM only implementation theoretically can achieve same speeds?

My apologies in advance if some of these questions were answered before or make little sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions