Enable compiler optimizations to improve inference speed by imatrisciano · Pull Request #4 · andriydruk/LMPlayground

imatrisciano · 2024-11-08T18:36:13Z

This PR introduces a couple of simple compiler flags that can greatly improve inference speed

As described in section 5.1.2 of the paper Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian. Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation, arXiv:2410.03613, the flag i8mm has been added to the architecture description for arm64-v8a processors. This flag supposedly enables the generation of machine instructions optimised for int8 math.

The flag -Ofast has been specified in CMakeLists.txt to enable compiler optimisations for any architecture. This change requires the flag -fno-finite-math-only to be specified so that we disable all the optimisations based on the assumption that floating point math cannot result in infinite.

With those changes, I was able to observe great performance improvements on my device (Motorola Edge 20) when using Llama3.2-1B-Q4K_M:

1568% improvement for prompt eval time, from 2.27 to 37.9 tokens/s
636% improvement for inference speed, from 1.30 to 9.58 tokens/s
86% energy usage reduction, from 364 to 50 μAh per generated token

ScottArbeit · 2024-11-13T05:56:06Z

Just because I was curious about the compiler flag changes, I asked GPT-4o for more details, which you can find at https://chatgpt.com/share/67343ef3-8ae4-8003-9c41-82ffa7cf7f5a.

Thanks for working on LM Playground! ❤️

When resetMessages() clears the message list while a generation callback is still delivering tokens, updateLastMessage() and markThinkingStarted() call _messages.last() on an empty list, throwing NoSuchElementException. This was the #4 crash on Google Play for v1.4.0 with 2 reports across 2 users. Added early return guards when _messages is empty in both updateLastMessage() and markThinkingStarted().

imatrisciano added 3 commits November 6, 2024 19:19

Upgraded gradle

920fc47

Added 'i8mm' compiler flag

e07169f

Enable compiler optimizations for llama.cpp

08ca811

andriydruk self-requested a review November 12, 2024 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable compiler optimizations to improve inference speed#4

Enable compiler optimizations to improve inference speed#4
imatrisciano wants to merge 3 commits into
andriydruk:mainfrom
imatrisciano:performance

imatrisciano commented Nov 8, 2024

Uh oh!

ScottArbeit commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imatrisciano commented Nov 8, 2024

Uh oh!

ScottArbeit commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants