Skip to content

Enable compiler optimizations to improve inference speed#4

Open
imatrisciano wants to merge 3 commits into
andriydruk:mainfrom
imatrisciano:performance
Open

Enable compiler optimizations to improve inference speed#4
imatrisciano wants to merge 3 commits into
andriydruk:mainfrom
imatrisciano:performance

Conversation

@imatrisciano

Copy link
Copy Markdown

This PR introduces a couple of simple compiler flags that can greatly improve inference speed

As described in section 5.1.2 of the paper Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian. Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation, arXiv:2410.03613, the flag i8mm has been added to the architecture description for arm64-v8a processors. This flag supposedly enables the generation of machine instructions optimised for int8 math.

The flag -Ofast has been specified in CMakeLists.txt to enable compiler optimisations for any architecture. This change requires the flag -fno-finite-math-only to be specified so that we disable all the optimisations based on the assumption that floating point math cannot result in infinite.

With those changes, I was able to observe great performance improvements on my device (Motorola Edge 20) when using Llama3.2-1B-Q4K_M:

  • 1568% improvement for prompt eval time, from 2.27 to 37.9 tokens/s
  • 636% improvement for inference speed, from 1.30 to 9.58 tokens/s
  • 86% energy usage reduction, from 364 to 50 μAh per generated token

@andriydruk andriydruk self-requested a review November 12, 2024 13:10
@ScottArbeit

Copy link
Copy Markdown

Just because I was curious about the compiler flag changes, I asked GPT-4o for more details, which you can find at https://chatgpt.com/share/67343ef3-8ae4-8003-9c41-82ffa7cf7f5a.

Thanks for working on LM Playground! ❤️

andriydruk added a commit that referenced this pull request Mar 23, 2026
When resetMessages() clears the message list while a generation callback
is still delivering tokens, updateLastMessage() and markThinkingStarted()
call _messages.last() on an empty list, throwing NoSuchElementException.

This was the #4 crash on Google Play for v1.4.0 with 2 reports across
2 users.

Added early return guards when _messages is empty in both
updateLastMessage() and markThinkingStarted().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants