Advanced Quantization: A practical Course

This course provides a hands-on introduction to extreme model quantization, hardware-aware optimization, and on-device deployment for generative AI models. You'll explore advanced techniques to reduce model size, accelerate inference, and deploy compact LLMs on edge devices like Android smartphones.

This course is Part 1 of a two‑part Advanced AI series. For Part 2, focusing on Mixture of Experts (MoE) models, see the companion material: Mixture of Experts: A Practical Course.

Intended Audience

This course is designed for industry ML engineers who want practical, ready-to-adapt examples for integrating advanced optimization techniques into their own applications. It is also well-suited for ML researchers interested in exploring and experimenting with cutting‑edge methods for model compression and performance optimization.

Learners should be comfortable with the following

Deep learning architectures — transformers, attention mechanisms, and feed-forward networks
Training and optimization — gradient descent variants, regularization, and overfitting control
Practical skills — Python programming experience and intermediate understanding of PyTorch

Those who want to build or refresh these fundamentals can first complete introductory and intermediate material such as the Intro to AI or Optimizing Generative AI on Arm course.

Labs Overview

Lab 1: Extreme Quantization

Train a language model and progressively quantize it from FP32 to 8-bit, 4-bit, 2-bit, and 1-bit precision. Implement and evaluate Quantization-Aware Training (QAT) to mitigate accuracy degradation in ultra-low-bit models.

Lab 2: Hardware–Software Model Co-Design

Wrap all nn.Linear layers with a custom QLinear module and explore layerwise post-training quantization. Search for the optimal bit-width configuration to maximize efficiency while maintaining model fidelity in a software-hardware co-design process.

Lab 3: Running & Quantizing Models on Android

Use llama.cpp to quantize and deploy LLaMA-style LLMs on Android. Learn how to benchmark and run models offline, directly on your mobile hardware.

Getting Started

Requirements

This course runs cross‑platform and has been validated on an NVIDIA DGX Spark; for training the models, we recommend using an accelerated backend such as a GPU. It is expected that you are familiar with configuring software across different operating systems, and you may need to install additional packages depending on your environment.

This repository uses a unified requirements.txt and Git LFS to manage dependencies and large pretrained models.

1️. Clone the Repository and Download Model Weights

# Install Git LFS if needed
sudo apt install git-lfs              # or: brew install git-LFS
git lfs install

# Clone the repo and pull large files
git clone https://github.com/arm-university/Advanced-AI-Hardware-Software-Co-Design.git
cd Advanced-AI-Hardware-Software-Co-Design
git lfs pull

2️. Set Up the Python Environment

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3️. Run the Labs

jupyter lab

Open:

lab1.ipynb for Extreme Quantization
lab2.ipynb for Hardware–Software Co-Design
Follow lab3.md for Android deployment with llama.cpp

Repository Structure

.
├── assets
├── lab1.ipynb
├── lab2.ipynb
├── lab3.md
├── LICENSE.md
├── README.md
├── requirements.txt
└── src

Android Deployment Notes

To complete Lab 3, make sure the following are installed:

Android Studio (Hedgehog or later)
Android NDK + ADB
A physical Android 10+ device with ≥6GB RAM

Windows users: use WSL 2 with Ubuntu 22.04 for full compatibility with build tools.

Learning Outcomes

Understand bit-width trade-offs (accuracy vs. compression)
Apply QAT to recover performance in quantized models
Perform per-layer hardware-aware optimization
Deploy and benchmark local LLMs on Android devices

License

You are free to fork or clone this material. See LICENSE.md for the complete license.

Inclusive Language Commitment

Arm is committed to making the language we use inclusive, meaningful, and respectful. Our goal is to remove and replace non-inclusive language from our vocabulary to reflect our values and represent our global ecosystem.

Arm is working actively with our partners, standards bodies, and the wider ecosystem to adopt a consistent approach to the use of inclusive language and to eradicate and replace offensive terms. We recognise that this will take time. This course may contain references to non-inclusive language; it will be updated with newer terms as those terms are agreed and ratified with the wider community.

Contact us at education@arm.com with questions or comments about this course. You can also report non-inclusive and offensive terminology usage in Arm content at terms@arm.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced Quantization: A practical Course

Intended Audience

Labs Overview

Lab 1: Extreme Quantization

Lab 2: Hardware–Software Model Co-Design

Lab 3: Running & Quantizing Models on Android

Getting Started

Requirements

1️. Clone the Repository and Download Model Weights

2️. Set Up the Python Environment

3️. Run the Labs

Repository Structure

Android Deployment Notes

Learning Outcomes

License

Inclusive Language Commitment

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
assets		assets
src/lab1		src/lab1
.gitattributes		.gitattributes
.gitignore		.gitignore
Education_horizontal_violet.png		Education_horizontal_violet.png
LICENSE.md		LICENSE.md
README.md		README.md
lab1.ipynb		lab1.ipynb
lab2.ipynb		lab2.ipynb
lab3.md		lab3.md
requirements.txt		requirements.txt

License

arm-university/Advanced-AI-Hardware-Software-Co-Design

Folders and files

Latest commit

History

Repository files navigation

Advanced Quantization: A practical Course

Intended Audience

Labs Overview

Lab 1: Extreme Quantization

Lab 2: Hardware–Software Model Co-Design

Lab 3: Running & Quantizing Models on Android

Getting Started

Requirements

1️. Clone the Repository and Download Model Weights

2️. Set Up the Python Environment

3️. Run the Labs

Repository Structure

Android Deployment Notes

Learning Outcomes

License

Inclusive Language Commitment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages