Skip to content

Releases: thatAverageGuy/mono-quant

v1.1: Critical Bug Fixes & Major Feature Enhancements

04 Feb 10:35

Choose a tag to compare

v1.1 Release

This release includes critical bug fixes and major feature enhancements for mono-quant.

🎯 What's New

True INT8 Conv2d Quantization

  • Fixed: Conv2d layers now properly store INT8 weights (previously dequantized to FP32 immediately)
  • Benefit: ~4x memory reduction for Conv2d layers

Dynamic Quantization Exclusions

  • Added: Skip sensitive layers during dynamic quantization (LayerNorm, Embeddings, etc.)
  • Parameters: modules_to_not_convert, skip_layer_types, skip_layer_names, skip_param_threshold

PyTorch-Native Deployment

  • Feature: Models can be saved and loaded without mono-quant installed
  • Mechanism: Auto-conversion to standard PyTorch modules before saving
  • Benefit: Zero dependency at inference time

nn.Embedding Quantization

  • Added: QuantizedEmbedding class for embedding layer quantization
  • Constraint: INT8 and FP16 only (INT4 blocked for accuracy)
  • Impact: Reduces memory for LLMs (embeddings are 20-30% of parameters)

Module Reversion for Ecosystem Compatibility

  • Feature: revert_to_standard_modules() converts quantized modules back to standard PyTorch
  • Enables: ONNX export, pruning tools, model inspection utilities
  • Use Case: Export quantized models to ONNX for deployment

Custom Serialization

  • Added: _save_to_state_dict and _load_from_state_dict methods
  • Benefit: Quantized models can be properly saved and loaded with metadata

πŸ› Bug Fixes

  • Fake Conv2d quantization (now stores true INT8 weights)
  • Dynamic quantization crashes with exclusion parameters
  • Broken state_dict serialization (now properly saves/loads quantization metadata)
  • Models cannot be loaded without mono-quant (fixed with PyTorch-native conversion)

πŸ”§ CI/CD Improvements

  • Fixed all linting errors (ruff)
  • Fixed test failures and expectations
  • Enabled PyPI auto-publishing on release
  • Added package installation step to CI/CD

πŸ“¦ Installation

```bash
pip install mono-quant==1.1
```

πŸ“š Documentation


Full Changelog: v1.0...v1.1

v1.0.1: Fix safetensors dependency

03 Feb 19:45

Choose a tag to compare

Fixes

  • Fixed safetensors dependency constraint from >=0.4 to >=0.3
  • This resolves compatibility issues with the uv package manager
  • Maintains full compatibility with pip

Installation

pip install mono-quant==1.0.1

Or with uv:

uv pip install mono-quant==1.0.1

What Changed

Only the dependency constraint changed - no code changes.

Previous: safetensors>=0.4
New: safetensors>=0.3

This allows the package to work with safetensors 0.3.x and 0.4.x versions (when available).


Full Changelog: See v1.0.0 release notes for initial release features.

v1.0 - Mono Quant Initial Release

03 Feb 17:56

Choose a tag to compare

Mono Quant v1.0 - Initial Release

Ultra-lightweight, model-agnostic quantization package for PyTorch models.

🎯 What is Mono Quant?

Mono Quant is a simple, reliable model quantization package for PyTorch with minimal dependencies. Just torch, no bloat.

✨ Key Features

Core Quantization

  • βœ… INT8 quantization with per-channel scaling
  • βœ… INT4 quantization with group-wise scaling (2x compression vs INT8)
  • βœ… FP16 quantization for memory reduction
  • βœ… Dynamic quantization (no calibration data required)
  • βœ… Static quantization with calibration data

Calibration

  • βœ… MinMaxObserver (default, fast)
  • βœ… MovingAverageMinMaxObserver (robust, EMA smoothing)
  • βœ… HistogramObserver (outlier-aware, KL divergence)
  • βœ… Calibration data from tensors or DataLoader

User Interface

  • βœ… Unified quantize() Python API
  • βœ… QuantizationResult with .save() and .validate() methods
  • βœ… CLI with git-style subcommands (monoquant)
  • βœ… Progress bars with CI/TTY auto-detection

Serialization

  • βœ… PyTorch format (.pt/.pth) support
  • βœ… Safetensors format support
  • βœ… Metadata preservation (bits, scheme, scales, zero-points)
  • βœ… Model dequantization back to FP32

Validation

  • βœ… SQNR (signal-to-quantization-noise ratio) computation
  • βœ… Model size comparison
  • βœ… Load testing (round-trip validation)
  • βœ… Accuracy warnings for aggressive quantization

Advanced Features

  • βœ… Model-agnostic design (any PyTorch model)
  • βœ… Layer skipping for INT4 (protects sensitive layers)
  • βœ… Symmetric and asymmetric quantization schemes
  • βœ… Custom exception hierarchy with actionable suggestions

πŸ“Š Statistics

  • Requirements delivered: 30/30 (100%)
  • Integration points: 8/8 verified
  • E2E flows: 8/8 working
  • Lines of code: 5,228 Python
  • Files: 26 source files
  • Technical debt: None identified

πŸ“¦ Installation

pip install mono-quant

πŸš€ Quick Start

Python API

from mono_quant import quantize

# Dynamic INT8 quantization (no calibration data needed)
result = quantize(model, bits=8, dynamic=True)

# Save the quantized model
result.save("model_quantized.pt")

# Check metrics
print(f"Compression: {result.info.compression_ratio:.2f}x")
print(f"SQNR: {result.info.sqnr_db:.2f} dB")

CLI

# Dynamic quantization
monoquant quantize --model model.pt --bits 8 --dynamic

# With custom output
monoquant quantize --model model.pt --bits 8 --output model_quantized.pt

πŸ’‘ Use Cases

  • CI/CD Pipelines - Automate quantization during build
  • Local Development - Test quantized models before deployment
  • Model Compression - Reduce model size by 4-8x
  • Inference Speedup - Faster inference with quantized models

πŸ”§ Requirements

  • Python: 3.8 or higher
  • PyTorch: 2.0 or higher

Optional Dependencies

  • safetensors>=0.4 - For Safetensors format support
  • click>=8.1 - For CLI
  • tqdm>=4.66 - For progress bars

πŸ“š Documentation

Full documentation available at: https://thataverageguy.github.io/mono-quant

  • Installation guide
  • Quick start tutorial
  • User guide (modes, calibration, INT4, layer skipping)
  • CLI reference
  • API documentation
  • Examples and tutorials

🎁 What's Included

  • Model-agnostic quantization (works with HuggingFace, local, or custom models)
  • Dynamic and static quantization modes
  • INT8, INT4, and FP16 support
  • Robust calibration with 3 observer types
  • Layer skipping to protect sensitive components
  • Serialization to PyTorch and Safetensors formats
  • Validation with SQNR metrics and accuracy warnings
  • Python API and CLI for automation

🚧 Known Limitations

  • CLI does not support loading calibration data from files (use Python API)
  • INT4 quantization requires calibration data (no dynamic INT4)
  • No quantization-aware training (QAT) - build-phase only
  • No ONNX/TFLite export (use dedicated conversion tools)

πŸ—ΊοΈ Roadmap

v2 (Future)

  • Genetic optimization for quantization parameters
  • Experiment tracking and logging
  • Mixed precision (different bits per layer)
  • LLM.int8() style outlier detection
  • Automatic layer sensitivity analysis

πŸ“„ License

MIT License - see LICENSE for details.

πŸ™ Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

πŸ“ž Support


Full Changelog: https://thataverageguy.github.io/mono-quant/about/changelog/