HiSVD: Hierarchical SVD Compression for Large Language Models

HiSVD is a training-free, two-stage SVD-based compression framework for LLMs. It combines rank-density-aware baseline allocation (Stage 1) with cluster-level sublayer redundancy refinement (Stage 2), followed by whitening SVD decomposition to produce the compressed model.

Directory Structure

HiSVD/
├── compress.py          # Main compression pipeline (Stage 1 + 2 + SVD)
├── calibrate.py         # Generate calibration data (whitening matrices)
├── evaluate.py          # Evaluate PPL for HuggingFace or compressed models
├── hisvd/               # Core algorithm modules
│   ├── calibration.py       # Calibration data collector
│   ├── rank_allocation.py   # Stage 1: rank-density-aware baseline
│   ├── clustering.py        # LESA layer clustering (reserved)
│   ├── cluster_sublayer_lwe.py  # LWE score computation
│   ├── refinement.py        # Stage 2: cluster-level LWE refinement
│   ├── whitening_svd.py     # Whitening SVD decomposition
│   └── shape_utils.py       # Weight shape inference (GQA support)
├── component/           # Model-specific SVD wrappers
│   ├── svd_llama.py
│   └── svd_mistral.py
├── utils/               # Shared utilities
│   ├── data_utils.py        # Dataset loading (C4, WikiText-2, PTB)
│   ├── model_utils.py       # Model loading helpers
│   └── eval_utils.py        # Perplexity evaluation
├── scripts/             # Reproduction shell scripts
│   ├── calibrate_all.sh
│   ├── compress_llama7b.sh
│   ├── compress_llama2_7b.sh
│   ├── compress_llama3_8b.sh
│   ├── compress_llama_30b.sh
│   ├── compress_mistral_7b.sh
│   ├── evaluate_all.sh
│   └── (optional) extra scripts: compress_llama13b.sh / compress_opt6.7b.sh
├── data/                # Local dataset files
│   └── c4-validation.json
└── requirements.txt

Installation

pip install -r requirements.txt

Quick Start

1. Generate calibration data

Calibration only needs to be done once per model. It collects whitening matrices and weight statistics from a small set of WikiText-2 samples.

python calibrate.py \
  --model_path /path/to/llama-7b \
  --model_name llama-7b \
  --nsamples 256 \
  --seqlen 2048

This produces calibration_cache/{name}_n256_s2048_with_hessian.pkl containing whitening matrices, Hessian matrices, and weight statistics (including "Vh").

2. Compress

python compress.py \
  --model_path /path/to/llama-7b \
  --calib_cache calibration_cache/llama-7b_n256_s2048_with_hessian.pkl \
  --target_compression 0.8 \
  --rank_method stable \
  --alpha 1.0 \
  --lwe_strength 1.0 \
  --tail_redundancy_weight 1.0 \
  --modulation logarithmic \
  --save_model compressed_models/llama7b_80p.pt

Key arguments:

Argument	Description	Default
`--target_compression`	Compression ratio (e.g. 0.8 = keep 80% params)	—
`--rank_method`	Stage 1 rank metric: `stable`, `effective`, `mathematical`	`stable`
`--alpha`	Rank density sensitivity	1.0
`--modulation`	Stage 2 modulation: `logarithmic`, `power`, `exponential`, `linear`	`logarithmic`
`--lwe_strength`	Modulation strength (beta)	1.0

3. Evaluate

# Evaluate a compressed .pt model
python evaluate.py --model_path compressed_models/llama7b_80p.pt --datasets wikitext2,c4

# Evaluate an original HuggingFace model
python evaluate.py --model_path /path/to/llama-7b --datasets wikitext2

Reproduce Paper Results

Shell scripts under scripts/ compress each model at 40%–80% ratios. Set MODEL_DIR to your local model directory before running:

export MODEL_DIR=/path/to/models

# Generate calibration for all models (run once)
bash scripts/calibrate_all.sh

# Compress (one model per GPU)
CUDA_VISIBLE_DEVICES=0 bash scripts/compress_llama7b.sh 0 &
CUDA_VISIBLE_DEVICES=1 bash scripts/compress_llama2_7b.sh 1 &
CUDA_VISIBLE_DEVICES=2 bash scripts/compress_llama3_8b.sh 2 &
CUDA_VISIBLE_DEVICES=3 bash scripts/compress_llama_30b.sh 3 &
CUDA_VISIBLE_DEVICES=4 bash scripts/compress_mistral_7b.sh 4 &

# Evaluate all compressed .pt files (after compression finishes)
bash scripts/evaluate_all.sh 0

Supported Models

LLaMA-7B (see scripts/compress_llama7b.sh)
LLaMA-2-7B (see scripts/compress_llama2_7b.sh)
LLaMA-3-8B (see scripts/compress_llama3_8b.sh)
LLaMA-30B (see scripts/compress_llama_30b.sh)
Mistral-7B (see scripts/compress_mistral_7b.sh)

Method Overview

Stage 1 — Rank-Density-Aware Baseline: Computes per-sublayer baseline ranks using stable rank density, allocating more capacity to sublayers with higher information density.

Stage 2 — Cluster-Level LWE Refinement: Refines ranks using sublayer-type-level tail redundancy scores. All layers sharing the same sublayer type receive a unified modulation factor, ensuring structural consistency.

Compression (after Stage 2): Applies calibration-based whitening to remove input distribution bias, then performs truncated SVD at the allocated rank.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiSVD: Hierarchical SVD Compression for Large Language Models

Directory Structure

Installation

Quick Start

1. Generate calibration data

2. Compress

3. Evaluate

Reproduce Paper Results

Supported Models

Method Overview

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
component		component
hisvd		hisvd
scripts		scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
calibrate.py		calibrate.py
compress.py		compress.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

HiSVD: Hierarchical SVD Compression for Large Language Models

Directory Structure

Installation

Quick Start

1. Generate calibration data

2. Compress

3. Evaluate

Reproduce Paper Results

Supported Models

Method Overview

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages