HiSVD is a training-free, two-stage SVD-based compression framework for LLMs. It combines rank-density-aware baseline allocation (Stage 1) with cluster-level sublayer redundancy refinement (Stage 2), followed by whitening SVD decomposition to produce the compressed model.
HiSVD/
├── compress.py # Main compression pipeline (Stage 1 + 2 + SVD)
├── calibrate.py # Generate calibration data (whitening matrices)
├── evaluate.py # Evaluate PPL for HuggingFace or compressed models
├── hisvd/ # Core algorithm modules
│ ├── calibration.py # Calibration data collector
│ ├── rank_allocation.py # Stage 1: rank-density-aware baseline
│ ├── clustering.py # LESA layer clustering (reserved)
│ ├── cluster_sublayer_lwe.py # LWE score computation
│ ├── refinement.py # Stage 2: cluster-level LWE refinement
│ ├── whitening_svd.py # Whitening SVD decomposition
│ └── shape_utils.py # Weight shape inference (GQA support)
├── component/ # Model-specific SVD wrappers
│ ├── svd_llama.py
│ └── svd_mistral.py
├── utils/ # Shared utilities
│ ├── data_utils.py # Dataset loading (C4, WikiText-2, PTB)
│ ├── model_utils.py # Model loading helpers
│ └── eval_utils.py # Perplexity evaluation
├── scripts/ # Reproduction shell scripts
│ ├── calibrate_all.sh
│ ├── compress_llama7b.sh
│ ├── compress_llama2_7b.sh
│ ├── compress_llama3_8b.sh
│ ├── compress_llama_30b.sh
│ ├── compress_mistral_7b.sh
│ ├── evaluate_all.sh
│ └── (optional) extra scripts: compress_llama13b.sh / compress_opt6.7b.sh
├── data/ # Local dataset files
│ └── c4-validation.json
└── requirements.txt
pip install -r requirements.txtCalibration only needs to be done once per model. It collects whitening matrices and weight statistics from a small set of WikiText-2 samples.
python calibrate.py \
--model_path /path/to/llama-7b \
--model_name llama-7b \
--nsamples 256 \
--seqlen 2048This produces calibration_cache/{name}_n256_s2048_with_hessian.pkl containing
whitening matrices, Hessian matrices, and weight statistics (including "Vh").
python compress.py \
--model_path /path/to/llama-7b \
--calib_cache calibration_cache/llama-7b_n256_s2048_with_hessian.pkl \
--target_compression 0.8 \
--rank_method stable \
--alpha 1.0 \
--lwe_strength 1.0 \
--tail_redundancy_weight 1.0 \
--modulation logarithmic \
--save_model compressed_models/llama7b_80p.ptKey arguments:
| Argument | Description | Default |
|---|---|---|
--target_compression |
Compression ratio (e.g. 0.8 = keep 80% params) | — |
--rank_method |
Stage 1 rank metric: stable, effective, mathematical |
stable |
--alpha |
Rank density sensitivity | 1.0 |
--modulation |
Stage 2 modulation: logarithmic, power, exponential, linear |
logarithmic |
--lwe_strength |
Modulation strength (beta) | 1.0 |
# Evaluate a compressed .pt model
python evaluate.py --model_path compressed_models/llama7b_80p.pt --datasets wikitext2,c4
# Evaluate an original HuggingFace model
python evaluate.py --model_path /path/to/llama-7b --datasets wikitext2Shell scripts under scripts/ compress each model at 40%–80% ratios.
Set MODEL_DIR to your local model directory before running:
export MODEL_DIR=/path/to/models
# Generate calibration for all models (run once)
bash scripts/calibrate_all.sh
# Compress (one model per GPU)
CUDA_VISIBLE_DEVICES=0 bash scripts/compress_llama7b.sh 0 &
CUDA_VISIBLE_DEVICES=1 bash scripts/compress_llama2_7b.sh 1 &
CUDA_VISIBLE_DEVICES=2 bash scripts/compress_llama3_8b.sh 2 &
CUDA_VISIBLE_DEVICES=3 bash scripts/compress_llama_30b.sh 3 &
CUDA_VISIBLE_DEVICES=4 bash scripts/compress_mistral_7b.sh 4 &
# Evaluate all compressed .pt files (after compression finishes)
bash scripts/evaluate_all.sh 0- LLaMA-7B (see
scripts/compress_llama7b.sh) - LLaMA-2-7B (see
scripts/compress_llama2_7b.sh) - LLaMA-3-8B (see
scripts/compress_llama3_8b.sh) - LLaMA-30B (see
scripts/compress_llama_30b.sh) - Mistral-7B (see
scripts/compress_mistral_7b.sh)
Stage 1 — Rank-Density-Aware Baseline: Computes per-sublayer baseline ranks using stable rank density, allocating more capacity to sublayers with higher information density.
Stage 2 — Cluster-Level LWE Refinement: Refines ranks using sublayer-type-level tail redundancy scores. All layers sharing the same sublayer type receive a unified modulation factor, ensuring structural consistency.
Compression (after Stage 2): Applies calibration-based whitening to remove input distribution bias, then performs truncated SVD at the allocated rank.
MIT