This project was inspired by an interest in exploring pruning and sparse tensors.
Pruning_Tutorial.ipynb: Based on the official PyTorch pruning tutorial: https://docs.pytorch.org/tutorials/intermediate/pruning_tutorial.htmlTesting_Prunes.ipynb: An exploration that builds on the tutorial techniques to study how different pruning styles affect model accuracy.
I originally expected a more “smooth” trade-off: a model loses some weights, retrains in the pruned state, and then recovers to reasonably high accuracy. In the (limited) comparisons I ran here, the outcomes were much more distinct.
I compared two pruning approaches:
- Per-layer pruning: unstructured, L1-magnitude pruning applied independently per layer (removing the least important fraction within each layer).
- Global pruning: unstructured, L1-magnitude pruning applied across the whole model (removing the least important weights across all layers).
In general, the globally pruned model is far more reliable than the per-layer model. That makes intuitive sense: with global pruning, the most redundant weights can be removed from across the (intentionally over-parameterized) network. I chose an over-parameterized model on purpose to see whether a deeper model could survive more iterative pruning while retaining accuracy.
By contrast, the per-layer model collapses immediately in performance and then appears to go through jarring training cycles. It never reaches accuracy higher than the global model in the runs I performed. This also makes sense: per-layer pruning removes (for example) the “worst” 20% of weights in every layer, regardless of how important that layer is overall. A layer’s “worst” weights can still be more important than another layer’s “best” weights, so pruning uniformly per layer can be much more destructive.
One interesting detail: from the printed metrics during pruning/training, much of the global pruning seems to happen in the middle layers of the network—the most “inner” layers appear to be pruned the most. This suggests a follow-up exploration into model size, depth, and layer-wise redundancy.
This experiment uses a relatively simple scikit-learn dataset: https://archive.ics.uci.edu/dataset/31/covertype
Because the model is quite deep, it fits the dataset early. As a result, the best-performing models appear very early (obvious in the per-layer case, since it crashes after the first pruning step), and less obviously in the global case, which in my runs appears to maximize accuracy and minimize evaluation loss around epoch ~300.
All tests were run on an NVIDIA RTX 4000 (Blackwell architecture) with CUDA 13 and PyTorch 2.10.
Feel free to adjust pruning hyperparameters, training settings, or even swap the dataset to conduct your own experiments.
These plots are generated by the comparison run stored under checkpoints/_comparisons/.
Corrective note (data issue): In the train vs validation loss plot above, the “training loss” and “validation loss” curves are not computed from the same model state at each checkpoint. In the training loop, train_loss is logged before pruning (after the model has trained for the epoch), while the checkpoints we later evaluate for val_loss are saved after pruning (the newly pruned model). Therefore, at prune epochs the reported train_loss corresponds to the pre-prune, recovered network, whereas the val_loss corresponds to the post-prune, freshly cut network. This systematic pre-prune vs post-prune mismatch makes the two lines appear to diverge sharply—especially under per-layer pruning—because they are sampling opposite sides of the pruning “shock,” not the same model snapshot.

