- Clone the repo (change the
FASTDIRas perfered):
export FASTDIR=/workspace
cd $FASTDIR/git/
git clone https://github.com/aim-uofa/model-quantization
git clone https://github.com/blueardour/pytorch-utils
cd model-quantization
ln -s ../pytorch-utils utils
# create separate log and weight folders (optional, if symbol link not created, the script will create these folders under the project path)
#mkdir -p /data/pretrained/pytorch/model-quantization/{exp,weights}
#ln -s /data/pretrained/pytorch/model-quantization/exp .
#ln -s /data/pretrained/pytorch/model-quantization/weights .
-
Install prerequisite packages
cd $FASTDIR/git/model-quantization # python 3 is required pip install -r requirement.txtQuantization for the classification task has not strict requirement on the pytorch version. However, other tasks such as detection and segmentation require a higher version pytorch.
detectron2currently requireTorch 1.4+. Besides, the CUDA version on the machine is advised to keep the same with the one compiling the pytorch. -
Install Nvidia image pre-processing packages and mix precision training packages (optional, highly recommend)
This repo supports the Imagenet dataset and CIFAR dataset. Create necessary folders and prepare the datasets. Example:
# dataset
mkdir -p /data/cifar
mkdir -p /data/imagenet
# download imagnet and move the train and evaluation data in in /data/imagenet/{train,val}, respectively.
# cifar dataset can be downloaded on the fly
Some of the quantization results are listed in result_cls.md. We provide pretrained models in google drive
Both training and testing employ the train.sh script. Directly call the main.py is also possible.
bash train.sh config.xxxx
config.xxxx is the configuration file, which contains network architecture, quantization related and training related parameters. For more about the supported options, refer below Training script options and config.md. Also refer the examples in config subfolder.
Training is often time-consuming . Try our start_on_terminate.sh script which can be used to pend a second task. New round training will start automatically when last training process is terminated.
# wait in a screen shell
screen -S next-round
bash start_on_terminate.sh [current training thread pid] [next round config.xxxx]
# Ctl+A D to detach screen to backend
Besides, tools.py provides many useful functions for debug / verbose / model convert. Refer tools.md for detailed usage.
See know issues
-
From 2020.07.28 Dynamic loading of the training options by policy file is supported.
-
Option parsing
Common options are parsed in
util/config.py. Quantization related options are separated in themain.py. -
Keyword (choosing quantization method)
The
--keywordoption is one of most important variables to control the model architecture and quantization algorithm choice.We currently support quantization algorithms by adding the following options in the
keyword:a.
lqfor LQ-Netsb.
pactfor PACTc.
dorefafor DoReFa-Net. Besides, additional keyword oflsqfor learned step size,non-uniformfor FATNN.d.
xnorfor XNOR-Net. Ifgammais combined with thexnorin the keyword, a separated learnable scale coefficient is added (It namely becomes the XNor-net++). -
Keyword (structure control):
The network structure can be chosen by
--archor--model. For ResNet, the official ResNet model is provided withpytorch-resnetxxand more flexible ResNet architecture can be realized by setting the--archor--modelwithresnetxx. For the latter case, a lot of options can be combined to customize the network structure:a.
originexists / not exists inkeywordis to choose whether the bi-real skip connection is preferred (Block-wise skip connection versus layer-wise skip connection).b.
bacsorcbas, etc, indicate the layer order in a ResNet block. For example,bacsis a kind of pre-activation structure, representing in a ResNet block, first normalization layer, then activation layer, then convolution layer and last skip connection layer. For pre-activation structure,preBNis required for the first ResNet block. Refer resnet.md for more information.c. By default all layers except the first and last layers are quantized,
real_skipcan be added to keep the skip connection layers in ResNet to full precision, which is widely used in Xnor-net and Bi-Real net.d. For the normalization layer and activation layer, we also provide some
keywordfor different variants. For example,NRelUmeans do not include ReLU activation in the network andPRelUindicates PReLU is employed. Refermodel/layer.pyfor details.e. Padding and quantization order. I think it is an error if padding the feature map with 0 after quantization, especially in BNNs. From my perspective, the strategy makes BNNs to become TNNs. Thus, I advocate to pad the feature map with zero first and then go through the quantization step. To keep compatible with the publication as well as providing a revised method,
padding_after_quantcan be set to control the order between padding and quantization. Refer line 445 inmodel/quant.pyfor the implementation.f. Skip connection realization. Two choices are provided. One is the avgpooling with stride followed by a conv1x1 with stride=1. Another is just one conv1x1 with stride as demanded.
singleconvinkeywordis used for the choice.g.
fixupis used to enable the architecture in Fixup Initialization.h. The option
basewhich is a standalone option rather a word in thekeywordlist is used to realize the branch configuration in Group-Net.Self-defined
keywordis supported and can be easily realized according the user's own desire. As introduced above, the options can be combined to build up different variant architectures. Examples can be found in theconfigsubfolder. -
Activation and weight quantization options
The script provides independent configurations for activations and weights respectively. We here explain some advanced options.
-
xx_quant_groupindicates the group amount for the quantization parameter along the channel dimension. -
xx_adaptivein most cases, indicates the additional normalization operation which shows great potential to increase the performance. -
xx_grad_typedefines custom gadient approximation method. In general, the quantization step is not differentiable, techniques such as the STE are used to approximate the gradient. Other types of approximation exist. Besides, in some works, it is advocated to add some scale coefficient to the gradient in order to stabilize the training.
-
-
Weight decay
Three major related options.
-
--wdset the default L2 weight decay value. -
Weight decay is originally proposed to avoid overfit for the large number of parameters. For some small tensors, for example the parameters in BatchNorm layer (as well as custom defined quantization parameters, such as clip-value), weight decay is advocated to be zero.
--decay_smallis for whether decay those small tensors or not. -
--custom_decay_listand--custom_decayare combined for specific custom decay value to certain parameters. For example, in PACT, the clip_boundary can own its independent weight decay for regularization.
-
-
Learning rate
-
multi-step decay
-
ploy decay
-
sgdr (with restart)
-
--custom_lr_listand--custom_lrare provided similarly with before mentioned weight decay to specific custom learning rate for certain parameters.
-
-
Mixed precision training options
--fp16and--opt_level [O1]are provided for mix precision traning.-
FP32
-
FP16 with custom level, recommend
O1level.
-