In this projects I have implemented training flow for GPT for single GPU using Huggingface Accelerate. Then I have exported the model to ONNX format for faster inference. Post that I have deployed it to Huggingface Spaces. Training logs can be seen from Wandb
https://huggingface.co/spaces/prerana1205/GPT-Inference
| Inference Type | Time Taken for 1000 tokens |
|---|---|
| Pytorch Model | 83 secs |
| Quantized Model | 81 secs |
| Onnx Quantized | 56 secs |
Clone the project
git clone https://github.com/kurchi1205/GPT-Scratch.gitGo to the project directory
cd GPT-ScratchInstall dependencies
pip install -r requirements_train.txtStart the training
./train.sh python export.pyThis will export the model to onnx and quantize it.
python generate.py #for pytorch model
python generate_onnx.py #for onnx modelI have implemented flash attention flow, although the cuda implementation is not there, the matrix slicing part has been implemented.