This repo offers simple implementationi ViT (Vision Transformer) from scratch using PyTorch.
Find elaborated implementation here.
Please follow the insturction below.
git clone https://github.com/bskkimm/Simple-ViT-Implementation.git
conda create -n ViT python=3.10 -y
conda activate ViT
pip install -r requirements.txtThen, implement ViT step by step using tutorial_from_scratch.ipynb
| Model | Dataset | Train Accuracy | Test Accuracy | GPU Used | Training Time |
|---|---|---|---|---|---|
| ViT-B/12 | CIFAR-10 | 98.88% | 77.40% | RTX 4070 Laptop | 2.0 hours |
Due to the small image size in CIFAR-10, I implemented attention map visualization on the Food-101 dataset instead, which offers higher-resolution samples more suitable for visual interpretability.
