ResNet-50 encoder + Transformer decoder trained on MS COCO dataset. Converted from a Jupyter notebook into a clean, production-ready project structure.
image_captioning/
|
|-- configs/
| |-- config.yaml <- all paths and hyperparameters
| |-- config_loader.py <- YAML loader
|
|-- src/
| |-- data/
| | |-- prepare.py <- load COCO pairs, compute mean/std, build vocab
| | |-- dataset.py <- CaptionDataset, collate_function, build_loaders
| |-- models/
| | |-- encoder.py <- CNNEncoder (ResNet-50)
| | |-- decoder.py <- PositionalEncoding + TransformerDecoder
| | |-- captioning_model.py <- ImageCaptioningModel (encoder + decoder)
| |-- training/
| | |-- train_utils.py <- generate_square_subsequent_mask, train_one_epoch, validate
| |-- inference/
| | |-- predictor.py <- generate_caption + CaptionPredictor wrapper
| |-- api/
| |-- app.py <- FastAPI server (POST /caption)
|
|-- scripts/
| |-- prepare_data.py <- run once to cache vocab + stats
| |-- train.py <- full training run
| |-- predict.py <- CLI inference
| |-- run_api.py <- start the API server
|
|-- frontend/
| |-- index.html <- drag-and-drop web UI
|
|-- Dockerfile <- Docker build file (CPU version)
|-- docker-compose.yml <- Docker Compose config
|-- .dockerignore <- files excluded from Docker image
|-- requirements.txt <- GPU dependencies
|-- requirements-cpu.txt <- CPU dependencies (for Docker / no GPU)
|-- data/ <- put your COCO images + annotations here
|-- saved_models/ <- checkpoints saved here
|-- README.md
| Situation | File to use |
|---|---|
| You have an NVIDIA GPU (CUDA) | requirements.txt |
| You have no GPU (CPU only) | requirements-cpu.txt |
| Building Docker image | uses requirements-cpu.txt automatically |
Before installing any packages, create an isolated environment. This keeps your project dependencies separate from other projects.
# Create a new conda environment with Python 3.10
conda create -n imgcap python=3.10
# Activate it
conda activate imgcap
# You will see (imgcap) in your terminal now# Create virtual environment
python -m venv imgcap
# Activate it on Windows
imgcap\Scripts\activate
# Activate it on Mac/Linux
source imgcap/bin/activate
# You will see (imgcap) in your terminal nowNOTE: Always make sure your environment is activated before running any commands in this project. You will see
(imgcap)at the start of your terminal line when it is active.
pip install -r requirements.txtrequirements.txt contains:
torch==2.6.0
torchvision==0.21.0
Pillow==12.0.0
numpy==2.2.6
tqdm==4.67.3
PyYAML==6.0.3
fastapi==0.135.1
uvicorn==0.41.0
python-multipart==0.0.22
pydantic==2.12.5
matplotlib==3.10.8
starlette==0.52.1
anyio==4.12.1
Make sure you have the correct CUDA version of PyTorch installed. Check your version at: https://pytorch.org/get-started/locally/
# This project uses CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124pip install -r requirements-cpu.txtrequirements-cpu.txt contains:
torch==2.2.0
torchvision==0.17.0
Pillow==12.0.0
numpy==1.26.0
tqdm==4.67.3
PyYAML==6.0.3
fastapi==0.135.1
uvicorn==0.41.0
python-multipart==0.0.22
pydantic==2.12.5
matplotlib==3.10.8
starlette==0.52.1
anyio==4.12.1
NOTE: Training on CPU will be very slow. GPU is strongly recommended for training. For inference (predicting captions on single images) CPU is fine, just slower (5-15 sec per image).
data/
|-- images/
| |-- val2017/ <- *.jpg images from COCO val2017
|-- annotations/
|-- captions_val2017.json
| File | Download Link |
|---|---|
| Images (val2017) | http://images.cocodataset.org/zips/val2017.zip |
| Captions | http://images.cocodataset.org/annotations/annotations_trainval2017.zip |
After downloading, update the paths in configs/config.yaml to point to your local data.
Download the trained model weights from Google Drive:
https://drive.google.com/file/d/1ySPlVzzgmCZdO4S2xRXUmvV-meOVcCYk/view?usp=sharing
Place the downloaded file into the saved_models/ folder:
saved_models/
image_caption_model_final.pth <- put it here
python scripts/prepare_data.pySaves data/vocab.json and data/dataset_stats.json.
NOTE: Requires GPU. Make sure you installed
requirements.txt(GPU version).
python scripts/train.pyResume training from a saved checkpoint:
python scripts/train.py --resume saved_models/checkpoint_epoch_005.pth# Provide the full path to your image inside quotes
python scripts/predict.py --image "C:/Users/YourName/Pictures/your_image.jpg"Start the API server:
python scripts/run_api.pyThen open frontend/index.html by double-clicking it in your file explorer.
Make sure the API URL box in the frontend shows http://127.0.0.1:8000 before uploading an image.
Swagger API docs: http://127.0.0.1:8000/docs
Docker uses
requirements-cpu.txtautomatically. No GPU needed on the Docker machine.
Your Machine
|-- saved_models/ <- your trained model (mounted into container)
|-- data/ <- vocab + stats files (mounted into container)
|-- Docker Container
|-- Python 3.10
|-- PyTorch 2.1.0 (CPU)
|-- All packages from requirements-cpu.txt
|-- runs: python scripts/run_api.py
|
v
API at http://localhost:8000
Download Docker Desktop from: https://www.docker.com/products/docker-desktop/
Verify it works:
docker --version
docker psDownload the trained model weights from Google Drive:
https://drive.google.com/file/d/1ySPlVzzgmCZdO4S2xRXUmvV-meOVcCYk/view?usp=sharing
After downloading, place the file here:
saved_models/
image_caption_model_final.pth <- put the downloaded file here
Also make sure these files exist (built by prepare_data.py):
data/vocab.json
data/dataset_stats.json
docker build -t image-captioning .This takes 10-15 minutes the first time (downloads Python and installs packages).
docker-compose upTo stop:
docker-compose downDouble-click frontend/index.html in your file explorer.
Make sure the API URL shows http://127.0.0.1:8000 then upload an image.
| File | Purpose |
|---|---|
Dockerfile |
builds the image using python:3.10-slim and requirements-cpu.txt |
docker-compose.yml |
mounts data/ and saved_models/ and runs the container |
.dockerignore |
excludes large files from being copied into the image |
requirements-cpu.txt |
CPU-only packages used inside Docker |
curl -X POST http://127.0.0.1:8000/caption \
-F "file=@photo.jpg"Response:
{
"caption": "a group of people standing in a field"
}- Always run all scripts from the project root directory, not from inside the scripts/ folder.
- The pretrained deprecation warning from torchvision is harmless - the model loads correctly.
- vocab.json and dataset_stats.json are cached after first prepare_data.py run - no need to rerun.
- A checkpoint is saved after every epoch inside saved_models/ so you can resume training anytime.
- For CPU-only inference the model automatically falls back to CPU - no code changes needed.
- Docker uses the CPU version of PyTorch - do not use Docker for training, only for inference and API.
Muhammad Rehman Ashraf
Built with PyTorch - torchvision - FastAPI - MS COCO