diff --git a/README.md b/README.md index 9062b8d..4a48e38 100644 --- a/README.md +++ b/README.md @@ -1,119 +1,76 @@ # DS598 DL4DS Midterm Project ## Introduction -For this project, you will train a network to generate captions for the -[VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/). -The images are taken by people who are blind and typically rely on -human-based image captioning services. Your objective will be to beat a -a baseline score on the [test set leaderboard](https://eval.ai/web/challenges/challenge-page/739/leaderboard/2006). -## Developer Setup +The project aims to provide image-to-caption services for blind people using Transformer technology. The project employs the [blip-image-captioning-base model](https://huggingface.co/Salesforce/blip-image-captioning-base), fine-tuned on the [VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/). The optimizer is [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) with a learning rate of 2e-5 and a weight decay of 5e-4. The model is set to train for up to 16 epochs, but training is stopped early at epoch 7, since it is overfitting afterwards. The batch sizes of training and validation are 6 and 32 respectively. The model achieved a CIDEr-D score of 75.37 on the [test dataset](https://eval.ai/web/challenges/challenge-page/739/leaderboard/2006). -Clone this repo to your directory on the SCC DS598 project space, e.g. -`/projectnb/ds598/students/`. - -Once you have a training script setup, create a shell script, e.g. `train.sh`, -that loads and activates a conda environment and then runs your training -script. An example shell script is below. - -```sh -#!/bin/bash -l - -# Set SCC project -#$ -P ds598 - -# load and activate the academic-ml conda environment on SCC -module load miniconda -module load academic-ml/spring-2024 -conda activate spring-2024-pyt - -# Add the path to your source project directory to the python search path -# so that the local `import` commands will work. -export PYTHONPATH="/projectnb/ds598/students//:$PYTHONPATH" - -# Update this path to point to your training file -python path/to/train.py - -# After updating the two paths above, run the command below from an SCC -# command prompt in the same directory as this file to submit this as a -# batch job. -### qsub -pe omp 4 -P ds598 -l gpus=1 train.sh -``` - -Note that there are train and test scripts for the two folders already. - -## Run Example Scripts - -When you run the example scripts, make sure to add the path to the repo -folder before running the script. +## Dataset -```export PYTHONPATH="/projectnb/ds598/path/to/folder:$PYTHONPATH"``` +The dataset used in this project is the VizWiz-Captions dataset, which includes 39,181 images sourced from individuals who are blind. Each image is accompanied by 5 descriptive captions. -The example shell scripts include this command. +Download the dataset from the website [VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/) and update the paths of annotation_file and image_folder in `src/base/dataset.py`. +## Evaluation -Set the paths in `src/base/constants.py` to the correct paths on your system. +In the VizWiz challenge evaluation they refer to five different evaluation metrics although they use CIDr-D as their primary evaluation. -Follow the .sh files to run the code. As an example, to run the `cnnlstm_train.sh` -script, you would run at the command prompt from the base of your local repo -folder: +They reference the BLUE metric, but there are limitations to that metric as described in [2] below. -```sh -$ qsub -pe omp 4 -P ds598 -l gpus=1 cnnlstm_train.sh -Your job 5437870 ("cnnlstm_train.sh") has been submitted -``` -As shown, you should get notification that your job was submitted and get a -job ID number. +### Validation Results -You can check your job status by typing: +At Epoch 7, the training loss was 1.3944. The performance scores for this epoch are as follows: -```sh -$ qstat -u -ob-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------------------------ -5437870 0.00000 cnnlstm_tr tgardos qw 03/14/2024 09:40:24 -``` +| Metric | Score | +|---------|---------| +| BLEU-1 | 0.6757 | +| BLEU-2 | 0.4938 | +| BLEU-3 | 0.3489 | +| BLEU-4 | 0.2419 | +| **CIDEr** | **0.7261** | -The above is showing the example output from user `tgardos`. +Here are two examples of the model's predictions: -## Dataset +Good example: -The dataset is downloaded to -`/projectnb/ds598/materials/datasets/vizwiz/captions`. There is no need to -download the dataset again and the path has already been defined in the -accompanying code. +![good example](https://i.postimg.cc/HWbHNZyJ/good-example.png) -## Evaluation +Bad example: -In the VizWiz challenge evaluation they refer to five different evaluation -metrics although they use CIDr-D as their primary evaluation. +![bad example](https://i.postimg.cc/qqcTCqTc/bad-example.png) -They reference the BLUE metric, but there are limitations to that metric as -described in [2] below. +### Test Results -### Validation Results +I submitted my test results to the VizWiz Image Captioning [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview). Here are the performance scores obtained: -Validation set results are reported in the CNN-LSTM example and code for reporting validation results are in the demo model code. +| Metric | Score | +|---------|-------| +| BLEU-1 | 68.49 | +| BLEU-2 | 50.20 | +| BLEU-3 | 35.68 | +| BLEU-4 | 24.89 | +| ROUGE-L | 48.51 | +| METEOR | 22.06 | +| **CIDEr** | **75.37** | +| SPICE | 17.48 | -### Test Results +## Implementation Suggestions -As is typically the case, the test dataset labels are withheld, and so the only way to get test results is to produce predicted captions and -then submit them to the VizWiz Image Captioning [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview). There are -scripts in both model directories to create the test submission file, although the demo model test script will have to be updated with model -information. +1. Explore trending image-to-text models on the [huggingface repository](https://huggingface.co/models?pipeline_tag=image-to-text&sort=trending) for alternatives, and feed dataset images into the reference API to evaluate the pre-trained models' outputs. -Create an account on the [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview) and submit your test predictions -to get your result. +2. The default learning rates for optimizers such as SGD, Adam, and AdamW are too high for fine-tuning, potentially leading to similar outputs across different inputs. It is recommended to adjust the learning rate to between 1e-5 and 5e-5. -Step-by-step instructions will be added here shortly. +## Limitation and Reflection +1. Facing with challenges such as debugging empty predictions, CUDA version mismatches, limited computational resources, and long training times, my experimentation was limited to a few models such as [blip-image-captioning-base model](https://huggingface.co/Salesforce/blip-image-captioning-base), [blip-image-captioning-large model](https://huggingface.co/Salesforce/blip-image-captioning-large), and [git-base](https://huggingface.co/microsoft/git-base) for fine-tuning. -State-of-the-art CIDEr-D scores on VizWiz Image Captioning is ~125. We're asking that you get a **minimum CIDEr-D test score of 50**. +2. I didn't try methods like data augmentation and dropout that could have potentially improved the model's robustness and generalization capabilities. ## References - 1. [CIDEr: Consensus-based image description evaluation](https://ieeexplore.ieee.org/document/7299087) 2. [BLEU: A Misunderstood Metric from Another Age](https://towardsdatascience.com/bleu-a-misunderstood-metric-from-another-age-d434e18f1b37), Medium Post 3. [BLEU Metric](https://huggingface.co/spaces/evaluate-metric/bleu), HuggingFace space +4. [image-to-text models](https://huggingface.co/models?pipeline_tag=image-to-text&sort=trending) +5. [image_captioning](https://huggingface.co/docs/transformers/main/en/tasks/image_captioning) +6. [BlipForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/blip#transformers.BlipForConditionalGeneration) diff --git a/cnnlstm_test.sh b/cnnlstm_test.sh index c329dda..6cc54c7 100644 --- a/cnnlstm_test.sh +++ b/cnnlstm_test.sh @@ -9,7 +9,7 @@ module load academic-ml/spring-2024 conda activate spring-2024-pyt # Change this path to point to your project directory -export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" +PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" #python -m spacy download en_core_web_sm # download spacy model python src/cnn_lstm/test.py diff --git a/cnnlstm_train.sh b/cnnlstm_train.sh index 37d48e8..43d500f 100644 --- a/cnnlstm_train.sh +++ b/cnnlstm_train.sh @@ -9,9 +9,9 @@ module load academic-ml/spring-2024 conda activate spring-2024-pyt # Change this path to point to your project directory -export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" # Set this!!! +PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" # Set this!!! -python -m spacy download en_core_web_sm # download spacy model +#python -m spacy download en_core_web_sm # download spacy model python src/cnn_lstm/train.py ### The command below is used to submit the job to the cluster diff --git a/demo_test.sh b/demo_test.sh index ec07167..52a6f21 100644 --- a/demo_test.sh +++ b/demo_test.sh @@ -9,9 +9,10 @@ module load academic-ml/spring-2024 conda activate spring-2024-pyt # Change this path to point to your project directory -export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" # Set this!!! +export PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" # Set this!!! python src/demo_model/test.py -### The command below is used to submit the job to the cluster -### qsub -pe omp 4 -P ds598 -l gpus=1 git_test.sh +### The commands below are used to submit the job to the cluster +### qsub -pe omp 4 -P ds598 -l gpus=1 demo_test.sh +### qsub -l gpus=1 -l gpu_c=7.0 -pe omp 8 demo_test.sh diff --git a/demo_train.sh b/demo_train.sh index b497ff3..ac4fdcc 100644 --- a/demo_train.sh +++ b/demo_train.sh @@ -9,9 +9,11 @@ module load academic-ml/spring-2024 conda activate spring-2024-pyt # Change this path to point to your project directory -export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" +export PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" +#python -m spacy download en_core_web_sm # download spacy model python src/demo_model/train.py -### The command below is used to submit the job to the cluster +### The commands below are used to submit the job to the cluster ### qsub -pe omp 4 -P ds598 -l gpus=1 demo_train.sh +### qsub -l gpus=1 -l gpu_c=7.0 -pe omp 8 demo_train.sh diff --git a/src/base/constants.py b/src/base/constants.py index a2c80c1..9c6c5bb 100644 --- a/src/base/constants.py +++ b/src/base/constants.py @@ -5,7 +5,7 @@ import spacy # set this path to where you want to save results -BASE_DIR = "/projectnb/ds598/projects/tgardos/sp2024_midterm/" +BASE_DIR = "/projectnb/ds598/students/lilinj/sp2024_midterm/" # Do not edit. This points to the dataset folder DATA_BASE_DIR = "/projectnb/ds598/materials/datasets/vizwiz/captions/" diff --git a/src/demo_model/test.py b/src/demo_model/test.py index 31c8690..6b7edd7 100644 --- a/src/demo_model/test.py +++ b/src/demo_model/test.py @@ -6,8 +6,8 @@ from src.base.vizwiz_eval_cap.eval import VizWizEvalCap from dataset import DemoDataset from tqdm import tqdm -from transformers import AutoProcessor -from transformers import AutoModelForCausalLM +from transformers import BlipProcessor +from transformers import BlipForConditionalGeneration from PIL import Image import matplotlib.pyplot as plt import os @@ -20,10 +20,11 @@ create_directory(DEMO_SAVE_PATH + "/examples") # The path below points to the location where the model was saved -MODEL_PATH = f"{DEMO_SAVE_PATH}/best_model" +MODEL_PATH = f"{DEMO_SAVE_PATH}/best_model_0" # Load your fine tuned model -model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR) +#model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR) +model = BlipForConditionalGeneration.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR) ## TODO # You can use the AutoProcessor.from_pretrained() method to load the HuggingFace @@ -33,7 +34,9 @@ # # Of course you should use the same model you trained with. try: - processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + #processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base", cache_dir=CACHE_DIR) + except Exception as e: print("You need to pick a pre-trained model from HuggingFace.") print("Exception: ", e) @@ -70,7 +73,7 @@ {"image_id": img_id.item(), "caption": caption} ) # Used for VizWizEvalCap -with open(DEMO_SAVE_PATH + "/test_captions.json", "w") as f: +with open(DEMO_SAVE_PATH + "/test_captions_0.json", "w") as f: json.dump(caption_val, f, indent=4) print("Test captions saved to disk!!") diff --git a/src/demo_model/train.py b/src/demo_model/train.py index 6372bdf..597e8b5 100644 --- a/src/demo_model/train.py +++ b/src/demo_model/train.py @@ -1,30 +1,32 @@ import torch from torch.utils.data import DataLoader, Dataset, Subset from torchvision import transforms +import torch.optim as optim from src.base.constants import * from src.base.helpers import * from src.base.vizwiz_eval_cap.eval import VizWizEvalCap from dataset import DemoDataset ## This is a local import from dataset.pyA from tqdm import tqdm -from transformers import AutoProcessor -from transformers import AutoModelForCausalLM +from transformers import BlipProcessor +from transformers import BlipForConditionalGeneration from PIL import Image import matplotlib.pyplot as plt import os import json + ################################################################################ # This is template code that will not run as is since a model is not defined but # is has much of the infrastructure needed to fine-tune a model on the VizWiz # dataset. -# +#custom # At a minimum you will have to complete code indicated by TODO comments. ################################################################################ CACHE_DIR = os.environ.get("TRANSFORMERS_CACHE") create_directory(DEMO_SAVE_PATH) -create_directory(DEMO_SAVE_PATH + "/examples") +create_directory(DEMO_SAVE_PATH + "/examples_0") ## TODO # You can use the AutoProcessor.from_pretrained() method to load the HuggingFace @@ -32,7 +34,9 @@ # to encode and decode text and images. # https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor try: - processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + #processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base", cache_dir=CACHE_DIR) + except Exception as e: print("You need to pick a pre-trained model from HuggingFace.") print("Exception: ", e) @@ -51,8 +55,8 @@ ) ### Use the Subset while debugging ### -# train_dataset = Subset(train_dataset, range(100)) -# val_dataset = Subset(val_dataset, range(10)) +#train_dataset = Subset(train_dataset, range(100)) +#val_dataset = Subset(val_dataset, range(10)) ### Since, subset is used above, the dataset object needs to be called with a .dataset, to access the original dataset. So while using the full dataset, the below is done. ### train_dataset = Subset(train_dataset, range(len(train_dataset))) @@ -64,7 +68,7 @@ print("SANITY CHECK DONE!!") -train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8) +train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=6) val_dataloader = DataLoader(val_dataset, shuffle=False, batch_size=32) ## TODO @@ -72,17 +76,21 @@ # model you want to fine-tune. This will allow you to use the model to train and evaluate # on the VizWiz dataset. try: - model = AutoModelForCausalLM.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + #model = AutoModelForCausalLM.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR) + model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", cache_dir=CACHE_DIR) + except Exception as e: print("You need to pick a pre-trained model from HuggingFace.") print("Exception: ", e) ## TODO Select your model optimizer try: - raise NotImplementedError("Select your model optimizer") - optimizer = None # pick one from torch.optim + # raise NotImplementedError("Select your model optimizer") + # optimizer = None # pick one from t pick an optimizer from torch.optimorch.optim + optimizer = torch.optim.AdamW(model.parameters(), lr=0.00002, betas=(0.9, 0.999), weight_decay=0.0005) + except Exception as e: - print("You need to pick an optimizer from torch.optim.") + print("You need to.") print("Exception: ", e) # Wrap the model with DataParallel only if more than one GPU is available @@ -94,7 +102,8 @@ method = "CIDEr" # method used for comparsions -logger = Logger(f"{DEMO_SAVE_PATH}/logs.log") +i="0" # change the logger path +logger = Logger(f"{DEMO_SAVE_PATH}/logs_{i}.log") # modify for each model def train(loger, train_dataloader, model, optimizer, device, processor): @@ -132,12 +141,16 @@ def evaluate( for idx, batch in enumerate(val_dataloader): image_ids = batch.pop("image_ids").to(device) pixel_values = batch.pop("pixel_values").to(device) - + with torch.no_grad(): outputs = model.generate(pixel_values=pixel_values, max_length=50) + # debug when prediction is empty + # print("Raw Output:", outputs) # Decode the generated ids to text generated_captions = processor.batch_decode(outputs, skip_special_tokens=True) + # debug when prediction is empty + # print("Decoded Output:", generated_captions) # Store the generated captions for img_id, caption in zip(image_ids, generated_captions): @@ -147,11 +160,13 @@ def evaluate( plot_captions_dict[img_id.item()] = caption # Used for plotting # Save the generated captions to a json file - with open(f"{save_path}/generated_captions.json", "w") as f: + # Change the path + with open(f"{save_path}/generated_captions_{i}.json", "w") as f: json.dump(caption_val, f, indent=4) + # Change the path vizwizRes = val_dataset.dataset.vizwiz.loadRes( - f"{save_path}/generated_captions.json" + f"{save_path}/generated_captions_{i}.json" ) vizwizEval = VizWizEvalCap(val_dataset.dataset.vizwiz, vizwizRes) vizwizEval.evaluate() @@ -160,7 +175,7 @@ def evaluate( for method in vizwizEval.eval: logger.info(f" Method: {method}, Score: {vizwizEval.eval[method]:.4f}") - return vizwizEval, vizwizRes, plot_captions_dict + return vizwizEval, vizwizRes, plot_captions_dict, model def get_val_examples(vizwizEval, vizwizRes, plot_captions_dict, epoch, method="CIDEr"): @@ -212,18 +227,18 @@ def get_val_examples(vizwizEval, vizwizRes, plot_captions_dict, epoch, method="C # Save the images and captions save_image_captions( - best_img_and_captions, f"{DEMO_SAVE_PATH}/examples/epoch_{epoch}/best/" + best_img_and_captions, f"{DEMO_SAVE_PATH}/examples_0/epoch_{epoch}/best/" ) save_image_captions( - worst_img_and_captions, f"{DEMO_SAVE_PATH}/examples/epoch_{epoch}/worst/" + worst_img_and_captions, f"{DEMO_SAVE_PATH}/examples_0/epoch_{epoch}/worst/" ) save_image_captions( - first_3_img_and_captions, f"{DEMO_SAVE_PATH}/examples/epoch_{epoch}/first_3/" + first_3_img_and_captions, f"{DEMO_SAVE_PATH}/examples_0/epoch_{epoch}/first_3/" ) best_score = 0 -for epoch in range(3): +for epoch in range(16): print(f"Epoch: {epoch+1}") # Wrap the dataloader with tqdm for a progress bar progress_bar = tqdm( @@ -233,10 +248,10 @@ def get_val_examples(vizwizEval, vizwizRes, plot_captions_dict, epoch, method="C # Train the model loss = train(logger, train_dataloader, model, optimizer, device, processor) logger.info(f"Loss at epoch {epoch}: {loss}") - + # Evaluate the model every 3 epochs if epoch % 3 == 0: - vizwizEval, vizwizRes, plot_captions_dict = evaluate( + vizwizEval, vizwizRes, plot_captions_dict, model = evaluate( logger, epoch, DEMO_SAVE_PATH, @@ -249,7 +264,9 @@ def get_val_examples(vizwizEval, vizwizRes, plot_captions_dict, epoch, method="C score = vizwizEval.eval[method] if score > best_score: best_score = score - model.save_pretrained(f"{DEMO_SAVE_PATH}/best_model") + model.save_pretrained(f"{DEMO_SAVE_PATH}/best_model_0") logger.info(f"New best score: {best_score}. Model saved") get_val_examples(vizwizEval, vizwizRes, plot_captions_dict, epoch, method) + +