DL4DS · lilinSTART · Apr 3, 2024 · Apr 3, 2024 · Apr 5, 2024 · Apr 5, 2024
diff --git a/README.md b/README.md
@@ -1,119 +1,76 @@
 # DS598 DL4DS Midterm Project
 
 ## Introduction
-For this project, you will train a network to generate captions for the 
-[VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/).
-The images are taken by people who are blind and typically rely on
-human-based image captioning services.  Your objective will be to beat a
-a baseline score on the [test set leaderboard](https://eval.ai/web/challenges/challenge-page/739/leaderboard/2006).
 
-## Developer Setup
+The project aims to provide image-to-caption services for blind people using Transformer technology. The project employs the [blip-image-captioning-base model](https://huggingface.co/Salesforce/blip-image-captioning-base), fine-tuned on the [VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/). The optimizer is [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) with a learning rate of 2e-5 and a weight decay of 5e-4. The model is set to train for up to 16 epochs, but training is stopped early at epoch 7, since it is overfitting afterwards. The batch sizes of training and validation are 6 and 32 respectively. The model achieved a CIDEr-D score of 75.37 on the [test dataset](https://eval.ai/web/challenges/challenge-page/739/leaderboard/2006).
 
-Clone this repo to your directory on the SCC DS598 project space, e.g.
-`/projectnb/ds598/students/<userid>`.
-
-Once you have a training script setup, create a shell script, e.g. `train.sh`,
-that loads and activates a conda environment and then runs your training
-script. An example shell script is below.
-
-```sh
-#!/bin/bash -l
-
-# Set SCC project
-#$ -P ds598
-
-# load and activate the academic-ml conda environment on SCC
-module load miniconda
-module load academic-ml/spring-2024
-conda activate spring-2024-pyt
-
-# Add the path to your source project directory to the python search path
-# so that the local `import` commands will work.
-export PYTHONPATH="/projectnb/ds598/students/<userid>/<yourdir>:$PYTHONPATH"
-
-# Update this path to point to your training file
-python path/to/train.py
-
-# After updating the two paths above, run the command below from an SCC
-# command prompt in the same directory as this file to submit this as a
-# batch job.
-### qsub -pe omp 4 -P ds598 -l gpus=1 train.sh
-```
-
-Note that there are train and test scripts for the two folders already.
-
-## Run Example Scripts
-
-When you run the example scripts, make sure to add the path to the repo
-folder before running the script. 
+## Dataset
 
-```export PYTHONPATH="/projectnb/ds598/path/to/folder:$PYTHONPATH"```
+The dataset used in this project is the VizWiz-Captions dataset, which includes 39,181 images sourced from individuals who are blind. Each image is accompanied by 5 descriptive captions. 
 
-The example shell scripts include this command.
+Download the dataset from the website [VizWiz Image Captioning dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/) and update the paths of annotation_file and image_folder in `src/base/dataset.py`.
 
+## Evaluation
 
-Set the paths in `src/base/constants.py` to the correct paths on your system.
+In the VizWiz challenge evaluation they refer to five different evaluation metrics although they use CIDr-D as their primary evaluation.
 
-Follow the .sh files to run the code. As an example, to run the `cnnlstm_train.sh`
-script, you would run at the command prompt from the base of your local repo
-folder:
+They reference the BLUE metric, but there are limitations to that metric as described in [2] below.
 
-```sh
-$ qsub -pe omp 4 -P ds598 -l gpus=1 cnnlstm_train.sh
-Your job 5437870 ("cnnlstm_train.sh") has been submitted
-```
-As shown, you should get notification that your job was submitted and get a 
-job ID number.
+### Validation Results
 
-You can check your job status by typing:
+At Epoch 7, the training loss was 1.3944. The performance scores for this epoch are as follows:
 
-```sh
-$ qstat -u <userid>
-ob-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------
-5437870 0.00000 cnnlstm_tr tgardos      qw    03/14/2024 09:40:24 
-```
+| Metric  | Score   |
+|---------|---------|
+| BLEU-1  | 0.6757  |
+| BLEU-2  | 0.4938  |
+| BLEU-3  | 0.3489  |
+| BLEU-4  | 0.2419  |
+| **CIDEr**   | **0.7261**  |
 
-The above is showing the example output from user `tgardos`.
+Here are two examples of the model's predictions:
 
-## Dataset
+Good example:
 
-The dataset is downloaded to 
-`/projectnb/ds598/materials/datasets/vizwiz/captions`. There is no need to 
-download the dataset again and the path has already been defined in the 
-accompanying code.
+![good example](https://i.postimg.cc/HWbHNZyJ/good-example.png)
 
-## Evaluation
+Bad example:
 
-In the VizWiz challenge evaluation they refer to five different evaluation
-metrics although they use CIDr-D as their primary evaluation.
+![bad example](https://i.postimg.cc/qqcTCqTc/bad-example.png)
 
-They reference the BLUE metric, but there are limitations to that metric as
-described in [2] below.
+### Test Results
 
-### Validation Results
+I submitted my test results to the VizWiz Image Captioning [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview). Here are the performance scores obtained:
 
-Validation set results are reported in the CNN-LSTM example and code for reporting validation results are in the demo model code.
+| Metric  | Score |
+|---------|-------|
+| BLEU-1  | 68.49 |
+| BLEU-2  | 50.20 |
+| BLEU-3  | 35.68 |
+| BLEU-4  | 24.89 |
+| ROUGE-L | 48.51 |
+| METEOR  | 22.06 |
+| **CIDEr**   | **75.37** |
+| SPICE   | 17.48 |
 
-### Test Results
+## Implementation Suggestions
 
-As is typically the case, the test dataset labels are withheld, and so the only way to get test results is to produce predicted captions and
-then submit them to the VizWiz Image Captioning [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview). There are
-scripts in both model directories to create the test submission file, although the demo model test script will have to be updated with model 
-information.
+1. Explore trending image-to-text models on the [huggingface repository](https://huggingface.co/models?pipeline_tag=image-to-text&sort=trending) for alternatives, and feed dataset images into the reference API to evaluate the pre-trained models' outputs.
 
-Create an account on the [Evaluation Server](https://eval.ai/web/challenges/challenge-page/739/overview) and submit your test predictions
-to get your result.
+2. The default learning rates for optimizers such as SGD, Adam, and AdamW are too high for fine-tuning, potentially leading to similar outputs across different inputs. It is recommended to adjust the learning rate to between 1e-5 and 5e-5.
 
-Step-by-step instructions will be added here shortly.
+## Limitation and Reflection
+1. Facing with challenges such as debugging empty predictions, CUDA version mismatches, limited computational resources, and long training times, my experimentation was limited to a few models such as [blip-image-captioning-base model](https://huggingface.co/Salesforce/blip-image-captioning-base), [blip-image-captioning-large model](https://huggingface.co/Salesforce/blip-image-captioning-large), and [git-base](https://huggingface.co/microsoft/git-base) for fine-tuning. 
 
-State-of-the-art CIDEr-D scores on VizWiz Image Captioning is ~125. We're asking that you get a **minimum CIDEr-D test score of 50**.
+2. I didn't try methods like data augmentation and dropout that could have potentially improved the model's robustness and generalization capabilities.
 
 ## References
-
 1. [CIDEr: Consensus-based image description evaluation](https://ieeexplore.ieee.org/document/7299087)
 2. [BLEU: A Misunderstood Metric from Another Age](https://towardsdatascience.com/bleu-a-misunderstood-metric-from-another-age-d434e18f1b37), Medium Post
 3. [BLEU Metric](https://huggingface.co/spaces/evaluate-metric/bleu), HuggingFace space
+4. [image-to-text models](https://huggingface.co/models?pipeline_tag=image-to-text&sort=trending)
+5. [image_captioning](https://huggingface.co/docs/transformers/main/en/tasks/image_captioning)
+6. [BlipForConditionalGeneration](https://huggingface.co/docs/transformers/en/model_doc/blip#transformers.BlipForConditionalGeneration)
 
 
 
diff --git a/cnnlstm_test.sh b/cnnlstm_test.sh
@@ -9,7 +9,7 @@ module load academic-ml/spring-2024
 conda activate spring-2024-pyt
 
 # Change this path to point to your project directory
-export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH"
+PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH"
 
 #python -m spacy download en_core_web_sm   # download spacy model
 python src/cnn_lstm/test.py

diff --git a/cnnlstm_train.sh b/cnnlstm_train.sh
@@ -9,9 +9,9 @@ module load academic-ml/spring-2024
 conda activate spring-2024-pyt
 
 # Change this path to point to your project directory
-export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" # Set this!!!
+PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" # Set this!!!
 
-python -m spacy download en_core_web_sm   # download spacy model
+#python -m spacy download en_core_web_sm   # download spacy model
 python src/cnn_lstm/train.py
 
 ### The command below is used to submit the job to the cluster

diff --git a/demo_test.sh b/demo_test.sh
@@ -9,9 +9,10 @@ module load academic-ml/spring-2024
 conda activate spring-2024-pyt
 
 # Change this path to point to your project directory
-export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH" # Set this!!!
+export PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH" # Set this!!!
 
 python src/demo_model/test.py
 
-### The command below is used to submit the job to the cluster
-### qsub -pe omp 4 -P ds598 -l gpus=1 git_test.sh
+### The commands below are used to submit the job to the cluster
+### qsub -pe omp 4 -P ds598 -l gpus=1 demo_test.sh
+### qsub -l gpus=1 -l gpu_c=7.0 -pe omp 8 demo_test.sh
diff --git a/demo_train.sh b/demo_train.sh
@@ -9,9 +9,11 @@ module load academic-ml/spring-2024
 conda activate spring-2024-pyt
 
 # Change this path to point to your project directory
-export PYTHONPATH="/projectnb/ds598/admin/tgardos/sp2024_midterm:$PYTHONPATH"
+export PYTHONPATH="/projectnb/ds598/students/lilinj/sp2024_midterm:$PYTHONPATH"
 
+#python -m spacy download en_core_web_sm   # download spacy model
 python src/demo_model/train.py
 
-### The command below is used to submit the job to the cluster
+### The commands below are used to submit the job to the cluster
 ### qsub -pe omp 4 -P ds598 -l gpus=1 demo_train.sh
+### qsub -l gpus=1 -l gpu_c=7.0 -pe omp 8 demo_train.sh
diff --git a/src/base/constants.py b/src/base/constants.py
@@ -5,7 +5,7 @@
 import spacy
 
 # set this path to where you want to save results
-BASE_DIR = "/projectnb/ds598/projects/tgardos/sp2024_midterm/"
+BASE_DIR = "/projectnb/ds598/students/lilinj/sp2024_midterm/"
 
 # Do not edit. This points to the dataset folder
 DATA_BASE_DIR = "/projectnb/ds598/materials/datasets/vizwiz/captions/"

diff --git a/src/demo_model/test.py b/src/demo_model/test.py
@@ -6,8 +6,8 @@
 from src.base.vizwiz_eval_cap.eval import VizWizEvalCap
 from dataset import DemoDataset
 from tqdm import tqdm
-from transformers import AutoProcessor
-from transformers import AutoModelForCausalLM
+from transformers import BlipProcessor
+from transformers import BlipForConditionalGeneration
 from PIL import Image
 import matplotlib.pyplot as plt
 import os
@@ -20,10 +20,11 @@
 create_directory(DEMO_SAVE_PATH + "/examples")
 
 # The path below points to the location where the model was saved
-MODEL_PATH = f"{DEMO_SAVE_PATH}/best_model"
+MODEL_PATH = f"{DEMO_SAVE_PATH}/best_model_0"
 
 # Load your fine tuned model
-model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR)
+#model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR)
+model = BlipForConditionalGeneration.from_pretrained(MODEL_PATH, cache_dir=CACHE_DIR)
 
 ## TODO
 # You can use the AutoProcessor.from_pretrained() method to load the HuggingFace
@@ -33,7 +34,9 @@
 #
 # Of course you should use the same model you trained with.
 try:
-    processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR)
+    #processor = AutoProcessor.from_pretrained("replace-with-model-choice", cache_dir=CACHE_DIR)
+    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base", cache_dir=CACHE_DIR)
+
 except Exception as e:
     print("You need to pick a pre-trained model from HuggingFace.")
     print("Exception: ", e)
@@ -70,7 +73,7 @@
         {"image_id": img_id.item(), "caption": caption}
     )  # Used for VizWizEvalCap
 
-with open(DEMO_SAVE_PATH + "/test_captions.json", "w") as f:
+with open(DEMO_SAVE_PATH + "/test_captions_0.json", "w") as f:
     json.dump(caption_val, f, indent=4)
 
 print("Test captions saved to disk!!")