Project for Trends and application of computer vision at the University of Trento A.Y.2024/2025
Developed by: Gheser Amir, Roman Simone and Mascherin Matteo
For this project, our goal is to explore advanced methods for open-vocabulary semantic segmentation (OVSS), aiming to segment images into regions defined by arbitrary textual concepts. Our research investigates two main approaches: SAN (Side Network) and SAM (Segment Anything), comparing their performance and adaptability in OVSS tasks.
The Side Adapter Network (SAN) is a lightweight framework designed for open-vocabulary semantic segmentation, leveraging CLIP's pre-trained vision-language capabilities. SAN models segmentation as a region recognition task by attaching a side network to CLIP with two branches: one for mask proposals and the other for attention bias, enabling CLIP-aware segmentation. Its end-to-end training maximizes adaptation to CLIP, ensuring accurate, efficient predictions. Compared to alternatives, SAN achieves state-of-the-art performance with up to 18x fewer parameters and 19x faster inference. It excels in resource efficiency while delivering high-quality segmentation across diverse datasets.
Segment anything (SAM) is the state of the art AI framework for object segmentation across diverse domains. To adapt SAM to OVSS task we propose a two-stage approach where SAM acts as a class-agnostic mask generator, and Alpha-CLIP is employed for mask classification. Post-processing techniques, such as BBox filtering and background adjustments, refine the mask proposals for enhanced segmentation accuracy in open-vocabulary settings.
We conducted the following experiments to evaluate the performance of our pipeline:
-
Pipeline Post-Processing Analysis:
We tested the pipeline using various types of post-processing techniques to determine their impact on the overall performance. -
Model Evaluation with Different Datasets and Vocabularies:
- We explored the effectiveness of SAN (Side-Adapter Network) and SAM (Segment Anything Models) across multiple datasets.
- For each dataset, we used two different vocabulary sources:
- Caption-generated Vocabulary: Derived from captions generated by the BLIP-2 model.
- Label-based Vocabulary: Created from predefined dataset labels.
Warning
To run the experiments you need python 3.10
In order to install all the dependencies launch this command:
sh setup.shFirst configure appropiately the 'datasets.yaml' file. Download the missing values and then run the following commands:
python download_dataset.py
# After manually getting the missing values
sh preprocess_dataset.shNote
AlphaCLIP only has google drive link working, so you need to download it manually and place it in the 'models' folder.
You can find examples of how to use SAN (Side Adapter Network) and SAM (Segment Anything Model) in the notebook directory. These examples demonstrate practical implementations and workflows for applying these models effectively.
To evaluate SAM using our pipeline, follow these steps:
- Browse the
configsdirectory and select the preferred configuration file that suits your dataset and vocabulary requirements. - Launch the pipeline using the following command:
python sam_pipeline.pyTo evaluate SAN, on ADE20K dataset, using a custom vocabulary, follow these steps:
- First, you need to slightly change the
inference_on_datasetfunction in theevaluator.pyfile inside detectron2 in order to perform predictions with a custom vocabulary using the SAN model.
You can find the file in the following path: your_venv/lib/python3.10/site-packages/detectron2/evaluation/evaluator.py. And then you need to change the inference_on_dataset function as shown in the following images:
-
Browse the
dataset/captions_valdirectory and select the preferred vocabulary you want to test on (e.g.,nouns_ade_filtered.pklornouns_coco_filtered.pkl, for nouns extracted and filtered from ADE20K and COCO datasets, respectively). -
Launch the SAN evaluation from the SAN directory using the following command:
cd SAN
python eval_net.py --eval-only --config-file configs/san_clip_vit_res4_coco.yaml --vocabulary ../datasets/captions_val/nouns_ade_filtered.pkl OUTPUT_DIR ../output/[Name of the output folder] MODEL.WEIGHTS ../checkpoints/san_vit_b_16.pth DATASETS.TEST "('ade20k_full_sem_seg_val',)" Adjust the --vocabulary parameter to the desired vocabulary file and the OUTPUT_DIR parameter to the desired output folder name. Take a look to SAN official repository for more information.
For any inquiries, feel free to contact:
-
Simone Roman - simone.roman@studenti.unitn.it
-
Amir Gheser - amir.gheser@studenti.unitn.it
-
Matteo Mascherin - matteo.mascherin@studenti.unitn.it


