Official PyTorch implementation of "Learning a Universal Attention Refinement Module for CLIP-based Open-Vocabulary Segmentation".
Open-vocabulary segmentation aims to segment novel categories that are not seen during training. This project introduces an Attention Refinement Module (ARM) that significantly enhances CLIP-based open-vocabulary segmentation performance by effectively aggregating multi-level visual features from the CLIP encoder.
We evaluate our method on standard semantic segmentation benchmarks:
| Dataset | Classes | Type | Download |
|---|---|---|---|
| PASCAL VOC 2012 | 21 (with background) | Indoor/Outdoor | Official |
| ADE20K | 150 | Scene Parsing | Official |
| COCO 2014 | 81 (with background) | Instance/Semantic | Official |
| PASCAL Context | 59/60/459 | Scene Understanding | Link |
| COCO-Stuff | 172 | Stuff Segmentation | GitHub |
| ADE20K-847 | 847 | Fine-grained | Official |
Organize datasets as follows:
data/
├── VOCdevkit/
│ └── VOC2012/
│ ├── JPEGImages/
│ ├── SegmentationClass/
│ └── ImageSets/
├── ADE20K/
│ ├── images/
│ └── annotations/
├── coco14/
│ ├── images/
│ └── annotations/
└── ...
Or modify config.py to update dataset paths according to your directory structure.
This project builds upon the following excellent open-source works:
- CLIP - Contrastive Language-Image Pre-training by OpenAI
- CLIPer - CLIP-based segmentation framework
- SCLIP - Semantic CLIP segmentation approach
- Cat-Seg - Category-aware segmentation method
- MaskCLIP - CLIP-based mask prediction
We sincerely thank the authors for their contributions to the community.
The complete source code and pre-trained weights will be released upon official acceptance of the paper. Stay tuned for updates!