This project explores various segmentation architectures for deep learning, including Fully Convolutional Networks (FCNs), U-Net, and SegFormer. It focuses on the implementation, training, and evaluation of models for semantic segmentation tasks.
-
Fully Convolutional Networks (FCNs):
Replace fully connected layers with convolutional layers to output spatial maps of class predictions. -
U-Net:
Uses an encoder-decoder structure with skip connections, retaining spatial information for improved segmentation. -
SegFormer:
A hierarchical transformer-based model that processes image patches and uses Multi-Layer Perceptrons (MLPs) for upscaling and classification.
- Downsampling: Reduces spatial dimensions for computational efficiency using pooling and strided convolutions.
- Upsampling: Restores feature maps to original resolution using interpolation or transposed convolutions.
- Pre-training: Uses a general dataset to learn initial features.
- Fine-tuning: Adapts features for specific datasets, improving performance with limited data.
Trained on the OxfordIIITPet dataset:
- Pre-trained Encoder vs Scratch Training:
- Pre-trained model showed faster convergence and higher mIoU.
- Validation plots confirmed the advantage of pre-training.
Fine-tuned using the OxfordIIITPet dataset:
-
Methods:
- Freezing Encoder: Encoder weights frozen during training.
- No Freezing: Encoder weights updated during training.
-
Observations:
- "No Freezing" method showed better convergence.
- Validation loss and mIoU plots indicated underfitting in early stages.
- Downsampling and Upsampling: Essential for balancing computational load and prediction accuracy.
- Pre-training and Fine-tuning: Effective for tasks with limited data, though performance varies based on dataset similarity.
- Challenges: Fine-tuning can lead to underfitting when initial and target tasks differ significantly.







