Add Multi-GPU Sigmoid-based Loss Training Pipeline#2
Open
tintindas wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates a new distributed training script into the NOVUM framework that enables multi-GPU training with a Sigmoid-based loss.
The implementation uses torchrun for process management and torch.distributed for synchronization.
Key Changes
Distributed Training Setup
Added setup() and cleanup() helpers for initializing/destroying torch.distributed process groups with nccl backend.
Automatically assigns devices per rank for multi-GPU scaling.
Training script can be launched via:
Model & Feature Bank
Integrated NetE2E backbone wrapped in DistributedDataParallel.
Set up FeatureBank with support for SigLip-based updates (forward_siglip).
Added checkpoint saving that includes model state, FeatureBank memory, and experiment metadata.
Loss Function
Introduced SigmoidLoss as the main criterion for contrastive learning between image features and feature bank embeddings.
Learnable parameters: t_prime (logit scale) and b (logit bias).
Data Handling
DistributedSampler for sharding dataset across GPUs.
Ensures reproducible shuffling across epochs via sampler.set_epoch(epoch).
Feature Bank Synchronization
After each epoch, synchronizes FeatureBank memory across GPUs via all_reduce.
Ensures consistency of negative sample pool across distributed processes.
Checkpointing
Saves model + FeatureBank state every 5 epochs.
Includes timestamp and args for reproducibility.
Validation
Verified training launches correctly with torchrun using 2 GPUs.
Confirmed metrics written to CSV files with headers.
Ensured FeatureBank synchronization produces consistent embeddings across ranks.
Checkpoint files correctly include DDP-wrapped model state and FeatureBank memory.