This repository implements a Face Retrieval system that leverages Metric Learning and textual descriptions to find similar faces in a dataset. It builds upon the idea that images and their corresponding textual attributes can be embedded into a shared space, enabling efficient retrieval based on semantic similarities.
- Content-Based Image Retrieval (CBIR): This technique retrieves images based on their content rather than metadata.
- Metric Learning: This approach learns an embedding space where similar images are mapped closer together based on a distance metric.
- Triplet Loss: A loss function commonly used in metric learning, enforcing a margin between the distances of a query image to its positive and negative counterparts.
The model consists of three main components:
- Image Encoder: Processes an image and extracts its visual features. (Implementation in
ImageEncoderclass) - Text Encoder: Processes textual attributes and extracts their semantic representations. (Implementation in
TextEncoderclass) - Triplet Loss Function: Calculates the loss based on the distances between the query image embedding, a positive image embedding with matching attributes, and negative image embeddings with different attributes. (Implemented in
VisualHuntNetworkclass)
VisualHuntNetwork.py: Defines the main network architecture, including the projection layers for image and text embeddings, the distance function, and the triplet loss calculation.- Potentially other files (
ImageEncoder.py,TextEncoder.py): These might contain the implementations for image and text encoders depending on the chosen architecture.
Dependencies:
- PyTorch
- SentenceTransformers
- HuggingFace
- Pandas
- NumPy
Example 1:
Example 2:
Example 3:
-
Clone the repository:
git clone git@github.com:Ishan25j/Contrastive-VisualHunt.git -
Run the Colab notebook: Open the provided Colab notebook and follow the instructions for training and evaluation.
- Experiment with different distance metrics in the triplet loss.
- Explore more advanced image and text encoder architectures.
- Apply the model to larger and more diverse face datasets.


