Skip to content

Built a facial image retrieval application using ViT (ImageNet-21k) for image feature extraction and MPNet (base-v2) for text encoding.

Notifications You must be signed in to change notification settings

Ishan25j/Contrastive-VisualHunt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Face Retrieval with Metric Learning and Textual Descriptions

Introduction:

This repository implements a Face Retrieval system that leverages Metric Learning and textual descriptions to find similar faces in a dataset. It builds upon the idea that images and their corresponding textual attributes can be embedded into a shared space, enabling efficient retrieval based on semantic similarities.

Key Concepts:

  • Content-Based Image Retrieval (CBIR): This technique retrieves images based on their content rather than metadata.
  • Metric Learning: This approach learns an embedding space where similar images are mapped closer together based on a distance metric.
  • Triplet Loss: A loss function commonly used in metric learning, enforcing a margin between the distances of a query image to its positive and negative counterparts.

Model Architecture:

The model consists of three main components:

  • Image Encoder: Processes an image and extracts its visual features. (Implementation in ImageEncoder class)
  • Text Encoder: Processes textual attributes and extracts their semantic representations. (Implementation in TextEncoder class)
  • Triplet Loss Function: Calculates the loss based on the distances between the query image embedding, a positive image embedding with matching attributes, and negative image embeddings with different attributes. (Implemented in VisualHuntNetwork class)

Code Structure:

  • VisualHuntNetwork.py: Defines the main network architecture, including the projection layers for image and text embeddings, the distance function, and the triplet loss calculation.
  • Potentially other files (ImageEncoder.py, TextEncoder.py): These might contain the implementations for image and text encoders depending on the chosen architecture.

Dependencies:

  • PyTorch
  • SentenceTransformers
  • HuggingFace
  • Pandas
  • NumPy

Output:

Example 1:

1721683970141

Example 2:

1721683978284

Example 3:

1721683987200

Usage:

  1. Clone the repository:

    git clone git@github.com:Ishan25j/Contrastive-VisualHunt.git
    
  2. Run the Colab notebook: Open the provided Colab notebook and follow the instructions for training and evaluation.

Further Exploration:

  • Experiment with different distance metrics in the triplet loss.
  • Explore more advanced image and text encoder architectures.
  • Apply the model to larger and more diverse face datasets.

About

Built a facial image retrieval application using ViT (ImageNet-21k) for image feature extraction and MPNet (base-v2) for text encoding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published