Skip to content

seanward/ImageBind

Repository files navigation

This is the output of an experimental paper to code AI for the paper from: https://arxiv.org/pdf/2305.05665.pdf

Here is a condensed summary of the paper:

Problem and Motivation: The goal is to learn a single joint embedding space for multiple modalities (images, text, audio, depth, thermal, IMU) without requiring datasets where all modalities co-occur. Recent methods have aligned image and text embeddings, but are limited to those modalities. Acquiring datasets with all modalities is infeasible.

Approach: The key idea is to use images to bind the modalities together. ImageBind aligns each modality's embedding to image embeddings, using image-text data and naturally paired data like (video, audio). This results in an emergent alignment across modalities, enabling zero-shot tasks without training on that modality pair.

ImageBind:

  • Encodes each modality (images, text, audio, depth, thermal, IMU) with a Transformer.
  • Uses an InfoNCE loss to align image and modality embeddings.
  • Freezes CLIP image and text encoders, trains other encoders.
  • Uses image-text data and naturally paired data (video-audio, image-depth, image-thermal, video-IMU).

Experiments: Evaluates on emergent zero-shot classification, few-shot classification, retrieval, generation, and detection.

  • Emergent zero-shot: Matches/outperforms specialist models. Gains over CLIP show alignment.
  • Few-shot: Outperforms specialist models on audio and depth.
  • Retrieval: Outperforms methods like AudioCLIP, AVFIC. Combines modalities for gains.
  • Generation: Upgrades DALLE-2 to generate images from audio.
  • Detection: Upgrades Detic to detect objects from audio.

Ablations: Studies image encoder size, temperature, projection head, epochs, augmentation, alignment, capacity.

  • Image encoder size: Improves all modalities, especially non-visual.
  • Temperature: Higher better for depth/thermal/IMU, lower for audio.
  • Projection head: Linear better than MLP.
  • Epochs: More improves all modalities.
  • Augmentation: Stronger helps depth, weaker helps audio.
  • Alignment: Temporal/spatial alignment critical.
  • Capacity: Smaller encoder for depth, larger for audio.

Contributions and Limitations: Enables emergent zero-shot and cross-modal tasks. Sets SOTA on zero-shot benchmarks. Enriches vision models for non-visual tasks. Limitations include lack of task specialization and biases from web data/datasets.

Implementation Details (PyTorch):

  • Modality Encoders: Vision Transformer (ViT) for images/videos, ViT for audio/depth/thermal, Transformer for text/IMU.
  • Image/Text Encoder: Pretrained OpenCLIP (ViT-H, text encoder). Freeze during training.
  • Other Encoders: ViT-B for audio/depth/thermal, train.
  • Loss: InfoNCE, temperature 0.07, batch size 512-2000.
  • Train: 16-32 epochs. AdamW, LR 3e-4, weight decay 1e-2.

Missing Details: The paper compared with the open source codebase available at [https://github.com/facebookresearch/ImageBind]:

  • The codebase provides a SimpleTokenizer class to tokenize text inputs based on a BPE vocabulary file. The paper does not mention how text inputs are tokenized or what vocabulary is used.
  • The codebase uses torchaudio and pytorchvideo libraries to process audio and video inputs respectively. The paper does not specify what libraries or tools are used for audio and video processing.
  • The codebase uses various data transformations such as resizing, cropping, normalizing, and waveform2melspec to prepare the inputs for the model. The paper does not describe the details of these data transformations or their parameters.

About

WIP AI Generated Implementation of Imagebind

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages