Skip to content

[20230413] Weekly VLM2 - Flamingo #4

@SoongE

Description

@SoongE

Paper
Flamingo: a Visual Language Model for Few-Shot Learning (a.k.a. Flamingo)

Speaker
@SoongE

Summary
CleanShot 2023-04-13 at 16 31 25

Key Point

  • Powerful connection between pre-trained Vision and Language
  • Using visual texture data
  • Any input using Preceiver model
  • Well implemented on several tasks

Methods

  • Freezing Vision and Language model

    • Vision Encoder:
      • Train on contrastive learning using BERT
      • Train with ALIGN + LTIP by accumulation methods
    • Fine-tuning or scratch instead of freezing resultes in a very large performance drop. They attribute this to catastrophic forgetting that occurs as the learning objective is refreshed.
  • Peceiver Resampler
    CleanShot 2023-04-13 at 17 40 50

    • Return fixed output shape of vision input
    • Fixed shape of latent query
    • 실험적으로 기존 attention보다 좋다
  • Gated Cross-Attention
    CleanShot 2023-04-13 at 17 37 23

    • Tanh gate: Long short-term memory(LSTM)
      • normalization 효과
  • Train on mixture of datasets

    • Dataset의 양과 quality에 따라 weight를 다르게줬다. (M3W, ALIGN, LTIP and VTP with weights 𝜆𝑚 of 1.0, 0.2, 0.2 and 0.03 respectively.)
    • M3W: interleaved image-text
      • 43M HTML dataset
    • ALIGN and LTIP: image-text pair
      • ALIGN: large and low quality
      • LTIP: small and high quality
    • VTP: video-text pair
      • 27M with short video about 22sec

strengths and weaknesses

  • Strengths
    • 많은 downstream task에서 좋은 성능을 보임
  • Weaknesses
    • LM의 side effect를 모두 가져온다.
    • Classification은 CLIP보다 좋지 않다.
    • Few-shot이 아닐 경우에는 각자의 모델이 더 좋은 성능을 낼 수 있다.
    • 학습에 사용한 dataset이 매우 크고, 모델 자체의 사이즈가 매우 커서 공정한 비교가 힘들다.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions