-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Paper
Flamingo: a Visual Language Model for Few-Shot Learning (a.k.a. Flamingo)
Speaker
@SoongE
Key Point
- Powerful connection between pre-trained Vision and Language
- Using visual texture data
- Any input using Preceiver model
- Well implemented on several tasks
Methods
-
Freezing Vision and Language model
- Vision Encoder:
- Train on contrastive learning using BERT
- Train with ALIGN + LTIP by accumulation methods
- Fine-tuning or scratch instead of freezing resultes in a very large performance drop. They attribute this to catastrophic forgetting that occurs as the learning objective is refreshed.
- Vision Encoder:
-
- Return fixed output shape of vision input
- Fixed shape of latent query
- 실험적으로 기존 attention보다 좋다
-
- Tanh gate: Long short-term memory(LSTM)
- normalization 효과
- Tanh gate: Long short-term memory(LSTM)
-
Train on mixture of datasets
- Dataset의 양과 quality에 따라 weight를 다르게줬다. (M3W, ALIGN, LTIP and VTP with weights 𝜆𝑚 of 1.0, 0.2, 0.2 and 0.03 respectively.)
- M3W: interleaved image-text
- 43M HTML dataset
- ALIGN and LTIP: image-text pair
- ALIGN: large and low quality
- LTIP: small and high quality
- VTP: video-text pair
- 27M with short video about 22sec
strengths and weaknesses
- Strengths
- 많은 downstream task에서 좋은 성능을 보임
- Weaknesses
- LM의 side effect를 모두 가져온다.
- Classification은 CLIP보다 좋지 않다.
- Few-shot이 아닐 경우에는 각자의 모델이 더 좋은 성능을 낼 수 있다.
- 학습에 사용한 dataset이 매우 크고, 모델 자체의 사이즈가 매우 커서 공정한 비교가 힘들다.
Reactions are currently unavailable


