Skip to content

[Survey, 논문 리뷰] ViT-Adapter, flash attention, ...... #40

@sghong977

Description

@sghong977

Vision Transformer Adapter for Dense Predictions

Info.

Summary

  • plain ViT
    • which is prone to work poorly due to the lack of inductive bias & weak prior assumption
    • To achieve general-purpose model, transformer structure is essential for masked data modeling and multi-modal pre-training
    • but vision-specific models are stronger than transformers... -> adapter can be a solution
  • adapter: train on large-scale multi-modal data
    • after training is done, no need to fine-tune for downstream tasks
  • transformers for each of the various vision tasks
    • such as instance, semantic, panoptic segmentation, visual grounding, detection...
  • achieve SOTA without using external dataset
    • 😮 is the performance in light of the strong adapter already pre-trained on a multimodal dataset?
    • => The author mentions this point. the author compared models under the fair pre-training strategy

Questions before reading the paper

  • is the "adapter" concept the same as NLP's? https://intelligentcm.tistory.com/340

    • Yes! The author refers to the NLP's adapter paper in the introduction section.
    • eg., object detection on COCO val2017.
      image
  • github에 flash attention을 적용한다는 말이 있던데, 요즘 이 키워드 자주 보인다. 이건 뭐지?

    • 설명 link 이거 읽어보니까 그냥 연산 효율적으로 하려고 만든 기법이다. chatGPT, Bard에게 물어보니 (당연하지만) 연산 결과는 인반 attention과 똑같다.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions