Description Vision Transformer Adapter for Dense Predictions
Info.
Summary
plain ViT
which is prone to work poorly due to the lack of inductive bias & weak prior assumption
To achieve general-purpose model, transformer structure is essential for masked data modeling and multi-modal pre-training
but vision-specific models are stronger than transformers... -> adapter can be a solution
adapter: train on large-scale multi-modal data
after training is done, no need to fine-tune for downstream tasks
transformers for each of the various vision tasks
such as instance, semantic, panoptic segmentation, visual grounding, detection...
achieve SOTA without using external dataset
😮 is the performance in light of the strong adapter already pre-trained on a multimodal dataset?
=> The author mentions this point. the author compared models under the fair pre-training strategy
Questions before reading the paper
is the "adapter" concept the same as NLP's? https://intelligentcm.tistory.com/340
Yes! The author refers to the NLP's adapter paper in the introduction section.
eg., object detection on COCO val2017.
github에 flash attention을 적용한다는 말이 있던데, 요즘 이 키워드 자주 보인다. 이건 뭐지?
설명 link 이거 읽어보니까 그냥 연산 효율적으로 하려고 만든 기법이다. chatGPT, Bard에게 물어보니 (당연하지만) 연산 결과는 인반 attention과 똑같다.
Reactions are currently unavailable
You can’t perform that action at this time.
Vision Transformer Adapter for Dense Predictions
Info.
Summary
Questions before reading the paper
is the "adapter" concept the same as NLP's? https://intelligentcm.tistory.com/340
github에 flash attention을 적용한다는 말이 있던데, 요즘 이 키워드 자주 보인다. 이건 뭐지?