RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

作者在开头介绍了这篇工作的两个动机。一是grid feature虽然被证明在VQA上很有用，但是其本身的平铺操作使得空间信息丢失。二是在图像文本生成decoder阶段中，先前的工作对某个词汇都平等对待。然而其中有些词可能包含丰富的图像线索信息，我们应该如何加大这些词汇的权重。为了解决这两个问题，作者提出了两个模块，采用multi-head attention的思想对其建模，在图像文本生成的多个benchmark上达到了SOTA。

## 信息
 - 主要作者：Xuying Zhang（第一作者）和Rongrong Ji（大佬作者）
 - 单位：厦大CV组和腾讯优图
 - [论文链接](https://openaccess.thecvf.com/content/CVPR2021/papers/Zhang_RSTNet_Captioning_With_Adaptive_Attention_on_Visual_and_Non-Visual_Words_CVPR_2021_paper.pdf)

## 1 学习到的新东西：
对目前图像文本生成的基本架构有了一个直观的理解。
文中作者提取的两个模块都很容易就能想到，但是出发点却非常的好。以后在科研工作中，要在各方面都细致考虑，找到好的出发点。

## 2 通过Related Work了解到了哪些知识
跨模态任务中region feature和grid feature的探索过程，以及其优缺点。
还有图像文本生成先前比较好的工作用了什么样的思想。


## 3 实验验证任务，如果不太熟悉，需要简单描述
具体不太熟悉，粗略看了下。


## 4 在你认知范围内，哪些其它任务可以尝试
可以尝试在图像关系文本生成上能够达到什么样的效果。

## 5 好的句子
1. ***Nevertheless***, grid features are flattened when fed to a transformer model, which ***inevitably*** leads to the loss of spatial information, as shown in Figure 1(a).
2. The representation of visual features have ***gone through*** two main stages after the extensive application of deep learning.
3. ***Typically***, Anderson et al. [2] applied the pre-trained region feature to multimodal task and achieved excellent performance in both image captioning and visual question answering. 
4. ***After that***, region features have been extensively studied and become the ***de-facto*** standard for most vision and language tasks.
5. Recently, Jiang et al. [15] ***revisited*** the grid features for VQA and discovered that grid features extracted from exactly the same layer of a pre-trained detector can ***perform competitively against*** their region-based counterparts and meanwhile solve several ***critical issues*** like time consuming, end-to-end training etc. 
6. Although the ***aforementioned*** transformer-based captioning models have achieved quite promising results, a serious problem still exists: all word sequences are coupled into high dimensional tensor, where visual and non-visual words are treated equally
7. Finally, we ***imitate*** the computation of region geometry features in [12, 9] to obtain the relative geometry features between two grids i and j
8. However, we found through experiments that memory information and hidden information ***are highly coupled*** for
transformer decoder, ***resulting in a serious language bias***.
9. The word sequence features is first processed by word embedding and incorporated with word sequence position encoding before used as the input of the first decoder layer of the transformer model. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words #17

信息

1 学习到的新东西：

2 通过Related Work了解到了哪些知识

3 实验验证任务，如果不太熟悉，需要简单描述

4 在你认知范围内，哪些其它任务可以尝试

5 好的句子

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words #17

Description

信息

1 学习到的新东西：

2 通过Related Work了解到了哪些知识

3 实验验证任务，如果不太熟悉，需要简单描述

4 在你认知范围内，哪些其它任务可以尝试

5 好的句子

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions