Abstract: Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results