On March 2th, the results of the acceptance of papers of the international academic conference CVPR 2022 were announced, and 5 papers from teachers and students in Gaoling School of Artificial Intelligence, Renmin University of China were accepted. The International Conference on Computer Vision and Pattern Recognition (CVPR) is a top-level conference in the field of computer vision and pattern recognition organized by IEEE, which is held annually around the world. The year 2022 is the 40th conference, and it will be held June 19-24 in New Orleans, Louisiana in a mixed online and offline format.
Paper Title: COTS: Collaborative Two-Stream Vision-Language Pre-Training Model
for Cross-Modal Retrieval
Authors: Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen
Corresponding Author: Zhiwu Lu
Paper Overview: Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel Collaborative Two-Stream vision-language pre-training model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance-level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction -- a masked vision-language modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task-level interaction -- a KL-alignment learning objective is devised between text-to-image and image-to-text retrieval tasks, where the probability distribution per task is computed with the negative queues in momentum contrastive learning. Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10,800x faster in inference) w.r.t. the latest single-stream methods. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-of-the-art on the widely-used MSR-VTT dataset.
Paper Title：Balanced Audio-visual Learning via On-the-fly Gradient Modulation
Authors：Xiaokang Peng*, Yake Wei*, Andong Deng, Dong Wang, Di Hu
Corresponding Author：Di Hu
Paper Overview:Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility.
Paper Title：Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Authors：Guangyao Li*, Yake Wei*, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu*
Corresponding Author：Di Hu
Paper Overview：In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset are available at http://ayameyao.github.io/st-avqa/.
Paper Title: Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase
Authors: Huayi Tang, Yong Liu
Corresponding Author: Yong Liu
Paper Overview: Multi-view clustering has been shown to boost clustering performance by effectively mining the complementary information from multiple views. However, we observe that sometimes learning from data with more views is not guaranteed to achieve better clustering performance than from data with fewer views. To address this issue, we propose a general deep learning based framework to reduce the risk of clustering performance degradation caused by view increase. Concretely, the model is required to extract complementary information and discard the meaningless noise by automatically selecting features. These two learning procedures are integrated into a unified framework by the proposed optimization objective. In theory, the empirical clustering risk of the proposed framework is no higher than learning from data before the view increase and data of the new increased single view. Also, the expected clustering risk of the framework under divergence-based loss is no higher than that with high probability. Comprehensive experiments on benchmark datasets demonstrate the effectiveness and superiority of the proposed framework in achieving safe multi-view clustering.
Paper Title: Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Authors: Hongwei Xue*, Tiankai Hang*, Yanhong Zeng*, Yuchong Sun*, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo
Paper Overview: We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 38.5% R@1 in zero-shot MSR-VTT text-to-video retrieval task, and 53.6% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual manipulation and super-resolution tasks.