Overview

Our research interests mainly include multimedia analysis and computing, computer vision, machine learning, and artificial intelligence. Recently, we are focusing on the visual understanding via deep learning, e.g., video/image recognition, detection and segmentation, video/image captioning, and video/image question answering (QA). We also explore the deep learning methods’ vulnerability and its robustness to adversarial attacks. Our goal is to further understand the vulnerability and interpretability of deep learning methods, which will provide theoretic evidences and methodology strategies for constructing a safer and more reliable system of image semantic understanding.

Highlights

1. Object Dectection in Practical Scenes: Domain Adaptation and Few Samples

1 Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization

Deng Li, Aming Wu, Yaowei Wang, Yahong Han
CVPR 2024, (Preprint), (Project Page)
In this paper, we propose a dynamic object-centric perception network based on prompt learning, aiming to adapt to the variations in image complexity. Specifically, we propose an object-centric gating module based on prompt learning to focus attention on the object-centric features guided by the various scene prompts. Then, with the object-centric gating masks, the dynamic selective module dynamically selects highly correlated feature regions in both spatial and channel dimensions enabling the model to adaptively perceive object-centric relevant features, thereby enhancing the generalization capability. Experimental results on single-domain generalization tasks in image classification and object detection demonstrate the effectiveness and versatility of our proposed method.

    2 Instance-Invariant Domain Adaptive Object Detection via Progressive Disentanglement

    Aming Wu, Yahong Han, Linchao Zhu, Yi Yang
    IEEE TPAMI, DOI:10.1109/TPAMI.2021.3060446, (Paper), (Project Page)
    In this work, a progressive disentangled framework is proposed to solve domain adaptive object detection for the first time. Particularly, base on disentangled learning used for feature decomposition, we devise two disentangled layers to decompose domain-invariant and domain-specific features. And the instance-invariant features are extracted based on the domain-invariant features. Finally, to enhance the disentanglement, a three-stage training mechanism including multiple loss functions is devised to optimize our model. The proposed method can achieve excellent detection performance in night and fog domain adaptive object detection in real road scenes under different weather conditions.

      3 Vector-Decomposed Disentanglement for Domain-Invariant Object Detection

      Aming Wu, Rui Liu, Yahong Han, Linchao Zhu, Yi Yang
      ICCV 2021, (Preprint), (Project Page)
      To improve the generalization of detectors, for domain adaptive object detection (DAOD), a novel disentangled method based on vector decomposition is proposed to disentangle domain-invariant representations from domain-specific representations for the first time. Firstly, an extractor is devised to separate domain-invariant representations from the input, which are used for extracting object proposals. Secondly, domain-specific representations are introduced as the differences between the input and domain-invariant representations. Through the difference operation, the gap between the domain-specific and domain-invariant representations is enlarged, which promotes domain-invariant representations to contain more domain-irrelevant information. The proposed method can achieve outstanding performance in the all-weather cross-domain and multi-severe weather mixed domain target detection, such as in fog, dusk rain and night rain domain adaptive object detection in real road scenes.

        4 Universal-Prototype Enhancing for Few-Shot Object Detection

        Aming Wu, Yahong Han, Linchao Zhu, Yi Yang
        ICCV 2021, (Preprint), (Project Page)
        In this paper, we explore how to enhance object features with intrinsical characteristics that are universal across different object categories in few-shot object detection (FSOD). We propose a new prototype, namely universal prototype, that is learned from all object categories. Besides the advantage of characterizing invariant characteristics, the universal prototypes alleviate the impact of unbalanced object categories. After enhancing object features with the universal prototypes, we impose a consistency loss to maximize the agreement between the enhanced features and the original ones, which is beneficial for learning invariant object characteristics. Thus, we develop a new framework of few-shot object detection with universal prototypes (FSODup) that owns the merit of feature generalization towards novel objects.

          2. Adversarial Vision and Robustness: Towards AI Security

          1 Query-efficient Black-box Adversarial Attack with Customized Iteration and Sampling

          Yucheng Shi, Yahong Han, Qinghua Hu, Yi Yang, Qi Tian
          IEEE TPAMI, DOI:10.1109/TPAMI.2022.3169802, (Project Page)
          In this work, a new framework bridging transfer-based and decision-based attacks is proposed for query-efficient black-box adversarial attack. We reveal the relationship between current noise and variance of sampling, the monotonicity of noise compression in decision-based attack, as well as the influence of transition function on the convergence of decision-based attack. Guided by the new framework and theoretical analysis, we propose a black-box adversarial attack named Customized Iteration and Sampling Attack (CISA). CISA estimates the distance from nearby decision boundary to set the stepsize, and uses a dual-direction iterative trajectory to find the intermediate adversarial example. Based on the intermediate adversarial example, CISA conducts customized sampling according to the noise sensitivity of each pixel to further compress noise, and relaxes the state transition function to achieve higher query efficiency.

            2 Curls & Whey: Boosting Black-Box Adversarial Attacks

            Yucheng Shi, Siyu Wang, Yahong Han
            CVPR 2019, (Oral),(Paper), (Project Page)
            Fourth place in both Untargeted Attack Track and Targeted Attack Track of NIPS 2018 Adversarial Vision Challenge
            In this work, we propose Curls & Whey black-box attack to fix the above two defects. During Curls iteration, by combining gradient ascent and descent, we ‘curl’ up iterative trajectories to integrate more diversity and transferability into adversarial examples. Curls iteration also alleviates the diminishing marginal effect in existing iterative attacks. The Whey optimization further squeezes the ‘whey’ of noises by exploiting the robustness of adversarial perturbation. Extensive experiments on Imagenet and Tiny-Imagenet demonstrate that our approach achieves impressive decrease on noise magnitude in l2 norm. Curls & Whey attack also shows promising transferability against ensemble models as well as adversarially trained models. In addition, we extend our attack to the targeted misclassification, effectively reducing the difficulty of targeted attacks under black-box condition.

              3 Polishing Decision-based Adversarial Noise with a Customized Sampling

              Yucheng Shi, Yahong Han, Qi Tian
              CVPR 2020,(Paper)
              In this paper, we demonstrate the advantage of using current noise and historical queries to customize the variance and mean of sampling in boundary attack to polish adversarial noise. We further reveal the relationship between the initial noise and the compressed noise in boundary attack. We propose Customized Adversarial Boundary (CAB) attack that uses the current noise to model the sensitivity of each pixel and polish adversarial noise of each image with a customized sampling setting. On the one hand, CAB uses current noise as a prior belief to customize the multivariate normal distribution. On the other hand, CAB keeps the new samplings away from historical failed queries to avoid similar mistakes. Experimental results measured on several image classification datasets emphasizes the validity of our method.

                4 Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal

                Yucheng Shi, Yahong Han, Yu-an Tan, Xiaohui Kuang
                NeurIPS 2022, (Preprint) (Project Page)
                In this paper, we theoretically analyze the limitations of existing decision-based attacks from the perspective of noise sensitivity difference between regions of the image, and propose a new decision-based black-box attack against ViTs, termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on three datasets demonstrate that PAR achieves a much lower noise magnitude with the same number of queries.

                  3. Vision-to-Language: Understanding and Reasoning

                  1 Catching the Temporal Regions-of-Interest for Video Captioning

                  Ziwei Yang, Yahong Han, Zheng Wang
                  ACM Multimedia 2017 (Oral Paper and Best Paper Presentation), (Paper) (Project Page)
                  Inspired by the insight that people always focus on certain interested regions of video content, we propose a novel approach which will automatically focus on regions-of-interest and catch their temporal structures. In our approach, we utilize a specific attention model to adaptively select regions-of-interest for each video frame. Then a Dual Memory Recurrent Model (DMRM) is introduced to incorporate temporal structure of global features and regions-of-interest features in parallel, which will obtain rough understanding of video content and particular information of regions-of-interest. We evaluate our method for video captioning on Microsoft Video Description Corpus (MSVD) and Montreal Video Annotation (M-VAD). The experiments demonstrate that catching temporal regions-of-interest information really enhances the representation of input videos.

                    2 Video Interactive Captioning with Human Prompts

                    Aming Wu, Yahong Han, Yi Yang
                    IJCAI 2019, (Paper), (Project Page)
                    As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.

                      3 Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

                      Bo Wang, Youjiang Xu, Yahong Han, Richang Hong
                      AAAI 2018, (Paper), (Project Page)
                      Winner of the MovieQA and The Large Scale Movie Description Challenge (LSMDC) @ ICCV 2017
                      Understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.

                        4 Connective Cognition Network for Directional Visual Commonsense Reasoning

                        Aming Wu, Linchao Zhu, Yahong Han, Yi Yang
                        NeurIPS 2019,(Paper), (Project Page)
                        Recent studies on neuroscience have suggested that brain function or cognition can be described as a global and dynamic integration of local neuronal connectivity, which is context-sensitive to specific cognition tasks. Inspired by this idea, towards Visual commonsense reasoning (VCR), we propose a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers. Concretely, we first develop visual neuron connectivity to fully model correlations of visual content. Then, a contextualization process is introduced to fuse the sentence representation with that of visual neurons. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers or rationales. Experimental results on the VCR dataset demonstrate the effectiveness of our method.