Publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2024
- ECCVMambaIR: A Simple Baseline for Image Restoration with State-Space ModelHang Guo* , Jinmin Li* , Tao Dai , and 3 more authorsProceedings of the European Conference on Computer Vision (ECCV), 2024
Recent years have witnessed great progress in image restoration thanks to the advancements in modern deep neural networks e.g. Convolutional Neural Network and Transformer. However, existing restoration backbones are usually limited due to the inherent local reductive bias or quadratic computational complexity. Recently, Selective Structured State Space Model e.g., Mamba, have shown great potential for long-range dependencies modeling with linear complexity, but it is still under-explored in low-level computer vision. In this work, we introduce a simple but strong benchmark model, named MambaIR, for image restoration. In detail, we propose the Residual State Space Block as the core component, which employs convolution and channel attention to enhance capabilities of the vanilla Mamba. In this way, our MambaIR takes advantages of local patch recurrence prior as well as channel interaction to produce restoration-specific feature representation. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms Transformer-based baseline SwinIR by up to 0.36dB, using similar computational cost but with global receptive field.
- LCM: Locally Constrained Compact Point Cloud Model for Masked Point ModelingYaohua Zha , Naiqi Li , Yanzi Wang , and 6 more authorsArxiv, 2024
The pre-trained point cloud model based on Masked Point Modeling (MPM) has exhibited substantial improvements across various tasks. However, these models heavily rely on the Transformer, leading to quadratic complexity and limited decoder, hindering their practice application. To address this limitation, we first conduct a comprehensive analysis of existing Transformer-based MPM, emphasizing the idea that redundancy reduction is crucial for point cloud analysis. To this end, we propose a Locally constrained Compact point cloud Model (LCM) consisting of a locally constrained compact encoder and a locally constrained Mamba-based decoder. Our encoder replaces self-attention with our local aggregation layers to achieve an elegant balance between performance and efficiency. Considering the varying information density between masked and unmasked patches in the decoder inputs of MPM, we introduce a locally constrained Mamba-based decoder. This decoder ensures linear complexity while maximizing the perception of point cloud geometry information from unmasked patches with higher information density. Extensive experimental results show that our compact model significantly surpasses existing Transformer-based models in both performance and efficiency, especially our LCM-based Point-MAE model, compared to the Transformer-based model, achieved an improvement of 2.24%, 0.87%, and 0.94% in performance on the three variants of ScanObjectNN while reducing parameters by 88% and computation by 73%.
- IJCAIFreqFormer: Frequency-aware Transformer for Lightweight Image Super-resolutionTao Dai , Jianping Wang , Hang Guo , and 3 more authorsProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2024
Transformer-based models have been widely and successfully used in various low-vision visual tasks, and have achieved remarkable performance in single image super-resolution (SR). Despite the significant progress in SR, Transformer-based SR methods (e.g., SwinIR) still suffer from the problems of heavy computation cost and low-frequency preference, while ignoring the reconstruction of rich high-frequency information, hence hindering the representational power of Transformers. To address these issues, in this paper, we propose a novel Frequency-aware Transformer (FreqFormer) for lightweight image SR. Specifically, a Frequency Division Module (FDM) is first introduced to separately handle high- and low-frequency information in a divide-and-conquer manner. Moreover, we present Frequency-aware Transformer Block (FTB) to extracting both spatial frequency attention and channel transposed attention to recover high-frequency details. Extensive experimental results on public datasets demonstrate the superiority of our FreqFormer over state-of-the-art SR methods in terms of both quantitative metrics and visual quality.
2023
- AdaptIR: Parameter Efficient Multi-task Adaptation for Pre-trained Image Restoration ModelsHang Guo , Tao Dai , Yuanchao Bai , and 3 more authorsarXiv preprint, 2023
Pre-training has shown promising results on various image restoration tasks, which is usually followed by full fine-tuning for each specific downstream task (e.g., image denoising). However, such full fine-tuning usually suffers from the problems of heavy computational cost in practice, due to the massive parameters of pre-trained restoration models, thus limiting its real-world applications. Recently, Parameter Efficient Transfer Learning (PETL) offers an efficient alternative solution to full fine-tuning, yet still faces great challenges for pre-trained image restoration models, due to the diversity of different degradations. To address these issues, we propose AdaptIR, a novel parameter efficient transfer learning method for adapting pre-trained restoration models. Specifically, the proposed method consists of a multi-branch inception structure to orthogonally capture local spatial, global spatial, and channel interactions. In this way, it allows powerful representations under a very low parameter budget. Extensive experiments demonstrate that the proposed method can achieve comparable or even better performance than full fine-tuning, while only using 0.6% parameters.
- IJCAITowards Robust Scene Text Image Super-resolution via Explicit Location EnhancementHang Guo , Tao Dai , Guanghao Meng , and 1 more authorIn Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , 2023
Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods.
- ACM MMOne-stage Low-resolution Text Recognition with High-resolution Knowledge TransferHang Guo , Tao Dai , Mingyan Zhu , and 4 more authorsIn Proceedings of the ACM International Conference on Multimedia (MM) , 2023
Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. Although this pipeline is straightforward and intuitive, it has to use an additional super-resolution network, which causes inefficiencies during training and testing. Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness.In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer.Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label.Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness.
2022
- TCSVTGaze target estimation inspired by interactive attentionZhengxi Hu , Kunxu Zhao , Bohan Zhou , and 4 more authorsIEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022
As an essential nonverbal cue, the human gaze reveals human intentions and plays a crucial role in human daily activities. Therefore, automatic detection of the person’s gaze target has drawn the interests of the computer vision community. This is useful not only for identifying whether children are attentive in class but also for locating items of interest to humans in retail settings. Existing gaze-following methods have only explored and exploited the scenes context and the head cues. Considering the significance of human-object interaction in understanding human intentions, we present the Visual-Spatial Graph and introduce a graph attention network to analyze the interaction probability between the human and elements in the scene. Then the interaction probability inferred from the visual-spatial information that is aggregated by the attention mechanism can be transformed into an interactive attention map that depicts the areas people care about. In addition, we construct a transformer as an encoder to integrate the features extracted by the scene and head pathways aiming to decode the gaze target. After introducing interactive attention, our proposed method achieves outstanding performance on two benchmarks: GazeFollow and VideoAttentionTarget.
- ACCVMGTR: End-to-End Mutual Gaze Detection with TransformerHang Guo , Zhengxi Hu , and Jingtai LiuIn Proceedings of the Asian Conference on Computer Vision (ACCV) , 2022
People’s looking at each other or mutual gaze is ubiquitous in our daily interactions, and detecting mutual gaze is of great significance for understanding human social scenes. Current mutual gaze detection methods focus on two-stage methods, whose inference speed is limited by the two-stage pipeline and the performance in the second stage is affected by the first one. In this paper, we propose a novel one-stage mutual gaze detection framework called Mutual Gaze TRansformer or MGTR to perform mutual gaze detection in an end-to-end manner. By designing mutual gaze instance triples, MGTR can detect each human head bounding box and simultaneously infer mutual gaze relationship based on global image information, which streamlines the whole process with simplicity. Experimental results on two mutual gaze datasets show that our method is able to accelerate mutual gaze detection process without losing performance. Ablation study shows that different components of MGTR can capture different levels of semantic information in images.