Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/43606
Title: | SMVT: Spectrum-Driven Multi-scale Vision Transformer for Referring Image Segmentation | Authors: | Li, Tianxiao CHEN, Junhong Huang, Yiheng Huang, Kesi Xia, Qiqiang Asim, Muhammad Liu, Wenyin |
Issue Date: | 2024 | Source: | Advanced Intelligent Computing Technology and Applications: Proceedings, Part VI, p. 193 -206 | Series/Report: | Lecture Notes in Computer Science | Series/Report no.: | 14867 | Abstract: | Referring image segmentation is a challenging task at the intersection of computer vision and natural language processing, aiming to segment out an object referred to by a natural language expression from an image. Recently despite significant progress on this task, existing methods still face challenges in effectively integrating visual and language information and enhancing the model's ability to capture fine-grained details within images. These challenges primarily originate from a lack of a mechanism capable of deeply and comprehensively fusing visual features with language features and effectively utilizing cross-modal features. To address these problems, we propose the Spectrum-driven Multi-scale Visual Transformer (SMVT), which incorporates two innovative designs: Spectrum-driven Fusion Attention (SFA) and the Cross-modal Feature Refinement Enhancement (CFRE) module. SFA, by guiding the fusion of visual and linguistic features at the spectral domain level, effectively captures fine-grained features in images and enhances the model's sensitivity to local spectral domain information , thereby responding more accurately to the detail requirements in language descriptions. CFRE module, by refining and enhancing cross-modal features at different layers, enhances the complementarity and the ability to capture fine-grained cross-modal features across different layers, promoting the precise alignment of visual and language features. These two modules enable the SMVT to more effectively process visual and language information. Experiments on three benchmark datasets have shown that our method surpasses state-of-the-art approaches. | Keywords: | Referring image segmentation;Cross-modal learning;Spectrum-driven fusion | Document URI: | http://hdl.handle.net/1942/43606 | ISBN: | 978-981-97-5596-7 978-981-97-5597-4 |
DOI: | 10.1007/978-981-97-5597-4_17 | ISI #: | 001307339400017 | Rights: | The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 D.-S. Huang et al. (Eds.): ICIC 2024, LNCS 14867, pp. 193–206, 2024. | Category: | C1 | Type: | Proceedings Paper |
Appears in Collections: | Research publications |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
978-981-97-5597-4_17.pdf Restricted Access | Published version | 1.82 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.