SMVT: Spectrum-Driven Multi-scale Vision Transformer for Referring Image Segmentation

Li, Tianxiao; CHEN, Junhong; Huang, Yiheng; Huang, Kesi; Xia, Qiqiang; Asim, Muhammad; Liu, Wenyin

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/43606

Title:	SMVT: Spectrum-Driven Multi-scale Vision Transformer for Referring Image Segmentation
Authors:	Li, Tianxiao CHEN, Junhong Huang, Yiheng Huang, Kesi Xia, Qiqiang Asim, Muhammad Liu, Wenyin
Issue Date:	2024
Source:	Advanced Intelligent Computing Technology and Applications: Proceedings, Part VI, p. 193 -206
Series/Report:	Lecture Notes in Computer Science
Series/Report no.:	14867
Abstract:	Referring image segmentation is a challenging task at the intersection of computer vision and natural language processing, aiming to segment out an object referred to by a natural language expression from an image. Recently despite significant progress on this task, existing methods still face challenges in effectively integrating visual and language information and enhancing the model's ability to capture fine-grained details within images. These challenges primarily originate from a lack of a mechanism capable of deeply and comprehensively fusing visual features with language features and effectively utilizing cross-modal features. To address these problems, we propose the Spectrum-driven Multi-scale Visual Transformer (SMVT), which incorporates two innovative designs: Spectrum-driven Fusion Attention (SFA) and the Cross-modal Feature Refinement Enhancement (CFRE) module. SFA, by guiding the fusion of visual and linguistic features at the spectral domain level, effectively captures fine-grained features in images and enhances the model's sensitivity to local spectral domain information , thereby responding more accurately to the detail requirements in language descriptions. CFRE module, by refining and enhancing cross-modal features at different layers, enhances the complementarity and the ability to capture fine-grained cross-modal features across different layers, promoting the precise alignment of visual and language features. These two modules enable the SMVT to more effectively process visual and language information. Experiments on three benchmark datasets have shown that our method surpasses state-of-the-art approaches.
Keywords:	Referring image segmentation;Cross-modal learning;Spectrum-driven fusion
Document URI:	http://hdl.handle.net/1942/43606
ISBN:	978-981-97-5596-7 978-981-97-5597-4
DOI:	10.1007/978-981-97-5597-4_17
ISI #:	001307339400017
Rights:	The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 D.-S. Huang et al. (Eds.): ICIC 2024, LNCS 14867, pp. 193–206, 2024.
Category:	C1
Type:	Proceedings Paper
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
978-981-97-5597-4_17.pdf Restricted Access	Published version	1.82 MB	Adobe PDF	View/Open Request a copy

Show full item record

SCOPUS^TM
Citations

1

checked on Dec 19, 2025

Google Scholar^TM

Check

Files in This Item:

SCOPUSTM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

Google Scholar^TM