Cross-modal Spectral Fusion Model for Referring Video Object Segmentation

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/43607

Title:	Cross-modal Spectral Fusion Model for Referring Video Object Segmentation
Authors:	Huang, Kesi Li, Tianxiao Xia, Qiqiang CHEN, Junhong Asim, Muhammad Liu, Wenyin
Issue Date:	2024
Publisher:	IEEE
Source:	2024 IEEE 14th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, p. 133 -137
Abstract:	Referring Video Object Segmentation (R-VOS) demands precise visual comprehension and sophisticated cross-modal reasoning to segment objects in videos based on descriptions from natural language. Addressing this challenge, we introduce the Cross-modal Spectral Fusion Model (CSF). Our model incorporates a Multi-Scale Spectral Fusion Module (MSFM), which facilitates robust global interactions between the modalities, and a Consensus Fusion Module (CFM) that dynamically balances multiple prediction vectors based on text features and spectral cues for accurate mask generation. Additionally, the Dual-stream Mask Decoder (DMD) enhances the segmentation accuracy by capturing both local and global information through parallel processing. Tested on three datasets, CSF surpasses existing methods in R-VOS, proving its efficacy and potential for advanced video understanding tasks.
Keywords:	referring video object segmentation;cross-modal;multi-scale alignment
Document URI:	http://hdl.handle.net/1942/43607
ISBN:	979-8-3503-6189-6
DOI:	10.1109/ICEIEC61773.2024.10561688
Rights:	2024 IEEE
Category:	C1
Type:	Proceedings Paper
Appears in Collections:	Research publications

File	Description	Size	Format
Cross-modal_Spectral_Fusion_Model_for_Referring_Video_Object_Segmentation.pdf Restricted Access	Published version	2.34 MB	Adobe PDF	View/Open Request a copy
ICEIEC+17_Cross-modal Spectral Fusion Model for Referring Video Object Segmentation.pdf	Peer-reviewed author version	1.39 MB	Adobe PDF	View/Open

Check