Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/43607
Title: Cross-modal Spectral Fusion Model for Referring Video Object Segmentation
Authors: Huang, Kesi
Li, Tianxiao
Xia, Qiqiang
CHEN, Junhong 
Asim, Muhammad
Liu, Wenyin
Issue Date: 2024
Publisher: IEEE
Source: 2024 IEEE 14th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, p. 133 -137
Abstract: Referring Video Object Segmentation (R-VOS) demands precise visual comprehension and sophisticated cross-modal reasoning to segment objects in videos based on descriptions from natural language. Addressing this challenge, we introduce the Cross-modal Spectral Fusion Model (CSF). Our model incorporates a Multi-Scale Spectral Fusion Module (MSFM), which facilitates robust global interactions between the modalities, and a Consensus Fusion Module (CFM) that dynamically balances multiple prediction vectors based on text features and spectral cues for accurate mask generation. Additionally, the Dual-stream Mask Decoder (DMD) enhances the segmentation accuracy by capturing both local and global information through parallel processing. Tested on three datasets, CSF surpasses existing methods in R-VOS, proving its efficacy and potential for advanced video understanding tasks.
Keywords: referring video object segmentation;cross-modal;multi-scale alignment
Document URI: http://hdl.handle.net/1942/43607
ISBN: 979-8-3503-6189-6
DOI: 10.1109/ICEIEC61773.2024.10561688
Rights: 2024 IEEE
Category: C1
Type: Proceedings Paper
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
Cross-modal_Spectral_Fusion_Model_for_Referring_Video_Object_Segmentation.pdf
  Restricted Access
Published version2.34 MBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.