Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/43607
Full metadata record
DC FieldValueLanguage
dc.contributor.authorHuang, Kesi-
dc.contributor.authorLi, Tianxiao-
dc.contributor.authorXia, Qiqiang-
dc.contributor.authorCHEN, Junhong-
dc.contributor.authorAsim, Muhammad-
dc.contributor.authorLiu, Wenyin-
dc.date.accessioned2024-08-29T11:13:05Z-
dc.date.available2024-08-29T11:13:05Z-
dc.date.issued2024-
dc.date.submitted2024-08-14T18:19:50Z-
dc.identifier.citation2024 IEEE 14th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, p. 133 -137-
dc.identifier.isbn979-8-3503-6189-6-
dc.identifier.urihttp://hdl.handle.net/1942/43607-
dc.description.abstractReferring Video Object Segmentation (R-VOS) demands precise visual comprehension and sophisticated cross-modal reasoning to segment objects in videos based on descriptions from natural language. Addressing this challenge, we introduce the Cross-modal Spectral Fusion Model (CSF). Our model incorporates a Multi-Scale Spectral Fusion Module (MSFM), which facilitates robust global interactions between the modalities, and a Consensus Fusion Module (CFM) that dynamically balances multiple prediction vectors based on text features and spectral cues for accurate mask generation. Additionally, the Dual-stream Mask Decoder (DMD) enhances the segmentation accuracy by capturing both local and global information through parallel processing. Tested on three datasets, CSF surpasses existing methods in R-VOS, proving its efficacy and potential for advanced video understanding tasks.-
dc.description.sponsorshipThis work is supported by the National Natural Science Foundation of China (No. 91748107), the Special Research Fund (BOF) of Hasselt University (No. BOF23DOCBL11), the Guangdong Innovative Research Team Program (No. 2014ZT05G157). Chen Junhong was sponsored by the China Scholarship Council (No. 202208440309).-
dc.language.isoen-
dc.publisherIEEE-
dc.rights2024 IEEE-
dc.subject.otherreferring video object segmentation-
dc.subject.othercross-modal-
dc.subject.othermulti-scale alignment-
dc.titleCross-modal Spectral Fusion Model for Referring Video Object Segmentation-
dc.typeProceedings Paper-
local.bibliographicCitation.conferencedate2024, 24-25 May-
local.bibliographicCitation.conferencename14th International Conference on Electronics Information and Emergency Communication-
local.bibliographicCitation.conferenceplaceBeijing, China-
dc.identifier.epage137-
dc.identifier.spage133-
local.bibliographicCitation.jcatC1-
local.type.refereedRefereed-
local.type.specifiedProceedings Paper-
dc.identifier.doi10.1109/ICEIEC61773.2024.10561688-
local.provider.typeCrossRef-
local.bibliographicCitation.btitle2024 IEEE 14th International Conference on Electronics Information and Emergency Communication (ICEIEC)-
local.uhasselt.internationalyes-
item.fullcitationHuang, Kesi; Li, Tianxiao; Xia, Qiqiang; CHEN, Junhong; Asim, Muhammad & Liu, Wenyin (2024) Cross-modal Spectral Fusion Model for Referring Video Object Segmentation. In: 2024 IEEE 14th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, p. 133 -137.-
item.accessRightsRestricted Access-
item.fulltextWith Fulltext-
item.contributorHuang, Kesi-
item.contributorLi, Tianxiao-
item.contributorXia, Qiqiang-
item.contributorCHEN, Junhong-
item.contributorAsim, Muhammad-
item.contributorLiu, Wenyin-
Appears in Collections:Research publications
Files in This Item:
File Description SizeFormat 
Cross-modal_Spectral_Fusion_Model_for_Referring_Video_Object_Segmentation.pdf
  Restricted Access
Published version2.34 MBAdobe PDFView/Open    Request a copy
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.