Advanced Search+
LIU Puyuan, ZOU Xiaochun, MA Shibo, TUO Wenyin. Cross-modal enhanced visual grounding based on graph attentionJ. Chinese Journal of Stereology and Image Analysis, 2025, 30(2): 158-166. DOI: 10.13505/j.1007-1482.2025.30.02.003
Citation: LIU Puyuan, ZOU Xiaochun, MA Shibo, TUO Wenyin. Cross-modal enhanced visual grounding based on graph attentionJ. Chinese Journal of Stereology and Image Analysis, 2025, 30(2): 158-166. DOI: 10.13505/j.1007-1482.2025.30.02.003

Cross-modal enhanced visual grounding based on graph attention

  • Visual grounding is a crucial technology in the field of computer vision and natural language interaction, providing support for tasks such as image semantic understanding and the execution of humancomputer interaction instructions. The core of visual grounding is to locate specific regions in images according to given textual descriptions. Current research shows that directly matching language and visual modalities has inherent limitations in capturing complex reference relationships in sentences. Moreover, interference from redundant visual noise in the cross-modal alignment stage will further aggravate the decline in feature matching accuracy, making it difficult to improve the overall positioning accuracy. This paper proposes a graph attention-based architecture for cross-modal alignment and fusion, enabling the model to dynamically learn the attention weights between feature nodes, effectively suppress noise in irrelevant regions, and highlight target-related features. During pre-training, node-level contrastive learning and graph-text alignment loss are introduced, and cross-modal alignment is achieved through data pretraining guided by image semantics and implicit linguistic relationships. In the fine-tuning stage, crossmodal fusion is performed using RefCOCO/+/g data to adapt to the visual grounding task. Experimental results show that the proposed method outperforms the benchmark models on the RefCOCO/+/g datasets.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return