基于图注意力的跨模态增强视觉定位
Cross-modal enhanced visual grounding based on graph attention
-
摘要: 视觉定位是计算机视觉与自然语言交互领域中的关键技术,能够为图像语义理解、人机交互指令执行等技术提供支撑,其核心是根据给定的文本描述在图像中定位具体区域。现阶段的研究表明,直接匹配语言和视觉模态在理解语句的复杂引用关系方面存在固有局限,而跨模态对齐阶段的冗余视觉噪声干扰,更会加剧特征匹配精度的下降,从而导致整体定位精度难以提升。本文提出了一种基于图注意力的跨模态对齐与融合结构,使模型能够动态学习特征节点间的注意力权重,有效抑制无关区域噪声,突出目标相关特征。在预训练阶段,引入节点级对比学习与图-文对齐损失,通过图像语义与语言隐含关系指导的数据预训练进行跨模态对齐。在微调阶段,通过RefCOCO/+/g数据进行跨模态融合以适应视觉定位任务。实验结果表明,本文所提出的方法在RefCOCO/+/g数据集上的表现均优于基准模型。Abstract: Visual grounding is a crucial technology in the field of computer vision and natural language interaction, providing support for tasks such as image semantic understanding and the execution of humancomputer interaction instructions. The core of visual grounding is to locate specific regions in images according to given textual descriptions. Current research shows that directly matching language and visual modalities has inherent limitations in capturing complex reference relationships in sentences. Moreover, interference from redundant visual noise in the cross-modal alignment stage will further aggravate the decline in feature matching accuracy, making it difficult to improve the overall positioning accuracy. This paper proposes a graph attention-based architecture for cross-modal alignment and fusion, enabling the model to dynamically learn the attention weights between feature nodes, effectively suppress noise in irrelevant regions, and highlight target-related features. During pre-training, node-level contrastive learning and graph-text alignment loss are introduced, and cross-modal alignment is achieved through data pretraining guided by image semantics and implicit linguistic relationships. In the fine-tuning stage, crossmodal fusion is performed using RefCOCO/+/g data to adapt to the visual grounding task. Experimental results show that the proposed method outperforms the benchmark models on the RefCOCO/+/g datasets.
下载: