GL-MSIN: A Global–Local Hierarchical Multiscale Interaction Network for Referring Remote Sensing Image Segmentation

Abstract

Referring remote sensing image segmentation aims to segment objects in a remote sensing image based on a natural language expression. Identifying objects with uncertain quantities described in language in large-scale remote sensing images requires enhanced capabilities for both local detailed feature extraction and global dependency perception. However, the current methods pay less attention to these aspects. In addition, the existing feature-fusion mechanisms frequently ignore semantic inconsistencies in features at different scales, making segmentation particularly challenging. Here, we propose a global-local hierarchical multiscale interaction network (GL-MSIN) to address these issues. In GL-MSIN, to better exploit global-local information in remote sensing images, we design an EN-Trans block as the basic component to generate global–local features about language. Within each EN-Trans block, we incorporate the enhanced local self-attention module into the visual backbone to capture local details, while also designing a gated enhancement pathway with a global enhancement module (GEM) to capture and inject global information about expression. At the core of this pathway is the GEM, which employs a lightweight attention mechanism to expand pixel-level interactions across the entire image, enabling effective extraction of global context from locally rich visual features. Furthermore, a learnable fusion feature-based iterative refinement network is designed to generate final fused features and overcome inconsistency issue in features fusion. In addition, we curate an extensive dataset, geographic RefRS (GRefRS), consisting of 70 328 image-expression-mask triplets, which not only presents more complex scenarios but also provides more practical natural language expressions with geographic attributes. Our experimental evaluations demonstrate the exceptional performance of GL-MSIN on the GRefRS and the existing mainstream datasets.

Keywords

  • Dataset construction
  • global–local information enhancement
  • multiscale interaction
  • referring remote sensing image segmentation (RRSIS)

Fingerprint

Dive into the research topics of 'GL-MSIN: A Global–Local Hierarchical Multiscale Interaction Network for Referring Remote Sensing Image Segmentation'. Together they form a unique fingerprint.

Cite this