TY - GEN
T1 - Remote Sensing Image Captioning Using Transformer
AU - Wang, Binze
AU - Xi, Jiangbo
AU - Wang, Xingrun
AU - Fang, Jianwu
AU - Jiang, Wandong
AU - Xie, Dashuai
AU - Xiang, Yaobing
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2022
Y1 - 2022
N2 - Image captioning generates a semantic description for the images, and with the development of deep learning, it usually combines computer vision and natural language processing. Image captioning needs not only recognize the important objects, attributes and the spatial relationships with the surrounding objects in the image, but also generate text descriptions that correspond to the language rules of people. In this paper, we proposed a image captioning model based on transformer. In the image understanding part, VGG16 was used to extract image information, and transformer encoder was used to extract relation from different image regions. The text generation extracts relations of word features in the description, and calculates the correlation between text and images from a variety of perspectives. The experimental results with indices BLEU4, METEOR, ROUGE, and CIDEr on the RSICD dataset are 0.29, 0.34, 0.61, and 2.53, respectively. These results are competitive and even better than the SOTA results. It is seen that show that transformer can alleviate overfitting on small datasets, accelerate the training process, and be generalized better.
AB - Image captioning generates a semantic description for the images, and with the development of deep learning, it usually combines computer vision and natural language processing. Image captioning needs not only recognize the important objects, attributes and the spatial relationships with the surrounding objects in the image, but also generate text descriptions that correspond to the language rules of people. In this paper, we proposed a image captioning model based on transformer. In the image understanding part, VGG16 was used to extract image information, and transformer encoder was used to extract relation from different image regions. The text generation extracts relations of word features in the description, and calculates the correlation between text and images from a variety of perspectives. The experimental results with indices BLEU4, METEOR, ROUGE, and CIDEr on the RSICD dataset are 0.29, 0.34, 0.61, and 2.53, respectively. These results are competitive and even better than the SOTA results. It is seen that show that transformer can alleviate overfitting on small datasets, accelerate the training process, and be generalized better.
KW - Image captioning
KW - Remote sensing image
KW - Transformer
UR - https://www.scopus.com/pages/publications/85130940432
U2 - 10.1007/978-981-16-9492-9_333
DO - 10.1007/978-981-16-9492-9_333
M3 - 会议稿件
AN - SCOPUS:85130940432
SN - 9789811694912
T3 - Lecture Notes in Electrical Engineering
SP - 3388
EP - 3397
BT - Proceedings of 2021 International Conference on Autonomous Unmanned Systems, ICAUS 2021
A2 - Wu, Meiping
A2 - Niu, Yifeng
A2 - Gu, Mancang
A2 - Cheng, Jin
PB - Springer Science and Business Media Deutschland GmbH
T2 - International Conference on Autonomous Unmanned Systems, ICAUS 2021
Y2 - 24 September 2021 through 26 September 2021
ER -