TY - GEN
T1 - Text Grouping Adapter
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Bi, Tianci
AU - Zhang, Xiaoyi
AU - Zhang, Zhizheng
AU - Xie, Wenxuan
AU - Lan, Cuiling
AU - Lu, Yan
AU - Zheng, Nanning
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pretrained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to as-semble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pretrained models, incorporating our TGA into various pretrained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pretraining. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
AB - Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pretrained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to as-semble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pretrained models, incorporating our TGA into various pretrained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pretraining. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
UR - https://www.scopus.com/pages/publications/85218002163
U2 - 10.1109/CVPR52733.2024.02659
DO - 10.1109/CVPR52733.2024.02659
M3 - 会议稿件
AN - SCOPUS:85218002163
SN - 9798350353006
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 28150
EP - 28159
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -