TY - GEN
T1 - CAST
T2 - 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021
AU - Shi, Ensheng
AU - Wang, Yanlin
AU - Du, Lun
AU - Zhang, Hongyu
AU - Han, Shi
AU - Zhang, Dongmei
AU - Sun, Hongbin
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Code summarization aims to generate concise natural language descriptions of source code, which can help improve program comprehension and maintenance. Recent studies show that syntactic and structural information extracted from abstract syntax trees (ASTs) is conducive to summary generation. However, existing approaches fail to fully capture the rich information in ASTs because of the large size/depth of ASTs. In this paper, we propose a novel model CAST that hierarchically splits and reconstructs ASTs. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, AST representation, together with source code embedding obtained by a vanilla code token encoder, is used for code summarization. Extensive experiments, including the ablation study and the human evaluation, on benchmarks have demonstrated the power of CAST. To facilitate reproducibility, our code and data are available at https://github.com/DeepSoftwareAnalytics/CAST.
AB - Code summarization aims to generate concise natural language descriptions of source code, which can help improve program comprehension and maintenance. Recent studies show that syntactic and structural information extracted from abstract syntax trees (ASTs) is conducive to summary generation. However, existing approaches fail to fully capture the rich information in ASTs because of the large size/depth of ASTs. In this paper, we propose a novel model CAST that hierarchically splits and reconstructs ASTs. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, AST representation, together with source code embedding obtained by a vanilla code token encoder, is used for code summarization. Extensive experiments, including the ablation study and the human evaluation, on benchmarks have demonstrated the power of CAST. To facilitate reproducibility, our code and data are available at https://github.com/DeepSoftwareAnalytics/CAST.
UR - https://www.scopus.com/pages/publications/85126153599
U2 - 10.18653/v1/2021.emnlp-main.332
DO - 10.18653/v1/2021.emnlp-main.332
M3 - 会议稿件
AN - SCOPUS:85126153599
T3 - EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 4053
EP - 4062
BT - EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
PB - Association for Computational Linguistics (ACL)
Y2 - 7 November 2021 through 11 November 2021
ER -