TY - GEN
T1 - Unsupervised learning from linked documents
AU - Guo, Zhen
AU - Zhu, Shenghuo
AU - Chi, Yun
AU - Zhang, Zhongfei
AU - Gong, Yihong
PY - 2010
Y1 - 2010
N2 - Documents in many corpora, such as digital libraries and webpages, contain both content and link information. In a traditional topic model which plays an important role in the unsupervised learning, the link information is either totally ignored or treated as a feature similar to content. We believe that neither approach is capable of accurately capturing the relations represented by links. To address the limitation of traditional topic models, in this paper we propose a citation-topic (CT) model that explicitly considers the document relations represented by links. In the CT model, instead of being treated as yet another feature, links are used to form the structure of the generative model. As a result, in the CT model a given document is modeled as a mixture of a set of topic distributions, each of which is borrowed (cited) from a document that is related to the given document. We apply the CT model to several document collections and the experimental comparisons against state-of-the-art approaches demonstrate very promising performances.
AB - Documents in many corpora, such as digital libraries and webpages, contain both content and link information. In a traditional topic model which plays an important role in the unsupervised learning, the link information is either totally ignored or treated as a feature similar to content. We believe that neither approach is capable of accurately capturing the relations represented by links. To address the limitation of traditional topic models, in this paper we propose a citation-topic (CT) model that explicitly considers the document relations represented by links. In the CT model, instead of being treated as yet another feature, links are used to form the structure of the generative model. As a result, in the CT model a given document is modeled as a mixture of a set of topic distributions, each of which is borrowed (cited) from a document that is related to the given document. We apply the CT model to several document collections and the experimental comparisons against state-of-the-art approaches demonstrate very promising performances.
UR - https://www.scopus.com/pages/publications/78149471768
U2 - 10.1109/ICPR.2010.184
DO - 10.1109/ICPR.2010.184
M3 - 会议稿件
AN - SCOPUS:78149471768
SN - 9780769541099
T3 - Proceedings - International Conference on Pattern Recognition
SP - 730
EP - 733
BT - Proceedings - 2010 20th International Conference on Pattern Recognition, ICPR 2010
T2 - 2010 20th International Conference on Pattern Recognition, ICPR 2010
Y2 - 23 August 2010 through 26 August 2010
ER -