Unsupervised learning from linked documents

  • Zhen Guo
  • , Shenghuo Zhu
  • , Yun Chi
  • , Zhongfei Zhang
  • , Yihong Gong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Documents in many corpora, such as digital libraries and webpages, contain both content and link information. In a traditional topic model which plays an important role in the unsupervised learning, the link information is either totally ignored or treated as a feature similar to content. We believe that neither approach is capable of accurately capturing the relations represented by links. To address the limitation of traditional topic models, in this paper we propose a citation-topic (CT) model that explicitly considers the document relations represented by links. In the CT model, instead of being treated as yet another feature, links are used to form the structure of the generative model. As a result, in the CT model a given document is modeled as a mixture of a set of topic distributions, each of which is borrowed (cited) from a document that is related to the given document. We apply the CT model to several document collections and the experimental comparisons against state-of-the-art approaches demonstrate very promising performances.

Original languageEnglish
Title of host publicationProceedings - 2010 20th International Conference on Pattern Recognition, ICPR 2010
Pages730-733
Number of pages4
DOIs
StatePublished - 2010
Externally publishedYes
Event2010 20th International Conference on Pattern Recognition, ICPR 2010 - Istanbul, Turkey
Duration: 23 Aug 201026 Aug 2010

Publication series

NameProceedings - International Conference on Pattern Recognition
ISSN (Print)1051-4651

Conference

Conference2010 20th International Conference on Pattern Recognition, ICPR 2010
Country/TerritoryTurkey
CityIstanbul
Period23/08/1026/08/10

Fingerprint

Dive into the research topics of 'Unsupervised learning from linked documents'. Together they form a unique fingerprint.

Cite this