跳到主要导航 跳到搜索 跳到主要内容

Trend analysis for large document streams

科研成果: 书/报告/会议事项章节会议稿件同行评审

4 引用 (Scopus)

摘要

More and more powerful computer technology inspires people to investigate information hidden under huge amounts of documents. In this report, we are especially interested in documents with relative time order, which we also call document streams. Examples include TV news, forums, emails of company projects, call center telephone logs, etc. To get an insight into these document streams, first we need to detect the events among the document streams. We use a time-sensitive Dirichlet process mixture model to find the events in the document streams. A time sensitive Dirichlet process mixture model is a generative model, which allows a potentially infinite number of mixture components and uses a Dirichlet compound multinomial model to model the distribution of words in documents. In this report, we consider three different time sensitive Dirichlet process mixture models: an exponential decay kernel model, a polynomial decay function kernel Dirichlet process model and a sliding window kernel model. Experiments on the TDT2 dataset have shown that the time sensitive models performs 18-20% better in terms of accuracy than the Dirichlet process mixture model. The sliding windows kernel and the polynomial kernel is more promising in detecting events. We use ThemeRiver to provide a visualization of the events along the time axis. With the help of ThemeRiver, people can easily get an overall picture of how different events evolve. Besides Themeriver, we investigate using top words as a high-level summarization of each event. Experiment results on TDT2 dataset suggests that the sliding window kernel is a better choice both in terms of capturing the trend of the events and expressibility.

源语言英语
主期刊名Proceedings - 5th International Conference on Machine Learning and Applications, ICMLA 2006
285-295
页数11
DOI
出版状态已出版 - 2006
已对外发布
活动5th International Conference on Machine Learning and Applications, ICMLA 2006 - Orlando, FL, 美国
期限: 14 12月 200616 12月 2006

出版系列

姓名Proceedings - 5th International Conference on Machine Learning and Applications, ICMLA 2006

会议

会议5th International Conference on Machine Learning and Applications, ICMLA 2006
国家/地区美国
Orlando, FL
时期14/12/0616/12/06

学术指纹

探究 'Trend analysis for large document streams' 的科研主题。它们共同构成独一无二的指纹。

引用此