TY - GEN
T1 - Continuously tracking core items in data streams with probabilistic decays
AU - Zhao, Junzhou
AU - Wang, Pinghui
AU - Tao, Jing
AU - Zhang, Shuo
AU - Lui, John C.S.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/4
Y1 - 2020/4
N2 - The sheer scale of big data causes the information overload issue and there is an urgent need for tools that can draw valuable insights from massive data. This paper investigates the core items tracking (CIT) problem where the goal is to continuously track representative items, called core items, in a data stream so to best represent/summarize the stream. In order to simultaneously satisfy the recency and continuity requirements, we consider CIT over probabilistic-decaying streams where items in the stream are forgotten gradually in a probabilistic manner. We first introduce an algorithm, called PNDCIT, to find core items in a special kind of probabilistic non-decaying streams. Furthermore, using PNDCIT as a building block, we design two novel algorithms, namely PDCIT and PDCIT+, to maintain core items over probabilistic-decaying streams with constant approximation ratios. Finally, extensive experiments on real data demonstrate that PDCIT+ achieves a speedup of up to one order of magnitude over a batch algorithm while providing solutions with comparable quality.
AB - The sheer scale of big data causes the information overload issue and there is an urgent need for tools that can draw valuable insights from massive data. This paper investigates the core items tracking (CIT) problem where the goal is to continuously track representative items, called core items, in a data stream so to best represent/summarize the stream. In order to simultaneously satisfy the recency and continuity requirements, we consider CIT over probabilistic-decaying streams where items in the stream are forgotten gradually in a probabilistic manner. We first introduce an algorithm, called PNDCIT, to find core items in a special kind of probabilistic non-decaying streams. Furthermore, using PNDCIT as a building block, we design two novel algorithms, namely PDCIT and PDCIT+, to maintain core items over probabilistic-decaying streams with constant approximation ratios. Finally, extensive experiments on real data demonstrate that PDCIT+ achieves a speedup of up to one order of magnitude over a batch algorithm while providing solutions with comparable quality.
UR - https://www.scopus.com/pages/publications/85085855811
U2 - 10.1109/ICDE48307.2020.00072
DO - 10.1109/ICDE48307.2020.00072
M3 - 会议稿件
AN - SCOPUS:85085855811
T3 - Proceedings - International Conference on Data Engineering
SP - 769
EP - 780
BT - Proceedings - 2020 IEEE 36th International Conference on Data Engineering, ICDE 2020
PB - IEEE Computer Society
T2 - 36th IEEE International Conference on Data Engineering, ICDE 2020
Y2 - 20 April 2020 through 24 April 2020
ER -