Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Calculating the number of distinct values (i.e., NDV) in a column of a big table is costly yet fundamental to a variety of database applications such as data compression and profiling. To reduce the high time and space cost, a number of sketch methods (e.g., HyperLogLog) have been proposed, which estimate the NDV from a constructed compact data summary of distinct values. However, these methods fail or are costly to manage fully-dynamic scenarios where data is often inserted into and deleted from the table. To solve this issue, we propose a novel sketch method, Half-Xor. Our Half-Xor sketch consists of a compact bit matrix and a small counter array, and it needs to set a few bits and update a counter when handling a data insertion/deletion. Compared with the state-of-the-art mergeable method, our experimental results demonstrate that our method Half-Xor is up to 6.6 times more accurate under the same memory usage and reduces the memory usage by up to 16 times to achieve the same estimation accuracy.

Original languageEnglish
Pages (from-to)3111-3125
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Volume36
Issue number7
DOIs
StatePublished - 1 Jul 2024

Keywords

  • Big table
  • data mining
  • database
  • dynamic cardinality estimation
  • sketch

Fingerprint

Dive into the research topics of 'Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables'. Together they form a unique fingerprint.

Cite this