For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | oertl's commentsregister

The algorithm is simple, which is why it is also suitable for textbooks. However, it is far from representing the state of the art. In case you are interested in the latest development, I have recently published a new distinct counting algorithm where the estimate does not depend on the processing order, which can be merged and requires only 3.5kB of memory to guarantee a standard error of 1.13% for counts up to exa-scale: https://arxiv.org/abs/2402.13726


FYI, already in 2015, I have proposed exactly the same idea as improvement to HdrHistogram. See https://github.com/HdrHistogram/HdrHistogram/issues/54 and corresponding code https://github.com/oertl/HdrHistogram/blob/memory_efficiency.... Unfortuantely, the author of HdrHistogram did not pick up my proposal as that would have lead to major changes and problems regarding compatibility. The mapping to histogram bins is the same as in your code https://github.com/DataDog/sketches-java/blob/master/src/mai.... I am curious, have you been inspired by that?


This sucks, but a problem we have in academia is that you can't reference a closed PR request at a top-tier conference like VLDB. If you wrote up your proposal as a referable document, you should rightly claim that you deserve scholarly recognition. Morally, you deserve recognition (which you are getting here, at least). It's decent of the author to admit he was inspired by you - there are others who would think it is easier to deny it. But there is no recourse, if it's only a PR.


I did see that thread and that's what inspired the mapping with the quadratic interpolation.

To clarify though, DDSketch as defined in the paper uses the logarithmic mapping, as our main goal was to make the memory footprint as small as possible. See: https://github.com/DataDog/sketches-java/blob/1650d939f1485f...

The Java implementation abstracts out the index mapping. This let us add alternative ways to map values to indices and we added the method with the quadratic interpolation as it seems to be a good tradeoff between index computation speed and memory footprint.


The tailcut approach breaks one important property of cardinality estimation algorithms: Adding the same value multiple times should change the state of the sketch only once when adding it for the first time.

However, due to the chance of overflows the same value can lead to multiple changes when using the tailcut approach. The introduced error depends on the order and also on the frequency distribution of inserted elements. It would be interesting to know which assumptions on the input data were made for the promised accuracy. In their paper I could not find any hint how the input data looked like in their experiments. It is fairly easy to obtain good results for the trivial case where all input values are distinct.

The tailcut approach reminds me of the HyperBitBit algorithm which also drops the property described above and promises a better storage factor (less error with same amount of memory).

It is true that the traditional 5-bit or 6-bit representations of Hyperloglog registers are suboptimal. Even with lossless compression (see https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b...) a significant amount of memory can be saved.

It is interesting that the standard error is reduced from 1.04/sqrt(m) down to 1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I conclude that it must be the estimation method itself which is superior to existing methods. I will need to check.


Hey oertl, big fan of your paper https://arxiv.org/pdf/1706.07290.pdf :D I going to start working on "intersections" and some improvements of the large scale cardinalities based on your research. Can I hit you up an email with some of my questions (concerning the intersection work? how does it compare to https://arxiv.org/pdf/1704.03911.pdf)


Sure, send me an email.


So if a register overflows (and is truncated) "twice" by the same element before the base register (B) is increased, then after B is increased, the overflowing register is not increased and so appears not to have overflowed. On the other hand, if that element shows up in the stream once before and once after B is increased, the overflowing register may overflow again.

This makes even the MLE procedure partially incorrect in the first case, since it is the maximum likelihood estimation of a set of hashes in which the register discussed above did not overflow.

Is that about right?


I have not yet checked their ML approach. Regarding register updates you are right.

I have also seen that they did only 10000 simulation runs to determine the standard deviation to be 1.00/sqrt(m). I think more simulations would be necessary to get a confidence interval for the standard deviation that is small enough to allow differentiation from 1.04/sqrt(m).


If you are interested in new cardinality estimation algorithms for HyperLogLog sketches you could also have a look on the paper I am currently working on:

http://oertl.github.io/hyperloglog-sketch-estimation-paper/

There, I present two different algorithms based on theoretical considerations that are both accurate over the entire cardinality range. Unlike LogLog-Beta or HyperLogLog++, they do not need any empirically determined coefficients or bias correction data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You