The algorithm is simple, which is why it is also suitable for textbooks. However, it is far from representing the state of the art. In case you are interested in the latest development, I have recently published a new distinct counting algorithm where the estimate does not depend on the processing order, which can be merged and requires only 3.5kB of memory to guarantee a standard error of 1.13% for counts up to exa-scale: https://arxiv.org/abs/2402.13726
This sucks, but a problem we have in academia is that you can't reference a closed PR request at a top-tier conference like VLDB. If you wrote up your proposal as a referable document, you should rightly claim that you deserve scholarly recognition. Morally, you deserve recognition (which you are getting here, at least). It's decent of the author to admit he was inspired by you - there are others who would think it is easier to deny it. But there is no recourse, if it's only a PR.
The Java implementation abstracts out the index mapping. This let us add alternative ways to map values to indices and we added the method with the quadratic interpolation as it seems to be a good tradeoff between index computation speed and memory footprint.
The tailcut approach breaks one important property of cardinality estimation algorithms: Adding the same value multiple times should change the state of the sketch only once when adding it for the first time.
However, due to the chance of overflows the same value can lead to multiple changes when using the tailcut approach. The introduced error depends on the order and also on the frequency distribution of inserted elements. It would be interesting to know which assumptions on the input data were made for the promised accuracy. In their paper I could not find any hint how the input data looked like in their experiments. It is fairly easy to obtain good results for the trivial case where all input values are distinct.
The tailcut approach reminds me of the HyperBitBit algorithm which also drops the property described above and promises a better storage factor (less error with same amount of memory).
It is interesting that the standard error is reduced from 1.04/sqrt(m) down to 1.0/sqrt(m) despite the loss of information after the tailcut. Therefore, I conclude that it must be the estimation method itself which is superior to existing methods. I will need to check.
Hey oertl, big fan of your paper https://arxiv.org/pdf/1706.07290.pdf :D
I going to start working on "intersections" and some improvements of the large scale cardinalities based on your research. Can I hit you up an email with some of my questions (concerning the intersection work? how does it compare to https://arxiv.org/pdf/1704.03911.pdf)
So if a register overflows (and is truncated) "twice" by the same element before the base register (B) is increased, then after B is increased, the overflowing register is not increased and so appears not to have overflowed. On the other hand, if that element shows up in the stream once before and once after B is increased, the overflowing register may overflow again.
This makes even the MLE procedure partially incorrect in the first case, since it is the maximum likelihood estimation of a set of hashes in which the register discussed above did not overflow.
I have not yet checked their ML approach. Regarding register updates you are right.
I have also seen that they did only 10000 simulation runs to determine the standard deviation to be 1.00/sqrt(m). I think more simulations would be necessary to get a confidence interval for the standard deviation that is small enough to allow differentiation from 1.04/sqrt(m).
If you are interested in new cardinality estimation algorithms for HyperLogLog sketches you could also have a look on the paper I am currently working on:
There, I present two different algorithms based on theoretical considerations that are both accurate over the entire cardinality range. Unlike LogLog-Beta or HyperLogLog++, they do not need any empirically determined coefficients or bias correction data.