Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hash functions are kinda magic. Bloom filters are one of the fun tricks you can play with them (though Bloom filters themselves are quite old at this point; check the literature for various alternatives that are better in specific situations.) The field of data sketches and streaming algorithms has many others. k-Minimum values is a nice example of a set cardinality estimator that is very easy to understand.


> check the literature for various alternatives that are better in specific situations

For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174


Also quite recent are Ribbon filters: https://arxiv.org/abs/2103.02515

But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.

Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.

    bloom([x, y]) == bloom([x]) | bloom([y])
Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that

    bloom([x]) & bloom([x, y]) == bloom([x])
    bloom([y]) & bloom([x, y]) == bloom([y])
    (bloom([x]) & bloom([y])) | bloom([x, y]) == bloom([x, y])
There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.

           bloom([x]) | bloom([y])              == bloom([x, y])
          ๐Ÿกฅ                       ๐Ÿกค
    bloom([x])                   bloom([y])
          ๐Ÿกค                       ๐Ÿกฅ
           bloom([x]) & bloom([y])


Also based on Ribbon, the BuRR filters: https://arxiv.org/pdf/2109.01892 which seem to get very close to the theoretical optimum, 0.1%-0.5% overhead comparatively


Interesting, but not quite the same:

Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).


I may have missed something when skimming the paper, but it sounds like Xor filters are constructed offline and can't be modified efficiently afterwards, whereas Bloom filters can be inserted into efficiently. So they don't seem to be an exact replacement


No, you are correct. I misremembered with cuckoofilters.


Cuckoo filters are more efficient alternative. The algorithm of displacement is also very notable: how we can use random swaps to place data in cells optimally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: