Hash functions are kinda magic. Bloom filters are one of the fun tricks you can ...

f_devd · 2025-05-02T08:21:41 1746174101

> check the literature for various alternatives that are better in specific situations

For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174

sparkie · 2025-05-02T10:51:18 1746183078

Also quite recent are Ribbon filters: https://arxiv.org/abs/2103.02515

But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.

Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.

    bloom([x, y]) == bloom([x]) | bloom([y])

Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that

    bloom([x]) & bloom([x, y]) == bloom([x])
    bloom([y]) & bloom([x, y]) == bloom([y])
    (bloom([x]) & bloom([y])) | bloom([x, y]) == bloom([x, y])

There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.

           bloom([x]) | bloom([y])              == bloom([x, y])
          🡥                       🡤
    bloom([x])                   bloom([y])
          🡤                       🡥
           bloom([x]) & bloom([y])

f_devd · 2025-05-02T13:54:03 1746194043

Also based on Ribbon, the BuRR filters: https://arxiv.org/pdf/2109.01892 which seem to get very close to the theoretical optimum, 0.1%-0.5% overhead comparatively

samuel · 2025-05-02T08:59:20 1746176360

Interesting, but not quite the same:

Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).

virexene · 2025-05-02T09:04:30 1746176670

I may have missed something when skimming the paper, but it sounds like Xor filters are constructed offline and can't be modified efficiently afterwards, whereas Bloom filters can be inserted into efficiently. So they don't seem to be an exact replacement

f_devd · 2025-05-02T10:38:46 1746182326

No, you are correct. I misremembered with cuckoofilters.

Snawoot · 2025-05-02T09:40:18 1746178818

Cuckoo filters are more efficient alternative. The algorithm of displacement is also very notable: how we can use random swaps to place data in cells optimally.