Hint to the uninformed: don't look up "bitmap" and wonder why anyone would index them. Look up "bitmap index" instead. This will save you many clicks in the github project as well as in your favorite search engine. It's about databases, not pixels.
>The key addition with Pilosa is that it's distributed and can scale horizontally :)
Doesn't this also make it several orders of magnitude slower? Every time I find myself using Bitmaps it's when speed is extremely important. What are some use cases for this not covered by in-process Bitmaps, Bloom filters, and HyperLogLog?
I'm just not aware of many use cases for bitmap indexes willing to trade that much speed for database-like access. I think this project would be more usable factored into a library than something external
The use case for Pilosa is usually as an addition to something like Cassandra or HDFS where you have terabytes (or more) of data that you want to be able to query more flexibly - especially if there are a very large number of attributes that you want to filter and segment on (think tens of millions).
I think you're right though, that there are use cases which would benefit from a library exposing this functionality - you need to have quite a lot of data before compressed bitmaps representing the relationships in that data start overflowing memory on a single machine.
This is exactly something I've been looking for to experiment with user segmentation, but crucially for fast, ad-hoc segmentation. Seems to bear some similarity to what Facebook [1] does for its audience insights.
SlicingDice founder here. We built SlicingDice exactly for this kind of necessity, very fast user segmentation. Actually, we just developed it because we didn't find any other solution that could support our needs.
User segmentation was actually the original use case that prompted Pilosa's development! Segmenting hundreds of millions of users with tens of millions of potential attributes.
Pilosa actually uses Roaring internally for bitmap compression (in fact, one of the Pilosa devs did the first port of Roaring from Java to Go). One difference is that really wide bitmaps are sharded and distributed among nodes in the cluster. This allows Pilosa to parallelize the work of doing bitwise operations on high cardinality data.
Based on what I can gather from the data model, this would be great for categorical type data where there are a limited amount of categories for each event? Would this be fast for querying quantitative data where values might be unique?
Pilosa was built for high cardinality data, so queries on unique numeric values would still be fast. Translating quantitative data to a categorical form fits Pilosa’s data model better, so the index would be more powerful. We have explored a few basic techniques for this, which you can read about here: https://www.pilosa.com/use-cases/taming-transportation-data/
As I understand, this sort of index is roughly how Elasticsearch is able to do filters quite quickly after warmup. They store bitwise mappings of filter results on a document by document basis.
(That said, this is one ES thing that's impossible to find documentation for).
Sure. Say you have 40k respondents in a survey who were each asked if they use toothpaste X (yes/no). Save all answers in 5k bytes, then map to any particular respondents answer by shifting right by 3 and checking remainder. Then test the bit you indexed to using the remainder. All fast operations.