Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pilosa: open source, distributed bitmap index in Go (github.com/pilosa)
183 points by jaffee on May 3, 2017 | hide | past | favorite | 22 comments


Hint to the uninformed: don't look up "bitmap" and wonder why anyone would index them. Look up "bitmap index" instead. This will save you many clicks in the github project as well as in your favorite search engine. It's about databases, not pixels.


Good call - wikipedia has a nice overview: https://en.wikipedia.org/wiki/Bitmap_index

The key addition with Pilosa is that it's distributed and can scale horizontally :)


>The key addition with Pilosa is that it's distributed and can scale horizontally :)

Doesn't this also make it several orders of magnitude slower? Every time I find myself using Bitmaps it's when speed is extremely important. What are some use cases for this not covered by in-process Bitmaps, Bloom filters, and HyperLogLog?

I'm just not aware of many use cases for bitmap indexes willing to trade that much speed for database-like access. I think this project would be more usable factored into a library than something external


The use case for Pilosa is usually as an addition to something like Cassandra or HDFS where you have terabytes (or more) of data that you want to be able to query more flexibly - especially if there are a very large number of attributes that you want to filter and segment on (think tens of millions).

I think you're right though, that there are use cases which would benefit from a library exposing this functionality - you need to have quite a lot of data before compressed bitmaps representing the relationships in that data start overflowing memory on a single machine.


Lucene (used by ElasticSearch) uses Bitmaps a lot and it is one of the main reasons for their low storage usage and search/query speed.

At my company we also strongly use Bitmaps on the analytics database engine we developed (S1Search).


This is exactly something I've been looking for to experiment with user segmentation, but crucially for fast, ad-hoc segmentation. Seems to bear some similarity to what Facebook [1] does for its audience insights.

[1] https://code.facebook.com/posts/382299771946304/audience-ins...


SlicingDice founder here. We built SlicingDice exactly for this kind of necessity, very fast user segmentation. Actually, we just developed it because we didn't find any other solution that could support our needs.

https://blog.slicingdice.com/why-we-built-slicingdice-1beffc...


User segmentation was actually the original use case that prompted Pilosa's development! Segmenting hundreds of millions of users with tens of millions of potential attributes.


Very nice job.

Looks like we were trying to solve the same problem (user segmentation), in the same industry (DMP), at the same time (2013-2016). LOL.

I'm the founder of a DMP too.

https://blog.slicingdice.com/why-we-built-slicingdice-1beffc...

I will email you guys.


How does this compare to https://github.com/RoaringBitmap/CRoaring?

Is the difference that this one is distributed?


Pilosa actually uses Roaring internally for bitmap compression (in fact, one of the Pilosa devs did the first port of Roaring from Java to Go). One difference is that really wide bitmaps are sharded and distributed among nodes in the cluster. This allows Pilosa to parallelize the work of doing bitwise operations on high cardinality data.


I'll add to travisturner's response - Pilosa has a few bells and whistles beyond being a straight up bitmap index including:

- associating each bit with a timestamp (at various granularities) and queries over time ranges.

- adding arbitrary key/value metadata to each row or column

- automatic sorting/caching of bitmaps to support "TopN" queries


Neat, thanks for the clarifications. Looks like a cool tool.


Based on what I can gather from the data model, this would be great for categorical type data where there are a limited amount of categories for each event? Would this be fast for querying quantitative data where values might be unique?


Pilosa was built for high cardinality data, so queries on unique numeric values would still be fast. Translating quantitative data to a categorical form fits Pilosa’s data model better, so the index would be more powerful. We have explored a few basic techniques for this, which you can read about here: https://www.pilosa.com/use-cases/taming-transportation-data/


I appreciate how they developed a query language that is more code-like than SQL but still basically declarative and parsimonious.


As I understand, this sort of index is roughly how Elasticsearch is able to do filters quite quickly after warmup. They store bitwise mappings of filter results on a document by document basis.

(That said, this is one ES thing that's impossible to find documentation for).


Great way to encode discrete answers for qualitative survey data. Reads are smoking fast.


could you please elaborate or give some example?


Sure. Say you have 40k respondents in a survey who were each asked if they use toothpaste X (yes/no). Save all answers in 5k bytes, then map to any particular respondents answer by shifting right by 3 and checking remainder. Then test the bit you indexed to using the remainder. All fast operations.


I love the website[0] for the product; very pleasing.

[0]: https://www.pilosa.com/





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: