Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Merklizing the key/value store for fun and profit (joelgustafson.com)
118 points by joelg on June 11, 2023 | hide | past | favorite | 16 comments


What’s worth mentioning is that IPFS is built on top of a data model named IPLD.

https://ipld.io/

If you are planning on doing your own experiments with Merkle tree and such, I can strongly recommend using IPLD as well. The IPLD project has standard libraries for a bunch of programming languages.

The advantage of using IPLD is that you can use regular IPFS tooling to inspect the data and make backups/restores. Regardless of your desire to share the data over IPFS.


As someone following German politics I had to read the headline twice :) Haven't seen it used without tree.

(Yes, the spelling is not the same, but we are talking associations, not exact science.)



Exactly... I went there looking for how one can sell out one's key/value store to Russian state mafia for profit! Isn't that what Merkelizing means? Oh, sorry, its merklizing. Never mind.


LOL. I also thought of Merkel when looking at the title.


Wouldn't it be simpler to use a trie over the hashes instead? It seems to me like it would have the properties desired here. I think the parent/child rule described here might actually result in some kind of trie.


The difference is that a trie over the hashes doesn't preserve lexicographical key ordering or support range queries, which are typically expected from key/value stores. But you're right - if you just need `get` / `set` / `delete`, you could just do that!


I read this with great interest. Sometimes technologies come around that make me seriously wonder if I should embrace them or continue on the path I’ve already designed.

They include: CRDTs, PouchDB, hypercore, etc.

What I currently have is my own, PHP-based implementation of essentially a history of “collaborative documents that evolve through time” — with each document having one machine as a source of truth:

https://qbix.com/platform/guide/streamsConcepts

So basically, I think that syncing and CRDTs (y.js and automerge) can be excellent for collaborating on documents, but not so great for chatrooms and smart contracts / crypto where order of operations matters. For those, you can implement optimistic interface updates but when conflicts (rarely) arise, your interface may roll back changes it displayed. This is especially true after a netsplit.

I thought about using PouchDB but the thing is that CouchDB is an extra process people have to run and we want our stuff to run anywhere Wordpress does. Ditto for hypercore. These are great technologies, but it seems what we have is great for fhe very reason tht it’s in PHP - still the most widely supported commodity hosting environment in the world by far. What are the stats these days for PHP vs Node, Go etc.?


It sounds like you've got a working solution already - is that right? If so, the old adage "if it ain't broke, don't fix it" applies.

Beyond that, if you're looking at whether to use these technologies in enhancing what you've got or solving problems that you have or expect with high confidence, then there are clear wins to be had.

Automerge and y.js are absolutely great for collaboration on blocks of text or structured documents (or anything that can be modelled as such). They are purely for conflict resolution, so if you don't have conflicts, or your use case manages to make the impact of conflicts negligible, don't bother with them. They are very easy to implement if you do need them - the barrier most people face is understanding them enough to decide if and how to use them. Just keep it simple.

Hypercore is a p2p solution for synchronising streams of data. It can be used raw, for streaming updates, or you can use the abstractions over it to keep filesystems or databases in sync. There's a significant difference between the decision about hypercore vs CRDTs - hypercore can potentially remove the need for you to maintain and operate a specific server, allowing all clients to be servers. This also reduces the dependency of the user on infrastructure - your servers, the Internet, a friendly government, etc. If that might be useful to you or your users, it's worth considering. Unlike PouchDB, it's a low level abstraction that can open up really new ways of thinking about shared information. PouchDB by contrast is 'just' a p2p syncing database. Also important: hypercore doesn't require its own process - you can roll it into your code directly, or have it in a separate process if you want.

Of all the things you mentioned, I'd say that hypercore is the only one worth diving into regardless of whether you seem to need it. You'll be a better systems designer and engineer once you understand it, and will have an expanded landscape of possibility when you can leverage it.

As for php vs node etc - that hasn't been an issue for a long time. Node is supported everywhere.


>As for php vs node etc - that hasn't been an issue for a long time. Node is supported everywhere.

I would guess what's being referred to for PHP is the ultra-cheap shared hosting where your account is running in the same OS instance as other people's accounts. You can get node.js to work, but it's harder to deal with since it's binding to a specific ip/port, and single ip addresses are shared by many customers. PHP has more options to make shared hosting look like a normal webserver to the outside. Shared hosts often use the LightSpeed webserver with "lsapi" that makes a shared webserver a little more configurable/performant on a per-account basis.


Question:since doing this requires hashing your entire tree, what are the implications of doing this hashing operation on possibly millions of entries (which I'm estimating is the scale at which comparing linearly really start being noticably slow)?

I'm guessing you need to choose a hashing function correctly, but is hashing 2n elements then comparing in log(n) actually that much faster than comparing in n? Evidently, with the right settings yes, or we wouldn't be here, but I'm just wondering if the hashing step doesn't actually end up costing a lot more than we think by saying "oh we just hash it"


The merkle tree is persisted, and updating it is only log(n). You only have to "hash 2n elements" once, and then incrementally maintain it. It's negligible overhead for free log(n) diffing.


If you are excited about these data structures you might be interested in my new database engine using prolly trees and designed to be used by React developers.

Blog: https://fireproof.storage/posts/from-mlops-to-point-of-sale:...

React hook: https://use-fireproof.com


Does anyone know a good resource on how to efficiently build a Merkle tree from a big k/v store? for example, should you cache intermediate nodes?


Does anyone know how a tool like Figma or Miro handle conflict resolution or synchronization so efficiently in real-time? For example: the position of a simple colored box being manipulated by 2 or more people at the same time. Is this article even remotely relevant for such a use case?


For collaborative editing you basically have two systems how to synchronize changes: CRDT or OT. Figma chose CRDT. They wrote two blog posts about it:

https://www.figma.com/blog/how-figmas-multiplayer-technology...

https://www.figma.com/blog/making-multiplayer-more-reliable/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: