> Biggest difference so far is that Minio is just files on disk, Garage chunks a...

alex_sf · on Dec 4, 2022

> I'd kind of expect most blob storage solutions to use abstractions other than just the file system, or at least consider doing so.

Honestly, I'd expect the exact opposite. Filesystems are really good at storing files. Why not leverage all that work?

> I recently built a system to handle millions of documents as a proof of concept and when I was testing it with 10 million files, the server ran out of inodes, before I went over to storing the blobs in some attached storage that had XFS

That's a misconfiguration issue though, not a reason to not store blobs as files on disk. Ext4 can handle 2^32 files. ZFS can handle 2^128(?).

> With abstracted storage (say, files bunches up into X MB large containers or chunked into such when too large, with something else to keep track of what is where) that wouldn't be such an issue, though you might end up with other issues along the way.

A few issues that come to mind for me:

* This requires tuning to actually reduce the number of inodes of used for certain datasets. E.g., if I'm storing large media files, that chunking would _increase_ the number of files on disk, not reduce it. At which point, if inode limits are the issue, we're just making it worse.

* It adds additional complexity. Now you need to account for these chunks, and, if you care about the data, check it periodically.

* You need specific tooling to work with it. Files on a filesystem are.. files on a filesystem. Easy to backup, easy to view. Arbitrary chunking and such requires tooling to perform operations on it. Tooling that may break, or have the wrong versions, or.. etc.

> It's curious that we don't advocate for storing blobs in relational databases anymore, even though I can also understand the reasoning

In my experience, the popular RDBMS out there just aren't good at it. With the way locking semantics and their transaction queueing works, storing and retrieving lots of blobs just isn't performant. You can get away with it for a long time though, and it can be pretty nice when you can.

mdaniel · on Dec 4, 2022

> Filesystems are really good at storing files. Why not leverage all that work?

As an asterisk, the S3 API is key-value pairs, not files; that distinction comes up a lot when interacting with Amazon S3, and I would expect the same with an S3 API clone. For example, ListObjects[1] has a "delimiter" that (AFAIK) defaults to / making it appear to be a filesystem but using "." or "!" would be a perfectly fine delimiter and thus would have no obvious filesystem mapping

1: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje...

vbezhenar · on Dec 4, 2022

Why is it useful?

mdaniel · on Dec 4, 2022

That's a complicated question but allows highlighting what I was bringing up: the Key is any unicode character[1] so while it has become conventional to use "/", imagine if you wanted to store the output of exploded jar files in S3, but be able to "list the directory" of a jar's contents: `PutObject("/some-path/my.jar!/META-INF/MANIFEST.MF", "Manifest-Version: 1.0")`

Now you can `ListObjects(Prefix="/some-path/my.jar", Delimiter="!")` to get the "interior files" back.

I'm sure there are others, that's just one that I could think of off the top of my head. Mapping a URL and its interior resources would be another (`"https://example.com\t/script[1]", "console.log('hello, world')")`

Further fun fact that even I didn't know until searching for other examples: "delimiter" is a string and thus can be `Delimiter=unknown` or such: https://github.com/aws/aws-sdk-go/issues/2130

1: see the ListObject page under "encoding-type"

selfhoster11 · on Dec 5, 2022

Minio supports virtual ZIP directories for such use cases. In your example, as long as this was enabled and your jar file was properly detected, you could submit a GET for "/some-path/my.jar/META-INF/MANIFEST.MF" and get the contents of that file just fine.

mdaniel · on Dec 5, 2022

One will observe I said list, not get, although in this case it's likely a non-issue because Minio supports the S3 API https://github.com/minio/minio-go/blob/v7.0.45/api-list.go#L... and thus should support the 2nd example I provided, too

orf · on Dec 5, 2022

Imagine you have a billion files in a “directory”. Being able to find files that start with “xyz” in constant time is a very, very useful property.

vlovich123 · on Dec 5, 2022

> Honestly, I'd expect the exact opposite. Filesystems are really good at storing files. Why not leverage all that work?

File systems are optimized for a hierarchical organization on a single machine. However, this kind of organization inhibits storing data in a distributed system because of the links between entries. S3 and similar object stores are a distributed, flat “filesystem”. There’s no relationship between files and there’s no grouping (aside from a virtual one you can simulate but doesn’t really exist). That’s why S3 doesn’t suffer weird directory traversal attacks that bring your file system to a crawl because such potentially expensive operations don’t exist.

KronisLV · on Dec 4, 2022

> Honestly, I'd expect the exact opposite. Filesystems are really good at storing files. Why not leverage all that work?

There are lots of different file systems out there and you won't always get a say in what your cloud vendor has on offer. However, if you can launch a container on the system that does an abstraction on top of the file system, takes its best parts and makes up for any shortcomings it might have in a mostly standardized way, then you can benefit from it.

That's not always the right way to go about things: it seems to work nicely for relational databases and how they store data, whereas in regards to storing larger bits of binary data, there are advantages and shortcomings to either approach. At the end of the day, it's probably about tradeoffs and what workload you're working with, what you want to achieve and so on.

> That's a misconfiguration issue though, not a reason to not store blobs as files on disk. Ext4 can handle 2^32 files. ZFS can handle 2^128(?).

Modern file systems are pretty good and can support lots of files, but getting a VPS from provider X doesn't mean that they will. Or maybe you have to use a system that your clients/employer gave you - a system that with such an abstraction would be capable of doing what you want to do, but currently doesn't. I agree that it's a misconfiguration in a sense, but not one that you can rectify yourself always.

> * This requires tuning to actually reduce the number of inodes of used for certain datasets. E.g., if I'm storing large media files, that chunking would _increase_ the number of files on disk, not reduce it. At which point, if inode limits are the issue, we're just making it worse.

This is an excellent point, thank you for making it! However, it's not necessarily a dealbreaker: on one hand, you can probably gauge what sorts of data you're working with (e.g. PDF files that are around 100 KB in size, or video files that are around 1 GB each) and tune accordingly, or perhaps let such a system rebalance data into chunks dynamically, as needed.

> * It adds additional complexity. Now you need to account for these chunks, and, if you care about the data, check it periodically.

As long as things keep working, many people won't care (which is not actually the best stance to take, of course) - how many care about what happens inside of their database when they do SQL queries against it, or what happens under the hood of their compatible S3 store of choice? I'll say that I personally like keeping things as simple as possible in most cases, however the popularity of something like Kubernetes shows that it's not always what we go for as an industry.

I could say the same about using PostgreSQL for certain workloads, for which SQLite might also be sufficient, or opting for a huge enterprise framework for a boring CRUD when something that has a codebase one tenth the size would suffice. But hey, as long as people don't constantly get burned by these choices and can solve the problems they need to, to make more money, good for them. Sometimes an abstraction or a piece of functionality that's provided reasonably outweighs the drawbacks and thus makes it a viable choice.

> * You need specific tooling to work with it. Files on a filesystem are.. files on a filesystem. Easy to backup, easy to view. Arbitrary chunking and such requires tooling to perform operations on it. Tooling that may break, or have the wrong versions, or.. etc.

This is actually the only point where I'll disagree.

You're always one directory traversal attack against your system away from having a really bad time. That's not to say that it will always happen (or that accessing unintended data cannot happen on other storage solutions, e.g. even the adjacent example of relational databases will make anyone recall SQL injection, or S3 will have stories of insecure buckets with data leaking confidential information), but being told that you can just use the file system will have many people using files as an abstraction in the programming language of their choice, without always considering the risks of sub-optimal engineering, like directory traversal attacks or file permissions.

Contrast this to a scenario where you're given a (presumably) black box that exposes an API to you - what's inside of the box is code that's written by other people that are more clever than you (the "you" in this example being an average engineer) and that handles many of the concerns that you might not have even thought of nicely. And if there are ever serious issues or good reasons for peeling back that complexity, look up the source code of that black box on GitHub and start diving in. Of course, in the case of MinIO and many other storage solutions, that's already what you get and is good enough. That's actually why I or others might use something S3 compatible, or something that gives you signed URLs for downloading files - so you don't have to think about or mess up how the signing works. That's also why I and many others would be okay with having a system that eases the implications of needing to think about file systems, by at least partially abstracting it away. Edit: removed unnecessary snarky bits about admittedly leaky abstractions you often get.

Honestly, that's why I like databases letting you pick whatever storage engines are suitable for your workloads, similarly to how object storage solutions might approach the issue - just give the user the freedom to choose how they want to store their blobs at the lower level, giving sane defaults otherwise. Those defaults might as well be just files on a filesystem. In regards to object storage, that's before we get into thinking about file names (especially across different OSes), potential conflicts and file versioning, as well as maximum file size supported by any number of file systems that you might need to support.

GauntletWizard · on Dec 5, 2022

To put it pretty bluntly: you were off the rails at "getting a VPS from provider X doesn't mean that they will". You're talking in terms of not having a custom kernel, and that's just the wrong layer of abstraction if we're talking about "cloud"; this whole discussion is really about VM and colo levels of abstraction anyway ("Cloud" advice would be "Just use your vendor's S3 blobstore").

Base Ubuntu has xfs support. If your VPS provider won't run plain old Ubuntu with some cloudinit stuff, get a new VPS provider.

vbezhenar · on Dec 4, 2022

> It's curious that we don't advocate for storing blobs in relational databases anymore

That's exactly what I did recently on new work: migrated blobs from DB to S3. It significantly reduced load from the servers (and will reduce more, right now the implementation is primitive - just proxying S3, using URL will allow other services to deal with S3 directly). It solved backup nightmare (those people couldn't do backup because their server run out of space every month). I'll admit that backup issue is more like admin incompetence but I work with what I get. Having database shrink from 200GB to 80MB now allows to backup/restore it in seconds rather than hours.

I didn't find any issues with S3 approach. Even transactions solved by a tiny possibility of leaving junk in S3 which is a non-issue. Just upload all data to S3 before commit and delete if commit fails (and if commit fails and delete fails, so be it).