I have to say that I'm not hugely convinced. I don't really think that being able to pull out the keys before or after a prefix is particularly impressive. That is the basis for database indices going back to the 1970s after all.
Perhaps the use-cases you're talking about are very different from mine. That's possible of course.
But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.
It depends on a few factors. The list objects call hides deleted and noncurrent versions, but it has to skip over them. Grouping prefixes also takes time, if they contain a lot of noncurrent or deleted keys.
A pathological case would be a prefix with 100 million deleted keys, and 1 actual key at the end. Listing the parent prefix takes a long time in this case - I’ve seen it take several minutes.
If your bucket is pretty “normal” and doesn’t have this, or isn’t versioned, then you can do 4-5 thousand list requests a second, at any given key/prefix, in constant time. Or or you can explicitly list object versions (and not skip deleted keys) also in constant time.
It all depends on your data: if you need to list all objects then yeah it’s gonna be slow because you need to paginate through all the objects. But the point is that you don’t have to do that if you don’t want to, unlike a traditional filesystem with a directory hierarchy.
And this enables parallelisation: why list everything sequentially, when you can group the prefixes by some character (i.e “-“), then process each of those prefixes in parallel.
We and our customers use S3 as a POSIX filesystem, and we generally find it faster than a local filesystem for many benchmarks. For listing directories we find it faster than Lustre (a real high performance filesystem). Our approach is to first try listing directories with a single ListObjectV2 (which on AWS S3 is in lexicographic order) and if it hasn't made much progress, we start listing with parallel ListObjectV2. Once you start parallelising the ListObjectV2 (rather than sequentially "continuing") you get massive speedups.
> find it faster than a local filesystem for many benchmarks.
What did you measure? How did you compare? This claim seems very contrary to my experience and understanding of how things work...
Let me refine the question: did you measure metadata or data operations? What kind of storage medium is used by the filesystem you use? How much memory (and subsequently the filesystem cache) does your system have?
----
The thing is: you should expect, in the best case, something like 5 ms latency on network calls over the Internet in an ideal case. Within the datacenter, maybe you can achieve sub-ms latency, but that's hard. AWS within region but different zones tends to be around 1 ms latency.
This is while NVMe latency, even on consumer products, is 10-20 micro seconds. I.e. we are talking about roughly 100 times faster than anything going through the network can offer.
For AWS, we're comparing against filesystems in the datacenter - so EBS, EFS and FSx Lustre. Compared to these, you can see in the graphs where S3 is much faster for workloads with big files and small files:
https://cuno.io/technology/
Normally, from someone working in the storage, you'd expect tests to be in IOPS, and the goto tool for reproducible tests is FIO. I mean, of course "reproducibility" is a very broad subject, but people are so used to this tool that they develop certain intuition and interpretation for it / its results.
On the other hand, seeing throughput figures is kinda... it tells you very little about how the system performs. Just to give you some reasons: a system can be configured to do compression or deduplication on client / server, and this will significantly impact your throughput, depending on what do you actually measure: the amount of useful information presented to the user or the amount of information transferred. Also throughput at the expense of higher latency may or may not be a good thing... Really, if you ask anyone who ever worked on a storage product about how they could crank up throughput numbers, they'd tell you: "write bigger blocks asynchronously". This is the basic recipe, if that's what you want. Whether this makes a good all around system or not... I'd say, probably not.
Of course, there are many other concerns. Data consistency is a big one, and this is a typical tradeoff when it comes to choosing between object store and a filesystem, since filesystem offers more data consistency guarantees, whereas object store can do certain things faster, while breaking them.
BTW, I don't think most readers would understand Lustre and similar to be the "local filesystem", since it operates over network and network performance will have a significant impact, of course, it will also put it in the same ballpark as other networked systems.
I'd also say that Ceph is kinda missing from this benchmark... Again, if we are talking about filesystem on top of object store, it's the prime example...
IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads, except for truly random I/O in applications such as databases. For example, in Machine Learning, training usually consists of taking large datasets (sometimes many PBs in scale), randomly shuffling them each Epoch, and feeding them into the engine as fast as possible. Because of this, we see storage vendors for ML workloads concentrate on IOPS numbers. The GPUs however only really care about throughput. Indeed, we find a great many applications only really care about the throughput, and IOPS is only relevant if it helps to accomplish that throughput. For ML, we realised that the shuffling isn't actually random - there's no real reason for it to be random versus pseudo-random. And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect - yielding a 60x boost in throughput on S3, beating out a bunch of other solutions. S3 is not going to do great for truly random I/O, however, we find that most scientific, media and finance workloads are actually deterministic or semi-deterministic, and this is where cunoFS, by peering inside each process, can better predict intra-file and inter-file access patterns, so that we can hide the latencies present in S3. At the end of the day, the right benchmark is the one that reflects real world usage of applications, but that's a lot of effort to document one by one.
I agree that things like dedupe and compression can affect things, so in our large file benchmarks each file is actually random. The small file benchmarks aren't affected by "write bigger blocks" because there's nothing bigger than the file itself. Yes, data consistency can be an issue, and we've had to do all sorts of things to ensure POSIX consistency guarantees beyond what S3 (or compatible) can provide. These come with restrictions (such as on concurrent writes to the same file on multiple nodes), but so does NFS. In practice, we introduced a cunoFS Fusion mode that relies on a traditional high-IOPS filesystem for such workloads and consistency (automatically migrating data to that tier), and high throughput object for other workloads that don't need it.
> And if its pseudo-random then it is predictable, and if its predictable then we can exploit that to great effect
This is an interesting hack. However, an IOP is an IOP, no matter how good you predicted it and prefetch it so that you hide the latency it's going to be translated to a GetObject.
I think what you really exploited here is that even though S3 is built on HDDs (and have very low IOPS per TiB) their scale is so large that even if you milk 1M+ IOPS out of it AWS still doesn't care and is happy to serve you. But if my back-of-envelope calculation is correct this isn't going to work well if everyone starts to do it.
How do you get around S3's 5.5k GET per second per prefix limit? If I only have ~200 20GiB files can you still get decent IOPS out of it?
and...
> IOPS is a really lazy benchmark that we believe can greatly diverge from most real life workloads
No, it's not. I have a workload training a DL model on time series data which demands 600k 8KiB IOPS per compute instance. None of the thing I tested work well. Had to build a custom one with bare metal NVMe-s.
Sorry for the late response - I didn't see your comment until now.
Our aim is to unleash all the potential that S3/Object has to offer for file system workloads. Yes, the scale of AWS S3 helps, as does erasure coding (which enhances flexibility for better load balancing of reads).
Is it suitable for every possible workload? No, which is why we have a mode called cunoFS Fusion where we let people combine a regular high-performance filesystem for IOPS, and Object for throughput, with data automatically migrated between the two according to workload behaviour. What we find is that most data/workloads need high throughput rather than high IOPS, and this tends to be the bulk of data. So rather than paying for PBs of ultra-high IOPS storage, they only need to pay for TBs of it instead. Your particular workload might well need high IOPS, but a great many workloads do not. We do have organisations doing large scale workloads on time-series (market) data using cunoFS with S3 for performance reasons.
> EFS is ridiculously slow though. Almost to the point where I fail to see how it’s actually useful for any of the traditional use cases for NFS.
Would you care to elaborate on your experience or use case a bit more? We've made a lot of improvements over the last few years (and are actively working on more), and we have many happy customers. I'd be happy to give a perspective of how well your use case would work with EFS.
Source: PMT turned engineer on EFS, with the team for over 6 years
Unfortunately I can’t say too much publicly on HN. But one of the big shortcomings is dealing with hundreds of files. It doesn’t even matter if those are big or small files (I’ve had experience with both).
Services like DataSync show that the underlying infra can be performant. But it feels almost impossible to replicate that on EFS via standard POSIX APIs. And unfortunately one of our use cases depend upon that.
If feels, to me at least, like EFS isn’t where AWSs priorities lie. At least if you compare EFS to FSx Lustre and recent developments to S3. Both of which has been the direction our AWS SAs have pushed us.
S3 is really high latency though. I store parquet files on S3 and querying them through DuckDB is much slower than file system because random access patterns. I can see S3 being decent if it’s bulk access but definitely not for random access.
This is why there’s a new S3 Express offering that is low latency (but costs more).
It can't be a POSIX filesystem if it doesn't meet POSIX filesystem guarantees. I worked on an S3 compatible object store in a large storage company and we also had distributed filesystem products. Those are completely different animals due to the different semantics and requirements. We've also built compliant filesystems over object store and the other way around. Certain operations like, write-append, are tricky to simulate over object stores (S3 didn't use to support append, I haven't really stayed up to date, does it now?). At least when I worked on this it wasn't possible to simulate POSIX semantics over S3 at all without needing to add additional object store primitives.
> Once you start parallelising the ListObjectV2 (rather than sequentially "continuing")
How are you "parallelizing" the ListObjectsV2? The continuation token can be only fed in once the previous ListObjectsV2 response has completed, unless you know the name or structure of keys ahead of time, in which listing objects isn't necessary.
For example, you can do separate parallel ListObjectV2 for files starting a-f and g-k, etc.. covering the whole key space. You can parallelize recursively based on what is found in the first 1000 entries so that it matches the statistics of the keys. Yes there may be pathological cases, but in practice we find this works very well.
Perhaps the use-cases you're talking about are very different from mine. That's possible of course.
But for me, often the slow speed of listing the bucket gets in the way. Your bucket doesn't have to get very big before listing the keys takes longer than reading them. I seem to remember that listing operations ran at sub-1mbps, but admittedly I don't have a big bucket handy right now to test that.