There is specifically block storage service (EBS) and falvirs of it like EBS mul...

hn72774 · on March 10, 2024

S3 is better for large datasets. It's cheaper and handles large file sizes with ease.

It has become a de-facto standard for distributed, data-intensive workloads like those common with spark.

A key benefit is decoupling the data from the compute so that they can scale independently. EBS is tightly coupled to iops and you pay extra for that.

(Source: a long time working in data engineering)

albert_e · on March 10, 2024

Yes and I also believe:

Experienced Spark / Data Engineering teams would not assume S3 is readily useable as a filesystem.

This [1] seems like a good guide on how to configure spark for working with Cloud object stores, while recognizing the limitations and pitfalls.

[1]: https://spark.apache.org/docs/latest/cloud-integration.html

---

Amazon EMR offers a managed way to run hadoop or spark clusters and it implements an "EMR FS" [2] system to interface with S3 as storage.

[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.h...

AWS Glue is another option which is "serverless" ETL. Source and Destination can be S3 data lakes read through a data catalog (hive or glue data catalog). During processing AWs Glue can optionally use S3 [3,4,5] for shuffle partition.

[3]: https://aws.amazon.com/blogs/big-data/introducing-amazon-s3-...

[4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shu...

[5]: https://aws.amazon.com/blogs/big-data/introducing-the-cloud-...

hn72774 · on March 11, 2024

I think we're talking about two different things. I was addressing a section in the article about running databases backed by s3. It's less about s3 needing to act as a filesystem, and more about all of the rdbms features that come along with the various types of DB transactions. It's a solved problem with the libraries I mentioned. Not something I'd ever recommend to build on your own. Been there done that when those solutions were still nascent. Wasn't worth the effort vs just using an rdbms.

The problem that emrfs is trying to solve doesn't cover the rdbms scenarios like row-level updates and deletes.

albert_e · on March 10, 2024

*flavors

*can be used

*file system

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)