Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is specifically block storage service (EBS) and falvirs of it like EBS multi-attach and EFS that can ne used if there is a need to port software/databases to the cloud with low level filesystem support.

Why would we need to do it on object storage which addresses a different type of storage need.

Nevertheless there are projects like EMRFS and S3 file system mount points that try to provide files stem interfaces to workloads that need to see S3 as a filesystem.



S3 is better for large datasets. It's cheaper and handles large file sizes with ease.

It has become a de-facto standard for distributed, data-intensive workloads like those common with spark.

A key benefit is decoupling the data from the compute so that they can scale independently. EBS is tightly coupled to iops and you pay extra for that.

(Source: a long time working in data engineering)


Yes and I also believe:

Experienced Spark / Data Engineering teams would not assume S3 is readily useable as a filesystem.

This [1] seems like a good guide on how to configure spark for working with Cloud object stores, while recognizing the limitations and pitfalls.

[1]: https://spark.apache.org/docs/latest/cloud-integration.html

---

Amazon EMR offers a managed way to run hadoop or spark clusters and it implements an "EMR FS" [2] system to interface with S3 as storage.

[2]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-fs.h...

AWS Glue is another option which is "serverless" ETL. Source and Destination can be S3 data lakes read through a data catalog (hive or glue data catalog). During processing AWs Glue can optionally use S3 [3,4,5] for shuffle partition.

[3]: https://aws.amazon.com/blogs/big-data/introducing-amazon-s3-...

[4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shu...

[5]: https://aws.amazon.com/blogs/big-data/introducing-the-cloud-...


I think we're talking about two different things. I was addressing a section in the article about running databases backed by s3. It's less about s3 needing to act as a filesystem, and more about all of the rdbms features that come along with the various types of DB transactions. It's a solved problem with the libraries I mentioned. Not something I'd ever recommend to build on your own. Been there done that when those solutions were still nascent. Wasn't worth the effort vs just using an rdbms.

The problem that emrfs is trying to solve doesn't cover the rdbms scenarios like row-level updates and deletes.


*flavors

*can be used

*file system

(Apologies for typos. The "noprocrast" setting sometimes locks us out of HN right after submitting a comment. And it is now too late, not editable)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: