Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The way to work around this issue is to provide a presigned S3 url

Have the users upload to s3 directly and then they can either POST you what they uploaded or you can find some other means of correlating the input (eg: files in s3 are prefixed with the request id or something)

I agree this is annoying and maybe I’ve been in AWS ecosystem for too long.

However having an API that accepts an unbounded amount of data is a good recipe for DoS attacks, I suppose the 100MB is outdated as internet has gotten faster but eventually we do need some limit



Well i partly agree, and if i would be the one building the counterpart, i prolly had used presigned s3 urls also.

In this specific case im getting oldschool file upload request from software that was partly written before the 2000s - noones gonne adjust anything any more.

And ye, just accepting giant size uploads is far from good in terms of "Security" like DoS - but ye we talking about stupidly somewhere between 100 and 300mb CSV files (called them "huge" because in terms of product data 200-300mb text include quite alot) - not great but well we try to satisfy our customers needs.

But ye like all the other points - everything is solvable somehow - just needs us to spend more time to solve something that technickly wasn't a real problem in first place.

Edit: Another funny example. In a similar process on another provider i downloaded files in a similar size range from S3 to parse them - which died again and again. After contacting the hoster, because their logs litearlly just stopped no error tracing nothing) they told me that basically their setup only allows for 10mb local storing - and the default (in this case aws s3 adapter for PHP) always downloads it even if you tell it to "stream". So i build a solution that used HTTP ranged requests to "fake stream" the file into memory in smaller chunks so i could process it afterwards without completely download it. Just another example of : yes its solvable, but annoying.


I find with these types of customers it’s always easier to just ask them to save files locally and grant me privileges to read the data. Sometimes they’ll be on Google, Dropbox, Microsoft, etc and I also run a SFTP for this in case they want to move them over to my service.

Then I either batch/schedule the processing or give them an endpoint to just to trigger it (/data/import?filename=demo.csv)

It’s actually so common that I just have the “data exchange” conversation and let them decide which fits their needs best. Most of it is available for self service configuration.


Yep, I concur. You need to meet them on their (legacy) terrain, get access to their data and then you can do any fancy thing you want to do.


Uploads to an S3 bucket can trigger a lambda… don’t complicate things. The upload trigger can tell the system about the upload and the client can continue on their day.

Uploader on the client uses presigned url. S3 triggers lambda. Lambda function takes file path and tells background workers about it either via queue, mq, rest, gRPC, or doing the lift in workflow etl functions.

Easy peasy. /s


> Uploads to an S3 bucket can trigger a lambda… don’t complicate things.

I read this and was getting ready to angrily start beating my keyboard. The best satire is hard to detect.


I don't really get the joke. S3 triggering a lambda doesn't sound meaningfully more complicated than using a lambda by itself. What am I missing?


Solving a serverless limitation with more serverless so you can continue doing serverless when you can’t FormUpload a simple 101mb zip file as an application/octet-stream. Doubling down on it for a triple beat.


I wouldn't really call it "more" severless to rearrange the order a bit. Which makes it "solving a serverless limitation so you can continue doing severless". And that's just a deliberately awkward way of saying "solving a serverless limitation" because if you can solve it easily why would you not continue? Spite?

So I still don't see how it's notably worse than the idea of using serverless at all.


The controversy here is the fact that the API Gateway limits the upload resulting in having to engineer a workaround workflow using s3 and triggers (even if this is the serverless way) when all you want to do is upload a file. A POST call with an octet-octet stream. Let http handle resume. But you can’t and you end up going around the side door, when all you really want is client_body_max_size

The sarcasm of correctness yet playing down its complexity is entirely my own. We used to be able to do things easily.


It gets really complex in this workflow to even achieve something like “file coprocessor successfully” on the client side with this approach

how will your client know if you backend lambda crashed or whatever? All it knows is the upload to s3 succeeded

Basically you’re turning a synchronous process into asynchronous


unfortunately they ruined it at the end with that /s


Did I? I don’t think I did.


And while you are being sarcastic, this is the Right Way to use queues.

Upload file to S3 -> trigger an SNS message for fanout if you need it -> SNS -> SQS trigger -> SQS to ETL jobs.

The ETL job can then be hosted using Lambda (easiest) or ECS/Docker/Fargate (still easy and scales on demand) or even a set of EC2 instances that scale based on the items in a queue (don’t do this unless you have a legacy app that can’t be containerized).

If your client only supports SFTP, there is the SFTP Transfer Service on AWS that will allow them to send the file via SFTP and it is automatically copied to an S3 bucket.

Alternatively, there are products that treat S3 as a mountable directory and they can just use whatever copy commands on their end to copy the file to a “folder”


If I have a user facing upload button, why can't I simply have a webserver that receives the data and pushes it into s3 via multi-part upload. Something that can be written in a framework of your choice in 10 minutes with 0 setup?

For uploads under 50 MB you could also skip the multipart upload and take a naive approach without taking a significant hit.


You can - you generate the pre-signed S3 URL and they upload it to the place your URL tells it to.

https://fullstackdojo.medium.com/s3-upload-with-presigned-ur...

And before you cry “lock in”, S3 API compatible services are a dime a dozen outside of AWS including GCP and even Backblaze B2.


Every day we stray further from the light


If you don’t do it this way you fail the system design interview.


Nope, you didn’t use terracottax so you failed anyway. 6 months before you can reapply in case the first humiliation wasn’t enough. Boss was looking for AWS Glue in there and you didn’t use it.


> Easy peasy. /s

It actually is though. I don't need to build a custom upload client, I don't need to manage restart behavior, I get automatic restarts if any of the background workers fail, I have a dead letter queue built in to catch unusual failures, I can tie it all together with a common API that's a first class component of the system.

Working in the cloud forces you to address the hard problems first. If you actually take the time to do this everything else becomes _absurdly_ easy.

I want to write programs. I don't want to manage failures and fix bad data in the DB directly. I personally love the cloud and this separation of concerns.


> I don't need to build a custom upload client

GP said this is an app from the 2000s.

For S3 you do need to generate a presigned URL, so you would have to add this logic there somewhere instead of "just having a generic HTTP upload endpoint".

Unless the solution is "don't have the problem in the first place" the cloud limitations are just getting in the way here.


The solution is to use the appropriate tool for the job. If you're locked in to highly crusty legacy software, it's inevitably going to require workarounds. There are good technical reasons why arbitrary-size single-part file uploads are now considered an anti-pattern. If you must support them, then don't be shocked if you wind up needing EC2 or other lower-level service as a point of ingress into your otherwise-serverless ecosystem.

If we want to treat the architectural peculiarities of GP's stack as an indictment of serverless in general, then we could just as well point to the limitations of running LAMP on a single machine as an indictment of servers in general (which obviously would be silly, since LAMP is still useful for some applications, as are bare metal servers).


We down play how trivial it is to generate a signed url, it’s only like a few lines and a function call to get but, you then have to send this to the client. The client has to then use this url, then check back with you to see if it arrived resulting in a kind of pea soup architecture unless your application is also entirely event driven. Oh how we get suckered in…


> Working in the cloud forces you to address the hard problems first.

It also forces you to address all the non-existent problems first, the ones you just wish you had like all the larger companies that genuinely have to deal with thousands of file upload per second.

And don't forget all the new infrastructure you added to do the job of just receiving the file in your app server and putting it into the place it was going to go anyway but via separate components that all always seem to end up with individual repositories, separate deployment pipelines, and that can't be effectively tested in isolation without going into their target environment.

And all the additional monitoring you need on each of the individual components that were added, particularly on those helpful background workers to make sure they're actually getting triggered (you won't know they're failing if they never got called in the first place due to misconfiguration).

And you're now likely locked into your upload system being directly coupled to your cloud vendor. Oh wait, you used Minio to provide a backend-agnostic intermediate layer? Great, that's another layer that needs managing.

Is a content delivery network better suited to handling concurrent file uploads from millions of concurrent users than your app server? I'd honestly hope so, that's what it's designed for. Was it necessary? I'd like to see the numbers first.

At the end of the day, every system design decision is a trade off and almost always involves some kind of additional complexity for some benefit. It might be worth the cost, but a lot of these system designs don't need this many moving parts to achieve the same results and this only serves to add complexity without solving a direct problem.

If you're actually that company, good for you and genuinely congratulations on the business success. The problem is that companies that don't currently and may never need that are being sold system designs that, while technically more than capable, are over-designed for the problem they're solving.


> the ones you just wish you had

You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.

> if they never got called in the first place due to misconfiguration

Centralized logging is built into all these platforms. Debugging these issues is one of the things that becomes absurdly easy.

> likely locked into your upload system

The protocol provided by S3 is available through dozens of vendors.

> Was it necessary?

It only matters if it is of equivalent or lessor cost.

> every system design decision is a trade off

Yet you explicitly ignore these.

> are being sold system designs

No, I just read the documentation, and then built it. That's one of those "trade offs" you're willingly ignoring.


> You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.

A lot of those failure mode examples seem well suited to client-side retries and appropriate rate limiting. If we're talking file uploads then sure, there absolutely are going to be cases where the benefits of having clients go to the third-party is more beneficial than costly (high variance in allowed upload size would be one to consider), but for simple upload cases I'm not so convinced that high-level client retries aren't something that would work.

> if they never got called in the first place due to misconfiguration

I find it hard to believe that having more components to monitor will ever be simpler than fewer. If we're being specific about vendors, the AWS console is IMHO the absolute worst place to go for a good centralized logging experience, so you almost certainly end up shipping your logs into a better centralized logging system that has more useful monitoring and visualisation features than CloudWatch and has the added benefit of not being the AWS console. The cost here? Financial, time, and complexity/moving parts for moving data from one to the other. Oh and don't forget to keep monitoring on the log shipping component too, that can also fail (and needs updates).

> The protocol provided by S3 is available through dozens of vendors.

It's become a de facto standard for sure, and is helpful for other vendors to re-implement it but at varying levels of compatibility.

> It only matters if it is of equivalent or lessor cost.

This is precisely the point, I'm saying that adding boxes in the system diagram is a guaranteed cost as much as a potential benefit.

> Yet you explicitly ignore these

I repeatedly mentioned things that to me count as complexity that should be considered. Additional moving parts/independent components, the associated monitoring required, repository sprawl, etc.

> No, I just read the documentation, and then built it.

I also just 'read the documention and built it', but other comments in the thread allude to vendor-specific training pushing for not only vendor-specific solutions (no surprise) but also the use of vendor-specific technology that maybe wasn't necessary for a reliable system. Why use a simple pull-based API with open standards when you can tie everything up in the world of proprietary vendor solutions that have their own common API?


> The protocol provided by S3 is available through dozens of vendors.

But not all of the S3 API is supported by other vendors - the asynchronous triggers for lambdas and the CloudTrail logs that you write code to parse.


Enjoyed reading this, thanks for writing it.

People often don't know how different might be easier for their case.

Following others, or the best practices, when they might not apply in their case can lead to to social proof architecture a little too often.


this kinda proves the point that you have to know a silly workaround




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: