Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
On Internal Engineering Practices at Amazon (jatins.gitlab.io)
350 points by wheresvic1 on March 26, 2019 | hide | past | favorite | 127 comments


Ex-Amazon SDE here.

A lot of this has changed.

First, there is a movement to build a lot of services in Native AWS instead of MAWS/Apollo.

Apollo doesn't require copying configs anymore; you can have the config exist as part of the package you are deploying. Generally, that's a best practice.

Pipelines can be configured as code too.

There is a centralized log service which requires onboarding. It does require some commandline tools, but it works. The logs get stored on S3, IIRC.

If containers suits your needs, you'd be hard-pressed to find someone telling you not to use it. Generally, though, you would want to use bare metal for Amazon's scale.

There is also a change to how NPM is being used at Amazon. It was a lot easier towards the end of my tenure, and was probably as close as it would get when working with Amazon's build systems.

Amazonians are generally conservative and don't use the latest and greatest unless it solves an actual customer need. Customer Obsession is still the defining leadership principle.


Any take on how this looks for someone who doesn't work at Amazon?

I know Apollo is their deployment tool. No idea how they plays into a Kubernetes or even container stack.


Conceptually, there is a lot of overlap. Apollo had a few layers to it but would, more or less, matched kubernetes plus additional tooling. Dare I say closer to openshift? Brazil, the build tool, infrastructure and repository was similar to using nixpkgs. Brazil had some features beyond nixpkgs that were absolutely essential for an business like Amazon, but otherwise the goals and even techniques overlapped.

I onboarded very, very quickly by already knowing nixpkgs and container management. The concepts translated well. A few aspects that didn't either were things Amazon handled that other tools have yet to encounter: Cool to learn about. Or historical bits that didn't matter: fine to ignore.


> Apollo had a few layers to it but would, more or less, matched kubernetes plus additional tooling

Have you looked at the codebase? Apollo and Brazil are tiny compared to Kubernetes & friends. You could reimplement them with a small script on top of traditional Linux package management (modulo the web UI).


As someone who worked on Apollo, I'd challenge this. Without divulging any details, Apollo is a non trivial system.


Yeah, I work adjacent to the Apollo org (and have worked near Pfrheak, too. Hi!). I would love to see a HackerNews commenter's idea of what it would take to re-implement Apollo. I'll bring popcorn.


> you would want to use bare metal for Amazon's scale

Containers are "bare metal".


> Containers are "bare metal"

Please explain this use of jargon. Are you saying that a Docker container is "bare metal" in some culture? A VM is bare metal? What do you mean by container, and bare metal?


Containers are just namespaces for things within the Linux kernel. Unlike with VMs, you're not running separate instances of the OS, it's all run by just one kernel instance, and that kernel usually runs directly on the hardware, that is, on "bare metal". That "ubuntu" base image you can spin up does not actually run the Ubuntu kernel. As a result, bare metal containers incur none of the "virtualization penalty" that VMs do.

Cloud is, in fact, abnormal in terms of how containers are used in that it wraps them into VMs. That's in part because VMs were already there and they simplify provisioning of something that looks like "machines" from the outside, and in part because containers do not offer much in the way of security guarantees (precisely because all containers running on the same host share the kernel), whereas VMs do.


Ah, you're using it to mean "no virtualization". Got it, thanks.

I tend to use the term to mean single-tenant, user-supplied OS, non-virtualized myself but I see where you're coming from.


All of that may be true as well - if the user picks the underlying OS (i.e. kernel) to run the container engine and installs on bare metal, then literally everything you wrote applies.


My workstation runs multiple processes in multiple namespaces without containers. Is it bare metal? What if I install applications via flatpak? Still bare metal? Now what if I use docker?


It's tempting to say that containers don't have any virtualization costs, but they often do for networking. For some applications, the addition of another networking driver doesn't matter. For others, it does.


There is no "another networking driver" in containers. It's the same networking driver.


For standard Docker container deployments, there is. See: https://docs.docker.com/network/


That's not a "driver" in the traditional sense, meaning it doesn't talk to the actual hardware directly. It's just a bridge or iptables.


shrug It's additional software in the networking stack, and it has a performance penalty. I think it's fair to call it a "virtualization cost," since you're paying for the container's networking abstraction.


I think GP meant in the sense that Docker containers are just normal programs whose instructions are being executed directly by the CPU (just like how my laptop is executing Firefox as I type this) without relying on hardware virtualization features.


Yeah i was confused by this too. Parent possibly thinking of virtual machines?


Fun fact: all of Google (including Google Cloud) is run in containers. IIRC even VMs are run in containers. The VMs, of course, could contain customer containers, too. It's literally containers "all the way down".


Based on my experience, the article contains a lot of misinformation. Some of the statements might have been true at one point in the past, but are now out of date by years, while others have never been true in the time I've been around.

Without getting into a point-by-point rebuttal, my reaction to each section/Exhibit is "that's wrong/misleading".


Isn't that a bit disingenuous? Your role at Amazon is nothing like the typical engineer at Amazon. You live in the shiny new world while the majority of engineers are stuck on something that's not too far from the article.


The median tenure of engineers at Amazon is 1 year. That means that the new engineers need senior and principle engineers to guide them on what tools exist. If an engineering organization happens to have strong senior Amazon engineers then they can guide their teams/org to use the tools that exist, because they do exist.

However, everything (and I mean everything) at Amazon depends on the team (and organization) that you land in. Some organizations do not have senior technical leadership; service ownership is handed off to teams without long tenured Amazon engineers so they do not get exposed to the types of tools to use (nor do these teams get time to discover, learn, and on-board to the tools that do exist). This is how an engineer can have the experience written about in the article.

The article is anecdotal, and definitely not the norm for the "majority of engineers"


1 Year median wow - they must lose lot of institutional knowledge that way.


I was an intern at AWS this past summer(and will be returning as a new grad in May) and this article is not accurate based on my experience. My code (which was part of our team's actual production system, not a toy project) ran on EC2 instances in an autoscaling group. There was no manually provisioning servers. I worked with a Kinesis Stream and several other resources and there was a way for us to programmatically retrieve the names of the resources. My impression was the system we used to find a resource name works for all AWS resources so I imagine you could get the name of a load balancer this way as well. Finally, I'll just say my team worked entirely in Python. Maybe that was an exception, but it is certainly sufficient to directly falsify the statement that "internally, at Amazon, you can't even use a language other than Java." (How would the author even know this? Was he in on the S-team meeting where it was declared that all non java users get a pip?)


The Java claim is just factually false.

Java/C++/Python/Ruby/Go/Perl/Scala, even Rust, are used in Amazon for various projects. What might be true is that, consider how critical your service would be, the more it is, people will lean towards to more conservative/mature/enterprise-ish languages, not exotic ones, because that inherently means risk. And Java is the go-to option under that consideration, just because, it is Java, the least controversial choice if you had asked me.

But Amazon comparing with other companies I worked with, except extremely small startup, where tech guidance is like non-existent, the least restrictive when it comes to languages. It fits the Amazon's self perception, it is pragmatic, non opinionated, moral-agnostic. It cares about customer, because customer means potential of eventual profit. However you achieved that, is not Amazon's concern.


The L8 manager of our org declared that all projects must be on the JVM, and if the language wasn't Java, then we would need to submit a six pager explaining why. Perhaps he was just being conservative, but to me there's something off-putting when those types of decisions come from managers rather than from the engineers.

That said, our org did manage to have a few projects that aren't Java, so I guess I would have to concede that the Java claim is factually false as well. But it sure doesn't feel like I am empowered to choose something other than Java.


Plus, you know, https://aws.amazon.com/corretto/

We do actually vend a JDK. Yes, it's conservative. It's also necessary.


I can confirm that the claim "you can't even use a language other than Java" is completely false. Java is often the best tool for the job because of how much institutional inertia there is behind it, but certainly nobody will stop a team from using a better tool.

PHP is banned at Amazon, and that's the exception that proves the rule: a language being banned is so unusual that I'm aware specifically of the example of PHP.


And, yet, there are some places where PHP is used (under explicit conditions and exceptions).


I left Amazon a little over 3 years ago, having worked there for over 3 years. I would agree that this article seems misleading. Some of the things the article complains about were fixed while I was working there. The article compares Apollo to Kubernetes/Docker but Apollo predates Docker by years.

The article complains about CloudWatch and log aggregation, but when I started Amazon internally had a mature log aggregation and monitoring solution that I would say is still unquestionably superior to the public version of CloudWatch.


I worked at a startup and then switched to a bigger product company: Zoho.

If there's one thing I'd take away for the rest of my career from Zoho, it would be frugality in adopting the latest of tech.

When NoSQL was all the rage, the company stood firm that relation databases had rock-solid mathematical foundation and stayed away from the bandwagon. It paid off.

When every other company wrote blogs about rewriting their software in NodeJS, Ruby & Python, the company stood ground with statically typed languages. It paid off.

My own team, Zoho Writer has a strong policy against incorporating third party libraries without good reasons. This way, the product is nearly a decade old, but the JS size has remained surprisingly small, all through its evolution.

I believe staying frugal in adopting the latest hype, can only be reasoned about in hindsight.


I have shared this blog posting more than I care to admit.

https://mcfunley.com/choose-boring-technology

It amazes me that people will put their companies and employees at risk by using the new cool stuff just because everyone else is ( looking at you k8s )


I was in a management discussion at a prior company where people were actually making technology decisions based on how our stack would look in job ads and whether it would excite potential candidates.


Definitely, but, I'm sure that adopting new technologies at the right time can be a huge competitive edge for startups. Sure it can fail horribly in some places where hindsight will show that the trusted stack was the right choice, but it could also mean winning a race to market and continued relevance for a company if it works out.


Startups are complete risk and should jump on the new cool and shiny toy ship.

When you start making money or wish to start making money you need to stabilize the insanity.

I have done both.


Something that's "complete risk" should take even more risk? Doesn't seem logical.


so when you are swimming you hold one leg in the air or do you keep everything in the water so you can swim faster ?


This is one position... albeit an extreme one.

Kubernetes has worked perfectly for me in multiple places now. Including where I am now at a large unicorn.

I agree, picking up all the shiny new stuff on hacker news is generally a bad idea. But foresaking new technologies altogether can backfire.

Part of it is how you handle failures. You could always have an aerospace-grade QA process... but for a consumer web product that's almost always going to vastly delay your progress. If you have the ability to fix issues quickly then it's sometimes better to "move fast and break things". If you're working on autonomous vehicle software then this is a terrible idea (I'm looking at you Uber ATG).


Problem most people forget is you still have to manage and run the servers or pay someone to do so.

The experience required to manage k8s isnt typically a consumer grade website type admin.

opex or capex ... pick where you want the spend


One misconception people get wrong, from my perspective, about most FAANG is that you get to work with new and shiny things, especially new langauges. This is really more applicable to startups where risk taking is in the DNA, you will mostly get really strong pushback because there is just too much effort involved to support more than two or three languages at scale. There are normally niche languages, but they are essentially statistical anomalies and are usually borne out of a real business need (like swift for iOS). Mostly engineering for engineering sake is also going to be frowned upon unless it really helps the business.

I do agree that Amazon is the worst in regards to OSS. They really need to fix that, even if jut for PR, because they are consuming so much of it for AWS.


Very good comment, +1.

People don't really get the rabbit hole that is necessary in order to introduce a new technology at a large company. Just off the top of my head, a few of the issues:

* Not everyone has followed the broader technology world outside the big company. It's totally possible to imagine a situation where nobody in your management chain or team has even heard of something relatively mainstream like Docker.

* Your company uses custom build, deployment, and dependency management systems. Someone will have to do the work to implement support for the new technology in these.

* If the new technology has its own opinionated ideas about how to do any of the above, like Rust crates, npm, pip, etc., forget about it. You need to interact with previously existing internal code which means you need to fit into the already existing build/deployment/dependency solution.

* Your company has a bunch of custom internal services. You need to create bindings for these services' APIs in the new language.

* Your company might be doing some esoteric stuff that the new technology has no support for, like for example if your network stack does HTTP in some sort of custom-tuned way for performance, and the new language's HTTP library only supports a subset of everything that's possible.

* The new technology may have correctness or performance issues that have never been discovered because it has never been used on many thousands of servers where a 1% difference in CPU makes a huge difference, or it has never been used with a codebase big enough to take >1 day to build, or with binaries larger than several GB, etc.

And the biggest one...

* Every day, many new people are joining the company and beginning to ramp up on the (internally) mainstream solutions that have institutional inertia. This will dwarf the speed at which you can convince people to switch to the new technology.

The pattern I have seen is that something starts off being used for one-off scripts that don't have a lot of dependencies on the existing gargantuan infrastructure, then very, very gradually gaining mindshare internally and becoming more supported.


"...or it has never been used with a codebase big enough to take >1 day to build..."

I've not worked in this scale of enterprise environments so I find this tidbit fascinating. Why does the codebase take over a day to build? Does that time include rebuilding and deploying the underlying infrastructure?

I'm genuinely interested here. I can only imagine a few highly niche cases myself so I'm curious what I'm missing here.


Hard to find updated publicly-available information on this, but it was claimed that in 2003, Windows took 12 hours to build, on a multi-machine build farm: https://www.itprotoday.com/windows-server/supersite-flashbac...


If you look at buck or hazel, they both use caching or artifacts to speed up builds. It still takes a long time, because everything depends on everything. But for most projects a fresh build is going to take tens of minutes normally for a medium sized project pulling in a big deep.


As far as I know, Node.js was used inside in a limited capacity, but they had an alternative for npm for security reasons, and you had to get an npm package approved to use it internally.

FWIW, this is a very good thing. A company as large as Amazon should do this with all of their repositories; even a small start-up should be doing this to mitigate suspicious packages / third-party code vulnerabilities.


I worked on third-party package approvals at Google. The reasoning behind reviews was largely due to license compliance. If the license said "you have to display this license to end-users" then we had to make sure that the license was machine-readable and would be automatically bundled into the build to be displayed in that "open source licenses" section of pretty much every app ever. If the license said "by linking this into your code, you have to opensource all code at your company", we had to deny it. That sort of thing.

We suggested that people get security reviews, but it was up to the user of the package to figure out whether or not that was necessary. Often security reviews would be blocking the project's launch and would be done at that time.

The final thing we enforced was a "one version" policy. If everyone was using foobar-1.0, and you wanted to use foobar-2.0, it was on you to update everyone to foobar-2.0. This was the policy that people hated the most, but basically mandatory at the time because none of the languages widely used at Google supported versioned symbols. Having library A depend on foobar-1.0 and library B depending on foobar-2.0 meant that application C could not depend transitively on library A and library B at the same time, which would cause many disasters.


There's some public documentation on the "one version" policy here: https://opensource.google.com/docs/thirdparty/oneversion/


This all sounds like justified inconvenience in the name of safety at scale. As long as you have a good relationship with the package approval guys this is a fine workflow in my experience.


Yeah, I thought it was fine (I was both a reviewer and a user).

I mostly posted to provide some contrast to the folks that are saying things like "there are too many open-source libraries and they need to be approved to make sure there aren't any security problems", which is not what we did. We did not attempt to limit the use of libraries, nor did we vouch for the security-readiness of packages before they were allowed to be checked in to source control. If you think there are too many npm libraries and are looking for an example of a big tech company saying "no more!", this is not it. Use all the libraries!


Morgan Stanley revealed at a recruiting event that they follow this practice. "You can definitely use Apache Spark! It's been reviewed by our Enterprise Architecture team; some APIs were removed, and we substituted 'Morganized' variants of others, which you won't find code/docs for on Github ;)"

Given the npm debacle, I could totally see even a small org running an internal Maven repo with approved versions of (popular, and especially obscure) libs.


I feel like the whole NPM/JS culture is around "there's a library for that" so I can see how the friction around having to get approval for npm packages would be an absolute velocity killer, but when you're as big as Amazon I really wouldn't want that particular JS flavour of "move fast and break things" powering the company...


This is a good idea in large companies but the only time I've ever seen this implemented was when a developer wanted who wanted to become lead dev so started hogging permissions to things and set up an internal packagist that only he could add packages to. It didn't end well for him really.


That's unfortunate, but how can you be sure this was his motivation or reasoning for adding an internal repo manager? I'm only asking because I learned the practice from an engineer / CTO whom I looked up to greatly early in my career, and if I were to go work somewhere and push leadership to implement Artifactory/Nexus/etc., I hope people wouldn't misunderstand my intentions. Anyone pushing for something like that is naturally likely to be the repo administrator in the beginning, and that may look like hogging permissions even if the developers intent is something entirely different? Catch my drift?


Node is picking up internally as a build environment for frontend JS which all used to be done in Ruby.


Don't want to go into too much detail but this article is like taking the crappiest parts of the crappiest systems and declaring it representative of an entire product. There is a lot of really good internal tooling not mentioned here, and for the internal tooling mentioned here (like Apollo) absolutely none of its benefits are mentioned.


Well, this article gets something right, but gets a lot of stuff wrong as well. What can be confirmed is that his access to the Amazon tech scene is limited, and he takes a sweeping generalization that is how the whole Amazon works.

Disclaimer: Ex-Amazonian, left like one year ago.


Ofcourse. What else can we expect from someone who has been there for just 2 years. It takes some audacity to speak about the tech scene of a company as large as Amazon with such limited time spent. Unless he has the overall view of the org ( like a VP of Engineering), I would take any assessment with a pinch of salt.


I agree.

Disclaimer: Also ex-amazonian, left like one year ago.


What happened like one year ago?


Well, like 2 amazonians left.


Amazon is a giant company with tens of thousands of employees. People are joining and leaving all the time.


Oh, this is the thread which is reserved for everyone who left one year ago from Amazon. If you left two years ago, you are supposed to comment in a different thread. That's just how large companies work.


I like how Amazon has an MAWS movement internally, meaning "Move to AWS". I think most people think that they use AWS mostly, but they dont.

Its an interesting look behind the scenes at Amazon and how antiquated they appear to operate. Makes you wonder if Azure and Google have pretty good chances beating them down the road.

Edit: Interesting, further down one person commented that Amazon doesn't use AWS broadly because it's seen as not secure enough for certain workloads.


Based on my experience, that information and some of the comments about it in the thread are out of date or inaccurate.

'Move to AWS' was a program focused on accelerating AWS adoption that was primarily active something like 5-7 years ago. The program achieved its goals and concluded: virtually all infrastructure was running on AWS. I worked on the program for part of that time, in the last couple of years it was active. Amazon's migration to AWS was was covered in a 2012 presentation at AWS re:Invent: "Drinking Our Own Champagne: Amazon's Migration to AWS" [1].

Some more recent efforts around AWS usage were covered in a 2016 talk: "How Amazon.com Uses AWS Management Tools" [2] (which references the earlier talk and discusses some of the changes since then). There are ongoing projects to improve and optimize usage of AWS, as well as to adopt some of the newer services.

[1] https://www.youtube.com/watch?v=f45Uo5rw6YY [2] https://www.youtube.com/watch?v=IBvsizhKtFk&t=13m20s


Until they move off of sable, their NoSQL backend for retail, calling them mostly on AWS is laughable especially considering their major prime day outage this last year was caused by sable not being able to dynamically scale up [0].

[0]: https://www.cnbc.com/2018/07/19/amazon-internal-documents-wh...


Depends on what you mean by mostly. Almost all code runs on EC2. All object storage is in S3. Almost all service asynchronous interactions are decoupled via SQS. Almost all notifications are shared via SNS.

Yes, most legacy systems use Sable but most new development for the last 2 years uses DynamoDB.

More and more event-driven applications with highly variable throughout are on Lambda. Even on AWS Service teams.

The Sable outage had a particularly laughable interpretation by the armchair quarterbacks from CNBC. The kind of scale involved makes Oracle-based approaches completely infeasible. (the internal Correction of Error document leaked and was wildly mis understood)

And the scaling timelines involved are so compressed that no company in their right mind "dynamically scales up as a result" - it's always projected scheduled scaling. There were other cascading effects.

I know it's fun to dunk on Amazon but their commitment to operational excellence is unparalleled as public AWS post mortems after major events should reveal.

If you read the entire Correction of Error document on tbe Sable outage you'd agree, of course CNBC would never publish that, one gets more clicks by getting a professor out of touch with the realities of production software engineering to blurb some juicy quotes.


This is very true. I left Amazon in 2014, by that time I had to explain why (at least twice a year) I had some services that could not feasibly be moved to AWS. At least in the org I was a part of, that they were not running on AWS stood out enough that it raised questions.


MS Internal tooling fucking sucks compared to Amazon.

Just atrocious. Though I wasn't in Azure so that might be better.


Ex-Amazon SDE checking in. The article is quite misleading.

The author confuses "shiny" with "good".

Amazon does package-based deployments because it scales well and allow engineers from many different teams to work on packages and also provides fast security updates.

Amazon used VMs more than a decade before container engines and the latter are still lacking security and stability.

Having worked in many companies, I would take Amazon's engineering practices over the modern shiny devops tool ecosystem every day.

I agree that Apollo is slow (due to the implementation) and has an ugly UI, and that the company has a very poor track record of contributing to OSS.


Context: I spent five years as an engineer at Amazon, the last two as a tech lead on an internal developer tool (think SaaS for performance engineering).

This article is not untrue but it misses the fact that teams are empowered to own their solutions are not restricted in how they setup their environments and which tools they use. It's true that fixing these problems feels like wasted effort, it's by design: Amazon operates as many separate internal entities and I think replication of effort is an acknowledged downside of operating this way.

> 1. Deployments > Their internal deployment tool at Amazon is Apollo, and it doesn't support auto-scaling.

I had to manually scale up my service once in two years and we weren't over-provisioning wastefully. Before I left my product was supporting +40K internal applications with an infra+AWS cost < 2k / month.

We had good CI with deep integration with Apollo, you could track any change across the pipeline, we had reproducible builds and we had a comprehensive deployment log listing all changes.

Apollo is sloooooow though and the UI is very 90s.

> 2. Logs > Any self respecting company running software on distributed machines should have centralized, searchable logs for their services.

We were using Elastic Logstash Kibana powered by AWS ElasticSearch. I wrote a thin wrapper around logstash that was used in over 1K environments internally, so weren't the only ones doing this.

> 3. Service Discovery > What service discovery? We used to hard wire load balancer host names in config files.

Agree with this one. I will never forget the quality time I spent configuring those load balancers and ticketing people about DNS.

> 4. Containers

As other commenters mentioned, if you want to use containers, you're free to bypass all of this and run your service in AWS where you can use ECR, EKS etc if you want.

> (As far as I know, Node.js was used inside in a limited capacity, but they had an alternative for npm for security reasons, and you had to get an npm package approved to use it internally.)

I built my UI from scratch using create-react-app and yarn offline builds (no mystery meat) and I bypassed all the internal JS tooling, which I thought was very poor. This was changing though.

Finally, my personal anecdote: you could onboard our product in less than an hour (including reading docs), it required no further maintenance and gave you performance stats for free. So not all was bad :)


> They have some amount of Rails, and JavaScript has to be there, but if you want to experiment with, say, Go, Kotlin, or anything else, you are going to get nothing but push back.

I missed this - starting 2018 we were writing all our backend logic in Kotlin and we got no push back from anyone.


Ex-Amazon engineer of several years here.

This is a pretty interesting article, but it's important to know that Amazon's internal tooling changes pretty fast, even if it's mostly several years behind state-of-the-art.

Exhibit A: Apollo

Apollo used to be insane. It was designed for the use case of deploying changes to thousands of C++ CGI servers on thousands of website hosts, worrying about compiling for different architectures, supporting special fleets with overrides to certain shared libraries, etc etc. It had an entire glossary of strange terms which you needed to know in order to operate it. Deployments to our global fleet involved clicking through tens of pages, copy-and-pasting info from page to page, duplicating actions left right and centre, and hoping that you didn't forget something.

When I left, most of that had been swept away and replaced with a continuous deployment tool. Do a bit of setup, commit your code to the internal Git repo, watch it be picked up, automated tests run, then deployments created to each fleet. Monitoring tools automatically rolled back deploys if certain key metrics changed.

Auto scaling became a reality too, once the Move to AWS project completed. You still needed budgetary approval to up your maximum number of servers (because for our team you were talking thousands of servers per region!) but you could keep them in reserve and only deploy them as needed.

Manually copying Apollo config for environment setup was still kind of a thing though. The ideas of CloudFormation hadn't quite filtered down yet.

Exhibit B: logs

My memory's a bit hazy on this one. There certainly was a lot of centralized logging and monitoring infrastructure. Pretty sure that logs got pulled to a central, searchable repository after they'd existed on the hosts for a small amount of time. But, yes, for realtime viewing you'd definitely be looking at using a tool to open a bunch of terminals.

The monitoring tools got a huge revamp about halfway through my tenure, gaining interactive dashboarding and metrics drill-down features which were invaluable when on-call. I'm currently implementing a monitoring system, so my appreciation for just how well that system worked is pretty high!

Exhibit C: service discovery

Amusingly, a centralized service discovery tool was one of the tools that used to exist, and had fallen into disrepair by the time this person was working there.

This was a common pattern in Amazon. Contrary to the 'Amazon doesn't experiment' conclusion, Amazon had a tendency to experiment too well - the Next Big Thing was constantly being released in beta, adopted by a small number of early adopters, and then disappearing for lack of funding/maintenance/headcount.

I can't think of any time I hard-wired load balancer host names though. Usually they would be set up in DNS. We did used to have some custom tooling to discover our webserver hosts and automatically add/remove them from load balancers, but that was made obsolete by the auto-scaling / continuous deployment system years before I left.

As for the question of "can we shut this down? who uses it?" - ha, yes, I seem to remember having that issue. I think that, before my time, it wasn't really a problem: to call a service you needed to consume its client library, so you could just look in the package manager to see which services declared that as a dependency. With the move to HTTP services that got lost. It was somewhat mitigated over the years by services moving to a fully authenticated model, with client services needing to register for access tokens to call their dependencies, but that was still a work in progress a few years ago.

Exhibit D: containers

Almost everything in Amazon ran on a one-host-per-service model, with the packages present on the host dictated by Apollo's dependency resolution mechanism, so containers weren't needed to isolate multiple programs' dependencies on the same host.

Screwups caused by different system binaries and libraries on different generations of host were a thing, though, and were particularly unpleasant to diagnose. Again, that mostly went away once AWS was a thing and we didn't need to hold onto our hard-won bare-metal servers.

'Amazon Does Not Experiment'

Amazon doesn't really do open source very well. The company is dominated by extremely twitchy lawyers. For instance, my original employment contract stated that I could not talk about any of the technology I used at my job - including which programming languages I used! Unsurprisingly, nobody paid attention to that. That meant that for many years, the company gladly consumed open source, but any question of contributing back was practically off the table as it might have risked exposing which open source projects were used internally.

A small group of very motivated engineers, backed up by a lot of open-source-friendly employees, gradually changed that over the years. My first ever Amazon open source contribution took over a year to be approved. The ones I made after that were more on the order of a week.

Other companies might regard open sourcing entire projects as good PR, but Amazon doesn't particularly seem to see it that way. Thus, it's not given much in the way of funding or headcount. AWS is the obvious exception, but that's because AWS's open source libraries allow people to spend more money on AWS.

Instead, engineers within Amazon are pushed to generate ideas and either patent them, or make them into AWS services. The latter is good PR and money.

As for different languages: it really depends on the team. I know a team who happily experimented with languages, including functional programming. But part of the reason for the pushback is that a) Amazon has an incredibly high engineer turnover, both due to expansion and also due to burnout, so you need to choose a language that new engineers can learn in a hurry, and b) you need to be prepared for your project to be taken over by another team, so it better be written in something simple. So you better have a very good justification if you want to choose something non-standard.

Overall, Amazon is a pretty weird place to work as an engineer.

I would definitely not recommend it to anybody whose primary motivation was to work on the newest, shiniest technologies and tooling!

On the other hand, the opportunities within Amazon to work at massive scale are pretty great.

One of the 'fun' consequences of Amazon's massive scale is the "we have special problems" issue. At Amazon's scale, things genuinely start breaking in weird ways. For instance, Amazon pushed so much traffic through its internal load balancers that it started running into LB software scaling issues, to the point where eventually they gave up and began developing their own load balancers! Similarly, source control systems and documentation repositories kept being introduced, becoming overloaded, then replaced with something more performant.

But the problem is that "we have special problems" starts to become the default assumption, and Not Invented Here starts to creep in. Teams either don't bother searching for external software that can do what they need, or dismiss suggestions with "yeah, that won't work at Amazon scale". And because Amazon is so huge, there isn't even a lot of weight given to figuring out how other Amazon teams have solved the same problem.

So you end up with each team reinventing their own particular wheel, hundreds of engineer-hours being logged building, debugging and maintaining that wheel, and burned-out engineers leaving after spending several years in a software parallel universe without any knowledge of the current industry state-of-the-art.

I'm one of them. I'm just teaching myself Docker at the moment. It's pretty great.


Speaking of twitchy lawyers and Move to AWS... one of the weirdest things we had to deal with inside Amazon was that, for many years after AWS launched, we weren't allowed to use it because it "wasn't secure enough".

Given that we were actively shopping it around to major financial institutions at the time, doesn't that strike you as particularly hypocritical? :)


So wait, when I need to convince customers why AWS is secure for their data, I can't say "It's good enough for Amazon!"?


To clarify, an AWS customer has a shared responsibility to describe the security of their systems including how they use AWS tools, and in this respect Amazon is no different than other AWS customers.


GP's comment may have been true at one point, but AWS is extremely mainstream within Amazon now and has been for several years.


No, you can say hey amazon might use this if they do a security evaluation first.

It's totally cool for your data though, don't worry about it.

Amazon fucking sucks at dogfooding.


This is misleading. This had more to do with internal audit tools and guardrails available than it did the services themselves.


It's still highly restricted for many teams working with customer data


Your comment is better than the original article. Can we push this one to the top, HN?


Someone should probably add (2018) to this post as it's from May 2018.


Email hn@ycombinator.com and they will take care of it.


I will be joining Amazon in about a month.

Is there any chance I'll be able to work on OSS and/or "modern" tech (e.g. containers, Go, etc.) without a ton of push-back?

It also seems Amazon is obsessed with reinventing wheels and keeping their stuff internal, which is worrying. Is there any chance to introduce solid OSS tools to the development process? (whatever they might be)


AWS SDE here.

The short answer is, in order to get your team to adopt something, you need to make the case that it's better for customers (including things like migration costs). If the modern thing is more efficient, is higher availability, increases velocity, and so on then the case can be made.

Some specific examples based on things you cite:

* For an example of something OSS or "modern" coming from AWS, checkout Firecracker (written in Rust): https://firecracker-microvm.github.io/

* With regards to "reinventing wheels" Apollo + EC2 solves a lot (not all) of the problems that containers solve, and existed for years before containers became the hotness.

* Docker, which brought containers to the masses launched in 2013.

* EC2 launched in 2006 (7 years before Docker).

* Apollo (and the build system Brazil) predated EC2 by many years.

* Amazon.com was migrating to EC2/AWS before 2012 (https://www.youtube.com/watch?v=f45Uo5rw6YY)

* Another example, Lambda, which launched in 2014 runs on EC2 (https://www.youtube.com/watch?v=QdzV04T_kec&t=1611s).

* New services get to build in AWS and use Lambda, ECS, DynamoDB etc based on their business needs.


Worked for EC2 as an SE ~ 5 years ago. We used to handle rack downs and page the relevant team if there was a large set of instances for the said team impacted. We once had a couple of hundreds of Amazon.com (merchant team) instances impacted. We paged the team and they were like "don't page us for anything less than 10 racks of our instances down". The burgers didn't even feel the impact. Their automation was insane.


My wife has been at amazon (aws) about two years now.

She worries about her customers' problems, and is obsessed with using the right tools to solve them. Unless she is a great actor she has had a fantastic time there.

If things like "be able to work on OSS or \"modern\" tech" is what you want, go somewhere that allows it.

Edit: Should say "go somewhere that is known to allow it for all engineers".


No, that's not accurate at all.

Amazon made choices on how to do things, years ago. These choices are being remade just because the new hotness is containers and not using uncool java. They have a lot of tooling that works for them, and a lot of it is quite good.

They aren't hostile to new stuff either. It's just that why would you waste time and money trying to shoehorn some new way of doing things when the old way works just fine.

You will still get to do lots. maybe, depends on the team.

I would say don't waste your energy. Learn when to pick the battles. Accept that you will get push back that doesn't make sense to you.


> Is there any chance I'll be able to work on OSS and/or "modern" tech (e.g. containers, Go, etc.) without a ton of push-back?

As long as it is the right choice, yes. As far as I know, I built the first service internally that was entirely container based, but did so because it was the right tool for the job. Container based services are getting a ton of traction internally now, especially Fargate-based ones.

You're going to have a hard time making a case for Go though. I have not a single time been convinced that Go was the best tool for the job in 5 years at Amazon.

> It also seems Amazon is obsessed with reinventing wheels and keeping their stuff internal, which is worrying.

I have not found this to be true. But I can see why people might think this. Amazon built some state of the art tooling quite some time ago, and it's starting to show it's age. Rather than drop the internal stuff for new OSS alternatives, they've continued to add modern features to the internal tools, which I think is the right choice overall considering the scale at which they're used and integrated.

Again, it is about the right tool for the job. If you present a compelling case to use an OSS alternative then more than likely you'll be able to use it.


I worked there for two years (summer 2013-summer 2015), so long enough to get an idea of the culture. Take this post with the caveat that I have no idea whether it's changed since then and if so to what extent.

> Is there any chance I'll be able to work on OSS and/or "modern" tech (e.g. containers, Go, etc.) without a ton of push-back?

Sure there's a greater chance than zero, but not if only motivated by it being "modern". Amazon is a business. It exists to make money by providing a valuable service to customers. The only reason it's a "tech company" is because writing software serves that goal, but "cool tech" is not a goal in itself. If you have a serious, documentable reason why rewriting your team's service in Go would help you achieve business goals, then you might be able to get the attention of engineering decision makers, but that's a much higher bar than "it's modern".

Personally I don't see what is "modern" about Go or how it would help the business serve customers better and/or make more money than it does using Java. I suspect many of your coworkers would feel the same way and these are the terms that the decision would be framed in within Amazon's culture.

By the way there has been over the last several years a move away from C++ and Perl and towards Java. When I was there the majority of stuff was in Java but there was still plenty of important stuff in those other two languages. I suspect C++ and Perl are even rarer now. I guess that's the modernization you're talking about, but maybe not at the pace you want.

> It also seems Amazon is obsessed with reinventing wheels

In many cases, either these wheels were invented at Amazon before they became available in the OSS world, or the OSS tools do not fit Amazon's needs (especially w.r.t. scale).

There is certainly no "obsession" with reinventing wheels -- any sane manager at Amazon would definitely rather use something off-the-shelf than waste a bunch of money developing it from scratch, assuming it fit their needs well.

> and keeping stuff internal

Well, I guess this is true. Amazon contributes less to open source than a few other famous tech companies.

> Is there any chance to introduce solid OSS tools to the development process?

Amazon uses plenty of very solid OSS tools: for example, Linux, Java, gcc, perl, git (though they were on Perforce when I started), and Tomcat are all core parts of Amazon infrastructure. As well as the same grab bag of common tools and libraries you'd find in use anywhere else. In general things with permissive licenses (BSD/MIT/etc) are fair game, but getting approval for things with copyleft licenses (GPL/LGPL/etc) is an uphill battle. As for whether you could introduce more, it would depend on the value of the tool and your motivation for doing so.


> Amazon contributes less to open source than a few other famous tech companies, though I'd say Apple is on par.

Really? Apple has impactful OSS like LLVM, WebKit, Swift.


Fair enough, I edited out the comment about Apple.


Getting an npm or other package approved for internal use is not an unusual practice.


Yes, but it probably makes Node.js useless in any such company since any non-trivial app will have 1,000 npm dependencies.


Yeah. I did this once a few years ago, and it was quite unpleasant. Did get it done in the end, but it definitely put my team off looking for any other useful NPM packages.

I wonder if it's any more streamlined now?


There's some internal builds tools that I can vouch for. If you're still at Amazon, feel free to ping me at dbarsky@ and we can chat.


Yep, can confirm. NPM is very compatible and easy to use in Amazon. It is just not allowed to serve critical traffic, which I think is a wise choice anyway.


The OP needs to put a date on the article, because AFAIK things are very different in 2019.

Also, it's interesting how they equate "experimentation" with "open source".


Considering the constant stream of new services and feature, the lack of OSS is insignificant compared to value they add to the world.

Like the fact that you can create an SSL/TLS certificate for free for load balancers without the usual agony. So easy.


I worked at Amazon in the late 90s. So my experience is most likely not relevant anymore, but I will make a few observations. First, I see that many commenters disagree with the OP, they had a different experience of Amazon, one where they were working with infrastructure that was responsive. modern, easy to use etc. It very possible that both observations are correct. In a large company, not all parts of the company will be using the same infrastructure at the same time. Indeed, I would be dangerous for the entire company to upgrade lockstep to a new technology infrastructure. Second, in most companies, innovation is not measured by the novelty or newness of the language or framework you use, but by the business impact your product or service makes. Much of Amazon's innovation was, and is, around business models. Indeed, when I worked at AMZN, I was writing C code (to power a website) using beautifully efficient database access code written by Sheldon Kaphan. There was nothing remotely advanced about the language. It took 9 months for me to get a 3 line code change into production. And I was using technology that predated Apollo (it was called Huston) There is nothing particularly wrong about that either (it was a potty mouth filter and was blocking some obscure swear words, and no one was too worried that the component it was part of didn't ship for the best part of a year.) I now run my own company, and I both manage technology, people as well as write code. I find myself exercising the same conservatism with respect to code and infrastructure that I found at Amazon, and for the same reasons. It is expensive and potentially company destroying to switch languages and core technologies. It is best not done, or if done at all, done with a lot of care and slowly.


I was once pitched a startup founded by some ex-amazonians whose big idea was "Apollo for everyone". They were nonplussed by my spit take.

For the people saying "I worked there and it wasn't like that" I wonder if you worked in retail. It's a very different world from the more modern bits of the company.


What amuses me is that most of retractions come from ex-Amazonians not from the current staff. This is the only company i know dealing with that much criticism from engineering.

Even more to add is that the article is more or less fresh and at Amazon's scale i doubt any major changes had undergone in the last 10 months.


There is a comment above you from an Amazon Principal Engineer: https://news.ycombinator.com/user?id=jcrites

His profile says "Architect and cofounder of Simple Email Service. Creator of Cloud Desktop, a cloud-based development environment used by most Amazon engineers. Technical lead for Amazon's strategy for using AWS."

Can't get more "From the horse's mouth than that"

We are generally asked not to comment on stuff like this because of how easy it is to reveal confidential internal details.

For the record, the article is mostly wildly out of date, but others have already corrected the record.


I'm aware current employees comment over here. Two things to mention: principal/architect roles are always based not only on a merit of skill but politics. So, taken with a grain of salt.

Moreover, if you inspect the person's comments you will notice how "legally" clean they are. Even the phrasing looks the same for both comments "based on my experience", "this information if wrongful" etc. Looks like those were refined by the legal team before posted.

I wonder all that, because since my first engagement with AWS and Amazon recruiting around five years ago the engineering tone did not change. That concerns in a way that putting an effort to get a job there may turn out a major disappointment.

The cutthroat approach is nice sometimes as it adds the taste of competition but the whole noise seems like you're about to get buried rather than played or burned out.


translation of the first two paragraphs is "I've decided what I think already and if you say you work at Amazon now and are happy it's prima facie evidence that you can't be trusted to be objective about it"


Sort of.

If it's a liability to say nasty stuff about the employer and most of the ex-ones throw some heat the nearest possible conclusion is that the truth tends to bow to statements that are observed as outdated and controversial by present employees.

It just can't be a smear campaign against just that one company.


Amazon (like many other big companies) strongly discourages employees from commenting publicly on subjects related to the company.

It might seem draconian, but it makes a lot of sense. Talking about a smaller company in a forum like HN might be completely anodyne, but comments about Amazon could easily have repercussions like getting picked up by the press and spun into some crazy story, or alerting people to some strategic information that the commenter isn't even aware is sensitive. For example, posting "hey we're at Amazon and we're moving from system X to system Y" could generate surprised and angry phone calls with the CEO of the vendor of X (I have a real example in mind that happened because of a stackoverflow post), or could cause the stock price of the vendor of Y to jump, causing insider trading concerns... best to just avoid it.

So it's very natural that company policy would strongly discourage such public commentary, and most current employees follow that.


>Amazon could easily have repercussions like getting picked up by the press and spun into some crazy story

Didn't NY Times expose help to shape Amazon workplace for good?


I am a current Sr. SDE at Amazon in the Retail org. I would agree with most of the rebuttals, and I don't have too much to add, since they do a pretty good job summarizing.


The main benefit for amazon's tools is once you've been there a while you know how they work, and all the complexity and bugs have been stripped out of them. And because they force engineers to go oncall everyone has a pretty good idea of how to fix things.

When you have SRE's spending all day creating the next new thing (generally after deprecating the previous one with no replacement), you end up in a situation where you forget how to say, rollback a bad deployment. Or scale a fleet.

The problem with fancy infrastructure as code, containers and logging services is when they break you have no idea how to get out of trouble. SSH and grep almost always work, as does symlinking a directory.


> "It's complicated, so it's gotta be good. I must be dumb to not get it."

Having worked with aws a lot recently, this article doesn't surprise me at all actually. When you see the low quality of UI and documentation for most of their tools that users pay for, I wouldn't expect their internal tooling to be any better.

I'm not saying their tech is bad, once things work, it works great - I'm talking about the usability of those tools as I try to use them, and it makes me feel the same as the OP's quote.


If you are using the UI to leverage AWS you arent really leveraging AWS.

AWS is designed to be used via automation and the API.

Disclaimer : I dont work for AWS but have spent many years building relatively large stacks on AWS (thousands of ec2 instances with monthly spend being a couple supercars in value)


yes, but before using the API you need to know which API to use. If you are looking into using a new service, say, analytics. Which coupling of services do I use, between cloudwatch, firehose, etc etc. Or reading documentation for example. They have terrible documentation.

Someone using their API/cli is already familiar with aws and how it works. The point, before getting to that point, you need good docs and UI to allow people to discover and learn.


The author is wrong about his description of tooling and best practices at Amazon.

I'm also annoyed that the post isn't dated anywhere so there's no way for me to tell if it's just old.


Looking through the rest of the blog, it looks like the author worked at Amazon between 2014 and 2016.


This article is so outdated from what the current state is and has been for quite a while now.


To me, it sounds like a company who is still working with the mentality of 90s. The same I'm facing at work, where new technology is treated as scary and there's fear of change something that (somehow, only God knows how) is working and none is allowed to touch such things


New technology is scary. It should be scary. If you don't find it scary you're probably not looking at it right. It's usually full of risk, which if you are in management (or a remotely responsible employee) needs to be evaluated and if possible contained, and ultimately needs to pay for itself (i.e. Offers a pay-off that more than outweighs the risk).

If we are honest with ourselves, most new tech in this industry totally fails that litmus test.

There are an awful lot of shiny-bauble chasers and snake oil salesmen both inside and outside companies.

I don't know what your precise situation is and obviously it's very much a spectrum rather than black and white, but a healthy aversion to promises made by the authors and salespeople of relatively unproven new things is extremely valuable.


I think there is no serious authenticity associated with this blog post.


This matches my experience using the internal tools pretty well.

But hte team I was on was mostly using AWS, which meant the tooling was better.

We also had a bunch of people who wanted python instead of Ruby, so they started using python and eventually we were just a team that used python. Cool.

Apollo is probably the worst thing. Brazil, Pipelines, etc... those are mostly fine.

Amazon has done a ton with the JVM. They don't need something new, so they don't bother. That's fine.

They also do adopt other tech as needed to do things. I know people who worked in Go, because it was container stuff.

This was also one of things that gave me pause when considering interviewing with snap a while back. So many former Amazon people. Seemed like a lot of stuff was going with "just copy amazon, make it a bit better". I don't want to write java. bleh.

Anyway, this article is going to have a lot of people saying "oh no that's not right". It is. There are exceptions but overall it's pretty much bang on.

Oh and open source. My understanding is jeff doesn't like contributing back. The company doesn't like contributing back. They keep an iron grip on IP. They are ridiculous about letting employees do side projects. They fight back with every single FOSS contribution and even after some of our senior guys did a whole bunch of work it still required multiple layers of approvals and a whole bunch of hooops and an extra training course and blah blah blah blah. It's really really dumb. I find it crazy that Amazon doesn't get more flak for being probably the worst company for open source around right now.

But MS has tossed out their old CEOs, has tons of interest in making open source work well for them, and contributes loads... still get shit on.


Working on a side project is super easy. Just file a TT (or I guess SIM now?). As long as it isn't competing and not a game, it's generally quick approval.


What is the issue with games?


Amazon owns game studios and might have many different game concepts being worked on, such that an employee might inadvertently compete with a similar game studio idea.


Games are automatically considered to be competing with the company.


In my career I've experienced alot of engineers that had a desire to shy away from command line or command prompt tools, shell, CMD.exe, batch, scripting, cron and related 'traditional' automation, in favor of GUI, IDE, html, browser etc.

I've even had some young sexy angular-wizard type engineers that had the ear of mgmt sarcastically respond with statements like "I don't do command line".

This article and my experience with AWS development and Amazon leads me to believe this entire company is led and staffed by such engineers.


It's hardly surprising. The quality of their public products isn't much better than what's described here. It's fine for companies with plenty of engineers, money, and time to base their tools off of, but that's about it. Without building a ton of extra tooling and having specialized information only available through paid support, it's almost impossible to operate anything on aws. The documentation is plentiful, mostly out of date, wrong, incomplete, and difficult to browse, search, and use. Doing devops using aws is a nightmare that never ends. Not to mention the speed of deploying anything is beyond slow, so any work takes many times the amount of time it should. For large companies with plenty of resources, these are minor points. For small and medium sized ones, it's a loss of productivity and money that simply cannot be justified over other methods.


Dang probably true for the products you were using but Redshift has been great for me, including the documentation which I have never found to be out of date.

I wouldn’t say its the economical option though.

Oh and its high time ‘count(distinct) over()’ partitions was supported.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: