I mean, not to defend them too strongly, but literally half of this post mortem is addressing the failure of the Service Dashboard. You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.
Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.
They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.
Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.
Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor
My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.
That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.
>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call
But,
>Be incapable of querying the status of such things
As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.
Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.
The AWS summary says: "As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue"
This seems like bad faith to me based on my experience when I worked for AWS. As they repeated many times at Re:Invent last week, they've been doing this for 15+ years. I distinctly remember seeing banners like "Don't update the dashboard without approval from <importantSVP>" on various service team runbooks. They tried not to say it out loud, but there was very much a top-down mandate for service teams to make the dashboard "look green" by:
1. Actually improving availability (this one is fair).
2. Using the "Green-I" icon rather than the blue, orange, or red icons whenever possible.
3. They built out the "Personal Health Dashboard" so they can post about many issues in there, without having to acknowledge it publicly.
Eh I mean at least when DeSantis was lower on the food chain then he is now, the normal directive was that ec2 status wasn't updated unless a certain X percent of hosts were affected. Which is reasonable because a single rack going down isn't relevant enough to constitute a massive problem with ec2 as a whole.
Multiple AWS employees have acknowledged it takes VP approval to change the status color of the dashboard. That is absurd and it tells you everything you need to know. The status page isn't about accurate information, it's about plausible deniability and keeping AWS out of the news cycle.
When is the last time they had a single service outage in a single region? How about in a single AZ in a single region? Struggling to find a lot of headline stories? I'm willing to bet it's happened in the last 2 years and yet I don't see many news articles about it... so I'd say if the only thing that hits the front page is a complete region outage for 6+ hours, it's working out pretty well for them.
Um, so you think straight-up lying is good politics?
Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.
That does not mean that we should think it is a good idea to tell lies when you break things.
It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?
Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?
We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.
AWS spends a lot of time thinking about this problem in service to their customers.
How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.
Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.
What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.
As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.
>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.
Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.
>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.
>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...
Of course there are, so a single R/Y/G indicator is obviously a bad choice.
Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.
More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.
Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.
>>You don’t know what you’re talking about.
Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.
That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work
it's a bullshit board used fudge numbers when negoaiting SLAs
like I don't care that much, hell my company does the same thing; but let's not get defensive over it
So -- ctrl-f "Dash" only produces four results and it's hidden away in the bottom of the page. It's false to claim that even 20% of the post mortem is addressing the failure of the dashboard.
The problem is that the dashboard requires VP approval to be updated. Which is broken. The dashboard should be automatic. The dashboard should update before even a single member of the AWS team knows there's something wrong.
Is it typical for orgs (the whole spectrum: IT departments everywhere, telecom, SaaS, maybe even status of non-technical services) to have automatic downtime messaging that doesn't need a human set of eyes to approve it first?
> You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.
Let's not act like this is the first time this has happened. It's bad faith that they do not change when their promise is they hire the best to handle infrastructure so you don't have to. It's clearly not the case. Between this and billing I we can easily lay blame and acknowledge lies.
AWS as a business has an enormous (multi-billion-dollar) moral hazard: they have a fantastically strong disincentive to update their status dashboard to accurately reflect the true nature of an ongoing outage. They use weasel words like "some customers may be seeing elevated errors", which we all know translates to "almost all customers are seeing 99.99% failure rates."
They have a strong incentive to lie, and they're doing it. This makes people dependent upon the truth for refunds understandably angry.
> Why are we doing this folks? What's making you so angry and contemptful?
Because Amazon kills industries. Takes job. They do this because they promise they hire the best people that can do this better than you and for cheaper. And it's rarely true. And then they lie about it when things hit the fan. If you're going to be the best you need to act like the best, and execute like the best. Not build a walled garden that people cant see into, and hard to leave.