I mean, not to defend them too strongly, but literally half of this post mortem ...

luhn · on Dec 10, 2021

Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.

They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.

46Bit · on Dec 11, 2021

Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.

ngc248 · on Dec 11, 2021

Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor

isbvhodnvemrwvn · on Dec 11, 2021

My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.

koheripbal · on Dec 11, 2021

This challenge is not specific to Amazon.

Being able to automatically detect system health is a non-trivial effort.

bob778 · on Dec 11, 2021

That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.

jjoonathan · on Dec 11, 2021

They had all day to do it manually.

moralestapia · on Dec 11, 2021

>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call

But,

>Be incapable of querying the status of such things

Yeah, I don't believe it.

saagarjha · on Dec 11, 2021

As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.

blackearl · on Dec 11, 2021

Why automatic? Surely someone could have the responsibility to do it manually.

geenew · on Dec 11, 2021

Or override the autogenerated values

thebean11 · on Dec 10, 2021

Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.

dijit · on Dec 11, 2021

The icons showed green.

discodave · on Dec 11, 2021

The AWS summary says: "As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue"

This seems like bad faith to me based on my experience when I worked for AWS. As they repeated many times at Re:Invent last week, they've been doing this for 15+ years. I distinctly remember seeing banners like "Don't update the dashboard without approval from <importantSVP>" on various service team runbooks. They tried not to say it out loud, but there was very much a top-down mandate for service teams to make the dashboard "look green" by:

1. Actually improving availability (this one is fair).

2. Using the "Green-I" icon rather than the blue, orange, or red icons whenever possible.

3. They built out the "Personal Health Dashboard" so they can post about many issues in there, without having to acknowledge it publicly.

res0nat0r · on Dec 11, 2021

Eh I mean at least when DeSantis was lower on the food chain then he is now, the normal directive was that ec2 status wasn't updated unless a certain X percent of hosts were affected. Which is reasonable because a single rack going down isn't relevant enough to constitute a massive problem with ec2 as a whole.

tw04 · on Dec 11, 2021

Multiple AWS employees have acknowledged it takes VP approval to change the status color of the dashboard. That is absurd and it tells you everything you need to know. The status page isn't about accurate information, it's about plausible deniability and keeping AWS out of the news cycle.

cookie_monsta · on Dec 11, 2021

I am so naive. I honestly thought those things were automated.

dylan604 · on Dec 11, 2021

>it's about plausible deniability and keeping AWS out of the news cycle.

How'd that work out for them?

https://duckduckgo.com/?q=AWS+outage+news+coverage&t=h_&ia=w...

tw04 · on Dec 11, 2021

When is the last time they had a single service outage in a single region? How about in a single AZ in a single region? Struggling to find a lot of headline stories? I'm willing to bet it's happened in the last 2 years and yet I don't see many news articles about it... so I'd say if the only thing that hits the front page is a complete region outage for 6+ hours, it's working out pretty well for them.

grumple · on Dec 11, 2021

Last year's Thanksgiving outage and this one are the two biggest. They've been pretty reliable. That's still 99.7% uptime.

s_dev · on Dec 10, 2021

>You can take it on bad faith

It's smart politics -- I don't blame them but I don't trust the dashboard either. There's established patterns now of the AWS dashboard being useless.

If I want to check if Amazon is down I'm checking Twitter and HN. Not bad faith -- no faith.

toss1 · on Dec 11, 2021

>>It's smart politics -- I don't blame them

Um, so you think straight-up lying is good politics?

Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.

That does not mean that we should think it is a good idea to tell lies when you break things.

It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?

Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?

We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.

efitz · on Dec 11, 2021

You don’t know what you’re talking about.

AWS spends a lot of time thinking about this problem in service to their customers.

How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.

Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.

What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.

As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.

Karunamon · on Dec 11, 2021

>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.

Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.

efitz · on Dec 13, 2021

Human-in-the-loop != lying.

Broken dashboard != lying.

The specific charge of “lying” is what I dispute.

toss1 · on Dec 11, 2021

>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.

>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...

Of course there are, so a single R/Y/G indicator is obviously a bad choice.

Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.

More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.

Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.

>>You don’t know what you’re talking about. Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.

sorry_outta_gas · on Dec 11, 2021

That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work

it's a bullshit board used fudge numbers when negoaiting SLAs

like I don't care that much, hell my company does the same thing; but let's not get defensive over it

dijit · on Dec 10, 2021

Once is a mistake.

Twice is a coincidence.

Three times is a pattern.

But this… This is every time.

doctor_eval · on Dec 11, 2021

Four times is a policy.

nwallin · on Dec 11, 2021

So -- ctrl-f "Dash" only produces four results and it's hidden away in the bottom of the page. It's false to claim that even 20% of the post mortem is addressing the failure of the dashboard.

The problem is that the dashboard requires VP approval to be updated. Which is broken. The dashboard should be automatic. The dashboard should update before even a single member of the AWS team knows there's something wrong.

hunter2_ · on Dec 11, 2021

Is it typical for orgs (the whole spectrum: IT departments everywhere, telecom, SaaS, maybe even status of non-technical services) to have automatic downtime messaging that doesn't need a human set of eyes to approve it first?

ProAm · on Dec 11, 2021

> You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.

Let's not act like this is the first time this has happened. It's bad faith that they do not change when their promise is they hire the best to handle infrastructure so you don't have to. It's clearly not the case. Between this and billing I we can easily lay blame and acknowledge lies.

systemvoltage · on Dec 10, 2021

[flagged]

jiggawatts · on Dec 11, 2021

AWS as a business has an enormous (multi-billion-dollar) moral hazard: they have a fantastically strong disincentive to update their status dashboard to accurately reflect the true nature of an ongoing outage. They use weasel words like "some customers may be seeing elevated errors", which we all know translates to "almost all customers are seeing 99.99% failure rates."

They have a strong incentive to lie, and they're doing it. This makes people dependent upon the truth for refunds understandably angry.

ProAm · on Dec 11, 2021

> Why are we doing this folks? What's making you so angry and contemptful?

Because Amazon kills industries. Takes job. They do this because they promise they hire the best people that can do this better than you and for cheaper. And it's rarely true. And then they lie about it when things hit the fan. If you're going to be the best you need to act like the best, and execute like the best. Not build a walled garden that people cant see into, and hard to leave.

edoceo · on Dec 11, 2021

All too often folk conflate frustration with anger or hate.

The comments are frustrated users.

Not hateful.

NicoJuicy · on Dec 10, 2021

I think the biggest issue is about the status dashboard that always stays green. I haven't seen much else, no?

It seems that degraded seems down in most cases. Since authorization of managers is required.