iwallace's comments

iwallace · on Dec 28, 2023

I like to compare this to Site C in British Columbia https://www.bchydro.com/energy-in-bc/projects/site_c.html#:~.... If you are lucky enough to be blessed with remote untapped rivers that can be dammed in somewhat unpopulated mountain valleys (that's another story again), it seems to be far cheaper and safer. I would imagine that Nuclear power's main issue is first and foremost cost. If you could pull off completely safe nuclear power, it just winds up costing too dang much.

iwallace · on Dec 10, 2021

My company uses AWS. We had significant degradation for many of their APIs for over six hours, having a substantive impact on our business. The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.

Of course problems and outages are going to happen, but saying they have five nines (99.999) uptime as measured by their "green board" is meaningless. During the event they were late and reluctant to report it and its significance. My point is that they are wrongly incentivized to keep the board green at all costs.

amalter · on Dec 10, 2021

I mean, not to defend them too strongly, but literally half of this post mortem is addressing the failure of the Service Dashboard. You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.

luhn · on Dec 10, 2021

Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.

They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.

46Bit · on Dec 11, 2021

Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.

ngc248 · on Dec 11, 2021

Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor

isbvhodnvemrwvn · on Dec 11, 2021

My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.

koheripbal · on Dec 11, 2021

This challenge is not specific to Amazon.

Being able to automatically detect system health is a non-trivial effort.

bob778 · on Dec 11, 2021

That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.

jjoonathan · on Dec 11, 2021

They had all day to do it manually.

moralestapia · on Dec 11, 2021

>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call

But,

>Be incapable of querying the status of such things

Yeah, I don't believe it.

saagarjha · on Dec 11, 2021

As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.

blackearl · on Dec 11, 2021

Why automatic? Surely someone could have the responsibility to do it manually.

geenew · on Dec 11, 2021

Or override the autogenerated values

thebean11 · on Dec 10, 2021

Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.

dijit · on Dec 11, 2021

The icons showed green.

discodave · on Dec 11, 2021

The AWS summary says: "As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue"

This seems like bad faith to me based on my experience when I worked for AWS. As they repeated many times at Re:Invent last week, they've been doing this for 15+ years. I distinctly remember seeing banners like "Don't update the dashboard without approval from <importantSVP>" on various service team runbooks. They tried not to say it out loud, but there was very much a top-down mandate for service teams to make the dashboard "look green" by:

1. Actually improving availability (this one is fair).

2. Using the "Green-I" icon rather than the blue, orange, or red icons whenever possible.

3. They built out the "Personal Health Dashboard" so they can post about many issues in there, without having to acknowledge it publicly.

res0nat0r · on Dec 11, 2021

Eh I mean at least when DeSantis was lower on the food chain then he is now, the normal directive was that ec2 status wasn't updated unless a certain X percent of hosts were affected. Which is reasonable because a single rack going down isn't relevant enough to constitute a massive problem with ec2 as a whole.

tw04 · on Dec 11, 2021

Multiple AWS employees have acknowledged it takes VP approval to change the status color of the dashboard. That is absurd and it tells you everything you need to know. The status page isn't about accurate information, it's about plausible deniability and keeping AWS out of the news cycle.

cookie_monsta · on Dec 11, 2021

I am so naive. I honestly thought those things were automated.

dylan604 · on Dec 11, 2021

>it's about plausible deniability and keeping AWS out of the news cycle.

How'd that work out for them?

https://duckduckgo.com/?q=AWS+outage+news+coverage&t=h_&ia=w...

tw04 · on Dec 11, 2021

When is the last time they had a single service outage in a single region? How about in a single AZ in a single region? Struggling to find a lot of headline stories? I'm willing to bet it's happened in the last 2 years and yet I don't see many news articles about it... so I'd say if the only thing that hits the front page is a complete region outage for 6+ hours, it's working out pretty well for them.

grumple · on Dec 11, 2021

Last year's Thanksgiving outage and this one are the two biggest. They've been pretty reliable. That's still 99.7% uptime.

s_dev · on Dec 10, 2021

>You can take it on bad faith

It's smart politics -- I don't blame them but I don't trust the dashboard either. There's established patterns now of the AWS dashboard being useless.

If I want to check if Amazon is down I'm checking Twitter and HN. Not bad faith -- no faith.

toss1 · on Dec 11, 2021

>>It's smart politics -- I don't blame them

Um, so you think straight-up lying is good politics?

Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.

That does not mean that we should think it is a good idea to tell lies when you break things.

It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?

Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?

We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.

efitz · on Dec 11, 2021

You don’t know what you’re talking about.

AWS spends a lot of time thinking about this problem in service to their customers.

How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.

Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.

What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.

As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.

Karunamon · on Dec 11, 2021

>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.

Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.

efitz · on Dec 13, 2021

Human-in-the-loop != lying.

Broken dashboard != lying.

The specific charge of “lying” is what I dispute.

toss1 · on Dec 11, 2021

>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.

>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...

Of course there are, so a single R/Y/G indicator is obviously a bad choice.

Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.

More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.

Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.

>>You don’t know what you’re talking about. Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.

sorry_outta_gas · on Dec 11, 2021

That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work

it's a bullshit board used fudge numbers when negoaiting SLAs

like I don't care that much, hell my company does the same thing; but let's not get defensive over it

dijit · on Dec 10, 2021

Once is a mistake.

Twice is a coincidence.

Three times is a pattern.

But this… This is every time.

doctor_eval · on Dec 11, 2021

Four times is a policy.

nwallin · on Dec 11, 2021

So -- ctrl-f "Dash" only produces four results and it's hidden away in the bottom of the page. It's false to claim that even 20% of the post mortem is addressing the failure of the dashboard.

The problem is that the dashboard requires VP approval to be updated. Which is broken. The dashboard should be automatic. The dashboard should update before even a single member of the AWS team knows there's something wrong.

hunter2_ · on Dec 11, 2021

Is it typical for orgs (the whole spectrum: IT departments everywhere, telecom, SaaS, maybe even status of non-technical services) to have automatic downtime messaging that doesn't need a human set of eyes to approve it first?

ProAm · on Dec 11, 2021

> You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.

Let's not act like this is the first time this has happened. It's bad faith that they do not change when their promise is they hire the best to handle infrastructure so you don't have to. It's clearly not the case. Between this and billing I we can easily lay blame and acknowledge lies.

systemvoltage · on Dec 10, 2021

[flagged]

jiggawatts · on Dec 11, 2021

AWS as a business has an enormous (multi-billion-dollar) moral hazard: they have a fantastically strong disincentive to update their status dashboard to accurately reflect the true nature of an ongoing outage. They use weasel words like "some customers may be seeing elevated errors", which we all know translates to "almost all customers are seeing 99.99% failure rates."

They have a strong incentive to lie, and they're doing it. This makes people dependent upon the truth for refunds understandably angry.

ProAm · on Dec 11, 2021

> Why are we doing this folks? What's making you so angry and contemptful?

Because Amazon kills industries. Takes job. They do this because they promise they hire the best people that can do this better than you and for cheaper. And it's rarely true. And then they lie about it when things hit the fan. If you're going to be the best you need to act like the best, and execute like the best. Not build a walled garden that people cant see into, and hard to leave.

edoceo · on Dec 11, 2021

All too often folk conflate frustration with anger or hate.

The comments are frustrated users.

Not hateful.

NicoJuicy · on Dec 10, 2021

I think the biggest issue is about the status dashboard that always stays green. I haven't seen much else, no?

It seems that degraded seems down in most cases. Since authorization of managers is required.

codegeek · on Dec 11, 2021

"Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. "

This to me is really bad. Even as a small company, we keep our support infrastructure separate. For a company of Amazon's size, this is a shitty excuse. If I cannot even reach you as a customer for almost 7 hours, that is just nuts. AWS must do better here.

Also, is it true that the outage/status pages are manually updated ? If yes, there is no excuse why it was green for that long. If you are manually updating it, please update asap.

walrus01 · on Dec 11, 2021

I know a few tiny ISPs that host their voip server and email server outside of their own ASN so that in the event of a catastrophic network event, communications with customers is still possible... Not saying amazon should do the same, but the general principle isn't rocket science.

there's such as thing as too much dogfooding.

acwan93 · on Dec 11, 2021

We moved our company's support call system to Microsoft Teams when lockdowns were happening, and even that was affected by the AWS outage (along with our SaaS product hosted on AWS).

It turned out our call center supplier had something running on AWS, and it took out our entire phone system. After this situation settles, I'm tempted to ask my supplier to see what they're doing to get around this in the future, but I doubt even they knew that AWS was used further downstream.

AWS operates a lot like Amazon.com, the marketplace now--you can try to escape it, but it's near impossible. If you want to ban usage of Amazon's services, you're going to find some service (AWS) or even a Shopify site (FBA warehouse) who uses it.

anonu · on Dec 11, 2021

Wasn't this the Bezos directive early on that created AWS? Anything that was created had to be a service with an API. Not allowed to recreate the wheel. So AWS depends on AWS.

jiggawatts · on Dec 11, 2021

Dependency loops are such fun!

My favourite is when some company migrates their physical servers to virtual machines, including the AD domain controllers. Then the next step is to use AD LDAP authentication for the VM management software.

When there's a temporary outage and the VMs don't start up as expected, the admins can't log on and troubleshoot the platform because the logon system was running on it... but isn't now.

The loop is closed.

You see this all the time, especially with system-management software. They become dependent on the systems they're managing, and vice-versa.

If you care about availability at all, make sure to have physical servers providing basic services like DNS, NTP, LDAP, RADIUS, etc...

nijave · on Dec 11, 2021

Or even just have some non-federated/"local" accounts stored in a vault somewhere you can use when the centralized auth isn't working

SkyPuncher · on Dec 10, 2021

My company isn't big enough for us to have any pull but this communication is _significantly_ downplaying the impact of this issue.

One of our auxiliary services that's basically a pass through to AWS was offline nearly the entire day. Yet, this communication doesn't even mention that fact. In fact, it almost tries to suggest the opposite.

Likewise, AWS is reporting S3 didn't have issues. Yet, for a period of time, S3 was erroring out frequently because it was responding so slowly.

jiggawatts · on Dec 11, 2021

SLAs with self-reported outage periods are worthless.

SLAs that refund only the cost of the individual service that was down is worthless.

SLAs that require separate proof and refund requests for each and every service that was affected are nearly worthless.

There needs to be an independent body set up by a large cloud customers to monitor availability and enforce refunds.

hvgk · on Dec 10, 2021

This. We're under NDA too on internal support. Our customers know we use AWS and they go and check the AWS status dashboards and tell us there's nothing wrong so the inevitable vitriol is always directed at us which we then have to defend.

macintux · on Dec 11, 2021

I guess you have to hope that every outage that impacts you is big enough to make the news.

electroly · on Dec 10, 2021

> The entire time their outage board was solid green

Unless you're talking about some board other than the Service Health Dashboard, this isn't true. They dropped EC2 down to degraded pretty early on. I bemusedly noted in our corporate Slack that every time I refreshed the SHD, another service was listed as degraded. Then they added the giant banner at the top. Their slight delay in updating the SHD at the beginning of the outage is mentioned in the article. It was absolutely not all green for the duration of the outage.

logical_proof · on Dec 10, 2021

That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.

electroly · on Dec 11, 2021

No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.

CoffeeOnWrite · on Dec 11, 2021

At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.

Aperocky · on Dec 11, 2021

> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.

CoffeeOnWrite · on Dec 11, 2021

They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.

acdha · on Dec 10, 2021

We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.

JPKab · on Dec 11, 2021

Multiple services I use were totally skunked, and none were ever anything but green.

Sagemaker, for example, was down all day. I was dead in the water on a modeling project that required GPUs. It relied on EC2, but nobody there even thought to update the status? WTF. This is clearly executives incentivized to let a bug persist. This is because the bug is actually a feature for misleading customers and maximizing profits.

steveBK123 · on Dec 11, 2021

Exactly, we had the same thing almost exactly a year ago - https://www.dailymail.co.uk/sciencetech/article-8994907/Wide...

They are barely doing better than 2 9s.

ALittleLight · on Dec 10, 2021

I worked at Amazon. While my boss was on vacation I took over for him in the "Launch readiness" meeting for our team's component of our project. Basically, you go to this meeting with the big decision makers and business people once a week and tell them what your status is on deliverables. You are supposed to sum up your status as "Green/Yellow/Red" and then write (or update last week's document) to explain your status.

My boss had not given me any special directions here so I assumed I was supposed to do this honestly. I set our status as "Red" and then listed out what were, I felt, quite compelling reasons to think we were Red. The gist of it was that our velocity was negative. More work items were getting created and assigned to us than we closed, and we still had high priority items open from previous dates. There was zero chance, in my estimation, that we would meet our deadlines, so I called us Red.

This did not go over well. Everyone at the Launch Readiness meeting got mad at me for declaring Red. Our VP scolded me in front of the entire meeting and lectured me about how I could not unilaterally declare our team red. Her logic was, if our team was Red, that meant the entire project was Red, and I was in no position to make that call. Other managers at the meeting got mad at me too because they felt my call made them look bad. For the rest of my manager's absence I had to first check in with a different manager and show him my Launch Readiness status and get him to approve my update before I was allowed to show it to the rest of the group.

For the rest of the time that I went to Launch Readiness I was forbidden from declaring Red regardless of what our metrics said. Our team was Yellow or Green, period.

Naturally, we wound up being over a year late on the deadlines, because, despite what they compelled us to say in those meetings, we weren't actually getting the needed work done. Constant "schedule slips" and adjustments. Endless wasted time in meetings trying to rework schedules that would instantly get blown up again. Hugely frustrating. Still slightly bitter about it.

Anyway, I guess all this is to say that it doesn't surprise me that Amazon is bad about declaring Red, Yellow, or Green in other places too. Probably there is a guy in charge of updating those dashboards who is forbidden from changing them unless they get approval from some high level person and that person will categorically refuse regardless of the evidence because they want the indicators to be Green.

version_five · on Dec 10, 2021

I had a good chuckle reading your comment. This is not unique to Amazon. Unfortunately, status indicators are super political almost everywhere, precisely because they are what is being monitored as a proxy for the actual progress. I think your comment should be mandatory reading for any leader who is holding the kinds of meetings you describe and thinks they are getting an accurate picture of things.

theduder99 · on Dec 11, 2021

no need for it to be mandatory, everyone is fully aware of the game and how to play it.

subsaharancoder · on Dec 11, 2021

I worked at AMZN and this perfectly captures my experience there with those weekly reviews. I once set a project I was managing as "Red" and had multiple SDMs excoriate me for apparently "throwing them under the bus" even though we had missed multiple timelines and were essentially not going to deliver anything of quality on time. I don't miss this aspect of AMZN!

redconfetti · on Dec 11, 2021

How dare you communicate a problem using the color system. It hurts feelings, and feelings are important here.

fma · on Dec 11, 2021

We have something similar at my big corp company. I think the issue is you went from Green to Red in a flip of a switch. A more normal project goes Green...raise a red flags...if red flags aren't resolved in the next week or two, go to yellow...In these meetings everyone collaborates ways to keep your green or get you back to green if you went yellow.

In essence - what you were saying is your boss lied the whole time, because how does one go from a presumed positive velocity to negative velocity in a week?

Additionally assuming you're a dev lead, it's a little surprising that this is your first meeting of this sorts. As dev lead, I didn't always attend them but my input is always sought on the status.

Sounds like you had a bad manager, and Amazon is filled with them.

human · on Dec 11, 2021

Exactly this. If you take your team from green to red without raising flags and asking for help, you will be frowned upon. It’s like pulling the fire alarm at the smell of burning toats. It will piss off people.

tempnow987 · on Dec 11, 2021

This is not unique. The reason is simple.

1) If you keep status green for 5 years, while not delivering anything, the reality is the folks at the very top (who can come and go) just look at these colors and don't really get into the project UNLESS you say you are red :)

2) Within 1-2 years there is always going to be some excuse for WHY you are late (people changes, scope tweaks, new things to worry about, covid etc)

3) Finally you are 3 years late, but you are launching. Well, the launch overshadows the lateness. Ie, you were green, then you launched, that's all the VP really sees sometime.

transcriptase · on Dec 10, 2021

This explicitly supports what most of us assume is going on. I wont be surprised if someone with a (un)vested interest will be along shortly to say that their experience is the opposite and that on their team, making people look bad by telling the truth is expected and praised.

tootie · on Dec 11, 2021

I once had the inverse happen. I showed up as an architect at a pretty huge e-commerce shop. They had a project that had just kicked off and onboarded me to help with planning. They had estimated two months by total finger in the air guessing. I ran them through a sizing and velocity estimation and the result came back as 10 months. I explained this to management and they said "ok". We delivered in about 10 months. It was actually pretty sad that they just didn't care. Especially since we quintupled the budget and no one was counting.

throwaway82931 · on Dec 10, 2021

A punitive culture of "accountability" naturally leads to finger pointing and evasion.

dawnbreez · on Dec 11, 2021

I worked at an Amazon air-shipping warehouse for a couple years, and hearing this confirms my suspicions about the management there. Lower management (supervisors, people actually in the building) were very aware of problems, but the people who ran the building lived out of state, so they only actually went to the building on very rare occasions.

Equipment was constantly breaking down, in ways that ranged from inconvenient to potentially dangerous. Seemingly basic design decisions, like the shape of chutes, were screwed up in mind-boggling ways (they put a right-angle corner partway down each chute, which caused packages to get stuck in the chutes constantly). We were short on equipment almost every day; things like poles to help us un-jam packages were in short supply, even though we could move hundreds of thousands of packages a day. On top of all this, the facility opened with half its sorting equipment, and despite promises that we'd be able to add the rest of the equipment in the summer, during Amazon's slow season...it took them two years to even get started.

And all the while, they demanded ever-increasing package quotas. At first, 120,000 packages/day was enough to raise eyebrows--we broke records on a daily basis in our first holiday rush--but then, they started wanting 200,000, then 400,000. Eventually it came out that the building wouldn't even be breaking even until it hit something like 500,000.

As we scaled up, things got even worse. None of the improvements that workers suggested to management were used, to my knowledge, even simple things like adding an indicator light to freight elevators.

Meanwhile, it eventually became clear that there wasn't enough space to store cargo containers in the building. 737s and the like store packages mostly in these giant curved cargo containers, and we needed them to be locked in place while working around/in them...except that, surprise, the people planning the building hadn't planned any holding areas for containers that weren't in use! We ended up sticking them in the middle of the work area.

Which pissed off the upper management when they visited. Their decision? Stop doing it. Are we getting more storage space for the cans? No. Are we getting more workers on the airplane ramp so we can put these cans outside faster? No. But we're not allowed to store those cans in the middle of the work area anymore, even if there aren't any open stations with working locks. Oh, by the way, the locking mechanisms that hold the cans in place started to break down, and to my knowledge they never actually fixed any of the locks. (A guy from their safety team claims they've fixed like 80 or 90 of the stations since the building opened, but none of the broken locks I've seen were fixed in the 2 years I worked there.)

blowski · on Dec 11, 2021

The problem here sounds like lack of clarity over the meaning of the colours.

In organisations with 100s of in-flight projects, it’s understandable that red is reserved for projects that are causing extremely serious issues right now. Otherwise, so many projects would be red, that you’d need a new colour.

ALittleLight · on Dec 11, 2021

I'd be willing to believe they had some elite high level reason to schedule things this way if I thought they were good at scheduling. In my ~10 years there I never saw a major project go even close to schedule.

I think it's more like the planning people get rewarded for creating plans that look good and it doesn't bother them if the plans are unrealistic. Then, levels of middle management don't want to make themselves look bad by saying they're behind. And, ultimately, everyone figures they can play a kind of schedule-chicken where everyone says they're green or yellow until the last possible second, hoping that another group will raise a flag first and give you all more time while you can pretend you didn't need it.

jaytaylor · on Dec 11, 2021

How about orange? Didn't know there was a color shortage these days.

tuananh · on Dec 11, 2021

but it's amz's color. it should carry a positive meaning #sacarsm

dylan604 · on Dec 11, 2021

> While my boss was on vacation I took over for him in the "Launch readiness" meeting....once a week

Jeez, how many meetings did you go to, and how long was this person's vacation? I'm jelly of being allowed to take that much time off continuously.

toomuchtodo · on Dec 11, 2021

You might be working at the wrong org? My colleagues routinely take weeks off at a time, sometimes more than a month to travel Europe, go scuba diving in French Polynesia, etc. Work to live, don’t live to work.

syngrog66 · on Dec 11, 2021

your story reminded me of the Challenger disaster and the "see no evil" bureaucratic shenanigans about the O-rings failing to seal in cold weather.

"How dare you threaten our launch readiness go/no-go?!"

dawnbreez · on Dec 11, 2021

Was Challenger the one where they buried the issue in a hundred-slide-long PowerPoint? Or was that the other shuttle?

belter · on Dec 10, 2021

Was this Amazon or AWS?

oars · on Dec 12, 2021

Based on the other comments in this comment thread I would say it's Amazon.

Clubber · on Dec 10, 2021

Yes, it's a conflict of interest. They have a guarantee on uptime and they decide what their actual uptime is. There's a lot of that now. Most insurances comes to mind.

tuananh · on Dec 11, 2021

even in the post mortem, they are reclutant to admit it

> While AWS customer workloads were not directly impacted from the internal networking issues described above, the networking issues caused impact to a number of AWS Services which in turn impacted customers using these service capabilities. Because the main AWS network was not affected, some customer applications which did not rely on these capabilities only experienced minimal impact from this event.

nijave · on Dec 11, 2021

>some customer applications which did not rely on these capabilities only experienced minimal impact from this event

Yeah so vanilla LB and EC2 with no autoscaling were fine. Anyone using "serverless" or managed services had a real bad day

jedberg · on Dec 11, 2021

Honestly, the should host that status page on CloudFlare or some completely separate infrastructure that they maintain in colo datacenters or something. The only time it really needs to be up is when their stuff isn't working.

tootie · on Dec 11, 2021

Second hand info but supposedly when an outage hits they go all hands on resolving it and no one who knows what's going on has time to update the status board which is why it's always behind.

voidfunc · on Dec 11, 2021

Not AWS, but Azure: highly doubt. At least at Azure the moment you declare an outage there is a incident manager to handle customer communication.

Bullshit someone at Amazon doesn’t have time to update the status.

notimetorelax · on Dec 10, 2021

If you were deployed in 2 regions would it alleviate the impact?

multipassnetwrk · on Dec 11, 2021

Depends. If your failover to another region required changing DNS and your DNS was using Route 53, you would have problems.

ransom1538 · on Dec 10, 2021

Yes. Exactly. Pay double. That is what all the blogs say. But no, when a region goes down everything is hosed. Give it a shot! Next time an entire region is down try out your apis or give AWS support a call.

hvgk · on Dec 10, 2021

No. We don't have an active deployment in that region at all. It killed our build pipeline as ECR was down globally so we had nowhere to push images. Also there was a massive risk as our target environments are EKS so any node failures or scaling events had nowhere to pull images from while ECR was down.

Edit: not to mention APIGW and Cloudwatch APIs were down too.

notyourday · on Dec 11, 2021

> The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.

  if ($pain > $gain) {
    move_your_shit_and_exit_aws();
  }

  sub move_your_shit_and_exit_aws
  {
    printf("Dude. We have too much pain. Start moving\n");
    printf("Yeah. That won't happen, so who cares\n");
    exit(1);
  }

secondcoming · on Dec 11, 2021

Moving your shit from AWS can be really expensive, depending on how much shit you have. If you're nice, GCP may subsidise - or even cover - the costs!

eranation · on Dec 11, 2021

Obligatory mention to https://stop.lying.cloud

codeduck · on Dec 11, 2021

carbon copy of our experience.

soheil · on Dec 11, 2021

This was addressed at least 3 times during this post. I'm not defending them but you're just gaslighting. If you have something to add about the points they raised regarding the status page please do so.

iwallace · on July 22, 2016

I can't find a reference anywhere for this to back it up, but I recall reading that Stephen Wolfram (https://en.wikipedia.org/wiki/Stephen_Wolfram) started Wolfram Research (Mathematica) with 100% remote employees.

iwallace · on July 8, 2012

Thank you for this post. I'm in nearly the same boat (41 years old, < 1M net worth, on my fifth startup with < 100k to show for those efforts) and watching seemingly complete idiots pass me by. I'm too old to be infuriated by the vagaries of chance, however, it does make me wonder if I've done something wrong. I suspect that the numbers are slightly skewed - that there are far fewer people having wild success than the media would have us believe.

chrisyeh · on July 9, 2012

The odds are always against entrepreneurs:

http://chrisyeh.blogspot.com/2010/07/entrepreneurship-is-abo...

"Let me reiterate--you have a less than 50/50 chance of founding a successful startup, even if you manage to raise VC every time (which is not a forgone conclusion) and even if you devote essentially your entire professional life to it."

On the other hand, if you enjoy entrepreneurship, don't let the lack of success get to you. After all, there's always the next company!

michaelochurch · on July 8, 2012

The conclusion that I've come to is that VC-istan isn't actually technology, at least not in this current social-media fueled bubble. It's old-fashioned social climbing and self-promotion with a bit of technology in the back end.

Do startups actually succeed based on technical merits, or on how well they market themselves? In this social media bubble, it's the latter. I'm not going to claim that technical skill doesn't matter. I just don't think it matters as much. You can back-fill the technical stuff by hiring the right people (contrary to our overblown claim that non-technical CEOs have no hope of finding technical talent because they can't individually judge it) but if you build great technology and can't sell it, you never get off the ground.

It's exceptionalism that leads people to think that the VC ecosystem is in some way (or should be) morally superior to Wall Street, Hollywood, the fashion industry, or Madison Avenue. Sure, what we do is cerebral, but so was advertising in the Mad Man era. VC-istan isn't worse than these other industries, but it's not better. When you have a "creative" industry, there are a lot of opportunities to do great work and profit by doing so. But there are also smiling-idiot narcissists who pile in and fuck everything up because they think they're "creative"... and of course, what gives them this opportunity is that there are other idiots in power who will put them ahead of the people of substance like us because they don't know any better.

It's the expectation of meritocracy that makes us unhappy, but human organizations and ecosystems and societies all turn to shit over time no matter what so this is an unreasonable expectation.

There is a place for people like us, the virtuous soldiers who get rich slowly, building our skillset until we're just really good at a few things... but there's also a place for smiling idiots. And smiling idiots are always going to be the "cool kids", and it's the cool kids (not people of substance) who get those stupid TechCrunch articles written about their 7-Couric products. It's the expectation of fairness in human structures, which is just unreasonable at scale, that creates the unhappiness.