Hacker Newsnew | past | comments | ask | show | jobs | submit | iwallace's commentslogin

I like to compare this to Site C in British Columbia https://www.bchydro.com/energy-in-bc/projects/site_c.html#:~.... If you are lucky enough to be blessed with remote untapped rivers that can be dammed in somewhat unpopulated mountain valleys (that's another story again), it seems to be far cheaper and safer. I would imagine that Nuclear power's main issue is first and foremost cost. If you could pull off completely safe nuclear power, it just winds up costing too dang much.


My company uses AWS. We had significant degradation for many of their APIs for over six hours, having a substantive impact on our business. The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.

Of course problems and outages are going to happen, but saying they have five nines (99.999) uptime as measured by their "green board" is meaningless. During the event they were late and reluctant to report it and its significance. My point is that they are wrongly incentivized to keep the board green at all costs.


I mean, not to defend them too strongly, but literally half of this post mortem is addressing the failure of the Service Dashboard. You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.


Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.

They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.


Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.


Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor


My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.


This challenge is not specific to Amazon.

Being able to automatically detect system health is a non-trivial effort.


That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.


They had all day to do it manually.


>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call

But,

>Be incapable of querying the status of such things

Yeah, I don't believe it.


As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.


Why automatic? Surely someone could have the responsibility to do it manually.


Or override the autogenerated values


Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.


The icons showed green.


The AWS summary says: "As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue"

This seems like bad faith to me based on my experience when I worked for AWS. As they repeated many times at Re:Invent last week, they've been doing this for 15+ years. I distinctly remember seeing banners like "Don't update the dashboard without approval from <importantSVP>" on various service team runbooks. They tried not to say it out loud, but there was very much a top-down mandate for service teams to make the dashboard "look green" by:

1. Actually improving availability (this one is fair).

2. Using the "Green-I" icon rather than the blue, orange, or red icons whenever possible.

3. They built out the "Personal Health Dashboard" so they can post about many issues in there, without having to acknowledge it publicly.


Eh I mean at least when DeSantis was lower on the food chain then he is now, the normal directive was that ec2 status wasn't updated unless a certain X percent of hosts were affected. Which is reasonable because a single rack going down isn't relevant enough to constitute a massive problem with ec2 as a whole.


Multiple AWS employees have acknowledged it takes VP approval to change the status color of the dashboard. That is absurd and it tells you everything you need to know. The status page isn't about accurate information, it's about plausible deniability and keeping AWS out of the news cycle.


I am so naive. I honestly thought those things were automated.


>it's about plausible deniability and keeping AWS out of the news cycle.

How'd that work out for them?

https://duckduckgo.com/?q=AWS+outage+news+coverage&t=h_&ia=w...


When is the last time they had a single service outage in a single region? How about in a single AZ in a single region? Struggling to find a lot of headline stories? I'm willing to bet it's happened in the last 2 years and yet I don't see many news articles about it... so I'd say if the only thing that hits the front page is a complete region outage for 6+ hours, it's working out pretty well for them.


Last year's Thanksgiving outage and this one are the two biggest. They've been pretty reliable. That's still 99.7% uptime.


>You can take it on bad faith

It's smart politics -- I don't blame them but I don't trust the dashboard either. There's established patterns now of the AWS dashboard being useless.

If I want to check if Amazon is down I'm checking Twitter and HN. Not bad faith -- no faith.


>>It's smart politics -- I don't blame them

Um, so you think straight-up lying is good politics?

Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.

That does not mean that we should think it is a good idea to tell lies when you break things.

It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?

Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?

We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.


You don’t know what you’re talking about.

AWS spends a lot of time thinking about this problem in service to their customers.

How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.

Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.

What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.

As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.


>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.

Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.


Human-in-the-loop != lying.

Broken dashboard != lying.

The specific charge of “lying” is what I dispute.


>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.

>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...

Of course there are, so a single R/Y/G indicator is obviously a bad choice.

Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.

More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.

Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.

>>You don’t know what you’re talking about. Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.


That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work

it's a bullshit board used fudge numbers when negoaiting SLAs

like I don't care that much, hell my company does the same thing; but let's not get defensive over it


Once is a mistake.

Twice is a coincidence.

Three times is a pattern.

But this… This is every time.


Four times is a policy.


So -- ctrl-f "Dash" only produces four results and it's hidden away in the bottom of the page. It's false to claim that even 20% of the post mortem is addressing the failure of the dashboard.

The problem is that the dashboard requires VP approval to be updated. Which is broken. The dashboard should be automatic. The dashboard should update before even a single member of the AWS team knows there's something wrong.


Is it typical for orgs (the whole spectrum: IT departments everywhere, telecom, SaaS, maybe even status of non-technical services) to have automatic downtime messaging that doesn't need a human set of eyes to approve it first?


> You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.

Let's not act like this is the first time this has happened. It's bad faith that they do not change when their promise is they hire the best to handle infrastructure so you don't have to. It's clearly not the case. Between this and billing I we can easily lay blame and acknowledge lies.


[flagged]


AWS as a business has an enormous (multi-billion-dollar) moral hazard: they have a fantastically strong disincentive to update their status dashboard to accurately reflect the true nature of an ongoing outage. They use weasel words like "some customers may be seeing elevated errors", which we all know translates to "almost all customers are seeing 99.99% failure rates."

They have a strong incentive to lie, and they're doing it. This makes people dependent upon the truth for refunds understandably angry.


> Why are we doing this folks? What's making you so angry and contemptful?

Because Amazon kills industries. Takes job. They do this because they promise they hire the best people that can do this better than you and for cheaper. And it's rarely true. And then they lie about it when things hit the fan. If you're going to be the best you need to act like the best, and execute like the best. Not build a walled garden that people cant see into, and hard to leave.


All too often folk conflate frustration with anger or hate.

The comments are frustrated users.

Not hateful.


I think the biggest issue is about the status dashboard that always stays green. I haven't seen much else, no?

It seems that degraded seems down in most cases. Since authorization of managers is required.


"Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. "

This to me is really bad. Even as a small company, we keep our support infrastructure separate. For a company of Amazon's size, this is a shitty excuse. If I cannot even reach you as a customer for almost 7 hours, that is just nuts. AWS must do better here.

Also, is it true that the outage/status pages are manually updated ? If yes, there is no excuse why it was green for that long. If you are manually updating it, please update asap.


I know a few tiny ISPs that host their voip server and email server outside of their own ASN so that in the event of a catastrophic network event, communications with customers is still possible... Not saying amazon should do the same, but the general principle isn't rocket science.

there's such as thing as too much dogfooding.


We moved our company's support call system to Microsoft Teams when lockdowns were happening, and even that was affected by the AWS outage (along with our SaaS product hosted on AWS).

It turned out our call center supplier had something running on AWS, and it took out our entire phone system. After this situation settles, I'm tempted to ask my supplier to see what they're doing to get around this in the future, but I doubt even they knew that AWS was used further downstream.

AWS operates a lot like Amazon.com, the marketplace now--you can try to escape it, but it's near impossible. If you want to ban usage of Amazon's services, you're going to find some service (AWS) or even a Shopify site (FBA warehouse) who uses it.


Wasn't this the Bezos directive early on that created AWS? Anything that was created had to be a service with an API. Not allowed to recreate the wheel. So AWS depends on AWS.


Dependency loops are such fun!

My favourite is when some company migrates their physical servers to virtual machines, including the AD domain controllers. Then the next step is to use AD LDAP authentication for the VM management software.

When there's a temporary outage and the VMs don't start up as expected, the admins can't log on and troubleshoot the platform because the logon system was running on it... but isn't now.

The loop is closed.

You see this all the time, especially with system-management software. They become dependent on the systems they're managing, and vice-versa.

If you care about availability at all, make sure to have physical servers providing basic services like DNS, NTP, LDAP, RADIUS, etc...


Or even just have some non-federated/"local" accounts stored in a vault somewhere you can use when the centralized auth isn't working


My company isn't big enough for us to have any pull but this communication is _significantly_ downplaying the impact of this issue.

One of our auxiliary services that's basically a pass through to AWS was offline nearly the entire day. Yet, this communication doesn't even mention that fact. In fact, it almost tries to suggest the opposite.

Likewise, AWS is reporting S3 didn't have issues. Yet, for a period of time, S3 was erroring out frequently because it was responding so slowly.


SLAs with self-reported outage periods are worthless.

SLAs that refund only the cost of the individual service that was down is worthless.

SLAs that require separate proof and refund requests for each and every service that was affected are nearly worthless.

There needs to be an independent body set up by a large cloud customers to monitor availability and enforce refunds.


This. We're under NDA too on internal support. Our customers know we use AWS and they go and check the AWS status dashboards and tell us there's nothing wrong so the inevitable vitriol is always directed at us which we then have to defend.


I guess you have to hope that every outage that impacts you is big enough to make the news.


> The entire time their outage board was solid green

Unless you're talking about some board other than the Service Health Dashboard, this isn't true. They dropped EC2 down to degraded pretty early on. I bemusedly noted in our corporate Slack that every time I refreshed the SHD, another service was listed as degraded. Then they added the giant banner at the top. Their slight delay in updating the SHD at the beginning of the outage is mentioned in the article. It was absolutely not all green for the duration of the outage.


That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.


No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.


At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.


> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.


They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.


We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.


Multiple services I use were totally skunked, and none were ever anything but green.

Sagemaker, for example, was down all day. I was dead in the water on a modeling project that required GPUs. It relied on EC2, but nobody there even thought to update the status? WTF. This is clearly executives incentivized to let a bug persist. This is because the bug is actually a feature for misleading customers and maximizing profits.


Exactly, we had the same thing almost exactly a year ago - https://www.dailymail.co.uk/sciencetech/article-8994907/Wide...

They are barely doing better than 2 9s.


I worked at Amazon. While my boss was on vacation I took over for him in the "Launch readiness" meeting for our team's component of our project. Basically, you go to this meeting with the big decision makers and business people once a week and tell them what your status is on deliverables. You are supposed to sum up your status as "Green/Yellow/Red" and then write (or update last week's document) to explain your status.

My boss had not given me any special directions here so I assumed I was supposed to do this honestly. I set our status as "Red" and then listed out what were, I felt, quite compelling reasons to think we were Red. The gist of it was that our velocity was negative. More work items were getting created and assigned to us than we closed, and we still had high priority items open from previous dates. There was zero chance, in my estimation, that we would meet our deadlines, so I called us Red.

This did not go over well. Everyone at the Launch Readiness meeting got mad at me for declaring Red. Our VP scolded me in front of the entire meeting and lectured me about how I could not unilaterally declare our team red. Her logic was, if our team was Red, that meant the entire project was Red, and I was in no position to make that call. Other managers at the meeting got mad at me too because they felt my call made them look bad. For the rest of my manager's absence I had to first check in with a different manager and show him my Launch Readiness status and get him to approve my update before I was allowed to show it to the rest of the group.

For the rest of the time that I went to Launch Readiness I was forbidden from declaring Red regardless of what our metrics said. Our team was Yellow or Green, period.

Naturally, we wound up being over a year late on the deadlines, because, despite what they compelled us to say in those meetings, we weren't actually getting the needed work done. Constant "schedule slips" and adjustments. Endless wasted time in meetings trying to rework schedules that would instantly get blown up again. Hugely frustrating. Still slightly bitter about it.

Anyway, I guess all this is to say that it doesn't surprise me that Amazon is bad about declaring Red, Yellow, or Green in other places too. Probably there is a guy in charge of updating those dashboards who is forbidden from changing them unless they get approval from some high level person and that person will categorically refuse regardless of the evidence because they want the indicators to be Green.


I had a good chuckle reading your comment. This is not unique to Amazon. Unfortunately, status indicators are super political almost everywhere, precisely because they are what is being monitored as a proxy for the actual progress. I think your comment should be mandatory reading for any leader who is holding the kinds of meetings you describe and thinks they are getting an accurate picture of things.


no need for it to be mandatory, everyone is fully aware of the game and how to play it.


I worked at AMZN and this perfectly captures my experience there with those weekly reviews. I once set a project I was managing as "Red" and had multiple SDMs excoriate me for apparently "throwing them under the bus" even though we had missed multiple timelines and were essentially not going to deliver anything of quality on time. I don't miss this aspect of AMZN!


How dare you communicate a problem using the color system. It hurts feelings, and feelings are important here.


We have something similar at my big corp company. I think the issue is you went from Green to Red in a flip of a switch. A more normal project goes Green...raise a red flags...if red flags aren't resolved in the next week or two, go to yellow...In these meetings everyone collaborates ways to keep your green or get you back to green if you went yellow.

In essence - what you were saying is your boss lied the whole time, because how does one go from a presumed positive velocity to negative velocity in a week?

Additionally assuming you're a dev lead, it's a little surprising that this is your first meeting of this sorts. As dev lead, I didn't always attend them but my input is always sought on the status.

Sounds like you had a bad manager, and Amazon is filled with them.


Exactly this. If you take your team from green to red without raising flags and asking for help, you will be frowned upon. It’s like pulling the fire alarm at the smell of burning toats. It will piss off people.


This is not unique. The reason is simple.

1) If you keep status green for 5 years, while not delivering anything, the reality is the folks at the very top (who can come and go) just look at these colors and don't really get into the project UNLESS you say you are red :)

2) Within 1-2 years there is always going to be some excuse for WHY you are late (people changes, scope tweaks, new things to worry about, covid etc)

3) Finally you are 3 years late, but you are launching. Well, the launch overshadows the lateness. Ie, you were green, then you launched, that's all the VP really sees sometime.


This explicitly supports what most of us assume is going on. I wont be surprised if someone with a (un)vested interest will be along shortly to say that their experience is the opposite and that on their team, making people look bad by telling the truth is expected and praised.


I once had the inverse happen. I showed up as an architect at a pretty huge e-commerce shop. They had a project that had just kicked off and onboarded me to help with planning. They had estimated two months by total finger in the air guessing. I ran them through a sizing and velocity estimation and the result came back as 10 months. I explained this to management and they said "ok". We delivered in about 10 months. It was actually pretty sad that they just didn't care. Especially since we quintupled the budget and no one was counting.


A punitive culture of "accountability" naturally leads to finger pointing and evasion.


I worked at an Amazon air-shipping warehouse for a couple years, and hearing this confirms my suspicions about the management there. Lower management (supervisors, people actually in the building) were very aware of problems, but the people who ran the building lived out of state, so they only actually went to the building on very rare occasions.

Equipment was constantly breaking down, in ways that ranged from inconvenient to potentially dangerous. Seemingly basic design decisions, like the shape of chutes, were screwed up in mind-boggling ways (they put a right-angle corner partway down each chute, which caused packages to get stuck in the chutes constantly). We were short on equipment almost every day; things like poles to help us un-jam packages were in short supply, even though we could move hundreds of thousands of packages a day. On top of all this, the facility opened with half its sorting equipment, and despite promises that we'd be able to add the rest of the equipment in the summer, during Amazon's slow season...it took them two years to even get started.

And all the while, they demanded ever-increasing package quotas. At first, 120,000 packages/day was enough to raise eyebrows--we broke records on a daily basis in our first holiday rush--but then, they started wanting 200,000, then 400,000. Eventually it came out that the building wouldn't even be breaking even until it hit something like 500,000.

As we scaled up, things got even worse. None of the improvements that workers suggested to management were used, to my knowledge, even simple things like adding an indicator light to freight elevators.

Meanwhile, it eventually became clear that there wasn't enough space to store cargo containers in the building. 737s and the like store packages mostly in these giant curved cargo containers, and we needed them to be locked in place while working around/in them...except that, surprise, the people planning the building hadn't planned any holding areas for containers that weren't in use! We ended up sticking them in the middle of the work area.

Which pissed off the upper management when they visited. Their decision? Stop doing it. Are we getting more storage space for the cans? No. Are we getting more workers on the airplane ramp so we can put these cans outside faster? No. But we're not allowed to store those cans in the middle of the work area anymore, even if there aren't any open stations with working locks. Oh, by the way, the locking mechanisms that hold the cans in place started to break down, and to my knowledge they never actually fixed any of the locks. (A guy from their safety team claims they've fixed like 80 or 90 of the stations since the building opened, but none of the broken locks I've seen were fixed in the 2 years I worked there.)


The problem here sounds like lack of clarity over the meaning of the colours.

In organisations with 100s of in-flight projects, it’s understandable that red is reserved for projects that are causing extremely serious issues right now. Otherwise, so many projects would be red, that you’d need a new colour.


I'd be willing to believe they had some elite high level reason to schedule things this way if I thought they were good at scheduling. In my ~10 years there I never saw a major project go even close to schedule.

I think it's more like the planning people get rewarded for creating plans that look good and it doesn't bother them if the plans are unrealistic. Then, levels of middle management don't want to make themselves look bad by saying they're behind. And, ultimately, everyone figures they can play a kind of schedule-chicken where everyone says they're green or yellow until the last possible second, hoping that another group will raise a flag first and give you all more time while you can pretend you didn't need it.


How about orange? Didn't know there was a color shortage these days.


but it's amz's color. it should carry a positive meaning #sacarsm


> While my boss was on vacation I took over for him in the "Launch readiness" meeting....once a week

Jeez, how many meetings did you go to, and how long was this person's vacation? I'm jelly of being allowed to take that much time off continuously.


You might be working at the wrong org? My colleagues routinely take weeks off at a time, sometimes more than a month to travel Europe, go scuba diving in French Polynesia, etc. Work to live, don’t live to work.


your story reminded me of the Challenger disaster and the "see no evil" bureaucratic shenanigans about the O-rings failing to seal in cold weather.

"How dare you threaten our launch readiness go/no-go?!"


Was Challenger the one where they buried the issue in a hundred-slide-long PowerPoint? Or was that the other shuttle?


Was this Amazon or AWS?


Based on the other comments in this comment thread I would say it's Amazon.


Yes, it's a conflict of interest. They have a guarantee on uptime and they decide what their actual uptime is. There's a lot of that now. Most insurances comes to mind.


even in the post mortem, they are reclutant to admit it

> While AWS customer workloads were not directly impacted from the internal networking issues described above, the networking issues caused impact to a number of AWS Services which in turn impacted customers using these service capabilities. Because the main AWS network was not affected, some customer applications which did not rely on these capabilities only experienced minimal impact from this event.


>some customer applications which did not rely on these capabilities only experienced minimal impact from this event

Yeah so vanilla LB and EC2 with no autoscaling were fine. Anyone using "serverless" or managed services had a real bad day


Honestly, the should host that status page on CloudFlare or some completely separate infrastructure that they maintain in colo datacenters or something. The only time it really needs to be up is when their stuff isn't working.


Second hand info but supposedly when an outage hits they go all hands on resolving it and no one who knows what's going on has time to update the status board which is why it's always behind.


Not AWS, but Azure: highly doubt. At least at Azure the moment you declare an outage there is a incident manager to handle customer communication.

Bullshit someone at Amazon doesn’t have time to update the status.


If you were deployed in 2 regions would it alleviate the impact?


Depends. If your failover to another region required changing DNS and your DNS was using Route 53, you would have problems.


Yes. Exactly. Pay double. That is what all the blogs say. But no, when a region goes down everything is hosed. Give it a shot! Next time an entire region is down try out your apis or give AWS support a call.


No. We don't have an active deployment in that region at all. It killed our build pipeline as ECR was down globally so we had nowhere to push images. Also there was a massive risk as our target environments are EKS so any node failures or scaling events had nowhere to pull images from while ECR was down.

Edit: not to mention APIGW and Cloudwatch APIs were down too.


> The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.

  if ($pain > $gain) {
    move_your_shit_and_exit_aws();
  }

  sub move_your_shit_and_exit_aws
  {
    printf("Dude. We have too much pain. Start moving\n");
    printf("Yeah. That won't happen, so who cares\n");
    exit(1);
  }


Moving your shit from AWS can be really expensive, depending on how much shit you have. If you're nice, GCP may subsidise - or even cover - the costs!


Obligatory mention to https://stop.lying.cloud


carbon copy of our experience.


This was addressed at least 3 times during this post. I'm not defending them but you're just gaslighting. If you have something to add about the points they raised regarding the status page please do so.


I can't find a reference anywhere for this to back it up, but I recall reading that Stephen Wolfram (https://en.wikipedia.org/wiki/Stephen_Wolfram) started Wolfram Research (Mathematica) with 100% remote employees.


Thank you for this post. I'm in nearly the same boat (41 years old, < 1M net worth, on my fifth startup with < 100k to show for those efforts) and watching seemingly complete idiots pass me by. I'm too old to be infuriated by the vagaries of chance, however, it does make me wonder if I've done something wrong. I suspect that the numbers are slightly skewed - that there are far fewer people having wild success than the media would have us believe.


The odds are always against entrepreneurs:

http://chrisyeh.blogspot.com/2010/07/entrepreneurship-is-abo...

"Let me reiterate--you have a less than 50/50 chance of founding a successful startup, even if you manage to raise VC every time (which is not a forgone conclusion) and even if you devote essentially your entire professional life to it."

On the other hand, if you enjoy entrepreneurship, don't let the lack of success get to you. After all, there's always the next company!


The conclusion that I've come to is that VC-istan isn't actually technology, at least not in this current social-media fueled bubble. It's old-fashioned social climbing and self-promotion with a bit of technology in the back end.

Do startups actually succeed based on technical merits, or on how well they market themselves? In this social media bubble, it's the latter. I'm not going to claim that technical skill doesn't matter. I just don't think it matters as much. You can back-fill the technical stuff by hiring the right people (contrary to our overblown claim that non-technical CEOs have no hope of finding technical talent because they can't individually judge it) but if you build great technology and can't sell it, you never get off the ground.

It's exceptionalism that leads people to think that the VC ecosystem is in some way (or should be) morally superior to Wall Street, Hollywood, the fashion industry, or Madison Avenue. Sure, what we do is cerebral, but so was advertising in the Mad Man era. VC-istan isn't worse than these other industries, but it's not better. When you have a "creative" industry, there are a lot of opportunities to do great work and profit by doing so. But there are also smiling-idiot narcissists who pile in and fuck everything up because they think they're "creative"... and of course, what gives them this opportunity is that there are other idiots in power who will put them ahead of the people of substance like us because they don't know any better.

It's the expectation of meritocracy that makes us unhappy, but human organizations and ecosystems and societies all turn to shit over time no matter what so this is an unreasonable expectation.

There is a place for people like us, the virtuous soldiers who get rich slowly, building our skillset until we're just really good at a few things... but there's also a place for smiling idiots. And smiling idiots are always going to be the "cool kids", and it's the cool kids (not people of substance) who get those stupid TechCrunch articles written about their 7-Couric products. It's the expectation of fairness in human structures, which is just unreasonable at scale, that creates the unhappiness.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: