Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Defcon: Preventing overload with graceful feature degradation (2023) (micahlerner.com)
237 points by mlerner on Feb 29, 2024 | hide | past | favorite | 95 comments


One of the most satisfying feature degradation steps I did with FastComments was making it so that if the DB went offline completely, the app would still function:

1. It auto restarts all workers in the cluster in "maintenance mode".

2. A "maintenance mode" message shows on the homepage.

3. The top 100 pages by comment volume will still render their comment threads, as a job on each edge node recalculates and stores this on disk periodically.

4. Logging in is disabled.

5. All db calls to the driver are stubbed out with mocks to prevent crashes.

6. Comments can still be posted and are added into an on-disk queue on each edge node.

7. When the system is back online the queue is processed (and stuff checked for spam etc like normal).

It's not perfect but it means in a lot of cases I can completely turn off the DB for a few minutes without panic. I haven't had to use it in over a year, though, and the DB doesn't really go down. But useful for upgrades.

built it on my couch during a Jurassic park marathon :P


Sounds like you're just failing over to a custom database.


The abstraction isn't really defined or isolated that way, but kinda.


Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.

Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.


it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance


Can you be specific about the cost of building these?

I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.


When I was in Search 15 or so years ago, there was actually a very direct cost: revenue.

The AdMixer was an "optional" response for the search page. If the ads didn't return before the search results did, the search would just not show ads, and Google wouldn't get any revenue for it. Showed the premium that Google of the day put on latency and user experience. I think we lost a few million per year to timeouts, but it was worth it for generating user loyalty, and it put a very big incentive on the ads team to keep the serving stack fast.

No idea if it's still architected like that, I kinda doubt it given recent search experiences, but I thought it was brilliant just for the sake of aligning incentives between different parts of the organization.


The developer, tester and devops time required to properly implement graceful degradation could easily accumulate to hundreds of hours.

Those hours are directly expensive when your developers cost hundreds of dollars a day; and have a material opportunity cost in that their commitment to one particular project delays the delivery of other features.

Moreover, any new features would have to be made compatible with the graceful degradation pattern, creating an ongoing cost.


When you hire an engineer to build a dam, you expect them to consider piping and subsurface flows such that the foundation isn't swept out in a decade. No matter of the engineer was already paid, retired, etc.

My point isn't that we all need to make dams that can hold up for a century. The point is that you hire an engineer because you want someone with the judgement and expertise to apply the correct amount of engineering to any given solution. Over-engineering is on the pathway to correct-sized engineering. It's the experience, discovery, and exploration required to arrive at choosing what things actually do not need to be done.

When your manager asks you, "do we really need to do that?" It's the expert that can explain why it really is necessary, and the professional who accepts "we're not going to do that" as an answer. And if they still feel it would be harmful not to do it, then that's where professional duty kicks in.


There's a lot of levels to the approach.

Just spending a few moments to consider whether queues should grow, block, or spill when adding them makes a big difference, along with choices in error handling. You can get a lot of things to gracefully degrade for free if that's a part of your decision-making process.


Could be as simple as just some feature flags with environment variables


I also found that when building a feature iteratively, with feature flags for rollout, a simple feature degradation path often appears natively.


For one, it potentially multiplies the testing and regression testing requirements to hit all those additional configurations.


Effectively every piece of software written for at most a few thousand people to use concurrently (i.e. 99.99% of software).

Consumer apps that scale to hundreds of thousands of users with five 9s+ uptime requirements are very rare.


So at my previous place we had a monolith with roughly 700 different URL handlers. Most of the problem with things like this was understanding what they all did.

Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.

The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead

Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.


Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!


Facebook makes over 300 requests for me just loading the main logged in page while showing me exactly 1 timeline item. Hovering my mouse over that item makes another 100 requests or so. Scrolling down loads another item at the cost of over 100 requests again. It's impressive in a perverse way just how inefficient they can be while managing to make it still work, and somewhat disturbing that their ads bring in enough money to make them extremely profitable despite it.


This is the company that instead of ditching PHP created a full on PHP to C++ transpiler and then deployed their while site on that for a few years.


FB still runs on Hack.


> deployed their while site

??


Obviously "whole".


That's why it's so inefficient!


Wasn't the whole point of GraphQL in mitigating this?


Yeah, that's why you have only 100 requests* when you hover over an item instead of 800.

(* allegedly, didn't verify it myself)


No.

Here is the thing, hypermedia is cacheable. React/Graphql not so much.

Facebook is now just an application that runs in the browser.

As a poor, small developer who doesn't want to hemorrhage money, I tend to want things to be more hypermedia and less app. It saves on complexity and bandwidth and costs.


Do you have an ad blocker stopping the requests and causing retries?


I do have ublock origin on everything of mine, so conceivably it's reacting to that somehow. I'm no longer at my computer to be able to look more closely at what it's doing.


Whenever you're running this kind of eyeball test with DevTools, make sure the "Disable Cache" option is _not_ checked. And even then, make sure the requests you're counting were not served from your browser cache. It's possible that only the first page load sends hundreds of requests, and subsequent loads may still initiate those requests, but have their responses served from your browser cache (or your cache is disabled).

I would check myself, but I haven't logged into Facebook in many years. :)


Could someone tell me what these hundreds of requests could do?


A lot of them appear to be that they've split their javascript into a gazillion files for whatever reason (I suppose because they have several MB of it). But someone or lots of people there did seem to get addicted to dynamic loading. Like I've got 100 or so friends, but my friends page loads them 8-16 at a time as I scroll. Just send all 100 and set the profile pictures to deferred fetch. It'd probably be smaller than the js they have to make it do "infinite" scroll.

Similarly, after getting to the bottom of their "infinite" scroll, my friend feed (which is annoyingly hidden away) gives me... 15 items. Just send me all 15. It's like 1-2 kB worth of data. If you're going to end the scroll after a dozen items, why is it using infinite scroll?


Could be so they can track what you're looking at on the back end


Track you, probably with a thousand layers of redundancy, tech bloat, and decades of mold.


I can't comment on the numbers, but think of how many engineers work there and how many users Facebook, Whatsapp, Instagram have. Each engineer is adding new features and queries every day. You're going to get a lot of queries.


We’ve really wasted an incredible amount of talent-hours over last couple decades. Imagine if we’d worked on, like, climate change or something instead of ad platforms.


"The best minds of my generation are thinking about how to make people click ads. That sucks." - Jeff Hammerbacher (2011); early Facebook employee, and Cloudera cofounder.

https://www.theatlantic.com/technology/archive/2011/04/quote...


The waste of talent hours is directly connected to climate change. The waste of network bandwidth is as well as the waste of compute cycles to run these "social" platforms.

That all being said, as humans have free will, imagining what we "could have done" if we just _forced_ everyone to do something different is flirting with fascism.


Sure, I wouldn’t suggest forcing everyone to work on something else. We all could have been better, and the government could have tried to incentivize more productive work (they already provide incentives on way or another after all).


I think most devs also would rather work on something good too, but those jobs are rare and pay worse. That's the part that needs fixing


Exactly. I think it’s misguided to blame Meta or Google here. People respond to incentives, and the masses want to buy stuff using e-commerce, hence it’s profitable.

I’m oversimplifying, but that’s the root of it. If people instead of buying things from ads were looking for the best ways to purchase offsets for CO2 emissions, then the best minds would be working on that problem.


Right, that's not going to happen though for pretty obvious reasons.

Governmental incentives and carbon taxes can make more jobs in CO2, but expecting individual citizens to solve a collective action problem is doomed.

At the same time, I think it is fair to ask people to look at what they're doing for/to the world and decide if that's something that fits their values. 300k salary is nice, but almost certainly not something you need


By the same token, you are asking people to introspect and unilaterally sacrifice in a way that, for the same reason, is not going to happen for pretty obvious reasons.

I’m on board with solving this at the governmental level FWIW. I support stronger anti-trust enforcement for example (and not just targeting tech).


Oh yeah totally. I think its easier to do so for one single rarely made decision than a bunch of small ones every day. From like, a practical perspective if you wanted to make one sacrifice that's the one I'd reccomend.

But yea govt level would solve it much better


You make it sound like everybody at Meta works in the ads department.


It is an ad company, everyone there works on ads or indirectly works on making a platform for ads.

The only exception is people who’ve managed to sneak their way into positions where they don’t contribute anything to the company. Those people are doing society a favor by wasting Facebook’s money.


I don’t think that’s true. Google took their ad money and placed a famous number of bets on moonshots, more than any company in a long time (back as far as Bell Labs? Did even they fund their research to the extent Google has?)

The whole Metaverse bet at Meta is also plausibly a 10s-of-$b bet that isn’t going to drive the ad flywheel, though I’m sure they hope it is. My impression is Zuck would be ok with just having a VR platform even if it’s generating revenue from not-ads.


Well, a lot of them do work directly on ads. Obviously not everybody, but a significant fraction. And most of the rest are just building things (products, features) to be able to show more ads.


98% of Meta's revenue is from ads. Meta is an ads department.


They all do.


found the meta employee


I think about 10 years ago when I was working there I checked the trace to load my own homepage. Just one page, just for myself, and there were 100,000 data fetches.


By "homepage" you mean your Facebook profile?


> Am I reading the second figure right? Facebook can do 130*10^6 queries/second == ‭130,000,000‬ queries/second?!

That sounds totally plausible to me.

Also keep in mind they didn't say what system this is. It's often true that 1 request to a frontend system becomes 1 each to 10 different backend services owned by different teams and then 20+ total to some database/storage layer many of them depend on. The qps at the bottom of the stack is in general a lot higher than the qps at the top, though with caching and static file requests and such this isn't a universal truth.


Those queries are probably mostly memcache hits, though of course with distributed cache invalidation and consistency fun


If it doesn't hit the database is it really a query?


Why wouldn't it be?


Sounds plausible. There are probably many queries required to display a page and Facebook has 2 billion daily active users.


This is how information slowly changes. The original numbers from facebook needed to be taken with a grain of salt. 2 billion a day raises it more.

Facebook claims to have 2 billion accounts but no where near 2 billion unique accounts. I don't know what facebook calls an active user but it use to mean logged in once in the past 30 days.


No, the person you're responding to was correct. Facebook has over 2 billion daily active users [1], and DAU refers to unique users who used the product in a day [2].

1: https://www.statista.com/statistics/346167/facebook-global-d...

2: https://www.innertrends.com/blog/active-users-measuring-busi...


Heh. Gave me a chuckle, because DAU also means "Dümmster Anzunehmender User" in German (dumbest assumed user, in context of creating idiot-proof software, and a wordplay on GAU, which means grösster anzunehmender Unfall, biggest assumed accident, which comes from fission power plants). And that kinda fits for the kind of people that perceivedly are left on the likes of Facebook and X.


It's directly in the link you provided.

"For example, an active user can be measured as a user that has logged back into her account to interact with the product in the last 30 days."

Even the marketing material is designed to confuse.


That's a monthly active user. A daily active user would be someone who logged into an account in the last day. Generally monthly active user count will be higher than daily active users, but for something like Facebook the difference is about 50% (which is what the second article linked is explaining, if you read more than just cherry-picking a line that matches your preconceptions)

And yes, that's a claim that if each user is a separate person, >20% of the world's population interacts with Facebook at least minimally each day. You can add your own interpretation about how many of the accounts are bots or otherwise duplicates, but it's a staggering amount either way.


It should be their account not her account (or his). Who writes this garbage.


Alternating or stochastically varying pronouns in your examples used to be a common way to make an effort at inclusive writing, usually preferred aesthetically to constructs like `his/her'. (The style before that was basically to use masculine pronouns for hypothetical people in every single case and deny that there was anything to question about that.) I think I agree that the modern semi-standard of using `they' for examples where gender is irrelevant or unknown is strictly better, but it's hard for me to summon a lot of contempt for someone who goes with a different/older habit.


Different metrics Daily vs Monthly Active User. MAU vs DAU


Whats with people lately writing 10^6 instead of 1 million. Its not that big that we need exponents to get involved.


- The comment is referring to a graph that used 10^6 on the vertical axis, which is a very common way to format graphs with large numbers (not just "lately"). It's also the default for a lot of plotting libraries.

- 10^n is more compact than million/billion/etc, more consistent, easier to translate, and doesn't suffer from regional differences (e.g. "one billion" is a different number in Britain than in the US).

I'm not saying it's clearly better than "million" in this specific case, but it's definitely not clearly worse.


Erosion of education makes basic scientific knowledge very trendy


Yes. And that was 4 years ago. Must add that figure does NOT include static asset serving path.


I forgot how to count that low.


A custom JIT + language + web framework + DB + queues + orchestrator + hardware built to your precise specifications + DCs all over the world go a long way ;)


We're close to 1 million servers, not 12 racks in a DC.


iirc Facebook has 3 billion users, so that sounds plausible.


Yeah, they allocated ALL of the ram to their DB servers. lol


Discussed (a tiny bit) at the time:

Defcon: Preventing Overload with Graceful Feature Degradation - https://news.ycombinator.com/item?id=36923049 - July 2023 (1 comment)


Off-topic but: I love the font on the website. At first I thought it was the classic Computer Modern font (used in LateX). But nope. Upon inspection of the stylesheet, it's https://edwardtufte.github.io/et-book/ which was a font designed by Dmitry Krasny, Bonnie Scranton, and Edward Tufte. The font was originally designed for his book Beautiful Evidence. But people showed interest in font, see the bulletin board on ET's website: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=... Initially he was reluctant to go the trouble of releasing it digitally. But eventually he did make it available on GitHub.


Anyone interested in load shedding and graceful degradation with request prioritization should check out the Aperture OSS project.

https://github.com/fluxninja/aperture


I'm surprised they don't have automated degradation (or at least the article implies that it must be operator initiated).

We built a similar tool at Netflix but the degradations could be both manual and automatic.


There's definitely automated degradation at smaller scale ("if $random_feature's backend times out, don't show it", etc.).

The manual part of Defcon is more "holy crap, we lost a datacenter and the whole site is melting, turn stuff off to bring the load down ASAP"


Isn't this referred to as Load Shedding in some circles? If its not, can someone explain how its different?


They're the same thing or close to it. "Load shedding" might be a bit more general. A couple possible nuances:

* Perhaps "graceful feature degradation" as a choice of words is a way of noting there's immediate user impact (but less than ungracefully running out of capacity). "Load shedding" could also mean something less impactful, for example some cron job that updates some internal dashboard skipping a run.

* "feature degradation" might focus on how this works at the granularity of features, where load shedding might mean something like dropping request hedges / retries, or individual servers saying they're overloaded and the request should go elsewhere.


This is the other side of the load shedding coin.

The situation is that A depends on B, but B is overloaded; if we allow B to do load shedding, we must also write A to gracefully degrade when B is not available.


Seems like whenever I log into FB lately it's pretty much always in a state of “graceful feature degradation”.

For example, as soon as I log in I see a bell icon in the upper right with a bright red circle containing an exact positive integer number of notifications. It practically screams “click here, you have urgent business”.

I can then leave the web page sitting there for any number of minutes, and no matter how long I wait, if I click on that notification icon it will take a good 20 seconds to load the list of new notifications. (This is on gigabit fiber in a major metro area, so not a plumbing issue.)


This is one of the great challenges of system engineering. Any slack you build into the system has a tendency to get used over time, but that means that if you don't exert some human discipline to have monitoring on your slack and treat it as at least a medium priority that your slack is being used up that your system will rapidly evolve (or devolve, if you prefer) into one that has single points of failure after all.

To give a super simple example, suppose you have a database that can transparently fail over to a backup, but it's so "transparent" that nobody even gets notified. Suppose the team even tests it and it proves to work well. The team will then believe that they are very well protected and tell all their customers and management all about how bulletproof their setup is, but if they don't notice that the primary database corrupted and permanently went down in month six because their systems just handle it so well, they'll actually be operating on a single database after all and just be one hiccup from failure.

One of the jobs of an ethical engineer is to make sure management doesn't just say "it's OK, the site is working, forget about it and work on something else" without some appropriate amount of pushback, which you can ground on the fact that sure, they're saying to ignore it now, but when the second DB goes down and the site goes down they sure won't be defending you with "oh, but I told the engineering team to ignore the alerts and keep delivering features so it's really my fault and not theirs the site went down".

At Facebook's scale, something will always be in a state of degradation. It's just a fact of life.


Have you tried navigating the website using a web proxy (Charles, Burp Suite, or similar tool) to intercept the HTTP request(s) in order to replay them yourself multiple times to see if the latency is consistent? It’d be interesting to discover that the delay is fabricated using the front-end code or if the back-end server is really the problem. I don’t use Facebook but I asked a friend just now and the response time for the notifications panel to appear is between 500ms-2000ms, which is relatively fast for web interactions.


Without being able to verify, I would assume it’s designed to behave in this way. The longer you wait the more anticipation builds up, the more gratifying it becomes.


I think there's no chance they intentionally want users to wait 20 seconds to see their latest notifications.


> it will take a good 20 seconds to load the list of new notifications.

Same thing here. Thought it was my (relatively much) slower Internet connection, or maybe that I had something "wrong" (what exactly that might have been, I don't know).


The initial render of Facebook's UI slows dramatically (I suspect but cannot prove intentionally) if you have adblockers/uBlock Origin/etc.


at least once youtube slowed to a crawl until I cleared the cookie


> if (disableCommentsRanking.enabled == False)

This could use some light-touch code reviewing


Because the HN crowd likes learning new things: if `enabled` is a nullable boolean in C# (i.e. has type `bool?`) then this check must indeed be written this way, to avoid confusing null with false.


I thought OP meant to imply that the readability could use some tweaking. You have 'disable', 'enabled' and 'False' used in the same expression so it requires some (more) thinking while reading it and trying to decipher what it's trying to do.


That is a more fundamental and better criticism that I'm embarrassed I overlooked.


Some could argue it would be for illustration purpose, and not actual production code


It looks funny, but I think it's actually good, and arguably the best possible form of it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: