The Container Throttling Problem

YokoZar · on Dec 26, 2021

> The gains for doing this for individual large services are significant (in the case of service-1, it's [mid 7 figures per year] for the service and [low 8 figures per year] including services that are clones of it, but tuning every service by hand isn't scalable.

This point seems wrong to me, bound too much by requiring solutions to be done by a small team of engineers who already have a mandate to work on the problem.

With numbers like that Twitter could, profitably, hire dozens of engineers that do literally nothing else. Just tweak thread pool sizes all day, every day, for service after service. Even though it's a boring, manual, "noncomplex" thing, this type of work is clearly valuable and should have happened years ago.

Most likely Twitter's job ladder, promotion process, and hiring pipeline is highly incentivizing people to avoid such work even when it has clear impact. They are very much not alone in that regard.

xyzzy_plugh · on Dec 26, 2021

I solved this same problem for a company also in 2019 (as the CPU quota bug hadn't been fixed yet) and it resulted in something like 8 figures of yearly cost savings.

You are correct in that most companies are not equipped to staff issues like this. Most places just accept their bills as a cost of doing business, not something that can be optimized.

_lqaf · on Dec 26, 2021

A side effect of deciding systems engineers can be replaced by devops.

In reality, you want both. A good systems person can save you a ton of money.

ithkuil · on Dec 26, 2021

A nice approach is to staff a DevOps team with people from diverse backgrounds; some more towards the system side of the spectrum and some more towards the dev side of the spectrum. As long as everybody knows a little bit of the other side. This helps ok avoiding a culture where devs "throw some code over the fence" and sysops people just moan that devs are careless and/or that they should do things differently, but without a clear way of showing exactly how differently things should be made (and also without a clear understanding of what devs ended up choosing the way they choose)

bradknowles · on Dec 27, 2021

Who does DevOps teams anymore? Don’t you know that all devs should also be ops, and you can cut your staff in half?

Seriously, some employers I’ve know seem to take exactly that approach. Just fire all the ops guys, and tell the devs that they’re now going to be doing DevOps.

ithkuil · on Dec 27, 2021

Oops I clearly didn't mean a dedicated DevOps team; I mean a team that develops and runs a product, what DevOps originally meant (and not the glorified developer tools team that often people have nowadays with that name)

waynesonfire · on Dec 27, 2021

Devops developed and ran product? Never heard of that. What disconnected manifesto did you read?

ithkuil · on Dec 28, 2021

Is the word product rather than service that bothers you or the fact that DevOps develop and operate?

slashdev · on Dec 26, 2021

This. I've never worked anywhere that had dedicated ongoing effort to cost reduction in compute services. It's always a once every couple of years thing to look at the cloud spending and spend a little effort dealing with the low hanging fruit.

karmakaze · on Dec 26, 2021

I can see how there are diminishing returns when optimizing but I would never say that server bills are not a metric to be aware of and address. I've always had some idea of what's practically achievable in terms of efficiency within a given architecture and aim for something that gets a good amount of the way there without undue effort. I also enjoy thinking of longer term improvements for efficiency whether that could improve latency or the bottom line and at the same time know that's secondary to providing additional value and gaining customers during a growth period.

joshuamorton · on Dec 26, 2021

> With numbers like that Twitter could, profitably, hire dozens of engineers that do literally nothing else. Just tweak thread pool sizes all day, every day, for service after service. Even though it's a boring, manual, "noncomplex" thing, this type of work is clearly valuable and should have happened years ago.

The issue is that once you hire a dozen engineers to do this (say for 5M a year in total), and they do it for a year, they save mid 8 figures (keep in mind this was the largest service, so the savings across other services will be smaller).

Then can they keep saving mid 8 figures every year?

I'll paraphrase something I previously wrote privately, but imagine you have some team that's able to save 10% of your fleetwide resources this year. They densify and optimize and improve defaults. So you can now grow your services by 10% without any additional cost increase, and you do! The next year, they've already saved the easiest 10%. Can they save 10% again? Can they keep it up every year? How long until they're saving 3% a year, or 1% a year? And that's if you keep the team the same size, where its clearly loosing money! If you could afford a dozen people to save 10%, you can only really afford 1-2 to save 1%, but then you're likely to get an even smaller return.

Unless you expect to be able to maintain the same relative value of optimizations every year, 3 or 5 years out, its not worth it to hire an FTE to work on them.

I should note that I've experienced this myself: I was working in an area where resource optimization could lead to "significant" savings (not 8 figures, but 6 or maybe 7). My first 6 months working in this area, I found all sorts of low hanging fruit and saved a fair amount. The second six months, 5-10x less ROI. I gave up even trying in the third six moths, if I come across a thing, I'll fix it, but its no longer worthwhile to look.

ZephyrBlu · on Dec 26, 2021

If you're looking in the same area/domain, what you're saying is almost certainly true.

If you're looking across the business as a whole, it seems likely that there is a lot of this kind of work lying around because there is not much incentive for people to tackle it as described in this comment: https://news.ycombinator.com/item?id=29691847.

joshuamorton · on Dec 26, 2021

I'm replying to that comment.

The issue is still the same: what do you do in the second year (or the third), after you've fixed all the low hanging fruit. If it takes a constant amount of time to investigate a particular service, it's not even worth examining the long tail, because the investigation is costlier than the savings. So once you've saved the high 8 figures by fixing the low hanging fruit in the biggest 20 services or whatever, what next?

The broader point is that while it is often very worthwhile to have individual employees work on optimizations, it is much less often worthwhile to task teams entirely with optimizations, especially in the way you describe. Having some group in charge of generalizable fleetwide optimizations is possibly useful. Doing some kind of resource quota thing where you get fewer resources than you're forecast to need (to force you to do some kind of local optimization) may make sense, but having a strike team whose job it is to tune JVM parameters isn't useful. A central team writing a document on best practices, setting better defaults, or using/building some kind of grid-search optimizer are all probably better investments.

ptc · on Dec 26, 2021

The answer is you build a team that builds tools to make optimization easier for everyone in the engineering organization. One individual team can't optimize every application, but if you make it easy for every engineering team to profile their applications and debug performance problems, you've enabled every team to continuously optimize the low-hanging fruit. In effect, you're no longer solving one-off 8-figure problems, you're optimizing the time of other engineers which is something that pays off in perpetuity.

Good large engineering organizations have these teams, and understand their value. For smaller companies, it's not as clear if you can reasonably fund a team for that kind of work.

YokoZar · on Dec 28, 2021

> The issue is still the same: what do you do in the second year (or the third), after you've fixed all the low hanging fruit.

Then you destaff the team, collect your tens of millions of dollars, and be happy about a successfully finished project? The problem with going too far down this line of reasoning is that you leave even low-hanging fruit laying around forever.

joshuamorton · on Dec 28, 2021

What you seem to be missing is that "destaffing" a team of a dozen people is really expensive and painful, and people don't generally sign on to jobs where we say "yeah you'll be doing this for two years and then you'll have to find a new role".

Like I've said twice now, you can solve this problem without stupid business practices, and the you're arguing with a straw man. If you have to say "going too far down this line", you're no longer responding to the argument presented, but a bad faith misrepresentation. That's inappropriate on HN.

ghusbands · on Dec 26, 2021

Certainly, at one place I worked, the higher-ups were very clear that any work on cost reduction was wasted and devs and ops should always work on increasing net income, not decreasing costs. It was consistently claimed that cost reduction can only get you small percentage decreases, whereas increases in income are larger and compound better.

kevin_nisbet · on Dec 26, 2021

I've had similar messages articulated to me by my manager, and have found myself articulating similar messages to my team.

In my team, the key is for the state of the project/product we manage, cost optimization is likely one of the lowest ROI activities we can spend too much time on. That doesn't mean we don't tackle some clear low hanging fruit when we see it, or use low hanging fruit as training opportunities to onboard new team members, but that we need to be conscious on where we make investments, and for the stage we're at, the more important investment is into areas that make our product more appealing to more customers.

I think it's easy to say someone, like an intern could pay for themselves with savings. But this to me overlooks that someone has to manage that intern, get them into change management, review the work, investigate mistakes or interruptions from the changes, etc. And then they're still the lowest earning employee, since most of us aren't hired to pay for ourselves, but actually to turn a profit for the company.

So while I'm not sure I agree with the message "whereas increases in income are larger and compound better.", I certainly understand and have pushed a similar message, that we be conscious on where we're spending our time, and that we're selecting the highest impact activities based on the resources and team we have. Sometimes that may be fixing high wastage, but very frequently that will be investing into the product. And I think for the stage of the product we manage, that is the best choice for us.

avianlyric · on Dec 26, 2021

The big difference between cost reduction and income increase, is that one has a hard limit on possible upside, whereas the other does not. You can reduce your costs by more than your total costs, but it’s quite possible to increase your income by many multiples of your existing income.

Result is that maximising income is generally better than reducing cost. Of course, as with all generalisation, there are situations where this approach doesn’t hold true. But as a high level first order strategy, it’s a good one to adopt.

ClumsyPilot · on Dec 26, 2021

"one has a hard limit on possible upside, whereas the other does not."

Thats plain wrong, the gloval market for cars, bicycles and what have you has a limited size. Every large company that's a market leader understands that.

beiller · on Dec 26, 2021

Would capturing 100% of the market granting a monopoly essentially grant unlimited upside? Cause you can just jack the price to absurdity? Also wouldn't reducing cost to zero have an infinite upside as well? Basically zero cost you can produce infinite output. Gettin real pedantic here heh.

bradknowles · on Dec 27, 2021

My experience has been that each additional multiple of revenue has a certain amount of effort increase that is relatively fixed. Likewise, each additional percentage of cost reduction has an effort cost that is relatively fixed.

If you do the math, then increasing your revenue by 2x doesn’t cost 2x the effort. But one additional percentage point of cost reduction might well cost 2x the effort.

Sadly, most employers I’ve known do not seem to understand this thing called math.

wolf550e · on Dec 26, 2021

You meant to write "You cannot reduce your costs"

R0b0t1 · on Dec 26, 2021

This was verified by (what should be) a famous Harvard Business school study. Quality before cost, revenue before expenses, and there is no three.

JHer · on Dec 27, 2021

Could you share a link or give the name of the study, so that I can read it?

tomrod · on Dec 26, 2021

The higher ups need some basic economics education, it appears. Certainly you shouldn't invest everything in long term returns, but you should be open to it.

Instead, when something has a payoff of 3 years, executives get antsy in orgs that have a 2-year cycle on exec positions.

lostdog · on Dec 26, 2021

You can do similar types of work, but target speed increases instead. Getting all the batch jobs to finish faster can help developer productivity, and is even worthwhile at a new startup.

mnutt · on Dec 26, 2021

There was a startup I talked to at one point that had a service where they’d run an agent on your instances that would collect performance data and live tune kernel parameters, and they had some AI to find the best parameters for your workload. No idea how well it worked, but it seems like a potentially good application of AI.

servytor · on Dec 26, 2021

Do you remember the name for it? Sounds really useful.

wmf · on Dec 26, 2021

Was it Granulate?

mnutt · on Dec 30, 2021

Yes, that was it.

syngrog66 · on Dec 26, 2021

Log4Shell.com

spockz · on Dec 26, 2021

Afaik, Twitter already has significant and mature infrastructure in place to run a plethora of different instances on shadowed traffic and compare settings. It is used at least by the people working on the optimised jre.

vegetablepotpie · on Dec 26, 2021

They could hire contractors or consultants to do the job, no? That class of worker would not be concerned about promotion opportunities. For some reason they haven’t done that either.

genewitch · on Dec 26, 2021

I can't imagine the man-hours that went into creating this, and, from here on out, knowing that core contention is still an issue that isn't solved will allow me to waltz in to contract jobs and save companies money, e-waste, and power costs - this causes hope, joy, something like that.

In case anyone missed it, the removal of throttling in certain circumstances saved twitter ~$5mm/year, if I read it correctly. With a naive kernel patch. While it takes dedicated engineers decades of knowledge to know where to aim an intern, an intern still banged out a kernel scheduling patch that made, what I assume, is a huge difference.

Dan Luu is a gem.

euiq · on Dec 26, 2021

Note that the intern in question was close to finishing their PhD in a related area.

wolf550e · on Dec 26, 2021

"Low 8 figures" is more like $25 per year, and that's a single service. Across all services it's more.

mochomocha · on Dec 26, 2021

At Netflix, we're doing a mix of what Dan calls "CPU Pinning and Isolation" (ie, host-level scheduling controlled by user-space logic) [1] and "Oversubscription at the cluster scheduler level" (through a bunch of custom k8s controllers) to avoid placing unhappy neighbors on the same box in the first place, while oversubscribing the machines based on containers usage patterns.

[1]: https://netflixtechblog.com/predictive-cpu-isolation-of-cont...

tyingq · on Dec 26, 2021

That's a really terrific article, thanks for sharing. I wonder if Linux will eventually tie the CPU scheduler together with the cgroup cpu affinity functionality, and some awareness of cores, smt, shared cache, etc. Seems a shame that you have to tie all that together yourself, including a solver.

eternalban · on Dec 26, 2021

The article mentions “nice values”. What does that mean? Underutilization/under-provisioning?

[p.s. thanks for the replies]

bboreham · on Dec 26, 2021

“nice” in Unix is a way to lower the priority of a process, so that others are more likely to be scheduled.

Eg https://man7.org/linux/man-pages/man2/nice.2.html

chrisoverzero · on Dec 26, 2021

It’s the kinds of values for things like `nice(2)`: https://linux.die.net/man/2/nice

In short, an offset from the base process priority.

KaiserPro · on Dec 26, 2021

We had a similar problem, but it exhibited differently

We had two lumps of compute:

1) huge render farm, at the time it was 36k CPU. The driving goal was 100% utilisation. it was a shared resource, and when someone wasn't using their share it was aggressively loaned out. (both CPU and licenses) Latency wasnt an issue

2) much smaller VM fleet. Latency was an issue. Even though the contention was much less, as was the number utilisation.

Number two was the biggest issue. We had a number of processes that needed 100% of one CPU all the time, and they were stuttering. Even though the VM thought they were getting 100% of a core, they were in practice getting ~50% according to the hyperviser. (this was a 24 core box, with only one CPU heavy process)

After much graphing, it turned out that it was because we had too many VMs on a machine defined with 4-8CPU. Because the hypervisor won't allocate only 2 CPUs to a 4 CPU VM, there was lots of spin locking waiting for space to schedule the VM. This meant that even though the VMs thought they were getting 100% cpu, the host was actually giving the VM 25%

The solution was to have more smaller machines. The more threads you ask for to be scheduled at the same time reduces the ability to share.

We didn't see this on the big farm, because the only thing we constrained was memory, The orchestrator would make sure that a thing configured for 4 threads was put in a 4 thread slot, but we would configure each machine to have 125% CPU allocated to it.

treffer · on Dec 26, 2021

I have been running k8s clusters at utilizations far beyond 50% (up to 90% during incidents). For web services/microservices, so tail latencies were important.

The way we solved this? 1. Kernel settings. Check e.g. the settings of the Ubuntu low latency kernel for example. 2. CFS tuning. Short timeslices. There are good documentations on how to do that 3. CPU pressure. We cordoned and load shedded overloaded nodes (k8s-pressurecooker).

By limiting the maximum CPU pressure to 20% you can say "every service will get all the CPU it needs at least 80% of the time on most nodes". This is what you want. A low chance of seeing CPU exhaustion. This is needed for predictable and stable tail latencies.

There are a few more knobs. E.g. scale services such that you use at least one core as requests are effectively limits under congestions and you can't get half a core continuously.

Very nice to see that people go public about this. We need to drop the footprint of services. It is straight up wasted money and CO2.

diegocg · on Dec 26, 2021

Quite interesting problem. It is indeed a contradiction to make a service use all the CPUs on a system, and, at the same time have an upper limit over how much CPU utilisation they can do.

The thread pool size negotiation seems a necessary fix - applications shouldn't be pre calculating their pool sizes on their own anyway. But you get additional (smaller) problems, like giving more or less threads to some service depending on their priority.

One of the big problems here as I understand it is trying to use a resource whose "size" changes dynamically (Max CPU usage on a cgroup, which can change depending on whether other prioritised service is currently running or not) with a fixed sized resource (nr of threads when a service starts).

As the number of cores per CPU grows, I wonder if this whole approach of scheduling tasks based on their CPU "usage" makes any sense. At some point, the basic scheduling unit should be one core, and tasks should be assigned a number of core units on the system for a given time.

londons_explore · on Dec 26, 2021

I think this problem would have been debugged and solved much quicker if they'd done a CPU scheduling trace. Then they could see, microsecond by microsecond, exactly which processes were doing which work, and what incoming requests are still waiting.

Then, let a human go in and say "How come request#77 hasn't yet been processed at this point, even though CPU#3 is working on printing unused debug data for a low priority request and #77 is well after its deadline!??".

Then you debug deeper and deeper, adjusting parameters and patching algorithms till you can get a CPU trace that a human can look at and think "yeah, I couldn't adjust this schedule by hand to get this work done better".

In this process, most people/teams will find at least 10x performance gains if they've never done it before, and usually still 2x if you limit changes to one layer of the stack ('eg. Im just tweaking the application code - we won't touch the runtime, VM, OS or hypervisor parameters').

ghusbands · on Dec 26, 2021

That does not cover almost anything in the article. It's a long article, so maybe you could quote the bit you're responding to.

A CPU scheduling trace wouldn't easily show you the details of the kernel-level group throttling that was causing a lot of issues, for example. They weren't having an issue with threads fighting other threads, they were having an issue with threads being penalised now for activity from several seconds ago, drastically reducing the amount of available CPU.

The article clearly shows a lot of debugging and diagnostic patching ability, so it's unlikely they missed the simple options. Rather, they probably didn't mention them because they were obvious to try and didn't help.

londons_explore · on Dec 26, 2021

> threads being penalised now for activity from several seconds ago,

Exactly... They would have found this out much quicker with a trace. They would have seen "how come this application level request is being handled on thread number X, yet that thread is not running on any core, and many cores are idle"? Then quickly they could see the reason that thread isn't scheduled by enabling extra tracing detail seeing the internal data structures used by the scheduler to see why something is schedulable or not at that instant.

ghusbands · on Dec 26, 2021

I think you're suffering hindsight bias, here. A trace is rarely as clear as that, and it's hard to see the details it's not designed to expose.

Your original message would probably be better received if you'd omitted the "I think this problem would have been debugged and solved much quicker [...]" and its insulting implications and instead started with "Sometimes, I find that CPU activity traces can really help with diagnosing this sort of problem".

The_rationalist · on Dec 26, 2021

Please stop advocating for politeness over correctness. Sure hindsight help but regardless, a company such as Twitter should have experts at tracing that have tools and knowledge that goes beyond the average developer knowledge about tracing methodologies. Excusing that is an appeal to a lowering of technical excellence worldwide, which is majorly important and matter more than hypothetical feelings.

londons_explore · on Dec 26, 2021

> a company such as Twitter should have experts at tracing

In a big company, getting the person with the most skills to solve a problem to be the one actually tasked with solving the problem is very hard. This particular problem had many avenues to find a solution - and while I think my proposed route would have been quicker, if you aren't aware of those tools or techniques, then other avenues might be much quicker. When starting an investigation like this, you don't know where you're going to end up either - if it turned out that the performance cliff was caused by CPU thermal throttling, it would be hard to see in a scheduling trace - everything would just seem universally slow all of a sudden.

neerajsi · on Dec 26, 2021

On Windows, we have the xperf and wpa toolset that makes looking at holistic scheduling performance, including processor power management and device io tractable. Even then, the skillset to analyze an issue like the one presented here takes months to acquire and only a few engineers can do it. We have dedicated teams to do this performance analysis work, and they're always in high demand.

jeffbee · on Dec 26, 2021

I completely agree. KUTrace would have been ideal for this and indeed KUTrace was developed to diagnose this exact problem.

piyh · on Dec 28, 2021

What tools would you use to start going down this route? I'm completely unfamiliar but would like to learn more.

neerajsi · on Dec 26, 2021

I don't know why the negative reaction. I've done the kind of analysis you've described many times and essentially been able to quickly identify such issues over the years. We had a similar problem in Windows when we first implemented the Dynamic Fair Share thread scheduler. It took a couple months to have the right tooling to do a proper scheduler trace, but with that available the problem was better understood in a week. I eventually rewrote the scheduler component and added a control law to give better burstable behavior than the hard cap quota that this article seems to be describing.

mabbo · on Dec 26, 2021

I have to wonder why the authors skipped the potential solution of removing containers and mesos from the equation entirely.

If you gave this service a dedicated, non-co-located fleet, running the JVM directly on the OS, and ran basic autoscaling of the number of hosts, you'd eliminate a huge number of the moving parts of the system that are causing these issues.

Yes, that would add to ops costs (edit: human ops costs) for this service, but when you're spending 8 figures per year in it, clearly the budget is available.

To quote the great philosopher Avril Lavigne: "Why'd you have to go and make things so complicated?"

commonsearch · on Dec 26, 2021

Its not that it was made complicated. Its trading one type of complexity for another. I think your under estimating the costs of having a one off team run their own service and hardware. There is also an opportunity cost for those people wasting time running hardware and the support teams involved for a unique service. They could save some couple millions of dollars or they could work on projects that enable much more growth. Twitter has $1.2b in revenue in a quarter.

marcinzm · on Dec 26, 2021

Isn't the problem then that each host would be underutilized on average by a lot? It has X cpus and the service can never use more than X cpus. If a service has any spiky loads then it'd been overprovisioned cpu to handle them at good latency.

That seems significantly more expensive at scale.

aetimmes · on Dec 27, 2021

Because then you have a snowflake service with a non-standard environment and still haven't solved the problem for all the other services that are still on Mesos.

toast0 · on Dec 26, 2021

I suspect it's the temptation of oversubscription. If service A and service B each use 50% of a server, it's so tempting to put them both on one server to maximize efficiency. Even if sometimes you need 4 servers running A and B to serve the load that can be managed with one server each of A and B.

Or if you've broken things up into small pieces that aren't big enough to use a whole server, that can feel inefficient as well.

xorcist · on Dec 26, 2021

> that would add to ops costs for this service,

Wouldn't fewer moving parts mean lower operational costs?

Kalium · on Dec 26, 2021

Only to the extent that cost is a function of complexity. This isn't always the case. In a case like this, going to bare metal likely brings with it significant drawbacks in organizational complexity, orchestrational complexity, and more while allowing for much better utilization of memory and cpu resources.

Telling someone whose car is making some funny noises that it's simpler to go back to horse-and-buggy times would both increase costs and decrease the number of user-servicable moving parts. There's some significant overhead attached.

xorcist · on Dec 26, 2021

Bare metal has nothing to do with this. It isn't even touched upon in the article. It discusses a scheduler, and the parent post suggests exempting these kind of jobs from the scheduler in question, which they obviously aren't a very good product fit for.

Should you wish to really stretch that car analogy, maybe a bit more appropriate than a horse would be: If you aren't happy with your travel agency aren't booking your taxi trips in time, try booking with the taxi company directly.

mabbo · on Dec 26, 2021

Yes and no.

It would lower the operations costs of hardware, hopefully (that's the entire goal of this article) but you'd need more people resources to manage it, I would guess. Mesos and containers automate a lot of thinking work.

Kalium · on Dec 26, 2021

Once you move to hosts dedicated to specific services, as seems to be the suggestion here, you also might increase the overall hardware cost across your set of services. The cost per some of the services might decrease, though.

tybit · on Dec 26, 2021

I realise that the Twitter is using Mesos, but for those of us on Kubernetes does guaranteed QoS solve this? https://kubernetes.io/docs/tasks/configure-pod-container/qua...

bboreham · on Dec 26, 2021

If you also use the CPU Manager feature and request an integer number of cores, yes. Then for example if you request 3 cores your process will be pinned onto 3 specific cores and nothing else will be scheduled onto those cores, and CFS will not throttle your process.

https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

mac-chaffee · on Dec 26, 2021

QoS classes are only used "to make decisions about scheduling and evicting Pods." It still uses the Completely Fair Scheduler, which is where the problem came from (as far as I understand).

KptMarchewa · on Dec 26, 2021

I think they are not using Mesos now.

https://dzone.com/articles/what-can-we-learn-from-twitters-m...

zekrioca · on Dec 26, 2021

I think there is another solution, not discussed in the article, which lies between CPU isolation and pinning, and that is virtualizing the container’s /proc so to not let it think the number of available (logical) processors is larger than a certain limit set by the cluster operator, but which is actually lower than the physical capacity of a server (so to allow overbooking and increase their ‘redacted’ savings in $M). This is basically presenting a container/application with a number of vCPUs that it can use in any way it sees fit, but with all the (invisible) control group (quota) limits (i.e., “throttling”) the author discusses in the text and that avoids the application to spawn so many threads that inevitably overloads the physical server and destroys tail latency.

This is at the kernel level, opposed to paravirtualization. And I guess this is Twitter’s use case, but should not be confused by the typical vCPUs offers one sees in most cloud providers, which is usually done through hypervisors such as Qeme/KVM, VMware, or Xen.

I’m not sure why Mesos (maybe this one tried and didn’t succeed), nor K8S (available through external Intel code) or even Docker, never really thought about that, but I guess they want to keep their internal (operational) overheads up to a limit, and possibly also to maintain the metastability of their services [1]. But now we see where it leads to, with all these redacted numbers in the article.

[1] https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

Ps: edits for clarifications.

throwaway984393 · on Dec 26, 2021

I wonder if k8s' bin-packing features would help here.

The graphs seems to validate my general assumption that large-load tasks just suck at scaling whereas small-load tasks can be horizontally scaled easier without falling over. The general assumption being that for most applications, if you ignore everything else about an operation and assume a somewhat random distribution of load, smaller-load services use up more available resources on average than a single large-load service. That's just been an assumption in my head, I can't remember any data to back that up.

Back in the day when I worked on a large-traffic internet site, we tried jiggering with the scheduler and other kernel tweaks, and in the end we literally just routed certain kinds of requests to certain machines and said "these requests need loads of cache" (memcache, local disk, nfs cache, etc) and "these requests need fast io" and "these requests need a ton of cpu". It was "dumb" engineering but it worked out.

richardwhiuk · on Dec 26, 2021

This article is quite old - the kernel patch has been available for a while now, I believe, and CMK is no longer in beta (the article references K8s 1.8 and 1.10, but the current latest version is 1.23).

cpitman · on Dec 26, 2021

There are updates from this month at the bottom!

staticassertion · on Dec 26, 2021

I remember working in Java where we'd have huge threadpools that sat idle 90% of the time.

It feels like you can eliminate most of this problem in other languages by using a much smaller pool and then leveraging userland concurrency/ scheduling. You probably don't want to have N cores and N + K threads, but in some languages you don't have much choice. Java has options for userland concurrency but they're pretty ugly and I don't think you'll find a lot of integration.

Containers make this a bit harder, and the Linux kernel sounds like it had a pretty silly default behavior, but how much of this is also just Java?

le-mark · on Dec 26, 2021

I don’t think blaming JVM is productive, but identifying “JVM” as a proxy for “language runtime designed to optimize multi processor machines” is the core element here.

One could imagine a vm or runtime that is async and multiprocess that also enforces quotas on cycles and heap such that these types of “noisy neighbor” events aren’t a problem.

In this direction there have been solutions that haven’t caught on; a multi tenant jvm existed at one time, and at least one js implementation has this ability. I’ve often thought Lua would be ideal for this.

yencabulator · on Dec 27, 2021

> identifying “JVM” as a proxy for “language runtime designed to optimize multi processor machines”

That'd be the wrong thing to pick as the proxy destination. Go is also a “language runtime designed to optimize multi processor machines” and, as the article explains, doesn't trigger this the same way.

Try something like "total number of threads in all the pools used by the application overwhelms the CPU".

staticassertion · on Dec 26, 2021

Yeah to be clear I'm not saying all fault lies with the JVM here. But a lack of concurrency primitives exacerbates the problem by encouraging very large threadpools.

neerajsi · on Dec 26, 2021

I haven't used it in anger, but it looks to me like the C# async compiler and library support helps reduce the need for large threadpools.

But it also looks like the GC was a major contributor, so that would not be as influenced by the differences between dotnet and Java.

mkhnews · on Dec 26, 2021

Hi, I recently found similar behavior in an app for our company. A simple threaded cpu benchmark shows:

% numactl -C 0,5 ./ssp 12 elapsed time: 99943 ms

cpu.cfs_quota_us = 200000 cpu.cfs_period_us = 100000 % cgexec -g cpu:cgtestq ./ssp 12 elapsed time: 420888 ms

cpu.cfs_quota_us = 2000 cpu.cfs_period_us = 1000 % cgexec -g cpu:cgtestqx ./ssp 12 elapsed time: 168104 ms

Also interesting was in our app some RR thread priorities are used, and those do not get controlled via the cgroup cpu.cfs settings.

bboreham · on Dec 26, 2021

Dave Chiluk did a great talk covering a similar scheduler throttling problem.

https://m.youtube.com/watch?v=UE7QX98-kO0

nvarsj · on Dec 26, 2021

CFS quotas have been broken for a long time - with processes being starved far below their utilisation of their quota. I think every serious user of k8s discovers this the hard way. Recent changes have been done to improve the scheduler for quotas but I’m surprised twitter was using them at all in 2019. Java GC also suffers badly with quotas. Pinning cpu is probably the best compromise, otherwise just use CPU requests with no limits.

d3nj4l · on Dec 26, 2021

As an newbie developer who hasn't dug into this stuff before, but found this post fascinating: does anybody have any good pointers, like books/articles/videos to learn about low-level details like this?

closeparen · on Dec 26, 2021

Computer Systems: A Programmer’s Perspective.

Operating Systems: Three Easy Pieces.

Most important parts of my undergrad. Much more so than Algorithms or a anything mathematical.

3np · on Dec 26, 2021

The way the issue is presented, it sounds to me like context switching should be one of the major considerations, especially when talking about CPU pinning. Yet it’s barely mentioned in passing. How come?

fulafel · on Dec 26, 2021

Self-teergrubing by cpu quotas.

Wonder what mechanism could be used to communicate the available timeslice length so that the app/thread could stop taking on a request when throttling is imminent.

jeffrallen · on Dec 26, 2021

Tl;dr, which is too bad, because normally danluu's stuff is great.

From the bit I had patience to read it sounds like "we made a complicated thing and it's doing complicated things wrong in complicated ways".

It is hard to believe that some of these CPU heavy, latency sensitive servers should really be in containers. Why are they not on dedicated machines? KISS.

marcosdumay · on Dec 26, 2021

Linux is optimized for desktops and shared servers. When you own the entire machine and wants to use it fully, that optimization gets in your way.