We had a similar problem, but it exhibited differently We had two lumps of compu...

We had a similar problem, but it exhibited differently

We had two lumps of compute:

1) huge render farm, at the time it was 36k CPU. The driving goal was 100% utilisation. it was a shared resource, and when someone wasn't using their share it was aggressively loaned out. (both CPU and licenses) Latency wasnt an issue

2) much smaller VM fleet. Latency was an issue. Even though the contention was much less, as was the number utilisation.

Number two was the biggest issue. We had a number of processes that needed 100% of one CPU all the time, and they were stuttering. Even though the VM thought they were getting 100% of a core, they were in practice getting ~50% according to the hyperviser. (this was a 24 core box, with only one CPU heavy process)

After much graphing, it turned out that it was because we had too many VMs on a machine defined with 4-8CPU. Because the hypervisor won't allocate only 2 CPUs to a 4 CPU VM, there was lots of spin locking waiting for space to schedule the VM. This meant that even though the VMs thought they were getting 100% cpu, the host was actually giving the VM 25%

The solution was to have more smaller machines. The more threads you ask for to be scheduled at the same time reduces the ability to share.

We didn't see this on the big farm, because the only thing we constrained was memory, The orchestrator would make sure that a thing configured for 4 threads was put in a 4 thread slot, but we would configure each machine to have 125% CPU allocated to it.