> I could go into this in more detail if anyone's curious. Please do!

jhgg · on July 26, 2018

We've experienced a plethora of platform issues that were exacerbated by how BEAM consumes resources on the system. Here's a few that come to mind:

- CPU utilization can differ wildly on Haswell & Skylake. On Skylake processors our CPU utilization jump by 20-30% due to Skylake using more cycles to spin. Luckily, all of that CPU time was spent in spinning, and our actual "scheduler utilization" metric remained roughly the same (actually, on Skylake it was lower!).

- Default allocator settings can call malloc/mmap a lot, and is sensitive to latency on those calls. Under host memory bandwidth pressure, BEAM can grind to a halt. Tuning BEAM's allocator settings is imperative to avoid this. Namely, MHlmbcs, MHsbct, MHsmbcs and MMscs. This was especially noticeable after meltdown was patched.

- Live host migrations and BEAM sometimes are not friends. Two weeks ago, we discovered a defect in GCP live migrations that would cause a 80-90% performance degradation on one of our services during source-brownout migration phase.

GCP Support/Engineering has been excellent in helping us with these issues and taking our reports seriously.

ngrilly · on July 26, 2018

> Live host migrations and BEAM sometimes are not friends. Two weeks ago, we discovered a defect in GCP live migrations that would cause a 80-90% performance degradation on one of our services during source-brownout migration phase.

I thought that GCP live migrations were completely transparent for the kernel and the processes running in the VM. I'd be happy to read a bit more about the defect that made BEAM unhappy.

AboutTheWhisles · on July 26, 2018

> Default allocator settings can call malloc/mmap a lot, and is sensitive to latency on those calls. Under host memory bandwidth pressure, BEAM can grind to a halt. Tuning BEAM's allocator settings is imperative to avoid this. Namely, MHlmbcs, MHsbct, MHsmbcs and MMscs. This was especially noticeable after meltdown was patched.

Excessive allocations and memory bandwidth are two very different things. Often then don't overlap, because to max out memory bandwidth you have to write a fairly optimized program.

Also, are the allocations because of BEAM or is it because what you are running allocates a lot of memory?

jhgg · on July 26, 2018

BEAM's default allocator settings will allocate often. It's just how the VM works. The erlang allocation framework: http://erlang.org/doc/man/erts_alloc.html is a complicated beast.

We were able to simulate this failure condition synthetically by inducing memory bandwidth pressure on the guest VM.

We noticed that during certain periods of time, not caused by any workload we ran, the time spent doing malloc/mmap would 10-20x, but the # of calls would not.

bostonvaulter2 · on July 27, 2018

> We noticed that during certain periods of time, not caused by any workload we ran, the time spent doing malloc/mmap would 10-20x, but the # of calls would not.

I'm curious what tools you used to discover this

jhgg · on July 27, 2018

https://github.com/iovisor/bcc

passer-by-123 · on July 26, 2018

Thanks! I would be great if you wrote an article on these with more details, as it would be really helpful to both Elixir and Erlang communities.