On the off chance someone associated with this is reading: I’m curious about the networking stack here. Specifically TCP. Is it being used? The reason I ask is because one limit I’ve run into in the past with large scale workloads like this is exhausting the ephemeral port supply to allow connections from new clients.
Did you run into this? If not I’m curious why not. And if so, how did you manage it?
Article author here, interesting question! We didn't run into that issue, explicitly.
Our setup was effectively as follows:
- AWS Lambda functions being spawned in us-east-1, from a separate AWS sub account.
- Connections were all made to the public address provisioned for MySQL protocol access to PlanetScale, using port 3306. The infrastructure did also reside in us-east-1.
- Between the Vitess components themselves, and once inside our own network boundaries, we use gRPC to communicate.
Since the goal we set was to hit one million, and realizing we were staying just barely within the limits of the Lambda default quotas, we didn't aggressively try to push beyond that. Some members of our infrastructure team did notice what appeared to be some kind of rate limiting when running the tests multiple times consecutively. Many tests before and after succeeded with no such issues, so we attributed it to a temporary load balancer quirk, but it might be worth going back to confirm if this is the behavior we saw.
Two hypotheses — one of which you can falsify easily. Perhaps Vitess is doing port concentration? Ie dispatching requests made by multiple clients over fewer db connections? This is quite typical to do.
The other is that you may have simply had a fast enough query that Little’s Law worked out for you.
Did you run into this? If not I’m curious why not. And if so, how did you manage it?