I created a gist with these recommendations - certainly an improvement, but I do...

ndriscoll · on April 19, 2024

It looks like now it does still only pull 100 rows out of the sort (so moving the limit into the 2nd cte didn't hurt). It isn't doing all 10000 rows now though, which is interesting. By any chance, do you have 9200 different tenants? If so that makes sense. What I suggested would work when you have a small number of tenants with queued work (it scales n log n with tenants with queued work, but log n with amount of tasks that a single tenant has queued). So if you're currently testing with many tenants queueing at once, you could see how it behaves with like 20 tenants where one has 9k items and the others have ~50 each. Then it sort of depends on how your distribution looks in practice to know whether that's acceptable.

You could also probably do tricks where individual workers filter to specific tenant IDs in the first CTE (e.g. filter group_key mod num_workers = worker_id) to reduce that cardinality if you expect it to be large. Or you could e.g. select 100 random group_keys as a first step and use that to filter the window, but then that needs another partial index on just `group_key where status = queued`.

Edit: I see it's still doing a seq scan on tasks. There is a PK on id, right? It knows there's only 100 rows from the rest of the query so it seems odd to me that it would decide to scan the table. You could try putting a hash index on id if it's refusing to use the btree index I guess. Or it might change its mind if you add 1M SUCCEEDED tasks or something.

Another thing to consider is that out of the box, pg's default config for the planner is tuned to like 20 year old hardware. You need to tweak the io costs for SSDs and tell it you have more RAM if you haven't done that. See e.g. https://pgtune.leopard.in.ua/ for better starting values.