> It's not entirely clear but this isn't an issue in multi instance upstream Postgres clusters?
No, it isn't an issue with single-instance PostgreSQL clusters. Multi-instance PostgreSQL clusters (single primary, plus streaming/physical replicas) are affected.
What they -too- discovered is that PostgreSQL currently doesn't have consistent snapshot behaviour between the primary and replicas. Presumably, read-only transaction T2 was executed on a secondary (replica) node, while T1, T3, and T4 (all modifying transactions) were executed on the primary.
Some background:
Snapshots on secondary PostgreSQL nodes rely on transaction persistence order (location of commit record in WAL) to determine which transactions are visible, while the visibility order on the primary is determined by when the backend that authorized the transaction first got notice that the transaction was completely committed (and then got to marking the transaction as committed).
On each of these (primary and secondary) the commit order is consistent across backends that connect to that system, but the commit order may be somewhat different between the primary and the secondary.
There is some work ongoing to improve this, but that's still very much WIP.
Thank you matashii--this would definitely explain it. I've also received another email suggesting this anomaly is due to the difference in commit/visibility order between primary and secondary. Is there by chance a writeup of this available anywhere that I can link to? It looks like https://postgrespro.com/list/thread-id/1827129 miiight be related, but I'm not certain. If so, I'd like to update the report.
My email is aphyr@jepsen.io, if you'd like to drop me a line. :-)
That thread is indeed about the same issue. I don't think anyone has done a more concise writeup on it.
Core of the issue is that on the primary, commit inserts a WAL record, waits for durability, local and/or replicated, and then grabs a lock (ProcArrayLock) to mark itself as no longer running. Taking a snapshot takes that same lock and builds a list of running transactions. WAL insert and marking itself as visible can happen in different order. This causes an issue on the secondary where there is no idea of the apparent visibility order, so visibility order on secondary is strictly based on order of commit records in the WAL.
The obvious fix would be to make visibility happen in WAL order on the primary too. However there is one feature that makes that complicated. Clients can change the desired durability on a transaction-by-transaction basis. The settings range from confirm transaction immediately after it is inserted in WAL stream, through wait for local durability, all the way up to wait for it to be visible on synchronous replicas. If visibility happens in WAL order, then an async transaction either has to wait on every higher durability transaction that comes before it in the WAL stream, or give up on read-your-writes. That's basically where the discussion got stuck without achieving a consensus on which breakage to accept. This same problem is also the main blocker for adopting a logical (or physical) clock based snapshot mechanism.
By now I'm partial to the option of giving up on read-your-writes, with an opt-in option to see non-durable transactions as an escape hatch for backwards compatibility. Re-purposing SQL read uncommitted isolation level for this sounds appealing, but I haven't checked if there is some language in the standard that would make that a bad idea.
A somewhat elated idea is Eventual Durability, where write transactions become visible before they are durable, but read transactions wait for all observed transactions to be durable before committing.
No, it isn't an issue with single-instance PostgreSQL clusters. Multi-instance PostgreSQL clusters (single primary, plus streaming/physical replicas) are affected.
What they -too- discovered is that PostgreSQL currently doesn't have consistent snapshot behaviour between the primary and replicas. Presumably, read-only transaction T2 was executed on a secondary (replica) node, while T1, T3, and T4 (all modifying transactions) were executed on the primary.
Some background:
Snapshots on secondary PostgreSQL nodes rely on transaction persistence order (location of commit record in WAL) to determine which transactions are visible, while the visibility order on the primary is determined by when the backend that authorized the transaction first got notice that the transaction was completely committed (and then got to marking the transaction as committed). On each of these (primary and secondary) the commit order is consistent across backends that connect to that system, but the commit order may be somewhat different between the primary and the secondary.
There is some work ongoing to improve this, but that's still very much WIP.