The other half of the thread was on linux-ext4, starting here: https://lists.ope...

derefr · on July 21, 2019

> It all feels like a situation where the people maintaining this code knew deep down that it wasn't really up to scratch, but were sufficiently used to workarounds ("use direct io", "scrape dmesg") that they no longer thought of it as a problem.

It's an ops-level solution. The one thing you never really do, as an ops person, is reach into the black-box components that make up your infrastructure to fix them at an architectural level. That's not your job. The system is already in production; it's your job to make it run, without changing it—to shim around the black boxes to make them work despite their architectural flaws.

Sure, you can file a ticket with the dev team about how dumb you think an architectural choice was—but, in most businesses, most of the black-box components you're working with are third-party, so filing a ticket isn't likely to get you much. (Google is an exception, but even Google has to hire their SREs from somewhere, and they'll come in enculturated to the "it's all black boxes and we're here to ziptie them together" mindset.)

And, even in an environment like Google's where you have access to the dev-teams of every component you're running, dev cycles are still longer than ops SLA deadlines. Ops solutions are chosen because they're quick to get into production. (Component can't handle the load because it's coded poorly? Replicate it and throw a load balancer in front! Five minute job.) So even if you do file that ticket, you've still got to solve the problem in the here-and-now. And once you do solve the immediate problem, it's no longer a hair-on-fire problem, so that ticket isn't going to be very high-priority to fix.

xorcist · on July 21, 2019

> The one thing you never really do, as an ops person, is reach into the black-box components

I'm sorry, but just no. That's far too strong general statement. I've done quite a bit of sysadmin/devops style work over the years, and also worked with many other people who've sent various kinds of fixes upstream. Sure, third party software isn't always easy to fix, but it depends on what the alternatives are. You probably need to do an immediate workaround as well.

I would say that's an important part of why open source have had such a strong following among sysadmins, the ability to fix things. One could even say that diving into large pieces of software and exploring issues is the most rewarding part of the job. Just don't tell them I said that.

zwischenzug · on July 21, 2019

Usually I've ventured into the code and made a bad enough attempt to fix that the devs are galvanised into doing it right. Filing a ticket usually not enough table stakes to get your voice heard if things are worked around.

stallmanite · on July 21, 2019

I like this concept. I wonder if there’s a word for it. Any Germans want to help out with a fifteen syllable wonder?

OJFord · on July 21, 2019

Tikettfileninsufficientstadtdelvencodebasefixen.

Or something? More seriously, relevant English idioms:

- A stitch in time saves nine;

- Something worth doing's worth doing well;

- If you want it done right, do it yourself;

and surely more. I'm certain there's a farming-analogy one about fixing something sooner rather than later, but it's escaped me.

twic · on July 22, 2019

So, we've got:

Cunningham's Law: "the best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer" [1]

"Broke gets fixed, crappy is forever" [2]

I feel like there should be a third but that's what i've got.

[1] http://fed.wiki.org/journal.hapgood.net/cunninghams-law/fora...

[2] https://dandreamsofcoding.com/2013/05/06/broke-gets-fixed-cr...

Izkata · on July 22, 2019

Maybe something about the dummy fix polluting the lead dev's code aesthetics?

tgsovlerkhgsel · on July 21, 2019

Threat-in-a-patch

vast · on July 22, 2019

Duct tape ops?

nh2 · on July 21, 2019

Agreed.

I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

It takes significant time, but after following this practice for a while, things start working reliably.

I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

vageli · on July 21, 2019

> Agreed.

> I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

> It takes significant time, but after following this practice for a while, things start working reliably.

> I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

> In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

Do you then submit your fixes back to upstream? This is the critical element I think.

caf · on July 22, 2019

Do you then submit your fixes back to upstream? This is the critical element I think.

It makes sense to, otherwise you're stuck forward-porting your private patches forever. Or worse, get stranded on an old version because there's no-one around who can do that forward-porting anymore...

mjw1007 · on July 21, 2019

I don't think there was anything wrong with Google using this workaround.

But Google were also employing the ext4 maintainer, and fixing those components presumably was part of his job.

kbenson · on July 21, 2019

It depends quite a bit on the company and the software they are running. I've worked at places that use open source for the majority of services they run on the servers specifically because they can both look at the source as an advanced form of debugging, and provide local patches to software until it's accepted upstream (or indefinitely if it's not accepted or too company specific).

A lot of patches for open source projects come from companies like this. Is there also a lot of working around buggy software? Yes. It's not always that way though, and if it's gotten to the point that you have devs for that software on staff, it really shouldn't be, because if those people who are financially motivated to care don't, who will?

derefr · on July 21, 2019

> I've worked at places that use open source for the majority of services they run on the servers specifically because they can both look at the source as an advanced form of debugging, and provide local patches to software until it's accepted upstream (or indefinitely if it's not accepted or too company specific).

Sure, but what you're talking about here is more "the developers of your software, solving problems in the composed release of the software, by vendoring + patching infrastructural dependencies."

Like, for example: my software depends on Postgres. So I have a copy of Postgres running on my workstation, to develop against. It's built from source. If I find a bug in Postgres, I can fix it in the source, submit a PR, and in the mean-time, package this fork of Postgres into the application's Docker image in place of the mainline one.

This is basically the same as saying "my project depends on ecosystem-package libfoo. I can vendor a copy of libfoo to fix a problem with libfoo."

Either way, that's still something only the developer of the software can really do (on any kind of a practical time-scale.) Changing the software from talking to an external Postgres, to talking to one packaged inside the Docker image? That's an internal architectural change that requires rewriting the software's configuration handling logic and build scripts at the very least, and probably also adding some sort of init daemon to the resulting Docker image. It's a software change. It's not something an ops person should be spending their time getting up-to-speed enough on a project to be able to do. It's the thing they file a ticket to ask the developer to do.

jayd16 · on July 21, 2019

As a dev, I hate this sentiment of siloed ops.

derefr · on July 21, 2019

It's kind of impossible to have it any other way. Ops is a cost center; you'll only have as many ops staff as you need to ensure your hair isn't on fire. And so ops staff only have enough time to solve problems with the most direct shortcut (usually by adding more components into the mix), rather than doing good Engineering to pay down the technical debt responsible for the problem.

If you have an infinite budget, by all means, hire developers whose jobs are to be prophylactic for ops tickets, by paying down tech-debt. But you'll have to guard that department with your life, because it'll be the first thing to be cut the moment anyone's looking to grow margin by shaving costs.

dgoldstein0 · on July 21, 2019

That is, until your customers start demanding higher SLAs and offering money for them... Maybe only a thing for really big companies, but when that happens, improving reliability and ops becomes someone's job. If not multiple people.

That doesn't mean no duck tape like solutions get used, but that the ones that cause problems get removed.

KaiserPro · on July 21, 2019

I don't think thats siloed.

A good File system maintainer is a precious resource. wasting them on running stuff is madness. (I say that as a Devop.) I know a fair amount about running/configuring/tinkering with filesystems. (hell I've bumped into at least 3 XFS bugs, ones that were new to Redhat.) But I would never presume to be able to make a meaningful patch.

Its the same way that I'd not expect the Ext4 maintainer to know how to instrument, metric and monitor a large company's infrastructure.

Almost certainly they would have flung a ticket over the fence, and it was up to politics to make it work.

jayd16 · on July 21, 2019

Its the fence that I dislike. But I work in a small studio where I can easily work with (and do some) devops.

techslave · on July 21, 2019

i have some bad news for you. that’s not how ops at google works. root cause solutions are preferred and rabbit hole explorations are encouraged.

jimmaswell · on July 21, 2019

If fixing the thing itself is easier, and also benefits everyone else, why not?

derefr · on July 21, 2019

Fixing the thing isn't usually easier. In an ops environment, you're not dealing with Git repos full of source code that you can just patch and run through CI; you're usually dealing with pre-built packages from an apt server, or pre-built Docker images, or pre-built virtual-appliance VMs. And usually these are all created in drastically-varying, heterogeneous ways, many again by third-parties (public Apt packages, public Docker images, etc.), but even if internal, then by different internal teams with different favored approaches toward building and deploying things.

From the position in the pipeline where ops people operate, it's more work to figure out how to fork a build of software X, patch it, build it, and get out the same kind of artifact your deploy pipeline is expecting, hosted in the same place; as it is to just throw more spaghetti at the wall to solve the problem.

Half of the reason behind Google's monorepo, and the focus on Golang, is to empower SREs to patch broken components when possible, by making there be only One True Way to do that, and only one language (without weird per-project DSLs) to know. But that still doesn't completely fix the problem, since you still need to understand the architecture of the software project, and sometimes that's the hardest part.

ajross · on July 21, 2019

I dunno, hardware errors really aren't a thing that is well handled by local APIs. I mean, it all feels to me like the PG folks (and you) are expecting this to have all been fine if they just called fsync() a second time to be sure the busted write got to disk, and then everything would have been OK.

No, that's not the way it works. In virtually all circumstances, that second write is going to fail too, because the machine done broke! And even if it works, you have a situation where the machine broke and something somewhere worked around it, and you need to be screaming like crazy for an operator to fix it. That's not an API situation, that's a deployment thing, and it's exactly the regime where Google applied the protocols that you don't seem to like.

That's not to say that the linux fsync behavior is correct here. I think clearing the error having reported it once is kinda broken too. But... fixing that doesn't actually "fix" the real problem.

Really the best suggestion in that thread was up at the top, where it was pointed out that ext3/4 support remounting the filesystem read-only on errors, which sounds like a great policy to me.

mjw1007 · on July 21, 2019

It seems pretty clear from the first message in the thread that what the postgres code (and so the "PG folks") was expecting was that fsync would continue to report errors as long as it hadn't succeeded.

That way the checkpoint wouldn't complete, and the relevant write-ahead log files wouldn't get deleted (and "checkpoints aren't happening and WAL is piling up" is the sort of thing people already monitor).

ajross · on July 21, 2019

> the sort of thing people already monitor

Exactly! So it's a devops solution either way. The choice is between one based on system-level monitoring and one based on tool-level error reporting. In neither case does a non-broken fsync() actually fix anyone's problem.

My point was that I don't see tools like remount-ro and dmesg monitoring as "workarounds" for an underlying problem. They're correct solutions implemented at the right level of abstraction. They just happen not to be shipped with PostgreSQL, which... I mean, yeah, basically I don't care much either way.

scottlamb · on July 21, 2019

> In neither case does a non-broken fsync() actually fix anyone's problem.

The system's broken either way, but failure modes matter!

With fsync() reporting EIO only once (one time, to one caller), callers that don't expect this behavior (and _no_ callers were written to expect this POSIX-violating, undocumented behavior AFAIK) could corrupt data if writes intermittently succeeded. Eg marking a segment of WAL as fully applied when it wasn't, not only throwing away durability for that transaction but corrupting the whole database. And maybe an error is reported now and then, but that's not so hard to miss.

With fsync() consistently reporting EIO, the system never reports a transaction as fully committed then loses its contents, and the existing database isn't corrupted. And someone will notice the problem. mjw1007 worded it perfectly: "'checkpoints aren't happening and WAL is piling up' is the sort of thing people already monitor".

So yes, a non-broken fsync() doesn't magically fix your damaged platter, but it's worlds better.

ajross · on July 21, 2019

That's just a restatement of the PG point in the linked thread. To which the reply was "it's better to do this at the system level", thus the rejoinder "but that's a workaround" and my "no, it's a perfectly valid way of solving a problem at the same level of abstraction", and now we're full circle because you're picking on my defense (which I did not make!) of the linux fsync behavior and ignoring all the context.

You have to do system level reporting of errors, full stop. You can do that with a fixed fsync. You can do it with dmesg. You can do it even with the current fsync behavior. Pick one. Arguing about fsync is pontificating over the wrong problem.

FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request. There's certainly no language in there saying that fsync should return EIO because of a previous failure, nor even what the state of the system (c.f. remount-ro) should be following the first one.

scottlamb · on July 22, 2019

You have to synchronously report the error to the application depending on the fsync, full stop. None of the other proposed mechanisms are a replacement for that; for one, because they are asynchronous. An application can't be written to "do an fsync, assume it's done...unless some system-wide error reporting mechanism says otherwise arbitrarily later".

> FWIW: I'm all but certain you're wrong about the "POSIX-violating" part. My read of the standard says the author was thinking only about the current request.

An fsync call is not a complete request (it doesn't specify the data to be written), so what on earth does "only about the current request" mean?

I can't reconcile any of these arguments that Linux's behavior was correct with the RATIONALE section here: http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs... and particularly the phrase "all data up to the time of the fsync() call is recorded on the disk". There's no mention of "since a previous fsync() call" or any such thing there.

ajross · on July 22, 2019

> You have to synchronously report the error to the application depending on the fsync, full stop.

Why? As reported above, Google itself isn't even doing that. We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.

The fsync behavior on Linux was surprising and wrong. But fixing that won't fix any of the stuff you're yelling about. API hygiene can't fix machines, and the machine broke.

scottlamb · on July 22, 2019

> Why? As reported above, Google itself isn't even doing that.

I work at "Google itself". Typically there's a single process responsible for all the disk writes of significance on the machine, possibly bypassing the in-kernel filesystem entirely in favor of direct block device access. That process's caller waits until three machines (each with its own battery backup, and frequently on a separate PDU from the others) have acked the write before proceeding. Also, that process's caller is Spanner, and those three machines are necessary to say a single replica has accepted the write. Typically there are three replicas in different clusters (often in separate cities) and two of them have to accept the write (in Paxos terminology) before Spanner tells the application it's complete.

Please don't take Google's production environment working successfully most of the time from the customer viewpoint as proof of anything about the Linux virtual filesystem layer or ext4. You can paper over a lot of brokenness when you have six independent battery-backed copies of the data in two cities before you depend on it.

> We're not talking about a network timeout here. Hardware failures are hardware failures and not in the general case recoverable by software no matter how strongly held the opinions are about how that software should work.

Software can't fix the broken hardware, and (without significant redundancy and cross-verification as Google does) can't correct for hardware simply lying. But correctly written software can absolutely change the likely failure modes when hardware reports error. And intermittent hardware failures happen a lot more than you might think anyway. Bugs that turn intermittent failure into permanent corruption are nasty.

macdice · on July 22, 2019

What about this: you write() some data, and then Linux throws it away because of a writeback error, and later you read() it, and Linux gives you an older version from before your write (because this error condition has somehow been fixed and now it reads that block back into the page cache). I'm pretty sure that violates POSIX, which says:

"The read() function reads data previously written to a file."

ajross · on July 22, 2019

I don't think that's the behavior, though. The blocks don't get flushed, they're still in cache and get read back. Probably. Again, the system broke. You've got no guarantees about anything. Fixing an API isn't going to change that.

Dylan16807 · on July 22, 2019

API guarantees are still important even when a peripheral breaks. It's the difference between knowing how and where performance is degraded and what to work around, versus utter chaos.

macdice · on July 22, 2019

Totally agree. Speaking as a database hacker, just tell me it broke. I will deal with the rest.

ajross · on July 22, 2019

It did tell you it broke, though. It just didn't repeat the message when you asked again. Really, there is no "correct" behavior from getting an -EIO from a fsync beyond "shut it all down, notify everyone you can and hope for the best". This retry behavior was itself a bad design choice (and again, I'm not defending fsync here).

And again, speaking as a systems hacker: it broke. You can't "deal with" broke. And the system absolutely did tell you it broke. I genuinely don't understand the insistence by folks on this list that OMG THE ERROR MUST BE HANDLED IN THE DATABASE SOFTWARE. It seems kinda ridiculous to me, honestly.

macdice · on July 22, 2019

Well, there are/were a whole bunch of related problems and scenarios discussed in that and other threads, so it depends which bit we're discussing. (1) errors consumed by asynchronous kernel activity that never make it to user space; (2) write(), write-back error not reported to user yet, buffer page replaced, read() sees old data; (3) fsync() -> EIO/ENOSPC, then fsync() -> SUCCESS; (4) write() in one process, then fsync() in another process from a different fd; (5) write(), close(), open(), fsync().

It seems that the developers of database software and kernels/filesystems had different understanding of what should happen to error reporting in those cases (and maybe some more that I'm forgetting). PostgreSQL and FreeBSD happened to agree and all of these scenarios do the right thing (IMHO, but you could call my opinion not impartial, since I am involved in both of those projects). Linux has had various different behaviours since Jeff Layton realised how terrible it all was and started improving it, starting from (1) which were straight up bugs. Some problems still exist. OpenBSD also made recent changes due to the noise generated by this stuff. So I genuinely don't understand how anyone can argue that there was nothing wrong or imply that it's not complicated: the maintainers of the software in question apparently disagreed.

In my humble opinion someone should go and talk to http://www.opengroup.org/austin/ and try to get some increased clarity here for POSIX 2022. Maybe I will find the energy if someone doesn't beat me to it.

By "I will deal with it", I didn't mean I can fix a hosed server, I mean something like "the database will fail over to another node, or enter recovery" or something like that. And, since we (and other DBs who apparently read our mailing list) introduced panic-on-fsync-failure, that's what we do. There are still a few edge cases to fix, though. We're on it...

macdice · on July 22, 2019

When a write-back error happens, Linux marks the block clean so it can be replaced by another block at any time. Additionally, some Linux filesystems (I don't recall which right now) proactively throw it away immediately.

zokier · on July 21, 2019

That was also suggested in the pg list:

> An crazy idea would be to have a daemon that checks the [kernel] logs and stops Postgres when it seems something wrong.

> [..] you could have stable log messages or implement some kind of "fsync error log notification" via whatever is the most sane way to get this out of kernel.