Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder what ZFS in the iPhone would've looked like. As far as I recall, the iPhone didn't have error correcting memory, and ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk. ZFS' RAM-hungry nature would've also forced Apple to add more memory to their phone.


ZFS detects corruption.

A very long ago someone named cyberjock was a prolific and opinionated proponent of ZFS, who wrote many things about ZFS during a time when the hobbyist community was tiny and not very familiar with how to use it and how it worked. Unfortunately, some of their most misguided and/or outdated thoughts still haunt modern consciousness like an egregore.

What you are probably thinking of is the proposed doomsday scenario where bad ram could theoretically kill a ZFS pool during a scrub.

This article does a good job of explaining how that might happen, and why being concerned about it is tilting at windmills: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...

I have never once heard of this happening in real life.

Hell, I’ve never even had bad ram. I have had bad sata/sas cables, and a bad disk though. ZFS faithfully informed me there was a problem, which no other file system would have done. I’ve seen other people that start getting corruption when sata/sas controllers go bad or overheat, which again is detected by ZFS.

What actually destroys pools is user error, followed very distantly by plain old fashioned ZFS bugs that someone with an unlucky edge case ran into.


> Hell, I’ve never even had bad ram.

To what degree can you separate this claim from "I've never noticed RAM failures"?


You can take that as meaning “I’ve never had a noticed issue that was detected by extensive ram testing, or solved by replacing ram”.

I got into overclocking both regular and ECC DDR4 ram for a while when AMD’s 1st gen ryzen stuff came out, thanks to asrock’s x399 motherboard which unofficially supporting ECC, allowing both it’s function and reporting of errors (produced when overlocking)

Based on my own testing and issues seen from others, regular memory has quite a bit of leeway before it becomes unstable, and memory that’s generating errors tends to constantly crash the system, or do so under certain workloads.

Of course, without ECC you can’t prove every single operation has been fault free, but as some point you call it close enough.

I am of the opinion that ECC memory is the best memory to overclock, precisely because you can prove stability simply by using the system.

All that said, as things become smaller with tighter specifications to squeeze out faster performance, I do grow more leery of intermittent single errors that occur on the order of weeks or months in newer generations of hardware. I was once able to overclock my memory to the edge of what I thought was stability as it passed all tests for days, but about every month or two there’d be a few corrected errors show up in my logs. Typically, any sort of stability is caught by manual tests within minutes or the hour.


My friends and I spent a lot of our middle and high school days building computers from whatever parts we could find, and went through a lot of sourcing components everywhere from salvaged throwaways to local computer shops, when those were a thing. We hit our fair share of bad RAM, and by that I mean a handful of sticks at best.


It isn't hard to run memtest on all your computers, and that will catch the kind of bad RAM that the aforementioned doomsday scenario requires.


To me, the most implausible thing about ZFS-without-ECC doomsaying is the presumption that the failure mode of RAM is a persistently stuck bit. That's way less common than transient errors, and way more likely to be noticed, since it will destabilize any piece of software that uses that address range. And now that all modern high-density DRAM includes on-die ECC, transient data corruption on the link between DRAM and CPU seems overwhelmingly more likely than a stuck bit.


> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk

ZFS does not need or benefit from ECC memory any more than any other FS. The bitflip corrupted the data, regardless of ZFS. Any other FS is just oblivious, ZFS will at least tell you your data is corrupt but happily keep operating.

> ZFS' RAM-hungry nature

ZFS is not really RAM-hungry, unless one uses deduplication (which is not enabled by default, nor generally recommended). It can often seem RAM hungry on Linux because the ARC is not counted as “cache” like the page cache is.

---

ZFS docs say as much as well: https://openzfs.github.io/openzfs-docs/Project%20and%20Commu...


And even dedup was finally rewritten to be significantly more memory efficient, as of the new 2.3 release of ZFS: https://github.com/openzfs/zfs/discussions/15896


It's very amusing that this kind of legend has persisted! ZFS is notorious for *noticing* when bits flip, something APFS designers claimed was rare given the robustness of Apple hardware.[1][2] What would ZFS on iPhone have looked like? Hard to know, and that certainly wasn't the design center.

Neither here nor there, but DTrace was ported to iPhone--it was shown to me in hushed tones in the back of an auditorium once...

[1]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...

[2]: https://ahl.dtrace.org/2016/06/19/apfs-part5/#checksums


I did early ZFSOnLinux development on hardware that did not have ECC memory. I once had a situation where a bit flip happened in the ARC buffer for libpython.so and all python software started crashing. Initially, I thought I had hit some sort of blizzard bug in ZFS, so I started debugging. At that time, opening a ZFS snapshot would fetch a duplicate from disk into a redundant ARC buffer, so while debugging, I ran cmp on libpython.so between the live copy and a snapshot copy. It showed the exact bit that had flipped. After seeing that and convincing myself the bitflip was not actually on stable storage, I did a reboot, and all was well. Soon afterward, I got a new development machine that had ECC so that I would not waste my time chasing phantom bugs caused by bit flips.


erhm, isnt zfs supposed to store the checksum of records stored in the arc, and verify on read? are you sure this is what happened?


> ZFS is notorious for corrupting itself when bit flips

That is a notorious myth.

https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...


> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk

I don't think it is. I've never heard of that happening, or seen any evidence ZFS is more likely to break than any random filesystem. I've only seen people spreading paranoid rumors based on a couple pages saying ECC memory is important to fully get the benefits of ZFS.


They also insist that you need about 10 TB RAM per TB disk space or something like that.


There is a rule of thumb that you should have at least 1 GB of RAM per TB of disk when using deduplication. That's.... Different.


So you've never seen the people saying you should steer clear of ZFS unless you're going to have an enormous ARC even when talking about personal media servers?


People, especially those on the Internet, say a lot of things.

Some of the things they say aren't credible, even if they're said often.

You don't need an enormous amount of ram to run zfs unless you have dedupe enabled. A lot of people thought they wanted dedupe enabled though. (2024's fast dedupe may help, but probably the right answer for most people is not to use dedupe)

It's the same thing with the "need" for ECC. If your ram is bad, you're going to end up with bad data in your filesystem. With ZFS, you're likely to find out your filesystem is corrupt (although, if the data is corrupted before the checksum is calculated, then the checksum doesn't help); with a non-checksumming filesystem, you may get lucky and not have meta data get corrupted and the OS keeps going, just some of your files are wrong. Having ECC would be better, but there's tradeoffs so it never made sense for me to use it at home; zfs still works and is protecting me from disk contents changing, even if what was written could be wrong.


I have seen people saying that yeah. I've also completely ignored them.

I have a 64TB ZFS pool at home (12x8TB drives in an 11w1s RAID-Z3) on a personal media server. The machine has been up for months. It's using 3 GiB of RAM (including the ARC) out of the 32 I put in it.


Not that I recall? And it's worked fine for me...


I have seen people say such things, and none of it was based on reality. They just misinterpreted the performance cliff that data deduplication had to mean you must have absurd amounts of memory even though data deduplication is off by default. I suspect few of the people peddling this nonsense even used ZFS and the few who did, had not looked very deeply into it.


Even then you obviously need L2ARC as well!! /s


But on optane. Because obviously you need an all flash main array for streaming a movie.


Fortunately, this has significantly improved since dedup was rewritten as part of the new ZFS 2.3 release. Search for zfs “fast dedup”.


It’s unfortunate some folks are missing the tongue-in-cheek nature of your comment.


Not this myth again. ZFS does not need ECC RAM. Stop propagating this falsehood.


> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.

If you have no mirrors and no raidz and no ditto blocks then errors cause problems, yes. Early on they would cause panics.

But this isn't ZFS "corrupting itself", rather, it's ZFS saving itself and you from corruption, and the price you pay for that is that you need to add redundancy (mirrors, raidz, or ditto blocks). It's not a bad deal. Some prefer not to know.


> ZFS is notorious for corrupting itself when bit flips hit it and break the checksum on disk.

What's a bit flip?


Sometimes data on disk and in memory are randomly corrupted. For a pretty amazing example, check out "bitsquatting"[1]--it's like domain name squatting, but instead of typos, you squat on domains that would bit looked up in the case of random bit flips. These can occur due e.g. to cosmic rays. On-disk, HDDs and SSDs can produce the wrong data. It's uncommon to see actual invalid data rather than have an IO fail on ECC, but it certainly can happen (e.g. due to firmware bugs).

[1]: https://en.wikipedia.org/wiki/Bitsquatting


Basically it's that memory changes out from under you. As we know, computers use Binary, so everything boils down to it being a 0 or a 1. A bit flip is changing what was say a 0 into a 1.

Usually attributed to "cosmic rays", but really can happen for any number of less exciting sounding reasons.

Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network. Memory and disks are not checked for correctness, basically ever on any machine anywhere. Many servers(but certainly not all) are the rare exception when it comes to memory safety. They usually have ECC(Error Correction Code) Memory, basically a checksum on the memory to ensure that if memory is corrupted, it's noticed and fixed.

Essentially every filesystem everywhere does zero data integrity checking:

  MacOS APFS: Nope
  Windows NTFS: Nope
  Linux EXT4: Nope
  BSD's UFS: Nope
  Your mobile phone: Nope
ZFS is the rare exception for file systems that actually double check the data you save to it is the data you get back from it. Every other filesystem is just a big ball of unknown data. You probably get back what you put it, but there is zero promises or guarantees.


> disks are not checked for correctness, basically ever on any machine anywhere.

I'm not sure that's really accurate -- all modern hard drives and SSD's use error-correcting codes, as far as I know.

That's different from implementing additional integrity checking at the filesystem level. But it's definitely there to begin with.


But SSDs (to my knowledge) only implement checksum for the data transfer. Its a requirement of the protocol. So you can be sure that the Stuff in memory and checksum computed by the CPU arrives exactly like that in the SSD driver. In the past this was a common error source with hardware raid which was faulty.

But there is ABSOLUTELY NO checksum for the bits stored on a SSD. So bit rot at the cells of the SSDs are undetected.


That is ABSOLUTELY incorrect. SSDs have enormous amounts of error detection and correction builtin explicitly because errors on the raw medium are so common that without it you would never be able to read correct data from the device.

It has been years since I was familiar enough with the insides of SSDs to tell you exactly what they are doing now, but even ~10-15 years ago it was normal for each raw 2k block to actually be ~2176+ bytes and use at least 128 bytes for LDPC codes. Since then the block sizes have gone up (which reduces the number of bytes you need to achieve equivalent protection) and the lithography has shrunk (which increases the raw error rate).

Where exactly the error correction is implemented (individual dies, SSD controller, etc) and how it is reported can vary depending on the application, but I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.


> I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.

While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.

There are lots of weasel words there on purpose. There is generally zero guarantee in reality and drives lie all the time about data being safely written to disk, even if it wasn't actually safely written to disk yet. This means on power failure/interruption the outcome of being able to read XYZ back is 100% unknown. Drive Manufacturers make zero promises here.

On most consumer compute, there is no promises or guarantees that what you wrote on day 1 will be there on day 2+. It mostly works, and the chances are better than even that your data will be mostly safe on day 2+, but there is zero promises or guarantees. We know how to guarantee it, we just don't bother(usually).

You can buy laptops and desktops with ECC RAM and use ZFS(or other checksumming FS), but basically nobody does. I'm not aware of any mobile phones that offer either option.


> While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.

I'm not really sure what point you're trying to make. It's using ECC, so they should be the same bytes.

There isn't infinite reliability, but nothing has infinite reliability. File checksums don't provide infinite reliability either, because the checksum itself can be corrupted.

You keep talking about promises and guarantees, but there aren't any. All there is are statistical rates of reliability. Even ECC RAM or file checksums don't offer perfect guarantees.

For daily consumer use, the level of ECC built into disks is generally plenty sufficient. It's chosen to be so.


I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.

We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.

We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.


All MLC SSDs absolutely do data checksums and error recovery, otherwise they would very lose your data much more than they do.

You can see some stats using `smartctl`.


Yes, the disk mostly promises what you write there will be read back correctly, but that's at the disk level only. The OS, Filesystem and Memory generally do no checking, so any errors at those levels will propagate. We know it happens, we just mostly choose to not do anything about it.

My point was, on most consumer compute, there is no promises or guarantees that what you see on day 1 will be there on day 2. It mostly works, and the chances are better than even that your data will be mostly safe on day 2, but there is zero promises or guarantees, even though we know how to do it. Some systems do, those with ECC memory and ZFS for example. Other filesystems also support checksumming, like BTRFS being the most common counter-example to ZFS. Even though parts of BTRFS are still completely broken(see their status page for details).


"Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network."

This is so not true.

All the high speed busses (QPI, UPI, DMI, PCIe, etc.) have "bit flip" protection in multiple layers: differential pair signaling, 8b/10b (or higher) encoding, and packet CRCs.

Hard drives (the old spinning rust kind) store data along with a CRC.

SSD/NVMe drives use strong ECC because raw flash memory flips so many bits that it is unusable without it.

If most filesystems don't do integrity checks it's probably because there's not much need to.


I agree that transfers(like the high speed buses) have some checks to ensure transfers happen properly, but that doesn't help much if the data is/was corrupted on either side.

> If most filesystems don't do integrity checks it's probably because there's not much need to.

I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.

We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.

We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.


Btrfs and bcachefs both have data checksumming. I think ReFS does as well.


Yes, ZFS is not the only filesystem with data checksumming and guarantees, but it's one of the very rare exceptions that do.

ZFS has been in productions work loads since 2005, 20 years now. It's proven to be very safe.

BTRFS has known fundamental issues past one disk. It is however improving. I will say BTRFS is fine for a single drive. Even the developers last I checked(a few years ago) don't really recommend it past a single drive, though hopefully that's changing over time.

I'm not familiar enough with bcachefs to comment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: