I've mentioned this story before, but we had massive drive failures when bringing up multiple disk arrays. We get them racked on a friday afternoon, and then I wrote a quick and dirty shell script to read/write data back and forth between them over the weekend that was to kick in after they finished striping the raid arrays. By quick and dirty I mean there was no logging, and just a bunch of commands saved as .sh. Came in on Monday to find massive failures in all of the arrays, but no insight into when they failed during the stripe or during stressing them. It was close to 50% failure rate. Turned out to be a bad batch from the factory. Multiple customers of our vendor were complaining. All the drives were replaced by the manufacturer. It just delayed the storage being available to production. After that, not one of them failed in the next 12 months before I left for another job.
The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.
I think following Backblaze's hard disk stats is enough at this point.
Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).
I bet you still have a higher early failure rate because of the stress from transportation, even if there's no funny business. And I expect some funny business because used enterprise drives often come with wiped SMART data, some drives may have been retired by sophisticated clients who decided they were near failure.
They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.
I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.
we don't have perfect metrics here but this seems to match our experience; a lot of failures happened shortly after install before the bulk of the data download onto the heap, so actual data loss is lower than hardware failure rates
Used drives make sense if maintaining your home server is a hobby. It's fun to diagnose and solve problem in home servers, and failing drives give me a reason to work on the server. (I'm only half-joking, it's kind of fun)