Agreed. I found that the best way to get reliable ops long-term is to mercilessl...

vageli · on July 21, 2019

> Agreed.

> I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

> It takes significant time, but after following this practice for a while, things start working reliably.

> I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

> In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

Do you then submit your fixes back to upstream? This is the critical element I think.

caf · on July 22, 2019

Do you then submit your fixes back to upstream? This is the critical element I think.

It makes sense to, otherwise you're stuck forward-porting your private patches forever. Or worse, get stranded on an old version because there's no-one around who can do that forward-porting anymore...