Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Agreed.

I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

It takes significant time, but after following this practice for a while, things start working reliably.

I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.



> Agreed.

> I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

> It takes significant time, but after following this practice for a while, things start working reliably.

> I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

> In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

Do you then submit your fixes back to upstream? This is the critical element I think.


Do you then submit your fixes back to upstream? This is the critical element I think.

It makes sense to, otherwise you're stuck forward-porting your private patches forever. Or worse, get stranded on an old version because there's no-one around who can do that forward-porting anymore...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: