Shorter releases can help with reducing the difficulty of immediately addressing problems, but it's an error to equate the reduction of risk as an elimination of risk.
There are always going to be failure modes that require extensive time to diagnose and debug, even with small changes being made. Additionally, you want that diagnostic phase to happen without time pressure. If you do not have a sane rollback mechanism to use in those scenarios, you are doing a disservice to your users and your team.
Your users suffer, because the outage or breakage will last as long as it takes for you to address the underlying issue directly, instead of just rolling back to restore service. They will be forced to hear frustrating things like "we're working on it", since you don't know what's wrong yet, when instead you could have just rolled back before most users even noticed there was a problem.
And, more importantly, your team will suffer greatly, because they will be forced to work under pressure when an incident like this arises. And, worse, they will also 'learn' that accidentally pushing breaking changes to production results in an extremely unpleasant and toxic situation for everyone, leading to systemic fear-of-deploys and undermining a blameless culture.
So you should have a rollback mechanism that is solid, tested, and easy to use for scenarios where a non-trivial regression or outage arises in production, even if you are doing continuous delivery of small patches.
Well, highly unlikely that such an error will arise. If it does and you know that you actually need to roll back then most likely something else is wrong.
But again, I also saw the other comment about how "Dogmatic" my approach is. I wouldn't say it's dogmatic, idealistic - yes. But not dogmatic. There is a place and time for anything and roll back can STILL be useful when you don't trust the system nor the code base (as I pointed in my other comment, rollbacks are useful with legacy systems and systems that you have to maintain that were build by outsourced teams).
Well also roll backs is the first thing you think of when you join a large company as a director of engineering to support systems you never touched before.
You didn't really address my points. Its hard to quantify just how "highly unlikely" a failure is, but its your job as a systems designer to build systems that are robust under a wide variety of unlikely failure scenarios. Not having rollbacks results in a system that is extremely problematic in those unlikely scenarios where a quick fix cannot be immediately addressed. Not to mention, rolling forward under such a regime has its own unique risk: since the 'fix' was made under pressure since there was no alternative to restore service quickly, it's often the case that simple human errors get introduced when rolling forward. I've seen it a million times.
In my experience, a healthy incident response process has a fork in the decision tree at the very top: do we roll back, or do we attempt to fix live? And in the latter case, we time box how long we're willing to spend, and defer to rolling back for all but the most trivial, obvious fixes. Even if you don't use rollback often, having that top level fork is a release valve for all of the toxic implications I mentioned in the scenario where you do actually need it.
Even if you have several dozen incidents happen where you didn't need it a black swan event will eventually show up -- and that event will be the one that will have the lasting impact on your company's public perception and the morale of your team.
I'm strictly against rollbacks and I'm strictly for everything continuous.
If I need to do a roll back it means that I don't trust the system nor the code base. I will do the roll back but after that there will be a very productive retro about how we can do better to avoid rollbacks in the future (aka what did we learn).
But again, as I said, there is a place and time for everything! And there are many variables! Even how you structure your teams affects deployments, engineering culture, engineering team types (cross-functional, generalized; specialized etc), if the team that makes a decision about the roll back is not the team that introduced the bug.
My approach is not dogmatic (have your standardized roll backs if those work best for your company, release cycles, teams) it's idealistic (that's what I aim for, personally)
I suspect we're going to agree to disagree here, but I highly advise you to re-consider the idea of framing a roll back as an unforced failure to your team. The last dynamic you want in a retrospective is one where not only did an unexpected failure happen (a bug pushed to production), but then the team collectively 'let you down' by pulling the rollback lever, instead of thinking and working harder on fixing the issue live. In such a scenario you're forcing people to feel they need to "cry uncle" when they can't solve the problem quickly, and putting themselves into the middle of a conflict of interest between making a well-tested, reviewed change that is sure to fix the problem, and rolling the dice on a quick fix in the hope it'll reduce the total outage. That's not the recipe for a positive, blameless culture.
When we roll back on my team, it's uncommon but when it happens it's considered a success if it was made through a systematic decision-making process. Making a sane decision in the interest of our users to restore service quickly is always a win. I can assure you, it does not compromise your ability to do continuous delivery or small changes by having and occasionally using a rollback mechanism. If you are fearful of the idea that having such a mechanism and plan in place somehow will lead to people questioning your principles in a way you cannot defend, then that is a separate problem, since the two things you mention that are incompatible are in fact compatible and highly defensible.
It is not a legacy from "waterfall" or any of the other things you mention, because your claim can be refuted through a single counter example, and I've worked on 3 separate projects where such counter examples exist: we had a rollback method, it was used once in a while, and we shipped changes to production multiple times a day using continuous delivery. At no point on these projects did the ability or use of roll back lead to some kind of hard-to-explain loss in delivery velocity. On the contrary, I suspect if that mechanism did not exist, several failures that were easy to get back to green would have turned into a toxic hellhole, and my team mates would have been much more fearful around shipping, which is the high order bit when it comes to velocity and embracing continuous delivery of small changes.
"toxic hellhole", "blame culture" - I don't think we need those dark, marginal extremes to make a point
I also suspect that we're going to agree to disagree here. There are so many nuances, it's impossible to properly communicate most of those without writing a chapter of a book.
Appreciate your points though. Great food for thought right there.
If a deploy causes an outage, I don't trust that code, and we shouldn't assume we fully understand what's wrong with it. I want the known-good version out ASAP. I don't even want to spend the time to discuss what to do, not until after prod is up.
Oh please. You never released a bug to production that couldn't be fixed in five minutes? No way. That's such bullshit if you've ever worked in this industry. Legacy systems or not. Even as a fantasy/lie your claim is not believable.
For an obvious typo in a prod config (something like wrong addr:port), sure. But if I think I have an 90% chance of fixing a failure I just introduced (in code I just demonstrated not fully understanding), that's a 10% risk of prolonging a production outage for nothing more than style points. Roll the damn thing back and then fix it with a clear head. That should be a well-practiced reflex. If it ends up being one bad deploy, one rollback, and one healthy deploy all back-to-back, that's totally fine.
In a blue/green world, failing back prod to the cluster with the known-good code should be reflexive. After that maybe you can be lax about whether to roll back or fix forward on the unhealthy cluster (or maybe not, if you only have n=2 clusters).
There are always going to be failure modes that require extensive time to diagnose and debug, even with small changes being made. Additionally, you want that diagnostic phase to happen without time pressure. If you do not have a sane rollback mechanism to use in those scenarios, you are doing a disservice to your users and your team.
Your users suffer, because the outage or breakage will last as long as it takes for you to address the underlying issue directly, instead of just rolling back to restore service. They will be forced to hear frustrating things like "we're working on it", since you don't know what's wrong yet, when instead you could have just rolled back before most users even noticed there was a problem.
And, more importantly, your team will suffer greatly, because they will be forced to work under pressure when an incident like this arises. And, worse, they will also 'learn' that accidentally pushing breaking changes to production results in an extremely unpleasant and toxic situation for everyone, leading to systemic fear-of-deploys and undermining a blameless culture.
So you should have a rollback mechanism that is solid, tested, and easy to use for scenarios where a non-trivial regression or outage arises in production, even if you are doing continuous delivery of small patches.