If someone's going to ask you gotcha questions which they're then going to post on social media to use against you, or against other people, it helps to have pre-prepared statements to defuse that.
The model may not be able to detect bad faith questions, but the operators can.
I think the concern is that if the system is susceptible to this sort of manipulation, then when it’s inevitably put in charge of life critical systems it will hurt people.
There is no way it's reliable enough to be put in charge of life-critical systems anyway? It is indeed still very vulnerable to manipulation by users ("prompt injection").
Just because neither you nor I would deem it safe to put in charge of a life-critical system, does not mean all the people in charge of life-critical systems are as cautious and not-lazy as they're supposed to be.
The model may not be able to detect bad faith questions, but the operators can.