Assuming a motivated “attacker”, yes. The average user will have no such notion ...

Assuming a motivated “attacker”, yes. The average user will have no such notion of “jailbreaks”, and it’s at least clear when one _is_ attempting to “jailbreak” a model (given a full log of the conversation and a competent human investigator).

I think the class of problems that remain are basically outliers that are misaligned and don’t trip up the model’s detection mechanism. Given the nature of language and culture (not to mention that they both change over time), I imagine there are a lot of these. I don’t have any examples (and I don’t think yelling “time’s up” when such outliers are found is at all helpful).