Definition
Jailbreaking is wording a message so an AI ignores its built-in safety rules and does what it should refuse.
At a glance
- No hacking or code, just clever typed words, so anyone can try it[1].
- Common tricks: roleplay (“pretend you have no rules”), the “DAN / Do Anything Now” prompt, or “agree with everything the customer says.”
- Real damage: a Chevy bot “agreed” to sell a $76,000 Tahoe for $1[3]; DPD’s bot was made to swear and trash its own company[4].
- Security body OWASP ranks the underlying trick, prompt injection, as the #1 AI risk, and it can’t be fully removed[2].
How it works
Chatbots ship with rules: no offensive answers, no secrets, stay on task. A jailbreak talks the bot out of them by inventing a scenario it “wants” to play along with, or by slipping in a sneaky instruction. Trying to be helpful, the bot complies.
Why it matters
A customer, prankster, or competitor can jailbreak any bot on your site. Both the Chevy and DPD incidents went viral within hours[4]. Worse, a jailbroken bot can leak customer or company data and trigger legal trouble under rules like HIPAA or the EU AI Act[5].
How to contain it
You can’t fully block it, but you can shrink it: use vendors with safety layers, keep the bot’s data access narrow, monitor its outputs, log chats, and never let it make binding promises on prices or contracts[5]. Treat it like a junior employee who can be talked into bad ideas.
Bottom line
Jailbreaking is persuasion, not hacking, so assume someone will try and limit what your bot can access and promise.