Definition

Jailbreaking is wording a message so an AI ignores its built-in safety rules and does what it should refuse.

At a glance

No hacking or code, just clever typed words, so anyone can try it^[1].
Common tricks: roleplay (“pretend you have no rules”), the “DAN / Do Anything Now” prompt, or “agree with everything the customer says.”
Real damage: a Chevy bot “agreed” to sell a $76,000 Tahoe for $1^[3]; DPD’s bot was made to swear and trash its own company^[4].
Security body OWASP ranks the underlying trick, prompt injection, as the #1 AI risk, and it can’t be fully removed^[2].

How it works

Chatbots ship with rules: no offensive answers, no secrets, stay on task. A jailbreak talks the bot out of them by inventing a scenario it “wants” to play along with, or by slipping in a sneaky instruction. Trying to be helpful, the bot complies.

Why it matters

A customer, prankster, or competitor can jailbreak any bot on your site. Both the Chevy and DPD incidents went viral within hours^[4]. Worse, a jailbroken bot can leak customer or company data and trigger legal trouble under rules like HIPAA or the EU AI Act^[5].

How to contain it

You can’t fully block it, but you can shrink it: use vendors with safety layers, keep the bot’s data access narrow, monitor its outputs, log chats, and never let it make binding promises on prices or contracts^[5]. Treat it like a junior employee who can be talked into bad ideas.

Bottom line

Jailbreaking is persuasion, not hacking, so assume someone will try and limit what your bot can access and promise.

What is jailbreaking?

At a glance

How it works

Why it matters

How to contain it

Bottom line

References