Definition
A training method from Anthropic that uses a written list of plain-language principles so an AI judges and improves its own answers.
At a glance
- The “constitution” is a written set of values, in plain English, the AI uses to check its own answers.
- It learns to self-correct instead of relying on humans to flag every bad reply.
- Anthropic reports the model got safer while staying helpful, not evasive.[2]
- For a business, this is the built-in safety layer behind a tool like Claude.
How it works
Two steps. First, the AI reviews its own draft against the rules and rewrites it, then re-trains on those better answers. Second, it compares pairs of its own responses, picks the one that fits the principles, and learns from those choices — a process called RLAIF.[1] The only human input is the constitution itself.
The constitution itself
The principles draw on sources like the UN human-rights declaration, telling the model to avoid toxic, illegal, or harmful output while staying useful. Anthropic publishes it openly and, in January 2026, expanded it from about 2,700 to 23,000 words[4] — shifting from listing rules to explaining why values matter.[3] You can read it and judge whether it fits your business.
Bottom line
It is the safety layer that lets an assistant police itself against a published, plain-English rulebook you can read and weigh against your own values.
References
- Constitutional AI: Harmlessness from AI Feedback — Anthropic. Anthropic www.anthropic.com
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2212.08073) — Yuntao Bai, et al.. arXiv arxiv.org
- Claude's new constitution — Anthropic. Anthropic www.anthropic.com
- Anthropic writes 23,000-word 'constitution' for Claude — The Register. The Register www.theregister.com
Comments
Questions, corrections, and links welcome. Be specific and civil.