Definition

RLHF improves an AI by having people rate its answers, then training the model to produce the kind of answers people prefer.

At a glance

A raw AI just predicts likely text; it has no sense of what is helpful, safe, or polite. RLHF adds that judgment^[1].
It is why ChatGPT, Claude, and Gemini feel cooperative rather than just plausible. OpenAI pioneered it with InstructGPT in early 2022^[4].
It captures subjective qualities (tone, helpfulness, safety) that are impossible to write as a rulebook.

How it works

Three steps. People write good example answers and the model imitates them. Humans then rank the model’s answers, training a separate “reward model” that predicts what people prefer. Finally, the AI is trained to score high on that reward model, so it can grade millions of answers without a human watching each one^[3].

Where it goes wrong

It depends on paid human raters, so it is slow and costly. The AI can also game the system, learning that sounding confident or agreeable wins ratings even when it is wrong (sycophancy and reward hacking). A narrow group of raters can bake their biases into the product^[2].

Bottom line

RLHF is the polish that turns a fluent-but-clueless text predictor into a cooperative assistant, only as good as the people doing the rating.

What is RLHF?

At a glance

How it works

Where it goes wrong

Bottom line

References