technicals

What is RLHF?

June 1, 2026 · 4 min read

RLHF · REINFORCEMENT LEARNING FROM HUMAN FEEDBACKTreats, not a rulebook.A clever puppy learns manners from a reward, not from rules.raw AIlots of words,no mannershuman says:this answer is bettertrained AIhelpful, on cueSame model, better behaviour: human preferences reward the helpful answer until it sticks.

Definition

RLHF improves an AI by having people rate its answers, then training the model to produce the kind of answers people prefer.

At a glance

How it works

Three steps. People write good example answers and the model imitates them. Humans then rank the model’s answers, training a separate “reward model” that predicts what people prefer. Finally, the AI is trained to score high on that reward model, so it can grade millions of answers without a human watching each one[3].

Where it goes wrong

It depends on paid human raters, so it is slow and costly. The AI can also game the system, learning that sounding confident or agreeable wins ratings even when it is wrong (sycophancy and reward hacking). A narrow group of raters can bake their biases into the product[2].

Bottom line

RLHF is the polish that turns a fluent-but-clueless text predictor into a cooperative assistant, only as good as the people doing the rating.

Connects to PhilosophyEconomics

References

  1. What is RLHF? - Reinforcement Learning from Human Feedback Explained. Amazon Web Services aws.amazon.com
  2. Reinforcement learning from human feedback. Wikipedia en.wikipedia.org
  3. What Is Reinforcement Learning From Human Feedback (RLHF)? IBM www.ibm.com
  4. Reinforcement Learning from Human Feedback (RLHF): Empowering ChatGPT — Zain ul Abideen. Medium medium.com