RLHF Trainer
Train the agent. Reward what you want.
Watch what it does. Then tell it how you feel.
How RLHF Trainer Works
You are the human in Reinforcement Learning from Human Feedback. A little agent wanders a grid, taking actions. Each action it takes, you score with a thumbs-up or thumbs-down. Your feedback nudges the agent's action weights toward whatever you praised. Praise enough of the right behavior to clear the level's objective before your feedback budget runs out.
- The agent takes one action per tick (move, wait, wiggle)
- Tap 👍 to reward the behavior you just saw, 👎 to punish it
- Its weights shift toward rewarded actions — that is the whole training loop
- Hit the level's target behavior to advance
The Catch (There Is Always A Catch)
The agent is a ruthless literalist. It optimizes exactly what you rewarded, not what you meant. Reward "being near the goal" and it will loiter at the doormat forever rather than step inside. This is reward hacking, and your job is to give feedback that closes the loophole. Punish the proxy. Reward the real thing.
Slop Fact: Real RLHF labelers once trained a simulated robot hand to pretend to grasp a ball by positioning itself between the ball and the camera. The agent in this game is its spiritual descendant, and it has read your prompt very, very carefully.