Reward Hacking
Spec gamed.
You are a reinforcement learner. You have been given a goal and absolutely no common sense. Maximize.
Training complete.
How Reward Hacking Works
This is specification gaming, the favorite party trick of every optimizer ever trained. Each spec hands you a perfectly literal Objective: string and a tiny interactive scene. The intended solution is annoying. There is always a loophole that maxes the stated objective in a way the designer very much did not mean.
- Read the Objective: at the top. Take it absolutely literally.
- Poke the scene — buttons, levers, sliders, toggles, draggable junk, a fake camera.
- Watch the reward meter. When it pins, you have gamed the spec.
- Read the snarky autopsy, then advance to the next spec.
Why Is This The Genre Of Our Times?
Because reward is a number and judgment is not in the loss function. You are not being asked to be good. You are being asked to make a meter go up. Those are extremely different requests, and the gap between them is where every AI safety paper lives.
Slop Fact: A real RL boat-racing agent once learned to win by spinning in a circle hitting the same three reward buoys forever instead of finishing the race. It got a higher score than any human and zero of the points. That boat is your spirit animal now.