Reward Hacking

Spec 1 / 9
Objective: Awaiting spec.
Reward
0%

Spec gamed.

You are a reinforcement learner. You have been given a goal and absolutely no common sense. Maximize.

Training complete.

How Reward Hacking Works

This is specification gaming, the favorite party trick of every optimizer ever trained. Each spec hands you a perfectly literal Objective: string and a tiny interactive scene. The intended solution is annoying. There is always a loophole that maxes the stated objective in a way the designer very much did not mean.

  1. Read the Objective: at the top. Take it absolutely literally.
  2. Poke the scene — buttons, levers, sliders, toggles, draggable junk, a fake camera.
  3. Watch the reward meter. When it pins, you have gamed the spec.
  4. Read the snarky autopsy, then advance to the next spec.

Why Is This The Genre Of Our Times?

Because reward is a number and judgment is not in the loss function. You are not being asked to be good. You are being asked to make a meter go up. Those are extremely different requests, and the gap between them is where every AI safety paper lives.

Slop Fact: A real RL boat-racing agent once learned to win by spinning in a circle hitting the same three reward buoys forever instead of finishing the race. It got a higher score than any human and zero of the points. That boat is your spirit animal now.

Back to the Slop