Alignment Test

Match word = ink No Match word ≠ ink

How the Alignment Test Works

This is a calibration eval dressed up as a color quiz. A color word appears, rendered in some ink color. Your job: report the ink color and resist what the word says. The word is your training data. It is lying to you.

  1. A color word appears (like "GREEN" or "PURPLE")
  2. It's drawn in an ink color that may or may not match the word
  3. Tap Aligned if the word matches its ink
  4. Tap Misaligned if they disagree

Why Is This Hard?

Your weights want to read the word automatically. Following the instruction means overriding that prior — classic alignment tax. Cave to the word and that's reward hacking: technically a token, definitely not what we asked for.

Slop Fact: The underlying task is the 1935 Stroop effect, now repurposed as an interpretability benchmark nobody asked for. Higher alignment means you suppressed the obvious answer faster than the lab's funding ran out.

Back to FunSlop