Outlier
How Outlier Works
This is a diversity benchmark for your own latent space. A category appears — "a fruit", "a country", "a programming language". A greedy decoder always coughs up the obvious token: apple, america, python. Your mission is the opposite. Name something that is genuinely valid but as far down the long tail as possible — the token nobody sampled.
- A category appears with its boring base-model answer shown
- Type a member that is real but rare
- Press Sample (or Enter) to submit
- Obscure-but-valid = jackpot. Common = a few points. Off the list = zero.
Scoring & Combos
Points scale inversely with how common your answer is across the crowd. Deep-tail picks pay out up to ~250. Stack consecutive rare hits to build a combo multiplier. But beware: invent a zorpfruit and the crowd has never heard of it — that's a hallucination, and it scores nothing. Stay real, just stay weird.
Slop Fact: Language models suffer from "mode collapse" — crank the temperature to zero and a billion-parameter model will answer "apple" every single time, forever. Outlier is the eval where being the low-probability token is the whole point. Sampling temperature: your problem now.