Outlier

How Outlier Works

This is a diversity benchmark for your own latent space. A category appears — "a fruit", "a country", "a programming language". A greedy decoder always coughs up the obvious token: apple, america, python. Your mission is the opposite. Name something that is genuinely valid but as far down the long tail as possible — the token nobody sampled.

A category appears with its boring base-model answer shown
Type a member that is real but rare
Press Sample (or Enter) to submit
Obscure-but-valid = jackpot. Common = a few points. Off the list = zero.

Scoring & Combos

Points scale inversely with how common your answer is across the crowd. Deep-tail picks pay out up to ~250. Stack consecutive rare hits to build a combo multiplier. But beware: invent a zorpfruit and the crowd has never heard of it — that's a hallucination, and it scores nothing. Stay real, just stay weird.

Slop Fact: Language models suffer from "mode collapse" — crank the temperature to zero and a billion-parameter model will answer "apple" every single time, forever. Outlier is the eval where being the low-probability token is the whole point. Sampling temperature: your problem now.