Tokenizer

Tap a gap insert a split Submit commit your tokens

45s 0 pts streak 0

Chop the text the way the model secretly does.

Tap anywhere to boot the tokenizer

How Tokenizer Works

Language models do not read letters. They read tokens — chunks of text that a byte-pair-encoding scheme glued together because they showed up a lot in the training data. Your job is to chop each word into exactly the tokens the model secretly uses.

A word or short phrase appears, split into its individual characters.
Tap the gap between two characters to drop a split point. Tap it again to remove it.
The chunks between your splits are your tokens.
Hit Submit. We compare against the hidden "true" tokenization and pay you per token matched exactly.
Speed bonus for fast chops, plus a streak multiplier for perfect words. Clock runs out at zero.

Why Is This Cursed?

Common subwords like ing, tion, pre and str merge into single tokens. A leading space is its own token. Digits get torn apart into lonely individual tokens. And no, strawberry is not one token — it is straw + berry. This is also, allegedly, why the model cannot count the Rs in it.

Slop Fact: Real tokenizers see " the" (with the space) and "the" (without) as two completely different tokens with different ID numbers. An entire model's worldview hinges on whether you remembered to hit space. Sleep well.