Tokenizer
Chop the text the way the model secretly does.
Tap anywhere to boot the tokenizer
How Tokenizer Works
Language models do not read letters. They read tokens — chunks of text that a byte-pair-encoding scheme glued together because they showed up a lot in the training data. Your job is to chop each word into exactly the tokens the model secretly uses.
- A word or short phrase appears, split into its individual characters.
- Tap the gap between two characters to drop a split point. Tap it again to remove it.
- The chunks between your splits are your tokens.
- Hit Submit. We compare against the hidden "true" tokenization and pay you per token matched exactly.
- Speed bonus for fast chops, plus a streak multiplier for perfect words. Clock runs out at zero.
Why Is This Cursed?
Common subwords like ing, tion, pre and str merge into single tokens. A leading space is its own token. Digits get torn apart into lonely individual tokens. And no, strawberry is not one token — it is straw + berry. This is also, allegedly, why the model cannot count the Rs in it.
Slop Fact: Real tokenizers see " the" (with the space) and "the" (without) as two completely different tokens with different ID numbers. An entire model's worldview hinges on whether you remembered to hit space. Sleep well.