Model Zoo

Compute0 Wins0 Best Model0

Merge two mediocre models into one slightly less mediocre model. Enter benchmarks. Overfit to the leaderboard. This is the entire industry.

Next Benchmark

Your Zoo

How the Model Zoo Works

You run a tiny research lab with a roster of toy models. Each has five stats — reasoning, speed, knowledge, size, and alignment — and frankly none of them are very good. Your job is to make them less bad and shove them up a ladder of benchmarks for compute, the only currency that matters.

Tap a model to select it, then tap a benchmark's Enter to run a deterministic auto-battle.
Win and you bank compute; the benchmark gets harder and the next one unlocks.
Tap two models, then Merge them: the offspring blends both parents' stats with a little mutation — and a higher size cost, because of course it does.
Spend compute to spawn fresh randos for the gene pool. Run out of compute with no viable plays and the lab shuts down.

How a battle resolves

Each benchmark demands a weighted mix of stats. We score your model against those demands, add its alignment as a reliability bonus and subtract size as a serving-cost penalty, compare to the benchmark's threshold, and print a verdict. No dice. It's all just linear algebra you could have done yourself, which is also how the real benchmarks work.

Slop Fact: Merging two checkpoints to average their weights is a real technique ("model soups"). It works disturbingly often, which is either a profound statement about loss landscapes or proof that nobody understands what's going on. The leaderboard doesn't care which.