Benchmarks

Compare AI model performance across standard benchmarks and datasets.

GSM-8K Leaderboard

Grade School Math 8K, a benchmark for measuring mathematical reasoning ability.

Normalize by parameter count

Show deltas vs best

#	Model	Provider	Parameters		Date

Data freshness: Last updated Apr 8, 2026

Custom Benchmark Set

Build your own benchmark comparison by selecting any combination of datasets.

Coming soon: Create custom benchmark sets to compare models across multiple dimensions.

Benchmark Methodology

Learn about how benchmarks are conducted and scored.

We collect benchmark data from official model releases, research papers, and community evaluations. All scores are normalized to percentages for easier comparison.