How Do We Know If AI Is Actually Smart?

Person A has memorized every textbook on earth but has never left their basement. Person B has read three books, navigated a job interview, and assembled IKEA furniture without instructions.

Which one is smarter?

Welcome to AI benchmarking in 2026. We’re still arguing about this.

The Alphabet Soup of Smart

No single “intelligence score” exists for AI models. Instead, competing benchmarks each measure a different flavor:

GPQA Diamond — Graduate-level science. GPT-5.4 scores 94.5%, Claude Opus 4.6 hits 93.2%. These are questions even PhDs get wrong.

AIME — Olympiad-level math. Requires chaining logical steps without losing the thread.

SWE-bench — Real-world debugging. Claude leads at 93.2%. Not “write a function” — debug someone else’s broken codebase. A wooden ruler rests beside a brass astrolabe and a stack of

MMLU-Pro — Broad knowledge across dozens of disciplines.

LiveCodeBench — Continuously updated coding challenges resistant to contamination. A good score here means you’re actually good, not just well-prepared.

And then there’s the wildcard that humbles everyone.

The ARC-AGI Humiliation

ARC-AGI measures fluid intelligence — solving genuinely novel puzzles with zero prior training. Foundational LLMs score around 0.68%. Even with LLMs plus procedural world models and verification, researchers hit only 33%+.

Gemini 3.1 Pro scored 84% on ARC-AGI-2. The gap between “solving problems you’ve seen before” and “solving problems you’ve never seen” is enormous.

Crystallized vs. Fluid Intelligence

A viral LessWrong essay captured what researchers suspect: LLMs excel at crystallized intelligence — accumulated knowledge and learned patterns. That’s why they crush coding benchmarks and ace math.

But they’re weak at fluid intelligence — reasoning about entirely new situations. The kind that lets a five-year-old learn hide-and-seek in seconds. A silver compass lies open on a scattered city map, surround

As one researcher put it: “LLMs are like that friend who’s read every book but can’t parallel park.”

What the Open-Source Community Compares On

MMLU-Pro — broad knowledge
GPQA Diamond — domain expertise
AIME + LiveCodeBench — math and coding
SWE-bench — real-world building
ARC-AGI — actual thinking (spoiler: barely)

No single score tells the whole story. A model can dominate LiveCodeBench and still fail a novel logic puzzle. It can score 90% on MMLU-Pro and hallucinate when asked something genuinely unexpected.

Intelligence Per Dollar

Here’s the genuinely interesting part. You don’t need a $50 million cluster anymore.

Phi-4 (14B) — Microsoft tuned this specifically for efficiency. Dominates the small-model category, runs on consumer hardware.

Llama 4 Scout (4B) — The shocker of 2026. Phone-app size, outperforms models five times larger on certain benchmarks. Three brass weighing scales sit side by side on a marble ped

Qwen 3.6 (32B) — The sweet spot. Competitive with models three times its size. Runs on a single quantized consumer GPU.

DeepSeek V4 Flash — 284B total parameters but only 13B active per token via sparse architecture. Near-frontier performance on multi-GPU setups.

A quantized 14B model on a consumer GPU can outperform a 70B model from two years ago. The efficiency curve has gotten steep.

The Real Question

If a model can ace every coding benchmark and still fail at novel reasoning, are we measuring intelligence or memorization with extra steps?

ARC-AGI suggests the latter dominates. But the gap is closing — and models are getting better at the flexible, adaptive problem-solving that separates a parrot from a person.

The question isn’t whether AI is smart. It’s what kind of smart it is, how much matters, and whether our benchmarks tell the whole story.

Spoiler: they don’t. A glass laboratory beaker contains swirling colored liquids,