AI May 30, 2026 · 4 tags

How Do We Know If AI Is Actually Smart? The Messy Truth About Measuring Intelligence in 2026

ARC-AGI, GPQA, MMLU-Pro — the benchmarks competing to crown the smartest AI. Spoiler: they measure crystallized intelligence, not the kind that lets you survive a bad blind date.

#AI#Benchmarks#Intelligence#Open Source

How Do We Know If AI Is Actually Smart?

Person A has memorized every textbook on earth but has never left their basement. Person B has read three books, navigated a job interview, and assembled IKEA furniture without instructions.

Which one is smarter?

Welcome to AI benchmarking in 2026. We’re still arguing about this.

The Alphabet Soup of Smart

No single “intelligence score” exists for AI models. Instead, competing benchmarks each measure a different flavor:

GPQA Diamond — Graduate-level science. GPT-5.4 scores 94.5%, Claude Opus 4.6 hits 93.2%. These are questions even PhDs get wrong.

AIME — Olympiad-level math. Requires chaining logical steps without losing the thread.

SWE-bench — Real-world debugging. Claude leads at 93.2%. Not “write a function” — debug someone else’s broken codebase.

MMLU-Pro — Broad knowledge across dozens of disciplines.

LiveCodeBench — Continuously updated coding challenges resistant to contamination. A good score here means you’re actually good, not just well-prepared.

And then there’s the wildcard that humbles everyone.

The ARC-AGI Humiliation

ARC-AGI measures fluid intelligence — solving genuinely novel puzzles with zero prior training. Foundational LLMs score around 0.68%. Even with LLMs plus procedural world models and verification, researchers hit only 33%+.

Gemini 3.1 Pro scored 84% on ARC-AGI-2. The gap between “solving problems you’ve seen before” and “solving problems you’ve never seen” is enormous.

Crystallized vs. Fluid Intelligence

A viral LessWrong essay captured what researchers suspect: LLMs excel at crystallized intelligence — accumulated knowledge and learned patterns. That’s why they crush coding benchmarks and ace math.

But they’re weak at fluid intelligence — reasoning about entirely new situations. The kind that lets a five-year-old learn hide-and-seek in seconds.

As one researcher put it: “LLMs are like that friend who’s read every book but can’t parallel park.”

What the Open-Source Community Compares On

  • MMLU-Pro — broad knowledge
  • GPQA Diamond — domain expertise
  • AIME + LiveCodeBench — math and coding
  • SWE-bench — real-world building
  • ARC-AGI — actual thinking (spoiler: barely)

No single score tells the whole story. A model can dominate LiveCodeBench and still fail a novel logic puzzle. It can score 90% on MMLU-Pro and hallucinate when asked something genuinely unexpected.

Intelligence Per Dollar

Here’s the genuinely interesting part. You don’t need a $50 million cluster anymore.

Phi-4 (14B) — Microsoft tuned this specifically for efficiency. Dominates the small-model category, runs on consumer hardware.

Llama 4 Scout (4B) — The shocker of 2026. Phone-app size, outperforms models five times larger on certain benchmarks.

Qwen 3.6 (32B) — The sweet spot. Competitive with models three times its size. Runs on a single quantized consumer GPU.

DeepSeek V4 Flash — 284B total parameters but only 13B active per token via sparse architecture. Near-frontier performance on multi-GPU setups.

A quantized 14B model on a consumer GPU can outperform a 70B model from two years ago. The efficiency curve has gotten steep.

The Real Question

If a model can ace every coding benchmark and still fail at novel reasoning, are we measuring intelligence or memorization with extra steps?

ARC-AGI suggests the latter dominates. But the gap is closing — and models are getting better at the flexible, adaptive problem-solving that separates a parrot from a person.

The question isn’t whether AI is smart. It’s what kind of smart it is, how much matters, and whether our benchmarks tell the whole story.

Spoiler: they don’t.


Quick Quiz 🧠

1. What score do foundational LLMs get on ARC-AGI?

Answer: ~0.68%

2. What’s the difference between crystallized and fluid intelligence?

Answer: Crystallized = learned knowledge and patterns; Fluid = novel problem-solving on the fly

3. Which small model dominates the <14B category in 2026?

Answer: Phi-4


Source: Modern AI Benchmarks, ARC-AGI-3 Testing, What If LLMs Are Mostly Crystallized Intelligence?, 9 Best Open-Source LLMs in 2026