Open Source LLMs Now Within Single Digits of Proprietary Models — The Gap is Closing

The gap between open source and proprietary large language models has collapsed to single digits. As of February-March 2026, the best open-weight LLMs are performing within 5–8 Quality Index points of closed alternatives like GPT-4o and o1 — down from over 12 points in early 2025.

This represents a fundamental shift in AI development. Open source is no longer lagging behind; it’s competitive, often superior on specific workloads, and delivering roughly 85% cost savings at similar quality.

The Current Leaders

Three model families dominate the conversation:

GLM-5 (Reasoning)

Z.ai’s GLM-5 leads our rankings across practical applications. It hit 68/70 on the Quality Index in January 2026, up from 58 three months prior. On agentic tasks specifically, it scored 96% on τ²-Bench, beating all proprietary alternatives.

Qwen3/Qwen3.5

Alibaba’s Qwen3 family introduced native multimodality and hybrid architectural innovations including Gated DeltaNet layers. The “Thinking” variant integrates reasoning directly into tool-use — the first in its class to do so. Qwen now has 113,000+ model derivatives on Hugging Face, far exceeding Meta’s 27,000.

DeepSeek V3.2 and R1

DeepSeek continues to push reasoning capabilities with V3.2 Speciale hitting 90% on LiveCodeBench, making it the top open-source coding LLM as of February 2026. The “DeepSeek moment” in January 2025 when R1 demonstrated ChatGPT-level reasoning at a fraction of training costs was a watershed.

Other Notable Models

Llama 4: Meta’s January 2025 release introduced Mixture-of-Experts (MoE) architecture with two variants: Scout (109B total parameters, 17B active) and Maverick (400B total, 17B active). First open-weight models with native multimodality.
Kimi K2.5: Hit 96% on AIME 2025 for math reasoning, outperforming most proprietary alternatives.
MiniMax M2: Emerging as a strong contender in S-tier rankings alongside GLM-4.7.

The Architecture Shift: Mixture-of-Experts

MoE has become the dominant architecture since DeepSeek-V3 and R1. These new models push sparsity further than previous generations without sacrificing performance — not just in network sparsity, but also adopting sparse attention mechanisms.

Performance at Scale:

Kimi K2.5 scores 96% on AIME 2025 (math)
DeepSeek V3.2 Speciale: 90% LiveCodeBench (coding)
GLM-4.7: Quality Index 68/70
Llama 4 Maverick: unprecedented context window with MoE efficiency

What This Means for Developers

The practical implications are clear:

Cost reduction: Open source saves ~85% at comparable quality levels
Self-hosting viability: Ollama and local deployment tools have made running frontier models on consumer hardware realistic
No vendor lock-in: Control your data, control your inference stack
Performance parity: For many workloads — particularly coding and tool-use — open source matches or exceeds proprietary models

The Bottom Line

Open source LLMs released in 2025–2026 demonstrate competitive performance with top proprietary models through better reasoning, specialized capabilities, and massive ecosystem support. Llama 4, Qwen3.5, DeepSeek R1, and GLM-5 each offer distinct strengths — from edge efficiency to multilingual depth.

The questions aren’t anymore about whether open source works. They’re about which model fits your specific workload, and how you leverage increasingly sophisticated tooling to self-host at scale.

This article reflects the state of open-source LLMs through March 2026, based on benchmark data from WhatLLM.org, Hugging Face leaderboards, and industry reports.