AI May 30, 2026 · 4 tags

AI Learner #1: How Language Becomes Numbers

LLMs don't read words. They read numbers. Here's how your text gets chopped into tokens, compressed into IDs, and fed to a model that speaks only in integers.

#AI#LLMs#Education#Tokens

AI Learner #1: How Language Becomes Numbers

You’re reading this on a screen made of tiny lights, each one on or off. Your phone stores everything as numbers — photos, music, voice. Everything is a number eventually.

But LLMs take this further. They don’t just store numbers. They think in numbers. And the bridge between your sentences and a model’s mathematical brain is a process that’s both clever and a little weird.

Let’s walk through it.

Words Are Not the Right Unit

Text being tokenized into numbered fragments

I ask an LLM to write about “playing video games.” You might think the model sees three words: “playing,” “video,” “games.”

It doesn’t. It sees tokens — and “video games” might become two tokens, one token, or three, depending on the tokenizer. “Unhappiness” might be one token, or “un” + “happiness.” It depends.

Tokens are the smallest unit a language model understands. In English prose, you’ll average roughly one token per 0.75 words — about four characters per token. That 1,000-word article? Roughly 1,300 tokens. That’s a 4x compression. And it matters, because every token costs compute.

The Tokenizer: Your Text’s Translator

Before the model sees your words, a tokenizer chops them up and assigns each piece a number from a fixed vocabulary. Modern LLMs have vocabularies between 32,000 and 128,000 tokens.

The dominant algorithm is Byte Pair Encoding (BPE). It works like this:

  1. Start with every character as its own token.
  2. Find the most frequent pair of adjacent characters in your training data (like “th,” “ing,” or “er”).
  3. Merge those pairs into new tokens.
  4. Repeat until your vocabulary hits the target size.

This is why common fragments like “ing,” “tion,” “un,” and “ed” become their own tokens. The algorithm discovers these patterns from massive text corpora.

BPE merging character pairs into larger tokens

That’s why “unhappiness” might split into [“un”, “happiness”] — because “happiness” is frequent, but “unhappiness” might not be.

But here’s the thing: every model uses its own tokenizer. GPT-4, Claude, and Llama 3 tokenize the same text differently. “Transformers” might be one token in GPT-4, two in Llama 3. Each model trains its tokenizer on different corpora, building its own vocabulary from scratch.

Tokens Are Just Numbers (and That’s Fine)

Once tokenized, each piece becomes an integer — a simple number. “Hello” might be token 12. “World” might be token 8,342. The model doesn’t care they represent words. It just sees integers and looks up their corresponding embeddings — high-dimensional vectors encoding meaning.

This is the magic step: turning language into mathematics. Each token maps to a point in thousands-dimensional space, where similar words cluster together. “King” and “queen” are close. “Paris” and “France” are close.

And the model learns to navigate this space.

Why This Matters

Understanding tokens matters because everything about LLMs flows from this:

  • Cost: API calls charge per token. Inefficient tokenization burns money on fragments.
  • Context: Context windows are measured in tokens, not words. A 128K-token window is ~96K English words, or ~128K Chinese characters.
  • Behavior: Tokenization affects how the model understands relationships between words.

But tokens are just the beginning. Once your text becomes numbers, the model needs to actually do something with them. That’s where the real magic happens.

Coming up: how transformers read and understand those numbers.


Quick Quiz 🧠

1. Roughly how many tokens is a 1,000-word English article?

Answer: ~1,300 tokens

2. What is BPE?

Answer: Byte Pair Encoding — an algorithm that discovers frequent character pairs in training data and merges them into subword tokens

3. Why does the same text have different token counts across models?

Answer: Each model has its own tokenizer trained on different corpora, producing different vocabularies


Source: What Are LLM Tokens? The Complete 2026 Guide, From Bytes to BPE: A From-Scratch Tour of LLM Tokenization, How LLMs Work: Tokens, Embeddings, and Transformers