AI May 30, 2026 · 5 tags

AI Learner #2: Embeddings & Vector Spaces — How Words Get Meaning

LLMs don't read words. They read numbers. In Part 1 we covered tokens. Now: how those numbers become meaning through embeddings — the hidden geometry inside every language model.

#AI#LLMs#Education#Embeddings#Vector Spaces

AI Learner #2: Embeddings & Vector Spaces — How Words Get Meaning

You’ve got numbers now. Tokens, IDs, integers — the output of a tokenizer, the subject of Part 1. But what happens when you feed the number 12 into a model? It doesn’t look at a dictionary. It doesn’t check a spell checker. It looks at a vector — a list of numbers — and that vector carries meaning.

This is where things get interesting. This is where language becomes geometry.

Word embeddings in a vector space — similar meanings cluster together

The Problem: Computers Are Terrible at Words

Let me ask you a question: how are “happy” and “joyful” more similar than “happy” and “refrigerator”?

You know the answer instantly. But a computer? Not so much. To a computer, “happy” is the string h-a-p-p-y and “joyful” is j-o-y-f-u-l. Those share one letter. Meanwhile, “happy” and “refrigerator” share a, p (not even in the same order). So by a naive character count, “happy” and “joyful” are barely related.

But that’s not how meaning works.

What we need is a way to represent words where similar meaning = close together. Enter embeddings.

What Is an Embedding?

An embedding is a dense vector of numbers that represents a word in a multi-dimensional space. Where a one-hot encoding might use a 50,000-dimensional vector with a single 1 and everything else 0, an embedding uses maybe 512 or 1,536 numbers — all of them filled with real values.

Think of it like a GPS coordinate for a word’s meaning.

In a two-dimensional analogy:

  • “King” might sit at (0.8, 0.3)
  • “Queen” at (0.75, 0.35)
  • “Man” at (0.6, -0.2)
  • “Woman” at (0.55, -0.15)

The coordinates themselves are meaningless — no single number means “royalty” or “gender.” But the relationships between coordinates encode real semantic structure. “King” and “Queen” are close. “Man” and “Woman” are close. And the vector from “Man” to “King” is roughly the same as the vector from “Woman” to “Queen.”

The Geometry of Meaning

This is the core insight: meaning is geometry. In the embedding space, similar concepts cluster together. You can measure similarity with cosine similarity — a math trick that measures the angle between two vectors. If the angle is small, the words are semantically similar. If the angle is large, they’re not.

This isn’t just theory. It’s how search engines work, how recommendation systems suggest products, how chatbots figure out that “my wifi is down” and “can’t connect to internet” are asking the same thing.

Here’s where it gets wild. Because embeddings are geometric, you can do vector arithmetic:

vec("King") - vec("Man") + vec("Woman") ≈ vec("Queen")

This relationship was famously demonstrated in a 2013 paper and became one of the first clues that embeddings had captured something deeper than simple word co-occurrence. The model learned that the relationship between genders is consistent across concepts.

But Is This Magic? A Quick Reality Check

That “king − man + woman ≈ queen” trick is elegant, but it’s also somewhat overhyped. The relationships aren’t perfectly consistent across all word pairs. As researchers like Mike X Cohen have pointed out, this arithmetic analogy works in some directions but breaks down in others. Still, the underlying principle holds: embeddings encode structured relationships between concepts, even if the math doesn’t always produce poetic equations.

How Are Embeddings Actually Made?

You might be thinking: “Great. But who decided that ‘King’ should be at (0.8, 0.3)?”

Nobody manually set these values. Embeddings are learned.

The most famous early approach was Word2Vec (2013), which learned embeddings by training on massive text corpora.

Word2Vec neural network architecture for learning word embeddings

It used one of two neural network architectures:

  • CBOW (Continuous Bag-of-Words): Given the words around a target word, predict the target.
  • Skip-Gram: Given a word, predict the words that appear near it.

The model trains by scanning through billions of words, making predictions, and adjusting its vectors to get better. After training, the vectors that survived are the ones that best explained the structure of language in the training data.

Two years later, GloVe (Global Vectors for Word Representation) took a different approach — instead of predicting, it analyzed the co-occurrence matrix (how often words appear together) and factorized it to find dense representations. The result was remarkably similar to Word2Vec, but from a different mathematical angle.

Static vs. Contextual Embeddings

The word 'bank' produces different vectors depending on context: river vs. finance

Here’s an important distinction. Word2Vec and GloVe produce static embeddings — the word “bank” always has the same vector, whether you’re talking about a river bank or a financial bank. That’s a problem.

Modern LLMs use contextual embeddings. In a transformer, each word gets a different embedding depending on what surrounds it. “Bank” in “river bank” produces one vector. “Bank” in “bank vault” produces another. This is a fundamental improvement — and it’s one of the reasons transformers beat older architectures so decisively.

Why Embeddings Changed Everything

Before embeddings, NLP was dominated by bag-of-words models. These counted how many times each word appeared in a document, creating massive sparse vectors. A document might be represented by a 50,000-dimensional vector with 49,999 zeros. It was accurate, clumsy, and carried zero understanding of meaning.

Embeddings replaced this with compact, meaningful representations:

Bag-of-WordsEmbeddings
Dimensions50,000+512–1,536
Sparsity99%+ zerosDense (all values filled)
MeaningNoneSemantic relationships
SimilarityExact matchCosine similarity

The shift was revolutionary. Embeddings made it possible for machines to understand that “the cat sat on the mat” and “the feline rested on the rug” are saying the same thing.

Where Embeddings Show Up Everywhere

Embeddings are the hidden infrastructure of modern AI. They power:

  • Semantic search — Google and Perplexity use embeddings to match your query to the most relevant documents, not just keyword hits
  • Recommendation systems — YouTube, Netflix, and Spotify embed your preferences and items in shared vector spaces to find things you’ll like
  • Chatbots — When a chat model retrieves similar past interactions, it’s using embeddings
  • Duplicate detection — Stack Overflow finds “similar questions” by embedding both the question and its answers
  • Document clustering — Grouping articles by topic, automatically, by comparing their embedding vectors

In every case, the same principle applies: turn things into numbers where distance equals meaning.

What Comes Next

Embeddings give words meaning. But meaning alone isn’t enough — you need a way to process it. That’s where transformers and attention come in. They’re the architecture that reads these embeddings, weighs their importance, and produces the text you’re reading right now.

Coming up: how attention lets models focus on what matters.


Quick Quiz 🧠

1. What’s the difference between a one-hot encoding and an embedding?

Answer: One-hot uses a huge vector with mostly zeros (one “hot” entry per word, no relationship between words). Embeddings use a dense, lower-dimensional vector where the values encode semantic meaning and relationships.

2. What does “contextual embedding” mean, and why is it better than static embeddings?

Answer: In contextual embeddings (used by transformers), each word gets a different vector depending on its surrounding context. “Bank” in “river bank” and “bank vault” produce different vectors. Static embeddings (Word2Vec, GloVe) give every word one fixed vector regardless of context.

3. Name two real-world applications of embeddings beyond language models.

Answer: Recommendation systems (YouTube, Netflix, Spotify), semantic search (Google, Perplexity), duplicate question detection, document clustering — any system that needs to measure similarity between items.


Sources: Embeddings: Meaning, Examples and How To Compute (Arize AI), A Gentle Introduction to Word Embedding and Text Vectorization (Machine Learning Mastery), Demystifying Embedding Spaces using Large Language Models (arXiv)