AI May 31, 2026 · 5 tags

AI Learner #3: Transformers & Attention — How Models Focus on What Matters

Tokens become numbers, numbers become meaning through embeddings. But how does an LLM actually *read* them? Enter the transformer: the architecture behind every modern AI model, built on one elegant idea — attention.

#AI#LLMs#Education#Transformers#Attention

AI Learner #3: Transformers & Attention — How Models Focus on What Matters

In Part 1, we saw how text becomes numbers. In Part 2, we learned how those numbers encode meaning. But numbers sitting in a vector don’t do anything by themselves. Something has to read them, weigh them, and decide which ones matter most for what comes next.

That something is the transformer — an architecture so powerful it rewired the entire field of AI. And its secret weapon is a mechanism called attention.

Let’s walk through it.

The Problem: How Do You Read a Sentence?

Imagine you’re reading this sentence right now:

“The bank of the river was flooded because it had been raining for days.”

Your brain instantly knows that “it” refers to “the river,” not “the bank.” You don’t have to think about it — your brain naturally weighs which words are most relevant to understanding each other.

Now imagine doing that for every single token in a document of 128,000 tokens. At each step, the model needs to ask: which other tokens should I pay attention to?

This is the attention problem. And the transformer solves it in one brilliant move.

The Transformer: A Different Kind of Neural Network

Before transformers, the dominant architectures were RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks). These processed text one word at a time, left to right, like reading with a finger.

The problem? Information degrades over distance. By the time an RNN reaches the end of a long sentence, it’s forgotten what the beginning said. This is the vanishing gradient problem, and it made long-range understanding notoriously difficult.

Transformers throw this away entirely. Instead of reading sequentially, they look at the entire sequence at once. Every token gets a full view of every other token simultaneously. This is called parallelization, and it’s why transformers train so much faster than RNNs.

But looking at everything at once creates a new problem: how do you know which relationships matter? That’s where attention comes in.

What Is Attention?

Attention is a mechanism that lets the model dynamically weight the importance of every other token when processing a given token.

Here’s the intuition: when the model reads the word “it” in the example above, attention assigns a high weight to “river” and a low weight to “bank.” It’s not hardcoded — the model learns these weights during training.

The Three Ingredients: Q, K, V

The QKV attention mechanism: query, key, and value vectors

Every token gets three vectors, learned during training:

  • Query (Q): What am I looking for?
  • Key (K): What do I contain?
  • Value (V): What information do I carry?

The attention score between two tokens is computed as:

Attention(Q, K, V) = softmax(Q · K^T / √d) · V

Let me translate that from math to plain English:

  1. Dot product of Q and K — How much does token A’s query match token B’s key? High score = relevant.
  2. Divide by √d — A scaling factor that keeps gradients stable (where d is the vector dimension).
  3. Softmax — Converts raw scores into a probability distribution. All scores sum to 1.0, and every score is positive.
  4. Multiply by V — Blend together the values of all tokens, weighted by their attention scores.

The result: a new representation of the original token that has incorporated information from the most relevant tokens in the sequence.

A Concrete Example

When processing “it” in “The bank of the river was flooded because it had been raining for days”:

  • The query for “it” might be something like [looking_for: river, looking_for: flooded]
  • The key for “river” matches well → high attention score
  • The key for “bank” matches poorly → low attention score
  • The final output for “it” is mostly “river“‘s value, slightly mixed with other relevant tokens

The model learns these Q, K, V transformations from data. Nobody hand-designs them.

Self-Attention vs. Cross-Attention

There are two main types of attention in transformers:

Self-Attention (Inside the Model)

Within a single sequence, every token attends to every other token. This is self-attention — the model looking inward at its own input. When processing “it,” it looks at “the,” “bank,” “of,” “river,” “was,” “flooded,” etc., and decides which ones matter most.

Self-attention is what gives transformers their name (from “Attention Is All You Need,” the 2017 paper that introduced them).

Cross-Attention (Between Sequences)

In encoder-decoder architectures (like early machine translation models), one sequence attends to another. The decoder’s queries attend to the encoder’s keys and values. This is cross-attention — the model looking outward at a different sequence.

Most modern LLMs (GPT, Claude, Llama) are decoder-only, so they use self-attention exclusively. But understanding cross-attention helps explain how multimodal models (like ones that process both images and text) work.

Multi-Head Attention: Many Lenses at Once

The original transformer didn’t just use one set of Q, K, V. It used multiple heads — each head learns a different perspective on the data.

One head might focus on syntactic relationships (“the” modifies “bank”). Another might focus on semantic relationships (“bank” relates to “river”). A third might capture longer-range dependencies.

Each head produces its own attention-weighted output, and these are concatenated and projected together. It’s like having multiple specialists looking at the same data, each from a different angle, then combining their insights.

This is multi-head attention, and it’s one of the key reasons transformers are so powerful. Different heads can capture different types of relationships simultaneously.

Multi-head attention: parallel streams of QKV computation converging into a single merged output

Visualizing Attention

You can actually see what attention heads learn. Some heads consistently attend to nearby words (local syntax). Others attend to distant words (long-range dependencies). Some specialize in subject-verb agreement. Others track pronoun references.

This wasn’t programmed — it emerged from training.

Positional Encoding: Transformers Have No Sense of Order

Here’s a curious fact: the attention mechanism itself is permutation-invariant. If you shuffle the order of tokens in the input, attention produces the same output (just permuted).

But order matters enormously for language. “The dog bit the man” means something very different from “The man bit the dog.”

Transformers solve this with positional encodings — additional vectors added to the token embeddings that encode each token’s position in the sequence. These can be:

  • Fixed sinusoidal functions (the original transformer used sine and cosine waves of different frequencies)
  • Learned embeddings (each position gets a trainable vector)
  • Relative positional encodings (instead of absolute positions, each token encodes distances to other tokens)

Modern models like Llama and GPT use RoPE (Rotary Positional Embeddings), which encodes positions as rotations in the embedding space. It’s elegant, scalable, and lets the model handle sequences longer than it saw during training.

The Transformer Block: Putting It All Together

A single transformer layer (or “block”) does three things:

  1. Multi-head self-attention — Each token gathers information from all other tokens
  2. Feed-forward network — Each token’s updated representation passes through a small neural network that transforms it
  3. Residual connections + Layer normalization — Skip connections that help gradients flow and stabilize training

These blocks are stacked — typically 24 to 128 times in modern models. Each layer refines the representations further, building up increasingly abstract understanding.

Input tokens → [Attention → FFN → Residual + Norm] × N layers → Output tokens

Why Transformers Changed Everything

Before transformers, the best language models struggled with long-range dependencies, trained slowly, and couldn’t easily scale. Transformers solved all three:

CapabilityRNN/LSTMTransformer
ParallelizationSequential (slow)Full parallel (fast)
Long-range dependenciesDegrades with distanceDirect connections (attention)
ScalingLimitedNear-infinite (compute-bound)
Multi-modalDifficultNatural (cross-attention)

This is why every major AI model since 2017 has been transformer-based: GPT, BERT, Claude, Llama, Gemini, Grok — they’re all transformers at their core.

What Comes Next

Attention lets transformers focus on what matters. But focus alone doesn’t produce text. The model needs a way to turn those rich representations into actual words, one token at a time.

That’s where decoding strategies come in — how a model generates text from its internal representations. We’ll cover sampling, temperature, top-k, and top-p in the next article.

Coming up: how models actually write text — decoding strategies and the art of prediction.


Quick Quiz 🧠

1. What’s the difference between self-attention and cross-attention?

Answer: Self-attention is when tokens within a single sequence attend to each other. Cross-attention is when one sequence attends to a different sequence (like a decoder attending to an encoder’s output).

2. What are Q, K, and V, and what role do they play in attention?

Answer: Query, Key, and Value are three vectors learned for each token. The query asks “what am I looking for?”, the key says “what do I contain?”, and the value carries the actual information. The attention score comes from matching queries to keys, and the values get blended based on those scores.

3. Why do transformers need positional encodings if attention is powerful?

Answer: Attention is permutation-invariant — it treats all inputs as a set, not a sequence. Without positional encodings, “the dog bit the man” and “the man bit the dog” would look identical to the model. Positional encodings inject order information into the token embeddings.


Source: Attention Is All You Need (Vaswani et al., 2017), The Annotated Transformer, Illustrated Guide to Transformers (Jay Alammar)