← back to blog

I Lack Attention. So I Built 12 Heads of It.

Table of Contents

  1. Introduction
  2. Mechanics of Dot Products
  3. Transformer Architecture (Encoder-Decoder)
  4. Self-Attention in Detail
  5. Role of the Feed-Forward Network in a Transformer
  6. The Decoder, Final Layer & Softmax
  7. GPT-2 Architecture — Decoder Only
  8. Complications with Eigen — The Rowwise & Colwise Mess
  9. A Few Pretty Graphs & Conclusions

1. Introduction

For the past five years, Claude, ChatGPT, and Gemini have been everywhere. Engineering, finance, marketing — pick a field, there’s a model in it. Most people using these tools don’t think too hard about what’s actually happening under the hood. And honestly, fair enough. But all of it — every summarization, every code completion, every agentic workflow where a model picks up tools and starts acting — goes back to a single 2017 paper: Attention is All You Need.

For someone with ADHD, that title hit different. Attention is quite literally what I have been fighting for my entire life. Pun very much intended.

The idea that a mechanism called “attention” could turn next-token prediction into a general-purpose reasoning engine is genuinely strange when you look at it up close. And stranger still: people figured out that if you just gave this thing tools — a search function, a code interpreter, an API call — it would start behaving like an agent. A chatbot became an actor.

This blog is my attempt to understand what “attention” actually means at the implementation level. Not the diagram, not the analogy — the actual matrix multiplications, the shapes, the numbers.

To do that, I followed Andrej Karpathy’s Let’s build GPT from scratch and implemented GPT-2 inference in C++. Why C++? I’ll get to that. There’s a progression here that I think is worth following in order — it’s a story about how the explosive growth of LLMs created an equally explosive demand for GPU compute, which is what eventually pulled me toward GPU programming and hardware architecture. The C++ implementation is the first step in that story.

If you want to understand how attention works from the inside, not just what it does but why it’s built the way it is — this is for you.


2. Mechanics of Dot Products

Before attention makes any sense, dot products need to make sense. Not just the formula — the geometry behind it.

A vector is a list of numbers. Geometrically it’s an arrow pointing somewhere in space. The dot product of two vectors gives you a single number that answers one question: how much do these two arrows point in the same direction?

Three cases cover everything:

The formal definition makes this precise:

ab=a b cos(θ)\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}|\ |\mathbf{b}|\ \cos(\theta)

cos(0°)=1\cos(0°) = 1, cos(90°)=0\cos(90°) = 0, cos(180°)=1\cos(180°) = -1. The magnitudes scale the output but cos(θ)\cos(\theta) is doing all the directional work.

In practice you never compute it through the angle. You compute it element-wise:

ab=a1b1+a2b2++anbn\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

Concrete example. Take a=[3,1]\mathbf{a} = [3, 1] and b=[2,4]\mathbf{b} = [2, 4]:

ab=(3×2)+(1×4)=6+4=10\mathbf{a} \cdot \mathbf{b} = (3 \times 2) + (1 \times 4) = 6 + 4 = 10

Now scale this up to 768 dimensions. You can’t visualize it, but the geometry still holds exactly. Two 768-dimensional vectors with a large dot product are “pointing in the same direction” in that high-dimensional space. That intuition is all you need to carry forward.

Why this matters for attention

In attention, tokens are represented as vectors. Computing how much one token should attend to another boils down to asking: are these two vectors pointing in the same direction? The dot product answers that in a single cheap operation — one multiply-accumulate per dimension. At the scale of a language model doing this for every token pair, every layer, every head, that efficiency is not incidental. It’s the whole reason the mechanism works in practice.

We’ll get into exactly how that plays out in section 4.


3. Transformer Architecture (Encoder-Decoder)

Let’s use a translation example. Take “நான் வீடு செல்கிறேன்” — Tamil for “I go home.” Tamil is SOV, English is SVO. The verb comes last in Tamil. So this isn’t just word swapping — the model has to actually restructure the sentence. That reordering problem is a good way to understand why the architecture is built the way it is.

From the outside — Tamil sentence goes in, English comes out. Black box.

Transformer black box — Tamil to English

Crack it open and there are two stacks — an encoder on the left and a decoder on the right, each 6 layers deep.

Encoder-Decoder stacks

The encoder reads the entire input at once. All 6 layers process every Tamil token together — every token can see every other token freely. By the time the signal exits the encoder stack, each token’s representation has been shaped by the full context of the sentence. “செல்கிறேன்” (go) knows about “நான்” (I) because attention connected them.

The decoder generates the output one token at a time. It receives two things: the tokens it has already generated, and the encoder’s output. It uses the encoder output via cross-attention — the decoder’s queries reach into the encoder’s keys and values to pull out relevant information from the source sentence.

Inside each encoder layer — two sub-layers. Inside each decoder — three.

Inside one encoder and one decoder

Encoder sub-layers:

Decoder sub-layers:

Each sub-layer is wrapped with a residual connection and layer norm. Input gets added back to output — that’s the green dashed bypass in the diagram. This is what makes deep stacks trainable.

Embeddings and why parallelism matters

Before any of this — each input token becomes a vector. “நான்” maps to an integer via the tokenizer, that integer indexes into the embedding matrix, and you get a 512-dimensional vector. That’s the token’s starting representation.

Each token flows through the encoder on its own independent path. No sequential bottleneck — all positions process in parallel. This is the fundamental difference from RNNs which had to process one token at a time and killed themselves trying to carry information across long sequences.

The paths cross only in self-attention — when token ii gathers context from all other tokens. After that, in the FFN, paths split again. The feed-forward network has zero connections between positions. Each token’s FFN computation is completely isolated. That’s what lets transformers scale — parallelism all the way down, except for the one operation that actually needs to look across the sequence.


4. Self-Attention in Detail

I’ve been throwing the word “attention” around like you already know what it means. You don’t yet — or at least, not at the level we need. Let’s fix that.

From input to Q, K, V

Every word in the input gets converted to a vector — we covered that. What happens next: that input vector gets used to create three more vectors: a query, a key, and a value.

How? Three weight matrices — WQW_Q, WKW_K, WVW_V — learned during training. You multiply the input vector with each:

q=xWQk=xWKv=xWVq = x \cdot W_Q \qquad k = x \cdot W_K \qquad v = x \cdot W_V

These vectors are smaller than the input. The paper used 64 dimensions compared to 512 for the input embedding. Architecture choice — keeping the per-head dimension small means multi-head attention doesn’t blow up in cost.

Input embedding → Q, K, V

What do these three vectors actually mean? Useful abstractions:

Calculating the attention score

Take the sentence: “The cat sat on the mat”

“sat” is the word being encoded right now. The model needs to understand what “sat” means in this specific sentence — in the context of “cat”, “mat”, “on.” That’s literally what attention means. Which other words are relevant to understanding “sat” right now?

The attention score answers:

“How much should I pay attention to word X while encoding word Y?”

Four steps. Together they give you the attention formula.

Step 1 — Dot product of Q and K

The score between two words is just the dot product:

score(sat,other)=qsatkother\text{score}(\text{sat}, \text{other}) = q_{\text{sat}} \cdot k_{\text{other}}

“sat” and “cat” point in similar directions → high score → model pays more attention. Point away from each other → low score → mostly ignored. Computed for every query against every other key simultaneously.

Score calculation — Q_sat · K_word for each word in the sentence

Step 2 — Divide by dk\sqrt{d_k}

Raw scores divided by 64=8\sqrt{64} = 8.

Why? Dot products grow with dimension. Q and K are random vectors of dimension 64, each element has variance ~1, each pairwise multiplication has variance ~1, sum 64 of them — total variance becomes 64. The bigger the dimension, the larger the scores.

Feed a very large number into softmax and you get a near one-hot distribution. One score dominates, everything else goes to zero. Gradients vanish. Training dies.

Dividing by dk\sqrt{d_k} keeps variance at ~1 regardless of dimension. Stable scores, stable gradients.

Step 3 — Softmax

Plain normalization vs softmax:

x^i=xijxjvssoftmax(xi)=exijexj\hat{x}_i = \frac{x_i}{\sum_j x_j} \quad \text{vs} \quad \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Softmax does three things plain normalization doesn’t: always produces positive scores, amplifies larger scores (sharpens focus), and stays smooth and differentiable everywhere.

Step 4 — Multiply by V and sum

The softmax output tells you how much each token contributes. Multiply each score by the corresponding value vector, sum everything up:

outputsat=isoftmaxivi\text{output}_{\text{sat}} = \sum_i \text{softmax}_i \cdot v_i

“sat” is now encoded with full awareness of its context. It knows it’s connected to “cat” and “mat.” That’s the whole mechanism.

Writing it in matrix form

Every row in the input matrix XX is one word. Compute Q, K, V for all words at once:

Q=XWQK=XWKV=XWVQ = X \cdot W_Q \qquad K = X \cdot W_K \qquad V = X \cdot W_V

Then the full attention formula:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Four steps, one equation. QKTQK^T is every query dotted against every key simultaneously. Divide by dk\sqrt{d_k}, softmax row-wise, multiply by VV. Every token gets a context-aware output vector in a single pass.

Multi-Head Attention

Single attention head so far. In practice — multiple heads in parallel. 8 in the original paper, 12 in GPT-2.

Why not just one big head?

Specialization — one head trying to learn both grammar and semantics gets pulled in different directions and does both poorly. Multiple heads let each one settle into its own pattern.

Different representation subspaces — each head has its own WQiW_Q^i, WKiW_K^i, WViW_V^i, independently initialized. They start from different random points, see the same gradients through completely separate projections, and converge to genuinely different solutions. Not the same thing at smaller scale — different things entirely.

Full flow: XX fans out to all heads → each head computes Q, K, V and produces output RiR_i of shape [seq,64][\text{seq}, 64] → all outputs concatenated → multiplied by WOW_O → final output ZZ.

Overall self-attention architecture

WOW_O is not just bookkeeping — it lets the model learn how to mix information across all heads into a single coherent representation.

Interactive — Embedding Inspector

Click any token to inspect its 768-dim embedding vector. Two dimensions are labeled based on probing studies — most dimensions have no clean human-readable interpretation.

click a token above to inspect its embedding
click a token to inspect its embeddingdim_0–dim_7 of 768

Interactive — Attention Arc Visualizer

Click a token to see what it attends to across all four attention heads. Arc thickness = attention weight.

The
king
walked
to
the
castle
weak
strong
arc thickness = attention weight · showing weights ≥ 0.4
click a token to see all its attention weights

5. Role of the Feed-Forward Network in a Transformer

Residuals & Layer Normalization

Two things that make the whole stack trainable — residual connections and layer normalization. Not glamorous. Without them, none of this works at depth.

Residual connections

Instead of passing the output of a sub-layer directly to the next one, you add the input back:

output=x+sublayer(x)\text{output} = x + \text{sublayer}(x)

Deep networks have a gradient problem. By the time gradients flow back through 6, 12, 24 layers, they either vanish or explode. The addition creates a direct highway for gradients to flow backwards without passing through any transformation. If a layer isn’t useful, its weights go to zero and the signal bypasses it unchanged.

Layer Normalization

After each residual add, the signal gets normalized. For each token’s vector independently, compute mean and variance across all 512 dimensions and rescale:

x^=xμσ+ϵγ+β\hat{x} = \frac{x - \mu}{\sigma + \epsilon} \cdot \gamma + \beta

γ\gamma and β\beta are learned. Without this, activations at different layers drift wildly in scale. LayerNorm keeps every token’s representation in a stable numerical range regardless of depth. Empirically, transformers without LayerNorm don’t train.

GPT-2 uses Pre-LN — layer norm applied before each sub-layer, not after. The original paper used Post-LN. Pre-LN is easier to train at scale because gradient flow through the residual highway stays completely clean.

Encoder block — residuals, LayerNorm, FFN

Feed-Forward Network

After attention figures out where to look, the FFN decides what to do with that information.

Two linear layers with GELU between them:

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2

Dimensions expand 512 → 2048 (4×) then collapse back to 512. That 2048-dimensional space is a scratchpad — room for richer per-token transformations than a single linear layer allows.

This happens independently for every token. No cross-token communication.

Geva et al. (2021) — “Transformer Feed-Forward Layers Are Key-Value Memories” (arXiv:2012.14913) — showed something interesting here. First linear layer acts as keys: pattern detectors that activate for specific inputs. Second layer acts as values: information retrieved when a pattern fires. Factual knowledge is stored in these weight matrices.

Split of labour:

More FFN parameters = more key-value memory pairs = more facts the model can store. This is why scaling the FFN makes models smarter in ways scaling attention alone doesn’t.


6. The Decoder, Final Layer & Softmax

The encoder stack is done. Every input token has a rich context-aware representation. The top encoder’s output — a matrix, one vector per input token — gets handed to every decoder layer.

It doesn’t just get passed across raw.

What actually happens at the “transformed” step

Top encoder’s output gets projected into K and V using learned weight matrices:

Kenc=EncoderOutput×WKVenc=EncoderOutput×WVK_{\text{enc}} = \text{EncoderOutput} \times W_K \qquad V_{\text{enc}} = \text{EncoderOutput} \times W_V

These WKW_K and WVW_V belong to the cross-attention layer inside the decoder — not the encoder. The encoder provides the raw vectors. The decoder learns how to project them into a useful space.

Q comes from the decoder’s masked self-attention output:

Every decoder layer gets to ask questions of the full encoded input.

Masked Self-Attention in the decoder

Before cross-attention, the decoder runs its own self-attention — with a mask. Upper triangle of the attention score matrix set to -\infty before softmax:

mask[i][j]=if j>i\text{mask}[i][j] = -\infty \quad \text{if } j > i

During inference the model generates one token at a time. When generating token 3, tokens 4, 5, 6 don’t exist yet. The mask enforces the same constraint during training — without it, the model cheats by attending to future tokens that won’t be available at inference. After softmax, -\infty becomes 0. Future positions invisible.

Decoder — full flow

The final linear layer & softmax

After the decoder stack — a 512-dim vector. Final linear layer projects to full vocabulary:

logits=DecoderOutput×Wvocab\text{logits} = \text{DecoderOutput} \times W_{\text{vocab}}

Shape goes from [1,512][1, 512] to [1,50257][1, 50257] — one score per possible next word. Softmax converts to probabilities.

How do we pick one word?

Greedy decoding — highest probability token every time. Fast, deterministic, boring. Repetitive output.

Temperature sampling:

P=softmax(logitsT)P = \text{softmax}\left(\frac{\text{logits}}{T}\right)

Top-k — sample only from top kk tokens. Cuts off the long tail of garbage.

Top-p (nucleus) — smallest set of tokens whose cumulative probability exceeds pp. Dynamic — expands when uncertain, contracts when confident. Most production LLMs use top-p or top-k + temperature.


7. GPT-2 Architecture — Decoder Only

The model family

ModelLayersHeadsd_modelParameters
GPT-2 Small1212768117M
GPT-2 Medium24161024345M
GPT-2 Large36201280762M
GPT-2 XL482516001558M

The one I implemented — GPT-2 Small. 12 layers, 12 heads, 768-dim residual stream.

GPT-2 124M — forward pass

Input — token + position embeddings

Two lookup tables. That’s it.

Token embeddingwte [50257 × 768]. Token ID in, 768-dim vector out. Row lookup.

Position embeddingwpe [1024 × 768]. Position index in, 768-dim vector out.

x=wte[token_id]+wpe[position]x = \text{wte}[\text{token\_id}] + \text{wpe}[\text{position}]

GPT-2 uses learned positional embeddings — not the fixed sinusoidal ones from the original paper. Sinusoidal embeddings are computed from a formula and never change. GPT-2’s positional embeddings are just another weight matrix trained end-to-end. Empirically about the same, but simpler to implement.

Weight tying

This confused me the most when I first read the architecture.

GPT-2 needs to produce a probability over 50,257 tokens. Naive approach: multiply the 768-dim hidden state by a [768 × 50257] matrix. That’s 39M extra parameters just for the final projection.

Weight tying says — don’t. Reuse wte, transposed.

logits=hidden_state×wteT\text{logits} = \text{hidden\_state} \times \text{wte}^T

Same matrix. Same weights. Used twice in one forward pass. Why does this work?

The model wants the hidden state at position ii to be geometrically close to the embedding of whatever token comes next. “Close” means high dot product. The dot product of the hidden state with each row of wte is exactly the right similarity score. The model learns to push hidden states toward the embedding vectors of correct next tokens. Embedding space and output space become the same space. One matrix, two jobs.

Saving: 39M parameters gone. For GPT-2 Small at 117M total — roughly a third of the model.

Weight tying — one matrix, two jobs

Why decoder only?

The original transformer had encoder + decoder — encoder reads the source, decoder generates the target. Makes sense for sequence-to-sequence: translate Tamil to English, summarize a document. Clear source, clear target.

GPT-2 doesn’t work that way. There is no source sentence. There is no target sentence. Just a sequence of tokens, and the model predicts what comes next.

For next-token prediction, an encoder is the wrong tool. Bidirectional attention — every token sees every other token, past and future. Powerful for understanding, useless for generation. You can’t attend to tokens that don’t exist yet.

The decoder’s causal mask fixes this — upper triangle is -\infty, every token only attends to positions before it. That single constraint makes autoregressive generation possible.

Autoregression: after each token is produced, append it to the sequence, feed the new longer sequence back in, generate the next token. Repeat. No encoder state. No cross-attention. Just one stack of masked self-attention layers reading a growing sequence. Simple, and it scales.

KV Cache — the inference insight

Every generation step runs the full forward pass. For sequence length nn that means computing Q, K, V for all nn positions in every layer.

But K and V for already-generated tokens don’t change. Token 1’s key and value vectors at layer 3 are identical whether you’re generating token 10 or token 500. Only the new token’s Q, K, V need computing.

So you cache them:

Kcache=[k1,k2,,kt1]Vcache=[v1,v2,,vt1]K_{\text{cache}} = [k_1, k_2, \ldots, k_{t-1}] \qquad V_{\text{cache}} = [v_1, v_2, \ldots, v_{t-1}]

Step tt: compute Q, K, V for the new token only. Concatenate K and V onto the cache. Run attention. Done. No recomputation.

Without KV cache: O(n2)O(n^2) per generation step. With: O(n)O(n) per step.

The tradeoff is memory. Each layer caches 2×seq_len×7682 \times \text{seq\_len} \times 768 floats per batch. At 128K context windows and large models this becomes the binding constraint — KV cache memory is why inference infrastructure is still an unsolved engineering problem, and why papers like PagedAttention (vLLM) exist. Which is exactly where this story is heading.


8. Complications with Eigen — The Rowwise & Colwise Mess

This is the section I wish existed when I was debugging at 2am wondering why my LayerNorm outputs looked completely wrong despite the math being correct.

No crash. No assertion failure. Just silently wrong numbers.

The setup

Eigen is a C++ linear algebra library. Matrices, vectors, fast operations. Natural choice for implementing a transformer in C++ — fast, well-documented, handles all the BLAS-level stuff.

The problem isn’t Eigen. The problem is rowwise() and colwise() — two methods whose names are genuinely misleading until you’ve been burned by them.

What I expected vs what Eigen means

Matrix X of shape [seq_len, 768] — one row per token, one column per dimension.

You want the mean of each token’s vector. Each token is a row. Mean across 768 columns, one number per row — a [seq_len, 1] vector.

Instinct: rowwise(). You’re operating on rows.

That instinct is wrong. And this is the trap.

auto wrong = X.rowwise().mean(); // shape: [1, 768]    -- NOT what you want
auto right = X.colwise().mean(); // shape: [seq_len, 1] -- this is it

The mental model that actually works

Stop thinking about what you’re iterating over. Think about what dimension gets collapsed.

OperationAxis that disappearsResult shape
X.rowwise().mean()rows (axis 0)[1, 768]
X.colwise().mean()cols (axis 1)[seq_len, 1]

LayerNorm wants mean per token — collapse the 768 dimensions. Columns disappear. colwise().

Where this wrecked my LayerNorm

// X shape: [seq_len, 768]

// mean per token — collapse columns → [seq_len, 1]
Eigen::MatrixXf mean = X.colwise().mean();  // NOT rowwise

// broadcast mean across all 768 columns
Eigen::MatrixXf deviation = X.colwise() - mean.col(0);

// variance
Eigen::MatrixXf variance = deviation.array().square().colwise().mean();

// normalize
float eps = 1e-5f;
Eigen::MatrixXf std_dev = (variance.array() + eps).sqrt();
Eigen::MatrixXf normalized = deviation.array().colwise() / std_dev.col(0).array();

// scale and shift — gamma, beta are [768], one per dimension → rowwise
Eigen::MatrixXf out = normalized.array().rowwise() * gamma.transpose().array()
                    + normalized.array().rowwise() + beta.transpose().array();

The asymmetry at the end — rowwise() for gamma and beta, colwise() for mean and variance — because:

The rule — one sentence

If your vector has shape [seq_len] (one value per token), use colwise(). If it has shape [768] (one value per dimension), use rowwise().

Write that on a sticky note. Put it next to your monitor. Thank me later.

Why it fails silently

Eigen doesn’t crash. Both rowwise() and colwise() compile. Both produce output. Different shapes, but if you’re not asserting shapes — and most C++ code doesn’t — the wrong result propagates silently through the entire forward pass.

My LayerNorm was normalizing across tokens instead of dimensions. Output looked reasonable — numbers in a sane range — but every token’s representation was subtly wrong. Attention scores wrong. FFN wrong. Logits wrong. Nothing crashed.

Debugging: generate text, notice it’s incoherent even for GPT-2 standards, add shape assertions everywhere, find LayerNorm producing [1, 768] means instead of [seq_len, 1], fix one word, watch everything work.

The attention softmax had the same problem

// scores: [seq_len, seq_len]

// subtract row max for numerical stability
Eigen::MatrixXf shifted = scores.colwise() - scores.rowwise().maxCoeff();
Eigen::MatrixXf exp_scores = shifted.array().exp();

// normalize each row
Eigen::MatrixXf softmax = exp_scores.array().colwise() / exp_scores.rowwise().sum().array();

The naming stays confusing even when you know the rule. Apply it mechanically every time.

Summary

What you wantVector shapeEigen call
Mean per token (across dims)[seq_len, 1].colwise().mean()
Mean per dim (across tokens)[1, 768].rowwise().mean()
Broadcast per-token vec across dims.colwise() op vec
Broadcast per-dim vec across tokens.rowwise() op vec

Once it clicks it’s mechanical. Until it clicks it will silently cost you hours.


9. A Few Pretty Graphs & Conclusions

I ran benchmarks on my RTX 3050 8GB. Here’s what the numbers say.

Per-layer forward pass time

GPT-2 forward pass time per component

How boring this graph is — and boring is exactly right.

Block 0: 4.57ms. Block 11: 4.46ms. Everything in between: 4.37–4.45ms. Perfectly uniform — same operations, same dimensions, same compute, same time. Layer after layer.

Two things nearly invisible: embedding lookup at the far left, essentially 0ms — it’s a table lookup. Final LayerNorm on the right — a rounding error.

The transformer block is the cost. Everything else is noise.

Attention is O(N²)

Attention O(N²) — forward pass time vs sequence length

This is the one that matters. This is why long context is expensive.

Sequence lengthForward pass time
4 tokens91ms
8 tokens118ms
32 tokens197ms
64 tokens332ms
128 tokens640ms
256 tokens937ms
512 tokens938ms

4 tokens to 128 tokens — 32× increase in sequence length — forward pass goes from 91ms to 640ms. Not 32× slower. More like 7×. Already worse than linear.

O(N²) fit overlaid in orange. Measured curve tracks closely up to 256 tokens then flattens — memory bandwidth starts dominating on CPU. But the shape is clear. QKTQK^T is [seq, seq]. Double the sequence, quadruple attention compute. Context windows are not free.

This is why FlashAttention exists. Why linear attention exists. Getting O(N2)O(N^2) manageable at 128K context is not a solved problem.

Top-k sampling distribution

Sampling distribution — before and after top-k filtering

Prompt: “The meaning of life is”

Left: full distribution over top 100 tokens. Steep drop from rank 0 to near-zero by rank 20, then a flat line to rank 100. Top-k cutoff at k=20 is the red dashed line.

Right: after filtering. “not” jumps to ~22%, “to” to ~18%, “the” and “that” around 10%.

The point is the tail. Without top-k, tokens ranked 40-100 have non-zero probability. They’re garbage. Top-k cuts them off and renormalises. The difference between coherent and incoherent output.

Temperature effect

Effect of temperature on output distribution

At T=0.5 the top token hits ~44%. At T=2.0 it’s at ~7% — almost flat across the top 50. T=1.0 is what the model was trained at. Below that: more conservative than trained. Above: more chaotic. Neither wrong — depends what you want.

Memory breakdown

GPT-2 124M memory breakdown — total 474.4MB

Total: 474.4MB for GPT-2 124M.

ComponentMemory
wte (token embeddings)147.2MB — 31%
FFN fc weights108.0MB — 22.8%
FFN proj weights108.0MB — 22.8%
Attn QKV weights81.0MB — 17.1%
Attn proj weights27.0MB — 5.7%
wpe (pos embeddings)3.0MB
LayerNorm + biases0.1MB

Three surprises. First — wte at 147.2MB is 31% of the model. One embedding table. And because of weight tying, this same matrix does double duty as the output classifier — saving another 147MB.

Second — FFN weights (216MB combined) dwarf attention weights (108MB combined). People assume attention is the expensive part. It’s not. The Geva et al. result in practice.

Third — LayerNorm: 0.1MB. The operation that keeps the architecture trainable costs essentially nothing.

Conclusions

Attention is not magic. It’s dot products, softmax, and a weighted sum. The magic is in what the model learns to put in those vectors — query vectors that ask the right questions, key vectors that answer them, value vectors that carry the right information. The mechanism is simple. The learned content is not.

The O(N²) graph is the honest answer to why the field is still actively working on this. A 2017 architecture running on a 2019 pretrained model, on a consumer GPU in 2025, hits a compute wall at a few hundred tokens. Modern systems running 128K context are doing serious engineering to get around that wall — not by fixing the math, but by attacking the memory access patterns.

That’s a GPU problem. Next post: GPU Architecture. Not CUDA yet.


The Code & A Thank You

The full C++ implementation — weight loading, forward pass, tokenization, sampling — is on GitHub: github.com/kailashnagarajan/gpt.cpp. If you’re going to implement this yourself, I’d strongly suggest doing it. Reading about transformers and building one are very different experiences. The Eigen bugs alone will teach you things no blog post can.

I also want to acknowledge something a bit meta. This post was written in genuine collaboration with Claude. Not in the “I asked it to write my blog” way — all the technical substance, the benchmarks, the bugs, the implementations are mine. But Claude was a real partner in drafting and refining the prose, creating the Excalidraw diagrams, building the interactive widgets, and being a sounding board while I was working through the architecture. It’s a strange thing to thank an AI in a blog post about how AI works, but it would feel dishonest not to. The irony of using attention-based models to write about attention-based models is not lost on me.


References

  1. Vaswani et al. (2017) — Attention is All You Need
  2. Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2)
  3. Geva et al. (2021) — Transformer Feed-Forward Layers Are Key-Value Memories
  4. Alammar (2018) — The Illustrated Transformer
  5. Karpathy — Let’s build GPT from scratch