What is the difference between self-attention and multi-head attention?

Self-attention is the fundamental mechanism where each token computes relevance scores against all other tokens using Query, Key, and Value projections, producing a contextually informed representation. Multi-head attention runs this mechanism h times in parallel with different learned projection matrices, allowing the model to capture multiple types of relationships simultaneously — for example, one head might focus on syntactic dependencies while another focuses on semantic similarity. The outputs of all heads are concatenated and projected through a learned output matrix. Multi-head attention is strictly more expressive than single-head attention with the same total parameter count.

Why is the scaling factor √d_k used in the attention formula?

The scaling by √d_k (the square root of the key dimension) prevents the dot products from becoming very large when the embedding dimension is large. Without this scaling, the dot products Q K^T grow proportionally with d_k, pushing the softmax function into regions where it produces very sharp distributions — one token receives nearly all the attention weight, and gradients become extremely small. By dividing by √d_k, the attention scores are kept in a range where the softmax produces useful probability distributions and gradients flow well during training. This is especially important for large models where d_k may be 64 to 128 or higher.

What is causal masking and why is it needed in decoder-only LLMs?

Causal masking prevents each token in a sequence from attending to future tokens during training. In a decoder-only model trained on next-token prediction, if token at position i could attend to token at position i+1, the model would simply copy the answer rather than learn to predict it — the training objective would be trivially solvable. The causal mask is implemented by setting scores to −∞ (which becomes 0 after softmax) for all positions j > i before the softmax operation. At inference time, causal masking is not explicitly needed because tokens are generated one at a time and future tokens do not yet exist, but it must be applied during training on full sequences.

How does Rotary Positional Embedding (RoPE) differ from sinusoidal positional encoding?

Sinusoidal positional encoding adds a fixed position-dependent vector to the token embedding at the input to the Transformer — position is encoded as an absolute value. RoPE instead applies a position-dependent rotation to the Query and Key vectors inside each attention head before computing dot products. The critical advantage is that the dot product of query at position m with key at position n naturally encodes the relative distance (m − n) rather than the absolute positions m and n. This relative encoding provides better generalization to sequence lengths longer than those seen during training (context length extrapolation) and consistently improves downstream task performance. RoPE is now the standard positional encoding in most open-source LLMs including LLaMA 2 and 3, Mistral, and Qwen.

What is Byte Pair Encoding (BPE) and how does it affect LLM cost?

BPE is a subword tokenization algorithm that builds its vocabulary by iteratively merging the most frequent adjacent character pairs in a training corpus. Starting from individual characters, it adds common subwords like "ing", "tion", "un" and frequent whole words, producing a vocabulary of typically 32,000–100,000 tokens that balances vocabulary size with sequence length efficiency. BPE directly affects LLM cost because API pricing is per token: a typical English prompt costs 1.3 tokens per word, so a 1,000-word document costs about 1,300 tokens. Code, non-English text, and mathematical notation typically tokenize less efficiently, increasing cost. Understanding tokenization helps engineers optimize prompts and control API costs.

← AI Engineering Studio

AI & Automation·10 min read·June 19, 2026

🧠 Transformer Architecture Deep Dive: How Self-Attention, Positional Encoding, and Tokenization Work

A thorough technical explanation of the Transformer architecture behind ChatGPT, Claude, and other large language models — covering the encoder-decoder design, multi-head self-attention with Q/K/V matrices, scaled dot-product attention, sinusoidal and RoPE positional encoding, BPE tokenization, and the key differences between GPT and BERT architectures.

What Is a Transformer and Why Did It Change AI?

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., replaced recurrent neural networks (RNNs) as the dominant approach for sequence modeling in natural language processing. RNNs processed tokens sequentially — one at a time — making them slow to train and prone to the vanishing gradient problem that degraded long-range dependencies. Transformers process all tokens in parallel using a mechanism called self-attention, which allows every token to directly attend to every other token in the sequence regardless of distance. This change unlocked the ability to train much larger models on much more data, and ultimately produced the large language models (LLMs) that power ChatGPT, Claude, Gemini, and LLaMA.

Understanding the Transformer at a mechanical level — not just at the conceptual level of diagrams and buzzwords — enables engineers to reason about why LLMs behave as they do: why context length matters, why token count affects cost, why fine-tuning works, and what the computational bottlenecks are during inference. This article builds that mechanical understanding from the ground up.

Encoder vs Decoder: The Two Transformer Variants

The original Transformer architecture had two components: an encoder that processes the input sequence and builds contextual representations, and a decoder that generates output tokens one at a time using both the encoder's representations and its own previously generated tokens. This encoder-decoder design is used for sequence-to-sequence tasks like machine translation — the encoder reads French, the decoder generates English.

Modern LLMs have largely moved to two specialized variants:

Decoder-only (GPT-style): Used by GPT-4, Claude, LLaMA, Mistral, and virtually all modern general-purpose LLMs. The model sees only previous tokens (enforced by causal masking) and generates the next token autoregressively. Decoder-only models scale better, are simpler to train, and are the dominant architecture for text generation.
Encoder-only (BERT-style): Used for classification, named entity recognition, and embedding generation. BERT-style models see the entire input sequence bidirectionally — every token attends to every other token with no causal mask. This makes them excellent for understanding tasks but not suitable for generation.

The data flow through a decoder-only Transformer looks like this: Input tokens → Token embedding → Positional encoding → Stack of N Transformer blocks → Linear projection → Softmax → Next-token probability distribution. Each Transformer block contains two sublayers: masked self-attention and a feed-forward network, with residual connections and layer normalization around each.

Self-Attention: Queries, Keys, and Values

Self-attention is the computational core of every Transformer. Its purpose is to allow each token to dynamically compute its contextual meaning by looking at other tokens in the sequence — which tokens are relevant, and by how much.

The mechanism works through three learned projections of the token embedding matrix X:

Query (Q = X W_Q): represents "what this token is looking for"
Key (K = X W_K): represents "what information this token exposes to others"
Value (V = X W_V): represents "what information this token actually provides when attended to"

W_Q, W_K, and W_V are learned weight matrices. The actual attention computation is the scaled dot-product attention formula:

Attention(Q, K, V) = softmax( Q K^T / √d_k ) V

Breaking this down step by step: Q K^T computes the dot product between every query and every key, producing a matrix of raw similarity scores. Dividing by √d_k (the square root of the key dimension) stabilizes gradients — without this scaling, dot products grow large as dimensionality increases, pushing the softmax into saturation where gradients nearly vanish. The softmax normalizes these scores into a probability distribution over positions. Finally, multiplying by V produces a weighted sum of value vectors — each token's new representation is a blend of all value vectors, weighted by how much attention it pays to each position.

In PyTorch, the core computation looks like this:

import math, torch

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

The optional mask is the causal mask for decoder-only models: it sets future token scores to negative infinity before softmax, ensuring that token i can only attend to tokens 1 through i. This enforces autoregressive behavior — the model cannot "cheat" by looking at future tokens during training.

Multi-Head Attention: Capturing Multiple Relationship Types Simultaneously

A single attention head computes one set of query-key-value relationships — one "view" of how tokens relate to each other. Multi-head attention runs h parallel attention heads, each with its own Q, K, V projection matrices. The outputs are concatenated and projected through an output matrix W_O:

MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O

where each head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)

Why does this matter? Different attention heads learn to focus on different types of relationships. In practice, some heads track syntactic dependencies (which noun does this pronoun refer to?), others track positional proximity, others track semantic similarity, and others specialize in long-range dependencies. GPT-3 uses 96 attention heads; LLaMA-2 70B uses 64. Increasing the number of heads (while keeping total computation constant by reducing the dimension per head) generally improves model quality up to a point.

Feed-Forward Sublayers and Residual Connections

After self-attention mixes information across tokens, a position-wise feed-forward network (FFN) operates independently on each token's representation:

FFN(x) = max(0, x W_1 + b_1) W_2 + b_2

The FFN expands the dimension by a factor of 4 (so a model with d_model = 768 has an inner FFN dimension of 3,072), applies a nonlinearity (ReLU or GELU), then compresses back. The FFN is where much of the model's factual knowledge is stored — research has shown that specific facts can be associated with specific FFN neurons.

Both the self-attention and FFN sublayers are wrapped with residual connections and layer normalization. The residual connection adds the input directly to the sublayer output: Output = LayerNorm(x + Sublayer(x)). This design enables training of very deep networks (dozens to hundreds of layers) by ensuring gradient flow even when sublayer outputs are small. Layer normalization normalizes activations per token across features, computed as: LN(x) = (x − μ) / √(σ² + ε) × γ + β where γ and β are learned scale and shift parameters.

Positional Encoding: Teaching Transformers About Order

Self-attention has no inherent notion of sequence order — it treats the input as a set, not a sequence. Without positional information, "the dog bit the man" and "the man bit the dog" would produce identical attention structures. Position must be injected explicitly.

The original Transformer used sinusoidal positional encoding: fixed mathematical functions of position and dimension. Different sine and cosine frequencies encode position at different scales, producing a unique vector for each position that the model can learn to distinguish. The advantage is that it generalizes to sequence lengths not seen during training and requires no learned parameters.

Modern LLMs use Rotary Positional Embeddings (RoPE), introduced in the RoFormer paper and adopted by LLaMA, Mistral, and most open-source LLMs. RoPE applies a position-dependent rotation to query and key vectors inside the attention computation rather than adding positional information at the input. The key insight is that the dot product of a rotated query at position m with a rotated key at position n naturally encodes the relative distance (m − n) rather than absolute positions. RoPE provides excellent extrapolation to context lengths longer than those seen during training and has better empirical performance than learned absolute position embeddings. GPT-style models originally used learned absolute position embeddings (a simple learned lookup table indexed by position), which work well but have a hard context length limit.

Tokenization: BPE vs WordPiece

Before any of the Transformer machinery operates, text must be converted to tokens. Tokenization is not a preprocessing detail — it is part of the model design and directly affects cost, performance, and bias. The two most common subword tokenization algorithms are:

Byte Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair of tokens, building up a vocabulary of common subwords. GPT-2, GPT-4, and the LLaMA family use BPE-based tokenizers. The GPT-4 tokenizer has a vocabulary of approximately 100,000 tokens. BPE is deterministic and efficient but optimizes for frequency rather than linguistic structure.

WordPiece (used by BERT and related models) is similar to BPE but selects merges based on likelihood rather than frequency — it merges the pair that maximizes the likelihood of the training data given the vocabulary. A key difference: WordPiece prefixes continuation subwords with "##" (e.g., "##ing"), making tokenization reversible and easier to interpret.

A practical consequence: English text tokenizes at approximately 0.75 words per token (1.3 tokens per word on average), while non-English languages with larger character sets often tokenize much less efficiently — a Japanese or Arabic sentence may require 2–3× as many tokens as the equivalent English sentence. This has direct cost implications when using token-based API billing.

GPT vs BERT: Architectural Choices and Use Cases

The practical distinction between GPT-style (decoder-only) and BERT-style (encoder-only) models comes down to the masking strategy and training objective. BERT uses masked language modeling: random tokens in the input are masked, and the model predicts them using bidirectional context. GPT uses causal language modeling: predict each token from only the preceding tokens. BERT is better at understanding tasks (classification, named entity recognition, question answering over fixed text); GPT is better at generation tasks (text completion, conversation, code generation, summarization).

For engineers building AI tools, the choice is usually straightforward: use a GPT-style model (via OpenAI API, Anthropic Claude API, or open-source LLaMA derivatives) for generation, drafting, and Q&A tasks. Use a BERT-style model (or sentence-transformer) for semantic search, classification, and generating embeddings for a RAG system. Understanding the architecture helps you choose the right tool for each layer of your AI system.

Topics covered

transformer architecture explainedself-attention mechanismmulti-head attentionpositional encoding transformersBPE tokenizationGPT vs BERT architecturescaled dot-product attentionquery key value attentiondecoder-only transformerrotary positional embeddings RoPElayer normalization transformerfeed-forward network LLMWordPiece tokenizationtransformer encoder decoderhow ChatGPT workslarge language model architectureattention is all you needcausal masking LLM

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps