What Is a Transformer and Why Did It Change AI?

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., replaced recurrent neural networks (RNNs) as the dominant approach for sequence modeling in natural language processing. RNNs processed tokens sequentially — one at a time — making them slow to train and prone to the vanishing gradient problem that degraded long-range dependencies. Transformers process all tokens in parallel using a mechanism called self-attention, which allows every token to directly attend to every other token in the sequence regardless of distance. This change unlocked the ability to train much larger models on much more data, and ultimately produced the large language models (LLMs) that power ChatGPT, Claude, Gemini, and LLaMA.

Understanding the Transformer at a mechanical level — not just at the conceptual level of diagrams and buzzwords — enables engineers to reason about why LLMs behave as they do: why context length matters, why token count affects cost, why fine-tuning works, and what the computational bottlenecks are during inference. This article builds that mechanical understanding from the ground up.

Encoder vs Decoder: The Two Transformer Variants

The original Transformer architecture had two components: an encoder that processes the input sequence and builds contextual representations, and a decoder that generates output tokens one at a time using both the encoder's representations and its own previously generated tokens. This encoder-decoder design is used for sequence-to-sequence tasks like machine translation — the encoder reads French, the decoder generates English.

Modern LLMs have largely moved to two specialized variants:

  • Decoder-only (GPT-style): Used by GPT-4, Claude, LLaMA, Mistral, and virtually all modern general-purpose LLMs. The model sees only previous tokens (enforced by causal masking) and generates the next token autoregressively. Decoder-only models scale better, are simpler to train, and are the dominant architecture for text generation.
  • Encoder-only (BERT-style): Used for classification, named entity recognition, and embedding generation. BERT-style models see the entire input sequence bidirectionally — every token attends to every other token with no causal mask. This makes them excellent for understanding tasks but not suitable for generation.

The data flow through a decoder-only Transformer looks like this: Input tokens → Token embedding → Positional encoding → Stack of N Transformer blocks → Linear projection → Softmax → Next-token probability distribution. Each Transformer block contains two sublayers: masked self-attention and a feed-forward network, with residual connections and layer normalization around each.

Self-Attention: Queries, Keys, and Values

Self-attention is the computational core of every Transformer. Its purpose is to allow each token to dynamically compute its contextual meaning by looking at other tokens in the sequence — which tokens are relevant, and by how much.

The mechanism works through three learned projections of the token embedding matrix X:

  • Query (Q = X W_Q): represents "what this token is looking for"
  • Key (K = X W_K): represents "what information this token exposes to others"
  • Value (V = X W_V): represents "what information this token actually provides when attended to"

W_Q, W_K, and W_V are learned weight matrices. The actual attention computation is the scaled dot-product attention formula:

Attention(Q, K, V) = softmax( Q K^T / √d_k ) V

Breaking this down step by step: Q K^T computes the dot product between every query and every key, producing a matrix of raw similarity scores. Dividing by √d_k (the square root of the key dimension) stabilizes gradients — without this scaling, dot products grow large as dimensionality increases, pushing the softmax into saturation where gradients nearly vanish. The softmax normalizes these scores into a probability distribution over positions. Finally, multiplying by V produces a weighted sum of value vectors — each token's new representation is a blend of all value vectors, weighted by how much attention it pays to each position.

In PyTorch, the core computation looks like this:

import math, torch

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

The optional mask is the causal mask for decoder-only models: it sets future token scores to negative infinity before softmax, ensuring that token i can only attend to tokens 1 through i. This enforces autoregressive behavior — the model cannot "cheat" by looking at future tokens during training.

Multi-Head Attention: Capturing Multiple Relationship Types Simultaneously

A single attention head computes one set of query-key-value relationships — one "view" of how tokens relate to each other. Multi-head attention runs h parallel attention heads, each with its own Q, K, V projection matrices. The outputs are concatenated and projected through an output matrix W_O:

MultiHead(Q, K, V) = Concat(head_1, …, head_h) W_O

where each head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)

Why does this matter? Different attention heads learn to focus on different types of relationships. In practice, some heads track syntactic dependencies (which noun does this pronoun refer to?), others track positional proximity, others track semantic similarity, and others specialize in long-range dependencies. GPT-3 uses 96 attention heads; LLaMA-2 70B uses 64. Increasing the number of heads (while keeping total computation constant by reducing the dimension per head) generally improves model quality up to a point.

Feed-Forward Sublayers and Residual Connections

After self-attention mixes information across tokens, a position-wise feed-forward network (FFN) operates independently on each token's representation:

FFN(x) = max(0, x W_1 + b_1) W_2 + b_2

The FFN expands the dimension by a factor of 4 (so a model with d_model = 768 has an inner FFN dimension of 3,072), applies a nonlinearity (ReLU or GELU), then compresses back. The FFN is where much of the model's factual knowledge is stored — research has shown that specific facts can be associated with specific FFN neurons.

Both the self-attention and FFN sublayers are wrapped with residual connections and layer normalization. The residual connection adds the input directly to the sublayer output: Output = LayerNorm(x + Sublayer(x)). This design enables training of very deep networks (dozens to hundreds of layers) by ensuring gradient flow even when sublayer outputs are small. Layer normalization normalizes activations per token across features, computed as: LN(x) = (x − μ) / √(σ² + ε) × γ + β where γ and β are learned scale and shift parameters.

Positional Encoding: Teaching Transformers About Order

Self-attention has no inherent notion of sequence order — it treats the input as a set, not a sequence. Without positional information, "the dog bit the man" and "the man bit the dog" would produce identical attention structures. Position must be injected explicitly.

The original Transformer used sinusoidal positional encoding: fixed mathematical functions of position and dimension. Different sine and cosine frequencies encode position at different scales, producing a unique vector for each position that the model can learn to distinguish. The advantage is that it generalizes to sequence lengths not seen during training and requires no learned parameters.

Modern LLMs use Rotary Positional Embeddings (RoPE), introduced in the RoFormer paper and adopted by LLaMA, Mistral, and most open-source LLMs. RoPE applies a position-dependent rotation to query and key vectors inside the attention computation rather than adding positional information at the input. The key insight is that the dot product of a rotated query at position m with a rotated key at position n naturally encodes the relative distance (m − n) rather than absolute positions. RoPE provides excellent extrapolation to context lengths longer than those seen during training and has better empirical performance than learned absolute position embeddings. GPT-style models originally used learned absolute position embeddings (a simple learned lookup table indexed by position), which work well but have a hard context length limit.

Tokenization: BPE vs WordPiece

Before any of the Transformer machinery operates, text must be converted to tokens. Tokenization is not a preprocessing detail — it is part of the model design and directly affects cost, performance, and bias. The two most common subword tokenization algorithms are:

Byte Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair of tokens, building up a vocabulary of common subwords. GPT-2, GPT-4, and the LLaMA family use BPE-based tokenizers. The GPT-4 tokenizer has a vocabulary of approximately 100,000 tokens. BPE is deterministic and efficient but optimizes for frequency rather than linguistic structure.

WordPiece (used by BERT and related models) is similar to BPE but selects merges based on likelihood rather than frequency — it merges the pair that maximizes the likelihood of the training data given the vocabulary. A key difference: WordPiece prefixes continuation subwords with "##" (e.g., "##ing"), making tokenization reversible and easier to interpret.

A practical consequence: English text tokenizes at approximately 0.75 words per token (1.3 tokens per word on average), while non-English languages with larger character sets often tokenize much less efficiently — a Japanese or Arabic sentence may require 2–3× as many tokens as the equivalent English sentence. This has direct cost implications when using token-based API billing.

GPT vs BERT: Architectural Choices and Use Cases

The practical distinction between GPT-style (decoder-only) and BERT-style (encoder-only) models comes down to the masking strategy and training objective. BERT uses masked language modeling: random tokens in the input are masked, and the model predicts them using bidirectional context. GPT uses causal language modeling: predict each token from only the preceding tokens. BERT is better at understanding tasks (classification, named entity recognition, question answering over fixed text); GPT is better at generation tasks (text completion, conversation, code generation, summarization).

For engineers building AI tools, the choice is usually straightforward: use a GPT-style model (via OpenAI API, Anthropic Claude API, or open-source LLaMA derivatives) for generation, drafting, and Q&A tasks. Use a BERT-style model (or sentence-transformer) for semantic search, classification, and generating embeddings for a RAG system. Understanding the architecture helps you choose the right tool for each layer of your AI system.