Why CPUs Cannot Keep Up with AI Workloads

The question of why AI needs special hardware has a precise answer. At the heart of every neural network is one operation: matrix multiplication. A transformer layer performing attention computes Q×K^T, applies softmax, multiplies by V, and passes through a feed-forward network — with matrix operations constituting roughly 90% of total compute time.

CPUs are built for the opposite of what this requires. A CPU is optimised for branch prediction, cache coherency, and low-latency single-thread execution — capabilities that are irrelevant to dense matrix math. A modern Intel Xeon might run 64 threads. An NVIDIA H100 GPU runs 16,896 CUDA cores, plus tensor cores that perform entire 16×16 matrix multiply-accumulate operations per cycle. Same matrix multiplication, orders of magnitude faster on GPU:

# CPU (NumPy) — takes seconds for large matrices
import numpy as np
Y = X @ W

# GPU (PyTorch) — same operation, milliseconds
import torch
X = torch.rand(4096, 4096, device="cuda", dtype=torch.float16)
W = torch.rand(4096, 4096, device="cuda", dtype=torch.float16)
Y = X @ W  # tensor cores fire automatically

AI hardware design is fundamentally shaped by one workload: repeated matrix math plus memory movement. Every decision — GPU architecture, TPU systolic arrays, custom ASICs, high-bandwidth memory — exists to make this pipeline faster or more efficient.

GPU Architecture: Tensor Cores, VRAM, and Bandwidth

A modern GPU contains thousands of streaming multiprocessors (SMs), each holding CUDA cores for general math, tensor cores for matrix operations, shared memory (on-chip SRAM, typically 64–228 KB per SM), and a warp scheduler that executes 32 threads in lockstep.

Tensor cores are the AI engine inside GPUs. Where a CUDA core performs one floating-point operation per cycle, a tensor core performs a 4×4 matrix multiply-accumulate per cycle — processing 64 multiply-add operations simultaneously. NVIDIA introduced tensor cores in the Volta architecture (2017). By the Hopper generation (H100, 2022), tensor cores in FP8 mode deliver up to 4 PFLOPS of throughput per GPU.

Key GPU specifications for AI work:

  • NVIDIA H100 SXM5: 80 GB HBM3, 3.35 TB/s memory bandwidth, 4 PFLOPS FP8 tensor throughput, NVLink 4.0 with 900 GB/s GPU-to-GPU bandwidth
  • NVIDIA RTX 4090 (consumer/workstation): 24 GB GDDR6X, 1,008 GB/s bandwidth — excellent for inference and fine-tuning smaller models
  • AMD MI300X: 192 GB HBM3 (largest of any accelerator), 5.3 TB/s bandwidth — strong for models that need extreme VRAM capacity

VRAM determines which models you can run. A 70B parameter model at FP16 precision requires 140 GB of VRAM — more than a single H100 provides. This forces model parallelism across multiple GPUs, introducing inter-GPU communication overhead that makes the interconnect bandwidth (NVLink, InfiniBand) as important as raw compute.

TPUs: Google's Systolic Array Architecture

Google developed the Tensor Processing Unit (TPU) because GPUs were too inefficient for inference at Google-scale. Rather than thousands of small cores, a TPU uses a systolic array: a grid of multiply-accumulate units where data flows rhythmically through the array, each unit performing a computation and passing its result to the next. This eliminates most of the memory traffic that makes GPU attention operations expensive.

TPU generations and their focus:

  • TPU v2/v3: Training-focused; first large-scale deployments in Google Cloud
  • TPU v4: Supercomputer-scale training pods; used to train Google's largest models
  • TPU v5e/v5p: Memory-optimised; improved inference economics
  • TPU v6 (Ironwood, 2026): Inference-first design; cost-optimised per token

The TPU software stack is XLA (Accelerated Linear Algebra) → JAX or TensorFlow → your code. This vertical integration is powerful for stable, large-scale workloads but creates significant lock-in: code written for TPUs does not run on GPUs without framework adaptation. TPUs are the right choice when you run stable workloads at Google-scale through Google Cloud, and the wrong choice when you need flexibility, custom CUDA kernels, or are in active research mode.

NPUs: On-Device AI Inference

Neural Processing Units (NPUs) are purpose-built AI accelerators integrated into consumer devices — smartphones, laptops, and embedded systems. Unlike GPUs (which optimise for throughput) and TPUs (which optimise for Google-scale training), NPUs optimise for low-power real-time inference within the thermal and battery constraints of portable devices.

Major NPU implementations in 2026:

  • Apple Neural Engine (ANE): Integrated into the Apple Silicon SoC (M4 generation delivers approximately 38 TOPS). Powers Core ML, on-device Siri, and the on-device inference engine behind Apple Intelligence. Optimised for the specific operation graphs produced by Core ML models.
  • Qualcomm Hexagon NPU: Found in Snapdragon 8 Elite and X Elite chips. Delivers approximately 45–75 TOPS in flagship configurations. Powers on-device AI in Android flagships and Copilot+ PCs. Supports INT4 and INT8 quantised models efficiently.
  • Google Tensor (Pixel phones): Custom SoC including an NPU designed to run Google's on-device ASR, image processing, and now on-device Gemini Nano inference.

The smartphone in your pocket now has more dedicated AI processing power than a server rack had five years ago. This hardware capability, combined with quantisation techniques (INT8, INT4) that reduce model size by 4–8x with minimal accuracy loss, means that on-device models can handle language translation, code completion, voice processing, and photo enhancement without any cloud round-trip.

ASICs: Custom Silicon for Maximum Efficiency

Application-Specific Integrated Circuits (ASICs) are chips designed for one specific workload. Unlike GPUs (general-purpose parallel compute) or NPUs (on-device inference), ASICs sacrifice flexibility entirely for efficiency — at hyperscale, even a 10–20% efficiency gain translates to hundreds of millions of dollars in annual savings.

Notable AI ASICs in production:

  • Groq LPU (Language Processing Unit): A dataflow architecture designed specifically for LLM inference. By compiling the model into a deterministic execution schedule at compile time, the LPU eliminates the runtime scheduling overhead that makes GPU inference latency variable. Groq benchmarks demonstrate inference speeds exceeding 750 tokens/second for Mixtral 8x7B — roughly 10× faster than GPU inference for latency-sensitive applications. Tradeoff: the model must be compiled for Groq's architecture; dynamic control flow in the model is constrained.
  • Cerebras WSE-3 (Wafer Scale Engine): Rather than cutting individual chips from a silicon wafer, Cerebras uses the entire 300mm wafer as a single processor — containing 900,000 AI-optimised cores and 44 GB of on-chip SRAM. The massive on-chip memory eliminates the HBM memory bandwidth bottleneck that limits GPU training performance. WSE-3 delivers 125 PFLOPS of AI compute with no inter-chip communication overhead. Tradeoff: expensive to manufacture, limited software ecosystem, niche use cases.
  • AWS Trainium / Inferentia: Amazon's custom training (Trainium) and inference (Inferentia) chips, programmed via the Neuron SDK. Designed to reduce AWS's dependency on NVIDIA GPUs and lower cost per token for EC2 AI instances.
  • Google TPU (revisited as ASIC): Google's TPUs are the most successful and widely deployed AI ASICs — used to serve trillions of inference requests per day through Google Search, Google Maps, and Gmail.

Memory Bandwidth: The Real Bottleneck

The most important insight about AI hardware that most engineers miss: AI systems are usually memory-bandwidth limited, not compute-limited. Modern GPUs can compute matrix multiplications faster than they can be fed with data from memory. A GPU might have 4 PFLOPS of compute capacity but only 3.35 TB/s of memory bandwidth — a ratio that means many operations are waiting for data, not performing math.

High Bandwidth Memory (HBM) is the engineering response to this problem. Instead of placing DRAM on the motherboard and connecting it via narrow memory buses, HBM is stacked vertically next to the GPU die and connected through thousands of wires using 3D packaging technology:

  • HBM2 (NVIDIA V100): ~256 GB/s per stack
  • HBM2E (NVIDIA A100): ~460 GB/s per stack
  • HBM3 (NVIDIA H100): ~800 GB/s per stack, 3.35 TB/s total
  • HBM3E (NVIDIA Blackwell B200): >1 TB/s per stack, 8 TB/s total
  • HBM4 (next generation, 2026+): 1.5–2 TB/s per stack

Even with HBM3E, memory is still the bottleneck for the largest models. A 70B parameter model at BF16 requires 140 GB of bandwidth to load once — and during inference each forward pass must stream those weights from memory. This is why the AMD MI300X with its 192 GB HBM3 pool is attractive for very large model inference: you can fit the full model on a single device without inter-GPU communication overhead.

Choosing Hardware: Training vs Inference

Training and inference have fundamentally different hardware requirements:

  • Training requires maximum compute (FP16/BF16), large VRAM to hold model weights plus gradients plus optimiser state (roughly 16–20 bytes per parameter for Adam), high inter-GPU bandwidth (NVLink for within-node, InfiniBand for cross-node), and long job stability. NVIDIA H100/H200 clusters or Google TPU pods are the standard choices.
  • Inference requires low latency, high throughput per dollar, and often INT8 or INT4 precision. ASICs like Groq LPU excel at low-latency inference. GPUs (A100, H100) handle mixed training and inference workloads. Consumer GPUs (RTX 4090) work for single-user inference or small teams. NPUs handle on-device inference where cloud connectivity is not available or privacy is required.

The architectural reality is that the future is not one chip type but a hybrid fleet: GPUs for training and research, ASICs for production inference at scale, NPUs for edge deployment, and CPUs for orchestration. Understanding where each excels lets you design AI systems that are both capable and cost-efficient.