What is the difference between a GPU and a TPU for AI?

GPUs are general-purpose parallel processors with thousands of cores and tensor cores for matrix math — flexible, programmable, and dominant for training and research. TPUs use systolic array architecture specifically optimised for Google's AI workloads — more energy-efficient for stable, large-scale inference but tightly integrated with Google's XLA/JAX software stack and only available through Google Cloud. GPUs win for flexibility; TPUs win for Google-scale cost efficiency.

What is an NPU and why do phones have one?

A Neural Processing Unit is a dedicated AI inference accelerator built into mobile and laptop chips. Unlike a GPU, which is designed for flexible parallel compute, an NPU is optimised for the specific mathematical operations in neural networks (matrix multiply, convolutions, activation functions) while consuming a fraction of the power. Apple's Neural Engine, Qualcomm's Hexagon, and Google's Tensor NPU enable real-time AI features (face recognition, translation, computational photography) on-device without cloud connectivity or battery drain.

What makes Groq's LPU different from a GPU for inference?

Groq's Language Processing Unit uses a deterministic dataflow architecture: the model is compiled at deployment time into a fixed execution schedule, eliminating the runtime scheduling overhead that causes GPU inference latency to vary. The result is extremely consistent, high-speed inference — benchmarks show over 750 tokens/second for Mixtral 8x7B, roughly 10× faster than GPU inference. The tradeoff is inflexibility: models must be compiled for Groq's architecture, and dynamic control flow in models is constrained.

Why is memory bandwidth more important than FLOPS for AI hardware?

Most AI operations are memory-bandwidth limited, not compute-limited. A GPU can perform matrix multiplications faster than it can load the data from memory. During LLM inference, the bottleneck is streaming billions of model weight parameters from HBM memory into compute units on every token generated. More FLOPS does not help if the memory cannot feed them fast enough. This is why HBM3E (8 TB/s on NVIDIA B200) and the AMD MI300X's 192 GB HBM3 pool matter more than raw compute numbers for many workloads.

Should I use a GPU or ASIC for AI inference in production?

GPUs are the right choice for most teams: they support any model, any framework, and any precision without recompilation, and NVIDIA's software ecosystem (CUDA, TensorRT, vLLM) is mature and optimised. ASICs like Groq LPU make sense when you have a stable, high-throughput inference workload where latency and cost-per-token dominate, and you can afford the integration cost of compiling your model for the ASIC's architecture. Hyperscalers (Google, Amazon, Microsoft) build custom ASICs because at their scale, even 15% efficiency gains are worth $500M+ in development costs.

← AI Engineering Studio

AI & Automation·11 min read·June 19, 2026

⚡ GPUs, TPUs, NPUs, and ASICs: The Silicon Powering Modern AI Systems

A systems engineer's guide to AI accelerator hardware: why CPUs fail at AI workloads, how GPU tensor cores work, when to use a TPU vs GPU, what NPUs do on your phone, how custom ASICs like Groq LPU and Cerebras WSE differ, and why memory bandwidth is often the real bottleneck.

Why CPUs Cannot Keep Up with AI Workloads

The question of why AI needs special hardware has a precise answer. At the heart of every neural network is one operation: matrix multiplication. A transformer layer performing attention computes Q×K^T, applies softmax, multiplies by V, and passes through a feed-forward network — with matrix operations constituting roughly 90% of total compute time.

CPUs are built for the opposite of what this requires. A CPU is optimised for branch prediction, cache coherency, and low-latency single-thread execution — capabilities that are irrelevant to dense matrix math. A modern Intel Xeon might run 64 threads. An NVIDIA H100 GPU runs 16,896 CUDA cores, plus tensor cores that perform entire 16×16 matrix multiply-accumulate operations per cycle. Same matrix multiplication, orders of magnitude faster on GPU:

# CPU (NumPy) — takes seconds for large matrices
import numpy as np
Y = X @ W

# GPU (PyTorch) — same operation, milliseconds
import torch
X = torch.rand(4096, 4096, device="cuda", dtype=torch.float16)
W = torch.rand(4096, 4096, device="cuda", dtype=torch.float16)
Y = X @ W  # tensor cores fire automatically

AI hardware design is fundamentally shaped by one workload: repeated matrix math plus memory movement. Every decision — GPU architecture, TPU systolic arrays, custom ASICs, high-bandwidth memory — exists to make this pipeline faster or more efficient.

GPU Architecture: Tensor Cores, VRAM, and Bandwidth

A modern GPU contains thousands of streaming multiprocessors (SMs), each holding CUDA cores for general math, tensor cores for matrix operations, shared memory (on-chip SRAM, typically 64–228 KB per SM), and a warp scheduler that executes 32 threads in lockstep.

Tensor cores are the AI engine inside GPUs. Where a CUDA core performs one floating-point operation per cycle, a tensor core performs a 4×4 matrix multiply-accumulate per cycle — processing 64 multiply-add operations simultaneously. NVIDIA introduced tensor cores in the Volta architecture (2017). By the Hopper generation (H100, 2022), tensor cores in FP8 mode deliver up to 4 PFLOPS of throughput per GPU.

Key GPU specifications for AI work:

NVIDIA H100 SXM5: 80 GB HBM3, 3.35 TB/s memory bandwidth, 4 PFLOPS FP8 tensor throughput, NVLink 4.0 with 900 GB/s GPU-to-GPU bandwidth
NVIDIA RTX 4090 (consumer/workstation): 24 GB GDDR6X, 1,008 GB/s bandwidth — excellent for inference and fine-tuning smaller models
AMD MI300X: 192 GB HBM3 (largest of any accelerator), 5.3 TB/s bandwidth — strong for models that need extreme VRAM capacity

VRAM determines which models you can run. A 70B parameter model at FP16 precision requires 140 GB of VRAM — more than a single H100 provides. This forces model parallelism across multiple GPUs, introducing inter-GPU communication overhead that makes the interconnect bandwidth (NVLink, InfiniBand) as important as raw compute.

TPUs: Google's Systolic Array Architecture

Google developed the Tensor Processing Unit (TPU) because GPUs were too inefficient for inference at Google-scale. Rather than thousands of small cores, a TPU uses a systolic array: a grid of multiply-accumulate units where data flows rhythmically through the array, each unit performing a computation and passing its result to the next. This eliminates most of the memory traffic that makes GPU attention operations expensive.

TPU generations and their focus:

TPU v2/v3: Training-focused; first large-scale deployments in Google Cloud
TPU v4: Supercomputer-scale training pods; used to train Google's largest models
TPU v5e/v5p: Memory-optimised; improved inference economics
TPU v6 (Ironwood, 2026): Inference-first design; cost-optimised per token

The TPU software stack is XLA (Accelerated Linear Algebra) → JAX or TensorFlow → your code. This vertical integration is powerful for stable, large-scale workloads but creates significant lock-in: code written for TPUs does not run on GPUs without framework adaptation. TPUs are the right choice when you run stable workloads at Google-scale through Google Cloud, and the wrong choice when you need flexibility, custom CUDA kernels, or are in active research mode.

NPUs: On-Device AI Inference

Neural Processing Units (NPUs) are purpose-built AI accelerators integrated into consumer devices — smartphones, laptops, and embedded systems. Unlike GPUs (which optimise for throughput) and TPUs (which optimise for Google-scale training), NPUs optimise for low-power real-time inference within the thermal and battery constraints of portable devices.

Major NPU implementations in 2026:

Apple Neural Engine (ANE): Integrated into the Apple Silicon SoC (M4 generation delivers approximately 38 TOPS). Powers Core ML, on-device Siri, and the on-device inference engine behind Apple Intelligence. Optimised for the specific operation graphs produced by Core ML models.
Qualcomm Hexagon NPU: Found in Snapdragon 8 Elite and X Elite chips. Delivers approximately 45–75 TOPS in flagship configurations. Powers on-device AI in Android flagships and Copilot+ PCs. Supports INT4 and INT8 quantised models efficiently.
Google Tensor (Pixel phones): Custom SoC including an NPU designed to run Google's on-device ASR, image processing, and now on-device Gemini Nano inference.

The smartphone in your pocket now has more dedicated AI processing power than a server rack had five years ago. This hardware capability, combined with quantisation techniques (INT8, INT4) that reduce model size by 4–8x with minimal accuracy loss, means that on-device models can handle language translation, code completion, voice processing, and photo enhancement without any cloud round-trip.

ASICs: Custom Silicon for Maximum Efficiency

Application-Specific Integrated Circuits (ASICs) are chips designed for one specific workload. Unlike GPUs (general-purpose parallel compute) or NPUs (on-device inference), ASICs sacrifice flexibility entirely for efficiency — at hyperscale, even a 10–20% efficiency gain translates to hundreds of millions of dollars in annual savings.

Notable AI ASICs in production:

Groq LPU (Language Processing Unit): A dataflow architecture designed specifically for LLM inference. By compiling the model into a deterministic execution schedule at compile time, the LPU eliminates the runtime scheduling overhead that makes GPU inference latency variable. Groq benchmarks demonstrate inference speeds exceeding 750 tokens/second for Mixtral 8x7B — roughly 10× faster than GPU inference for latency-sensitive applications. Tradeoff: the model must be compiled for Groq's architecture; dynamic control flow in the model is constrained.
Cerebras WSE-3 (Wafer Scale Engine): Rather than cutting individual chips from a silicon wafer, Cerebras uses the entire 300mm wafer as a single processor — containing 900,000 AI-optimised cores and 44 GB of on-chip SRAM. The massive on-chip memory eliminates the HBM memory bandwidth bottleneck that limits GPU training performance. WSE-3 delivers 125 PFLOPS of AI compute with no inter-chip communication overhead. Tradeoff: expensive to manufacture, limited software ecosystem, niche use cases.
AWS Trainium / Inferentia: Amazon's custom training (Trainium) and inference (Inferentia) chips, programmed via the Neuron SDK. Designed to reduce AWS's dependency on NVIDIA GPUs and lower cost per token for EC2 AI instances.
Google TPU (revisited as ASIC): Google's TPUs are the most successful and widely deployed AI ASICs — used to serve trillions of inference requests per day through Google Search, Google Maps, and Gmail.

Memory Bandwidth: The Real Bottleneck

The most important insight about AI hardware that most engineers miss: AI systems are usually memory-bandwidth limited, not compute-limited. Modern GPUs can compute matrix multiplications faster than they can be fed with data from memory. A GPU might have 4 PFLOPS of compute capacity but only 3.35 TB/s of memory bandwidth — a ratio that means many operations are waiting for data, not performing math.

High Bandwidth Memory (HBM) is the engineering response to this problem. Instead of placing DRAM on the motherboard and connecting it via narrow memory buses, HBM is stacked vertically next to the GPU die and connected through thousands of wires using 3D packaging technology:

HBM2 (NVIDIA V100): ~256 GB/s per stack
HBM2E (NVIDIA A100): ~460 GB/s per stack
HBM3 (NVIDIA H100): ~800 GB/s per stack, 3.35 TB/s total
HBM3E (NVIDIA Blackwell B200): >1 TB/s per stack, 8 TB/s total
HBM4 (next generation, 2026+): 1.5–2 TB/s per stack

Even with HBM3E, memory is still the bottleneck for the largest models. A 70B parameter model at BF16 requires 140 GB of bandwidth to load once — and during inference each forward pass must stream those weights from memory. This is why the AMD MI300X with its 192 GB HBM3 pool is attractive for very large model inference: you can fit the full model on a single device without inter-GPU communication overhead.

Choosing Hardware: Training vs Inference

Training and inference have fundamentally different hardware requirements:

Training requires maximum compute (FP16/BF16), large VRAM to hold model weights plus gradients plus optimiser state (roughly 16–20 bytes per parameter for Adam), high inter-GPU bandwidth (NVLink for within-node, InfiniBand for cross-node), and long job stability. NVIDIA H100/H200 clusters or Google TPU pods are the standard choices.
Inference requires low latency, high throughput per dollar, and often INT8 or INT4 precision. ASICs like Groq LPU excel at low-latency inference. GPUs (A100, H100) handle mixed training and inference workloads. Consumer GPUs (RTX 4090) work for single-user inference or small teams. NPUs handle on-device inference where cloud connectivity is not available or privacy is required.

The architectural reality is that the future is not one chip type but a hybrid fleet: GPUs for training and research, ASICs for production inference at scale, NPUs for edge deployment, and CPUs for orchestration. Understanding where each excels lets you design AI systems that are both capable and cost-efficient.

Topics covered

AI hardwareGPU AITPU architectureNPU chipASIC AINVIDIA tensor coresGroq LPUCerebras WSEHBM memory bandwidthAI acceleratorCUDA GPUApple Neural EngineQualcomm Hexagonsystolic arraytraining vs inference hardwareHBM2e HBM3GPU VRAMAI silicondeep learning hardwareAI chip comparison

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps