What CUDA Actually Is (And What It Is Not)
CUDA is not a library. It is not a language. It is not a driver. CUDA (Compute Unified Device Architecture) is a complete parallel computing platform — a programming model, compiler toolchain, runtime system, memory model, math libraries, profiler, and hardware abstraction layer rolled into one. NVIDIA introduced CUDA in 2006, and it has become, by a wide margin, the most important software platform in AI.
The full AI software stack looks like this from top to bottom:
Your Python Model Code
↓
AI Frameworks (PyTorch, TensorFlow, JAX)
↓
CUDA Libraries (cuBLAS, cuDNN, NCCL, TensorRT)
↓
CUDA Runtime (manages memory, launches kernels)
↓
GPU Driver (NVIDIA kernel module)
↓
GPU Hardware (SMs, tensor cores, HBM)
Every call to model.forward(x) in PyTorch eventually becomes a sequence of CUDA library calls, which become kernel launches on the GPU. CUDA is the translation layer that makes that possible. Understanding it — even at a conceptual level — is what separates engineers who can debug performance problems from those who cannot.
The CUDA Execution Model: Threads, Blocks, Warps, Grids
CUDA maps software to hardware through a four-level hierarchy:
- Thread — the smallest unit of execution. Each thread runs the same kernel function on a different piece of data. Each thread has its own registers and local memory.
- Warp — 32 threads that execute in lockstep (SIMT: Single Instruction, Multiple Threads). All 32 threads in a warp execute the same instruction simultaneously. This is the actual hardware execution unit inside a streaming multiprocessor (SM).
- Block — a group of threads (up to 1,024) that can share on-chip SRAM (shared memory) and synchronise with
__syncthreads(). Each block runs on one SM. - Grid — the collection of all blocks in a kernel launch. A kernel is launched as a grid of blocks, and CUDA distributes the blocks across available SMs automatically.
// A CUDA kernel for vector addition
__global__ void vector_add(float* a, float* b,
float* c, int n) {
// Each thread computes one element
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
// Launch 256 threads per block, enough blocks to cover n
int blocks = (n + 255) / 256;
vector_add<<>>(a, b, c, n);
In practice, you almost never write raw CUDA kernels. PyTorch, JAX, and TensorFlow generate and call optimised CUDA kernels through their internal compiler infrastructure. But understanding the thread/block/warp model explains why certain operations are fast (operations that map cleanly to the warp execution model) and why certain patterns are slow (warp divergence, where threads in the same warp take different branches).
CUDA Memory Model: Why Performance Lives or Dies Here
CUDA exposes a memory hierarchy with dramatically different performance characteristics at each level:
- Registers — per-thread, fastest possible, managed by the compiler. Local variables in your CUDA kernel live here. Very limited (typically 65,536 registers per SM, shared across all threads on that SM).
- Shared memory — per-block, on-chip SRAM, roughly 100× faster than global memory. Threads in a block can share data through shared memory and synchronise access with barriers. This is where optimised CUDA kernels cache tiles of matrices to minimise global memory traffic.
- L2 cache — device-wide cache, shared across all SMs. Managed by hardware. Provides a buffer between shared memory and global memory.
- Global memory (HBM) — the GPU's VRAM (HBM2E on A100: 460 GB/s; HBM3 on H100: 3.35 TB/s). All tensors you create in PyTorch live here. Large, but relatively slow compared to on-chip memory. Optimised AI kernels minimise global memory traffic by maximising data reuse in shared memory.
- Unified memory — accessible from both CPU and GPU through a shared address space. Convenient but the slowest option due to page-fault overhead. Avoid in performance-critical code.
The most common CUDA performance mistake is unnecessarily reading from and writing to global memory. Unfused kernels do this between every operation:
# Unfused — 3 separate global memory read/write cycles:
y = x @ W # write to global
y = layer_norm(y) # read + write global
y = gelu(y) # read + write global
# Fused kernel (via PyTorch 2.0 torch.compile or Triton):
# One global memory read, one write, all math on-chip
Kernel fusion — combining operations to reduce global memory traffic — can provide 2–5× speedups on memory-bound workloads, which is why PyTorch 2.0's torch.compile() and frameworks like Triton are so valuable.
Key CUDA Libraries: The Invisible Speed Layer
Most AI performance comes not from custom CUDA code but from highly optimised libraries that NVIDIA engineers have tuned over years:
- cuBLAS — NVIDIA's implementation of the Basic Linear Algebra Subprograms for GPU. Every matrix multiplication in PyTorch calls cuBLAS (or cuBLASLt for tensor core access). Each cuBLAS release improves performance on the same hardware — this is why updating CUDA drivers and libraries can speed up your existing code without changing anything.
- cuDNN — CUDA Deep Neural Network library. Provides highly optimised implementations of convolutions, attention operations, batch normalisation, and pooling. PyTorch calls cuDNN for these operations automatically when a CUDA tensor is used.
- NCCL — NVIDIA Collective Communications Library. Implements all-reduce, broadcast, gather, and scatter operations across multiple GPUs. This is what PyTorch's DistributedDataParallel uses to synchronise gradients across GPUs during training. NCCL performance over NVLink determines multi-GPU training scalability.
- TensorRT — NVIDIA's inference optimisation framework. Takes a trained model, applies layer fusion, precision calibration (FP16, INT8), and kernel auto-tuning to produce an optimised inference engine. TensorRT can reduce inference latency by 2–10× compared to running the same model in PyTorch on the same GPU.
- Thrust — CUDA's parallel algorithms library (sort, scan, reduce, transform), analogous to C++ STL but for GPU.
You use these libraries without knowing it. import torch; X @ W calls cuBLAS. torch.nn.Conv2d()(x) calls cuDNN. They are the reason GPUs are fast for AI, not the raw CUDA execution model.
How PyTorch Relates to CUDA
PyTorch is a thin Python layer sitting on top of a C++ core (ATen/libtorch), which dispatches operations to CUDA libraries. When you write x = torch.randn(1024, 1024, device="cuda"), PyTorch allocates memory in GPU HBM. When you compute y = x @ w, PyTorch calls cuBLASLt with the appropriate parameters for your data types and tensor sizes.
PyTorch's CUDA memory management is important to understand for production systems. PyTorch maintains a memory cache (the caching allocator) that keeps freed GPU memory for reuse rather than returning it to the OS. This speeds up allocation but means that torch.cuda.memory_allocated() and torch.cuda.memory_reserved() return different values — allocated is what your tensors are using; reserved is the total GPU memory PyTorch has claimed (including its cache). CUDA out-of-memory errors often appear before your model actually needs the full VRAM because the cache has grown large.
import torch
# Check GPU memory
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# Release cache (useful between experiments)
torch.cuda.empty_cache()
# For production: use context manager to track peak usage
with torch.cuda.amp.autocast(): # mixed precision FP16
output = model(input)
Monitoring GPU Utilisation: nvidia-smi and PyTorch Profiler
Understanding whether your GPU is being used efficiently requires profiling, not guessing. Two tools cover most production needs:
nvidia-smi is the command-line interface to the NVIDIA Management Library. It shows GPU utilisation (0–100% of compute), memory usage, temperature, and power consumption in real time:
# Real-time monitoring, refresh every 1 second
nvidia-smi dmon -s u -d 1
# Detailed snapshot
nvidia-smi --query-gpu=name,utilization.gpu,memory.used,memory.total --format=csv,noheader
A healthy training job shows GPU utilisation above 90%. Values below 50% typically indicate a data loading bottleneck (the GPU is waiting for the CPU to prepare batches), or small batch sizes that leave SMs underutilised. Consistent high utilisation with low throughput often indicates memory bandwidth saturation.
PyTorch Profiler provides operation-level visibility — which CUDA kernels consumed the most time, how much memory each operation used, and where CPU-GPU synchronisation is happening:
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True
) as prof:
with record_function("model_forward"):
output = model(input)
# Print top 10 CUDA operations by total time
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=10
))
Common CUDA Memory Errors
Every PyTorch developer encounters these errors. Knowing what they mean saves hours of debugging:
- CUDA out of memory (OOM): The most common error. GPU HBM is exhausted. Solutions: reduce batch size; use gradient checkpointing (
torch.utils.checkpoint.checkpoint); switch to mixed precision (FP16/BF16); offload optimiser state to CPU; use model parallelism across multiple GPUs. - CUDA illegal memory access: A CUDA kernel accessed memory outside its allocated range — equivalent to a segfault on GPU. Usually caused by tensor shape mismatches, off-by-one errors in custom CUDA code, or accessing a tensor that has been freed. Run with
CUDA_LAUNCH_BLOCKING=1to get synchronous error reporting with accurate stack traces. - Device-side assert triggered: A CUDA assertion failed inside a kernel — often an index out of bounds or a NaN check. Enable with
TORCH_USE_CUDA_DSA=1for better diagnostics. - Expected all tensors to be on the same device: Mixing CPU and CUDA tensors in an operation. Always check
tensor.devicewhen combining tensors from different sources.
CUDA Alternatives: ROCm, Metal, and OneAPI
CUDA's dominance is not inevitable. Alternatives exist, each with genuine use cases:
- ROCm (AMD) — AMD's open-source GPU compute platform. PyTorch supports ROCm, and AMD's MI300X GPU is competitive with the H100 in memory-bound workloads. ROCm is improving rapidly but has gaps in library coverage and framework optimisation compared to CUDA. Best choice when you need AMD GPUs for cost or supply reasons, or in HPC environments where AMD hardware dominates.
- Metal / Core ML (Apple) — Apple's GPU and NPU programming stack for macOS and iOS. PyTorch supports Metal via the MPS (Metal Performance Shaders) backend:
device = torch.device("mps"). Performance on Apple M4 Pro is strong for inference; training support is improving but lags CUDA. Best for: local development on Apple silicon; on-device iOS/macOS inference via Core ML. - OneAPI / SYCL (Intel) — Intel's cross-architecture programming model targeting Intel GPUs, CPUs, and FPGAs with a unified codebase. Growing in HPC environments with Intel hardware. Weak AI framework ecosystem compared to CUDA.
CUDA's real moat is not the programming model — it is 15+ years of optimised kernel libraries, deep framework integration, a massive developer community, and backward compatibility that protects existing code investments. Replacing CUDA means replacing the entire AI software ecosystem. That is why no competitor has succeeded yet, and why understanding CUDA remains a foundational skill for AI engineers working at the systems level.