What is CUDA and why do I need to understand it as an AI engineer?

CUDA is NVIDIA's complete GPU computing platform: programming model, runtime, compiler, memory model, and library ecosystem. Every major AI framework (PyTorch, TensorFlow, JAX) compiles to CUDA under the hood. Understanding CUDA explains why certain operations are fast or slow, helps you diagnose performance bottlenecks and memory errors, and is essential for optimising training or inference workloads. You don't need to write raw CUDA kernels, but understanding the execution model (threads/blocks/warps) and memory hierarchy (registers/shared/HBM) is foundational.

What are the most important CUDA libraries for AI?

cuBLAS handles all matrix multiplications (the dominant AI operation). cuDNN provides optimised implementations of convolutions, attention, batch normalisation, and pooling. NCCL handles multi-GPU collective communication for distributed training. TensorRT optimises inference workloads through layer fusion and precision calibration. You use all of these without calling them directly — PyTorch and other frameworks dispatch to them automatically.

How do I check if my GPU is being used efficiently during training?

Start with nvidia-smi: run `nvidia-smi dmon -s u -d 1` and watch GPU utilisation. Above 90% is healthy; below 50% typically means a data loading bottleneck (the CPU cannot prepare batches fast enough for the GPU). For operation-level analysis, use PyTorch Profiler, which shows which CUDA kernels consume the most time, memory allocation per operation, and CPU-GPU synchronisation points. These two tools diagnose most training performance problems.

What causes CUDA out of memory errors and how do I fix them?

CUDA OOM errors occur when GPU VRAM is exhausted. Common causes: batch size too large; storing activation tensors that are not needed for the backward pass; optimiser states (Adam requires ~3× model size in memory); accumulating computation graphs across epochs. Fixes: reduce batch size and use gradient accumulation to maintain effective batch size; enable gradient checkpointing to recompute activations on the backward pass; use mixed precision (BF16/FP16); offload optimiser states to CPU with ZeRO optimisation.

When should I consider ROCm or Metal instead of CUDA?

ROCm (AMD) makes sense when you have access to AMD MI300X GPUs (which offer 192 GB HBM3, useful for large model inference), when you need open-source hardware infrastructure, or when NVIDIA GPU supply is constrained. The MI300X is competitive with H100 for memory-bound workloads. Metal/MPS (Apple) is the right choice for local development on Apple Silicon and for on-device inference deployed to macOS or iOS via Core ML. For production training and inference in the cloud, CUDA on NVIDIA GPUs remains the default choice due to software maturity.

← AI Engineering Studio

AI & Automation·10 min read·June 19, 2026

🔧 CUDA and the AI Software Stack: From Hardware to PyTorch — What Engineers Need to Know

CUDA is the operating system of modern AI accelerators. This guide explains the CUDA execution model (threads, blocks, warps, grids), key libraries like cuBLAS and cuDNN, how PyTorch sits on top of CUDA, GPU profiling with nvidia-smi and PyTorch Profiler, common memory errors, and alternatives like ROCm and Metal.

What CUDA Actually Is (And What It Is Not)

CUDA is not a library. It is not a language. It is not a driver. CUDA (Compute Unified Device Architecture) is a complete parallel computing platform — a programming model, compiler toolchain, runtime system, memory model, math libraries, profiler, and hardware abstraction layer rolled into one. NVIDIA introduced CUDA in 2006, and it has become, by a wide margin, the most important software platform in AI.

The full AI software stack looks like this from top to bottom:

Your Python Model Code
        ↓
AI Frameworks (PyTorch, TensorFlow, JAX)
        ↓
CUDA Libraries (cuBLAS, cuDNN, NCCL, TensorRT)
        ↓
CUDA Runtime (manages memory, launches kernels)
        ↓
GPU Driver (NVIDIA kernel module)
        ↓
GPU Hardware (SMs, tensor cores, HBM)

Every call to model.forward(x) in PyTorch eventually becomes a sequence of CUDA library calls, which become kernel launches on the GPU. CUDA is the translation layer that makes that possible. Understanding it — even at a conceptual level — is what separates engineers who can debug performance problems from those who cannot.

The CUDA Execution Model: Threads, Blocks, Warps, Grids

CUDA maps software to hardware through a four-level hierarchy:

Thread — the smallest unit of execution. Each thread runs the same kernel function on a different piece of data. Each thread has its own registers and local memory.
Warp — 32 threads that execute in lockstep (SIMT: Single Instruction, Multiple Threads). All 32 threads in a warp execute the same instruction simultaneously. This is the actual hardware execution unit inside a streaming multiprocessor (SM).
Block — a group of threads (up to 1,024) that can share on-chip SRAM (shared memory) and synchronise with __syncthreads(). Each block runs on one SM.
Grid — the collection of all blocks in a kernel launch. A kernel is launched as a grid of blocks, and CUDA distributes the blocks across available SMs automatically.

// A CUDA kernel for vector addition
__global__ void vector_add(float* a, float* b,
                            float* c, int n) {
    // Each thread computes one element
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Launch 256 threads per block, enough blocks to cover n
int blocks = (n + 255) / 256;
vector_add<<>>(a, b, c, n);

In practice, you almost never write raw CUDA kernels. PyTorch, JAX, and TensorFlow generate and call optimised CUDA kernels through their internal compiler infrastructure. But understanding the thread/block/warp model explains why certain operations are fast (operations that map cleanly to the warp execution model) and why certain patterns are slow (warp divergence, where threads in the same warp take different branches).

CUDA Memory Model: Why Performance Lives or Dies Here

CUDA exposes a memory hierarchy with dramatically different performance characteristics at each level:

Registers — per-thread, fastest possible, managed by the compiler. Local variables in your CUDA kernel live here. Very limited (typically 65,536 registers per SM, shared across all threads on that SM).
Shared memory — per-block, on-chip SRAM, roughly 100× faster than global memory. Threads in a block can share data through shared memory and synchronise access with barriers. This is where optimised CUDA kernels cache tiles of matrices to minimise global memory traffic.
L2 cache — device-wide cache, shared across all SMs. Managed by hardware. Provides a buffer between shared memory and global memory.
Global memory (HBM) — the GPU's VRAM (HBM2E on A100: 460 GB/s; HBM3 on H100: 3.35 TB/s). All tensors you create in PyTorch live here. Large, but relatively slow compared to on-chip memory. Optimised AI kernels minimise global memory traffic by maximising data reuse in shared memory.
Unified memory — accessible from both CPU and GPU through a shared address space. Convenient but the slowest option due to page-fault overhead. Avoid in performance-critical code.

The most common CUDA performance mistake is unnecessarily reading from and writing to global memory. Unfused kernels do this between every operation:

# Unfused — 3 separate global memory read/write cycles:
y = x @ W      # write to global
y = layer_norm(y)  # read + write global
y = gelu(y)    # read + write global

# Fused kernel (via PyTorch 2.0 torch.compile or Triton):
# One global memory read, one write, all math on-chip

Kernel fusion — combining operations to reduce global memory traffic — can provide 2–5× speedups on memory-bound workloads, which is why PyTorch 2.0's torch.compile() and frameworks like Triton are so valuable.

Key CUDA Libraries: The Invisible Speed Layer

Most AI performance comes not from custom CUDA code but from highly optimised libraries that NVIDIA engineers have tuned over years:

cuBLAS — NVIDIA's implementation of the Basic Linear Algebra Subprograms for GPU. Every matrix multiplication in PyTorch calls cuBLAS (or cuBLASLt for tensor core access). Each cuBLAS release improves performance on the same hardware — this is why updating CUDA drivers and libraries can speed up your existing code without changing anything.
cuDNN — CUDA Deep Neural Network library. Provides highly optimised implementations of convolutions, attention operations, batch normalisation, and pooling. PyTorch calls cuDNN for these operations automatically when a CUDA tensor is used.
NCCL — NVIDIA Collective Communications Library. Implements all-reduce, broadcast, gather, and scatter operations across multiple GPUs. This is what PyTorch's DistributedDataParallel uses to synchronise gradients across GPUs during training. NCCL performance over NVLink determines multi-GPU training scalability.
TensorRT — NVIDIA's inference optimisation framework. Takes a trained model, applies layer fusion, precision calibration (FP16, INT8), and kernel auto-tuning to produce an optimised inference engine. TensorRT can reduce inference latency by 2–10× compared to running the same model in PyTorch on the same GPU.
Thrust — CUDA's parallel algorithms library (sort, scan, reduce, transform), analogous to C++ STL but for GPU.

You use these libraries without knowing it. import torch; X @ W calls cuBLAS. torch.nn.Conv2d()(x) calls cuDNN. They are the reason GPUs are fast for AI, not the raw CUDA execution model.

How PyTorch Relates to CUDA

PyTorch is a thin Python layer sitting on top of a C++ core (ATen/libtorch), which dispatches operations to CUDA libraries. When you write x = torch.randn(1024, 1024, device="cuda"), PyTorch allocates memory in GPU HBM. When you compute y = x @ w, PyTorch calls cuBLASLt with the appropriate parameters for your data types and tensor sizes.

PyTorch's CUDA memory management is important to understand for production systems. PyTorch maintains a memory cache (the caching allocator) that keeps freed GPU memory for reuse rather than returning it to the OS. This speeds up allocation but means that torch.cuda.memory_allocated() and torch.cuda.memory_reserved() return different values — allocated is what your tensors are using; reserved is the total GPU memory PyTorch has claimed (including its cache). CUDA out-of-memory errors often appear before your model actually needs the full VRAM because the cache has grown large.

import torch

# Check GPU memory
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved:  {torch.cuda.memory_reserved()/1e9:.2f} GB")
print(f"Max alloc: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")

# Release cache (useful between experiments)
torch.cuda.empty_cache()

# For production: use context manager to track peak usage
with torch.cuda.amp.autocast():  # mixed precision FP16
    output = model(input)

Monitoring GPU Utilisation: nvidia-smi and PyTorch Profiler

Understanding whether your GPU is being used efficiently requires profiling, not guessing. Two tools cover most production needs:

nvidia-smi is the command-line interface to the NVIDIA Management Library. It shows GPU utilisation (0–100% of compute), memory usage, temperature, and power consumption in real time:

# Real-time monitoring, refresh every 1 second
nvidia-smi dmon -s u -d 1

# Detailed snapshot
nvidia-smi --query-gpu=name,utilization.gpu,memory.used,memory.total            --format=csv,noheader

A healthy training job shows GPU utilisation above 90%. Values below 50% typically indicate a data loading bottleneck (the GPU is waiting for the CPU to prepare batches), or small batch sizes that leave SMs underutilised. Consistent high utilisation with low throughput often indicates memory bandwidth saturation.

PyTorch Profiler provides operation-level visibility — which CUDA kernels consumed the most time, how much memory each operation used, and where CPU-GPU synchronisation is happening:

from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True
) as prof:
    with record_function("model_forward"):
        output = model(input)

# Print top 10 CUDA operations by total time
print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10
))

Common CUDA Memory Errors

Every PyTorch developer encounters these errors. Knowing what they mean saves hours of debugging:

CUDA out of memory (OOM): The most common error. GPU HBM is exhausted. Solutions: reduce batch size; use gradient checkpointing (torch.utils.checkpoint.checkpoint); switch to mixed precision (FP16/BF16); offload optimiser state to CPU; use model parallelism across multiple GPUs.
CUDA illegal memory access: A CUDA kernel accessed memory outside its allocated range — equivalent to a segfault on GPU. Usually caused by tensor shape mismatches, off-by-one errors in custom CUDA code, or accessing a tensor that has been freed. Run with CUDA_LAUNCH_BLOCKING=1 to get synchronous error reporting with accurate stack traces.
Device-side assert triggered: A CUDA assertion failed inside a kernel — often an index out of bounds or a NaN check. Enable with TORCH_USE_CUDA_DSA=1 for better diagnostics.
Expected all tensors to be on the same device: Mixing CPU and CUDA tensors in an operation. Always check tensor.device when combining tensors from different sources.

CUDA Alternatives: ROCm, Metal, and OneAPI

CUDA's dominance is not inevitable. Alternatives exist, each with genuine use cases:

ROCm (AMD) — AMD's open-source GPU compute platform. PyTorch supports ROCm, and AMD's MI300X GPU is competitive with the H100 in memory-bound workloads. ROCm is improving rapidly but has gaps in library coverage and framework optimisation compared to CUDA. Best choice when you need AMD GPUs for cost or supply reasons, or in HPC environments where AMD hardware dominates.
Metal / Core ML (Apple) — Apple's GPU and NPU programming stack for macOS and iOS. PyTorch supports Metal via the MPS (Metal Performance Shaders) backend: device = torch.device("mps"). Performance on Apple M4 Pro is strong for inference; training support is improving but lags CUDA. Best for: local development on Apple silicon; on-device iOS/macOS inference via Core ML.
OneAPI / SYCL (Intel) — Intel's cross-architecture programming model targeting Intel GPUs, CPUs, and FPGAs with a unified codebase. Growing in HPC environments with Intel hardware. Weak AI framework ecosystem compared to CUDA.

CUDA's real moat is not the programming model — it is 15+ years of optimised kernel libraries, deep framework integration, a massive developer community, and backward compatibility that protects existing code investments. Replacing CUDA means replacing the entire AI software ecosystem. That is why no competitor has succeeded yet, and why understanding CUDA remains a foundational skill for AI engineers working at the systems level.

Topics covered

CUDA programmingCUDA tutorialcuBLAScuDNNNCCLPyTorch CUDAnvidia-smiPyTorch profilerCUDA memory errorROCm AMDMetal Apple AIAI software stackGPU programmingCUDA kernelwarp threadsCUDA blockstensor core CUDATensorRTCUDA runtimeGPU optimization

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps