What is KV caching and why is it essential for LLM inference?

KV caching stores the Key and Value tensors computed for each input token during the prefill phase and reuses them during autoregressive token generation. Without KV caching, the attention computation would need to reprocess all previous tokens at each generation step, making the cost grow as O(n²) with sequence length. With KV caching, each new token requires only computing the attention of the new token against all cached Keys and Values, reducing the amortized cost to O(n). KV caching is considered mandatory for any production LLM deployment — the latency and compute cost without it would make real-time generation with long contexts completely infeasible. The tradeoff is memory: KV cache can consume tens of gigabytes for long contexts and large batch sizes.

What is the difference between GPTQ and AWQ quantization and which should I use?

GPTQ (Gradient-based PTQ) quantizes each layer's weights to minimize the layer's output error, using second-order information from the Hessian. It is effective, well-supported, and widely available for most open-source models. AWQ (Activation-Aware Weight Quantization) identifies the 1% of weights that most strongly influence activations and protects them from quantization, resulting in better accuracy on reasoning and factual tasks at the same bit width. For most production deployments, AWQ is the preferred choice — it generally produces better results than GPTQ at 4-bit precision, especially for instruction-following and reasoning tasks. GPTQ remains a valid option if AWQ-quantized versions of your target model are not available or if you need to quantize a custom fine-tuned model yourself.

When should I use vLLM vs TensorRT-LLM vs llama.cpp for LLM serving?

Use vLLM for most production deployments of open-source LLMs: it offers continuous batching, PagedAttention, OpenAI-compatible API, broad model support, and good throughput on NVIDIA and AMD GPUs. Use TensorRT-LLM when you need maximum throughput on NVIDIA GPUs and can tolerate the model compilation step (which takes minutes to hours) — it outperforms vLLM on raw throughput benchmarks through kernel fusion and hardware-specific optimizations. Use llama.cpp for CPU inference, development environments, edge deployments, or when running small quantized models on GPUs with limited VRAM — it is not suitable for high-throughput production serving but is invaluable for prototyping and local development without GPU access.

How do I estimate GPU memory requirements for deploying an LLM?

Estimate GPU memory as follows: model weights = parameters × bytes_per_parameter (2 for FP16/BF16, 1 for INT8, 0.5 for INT4). A 7B FP16 model needs 14 GB, a 70B FP16 model needs 140 GB. Add 20% for activations and framework overhead. Then add KV cache: 2 × layers × d_model × max_seq_len × batch_size × bytes_per_parameter. For a 7B LLaMA-3 model (32 layers, 4096 d_model, FP16, 8K context, batch size 8), KV cache is approximately 16 GB. Total: 14 + 2.8 + 16 = 33 GB — fits on an A100 40GB. Use AWQ 4-bit quantization to reduce the model weight footprint to ~3.5 GB if memory is tight.

What is speculative decoding and when does it improve LLM inference performance?

Speculative decoding uses a small, fast draft model to generate k candidate tokens simultaneously, which the large target model verifies in a single forward pass. Because GPU hardware can verify k+1 tokens in nearly the same time as generating 1 token (verification is parallelized), each accepted draft token costs very little additional compute. When the draft model's predictions are accepted with high probability (typically 70–90% on English text and code), speculative decoding reduces end-to-end generation latency by 2–4× without changing the target model's output distribution. It works best for domains where a well-chosen draft model closely matches the target model's distribution — it is less effective for highly creative tasks or domains where the draft model is weak. The tradeoff is doubled GPU memory usage (holding both the draft and target model) and added serving complexity.

← AI Engineering Studio

AI & Automation·11 min read·June 19, 2026

⚙️ LLM Inference Optimization: Running AI Models in Production Without Breaking the Bank

A comprehensive technical guide to LLM inference optimization — covering KV cache mechanics, quantization methods (INT8, INT4, GPTQ, AWQ), speculative decoding, continuous batching, tensor parallelism, and comparing vLLM vs TensorRT-LLM vs llama.cpp for production deployments with cost estimation formulas and GPU memory sizing.

Why Inference Optimization Is the Critical Bottleneck in Production LLMs

Training and fine-tuning determine what a model knows and how it behaves. Inference optimization determines whether that model is economically viable in production. For most organizations deploying LLMs, inference costs dominate the total lifetime expense — often by a factor of 10 to 100 over training costs for a successful product. A model that produces excellent results in a research context can be completely infeasible to deploy if its inference cost per request is too high or its latency is too long for user experience requirements.

The challenge is that LLM inference has fundamentally different performance characteristics than traditional software. Cost scales with token count (not request count), latency scales with output length (not just input length), and memory requirements scale with context window size. Understanding these dynamics — and the optimization techniques available to address them — is essential for any engineer responsible for deploying LLM-powered applications.

This article covers the key optimization levers in order of impact: KV caching, quantization, batching strategies, speculative decoding, and parallel inference. It also provides practical guidance on selecting inference frameworks and estimating production costs.

The Inference Loop: Where Time Is Actually Spent

Each generated token requires a complete forward pass through all Transformer layers. The core computations per forward pass are:

Matrix multiplications (GEMMs): The dominant compute operation. Every attention projection (Q, K, V, O matrices), every FFN layer (two large matrix multiplications), and the output vocabulary projection are GEMMs. On modern GPUs with tensor cores, these are highly parallelized.
Memory bandwidth: Loading model weights from GPU HBM (high-bandwidth memory) into compute units is often the binding constraint for single-request inference. For a 7B model in FP16, weights occupy ~14 GB and must be streamed through the GPU memory interface at each token generation step.
Attention computation: O(n²) in sequence length n. For a single 1,000-token context, attention is fast. For a 100,000-token context, attention becomes a bottleneck — though FlashAttention (below) largely resolves this.

The key insight: inference is bounded by memory bandwidth, not compute, for single-batch requests. GPUs are most efficient when processing many tokens simultaneously (large batches). Serving one request at a time dramatically underutilizes GPU compute — you pay for GPU capacity while only using its memory bandwidth.

KV Cache: The Single Biggest Inference Optimization

During autoregressive generation, the model generates tokens one at a time. At each step, it recomputes attention over all previous tokens. Without caching, this means recomputing the Key and Value matrices for every previous token at every new generation step — the computational cost grows as O(n²) with sequence length.

KV caching solves this by storing the Key and Value tensors for previously processed tokens and reusing them. At step t, only the new token's Query, Key, and Value need to be computed; Keys and Values from steps 1 through t−1 are read from cache. This reduces the attention computation from O(n²) over n steps to O(n) total — a fundamental change that makes real-time generation feasible for long sequences.

The memory cost of the KV cache is significant:

KV_Cache_Bytes = 2 × num_layers × d_model × seq_length × precision_bytes × batch_size

For a LLaMA-2 70B model (80 layers, d_model = 8192, FP16 = 2 bytes) with a 4,096-token context and batch size 1: 2 × 80 × 8192 × 4096 × 2 = 10.7 GB. At batch size 8 with 8K context: ~85 GB — which exceeds a single A100 80GB GPU's available VRAM after loading model weights. KV cache memory management is therefore a central concern in LLM inference serving.

PagedAttention, the innovation behind vLLM, addresses KV cache memory fragmentation by managing cache in fixed-size "pages" (similar to virtual memory in operating systems). This allows multiple requests with different sequence lengths to share GPU memory efficiently and enables continuous batching without memory waste from over-allocation.

Quantization: Trading Precision for Speed and Memory

Quantization reduces the numerical precision of model weights (and sometimes activations) to reduce memory footprint and increase inference speed. The tradeoffs:

Precision	Bits	Memory (7B model)	Speed vs FP32	Quality Impact
FP32	32	~28 GB	1×	Reference
FP16 / BF16	16	~14 GB	~2×	Negligible
INT8	8	~7 GB	~2–3×	Slight drop
INT4 / NF4	4	~3.5 GB	~3–4×	Task-dependent

The two most widely deployed post-training quantization methods are:

GPTQ (Gradient-based PTQ): Quantizes weights layer by layer using an approximation derived from the Hessian of the layer's output error. GPTQ produces high-quality 4-bit models with minimal accuracy degradation on most benchmarks and is supported by vLLM, llama.cpp, and ExLlamaV2. GPTQ models can be loaded directly into inference servers without re-quantizing.
AWQ (Activation-Aware Weight Quantization): Identifies the most important weights for a given activation distribution (typically the 1% of weights with the largest activation magnitudes) and protects them from quantization error. AWQ generally outperforms GPTQ on reasoning-intensive tasks and is now the preferred quantization method for production deployments. Models quantized with AWQ are available for most major open-source LLMs on Hugging Face.

For production deployments, FP16 or BF16 provides the best accuracy-speed tradeoff on modern GPUs with hardware support. INT8 quantization using NVIDIA's TensorRT-LLM or bitsandbytes library provides additional speedup with minimal quality degradation. INT4 (GPTQ or AWQ) enables running 70B models on single-GPU hardware and reduces cost by approximately 50% compared to FP16 in API deployments.

Speculative Decoding: Faster Generation Through Draft Models

Speculative decoding is a technique that accelerates inference by using a small, fast "draft" model to propose multiple tokens simultaneously, which the large target model then verifies in a single forward pass. Because verification (checking tokens in parallel) is faster than sequential generation, speculative decoding reduces end-to-end latency when the draft model's proposals are frequently accepted.

The workflow: the small draft model (e.g., a 7B model) generates k candidate tokens (typically k = 4–8). The large target model (e.g., a 70B model) processes all k+1 positions simultaneously in one forward pass. Any tokens that the target model would have produced with equal or higher probability are accepted; the first rejected token triggers a correction. On average, for domains where the draft model performs well (English prose, code), 3–5 tokens can be accepted per speculation cycle, reducing the number of large model forward passes by 3–5×.

Speculative decoding reduces latency (time per token) without reducing throughput or model quality — the target model's output distribution is preserved exactly. It requires maintaining two models in GPU memory simultaneously, which is feasible for large GPU setups but requires careful memory planning.

Continuous Batching and vLLM

Traditional static batching requires all requests in a batch to have the same length and to start and finish together. This wastes GPU resources: when one request in a batch finishes early, the GPU idles while waiting for the remaining requests. Dynamic or continuous batching (the core innovation of vLLM) solves this by treating each token generation step as an independent batch — new requests can be added to the batch dynamically as GPU compute becomes available, and finished requests are removed immediately.

vLLM (from UC Berkeley, 2023) implements continuous batching with PagedAttention and has become the de facto standard for open-source LLM serving. Key capabilities:

Continuous batching for 2–10× higher throughput than static batching
PagedAttention for near-zero KV cache memory waste
Support for GPTQ, AWQ, and FP16/BF16 quantization
OpenAI-compatible API interface for drop-in compatibility
Multi-GPU tensor parallelism via NVIDIA Megatron-style partitioning

# Start vLLM server for a quantized model
# Requires: pip install vllm

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3-8B-Instruct   --quantization awq   --dtype auto   --tensor-parallel-size 1   --gpu-memory-utilization 0.90   --max-model-len 8192   --port 8000

TensorRT-LLM (NVIDIA) provides an alternative optimized inference stack with kernel fusion, INT8/FP8 quantization, and support for NVIDIA-specific hardware features. TensorRT-LLM typically outperforms vLLM on pure throughput benchmarks on NVIDIA GPUs but requires model compilation (which can take hours) and has less framework flexibility. Use vLLM for rapid deployment and flexibility; use TensorRT-LLM when squeezing maximum throughput from NVIDIA hardware matters most.

llama.cpp enables CPU inference and efficient GPU inference for quantized models, primarily targeting edge deployment and development environments. It supports GGUF-format quantized models (Q2 through Q8) and can run models partially on CPU and partially on GPU. Ideal for local development, smaller models, and CPU-only deployments — not suitable for high-throughput production serving.

Tensor Parallelism for Large Models

Models that do not fit in a single GPU's memory require distributed inference. The two primary parallelism strategies are:

Tensor parallelism: splits individual weight matrices across multiple GPUs. Each GPU holds a fraction of each attention head and FFN layer, and all-reduce operations synchronize results between GPUs during the forward pass. Tensor parallelism reduces per-GPU memory proportionally to the number of GPUs but adds communication overhead. Most effective for models that are slightly too large for one GPU.
Pipeline parallelism: assigns different Transformer layers to different GPUs, with activations passed sequentially between them. Has lower communication overhead than tensor parallelism but introduces pipeline bubbles (idle time between mini-batches). Most effective for very deep models on bandwidth-limited inter-GPU connections.

For a 70B model in FP16 (140 GB), 2× A100 80GB GPUs with tensor parallelism is the standard deployment. For a 7B model (14 GB), a single A100 40GB or A10G 24GB handles inference with room for KV cache.

GPU Memory Sizing and Cost Estimation

Before deploying, estimate memory requirements and cost:

GPU_Memory_Required = Model_Weights + KV_Cache + Activations + Framework_Overhead

Rule of thumb for model weights: parameters × precision_bytes. 7B parameters × 2 bytes (FP16) = 14 GB. Add 20–30% for activations and framework overhead. For KV cache at batch size B and max sequence length L:

KV_Cache_GB = 2 × layers × d_model × L × B × 2 / (1024³)

For inference cost estimation:

Cost_per_1K_tokens ≈ (GPU_hourly_rate × inference_time_per_1K_tokens) / 3600

A rough benchmark: on an A100 80GB, a 7B FP16 model achieves approximately 2,000–4,000 tokens/second at batch size 16. At $3/hour GPU cost, serving cost is approximately $0.0002–0.0004 per 1,000 tokens — far below commercial API pricing, which explains why high-volume applications often self-host open-source models. For lower-volume applications, commercial APIs (which include model updates, maintenance, and reliability SLAs) remain cost-effective.

The most impactful cost reduction levers in order of priority: (1) reduce context length — KV cache and attention cost scale quadratically; (2) use quantization — INT4/AWQ roughly halves memory and cost; (3) use smaller models where quality permits — a 7B fine-tuned model often outperforms a 70B zero-shot model for specific tasks; (4) implement batching — continuous batching provides 5–10× throughput improvement; (5) implement caching — prompt prefix caching in vLLM can eliminate redundant computation for shared system prompts.

Topics covered

LLM inference optimizationKV cache mechanicsquantization INT8 INT4GPTQ quantizationAWQ quantizationspeculative decoding LLMcontinuous batching vLLMtensor parallelism LLMvLLM vs TensorRT-LLMllama.cpp inferenceLLM production deploymentGPU memory LLM sizingLLM cost optimizationtime to first tokenthroughput vs latency LLMFlashAttention optimizationLLM serving infrastructuretoken per second GPUinference cost estimationLLM deployment cost

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps