What is LoRA and how does it reduce the cost of fine-tuning large language models?

LoRA (Low-Rank Adaptation) fine-tunes large models by adding small trainable matrices alongside frozen base model weights, rather than updating all parameters. Instead of training the full weight matrix W, LoRA trains two small matrices A and B such that the effective weight update is ΔW = B × A, where the rank r of these matrices is much smaller than the original dimensions (typically 4–16 vs 4096 or more). This reduces trainable parameters by 99% or more — a 7B parameter model might have only 4–8 million trainable LoRA parameters. Memory usage, compute time, and storage are all dramatically reduced, making fine-tuning accessible on single GPUs rather than requiring large GPU clusters.

What is QLoRA and what hardware does it require for fine-tuning?

QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters for training. The base model is quantized to NF4 (NormalFloat4) precision, reducing its memory footprint by 4× compared to FP16. LoRA adapters remain in higher precision (BF16) during training. This combination allows fine-tuning of 7B models on consumer-grade GPUs with 16–24 GB VRAM (such as RTX 3090/4090), 13B models on GPUs with 24 GB, and 70B models on GPUs with 48 GB or with two 24 GB GPUs. Without QLoRA, fine-tuning a 7B model in FP16 would require approximately 60–80 GB of GPU memory to store model weights plus optimizer states.

What is RLHF and what problem does it solve for LLM alignment?

Reinforcement Learning from Human Feedback (RLHF) is the training process that transforms a capable but unaligned base model into a helpful, safe assistant. Base models trained on text prediction do not naturally follow instructions, refuse harmful requests, or produce appropriately formatted responses. RLHF addresses this by: first fine-tuning on human demonstration data (SFT), then training a reward model on human preference comparisons, then using PPO to optimize the policy model to maximize reward while staying close to the SFT model. The result is a model that is significantly more cooperative, helpful, and aligned with human values — at the cost of substantial training infrastructure and human labeling effort.

When should I choose fine-tuning over RAG for an LLM application?

Choose fine-tuning when you need consistent behavior, format, or domain expertise that cannot be reliably achieved through prompting or retrieval. Fine-tuning is the right choice when: the model consistently produces outputs in the wrong format despite clear prompt instructions; you need the model to behave like a domain expert (using domain-specific vocabulary, reasoning patterns, and conventions); you are running the model at high volume where long system prompts significantly increase per-request cost; or you need to run on smaller, faster models. Choose RAG when the primary need is access to current or private knowledge — fine-tuning cannot inject frequently changing facts effectively, and RAG provides traceable citations that fine-tuning cannot.

What is Direct Preference Optimization (DPO) and how does it differ from RLHF?

DPO is an alignment technique that achieves similar results to RLHF without requiring a separate reward model or reinforcement learning optimization. Both RLHF and DPO use preference data — pairs of (chosen, rejected) responses for each prompt. RLHF first trains a reward model to score preferences, then uses PPO to optimize the policy against this reward model. DPO skips both of these steps by deriving a closed-form loss function from the RLHF objective, directly updating the policy model to prefer chosen responses over rejected ones in a single supervised training step. DPO is simpler, cheaper, and more stable than RLHF while achieving comparable alignment quality on most tasks. It has become the default alignment approach for open-source model development.

← AI Engineering Studio

AI & Automation·10 min read·June 19, 2026

🎯 Fine-Tuning LLMs: LoRA, QLoRA, RLHF, and Instruction Tuning Explained

A practical guide to adapting large language models for specific tasks — covering the full fine-tuning vs PEFT spectrum, LoRA low-rank decomposition math, QLoRA 4-bit quantization, instruction tuning datasets, the RLHF pipeline with reward models and PPO, DPO as a simpler alignment alternative, and how to decide between fine-tuning, RAG, and prompt engineering.

Why Fine-Tuning Matters: From General Capability to Specific Behavior

Pretraining builds a foundation model: a system trained on vast amounts of text that learns statistical patterns, world knowledge, and general reasoning capabilities. But a pretrained model is not immediately useful for most practical applications. It predicts the next token in sequences similar to its training data — it does not naturally follow instructions, maintain a particular persona, or reliably produce outputs in a specific format.

Fine-tuning is the process of continuing to train a pretrained model on a smaller, task-focused dataset. It reshapes the model's probability distributions — making certain outputs more likely, others less likely — without changing the architecture or tokenizer. Fine-tuned models are more consistent, more task-appropriate, and often significantly more cost-efficient in production than heavily prompted base models. Understanding the different fine-tuning approaches — and when to use each — is a core skill for anyone building LLM-powered applications.

The Fine-Tuning Spectrum: From Prompting to Full Fine-Tuning

Fine-tuning methods exist on a spectrum of increasing control and cost. At one end is prompt engineering — steering the model through carefully designed inputs without changing any weights. At the other end is full fine-tuning — updating all model parameters. Between these extremes are several parameter-efficient fine-tuning (PEFT) methods that update only a small fraction of parameters while achieving most of the performance gains of full fine-tuning.

The practical decision framework:

Prompt engineering: Use when requirements change frequently, iteration speed matters most, and output quality is acceptable with well-crafted prompts. Best for low-volume applications where outputs are reviewed by humans.
Instruction tuning: The first fine-tuning step for any base model. Teaches the model to follow instructions rather than continue text. Produces dramatic usability improvements with thousands of examples.
PEFT (LoRA/QLoRA): The default choice for domain adaptation, style control, and task-specific behavior. Updates 0.1–1% of parameters while approaching full fine-tuning performance. Solves 80% of practical fine-tuning needs.
Full fine-tuning: Used when maximum performance is required, you have large high-quality datasets, and you control the full model lifecycle. Expensive, risks catastrophic forgetting, and rarely justified over LoRA for most use cases.

LoRA: Low-Rank Adaptation Mathematics

LoRA (Low-Rank Adaptation) is the most widely used PEFT method, introduced by Hu et al. in 2021. Its core insight is that the useful updates to a large weight matrix during fine-tuning lie in a low-dimensional subspace — meaning you do not need to update the full matrix to achieve the behavioral change you want.

For a weight matrix W ∈ ℝ^{m×n} in the original model, LoRA introduces a low-rank update:

W' = W + ΔW = W + B × A

where B ∈ ℝ^{m×r} and A ∈ ℝ^{r×n}, and the rank r is much smaller than both m and n (typically r = 4, 8, or 16). The original weight W is frozen during training. Only B and A are trained. During inference, the effective weight W' = W + BA can be computed once and merged into the original weight, so there is zero additional inference latency compared to the base model.

The parameter reduction is significant. For a linear layer with m = 4096 and n = 4096, the original weight has 16.7 million parameters. With r = 8, LoRA adds only 4096 × 8 + 8 × 4096 = 65,536 parameters — about 0.4% of the original. LoRA is typically applied to the Query and Value projection matrices in attention layers, though some implementations also apply it to the Key, output projection, and FFN matrices.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features), requires_grad=False
        )  # Frozen base weight
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.scaling = alpha / rank  # LoRA scaling factor

    def forward(self, x):
        base = x @ self.weight.T
        lora = x @ self.lora_A.T @ self.lora_B.T * self.scaling
        return base + lora

The scaling factor alpha/rank (where alpha is a hyperparameter, typically 2× the rank value) controls the magnitude of the LoRA update. A higher alpha gives LoRA updates more influence over the base model's outputs.

QLoRA: Fine-Tuning 70B+ Models on a Single GPU

QLoRA (Quantized LoRA) combines quantization of the base model with LoRA adapters for training, enabling fine-tuning of very large models on consumer or single-server GPU hardware. Introduced by Dettmers et al. in 2023, QLoRA made it practical to fine-tune 65B and 70B parameter models on a single 80GB A100 GPU — previously impossible without multi-GPU clusters.

The QLoRA recipe has three key components:

4-bit NF4 (NormalFloat4) quantization: The base model weights are quantized to 4 bits using a data type specifically designed for normally distributed weight values. Unlike standard INT4 quantization, NF4 assigns quantization bins that are optimal for the distribution of neural network weights, minimizing quantization error.
Double quantization: The quantization constants themselves are quantized to 8 bits, saving an additional ~0.5 GB per 70B model.
Paged optimizers: GPU memory for optimizer states (Adam momentum terms) is managed with NVIDIA's unified memory, allowing optimizer states to page to CPU RAM when GPU memory is constrained.

The practical result: fine-tuning a 7B parameter model requires approximately 6–10 GB of GPU memory with QLoRA (versus 28–56 GB for full fine-tuning in FP16). A 70B model fits in 48 GB with QLoRA. This has made domain-specific fine-tuning accessible to teams without large GPU clusters.

Instruction Tuning: Teaching Models to Follow Instructions

Base language models are trained to complete text in the style of their training corpus. They do not naturally respond to questions, follow formatting constraints, or exhibit helpful behavior. Instruction tuning is supervised fine-tuning on datasets of (instruction, response) pairs that teaches the model to recognize and follow instructions.

Key instruction tuning datasets include:

FLAN (Fine-tuned LAnguage Net): Google's collection of over 1,800 NLP tasks converted to instruction format. FLAN models showed that instruction tuning on many diverse tasks generalizes better than task-specific fine-tuning.
Alpaca: A dataset of 52,000 instruction-following examples generated by prompting GPT-3 text-davinci-003 with a small set of seed examples. Demonstrated that self-instructed data from a powerful model can produce capable instruction-following behavior in a smaller model.
OpenAssistant (OASST): Human-generated multi-turn conversations with assistant-style responses, focused on helpfulness and safety.

Instruction tuning training applies cross-entropy loss only to the response tokens, not to the instruction or context tokens. This teaches the model to produce responses rather than repeat or continue the input. A typical training format: [INST] Summarize the following text: {text} [/INST] {summary}. The loss is computed only on the tokens after [/INST].

RLHF: The Full Alignment Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the training approach behind the dramatic improvement in helpfulness and safety between base language models and production assistants like ChatGPT and Claude. Despite its name, RLHF is mostly supervised learning and preference modeling, with a reinforcement learning optimization step at the end.

The RLHF pipeline has three stages:

Supervised Fine-Tuning (SFT): Fine-tune the pretrained model on high-quality demonstration data — human-written examples of ideal assistant responses. This creates the SFT model, which follows instructions but may still be inconsistently helpful or occasionally harmful.
Reward Model Training: Collect human preference data: show human labelers a prompt and two model responses (from the SFT model), and have them choose which response is better. Train a reward model — a neural network with the same architecture as the base LLM but with a scalar output head — to predict which response humans prefer. The reward model learns to score responses on a continuous scale of human preference.
Policy Optimization with PPO: Use the reward model as a proxy for human judgment and optimize the SFT model (now called the policy) to maximize reward using Proximal Policy Optimization (PPO). A KL-divergence penalty prevents the policy from deviating too far from the SFT model, which would cause reward hacking — optimizing the reward model score while producing low-quality outputs. The trained policy is the final deployed model.

RLHF is computationally expensive (requiring three separate model training runs), operationally complex (human labeling pipelines, reward model calibration, PPO stability), and prone to failure modes like reward hacking, over-refusal, and reduced output diversity. Few teams outside major AI labs implement full RLHF.

DPO: Simpler Alignment Without Reinforcement Learning

Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, achieves similar alignment results to RLHF without training a separate reward model or running PPO. DPO reformulates the RLHF objective as a straightforward supervised learning problem on preference data.

The key insight: the optimal policy under the RLHF objective can be expressed analytically in terms of the preference data and the reference model (the SFT model). DPO derives a loss function that directly trains the policy model to prefer the chosen response over the rejected response, without requiring the intermediate reward model step.

In practice, DPO requires the same preference data format as RLHF: (prompt, chosen_response, rejected_response) triplets. The training is straightforward supervised fine-tuning on this data with the DPO loss. DPO models perform comparably to RLHF models on most benchmarks, are significantly simpler to train, and are becoming the standard alignment approach for teams that need instruction following and basic safety alignment without the full RLHF infrastructure.

Fine-Tuning vs RAG vs Prompt Engineering: The Decision Framework

The most practical question for LLM system builders is: given a specific task, should I change the prompt, fine-tune the model, or add retrieval? The decision should be driven by engineering requirements, not preference:

Use prompt engineering when: the task changes frequently, you need fast iteration, volume is low, and minor inconsistencies are acceptable. Prompting is the fastest path from zero to working.
Use RAG when: the task requires up-to-date or private knowledge that was not in the model's training data, you need traceable citations, or the knowledge base changes frequently. RAG is cheaper than fine-tuning for knowledge-intensive tasks.
Use fine-tuning when: you need consistent output format or style at scale, the prompt is becoming too complex to maintain, response quality is insufficient despite good prompts and retrieval, or you need to run the model on smaller hardware with lower latency. Fine-tuning trades flexibility for reliability.
Use both fine-tuning and RAG in production systems: fine-tune for behavior (tone, format, domain expertise), add RAG for knowledge (current facts, proprietary documents). This hybrid is the dominant pattern in mature LLM applications.

Topics covered

fine-tuning LLMsLoRA low-rank adaptationQLoRA 4-bit quantizationRLHF reinforcement learning human feedbackinstruction tuning LLMparameter efficient fine-tuning PEFTDPO direct preference optimizationsupervised fine-tuning SFTreward model LLMPPO language model trainingFLAN instruction datasetAlpaca fine-tuningfine-tuning vs RAGcatastrophic forgetting LLMLoRA rank decompositionNF4 quantizationalignment fine-tuningLLM adaptation strategiesfine-tune GPT

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps