Why Database Design Decisions Are Hard to Undo

Application code is easy to refactor. Database schema changes in production are painful — they require migrations that may lock tables, break existing queries, and require coordinated deploys. Getting the schema right, the indexes right, and the normalization level right from the start saves enormous effort. This guide covers the core patterns that experienced database engineers apply on every project.

Normalization: The Levels Explained

Normalization is the process of organizing a database to eliminate data redundancy and update anomalies. Each normal form builds on the previous:

First Normal Form (1NF)

A table is in 1NF if every column contains atomic (indivisible) values and every row is unique.

-- Violates 1NF: multiple values in one column
CREATE TABLE orders (
  id      SERIAL PRIMARY KEY,
  items   TEXT  -- "item1, item2, item3" — NOT atomic
);

-- 1NF compliant: separate table for multi-valued attribute
CREATE TABLE orders (id SERIAL PRIMARY KEY);
CREATE TABLE order_items (
  id        SERIAL PRIMARY KEY,
  order_id  INT REFERENCES orders(id),
  item_name TEXT NOT NULL
);

Second Normal Form (2NF)

A table is in 2NF if it is in 1NF and every non-key column is fully dependent on the entire primary key (no partial dependencies). This applies only to tables with composite primary keys.

-- Violates 2NF: price depends only on product_id, not on (order_id, product_id)
CREATE TABLE order_items (
  order_id    INT,
  product_id  INT,
  quantity    INT,
  price       DECIMAL,  -- ← partial dependency: depends only on product_id
  PRIMARY KEY (order_id, product_id)
);

-- 2NF compliant: move price to products table
CREATE TABLE products (id SERIAL PRIMARY KEY, name TEXT, price DECIMAL);
CREATE TABLE order_items (
  order_id    INT REFERENCES orders(id),
  product_id  INT REFERENCES products(id),
  quantity    INT NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

A table is in 3NF if it is in 2NF and every non-key column depends directly on the primary key — not on another non-key column (no transitive dependencies).

-- Violates 3NF: zip_code determines city and state (transitive dependency)
CREATE TABLE employees (
  id        SERIAL PRIMARY KEY,
  name      TEXT,
  zip_code  CHAR(5),
  city      TEXT,    -- ← depends on zip_code, not id
  state     CHAR(2)  -- ← depends on zip_code, not id
);

-- 3NF compliant
CREATE TABLE zip_codes (zip CHAR(5) PRIMARY KEY, city TEXT, state CHAR(2));
CREATE TABLE employees (
  id        SERIAL PRIMARY KEY,
  name      TEXT,
  zip_code  CHAR(5) REFERENCES zip_codes(zip)
);

Boyce-Codd Normal Form (BCNF)

BCNF is a stricter version of 3NF. A table is in BCNF if for every functional dependency X → Y, X is a superkey. BCNF violations are rare in practice but occur when a table has overlapping candidate keys. In most production systems, 3NF is sufficient — over-normalization creates too many JOINs.

Indexing Strategies

B-Tree Indexes (the Default)

PostgreSQL's default index type. A B-tree is a self-balancing tree that supports equality (=), range (<, >, BETWEEN), and prefix matching (LIKE 'prefix%'). Use for most columns used in WHERE clauses, JOIN conditions, and ORDER BY.

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index — order matters! Most selective column first
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- This index is used for: WHERE user_id = ? AND status = ?
--   and also for:         WHERE user_id = ?
-- NOT used efficiently for: WHERE status = ? (missing leading column)

Partial Indexes

Index only the rows that match a condition. Dramatically smaller and faster when most rows would never be queried:

-- Only index active users — inactive users are never searched in the app
CREATE INDEX idx_users_email_active ON users(email) WHERE is_active = true;

-- Index only unprocessed jobs — the queue is always queried with status = 'pending'
CREATE INDEX idx_jobs_pending ON jobs(created_at) WHERE status = 'pending';

Covering Indexes

A covering index includes all columns a query needs, so PostgreSQL can satisfy the query from the index alone without reading the heap (the actual table rows). This is called an "index-only scan" and is significantly faster for read-heavy queries:

-- Query: SELECT email, name FROM users WHERE email = ?
-- A standard index on email requires a table row lookup for 'name'
-- A covering index eliminates the row lookup:
CREATE INDEX idx_users_email_covering ON users(email) INCLUDE (name, created_at);

Query Optimization with EXPLAIN ANALYZE

EXPLAIN ANALYZE shows how PostgreSQL actually executes a query, including row estimates, actual row counts, and timing for each operation. It is the primary tool for query optimization.

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id;

-- Key terms to understand in EXPLAIN output:
-- Seq Scan        — reading entire table, no index used
-- Index Scan      — using an index to find rows, then reading table for data
-- Index Only Scan — satisfied entirely from index (fastest)
-- Hash Join       — join via hash table (good for large result sets)
-- Nested Loop     — join via loop (good for small result sets, indexed join column)
-- rows=XXXX       — PostgreSQL's estimate (if far off actual, run ANALYZE users)
-- cost=0.00..XXXX — arbitrary cost units, lower is better

Common fixes when EXPLAIN reveals problems:

  • Seq Scan on a large table — Add an index on the WHERE column
  • Estimated rows far from actual — Run ANALYZE table_name to update statistics
  • Slow ORDER BY — Add an index on the ORDER BY column(s)
  • Slow aggregate — Add a partial index that pre-filters the GROUP BY subset

The N+1 Problem

The N+1 problem is a common ORM pitfall: you run 1 query to get N records, then 1 query per record to get related data — N+1 queries total. At small scale it is invisible; at production scale it is catastrophic:

// N+1 problem in an ORM (pseudocode)
const posts = await db.posts.findMany()  // 1 query → 100 posts
for (const post of posts) {
  post.author = await db.users.findOne(post.userId)  // 100 queries — N+1!
}

// Fix: JOIN / include the relation in one query
const posts = await db.posts.findMany({
  include: { author: true }  // Prisma: SELECT posts.*, users.* JOIN users
})

// Fix with raw SQL: explicit JOIN
SELECT p.*, u.name as author_name, u.email as author_email
FROM posts p
JOIN users u ON u.id = p.user_id
ORDER BY p.created_at DESC;

ORMs like Prisma and TypeORM handle this with include / eager loading. In raw SQL, always JOINs. In GraphQL, the DataLoader pattern batches N individual lookups into 1 SQL query using SELECT WHERE id IN (...).

PostgreSQL vs MongoDB: Choosing Correctly

CriterionPostgreSQLMongoDB
Data modelRelational (tables, rows, JOINs)Document (JSON collections)
SchemaEnforced (migrations required)Flexible (schema optional)
ACID transactionsFull ACIDMulti-document (4.0+)
JoinsEfficient (first-class)$lookup (less efficient)
Query languageSQLMQL (MongoDB Query Language)
ScalingVertical + read replicas + shardingBuilt-in horizontal sharding
Best forFinancial, ERP, relational dataCatalogs, logs, user content

PostgreSQL's JSONB column type blurs this line — you can store arbitrary JSON documents in PostgreSQL and query them with GIN indexes. For most applications, PostgreSQL with JSONB for semi-structured data is preferable to running both databases.

Connection Pooling

PostgreSQL opens a full OS thread per connection (~5 MB RAM). An application with 100 serverless functions each opening their own connection can exhaust the database's connection limit. Connection poolers like PgBouncer sit between your app and Postgres, reusing a small pool of connections:

# PgBouncer modes:
# Session pooling — one server connection per client session (least efficient)
# Transaction pooling — server connection held only during a transaction (best for serverless)
# Statement pooling — server connection per statement (most efficient, but limited prepared statement support)

# In your connection string (Supabase provides a pooler on port 6543):
# Direct: postgresql://user:pass@db.supabase.co:5432/postgres
# Pooled: postgresql://user:pass@db.supabase.co:6543/postgres?pgbouncer=true

For serverless environments (Vercel Functions, AWS Lambda), always use the pooler. Each function invocation may create a new connection; without pooling, a traffic spike creates thousands of PostgreSQL connections and crashes the database.