How do I decide which columns to index?

Index columns that appear in WHERE clauses, JOIN conditions, and ORDER BY clauses for queries that run frequently and on large tables. Key factors: (1) Selectivity — an index on a column with high cardinality (many distinct values, like email or user_id) is much more useful than one on a low-cardinality column (like status with 3 values — a table scan may be faster). (2) Query frequency — index to benefit your most common queries, not hypothetical ones. (3) Write cost — every index slows down INSERT/UPDATE/DELETE by a small amount. Do not add indexes speculatively. Use EXPLAIN ANALYZE on your actual slow queries to identify missing indexes.

When should I denormalize my database schema?

Denormalization (intentionally adding redundancy) is appropriate when: (1) A JOIN is on the critical path of a very high-traffic query and profiling shows it is a genuine bottleneck. (2) You need to pre-compute aggregates (total_order_count on a users table) because computing them at query time is too slow. (3) Your read-to-write ratio is very high and you can afford to update redundant data on writes. Before denormalizing, exhaust all index optimization options — partial indexes, covering indexes, materialized views. Denormalization trades write complexity and potential data inconsistency for read speed. Always have an explicit strategy for keeping redundant data in sync.

What is the difference between INNER JOIN, LEFT JOIN, and CROSS JOIN?

INNER JOIN returns only rows where the join condition matches in both tables — unmatched rows from either side are excluded. LEFT JOIN (or LEFT OUTER JOIN) returns all rows from the left table plus matching rows from the right table; where there is no match, the right table columns are NULL. This is used to retrieve records with optional related data (all users, including those with no orders). RIGHT JOIN is the mirror. CROSS JOIN returns the Cartesian product — every row from the left table paired with every row from the right (m × n rows). CROSS JOINs are rarely useful in application queries but appear in calendar generation and combination problems.

What is a materialized view and when should I use it?

A materialized view is a pre-computed query result stored as a physical table. Unlike a regular view (which re-executes the query on every access), a materialized view is fast to read but requires explicit refreshing when the underlying data changes. Use materialized views for: expensive aggregation queries (total sales by category, user activity summaries) that are queried frequently but the underlying data changes infrequently. In PostgreSQL: CREATE MATERIALIZED VIEW monthly_revenue AS SELECT DATE_TRUNC("month", created_at), SUM(amount) FROM orders GROUP BY 1. Refresh with REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue (CONCURRENTLY allows reads during refresh). Schedule the refresh in a cron job or trigger it after significant data updates.

How does connection pooling work and why is it critical for serverless?

PostgreSQL creates a dedicated OS process (~5 MB RAM each) for every client connection. Applications with many short-lived requests — especially serverless functions that spin up and down — can create hundreds of simultaneous connections during a traffic spike, exhausting the server limit (default max_connections is 100–200). A connection pooler like PgBouncer maintains a small pool of long-lived database connections and routes incoming application connections to pool members. In transaction mode, a database connection is held only for the duration of a single transaction, then returned to the pool. For Supabase users, use port 6543 (the Supabase Vitest/PgBouncer endpoint) instead of 5432 for all serverless deployments. Without pooling, a moderate traffic spike can cause "too many connections" errors and a database outage.

← Software Engineering Studio

Engineering·9 min read·June 21, 2026

🗄️ Database Design Patterns: SQL, NoSQL, and NewSQL

A practical guide to database design patterns: normalization (1NF through BCNF), indexing strategies (B-tree, hash, composite), query optimization with EXPLAIN, the N+1 problem, PostgreSQL vs MongoDB use cases, and connection pooling.

Why Database Design Decisions Are Hard to Undo

Application code is easy to refactor. Database schema changes in production are painful — they require migrations that may lock tables, break existing queries, and require coordinated deploys. Getting the schema right, the indexes right, and the normalization level right from the start saves enormous effort. This guide covers the core patterns that experienced database engineers apply on every project.

Normalization: The Levels Explained

Normalization is the process of organizing a database to eliminate data redundancy and update anomalies. Each normal form builds on the previous:

First Normal Form (1NF)

A table is in 1NF if every column contains atomic (indivisible) values and every row is unique.

-- Violates 1NF: multiple values in one column
CREATE TABLE orders (
  id      SERIAL PRIMARY KEY,
  items   TEXT  -- "item1, item2, item3" — NOT atomic
);

-- 1NF compliant: separate table for multi-valued attribute
CREATE TABLE orders (id SERIAL PRIMARY KEY);
CREATE TABLE order_items (
  id        SERIAL PRIMARY KEY,
  order_id  INT REFERENCES orders(id),
  item_name TEXT NOT NULL
);

Second Normal Form (2NF)

A table is in 2NF if it is in 1NF and every non-key column is fully dependent on the entire primary key (no partial dependencies). This applies only to tables with composite primary keys.

-- Violates 2NF: price depends only on product_id, not on (order_id, product_id)
CREATE TABLE order_items (
  order_id    INT,
  product_id  INT,
  quantity    INT,
  price       DECIMAL,  -- ← partial dependency: depends only on product_id
  PRIMARY KEY (order_id, product_id)
);

-- 2NF compliant: move price to products table
CREATE TABLE products (id SERIAL PRIMARY KEY, name TEXT, price DECIMAL);
CREATE TABLE order_items (
  order_id    INT REFERENCES orders(id),
  product_id  INT REFERENCES products(id),
  quantity    INT NOT NULL,
  PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

A table is in 3NF if it is in 2NF and every non-key column depends directly on the primary key — not on another non-key column (no transitive dependencies).

-- Violates 3NF: zip_code determines city and state (transitive dependency)
CREATE TABLE employees (
  id        SERIAL PRIMARY KEY,
  name      TEXT,
  zip_code  CHAR(5),
  city      TEXT,    -- ← depends on zip_code, not id
  state     CHAR(2)  -- ← depends on zip_code, not id
);

-- 3NF compliant
CREATE TABLE zip_codes (zip CHAR(5) PRIMARY KEY, city TEXT, state CHAR(2));
CREATE TABLE employees (
  id        SERIAL PRIMARY KEY,
  name      TEXT,
  zip_code  CHAR(5) REFERENCES zip_codes(zip)
);

Boyce-Codd Normal Form (BCNF)

BCNF is a stricter version of 3NF. A table is in BCNF if for every functional dependency X → Y, X is a superkey. BCNF violations are rare in practice but occur when a table has overlapping candidate keys. In most production systems, 3NF is sufficient — over-normalization creates too many JOINs.

Indexing Strategies

B-Tree Indexes (the Default)

PostgreSQL's default index type. A B-tree is a self-balancing tree that supports equality (=), range (<, >, BETWEEN), and prefix matching (LIKE 'prefix%'). Use for most columns used in WHERE clauses, JOIN conditions, and ORDER BY.

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index — order matters! Most selective column first
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- This index is used for: WHERE user_id = ? AND status = ?
--   and also for:         WHERE user_id = ?
-- NOT used efficiently for: WHERE status = ? (missing leading column)

Partial Indexes

Index only the rows that match a condition. Dramatically smaller and faster when most rows would never be queried:

-- Only index active users — inactive users are never searched in the app
CREATE INDEX idx_users_email_active ON users(email) WHERE is_active = true;

-- Index only unprocessed jobs — the queue is always queried with status = 'pending'
CREATE INDEX idx_jobs_pending ON jobs(created_at) WHERE status = 'pending';

Covering Indexes

A covering index includes all columns a query needs, so PostgreSQL can satisfy the query from the index alone without reading the heap (the actual table rows). This is called an "index-only scan" and is significantly faster for read-heavy queries:

-- Query: SELECT email, name FROM users WHERE email = ?
-- A standard index on email requires a table row lookup for 'name'
-- A covering index eliminates the row lookup:
CREATE INDEX idx_users_email_covering ON users(email) INCLUDE (name, created_at);

Query Optimization with EXPLAIN ANALYZE

EXPLAIN ANALYZE shows how PostgreSQL actually executes a query, including row estimates, actual row counts, and timing for each operation. It is the primary tool for query optimization.

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id;

-- Key terms to understand in EXPLAIN output:
-- Seq Scan        — reading entire table, no index used
-- Index Scan      — using an index to find rows, then reading table for data
-- Index Only Scan — satisfied entirely from index (fastest)
-- Hash Join       — join via hash table (good for large result sets)
-- Nested Loop     — join via loop (good for small result sets, indexed join column)
-- rows=XXXX       — PostgreSQL's estimate (if far off actual, run ANALYZE users)
-- cost=0.00..XXXX — arbitrary cost units, lower is better

Common fixes when EXPLAIN reveals problems:

Seq Scan on a large table — Add an index on the WHERE column
Estimated rows far from actual — Run ANALYZE table_name to update statistics
Slow ORDER BY — Add an index on the ORDER BY column(s)
Slow aggregate — Add a partial index that pre-filters the GROUP BY subset

The N+1 Problem

The N+1 problem is a common ORM pitfall: you run 1 query to get N records, then 1 query per record to get related data — N+1 queries total. At small scale it is invisible; at production scale it is catastrophic:

// N+1 problem in an ORM (pseudocode)
const posts = await db.posts.findMany()  // 1 query → 100 posts
for (const post of posts) {
  post.author = await db.users.findOne(post.userId)  // 100 queries — N+1!
}

// Fix: JOIN / include the relation in one query
const posts = await db.posts.findMany({
  include: { author: true }  // Prisma: SELECT posts.*, users.* JOIN users
})

// Fix with raw SQL: explicit JOIN
SELECT p.*, u.name as author_name, u.email as author_email
FROM posts p
JOIN users u ON u.id = p.user_id
ORDER BY p.created_at DESC;

ORMs like Prisma and TypeORM handle this with include / eager loading. In raw SQL, always JOINs. In GraphQL, the DataLoader pattern batches N individual lookups into 1 SQL query using SELECT WHERE id IN (...).

PostgreSQL vs MongoDB: Choosing Correctly

Criterion	PostgreSQL	MongoDB
Data model	Relational (tables, rows, JOINs)	Document (JSON collections)
Schema	Enforced (migrations required)	Flexible (schema optional)
ACID transactions	Full ACID	Multi-document (4.0+)
Joins	Efficient (first-class)	$lookup (less efficient)
Query language	SQL	MQL (MongoDB Query Language)
Scaling	Vertical + read replicas + sharding	Built-in horizontal sharding
Best for	Financial, ERP, relational data	Catalogs, logs, user content

PostgreSQL's JSONB column type blurs this line — you can store arbitrary JSON documents in PostgreSQL and query them with GIN indexes. For most applications, PostgreSQL with JSONB for semi-structured data is preferable to running both databases.

Connection Pooling

PostgreSQL opens a full OS thread per connection (~5 MB RAM). An application with 100 serverless functions each opening their own connection can exhaust the database's connection limit. Connection poolers like PgBouncer sit between your app and Postgres, reusing a small pool of connections:

# PgBouncer modes:
# Session pooling — one server connection per client session (least efficient)
# Transaction pooling — server connection held only during a transaction (best for serverless)
# Statement pooling — server connection per statement (most efficient, but limited prepared statement support)

# In your connection string (Supabase provides a pooler on port 6543):
# Direct: postgresql://user:pass@db.supabase.co:5432/postgres
# Pooled: postgresql://user:pass@db.supabase.co:6543/postgres?pgbouncer=true

For serverless environments (Vercel Functions, AWS Lambda), always use the pooler. Each function invocation may create a new connection; without pooling, a traffic spike creates thousands of PostgreSQL connections and crashes the database.

Topics covered

database design patternsSQL normalization 1NF 2NF 3NFdatabase indexing strategiesB-tree index explainedcomposite index optimizationEXPLAIN query PostgreSQLN+1 problem databasePostgreSQL vs MongoDBconnection pooling PgBouncerdatabase query optimizationBCNF normal formcovering index SQLpartial index PostgreSQLindex selectivity databaseSQL joins performancedatabase schema designforeign key constraintsNewSQL distributed database

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps