How do I calculate QPS (queries per second) in a system design interview?

Use the formula: QPS = (Daily Active Users × average requests per user per day) / 86,400 seconds. Then estimate peak QPS as 2–3x the average (traffic is not uniform — peak usually occurs during evening hours or news events). For example: 10M DAU × 5 reads/day / 86,400 = ~579 QPS average, ~1,750 QPS peak. Always state your assumptions explicitly. Interviewers are evaluating your estimation process as much as the number itself. Round aggressively (86,400 ≈ 10^5) to keep the math fast.

When should I use NoSQL instead of SQL in a system design?

Choose NoSQL when: (1) Write throughput is extreme (100,000+ writes/second) and horizontal scaling is required — SQL sharding is complex while DynamoDB or Cassandra scale horizontally by design. (2) Your access pattern is simple key-value or document lookups with no joins — NoSQL is optimized for these. (3) Schema is highly variable (different documents have different fields). (4) You need global geographic distribution with low-latency reads. Stick with SQL when: you need ACID transactions across multiple records (financial systems), complex queries with JOINs are required, your team is more familiar with relational modeling, or you are at a scale where SQL is sufficient (below ~50,000 QPS on a well-tuned instance with read replicas).

What is the CAP theorem and why does it matter for system design?

CAP theorem states that a distributed data store can guarantee at most two of three properties: Consistency (every read returns the most recent write), Availability (every request receives a non-error response), and Partition Tolerance (the system continues functioning when network partitions occur). Since network partitions are inevitable in any distributed system, the real trade-off is between Consistency and Availability. CP systems (PostgreSQL, HBase, ZooKeeper) prioritize correctness — they may refuse requests during a partition rather than serve stale data. AP systems (Cassandra, DynamoDB, CouchDB) stay available during partitions but may serve stale data. In system design interviews, justify your database choice by stating which trade-off your use case requires: financial systems need CP, social feeds can accept AP.

How would you design a system for 10 million users that needs to scale to 100 million?

Start with the simplest architecture that handles 10M users: a single primary database with read replicas behind a load balancer, Redis cache for hot data, a CDN for static assets. Calculate your QPS at 10M and 100M users. At 10M with typical SaaS usage (10 requests/day/user), that is ~1,160 QPS — manageable with a single well-tuned Postgres instance plus read replicas. At 100M users, you hit ~11,600 QPS, which still fits with aggressive caching and read replicas. If writes become the bottleneck, add database sharding. Introduce message queues for async work early. Plan for 100M from the start in your architecture, but do not over-engineer for it — deploy the simple version first and scale components as actual bottlenecks emerge.

What is consistent hashing and when is it used?

Consistent hashing is an algorithm that minimizes the number of key remappings when nodes are added or removed from a distributed system. In a regular hash (key % num_nodes), adding one node requires remapping ~50% of all keys. With consistent hashing, adding or removing a node remaps only key_count / num_nodes keys on average — typically a small fraction. The algorithm maps both keys and nodes to positions on a circular ring (0 to 2^32). Each key is assigned to the next node clockwise on the ring. Use consistent hashing for: distributed caches (deciding which cache server holds a given key), database sharding with dynamic cluster size, and distributed load balancing (directing user sessions to the same server for sticky sessions). Most distributed caches (Redis Cluster, Memcached) and databases (Cassandra, DynamoDB) use consistent hashing internally.

← Software Engineering Studio

Engineering·14 min read·June 20, 2026

🏗️ System Design Interview Guide: Architecture Patterns and Trade-offs

A complete system design interview guide covering the RESHADED framework, back-of-envelope estimation, load balancing, caching strategies, database scaling, message queues, CDN, and rate limiting — with worked examples for URL shortener and social feed designs.

The Interview Framework: What Interviewers Are Actually Testing

System design interviews are not tests of memorized architectures. They are evaluations of how you think under ambiguity, how you quantify trade-offs, and whether you can have a structured technical conversation. Interviewers want to see you clarify scope, estimate scale, justify decisions, and identify bottlenecks — not recite a pre-memorized "correct answer."

Use this framework for every interview, and make your process explicit as you go:

Clarify requirements (5 minutes) — Ask about scale, features, constraints, and what "success" looks like.
Estimate scale (5 minutes) — DAU, QPS, storage. This determines your architecture choices.
High-level design (10 minutes) — Sketch the major components and data flow.
Deep dive (15 minutes) — Go deep on the interviewer's area of interest (database schema, caching layer, API design).
Identify bottlenecks (5 minutes) — Where does your design fail at 10x scale? What would you change?

Back-of-Envelope Estimation

Scale estimation is not optional fluff — it determines whether you need a single Postgres instance or a globally distributed database. Interviewers are testing whether you can reason quantitatively about systems.

Key numbers to remember:

There are 86,400 seconds in a day (~10⁵ for estimation)
1 KB = 10³ bytes, 1 MB = 10⁶, 1 GB = 10⁹, 1 TB = 10¹²
A typical Postgres instance handles ~5,000–10,000 QPS (simple queries)
A Redis instance handles ~100,000–1,000,000 operations/second

QPS calculation formula:

QPS = (Daily Active Users × Average requests per user per day) / 86,400

Example: Twitter-like feed
- 100M DAU
- Each user makes 10 read requests/day (timeline, profile, etc.)
- Total: 100M × 10 / 86,400 = 11,574 QPS average
- Peak (assume 3x average): ~35,000 QPS

Storage example:
- 1M new posts per day
- Each post: ~1 KB (text + metadata)
- Storage/day: 1M × 1KB = 1 GB/day
- Per year: 365 GB ≈ 400 GB
- With media thumbnails (5 KB avg): 5 GB/day, 1.8 TB/year

Load Balancing

Load balancers distribute incoming requests across multiple server instances to prevent any single server from becoming a bottleneck. Two types:

Layer 4 (L4) load balancers operate at the TCP level — they route connections based on IP and port without reading the HTTP content. Fast and low-overhead, but cannot make routing decisions based on URL path or HTTP headers.

Layer 7 (L7) load balancers (e.g., NGINX, AWS ALB) read the HTTP request and can route based on URL path, host headers, cookies, or request content. Necessary for directing /api/ traffic to API servers and /static/ to a different origin.

Load balancing algorithms:

Round-robin — Each server gets requests in rotation. Simple, works well when servers are homogeneous.
Least connections — Sends each new request to the server with fewest active connections. Better for variable request durations.
IP hash — Routes a given client IP to the same server every time (sticky sessions without cookies). Use when sessions cannot be shared across servers.

Caching Strategies

Caching is the most impactful performance optimization in distributed systems. The three main patterns:

Cache-aside (lazy loading): The application checks the cache first. On a cache miss, it reads from the database and populates the cache for next time. Most common pattern — good for read-heavy workloads. Downside: cold start (first request always misses), potential for stale data.

// Cache-aside in pseudocode
async function getUser(id: string) {
  const cached = await redis.get(`user:${id}`)
  if (cached) return JSON.parse(cached)

  const user = await db.query('SELECT * FROM users WHERE id = $1', [id])
  await redis.setex(`user:${id}`, 3600, JSON.stringify(user)) // TTL: 1 hour
  return user
}

Write-through: Every database write also updates the cache synchronously. Data is always warm — no cold start. Downside: write latency increases (two writes per operation), and caches fill with data that may never be read.

Write-behind (write-back): Writes go to the cache first, and the cache asynchronously flushes to the database. Very fast writes, but risk of data loss if the cache node fails before the flush. Use only when some data loss is acceptable (e.g., view counters).

Cache eviction policies:

LRU (Least Recently Used) — Evict the item not accessed for the longest time. Best for general-purpose caches.
LFU (Least Frequently Used) — Evict the item accessed the fewest times. Better when some items are permanently popular (hot items).

Cache stampede prevention: When a cached item expires, thousands of requests may simultaneously hit the database. Solutions: probabilistic early expiration (refresh the cache before it expires), distributed locks (only one request fetches, others wait), or background refresh (separate process warms the cache before expiry).

Database Scaling

Read Replicas

The first scaling step for read-heavy applications: add read replicas. Writes go to the primary (master), which asynchronously replicates to replicas. Reads are distributed across replicas. Typical ratio: 1 primary to 3–5 replicas for a read-to-write ratio of 10:1. Caveat: replication lag means replicas may serve slightly stale data.

Vertical vs Horizontal Scaling

Vertical scaling (bigger machine: more CPU, RAM, faster disk) is simple — no code changes needed. It has a ceiling (the largest available machine) and has single points of failure. Horizontal scaling (more machines, each handling a subset of traffic) is theoretically unlimited but requires distributed coordination.

Sharding

Database sharding splits data across multiple database instances. Each shard holds a subset of the rows.

Range sharding: Shard 1 holds user IDs 1–1M, Shard 2 holds 1M–2M, etc. Simple to implement, but creates hot spots (new users always go to the latest shard).
Hash sharding: shard_id = hash(user_id) % num_shards. Distributes load evenly but makes range queries (users created this month) require querying all shards.

CAP Theorem

Distributed databases can guarantee at most two of three properties simultaneously:

Consistency — Every read returns the most recent write.
Availability — Every request gets a non-error response (though possibly stale).
Partition Tolerance — The system continues operating despite network partition between nodes.

Since network partitions are inevitable in distributed systems, you must choose between CP (consistent but may reject requests during partition) and AP (always available but may return stale data). PostgreSQL with synchronous replication is CP. Cassandra and DynamoDB are AP. There is no universal right answer — the choice depends on whether your application can tolerate stale reads (most social apps: yes) or not (banking: no).

SQL vs NoSQL: When to Use Each

Use SQL (PostgreSQL, MySQL) when: you have complex relational queries with multiple JOINs, ACID transactions are required (financial records, inventory), your data schema is stable and well-understood, or you need complex reporting and aggregation.

Use NoSQL when: your write throughput exceeds SQL scaling limits (~50,000+ QPS on a single shard), you need schema flexibility (varied document structures), your access pattern is simple key-value or document lookup with no joins, or you need global distribution with low-latency reads (DynamoDB Global Tables, Cosmos DB).

Message Queues

Message queues decouple producers and consumers, enabling asynchronous processing and resilience to downstream failures.

Primary use cases:

Async processing: User signs up → API returns 200 immediately → email welcome message sent in background by a worker consuming from the queue.
Fan-out: One event (post published) → multiple consumers (notification service, feed update service, analytics service) each receive a copy.
Load leveling: A sudden spike of 10,000 requests/second can be absorbed by the queue and processed at 1,000/second — smoothing the load on downstream services.

Kafka is a distributed log optimized for high-throughput streaming (millions of messages/second), durable message retention for replay, and fan-out to many consumer groups. Best for event streaming, analytics pipelines, and audit logs.

RabbitMQ is a traditional message broker optimized for task queues with complex routing, low-latency delivery, and acknowledgment-based reliability. Best for job queues, background tasks, and work distribution among competing workers.

SQS (AWS) is a managed service that requires zero infrastructure management. Good default for AWS-native architectures. Supports standard queues (at-least-once delivery, best-effort ordering) and FIFO queues (exactly-once, ordered).

Worked Example 1: URL Shortener

Requirements: 100M short URLs created per day, 10B redirect reads per day, URLs expire after 1 year, custom aliases optional.

Estimation:

Write QPS: 100M / 86,400 ≈ 1,160 writes/second
Read QPS:  10B / 86,400  ≈ 115,700 reads/second
Read:Write ratio = 100:1 — extremely read-heavy

Storage per URL: ~500 bytes (short URL, long URL, metadata, timestamps)
Storage per day: 100M × 500 bytes = 50 GB/day
Storage for 1 year: 50 GB × 365 ≈ 18 TB

Short URL generation: Use Base62 encoding (a-z, A-Z, 0-9 = 62 characters). A 7-character Base62 code gives 62⁷ = 3.5 trillion possible codes — sufficient for years. Generate a unique integer ID from the database (auto-increment or UUID converted to integer), then Base62 encode it. Avoid random generation to prevent collisions.

Architecture:

Write path: API server → generate Base62 ID → write to Postgres primary → also cache in Redis with TTL.
Read path: DNS → L7 load balancer → redirect service → check Redis cache first (hit rate should be ~95% for popular links) → on miss, query Postgres read replica → Redis cache write-through → return 301 or 302 redirect.
Use 302 (temporary) rather than 301 (permanent) if you want analytics — browsers cache 301 redirects and bypass your redirect service on subsequent visits.
Expiry: background job runs daily to delete expired URL rows and evict them from Redis using TTL.

Worked Example 2: Social Feed (Timeline)

The core design decision for a social feed is pull vs push (fan-out on read vs fan-out on write).

Fan-out on write (push model): When a user posts, the system immediately copies the post ID to every follower's feed table. Reading a timeline is O(1) — just SELECT from your personal feed. Writes are expensive: if a user has 1M followers, one post triggers 1M writes.

Fan-out on read (pull model): When a user requests their timeline, the system queries the posts of everyone they follow and merges/sorts them. No write fan-out cost. Reads are expensive: a user following 500 accounts requires 500 queries merged and sorted at read time.

The celebrity problem and hybrid solution: Pure fan-out-on-write breaks for accounts with millions of followers (Elon Musk posting triggers 100M writes). The production solution used by Twitter/X is a hybrid:

For regular users (below a follower threshold, e.g., 10,000 followers): fan-out on write. Most users, fast reads.
For celebrity accounts (above threshold): fan-out on read. When requesting a timeline, merge the pre-built feed with live fetches from followed celebrities.
The merge at read time only adds a small number of celebrity queries — manageable overhead.

CDN and Static Assets

Serve all static assets (JavaScript bundles, CSS, images, fonts) through a CDN (Content Delivery Network). A CDN caches your assets at edge locations geographically close to your users — a user in Tokyo gets your JS bundle from a Tokyo edge server, not your US-East origin. This reduces latency from 200ms+ to <20ms for asset loading.

Set aggressive cache-control headers for static assets with content-hashed filenames (main.a3f9c2.js): Cache-Control: public, max-age=31536000, immutable. For HTML files, use Cache-Control: no-cache so browsers always check for the latest HTML (which references the latest hashed asset URLs).

Rate Limiting

The token bucket algorithm is the most practical rate limiting approach. Each user has a bucket that holds up to N tokens (the burst limit). Tokens are added at a constant rate (the refill rate). Each request consumes one token. If the bucket is empty, the request is rejected with 429.

Implement in Redis using atomic operations:

-- Redis Lua script for token bucket (atomic)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per second
local now = tonumber(ARGV[3])           -- current unix timestamp in ms

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Calculate tokens to add since last refill
local elapsed = (now - last_refill) / 1000
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

if new_tokens >= 1 then
  redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
  redis.call('EXPIRE', key, 3600)
  return 1  -- allowed
else
  return 0  -- rate limited
end

Topics covered

system design interviewsystem design guideback-of-envelope estimationCAP theorem explaineddatabase sharding strategiesload balancing algorithmsRedis caching patternsmessage queue Kafka vs RabbitMQURL shortener system designsocial feed architecturefan-out on write vs readconsistent hashingdatabase read replicasCDN architecturerate limiting token bucketsystem design trade-offsSQL vs NoSQL when to useQPS calculationdistributed systems patternsBASE vs ACID database

🛠️ Related Free Tools

Put this knowledge to work on your iPhone

Browse our full catalog of professional iOS apps — from electrical code tools to AI builders.

Browse All 95+ Apps