[Study Notes] Designing Systems that Scale

Jay Kim
4 min readApr 15, 2024


Horizontal Scaling

  • Scale in/out
  • More servers to maintain (including load balancer)
  • Practically infinite scalability
  • Web servers need to be stateless

Vertical Scaling

  • Scale up/down
  • Fewer servers to maintain
  • Limited resource capacity
  • Single point of failure

Failover Strategies

Cold Standby

  • Cheap
  • Needs periodic backup
  • There is downtime and possible data loss until new instance is provisioned and backup is restored

Warm Standby

  • Replication is up constantly copying from primary database host
  • Most databases have their own replication mechanisms built-in
  • Minimal downtime and data loss

Hot Standby

  • Web host simultaneously writes to primary and secondary database hosts (no replication)
  • Read operations can be distributed (kind of a horizontal scaling)

Horizontal Scaling of Databases


  • Shard: a horizontal partition of a database
  • Each shard might have its own backup host running replication in the background
  • Possible to route a given request to a given partition (increased scalability and resiliency)


Shaded databases are sometimes called NoSQL.


  • Works best with simple key/value lookups
  • Formal schemas may not be needed
  • Tough to do joins across shards
  • Examples: MongoDB, DynamoDB, Cassandra, HBase


  • mongos instances route queries and write operations to shards in a sharded cluster. it provides only interface to a sharded cluster so applications should never connect directly with the shards.
  • mongos tracks what data is on which shard by caching the metadata from config servers. It uses metadata to route operations from applications and clients to the mongod instances.


  • Solves single point of failure by having a ring of nodes (shards)
  • Any one of these nodes can act as the primary interface point
  • Data needs to be replicated to all nodes (eventual consistency)
  • Nodes can be added or removed with no downtime

Data Lakes

Another strategy is to put raw data into a distributed storage system (e.g Amazon S3). This is a common approach for “big data” and unstructured data. Another process (e.g. Amazon Glue) creates a schema for that data. Some cloud-based features (e.g. Amazon Athena, Amazon Redshift) can be used to query the data. Partitioning the raw data for best performance may be required. For partitioning, think about how the end-user will be using the data or querying data.

ACID Compliance

  • Atomicity: Either the entire transaction succeeds or fails
  • Consistency: All database rules are enforced or the entire transaction is rolled back
  • Isolation: No transaction is affected by any other transaction that is in progress
  • Durability: Once a transaction is committed, it stays even if the system crashes immediately after

The CAP Theorem

You can have any two of them but not all three. Understand the requirements about scalability, consistency, and availability before proposing a specific database solution.

  • Consistency: Every read receives the most recent write or an error (single-master designs favor consistency and partition tolerance)
  • Availability: Every request receives a non-error response without the guarantee that it contains the most recent write
  • Partition Tolerance: System continues to operate despite any number of communication breakdowns between nodes


Discussion points

  • Expiration policy
  • Hotspots problem (the “celebrity” problem)
  • Cold-start problem: How to warn up the cache initially to prvent DB failures?

Eviction Policies

  • LRU: Least Recently Used
  • LFU: Least Frequently Used
  • FIFO: First In First Out
LRU Implementation using Doubly Linked List

Caching Technologies


  • In-memory key/value store
  • open source


  • More Features: snapshots, replication, transactions, pub/sub
  • Advanced data structures


  • Fully-managed Redis or Memcached AWS solution


  • Ncache, Ehcache, etc

Content Delivery Networks (CDNs)

  • Geographically distributed fleet of servers
  • Local hosting of HTML, CSS, JavaScript, Images, etc
  • Useful for static content or some limited computation
  • Cost can be expensive quickly depending on the where the edge locations are implemented (Think about what needs to be hosted on CDN)


  • Secondaries should be spread across multiple availability zones and regions
  • System needs enough capacity to survive a failure at any reasonable scale (over-provisioning)
  • Think about balance between budget and availability

Distributed Storage Solutions

  • Cloud: Amazon S3, Google Cloud Storage, Microsoft Azure
  • Self-Hosting: Hadoop HDFS

Example of Distributed Storage Solution: HDFS Architecture

  • Files are broken up into blocks
  • Rack-aware replication
  • Writes get replicated across different racks
  • For high availability, there maybe 3 or more NameNodes and highly available data store for metadata
  • Clients try to read from the nearest replica