Scalability
Horizontal Scaling
- Scale in/out
- More servers to maintain (including load balancer)
- Practically infinite scalability
- Web servers need to be stateless
Vertical Scaling
- Scale up/down
- Fewer servers to maintain
- Limited resource capacity
- Single point of failure
Failover Strategies
Cold Standby
- Cheap
- Needs periodic backup
- There is downtime and possible data loss until new instance is provisioned and backup is restored
Warm Standby
- Replication is up constantly copying from primary database host
- Most databases have their own replication mechanisms built-in
- Minimal downtime and data loss
Hot Standby
- Web host simultaneously writes to primary and secondary database hosts (no replication)
- Read operations can be distributed (kind of a horizontal scaling)
Horizontal Scaling of Databases
Sharding
- Shard: a horizontal partition of a database
- Each shard might have its own backup host running replication in the background
- Possible to route a given request to a given partition (increased scalability and resiliency)
NoSQL
Shaded databases are sometimes called NoSQL
.
Characteristics
- Works best with simple key/value lookups
- Formal schemas may not be needed
- Tough to do joins across shards
- Examples: MongoDB, DynamoDB, Cassandra, HBase
MongoDB
mongos
instances route queries and write operations to shards in a sharded cluster. it provides only interface to a sharded cluster so applications should never connect directly with the shards.mongos
tracks what data is on which shard by caching the metadata from config servers. It uses metadata to route operations from applications and clients to themongod
instances.
Cassandra
- Solves single point of failure by having a ring of nodes (shards)
- Any one of these nodes can act as the primary interface point
- Data needs to be replicated to all nodes (eventual consistency)
- Nodes can be added or removed with no downtime
Data Lakes
Another strategy is to put raw data into a distributed storage system (e.g Amazon S3). This is a common approach for “big data” and unstructured data. Another process (e.g. Amazon Glue) creates a schema for that data. Some cloud-based features (e.g. Amazon Athena, Amazon Redshift) can be used to query the data. Partitioning the raw data for best performance may be required. For partitioning, think about how the end-user will be using the data or querying data.
ACID Compliance
- Atomicity: Either the entire transaction succeeds or fails
- Consistency: All database rules are enforced or the entire transaction is rolled back
- Isolation: No transaction is affected by any other transaction that is in progress
- Durability: Once a transaction is committed, it stays even if the system crashes immediately after
The CAP Theorem
You can have any two of them but not all three. Understand the requirements about scalability, consistency, and availability before proposing a specific database solution.
- Consistency: Every read receives the most recent write or an error (single-master designs favor consistency and partition tolerance)
- Availability: Every request receives a non-error response without the guarantee that it contains the most recent write
- Partition Tolerance: System continues to operate despite any number of communication breakdowns between nodes
Caching
Discussion points
- Expiration policy
- Hotspots problem (the “celebrity” problem)
- Cold-start problem: How to warm up the cache initially to prvent DB failures?
Eviction Policies
- LRU: Least Recently Used
- LFU: Least Frequently Used
- FIFO: First In First Out
Caching Technologies
Memcached
- In-memory key/value store
- open source
Redis
- More Features: snapshots, replication, transactions, pub/sub
- Advanced data structures
ElasticCache
- Fully-managed Redis or Memcached AWS solution
Others
- Ncache, Ehcache, etc
Content Delivery Networks (CDNs)
- Geographically distributed fleet of servers
- Local hosting of HTML, CSS, JavaScript, Images, etc
- Useful for static content or some limited computation
- Cost can be expensive quickly depending on the where the edge locations are implemented (Think about what needs to be hosted on CDN)
Resiliency
- Secondaries should be spread across multiple availability zones and regions
- System needs enough capacity to survive a failure at any reasonable scale (over-provisioning)
- Think about balance between budget and availability
Distributed Storage Solutions
- Cloud: Amazon S3, Google Cloud Storage, Microsoft Azure
- Self-Hosting: Hadoop HDFS
Example of Distributed Storage Solution: HDFS Architecture
- Files are broken up into blocks
- Rack-aware replication
- Writes get replicated across different racks
- For high availability, there maybe 3 or more NameNodes and highly available data store for metadata
- Clients try to read from the nearest replica