Amazon DynamoDB A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service
5 minute read ∼ Filed in : A paper noteIntroduction
Amazon DynamoDB is a NoSQL cloud database service that supports fast and predictable performance at any scale.
Provides Features:
- Fully managed cloud service, resource provisioning, failure recovery, data encryption, upgrades, backups.
- Multi-tenant architecture with high utilization of resources.
- Boundles scalability:
- Predicitable latency, DynamoDB maintain the latency stable by distributed data placement (auto repartitions) and request routing algorithms.
- High available: data is replicated acorss multiple data cetners. re-replicates in case of failure.
- Flexible uae cases: flexible data model (each data item have various type) and consistency model (strong or eventual) design by user.
Architecture
Data model:
- DynamoDB is a collection of items, each has many attributes inlcuding primary schema. It has partition key (used for hash partiton) sort key (used for sortting).
It also supports multiple secondary indexs.
It supports Transactions across items.
It supports partitions and used PMMC to provide availabliltiy and strong consistency.
- it used lease mechanism’s to prevent split-brain scenarios, ensuring that there is only one active leader at a time.
- Each replication group not only have storage node (has B tree and write-ahead log) and has log node (has only recent write-ahead log). Which can improve the available and durability.
It contains many micro-services including metadata service, request routing service, storage nodes, autoadmin service etc.
Partitions
It introduced an internal abstraction, partitions, as a way to dynamically scale both the capacity and performanc e of tables. Customers can specifiy the throughput for each table in terms of read capacity units and write capacity units (WCUs). Each partiton have a maximum provisioned througput, e,g. 1000 WCUs, each table can be partitoned into multiple partitones such that each partition’s WCU is less then 1000.
However, the access to data is normally not uniformly distributed, thus after partition there are two problems:
- Hot Partitons: one partiton could be the hot port and the access rate exceeds the maximum WCU.
- Throughput dilution refers to per partition throughput would decrease compared with the original partition’s throughput.
Solution:
- Bursting: It is to handle the short-lived spikes. it rentain unused capacity for later brusts of throughputs. Specifically, it uses token to do the admission control. Each parititon has two token (allocated and burst), and each storage node has one token. Once the partition had exhausted all provisoned tokens, requests were able to burst only both burst token and server token is avaiable.
- adaptive capacity: it is to handle long-lived spikes. It uses proportional control algorithm to increaes the allocated throughput of a partiton to handle the long-lived spikes.
Furture, it uses global admission control to handle both short and long lived spikes. global admission controler service tracks the total consumptions of a table in terms of tokens.
If the worloads was skewed to sepcific set of items, then the system automatically scale out partition based on the throughput consumed. E,g, once the consuemd throughput of a partiton reached a threshold, the partiton is split for consumption.
For migartion data from other databases to the DynamoDB, it uses on-demand tables, which automatically provisions on demand tables based on capacity by collecting the signal of read and writes.
Durability
It is designed to handle the failure caused by hardware failure, software bugs, and hardware bugs. Write-ahead logs are central for durabity and crash recovery, which is replicated across all replicas. For better durability, it also be stored to S3. Once a replicate failure due to memory or disk. the leader will add a new log replica to ensure quickly healing (without copying B tree).
As for the Silent data errors (incorrect data stored due to the hardware failure), it requires detection, but it’s very hard. the paper uses checksums to check integrity for every data transfer between two nodes.
As for the software bugs, it ensure the correctness of the replication protocols, and perform model checking when add new features.
As for the logical corruptin due toe bug of the customer application. it supports backups and restores. It is built using the write-achead logs archived in S3. It also supports point-in-time restore. It could restore the contents of a table existed at anytime (closest snapshots to the requested time) in the previous 35 days. It periodic snapshots of the partitions that belong to the table and uploads them to S3.
Availability
it runs millions of Paxos groups in a Region with log replicas
To reduce the false positives (the system mistakenly classifies a healthy component as failed), the paper enable: once a follower cannot connect the leader, it ask other follower if they can communicate with the leader. If all response with healthy leader message, then the follower don’t attempt to trigger a leader election. This can minimize the number of false positives in the system.
The deployments are not atomic in a distributed system, and, at any given time, there will be software running the old code on some nodes and new code on other parts of the fleet. The additional challenge with distributed deployments is that the new software might introduce a new type of message or change the protocol in a way that old software in the system doesn’t understand.
- DynamoDB uses read-write deployment, each node first ensure to understading all messages, and then begin to send message.
- All deployments are done on small set of nodes before pushing to entire nodes. Once error occurs, it triggers automatics rollbacks.
Metadata store
Request routers requires the mapping between table’s primary keys and storage nodes.
At the cold start, each request will trigger one metadata lookups, and thus the metadata service should also scale as the same rate as DynamoDB, thus making the system unstable.
It uses in-memory distributd datastroe, which scale and handle entire requests rate of DynamoDB.
It uses hybird of Patricia tree and Merkle tree.