Microsecond Consensus for Microsecond Applications
10 minute read ∼ Filed in : A paper noteSome questions about the paper.
- Why the logs are in cache instead of the main memory? how could they control this?
Abstract & introduction
Background
It is desirable to replciate the microsecond apps across many hosts to provide high availability. Especially for the stateful services like kv store.
Traditional State Machine Replication system add hundreds of microseconds of overhead on replication process.
Recend work explores modern hardware to decrease the replication time to several microseconds. Eg Hermes, Dare, HovercRaft. But it’s still to high for apps taking only few microsecond.
When above meet failure, it spends tens of milliseconds to recover.
Problems
How to make microsecond apps fault-tolerant through replication which operates at microsecond scale?
Contribution
The pape propose a system called MU, it leverages RDMA to implement strong consistency for microsecond apps.
Summary of contributations
-
It significantly reduce both replication and failover recover latency. Specifically, it spends less than 1.3 microseconds to replicate a app request and 873 microseconds to failure recover.
Roughly, leader replicates a request by simple writing it directly to the log of other replicas using RDMA in parallel way and in one round, without any additional communication.
-
Provides complete correctness proof of Mu
-
Evaluate its raw performance
Callenge of this system:
- Mechanism to protect against reces of concurrent leaders.
- Change leaders and garbage collect logs, as described in the paper
Experiments: The system is evaluated on a financial exchange app, Redis, Memcached, HERD.
Backgrounds
Microsecond services
Modern distributed systems are composed of hundreds of stateless and stateful microservices, such as key-value stores, web servers, load balancers, and ad services—each operating as an independent app whose latency requirements are gradually decreasing to the microsecond level as the number of composed services is increasing
State machine replication
A common way to implement SMR is as follows: each replica has a copy of the service software and a log. The log stores client requests.
A consensus protocol must ensure safety and liveness
RDMA
Traditional communication process:
- in sender party: data is copied from user space to kernel socker buffer
- in sender party: data is added with tcp/ip headers
- in sender party: data is copied from socker buffer to NIC buffer.
- in receiver party: data is copied from NIC buffer to socker buffer
- in receiver party: data is read with header information
- in receiver party: data is copied from socker buffer to user space buffer.
- Process trigger context switch.
RDMA is based on Kernel bypass and zero copy technology.
- CPU Offload: No need CPU. Process can access remote host’s memory without bother either sender’s or receiver;s CPU.
- Kernel Bypass: RDMA provides a Verbs interface instead of traditional TCP/IP socket interface. User’s process can communicate with other host without system call (no context switch. )
- Zero Copy: User’s process can communicate with RNIC hardware directly.
Programming model:
Applications make local memory available for remote access by registering local virtual memory regions (MRs) with the RDMA driver.
Queue pairs (QP):
- includes send queue (SQ) and receive queue (RQ)
- QP is mapped to the user defined process’ virtual memory space. And the process can access RNIC directly with this memory space.
Worker Request (WR):
- user can create Work request, WR includes operations sender want to send.
- WR is added to the QP.
- RDMA hardware will consume the WR.
- read the message to be sent from buffer (specified in WR).
- Performs the operation and posts a work completion to CQ.
Complete Queue (CQ): Infrom user that the specific WQ has been finished. one element in CQ can be corresponding to either SQ or RQ.
Both QPs and MRs can have different access modes (e.g., read-only or read-write).
The same memory can be registered multiple times, yielding multiple MRs, each with its own access mode.In this way, different remote machines can have different access rights to the same memory
System Overview
Replication plane: (A replica only has one of these components active, depending on its role in the system.)
- Replicator: replicate a request from leader to followers
- Replayer: replays entries from log
Background plane:
- leader election: detect failure of leader and select other replicas to become leader
- Premission management: grants or revokes write access.
Logging: Store client requests to be replicated.
Leader:
- The leader captures a client request, uses an RDMA Write to append that request to the log of each follower,
- Continues the application to process the request.
Follower:
- Each replica grants RDMA write premission to its log for its current leader only.
- Detect a new request in their log, they inject the request into the application, thereby updating the replicas.
- Monitor leader state to check it is still active.
Handle Leader Failures
-
Detect failures:
leader periodically increments a local counter, and the follower perodically check the counter using RDMA Read. if follower dont detect an increment of the counter after a few tries, a new leader elected.
-
New leader election
New leader revokes a write premission by any old leader. such that old leader cannot interfere new leader.
Replication Plane
Safety: only a leader replica communicates over the network ensure safety.
Log Structure:
- minProposal: Each replica publishes a minProposal, which represent the minimum proposal number it can accept.
- FUO: first undecided offset, represent lowest log index
- each slot is (propNr, value) tuple
Basic Algorithm
Extension Algorithm
After a replica becomes leader and establishes its confirmed followers, but before attempting to replicate new value, the leader brings itself and its follower up to data by copying the contents of more up-to-date log to less up-to-date log.
Omitting the prepare phase
Once a leader finds only empty slots at a given index at all of its confirmed followers at line 23, the leader may omit the prepare phase for higher indexes. With this optimization, the cost of a Propose call becomes a single RDMA write to a majority in the common case.
Growing confirmed followers
Leader grows its confirmed followers set by briefly waiting for responses from all replicas during its initial request for permission.
Replayer
Follower continually monitor the log for new entries and check the extra canary byte to make sure the data read is complete before executing.
Followers can monitor their local logs and commit all values up to (but excluding) the highest non-empty log index.
Background Plane
Leader Election
Replica A decides that B is leader if B is the replica with the lowest id, among those that A considers to be alive.
To know whether a replica has failed, we employ a pull- score mechanism, based on a local heartbeat counter. A leader election thread continually increments its own counter locally and uses RDMA Reads to read the counters (heart- beats) of other replicas and check whether they have been updated.
Timeout Mechanism
A small timeout in our detection algorithm (scoring), and a longer timeout built into the RDMA connection mechanism.
Make sure both Replication and leader election threads works.
Every X=10000 iterations, the leader election thread checks the replication thread for activity. if replication thread is stuck, new leader will be elected.
Permission Management
Each replica maintains the invariant that only one replica at a time has write permission on its log.
RDMA provides multiple mechanisms to grant and revoke write access.
However, changing a QPs access flags while RDMA operations to that QP are in flight sometimes causes the QP to go into an error state. Therefore, in Mu they use a fast- slow path approach: they first optimistically try to change permissions using the faster QP access flag method and, if that leads to an error, switch to the slower, but robust, QP state method.
Log Recycling
Use limited memory to store logs.
- the leader’s background plane reads the log heads of all followers and computes minHead,
- Log entries up to the minHead can be reused
- The leader zeroes all follower logs after the leader’s first undecided offset and before min- Head, using an RDMA Write per follower.
Adding and removing replicas
Use consensus itself to inform replicas about the change.
Removing replica: it stops executing, while other replicas subsequently ignore any communication with it.
Adding replica: uses the standard approach of check-pointing state.
Implementation
7k C++ code, Ibverbs (RDMA library), modular, QP exchange layer
Evaluation
Measurements:
- Replication latency (change with payload size, compare with other? )
- Fail-over time
- Throughput
Evaluate on 4 nodes and compare with APUS, DARE, HERMES.
Application: three key-value stores (Redis, Memcached, HERD) , order matching engine for financial exchange.
Common case replication latency
Standalone runs
- No leader failure
- Standalone mode: run just replication layer without applications, leader generate random payload and use loop to propose.
Analysis
Standalone outperforms other due to processor cache effects.
In standalone runs, logs and queue pairs are alwasys in cache, and requests need not be fetched from memory.
Attached runs:
-
with various appliactions
-
Mu supports two ways of attaching to an application, direct or handover
- direct mode: same thread to run both application and replication share L1 and L2 caches.
- handover places application thread on separate core from replication thread, avoid shareing caches.
Direct mode: used in large workloads applications.
Handover places: used in lighter weight applications.
Analysis: (why others are slow?)
APUS and Hermes: the need to involve the CPU of many replicas in the critical path.
DARE and APUS: sequentializing several RDMA operations so that their variance aggregates.
End to end latency
Includes the latency incurred by the application and by replication (if enabled).
Failover time ( = detection time, premission switch time, )
Repeatedly introduce leader failures by delaying the leader, making it become unresponsive (1k times). This causes other replicas to observe that the leader’s heartbeat has stopped changing, and thus detect a failure.
Analysis:
Time to switch permissions (one to revoke write permission from the old leader and one to grant it to the new leader) constitutes about 30% of the total fail-over time
The rest of the fail-over time is attributed to failure de- tection (≈600μs)
Throughput
Run a standalone microbenchmark.
Increase workloads (throughputs) in two ways:
- by batching requests together before replicating,
- by allowing multiple out- standing requests at a time.
Latency and throughput both increase as the batch size increases
Increasing the number of outstanding requests allowed while keeping the batch size constant substantially increases throughput at a small latency cost.