Chain Replication for Supporting High Throughput and Availability

Posted on February 9, 2022 4 minute read ∼ Filed in : A paper note

Questions

does other server has pending queue?
other server 和head的hist 存的是什么
why can not read from backups

Introduction

Motivation

Storage service is the serice between file systems and database systems, eg,. as for information-intensive services, file system lacks rich semantics while database system is too expensive, so we can use storage service.

Storage service can

store objects
support query ops to return a value of object
support update ops to change a state of object

Two challenges for implementing such service.

How to maintain high availability and high throughput despite failures.
How to guarantee strong consistency.

Contribution

The paper propose chanin replication approach to coordinating fail-stop servers, subjecting to support high throughput, availability and strong consistency.

Roughly

high throughput: read request can return immediately without replication.
availability: failure-recover mechanism.
strong consistency: update request is confirmed by tail and reply by head.

A storage service interface

Chain replicaiton Protocol

Failure model

If an object is replicated at t servers, at most t-1 server can fail without compromising the availability.

Protocol Details

Basic operations

Query Processing

Query request sent to the tail directly.
tail return to client directly.

Update Processing

Update request sent to head directly.
head update local object and send result to next server.
head wait for tail’s reply
head return to client.

Reply Generation

Tail sent back to head through all mid servers.

Failure handler

master service do the following

detect failure
Inform server with its predecessor and new successor.
Inform client which server is head or tail.

Head Failure

Removing H from chain and making the successor the new head.

Tail Failure

Remving tail T from chain and making predecessor T- the new tail.

Other server failure

Add new server to chain

In practise, add new server to T+1 after the tail T.
set Sent queue(T+1) = empty
set Hist(T+1) = Hist(T)
T notified master that it is not tail
T begin to fill Sent queue and forward to T+1
Master notifed client to sent query to T+1

Primary/Backup Protocols

Operation Latency

Compare with Primary/Backup approach where read reach the primary and primary waits ack from ups before replying to client, chain approach has lower latency because tail can return to client directly. (reason behind is in primary/backup primary sync read requests to prevent stale read, but in chain appraoch, read always from tail server. )

Primary/Backup can boardcasts request to backups parallel, delay = max([d1, d2…])

Chain approach sync request sequencelly, delay = sum(d1,d2…)

(di is delay of backup i)

Failure-Recover Latency

Primary/backup

Primary Failure (5 message delay)

master detect failure and broadcasts to all backups
each backup replies to master
master broadcasts new primary’s id to all backups
new primary transfer state.
master broadcasts new primary’s id to clients.

Backup Failure (1 message delay)

Pick idle to be backup and start state transfer

Chain replication

Head failure (2 message delay)

master broadcasts message to new head and it’s successor
master notify clients.

Middle Server Failure (4 message delay)

as showen above

Tail failure (2 msg delay)

master sends a msg to new tail
master notifies all clients.

Compare Conclusion

Transient outage of chain replication is shorter than primary/backups

Simulation Experiments

simulated network with infinite bandwidth but with latencies of 1ms per message.

Single chain, No Failures

Replication factor t = 2, 3, 10, 25 clients, measure throughput of:

chain: Chain replication. (strong consistency)
p/b: Primary/backup. (strong consistency)
weak-chain: Chain replication modified so query requests go to any random server.
weak-p/b: Primary/backup modified so query requests go to any random server.

As for the weak-, the more server, the higher throughput because the random access.

Chain has better performance than p/b.

Multiple chain, No Failures

Sharding objects to different chain. Each processor host multiple chains.

Client send request to dispatcher, which load-balance the requests.

Reply of server send to client directly.

Effects of Failures on Throughput

11 clients, storage service has following properities.

For update, after failure happens fewer server attend updating process, so the latency is low.

For query, load is no longer well-balanced among servers, and aggregate query throughput is lower.

Large Scale Replication of Critical Data

t is the chain length.

Ring: Volume(group of object) are placed using consistent hash.

RndPar: Volume(group of object) are placed Randomly. (used in GFS)

RndSeq: Replicas of volume are placed randomly

Chain Replication for Supporting High Throughput and Availability

Questions

Introduction

Motivation

Contribution

A storage service interface

Chain replicaiton Protocol

Failure model

Protocol Details

Basic operations

Failure handler

Primary/Backup Protocols

Operation Latency

Failure-Recover Latency

Simulation Experiments

Single chain, No Failures

Multiple chain, No Failures

Effects of Failures on Throughput

Large Scale Replication of Critical Data

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Questions

Introduction

Motivation

Contribution

A storage service interface

Chain replicaiton Protocol

Failure model

Protocol Details

Basic operations

Failure handler

Primary/Backup Protocols

Operation Latency

Failure-Recover Latency

Simulation Experiments

Single chain, No Failures

Multiple chain, No Failures

Effects of Failures on Throughput

Large Scale Replication of Critical Data

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.