Chain Replication for Supporting High Throughput and Availability

Posted on February 9, 2022   4 minute read ∼ Filed in  : 

Questions

  1. does other server has pending queue?
  2. other server 和head的hist 存的是什么
  3. why can not read from backups

Introduction

Motivation

Storage service is the serice between file systems and database systems, eg,. as for information-intensive services, file system lacks rich semantics while database system is too expensive, so we can use storage service.

Storage service can

  1. store objects
  2. support query ops to return a value of object
  3. support update ops to change a state of object

Two challenges for implementing such service.

  1. How to maintain high availability and high throughput despite failures.
  2. How to guarantee strong consistency.

Contribution

The paper propose chanin replication approach to coordinating fail-stop servers, subjecting to support high throughput, availability and strong consistency.

Roughly

  • high throughput: read request can return immediately without replication.

  • availability: failure-recover mechanism.

  • strong consistency: update request is confirmed by tail and reply by head.

    A storage service interface

image-20220209144049567

Chain replicaiton Protocol

Failure model

If an object is replicated at t servers, at most t-1 server can fail without compromising the availability.

Protocol Details

Basic operations

Query Processing

  1. Query request sent to the tail directly.
  2. tail return to client directly.

Update Processing

  1. Update request sent to head directly.
  2. head update local object and send result to next server.
  3. head wait for tail’s reply
  4. head return to client.

Reply Generation

Tail sent back to head through all mid servers.

Failure handler

master service do the following

  • detect failure
  • Inform server with its predecessor and new successor.
  • Inform client which server is head or tail.

Head Failure

Removing H from chain and making the successor the new head.

Tail Failure

Remving tail T from chain and making predecessor T- the new tail.

Other server failure

image-20220209170311558

Add new server to chain

  1. In practise, add new server to T+1 after the tail T.
  2. set Sent queue(T+1) = empty
  3. set Hist(T+1) = Hist(T)
  4. T notified master that it is not tail
  5. T begin to fill Sent queue and forward to T+1
  6. Master notifed client to sent query to T+1

Primary/Backup Protocols

Operation Latency

Compare with Primary/Backup approach where read reach the primary and primary waits ack from ups before replying to client, chain approach has lower latency because tail can return to client directly. (reason behind is in primary/backup primary sync read requests to prevent stale read, but in chain appraoch, read always from tail server. )

Primary/Backup can boardcasts request to backups parallel, delay = max([d1, d2…])

Chain approach sync request sequencelly, delay = sum(d1,d2…)

(di is delay of backup i)

Failure-Recover Latency

Primary/backup

Primary Failure (5 message delay)

  • master detect failure and broadcasts to all backups
  • each backup replies to master
  • master broadcasts new primary’s id to all backups
  • new primary transfer state.
  • master broadcasts new primary’s id to clients.

Backup Failure (1 message delay)

  • Pick idle to be backup and start state transfer

Chain replication

Head failure (2 message delay)

  • master broadcasts message to new head and it’s successor
  • master notify clients.

Middle Server Failure (4 message delay)

  • as showen above

Tail failure (2 msg delay)

  • master sends a msg to new tail
  • master notifies all clients.

Compare Conclusion

Transient outage of chain replication is shorter than primary/backups

Simulation Experiments

simulated network with infinite bandwidth but with latencies of 1ms per message.

Single chain, No Failures

Replication factor t = 2, 3, 10, 25 clients, measure throughput of:

  • chain: Chain replication. (strong consistency)
  • p/b: Primary/backup. (strong consistency)
  • weak-chain: Chain replication modified so query requests go to any random server.
  • weak-p/b: Primary/backup modified so query requests go to any random server.

image-20220209211309340

As for the weak-, the more server, the higher throughput because the random access.

Chain has better performance than p/b.

Multiple chain, No Failures

Sharding objects to different chain. Each processor host multiple chains.

Client send request to dispatcher, which load-balance the requests.

Reply of server send to client directly.

image-20220209214232322

Effects of Failures on Throughput

11 clients, storage service has following properities.

image-20220209215724899

image-20220209220240278

For update, after failure happens fewer server attend updating process, so the latency is low.

For query, load is no longer well-balanced among servers, and aggregate query throughput is lower.

Large Scale Replication of Critical Data

image-20220209221339610

t is the chain length.

Ring: Volume(group of object) are placed using consistent hash.

RndPar: Volume(group of object) are placed Randomly. (used in GFS)

RndSeq: Replicas of volume are placed randomly

#





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.