The design of a practical system for fault-tolerant virtual machines

Posted on January 28, 2022 9 minute read ∼ Filed in : A paper note

Something get from the paper

Fault-tolerate = primary/backup + state-machine sync method (init state, same ops, same order.)

Receiving ack before returning to client can guarantee state-consistency betwen primary and backups.

Shared storage to make sure only one VM is primary

Slow down the primary is backup is too slow.

Using easy-to-control temp buffer to handle race conditin of disk operation and memory accessing of VM.(virtual memory, some data in memory is stored on disk. )

Introduction

Current Problems

Primary/backup approach

Requires keep the state of backup identical to the state of primary all the time.

But shipping changes to all state of primary,CPU, memory and I/O devices, are very large and time-consuming since it requires much bandwidth.

State-machine Approach

Model the server as deterministic state machine which are kept in sync by starting from same init state, and execute same request in same order.

But mose servers or services don’t have deterministic operations, extra coordination must be used to keep the extra information.

Implementing coordination to ensure deterministic execution of physical servers is difficult, as processor frequenceies increase. VM is excellent platform for state-machine approach, where the hypervisor play the role of coordinator.

Solution

The paper implemented fault-tolerant system using primary/backup approach on VM.

The System has small time lag.
The System use deterministic replay technology to record the execution of primary and replay at backup server, such that the bandwidth usage is not high.
The System automatically restores redundancy after failure by starting a new backup VM on new server.
mainly deal with fail-stop failures.

Basic FT design

Basic ideas

Only primary VM is accessable for client.

Primary sync the request to backup via network connection known as logging channel, and Backup VM return ack to primary VM explicitly.

Output of backup VM is dropped by hypervisor and only primary VM return res to client.

Using heartbeating between server and monitoring of traffic on logging channel to detect VM failure.

Determinsitic Replay Implementation

Input needs to be replicated

determinstic events: Network packets, disk reads, input from devices
No-eterminstic events: virtual interrupts, reading clock cycle counter.

Challenge

How to capture all deter/No-deter events correctly?
How to apply those inputs to backup VM?
Replication process doesn’t degrade performance

Solution - Use VMware determinsitic replay

It records events of VM in stream of log entries written to a log file.
Other VMs can read the file and replay.
It’s efficient in recording and delivering by using various techniques like hardware performance counters. So the system can record and delivered events nearly realtime.
The system use VMware deterministic replay but it send log entries via logging channel instead of writting log entires to disk. So the backup VM can replay in real time.

FT Protocol

Goal/Challenge

if the backup VM takes over after primary VM failure, the backup VM will continue executing in a way that is entirely consistent with all outputs that the primary VM has sent to client.

(this requires no data is lost during VM switching if primary fails.)

Current problem

Failure happens after primary VM write successfully and return to client and before this write operation is sent to backup VMs, backup VM can takes over and replay the execution. If client issues a read to backup, then client read an inconsistenct data.

Solution

The primary VM may not send an output to client until backup VM has received and ack the log entry associated with the operation producing the output. With this even primary VM fails after replying to client, backup VMs can still handle read request.

Backup cannot determine if a primary crashed.

Detecting and Responding to Failure

Failure detection

VMware FT use UDP heartbeating between servers to detect when a server may have crashed. eg., VMware FT monitor traffic sent from primary to backup and the ack sent from backup to primary.

Action

Once a backup VM is switched to primary VM, VMware FT automatically advertises the MAC address of the new primary VM on the network, so the network swithes know what server the new primary VM is located.

Problems

If the network has partition, brain-split happens.

Solution

To make sure only one primary exist in the system, the system introduce shared storage to provide the atomic operation.

Practical Implementation of FT

To enhance usable, robust and automatic of system, some new components are needed to add.

Starting and Restarting FT VMs

Problem

How to start a new backup VM or re-start failured VM as a backup in the same state as a primary VM?
How to choose a server for such VM?

Solution

Adopt the VMotion functionality of VMware vSphere, such that the origin VM is copped to a new server instead of migrating to new server.(migrating requires destory original VM)
All VMs can run on any server in the cluster with a shared storage. there is clustering service doing resource mangement and when a failure happens, backup VM become primary VM and then ask clustering service for a new backup VM, the clustering service determine best server based on resource usage and other constraints.

Managing the logging channel

Hypervisors maintain a large buffer for logging entries for the primary and backup VMs. So the VMware FT can monitor the logging channel.

Problem

When primary VM encounters a full log buffer when it needs to write a log entry, it must stop execution until log entries can be flushed out. (pause) This pause could increase clients latency. So how to minimize the possibility that the primary log buffer fills up.

Solution

Backup to slow => Primary log buffer fills up. Luckly, VMware deterministic replay guarantee the executing speed is the roughly the same.

Problem

Backup VM slow is inevitable when the server hosting backup VM is heavily loaded with other VM. If primary fails at this time and backup VM switch to primary VM, then backup VM must replaying all log entries. Which introduce more latency (lag time). How to reduce this?

Solution

Additional mechanism to slow down the primary VM to prevent the backup VM from getting too far behind. eg,. when backup VM has significant execution lag (> 1 second), VMware FT starts slowing down the primary by informing the scheduler to give it smaller CPU.

Operation on FT VMs

Various control operations applied to primary VM also should be applied to backup VMs.eg,.

Explicitly Powered off on primary also applied to backups.
Explicitly resource change on primary also applied to backups

Various operations should only applied on primary VM.

All above requirements are achieved using necessary control entry.

Implementation issues for Disk I/O

Disk operation are non-blocking, and can execute in parallel. Implementation of disk IO uses DMA directly to/from memory of virtual machines.

Problem

Parallelims introduce race condition.
Disk operation can also race with memory access by application in VM since disk operation directly access the memory of a VM via DMA.

Solution

The system detect race IO and make them sequential.

The system also provide page protection on pages that are targets of disk operations. eg,. when race happens, then VM can paused until disk operation completes.

As for the implementation, use fully controlled temporary buffer to store the modified data. (we could provide consistency of this buffer easily compared with directly change MMU protection on pages.)

when disk read operation happens, read the temporary buffer.
when disk write operation happens, copy data to tmp buffer, update the data in buffer.

Implementation issues for Network IO

Two approaches to improve VM network performance while running FT.

Reduce VM interrupts by only posting the interrupt for a group of packets.
Reduce transmit delay (time between sending log message to backup and getting an ack) by remove thread context switch.

Design Alternatives

Shared vs Non-shard Disk

shared disk

Only primary VM does actual writes to disk, and write must be delayed after get ack from backups.

No-shared disk

Challenges includes

sync the disk infor
handle split-brain problem

Executing Disk reads on backup VM

In reading operation, Primary VM read from disk and send read result to backups.

Alternate design is to send read ops to backups, and backups can read from disk themselves.

challenges

backup read disk failes => retry
Primary read disk fails => backup should knows it and revert the data to original
reading disk introduce latency at backup VMs.

Experiments shows executing disk reads on backup can cause some slightly reduced throughput (1-4%) for real applications.

Preformanace Evaluation

Applications

SPECJbb: java application which is CPU-intensive and little IO

Kernel Compile: CPU and MMU intensive. (Race of memory access)

Oracle Swingbench: OLTP workload with substantial disk and network IO

MSSQL DVD Store: many clients.

Basic performance result

3rd column is average bandwidth of daata sent on logging channel when app are running.

Overhead of FT is small with less than 10%

Network Benchmarks

Primary and backup connected by 1GB/10GB channels.

When FT is disabled, Vm can achieve 940Mbit/s

When FT is enabled for receive workloads, the logging bandwidth is large since all incomming packets must be sent on logging channels.

Conclusion

Adding FT to system only introduce less 10% overhead.

logging bandwidth required to keep the primary and backup in sync is small, less then 20Mbit/s. The primary and backup can separated by long distances.

The design of a practical system for fault-tolerant virtual machines

Something get from the paper

Introduction

Current Problems

Solution

Basic FT design

Basic ideas

Determinsitic Replay Implementation

FT Protocol

Detecting and Responding to Failure

Practical Implementation of FT

Starting and Restarting FT VMs

Managing the logging channel

Operation on FT VMs

Implementation issues for Disk I/O

Implementation issues for Network IO

Design Alternatives

Shared vs Non-shard Disk

Executing Disk reads on backup VM

Preformanace Evaluation

Applications

Basic performance result

Network Benchmarks

Conclusion

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Something get from the paper

Introduction

Current Problems

Solution

Basic FT design

Basic ideas

Determinsitic Replay Implementation

FT Protocol

Detecting and Responding to Failure

Practical Implementation of FT

Starting and Restarting FT VMs

Managing the logging channel

Operation on FT VMs

Implementation issues for Disk I/O

Implementation issues for Network IO

Design Alternatives

Shared vs Non-shard Disk

Executing Disk reads on backup VM

Preformanace Evaluation

Applications

Basic performance result

Network Benchmarks

Conclusion

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.