The design of a practical system for fault-tolerant virtual machines
9 minute read ∼ Filed in : A paper noteThe design of a practical system for fault-tolerant virtual machines
Something get from the paper
Fault-tolerate = primary/backup + state-machine sync method (init state, same ops, same order.)
Receiving ack before returning to client can guarantee state-consistency betwen primary and backups.
Shared storage to make sure only one VM is primary
Slow down the primary is backup is too slow.
Using easy-to-control temp buffer to handle race conditin of disk operation and memory accessing of VM.(virtual memory, some data in memory is stored on disk. )
Introduction
Current Problems
Primary/backup approach
Requires keep the state of backup identical to the state of primary all the time.
But shipping changes to all state of primary,CPU, memory and I/O devices, are very large and time-consuming since it requires much bandwidth.
State-machine Approach
Model the server as deterministic state machine which are kept in sync by starting from same init state, and execute same request in same order.
But mose servers or services don’t have deterministic operations, extra coordination must be used to keep the extra information.
Implementing coordination to ensure deterministic execution of physical servers is difficult, as processor frequenceies increase. VM is excellent platform for state-machine approach
, where the hypervisor play the role of coordinator.
Solution
The paper implemented fault-tolerant system using primary/backup approach on VM.
- The System has small time lag.
- The System use
deterministic replay technology
to record the execution of primary and replay at backup server, such that the bandwidth usage is not high. - The System automatically restores redundancy after failure by starting a new backup VM on new server.
- mainly deal with fail-stop failures.
Basic FT design
Basic ideas
Only primary VM is accessable for client.
Primary sync the request to backup via network connection known as logging channel
, and Backup VM return ack to primary VM explicitly.
Output of backup VM is dropped by hypervisor and only primary VM return res to client.
Using heartbeating between server
and monitoring of traffic on logging channel
to detect VM failure.
Determinsitic Replay Implementation
Input needs to be replicated
- determinstic events: Network packets, disk reads, input from devices
- No-eterminstic events: virtual interrupts, reading clock cycle counter.
Challenge
- How to capture all deter/No-deter events correctly?
- How to apply those inputs to backup VM?
- Replication process doesn’t degrade performance
Solution - Use VMware determinsitic replay
- It records events of VM in stream of log entries written to a log file.
- Other VMs can read the file and replay.
- It’s efficient in recording and delivering by using various techniques like hardware performance counters. So the system can record and delivered events nearly realtime.
- The system use
VMware deterministic replay
but it send log entries via logging channel instead of writting log entires to disk. So the backup VM can replay in real time.
FT Protocol
Goal/Challenge
-
if the backup VM takes over after primary VM failure, the backup VM will continue executing in a way that is entirely consistent with all outputs that the primary VM has sent to client.
(this requires no data is lost during VM switching if primary fails.)
Current problem
- Failure happens after primary VM write successfully and return to client and before this write operation is sent to backup VMs, backup VM can takes over and replay the execution. If client issues a read to backup, then client read an inconsistenct data.
Solution
-
The primary VM may not send an output to client until backup VM has received and ack the log entry associated with the operation producing the output. With this even primary VM fails after replying to client, backup VMs can still handle read request.
Backup cannot determine if a primary crashed.
Detecting and Responding to Failure
Failure detection
VMware FT
use UDP heartbeating between servers to detect when a server may have crashed. eg., VMware FT
monitor traffic sent from primary to backup and the ack sent from backup to primary.
Action
Once a backup VM is switched to primary VM, VMware FT
automatically advertises the MAC address of the new primary VM on the network, so the network swithes know what server the new primary VM is located.
Problems
If the network has partition, brain-split happens.
Solution
To make sure only one primary exist in the system, the system introduce shared storage to provide the atomic operation.
Practical Implementation of FT
To enhance usable, robust and automatic of system, some new components are needed to add.
Starting and Restarting FT VMs
Problem
- How to start a new backup VM or re-start failured VM as a backup in the same state as a primary VM?
- How to choose a server for such VM?
Solution
- Adopt the VMotion functionality of VMware vSphere, such that the origin VM is copped to a new server instead of migrating to new server.(migrating requires destory original VM)
- All VMs can run on any server in the cluster with a shared storage. there is
clustering service
doingresource mangement
and when a failure happens, backup VM become primary VM and then ask clustering service for a new backup VM, the clustering service determine best server based on resource usage and other constraints.
Managing the logging channel
Hypervisors maintain a large buffer for logging entries for the primary and backup VMs. So the VMware FT
can monitor the logging channel.
Problem
When primary VM encounters a full log buffer when it needs to write a log entry, it must stop execution until log entries can be flushed out. (pause) This pause could increase clients latency. So how to minimize the possibility that the primary log buffer fills up.
Solution
Backup to slow => Primary log buffer fills up. Luckly, VMware deterministic replay guarantee the executing speed is the roughly the same.
Problem
Backup VM slow is inevitable when the server hosting backup VM is heavily loaded with other VM. If primary fails at this time and backup VM switch to primary VM, then backup VM must replaying all log entries. Which introduce more latency (lag time
). How to reduce this?
Solution
Additional mechanism to slow down the primary VM to prevent the backup VM from getting too far behind. eg,. when backup VM has significant execution lag (> 1 second), VMware FT
starts slowing down the primary by informing the scheduler to give it smaller CPU.
Operation on FT VMs
Various control operations applied to primary VM also should be applied to backup VMs.eg,.
Explicitly Powered off
on primary also applied to backups.Explicitly resource change
on primary also applied to backups
Various operations should only applied on primary VM.
All above requirements are achieved using necessary control entry.
Implementation issues for Disk I/O
Disk operation are non-blocking, and can execute in parallel. Implementation of disk IO uses DMA
directly to/from memory of virtual machines.
Problem
- Parallelims introduce race condition.
- Disk operation can also race with memory access by application in VM since disk operation directly access the memory of a VM via DMA.
Solution
The system detect race IO and make them sequential.
The system also provide page protection on pages that are targets of disk operations. eg,. when race happens, then VM can paused until disk operation completes.
As for the implementation, use fully controlled temporary buffer to store the modified data. (we could provide consistency of this buffer easily compared with directly change MMU protection on pages.)
- when disk read operation happens, read the temporary buffer.
- when disk write operation happens, copy data to tmp buffer, update the data in buffer.
Implementation issues for Network IO
Two approaches to improve VM network performance while running FT.
- Reduce VM interrupts by only posting the interrupt for a group of packets.
- Reduce transmit delay (time between sending log message to backup and getting an ack) by remove thread context switch.
Design Alternatives
Shared vs Non-shard Disk
shared disk
Only primary VM does actual writes to disk, and write must be delayed after get ack from backups.
No-shared disk
Challenges includes
- sync the disk infor
- handle split-brain problem
Executing Disk reads on backup VM
In reading operation, Primary VM read from disk and send read result to backups.
Alternate design is to send read ops to backups, and backups can read from disk themselves.
challenges
- backup read disk failes => retry
- Primary read disk fails => backup should knows it and revert the data to original
- reading disk introduce latency at backup VMs.
Experiments shows executing disk reads on backup can cause some slightly reduced throughput (1-4%) for real applications.
Preformanace Evaluation
Applications
SPECJbb: java application which is CPU-intensive and little IO
Kernel Compile: CPU and MMU intensive. (Race of memory access)
Oracle Swingbench: OLTP workload with substantial disk and network IO
MSSQL DVD Store: many clients.
Basic performance result
3rd column is average bandwidth of daata sent on logging channel when app are running.
Overhead of FT is small with less than 10%
Network Benchmarks
Primary and backup connected by 1GB/10GB channels.
When FT is disabled, Vm can achieve 940Mbit/s
When FT is enabled for receive workloads, the logging bandwidth is large since all incomming packets must be sent on logging channels.
Conclusion
Adding FT to system only introduce less 10% overhead.
logging bandwidth required to keep the primary and backup in sync is small, less then 20Mbit/s. The primary and backup can separated by long distances.