A High-Speed Load-Balancer Design with Guaranteed Per-Connection-Consistency
11 minute read ∼ Filed in : A paper noteIntroduction
Load balancer requirements:
- Uniform load distribution of the incomming connections across the servers.
- Per-connection consisency (PCC), same packets belonging to the same connection to the same server. when number of server and load balancers are dynamically changing.
Existing systems:
Reply on simple hash computation, but hash-based load balancers can mitigate PCC violations but may suffer from load imbalances up to 30%.
Details are discussed at “Background and motivation” below.
Solutions:
Introduce a load balancer that supports:
- dynamicity, the number of LBs and servers can increase or decrease depending on the actual load.
- per-connection-consistency (PCC), packets belonging to the same connection are forwarded to the same server;
- uniform load distribution, by supporting advanced load balancing mechanisms that efficiently utilize the servers;
- efficient packet processing, the LB should have minimal impact on communication latency;
- resilience, it should be hard for a client to “clog” the LB and the servers with spurious traffic
Deployments:
-
In Stateless and stateful manner,
-
Both a software and a Tofino-based hardware switch.
Result:
Has negligible packet processing overheads, and can support load balancing mechanisms that reduce the flow completion time by a factor of 2 − 3x.
Contriubtion summary:
- quantify limitations of existing stateless and stateful LBs through large-scale simulations. We show that the quality of the load distribution of existing LBs is 40 times worse than that of an ideal LB. We also show stateless LBs (such as Beamer and Faild) can reduce such imbalances at the price of increasing PCC violations.
- Present a stateless and a stateful design of CHEETAH, which strike different trade-offs in terms of resilience and performance
- Implement our stateless and stateful CHEETAH LBs in FastClick. Also implement both versions of CHEETAH with a weighted round-robin LB on a Tofino-based switch.
Background and motivation
Multi-tier load balancing architecutres.
VIP => each service => a set of servers providing that service.
DIP => each server.’s IP
LB => a server receving incomming connections for a service and select a server to provide the service.
LB partitions the space of connections identifiers ( TCP 5 tuples) across all the servers (DIPS) associate with that VIP.
5 tuples: a source IP address/port number, destination IP address/port number and the protocol in use.
Limits of Stateless Load Balancers
Traditional stateless LBs cannot guarantee PCC
If the number of servers changes, the indirection table must be updated, which may cause some existing connections to be rerouted to the new (and wrong) server that is now associated with an entry in the table, i.e., a PCC violation
Advanced stateless LBs cannot always guarantee PCC.
Experiment step: Performed DIP updates using different frequency distributions.
We define the number of broken connections as the number of connections that have been mapped to at least two different servers during their starting and ending times.
Fig. 2a shows that Beamer and Faild (plotted using the same line) still break almost 1% of the connections at the highest DIP update frequency,
Hash-based LBs cannot uniformly spread the load.
imbalance of a server: the ratio between the number of connections active on that server and the average number of active connections across all servers.
system imbalance: the maximum imbalance of any server
Experiment step: Vary only the number of connections that are active at the same time in the cluster between 20K and 200K
These results show that a more uniform distribution of loads can be achieved by storing the mapping between connections and servers.(Round-Robin, Power-Of-Two, and Least-Loaded require storing the connection-to-server mapping)
Beamer can reduce imbalance at the cost of a greater number of PCC violations.
Monitor server’s load imbalances, once the server reaches a threshold, the server is removed from indirection table. And it will be re-added to the table when number of active connectinons drops below the average.
Note that, if an entry in the indirection table changes its server mapping twice, Beamer will break those existing connections that were relying on the initial state of the indirection table.
We note that guaranteeing an imbalance of at most 10% would cause 3% of all connections to break.
Limits of Stateful Load Balancers
ConnTable: Store the connection-to-server mapping.
Today’s stateful LBs cannot guarantee PCC
if the number of servers also changes, then some existing connections will be routed to an LB without state, hence it will hash the connection to the wrong server, thus breaking PCC.
Today’s stateful LBs rely on complex and slow data structures.
Cuckoo-hash tables to keep per-connection mappings, but the insertion is slow, impacting the throughput.
Service Resilience and Load Balancers
LBs shield servers from targeted bandwidth depletion attacks
Spreading connections across all servers guarantees that the system absorb sudden bursts due to DDoS attacks with minimal impact on a service’s operation.
Stateful LBs support per-connection view at lower resilience
Incoming spurious connections add to the connection table rapidly exhaust the limited LB memory and rapidly degrading performance
The CHEETAH Load Balancer
Encoding information about the connection into a cookie that is added to all the packets of a connection
Guarantee:
- Future packets belonging to the same connection are forwarded to the same server.
- Speed up the forwarding process in a stateful LB, which in turn increases the resilience of the LB.
Limitations of the naive approach.
- Some clients can wait to establish many connections to the same server and then suddenly increase their load. This is highly undesired as it leads to cascade-effect imbalances and service disruptions
Stateless CHEETAH:
Guarantees PCC:
- move the server’s information to header.
- use VIPToServers Table only for 1st packet of the connection. The rest is using AllServers Table.
And preserves the same resilience:
hash the server identifers. Resilient to attack.
Supports arbitrary load balancing mechanisms:
Since the binding of the connection to the server is stored in the packet header, it can support many LB mechanisms that go well beyond uniform hashing.
Lower bounds on the size of the cookie
- In CHEETAH, the size of the cookie has to be at least log2 k bits, where k is the maximum number of servers stored in the AllServers table.
Stateful CHEETAH:
- A stateful LB can keep track of the behaviour of each individual connection and support complex network functions, such as rate limiters, NATs, detection of heavy hitters, and rerouting to dedicated scrubbing devices
- We store a set of m ConnTable tables that keep per-connection statistics and DIP mappings.
- We also use an equal number of ConnStack stacks of indices, each storing the unused entries in its corresponding ConnTable.
hybrid datacenter architecture:
Propose a 2-tier DC architecture where the first tier consists of stateless CHEETAH LBs and the second tier consists of stateful CHEETAH LB.
- The stateless LB uses the first bytes of the cookie to encode the identifier of a stateful load balancer, thus guaranteeing a connection always reaches the same LB regardless of the LB pool size.
- The stateful load balancer uses the last bytes of the cookie to encode per-connection information as described above
Implementation
Stores information about the connection mappings into the connections themselves. When a CHEETAH LB receives the first packet of a connection, it encodes the selected server’s identifier into a cookie that is permanently added to all the packet headers exchanged within this connection.
Decouples the load balancing logic from PCC support.
Cookies that can be processed fast and can only be interpreted by the LB
Stateless and statefl version. encode the connection-to-servers mappings into the packet headers.
Overall Implementation:
- Built on top of FastClick
- Implement stateless and stateful CHEETAH LB on Tofino-based switch using P4.
- Embed the cookie into part of the bits of the TCP timestamp options
FastClick Implementation:
TCP options: 1-byte identifier, 1-byte length, content value. Could be in any order.
New Problem: How to extract the timestamp option TS**ecr from a packet ??
Solutions: For different patterns, use different rules.
For other packets (Non-Syn/Ack):
99.95% of the packets have the following pattern: NOP (1B) + NOP (1B) + TimeStamp (10B) possibly followed by other fields.
Etc…
Finally, we note that we can completely avoid the more complex parsing operations for SYNs and SYN/ACKs if servers use TCP SYN cookies.
Load balancing mechanisms.
We implemented several load balancing mechanisms that will be evaluated using multiple workloads. Including:
-
Power-of-2 choices and
-
A weighted round robin (WRR).
Weight of each server changes according to their relative CPU loads. Eg. if a server is underutilized (it’s load is less than average server load. we should increase the weights. )
P4-Tofino prototype
Rely on registers, which provide per-packet transactional memories, to store a counter that implements the weighted-round-robin LB.
Evaluation:
Measurement from three perspectives:
- Cost of packet processing,
- Load imbalances
- PCC support
Experiments steps:
- LB is running on a dual-socket, 18 cores machine, which is connected to a NoviFlow Switch.
- 4 machines generate load to LB.
- 4 machines run up to 64 nginx web servers to handle the requests. ( isolated using Linux network namespaces. Each NGINX server has a dedicated virtual NIC using SRIOV, allowing packets to be switched in hardware and directly received on the correct CPU core)
Experiment result and analysis
1. Packet processing Analysis
Observe:
- Compare packet processing time between stateless CHEETAH and stateful CHEETAH. Also compare with Beamer.
- The main result from this experiment is that stateless CHEETAH consumes almost the same number of CPU cycles per packet as the most optimized, hardware assisted hash-based mechanism and significantly fewer cycles than stateful approaches.
- Stateful CHEETAH outperforms cuckoo-hash based LBs.
Analysis:
The operation of obfuscating the cookie only adds less than a 4-cycle hit.
We note that our stateless CHEETAH implementation uses server-side TCP timestamp correction (see Sect. 4), which only imposes a 0.2% performance hit over the server processing time. If we were to use LB-side timestamp correction, we observe that the stateless CHEETAH modifies the timestamp MSB on the LB in just 30 cycles per packet performance hit.
2. Load Imblanace Analysis
Expect operator to choose a uniform round-robin LB mechanism to distribute the load.eg Cheetah-RR
We measure the 99th percentile flow completion time (FCT) tail latency for the increasing average server load
We measure the variance of the servers’ load over the experiment for an average server load of 60% and 16, 32, and 64 servers
Observe:
CHEETAH reduces the 99th percentile FCT by a factor of 2 − 3x compared to the best performing hash-based mechanism, i.e., Hash RSS.
Variance of RR is considerably smaller than hash-based methods.
CHEETAH improves FCT even with non-uniform work- loads.
Analysis:
The variance of RR is considerably smaller than hash-based methods. This is because the Load balancer iteratively spreads the incoming requests over the servers instead randomly spreading them. In this specific scenario, CHEETAH allows operators to leverage RR, which would otherwise be impossible with today’s load balancers
3. PCC Analysis
We compare CHEETAH against Hash RSS, consistent hashing, and Beamer.
Observe:
Compared to Beamer, Cheetah not only achieves better load balancing with AWRR (Sect. 5.2), but it also does not break any connection.
Conclusions
-
We introduced a novel building block for load balancers that guarantees PCC and supports any realizable LB mechanisms.
-
We implemented CHEETAH on both software switches and programmable ASIC Tofino switches