Large-scale cluster management at Google with Borg

Posted on May 26, 2022 5 minute read ∼ Filed in : A paper note

Introduction

Contribution

It hides the details of resource management and failure handing, so users can focus on application developement.
High relability and availability.
Supports workloads across tens of thousands of machines effectivel.

User perspective

Workload

Production job: Long-running services which handle short-lived latency-sensitive requests.
Non-production job: Batch jobs that take from a few seconds to a few days to complete.

Cluster and cells

On cluster has many cells, and one cell has many machines, eg, median cell size = 10k machines.

All machines inside each cell are heterogeneous in many dimensions, CPU, RAM, DISK, NETWORK.

The Borg isolates users from

determining which machine inside a cell should run a task.
allocating resources.
installing dependencies.
monittoring their health, and restarting if they fail.

Job & tasks

Eeach job includes many tasks.

Each job will run in just on cell. While each task are mapped into a set of Linux processes.

Each job can have soft / hard schedule requirements. While each task can overwritten job’s requirement.

Task can be updated by pushing a new configuraiton. And task could act as following:

restart
no longer fit to current machine, and it will be rescheduled.
update in-place.

Allocs

It is a reserved set of resources on a machine in which one or more tasks can be run. It can used to do the following

Set resources for future tasks
Retain resources between stopping a task and starting it again.
gather tasks from mulitple jobs into same machine.

A allocs set can be reserved on multiple machines, and one or many jobs can submitted to run in it.

Priority, quoat, admission control

Each job has a priority. High priority job can preempts low-priority job;s resources.

Each user has a quota, which is maximum amount of resurces a job can ask for at a given time. Quota-checking is part of admission control, and if job with insufficnent quota will be reject immediately.

The use of quota reduces the need for policies like Dominant Resource Fairness (DRF)

Borg can assign admin privileges to some user.

Naming and monitoring

To enable service finding, Borg create a stable “Borg name service” (BNS) name for each task that includes the cell name, job name, and task number. And then the task name and port will be written into Chubby. PRC system can find the task endpoint by using the name.

Brog also records job size and task health information into chubby, so load balancers can see where to route request to.

Each task under Borg contains a built-in HTTP server which publishes the information about health, performnace metrics. Borg will help to monitor it and restarts failed tasks.

Borg recors all job information and task events in Infrastore, Infrastore, a scalable read-only data store with an interactive SQL-like interface via Dremel. Those data are used for usage-based charging, debugging job and system failures, and long-term capacity planning

Borg Architecture.

Borgmaster

keep state of cells, replicate them with Paxos. Offer checkpoint, Interactive with users.

Scheduler

Schedule from high to low priority and schedule if there are sufficient available resources. It has two parts

Feasibility checking: find machines on which task should run.
Scoring, which picks one of the feasible machines. It takes into account user-specified preferences, but is mostly driven by built-in criteria such as
- minimizing the number and priority of preempted tasks,
- picking machines that already have a copy of the task’s packages,
- spreading tasks across power and failure domains,
- packing quality including putting a mix of high and low priority tasks onto a single machine to allow the high-priority ones to expand in a load spike.

Some options for scoring:

E-PVM: Minimizes the cost when placing a task.
Best-fit: fill machines as tightly as possible. It leaves some machines empty of user jobs.

Scheduler can be a process, and it acts as following

retrieves state changes from master,
update local copty
scheduling pass to assign tasks.
infrom master of the assignments

Borglet

It starts and stops tasks, restarts them if they fail.
Manages local resource by manipulating OS kernel Settings.
Reports logs to master, etc.

Borgmaster polls each Borglet every few seconds to retrieve states.

Borgmaster will update local cell’s state after receiving Borglet’s reply. For performance scalability, each master replica runs a stateless link shard to handle the communication with some of the Borglets. And the partitioning is recalculated whenever a Borgmaster election occurs.

Scalability

A single Borgmaster can manage many thousands of machines in a cell, and several cells have arrival rates above 10 000 tasks per minute. A busy Borgmaster uses 10–14 CPU cores and up to 50 GiB RAM.

In order to achieve better scalability:

Run scheduler into a separate process so it could operate in parallel with the other Borgmaster functions that are replicated for failure tolerance.
Scheduler has copys which operate on one copy of cell state ( read from master ).
Add seperate threads to master to talk to Borglets and respond to read-pnly PRCs
Score caching: Evaluaton feasibility and scoring for each task is expensive. So it caches the scores until task change.
Equivalence classes: Borg only does feasibility and scoring for one task per equivalence class – a group of tasks with identical requirements.
Relaxed randomization: It is wasteful to calculate feasibility and scores for all the machines in a large cell, scheduler examines machines in a random order until it has found “enough” feasible machines to score, and then selects the best within that set.

Scheduling a cell’s entire workload from scratch typically took a few hundred seconds, but did not finish after more than 3 days when the above techniques were disabled.

Large-scale cluster management at Google with Borg

Introduction

Contribution

User perspective

Workload

Cluster and cells

Job & tasks

Allocs

Priority, quoat, admission control

Naming and monitoring

Borg Architecture.

Borgmaster

Scheduler

Borglet

Scalability

Availability

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Introduction

Contribution

User perspective

Workload

Cluster and cells

Job & tasks

Allocs

Priority, quoat, admission control

Naming and monitoring

Borg Architecture.

Borgmaster

Scheduler

Borglet

Scalability

Availability

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.