The Google File System

Posted on May 31, 2022 1 minute read ∼ Filed in : A paper note

Introduction

The paper presents file system interface design to support distributed applications, and also discusses many aspects of the design.

The files system cluster has been used in:

The GFS has good performance, scalability, reliability, and availability. And it is also designed based on the following points

Component failures are the norm. Such as failure caused by application bugs, operating system bugs, human errors, etc.
Files are huge by traditional standards. Multi-GB files are common.
Most are append-only data rather than overwriting existing data.
Co-design the application and file system API to increase flexibility. Eg.
- It relaxed GFS’ consistency model to simplify the file system without imposing a burden on the application.
- Introduce atomic append operation. Multiple clients can append concurrently without extra synchronization.

The system is built on inexpensive machines which often fail.
Should store a modest number of large files, A few million files with 100 MB.
Supports workloads include
- Large streaming reads: Each request reads hundreds of KBs
- Small random reads: Few KBs at the arbitrary offset.
The system only focuses on sequential writes. While small writing is less efficient.
The system must efficiently support multiple clients concurrently append to the same file.
High sustained bandwidth is more important than low latency. Most applications require processing data in bulk at a high rate. And they don’t require a fast response time.

Supports create, delete, open, close, read, and write operations.

Also has snapshot, and record appends operations.