The Google File System
1 minute read ∼ Filed in : A paper noteIntroduction
The paper presents file system interface design to support distributed applications, and also discusses many aspects of the design.
The files system cluster has been used in:
- Hundreds of terabytes of storage
- Across thousands of disks on thousand of machines.
- Accessed by hundreds of clients.
The GFS has good performance, scalability, reliability, and availability. And it is also designed based on the following points
- Component failures are the norm. Such as failure caused by application bugs, operating system bugs, human errors, etc.
- Files are huge by traditional standards. Multi-GB files are common.
- Most are append-only data rather than overwriting existing data.
- Co-design the application and file system API to increase flexibility. Eg.
- It relaxed GFS’ consistency model to simplify the file system without imposing a burden on the application.
- Introduce atomic append operation. Multiple clients can append concurrently without extra synchronization.
Design Overview
Assumptions
- The system is built on inexpensive machines which often fail.
- Should store a modest number of large files, A few million files with 100 MB.
- Supports workloads include
- Large streaming reads: Each request reads hundreds of KBs
- Small random reads: Few KBs at the arbitrary offset.
- The system only focuses on sequential writes. While small writing is less efficient.
- The system must efficiently support multiple clients concurrently append to the same file.
- High sustained bandwidth is more important than low latency. Most applications require processing data in bulk at a high rate. And they don’t require a fast response time.
Interface
Supports create, delete, open, close, read, and write operations.
Also has snapshot, and record appends operations.
- Snapshot: creates a copy of a file or a directory tree at a low cost.
- Record appends: Allows multiple clients to append data atomicity.
Architecture