Google File System (GFS) @Google

Reading notes of ~~Gordon Food Service~~ Good Friend System.

This paper talks about some basic idea in constructing Google File System (GFS). It is amazing that after 10 years this system is still very useful and commonly adopted. Here are some notes about this paper.

Section 1. Background

Scenario: The performance, scalability, reliability and availability are 4 main consideration when designing this system. As there are mainly large file, optimization of appending performance and atomic guarantee are mainly considered.

Section 2. Design Overview

The storage unit, chunk, is used instead of block, so as to save index pages. A chunk is normally 64MB. There are also small files, but they are less considered. Also the POSIX API is not supported for simplicity.
High bandwidth is more important than low latency in this system, as huge stream of data flow is common in this system.
The file system metadata are stored in master machine, which avoids some issue of coherence.
Work on master is eliminated as much as possible. For instance, clients never read data from master, but ask which chunkserver to contact.

Section 2.6 Metadata

Less than 64kB is used for storing metadata of one chunk. Metadata can be stored in memory of master.
Master maintains consistency of its replica by storing the log operation into persistent storage.
Master does not store location of data persistently, but asks chunkservers at startup, as a simple maintainance of consistency.
Master can build checkpoint, while performing mutation at the same time. Checkpoint can be created in a minute (for a cluster with around 1M file)

Section 2.7 Consistency Model

To guarantee consistency, concurrent mutation leaves the data as undefined, so to be consistent. And GFS mutate files by appending rather than overwritting. Validity of mutation can be checked with checksum.

Section 3. System Interaction

(Similar with Facebook’s Memcache-based system) A lease is granted by master to one of the replicas of data, which is also called the ‘primary’ node. This lease has an expiration time, and can be revoked.
Data flow is separated from control flow for better usage of network topology.
The primary node determines the order of mutation, and confirms when all replicas complete mutating data.
When making a snapshot, master revokes the outstanding lease first, then asks each replica to make a copy of that chunk locally. By making copy locally instead of thru the network, it can save some extra time.

Section 4. Master Operation

There is a read and write lock at each dir level. To modify a file/dir, it needs to gain all read locks of its parent path, and the write lock of current path.
To avoid deadlock, locks are acquired in such order: first by level in namespace tree, then by lexicographically in same level.
When choosing the chunkserver as replica, we choose: the one with utilization below-average, and limit the number of ‘recent’ action on that server.