The Role of Distributed State

Posted on January 19, 2022 9 minute read ∼ Filed in : A paper note

Introduction

Problems

In distributed systems, the overall state of the system is partitioned among server machines. But there are two potential issues.

some state must be accessed in different fashion than other states.
if one machine crashes, it causes some but not all of overstate to be lost.

Solutions

The act of building a distributed system consists of making trade-offs among various alternatives for managing the distributed state.

And finally, the paper concludes with the opinion that there is no perfect solution to managing distributed state: each system designer must choose a particular approach based on the needs of his or her particular environment.

Why Is Distributed State Good?

Distributed state provides performance, coherency, and reliability in distributed system.

Performance: Each server reads some state locally and avoids retrieving information from the remote machine.
coherency: Each party knows something about the other so they can work together effectively. Eg, sequenceNum in exactly-once protocol.
reliability: Replication of state guarantees the lost-recover.

Why Is Distributed State Bad?

Problems introduced by distributed state: consistency, crash sensitivity, time and space overheads, and complexity

Consistency

Problem

Delay of updating of duplicate copies could incur system inconsistent.

Solutions

Detect stale data on use, and fetches the latest copy
Prevent inconsistency: Wait until the system reaches consistence.
Tolerate inconsistency: reading stale data is allowed.

Crash sensitivity

Problem

Backup machine cannot recover the full state data that had existed on the failed machine. And it is rare for state to be fully replicated.

Fully replication of state can solve this problem but need to meet following rules.

communication protocols make sure sender can redirects message traffic to replacement machine after failure of primary machine.
when failure occurs, communicaiton protocols can determine latest copies and bring out-of-date copies back to consistency without waiting for failed machine to reboot.
when failed machine reboot, comminucatin protocols can bring its state back to consitency with others.

Time and Space overheads

Problem

Maintaining consistency of distributed state incure the time overheads. , eg

state must be checked every time the state is used,
some party track distributed copies and notity other’s abou the change.

Distributed copies across cluster incure storage overheads.

Solutions

The overhead problems are closely related to the degree of sharing and rate of modification.

If information is not widely shared, then there need not be many copies of the information.
If shared information is updated frequently, the cost of maintaining consistency becomes higher than the cost of communicating with a central server on each use; So we can use centralized approach to state management.

Complexity

Dealling with consistency and debugging distributed system is complex, making it hard to tune system performance.

Case study - NFS

NFS System - `Statelessness and idempotency`

idempotent

The second important characteristic of NFS is that almost all of its operations are idempotent

stateless

In NFS, the distributed state is kept almost exclusively on the clients. Servers do not store any information about their clients except for a list indicating which clients are allowed to access which disk partitions. This distributed state includes the following

Advantages of NFS

Ease to handle server crashes

The clients will detect the timeouts and simply retry their requests until eventually the server reboots and the requests succeed.

All important server state is on disk so nothing is lost during the crash.

simplicity development

Simple interactions between clients and servers makes it easy to build. NFS have been made in variants of the UNIX operating system

Disadvantages of NFS

Statelessness makes NFS protocol suffers from three major weaknesses: performance, consistency, and semantics.

Performance

(Not only update data ,but also update index block => write amplification )

Optimization:

When descriptors and index blocks are repeatedly written, as described above, the writes are made to the non-volatile memory. Only a single disk write will be necessary when the cache if full. Because the cache is non-volatile, it can survive server reboots just as well as the disk.

Consistency (across the clients)

Servers do not keep track of which clients are using which files. If one client modifies a file, there is no way for the server to notify other clients that have cached the old contents of the file.

Each client manages it’s own states.

Check fetch time: Whenever a file is accessed on a client, the client checks to see how recently the attributes for the file were fetched from the server. (客户端检查从服务器获取文件属性的最近时间) If the attributes are more than a few seconds old, the client refetches them.
Check last-modified-time:If the last-modified-time in the new attributes does not match the last-modified-time in the client’s old copy of the attributes, client drop its cached data for the file and load latest to current cache.

This approach ensures that each client eventually receives up-to-date information, but it permits windows of inconsistency where stale data may be used.(fetch time allows few seconds old, this few seconds is inconsistency window, since other client could modify data. ) Because of this, NFS cannot be used for certain applications where consistency is required.

NFS uses a write-through-on-close policy to reduce windows of inconsistency. Whenever a file is closed on a client machine, the client immediately transmits modified data for the file back to the server. The close operation does not complete until the data is safely on the server’s disk. but the write-through-on-close deplays the closing process, and it also results in unncessary load on server and disk. eg,. some temp data are also transmitted.

Semantics

Statelessness and idempotency make it hard to add new features. like file lock.

Case study - Sprite File System

Sprite File System - Stateful

Sprite provides high performance and clean semantics, but it is more complex and faces more difficult crash recovery problems.

Three additional pieces of distributed state are kept in Sprite

Clients also retain modified file blocks in their main memories; and wait 30s or wait until the information is needed by some other client before writting to disk.
Servers retain modified file blocks in their main memories, and wait 30s before writting to disk
Servers uses main memories to store which workstations are reading or writing which files. This requires clients to notify servers whenever files are opened or closed, but allows the servers to enforce consistency.

Operations:

client open file:

send request to server.
if the file is modified by other client, server retrieve the data from that client and return.
server return client a version number, it will not match the version associated with the stale data, so the file will be purged from the client’s cache.

Client close file

send close request to server.
server close it and client can continue processing immediately without waiting server closing it;

Advantages

Consistency

version number make sure each read operation is guaranteed to return the most recently written data for that file.

Sprite’s stateful approach also allowed file locking to be implemented easily.

If multiple clients try to write, no one can cache.

Preformance

Client’s send-back-wait reduce the communication overhead.

Server’s write-wait reduce the write amplification.

Client need not wait for information to be written to disk when they close files.

Limitations

complexity, recovery, performance, and space overhead.

complexity

Managing the server’s state is complex. (avoid deadlock, race condition.)

recovery

State kept in client and server can lost in crash.

The system flushes their file cache to (to disk in the case of servers; to servers in the case of clients) when crash happens, but still lost the information of which clients are using which files.

Solution

Client keep the file state locally, and a new operation was added to the Sprite protocol: reopen. When a server reboots, each client reopens all of its files by passing the server a copy of its state information (including information such as which files are open for read- ing or writing, which are locked, etc.). Server can reconstruct its state.

But this solution may incur recovery storm (overload the servers to the point where they cannot respond to requests in a timely fashion) when hundred files open at the crash point.

Slow open

Each open and close be reflected through to the file’s server.

In contrast, NFS clients never contact servers during closes except to write new data; during opens, an NFS client need not contact the server as long as it has up-to-date attributes cached for the file.

Space overhead

A file server needs several hundred bytes of storage for each open file to keep track of the file’s usage,

and may have many thousand files open at once. As a result, the storage required for file state grossly exceeded our initial estimates, causing internal memory limits in the Sprite kernel to be exceeded.

Usage state information typically occupies about 15-20% as much space as the file data (many megabytes in larger configurations).

Conclusion

Stateless model has lower performance than stateful one. (Limited by disk performance)

Distributed state incure complexity, so we should reduce states.

Merge recovery with normal operation to reduce complexity during recovery to facilite the testing.

NFS

Performance

every client write request synchronously writes to server disks

Consistency

other client caches are not notified of the updates
read inconsistent data • NFS’ solution?
- periodic polling
- eventually receive updates
- affects performance

SPRITE

Advantages:

consistency: if one client write, server notify other client who is reading, and let them read from server only.
performance

Disadvantages:

complexity
durability and recovery

The Role of Distributed State

Introduction

Problems

Solutions

Why Is Distributed State Good?

Why Is Distributed State Bad?

Consistency

Crash sensitivity

Time and Space overheads

Complexity

Case study - NFS