Chapter 8 The Trouble with Distributed Systems

Anything that can go wrong will go wrong.

Malfunctions and partial failures

Differences between stand-alone and distributed systems: Stand-alone software operates more predictably, the results of operations are deterministic when the hardware is normal, and hardware problems tend to cause the entire system to fail. Distributed systems, on the other hand, are subject to partial failures with uncertainty, and some of their components may be damaged in unpredictable ways, which makes it more difficult to work with distributed systems.
Troubleshooting of different computing systems: Supercomputers in the field of high-performance computing (HPC) are mostly used for scientific computing, and operations often store computational state on disk, and cluster workloads are usually stopped when nodes fail, and restarted to continue from checkpoints, similar to escalating a partial failure to a complete failure to deal with. Cloud computing is associated with multi-tenant data centers, etc., nodes are constructed from commodity machines, the failure rate is higher, need to build fault tolerance mechanisms, allowing the system to tolerate the failure of the node to continue to work, in order to ensure that the service is not interrupted.
Ideas for building reliable systems: Even if components are unreliable, more reliable systems can be constructed through mechanisms such as error-correcting codes, TCP protocols, etc., but with limited reliability. In distributed systems it is important to accept the possibility of partial failures, incorporate fault-tolerance mechanisms into software design, consider various error scenarios and test them, and not simply assume that defects are rare.

unreliable network

Network Problem Performance and Handling: Networks are the only way for machines to communicate in distributed systems, mostly asynchronous packet networks, there are many problems such as lost requests, queuing, remote node failure or response delays, leading to difficulties in distinguishing the cause of the failure, which is usually dealt with using a timeout mechanism, but the timeout setting faces the dilemma of waiting for a long period of time or misjudging the node failure.
Real World Network Failures: computer network construction for many years, there are still widespread failures, such as data center networks, public cloud services are plagued by it, human error is the main cause of network outages, and network partitioning and other failures if not handled properly will lead to serious consequences, the need to clarify the software response and testing
Difficulties and methods of detecting faults: It is difficult to determine whether a node is faulty, although there are cases such as the operating system to close the TCP connection, script notification, query the switch, router replies can be feedback fault information, but can not be completely relied on, most of the cases no feedback, need to retry and combined with the timeout to determine the status of the node.
- Why do you need to automatically detect faulty nodes
  - The load balancer needs to stop forwarding requests to dead nodes (i.e., from moving out of the polling list (out of rotation)).
  - In distributed databases with single-master replication functionality, one of the slaves needs to be upgraded to a new master if the master fails
Network congestion and queuing impact: The variability of network packet latency often stems from queuing, which is affected by multiple factors such as switches, operating systems, virtual machines, etc. TCP performs flow control, which also introduces latency. Different applications have different sensitivities to latency. For example, UDP is used for video conferencing to sacrifice reliability for low latency, and most systems need to determine the appropriate timeout through experimentation or with the help of a fault detector.
- The variability of packet delays on computer networks is usually due to queuing
  - Switch queues fill up
  - Target machine's cpu is busy
  - Multiple virtual machines competing for cpu
  - TCP performs flow control
- How to Solve Network Congestion in Multi-Tenant Data Centers
  - Reason: In public clouds and multi-tenant data centers, resources are shared by many customers: network links and switches, and even each machine's NIC and CPU (when running on a virtual machine)
  - Solution: Select timeouts experimentally: Measure the distribution of network round-trip times over a long period of time, on multiple machines, to determine the expected variation in latency. Then, taking into account the characteristics of the application, an appropriate trade-off between fault detection latency and the risk of premature timeouts can be determined.
  - A better approach: instead of a fixed constant timeout, continuously measure the response time and its changes to automatically adjust the timeout
Synchronous vs. Asynchronous Networks: Compare a data center network with a traditional landline telephone network, which is a synchronous network that allocates a fixed bandwidth when establishing circuits and guarantees a fixed maximum end-to-end latency (called aeffective delay), while data center networks and the Internet use packet-switched protocols optimized for bursty traffic, there are network congestion, queuing and infinite latency problems, and current technology makes it difficult to guarantee network latency or reliability.
Relationship between delays and resource utilization: Latency variations can be seen as a result of dynamic resource partitioning, with static resource allocation for telephone network lines and dynamic sharing of bandwidth for the Internet, each with its own advantages and disadvantages, with static allocations achieving guaranteed latency but with low utilization and high costs, and dynamic allocations with high utilization but with variable latency disadvantages.

Unreliable clocks

Clock Classification and Characteristics: Modern computers have a clock (real-time clock) and monotonic clock two kinds of clocks, the clock according to the calendar to return to the date and time, often synchronized with the NTP but there are jumps, limited accuracy and other problems, not applicable to measuring elapsed time; monotonic clock used to measure the duration of the time to ensure that it has been advancing but the absolute value of the meaningless, different computers monotonous clock value is not comparable to the timeout and other cases of measuring time-outs in a distributed system is more applicable.
Clock synchronization and accuracy issues: There are many unreliable and inaccurate factors in the methods of obtaining clocks, such as quartz clock drift, NTP synchronization limited by network latency, possible errors or misconfigurations of NTP servers, complexity of leap-second processing, and virtualization of clocks in VMs, etc. It is important to monitor the clock skews when using software that needs to be synchronized with clocks in order to prevent problems such as data loss.
Timestamping Issues for Ordered Events: Relying on clocks to sort multi-node events is risky, such as the Last Write Whichever (LWW) conflict resolution strategy, which can lead to loss of database writes, inability to differentiate between sequential and concurrent writes due to unsynchronized clocks or accuracy issues, etc. Logical clocks are relatively more suitable for sorting events.
Confidence intervals for clock readings: Clock readings are subject to uncertainty and should be treated as a time range, with different time sources corresponding to different error calculations, but most systems don't disclose the uncertainty, with the exception of Google's TrueTime API, which reports clock confidence intervals, which Spanner leverages to achieve snapshot isolation across data centers.
Impact of and response to the suspension process: In distributed systems, processes can be suspended for long periods of time due to garbage collection, VM hang recovery, disk I/O, page scheduling, signal control, and many other reasons, leading to problems with code that relies on synchronized clocks or is assumed to have a short execution time, and since there is no shared memory in a distributed system, the traditional thread-safety tools can't be directly applied, but there are a number of measures that can be taken to mitigate the effects of garbage-collection suspensions

Knowledge, truth and lies

Uncertainty in distributed systems: Distributed system nodes can only judge the status of other nodes through messages, due to network, pause and other issues can not know the real situation, so many distributed algorithms rely on a quorum vote to make decisions, such as declaring the node dead and other cases, to reduce the dependence on individual nodes
Leaders and Lock-in Issues and Protections: In distributed systems such as database partition leader, resource locks and other uniqueness control, it should be noted that the node's self-perceived role is not necessarily recognized by a quorum of nodes, may be due to the network or suspension and other reasons for the problem, such as the expiration of the lease is still written to lead to data corruption, can be protected by the token mechanism, the requirement to write accompanied by an incremental token, the resource side of the checking token to reject the old token request to protect the safety of the.
Byzantine failures and related conditions: Byzantine fault refers to nodes intentionally "lying" to destroy system guarantees, in specific scenarios such as aerospace, multi-organizational participation, etc. need to consider Byzantine fault tolerance, but the production of Byzantine fault-tolerant system protocols are complex, costly, and impractical for most server-side data systems, although some simple mechanisms can be added to prevent weak forms of "lying" behavior can be added to improve reliability.