Distributed Systems Architecture 3: Service Fault Tolerance

This is a small volume of distributed system architecture to learn the third article, although I know that we do not like to read purely technical articles, written not much reading, but personal growth, or the need to deepen a little technology to explore!

1. Why fault tolerance is needed

Distributed systems are unreliable by nature. In a large service cluster, programs may crash, nodes may go down, the network may be interrupted, and all of these "surprises" are actually "expected". Failure isinevitable, so a robust fault-tolerant mechanism needs to be designed to cope with these problems.

Fault-tolerant strategy, refers to the "face of failure, we should do what"; and fault-tolerant design patterns, refers to the "to achieve a certain fault-tolerant strategy, how we should do". The following is an introduction to seven common fault-tolerant strategies.

2. Seven Fault Tolerance Strategies

7 common fault tolerance strategies: failover, fail fast, fail safe, fail silent, failback, parallel calls, and broadcast calls

Failover

conceptual: In a distributed service, there are multiple copies of the service. If the invoking server fails, the system will not return failure directly, but switch to other service replicas to ensure that the result of the invocation is returned successfully.

Failover needs to set the number of retries, and you need to consider whether to set failover according to the actual business scenario. Example:

Now there's Service A → Service B → Service C such a chain of calls. Suppose the timeout threshold for A is100msAnd B calls C needs60msThe failover is not meaningful at this point, because even if B calls C to failover successfully, the call will take at least 60ms longer than A's. This failover is not beneficial to the system. Because even if B call C failover success, call time consumption increased by at least 60ms, A has timeout, this failover is not beneficial to the system.

Applicable Scenarios: read more write less collection, such as: e-commerce goods query; the success rate of the high requirements of the collection

Failfast

When the business scenario does not allow, or the service is non-idempotent, repeated calls will generate dirty data, you can not use failover, you need to use fast failure.

conceptual: The service returns an error immediately after the call fails, without any retries

Example: In a payment scenario, a bank deduction interface is called and the return result is a network exception. At this time, it is impossible to distinguish whether the money has been deducted or not, in order to avoid repeated deductions, only the service can throw an exception to report an error, can not retry!

Applicable scenarios: high real-time scenarios, transaction and payment scenarios

Cons: Caller needs to be highly fault tolerant

Failsafe

Services also distinguish between main and bypass, bypass is characterized by the failure of the service does not affect the core business. For example, spring project logs, Debug information and so on. Bypass logic does not affect the final result. Therefore, the fault tolerance strategy for this type of logic is that even if the bypass logic fails, but also as a correct return.

conceptual: When a service call fails, the exception is ignored and a default result is returned, ensuring that the system continues to run.

Applicable scenarios: non-core business scenarios, log processing, monitoring collection

vantage: Maximize system stability

Example: javatry-catchThe failafe policy in Dubbo.

Failsilent Failsilent

conceptual: A large number of requests that all wait until a timeout before failing may exhaust the system's threads, memory, and network resources, affecting the stability of the entire service. The failure strategy for this scenario is: when a request fails, the default service provider is unable to provide service for a certain period of time, and no more traffic is allocated to it to isolate the error.

Practical application scenario: In distributed systems, when a single point of failure occurs, the traffic scheduling system no longer allocates traffic to the node and automatically checks whether the node is recovered every 5 minutes.

Failback

Not standalone, usually defaults to a fast fail+failback strategy

conceptualFailure recovery means that after a service call fails, the failed request is stored asynchronously in a database or message queue, and retried or compensated for at regular intervals until the call succeeds. This approach has a certain "retroactive" ability to the business. Failure recovery also requires a maximum number of retries.

Applicable scenarios: real-time requirements are not high, data consistency requirements are high. Such as: inventory update, order status synchronization

Advantage: Improves ultimate system consistency

Disadvantages: the system needs to cooperate with the message queue, the implementation is complex

Summary: The previous 5 fault tolerance strategies are all for how to compensate after the call fails, and the following 2 are how to provide a success rate before the call is made

Calling Forking in Parallel

conceptual: Calling multiple service nodes at the same time, as long as any one of the nodes returns a successful result, the call is considered successful. For service nodes with the same or similar call results, this approach can significantly increase the success rate of the call.

Applicable Scenarios: Multi-copy deployment scenarios, call time-consuming and highly available requirements of the scenario. Such as: database sharding storage query

Advantages: provides success rate and reduces wait time (depending on the first node to return success)

Cons: Increased system overhead

Broadcast calls Broadcast

conceptual: The request is sent to all service instances and all returned results are collected, requiring all requests to succeed in order to be considered successful. This approach is suitable for scenarios that require synchronized operations on multiple nodes

Applicable Scenarios: Refresh Distributed Cache, Configuration Synchronization

Advantage: all nodes can perform operations

Cons: High overhead for parallel execution

Implementation: Dubbo'sbroadcastThe policy supports broadcast calls

Comparison of 7 Fault Tolerance Strategies

fault tolerance strategy	vantage	drawbacks	application scenario
failover	Automatically handled by the system, invoker is not visible to failure messages	can increase call time and also lead to additional resource overheads	Scenarios that invoke idempotent services and are not sensitive to invocation time
Rapid failure	The caller has full control over the handling of failures and does not depend on the idempotence of the service	Caller must handle failure logic correctly, prone to avalanches	Scenarios where non-idempotent services are invoked with low timeout thresholds
security failure	Does not affect the main logic	Only for bypass calls	Bypass services in the call chain
Silence Failure	Control errors do not affect the global	The error will be unavailable for a period of time	Frequent service timeouts
Fault recovery	The call fails and is automatically retried, also without affecting the main logic	Recommended for bypass service calls, retry tasks may pile up and retries may still fail	Bypass services in the call chain, main logic with low real-time requirements
parallel call	Highest success rate in the shortest time possible	Extra consumption of machine resources, most calls may be useless	Scenarios with sufficient resources and low failure tolerance
broadcast call	Supports simultaneous invocation of a batch of service providers	High resource consumption and high probability of failure	Only for batch operation scenarios

Preparation of interview questions

If a business system needs to call a third party's five interfaces, these five interfaces as long as three interfaces return success is considered successful, ask how to design and implement the

Big Brother Chow's response:

I see this question as a trap ah, most of the architecture design questions, fixed answers are often incorrect. Because technical design is done to solve real problems and not to talk about them, the solution should be based on what you hope to achieve:

If the goal is for this operation to be completed as quickly as possible, then the forking strategy, where 5 are called together and 3 are counted as successful.

If the purpose of this business as little as possible to consume resources, then the failfast strategy, first of all, the probability of their error to do an a priori judgment, sorted first call the most likely to be wrong, wrong enough to 3 times counted as a failure, the back of the implementation of the non-execution.

If the goal is for this operation to be completed with as high a probability as possible, then failover strategy