ServiceMesh 5: Exception Retry and Timeout Protection Improve Service Availability

★ ServiceMesh Series

1 Background

In complex Internet scenarios, it is inevitable that requests will fail or time out.
From the program's response results, generally Response returns 5xx status error; from the user's point of view, generally the result of the request does not meet expectations, that is, the operation fails (such as transfer failure, order failure, information can not be obtained, etc.).
The occasional unavoidable 5xx request error arises for a variety of reasons, such as:

Network latency or jitter
Insufficient server resources (CPU, memory going high, full connection pool)
server failure
Service program bugs that meet certain specific conditions (mostly not mandatory)

2 System Stability Classification

Most services tolerate low-frequency, occasional 5xx errors and use availability levels to measure the robustness of the system, with higher level coefficients resulting in better robustness, as follows:

Grade Description	Duration of failure (years)	Available Line Levels
Basic Availability	87.6h	99%
higher availability	8.8h	99.9%
Very high availability (most failures can be recovered automatically)	52m	99.99%
Extremely High Availability	5m	99.999%

For systems with strong system reliability and result expectation requirements, such as transferring money, placing orders, and making payments, even a small degradation in availability is unacceptable, and users have a strong need to receive the right results.
Think about how panicked you are when you make a payment and realize it failed, how frustrated you are when you order takeout and fail to get the information - these are user pain points.

3 Means of managing request exceptions

3.1 Fault recovery using exception retries

Through the above analysis of the cause of the failure we know that the exclusion of the necessary program logic errors, most of the environment caused by the error can be recovered by retrying.
The means of governance is mainly realized by using anomaly retries, which reduce the frequency of user-perceived failures by retrying the load onto healthy instances (the more instances are retried the higher the success rate).

Description of the implementation process

This is an example of the sample service Svc-A initiating access to Svc-B.
After the 1st execution fails, the 2nd request is initiated after an interval of 25ms according to the policy.
You will see two logs with the same trace_id, which means that it is the same call (1 call with 2 requests, 1 first and 1 retry).
The requesting party is the same instance Svc-A-Instance1, indicating that the request initiator is the same.
There is a change in the requested party indicating the scheduling to a new instance (Svc-B-Instance1 to Svc-B-Instance2).
Returns a normal 200.

Because our load balancing mode is RR by default, the more instances we have, the higher the probability of a successful retry will actually be. For example, if there are 50 instances and one of them fails, causing the execution to return 5xx, then the second request will generally have a 49/50 probability of success. The following figure:

3.2 Istio Policy Implementation

The notes are clearer, so I won't explain them here.

# VirtualService
apiVersion: /v1beta1
kind: VirtualService
metadata: name: xx-svc-b-vs
  name: xx-svc-b-vs
  namespace: kube-ns-xx
spec: hosts: xx-svc-b-vs namespace: kube-ns-xx
  hosts: xx-svc-b-vs namespace: kube-ns-xx spec.
  - svc_b. # Governs traffic destined for the svc-b service
  http.
  - match: # Govern traffic with matching conditions
    - uri.
        prefix: /v1.0/userinfo # Match routes prefixed with /v1.0/userinfo, e.g. /v1.0/userinfo/1305015
    retries.
      attempts: 1 # Retry once
      perTryTimeout: 1s # Timeout for first call and each retry.
      retryOn: 5xx # retry trigger condition
    timeout: 2.5s # The overall timeout of the request is 2.5s, no matter how many times it is retried, it will be disconnected after this time.
    route.
    - host: svc_b.
        host: svc_b.
      weight: 100
  - route: # Other unmatched traffic will not be treated by default and will flow directly.
    - destination: host: svc_c.
        host: svc_c.
      weight: 100

4 Means of governing request timeouts

4.1 Main reasons for request timeouts

Network latency or jitter or packet loss, which results in longer response times.
Containers or even cloud host resource bottlenecks: such as high CPU usage, normal memory usage, disk IO pressure, network latency, and other resource usage anomalies may also lead to longer response times.
Load Balancing Problem: Unbalanced traffic distributed under multiple instances, which is not uncommon when looking at cloud base scenarios at the moment.
Sudden flood request: if the stream does not have unintended traffic, as the main focus of the internal project, sudden flood request is mainly the call of the program is unreasonable or program bugs (memory leaks, circular calls, cache breakdown, etc.).

Single copy, long time-consuming easy to cause queue stacking, a great loss of resources, fast release or scheduling open is a better way to a generally acceptable degradation program, otherwise timeout blocking will lead to a long period of unavailability of services.
And this impact spreads horizontally, with other features on the same service competing for resources.

4.2 Istio's governance tools

4.2.1 Timeout retries

Fine-grained configuration of the service's core interfaces, specific interface timeouts should be at ≥ TP 99.9 (the minimum elapsed time required to satisfy a 999‰ network request) elapsed time to consider retrying.

4.2.2 Time-out fuses

Degradation is achieved by specifying a timeout to disconnect the request. Avoid long queue blocking, which causes an avalanche to pass up along the call, causing the entire link to crash.

4.3 Istio Policy Implementation

Focus on the two asterisk ★ attributes in the code below:

perTryTimeout refers to the timeout of the first call and each retry, more than this time, indicating that the request is likely to have pended, then retry to try to fall on other healthy instances, the results of the faster take back.
timeout refers to the overall timeout of the request is 2.5s, no matter how many times to retry, more than the time will be disconnected, this is a protection strategy to avoid excessive retries or long time Pending lead to service deterioration or even avalanche.

# VirtualService
apiVersion: /v1beta1
kind: VirtualService
metadata: name: xx-svc-b-vs
  name: xx-svc-b-vs
  namespace: kube-ns-xx
spec: hosts: xx-svc-b-vs namespace: kube-ns-xx
  hosts: svc_b.
  - svc_b. # Governs traffic destined for the svc-b service
  http.
  - match: # Govern traffic with matching conditions
    - uri.
        prefix: /v1.0/userinfo # Match routes prefixed with /v1.0/userinfo, e.g. /v1.0/userinfo/1305015
    retries.
      attempts: 1 # Retry once
      perTryTimeout: 1s # ★ Timeout for first call and each retry.
      retryOn: 5xx # Retry trigger condition
    timeout: 2.5s # ★ The overall timeout for the request is 2.5s, no matter how many times it is retried, it will be disconnected after that time.
    route.
    - ★ The overall request timeout is 2.5s.
        host: svc_b.
      weight: 100
  - route: # Other unmatched traffic is not managed by default, and flows directly.
    - destination: host: svc_c.
        host: svc_c.
      weight: 100

5 Summary

In this article we introduced the use of Service Grid for the governance of exception retries and timeout meltdowns.Istio provides a wealth of governance capabilities, subsequent chapters we learn one by one to understand the advanced use of fault injection, meltdowns to limit the flow of anomalies such as eviction.