ServiceMesh 1: What's so great about the hot cloud-native microservice mesh?

1 About Cloud Native

The official description from the Cloud Native Computing Foundation (CNCF) is:
Cloud native is a collective term for a class of technologies that enable us to build applications that are more elastic and scalable, with highly distributed benefits.These applications can be run in different environments, such as private, public, hybrid, and multi-cloud scenarios.Cloud Native includes capabilities such as containers, microservices (covering service grids), Serverless, DevOps, API management, immutable infrastructure, and more. Applications built with cloud-native technologies have an impact on the
The underlying infrastructure is very low-coupled, easy to migrate, and can fully utilize the capabilities provided by the cloud, so the development, deployment, and management of cloud-native applications are more efficient and convenient compared to traditional applications.

1.1 Microservices

Microservices is an architectural pattern that is an evolution of the Service Oriented Architecture (SOA) software architecture pattern, the
It promotes the division of a single application into a set of loosely coupled, fine-grained, small services, aided by lightweight protocols that coordinate and interoperate with each other to provide ultimate value to the user.
Characterized by single responsibility, lightweight communications, independence, process isolation, hybrid technology stacks and hybrid deployment approaches, and simplified governance.

1.2 DevOps

DevOps, as an engineering model, is essentially a way to maximize engineering efficiency by dividing responsibilities among development, operations, testing, and distribution management roles to meet business needs.

1.3 Continuous delivery

It is difficult to frequently release new features to users without affecting their use of the service. Need strong traffic management capabilities, dynamic service expansion and contraction for smooth release, ABTesting to provide protection.

1.4 Containerization

The benefit of containerization is that the operation and maintenance do not need to care about the technology stack used by each service, each service is indiscriminately encapsulated in the container, can be indiscriminately managed and maintained, now more popular technologies are docker and k8s.

2 About ServiceMesh

2.1 What is ServiceMesh

ServiceMesh is the latest generation of microservice architecture, as an infrastructure layer, can be decoupled from the business, mainly to solve the complex network topology of microservices and microservices communication between the realization of the form is generally a lightweight network agent, and with the application of the SideCar deployment, while transparent to the business application.

If called from an individual link the following structure diagram can be obtained:

If we take a global view, with application services in green and SideCar in blue, we get the following deployment diagram:

2.2 Differences Compared to Traditional Microservices

Micro-service development frameworks represented by SpringCloud and Dubbo are very popular. However, we found that he has excellent service governance capabilities and obvious pain points:
1. Highly invasive. Want to integrate the ability of the SDK, in addition to the need to add the relevant dependencies, the business layer in the invasion of the code, annotations, configuration, and governance layer boundaries are not clear. You can think about Dubbo, SpringCloud and other practices
2. High cost of upgrading. Each upgrade requires the business application to modify the SDK version, re-run functional regression tests, and deploy each service online, which is contrary to rapid iterative development.
3. Significant version fragmentation. Due to the high cost of upgrading and the fast update of middleware versions, the SDK versions referenced by different services on the line are not uniform and have varying capabilities, making it difficult to unify governance.
4. Difficulties in the evolution of middleware. Due to the serious fragmentation of the version, the middleware needs to be compatible with a variety of old versions of logic in the code in the process of forward evolution, and it is impossible to realize rapid iteration with the "shackles" forward.
5. High content and high threshold. Many dependent components and high learning costs.
6. Governance is dysfunctional. Unlike the RPC framework, SpringCloud as a typical governance family bucket, is not everything, such as protocol conversion support, multiple authorization mechanisms, dynamic request routing, fault injection, gray-scale release and other advanced features are not covered.

2.3 The Value of ServiceMesh - Enabling Infrastructure

Harmonize solutions for multilingual frameworks to reduce development costs
Reduce testing costs and improve quality
Control logic centralized to the control surface
Provide support for new architectural evolutions such as Serverless
Grid Half Coverage to Unified Coverage (to compensate for service-center and gradually override)
Complete closed-loop microservices orchestration and management capabilities

2.4 The Value of ServiceMesh - Enabling the Business

The framework is decoupled from the business to reduce business constraints.
Simplify the management of SDK versions on which the service depends.
Rely on hot-scaling capabilities and short version recall cycles.
SDK slims down and reduces business dependency conflicts.
Rich Traffic Governance, Security Policies, Distributed Trace, Log Monitoring, Sinking Service Governance Chassis to allow business to focus on business.

3 ServiceMesh Core Capabilities

3.1 Flow governance

The biggest pain point for microservices applications is dealing with inter-service communication, and the core of this problem is really traffic management.

3.1.1 Request routing

Route requests to a version of the service where the application routes traffic to different places based on the value of the HTTP request header, the value of the Uri. Matching rules can be things like traffic port, header field, URI, etc.
RuleMatch Reference

3.1.2 Flow diversion

When one version of a microservice is gradually migrated to another, we can migrate traffic from the old version to the new one. As shown in the following figure, using the weight parameter for weight assignment.
A very typical application scenario for this is gray scale release or ABTesting.

3.1.3 Load balancing

As with the diagram in 3.1.2, Service B has multiple instances, so a separate load balancing policy can be developed.
Load balancing policies support simple load policies (ROUND_ROBIN, LEAST_CONN, RANDOM, PASSTHROUGH), Consistent Hash Policies, and Regional Load Balancing Policies.

3.1.4 Timeout

For the upstream request settings, set a timeout of a certain length (0.5s), the request does not respond beyond this time, you can directly fallback. the target is still overload protection.

3.1.5 Retries

Configure the number of retries when a request does not return the correct value within a fixed period of time. Set a retry if the service does not return the correct return value within 1 second, with a return code of 5xx and 3 retries.
Retrying is an important technique for high availability in distributed environments, and retrying schemes are used with caution.

retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: 5xx

3.1.6 Fusing/current limiting/degrading

Meltdown strategy is more, you can configure the maximum number of connections, connection timeout, maximum number of requests, the number of request retries, request timeout, etc., we can give him melt down, fallback back.
However, as it stands, Istio's support for more flexible and fine-grained capabilities such as flow limiting and degradation is not good enough, and it is reasonable that there should be funnel pooling algorithms (e.g.Ali open source flow-limiting framework Sentinel) or token bucket algorithms (such asRateLimiter, a flow-limiting toolkit provided by Google Guava.) Such a flexible approach.
However, it can be handled in other ways, such as traffic forwarding can be used to flow part of the traffic to the default service, which enables the default fallback, but you need to control the sampling time, fused half-open policy.

3.1.7 Outlier Detection (OD)

When a service in a cluster fails, it is actually our highest priority to outgroup first, and then check the problem, deal with it and recover from the failure. Therefore, the ability to quickly outgroup is important for the availability of the system.
Outlier Detection allows you to scan for upstream services and then determine whether to outlier a service based on the parameters you set.
The following configuration means that the upstream hosts are scanned once per second, and all hosts that fail twice consecutively and return a 5xx error code are removed from the load balanced connection pool for 3 minutes, and the upstream hosts that have been outgrouped should not account for more than 10% of the cluster.
But regardless of the ratio, as long as you have >=2 service instances under your cluster, at least 1 host will be populated. It has a very detailed configuration thatRefer this way.。
Note: If you return to the group after 3 minutes, if you are left again, it will be the last time you were left + the current time you were left, i.e. 3+3; by default, if you are left more than 50% (adjustable percentage), you will enter panic mode.

outlierDetection:
      consecutiveErrors: 2
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 10

3.1.8 Fault injection

It is used to simulate whether the current service has the ability to handle the request when the upstream service returns a specified exception code for the request. Before the system goes live, you can configure the injected httpStatus and ratio to verify the downstream service's ability to handle the failure.

3.1.9 Traffic mirroring (Mirroring)

This is also called shadow traffic. It means that a copy of the real traffic on the line is copied to the mirror service through certain configurations, and you can set the percentage of traffic that is only forwarded and not responded to.
Personally, I think this is still more useful, the benefits are Complete online formal environment simulation, traffic analysis, stress testing; full real online problem reproduction, easy to troubleshoot.

3.2 Observability

3.2.1 Monitoring and Visualization

Prometheus (standard, captures metrics data by default), kiali monitoring (service view, observability of Istion links) , Grafana (BI reporting) (data plane, control plane, xDS Service health metrics for each)
Subsequent chapters will unfold one by one...

3.2.2 Access log

ELK, EFK (Envoy logs AccessLog, contains SideCard's InBound, OutBound logs)
Subsequent chapters will expand on this in detail...

3.2.3 Distributed tracking

Essentially one way to find correlations between multiple HTTP requests is to use a correlation ID. this ID should be passed to all requests so that the tracking platform knows which requests belong to the same request. As shown in the figure below:

Although Istio utilizes Envoy's distributed tracking capabilities to provide out-of-the-box tracking integration, it's actually a misconception that our application needs to do some work. The application needs to propagate the following header:

x-request-id
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
x-ot-span-context

The Envoy agent within the Istio Sidecar receives these headers and passes them to the configured tracing system. So in practice, service tracing in Istio will only trace up to level 2, by default.
For example, if A -> B -> C, there will be 2 traceroute links in Istio: A -> B and B -> C. Instead of A -> B -> C, which is what we would expect, we need to modify the inter-service calls if we want the services to be linked together.
In Istio applications associate spans to the same trace by propagating the http header.

3.3 Security mechanisms

Service Mesh can introduce bi-directional TLS encryption in inter-service communication to ensure that data is not tampered with or eavesdropped during transmission. The control plane is responsible for managing and distributing certificates, and the Sidecar Proxy performs encryption and decryption operations during communication.
By introducing authentication and access control policies, it is possible to control at a granular level which services can access other services.

3.4 Strategy Implementation

Service Mesh enables transparent proxying of inter-service communication by deploying Sidecar Proxy next to each service instance. These proxies are responsible for intercepting all incoming and outgoing traffic and performing the appropriate actions according to the configuration and policies issued by the control plane. Here's how it works:

3.4.1 Service Discovery:

When a service instance starts, it registers itself with the service registry. The control plane is responsible for managing this service instance information and distributing the updated list of services to all Sidecar Proxy.

3.4.2 Flow management:

When a service needs to communicate with another service, the traffic first passes through the local Sidecar Proxy. the proxy forwards the traffic to the target service instance based on the configured routing rules and load balancing policies.
The control plane can dynamically update these routing rules to enable advanced traffic management features such as blue-green deployment and canary publishing.

3.4.3 Security certification:

Service Mesh can introduce bi-directional TLS encryption in inter-service communication to ensure that data is not tampered with or eavesdropped during transmission. The control plane is responsible for managing and distributing certificates, and the Sidecar Proxy performs encryption and decryption operations during communication.
By introducing authentication and access control policies, it is possible to control at a granular level which services can access other services.

3.4.4 Observability:

Agents in Service Mesh collect logs, monitoring data, and trace information for each request and send this data to the observability component for processing and storage.
Operators can monitor traffic conditions, latency, error rates, and other metrics between services in real time through the interfaces and dashboards provided by the control plane, and perform troubleshooting and performance optimization.

4 Summary

Service Mesh has significant advantages over traditional microservices frameworks in the following areas:

Decoupling application and communication logic
Provision of enhanced service governance capacity
Improved observability and debuggability
Supports multiple languages and protocols to
Improved system reliability and scalability