kubernetes load-aware scheduling

contexts

The native scheduler in kubernetes can only schedule pods via resource requests, which can easily lead to a number of uneven load issues.
In many cases, the business side is over-requesting resources, so in the native scheduler era, we set the Requests/Limit ratio according to the characteristics of the business and the evaluation level to improve the efficiency of resource utilization.
There are still many problems in this scenario:

Uneven node load: The native Kubernetes Scheduler schedules Pods based on Requests and the total amount of nodes available for allocation, neither considering real-time load nor estimating usage, and this purely static scheduling leads to uneven distribution of node resource utilization.
In the scenario of traffic volatility business, some node utilization breaks through the security threshold during traffic peaks, but many nodes are particularly spotty, and the difference in node utilization is particularly large
Business Cyclicality: In offline cluster separation, there is a huge waste of resources at the bottom peak of the online cluster

This paper focuses on improving resource utilization within online clusters if problem one is solved

The online cluster Cpu discretization coefficient is 0.45, and the peak Cpu utilization rate of the whole cluster is only about 25%; the following figure shows the discrete Cpu utilization rate:

file

collapse (of plan, talks etc)

Based on the above, a peak Cpu utilization of only 25% is certainly not a reasonable situation, the industry does a good job of 50% +. If you want to continue to improve the utilization rate, you must solve the problem of uneven load on the nodes:

Sense the real load of the node: to solve the problem of uneven node load, it is necessary to report the current real load of the node
Load-based forward scheduling plug-in: add a load-based scheduling plug-in on top of the default scheduler, in the forward scheduling is to try to ensure that the water level between the nodes average
Load-Based Rescheduling Component: When the business is constantly fluctuating, the nodes may have differences in node load due to changes in application load, and need to be rescheduled to migrate the Pod to re-attain the average.

fulfill

Two open source projects of interest:

Koordinator: /

Crane: /

As opposed to Koordinator, which is a software specifically made for mixing, Crane takes Finops as a starting point, and both are more suitable for us than Koordinator, and mixing offline is the next step in the plan.

Research and Testing

After it goes live:
file

Problems encountered

Hotspot Node Problem: In the peak of business, the node load becomes high, there will be a hotspot node, this time you need to intervene with the rescheduling component to reschedule the Pod to other nodes.

The need for front-loading to break up hotspot nodes, which requires resource profiling of applications and dispersing this type of application in scheduling to avoid the creation of hotspot nodes for business peaks
2. In case 1, when expanding some nodes to relieve the pressure on the cluster, the new nodes will be rapidly occupied by hotspot Pods, resulting in higher node loads and triggering rescheduling again.

Adjust the weight of the load balancing scoring plugin in the scheduling plugin to make the node load more balanced and avoid hot node problems
3. Find the right node size, small size nodes, more containers appear hot nodes

In our business scenario, the current 48c node hotspot is less likely to occur than the 32c node.