flink mass task commit yarn failure problem

problematic phenomenon

After migrating to the new cluster, a user reported that a large number of flink jobs on their development platform were failing, when the cluster had enough yarn resources.

investigation process

The user is submitted in their development platform, check their failed tasks, found that they are the submission side of the initiative Kill, and then communication found that they submit the platform has a logic that is submitted to the yarn flink task, if within 2 minutes did not start up in the yarn, that is, submitted flink yarn application does not turn into a running state, then the development platform will terminate the submission, and determine the status of the task as a failed submission. If the submitted flink yarn application does not turn into running state within 2 minutes, the development platform will terminate the submission and determine the status of the task as a failed submission.

The known background is as follows.

The old cluster didn't have this problem, the migration to the new cluster has caused the
The problem only occurs during peak submission times, not during low times.
The log on the client side of the submit end is showing that it is waiting, and then more than 2 minutes later, it canceled.

2022-07-26 17:45:32,355 INFO             - Submitting application master application_1625727384658_7951
2022-07-26 17:45:32,577 INFO           - Submitted application application_1625727384658_7951
2022-07-26 17:45:32,578 INFO             - Waiting for the cluster to be allocated
2022-07-26 17:45:32,579 INFO             - Deploying cluster, current state ACCEPTED
2022-07-26 17:46:32,801 INFO             - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
2022-07-26 17:46:33,052 INFO             
......
2022-07-26 17:47:16,655 INFO             - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
2022-07-26 17:47:16,690 INFO             - Cancelling deployment from Deployment Failure Hook

a doubtful point

Increasing the 2-minute judgment time of the submission platform to 5 minutes will ensure the successful submission of the task.

according toThe yarn task submission process application, after the master starts up and registers with rm, the application changes from accepted to running, so the question is why the master won't be able to pull it up in 2 minutes.

am is also a container, and in general the container startup process takes longer than the localized resource process, as confirmed by the nm logs.

But resource localization is a very simple operation, and if it's slow, the problem won't be in the application layer, it's usually a problem with the underlying network or hardware.

So then we moved on to troubleshooting the underlying operating system and hardware.

The analysis reveals a very high number of threads on the node in question: the
ps -efL | grep java | wc -l
There are 11 w, when the number of threads in 5 w, is able to pull up the flink task, which is also the number of threads on the normal nodes on the old cluster, but when the number of submitted tasks increases and the number of threads reaches more than 10 w, it will not be able to pull up the submitted flink task within 2 minutes.

In the problem node, found ps execution lag phenomenon, from here to find.
Because too many threads are open, there are too many subdirectories in the /proc directory, and the resulting traversal of these directories isgetdents Excessive consumption time.Any file that would access the /proc directory would be slow to process for the same reason.

So I guessed that the flink commit would be the same factor that caused the pull-up timeout, and then I used the strace utility to trace the nodemanager process, which traces the startup of the am container:: "I'm not sure if I'm going to be able to do this, but I'm sure I'm going to be able to do it.
Find the pid of the process/thread that downloaded the jar package and strace -p pid -f -Ttt 2> It was found that the nm process operates on the /proc directory at the bottom of the process.

So the conclusion is that there are too many threads, which leads to too many directories under the proc, which slows down the underlying calls, which in turn slows down the upper level java program.

However, the user then questioned, the new cluster cpu and memory configuration is twice the old cluster, why can not support twice the flink task submission, and at that time the problem node cpu load is not full!

There's no other leads.
At this time, we found that the problem node using the linux kernel version is 3.10, the version is relatively old, suspected that the kernel level of multi-core multi-threading processing capacity, so the construction of the environment using strace test, found that indeed a high version of kernelgetdents Huge difference in call performance

企业微信截图_e7c6b3ca

企业微信截图_45d3a35f

reach a verdict

The linux kernel version 3.10 is not optimized to handle large number of threads with many cores, resulting in low performance of some underlying system calls.
Upgrading the kernel to version 4.18 solves the problem.