Oh, my God.
Hello, I'm Fishskin, a programmer. It's time to follow Fishskin's experience in dealing with online accidents!
scene of the accident
On Monday afternoon, our
It was really hard for me to see this, and our fellow developers on the team started troubleshooting as soon as they could.
Simply looked at the requests sent from the front-end to the back-end, and found that all the requests kept blocking until they timed out. The same goes for direct requests to the back-end server interface, which waits for a long time without returning data properly. The key thing is that all interfaces are blocked, even just requesting a health check interface (the back-end returns "ok" directly, without querying the database), it can't respond properly.
Our back-end service is deployed on a container hosting platform. Normally if the resources (e.g. CPU and memory) are occupied beyond a certain percentage, the nodes will be automatically expanded to allow the service to carry more concurrent requests, but why wasn't it expanded this time?
In fact, experienced friends should already be able to guess the reason why the interface is blocked, and I will take you to unravel the mystery.
accident detection
Based on the above phenomenon, it is presumed that the probability is that there is a problem with the interface layer, and it does not involve the business and database and other dependent resources. Because our back-end is using Spring Boot + embedded Tomcat server, and Tomcat simultaneous processing of requests for the maximum number of threads is fixed (the default is 200), so when too many simultaneous requests, and each request has not been processed to complete, all threads are busy, there is no way to deal with new requests, which leads to the queuing of new requests waiting to be processing, which results in a blocked interface (delayed response).
Here I use a simple program to simulate the interface blocking and troubleshooting process.
First, write a very simple test interface that simulates a time-consuming operation by prefixing the return content with a , so that the thread processing the request goes into a longer wait.
Then change the maximum number of threads in Tomcat to 5, so that we can simulate a situation where we don't have enough threads:
Start the project, hit the breakpoint, and then request the interface six times in a row.
There should only be 5 requests that go to a breakpoint, the last request will keep spinning around and getting stuck with no thread to process it. This way we restore the scene of the accident.
But the above is only speculation, the actual online project, how to troubleshoot to confirm that Tomcat threads are blocked? And how to confirm which interface or code makes the Tomcat thread blocking wait?
It's actually quite simple, first use thejps -l
command to view the process PID corresponding to the Java back-end service:
Then use the jstack command to generate a snapshot of the thread and save it as a file. The specific commands are as follows:
jstack <Process PID> > thread_dump.txt
Open the thread snapshot file to see the status of all threads at a glance, and search thehttp-nio
You can see Tomcat's request processing threads, and sure enough, the status of all the request processing threads isTIMED_WAITING
This means that the thread is waiting for another thread to perform a specific action, but with a specified wait time. And it's straightforward to see which code location the request is blocking.
Using this method, we also quickly pinpointed the cause of the programming navigation interface blockage, which was occurring in a method that queries users from the database. As we executed a mass SMS recall of old users yesterday afternoon, a large number of users accessed Programming Navigator and executed this method at the same time. Due to the slow execution of the database query operation involved, each request needs to wait for the database to query the results before responding to the data, and the next request can then come back in to query the database, which leads to a large number of Tomcat request processing threads blocked waiting for the database query, which further leads to the queuing of new requests to be processed.
The truth comes out!
How is it resolved?
In fact, the problem we encountered this time is a typical "online connection pool bursting problem", which is often asked in the interview. We have talked about how to troubleshoot this kind of problem, so how to solve this kind of problem?
First of all, when you encounter a situation where the connection pool is full, first protect the scene, such as dumping thread information according to the operation of the fish skin above, and then hurry to restart the service or start a new instance, so that the user can be used normally first. Then carry out troubleshooting analysis and optimization.
How to Optimize Online Connection Pool Bursting Problem? The first thing is definitely still to optimize the code that is causing the requests to block. For example, if the database query is slow, we will add indexes to improve the query speed.
You can also increase the size of the database connection pool. In Spring Boot, HikariCP is used as the data source connection pool by default, and HikariCP'smaximumPoolSize
(The default value is only 10, which is obviously not enough to cope with high concurrency scenarios. You can increase the value of the following configuration:
spring
Since there are more back-end request operations than just querying the database, the maximum number of threads and the minimum number of idle threads in the Tomcat thread pool can also be adjusted as needed, such as in the following configuration:
# Set the maximum number of threads for Tomcat
.max=300
# Set the minimum number of idle threads for Tomcat
.min-spare=20
Increasing the maximum number of threads in Tomcat increases the ability to handle concurrent requests. Increasing the minimum number of idle threads in Tomcat ensures that at peak concurrency times, Tomcat can respond quickly to new requests without having to recreate threads.
In fact, most of our cases, the memory utilization of online servers (containers) is not high, so you can change the configuration appropriately according to the actual resources and concurrency. Remember to do more testing, because too high a number of threads may lead to an increase in thread scheduling overhead, which in turn reduces performance.
realism
Well, the above is just my solution to this type of problem. But the reality may not be so ideal, in fact, the slow SQL problem we have already pinpointed and synchronized within the group during the last outage. But guess what, the developers on our team sent out a bunch of screenshots of the monitoring, but none of them actually solved the problem, which led to a re-run of the outage many days later!
Once a problem is identified, it's important to come up with a solution that will be supported for as long as possible, or isn't this monitoring for nothing?
Why this incident lasted so long? It was also because my team's fellow developers lacked experience in online problem handling and were there to analyze the situation, and as a result, they forgot to restore the service, and after half an hour users were still unable to access it until I went to remind them.
So this time you know how important the usual memorized octet is, right?Tomcat's connector configuration and performance optimization is also a classic octet, as well as our
more
💻 Programming Learning Exchange:Programming Navigation
📃 Quick Resume Maker:Old Fish Resume
✏️ Interview Brushup:interviewer