Online Troubleshooting Guide

preamble

I've been asked a lot lately how to quickly troubleshoot an online problem.

It's very much a test of work experience.

There are some problems you've encountered before, and if you encounter a similar problem again, you'll be able to quickly troubleshoot what's causing it.

But if a particular issue is new to you, there may be a bit of a feeling of cluelessness in your mind.

This article summarizes, some of the online problem troubleshooting ideas I've encountered before, and I hope it will be helpful to you.

1 OOM issues

OOM issues, when they occur in production environments, are generally very serious problems and services may hang.

But there are multiple scenarios of OOM problems, and the reasons for the problems are different for different scenarios.

1.1 Heap Memory OOM

The server's logs typically print the following:

: Java heap space

This is the kind of OOM problem that occurs the most.

The following parameter can be added when the Java service is started:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=

In the event of an OOM, the program will automatically save the memory usage at that time, dump, to a specified file.

Then use MAT (Memory Analyzer Tool), or the Java visualvm that comes with the JDK, to analyze the dump file and find the code that is causing the OOM .

1.2 Stack Memory OOM

The exception message for a stack memory OOM problem is as follows:

: unable to create new native thread

If, in practice, this problem occurs, it is usually caused by creating too many threads, or setting up a single thread that takes up too much memory space.

This is the time to troubleshoot the number of threads in the service.

Thread pooling is recommended to reduce thread creation and effectively control the number of threads in the service.

1.3 Stack Memory Overflow

The exception message for a stack memory overflow problem is as follows:

The problem is usually due to some recursive calls written in the business code, where the depth of recursion exceeds the maximum depth allowed by the JVM, and a stack memory overflow may occur.

If in the production environment, this problem occurs, you can troubleshoot whether the recursive call is normal, it is possible that infinite recursion occurs.

1.4 GC OOM

The exception message when a GC OOM problem occurs is as follows:

: GC overhead limit exceeded

GC OOM is generally due to the JVM in the GC, too many objects, resulting in memory overflow, it is recommended to adjust the GC strategy.

At 80% of the old generation is when the GC starts, and set -XX:SurvivorRatio (-XX:SurvivorRatio=8) and -XX:NewRatio (-XX:NewRatio=4) to be more reasonable.

1.5 Meta-space OOM

The exception message when a meta-space OOM problem occurs is as follows:

: Metaspace

Use after JDK8Metaspacein place ofPermanent Representative, Metaspace is the method area's implementation in HotSpot.

This problem is usually caused by too many classes loaded into memory, or the size of the class is too large.

If this problem occurs in a production environment, you can modify the metaspace size with the following command:

-XX:MetaspaceSize=10m -XX:MaxMetaspaceSize=10m

I've listed the most common scenarios of OOM problems here, and if you want to know more, you can read a previous article I wrote called "TheThe 6 most common OOM problems at work", which is described in more detail.

2 CPU 100% problem

It is also common for online services to have CPU 100% issues.

This problem occurs due to the service hogging CPU resources for a long time.

The main reasons are these below:

To locate this problem, you can use the JDK's ownjstacktool, or use the Ali open sourceArthasDetection tools.

For those more interested in the 100% CPU issue, check out my other article, theCrap, the CPU is 100%!!!!", which is described in more detail.

3 Interface Timeout Issues

I don't know if you have ever encountered such a scenario: we provide a certain API interface, the response time was always very fast, but at some inadvertent point in time, there is suddenly an interface timeout.

There are many reasons for interface timeout, we need to check them one by one.

The following chart lists the common reasons why interfaces in production environments suddenly experience timeout problems:

If you want to learn more about the interface timeout issue, you can check out my other article, "TheInterface Sudden Timeout 10 Deadly Sins....》

4 Indexing failures

I don't know if you have ever encountered that the generation environment obviously created the index, but the database in the process of executing SQL, the index is unexpectedly invalid.

The performance of the interface is affected by the index failure, which makes the operation, which was fast before, slow all of a sudden.

We can do this through theexplainkeyword to see the execution plan of the sql, you can confirm whether the index is invalid or not.

If the index is failing, what could be causing the problem?

This chart below gives you a list of common causes:

For those of you who want to learn more about the indexing failure problem, take a look at my other article, "TheTalking about 10 scenarios where indexing fails and it's too much of a pitfall", which is described in great detail.

5 Deadlock Problems

If you are using a MySQL database, you must have encountered deadlock problems in the production environment.

deadlockIt refers to two or more transactions in the execution process, due to competition for resources and caused by a phenomenon of waiting for each other, if there is no external force, these transactions will not be able to continue to move forward.

In Java, when using a MySQL database, if you encounter a MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction exception, it means that the database has detected a deadlock.

MySQL deadlocks are usually caused by the following reasons:

Resource contention: Multiple transactions compete for the same resource at the same time, e.g., they all try to acquire a lock held by the other.
Cyclic Waiting: Transactions form a cyclic relationship where they wait for each other to release resources.
Improper transaction design: transactions are executed in an unreasonable order, take too long to execute, etc.
Concurrent Operation Conflict: In a highly concurrent environment, multiple transactions operating on the same set of data can easily trigger lock conflicts leading to deadlocks.
Inappropriate use of indexes: If the indexes are not well designed, it may lead to problems in acquiring locks for the transaction.

How to minimize deadlock problems?

Set a reasonable transaction isolation level.
Business code that avoids big transactions.
Optimize sql performance.
Add lock wait timeout handling.
Increased monitoring and analytics

6 Disk Problems

Server disk problems are the best of the many online problems to troubleshoot.

There are two general types of disk problems:

The disk is broken.
Insufficient disk space

If it's a bad disk, Ops generally has a hard time fixing it in time, in a short period of time.

Therefore, the disk needs to be replaced in a timely manner.

If there is insufficient disk space.

generally need to be logged into that server.
Use the command:

df -Hl

View the current disk usage of the server.

total size
How much has been used
How much is available

The fastest solution is to delete the files in the /tmp folder, which will free up some disk space.

Then find the log file and delete the logs that are 7 days old.

These two ways, generally free up quite a bit of disk space and temporarily solve the problem of insufficient disk space.

From a common point of view, we need to monitor the server's disk usage and have a warning if the threshold is exceeded.

At the same time need to need to standardize the business system, which scenarios need to print logs, which scenarios do not need, should not be all the scenarios, all print logs.

In particular, some business query interface calls are very frequent and return a lot of data at once, in this case, it will lead to rapid expansion of logs on the server, occupying too much disk space.

7 MQ Message Backlog Problem

If you have used MQ messaging middleware, you must have encountered the MQ message backlog problem in the production environment.

This problem generally occurs when MQ consumers consume messages at a slower rate than MQ producers produce them.

If everything has been fine before and suddenly one day there is an MQ message backlog problem.

It could be caused by the following:

MQ producers send messages in bulk.
With more and more data, MQ consumer's are processing business logic with mysql indexes failing or wrong indexes selected, resulting in slower processing of messages.

If the production environment is experiencing MQ message backlog issues, first confirm that the MQ producer is not sending messages in bulk.

If so, the number of core threads and the maximum number of threads in the thread pool in the MQ consumer can be adjusted larger to allow more threads to process the business logic and increase the consumption capacity.

The premise of this solution is that in the MQ consumer, a thread pool has been used to consume messages.

If you are not using a thread pool, you will have to add server nodes temporarily.

If the MQ producer is not sending messages in bulk, you need to troubleshoot where in the MQ consumer's business logic there is a performance issue that requires code optimization.

Optimization is the way to go:

Optimizing Indexes
Optimizing sql statements
asynchronous processing
batch file

Wait, there are others.

For those of you who are more interested in performance optimization tips, you can check out my other article, "TheI've used these 11 tricks to improve interface performance by a factor of 100", which is described in great detail.

8 Calling the interface reports an error

Our production environment of the program, sometimes appear, before calling a certain API has been normal, but suddenly appeared to report an error, that is, the return code is not 200.

So how do we troubleshoot this kind of problem?

8.1 Return 401

Generally this problem occurs in production environments due to lack of login authentication through the interface.

This occurs when a user is generally required to go through some form of authentication (e.g., logging in) before attempting to access a protected resource, but fails to correctly provide the necessary authentication information, such as Token, username, and password.

It will appear that the return code is 401.

8.2 Return 403

If the production environment requests an interface and the return code is 403, it means that there is currently no access to the resource.

This scenario is different from the return code being 401.

401 focuses on authentication issues, where the user does not provide the correct authentication information.

A 403, on the other hand, is based on successful authentication and the user does not have sufficient privileges to access the requested resource.

To solve this problem, we need to assign the appropriate access rights to the caller of the interface.

8.3 Return to 404

There's no need to suspect that the address of the interface you're requesting, which now no longer exists, is what's reporting 404.

For example, some interface names were changed, or /v1/user/query was changed to /v2/user/query in the interface path, and the version number was upgraded.

If all interface callers are not notified, it is possible to request an interface return code of 404.

There is another possibility that can also lead to the problem of requesting an interface that reports a 404. The interface address was previously registered to the API gateway, but there was a problem with the configuration of the API gateway.

Prioritize troubleshooting whether the interface url is modified, then troubleshooting whether there is a problem with the gateway or Nginx configuration.

8.4 Return 405

If the interface is requested and the return code is 405, it is usually caused by an error in the request method.

The most common is: the interface only supports the post method, but sends a get request.

Or the interface only supports the get method, but sends a post request.

This kind of problem is usually very good to troubleshoot and fix.

8.5 Return 500

If the interface is requested and the return code is 500, there is usually an internal error with the service.

Generally the gateway layer will do a one-time encapsulation of the return value of the interface and will not return a true exception message.

We can only look at the error logs of the interface to locate and troubleshoot the problem.

It is recommended to print out the interface request parameters when an exception occurs to make it easier to reproduce the problem later.

There are many reasons for this problem, and we can only troubleshoot them one by one based on the error logs on the server, and the relevant business code.

8.6 Return 502

If the interface is requested and the return code is 502, there is usually a case of service unavailability.

There are two scenarios:

The server is in the process of rebooting.
The service hung up.

At this point, you can check the monitoring of the service, or you can log in to the server to check the operational status.

In most cases, a restart of the service will quickly fix the problem.

Then again, based on the logs on the server, you can pinpoint the specific cause, e.g., an OOM issue causing it.

8.7 Return 504

If the interface requested, the return code is 504, usually due to gateway or interface timeout.

The time taken for the interface to return data, which is greater than the timeout set by the gateway, will cause this problem.

When this happens, you generally need to optimize the interface-related code.

If you are more interested in interface optimization, you can check out this article of mine, theI've used these 11 tricks to improve interface performance by a factor of 100", which is described in great detail.

If you're more interested in some of the potholes in your day-to-day work, take a look at my tech column, "TheThe 100 Most Common Questions Programmers Ask", there's a lot of dry stuff in there, and it's still well worth a look.

One final note (ask for attention, don't patronize me)
If this article is helpful to you, or inspired, help scan the QR code below to pay attention to it, your support is my biggest motivation to keep writing.

Ask for a one-click trifecta: like, retweet, and watch at.
Concerned about the public number: [Su San said technology], in the public number reply: interviews, code artifacts, development manuals, time management there are awesome fan benefits, in addition to the reply: to add a group, you can communicate with a lot of BAT factory seniors and learn.