Location>code7788 >text

No, man, who taught you to handle production like that?

Popularity:946 ℃/2024-09-23 23:05:46

Hello, I'm Crooked.

I recently ran into a production issue where a service I'm responsible for triggered a memory utilization alert, and when I received the alert I went to see that the memory utilization was already at 80%, and a quick glance at the GC revealed that FullGC hadn't been triggered yet, not even once.

Based on this phenomenon, two possibilities were hypothesized at the time, a memory overflow or a memory leak.

Okay, let's say it's an interview now, and the interviewer has given that bit of information so far, and he asks you whether it's an overflow or a leak, how do you answer?

Before answering, we need to be clear about what is an overflow and what is a leak.

  • OutOfMemoryError: A memory overflow is when a program requests more memory than the maximum amount of memory currently allowed by the JVM. When the JVM tries to allocate memory for an object, an exception is thrown if the currently available heap memory is insufficient to fulfill the request. This is usually because the heap space is too small or is full for some reason.
  • Memory Leak: A memory leak is when memory space that is no longer in use is not freed, resulting in that portion of memory not being available for use again. Although a memory leak will not immediately cause a program to crash, it will gradually consume available memory and may eventually lead to a memory overflow.

While both are memory related, they occur at different times and have different effects. Memory overflow usually occurs at program runtime when the size of a data structure exceeds a preset limit, commonly when you're allocating a large object, such as checking too much data out of the data at once.

Memory leaks have little to do with "too much", it's a gradual process, the impact of a memory leak may be minimal, but over time, the accumulation of multiple memory leaks may eventually lead to a memory overflow.

The concept is this one, and the two things are often confused by people, so I'll mention it more than once.

With the concept clear, going back to the very beginning of this question, how do you answer it?

You can't answer.

You can't answer because the information is too incomplete.

Interviewers like to come up with questions that are full of wrong choices to confuse you and find out what you're really like.

First of all, the reason why you can't tell is because of what was said earlier: not a single FullGC.

Even though the memory usage is now at 80%, in case it goes down again after a FullGC, there is nothing wrong with the program.

If it doesn't go down, it means that the probability is that there is a memory overflow, and you need to go inside the code to find out where a large object has been allocated.

So if it goes down, does that mean there must not be a memory leak?

Neither can it, because again, as stated earlier: memory leaks are a finicky process.

Regarding memory overflows, you'll remember this chart on the left if you have all the monitoring tools in place:

A slow, steady upward memory trend, and finally a crazy GC trigger, but no memory is reclaimed and the program just crashes.

Memory leaks, true or false at a glance.

This chart is from this article I wrote last year:"It's a tough production problem I'm having, but when I write it, it's yours.

Inside is a description of a memory leak, by analyzing the Dump file, and finally succeeded in locating the leak and fixing the code.

A memory leak, no matter how complex, has a methodology for dealing with it.

It's just a matter of analyzing Dump files, using the tools, and having enough patience and a little bit of luck.

So instead of going into all that stuff, I want to share how I corresponded this time to the memory warning that I said at the beginning of the article.

Here's how I handle it: restart the service.

Yes, conventionally it would be to keep the scene and restart the service. But the way I handled it was that I just executed the pre-plan to restart the service. There's no follow-up action.

The considerations inside my head at the time were probably something like this.

First of all, this service is an edge service, it carries a small amount of data, its business has been more than a year no new, stock data is slowly dying. The code has not changed much in the past one or two years, only some upgraded jar packages, logs buried such horizontal transformation.

Second, I looked at the fact that the service hasn't been restarted in over four months, there hasn't been any bursty traffic during that time, the data processed per day is trending downward, and the memory trend is indeed a slow upward one, so I initially suspected that there was a memory leak.

Then, this service is one that I took over from another team, and based on the previous point, the business is dying out as a factor, and I only know the general functionality, and don't know the internal details, so due to the lack of familiarity with the system, it would be more difficult to pinpoint the problem if I had to.

Finally, based on the company's system, although I know how to troubleshoot the problem, and I know how to use the commands and tools, I as a developer do not have access to all kinds of troubleshooting tools and commands of the Ops staff, so if I want to locate the problem, I have to ask to coordinate the help of one of my Ops colleagues.

So, after mentally calculating the input-output ratio, I decided to just restart the service and not go locate the problem.

According to the current frequency, the program may trigger the memory warning after four or five months of normal operation, so the big deal is to restart the service every three months, and it only takes 30s to restart the service once, which is only 2 minutes for 4 restarts in a year.

Let's say it takes five years for this business to die out, so that's only 10 minutes.

If I were to locate whether it was a memory leak or not, and where exactly it was leaking, combined with my familiarity with the system and the processes that the company must have in place, this wave of time consumption, to say the least, would add up to three or five business days.

10 minutes versus three or five working days. It's clear which one to choose, isn't it?

My purpose in sharing this is really to illustrate a point I've realized on this: not every problem you encounter at work has to be solved, and there is an option to work around the problem, as long as the end result is good.

If we put aside other factors and just look at the programmer's job, then the time to encounter a problem such as a memory leak is the time to locate the problem and solve it.

But in the workplace, it actually needs to be analyzed in context.

What is the actual situation?

That "first, second, then, last" I listed earlier is the reality of my problem outside of technology.

These practicalities made me decide that I didn't need to locate the problem.

It's not a question of avoiding the issue either, it's the best option after weighing the pros and cons.

In the same day, I could have pinpointed this "reboot and it's fixed" problem, or I could have done something else more valuable, knocked out some code with more business value.

This is something that needs to be weighed, and an important measure is what was mentioned earlier: the input-output ratio.

Regarding the whole "not all problems have to be solved, there is an option to bypass them" thing, I'll give you another real life example that I've encountered.

A few years ago, our team ran into a problem where the RPC framework we were using was Dubbo, and a few of our core services were being released on a rolling basis during the production period, and the traffic was not getting clean, resulting in the service being offline and the upstream system still calling it.

It was arranged for me to research the solution.

It was really a matter of elegance down the line, but at the time the seniority was low and I studied it seriously for some time and really didn't work out a fundamental solution to the problem.

The solution we gave later was to make a fault-tolerant mechanism, if there is a request processing failure during the commissioning period due to the problem of unclean traffic, we record these data and then wait until the commissioning is completed and then resend.

The underlying problem was not addressed and the choice was made to bypass the problem, but in terms of the end result, the problem was solved.

Later, we built a dual center. Before commissioning, there is traffic in both A and B centers, and every time we commission, we first cut all traffic from A center to B center, and commission the service without any traffic in A center, and vice versa in B center.

In this way, from the production process to avoid the "traffic is always not clean" problem, because the production of the corresponding service is no longer in the way of traffic, do not need to consider the problem of elegance, thus avoiding the problem of elegance offline.

The problem still hasn't been solved, but the issue is being completely bypassed.

Finally, one more example of a response I read on Knowledge that is similar to the point I'm trying to make:

/question/634940930/answer/3336285780

The comments below this response are also interesting, so if you're interested in going through them, I'll intercept two that I found interesting:

In the workplace, or even in life, a problem that doesn't have a solution but can be bypassed is not a problem in my opinion.

But this also depends on the situation, not all problems can be bypassed, if it is a key service, then certainly can not be ignored, hard also have to go.

The point is, I've encountered so many people in the workplace and in life who seem to just tough it out when it comes to problems.

Just going hard and knowing when to go hard are two very different workplace stages.

So sometimes, when you have a problem, don't go hard, but also give your scalp a break and see if you can get around it.