I. Daily issues
1) Temporary Minor Needs
In the process of daily research and development, it is inevitable to add some small temporary needs, such as adding a logo, font color, spacing increase, etc..
These types of requirements are not complex, but many times they disrupt one's development rhythm.
I recently received a modification request that went back and forth four times. Since the requirement was just dictated to me, I modified it as dictated.
Back testing revealed some scenarios that needed to be filtered and then fixed right away. After going live, the interface effect I designed did not match what the product wanted due to the lack of a design.
Two more revisions were made, which didn't take much time, but were a real pain in the ass. In the end, it was caused by unclear requirements.
The next time you encounter such a problem, you need to describe the requirements clearly with the product, and if necessary, call on the test, and ask for everything from scenarios to presentation, so as not to miss it.
2) Service call errors
Tuesday night someone reported that some leaderboard data wasn't updating, and after troubleshooting it turned out that Node's call to the server side of the service didn't work (service call error getaddrinfo ENOTFOUND xxx).
This allows the Node to report an error that causes the Pod to restart, and the interface can't access the data.
In fact, the calls to the server-side interface have already done a try-catch, but no object is returned in the catch branch.
The most straightforward approach is to give a default return value first, without the undefined error, and to keep the pod from restarting.
After changing the code on the line to 23:00 at night, pod is no longer restarted, most of the server-side interface can be successfully called, but there are also a relatively small number of failures.
When I came to the company the next day, O&M and I said that the back-end Pod, when the CPU is too high, will restart automatically, and this situation will be more frequent in the time period of heavy visits.
This tawdry operation is also helpless, and they don't have the resources to do code optimization right now, so they can only alleviate the overly slow requests on the line by rebooting.
Then Ops deployed a separate set of services for us, specifically just called by us, without restarting, and after the calling domain was updated, we really didn't get any more request failure errors.
In fact, there is a pattern called meltdown, that is, if you find that the upstream service call is slow, or there is a large number of timeouts, directly abort the call to the service, directly return information, and quickly release resources.
Here's where the code optimization needs to be done again and can be followed up with optimization optimization.
3) Database CPU exception
Starting October 8th, the database pushes an exception alert every day at 3:00 a.m., and the CPU utilization exceeds 60%.
At first, I thought it was an occasional phenomenon, as I've had this sudden increase before, but the fact that it alerts every day is problematic.
Got Ops to troubleshoot, said a table, pushed the table name to the relevant group to troubleshoot and found that it wasn't their service that was causing it.
This suggests that Ops is inferring incorrectly, as it feels like it's running a timed task because it's timed every day.
Ops again locks onto a delete statement to delete a seven-day old monitoring log, which takes up to 10 minutes to execute, during which time the CPU spikes.
DELETE FROM `web_monitor` WHERE `ctime` <= '2024-10-08 00:00'
Most likely it has to do with the recent rise in log volume, which previously was around 70W entries per day and now reaches around 100W.
The ops said that he can also configure timed operations for the database on his end, and then will add limit restrictions to the statement so that it doesn't take too long.
However, I didn't let him configure it in the end, mainly because if the timed operation was abnormal, I would still have to find the ops to repair it, and I wouldn't know if it was abnormal without an alert.
This service is more important to me, so I decided to optimize it myself, in a simple way, by adding the same limit restrictions, only with a few more loops.
Recently, the server-side interface has also been reporting 500 errors, and a couple of days it was reporting a lot more, and it was affecting the performance metrics I was monitoring, and also feeding back twice.
II. Work optimization
1) Collaborative dependence
I recently discovered the problem of collaborative dependencies while doing 1V1 in a group.
It's when multiple groups collaborate, there is a dependency, but it's a one-way dependency and the dependent object doesn't know that someone is depending on it. Then when the logic is modified or omitted, it doesn't bother to notify the dependents, and problems can arise.
It's when your code logic has a precondition that exists for other groups, and when the other groups update their code and don't realize that it affects you, then that piece of your code will fail to execute, causing the user to report it.
Ran into this problem twice this bi-monthly, once when we were dependent on someone else and the other time when someone else was dependent on us.
There's an audit function where the server will insert a record into a table, and we'll go from that table to check if we have that record.
However, this time, a different person on the server side was doing the update business, and he did not insert the records, thus causing a logic exception in our group.
This is an issue where I'm more inclined to feel that their group didn't keep exhaustive technical documentation on regular features and there was a logical omission.
The other time was when the data group was doing statistics, it would rely on a field in the operation record that we would write to, and this time the product changed the format of the field, which resulted in a statistics exception.
My preference on this issue is to try to inform the data team in advance if there is a data related need to avoid not being able to count the results.
In fact, the simplest and most direct solution is to notify the relying person in advance, but the difficulty is that one does not know that such a person exists, so in the actual project there is an omission.
And I feel like there should be quite a few more problems with this kind of collaboration.
2) Alarms are not a string of numbers
Before the 4th of July vacation, I received a few occasional 500 errors and didn't take it seriously, thinking it was just an occasional occurrence.
I didn't expect a large number of 500 warnings to suddenly appear during the National Day holiday, and a check turned out to be the gateway forwarding errors 502, 503, 504.
This results in a non-standard JSON format being received and the call will result in an undefined error.
As soon as I knew the reason, I modified it to change the gateway forwarding to an internal interface call, and added some undefined judgment to the code.
The code was released just after 23:00 on the 3rd, and the indicator on the 4th was normal.
During the period also found a lot of slow response, more than 20 times the previous normal, check the interface logs, and finally locked is dependent on the server-side interface anomaly.
Contacted the ops and server side people, the latter didn't respond, the former went to check and said that other interfaces were affecting the whole service that we weren't calling.
Finally, we were given a separate POD, and only the interfaces we accessed requested this POD, and the slow response percentage of #5 was immediately restored.
The insensitivity to data and ignoring of alarms, which makes you have to change the code overnight during the National Day, are all your own doing and no one else's fault.
Although it is the upstream that affects the downstream, but after causing an impact, the downstream still has to take the blame, so in the future, the data should be watched more closely, not just as a string of figures.