Location>code7788 >text

Nothing was going on, deleting a folder on the server led to almost two weeks of busyness

Popularity:156 ℃/2024-09-05 12:55:56

The big data service was basically fine for most of the year I was away, and I only came over once or twice for maintenance

For most of 2024, the Big Data service was relatively stable, and I only came over once or twice to maintain it. i came over again in August, after handing over the work of a departing colleague, and there wasn't much going on.

StatHub page service status not refreshing

StatHub is a cluster management/application orchestration/process guardian tool that can be used to efficiently manage service clusters. With node process management and application management features.
Big Data R&D at another company that works here says that the StatHub page service status is not refreshing. I said is your service normal? He said it is normal. I said don't worry about it, I'll take a look at it some day when I have time.
StatHub consists of two parts: master and agent:
The stat-server, or master, provides the service orchestration interface.
The stat-agent, running on the worker node, guards the worker process.
StatHub source code address:/rm-vrf/stat

I'm screwed. I deleted a folder I shouldn't have.

After having some free time, I tried to solve the StatHub problem. Actually, there was a solution before, which was to look for the files of size 0 in the app and proc folders in the .stat_agent folder on each server node and delete them.
But I didn't think of this solution for a while, so I thought of solving it by rebooting, I rebooted stat-agent on the misbehaving node, and rebooted stat-server several times, and it didn't work.
I wondered if it was caused by some cache, the .stat_server folder must not have existed at the very beginning when I deployed StatHub, it should have been auto-generated, I stopped stat-server, then deleted it, then restarted to try. So that's how I deleted .stat-server, restarted StatHub successfully, and the .stat-server folder was automatically rebooted and generated again. But soon I realized a serious problem, all those 100+ services on the StatHub page were gone! The page was empty!
Run for the hills. Going to lose your job. Crap!
Although over 100 services are out of administration, the services should all still be up and running, so as long as the services don't hang, it's not a problem for a while.
What to do? Recover the data? That server is important, there are a lot of important services running on it, and if it's messed up, it's really over.

Find a way to slowly restore service management on the StatHub page

The good thing is, I found that the process information of each service in the proc folder in the .stat_agent folder on the 20+ servers where stat-agent is located is there, and that has the name of the service and the startup command, which can be used to re-enter the information of the service in the StatHub page, mainly the startup parameters, because some java and spark services have startup parameters are more complex. So I made a backup of the service names and startup commands in the proc folder on 20+ servers. Then, I restored the administration of 2 or 3 services first, but the service status could not be refreshed, and I could not stop and start the services normally, so I could only go to the machine where the services are located, and knock on the Linux command to check the service running status.

Modify StatHub source code to resolve service status irregularities

I opened the StatHub source code, and found that when traversing the information of each node, I added a try catch, but it only caught the ResourceAccessException exception, and the other exceptions would cause the for loop to hang, and all the node and process information would fail to be obtained. So I modified the code, added a catch (Exception e), and print the log, submit, re-release start stat-server, check the stat-server log, identified the exception node, the exception node server on the size of 0 file deleted, the service status is normal.

There's a new situation, the node information for 162 this machine is missing from the node list on the StatHub page

After restarting stat-agent on node 162 for some reason, the node information for the machine 162 was missing from the node list on the StatHub page. Finally found out it was a problem with the server, mount command, stuck for a while, bunch of mounts, for some reason. df -hl command also stuck for a while before coming out with the information, this problem caused stat-agent to get stuck when traversing the disk information.

ClickHouse is also having problems with a service that frequently reports Too many parts exception when inserting data.

It was solved once before, the idea was to increase the amount of data inserted in each batch to reduce the number of inserts. At that time, the service was temporarily stabilized, I thought it was solved, in fact, it was not solved. The service consumes a total of 78 partitions of kafka's topic, the parallelism is 78, too big, how to reduce the parallelism? I didn't know how to solve it at that time. This time, I changed the code to (1).foreachPartition, the role of coalesce is to reduce the partition, so that you can reduce the data inserted into the ClickHouse parallelism, I set the degree of parallelism to 1. It is reasonable to say that the problem should be solved, but it still reported Too many parts exception, the data inserted into the success of a few times failed several times.

Restart ClickHouse

Nothing a reboot can't fix, if it doesn't work, reboot again.
So I then decided to restart the ClickHouse service on all 4 nodes.
When restarting the 3rd node, the server suddenly lost connection, I just restarted a ClickHouse and the server hung? I was able to connect again after a while.
When I restarted the 4th node, I found that it couldn't get up ah! Checking the monitoring page, I found that all services writing to ClickHouse, were reporting red! I restarted the dependent zookeeper service again, and restarted ClickHouse multiple times, and nothing worked.
Partial error message:DB::Exception: The local set of parts of table 'xxx' doesn't look like the set of parts in ZooKeeper: xxx billion rows of xxx billion total rows in filesystem are suspicious. ... Cannot attach table 'xxx' from metadata file /var/lib/clickhouse/metadata/xxx/ from query ATTACH TABLE ...
Baidu searched a similar question/intl/en-us/trouble-mrs/mrs_03_0281.html, there are too many steps to read too much to operate.

Problem solved, restarted ClickHouse successfully

I noticed the metadata file in the error message and had a plan to rename the two .sql files mentioned in the error log to backup them, then restarted ClickHouse and it worked! Then rename those two files back again. Then I observed those services that write to ClickHouse, all of them are normal, and some of the services failed without automatic restart, so I restarted them manually. Then I realized that the problem of Too many parts was also solved.

162 servers are up.

Another company's big data R&D, after preparatory work, restarted this machine to solve the problem.

Service management on StatHub page restored for the most part

After a few days of manual entry, the service management on the StatHub page is mostly back.
I made a backup of the app and choreo folders in the .stat_server folder on the server where stat-server is located. I hadn't realized that this folder was so important or deleted before, and never backed it up.
For the rest of the services, it's fine to enter them slowly, or wait until something goes wrong with the service and it needs to be restarted.

Was the week or so of work something out of nothing?

Not really.

  1. The StatHub page service status is not working properly and still needs to be addressed. But I made the mistake of deleting folders that shouldn't have been deleted. After this lesson, I made a backup.
  2. It was only a matter of time before ClickHouse went wrong, because the spark service, which was written before, was never optimized for too much data insertion parallelism.
  3. The 162 server has had problems for a long time, but as long as you don't restart stat-agent you're fine.

The problem's almost taken care of.

There's also the problem that only most of the 100+ services on the StatHub page have been restored. To restore service management, it is necessary to restart the service, many services are not written by me or deployed by me, I am not familiar with it, in case it does not work and affects the business, it will cause unnecessary trouble. But the service is out of management, in case it hangs one day, and do not know, will also cause trouble to troubleshoot the problem.