I haven't slept well for days on end since using these monitoring tools!

Hello, I'm programmer Fishskin, and today I'm sharing some very useful system monitoring and alerting tools.

Why do I need to monitor alarms?

When it comes to monitoring alarms, it is very easy for students without enterprise development experience to ignore it, and some students even think that it is not necessary, the big deal is that if a bug occurs, it can be fixed again.

This idea is very wrong!

We visualize the system as the human body. There are times when a person may look healthy on the surface, but they may just not have had the opportunity to discover the abnormalities within their body, and as a result, when something does go wrong, they often have to suffer much more serious consequences. That's why regular medical checkups are needed to detect and deal with problems in a timely manner. The role of the system monitoring alarms is similar, to detect potential anomalies and problems in the system in a timely manner, the line out of the problem can also be the first time to find, as early as possible to deal with, so as to prevent or mitigate failures.

In addition, there are some other benefits of a surveillance system, so let's move on.

How do you implement monitoring alerts?

The most straightforward way to do this is to write your own code, for example, add some logic to the functionality you want to focus on, and send an SMS/email/Envoy message when an exception occurs. That's what we did in the beginning:

But in fact, business alerts are only one of the levels of monitoring alerts, just like the surface of the human body skin check. If we want to more comprehensively and accurately monitor the health of the system, we need to do a full range of physical examination inside and outside, including server monitoring, network monitoring, application monitoring, database monitoring, API interface monitoring and so on.

Yes, it's as complicated as it sounds, which is why monitoring has a more specialized alias in modern operations called "observability". Observability is the ability of a system to understand and diagnose its health and performance by monitoring and analyzing its internal state. This concept encompasses not only traditional monitoring, but extends to data collection, analysis, and response. For example, if we find that the system's memory utilization is not high through monitoring, we can appropriately downgrade the allocation to save costs; if we find that the system's memory utilization is too high, we can consider whether to upgrade the allocation and expand the capacity.

Trying to optimize the observability of the system on your own is still very complex, data collection, data storage, data analysis, alarm mechanisms, availability guarantees, performance and so on all have to be considered, and the big players have scaled infrastructure teams to do it.

For us individual developers or small companies, since it's an all-encompassing "medical checkup", we usually don't do it ourselves, but rather opt for a more specialized tool or service that we can just use and access directly. Here are a few recommended ones that our team is using.

Recommended Monitoring Tools

1、Server monitoring

1) The server comes with monitoring capabilities

As long as you are using a cloud server from a major manufacturer, it basically comes with server monitoring, and you can also set up alerts. For example, the monitoring of the Tencent Cloud Light Application Server in the picture below can see the usage of CPU, memory, network bandwidth, hard disk and other resources:

2) Monitoring capabilities of container platforms

If you are using containers to deploy projects, basically the container platform also comes with monitoring and alerting capabilities. For example, the service monitoring of WeChat cloud hosting, in addition to seeing the occupation of system resources, you can also see the amount of interface calls, the amount of request errors, interface QPS and response time, which is equivalent to a part of the API interface monitoring capabilities.

And the cloud hosting platform supports receiving alert messages in WeChat, which is very convenient. Once the node is attacked, you can be notified immediately.

2、Database monitoring

In the past, without database monitoring, it is difficult for us to keep an eye on the running status of the database and wonder if it is working hard, touching any fish or overloading overtime. But now, if you are using a third-party cloud service provider's cloud database, you can directly view the database's resource utilization on the platform. For example, the Tencent cloud database we use comes with monitoring:

In the past, you could only find out the slow SQL that jeopardized the system through user feedback or server failure, but now you can use the intelligent housekeeper that comes with Cloud Database to help you find out the slow SQL at the first time, and prevent the problem from occurring before it is too late.

You can also do a physical examination of your database with one click, and modify it in time if it is not 100 points:

3、Application monitoring

The scope of application monitoring is relatively broad, and we use ARMS, the application real-time monitoring service of AliCloud, mainly because Ali's specialization in Java application services is really higher in comparison.

This includes the state of the application server (e.g., Java's Tomcat), API interface calls, calls to dependent services within the system, calls to timed tasks, the state of the thread pool, the memory of the virtual machine, the state of the GC, and so on.

You can also view application topology, analyze call links, and more:

In addition to the monitoring capabilities, it's alerting capabilities are really strong! We've plugged our services into Enterprise Micro, and whenever there's a problem with a link, it immediately sends us an alert. You can also quickly view alert details, claim alerts, block alerts, and more.

To tell you the truth, we just accessed this thing a few days, or quite painful, because the exposure of a lot of previously undiscovered system problems, most of the night the enterprise micro has been drip drip drip drip drip drip drip stranded there! The developers on our team were miserable.

But I'm used to it now. Er, to be precise, the system has been optimized and has become healthier~

In any case, access to monitoring alarms is still necessary, it feels like opening the world of penetration, the state of the system is well known!

However, monitoring service use more than a certain number of times, it is necessary to pay, probably a few dozen G per month of free credit it, enterprise projects, in fact, quickly run out. For learning or personal website can try.

4、Front-end monitoring

In addition to the above monitoring, sometimes we also want to understand the user's behavior, user attributes and business indicators, such as how many users visit the site every day, is the use of the PC or cell phone, what brand of cell phone, how many new users to register and so on. Then you may also need to front-end monitoring (of course, you can also back-end buried statistics), previously shared, with Baidu statistics, a line of code can be accessed to the front-end site, it is very convenient ~!

OK, that's it for this issue, let's go!

💻 Programming Learning Exchange:Programming Navigation

📃 Quick Resume Maker:Old Fish Resume

✏️ Interview Brushup:interviewer