Timed task stability solution - healthchecks monitoring system

background

At present, there is no perception after a problem occurs in crontab, and the problem is not discovered in time. It is almost done by the business department or user feedback. The R&D department will investigate it again. It is relatively lagging behind in handling problems and discovering problems. It can be seen that it is necessary to further optimize the stability of crontab, reduce the pre-provision of fault notifications, and accept fault handling before user feedback.
Summary of historical issues:

Human errors caused all crontabs to be cleared
Difficulty in disaster recovery after misoperation of scheduled task server
Unlocked mechanism detects that multiple processes are started, resulting in data malfunction
The task failed by the system oom
After the timed task server restarts, cron does not start automatically, causing the task to not be executed
After the timing task fails, no monitoring, no perception of problems (discovery of problems and lag in handling problems)
...................

1. Timed task management specifications

Problem description: Currently, it is distributed on the salt server through operation and maintenance. The release method is opaque, and there is also manual modification method, which is not standardized. There have been misoperations that lead to full clearing events, and the problem of disaster recovery difficulty after misoperation of the task server is timed.
Solution: Unify the standardization into the jenkins release model, and follow the same logic as the release code

2. Problem of selecting machine when publishing timed tasks

Problem description: Currently, you need to select a server when releasing code, and there is a wrong selection, which causes problems.
Solution: Optimize the publishing method, no need to select a server for publishing, and automatically determine the publishing server.

3. Timed tasks cannot be viewed in time

Problem description: Currently, it is timed to synchronize the task list to /tmp/work_cron, and there is a delay
Solution: Just develop and view gitlab's warehouse directly

4. OOM interruption of timing task execution

Problem description: Large program execution will consume a lot of memory, which will lead to the risk of being oomed by the system. However, it cannot be discovered after oomed by the system.
Solution: For system oom problems, you can collect /var/log/message for alarm processing, and the problem can be discovered as soon as possible.

5. Data security guarantee for timed task process (lock mechanism)

The timed task has hang process, causing problems in many processes to start;
Multiple processes run at the same time lead to data disorder.
For example, the temporary table names written each time are the same, and two processes may produce incorrect data results when writing at the same time.
Solution: For tasks that cannot start two processes at the same time, the program needs to lock to determine the status to ensure data reliability.

6. Implementation of large-scale timed task upgrade queues

Problem description: The timing tasks should be as light as possible. The optimal solution is to only trigger the timing tasks, and then the program will process data through queues.
For example, a timed task program takes more than tens of minutes to run a single time, or the amount of data processed reaches tens of millions.
Solution:

Plan 1

Implementation of the method of transforming heavyweight tasks into queues. The code implements data processing logic and putting data into queues in sequence.
Use cronsun triggering to manage timing tasks

Plan 2

Migrate to the big data task platform and use the computing power of big data clusters to complete related functions

7. Timed task status perception

Problem description: Currently, the execution status of each timed task (success/failure/hang/warn) cannot be perceived, and can only be checked through logs (if there is a log)

How to know if the task starts executing? (Currently rely on people) [Cron service not enabled]
How to know whether the task execution is successful/failed? (Currently rely on people) [The script failed after 80% execution]
How to discover the task failure as soon as possible? (Currently, it basically relies on business and user-side feedback)
Solution:

Add to determine whether crontab performs monitoring mechanism as expected
Add status reporting logic to visualize and dataize task execution, and add alarm mechanism. (Core task)
Add timed task logs and provide keywords for alarm (core tasks)

8. Healthchecks monitoring system

Details are moved to the official website:/(Open source software)

Healthchecks is a system used to monitor whether a timed job runs on time. It helps you discover whether a timing task is abnormal or failed in a very simple and effective way.

Main functions

Monitor whether cron, systemd timer, script, etc. are executed on time;
When the task does not "check in" on time, send a notification (email, Webhook, Slack, DingTalk, etc.);
Provides a simple Web UI to record the task run history and status.

The working principle of Healthchecks is as follows:

The system assigns a unique "ping URL" to each task (e.g./your-uuid）；
After each task execution is successful, send an HTTP request (called "ping") to this URL;
Healthchecks sets a timeout for each task (for example, 1 hour);
If the "ping" is not received during the timeout, it is considered that the task has not been executed or failed, and an alarm is triggered

Application scenarios

Timed script/task monitoring in production environment
Such as MySQL backup scripts, log archives, data synchronization, etc.
Kubernetes CronJob Monitoring
After CronJob is successful, add a ping request, and Healthchecks provides independent status records and alarms.
Small team without integrated monitoring system
Provides simple, ready-to-use Web UI and notification integration, ideal for quick access to small and medium-sized projects.
Complementary with Prometheus/Grafana
A more intuitive "whether to execute" state can be provided at the task level, and a closed loop can be formed in combination with existing monitoring.

If you have multiple timing tasks, you can also manage them using tags, project groups, etc. If you are interested in building a private version, it also supports Docker one-click deployment.