About the globalized large-scale hybrid cloud Kubernetes Prometheus monitoring system standardization and GitOps automation improvement solutions

background

status quo

Overview of a certain company:
1. PaaS/SaaS company, whose business is global, including Southeast Asia/South Asia/Middle East/Europe/Africa/America/East Asia...
2. Production of dozens of k8s clusters, production of non-production >100 sets (multiple cluster types, various public clouds/proprietary clouds/private clouds/data centers...)
3. Since the epidemic, cost optimization has been continuously promoted.
A company monitoring overview, due to historical reasons and cost considerations:
1. Based on native Prometheus deep customization + self-developed part exporter/sd, useless to use kube-prometheus-stack (incompatible, cost will increase)
2. Monitoring coverage: k8s/pod/various middleware/microservices/url...
3. One set of Prometheus monitoring per cluster
4. The computing storage and other resources occupied by monitoring are limited
5. Monitoring deployment method: ansible installation of monitoring components and subsequent use of jenkins devops CI/CD automatic release

In summary, monitoring can be regarded as:

Globalized
Large-scale
Hybrid cloud
Kubernetes
Low-cost monitoring

question

Recently, due to insufficient monitoring coverage (specifically, a cluster lacks the configuration of the url monitoring part), an in-depth review has been conducted. The core issues can be summarized into two points:

Lack of a unique source of trusted configuration, The monitoring configurations of each cluster are scattered, and there are problems such as inconsistent versions and omissions of rules;
Manual operation causes configuration drift, the global cluster status cannot be synchronized in real time, and the fault warning capability is limited.

To avoid such problems from happening again, the planning improvements are as follows:

useGitOps (Git as the only source of fact) + Prometheus OperatorThe standardized monitoring architecture is the core, the specific plan is as follows:

1. The root cause of the problem and the direction of improvement

Current Challenge
- Fragmented management: The Prometheus monitoring configuration part of hundreds of clusters around the world still relies on manual maintenance, and is prone to omissions of rules and inconsistent thresholds.
- Manually manage risk: Manually manage monitoring components and monitoring configurations and thresholds, with expired or misconfiguration risks (such as recent failures).
- Monitor data noise: Due to inconsistent configurations, false alarms/missed reports frequently occur, which affects the fault response efficiency.
Target plan
- Single Source of Truth: Unified management of all monitoring configurations (Prometheus rules, ServiceMonitor, AlertManager, etc.) through the Git repository to eliminate manual intervention.
- GitOps Automated Sync (reconcile) and Self-Healing: Use relevant GitOps professional tools such as ArgoCD to achieve real-time configuration synchronization to ensure that the cluster status is consistent with Git declarations.
- Centralized observability: Standardized deployment through Prometheus Operator. If necessary, you can consider combining Thanos/Cortex/Mimir to achieve cross-cluster monitoring data aggregation.

2. Technology implementation path

Standardized process for GitOps (Git as the only source of fact)
- GitOps: Store all monitoring resources (Prometheus CRD, Grafana dashboard) in the Git repository, and the version control + Code Review mechanism ensures that changes can be traced.
- Automated synchronization (reconcile): Monitor Git repository changes through related GitOps professional tools such as ArgoCD, and automatically push them to each cluster to avoid manual misoperation (refer to Red Hat OpenShift GitOps best practices here).
- Emergency repair process: Any production changes must be submitted through Git, only Git repositories are allowed as modification portals, and "temporary patches" are eliminated.
Prometheus Operator Strengthens Capabilities
- Unified deployment templates: Use Helm Chart to encapsulate Prometheus Stack (AlertManager, BlackBox, etc.) to ensure that each cluster version is consistent with the configuration.
- Dynamic service discovery: Automatically identify microservice endpoints through ServiceMonitor to avoid omissions caused by manual addition of Exporter.

3. Expected profits

Reduce operation and maintenance risks: Configuration drift is reduced by more than 90%, and monitoring components/thresholds/configurations are fully automated.
Improve fault response: Through centralized alarm views and standardized rules, MTTD (average fault detection time) is shortened by 50%.
(To be determined) Cost optimization: Avoid repeated development of monitoring components, and increase resource utilization by 30% (optimize data storage through Prometheus federated clusters, such as Thanos/Cortex/Mimir, etc.).

4. Follow-up plan

Pilot advancement: It is planned to build a temporary environment first, conduct PoC verification for a period of time, and output standardized templates and automated pipelines.
Global promotion：
1. Construction of a dedicated monitoring management cluster.
2. Migrate toGitOps (Git as the only source of fact) + Prometheus OperatorThe system, considering its large scale, is expected to require continuous investment.
Training and collaboration: Organize internal team sharing sessions and synchronizeGitOps (Git as the only source of fact) + Prometheus OperatorCollaboration specifications (branch policies, project structure policies, Review processes, etc.).

📚️ Reference Document

OpenShift GitOps Recommended Practices | Red Hat Developer
Lightning Talk: Best Practices on Organizing GitOps Repositories - Konstantinos Kapelonis, Codefresh

When three people walk, there must be my teacher; if knowledge is shared, the world will be the public.This article is from Dongfeng Weiming Technology Blogwrite.