Best practices for managing large-scale monitoring technology stacks

Centrally manage observability data

Centralized monitoring data helps break information silos and provides a panoramic view of the system. Bloomberg found that when teams are fighting each other, system disruptions often last for a long time before someone realizes that multiple teams are dealing with the same problem independently. Through centralized data management, they gain a more comprehensive view of infrastructure, allowing for more efficient failure hierarchy (Source: How Bloomberg Tracks Trillions of Data Points Daily with Metrictank and Grafana).

Adopting standardized monitoring methodology

The following mature methodologies can guide monitoring practice:

Four major gold indicators: Monitor request rate, error rate, delay and saturation rate for each microservice
RED Method: Focus rate (Rate), error (Errors) and duration (Duration), is a simplified version of the four major gold indicators
USE Method: Track Utilization, Saturation and Errors

These methodologies provide a monitoring framework, but need to be adjusted according to the specific architecture (Source: "What is observability? Best Practices, Key Indicators and Methodology").

Unified dashboard specifications

The consistent dashboard layout of the entire organization can improve data interpretation efficiency. For example, Salesforce uses standardized dashboards to build scalable dynamic complex dashboards through functions such as repeating lines, paging and custom pop-ups (Source: How Salesforce implements large-scale service health management through Grafana and Prometheus).

Implement intelligent alarm mechanism

Establish an active alarm system. Salesforce deploys a "hyperlocal observability" system that integrates Prometheus, Grafana and Alertmanager to achieve comprehensive low latency high availability alerts (sigma).

Choose a hosting plan or a self-built plan

Evaluate the applicability of hosting solutions such as Grafana Cloud and self-built open source solutions:

Hosting PlanReduce the operation and maintenance burden and allow the team to focus on application development and strategic projects
Self-built planProvide higher controllability, but more maintenance resources are required (Source: "Why do enterprises choose Grafana Cloud instead of self-built open source solutions")

Adopt open standards

Using open standards such as OpenTelemetry to detect can not only avoid vendor lock-in, but also achieve a full-stack unified context telemetry data (Source: "Using OpenTelemetry and Grafana to implement observability, visualization and monitoring of Kubernetes applications.").

Integrated monitoring tools

Unified monitoring tool view saves time and cost. Grafana Labs survey shows that 80% of respondents have achieved centralized observability, of which 78% save time or cost (Source: "Preview of the 2024 Grafana Labs Observability Survey Report").

Implement process automation

Implement best practices through automation. Bloomberg automates SRE best practices, establishing company-wide specifications on CPU, memory, file system storage and service frameworks, and these rules "act immediately when users create new services or start new machines" (ibid.).

Implementing these practices can build more efficient monitoring strategies, which not only provides panoramic visibility of the technology stack, but also accelerates problem identification and resolution.

When three people walk, there must be my teacher; if knowledge is shared, the world will be the public.This article is from Dongfeng Weiming Technology Blogwrite.