Location>code7788 >text

Observable platform construction practice of Haida Group

Popularity:548 ℃/2024-08-30 11:45:48

Introduction of Haida Group

Established in 1998 in Guangzhou, Guangdong Province, Haida Group is a technology-oriented leading agricultural enterprise in China, with its business covering the whole industry chain of modern agriculture and animal husbandry, including feed, seedling, vaccine, intelligent breeding, food processing, etc. Haida Group owns more than 600 molecule companies and has 40,000 employees around the world, and is ranked 238th in 2023 China's Top 500 Enterprises, and 87th in 2023 China's Top 500 Private Enterprises. It is also ranked 238th among the top 500 Chinese enterprises and 87th among the top 500 Chinese private enterprises. With its eye-catching business performance and brand influence, Haida Group has been on the list for 5 consecutive years and ranked 1,415th in the Forbes Global 2000 in 2024.

海大集团

Needs and challenges

At the beginning of the construction of the unified observation platform, the IT department of the University of Shanghai set four clear goals:

  1. To cover different business segments (e.g. we have numerous business segments and business systems)
  2. To cater to heterogeneous IT environments (containers/K8s, physical machines, VMs, public cloud coexistence)
  3. To bridge the gap from business perspective to IT perspective monitoring
  4. To have efficient fault finding and localization capabilities

Before using the Flashcat solution, we used Prometheus to collect microservices monitoring data and work with alertmanager to send alerts, Grafana visualization; we used Zabbix to monitor the network and machines and devices; we also used the EFK technology stack, AliCloud logging service to collect and monitor logs; in terms of link tracing, we used both SkyWalking, ElasticAPM, and AliCloud ARMS. In terms of link tracking, we use SkyWalking, ElasticAPM, and Aliyun ARMS.

As you can see, with the development of the business and the evolution of the architecture, we have continuously introduced various types of monitoring tools to meet the monitoring needs of different scenarios, environments, and IT architectures. Maintaining and utilizing these monitoring tools brings us a lot of challenges:

  • Multiple monitoring tools, high maintenance costs; each tool, you need to learn once, the threshold of use is high.
  • Data is scattered across different systems, which is inefficient when it comes to analyzing problems and locating faults.
  • There is no place to centrally view and distribute alarms issued by multiple monitoring tools, alarms are noisy, and the alarm handling process is not transparent and easy to miss.
  • Although there are already so many monitoring tools, still face monitoring data collection is incomplete, need to add the perfect situation, such as our various models of network equipment load monitoring, network all-link monitoring, business indicators monitoring and so on.

We hope to establish a unified observation platform to better ensure the stability of the system and improve the efficiency of the entire technical team.

prescription

Flashcat is an all-in-one observable platform built by CryptoCat Nebula with the open source Nightingale as its core, with the following features:

  • Unified collection, supporting the collector Categraf, using plug-in ideas, built-in integration of hundreds of collection plug-ins, GPUs, servers, network devices, middleware, databases, applications, business, cloud on cloud, can be monitored, out of the box;
  • Integration and fusion, in addition to the use of the collector, can also be integrated with the existing internal enterprise, the cloud on the cloud under the observable supporting systems, without the need to push back and start over again, the full benefit of the old, quick results, tandem to pass the data, play the value of collaborative analysis;
  • Unified alarms, supporting indicator alarms, log alarms and intelligent alarms on a single platform, supporting dozens of data source docking, collecting alarm events from various monitoring systems, conducting unified alarm convergence, noise reduction, scheduling, claiming, upgrading, and collaboration, and significantly improving alarm processing efficiency;
  • Unified Observation integrates multiple observable data such as Metrics, Logs, Traces, Events, Profiling, etc., and presets industry best practices to provide both a cockpit with a global business perspective and technical perspective, as well as the ability to locate faults by drilling down layer by layer to effectively shorten the time of fault discovery and localization;

Flashcat

These are some of the features we particularly value in Flashcat:

  • Ability to monitor business metrics and link business metrics to the health of IT systems
  • Able to dock the enterprise's existing, well-collected data, landing quickly with little resistance and little risk
  • We have a set of mature fault discovery and localization methodology with Internet characteristics, which can support us to promote the construction of 1-5-10 Stability Assurance System.
  • Flashcat provides alarm aggregation noise reduction to effectively reduce the number of alarms

As a result, we worked with the Flashcat technical team to develop the following roadmap for landing:

路线图

Landing effect

With reference to Flashcat's stability assurance model, we have established Polaris, fire-fighting diagrams, and multi-dimensional analysis reports for all business segments, infrastructure, big data, and group networks from top to bottom, realizing a three-dimensional solution for fault discovery, location, and analysis.

板块和层级梳理

Second, we take the data source docking existing monitoring data as the basis, while using Flashcat supporting all-in-one collectorCategraf, the collection of our observability data was perfected, thus quickly and smoothly realizing the goal of using one platform to satisfy the complete observability needs, greatly improving the user experience and the efficiency of use.

落地效果

Finally, we collected all the previously scattered alarms into Flashcat's unified alarm event response platform, landed on the alarm aggregation and noise reduction, claiming, upgrading, scheduling, distributing and other capabilities, to achieve the full lifecycle management of alarm events and a comprehensive analysis of alarm data, data-driven alarm governance optimization, and significantly improved the efficiency of oncall.

This article was written by Lv Libing, Deputy IT System Manager of Haida Group.