Location>code7788 >text

There's no need to overthrow existing constructions, this observability product is a novel idea!

Popularity:961 ℃/2024-09-03 11:09:21

There are already many open source, commercial observability products on the market, such as Zabbix, Prometheus, Nightingale, SigNoz, SkyWalking, ELK, and so on. Moreover, all kinds of cloud vendors provide their own observability suites, and some cloud vendors with confusing planning may even provide multiple products with overlapping functionality, which exacerbates enterprise The status quo of data silos. Take a look at two sets of data:

两组数据感受真实的企业环境

According to incomplete statistics, a medium-sized enterprise, on average, will use more than 5 kinds of monitoring tools, while a large enterprise, on average, will use more than 10 kinds of monitoring tools.

On-cloud, off-cloud, open source, self-built, commercial, network, server, database, middleware, application, business, metrics, logs, links, events.

Everyone wants to put an end to this mess, but they don't want to completely abandon the existing build; after all, abandoning the existing build means negating the decisions and efforts of those who came before them, huge relocation costs, and all kinds of teamwork crap ensuing.

Without introducing new data silos, is it possible to integrate all these fragmented tools with a single product to provide better data cascading, observability capabilities, and accelerated fault localization? The answer is Yes, and we recommend one such product: Flashcat.

Flashcat产品架构图

Flashcat is an all-in-one observation platform built by CryptoCat Nebula with the open source Nightingale as its core, with the following features:

  • Can directly access all kinds of monitoring systems as a data source, without having to overturn the existing construction, quick results;
  • If a certain aspect of the data is missing, Flashcat can also provide complementary, for example, you only Prometheus do the indicator system, want to make up the logging system, Flashcat can provide the ability of the logging system, including logging collection, logging ETL, logging storage, logging query, logging alarms and so on;
  • Flashcat is designed to be "fault location oriented", and has precipitated some fault location methodologies and scenarios, allowing users to use all kinds of observation data to locate faults faster;

The first two points are easier to understand. The third point, to elaborate.

Other observation platforms, most of the data is presented flat, the function menu is directly and roughly divided into indicators, logs, link tracking, and even indicators continue to be subdivided into infrastructure, applications, business, etc. Such a design is not friendly to fault location. The reason:

  • The user lacks an overview of the cockpit to see the health of the system from a global perspective;
  • The lack of linkage between data requires a global model in the user's mind to connect various types of data; for example, to view the health status of a service, it is necessary to go to the metrics to query certain specific metrics, go to the logs to check certain indexes, and go to the link tracking to check certain call chains, and there are different ways to retrieve different data with different keywords, which increases the cognitive burden of the user; as a result, only the senior Only senior operation and maintenance personnel can skillfully use these tools, operation and maintenance novices, and even research and development personnel are very difficult to use; resulting in the observation platform is difficult to play the proper value;

Flashcat's approach, focusing on a knowledge precipitation reuse, usually the user to locate the fault, what data to look at first, and then look at what data, can be precipitated in Flashcat. Organic integration of various types of data presented, for example:

  • The first provides a global cockpit view, allowing users to see the health of each business at a glance
  • When a core business metric goes wrong, such as a sharp drop in orders, users can easily see the health status of services that are affecting order volumes in Flashcat.
  • If a service's 5xx error rate has increased, users can easily see in Flashcat which dimensions of the service are experiencing 5xx requests, allowing them to quickly discover data characteristics.

北极星-业务指标全局健康状态图

The above figure is Flashcat's North Star page, you can clearly see the company's global health status of the six major businesses, in which the e-commerce business is red, there are abnormalities, click on the details to go in, to see exactly which core business indicators are abnormal.

Flashcat北极星业务详情

Details can be seen in the real-time order quantity of goods this key business indicators plummet, in some moments directly down to 0, this is an obvious failure, the user can click the mouse in the location of the plunge, you can see whether those related services are healthy, unhealthy direct red label, the user can quickly locate the faulty service, such as here obviously seeE-commerce-> Order subsystems The function is abnormal and the user can just click in and continue to drill down to troubleshoot:

Flashcat订单子系统的灭火图

The above figure shows the various functions of the order subsystem, most of the functions are green and healthy (here the color has a good role in guiding), only one "order submission" function is red exception, continue to click on the details, you can see the function of the Performance data of the historical trend of the graph:

订单提交功能的历史趋势图

The three charts show the key RED metrics: Requests, Success/Error, and Duration, and it's clear that the reason for the failure is that the interface's success rate is dropping off at times. Continuing to click on the location of the drop, you can drill down to see the logs, link data, etc. at the time of the drop, and even automate the comparison of abnormal and normal intervals:

特征分析

The system will analyze the data of each dimension in the two time periods, and mark the dimension with the bigger difference. Of course, a more common use is to view the relevant logs according to the anomaly indicators, and view the link data according to the traceId in the logs:

指标串联日志和链路

In the above example, there is a link traceId in the exception log, so users can directly click the trace button to link data from the log:

链路数据

By looking at the link data, we eventually realized that the root cause of this failure was a call to the10.201.0.210:6379 This Redis fails, just grab the stop loss.

The whole process seems to click several times, in fact, each place has a color and graphic guide, the user operation is not complicated, even if a novice, not too familiar with the whole system, according to the color guide of the observation platform, can also quickly locate the fault. This is what an observation platform should look like!

Flashcat This kind of product idea is rare in the market, and it's worth a try. You can add my WeChat friend to make an appointment for Flashcat's product explanation and demonstration, or go to our official website to learn more about our products. Our official website is:/

巴辉特微信