Reading notes for Core Technologies and Applications for Data Asset Management -- Chapter 5: Data Services (II)

Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.

Book Description:Data asset management core technologies and applications

Today, I'm mainly going to share Chapter 5 with you:

The title of chapter V serves the data

The content mind map is below:

This article is followed by

Reading notes for Core Technologies and Applications for Data Asset Management - Chapter 5: Data Services (I)

Moving on.

1.5 Monitoring and alerting of data services

After completing the configuration of the data service, the data service in the invocation, but also need to monitor, in the monitoring of the occurrence of failure also need to support the automatic sending of alarm notification information, so as to better protect the stability of the data service. In the book's data monitoring and alerting that chapter, there is a reference to data services monitoring and alerting technology design and implementation is mainly through asynchronous collection of data service call logs, and then with Prometheus and Grafana to complete the following chart. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing.

From the figure, we can see that the key to the monitoring and alerting of data services lies in the log data collection of data services, which means that data services need to output logs when they are invoked. In order to make the monitoring of data services more accurate and detailed, the logs are usually recommended to contain common fields in the design as shown in Table 6 below. Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press and authored by Yongqing Zhang et al.

field name	Field Description
appId	The ID of the data service being called, which represents the identity of a specific data service
requestArgs	Request parameters passed in when calling the data service
cliendIp	The IP address of the requesting party is obtained on the data service platform side.
requestTime	The timestamp of when the requestor invoked the data service, usually recommended to the milliseconds
receiveTime	Timestamp of when the request was received on the data service platform side, usually recommended to the milliseconds
responseTime	The timestamp of the response to the requestor after the request has been processed by the data service platform, usually recommended to be accurate to milliseconds.
queryDataDuration	Time-consuming duration of the data service platform in querying data processes
responseMessage	Response result to the requestor after the data service platform has processed the request
exception	Information about exceptions that occurred during the processing of the request by the data service platform, if there are no exceptions, this field will remain empty

In the output log, you can use the JSON format to include the fields in the form, and then collect the JSON logs through the log collection method and send them to the message queue for the data processing program to do the parsing of the log data, and then send them to the Prometheus Pushgateway component.

Common log collection tools are shown in the following table. Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press and authored by Yongqing Zhang et al.

Log Capture Tool	Description and download and deployment address
Flume	Apache Foundation under the open source project , the use of Java language implementation of the logging tool , Github address / apache/logging-flume
Logstash	Pipeline based on the realization of the open source log collection tool , Github address /elastic/logstash
Fluentd	C/Ruby-based implementation of pluggable open source log data collection tool , Github address / fluent/fluentd
Splunk	Non-open-source commercial nature of the log collection and processing and storage tools, the official URL for the /

After acquiring the JSON log data through collection, after processing the log data, the following core indicator data can usually be generated for monitoring, as shown in the figure below.

When the request processing takes a long time, it means that the processing of the data service is very slow, and you need to check whether the processing capacity of the data service or the server resources are not enough.
When the network takes a long time in the request, it is likely that the bandwidth of the network is insufficient or the network is frequently jittery, etc., and you need to troubleshoot the network link.
When the data query takes a long time, it means that the query database query is very slow, at this time you need to check whether there is a slow query in the database or not enough resources in the database.
The number of times an exception occurs represents that an exception has occurred in the processing of the request, and if the number of exceptions reaches a certain threshold, then it is necessary to troubleshoot whether there is a failure in the data service or an error in the request parameters of the requestor, etc.
The number of calls represents the amount of calls from the requesting party, and is also an important indicator of whether the concurrency of requests from the requesting party is very large. If the amount of calls exceeds the processing capacity of the data service, it is necessary to increase the resources in time for expansion or to ask the requesting party in time why the amount of calls will be very large, and it is also necessary to check whether the data service has been subjected to malicious attacks from the outside.

2. Performance of data services

A good data service in addition to the need for good design, but also need to have good performance, the most intuitive performance is the query ability of the data, the faster the data query ability, then the performance of the data service will certainly be the better, usually the performance of the optimization of the main embodiment of the SQL optimization, database optimization, architectural design optimization, hardware optimization and so on, as shown in the figure below.

(i) SQL optimization: This is easy to understand, is to improve the query performance of the SQL statement, locate a SQL query performance of the common steps are shown below. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.

As you can see in the picture:

The first step is to find the slow performance query SQL statement as soon as possible, you can query the database's slow query log or the database query to do monitoring and other ways to get the slow SQL, only to know the slow SQL to do the next step in the analysis.

The second step is to analyze the specific reason for the slow SQL statement query by viewing the execution plan of the SQL statement in the database. Generally speaking, no matter what type of database, you can view the execution plan of the SQL statement when it is executed.

The third step is to analyze the reasons for the SQL statement to do tuning, commonly used tuning is that if there is no index, then increase the index, if there is an index, but did not hit the index, then adjust the SQL statement to make it hit the correct way to write the relevant index.

(ii) Database optimization: When the amount of data does reach a super-high data volume level and the problem cannot be solved by SQL optimization, database optimization is needed to solve the performance problem. The common ways of database optimization include using cache, read-write separation, split library and split table, etc., as shown below.

A, using cache: refers to the database query cache, some commonly used hot data loaded into the cache in advance, usually as far as possible to the database to allocate a relatively large amount of memory, so that the data query, the data will be loaded into the cache, then the next time the query, there is no need to pull data from the physical storage, as shown in the following figure.

B, read-write separation: read-write separation is a kind of architectural optimization from the database point of view, when the data service is "read more write less", the database is too large because of the amount of data, can not carry high concurrency of the query, you can use read-write separation, so that more data queries from the read-only node from the library to the query, the following figure shows.

C, library and table: library and table is a common solution for single table data volume is too large, when the data volume reaches the bottleneck of a single table, the use of table to redistribute the data. When the data volume reaches the bottleneck of a single library, the way to let the data redistribute using the library, as shown in the figure below.

The common ways of splitting libraries and tables are as follows:

(1), in accordance with the hot and cold data separation: usually the use of very high-frequency data called hot data, query frequency is low or almost no query data called cold data, hot and cold data separation, hot data stored separately, so that the amount of hot data data volume down, the query performance naturally improved, as shown in the figure below.

In addition to doing hot and cold data separation as shown in the figure, with the development of hardware technology, such as the decline in the price of memory and the emergence of SSDs, it is also possible to automatically do hot and cold data loading and separation as shown in the figure below, which can be based on certain rules to determine when it is necessary to preload the data in the ordinary hard disk to the SSD or memory to accelerate the performance of the data query. Since SSD and memory cannot store a large amount of data, it is also necessary to set certain rules to periodically clear the data in SSD and memory that is not queried to free up cache space.

(2), according to the time dimension of the way: you can follow the real-time data and historical data library and table, but also according to the year, month and other time intervals, as shown in the figure below, the purpose is to minimize the amount of data in a single library table. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing.

(3), in accordance with certain algorithmic calculations: when the data are hot data, such as data really can not do hot and cold separation, all the data are often queried, and the amount of data is very large. At this point, you can do algorithmic calculations based on a field in the data (need to pay special attention to this field is generally a data query search condition field), so that the data can fall evenly into different sub-tables to go, query and then do the algorithmic calculations based on the query conditions in the field you can quickly locate the need to query which table to go to, the following chart.

(4), according to the time dimension of the way: you can be in accordance with real-time data and historical data library table, but also in accordance with the year, month and other time intervals to carry out the library table, as shown in the figure below, the purpose is to minimize the amount of data in a single library table.

(5), in accordance with certain algorithmic calculations: when the data are hot data, such as data really can not do hot and cold separation, all the data are often queried, and the amount of data is very large. At this point, you can do algorithmic calculations based on a field in the data (need to pay special attention to this field is generally a data query search condition field), so that the data can fall evenly into different sub-tables to go, query and then do the algorithmic calculations based on the query conditions in the field you can quickly locate the need to query which table to go to the query, the following chart shows.

(iii) Architectural design optimization: When SQL optimization and database optimization can not solve the performance problem, it is necessary to consider optimization from the architectural design, the common means of optimization of architectural design are as follows:

(1), through the message queue to cut peaks and fill valleys: in the peak of the call volume is very large, through the message queue buffer call request, and then let the request asynchronous processing is completed, and then synchronized to the request of the caller, as shown in the following figure. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.

2), through the use of distributed databases for processing, distributed databases are databases in a MPP (Massively Parallel Processing acronym) of the architecture of the realization of common distributed databases, including Doris (you can through the official website)/(learn more about Doris), Greenplum (you can learn more about Greenplum via the official website/), and more.

(3), the deployment of architecture optimization, such as through the Kubernetes approach to deployment, because Kubernetes can support dynamic expansion and contraction, in order to ensure the performance of the data service at the same time, but also through the elasticity of the expansion and contraction to control costs.

(iv) Hardware optimization: A common means of hardware optimization is to expand or improve the performance of hardware resources, and the common means are as follows:

(1), use hardware with faster I/O read/write, such as using an SSD hard disk instead of a regular mechanical hard disk.
(2) Expand servers horizontally or vertically by increasing the number of servers or increasing the configuration of servers.
(3) Increase the bandwidth of the network or use network equipment with higher bandwidth to increase the transmission speed of the network channel.

Unfinished Business ...... Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.