Notes on the reading of Core Technologies and Applications for Data Asset Management -- Chapter 4: Technical realization of data quality (II)

Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.

Book Description:Data asset management core technologies and applications

Today, I'm mainly going to share Chapter 4 with you:

Chapter 4 is entitled Technical Realization of Data Quality

The content mind map is below:

This article is followed by

Notes on the reading of Core Technologies and Applications for Data Asset Management -- Chapter 4: Technical realization of data quality (I)

Moving on.

1. Technical realization of quality data collection

Of course, in addition to the use of Apache DolphinSchedur, we can also achieve their own timed tasks run, the relevant technical architecture is shown in the figure below. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Yongqing Zhang.

Since both data lakes and data warehouses support Spark to do data reading and data processing, the quality data collection of data lakes or data warehouses can be obtained by executing Spark tasks in a Spark cluster.The deployment of Spark clusters supports Standalone, Mesos, YARN, and Kubernetes, Kubernetes, you can refer to the official Spark website:/docs/latest/#cluster-manager-typesAs shown in the figure below, you can choose the deployment mode of the corresponding Spark cluster according to the deployment mode of the data lake or data warehouse you are actually using, for example, if your data warehouse Hive is deployed by way of Hadoop, then the deployment mode of the Spark cluster should choose the deployment mode of Hadoop YARN to be more appropriate. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Yongqing Zhang.

Design a jar package or PySpark script that can be executed on the Spark cluster, the jar package or PySpark script is used to submit the task to the Spark cluster for running , when running, read the configured quality rules, after the task execution is completed, the collected quality results data into the library, about how to submit the jar package or the PySpark script task to Spark cluster can refer to the official website. PySpark script tasks, you can refer to the official website:/docs/latest/, as shown in the figure below.

Jar packages or PySpark scripts can execute Spark SQL statements, as well as Scala scripts or Python scripts.
If the Spark cluster is deployed via Kubernetes, you need to make a Docker image of the Jar package or PySpark script, and then run the Jar package or PySpark script into the Spark cluster via the image, as shown in the following figure, which you can refer to for knowledge about Kubernetes:/zh-cn/docs/home/ . Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press and written by Yongqing Zhang et al.

Jar packages or Python scripts need to be made generic, rather than creating a jar package or Python script for each quality rule. Of course, user-defined Jar packages or Python scripts can be supported for extension, but the abstract interface of the Jar packages or Python scripts must be defined, as shown in the following figure.

As you can see from the figure, we can at least in the abstract interface first predefined to read the rules, parsing rules, and implementation of the rules of these methods, the use of Java development language definition of the abstract interface reference code is as follows: "Data Asset Management Core Technologies and Applications" is a book published by Tsinghua University Press, authored by Zhang Yongqing et al.

public interface Example {

    void readRule(String rule);

    void analysisRule(String rule);

    void execRule(String data);

Based on the configuration of the data quality rules described above to the execution of the timed collection of data quality rules, the table structure model can be roughly designed as shown below for reference.

1), t_quality_rule_template for the data quality rule template table, you can make some general rules into a template for the rule configurator to use directly or based on the selected rule template and then do a small amount of secondary modification.

2), t_quality_rule for the data quality rules configuration table, the table stores the actual data quality collection rules and the data table id corresponding to the rule as well as the cron expression for the timed collection, for example, something like 0 */30 * * * ? which is executed every 30 minutes.

A cron expression is a string that typically consists of seven fields, each formatted with a space between them, with each field representing a specific time meaning, as shown in the following table.

domain (taxonomy)	range of values
unit of angle or arc equivalent one sixtieth of a degree	0-59
ingredient	0-59
hour	0-23
date	1-31
moon	1-12 or JAN-DEC
weekly	1-7 or SUN-SAT
surname Nian	1970－2099

3）、t_quality_rule_exec is the data quality rule execution table, which stores the execution record of each timed collection task. When the timed collection task is executed, its state change process is roughly shown in the figure below. In order to facilitate the problem localization, all the state changes during the task execution need to be updated into the table t_quality_rule_exec. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and written by Zhang Yongqing et al.

2. How to handle the collected quality data

Quality data is captured as raw data, and since there are numerous data quality rules, the raw data captured by each rule may be different, so it is also necessary to do normalization on the raw data before it can be warehoused and stored, as shown in the figure below.

Although the raw data collected by each quality rule may be different, we still need to design a unified raw data message format to facilitate the unified processing of data, refer to the following:

[{

                   "execId": "",

                   "ruleId": "",

                   "returnType": "",

                   "returnData": [],

                   "startExecTime": "",

                   "endExecTime": ""

}]
3. Storage model design for quality data

From an architectural design perspective, data quality storage requires the following.

Scalability: Supports the storage of quality data collected by a variety of different quality rules, for example, there can be no expansion of the quality rules or user-defined quality rules of the results of the data can not be stored, thus requiring modification of the data storage model.
Traceability: It is necessary to record the change records of quality data, so that it is convenient to do the tracking and review of quality data changes in the future. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.
Maintainability: Support manual operation and maintenance, for example, when there is dirty data or the need for manual intervention, the system administrator can be allowed to carry out the relevant historical data or dirty data cleanup and other routine operation and maintenance operations.

As shown in the figure below based on the above design principles, the following data quality storage model is designed for reference, under which the core fields of the data quality storage model are listed in each table. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.

If you need to query the quality data of a certain table, you can get the data based on the correlation relationship as shown in the following figure.

Quality data is actually very similar to commonly used monitoring data, and you can also consider using a temporal database for storage, because quality data are collected in a temporal order according to time, and the data also changes in a temporal order, so using a temporal database for storage is very suitable. A comparison of common time-sequential databases is presented in the following table, which can be selected according to the actual scenario. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.

Database type	InfluxDB	Prometheus	OpenTSDB
descriptive	Open source time-series database for storing time series, events and metrics	Open source time-series database, generally mostly used in monitoring systems	HBase-based scalable open source database for time series
official website	/products/influxdb/	The official website is /	The official website is /
Documentation	/influxdb	/docs/	/docs/build/html/
Development languages for underlying implementations	Go	Go	Java
Supported data types	Numbers and strings	Digital only	Indicator support figures , tags support strings
Whether to support SQL language	Support for SQL-like queries (similar to SQL syntax)	unsupported	unsupported
API Type	Http API	RESTful Http/JSON API	Http API

Unfinished Business ...... Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.