Notes on the reading of Core Technologies and Applications for Data Asset Management -- Chapter 4: Technical realization of data quality (I)

Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.

Book Description:Data asset management core technologies and applications

Today, I'm mainly going to share Chapter 4 with you:

Chapter 4 is entitled Technical Realization of Data Quality

The content mind map is below:

In data asset management, in addition to metadata and data lineage, data quality is also an important aspect, as shown in the following figure, data quality usually refers to the ability to always maintain data integrity, consistency, accuracy, reliability, timeliness, etc., throughout the lifecycle of data processing.We can only know the quality of the data, so that we can go to the improvement of the data when the data quality is poor. Reading Notes for Data Asset Management Core Technologies and Applications - Chapter 4: Technical Realization of Data Quality

Notes on Core Technologies and Applications for Data Asset Management - Chapter 4: Technical Realization of Data Quality

Integrity: whether there is any loss of data, e.g., whether there is any loss of data fields, amount of data.
Consistency: whether the data values are identical, for example, whether there is a loss of precision in decimal data.
Accuracy: Whether the meaning of the data is accurate, e.g., whether the data field annotations are accurate.
Reliability: for example, whether the data storage is reliable, whether data disaster recovery is done, etc.
Timeliness: Whether there are delays or blockages in data that result in untimely entry into the data warehouse or data lake.

It is because of the importance of data quality, so in the international are specialized in data quality international standard definition, such as ISO 8000 data quality series of international standards in a detailed description of how to measure data quality and how to certify and so on, including the characteristics of data quality, characteristics and how to carry out data quality management, assessment and so on. A total of 21 standards have been published in ISO 8000 at the URL:ISO 8000 can be queried in /gj/std?op=ISOQuality standards, as shown in 4-0-2 below

The main elements related to data quality include the following:

1）、ISO 8000-1:2022 Data quality-Part 1: Overview
2）、ISO 8000-2:2022 Data quality-Part 2: Vocabulary
3）、ISO 8000-8:2015 Data quality-Part 8: Information and data quality: Concepts and measuring
4）、ISO/TS 8000-60:2017 Data quality-Part 60:Data quality management: Overview
5）、ISO 8000-61:2016 Data quality-Part 61: Data quality management: Process reference model
6）、ISO 8000-62:2018 Data quality-Part 62: Data quality management: Organizational process maturity assessment: Application of standards relating to process assessment
7）、ISO 8000-63:2019 Data quality-Part 63: Data quality management: Process measurement
8）、ISO 8000-64:2022 Data quality-Part 64: Data quality management: Organizational process maturity assessment: Application of the Test Process Improvement method
9）、ISO 8000-65:2020 Data quality -Part 65:Data quality management: Process measurement questionnaire
10）、ISO 8000-66:2021 Data quality-Part 66: Data quality management: Assessment indicators for data processing in manufacturing operations
11）、ISO/TS 8000-81:2021 Data quality-Part 81: Data quality assessment: Profiling
12）、ISO/TS8000-82：2022 Data quality-Part 82:Data quality assessment: Creating data rules
13）、ISO 8000-100:2016 Data quality-Part 100: Master data: Exchange of characteristic data: Overview
14）、ISO 8000-110:2021 Data quality-Part 110: Master data: Exchange of characteristic data: Syntax, semantic encoding, and conformance to data specification
15）、ISO 8000-115:2018 Data quality-Part 115: Master data: Exchange of quality identifiers: Syntactic, semantic and resolution requirements
16）、ISO 8000-116:2019 Data quality-Part 116: Master data: Exchange of quality identifiers: Application of ISO 8000-115 to authoritative legal entity identifiers
17）、ISO 8000-120:2016 Data quality -Part 120: Master data: Exchange of characteristic data: Provenance
18）、ISO 8000-130:2016 Data quality-Part 130: Master data: Exchange of characteristic data: Accuracy
19）、ISO 8000-140:2016 Data quality- Part 140: Master data: Exchange of characteristic data: Completeness
20）、ISO 8000-150:2022 Data quality -Part 150: Data quality management: Roles and responsibilities
21）、ISO/TS 8000-311:2012 Data quality-Part 311: Guidance for the application of product data quality for shape (PDQ-S)

1. Technical realization of quality data collection

Whether in the data warehouse or data lake, at the beginning we do not know the quality of the data, you need to regularly go to the data lake or data warehouse to collect the quality of the data through a certain rule, this rule allows users to configure their own, the usual process is shown in the figure below.

For some common rules, it can be made into a rule template, and then the user can directly select a certain rule for quality data collection, the common common rules are shown in the table below.

Notes on Core Technologies and Applications for Data Asset Management - Chapter 4: Technical Realization of Data Quality

rules and regulations	descriptive
Null rate for table fields	Capture the rate at which a specified field of a specified table is empty
Exception rates for table fields	Capture the indicator table of the specified field value of the anomaly rate, such as gender field, only possible for male or female, for other values is anomalous, we can statistically based on the rules of the rate of anomalous values, which values are anomalous of course, also need to support the customization of the maintenance of the
Table field data format anomaly rate	The data format anomaly rate of the specified field values of the collection index table, such as time format or cell phone number format does not meet the specified rules is anomalous data, we can calculate the rate of these format anomalies
Duplication rate of table field data	Capture the duplication rate of specified field values in a specified table, for example, some field values are not allowed to be duplicated, and when duplication occurs it is an exception
Missing rates for table fields	Collect whether the number of fields in the specified table is consistent with the expected number of fields, if not, there are missing fields, you can count the missing rate of fields
Timeliness of table data entry	Capture the difference between the inbound time of the specified table data and the current system time, and then calculate the timeliness of the data and the timeliness rate.
Loss rate of table records	1, the number of records of the specified table data collection, and then compared with the expected amount of data or the amount of data in the source table to calculate the loss rate of data records 2、 Collect the number of records of the specified table data, and then compare it with the weekly or monthly average to determine whether the number of data records is lower than the normal standard, so as to determine whether there is a loss.

In addition to generic rules, there is definitely a need to support custom rules, which allow users to write their own SQL scripts, Python language scripts, or scala language scripts.

SQL scripts: generally refers to the direct submission and running of SQL scripts through JDBC to obtain data quality results, common relational databases, such as MySQL, SQLServer, etc. are supported by JDBC, and Hive is also supported by JDBC connection, in addition to running SQL scripts by means of SparkSQL Job, as the following figure shown in the figure below.

To summarize: If the database or data warehouse itself supports the JDBC protocol, then you can run SQL statements directly through the JDBC protocol. If not, then you can make the transition by way of a SparkSQL job, which supports connecting to data warehouses or data lakes such as Hive, Hudi, etc., as well as other databases via JDBC. On the official website address./docs/latest/It is clearly described in the following figure.

Python script: Python is a commonly used scripting language, as SQL scripts only support some data results that can be queried directly with SQL statements, for some complex scenarios or scenarios that cannot be supported by SQL statements, you can use Python scripts, and Spark also supports Python language, as shown in the following figure.

An introduction to PySpark can be found at the URL:/docs/latest/api/python/, as shown in the figure below.

Notes on Core Technologies and Applications for Data Asset Management - Chapter 4: Technical Realization of Data Quality

Scala script: Spark bottom itself is mainly realized through the code written in the Scala language, many big data developers are keen to use the Scala language, so for the Spark Job to collect the quality of the data, you can also write Scala scripts, as shown in the following figure.

For the collection of quality data, timed Job technology selection, I here recommend Apache DolphinSchedur this big data task scheduling platform.Apache DolphinSchedur is a distributed, easy to extend the visualization of the workflow task scheduling open-source platform that solves the complexity of the big data task dependencies, and supports in a variety of big data applications DataOPS arbitrarily scheduling the association between task nodes its assembly of tasks can be timely monitoring of the task execution status and support retry, specified node resumption failure, pause, suspend, and so on. DataOPS in a variety of big data applications to arbitrarily orchestrate the association between task nodes in its directed acyclic graph (DAG) flow pattern to assemble tasks , you can monitor the execution status of tasks in a timely manner , and support retry , specified node resumption fails , pause , resume and terminate the task and other operations . The official URL is:/en-us, as shown in the figure below.

Apache DolphinSchedur supports secondary development and its Github address is:/apache/dolphinscheduler

The address of the relevant deployment documentation is:/en-us/docs/3.2.0/installation_menu

The image below shows the official website at/en-us/docs/3.2.0/architecture/designThe technical implementation architecture diagram provided in the

As you can see from the figure, its support for SQL, Python, Spark and other task nodes is just what we need, and the platform is supporting distributed deployment and scheduling, so there is no performance bottleneck, because distributed systems support horizontal or vertical scaling.

Apache DolphinSchedur also provides API access, the official API documentation address: /en-us/docs/3.2.0/guide/api/open-api.

The architectural diagram of the technical implementation of the final collection of quality data is shown in the following figure.

Unfinished Business ...... Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press, Book Notes - Chapter 4: Technical Realization of Data Quality.