Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.
Book Description:Data asset management core technologies and applications
Today, I'm mainly going to share Chapter 4 with you:
Chapter 4 is entitled Technical Realization of Data Quality
The content mind map is below:
This article is followed by
Notes on the reading of Core Technologies and Applications for Data Asset Management -- Chapter 4: Technical realization of data quality (II)
Moving on.
4, common open source data quality management platform
4.1、 Apache Griffin
Apache Griffin is an open source big data quality management system, the bottom is based on Hadoop and Spark implementation, support for batch processing and streaming processing mode two kinds of data quality inspection methods, the official website:/The official address of Apache Griffin is shown in the image below./docs/The architecture diagram provided in the
Apache Griffin's source code github address is /apache/griffin "Data Asset Management Core Technologies and Applications" is a book published by Tsinghua University Press and written by Zhang Yongqing et al.
As you can see from the architecture diagram
- Apache Griffin in doing data quality inspection, is based on the Spark implementation, in the form of Spark tasks to define the quality of the data to be collected into the data source data collection.
- In the architecture diagram, Define is mainly used for the definition of dimensions of data quality, which is what we call the definition of data quality rules.
- Measure is responsible for the execution of data quality tasks as well as the generation of data quality result data. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and written by Yongqing Zhang et al.
- Analyze is primarily responsible for storing and presenting the resultant data.
As shown in the figure below, Apache Griffin's architecture diagram is just right to correspond to our previous data quality collection process.
Apache Griffin also supports containerized deployment, please refer to the related deployment introduction:/apache/griffin/blob/master/griffin-doc/docker/
Apache Griffin's main technology stack and development languages include
- Backend: Java and Scala, its API service is mainly developed by Java language, based on Http protocol and GRPC protocol to do data communication. Its task execution is mainly developed based on Scala language for Spark task submission, running and so on.
- Front-end: TypeScript, Html, Css
Its core technology architecture is shown in the following figure. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and authored by Zhang Yongqing et al.
As you can see from the figure its core technology is realized by SpringBoot+Spark.
4.2、 Qualitis
Qualitis is a data quality monitoring platform that supports multiple heterogeneous data sources, and is designed to solve various data quality problems encountered in the process of business system operation, data center construction and data governance.
The figure below shows the official address of Qualitis /WeBankFinTech/Qualitis/blob/master/docs/zh_CN/ch1/%E6%9E%B6%E6%9E%84%E8%AE%BE%E8%AE%A1%E6%96%87%E6%A1%#21-%E6% 80%BB%E4%BD%93%E6%9E%B6%E6%9E%84%E8%AE%BE%E8%AE%A1 in the architectural diagrams provided.
From the architecture diagram, we can see that it also contains these core modules such as quality rule configuration, quality task management and quality data collection, quality data storage and analysis.
The overall module design diagram is also provided in the official Qualitis web site, and its module design diagram is also just enough to correspond to our previous data quality collection process, as shown in the following figure. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press, authored by Zhang Yongqing and others.
You can see that the process of data quality collection is actually almost the same no matter which open source data quality platform is used, and needs to include
- Configuration and management of quality rules: mainly configuration rules and maintenance rules.
- Timingjob timed to go to the execution of the quality rules to capture the raw data quality data. Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press and authored by Yongqing Zhang et al.
- Data processing and analysis of quality: the raw quality data captured is processed, and then the quality data is analyzed to optimize the configuration of the quality rules, forming a closed-loop link, as shown below
- The installation and deployment of Qualitis, the storage of quality results data, can be found at/WeBankFinTech/Qualitis/blob/master/docs/zh_CN/ch1/%E5%BF%AB%E9%80%9F%E6%90%AD%E5%BB%BA%E6%89%8B%E5%86%8C%E2%80%94%E2%80%94HA%E7%89% The deployment instructions in. Data Asset Management Core Technologies and Applications is a book published by Tsinghua University Press and written by Yongqing Zhang et al.