Reading Notes for Core Technologies and Applications for Data Asset Management

Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.

Book Description:Data asset management core technologies and applications

Today, I'm mainly going to share Chapter 3 with you:

Chapter 3 is titled Data Bloodlines

The content mind map is below:

1. Technical realization of obtaining data bloodlines

1.1, how to get data bloodline from Hive

Hive is a typical representative of the data warehouse, is also a representative of the offline data hierarchical model design of big data, and supports HQL data processing and computation, so Hive in order to facilitate the user to do data tracking, in the underlying design, in fact, take into account the issue of tracking the bloodline of the data. Hive's own bloodline in the source code is mainly through the to the output, the code mainly deal with the process shown in the following figure The main process in the code is shown in the figure below. The edges are mainly output through edges (the flow direction of the DAG graph) and vertices (the nodes of the DAG graph).

In the source code defines the four types of SQL operations it supports, respectively, QUERY (query), CREATETABLE_AS_SELECT (Select query to create a table), ALTERVIEW_AS (modify the view), CREATEVIEW (create a view).

When parsing and generating edges (flows of the DAG graph) and vertices (nodes of the DAG graph) information, it determines whether the type of the QueryPlan is one of the four supported SQL operation types, and if it is not, edges and vertices are not parsed and generated.

It is a Hook function provided by Hive, which is used to inject customized operation code before or after the execution of a Hive task.

For specific code implementations, you can refer directly to the paperback and will not repeat them here.

1.2. Getting data bloodlines from Spark execution plans

The process of generating the execution plan and processing the execution plan in the underlying Spark is shown in the following figure.

As can be seen in the figure, the

1, the execution of SQL statements or Data Frame, it will become an Unresolved Logical Plan, that is, there has not been any processing and analysis of the logical execution of the plan, only from the point of view of the SQL syntax to do some basic checks.

2, and then through the acquisition of Catalog data, the need to execute the SQL statement to do further analysis of the table name, column name verification, so as to generate a logical execution plan can be run directly.

3. but Spark underlying will have an optimizer to generate an optimal way to execute the operation, so as to generate an optimized best logic execution plan.

4. The finalized logical execution plan is converted to a physical execution plan, which is converted to the final code for execution.

Spark's execution plan is actually the process plan for data processing, which will parse the SQL statement or DataFrame and combine it with Catalog to generate the final code for data transformation and processing. So you can get the data conversion logic from Spark's execution plan, so as to parse the bloodline of the data. But the execution plan of Spark are automatically processed inside the Spark bottom layer, how to get the information of the execution plan of each Spark task? In fact, there is a set of Listener architecture design in the bottom layer of Spark, you can use Spark Listener to get a lot of data information about the execution of the bottom layer of spark.

In the spark source code, a trait (Java-like interface) is provided in the form of Scala to act as a listener for the execution of tasks such as Spark SQL. The following two methods are provided in the table below.

method name	descriptive
def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit	The method called on successful execution, which includes the execution plan parameter, where the execution plan can be either a logical or physical plan
def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit	The method called when execution fails, which also includes the execution plan parameter, where the execution plan can be either a logical or physical plan

Therefore, you can borrow the QueryExecutionListener to actively allow Spark to push the execution plan information to its own system or database when executing tasks, and then do further parsing, as follows

For specific code implementations, please refer to the paperback due to the large amount of code.

1.3. Getting Data Bloodlines from Spark SQL Statements

In Spark task processing, offline ETL data processing through SQL statements is its most common application in big data, as shown in the following figure For this application scenario, you can directly obtain the SQL statements running in Spark, and then parse the SQL statements and combine them with Catalog to analyze the input and output tables contained in the SQL statements. data lineage of the input and output tables contained in the SQL statement.

With the idea of parsing the bloodline through SQL statements, the problem that needs to be solved is how to automatically capture the SQL statements running in Spark. In the above description, it has been mentioned that Spark has the mechanism of Listener, which is just a kind of Listener, and can automatically get the relevant execution information of Spark through the way of Listener. Spark provides this underlying abstract class for the upper level code to listen to Spark in the whole life cycle of the relevant event messages, as shown in the following figure.

The overall process of getting the SQL statements executed by Spark is summarized in the following figure

For specific code implementations, please refer to the paperback due to the large amount of code.

1.4. Getting data bloodlines from Flink

FlinkSQL in the underlying execution, contains about the following five steps, its underlying execution process and SparkSQL is very similar.

Step 1: Syntactically analyze the SQL statement to be executed, then Apache Calcite will directly convert the SQL statement to AST (Abstract Syntax Tree), which is the SqlNode node tree in Calcite. Calcite is an open source SQL parsing tool , Flink in the underlying technology implementation , integrated Calcite as the underlying SQL statement parser , Calcite's Github address is /apache/calcite/, related to more information can be referred to the official document URL :/docs/。
Step 2: Verify the syntax in the SQL statement according to the queried metadata, get the information of tables, fields, functions, etc. contained in the SQL statement through the information in the SqlNode node tree obtained in the first step, and then judge whether the information of tables, fields, etc. contained in the SQL statement exists or not in the metadata by comparing it with the metadata.
Step 3: Further parsing of the SqlNode node tree by combining metadata information yields a relational expression tree that generates a preliminary logical execution plan.
Step 4: The logical execution plan generated in the third step is further optimized to obtain the optimal logical execution plan, this step yields the result or the relational expression tree.
Step 5: Transform the logical execution plan into a physical execution plan and submit it to the Flink cluster for execution.

For specific code implementations, please refer to the paperback due to the large amount of code.

1.5. Obtaining data lineage from the organization system of data tasks

A data task orchestration system usually orchestrates the running order of tasks before and after as well as the dependencies for different data node types, as shown in the following figure.

For specific code implementations, please refer to the paperback due to the large amount of code.

2. Storage model and display design for data bloodlines

From an architectural design perspective, a pedigree data store requires the following.

Extensibility: Support the expansion of the source of the data lineage, for example, only need to support the current Hive, Spark execution plan, Spark SQL, etc., but in the future may add other sources of data lineage. If new sources of data margins appear, they should be supported without changes to the existing design.
Traceability: It is necessary to record the change records of data lineage, so that it is convenient to do tracking in the future, for example, when there is a change in data lineage, it is necessary to record the process of the change, instead of directly replacing the existing data lineage that has been collected into the database after the change of data lineage.
Maintainability: Support manual maintenance, for example, support manual maintenance of data bloodline or data bloodline collection error, support manual modification.

The process of data bloodline collection and processing is usually shown in the figure below, where the raw data is acquired in real time and then sent to a message queue like Kafka, and then the raw data is parsed to generate the bloodline data, which is then warehoused and stored.

After the bloodline data parsing into the database, you can do the display of the data bloodline, about the data bloodline display design reference is shown in the figure below, generally need to pay attention to the following points:

Support table-level blood relations, default display table-level blood relations, when selecting a single table, you can also click to view the table's upstream blood relations or downstream blood relations.
Field level blood relationship display is supported in details, when you click on the field details of the table in the figure, you can continue to expand the display to show the field level blood relationship.
In blood relations, there is support for clicking to see the parsed sources of blood relations, such as SparkSQL statements, Spark execution plans, and so on.

From the figure below, you can see a display of the blood relationship between tables and fields. When a change occurs in a table, it is easy to know which tables and fields downstream or upstream are affected, which can speed up the processing and localization of many problems. When using the data of a table, it is also possible to trace back to the original data table of the table and which intermediate tables have been processed. The link of data becomes very clear, which produces a great help to the users of the data.

Reading Notes for Core Technologies and Applications for Data Asset Management - Chapter 3: Data Bloodlines