Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.
Book Description:Data asset management core technologies and applications
Today, I'm mainly going to share Chapter 2 with you:
Chapter 2 is titled Metadata Capture and Storage
It is mainly about how to collect and obtain metadata from common data warehouses, data hoovers, and relational databases such as Apache Hive, Delta lake, Apache Hudi, Apache Iceberg, and Mysql. ,
1, Hive's metadata collection method:
1.1, Metadata collection based on Hive Meta DB
Since Hive is deployed with data stored separately in a specified database, it is certainly possible to obtain the required metadata information directly from the database where Hive metadata is stored from the technical implementation point of view.Hive usually stores metadata in separate databases, and it can be specified by the user at the time of deployment as to the type of database to be stored in, which usually supports storage in MySQL, SQLServer, Derby, Postgres, Oracle and so on. In MySQL, SQLServer, Derby, Postgres, Oracle and other databases, the relationship between common key tables in the Hive metadata database is shown in the following figure, according to the relationship in the figure below, we can use SQL statements to query the required metadata data information.
The relevant tables are described below:
DBS: stores basic information about the database in Hive
DATABASE_PARAMS: stores information about database parameters in Hive
TBLS: Stores basic information about the database's data tables in Hive.
COLUMNS_V2: stores information about the fields of the data table.
TABLE_PARAMS: stores basic information about the parameters or attributes of a database table in Hive.
TBL_PRIVS: stores authorization information for a table or view
SERDES: stores configuration information related to data serialization
SERDE_PARAMS: stores information about the attributes or parameters of the data serialization
SDS: Stores information about the data file storage of the data table.
SD_PARAMS: Stores storage-related attributes or parameters of the data table.
PARTITIONS: stores information about the partitions of the data table.
PARTITION_KEYS: stores information about the partition fields of the data table.
PARTITION_PARAMS: stores information about the attributes or parameters of a partition
1.2. Hive Catalog-based metadata collection
Hive Catalog is an important component provided by Hive, specialized in the management of metadata, managing the structure of all Hive library tables, storage location, partitioning and other related information, while Hive Catalog provides a RESTful API or Client package for users to query or modify the metadata information, and its underlying core Jar package. In the Jar package in the interface defines the abstraction of the management of Hive metadata , detailed code implementation can refer to the paper book .
In Hive2.2.0 version before Hive also provides a Hcatalog REST API in the form of external access to the Hive Catalog (Hive2.2.0 version, has been removed from the Hcatalog REST API this function), the REST API access address in the format of: http:// yourserver/templeton/v1/resource, in the Hive Wiki site: WebHCat Reference - Apache Hive - Apache Software Foundation has a detailed list of REST API support which interface access, as follows
For example, by calling the REST API interface: http://yourserver/templeton/v1/ddl/database you can get information about all the databases in the Catalog, as follows
1.2. Spark Catalog-based metadata collection
Spark is a distributed big data based computing framework. the biggest difference between Spark and Hadoop is that Spark's data is mainly based on memory computation, so Spark's computational performance is much higher than that of Hadoop, which is favored by big data developers, Spark provides a variety of development languages, such as Java, Scala, Python, and R, etc. Spark provides APIs in Java, Scala, Python, and R.
Spark Catalog is a metadata management component provided by Spark, which is specifically used for Spark's metadata reading and storage management, and manages the metadata of all data sources supported by Spark. Spark Catalog maps the data tables in external data sources to the tables in Spark, so we can also collect the metadata information we need through Spark Catalog.
Since Spark version 3.0, the Catalog Plugin was introduced, although it provides some common metadata queries and operations, but it is not comprehensive, powerful and flexible enough, such as the inability to support multiple Catalogs, etc., so the Catalog Plugin is Spark in order to solve these problems and should be born.
Detailed code implementation can be found in the paperback.
2. Metadata Collection in Delta Lake
When it comes to Delta Lake, we have to mention the concept of a data lake, which is a type of data lake. A data lake is a centralized storage concept compared to a data warehouse. Unlike a data warehouse, which mainly stores structured data, a data lake can store structured data (generally data presented in rows and columns), semi-structured data (e.g., logs, XML, JSON, etc.), unstructured data (e.g., Word documents, PDFs, etc.), and binary data (e.g., video, audio, images, etc.). (e.g., video, audio, images, etc.). Generally speaking, data lakes store raw data, while data warehouses store structured data after the raw data has been processed.
Delta Lake is a data lake based on the open source project , the ability to build on top of the data lake lake warehouse one data architecture , to provide support for ACID data transactions , scalable metadata processing and the underlying support for Spark on the flow of batch data computing processing .
The key features of Delta Lake are listed below:
Based on ACID data transactions on top of Spark, serializable transaction isolation levels ensure data read and write consistency.
Utilizing Spark's distributed and scalable processing capabilities, it is possible to do more than petabytes of data processing and storage.
Data support for version control, including support for data rollback and a complete audit trail of historical versions.
Supports high-performance row-level Merge, Insert, Update, and Delete operations, which Hive cannot provide.
Parquet file as the data storage format, at the same time there will be a Transaction Log file records the data change process, the log format is JSON, as follows
2.1 Capturing metadata based on Delta Lake's own design
Delta Lake's metadata is managed in-house and does not typically rely on third-party external metadata components like Hive Metastore. In Delta Lake, metadata is stored in its own filesystem directory together with data, and all metadata operations are abstracted into corresponding Action operations, and table metadata is implemented by Action subclasses. The following is the structure of the source code in Delta Lake (source code Github address: /delta-io/delta), as follows
Method calls for metadata are provided in this implementation class, and detailed code implementation can be found in the paperback.
2.2 Capturing metadata based on Spark Catalog
As Delta Lake supports using spark to read and write data, so in the source code of Delta Lake, it also implements the CatalogPlugin interface provided by Spark. As Delta Lake also implements the CatalogPlugin interface provided by Spark, it can also directly get the metadata information of delta lake based on Spark Catalog. way, you can also directly get the metadata information of delta lake, detailed code implementation can refer to the paper book.
3. Metadata Collection in MySQL
MySQL is a widely used relational database, and the MySQL database system comes with the information_schema library to provide access to MySQL metadata. INFORMATION_SCHEMA is an own database in each MySQL instance that stores information about all other databases maintained by the MySQL server. databases. The tables in INFORMATION_SCHEMA are actually read-only views, not true base tables, and cannot perform INSERT, UPDATE, or DELETE operations; therefore, there is no datafile associated with INFORMATION_SCHEMA, and there is no database directory with that name, and you can't set up triggers.
The highlighted tables related to metadata in information_schema are as follows:
Tables: provides information about tables and views in the database.
Columns table: provides information about the table fields in the database.
Views table: provides information about the views in the database.
Partitions table: provides information about the partitions of the data tables in the database.
Files table: provides information about the files that store MySQL tablespace data
4. Metadata collection in Apache Hudi
Hudi and Delta Lake the same, is also a data lake based on open source projects, the same can also build on top of the data lake lake warehouse one data architecture, by visiting the URL: Apache Hudi | An Open Source Data Lake Platform can be counted on the official home page of Hudi.
The main features of Hudi are as follows:
Supports tables, transactions, fast Insert, Update, Delete, etc.
Supports indexing, high data storage compression ratios, and supports common open source file storage formats
Support distributed streaming data processing based on Spark, Flink
Support for Apache Spark, Flink, Presto, Trino, Hive and other SQL query engines
4.1 Capturing metadata based on Spark Catalog
Since Hudi supports using Spark to read and write data, it also implements the CatalogPlugin interface provided by Spark in the source code of Hudi. Since Hudi, like Delta Lake, also implements the CatalogPlugin interface provided by spark, it can directly obtain the metadata information of Hudi based on the Spark Catalog. approach, you can also directly obtain the metadata information of Hudi, detailed code implementation can refer to the paper book.
4.2 Hudi Timeline Meta Server
Usually, a data lake manages metadata by tracking data files in the data lake, whether it is Delta Lake or Hudi, the bottom layer extracts metadata by tracking file operations. In Hudi, the operation of metadata is very similar to the implementation of Delta Lake, the bottom layer is also abstracted into the corresponding Action operation, only the type of Action operation is slightly different.
The reason why data lakes can't directly use Hive Meta Store to manage metadata is because Hive Meta Store's metadata management is not able to realize the data tracking capabilities unique to data lakes. Because the granularity of managing files in a data lake is very fine, you need to record and track which files are new operations, which files are invalid operations, which data is added, which data is updated, and you also need to have atomic transactionality to support operations such as rollback.Hudi designed Timeline Meta Server in order to manage the metadata well and to record the process of data changes. Timeline records a log of all the operations performed on the table at different moments, which helps to provide an immediate view of the table.
The concept of Marker is abstracted in Hudi, which translates to the meaning of marking, the write operation of data may fail to write before completion, thus leaving a partial or corrupted data file in the storage, and the marking is used to track and clear the failed write operation, when the write operation is started, a marker will be created indicating that a file write is in progress. The marker is deleted when the write is submitted successfully. If the write operation fails in the middle, a marker is left to indicate that this written file is incomplete. Markers are used for two main purposes.
Duplicate/Partial Data Files Being Removed: Flagging helps to efficiently identify partially written data files that contain duplicates when compared to data files that are successfully written at a later date and that will be cleared when the commit completes.
Rollback of failed commits: If a write operation fails, the next write request will rollback the failed commit before continuing with a new write. Rollback is accomplished with the help of flags, which are used to identify partially written data files that have failed overall but have been committed.
Without the use of markers to track each commit, Hudi would have to list all the files in the filesystem, correlate them with the files it sees in the Timeline, and then delete the files that are part of the write failures, which would be very costly in terms of performance overhead in a distributed system as large as Hudi.
4.3 Capturing Metadata Based on Hive Meta DB
Although Hudi metadata storage is managed through Timeline, Hudi was designed with support for synchronizing its own metadata to the Hive Meta Store, which is actually an asynchronous update of the metadata in Hudi's Timeline to be stored in the Hive Meta Store.
In Hudi's source code, this interface abstraction is defined to be used as a metadata synchronization to third-party external metadata repositories like Hive Meta DB, and the detailed code implementation can be found in the paperback.
5. Metadata collection in Apache Iceberg
Apache Iceberg is also an open source data lake project, the emergence of Iceberg to further promote the development of data lakes and lake warehouse one architecture, and let the data lake technology has become more rich, by visiting the URL: Apache Iceberg - Apache Iceberg can enter its official home page.
The key features of Iceberg are listed below:
Apache Spark, Flink, Presto, Trino, Hive, Impala and many other SQL query engines are also supported.
Support more flexible SQL statements to Merge, Update, Delete data in the data lake.
Changes to the data Schema, such as adding new columns, renaming columns, etc., can be well supported.
Supports fast data query, data query can quickly skip unnecessary partitions and files in order to quickly find the data that meets the specified conditions, in Iceberg, a single table can support the rapid query of petabyte-level data.
The datastore supports time-series version control and rollback, allowing you to query snapshots of data by time series or version.
Data is stored with compression support out-of-the-box, which can effectively save the cost of data storage.
5.1 Metadata design for Iceberg
Because the state of the table in the Hive data warehouse is directly viewed by listing the underlying data files , so the table data modification can not be done atomically , so it can not support transactions as well as rollback , once the write error may produce inaccurate results . So Iceberg in the bottom through the architectural design of the metadata layer added to the design to circumvent the shortcomings of the Hive data warehouse , as shown in the figure below , from the figure you can see that Iceberg uses a two-tier design to persistent data , one layer is the metadata layer , one layer is the data layer , the data layer is stored in the data layer is the actual format of the Apache Parquet, Avro, or ORC, etc. In the metadata layer, it is possible to effectively track which files and folders are deleted during data operations, and then when scanning the data file statistics, it is possible to determine if the file needs to be read for a particular query in order to increase the speed of the query. The metadata layer typically contains the following:
Metadata File: Metadata files usually store data information such as table Schema, partition information and table snapshot details.
Inventory list file: stores all the inventory file information as an index of the inventory files in the snapshot, and usually contains some other detailed information, such as how many data files have been added, deleted, and what the boundaries of the partitions are.
Inventory File: Stores a list of data files (e.g., data stored in Parquet/ORC/AVRO format), as well as column-level metrics and statistics for when a file has been modified.
5.2 Capturing Metadata via the Spark Catalog
Like Hudi and Delta Lake, Iceberg also supports reading and writing data using Spark, so the underlying design of Iceberg also implements the CatalogPlugin interface provided by Spark, so it is possible to directly obtain Iceberg's metadata information through the Spark Catalog. data information, the detailed code implementation can refer to the paper book.
5.3 Getting Metadata via the Iceberg Java API
Iceberg provides a Java API to get the metadata of the table, by visiting the official URL: Java API - Apache Iceberg you can get the details of the Java API, as follows
As you can see from the figure, you can get the Schema, attributes, storage paths, snapshots, and many other metadata information of the Iceberg data table through the Java API.