Some of the common data lakes in the open source community are Hudi, Delta Lake, and Iceberg.
- Distributed database storage: distributed database storage is generally used to store for real-time query requirements are high or require real-time OLAP data analysis of the data, in the open source community, the common distributed database has Apache Doris (you can through the official website)/Learn more about Apache Doris), Apache Druid (you can learn more about Apache Druid via the official website/), and more.
Through the above analysis, the architectural design of the data storage layer is usually recommended to be designed as the most popular lake warehouse architecture, and for special business scenarios, you can introduce some distributed databases or relational databases to assist, as shown in the figure below.
Data storage layer in the storage of data, usually will also be tiered data storage, data tiering architecture implementation scheme is usually shown in the following figure, data tiering is mainly to
- Modular design of data to achieve the purpose of decoupling between data, data through layering can be some very complex data decoupled into a number of independent data blocks, each layer to complete a specific data processing, easy to develop, maintain, and let the data can be better reuse.
- Make the data more scalable, when the data business needs change, only need to adjust the data processing logic of the response data layer, avoiding the whole data need to be recalculated from the original data (that is, the data in the ODS layer in the figure), saving the cost of resources for development and data calculation.
- Make data query performance faster, in big data, due to the existence of massive amounts of data, if all from the original data (that is, the data in the ODS layer in the figure) to query the business needs of the data results, the amount of data to be scanned will be very large, the data layering can optimize the query path of the data to reduce the time of data scanning in order to achieve the purpose of improving the performance of data query.
1.4 Data management
The data management layer is mainly responsible for the classification, identification and management of data, which mainly includes metadata management, data lineage tracking management, data quality management, data rights and security management, data monitoring and alarm management, etc. Its overall realization architecture is shown in the following figure.
The technical core of the data management system is the collection and acquisition of metadata, pedigree data, quality data, monitoring data, etc., which we have published in Tsinghua University Publishing House.Core Technologies and Applications for Data Asset ManagementIn the previous chapters of the book has had a very specific description of the data, after getting these data, the main function of the data management system to achieve is to integrate these data and display them to the data asset management platform, the data management system is the core of the data asset management.
1.5 Data analysis layer
In the architectural design of the data analytics layer, the following two main components are included:
- The choice of data analysis tools: With the development of big data analysis technology, the birth of a lot of data analysis and data analysis related BI tools, common BI analysis tools related to the introduction of the following table:
BI tool name |
descriptive |
Applicable scenarios |
Power BI |
It is a BI data analysis tool launched by Microsoft |
Higher cost, usually suitable for use in Microsoft cloud-related services |
Pentaho |
Open source BI analytics tool with data integration, report generation and data visualization |
Open source product. Suitable for use by teams that have their own deployment and operation and maintenance capabilities |
Quick BI
|
It is a BI data analysis tool launched by Aliyun |
Since it was launched by AliCloud, it is usually only suitable for use in AliCloud. |
FineBI |
It is a BI data analysis tool launched by Sailsoft. |
Commercial software, generally need to be purchased, usually applies to the government or enterprises and institutions to make. |
When choosing data analysis tools for BI, it is generally recommended to combine their business needs, cost of use, management and maintenance costs and many other aspects to consider before choosing the most appropriate BI tool.
- Data processing and processing: data processing and processing here mainly refers to the data analysis needs to do data pre-processing and processing, so that data analysis tools can quickly get the data they want. In big data, due to the massive amount of data, so in the data analysis, BI analysis tools usually do not go directly from the massive amount of raw data directly to do analysis.
By analyzing as above two points, the overall architectural design of the data analysis layer is usually shown as below.
- Data analysis, for real-time requirements of high data, usually stored in a distributed database, do not do too much pre-processing, so that the BI tool directly to query and access, so that you can ensure that the entire data analysis chain of real-time.
- For the real-time requirements of the data is not high, can be processed offline every day, usually from the data warehouse or data lake, every day offline data preprocessing, processing the results of the data can be based on the size of the data volume to choose to put into the ordinary relational data or put into the data warehouse or the data lake of the ADS application layer for the BI tools to do the analysis, and even the data lake or the data warehouse of the Even the data lake or data warehouse DWD data detail layer or DWS data light summary layer can also be opened to BI data analysis tools to do analysis directly.
1.6 Data service layer
Data service layer is usually to let the data to provide services to the outside world, so that the data can serve the business, and is responsible for the management of data services, the data service layer is usually the architecture of the realization of the figure below, the specific technical implementation details of the data service can be referred to Tsinghua University Publishing House published in theCore Technologies and Applications for Data Asset ManagementChapter 6 of the book.
Data service layer in the design, usually need to include service creation, service release, service access, service degradation, service meltdown, service monitoring and rights management modules, for service access rights management is usually recommended to use role-based access control (RBAC) to achieve, as shown in the following figure.
- A role can have one or more different services and one or more different menus.
- Roles can be assigned to users or upstream of the business requirements that invoke the service.
By doing architectural analysis and design of each layer, we get the final data asset architecture diagram as shown in the following figure, which is the most common architectural design scheme in big data processing, solving the scalability of data and for whatever type or whatever format of data, we can do data processing, storage and analysis.
Core Technologies and Applications for Data Asset Management is a book published by Tsinghua University Press. The book is divided into 10 chapters, Chapter 1 mainly allows readers to recognize data assets, understand the basic concepts related to data assets, and the development of data assets. Chapters 2 to 8 mainly introduce the core technologies involved in data asset management in the era of big data, including metadata collection and storage, data blood, data quality, data monitoring and alerting, data services, data rights and security, and data asset management architecture. Chapters 9 to 10 mainly introduce the application practice of data asset management technology from a practical perspective, including how to manage metadata to realize the greater potential of data assets, and how to model data to mine the greater value in the data.