[Mr. Zhao Yuqiang] Platform Architecture Based on Big Data Components

After understanding the components included in each ecosystem of big data and their functional characteristics, you can use these components to build a big data platform to realize data storage and data computation. The following diagram shows the overall architecture of a Big Data platform.

The video explanation is below:

Lambda Architecture for Big Data Platforms

Kappa Architecture for Big Data Platforms

The overall architecture of a big data platform can be divided into five layers, namely: data source layer, data collection layer, big data platform layer, data warehouse layer and application layer.

I. Data source layer

The main function of the data source layer is to be responsible for providing all kinds of required business data, such as: user order data, transaction data, system log data, etc. In short, the data that can be provided can be called data source. Although there are various types of data sources, they can be divided into two categories in the big data platform system, namely: offline data sources and real-time data sources. As the name suggests, offline data sources are used in big data offline computing; while real-time data sources are used in big data real-time computing.

II. Data acquisition layer

With the data from the underlying data source, there is a need to use ETL tools to accomplish data collection, transformation and loading. Such components are provided in the Hadoop system. For example, you can use Sqoop to complete the data exchange between the big data platform and the relational database; use Flume to complete the collection of log data. In addition to these components provided by the big data platform system itself, crawlers are also a typical data collection method. Of course, you can also use third-party data collection tools, such as DataX and CDC to complete the data collection work.
In order to address the coupling between the data source layer and the data acquisition layer, a data bus can be added between these two layers. The data bus is not necessary, it is only introduced to reduce the coupling between layers when doing the system architecture design.

III. Big data platform layer

This is the core layer of the entire big data system used to accomplish big data storage and big data computation. As the big data platform can be seen as a way to realize the data warehouse, and then can be divided into offline data warehouse and real-time data warehouse. The following is a description of each.

3.1 Offline Data Warehouse Implementation Based on Big Data Technology

After the underlying data collection layer gets the data, it can usually be stored in HDFS or HBase. Then the analysis and processing of offline data is completed by offline computing engines, such as MapReduce, Spark Core, Flink DataSet. In order to be able to carry out unified management and scheduling of various computing engines on the platform, these computing engines can be run on top of Yarn; next you can use Java programs or Scala programs to complete the analysis and processing of data. In order to simplify the development of applications, in the big data platform system, it also supports the use of SQL statements to process data, that is, it provides a variety of data analysis engines, for example: Hive in the Hadoop system, whose default behavior is Hive on MapReduce, so that you can write standard SQL in Hive, which can be converted to MapReduce by Hive's engine. MapReduce, which in turn runs on top of Yarn to process big data. Common big data analytics engines besides Hive are Spark SQL and Flink SQL.

3.2 Real-time data warehouse realization based on big data technology

After the underlying data collection layer gets the real-time data, in order to carry out data persistence while ensuring the reliability of the data, the collected data can be stored in the messaging system Kafka; and then processed by a variety of real-time computing engines, such as: Storm, Spark Stream and Flink DataStream. As with offline data warehouses, these computational engines can be run on top of Yarn, while supporting SQL statements for real-time data processing.
Offline data warehouse and real-time data warehouse in the process of implementation, may use some public components, such as: the use of MySQL storage of meta-information, the use of Redis for caching, including the use of ElasticSearch (referred to as ES) to complete the search for data and so on.

IV. Data warehouse layer

With the support of the big data platform layer you can further build the data warehouse layer. When building the data warehouse model, it can be built based on the star model or snowflake model. The previously mentioned data marts and machine learning algorithms can also be categorized into this layer.

V. Application layer

With various data models and data in the data warehouse layer, it is possible to implement various application scenarios based on these models and data. For example: analysis of popular products in e-commerce, social network analysis in graph computing, implementation of recommender systems, risk control, and behavioral prediction, and so on.