The four pieces of the flow-approval puzzle

Streaming and batching is a hot topic in the data field. With the emerging demand for real-time data processing and the continuous development of emerging streaming computing technologies such as Flink, streaming and batching is transitioning from a technological vision to concrete solutions adapted to the characteristics of different industries.

Personally, I believe that the focus of the streaming and batching solution is divided into four areas.Data integration, storage engines, compute engines, metadata management。

data integration

Traditional batch data integration is a once-a-day batch data transfer, and the carrier is a file. Real-time data integration is a real-time data transfer through the CDC tool or call the API interface to push the way, the carrier is the message, relying on Kafka and other message queues. In the Lambda architecture, these two exist simultaneously, and are the starting point for both batch and real-time data processing chains.

In terms of timeliness, the real-time approach has a significant advantage, data latency can be as fast as seconds or even milliseconds, so the Kappa architecture advocates for the two to be combined, and comprehensively adopts the real-time data integration approach.

Years after the Kappa architecture was proposed, the idea was not universally adopted, and I think part of the reason is that the stability of real-time data integration is not enough, and there is a certain possibility of data loss and duplication of the data transmission link composed of CDC tools and Kafka. Therefore, it is not applicable in scenarios with stringent requirements for data accuracy, especially in the financial industry. Of course, this kind of data deviation may be irrelevant in other industries, such as statistics of real-time traffic, the loss of individual vehicle information does not affect the judgment of road congestion.

In addition to accuracy, the second point is the issue of data boundaries. Real-time data is viewed in terms of the concept of flow, and the data is as constant as a river, with no obvious data boundaries. However, in financial scenarios, because of business reasons, there are boundaries of data. A more easily understood example is the interest calculation of deposits, there is always a point in time to decide whether to calculate 1 more day's interest for the user, which is the so-called business date flop. Two different understandings of data, which is a bit like wave-particle duality.

Just because there are accuracy and data boundary issues doesn't mean we're stuck at the Lambda architecture stage.

The technology of real-time data integration links is constantly evolving, for example, Kafka added idempotency support after version 0.11, which avoids duplicate message writes caused by network jitter. Application level can also be compensated, such as in the producer side of a data stream to increase data summary information, so that the target side can determine the data integrity and do subsequent processing. I am optimistic about the development of real-time link technology, the problem of data loss and duplication can be solved, and it does not take long. The second point is that although the data boundary problem exists in the financial industry, there is still a large amount of data that can be viewed in the form of streams.

If you are interested in idempotency and data summarization techniques, you can look up the previous articles in my publicWhat is idempotence?Data Abstraction Techniques in the Lamentations of Nanming

To summarize, based on the characteristics of the financial industry, the stage goal is that real-time data integration will dominate, batch integration mode because of data boundary issues will still exist temporarily, but can be compressed to a lower percentage. After reaching this goal, the data boundary problem also has the opportunity to use technical means to realize the full real-time data integration.

storage engine

The storage engine can be expanded into three levels, underlying storage, file format and table format.

HDFS is still the main storage method under big data architecture, and its design goal is to adapt to massive data processing with larger default data files. For example, the default size of HDFS data blocks is 128M, significantly higher than ordinary file systems. With, the real arrival of massive data, local disk-based HDFS also has significant cost pressures, and object storage because of the cost advantages and the friendliness of small files, has become an important option for various enterprises. Of course, object storage in the performance of the attenuation is still very obvious, so at this stage most of the storage capacity on the supplementary means. With the maturity of the automatic migration of hot and cold data technology, the application of object storage will be more widespread. At the same time, object storage is also the main topic of the separation of storage and computing issues, it is worth exploring a lot of content, here will not be expanded, followed by a separate discussion.

At the file format level, according to the technical characteristics are divided into two kinds of columnar storage and row storage, the former is represented by Parquet and ORC, and the latter is represented by Avro, the real-time link in the data is pushed in accordance with the order of generation, so in the process of transmission, more use of the row storage format for serialization (Avro is also a serialization framework), and the landing data is often used in the Parquet and other columnar storage is more capable of improving processing performance.

After the emergence of Hudi/Iceberg and other technologies, table format has become a new member of the storage engine, based on Hudi/Iceberg can be data update and delete operations, compared to the previous can only be used in a full-volume overwrite (overwrite), which greatly improves the efficiency of incremental data processing under the Hadoop system. processing efficiency under Hadoop system, so that those data modeling methods accumulated on the MPP database can be more migrated to Hadoop system.

To summarize, the important change on the storage engine side is the introduction of the table format to provide more efficient update and delete operations, so both for real-time data processing and batch data processing, there are significant help.

computational engine

In the Hadoop ecosystem, Hive/Spark/Flink are all important compute engines. Strictly speaking, Hive is not a compute engine, and is usually referred to as Hive+MapReduce, where MapReduce is the initial compute engine of Hadoop. However, with the development of technology, especially the main supplier of Hadoop Cloudera will be upgraded to CDP, the default engine from MapReduce switch to Tez, MapReduce has been gradually withdrawn from the stage, just as a legacy technology exists. Spark community has a strong vitality and widespread influence, in terms of performance than MapReduce has significant Spark community has a strong life and wide influence, in terms of performance than MapReduce has a significant advantage, but also for batch, stream computing, AI and other scenarios.Flink as a new computing engine, at the beginning of the generation of real-time computing as the target scenario, and then showed greater ambition, the target will be locked into the flow of batch of the integrated computing engine. At the same time, also derived from the supporting storage open source component Paimon.

To summarize, both Spark and Flink have the potential to achieve the flow batch one at the computing engine level, while domestic programmers are more involved in the Flink community, which may become an advantage for Flink and affect the future market share of the flow batch one computing engine.

metadata management

Metadata has always been an essential, yet lukewarm topic in the data space. Metadata is essential for buildingAutomated data pipelineIt is critical. At the same time, metadata is an important link between streaming and batch data.

Under the Hadoop system, HMS (HiveMetaStore) is the de facto standard for metadata management. Although HMS was only a subsidiary component of Hive at the beginning of its birth, due to Hive's once huge market share, other compute engines have to take the initiative to adapt HMS in order to integrate into the eco-system, and HMS does not prohibit such adaptation due to the unique advantage of open source. As a result, HMS is gradually seen as a neutral metadata management component. Therefore, even if Hive will be accompanied by MapReduce withdrawal and gradual decline, but HMS still has a strong, independent vitality.

Under the vision of flow and batch integration, the goal of metadata management is to make up for the inherent defects of the batch data integration link, and improve the automation and integration capability of flow and batch integration. Traditional batch data integration only integrates the data itself, without accompanying metadata (e.g., table structure information), and the consistency of metadata at the source and target ends relies entirely on manual offline communication and then production respectively. The risk of this manual mechanism is self-evident, and in the context of massive data, the absolute number of problems caused will not be too small.

Relying on the real-time data integration links of CDC tools, there is an opportunity to automate metadata interfacing between systems. In fact, tools such as OGG (Oracle Golden Gate) already provide the ability to push table structure changes in real time, which can be received through SR (Schema Registry), but the synergy between SR and HMS needs further processing. Thus, it seems that what HMS provides at this stage is still a metadata management capability mainly for batch scenarios.

Based on our conclusions in the Data Integration section, batch data and streaming data will continue to coexist for quite some time, and the two are not completely separated in terms of usage scenarios, so the metadata management module should be compatible with both forms of data.

To summarize, HMS is the de facto standard for metadata management, with the ability to support the stream batch integrated architecture, but there is still room for improvement in automation and integration capabilities, especially for metadata management of real-time data to be further strengthened. Perhaps, other metadata management components will also take this opportunity to challenge the status of HMS, such as Databricks open source Unity Catalog this year.

Finally, to summarize, in this paper, we divided the stream batch integration into four aspects: data integration, storage engine, compute engine and metadata management, in which the unified specification of metadata management and storage engine is a strong constraint, and the technology product is exclusive; the data integration level is dominated by real-time links, and batch links are supplemented by batch links; the compute engine is divisible and combinable and is not necessarily a single compute engine all-in-one situation, whereas Flink has the advantage of a single compute engine from a the perspective of community participants, relatively more advantageous.

A family of words, and from the perspective of the industry, it is inevitable that there is bias, welcome to leave a message to criticize and correct.