China Telecom Yikang Jishi Data Center Building Data Integration Platform Based on Apache SeaTunnel Experience Sharing

Author | Telecom Yikang Engineer Dai Lai
Editor | Debra Chen

I. Introduction

Apache SeaTunnel, as a high-performance, easy-to-use data integration framework, is the cornerstone of the rapid landing data integration platform. In this paper, we will explain in detail how we quickly build a data integration platform based on Apache SeaTunnel from the background of the data center strategy, data integration platform technology selection, lowering the threshold of Apache SeaTunnel use and the future outlook in several aspects.

II. Background to the data middleware strategy

With the increasing demand for data-driven decision-making in the healthcare industry, it is urgent to tap the value of healthcare data elements and stimulate the potential of new productivity. Through the self-developed "data center", China Telecom Yikang carries out the whole-process management and one-stop empowerment of medical and healthcare data elements, creates an operation base for medical and healthcare data elements, and helps the value mining of medical data and the application of AI models. Under this strategic background, the data integration platform, as the "artery" of our data center, needs to be able to land quickly and meet the demands of complex data integration scenarios of the center.

III. Data integration platform technology selection

3.1 Key considerations

There are several key factors to consider when making technology selections for the underlying data integration platform:

Performance: The data integration engine needs to have high throughput and low latency to be able to efficiently process massive amounts of data.
Scalability: The data integration engine should have good scalability and be able to dynamically expand processing capabilities according to business needs.
Ease of use: The data integration platform should be easy to use and maintain, reducing reliance on specialized technical staff.
Ecological support: The data integration engine should support multiple data sources and targets with good ecosystem support.

3.2 Advantages of choosing Apache SeaTunnel

Currently, the mainstream data integration technologies on the market include Sqoop, Datax, Kettle, Flink CDC, Canal, Airbyte, etc. Apache SeaTunnel has the following advantages that make it an ideal choice for our data integration platform:

performances

According to the latest official data, Apache SeaTunnel is 40%-80% faster than Datax and 30 times faster than Airbyte in the same test scenario, which has unparalleled performance advantages. We have also tested the performance of jdbc-source to jdbc-sink on 8C32G servers with the same database, and the rate of our data integration platform is on average nearly 20,000 items per second faster than that of third-party platforms. The excellent performance comes from SeaTunnel's excellent design. Take the JDBC connector as an example, SeaTunnel uses database connection reuse, dynamic sharding, and SeaTunnel's zeta engine even realizes dynamic thread sharing technology. In the completion of data synchronization while minimizing the use of resources to improve efficiency.
Deployment method

In our customer scenarios, most of the hospitals can only provide the physical machine of the front-end collection node for us to deploy the collection service, and the platform is deployed in the center end, only the collection end to the hospital database network is through, the center end to the front-end collection node network is through, can't quartet end communication. Only a few customers all services can be deployed in a set of environment. This requires that our deployment must be very flexible. SeaTunnel supports both distributed and stand-alone deployment, and its decentralized design ensures that the system's high availability mechanism and easy to expand. It allows each node to be both a Master and a Worker at the same time, or the Master and Worker can be deployed separately. The former is suitable for small and medium-sized deployment , the latter is suitable for large-scale deployment .
error tolerance

SeaTunnel's fault tolerance is also excellent.

From the cluster perspective, when a cluster node goes down, its tasks can be automatically fault-tolerant to other cluster nodes. When we enable IMAP persistence for the cluster, even if all the cluster nodes are down, the persistent data can be automatically recovered when the cluster is restarted. It is important to note that when the first node of the cluster starts up, it will load the persistent IMAP data, so the difference in startup time between cluster nodes should not be too long, otherwise it may cause all tasks to hit the first node that starts up.

From the job perspective, SeaTunnel also has a checkpoint mechanism, when the job is suddenly abnormal due to an accident, it can also be recovered from the checkpoint so as to ensure that expensive data synchronization tasks do not have to be resynchronized. In addition, due to network delays, node failures and other reasons, the data in the distributed system may have consistency problems, SeaTunnel also implements a two-phase submission in the relevant connector to ensure data consistency.
ecological enrichment

SeaTunnel already supports 100+ kinds of data sources , and easy to extend and support their own ecology . It supports the whole library synchronization, multi-table synchronization, breakpoint transfer. More support for automatic table building , this feature reflects the SeaTunnel design intention , this feature is easy to implement on the platform side and in the user is very friendly features , which is synchronized with a very large number of tables in particular can reflect its advantages.
Integrated Engine Architecture

SeaTunnel's EtLT architecture is very suitable for data center scenarios, in the data center scenarios, 90% of the scenarios are to move data from the source to the target, which may contain transformation (Transform), but the T is a lowercase t, which mainly includes copying the columns, filtering the columns, slicing fields and other transformations instead of joining or group by or suchlike. operations. This is very common in the data center, when the data into the data warehouse platform in the data development stage before the need for a large T. In addition, the design of EtLT can be said to be an upgraded version of ETL, in many scenarios its data synchronization rate is much higher than the ETL architecture.
platform architecture

If we hadn't chosen SeaTunnel as our data integration engine, this is what our platform architecture might have looked like:

The disadvantage of this architecture is that the use of multiple data integration engine, maintenance costs are high, in addition to its need for flink execution environment to complete real-time synchronization tasks. From the point of view of fast landing data integration platform is not very friendly, need to deeply study the multi-data integration engine. When we use Seatunnel, the architecture of the data integration platform can be optimized as follows:

We only need to study Apache seatunnel, and based on this fast landing data integration platform, if there is not to meet the demand can also be based on its secondary development, development, operation and maintenance costs are relatively low compared to the former a lot.

IV. How to Lower the Barrier to Using Apache SeaTunnel

1. User-friendly functional interface

To lower the barrier to use, a visual configuration interface can be provided to allow users to configure data integration tasks through a graphical interface without having to write complex configuration files.

Batch flow all-in-one task creation

Select data source for configuration, support parameter configuration (optional)

Synchronize the task configuration mapping relationship, you can flexibly adjust the order of fields and support custom field values, add default fields, delete redundant fields.

Linked queries against complex sql

Supports periodic scheduling of batch tasks to meet timed full or incremental synchronization.

Supports global parameter settings

Above are some sample screenshots of our product features, the entire product features much more than that, through these samples to attract jade, to help users how to quickly land the data integration platform to play a guiding role.

2. Provide rich documentation and examples

An excellent data integration platform is certainly not without rich and excellent documentation to help users get started quickly by providing detailed usage documentation and rich sample code. Including how to install, configure and debug, as well as solutions to common problems.

The main documents include environment requirements, project configuration, configuration file details, running tests, solutions to common problems, and so on, taking solutions to common problems as an example:

Solutions to Common Problems

Data source connectivity issues:

Ensure that the data source address, port, and authentication information are correct.
Check network connections and firewall settings.

Data conversion error:

Check that the conversion rules are correct.
Make sure all fields and types match.

Performance Issues:

Adjust Connector parameters and other configurations to improve performance.
Optimize data conversion logic.

Plugin Issues:

Make sure that all necessary plugins are installed and configured correctly.
Check the version compatibility of the plugin.

3. Integration of automated deployment tools

Automated deployment and management of SeaTunnel to further reduce the difficulty of use and maintenance. Realized this one-click deployment of sSeaTunnel service based on server address information.

The following is an implementation of real-time monitoring of the deployed seatunnel service

4. Community support

During the development and implementation of the entire data integration platform, it is inevitable that we will encounter some problems, some of which the community already has part of the experience, for example, the fault tolerance and recovery of SeaTunnel clusters and other functions, the community can actively give answers and help. In addition, some features can not meet our actual business needs, for example, in the lake warehouse data center architecture, we use Apache paimon as a data lake, but the community's Paimon connector can not fully meet our business needs, we have for Paimon connector for bug fixes and add a lot of new! Feature

Support cdc write paimon
Support paimon sink automatic table construction, automatic table construction support for the specified partition key, primary key and support for the specified multi-bucket (large data write scenarios can improve write performance)
Multi-table sink with paimon support
Support for writing paimon in a specified format (default is orc, can specify parquet, avro format)
Solve the problem of incorrect writing of date field, support timestmap(n) type.
Support kerberos authentication and HA mode hdfs cluster
Support for Hive catalog
Supports type conversion before writing to Sink table.
Fix the problem of batch writing data loss

The above is just a snapshot of our contribution to the community, our contribution to the community is more than that, since we chose Apache SeaTunnel as our data integration engine, we enjoy the benefits brought by the community, and of course, in return, we should actively contribute to the community, feedback to the community so that we can all work together to get better and better.

V. Future and outlook

With the growing demand for big data in the healthcare industry, Seatunnel, as an efficient and flexible data integration tool, will play an important role in healthcare informatization, especially in data integration and processing. With the increasing demand for data-driven decision-making in the healthcare industry, Seatunnel's functions and features are well suited to meet the needs of healthcare big data platforms. Here are some outlooks of Seatunnel in the healthcare industry:

1. Integration of multiple data sources

Integrate the hospital's electronic medical record system, imaging information system (PACS), laboratory information system (LIS), etc. to realize cross-system data sharing.

2. Data standards

Supports healthcare industry standards such as HL7 FHIR (Fast Healthcare Interoperability Resources) to improve data standardization and interoperability.

3. Security and privacy protection

Data encryption: Encryption technology is used to protect data security, especially during transmission.
Anonymization and desensitization: anonymize and desensitize data to protect patient privacy.

4. AI and machine learning integration

The data integration platform will introduce more intelligent features, such as intelligent recommendation configuration, to help users integrate and process data more efficiently.

VI. Summary

Apache SeaTunnel, as an efficient and flexible data integration platform, plays an important role in the data center strategy. Through the introduction of this article, readers can understand how to quickly build a data integration platform based on SeaTunnel and flexibly utilize it in practical applications. In the future, with the continuous development of technology, SeaTunnel will continue to play an important role in the field of data integration, helping enterprises to realize data-driven business change.