Daph: A New Generation of Streaming Batch All-in-One Data Integration and Data Processing Tools

The Daph source code is located at gitee at/dasea96/daph

summarize

The Chinese name of Daph is First Mate. First Mate is a ship's pilot whose position is only lower than the captain, the head of the Deck Department (Pilot's Department), and the main assistant of the captain.
The English name of Daph is taken from the first letter and the last three letters of [Directed Acyclic Graph].

Daph is a general purpose data integration and data processing platform-level tool for building visually configurable data integration and data processing platforms.

Daph, the path is simple.
The core concept of Daph is the node, which has input and output lines, each of which carries data, and the node carries arbitrary data processing logic.
The core building block of Daph is a self-created general-purpose DAG data flow engine that can flow any Java/Scala data structure, and can introduce any data computation component based on a Java platform or with a Java client as the underlying data flow data computation engine.
The core function of Daph is to link multiple nodes to form a DAG graph and stream data.

functionality

Full incremental whole library and table data integration: Complete full incremental whole library and table data integration with minimal configuration.
- Full table synchronization for more than 50 data source types is supported.
- Supported cdc table synchronization for all data source types supported by Flink-cdc.
- Full incremental whole repository synchronization from mysql/postgresql/oracle/sqlserver/doris/starrocks to mysql/postgresql/oracle/sqlserver/doris/starrocks/hive/iceberg/kafka is supported.
Streaming Batch Complex Data Processing: a minimalist configuration to complete the flow batch of one arbitrarily complex multi-table sql processing logic

fig. values (ethical, cultural etc)

Unified Data Development View: Daph has both rich data integration and powerful data processing capabilities
Lowering the data development threshold: Complete data development through configuration files
Shorten the data development cycle: out-of-the-box massive data integration and data processing capabilities, minimal installation and deployment methods, minimal secondary development processes

specificities

common (use): Nodes of any JVM type can be connected to form a DAG graph and stream any Java/Scala data structure. Therefore, not only can it be used to construct DAG data flow at present, but also has the potential of DAG task scheduling with potential arbitrary granularity, which can be based on a daph-core to unify task development and task scheduling, and realize an integrated visual task development and task scheduling platform.
simpler: Simple concept, simple configuration
- Based on open-source computing engine, without introducing new complex concepts
- Node configurations are simple, such as the daph-spark node configuration item, which is almost identical to the Spark configuration item and does not add learning overhead.
large: Powerful architecture and functionality
- The architecture level has a multi-layer wrap-around operation system, which can be customized with any Job-level/DAG-level/node-level/method-level functions, such as node data preview function, node monitoring function, and pre-post SQL function. At present, all nodes have supported pre-post table building function, and all daph-spark nodes have supported pre-post SQL function.
- daph-spark only 5 connectors, 6 converters, but has supported 44 kinds of data sources stream batch read and write, and can be expanded at any time more data sources; has supported a single table map, filter, sql processing, multi-table join and any complex sql processing; and support for spark can support any catalog
- daph-flink only 2 connectors , 1 converter , but already supports any flink-sql support for streaming batch read and write data sources ; already supports a single table and multiple tables of any complex sql processing ; and support flink can support any catalog .
spotlight: Focuses on visualizing configurable data integration and data processing, and focuses on simplifying the use of open source computational engines without adding learning overhead.
Streaming arbitrary data structures: Can stream any JVM data structure, such as Java/Scala List, Spark DataFrame, Flink DataStream.
Supports multiple computational engines: Any data computation component based on a Java platform or with a Java client can be introduced as the underlying data computation engine for the data stream, e.g., Java/Scala/Spark/Flink, and so on.
Fast node expansion: Nodes with arbitrary logic can be easily extended and deployed, e.g., extending new connector nodes to support reading and writing new database types; e.g., extending new converter nodes to introduce specific data processing logic to process data. Only the following three points need to be accomplished:

1) Implement a configuration interface and a functional interface
2) Place the extended node's corresponding jar in the server directory
3) Configure extended node information in the json file.

Compare similar software in the industry

Daph：

Can be used for both data integration and complex data processing
Extremely versatile DAG model, capable of streaming any JVM object, introducing any compute engine that conforms to the Spark/Flink programming model
Does not duplicate the wheel, focuses on simplifying the use of open source computing engines, configuration items almost one-to-one with the open source computing engine
Perfectly utilizes the capabilities of the open source computing engine, including and not limited to stream batch processing capabilities, catalog capabilities, sql capabilities
Ability to benefit from the ecosystem of open source computing engines in a timely manner
- In Spark, for example, as soon as a new database connector becomes available, it can be used in Daph by simply adding a dependency in the

comparison dimension	Daph	SeaTunnel	StreamSets	StreamX	Kettle	Chunjun
universality	your (honorific)	lower (one's head)	lower (one's head)	lower (one's head)	lower (one's head)	lower (one's head)
usability	your (honorific)	center	your (honorific)	your (honorific)	your (honorific)	center
expand one's financial resources	be	be	clogged	be	be	be
Data structure flow capability	All JVM objects	Dataset[Row]/DataStream[Row]/Zeta Data Structure	not have	not have	not have	not have
Computing engine access capabilities	Any compute engine that conforms to the Spark/Flink programming model	Spark/Flink/Zeta	Spark	Spark/Flink	Java	Flink
Flow line model	DAG	threads	DAG	point (in space or time)	DAG	threads
Functional scalability	your (honorific)	center	lower (one's head)	center	lower (one's head)	center
Learning costs	lower (one's head)	your (honorific)	your (honorific)	center	center	center
development cost	lower (one's head)	your (honorific)	your (honorific)	center	your (honorific)	center
O&M costs	lower (one's head)	your (honorific)	lower (one's head)	center	lower (one's head)	center

Architecture Model

data flow model

Daph's data flow model is the DAG data flow model as shown below:

Daph数据流模型

An example of a comprehensive data integration and data processing scenario is shown below:

The inputs are a MySQL table, a Hive table, an Oracle table
Processing logic including map, join, sql, custom complex logic
The output is a Hudi table, a Doris table, an HBase table

Daph数据流模型示例

operational model

Daph wraps code fragments through nodes, concatenates nodes into a DAG graph, and eventually forms the DAG graph into a complete application.

A DAG graph is a complete runtime logic, e.g. when using Spark as the underlying compute engine, a DAG graph is a complete Spark application.
A DAG graph can contain both Java nodes, Scala nodes, and Spark nodes, or both Java nodes, Scala nodes, and Flink nodes, but not both Spark and Flink nodes.
The underlying compute engine determines the type of application.
The JVM engine corresponds to native Java/Scala applications;
The Spark engine corresponds to the Spark application;
The Flink engine corresponds to the Flink application.

The Daph run model is shown below:

Daph运行模型

Deployment models

Daph's current deployment model is very simple.

daph-jvm, is to deploy native java programs
daph-spark, which is deploying a spark application
daph-flink, which is the deployment of the flink application