The Daph source code is located at gitee at/dasea96/daph
summarize
The Chinese name of Daph is First Mate. First Mate is a ship's pilot whose position is only lower than the captain, the head of the Deck Department (Pilot's Department), and the main assistant of the captain.
The English name of Daph is taken from the first letter and the last three letters of [Directed Acyclic Graph].
Daph is a general purpose data integration and data processing platform-level tool for building visually configurable data integration and data processing platforms.
Daph, the path is simple.
The core concept of Daph is the node, which has input and output lines, each of which carries data, and the node carries arbitrary data processing logic.
The core building block of Daph is a self-created general-purpose DAG data flow engine that can flow any Java/Scala data structure, and can introduce any data computation component based on a Java platform or with a Java client as the underlying data flow data computation engine.
The core function of Daph is to link multiple nodes to form a DAG graph and stream data.
functionality
-
Full incremental whole library and table data integration: Complete full incremental whole library and table data integration with minimal configuration.
- Full table synchronization for more than 50 data source types is supported.
- Supported cdc table synchronization for all data source types supported by Flink-cdc.
- Full incremental whole repository synchronization from mysql/postgresql/oracle/sqlserver/doris/starrocks to mysql/postgresql/oracle/sqlserver/doris/starrocks/hive/iceberg/kafka is supported.
- Streaming Batch Complex Data Processing: a minimalist configuration to complete the flow batch of one arbitrarily complex multi-table sql processing logic
fig. values (ethical, cultural etc)
- Unified Data Development View: Daph has both rich data integration and powerful data processing capabilities
- Lowering the data development threshold: Complete data development through configuration files
- Shorten the data development cycle: out-of-the-box massive data integration and data processing capabilities, minimal installation and deployment methods, minimal secondary development processes
specificities
- common (use): Nodes of any JVM type can be connected to form a DAG graph and stream any Java/Scala data structure. Therefore, not only can it be used to construct DAG data flow at present, but also has the potential of DAG task scheduling with potential arbitrary granularity, which can be based on a daph-core to unify task development and task scheduling, and realize an integrated visual task development and task scheduling platform.
-
simpler: Simple concept, simple configuration
- Based on open-source computing engine, without introducing new complex concepts
- Node configurations are simple, such as the daph-spark node configuration item, which is almost identical to the Spark configuration item and does not add learning overhead.
-
large: Powerful architecture and functionality
- The architecture level has a multi-layer wrap-around operation system, which can be customized with any Job-level/DAG-level/node-level/method-level functions, such as node data preview function, node monitoring function, and pre-post SQL function. At present, all nodes have supported pre-post table building function, and all daph-spark nodes have supported pre-post SQL function.
- daph-spark only 5 connectors, 6 converters, but has supported 44 kinds of data sources stream batch read and write, and can be expanded at any time more data sources; has supported a single table map, filter, sql processing, multi-table join and any complex sql processing; and support for spark can support any catalog
- daph-flink only 2 connectors , 1 converter , but already supports any flink-sql support for streaming batch read and write data sources ; already supports a single table and multiple tables of any complex sql processing ; and support flink can support any catalog .
- spotlight: Focuses on visualizing configurable data integration and data processing, and focuses on simplifying the use of open source computational engines without adding learning overhead.
- Streaming arbitrary data structures: Can stream any JVM data structure, such as Java/Scala List, Spark DataFrame, Flink DataStream.
- Supports multiple computational engines: Any data computation component based on a Java platform or with a Java client can be introduced as the underlying data computation engine for the data stream, e.g., Java/Scala/Spark/Flink, and so on.
-
Fast node expansion: Nodes with arbitrary logic can be easily extended and deployed, e.g., extending new connector nodes to support reading and writing new database types; e.g., extending new converter nodes to introduce specific data processing logic to process data. Only the following three points need to be accomplished:
1) Implement a configuration interface and a functional interface
2) Place the extended node's corresponding jar in the server directory
3) Configure extended node information in the json file.
Compare similar software in the industry
Daph:
- Can be used for both data integration and complex data processing
- Extremely versatile DAG model, capable of streaming any JVM object, introducing any compute engine that conforms to the Spark/Flink programming model
- Does not duplicate the wheel, focuses on simplifying the use of open source computing engines, configuration items almost one-to-one with the open source computing engine
- Perfectly utilizes the capabilities of the open source computing engine, including and not limited to stream batch processing capabilities, catalog capabilities, sql capabilities
- Ability to benefit from the ecosystem of open source computing engines in a timely manner
- In Spark, for example, as soon as a new database connector becomes available, it can be used in Daph by simply adding a dependency in the
comparison dimension | Daph | SeaTunnel | StreamSets | StreamX | Kettle | Chunjun |
---|---|---|---|---|---|---|
universality | your (honorific) | lower (one's head) | lower (one's head) | lower (one's head) | lower (one's head) | lower (one's head) |
usability | your (honorific) | center | your (honorific) | your (honorific) | your (honorific) | center |
expand one's financial resources | be | be | clogged | be | be | be |
Data structure flow capability | All JVM objects | Dataset[Row]/DataStream[Row]/Zeta Data Structure | not have | not have | not have | not have |
Computing engine access capabilities | Any compute engine that conforms to the Spark/Flink programming model | Spark/Flink/Zeta | Spark | Spark/Flink | Java | Flink |
Flow line model | DAG | threads | DAG | point (in space or time) | DAG | threads |
Functional scalability | your (honorific) | center | lower (one's head) | center | lower (one's head) | center |
Learning costs | lower (one's head) | your (honorific) | your (honorific) | center | center | center |
development cost | lower (one's head) | your (honorific) | your (honorific) | center | your (honorific) | center |
O&M costs | lower (one's head) | your (honorific) | lower (one's head) | center | lower (one's head) | center |
Architecture Model
data flow model
Daph's data flow model is the DAG data flow model as shown below:
An example of a comprehensive data integration and data processing scenario is shown below:
- The inputs are a MySQL table, a Hive table, an Oracle table
- Processing logic including map, join, sql, custom complex logic
- The output is a Hudi table, a Doris table, an HBase table
operational model
Daph wraps code fragments through nodes, concatenates nodes into a DAG graph, and eventually forms the DAG graph into a complete application.
- A DAG graph is a complete runtime logic, e.g. when using Spark as the underlying compute engine, a DAG graph is a complete Spark application.
- A DAG graph can contain both Java nodes, Scala nodes, and Spark nodes, or both Java nodes, Scala nodes, and Flink nodes, but not both Spark and Flink nodes.
- The underlying compute engine determines the type of application.
The JVM engine corresponds to native Java/Scala applications;
The Spark engine corresponds to the Spark application;
The Flink engine corresponds to the Flink application.
The Daph run model is shown below:
Deployment models
Daph's current deployment model is very simple.
- daph-jvm, is to deploy native java programs
- daph-spark, which is deploying a spark application
- daph-flink, which is the deployment of the flink application