ETL tools] DataX + DataXWeb first use process records

Version: DataX v202309 DataXWeb 2.1.3 Pre-Release

DataX：

Github：/alibaba/DataX

Function introduction document: /alibaba/DataX/blob/master/

Although the documentation only states Linux systems, the actual deployment of Windows can also be

JDK version 1.8 is sufficient

Python can use 2.6 or 2.7 if the version of the environment is selectable, I'm using 3.12.5.

Maven is required for compilation

At the beginning of the download is v202308 version, the installation package download path: /202308/

Because you want to, replace the py files in the DataX /bin directory (the replacement files are in: DataXWeb:doc/datax-web/datax-python3/)

Since DataX only supports Mysql, but the Mysql DB on my side is the

So download the source code for v202309 and tweak the code to make it support mysql

(Steps to change the code: /weixin_41640312/article/details/132019719)

Then just follow the steps in github to package it up

Question:

During the packaging process, I found oceanbasev10writer reported an error that a specific jar file is missing under the libs of the project.

Solution:

Go to the master branch and find this jar, download it and copy it, then you can package it successfully (the packaging process is very slow, I don't know if it's a network problem)

Question:

Created a Job to migrate between Mysql data sources (the docs don't say what the restricted version of mysql is, and it didn't occur to me that the supported Mysql version was so low)

The configuration is correct, but dataX keeps reporting errors

Solution:

Went searching and realized the version restriction, so switched versions

Question:

After packing the latest version, running Mysql Job still reports an error (the bps value of a single channel cannot be null or a non-positive number when there is a total bps speed limit)

Solution:

Change from -1 to 2000000 in the packaged datax\conf\.

DataXWeb：

As you know, DataX uses the Python command line to run Job's Json file configuration to synchronize data sources

So use DataXWeb with it!

At first, I used DataXWeb v2.1.2, but it was a bit hard to understand the configuration of field mapping, so I switched to the latest version, which is the 2.1.3 pre-release version.

1. Downloading the source code

2. Run datax-admin&datax-executor (modify configuration files as needed)

Configuration files have instructions, follow the instructions to configure the DB, and the path, etc. can be.

Relatively speaking, the configuration of the new version is easier to understand than the old version, however, the data on the page is not very even if the operation, or need to refresh, I do not know whether it will be adjusted in the future!

As for the steps to create a Job in DataX I won't mention that it's easy to create a Job using DataXWeb

Other:

Attached are the DataX supported data sources (all available on github)

DataX's Core Architecture

The Job is sliced into multiple Tasks by the source-side slicing strategy, and then the Schedule module is called, which divides the Tasks into TaskGroups (by default a TaskGroup of 5 Tasks) based on the configured concurrency parameters, etc.)

Enable a thread in each Task to complete the Reader->Channel->Writer process