DataX is Ali's offline synchronization tool for heterogeneous data sources. For detailed introduction and use, please refer to official website introduction and Quick Start . The DataX series mainly introduces the principle of the whole operation in more detail.

Configuration

The analysis of DataX configuration includes three files: job.json, core.json, and plugin.json. These three json files are all configured with multi-level json, such as a json with abc=d. If we get it through json, we usually get the json of bc through the key of a, and then get the json of bc through the key of b. Josn, finally get d through the key of c, so the code is very cumbersome to write.

DataX provides a Configuration class, which can directly flatten json. Let's take a look through the following example.

public static String JSON = "{'a': {'b': {'c': 'd'}}}";

public static void main(String[] args) {
    Configuration configuration = Configuration.from(JSON);
    System.out.println(configuration.get("a.b"));
    System.out.println(configuration.get("a.b.c"));
    System.out.println(configuration.get("a.b.d"));
}

The running results are as follows, you can see that the multi-level data of json can be easily obtained through Configuration. In addition to get, there are methods such as merge merge, getString based on String type, and get NecessaryValue for required items, which are not introduced here.

{"c":"d"}
d
null

job.json

job.json is the configuration file of the job. Before the task runs, the full path of the configuration file is passed in through parameters, so the name can be customized.

The main configuration content includes job.content.reader, job.content.writer and job.setting.speed, reader and writer can refer to the resources/plugin_job_template.json file in each corresponding module, or you can get it directly through commands. There are examples in Quick Start. It mainly specifies which reader is used to read data, which writer is used to write data, and the configuration information of reader and writer.

Setting.speed mainly controls the flow rate, which will be explained in detail later.

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
      }
    }
  }
}

core.json

The full path is in DATAX_HOME/conf/core.json, configure some global information, such as the number of channels of taskGroup, and type conversion is configured here.

{
    "entry": {
        "jvm": "-Xms1G -Xmx1G",
        "environment": {}
    },
    "common": {
        "column": {
            "datetimeFormat": "yyyy-MM-dd HH:mm:ss",
            "timeFormat": "HH:mm:ss",
            "dateFormat": "yyyy-MM-dd",
            "extraFormats":["yyyyMMdd"],
            "timeZone": "GMT+8",
            "encoding": "utf-8"
        }
    },
    "core": {
        "dataXServer": {
            "address": "http://localhost:7001/api",
            "timeout": 10000,
            "reportDataxLog": false,
            "reportPerfLog": false
        },
        "transport": {
            "channel": {
                "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel",
                "speed": {
                    "byte": -1,
                    "record": -1
                },
                "flowControlInterval": 20,
                "capacity": 512,
                "byteCapacity": 67108864
            },
            "exchanger": {
                "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger",
                "bufferSize": 32
            }
        },
        "container": {
            "job": {
                "reportInterval": 10000
            },
            "taskGroup": {
                "channel": 5
            },
            "trace": {
                "enable": "false"
            }

        },
        "statistics": {
            "collector": {
                "plugin": {
                    "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector",
                    "maxDirtyNumber": 10
                }
            }
        }
    }
}

plugin.json

The full path of plugin.json is DATAX_HOME/plugin/reader/streamreader/plugin.json, this streamreader corresponds to the above job.json. After loading, an attribute path will be added, which is the actual path of the plug-in.

The main content of this file is name and class, and class is the plug-in class to be used at runtime. Since there will be a reader and a writer, two plugin.json will be loaded here.

{
    "name": "streamreader",
    "class": "com.alibaba.datax.plugin.reader.streamreader.StreamReader",
    "description": {
        "useScene": "only for developer test.",
        "mechanism": "use datax framework to transport data from stream.",
        "warn": "Never use it in your real job."
    },
    "developer": "alibaba"
}

After the above three files, job.json, core.json, and plugin.json, are loaded, they will be merged through the merge method, so the final Configuration is the merged information of these files, and then the plugin is started through the Configuration.


大军
847 声望183 粉丝

学而不思则罔,思而不学则殆