1

Author: Yao Hui (Thousand Xi)

background

Let’s talk about what is distributed batch processing. Literally, there is a large amount of business data that needs to be calculated and processed by applications in batches, and it will take a long time to execute in a single-machine mode, and it will not give full play to the business cluster. The processing power of each application node in the Through some common distributed batch processing solutions, all business application nodes in the business cluster can effectively complete a large batch data processing task, thereby improving the overall processing efficiency and processing reliability.

 title=

batch model

In a simple single-machine scenario, multithreading can be enabled to process a large task at the same time, and in multiple machines, the same task can be processed in parallel by multiple machines at the same time. Therefore, the distributed batch processing solution needs to shield the distributed coordination logic between business application clusters such as the above-mentioned task splitting, distribution, parallel execution, result aggregation, failure tolerance, dynamic expansion, etc. for developers at the code development level, so that users can only focus on The business logic sharding rules and business logic processing described in the red box above can be used.

Big data batch comparison

In big data processing scenarios, we will also use the MapReduce model, whose processing logic is essentially the same as the business batch logic we are going to discuss. The batch processing in the big data scenario is mainly oriented to the processing of the data itself, and the corresponding big data platform cluster needs to be deployed to support data storage and data batch processing. Therefore, the main purpose of this scenario is to build a complete data platform. . Compared with the big data batch processing scenario, this time it mainly focuses on the distributed business batch processing scenario, and builds distributed batch processing logic based on the existing business application service cluster. The following requirements can be solved by a distributed batch solution

  • Decouple time-consuming business logic to ensure fast response to core link business processing
  • Fully schedule all application nodes in the business cluster to cooperate to complete business processing in batches
  • Different from big data processing, other online business services will be called to participate in the batch process during subtask processing.

Open source batch solution

ElasticJob

ElasticJob is a distributed task scheduling framework. Its main features are that it realizes timing scheduling based on Quartz and provides the ability to coordinate and process tasks in business clusters. The entire architecture is based on Zookeeper to achieve task fragmentation execution, dynamic elastic scheduling of application clusters, and high availability of subtask execution. The sharding scheduling model can support the balanced distribution of large-volume business data processing to each node in the business cluster for processing, effectively improving the task processing efficiency.

 title=

  • SimpleJob

Spring Boot projects can configure task definitions through YAML, specifying the following: task implementation class, timing scheduling period, and sharding information.

 elasticjob:
  regCenter:
    serverLists: 127.0.0.1:2181
    namespace: elasticjob-lite-springboot
  jobs:
    simpleJob:
      elasticJobClass: org.example.job.SpringBootSimpleJob
      cron: 0/5 * * * * ?
      overwrite: true
      shardingTotalCount: 3
      shardingItemParameters: 0=Beijing,1=Shanghai,2=Guangzhou

The configured org.example.job.SpringBootSimpleJob class needs to implement the execute method of the SimpleJob interface, and obtain the corresponding business shard data through the ShardingContext parameter for business logic processing.

 @Component
public class SpringBootSimpleJob implements SimpleJob {
    @Override
    public void execute(final ShardingContext shardingContext) {
        String value = shardingContext.getShardingParameter();
        System.out.println("simple.process->"+value);
    }
}

We deploy 3 application services as the scheduling processing cluster to process the above tasks. When the task is triggered to run, ElasticJob will process the corresponding 3 sharded tasks to the 3 application services to complete the entire task data processing.

 title=

  • DataflowJob

At present, DataflowJob is essentially the same as SimpleJob in the overall structure. Referring to the following interface, compared to SimpleJob, it adds the fetchData method for the business side to implement and load the data to be processed. In fact, the execute method of SimpleJob is logically decomposed into two steps. The only difference is that DataflowJob provides a resident data processing task (which can be called: streaming process), which supports the resident operation of the task until fetchData is empty.

 public interface DataflowJob<T> extends ElasticJob {

    /**
     * Fetch to be processed data.
     *
     * @param shardingContext sharding context
     * @return to be processed data
     */
    List<T> fetchData(ShardingContext shardingContext);

    /**
     * Process data.
     *
     * @param shardingContext sharding context
     * @param data to be processed data
     */
    void processData(ShardingContext shardingContext, List<T> data);
}

Add props: streaming.process=true to the yaml configuration of the DataflowJob task to achieve the effect of the task's streaming process. When the task is triggered to execute, each sharded task will be executed in a loop according to the corresponding process: fetchData->processData->fetchData until fetchData is empty. Scenario analysis of this mode:

  • When a single sharding task has a large amount of data, part of the paging data of the shard is read during fetchData for processing until all data is processed.
  • Fragmentation waits for data to be continuously generated, so that tasks can always obtain data through fetchData, and realize long-term residency and continuous business data processing
 elasticjob:
  regCenter:
    serverLists: 127.0.0.1:2181
    namespace: elasticjob-lite-springboot
  jobs:
    dataflowJob:
      elasticJobClass: org.example.job.SpringBootDataflowJob
      cron: 0/5 * * * * ?
      overwrite: true
      shardingTotalCount: 3
      shardingItemParameters: 0=Beijing,1=Shanghai,2=Guangzhou
      props:
        # 开启streaming process
        streaming.process: true
  • Characteristic Analysis

The distributed sharding scheduling model of ElasticJob provides great convenience support for common and simple batch processing scenarios, and solves the entire coordination process of distributed slicing and execution of a large batch of business data processing. In addition, there may be some deficiencies in the following aspects:

  • The core of the entire architecture depends on ZK stability
    • Additional operation and maintenance deployment is required and its high availability must be guaranteed
    • A large amount of task storage triggers the running process to rely on ZK. When the task volume is large, the ZK cluster can easily become a scheduling performance bottleneck.
  • The number of sharding configurations is fixed, and dynamic sharding is not supported
    • For example, when the amount of data to be processed by each shard varies greatly, it is easy to break the balance of cluster processing capacity
    • If the definition of sharding is unreasonable, the cluster elasticity will lose its effect when the cluster size is much larger than the number of shards
    • Sharding definition and business logic are relatively separate, and it is more troublesome to maintain the connection between the two artificially
  • Weak console capability

Spring Batch batch framework

Spring Batch batch processing framework, which provides lightweight and complete batch processing capabilities. The Spring Batch task batch box mainly provides two methods: single-process multi-thread processing and distributed multi-process processing. In the single-process multi-thread processing mode, the user can define a Job as a batch task unit. A Job is composed of one or more Step steps in series or in parallel. Complete the reading, processing, and output of each step of the task. The follow-up mainly discusses a scenario where a Job contains only one Step for analysis.

 title=

The Spring Batch framework personally thinks that the practical significance of multi-threading under a single process is not too great, mainly because it is a bit laborious to use this framework for small batch data task processing, and it is possible to open a thread pool by itself to solve the problem. This discussion mainly focuses on the scenario of distributed and collaborative completion of business data batch processing tasks under a certain scale of business clusters. In Spring Batch, remote sharding/partitioning processing capabilities are provided. In the Step of the Job, the task can be split into multiple subtasks according to specific rules and distributed to other workers in the cluster for processing to achieve distributed parallel batch processing. ability. Its remote interaction capability is often achieved by the use of third-party message middleware to realize the distribution of subtasks and the aggregation of execution results.

  • Remote Chunking

Remote chunking is a distributed batch solution provided by Spring Batch when processing large batches of data tasks. It can load data through ItemReader in one step and build into multiple Chunk chunks, and ItemWriter converts these multiple chunks into multiple chunks. Each chunk is distributed to the cluster nodes through message middleware or other forms, and the cluster application node performs business processing on each chunk.

 title=

Remote Chunking example

The above-mentioned master node ItemReader and ItemWriter can be mapped to the "task split-split" phase in the batch model discussed in this paper. The master node can use ChunkMessageChannelItemWriter provided by Spring Batch Integration for ItemWriter, which integrates other components provided by Spring Integration. Channels (such as AMQP, JMS) complete batch task data loading and block distribution.

 @Bean
    public Job remoteChunkingJob() {
         return jobBuilderFactory.get("remoteChunkingJob")
             .start(stepBuilderFactory.get("step2")
                     .<Integer, Integer>chunk(2) // 每Chunk块包含reader加载的记录数
                     .reader(itemReader())
                     // 采用ChunkMessageChannelItemWriter分发Chunk块
                     .writer(itemWriter())
                     .build())
             .build();
     }

    @Bean
    public ItemReader<Integer> itemReader() {
        return new ListItemReader<>(Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10));
    }

    @Bean
    public ItemWriter<Integer> itemWriter() {
        MessagingTemplate messagingTemplate = new MessagingTemplate();
        messagingTemplate.setDefaultChannel(requests());
        ChunkMessageChannelItemWriter<Integer> chunkMessageChannelItemWriter = new ChunkMessageChannelItemWriter<>();
        chunkMessageChannelItemWriter.setMessagingOperations(messagingTemplate);
        chunkMessageChannelItemWriter.setReplyChannel(replies());
        return chunkMessageChannelItemWriter;
    }
    // 省略了相关消息中间件对接通道配置

The Slave node mainly performs corresponding business logic processing and data result output on the distributed Chunk block data (which can be understood as subtasks). Therefore, on the subtask processing side, you need to configure the ChunkProcessorChunkHandler provided by Spring Batch Integration to complete related actions such as subtask reception, actual business processing, and feedback processing results.

 // 省略了相关消息中间件对接通道配置

    // 接收分块任务升级及反馈执行结果
    @Bean
    @ServiceActivator(inputChannel = "slaveRequests", outputChannel = "slaveReplies")
    public ChunkProcessorChunkHandler<Integer> chunkProcessorChunkHandler() {
        ChunkProcessor<Integer> chunkProcessor = new SimpleChunkProcessor(slaveItemProcessor(), slaveItemWriter());
        ChunkProcessorChunkHandler<Integer> chunkProcessorChunkHandler = new ChunkProcessorChunkHandler<>();
        chunkProcessorChunkHandler.setChunkProcessor(chunkProcessor);
        return chunkProcessorChunkHandler;
    }

    // 实际业务需要开发的任务处理逻辑processor
    @Bean
    public SlaveItemProcessor slaveItemProcessor(){ return new SlaveItemProcessor();}

    // 实际业务需要开发的任务处理逻辑writer
    @Bean
    public SlaveItemWriter slaveItemWriter(){ return new SlaveItemWriter();}
  • Remote Partitioning

The main difference between remote partitioning and remote partitioning is that the master node is not responsible for data loading, which can be understood as splitting the current Step into multiple sub-steps (which can also be understood as sub-tasks) through the Partitioner, and then distributing the corresponding sub-tasks to each through the PartitionHandler. Slave node processing. For this reason, Spring Batch Integration provides MessageChannelPartitionHandler to realize the corresponding subtask distribution. The bottom layer also needs to rely on message middleware for adaptation and docking. Each Slave node needs to read the context information of the subtask Step, and perform complete ItemReader, ItemProcess, and ItemWrite processing according to the information.

 title=

  • Characteristic Analysis

Spring Batch framework, comprehensive feature analysis:

  • It has complete batch processing capabilities: supports single-machine multi-threading, distributed multi-process collaborative batch processing, and supports custom sharding models.
  • Lack of timing scheduling support: The native non-timing scheduling capability needs to integrate a three-party timing framework (for example, Spring Task needs to solve the repeated triggering of the cluster by itself).
  • Weak visual management and control capabilities: Spring Batch often uses programs or files to configure tasks, and the console needs to be additionally built and the management and control capabilities are weak.
  • High integration difficulty: its distributed batch processing capability requires additional third-party middleware integration and development, or self-expanding development based on its interface; completing enterprise-level use based on the officially provided method requires relatively complex planning and integration.

Enterprise Batch Solution - SchedulerX Visual MapReduce Tasks

The SchedulerX task scheduling platform provides a complete overall solution for enterprise-level batch processing needs. Users can directly use the services of the public cloud platform to easily realize the distributed batch processing capabilities of business application clusters (users can also deploy non-Alibaba cloud business applications. Support docking), no need to deploy other middleware integration maintenance.

Principle analysis

In the whole solution, the task scheduling platform provides a full range of visual management and control, highly reliable timing scheduling and visual query capabilities for the tasks registered by users. In addition, by integrating the SchedulerX SDK on the user business application side, fast access to the distributed batch processing capability can be achieved. At this time, the user only needs to care about the subtask business segmentation rules in the batch model and the processing logic of each subtask. This distributed batch process has the following characteristics:

  • High availability of subtasks: When the cluster execution node is down, it supports automatic failover to redistribute subtasks on the offline machine to other nodes
  • Automatic elastic expansion: when a new pair of application nodes is deployed in the cluster, it can automatically participate in the execution of subsequent tasks
  • Visualization capability: Provide various monitoring operation and maintenance and business log query capabilities for the execution process of tasks and subtasks

 title=

The general principle process is described below:

  • After the platform creates a MapReduce task, the timing scheduling service will enable highly reliable timing trigger execution for it
  • When a MapReduce task is triggered to execute, the scheduling service will select a node among the connected business worker nodes as the master node for this task to run.
  • The master node runs the subtask split loading logic that executes user-defined development, and distributes subtask processing requests in a balanced manner to other worker nodes in the cluster through the map method call
  • The master node will monitor the processing of the entire distributed batch task, as well as the health monitoring of each worker node to ensure high availability of the overall operation
  • After receiving the subtask processing request, other worker nodes start to call back and execute user-defined business logic, and finally complete the processing requirements of each subtask; and the number of parallel threads for a single application node to process subtasks at the same time can be configured.
  • After all subtasks are completed, the master node will aggregate the execution results of all subtasks, call back the reduce method, and feed back to the scheduling platform to record the execution results.

Developers only need to implement a MapReduceJobProcessor abstract class in the business application, and load the list of business subtask data objects to be processed this time in isRootTask; in non-root requests, get the single subtask object information through jobContext.getTask(), according to the Information performs business processing logic. After the business application is deployed and released to the cluster nodes, when the task is triggered to run, all nodes in the cluster will participate in coordinating the execution of the entire distributed batch task until it is completed.

 public class MapReduceHelloProcessor extends MapReduceJobProcessor {

    @Override
    public ProcessResult reduce(JobContext jobContext) throws Exception {
        // 所有子任务完成的汇聚逻辑处理回调,可选实现
        jobContext.getTaskResults();
        return new ProcessResult(true, "处理结果数量集:" + jobContext.getTaskResults().size());
    }

    @Override
    public ProcessResult process(JobContext jobContext) throws Exception {
        if (isRootTask(jobContext)) {
            List<String> list = // 加载业务待处理的子任务列表
            // 回调sdk提供的map方法,自动实现子任务分发
            ProcessResult result = map(list, "SecondDataProcess");
            return result;
        } else {
            // 获得单个子任务数据信息,进行单个子任务业务处理
            String data = (String) jobContext.getTask();
            // ... 业务逻辑处理补充 ...
            return new ProcessResult(true, "数据处理成功!");
        }
    }
}

Functional advantage

  • Subtask visualization capabilities

User dashboard: Provides all task triggering and running visual record information.

 title=

Visualize subtask details: By querying the task execution record details, you can get the execution status and node of each subtask.

 title=

  • Subtask business log

Click "Log" in the subtask list to get the log record information during the processing of the current subtask.

 title=

  • Execute stack view

The execution stack view function can be used to check the stack information of the corresponding execution thread in the scenario where the subtask is stuck and has not been completed.

 title=

 title=

  • Custom business labels

The subtask business tag capability provides users with the ability to view and query subtask business information quickly and visually. In the figure below, "account name" is the business tag information segmented by this subtask. Based on this information, users can quickly understand the processing status of the corresponding business subtask, and support querying the subtask processing status of the specified business tag information.

 title=

How to configure custom labels for subtasks, just implement the BizSubTask interface for the subtask objects distributed in this map, and implement its labelMap method to add its own business feature labels for each subtask for visual query.

 public class AccountTransferProcessor extends MapReduceJobProcessor {

    private static final Logger logger = LoggerFactory.getLogger("schedulerxLog");

    @Override
    public ProcessResult reduce(JobContext context) throws Exception {
        return new ProcessResult(true);
    }

    @Override
    public ProcessResult process(JobContext context) throws Exception {
        if(isRootTask(context)){
            logger.info("split task list size:20");
            List<AccountInfo> list = new LinkedList();
            for(int i=0; i < 20; i++){
                list.add(new AccountInfo(i, "CUS"+StringUtils.leftPad(i+"", 4, "0"),
                        "AC"+StringUtils.leftPad(i+"", 12, "0")));
            }
            return map(list, "transfer");
        }else {
            logger.info("start biz process...");
            logger.info("task info:"+context.getTask().toString());
            TimeUnit.SECONDS.sleep(30L);
            logger.info("start biz process end.");
            return new ProcessResult(true);
        }
    }
}

public class AccountInfo implements BizSubTask {

        private long id;

        private String name;

        private String accountId;

        public AccountInfo(long id, String name, String accountId) {
            this.id = id;
            this.name = name;
            this.accountId = accountId;
        }

        // 子任务标签信息设置
        @Override
        public Map<String, String> labelMap() {
            Map<String, String> labelMap = new HashMap();
            labelMap.put("账户名称", name);
            return labelMap;
        }
    }
  • Compatible with open source

SchedulerX supports executors written based on common open source frameworks, including: XXL-Job, ElasticJob, and the subsequent scheduling platform will also plan to support scheduling Spring Batch tasks.

case scenario

The distributed batch model (visualized MapReduce model) has a large number of demand scenarios in actual enterprise-level applications. Some common usage scenarios are:

  • For batch parallel processing of sub-database and sub-table data, the sub-database or sub-table information is distributed among cluster nodes as sub-task objects to achieve parallel processing
  • According to the logistics order data processing in the city area, the city and area are distributed among the cluster nodes as sub-task objects to realize parallel processing
  • In view of the visualization capability of Visual MapReduce sub-tasks, key customer/order information can be used as sub-task processing objects to process corresponding data reports or push information to achieve visual tracking and processing of important sub-tasks
  • Fund sales business case

The following provides a fund sales business case for reference. If a distributed batch model is used, users can freely play in their own business scenarios. Case description: Every day between a fund company and a fund sales company (such as Ant Fortune), the investor's account/transaction application data will be synchronously processed, which often uses file data interaction. (and vice versa), the data files provided by each vendor are completely independent; the data files of each vendor need to go through several fixed steps of file verification, interface file analysis, data verification, and data import. When dealing with the above fixed steps, the fund company is very suitable for using distributed batch processing to speed up the processing of data files. Each vendor is used as a sub-task object to distribute to the cluster, and all application nodes participate in parsing the data files of different vendors assigned to them. deal with.

 @Component
public class FileImportJob extends MapReduceJobProcessor {

    private static final Logger logger = LoggerFactory.getLogger("schedulerx");

    @Override
    public ProcessResult reduce(JobContext context) throws Exception {
        return new ProcessResult(true);
    }

    @Override
    public ProcessResult process(JobContext context) throws Exception {
        if(isRootTask(context)){
            // ---------------------------------------------------------
            // Step1. 读取对接的销售商列表Code
            // ---------------------------------------------------------
            logger.info("以销售商为维度构建子任务列表...");

            // 伪代码从数据库读取销售商列表,Agency类需要实现BizSubTask接口并可将
            // 销售商名称/编码作为子任务标签,以便控制台可视化跟踪
            List<Agency> agencylist = getAgencyListFromDb();
            return map(agencylist, "fileImport");
        }else {
            // ---------------------------------------------------------
            // Step2. 针对单个销售商进行对应文件数据的处理
            // ---------------------------------------------------------
            Agency agency = (Agency)context.getTask();
            File file = loadFile(agency);
            logger.info("文件加载完成.");
            validFile(file);
            logger.info("文件校验通过.");
            List<Request> request = resolveRequest(file);
            logger.info("文件数据解析完成.");
            List<Request> request = checkRequest(request);
            logger.info("申请数据检查通过.");
            importRequest(request);
            logger.info("申请数据导入完成.");
            return new ProcessResult(true);
        }
    }
}

 title=

The case is mainly to use a parallel batch processing method to process a business link in the fund transaction clearing, and each subsequent processing link can also be processed in a similar way. In addition, each visual MapReduce task node can build a complete automatic business clearing process through DAG dependency orchestration.

Summarize

The distributed task scheduling platform SchedulerX provides a complete solution for enterprise-level distributed batch processing, provides users with a fast and easy-to-use access mode, and supports scheduled scheduling, visual operation tracking, controllable, simple operation and maintenance, and high-availability scheduling services At the same time, it is equipped with enterprise-level monitoring capabilities, log services, and monitoring and alarming capabilities.

references:

Spring Batch Integration:

https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#springBatchIntegration

ElasticJob:

https://shardingsphere.apache.org/elasticjob/current/cn/overview/

Distributed task scheduling SchedulerX manual:

https://help.aliyun.com/document_detail/161998.html

How SchedulerX helps users solve distributed task scheduling:

https://mp.weixin.qq.com/s/EgyfS1Vuv4itnuxbiT7KwA


阿里云云原生
1k 声望302 粉丝