The batch processing framework Spring Batch is so strong, would you use it?

Introduction to spring batch

Spring batch is a data processing framework provided by spring. Many applications in the enterprise domain require batch processing to perform business operations in mission-critical environments. These business operations include:

Automated and complex processing that can most effectively process large amounts of information without user interaction. These operations usually include time-based events (such as month-end calculations, notifications, or communications).
The regular application of complex business rules (for example, insurance benefit determination or rate adjustment) is repeatedly processed in very large data sets.
Integrate information received from internal and external systems. This information usually needs to be formatted, verified and processed in a transactional manner into the system of record. Batch processing is used to process billions of transactions for businesses every day.

Spring Batch is a lightweight, comprehensive batch processing framework designed to develop powerful batch processing applications that are critical to the daily operation of enterprise systems. Spring Batch builds the expected Spring Framework features (productivity, POJO-based development methods and general ease of use), while enabling developers to easily access and utilize more advanced enterprise services when necessary. Spring Batch is not a schuedling framework.

Spring Batch provides reusable functions that are essential for processing large amounts of data, including recording/tracking, transaction management, job processing statistics, job restart, skipping, and resource management. It also provides more advanced technical services and functions, through optimization and partitioning technology to achieve extremely high-capacity and high-performance batch jobs.

Spring Batch can be used for two simple use cases (such as reading a file into a database or running a stored procedure) and a large number of complex use cases (such as moving large amounts of data between databases, transforming it, etc.). Large batches of batch jobs can use the framework to process large amounts of information in a highly scalable manner.

Introduction to Spring Batch architecture

A typical batch processing application is roughly as follows:

Read a large number of records from a database, file or queue.
Process the data in a certain way.
Write back the data in the modified form.

The corresponding schematic diagram is as follows:

An overall architecture of spring batch is as follows:

In spring batch, a job can define many steps. In each step, you can define its exclusive ItemReader for reading data, ItemProcesseor for processing data, ItemWriter for writing data, and each defined job has all In JobRepository, we can start a job through JobLauncher.

Introduction to Spring Batch core concepts

The following concepts are the core concepts in the Spring batch framework.

What is Job

Job and Step are the two most core concepts for spring batch to perform batch processing tasks.

Among them, Job is a concept that encapsulates the entire batch process. Job is only an abstract concept at the top level in the spring batch system. When embodied in the code, it is only the top level interface. The code is as follows:

/**
 * Batch domain object representing a job. Job is an explicit abstraction
 * representing the configuration of a job specified by a developer. It should
 * be noted that restart policy is applied to the job as a whole and not to a
 * step.
 */
public interface Job {
 
 String getName();
 
 
 boolean isRestartable();
 
 
 void execute(JobExecution execution);
 
 
 JobParametersIncrementer getJobParametersIncrementer();
 
 
 JobParametersValidator getJobParametersValidator();
 
}

Five methods are defined in the Job interface. Its implementation class mainly has two types of jobs, one is simplejob and the other is flowjob. In spring batch, job is the top-level abstraction. In addition to job, we also have two lower-level abstractions, JobInstance and JobExecution.

A job is the basic unit of our operation, and it is internally composed of steps. Job can be regarded as a container of step in essence. A job can combine steps in a specified logical order, and provides a way for us to set the same properties for all steps, such as some event monitoring and skip strategies.

Spring Batch provides a default simple implementation of the Job interface in the form of SimpleJob class, which creates some standard functions on top of Job. An example code using java config is as follows:

@Bean
public Job footballJob() {
    return this.jobBuilderFactory.get("footballJob")
                     .start(playerLoad())
                     .next(gameLoad())
                     .next(playerSummarization())
                     .end()
                     .build();
}

The meaning of this configuration is: first give the job a name called footballJob, and then specify the three steps of this job, which are implemented by the methods, playerLoad, gameLoad, and playerSummarization.

What is JobInstance

We have already mentioned JobInstance above. It is a lower level abstraction of Job. Its definition is as follows:

public interface JobInstance {
 /**
  * Get unique id for this JobInstance.
  * @return instance id
  */
 public long getInstanceId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
}

His method is very simple, one is to return the id of the job, and the other is to return the name of the job.

JobInstance refers to the concept of job execution during job execution. Instance originally means an instance.

For example, there is now a batch job, its function is to execute the line once at the end of the day. We assume that the name of this batch job is'EndOfDay'. In this case, then there will be a Logical JobInstance every day, and we must record each run of the job.

What is JobParameters

As we mentioned above, if the same job runs once a day, then there is a jobIntsance every day, but their job definitions are the same, so how do we distinguish the different jobinstances of a job. Let's make a guess first. Although the job definition of jobinstance is the same, they have different things, such as running time.

What is provided in spring batch to identify a jobinstance is: JobParameters. The JobParameters object contains a set of parameters used to start a batch job, which can be used for identification or even reference data during runtime. The running time we assume can be used as a JobParameters.

For example, our previous'EndOfDay' job now has two instances, one is generated on January 1st, and the other is generated on January 2nd, then we can define two JobParameter objects: one has a parameter of 01 -01, the other parameter is 01-02. Therefore, the method of identifying a JobInstance can be defined as:

Therefore, we can operate the correct JobInstance through Jobparameter

What is JobExecution

JobExecution refers to a single attempt to run a code-level concept of a job that we have defined. One execution of job may fail or succeed. Only when the execution is successfully completed, the given JobInstance corresponding to the execution is also considered to be completed.

Still taking the job of EndOfDay described earlier as an example, suppose the result of running the JobInstance of 01-01-2019 for the first time is a failure. At this time, if the job parameters are run again with the same Jobparameter parameters as the first run (ie 01-01-2019), then a new JobExecution instance corresponding to the previous jobInstance will be created, and there is still only one JobInstance.

The interface of JobExecution is defined as follows:

public interface JobExecution {
 /**
  * Get unique id for this JobExecution.
  * @return execution id
  */
 public long getExecutionId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
 /**
  * Get batch status of this execution.
  * @return batch status value.
  */
 public BatchStatus getBatchStatus();
 /**
  * Get time execution entered STARTED status. 
  * @return date (time)
  */
 public Date getStartTime();
 /**
  * Get time execution entered end status: COMPLETED, STOPPED, FAILED 
  * @return date (time)
  */
 public Date getEndTime();
 /**
  * Get execution exit status.
  * @return exit status.
  */
 public String getExitStatus();
 /**
  * Get time execution was created.
  * @return date (time)
  */
 public Date getCreateTime();
 /**
  * Get time execution was last updated updated.
  * @return date (time)
  */
 public Date getLastUpdatedTime();
 /**
  * Get job parameters for this execution.
  * @return job parameters  
  */
 public Properties getJobParameters();
 
}

The comments of each method have been explained very clearly, so I won't explain more here. Just mention BatchStatus, JobExecution provides a method getBatchStatus to get a status of a particular execution of a job. BatchStatus is an enumeration class representing job status, which is defined as follows:

public enum BatchStatus {STARTING, STARTED, STOPPING, 
   STOPPED, FAILED, COMPLETED, ABANDONED }

These attributes are very critical information for the execution of a job, and spring batch will persist them in the database. In the process of using Spring batch, spring batch will automatically create some tables to store some job-related information, use The table where JobExecution is stored is batch_job_execution , the following is an example of a screenshot from the database:

What is Step

Each Step object encapsulates an independent stage of the batch job. In fact, every Job is essentially composed of one or more steps. Each step contains all the information needed to define and control the actual batch processing. Any specific content is at the discretion of the developer who wrote the job.

A step can be very simple or very complex. For example, the function of a step is to load the data in the file into the database, so there is almost no need to write code based on the support of spring batch. More complex steps may have complex business logic as part of the processing.

Like Job, Step has a StepExecution similar to JobExecution, as shown in the following figure:

What is StepExecution

StepExecution means that a Step is executed once, and a new StepExecution is created every time a Step is run, similar to JobExecution. However, a step may not be executed due to the failure of the previous step. And StepExecution will only be created when Step actually starts.

An instance of step execution is represented by an object of the StepExecution class. Each StepExecution contains a reference to its corresponding step and JobExecution and transaction-related data, such as commit and rollback counts and start and end time.

In addition, each step execution contains an ExecutionContext, which contains any data that the developer needs to keep during the batch run, such as statistics or status information required for restarting. The following is an example of a screenshot from the database:

What is ExecutionContext

ExecutionContext is the execution environment of each StepExecution. It contains a series of key-value pairs. We can get the ExecutionContext with the following code

ExecutionContext ecStep = stepExecution.getExecutionContext();
ExecutionContext ecJob = jobExecution.getExecutionContext();

What is JobRepository

JobRepository is a class used to persist the above-mentioned job, step and other concepts. It also provides CRUD operations for Job and Step and the implementation of JobLauncher mentioned below.

When the job is started for the first time, the JobExecution will be obtained from the repository, and in the process of executing batch processing, StepExecution and JobExecution will be stored in the repository.

@EnableBatchProcessing annotation can provide automatic configuration for JobRepository.

What is JobLauncher

The function of the JobLauncher interface is very simple. It is used to start a job with JobParameters specified. Why is the job parameter specified here? The reason is that we have already mentioned that jobparameter and job together can form a job execution. The following is a code example:

public interface JobLauncher {
 
public JobExecution run(Job job, JobParameters jobParameters)
            throws JobExecutionAlreadyRunningException, JobRestartException,
                   JobInstanceAlreadyCompleteException, JobParametersInvalidException;
}

The function implemented by the run method above is to obtain a JobExecution from the JobRepository and execute the Job according to the incoming job and jobparamaters.

What is Item Reader

ItemReader is an abstraction of reading data, and its function is to provide data input for each Step. When ItemReader has read all the data, it will return null to tell the subsequent operations that the data has been read. Spring Batch provides a lot of useful implementation classes for ItemReader, such as JdbcPagingItemReader, JdbcCursorItemReader and so on.

The data sources supported by ItemReader are also very rich, including various types of databases, files, data streams, and so on. Covers almost all of our scenes.

The following is an example code of JdbcPagingItemReader:

@Bean
public JdbcPagingItemReader itemReader(DataSource dataSource, PagingQueryProvider queryProvider) {
        Map<String, Object> parameterValues = new HashMap<>();
        parameterValues.put("status", "NEW");
 
        return new JdbcPagingItemReaderBuilder<CustomerCredit>()
                                           .name("creditReader")
                                           .dataSource(dataSource)
                                           .queryProvider(queryProvider)
                                           .parameterValues(parameterValues)
                                           .rowMapper(customerCreditMapper())
                                           .pageSize(1000)
                                           .build();
}
 
@Bean
public SqlPagingQueryProviderFactoryBean queryProvider() {
        SqlPagingQueryProviderFactoryBean provider = new SqlPagingQueryProviderFactoryBean();
 
        provider.setSelectClause("select id, name, credit");
        provider.setFromClause("from customer");
        provider.setWhereClause("where status=:status");
        provider.setSortKey("id");
 
        return provider;
}

JdbcPagingItemReader must specify a PagingQueryProvider, which is responsible for providing SQL query statements to return data by page.

The following is an example code of JdbcCursorItemReader:

 private JdbcCursorItemReader<Map<String, Object>> buildItemReader(final DataSource dataSource, String tableName,
            String tenant) {
 
        JdbcCursorItemReader<Map<String, Object>> itemReader = new JdbcCursorItemReader<>();
        itemReader.setDataSource(dataSource);
        itemReader.setSql("sql here");
        itemReader.setRowMapper(new RowMapper());
        return itemReader;
    }

What is Item Writer

Since ItemReader is an abstraction for reading data, ItemWriter is naturally an abstraction for writing data. It provides the function of writing data for each step. The writing unit is configurable. We can write one piece of data at a time or one chunk of data at a time. There will be a special introduction about chunks below. ItemWriter cannot do any operation on the read data.

Spring Batch also provides a lot of useful implementation classes for ItemWriter. Of course, we can also implement our own writer function.

What is an item processor

ItemProcessor is an abstraction of the business logic processing of the project. After the ItemReader reads a record, before the ItemWriter has written the record, we can use temProcessor to provide a function of processing business logic and perform corresponding operations on the data. If we find in ItemProcessor that a piece of data should not be written, it can be represented by returning null. ItemProcessor and ItemReader and ItemWriter can work together very well, and the data transmission between them is also very convenient. We can use it directly.

chunk processing flow

Spring batch provides the ability to process data according to chunks. The schematic diagram of a chunk is as follows:

Its meaning is the same as shown in the figure. Since our batch task may have a lot of data read and write operations, it will not be very efficient to process and submit to the database one by one, so spring batch provides the concept of chunk , We can set a chunk size, spring batch will process the data one by one, but not submit it to the database. Only when the amount of data processed reaches the value of the chunk size setting, it will commit together.

The java instance definition code is as follows:

In the above step, the chunk size is set to 10. When the number of data read by the ItemReader reaches 10, this batch of data is transferred to the itemWriter together, and the transaction is submitted at the same time.

skip strategy and failure handling

The step of a batch job may process a very large amount of data, and it is inevitable that there will be errors. Although the probability of errors is small, we have to consider these situations, because we are the most important thing in data migration. It is to ensure the ultimate consistency of the data. Of course, spring batch also considers this situation and provides us with relevant technical support. Please see the following bean configuration:

We need to pay attention to these three methods, namely skipLimit(), skip(), noSkip(),

The skipLimit method means that we can set the number of exceptions that we allow this step to skip. If we set it to 10, when this step is running, as long as the number of exceptions does not exceed 10, the entire step will not fail. Note that if skipLimit is not set, its default value is 0.

With the skip method, we can specify the exceptions that we can skip, because some exceptions can be ignored.

The noSkip method means that we don't want to skip this exception, that is, exclude this exception from all skip exceptions. From the above example, it means to skip all exceptions except FileNotFoundException.

Then for this step, FileNotFoundException is a fatal exception. When this exception is thrown, the step will directly fail.

Batch operation guide

This part is some noteworthy points when using spring batch

Batch processing principle

When building a batch processing solution, the following key principles and considerations should be considered.

Batch processing architecture usually affects the architecture
Simplify as much as possible and avoid building complex logical structures in a single batch of applications
Keep the processing and storage of the data physically close (in other words, keep the data in the process).
Minimize the use of system resources, especially I/O. Perform as many operations as possible in internal memory.
Look at application I/O (analyze SQL statements) to ensure that unnecessary physical I/O is avoided. In particular, you need to look for the following four common flaws:
When the data can be read once and cached or stored in the working storage, the data for each transaction is read.
Re-read the data of the transaction that previously read the data in the same transaction.
Cause unnecessary table or index scans.
The key value is not specified in the WHERE clause of the SQL statement.
Don't do the same thing twice in a batch run. For example, if you need data aggregation for reporting purposes, you should (if possible) increment the stored total when the data is initially processed, so your reporting application does not have to reprocess the same data.
Allocate sufficient memory at the beginning of the batch application to avoid time-consuming reallocations in the process.
Always assume the worst data integrity. Insert appropriate checks and record verifications to maintain data integrity.
Implement checksums as much as possible for internal verification. For example, for the data in a file, there should be a data record, telling the total number of records in the file and the summary of key fields.
Plan and execute stress tests as early as possible in a similar production environment with real data volumes.
In high-volume systems, data backup can be challenging, especially if the system is running 24-7 online. Database backups are usually handled well in online designs, but file backups should be regarded as equally important. If the system relies on files, not only should the file backup process be in place and documented, it should also be tested regularly.

How to not start job by default

When using the spring batch job with java config, if you don't do any configuration, the project will run the batch job defined by us by default when it starts. So how to make the project not automatically run the job when it starts?

The job of spring batch will run automatically when the project starts. If we don't want him to run at startup, we can add the following properties in application.properties:

spring.batch.job.enabled=false

Insufficient memory when reading data

When using spring batch for data migration, I found that after the job was started, it was stuck in one place at a certain point in time, and the log was no longer printed. After waiting for a period of time, I got the following error:

The red letter information is: Resource exhaustion event：the JVM was unable to allocate memory from the heap.

The translated meaning is that the project issued a resource exhaustion event, telling us that the java virtual machine can no longer allocate memory for the heap.

The reason for this error is: The reader of the batch job in this project retrieved all the data in the database at one time without paging. When the amount of data is too large, it will cause insufficient memory. There are two solutions:

Adjust the reader's read data logic, read by page, but the implementation will be more troublesome, and the operating efficiency will decrease
Increase service memory

(Thanks for reading, I hope for all your help)

Source: blog.csdn.net/topdeveloperr/article/details/84337956