Spring batch introduction

Spring Batch is a data processing framework provided by Spring. Many applications in the enterprise domain require batch processing to perform business operations in a mission-critical environment. These business operations include:

  • Automated, complex processing that processes large amounts of information most efficiently without user interaction. These operations typically include time-based events (such as month-end calculations, notifications, or communications).

  • Periodic applications of complex business rules (for example, insurance benefit determination or rate adjustment) are processed repeatedly over very large data sets.

  • Integrate information received from internal and external systems, which typically needs to be transactionally formatted, validated, and processed into a recording system. Batch processing is used to process billions of transactions for businesses every day.

Spring Batch is a lightweight, comprehensive Batch framework designed to develop powerful Batch applications critical to the daily operations of enterprise systems. Spring Batch builds on the expected features of the Spring Framework (productivity, POJO-based development methods, and general ease of use) while making it easy for developers to access and leverage more advanced enterprise services when necessary. Spring Batch is not a Schuedling framework.

Spring Batch provides reusable capabilities that are critical for processing large amounts of data, including record/trace, transaction management, job processing statistics, job restart, skip, and resource management. It also provides advanced technical services and capabilities that enable extremely high volume and high performance batch jobs through optimization and partitioning techniques.

Spring Batch can be used for both simple use cases (such as reading a file into a database or running a stored procedure) and complex, large-volume use cases (such as moving large amounts of data between databases, transforming it, and so on). High-volume batch jobs can leverage the framework to process large amounts of information in a highly extensible manner.

Spring Batch Architecture introduction

A typical batch application looks like this:

  • Read a large number of records from a database, file, or queue.

  • Process the data in a certain way.

  • Write back the data in modified form.

The corresponding schematic diagram is as follows:

The overall architecture of Spring Batch is as follows:

In Spring Batch, a job can define many steps. Within each step, it can define its own ItemReader for reading data, ItemProcesseor for processing data, and ItemWriter for writing data. And every job is in the JobRepository, so we can start a job through JobLauncher.

This section describes core concepts of Spring Batch

Here are some concepts that are central to the Spring Batch framework.

What is the Job

Job and Step are two concepts that spring Batch uses to execute batch tasks.

Job is a concept that encapsulates the entire batch process. In the spring Batch system, Job is only an abstraction at the top level. In the code, Job is only an interface at the top level. The code is as follows:

/** * Batch domain object representing a job. Job is an explicit abstraction * representing the configuration of a job specified by a developer. It should * be noted that restart policy is applied to the job as a whole and not to a * step.  */ public interface Job { String getName(); boolean isRestartable(); void execute(JobExecution execution); JobParametersIncrementer getJobParametersIncrementer(); JobParametersValidator getJobParametersValidator(); }Copy the code

Five methods are defined in the Job interface, and its implementation classes mainly have two types of jobs, one is SimpleJob and the other is flowJob. In Spring Batch, job is the top-level abstraction. In addition to job, we also have JobInstance and JobExecution, two lower-level abstractions.

A job is the basic unit in which we run and is internally composed of steps. Job is essentially a container for step. A job can combine steps in a specified logical order and provides a way for us to set the same properties for all steps, such as some event listening and skip policies.

Spring Batch provides a simple default implementation of the Job interface in the form of the SimpleJob class, which creates standard functionality on top of the Job. An example using Java Config looks like this:

@Bean public Job footballJob() { return this.jobBuilderFactory.get("footballJob") .start(playerLoad()) .next(gameLoad())  .next(playerSummarization()) .end() .build(); }Copy the code

FootballJob specifies the job’s three steps, which are implemented by methods playerLoad,gameLoad, and playerSummarization.

What is a JobInstance

We have already mentioned JobInstance, which is a lower-level abstraction of Job, and its definition is as follows:

public interface JobInstance {
 /**
  * Get unique id for this JobInstance.
  * @return instance id
  */
 public long getInstanceId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
}



Copy the code

His method is simple: one returns the ID of the Job, the other returns the name of the Job.

JobInstance refers to the concept of job execution. Instance means an Instance.

Let’s say you have a batch job whose function is to execute a row at the end of the day. Let’s assume the batch job name is ‘EndOfDay’. In this case, there would be a logical JobInstance every day, and we would have to record every run of the job.

What is the JobParameters

If the same job runs once every day, there will be a jobIntsance every day. However, the definition of job is the same. Let’s take a guess. Although jobInstance has the same job definition, they have different things, such as running time.

Spring Batch provides something to identify a jobinstance: JobParameters. The JobParameters object contains a set of parameters to start a batch job, which can be used to identify or even as reference data at run time. We assume that the runtime can be used as a JobParameters.

For example, our previous ‘EndOfDay’ job now has two instances, one generated on January 1 and one generated on January 2, so we can define two JobParameter objects: The parameters of one are 01-01 and the parameters of the other are 01-02. Thus, the method of identifying a JobInstance can be defined as:

Therefore, we can manipulate the correct JobInstance via Jobparameter

What is a JobExecution

JobExecution refers to the code-level concept of a single attempt to run a defined Job. An execution of a job may fail or succeed. The given JobInstance corresponding to the execution is also considered complete only if the execution completes successfully.

Using the job described in EndOfDay as an example, suppose that the first run of JobInstance 01-01-2019 fails. If you run the job again with the same Jobparameter parameter as the first run (that is, 01-01-2019), A new JobExecution instance will then be created corresponding to the previous jobInstance, and there will still be only one jobInstance.

The JobExecution interface is defined as follows:

public interface JobExecution {
 /**
  * Get unique id for this JobExecution.
  * @return execution id
  */
 public long getExecutionId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
 /**
  * Get batch status of this execution.
  * @return batch status value.
  */
 public BatchStatus getBatchStatus();
 /**
  * Get time execution entered STARTED status. 
  * @return date (time)
  */
 public Date getStartTime();
 /**
  * Get time execution entered end status: COMPLETED, STOPPED, FAILED 
  * @return date (time)
  */
 public Date getEndTime();
 /**
  * Get execution exit status.
  * @return exit status.
  */
 public String getExitStatus();
 /**
  * Get time execution was created.
  * @return date (time)
  */
 public Date getCreateTime();
 /**
  * Get time execution was last updated updated.
  * @return date (time)
  */
 public Date getLastUpdatedTime();
 /**
  * Get job parameters for this execution.
  * @return job parameters  
  */
 public Properties getJobParameters();
 
}



Copy the code

The annotations for each method are clearly explained, so there is no further explanation here. Just to mention BatchStatus, JobExecution provides a method called getBatchStatus that is used to get the state of a particular job. BatchStatus is an enumeration class that represents job status. It is defined as follows:

public enum BatchStatus {STARTING, STARTED, STOPPING, 
   STOPPED, FAILED, COMPLETED, ABANDONED }



Copy the code

These attributes are critical information for the execution of a job, and Spring Batch persists them to the database. In the process of using Spring Batch, Spring Batch automatically creates tables to store job-related information. The batch_job_execution table is used to store JobExecution.

What is a Step

Each Step object encapsulates a separate stage of the batch job. In fact, each Job is essentially composed of one or more steps. Each step contains all the information needed to define and control the actual batch. Any particular content is at the discretion of the developer who wrote the Job.

A step can be very simple or very complex. For example, a step function is to load data from a file into a database, so with the current spring Batch support, little code is required. More complex steps may have complex business logic as part of the processing.

Like Job, Step has StepExecution similar to JobExecution, as shown in the figure below:

What is a StepExecution

StepExecution means executing a Step once, and each time a Step is executed, a new StepExecution is created, similar to JobExecution. However, a step may not be executed because its previous step failed. StepExecution is created only when Step actually starts.

An instance of a step execution is represented by an object of the StepExecution class. Each StepExecution contains a reference to its corresponding step as well as data related to JobExecution and transaction, such as commit and rollback counts and start and end times.

In addition, each step execution contains an ExecutionContext that contains any data the developer needs to keep in the batch run, such as statistics or status information needed for a restart. Here is an example screenshot from the database:

What is a — an optional ExecutionContext

ExecutionContext is the execution environment for each StepExecution. It contains a series of key-value pairs. We can get ExecutionContext with the following code

ExecutionContext ecStep = stepExecution.getExecutionContext();
ExecutionContext ecJob = jobExecution.getExecutionContext();



Copy the code

What is the JobRepository

JobRepository is a class that persists the above concepts of job, step, and so on. It provides CRUD operations for both Job and Step, as well as the JobLauncher implementation mentioned below.

JobExecution is retrieved from Repository when the Job is first started, and StepExecution and JobExecution are stored in repository during batch execution.

The @enablebatchProcessing annotation provides automatic configuration for JobRepository.

What is a JobLauncher

JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher: JobLauncher The jobParameter and job together constitute a job execution. Here is an example of the code:

public interface JobLauncher {
 
public JobExecution run(Job job, JobParameters jobParameters)
            throws JobExecutionAlreadyRunningException, JobRestartException,
                   JobInstanceAlreadyCompleteException, JobParametersInvalidException;
}



Copy the code

What the run method does above is take a JobExecution from JobRepository and execute the job based on the incoming job and the JobParamaters.

What is an Item Reader

ItemReader is a read data abstraction that provides data input for each Step. When the ItemReader has read all the data, it returns NULL to tell subsequent operations that the data has been read. Spring Batch provides a number of useful implementation classes for ItemReader, such as JdbcPagingItemReader, JdbcCursorItemReader, and so on.

ItemReader supports a wide variety of data sources for reading, including various types of databases, files, data streams, and more. It covers almost all of our scenarios.

Here is an example of JdbcPagingItemReader:

@Bean
public JdbcPagingItemReader itemReader(DataSource dataSource, PagingQueryProvider queryProvider) {
        Map<String, Object> parameterValues = new HashMap<>();
        parameterValues.put("status", "NEW");
 
        return new JdbcPagingItemReaderBuilder<CustomerCredit>()
                                           .name("creditReader")
                                           .dataSource(dataSource)
                                           .queryProvider(queryProvider)
                                           .parameterValues(parameterValues)
                                           .rowMapper(customerCreditMapper())
                                           .pageSize(1000)
                                           .build();
}
 
@Bean
public SqlPagingQueryProviderFactoryBean queryProvider() {
        SqlPagingQueryProviderFactoryBean provider = new SqlPagingQueryProviderFactoryBean();
 
        provider.setSelectClause("select id, name, credit");
        provider.setFromClause("from customer");
        provider.setWhereClause("where status=:status");
        provider.setSortKey("id");
 
        return provider;
}



Copy the code

JdbcPagingItemReader must specify a PagingQueryProvider that provides SQL queries to return data in pages.

Here is an example of JdbcCursorItemReader:

 private JdbcCursorItemReader<Map<String, Object>> buildItemReader(final DataSource dataSource, String tableName,
            String tenant) {
 
        JdbcCursorItemReader<Map<String, Object>> itemReader = new JdbcCursorItemReader<>();
        itemReader.setDataSource(dataSource);
        itemReader.setSql("sql here");
        itemReader.setRowMapper(new RowMapper());
        return itemReader;
    }



Copy the code

What is an Item Writer

Since ItemReader is an abstraction for reading data, ItemWriter is an abstraction for writing data for each step. The write unit can be configured. We can write one piece of data at a time or one chunk of data at a time. There will be special introduction about chunk later. The ItemWriter cannot do anything about the data it reads.

Spring Batch also provides a number of useful implementation classes for ItemWriter, although we can also implement our own writer functionality.

What is the Item Processor

ItemProcessor is an abstraction of the business logic processing of a project. After the ItemReader reads a record, but before the ItemWriter writes the record, we can use temProcessor to provide a function to process the business logic. And perform corresponding operations on the data. If we find a piece of data in the ItemProcessor that should not be written, we can return null to indicate that. ItemProcessor, ItemReader and ItemWriter can work together very well, and data transmission between them is also very convenient. We can just use it.

Chunk Processing Process

Spring Batch provides the ability to process data by chunk. A chunk diagram is shown below:

Its meaning is just as shown in the figure. Since we may have a lot of data read and write operations at a batch task, processing one by one and submitting it to the database will not be efficient. Therefore, Spring Batch provides the concept of chunk, and we can set a chunk size. Spring Batch processes data one by one, but does not commit it to the database. Only when the amount of data processed reaches the value set by chunk size, do the commit together.

Java instance definition code is as follows:

In this step, the chunk size is set to 10. When the number of items read by the ItemReader reaches 10, the chunk size is sent to the itemWriter together and the transaction is committed.

Skip policy and failure handling

A batch job step may process a large amount of data, and errors will inevitably occur. Although the probability of errors is small, we have to consider these cases, because the most important thing for data migration is to ensure the final consistency of data. Spring Batch certainly takes this into account and provides technical support, as shown in the following bean configuration:

We need to pay attention to these three methods, skipLimit(),skip(),noSkip(),

The skipLimit method means that we can set the number of exceptions that we allow the step to skip. If we set it to 10, then when the step runs, as long as the number of exceptions does not exceed 10, the whole step will not fail. Note that if skipLimit is not set, the default value is 0.

The skip method allows us to specify exceptions that we can skip, because some exceptions occur that we can ignore.

The noSkip method means that we don’t want to skip this exception, that is, to exclude this exception from all the exceptions of skip. In the example above, Skip all exceptions except FileNotFoundException.

For this step, FileNotFoundException is a fatal exception. If this exception is thrown, step will fail

Batch operation Guide

This section is a list of some notable points when using Spring Batch

Batch processing principle

The following key principles and considerations should be considered when building a batch solution.

  • Batch architectures often affect the architecture

  • Simplify as much as possible and avoid building complex logical structures in a single batch application

  • Keep data processing and storage physically close together (in other words, keep data in process).

  • Minimize the use of system resources, especially I/O. Perform as many operations as possible in internal memory.

  • Look at application I/O (analyze SQL statements) to ensure unnecessary physical I/O is avoided. In particular, look for the following four common defects:

  • Read data for each transaction when it can be read once and cached or stored in working storage.

  • Reread data from a transaction that previously read data in the same transaction.

  • Resulting in unnecessary table or index scans.

  • The key value is not specified in the WHERE clause of the SQL statement.

  • Do not do the same thing twice in a batch run. For example, if you need data summarization for reporting purposes, you should (if possible) incrementally store the total when you initially process the data, so your reporting application does not have to reprocess the same data.

  • Allocate enough memory at the start of a batch application to avoid time-consuming reallocation along the way.

  • Always assume the worst data integrity. Insert appropriate checks and record validation to maintain data integrity.

  • Perform checksums for internal validation whenever possible. For example, there should be a data entry record for data in a file that tells the total number of records in the file and a summary of key fields.

  • Plan and perform stress tests early in a similar production environment with real data volumes.

  • In high-volume systems, data backup can be challenging, especially if the system is running 24-7 online. Database backups are generally well handled in online design, but file backups should be considered equally important. If the system depends on files, the file backup process should not only be in place and documented, but also be tested regularly.

How do I not start job by default

When using Spring Batch jobs using Java Config, if no configuration is done, the project will run our defined batch jobs by default when it starts. So how do you get a project to start without automatically going to job?

The Spring Batch job automatically runs when the project starts. If we don’t want it to run at startup, we can add the following properties to the application.properties file:

spring.batch.job.enabled=false



Copy the code

Out of memory while reading data

When spring Batch is used to perform data migration, it is found that the job is stuck at a certain point in time and the log is no longer printed. After waiting for a period of time, the following error occurs:

Resource exhaustion event: The JVM was unable to allocate memory from the heap.

The project issues a resource exhaustion event, telling us that the Java virtual machine can no longer allocate memory for the heap.

This error is caused by the fact that the batch job reader in this project retrieves all the data in the database at one time without pagination. When the amount of data is too large, the memory will be insufficient. There are two solutions:

  • Adjust the reader logic to read data in pages, but it is more difficult to implement and less efficient

  • Increase service memory

(Thanks for reading, hope you all help)

Source: blog.csdn.net/topdeveloperr/article/details/84337956