Extract arbitrary page numbers in PDF files using itextpdf

Now there is such a requirement:

There is a PDF file with dozens of pages, and now it is necessary to split the specified page number from it, and then generate a new PDF file.

At this time, you can use the open source itextpdf library to implement, the official github address of itextpdf is: https://github.com/itext/itextpdf .

The following is a demonstration of the specific code.

1. Introduce dependencies

Currently itextpdf the latest version is 5.5.13.3 , which can be searched at https://search.maven.org/ .

 <dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.13.3</version>
</dependency>

2. Code implementation

2.1 Specify page number extraction

 package com.magic.itextpdf;

import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
import java.util.Objects;

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PdfCopy;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfSmartCopy;

/**
 * PDF工具类
 */
public class PdfUtils {

    /**
     * 抽取PDF文件
     * @param sourceFile 源PDF文件路径
     * @param targetFile 目标PDF文件路径
     * @param extractedPageNums 需要抽取的页码
     */
    public static void extract(String sourceFile, String targetFile, List<Integer> extractedPageNums) {
        Objects.requireNonNull(sourceFile);
        Objects.requireNonNull(targetFile);
        PdfReader reader = null;
        Document document = null;
        FileOutputStream outputStream = null;
        try {
            // 读取源文件
            reader = new PdfReader(sourceFile);
            // 创建新的文档
            document = new Document();
            // 创建目标PDF文件
            outputStream = new FileOutputStream(targetFile);
            PdfCopy pdfCopy = new PdfSmartCopy(document, outputStream);

            // 获取源文件的页数
            int pages = reader.getNumberOfPages();
            document.open();

            // 注意此处的页码是从1开始
            for (int page = 1; page <= pages; page++) {
                // 如果是指定的页码，则进行复制
                if (extractedPageNums.contains(page)) {
                    pdfCopy.addPage(pdfCopy.getImportedPage(reader, page));
                }
            }
        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        } finally {
            if (reader != null) {
                reader.close();
            }

            if (document != null) {
                document.close();
            }

            if (outputStream != null) {
                try {
                    outputStream.flush();
                    outputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

extract() method has three parameters, the sub-package is the source PDF file path, the target PDF file path and the specified page number, where the specified page number is passed by the List collection, for example, if you need to extract the first page, you can call it like the following

 PdfUtils.extract("D:\\Test\\test.pdf", "D:\\Test\\test_out.pdf", Collections.singletonList(1));

If you need to extract multiple pages at the same time, such as pages 1, 3, and 5, you can call it like this

 PdfUtils.extract("D:\\Test\\test.pdf", "D:\\Test\\test_out.pdf", Arrays.asList(1, 3, 5));

Of course, if a PDF has more than 100 pages, 10-60 pages need to be extracted now. If the parameters are passed as above, it will be very troublesome. At this time, you can overload a method to pass the starting page number and ending page number to extracted.

2.2 Start and end page number extraction

Overload extract method, the specific code is as follows:

 /**
 * 抽取PDF文件
 * @param sourceFile 源PDF文件路径
 * @param targetFile 目标PDF文件路径
 * @param fromPageNum 起始页码
 * @param toPageNum 结束页码
 */
public static void extract(String sourceFile, String targetFile, int fromPageNum, int toPageNum) {
    Objects.requireNonNull(sourceFile);
    Objects.requireNonNull(targetFile);
    PdfReader reader = null;
    Document document = null;
    FileOutputStream outputStream = null;
    try {
        // 读取源文件
        reader = new PdfReader(sourceFile);
        // 创建新的文档
        document = new Document();
        // 创建目标PDF文件
        outputStream = new FileOutputStream(targetFile);
        PdfCopy pdfCopy = new PdfSmartCopy(document, outputStream);

        // 获取源文件的页数
        int pages = reader.getNumberOfPages();
        document.open();

        // 注意此处的页码是从1开始
        for (int page = 1; page <= pages; page++) {
            if (page >= fromPageNum && page <= toPageNum) {
                pdfCopy.addPage(pdfCopy.getImportedPage(reader, page));
            }
        }
    } catch (IOException | DocumentException e) {
        e.printStackTrace();
    } finally {
        if (reader != null) {
            reader.close();
        }

        if (document != null) {
            document.close();
        }

        if (outputStream != null) {
            try {
                outputStream.flush();
                outputStream.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

For continuous page numbers, this method is simpler. For example, if you want to extract 10-60 pages, you can call it like this

 PdfUtils.extract("D:\\Test\\test.pdf", "D:\\Test\\test_out.pdf", 10, 60);

3. Test verification

Now there is a PDF file with a total of 2 pages. Use the above method to extract and split the first page respectively. The code is as follows:

 package com.magic.itextpdf;

import java.util.Collections;

public class Test {

    public static void main(String[] args) {
        PdfUtils.extract("D:\\Test\\test.pdf", "D:\\Test\\test_out_1.pdf", Collections.singletonList(1));
        PdfUtils.extract("D:\\Test\\test.pdf", "D:\\Test\\test_out_2.pdf", 1, 1);
    }
}

After running, two new files test_out_1.pdf and test_out_2.pdf are generated respectively, and the new files are the first page of the source file.

4. Other methods

If you only deal with a single PDF file, you can use the print function of WPS or the print function of the Chrome browser, which is very convenient.

Extract arbitrary page numbers in PDF files using itextpdf

1. Introduce dependencies

2. Code implementation

2.1 Specify page number extraction

2.2 Start and end page number extraction

3. Test verification

4. Other methods

4.1 WPS print split

4.2 Chrome Print Split

十方

引用和评论

Java代码判断当前操作系统是Windows或Linux或MacOS

Bitmap 和布隆过滤器傻傻分不清？你这不应该啊

Jerry和您聊聊Chrome开发者工具

Spring 实现 3 种异步流式接口，干掉接口超时烦恼

💢线上高延迟请求排查

每一个前端，都要拥有属于自己的埋点库~

Spring Boot起步，CRUD、错误处理与宝塔部署

Extract arbitrary page numbers in PDF files using itextpdf

1. Introduce dependencies

2. Code implementation

2.1 Specify page number extraction

2.2 Start and end page number extraction

3. Test verification

4. Other methods

4.1 WPS print split

4.2 Chrome Print Split

十方

引用和评论

Java代码判断当前操作系统是Windows或Linux或MacOS

Bitmap 和 布隆过滤器傻傻分不清？你这不应该啊

Jerry和您聊聊Chrome开发者工具

Spring 实现 3 种异步流式接口，干掉接口超时烦恼

💢线上高延迟请求排查

每一个前端，都要拥有属于自己的埋点库~

Spring Boot起步，CRUD、错误处理与宝塔部署

Bitmap 和布隆过滤器傻傻分不清？你这不应该啊