Practice Guide-Generate PDF from Web Page

1. Background

In the development work, it is necessary to realize the function of generating PDF from the web page. The generated PDF needs to be uploaded to the server, and the PDF address is used as a parameter to request an external interface. This conversion process and the converted PDF do not need to be displayed to the user on the front end.

2. Technical selection

This function does not need to be displayed to users on the front end. In order to save client resources, choose to implement the function of web page generation PDF on the server side.

1. Puppeteer

Puppeteer is a Node library that provides advanced API to control Chrome or Chromium through the DevTools protocol.

Most of the operations performed manually in the browser can be done using Puppeteer , such as:

Generate screenshots and PDFs of pages;
Crawl SPA and generate pre-rendered content (that is, SSR );
Automatic form submission, UI testing, keyboard input, etc.;
Create the latest automated test environment. Use the latest JavaScript and browser functions to run the test directly in the latest version of Chrome
Capture the timeline to track the website to help diagnose performance issues;
Test the Chrome extension program.

It can be seen from the above that Puppeteer can realize the PDF function of the page generated Node

Three, implementation steps

1. Installation

Enter the project, install puppeteer to the local.

$ npm install -g cnpm --registry=https://registry.npm.taobao.org
$ cnpm i puppeteer --save

It should be noted that puppeteer is installed, the latest version of the Chromium API will be downloaded. There are the following methods to modify the default settings without downloading the browser:

In environment variable settings PUPPETEER_SKIP_CHROMIUM_DOWNLOAD ;
puppeteer-core with puppeteer .

puppeteer-core is puppeteer . It does not download the browser by default, but launches an existing browser or connects to a remote browser. When using puppeteer-core , please note that there is a browser that can be connected locally, and the installed puppeteer-core is the one you intend to connect Browser compatible. The method to connect to the local browser is as follows:

const browser = await puppeteer.launch({ 
  executablePath: '/path/to/Chrome' 
});

This project needs to be deployed to the server, and there is no browser to connect to, so puppeteer chosen to install.

2. Launch the browser

const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--font-render-hinting=medium']
  })

headless represents the headless mode. When the browser is started on the back end, there will be no display on the front end.

Tips: When debugging locally, it is recommended to set headless: false , you can start the full version of the browser, and view the content directly in the browser window.

3. Open a new page

After the browser is generated, open a new page in the browser.

const page = await browser.newPage()

4. Jump to the specified page

Jump to the page where you want to generate PDF.

await page.goto(`${baseURL}/article/${id}`, {
    timeout: 60000,
    waitUntil: 'networkidle2', // networkidle2 会一直等待，直到页面加载后不存在 2 个以上的资源请求，这种状态持续至少 500 ms
  })

timeout is the longest loading time. The default is 30s. If the page loading time is long, it is recommended to timeout value of 060b8e073e6b59 to prevent timeout errors.

waitUntil indicates the extent to which the page is loaded to start generating PDF or other operations. When there are many image resources to be loaded on the web page, it is recommended to set it to networkidle2 . The following values are available:

load: when the load event is triggered;
domcontentloaded: when the DOMContentLoaded event is triggered;
networkidle0: There are no more than 0 resource requests after the page is loaded, and this state lasts for at least 500 ms;
networkidle2: There are no more than 2 resource requests after the page is loaded, and this state lasts for at least 500 ms.

5. Specify the path to generate pdf

After the page specified above is loaded, the page is generated into a PDF.

  const ext = '.pdf'
  const key = randomFilename(title, ext)
  const _path = path.resolve(config.uploadDir, key)
  await page.pdf({ path: _path, format: 'a4' })

path indicates the file path to save the PDF to. If the path is not provided, the PDF will not be saved to disk.

Tips: Regardless of whether the PDF needs to be saved locally, it is recommended to set a path when debugging, so that it is convenient to view the style of the generated PDF and check whether there is a problem.

format represents the paper format of PDF. The a4 size is 8.27 inches x 11.7 inches, which is the traditional printing size.

Note: currently only supports headless: true to generate PDF in headless mode

6. Close the browser

After all operations are completed, close the browser to save performance.

  await browser.close()

4. Difficulties

1. Image lazy loading

Since the page to be generated in the PDF is an article-type page, it contains a lot of pictures, and the pictures introduce lazy loading, resulting in the generated PDF will have a lot of lazy loading pocket bottom pictures, the effect is as follows:

The solution is to jump to the page, scroll to the bottom of the page, all image resources will be requested, waitUntil set to networkidle2 , the image can be loaded successfully.

await autoScroll(page) // 因为文章图片引入了懒加载，所以需要把页面滑动到最底部，保证所有图片都加载出来

/**
 * 控制页面自动滚动
 * */
function autoScroll (page) {
  return page.evaluate(() => {
    return new Promise<void>(resolve => {
      let totalHeight = 0
      const distance = 100
      // 每200毫秒让页面下滑100像素的距离
      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight
        window.scrollBy(0, distance)
        totalHeight += distance
        if (totalHeight >= scrollHeight) {
          clearInterval(timer)
          resolve()
        }
      }, 200)
    })
  })
}

page.evaluate() method is used here to control page operations, such as using the built-in DOM selector, using the window method, and so on.

2. CSS print style

According official website description, page.pdf() generate style PDF files by print css media specified, and therefore can css to modify the generated PDF styles, paper demand, for example, the resulting PDF to hide the header, footer, as well as other articles and The irrelevant part of the main body, the code is as follows:

@media print {
  .other_info,
  .authors,
  .textDetail_comment,
  .detail_recTitle,
  .detail_rec,
  .SuspensePanel {
    display: none !important;
  }

  .Footer,
  .HeaderSuctionTop {
    display: none;
  }
}

3. Login state

Because some articles are not open to external users, users need to be authenticated, and users who meet the requirements can see the content of the article. Therefore, after jumping to the specified article page, you need to inject the login status into the generated browser window and log in that meets the conditions. Users can see the content of this part of the article.

Use the method of injecting cookie to obtain the login page.evaluate() set cookie , the code is as follows:


async function simulateLogin (page, cookies, domain) {
  return await page.evaluate((sig, sess, domain) => {
    let date = new Date()
    date = new Date(date.setDate(date.getDate() + 1))
    let expires = ''
    expires = `; expires=${date.toUTCString()}`
    document.cookie = `koa:sess.sig=${sig}${expires}; domain=${domain}; path=/`
    document.cookie = `koa:sess=${sess}=${expires}; domain=${domain}; path=/` // =是这个cookie的value
    document.cookie = `is_login=true${expires}; domain=${domain}; path=/`
  }, cookies['koa:sess.sig'], cookies['koa:sess'], domain)
}


await simulateLogin(page, cookies, config.domain.split('//')[1])

Tips: Puppeteer also has its own api achieve cookie injection, such as page.setCookie({name: name, value: value}) , but I can’t get the login status with this method of injection, and I haven’t found the specific reason. It is recommended that I directly use the above method to inject cookie , pay attention to name and 060b8e073e. value addition to expires , 060b8e073e6db8, domain , path also need to be configured.

4. Docker deploy Puppeteer

According to the above operations, the page can be successfully generated locally to PDF. After the local experience is no problem, it needs to be deployed to the server for testing and online.

Dockerfile was not modified, the following errors were found after deployment:

Official website to Docker configuration instructions reference may eventually practice available ubuntu system Dockerfile follows:

# ...省略...

# 安装 puppeteer 依赖
RUN apt-get update && \
    apt-get install -y libgbm-dev && \
    apt-get install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev -y && \
    apt-get install -y fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf --no-install-recommends

# ...省略...

Only need to focus on install puppeteer depends on the part.

Note: Before v1.18.1, Puppeteer requires at least Node v6.4.0. Versions from v1.18.1 to v2.1.0 all depend on Node 8.9.0+. Starting from v3.0.0, Puppeteer began to rely on Node 10.18.1+. You also need to pay attention to the node version of the server when configuring the Dockerfile.

Five, summary

This article describes the complete process of generating PDF files from web pages Node

Technology selection, select appropriate means to achieve functions according to the demand scenario;
Read official document , quickly go through the document to avoid some pits;
Crack the difficulty, use an unused tool, you will encounter problems that have not been solved, so let’s open up the tricks ^ ^.

Refer to the Demo source code to quickly get started with the above functions. I hope this article can be helpful to you. Thanks for reading ❤️

highlights · 160b8e073e6f2e

[Review of the live broadcast·The growth and transformation of the program Yuan]

[Optimization of uploading large-size files]

【JDR DESIGN Development Summary】

Welcome to follow the blog of Lab: 160b8e073e6f8d aotu.io

Or follow the AOTULabs official account (AOTULabs) and push articles from time to time.

Practice Guide-Generate PDF from Web Page

1. Background

2. Technical selection

1. Puppeteer

Three, implementation steps

1. Installation

2. Launch the browser

3. Open a new page

4. Jump to the specified page

5. Specify the path to generate pdf

6. Close the browser

4. Difficulties

1. Image lazy loading

2. CSS print style

3. Login state

4. Docker deploy Puppeteer

Five, summary

【JDR DESIGN Development Summary】

凹凸实验室

引用和评论

招聘 | Taro 团队招人啦！

如何在仓库中添加只对自己生效的.gitignore规则？

EventLoop事件循环机制(浏览器和Node EventLoop)

不要再这样编写 async/await

静态NodeList 和动态NodeList的区别

JavaScript&ES6----数组去重的多种方法

分享一个基于webpack5 + react + antd的空后台,优化的不错,拿来即用

Practice Guide-Generate PDF from Web Page

1. Background

2. Technical selection

1. Puppeteer

Three, implementation steps

1. Installation

2. Launch the browser

3. Open a new page

4. Jump to the specified page

5. Specify the path to generate pdf

6. Close the browser

4. Difficulties

1. Image lazy loading

2. CSS print style

3. Login state

4. Docker deploy Puppeteer

Five, summary

【JDR DESIGN Development Summary】

凹凸实验室

引用和评论

招聘 | Taro 团队招人啦！

如何在仓库中添加只对自己生效的.gitignore规则？

EventLoop事件循环机制(浏览器和Node EventLoop)

不要再这样编写 async/await

静态NodeList 和 动态NodeList的区别

JavaScript&ES6----数组去重的多种方法

分享一个基于webpack5 + react + antd的空后台,优化的不错,拿来即用

静态NodeList 和动态NodeList的区别