1. Background
In the development work, it is necessary to realize the function of generating PDF from the web page. The generated PDF needs to be uploaded to the server, and the PDF address is used as a parameter to request an external interface. This conversion process and the converted PDF do not need to be displayed to the user on the front end.
2. Technical selection
This function does not need to be displayed to users on the front end. In order to save client resources, choose to implement the function of web page generation PDF on the server side.
1. Puppeteer
Puppeteer is a Node
library that provides advanced API
to control Chrome
or Chromium
through the DevTools
protocol.
Most of the operations performed manually in the browser can be done using Puppeteer
, such as:
- Generate screenshots and PDFs of pages;
- Crawl
SPA
and generate pre-rendered content (that is,SSR
); - Automatic form submission, UI testing, keyboard input, etc.;
- Create the latest automated test environment. Use the latest
JavaScript
and browser functions to run the test directly in the latest version ofChrome
- Capture the timeline to track the website to help diagnose performance issues;
- Test the
Chrome
extension program.
It can be seen from the above that Puppeteer
can realize the PDF function of the page generated Node
Three, implementation steps
1. Installation
Enter the project, install puppeteer
to the local.
$ npm install -g cnpm --registry=https://registry.npm.taobao.org
$ cnpm i puppeteer --save
It should be noted that puppeteer
is installed, the latest version of the Chromium
API
will be downloaded. There are the following methods to modify the default settings without downloading the browser:
- In environment variable settings
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD
; puppeteer-core
withpuppeteer
.
puppeteer-core
is puppeteer
. It does not download the browser by default, but launches an existing browser or connects to a remote browser. When using puppeteer-core
, please note that there is a browser that can be connected locally, and the installed puppeteer-core
is the one you intend to connect Browser compatible. The method to connect to the local browser is as follows:
const browser = await puppeteer.launch({
executablePath: '/path/to/Chrome'
});
This project needs to be deployed to the server, and there is no browser to connect to, so puppeteer
chosen to install.
2. Launch the browser
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--font-render-hinting=medium']
})
headless
represents the headless mode. When the browser is started on the back end, there will be no display on the front end.
Tips: When debugging locally, it is recommended to set headless: false
, you can start the full version of the browser, and view the content directly in the browser window.
3. Open a new page
After the browser is generated, open a new page in the browser.
const page = await browser.newPage()
4. Jump to the specified page
Jump to the page where you want to generate PDF.
await page.goto(`${baseURL}/article/${id}`, {
timeout: 60000,
waitUntil: 'networkidle2', // networkidle2 会一直等待,直到页面加载后不存在 2 个以上的资源请求,这种状态持续至少 500 ms
})
timeout
is the longest loading time. The default is 30s. If the page loading time is long, it is recommended to timeout
value of 060b8e073e6b59 to prevent timeout errors.
waitUntil
indicates the extent to which the page is loaded to start generating PDF or other operations. When there are many image resources to be loaded on the web page, it is recommended to set it to networkidle2
. The following values are available:
- load: when the
load
event is triggered; - domcontentloaded: when the
DOMContentLoaded
event is triggered; - networkidle0: There are no more than 0 resource requests after the page is loaded, and this state lasts for at least 500 ms;
- networkidle2: There are no more than 2 resource requests after the page is loaded, and this state lasts for at least 500 ms.
5. Specify the path to generate pdf
After the page specified above is loaded, the page is generated into a PDF.
const ext = '.pdf'
const key = randomFilename(title, ext)
const _path = path.resolve(config.uploadDir, key)
await page.pdf({ path: _path, format: 'a4' })
path
indicates the file path to save the PDF to. If the path is not provided, the PDF will not be saved to disk.
Tips: Regardless of whether the PDF needs to be saved locally, it is recommended to set a path when debugging, so that it is convenient to view the style of the generated PDF and check whether there is a problem.
format
represents the paper format of PDF. The a4 size is 8.27 inches x 11.7 inches, which is the traditional printing size.
Note: currently only supports headless: true to generate PDF in headless mode
6. Close the browser
After all operations are completed, close the browser to save performance.
await browser.close()
4. Difficulties
1. Image lazy loading
Since the page to be generated in the PDF is an article-type page, it contains a lot of pictures, and the pictures introduce lazy loading, resulting in the generated PDF will have a lot of lazy loading pocket bottom pictures, the effect is as follows:
The solution is to jump to the page, scroll to the bottom of the page, all image resources will be requested, waitUntil
set to networkidle2
, the image can be loaded successfully.
await autoScroll(page) // 因为文章图片引入了懒加载,所以需要把页面滑动到最底部,保证所有图片都加载出来
/**
* 控制页面自动滚动
* */
function autoScroll (page) {
return page.evaluate(() => {
return new Promise<void>(resolve => {
let totalHeight = 0
const distance = 100
// 每200毫秒让页面下滑100像素的距离
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight
window.scrollBy(0, distance)
totalHeight += distance
if (totalHeight >= scrollHeight) {
clearInterval(timer)
resolve()
}
}, 200)
})
})
}
page.evaluate()
method is used here to control page operations, such as using the built-in DOM
selector, using the window
method, and so on.
2. CSS print style
According official website description, page.pdf()
generate style PDF files by print css media
specified, and therefore can css
to modify the generated PDF styles, paper demand, for example, the resulting PDF to hide the header, footer, as well as other articles and The irrelevant part of the main body, the code is as follows:
@media print {
.other_info,
.authors,
.textDetail_comment,
.detail_recTitle,
.detail_rec,
.SuspensePanel {
display: none !important;
}
.Footer,
.HeaderSuctionTop {
display: none;
}
}
3. Login state
Because some articles are not open to external users, users need to be authenticated, and users who meet the requirements can see the content of the article. Therefore, after jumping to the specified article page, you need to inject the login status into the generated browser window and log in that meets the conditions. Users can see the content of this part of the article.
Use the method of injecting cookie
to obtain the login page.evaluate()
set cookie
, the code is as follows:
async function simulateLogin (page, cookies, domain) {
return await page.evaluate((sig, sess, domain) => {
let date = new Date()
date = new Date(date.setDate(date.getDate() + 1))
let expires = ''
expires = `; expires=${date.toUTCString()}`
document.cookie = `koa:sess.sig=${sig}${expires}; domain=${domain}; path=/`
document.cookie = `koa:sess=${sess}=${expires}; domain=${domain}; path=/` // =是这个cookie的value
document.cookie = `is_login=true${expires}; domain=${domain}; path=/`
}, cookies['koa:sess.sig'], cookies['koa:sess'], domain)
}
await simulateLogin(page, cookies, config.domain.split('//')[1])
Tips:Puppeteer
also has its ownapi
achievecookie
injection, such aspage.setCookie({name: name, value: value})
, but I can’t get the login status with this method of injection, and I haven’t found the specific reason. It is recommended that I directly use the above method to injectcookie
, pay attention toname
and 060b8e073e.value
addition toexpires
, 060b8e073e6db8,domain
,path
also need to be configured.
4. Docker deploy Puppeteer
According to the above operations, the page can be successfully generated locally to PDF. After the local experience is no problem, it needs to be deployed to the server for testing and online.
Dockerfile
was not modified, the following errors were found after deployment:
Official website to Docker configuration instructions reference may eventually practice available ubuntu
system Dockerfile
follows:
# ...省略...
# 安装 puppeteer 依赖
RUN apt-get update && \
apt-get install -y libgbm-dev && \
apt-get install gconf-service libasound2 libatk1.0-0 libatk-bridge2.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget build-essential libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev librsvg2-dev -y && \
apt-get install -y fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst fonts-freefont-ttf --no-install-recommends
# ...省略...
Only need to focus on install puppeteer depends on the part.
Note: Before v1.18.1, Puppeteer requires at least Node v6.4.0. Versions from v1.18.1 to v2.1.0 all depend on Node 8.9.0+. Starting from v3.0.0, Puppeteer began to rely on Node 10.18.1+. You also need to pay attention to the node version of the server when configuring the Dockerfile.
Five, summary
This article describes the complete process of generating PDF files from web pages Node
- Technology selection, select appropriate means to achieve functions according to the demand scenario;
- Read official document , quickly go through the document to avoid some pits;
- Crack the difficulty, use an unused tool, you will encounter problems that have not been solved, so let’s open up the tricks ^ ^.
Refer to the Demo source code to quickly get started with the above functions. I hope this article can be helpful to you. Thanks for reading ❤️
highlights · 160b8e073e6f2e
[Review of the live broadcast·The growth and transformation of the program Yuan]
[Optimization of uploading large-size files]
【JDR DESIGN Development Summary】
Welcome to follow the blog of Lab: 160b8e073e6f8d aotu.io
Or follow the AOTULabs official account (AOTULabs) and push articles from time to time.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。