7

Why PDF to WORD is a historical problem

Converting PDF to Word is a very, very common requirement. It can be said that everyone is on the verge of danger. Why is such a common requirement so difficult? It depends on why there is such a requirement:

PDF documents follow the iOS32000 specification and are a document format launched by Adobe. The reason why it is so widely used is that PDF accurately locates the coordinates of each character and draws various shapes according to the coordinates. The use of PDF format to transmit and print documents can guarantee The consistency of the format, and then many PDF files can be used for reading, display, printing, but it is very difficult to edit, such as format adjustment, text modification, style adjustment, etc., then the historical demand for converting PDF to Word is derived. However, due to the complete inconsistency of the coding standards and layout mechanisms used between the two, the conversion will be very complicated. General tools are either in disorder in format or in disorder, which makes it difficult to meet the original expectations of customers.

The difficulty lies in the establishment of a mapping from PDF format based on element location to Word content based format. PDF documents actually do not have the concept of paragraphs and tables. What PDF to Word has to do is to parse the "horizontal and vertical lines surrounding the text" in the PDF document into a Word "table" and parse the "text and a horizontal line below" As "text underline" and so on.

Two tools and two sets of rules. Since ancient times, the compatible conversion between the two tools, unless it is owned by one family, will have common standards and interfaces reserved to achieve good compatibility, but Adobe and Microsoft are huge technologies. Enterprises, and both software features are very powerful and comprehensive, it is very difficult to perfectly match all the rules.

For report users, many users will understand the report as a report, and the report will naturally associate with Word, so they hope that the content displayed on the page can be a Word file for archiving, editing and other functions.

ActiveReportsJS is a front-end report development tool, which is not associated with the back-end. Therefore, if you want to generate Word from the displayed HTML, the R&D team found that the whole process will be very complicated and difficult after some research. As they feedback: "It is not a problem that can be solved by a sprint. "There is strong Mozilla support behind PDF.js, not to mention that Word documents are generated by relying on Microsoft's Offic development components.

But when actually contacting customers, many users will ask related content including how to use reports to design very common Word reports such as approval forms, personnel resumes, and test reports. Users are satisfied with the results, but the only user dissatisfaction is that the report results can only generate pdf. This is a tradition, this is also a core requirement, and it is also a pain point.

Ben Grape is a little worried, so he doesn't believe in this evil. With such a wealth of front-end tools, there is no such tool available?

I started searching, opened google, squeezed all the vocabulary of my brain and entered the keywords I needed, and found the following results.

At first glance, the first one is exactly the same. Although Node.js is a server, it is not unacceptable, as long as there is a plan.

Use cloudmersive-convert-api-client to convert any file format

https://cloudmersive.medium.com/how-to-convert-pdf-to-word-docx-format-in-node-js-30291f7c446b

Look very interesting

The code is simple:

But take a closer look at the code, **Sure enough, God marked the price on the back when sending things to us:

I thought that if you can, pay for it. After all, we are also a professional er for paid commercial software. Copyright awareness is still needed.

Click Login, and after successfully logging in with your Google account, you can reference the cloudmersive-convert-api-client installation package in the project.

The JS library provides nearly dozens of APIs and Classes for processing and converting files in different formats: in addition to converting PDF to Word, there are other file format conversions, which are also very simple to use.

Conversion result evaluation:

Can recognize the local PDF file, the conversion result:

  1. Able to guarantee 90% of the format and style to meet the requirements
  2. Pictures can be imported directly
  3. The background color cannot be preserved
  4. The table cannot be directly imported as a Word table, it can only be used as but text
  5. Header and footer information cannot be directly imported as Word header and footer, only as text
  6. Part of the content is missing

  • product price

Because the entire conversion API is just an API function of CloudMersive, the entire product also has other security checks and other functions, so the product is charged monthly and concurrently. You can search for it yourself, but their website provides several file conversion tools that are very easy to use. You don’t need to log in to get the conversion results directly.

https://cloudmersive.com/tools

Since there is a PDF stream to directly convert a Word document violently, can you?

Through searching, it is very difficult to directly convert PDF object stream into Word file with JS, and it has been verified that the PDF file exported by ARJS can be opened with Word software, then I suddenly thought of whether I can find a middleware to directly convert the PDF stream into doc or docx Format, but after searching for a bit, after trying, just added document.docx.pdf in front of .pdf

The method attempt failed.

After chatting with technical , 1616f94526bcdd discovered that pdf and word are essentially binary streams, the internal declarations are all unique attributes of the respective files, so they cannot be directly converted. In short, what kind of file stream is it? What file stream can only be saved. Moreover, PDF and Word are endorsements of two major technology companies, and professional tools must be used for direct conversion, so this is not an easy way.

Curve to save Coder: HTML to PDF conversion will be a great job?

So, let’s take the second place, HTML is omnipotent, HTML can be converted to everything, HTML to PDF, HTML to image, HTML to Excel, etc., then ActiveReportsJS provides a report that can be exported to HTML files and the format is exactly the same, then The method is here. Wouldn’t it be more convenient for me to directly use HTML to convert to Word? Sure enough, Google search for such data is a hundred times more than PDF to Word, and it is very simple to look at the code:

https://jscodemine.grapecity.com/share/Itym7G5fAUSWY4ffuu2cJw/

Just 3 steps:

1. Export the report to HTML
` var pageReport = new ARJS.PageReport();

            pageReport.load('./BandedReport.rdlx-json')
                .then(function() { return pageReport.run() })
                .then(function(pageDocument) { return HTMLExport.exportDocument(pageDocument) })

`

2. Processing HTML code to increase office mark

3. Create a label and download the doc format directly

`var fileDownload = document.createElement("a");

   document.body.appendChild(fileDownload);
   fileDownload.href = sourceHTML;
   fileDownload.download = 'document.doc';
   fileDownload.click();
   document.body.removeChild(fileDownload);        

`
Look at the results: the effect is very Nice

Conversion result evaluation:

  1. The style is missing, including font color, background color, shape
  2. Image loss
  3. The table can be directly imported as a Word table
  4. Icon retention

4. Summary

The two conversion results are summarized as follows:

After some attempts, it can be regarded as a Workaround. Considering that report reports are generally based on text content and simple in style, it is a quick and concise method to use HTML to Word conversion. Most of them need to be saved as Word or for the purpose Second edit. This grape is also working hard to find a way to preserve the style of HTML to Word, and will update the second part for everyone with new developments.

Please indicate the source for reprinting: Grape City official website, Grape City provides developers with professional development tools, solutions and services, and empowers developers.

葡萄城技术团队
2.7k 声望29.6k 粉丝

葡萄城是专业的软件开发技术和低代码平台提供商,聚焦软件开发技术,以“赋能开发者”为使命,致力于通过表格控件、低代码和BI等各类软件开发工具和服务,一站式满足开发者需求,帮助企业提升开发效率并创新开发模式。