libreoffice命令行更新word目录页码

背景

有些时候在使用程序编辑了word文件之后，会造成页码的变更，但是在保存的时候，程序不会自动更新页码

实现方式

任选一种

在调用libreoffice的时候，编写并指定一个xba文件宏，文件会被打开并更新完成目录之后保存
依赖项目unoconv/unoserver，调用python开发的执行程序进行文件更新

`xba`宏实现

使用libreoffice程序，通过编辑宏的方式，命令行调用libreoffice程序更新目录

找到Module1.xba宏的位置，该宏是libreoffice的一个默认宏，需要对该文件新增一点函数

$ sudo find / -name "Module1.xba" 2>/dev/null |grep Standard

一般该文件的位置是

~/.config/libreoffice/4/user/basic/Standard/Module1.xba

查看该文件的内容

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">Sub AutoOpen()
&apos;
&apos; AutoOpen Macro
&apos;
&apos;
    &apos; Update the entire first table of contents.
    &apos; The TOC must exist or this produces an error.
    On Error GoTo DontUpdate
   ActiveDocument.TablesOfContents(1).Update
   ActiveDocument.TablesOfContents(1).UpdatePageNumbers
    
DontUpdate:
    On Error GoTo 0
    
End Sub
</script:module>

之后在End Sub之后新增一段函数

文件变为如下

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">Sub AutoOpen()
&apos;
&apos; AutoOpen Macro
&apos;
&apos;
    &apos; Update the entire first table of contents.
    &apos; The TOC must exist or this produces an error.
    On Error GoTo DontUpdate
   ActiveDocument.TablesOfContents(1).Update
   ActiveDocument.TablesOfContents(1).UpdatePageNumbers
    
DontUpdate:
    On Error GoTo 0
    
End Sub

Sub UpdateIndexes(path As String)
     &apos;&apos;&apos;Update indexes, such as for the table of contents&apos;&apos;&apos; 
     Dim doc As Object
     Dim args()

     doc = StarDesktop.loadComponentFromUrl(convertToUrl(path), &quot;_default&quot;, 0, args())

     Dim i As Integer

     With doc &apos; Only process Writer documents
         If .supportsService(&quot;com.sun.star.text.GenericTextDocument&quot;) Then
             For i = 0 To .getDocumentIndexes().count - 1
                 .getDocumentIndexes().getByIndex(i).update()
             Next i
         End If
     End With &apos; ThisComponent

     doc.store()
     doc.close(True)

 End Sub &apos; UpdateIndexes  
</script:module>

之后创建一个带目录的docx文件，把目录页码改成别的数字，之后执行命令如下，该命令会打开指定位置的文件，更新页码之后保存

$ libreoffice --headless "macro:///Standard.Module1.UpdateIndexes(/path/to/my.docx)"

UpdateIndexes内部的参数需要填写绝对路径

`unoconv/unoserver`实现

项目地址

https://github.com/unoconv/unoserver

该项目是https://github.com/unoconv/unoconv的重构版本，维护活跃

首先查看最新的发布tag，当前2023-10-08的是2.0b1，在2.0*版本才有更新目录页码的功能

组成结构

unoserver: 运行时指定固定的IP和端口启动libreoffice侦听器
unoconvert: 连接到侦听器并进行文档处理

结构组成说明

比如在运行命令如下

$ libreoffice --headless --convert-to pdf MyDocument.odf

这会将 LibreOffice 加载到内存中，转换文件，然后退出 LibreOffice，这意味着下次转换文档时，LibreOffice 需要再次加载到内存中

为了避免这种情况，LibreOffice 有一个侦听器模式，它可以通过端口侦听命令，并加载和转换文档，而无需退出和重新加载软件

优点

这样会降低转换许多文档时的 CPU 负载，这意味着您可以使用侦听器同时转换两倍到四倍的文档

缺点

常驻内存导致会有一个持续的内存占用

安装

$ sudo pip3 install unoserver

需要使用sudo

libreoffice安装的时候，会使用到系统python，同时会存在一个文件/usr/lib/python3/dist-packages/uno.py包含工具包，导入/usr/lib/libreoffice/program路径下面的*.so文件
使用sudo pip3一般可以确定指向系统python

程序分为服务端和客户端

首先执行服务端，如下命令会启动libreoffice并监听2002端口

$ unoserver

unoserver进程占用内存72M左右

该进程会派生出两个子进程

# 占用内存4M左右
/usr/lib/libreoffice/program/oosplash --headless ....
# 未转换文件前占用内存100M左右，一旦开始转换后，增长到200M~300M
/usr/lib/libreoffice/program/soffice.bin ...

所以侦听器模式一般会稳定消耗内存300~400M左右，但是可以换取较大性能提升，每次转换文件不需要读取磁盘加载libreoffice程序

侦听器随着转换文档数量的增长，占用内存也会持续上升，可以定期重启侦听器程序以避免该问题

新开一个终端，调用客户端程序，更新目录，比如存在一个目录页码错误的test-toc.docx，执行更新生成一个新文件到test-toc-a.docx

usage: unoconvert [-h] [--convert-to CONVERT_TO] [--filter FILTER] [--filter-options FILTER_OPTIONS] [--update-index]
                  [--dont-update-index] [--host HOST] [--port PORT] [--host-location {auto,remote,local}]
                  infile outfile

$ unoconvert --convert-to docx test-toc.docx test-toc-a.docx

或者也可以转换为pdf

$ unoconvert --convert-to pdf test-toc.docx test-toc.pdf

扩展阅读

也可以直接使用libreoffice进行文档类型转换，比如文件名称是my.docx

转变为pdf

$ libreoffice --headless --convert-to pdf:writer_pdf_Export my.docx --outdir .

转变为html

$ libreoffice --headless --convert-to "html:XHTML Writer File:UTF8" my.docx --outdir .

libreoffice推荐采用docker版本的

unoserver维护的，地址

https://github.com/unoconv/unoserver-docker

docker镜像库地址

https://github.com/unoconv/unoserver-docker/pkgs/container/unoserver-docker

linuxserver/libreoffice

https://hub.docker.com/r/linuxserver/libreoffice

参考阅读

Update TOC via command line

libreoffice命令行更新word目录页码

背景

实现方式

`xba`宏实现

`unoconv/unoserver`实现

项目地址

组成结构

安装

扩展阅读

参考阅读

龚正阳

引用和评论

Rust通过FFI调用C

libreoffice命令行更新word目录页码

背景

实现方式

xba宏实现

unoconv/unoserver实现

项目地址

组成结构

安装

扩展阅读

参考阅读

龚正阳

引用和评论

Rust通过FFI调用C

`xba`宏实现

`unoconv/unoserver`实现