[python]把sphinx生成静态网页打包为epub

前言

我目前正在从事一项颇具创新性的工程：运用模式识别（基于归纳法）技术，把一本 ePub 中的所有 HTML 和 XHTML 文件，分别提取结构化信息，然后为这些文件生成一种标准化的带有标记的中间形式，再把这种中间形式生成一个个目标 HTML，目标 HTML 具有一致性的结构化表达。之后，就可以为这种具有一致性的结构，专门设计各种外观主题。

由于这项工程的规模较大，所以在当前的起步阶段我主要依靠开源项目来介入或负责其中的多个环节，目前我已经依靠文档生成工具 Sphinx ，较快速地实现上面的思路，而类似的另一个项目 MkDocs 目前也在试验中。

简单来说，我写了一个程序用来从 HTML 生成 reStructuredText 或 Markdown (基于 myst) 文档，然后用 Sphinx 负责生成标准化的 HTML 静态网页，最后再对生成的静态网页进行打包。

journey
    title 自动化精排流程
    section 为 Sphinx 编写扩展
        编写 role 扩展: 1: Me
        编写 directive 扩展: 1: Me
        编写 template: 3: Me
        编写其它注入代码: 3: Me
    section 拆解 ePub
      分析opf文件: 5: Me
      提取资源: 5: Me
    section (x)html 转 rst 或 md
      解析 html: 5: Me
      预处理 html: 3: Me
      转换 html 到 rst: 3: Me
      转换 html 到 md: 1: Me
    section rst 或 md 转 html
        Sphinx 构建 html: 5: Sphinx
        html 二次处理: 4: Me
    section 打包和使用
        打包成 ePub: 4: Me
        把 epub 作为网站服务: 3: Me

这篇 post 主要分享我在打包方面的思考和为此专门编写的 python 模块。

关于打包

当已经利用 Sphinx 或 MkDocs 生成静态网页后，就可以在浏览器中进行本地访问了。但是如果需要以 ePub 的方式进行阅读，就需要打包。

一个 ePub 文件其实就是一个以 .epub 为扩展名的 zip 压缩包，并具有一定的规则。主要是以下 3 块内容：

Open Container Format (OCF)：包含文件内容的格式
Open Packaging Format (OPF)：用 XML 文件来描述 ePub 的文件结构
OEBPS Container Format (OCF)：把所有文件收集到 ZIP 压缩包

具体可以参考这几份资料：

为此，我专门实现了一个 pack_epub 函数，它在运行时：

首先，创建一个 zip 压缩包

在压缩包中写入一个路径为 META-INF/container.xml 的文件，里面说明了 opf 文件的路径，这里指定为 OEBPS/content.opf

<?xml version="1.0" encoding="utf-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
    <rootfiles>
        <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
    </rootfiles>
</container>'

创建一个空的 opf 文件

<?xml version="1.0" encoding="utf-8"?>
<package version="3.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="BookId">urn:uuid:%(uuid)s</dc:identifier>
    <dc:language>en</dc:language>
    <dc:title>untitled</dc:title>
    <meta property="dcterms:modified">%(mtime)s</meta>
</metadata>
<manifest />
<spine />
</package>

将待打包文件夹中的文件，保持相对的目录结构，写到压缩包中的 OEBPS/ 目录下，并更新 opf 文件
把 opf 写入压缩包，路径为 OEBPS/content.opf

而由于可以从 Sphinx 生成的网页中，提取目录、封面等信息，因此我又写了一个 pack_sphinx_epub 来打包 Sphinx 生成的页面。

python实现

#!/usr/bin/env python
# coding: utf-8

__author__  = "ChenyangGao <https://chenyanggao.github.io/>"
__version__ = (0, 0, 1)
__all__ = ["pack_epub", "pack_sphinx_epub"]
__requirements__ = ["lxml", "lxml-stubs"]

import os.path as ospath
import posixpath

from collections import deque
from datetime import datetime
from functools import partial
from glob import iglob
from itertools import count
from mimetypes import guess_type
from os import fsdecode, listdir, PathLike
from pathlib import Path
from re import compile as re_compile
from typing import (
    cast, Callable, Container, Final, Iterator, Optional, Sequence, MutableSequence
)
from urllib.parse import urlsplit
from uuid import uuid4
from zipfile import ZipFile

from lxml.etree import fromstring, tostring, Element, _Comment, _Element, _ElementTree
from lxml.html import fromstring as html_fromstring


CRE_XML_ENCODING: Final = re_compile(r'(?<=\bencoding=")[^"]+|(?<=\bencoding=\')[^\']+')


def xml_fromstring(doc: bytes | str, parser=None, *, base_url=None) -> _Element:
    doc = doc.lstrip()
    if isinstance(doc, str):
        index = doc.find("\n")
        if index != -1:
            m = CRE_XML_ENCODING.search(doc[:index])
            if m is not None:
                doc = bytes(doc, m[0])
    return fromstring(doc, parser, base_url=base_url)

xml_fromstring.__doc__ = fromstring.__doc__


def to_xhtml(
    etree: _Element | _ElementTree, 
    ensure_epub: bool = False, 
) -> Callable[..., bytes]:
    """Convert an element node and its child nodes into XHTML format.

    :param etree: Element node or node tree that to be processed.
    :param ensure_epub: Determine whether to add epub namespaces to the root element, 
                        but it must itself be a root element or node tree.

    :return: A helper function (ignore if not needed), used to serialize the current 
             element node or node tree (depending on what is provided).
    """
    if isinstance(etree, _ElementTree):
        root = etree.getroot()
        is_root = True
    else:
        root = etree.getroottree().getroot()
        is_root = etree is root
    # NOTE: Because in Sigil editor, double hyphen (--) within comment will 
    #       issue an error, so I just escape all the double hyphens 😂.
    comments: list[_Comment] = etree.xpath(".//comment()") # type: ignore
    if comments:
        for comment in comments:
            if comment.text and "--" in comment.text:
                comment.text = comment.text.replace("--", "&#45;&#45;")
    # NOTE: Because if you want to convert HTML to XHTML, you may need to use 
    #       `lxml.etree.tostring`. When encountering an element node without 
    #       children, it will form a self closing tag, but there is no such 
    #       thing in HTML. However there is a concept of void element in HTML:
    #
    #           - https://html.spec.whatwg.org/multipage/syntax.html#void-elements
    #           - https://developer.mozilla.org/en-US/docs/Glossary/Void_element
    #
    #       A void element is an element in HTML that cannot have any child nodes 
    #       (i.e., nested elements or text nodes). Void elements only have a start 
    #       tag; end tags must not be specified for void elements.
    #       To make sure that all non-void elements do not form self closing tags, 
    #       it is possible to replace the text node with "" by checking that their 
    #       text node is None.
    for el in etree.iter("*"):
        # NOTE: In the past, there were other elements that were void elements, such as 
        #       <param> and <keygen>, but they have all been deprecated and removed. 
        #       An obsoleted element is not occupied by the HTML standard and is not 
        #       considered as a void element. So they can be given new meanings by users, 
        #       and cannot be directly considered as void elements.
        #
        #           - https://developer.mozilla.org/en-US/docs/Web/HTML/Element/param
        #           - https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Releases/69
        if el.tag.lower() not in (
            "area", "base", "br", "col", "embed", "hr", "img", "input", "link", 
            "meta", "source", "track", "wbr", 
        ):
            if el.text is None:
                el.text = ""
    # NOTE: You need to use epub:type to perform `Expression Structural Semantics`, 
    #       so there are two namespaces that need to be defined.
    #
    #           - https://www.w3.org/TR/epub/#app-structural-semantics
    #           - https://www.w3.org/TR/xml-names/
    if is_root:
        if "xmlns" not in root.attrib:
            root.attrib["xmlns"] = "http://www.w3.org/1999/xhtml"
        if ensure_epub and "xmlns:epub" not in root.attrib:
            root.attrib["xmlns:epub"] = "http://www.idpf.org/2007/ops"
    # NOTE: Because UTF-8 is currently the most recommended encoding
    kwargs = {"encoding": "utf-8"}
    if is_root:
        # NOTE: Specify the DOCTYPE as HTML5 (<!DOCTYPE html>), ignoring the original.
        kwargs["doctype"] = "<!DOCTYPE html>"
        # NOTE: Sigil editor does not support XML version > 1.0 🫠
        kwargs["xml_declaration"] = '<?xml version="1.0" encoding="UTF-8"?>'
    return partial(tostring, etree, **kwargs) # type: ignore


def oebps_iter(
    top: bytes | str | PathLike, 
    filter: Optional[Callable[[Path], bool]] = None, 
    follow_symlinks: bool = True, 
    on_error: None | bool | Callable = None, 
) -> Iterator[tuple[str, Path]]:
    """Iterate over the directory structure starting from `top` (exclusive), 
    yield a tuple of two paths each time, one is a relative path based on `top`, 
    the other is the corresponding actual path object (does not include a directory).

    Note: This function uses breadth-first search (bfs) to iterate over the directory structure.

    :param top: The directory path to start the iteration from.
    :param filter: A callable that takes a Path object as input and returns True 
                   if the path should be included, or False otherwise.
    :param follow_symlinks: If True, symbolic links will be followed during iteration.
    :param on_error: A callable to handle any error encountered during iteration.

    :yield: A tuple containing the href (a relative path based on `top`) and the corresponding Path object.
    """
    dq: deque[tuple[str, Path]] = deque()
    put, get = dq.append, dq.popleft
    put(("", Path(fsdecode(top))))
    while dq:
        dir_, top = dq.popleft()
        try:
            path_iterable = top.iterdir()
        except OSError as e:
            if callable(on_error):
                on_error(e)
            elif on_error:
                raise
            continue
        for path in path_iterable:
            if path.is_symlink() and not follow_symlinks or filter and not filter(path):
                continue
            href = dir_ + "/" + path.name if dir_ else path.name
            if path.is_dir():
                put((href, path))
            else:
                yield href, path


def pack_epub(
    source_dir: bytes | str | PathLike, 
    save_path: None | bytes | str | PathLike = None, 
    generate_id: Callable[[str], str] = lambda href: str(uuid4()), 
    content_opf: None | bytes | str | _Element = None, 
    spine_files: None | Sequence | Container | Callable = None, 
    filter: Optional[Callable[[Path], bool]] = lambda path: (
        path.name not in (".DS_Store", "Thumbs.db") and
        not path.name.startswith("._")
    ), 
    follow_symlinks: bool = True, 
    sort: Optional[Callable] = None, 
    finalize: Optional[Callable] = None, 
) -> bytes | str | PathLike:
    """This function is used to pack a directory of files into an ePub format e-book. 

    :param source_dir: The source directory containing the files to be packaged.
    :param save_path: The path where the ePub file will be saved. If not provided, it will be 
                      saved in the same directory as the source with the .epub extension.
    :param generate_id: A function to generate unique identifiers for items in the ePub file.
    :param content_opf: An optional parameter representing the original content.opf file of the ePub.
    :param spine_files: An optional parameter to determine which (HTML or XHTML) files of the ePub 
                        should be included in the spine. 
    :param filter: An optional function used to filter the files to be included in the ePub.
    :param follow_symlinks: A boolean indicating whether to follow symbolic links.
    :param sort: An optional function to sort the files before packaging.
    :param finalize: An optional function to perform a finalization step at end of packaging.

    :return: The path where the ePub file is saved.

    Note:
        - The spine_files parameter serves as a predicate to determine which files should be included 
          in the linear reading order (spine) of the ePub. 
        - The spine_files is a sequence, container, or callable, it is used to specify the inclusion 
          criteria for files in the spine.
        - If spine_files is a sequence, it also determines the order of the spine.
    """
    source_dir = ospath.abspath(fsdecode(source_dir))
    if not save_path:
        save_path = source_dir + ".epub"
    if isinstance(content_opf, _Element):
        opf_etree = content_opf.getroottree().getroot()
    elif content_opf:
        opf_etree = xml_fromstring(content_opf)
    elif ospath.isfile(ospath.join(source_dir, "content.opf")):
        opf_etree = fromstring(open(ospath.join(source_dir, "content.opf"), "rb").read())
    else:
        opf_etree = fromstring(b'''\
<?xml version="1.0" encoding="utf-8"?>
<package version="3.0" unique-identifier="BookId" xmlns="http://www.idpf.org/2007/opf">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier id="BookId">urn:uuid:%(uuid)s</dc:identifier>
    <dc:language>en</dc:language>
    <dc:title>untitled</dc:title>
    <meta property="dcterms:modified">%(mtime)s</meta>
  </metadata>
  <manifest />
  <spine />
</package>''' % {
    b"uuid": bytes(str(uuid4()), "utf-8"), 
    b"mtime": bytes(datetime.now().strftime("%FT%XZ"), "utf-8"), 
})
    opf_manifest = opf_etree[1]
    opf_spine = opf_etree[2]
    id_2_item: dict[str, _Element] = {el.attrib["id"]: el for el in opf_manifest if "href" in el.attrib} # type: ignore
    href_2_id_cache = {el.attrib["href"]: id for id, el in id_2_item.items()}
    spine_map: dict[str, Optional[_Element]] = {
        id_2_item[el.attrib["idref"]].attrib["href"]: el # type: ignore
        for el in opf_spine if el.attrib["idref"] in id_2_item
    }
    is_spine: Optional[Callable] = None
    if isinstance(spine_files, Sequence):
        # NOTE: If spine_files is a sequence and immutable, then the spine will 
        #       consist of at most the items in this sequence.
        for href in spine_files:
            if href not in spine_map:
                spine_map[href] = None
        if isinstance(spine_files, MutableSequence):
            is_spine = lambda href: href.endswith((".htm", ".html", ".xhtm", ".xhtml"))
    elif isinstance(spine_files, Container):
        is_spine = lambda href: href in spine_files # type: ignore
    elif callable(spine_files):
        is_spine = spine_files
    it = oebps_iter(source_dir, filter=filter, follow_symlinks=follow_symlinks)
    if sort:
        it = sort(it)
    with ZipFile(save_path, "w") as book: # type: ignore
        book.writestr("mimetype", "application/epub+zip")
        book.writestr("META-INF/container.xml", '''\
<?xml version="1.0" encoding="utf-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
    <rootfiles>
        <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
   </rootfiles>
</container>''')
        for href, path in it:
            if href == "content.opf":
                continue
            media_type = guess_type(href)[0] or "application/octet-stream"
            if href in href_2_id_cache:
                uid  = href_2_id_cache.pop(href)
                item = id_2_item[uid]
                if "media-type" not in item.attrib:
                    item.attrib["media-type"] = media_type
            else:
                uid = str(generate_id(href))
                if uid in id_2_item:
                    nuid = str(generate_id(href))
                    if uid == nuid:
                        for i in count(1):
                            nuid = f"{i}_{uid}"
                            if nuid not in id_2_item:
                                uid = nuid
                                break
                    else:
                        uid = nuid
                        while uid in id_2_item:
                            uid = str(generate_id(href))
                id_2_item[uid] = Element("item", attrib={"id": uid, "href": href, "media-type": media_type})
            book_path = "OEBPS/" + href
            if href.endswith((".htm", ".html")):
                etree = html_fromstring(open(path, "rb").read())
                tostr = to_xhtml(etree, ensure_epub=True)
                book.writestr(book_path, tostr())
            else:
                book.write(path, book_path)
            if href in spine_map and spine_map[href] is None or is_spine and is_spine(href):
                spine_map[href] = Element("itemref", attrib={"idref": uid})
        opf_manifest.clear()
        for uid in href_2_id_cache.values():
            item = id_2_item.pop(uid)
            print("\x1b[38;5;6m\x1b[1mIGNORE\x1b[0m: item has been ignored because file not found: "
                f"\n    |_ href={item.attrib['href']!r}"
                f"\n    |_ item={tostring(item).decode('utf-8')!r}")
        opf_manifest.extend(el for el in id_2_item.values())
        opf_spine.clear()
        opf_spine.extend(el for el in spine_map.values() if el is not None)
        if finalize:
            finalize(book, opf_etree)
        book.writestr("OEBPS/content.opf", 
            b'<?xml version="1.0" encoding="UTF-8"?>\n'+tostring(opf_etree, encoding="utf-8"))
    return save_path


def pack_sphinx_epub(
    source_dir: bytes | str | PathLike, 
    save_path: None | bytes | str | PathLike = None, 
    follow_symlinks: bool = True, 
    sort: Optional[Callable] = None, 
) -> bytes | str | PathLike:
    """Pack a Sphinx documentation into ePub format.

    NOTE: If there are references to online resources, please localize them in advance.

    :param source_dir: Path to the source directory.
    :param save_path: Path where the ePub file will be saved. If not provided, it will be 
                      saved in the same directory as the source with the .epub extension.
    :param follow_symlinks: A boolean indicating whether to follow symbolic links.
    :param sort: An optional function to sort the files before packaging.

    :return: Path to the saved ePub file.
    """
    def clean_toc(el):
        tag = el.tag.lower()
        if tag == "ul":
            el.tag = "ol"
        if tag == "a":
            href = el.attrib.get("href", "")
            el.attrib.clear()
            el.attrib["href"] = href
        else:
            el.attrib.clear()
        for sel in el:
            if sel.tag.lower() in ("ul", "li", "a", "ol"):
                clean_toc(sel)
            else:
                el.remove(sel)
        return el
    def finalize(book, opf_etree):
        opf_metadata = opf_etree[0]
        opf_manifest = opf_etree[1]
        opf_spine = opf_etree[2]
        # add nav.xhtml
        if "OEBPS/nav.html" in book.NameToInfo and "OEBPS/nav.xhtml" not in book.NameToInfo:
            etree = html_fromstring(book.read("OEBPS/nav.html"))
            toc_org = etree.get_element_by_id("toc")
            if toc_org is None or len(toc_org) == 0:
                toc = None
            else:
                toc = clean_toc(toc_org[0])
            nav = fromstring(b'''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
  <title>ePub NAV</title>
  <meta charset="utf-8" />
</head>
<body epub:type="frontmatter">
  <nav epub:type="toc" id="toc" role="doc-toc">
    <h1>Table of Contents</h1>
  </nav>
  <nav epub:type="landmarks" id="landmarks" hidden="">
    <h2>Landmarks</h2>
    <ol>
      <li>
        <a epub:type="toc" href="#toc">Table of Contents</a>
      </li>
    </ol>
  </nav>
</body>
</html>''')
            if toc is not None:
                toc_tgt = nav.xpath('//*[@id="toc"]')[0]
                toc_tgt.append(toc)
            book.writestr("OEBPS/nav.xhtml", tostring(
                nav, encoding="utf-8", xml_declaration='<?xml version="1.0" encoding="utf-8"?>'))
            opf_manifest.append(Element("item", attrib={
                "id": "nav.xhtml", "href": "nav.xhtml", "media-type": "application/xhtml+xml", "properties": "nav"}))
            opf_spine.append(Element("itemref", attrib={"idref": "nav.xhtml", "linear": "no"}))
        # set cover
        for item in opf_manifest:
            if not item.attrib["media-type"].startswith("image/"):
                continue
            href = item.attrib["href"]
            name = posixpath.splitext(posixpath.basename(href))[0]
            if name != "cover":
                continue
            uid = item.attrib["id"]
            try:
                cover_meta = opf_etree.xpath('//*[local-name()="meta"][@name="cover"]')[0]
                cover_meta.attrib["content"] = uid
            except IndexError:
                cover_meta = Element("meta", attrib={"name": "cover", "content": uid})
                opf_metadata.append(cover_meta)
    source_dir = ospath.abspath(fsdecode(source_dir))
    if not save_path:
        save_path = source_dir + ".epub"
    if "index.html" in listdir(source_dir):
        index_html_path = ospath.join(source_dir, "index.html")
    else:
        index_html_path = ospath.join(source_dir, 
            next(iglob("**/index.html", root_dir=source_dir, recursive=True)))
        source_dir = ospath.dirname(index_html_path)
    etree = html_fromstring(open(index_html_path, "rb").read())
    spine_files = ["index.html", "nav.html"]
    seen = set(spine_files)
    for el in etree.cssselect('li[class^="toctree-l"] > a[href]'):
        href: str = el.attrib["href"] # type: ignore
        urlp = urlsplit(href)
        if urlp.scheme or urlp.netloc:
            continue
        href = urlp.path
        if href in seen:
            continue
        spine_files.append(href)
        seen.add(href)
    pack_epub(
        source_dir, 
        save_path, 
        generate_id=posixpath.basename, 
        spine_files=spine_files, 
        filter=lambda path: path.name not in ("Thumbs.db", "objects.inv") and \
                            not path.name.startswith(".") and \
                            not path.name.endswith((".js.map", ".css.map")), 
        follow_symlinks=follow_symlinks, 
        sort=sort, 
        finalize=finalize, 
    )
    return save_path

[python]把sphinx生成静态网页打包为epub

前言

关于打包

python实现

麻花疼

引用和评论

👏欢迎关注python-epub3项目

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）