[python]epub网页服务器

前言

大家好，我又来了。我想问你一个问题：你阅读 epub 是不是一向来就使用阅读器？

诚然，使用阅读器，可以做笔记，可以记录阅读时间，可以有各种各样的功能。但是，某种程度上，你可能也会被阅读器所限制。这种限制不仅仅是约束了你的认知，形成你的思维定势，同时也会养成依赖。

当然你不可能不用阅读器，不过如果阅读器对你的限制越小，你的思维开放性才能越大。就好比说，你得到一把锤子🔨，在过去常常能用锤子来解决问题，那么被锤子培养的习惯和思维定势，可能会影响你对问题本质的认识，以及排斥解决问题方法存在多样性的事实。

要让自己的思维更有开放性，有一点基本要求就是，经常要去思考边界以外的问题。

行之而不著焉，习矣而不察焉，终身由之而不知其道者，众也。
-- 《孟子尽心上》

所以今天，请跟着我一起来松动一下固有的观念，看一看如果设备上没有专门的阅读器时，我们又能靠什么来阅读 epub 😂。

一种阅读新方法：基于浏览器

对于一般的电子设备，只要具有上网能力，就往往会带有一个浏览器。在我看来，浏览器几乎是目前最强大的阅读工具，使用浏览器固然也会约束你的认知，但是可以保持在一个相对较高水平。

下面我会介绍浏览器这个阅读工具，在阅读 epub 时的一些优势：

1. 自带翻译

现代的浏览器往往会自带翻译功能，可以在各种语言之间互相翻译，如果你正在读一篇英文材料，你可以直接让浏览器把它翻译成中文，或者只在需要时，点击查询单词的意思。

title=

2. 它有很多扩展，而且你可以自己编写扩展

无论是 Chrome 还是 Firefox，乃至 Safari、Edge、Opera、Vivaldi 以及 Brave，亦或者是其它，都在不遗余力的支持扩展，确实还存在需要另当别论的例外😄。

title=

3. 当出现问题时，你可以借助开发者工具来了解发生了什么

几乎每个现代的桌面浏览器，都具备一个开发者工具，当你所浏览的页面发生一些状况时，可以利用它来了解情况。而当你对页面的某些东西感兴趣时，它同样可以为你提供信息。

title=

4. 强大的渲染能力

epub 本质上就是一堆网页的打包，在以前 chm 也是这样，但是 epub 比它强大的多。依靠浏览器来阅读 epub，这本身就是一个合理的选项。而且如果你认同我的下列观念：

epub 本质上可以被解释为一个运行在本地的网站，它是各种静态资源的打包，不需要再联网下载其它资源；
epub 可以做到静态网页能做到的任何事情，就如用 hexo、Jekyll、Hugo、Gatsby 等工具创建的静态博客，也能有靓丽外观和酷炫交互；
epub3 同样也是 W3C 组织推广的技术标准，W3C 组织还推广了 html5、css3 和 es6，epub3 同样也是一个面向网页的顶级标准

那么你就可以发现，利用浏览器来阅读 epub 几乎是优先选择😂。

我的实践：一个 epub 阅读服务器

上面👆我balabala说了这么多，如果不给你带来点什么，不就沦为理论派了吗。那今天，就为大家带来一个原创的 Python 工具：serve_epub.py v0.0.1 版。

这个工具实现了，把任何一个 epub 文件，作为一个网站后台服务的根目录，然后用浏览器来阅读 epub 的功能。可以通过按左右键来翻页。后续还会添加很多新功能，甚至书架，欢迎期待 0.0.2 版。

$ python serve_epub.py -h
usage: serve_epub.py [-h] [-H HOST] [-p PORT] [-o] path

    📖 EPub reader server 📚

🖍️ TIPS: Press the left ⬅️ and right ➡️ keys to turn pages

positional arguments:
  path                  epub book path

options:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  hostname, default to '0.0.0.0'
  -p PORT, -P PORT, --port PORT
                        port, default to 8080
  -o, --open-browser    open browser to read book

https://www.bilibili.com/video/BV1AC4y1E7N8/?aid=748309408&ci...

附：源码

#!/usr/bin/env python3
# coding: utf-8

__author__ = "ChenyangGao <https://chenyanggao.github.io/>"
__version__ = (0, 0, 1)

if __name__ != "__main__":
    print("must run as a main module")
    raise SystemExit(1)

from argparse import ArgumentParser, RawDescriptionHelpFormatter

parser = ArgumentParser(description="""\
    📖 \x1b[38;5;4m\x1b[1mEPub reader server\x1b[0m 📚

🖍️ \x1b[38;5;1m\x1b[1mTIPS\x1b[0m: Press the left ⬅️ and right ➡️ keys to turn pages
""", formatter_class=RawDescriptionHelpFormatter)
parser.add_argument("path", help="epub book path")
parser.add_argument("-H", "--host", default="0.0.0.0", help="hostname, default to '0.0.0.0'")
parser.add_argument("-p", "-P", "--port", type=int, default=8080, help="port, default to 8080")
parser.add_argument("-o", "--open-browser", action="store_true", help="open browser to read book")
args = parser.parse_args()

import posixpath

from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from io import BytesIO
from re import compile as re_compile
from urllib.parse import quote, unquote, urlsplit
from xml.etree.ElementTree import fromstring
from zipfile import ZipFile


CREB_XML_ENC = re_compile(br"(?<=\bencoding=\")[^\"]+|(?<=\bencoding=')[^']+")
CRE_OPF_ITEM = re_compile(r"<item\s[^>]+?/>")
CRE_OPF_ITEMREF = re_compile(r"<itemref\s[^>]+?/>")


def get_xml_encoding(content, /, default="utf-8"):
    if isinstance(content, str):
        content = bytes(content, "utf-8")
    encoding = default
    for xml_dec in BytesIO(content_opf):
        xml_dec = xml_dec.strip()
        if not xml_dec:
            continue
        if not xml_dec.startswith(b"<?"):
            break
        match = CREB_XML_ENC.search(xml_dec)
        if match is None:
            break
        encoding = match[0].decode("ascii")
    return encoding


def get_opf_path(container_xml):
    etree = fromstring(container_xml)
    for el in etree.iter():
        if (
            (el.tag == 'rootfile' or el.tag.endswith('}rootfile')) 
            and el.attrib.get('media-type') == 'application/oebps-package+xml'
        ):
            return unquote(el.attrib['full-path'])
    raise FileNotFoundError('OPF file path not found.')


def opf_item_iter(content_opf):
    if isinstance(content_opf, bytes):
        encoding = get_xml_encoding(content_opf)
        content_opf = content_opf.decode(encoding)
    for m in CRE_OPF_ITEM.finditer(content_opf):
        yield fromstring(m[0])


def opf_itemref_iter(content_opf):
    if isinstance(content_opf, bytes):
        encoding = get_xml_encoding(content_opf)
        content_opf = content_opf.decode(encoding)
    for m in CRE_OPF_ITEMREF.finditer(content_opf):
        yield fromstring(m[0])


class EpubHandler(BaseHTTPRequestHandler):

    def do_HEAD(self):
        path = urlsplit(unquote(self.path)).path.lstrip("/")
        if path == "":
            path = index_file
        elif path.endswith("/"):
            path = path.rstrip("/") + "/index.html"
        if path not in href_2_attr:
            self.send_response(404, "Not Found")
            return
        fullpath = posixpath.join(opf_root, path)
        filesize = zfile.NameToInfo[fullpath].file_size
        self.send_response(200)
        self.send_header("Content-Length", str(filesize))
        self.send_header("Content-Type", href_2_attr[path].get("media-type", "application/octet-stream"))
        self.send_header("Accept-Ranges", "bytes")
        self.end_headers()

    def do_GET(self):
        path = urlsplit(unquote(self.path)).path.lstrip("/")
        if path == "":
            path = index_file
        elif path.endswith("/"):
            path = path.rstrip("/") + "/index.html"
        if path not in href_2_attr:
            self.send_response(404, "Not Found")
            return
        fullpath = posixpath.join(opf_root, path)
        filesize = zfile.NameToInfo[fullpath].file_size
        prev_path = next_path = None
        if "Range" in self.headers:
            if filesize == 0:
                self.send_response(206)
                self.send_header("Content-Range", f"bytes 0-0/0")
                start = size = 0
            else:
                try:
                    rng = self.get_range(filesize)
                except Exception:
                    rng = None
                if rng is None:
                    self.send_response(416, "Range Not Satisfiable")
                    self.send_header(f"Content-Range", f"bytes */{filesize}")
                    self.end_headers()
                    return
                start, size = rng
                self.send_response(206)
                self.send_header("Content-Range", f"bytes {start}-{start+size-1}/{filesize}")
        else:
            self.send_response(200)
            start, size = 0, filesize
            count_spines = len(spine_files)
            if count_spines > 1:
                if path.endswith((".html", ".xhtml")):
                    try:
                        index = spine_files.index(path)
                    except ValueError:
                        pass
                    else:
                        prev_path = spine_files[(index-1)%count_spines]
                        next_path = spine_files[(index+1)%count_spines]
        if prev_path is not None:
            inject_code = b'''
<script>
document.addEventListener('keydown', function(e) {
    if (e.keyCode == 37) { // press left key
    window.location.href = "/%s";
    } else if (e.keyCode == 39) { // press right key 
    window.location.href = "/%s";
    }
});
</script>''' % (bytes(quote(prev_path), "utf-8"), bytes(quote(next_path), "utf-8"))
            content = zfile.open(fullpath).read()
            index = content.rfind(b"</body>")
            if index == -1:
                content += inject_code
            else:
                content = content[:index] + inject_code + content[index:]
            self.send_header("Content-Length", str(len(content)))
        else:
            self.send_header("Content-Length", str(size))
        self.send_header("Content-Type", href_2_attr[path].get("media-type", "application/octet-stream"))
        self.send_header("Accept-Ranges", "bytes")
        self.end_headers()
        if prev_path is not None:
            self.wfile.write(content)
            return
        if size > 0:
            chunk_size = 1 << 16
            write = self.wfile.write
            with zfile.open(fullpath) as f:
                read = f.read
                if start:
                    f.seek(start)
                while size > chunk_size:
                    write(read(chunk_size))
                    size -= chunk_size
                write(read(size))

    def get_range(self, file_size):
        # NOTE: Content-Type "multipart/byteranges" is currently not supported
        # Reference: 
        #   - https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests
        #   - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range
        #   - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Range
        #   - https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/206
        #   - https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/416
        range_header = self.headers.get("Range")
        if not range_header:
            return 0, file_size
        unit, rng = range_header.strip().split("=", 1)
        if unit != "bytes":
            return None
        start, end = rng.strip().split("-")
        if not end:
            start = int(start)
            if start >= file_size:
                return None
            return start, file_size - start
        if not start:
            size = int(end)
            if size < 0:
                return None
            elif size >= file_size:
                size = file_size
            return file_size - size, size
        start, end = int(start), int(end)
        if end < 0 or end < start or start >= file_size:
            return None
        if end >= file_size:
            size = file_size - start
        else:
            size = end - start + 1
        return start, size


path = args.path
host = args.host
port = args.port
open_browser = args.open_browser

with ZipFile(path) as zfile:
    opf_path = get_opf_path(zfile.read("META-INF/container.xml"))
    opf_root = posixpath.dirname(opf_path)
    content_opf = zfile.read(opf_path)

    itemlist = list(item.attrib for item in opf_item_iter(content_opf))
    href_2_attr = {unquote(item["href"]): item for item in itemlist}
    id_2_href = {item["id"]: unquote(item["href"]) for item in itemlist}
    spine_files = [id_2_href[itemref.attrib["idref"]] for itemref in opf_itemref_iter(content_opf)]

    for itemref in opf_itemref_iter(content_opf):
        if itemref.attrib.get("linear") != "no":
            index_file = id_2_href[itemref.attrib["idref"]]
            break
    else:
        if posixpath.join(opf_root, "index.html") in zfile.NameToInfo:
            index_file = "index.html"
        elif posixpath.join(opf_root, "index.xhtml") in zfile.NameToInfo:
            index_file = "index.xhtml"
        elif any(((file:=href).endswith((".html", ".xhtml"))) for href in id_2_href.values()):
            index_file = file
        else:
            raise RuntimeError("no mainpage found")

    if open_browser:
        import webbrowser
        from time import sleep
        from threading import Thread

        def open_browser():
            url = f"http://localhost:{port}"
            sleep(1)
            webbrowser.open(url)

        Thread(target=open_browser).start()

    with ThreadingHTTPServer((host, port), EpubHandler) as httpd:
        host, port = httpd.socket.getsockname()[:2]
        url_host = f'[{host}]' if ':' in host else host
        print(
            f"Serving HTTP on {host} port {port} "
            f"(http://{url_host}:{port}/) ..."
        )
        try:
            httpd.serve_forever()
        except KeyboardInterrupt:
            print("\nKeyboard interrupt received, exiting.")

# TODO: support for injecting code (append to body): css, js, html
# TODO: Before injecting all code, inject some environment variables: e.g. item-list, spine-list
# TODO: Enhance fault tolerance, only 404 will be reported when encountering non-existent files, without errors such as IndexError or KeyError
# TODO: Check the opf file, if there are any files that do not exist, ignore them

[python]epub网页服务器

前言

一种阅读新方法：基于浏览器

1. 自带翻译

2. 它有很多扩展，而且你可以自己编写扩展

3. 当出现问题时，你可以借助开发者工具来了解发生了什么

4. 强大的渲染能力

我的实践：一个 epub 阅读服务器

附：源码

麻花疼

引用和评论

👏欢迎关注python-epub3项目

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

本地使用PaddleOCR进行图片识别获得文字（返回JSON）