[python]html转xhtml

历史故事

Sigil was designed to make it easy to create great ebooks using the EPUB format.

当我们在使用 Sigil 编辑器制作电子书时，总是被要求使用 XHTML (eXtensible HyperText Markup Language)。它是一种更为严格的 HTML (HyperText Markup Language)。

今天最常见的标记语言是 HTML，它是由万维网 (World Wide Web) 的发明者蒂姆·伯纳斯-李 (Tim Berners-Lee) 设计的，它是用于构建和显示网页的第一种标记语言。

Tim Berners-Lee, a British scientist, invented the World Wide Web (WWW) in 1989, while working at CERN. The web was originally conceived and developed to meet the demand for automated information-sharing between scientists in universities and institutes around the world.

Tim Berners-Lee, a British scientist, invented the World Wide Web (WWW) in 1989, while working at CERN. The web was originally conceived and developed to meet the demand for automated information-sharing between scientists in universities and institutes around the world.

Screenshot of the recreated page of the first website (Image: CERN)

Screenshot of the recreated page of the first website (Image: CERN)

The first page of Tim Berners-Lee's proposal for the World Wide Web, written in March 1989 (Image: CERN)

The first page of Tim Berners-Lee's proposal for the World Wide Web, written in March 1989 (Image: CERN)

proportions of World Wide Web content constituting the surface web, deep web, and dark web

proportions of World Wide Web content constituting the surface web, deep web, and dark web

在创建后不久，HTML 迅速发展，其后续版本成为了标记语言的标准。随着时间的推移，它通过一系列重要的版本不断演进，然而，直到 HTML4 之前，所有这些版本通常都只被称为 HTML。这在先前的版本和当前最新版本 HTML5 之间划清了界线，强调了它们之间的重大差异。

值得注意的是，在 HTML5 发布之前，万维网联盟（W3C）启动了一个工作，旨在开发基本 HTML 的扩展，并将其与 XML 格式合并。这是为了解决当时浏览器遇到的一些兼容性问题。

在 20 世纪 90 年代末到 21 世纪初，W3C 对 XML 情有独钟，认为它应该取代 HTML 语法。

这样想很有道理。在当时并没有 HTML 解析规范，因此对于任何复杂的内容，你通常会发现 4 个浏览器引擎各自以 4 种不同的方式解释同一个 HTML 文档。而相比之下，XML 则拥有完全定义好的解析器。

但是，一下子做出这样巨大的改变是不现实的，因此在 2000 年，XHTML 1.0 成为了一项推荐标准，并提出了以一种兼容现有 HTML 解析器和 XML 解析器的方式编写 HTML 的建议。

HTML 和 XML 都是标准的标记语言，在功能上非常相似。然而，后者在处理错误和格式的严格性上有所不同。作为其结果产生的语言仍然与 HTML4 非常相似，但引入了一些额外的、更严格的规则。它被命名为 XHTML。并定义了一个新的 MIME type，application/xhtml+xml。

XHTML 与 HTML 的区别

Comparison chart

	HTML	XHTML
Introduction (from Wikipedia)	HTML or HyperText Markup Language is the main markup language for creating web pages and other information that can be displayed in a web browser.	XHTML (Extensible HyperText Markup Language) is a family of XML markup languages that mirror or extend versions of the widely used Hypertext Markup Language (HTML), the language in which web pages are written.
Filename extension	.html, .htm	.xhtml, .xht, .xml, .html, .htm
Internet media type	text/html	application/xhtml+xml
Developed by	W3C & WHATWG	World Wide Web Consortium
Type of format	Document file format	Markup language
Extended from	SGML	XML, HTML
Stands for	HyperText Markup Language	Extensible HyperText Markup Language
Application	Application of Standard Generalized Markup Language (SGML).	Application of XML
Function	Web pages are written in HTML.	Extended version of HTML that is stricter and XML-based.
Nature	Flexible framework requiring lenient HTML-specific parser.	Restrictive subset of XML and needs to be parsed with standard XML parsers.
Origin	Proposed by Tim Berners-Lee in 1987.	World Wide Web Consortium Recommendation in 2000.
Versions	HTML 2, HTML 3.2, HTML 4.0, HTML 5.	XHTML 1, XHTML 1.1, XHTML 2, XHTML 5.

文档结构

XHTML DOCTYPE 是强制性的
<html> 中的 XML namespace 属性是强制性的
<html>、<head>、<title> 以及 <body> 也是强制性的

元素语法

XHTML 元素必须正确嵌套
XHTML 元素必须始终关闭
XHTML 元素必须小写
XHTML 文档必须有一个根元素

属性语法

XHTML 属性必须使用小写
XHTML 属性值必须用引号包围
XHTML 属性最小化也是禁止的

这就意味着

<!-- Instead of: -->
<HTML LANG="en">

<!-- You'd write: -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

因为我们必须吓跑那些不守规则的新来者😂。

<!-- Instead of: -->
<option value=foo selected>…</option>

<!-- You'd write: -->
<option value="foo" selected="selected">…</option>

因为，在XML中，属性需要值，而且必须用引号扩起来。

<!-- Instead of: -->
<img src="…">

<!-- You'd write: -->
<img src="…" />

因为在 XML 中，标签必须显式关闭，而在 XML 中自闭合标签有一种简写：/>。

这些规则纯粹是针对 XML 解析器的，因为文档是以 HTML 的形式提供的，这些语法上的“额外部分”就被忽略了。

对于 <option selected="selected">，值会被忽略，所以 <option selected=""> 也可以工作，就像 <option selected="false"> 一样（false 会被忽略），但基于“一致性”，决定重复属性名称是个好主意。

如果你忘了给属性加引号，浏览器不会抱怨，它只会继续渲染页面。

如果以 /> 结束一个标签，浏览器会将其视为解析错误并忽略它。这就是我开始对此提出异议的地方。

<br /> The br is closed. This text is not inside the br.

But also:

<br> The br is closed. This text is not inside the br.

And this is where it gets confusing:

<div /> The div is open. This text is inside the div.

在XML中，<div /> 会是一个自闭合的 div，但在 HTML 中不是这样的。在 HTML 中，不是 /> 关闭了 br 标签，而是 br 本身。它属于一个特殊的元素列表（void elements），永远不能包含子元素，因此它们是自闭合的。<div /> 不会自闭合，因为 div 不在那个列表中。

Complete List of Self-Closing Tags for HTML5

Tag	Description
`<area>`	The HTML `<area>` tag specifies an area within an image map with predetermined clickable zones based on coordinates, which subsequently accepts a URL and behaves as a hyperlink. This element can only be used inside an `<map>` element.
`<base>`	The HTML `<base>` tag specifies a base URI, often known as a base URL, for relative links in a document. A document can only include one `<base>` element. For example, you can specify the base URL once in the header area of your page, and all subsequent relative links will utilize that URL as a starting point.
`<br>`	The HTML `<br>` tag is used to create a line break in the text. It is typically employed in poems or addresses where line division is required. It is an empty tag, which means it contains no content and is referred to as a void element. Including the `<br>` tag in the HTML code functions similarly to pressing the enter key in a word processor.
`<col>`	The HTML `<col>` tag specifies the attributes for columns contained within the `<colgroup>` tag. This allows you to format or add a class to a column or group of columns rather than each individual cell. It is most commonly found within an `<colgroup>` element. This element specifies the style property for each column.
`<embed>`	The HTML `<embed>` tag is used to embed external applications, which are typically multimedia elements such as audio or video, at the specified place in an HTML document. It serves as a container for plug-ins such as flash animations. This is a new tag in HTML 5, and it just requires the beginning tag.
`<hr>`	The HTML `<hr>` tag is used to insert a horizontal rule or a paragraph-level thematic break in a Html document to split or separate document sections. It is used when the topic of your HTML content abruptly changes. It divides them by drawing a horizontal line. The `<hr>` tag is an empty tag that does not require a closing tag. For example, a change of scene in a story or a switch of the topic within a segment.
`<img>`	The HTML `<img>` tag is used to display or embed an image on the web page. The HTML image element is an inline and empty element that only includes attributes; closing tags are not used in the image element.
`<input>`	The HTML `<input>` tag is used to create interactive controls for web-based forms to accept data from the user; depending on the device and user agent, a wide variety of input data and control widgets are accessible. The element is among the most powerful and complex in all HTML tags due to the vast amount of input types and attribute combinations. It is used inside the `<form>` element to declare input controls that allow users to enter data. `<label>` can be used to define labels for the input element.
`<link>`	The HTML `<link>` tag is used to establish a connection between a current document and an external resource. The link tag is mainly used to connect to external sheets and establish site icons (both "favicon" style icons and icons for the home screen and apps on mobile devices), among other things. This element can appear more than once, but it only appears in the head section. The link element's values indicate how the item is linked to and is related to the containing page.
`<meta>`	The HTML `<meta>` tag allows you to add metadata - extra essential information about a document in a number of ways. The `<meta>` elements can be used to incorporate name/value pairs specifying HTML document features such as expiry date, author, a list of keywords, document author, etc. You can include more than one meta tag in your document depending on the information you wish to maintain. Still, in general, meta tags do not affect the physical appearance of the document. Thus it makes no difference whether you include them or not.
`<param>`	The HTML `<param>` tag is used to pass a parameter to the object associated with the `<object>` element for plug-ins. We can use several `<param>` tags within an `<object>` element in any order, but each tag must have a name and value attribute and should be inserted at the beginning of the content. The parameter tag governs the behavior of the `<object>` element by specifying a distinct pair of name and value attributes, such as autoplay, controller, etc.
`<source>`	The HTML `<source>` tag is used as a child element to define multiple media resources for the `<audio>`, `<video>`, and `<image>` elements. It is widely used to provide the same media material in several file formats, such as mp3, mp4, and so on, in order to enable compatibility with a wide range of browsers due to their varying support for image and media file formats. Basically, it is used to attach multimedia assets such as audio, video, and images.
`<track>`	The HTML `<track>` tag is used as a child element of `<audio>` and `<video>` elements in order to define time-based text tracks for a media file. It is used to include a subtitle, caption, or any other type of text that will be rendered when a media file gets displayed. For example, it allows you to set timed text tracks (or time-based data) to handle subtitles automatically. WebVTT (Web Video Text Tracks) format (.vtt files) is used for the tracks.
`<wbr>`	The HTML `<wbr>` tag stands for word break opportunity. This tag denotes a spot within the text where the browser may optionally break a line, even though its line-breaking rules would not otherwise cause a break at that location. It is typically used when the employed term is too long, and there is a risk that the browser would break lines at the incorrect location in order to fit the content.

利用 Python 实现 HTML 转 XHTML

from functools import partial
from typing import Callable

from lxml.etree import tostring, _Element, _ElementTree


def to_xhtml(
    etree: _Element | _ElementTree, 
    ensure_epub: bool = False, 
) -> Callable[..., bytes]:
    """Convert an element node and its child nodes into XHTML format.

    :param etree: Element node or node tree that to be processed.
    :param ensure_epub: Determine whether to add epub namespaces to the root element, 
                        but it must itself be a root element or node tree.

    :return: A helper function (ignore if not needed), used to serialize the current 
             element node or node tree (depending on what is provided).
    """
    if isinstance(etree, _ElementTree):
        root = etree.getroot()
        is_root = True
    else:
        root = etree.getroottree().getroot()
        is_root = etree is root
    # NOTE: Because in Sigil editor, double hyphen (--) within comment will 
    #       issue an error, so I just escape all the double hyphens 😂.
    comments = etree.xpath(".//comment()")
    if comments:
        for comment in comments:
            if "--" in comment.text:
                comment.text = comment.text.replace("--", "&#45;&#45;")
    # NOTE: Because if you want to convert HTML to XHTML, you may need to use 
    #       `lxml.etree.tostring`. When encountering an element node without 
    #       children, it will form a self closing tag, but there is no such 
    #       thing in HTML. However there is a concept of void element in HTML:
    #
    #           - https://html.spec.whatwg.org/multipage/syntax.html#void-elements
    #           - https://developer.mozilla.org/en-US/docs/Glossary/Void_element
    #
    #       A void element is an element in HTML that cannot have any child nodes 
    #       (i.e., nested elements or text nodes). Void elements only have a start 
    #       tag; end tags must not be specified for void elements.
    #       To make sure that all non-void elements do not form self closing tags, 
    #       it is possible to replace the text node with "" by checking that their 
    #       text node is None.
    for el in etree.iter("*"):
        # NOTE: In the past, there were other elements that were void elements, such as 
        #       <param> and <keygen>, but they have all been deprecated and removed. 
        #       An obsoleted element is not occupied by the HTML standard and is not 
        #       considered as a void element. So they can be given new meanings by users, 
        #       and cannot be directly considered as void elements.
        #
        #           - https://developer.mozilla.org/en-US/docs/Web/HTML/Element/param
        #           - https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Releases/69
        if el.tag.lower() not in (
            "area", "base", "br", "col", "embed", "hr", "img", "input", "link", 
            "meta", "source", "track", "wbr", 
        ):
            if el.text is None:
                el.text = ""
    # NOTE: You need to use epub:type to perform `Expression Structural Semantics`, 
    #       so there are two namespaces that need to be defined.
    #
    #           - https://www.w3.org/TR/epub/#app-structural-semantics
    #           - https://www.w3.org/TR/xml-names/
    if ensure_epub and is_root:
        if "xmlns" not in root.attrib:
            root.attrib["xmlns"] = "http://www.w3.org/1999/xhtml"
        if "xmlns:epub" not in root.attrib:
            root.attrib["xmlns:epub"] = "http://www.idpf.org/2007/ops"
    # NOTE: Because UTF-8 is currently the most recommended encoding
    kwargs = {"encoding": "utf-8"}
    if is_root:
        # NOTE: Specify the DOCTYPE as HTML5 (<!DOCTYPE html>), ignoring the original.
        kwargs["doctype"] = "<!DOCTYPE html>"
        # NOTE: Sigil editor does not support XML version > 1.0 🫠
        kwargs["xml_declaration"] = '<?xml version="1.0" encoding="UTF-8"?>'
    return partial(tostring, etree, **kwargs)

[python]html转xhtml

历史故事

XHTML 与 HTML 的区别

利用 Python 实现 HTML 转 XHTML

参考资料

麻花疼

引用和评论

👏欢迎关注python-epub3项目

Html&Css测试试题（简单版）

刷新iframe的几个方法，管你跨不跨域

在线考试答题系统（Web+H5+小程序）开发方案与实现附源代码

Anaconda安装教程以及Anaconda和pip配置国内镜像

如何减少跨团队交付摩擦？——基于 DevOps 与敏捷的最佳实践

Python 描述符