Analysis of Markdown-it principle

foreword

In "An article that takes you to build a blog with VuePress + Github Pages" , we used VuePress to build a blog. Check the final effect: TypeScript Chinese document .

In the process of building the blog, we have, for practical needs, in "VuePress blog optimization of expansion Markdown syntax" in explaining how to write a markdown-it plug this article, we will delve markdown-it source, explain markdown-it the implementation of the principle, The purpose is to give everyone a deeper understanding markdown-it

introduce

Quoting the introduction of markdown-it Github repository :

Markdown parser done right. Fast and easy to extend.

It can be seen that markdown-it is a markdown parser and is easy to extend.

The demo address is: https://markdown-it.github.io/

markdown-it has several advantages:

Follow CommonMark spec and add syntactic expansion and syntactic sugar (such as URL automatic recognition, special treatment for printing)
Configurable syntax, you can add new rules or replace existing ones
quick
Safe by default
The community has a lot of plugins or other packages

use

// 安装
npm install markdown-it --save

// node.js, "classic" way:
var MarkdownIt = require('markdown-it'),
    md = new MarkdownIt();
var result = md.render('# markdown-it rulezz!');

// browser without AMD, added to "window" on script load
// Note, there is no dash in "markdownit".
var md = window.markdownit();
var result = md.render('# markdown-it rulezz!');

Source code analysis

We look markdown-it the entry code , and we can find that its code logic is clear and clear:

// ...
var Renderer     = require('./renderer');
var ParserCore   = require('./parser_core');
var ParserBlock  = require('./parser_block');
var ParserInline = require('./parser_inline');

function MarkdownIt(presetName, options) {
  // ...
  this.inline = new ParserInline();
  this.block = new ParserBlock();
  this.core = new ParserCore();
  this.renderer = new Renderer();
  // ...
}

MarkdownIt.prototype.parse = function (src, env) {
  // ...
  var state = new this.core.State(src, this, env);
  this.core.process(state);
  return state.tokens;
};

MarkdownIt.prototype.render = function (src, env) {
  env = env || {};
  return this.renderer.render(this.parse(src, env), this.options, env);
};

It can also be seen from the render method that its rendering is divided into two processes:

Parse: Parse Markdown files into Tokens
Render: Traverse Tokens to generate HTML

markdown-it Babel, but Babel is converted to an abstract syntax tree (AST), and 061efd1309d710 did not choose to use AST, mainly to follow the principle of KISS ( Keep It Simple, Stupid ).

Tokens

So what do Tokens look like? Let's try it out in the demo page :

It can be seen # header is (Note: This is simplified for the convenience of display):

[
  {
    "type": "heading_open",
    "tag": "h1"
  },
  {
    "type": "inline",
    "tag": "",
    "children": [
      {
        "type": "text",
        "tag": "",
        "content": "header"
      }
    ]
  },
  {
    "type": "heading_close",
    "tag": "h1"
  }
]

For the meaning of the fields in the specific Token, Token Class .

The difference between Tokens and AST can also be seen through this simple Tokens example:

Tokens are just a simple array
Opening and closing tags are separated

Parse

Check out the code related to the parse method:

// ...
var ParserCore   = require('./parser_core');

function MarkdownIt(presetName, options) {
  // ...
  this.core = new ParserCore();
  // ...
}

MarkdownIt.prototype.parse = function (src, env) {
  // ...
  var state = new this.core.State(src, this, env);
  this.core.process(state);
  return state.tokens;
};

You can see the specific execution code, it should be written in ./parse_core , check the code parse_core.js

var _rules = [
  [ 'normalize',      require('./rules_core/normalize')      ],
  [ 'block',          require('./rules_core/block')          ],
  [ 'inline',         require('./rules_core/inline')         ],
  [ 'linkify',        require('./rules_core/linkify')        ],
  [ 'replacements',   require('./rules_core/replacements')   ],
  [ 'smartquotes',    require('./rules_core/smartquotes')    ]
];

function Core() {
    // ...
}

Core.prototype.process = function (state) {
    // ...
  for (i = 0, l = rules.length; i < l; i++) {
    rules[i](state);
  }
};

It can be seen that the Parse process has 6 rules by default, and its main functions are:

1. normalize

In CSS, we use normalize.css smooth out the differences at each end. The same logic is used here. We look at the code of normalize, which is actually very simple:

// https://spec.commonmark.org/0.29/#line-ending
var NEWLINES_RE  = /\r\n?|\n/g;
var NULL_RE      = /\0/g;


module.exports = function normalize(state) {
  var str;

  // Normalize newlines
  str = state.src.replace(NEWLINES_RE, '\n');

  // Replace NULL characters
  str = str.replace(NULL_RE, '\uFFFD');

  state.src = str;
};

We know that \n matches a newline, and \r matches a carriage return, so why replace \r\n with \n ?

We can find the history of the appearance of "Carriage Return and Line \r\n

Before computers, there was something called the Teletype Model 33, which could type 10 characters per second. But it has a problem, that is, it takes 0.2 seconds to finish typing a line feed, which is just enough to type two characters. If in this 0.2 seconds, a new character is passed, then this character will be lost.
Therefore, the developers thought of a way to solve this problem, which is to add two characters to indicate the end of each line. One is called "carriage return" and tells the typewriter to position the print head at the left border; the other is called "line feed" and tells the typewriter to move the paper down one line.
This is the origin of "line feed" and "carriage return", which can also be seen from their English names.
Later, the computer was invented, and these two concepts were applied to the computer. Back then, memory was expensive, and some scientists thought adding two characters at the end of each line was too wasteful, and adding one was fine. So, there was a disagreement.
In Unix systems, each line ends with only "<newline>", that is, "\n"; in Windows systems, each line ends with "<carriage return><newline>", that is, "\r\n"; in Mac systems, Each line ends with "<carriage return>". A direct consequence is that if a file under Unix/Mac system is opened in Windows, all text will become one line; and if a file in Windows is opened under Unix/Mac, there may be an extra ^M at the end of each line. symbol.

The reason why \r\n replaced by \n is that follows the specification :

A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.

Among them, U+000A means line feed (LF), and U+000D means carriage return (CR).

In addition to replacing the carriage return, the source code also replaces null characters. In the regular , \0 means matching the NULL (U+0000) character, according to the explanation of WIKI

The Null character, also known as the terminator, abbreviated NUL, is a control character with a value of 0.
Null characters are included in many character encodings, including ISO/IEC 646 (ASCII), C0 control code, universal character set, Unicode and EBCDIC, etc. Almost all mainstream programming languages include null characters
The original meaning of this character is similar to the NOP command. When sent to a list machine or terminal, the device does not need to do any action (although some devices will print it incorrectly or display a blank).

And we replace the null character with \uFFFD , in Unicode, \uFFFD represents the replacement character:

The reason why this alternative were, in fact, follow the norms, we review CommonMark 2.3 spec chapter :

For security reasons, the Unicode character U+0000 must be replaced with the REPLACEMENT CHARACTER (U+FFFD).

Let's test this effect:

md.render('foo\u0000bar'), '<p>foo\uFFFDbar</p>\n'

The effect of the following, you will find the original invisible null character be replaced replacement character after showing out:

2. block

The function of the block rule is to find blocks and generate tokens. What is a block? What is inline? We can also find the answer in the Blocks and inlines chapter

We can think of a document as a sequence of blocks—structural elements like paragraphs, block quotations, lists, headings, rules, and code blocks. Some blocks (like block quotes and list items) contain other blocks; others (like headings and paragraphs) contain inline content—text, links, emphasized text, images, code spans, and so on.

Translate it to:

We think of a document as a set of blocks, structured elements like paragraphs, quotes, lists, headings, code blocks, etc. Some blocks (like citations and lists) can contain other blocks, and others (like headings and paragraphs) can contain inline content such as text, links, underlined text, images, code snippets, and so on.

Of course, in markdown-it , which will be recognized as blocks, you can check parser_block.js , here also define some recognition and parse rules:

Regarding these rules, I will pick a few uncommon ones to explain:

code rule is used to identify Indented code blocks (4 spaces padded), in markdown:

fence rule is used to identify Fenced code blocks, in markdown:

hr rule is used to recognize newlines, in markdown:

reference rule is used to identify reference links , in markdown:

html_block used to identify HTML block element tags in markdown, such as div .

lheading used to identify Setext headings , in markdown:

3. inline

The role of the inline rule is to parse the inline in markdown, and then generate tokens. The reason why the block is executed first is because the block can contain inline. The parsed rules can be viewed in parser_inline.js :

Regarding these rules, I will pick a few uncommon ones to explain:

newline rule is used to identify \n and replace \n with a hardbreak type token

backticks rule is used to recognize backticks:

entity rule is used to process HTML entities, such as { `¯ `" etc.:

4. linkify

Automatically recognize links

5. replacements

`6. smartquotes`

For the convenience of printing, the straight quotation marks are processed:

`Render`

The Render process is actually relatively simple. Looking at the renderer.js file, you can see that some default rendering rules are built in:

default_rules.code_inline
default_rules.code_block
default_rules.fence
default_rules.image
default_rules.hardbreak
default_rules.softbreak
default_rules.text
default_rules.html_block
default_rules.html_inline

In fact, these names are also the type of the token. When traversing the token, the type of the token corresponds to the rules here. Let's take a look at the content of the code_inline rule, which is actually very simple:

default_rules.code_inline = function (tokens, idx, options, env, slf) {
  var token = tokens[idx];

  return  '<code' + slf.renderAttrs(token) + '>' +
          escapeHtml(tokens[idx].content) +
          '</code>';
};

`Custom Rules`

So far, we have a simple understanding of the rendering principle of markdown-it. Whether it is Rules in the Parse or Render process, markdown-it provides methods to customize these Rules, which are also the key to writing markdown-it plugins. These We will talk about it later.

`series of articles`

The blog building series is the only series of practical tutorials I have written so far, explaining how to use VuePress to build a blog and deploy it to GitHub, Gitee, personal servers and other platforms.

WeChat: "mqyqingfeng", add me to the only readership of Xianyu.

If there are any mistakes or inaccuracies, please be sure to correct me, thank you very much. If you like or have inspiration, welcome to star, which is also an encouragement to the author.

Analysis of Markdown-it principle

foreword

introduce

use

Source code analysis

Tokens

Parse

1. normalize

2. block

3. inline

4. linkify

5. replacements

`6. smartquotes`

`Render`

`Custom Rules`

`series of articles`

冴羽

`引用和评论`

SvelteKit 最新中文文档教程（23）—— CLI 使用指南

Vue.js-Vue实例

2025年最新反编译微信小程序的教程及工具

你可能不知道的图片加载相关知识

手写一个动态海洋和天空效果的vue hooks

原生JS大揭秘—JS代码执行原理解刨

使用CSS给标题添加书名号并超出省略