29
头图
The full text is 5,000 words, interpret the code highlighting implementation principle behind vscode, welcome to like, follow and forward.

The language functions of Vscode, such as code highlighting, code completion, error diagnosis, and jump definition, are implemented by two extension schemes, including:

  • Based on lexical analysis technology, recognize tokens and apply highlighting styles
  • Based on the programmable language feature interface, it can identify code semantics and apply highlight styles. In addition, it can also implement functions such as error diagnosis, smart prompts, and formatting.

The functional scope of the two schemes increases step by step, and the corresponding technical complexity and implementation cost also increase step by step. This article will outline the work process and characteristics of the two schemes, what work they have done, and write each other in this way, combined with actual cases. Step by step to uncover the realization principle of the vscode code highlighting function:

Vscode plugin basics

Before introducing the principle of vscode code highlighting, it is necessary to familiarize yourself with the underlying architecture of vscode. Similar to Webpack, vscode itself only implements a set of shelves. The commands, styles, status, debugging and other functions inside the shelves are provided in the form of plug-ins. vscode provides five external expansion capabilities:

Among them, the code highlighting function is by 160d06337d632b language extension plug-ins, which can be subdivided into:

  • declarative : Declare a bunch of matching lexical regulars in a specific JSON structure, without writing logic code, you can add language features such as block-level matching, automatic indentation, syntax highlighting, and the built-in extensions/css, extensions/html of vscode Plug-ins are implemented based on declarative interfaces
  • programmatic : vscode monitors user behavior during operation, triggers event callbacks after specific behaviors occur, programming language extensions need to monitor these events, dynamically analyze text content and return code information in a specific format

The declarative performance is high, but the ability is weak; the programmatic performance is low and the ability is strong. Language plug-in developers can usually mix and use the declarative interface to identify the lexical token in the shortest time and provide basic syntax highlighting; then use the programming interface to dynamically analyze the content and provide more advanced features such as error diagnosis and intelligent prompts.

The declarative language extension in Vscode is implemented based on the TextMate lexical analysis engine; the programming language extension is implemented based on the semantic analysis interface, the vscode.language.* interface, and the Language Server Protocol protocol. The basic logic of each technical solution is introduced below.

Lexical highlighting

Lexical Analysis is the process of converting character sequences into token token is the smallest unit that constitutes the source code. Lexical analysis technology is used in compiling, IDE, etc. The field has a very wide range of applications.

For example, the lexical engine of vscode analyzes the token sequence and then applies the highlight style according to the token type. This process can be simply divided into two steps: word segmentation and style application.

Reference materials:

Participle

On the nature of the code word recursively dismantling process is a long list, classified character string segment has a specific meaning, such as +-*/% other operators; var/const keywords like; 1234 or "tecvan" type constant values, etc., is simply a piece of text from Identify where there is a word.

The lexical analysis of Vscode is based on the TextMate engine. The functions are more complicated and can be simply divided into three aspects: regular word segmentation, compound word segmentation rules, and nested word segmentation rules.

basic rules

TextMate engine at the bottom of regular matching. The text content is scanned line by line at runtime, and the predefined rule set is used to test whether the text line contains rules that match a specific regular configuration, for example, the following :

{
    "patterns": [
        {
            "name": "keyword.control",
            "match": "\b(if|while|for|return)\b"
        }
    ]
}

In the example, patterns used to define a set of rules, the match attribute is set to match the token's regularity, and the name attribute declares the scope of the token. When the TextMate segmentation process encounters match regularity, it will be treated as a separate The token is processed and classified as name declared by keyword.control .

The above example will recognize the if/while/for/return keyword as keyword.control , but cannot recognize other keywords:

In the context of TextMate, scope is a . segmentation. For example, keyword and keyword.control form a parent-child hierarchy. This hierarchical structure can achieve a css selector-like match in the style processing logic. Details will be discussed later.

Compound participle

The above example configuration object is called Language Rule in the context of TextMate. In addition to match for matching single-line content, you can also use begin + end attribute pairs to match more complex cross-line scenes. The range recognized from begin to end is considered to be a token of type name . For example, there is a configuration in the syntaxes/vue.tmLanguage.json vuejs/vetur

{
    "name": "Vue",
    "scopeName": "source.vue",
    "patterns": [
        {
          "begin": "(<)(style)(?![^/>]*/>\\s*$)",
          // 虚构字段,方便解释
          "name": "tag.style.vue",
          "beginCaptures": {
            "1": {
              "name": "punctuation.definition.tag.begin.html"
            },
            "2": {
              "name": "entity.name.tag.style.html"
            }
          },
          "end": "(</)(style)(>)",
          "endCaptures": {
            "1": {
              "name": "punctuation.definition.tag.begin.html"
            },
            "2": {
              "name": "entity.name.tag.style.html"
            },
            "3": {
              "name": "punctuation.definition.tag.end.html"
            }
          }
        }
    ]
}

Configuration, begin for matching <style> statement end for matching </style> statements and <style></style> entire scope statement is given as tag.style.vue . In addition, the characters in the sentence are assigned to different scope types beginCaptures and endCaptures

Here from begin to beginCaptures , from end to endCaptures form a certain degree of composite structure, so as to match multiple lines of content at once.

Rule nesting

On the begin + end , TextMate also supports the definition of nested language rules in the way of patterns

{
    "name": "lng",
    "patterns": [
        {
            "begin": "^lng`",
            "end": "`",
            "name": "tecvan.lng.outline",
            "patterns": [
                {
                    "match": "tec",
                    "name": "tecvan.lng.prefix"
                },
                {
                    "match": "van",
                    "name": "tecvan.lng.name"
                }
            ]
        }
    ],
    "scopeName": "tecvan"
}

The configuration recognizes the character string between lng` to ` tecvan.lng.outline . After that, the content between the two is processed recursively and patterns rule, for example, for:

lng`awesome tecvan`

Recognizable word segmentation:

  • lng`awesome tecvan` , scope is tecvan.lng.outline
  • tec , scope is tecvan.lng.prefix
  • van , scope is tecvan.lng.name

TextMate also supports language-level nesting, for example:

{
    "name": "lng",
    "patterns": [
        {
            "begin": "^lng`",
            "end": "`",
            "name": "tecvan.lng.outline",
            "contentName": "source.js"
        }
    ],
    "scopeName": "tecvan"
}

Based on the above configuration, lng` to ` content will be identified as between contentName specified source.js statement.

style

Lexical highlighting is essentially to first disassemble the original text into multiple token sequences according to the above rules, and then adapt different styles according to the types of tokens. TextMate provides a set of functional structure based on the token type field scope configuration style based on word segmentation, for example:

{
    "tokenColors": [
        {
            "scope": "tecvan",
            "settings": {
                "foreground": "#eee"
            }
        },
        {
            "scope": "tecvan.lng.prefix",
            "settings": {
                "foreground": "#F44747"
            }
        },
        {
            "scope": "tecvan.lng.name",
            "settings": {
                "foreground": "#007acc",
            }
        }
    ]
}

In the example, the scope attribute supports a Scope Selectors , which is similar to the css selector and supports:

  • Element selection, for example scope = tecvan.lng.prefix can match tecvan.lng.prefix type tokens; special scope = tecvan can match tecvan.lng , tecvan.lng.prefix and other sub-type tokens
  • Descendant selection, for example scope = text.html source.js used to match JavaScript code in html documents
  • Group selection, for example scope = string, comment used to match strings or remarks

Plug-in developers can customize the scope or choose to reuse many of TextMate's built-in scopes, including comment, constant, entity, invalid, keyword, etc. For a complete list, please refer to official website .

settings attribute is used to set the expression style of the token, and supports foreground, background, bold, italic, underline and other style attributes.

Example analysis

After reading the principle, let’s disassemble an actual case: https://github.com/mrmlnc/vscode-json5 , json5 is a JSON extension protocol designed to make it easier for humans to write and maintain manually, and supports notes and orders. Features such as quotation marks and hexadecimal numbers. These extended features require the use of the vscode-json5 plug-in to achieve the highlighting effect:

In the above picture, the left side is the effect of not starting vscode-json5, and the right side is the effect after starting.

The source code of the vscode-json5 plug-in is very simple, with two key points:

  • Declare the contributes attribute of the plug-in in the package.json file, which can be understood as the entrance of the plug-in:
  "contributes": {
    // 语言配置
    "languages": [{
      "id": "json5",
      "aliases": ["JSON5", "json5"],
      "extensions": [".json5"],
      "configuration": "./json5.configuration.json"
    }],
    // 语法配置
    "grammars": [{
      "language": "json5",
      "scopeName": "source.json5",
      "path": "./syntaxes/json5.json"
    }]
  }
  • In the syntax configuration file ./syntaxes/json5.json , define the Language Rule in accordance with the requirements of TextMate:
{
    "scopeName": "source.json5",
    "fileTypes": ["json5"],
    "name": "JSON5",
    "patterns": [
        { "include": "#array" },
        { "include": "#constant" }
        // ...
    ],
    "repository": {
        "array": {
            "begin": "\\[",
            "beginCaptures": {
                "0": { "name": "punctuation.definition.array.begin.json5" }
            },
            "end": "\\]",
            "endCaptures": {
                "0": { "name": "punctuation.definition.array.end.json5" }
            },
            "name": "meta.structure.array.json5"
            // ...
        },
        "constant": {
            "match": "\\b(?:true|false|null|Infinity|NaN)\\b",
            "name": "constant.language.json5"
        } 
        // ...
    }
}

OK, it's over, no more, it's that simple, and then vscode can adapt the syntax highlighting rules of json5 according to this configuration.

Debugging tools

Vscode has a built-in scope inspect tool for debugging the token and scope information detected by TextMate. When using it, you only need to focus the editor cursor on the specific token, and the shortcut key ctrl + shift + p open the vscode command panel and output the Developer: Inspect Editor Tokens and Scopes command and press Enter:

After the command is run, you can see the language, scope, style and other information of the token.

Programming language extension

The lexical analysis engine TextMate is essentially a regular-based static lexical analyzer. The advantage is that the access method is standardized, the cost is low, and the operating efficiency is high. The disadvantage is that static code analysis is difficult to implement certain context-sensitive IDE functions, such as The following code:

Note that first line of code function parameters languageModes second row function in vivo languageModes same entity, but does not achieve the same pattern is not formed on the visual interaction.

To this end, vscode provides three more powerful and complex language feature extension mechanisms in addition to the TextMate engine:

  • Use DocumentSemanticTokensProvider realize programmable semantic analysis
  • Use vscode.languages.* to monitor various programming behavior events, and realize semantic analysis at specific time nodes
  • According to Language Server Protocol protocol to realize a complete set of language feature analysis server

Compared with the declarative lexical highlighting introduced above, the language feature interface is more flexible and can implement advanced functions such as error diagnosis, candidate words, intelligent prompts, and definition jumps.

Reference materials:

DocumentSemanticTokensProvider word segmentation

Introduction

Sematic Tokens Provider is a built-in object protocol of vscode. It needs to scan the content of the code file by itself, and then return the semantic token sequence in the form of an integer array, telling vscode which line, column, and interval of the file is one What type of token.

Pay attention to the distinction. The scanning in TextMate is engine-driven and matches the regularity line by line Sematic Tokens Provider scenario are all implemented by the plug-in developer. The flexibility is enhanced but the relative development cost will also be higher.

In terms of implementation, Sematic Tokens Provider is defined with the vscode.DocumentSemanticTokensProvider interface. Developers can implement two methods as needed:

  • provideDocumentSemanticTokens : Full analysis of code file semantics
  • provideDocumentSemanticTokensEdits : Incremental analysis of the semantics of the module being edited

Let's look at a complete example:

import * as vscode from 'vscode';

const tokenTypes = ['class', 'interface', 'enum', 'function', 'variable'];
const tokenModifiers = ['declaration', 'documentation'];
const legend = new vscode.SemanticTokensLegend(tokenTypes, tokenModifiers);

const provider: vscode.DocumentSemanticTokensProvider = {
  provideDocumentSemanticTokens(
    document: vscode.TextDocument
  ): vscode.ProviderResult<vscode.SemanticTokens> {
    const tokensBuilder = new vscode.SemanticTokensBuilder(legend);
    tokensBuilder.push(      
      new vscode.Range(new vscode.Position(0, 3), new vscode.Position(0, 8)),
      tokenTypes[0],
      [tokenModifiers[0]]
    );
    return tokensBuilder.build();
  }
};

const selector = { language: 'javascript', scheme: 'file' };

vscode.languages.registerDocumentSemanticTokensProvider(selector, provider, legend);

I believe most readers will feel unfamiliar with this code. After thinking about it for a long time, I think it is easier to understand from the perspective of function output, which is line 17 of the code in the above example, tokensBuilder.build() .

Output structure

provideDocumentSemanticTokens function requires to return an integer array, and the array items are represented as a group of 5 bits:

  • 5 * i , the offset of the row of the token relative to the previous token
  • 5 * i + 1 , the offset of the column of the token relative to the previous token
  • 5 * i + 2 , token length
  • 5 * i + 3 , the type value of the token
  • 5 * i + 4 , the modifier value of the token

We need to understand that this is an integer array with strongly correlated positions. Every 5 items in the array describe the position and type of a token. The token position is composed of three numbers: row, column, and length. In order to compress the size of the data, vscode is deliberately designed in the form of relative displacement. For example, for this code:

const name as

If it is simply divided by spaces, then three tokens can be parsed here: const , name , as , and the corresponding description array is:

[
// 对应第一个 token:const
0, 0, 5, x, x,
// 对应第二个 token: name
0, 6, 4, x, x,
// 第三个 token:as
0, 5, 2, x, x
]

Note that this is described in terms of the position relative to the previous token. For example as character is: offset 0 rows and 5 columns from the previous token, the length is 2, and the type is xx.

The remaining 5 * i + 3 and 5 * i + 4 respectively describe the type and modifier of the token, where type indicates the type of the token, such as comment, class, function, namespace, etc.; modifier is a modifier based on type, which can be roughly understood as sub The type, for example, may be abstract for class, or it may be derived from the standard library defaultLibrary.

The specific values of type and modifier need to be defined by the developer. For example, in the above example:

const tokenTypes = ['class', 'interface', 'enum', 'function', 'variable'];
const tokenModifiers = ['declaration', 'documentation'];
const legend = new vscode.SemanticTokensLegend(tokenTypes, tokenModifiers);

// ...

vscode.languages.registerDocumentSemanticTokensProvider(selector, provider, legend);

First, construct the type and modifier's internal representation legend object through the vscode. SemanticTokensLegend class, and then use the vscode.languages.registerDocumentSemanticTokensProvider interface to register it in the vscode together with the provider.

Semantic Analysis

provider in the above example is to traverse the content of the analysis file and return an integer array that meets the above rules. vscode does not limit the specific analysis method, but provides a tool for constructing the token description array SemanticTokensBuilder . For example, in the above example:

const provider: vscode.DocumentSemanticTokensProvider = {
  provideDocumentSemanticTokens(
    document: vscode.TextDocument
  ): vscode.ProviderResult<vscode.SemanticTokens> {
    const tokensBuilder = new vscode.SemanticTokensBuilder(legend);
    tokensBuilder.push(      
      new vscode.Range(new vscode.Position(0, 3), new vscode.Position(0, 8)),
      tokenTypes[0],
      [tokenModifiers[0]]
    );
    return tokensBuilder.build();
  }
};

The code uses the SemanticTokensBuilder interface to build and returns an [0, 3, 5, 0, 0] , that is, row 0, column 3, a string of length 5, type=0, modifier=0, running effect:

Except for the recognized token in this paragraph, other characters are considered unrecognizable.

summary

In essence, DocumentSemanticTokensProvider only provides a set of rough IOC interfaces, and developers can do more limited things. So now most plug-ins do not adopt this scheme. Readers can understand it and don't need to go into it.

Language API

Introduction

Relatively speaking, vscode.languages.* series of APIs may be more in line with the thinking habits of front-end developers. vscode.languages.* hosts a series of user interaction behavior processing and categorization logic, and is open in the form of event interfaces. Plug-in developers only need to listen to these events, infer language features based on parameters, and return results according to the rules.

Vscode Language API provides many event interfaces, for example:

  • registerCompletionItemProvider: Provide code completion tips

  • registerHoverProvider: Triggered when the cursor stays on the token

  • registerSignatureHelpProvider: provide function signature prompt

For a complete list, please refer to the https://code.visualstudio.com/api/language-extensions/programmatic-language-features#show-hovers .

Hover example

The Hover function is implemented in two steps. First, you need to declare the hover feature package.json

{
    ...
    "main": "out/extensions.js",
    "capabilities" : {
        "hoverProvider" : "true",
        ...
    }
}

After that, you need to call registerHoverProvider activate function to register the hover callback:

export function activate(ctx: vscode.ExtensionContext): void {
    ...
    vscode.languages.registerHoverProvider('language name', {
        provideHover(document, position, token) {
            return { contents: ['aweome tecvan'] };
        }
    });
    ...
}

operation result:

Other features and functions are written similarly. Interested students are advised to check it out on the official website.

Language Server Protocol

Introduction

The above-mentioned code highlighting method based on language extension plug-ins has a similar problem: it is difficult to reuse between editors. For the same language, it is necessary to repeatedly write support plug-ins with similar functions according to the editor environment and language. Then for n languages, m In the editor, the development cost here is n * m .

In order to solve this problem, Microsoft proposed a standard protocol called Language Server Protocol. The language function plug-in and the editor no longer communicate directly, but are isolated through LSP:

Adding the LSP layer brings two benefits:

  • The development language and environment of the LSP layer are decoupled from the host environment provided by the specific IDE
  • The core functions of the language plug-in only need to be written once, and then it can be reused in the IDE that supports the LSP protocol

Although the capabilities of LSP and the aforementioned Language API are almost the same, these two advantages greatly improve the development efficiency of plugins. At present, many vscode language plugins have been migrated to LSP implementation, including well-known plugins such as vetur, eslint, and Python for VSCode.

The LSP architecture in Vscode consists of two parts:

  • Language Client: A standard vscode plug-in that realizes the interaction with the vscode environment. For example, hover events are first passed to the client, and then passed to the server behind by the client
  • Language Server: The core implementation of language features, communicate with Language Client through the LSP protocol, note that the Server instance will run as a separate process

To make an analogy, LSP is the Language API with optimized architecture. The function implemented by a single provider function is disassembled into a cross-language architecture at both ends of Client + Server. Client and vscode interact and implement request forwarding; Server performs code analysis actions and provides Highlight, completion, prompt and other functions, as shown below:

Simple example

LSP is a little bit more complicated. It is recommended that readers first pull down the official examples of vscode for comparison and study:

git clone https://github.com/microsoft/vscode-extension-samples.git
cd vscode-extension-samples/lsp-sample
yarn
yarn compile
code .

The main code files of vscode-extension-samples/lsp-sample are:

.
├── client // Language Client
│   ├── src
│   │   └── extension.ts // Language Client 入口文件
├── package.json 
└── server // Language Server
    └── src
        └── server.ts // Language Server 入口文件

There are several key points in the sample code:

  1. Declare activation conditions and plug-in entry in package.json
  2. Write entry file client/src/extension.ts , start LSP service
  3. Write the LSP service, namely server/src/server.ts , to implement the LSP protocol

package.json when loading the plug-in, then load and run the plug-in entry, and start the LSP server. After the plug-in is started, subsequent user interactions in vscode will trigger the plug-in’s client with standard events such as hover, completion, signature help, etc., and the client will be forwarded to the server layer according to the LSP protocol.

Let's take a look at the details of the three modules.

Entry configuration

Example vscode-extension-samples / lsp- sample in package.json two key configuration:

{
    "activationEvents": [
        "onLanguage:plaintext"
    ],
    "main": "./client/out/extension",
}

among them:

  • activationEvents : Declare the activation conditions of the plug-in. onLanguage:plaintext in the code means to activate when the txt text file is opened
  • main : The entry file of the plug-in
Client example

The client entry code in the sample vscode-extension-samples/lsp-sample, the key parts are as follows:

export function activate(context: ExtensionContext) {
    // Server 配置信息
    const serverOptions: ServerOptions = {
        run: { 
            // Server 模块的入口文件
            module: context.asAbsolutePath(
                path.join('server', 'out', 'server.js')
            ), 
            // 通讯协议,支持 stdio、ipc、pipe、socket
            transport: TransportKind.ipc 
        },
    };

    // Client 配置
    const clientOptions: LanguageClientOptions = {
        // 与 packages.json 文件的 activationEvents 类似
        // 插件的激活条件
        documentSelector: [{ scheme: 'file', language: 'plaintext' }],
        // ...
    };

    // 使用 Server、Client 配置创建代理对象
    const client = new LanguageClient(
        'languageServerExample',
        'Language Server Example',
        serverOptions,
        clientOptions
    );

    client.start();
}

The code context is very clear. First, the Server and Client configuration objects are defined, and then the LanguageClient instance is created and started. From the examples, we can see that the client layer can be very thin. In the Node environment, most of the forwarding logic is encapsulated in the LanguageClient class, and the developer does not need to care about the details.

Server example

The Server code in the sample vscode-extension-samples/lsp-sample implements error diagnosis and code completion functions. As a learning sample, it is a little more complicated, so I only extract the code in the error diagnosis part:

// Server 层所有通讯都使用 createConnection 创建的 connection 对象实现
const connection = createConnection(ProposedFeatures.all);

// 文档对象管理器,提供文档操作、监听接口
// 匹配 Client 激活规则的文档对象都会自动添加到 documents 对象中
const documents: TextDocuments<TextDocument> = new TextDocuments(TextDocument);

// 监听文档内容变更事件
documents.onDidChangeContent(change => {
    validateTextDocument(change.document);
});

// 校验
async function validateTextDocument(textDocument: TextDocument): Promise<void> {
    const text = textDocument.getText();
    // 匹配全大写的单词
    const pattern = /\b[A-Z]{2,}\b/g;
    let m: RegExpExecArray | null;

    // 这里判断,如果一个单词里面全都是大写字符,则报错
    const diagnostics: Diagnostic[] = [];
    while ((m = pattern.exec(text))) {
        const diagnostic: Diagnostic = {
            severity: DiagnosticSeverity.Warning,
            range: {
                start: textDocument.positionAt(m.index),
                end: textDocument.positionAt(m.index + m[0].length)
            },
            message: `${m[0]} is all uppercase.`,
            source: 'ex'
        };
        diagnostics.push(diagnostic);
    }

    // 发送错误诊断信息
    // vscode 会自动完成错误提示渲染
    connection.sendDiagnostics({ uri: textDocument.uri, diagnostics });
}

The main flow of the LSP Server code:

  • Call createConnection establish a communication link with the vscode main process, and all subsequent information interactions are implemented based on the connection object.
  • Create a documents object, and listen to document events as needed, such as onDidChangeContent
  • Analyze the code content in the event callback, and return error diagnosis information according to the language rules. For example, in the example, use regularity to determine whether the words are all uppercase letters. If yes, use the connection.sendDiagnostics interface to send error messages

running result:

summary

LanguageClient through the sample code, the communication process between LSP client and server has been encapsulated in objects such as 060d06337d8495 and connection . Plug-in developers do not need to care about the underlying implementation details, nor do they need to deeply understand the LSP protocol to expose based on these objects. The interface, event, etc. realize simple code highlighting effect.

to sum up

Vscode provides a multi-language extension interface in the form of plug-ins, which is divided into two types: declarative and programmatic. In actual projects, these two technologies are usually mixed, and the lexical in the code is quickly identified with the declarative interface based on TextMate; Use programming interface such as LSP supplement to provide advanced functions such as error prompt, code completion, jump definition and so on.

During this period, I saw a lot of open source vscode plugins. Among them, the Vetur plugin learning officially provided by Vue is a typical case in this regard, and the learning value is extremely high. Readers who are interested in this area are recommended to analyze and learn the vscode language extension plugins. Writing.


范文杰
1.4k 声望6.8k 粉丝