Full text 5000 words, interpretation of VSCODE behind the code highlighting implementation principle, welcome thumb up to pay attention to forward.

The language functions of VSCODE, such as code highlighting, code completion, error diagnosis and jump definition, are realized by two extension schemes, including:

  • Based on lexical analysis technology, the word segmentation token is recognized and the highlighting style is applied
  • Based on the programmable language characteristic interface, it can recognize the code semantics and apply the highlighting style. In addition, it can realize the error diagnosis, intelligent prompt, formatting and other functions

The functional scope of the two schemes is increasing step by step, and the technical complexity and implementation cost are also increasing step by step. This paper will briefly introduce the working process and characteristics of the two schemes, what they accomplish, and write this for each other. In addition, the realization principle of VSCODE highlighting function will be revealed step by step based on actual cases:

VS Code Plug-in Foundation

Before introducing the principle of VSCode code highlighting, it is necessary to be familiar with the underlying architecture of VSCode. Similar to Webpack, VScode itself only implements a set of shelves, and the commands, styles, states, debugging and other functions inside the shelf are provided in the form of plug-ins. VScode externally provides five kinds of expansion capabilities:

Among them, the code highlighting function is realized by the language extension class plug-in, which can be further subdivided into:

  • Declative: Declare a set of regular matches with a specific JSON structure. Without writing logical code, language features such as block-level matching, automatic indentation, syntax highlighting can be added. Extensions/CSS, Extensions/HTML and other plug-ins built into VSCode are implemented based on declarative interfaces
  • Programmatic: VSCode listens for user behavior during operation and triggers event callbacks when certain behaviors occur. Programmatic language extensions need to listen for these events, dynamically analyze the text content and return code information in a specific format

Declarative performance is high and capability is weak; Low programming performance, high capability. Language plug-in developers can often mix the use of declarative interfaces to recognize lexical tokens in the shortest possible time and provide basic syntax highlighting. The content is then dynamically analyzed using a programmatic interface to provide more advanced features such as error diagnosis, intelligence alerts, and so on.

The declarative language extension in VS Code is realized based on TextMate lexical analysis engine. The programmatic language extension is realized in three ways: semantic analysis interface, VSCODE.LANGUAGE.* interface, and Language Server Protocol. The following introduces the basic logic of each technical scheme.

Lexical highlight

Lexical Analysis is the process of transforming character sequence into token sequence in computer science, and token is the smallest unit of source code. Lexical Analysis technology has been widely used in the fields of compilation, IDE and so on.

For example, the lexical engine of VSCODE analyzes the token sequence and then applies the highlighting style according to the token type. This process can be simply divided into two steps: word segmentation and style application.

References:

  • https://macromates.com/manual… \_grammars
  • https://code.visualstudio.com…

participles

The word segmentation process essentially recursively breaks down a long string of code into string fragments with specific meanings and categories, such as +-*/% and other operators; Var /const; Constant values of type 1234 or “Tecvan” or something like that, which is simply identifying a word somewhere in a piece of text.

The lexical analysis of VSCode is based on TextMate engine, which has complex functions and can be simply divided into three aspects: regular-based word segmentation, compound word segmentation rules and nested word segmentation rules.

The basic rule

The TextMate engine at the bottom of Vscode implements word segmentation based on regular matching. At run time, it scans the text content line by line and tests whether the text line contains the content matching a specific regular rule with the predefined rule set. For example, for the following rule configuration:

{
    "patterns": [
        {
            "name": "keyword.control",
            "match": "\b(if|while|for|return)\b"
        }
    ]
}

In the example, Patterns is used to define a set of rules, the match attribute specifies the regularity used to match the token, the name attribute declares the category (scope) of that token, and the textMate segmentation procedure encounters something that matches the match regularity. It is treated as a separate Token and classified as the keyword. Control type declared by Name.

The above example identifies the if/while/for/return keyword as a keyword. Control type, but no other keywords are recognized:

In the context of TextMate, scope is a hierarchical structure separated by. For example, keyword and keyword. Control form a parent-child hierarchy, which can achieve a matching similar to CSS selectors in style processing logic.

Compound word segmentation

The above example configuration object is called Language Rule in the context of TextMate. In addition to match for single-line content, you can also use the begin + end attribute to match for more complex cross-line scenarios. From begin to end the recognition to the scope, is considered to be the name type of token, such as in vuejs/vetur plug-in syntaxes/vue tmLanguage. Have so a json file configuration:

{ "name": "Vue", "scopeName": "source.vue", "patterns": [ { "begin": "(<) (style) (?! ^ / > * / > \ \ s * $)", / / fictional field, convenient to explain "name" : "tag. The style.css. Vue," "beginCaptures" : {" 1 ": {" name" : "punctuation.definition.tag.begin.html" }, "2": { "name": "entity.name.tag.style.html" } }, "end": "(</)(style)(>)", "endCaptures": { "1": { "name": "punctuation.definition.tag.begin.html" }, "2": { "name": "entity.name.tag.style.html" }, "3": { "name": "punctuation.definition.tag.end.html" } } } ] }

In the configuration, begin matches

statements, and the entire

statement is given scope as tag.style.vue. In addition, the characters in the statement are assigned to different scope types by the beginCaptures, ap-attribute:

Here, from begin to beginCaptures, and from end to end form some degree of composition, enabling multiple lines to be matched at once.

Rules of nested

In addition to begin + end above, TextMate also supports the definition of nested language rules as subpatterns, such as:

{
    "name": "lng",
    "patterns": [
        {
            "begin": "^lng`",
            "end": "`",
            "name": "tecvan.lng.outline",
            "patterns": [
                {
                    "match": "tec",
                    "name": "tecvan.lng.prefix"
                },
                {
                    "match": "van",
                    "name": "tecvan.lng.name"
                }
            ]
        }
    ],
    "scopeName": "tecvan"
}

The configuration identifies the string between LNG ‘and’ and classifies it as Tecvan. LNG. Outline. After that, we recursively process the contents between the two and match more specific tokens according to the subpatterns rule, such as for:

lng`awesome tecvan`

Recognizable participles:

  • lng`awesome tecvan` , the scope fortecvan.lng.outline
  • tec, the scope fortecvan.lng.prefix
  • van, the scope fortecvan.lng.name

TextMate also supports language-level nesting, for example:

{
    "name": "lng",
    "patterns": [
        {
            "begin": "^lng`",
            "end": "`",
            "name": "tecvan.lng.outline",
            "contentName": "source.js"
        }
    ],
    "scopeName": "tecvan"
}

Based on the above configuration, the content between LNG ‘and’ is recognized as the source.js statement specified by ContentName.

style

Lexical highlighting essentially breaks the original text into a sequence of tokens with classes according to the above rules, and then ADAPTS different styles according to the token type. TextMate provides a set of functional structures based on the participle according to the Token type field scope configuration style, for example:

{
    "tokenColors": [
        {
            "scope": "tecvan",
            "settings": {
                "foreground": "#eee"
            }
        },
        {
            "scope": "tecvan.lng.prefix",
            "settings": {
                "foreground": "#F44747"
            }
        },
        {
            "scope": "tecvan.lng.name",
            "settings": {
                "foreground": "#007acc",
            }
        }
    ]
}

In the example, the scope attribute supports a matching pattern called scope Selectors, which are similar to CSS Selectors and support:

  • Element selection, for examplescope = tecvan.lng.prefixTo be able to matchtecvan.lng.prefixType token; specialscope = tecvanTo be able to matchtecvan.lngtecvan.lng.prefixEquals subtypes of Token
  • Progeny selection, for examplescope = text.html source.jsUsed to match JavaScript code in an HTML document
  • Group selection, for examplescope = string, commentUsed to match a string or remark

Plugin developers can customize the scope or reuse many of TextMate’s built-in scopes, including Comment, Constant, Entity, Invalid, Keyword, etc. For a complete list, please refer to the official website.

Settings property is used to set the presentation style of the token, supporting foreground, background, bold, italic, underline and other style properties.

Instance analysis

Watching the principle we to disassemble a real case: https://github.com/mrmlnc/vsc… , JSON5 is a JSON extension protocol designed to make it easier for humans to write and maintain by hand. It supports features such as remarks, single quotes, and hexadecimal numbers. These extensions need to be highlighted using the vscode-JSON5 plugin:

In the image above, on the left is the effect without starting vscode-json5, and on the right is the effect after starting.

VSCODE – JSON5 plugin source is very simple, two key points:

  • inpackage.jsonThe plug-in is declared in thecontributesProperty, which can be understood as the plug-in’s entry point:
"Contributes" : {/ / configure "languages" : [{" id ":" json5 ", "aliases" : [" json5 ", "json5"], "extensions" : [". Json5 "], "configuration" : ". / json5. Configuration. The json "}], / / grammar configuration "grammars" : [{" language ":" json5 ", "scopeName" : "source.json5", "path": "./syntaxes/json5.json" }] }
  • In the syntax configuration file./syntaxes/json5.jsonThe Language Rule is defined in accordance with the requirements of TextMate:
{ "scopeName": "source.json5", "fileTypes": ["json5"], "name": "JSON5", "patterns": [ { "include": "#array" }, { "include": "#constant" } // ... ] , "repository": { "array": { "begin": "\\[", "beginCaptures": { "0": { "name": "punctuation.definition.array.begin.json5" } }, "end": "\\]", "endCaptures": { "0": { "name": "punctuation.definition.array.end.json5" } }, "name": "meta.structure.array.json5" // ... }, "constant": { "match": "\\b(? :true|false|null|Infinity|NaN)\\b", "name": "constant.language.json5" } // ... }}

OK, it’s over, it’s gone, it’s as simple as that, and then VS Code can adapt the syntax highlighting rules of JSON5 according to this configuration.

A debugging tool

VSCode has a built-in scope inspection tool for debugging the token and scope information detected by TextMate. When using it, it only needs to focus the editor cursor on the specific token. Developer: Inspect Editor Tokens and Scopes Developer: Inspect Editor Tokens and Scopes

After the command is run, you can see the language, scope, style and other information of the participle token.

Programmatic language extensions

TextMate, a lexical analysis engine, is essentially a static lexical analyzer based on regularization. It has the advantages of standardized access mode, low cost and high running efficiency. The disadvantage is that static code analysis is difficult to implement some context-dependent IDE functions, such as the following code:

Note that in the first line of the code, the function parameter LANGUAGeNumber is the same entity as the function body LANGUAGeNumber in the second line of the code but does not implement the same style. There is no visual linkage.

To this end, VS Code provides three more powerful and complex language feature extension mechanisms in addition to the TextMate engine:

  • useDocumentSemanticTokensProviderImplement programmable semantic analysis
  • usevscode.languages.*The interface under listens for various programming behavior events and performs semantic analysis at specific time nodes
  • According to the Language Server Protocol, a complete set of Language characteristics is implemented to analyze the Server

The language feature interface is more flexible than the declarative lexical highlighting described above and enables advanced features such as error diagnosis, candidate words, intelligence hints, definition jumps, and more.

References:

  • https://code.visualstudio.com…
  • https://code.visualstudio.com…
  • https://code.visualstudio.com…

DocumentSemanticTokensProvider participle

Introduction to the

The Sematic Tokens Provider is an object protocol built into VSCode that scans code files and returns a semantic token sequence as an array of integers. Tells vscode what type of token is in which line, column, and length of the file.

Note that TextMate’s scanning is engine driven and line-by-line matching is regular, while Sematic Tokens Provider’s scanning rules and matching rules are implemented by the plugin developers themselves, with increased flexibility and relatively higher development costs.

Implementation, Sematic Tokens Provider to vscode. DocumentSemanticTokensProvider interface definition, developers can according to the need to implement two methods:

  • provideDocumentSemanticTokens: Full analysis of code file semantics
  • provideDocumentSemanticTokensEdits: Incremental analysis is editing the semantics of the module

Let’s take a look at the complete example:

import * as vscode from 'vscode'; const tokenTypes = ['class', 'interface', 'enum', 'function', 'variable']; const tokenModifiers = ['declaration', 'documentation']; const legend = new vscode.SemanticTokensLegend(tokenTypes, tokenModifiers); const provider: vscode.DocumentSemanticTokensProvider = { provideDocumentSemanticTokens( document: vscode.TextDocument ): vscode.ProviderResult<vscode.SemanticTokens> { const tokensBuilder = new vscode.SemanticTokensBuilder(legend); tokensBuilder.push( new vscode.Range(new vscode.Position(0, 3), new vscode.Position(0, 8)), tokenTypes[0], [tokenModifiers[0]] ); return tokensBuilder.build(); }}; const selector = { language: 'javascript', scheme: 'file' }; vscode.languages.registerDocumentSemanticTokensProvider(selector, provider, legend);

This code will be unfamiliar to most readers, but I’ve been thinking about it for a long time, and I think it’s easier to start with the output of the function, which is line 17 of the above example.

Output structure

ProvideDocumentSemanticTokens function returns an integer array, array items according to a set of five respectively:

  • The first5 * iBit, the offset of the token’s row from the previous token
  • The first5 * i + 1Bit, the offset of the column in which the token is located relative to the previous token
  • The first5 * i + 2Bit, token length
  • The first5 * i + 3Bit, the type value of the token
  • The first5 * i + 4Bit, token modifier value

We need to understand that this is a position strongly correlated integer array, and every 5 items in the array describe the position and type of a token. The token position consists of three digits: row, column and length. In order to compress the size of the data, vscode is designed to be a relative displacement. For example, for code like this:

const name as

If we simply split the token by space, we can resolve three tokens: const, name, and as. The corresponding description array is:

[// The first token: const 0, 0, 5, x, x, // The second token: name 0, 6, 4, x, x, // The third token: as 0, 5, 2, x, x]

Note that this is described in terms of position relative to the previous token. For example, the semantics of the five digits corresponding to the AS character are: offset from the previous token by 0 lines, 5 columns, length 2, type xx.

The remaining bits 5 * I + 3 and 5 * I + 4 describe the token type and modifier respectively, where the type indicates the token type, such as comment, class, function, namespace, etc. Modifier is a type-based modifier that can be approximated as a subtype, such as an abstract for class or an export of DefaultLibrary from the standard library.

The specific values of type and modifier need to be defined by the developer, for example in the above example:

const tokenTypes = ['class', 'interface', 'enum', 'function', 'variable'];
const tokenModifiers = ['declaration', 'documentation'];
const legend = new vscode.SemanticTokensLegend(tokenTypes, tokenModifiers);

// ...

vscode.languages.registerDocumentSemanticTokensProvider(selector, provider, legend);

First of all, VScode. semanticTokensLegend class is used to build the internal object representing Legend of Type and Modifier. After using vscode. Languages. Registered with the provider to the vscode registerDocumentSemanticTokensProvider interface.

Semantic analysis

In the above example, the main function of the provider is to traverse the contents of the analysis file and return an array of integers that meet the above rules. VSCode does not limit the specific analysis method, but only provides the tool semanticTokensBuilder used to build the token description array, such as in the above example:

const provider: vscode.DocumentSemanticTokensProvider = { provideDocumentSemanticTokens( document: vscode.TextDocument ): vscode.ProviderResult<vscode.SemanticTokens> { const tokensBuilder = new vscode.SemanticTokensBuilder(legend); tokensBuilder.push( new vscode.Range(new vscode.Position(0, 3), new vscode.Position(0, 8)), tokenTypes[0], [tokenModifiers[0]] ); return tokensBuilder.build(); }};

The code uses the semanticTokensBuilder interface to build and return an array of [0, 3, 5, 0, 0], that is, row 0, column 3, a string of length 5, type =0, modifier =0

Except for this recognized token, all other characters are considered unrecognized.

summary

Essence, DocumentSemanticTokensProvider just provide a rough IOC interface, developers can do is limited, so now most plug-ins are not using this scheme, the reader to understand, don’t have to dig in.

Language API

Introduction to the

Relatively speaking, the language extension capabilities provided by the VSCode.Languages.* series of APIs may be more consistent with the thinking habits of front-end developers. VSCode.Languages.* hosts a series of user interaction processing and categorizing logic, and is exposed as an event interface. The plug-in developer only needs to listen to these events, infer language characteristics based on parameters, and return results according to rules.

The VS Code Language API provides a number of event interfaces, such as:

  • Tip registerCompletionItemProvider: provides code completion

  • RegisterHoverProvider: Triggered when the cursor hovers over the token

  • Tip registerSignatureHelpProvider: provide function signature

A complete list, please refer to https://code.visualstudio.com… The article.

Hover example

Hover function is achieved in two steps, first need to declare the Hover feature in package.json:

{... "main": "out/extensions.js", "capabilities" : { "hoverProvider" : "true", ... }}

After that, you need to register the hover callback by calling RegisterHoverProvider in the Activate function:

export function activate(ctx: vscode.ExtensionContext): void { ... vscode.languages.registerHoverProvider('language name', { provideHover(document, position, token) { return { contents: ['aweome tecvan'] }; }}); . }

Running results:

Other features and functions are written in a similar way. Those who are interested in them are advised to check them out on the website.

Language Server Protocol

Introduction to the

The above method of code highlighting based on language extension plug-in has a similar problem: it is difficult to reuse between editors. With the same language, support plug-ins with similar functions need to be repeatedly written according to the editor environment and language, so for N languages and editors in M, the development cost is N * M.

In order to solve this problem, Microsoft proposed a standard Protocol called Language Server Protocol, which no longer directly communicates with the Language function plug-in and the editor, but provides a layer of isolation through LSP:

Adding an LSP layer brings two benefits:

  • The development language, environment, etc. of the LSP layer are decoupled from the host environment provided by the specific IDE
  • The core functionality of a language plug-in needs to be written once and reused into an IDE that supports the LSP protocol

Although LSP has almost the same capability as the above Language API, the development efficiency of the plug-in has been greatly improved by taking advantage of these two advantages. At present, many VSCODE Language plug-ins have been migrated to LSP implementation. Includes well-known plugins like Vetur, ESLint, Python for VSCode, etc.

The LSP architecture in VS Code consists of two parts:

  • Language Client: A standard VSCODE plugin that interacts with the VSCODE environment, such as hover events passed first to the Client and then to the underlying server
  • Language Server: A core implementation of the Language feature that communicates with the Language Client via the LSP protocol. Note that the Server instance runs as a separate process

To make an analogy, LSP is a Language API optimized by the architecture. The function originally realized by a single provider function is dismantled into a cross-language architecture at both ends of Client + Server. Client interacts with VScode and realizes request forwarding. Server performs code analysis actions and provides functions such as highlighting, completion, and prompt, as shown in the figure below:

A simple example

LSP is a little bit more complicated, so I suggest you pull down the official VS code example for comparison:

git clone https://github.com/microsoft/vscode-extension-samples.git
cd vscode-extension-samples/lsp-sample
yarn
yarn compile
code .

The main code files for vscode-extension-samples/ LSP-sample are:

.sigma ── sigma ── sigma ── sigma ── sigma ── package.json sigma ── sigma // Language Server sigma ─ SRC sigma ─ server.ts // Language Server entry file

There are a few key points in the sample code:

  1. inpackage.jsonDeclares activation conditions and plug-in entry in
  2. Write the entry fileclient/src/extension.ts, start the LSP service
  3. Write an LSP serviceserver/src/server.tsAnd realize the LSP protocol

Logically, VSCode will determine activation conditions based on package.json configuration when loading the plug-in, then load, run the plug-in entry and start the LSP server. After the plug-in is started, the subsequent interaction of users in vscode will trigger the client of the plug-in with standard events, such as hover, completion, signature help, etc., and the client will be forwarded to the server layer according to the LSP protocol.

Let’s take a look at the details of the three modules.

Entrance to the configuration

The package.json in the sample vscode-extension-samples/ LSP-sample has two key configurations:

{
    "activationEvents": [
        "onLanguage:plaintext"
    ],
    "main": "./client/out/extension",
}

Among them:

  • activationEvents: Declares the activation condition of the plug-in, in the codeonLanguage:plaintextActivated when opening a TXT text file
  • main: The plug-in’s entry file

The Client sample

The Client entry code in the sample vscode-extension-samples/lsp-sample, the key parts are as follows:

Export function activate(context: extensionContext) {const ServerOptions: ServerOptions = {run: {// Server module entry file module: Context.asabsolutePath (path.join('server', 'out', 'server.js')), // Support stdio, IPC, pipe, socket transport: TransportKind.ipc }, }; // Client configuration const clientOptions: DocumentageCreentOptions = {// Similar to Packages.json's ActivationEvents // Plugin's Activation Criteria documentSelector: [{scheme: 'file', language: 'plaintext' }], // ... }; Const Client = new LanguageClient('languageServerExample', 'languageServerExample', 'languageServerExample', serverOptions, clientOptions ); client.start(); }

The line of code is clear, first defining the Server and Client configuration objects, then creating and starting the LANGUAGeClient instance. As you can see from the example, the Client layer can be very thin. In the Node environment, most of the forwarding logic is encapsulated in the languageClient class, so developers don’t have to worry about the details.

Server sample

The Server code in the sample vscode-extension-samples/lsp-sample implements the function of error diagnosis and code completion, which is a bit complicated as a learning sample, so I only excerpt the code in the error diagnosis part:

CreateConnection (proposedfeatures.all); createConnection(proposedfeatures.all); createConnection(proposedfeatures.all); // Document objects that match the Client activation rules are automatically added to the Documents object as const Documents: TextDocuments<TextDocument> = new TextDocuments(TextDocument); / / to monitor the document content change event documents. OnDidChangeContent (change = > {validateTextDocument (change. Document); }); // async function validateTextDocument(textDocument: textDocument): Promise<void> { const text = textDocument.getText(); // const pattern = /\b[a-z]{2,}\b/g; let m: RegExpExecArray | null; // diagnostics: diagnostics [] = []; // diagnostics: diagnostics [] = []; while ((m = pattern.exec(text))) { const diagnostic: Diagnostic = { severity: DiagnosticSeverity.Warning, range: { start: textDocument.positionAt(m.index), end: textDocument.positionAt(m.index + m[0].length) }, message: `${m[0]} is all uppercase.`, source: 'ex' }; diagnostics.push(diagnostic); } / / / / send the error diagnostic information vscode will automatically complete the connection error rendering. SendDiagnostics ({uri: textDocument uri, diagnostics and}); }

The main flow of the LSP Server code:

  • callcreateConnectionThe communication link with the main process of VSCODE is established, and all subsequent information interaction is realized based on the Connection object.
  • createdocumentsObject and listen for document events as needed in the example aboveonDidChangeContent
  • The code content is analyzed in the event callback, and error diagnostics are returned according to the language rules, such as the regular rule used in the example to determine if all words are uppercase, if yesconnection.sendDiagnosticsThe interface sends an error message

Operation effect:

summary

Looking through the sample code, the communication between the LSP client and server is encapsulated in objects such as LANGUAGeClient, Connection, etc., so the plug-in developer does not need to care about the underlying implementation details. It also does not require in-depth understanding of the LSP protocol can be based on these objects exposed interfaces, events, and so on to achieve simple code highlighting effect.

conclusion

VSCode provides a variety of language extension interfaces in the way of plug-in, including declarative and programmatic interfaces. In actual projects, these two technologies are often mixed, and TextMate based declarative interface is used to quickly identify the lexical in the code. A programmatic interface such as LSP is supplemented to provide advanced functionality such as error tips, code completion, and jump definitions.

This period of time to see a lot of open source VSCODE plug-ins, VUE official Vetur plug-in learning is a typical case in this respect, learning value is very high, it is suggested that interested in this aspect of the readers can go to the analysis of learning VSCODE language extension class plug-in writing.