Let’s start with a bug

One day, there was a problem in the product feedback. There was a problem in the rich text display of business page 1. The input from the management background was A < B < C, but only A was displayed on the final page. Positioning found that there is a problem logic in rich text rendering, which restores the HTML Entity character input by the management background. When some characters similar to HTML tags are input, the browser will identify the input as HTML tags in display, and as a result, this part of the characters will disappear into thin air.

The rich text rendering flow here:

HTML input –> Entity decode –> dangerouslySetInnerHTML –> DOM –> final UI

When HTML is entered in a format like this:

a< b< C

, which becomes

a

after Decode, causing the following

The most straightforward way to solve this problem would be to remove Entity Decode, but in practice there is an awkward situation: this piece of logic is a piece of code that has been around for years, has been referenced in many places, and is too wide-ranging to change and requires too much testing.

From a safe point of view, since it is difficult to kill, before we have a chance to catch up with large-scale testing, can we first simplify the hedging mechanism? Neutralizing Entity Decode to produce the desired UI effect; The changes are contained in the problem module and do not affect other modules that reference the rendering logic, thus making the testing workload manageable.

The modified rich text rendering process is similar:

HTML input –> [special processing] –> Entity decode –> dangerouslySetInnerHTML –> DOM –> final UI

Experiments show that it can!

The theory of learning

The simplest and most crude idea of implementing this logic is to make a direct substitution for the scenario in which the problem arises. But this is itself a palliative, and carries the risk of creating new problems. Can we essentially deal with all possible cases at once?

HTML is not about its weird syntax, but about what kind of DOM tree you want the browser to construct, so our focus is on how the browser parses HTML. According to the WHATWG documentation, and with a little knowledge of compilation principles, the basic working pattern here is clear: read the stream of input characters, analyze the lexical structure with a Tokenizer, and build up the DOM.

Tokenizer analyzes processes that involve the concept of “finite state machines.” Basically, Tokenizer maintains a state machine inside that defines all the states involved in the HTML parsing process, reading characters from front to back, and jumping to different states step by step. The browser engine also maintains a specialized set of DOM trees internally, and when the state machine is parsed into a specific state, the state machine generates new tokens, triggering Tree Construction steps, and filling out the DOM Tree structure. When the character stream is finished reading, the DOM tree is finalized.

The WHATWG HTML5 documentation provides us with a complete parsing process and state machine definition. The most critical ones related to the Tag parsing scenario here are Tag Open State and End Tag Open State. The parser starts with the Data state state and enters the Tag Open state when the next character is <. The input of the next character immediately after it is critical to start parsing by label.

According to the definition document, the next input of Tag open State may jump into the following six situations:

Where U+xxxxF stands for hexadecimal Unicode characters, we can appropriately abandon some of the extreme rigor, and from an intuitive point of view we can roughly expand as follows:

  1. !Exclamation mark: Jump tomarkup declaration open state(for<! DOCTYPE html> / <! - comments -- > / <[CDATA[, and so on and so forth)
  2. /Slash: Jumps toend tag open state(For closed label scenarios, for example</div>)
  3. ASCII uppercase and lowercase letters: Entertag name stateParse by label
  4. ?Question mark: Parse exception, produce an empty comment node, turnbogus comment stateReconsume, comments continue to contain? ) no.
  5. EOF End Input: output<Number (no more input at this point)
  6. Other cases: output<Number (generates one at this timeinvalid first character of tag nameAbnormal signal, but does not affect work)

As long as we ensure that < does not occur in cases 1-4, the Tokenizer will not enter the tag parsing state and will not generate new nodes, thus avoiding the incomplete display caused by mismatched tags.

Hack of actual combat

It works in theory, but how can we quickly test this idea and identify suitable string replacements in the real world? In addition to writing some HTML code for edit-refresh validation, you can also use the Edit as HTML feature in the DevTools Elements panel to Edit and debug in real time.

But this is not intuitive, just rely on observation and guess is still a little cumbersome. What I want to do is hack WHATWG HTML5 standard parser. Since WHATWG HTML5 Spec is a standard, maybe someone will make some browser-independent HTML5 parsing solutions based on it? With that in mind, I did a quick search and found an HTML5 code parser for Node.js called Parse5. Following the Online Playground on the project homepage, I can also find a website called AST Explorer. Boy! Now you can directly observe the DOM tree generated by an HTML text, thanks to the community.

Take this code as an example:

<! DOCTYPEhtml>
<html>
  <head>
    <meta charset="utf-8">
    <title>Test</title>
  </head>
  <body>
    <h1>My first headline</h1>
    <p>My first paragraph.</p>
  </body>
</html>
Copy the code

The details of the DOM tree generated by this code can be seen in the AST Explorer:

Input the problematic

a

, and observe that it produces a node named B

Continue the imaginative replacement strategy: in principle, you don’t want to affect the UI; We add an empty comment node that does not generate nodes in the render tree between the < character that will generate a tag.

HTML input:

a<
b<
c

Effect:

The display was also as expected:

Short of space, other types of generated Node will not table (

Add a few more details about Entity to get the final solution, and add this string regular filter before Entity encoding (because it is hack code, it needs to write enough comments to avoid future crashes) :

/** * According to the whatWG state machine, when parsing to "<", it enters the Tag open state *. If the following 1-4 situations occur, the display will not be complete * 1. "/" will cause the following characters to be parsed by comment * 2. "/" will cause the following characters to be parsed by end label. If the next character is ASCII alpha, the end label will be empty if ">", otherwise it will become comment * 3. ASCII alpha letters enter the state of tag name, the following characters will become the tag name, resulting in no display of * 5. Other cases: normal "<" * * replacement strategy: when matching cases 1-4, add an empty comment node <! Details: -- -- -- - > * * * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state@param {string} Input The input HTML string with entity */
function replaceHTMLEntityTagStr(input = ' ') {
  // Entity has three types: named, hexadecimal, and decimal
  return input.replace(
    /((<) |(<) | (& # 60) ) (! | \ | \ /? |[a-zA-Z])/gi.(all, p1, p2, p3, p4, p5) = > `${p1}<! ---->${p5}`
  );
}
Copy the code

Compatibility considerations:

  1. HTML Tag parsing is one of the most basic features of browsers, and in theory there should be no compatibility issues (you can find the Tokenizer code consistent with this in the original WebKit version of Safari/Chromium, with no significant changes from the latest code)
  2. The problem page is used for mobile terminal (iOS unified system WebView; Android has QQ browser X5 kernel), can maintain a higher version of the WebKit/Chromium browser kernel
  3. This replacement logic is available in IE 7 and above tests, Safari/Chromium 44 / Firefox 91.01 on iOS 11 tests ok (not verified yet for other versions)

Hacks tips

I had a lot to learn from my relationship with bugs. Here are a few “life lessons” :

1. The end itself is more important than the means. There is only one Rome, but there can be thousands of roads to Rome. From a common perspective, the DOM determines the PRESENTATION of the UI, and it’s a stereotype that our focus is on the DOM.

The source of a DOM tree can be divided into two parts:

  • The browser parses the HTML and builds a DOM tree in lexical analysis
  • At the JavaScript level, createElement, appendChild, etc. It encapsulates these methods in GUI frameworks such as React/Vue, constructs an Object tree, and then uniformly renders corresponding DOM nodes

In this exception hack scenario, the focus is not just on the DOM level, so our “Rome” is also on the final UI. The same UI display may correspond to different DOM trees, similar to a mapping of n –> 1 from a mathematical perspective. When a road is difficult to follow, we can take another way to achieve the same goal. Of course, going off the beaten path also means that you have to face risks that few people face.

2. Although writing front-end pages (commonly known as “cutting map”), basic computer knowledge is always used inadvertently, trying to grasp some common things, and can achieve dimensionality reduction when appropriate; Of course, there is no need to be too advanced, some basic compilation principle knowledge is enough, the standard document has already used all the terminology concepts, algorithms and so on are written in the above, the need is just calm down to read the state of mind.

3. The development of things is always in a gradual state, to achieve the ideal is not an overnight thing, first stabilize the situation, there is time to optimize. It may be difficult to do it perfectly at first, but it is more of a state of completion and then perfection. Also think from this, to engineer the role demands in addition to code quality, is probably a more important, for a limited time, business-oriented to different requirements of quality and efficiency, the ability to get the most suitable solution, similar to “fast 60 points” and “fine polished to 90 + points” between a flexible control.