… The 猹 turns and flees from his crotch. ** This youth is leap earth. ** I knew him when I was in my teens, so it will be thirty years now;

As above, Markdown rendering failed…

Conclusion first, it’s not the Nuggets’ problem, Markdown said so. More specifically, this is due to differences between natural languages. While this has had an impact, I don’t think the Markdown specification will change the rules in this area in the foreseeable future.

CommonMark specification

Markdown uses * and _ as emphasis indicators. Text wrapped with a single * or _ will be wrapped with HTML tags (that is, italics); Those wrapped by two will be wrapped by the HTML tag (that is, bold).

Before we get to the question, let’s talk about what some of the key terms mean:

Delimiter sequence (Delimiter Run)

Delimiter classes refer to:

  • One or a series of unescaped*;
  • One or a series of unescaped_.

Left flanking Delimiter Run

The left delimiter sequence is a delimiter sequence, and:

  1. It can’t be blank;
  2. When there is no white space or punctuation in front of it, it cannot be followed by punctuation.

Right-flanking Delimiter Run

The right delimiter sequence is a delimiter sequence, and:

  1. It can’t be blank;
  2. It cannot be preceded by punctuation when there is no white space or punctuation.

  1. when**Is part of the left delimiter sequence, it represents the beginning of bold;
  2. when**Is part of the right delimiter sequence, it represents the end bold;

Same thing with italics.

(There are 17 rules, which I won’t go into here, but you can check out the CommonMark specification if you’re interested.)


As in the example above, the ** is preceded by a punctuation mark, but not followed by a blank or punctuation mark, so it is not a right delimiter sequence and will not be treated as an end-of-bold identifier. Natural bold will not work.

… RunTu. ** I know…

Why is it defined that way?

To support nested delimiter sequences, **one **two two** three** two two** one** :

one two two three two two one

I won’t go into details here (and it’s hard to do so), but you’ll get a good idea of what it means by following the rules above.

The impact of this definition is so obvious that users of languages with non-space participles can only curse… The earth continues to be used.

How to solve it?

# 1

The solution is actually very simple, add a space after the **.

… The 猹 turns and flees from his crotch. This youth is leap earth. I knew him when I was in my teens, and it will be thirty years now;

But people like me, who require a high degree of uniformity, are not so happy; There is no such thing as space in Chinese.

Github user Haqer1 offers a better solution — zero-width space (ZWSP).

# 2

This is ZWSP:

What? You said you couldn’t see it?

That’s right, ZWSP is a non-printable Unicode character (U+200B) used where line breaks may be required.

We can use ZWSP to specify newline positions for long text; ZWSP only works if the screen width is insufficient for a single line display.

For example, without ZWSP it looks like this:

LongLongLongLongLongLongLongLongBreakBeforeHereLongLongLongLongLongLongLongLongLongLongLongLongLongLongText

ZWSP looks like this:

LongLongLongLongLongLongLongLong​BreakBeforeHereLongLongLongLongLongLongLongLongLongLongLongLongLongLongText

Because of its special properties, ZWSP is also used to bypass sensitive word checks, create non-replicable pseudo-links, and so on.


Going back to the original problem, we just need to add ZWSP before the right delimiter:

… The 猹 turns and flees from his crotch. This youth is leap earth. I knew him when I was in my teens, and it will be thirty years now;

(ZWSP after the period,**Before)

This character can be copied from the Unciode character list website: unicode-table.com/en/200B/


Wait, isn’t zero-width space also a space? This contradicts your previous rule that the right delimiter sequence cannot be preceded by a space.

Which brings us to the definition of space in the rules.

What is space?

Spaces here refer to Unicode space characters.

Unicode space characters include TAB (U+0009), Line Feed (U+000A), Form Feed (U+000C), carriage Return (U+000D), and any character in the Unicode Zs category.

Note: Line feed (LF,\n) and carriage return (CR,\r) are newlines. On Windows, line breaks are CR+LF; The newline on Unix systems is LF.

Characters in Unicode are classified by category, with the first uppercase letter being the major category and the second lowercase letter being the minor category.

Zs at the top represents the “seperator, space” class. The Zs class contains 17 characters as follows:

(picture in the website: www.compart.com/en/unicode/.)

As you can see above, there is no U+200B in the list, so ZWSP is not a space (although it is called a zero-width space). ZWSP belongs to the Cf class, which is the “Other, Format” class.


The same applies to Java’s definition of whitespace:

System.out.println(Character.isWhitespace(' '));
System.out.println(Character.isWhitespace('\t'));
System.out.println(Character.isWhitespace('\n'));
System.out.println(Character.isWhitespace('\u200b')); // ZWSP
Copy the code

Results:

true
true
true
false
Copy the code

digression

1. This isn’t a bug, it’s a feature!

(I’m just saying this for myself.)

Why did I say at the beginning of the article that this should be unsolvable?

Which leads to another story, check out one of the discussions on the CommonMark forum:

181st floor. Nearly eight years of discussion. Guess what?

Markdown syntax support for tables:

1 | | title title 2 | | -- - | -- -- -- -- -- - | 2 | | 1 | contentsCopy the code

Yes, tables in the CommonMark specification are written in HTML.

The next GitHub chinglish Markdown (GFM, also known as the Markdown specification defined by GitHub) had already played flowers, and they were still chatting with each other. One side keeps suggesting new syntax, all kinds of suggestions, all kinds of typography. The other side objected, saying it was hard to read, hard to maintain…

Well, I agree, using HTML for tables… That’s great:

<table>
  <thead>
    <tr>
      <th>Title 1</th>
      <th>Title 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>The content of 1</td>
      <td>Content of the 2</td>
    </tr>
  </tbody>
</table>
Copy the code

2. Why did you put “Nuggets” in the headline when it was clearly a Markdown problem?

(Indeed, I’ve had this problem on Github, too.)

In short, there are two reasons for this:

# 1

The article was inspired by an issue in the Juejin-Markdown-Themes repository:

A look found himself also encountered, but do not know the reason, the interest came. I started by looking at the Nuggets Markdown editor, Bytemd, and realized that Markdown parsing called the Remark library and the Rehype library, and then tracked down to the CommonMark specification…

# 2

Well, you can write the title in this way. It makes you feel like: Hey, I think I also came across this feeling…

(And by the way, snag the nuggets operation)

The resources

  1. Talk.commonmark.org/t/bold-fail…
  2. Ld246.com/article/159…
  3. Github.com/commonmark/…
  4. Spec.commonmark.org/0.30/#empha…
  5. Blog.meathill.com/tech/fe/zer…
  6. En.wikipedia.org/wiki/Unicod…
  7. Talk.commonmark.org/t/tables-in…