Why intercept fonts?

As we all know, Chinese fonts are inherently “huge” compared to English fonts. An English font of two or three hundred kilobytes is large, while a Chinese font of several or more megabytes is small. On the one hand, Chinese fonts contain a very large number of glyphs, often thousands or even tens of thousands, while English fonts only need to contain dozens of basic characters and symbols, even if they support multiple languages and character variants, the capacity of more than 3,000 glyphs is very large. On the other hand, the zigzag variation complexity of Chinese font is high. In contort-based vector font design, there are generally more control points to control Chinese font curve than English, which requires more data and leads to font file bloat.

In front-end development practice, in order to achieve some special visual effects, we often need to use some special fonts, and it is almost impossible to install these fonts on the user’s computer. In this case, we usually need to use Web font technology and let the browser dynamically download our custom fonts. However, Chinese fonts are very large, and it is often not practical to load a font file “in full”. This is especially true for dynamic pages with only a small number of characters per page. Of course, not every page will use every character in a font file, and full loading itself is extremely wasteful.

Research shows that 3,500 commonly used Chinese characters (the number of Chinese characters required to master in grade 9 of Compulsory education in China) cover 99.8% of Chinese characters in daily use:

  • 500 words (78.53202%)
  • 1000 words (91.91527%)
  • 1500 words (96.47563%)
  • 2000 words (98.38765%)
  • 2500 words (99.24388%)
  • 3000 words (99.63322%)
  • 3500 words (99.82015%)

It can be seen that the coverage of the top 500 most commonly used Chinese characters has reached 78%. Therefore, loading a font “in full”, especially Chinese fonts, is not only a waste of traffic and time in the current network environment, but also completely unnecessary. In that case, we can subset a font based on the characters it uses on a web page.

This paper first briefly reviews the technical specifications of Web custom fonts, and then introduces two commonly used front-end font interception techniques through examples. The first is the Unicode-range property in CSS, which we call “soft interception” because it simply makes “soft links” to a subset of existing fonts or fonts already downloaded by the browser, and doesn’t really reduce the size of the browser’s downloaded files. Secondly, Node command line tool Glyphhanger, which we call “hard interception technology”, separates a relatively small subset of fonts from the “full” fonts on the server side, makes it into Web fonts and sends it to the browser through Web server or CDN.

Both “soft” and “hard” intercepts use Web fonts and the @font-face rule. Therefore, we need to take a look at the basic Web standard syntax.

Web fonts with @font-face

Microsoft pioneered the @font-face rule to go beyond “web-safe fonts” and use fonts on Web pages that users are unlikely to install on their computers. This rule was later incorporated into the W3C’s CSS Fonts Module Level 3 Module, giving rise to a common Web custom font technique on the front end:

@font-face { font-family: 'MyWebFont'; src: url('webfont.eot'); /* Compatible IE9 */ SRC: url('webfont. Eot? #iefix') format('embedded-opentype'), /* IE6-IE8 */ url('webfont.woff2') format('woff2'), /* Newer browsers */ url(' webfont-.woff ') format('woff'), /* newer browsers */ url(' webfont-.ttf ') format('truetype'), SVG #svgFontName) format(' SVG '); /* Early iOS */}Copy the code

Sample code reference: https://css-tricks.com/snippets/css/using-font-face/

Of course, the code above is a scheme that works with almost any browser. About two years ago, in 2016, it was realistic to write something like this due to the rapid change in browser versions:

@font-face {
  font-family: 'MyWebFont';
  src:  url('myfont.woff2') format('woff2'),
        url('myfont.woff') format('woff');
}
Copy the code

If you want to work with more browsers, it seems safe to add a TTF format that almost all browsers support:

@font-face {
  font-family: 'MyWebFont';
  src: url('myfont.woff2') format('woff2'),
       url('myfont.woff') format('woff'),
       url('myfont.ttf') format('truetype');
}
Copy the code

However, our ultimate goal is to use only woff2 as a built-in compression format:

@font-face {
  font-family: 'MyWebFont';
  src: url('myfont.woff2') format('woff2');
}
Copy the code

Technically, in addition to using @font-face directly, you can use the @import rule or the link element to import or load external files containing @font-face declarations:

/ / import @ import url (/ / fonts.googleapis.com/css?family=Open+Sans); / / or reference < link href = '/ / fonts.googleapis.com/css?family=Open+Sans rel = "stylesheet" type = "text/CSS" > / / actual use body { font-family: 'Open Sans', sans-serif; }Copy the code

Open the Google Fonts to take a look at: https://fonts.googleapis.com/css?family=Open+Sans

These are all technical specifications, and it should only be a matter of time before you make the transition to using only woff2, a compressed format optimized specifically for Web fonts.

Having reviewed the basic technical specifications and syntax, and identified the direction for the future, let’s move on to the field. Filament Group, Inc. (a Web developer) if you want to talk about the connection between the CSS Fonts Module Level 3 and the @font-face rule, you have a few examples. Glyphhanger, a font clipper.

unicode-range

The Unicode-range attribute, while a “font interception” technique, is “soft interception”, not “hard interception”. It acts as a shortcut without really reducing the font file size that the browser needs to download.

As the name implies, Unicode-range is used to specify the range of Unicode code points for characters contained in a custom font. The syntax is as follows:

// CSS
@font-face {
  font-family: 'Ampersand';
  src: local('Times New Roman');
  unicode-range: U+26;
}
div {
  font-size: 4em;
  font-family: Ampersand, Helvetica, sans-serif;    
}
// HTML
<div>Me & You = Us</div>
Copy the code

Sample code reference: https://developer.mozilla.org/en-US/docs/Web/CSS/@font-face/unicode-range

The @font-face rule defines a font named “Ampersand” that is “cut” from the local font Times New Roman. This font contains only one character: U+26 (26 is the hexadecimal Unicode code point for the English ampersand, corresponding to the decimal value 38).

The custom fonts Ampersand (Times New Roman), Helvetica (Sans Serif), and sans-Serif (sans Serif) families are applied to HTML div elements according to the font family directive. The actual application effect is as follows:

Unicode encodings were expanded to 17 encoding planes, each with a capacity of 65,536, for a total capacity of 1,114,112 code points, of which only 128,237, or about 12%, were actually allocated. There is therefore ample room for Unicode to include characters from all civilizations on Earth for the foreseeable future.

Look at an example of a Chinese font. Suppose we wanted to use a special font to highlight the most famous line from preface to The Pavilion of King Teng, a famous work by Wang Bo, one of the “four great masters of the early Tang Dynasty” : “Sunset clouds and lone ducks fly together, autumn waters and the sky are the same color.”

We can first convert this famous sentence (including punctuation) to Unicode code points:

String transcoding points can use the following JavaScript functions:

function text2point(t) {
   return t.split('').map(c => 'u+'+c.charCodeAt().toString(16)).join(',')
}
Copy the code

Use Libian SC as the source font and apply a custom font named Custom to it.

// CSS @font-face { font-family: custom; src: local(Libian SC); unicode-range: u+843d,u+971e,u+4e0e,u+5b64,u+9e5c,u+9f50,u+98de,u+ff0c, u+79cb,u+6c34,u+5171,u+957f,u+5929,u+4e00,u+8272,u+3002; font-weight: 500; } .emphasis { font-family: custom; } // HTML <! The sunset clouds and lone ducks fly together, and the autumn waters share the same color of the sky. </span> <! -- Other sentences -->Copy the code

Note that the code point list in the above code is changed to facilitate typesetting and reading. In practice, do not replace the line artificially, so as not to cause grammatical errors. The same is true for the following code example.

The results are as follows:

At this point, we notice that punctuation (commas and periods) doesn’t match the style of the rest of the text, which uses PingFang SC font (on the Mac). Is it possible to simply delete the code points u+ff0c and u+3002 for commas and periods from the previous list of code points? This scheme works in Safari 12 and Firefox 62, where commas and periods after code points are removed inherit the “apple square” font, but doesn’t work in Chrome 69.

Chrome also seems to have a bug. If you type a character to the left of the punctuation mark that is not included in the custom font, Chrome forces the character to be displayed as a custom font. There seems to be some inconsistency in the browser implementation. There are no tests for IE and Edge on Windows. You can test them yourself.

However, we can solve this problem by defining a custom font that contains only commas and periods:

@font-face {
  font-family: punc;
  src: local(PingFang SC);
  unicode-range: u+ff0c,u+3002;
}
.emphasis {
  font-family:punc, custom;
}
Copy the code

Thus, Chrome, Safari, and Firefox can all display commas and periods as “apple” fonts without removing code points from the custom declaration:

Note that you should not attempt to customize punc fonts based on English fonts, because English fonts do not contain a mapping of code points for Chinese punctuation marks.

Although this example is obviously a self-made one, “special font processing for some Chinese characters in Chinese content, or special font processing for some Characters in English font” is the most suitable application scenario for the “soft interception technique” of Unicode-range. For more information on Unicode-range, please read Zhang Xinxu’s article “CSS Unicode-range using Font-face custom Fonts” : (https://www.zhangxinxu.com/wordpress/2016/11/css-unicode-range-character-font-face/).

useunicode-rangeNote:

  • Unicode-range is acceptable

    • Single code point:U+26(oru+26)
    • Code point range:U+0-7F.U+0025-00FF
    • Wildcard range:U+4??, which is equivalent toU+400-U+4FF
    • Comma-separated multiple values:U+0025-00FF, U+4??
  • Unicode-range defaults to U+0-10FFFF, which is all Unicode character encodings

  • Unicode-range values are literals or lists of literals of codepoints, not strings

    • Correct:unicode-range: u+ff0c,u+3002;
    • Error:unicode-range: "u+ff0c,u+3002";
  • Unicode-range values must not have syntactical errors, such as strings, or redundant commas: u+ff0c,u+3002,; The consequence of a syntax error is that the custom font becomes an alias for the source font rather than a subset based on the source font. (Of course, using @font-face to define the alias of the whole set of existing fonts is also a practical CSS technique, please refer to zhang’s article above.)

  • Make sure to use the correct character when converting to code points, such as “scrambles” (u+9e5c) in the previous example and don’t incorrectly use “scrambles” (u+9a9b).

So much for the use of a “soft interception technique” called Unicode-range. Next we introduce the “hard intercept tool” : Glyphhanger.

glyphhanger

Glyphhanger is Zach Leatherman (https://www.zachleat.com/web/) for the Filament Group (https://www.filamentgroup.com) write a. Turn the vera.ttf Command line tools such as WOFF/WOFF2 in Web font format can:

  • Grab remote or local files and analyze the text contained therein
  • Dereorder the analysis results and convert them to Unicode code points
  • Generate a subset of the corresponding format based on the specified source font (another tool will need to be installed, described later)
  • It also generates contains@font-faceRule CSS file

This tool is very useful, so let’s demonstrate some typical uses of Glyphhanger in Web font creation.

First, global installation:

npm install -g glyphhanger
Copy the code

Usage 1: Change wang Bo’s famous sentence into “Song Origin song”

Go to the directory that contains the example page, fontSubsetInAction, and run the following command:

= 'custom' ➜ fontSubsetInAction glyphhanger http://127.0.0.1:8080/index.html - family - subset = SourceHanSerifCN - Light. The vera.ttf --formats=woff2 U+3002,U+4E00,U+4E0E,U+5171,U+5929,U+5B64,U+6C34,U+79CB,U+8272,U+843D,U+957F,U+971E,U+98DE,U+9E5C,U+9F50,U+FF0C Writing CSS file: Subsetting sourcehanserifCN-light. CSS Subsetting sourcehanserifCN-light. TTF to sourcehanserifCN-light. woff2 (was 12.44 MB, Now 3.57 KB)Copy the code

Four parameters are passed to Glyphhanger.

  • Remote files to analyze (in this case, a local Web service) :http://127.0.0.1:8080/index.html
  • --family='custom'Specifies that only applications in the above pages are analyzedfont-family: custom;Regular elements
  • --subset=SourceHanSerifCN-Light.ttfSpecify the Source font to use, in this case Source Han Serif
  • --formats=woff2Specify the target format for the subset of fonts you want to generate, in this case WOFF2

Glyphhanger first output “sunset clouds and lone duck flying together, autumn water with the sky.” Corresponding Unicode code points (including commas and periods). A file named “sourcehanserifCN-light.css” is then created in the current directory. The output later shows that the intercepted font is called “sourcehanserifCN-light-subset.woff2”, and the source font file is 12.44 MB and the subset file is 3.57 KB. 16 Chinese characters take up 3.57 KB, or 228 bytes on average. Isn’t that scary? !

However, compared to 12.44 MB, 3.57 KB is tiny. Here, take a look at the CSS file that Glyphhanger helped us generate:

/* This file was automatically generated by GlyphHanger 3.0.3 */ @font-face {font-family: custom; src: url(SourceHanSerifCN-Light-subset.woff2) format("woff2"); unicode-range: U+3002,U+4E00,U+4E0E,U+5171,U+5929,U+5B64,U+6C34,U+79CB, U+8272,U+843D,U+957F,U+971E,U+98DE,U+9E5C,U+9F50,U+FF0C; }Copy the code

Direct use, compared to the previous manual generation code point, so much less trouble. The results are as follows:

Usage two: Analyze web pages using a subset of Chinese characters

In case you haven’t noticed, the code points of the Glyphhanger output in the above example are sorted in ascending order for each character in the Unicode encoding. Moreover, the code points are sorted after automatic de-repetition.

“Sunset clouds and lone ducks fly together, and autumn waters are the same color as the sky.” If there are no duplicate words, let’s look at the following example:

➜ fontSubsetInAction glyphhanger https://lisongfeng.cn/post/dive-into-async-function.html - string "# $& '() *, +, -, / 0123456789:; The < = >? ACDEFGHIJKLMNOPQRSTUVWXY [] ` abcdefghijklmnopqrstuvwxy {|} © « » - ""... ,. Yiding three not and up and down and in the two series is given priority to with yao's ride and book 2 in each other some pay what this interface only from generation to have a ren wei spread like but how do you make a low living body example, for in accordance with the letter and repair times value false do stop rush into full xi guan like son first charge against the inside its Canon copies will definitely conditions must write a letter, and cutting the columns just don't unto the system time at the beginning of creation for office work and services to move to help Mr Packetize area rise make up half of the single or but to participate And expulsion take dialectal switching sentence after another call to the contract of part it contains rev ah? Weeks of life and taste brother ah oh ah ha zai rang what! Then it is because of the trapped back round figure piece of base condition to add in a hole in the ground for after the doctor lost head, set good beginning such as elder sister more media near the son word save it the real home of guide sealing capacity will be smaller, less taste is the last layer embedded flexor exhibition is a landslide has already cloth tape to help dry years and often sequence libraries should build up type, the bottom degrees Strong when the butcher to sign for a micro heart will follow quickly read state how think acute total want to restore interest noted the result tragedy meaning into or by hand to make me of enlarge the one looking for bearing spell cast QiangHu smoke refuse enclosed take hold by scratch catch change according to push agent connect control tracing to touch operation with the change the effect to save for the article is expected to break New Zealand without both early Ming yi at the end of the show PuJingZhi temporarily replace the month is not the native miscellaneous article beam structure analysis to fruit a nuclear dream Logan case inspection shall check the prototype mold time is This step for each than bi blessings forever and can't live flow injection hole clear away acerbity deep dyke leakage full play more diffuse bore very hot but as cooked version of special shape alone prison ring Richard very sweet life in the world now painful disease of white eyes are straight facies province see true eyes know the product code development in god really kind of seconds said cheng is a bit poor state should end a abridged tube such as pen operator first cage arrow type of sugar level line fine pure quad end salt knot give off ground system to d heddle weave nets buy fabulous examiners and astute meat back to the feet, since to send Section taro though suffering drought was pretty line fill table comes to see touch the meter gauge at solution for discussion try write about a visit to the evaluation of translation of words said by mistake, please read the detailed language seeks as responsible for the mass loss cost information and assign the foot step and reproduced a series of rapid Wallace shipped back this far into the idea above send back optimal inverse by pass through the speed to build logic times are in the interpretation of the weight of the guide pin chain fault asked bond length between idle team resistance international limited except with implicit sets are not on face page along must wind ACTS the role of full first sweet ride Gao Mamo ! (), :; ? ~Copy the code

The first parameter is a “real” remote web site: https://lisongfeng.cn/post/dive-into-async-function.html, is I write articles before “the little brother little sister, to taste the syntactic sugar” Async function. The article is nearly 5,000 characters in length, but only 604 characters are actually used after analysis and weight reduction. In addition, another parameter –string — is used to make Glyphhanger convert Unicode code points into string output, sorted by code points from smallest to largest.

This is a simple JavaScript function that will do the trick:

function textEliminateDuplicationAndSorting(text) {
    return text.split('').filter((value, index, self) => { 
      return self.indexOf(value) === index;
    }).sort().join('')
}
Copy the code

Usage three: Specify text or code points to generate font subsets

Of course, if you have existing text or codepoints, you can just have Glyphhanger generate a subset of the font and CSS files for you. For example, I want to put the “fishing boat sing late, ring poor Peng Li shore, wild geese startled cold, sound broken hengyang pu.” Also displayed as “Song Style of Song source” :

➜ fontSubsetInAction glyphhanger --whitelist=" Sunset clouds and lone birds fly together, autumn water together long sky color. Fishing boat sing late, ring poor Peng Li shore, wild geese startled cold, sound off hengyang pu. --subset=SourceHanSerifCN-Light.ttf --css U+3002,U+4E00,U+4E0E,U+4E4B,U+5171,U+54CD,U+5531,U+58F0,U+5929,U+5B64,U+5BD2,U+5F6D,U+60CA,U+65AD,U+665A,U+6C34,U+6D66,U +6E14,U+6EE8,U+79CB,U+7A77,U+821F,U+8272,U+843D,U+8821,U+8861,U+957F,U+9633,U+9635,U+96C1,U+971E,U+98DE,U+9E5C,U+9F50,U+ FF0C Subsetting sourcehanserifCN-light. TTF to sourcehanserifCN-light. TTF (was 12.44 MB, Subsetting sourcehanserifCN-light. TTF to SourcehanserifCN-light subset. Zopfli.woff (was 12.44 MB, Subsetting sourcehanserifCN-light.ttf to sourcehanserifCN-light-subsetting.woff2 (was 12.44 MB, Now 7.45 KB) Writing CSS file: sourcehanserifcn-light.css @font-face {SRC: url(SourceHanSerifCN-Light-subset.woff2) format("woff2"), url(SourceHanSerifCN-Light-subset.zopfli.woff) format("woff"), url(SourceHanSerifCN-Light-subset.ttf) format("truetype"); unicode-range: U+3002,U+4E00,U+4E0E,U+4E4B,U+5171,U+54CD,U+5531,U+58F0,U+5929, U+5B64,U+5BD2,U+5F6D,U+60CA,U+65AD,U+665A,U+6C34,U+6D66,U+6E14, U+6EE8,U+79CB,U+7A77,U+821F,U+8272,U+843D,U+8821,U+8861,U+957F, U+9633,U+9635,U+96C1,U+971E,U+98DE,U+9E5C,U+9F50,U+FF0C; }Copy the code

This time the –whitelist argument is used to pass in the characters to be intercepted, the –formmats argument is omitted, and the — CSS argument is added.

As can be seen from the results, Glyphhanger still carries out de-repetition, transcoding point and sorting of the text. And, with no –formats specified, a subset of fonts is generated for.ttf, woff, and woff2 to improve browser compatibility. Finally, in addition to routinely generating CSS files, the — CSS option also lets Glyphhanger output the contents of CSS files to the console for easy copying.

Note, however, that the CSS file and output do not contain the font family property, that is, there is no custom font name (custom), you must add your own. Ok, here are the results:

Install pyftsubset

Glyphhanger itself only does web scraping and analysis, and the actual font capture is done using a well-known Python package called FontTools: github.com/fonttools/f… . Installation method is as follows:

pip install fonttools

# Additional installation for --flavor=woff2
git clone https://github.com/google/brotli
cd brotli
python setup.py install

# Additional installation for --flavor=woff --with-zopfli
git clone https://github.com/anthrotype/py-zopfli
cd py-zopfli
git submodule update --init --recursive
python setup.py install
Copy the code

At the end of the article, for your reference, we give you the help information about Glyphhanger, so you can explore more fun uses for yourself:

➜  fontSubsetInAction glyphhanger -h
glyphhanger error: requires at least one URL or whitelist.

usage: glyphhanger ./test.html
       glyphhanger http://example.com
       glyphhanger https://google.com https://www.filamentgroup.com
       glyphhanger http://example.com --subset=*.ttf
       glyphhanger --whitelist=abcdef --subset=*.ttf

arguments:
  --version
  --whitelist=abcdef
       A list of whitelist characters (optionally also --US_ASCII).
  --string
       Output the actual characters instead of Unicode code point values.
  --family='Lato,monospace'
       Show only results matching one or more font-family names (comma separated, case insensitive).
  --json
       Show detailed JSON results (including per font-family glyphs for results).
  --css
       Output a @font-face block for the current data.
  --subset=*.ttf
       Automatically subsets one or more font files using fonttools `pyftsubset`.
  --formats=ttf,woff,woff2,woff-zopfli
       woff2 requires brotli, woff-zopfli requires zopfli, installation instructions: https://github.com/filamentgroup/glyphhanger#installing-pyftsubset

  --spider
       Gather local URLs from the main page and navigate those URLs.
  --spider-limit=10
       Maximum number of URLs gathered from the spider (default: 10, use 0 to ignore).
  --timeout
       Maximum navigation time for a single URL.
Copy the code

[Full Text]