Follow my blog for the original article
mengkang.net/1129.html


How long has it been since beginner Phper recharged its batteries? Amway a wave of my live
The path to advanced PHP

demand

If you need to split a string that may contain Chinese into an array, let’s use utF-8 encoding as an example.

Solution 1

The way I’m used to it might be:

$STR = "$STR "; $array = []; for ($i=0,$l = mb_strlen($str,"utf-8"); $i < $l; $i++) { array_push($array, mb_substr($str, $i, 1)); } var_export($array);Copy the code

What if we don’t have the MB extension installed?

Solution 2

Today I saw a code that someone else wrote:

function str_split_utf8($str)  
{  
    $split = 1;  
    $array = array();  
    for ($i = 0; $i < strlen($str);) {  
        $value = ord($str[$i]);  
        if ($value > 127) {  
            if ($value >= 192 && $value <= 223) {  
                $split = 2;  
            } elseif ($value >= 224 && $value <= 239) {  
                $split = 3;  
            } elseif ($value >= 240 && $value <= 247) {  
                $split = 4;  
            }  
        } else {  
            $split = 1;  
        }  
        $key = null;  
        for ($j = 0; $j < $split; $j++, $i++) {  
            $key .= $str[$i];  
        }  
        array_push($array, $key);  
    }  
    return $array;  
}  Copy the code

Code reading

$STR [x] = $STR [x] = $STR [x] = $STR [x] = $STR [x] = $STR [x] PHP uses ORD to get ASCII values.

The cutting rules are as follows

ASCII range Cutting offset
0 ~ 127 1 byte
192 ~ 223 2 –
224 ~ 239 3 bytes
240 ~ 247 4 bytes

Why is that?

www.ruanyifeng.com/blo…



segmentfault.com/a/11…The history of UTF-8 is colloquial

Unicode

Unicode is just a set of symbols, and it only specifies the binary of a symbol, not how that binary should be stored.

UTF-8

Utf-8 is the most widely used implementation of Unicode on the Internet. One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Utf-8 encoding rules are simple, with only two rules:

  1. For single-byte symbols, the first byte is set to0Behind,7Bit is the Unicode code for this symbol. Therefore, utF-8 and ASCII codes are the same for English letters (from 0 to 127).
  2. fornThe symbol of a byte (n > 1) that precedes the first bytenI’m going to set it all to 1, and then I’m going to set it to 1n + 1Bit is set to 0 and the first two characters of the following bytes are all set to 010. The remaining bits, not mentioned, are all Unicode codes for this symbol.

The following table summarizes the encoding rules, with the letter X representing the bits of the available encoding:

Unicode symbol range (hexadecimal) Utf-8 Encoding mode (binary) Utf-8 first byte range
0000 0000-0000 007F 0xxxxxxx 0 ~ 127
0000 0080-0000 07FF 110xxxxx 10xxxxxx (128+64) to (255-32) is the same thing as 192 to 223
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx (128+64+32) to (255-16) is 224 to 239
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (128+64+32+16) to (255-8) is 240 to 247

I think if you look at this chart, you’ll understand.