File name length limit in Linux

Occurrence scenario: In the iteration, there is a requirement to change the PDF file name to the set of names of all classes. As a result, the file name is too long to create files and folders under Linux because there are too many classes

Solution: In Linux, the maximum length of a file name is 255 characters, and the maximum length of a file path is 4096 characters. Therefore, the class name needs to be truncated, and not only according to the length of the string, but according to the corresponding characters of each letter or Chinese character, the final file name is less than 255 characters.

In view of the above scenario, so a little in-depth study, the character coding related knowledge for a further step of learning

character

Brief introduction: Characters are simply some characters used in our daily life, such as numbers, Chinese characters, punctuation marks and so on. Professional introduction can refer to Baidu Encyclopedia

byte

Description: A unit of storage capacity in a computer, typically a byte representing an eight-bit binary number

coding

Introduction: In fact, coding is the rule of converting characters into binary, because we know that the computer represents 1 and 0 respectively through high level and low level. In order for the computer to understand our characters, it is necessary to establish a mapping relationship between binary number and our characters, which is called encoding. Everyone can define their own coding rules, but that would be confusing. So some organizations have developed uniform coding rules. Thus came our common ASCII, Unicode and other encoding rules

Character set

ASCII (character set) : Establishes 128 characters corresponding to digits, supporting only English letters, some punctuation marks, and some undisplayable characters.

Unicode: Unicode is just a standard for mapping characters and numbers. There is no limit to the number of characters that can be supported, nor does it require a character to be two, three, or any other number of bytes. Unicode does not deal with how characters are represented in bytes, it only specifies the corresponding number of characters. After all, Unicode is about mapping all characters in the world to a single number, but how that number is represented on a computer is not his concern.

Other misconceptions about Unicode include that the maximum number of characters supported by Unicode is 65536, and that Unicode characters must be two bytes, which are incorrect.

How Unicode characters are encoded into bytes in memory is another topic, but it is defined by UTF(Unicode Transformation Formats).

Unicode problems: For example, the Unicode of Chinese characters is the hexadecimal number 4E25, which has a full 15 bits (100111000100101) converted to binary, meaning that the representation of the symbol requires at least 2 bytes. For other larger symbols, it might take three bytes or four bytes, or even more.

There are two serious questions here. The first is, how do you distinguish Unicode from ASCII? How does the computer know that three bytes represent one symbol, rather than three symbols? The second problem is that we already know that the English alphabet with only one byte is enough, if the Unicode unified regulation, each symbol with three or four bytes, so every English letters before there must be two to three bytes is 0, it is a great waste for storage, the size of a text file will be two or three times as big, This is unacceptable. Different encodings of the Unicode character set emerged

Common coding rules

ASCII code encoding rules: Each bit has two states: 0 and 1, so eight bits can make up 256 different states (00000000-11111111). Since the ASCII character set defines 128 characters, eight bits can represent 256 states, so it must be sufficient. So in ASCII code a character is only one byte (one byte === 8 bits)

Utf-8: Is a Unicode encoding scheme. In UTF-8, characters 0-127 are represented by 1 byte, using the same encoding as US-ASCII. This means that documents written in the 1980s have no problem opening utF-8. Only characters 128 and above are represented by two, three, or four bytes. For this reason, UTF-8 is known as variable-length encoding, which can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Utf-16: Another popular variable-length encoding scheme is UTF-16, which uses two or four bytes to store characters.

Character encoding in javascript

The encoding method used in js is not UTF-8 or UTF-16. It’s UCS-2, specifically because javascript didn’t have UTF-16, but luckily UCS has unicode code points, so they’re compatible with each other. ** The relationship is simply that UTF-16 replaces UCS-2, or ucS-2 is integrated into UTF-16. ** So now there is only UTF-16, no UCS-2.

Conclusion:

  1. Unicode is a simple standard for mapping characters to numbers. The Unicode consortium takes care of all the behind-the-scenes stuff, including encoding new characters.
  2. Unicode doesn’t tell you how characters are encoded into bytes. This is determined by the encoding scheme, specified through UTF.
  3. There is no such thing as plain text. If you want to read a string, you have to know its encoding
  4. Character sets are different from encodings, where a character set is a mapping of specified numbers and characters, while encodings are the process of storing the corresponding numbers of specified characters in a computer.
  5. Coding is a very complex process, which involves brick code, basic plane, auxiliary plane and other concepts are not mentioned in this paper, the specific implementation process can refer to the following ruan Yifeng related materials.

References:

  • Character encoding notes: ASCII, Unicode and UTF-8 Ruan Yifeng

  • Learning coding won’t kill you: The Mythbusters and Coding Debunks of Unicode

  • The basic character set and encoding of javascript

  • Unicode and JavaScript details ruan Yifeng