The login
registered
Write an article
Home page
Download the APP

Bytes problem in Python

SimonJoe246

Bytes problem in Python

What is the bytes

From several character encodings in the previous article, we already know the ASCII Unicode UTF-8 relationship. Moreover, a computer can only recognize zeros and ones, so obviously files stored in a computer can only be stored in binary form. How does character coding work in a computer?

In computer memory (when you open a file on your computer, you read it from your hard drive), Unicode is used. When it needs to be saved to hard disk or transferred, it is converted to UTF-8 encoding (as you can see from the previous article, this saves space and speeds up the transfer).

For example, during notepad editing, utF-8 characters read from the file are converted to Unicode characters in memory. After editing, Unicode characters in memory are converted to UTF-8 and saved to the file:


mark

When browsing a web page, the server converts dynamically generated Unicode characters into UTF-8 characters and transmits them to the browser:


mark

So you’ll see a lot of pages with a
message in the source code, indicating that the page is using UTF-8 encoding.

In Python, strings are encoded in Unicode, and the Python string type is STR, which is represented in Unicode in memory. To transfer over the network or save to disk, STR needs to be converted to bytes in bytes.

To get a bytes representation of a character, use the encode() method, as in

> > > 'ABC'. Encode (' ASCII) b 'ABC' > > > 'ABC'. The encode () 'utf-8' b 'ABC' > > > 'Chinese'. The encode () 'utf-8' b '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87' >>> 'encode '. Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)Copy the code

STR, which can be encoded as bytes in plain English, has the same content as UTF-8. Chinese STR cannot be encoded as bytes by ASCII. If STR exceeds the ASCII encoding range, an error will be reported.

Bytes that cannot be displayed as ASCII characters are displayed as b\x##.

Use type to view the data type of b’ ABC ‘or b’\xe4\ XB8 \xad\xe6\x96\x87’, which is a bytes class

>>> type(b'\xe4\xb8\xad\xe6\x96\x87')
<class 'bytes'>
Copy the code

Instead, when a byte stream is read from the network or disk, bytes are read and decoded to STR using decode()

> > > b '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'. The decode (' utf-8) 'Chinese'Copy the code

Byte string and text string

B ‘ABC’ is different from ‘ABC’. The former is bytes, also called a byte string. The latter is STR, also known as a text string. The former is one byte for a character (three bytes for a Chinese character), and the STR type is represented in memory as Unicode, which is several bytes for a character.

>>> len(' encode '.encode('utf-8')) 3Copy the code

Byte strings and text strings can cause errors when reading binary data. Note, in particular, that the index and iteration actions return the value of a byte, not a byte string.

>>> # Text string
>>> t = 'Hello world'
>>> for x in t:
...     print(x)
...     
H
e
l
l
o
w
o
r
l
d
>>> # Byte string
>>> b = b'Hello world'
>>> for x in b:
...     print(x)
...     
72
101
108
108
111
32
119
111
114
108
100
Copy the code

Base64: Displays and prints binary data

Base64 is a method of representing arbitrary binary data in 64 characters.

When we use notepad to open BMP, exe, JPG files, there will be a lot of gibberish:


mark

This is because they are not a text file, binary file, and the binary file contains a lot of can’t display and print characters, so if you want the notepad text processing software can deal with binary data, you need a binary conversion method to the character, Base64 is one of the most common method of binary code.

Note: Usually we speak of encoding as encoding characters into binary and decoding binary into characters. Now we’re talking about encoding binary into character text, and decoding character text into binary. Not to be confused, I was at a loss to understand whether I was decoding or coding at the very beginning.

Methods:

Prepare an array of 64 characters:

[‘A’, ‘B’, ‘C’, … ‘a’, ‘b’, ‘c’, … ‘0’, ‘1’, … ‘+’, ‘/’]

Binary data is processed in a group of three bytes, with a total of 3 * 8 = 24bits, divided into four groups with six bits in each group:


mark

We get four numbers for the index, look it up in the table, and we get the corresponding four characters, which are the encoded strings.

Therefore, we encode the 3-byte binary data into 4-byte text data, increasing the length by 33%. The advantage is that the encoded text can be displayed normally in the email body and web page.

Python’s built-in Base64 module provides base64 codec functionality:

>>> import base64
>>> base64.b64encode(b'i\xb7\x1d\xfb\xef\xff')
b'abcd++//'
>>> base64.b64decode(b'abcd++//')
b'i\xb7\x1d\xfb\xef\xff'
Copy the code

What happens when the binary data to be encoded is not a multiple of three, and there are 1 or 2 characters left? Base64 automatically decodes the binary data with b\x00 at the end, and then adds 1 or 2 = to the end of the encoding to indicate that several B \x00 have been added to the end of the binary data.

However, many Base64 encodings remove = because it creates ambiguity in urls,Cookies:

# standard Base64: 'abcd' - > 'YWJjZA = =' # automatically remove = : 'abcd' - > 'YWJjZA'Copy the code

How to decode if = is removed? Because the length of Base64 encoding is always an integer multiple of 4, the Base encoding that is not an integer multiple of 4 is automatically added with the corresponding number = to make it an integer multiple of 4 before decoding

You might be wondering why you need to encode binary data before displaying it, since you can decode binary data.

JPG, BMP, mp3, etc. The binary data cannot be parsed into characters.

> > > 'Chinese'. Encode () 'utf-8' b '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87' > > > b '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x56 \ x87'. The decode (' utf-8) Traceback (most  recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 3-4: invalid continuation byteCopy the code

But many times we will add some data related to file attributes in some file headers, such as in JPG file headers add data to indicate the size, resolution, color and other information of the picture, then we need to read these information through binary encoding.

struct bytesConversions to other data types

Bytes can be added (without subtraction) to form a new Bytes:

>>> m = b'hello '
>>> b = b'world'
>>> m+b
b'hello world'
>>> m+b'world'
b'hello world'
Copy the code

Encode () : encode() : encode() : encode() : encode() : encode() : encode() : encode()) : encode() : encode() : encode() : encode()) : encode() : encode() : encode()) : encode() : encode() : encode())

Python provides a struct module for converting bytes to and from other data types:

>>> struct.pack('>I', 10240099)
b'\x00\x9c@c'
Copy the code

The first argument to pack is the processing instruction, ‘>I’ means:

> means the byte order is big-endian, which is the network order, and I means a 4-byte unsigned int.

The following parameter number should be consistent with the processing instruction, and the size should be within the specified parameter range:

>>> struct.pack('>2H', 10245599)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
struct.error: pack expected 2 items for packing (got 1)

>>> struct.pack('>2H', 102456565599)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
struct.error: pack expected 2 items for packing (got 1)
Copy the code

H represents the integer, the unsigned two-byte integer, and the unsigned short.

The struct module defines data types that you can refer to the official Python documentation


mark

Instead, the unpack directive is used to convert bytes into the desired format with a given parameter:

>>> struct.unpack('>I', b'\x00\x9c@c')
(10240099,)
Copy the code

Convert the given byte stream to a 4-byte unsigned integer of type unsigned int. Unpack can also convert byte streams to character data by changing the parameters.

Unpack returns a tuple

Application Scenarios:

There are times when you need to use Python to process binary data, such as when accessing files or socket operations. This can be done using Python’s struct module, for example, to deal with structures in C.

Let’s say we have a structure:

struct Header
{
    unsigned short id;
    char[4] tag;
    unsigned int version;
    unsigned int count;
}
Copy the code

Recv received the above struct data as a string s in bytes format. Now parse it using the unpack function:

import struct id, tag, version, count = struct.unpack('! H4s2I', s)Copy the code

! Represents network byte order, because data is received over the network and transmitted over the network in network byte order. H4s2I represents 1 unsigned int, 4s represents a 4-byte string, and 2 unsigned short.

The ID, tag, version, and count data are parsed in a single unpack.

Similarly, you can pack local data into a struct format using pack

ss = struct.pack(‘>I4s2I’, id, tag, version, count)

The pack function is converted to a structure Header in the specified format, and ss is now a byte stream that can be sent over the socket

Recommended readingMore highlights

  • The first time I came into contact with programming knowledge was in university, where I learned some basic knowledge, such as C language, Java language, assembly language of single-chip microcomputer, etc. University graduate… Read 1,108 Comments 0 Likes 6
  • In-depth analysis problem of Chinese coding in Java http://www.ibm.com/developerw… Coding issues have always plagued developers, especially in Java because Java is a cross-platform language, and switching between coding on different platforms… X360 Read 1,061 Comment 1 Like 19
  • Character sets and encodings in Python Character sets and encodings are common in programming, including ASCII,MBCS,Unicode, and so on. To be exact… Lanshan Pavilion Read 1,889 comments 0 upvotes 11
  • Unicode pain note: I read this article in ReadThedocs and thought it was pretty good so I reprinted it. Practical Unicode… Aurora Read 310 Comment 0 Like 6
  • If you don’t sleep well either I am someone who has had severe insomnia for a year and three months. To what extent? Very sleepy, very sleepy, completely unable to open his eyes, his mind is confused… Miss O ‘Nien Read 130 Comments 0 upvotes 1