Before Python 3.5 and inclusive, dictionaries were not guaranteed to be in order, with key-value pairs A first and key-value pairs B second, but when you print A list of Keys for A dictionary, you may find that B comes before A.

But starting with Python 3.6, dictionaries are ordered. You insert the key pair A first and then B, and when you print the list of Keys, you’ll find that B comes after A.

Not only that, but since Python 3.6, the following three traversals are more efficient than before Python 3.5:

for key inThe dictionaryfor value inIn the dictionary. Values ()for key, value inIn the dictionary. The items ()Copy the code

Since Python 3.6, dictionaries have taken up only 30% to 95% of their original memory space, depending on the number of key-value pairs in the dictionary.

What are the dictionary optimizations for Python 3.6? To illustrate this, we need to talk about the underlying principles of dictionaries prior to Python 3.5 and inclusive.

When we initialize an empty dictionary, the bottom layer of CPython initializes a two-dimensional array with eight rows and three columns, as shown in the following diagram:


my_dict = {}

"' the memory map [[-, -, -], [-, -, -], [-, -, -], [-, -, -], [-, -, -], [-, -, -]. [--], [--]] ""
Copy the code

Now, let’s add a number to the dictionary:

my_dict['name'] = 'kingname'

"' the memory map [[-, -, -], [-, -, -], [-, -, -], [-, -, -], [-, -, -]. [1278649844881305901, pointer to the name, pointer to kingname], [-, -, -], [-, -, -]] "'
Copy the code

Here’s why memory looks like this when a key-value pair is added:

First we call Python’s hash function to calculate the hash value of the string name at the current runtime:

>>> hash('name')
1278649844881305901
Copy the code

Notice that I’m emphasizing “current runtime” because Python comes with a hash function that’s not what we traditionally think of as a hash function. The hash function in Python is guaranteed to be the same from run time to run time, but it can change when you turn Python off and on again, as shown below:

Suppose that in a runtime, the hash(‘name’) value is 1278649844881305901. Now we take the remainder of this with respect to 8:

>>> 1278649844881305901 % 8
5
Copy the code

It has a remainder of 5, so I put it on the row with subscript 5 in the two-dimensional array that I just initialized. Since name and kingname are two strings, the underlying C language uses two string variables to store these two values and get Pointers to them. So, on the 5 line of our two-dimensional array, the first value is the hash value of name, the second value is the address of the memory in which the name string is located (Pointers are memory addresses), and the third value is the address of the memory in which the kingname string is located.

Now, let’s insert two more key-value pairs:

my_dict['age'] = 26
my_dict['salary'] = 999999

[[-4234469173262486640, salary pointer, 999999 pointer], [1545085610920597121, age pointer, 26 pointer], [--, --, -], [-, -, -], [-, -, -], [1278649844881305901, pointer to the name, pointer to kingname], [-, -, -], [-- -- -- -- -- --, -]] "'
Copy the code

So how does a dictionary read the data? Let’s say we want to read the value of age.

At this point, Python computes the Hash value of age under the current runtime:

>>> hash('age')
1545085610920597121
Copy the code

Now the hash value has the remainder of 8:

>>> 1545085610920597121 % 8
1
Copy the code

If the remainder is 1, then the row with subscript 1 in the two-dimensional array is the key-value pair that we need. Return the memory value corresponding to the third pointer on the line, which is 26 for age.

When you iterate over a dictionary Key, Python’s low-level iterates through the two-dimensional array, returning the memory value of the Key pointer if there is data in the current row. If there is no data in the current row, skip it. So it’s always going to go through every line of the entire array.

Each row has three columns, and each column occupies 8 bytes of memory space, so each row occupies 24 bytes of memory space.

Since the remainder of the Hash value can be large or small, dictionary keys are not stored in the order in which they were inserted.

Note that I have omitted two points that are not very relevant to this article:

  1. Open addressing, when you Hash two different keys, and you take the remainder of 8, maybe the remainder will be the same. In order not to overwrite the existing value, Python uses itOpen addressingThe technology finds a new location to store the new key-value pair.
  2. When the number of dictionary key-value pairs exceeds two-thirds of the current array length, the array is expanded, with 8 rows becoming 16 and 16 rows becoming 32. When the length changes, the original remainder position will also change, and the data in the original position will need to be moved, resulting in low insertion efficiency.

After Python 3.6, the underlying data structure of dictionaries has changed so that when you initialize an empty dictionary, the underlying data structure looks like this:

my_dict = {}

Indices = [None, None, None, None, None, None, None] Entries = []"
Copy the code

When you initialize a dictionary, Python generates a single one-dimensional array of length 8. It then generates an empty two-dimensional array.

Now let’s add a key-value pair to the dictionary:

my_dict['name'] = 'kingname'

Indices = [None, 0, None, None, None, None, None] Entries = [[-5954193068542476671, pointer to name Execute pointer to kingName]] ""
Copy the code

Why is memory like this? Let’s look at it step by step:

At the current runtime, the hash value of the name string is -5954193068542476671, and the remainder of 8 is 1:

>>> hash('name')
- 5954193068542476671.
>>> hash('name') % 8
1
Copy the code

So let’s change the 1 in the indices one-dimensional array to 0.

What does this zero mean? 0 is the index of the two-digit array entries. There are now only one row of the key-value pair we just added: the hash value for name, the pointer to Name, and the pointer to Kinganme. So the number 0 in the indices is the row index in the two-digit array of the key-value pair we just inserted.

Ok, now let’s insert two more data:

my_dict['address'] = 'xxx'
my_dict['salary'] = 999999

Indices = [1, 0, None, None, None, None, 2, None] Entries = [[-5954193068542476671, pointer to name [9043074951938101872, address pointer, XXX pointer], [7324055671294268046, salary pointer, 999999 pointer]]"
Copy the code

Now what if I want to read the data? If I want to read salary, I first compute the hash value of salary and the remainder of that value to 8:

>>> hash('salary')
7324055671294268046
>>> hash('salary') % 8
6
Copy the code

So I’m going to read the indices at 6. This value is 2.

Then read the “2” line of data in entries, that is, salary data.

In this way, when I want to insert new data, I always add data to the end of entries to ensure the order of insertion. When we iterate through dictionary Keys and Values, we simply iterate through entries. Every row in the dictionary is useful data. There is no skipping, reducing the number of entries.

The old way, when a two-dimensional array had eight rows, even though the valid data had only three rows, it still took up 8 * 24 = 192 bytes of memory. With the new method, if there are only three lines of valid data, then entries will have only three lines and occupy a space of 3 * 24 =72 bytes. The indices, because they are only a one-dimensional array, occupy only 8 bytes, so they occupy a total of 80 bytes. Memory usage is 41% of what it used to be.

Reference: [python-dev] More Compact dictionaries with faster Iteration