Moment For Technology

How does Immutable. Js persist data structures?

Posted on Aug. 6, 2022, 12:05 p.m. by Gillian Patterson
Category: The front end Tag: Immutable.js

It has been seven months since I wrote my last article, and NOW I have decided to go ahead and write a more in-depth article.

Immutable.js is a source of persistent immutable.js. It is a source of persistent immutable.js.

The relevant PPT and brain map of the article have been uploaded to GitHub: reading_notes. Due to time limitations, the article will not be described in too much detail as before, and many links may only describe the basic principles of implementation. For details, please refer to PPT, brain map or source code.

Let's get down to business.

1 basic Introduction

What is Immutable?

Immutable data is data that cannot be modified after being created. The corresponding data is mutable and can be modified after creation.

In JavaScript Immutable

All primitive values in JavaScript are immutable;

All complex data types in JavaScript are mutable.

Such as:

const foo = 'string';
const bar = {};

let foo1 = foo;
const bar1 = bar;

foo1 = 'new-string';
bar1['CUSTOM_VALUE'] = [];

console.log(foo, bar, foo1, bar1);

// Output: string {CUSTOM_VALUE: Array(0)} new-string {CUSTOM_VALUE: Array(0)}
Copy the code

As you can see, a change to foo1 does not apply to Foo, while a change to bar1 affects bar.

This also explains why strings can index characters by [], but cannot assign to them.

Immutable Mutable

The fundamental reason for Immutable and Mutable is that data is stored in different ways. For simple data (which can be understood as the original type of data), the storage space is small, the operation is simple, the replication cost is small, so it can be directly stored in the stack, and the value can be directly copied every time the assignment. For complex data (which can be understood as complex data), the storage space is large, the operation is more complex, and the cost of replication is high. Therefore, the data is stored in the heap outside the stack, and only the corresponding Pointers are stored in the stack.

When we index complex type data, we operate according to the pointer stored in the stack. Therefore, when we copy complex type data, we actually copy only the pointer in the stack. The two Pointers will point to the same memory area and share complex type data. This has the benefit of reducing the amount of time spent manipulating complex data and the amount of extra space required to copy complex data.

The two types of data have their own advantages and disadvantages, so the use of which data needs to be selected according to the actual situation. It is not true that one data is better than the other.

The pros and cons of the two are compared as follows:

advantages disadvantages
Mutable Make full use of memory, improve operation efficiency, reduce value copy and memory allocation time. Most languages are native to the way, easy to operate, low learning costs. Data manipulation is more dangerous, and concurrent programming is more complex.
Immutable Ensure the security of concurrent programming. Atomicity of data manipulation can be achieved. Take up more memory, lower operation efficiency, high learning cost

Why Immutable?

The previous section discussed which types of data should be considered for different scenarios, so this section explains why Immutable data should be used.

Immutable is a concept inherent in functional programming, which emphasizes the immutability of data, the transparency of function references, and the fact that a function is always a closed functional unit that does not modify external data. In this way, we can simply think about the function implementation of the current function, and when using the function, we can safely call it without worrying that it will affect external or global data and cause unpredictable problems.

So the starting point for using Immutable data is that it should be used when we are doing functional programming and need to be as transparent as possible about the reference of a function. This is why React uses Immutable and Vue does not care. Because Vue development does not follow the functional programming paradigm, changes to the data can be automatically resolved through data hijacking and dependency collection to update the UI. React follows the functional programming paradigm, whose core idea is UI = F (state), so the purest we write the React component, the better.

The above description may seem abstract, so here is a simple example:

There are two child components and a parent component. Clicking the button updates the child property, but after clicking the button, you can see that the other component rerender even if the data has not changed.

In this case, we need to use the react. memo to compare the data. By passing a comparison function to the react. Memo, we can prevent the sub-components from rerender when the data has not changed. Using Immutable data, however, speeds up the comparison process, which is why it is necessary to use it. In addition, immutable is used in React to change data, so that a new data object is always returned instead of changing the old data. In this way, React cannot receive data change messages by directly changing the old state.

Application of the Immutable idea in practical development

  • Most languages already implement Immutable data structures themselves or through third-party packages.

    JavaScript/Elm/Python/Clojure/Java/...

  • React uses Immutable to speed up comparison.

  • When copying data to a specified range, the COW mechanism of file systems copies the old data first to avoid data loss caused by accidents.

Ii. Implementation Mode

There are many ways to implement Immutable, and the following are some common ways.

A shallow copy

It is possible to implement Immutable data by using shallow copies. We know how to do this with React:

const a = {
  name: 'Russ'.gender: 1.favoriteFoods: ['banana'.'watermelon'.'grape']};constb = { ... a,favoriteFoods: [
    ...a.favoriteFoods,
    'litchi']};Copy the code

In this way, we reused the name and gender attributes in A, as well as other attributes in A. Flaoritefoods, and separately added a food.

This enables Immutable data because unchanged data is reused directly, and the ancestors of the changed nodes become new values.

But this approach has several drawbacks:

  • Writing is not elegant, and when the data hierarchy is deep and the data structure is complex, the code becomes bloated.
  • There is no real reuse of primitive data, and in fact new primitive data is recreated when extended operations are performed.
  • Adding data at the beginning and end of an array is fine, but changing or inserting values at a specified index becomes more complicated, requiring multiple slicing of the original array.
  • The operation is inefficient. For an array, the number of nodes to be copied is O(N). The larger the index, the more nodes to be copied.

2 list

Sharing data structures through linked lists is probably familiar, as it is very similar to git's branching implementation. Data structures can be persisted by using different header Pointers to different linked list nodes:

As shown in the figure above, if we want to shift the node that HEAD1 points to, we simply create a new pointer HEAD2 to the second node, so that the nodes are shared between 2-3 and 4.

If we want to unshift a new node, we simply create a new node, make the new head pointer HEAD3 point to it, and then point the new node to the original HEAD2 node, so that we can reuse 2, 3, and 4 nodes.

If we want to remove 2 nodes from the HEAD3 list, we simply copy the 1 'node and point it to the 3 node, and create a new HEAD4 pointer pointing to the 1 "node after copying.

Therefore, Immutable List is implemented through linked lists. The nodes that may need to be copied during each operation are all the nodes to the left of the modified node, and the nodes to the right can be shared.

But the drawbacks of this approach are also obvious:

  • Storage efficiency is low (50%), each node can only store one data and one pointer.
  • Random indexing is inefficient and requires a search through the node from scratch.

To address the first flaw, a new data structure was introduced: strings.

3 series

A string is a combination of an array and a linked list, turning each linked list node from one to multiple.

The implementation principle is the same as linked list, but the space storage efficiency is improved.

But the disadvantage is still that random indexing is inefficient. To solve this problem, we consider using balanced binary trees to implement persistent data structures.

Balanced binary tree

The height difference between the left and right nodes of a balanced binary tree is no more than 1. Therefore, the time complexity of data operation/search and the number of nodes to be copied are very stable for indexes of different lengths, both O(logN). This greatly improves the stability of the time complexity of the data structure, and the random index ability of the tree is much better than the linked list.

However, balanced binary tree also has its own defects, that is, the space storage efficiency is very low, each node can only store one data, but need to store two Pointers. But balanced binary tree gives us a good inspiration, through the tree to achieve persistent data structure, can be as few as possible and stable replication of a certain number of nodes.

Combining arrays and trees, you can implement a Vector Trie that is best suited for persistent data structures.

5 Vector Trie

Vector Trie, also known as a prefix tree, is a type of tree structure in which only the leaf nodes are used to store values, whereas the non-leaf nodes are used to store indexes.

Here's an example of a prefix tree:

On the left is a JavaScript object and on the right is the implementation of the corresponding prefix tree structure. As you can see, we split the key characters. We index each character along the HEAD pointer to find the value of the key. This is a prefix tree.

This is an implementation of JavaScript objects, but how is a JavaScript array implemented?

We know that JavaScript has the concept of array-like objects, so we can use an array-like object to simulate array implementation (ignoring the length attribute for the moment) :

The second implementation is a standard number partition prefix tree. Number partition means that a number is assigned to the prefix tree as a key.

Here is a more normalized prefix tree for decimal number partitions:

This prefix tree has four levels, and each node has 10 children, so the amount of data that can be stored in the whole prefix tree is 10 ^ 4 = 10000.

The figure above shows the process of taking the prefix tree as a list and setting position 2433 of the list to 13. We can quickly get the formula for each index as follows:

// levelIndex is the index of the current level
// level is the current level, 4/3/2/1 from left to right
const levelIndex = index / (10 ** (level - 1)) % 10
Copy the code

So the result is 2/4/3/3, which looks like splitting string 2433 by character, so it makes sense.

The number partition corresponds to the bit partition, which is the bit. When the memory of each node in the number partition is 2 to the power of the exponent, we get a bit partition prefix tree. For example, the simplest bit partition prefix tree is binary:

What are the benefits of using bit partitioning? The biggest advantage is that the calculation formula of each layer index can be simplified, and the use of bit partition can be more in line with the constitution of the computer and the implementation rules of the language.

So how do we understand this benefit? Let's take an example.

The previous decimal partition index for each level is calculated as follows:

// levelIndex is the index of the current level
// level is the current level
const levelIndex = index / (10 ** (level - 1)) % 10
Copy the code

If our base is 2 to the exponent, as in 2, then the above formula can be simplified as:

// levelIndex is the index of the current level
// level is the current level
const levelIndex = index / (2 ** (level - 1)) % 2
Copy the code

However, for binary data operation, we can simplify it by bit operation, such as:

If x = 2 ** n, then y/x = y n, y % x = y(x-1).

Therefore, when we use bit partition, the formula of index operation for each layer can be simplified as:

// levelIndex is the index of the current level
// level is the current level
const levelIndex = (index  (level - 1)) % 1
Copy the code

This is one of the biggest benefits of using bit-partitioning, plus we know that all numbers in JavaScript are stored in 64-bit doubles, and that many data structures have a maximum length of 2 ** 32-1; So using bitpartitioning to simulate JavaScript data objects fits the language design better and is easier to implement.

So how do prefix trees persist data structures? It's easy to understand by looking at this picture:

Each time we change the structure of the tree, we copy all the nodes in the index path, while the side nodes keep the original reference unchanged, thus making the data structure persistent. We get a completely new tree structure each time we set it up, but for unmodified nodes, the references are unchanged, so it looks like the data structure is shared.

In conclusion, there is no doubt that Immutable. Js implements the List structure and its data structure persistence properties through 32-bit partitioning. Since JavaScript arrays typically store a maximum length of 2 ** 32-1, Immutable List requires up to 7 layers to store 32 ** 7 = 2 ** 35 data. The time complexity of data operation is O(7).

Now that we know how to persist Immutable. Js data structures, we can implement the List structure ourselves. I've already written a basic implementation and done some unit tests. The code is on GitHub for those who want to take a look. I won't go into this article.

Three optimization means

1 Tail optimization

The bitpartition prefix tree mentioned above already meets our basic requirements. Is there anything that needs to be optimized?

Immutable. Js uses the 32-bit partition prefix tree to implement the List structure, so the time complexity of the operation node is O(7). Although it seems to be a constant level of complexity, compared with JavaScript Native Array horizontally, the speed of data operation is still several orders of magnitude lower, which needs to be optimized.

Without optimization, multiple calculations and node copies are required for each modification of data:

The operation of List is usually carried out in the tail, so we can optimize for this feature. As we know that database/DOM has batch insert optimization scheme, we can also consider using batch modification scheme for optimization.

Batch optimization scheme:

We add a tail attribute to the HEAD pointer, and the value of tail is a common node in the prefix tree. All tail operations can be performed directly on the tail node. When we need to mount the tail node to the tree, we can save a lot of time. Compared to the original scheme, if we were implementing the 32-bit partition-prefix tree, then all of our tail operations could save 31/32 of the work and greatly speed up the operation.

For 32-bit partition prefix trees (suppose capacity), the starting index of Tail is: (Capacity-1) 5 5, that is, (Capacity-1) / 32 * 32. In addition, if the total length is less than 32, then the index at the beginning of tail is 0, because only one tail node is needed to implement the entire List.

The related calculation function Immutable. Js source code:

2 Transient optimization

Before we talk about Transient optimization, we can first look at the following code problems, which will help us understand why Transient optimization is needed:

The above code will repeatedly assign to l to override the value of index 0/1/2. At first glance, this code will work perfectly.

However, we know that every time we call the set method to set the node value, we get a new List instance, as shown in the following figure: HEAD' and HEAD"

All nodes are copied and created, but we don't use any List instances (such as HEAD'). We only use the List instance (HEAD") returned from the last call to the set method, which increases the overhead of the code run time. So we need to optimize for that.

The optimal solution is to make the intermediate trees mutable, allowing us to modify some of the nodes directly without having to copy new nodes each time.

For example, the red node in the figure below was created by modifying the blue node directly, and only one new Trie was created in the whole process.

So how do we realize that mutable exists in the process of immutable? The key is what are the criteria for determining whether a node can be modified? .

Through the understanding of prefix tree, it is not difficult to find the following two key points:

  1. The current node can only be accessed by the current trie. no other tries share the current node (you cannot modify nodes shared from other Tries).
  2. The current Trie will not be used after this update (make sure no new tries are created during the update).

But how do we implement these two critical judgments?

The first one is easy to implement. We can identify each node and the root node of the tree by ID. If the node is newly created when the current trie is created, it should have the same ID as the trie.

The second point is more difficult to implement. How do you know it won't be used after this update? It's hard. There's no way to predict what the future will bring. So what is difficult to achieve through code is constrained by a specification, either written or written, which Immutable. Js creates by code.

Look at the official website, we can see a withMutations method, which achieves this optimization solution:

In the function passed in, we can assume that the trie we create will not be used later and can be modified directly, so that we do not have to continuously assign to the same variable, but just chain call it.

The key code for Transient optimization in Immutable. Js is as follows:

This function is responsible for creating a node and takes two arguments: node and ownerID. Node is the node to be modified, and ownerID is the ID of the newly generated trie.

If the current node and the current trie have the same ownerID, then the node is considered mutable and can be modified directly.

If they are not equal, then the node is shared from another TRIe, so it cannot be modified.

The realization of withMutations is as follows:

export function withMutations(fn) {
  // asMutable returns a new data structure via ensureOwner, and that data structure must have an ownerID.
  const mutable = this.asMutable();
  fn(mutable);
  // The ownerID needs to be restored to the original ownerID.
  return mutable.wasAltered() ? mutable.__ensureOwner(this.__ownerID) : this;
}
Copy the code

This is the implementation of the Transient optimization, and the memory savings and GC optimization are not obvious because the amount of memory reclaimed is nearly the same throughout the process. More optimization effect is reflected in the execution speed, because there is no need to frequently allocate memory, new objects, so the execution speed is much faster. Here is a comparison of trie optimizations for 1,000,000 elements:

You can see that code with Transient optimization executes twice as fast.

For the implementation of these two optimization schemes, I also put on GitHub, which is equivalent to the V2 version, you can see the code.

For my own implementation of v1 and v2, as well as the Immutable. Js version, I did a speed comparison:

You can see that the Immutable. Js version is twice as fast as the v1 version, and the v2 version is twice as fast as the Immutable.

The former is easier to understand than the latter, and the next section explains why the V2 version is twice as fast as the immutable.js version.

3. Spatial Optimization

Tail optimization and Transient optimization speed up the code, but we have not optimized the storage space, this section will show how to compress the prefix tree space.

In the previous section, we found that our version of V2 runs twice as fast as Immutable. I think Immutable runs twice as fast because of compression of prefix trees (scalable prefix trees). The v2 implementation is a non-scalable prefix tree. When we create a Trie, it has a capacity of 2 ** 25. We do not need to expand the tree space later, so it will run much faster.

However, such a disadvantage is the waste of storage space. We may only need a Trie with 5 capacity, but it takes up the space of 2 ** 25 capacity Trie, resulting in a waste of resources, so we need to find ways to optimize. Let's take a look at a few animations to see how Immutable. Js optimizes storage efficiency.

There is space at the end, so push directly:

If there is not enough space in the tail, but there is enough space in the Trie, create all nodes:

When there is not enough room in the entire Trie, increase the height of the tree to double the capacity:

When the capacity is sufficient, the direct POP node:

When a node is deleted and there is no other value for the current path, the entire path is deleted, and if only one path exists on the root node, the tree can be compressed and degraded:

Quick offsets of elements via cursors to implement header operations (shift/unshift) :

Each time you perform an operation on an element, you need to evaluate the current offset, not index.

The above is the Immutable. Js spatial optimization scheme, more details need to refer to the code, link at the end of the article.

Four summarizes

Due to time reasons, this article is more about the implementation principle, ignoring most of the implementation details, requiring readers to understand the principle before looking at the code.

Here are some of the information I summarized and some of the information I refer to, for the convenience of further development and in-depth.

1 Summary data

It is highly recommended that you take a look at the slides and code notes in the materials below, as they are much more informative than this article and include a hash algorithm/IS algorithm /Map data structure implementation.

  1. Learn TrieList, code address: immutable, you can switch v1 tag to view v1 version
  2. Reading_notes/IMmutable
  3. Some annotations to the source code during the learning process: immutation-js, note the master branch

2 Reference Materials

The following article is highly recommended and will definitely help improve your knowledge of the Immutable. Js implementation.

英 文 名 称 :

  1. Functional Go: An introduction to persistent data structures
  2. Functional Go: Implementation of Vector Trie
  3. Functional Go: Transient and persistent
  4. The Implementation of IMmutable.js
  5. The Implementation mechanism of IMmutable. Js
  6. Immutable. Js source code parsing --List type
  7. Immutable. Js source code parsing -Map type
  8. Read Map data structures in IMMUTABLE -js
  9. JavaScript Engine Principles (part 3)

Foreign Language Materials:

  1. Persistent Vector Performance
  2. Persistent Vector Performance Summarised
  3. Understanding Clojure's Persistent Vectors, pt. 1
  4. Understanding Clojure's Persistent Vectors, pt. 2
  5. Understanding Clojure's Persistent Vectors, pt. 3
  6. Understanding Clojure's Transients
  7. Data Structures in ImmutableJS
  8. Question: use of smi() function when computing hashes
  9. Efficiently Representing Values and Tagging
Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.