Ask questions

Let’s start with an interview question I often use, which is also a real need:

You need to display 5,000 or more pieces of data on the front end, each piece of data as an object with various properties in it. The value of each property can be a primitive type, an object, or even an array. The object or the element inside the array can continue to contain the object or the array and allow infinite nesting. Such as

{
  "name": {
    "firstName": "yi"."lastName": "li"
  },
  "age": 23."roles": ['developer'.'admin']."projects": [{
    "name": "demo"."repo": ""}}]Copy the code

A search box is provided on the page, and the user can find the data containing the content by entering the search content. Note that any property value of any data object (for example, in the above data structure, any property value of name, age, roles) contains this keyword. If the property value is an array or object, then the elements of the array or the value of the object continue to match the input, recursively, and as long as there is a hit, the data is considered a match

How to design this feature so that the search function is as fast as possible?

solution

If you have any sense as a programmer, you should have two things in mind:

  • Traversal and depth-first traversal are the most straightforward
  • If I want to go through it fast enough, I lose

True, traversal is the easiest and slowest. So one of the usual ways to optimize is to trade space for time; And the other way… I’ll come out later.

Here we try to optimize the search by building a dictionary tree (Trie).

If you don’t know what a dictionary tree is, let’s say we have a simple object with keys and values that correspond as follows:

We build a tree according to the appearance of the letters of the “key”, and the value of the leaf node may be the value of a “key”

At this time, no matter the user wants to access the value of any attribute, as long as the user starts from the root node of the tree and accesses the leaf node of the tree according to the order in which the attribute letters appear, the value of the attribute can be obtained. For example, when we want to visit Tea:

But in the scenario we need to solve, we don’t need to care about “attributes”, we only care about whether “values” match the search content. So we just need to build a dictionary tree for values.

Suppose you have the following object values

const o = {
  message: 'ack'
  fruit: 'apple'.unit: 'an'.name: 'anna',}Copy the code

The established tree structure is as follows:

root--a
      |--c
         |--k
      |--p
         |--p
            |--l
               |--e    
      |--n
         |--n
            |--a
Copy the code

When the user searches for apple, the search starts from A and ends with the letter E. If there is a node in the tree, it indicates a hit. When the user searches aha, the corresponding node cannot be found in the tree when accessing H, indicating that the object does not meet the search conditions

But in practice we will have a lot of object values, multiple object values may have duplicate values, so when we match, we want to return all possible match results. Such as

[{id: 1.message: 'ack'
    fruit: 'apple'.unit: 'an'.name: 'anna'}, {id: 2.message: 'ack'
    fruit: 'banana'.unit: 'an'.name: 'lee',},]Copy the code

The above two objects have the same values ack and AN, so we need to add the id of the object to the leaf node in the tree

root--a| | -- - c k (ids: [1, 2]) | -- - | -- - | p p --l
               |--e (ids: [1])
      |--n (ids: [1, 2])
         |--n
            |--a (ids: [1])
Copy the code

So when the user searches for an, we can return all the matches

OK, so with that in mind let’s start implementing the code.

Code implementation

False data

So one of the first questions that we have to solve is how do you quickly forge 5000 pieces of data? Here we use randomuser.me/ API/open source API. For simplicity, let’s make it return only the values of the gender, email, phone, cell, and NAT basic data types and not the nested structures (objects and arrays). Note that complex structures have been omitted for the sake of easy code presentation and understanding, thus avoiding complex code. There are no major changes to the code, just traversal logic and recursive logic.

Request randomuser. Me/API /? The result… The results are as follows:

{
  "results": [{"gender": "male"."email": "[email protected]"."phone": "00 02-65-13-26 -"."cell": "The 06-09-02-19-99"."nat": "FR"
    },
    {
      "gender": "male"."email": "[email protected]"."phone": "011-376-3811"."cell": "081-697-1414"."nat": "IE"
    }
    / /...]}Copy the code

Leaf node data structure

The data structure is described as follows:

class Leaf {
  constructor(id = "", value = "") {
    this.ids = id ? [id] : [];
    this.value = value;
    this.children = {};
  }
  share(id) {
    this.ids.push(id); }}Copy the code

The share method is used to add multiple identical matching ids to the leaf node

Help function

We need some helper functions during coding, such as:

  • isEmptyObject: Checks whether the object is empty
  • distinct: Removes duplicate elements from an array

These two functions can be implemented by borrowing from the LoDash library and are easy to implement manually, so I won’t go into details here

Another important method is normalize, which I prefer to translate as “flattening” (rather than “standardizing”) because it’s more graphic. This method is used to split an array of objects into a mapping between ids and objects.

Such as the

[{id: 1.message: 'ack'
    fruit: 'apple'.unit: 'an'.name: 'anna'}, {id: 2.message: 'ack'
    fruit: 'banana'.unit: 'an'.name: 'lee',},]Copy the code

And then it flattens out

{
  '1': {
    id: 1.message: 'ack'
    fruit: 'apple'.unit: 'an'.name: 'anna',},'2': {
    id: 2.message: 'ack'
    fruit: 'banana'.unit: 'an'.name: 'lee',}}Copy the code

The reason for this is that when the retrieval results return an array of ids: [1, 2, 3], we simply iterate over the results and immediately find the corresponding data in the flattened Map by id. Otherwise, you have to keep iterating through the raw data array to find the corresponding data.

Since randomUser. me does not return an ID, we’ll use email as the only identifier for now. The implementation of Normalize is as follows:

function normalize(identify, data) {
  const id2Value = {};
  data.forEach(item= > {
    const idValue = item[identify];
    id2Value[idValue] = item;
  });
  return id2Value;
}
Copy the code

Build a Tree

There’s no secret in this part of the code, it’s just building a tree recursively, right

fetch("https://randomuser.me/api/? results=5000&inc=gender,email,phone,cell,nat")
  .then(response= > {
    return response.json();
  })
  .then(data= > {
    const { results } = data;
    const root = new Leaf();
    const identifyKey = "email";

    results.forEach(item= > {
      const identifyValue = item[identifyKey];
      Object.values(item).forEach(itemValue= > {
        // Notice that the Number and Boolean types are stringed as well
        const stringifiedValue = String(itemValue);
        let tempRoot = root;
        const arraiedStringifiedValue = Array.from(stringifiedValue);
        arraiedStringifiedValue.forEach((character, characterIndex) = > {
          const reachEnd =
            characterIndex === arraiedStringifiedValue.length - 1;
          if(! tempRoot.children[character]) { tempRoot.children[character] =new Leaf(
              reachEnd ? identifyValue : "",
              character
            );
            tempRoot = tempRoot.children[character];
          } else {
            if(reachEnd) { tempRoot.children[character].share(identifyValue); } tempRoot = tempRoot.children[character]; }}); }); });Copy the code

Fuzzy search

Searching parts of the code is no secret:

function searchBlurry(root, keyword, userMap) {
  const keywordArr = Array.from(String(keyword));
  let tempRoot = root;
  let result = [];

  for (let i = 0; i < keywordArr.length; i++) {
    const character = keywordArr[i];
    if(! tempRoot.children[character]) {break;
    } else {
      tempRoot = tempRoot.children[character];
    }
    if (keywordArr.length - 1=== i) { result = [ ...tempRoot.ids, ...collectChildrenInsideIds(tempRoot.children) ]; }}return distinct(result).map(id= > {
    return userMap[id];
  });
}
Copy the code

Notice that there is a collectChildrenInsideIds method that collects the ids of all the children under the leaf node. The reason for this is that the current operation is a fuzzy match, so when you search for a, apple, Anna, ACK all count as matches.

Conventional search methods and dictionary tree defects

To compare efficiency, and to test the correctness of the search results, we still need to write a regular traversal search method:

function regularSearch(searchKeyword) {
  const regularSearchResults = [];
  results.forEach(item= > {
    for (const key in item) {
      const value = item[key];
      if (String(value).startsWith(searchKeyword)) {
        regularSearchResults.push(item);
        break; }}});return regularSearchResults
}
Copy the code

Note that we used startsWith instead of indexOf to test if the object value matches the search term, ** this is because the dictionary tree has the defect of only matching words that begin with the search term! ** For example, when you search for a, you can only match Apple and Anna, but not Banana. For comparison purposes, we have to use startsWith

Performance comparison

The performance comparisons are interesting:

  • When the amount of data is small, there is no big difference in search efficiency
  • When you have a lot of data, like 5,000, when you have a very short search term, likea, the search efficiency of dictionary tree will be lower than traversal search, that is, it will take longer; When the search term becomes specific, for exampleali, dictionary tree search efficiency will be higher than traversal search

It’s not hard to see why: when you search for simple terms, you visit fewer leaf nodes, so you can only scan children to collect all possible ids of the child nodes, which takes up most of the traversal process

But we still need to satisfy this part of the query requirements, so we need to make some optimizations for this scenario

Optimize short search scenarios

Recalling the simple search scenario, the performance bottleneck is that we need to traverse all the children under the leaf node. Easy to do. Since there will be no change after the tree is built, we just need to calculate all the sub-ids of each leaf node in advance, which is the second type of optimization scheme mentioned at the beginning of the article, namely predictive calculation.

I wrote a new method to recursively add the ids of all its children to each leaf node:

function decorateWithChildrenIds(root) {
  const { children } = root;
  root.childrenIds = collectChildrenInsideIds(root.children);
  for (const character in children) {
    constcharacterLeaf = children[character]; characterLeaf.childrenIds = collectChildrenInsideIds( characterLeaf.children ); decorateWithChildrenIds(characterLeaf); }}Copy the code

After you build the tree, use this method to “decorate” all the leaf nodes

conclusion

After passing the prediction calculation, in the case of 5000 pieces of data, the search efficiency of dictionary tree is basically about 1ms regardless of short search or long search, while the conventional traversal search is about 10ms, which is indeed a ten-fold improvement. But the cost of this upgrade is based on the sacrifice of space, and the time spent calculating ahead of time. It is believed that if the data structure becomes more complex, the efficiency gains will be more obvious

The source code for this article is github.com/hh54188/sea…

Finally, I leave a problem for you: when the amount of data to be searched is larger, such as 1000, dictionary tree search results and traversal search results will occasionally be inconsistent, but when the amount of data to be searched is larger, such as 5000, then this “problem” will be stable. It’s not exactly a bug, but what’s the problem?


This article is also published in my Zhihu column front-end technology Hitchhiker’s Guide, welcome to subscribe