Entropy technology about Jane | large data integration platform oriented file system architecture design

Author Information: Entropy Airworks Jane technology team, the team is committed to build high performance and low code platform for the integration of large data analysis, data for the organization team and each business department personnel to provide “all the intelligence data link” platform and solutions, data cleaning, data fusion, data modeling, data visualization, data services, and other dimensions, Fully assist customers to achieve business intelligence.

Takeaway: Jane’s science and technology information data in entropy middle solution, the file system as a foundation for the whole architecture of storage, provides for a resource such as database, data table of operation and management, ensure the platform for the various resources, flexible and efficient call for data management, resource security, data warehouse building, provided the underlying data model management and upper layers.

1. File system requirements

Based on the Airworks platform, Entropy has been exploring a reliable and high-performance file system for a long time. This paper starts from three aspects of file system requirements, technical scheme comparison and file system architecture design, and introduces file system scheme and exploration in entropy simple data intelligent platform in detail. Data center is the core of intelligent data solutions for asset management institutions. The data sources and data volume within the asset management institutions are often heterogeneous and huge. The introduction of data center can get through the data resources of various departments and systems within the enterprise, and establish the data standard system and data index system. Integrate data services, unify external data service interfaces, provide enterprise-level data asset management, and realize data link tracking through metadata management; It provides a visual and drag-and-drop self-service development and analysis platform to unify data development process and project cycle management.

As the basis of the asset management data center, the file system can complete the management of various resources such as data tables, databases, data query scripts, ETL workflow, data models and interface services. The operation of reading, writing, moving, copying and deleting resources can be realized. Keep the personal space of each account and support the public collaborative space of the enterprise directory to ensure that the whole platform is highly flexible in the use of resources, and the resource invocation of each functional module is clear and reasonable; Set different permissions for different accounts for resources in the system based on the permission system. Therefore, the file system is the underlying foundation of applications including data governance, resource security, data warehouse construction, projects, workflows, data models and so on. Requirements for a file system include the following:

Point 1: A file system is a tree structure, in which folders can contain sub-files. In addition, non-folders may also contain a tree structure.

Point 2: Multiple file types need to be supported. Different file types support different operations. However, common operations such as create and delete need to be supported.

Point 3: There may be different filters for different file types; In the file list, there are also different display fields according to different file types;

Point 4: permission system and file system writing;

1.1 Overall design of file system architecture

File system requirements are mainly composed of the following parts:

First: the file system itself tree structure of the database design and add, delete, change and check;

Second: the framework of the file system design, in the operation of different file types to support each type to define different behavior;

Third: the interface to get the file list needs to be written in the file system, and support different display items and filter items for different file types; Therefore, we abstract the tree structure and common methods from various file types, and also divide the requests into common external requests and personalized requests for a particular file type, as shown below:

2. File system-database solution

In most SQL databases, there is no direct support for tree data structure types, so some additional design is required. In chapter 3 of SQL Anti-Patterns, several common tree data structure database designs are described, two of which are the most representative:

(1) Adjacency list scheme: add parent_id foreign key to each tree node, pointing to its parent;

(2) Closure table scheme: save a mapping table for each ancestor-child relationship.

In addition, the other nested set scheme presented in the book is difficult to maintain and inefficient, and has rarely been used. For adjacency list schemes, there are two optimizations, recursive query and enumeration path, so continuous improvement can be made when adjacency list fails to meet performance requirements.

2.1 Adjacency list scheme

The storage form of adjacency list is very simple and direct, that is, for each child to save an association pointing to its father, forward can directly find its father, reverse can find all children through the father, as shown in the following figure in the book of Anti-patterns:

The efficiency of trap

In this way, it is easier to add, delete, or change a node. After all, each node is associated with its parent, and only the parent field needs to be changed. The only possible problem is the query operation, if only query direct children is relatively easy, more complex is to query all children corresponding to a node, this code needs to use recursive implementation, may be written as follows:

def get_sons(self, root):
    return {'name': root.name, 'update_time': root.update_time.timestamp(),
            'sons': [self.get_sons(x) for x in Node.objects.filter(parent=root)]}
Copy the code

This code looks fine at first glance and pulls all files recursively. However, when the number of files increases, the time will be greatly increased. In our system, when the number of files of a single user reaches the order of four or five hundred, the running time of this code has reached 500ms. The main reason for the surge in time is that the depth-first search is used to write the query code, resulting in the number of requests to the database proportional to the number of nodes in the tree data structure. As the data grows, the number of requests to the database increases rapidly, which can be replicated through unit tests.

class TreeTest(TestCase):

    def setUp(self) -> None:
        super(TreeTest, self).setUp()
        self.root_node = Node.objects.create(name='root')
        for i in range(3):
            parent = self.root_node
            for j in range(3):
                parent = Node.objects.create(name='sons_{}_{}'.format(i, j), parent=parent)

    @override_settings(DEBUG=True)
    def test_scan(self):
        self.get_sons(self.root_node)
        for item in connection.queries:
            print(item)
Copy the code

Such a test case, for example, we first create a root node, the below three layer structure, each layer has three nodes, and then use get_sons function to obtain the whole tree, finally through the framework’s built-in components to print out all the database query, running after, we can see a total of 10 times the select query (including the root node), This is exactly the same as the number of nodes, assuming that the file tree you need to query has four or five hundred nodes, you will need to ask the SQL database four or five hundred times, which is very time-consuming.

The improved scheme

Improvement solutions to the problem is reduced to the database query, namely the depth-first search into a breadth-first search, it will query SQL file from is proportional to the number of node number is proportional to the depth of the tree, because the files are all users to create, to have a list of 10-20 layer depth has been close to the limit, This will meet the performance requirements.

Using the same test case above, you can see that this time the test case only requested the database four times, which is consistent with the depth of the tree and greatly reduces the time wasted on database requests.

Table 2.2 closure

The closure table needs to add an associated table in the data structure. Each pair of nodes with ancestor-descendant relationship needs to have a record in the table. In addition, it is better to record the distance between the two nodes (self and self default is 0) for easy query.

As can be seen from the figure, the closure table actually adds the association relation among all the nodes that have ancestor-descendant relation, and the nodes that have no common ancestor have no association relation. In the adjacency list, if we query to a node is the root of the tree, you need to use recursion to find, but the closure in the table, the operation is very simple, only need to find the associated with the node node can, you can use a database request, accordingly, the complexity of the insert, delete, and move becomes higher. When adding, deleting, modifying, and checking closure tables, it is not as simple as the original adjacency list and requires some additional steps:

(1) Add a node: the association between it and its ancestor needs to be added in batches;

(2) Delete a node: You need to delete it and all of its descendants. In The Django framework, relationships can be deleted automatically with foreign key restrictions.

(3) Move a node: it is necessary to carefully deal with its association, treat it and all its descendants as a set, delete all sets and external associations, and rebuild all their associations with the new ancestor

It should be noted that when deleting nodes, we cannot only delete the association relationship, but must delete the node itself, otherwise the node will become a free node, which can never be found through the association relationship, and additional cleaning code must be written to deal with them.

Code design

When you write closure table code, you can use Django’s suggested design style to encapsulate these complex operations in the Model layer, so that NodeRelation is not affected by external operations, and NodeRelation is not handled by external code. This ensures that the data is only modified in one place, making it easy to expand and check for errors. Without further ado, let’s go directly to the code, which is relatively simple to create and delete, as follows:

class NodeManager(models.Manager):

    def create_with_relation(self, parent_id=None, **kwargs):
        node = self.create(**kwargs)
        relations = [] if parent_id is None else [NodeRelation(
            ancestor_id=x.ancestor_id, descendant=node, level=x.level + 1
        ) for x in NodeRelation.objects.filter(descendant=parent_id)]
        relations.append(NodeRelation(ancestor=node, descendant=node, level=0))
        NodeRelation.objects.bulk_create(relations)
        return node

    def delete_with_relation(self, node_id):
        node_ids = [x['descendant_id'] for x in NodeRelation.objects.filter(
            ancestor_id=node_id).values('descendant_id')]
        self.filter(id__in=node_ids).delete()
Copy the code

Obtaining the whole tree of a root node requires two requests. First, obtain all descendant nodes, and then obtain all the relationships between them with a distance of 1. Construct the tree structure information according to the association information:

class NodeManager(models.Manager):

    def get_tree(self, node_id):
        node_dict = defaultdict(lambda: {'name': '', 'sons': []})
        for node in Node.objects.filter(descendant__ancestor_id=node_id):
            node_dict[node.id]['name'] = node.name
        for relation in NodeRelation.objects.filter(level=1, descendant_id__in=node_dict.keys()):
            node_dict[relation.ancestor_id]['sons'].append(node_dict[relation.descendant_id])
        return node_dict[node_id]
Copy the code

The most complicated part is the movement operation, which requires deleting the original association before adding a new one. The other option is to use SQL statements directly, but such SQL statements can be very complex, very complex, and very unreadable, so we don’t recommend doing this.

The efficiency of trap

Closure tables seem to be a better storage solution for tree structures, with better performance at fetching the entire tree at the root of any node, and at fetching all the ancestors of the node.

However, there are two obvious problems with closure tables:

In terms of space complexity, the space required is actually n^2, but only in the limit case, it’s much smaller in everyday use, but it uses a lot more extra space than adjacency lists.

Another fatal problem is that when moving nodes, there is a Cartesian product operation. Moving a subtree of depth D and number of nodes N can result in a D * N operation. If you move a directory with tens of thousands of subfiles to a directory with five layers of depth, it will involve the deletion and addition of hundreds of thousands of associated data, if this operation is frequent, then the database will be overwhelmed.

2.3 Adjacency list optimization

If you already use adjacency lists in your project, or can’t afford the cost of closure lists, what better way is there? The answer is yes, and the two schemes described above can be used as improvements and optimizations to the adjacency list.

Recursive query

Recursive queries here do not refer to recursive queries within the code, but specifically to recursive query operations supported in some databases. Oracle database is related to the operation support, but even Amoy department has given up Oracle, Oracle is not a panacea. On the Mysql side, the with clause is supported for recursive queries starting with Mysql8.

With recursive query, the original query becomes simple, as long as the code recursion into SQL statement can be, you can refer to:

[Stackoverflow: mysql — recursive query] (stackoverflow.com/questions/2…). .

But what if mysql is 5.6 or 5.7? One solution is to limit the depth of the file system so that queries can be queried directly with limited join operations, but this is not elegant and inefficient with so many joins. Another scheme, also mentioned in Stackoverflow above, is to directly use THE CTE formula of SQL to do it, but the effect of this kind of operation is not good either. The efficiency drops sharply when the number of nodes is large, mainly because it is very inefficient to judge whether it is in the set.

In general, database support is a must if you want to use recursive queries, otherwise it may not be as efficient as the code-implemented width first search. Therefore, in the actual project, considering the problem of database support, the earliest use is width-first search to deal with the file search problem.

Path enumeration

This optimization method is actually very simple and easy to understand, is to add a field in each node, record the path information corresponding to the current node, similar to the file path. In this way, when searching for a tree with a node as the root, you can use the SQL string index feature to search for the prefix using the Like clause. For example, if the path of the current node is 1/2/3, all nodes can be pulled out if the file path starts with 1/2/3.

In fact, this operation is somewhat similar to the closure table scheme, that is, the path information of all relevant nodes needs to be modified when moving. There will be an update operation with a complexity proportional to the number of nodes, but the path information does not need to be updated when inserting or deleting.

As an update to the adjacency list, the enumeration path actually trades the time of the move operation for the time of the pull tree node. It is recommended as an alternative to the recursive query scheme when it is not possible to implement. The closure list can also coexist with the adjacency list, so enumeration paths need to be compared to the closure list and selected according to the actual situation of the project.

3. File system architecture design

The main purpose of the framework design here is to separate the file system-related operations from the custom operations for each file type, and to easily add file types without modifying the file system-related code, which increases the ease of use and extensibility of the file system.

3.1 Use the callback shard module

According to the previous requirement analysis, the operations related to the file system may affect the contents of specific files. The overall process may be in the following sequence (taking creating files as an example) :

In the process of the above, request first into the file system, file system will create a file, the relationship and maintain files, after will create good file model and request of all parameters are passed to a particular file type module for processing, specific module according to the information after processed, will need to show to the data in a file list then returned to the file system, Updates from the file system to the database and back to the client, forming a closed loop.

3.2 Automatic registration of callback functions

To facilitate the extension of file types, create a virtual base class of callback functions, defined by each type module and registered with the file system so that the file system can call the corresponding callback function based on the file type operated by the client.

In order for callback functions to be automatically registered, the callback function base class is defined using python metaclass features so that when a type inherits from the base class, it can be automatically registered in the callback repository. Here is the metaclass code:

class CallbackMeta(ABCMeta): def __new__(mcs, what, bases=None, attrs=None): new_class = super().__new__(mcs, what, bases, attrs) if what ! = "FileCallback": file_type = new_class.file_type for callback_type in ("create", "update", "delete", "dump", "load"): CallbackRepository.register( file_type, callback_type, getattr(new_class, callback_type), getattr(new_class, "batch_{}".format(callback_type)), ) return new_classCopy the code

Finally, when the callback function is called, it can be directly invoked using the callback repository.

4, endnotes

When designing systems such as files, categories, roles, etc., it is possible to encounter the design problem of tree structure. In my previous work, I encountered a code with a nine-layer nested IF statement. As for what kind of design to use, we still need to consider the actual requirements and use scenarios, we can be divided into several situations:

(1) If there is no need for a pull tree structure, and only the first child of the current parent is pulled each time (this is the case at the front end of many file systems), then in fact the adjacency list is optimal and any operation is fast enough;

(2) If frequent operations of pulling the whole tree are involved, but movement rarely occurs, or movement is basically carried out by the system or the background (such as hierarchical classification system), then closure table is probably the best choice; Sometimes it may be that the original adjacency list does not meet the performance requirements and needs to be modified, so you can choose to use recursive query or enumeration path according to the situation of the database, or directly transformed into a closed table structure.

In the end, no matter what solution you choose, the architecture will evolve with the requirements, and if you can meet the requirements early on, you can choose a solution that leaves room for upgrades, allowing for more elegant code refactoring.

Entropy technology about Jane | large data integration platform oriented file system architecture design

Table 2.2 closure

2.3 Adjacency list optimization

3.1 Use the callback shard module

3.2 Automatic registration of callback functions

Related Posts

SpringBoot2 | @ SpringBootApplication annotation automation configuration process source code analysis (3)

Zhang SAN can also learn the eight singleton pattern writing

Morning ops notes | PostgreSQL Ubuntu deployment