GraphQL and metadata-driven architecture practices in back-end BFF

GraphQL is a data query language proposed by Facebook. Its core features are data aggregation and on-demand demand. It is widely used in front and back ends to solve the problem of flexible use of data by clients. This article introduces another practice of GraphQL, where we submerge GraphQL below the back-end BFF layer and combine metadata technology to implement on-demand query and execution of data and processing logic. This not only solves the problem of flexible use of data in the back-end BFF layer, but also can directly reuse the processing logic of these fields, which greatly improves the efficiency of research and development. The practical schemes introduced in this paper have been implemented in some business scenarios of Meituan and achieved good results. I hope these experiences can be helpful to you.

1 The origin of BFF

The term BFF comes from a blog post by Sam Newman called Pattern:Backends For Frontends, which refers to the back end that serves the front end. What problem does BFF solve? According to the text, with the rise of mobile Internet, the server functions originally adapted to desktop Web are expected to be provided to mobile apps at the same time, but there are such problems in this process:

There are differences in UI between mobile apps and desktop Web.
Mobile apps involve different ends, not only iOS but also Android, and there are differences in UI between these ends.
There is already considerable coupling between the original back-end functionality and the desktop Web UI.

Because of the difference of the server side, the function of the server side should be adapted and tailored according to the difference of the server side, and the service function of the server side is relatively single, which leads to a contradiction between the single service function of the server side and the different demands of the server side. So how can we solve this problem? This is also described by the subtitle of the article “single-purpose Edge Services for UIs and External parties”, the introduction of BFF, by BFF to do the adaptation for multiple differences, this is also a widely used model in the industry at present.

In practical business practice, there are many reasons that lead to such end difference, including technical reasons and business reasons. For example, the user’s client is Android or iOS, large screen or small screen, what version. For example, which industry the business belongs to, what is the form of the product, what scenes the function is put in, who is the user group facing, and so on. All of these factors lead to differences in the functional logic of the end.

On this issue, the product display business of the author’s team has a certain say. For the same product business, the display function logic at the C-end is deeply influenced by commodity type, industry, transaction form, place of delivery, group-oriented and other factors. At the same time, the frequent iteration of consumer-oriented functions intensifies and deepens this contradiction, making it evolve into a contradiction between the single stability of the service side and the difference and flexibility of the service side. This is also the inevitable reason for the existence of BFF business system. This paper mainly introduces some problems and solutions in the context of meituan’s in-store product display scene.

2. Core contradictions in the context of BFF

The introduction of THE BFF layer is to solve the contradiction between the single stability of the server and the different flexible demands of the server. This contradiction does not exist, but is transferred. The contradiction between the back end and the front end has been transferred to the contradiction between the BFF and the front end. My team’s main job is to combat this conflict. Taking a specific business scenario as an example and combining with the current business characteristics, the specific problems we face under the BFF production mode are described below. The following figure shows two group purchase shelf display modules of different industries. We think these two modules are the display scenes of two commodities. They are two sets of independently defined product logic and will be iterated separately.

In the early days of the business, there were not many such scenes. The “chimney type” construction of BFF layer system, the rapid development of functions online to meet the demands of the business, in such a case, this contradiction is not obvious. With the development of the business and the development of the industry, many such commodity display functions have been formed, and the contradiction has gradually intensified, mainly manifested in the following two aspects:

Business support efficiency: As more and more commodity display scenarios become available, APIS tend to explode. There is a linear relationship between business support efficiency and human resources, and the system capability cannot support the scale expansion of business scenarios.
System complexity is high: core functionality continues to iterate, and internal logic is rifeThe if... The else..., the code is written in the process, the system complexity is high, it is difficult to modify and maintain.

So how do these problems arise? This should be combined with the background of the “chimney” system construction and the business facing the commodity display scene, as well as the characteristics of the system to understand.

Feature 1: Multiple external dependencies, different numbers between scenarios, and high requirements for user experience

The legend shows two group purchase shelf modules of different industries. Such a seemingly small module, the back end of the BFF layer to call more than 20 downstream services to get all the data, this is one. In the above two different scenarios, there are differences in the collection of data sources, and this difference is common, this is the second, such as a pedicure group purchase shelf needs a data source, in the beauty group purchase shelf does not need, beauty group purchase shelf needs a data source, pedicure group purchase shelf does not need. While relying heavily on downstream services, the c-side user experience should also be ensured, which is the third.

These characteristics bring the technology no small problem: 1) the size of aggregation is difficult to control, aggregation function is divided into scene construction? Or unified construction? If you build in different scenarios, you will have to write similar aggregation logic repeatedly for different scenarios. If unified construction, then a large and complete data aggregation is bound to have invalid calls. 2) Complexity control of aggregation logic. In the case of so many data sources, not only how to write business logic, but also the arrangement of asynchronous calls should be considered. If the code complexity is not well controlled, the subsequent modification of aggregation will be a difficult problem.

Feature 2: Show multiple logics, differences between scenes, and logical coupling of commonness and individuality

We can clearly identify that there are commonalities in the logic of a certain type of scene, such as the display scene related to group order. Intuitively, we can see that they are basically showing one dimensional information of the group, but this is just a representation. In fact, there are many differences in the module generation process, such as the following two differences:

Field splicing logic difference: such as the two group purchase shelves in the figure above group purchase title as an example, the same is the title, in the beauty of the group purchase shelf display rules are: [type] + group purchase title, and in the foot treatment group purchase shelf display rules are: group purchase title.
Differences in sorting and filtering logic: For example, if the same group is in A single list, scene A is sorted in reverse order by sales volume, and scene B is sorted by price. Different scenes have different sorting logic.

There are many other differences in display logic. Similar scenarios actually have many internal differences in logic, how to deal with this difference is a difficult problem, the following is the most common way to write, by reading the specific condition field to make a judgment to achieve logical routing, as follows:

if(category == "Beauty") {
  title = "[" + category + "]" + productTitle;
} else if (category == "Foot massage") {title = productTitle; }Copy the code

This solution has no problems with functional implementation and can reuse common logic. But in fact, in the case of a large number of scenarios, there will be a lot of different judgment logic superimposed together, and the function will be continuously iterated, it can be imagined that the system will become more and more complex, and more and more difficult to modify and maintain.

Conclusion: At THE BFF level, different commodity display scenes are different. In the early stage of business development, the system supports rapid trial-and-error through independent construction. In this case, the problems caused by business differences are not obvious. With the continuous development of the business, more and more scenarios need to be built and operated, showing a trend of scale. At this time, the business puts forward higher requirements for technical efficiency. In the context of multiple scenarios and differences among scenarios, how to satisfy the efficiency of scenario expansion and control the complexity of the system is the core problem we face in business scenarios.

3 BFF application pattern analysis

At present, there are two main modes in the industry for this kind of solution, one is the back-end BFF mode, the other is the front-end BFF mode.

3.1 Back-end BFF mode

The back-end BFF pattern means that BFF is responsible for by the back-end students. The most widespread practice of this pattern at present is the back-end BFF scheme built based on GraphQL. In particular, the back-end encapsulates display fields as display services, which are exposed to the front end after being choreography by GraphQL. As shown in the figure below:

The biggest feature and advantage of this pattern is that when the display field already exists, the back end does not need to care about the difference requirements of the front end, and the ability to query on demand is supported by GraphQL. This feature can well deal with the problem that different display fields exist in different scenarios. The front end can directly query data on demand based on GraphQL, and the back end does not need to change. At the same time, with the orchestration and aggregation query capabilities of GraphQL, the back-end can decompose the logic into different presentation services, thus resolving the complexity of the BFF layer to some extent.

However, based on this pattern, there are still several problems: displaying service granularity, data graph division, and field diffusion. The following is a specific case based on the current pattern:

1) Display service granularity design issues

This scenario requires that the Presentation logic and the fetch logic be encapsulated in a single module to form a Presentation Service, as shown in the figure above. In fact, the relationship between display logic and take number logic is many-to-many.

Background: There are two display services that encapsulate query capabilities for item title and item label, respectively. Scenario: At this time, PM makes a demand that the title of the product in a certain scenario should be displayed in the form of “[type]+ product title”. At this time, the splicing of the product title depends on the type data, which has been invoked in the product label display service. Question: Does the commodity title display service call the type data itself or does it merge the two display services together?

The problem described above is that the granularity of the display service is controlled. We can wonder if the above example is due to the small size of the display service. On the flip side, if two services are merged together, there is bound to be redundancy. This is the difficulty of the presentation service design. The core reason is that the presentation logic and the take number logic are themselves many-to-many, but the result is designed together.

2) Data graph division

Through GraphQL, data of multiple display services are aggregated into a GraphQL Schema to form a data view. When data is needed, as long as the data is in the graph, on-demand Query can be made based on Query. So the question is, how should this picture be organized? Is it one picture or multiple pictures? If the graph is too large, it will inevitably bring complex data relationship maintenance problems, and if the graph is too small, it will reduce the value of the scheme itself.

3) Show intra – service complexity + model diffusion

As mentioned above, there are different splicing logics in the display of commodity titles. This logic is particularly common in the display scene of commodities. For example, the same price, A industry show preferential price, B industry show preferential price before; In the same way, industry C shows the service time, while industry D shows the characteristics of goods. So the question is, how is the presentation model designed? Take the title field as an example. Should I just put a title field on the presentation model, or should I put title and Title with Category separately? If it’s the former then there must be an if… The else… This logic, used to distinguish the concatenation of the title, also leads to the complexity of showing the inside of the service. If it is multiple fields, you can imagine that the model fields showing the service will also proliferate.

Conclusion: The back-end BFF pattern can solve the complexity of back-end logic to some extent and provide a reuse mechanism for display fields. However, there are still outstanding issues, such as the granularity design of the presentation service, the partitioning of the data graph, and the complexity and field diffusion within the presentation service. At present, representatives of this model practice include Facebook, Airbnb, eBay, iQiyi, Ctrip, Qunar and so on.

3.2 Front-end BFF mode

The front-end BFF pattern is specifically described in The “And Autonomy” section of Sam Newman’s article, which means that the BFF itself is the responsibility of the front-end team itself, as shown in the schematic below:

The idea is that there is no need to split the requirements that could have been delivered by one team, and that the two teams themselves have a greater cost of communication and collaboration. In essence, it is also a way of thinking to transform “the contradiction between ourselves and the enemy” into “the contradiction among the people”. The front end takes over the development of THE BFF completely, achieving self-sufficiency in data query and greatly reducing the collaboration costs on the front and back ends. But this model doesn’t address some of the core issues we care about: how complexity works, how diversity works, how presentation models are designed, and so on. In addition, this model also has some preconditions and disadvantages, such as relatively complete front-end infrastructure; The front end needs to not only care about rendering, but also understand the business logic, etc.

Conclusion: The front-end BFF model reduces the cost of cross-team collaboration and improves the efficiency of BFF RESEARCH and development through the front-end independent query and use of data. At present this kind of model practice representative is Alibaba.

4. Information aggregation architecture design based on GraphQL and metadata

4.1 Overall Thinking

Through the analysis of the two modes, back-end BFF and front-end BFF, we finally choose the back-end BFF mode. The front-end BFF scheme has a great impact on the current R&D mode, which requires not only a large amount of front-end resources, but also the construction of perfect front-end infrastructure. The implementation cost of the scheme is relatively high.

Although the backend GraphQL BFF pattern mentioned above has some problems when applied to our specific scenario, it has great reference value in general, such as the reuse idea of display fields, the query idea of data on demand, and so on. In the commodity display scenario, 80% of the work is focused on the aggregation and integration of data, and this part has a strong reuse value, so the query and aggregation of information is the main contradiction we face. Therefore, our idea is: Based on the improvement of GraphQL+ back-end BFF scheme, the precipitation, composition and reuse of the fetching logic and display logic are realized. The overall architecture is shown in the following schematic diagram:

As can be seen from the figure above, the biggest difference between the traditional GraphQL BFF scheme and GraphQL is that we devolv GraphQL to the data aggregation part. Since the data comes from the commodity domain, the domain is relatively stable, so the data graph size is controllable and relatively stable. In addition, the core design of the overall architecture also includes the following three aspects: 1) take the number of display separation; 2) Query model normalization; 3) Metadata driven architecture.

We solve the problem of display service granularity by the separation of display and display, and make the display logic and display logic precipitate and reusable. The problem of display field diffusion was solved by the query model normalization design. Visualization of capabilities through metadata-driven architecture, automation of orchestration execution of business components, enables business development students to focus on the business logic itself. These three sections will be introduced one by one.

4.2 Core Design

4.2.1 Take numbers to show separation

As mentioned above, in the commodity display scenario, the display logic and the number taking logic have a many-to-many relationship, but the traditional GraphQl-based back-end BFF practice encapsulates them together, which is the fundamental reason that the display service granularity is difficult to design. Think about what are the concerns of take numbers logic and presentation logic? The fetching logic focuses on how to query and aggregate the data, while the presentation logic focuses on how to process and generate the required presentation fields. Their concerns are different, and together they can add complexity to the presentation service. Therefore, our idea is to separate the number taking logic and display logic and encapsulate them into logical units, which are called number taking unit and display unit respectively. After the separation of display by taking numbers, GraphQL also sinks to realize the on-demand aggregation of data, as shown in the figure below:

So what is the encapsulation granularity of take numbers and display logic? It should not be too small or too large. In terms of the design of granularity, we have two core considerations: 1) reuse, display logic and number taking logic are reusable assets in the commodity display scenario. We hope that they can be deposited and used separately on demand; 2) Keep it simple, so it’s easy to modify and maintain. Based on these two considerations, the definition of granularity is as follows:

Fetch unit: try to encapsulate only one external data source and at the same time simplify the model returned by the external data source. The model generated in this part is called fetch model.
Display unit: try to encapsulate only the processing logic of one display field.

The advantage of separation is that it is simple and can be combined, so how do you achieve combination? Our idea is to describe the relationship between them through metadata, based on which the unified execution framework is used to correlate the operation. The specific design will be introduced in the following paragraphs. The separation of the number and display, the association of metadata and the combined call at runtime can keep the logic unit simple and meet the demand of reuse. This also solves the problem of granularity of display service in the traditional solution.

4.2.2 Query Model Normalization

Through what interface does the processing result of the display unit come out? Next, let’s look at the query interface design.

1) Difficulty in querying interface design

Common query interface design modes are as follows:

Strongly typed pattern: A strongly typed pattern is one in which the query interface returns A POJO object, and each query result corresponds to a specific field in the POJO with a specific business meaning.
Weakly typed mode: A weakly typed mode is when the query results are returned in K-V or JSON mode, with no explicit static fields.

Both of these models are widely used in the industry, and both have clear advantages and disadvantages. The strongly typed pattern is developer-friendly, but the business is constantly iterating, at the same time, the system precipitation display unit will be constantly enriched, in this case, the interface will return more and more fields in the DTO, each new function to support, accompanied by the interface query model modification, the UPGRADE of JAR version. However, the upgrade of JAR involves both the data provider and the data consumer, which has obvious efficiency problems. In addition, it is conceivable that the query model would iterate over and over again, eventually including hundreds or thousands of fields, which would be difficult to maintain.

And weak type model can make up for the shortcomings, but weak type model for developers is very unfriendly, interface check what model is the query results for developers in the process of development has no feeling, but the nature of the programmer is like by code to understand the logic, rather than the configuration and documentation. In fact, both of these interface design patterns have a common problem — a lack of abstraction. In the next two sections, we will introduce the abstract thinking and framework capabilities that support the design of the query model returned by the interface.

2) Query model normalized design

Back to the commodity display scene, a display field has a variety of different implementations, such as two different implementations of commodity title: 1) commodity title; 2) [Category] The relationship between the commodity title and the two kinds of display logic is essentially an abstract – concrete relationship. By identifying this key point, the idea is clear. The idea is to abstract the query model. There are abstract display fields on the query model, and one display field corresponds to multiple display units, as shown in the figure below:

At the implementation level, the relationship between display fields and display units is also described based on metadata. Based on the above design ideas, the diffusion of model can be slowed down to a certain extent, but expansion cannot be avoided. For example, in addition to the standard attributes that each commodity has, such as price, inventory, sales volume, etc., different commodity types generally have attributes specific to this commodity. Goods such as the chamber of secrets. Field is “a few people to spell” such description attribute, makes little sense to abstract the field itself, and in the query model of goods as a separate field expansion leads to model, for this kind of problem, our solution is to introduce extended attributes, extended attributes bearing such non-standard special field. The query model based on standard field + extended attribute can solve the problem of field diffusion.

4.2.3 Metadata driven Architecture

So far, we’ve defined how to decompose business logic units and how to design query models, and described the relationships between them with metadata. The business logic and models implemented based on the above definitions have strong reuse value and can be deposited as business assets. So why use metadata to describe relationships between business functions and models?

The introduction of metadata description has two main purposes: 1) Automatic orchestration of code logic. The association relationship between business logic is described by metadata. The runtime can automatically implement the association execution between logic based on metadata, thus eliminating a lot of manual logic orchestration code; 2) Visualization of business functions, the metadata itself describes the functions provided by the business logic, as shown in the following two examples:

Group order basic price string display, for example: 30 yuan. Group single market price display field, for example: 100 yuan.

The metadata is reported to the system and can be used to display the functions provided by the current system. Metadata is used to describe components and their relationship, and the framework resolves metadata to automatically invoke business components, forming the following metadata architecture:

The overall architecture consists of three core components:

Business capability: Standard business logic units, including fetch units, presentation units, and query models, are key reusable assets.
Metadata: Describes the service functions (such as display units and fetch units) and the relationship between the service functions, such as the data dependent on the display units and the display fields of the display unit mappings.
Execution engine: responsible for consuming metadata and scheduling and executing business logic based on metadata.

Through the organic combination of the above three parts, a metadata-driven style architecture is formed.

5. Optimization practices for GraphQL

5.1 Simplification

1) Direct use of GraphQL

Introducing GraphQL introduces some additional complexity, as in the case of GraphQL as described in concepts like Schema and RuntimeWiring. Here’s how GraphQL is developed in its native Java framework:

These concepts add to the cost of learning and understanding for those of you who have not been exposed to GraphQL, and they usually have little to do with the business domain. However, we only want to use GraphQL’s on-demand query feature, but GraphQL itself is a drag. Business development students should focus on the business logic itself. How to solve this problem?

The famous computer scientist David Wheeler famously said, “All problems in computer science can be solved by another level of indirection.” There is no problem that can’t be solved by adding another layer, which essentially requires someone to be responsible for it, so we added an execution engine layer on top of native GraphQL to solve these problems. The goal is to mask the complexity of GraphQL and let developers focus only on the business logic.

2) Standardization of the number taking interface

First of all, we need to simplify data access. Both the native DataFetcher and DataLoader are ata relatively high level of abstraction and lack business semantics. In the query scenario, we can conclude that all queries fall into the following three patterns:

1 Search 1: Queries a result based on a condition.
1 Search N: Queries multiple results based on one condition.
N check N: check one or more batch versions.

Therefore, we have standardized the query interface. The business development students judge the query interface based on the scene and choose to use it as needed. The standardized design of the query interface is as follows:

Business development students to select the need to use the number finder, through the generic type to specify the result type, 1 query 1 and 1 query N is relatively simple, N query N we defined it as a batch query interface, used to meet the “N+1” scenario, where batchSize field is used to specify the fragment size, batchKey is used to specify the query Key, Business development only needs to specify the parameters, and the rest of the framework will handle them automatically. In addition, we have a constraint that the return result must be CompleteFuture for the full link asynchrony that satisfies the aggregate query.

3) Aggregation orchestration automation

The standardization of fetching interface makes the semantics of data source more clear, and the development process can be selected as needed, which simplifies the development of business. But when the business development students write the Fetcher, they have to go to another place to write the Schema, and then they have to write the mapping between the Schema and the Fetcher, so the business development students enjoy the process of writing the code, and they are not willing to go to another place to get the configuration after writing the code. And maintaining code and configuration simultaneously increases the likelihood of an error. Can you remove all of these cumbersome steps?

Schema and RuntimeWiring are essentially describing some information, and if the information is described in another way, our optimization is to tag the annotations during business development, describe the information with the annotated metadata, and let the framework do the rest. The solution is as follows:

5.2 Performance Optimization

5.2.1 GraphQL Performance Problems

Although GraphQL is open source, Facebook has only open sourced the standards and has no solution. Graphql-java framework is contributed by the community, based on the open source GraphqL-Java as an on-demand query engine solution, we found some problems in GraphQL application, some of these problems are caused by improper posture, and some of the problems are GraphQL itself implementation. Here are some typical problems we encountered:

CPU – consuming query resolution, includingSchemaParsing andQueryThe parsing.
When the query model is more complex especially when there is a large list latency problem.
CPU consumption of model transformation based on reflection.
DataLoaderHierarchical scheduling problem.

Therefore, we have made some optimization and transformation to the use method and framework, in order to solve the problems listed above. This chapter focuses on our graphQL-Java optimization and transformation ideas.

5.2.2 GraphQL compilation optimization

1) Overview of GraphQL language principle

GraphQL is a query language designed to build client applications based on intuitive and flexible syntax for describing their data requirements and interactions. GraphQL is a domain specific language (DSL), and the GraphQL-Java client we use is based on ANTLR 4, which is a language definition and recognition tool written in Java. ANTLR is a meta-language, and its relationship is as follows:

The Schema and Query accepted by the GraphQL execution engine are both based on the content expressed by the language defined by GraphQL. The GraphQL execution engine cannot directly understand GraphQL. Before execution, it must be translated by the GraphQL compiler into document objects understandable by the GraphQL execution engine. The GraphQL compiler is Java-based, and experience has shown that this part of the code can be a CPU hot spot in the case of real-time interpretation in high-traffic scenarios, with latency, and the more complex the Schema or Query, the greater the performance cost.

2) Schema and Query compilation cache

Schema expresses the isomorphism of the data view and the number model. It is relatively stable and the number is not large. In our business scenario, there is only one service. Therefore, our approach is to construct GraphQL execution engine constructed based on Schema at startup and cache it as a singleton. For Query, there are some differences in Query of each scenario, so the parsing result of Query cannot be regarded as a singleton. Our approach is to implement PreparsedDocumentProvider interface, based on the Query as a Key to compile the Query cache the result. As shown in the figure below:

5.2.3 GraphQL execution engine optimization

1) GraphQL execution mechanism and problems

Let’s first take a look at how the GraphQL-Java execution engine works. Suppose we choose AsyncExecutionStrategy as the execution strategy, let’s look at the execution process of GraphQL execution engine:

The AsyncExecutionStrategy execute method is an asynchronous mode implementation of the object execution strategy. It is the starting point of query execution and the entry point of the root node query. AsyncExecutionStrategy queries multiple fields of the object by loop + asynchrony. We trigger from AsyncExecutionStrategy’s execute method to understand the GraphQL query process as follows:

Calls the current field bound toDataFetcherthegetMethod if the field is not boundDataFetcher, passes the defaultPropertyDataFetcherQuery the field,PropertyDataFetcherThe implementation is based on reflection to read the query fields from the source object.
Will be taken fromDataFetcherThe query results are wrapped asCompletableFutureIf the result itself isCompletableFutureThen it won’t be packed.
The results ofCompletableFutureWhen finished, callcompleteValue, respectively based on the result type.
- If the result of the query is a list type, the list type is traversed, recursively for each elementcompleteValue.
- If the result type is an object type, then it is executed on the objectexecuteIt’s back to where it started, which isAsyncExecutionStrategy the execute.

This is the execution process of GraphQL. What’s wrong with this process? Here’s a look at the problems GraphQL encountered in our business scenarios based on the order of tags on the graph. These problems do not mean that they are problems in other scenarios, for reference only:

Problem 1: PropertyDataFetcherCPU hotspot problem, PropertyDataFetcher is a hotspot code throughout the query process, and its implementation has some room for optimization, at run time PropertyDataFetcher execution will become a CPU hotspot. (For details, see Commit and Conversion on GitHub: github.com/graphql-jav…)

Problem 2: The calculation of the list takes time. The calculation of the list is cyclic. If there is a large list in the query result, the cycle will cause a significant delay in the overall query. For example, suppose there is a list of 1000 in the query result and the processing of each element is 0.01ms, then the total time is 10ms. Based on GraphQL, this 10ms will block the whole link.

2) Type conversion optimization

The GraphQL model obtained through the GraphQL query engine is isomorphic to the fetching model returned by the DataFetcher of the business implementation, but the types of all fields are converted to the GraphQL internal types. The reason why PropertyDataFetcher becomes a CPU hotspot lies in the transformation process of the PropertyDataFetcher model. The transformation process of the business-defined model to the GraphQL type model is shown in the following figure:

When there are many fields in the query result model, such as tens of thousands of fields, it means that each query has tens of thousands of PropertyDataFetcher operations, which actually reflects the HOT CPU problem. Our solution to this problem is to keep the original business model unchanged. The results of the non-PropertyDatafetcher query are in turn populated on the business model. As shown in the following schematic diagram:

Based on this idea, the result we get through the GraphQL execution engine is the object model returned by the business Fetcher, which not only solves the CPU hot issues caused by the field reflection transformation, but also increases the friendliness for business development. Because the GraphQL model is similar to the JSON model, this model lacks the business type and is very troublesome to use directly for business development. The above optimization was tested on a pilot scenario, and the results show that the average response time of this scenario is shortened by 1.457ms, the average 99 line is shortened by 5.82ms, and the average CPU utilization is reduced by about 12%.

3) List calculation optimization

When there are many elements in the list, the default single-threaded traversal of the list element calculation is very costly, and this delay optimization is necessary for response time sensitive scenarios. To solve this problem, our solution is to make full use of the multi-core computing capability of CPU, divide the list into tasks, and execute the tasks in parallel through multi-threading. The implementation mechanism is as follows:

5.2.4 Scheduling optimization of GraphQL-Dataloader

1) DataLoader fundamentals

A DataLoader has two methods: Load and Dispatch. In the N+1 problem scenario, the DataLoader is used as follows:

The whole is divided into two stages. In the first stage, load is called N times, and in the second stage, Dispatch is called. When dispatch is called, data query is really executed, so as to achieve the effect of batch query + sharding.

2) DataLoader scheduling problem

GraphQL – Java implementation of integrated support for DataLoader in FieldLevelTrackingApproach, FieldLevelTrackingApproach implementation will be what kind of problems? The following is a graphical representation of the problems caused by the native DataLoader scheduling mechanism:

Question obviously, based on the realization of the FieldLevelTrackingApproach, the next level of DataLoader dispatch is need to wait until after the results back to this level. Based on this implementation, the formula for calculating the TOTAL query time is: TOTAL = MAX (Level 1 Latency) + MAX (Level 2 Latency) + MAX (Level 3 Latency) +… , the total query time is equal to the sum of the maximum time of each layer. In fact, if the link orchestration is written by the business development students themselves, the theoretical effect is that the total query time is equal to the time of the longest link of all links. This is reasonable. And the realization of the FieldLevelTrackingApproach shown by the results of the is the common sense, as to why, at the moment we understand may be designers based on simple and general considerations.

The problem is that the above implementation is not acceptable in some business scenarios, such as our list scenario, where the total response time constraint is less than 100ms, dozens of ms are added for this reason. To solve this problem, one way is to arrange the scene with high response time independently, without GraphQL. Another approach is to solve this problem at the GraphQL level and maintain the unity of the architecture. Next, let’s see how we extend the GraphQL-Java execution engine to solve this problem.

3) DataLoader scheduling optimization

The solution to the performance problem of DataLoader scheduling is to call the Dispatch method immediately after the last load of a DataLoader is called. How do we know which load is the last? This problem is also a difficult one to solve DataLoader scheduling problem. Here is an example to explain our solution:

Suppose the model structure we query is as follows: The root node is the field under Query, the field is called SUBJECTS, the subject refers to a list, the subject has two elements, both are object instances of ModelA, ModelA has two fields, fieldA and fieldB, The fieldA association of Subjects [0] is an instance of ModelB, and the fieldB of subjects[0] is associated with multiple ModelC instances.

To make it easier to understand, we define the concepts of fields, field instances, field instance execution, field instance value size, field instance value object execution size, Field instance value object execution completion, and so on:

field: has a unique path, is static, and has no relation to the size of the runtime object. For example:subjectsandsubjects/fieldA.
Field examplesAn instance of the: field, with a unique path, is dynamic and depends on the size of the runtime object, such as:subjects[0]/fieldAandsubjects[1]/fieldAIs a fieldsubjects/fieldAThe instance.
Field instance execution: All object instances associated with the field instance are executed by GraphQL.
Field instance value size: Number of object instances referenced by the field instance, as in the above example,subjects[0]/fieldAThe field instance value size is 1,subjects[0]/fieldBThe field instance value size is 3.

In addition to the above definitions, our business scenario meets the following criteria:

There is only one root node, and the root node is the list.
DataLoaderMust belong to a certain field, under a certain fieldDataLoaderThe number of times it should be executed is equal to the number of object instances below it.

Based on the above information, we can draw the following analysis:

When the field instance is executed, we can know the size of the current field instance. The size of the field instance is equal to the field associationDataLoaderNeeds to be executed under the current instanceloadNumber of times, so in executionloadWe can then know if the current object instance is the last object of the field instance in which it resides.
Instance of an object may hang under the different field instance, so only when the current object instance is the field in instances when the last object instance, do not represent the current object instance is the last of all object instances, if and only if the node instance object instance when was the last one instance of the node.
We can infer the number of field instances from the field instance size. For example, we knowsubjectsThe magnitude of theta is 2, so we knowsubjectsField has two field instancessubjects[0]andsubjects[1], also know the fieldsubjects/fieldAThere are two instances,subjects[0]/fieldAandsubjects[1]/fieldA, so we can infer from the root node whether a field instance is finished.

Through the above analysis, we can conclude that the condition for an object to be executed is that the field instance in which it resides and all the parent field instances of the field in which it resides have been executed, and the object instance currently executed is the last object instance of the field instance in which it resides. Based on this logic, our implementation is to determine whether to dispatch each time the DataFetcher is called, and if so, to dispatch. If the current instance of an object is not the last, but all the remaining objects are of size 0, then the load of the DataLoader associated with the current object will never be triggered. Therefore, if the size of the object is 0, you need to check again.

According to the above logic analysis, we realize the optimization of DataLoader call link, to achieve the optimal effect of theory.

6. The impact of the new architecture on r&d mode

Productivity determines the relationship of production, metadata driven information aggregation architecture is the core productivity of the demonstration scenario building, while business development model and process is the relationship of production, and will change accordingly. We will discuss the impact of the new architecture on r&d in terms of development patterns and processes.

6.1 Focused Business Development Mode

The new architecture provides a set of standardized code decomposition constraints based on business abstractions. Whereas the previous development students’ understanding of the system was likely to be “check the services, stick the data together”, now the development students’ understanding of the business and code decomposition ideas will be consistent. For example, the display unit represents the display logic, and the number unit represents the number logic. At the same time, many jumbled and error-prone logic has been shielded by the framework, and the R&D students can focus more on the business logic itself, such as: the understanding and encapsulation of business data, the understanding and writing of display logic, and the abstraction and construction of query model. As shown in the following schematic diagram:

6.2 Upgrading the R&D process

The new architecture not only affects the coding of R&D, but also affects the improvement of R&D process. Based on the visualization and configuration capabilities of metadata architecture, the existing R&D process is significantly different from the previous r&d process, as shown in the figure below:

In the past, the development mode of “poke to the bottom with one lever”, the construction of each display scene needs to go through the whole process from interface communication to API development. Based on the new architecture, the system automatically has the ability of multi-layer reuse, visualization and configuration.

A: this is the best situation, the access and display functions have been settle, research and development of students need to do is create a query plan, on-demand select need to display unit based on the operating platform, with a query plan ID based on query interface can examine the information need to display, visual and configuration interface shown in the diagram below:

Case 2: There may be no display function at this time, but according to the operation platform, the data source has been connected, so it is not difficult, just need to write a section of processing logic based on the existing data source, this section of processing logic is a very cool section of pure logic, the data source list is shown in the following schematic diagram:

Case 3: the worst is the system can’t satisfy the current query ability, this situation is rare, because the back-end service is relatively stable, so don’t panic, just according to the standard data source access to come in, and then write the processing logic infraction, these abilities can be continued after reuse.

7 summary

The complexity of commodity display scene is reflected in the following aspects: more scenes, more dependencies, more logic, and the difference between different scenes. In this context, if it is the initial business, how fast how to use the “chimney type” personalized construction of the way do not have too much doubt. However, with the continuous development of business, the continuous iteration of functions, and the scale trend of scenarios, the disadvantages of “smokestack” personalized construction will gradually come out, including high code complexity, lack of capacity precipitation and other problems.

Based on the analysis of the core contradiction faced by Meituan’s in-store commodity display scene, this paper introduces:

There are different BFF application modes in the industry, and the advantages and disadvantages of different modes.
Improved metadata-driven architecture scheme design based on GraphQL BFF pattern.
The problems and solutions we encountered in GraphQL practice.
The impact of the new architecture on the RESEARCH and development model is presented.

At present, the core product display scenes of the author’s team have been moved into the new architecture. Based on the new RESEARCH and development mode, we have realized the reuse of more than 50% of the display logic and the efficiency improvement of more than one time. I hope this article can be helpful to you.

8 References

[1] samnewman. IO/patterns/ar…
[2] www.thoughtworks.com/cn/radar/te…
[3] To understand the background system of e-commerce, this article is enough
[4] Framework Definition – Baidu Encyclopedia
[5] Efficient RESEARCH and Development – Xianyu’s exploration and practice on data aggregation
[6] System Architecture – Product Design and Development for Complex Systems

9 Recruitment Information

The comprehensive RESEARCH and development Center of Meituandian store is looking for a long-term recruitment of front-end, back-end, data warehouse, machine learning/data mining algorithm engineers, located in Shanghai. Interested students are welcome to send their resume to [email protected] (subject: Comprehensive Research and development Center of Meituandian Store — Shanghai).

Read more technical articles from the Meituan technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please note “Reprinted from Meituan technical team”. This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to [email protected] to apply for authorization.