Above the code: The story behind our landing GraphQL

GraphQL is often labeled as a “good thing, but hard to land”, and many teams trying GraphQL have gone from “getting started to giving up”. The fact that we’ve been able to get GraphQL off the ground in the project for over a year is an interesting story of technical decisions, engineering management, and code.

In this article, I will tell the story in chronological order:

Enlightenment: Why do we need GraphQL
Prototype phase: Familiar with GraphQL, familiar with the community
Honeymoon phase: Apply GraphQL to internal systems and reap quick results
Access period: Make a GraphQL gateway running in the browser
Pain points: Setbacks at both the front and back end
Landing stage: Seize the opportunity, real GraphQL

Why do we need GraphQL

Before we get to the “why,” let’s talk about who “we” are.

We are the front end team for SmartX. Like us, the introduction of GraphQL in many companies was initiated by the front end team. The difference is that from the beginning we didn’t want to impose any extra work on the back-end team. We’ll talk more about how we did this later on, but it was undoubtedly an important foundation for our success.

And why do we need it? In January 2019, GraphQL was written on the whiteboard for the first time in a group discussion. At that time, no one in the group knew anything about GraphQL. We only mentioned it because we wanted to solve the complicated data stitching problem in the front-end project. In our impression, GraphQL is a technical solution that is good at data stitching.

How complex is data stitching in our front-end project? You can get a feel for this not entirely realistic (but close to complex) schematic:

A seemingly uncomplicated UI uses multiple service apis, and these apis may be developed by multiple groups, with certain differences in format, semantics, and exception handling. Therefore, a “specification that can concatenate data and obtain on demand” became our initial appeal.

More than splicing, GraphQL has engineering value

As we worked through the subsequent phases, we quickly realized that GraphQL could provide much more engineering value than simply reducing the number of network requests and getting data on demand.

Strongly typed interface definitions avoid inconsistency between the front and back ends

In the GraphQL specification, both the front and back ends require static declaration interfaces and request structures. For example, the backend provides the User interface and declares that it contains two fields, name and email.

If the front end makes a request to User and mistakenly requests the age field that doesn’t exist, then at the static phase (compile, Lint, etc.) we have the ability to detect the error.

Similarly, if the back end introduces a breaking change in a release, renaming the email field to email_address, requests using user.email in the front end project can also find problems in the static phase without being exposed by testing at run time.

Although these problems can be avoided through artificial communication and joint investigation, efficiency and reliability are obviously not at the same level.

On the other hand, if a strongly typed language such as Typescript is used in a front-end project, Typescript type declarations can also be generated from the GraphQL interface definition through a toolchain, making the entire data request link type-safe with no additional maintenance costs.

Reactive cascade caching

Data management in the UI is a more complex topic that deserves a separate article. Here I’ll try to use a simple example to show why it’s easier to implement a nice client cache based on GraphQL.

The figure above shows a Tab, a User table, and a User form. If we change Adam’s Email address from the form, we expect the API to send the latest Email address to the back end and display Adam’s Email address in both the form and the form as the updated value.

If we want UI changes to happen automatically (rather than through additional business logic), we should implement a cache like this:

The response type. The cache needs to know both that Adam’s Email address corresponds to User with ID 1 and that a table and a form are “relying” on this data. When the data changes, the cache “tells” the UIs that depend on the data to render again and update to the latest state. When the UI is no longer “dependent” on this data (such as switching from a Tab to a Todo Tab and the User table is no longer displayed), changes to this data will no longer trigger rendering.
The use of data by a cascading UI may come from different entry points, such as an API for getting all users, an API for getting a single User, and an API for getting a creator of Todo. But no matter where the entry came from, the cache can trace it back through a layer of data, User WITH ID 1, to correlate them together.

GraphQL makes this easier to happen because:

The relationship between data is fully declared in the static schema.
Responses that can be accurate to the field level.

The benefits are all front-end, and most of the work is back-end

As you can see from the description above, a lot of the engineering value is in the front-end’s ability to better organize code based on GraphQL, but the bulk of the work to provide a GraphQL API is on the back end.

We recognized this fact early on and made the decision that **GraphQL optimization for network requests (volume, data volume) was not the most important value for us. We could implement a GraphQL gateway running in a browser in a front-end project and get the other benefits of GraphQL. ** This approach can be done independently without the support of the backend, and can be quickly converted into a stand-alone NodeJS service when mature enough to enjoy all the benefits of GraphQL.

But we still have a lot of research to do before we can do that.

Get results quickly

After the Spring Festival of 2019, we officially started the investigation of GraphQL. In order to speed up the research process, we carried out two aspects of work simultaneously:

Get familiar with GraphQL itself and its community.
Use GraphQL for less serious but more complex projects.

GraphQL community

Although GraphQL was originally open sourced by Facebook, it has evolved into a typical community-driven project. Community-led projects tend to have the following characteristics:

Directions change frequently, new star projects are created every once in a while, but there are also frequent projects that are not maintained or even scrapped.
There is little incentive to fix long-standing design flaws, which are often resolved in the next major release refactoring.

In order to maximize the use of community efforts while avoiding “pitfalls,” we first categorized projects and their maintainers. Functionally, we divide these projects into three layers:

The class level. The GraphQL Spec is the core, language-independent, description of the standard behavior of GraphQL. It is maintained by the GraphQL Working Group and releases a new version every one or two years. It is very stable, and when we look at other projects we will be judged by how well the Spec is implemented.
Implementation level. At present, many programming languages implement GraphQL Spec. Graphql-js corresponding to JavaScript is officially maintained, and there are basically no substitutes at present.
Application level. Different from the stability of the above two levels, the community has an endless stream of projects at the application level, including back-end server, front-end client, database ORM encapsulation and many other directions.
Tool chain level. There are also many projects, including lint, compilation, testing, and so on.

In the process of research, every time we come into contact with a new project, we will put it on the corresponding level and compare it with other projects of the same level with the same function. In the process of comparison, it is easier to find the advantages and disadvantages of each project.

On the other hand, we also made a descriptive judgment of the maintainer behind the project, taking the project corresponding maintainer as an example:

Apollo is an important member of the community and has a commercial GraphQL product of its own. I have maintained many projects at the application level and tool chain level, but lack of motivation to solve long-standing design defects of projects.
Prisma, like Apollo, has its own commercial products. Prisma, by contrast, is more aggressive and likes to keep coming up with new versions and projects to replace its own.
The Guild, organized by The community, is very active in maintaining several high quality projects.

Familiarize ourselves with the maintainer’s style and read the long-standing issues of the project, so that we can know in advance the “pits” existing in each project and the possible time point of solution. Combined with our own access plan, the risk of actual use in the later period is greatly reduced.

Practice in an internal system

Like many companies, SmartX has some internal system requirements. Most of these systems are in the form of Web Service + Web UI, and do not involve the underlying technology (virtualization, distributed storage), so they are developed by the front-end group.

Compared to the product mainline, these internal systems are an excellent testing ground for verifying a new technology stack because they:

The delivery cycle is more flexible, with some room for trial and error.
Less automated tests, Code Review focuses more on the overall design, Code integration speed.
You don’t need to worry too much about upgrades, compatibility, etc.
There is a certain business logic, and in some systems the business logic is quite complex, making it easier to verify the performance of the technology stack in various scenarios.

Since the first half of 2019, GraphQL has been adopted as an API specification for all of our internal systems. Among them, the most outstanding achievement is that we completed the development of after-sales system in 2 months with 1.5 manpower, which has satisfactory performance in both development efficiency and user experience.

After working with our internal system, we did this in June 2019:

Familiarizing yourself with the progress of the GraphQL community, I picked out a few of the projects that were appropriate for our usage scenarios for long-term investment.
I have two team members who are very familiar with GraphQL.

Data Layer: GraphQL gateway in the browser

In May 2019, with the experience accumulated in internal systems, we returned to the front end project of the product line and began to design a pure front end GraphQL solution code-named Data Layer.

The fact is that the GraphQL Spec never limits the runtime environment for the Server part, so it’s perfectly possible to implement a GraphQL Server runtime in a browser.

How GraphQL is executed

Take a GraphQL Schema as an example:

# [] stands for array,! The value cannot be NULL

type Query {
  users: [User! ] ! }type User {
  name: String!
  posts: [Post! ] ! }type Post {
  title: String!
}
Copy the code

If we wanted to query the names of all users and the titles of articles each wrote, we could initiate such a query

query {
  users {
    name
    posts {
      title
    }
  }
}
Copy the code

According to the GraphQL spec, the server will process the query in the following order:

The entry point is Query, which corresponds to type Query in the schema
Execute query.users down. Resolve is to execute the resolver function of the corresponding schema node defined in the server.
Query. Users. name, query.users.posts, query.users.posts.

The resolver function of each Schema node takes the return value of the upper-level node, the variables entered in the query, the current context, and some additional information as parameters, and returns a result that conforms to the schema type definition for the lower-level node to continue executing. After all nodes are executed, the complete result is returned to the client.

The resolver function is flexible. For example, we might perform a database query in query.users and an HTTP request in Query.users. posts to get the result. Read the results from the file system in query.users.posts.title, as long as the return value matches the definition of GraphQL Schema.

Data Layer design

The Data layer runs in the browser, and we want the front-end UI code to interact with it via the standard GraphQL API so that we can migrate the Data layer to a separate NodeJS service at any time. Therefore, the design of data layer should always meet the following requirements:

Using only the JavaScript standard API, without using any browser API, is guaranteed to work well in the NodeJS runtime.
The communication between the UI and the data layer must be serializable rather than passing data directly through memory to ensure the ability to make remote calls.

When the data layer runs in the browser, the overall shape is as shown in the figure below:

When each GraphQL request is executed in the data layer, an HTTP request will be issued in the corresponding resolver function to obtain data from the existing back-end service.

With this implementation, we can already get all the benefits of GraphQL for our front-end projects, except that there are no quantitative or data savings in network requests compared to previous Restful apis (because the Data layer still makes roughly the same number of requests).

It is important to note that even without the advantage of network requests, data Layer helps our front-end code solve long-standing problems such as data splicing abstraction, data cache consistency, and correct UI updates. This shows that data clipping is by no means the only value of GraphQL.

In our plan, if the data layer stably implements all the data request code in our UI code, we can migrate it to a separate NodeJS service that looks like this:

Further gain the following benefits:

Data tailoring to reduce the number of requests and data at the front and back ends.
Reduce overhead in the browser and improve the user experience.
The data layer is closer to the data source (back-end service) and the overall response time is shortened.

Labor pains: A predictable setback

In August 2019, GraphQL has been used in our internal system for more than half a year, and Data Layer has also been applied in the front-end project of the product line.

At this stage, we started to encounter some problems. Fortunately, we had some understanding of these problems in the preliminary investigation, so it was not easy to solve them, but we did not have a clue.

Apollo Client cache management

The concept of “reactive cascading caching” mentioned above is not unique to us, and the most popular Client in the GraphQL community, Apollo Client, has been using this implementation heavily.

However, in the Apollo Client V2 implementation, this layer of cache has always suffered from a fatal flaw: the lack of an easy-to-use cache invalidation method. Methods such as Refectch and Write Cache described in official documentation are difficult to maintain when a project is larger, and have been discussed in the community for two years, The cache invalidation scheme presented in the Apollo Client V3 RC release is still not a complete solution to the problem.

This problem also began to emerge after half a year of iteration of our internal systems. We had two options:

Wait for the official promised V3 release in October 2019 to address this issue.
A cache invalidation scheme was designed and implemented on the basis of Apollo Client.

In the end, we chose to spend half a month fully understanding the Apollo Client code and another week of design to complete a cache invalidation solution that included dependency tracing and UI automatic response. We can solve problems faster than the community because:

We can design solutions that serve only our use scenarios, regardless of backward and forward compatibility.
We can concentrate on solving this one problem in a short period of time with sufficient manpower.

It’s also clear from this process that using the open source community’s solution doesn’t mean you can never fork out and wait indefinitely for upstream to come up with a solution.

In fact, we didn’t even use a fork version of the Apollo Client, but just wrapped it in the upper layer code and used some private apis to fix the problem.

GraphQL Schema also needs to be designed

In the process of applying data Layer, we encountered another problem: how to design a “good” GraphQL Schema.

Like any API specification, GraphQL Schema needs to be designed, and there is very little data on GraphQL best practices compared to, say, RESTful apis.

In our Data Layer, the data source behind it is not a very flexible database query, but a set of more complex HTTP apis. Therefore, in the actual development, while exploring the best practices, we made choices in the ease of use, number of requests, performance and other aspects of schema, and applied quite complex batch processing and cache strategy to optimize, resulting in the later development of data layer has become a technical work.

This is not strictly an issue that hinders development, but it does limit our productivity. This prompted us to make improvements in two areas in the next phase:

Run the GraphQL gateway as a standalone back-end service, preferably with the ability to use a database as a data source.
Borrow a set of GraphQL Schema design best practices from mature projects.

Seize the opportunity to truly GraphQL

In the fourth quarter of 2019, when we were thinking about the next direction of data Layer, we suddenly got a better opportunity: a management side product would be added to the product line. Considering that students on the back end need to focus on other more challenging tasks, prepare the front end group to take charge of the development of the front and back ends of the new management side product, the Web.

This opportunity not only perfectly fits our improvement direction summarized in the last stage, but also brings more room for play. In the following six months, we achieved:

The required data was cached by database and GraphQL gateway was run as an independent back-end service, which completely solved the unavoidable performance problems in data layer and significantly improved user experience.
Prisma ORM encapsulates a set of GraphQL Schema with complete functions and strong expression ability, which reduces the cost of API design and implementation and improves the experience of USING API.
The type safety of the database, gateway, and UI link allows most problems to be exposed at compile time. At the same time, code generation tools were developed so that some complex UI components for data interaction could be generated directly from database schema.
Students at the back end can devote themselves to the development work at the bottom, no longer need to design API for UI, and also reduce a lot of joint debugging work.

In January 2019, we still didn’t have a single member of the GraphQL front-end team who knew anything about it.

As of May 2020, we have a management product team with all members familiar with GraphQL usage and two members proficient in GraphQL details.

Chance and luck played a part in this process, but none of this would have happened if we had just pushed the back-end team to help us with the GraphQL transformation from the start.

If you agree with our approach to engineering problems, welcome to join us.