Practice of GraphQL and metadata-driven architecture in back-end BFF

GraphQL is a data query language proposed by Facebook. The core feature is data aggregation and on-demand retrieval. It is currently widely used between the front and back ends to solve the problem of flexible use of data by the client. This article introduces another practice of GraphQL. We will sink GraphQL below the back-end BFF layer and combine metadata technology to realize on-demand query and execution of data and processing logic. This not only solves the problem of flexible use of data in the back-end BFF layer, these field processing logic can also be directly reused, which greatly improves the efficiency of research and development. The practical solutions introduced in this article have been implemented in some business scenarios of Meituan and achieved good results. I hope these experiences can be helpful to everyone.

1 The origin of BFF

The term BFF comes from a blog post " Pattern: Backends For Frontends " by Sam Newman, which refers to the back end that serves the front end. What problem does BFF solve? According to the original description, with the rise of the mobile Internet, the server functions originally adapted to the desktop Web are expected to be provided to mobile apps at the same time. In this process, there are such problems:

Mobile App and desktop Web have differences in the UI part.
Mobile App involves different terminals, not only iOS, but also Android. There are differences between the UIs of these different terminals.
There has been a large coupling between the original back-end functions and the desktop Web UI.

Because of the differences in the end, the function of the server must be adapted and tailored to the difference of the end, and the business function of the server itself is relatively single, which creates a contradiction-between the single business function of the server and the different demands of the end Contradiction. So how can this problem be solved? This is also the "Single-purpose Edge Services for UIs and external parties" described in the subtitle of the article. BFF is introduced, and BFF is used to adapt to multi-terminal differences. This is also a model widely used in the industry.

图1 BFF示意图

In actual business practice, there are many reasons for this end-to-end difference, including technical reasons and business reasons. For example, whether the user's client is Android or iOS, is it a large screen or a small screen, and what version is it. For another example, which industry the business belongs to, what is the product form, what scenario is the function put in, who is the target user group, and so on. These factors will bring about the end-oriented functional logic differences.

On this issue, the product display business that the author's team is responsible for has a certain say. The same product business, the display function logic at the C-side is deeply affected by factors such as product type, industry, transaction form, placement location, and group orientation. influences. At the same time, the attribute of frequent iteration of consumer-oriented functions has intensified and deepened this contradiction, which has evolved into a contradiction between the single stability of the service end and the flexibility of the end. This is also the product display (product display BFF). ) The inevitable reason for the existence of the business system. This article mainly introduces some of the problems and solutions in the context of the display scene of the Meituan to the store.

2 The core contradiction in the context of BFF

The introduction of the BFF layer is to solve the contradiction between the single stability of the server and the different and flexible demands of the end. This contradiction is not non-existent, but transferred. The contradiction between the original back-end and the front-end has shifted to the contradiction between the BFF and the front-end. The main job of the author's team is to fight against this contradiction. The following takes a specific business scenario as an example, combined with the current business characteristics, to illustrate the specific problems we face in the BFF production mode. The following figure shows two display modules of group buying shelves in different industries. We consider these two modules to be the display scenes of two products. They are two sets of independently defined product logic and will iterate separately.

图2 展示场景

In the early stage of business development, there are not many such scenarios. The “chimney-style” construction of the BFF layer system and the rapid development and launch of functions to meet business demands. Under such circumstances, this contradiction is not obvious. With the development of business and the expansion of the industry, many such product display functions have been formed, and the contradictions have gradually intensified, mainly in the following two aspects:

Business support efficiency: With more and more commodity display scenarios, APIs are exploding, business support efficiency and manpower become linear, and system capabilities are difficult to support the large-scale expansion of business scenarios.
High system complexity: The core functions continue to iterate, the internal logic is flooded with if…else… , and the code is written in a procedural manner. The system is complex and difficult to modify and maintain.

So how did these problems arise? This should be understood in combination with the background of the "chimney" system construction, the business faced by the commodity display scene, and the characteristics of the system.

Feature 1: Many external dependencies, differences in access between scenes, high user experience requirements

The illustration shows two group buying shelf modules in different industries. For such a seemingly small module, the back-end needs to call more than 20 downstream services at the BFF layer to get all the data. This is one of them. In the above two different scenarios, there are differences in the set of data sources required, and this difference is common. This is the second. For example, a certain data source required by the pedicure group purchase shelf is not required on the beauty group purchase shelf. A certain data source required by the shelf is not required by the pedicure group purchase shelf. Although it relies heavily on downstream services, at the same time, it is necessary to ensure the user experience of the C-side, which is the third.

These features have brought many problems to the technology: 1) The aggregation size is difficult to control, and the aggregation function is constructed by scene? Or unified construction? If the construction is divided into scenes, there must be a problem of repeatedly writing similar aggregation logic in different scenes. If it is built in a unified manner, then there will inevitably be invalid calls in a large and complete data aggregation. 2) The complexity control problem of aggregation logic. In the case of so many data sources, not only how to write business logic, but also the arrangement of asynchronous calls should be considered. When the code complexity is not well controlled, subsequent aggregation The changes and modifications will be a difficult problem.

feature two: multiple display logic, differences between scenes, common personality logic coupling

We can clearly recognize that there are commonalities in the logic of a certain type of scene, such as group order related display scenes. Intuitively, it can be seen that basically the information of the group is displayed in a single dimension, but this is only the appearance. In fact, there are many differences in the process of module generation, such as the following two differences:

Field splicing logic difference: For example, the group buying title of the two group buying shelves in the above figure is the same title. The display rule in the beauty group buying shelf is: [type] + the group buying title , while the display rule in the pedicure group buying shelf Yes: Group buy title .
Sorting and filtering logic difference: For example, it is also a group order list, scene A is sorted by sales volume, scene B is sorted by price, and the sorting logic of different scenes is different.

There are many differences in display logic Similar scenarios actually have a lot of different logic inside. How to deal with this difference in the backend is a difficult problem. The following is the most common way of writing. Logical routing is realized by reading specific condition fields to make judgments, as shown below Show:

if(category == "丽人") {
  title = "[" + category + "]" + productTitle;
} else if (category == "足疗") {
  title = productTitle；
}

This kind of scheme has no problem in terms of function realization, and can reuse common logic. But in fact, in the case of a lot of scenarios, there will be a lot of different judgment logic superimposed together, and the function will be continuously iterated. It is conceivable that the system will become more and more complex. The more difficult it is to modify and maintain.

summary : On the BFF level, there are differences in different product display scenes. In the initial stage of business development, the system supports rapid business trial and error through independent construction. In this case, the problems caused by business differences are not obvious. With the continuous development of business, there are more and more scenarios that need to be built and operated, showing a trend of large-scale. At this time, the business puts forward higher requirements for technical efficiency. In this context of many scenes and differences between scenes, how to meet the efficiency of scene expansion while being able to control the complexity of the system is the core problem faced in our business scenes.

3 BFF application mode analysis

At present, the industry has mainly two modes for such solutions, one is the back-end BFF mode, and the other is the front-end BFF mode.

3.1 Backend BFF mode

The back-end BFF mode means that BFF is taken care of by back-end students. At present, the most extensive practice of this mode is the back-end BFF solution built on GraphQL. Specifically: the back-end encapsulates the display field into a display service, and exposes it after editing through GraphQL For front-end use. As shown below:

图3 后端BFF模式

The biggest feature and advantage of this mode is that when the display field already exists, the back end does not need to care about the front-end differential requirements, and the ability to query on demand is supported by GraphQL. This feature can well deal with the problem of display field differences in different scenarios. The front-end can directly query data based on GraphQL on demand, and the back-end does not need to be changed. At the same time, with the help of GraphQL's orchestration and aggregation query capabilities, the back-end can decompose logic into different display services, so the complexity of BFF can be resolved to a certain extent.

However, based on this model, there are still several problems: display service granularity issues, data graph division issues, and field diffusion issues. The following figure is a specific case based on the current model:

图4 后端BFF模式（案例）

1) Display service granularity design problem

This solution requires the presentation logic and the access logic to be encapsulated in a module to form a presentation service (Presentation Service), as shown in the figure above. In fact, the relationship between the display logic and the access logic is many-to-many, or the example mentioned in the previous article:

Background : There are two display services, which respectively encapsulate the query capabilities of product titles and product labels.
Scenario : At this time, PM has made a demand, and hopes that the title of the product in a certain scene will be displayed in the form of "[type]+product title". At this time, the splicing of the product title depends on the type data, and the type data product label at this time It has been called in the display service.
Question : The product title display service calls the type data itself or merges the two display services together?

The problem described above is the problem of controlling the granularity of the display service. We can suspect that the above example is because the granularity of the display service is too small? Then look at it the other way around. If the two services are merged together, there will inevitably be redundancy. This is the difficulty of the display service design. The is that the display logic and the access logic are in a many-to-many relationship. As a result, they are designed together .

2) Data graph division problem

Aggregate the data of multiple display services into a graph (GraphQL Schema) through GraphQL to form a data view. When data is needed, as long as the data is in the graph, it can be queried on demand based on Query. So the question is, how should this diagram be organized? Is it one picture or multiple pictures? If the map is too large, it will inevitably bring about complex data relationship maintenance problems, and if the map is too small, it will reduce the value of the solution itself.

3) Display service internal complexity + model diffusion problem

As mentioned above, there are different splicing logics in the display of a product title. This logic is particularly common in the product display scene. For example, the same is the price, the A industry displays the discounted price, the B industry displays the price before the discount; the same is the label position, the C industry displays the service time, and the D industry displays the product characteristics. So the question is, how to design the display model? Take the title field as an example. title put a 060954ef0b05fc field on the title and titleWithCategory ? If it is the former, then the if…else… must exist in the service, which is used to distinguish title , which will also lead to the complexity of the display service. If there are multiple fields, then it is conceivable that the model fields of the display service will continue to spread.

summarizes : The back-end BFF mode can resolve the complexity of the back-end logic to a certain extent, while providing a multiplexing mechanism for displaying fields. However, there are still unresolved issues, such as the granularity design of the display service, the division of data graphs, and the complexity and field diffusion of the display service. Representatives of this model practice currently include Facebook, Airbnb, eBay, iQiyi, Ctrip, Qunar and so on.

3.2 Front-end BFF mode

The front-end BFF mode has a special introduction in the "And Autonomy" part of Sam Newman's article, which means that BFF itself is the responsibility of the front-end team itself, as shown in the following diagram:

图5 前端BFF模式

The idea of this model is that there is no need to split the requirements into two teams that can be delivered by one team. The two teams themselves bring greater communication and collaboration costs. In essence, it is also a way of thinking about transforming "contradictions between ourselves and the enemy" into "contradictions among the people." The front-end completely took over the development of BFF, realized self-sufficiency in data query, and greatly reduced the cost of front-end and back-end collaboration. But this model does not mention some of the core issues that we care about, such as: how to deal with complexity, how to deal with differences, how to design the display model, and so on. In addition, this model also has some prerequisites and drawbacks, such as a relatively complete front-end infrastructure; the front-end not only needs to care about rendering, but also needs to understand business logic.

summarizes : The front-end BFF mode uses the front-end to independently query and use data, thereby reducing the cost of cross-team collaboration and improving the efficiency of BFF research and development. The current representative of this model is Alibaba.

4 Information aggregation architecture design based on GraphQL and metadata

4.1 Overall thinking

Through the analysis of the back-end BFF and front-end BFF models, we finally chose the back-end BFF model. The front-end BFF solution has a greater impact on the current R&D model. It not only requires a lot of front-end resources, but also needs to build a complete front-end infrastructure. , The implementation cost of the plan is relatively high.

Although there are some problems with the back-end GraphQL BFF mode mentioned above in our specific scenarios, it has great reference value in general, such as the reuse of display fields, the on-demand query of data, and so on. In the commodity display scenario, 's work is concentrated on the data aggregation and integration part , and this part has a strong reuse value, so the query and aggregation of information is the main contradiction we face. Therefore, our idea is: based on GraphQL+ back-end BFF solution improvement, and realizes that the access logic and display logic can be precipitated, combined, and reused . The overall architecture is shown in the following schematic diagram:

图6 基于GraphQL BFF的改进思路

As can be seen from the above figure, the biggest difference from the traditional GraphQL BFF solution is that we decentralize GraphQL to the data aggregation part. Since the data comes from the commodity field, the field is relatively stable, so the scale of the data graph is controllable and relatively stable. In addition, the core design of the overall architecture also includes the following three aspects: 1) Separation of access and display; 2) Query model integration; 3) Metadata-driven architecture.

We solve the problem of display service granularity through the separation of access and display, and at the same time make the display logic and access logic can be precipitated and reusable; through the normalized design of the query model to solve the problem of display field diffusion; through the metadata-driven architecture to achieve capabilities Visualization, the automation of the orchestration and execution of business components, enables business development students to focus on the business logic itself. The following will introduce these three parts one by one.

4.2 Core design

4.2.1 Separation of access and display

As mentioned above, in the commodity display scenario, the display logic and the number access logic are in a many-to-many relationship, and the traditional GraphQL-based back-end BFF practice program encapsulates them together, which makes it difficult to design the display service granularity The root cause. Think about the focus of access logic and display logic? The access logic focuses on how to query and aggregate data, while the display logic focuses on how to process and generate the required display fields. Their focus is different, and putting them together will also increase the complexity of the display service. Therefore, our idea is to separate the access logic and display logic, and encapsulate them into logical units separately, called the access unit and the display unit. After the data access is separated, GraphQL also sinks, and is used to realize the on-demand aggregation of data, as shown in the following figure:

图7 取数展示分离+元数据描述

So what is the encapsulation granularity of the access and display logic? It cannot be too small or too large. In the granular design, we have two core considerations: 1) reuse , display logic and access logic are assets that can be reused in the commodity display scene. We hope that they can settle down and be used separately on demand; 2) simple and kept simple, so it is easy to modify and maintain. Based on these two considerations, the definition of granularity is as follows:

unit : Try to encapsulate only one external data source, and at the same time, it is responsible for simplifying the model returned by the external data source. The model generated in this part is called the access model.
display unit : try to encapsulate the processing logic of only one display field.

The advantage of separation is that it is simple and can be combined and used, so how to achieve combined use? Our idea is to describe the relationship between them through metadata, which is based on the metadata to be associated and operated by a unified execution framework. The specific design will be introduced below. Through the separation of data access and display, the association of metadata and the combined call of runtime, the logic unit can be kept simple, while satisfying the requirements of reuse, which also solves the display service in the traditional solution. Degree issue .

4.2.2 Query model unification

Through what kind of interface is the processing result of the display unit revealed? Next, we introduce the problem of query interface design.

1) Difficulties in query interface design

There are two design patterns for common query interfaces:

Strong Type Mode : The strong type mode refers to the POJO object returned by the query interface, and each query result corresponds to a clear field with specific business meaning in the POJO.
Weak Type Mode : Weak Type Mode means that the query result is returned in KV or JSON mode without clear static fields.

The above two modes are widely used in the industry, and they both have clear advantages and disadvantages. The strong type mode is friendly to developers, but the business is constantly iterating. At the same time, the display units deposited by the system will continue to be enriched. In this case, the fields in the DTO returned by the interface will be more and more. The support of new functions must be accompanied by the modification of the interface query model and the upgrade of the JAR version. The upgrade of JAR involves both the data provider and the data consumer, and there are obvious efficiency problems. In addition, it is conceivable that the continuous iteration of the query model will eventually include hundreds or thousands of fields, which is difficult to maintain.

The weak type mode can just make up for this shortcoming, but the weak type mode is very unfriendly to developers. The query results in the interface query model have no feeling for the developer during the development process, but the programmer’s By nature, I like to understand logic through code instead of configuration and documentation. In fact, these two interface design patterns have a common problem-the lack of abstraction. In the following two sections, we will introduce abstract ideas and framework capability support in the design of the query model returned by the interface.

2) Query model normalization design

Going back to the commodity display scenario, there are many different realizations of a display field, such as two different realizations of the commodity title: 1) commodity title; 2) [category] + commodity title. The relationship between the product title and these two display logics is essentially an abstract-concrete relationship. Identify this key point, the idea will be clear, our idea is to abstract the query model. The query model is all abstract display fields, and one display field corresponds to multiple display units, as shown in the following figure:

图8 查询模型归一化 + 元数据描述

At the implementation level, the relationship between the display field and the display unit is also described based on metadata. Based on the above design ideas, the proliferation of the model can be slowed to a certain extent, but the expansion cannot be avoided. For example, in addition to the standard attributes of each product such as price, inventory, sales volume, etc., different product types generally have specific attributes of the product. For example, the secret room theme game product only has the description attribute of "a few people fight". This kind of field itself has little abstract meaning, and putting it as a separate field in the product query model will lead to model expansion. In response to this type of problem, Our solution is to introduce extended attributes, which specifically carry such non-standard fields. Establishing a query model through standard fields + extended attributes can better solve the problem of field spreading .

4.2.3 Metadata Driven Architecture

So far, we have defined how to decompose the business logic unit and how to design the query model , and mentioned the use of metadata to describe the relationship between them. The business logic and models implemented based on the above definitions all have strong reuse value and can be deposited as business assets. So, why use metadata to describe the relationship between business functions and models?

We introduce metadata description mainly for two purposes: 1) Automatic layout of code logic, through metadata to describe the association between business logic, runtime can automatically realize the association execution between logic based on metadata, which can eliminate a lot of Manual logic orchestration code; 2) Visualization of business functions, metadata itself describes the functions provided by business logic, as shown in the following two examples:

The basic selling price of the group order is displayed as a string, for example: 30 yuan.
Group order market price display field, for example: 100 yuan.

These metadata are reported to the system and can be used to display the functions provided by the current system. Metadata is used to describe components and their association relationships, and metadata is parsed through the framework to automatically call and execute business components, forming the following metadata architecture:

图9 元数据驱动架构

The overall architecture consists of three core parts:

Business capabilities: standard business logic units, including access units, display units, and query models, which are all key reusable assets.
Metadata: Describe the business functions (such as display unit, access unit) and the relationship between the business functions, such as the data that the display unit depends on, and the display fields mapped by the display unit.
Execution engine: responsible for consuming metadata, and scheduling and executing business logic based on metadata.

Through the organic combination of the above three parts, a metadata-driven style architecture is formed.

5 Optimization practice for GraphQL

5.1 Simplified use

1) GraphQL direct use problem

The introduction of GraphQL will introduce some additional complexity, such as some concepts brought by GraphQL such as: Schema, RuntimeWiring, the following is the development process based on GraphQL's native Java framework:

图10 原生GraphQL使用流程

These concepts increase the cost of learning and understanding for students who have not been exposed to GraphQL, and these concepts usually have nothing to do with the business field. We only hope to use GraphQL's on-demand query feature, but are dragged down by GraphQL itself. Business development students should focus on the business logic itself. How to solve this problem?

The famous computer scientist David Wheeler said a famous saying, "All problems in computer science can be solved by another level of indirection". There is no problem that cannot be solved by adding a layer. In essence, someone needs to be responsible for this matter. Therefore, we added an execution engine layer on top of the native GraphQL to solve these problems. The goal is to shield the complexity of GraphQL and let developers Only need to focus on business logic.

2) Access interface standardization

First, we need to simplify data access. The native DataFetcher and DataLoader are at a relatively high level of abstraction and lack business semantics. In the query scenario, we can conclude that all queries belong to the following three modes:

1 check 1 : query a result according to a condition.
1 : query multiple results based on one condition.
N Check N : One check one or one check multiple batch version.

Therefore, we have standardized the query interface. Business development students can judge which type based on the scene, and choose to use it as needed. The standard design of the access interface is as follows:

图11 查询接口标准化

Business development students can select the accessor they need to use according to their needs, and specify the result type through generics. It is relatively simple to check 1 and 1 to check N. We define it as a batch query interface to satisfy "N+1". In the scenario of ", the batchSize field is used to specify the fragment size, and batchKey used to specify the query key. Business development only needs to specify the parameters, and other frameworks will automatically handle it. In addition, we also restrict the return result must be CompleteFuture , which is used to satisfy the full-link asynchronization of aggregated queries.

3) Aggregation orchestration automation

The standardization of the access interface makes the semantics of the data source clearer, and the development process can be selected on demand, which simplifies the business development. But at this time, after business development students have written Fetcher , they need to go to another place to write Schema , and after Schema to write Schema and Fetcher . Business development enjoys the process of writing code more and is not willing to finish The code needs to go to another place to get the configuration, and maintaining the code and the corresponding configuration at the same time also increases the possibility of errors. Can these complicated steps be removed?

Schema and RuntimeWiring essentially want to describe some information. If this information is described in another way, is it possible? Our optimization idea is to mark annotations in the business development process, and describe this information through annotation metadata. The things are left to the framework. The solution schematic diagram is as follows:

图12 注解元数据描述Schema和RuntimeWiring

5.2 Performance optimization

5.2.1 GraphQL performance issues

Although GraphQL has been open sourced, Facebook only open sourced the relevant standards and did not provide a solution. The GraphQL-Java framework is contributed by the community. Based on the open source GraphQL-Java as an on-demand query engine, we found some problems in GraphQL applications. Some of these problems are caused by improper postures and some The problems with the implementation of GraphQL itself, such as several typical problems we encountered:

CPU-consuming query analysis, including Schema analysis and Query analysis.
When the query model is more complicated, especially when there is a large list, there is a delay problem.
CPU consumption problem based on reflection model conversion.
The hierarchical scheduling problem of DataLoader

Therefore, we have made some optimizations and modifications to the usage and framework to solve the problems listed above. This chapter focuses on our optimization and transformation ideas in GraphQL-Java.

5.2.2 GraphQL compilation optimization

1) Overview of

GraphQL is a query language whose purpose is to build client applications based on intuitive and flexible syntax to describe their data requirements and interactions. GraphQL belongs to a domain-specific language (DSL), and the GraphQL-Java client we use is implemented based on ANTLR 4 at the language compilation level. ANTLR 4 is a language definition and recognition tool written based on Java. ANTLR is a Meta-Language (Meta-Language), their relationship is as follows:

图13 GraphQL语言基本原理示意图

Schema and Query accepted by the GraphQL execution engine are based on the content expressed in the language defined by GraphQL. The GraphQL execution engine cannot directly understand GraphQL, and must be translated by the GraphQL compiler into a document object understandable by the GraphQL execution engine before execution. The GraphQL compiler is based on Java. Experience shows that in the case of real-time interpretation of high-traffic scenarios, this part of the code will become a CPU hot spot, and it will also take up response delay. The Schema or Query , the more obvious the performance loss.

2) Schema and Query compilation cache

Schema expresses that the data view and the access model are homogeneous, relatively stable, and there are not many. In our business scenario, there is only one service per service. Therefore, our approach is to Schema at startup and cache it as a singleton. For Query Query each scene is somewhat different, so Query cannot be used as a singleton. , Our approach is to implement the PreparsedDocumentProvider interface, and cache the Query Query as the Key. As shown below:

图14 Query缓存实现示意图

5.2.3 GraphQL execution engine optimization

1) GraphQL execution mechanism and problems

Let's first take a look at the operating mechanism of the GraphQL-Java execution engine. AsyncExecutionStrategy in the execution strategy, let's take a look at the execution process of the GraphQL execution engine:

图15 GraphQL执行引擎执行过程

The above sequence diagram has been simplified, and some information irrelevant to the key point has been AsyncExecutionStrategy . The execute method of 060954ef0b0dd1 is the implementation of the asynchronous mode of the object execution strategy. It is the starting point of query execution and the entry point of the root node query. AsyncExecutionStrategy has multiple objects queries logical fields, take a cycle of + asynchronous implementation, from the AsyncExecutionStrategy of execute trigger method, understanding GraphQL query process as follows:

Call the get DataFetcher bound to the current field. If the field is not bound to DataFetcher , the default PropertyDataFetcher query field is used. PropertyDataFetcher is based on reflection to read the query field from the source object.
DataFetcher query result from CompletableFuture . If the result itself is CompletableFuture , it will not be packed.
After the result CompletableFuture completed, call completeValue process separately based on the result type.
- If the query result is a list type, then the list type will be traversed, and completeValue will be executed recursively for each element.
- If the result type is an object type, then execute executed on the object, and it is back to the starting point, which is the execute of AsyncExecutionStrategy.

The above is the execution process of GraphQL. What's wrong with this process? Let's take a look at the problems encountered in the application and practice of GraphQL in our business scenarios based on the marking order on the graph. These problems do not mean that they are also problems in other scenarios. They are for reference only:

problem 1 : PropertyDataFetcher CPU hot issue, PropertyDataFetcher belongs to the hot code in the entire query process, and its own implementation also has some optimization space. The execution of PropertyDataFetcher will become a CPU hot spot at runtime. (For specific questions, please refer to commit and Conversion on GitHub: https://github.com/graphql-java/graphql-java/pull/1815 )

Question 2 : The calculation of the list is time-consuming. The calculation of the list is cyclic. For scenarios where there is a large list in the query result, the cycle will cause a significant delay in the overall query. Let's take a specific example. Assuming that there is a list size of 1000 in the query result, and the processing of each element is 0.01ms, then the total time is 10ms. Based on the GraphQL query mechanism, this 10ms will block the entire link.

2) Type conversion optimization

The GraphQL model obtained through the GraphQL query engine DataFetcher , but the types of all fields will be converted to GraphQL internal types. PropertyDataFetcher reason why 060954ef0b0f51 has become a CPU hot spot is the model conversion process. The schematic diagram of the conversion process from the business-defined model to the GraphQL type model is shown in the following figure:

When there are many fields in the query result model, such as tens of thousands, it means that there are tens of thousands of PropertyDataFetcher operations for each query, which is actually reflected in the CPU hot issue. Our solution to this problem is to maintain the original business The model remains unchanged, and PropertyDataFetcher query is in turn filled into the business model. As shown in the following schematic diagram:

Based on this idea, the result we get through the GraphQL execution engine is Fetcher , which not only solves the CPU hotspot problem caused by field reflection conversion, but also increases the friendliness for business development. Because the GraphQL model is similar to the JSON model, this model lacks business types, and it is very troublesome to use directly for business development. The above optimization was tested in a pilot scenario on a scenario, and the results showed that the average response time of this scenario was shortened by 1.457ms, the average 99-line was shortened by 5.82ms, and the average CPU utilization rate was reduced by about 12%.

3) List calculation optimization

When there are many list elements, the delay caused by the default single-threaded calculation method of traversing the list elements is very obvious. This delay optimization is necessary for scenarios where the response time is more sensitive. Our solution to this problem is to make full use of the CPU's multi-core computing capabilities, split the list into tasks, and execute them in parallel through multiple threads. The implementation mechanism is as follows:

`5.2.4 GraphQL-DataLoader scheduling optimization`

1) Basic Principles of DataLoader

Let me briefly introduce the basic principles of DataLoader. DataLoader has two methods, one is load and the other is dispatch . In the scenario of solving the N+1 problem, DataLoader is used like this:

The whole is divided into two stages. The first stage calls load , which is called N times, and the second stage calls dispatch . When dispatch is called, the data query will be executed, so as to achieve the effect of batch query + sharding.

2) DataLoader scheduling problem

The implementation of GraphQL-Java's integrated support for DataLoader is in FieldLevelTrackingApproach . What problems will there be with the implementation of FieldLevelTrackingApproach The following is based on a diagram to express the problems caused by the native DataLoader scheduling mechanism:

The problem is obvious, based on FieldLevelTrackingApproach achieved, the next level of DataLoader of dispatch need to wait until after the results of this hierarchy are sent back before. Based on this implementation, the calculation formula for the total query time is equal to: TOTAL = MAX (Level 1 Latency) + MAX (Level 2 Latency) + MAX (Level 3 Latency) + …, total query time is equal to the maximum time per layer In fact, if the link arrangement is written by the business development students themselves, the theoretical effect is that the total time is equal to the time consumed by the longest link . This is reasonable . FieldLevelTrackingApproach sense. As for why it is implemented this way, we currently understand that the designer may be based on simple and general considerations.

The problem is that the above implementation is unacceptable in some business scenarios. For example, the response time constraint of our list scenario is less than 100ms in total, and dozens of ms are involved for this reason. To solve this problem, one way is to independently arrange for scenes with particularly high response time requirements without using GraphQL; the other way is to solve this problem at the GraphQL level and maintain the unity of the architecture. Next, introduce how we extended the GraphQL-Java execution engine to solve this problem.

3) DataLoader scheduling optimization

DataLoader scheduled for performance issues, Our idea is to solve the last time a call DataLoader of load after the call immediately dispatch method send queries , the problem is how do we know which one is the last load of load it? This problem is also a difficult point in solving the DataLoader scheduling problem. The following is an example to explain our solution:

Suppose we inquire into the model is structured as follows: root is Query fields under, the field named subjects , subject cited a list, subject has two elements, the all ModelA object instance, ModelA has two fields, fieldA and fieldB , subjects[0] the fieldA association is ModelB one example, subjects[0] the fieldB plurality of associated ModelC instances.

In order to facilitate understanding, we define some concepts, such as field, field instance, field instance execution completion, field instance value size, field instance value object execution size, field instance value object execution completion, etc.:

field : has a unique path, is static, and has nothing to do with the size of the runtime object, such as: subjects and subjects/fieldA .
field instance : an instance of a field, which has a unique path, is dynamic, and is related to the size of the runtime object, such as: subjects[0]/fieldA and subjects[1]/fieldA are instances of the field subjects/fieldA
field instance is executed : all the object instances associated with the field instance have been executed by GraphQL.
field instance value size : the number of field instance reference object instances, as in the above example, the subjects[0]/fieldA field instance value size is 1, and the subjects[0]/fieldB field instance value size is 3.

In addition to the above definitions, our business scenarios also meet the following conditions:

There is only 1 root node, and the root node is a list.
DataLoader must belong to a certain field, and the DataLoader under a certain field should be executed is equal to the number of object instances under it.

Based on the above information, we can get the following problem analysis:

When executing the field instance, we can know the size of the current field instance. The size of the field instance is equal to the number of times the DataLoader load in the current instance load , we can know whether the current object instance is where it is. The last object of the field instance.
An object instance may hang under different field instances, so only when the current object instance is the last object instance of the field instance where it is located does not mean that the current object instance is the last of all object instances, if and only This is true when the node instance where the object instance is located is the last instance of the node.
subjects the number of field instances from the field instance size. For example, if we know that the size of 060954ef0b131c is 2, then we know that the subjects field has two field instances subjects[0] and subjects[1] , which means that there are two instances of the subjects/fieldA subjects[0]/fieldA and subjects[1]/fieldA . Therefore, we can infer from the root node down whether a certain field instance has been executed.

Through the above analysis, we can conclude that the condition for an object to be executed is that the field instance in which it is located and all the parent field instances of the field in which it is located have been executed, and the currently executed object instance is the last object of the field instance in which it is located Instance. Based on this judgment logic, our implementation plan is to dispatch needs to be initiated DataFetcher is called, and if so, it is initiated. In addition, the above timing and conditions have dispatch . There is a special case. When the current object instance is not the last, but the remaining object size is 0, then the current object associated DataLoader will never be triggered. load , so when the object size is 0, it needs to be judged again.

According to the above logical analysis, we have realized the DataLoader call link and achieved the theoretical optimal effect.

`6 The impact of the new architecture on the R&D model`

Productivity determines the production relationship. The metadata-driven information aggregation architecture is the core productivity of the display scene. The business development model and process are the production relationship, so it will change accordingly. Below we will introduce the impact of the new architecture on R&D from the perspective of development model and process.

`6.1 Business-focused development model`

The new architecture provides a set of standardized code decomposition constraints based on business abstractions. In the past, development students’ understanding of the system was likely to be "check the service and glue the data together", but now, the development students’ understanding of the business and code decomposition ideas will be consistent. For example, the display unit represents the display logic, and the access unit represents the access logic. At the same time, a lot of complicated and error-prone logic has been shielded by the framework. R&D students can have more energy to focus on the business logic itself, such as: understanding and encapsulation of business data, understanding and writing of display logic, and query model Abstraction and construction. As shown in the following schematic diagram:

`6.2 R&D process upgrade`

The new architecture not only affects the coding of R&D, but also affects the improvement of the R&D process. Based on the visualization and configuration capabilities of the metadata architecture, the existing R&D process is significantly different from the previous R&D process, as shown in the figure below. Shown:

In the past, it was a "one shot to the end" development model. The construction of each display scene needs to go through the entire process from interface communication to API development. Based on the new architecture, the system automatically has multi-layer reuse, visualization, and configuration capabilities. .

case one : This is the best situation. At this time, both the access function and the display function have been precipitated. All R&D students need to do is to create a query plan, select the desired display unit based on the operating platform, and hold the query The plan ID can be found based on the query interface to find the required display information. The visualization and configuration interface is shown in the following schematic diagram:

case two : there may not be a display function at this time, but the data source has been accessed through the operating platform, so it is not difficult. You only need to write a piece of processing logic based on the existing data source. This piece of processing logic It is a very cool piece of purely logical writing. The data source list is shown in the following diagram:

case three : The worst case is that the system cannot meet the current query capabilities at this time. This situation is relatively rare, because the back-end service is relatively stable, so there is no need to panic, just connect the data source according to the standard specification Come in, and then write the processing logic fragments, after which these capabilities can be continuously reused.

`7 Summary`

The complexity of commodity display scenes is reflected in: many scenes, many dependencies, many logics, and differences between different scenes. In this context, if it is the early stage of the business, how quickly and how quickly, and adopting the "chimney-style" personalized construction method, there is no need to have too many doubts. However, with the continuous development of business, the continuous iteration of functions, and the large-scale trend of scenarios, the drawbacks of "chimney-style" personalized construction will gradually become prominent, including problems such as high code complexity and lack of capacity precipitation.

Based on the analysis of the core contradictions faced by the Meituan to-store merchandise display scene, this article introduces:

Different BFF application modes in the industry, as well as the advantages and disadvantages of different modes.
Improved metadata-driven architecture scheme design based on GraphQL BFF mode.
The problems and solutions we encountered in the process of GraphQL practice.
The impact of the new structure on the R&D model is presented.

At present, the core product display scenes that the author's team is responsible for have moved into the new architecture. Based on the new research and development model, we have achieved more than 50% of the display logic reuse and more than 1 times the efficiency improvement. I hope this article can be useful for everyone. help.

`8 References`

[1]https://samnewman.io/patterns/architectural/bff/
[2] https://www.thoughtworks.com/cn/radar/techniques/graphql-for-server-side-resource-aggregation
[3] understand the back-end system of e-commerce, this is enough
[4] frame definition-Baidu Encyclopedia
[5] Efficient R&D-Exploration and Practice of
[6] "System Architecture-Product Design and Development of Complex Systems"

`9 Recruitment Information`

Meituan to the store integrated R&D center for long-term recruitment of front-end, back-end, data warehouse, machine learning/data mining algorithm engineers, located in Shanghai, welcome interested students to send their resumes to: tech@meituan.com (the title of the email indicates: Meituan To the store comprehensive research and development center-Shanghai).

Read more technical articles from the

| In the public account menu bar dialog box, reply to keywords such as [Products in 2020], [Products in 2019], [Products in 2018], [Products in 2017], and you can view the collection of technical articles from the Meituan technical team over the years.

| This article is produced by the Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "the content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activity, please send an email to tech@meituan.com to apply for authorization.