This article was first published on the Nebula Graph Community public account
For each query of the user, it is the goal of the Ctrip hotel technical team to match the corresponding scenarios and products according to their intentions, but to achieve this goal, they encountered three major problems... This article focuses on how they build scenarios With respect to the information relationship, Nebula is used to process the association relationship, so as to quickly return to the practical process of customizing and recommending information to hotel users.
background
To summarize the requirements of Ctrip Yugong project in one sentence, it is " for every query of the user, the corresponding scene and product can be matched according to their intention ".
At present, in terms of hotel ranking, Ctrip Hotels has provided customized services for different user groups and usage scenarios: in different scenarios, the hotel list page will be sorted differently, which has greatly improved user value. And Ctrip Hotel also reduces the effort of user decision-making by displaying scene-based content on the front end.
If the current user scenario is a ski vacation, as a service provider, we naturally expect that each front-end display position can be associated with skiing. As shown in the short label position of the hotel in the hotel list page above, a 滑雪套餐
is displayed, the hotel The ranking of the list shows the current hotel's ranking in the local ski list, and the display picture of the hotel is the hotel under the snow scene.
In the hotel details page, the content displayed in the hotel album is also a snow scene. The first category of albums is [Ski Picks], while other short-tag lists of hotels have a similar logic to the list page, showing ski-related content. The geographic information bit shows the distance to the nearest ski area.
The above two pictures are the detailed information of the hotel and the room type respectively. In the presentation of the hotel facilities and room type atmosphere map, corresponding adjustments have also been made according to the ski resort scene.
The above four pictures briefly describe the effect that Ctrip Hotel can achieve for each front-end display position by associating with the scene.
To realize the above functions, first of all, it is necessary to establish an association between various scenes and content information, and when entering a scene, the relevant information can be directly and quickly retrieved for display. The following issues need to be addressed for this:
- Lack of mapping of scenes and information
- The richness of information is not enough to support sceneization
- Operations and systems are not efficient enough
Specifically, the reason for the lack of correlation between scenes and information is that before the Yugong Project, the front-end display content of Ctrip Hotel was independent of each other, and there was no unified integration system. Like the featured tags mentioned above 带娃
, 爱住
tags, parent-child paradise tags in facility tags, although they can be directly related to parent-child scenes semantically, one of them is Features, one is the facility, apart from using the same display space, there is no other commonality between the labels.
Another typical example is reviews. The screenshots of the app on the right side of the picture above are three reviews of Donghu Hotel in Wuhan. The three types of content are currently only displayed as a hotel review and have no other functions. In the Yugong project, if we define three scenarios, one is called parent-child, one is called cherry blossom viewing, and the other is called tourism, and associated comments and scene tags, then we can do more things. For example, in a hotel, where 30% of the reviews mentioned parent-child, it may be possible to judge that the hotel has parent-child characteristics. For another example, in the review of cherry blossoms, a short and beautiful sentence is extracted to be associated with the cherry blossom scene. When the cherry blossom scene is sent from the front end, this short sentence can be displayed in the front end as a supplement to the display... For similar purposes, we can It is imaginative and non-stop, but the key point of realization is to establish the relationship between the scene and the information.
The second problem is that the information richness is not enough to support scene-based requirements. For a simple example, after skiing is screened on the list page, there are no ski-related pictures, and there is no geographical location information related to the tag; there is no ski-related album information on the details page, and there is no ski atmosphere map, although the hotel facilities have Ski facilities, but it did not make a higher priority display based on the skiing scene, and there was no related picture display for the second facility ski facilities, which led to a lack of vividness in hotel recommendations.
The third problem is that systems and operations are not efficient enough. Specifically:
New scenarios & new data are expensive to develop. Each scene is online and needs to be developed one by one according to the linkage location
- Each location accesses new information, which needs to be released with the version
Data timeliness: There is no unified standard, and the timeliness is insufficient
- For example, featured tags, multiple job serial: T+x takes effect
Data sorting: no unified receiving party
- List page short label: static information, front-end logic + operation background
- Check the link length, the routine maintenance efficiency is low
Data Standards: Multiple sets of standards exist
- Which are family hotels?
- …
In terms of expansion, the front-end display positions of Ctrip hotels come from multiple services, such as the more than ten display positions marked in the picture, which include a series of services such as reviews, lists, and labels. If the linkage logic of these services does not have a unified receiver, then for each new scene, the corresponding display space needs to be developed accordingly, so the development cost will be relatively large. The second point is the real-time nature of the data. At present, the update job is partially serial, and some data operations and modifications may take effect after t+x, and the real-time performance is not high. The third point is that the data sorting and display logic does not have a unified destination, and there are places to maintain the corresponding logic at the front and back ends, resulting in redundant information data, longer inspection links, and low maintenance efficiency. The last point is the lack of uniform data standards. How to judge whether a hotel meets parent-child characteristics? If a hotel has three featured tags that mention parent-child, is it a parent-child hotel? Or maybe 20% of hotel reviews mention parent-children, so is it a parent-child hotel? For such standards, we need an accurate definition so that scenario-based rules can be implemented.
To sum up, in order to achieve the scene-based content of each front-end display position, three problems need to be solved. One is the lack of correlation between the scene and the information; the second is that the current richness of information cannot support the scene; the third is the operational efficiency of the system. It is not efficient enough under the original architecture. In order to solve the above problems, the Yugong project came into being.
Yugong Project: Scenario-based information on the whole process of user booking
The figure below shows the overall framework of the Yugong project.
project framework
The project involves the improvement of multiple systems:
- Unified Intent Recognition
- Placement logic sinks
- relationship matching
- information mining
- Rich data sources
The Yugong project is mainly divided into the 5 parts of the above figure. The first part from top to bottom is intent recognition, which mainly uses the user's historical preferences and real-time data to identify the user's specific intent. There is also a child filter item in the real-time check-in screening items, so we judge that his current intention is very likely to be parent-child travel.
The second part is the sinking of the information display logic. As mentioned earlier, there are more than ten display positions in the front end, and the display logic among them is maintained by different information services. In this module, we unify the logic of these display positions, and the information can be easily Can be linked. In addition, it can also achieve the functions of de-duplication and de-conflict, and can also plan the logic of sorting and recall as a whole.
The third part is relationship matching, which establishes the matching relationship between the scene and the data. The data here involves the basic information data of the stock and the incremental data obtained by mining. In the early stage, they were mainly associated with the scene through manual operation and maintenance, and then NLP methods were used to automatically establish a connection. For these relationships between data and data, and between data and scenes, Nebula Graph will be used as a medium for storage and retrieval.
The fourth part is information mining, that is, extracting information related to the scene from various data sources, or the highlight information of the hotel. In the comment example mentioned earlier, find sentences related to the scene in the comment, extract smooth and beautiful short sentences from them, use them as the data source associated with the scene, and display them on the front end, thereby enriching the data source. In the information mining module, the overall process will involve NLP-related data annotation, model training, badcase feedback, and annotation retraining.
Finally, there is the data source. First, we must obtain information from as many data sources as possible and add it to the graph. Of course, the accuracy of the data should be ensured as much as possible, and the data should be obtained directly from the source rather than a third party. In order to ensure accuracy, an anti-inspection mechanism may be added in subsequent development. For example, a hotel has a swimming pool facility display information. By mining data, it is found that the current swimming pool facility of this hotel has already been closed, then we can synchronize the data to the operator. make the data more accurate.
In the above 5 parts, if the subsidence of the placement logic and the establishment of relationship matching are compared to the foundation of a project, then the other modules are high-rise buildings built on the project. Where can a high-rise building come from without a foundation?
Next, I will explain how the sinking of placement logic and relationship matching are built.
Front-end information display logic sinking & relationship matching
The focus of the module is to close the data recall logic of multiple display positions in the front end, and establish the relationship between the display positions and the scene.
Specifically, when a scene comes in, the system needs to know which display slots are to be linked with it, and the data recall logic of the corresponding display slots will also change accordingly. For example, short tags such as parent-child scenes, parent-child travel, and parent-child paradise can well express the meaning of parent-child. At this time, the short tag position is suitable to directly associate with parent-child scenes. In the food scene, if a hotel ranks among the best in the food list, the corresponding food label will be more convincing, and the food scene is more suitable to be associated with the food list. After the association is established, there is a unified standard for whether each display position needs to change according to the scene, and then the data recall logic is unified.
The above table and figure mark the 10+ front-end display positions and their linkage level. The higher the linkage level, the more scene-oriented its display needs to be, and more information related to the scene can be found. .
The above is how to establish the relationship between the scene and the display position, and then how to establish the relationship between the scene and the data. The above picture is more intuitive. First, we need to dig the scene to find the points that highlight the characteristics of the hotel or attract users. Then, based on the scene, find a display space that is linked to the scene. As mentioned earlier, different display positions can highlight different Features, the scene can be associated with a suitable display position to achieve a better display effect. Finally, the corresponding data is expanded to the corresponding data through the display position, and the corresponding data may be existing or newly added data for a specific scene.
Let’s take a look at this table first. Now we need to set a new ski scene. First, we need to determine which display positions on the front end can be associated with this scene, and find the short label of the quick filter in the list page of the hotel, These display positions, such as the hotel's head picture comments on the hotel details page, can be linked with this scene. Then, you only need to find the corresponding data source and data type through these display slots, and configure the relationship.
Taking short tags as an example, short tags include topic tags, facility tags, etc. There may already be ski-related content in these data sources, and we can directly associate the scene relationship in the configuration background. If there is no relevant content, it can also be added by operators, or new data sources can be generated through data mining, and then associated. The entire association is displayed intuitively in the background. After the configuration is completed, the data will be written to the Nebula Graph in real time, and the front-end can directly filter the relevant display through the ski scene.
Technical architecture supporting information scene
Let's start with Nebula's architecture and cluster deployment.
You must be familiar with this graph. The Nebula service includes three services: graph service, meta service, and storage. Among them, the graph service is used to process client requests, the meta service is used to store metadata such as shards, schemas, and user accounts, and the storage is used to store edge and index data in the graph.
Data consistency in Nebula cluster is dependent on raft protocol, in which meta service and storage service are clusters based on raft protocol. The storage service is more complex than the meta service structure. Specifically, all the replicas of each shard in the storage service together constitute a raft cluster, that is to say, storage is not a raft cluster, but as many shards as it has, it has as many raft clusters. In a raft cluster, the leader in the cluster handles requests, while followers are used to vote, synchronize data, synchronize logs, and provide a backup for the leader. The leader will send heartbeats regularly. If the follower does not receive the heartbeat from the leader for a period of time, then they will automatically start to elect and generate a new leader. The request from the graph, whether it is read or write, is basically responded by the leader.
Due to the characteristics of the raft protocol, the leader's vote needs to be voted by more than half of the followers. The general cluster deployment strategy is 2n+1 machines, which can tolerate problems caused by n machines.
Machine deployment
The deployment method of Ctrip Hotel is to deploy three computer rooms in a distributed proportion, with 5n storage services, and they are distributed into 1n:2n:2n, so that if any one computer room has a problem, the other two computer rooms can continue to be used. However, there are certain problems with this deployment. Even if there are three computer rooms, it is actually only a cluster. There must be a delay problem caused by cross-computer room access and reading when reading. In addition, this single-room deployment model cannot support publishing methods similar to blue-green, nor can it distribute traffic based on nearby access.
In this regard, the expectation of Ctrip Hotels in the future is that each computer room can deploy an independent cluster, so as to control the traffic according to the cluster and support nearby access. When one cluster fails, it can be directly switched to the unilateral cluster on the other side through domain name traffic, which does not require too much switching cost, and can also support blue-green mode write deployment when writing data. This model also makes it more feasible to provide services overseas. For example, directly deploying an independent cluster overseas can greatly reduce the delay problem of domestic and overseas reading.
This also leads to another problem. In terms of data synchronization, Ctrip Hotel writes from domestic to overseas. In this case, the real-time and stability of synchronization cannot be well guaranteed. Ctrip Hotel is also discussing with the Ctrip system R&D team a solution for direct data synchronization between servers. If it can be realized, it can better reduce the impact of time delay caused by data synchronization.
In short, raft allows Nebula to easily deal with the problems caused by the unavailability of the machine itself. The new cluster deployment method mentioned above is to make the entire cluster more available on the basis of raft, which is fault-tolerant for an online application such as a Ctrip hotel. It is also more in line with actual needs.
After talking about the deployment of clusters, let's talk about the architecture of the project.
Technology Architecture
First compare the changes before and after the project architecture. You can see from the above picture that in the original architecture, each data source requested by the client has its own scene configuration. Some data sources have no way to support the scene due to the lack of clear semantics. This leads to three main problems:
- Development is cumbersome
- difficult to achieve
- Information island
First, due to the discrete logic of information display and the existence of multiple independent services, it is impossible to form a unified scenario-based standard and data recall logic. Each scenario-based module requires a separate configuration file and implementation logic. The second point is that types of content such as hotel pictures and hotel Q&A cannot be directly associated with the scene, and can only be established through manual operation, which requires a lot of cost. The third point is that there is no uniform standard for the maintenance of various information sources, and the information is independent of each other, which may easily lead to information duplication or conflict between different information sources. For example, hotel themes and hotel facilities contain tags with very similar semantics. , if these two labels are displayed on the front end at the same time, it will cause the problem of information redundancy.
So how to solve these problems? To build a system that can store the relationship between various types of information and information, the relationship between information and scenes, and the logic of information recall, we call it an information middle-office system.
The difference between the above picture and the previous architecture is that the independent scene configuration of each data is replaced by the knowledge graph module and the semantic annotation module: the knowledge graph module is used to store the relationship between information and information and between information scenes. Here we will Use Nebula. By establishing the appropriate schema and mapping relationship through Nebula, most data can be searched within two degrees. The semantic annotation module mines and annotates information with unclear semantics, and then establishes a connection with the scene. Such as hotel description information and hotel question and answer information, useful data can be mined from it and added to the knowledge map.
The above figure is a little abstract, and then we will talk about the composition and function of each module in the middle-stage system in detail.
The overall architecture of the information center can be divided into four modules from right to left, namely the information center API module, the information map module, the knowledge annotation module and the data module.
Among them, the information center API module is the service of the information center to the client. The functions of this module include scene acquisition, data query, data packaging, integrated data display logic and so on. The information center API allows the services of albums, tags, and devices to be transformed from the previous independent display logic to obtaining display logic from a unified source.
The knowledge graph module mainly integrates the data recall logic, and uses Nebula as a storage engine between information. For basic data, such as facility label data, it abstracts the hotel scene into points and the relationship between them into edges. The data generated by annotation is abstracted into points, and the relationship between them and the hotel scene is abstracted into edges. When the scene and hotel ID are input, the index information of all test points that match the scene under the hotel can be quickly retrieved through Nebula query. In addition, the Nebula data is updated by reading real-time messages to achieve real-time update instead of t+x taking effect, and the real-time performance has also been greatly enhanced .
The information annotation module greatly increases the data richness . For a large number of UGC and hotel descriptions, it is a resource-intensive task to find the information related to the scene. At this time, the use of NLP-related technologies can improve efficiency. The middle part of the above figure mainly draws the flow of data in the entire information labeling module. First, preprocess comments, pictures and texts and other information, and get clauses and short sentences and put them into the labeling module for labeling. When labeling, semantic tags are used, which are derived from hot search words or manual definitions and other channels. When using semantic tags, the tag will directly establish a relationship with the scene. Generally, this relationship is one-to-one, but there are also one-to-many situations. The marked data will undergo model training, and enter the task scheduling module together with the static rules to mark the data in batches, and some of the finally generated data can be used after sampling and manual checking. This information labeling module integrates the data generation process and greatly improves the efficiency of data generation .
Finally, the data module is responsible for data integration and transmission. The data source of the module includes all kinds of information generated by the information labeling module, as well as other data synchronized from the outside. These data are finally written into the Nebula Graph database through the data synchronization framework.
Then explain the flow of data in the database module. Featured data, facility data, hotel data and other types are divided into full and incremental data. Full data includes DB, message queue, message interface, Hive table, etc. Hive table directly synchronizes Nebula directly through Hive job, while DB, message The queue and message interface are synchronized through the nebula-java client. Combined with Nebula Java client, Ctrip Hotel has implemented a configurable synchronization framework and built a service for information synchronization. When the data source is an interface or a message queue, the data is assembled by implementing the corresponding interface, and the mapping relationship between the message field and the Nebula field is configured to realize data synchronization. When the data source is DB, directly use the SQL statement to fetch data from DB, and configure the mapping relationship between DB fields and Nebula fields to realize data synchronization. Incremental data mainly comes from message queues. Incremental jobs and full jobs use the same mapping relationship. For incremental messages, only the interface for assembling data can be implemented to achieve synchronization.
In addition to data synchronization, Ctrip Hotel also has a data operation platform, including the following 4 functions:
- schema: Close Nebula schema operation. Nebula itself has a graphical interface. Ctrip Hotel closes the schema operation to the background, which is convenient for configuring permissions and recording operation logs.
- Monitoring: During the data import process, Ctrip Hotels records the data, source, operation type, and topic information in the ClickHouse dashboard, which is convenient for viewing data statistics intuitively. In addition, when an error occurs, combined with the Nebula query log in the error message and the data on the CK (ClickHouse), you can quickly locate the specific point, edge, error type, data source, message ID and other information.
- Dependency configuration: During data assembly, the mapping relationship configuration between the data source and the Nebula field is mentioned above. The mapping configuration will be read in real time every time the job is started, and the Ctrip hotel technical team will put the configuration in the configuration background, so that the real-time configuration can take effect in real time. .
- Retry mechanism: When a data transmission error occurs, a retry mechanism is added. However, data such as message queues have a retransmission mechanism, and only need to give the corresponding identification when the data is abnormal. For data such as interfaces and DBs, there is no retry mechanism for the time being, and a retry mechanism will be added in the future.
Schema definition and stress testing
Let's talk about the schema definition and stress test results.
The information center map data model is divided into four large blocks, namely label information, basic information, UGC information and GEO information. Label information includes facility labels, feature labels, long labels, UGC labels, etc.; basic information includes room type information, facility information, policy information, etc.; UGC information includes excavated clause points, user points, etc. ; GEO information includes POI, province, city point, etc. The hotel is encompassed by these 4 areas as the most central point, and there are edges associated with each of the points mentioned above.
In addition, the top layer abstracts semantic label points, and all points associated with the scene will have semantic labels to directly establish corresponding edge relationships. In this way, when a hotel ID and scene ID are entered, the information of all the points can be quickly filtered out. Such a schema is more in line with business logic, but there are still some problems in the actual online process, the most typical of which is the hot data problem.
For example, the amount of data of type A may be in millions, and the amount of data in type B may be in thousands. If there is an association between each A and B, then the magnitude of the association may be in the order of 100,000 or millions. When querying, B may become a super node and affect some performance.
In this regard, several solutions have been thought of. The first is to increase the number of shards to solve the problem that hotspot data is concentrated in one or two machines, but this only disperses hotspots and does not completely solve the problem of hotspots. The second is to add logic points, and disperse a hot spot into multiple points with the same attributes but different VIDs. When these points are connected to the central point, the edges are broken up by hash or other methods, but this not only increases the logic of data importing The complexity, in some cases, also affects the accuracy of the original logic between the data and the data. I have also thought about the solution of follower reading. Due to the limitation of the raft protocol, only the leader can process the request. In the case of low consistency requirements, can it support the follower to process the read request, and then increase the number of shards to make the whole cluster The load of each machine can be greatly balanced - this problem requires technical support from the Nebula R&D team.
In the end, the solution to the hotspot problem was that, with the support of the Ctrip system R&D team, the problem of excessive load on individual machines due to hotspot data was solved through configuration parameters. The solution was to enable the prefix matching bloom filter and increase the blockcache respectively.
The above table shows the performance test chart before going online. The cluster configuration is 6 graphs, 5 metas, and 10 storages. In the case of 2.5 million+ points and more than 200 million edges, the search is about 10,000 QPS at one time, about 20 ms; when writing 100,000+ data at the same time, it is about 40 ms; when the search is about 7,000 QPS at the same time , in about 32 ms, when writing 100,000+ incremental data at the same time, in about 50 ms... Overall, the performance is still in line with business needs and expectations.
In addition, Nebula's natural distributed architecture and active Chinese community are one of the reasons why Ctrip Hotels finally chose it as the information map to build platform infrastructure.
The above is the practice sharing of Ctrip hotel information knowledge map.
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
Nebula Community's first call for papers is underway! 🔗 The prizes are generous, covering the whole scene: code mechanical keyboard⌨️, mobile phone wireless charging 🔋, health assistant smart bracelet⌚️, and more database design, knowledge map practice books 📚 waiting for you to pick up, and Nebula exquisite surrounding delivery non-stop ~🎁
Welcome friends who are interested in Nebula and who like to study and write interesting stories about themselves and Nebula~
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。