This article was first published on DataFunTalk
Introduction: In recent years, graph data has been widely used in the computer field. The amount of Internet data has grown exponentially, and the application of big data technology and graph data has grown rapidly. All major Internet companies have invested a lot of manpower and material resources in graph data analysis and application. In order to make our search more intelligent, Tencent Music also uses the knowledge graph. Today, I will share with you Tencent Music’s exploration of graph retrieval and business practice, which mainly includes the following parts:
- Introduction to Music Knowledge Graph
- Graph database selection
- Project Architecture Introduction
- Application example of knowledge graph search function
- Summary and Outlook
Introduction to Music Knowledge Graph
First of all, let me introduce to you the relevant knowledge of the music knowledge map.
1. Music data classification
Graph data exists widely, among which music-related business data mainly fall into the following three categories:
- In terms of content, there are songs, variety shows, movies, albums, etc.;
- For singers, there are singer information and the relationship between singers, including combination, similarity, etc.;
- The relationship between singer and singer content includes singing, writing lyrics, composing, etc.
2. Application Scenarios of Music Knowledge Graph
(1) complex search requirements to achieve
The music knowledge graph can not only do simple searches, but also realize complex search requirements. For example, if you want to query Jay Chou's songs due to men and women, if you want to implement this query, you need to filter Jay Chou's songs, the number of singers must be equal to 2, and the gender of the other singer is female. , singer weights, etc. It is very complicated to realize this function in traditional relational data. It is relatively simple to use the knowledge map. First, find the singer Jay Chou, find all songs of Jay Chou that satisfy 2-person chorus, and the gender of the other singer is female, and complex search queries can be realized with only two hops.
(2) Related recommendations for search results
The entity nodes in the graph can be queried according to the searched keywords, the associated nodes can be queried according to the entity nodes, and the recommended results can be given by the associated nodes. For example, if a user searches for Zhou Huajian, he can recommend Li Zongsheng through related information. If you use a search engine, it is difficult to recommend Li Zongsheng, but using the knowledge map, it only takes two jumps, Zhou Huajian singer to the corresponding group (vertical line), from the group to another singer Li Zongsheng, just two jumps.
(3) gives the answer based on knowledge calculation
Some answers can be given according to the calculation results of the knowledge graph, and the corresponding answers can be queried through the associated information of the graph, entity upper and lower information, and entity attribute information. For example, if a user searches for Andy Lau's songs in the 1990s, using the knowledge graph, as long as the singer Andy Lau; the songs of the 1990s, the two can be combined to get the result.
3. Advantages and disadvantages of search recall and knowledge graph recall
Search and recall is based on text matching. After recall, it will involve correlation sorting, which is relatively complex, lacks precision, and may over-recall. The process of search and recall is more complicated, and the sorting strategy is also relatively complicated.
Knowledge graph recall is a query based on the relationship between entities, which can achieve precise recall, and the recall process can be very short, that is, several graph query sentences. In addition, knowledge graph also has certain reasoning ability.
Graph database selection
To implement graph query, you must first select the graph database.
When selecting a graph database, the following factors need to be considered:
- Open source is not paid, considering the cost and source code controllability, choose to embrace open source;
- The distributed framework is scalable, and the backend must be scalable as data increases and decreases;
- High-performance millisecond-level multi-hop query, to achieve millisecond-level online response;
- Support hundreds of billions of data volumes;
- Support batch import and export of data.
We compared 8 databases, analyzed the advantages and disadvantages, and classified these databases:
- The first category, represented by Neo4j, has only a stand-alone version with excellent performance, but does not meet the requirements of distributed scalability. The commercial version of Neo4j supports distribution, but there is a fee.
- The second category, databases such as JanusGraph and HugeGraph, support distributed scalability. Their common feature is that a general graph semantic interpretation layer is added to the existing graph, which is limited by the architecture of the storage layer (the storage layer is implemented by an external database). ), does not support the function of computing pushdown, resulting in poor performance.
- The third category, represented by Nebula Graph, has implemented its own storage layer, supports calculation push-down, optimized efficiency, and improved performance a lot.
See the comprehensive performance test data from the above figure. We test the database performance through 1-degree neighbors (points directly connected to a point), 2-degree neighbors, and common neighbors. We can see that Nebula Graph is far superior to both stand-alone performance and cluster performance. to other competing products. Considering performance, community activity, version iteration speed, and language versatility, we finally chose Nebula Graph as the graph database for our project.
Project Architecture Introduction
1. Online layer
Contains the following modules:
- storaged is responsible for the storage of specific data, including point data, edge data, and related indexes;
- metad is responsible for storing meta information of graph data, such as database schema, addition, etc.;
- Nebula graphd is responsible for the logic layer of data calculation. It is stateless and can be expanded in parallel. The internal execution calculation engine is used to complete the whole process of query.
- Nebula proxy is our newly added module. As the proxy layer of the entire nebula module, it can accept external commands and operate on graph data, including graph query, update, and deletion. In addition, nebula proxy is also responsible for protocol conversion, cluster heartbeat and route registration.
Since a single cluster needs to rebuild data, and to prevent computer room failures, we choose dual clusters to support the availability of the entire service.
The process of online layer request processing is that after cgi receives the user request, it passes the user request to the broker module. The broker request template matches to generate the corresponding graph query statement, extracts the available clusters from Zookeeper, and sends the query statement to the nebula proxy. For graph recall, nebula proxy passes the specific query statement to nebula graphd, nebula graphd is responsible for executing the final statement, and then returns the result to the broker layer. After the broker layer adds some front-end display summaries, the data is returned to the front-end for display.
2. Offline layer
Music data has real-time new data, such as newly released albums, and full data updates, so we chose the full and incremental data layer solution.
(1) full data generation scheme
A lot of music data is stored in the database. After dumping the data from the DB, the IndexBuilder module converts the data format to the required format to form a full source data. After uploading the full source data to HDFS, run Spark The task is to convert the data into the data file required by the bottom layer of Nebula Graph. After IndexMgr finds that new constant data is generated, it downloads the data file and loads the full amount of data into NebulaProxy, so that the full amount of data is generated.
(2) real-time data generation
Every once in a while, usually a few minutes, after the business modification data within a few minutes is dumped, it is converted into a specific format to form an incremental source data, and the incremental source data is stored in Kafka, which can be used For data retransmission and recovery, DataSender pulls the latest data from the Kafka queue and sends it to the cluster through NebulaProxy, so that the incremental data takes effect.
This involves a problem of incremental reissue, because the process of dumping the stock data takes a long time, it may take several hours, and there is also new incremental data in the process of dumping the full amount of data, and the incremental data during this period may be Not included in the full amount of data. Therefore, a historical incremental reissue is required here. The newly added data after T0 (the start time of full synchronization) is not included in the full amount of data, and all data after T0 needs to be reissued.
Application example of knowledge graph search function
1. Configured recall
The conventional recall method is: generate a query statement according to Query, obtain the recall result, shuffle according to the strategy, and display the recall result.
The problem with doing this is that the above four steps have to be repeated every time a new recall strategy is added, so the recall is not flexible enough and the business changes greatly.
We have added a new recall method based on Query template, which is to generate corresponding query statements according to the template, and preset some common shuffling strategies. For example, we configure a template for adding a school song. When querying the school song, we extract the name of the school and fill it in the query statement to form a complete graph query statement. At the same time, some shuffling insertion strategies are preset, and the corresponding shuffling parameters can be filled in to go online. The advantage of this is that recall is more flexible, and the cost of recall online is relatively small compared to search.
2. Business applications
We finally launched the services shown above, supporting various search scenarios.
- School song search: When a user searches for a combination of university name and school song, recall the school song of the corresponding school;
- Singer scene: When a user searches for a singer's name, it returns the singer's group, as well as co-singers who have sung well-known songs, etc.;
- Movie and TV scene: When the user searches for movie and TV theme songs, ending songs, episodes, etc., the corresponding movie and TV songs are returned.
Summary and Outlook
Today's discussion starts from the selection of graph data, to schema classification definition, project architecture layer design, to knowledge graph search. The conclusion is that by using graph data, expert experience can be intelligently integrated into the graph. The knowledge base realized by graph data technology enhances functions such as retrieval, recommendation, and visualization. Tencent Music has applied the knowledge graph technology well, which greatly improves the customer's search experience and enhances customer stickiness. Let's embrace AI technology and make it better for life.
Wonderful Q&A
Q: Is audio information considered in the search process?
A: This is for consideration. We can use audio recognition technology to first identify a large category of songs, such as folk, rock, and pop genres, and then when searching online, we will use this voice search. to recall. In addition, we have also cooperated with QQ Music Tianjin Laboratory, such as listening to the current Jinkeshi music, and we also use our limited search in the background, which is also through the recall of audio information.
Q: Where do the semantic search results rank? How is it sorted with keyword search?
A: First of all, we will use an algorithm to mine the similarity between a semantic tag and a song. If you search for semantics, you can recall the semantic tags, and prioritize the results with high semantic similarity to the front. Of course, there will be some strange situations. For example, Zhao Lei has a song called Folk Ballad. The Folk Ballad is a song, and it is also a semantic. For sorting, we will put the folk songs first, because it is a relatively well-known singer’s song after all, and the corresponding semantic structure will be put in the back, and then we will have an algorithm-based sorting model in the upper layer to give users to It is recommended to pre-tune with a high number of clicks.
Q: Will the double-buffer memory be doubled when the full index version is switched?
A: In fact, we do not have double buffers in the process of index switching. We switch each copy under each shard one by one. When switching, it will be dynamically unloaded, so it does not occupy additional memory.
Q: Cross truncation, is it better to truncate at the index, or to select the truncation online?
A: Online truncation is selected. If offline truncation will result in data loss, there is no way to backtrack. Truncation is also fragmented, and vector retrieval can also be fragmented for parallel retrieval.
That's all for today's sharing, thank you all.
Sharing guests:
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card , and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。