头图

Preface

Full-text search based on local data plays an important role in the client requirements of IM scenarios. The so-called full-text search is a technique to find the position of a certain word in a large number of documents. In the past relational databases, it can only be achieved through LIKE, which has several drawbacks:

  1. Unable to use database index, need to traverse the whole table, poor performance
  2. The search effect is poor, only the first and last positions can be fuzzy matching, and the complex search requirements cannot be realized
  3. Unable to get the relevance of the document and the search criteria

NetEase Yunxin IM implements the full-text search function of local data based on libraries such as SQLite in iOS, Android and desktop, but this part of the function is missing on the Web and Electron. On the Web side, due to browser environment restrictions, the only local storage database that can be used is IndexDB, which is not within the scope of the discussion for the time being. On Electron, although Chromium's kernel is also built-in, because of the ability to use Node.js, there are more choices.

Let's first look specifically at how to implement full-text retrieval.

Basic technical knowledge points

To achieve full-text retrieval, the following two knowledge points are inseparable:

  • inverted index
  • participle

These two technologies are the technologies and difficulties to achieve full-text retrieval, and the implementation process is relatively complicated. Before we talk about the implementation of full-text indexing, let's talk about the implementation of these two technologies in detail.

Inverted index

First briefly introduce the inverted index, the concept of the inverted index is different from the forward index:

  • Front index: The unique ID of the document object is used as the index, and the content of the document is the structure of the record
  • Inverted index: The word in the document content is used as the index, and the document ID containing the word is used as the structure of the record

1.jpg

Take the inverted index library search-index as a practical example. NetEase IM in cloud letter of each message object has idClient as the only ID, then we enter "a nice day" to every Chinese individual word (word concept we will share in detail below) , So the input becomes "now", "天", "天", "qi", "true", "good", and then write it into the library through the search-index PUT method, and finally look at the stored content Structure:

image.png

As shown in the figure, you can see the structure of the inverted index. The key is a single Chinese word after word segmentation, and the value is an array of idClients containing the Chinese message object. Of course, in addition to the above content, search-index also has some other content, such as Weight, Count, and front row data, etc. These are for sorting, paging, search by field and other functions. This article will not elaborate. Up.

Participle

Word segmentation is to divide the content of the original message into multiple words or sentences based on semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we chose Nodejieba as the basic word segmentation database. The following is the flowchart of jieba word segmentation:

3.jpg

Take "Go to Peking University to play" as an example, let's select the most important modules to analyze:

Load dictionary

The jieba word segmentation will first load the dictionary when it is initialized, and the general content is as follows:

image.png

Building a prefix dictionary

Next, a prefix dictionary will be constructed based on the dictionary, the structure is as follows:

image.png

Among them, "Peking University" is the prefix of "Peking University", and its word frequency is 0, which is to facilitate the subsequent construction of DAG graphs.

Build a DAG graph

DAG graph is the abbreviation of Directed Acyclic Graph, that is, directed acyclic graph.

Based on the prefix dictionary, the input content is segmented. Among them, "Go" has no prefix, so there is only one segmentation method; for "North", there are three segmentation methods of "North", "Beijing", and "Peking University"; for "京", there is also only one segmentation method. Segmentation methods; for "big", there are two segmentation methods, "big" and "university"; for "learning" and "playing," there is still only one segmentation method. In this way, the segmentation method of each word as a prefix can be obtained, and the DAG diagram is shown in the following figure:

6.jpg

Maximum probability path calculation

All the paths of the above DAG graph are as follows:

  1. Go/Beijing/Beijing/University/Study/Play
  2. Go/Beijing/University/Study/Play
  3. Go/Beijing/University/Play
  4. Go/Peking University/Play

Because each node is weighted (Weight), for words in the prefix dictionary, its weight is its word frequency. So our problem is to ask for a maximum path so that the weight of the entire sentence is the highest.

This is a typical dynamic programming problem. First of all, we confirm the two conditions of dynamic programming:

  • Repeat sub-problem : For node i and its possible multiple successors j and k:
任意通过i到达j的路径的权重 = 该路径通过i的路径权重 + j的权重,即 R(i -> j) = R(i) + W(j) 
任意通过i到达k的路径的权重 = 该路径通过i的路径权重 + k的权重,即 R(i -> k) = R(i) + W(k) 

That is, for j and k with a common predecessor node i, it is necessary to repeatedly calculate the weight of the path to i.

  • optimal substructure : suppose the optimal path of the entire sentence is Rmax, the end node is x, and the multiple possible predecessor nodes are i, j, and k. The formula is as follows:
Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x) 

So the problem becomes solving Rmaxi, Rmaxj and Rmaxk, the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as "Go/Peking University/Play".

HMM Hidden Markov Model

For unregistered words, jieba word segmentation adopts HMM (abbreviation of Hidden Markov Model) model for word segmentation. It regards the word segmentation problem as a sequence labeling problem, the sentence is the observation sequence, and the word segmentation result is the state sequence. The jieba word segmentation author mentioned in the issue that the parameters of the HMM model are based on the 1998 People’s Daily segmentation corpus that can be downloaded on the Internet, an MSR corpus and TXT novels collected by themselves, segmented with ICTCLAS, and finally generated by using a Python script to count the word frequency. .

The model consists of a five-tuple and has two basic assumptions.

Quintuple:

  1. State value collection
  2. Collection of observations
  3. Initial state probability
  4. State transition probability
  5. State emission probability

Basic assumptions:

  1. Homogeneity hypothesis: that is, it is assumed that the state of the hidden Markov chain at any time t depends only on its state at the previous time t-1, has nothing to do with the state and observations at other times, and has nothing to do with time t.
  2. Observation independence hypothesis: It is assumed that the observation value at any time is only related to the state of the Markov chain at that time, and has nothing to do with other observations and states.

The state value set is {B: begin, E: end, M: middle, S: single }, which represents the position of each word in the sentence. B is the start position, E is the end position, M is the middle position, and S It is a single word into a word.

The set of observations is the set of each word in our input sentence.

The initial probability of the state indicates the probability that the first word in the sentence belongs to the four states of B, M, E, and S, where the probability of E and M are both 0, because the first word can only be B or S, which is consistent with reality .

The state transition probability indicates the probability of transition from state 1 to state 2, and satisfies the homogeneity hypothesis. The structure can be represented by a nested object:

P = {
    B: {E: -0.510825623765990, M: -0.916290731874155},
    E: {B: -0.5897149736854513, S: -0.8085250474669937},
    M: {E: -0.33344856811948514, M: -1.2603623820268226},
    S: {B: -0.7211965654669841, S: -0.6658631448798212},
}

P'B' indicates the probability of transition from state B to state E (the logarithm of the probability in the structure, which is convenient for calculation) is 0.6. Similarly, P'B' indicates that the probability of the next state being M is 0.4, indicating that when a When a word is at the beginning, the probability of the next word being at the end is higher than the probability of the next word being in the middle, which is intuitive, because two-character words are more common than multiple-character words.

The state emission probability indicates the current state and satisfies the assumption of independence of observations. The structure is the same as above, and it can also be represented by a nested object:

P = {
    B: {'突': -2.70366861046, '肃': -10.2782270947, '适': -5.57547658034},
    M: {'要': -4.26625051239, '合': -2.1517176509, '成': -5.11354837278},
    S: {……},
    E: {……},
}

The meaning of P'B' is that the state is in B, and the logarithmic value of the probability that the observed word is "abrupt" is equal to -2.70366861046.

Finally, through the Viterbi algorithm, input the set of observation values, use the initial state probability, state transition probability, and state emission probability as parameters, and output the state value set (that is, the word segmentation result with the maximum probability). Regarding the Viterbi algorithm, this article will not be expanded in detail, and interested readers can refer to it by themselves.

The above two technologies are the technical core of our architecture. Based on this, we have made improvements to the Electron end technical architecture of NetEase Yunxin IM.

NetEase Yunxin IM Electron end architecture

Detailed architecture diagram

Considering that full-text search is only a function of IM, in order not to affect other IM functions, and to enable faster iteration requirements, the following architecture scheme is adopted:

7.png

On the right is the previous technical architecture. The underlying storage library uses indexDB, and the upper layer has two modules for reading and writing:

  • When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message object will be synchronized to indexDB;
  • When the user needs to query a keyword, it will go to indexDB to traverse all the message objects, and then use indexOf to determine whether each message object contains the queried keyword (similar to LIKE).

Then, when the amount of data is large, the query speed is very slow.

On the left is a new architecture scheme with the addition of word segmentation and inverted index database. This scheme will not have any impact on the previous scheme. It just adds a layer before the previous scheme. Now:

  • When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message in each message object will be synchronized to the inverted index database after word segmentation;
  • When a user needs to query a keyword, it will first find the idClient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexDB according to the idClient and return it to the user.

Architecture advantages

The program has the following 4 advantages:

  • is fast: implements inverted indexing through search-index, which improves the search speed.
  • Cross-platform : Because search-index and indexDB are both based on levelDB, search-index also supports a browser environment, which provides the possibility to implement full-text search on the Web.
  • Independence : The inverted index database is separated from the IM main business database indexDB. When indexDB writes data, it will automatically notify the write module of the inverted index library. After segmenting the message content, insert it into the storage queue, and finally insert it into the inverted index database in turn. When a full-text search is required, through the reading module of the inverted index library, the idClient of the message object corresponding to the keyword can be quickly found, and the message object is found in the indexDB according to the idClient and returned.
  • flexibility: full-text search access in the form of a plug-in, which exposes a high-order function, wraps IM and returns a new inherited and extended IM, because JS prototype-oriented mechanism, a method that does not exist in the new IM , Will automatically go to the prototype chain (ie the old IM) to find it, so the plug-in can focus on the realization of its own method, and does not need to care about the specific version of IM, and the plug-in supports custom word segmentation functions to meet different user segmentation Demand scenario.

Effect

After using the above architecture, after our test, at the level of data volume of 20W, the search time dropped from the first ten seconds to within one second, and the search speed was about 20 times faster.

to sum up

Above, we implemented the full-text search of NetEase Yunxin IM SDK chat messages on Electron based on Nodejieba and search-index, which accelerated the search speed of chat records. Of course, we will do more optimizations in the following aspects, such as:

  • write performance improvement : In actual use, it is found that when the amount of data is large, the underlying database levelDB that search-index depends on will have a write performance bottleneck, and the CPU and memory consumption will be large. After investigation, the write performance of SQLite is relatively better. From observations, the write speed is only proportional to the amount of data, and the CPU and memory are relatively stable. Therefore, you may consider compiling SQLite into a Node native module to replace it in the future. search-index.
  • Scalability : The current decoupling of business logic is not thorough enough. Some business fields are stored in the inverted index library. In the future, you can consider the inverted index library to find the idClient of the message object only based on keywords, and put the search with business attributes in the indexDB, which completely decouples the inverted index library from the main business library.

The above is all the sharing of this article, welcome to follow us and continue to share more technical dry goods. Finally, I hope my sharing can be helpful to everyone.

about the author

Li Ning is a senior front-end development engineer at NetEase Yunxin. He is responsible for the application development, component development and solution development of NetEase Yunxin's audio and video IM SDK. He has rich practical experience in React, PaaS component design, and multi-platform development and compilation. If you have any questions, please leave a message for exchange.

For more technical dry goods, please pay attention to [Netease Smart Enterprise Technology+] WeChat public account


网易数智
619 声望139 粉丝

欢迎关注网易云信 GitHub: