IM development dry goods sharing: Netease Yunxin IM client&#39;s chat message full-text retrieval technology practice

1 Introduction

In the usage scenarios of the IM client, the full-text search function based on local data plays an important role. The most commonly used ones are to find chat records and contacts, as shown in the figure below.

▲ WeChat chat history search function

Similar to functions such as chat history search and contact search in IM, with full-text search capabilities, it can indeed greatly improve the efficiency of content search. Otherwise, allowing users to manually search will indeed reduce the user experience.

This article will specifically talk about how NetEase Yunxin realizes the full text retrieval capabilities of the IM client, and hope to inspire you.

study Exchange:

5 groups for instant messaging/push technology development and communication: 215477170 [recommended]
Introduction to Mobile IM Development: "One entry is enough for novices: Develop mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article was published simultaneously at: http://www.52im.net/thread-3651-1-1.html)

2. About the author

Li Ning: NetEase Yunxin senior front-end development engineer, responsible for the application development, component development and solution development of the audio and video IM SDK. He has rich practical experience in React, PaaS component design, and multi-platform development and compilation.

3. Related articles

IM client full-text search related articles:

" WeChat mobile phone end of the local data full-text search optimization road "
" WeChat team sharing: Full-text search of multi- characters on WeChat mobile terminal 16108e981631fc"

Other articles shared by the NetEase technical team:

" NetEase Video Cloud Technology Sharing: Quick Start of Audio Processing and Compression Technology "
"Some optimization ideas for NetEase Yunxin real-time video live broadcast at the TCP data transmission layer"
"Netease Yunxin Technology Sharing: Summary of the Practice of the Ten Thousands of People Chatting Technical Solutions in IM"
"Web-side instant messaging practice dry goods: How to make your WebSocket disconnect and reconnect faster? 》
"Behind the glamorous bullet message: the chief architect of Netease Yunxin shares the technical practice of the billion-level IM platform"

4. What is full-text search

The so-called full-text search is to find the position of a certain word in a large amount of content.

In traditional relational databases, it can only be achieved through LIKE conditional query, which has several disadvantages:

1) The database index cannot be used, the entire table needs to be traversed, and the performance is poor;
2) The search effect is poor, only the first and last positions can be fuzzy matching, and the complex search requirements cannot be realized;
3) The relevance of the content and the search conditions cannot be obtained.

We have implemented the full-text search function of local data based on libraries such as SQLite in the iOS, Android and desktop of IM, but this part of the function is missing on the Web and Electron.

Because on the Web side, due to browser environment restrictions, the only local storage database that can be used is IndexDB, which is not within the scope of the discussion for the time being. But on Electron, although Chromium's kernel is also built-in, because of the ability to use Node.js, there are more choices. In this article, we take the Electron-based IM client as an example to discuss the implementation of full-text retrieval technology (the technical ideas are the same, not limited to specific end).

PS: If you don't know what Electron technology is, read this "Quick Understanding of Electron: A New Generation of Web-based Cross-platform Desktop Technology".

Let's first look specifically at how to implement full-text retrieval.

To achieve full-text retrieval, the following two knowledge points are inseparable:

1) Inverted index;
2) Word segmentation.

These two technologies are the technologies and difficulties to achieve full-text retrieval, and the implementation process is relatively complicated. Before we talk about the implementation of full-text indexing, let's learn the principles of these two technologies in detail.

5. Knowledge point 1: Inverted index

First briefly introduce the inverted index, the concept of the inverted index is different from the forward index:

1) Front index: The unique ID of the document object is used as the index, and the content of the document is the structure of the record;
2) Inverted index: The word in the document content is used as the index, and the document ID containing the word is used as the structure of the record.

Take the inverted index library search-index as a practical example:

In our IM, each message object has idClient as a unique ID. Next, we enter "Today's weather is really good" and separate each Chinese word (the concept of word segmentation will be shared in detail below), so the input changes It became "Jin", "Heaven", "Heaven", "Qi", "True", and "Good". Then write it into the library through the PUT method of search-index.

Finally, look at the structure of the storage content of the above example:

As shown in the figure: you can see the structure of the inverted index, the key is a single Chinese word after word segmentation, and the value is an array of idClients containing the Chinese message object.

Of course: search-index In addition to the above content, there are some other content, such as Weight, Count, and front row data, etc. These are for sorting, paging, search by field and other functions. This article will not go into details. NS.

6. Knowledge point of full-text search 2: word segmentation

6.1 Basic concepts

Word segmentation is to divide the content of the original message into multiple words or sentences based on semantics. Considering the effect of Chinese word segmentation and the need to run on Node, we chose Nodejieba as the basic word segmentation database.

The following is the flowchart of jieba word segmentation:

Take "Go to Peking University to play" as an example, let's select the most important modules to analyze.

6.2 Load dictionary

The jieba word segmentation will first load the dictionary when it is initialized, and the general content is as follows:

6.3 Building a prefix dictionary

Next, a prefix dictionary will be constructed based on the dictionary, the structure is as follows:

Among them: "Beijing University" is the prefix of "Peking University", and its word frequency is 0, which is to facilitate the subsequent construction of DAG graphs.

6.4 Building a DAG graph

DAG graph is the abbreviation of Directed Acyclic Graph, that is, directed acyclic graph.

Based on the prefix dictionary, the input content is segmented.

in:

1) "Go" has no prefix, so there is only one way of segmentation;
2) For "North", there are three segmentation methods of "North", "Beijing" and "Peking University";
3) For "Beijing", there is only one way of segmentation;
4) For "big", there are two ways of segmentation: "big" and "university";
5) There is still only one way to separate "learning" and "playing".

In this way, the segmentation method of each word as a prefix word can be obtained.

The DAG diagram is shown below:

6.5 Maximum probability path calculation

All the paths of the above DAG graph are as follows:

Go/Beijing/Beijing/University/Study/Play
Go/Beijing/University/Study/Play
Go/Beijing/University/Play
Go/Peking University/Play

Because each node is weighted (Weight), for words in the prefix dictionary, its weight is its word frequency. So our problem is to ask for a maximum path so that the weight of the entire sentence is the highest.

This is a typical dynamic programming problem. First of all, we confirm the two conditions of dynamic programming.

1) Duplicate sub-problems:

For node i and its possible successor nodes j and k:

1) The weight of any path that passes through i to j = the weight of the path through i + the weight of j, that is, R(i -> j) = R(i) + W(j);
2) The weight of any path through i to k = the weight of the path through i + the weight of k, that is, R(i -> k) = R(i) + W(k).

That is, for j and k with a common predecessor node i, it is necessary to repeatedly calculate the weight of the path to i.

2) Optimal substructure:

Suppose the optimal path of the entire sentence is Rmax, the end node is x, and the multiple possible predecessor nodes are i, j, and k.

The formula is as follows:

Rmax = max(Rmaxi, Rmaxj, Rmaxk) + W(x)

So the problem becomes solving Rmaxi, Rmaxj and Rmaxk, the optimal solution in the substructure is part of the global optimal solution.

As above, the optimal path is finally calculated as "Go/Peking University/Play".

6.6 HMM Hidden Markov Model

For unregistered words, jieba word segmentation adopts HMM (abbreviation of Hidden Markov Model) model for word segmentation.

It regards the word segmentation problem as a sequence labeling problem, the sentence is the observation sequence, and the word segmentation result is the state sequence.

The jieba word segmentation author mentioned in the issue that the parameters of the HMM model are based on the 1998 People’s Daily segmentation corpus that can be downloaded on the Internet, an MSR corpus and TXT novels collected by themselves, segmented with ICTCLAS, and finally generated by using a Python script to count the word frequency. .

The model consists of a five-tuple and has two basic assumptions.

Quintuple:

1) State value collection;
2) Collection of observation values;
3) Initial state probability;
4) Probability of state transition;
5) Probability of state emission.

Basic assumptions:

1) Homogeneity hypothesis: it is assumed that the state of the hidden Markov chain at any time t depends only on its state at the previous time t-1, and has nothing to do with the state and observations at other times, and has nothing to do with time t;
2) Observation independence hypothesis: It is assumed that the observation value at any time is only related to the state of the Markov chain at that time, and has nothing to do with other observations and states.

The state value set is {B: begin, E: end, M: middle, S: single }, which represents the position of each word in the sentence. B is the start position, E is the end position, M is the middle position, and S It is a single word into a word.

The set of observations is the set of each word in our input sentence.

The initial probability of the state indicates the probability that the first word in the sentence belongs to the four states of B, M, E, and S, where the probability of E and M are both 0, because the first word can only be B or S, which is consistent with reality .

The state transition probability indicates the probability of transition from state 1 to state 2, and satisfies the homogeneity hypothesis. The structure can be represented by a nested object:

P = {

B: {E: -0.510825623765990, M: -0.916290731874155},
E: {B: -0.5897149736854513, S: -0.8085250474669937},
M: {E: -0.33344856811948514, M: -1.2603623820268226},
S: {B: -0.7211965654669841, S: -0.6658631448798212},

}

P'B' means the probability of transition from state B to state E (the logarithm of the probability in the structure, which is convenient for calculation) is 0.6. Similarly, P'B' means that the probability of the next state being M is 0.4, indicating that when a When a word is at the beginning, the probability of the next word being at the end is higher than the probability of the next word being in the middle, which is intuitive, because two-character words are more common than multiple-character words.

The state emission probability indicates the current state and satisfies the assumption of independence of observations. The structure is the same as above, and it can also be represented by a nested object:

P = {

B: {'突': -2.70366861046, '肃': -10.2782270947, '适': -5.57547658034},
M: {'要': -4.26625051239, '合': -2.1517176509, '成': -5.11354837278},
S: {……},
E: {……},

}

The meaning of P'B' is that the state is in B, and the logarithm value of the probability that the observed word is "abrupt" is equal to -2.70366861046.

Finally, through the Viterbi algorithm, input the set of observation values, use the initial state probability, state transition probability, and state emission probability as parameters, and output the state value set (that is, the word segmentation result with the maximum probability). Regarding the Viterbi algorithm, this article will not be expanded in detail, and interested readers can refer to it by themselves.

7. Technical realization

The two technologies of full-text retrieval introduced in the previous section are the technical core of our architecture. Based on this, we have made improvements to the Electron side technical architecture of IM. It will be described in detail below.

7.1 Detailed structure diagram

Considering that full-text search is only a function of IM, in order not to affect other IM functions, and to be able to iterate faster, the following architecture scheme is adopted.

The architecture diagram is as follows:

As shown in the figure above, on the right is the previous technical architecture, the underlying storage library uses indexDB, and the upper layer has two modules for reading and writing.

The specific functions of the read-write module are:

1) When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message object will be synchronized to indexDB;
2) When the user needs to query a keyword, it will go to indexDB to traverse all the message objects, and then use indexOf to determine whether each message object contains the query keyword (similar to LIKE).

Then, when the amount of data is large, the query speed is very slow.

On the left is a new architecture scheme with the addition of word segmentation and inverted index database. This scheme will not have any impact on the previous scheme, just adding a layer before the previous scheme.

Now, the working logic of the reading and writing module:

1) When the user actively sends a message, actively synchronizes a message, actively deletes a message, and receives a message, the message in each message object will be synchronized to the inverted index database after word segmentation;
2) When the user needs to query a keyword, it will first find the idClient of the corresponding message in the inverted index database, and then find the corresponding message object in the indexDB according to the idClient and return it to the user.

7.2 Architecture advantages

The program has the following 4 advantages:

1) Fast speed: Inverted index is realized through search-index, which improves the search speed.
2) Cross-platform: Because both search-index and indexDB are based on levelDB, search-index also supports the browser environment, which provides the possibility of implementing full-text retrieval on the Web side;
3) Independence: the inverted index database is separated from the IM main business database indexDB;
4) Flexibility: Full-text search is accessed in the form of a plug-in.

addresses the above point "3)": When indexDB writes data, it will automatically notify the write module of the inverted index library, and after segmenting the message content, insert it into the storage queue, and finally insert it into the inverted index database in turn middle. When a full-text search is required, through the reading module of the inverted index library, the idClient of the message object corresponding to the keyword can be quickly found, and the message object is found in the indexDB according to the idClient and returned.

addresses the above “4)” point: It exposes a higher-order function, wraps IM and returns a new inherited and extended IM. Because of the prototype-oriented mechanism of JS, methods that do not exist in the new IM will automatically Go to the prototype chain (ie the old IM) to find it, so that the plug-in can focus on the implementation of its own method, and does not need to care about the specific version of IM, and the plug-in supports custom word segmentation functions to meet different user needs for different word segmentation scenarios

7.3 Effect of use

After using the above architecture, after our test, at the level of data volume of 20W, the search time dropped from the first ten seconds to within one second, and the search speed was about 20 times faster.

8. Summary of this article

In this article, we implemented the full text search of IM chat messages on Electron based on Nodejieba and search-index, which speeded up the search speed of chat records.

Of course, we will do more optimizations in the following aspects, such as the following two points:

1) Write performance: , it is found that when the amount of data is large, the underlying database levelDB that search-index depends on will have a write performance bottleneck, and the CPU and memory consumption will be large. After investigation, the write performance of SQLite is relatively much better. From observations, the write speed is only proportional to the amount of data, and the CPU and memory are relatively stable. Therefore, you may consider compiling SQLite into a Node native module to replace it in the future. search-index.

2) Scalability: The decoupling of business logic is not thorough enough at present. Some business fields are stored in the inverted index library. In the future, you can consider the inverted index library to find the idClient of the message object only based on keywords, and put the search with business attributes in the indexDB, which completely decouples the inverted index library from the main business library.

The above is all the sharing of this article, I hope my sharing can be helpful to everyone.

Appendix: More IM dry goods technical articles

"One entry is enough for novices: develop mobile IM from scratch"
"From the perspective of the client to talk about the message reliability and delivery mechanism of the mobile terminal IM"
"How to ensure the efficiency and real-time performance of large-scale group message push in mobile IM? 》
"Technical issues that need to be faced in mobile IM development"
"Implementation of IM Message Delivery Guarantee Mechanism (1): Guarantee the reliable delivery of online real-time messages"
"Implementation of IM Message Delivery Guarantee Mechanism (2): Guaranteeing the Reliable Delivery of Offline Messages"
"How to ensure the "sequence" and "consistency" of IM real-time messages? 》
"A low-cost method to ensure the timing of IM messages"
"Should I use "push" or "pull" for online status synchronization in IM single chat and group chat? 》
"IM group chat messages are so complicated, how to ensure that they are not lost or repetitive? 》
"Talk about the optimization of login request in mobile IM development"
"How to save data by pulling data during IM login on the mobile terminal? 》
"On the principle of multi-sign-in and message roaming on mobile IM"
"How to design a "failure retry" mechanism for a completely self-developed IM? 》
"Easy to understand: cluster-based mobile terminal IM access layer load balancing solution sharing"
"Technical Test and Analysis of WeChat's Influence on the Network (Full Paper)"
"WeChat Technology Sharing: Practice of Generating Massive IM Chat Message Sequence Numbers in WeChat (Principles of Algorithms)"
"Is it so difficult to develop IM yourself? Teach you to teach yourself an Andriod version of simple IM (with source code) "
"Rongyun Technology Sharing: Decrypting the Chat Message ID Generation Strategy of Rongyun IM Products"
"Suitable for novices: develop an IM server from scratch (based on Netty, with complete source code)"
"Pick up the keyboard and do it: work with me to develop a distributed IM system by hand"
"Suitable for novices: teach you to use Go to quickly build a high-performance and scalable IM system (source code)"
"What is the realization principle of "Nearby" function in IM? How to implement it efficiently? 》
"IM Message ID Technology Topic (1): Practice of Generating Massive IM Chat Message Sequence Numbers on WeChat (Principles of Algorithms)"
"IM Development Collection: The most complete in history, a summary of various function parameters and logic rules of WeChat"
"IM development dry goods sharing: how do I solve a large number of offline messages causing the client to freeze"
"Introduction to zero-based IM development (1): What is an IM system? 》
"Introduction to zero-based IM development (2): What is the real-time nature of the IM system? 》
"Introduction to zero-based IM development (3): What is the reliability of the IM system? 》
"Introduction to zero-based IM development (4): What is the message timing consistency of the IM system? 》
"A set of IM architecture technical dry goods for hundreds of millions of users (Part 2): reliability, orderliness, weak network optimization, etc."
"IM Scan Code Login Technology Topic (3): Easy to understand, one detailed principle of IM scan code login function is enough"
"Understanding the "Reliability" and "Consistency" Issues of IM Messages and Discussion of Solutions"
"Ali Technology Sharing: Xianyu IM's Cross-End Transformation Practice Based on Flutter"
"Rongyun Technology Sharing: Fully Revealing the Reliable Delivery Mechanism of 100 Million-level IM Messages"
"IM development dry goods sharing: how to elegantly realize the reliable delivery of a large number of offline messages"
"IM development and dry goods sharing: Youzan mobile terminal IM componentized SDK architecture design practice"
"IM development and dry goods sharing: Netease Yunxin IM client's chat message full-text retrieval technology practice"
16108e98169191 This article has been simultaneously published on the

▲ The link of this article on the official account is: click here to enter. The synchronous publishing link is: http://www.52im.net/thread-3651-1-1.html

IM development dry goods sharing: Netease Yunxin IM client's chat message full-text retrieval technology practice

1 Introduction

2. About the author

3. Related articles

4. What is full-text search

5. Knowledge point 1: Inverted index