IM Full-Text Retrieval Technology Topic (4): Optimization Practice of the Latest Full-Text Retrieval Technology on WeChat iOS

This article was shared by WeChat development team engineer "qiuwenchen", the original title "iOS WeChat full-text search technology optimization", with revisions.

1 Introduction

Full-text search is a search method that uses an inverted index to search. Inverted index, also known as reverse index, refers to establishing an index for each Token in the input content, and the index stores the specific position of the Token in the content. Full-text search technology is mainly used in the scene of searching a large amount of text content.

The business scenarios of WeChat terminals involving a large number of text searches mainly include: im contacts, im chat records, and favorite searches.

Since these functions were launched in 2014, the underlying technology has not been updated for many years:

1) The full-text search engine used for chat records is still SQLite FTS3, and now there is SQLite FTS5;
2) The search on the Favorites homepage still uses a simple Like statement to match the text;
3) Contact search even uses memory search (traversing all attributes of all contacts in memory to match).
As users accumulate more and more im chat data on WeChat, the need to improve WeChat's underlying search technology is becoming more and more urgent. Therefore, in 2021, we will conduct a comprehensive upgrade of the full-text search technology on the WeChat iOS side. This article mainly records the technical practice in the process of this technology upgrade.

study Exchange:

Introductory article on mobile IM development: "One entry is enough for beginners: developing mobile IM from scratch"
Open source IM framework source code: https://github.com/JackJiang2011/MobileIMSDK

(This article is simultaneously published at: http://www.52im.net/thread-3839-1-1.html )

2. Series of articles

This article is the fourth in a series on the topic:

"IM Full-Text Retrieval Technology Topic (1): The Road to Full-Text Retrieval Optimization of WeChat Mobile Terminal"
"IM Full-Text Retrieval Technology Topic (2): Solution to the Problem of Multi-Phonetic Words in Full-Text Retrieval on WeChat Mobile Terminal"
"IM Full Text Retrieval Technology Topic (3): Practice of Chat Message Full Text Retrieval Technology of NetEase Yunxin Web-side IM"
"IM Full-Text Retrieval Technology Topic (4): The Latest Full-Text Retrieval Technology Optimization Practice on WeChat iOS" (* this article)

3. Selection of full-text search engine

There are not many full-text search engines that can be used by the iOS client, mainly including:

1) FTS components of three versions of SQLite (FTS3 and 4, FTS5);
2) The C++ implementation version of Lucene, CLucene;
3) Lucene's C language bridge version Lucy.

Here is a comparison of these engines in terms of transaction capabilities, technical risks, search capabilities, read and write performance, etc. (see the figure below).

1) In terms of transaction capabilities:

Lucene does not provide complete transaction capabilities, because Lucene uses a multi-file storage structure, which does not guarantee the atomicity of transactions.

The FTS component of SQLite can perfectly inherit the transaction capabilities of SQLite because the bottom layer is still implemented using ordinary tables.

2) In terms of technical risks:

Lucene is mainly used on the server side, and there is no large-scale application on the client side. Moreover, both CLucene and Lucy have officially stopped maintenance since 2013, and the technical risk is high.

The FTS3 and FTS4 components of SQLite are old versions of the engine of SQLite, and there is not much official maintenance, and both versions store the index of a word in a record. In extreme cases, the maximum length of a single record of SQLite is exceeded. risks of.

The FTS5 component of SQLite has been launched for more than six years as the latest version of the engine, and it has been fully applied on Android and WeChat, so the technical risk is the lowest.

3) In terms of search ability:

The development history of Lucene is much longer than that of SQLite's FTS component, and its search capabilities are also the most abundant. In particular, Lucene has a rich scoring and sorting mechanism for search results, but this has no application scenario in the WeChat client. Because our search results are either sorted by time or by some simple custom rules.

Among the several versions of SQLite, the search syntax of FTS5 is more complete and rigorous, and it provides many interfaces for users to customize search functions, so the search ability is relatively strong.

4) In terms of read and write performance:

The following three figures are the performance data of using different engines to generate an index in the Optimize state for 1 million randomly generated Chinese sentences with a length of 10. The frequency of Chinese characters in each sentence is based on the actual frequency of Chinese characters used.

As can be seen from the above 3 pictures: Lucene's performance of reading the number of hits is much better than that of SQLite, indicating that the file format indexed by Lucene is very advantageous, but WeChat does not have an application scenario that only reads the number of hits. Lucene's other performance data is not significantly different from SQLite. Most of the performances of SQLite FTS3 and FTS5 are very close. The generation time of FTS5 index is a bit higher than that of FTS3. There are optimization methods for this.

Taking these factors into consideration: we choose SQLite FTS5 as the search engine for iOS WeChat full-text search.

4. Engine layer optimization 1: Realize the segment automatic Merge mechanism of FTS5

SQLite FTS5 will save the content written by each transaction as an independent b-tree, called a segment. The segment stores the row number (rowid) and column number of each word in the content written this time. and the position offset of each occurrence in the field, so this segment is the inverted index of the content.

Multiple writes will form multiple segments. When querying, you need to query these segments separately and then aggregate the results. As a result, the greater the number of segments, the slower the query speed.

In order to reduce the number of segments, SQLite FTS5 introduces a merge mechanism. The level of the newly written segment is 0, and the merge operation can merge the existing segment with level i into a new segment with level i+1.

An example of merge is as follows:

There are two default merge operations in FTS5:

1) automerge: When the segment of a certain level reaches 4, it starts to automatically perform a part of the merge operation when writing the content, which is called an automerge. The write volume of each automerge is proportional to the write volume of this update, and multiple automerges are required to completely merge into a new segment. Before Automerge completely generates a new segment, it needs to crop the merged content of the old segment several times, introducing redundant write volume;
2) crimemerge: When the number of segments of a certain level reaches 16 after this writing, the segments of this level are merged at one time, which is called crimemerge.

The default merge operation of FTS5 is executed synchronously at the time of writing, which will affect the performance of business logic. In particular, crimemerge will occasionally cause a certain write operation to take a long time, which will make the business performance uncontrollable (in the previous test, FTS5 The indexing takes a long time, mainly because the merge operation of FTS5 is more time-consuming than the other two engines).

We implement the segment automatic merge mechanism of FTS5 in WCDB, concentrate these merge operations into a single sub-thread for execution, and optimize the execution parameters.

The specific methods are as follows:

1) Monitor the FTS5 index table that each transaction of the database with FTS5 index changes to, and throw a notification to the child thread to trigger the automatic merge operation of WCDB;
2) The Merge thread checks all levels in the FTS5 index table where the number of segments exceeds 1 and executes a merge;
3) When merging, every 16 pages of data is written to check whether there is any write operation by other threads because the merge operation is blocked. If there is, commit immediately to minimize the impact of merge on business performance.

The flow chart of automatic merge logic execution is as follows:

Limiting the number of segments at each level to 1 can make the query performance of FTS5 closest to the performance after optimization (all segments are merged into one), and the amount of writes introduced is acceptable. Assuming that the write volume of each service is M and N times are written, then after the merge is completed, the actual write volume of the database is MN(log2(N)+1). Businesses are written in batches, and increasing M can also reduce the total write volume.

In terms of performance, querying three words in an fts5 table containing 100w pieces of Chinese content, each with a length of 100 Chinese characters, takes 2.9ms in the optimize state, and limits the number of segments in each level to 2, 3, and 4. Queries The time-consuming is 4.7ms, 8.9ms, 15ms respectively. When 100w pieces of content are written to 100 pieces at a time, it takes less than 10s to execute the merge according to the WCDB scheme.

Using the automatic Merge mechanism, the FTS5 index can be kept in a state closest to Optimize without affecting the index update performance, which improves the search speed.

5. Engine layer optimization 2: tokenizer optimization

5.1 Tokenizer performance optimization
The tokenizer is the key module of full-text search. It splits the input content into multiple tokens and provides the location of these tokens. The search engine then builds an index on these tokens. The FTS component of SQLite supports custom tokenizers, and you can implement your own tokenizers according to business requirements.

The word segmentation method of the tokenizer can be divided into word segmentation and word segmentation. The former simply indexes the input content word by word, while the latter requires understanding the semantics of the input content and indexing phrases with specific meanings. Compared with word segmentation, the advantage of word segmentation is that it can not only reduce the number of tokens for indexing, but also reduce the number of tokens that are matched during search. unreachable problem.

In order to simplify the client logic and avoid the problem that users cannot find the content when they miss input, the FTS3 tokenizer OneOrBinaryTokenizer before iOS WeChat adopts an ingenious word segmentation algorithm. Every two consecutive words in the content are indexed, and for the search content, word segmentation is performed according to every two words.

The following is an example of a word segmentation that uses "Beijing welcomes you" to search for the same content:

Compared with simple word segmentation, the advantage of this word segmentation method is that it can reduce the number of tokens matched during search by nearly half, improve the search speed, and improve the search accuracy to a certain extent (for example, searching for "welcome you to Beijing" does not match "Beijing welcomes you"). The disadvantage of this word segmentation method is that a lot of index content is saved, and each word of the basic input content is saved three times in the index, which is a way of exchanging space for time.

Because OneOrBinaryTokenizer uses nearly three times the index content growth in exchange for less than twice the search performance improvement, it is not very cost-effective, so we re-developed a new tokenizer VerbatimTokenizer on FTS5, this tokenizer only uses basic word by word. Word segmentation, no redundant index content is saved. At the same time, when searching, every two words are enclosed in quotation marks to form a Phrase. According to the search syntax of FTS5, the words in the Phrase will be hit only if they appear adjacent to each other in order, which achieves the same search accuracy as OneOrBinaryTokenizer.

The schematic diagram of the word segmentation rules of VerbatimTokenizer is as follows:

5.2 Tokenizer Capability Extension
VerbatimTokenizer also implements five expansion capabilities to improve the fault tolerance of search according to the actual business needs of WeChat:

1) Support the conversion of traditional Chinese characters into simplified Chinese characters during word segmentation: so that users can search for simplified Chinese characters in traditional Chinese characters, and search for traditional Chinese characters in simplified Chinese characters, which avoids the problem of user input errors due to the similarity of simplified and traditional Chinese characters. .

2) Support Unicode normalization: Unicode supports characters with the same glyph to be represented by different encodings. For example, é encoded as \ue9 and é encoded as \u65\u301 have the same glyph, which will cause users to look the same The content of the search results can not be found. Unicode normalization is to represent characters with the same glyph in the same encoding.

3) Support for filtering symbols: In most cases, we do not need to support the indexing of symbols, the repetition of symbols is large, and users generally do not use symbols to search for content, but the business scenario of contact search needs to support symbol search, because Emoticons often appear in user nicknames, and the usage of symbols is not low.

4) Supports stemming of English words with Porter Stemming algorithm: The advantage of stemming is that it allows users to search for content whose singular and plural and tense are inconsistent with the hit content, making it easier for users to search for content. However, stemming also has disadvantages. For example, the content that the user wants to search for is "happyday", but if they enter "happy" as a prefix to search, they will not find it, because the stemming of "happyday" becomes "happydai", and the word "happy" is used. Dry becomes "happi", and the latter cannot be a prefix of the former. This kind of badcase is easy to appear when the content is spliced together with multiple English words. The splicing of contact nicknames in English is very common, so there is no stemming in the contact index, and it is used in other business scenarios.

5) Support converting all letters to lowercase: so that users can search for uppercase with lowercase, and vice versa.

These expansion capabilities are to transform each word in the indexed content and search content. This transformation can actually be done at the business layer. The Unicode normalization and simplified-traditional conversion were previously implemented at the business layer.

But doing so has two drawbacks:

1) One is that every conversion in the business layer needs to traverse the content once, introducing redundant computation;
2) One is that the content written into the index is the transformed content, so the search results are also transformed, which will be inconsistent with the original text, and the business layer is prone to errors when judging the content.

For these two reasons, VerbatimTokenizer concentrates these transformation capabilities into the tokenizer.

6. Engine layer optimization 3: Index content supports multi-level separators

The FTS index table of SQLite does not support adding new columns after the table is built. However, with the development of the business, there will be more attributes that support the search of business data. How to solve the problem of searching for new attributes?

Especially in the business scenario of contact search, there are many fields that support search for a contact.

A straightforward idea is to concatenate the new and old properties with a delimiter to create an index.

But this will introduce new problems: FTS5 matches the content of the entire field as a whole. If the user searches for a matching Token in a different attribute, the data will also hit. This result is obviously not what the user wants. Search The accuracy of the results is reduced.

We need to search for matching Tokens that do not have a separator in the middle, which can ensure that matching Tokens are all in one attribute. At the same time, in order to support flexible business expansion, it is also necessary to support multi-level separators, and the search results also need to support the level and location of the matching results, as well as the original text and matching words of the content.

This capability is not yet available in FTS5, and the custom helper function of FTS5 supports obtaining the position of each hit Token in all hit results during search. Using this information, it can be inferred whether there are separators between these Tokens and the level of these Tokens. , so we developed SubstringMatchInfo, a new FTS5 search helper function to implement this capability.

The general execution flow of this function is as follows:

7. Application layer optimization 1: database table format optimization

7.1 How non-text search content is saved
In practical applications, in addition to saving the FTS index of the text to be searched in the database, we also need to additionally save the id of the business data corresponding to the text and the attributes used for sorting the results (commonly, the creation time of the business data) And other content that needs to be read out directly following the search results, these are all content that does not participate in text search.

According to the different storage locations of non-text search content, we can divide the table format of the FTS index table into two types:

1) The first way: store the non-text search content in an additional ordinary table, this table saves the mapping relationship between the Rowid of the FTS index and the non-text search content, and each row of the FTS index table only saves the searchable text content.

The table format looks like this:

The advantages and disadvantages of this tabular format are obvious, namely:

a) The advantages are: the content of the FTS index table is very simple, students who are not familiar with the configuration of the FTS index table are not prone to errors, and the scalability of ordinary tables is good, and new columns can be added;
b) The disadvantage is: when searching, you need to use the Rowid indexed by FTS to read the Rowid of the ordinary table first, so that you can read other contents of the ordinary table, the search speed is a bit slower, and the search needs to be linked to the table query, and the search SQL statement is slightly A little more complicated.

2) The second way: the non-text search content is directly stored in the FTS index table together with the searchable text content.

The table format looks like this:

The advantages and disadvantages of this method are exactly the opposite of the previous method:

a) The advantages are: the search speed is fast and the search method is simple;
b) The disadvantage is: poor scalability and requires more detailed configuration.

Because iOS WeChat used to use the second table format, and WeChat's search business has been stable and will not change significantly, we are now more pursuing search speed, so we continue to use the second table format to store full-text search data.

7.2 Avoid redundant index content
By default, the FTS index table builds an inverted index for the content of each column in the table, even the digital content will be processed according to the text, which will cause the non-text search content that we save in the FTS index table to also build an index, and then Increasing the size of the index file, the time-consuming of index updates and the time-consuming of searching, this is obviously not what we want.

FTS5 supports adding UNINDEXED constraints to columns in indexed tables, so that FTS5 will not index this column, so adding this constraint to all columns other than searchable text content can avoid redundant indexes.

7.3 Reduce the size of index content
As mentioned earlier, the inverted index mainly saves the row number (rowid), column number and position offset of each occurrence in the field corresponding to each Token in the text. The row number is automatically assigned by SQLite, and the position offset is Depending on the actual content of the business, we can't decide on either, but the column number can be adjusted.

In the FTS5 index, the format of the index content of a Token in a row is as follows:

It can be seen from this that if we set the searchable text content in the first column (if there are multiple searchable text columns, put the column with more content in the first column), we can save less column separator 0x01 and column number , which can significantly reduce the index file size.

So our final table format looks like this:

7.4 Effect comparison before and after optimization
The following is a comparison of the average index file size per user before and after iOS WeChat optimization:

8. Application layer optimization 2: Index update logic optimization

8.1 Overview
In order to decouple the full-text search logic and business logic, the FTS index of iOS WeChat is not stored in the database of each business, but is centrally stored in a dedicated full-text search database. After the data of each business is updated, the full text is notified asynchronously. The search module updates the index.

The overall process is as follows:

This can not only prevent the index update from slowing down the speed of business data update, but also prevent the index data update error or even the index data damage from affecting the business, so that the full-text search function module can be fully independent.

8.2 Ensure index and data consistency
The benefits of separate and asynchronous synchronization of business data and index data are many, but they are also difficult to implement.

The most difficult problem is how to ensure the consistency of business data and index data, that is, to ensure that business data and index data correspond one by one, no more and no less.

Once iOS WeChat has stepped on a lot of pits here, and many patches have not been able to completely solve this problem, we need a more systematic method to solve this problem.

In order to simplify the problem, we can split the consistency problem into two aspects:

1) First, ensure that all business data is indexed, so that there will be no omissions in the search results of this user;
2) The second is to ensure that all indexes correspond to a valid business data, so that users will not find invalid results.
To ensure that all business data has an index, first of all, it is necessary to find or construct a kind of constantly growing data to describe the progress of business data update. The update of this progress data and the update of business data can guarantee atomicity. And according to this progress interval, we can come up with the content of business data update, so that we can rely on this progress to update the index.

In WeChat's business, the progress data of different businesses are different:

1) The chat record uses the rowid of the message;
2) Collection is an updateSequence that uses collection to synchronize with the background;
3) Contacts cannot find this growing progress data (we use the Wechat IDs of newly added or updated contacts in the contact database as the index update progress).

For point 3) above, the usage of progress data is as follows:

Regardless of whether the business data is successfully saved, whether the update notification reaches the full-text search module, or whether the index data is successfully saved, this set of index update logic can ensure that the successfully saved business data can be successfully built into the index.

One of the key points is that the data and progress must be updated together in the same transaction, and stored in the same database, so as to ensure the atomicity of the update of data and progress (the database created by WCDB uses WAL mode and cannot guarantees the atomicity of transactions in different databases).

There is another operation diagram that is not shown. Specifically, when WeChat starts, if it is checked that the business progress is smaller than the index progress, this generally means that the business data has been reset after being damaged. In this case, delete the index and reset the index. schedule.

Each index corresponds to valid business data, which requires that the index must be deleted after the business data is deleted. At present, the deletion of business data and the deletion of indexes are asynchronous, and the index will not be deleted after the deletion of business data.

This situation causes two problems:

1) First, redundant indexes will slow down the search speed, but the probability of this problem is very small, and the impact can be ignored;
2) Second, it will cause users to search for invalid data, which should be avoided.
In view of the above point 2): because it is expensive to completely delete all invalid indexes, we use a lazy check method to solve this problem. The specific method is to check whether the data is valid only when the search results are to be displayed to the user. If invalid, the search result will not be displayed and the corresponding index will be deleted asynchronously. Because there is very little data that the user can see on one screen, the performance consumption caused by the checking logic can also be ignored. And this check operation is not actually an additional logic. In order to display the flexibility of the search results, we also need to read out the business data when displaying the search results, which also checks the validity of the data.

8.3 Indexing speed optimization
The index is only used when searching, and its update priority is not as high as that of business data. You can save as much business data as possible before building indexes in batches.

Batch indexing has the following three benefits:

1) Reduce the number of writes to the disk and improve the average indexing speed;
2) In a transaction, the parsing result of the indexing SQL statement can be used repeatedly, which can reduce the number of parsing of the SQL statement, thereby improving the average indexing speed;
3) Reduce the number of generated Segments, thereby reducing the read and write consumption caused by Merge Segment.
Of course, you can't keep too much business data without indexing, so that when users want to search, it will be too late to build an index, resulting in incomplete search results.

With the previous segment automatic merge mechanism, the writing speed of the index is very controllable. As long as the amount is well controlled, there is no need to worry about the high time-consuming problem caused by batch index building.

We comprehensively considered the indexing speed of low-end machines and the pull-up time of search pages, and determined that the maximum number of indexed data in batches is 100.

At the same time, we will cache the unindexed business data generated during this WeChat operation in memory, and in extreme cases, provide relative memory search for business data that has not been indexed in time to ensure the integrity of the search results. Because the unindexed data generated during the last WeChat operation of the cache needs to introduce additional disk IO, after WeChat starts, an indexing logic will be triggered to build an index for the existing unindexed business data.

To summarize, there are three times when indexing is triggered:

1) The number of unindexed business data reaches 100;
2) Enter the search interface;
3) WeChat starts.

8.4 Drop index speed optimization
The speed of index deletion is often a factor that is easily overlooked when designing an index update mechanism, because the amount of deleted business data is easily underestimated and may be mistaken for a low-probability scenario.

However, the actual business data deleted by users may reach 50%, which is a main scenario that cannot be ignored. Moreover, SQLite does not support parallel writing. The performance of deleting an index will also indirectly affect the writing speed of the index, which will introduce uncontrollable factors to the index update.

Because when the index is deleted, it is deleted with the id of the business data.

So there are two ways to improve the speed of deleting indexes:

1) Build a common index from the business data id to the rowid of the FTS index;
2) Remove the UNINDEXED constraint in the column of business data Id in the FTS index table, and add an inverted index to the business data Id.
The inverted index here is actually not as efficient as a normal index, for two reasons:

1) Compared with the ordinary index, the inverted index also brings a lot of extra information, and the search efficiency is lower;
2) If multiple business fields are required to determine an inverted index, the inverted index cannot build a joint index, only one of the business fields can be matched, and the other fields are traversed and matched. In this case, the search efficiency will be very low.

8.5 Effect comparison before and after optimization
The index performance data before and after optimization of chat records are as follows:

The collected index performance data before and after optimization are as follows:

9. Application layer optimization 3: search logic optimization

9.1 Question
When a user searches for content on the homepage of iOS WeChat, the interaction logic is as follows:

As shown in the figure above: When the user changes the content of the search box, the search tasks of all services will be initiated in parallel, and the search results will be returned to the main thread for display on the page after each search task is executed. This logic will continue to repeat as the user changes the search content.

9.2 A single search task should support parallel execution
Although different search tasks now support parallel execution, the data volume and search logic of different services are very different. Tasks with large data volume or complex search logic will take a long time, which cannot give full play to the parallel processing capabilities of mobile phones.

We can also introduce parallel processing capabilities into a single search task, there are two ways to do it:

1) For businesses with a large amount of search data (such as chat record search): the index data can be evenly stored in multiple FTS index tables (note that there will still be a short-board effect if the data is not evenly divided), so that parallel searches can be performed when searching. Each index table, and then summarize the search results of each table, and then uniformly sort. The number of index tables to be split here can neither be too large nor too small. Too much will exceed the actual parallel processing capability of the mobile phone, and will also affect the performance of other search tasks. Too few will not be able to fully utilize the parallel processing capability. In the past, WeChat used ten FTS tables to store chat record indexes, but now it uses four FTS tables instead.

2) For businesses with complex search logic (such as contact search): independently executable search logic can be executed in parallel (for example: in the contact search task, we search for ordinary text, pinyin search, tags and regions of contacts. The search of multiple group members is executed in parallel, and the results are merged and sorted after the search is completed. Why not use the method of splitting the table here? Because of this scenario with a small number of search results, the time-consuming search is mainly concentrated in the search index. The index can be regarded as a B-tree, and a B-tree is divided into multiple ones, so the search time will not be too long. ratio decreased.

9.3 The search task should support interrupts
When the user continues to input content in the search box, the search task may be automatically initiated multiple times. If the search task is initiated again before the previous search task has been executed, the two search tasks before and after will affect each other. performance.

This situation is quite easy to occur in the process of user input from short to long, because when the search text is short, there are many hits, and the search task is more time-consuming, so it is more likely to hit the following search tasks. Performing too many tasks at the same time can easily cause the phone to overheat and burst memory.

Therefore, we need to allow the search task support to be interrupted at any time, so that when the next search task is initiated, the previous search task can be interrupted to avoid the problem of excessive task volume.

The way to implement interrupt support for search tasks is to set a CancelFlag for each search task. When the search logic is executed, every time a result is found, it is judged whether the CancelFlag is set, and if it is set, the task is immediately exited. External logic can interrupt the search task by setting CancelFlag.

The logic flow is shown in the following figure:

In order to allow the search task to be interrupted in time, we need to make the interval of checking CancelFlag as equal as possible. To achieve this goal, we need to avoid using the OrderBy clause to sort the results when searching.

Because FTS5 does not support the establishment of a joint index, when using the OrderBy clause, SQLite will traverse all the matching results and sort them before outputting the first result, which makes the time-consuming of outputting the first result almost equal to the time-consuming of outputting all the results. , the interrupt logic is meaningless.

Not using the OrderBy clause adds two restrictions to the search logic:

1) Sort after reading all the results from the database: We can read out the fields used for sorting when reading the results, and then perform sorting on all the results after reading all the results. Because the time-consuming of sorting accounts for a very low proportion of the total search time, and the performance of the sorting algorithm is similar, the impact of this approach on the search speed can be ignored.

2) Cannot use segmented query: In the scenario of full-text search, segmented query is actually useless. Because segmented query requires sorting the results, and sorting results requires traversing all results, segmented query does not reduce search time (unless it is segmented according to the Rowid index of the FTS index, but Rowid does not contain actual business information).

9.4 Search reads should be minimized
The amount of content read while searching is also a key factor in determining how long a search takes.

The FTS index table is actually composed of multiple SQLite ordinary tables, some of which store the actual inverted index content, and one table stores all the original texts that the user saves to the FTS index table. When reading content other than Rowid when searching, it is necessary to use Rowid to read the content of the table that saves the original text.

The internal execution process of the output result of the index table is as follows:

Therefore, the less content is read, the faster the output result will be, and if too much content is read, there will be a hidden danger of consuming memory.

The method we use is: when searching, only the business data id and the business attributes used for sorting are read. After sorting, when the results need to be displayed to the user, the business data id is used to read the specific content of the business data on demand and display it. . The scalability of this will also be very good, and the content displayed in the search results can be continuously adjusted according to the needs of each business without changing the stored content.

There is another place to mention in particular: try not to read the highlighted information when searching (SQLite's highlight function has this ability). Because to obtain the highlighted field, not only the original text of the text must be read, but also the original text of the text must be word-segmented again to locate the original content of the hit position. When there are many search results, the consumption of word-segmentation is very obvious.

How to get highlighted matching content when displaying search results? The method we use is to segment the user's search text, and then find the position of each Token in the displayed text when displaying the results, and then highlight that position (also because the number of results the user sees on one screen is very small) Yes, the performance consumption caused by the highlighting logic here can be ignored).

Of course, in the case of complex search rules, it is more convenient to directly read the highlighted information (for example, when searching for contacts, use the aforementioned SubstringMatchInfo function to read the highlighted content). The main reason here is to read the level and position of the matching content for sorting, so the operation of re-segmenting the results one by one is inevitable.

9.5 Effect comparison before and after optimization
The following is a comparison of the search time before and after the optimization of each search business of WeChat:

10. Summary of this article

At present, iOS WeChat has fully applied this new full-text search technology solution to the search business of chat records, contacts and favorites.

After using the new solution: the full-text search index file occupies less space, the index update takes less time, and the search speed is faster. It can be said that the performance of full-text search has been improved in all directions.

Appendix: Summary of QQ and WeChat Team Technical Articles

"Technical Challenges and Practical Summary behind the Hundreds of Billions of Visits in WeChat Moments"
"Tencent Technology Sharing: Cache Monitoring and Optimization Practice of Android Mobile QQ"
"WeChat Team Sharing: Technical Practice of High-Performance Universal Key-Value Component of WeChat for iOS"
"WeChat Team Sharing: How does WeChat for iOS prevent group explosions and APP crashes caused by special characters? 》
"Tencent Technology Sharing: Technical Practice of Android Mobile Q's Thread Deadlock Monitoring System"
"iOS background wake-up combat: WeChat payment to the account voice reminder technology summary"
"WeChat Team Sharing: Deciphering the Technology Behind WeChat's Daily Billions of Real-Time Audio and Video Chats"
"Tencent Team Sharing: Sharing of the Tracking Process of a Bug in the Picture Display in the Mobile QQ Chat Interface"
"WeChat Team Sharing: Those Pits Filled by WeChat Android Version of Small Video Coding"
"Optimization of Synchronization Update Scheme of Organizational Structure Data in Enterprise WeChat Client"
"WeChat Team Disclosure: The Ins and Outs of WeChat Interface Stuck Super Bug "15...""
"QQ 18 years: Decrypting 800 million monthly active QQ background service interface isolation technology"
"How the Super IM WeChat with 889 million monthly active users is tested for Android compatibility"
"WeChat Backstage Design Practice of Mass Data Cold and Hot Hierarchical Architecture Based on Time Series"
"WeChat Team Original Sharing: The Bloat and Modular Practice of Android WeChat"
"WeChat Background Team: Optimization and Upgrade Practice Sharing of WeChat Background Asynchronous Message Queue"
"WeChat team original sharing: WeChat client SQLite database damage repair practice"
"Tencent Original Sharing (1): How to greatly improve the picture transmission speed and success rate of mobile QQ under the mobile network"
"WeChat New Generation Communication Security Solution: Detailed Explanation of MMTLS Based on TLS1.3"
"WeChat team original sharing: Android version WeChat background keep-alive actual combat sharing (network keep-alive)"
"WeChat Technical Director Talking About Architecture: The Way of WeChat - The Great Way to Jane (full speech)"
"Background System Storage Architecture Behind Massive WeChat Users (Video + PPT) [Attachment Download]"
"WeChat Asynchronous Transformation Practice: The Background Solution Behind 800 Million Monthly Active Users and 10 Million Connections on a Single Machine"
"WeChat Moments Massive Technology Way PPT [Attachment Download]"
"Technical Experiment and Analysis of WeChat's Influence on the Internet (Full Paper)"
"The Way of Architecture: 3 Programmers Achieve an Average Daily Publishing of 1 Billion in WeChat Moments [with Video]"
"Fast Fission: Witness the Evolution of Wechat's Powerful Background Architecture from 0 to 1 (1)"
"WeChat Team Original Sharing: Summary of Android Memory Leak Monitoring and Optimization Skills"
"The actual combat record of "weight loss" of the WeChat installation package for Android version"
"WeChat Installation Package "Weight Loss" Practical Record for iOS"
"Mobile IM Practice: The iOS Version of WeChat Interface Caton Monitoring Solution"
"Technical Problems Behind WeChat "Red Envelope Photos""
"Mobile IM Practice: A Record of the Technical Solution for the iOS Version of WeChat's Small Video Function"
"Mobile IM Practice: How the Android version of WeChat greatly improves interactive performance (1)"
"Mobile IM Practice: Realizing the Intelligent Heartbeat Mechanism of Android WeChat"
"Mobile IM Practice: Research on Google Message Push Service (GCM) (from WeChat)"
"Mobile IM Practice: Discussion on the Multi-device Font Adaptation Scheme of WeChat for iOS"
Detailed explanation of IPv6 technology: basic concepts, application status, and technical practice (Part 1)
"Teach you to read the chat records of Android version WeChat and mobile QQ (for technical research only)"
"WeChat Technology Sharing: WeChat Mass IM Chat Message Serial Number Generation Practice (Algorithm Principles)"
"Decryption of Social Software Red Packet Technology (1): Comprehensive Decryption of QQ Red Packet Technical Solution - Architecture, Technical Implementation, etc."
"Deciphering of social software red envelope technology (2): Deciphering the technological evolution of WeChat shake red envelope from 0 to 1"
"Social Software Red Packet Technology Decryption (3): The Technical Details Behind WeChat Shake the Red Packet Rain"
"Decryption of Social Software Red Envelope Technology (4): How Does WeChat Red Envelope System Cope with High Concurrency"
"Decryption of Social Software Red Envelope Technology (5): How the WeChat Red Envelope System Achieves High Availability"
"Decryption of Social Software Red Envelope Technology (6): The Evolution Practice of Storage Layer Architecture of WeChat Red Envelope System"
"Decryption of Social Software Red Packet Technology (11): Decryption WeChat Red Packet Random Algorithm (including code implementation)"
"IM Development Collection: The most complete in history, a summary of various functional parameters and logic rules of WeChat"
"WeChat Team Sharing: The Evolution of the 15 Million Online Message Architecture in WeChat Live Chat Room"
"Demystifying the IM Architecture Design of Enterprise WeChat: Message Model, Ten Thousand Crowds, Read Receipts, Message Withdrawal, etc."

(This article is simultaneously published at: http://www.52im.net/thread-3839-1-1.html )

IM Full-Text Retrieval Technology Topic (4): Optimization Practice of the Latest Full-Text Retrieval Technology on WeChat iOS

1 Introduction

2. Series of articles

3. Selection of full-text search engine

4. Engine layer optimization 1: Realize the segment automatic Merge mechanism of FTS5

5. Engine layer optimization 2: tokenizer optimization

6. Engine layer optimization 3: Index content supports multi-level separators

7. Application layer optimization 1: database table format optimization

8. Application layer optimization 2: Index update logic optimization

9. Application layer optimization 3: search logic optimization

10. Summary of this article

Appendix: Summary of QQ and WeChat Team Technical Articles

JackJiang

引用和评论

长连接网关技术专题(十二)：大模型时代多模型AI网关的架构设计与实现

百万级群聊的设计实践

全民AI时代，大模型客户端和服务端的实时通信到底用什么协议？

融云数据监控平台「北极星」教程，聊天室洪峰、连接异常、消息未达正确解法

极致出海友好，融云 IM 支持消息免打扰设置时区

如何备份 iPhone 上的照片？ 6 个你不能错过的方法

Linux版微信的正确打开方式