Big guys use Redis to implement a lightweight search engine, awesome!

Scenes

If you are doing back-end development, you must have implemented the list query interface. Of course, some query conditions are very simple, and one SQL can be done, but some query conditions are extremely complicated, plus various design in the library table Unreasonable, making the query interface particularly difficult to write, and then work overtime or something needless to say (I wonder if you feel this way~).

Let’s start with an example. This is the search condition of a shopping website. If you were to implement such a search interface, how would you implement it? (Of course you said that with the help of search engines, like Elasticsearch, you can do it. But what I want to say here is, what if you want to do it yourself?)

As can be seen from the above figure, the search is divided into 6 categories, and each category is divided into subcategories. In the middle, the conditions of the major categories are the intersections. There are single-selection, multiple-selection, and custom cases in each sub-category, and the result set that meets the conditions is finally output.

Well, now that the requirements are clear, we will start to implement them.

Achieve 1

The first to appear on the stage is the small A classmate, he is an "expert" in writing SQL. Little A confidently said: "Isn't it just a query interface? Looking at the conditions, but with my rich SQL experience, this is still not difficult for me."

So I wrote the following code (here is MYSQL as an example):

select ... from table_1
left join table_2
left join table_3
left join (select ... from table_x where ...) tmp_1
...
where ...
order by ...
limit m,n

The code ran in the test environment, and the results seemed to match, so I was ready to pre-release. With this pre-release, the problem began to be exposed. The pre-launch is to make the online environment as realistic as possible, so the amount of data is naturally much larger than the test. So for such a complex SQL, its execution efficiency can be imagined. The test students decisively typed the code of Little A back.

Realization 2

Summarizing the lessons of Xiao A's failure, Xiao B began to optimize SQL, first passed the explain keyword for SQL performance analysis, and added indexes to the places where the indexes were added. At the same time, a complex SQL is split into multiple SQLs, and the calculation results are calculated in the program memory.

The pseudo code is as follows:

$result_1 = query('select ... from table_1 where ...');
$result_2 = query('select ... from table_2 where ...');
$result_3 = query('select ... from table_3 where ...');
...

$result = array_intersect($result_1, $result_2, $result_3, ...);

This kind of scheme is obviously much better than the first kind in performance, but when the function is checked and accepted, the product manager still feels that the query speed is not fast enough. Little B himself knows that each query will query the database multiple times, and for some historical reasons, some conditions cannot be single-table query, so the query waiting time is unavoidable.

Realize 3

Little C sees the room for optimization from the above scheme. He found that Little B has no problem in thinking, split the complex conditions, calculate the result set of each sub-dimension, and finally merge all the sub-result sets to get the final desired result.

So he suddenly wondered whether the result set of each sub-dimension can be cached in advance, so that when the query is to be performed, the desired subset can be directly retrieved instead of the database calculation every time.

Here, Little C uses Redis to store cached data. The main reason for using it is that it provides a variety of data structures, and it is easy to perform set intersection operations in Redis.

The specific scheme is shown in the figure:

Here, for each condition, the calculated result set ID is stored in the corresponding key in advance, and the selected data structure is set. Query operations include:

Sub-category single selection: get the corresponding result set directly according to the condition key;
Multiple selection of sub-categories: According to multiple condition keys, perform a union operation to obtain the corresponding result set;
Final result: Intersecting all the sub-categories obtained to obtain the final result;

This is actually the so-called reverse index.

Here you will find that a price condition is missing. It can be seen from the demand that the price condition is an interval, and it is infinite. Therefore, the above-mentioned Key-Value method with exhaustive conditions cannot be achieved. Here we use another data structure of Redis to implement, ordered set (Sorted Set):

Add all products to the ordered set with Key as the price, the value is the product ID, and the score corresponding to each value is the value of the product price. In this way, in the ordered set of Redis, the ZRANGEBYSCORE command can be used to obtain the corresponding result set according to the score (price) range.

At this point, the optimization of Scheme 3 has all ended, and the data query and calculation are separated by means of caching. In each search, you only need to search Redis a few times to get the result. The query speed meets the acceptance requirements.

Expand

Pagination

Here you may have found a serious functional flaw, how can list queries have no pagination. Yes, let's take a look at how Redis implements paging right away.

Paging mainly involves sorting. For the sake of simplicity, take the creation time as an example.

as the picture shows:

The blue part in the figure is an ordered collection of commodities with creation time as the score. The result set below the blue is the result of conditional calculation. Through the ZINTERSTORE command, the weight of the result set is 0, and the commodity time result is 1. The result set obtained by taking the intersection is assigned a new ordered set of creation time points. The operation on the new result set can get all the data needed for paging:

The total number of pages is: ZCOUNT command
Current page content: ZRANGE command
In reverse order: ZREVRANGE command

Data Update

There are two ways to update the index data. One is to immediately trigger the update operation through the modification of commodity data, and the other is to perform batch updates through timing scripts.

It should be noted here that regarding the update of the index content, if the Key is deleted violently, the Key shall be reset again. Because the two operations in Redis will not be performed atomically, there may be a gap between them. It is recommended to remove only the invalid elements from the collection and add new elements.

Performance optimization

Redis is a memory-level operation, so a single query will be fast. But if there are multiple Redis operations in our implementation, the multiple connection time of Redis may be unnecessary time consumption. By using the MULTI command, open a transaction, put multiple Redis operations in one transaction, and finally execute atomic execution through EXEC (note: the so-called transaction here is just to execute multiple operations in one connection, if you execute If there is a failure in the process, it will not be rolled back).

to sum up

Here is just a simple demo that uses Redis to optimize query and search. Compared with the existing open source search engine, it is lighter and the learning cost page is correspondingly lower. Secondly, some of its ideas are similar to open source search engines. If you add word analysis, you can also achieve similar full-text search functions.

Source: https://github.com/jasonGeng88/blog/blob/master/201706/redis-search.md

Big guys use Redis to implement a lightweight search engine, awesome!

Scenes

Achieve 1

Realization 2

Realize 3

Expand

Pagination

Data Update

Performance optimization

to sum up

民工哥

引用和评论

早知道有这么个吊炸天的 CI&CD 工具，我就不用 Jenkins 了！

70k star，取代Postman！这款轻量级API工具，太香了！

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

MySQL慢查询日志：性能优化的终极指南

C++ 中 VS 项目引入公共配置文件

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

大数据从业者必知必会的Hive SQL调优技巧