Issue 30: Index Design (Chinese processing of full-text index)

This article is the final article of full-text indexing, let’s talk about how MySQL full-text indexing handles Chinese. Before understanding how MySQL full-text indexing handles Chinese, let's take a look at what word segmentation is.

MySQL full-text index is based on single-byte stream processing by default, that is, each keyword is divided according to words and stop words (default spaces or punctuation), and the document ID and location of the keywords are saved in the auxiliary table for later retrieval . This kind of single-byte characters for English and numbers is very good, such as "I am a boy!", each word is clearly separated by spaces, and later queries only need to be searched by words with spaces as separators. I have covered these in detail in the previous three articles. However, this segmentation method is not very friendly to multi-byte characters such as Chinese. For Chinese, each character is a separate character. Irregular characters can form words, but each word does not need to be separated by spaces. For example: "Proud of the Chinese" This sentence contains three words "for", "Chinese" and "proud". If you follow the default full-text index processing, search for any of the clauses, the result will definitely not come out. This also indirectly leads people to say that MySQL's full-text search results are inaccurate and unreliable. In fact, this is not the case. The main reason is that the MySQL full-text index has differences in the definition of word segmentation and stop characters. For example, in the following table ft_ch, there are three records, no results are found no matter how to check.

mysql> create table ft_ch(id int unsigned auto_increment primary key, s1 varchar(200),fulltext ft_s1(s1));
Query OK, 0 rows affected (0.39 sec)

mysql> select * from ft_ch where match(s1) against ('我是');
Empty set (0.00 sec)

mysql> select * from ft_ch where match(s1) against ('中国');
Empty set (0.00 sec)

mysql> select * from ft_ch where match(s1) against ('我是中');
Empty set (0.01 sec)

But this table actually has records:

mysql> select * from ft_ch;
+----+--------------------------------------+
| id | s1                                   |
+----+--------------------------------------+
|  1 | 我是中国人你是哪里人？               |
|  2 | 我是中国人，你是哪里人？             |
|  3 | 我是中国人 你是哪里人？              |
+----+--------------------------------------+
3 rows in set (0.00 sec)

What's the problem? Looking back at the full-text index introduced earlier, you may think that the word segmentation length is not enough or the stop word is incorrect. See the following parameters for the word segmentation length. The default stop word is a space or punctuation.

mysql> show variables like '%innodb_ft_%token%';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_ft_max_token_size | 84    |
| innodb_ft_min_token_size | 3     |
+--------------------------+-------+
2 rows in set (0.00 sec)

These two parameters define the minimum and maximum segmentation lengths, within this range, keywords containing boundaries will be retrieved. The reason why the previous query has no results is that the keywords provided by SQL do not trigger the stop word boundary. Then modify the keywords in the SQL, query again, and change the search keywords to "I am Chinese" , this keyword happens to have a stop word (space or comma) after it.

mysql> select * from ft_ch where match(s1) against ('我是中国人');
+----+--------------------------------------+
| id | s1                                   |
+----+--------------------------------------+
|  2 | 我是中国人，你是哪里人？             |
|  3 | 我是中国人 你是哪里人？              |
+----+--------------------------------------+
2 rows in set (0.00 sec)

The result is there, but it is incomplete, and the record with ID 1 has not been found out. What's the reason? Is the minimum margin of word segmentation too big? This is true, but to reduce the size of the word segmentation, such as reducing the parameter innodb_ft_min_token_siz e to 2, then the data must be re-entered, similar to the following:

insert into ft_ch(s1) values ("I am Chinese, where are you from?");

This data looks weird. According to the set word segmentation size, the sentence is divided by spaces. Obviously, the data will be messed up. In other words, the data found afterwards have to be recombined and processed. Obviously this is not feasible. Is there a way for MySQL to enter data in accordance with the thinking of the Chinese people and still query the results normally? The answer is yes.

MySQL has provided native Chinese plug-in . Let me introduce the Chinese processing plug-in Ngram.

Check whether the Ngram plug-in is loaded normally, the result is ON, which means the loading is successful.

mysql> select * from information_schema.plugins where plugin_name = 'ngram'\G
*************************** 1. row ***************************
           PLUGIN_NAME: ngram
        PLUGIN_VERSION: 0.1
         PLUGIN_STATUS: ACTIVE
           PLUGIN_TYPE: FTPARSER
   PLUGIN_TYPE_VERSION: 1.1
        PLUGIN_LIBRARY: NULL
PLUGIN_LIBRARY_VERSION: NULL
         PLUGIN_AUTHOR: Oracle Corp
    PLUGIN_DESCRIPTION: Ngram Full-Text Parser
        PLUGIN_LICENSE: GPL
           LOAD_OPTION: ON
1 row in set (0.00 sec)

The Ngram plug-in has only one single system parameter, which is to configure the word segmentation length. The default is 2, which means that 2 characters are a word segmentation.

mysql> show variables like '%ngram%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
1 row in set (0.00 sec)

For the table ft_ch, change the full-text index from the default to Ngram, just add the with parser ngram clause.

mysql> alter table ft_ch drop key ft_s1, add fulltext ft_s1_n( s1) with parser ngram;
Query OK, 0 rows affected (0.35 sec)
Records: 0  Duplicates: 0  Warnings: 0

In order to verify the ngram plugin, I insert another record without punctuation

mysql> insert into ft_ch(s1) values('我是中国人你是哪里人');
Query OK, 1 row affected (0.01 sec)

Next, execute the previous query again, and now there are results.

mysql>  select * from ft_ch where match(s1) against ('中国');
+----+--------------------------------------+
| id | s1                                   |
+----+--------------------------------------+
|  1 | 我是中国人你是哪里人？               |
|  2 | 我是中国人，你是哪里人？             |
|  3 | 我是中国人 你是哪里人？              |
|  4 | 我是中国人你是哪里人                 |
+----+--------------------------------------+
4 rows in set (0.00 sec)

Next, let’s take a look at how these records are segmented. What is the difference from the default full-text search segmentation? For table ft_ch, clone a table ft_en.

mysql> create table ft_en like ft_ch;
Query OK, 0 rows affected (0.40 sec)

mysql> alter table ft_en drop key ft_s1_n, add fulltext ft_s1 (s1);
Query OK, 0 rows affected (0.34 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into ft_en select * from ft_ch;
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0

Open the monitoring table ft_en.

mysql> set global innodb_ft_aux_table = 'ytt/ft_en';
Query OK, 0 rows affected (0.00 sec)

Looking at the full-text search cache table, you can see that the word segmentation records are divided according to stop words. In fact, for Chinese, such an index is very incomplete.

mysql> select word,doc_id,position,doc_count from information_schema.innodb_ft_index_cache;
+--------------------------------+--------+----------+-----------+
| word                           | doc_id | position | doc_count |
+--------------------------------+--------+----------+-----------+
| 你是哪里人                     |      3 |       18 |         2 |
| 你是哪里人                     |      4 |       16 |         2 |
| 我是中国人                     |      3 |        0 |         2 |
| 我是中国人                     |      4 |        0 |         2 |
| 我是中国人你是哪里人           |      2 |        0 |         2 |
| 我是中国人你是哪里人           |      5 |        0 |         2 |
+--------------------------------+--------+----------+-----------+
6 rows in set (0.00 sec)

Switch to ngram index table

mysql> set global innodb_ft_aux_table = 'ytt/ft_ch';
Query OK, 0 rows affected (0.00 sec)

Checking the full-text index cache table, you can see that the word segmentation data is strictly divided according to the set number, there is no redundant data, and it is not divided according to the stop words, which is more suitable for processing Chinese.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 中国   |            6 |           6 |         1 |      6 |        6 |
| 人你   |            6 |           6 |         1 |      6 |       12 |
| 你是   |            6 |           6 |         1 |      6 |       15 |
| 哪里   |            6 |           6 |         1 |      6 |       21 |
| 国人   |            6 |           6 |         1 |      6 |        9 |
| 我是   |            6 |           6 |         1 |      6 |        0 |
| 是中   |            6 |           6 |         1 |      6 |        3 |
| 是哪   |            6 |           6 |         1 |      6 |       18 |
| 里人   |            6 |           6 |         1 |      6 |       24 |
+--------+--------------+-------------+-----------+--------+----------+
9 rows in set (0.00 sec)

From the above results, we can also find that the punctuation marks are not displayed in the word segmentation, and Ngram has optimized this part by default (this is the reason why the default word segmentation is 2).

Except for the storage of word segmentation data, there are no similarities and differences with the default full-text index.

For example, check whether the internal index table storage is similar, and the query result is the same as the default one.

mysql> select table_id from information_schema.innodb_tables where name = 'ytt/ft_ch' into @tid;               Query OK, 1 row affected (0.00 sec)

mysql> select table_id,name from information_schema.innodb_tables where name like concat('ytt/fts_',lpad(hex(@tid),16,'0'),'%');
+----------+---------------------------------------------------+
| table_id | name                                              |
+----------+---------------------------------------------------+
|     1431 | ytt/fts_0000000000000596_being_deleted            |
|     1432 | ytt/fts_0000000000000596_being_deleted_cache      |
|     1433 | ytt/fts_0000000000000596_config                   |
|     1434 | ytt/fts_0000000000000596_deleted                  |
|     1435 | ytt/fts_0000000000000596_deleted_cache            |
|     1442 | ytt/fts_0000000000000596_00000000000002be_index_1 |
|     1443 | ytt/fts_0000000000000596_00000000000002be_index_2 |
|     1444 | ytt/fts_0000000000000596_00000000000002be_index_3 |
|     1445 | ytt/fts_0000000000000596_00000000000002be_index_4 |
|     1446 | ytt/fts_0000000000000596_00000000000002be_index_5 |
|     1447 | ytt/fts_0000000000000596_00000000000002be_index_6 |
+----------+---------------------------------------------------+
11 rows in set (0.00 sec)

After you turn the full text into an ngram, you just retrieved whether there is a result. As for whether the result is correct, there is still no verification. In order to better illustrate the accuracy of the results, I re-insert two rows of records:

mysql> truncate ft_ch;
Query OK, 0 rows affected (0.52 sec)

mysql> insert into ft_ch(s1) values('我是中国人，你呢？');
Query OK, 1 row affected (0.02 sec)

mysql> insert into ft_ch(s1) values('我是外国人，你呢？');
Query OK, 1 row affected (0.01 sec)

Then let's take a look at the impact of the ngram plugin on search results. The default match against is a natural language mode. When searching for the keyword "Chinese", both rows of records are matched, but the record with ID 2 does not match the search keyword. Why does MySQL print out irrelevant records? The reason is that the natural language model will make a union of search keywords according to the size of the word segmentation, that is to say, the keyword "Chinese" is divided into two keywords "China" and "Chinese". MySQL uses OR to do this. Output the result so that all records containing "China" or "Chinese" are printed out, so there are two results! But this is not the result we expected.

mysql> select * from ft_ch where match(s1) against('中国人' in natural language mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
|  1 | 我是中国人，你呢？          |
|  2 | 我是外国人，你呢？          |
+----+-----------------------------+
2 rows in set (0.00 sec)

In order to make the results accurate, you must use the Boolean mode. In the Boolean mode, only the results corresponding to the keywords are selected, then the following results are correct.

mysql> select * from ft_ch where match(s1) against('中国人' in  boolean mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
|  1 | 我是中国人，你呢？          |
+----+-----------------------------+
1 row in set (0.00 sec)

What if you want any result to match at this time? For example, the matching of a single word? At this time, the Boolean mode can not get the desired result. In fact, it is not that the result is incorrect, but that the word segmentation is too large. The search keyword of this SQL is only one word, and the word segmentation size is 2 by default. The result is definitely wrong.

mysql> select * from ft_ch where match(s1) against('国' in boolean mode);
Empty set (0.01 sec)

At this time, you can modify the word segmentation to 1, modify the parameter ngram_token_size=1 in the configuration file; restart the MySQL service.

Monitoring table ft_ch

mysql> set global innodb_ft_aux_table='ytt/ft_ch';
Query OK, 0 rows affected (0.01 sec)

To modify the word segmentation size, the index must be rebuilt. You can see that the word segmentation data includes punctuation marks, which is why the word segmentation of the ngram plugin of MySQL is set to 2 by default.

mysql> alter table ft_ch drop key ft_s1_n;
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table ft_ch add fulltext ft_s1_n(s1) with parser ngram;
Query OK, 0 rows affected (0.25 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select * from information_schema.innodb_ft_index_table;
+------+--------------+-------------+-----------+--------+----------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+--------------+-------------+-----------+--------+----------+
| ，   |            2 |           3 |         2 |      2 |       15 |
| ，   |            2 |           3 |         2 |      3 |       15 |
| ？   |            2 |           3 |         2 |      2 |       24 |
| ？   |            2 |           3 |         2 |      3 |       24 |
| 中   |            2 |           2 |         1 |      2 |        6 |
| 人   |            2 |           3 |         2 |      2 |       12 |
| 人   |            2 |           3 |         2 |      3 |       12 |
| 你   |            2 |           3 |         2 |      2 |       18 |
| 你   |            2 |           3 |         2 |      3 |       18 |
| 呢   |            2 |           3 |         2 |      2 |       21 |
| 呢   |            2 |           3 |         2 |      3 |       21 |
| 国   |            2 |           3 |         2 |      2 |        9 |
| 国   |            2 |           3 |         2 |      3 |        9 |
| 外   |            3 |           3 |         1 |      3 |        6 |
| 我   |            2 |           3 |         2 |      2 |        0 |
| 我   |            2 |           3 |         2 |      3 |        0 |
| 是   |            2 |           3 |         2 |      2 |        3 |
| 是   |            2 |           3 |         2 |      3 |        3 |
+------+--------------+-------------+-----------+--------+----------+
18 rows in set (0.00 sec)

Then the previous SQL is re-executed, and the result is definitely there.

mysql> select * from ft_ch where match(s1) against('国' in boolean mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
|  1 | 我是中国人，你呢？          |
|  2 | 我是外国人，你呢？          |
+----+-----------------------------+
2 rows in set (0.00 sec)

mysql> select * from ft_ch where match(s1) against('中国人' in boolean mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
|  1 | 我是中国人，你呢？          |
+----+-----------------------------+
1 row in set (0.00 sec)

mysql> select * from ft_ch where match(s1) against('人，' in boolean mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
|  1 | 我是中国人，你呢？          |
|  2 | 我是外国人，你呢？          |
+----+-----------------------------+
2 rows in set (0.00 sec)

What else do you want to know about the technical content of MySQL? Hurry up and leave a message to tell the editor!

Issue 30: Index Design (Chinese processing of full-text index)

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！

php可以做搜索吗？php搜索引擎的尝试

Issue 30: Index Design (Chinese processing of full-text index)

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！​​

php可以做搜索吗？php搜索引擎的尝试

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！