Issue 28: Index design (using full-text index)

The previous article introduced the basic principles of full-text indexing. This article will talk about how to better use full-text indexing.

The syntax of the full-text index search is different from that of the ordinary search, and the ordinary search is generally similar to the following SQL:

select * from tb1 where id in (1,2);
select * from tb1 where id < 10;

The filter condition is after the WHERE clause, and the SQL is spliced in a certain way. The full-text index has a specific syntax:

MATCH (col1,col2,...) AGAINST (expr [search_modifier])
search_modifier:
{
IN NATURAL LANGUAGE MODE
| IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
| IN BOOLEAN MODE
| WITH QUERY EXPANSION
}

The sample table used in this article is as follows, the number of records in the table is 10W rows:

 CREATE TABLE `fx` (
  `id` int NOT NULL AUTO_INCREMENT,
  `s1` varchar(200) DEFAULT NULL,
  `log_time` datetime DEFAULT NULL,
  `s2` varchar(200) DEFAULT NULL,
  `s3` text,
  PRIMARY KEY (`id`),
  KEY `idx_log_time` (`log_time`)
)

Now create a full-text index for the table fx field s1, and then do a simple search with the keyword "cluster".

mysql> alter table fx add fulltext ft_s1(s1);
Query OK, 0 rows affected, 1 warning (5.94 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> select count(*) from fx where match(s1) against ('cluster');
+----------+
| count(*) |
+----------+
|     9396 |
+----------+
1 row in set (0.01 sec)

There are three modes of full-text indexing: natural language mode; query expansion mode; and Boolean mode.

Natural language model:

This is the default mode, so the statement above is actually written like this, named SQL1, which is the most optimized way of writing.

# SQL 1
select count(*) from fx where match(s1) against ('cluster' in natural language mode);

To be more detailed, change SQL1 to the following wording and name it SQL 2.

# SQL 2
select count(*) from fx where match(s1) against ('cluster' in natural language mode) >0;

Change the SQL 2 slightly, adjust the filter Match against after the where clause to the select clause, and name it SQL 3.

# SQL 3
select count(if(match(s1) against ('cluster' in natural language mode),1,null)) as "count(*)" from fx;

Look at the three SQL execution plans:

SQL 1 is the best, the EXTRA column results show that this is an optimized query;

SQL is second, followed by full-text index ft_s1;

SQL 3 has no WHERE filter condition, so full table scan has the worst performance.

# SQL 1 执行计划
mysql> explain select count(*) from fx where match(s1) against ('cluster' in natural language mode)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
...
        Extra: Select tables optimized away
1 row in set, 1 warning (0.01 sec)

# SQL 2 执行计划
mysql> explain select count(*) from fx where match(s1) against ('cluster' in natural language mode)>0\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: fx
   partitions: NULL
         type: fulltext
possible_keys: ft_s1
          key: ft_s1
      key_len: 0
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Ft_hints: rank > 0
1 row in set, 1 warning (0.01 sec)

# SQL 3 执行计划
mysql> explain select count(if(match(s1) against ('cluster' in natural language mode),1,null)) as "count(*)" from fx\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: fx
   partitions: NULL
         type: ALL
...
         rows: 99763
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.01 sec)

There is a keyword in the full-text index called "relevance" , which refers to the degree of matching of the searched keyword in each record. The higher the relevance, the more criterion the searched data. The following SQL 4 is output in reverse order according to the relevance of the keyword'mysql'.

# SQL 4
SELECT 
    MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE) AS relevance
FROM
    fx
WHERE
    MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE) > 0

Looking at the execution plan of SQL 4 at this time, it is found that it is a little different from the previous SQLs, with an additional sorting process.

Extra: Using where; Ft_hints: sorted, rank > 0

Then print out the correlation and execute SQL 5.

#SQL 5
SELECT 
    s1, relevance
FROM
    (SELECT 
        s1,
            MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE) AS relevance
    FROM
        fx
    WHERE
        MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE)) outer_table
GROUP BY relevance;

The following is a list of the relevance field values in the SQL5 execution result:

+--------------------+
| relevance          |
+--------------------+
|  0.550973117351532 |
| 0.4591442346572876 |
|  0.367315411567688 |
|  0.275486558675766 |
|  0.183657705783844 |
|  0.091828852891922 |
+--------------------+
6 rows in set (0.54 sec)

The above few SQLs are all searched for a single keyword. What if there are multiple keywords now? For example, search for both "mysql" and "oracle" keywords at the same time? Just connect the two retrieval clauses, see SQL 6:

# SQL 6
SELECT 
    s1
FROM
    fx
WHERE
    MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE)
        AND MATCH (s1) AGAINST ('oracle' IN NATURAL LANGUAGE MODE)

Or written as SQL 7:

# SQL 7
SELECT 
    *

FROM
    (SELECT 
        s1
    FROM
        fx
    WHERE
        MATCH (s1) AGAINST ('mysql' IN NATURAL LANGUAGE MODE)) outer_table
WHERE
    MATCH (s1) AGAINST ('oracle' IN NATURAL LANGUAGE MODE)

In addition to the above writing method, is there a simpler and more optimized writing method? This is the problem solved by another mode, the Boolean mode.

Boolean mode:

The Boolean mode has native operators that can handle filtering of multiple keywords, such as changing the previous SQL 6 and SQL 7 to the Boolean mode, named SQL 8.

# SQL 8
select s1 from fx where match(s1) against ('+mysql +oracle' in boolean mode);

It is very simple to refine it a little bit and add some conditions. For example, the search keywords contain'mysql','oracle' but do not contain ("postgresql", "mongodb", "sqlserver") records, named SQL 9.

# SQL 9
SELECT 
    s1
FROM
    fx
WHERE
    MATCH (s1) AGAINST ('+mysql +oracle -postgresql -mongodb -sqlserver' IN BOOLEAN MODE);

Looking at another piece of SQL, the keyword does not have any operation symbols. It does not refer to the search keyword "mysql oracle" but to search for records that contain "mysql" or contain the keyword "oracle".

# SQL 10
select s1 from fx where match(s1) against ('mysql oracle' in boolean mode);

The operators involved in the above SQL 8, SQL 9, SQL 10 are "+", "-", "", which represent "and", "not", and "or" respectively.

The difference between the Boolean mode and the natural language mode is that the results are not sorted according to the relevance of the keywords. For requirements that must be ranked according to relevance, consider using natural language models.

Boolean mode can not only filter multiple keywords, but also actively intervene in the relevance ranking of search keywords. The following SQL search contains both the keywords "mysql" and "oracle", and the relevance of the keyword "postgresql" is improved, which means it ranks high.

# SQL 11
select s1 from fx where match(s1) against ('+mysql +oracle >postgresql' in boolean mode);

Of course, the Boolean mode has many magical uses, so I won’t expand them one by one. Let’s see the query expansion mode

Query extended mode:

The significance of this mode is to expand the output of common-sense keywords implied by search results. For example, if you want to search for the keyword "database", you will not only search for "database", but also search records containing the keywords "mysql" and "oracle". The first time is "database", and the other time is "mysql", "oracle". After that, the two results are merged together to output the result.

In order to view the results clearly, create another table, the table structure is the same as fx, but there are only 8 records

mysql> select s1 from fx_few;
+----------------------------+
| s1                         |
+----------------------------+
| mysql oracle               |
| mysql postgresql sqlserver |
| windows unix linux         |
| man woman                  |
| lucy lily                  |
| sqlserver postgresql       |
| oracle postgresql          |
| mysql mongodb              |
+----------------------------+
8 rows in set (0.00 sec)

Use the query expansion mode to query the records containing the "sqlserver" keyword, and the results containing other database keywords will also be output.

mysql> select * from fx_few where match(s1) against ('sqlserver' with query expansion);
+----+----------------------------+----------+------+------+
| id | s1                         | log_time | s2   | s3   |
+----+----------------------------+----------+------+------+
|  2 | mysql postgresql sqlserver | NULL     | NULL | NULL |
|  6 | sqlserver postgresql       | NULL     | NULL | NULL |
|  1 | mysql oracle               | NULL     | NULL | NULL |
|  7 | oracle postgresql          | NULL     | NULL | NULL |
|  8 | mysql mongodb              | NULL     | NULL | NULL |
+----+----------------------------+----------+------+------+
5 rows in set (0.00 sec)

This is the end of this article about the simple usage of MySQL full-text indexing. Later, I will continue to explain how to improve the accuracy of full-text indexing results, as well as the optimization of full-text indexing and the use of Chinese plugins.

What else do you want to know about the technical content of MySQL? Hurry up and leave a message to tell the editor!

Issue 28: Index design (using full-text index)

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！

Issue 28: Index design (using full-text index)

爱可生开源社区

引用和评论

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！​​

SQLShift 重大更新：Oracle→PostgreSQL 存储过程转换功能上线！