Issue 27: Index Design (Principle of Full-Text Indexing)

The B-tree index and hash index were introduced earlier, and then let's look at the MySQL full-text index.

Before talking about full-text indexing, you can look at the following very common types of SQL statements:

 select count(*) from fx where s1 like '%cluster%'

This statement retrieves the field s1 from the table fx, and the filter condition is'%cluster%'. Such a fuzzy search statement has poor performance. Even if there is an index on the field s1, the entry point cannot be found, so a full table scan of the table fx is performed , Especially for a large table, the performance of this type of SQL is undoubtedly fatal.

Full-text indexing is a good solution to the performance problems of this type of inefficient SQL. The concept of full-text indexing is just the opposite of the concept of ordinary B-tree indexing. The construction of B-tree indexing is based on all or part of a certain field value; full-text indexing is to divide all data of a certain field value according to a certain separator ( stop word ) and character length ( is also called participle length ) together to form various arrangements, and then record the position of these characters, the number of times and other static information in the index. I simply drew a picture as follows:

As you can see from this picture, the full-text index (also called the inverted index) is somewhat similar to the storage of the HASH index, except that KEY is a word, and VALUE is the document ID and corresponding location information to which the keyword belongs. For example, the word "YTT" appears in a certain position in 4 documents, that is, a certain position in 4 rows of records. FTS_DOC_ID refers to the ID of the document, and each record corresponds to an ID, which is similar to the primary key of a table.

Next, I will elaborate on the full-text index from several aspects. The examples shown in this article are based on the following table:

CREATE TABLE ft_sample (
    id INT PRIMARY KEY,
    s1 VARCHAR(200),
    log_time DATETIME,
    s2 TEXT,
    KEY idx_log_time (log_time)
);

Auxiliary table

First add a full-text index to the table ft_sample

mysql> alter table ft_sample add fulltext ft_s1(s1);
Query OK, 0 rows affected (0.35 sec)
Records: 0  Duplicates: 0  Warnings: 0

After creating a full-text index on the table, MySQL uses some auxiliary tables to store the relevant data points of the full-text index fields. If the table ft_sample does not belong to the shared table space, these tables can also be seen on the corresponding disk directory. as follows:

mysql> SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_TABLES  WHERE name LIKE 'ytt/fts%';
+----------+---------------------------------------------------+-------+
| table_id | name                                              | space |
+----------+---------------------------------------------------+-------+
|     1219 | ytt/fts_00000000000004c2_being_deleted            |   162 |
|     1220 | ytt/fts_00000000000004c2_being_deleted_cache      |   163 |
|     1221 | ytt/fts_00000000000004c2_config                   |   164 |
|     1222 | ytt/fts_00000000000004c2_deleted                  |   165 |
|     1223 | ytt/fts_00000000000004c2_deleted_cache            |   166 |
|     1230 | ytt/fts_00000000000004c2_00000000000001ba_index_1 |   173 |
|     1231 | ytt/fts_00000000000004c2_00000000000001ba_index_2 |   174 |
|     1232 | ytt/fts_00000000000004c2_00000000000001ba_index_3 |   175 |
|     1233 | ytt/fts_00000000000004c2_00000000000001ba_index_4 |   176 |
|     1234 | ytt/fts_00000000000004c2_00000000000001ba_index_5 |   177 |
|     1235 | ytt/fts_00000000000004c2_00000000000001ba_index_6 |   178 |
+----------+---------------------------------------------------+-------+
11 rows in set (0.00 sec)

Let's introduce these tables in detail below:

The suffix \_index_1-6 is called an auxiliary table, which stores the real data of the inverted index in order. As for the reason for the division of six tables, it can be understood as adding a full-text index to the field and parallelizing the word segmentation of the data. Refer to the parameter innodb_ft_sort_pll_degree to control the number of concurrency.

For example, the table name is: ytt/fts_00000000000004c2_00000000000001ba_index_1, where ytt represents the database name, the beginning of fts_ and the end of \_index_1 represent the auxiliary table, 00000000000004c2 represents the hexadecimal value of the corresponding table ID, and 00000000000001ba represents the hexadecimal value corresponding to the fulltext index field ID.制值。 System value.

Check the ID corresponding to the table ft_sample,

mysql> SELECT 
    a.table_id,
    HEX(a.table_id),
    a.index_id,
    HEX(a.index_id),
    a.name
FROM
    information_schema.innodb_indexes a,
    information_schema.innodb_tables b
WHERE
    a.table_id = b.table_id
        AND b.name = 'ytt/ft_sample'
        AND a.name = 'ft_s1';

+----------+-----------------+----------+-----------------+-------+
| table_id | hex(a.table_id) | index_id | hex(a.index_id) | name  |
+----------+-----------------+----------+-----------------+-------+
|     1218 | 4C2             |      442 | 1BA          | ft_s1 |
+----------+-----------------+----------+-----------------+-------+
1 row in set (0.00 sec)

The remaining tables that do not contain the full-text index field ID are general auxiliary tables, which record the configuration information of the index table and information about index deletion.

ytt/fts_00000000000004c2_deleted

ytt/fts_00000000000004c2_deleted_cache

The contents of these two tables are the same, and both contain a list of document IDs (DOC_ID) that are marked for deletion, but have not actually been deleted from the previous six index tables; the difference is that ytt/fts_00000000000004c2_deleted_cache is one of ytt/fts_00000000000004c2_deleted in memory copy.

ytt/fts_00000000000004c2_being_deleted

ytt/fts_00000000000004c2_being_deleted_cache

The contents of the two tables are also the same, and both contain the DOC_ID marked as deleted, and the corresponding DOC_ID is being deleted from the previous six index tables. Similarly, the table tt/fts_00000000000004c2_being_deleted_cache is a memory copy of the table ytt/fts_00000000000004c2_being_deleted.

The significance of the existence of the above four tables is to avoid frequent write operations in the full-text index field causing the corresponding six disk index tables to become hotspots. The problem caused by this is that the deleted records are stored in multiple copies, which are not deleted in time, which takes up extra disk space. However, you can use the MySQL statement "optimize table" to manually release these spaces in advance. The optimize table statement only organizes the B+ tree clustered index by default, not the full-text index. Here MySQL provides a parameter innodb_optimize_fulltext_only, which is closed by default. After this parameter is turned on, the statement optimize table will only defragment the disk space for the full-text index.

ytt/fts_00000000000004c2_config

This table contains the internal status information of the full-text index. The field FTS_SYNCED_DOC_ID is different from FTS_DOC_ID, indicating that the index record has been parsed and refreshed.

Full-text index buffer pool

The full-text index has a buffer pool: information_schema.innodb_ft_index_cache. Used to cache the write operations (insert/update) of the full-text index field, mark word segmentation and other related information, like other caches in MySQL, the purpose is to merge multiple frequent flashes into the defined buffer pool size after it is full One-time refresh (refresh to the previous six auxiliary tables). After flushing the table information_schema.innodb_ft_index_cache is cleared, next time you filter according to the full-text index field, directly query the corresponding disk index table; if there is an update to the full-text index field value at this time but the flushing has not been triggered, MySQL will set the buffer pool The data and the data in the disk index table are returned to the client together.

The variable that controls the size of the single-table buffer pool is: innodb_ft_cache_size, the default is 8MB, the minimum is 1.6MB, and the maximum is 80MB.

The variable that controls the buffer pool size of the entire MySQL instance is: innodb_ft_total_cache_size, the default is 640M, the minimum is 32MB, and the maximum is 1.6GB.

Document ID, DOC_ID

DOC_ID is the record ID of the index table for keyword mapping. Each record is treated as a document and mapped to a field FTS_DOC_ID in the MySQL full-text index table. If the full-text index table does not explicitly specify this field, MySQL creates a hidden field by default. In order to avoid the overhead of adding columns later, this field will not be deleted with the destruction of the full-text index. In other words, this field will always exist, unless the table is deleted.

In the sample table ft_sample at the beginning of this article, use the show extended columns statement to view the hidden fields:

mysql> show extended columns from fx;
+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int          | NO   | PRI | NULL    | auto_increment |
| s1          | varchar(200) | YES  | MUL | NULL    |                |
| log_time    | datetime     | YES  | MUL | NULL    |                |
| s2          | varchar(200) | YES  |     | NULL    |                |
| s3          | text         | YES  |     | NULL    |                |
| FTS_DOC_ID  |              | NO   |     | NULL    |                |
| DB_TRX_ID   |              | NO   |     | NULL    |                |
| DB_ROLL_PTR |              | NO   |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+
8 rows in set (0.01 sec)

If you want to explicitly customize this field and manually maintain the uniqueness of the value, you can specify a name as the FTS_DOC_ID field and the type as unsigned INT64 when the table is built, or before the full-text index is established. (Note that this field Must be uppercase). such as:

mysql> alter table ft_sample add FTS_DOC_ID bigint unsigned not null, add  unique key idx_FTS_DOC_ID (FTS_DOC_ID);
Query OK, 0 rows affected (0.13 sec)
Records: 0  Duplicates: 0  Warnings: 0

Full-text index transaction processing

The transaction processing of full-text indexing is a bit special, which is a bit different from the transaction processing of INNODB. For example, for INSERT/UPDATE operations on a full-text index table, you must wait for all COMMITs before you can retrieve the data that was just updated. Even in a transaction, you cannot see the data that was just updated but has not been COMMIT. for example:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into ft_sample values (1,'mysql oracle postgresql','2020-01-16 09:32:58','');
Query OK, 1 row affected (0.00 sec)

mysql> insert into ft_sample values (2,'mysql oracle postgresql','2020-04-20 09:32:58','');
Query OK, 1 row affected (0.00 sec)

mysql> insert into ft_sample values (3,'mysql oracle postgresql','2020-09-30 09:32:58','');
Query OK, 1 row affected (0.01 sec)

mysql> insert into ft_sample values (4,'xfs ntfs','2020-10-30 09:32:58','');
Query OK, 1 row affected (0.00 sec)

mysql> select * from ft_sample where match (s1) against ('mysql');
Empty set (0.00 sec)

mysql> commit;
Query OK, 0 rows affected (0.02 sec)

mysql> select * from ft_sample where match (s1) against ('mysql');
+----+-------------------------+---------------------+------+
| id | s1                      | log_time            | s2   |
+----+-------------------------+---------------------+------+
|  1 | mysql oracle postgresql | 2020-01-16 09:32:58 |      |
|  2 | mysql oracle postgresql | 2020-04-20 09:32:58 |      |
|  3 | mysql oracle postgresql | 2020-09-30 09:32:58 |      |
+----+-------------------------+---------------------+------+
3 rows in set (0.00 sec)

As can be seen from the above example, before commit, the record of query keyword'mysql' does not exist, and after commit, it can be queried normally.

Through this introduction, I give a general introduction to the structure of the full-text index and its manifestation in MySQL, and the next one will talk about how to better use the full-text index.

Regarding the technical content of MySQL, what else do you want to know? Hurry up and leave a message to tell the editor!

Issue 27: Index Design (Principle of Full-Text Indexing)

Auxiliary table

Full-text index buffer pool

Document ID, DOC_ID

Full-text index transaction processing

爱可生开源社区

引用和评论

如何巧妙解决 Too many connections 报错？