What is the index? Why is there a mysql index, what problems have been solved, and what is the underlying principle? Why use B+ tree as the solution? Isn't it possible to use other hash indexes or B-trees?

Simple understanding of the index

First of all, what is Index? If I told you directly that the index is an ordered data structure database management system, you might be a little confused.

In order to avoid this situation, I intend to give a few examples to help you understand index more easily.

When we look up the dictionary, we can find the corresponding word according to the radicals and strokes of the word, so that we can quickly find the page where the corresponding word is located. The thing at the beginning of the dictionary is called index

There is also a catalog of books, which can help us quickly jump to different chapters. At this time, the catalog here is also index

Even the map of the scenic spot will tell you where you are now and where the other scenic spots are. This map is also index

Combined with the more professional explanation at the beginning, you may be able to understand what an index is.

Why do you need an index

Knowing the concept of index, we need to know why we need index? As can be seen from the example just now, the purpose of the index is:

  • The index in the dictionary helps us find the corresponding word quickly
  • The catalog of the book helps us quickly jump to the chapters we need to read
  • The map of the scenic spot helps us quickly find the way to the scenic spot we want to go

In the database, the index can help us quickly query the corresponding data row, so as to smoothly fetch the data of all columns. This process must be faster than . For current web applications, if the DB responds slowly, it will directly affect the response time of the entire request, which is disastrous for the user experience.

If you click a button and wait a few seconds before returning, then there is a high probability that users will no longer use the application you developed.

Indexes in MySQL

First of all, MySQL and indexes are actually not directly related. Indexes are actually a concept in InnoDB, the storage engine used in MySQL. In InnoDB, indexes are divided into:

  • Clustered index
  • Nonclustered index

For the clustered index , it is an index constructed by InnoDB based on the primary key (Primary Key). You can temporarily understand that key is the primary key, and value is the entire row of data. And a table can only have one clustered index.

Of course, you do not need to define the primary key. But Under normal circumstances we will create a monotonically increasing primary key or algorithm is generated through a unified ID generation. If no primary key is defined, InnoDB will have its own bottom-line strategy. InnoDB will choose first we define all values are not empty of unique index as clustered index .

However, in the actual production environment, there will indeed be such a Corner Case. line. If even the only remaining unique index 160ed08cbecf89 does not meet the requirements, InnoDB will create a hidden 6-byte primary key RowID by itself, and then generate a clustered index based on this hidden primary key.

For non-clustered index , it is an index created based on the specified column, also called secondary index (Secondary Index) , a table can create up to 64 secondary indexes . The key is the value of the column where the secondary index is created, and the value is the primary key. In other words, if you query through a non-clustered index, you can only get the value of the index column itself + the value of the primary key. If you want to get the complete column data, you need to query again in the clustered index based on the obtained primary key. , This process is called back to the table .

To explain here, there are many blogs that say that when MySQL uses InnoDB, a table can only create 16 indexes at most. First of all, this is wrong . Obviously, I copied it directly from other places and did not do it myself. Any verification.

In the official MySQL article, it is clearly stated that a table can create up to 64 non-clustered indexes , and when creating a non-clustered index, the number of columns cannot exceed 16.

Note that the number of columns for creating a non-clustered index cannot exceed 16!

This is also a digression, so-called technical rigor, what is rigorous? With regard to the knowledge you have acquired through other channels, it is at most author’s opinion . We hold a skeptical attitude and try to verify it ourselves. After verification, it will become fact .

Rather than memorizing certain terms by rote, new things are emerging in endlessly, but when you go back to their roots, you will find that it is just that.

The underlying principle of the index

I mentioned the types of indexes in InnoDB earlier. After a simple understanding of their classification and differences, how do indexes in InnoDB speed up queries? What is the underlying principle? The underlying structure of the index in InnoDB is the B+ tree, which is a variant of the B tree.

Let me first show you what kind of bird the B+ tree looks like. The picture below is a B+ tree that stores the numbers "1-7".

<img src="https://tva1.sinaimg.cn/large/008i3skNgy1gqhyirkodxj30uh0akt9h.jpg" style="zoom:67%;" />

As you can see, in the B+ tree, each node can have multiple child nodes, while in the binary tree that we are usually familiar with, each node can only have at most two. Moreover, in the B+ tree, the stored data of the nodes is ordered, and the ordered data structure allows us to perform fast exact matching and range query. And there is a pointer to the next node between the leaf nodes in the B+ tree, but there is no leaf node in the B tree.

In the actual implementation of MySQL InnoDB, the page nodes are actually a double-linked list, which stores pointers to the previous and next nodes respectively

The figure below is a B-tree that contains integers "1-7". This figure should help you deepen your understanding of the difference between the two.

Moreover, in the B+ tree, except for the leaf nodes that store real data, the rest of the nodes only store pointers to the next node. In other words, all the data is on the leaf nodes. In the B-tree, all nodes can store data, which is the main difference.

After knowing the basic structure of B-tree and B+ tree, we need to further understand how InnoDB uses B+ tree to store data. First, MySQL does not put in the memory, the memory just as a run-time optimization of data storage, something about InnoDB memory architecture related, previously has written article , interested can go take a look.

InnoDB will store data on the disk, and when we query the data, the OS will load the data stored on the disk into the memory page by page. The page here is a way for the OS to manage memory. When it loads data into the memory, it loads the data on a certain disk block according to the size of the page. Here, you can understand that each node in the B-tree is a disk block.

Since both B-trees and B+ trees need to perform I/O operations to load the required nodes into the memory when searching, where are the advantages of B+ trees over B-trees?

Personally, there are three main points.

One is that the B+ tree can reduce the number of I/O . why? Why is the data structure similar in length, B+ tree can reduce the number of I/O? As mentioned earlier, a single node represents a disk block, and the size of a single disk block is fixed. A B+ tree only stores values at leaf nodes. Compared to a B tree where all nodes store complete data, a single disk block in the B+ tree can hold more data.

Single disk block, under the premise of fixed capacity, the storage element size smaller , it is possible to store elements number will be more . In other words, one I/O can load more data into memory, and these multi-loaded elements are likely to be used by you, and this can reduce the number of I/Os to a certain extent.

In addition, a single node can store more elements, which can also reduce the height of the tree.

Second, the query efficiency is more stable . What is more stable? In the case of the same amount of data, the query time will not vary greatly because of the different data IDs you query. In other words, this request may take 10ms, and the next same request will take 20ms. , Which makes people very unacceptable. The performance of the co-authored interface depends on your database's mood?

Then why is it said that using B+ trees can achieve stable query efficiency? Because B+ tree non-leaf nodes do not store data, if you want to get the final data, you must find the leaf nodes. In other words, the number of I/Os for each query is the same. And because all nodes of the B-tree can store data, some data may be queried in one I/O, while some need to query the leaf nodes to find the data, and this will cause instability of query efficiency.

Third is to better support the range query . Why can't the B-tree support it well? Let us return to the B-tree picture.

Suppose we need to query the data in the interval [3, 5], what will we experience? No nonsense, just give the picture directly.

It can be seen that if the complete data is still not queried to the leaf node, it will return to the root node to traverse again. In contrast to the B+ tree, when the leaf nodes are found, the linked list can be traversed directly through the pointers between the leaf nodes, which can greatly improve the efficiency of range query.

After knowing this, we can learn by analogy why InnoDB does not use Hash as the underlying data structure. Even if the time complexity of Hash when querying can even be O(1)

Finally talk about I/O

The whole article mentions many times of I/O, and in the index design of MySQL, it is necessary to reduce the number of I/O as much as possible. Why? It is because I/O is expensive. What happens when we perform an I/O?

Originally I talked about the disk structure in detail, but after a glance at the length, it is almost over, so here is a simple chat.

In a mechanical hard disk, an I/O operation consists of three steps:

First, required to seek . Seeking refers to the movement of the head of the disk on the track on the disk. This time is generally within 3-15ms.

Then , the disk will rotate the disk that stores the corresponding data below the head, which takes about 2ms. The specific time delay is related to the speed of the disk.

Finally, data transmission .

After a wave of operation, the cost is about 10ms. Don't think 10ms is okay...Compared with the microseconds and nanoseconds of SSD (Solid State Drive) and memory, there is a world of difference.

This is why in MySQL, random I/O has a great impact on the performance of its queries.

Well, the above is the entire content of this blog, welcome to search on follow [160ed08cbed3cd SH's full stack notes ], reply [ queue ] to obtain MQ learning materials, including basic concept analysis and RocketMQ detailed source code analysis, continue to update in.

If you think this article is helpful to you, please a thumbs up , close a note , share , leave a .


SH的全栈笔记
507 声望124 粉丝