English version published on Flexport Engineering Blog
Introduction
It is generally accepted that indexes improve SQL performance, but this is not always true because only efficient indexes can really improve SQL performance. In fact, too many indexes can even slow down write operations. Our previous article discussed how an inefficient index can slow down system performance, but the example there is an infrequently used GIN index. This time, the example we introduce is an inefficient BTREE index. In both cases, the common key to high-performance SQL is index selectivity.
Hint: An index is said to have good selectivity if it helps the query related to it filter out most of the rows in an index scan. Indexes with poor selectivity often degrade the performance of index scans to the level of full table scans.
question
The Flexport platform enables Flexport operators (customer operations and supply chain operations), customers and freight partners to send messages across our applications. When a message is sent, its recipients are notified. An operator at Flexport may receive thousands of notifications per week. To help them manage notifications, we have developed tools such as the "Notifications" drop-down menu in the navigation bar, which allows users to information, customer information) to query the "Inbox" for relevant notifications.
It is worth noting that the system design should accommodate future business scale without performance degradation. These tools face severe performance challenges: due to rapid business growth, some users receive millions of notifications, and their queries take longer than a minute.
Table Structure
The structure of the notification table is roughly as follows:
The meaning of each column in the table above is:
- id: primary key
- user_id: foreign key, pointing to the recipient user
- notification_type_id: foreign key, pointing to NotificationType
- subject_id: foreign key to the user who performed an action that caused this notification
- object_id: foreign key, pointing to the object that was performed an operation that caused this notification
- created_at: notification creation time
- updated_at: notification modification time
- read_at: the time the notification was marked "read"
- details: the text content of the notification
Next, we introduce several slow SQL examples and solutions in detail (both SQL and query plans in the text have been modified to hide commercially sensitive information).
Example 1: Unread notification count
SQL
SELECT COUNT(*)
FROM notifications
WHERE notifications.user_id = ? AND read_at IS NULL
"read_at IS NULL" means "unread", so this query counts the number of unread notifications for a user. There is a BTREE index on the user_id column serving this query.
The number of notifications for a user can range from thousands to millions. For most users, the number is in the thousands, so this query is fast (milliseconds). But a few users in the customer operations team can have millions of notifications, so this query is slow (seconds or even minutes) because index scans take more time to traverse millions of index entries. It can be called the "data skew" problem.
solution
We found that less than 10% of notifications were unread, so index scans can be sped up by skipping read notifications. PostgreSQL supports a feature called " partial indexes ". We replace the user_id index with a partial index of "user_id WHERE read_at IS NULL". This new index only records those rows that satisfy the "read_at IS NULL" condition, so most of the rows are excluded, reducing the size of the index by 90%. As a result, the query is always milliseconds fast.
Example 2: List of recent notifications
SQL
SELECT *
FROM notifications
WHERE notifications.user_id = ?
ORDER BY notifications.created_at desc
LIMIT ? OFFSET ?
This is a paginated query, so it queries the most recent notifications for a user (in descending order of creation time). Similar to the count example, this query is also fast for most users and slow for few users. Another "data skew" problem.
solution
We found that a " multi-column index " on (user_id, created_at) performed better than a single-column index on user_id. Index entries are sorted by value when they are stored. In this query, the scan of the multi-column index will first locate the index subset that satisfies the user_id value in this index space, and then traverse the subset sorted by the created_at value in order and find enough (satisfying the LIMIT clause). ) to stop immediately. Note that there was already a (user_id, object_id, created_at) index, but it is not as efficient as the (user_id_created_at) index, because the object_id column is not only useless or even harmful to this query. Here, the subset of index items that satisfy the user_id value is not sorted by created_at, but by a combination of (object_id, created_at), so traversing it (expecting created_at order) will jump around on different object_ids go.
A multicolumn index (also known as a compound index or composite index) consists of a list of columns. The order in which these columns are listed is important. Proper index design should place these columns from left to right according to the search pattern you want to use. In general, if a query uses some columns to filter and others to sort, then you can create a multi-column index with the column on the left for filtering and the column on the right for sorting. .
A rule of thumb is to put the primary filter condition first in a multi-column index, and the columns after it are used for auxiliary filtering. For example, the user name table tends to use a multi-column index of the form (last_name, first_name).
Example 3: Query related notifications by various conditions (time range, shipping information, customer information)
SQL
SELECT notifications.*
FROM notifications
INNER JOIN notification_types ON notification_types.id = notifications.notification_type_id
JOIN messages ON messages.id = notifications.object_id AND notification_types.object_type = ‘message’
JOIN shipments ON messages.messageable_id = shipments.id AND messages.messageable_type = ‘Shipment’
WHERE notifications.user_id = ?
AND notification_types.domain = ?
AND shipments.status = ?
AND shipments.client_id IN (?)
AND (messages.assignee_id = ? OR messages.assignee_id IS NULL)
AND messages.created_at >= ?
AND notifications.created_at >= ?
ORDER BY notifications.created_at DESC, notifications.id DESC
This query is complicated. Unlike the previous query, this query is sometimes faster and slower, even when called with the same parameters (for the same user, of course, since the user_id is also the same).
Slow Query Plan The following is a slow query plan :
QUERY PLAN
----------------------------------------
-> Sort (cost=58696.04..58696.04 rows=1 width=553) (actual time=34338.192..34338.192 rows=0 loops=1)
Sort Key: notifications.created_at DESC, notifications.id DESC
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=3800.49..58660.11 rows=1 width=20) (actual time=34338.162..34338.162 rows=0 loops=1)
-> Gather (cost=3800.35..58659.94 rows=1 width=24) (actual time=34338.161..34339.487 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=2800.35..57659.84 rows=1 width=24) (actual time=34328.555..34328.555 rows=0 loops=3)
-> Nested Loop (cost=2799.78..57401.65 rows=31 width=20) (actual time=1684.157..34318.009 rows=1581 loops=3)
-> Parallel Bitmap Heap Scan on shipments (cost=2799.21..46196.17 rows=138 width=12) (actual time=1365.688..1385.022 rows=7047 loops=3)
Recheck Cond: (client_id = ANY (?::integer[]))
Filter: (status = ?)
Heap Blocks: exact=5882
-> Bitmap Index Scan on index_shipments_on_client_id (cost=0.00..2799.13 rows=22114 width=0) (actual time=1364.729..1364.729 rows=22367 loops=1)
Index Cond: (client_id = ANY (?::integer[]))
-> Index Scan using index_messages_on_messageable_type_and_messageable_id on messages (cost=0.56..81.19 rows=1 width=12) (actual time=4.341..4.672 rows=0 loops=21142)
Index Cond: (((messageable_type)::text = 'Shipment'::text) AND (messageable_id = shipments.id))
Filter: (((assignee_id = ?) OR (assignee_id IS NULL)) AND (created_at >= ?::timestamp without time zone))
Rows Removed by Filter: 8
-> Index Scan using index_notifications_on_user_id_and_object_id_and_created_at on notifications notifications_1 (cost=0.57..8.32 rows=1 width=12) (actual time=0.006..0.006 rows=0 loops=4743)
Index Cond: ((user_id = ?) AND (object_id = messages.id) AND (created_at >= ?::timestamp without time zone))
-> Index Scan using notification_types_pkey on notification_types (cost=0.14..0.17 rows=1 width=4) (never executed)
Index Cond: (id = notifications_1.notification_type_id)
Filter: (((object_type)::text = 'message'::text) AND (domain = ?))
Planning Time: 116.579 ms
Execution Time: 34339.810 ms
A key step in this is
-> Index Scan using index_messages_on_messageable_type_and_messageable_id on messages (cost=0.56..81.19 rows=1 width=12) (actual time=4.341..4.672 rows=0 loops=21142)
Index Cond: (((messageable_type)::text = ‘Shipment’::text) AND (messageable_id = shipments.id))
Filter: (((assignee_id = ?) OR (assignee_id IS NULL)) AND (created_at >= ?::timestamp without time zone))
Rows Removed by Filter: 8
This step is a Nested Loop Join, and each loop performs an index scan on the target table. The average time per loop is 4.5ms (4.341..4.672). From "loops=21142", it can be seen that there are 21142 loops in total, so the total time of this step should be 4.5 * 21142 = 95139 milliseconds, but the total time of the entire query execution is only 34338 milliseconds. Why?
We can see this information from the query plan:
Workers Planned: 2
Workers Launched: 2
It means that the degree of parallelism is 3, i.e. this query is processed in parallel by 1 leader process and 2 worker processes (the process currently executing the query is the leader process, and two additional worker processes are started). Therefore, the actual total time for this step is divided by the degree of parallelism, which is 4.5 * 21142 / 3 = 31713 milliseconds, which matches the total query execution time. Since this step consumes most of the time, we will try to optimize it.
Fast query plan When we execute the same query again with the same parameters, we get a similar but much faster query plan :
QUERY PLAN
----------------------------------------
......
-> Index Scan using index_messages_on_messageable_type_and_messageable_id on messages (cost=0.56..76.04 rows=1 width=12)
(actual time=0.012..0.012 rows=0 loops=20463)
Index Cond: (((messageable_type)::text = 'Shipment'::text) AND (messageable_id = shipments.id))
Filter: (((assignee_id = ?) OR (assignee_id IS NULL)) AND (created_at >= ?::timestamp without time zone))
Rows Removed by Filter: 8
......
Planning Time: 4.771 ms
Execution Time: 105.937 ms
The only difference is the time of each loop (0.012..0.012), which is 375 times faster than the previous one (4.5 / 0.012 = 375). If we then execute the query again with a different set of parameters, we will get a slow query plan again. In short, the first execution of the same set of parameters is slower, and the subsequent executions are faster. The fast query plan is almost the same as the slow query plan, and only the timing of the critical step is different. What can be inferred from this phenomenon? Yes, there is a cache.
analyze
There are two main problems with this slow query:
One problem is that the loop that does the index scan of the messages table is too slow when there is a cache miss. The condition of this index scan is just a simple value, but it took 4.5 milliseconds, which is unacceptable. We noticed that index scans on the notifications table were fast (as slow as 0.006ms, as fast as 0.002ms), which is reasonable. So what makes them have this gap?
We can see that the index scan of the messages table has two conditions of "Index Cond" and "Filter", while the index scan of the notifications table has only one condition of "Index Cond". The index on the messages table covers the two columns messageable_type and messageable_id, so this query must first load the index file to scan it, filter by messageable_type and messageable_id, then load the table file, and then filter by assignee_id and created_at. Loading the table file requires multiple, time-consuming disk reads, but can benefit from caching (which is why this query is faster when executed again with the same parameters). To make the query plan include cached information, you can execute "EXPLAIN (ANALYZE, BUFFERS) #{sql}". Index query time can be improved by index selectivity.
Another problem is that the query planner underestimates the expected number of rows in the shipments table filtered by client_id. The following information claims that the expected number of rows is 138 (rows=138), but the actual number of rows is 21141 (rows=7047 loops=3):
-> Parallel Bitmap Heap Scan on shipments (cost=2799.21..46196.17 rows=138 width=12) (actual time=1365.688..1385.022 rows=7047 loops=3)
Hash Join will create a hashmap, which is faster for more rows, but consumes more space. Nested Loop Join is only faster for fewer rows, but saves space. The query planner will choose Hash Join when the expected number of rows is large, and choose Nested Loop Join when the expected number of rows is small. In this case, it thinks there will only be 138 rows, so it chooses Nested Loop Join. If Hash Join is used, this step should be improved.
solution
SQL queries should not rely on the benevolence of the cache. Caching is really great! But before considering caching, your SQL queries should be hard and hard on their own.
First we want to improve index selectivity. If an index on the messages table can cover more columns for better selectivity, it can reduce or even eliminate table file loads. Because the "assignee_id IS NULL" condition is too broad for caching, we chose to append only a created_at column to the existing index. As a small optimization, we also advance the more selective messageable_id as the first column in the index structure. Now the index is on the (messageable_id, messageable_type, created_at) column group of the messages table. The average time per loop dropped from 4.5 ms to 0.338 ms, and the total query execution time dropped from 34,339 ms to 3,893 ms, 9 times faster.
So, can we tell the query selector "Please select Hash Join"? PostgreSQL does not support SQL hints, so we can only set the "SET enable_nestloop = off" option. This option forces PostgreSQL not to choose Nested Loop Join, so it chooses Hash Join. The total query execution time dropped from 34339ms to 5568ms, which is 6 times faster. In fact, for this reason, MySQL 8 abandoned Nested Loop Join in favor of Hash Join .
The optimization of index selectivity is good enough. The optimization of Hash Join needs to modify the database settings, so it is not very dare to use in the production environment, and its effect cannot be linearly superimposed on the optimization effect of index selectivity (1+1<2). Therefore, we decided to use only the optimization of index selectivity.
Real improvement in production environment
The chart below shows the dramatic improvement in production after optimizing the Inbox.
Due to high load due to business growth, the error rate of API endpoints (before optimization) is as high as 11.8%
The error rate for API endpoints (after optimization) is only 0.8%, a 15x improvement (we are refactoring the code to improve it further)
in conclusion
We discuss 3 examples of BTREE index performance that all benefit from improved index selectivity. Poorly selected indexes not only consume more memory and storage space, but also reduce SQL performance. As a principle, when designing business-critical database queries, pay attention to and improve index selectivity.
Thanks
Thanks to Joker Ye for his contributions to the related engineering work, and to Dylan Irlbeck, Vinod Kumar, David Goussev, and other Flexport colleagues for reviewing the article.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。