Author: Yang Taotao

Senior database expert, specializing in MySQL for more than ten years. Good at backup and recovery related to open source databases such as MySQL, PostgreSQL, MongoDB, SQL tuning, monitoring operation and maintenance, and high-availability architecture design. Currently working at Aikesheng, providing MySQL-related technical support and MySQL-related course training for major operators and banking and financial companies.

Source of this article: original submission

* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


For a long time, it has been written in the development specifications of MySQL: Prohibit large transactions! Turning to TiDB, the topic should still be: Prohibit big transactions!

Due to the distributed nature of the transaction itself, and the write amplification caused by RAFT replication in the backstage, TiDB is highly discouraged from using large transactions.

Versions of TiDB prior to 4.0 have too detailed transaction requirements, such as:
  1. A single transaction contains no more than 5000 SQL statements
  2. A single KV entry does not exceed 6MB
  3. The total number of KV entries does not exceed 30w
  4. The total size of KV entry does not exceed 100MB
The above limitations will cause some DML statements to be blocked, such as the following three types of classic unfiltered statement:
  1. insert ... select ... where 1
  2. update ... where 1
  3. delete from ... where 1
It is very easy to have the error that the transaction is too large: ERROR 8004 (HY000): transaction too large, len:300001. Generally, there are the following methods to circumvent this problem:
  1. Enable the dml batch feature without security guarantee for Insert and delete statements: TiDB_batch_insert, TiDB_batch_delete.
  2. Split the entire update statement in chunks.

In versions after 4.0, in addition to the single kv entry size is still limited to a maximum of 6MB, several other restrictions have been removed.

You only need to add the following options in the configuration file to cover most of the transactions:

performance.txn-total-size-limit: 10737418240 (range 1G to 10G)

Is it possible to write transactions at will after version 4.0? of course not! Due to TiDB's write amplification, it will also cause the memory usage to increase exponentially, which will have a great impact on other businesses, so TiDB hardly limits the maximum transaction support to 10G. For example, if DM is used to synchronize MySQL data to TiDB, large transactions will increase memory and write latency, which will affect other write performance.

Therefore, large transactions must be banned and split into small transactions for batch processing.

So how to split large transactions? In terms of business alone, different business types have different corresponding splitting methods, and you may not be able to finish writing a book. Here I am only subdivided from the perspective of the database, from the perspective of the table, and then further to the perspective of how to split the DML statement.

The three classic statements listed above seem very simple, but there is no filter condition. If the number of table indexes is very large, even if the number of table records is not large, it will be a big transaction, but this transaction only contains one DML statement. However, the split of this type of statement actually depends on how the table structure is defined, which can be divided into three types:

  1. There is a primary key, and the primary key is continuous
  2. There is a primary key, the primary key is not continuous
  3. The table has no primary key (similar to the first)
The first type is the easiest to split, and you can divide different blocks according to the primary key.

for example:

Table t1 has 100W records and 6 indexes besides the primary key. Update table t1:

update ytt.t1 set log_date = current_date() - interval ceil(rand()*1000) day where 1;

Under the default auto-commit, this statement is actually an implicit large transaction statement, which is internally converted to:

begin

update ytt.t1 set log_date = current_date() - interval ceil(rand()*1000) day where 1;

commit;

Assuming that the primary key of table t1 is self-incrementing and continuous, it is very simple. Divide this transaction into 10 small transactions and update 10W records each time instead of updating 100W at once. The script is roughly as follows:

root@ytt-ubuntu:~/scripts# cat update_table_batch
#!/bin/sh
# TiDB 拆分更新
for i in `seq 1 10`;do
    min_id=$(((i-1)*100000+1))
    max_id=$((i*100000))
    queries="update t1 set log_date = date_sub(current_date(), interval ceil(rand()*1000) day)  \
        where id >=$min_id and id<=$max_id;" 

    mysql --login-path=TiDB_login -D ytt -e "$queries" &
done
The second one is for discontinuous self-incrementing primary key scenarios.

The first one is the most common. It is strongly not recommended to use continuous auto-increment fields as primary keys in TiDB, which will lead to potential single-region write hotspots. Therefore, it is recommended to use the auto_random feature to write randomly for the self-incrementing primary key to avoid continuity.

The methods listed in the above script become unsuitable. How should it be dismantled? You can make some modifications. Use the window function row_number() to supplement the simulated primary key. The update table is changed to t2. The rewritten script is roughly as follows:

root@ytt-ubuntu:~/scripts# cat update_table_batch
#!/bin/sh
# TiDB 拆分更新
for i in `seq 1 10`;do
    min_id=$(((i-1)*100000+1))
    max_id=$((i*100000))
    queries="update t2 a, (select *,row_number() over(order by id) rn from t2) b set a.log_date =  \
        date_sub(current_date(), interval ceil(rand()*1000) day)  \
        where a.id = b.id and (b.rn>=$min_id and b.rn<=$max_id);"
    mysql --login-path=TiDB_login -D ytt -e "$queries" &
done

In fact, the above two ideas already contain most of the split scenarios. MySQL or TiDB contains an implicit auto-increment ID to distinguish the relationship between rows by default for tables without a primary key. Therefore, in order to avoid adding a complicated split strategy at the DML layer, it is still strongly recommended to use an explicit primary key!

Concluding remarks

Although TiDB 4.0 supports large transactions very well, this is not a reason to use large transactions casually. It is still necessary to do a good job of splitting strategies such as table design and retrieval table data in advance to better make the database better. Service good business.


爱可生开源社区
426 声望207 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。