头图

In daily business systems, user-related operation data or log records tend to increase. However, the query requirements for early data are not so strong, so they can be archived - stored as cold data and retrieved from the online database. delete.

General Data Sheet Archive

Simple data archiving is easy to implement, such as a table like this:

 CREATE TABLE `operation_log` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `create_date` date DEFAULT NULL,
  `user_id` int(11) DEFAULT NULL,
  `type` varchar(20) NOT NULL,
  `amount` decimal(20,8) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `ix_date` (`create_date`),
  KEY `ix_type_date` (`type`,`create_date`)
)

You can query by time, select the data before the specified date, save it, and then delete it from the database. The sample code is as follows (python + sqlalchemy):

 # 已获取数据库连接connection,或者session

# 查出所有旧数据
records = connection.execute(’select * from operation_log where create_date < "2022-06-01"‘)

# 保存查出来的数据到文件,或者到冷备份数据库
# ......

# 删除已导出的数据
connection.execute(’delete from operation_log where create_date < "2022-06-01"‘)

Massive data table archiving

However, if this table contains a large amount of data, such as 1 billion records in the whole table and 10 million records in the result set, it is obviously impossible to execute this recklessly:

  1. Loading 10 million result sets into the application server is very dangerous and usually blows up memory
  2. Although the 'create_date' field has an index, the search speed is obviously not as good as that of the primary key.

For problem 1, the solution is to query in segments, find out a batch of each iteration, and process a batch until all the data that meets the conditions are processed; for problem 2, you can use the self-increasing feature of the primary key id (increase in the order of record generation ), which converts most of the queries into conditions based on the primary key. The code example is as follows:

 # 已获取数据库连接connection,或者session

BATCH_SIZE = 100000 # 设置批量数据为10万条
FINAL_DATE = datetime.date(2022,6,1)

records = connection.execute('select id from operation_log order by id limit 1')
batch_end_id = records[0]['id']  # 先把迭代id初始化为
batch_end_date = datetime.date(1980,1,1) # 设置一个很早的起始日期

while batch_end_date < FINAL_DATE:
    batch_end_id += BATCH_SIZE
    records = connection.execute('select * from operation_log where id<{}'.format(batch_end_id))
    
    for record in record:
        record_date = record['create_date']
        if batch_end_date < record_date:
            # 获取本次迭代最晚一条记录的时间
            batch_end_date = record_date
        if record_date < FINAL_DATE:
            # 保存查出来的数据到文件,或者到冷备份数据库
            # ......

    # 删除已导出的数据
    if batch_end_date >= FINAL_DATE:
        # 只有最后一次迭代,以date为条件,保证删除的数据不超过
        connection.execute('delete from operation_log where id<{} and create_date < "{}"'.format(batch_end_id, FINAL_DATE.strftime('%Y-%m-%d')))
        break
    else:
        # 正常迭代只有id作为查询条件,速度会快很多
        connection.execute('delete from operation_log where id<{}'.format(batch_end_id))

songofhawk
303 声望24 粉丝