5
头图
Working overtime in the company until eight o'clock, this is the background.

The reason for working overtime is to go online to solve the problem of duplicate data in the online database. A bug in the program is found and it is easy to solve. The problem is to correct the duplicate data on the line.

There are 6 tables in the online library with duplicate data, 2 of which are relatively large, one is 960,000+, and the other is 300,000+. Because the same problem has been dealt with before, I directly used the last Python deduplication script and script. It's very simple. It connects to the database, finds out the duplicate data, and deletes it cyclically.

emmmm, but this efficiency is too low, one per second, repeated data is about 20,000+, and the estimated time is about 8 hours. . .

It is problematic to blindly rely on the things of the predecessors instead of thinking for yourself! Always think about how it was possible before and how it is not now, this is also a problem! I found that my state was really not quite right recently, and I lost my desire to explore and seek knowledge. Today is a vigilance, and I feel quite ashamed.

Closer to home, the steps to remove duplication are described in detail below.

CREATE TABLE `animal` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;


INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32');
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42');

Goal: We want to remove the data with the same name.

Let’s take a look at which data is duplicated

SELECT name,count( 1 ) 
FROM
 student 
GROUP BY
NAME 
HAVING
 count( 1 ) > 1;

Output:

name count(1) cat 2 dog 2

The data whose name is cat and dog are duplicated, and there are two duplicate data for each;

Select * From 表 Where 重复字段 In (Select 重复字段 From 表 Group By 重复字段 Having Count(1)>1)

Delete all duplicate data, leave no one

Delete directly will report an error

DELETE 
FROM
 student 
WHERE
 NAME IN (
 SELECT NAME 
 FROM
  student 
 GROUP BY
 NAME 
HAVING
 count( 1 ) > 1)

Error:

1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s

reason for is 16111e4b324c17: The table is queried while updating this table, and the table is updated while querying this table, which can be understood as a deadlock. mysql does not support this update query operation of the same table

solution : query the columns of data to be updated as a third-party table, and then filter and update.

DELETE 
FROM
 student 
WHERE
 NAME IN (
 SELECT
  t.NAME 
FROM
 ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t)

Delete duplicate data in the delete table, only keep one

Before deleting, we can check first, what kind of duplicate data we want to delete

SELECT
 * 
FROM
 student 
WHERE
 id NOT IN (
 SELECT
  t.id 
 FROM
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t 
 )

What do you mean, it is to first group by name to find out the data with the smallest id. These data are the kind of fire we want to leave, and then query to find out that the id is not in it, which is the duplicate data we want to delete.

Start to delete duplicate data, leaving only one

Very simple, just replace the select with delete.

DELETE 
FROM
 student 
WHERE
 id NOT IN (
 SELECT
  t.id 
 FROM
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t 
 )

900,000+ watches execute super fast.

All done 👏👏👏👏~

(Thanks for reading, I hope for all your help)
Source: telami.cn/2019/mysql-removes-duplicate
-data-and-keeping-only-one/


民工哥
26.4k 声望56.7k 粉丝

10多年IT职场老司机的经验分享,坚持自学一路从技术小白成长为互联网企业信息技术部门的负责人。2019/2020/2021年度 思否Top Writer