Recently, I have been working for a subject storehouse system. Since duplicate subject are added to the storehouse, it is necessary to query the duplicate subject, and delete all duplicate subject leave only one, so that the duplicate subject cannot be taken when the test is taken.
First wrote a small example:
Single field operation
This is the table in the database:
Group By:
Select Repeat Field From Table Group By Repeat Field Having Count(*)>1
See if there is duplicate data:
GROUP BY <column name sequence>
HAVING <group conditional expression>
Query out: those groups that satisfy the group conditional expression (the number of repetitions is greater than 1) in the having clause according to the dname grouping
There is no difference between count(*) and count(1).
Query all duplicate data:
Select * From Table Where Repeat Field In (Select Repeat Field From Table Group By Repeat Field Having Count(*)>1)
Delete all duplicate subject:
Change the above query select to delete (this will cause an error)
DELETE
FROM
dept
WHERE
dname IN (
SELECT
dname
FROM
dept
GROUP BY
dname
HAVING
count(1) > 1
The following error will occur: [Err] 1093 – You can’t specify target table ‘dept’ for update in FROM clause
The reason is: update this table and query this table at the same time, query this table and update the table, can be understood as deadlock. Mysql does not support this update to query the same table operation
Solution: Query the columns of data to be updated as a third-party table, and then filter the updates.
query duplicate subject in the table (according to depno, except for the one with the smallest rowid)
the first method:
SELECT
*
FROM
dept
WHERE
dname IN (
SELECT
dname
FROM
dept
GROUP BY
dname
HAVING
COUNT(1) > 1
)
AND deptno NOT IN (
SELECT
MIN(deptno)
FROM
dept
GROUP BY
dname
HAVING
COUNT(1) > 1
)
The above is correct, but the query is too slow, you can try the following method:
The second method:
According to the dname grouping, find out the deptno minimum. Then query deptno and do not contain the one just found out. This will query all the duplicate data (except the smallest line of deptno)
SELECT *
FROM
dept
WHERE
deptno NOT IN (
SELECT
dt.minno
FROM
(
SELECT
MIN(deptno) AS minno
FROM
dept
GROUP BY
dname
) dt
)
Delete redundant duplicate subject in the table and leave only one:
the first method:
DELETE
FROM
dept
WHERE
dname IN (
SELECT
t.dname
FROM
(
SELECT
dname
FROM
dept
GROUP BY
dname
HAVING
count(1) > 1
) t
)
AND deptno NOT IN (
SELECT
dt.mindeptno
FROM
(
SELECT
min(deptno) AS mindeptno
FROM
dept
GROUP BY
dname
HAVING
count(1) > 1
) dt
)
The second method (corresponding to the second method of query above, just change select to delete):
DELETE
FROM
dept
WHERE
deptno NOT IN (
SELECT
dt.minno
FROM
(
SELECT
MIN(deptno) AS minno
FROM
dept
GROUP BY
dname
) dt
)
Operation of multiple fields:
If you already have been learn to delete a single field is there, multiple fields are also very simple.
DELETE
FROM
dept
WHERE
(dname, db_source) IN (
SELECT
t.dname,
t.db_source
FROM
(
SELECT
dname,
db_source
FROM
dept
GROUP BY
dname,
db_source
HAVING
count(1) > 1
) t
)
AND deptno NOT IN (
SELECT
dt.mindeptno
FROM
(
SELECT
min(deptno) AS mindeptno
FROM
dept
GROUP BY
dname,
db_source
HAVING
count(1) > 1
)
to sum up:
In fact, there are still a lot of things to optimize in the above methods. If the amount of data is too large, the implementation is very slow. You can consider adding optimization:
Add an index to the frequently queried field
Change * to the field you need to query, don’t query it all out.
Small tables drive large tables with IN, large tables drive small tables with EXISTS. The case where IN is suitable is the case where the amount of external data is small, not the case where the external data is large, because IN will traverse all the data of the outer table, assuming 100 tables of a, and 10000 of b, the number of traversals is 100*10000 times, and exists Then, it is executed 100 times to judge whether the data in the a table exists in the b table, and it only executes the number of a.length. As for which efficiency is high, it depends on the situation, because in is compared in memory, and exists is the database query operation.