將MySQL去重操作優(yōu)化到極致之三彈連發(fā)

發(fā)布時間：2020-08-07 22:53:21 來源：ITPUB博客閱讀：130 作者：壹頁書欄目：MySQL數(shù)據(jù)庫

將MySQL去重操作優(yōu)化到極致之三彈連發(fā)（一）：巧用索引與變量

http://blog.csdn.net/wzy0623/article/details/54377986

實驗準備:
MySQL 5.6.14

create table t_source
(
item_id int,
created_time datetime,
modified_time datetime,
item_name varchar(20),
other varchar(20)
);
create table t_target like t_source;
delimiter //
create procedure sp_generate_data()
begin
set @i := 1;
while @i<=500000 do
set @created_time := date_add('2017-01-01',interval @i second);
set @modified_time := @created_time;
set @item_name := concat('a',@i);
insert into t_source
values (@i,@created_time,@modified_time,@item_name,'other');
set @i:=@i+1;
end while;
commit;
set @last_insert_id := 500000;
insert into t_source
select item_id + @last_insert_id,
created_time,
date_add(modified_time,interval @last_insert_id second),
item_name,
'other'
from t_source;
commit;
end
//
delimiter ;
call sp_generate_data();
insert into t_source
select * from t_source where item_id=1;
commit;
select count(*),count(distinct created_time,item_name) from t_source;

1.使用表連接查重

truncate t_target;
insert into t_target
select distinct t1.* from t_source t1,
(select min(item_id) item_id,created_time,item_name from t_source t3 group by created_time,item_name) t2
where t1.item_id = t2.item_id;
commit;

由于機器性能的差異,使用表連接方式,我的環(huán)境耗時14s
執(zhí)行計劃如下:
將MySQL去重操作優(yōu)化到極致之三彈連發(fā)

可以看到MySQL 給 t1表的item_id自動創(chuàng)建了一個索引.

2.使用MySQL特性

truncate t_target;
insert into t_target
select min(item_id),created_time,modified_time,item_name,other
from t_source
group by created_time,item_name;
commit;

耗時10s左右.
效率尚可,省時省力.

3.使用自定義變量

set @a:='0000-00-00 00:00:00';
set @b:=' ';
set @f:=0;
truncate t_target;
insert into t_target
select
item_id, created_time, modified_time, item_name, other
from
(
select
t0 . *,
if(@a = created_time and @b = item_name, @f:=0, @f:=1) f,
@a:=created_time,
@b:=item_name
from
(
select
*
from
t_source
order by created_time , item_name
) t0
) t1
where
f = 1;
commit;

耗時18s
執(zhí)行計劃如下:
將MySQL去重操作優(yōu)化到極致之三彈連發(fā)

以上都是沒有添加任何索引的情況.

添加索引如下:
create index idx_sort on t_source(created_time,item_name,item_id);
analyze table t_source;

創(chuàng)建索引之后,
使用表連接查詢方式耗時11s,小幅提升.
使用MySQL特性的方式,耗時11-12s,反而更慢.
使用MySQL自定義變量的方式,耗時還是18s.

很顯然,MySQL自定義變量的方式,其實沒有利用索引.

最終改進SQL

set @a:='0000-00-00 00:00:00';
set @b:=' ';
truncate t_target;
insert into t_target
select * from t_source force index (idx_sort)
where (@a!=created_time or @b!=item_name) and (@a:=created_time) is not null and (@b:=item_name) is not null
order by created_time,item_name;
commit;

耗時11s.
該語句具有以下特點。
（1）消除了嵌套子查詢，只需要對t_source表進行一次全索引掃描，查詢計劃已達最優(yōu)。
（2）無需distinct二次查重。
（3）變量判斷與賦值只出現(xiàn)在where子句中。
（4）利用索引消除了filesort。

強制通過索引idx_sort查找數(shù)據(jù)行 -> 應用where篩選器 -> 處理select列表 -> 應用order by子句。

為了使變量能夠按照created_time和item_name的排序順序進行賦值和比較，必須按照索引順序查找數(shù)據(jù)行。這里的force index (idx_sort)提示就起到了這個作用，必須這樣寫才能使整條查重語句成立。否則，因為先掃描表才處理排序，因此不能保證變量賦值的順序，也就不能確保查詢結果的正確性。order by子句同樣不可忽略，否則即使有force index提示，MySQL也會使用全表掃描而不是全索引掃描，從而使結果錯誤。
索引同時保證了created_time,item_name的順序，避免了文件排序。force index (idx_sort)提示和order by子句缺一不可，索引idx_sort在這里可謂恰到好處、一舉兩得。
查詢語句開始前，先給變量初始化為數(shù)據(jù)中不可能出現(xiàn)的值，然后進入where子句從左向右判斷。先比較變量和字段的值，再將本行created_time和item_name的值賦給變量，按created_time,item_name的順序逐行處理。item_name是字符串類型，(@b:=item_name)不是有效的布爾表達式，因此要寫成(@b:=item_name) is not null。

“insert into t_target select * from t_source group by created_time,item_name;”的寫法，它受“sql_mode='ONLY_FULL_GROUP_BY'”的限制。

運行耗時和原文有出入,可能是因為我的環(huán)境是SSD的緣故.
另外,避免回表的開銷,可以增加索引的字段

drop index idx_sort on t_source;
create index idx_sort on t_source(created_time,item_name,item_id,modified_time,other);
analyze table t_source;

使用上述索引,終極改進的SQL 耗時可以降到 9.5s

參考:
http://blog.csdn.net/wzy0623/article/details/54378367
http://blog.csdn.net/wzy0623/article/details/54378575

向AI問一下細節(jié)

將MySQL去重操作優(yōu)化到極致之三彈連發(fā)

將MySQL去重操作優(yōu)化到極致之三彈連發(fā)（一）：巧用索引與變量

猜你喜歡

最新資訊

相關推薦

相關標簽