您好,登錄后才能下訂單哦!
這篇文章將為大家詳細(xì)講解有關(guān)Hive優(yōu)化的示例分析,小編覺得挺實(shí)用的,因此分享給大家做個(gè)參考,希望大家閱讀完這篇文章后可以有所收獲。
一、什么時(shí)候可以避免執(zhí)行MapReduce?select *where語句中只有分區(qū)字段
二、Jion優(yōu)化驅(qū)動(dòng)表最右邊查詢表表的大小從左邊到右邊依次增大標(biāo)志機(jī)制顯示的告知查詢優(yōu)化器哪張表示大表/*+streamtable(table_name)*/
三、Map-side聚合sethive.map.aggr=true;這個(gè)設(shè)置可以將頂層的聚合操作放在Map階段執(zhí)行,從而減輕清洗 階段數(shù)據(jù)傳輸和Reduce階段的執(zhí)行時(shí)間,提升總體性能。缺點(diǎn):該設(shè)置會(huì)消耗更多的內(nèi)存。執(zhí)行select count(1) from wlan;
四、Localhadoop本地模式SETmapred.job.tracker=local; 測試 select 1 from wlan limit 5;
下面兩個(gè)參數(shù)是local mr中常用的控制參數(shù):
1,hive.exec.mode.local.auto.inputbytes.max設(shè)置local mr的最大輸入數(shù)據(jù)量,當(dāng)輸入數(shù)據(jù)量小于這個(gè)值的時(shí)候會(huì) 采用local mr的方式
2,hive.exec.mode.local.auto.tasks.max設(shè)置local mr的最大輸入文件個(gè)數(shù),當(dāng)輸入文件個(gè)數(shù)小于這個(gè)值的時(shí) 候會(huì)采用local mr的方式默認(rèn)執(zhí)行模式:
hive (default)> select count(1) t1;
Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/
Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job -kill job_1433931422330_0001
Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1
2015-06-11 18:56:44,749 Stage-1 map = 0%, reduce = 0%
2015-06-11 18:56:57,029 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
2015-06-11 18:57:11,050 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.2 sec
MapReduce Total cumulative CPU time: 4 seconds 200 msec
Ended Job = job_1433931422330_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce:1 Cumulative CPU: 4.2 sec HDFS Read: 312 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 200 msec
OK
t1
100
Time taken: 46.573 seconds, Fetched: 1 row(s)
對比啟動(dòng)Localhadoop模式:
hive (default)> select count(1) t1;
Automatically selecting local only mode for query
Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
sethive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
sethive.exec.reducers.max=<number>
In order to set a constant number of reducers:
setmapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0
2015-06-11 18:55:25,123 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1510342541_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 12HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t1
100
Time taken: 1.721 seconds, Fetched: 1 row(s)
五、索引Hive中的索引架構(gòu)開放了一個(gè)接口,允許你根據(jù)這個(gè)接口去實(shí)現(xiàn)自己的索引。目前Hive自己有一個(gè)參考的索引實(shí)現(xiàn)(CompactIndex),后來在0.8版本中又加入位圖索引。這里就講講CompactIndex。
/*在index_test_table表的id字段上創(chuàng)建索引*/
create index idx on table index_test_table(id)
as 'org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler'with deferred rebuild;
alter index idx on index_test_table rebuild;
/*索引的剪裁。找到上面建的索引表,根據(jù)你最終要用的查詢條件剪裁一下。*/
/*如果你想跟RDBMS一樣建完索引就用,那是不行的,會(huì)直接報(bào)錯(cuò),這也是其麻煩的地方*/
create table my_index
as select _bucketname, `_offsets`
from default__index_test_table_idx__ where id = 10;
/*現(xiàn)在可以用索引了,注意最終查詢條件跟上面的剪裁條件一致*/
set Hive.index.compact.file = /user/Hive/warehouse/my_index;
set Hive.input.format =org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat;
select count(*) from index_test_table where id = 10;
六、數(shù)據(jù)傾斜所謂數(shù)據(jù)傾斜,說的是由于數(shù)據(jù)分布不均勻,個(gè)別值集中占據(jù)大部分?jǐn)?shù)據(jù)量,加上Hadoop的計(jì)算模式,導(dǎo)致計(jì)算資源不均勻引起性能下降。
傾斜分成group by造成的傾斜和join造成的傾斜:
一個(gè)是Hive.Map.aggr,默認(rèn)值已經(jīng)為true,意思是會(huì)做Map端的combiner。所以如果你的group by查詢只是做count(*)的話,其實(shí)是看不出傾斜效果的,但是如果你做的是count(distinct),那么還是會(huì)看出一點(diǎn)傾斜效果。
另一個(gè)參數(shù)是Hive.groupby. skewindata。這個(gè)參數(shù)的意思是做Reduce操作的時(shí)候,拿到的key并不是所有相同值給同一個(gè)Reduce,而是隨機(jī)分發(fā),然后Reduce做聚合,做完之后再做一輪MR,拿前面聚合過的數(shù)據(jù)再算結(jié)果。所以這個(gè)參數(shù)其實(shí)跟Hive.Map.aggr做的是類似的事情,只是拿到Reduce端來做,而且要額外啟動(dòng)一輪Job,所以其實(shí)不怎么推薦用,效果不明顯。
改寫SQL來優(yōu)化
/*改寫前*/
select a, count(distinctb) as c from tbl group by a;
/*改寫后*/
select a, count(*) as c
from (selectdistinct a, b from tbl) group by a;
七、Job間并行
首先,在Hive生成的多個(gè)Job中,在有些情況下Job之間是可以并行的,典型的就是子查詢。當(dāng)需要執(zhí)行多個(gè)子查詢unionall或者join操作的時(shí)候,Job間并行就可以使用了。比如下面的代碼就是一個(gè)可以并行的場景示意:
hive>FROM t4
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...
關(guān)于“Hive優(yōu)化的示例分析”這篇文章就分享到這里了,希望以上內(nèi)容可以對大家有一定的幫助,使各位可以學(xué)到更多知識(shí),如果覺得文章不錯(cuò),請把它分享出去讓更多的人看到。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。