溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

Hive優(yōu)化的示例分析

發(fā)布時(shí)間:2021-12-10 10:39:07 來源:億速云 閱讀:189 作者:小新 欄目:云計(jì)算

這篇文章將為大家詳細(xì)講解有關(guān)Hive優(yōu)化的示例分析,小編覺得挺實(shí)用的,因此分享給大家做個(gè)參考,希望大家閱讀完這篇文章后可以有所收獲。

一、什么時(shí)候可以避免執(zhí)行MapReduce?select *where語句中只有分區(qū)字段
二、Jion優(yōu)化驅(qū)動(dòng)表最右邊查詢表表的大小從左邊到右邊依次增大標(biāo)志機(jī)制顯示的告知查詢優(yōu)化器哪張表示大表/*+streamtable(table_name)*/
三、Map-side聚合sethive.map.aggr=true;這個(gè)設(shè)置可以將頂層的聚合操作放在Map階段執(zhí)行,從而減輕清洗      階段數(shù)據(jù)傳輸和Reduce階段的執(zhí)行時(shí)間,提升總體性能。缺點(diǎn):該設(shè)置會(huì)消耗更多的內(nèi)存。執(zhí)行select count(1) from wlan;
四、Localhadoop本地模式SETmapred.job.tracker=local; 測試 select 1 from wlan limit 5;
下面兩個(gè)參數(shù)是local mr中常用的控制參數(shù):
1,hive.exec.mode.local.auto.inputbytes.max設(shè)置local mr的最大輸入數(shù)據(jù)量,當(dāng)輸入數(shù)據(jù)量小于這個(gè)值的時(shí)候會(huì) 采用local  mr的方式
2,hive.exec.mode.local.auto.tasks.max設(shè)置local mr的最大輸入文件個(gè)數(shù),當(dāng)輸入文件個(gè)數(shù)小于這個(gè)值的時(shí) 候會(huì)采用local mr的方式默認(rèn)執(zhí)行模式:

hive (default)> select count(1) t1;

Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  sethive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  sethive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/

Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job  -kill job_1433931422330_0001

Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1

2015-06-11 18:56:44,749 Stage-1 map = 0%,  reduce = 0%

2015-06-11 18:56:57,029 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec

2015-06-11 18:57:11,050 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.2 sec

MapReduce Total cumulative CPU time: 4 seconds 200 msec

Ended Job = job_1433931422330_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce:1   Cumulative CPU: 4.2 sec   HDFS Read: 312 HDFS Write: 2 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 200 msec

OK

t1

100

Time taken: 46.573 seconds, Fetched: 1 row(s)

對比啟動(dòng)Localhadoop模式:

hive (default)> select count(1) t1;

Automatically selecting local only mode for query

Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  sethive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  sethive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  setmapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0

2015-06-11 18:55:25,123 Stage-1 map = 100%,  reduce = 100%

Ended Job = job_local1510342541_0004

MapReduce Jobs Launched:

Stage-Stage-1:  HDFS Read: 12HDFS Write: 22 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

t1

100

Time taken: 1.721 seconds, Fetched: 1 row(s)

五、索引Hive中的索引架構(gòu)開放了一個(gè)接口,允許你根據(jù)這個(gè)接口去實(shí)現(xiàn)自己的索引。目前Hive自己有一個(gè)參考的索引實(shí)現(xiàn)(CompactIndex),后來在0.8版本中又加入位圖索引。這里就講講CompactIndex。

/*在index_test_table表的id字段上創(chuàng)建索引*/

create index idx on table index_test_table(id)  

as 'org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler'with deferred rebuild;

alter index idx on index_test_table rebuild;

/*索引的剪裁。找到上面建的索引表,根據(jù)你最終要用的查詢條件剪裁一下。*/

/*如果你想跟RDBMS一樣建完索引就用,那是不行的,會(huì)直接報(bào)錯(cuò),這也是其麻煩的地方*/

create table my_index

as select _bucketname, `_offsets`

from default__index_test_table_idx__ where id = 10;

/*現(xiàn)在可以用索引了,注意最終查詢條件跟上面的剪裁條件一致*/

set Hive.index.compact.file = /user/Hive/warehouse/my_index;

set Hive.input.format =org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat;

select count(*) from index_test_table where id = 10;

六、數(shù)據(jù)傾斜所謂數(shù)據(jù)傾斜,說的是由于數(shù)據(jù)分布不均勻,個(gè)別值集中占據(jù)大部分?jǐn)?shù)據(jù)量,加上Hadoop的計(jì)算模式,導(dǎo)致計(jì)算資源不均勻引起性能下降。

傾斜分成group by造成的傾斜和join造成的傾斜:

一個(gè)是Hive.Map.aggr,默認(rèn)值已經(jīng)為true,意思是會(huì)做Map端的combiner。所以如果你的group by查詢只是做count(*)的話,其實(shí)是看不出傾斜效果的,但是如果你做的是count(distinct),那么還是會(huì)看出一點(diǎn)傾斜效果。

另一個(gè)參數(shù)是Hive.groupby. skewindata。這個(gè)參數(shù)的意思是做Reduce操作的時(shí)候,拿到的key并不是所有相同值給同一個(gè)Reduce,而是隨機(jī)分發(fā),然后Reduce做聚合,做完之后再做一輪MR,拿前面聚合過的數(shù)據(jù)再算結(jié)果。所以這個(gè)參數(shù)其實(shí)跟Hive.Map.aggr做的是類似的事情,只是拿到Reduce端來做,而且要額外啟動(dòng)一輪Job,所以其實(shí)不怎么推薦用,效果不明顯。

改寫SQL來優(yōu)化

/*改寫前*/

select a, count(distinctb) as c from tbl group by a;

/*改寫后*/

select a, count(*) as c

from (selectdistinct a, b from tbl) group by a;


七、Job間并行

首先,在Hive生成的多個(gè)Job中,在有些情況下Job之間是可以并行的,典型的就是子查詢。當(dāng)需要執(zhí)行多個(gè)子查詢unionall或者join操作的時(shí)候,Job間并行就可以使用了。比如下面的代碼就是一個(gè)可以并行的場景示意:

hive>FROM t4

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...

關(guān)于“Hive優(yōu)化的示例分析”這篇文章就分享到這里了,希望以上內(nèi)容可以對大家有一定的幫助,使各位可以學(xué)到更多知識(shí),如果覺得文章不錯(cuò),請把它分享出去讓更多的人看到。

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI