溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

Hadoop上Data Locality的詳解

發(fā)布時間:2020-10-12 13:05:59 來源:腳本之家 閱讀:284 作者:csguo007 欄目:編程語言

Hadoop上Data Locality的詳解

Hadoop上的Data Locality是指數(shù)據(jù)與Mapper任務(wù)運(yùn)行時數(shù)據(jù)的距離接近程度(Data Locality in Hadoop refers to the“proximity” of the data with respect to the Mapper tasks working on the data.)

1. why data locality is imporant?

當(dāng)數(shù)據(jù)集存儲在HDFS中時,它被劃分為塊并存儲在Hadoop集群中的DataNode上。當(dāng)在數(shù)據(jù)集執(zhí)行MapReduce作業(yè)時,各個Mappers將處理這些塊(輸進(jìn)行入分片處理)。如果Mapper不能從它執(zhí)行的節(jié)點(diǎn)上獲取數(shù)據(jù),數(shù)據(jù)需要通過網(wǎng)絡(luò)從具有這些數(shù)據(jù)的DataNode拷貝到執(zhí)行Mapper任務(wù)的節(jié)點(diǎn)上(the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task)。假設(shè)一個MapReduce作業(yè)具有超過1000個Mapper,在同一時間每一個Mapper都試著去從集群上另一個DataNode節(jié)點(diǎn)上拷貝數(shù)據(jù),這將導(dǎo)致嚴(yán)重的網(wǎng)絡(luò)阻塞,因為所有的Mapper都嘗試在同一時間拷貝數(shù)據(jù)(這不是一種理想的方法)。因此,將計算任務(wù)移動到更接近數(shù)據(jù)的節(jié)點(diǎn)上是一種更有效與廉價的方法,相比于將數(shù)據(jù)移動到更接近計算任務(wù)的節(jié)點(diǎn)上(it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation)。

2. How is data proximity defined?

當(dāng)JobTracker(MRv1)或ApplicationMaster(MRv2)接收到運(yùn)行作業(yè)的請求時,它查看集群中的哪些節(jié)點(diǎn)有足夠的資源來執(zhí)行該作業(yè)的Mappers和Reducers。同時需要根據(jù)Mapper運(yùn)行數(shù)據(jù)所處位置來考慮決定每個Mapper執(zhí)行的節(jié)點(diǎn)(serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located)。

Hadoop上Data Locality的詳解

3. Data Local

當(dāng)數(shù)據(jù)所處的節(jié)點(diǎn)與Mapper執(zhí)行的節(jié)點(diǎn)是同一節(jié)點(diǎn),我們稱之為Data Local。在這種情況下,數(shù)據(jù)的接近度更接近計算( In this case the proximity of the data is closer to the computation.)。JobTracker(MRv1)或ApplicationMaster(MRv2)首選具有Mapper所需要數(shù)據(jù)的節(jié)點(diǎn)來執(zhí)行Mapper。

4. Rack Local

雖然Data Local是理想的選擇,但由于受限于集群上的資源,并不總是在與數(shù)據(jù)同一節(jié)點(diǎn)上執(zhí)行Mapper(Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster)。在這種情況下,優(yōu)選地選擇在那些與數(shù)據(jù)節(jié)點(diǎn)在同一機(jī)架上的不同節(jié)點(diǎn)上運(yùn)行Mapper( In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data.)。在這種情況下,數(shù)據(jù)將在節(jié)點(diǎn)之間進(jìn)行移動,從具有數(shù)據(jù)的節(jié)點(diǎn)移動到在同一機(jī)架上執(zhí)行Mapper的節(jié)點(diǎn),這種情況我們稱之為Rack Local。

5. Different Rack

在繁忙的群集中,有時Rack Local也不可能。在這種情況下,選擇不同機(jī)架上的節(jié)點(diǎn)來執(zhí)行Mapper,并且將數(shù)據(jù)從具有數(shù)據(jù)的節(jié)點(diǎn)復(fù)制到在不同機(jī)架上執(zhí)行Mapper的節(jié)點(diǎn)。這是最不可取的情況。

如有疑問請留言或者到本站社區(qū)交流討論,感謝閱讀,希望能幫助到大家,謝謝大家對本站的支持!

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI