MapReduce設(shè)計(jì)模式有哪些

發(fā)布時(shí)間：2022-01-04 10:59:32 來(lái)源：億速云閱讀：178 作者：iii 欄目：云計(jì)算

本篇內(nèi)容主要講解“MapReduce設(shè)計(jì)模式有哪些”，感興趣的朋友不妨來(lái)看看。本文介紹的方法操作簡(jiǎn)單快捷，實(shí)用性強(qiáng)。下面就讓小編來(lái)帶大家學(xué)習(xí)“MapReduce設(shè)計(jì)模式有哪些”吧!

1 (總計(jì))Summarization Patterns

1.1（數(shù)字統(tǒng)計(jì)）Numerical Summarizations

這個(gè)算是Built-in的,因?yàn)檫@就是MapReduce的模式. 相當(dāng)于SQL語(yǔ)句里邊Count/Max,WordCount也是這個(gè)的實(shí)現(xiàn)。

1.2（反向索引）Inverted Index Summarizations

這個(gè)看著名字很玄，其實(shí)感覺(jué)算不上模式，只能算是一種應(yīng)用，并沒(méi)有涉及到MapReduce的設(shè)計(jì)。其核心實(shí)質(zhì)是對(duì)listof(V3)的索引處理，這是V3是一個(gè)引用Id。這個(gè)模式期望的結(jié)果是：
url-〉list of id

1.3（計(jì)數(shù)器統(tǒng)計(jì)）Counting with Counters

計(jì)數(shù)器很好很快，簡(jiǎn)單易用。不過(guò)代價(jià)是占用tasktracker，最重要使jobtracker的內(nèi)存。所以在1.0時(shí)代建議tens，至少<100個(gè)。不過(guò)2.0時(shí)代，jobtracker變得per job，我看應(yīng)該可以多用，不過(guò)它比較適合Counting這種算總數(shù)的算法。
context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1);

2 (過(guò)濾)Filtering Patterns

2.1（簡(jiǎn)單過(guò)濾）Filtering

這個(gè)也算是Built-in的,因?yàn)檫@就是MapReduce中Mapper如果沒(méi)有Write，那么就算過(guò)濾掉

了. 相當(dāng)于SQL語(yǔ)句里邊Where。

map(key, record):
    if we want to keep record then
    emit key,value

2.2（Bloom過(guò)濾）Bloom Filtering

以前我一直不知道為什么叫BloomFilter，看了wiki后，才知道，貼過(guò)來(lái)大家瞧瞧：
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate.
其原理可以參見(jiàn)這篇文章：

http://blog.csdn.net/jiaomeng/article/details/1495500
要是讓我一句話說(shuō)，就是根據(jù)集合內(nèi)容，選取多種Hash做一個(gè)bitmap，那么如果一個(gè)詞的 hash落在map中，那么它有可能是，也有可能不是。但是如果它的hash不在，則它一定沒(méi)有落在里邊。此過(guò)濾有點(diǎn)意思，在HBase中得到廣泛應(yīng)用。接下來(lái)得實(shí)際試驗(yàn)一下。

Note: 需要弄程序玩玩

2.3（Top N）Top Ten

這是一個(gè)典型的計(jì)算Top的操作，類(lèi)似SQL里邊的top或limit，一般都是帶有某條件的top

操作。
算法實(shí)現(xiàn)：我喜歡偽代碼，一目了然：

class mapper:
    setup():
        initialize top ten sorted list
     
    map(key, record):
        insert record into top ten sorted list
        if length of array is greater-than 10 then
        truncate list to a length of 10

    cleanup():
        for record in top sorted ten list:
        emit null,record

class reducer:
    setup():
        initialize top ten sorted list

    reduce(key, records):
        sort records
        truncate records to top 10
        for record in records:
            emit record

2.4（排重）Distinct

這個(gè)模式也簡(jiǎn)單，就是利用MapReduce的Reduce階段，看struct，一目了然：

map(key, record):
    emit record,null

reduce(key, records):
    emit key

3 (數(shù)據(jù)組織)Data Organization Patterns

3.1（結(jié)構(gòu)化到層級(jí)化）Structured to Hierarchical

這個(gè)在算法上是join操作,在應(yīng)用層面可以起到Denormalization的效果.其程序的關(guān)鍵之處是用到了MultipleInputs,可以引入多個(gè)Mapper,這樣便于把多種Structured的或者任何格式的內(nèi)容,聚合在reducer端,以前進(jìn)行聚合為Hierarchical的格式.
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, CommentMapper.class);
在Map輸出的時(shí)候,這里有一個(gè)小技巧,就是把輸出內(nèi)容按照分類(lèi),添加了前綴prefix,這樣在Reduce階段,就可以知道數(shù)據(jù)來(lái)源,以更好的進(jìn)行笛卡爾乘積或者甄別操作. 從技術(shù)上講這樣節(jié)省了自己寫(xiě)Writable的必要,理論上,可以定義格式,來(lái)攜帶更多信息. 當(dāng)然了,如果有特殊排序和組合需求,還是要寫(xiě)特殊的Writable了.
outkey.set(post.getAttribute("ParentId"));
outvalue.set("A" + value.toString());

3.2（分區(qū)法）Partitioning

這個(gè)又來(lái)了,這個(gè)是built-in,寫(xiě)自己的partitioner,進(jìn)行定向Reducer.

3.3（裝箱法）Binning

這個(gè)有點(diǎn)意思,類(lèi)似于分區(qū)法,不過(guò)它是MapSide Only的,效率較高,不過(guò)產(chǎn)生的結(jié)果可能需

要進(jìn)一步merge.
The SPLIT operation in Pig implements this pattern.
具體實(shí)現(xiàn)上還是使用了MultipleOutputs.addNamedOutput().

// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs

MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled

MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);

3.4（全排序）Total Order Sorting

這個(gè)在Hadoop部分已經(jīng)詳細(xì)描述過(guò)了，略。

3.5（洗牌）Shuffling

這個(gè)的精髓在于隨機(jī)key的創(chuàng)建。
outkey.set(rndm.nextInt());
context.write(outkey, outvalue);

4 (連接)Join Patterns

4.1（Reduce連接）Reduce Side Join

這個(gè)比較簡(jiǎn)單，Structured to Hierarchical中已經(jīng)講過(guò)了。

4.2（Mapside連接）Replicated Join

Mapside連接效率較高，但是需要把較小的數(shù)據(jù)集進(jìn)行設(shè)置到distributeCache，然后把

另一份數(shù)據(jù)進(jìn)入map，在map中完成連接。

4.3（組合連接）Composite Join

這種模式也是MapSide的join，而且可以進(jìn)行兩個(gè)大數(shù)據(jù)集的join，然而，它有一個(gè)限制就是兩個(gè)數(shù)據(jù)集必須是相同組織形式的，那么何謂相同組織形式呢？
? An inner or full outer join is desired.
? All the data sets are sufficiently large.
? All data sets can be read with the foreign key as the input key to the mapper.
? All data sets have the same number of partitions.
? Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. That is, partition X of data sets A and B contain
the same foreign keys and these foreign keys are present only in partition X. For a visualization of this partitioning and sorting key, refer to Figure 5-3.
? The data sets do not change often (if they have to be prepared).

// The composite input format join expression will set how the records
// are going to be read in, and in what input format.
conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType,
KeyValueTextInputFormat.class, userPath, commentPath));

4.4（笛卡爾）Cartesian Product

這個(gè)需要重寫(xiě)InputFormat，以便兩部分?jǐn)?shù)據(jù)可以在record級(jí)別聯(lián)合起來(lái)。sample略。

5 (元模式)MetaPatterns

5.1（鏈?zhǔn)絁ob）Job Chaining

多種方式，可以寫(xiě)在driver里邊，也可以用bash腳本調(diào)用。hadoop也提供了JobControl

可以跟蹤失敗的job等好的功能。

5.2（折疊Job）Chain Folding

ChainMapper and ChainReducer Approach，M+R*M

5.3（合并Job）Job Merging

合并job，就是把同數(shù)據(jù)的兩個(gè)job的mapper和reducer代碼級(jí)別的合并，這樣可以省去

I/O和解析的時(shí)間。

6 (輸入輸出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

InputFormat
getSplits
createRecordReader
InputSplit
getLength()
getLocations()
RecordReader
  initialize
  getCurrentKey and getCurrentValue
  nextKeyValue
  getProgress
  close
OutputFormat
  checkOutputSpecs
  getRecordWriter
  getOutputCommiter
RecordWriter
write
close

6.2 (產(chǎn)生Random數(shù)據(jù))Generating Data

關(guān)鍵點(diǎn)：構(gòu)建虛假的InputSplit，這個(gè)不像FileInputSplit基于block，只能去騙hadoop了。

到此，相信大家對(duì)“MapReduce設(shè)計(jì)模式有哪些”有了更深的了解，不妨來(lái)實(shí)際操作一番吧！這里是億速云網(wǎng)站，更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢，關(guān)注我們，繼續(xù)學(xué)習(xí)！

向AI問(wèn)一下細(xì)節(jié)