您好,登錄后才能下訂單哦!
本篇文章為大家展示了如何實現(xiàn)generate.max.count的參數(shù)處理,內(nèi)容簡明扼要并且容易理解,絕對能使你眼前一亮,通過這篇文章的詳細介紹希望你能有所收獲。
對generate.max.count參數(shù)的處理在org.apache.nutch.crawl.Generator內(nèi)部類Selector中
org.apache.nutch.crawl.Generator中相關(guān)變量聲明情況
private HashMap<String, int[]> hostCounts = new HashMap<String, int[]>(); private int maxCount;
內(nèi)部類Selector的config方法中
maxCount = job.getInt(GENERATOR_MAX_COUNT, -1);
reduce方法中的處理
/*** 1、獲取 某一主機下的int[] ,如果為null,聲明一個數(shù)組,放入map中,int數(shù)組第2個值+1; */ //1 int[] hostCount = hostCounts.get(hostordomain); if (hostCount == null) { hostCount = new int[] { 1, 0 }; hostCounts.put(hostordomain, hostCount); } hostCount[1]++;// increment hostCount //2、檢查是否到了topN的數(shù)量,如果hostCount的第一個值大于limit // check if topN reached, select next segment if it is while (segCounts[hostCount[0] - 1] >= limit//segCounts : && hostCount[0] < maxNumSegments) { hostCount[0]++; hostCount[1] = 0; } // reached the limit of allowed URLs per host / domain // see if we can put it in the next segment? if (hostCount[1] >= maxCount) { if (hostCount[0] < maxNumSegments) { hostCount[0]++; hostCount[1] = 0; } else { if (hostCount[1] == maxCount + 1 && LOG.isInfoEnabled()) { LOG.info("Host or domain " + hostordomain + " has more than " + maxCount + " URLs for all " + maxNumSegments + " segments. Additional URLs won't be included in the fetchlist."); } // skip this entry continue; } } entry.segnum = new IntWritable(hostCount[0]); segCounts[hostCount[0] - 1]++;
上述內(nèi)容就是如何實現(xiàn)generate.max.count的參數(shù)處理,你們學(xué)到知識或技能了嗎?如果還想學(xué)到更多技能或者豐富自己的知識儲備,歡迎關(guān)注億速云行業(yè)資訊頻道。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。