您好,登錄后才能下訂單哦!
有這樣一個(gè)場(chǎng)景,在HBase中需要分頁(yè)查詢,同時(shí)根據(jù)某一列的值進(jìn)行過(guò)濾。
不同于RDBMS天然支持分頁(yè)查詢,HBase要進(jìn)行分頁(yè)必須由自己實(shí)現(xiàn)。據(jù)我了解的,目前有兩種方案, 一是《HBase權(quán)威指南》中提到的用PageFilter加循環(huán)動(dòng)態(tài)設(shè)置startRow實(shí)現(xiàn),詳細(xì)見(jiàn)這里。但這種方法效率比較低,且有冗余查詢。因此京東研發(fā)了一種用額外的一張表來(lái)保存行序號(hào)的方案。 該種方案效率較高,但實(shí)現(xiàn)麻煩些,需要維護(hù)一張額外的表。
不管是方案也好,人也好,沒(méi)有最好的,只有最適合的。
在我司的使用場(chǎng)景中,對(duì)于性能的要求并不高,所以采取了第一種方案。本來(lái)使用的美滋滋,但有一天需要在分頁(yè)查詢的同時(shí)根據(jù)某一列的值進(jìn)行過(guò)濾。根據(jù)列值過(guò)濾,自然是用SingleColumnValueFilter(下文簡(jiǎn)稱SCVFilter)。代碼大致如下,只列出了本文主題相關(guān)的邏輯,
Scan?scan?=?initScan(xxx); FilterList?filterList=new?FilterList(); scan.setFilter(filterList); filterList.addFilter(new?PageFilter(1)); filterList.addFilter(new?SingleColumnValueFilter(FAMILY,ISDELETED,?CompareFilter.CompareOp.EQUAL,?Bytes.toBytes(false)));
數(shù)據(jù)如下
row1?????????????????column=f:content,?timestamp=1513953705613,?value=content1 ?row1?????????????????column=f:isDel,?timestamp=1513953705613,?value=1 ?row1?????????????????column=f:name,?timestamp=1513953725029,?value=name1 ?row2?????????????????column=f:content,?timestamp=1513953705613,?value=content2 ?row2?????????????????column=f:isDel,?timestamp=1513953744613,?value=0 ?row2?????????????????column=f:name,?timestamp=1513953730348,?value=name2 ?row3?????????????????column=f:content,?timestamp=1513953705613,?value=content3 ?row3?????????????????column=f:isDel,?timestamp=1513953751332,?value=0 ?row3?????????????????column=f:name,?timestamp=1513953734698,?value=name3
在上面的代碼中。向scan添加了兩個(gè)filter:首先添加了PageFilter,限制這次查詢數(shù)量為1,然后添加了一個(gè)SCVFilter,限制了只返回isDeleted=false
的行。
上面的代碼,看上去無(wú)懈可擊,但在運(yùn)行時(shí)卻沒(méi)有查詢到數(shù)據(jù)!
剛好最近在看HBase的代碼,就在本地debug了下HBase服務(wù)端Filter相關(guān)的查詢流程。
首先看下HBase Filter的流程,見(jiàn)圖:
然后再看PageFilter的實(shí)現(xiàn)邏輯。
public?class?PageFilter?extends?FilterBase?{ ??private?long?pageSize?=?Long.MAX_VALUE; ??private?int?rowsAccepted?=?0; ??/** ???*?Constructor?that?takes?a?maximum?page?size. ???* ???*?@param?pageSize?Maximum?result?size. ???*/ ??public?PageFilter(final?long?pageSize)?{ ????Preconditions.checkArgument(pageSize?>=?0,?"must?be?positive?%s",?pageSize); ????this.pageSize?=?pageSize; ??} ??public?long?getPageSize()?{ ????return?pageSize; ??} ??@Override ??public?ReturnCode?filterKeyValue(Cell?ignored)?throws?IOException?{ ????return?ReturnCode.INCLUDE; ??} ? ??public?boolean?filterAllRemaining()?{ ????return?this.rowsAccepted?>=?this.pageSize; ??} ??public?boolean?filterRow()?{ ????this.rowsAccepted++; ????return?this.rowsAccepted?>?this.pageSize; ??} ?? }
其實(shí)很簡(jiǎn)單,內(nèi)部有一個(gè)計(jì)數(shù)器,每次調(diào)用filterRow的時(shí)候,計(jì)數(shù)器都會(huì)+1,如果計(jì)數(shù)器值大于pageSize,filterrow就會(huì)返回true,那之后的行就會(huì)被過(guò)濾掉。
再看SCVFilter的實(shí)現(xiàn)邏輯。
public?class?SingleColumnValueFilter?extends?FilterBase?{ ??private?static?final?Log?LOG?=?LogFactory.getLog(SingleColumnValueFilter.class); ??protected?byte?[]?columnFamily; ??protected?byte?[]?columnQualifier; ??protected?CompareOp?compareOp; ??protected?ByteArrayComparable?comparator; ??protected?boolean?foundColumn?=?false; ??protected?boolean?matchedColumn?=?false; ??protected?boolean?filterIfMissing?=?false; ??protected?boolean?latestVersionOnly?=?true; ? ??/** ???*?Constructor?for?binary?compare?of?the?value?of?a?single?column.??If?the ???*?column?is?found?and?the?condition?passes,?all?columns?of?the?row?will?be ???*?emitted.??If?the?condition?fails,?the?row?will?not?be?emitted. ???*?<p> ???*?Use?the?filterIfColumnMissing?flag?to?set?whether?the?rest?of?the?columns ???*?in?a?row?will?be?emitted?if?the?specified?column?to?check?is?not?found?in ???*?the?row. ???* ???*?@param?family?name?of?column?family ???*?@param?qualifier?name?of?column?qualifier ???*?@param?compareOp?operator ???*?@param?comparator?Comparator?to?use. ???*/ ??public?SingleColumnValueFilter(final?byte?[]?family,?final?byte?[]?qualifier, ??????final?CompareOp?compareOp,?final?ByteArrayComparable?comparator)?{ ????this.columnFamily?=?family; ????this.columnQualifier?=?qualifier; ????this.compareOp?=?compareOp; ????this.comparator?=?comparator; ??} ? ??? ??@Override ??public?ReturnCode?filterKeyValue(Cell?c)?{ ????if?(this.matchedColumn)?{ ??????//?We?already?found?and?matched?the?single?column,?all?keys?now?pass ??????return?ReturnCode.INCLUDE; ????}?else?if?(this.latestVersionOnly?&&?this.foundColumn)?{ ??????//?We?found?but?did?not?match?the?single?column,?skip?to?next?row ??????return?ReturnCode.NEXT_ROW; ????} ????if?(!CellUtil.matchingColumn(c,?this.columnFamily,?this.columnQualifier))?{ ??????return?ReturnCode.INCLUDE; ????} ????foundColumn?=?true; ????if?(filterColumnValue(c.getValueArray(),?c.getValueOffset(),?c.getValueLength()))?{ ??????return?this.latestVersionOnly??ReturnCode.NEXT_ROW:?ReturnCode.INCLUDE; ????} ????this.matchedColumn?=?true; ????return?ReturnCode.INCLUDE; ??} ? ?? ??private?boolean?filterColumnValue(final?byte?[]?data,?final?int?offset, ??????final?int?length)?{ ????int?compareResult?=?this.comparator.compareTo(data,?offset,?length); ????switch?(this.compareOp)?{ ????case?LESS: ??????return?compareResult?<=?0; ????case?LESS_OR_EQUAL: ??????return?compareResult?<?0; ????case?EQUAL: ??????return?compareResult?!=?0; ????case?NOT_EQUAL: ??????return?compareResult?==?0; ????case?GREATER_OR_EQUAL: ??????return?compareResult?>?0; ????case?GREATER: ??????return?compareResult?>=?0; ????default: ??????throw?new?RuntimeException("Unknown?Compare?op?"?+?compareOp.name()); ????} ??} ??public?boolean?filterRow()?{ ????//?If?column?was?found,?return?false?if?it?was?matched,?true?if?it?was?not ????//?If?column?not?found,?return?true?if?we?filter?if?missing,?false?if?not ????return?this.foundColumn??!this.matchedColumn:?this.filterIfMissing; ??} ??? ? }
在HBase中,對(duì)于每一行的每一列都會(huì)調(diào)用到filterKeyValue,SCVFilter的該方法處理邏輯如下:
1.?如果已經(jīng)匹配過(guò)對(duì)應(yīng)的列并且對(duì)應(yīng)列的值符合要求,則直接返回INCLUE,表示這一行的這一列要被加入到結(jié)果集 2.?否則如latestVersionOnly為true(latestVersionOnly代表是否只查詢最新的數(shù)據(jù),一般為true),并且已經(jīng)匹配過(guò)對(duì)應(yīng)的列(但是對(duì)應(yīng)的列的值不滿足要求),則返回EXCLUDE,代表丟棄該行 3.?如果當(dāng)前列不是要匹配的列。則返回INCLUDE,否則將matchedColumn置為true,代表以及找到了目標(biāo)列 4.?如果當(dāng)前列的值不滿足要求,在latestVersionOnly為true時(shí),返回NEXT_ROW,代表忽略當(dāng)前行還剩下的列,直接跳到下一行 5.?如果當(dāng)前列的值滿足要求,將matchedColumn置為true,代表已經(jīng)找到了對(duì)應(yīng)的列,并且對(duì)應(yīng)的列值滿足要求。這樣,該行下一列再進(jìn)入這個(gè)方法時(shí),到第1步就會(huì)直接返回,提高匹配效率
再看filterRow方法,該方法調(diào)用時(shí)機(jī)在filterKeyValue之后,對(duì)每一行只會(huì)調(diào)用一次。
SCVFilter中該方法邏輯很簡(jiǎn)單:
1.?如果找到了對(duì)應(yīng)的列,如其值滿足要求,則返回false,代表將該行加入到結(jié)果集,如其值不滿足要求,則返回true,代表過(guò)濾該行 2.?如果沒(méi)找到對(duì)應(yīng)的列,返回filterIfMissing的值。
是不是因?yàn)閷ageFilter添加到SCVFilter的前面,當(dāng)判斷第一行的時(shí)候,調(diào)用PageFilter的filterRow,導(dǎo)致PageFilter的計(jì)數(shù)器+1,但是進(jìn)行到SCVFilter的filterRow的時(shí)候,該行又被過(guò)濾掉了,在檢驗(yàn)下一行時(shí),因?yàn)镻ageFilter計(jì)數(shù)器已經(jīng)達(dá)到了我們?cè)O(shè)定的pageSize,所以接下來(lái)的行都會(huì)被過(guò)濾掉,返回結(jié)果沒(méi)有數(shù)據(jù)。
在FilterList中,先加入SCVFilter,再加入PageFilter
Scan?scan?=?initScan(xxx); FilterList?filterList=new?FilterList(); scan.setFilter(filterList); filterList.addFilter(new?SingleColumnValueFilter(FAMILY,ISDELETED,?CompareFilter.CompareOp.EQUAL,? Bytes.toBytes(false))); filterList.addFilter(new?PageFilter(1));
結(jié)果是我們期望的第2行的值。
當(dāng)要將PageFilter和其他Filter使用時(shí),最好將PageFilter加入到FilterList的末尾,否則可能會(huì)出現(xiàn)結(jié)果個(gè)數(shù)小于你期望的數(shù)量。
(其實(shí)正常情況PageFilter返回的結(jié)果數(shù)量可能大于設(shè)定的值,因?yàn)?a title="服務(wù)器" target="_blank" href="http://www.kemok4.com/">服務(wù)器集群的PageFilter是隔離的。)
其實(shí),在排查問(wèn)題的過(guò)程中,并沒(méi)有這樣順利,因?yàn)閱?wèn)題出在線上,所以我在本地查問(wèn)題時(shí)自己造了一些測(cè)試數(shù)據(jù),令人驚訝的是,就算我先加入SCVFilter,再加入PageFilter,返回的結(jié)果也是符合預(yù)期的。
測(cè)試數(shù)據(jù)如下:
row1?????????????????column=f:isDel,?timestamp=1513953705613,?value=1 ?row1?????????????????column=f:name,?timestamp=1513953725029,?value=name1 ?row2?????????????????column=f:isDel,?timestamp=1513953744613,?value=0 ?row2?????????????????column=f:name,?timestamp=1513953730348,?value=name2 ?row3?????????????????column=f:isDel,?timestamp=1513953751332,?value=0 ?row3?????????????????column=f:name,?timestamp=1513953734698,?value=name3
當(dāng)時(shí)在本地一直不能復(fù)現(xiàn)問(wèn)題。很是苦惱,最后竟然發(fā)現(xiàn)使用SCVFilter查詢的結(jié)果還和數(shù)據(jù)的列的順序有關(guān)。
在服務(wù)端,HBase會(huì)對(duì)客戶端傳遞過(guò)來(lái)的filter封裝成FilterWrapper。
class?RegionScannerImpl?implements?RegionScanner?{ ????RegionScannerImpl(Scan?scan,?List<KeyValueScanner>?additionalScanners,?HRegion?region) ????????throws?IOException?{ ??????this.region?=?region; ??????this.maxResultSize?=?scan.getMaxResultSize(); ??????if?(scan.hasFilter())?{ ????????this.filter?=?new?FilterWrapper(scan.getFilter()); ??????}?else?{ ????????this.filter?=?null; ??????} ????} ???.... }
在查詢數(shù)據(jù)時(shí),在HRegion的nextInternal方法中,會(huì)調(diào)用FilterWrapper的filterRowCellsWithRet方法
FilterWrapper相關(guān)代碼如下:
/** ?*?This?is?a?Filter?wrapper?class?which?is?used?in?the?server?side.?Some?filter ?*?related?hooks?can?be?defined?in?this?wrapper.?The?only?way?to?create?a ?*?FilterWrapper?instance?is?passing?a?client?side?Filter?instance?through ?*?{@link?org.apache.hadoop.hbase.client.Scan#getFilter()}. ?*? ?*/ ? final?public?class?FilterWrapper?extends?Filter?{ ??Filter?filter?=?null; ??public?FilterWrapper(?Filter?filter?)?{ ????if?(null?==?filter)?{ ??????//?ensure?the?filter?instance?is?not?null ??????throw?new?NullPointerException("Cannot?create?FilterWrapper?with?null?Filter"); ????} ????this.filter?=?filter; ??} ? ??public?enum?FilterRowRetCode?{ ????NOT_CALLED, ????INCLUDE,?????//?corresponds?to?filter.filterRow()?returning?false ????EXCLUDE??????//?corresponds?to?filter.filterRow()?returning?true ??} ?? ??public?FilterRowRetCode?filterRowCellsWithRet(List<Cell>?kvs)?throws?IOException?{ ????this.filter.filterRowCells(kvs); ????if?(!kvs.isEmpty())?{ ??????if?(this.filter.filterRow())?{ ????????kvs.clear(); ????????return?FilterRowRetCode.EXCLUDE; ??????} ??????return?FilterRowRetCode.INCLUDE; ????} ????return?FilterRowRetCode.NOT_CALLED; ??} ? }
這里的kvs就是一行數(shù)據(jù)經(jīng)過(guò)filterKeyValue后沒(méi)被過(guò)濾的列。
可以看到當(dāng)kvs不為empty時(shí),filterRowCellsWithRet方法中會(huì)調(diào)用指定filter的filterRow方法,上面已經(jīng)說(shuō)過(guò)了,PageFilter的計(jì)數(shù)器就是在其filterRow方法中增加的。
而當(dāng)kvs為empty時(shí),PageFilter的計(jì)數(shù)器就不會(huì)增加了。再看我們的測(cè)試數(shù)據(jù),因?yàn)樾械牡谝涣芯褪荢CVFilter的目標(biāo)列isDeleted?;仡櫳厦鍿CVFilter的講解我們知道,當(dāng)一行的目標(biāo)列的值不滿足要求時(shí),該行剩下的列都會(huì)直接被過(guò)濾掉!
對(duì)于測(cè)試數(shù)據(jù)第一行,走到filterRowCellsWithRet時(shí)kvs是empty的。導(dǎo)致PageFilter的計(jì)數(shù)器沒(méi)有+1。還會(huì)繼續(xù)遍歷剩下的行。從而使得返回的結(jié)果看上去是正常的。
而出問(wèn)題的數(shù)據(jù),因?yàn)樵诹衖sDeleted之前還有列content,所以當(dāng)一行的isDeleted不滿足要求時(shí),kvs也不會(huì)為empty。因?yàn)榱衏ontent的值已經(jīng)加入到kvs中了(這些數(shù)據(jù)要調(diào)用到SCVFilter的filterrow的時(shí)間會(huì)被過(guò)濾掉)。
從實(shí)現(xiàn)上來(lái)看HBase的Filter的實(shí)現(xiàn)還是比較粗糙的。效率也比較感人,不考慮網(wǎng)絡(luò)傳輸和客戶端內(nèi)存的消耗,基本上和你在客戶端過(guò)濾差不多。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。