hbase的基本原理和使用

發(fā)布時(shí)間：2021-09-17 15:25:45 來源：億速云閱讀：138 作者：chen 欄目：大數(shù)據(jù)

這篇文章主要介紹“hbase的基本原理和使用”，在日常操作中，相信很多人在hbase的基本原理和使用問題上存在疑惑，小編查閱了各式資料，整理出簡單好用的操作方法，希望對(duì)大家解答”hbase的基本原理和使用”的疑惑有所幫助！接下來，請(qǐng)跟著小編一起來學(xué)習(xí)吧！

一、hbase概述

1.1 hbase簡介

HBase是一個(gè)分布式的、面向列的開源數(shù)據(jù)庫，它是一個(gè)適合于非結(jié)構(gòu)化數(shù)據(jù)存儲(chǔ)的數(shù)據(jù)庫。另一個(gè)不同的是HBase基于列的而不是基于行的模式。
大：上億行、百萬列
面向列：面向列（族）的存儲(chǔ)和權(quán)限控制，列（簇）獨(dú)立檢索
稀疏：對(duì)于為空(null)的列，并不占用存儲(chǔ)空間，因此，表的設(shè)計(jì)的非常的稀疏

1.2 hbase的架構(gòu)

hbase的基本原理和使用

圖 1.1 hbase架構(gòu)圖

可以看到，hbase是基于hdfs作為數(shù)據(jù)存儲(chǔ)的。在hdfs之上就是hbase，hbase主要有三個(gè)組件：HMaster、HRegionServer、zookeeper。下面看看每個(gè)組件的功能。

1.2.1 HMaster

HMaster不存儲(chǔ)實(shí)際的數(shù)據(jù)，只是起到管理整個(gè)集群的作用。一般情況下集群中只有一個(gè)HMaster，如果配置了高可用，則還有一個(gè)備節(jié)點(diǎn)。功能如下：
1) 監(jiān)控RegionServer，看是否正常工作，并將信息更新到zk的節(jié)點(diǎn)信息中
2) 處理RegionServer故障轉(zhuǎn)移。將上面的region分配到其他RegionServer
3) 處理元數(shù)據(jù)的變更
4) 處理region的分配或移除
5) 在空閑時(shí)間進(jìn)行數(shù)據(jù)的負(fù)載均衡
6) 通過Zookeeper發(fā)布自己的位置給客戶端
7）在Region Split后，負(fù)責(zé)新Region的分配
8）管理用戶對(duì)Table的增、刪、改、查操作

1.2.2 HRegionServer

HRegionServer在集群中是有多個(gè)的，每個(gè)HRegionServer內(nèi)部負(fù)責(zé)管理若干個(gè)region。每個(gè)region包含若干個(gè)store，每個(gè)store內(nèi)部包含一個(gè)memstore和至少一個(gè)storefile，storefile內(nèi)部包含一個(gè)hfile。memstore負(fù)責(zé)將數(shù)據(jù)暫存到內(nèi)存中，當(dāng)memstore的數(shù)據(jù)量達(dá)到閾值時(shí)，就會(huì)flush寫到storefile中，最終寫入到hdfs中。
由于memstore的數(shù)據(jù)并不是實(shí)時(shí)同步寫入到storefile中的，而是達(dá)到一定條件才會(huì)寫入，所以如果此時(shí)HRegionServer故障，內(nèi)存的數(shù)據(jù)肯定會(huì)丟失的。但是如果同步寫入到hdfs中，頻繁觸發(fā)IO，性能就會(huì)很差。所以出現(xiàn)了Hlog，一個(gè)HRegionServer只有一個(gè)hlog，用于記錄對(duì)數(shù)據(jù)的更新操作，并且是這個(gè)hlog是實(shí)時(shí)寫入到hdfs中的，防止丟失，這個(gè)機(jī)制類似于mysql的二進(jìn)制日志。
HRegionServer的功能總結(jié)如下：
1) 負(fù)責(zé)存儲(chǔ)HBase的實(shí)際數(shù)據(jù)
2) 處理分配給它的Region
3) 刷新緩存到HDFS
4) 維護(hù)HLog
5) 執(zhí)行壓縮
6) 負(fù)責(zé)處理Region分片

看到這里，疑問多多啊，比如region是個(gè)啥東西？別急，下面開始詳細(xì)說。

1.3 hbase的數(shù)據(jù)存儲(chǔ)模型

1.3.1 hbase的數(shù)據(jù)模型

一般來說，在普通RDBMS中，一張表的存儲(chǔ)結(jié)構(gòu)如下：以student表為例，有id、name、sex、pwd這幾個(gè)列

id	name	sex	pwd
1	張三	male	123
2	李四	female	456

而在HBASE中，上面的表的邏輯結(jié)構(gòu)大致為：

row-key	column family1	column family2
1	info：{name:"張三"，sex：“male”}	password：{pwd：123}
2	info：{name:"李四"，sex：“female”}	password：{pwd：456}

看的是不是有點(diǎn)懵？別急，慢慢來。有些名詞先解釋下
rowkey：

這是一行的key，但是這里的key的設(shè)計(jì)是很考究的，并不一定是原來的student表中的id字段作為rowkey（實(shí)際上生產(chǎn)中確實(shí)不會(huì)這么單一使用）

column family：

    這是個(gè)新的概念，中文直譯過來通常稱為“列簇”，是列的一個(gè)集合，簡稱CF。一張表可以有多個(gè)CF（創(chuàng)建表時(shí)必須指定cf的名稱，但不需要指定column名稱），一個(gè)cf內(nèi)部可以有任意多個(gè)column，而且是在插入數(shù)據(jù)的時(shí)候，才會(huì)指定有哪些column，以及這些column是在哪個(gè)cf下的。即Column Family支持動(dòng)態(tài)擴(kuò)展，無需預(yù)先定義Column的數(shù)量以及類型，所有Column均以二進(jìn)制格式存儲(chǔ)，用戶使用時(shí)需要自行進(jìn)行類型轉(zhuǎn)換。
    比如上面的 Info以及password就是cf，info這個(gè)cf內(nèi)部有兩個(gè)column，分別是name和sex，每個(gè)column對(duì)應(yīng)一個(gè)value。password這個(gè)cf則只有pwd這個(gè)column。在hbase中column也稱為 qulifimer。

剛剛說到了，這上面只是表在hbase中的邏輯結(jié)構(gòu)，那么實(shí)際上hbase是怎么存儲(chǔ)表的呢？上面我們說到hbase是個(gè)列式數(shù)據(jù)庫，這里就體現(xiàn)出來了。
hbase的實(shí)際存儲(chǔ)結(jié)構(gòu)如下：

rowkey	cf:column	cell value	timestamp
1	info:name	張三	1564393837300
1	info:sex	male	1564393810196
1	password:pwd	123	1564393788068
2	info:name	李四	1564393837300
2	info:sex	female	1564393810196
2	password:pwd	456	1564393788068

可以看到，hbase中是將每一個(gè)單元格（cell）作為單獨(dú)一行進(jìn)行存儲(chǔ)的，如果對(duì)應(yīng)的cf的column沒有value，那么就不會(huì)有任何存儲(chǔ)記錄，更不會(huì)占用任何空間。這也是hbase作為列式存儲(chǔ)很典型的結(jié)構(gòu)。
而且需要確定一個(gè)cell的位置，需要4個(gè)參數(shù)：rowkey+cf+column+timestamp。前面三個(gè)可以理解，為啥多個(gè)timestamp？那是因?yàn)閔base中對(duì)應(yīng)一個(gè)cell的value有多版本（version）的概念，一個(gè)cell可以有多個(gè)value，那么這些value相互之間怎么區(qū)分？這時(shí)候僅憑rowkey+cf+column是沒法區(qū)分的，所以就加了個(gè)timestamp，是該value更新的最后的時(shí)間戳，這樣可以唯一確定一個(gè)cell了。默認(rèn)情況下，一個(gè)cell的value的version是一個(gè)（實(shí)際上是cf的version才對(duì)，column只是基于cf的version設(shè)置的），而且就算有多個(gè)value，只有最新得一個(gè)value對(duì)外顯示。一個(gè)column有多少個(gè)版本是可以設(shè)置的，如果插入的值超過設(shè)置的版本數(shù)，會(huì)優(yōu)先覆蓋最舊的版本。

你以為上面就是hbase的物理存儲(chǔ)結(jié)構(gòu)了？還不完全是哦，下面繼續(xù)講

1.3.2 hbase數(shù)據(jù)存儲(chǔ)原理

（1）region

我們知道，HRegionServer是負(fù)責(zé)數(shù)據(jù)的存儲(chǔ)的。HRegionServer內(nèi)部管理了一系列HRegion對(duì)象，每個(gè)HRegion對(duì)應(yīng)了Table中的一個(gè)Region（后面可以將HRrgion和region 是一個(gè)意思），一個(gè)表至少有一個(gè)region，一個(gè)region由[startkey,endkey)表示，注意閉開區(qū)間。一般來說，會(huì)事先根據(jù)原始數(shù)據(jù)的特性，預(yù)先劃分為幾個(gè)分區(qū)，然后每個(gè)分區(qū)一個(gè)region進(jìn)行管理。至于region分布到哪些regionserver上是由HMaster分配的。HRegion中由多個(gè)HStore組成。每個(gè)HStore對(duì)應(yīng)了Table中的一個(gè)Column Family的存儲(chǔ)，可以看出每個(gè)Column Family其實(shí)就是一個(gè)集中的存儲(chǔ)單元，因此最好將具備共同IO特性的column放在一個(gè)Column Family中，這樣最高效。
而每個(gè)HStore內(nèi)部由兩部分組成，MemStore（只有一個(gè)） & StoreFiles（至少一個(gè)）。下面講講這個(gè)兩個(gè)東東。

（2）MemStore & StoreFiles

HStore存儲(chǔ)是HBase存儲(chǔ)的核心了，其中由兩部分組成，一部分是MemStore，一部分是StoreFiles。MemStore是Sorted Memory Buffer，用戶寫入的數(shù)據(jù)首先會(huì)放入MemStore，當(dāng)MemStore滿了以后會(huì)Flush成一個(gè)StoreFile（底層實(shí)現(xiàn)是HFile），當(dāng)StoreFile文件數(shù)量增長到一定閾值，會(huì)觸發(fā)Compact合并操作，將多個(gè)StoreFiles合并成一個(gè)StoreFile，合并過程中會(huì)進(jìn)行版本合并和數(shù)據(jù)刪除，因此可以看出HBase其實(shí)只有增加數(shù)據(jù)，所有的更新和刪除操作都是在后續(xù)的compact過程中進(jìn)行的，這使得用戶的寫操作只要進(jìn)入內(nèi)存中就可以立即返回，保證了HBase I/O的高性能。當(dāng)StoreFiles Compact后，會(huì)逐步形成越來越大的StoreFile，當(dāng)單個(gè)StoreFile大小超過一定閾值后，會(huì)觸發(fā)Split操作，同時(shí)把當(dāng)前Region Split成2個(gè)Region，父Region會(huì)下線，新Split出的2個(gè)孩子Region會(huì)被HMaster分配到相應(yīng)的HRegionServer上，使得原先1個(gè)Region的壓力得以分流到2個(gè)Region上。
這里思考一個(gè)點(diǎn)，compact操作中的作用：

我們知道，storefile是從memstore中flush出來的數(shù)據(jù)。那么我們可以假設(shè)有這么一種情況，就是有個(gè)cell的value一開始1，然后這時(shí)候memstore滿了，flush到storefile中。接著后面這個(gè)cell 的value改為2了，然后這時(shí)候memstore又滿了，flush到storefile中。這時(shí)候這個(gè)cell在storefile中就有多個(gè)value了（這里說的并不是cell多version）。這樣表面上看，memstore中的數(shù)據(jù)時(shí)修改操作，但是對(duì)于底層的storefile來說只是一次數(shù)據(jù)的增加操作，因?yàn)樵黾訑?shù)據(jù)比修改數(shù)據(jù)效率要高。當(dāng)然這也有缺點(diǎn)，就是同一個(gè)cell存儲(chǔ)了多個(gè)版本的數(shù)據(jù)，占用存儲(chǔ)空間，所以這是一種以空間換時(shí)間的策略。而當(dāng)storefile增長到一定數(shù)量時(shí)，就會(huì)將多個(gè)storefile合并，這時(shí)候就會(huì)去除那些重復(fù)的數(shù)據(jù)（只保留最后一次的value，之前的全部刪除），最終釋放了一定量的存儲(chǔ)空間，得出最新的數(shù)據(jù)。所以這合并的過程其實(shí)就是完成更新、修改以及刪除的操作。

上面說到，memstore的數(shù)據(jù)是等到數(shù)據(jù)量達(dá)到閾值時(shí)，才會(huì)flush到storefile中，那如果還沒等到flush的時(shí)候，regionserver突然宕機(jī)，那么內(nèi)存的數(shù)據(jù)肯定會(huì)丟失的，那咋辦？別急，有 hlog哦

（3）hlog

每個(gè)HRegionServer中都只有一個(gè)HLog對(duì)象，所有region共用。HLog是一個(gè)實(shí)現(xiàn)Write Ahead Log的類，在有寫操作的時(shí)候，會(huì)先將相應(yīng)的操作記錄到hlog，且hlog是實(shí)時(shí)同步到磁盤中的，所以不用擔(dān)心宕機(jī)丟失hlog。等hlog返回記錄完成后，才會(huì)寫入到memstore中。這樣就保證了內(nèi)存的數(shù)據(jù)操作一定會(huì)記錄到hlog中。HLog文件定期會(huì)滾動(dòng)出新的，并刪除舊的文件（已持久化到StoreFile中的數(shù)據(jù)）。
hlog在regionserver故障時(shí)起到非常重要的恢復(fù)數(shù)據(jù)的作用。當(dāng)HRegionServer意外終止后，HMaster會(huì)通過Zookeeper感知到，HMaster首先會(huì)處理遺留的 HLog文件，將其中不同Region的Log數(shù)據(jù)進(jìn)行拆分，分別放到相應(yīng)region的目錄下，然后再將失效的region重新分配，領(lǐng)取到這些region的HRegionServer在Load Region的過程中，會(huì)發(fā)現(xiàn)有歷史HLog需要處理，因此會(huì)Replay HLog中的數(shù)據(jù)到MemStore中，然后flush到StoreFiles，完成數(shù)據(jù)恢復(fù)。

1.3.3 hbase的物理存儲(chǔ)文件

首先每個(gè)table至少是一個(gè)region，每個(gè)region中一個(gè)CF對(duì)應(yīng)一個(gè)storefile。所以實(shí)際上不同的CF是分開存儲(chǔ)的物理文件的。也就是在數(shù)據(jù)模型中，student表的物理存儲(chǔ)結(jié)構(gòu)實(shí)際上是這樣的：

格式為：rowkey:cf:column:value:timestamp
hfile for cf--info:
1:info:name:張三：1564393837300
1:info:sex：male：1564393837300
2:info:name：李四：1564393837302
2:info:sex：female：1564393837302

hfile for cf--password:
1:password：pwd：123：1564393837300
2:password：pwd：456：1564393837300

當(dāng)我們需要查詢一行數(shù)據(jù)時(shí)，遍歷所有所屬的region的所有hfile，然后找出同一rowkey的數(shù)據(jù)即可。hfile在hdfs中是直接使用二進(jìn)制方式存儲(chǔ)的，比較快。
接著就是hlog了，在底層是以Sequence File方式存儲(chǔ)的

1.4 hbase的讀寫流程

HBASE存在在兩張?zhí)厥獾膖able的，-ROOT-以及.META.，前者用于記錄meta表的region信息，后者則記錄用戶表的region信息。目前來說，-ROOT-已經(jīng)被移除了，因?yàn)橛行┒嘤?，所以直接使用meta表即可。meta表時(shí)整個(gè)hbase集群的入口表，讀寫操作都得先訪問meta表。

1.4.1 讀流程

1）HRegionServer保存著.META.的這樣一張表以及表數(shù)據(jù)，要訪問表數(shù)據(jù)，首先Client先去訪問zookeeper，從zookeeper里面找到.META.表所在的位置信息，即找到這個(gè).META.表在哪個(gè)HRegionServer上保存著。
2) 接著Client通過剛才獲取到的HRegionServer的IP來訪問.META.表所在的HRegionServer，從而讀取到.META.，進(jìn)而獲取到.META.表中存放的元數(shù)據(jù)。
3) Client通過元數(shù)據(jù)中存儲(chǔ)的信息，訪問對(duì)應(yīng)的HRegionServer，然后掃描(scan)所在
HRegionServer的Memstore和Storefile來查詢數(shù)據(jù)。
4) 最后HRegionServer把查詢到的數(shù)據(jù)響應(yīng)給Client。

1.4.2 寫流程

1）Client也是先訪問zookeeper，找到.META.表所在的regionserver，并獲取.META.表信息。
2) 確定當(dāng)前將要寫入的數(shù)據(jù)所對(duì)應(yīng)的RegionServer服務(wù)器和Region。這個(gè)過程需要HMaster的參與，決定數(shù)據(jù)要往哪個(gè)region寫
3) Client向該RegionServer服務(wù)器發(fā)起寫入數(shù)據(jù)請(qǐng)求，然后RegionServer收到請(qǐng)求并響應(yīng)。
4) Client先把數(shù)據(jù)寫入到HLog，以防止數(shù)據(jù)丟失。
5) 然后將數(shù)據(jù)寫入到Memstore。
6) 如果Hlog和Memstore均寫入成功，則這條數(shù)據(jù)寫入成功。在此過程中，如果Memstore達(dá)到閾值，會(huì)把Memstore中的數(shù)據(jù)flush到StoreFile中。
7) 當(dāng)Storefile越來越多，會(huì)觸發(fā)Compact合并操作，把過多的Storefile合并成一個(gè)大的Storefile。當(dāng)Storefile越來越大，Region也會(huì)越來越大，達(dá)到閾值后，會(huì)觸發(fā)Split操作，將Region一分為二。

二、hbase部署

2.1 環(huán)境準(zhǔn)備

軟件	版本	主機(jī)（192.168.50.x/24）
zookeeper（已部署）	3.4.10	bigdata121（50.121），bigdata122（50.122），bigdata123（50.123）
hadoop（已部署）	2.8.4	bigdata121（50.121）namenode所在，bigdata122（50.122），bigdata123（50.123）
hbase	1.3.1	bigdata121（50.121）HMaster所在，bigdata122（50.122），bigdata123（50.123）

2.2 開始部署hbase

在bigdata121上

解壓 hbase-1.3.1-bin.tar.gz

tar zxf  hbase-1.3.1-bin.tar.gz  -c /opt/modules/

修改/opt/modules/hbase-1.3.1-bin/conf/hbase-env.sh

export JAVA_HOME=/opt/modules/jdk1.8.0_144
# 禁用hbase自帶的zookeeper，使用額外安裝的zookeeper
export HBASE_MANAGES_ZK=false

修改/opt/modules/hbase-1.3.1-bin/conf/hbase-site.xml

<configuration>
<!--指定hbase在hdfs中的存儲(chǔ)目錄 -->
<property>
<name>hbase.rootdir</name>
<value>hdfs://bigdata121:9000/hbase</value>
</property>

<!--是否集群 -->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

<!--HMASTER的端口 -->
<property>
<name>hbase.master.port</name>
<value>16000</value>
</property>

<!--zk集群的服務(wù)器信息 -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>bigdata121:2181,bigdata122:2181,bigdata123:2181</value>
</property>

<!--zk的數(shù)據(jù)目錄 -->
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/opt/modules/zookeeper-3.4.10/zkData</value>
</property>

<!--HMaster的和regionserver之間最大的時(shí)間差，單位是秒 -->
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
</configuration>

配置環(huán)境變量：

vim /etc/profile.d/hbase.sh
#!/bin/bash
export HBASE_HOME=/opt/modules/hbase-1.3.1
export PATH=$PATH:${HBASE_HOME}/bin

然后 
source /etc/profile.d/hbase.sh

配置完成后，將整個(gè)hbase目錄scp到bigdata122和bigdata123的/opt/modules 目錄下。
別忘記配置環(huán)境變量。

2.3 啟動(dòng)hbase集群

啟動(dòng)/關(guān)閉整個(gè)集群

start-hbase.sh
stop-hbase.sh

提示：
如果集群中HMaster掛了，你會(huì)發(fā)現(xiàn)沒法用stop-hbase.sh關(guān)閉集群了，這時(shí)候請(qǐng)手動(dòng)關(guān)閉其他region server。

單獨(dú)啟動(dòng)和關(guān)閉集群及節(jié)點(diǎn)：

hbase-daemon.sh start/stop master
hbase-daemon.sh start/stop regionserver

并且啟動(dòng)成功后通過HMasterIP：16010 可以看到hbase 的web管理頁面

2.4 zookeeper集群節(jié)點(diǎn)情況

hbase集群啟動(dòng)之后，會(huì)創(chuàng)建一個(gè) /hbase 節(jié)點(diǎn)，該節(jié)點(diǎn)下創(chuàng)建多個(gè)維護(hù)hbase集群信息的子節(jié)點(diǎn)

[zk: localhost:2181(CONNECTED) 1] ls /hbase
[replication, meta-region-server, rs, splitWAL, backup-masters, table-lock, flush-table-proc, region-in-transition, online-snapshot, switch, master, running, recovering-regions, draining, namespace, hbaseid, table]

其中：
rs
這個(gè)節(jié)點(diǎn)下面有regionserver的信息，以"hostname,port,id"格式命名的
[zk: localhost:2181(CONNECTED) 6] ls /hbase/rs
[bigdata122,16020,1564564185736, bigdata121,16020,1564564193102, bigdata123,16020,1564564178848]
這個(gè)節(jié)點(diǎn)下有哪些節(jié)點(diǎn)，就表示對(duì)應(yīng)的regionserver是正常運(yùn)行的，下線的regionserver對(duì)應(yīng)的節(jié)點(diǎn)信息會(huì)被刪除，因?yàn)槭桥R時(shí)節(jié)點(diǎn)而已，具體臨時(shí)節(jié)點(diǎn)有啥特性，請(qǐng)看zookeeper系列的文章。

meta-region-server
這個(gè)節(jié)點(diǎn)的value中保存了存儲(chǔ)meta表的regionserver的信息

master
當(dāng)前HMaster所在的主機(jī)信息

backup-masters
備master節(jié)點(diǎn)信息，如果沒有配置的話，是空的

namespace
下面每個(gè)子節(jié)點(diǎn)對(duì)應(yīng)一個(gè)命名空間，相當(dāng)于RDBMS中庫的概念

hbaseid
value記錄了hbase集群的唯一標(biāo)志id

table
下面每個(gè)子節(jié)點(diǎn)對(duì)應(yīng)一張表

2.5 regionserver的節(jié)點(diǎn)管理

2.5.1 添加節(jié)點(diǎn)

可以使用下面的命令啟動(dòng)新的節(jié)點(diǎn)

hbase-daemon.sh start regionserver

剛開始的時(shí)候，新的節(jié)點(diǎn)沒有任何數(shù)據(jù)。如果此時(shí)平衡器（balance_switch）開啟，HMaster會(huì)調(diào)度其他節(jié)點(diǎn)的region移動(dòng)到這個(gè)新的節(jié)點(diǎn)上，也就說我們所說的數(shù)據(jù)平衡。
啟動(dòng)節(jié)點(diǎn)完成后，在hbase shell 中查看平衡器的狀態(tài)

balancer_enabled   返回的就是平衡器當(dāng)前的狀態(tài)，默認(rèn)是false關(guān)閉的

開啟/關(guān)閉平衡器

balance_switch true/false

有小坑的地方：

有個(gè) balance_switch status 命令，我看字面意思，以為是用來查詢平衡器當(dāng)前的狀態(tài)的，后面被坑了之后，發(fā)現(xiàn)有點(diǎn)問題。經(jīng)過反復(fù)試驗(yàn)，得出的結(jié)論是：
該命令執(zhí)行之后，無論平衡器當(dāng)前是什么狀態(tài)，一律改為false，也就是關(guān)閉狀態(tài)。
而且命令返回的結(jié)果是平衡器上一次的狀態(tài)，注意是上一次，不是當(dāng)前狀態(tài)。
這個(gè)就是這個(gè)命令坑爹的地方了，什么鬼啊，誰tm設(shè)計(jì)的。

所以這個(gè)命令千萬別亂執(zhí)行，執(zhí)行了這個(gè)命令，平衡器就直接給你關(guān)掉了。

2.5.2 下線節(jié)點(diǎn)

當(dāng)我們想將某個(gè)節(jié)點(diǎn)下線時(shí)，一般步驟如下：
先停止平衡器

balance_switch false

然后停止節(jié)點(diǎn)上的regionserver

hbase-daemon.sh stop regionserver

節(jié)點(diǎn)關(guān)閉之后，所有原先該節(jié)點(diǎn)上的region全部不能訪問，處于維護(hù)狀態(tài)。然后zk上/hbase/rs/下對(duì)應(yīng)的臨時(shí)節(jié)點(diǎn)會(huì)消失（zk臨時(shí)節(jié)點(diǎn)的特性，不清楚的可以看我之前寫的zk的文章）。然后master節(jié)點(diǎn)發(fā)現(xiàn)zk中的節(jié)點(diǎn)信息變化后，就會(huì)檢測到該regionserver下線，自動(dòng)開啟平衡器，將下線的server上的region遷移到其他server上。

這種方式最大的弊端在于，首先server關(guān)閉后，上面的region都會(huì)停用。而且因?yàn)閿?shù)據(jù)都保存在了hdfs+hlog中，導(dǎo)致后面遷移region的時(shí)候，需要從hdfs讀取數(shù)據(jù)，并且重新執(zhí)行hlog中的操作，才能恢復(fù)出完整的region來。讀hdfs和執(zhí)行hlog的操作是很慢的。這就導(dǎo)致這些region長時(shí)間沒法訪問。因?yàn)閔base后面提供另外一種方式來更加平滑的下線節(jié)點(diǎn)。

在hbase的bin目錄下，執(zhí)行

graceful_stop.sh <RegionServer-hostname>

該命令會(huì)先關(guān)閉平衡器，然后直接assign region，將所有region遷移完成后，才會(huì)關(guān)閉server。這就充分利用了內(nèi)存中region數(shù)據(jù)了，減少從hdfs中的數(shù)據(jù)讀取量，以及無需執(zhí)行hlog中的操作，速度快了很多。所以region暫停訪問的時(shí)間也縮短了

三、hbase的使用

進(jìn)入命令行：

hbase shell

查看命令幫助

hbase(main)> help

3.1 基本namespace操作命令

查看當(dāng)前命名空間有哪些表（默認(rèn)是default）

hbase(main)> list

查看有哪些命名空間，類似于RDBMS中庫的概念

hbase(main)> list_namespace

查看指定namespace有哪些表

hbase(main)> list_namespace_tables ‘namespace_name’

創(chuàng)建namespace

hbase(main)>create_namespace 'namespace'

查看namespace信息：

hbase(main)> describe_namespace 'namespace'

3.2 表基本操作命令

創(chuàng)建表

Create ‘namespace:表名’,‘CF1','CF2','CFX',{para1=>value,para2=>value,}  不指定namespace，默認(rèn)是default這個(gè)namespace
例子：
hbase(main)> create 'student','info'   創(chuàng)建student表，列簇有info
hbase(main)> create 'student','info',{VERSIONS=>3}  創(chuàng)建student表，列簇有info,且版本數(shù)為3

如果需要給不同的cf設(shè)置不同的參數(shù)屬性，那么就需要下面的方式來創(chuàng)建表
create 'teacher_2',{NAME=>'info',VERSIONS=>3},{NAME=>'password',VERSIONS=>2}
創(chuàng)建表teacher_2，CF為info和password，版本個(gè)數(shù)為3和2

插入數(shù)據(jù)（更新數(shù)據(jù)也是同樣的命令，一樣的操作）

Put ‘namespace:table’, ‘rowkey’, ‘cf:colume’, ‘value’, [timestamp]
 [timestamp]不指定的話默認(rèn)為當(dāng)前時(shí)間。一次只能插入一個(gè)cell的數(shù)據(jù)

 例子：
hbase(main) > put 'student','1001','info:name','Thomas'

查看表數(shù)據(jù)

scan ‘namespace:table’,{param1=>value}

例子：
掃描全表：scan 'student'
掃描指定字段：scan 'student',{COLUMNS=>['info:name','info:sex']}
限制返回的行數(shù)： scan 'student',{LIMIT=>1}  實(shí)際上返回的是 n+1行
返回指定rowkey范圍的數(shù)據(jù)：scan 'student',{STARTROW => '1001', STOPROW  => '1002'}，可以單獨(dú)使用STARTROW和STOPROW
返回指定時(shí)間戳范圍的數(shù)據(jù)：scan 'student', {TIMERANGE => [1303668804, 1303668904]}

查看表結(jié)構(gòu)

desc 'namespace:table'

例子：
desc 'student'

打印的內(nèi)容如下：
Table student is ENABLED                          
student
COLUMN FAMILIES DESCRIPTION                                           {NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION =>
 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'
 可以看到表的列簇信息

查看指定行或者列的數(shù)據(jù)（scan也能實(shí)現(xiàn)）

get 'namespace:table','rowkey','cf:column',{para1=value....}

例子：
get 'student','1001','info:name'
get 'student','1001','info:name',{VERSIONS=>2} 查看前兩個(gè)版本的數(shù)據(jù)

注意：這個(gè)命令只能用于查詢單行的數(shù)據(jù)（同一rowkey的數(shù)據(jù)）

刪除數(shù)據(jù)

delete 'namespace:table','rowkey','cf:column'
用于刪除指定字段的數(shù)據(jù)

deleteall 'namespace:table','rowkey'
用于刪除同一rowkey的數(shù)據(jù)

禁用/啟用/查看表狀態(tài)

查看表是否啟用：is_enabled 'namespace:table'
啟用表：enable 'namespace:table'
禁用表：disable 'namespace:table'   禁用表之后，該表無法被讀寫

清空表數(shù)據(jù)

要先禁用表，然后再清空數(shù)據(jù)
truncate 'namespace:table'

刪除表

確認(rèn)表是啟用的狀態(tài)，禁用狀態(tài)下不能刪除表
drop 'namespace:table'

統(tǒng)計(jì)行數(shù)

count  'namespace:table'

變更表信息

alter 'namespace:table',{param1:value...}
例子：
alter 'student',{NAME='info',VERSIONS=>5} 修改列簇info的版本數(shù)為5
alter 'student',{NAME='info:name',METHOD='delete'} 刪除字段info:name
alter 'student',{NAME=>'address_info'}     增加列簇address_info

檢查表是否存在

exist 'namespace:table'

查看當(dāng)前hbase集群的節(jié)點(diǎn)狀態(tài)

status
顯示信息如下：
1 active master, 1 backup masters, 3 servers, 0 dead, 17.0000 average load
分別是：master、備master的狀態(tài)，regionserver存活個(gè)數(shù)以及死亡個(gè)數(shù)、平均負(fù)載

3.3 使用hbase java api

新建maven 項(xiàng)目，pom.xml添加以下依賴

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.3.1</version>
</dependency>

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.3.1</version>
</dependency>

3.3.1 判斷表是否存在

public class HbaseTest01 {
    public static Configuration conf;
    static{
        //使用HBaseConfiguration的單例方法實(shí)例化,配置zk集群ip，端口，zk節(jié)點(diǎn)名
        conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "bigdata121");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        conf.set("zookeeper.znode.parent", "/hbase");
    }

    public static boolean isTableExist(String tableName) throws IOException {
        //根據(jù)conf創(chuàng)建連接對(duì)象
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過連接獲取admin管理員對(duì)象，用于管理表
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        return admin.tableExists(tableName);

    }
}

3.3.2 創(chuàng)建表

public static void createTable(String tableName, String... columnFamily) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        //創(chuàng)建表描述對(duì)象
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf(tableName));
        for (String cf: columnFamily) {
            //每個(gè)cf創(chuàng)建一個(gè)字段描述對(duì)象，添加到表描述對(duì)象中
            hTableDescriptor.addFamily(new HColumnDescriptor(cf));
        }
        //創(chuàng)建表
        admin.createTable(hTableDescriptor);

    }

注意：如果創(chuàng)建表時(shí)沒有指定namespace，默認(rèn)就在default這個(gè)namespace，如果需要指定namespace，那么就需要將創(chuàng)建的表名命名為 "namespace:tableName" 的形式，中間用冒號(hào)分隔

3.3.3 刪除表

  public static void deleteTable(String tableName) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        HBaseAdmin admin = (HBaseAdmin)connection.getAdmin();
        //禁用表
        admin.disableTable(tableName);
        //刪除表
        admin.deleteTable(tableName);
    }

3.3.4 插入數(shù)據(jù)

    public static void putData(String tableName, String rowKey, String columnFamily, String column, String value) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過connection對(duì)象獲取表管理對(duì)象
        Table table = connection.getTable(TableName.valueOf(tableName));
        //創(chuàng)建行對(duì)象
        Put put = new Put(rowKey.getBytes());
        //給行添加column，寫入value
        put.addColumn(columnFamily.getBytes(), column.getBytes(), value.getBytes());
        //將行提交到表中實(shí)現(xiàn)更改
        table.put(put);
        table.close();

    }
}

3.3.5 刪除行

    public static void deleteData(String tableName, String... rowKey) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過connection對(duì)象獲取表管理對(duì)象
        Table table = connection.getTable(TableName.valueOf(tableName));
        //創(chuàng)建刪除對(duì)象
        ArrayList<Delete> deleteList = new ArrayList<>();
        for (String row: rowKey) {
            deleteList.add(new Delete(row.getBytes()));
        }
        //將行提交到表中實(shí)現(xiàn)刪除
        table.delete(deleteList);
        table.close();

    }

3.3.6 查詢數(shù)據(jù)或者查詢指定CF、指定“CF:COLUMN”

public static void scanData(String tableName) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過connection對(duì)象獲取表管理對(duì)象
        Table table = connection.getTable(TableName.valueOf(tableName));
        //創(chuàng)建掃描器,可以設(shè)置startRow,stopRow讀取指定key范圍內(nèi)的數(shù)據(jù)
        Scan scan = new Scan();
        //使用掃描器掃描表
        ResultScanner scanner = table.getScanner(scan);
        for (Result result: scanner) {
            Cell[] cells = result.rawCells();
            for (Cell cell:cells) {
                //得到rowkey
                System.out.println("行鍵:" + Bytes.toString(CellUtil.cloneRow(cell)));
                //得到列族
                System.out.println("列族" + Bytes.toString(CellUtil.cloneFamily(cell)));
                System.out.println("列:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
                System.out.println("值:" + Bytes.toString(CellUtil.cloneValue(cell)));
            }
        }
        table.close();
        connection.close();
    }

查詢指定CF、指定“CF:COLUMN”,可以在掃描器中添加要掃描的列或者cf
scan.addColumn(family,column);
scan.addFamily(cf.getBytes())

3.3.7 得到某一行數(shù)據(jù)

public static void getRow(String tableName, String rowKey) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過connection對(duì)象獲取表管理對(duì)象
        Table table = connection.getTable(TableName.valueOf(tableName));
        Get get = new Get(rowKey.getBytes());
        Result result = table.get(get);
        for (Cell cell:result.rawCells()) {
            //得到rowkey
            System.out.println("行鍵:" + Bytes.toString(CellUtil.cloneRow(cell)));
            //得到列族
            System.out.println("列族" + Bytes.toString(CellUtil.cloneFamily(cell)));
            System.out.println("列:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
            System.out.println("值:" + Bytes.toString(CellUtil.cloneValue(cell)));
            System.out.println("時(shí)間戳:" + cell.getTimestamp());
        }

        table.close();
        connection.close();
  }

3.3.8 獲取某一行指定的“CF:COLUMN”

public static void getRowCF(String tableName, String rowKey, String family, String column) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        //通過connection對(duì)象獲取表管理對(duì)象
        Table table = connection.getTable(TableName.valueOf(tableName));
        Get get = new Get(rowKey.getBytes());
        get.addColumn(family.getBytes(),column.getBytes());
        Result result = table.get(get);
        for (Cell cell:result.rawCells()) {
            //得到rowkey
            System.out.println("行鍵:" + Bytes.toString(CellUtil.cloneRow(cell)));
            //得到列族
            System.out.println("列族" + Bytes.toString(CellUtil.cloneFamily(cell)));
            System.out.println("列:" + Bytes.toString(CellUtil.cloneQualifier(cell)));
            System.out.println("值:" + Bytes.toString(CellUtil.cloneValue(cell)));
            System.out.println("時(shí)間戳:" + cell.getTimestamp());
        }
        table.close();
        connection.close();
    }

3.3.9 創(chuàng)建namespace

public static void createNamespace(String namespace) throws IOException {
        Connection connection = ConnectionFactory.createConnection(conf);
        Admin admin = connection.getAdmin();
        //創(chuàng)建namespace描述對(duì)象
        NamespaceDescriptor province = NamespaceDescriptor.create(namespace).build();
        //創(chuàng)建namespace
        admin.createNamespace(province);
    }

3.4 MapReduce和hbase結(jié)合使用

3.4.1 環(huán)境準(zhǔn)備

查看hbase運(yùn)行MapReduce任務(wù)所需的依賴

hbase mapredcp

添加依賴路徑到環(huán)境變量

export HADOOP_CLASSPATH=`hbase mapredcp`

3.4.2 官方提供的MapReduce例子

（1）統(tǒng)計(jì)表有多少行

cd /opt/modules/hbase-1.3.1/lib
 yarn jar  hbase-server-1.3.1.jar  rowcounter student

執(zhí)行結(jié)果看到：
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Countes
ROWS=3

（2）使用MapReduce將hdfs中的數(shù)據(jù)導(dǎo)入hbase

vim /tmp/fruit_input.txt
1001    apple   red
1002    pear    yellow
1003    orange  orange

上傳到hdfs中
hdfs dfs -mkdir /input_fruit
hdfs dfs -put /tmp/fruit_input.txt /input_fruit/

hbase中創(chuàng)建目標(biāo)表：
hbase(main)> create 'fruit_input','info'

yarn jar hbase-server-1.3.1.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color fruit_input hdfs://bigdata121:900/input_fruit
解釋：-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:color 指定的是導(dǎo)入的字段的對(duì)應(yīng)，用逗號(hào)分隔

查看數(shù)據(jù)：
hbase(main):002:0> scan 'fruit_input'
ROW        COLUMN+CELL             
1001      column=info:color, timestamp=1564710439420, value=red                         1001      column=info:name, timestamp=1564710439420, value=apple                       1002      column=info:color, timestamp=1564710439420, value=yellow                     1002      column=info:name, timestamp=1564710439420, value=pear                         1003      column=info:color, timestamp=1564710439420, value=orange                     1003      column=info:name, timestamp=1564710439420, value=orange

3.4.3 從hbase讀取數(shù)據(jù)分析結(jié)果寫入到hbase

需求：將fruit表的部分列簇的數(shù)據(jù)通過mr導(dǎo)入到fruit_mr表中。將info列簇中的name和color提取到fruit_mr表中
fruit表格內(nèi)容如下：

ROW        COLUMN+CELL
1001      column=account:sells, timestamp=1564393837300, value=20   
1001      column=info:color, timestamp=1564393810196, value=red       
1001      column=info:name, timestamp=1564393788068, value=apple     
1001      column=info:price, timestamp=1564393864714, value=10       
1002      column=account:sells, timestamp=1564393937058, value=100   
1002      column=info:color, timestamp=1564393908332, value=orange   
1002      column=info:name, timestamp=1564393897787, value=orange     
1002      column=info:price, timestamp=1564393918141, value=8

提前創(chuàng)建輸出表：

hbase(main):002:0> create 'fruit_mr','info'

mapper：

package HBaseMR;

import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/**
 * TableMapper<ImmutableBytesWritable, Put>  這里指定的是map的輸出kv類型
 * 因?yàn)檩斎胧菑膆base的表輸入的，輸入的KV類型是恒定的，所以無需指定
 *
 * 然后 hbase中的如果是以rowkey作為key的話，那么類型就是 ImmutableBytesWritable
 */
public class HBaseMrMapper extends TableMapper<ImmutableBytesWritable, Put> {
    /**
     * cell 存儲(chǔ)hbase物理存儲(chǔ)中一個(gè)行對(duì)應(yīng)的value信息
     * @param key
     * @param value
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
        Put put = new Put(key.get());

        //篩選出列簇info中的列name和color，放到put對(duì)象中
        for (Cell cell : value.rawCells()) {
            if ("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
                if ("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell))) || "color".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
                    put.add(cell);
                }
            }
        }

        //如果put是非空的才寫入到Context，否則最終寫入到hbase時(shí)會(huì)報(bào)錯(cuò)“空值不能寫入”
        if (! put.isEmpty()) {
            context.write(key, put);
        }

    }
}

reducer：

package HBaseMR;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.NullWritable;

import java.io.IOException;

/**
 * 繼承TableReducer<keyin,valuein,keyout>
    這里不用指定reduce的輸出value的類型，因?yàn)楸仨毷荘ut類型
 */
public class HBaseMrReducer extends TableReducer<ImmutableBytesWritable, Put, NullWritable> {
    @Override
    protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException {

        //將同一個(gè)key的寫入Context
        for (Put p : values) {
            context.write(NullWritable.get(), p);
        }

    }
}

runner：

package HBaseMR;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class HBaseMrRunner extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //創(chuàng)建job對(duì)象
        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, this.getClass().getSimpleName());
        job.setJarByClass(HBaseMrRunner.class);

        //創(chuàng)建掃描器，用于掃描hbase表的數(shù)據(jù)
        Scan scan = new Scan();
        scan.setCacheBlocks(false);
        scan.setCaching(500);

        //設(shè)置job參數(shù)，包括map和reduce
        //設(shè)置map輸入，類，輸出的kv的類
        TableMapReduceUtil.initTableMapperJob(
                "fruit",
                scan,
                HBaseMrMapper.class,
                ImmutableBytesWritable.class,
                Put.class,
                job
        );

        //設(shè)置reducer類，輸出的表
        TableMapReduceUtil.initTableReducerJob(
                "fruit_mr",
                HBaseMrReducer.class,
                job
        );

        job.setNumReduceTasks(1);

        //提交job
        boolean isSuccess = job.waitForCompletion(true);

        return isSuccess? 1:0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        //調(diào)用runner中的run方法
        int status = ToolRunner.run(conf, new HBaseMrRunner(), args);
        System.exit(status);

    }
}

使用maven打包，到集群上運(yùn)行：

yarn jar hbasetest-1.0-SNAPSHOT.jar HBaseMR.HBaseMrRunner

3.4.4 將hdfs文本數(shù)據(jù)導(dǎo)入到hbase

將hdfs中的/input_fruit/fruit_input.txt的數(shù)據(jù)導(dǎo)入到hbase表fruit_hdfs_mr中
文本格式如下：

1001    apple   red
1002    pear    yellow
1003    orange  orange
字段間使用 “\t” 分隔

先創(chuàng)建表：

create 'fruit_hdfs_mr','info'

mapper：

package HDFSToHBase;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class ToHBaseMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
    ImmutableBytesWritable keyOut = new ImmutableBytesWritable();
    //Put value = new Put();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] fields = line.split("\t");

        keyOut.set(fields[0].getBytes());
        Put put = new Put(fields[0].getBytes());
        put.addColumn("info".getBytes(), "name".getBytes(), fields[1].getBytes());
        put.addColumn("info".getBytes(), "color".getBytes(), fields[2].getBytes());

        context.write(keyOut, put);
    }
}

reducer：

package HDFSToHBase;

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.NullWritable;

import java.io.IOException;

public class ToHBaseReducer extends TableReducer<ImmutableBytesWritable, Put, NullWritable> {
    @Override
    protected void reduce(ImmutableBytesWritable key, Iterable<Put> values, Context context) throws IOException, InterruptedException {
        for (Put p : values) {
            context.write(NullWritable.get(), p);
        }
    }
}

runner:

package HDFSToHBase;

import HBaseMR.HBaseMrRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ToHBaseRunner extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        //創(chuàng)建job對(duì)象
        Configuration conf = this.getConf();
        Job job = Job.getInstance(conf, this.getClass().getSimpleName());
        job.setJarByClass(ToHBaseRunner.class);

        //設(shè)置數(shù)據(jù)輸入路徑
        Path inPath = new Path("/input_fruit/fruit_input.txt");
        FileInputFormat.addInputPath(job, inPath);

        //設(shè)置map類，輸出的KV類型
        job.setMapperClass(ToHBaseMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);

        //設(shè)置reduce類，輸出表
        TableMapReduceUtil.initTableReducerJob(
                "fruit_hdfs_mr",
                ToHBaseReducer.class,
                job
        );

        job.setNumReduceTasks(1);

        boolean isSuccess = job.waitForCompletion(true);

        return isSuccess?0:1;

    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        int status = ToolRunner.run(configuration, new ToHBaseRunner(), args);
        System.exit(status);

    }
}

打包運(yùn)行jar包：

yarn jar hbasetest-1.0-SNAPSHOT.jar HDFSToHBase.ToHBaseRunner

3.5 hive和hbase結(jié)合使用

使用的hive版本為1.2，hive 的部署請(qǐng)看之前的hive相關(guān)的文章。

3.5.1 環(huán)境配置

hive需要對(duì)hbase進(jìn)行操作，需要需要經(jīng)hbase的lib目錄下的一些依賴jar復(fù)制一些到hive的lib目錄下，并且hive需要訪問zookeeper集群，以便訪問hbase，所以zk相應(yīng)的jar也需要復(fù)制。

hbase依賴：
cp /opt/modules/hbase-1.3.1/lib/hbase-* /opt/modules/hive-1.2.1-bin/lib/
cp /opt/modules/hbase-1.3.1/lib/htrace-core-3.1.0-incubating.jar
/opt/modules/hive-1.2.1-bin/lib/

zookeeper依賴：
cp /opt/modules/hbase-1.3.1/lib/zookeeper-3.4.6.jar /opt/modules/hive-1.2.1-bin/lib/

接著修改hive 的配置文件 conf/hive-site.xml，增加以下配置項(xiàng)

<!-- 指定zk集群的地址以及端口-->
<property>
    <name>hive.zookeeper.quorum</name>
    <value>bigdata121,bigdata122,bigdata123</value>
    <description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

<property>
    <name>hive.zookeeper.client.port</name>
    <value>2181</value>
    <description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>

3.5.2 hive和hbase關(guān)聯(lián)以及出現(xiàn)的問題

（1）在hive中創(chuàng)建關(guān)聯(lián)表：

create table student_hbase_hive(
id int,
name string,
sex string,
score double)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:sex,info:score")
TBLPROPERTIES("hbase.table.name"="hbase_hive_student");

語句解釋：
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
存儲(chǔ)的類用hbase的類

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:sex,info:score")
定義hive中表的字段和hbase的字段的映射關(guān)系，按先后順序映射

TBLPROPERTIES("hbase.table.name"="hbase_hive_student");
在hbase中創(chuàng)建的表的參數(shù)，這里指定表名為 hbase_hive_student

報(bào)錯(cuò)小插曲
創(chuàng)建過程中，出現(xiàn)以下報(bào)錯(cuò)：

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hbase.HTableDescriptor.addFamily(Lorg/apache/hadoop/hbase/HColumnDescriptor;)V

不太詳細(xì)，接著看看詳細(xì)點(diǎn)的報(bào)錯(cuò)信息，將debug信息都打印出來，以下面方式啟動(dòng)hive

hive -hiveconf hive.root.logger=DEBUG,console

然后再次執(zhí)行上面的創(chuàng)建語句，出現(xiàn)很多信息，我們往下翻，翻到一條關(guān)鍵性的信息：

ERROR exec.DDLTask: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HTableDescriptor.addFamily......

意思就是org.apache.hadoop.hbase.HTableDescriptor.addFamily里面沒有HTableDescriptor.addFamily這個(gè)方法。
接著我用IDEAL下用maven下載對(duì)應(yīng)版本的hbase依賴后，發(fā)現(xiàn)還真沒有HTableDescriptor.addFamily 這個(gè)方法。問題很明顯了，應(yīng)該是hive使用的hbase的某些包和我們使用的hbase不兼容。而用來關(guān)聯(lián)hbase和hive的一個(gè)重要包就是我們上面用的
org.apache.hadoop.hive.hbase.HbaseStorageHandler這個(gè)類對(duì)應(yīng)的包，其實(shí)就是hive-hbase-handler-1.2.1.jar這個(gè)包，猜測是因?yàn)檫@個(gè)包版本太舊了，不兼容目前的hbase，所以我們到maven下載新的包 hive-hbase-handler-2.3.2.jar，就這個(gè)版本吧。然后替換掉hive的lib目錄下原本的包，然后重啟hive，接著執(zhí)行上面的創(chuàng)建表的語句，發(fā)現(xiàn)正常執(zhí)行，完美。

執(zhí)行完創(chuàng)建語句后，進(jìn)入hive和hbase發(fā)現(xiàn)都創(chuàng)建了新的表。隨后，從hbase或者h(yuǎn)ive插入數(shù)據(jù)到這表中，在兩邊都可以查看到插入的數(shù)據(jù)。

（2）向關(guān)聯(lián)表中導(dǎo)入數(shù)據(jù)

在hive 中向關(guān)聯(lián)表導(dǎo)入數(shù)據(jù)時(shí)，不能直接使用load命令導(dǎo)入，只能從其他表通過

insert into table TABLE_NAME select * from ANOTHER_TABLE

或者直接insert 一行一行插入數(shù)據(jù)也可以。這里就不多說了。

（3）hive關(guān)聯(lián)hbase中已存在的表

因?yàn)閔base提供的sql操作不怎么強(qiáng)大，所以有時(shí)候需要對(duì)數(shù)據(jù)進(jìn)行sql統(tǒng)計(jì)，比較麻煩，所以可以通過將hbase 的已有數(shù)據(jù)的表關(guān)聯(lián)到hive中，然后在hive中通過較為完善的HQL來進(jìn)行統(tǒng)計(jì)分析。創(chuàng)建關(guān)聯(lián)表的方式和上面的一樣，這里不重復(fù)。

（4）hive和hbase關(guān)聯(lián)的本質(zhì)

其實(shí)本質(zhì)上數(shù)據(jù)是存儲(chǔ)在hbase中的，hive只是可以通過接口操作hbase中的表中的數(shù)據(jù)。但是這里有一個(gè)坑的地方，就是在hive中字段是有類型的，比如int。然而在hbase中字段不存在類型，或者說全是string類型，然后直接通過二進(jìn)制的方式存儲(chǔ)。如果這時(shí)候直接在hbase中查詢對(duì)應(yīng)的數(shù)據(jù)，會(huì)發(fā)現(xiàn)顯示的是亂碼，因?yàn)閔base壓根無法識(shí)別數(shù)據(jù)類型。這點(diǎn)有時(shí)候會(huì)有坑，要注意下。

3.6 sqoop--MysqlToHbase

sqoop的是之前hive部署的時(shí)候一起部署，所以sqoop的部署就看我之前寫的hive相關(guān)的文檔吧。
修改在sqoop配置文件 conf/sqoop-env.sh

export HBASE_HOME=/opt/modules/hbase-1.3.1

需求：將mysql中的表數(shù)據(jù)抽取到hbase中。
創(chuàng)建mysql表并導(dǎo)入數(shù)據(jù)：

CREATE DATABASE db_library;
CREATE TABLE db_library.book(
id int(4) PRIMARY KEY NOT NULL AUTO_INCREMENT, 
name VARCHAR(255) NOT NULL, 
price VARCHAR(255) NOT NULL);

INSERT INTO db_library.book (name, price) VALUES('Lie Sporting', '30');  
INSERT INTO db_library.book (name, price) VALUES('Pride & Prejudice', '70');  
INSERT INTO db_library.book (name, price) VALUES('Fall of Giants', '50');

hbase中創(chuàng)建目標(biāo)表

create 'hbase_book','info'

通過sqoop導(dǎo)入

sqoop import \
--connect jdbc:mysql://bigdata11:3306/db_library \
--username root \
--password 000000 \
--table book \
--columns "id,name,price" \
--column-family "info" \     指定列簇
--hbase-create-table \
--hbase-row-key "id" \       指定哪個(gè)字段映射為rowkey
--hbase-table "hbase_book" \ 目標(biāo)表名
--num-mappers 1 \
--split-by id

到此，關(guān)于“hbase的基本原理和使用”的學(xué)習(xí)就結(jié)束了，希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí)，快去試試吧！若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí)，請(qǐng)繼續(xù)關(guān)注億速云網(wǎng)站，小編會(huì)繼續(xù)努力為大家?guī)砀鄬?shí)用的文章！

向AI問一下細(xì)節(jié)

hbase的基本原理和使用

一、hbase概述

1.1 hbase簡介

1.2 hbase的架構(gòu)

1.2.1 HMaster

1.2.2 HRegionServer

1.3 hbase的數(shù)據(jù)存儲(chǔ)模型

1.3.1 hbase的數(shù)據(jù)模型

1.3.2 hbase數(shù)據(jù)存儲(chǔ)原理

（1）region

（2）MemStore & StoreFiles

（3）hlog

1.3.3 hbase的物理存儲(chǔ)文件

1.4 hbase的讀寫流程

1.4.1 讀流程

1.4.2 寫流程

二、hbase部署

2.1 環(huán)境準(zhǔn)備

2.2 開始部署hbase

2.3 啟動(dòng)hbase集群

2.4 zookeeper集群節(jié)點(diǎn)情況

2.5 regionserver的節(jié)點(diǎn)管理

2.5.1 添加節(jié)點(diǎn)

2.5.2 下線節(jié)點(diǎn)

三、hbase的使用

3.1 基本namespace操作命令

3.2 表基本操作命令

3.3 使用hbase java api

3.3.1 判斷表是否存在

3.3.2 創(chuàng)建表

3.3.3 刪除表

3.3.4 插入數(shù)據(jù)

3.3.5 刪除行

3.3.6 查詢數(shù)據(jù)或者查詢指定CF、指定“CF:COLUMN”

3.3.7 得到某一行數(shù)據(jù)

3.3.8 獲取某一行指定的“CF:COLUMN”

3.3.9 創(chuàng)建namespace

3.4 MapReduce和hbase結(jié)合使用

3.4.1 環(huán)境準(zhǔn)備

3.4.2 官方提供的MapReduce例子

3.4.3 從hbase讀取數(shù)據(jù)分析結(jié)果寫入到hbase

3.4.4 將hdfs文本數(shù)據(jù)導(dǎo)入到hbase

3.5 hive和hbase結(jié)合使用

3.5.1 環(huán)境配置

3.5.2 hive和hbase關(guān)聯(lián)以及出現(xiàn)的問題

（1）在hive中創(chuàng)建關(guān)聯(lián)表：

（2）向關(guān)聯(lián)表中導(dǎo)入數(shù)據(jù)

（3）hive關(guān)聯(lián)hbase中已存在的表

（4）hive和hbase關(guān)聯(lián)的本質(zhì)

3.6 sqoop--MysqlToHbase

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽

一、hbase概述

三、hbase的使用

3.3.6 查詢數(shù)據(jù)或者查詢指定CF、指定“CF:COLUMN”