您好,登錄后才能下訂單哦!
一、獲取數(shù)據(jù)
搜狗實驗室為我們提供了用戶使用搜狗搜索引擎查詢的日志,下載地址為
http://download.labs.sogou.com/dl/q.html
本文選擇下載精簡版。
數(shù)據(jù)格式如下:
二、上傳數(shù)據(jù)至HDFS
建立hdfs目錄
root@spark-master:~# hdfs dfs -mkdir -p /library/sougou
上傳文件
root@spark-master:~# hdfs dfs -put SogouQ1.txt /library/sougou root@spark-master:~# hdfs dfs -put SogouQ2.txt /library/sougou root@spark-master:~# hdfs dfs -put SogouQ3.txt /library/sougou root@spark-master:~#
三、使用Hive創(chuàng)建表
root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin# ./hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/hive/apache-hive-1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties hive> CREATE TABLE SOUGOU(ID STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT ,WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; OK Time taken: 1.995 seconds hive>
四、加載數(shù)據(jù)
hive> LOAD DATA INPATH '/library/sougou/SogouQ1.txt' INTO TABLE sougou; Loading data to table default.sougou Table default.sougou stats: [numFiles=1, totalSize=108750574] OK Time taken: 1.101 seconds
此時,我們再次查看源目錄
SogouQ1.txt 已經(jīng)沒有啦,該文件跑哪里去了呢?
可見,導(dǎo)入數(shù)據(jù)其實就是將HDFS上的文件移動一個位置而已。所以速度是非常的快。
那可不可以直接將SogouQ1.txt放置在HDFS的/user/hive/warehouse/sougou/中,而不使用LOAD語句?
因為元數(shù)據(jù)要知道該表中包含了哪些數(shù)據(jù)文件,所以必須使用load語句。
五、操作數(shù)據(jù)
5.1 計算count
hive> select count(*) from sougou; Query ID = root_20160314192407_792e772a-c225-4b37-b948-7050d6b529b4 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457942575478_0002, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0002/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-03-14 19:24:29,014 Stage-1 map = 0%, reduce = 0% 2016-03-14 19:24:47,137 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.14 sec 2016-03-14 19:25:04,563 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.38 sec MapReduce Total cumulative CPU time: 9 seconds 380 msec Ended Job = job_1457942575478_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.38 sec HDFS Read: 108757501 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 9 seconds 380 msec OK 1000000 Time taken: 58.603 seconds, Fetched: 1 row(s)
5.2 查看數(shù)據(jù)
hive> select * from sougou limit 5; OK 20111230000005 57375476989eea12893c0c3811607bcf 1 1 http://www.qiyi.com/ 20111230000005 66c5bb7774e31d0a22278249b26bc83a 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 20111230000007 b97920521c78de70ac38e3713f524b50 1 1 http://www.bblianmeng.com/ 20111230000008 6961d0c97fe93701fc9c0d861d096cd9 1 1 http://lib.scnu.edu.cn/ 20111230000008 f2f5a21c764aebde1e8afcc2871e086f 2 1 http://proxyie.cn/ Time taken: 0.246 seconds, Fetched: 5 row(s)
這里出現(xiàn)了亂碼,原因是源文件是gb3212編碼,但是Hadoop和Hive都使用UTF8編碼。我們將文件轉(zhuǎn)碼后再次上傳到hdfs中
root@spark-master:~# iconv -t utf-8 -f gb2312 -c SogouQ1.txt > SogouQ1.txt.utf8 root@spark-master:~# rm SogouQ1.txt ; mv SogouQ1.txt.utf8 SogouQ1.txt root@spark-master:~# hdfs dfs -rm /user/hive/warehouse/sougou/SogouQ1.txt 16/03/14 19:44:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /user/hive/warehouse/sougou/SogouQ1.txt root@spark-master:~# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/sougou/ root@spark-master:~#
再次查看
hive> select * from sougou limit 5; OK 20111230000005 57375476989eea12893c0c3811607bcf 奇藝高清 1 1 http://www.qiyi.com/ 20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙傳 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 20111230000007 b97920521c78de70ac38e3713f524b50 本本聯(lián)盟 1 1 http://www.bblianmeng.com/ 20111230000008 6961d0c97fe93701fc9c0d861d096cd9 華南師范大學(xué)圖書館 1 1 http://lib.scnu.edu.cn/ 20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在線代理 2 1 http://proxyie.cn/ Time taken: 0.151 seconds, Fetched: 5 row(s)
這樣就正常啦。
5.2 再來一個復(fù)雜點的查詢
hive> select count(*) from sougou where s_seq=1 and c_seq=1 and website like '%baidu%'; Query ID = root_20160314194855_8c9aa844-e088-4695-942f-3579718962f6 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457942575478_0003, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0003/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-03-14 19:49:12,041 Stage-1 map = 0%, reduce = 0% 2016-03-14 19:49:33,174 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.94 sec 2016-03-14 19:49:48,672 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.55 sec MapReduce Total cumulative CPU time: 11 seconds 550 msec Ended Job = job_1457942575478_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.55 sec HDFS Read: 114834192 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 550 msec OK 59090 Time taken: 55.022 seconds, Fetched: 1 row(s)
查詢點擊排名
hive> select word,count(*) cnt from sougou group by word order by cnt desc limit 5; Query ID = root_20160314202108_58aeca03-8ed6-4626-b15e-af6643c94107 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457942575478_0007, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0007/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0007 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-03-14 20:21:29,040 Stage-1 map = 0%, reduce = 0% 2016-03-14 20:21:57,425 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.98 sec 2016-03-14 20:22:16,021 Stage-1 map = 100%, reduce = 68%, Cumulative CPU 20.27 sec 2016-03-14 20:22:19,268 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 23.16 sec 2016-03-14 20:22:22,593 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 25.9 sec 2016-03-14 20:22:23,721 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.9 sec MapReduce Total cumulative CPU time: 26 seconds 900 msec Ended Job = job_1457942575478_0007 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457942575478_0008, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0008/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0008 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2016-03-14 20:22:44,377 Stage-2 map = 0%, reduce = 0% 2016-03-14 20:23:07,303 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 9.95 sec 2016-03-14 20:23:25,482 Stage-2 map = 100%, reduce = 82%, Cumulative CPU 15.54 sec 2016-03-14 20:23:26,563 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 16.88 sec MapReduce Total cumulative CPU time: 16 seconds 880 msec Ended Job = job_1457942575478_0008 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 26.9 sec HDFS Read: 114832713 HDFS Write: 15044297 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 16.88 sec HDFS Read: 15048892 HDFS Write: 153 SUCCESS Total MapReduce CPU Time Spent: 43 seconds 780 msec OK 百度 7564 baidu 3652 人體藝術(shù) 2786 館陶縣縣長閆寧的父親 2388 4399小游戲 2119 Time taken: 140.18 seconds, Fetched: 5 row(s)
六、外部表
我們在第三步創(chuàng)建的表是內(nèi)部表,內(nèi)部表創(chuàng)建成功后會在/user/hive/warehouse下創(chuàng)建和表同名的目錄。并且當(dāng)導(dǎo)入數(shù)據(jù)時,源文件會被放置在表對應(yīng)的目錄下。當(dāng)進(jìn)行表刪除時,目錄和文件一同被刪除
hive> drop table sougou; OK Time taken: 0.983 seconds
查看hdfs
root@spark-master:~# hdfs dfs -ls /user/hive/warehouse/ Found 1 items drwxr-xr-x - root supergroup 0 2016-03-14 17:10 /user/hive/warehouse/t1
Hive還提供了另一種表,稱之為外部表。
表創(chuàng)建方式如下:
hive> CREATE EXTERNAL TABLE SOUGOU(ID STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT ,WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' > STORED AS TEXTFILE LOCATION '/library/sougou/sougouExternal'; OK Time taken: 0.123 seconds
root@spark-master:~# hdfs dfs -ls /user/hive/warehouse/ 16/03/14 20:02:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items drwxr-xr-x - root supergroup 0 2016-03-14 17:10 /user/hive/warehouse/t1 root@spark-master:~# hdfs dfs -ls /library/sougou/ 16/03/14 20:02:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 3 root supergroup 217441417 2016-03-14 18:55 /library/sougou/SogouQ2.txt -rw-r--r-- 3 root supergroup 1086552775 2016-03-14 18:56 /library/sougou/SogouQ3.txt drwxr-xr-x - root supergroup 0 2016-03-14 20:01 /library/sougou/sougouExternal
目錄直接創(chuàng)建在指定的位置。
上傳文件
root@spark-master:~# hdfs dfs -put SogouQ1.txt /library/sougou/sougouExternal
在Hive中查詢數(shù)據(jù)
hive> select count(*) from sougou; Query ID = root_20160314200414_b514251b-58d3-40aa-a9ee-4a9cf5eef8f2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457942575478_0004, Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0004/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457942575478_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-03-14 20:04:27,514 Stage-1 map = 0%, reduce = 0% 2016-03-14 20:04:41,458 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.66 sec 2016-03-14 20:04:52,341 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.26 sec MapReduce Total cumulative CPU time: 7 seconds 260 msec Ended Job = job_1457942575478_0004 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.26 sec HDFS Read: 114832746 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 260 msec OK 1000000 Time taken: 39.823 seconds, Fetched: 1 row(s)
外部表被刪除后,hdfs上的文件并不會被刪除
hive> drop table sougou; OK Time taken: 0.363 seconds
root@spark-master:~# hdfs dfs -ls /library/sougou/sougouExternal/ 16/03/14 20:16:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 3 root supergroup 114825752 2016-03-14 20:03 /library/sougou/sougouExternal/SogouQ1.txt
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。