Hive SQL匯總

發(fā)布時(shí)間：2020-07-20 04:43:19 來(lái)源：網(wǎng)絡(luò) 閱讀：6520 作者：jethai 欄目：大數(shù)據(jù)

創(chuàng)建數(shù)據(jù)庫(kù)
create database if not exists sopdm
comment 'this is test database'
with dbproperties('creator'='gxw','date'='2014-11-12')
--數(shù)據(jù)庫(kù)鍵值對(duì)屬性信息
location '/my/preferred/directory';

顯示所有表
show tables ;
顯示表的描述信息
desc [extended,formatted] tablename;
顯示建表語(yǔ)句
show create table tablename;
刪除表
drop table tablename;

由一個(gè)表創(chuàng)建另一個(gè)表，相當(dāng)于復(fù)制，表結(jié)構(gòu)復(fù)制，數(shù)據(jù)沒(méi)復(fù)制
create table test3 like test2;

由其他表查詢創(chuàng)建表
create table test4 as select name,addr from test5;

stored as textfile
可以直接查看
stored as sequencefile
必須用hadoop fs -text查看
stored as rcfile
hive -service rcfilecat path 查看

stored as inputformat 'class'（自定義的）

加載jar包
shell窗口add jar path（作用范圍本shell）
加載到分布式緩存中供各個(gè)節(jié)點(diǎn)使用

或者直接拷貝到hive安裝目錄下的lib目錄

SerDe（hive使用SerDe讀、寫表的行）

讀寫順序：
HDFS文件-->InputFileFormat--> <key,value>-->Deserializer-->Row對(duì)象（供hive使用）
Row對(duì)象-->Serializer--> <key,value>-->OutputFileFormat-->HDFS文件

hive自帶RegexSerDe.class 正則表達(dá)式匹配每一行的數(shù)據(jù)

create table apachelog(
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
refer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex="([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([0-9]*) ([^ ]*) ([^ ]*)"
) stored AS TEXTFILE；

分區(qū)表（相當(dāng)于表的子目錄）

create table tablename (name string) partitioned by (key type,...)

create external table employees(
name string,
salary float,
subordinates array<string>,
deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int)
)
partitioned by (dt string,type string)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/data';
數(shù)據(jù)格式：
wang  123  a1,a2,a3  k1:1,k2:2,k3:3  s1,s2,s3,4

查看分區(qū)：
show partitions employees
增加分區(qū)

alter table employees add if not exists partition(country='xxx'[,state='yyy'])
刪除分區(qū)
alter table employees drop if  exists partition(country='xxx'[,state='yyy'])

動(dòng)態(tài)分區(qū)：
1.不需要為不同的分區(qū)添加不同的插入語(yǔ)句
2.分區(qū)不確定，需要從數(shù)據(jù)中獲取
參數(shù)：(動(dòng)態(tài)分區(qū)前兩個(gè)必須開(kāi)啟）
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrick;
//無(wú)限制模式，如果是strict，則必須有一個(gè)靜態(tài)分區(qū)，且放在最前面
set hive.exec.max.dynamic.partitions.pernode=10000;
//每個(gè)節(jié)點(diǎn)生成動(dòng)態(tài)分區(qū)的最大個(gè)數(shù)
set hive.exec.max.dynamic.partitions=100000;
//每次sql查詢生成動(dòng)態(tài)分區(qū)的最大個(gè)數(shù)
set hive.exec.max.created.files=150000;
//一個(gè)任務(wù)最多可以創(chuàng)建的文件數(shù)目
set dfs.datanode.max.xcievers=8182;
//限定一次最多打開(kāi)的文件數(shù)
1. 創(chuàng)建分區(qū)表
create table d_part (
name string
)
partitioned by (value string)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
2. 插入動(dòng)態(tài)分區(qū)
insert overwrite table d_part partition(value)
select name,addr as value
from testtext;

分桶
set hive.enforce.bucketing=true;

按id分桶
create table bucketed_user
(id string,
name string)
clustered by (id) sorted by(name)  into 4 buckets
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
分桶抽樣
select * from bucketed_user tablesample(bucket 1 out of 2 on id )
取一半的桶
優(yōu)化
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Hive SQL匯總

beeline底層使用的jdbc，命令行使用jdbc(可以遠(yuǎn)程訪問(wèn)）

hive -help
hive --help
hive --service -help

查詢結(jié)果寫到文件
hive -V -e "select name from testtext" > /home/data/result

hive命令行輸入list jar
顯示當(dāng)前分布式緩存有哪些jar包也就是add jar命令加載的jar包）

命令行執(zhí)行hql文件（類似于hive -f)，常用于設(shè)置初始化參數(shù)
source /home/data/hql/select_hql

配置變量
set val='';
hql使用hive變量
${hiveconf:val}
select * from testtext where name ='${hiveconf:val}';
env查看linux環(huán)境變量
HOME=/root
hql使用linux環(huán)境變量
select '${env:HOME}' from testtext;

1.內(nèi)表數(shù)據(jù)加載(overwrite和into不能同時(shí)存在，只能存在一個(gè)）
(1) 創(chuàng)建表時(shí)加載
create table newtable as select col1,col2 from oldtable;
(2)創(chuàng)建表時(shí)指定數(shù)據(jù)位置(對(duì)location下的數(shù)據(jù)具有擁有權(quán)，刪除內(nèi)表時(shí)也會(huì)刪除數(shù)據(jù)）

create table tablename() location '';
(3)本地?cái)?shù)據(jù)加載
load data local inpath 'localpath' [overwrite] into table tablename;
(4)加載hdfs數(shù)據(jù)（移動(dòng)數(shù)據(jù)，原來(lái)的數(shù)據(jù)移動(dòng)到表的位置下）
load data  inpath 'hdfspath' [overwrite] into table tablename;

hive命令行執(zhí)行l(wèi)inux shell命令在前面加個(gè)！
!ls /home/data

(5)通過(guò)查詢語(yǔ)句加載數(shù)據(jù)

insert  into table test_m select name,addr from testtext where name ='wer';
或者
from testtext insert  into table test_m select name,addr  where name ='wer';
或者
select name,addr from testtext  where name ='wer' insert  into table test_m ;

2.外部表數(shù)據(jù)加載
(1)創(chuàng)建表時(shí)指定數(shù)據(jù)位置
create external table tablename() location '';
(2)查詢插入，同內(nèi)表
(3)使用hadoop命令拷貝數(shù)據(jù)到指定位置（hive的shell中執(zhí)行和linux的shell執(zhí)行）

3.分區(qū)表數(shù)據(jù)加載
（1）內(nèi)部分區(qū)表數(shù)據(jù)加載類似于內(nèi)表

（2）外部分區(qū)表數(shù)據(jù)加載方式類似于外表
注意：數(shù)據(jù)存放的路徑層次要和表的分區(qū)一致,并且表要增加相應(yīng)分區(qū)才能查到數(shù)據(jù)

load data local inpath 'localpath' [overwrite] into table tablename partition(dt='20140905');

4.數(shù)據(jù)類型對(duì)應(yīng)問(wèn)題
Load數(shù)據(jù)，字段類型不能相互轉(zhuǎn)化時(shí)，查詢返回NULL；
select查詢輸入，字段類型不能相互轉(zhuǎn)化時(shí)，插入數(shù)據(jù)為NULL(文件保存是 \N)；
select查詢輸入數(shù)據(jù)，字段名稱可不一致，數(shù)據(jù)加載不做檢查，查詢時(shí)檢查

數(shù)據(jù)導(dǎo)出

導(dǎo)出到本地，默認(rèn)分隔符^A
insert overwrite local directory '/home/data3'
row format delimited fields terminated by '\t'
select name,addr from testtext;

導(dǎo)出到hdfs，不支持row format delimited fields terminated by '\t'，只能采用默認(rèn)分隔符是I
insert overwrite directory '/home/data3'
select name,addr from testtext;

表屬性操作

1.修改表名
alter table table_name rename to new_table_name；

2.修改列名
alter table tablename change column c1 c2 int comment 'xxx' after severity;
c1舊列，c2新列，int代表新列數(shù)據(jù)類型
after severity;可以把該列放到指定列的后面，或者使用'first'放到第一位

3.增加列(默認(rèn)新增列放到最末尾）
alter  table tablename add column (c1 string comment 'xxxx',c2 string comment 'yyyy')；

4.修改tblproperties
alter table test set tblproperties('comment'='xxxx');

5.修改分隔符(分區(qū)表比較特殊）
方式一
alter table city set serdeproperties('field.delim'='\t');（對(duì)分區(qū)表原始數(shù)據(jù)無(wú)效，對(duì)新加分區(qū)有效）

方式二對(duì)分區(qū)表原始數(shù)據(jù)也使用最新的分隔符
alter table city partitin(dt='20140908') set serdeproperties('field.delim'='\t');

6.修改location
alter table city [partition(...)] set location 'hdfs://master:9000/location' ;

7.內(nèi)部表和外部表轉(zhuǎn)換
alter table test set tblproperties('EXTERNAL'='TRUE');內(nèi)部表轉(zhuǎn)外部表
alter table test set tblproperties('EXTERNAL'='FALSE');外部表轉(zhuǎn)內(nèi)部表

聚合操作
1.count計(jì)數(shù)
count(*)所有的字段不全為null，全為null不加1
count(1)不管記錄是啥，只要有這條記錄都加1
count(col)列不為空加1

2.sum求和
sum(可轉(zhuǎn)成數(shù)字的值)  返回bigint
sum(col)+cast(1 as bigint)

3.avg
avg(可轉(zhuǎn)成數(shù)字的值) 返回double

where條件在map端執(zhí)行

group by是在reduce端執(zhí)行分組的列組合為key

having字句聚合操作之后執(zhí)行判斷，也是在reduce端執(zhí)行

groupby數(shù)據(jù)傾斜優(yōu)化
hive.groupby.skewindata=true;（多起一個(gè)job）

join操作（普通join不支持不等值鏈接）
優(yōu)化參數(shù)set hive.optimize.skewjoin=true;
樣例
select m.col as col,m.col2 as col2,n.col3 as col3
from
(select col,col2
from test
where ...(map端執(zhí)行)
) m
[left outer|right outer|left semi] join
n (右表)
on m.col=n.col
where condition (reduce端執(zhí)行)

LEFT SEMI JOIN 是 IN/EXISTS 子查詢的一種更高效的實(shí)現(xiàn)

Hive 當(dāng)前沒(méi)有實(shí)現(xiàn) IN/EXISTS 子查詢，所以你可以用 LEFT SEMI JOIN 重寫你的子查詢語(yǔ)句。LEFT SEMI JOIN 的限制是， JOIN 子句中右邊的表只能在

ON 子句中設(shè)置過(guò)濾條件，在 WHERE 子句、SELECT 子句或其他地方過(guò)濾都不行。

SELECT a.key, a.value

FROM a

WHERE a.key in

(SELECT b.key

FROM b);

可以被重寫為：

SELECT a.key, a.val

FROM a LEFT SEMI JOIN b on (a.key = b.key)

order by全局排序，只有一個(gè)reduce
distribute by col分散數(shù)據(jù)，按col分散到不同的reduce
和sort by結(jié)合保證每個(gè)reduce輸出是有序的

union all（不去重）和union（去重）

hive只支持union all，子查詢不允許起別名
select col from
(select a as col from t1
union all
select b as col from t2
) tmp;

要求：
1.字段名字一樣
2.字段類型一樣
3.字段個(gè)數(shù)一樣
4.子表不能有別名
5.如果需要從合并之后的表中查詢數(shù)據(jù)，那么合并之后的表必須要有別名

hive創(chuàng)建索引表

hive> create index user_index on table user(id)     
        > as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'    
        > with deferred rebuild    
        > IN TABLE user_index_table;    
    hive> alter index user_index on user rebuild;    
    hive> select * from user_index_table limit 5;

直接用hadoop命令復(fù)制刪除hive存儲(chǔ)數(shù)據(jù)后，需要add partition或alter來(lái)同步源數(shù)據(jù)信息，否則drop表等操作時(shí)會(huì)查詢元數(shù)據(jù)metastore，查到metastore信息和hdfs信息不一致，會(huì)報(bào)錯(cuò)。這個(gè)也可算是hive的bug，尚未修復(fù)，但也可以理解為初衷不建議直接操作hdfs數(shù)據(jù)。
目前，可以采用命令：

MSCK REPAIR TABLE table_name;

該命令會(huì)把沒(méi)添加進(jìn)partition的數(shù)據(jù)，都增加對(duì)應(yīng)的partition。同步源數(shù)據(jù)信息metadata。

Recover Partitions

Hive在metastore中存儲(chǔ)每個(gè)表的分區(qū)列表，如果新的分區(qū)加入HDFS后，metastore不會(huì)注意這些分區(qū)，除非

ALTER TABLE table_name ADD PARTITION

當(dāng)然可以通過(guò)

MSCK REPAIR TABLE table_name;

類似EMR版本中的如下命令

ALTER TABLE table_name RECOVER PARTITIONS;

向AI問(wèn)一下細(xì)節(jié)

Hive SQL匯總

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽