溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

Hive的底層執(zhí)行流程

發(fā)布時(shí)間:2021-09-03 14:16:09 來源:億速云 閱讀:226 作者:chen 欄目:大數(shù)據(jù)

本篇內(nèi)容主要講解“Hive的底層執(zhí)行流程”,感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷,實(shí)用性強(qiáng)。下面就讓小編來帶大家學(xué)習(xí)“Hive的底層執(zhí)行流程”吧!

Hive 底層執(zhí)行流程

我們以下面這個SQL為例

FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

整個編譯過程分為六個階段:
1.Antlr定義SQL的語法規(guī)則,完成SQL詞法,語法解析,將SQL
轉(zhuǎn)化為抽象語法樹AST Tree

HiveLexerX,HiveParser分別是Antlr對SQL編譯后自動生成的詞法解析和語法解析類,在這兩個類中進(jìn)行復(fù)雜的解析。
例子中的AST tree為

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF src))
(TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) 
(TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) 
(TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4))))
(TOK_GROUPBY (TOK_COLREF src key))))

2.遍歷AST Tree,抽象出查詢的基本組成單元QueryBlock
AST Tree 仍然非常復(fù)雜,不夠結(jié)構(gòu)化,不方便直接翻譯為 MapReduce 程序, AST
Tree 轉(zhuǎn)化為 QueryBlock(QB)就是將 SQL 進(jìn)一部抽象和結(jié)構(gòu)化。
AST Tree 生成 QueryBlock 的過程是一個遞歸的過程,先序遍歷 AST Tree ,遇到不
同的Token 節(jié)點(diǎn)(理解為特殊標(biāo)記),保存到相應(yīng)的屬性中,主要包含以下幾個過程

TOK_QUERY => 創(chuàng)建 QB 對象,循環(huán)遞歸子節(jié)點(diǎn)
TOK_FROM => 將表名語法部分保存到 QB 對象的 aliasToTabs 等屬性中
TOK_INSERT => 循環(huán)遞歸子節(jié)點(diǎn)
TOK_DESTINATION => 將輸出目標(biāo)的語法部分保存在 QBParseInfo 對象的nameToDest 屬性中
TOK_SELECT => 分別將查詢表達(dá)式的語法部分保存在 destToSelExpr 、destToAggregationExprs 、 destToDistinctFuncExprs 三個屬性中
TOK_WHERE => 將 Where 部分的語法保存在 QBParseInfo 對象的destToWhereExpr 屬性中

3.遍歷QueryBlock,翻譯為執(zhí)行操作樹OperatorTree
Hive 最終生成的 MapReduce 任務(wù), Map 階段和 Reduce 階段均由 Operator Tree
組成。邏輯操作符,就是在 Map 階段或者 Reduce 階段完成單一特定的操作。
基本的操作符包括
TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator、ReduceSinkOperator
QueryBlock 生成 Operator Tree 就是遍歷上一個過程中生成的 QB 和 QBParseInfo
對象的保存
語法的屬性,包含如下幾個步驟:

QB#aliasToSubq => 有子查詢,遞歸調(diào)用
QB#aliasToTabs => TableScanOperator
QBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperator
QBParseInfo#destToWhereExpr => FilterOperator
QBParseInfo#destToGroupby => ReduceSinkOperator +GroupByOperator
QBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator

由于 Join/GroupBy/OrderBy 均需要在 Reduce 階段完成,所以在生成相應(yīng)操作的Operator 之前都會先生成一個 ReduceSinkOperator ,將字段組合并序列化為 Reduce Key/value,Partition Key
SQL例子翻譯成OperatorTree

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src
            Reduce Output Operator
              key expressions:
                    expr: key
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: rand()
                    type: double
              tag: -1
              value expressions:
                    expr: substr(value, 4)
                    type: string
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(UDFToDouble(VALUE.0))
          keys:
                expr: KEY.0
                type: string
          mode: partial1
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
                name: binary_table

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        /tmp/hive-zshao/67494501/106593589.10001
          Reduce Output Operator
            key expressions:
                  expr: 0
                  type: string
            sort order: +
            Map-reduce partition columns:
                  expr: 0
                  type: string
            tag: -1
            value expressions:
                  expr: 1
                  type: double
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE.0)
          keys:
                expr: KEY.0
                type: string
          mode: final
          Select Operator
            expressions:
                  expr: 0
                  type: string
                  expr: 1
                  type: double
            Select Operator
              expressions:
                    expr: UDFToInteger(0)
                    type: int
                    expr: 1
                    type: double
              File Output Operator
                compressed: false
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                    name: dest_g1

  Stage: Stage-0
    Move Operator
      tables:
            replace: true
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                name: dest_g1

4.Logical Optimizer進(jìn)行OperatorTree變換,合并不必要的
使用ReduceSinkOperator,減少shuffle數(shù)據(jù)量。大部分邏輯層優(yōu)化器通過變換 OperatorTree ,合并操作符,達(dá)到減少 MapReduce Job ,減少 shuffle 數(shù)據(jù)量的目的。
5.遍歷OperatorTree,翻譯為Task tree
OperatorTree 轉(zhuǎn)化為 Task tree的過程分為下面幾個階段

  • 對輸出表生成 MoveTask

  • 從 OperatorTree 的其中一個根節(jié)點(diǎn)向下深度優(yōu)先遍歷

  • ReduceSinkOperator 標(biāo)示 Map/Reduce 的界限,多個 Job 間的界限

  • 遍歷其他根節(jié)點(diǎn),遇過碰到 JoinOperator 合并 MapReduceTask

  • 生成 StatTask 更新元數(shù)據(jù)

  • 剪斷 Map 與 Reduce 間的 Operator 的關(guān)系

6.PhysicalOptimizer 對Task tree優(yōu)化,生成最終的執(zhí)行計(jì)劃

7、執(zhí)行

以上就是HiveSQL的底層執(zhí)行流程

打印SQL運(yùn)行相關(guān)信息

我們在開發(fā)中,可以使用下面這個語句來打印SQL語句的相關(guān)運(yùn)行信息

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

注:我的版本是hive-1.1.0-cdh6.7.0,所以只可用三個可選屬性,如果您版本比較高的話,可以去官網(wǎng)查閱對應(yīng)屬性
下面我對三種可選屬性進(jìn)行簡單介紹

EXTENDED

EXTENDED:打印SQL解析成AST&Operator Tree最全面的信息

hive (g6_hadoop)> explain EXTENDED insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            g6_access_orc
   TOK_INSERT
      TOK_DESTINATION
         TOK_TAB
            TOK_TABNAME
               g6_access_orc_explain
      TOK_SELECT
         TOK_SELEXPR
            TOK_TABLE_OR_COL
               domain
         TOK_SELEXPR
            TOK_FUNCTION
               count
               1
            num
      TOK_WHERE
         >
            TOK_TABLE_OR_COL
               traffic
            '99900'
      TOK_GROUPBY
         TOK_TABLE_OR_COL
            domain

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: g6_access_orc
            Statistics: Num rows: 260326 Data size: 188215698 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: (traffic > 99900) (type: boolean)
              Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: domain (type: string)
                outputColumnNames: domain
                Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count(1)
                  keys: domain (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string)
                    sort order: +
                    Map-reduce partition columns: _col0 (type: string)
                    Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
                    tag: -1
                    value expressions: _col1 (type: bigint)
                    auto parallelism: false
      Path -> Alias:
        hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc [g6_access_orc]
      Path -> Partition:
        hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc 
          Partition
            base file name: g6_access_orc
            input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
            output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns cdn,region,level,time,ip,domain,url,traffic
              columns.comments 
              columns.types string:string:string:string:string:string:string:bigint
              field.delim   
              file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
              name g6_hadoop.g6_access_orc
              numFiles 1
              numRows 260326
              rawDataSize 188215698
              serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
              serialization.format  
              serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
              totalSize 8567798
              transient_lastDdlTime 1557676635
            serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde

              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns cdn,region,level,time,ip,domain,url,traffic
                columns.comments 
                columns.types string:string:string:string:string:string:string:bigint
                field.delim     
                file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
                name g6_hadoop.g6_access_orc
                numFiles 1
                numRows 260326
                rawDataSize 188215698
                serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
                serialization.format    
                serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
                totalSize 8567798
                transient_lastDdlTime 1557676635
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: g6_hadoop.g6_access_orc
            name: g6_hadoop.g6_access_orc
      Truncated Path -> Alias:
        /g6_hadoop.db/g6_access_orc [g6_access_orc]
      Needs Tagging: false
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            GlobalTableId: 1
            directory: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
            NumFilesPerFileSink: 1
            Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
            Stats Publishing Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                properties:
                  COLUMN_STATS_ACCURATE true
                  bucket_count -1
                  columns domain,num
                  columns.comments 
                  columns.types string:bigint
                  field.delim |
                  file.inputformat org.apache.hadoop.mapred.TextInputFormat
                  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
                  name g6_hadoop.g6_access_orc_explain
                  numFiles 1
                  numRows 7
                  rawDataSize 149
                  serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
                  serialization.format |
                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  totalSize 156
                  transient_lastDdlTime 1558661108
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                name: g6_hadoop.g6_access_orc_explain
            TotalFiles: 1
            GatherStats: true
            MultiFileSpray: false

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          source: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns domain,num
                columns.comments 
                columns.types string:bigint
                field.delim |
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
                name g6_hadoop.g6_access_orc_explain
                numFiles 1
                numRows 7
                rawDataSize 149
                serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
                serialization.format |
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 156
                transient_lastDdlTime 1558661108
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: g6_hadoop.g6_access_orc_explain

  Stage: Stage-2
    Stats-Aggr Operator
      Stats Aggregation Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/

Time taken: 1.359 seconds, Fetched: 198 row(s)

AUTHORIZATION

AUTHORIZATION :打印SQL運(yùn)行相關(guān)權(quán)限

hive (g6_hadoop)> explain AUTHORIZATION insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
INPUTS: 
  g6_hadoop@g6_access_orc
OUTPUTS: 
  g6_hadoop@g6_access_orc_explain
CURRENT_USER: 
  hadoop
OPERATION: 
  QUERY
AUTHORIZATION_FAILURES: 
  No privilege 'Update' found for outputs { database:g6_hadoop, table:g6_access_orc_explain}
  No privilege 'Select' found for inputs { database:g6_hadoop, table:g6_access_orc, columnName:domain}
Time taken: 0.599 seconds, Fetched: 11 row(s)

DEPENDENCY

DEPENDENCY:打印SQL輸入表的相關(guān)信息

hive (g6_hadoop)> explain DEPENDENCY insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
Explain
{"input_partitions":[],"input_tables":[{"tablename":"g6_hadoop@g6_access_orc","tabletype":"MANAGED_TABLE"}]}
Time taken: 0.135 seconds, Fetched: 1 row(s)

到此,相信大家對“Hive的底層執(zhí)行流程”有了更深的了解,不妨來實(shí)際操作一番吧!這里是億速云網(wǎng)站,更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢,關(guān)注我們,繼續(xù)學(xué)習(xí)!

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI