PostgreSQL 源碼解讀（71）- 查詢語句#56（make_one_rel函數(shù)#21-...

發(fā)布時間：2020-08-08 04:17:19 來源：ITPUB博客閱讀：190 作者：husthxd 欄目：關系型數(shù)據(jù)庫

本節(jié)大體介紹了動態(tài)規(guī)劃算法實現(xiàn)(standard_join_search)中的join_search_one_level->make_join_rel->populate_joinrel_with_paths->add_paths_to_joinrel->hash_inner_and_outer中的initial_cost_hashjoin和final_cost_nestloop函數(shù)，這些函數(shù)用于計算hash join的Cost。

一、數(shù)據(jù)結構

Cost相關
注意:實際使用的參數(shù)值通過系統(tǒng)配置文件定義,而不是這里的常量定義!

 typedef double Cost; /* execution cost (in page-access units) */

 /* defaults for costsize.c's Cost parameters */
 /* NB: cost-estimation code should use the variables, not these constants! */
 /* 注意:實際值通過系統(tǒng)配置文件定義,而不是這里的常量定義! */
 /* If you change these, update backend/utils/misc/postgresql.sample.conf */
 #define DEFAULT_SEQ_PAGE_COST  1.0       //順序掃描page的成本
 #define DEFAULT_RANDOM_PAGE_COST  4.0      //隨機掃描page的成本
 #define DEFAULT_CPU_TUPLE_COST  0.01     //處理一個元組的CPU成本
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005   //處理一個索引元組的CPU成本
 #define DEFAULT_CPU_OPERATOR_COST  0.0025    //執(zhí)行一次操作或函數(shù)的CPU成本
 #define DEFAULT_PARALLEL_TUPLE_COST 0.1    //并行執(zhí)行,從一個worker傳輸一個元組到另一個worker的成本
 #define DEFAULT_PARALLEL_SETUP_COST  1000.0  //構建并行執(zhí)行環(huán)境的成本
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288    /*先前已有介紹, measured in pages */

 double      seq_page_cost = DEFAULT_SEQ_PAGE_COST;
 double      random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double      cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double      cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double      cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
 double      parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
 double      parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int         effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 Cost        disable_cost = 1.0e10;//1后面10個0,通過設置一個巨大的成本,讓優(yōu)化器自動放棄此路徑
 
 int         max_parallel_workers_per_gather = 2;//每次gather使用的worker數(shù)

二、源碼解讀

hash join的算法實現(xiàn)偽代碼如下:
Step 1
FOR small_table_row IN (SELECT * FROM small_table)
LOOP
slot := HASH(small_table_row.join_key);
INSERT_HASH_TABLE(slot,small_table_row);
END LOOP;

Step 2
FOR large_table_row IN (SELECT * FROM large_table) LOOP
slot := HASH(large_table_row.join_key);
small_table_row = LOOKUP_HASH_TABLE(slot,large_table_row.join_key);
IF small_table_row FOUND THEN
output small_table_row + large_table_row;
END IF;
END LOOP;

initial_cost_hashjoin和final_cost_nestloop函數(shù)初步預估hash join訪問路徑的成本和計算最終hash join的Cost。

initial_cost_hashjoin

//----------------------------------------------------------------------------- initial_cost_hashjoin

/*
 * initial_cost_hashjoin
 *    Preliminary estimate of the cost of a hashjoin path.
 *    初步預估hash join訪問路徑的成本
 *    注釋請參照merge join和nestloop join
 * This must quickly produce lower-bound estimates of the path's startup and
 * total costs.  If we are unable to eliminate the proposed path from
 * consideration using the lower bounds, final_cost_hashjoin will be called
 * to obtain the final estimates.
 *
 * The exact division of labor between this function and final_cost_hashjoin
 * is private to them, and represents a tradeoff between speed of the initial
 * estimate and getting a tight lower bound.  We choose to not examine the
 * join quals here (other than by counting the number of hash clauses),
 * so we can't do much with CPU costs.  We do assume that
 * ExecChooseHashTableSize is cheap enough to use here.
 * 
 * 'workspace' is to be filled with startup_cost, total_cost, and perhaps
 *      other data to be used by final_cost_hashjoin
 * 'jointype' is the type of join to be performed
 * 'hashclauses' is the list of joinclauses to be used as hash clauses
 * 'outer_path' is the outer input to the join
 * 'inner_path' is the inner input to the join
 * 'extra' contains miscellaneous information about the join
 * 'parallel_hash' indicates that inner_path is partial and that a shared
 *      hash table will be built in parallel
 */
void
initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
                      JoinType jointype,
                      List *hashclauses,
                      Path *outer_path, Path *inner_path,
                      JoinPathExtraData *extra,
                      bool parallel_hash)
{
    Cost        startup_cost = 0;
    Cost        run_cost = 0;
    double      outer_path_rows = outer_path->rows;
    double      inner_path_rows = inner_path->rows;
    double      inner_path_rows_total = inner_path_rows;
    int         num_hashclauses = list_length(hashclauses);
    int         numbuckets;
    int         numbatches;
    int         num_skew_mcvs;
    size_t      space_allowed;  /* unused */

    /* cost of source data */
    startup_cost += outer_path->startup_cost;
    run_cost += outer_path->total_cost - outer_path->startup_cost;
    startup_cost += inner_path->total_cost;

    /*
     * Cost of computing hash function: must do it once per input tuple. We
     * charge one cpu_operator_cost for each column's hash function.  Also,
     * tack on one cpu_tuple_cost per inner row, to model the costs of
     * inserting the row into the hashtable.
     * 計算哈希函數(shù)的成本:每個輸入元組必須執(zhí)行一次。
     * 我們對每個列的哈希函數(shù)的計算成本設置為cpu_operator_cost。
     * 另外，在每個內(nèi)表行的處理添加以cpu_tuple_cost為單位的成本，以模擬將行插入到散列表中的成本。
     *
     * XXX when a hashclause is more complex than a single operator, we really
     * should charge the extra eval costs of the left or right side, as
     * appropriate, here.  This seems more work than it's worth at the moment.
     * 當一個hash條件子句比單個操作符更復雜時，確實應該在這里計算連接左邊或右邊的額外的表達式的解析(eval)成本。
     * 這看起來會比現(xiàn)在的工作要多。
     */
    startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
        * inner_path_rows;
    run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

    /*
     * If this is a parallel hash build, then the value we have for
     * inner_rows_total currently refers only to the rows returned by each
     * participant.  For shared hash table size estimation, we need the total
     * number, so we need to undo the division.
     * 如果這是一個并行散列構建，那么當前inner_rows_total的值僅引用每個參與者返回的行。
     * 對于共享哈希表大小估計，需要總數(shù)，因此需要修正。
     */
    if (parallel_hash)
        inner_path_rows_total *= get_parallel_divisor(inner_path);

    /*
     * Get hash table size that executor would use for inner relation.
     * 執(zhí)行器根據(jù)內(nèi)表信息,獲取hash表的大小.
     * 
     * XXX for the moment, always assume that skew optimization will be
     * performed.  As long as SKEW_WORK_MEM_PERCENT is small, it's not worth
     * trying to determine that for sure.
     * 目前，總是假設會執(zhí)行傾斜優(yōu)化(skew optimization)。
     * 但是只要SKEW_WORK_MEM_PERCENT很小，就不需要考慮。
     *
     * XXX at some point it might be interesting to try to account for skew
     * optimization in the cost estimate, but for now, we don't.
     * 在某種程度上，嘗試在成本估算中考慮歪斜優(yōu)化可能是很有趣的，但現(xiàn)在，不使用這樣的做法。
     */
    ExecChooseHashTableSize(inner_path_rows_total,
                            inner_path->pathtarget->width,
                            true,   /* useskew */
                            parallel_hash,  /* try_combined_work_mem */
                            outer_path->parallel_workers,
                            &space_allowed,
                            &numbuckets,
                            &numbatches,
                            &num_skew_mcvs);

    /*
     * If inner relation is too big then we will need to "batch" the join,
     * which implies writing and reading most of the tuples to disk an extra
     * time.  Charge seq_page_cost per page, since the I/O should be nice and
     * sequential.  Writing the inner rel counts as startup cost, all the rest
     * as run cost.
     * 如果內(nèi)表太大，那么將需要“批處理”連接，這意味著要花額外的時間將大多數(shù)元組寫入和讀取到磁盤。
     * 因為是連續(xù)順序I/O,所以每一頁page的成本為eq_page_cost。
     * 寫內(nèi)表的成本統(tǒng)計作為啟動成本，其余的都作為運行成本。
     */
    if (numbatches > 1)
    {
        double      outerpages = page_size(outer_path_rows,
                                           outer_path->pathtarget->width);
        double      innerpages = page_size(inner_path_rows,
                                           inner_path->pathtarget->width);

        startup_cost += seq_page_cost * innerpages;
        run_cost += seq_page_cost * (innerpages + 2 * outerpages);
    }

    /* 后續(xù)再計算CPU成本.CPU costs left for later */

    /* Public result fields */
    workspace->startup_cost = startup_cost;
    workspace->total_cost = startup_cost + run_cost;
    /* Save private data for final_cost_hashjoin */
    workspace->run_cost = run_cost;
    workspace->numbuckets = numbuckets;
    workspace->numbatches = numbatches;
    workspace->inner_rows_total = inner_path_rows_total;
}

//----------------------------------------------------------- ExecChooseHashTableSize
 /*
 * Compute appropriate size for hashtable given the estimated size of the
 * relation to be hashed (number of rows and average row width).
 * 給定需要散列的關系的估計大小(行數(shù)和平均行寬度),計算散列表的合適大小。
 * 
 * This is exported so that the planner's costsize.c can use it.
 * 該結果會導出到其他地方以便優(yōu)化器的costsize.c可以使用此值.
 */

/* Target bucket loading (tuples per bucket) */
#define NTUP_PER_BUCKET         1

void
ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
                        bool try_combined_work_mem,
                        int parallel_workers,
                        size_t *space_allowed,
                        int *numbuckets,
                        int *numbatches,
                        int *num_skew_mcvs)
{
    int         tupsize;
    double      inner_rel_bytes;
    long        bucket_bytes;
    long        hash_table_bytes;
    long        skew_table_bytes;
    long        max_pointers;
    long        mppow2;
    int         nbatch = 1;
    int         nbuckets;
    double      dbuckets;

    /* Force a plausible relation size if no info */
     //如元組數(shù)目沒有統(tǒng)計,則默認值為1000
    if (ntuples <= 0.0)
        ntuples = 1000.0;

    /*
     * Estimate tupsize based on footprint of tuple in hashtable... note this
     * does not allow for any palloc overhead.  The manipulations of spaceUsed
     * don't count palloc overhead either.
     * 基于散列表中元組的足跡(footprint)估計元組大小.
     * 注意，這不允許任何palloc開銷?？臻g使用的操作也不計入palloc的開銷。
     */
    tupsize = HJTUPLE_OVERHEAD +
        MAXALIGN(SizeofMinimalTupleHeader) +
        MAXALIGN(tupwidth);
    inner_rel_bytes = ntuples * tupsize;//元組數(shù) * 元組大小

    /*
     * Target in-memory hashtable size is work_mem kilobytes.
     * 目標hash表大小為work_mem
     */
    hash_table_bytes = work_mem * 1024L;

    /*
     * Parallel Hash tries to use the combined work_mem of all workers to
     * avoid the need to batch.  If that won't work, it falls back to work_mem
     * per worker and tries to process batches in parallel.
     * 并行散列嘗試使用所有worker的work_mem總數(shù)來避免批處理。
     * 如果無法如此操作，將使用每個worker一個work_mem的方式，并嘗試并行處理批處理。
     */
    if (try_combined_work_mem)
        hash_table_bytes += hash_table_bytes * parallel_workers;

    *space_allowed = hash_table_bytes;

    /*
     * If skew optimization is possible, estimate the number of skew buckets
     * that will fit in the memory allowed, and decrement the assumed space
     * available for the main hash table accordingly.
     * 如果傾斜優(yōu)化是可能的，那么估計能夠在內(nèi)存中容納的傾斜桶數(shù)，并相應地減小主哈希表的假設可用空間。
     *
     * We make the optimistic assumption that each skew bucket will contain
     * one inner-relation tuple.  If that turns out to be low, we will recover
     * at runtime by reducing the number of skew buckets.
     * 我們樂觀地假設每一個斜桶包含一個內(nèi)表元組。
     * 如果結果不容樂觀，那么將通過減少傾斜桶的數(shù)量在運行時恢復。
     *
     * hashtable->skewBucket will have up to 8 times as many HashSkewBucket
     * pointers as the number of MCVs we allow, since ExecHashBuildSkewHash
     * will round up to the next power of 2 and then multiply by 4 to reduce
     * collisions.
     * hashtable->skewBucket的HashSkewBucket指針數(shù)量將是我們允許的MCVs數(shù)量的8倍，
     * 因為ExecHashBuildSkewHash將舍入到下一個2的乘方，然后乘以4以減少沖突。
     */
    if (useskew)//使用傾斜優(yōu)化(數(shù)據(jù)列值非均勻分布)
    {
        skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;

        /*----------
         * Divisor is:
         * size of a hash tuple +
         * worst-case size of skewBucket[] per MCV +
         * size of skewBucketNums[] entry +
         * size of skew bucket struct itself
         *----------
         */
        *num_skew_mcvs = skew_table_bytes / (tupsize +
                                             (8 * sizeof(HashSkewBucket *)) +
                                             sizeof(int) +
                                             SKEW_BUCKET_OVERHEAD);
        if (*num_skew_mcvs > 0)
            hash_table_bytes -= skew_table_bytes;
    }
    else
        *num_skew_mcvs = 0;

    /*
     * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
     * memory is filled, assuming a single batch; but limit the value so that
     * the pointer arrays we'll try to allocate do not exceed work_mem nor
     * MaxAllocSize.
     * 設置nbucket，使其在內(nèi)存填充時達到平均的桶負載NTUP_PER_BUCKET，
     * 假設是一個批處理;但是需要限制這個值，試圖分配的指針數(shù)組不超過work_mem或MaxAllocSize。
     * 
     * Note that both nbuckets and nbatch must be powers of 2 to make
     * ExecHashGetBucketAndBatch fast.
     * 注意,不管是nbuckets還是nbatch必須是2的倍數(shù),以使ExecHashGetBucketAndBatch函數(shù)更快
     */
    max_pointers = *space_allowed / sizeof(HashJoinTuple);
    max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
    /* If max_pointers isn't a power of 2, must round it down to one */
    //設置為2的倍數(shù)
    mppow2 = 1L << my_log2(max_pointers);
    if (max_pointers != mppow2)
        max_pointers = mppow2 / 2;

    /* Also ensure we avoid integer overflow in nbatch and nbuckets */
    //確保沒有整數(shù)溢出
    /* (this step is redundant given the current value of MaxAllocSize) */
    max_pointers = Min(max_pointers, INT_MAX / 2);

    dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
    dbuckets = Min(dbuckets, max_pointers);
    nbuckets = (int) dbuckets;
    /* don't let nbuckets be really small, though ... */
    //nbuckets最大為1024
    nbuckets = Max(nbuckets, 1024);
    /* ... and force it to be a power of 2. */
    //左移一位,即結果*2
    nbuckets = 1 << my_log2(nbuckets);

    /*
     * If there's not enough space to store the projected number of tuples and
     * the required bucket headers, we will need multiple batches.
     * 如果沒有足夠的空間來存儲元組的投影數(shù)量和所需的桶header，將需要多個批處理。
     */
    bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
    if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
    {
        //內(nèi)表大小 + 桶大小 > hash表大小,需要拆分為多個批次
        /* We'll need multiple batches */
        long        lbuckets;
        double      dbatch;
        int         minbatch;
        long        bucket_size;

        /*
         * If Parallel Hash with combined work_mem would still need multiple
         * batches, we'll have to fall back to regular work_mem budget.
         */
        if (try_combined_work_mem)
        {
            ExecChooseHashTableSize(ntuples, tupwidth, useskew,
                                    false, parallel_workers,
                                    space_allowed,
                                    numbuckets,
                                    numbatches,
                                    num_skew_mcvs);
            return;
        }

        /*
         * Estimate the number of buckets we'll want to have when work_mem is
         * entirely full.  Each bucket will contain a bucket pointer plus
         * NTUP_PER_BUCKET tuples, whose projected size already includes
         * overhead for the hash code, pointer to the next tuple, etc.
         * 估計當work_mem完全滿時，可以得到多少個桶。
         * 每個桶bucket將包含一個bucket指針加上NTUP_PER_BUCKET元組，
         * 它的投影大小已經(jīng)包括哈希值的開銷、指向下一個元組的指針等等。
         */
        bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
        lbuckets = 1L << my_log2(hash_table_bytes / bucket_size);
        lbuckets = Min(lbuckets, max_pointers);
        nbuckets = (int) lbuckets;
        nbuckets = 1 << my_log2(nbuckets);
        bucket_bytes = nbuckets * sizeof(HashJoinTuple);

        /*
         * Buckets are simple pointers to hashjoin tuples, while tupsize
         * includes the pointer, hash code, and MinimalTupleData.  So buckets
         * should never really exceed 25% of work_mem (even for
         * NTUP_PER_BUCKET=1); except maybe for work_mem values that are not
         * 2^N bytes, where we might get more because of doubling. So let's
         * look for 50% here.
         * 桶bucket是指向hash join元組的簡單指針，而tupsize包括指針/hash值和MinimalTupleData。
         * 所以bucket不應該超過work_mem的25%(即使NTUP_PER_BUCKET=1);
         */
        Assert(bucket_bytes <= hash_table_bytes / 2);

        /* Calculate required number of batches. */
        //計算批次
        dbatch = ceil(inner_rel_bytes / (hash_table_bytes - bucket_bytes));
        dbatch = Min(dbatch, max_pointers);
        minbatch = (int) dbatch;
        nbatch = 2;
        while (nbatch < minbatch)
            nbatch <<= 1;
    }

    Assert(nbuckets > 0);
    Assert(nbatch > 0);

    *numbuckets = nbuckets;
    *numbatches = nbatch;
}

final_cost_hashjoin

//----------------------------------------------------------------------------- final_cost_hashjoin
/*
 * final_cost_hashjoin
 *    Final estimate of the cost and result size of a hashjoin path.
 *    最后的成本估算和hashjoin訪問路徑的結果大小
 *
 * Note: the numbatches estimate is also saved into 'path' for use later
 * 注意:numbatches估算也被保存到path中以備以后使用
 * 
 * 'path' is already filled in except for the rows and cost fields and
 *      num_batches
 * 'workspace' is the result from initial_cost_hashjoin
 * 'extra' contains miscellaneous information about the join
 */
void

final_cost_hashjoin(PlannerInfo *root, HashPath *path,
                    JoinCostWorkspace *workspace,
                    JoinPathExtraData *extra)
{
    Path       *outer_path = path->jpath.outerjoinpath;
    Path       *inner_path = path->jpath.innerjoinpath;
    double      outer_path_rows = outer_path->rows;
    double      inner_path_rows = inner_path->rows;
    double      inner_path_rows_total = workspace->inner_rows_total;
    List       *hashclauses = path->path_hashclauses;
    Cost        startup_cost = workspace->startup_cost;
    Cost        run_cost = workspace->run_cost;
    int         numbuckets = workspace->numbuckets;
    int         numbatches = workspace->numbatches;
    Cost        cpu_per_tuple;
    QualCost    hash_qual_cost;
    QualCost    qp_qual_cost;
    double      hashjointuples;
    double      virtualbuckets;
    Selectivity innerbucketsize;
    Selectivity innermcvfreq;
    ListCell   *hcl;

    /* Mark the path with the correct row estimate */
    if (path->jpath.path.param_info)
        path->jpath.path.rows = path->jpath.path.param_info->ppi_rows;
    else
        path->jpath.path.rows = path->jpath.path.parent->rows;

    /* For partial paths, scale row estimate. */
    if (path->jpath.path.parallel_workers > 0)
    {
        double      parallel_divisor = get_parallel_divisor(&path->jpath.path);

        path->jpath.path.rows =
            clamp_row_est(path->jpath.path.rows / parallel_divisor);
    }

    /*
     * We could include disable_cost in the preliminary estimate, but that
     * would amount to optimizing for the case where the join method is
     * disabled, which doesn't seem like the way to bet.
     */
    if (!enable_hashjoin)
        startup_cost += disable_cost;//禁用hash join

    /* mark the path with estimated # of batches */
    path->num_batches = numbatches;//處理批數(shù)

    /* store the total number of tuples (sum of partial row estimates) */
    //存儲元組總數(shù).
    path->inner_rows_total = inner_path_rows_total;

    /* and compute the number of "virtual" buckets in the whole join */
    //計算虛擬的桶數(shù)(buckets)
    virtualbuckets = (double) numbuckets * (double) numbatches;

    /*
     * Determine bucketsize fraction and MCV frequency for the inner relation.
     * We use the smallest bucketsize or MCV frequency estimated for any
     * individual hashclause; this is undoubtedly conservative.
     * 確定桶大小分數(shù)和MCV頻率之間的內(nèi)在聯(lián)系。
     * 使用每個散列子句估計的最小桶(bucket)大小或MCV頻率;但這無疑是保守的處理方法。
     * 
     * BUT: if inner relation has been unique-ified, we can assume it's good
     * for hashing.  This is important both because it's the right answer, and
     * because we avoid contaminating the cache with a value that's wrong for
     * non-unique-ified paths.
     * 但是:如果內(nèi)表元組是唯一的，可以假設它對哈希處理有好處。
     * 這很重要，因為它是正確的答案，而且避免了臟緩存，而這個值對于非唯一化路徑是錯誤的。
     */
    if (IsA(inner_path, UniquePath))
    {
        //內(nèi)表唯一
        innerbucketsize = 1.0 / virtualbuckets;
        innermcvfreq = 0.0;
    }
    else//內(nèi)表不唯一
    {
        innerbucketsize = 1.0;
        innermcvfreq = 1.0;
        foreach(hcl, hashclauses)//循環(huán)遍歷hash條件
        {
            RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
            Selectivity thisbucketsize;
            Selectivity thismcvfreq;

            /*
             * First we have to figure out which side of the hashjoin clause
             * is the inner side.
             * 首先，必須找出hashjoin子句的哪一邊是內(nèi)表邊。
             *
             * Since we tend to visit the same clauses over and over when
             * planning a large query, we cache the bucket stats estimates in
             * the RestrictInfo node to avoid repeated lookups of statistics.
             * 因為在計劃大型查詢時，往往會反復訪問相同的子句，
             * 所以將bucket stats估計緩存在限制條件信息節(jié)點中，以避免重復查找統(tǒng)計信息。
             */
            if (bms_is_subset(restrictinfo->right_relids,
                              inner_path->parent->relids))
            {
                /* righthand side is inner */
                thisbucketsize = restrictinfo->right_bucketsize;
                if (thisbucketsize < 0)
                {
                    /* not cached yet */
                    estimate_hash_bucket_stats(root,
                                               get_rightop(restrictinfo->clause),
                                               virtualbuckets,
                                               &restrictinfo->right_mcvfreq,
                                               &restrictinfo->right_bucketsize);
                    thisbucketsize = restrictinfo->right_bucketsize;
                }
                thismcvfreq = restrictinfo->right_mcvfreq;
            }
            else
            {
                Assert(bms_is_subset(restrictinfo->left_relids,
                                     inner_path->parent->relids));
                /* lefthand side is inner */
                thisbucketsize = restrictinfo->left_bucketsize;
                if (thisbucketsize < 0)
                {
                    /* not cached yet */
                    estimate_hash_bucket_stats(root,
                                               get_leftop(restrictinfo->clause),
                                               virtualbuckets,
                                               &restrictinfo->left_mcvfreq,
                                               &restrictinfo->left_bucketsize);
                    thisbucketsize = restrictinfo->left_bucketsize;
                }
                thismcvfreq = restrictinfo->left_mcvfreq;
            }

            if (innerbucketsize > thisbucketsize)
                innerbucketsize = thisbucketsize;
            if (innermcvfreq > thismcvfreq)
                innermcvfreq = thismcvfreq;
        }
    }

    /*
     * If the bucket holding the inner MCV would exceed work_mem, we don't
     * want to hash unless there is really no other alternative, so apply
     * disable_cost.  (The executor normally copes with excessive memory usage
     * by splitting batches, but obviously it cannot separate equal values
     * that way, so it will be unable to drive the batch size below work_mem
     * when this is true.)
     * 如果包含內(nèi)部MCV的bucket會超過work_mem，那么除非真的沒有其他選擇，否則我們不希望執(zhí)行散列，因此禁用hash連接。
     * (執(zhí)行程序通常通過分割批處理來處理過多的內(nèi)存使用，但顯然它不能以這種方式分離相等的值，
     * 因此如果這是真的，它將無法驅動work_mem以下的批處理大小。)
     */
    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
                           inner_path->pathtarget->width) >
        (work_mem * 1024L))
        startup_cost += disable_cost;//MCV的值如果大于work_mem,則禁用hash連接

    /*
     * Compute cost of the hashquals and qpquals (other restriction clauses)
     * separately.
     * 分別計算hash quals連接條件和qpquals(其他限制條件)的成本。
     */
    cost_qual_eval(&hash_qual_cost, hashclauses, root);
    cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
    qp_qual_cost.startup -= hash_qual_cost.startup;
    qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;

    /* CPU costs */

    if (path->jpath.jointype == JOIN_SEMI ||
        path->jpath.jointype == JOIN_ANTI ||
        extra->inner_unique)
    {
        double      outer_matched_rows;
        Selectivity inner_scan_frac;

        /*
         * With a SEMI or ANTI join, or if the innerrel is known unique, the
         * executor will stop after the first match.
         * 對于半連接或反連接，或者如果內(nèi)表是唯一的，執(zhí)行器將在第一次匹配后停止。
         *
         * For an outer-rel row that has at least one match, we can expect the
         * bucket scan to stop after a fraction 1/(match_count+1) of the
         * bucket's rows, if the matches are evenly distributed.  Since they
         * probably aren't quite evenly distributed, we apply a fuzz factor of
         * 2.0 to that fraction.  (If we used a larger fuzz factor, we'd have
         * to clamp inner_scan_frac to at most 1.0; but since match_count is
         * at least 1, no such clamp is needed now.)
         * 對于至少有一個匹配的外表outer-rel行，如果匹配是均勻分布的，
         * 那么桶掃描在桶行數(shù)*1/(match_count+1)之后就會停止。
         * 因為它們可能不是均勻分布的，所以對這個比例應用了2.0的模糊系數(shù)。
         * (如果我們使用更大的系數(shù)，將不得不將inner_scan_frac限制為1.0;
         * 但是因為match_count至少為1，所以現(xiàn)在不需要這樣處理了。)
         */
        outer_matched_rows = rint(outer_path_rows * extra->semifactors.outer_match_frac);
        inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);

        startup_cost += hash_qual_cost.startup;
        run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
            clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;

        /*
         * For unmatched outer-rel rows, the picture is quite a lot different.
         * In the first place, there is no reason to assume that these rows
         * preferentially hit heavily-populated buckets; instead assume they
         * are uncorrelated with the inner distribution and so they see an
         * average bucket size of inner_path_rows / virtualbuckets.  In the
         * second place, it seems likely that they will have few if any exact
         * hash-code matches and so very few of the tuples in the bucket will
         * actually require eval of the hash quals.  We don't have any good
         * way to estimate how many will, but for the moment assume that the
         * effective cost per bucket entry is one-tenth what it is for
         * matchable tuples.
         * 對于沒有匹配的外表行來說，情況就大不相同了。
         * 首先，沒有理由假設這些行優(yōu)先命中了被大量填充的桶(bucket);
         * 相反，假設它們與內(nèi)表分布不相關，因此它們看到的是inner_path_rows / virtualbuckets的平均桶大小。
         * 第二，如果有任何確切的哈希值匹配，它們可能會很少，因此桶中的元組實際上很少需要對散列quals條件進行eval解析。
         * 目前沒有很好的方法來估計會有多少個，但是目前假設訪問每個桶的實際耗費的成本是匹配元組的十分之一。
         */
        run_cost += hash_qual_cost.per_tuple *
            (outer_path_rows - outer_matched_rows) *
            clamp_row_est(inner_path_rows / virtualbuckets) * 0.05;

        /* Get # of tuples that will pass the basic join */
        //參與基礎連接的元組數(shù)
        if (path->jpath.jointype == JOIN_ANTI)
            hashjointuples = outer_path_rows - outer_matched_rows;
        else
            hashjointuples = outer_matched_rows;
    }
    else
    {
        /*
         * The number of tuple comparisons needed is the number of outer
         * tuples times the typical number of tuples in a hash bucket, which
         * is the inner relation size times its bucketsize fraction.  At each
         * one, we need to evaluate the hashjoin quals.  But actually,
         * charging the full qual eval cost at each tuple is pessimistic,
         * since we don't evaluate the quals unless the hash values match
         * exactly.  For lack of a better idea, halve the cost estimate to
         * allow for that.
         * 需要的元組比較數(shù)量是外部元組的數(shù)量乘以散列桶中的典型元組數(shù)量，這是內(nèi)部關系大小乘以它的桶大小比例。
         * 在每個函數(shù)中，都需要計算hash連接條件(hashjoin quals)。
         * 但實際上，認為在每個元組上耗費解析表達式(qual eval)的成本是較為悲觀的，
         * 因為我們不計算條件quals，除非散列值完全匹配。
         * 由于缺乏更好的想法，將成本估算減半。
         */
        startup_cost += hash_qual_cost.startup;
        run_cost += hash_qual_cost.per_tuple * outer_path_rows *
            clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

        /*
         * Get approx # tuples passing the hashquals.  We use
         * approx_tuple_count here because we need an estimate done with
         * JOIN_INNER semantics.
         * 通過hash條件獲得大約的元組數(shù)。
         * 這里使用approx_tuple_count，因為我們需要使用JOIN_INNER語義進行評估。
         */
        hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
    }

    /*
     * For each tuple that gets through the hashjoin proper, we charge
     * cpu_tuple_cost plus the cost of evaluating additional restriction
     * clauses that are to be applied at the join.  (This is pessimistic since
     * not all of the quals may get evaluated at each tuple.)
     * 對于通過哈希連接的每個元組，每個元組的成本為cpu_tuple_cost，
     * 加上計算將在連接上應用的附加限制條款的成本。
     * (這是悲觀的，因為并不是所有的條件表達式quals都能在每個元組上得到計算。)
     */
    startup_cost += qp_qual_cost.startup;
    cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
    run_cost += cpu_per_tuple * hashjointuples;

    /* tlist eval costs are paid per output row, not per tuple scanned */
    startup_cost += path->jpath.path.pathtarget->cost.startup;
    run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;

    path->jpath.path.startup_cost = startup_cost;
    path->jpath.path.total_cost = startup_cost + run_cost;
}

三、跟蹤分析

測試腳本如下

select a.*,b.grbh,b.je 
from t_dwxx a,
    lateral (select t1.dwbh,t1.grbh,t2.je 
     from t_grxx t1 
          inner join t_jfxx t2 on t1.dwbh = a.dwbh and t1.grbh = t2.grbh) b
order by b.dwbh;

啟動gdb,設置斷點

(gdb) b try_hashjoin_path
Breakpoint 1 at 0x7af2a4: file joinpath.c, line 737.

進入函數(shù)try_hashjoin_path,連接類型為JOIN_INNER

(gdb) c
Continuing.

Breakpoint 1, try_hashjoin_path (root=0x1b73980, joinrel=0x1b91d30, outer_path=0x1b866c0, inner_path=0x1b8e780, 
    hashclauses=0x1b93200, jointype=JOIN_INNER, extra=0x7ffc09955e80) at joinpath.c:737
737     required_outer = calc_non_nestloop_required_outer(outer_path,

進入initial_cost_hashjoin函數(shù)

(gdb) n
739     if (required_outer &&
(gdb) 
751     initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
(gdb) step
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200, 
    outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3160
3160        Cost        startup_cost = 0;

初始賦值,hash連接條件只有1個,即t_dwxx.dwbh = t_grxx.dwbh(連接信息參照上一節(jié))

3160        Cost        startup_cost = 0;
(gdb) n
3161        Cost        run_cost = 0;
(gdb) 
3162        double      outer_path_rows = outer_path->rows;
(gdb) 
3163        double      inner_path_rows = inner_path->rows;
(gdb) 
3164        double      inner_path_rows_total = inner_path_rows;
(gdb) 
3165        int         num_hashclauses = list_length(hashclauses);
(gdb) 
3172        startup_cost += outer_path->startup_cost;
(gdb) p num_hashclauses
$1 = 1

進入ExecChooseHashTableSize函數(shù),該函數(shù)根據(jù)給定需要散列的關系的估計大小(行數(shù)和平均行寬度),計算散列表的合適大小。

...
(gdb) step
ExecChooseHashTableSize (ntuples=100000, tupwidth=9, useskew=true, try_combined_work_mem=false, parallel_workers=0, 
    space_allowed=0x7ffc09955c38, numbuckets=0x7ffc09955c4c, numbatches=0x7ffc09955c48, num_skew_mcvs=0x7ffc09955c44)
    at nodeHash.c:677

獲取元組大小->48Bytes,內(nèi)部大小4800000Bytes(4.58M)和hash表大小->4194304(4M)

(gdb) n
690     tupsize = HJTUPLE_OVERHEAD +
(gdb) 
693     inner_rel_bytes = ntuples * tupsize;
(gdb) 
698     hash_table_bytes = work_mem * 1024L;
(gdb) 
705     if (try_combined_work_mem)
(gdb) p tupsize
$2 = 48
(gdb) p inner_rel_bytes
$3 = 4800000
(gdb) p hash_table_bytes
$4 = 4194304

使用行傾斜技術

(gdb) n
708     *space_allowed = hash_table_bytes;
(gdb) 
724     if (useskew)
(gdb) 
726         skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;
(gdb) p useskew
$5 = true

行傾斜桶數(shù)為635,調(diào)整后的hash表大小4110418,最大的指針數(shù)524288

...
(gdb) p *num_skew_mcvs
$7 = 635
(gdb) p hash_table_bytes
$8 = 4110418
...
(gdb) n
756     max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
(gdb) 
758     mppow2 = 1L << my_log2(max_pointers);
(gdb) 
759     if (max_pointers != mppow2)
(gdb) 
764     max_pointers = Min(max_pointers, INT_MAX / 2);
(gdb) 
766     dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
(gdb) p max_pointers
$10 = 524288

計算桶數(shù)=131072

...
(gdb) n
767     dbuckets = Min(dbuckets, max_pointers);
(gdb) 
768     nbuckets = (int) dbuckets;
(gdb) 
770     nbuckets = Max(nbuckets, 1024);
(gdb) 
772     nbuckets = 1 << my_log2(nbuckets);
(gdb) 
778     bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
(gdb) 
779     if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) p nbuckets
$11 = 131072

如果沒有足夠的空間，將需要多個批處理。

779     if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) n
791         if (try_combined_work_mem)

計算桶數(shù)->131072和批次->2等信息

...
(gdb) 
837     *numbuckets = nbuckets;
(gdb) 
838     *numbatches = nbatch;
(gdb) p nbuckets
$12 = 131072
(gdb) p nbatch
$13 = 2

回到initial_cost_hashjoin

(gdb) 
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200, 
    outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3226
3226        if (numbatches > 1)

initial_cost_hashjoin函數(shù)執(zhí)行完畢,查看計算結果

(gdb) 
3246        workspace->inner_rows_total = inner_path_rows_total;
(gdb) 
3247    }
(gdb) p workspace
$14 = (JoinCostWorkspace *) 0x7ffc09955d00
(gdb) p *workspace
$15 = {startup_cost = 3465, total_cost = 4261, run_cost = 796, inner_run_cost = 0, 
  inner_rescan_run_cost = 6.9525149533036983e-310, outer_rows = 3.7882102964330281e-317, 
  inner_rows = 1.4285501039407471e-316, outer_skip_rows = 1.4285501039407471e-316, 
  inner_skip_rows = 6.9525149532894692e-310, numbuckets = 131072, numbatches = 2, inner_rows_total = 100000}
(gdb)

設置斷點,進入final_cost_hashjoin

(gdb) b final_cost_hashjoin
Breakpoint 2 at 0x7a0b49: file costsize.c, line 3265.
(gdb) c
Continuing.

Breakpoint 2, final_cost_hashjoin (root=0x1b73980, path=0x1b93238, workspace=0x7ffc09955d00, extra=0x7ffc09955e80)
    at costsize.c:3265
3265        Path       *outer_path = path->jpath.outerjoinpath;

賦值,并計算虛擬桶數(shù)

3265        Path       *outer_path = path->jpath.outerjoinpath;
(gdb) n
3266        Path       *inner_path = path->jpath.innerjoinpath;
...
(gdb) 
3314        virtualbuckets = (double) numbuckets * (double) numbatches;
(gdb) 
3326        if (IsA(inner_path, UniquePath))
(gdb) p virtualbuckets
$16 = 262144
(gdb)

開始遍歷hash連接條件,確定桶大小分數(shù)和MCV頻率之間的內(nèi)在聯(lián)系,更新連接條件中的相關信息。

3335            foreach(hcl, hashclauses)
(gdb) 
3337                RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);

內(nèi)表在右端RTE=3,即t_grxx),更新相關信息

3349                if (bms_is_subset(restrictinfo->right_relids,
(gdb) 
3353                    thisbucketsize = restrictinfo->right_bucketsize;
(gdb) 
3354                    if (thisbucketsize < 0)
(gdb) 
3357                        estimate_hash_bucket_stats(root,
(gdb) 
3358                                                   get_rightop(restrictinfo->clause),
(gdb) 
3357                        estimate_hash_bucket_stats(root,
(gdb) 
3362                        thisbucketsize = restrictinfo->right_bucketsize;
(gdb) p *restrictinfo
$17 = {type = T_RestrictInfo, clause = 0x1b87260, is_pushed_down = true, outerjoin_delayed = false, can_join = true, 
  pseudoconstant = false, leakproof = false, security_level = 0, clause_relids = 0x1b87380, required_relids = 0x1b86c68, 
  outer_relids = 0x0, nullable_relids = 0x0, left_relids = 0x1b87340, right_relids = 0x1b87360, orclause = 0x0, 
  parent_ec = 0x1b85b88, eval_cost = {startup = 0, per_tuple = 0.0025000000000000001}, norm_selec = 0.0001, 
  outer_selec = -1, mergeopfamilies = 0x1b873c8, left_ec = 0x1b85b88, right_ec = 0x1b85b88, left_em = 0x1b85d58, 
  right_em = 0x1b85c80, scansel_cache = 0x1b92e50, outer_is_left = true, hashjoinoperator = 98, left_bucketsize = -1, 
  right_bucketsize = 0.00010013016921998598, left_mcvfreq = -1, right_mcvfreq = 0}

計算過程與先前介紹的類似,計算表達式的成本等等,具體可自行debug.
最終的計算結果

3505    }
(gdb) p *path
$19 = {jpath = {path = {type = T_HashPath, pathtype = T_HashJoin, parent = 0x1b91d30, pathtarget = 0x1b91f68, 
      param_info = 0x0, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 100000, 
      startup_cost = 3465, total_cost = 5386, pathkeys = 0x0}, jointype = JOIN_INNER, inner_unique = false, 
    outerjoinpath = 0x1b866c0, innerjoinpath = 0x1b8e780, joinrestrictinfo = 0x1b92208}, path_hashclauses = 0x1b93200, 
  num_batches = 2, inner_rows_total = 100000}

DONE!

四、參考資料

allpaths.c
cost.h
costsize.c
PG Document:Query Planning

向AI問一下細節(jié)

PostgreSQL 源碼解讀（71）- 查詢語句#56（make_one_rel函數(shù)#21-...

一、數(shù)據(jù)結構

二、源碼解讀

三、跟蹤分析

四、參考資料

猜你喜歡

最新資訊

相關推薦

相關標簽

一、數(shù)據(jù)結構

二、源碼解讀

三、跟蹤分析

四、參考資料