make_join_rel->populate_joinrel_with_paths->add_paths_to_j..."/>
您好,登錄后才能下訂單哦!
本節(jié)大體介紹了動態(tài)規(guī)劃算法實現(xiàn)(standard_join_search)中的join_search_one_level->make_join_rel->populate_joinrel_with_paths->add_paths_to_joinrel->hash_inner_and_outer中的initial_cost_hashjoin和final_cost_nestloop函數(shù),這些函數(shù)用于計算hash join的Cost。
Cost相關
注意:實際使用的參數(shù)值通過系統(tǒng)配置文件定義,而不是這里的常量定義!
typedef double Cost; /* execution cost (in page-access units) */
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants! */
/* 注意:實際值通過系統(tǒng)配置文件定義,而不是這里的常量定義! */
/* If you change these, update backend/utils/misc/postgresql.sample.conf */
#define DEFAULT_SEQ_PAGE_COST 1.0 //順序掃描page的成本
#define DEFAULT_RANDOM_PAGE_COST 4.0 //隨機掃描page的成本
#define DEFAULT_CPU_TUPLE_COST 0.01 //處理一個元組的CPU成本
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005 //處理一個索引元組的CPU成本
#define DEFAULT_CPU_OPERATOR_COST 0.0025 //執(zhí)行一次操作或函數(shù)的CPU成本
#define DEFAULT_PARALLEL_TUPLE_COST 0.1 //并行執(zhí)行,從一個worker傳輸一個元組到另一個worker的成本
#define DEFAULT_PARALLEL_SETUP_COST 1000.0 //構建并行執(zhí)行環(huán)境的成本
#define DEFAULT_EFFECTIVE_CACHE_SIZE 524288 /*先前已有介紹, measured in pages */
double seq_page_cost = DEFAULT_SEQ_PAGE_COST;
double random_page_cost = DEFAULT_RANDOM_PAGE_COST;
double cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
double cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
double cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
double parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
double parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
int effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
Cost disable_cost = 1.0e10;//1后面10個0,通過設置一個巨大的成本,讓優(yōu)化器自動放棄此路徑
int max_parallel_workers_per_gather = 2;//每次gather使用的worker數(shù)
hash join的算法實現(xiàn)偽代碼如下:
Step 1
FOR small_table_row IN (SELECT * FROM small_table)
LOOP
slot := HASH(small_table_row.join_key);
INSERT_HASH_TABLE(slot,small_table_row);
END LOOP;
Step 2
FOR large_table_row IN (SELECT * FROM large_table) LOOP
slot := HASH(large_table_row.join_key);
small_table_row = LOOKUP_HASH_TABLE(slot,large_table_row.join_key);
IF small_table_row FOUND THEN
output small_table_row + large_table_row;
END IF;
END LOOP;
initial_cost_hashjoin和final_cost_nestloop函數(shù)初步預估hash join訪問路徑的成本和計算最終hash join的Cost。
initial_cost_hashjoin
//----------------------------------------------------------------------------- initial_cost_hashjoin
/*
* initial_cost_hashjoin
* Preliminary estimate of the cost of a hashjoin path.
* 初步預估hash join訪問路徑的成本
* 注釋請參照merge join和nestloop join
* This must quickly produce lower-bound estimates of the path's startup and
* total costs. If we are unable to eliminate the proposed path from
* consideration using the lower bounds, final_cost_hashjoin will be called
* to obtain the final estimates.
*
* The exact division of labor between this function and final_cost_hashjoin
* is private to them, and represents a tradeoff between speed of the initial
* estimate and getting a tight lower bound. We choose to not examine the
* join quals here (other than by counting the number of hash clauses),
* so we can't do much with CPU costs. We do assume that
* ExecChooseHashTableSize is cheap enough to use here.
*
* 'workspace' is to be filled with startup_cost, total_cost, and perhaps
* other data to be used by final_cost_hashjoin
* 'jointype' is the type of join to be performed
* 'hashclauses' is the list of joinclauses to be used as hash clauses
* 'outer_path' is the outer input to the join
* 'inner_path' is the inner input to the join
* 'extra' contains miscellaneous information about the join
* 'parallel_hash' indicates that inner_path is partial and that a shared
* hash table will be built in parallel
*/
void
initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
JoinType jointype,
List *hashclauses,
Path *outer_path, Path *inner_path,
JoinPathExtraData *extra,
bool parallel_hash)
{
Cost startup_cost = 0;
Cost run_cost = 0;
double outer_path_rows = outer_path->rows;
double inner_path_rows = inner_path->rows;
double inner_path_rows_total = inner_path_rows;
int num_hashclauses = list_length(hashclauses);
int numbuckets;
int numbatches;
int num_skew_mcvs;
size_t space_allowed; /* unused */
/* cost of source data */
startup_cost += outer_path->startup_cost;
run_cost += outer_path->total_cost - outer_path->startup_cost;
startup_cost += inner_path->total_cost;
/*
* Cost of computing hash function: must do it once per input tuple. We
* charge one cpu_operator_cost for each column's hash function. Also,
* tack on one cpu_tuple_cost per inner row, to model the costs of
* inserting the row into the hashtable.
* 計算哈希函數(shù)的成本:每個輸入元組必須執(zhí)行一次。
* 我們對每個列的哈希函數(shù)的計算成本設置為cpu_operator_cost。
* 另外,在每個內(nèi)表行的處理添加以cpu_tuple_cost為單位的成本,以模擬將行插入到散列表中的成本。
*
* XXX when a hashclause is more complex than a single operator, we really
* should charge the extra eval costs of the left or right side, as
* appropriate, here. This seems more work than it's worth at the moment.
* 當一個hash條件子句比單個操作符更復雜時,確實應該在這里計算連接左邊或右邊的額外的表達式的解析(eval)成本。
* 這看起來會比現(xiàn)在的工作要多。
*/
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
* inner_path_rows;
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;
/*
* If this is a parallel hash build, then the value we have for
* inner_rows_total currently refers only to the rows returned by each
* participant. For shared hash table size estimation, we need the total
* number, so we need to undo the division.
* 如果這是一個并行散列構建,那么當前inner_rows_total的值僅引用每個參與者返回的行。
* 對于共享哈希表大小估計,需要總數(shù),因此需要修正。
*/
if (parallel_hash)
inner_path_rows_total *= get_parallel_divisor(inner_path);
/*
* Get hash table size that executor would use for inner relation.
* 執(zhí)行器根據(jù)內(nèi)表信息,獲取hash表的大小.
*
* XXX for the moment, always assume that skew optimization will be
* performed. As long as SKEW_WORK_MEM_PERCENT is small, it's not worth
* trying to determine that for sure.
* 目前,總是假設會執(zhí)行傾斜優(yōu)化(skew optimization)。
* 但是只要SKEW_WORK_MEM_PERCENT很小,就不需要考慮。
*
* XXX at some point it might be interesting to try to account for skew
* optimization in the cost estimate, but for now, we don't.
* 在某種程度上,嘗試在成本估算中考慮歪斜優(yōu)化可能是很有趣的,但現(xiàn)在,不使用這樣的做法。
*/
ExecChooseHashTableSize(inner_path_rows_total,
inner_path->pathtarget->width,
true, /* useskew */
parallel_hash, /* try_combined_work_mem */
outer_path->parallel_workers,
&space_allowed,
&numbuckets,
&numbatches,
&num_skew_mcvs);
/*
* If inner relation is too big then we will need to "batch" the join,
* which implies writing and reading most of the tuples to disk an extra
* time. Charge seq_page_cost per page, since the I/O should be nice and
* sequential. Writing the inner rel counts as startup cost, all the rest
* as run cost.
* 如果內(nèi)表太大,那么將需要“批處理”連接,這意味著要花額外的時間將大多數(shù)元組寫入和讀取到磁盤。
* 因為是連續(xù)順序I/O,所以每一頁page的成本為eq_page_cost。
* 寫內(nèi)表的成本統(tǒng)計作為啟動成本,其余的都作為運行成本。
*/
if (numbatches > 1)
{
double outerpages = page_size(outer_path_rows,
outer_path->pathtarget->width);
double innerpages = page_size(inner_path_rows,
inner_path->pathtarget->width);
startup_cost += seq_page_cost * innerpages;
run_cost += seq_page_cost * (innerpages + 2 * outerpages);
}
/* 后續(xù)再計算CPU成本.CPU costs left for later */
/* Public result fields */
workspace->startup_cost = startup_cost;
workspace->total_cost = startup_cost + run_cost;
/* Save private data for final_cost_hashjoin */
workspace->run_cost = run_cost;
workspace->numbuckets = numbuckets;
workspace->numbatches = numbatches;
workspace->inner_rows_total = inner_path_rows_total;
}
//----------------------------------------------------------- ExecChooseHashTableSize
/*
* Compute appropriate size for hashtable given the estimated size of the
* relation to be hashed (number of rows and average row width).
* 給定需要散列的關系的估計大小(行數(shù)和平均行寬度),計算散列表的合適大小。
*
* This is exported so that the planner's costsize.c can use it.
* 該結果會導出到其他地方以便優(yōu)化器的costsize.c可以使用此值.
*/
/* Target bucket loading (tuples per bucket) */
#define NTUP_PER_BUCKET 1
void
ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
bool try_combined_work_mem,
int parallel_workers,
size_t *space_allowed,
int *numbuckets,
int *numbatches,
int *num_skew_mcvs)
{
int tupsize;
double inner_rel_bytes;
long bucket_bytes;
long hash_table_bytes;
long skew_table_bytes;
long max_pointers;
long mppow2;
int nbatch = 1;
int nbuckets;
double dbuckets;
/* Force a plausible relation size if no info */
//如元組數(shù)目沒有統(tǒng)計,則默認值為1000
if (ntuples <= 0.0)
ntuples = 1000.0;
/*
* Estimate tupsize based on footprint of tuple in hashtable... note this
* does not allow for any palloc overhead. The manipulations of spaceUsed
* don't count palloc overhead either.
* 基于散列表中元組的足跡(footprint)估計元組大小.
* 注意,這不允許任何palloc開銷??臻g使用的操作也不計入palloc的開銷。
*/
tupsize = HJTUPLE_OVERHEAD +
MAXALIGN(SizeofMinimalTupleHeader) +
MAXALIGN(tupwidth);
inner_rel_bytes = ntuples * tupsize;//元組數(shù) * 元組大小
/*
* Target in-memory hashtable size is work_mem kilobytes.
* 目標hash表大小為work_mem
*/
hash_table_bytes = work_mem * 1024L;
/*
* Parallel Hash tries to use the combined work_mem of all workers to
* avoid the need to batch. If that won't work, it falls back to work_mem
* per worker and tries to process batches in parallel.
* 并行散列嘗試使用所有worker的work_mem總數(shù)來避免批處理。
* 如果無法如此操作,將使用每個worker一個work_mem的方式,并嘗試并行處理批處理。
*/
if (try_combined_work_mem)
hash_table_bytes += hash_table_bytes * parallel_workers;
*space_allowed = hash_table_bytes;
/*
* If skew optimization is possible, estimate the number of skew buckets
* that will fit in the memory allowed, and decrement the assumed space
* available for the main hash table accordingly.
* 如果傾斜優(yōu)化是可能的,那么估計能夠在內(nèi)存中容納的傾斜桶數(shù),并相應地減小主哈希表的假設可用空間。
*
* We make the optimistic assumption that each skew bucket will contain
* one inner-relation tuple. If that turns out to be low, we will recover
* at runtime by reducing the number of skew buckets.
* 我們樂觀地假設每一個斜桶包含一個內(nèi)表元組。
* 如果結果不容樂觀,那么將通過減少傾斜桶的數(shù)量在運行時恢復。
*
* hashtable->skewBucket will have up to 8 times as many HashSkewBucket
* pointers as the number of MCVs we allow, since ExecHashBuildSkewHash
* will round up to the next power of 2 and then multiply by 4 to reduce
* collisions.
* hashtable->skewBucket的HashSkewBucket指針數(shù)量將是我們允許的MCVs數(shù)量的8倍,
* 因為ExecHashBuildSkewHash將舍入到下一個2的乘方,然后乘以4以減少沖突。
*/
if (useskew)//使用傾斜優(yōu)化(數(shù)據(jù)列值非均勻分布)
{
skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;
/*----------
* Divisor is:
* size of a hash tuple +
* worst-case size of skewBucket[] per MCV +
* size of skewBucketNums[] entry +
* size of skew bucket struct itself
*----------
*/
*num_skew_mcvs = skew_table_bytes / (tupsize +
(8 * sizeof(HashSkewBucket *)) +
sizeof(int) +
SKEW_BUCKET_OVERHEAD);
if (*num_skew_mcvs > 0)
hash_table_bytes -= skew_table_bytes;
}
else
*num_skew_mcvs = 0;
/*
* Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
* memory is filled, assuming a single batch; but limit the value so that
* the pointer arrays we'll try to allocate do not exceed work_mem nor
* MaxAllocSize.
* 設置nbucket,使其在內(nèi)存填充時達到平均的桶負載NTUP_PER_BUCKET,
* 假設是一個批處理;但是需要限制這個值,試圖分配的指針數(shù)組不超過work_mem或MaxAllocSize。
*
* Note that both nbuckets and nbatch must be powers of 2 to make
* ExecHashGetBucketAndBatch fast.
* 注意,不管是nbuckets還是nbatch必須是2的倍數(shù),以使ExecHashGetBucketAndBatch函數(shù)更快
*/
max_pointers = *space_allowed / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
/* If max_pointers isn't a power of 2, must round it down to one */
//設置為2的倍數(shù)
mppow2 = 1L << my_log2(max_pointers);
if (max_pointers != mppow2)
max_pointers = mppow2 / 2;
/* Also ensure we avoid integer overflow in nbatch and nbuckets */
//確保沒有整數(shù)溢出
/* (this step is redundant given the current value of MaxAllocSize) */
max_pointers = Min(max_pointers, INT_MAX / 2);
dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
dbuckets = Min(dbuckets, max_pointers);
nbuckets = (int) dbuckets;
/* don't let nbuckets be really small, though ... */
//nbuckets最大為1024
nbuckets = Max(nbuckets, 1024);
/* ... and force it to be a power of 2. */
//左移一位,即結果*2
nbuckets = 1 << my_log2(nbuckets);
/*
* If there's not enough space to store the projected number of tuples and
* the required bucket headers, we will need multiple batches.
* 如果沒有足夠的空間來存儲元組的投影數(shù)量和所需的桶header,將需要多個批處理。
*/
bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
{
//內(nèi)表大小 + 桶大小 > hash表大小,需要拆分為多個批次
/* We'll need multiple batches */
long lbuckets;
double dbatch;
int minbatch;
long bucket_size;
/*
* If Parallel Hash with combined work_mem would still need multiple
* batches, we'll have to fall back to regular work_mem budget.
*/
if (try_combined_work_mem)
{
ExecChooseHashTableSize(ntuples, tupwidth, useskew,
false, parallel_workers,
space_allowed,
numbuckets,
numbatches,
num_skew_mcvs);
return;
}
/*
* Estimate the number of buckets we'll want to have when work_mem is
* entirely full. Each bucket will contain a bucket pointer plus
* NTUP_PER_BUCKET tuples, whose projected size already includes
* overhead for the hash code, pointer to the next tuple, etc.
* 估計當work_mem完全滿時,可以得到多少個桶。
* 每個桶bucket將包含一個bucket指針加上NTUP_PER_BUCKET元組,
* 它的投影大小已經(jīng)包括哈希值的開銷、指向下一個元組的指針等等。
*/
bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
lbuckets = 1L << my_log2(hash_table_bytes / bucket_size);
lbuckets = Min(lbuckets, max_pointers);
nbuckets = (int) lbuckets;
nbuckets = 1 << my_log2(nbuckets);
bucket_bytes = nbuckets * sizeof(HashJoinTuple);
/*
* Buckets are simple pointers to hashjoin tuples, while tupsize
* includes the pointer, hash code, and MinimalTupleData. So buckets
* should never really exceed 25% of work_mem (even for
* NTUP_PER_BUCKET=1); except maybe for work_mem values that are not
* 2^N bytes, where we might get more because of doubling. So let's
* look for 50% here.
* 桶bucket是指向hash join元組的簡單指針,而tupsize包括指針/hash值和MinimalTupleData。
* 所以bucket不應該超過work_mem的25%(即使NTUP_PER_BUCKET=1);
*/
Assert(bucket_bytes <= hash_table_bytes / 2);
/* Calculate required number of batches. */
//計算批次
dbatch = ceil(inner_rel_bytes / (hash_table_bytes - bucket_bytes));
dbatch = Min(dbatch, max_pointers);
minbatch = (int) dbatch;
nbatch = 2;
while (nbatch < minbatch)
nbatch <<= 1;
}
Assert(nbuckets > 0);
Assert(nbatch > 0);
*numbuckets = nbuckets;
*numbatches = nbatch;
}
final_cost_hashjoin
//----------------------------------------------------------------------------- final_cost_hashjoin
/*
* final_cost_hashjoin
* Final estimate of the cost and result size of a hashjoin path.
* 最后的成本估算和hashjoin訪問路徑的結果大小
*
* Note: the numbatches estimate is also saved into 'path' for use later
* 注意:numbatches估算也被保存到path中以備以后使用
*
* 'path' is already filled in except for the rows and cost fields and
* num_batches
* 'workspace' is the result from initial_cost_hashjoin
* 'extra' contains miscellaneous information about the join
*/
void
final_cost_hashjoin(PlannerInfo *root, HashPath *path,
JoinCostWorkspace *workspace,
JoinPathExtraData *extra)
{
Path *outer_path = path->jpath.outerjoinpath;
Path *inner_path = path->jpath.innerjoinpath;
double outer_path_rows = outer_path->rows;
double inner_path_rows = inner_path->rows;
double inner_path_rows_total = workspace->inner_rows_total;
List *hashclauses = path->path_hashclauses;
Cost startup_cost = workspace->startup_cost;
Cost run_cost = workspace->run_cost;
int numbuckets = workspace->numbuckets;
int numbatches = workspace->numbatches;
Cost cpu_per_tuple;
QualCost hash_qual_cost;
QualCost qp_qual_cost;
double hashjointuples;
double virtualbuckets;
Selectivity innerbucketsize;
Selectivity innermcvfreq;
ListCell *hcl;
/* Mark the path with the correct row estimate */
if (path->jpath.path.param_info)
path->jpath.path.rows = path->jpath.path.param_info->ppi_rows;
else
path->jpath.path.rows = path->jpath.path.parent->rows;
/* For partial paths, scale row estimate. */
if (path->jpath.path.parallel_workers > 0)
{
double parallel_divisor = get_parallel_divisor(&path->jpath.path);
path->jpath.path.rows =
clamp_row_est(path->jpath.path.rows / parallel_divisor);
}
/*
* We could include disable_cost in the preliminary estimate, but that
* would amount to optimizing for the case where the join method is
* disabled, which doesn't seem like the way to bet.
*/
if (!enable_hashjoin)
startup_cost += disable_cost;//禁用hash join
/* mark the path with estimated # of batches */
path->num_batches = numbatches;//處理批數(shù)
/* store the total number of tuples (sum of partial row estimates) */
//存儲元組總數(shù).
path->inner_rows_total = inner_path_rows_total;
/* and compute the number of "virtual" buckets in the whole join */
//計算虛擬的桶數(shù)(buckets)
virtualbuckets = (double) numbuckets * (double) numbatches;
/*
* Determine bucketsize fraction and MCV frequency for the inner relation.
* We use the smallest bucketsize or MCV frequency estimated for any
* individual hashclause; this is undoubtedly conservative.
* 確定桶大小分數(shù)和MCV頻率之間的內(nèi)在聯(lián)系。
* 使用每個散列子句估計的最小桶(bucket)大小或MCV頻率;但這無疑是保守的處理方法。
*
* BUT: if inner relation has been unique-ified, we can assume it's good
* for hashing. This is important both because it's the right answer, and
* because we avoid contaminating the cache with a value that's wrong for
* non-unique-ified paths.
* 但是:如果內(nèi)表元組是唯一的,可以假設它對哈希處理有好處。
* 這很重要,因為它是正確的答案,而且避免了臟緩存,而這個值對于非唯一化路徑是錯誤的。
*/
if (IsA(inner_path, UniquePath))
{
//內(nèi)表唯一
innerbucketsize = 1.0 / virtualbuckets;
innermcvfreq = 0.0;
}
else//內(nèi)表不唯一
{
innerbucketsize = 1.0;
innermcvfreq = 1.0;
foreach(hcl, hashclauses)//循環(huán)遍歷hash條件
{
RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
Selectivity thisbucketsize;
Selectivity thismcvfreq;
/*
* First we have to figure out which side of the hashjoin clause
* is the inner side.
* 首先,必須找出hashjoin子句的哪一邊是內(nèi)表邊。
*
* Since we tend to visit the same clauses over and over when
* planning a large query, we cache the bucket stats estimates in
* the RestrictInfo node to avoid repeated lookups of statistics.
* 因為在計劃大型查詢時,往往會反復訪問相同的子句,
* 所以將bucket stats估計緩存在限制條件信息節(jié)點中,以避免重復查找統(tǒng)計信息。
*/
if (bms_is_subset(restrictinfo->right_relids,
inner_path->parent->relids))
{
/* righthand side is inner */
thisbucketsize = restrictinfo->right_bucketsize;
if (thisbucketsize < 0)
{
/* not cached yet */
estimate_hash_bucket_stats(root,
get_rightop(restrictinfo->clause),
virtualbuckets,
&restrictinfo->right_mcvfreq,
&restrictinfo->right_bucketsize);
thisbucketsize = restrictinfo->right_bucketsize;
}
thismcvfreq = restrictinfo->right_mcvfreq;
}
else
{
Assert(bms_is_subset(restrictinfo->left_relids,
inner_path->parent->relids));
/* lefthand side is inner */
thisbucketsize = restrictinfo->left_bucketsize;
if (thisbucketsize < 0)
{
/* not cached yet */
estimate_hash_bucket_stats(root,
get_leftop(restrictinfo->clause),
virtualbuckets,
&restrictinfo->left_mcvfreq,
&restrictinfo->left_bucketsize);
thisbucketsize = restrictinfo->left_bucketsize;
}
thismcvfreq = restrictinfo->left_mcvfreq;
}
if (innerbucketsize > thisbucketsize)
innerbucketsize = thisbucketsize;
if (innermcvfreq > thismcvfreq)
innermcvfreq = thismcvfreq;
}
}
/*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost. (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
* 如果包含內(nèi)部MCV的bucket會超過work_mem,那么除非真的沒有其他選擇,否則我們不希望執(zhí)行散列,因此禁用hash連接。
* (執(zhí)行程序通常通過分割批處理來處理過多的內(nèi)存使用,但顯然它不能以這種方式分離相等的值,
* 因此如果這是真的,它將無法驅動work_mem以下的批處理大小。)
*/
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;//MCV的值如果大于work_mem,則禁用hash連接
/*
* Compute cost of the hashquals and qpquals (other restriction clauses)
* separately.
* 分別計算hash quals連接條件和qpquals(其他限制條件)的成本。
*/
cost_qual_eval(&hash_qual_cost, hashclauses, root);
cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
qp_qual_cost.startup -= hash_qual_cost.startup;
qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;
/* CPU costs */
if (path->jpath.jointype == JOIN_SEMI ||
path->jpath.jointype == JOIN_ANTI ||
extra->inner_unique)
{
double outer_matched_rows;
Selectivity inner_scan_frac;
/*
* With a SEMI or ANTI join, or if the innerrel is known unique, the
* executor will stop after the first match.
* 對于半連接或反連接,或者如果內(nèi)表是唯一的,執(zhí)行器將在第一次匹配后停止。
*
* For an outer-rel row that has at least one match, we can expect the
* bucket scan to stop after a fraction 1/(match_count+1) of the
* bucket's rows, if the matches are evenly distributed. Since they
* probably aren't quite evenly distributed, we apply a fuzz factor of
* 2.0 to that fraction. (If we used a larger fuzz factor, we'd have
* to clamp inner_scan_frac to at most 1.0; but since match_count is
* at least 1, no such clamp is needed now.)
* 對于至少有一個匹配的外表outer-rel行,如果匹配是均勻分布的,
* 那么桶掃描在桶行數(shù)*1/(match_count+1)之后就會停止。
* 因為它們可能不是均勻分布的,所以對這個比例應用了2.0的模糊系數(shù)。
* (如果我們使用更大的系數(shù),將不得不將inner_scan_frac限制為1.0;
* 但是因為match_count至少為1,所以現(xiàn)在不需要這樣處理了。)
*/
outer_matched_rows = rint(outer_path_rows * extra->semifactors.outer_match_frac);
inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);
startup_cost += hash_qual_cost.startup;
run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;
/*
* For unmatched outer-rel rows, the picture is quite a lot different.
* In the first place, there is no reason to assume that these rows
* preferentially hit heavily-populated buckets; instead assume they
* are uncorrelated with the inner distribution and so they see an
* average bucket size of inner_path_rows / virtualbuckets. In the
* second place, it seems likely that they will have few if any exact
* hash-code matches and so very few of the tuples in the bucket will
* actually require eval of the hash quals. We don't have any good
* way to estimate how many will, but for the moment assume that the
* effective cost per bucket entry is one-tenth what it is for
* matchable tuples.
* 對于沒有匹配的外表行來說,情況就大不相同了。
* 首先,沒有理由假設這些行優(yōu)先命中了被大量填充的桶(bucket);
* 相反,假設它們與內(nèi)表分布不相關,因此它們看到的是inner_path_rows / virtualbuckets的平均桶大小。
* 第二,如果有任何確切的哈希值匹配,它們可能會很少,因此桶中的元組實際上很少需要對散列quals條件進行eval解析。
* 目前沒有很好的方法來估計會有多少個,但是目前假設訪問每個桶的實際耗費的成本是匹配元組的十分之一。
*/
run_cost += hash_qual_cost.per_tuple *
(outer_path_rows - outer_matched_rows) *
clamp_row_est(inner_path_rows / virtualbuckets) * 0.05;
/* Get # of tuples that will pass the basic join */
//參與基礎連接的元組數(shù)
if (path->jpath.jointype == JOIN_ANTI)
hashjointuples = outer_path_rows - outer_matched_rows;
else
hashjointuples = outer_matched_rows;
}
else
{
/*
* The number of tuple comparisons needed is the number of outer
* tuples times the typical number of tuples in a hash bucket, which
* is the inner relation size times its bucketsize fraction. At each
* one, we need to evaluate the hashjoin quals. But actually,
* charging the full qual eval cost at each tuple is pessimistic,
* since we don't evaluate the quals unless the hash values match
* exactly. For lack of a better idea, halve the cost estimate to
* allow for that.
* 需要的元組比較數(shù)量是外部元組的數(shù)量乘以散列桶中的典型元組數(shù)量,這是內(nèi)部關系大小乘以它的桶大小比例。
* 在每個函數(shù)中,都需要計算hash連接條件(hashjoin quals)。
* 但實際上,認為在每個元組上耗費解析表達式(qual eval)的成本是較為悲觀的,
* 因為我們不計算條件quals,除非散列值完全匹配。
* 由于缺乏更好的想法,將成本估算減半。
*/
startup_cost += hash_qual_cost.startup;
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
/*
* Get approx # tuples passing the hashquals. We use
* approx_tuple_count here because we need an estimate done with
* JOIN_INNER semantics.
* 通過hash條件獲得大約的元組數(shù)。
* 這里使用approx_tuple_count,因為我們需要使用JOIN_INNER語義進行評估。
*/
hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
}
/*
* For each tuple that gets through the hashjoin proper, we charge
* cpu_tuple_cost plus the cost of evaluating additional restriction
* clauses that are to be applied at the join. (This is pessimistic since
* not all of the quals may get evaluated at each tuple.)
* 對于通過哈希連接的每個元組,每個元組的成本為cpu_tuple_cost,
* 加上計算將在連接上應用的附加限制條款的成本。
* (這是悲觀的,因為并不是所有的條件表達式quals都能在每個元組上得到計算。)
*/
startup_cost += qp_qual_cost.startup;
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->jpath.path.pathtarget->cost.startup;
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;
path->jpath.path.startup_cost = startup_cost;
path->jpath.path.total_cost = startup_cost + run_cost;
}
測試腳本如下
select a.*,b.grbh,b.je
from t_dwxx a,
lateral (select t1.dwbh,t1.grbh,t2.je
from t_grxx t1
inner join t_jfxx t2 on t1.dwbh = a.dwbh and t1.grbh = t2.grbh) b
order by b.dwbh;
啟動gdb,設置斷點
(gdb) b try_hashjoin_path
Breakpoint 1 at 0x7af2a4: file joinpath.c, line 737.
進入函數(shù)try_hashjoin_path,連接類型為JOIN_INNER
(gdb) c
Continuing.
Breakpoint 1, try_hashjoin_path (root=0x1b73980, joinrel=0x1b91d30, outer_path=0x1b866c0, inner_path=0x1b8e780,
hashclauses=0x1b93200, jointype=JOIN_INNER, extra=0x7ffc09955e80) at joinpath.c:737
737 required_outer = calc_non_nestloop_required_outer(outer_path,
進入initial_cost_hashjoin函數(shù)
(gdb) n
739 if (required_outer &&
(gdb)
751 initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
(gdb) step
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200,
outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3160
3160 Cost startup_cost = 0;
初始賦值,hash連接條件只有1個,即t_dwxx.dwbh = t_grxx.dwbh(連接信息參照上一節(jié))
3160 Cost startup_cost = 0;
(gdb) n
3161 Cost run_cost = 0;
(gdb)
3162 double outer_path_rows = outer_path->rows;
(gdb)
3163 double inner_path_rows = inner_path->rows;
(gdb)
3164 double inner_path_rows_total = inner_path_rows;
(gdb)
3165 int num_hashclauses = list_length(hashclauses);
(gdb)
3172 startup_cost += outer_path->startup_cost;
(gdb) p num_hashclauses
$1 = 1
進入ExecChooseHashTableSize函數(shù),該函數(shù)根據(jù)給定需要散列的關系的估計大小(行數(shù)和平均行寬度),計算散列表的合適大小。
...
(gdb) step
ExecChooseHashTableSize (ntuples=100000, tupwidth=9, useskew=true, try_combined_work_mem=false, parallel_workers=0,
space_allowed=0x7ffc09955c38, numbuckets=0x7ffc09955c4c, numbatches=0x7ffc09955c48, num_skew_mcvs=0x7ffc09955c44)
at nodeHash.c:677
獲取元組大小->48Bytes,內(nèi)部大小4800000Bytes(4.58M)和hash表大小->4194304(4M)
(gdb) n
690 tupsize = HJTUPLE_OVERHEAD +
(gdb)
693 inner_rel_bytes = ntuples * tupsize;
(gdb)
698 hash_table_bytes = work_mem * 1024L;
(gdb)
705 if (try_combined_work_mem)
(gdb) p tupsize
$2 = 48
(gdb) p inner_rel_bytes
$3 = 4800000
(gdb) p hash_table_bytes
$4 = 4194304
使用行傾斜技術
(gdb) n
708 *space_allowed = hash_table_bytes;
(gdb)
724 if (useskew)
(gdb)
726 skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;
(gdb) p useskew
$5 = true
行傾斜桶數(shù)為635,調(diào)整后的hash表大小4110418,最大的指針數(shù)524288
...
(gdb) p *num_skew_mcvs
$7 = 635
(gdb) p hash_table_bytes
$8 = 4110418
...
(gdb) n
756 max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
(gdb)
758 mppow2 = 1L << my_log2(max_pointers);
(gdb)
759 if (max_pointers != mppow2)
(gdb)
764 max_pointers = Min(max_pointers, INT_MAX / 2);
(gdb)
766 dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
(gdb) p max_pointers
$10 = 524288
計算桶數(shù)=131072
...
(gdb) n
767 dbuckets = Min(dbuckets, max_pointers);
(gdb)
768 nbuckets = (int) dbuckets;
(gdb)
770 nbuckets = Max(nbuckets, 1024);
(gdb)
772 nbuckets = 1 << my_log2(nbuckets);
(gdb)
778 bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
(gdb)
779 if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) p nbuckets
$11 = 131072
如果沒有足夠的空間,將需要多個批處理。
779 if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) n
791 if (try_combined_work_mem)
計算桶數(shù)->131072和批次->2等信息
...
(gdb)
837 *numbuckets = nbuckets;
(gdb)
838 *numbatches = nbatch;
(gdb) p nbuckets
$12 = 131072
(gdb) p nbatch
$13 = 2
回到initial_cost_hashjoin
(gdb)
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200,
outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3226
3226 if (numbatches > 1)
initial_cost_hashjoin函數(shù)執(zhí)行完畢,查看計算結果
(gdb)
3246 workspace->inner_rows_total = inner_path_rows_total;
(gdb)
3247 }
(gdb) p workspace
$14 = (JoinCostWorkspace *) 0x7ffc09955d00
(gdb) p *workspace
$15 = {startup_cost = 3465, total_cost = 4261, run_cost = 796, inner_run_cost = 0,
inner_rescan_run_cost = 6.9525149533036983e-310, outer_rows = 3.7882102964330281e-317,
inner_rows = 1.4285501039407471e-316, outer_skip_rows = 1.4285501039407471e-316,
inner_skip_rows = 6.9525149532894692e-310, numbuckets = 131072, numbatches = 2, inner_rows_total = 100000}
(gdb)
設置斷點,進入final_cost_hashjoin
(gdb) b final_cost_hashjoin
Breakpoint 2 at 0x7a0b49: file costsize.c, line 3265.
(gdb) c
Continuing.
Breakpoint 2, final_cost_hashjoin (root=0x1b73980, path=0x1b93238, workspace=0x7ffc09955d00, extra=0x7ffc09955e80)
at costsize.c:3265
3265 Path *outer_path = path->jpath.outerjoinpath;
賦值,并計算虛擬桶數(shù)
3265 Path *outer_path = path->jpath.outerjoinpath;
(gdb) n
3266 Path *inner_path = path->jpath.innerjoinpath;
...
(gdb)
3314 virtualbuckets = (double) numbuckets * (double) numbatches;
(gdb)
3326 if (IsA(inner_path, UniquePath))
(gdb) p virtualbuckets
$16 = 262144
(gdb)
開始遍歷hash連接條件,確定桶大小分數(shù)和MCV頻率之間的內(nèi)在聯(lián)系,更新連接條件中的相關信息。
3335 foreach(hcl, hashclauses)
(gdb)
3337 RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
內(nèi)表在右端RTE=3,即t_grxx),更新相關信息
3349 if (bms_is_subset(restrictinfo->right_relids,
(gdb)
3353 thisbucketsize = restrictinfo->right_bucketsize;
(gdb)
3354 if (thisbucketsize < 0)
(gdb)
3357 estimate_hash_bucket_stats(root,
(gdb)
3358 get_rightop(restrictinfo->clause),
(gdb)
3357 estimate_hash_bucket_stats(root,
(gdb)
3362 thisbucketsize = restrictinfo->right_bucketsize;
(gdb) p *restrictinfo
$17 = {type = T_RestrictInfo, clause = 0x1b87260, is_pushed_down = true, outerjoin_delayed = false, can_join = true,
pseudoconstant = false, leakproof = false, security_level = 0, clause_relids = 0x1b87380, required_relids = 0x1b86c68,
outer_relids = 0x0, nullable_relids = 0x0, left_relids = 0x1b87340, right_relids = 0x1b87360, orclause = 0x0,
parent_ec = 0x1b85b88, eval_cost = {startup = 0, per_tuple = 0.0025000000000000001}, norm_selec = 0.0001,
outer_selec = -1, mergeopfamilies = 0x1b873c8, left_ec = 0x1b85b88, right_ec = 0x1b85b88, left_em = 0x1b85d58,
right_em = 0x1b85c80, scansel_cache = 0x1b92e50, outer_is_left = true, hashjoinoperator = 98, left_bucketsize = -1,
right_bucketsize = 0.00010013016921998598, left_mcvfreq = -1, right_mcvfreq = 0}
計算過程與先前介紹的類似,計算表達式的成本等等,具體可自行debug.
最終的計算結果
3505 }
(gdb) p *path
$19 = {jpath = {path = {type = T_HashPath, pathtype = T_HashJoin, parent = 0x1b91d30, pathtarget = 0x1b91f68,
param_info = 0x0, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 100000,
startup_cost = 3465, total_cost = 5386, pathkeys = 0x0}, jointype = JOIN_INNER, inner_unique = false,
outerjoinpath = 0x1b866c0, innerjoinpath = 0x1b8e780, joinrestrictinfo = 0x1b92208}, path_hashclauses = 0x1b93200,
num_batches = 2, inner_rows_total = 100000}
DONE!
allpaths.c
cost.h
costsize.c
PG Document:Query Planning
免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權內(nèi)容。