溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點(diǎn)擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

PostgreSQL 源碼解讀(133)- MVCC#17(vacuum過程-lazy_vacuum_index函數(shù)#2)

發(fā)布時間:2020-08-11 16:11:39 來源:ITPUB博客 閱讀:167 作者:husthxd 欄目:關(guān)系型數(shù)據(jù)庫

本節(jié)簡單介紹了PostgreSQL手工執(zhí)行vacuum的處理流程,主要分析了B-Tree索引vacuum的處理過程,實(shí)現(xiàn)函數(shù)是btvacuumscan。

一、數(shù)據(jù)結(jié)構(gòu)

IndexVacuumInfo
傳遞給ambulkdelete/amvacuumcleanup的輸入?yún)?shù)結(jié)構(gòu)體


/*
 * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
 * 傳遞給ambulkdelete/amvacuumcleanup的輸入?yún)?shù)結(jié)構(gòu)體
 *
 * num_heap_tuples is accurate only when estimated_count is false;
 * otherwise it's just an estimate (currently, the estimate is the
 * prior value of the relation's pg_class.reltuples field).  It will
 * always just be an estimate during ambulkdelete.
 * 在estimated_count為F的情況下,num_heap_tuples才是精確的.
 * 否則,該值只是一個故事(當(dāng)前的實(shí)現(xiàn)是,該值是relation's pg_class.reltuples字段的上一個值).
 * 在ambulkdelete期間該值會一直都是估算值.
 */
typedef struct IndexVacuumInfo
{
    //index relation
    Relation    index;          /* the index being vacuumed */
    //是否只是ANALYZE(沒有實(shí)際的vacuum)
    bool        analyze_only;   /* ANALYZE (without any actual vacuum) */
    //如為T,則num_heap_tuples是一個估算值
    bool        estimated_count;    /* num_heap_tuples is an estimate */
    //進(jìn)度信息的日志等級
    int         message_level;  /* ereport level for progress messages */
    //在堆中仍存在的元組數(shù)
    double      num_heap_tuples;    /* tuples remaining in heap */
    //訪問策略
    BufferAccessStrategy strategy;  /* access strategy for reads */
} IndexVacuumInfo;

IndexBulkDeleteResult
ambulkdelete/amvacuumcleanup返回的統(tǒng)計(jì)信息結(jié)構(gòu)體


/*
 * Struct for statistics returned by ambulkdelete and amvacuumcleanup
 * ambulkdelete/amvacuumcleanup返回的統(tǒng)計(jì)信息結(jié)構(gòu)體
 * 
 * This struct is normally allocated by the first ambulkdelete call and then
 * passed along through subsequent ones until amvacuumcleanup; however,
 * amvacuumcleanup must be prepared to allocate it in the case where no
 * ambulkdelete calls were made (because no tuples needed deletion).
 * Note that an index AM could choose to return a larger struct
 * of which this is just the first field; this provides a way for ambulkdelete
 * to communicate additional private data to amvacuumcleanup.
 * 該結(jié)構(gòu)體通常由第一個ambulkdelete調(diào)用分配內(nèi)存,傳遞到下一個處理過程,直至amvacuumcleanup;
 * 但是,在ambulkdelete沒有調(diào)用時,amvacuumcleanup必須預(yù)分配(因?yàn)闆]有元組需要刪除).
 * 注意索引訪問方法(AM)可以選擇返回一個更大的結(jié)構(gòu)體,而該結(jié)構(gòu)體是這個更大的結(jié)構(gòu)體的第一個域;
 * 這為ambulkdelete提供了一個方法用于與需要額外私有數(shù)據(jù)的amvacuumcleanup函數(shù)通訊.
 *
 * Note: pages_removed is the amount by which the index physically shrank,
 * if any (ie the change in its total size on disk).  pages_deleted and
 * pages_free refer to free space within the index file.  Some index AMs
 * may compute num_index_tuples by reference to num_heap_tuples, in which
 * case they should copy the estimated_count field from IndexVacuumInfo.
 * 注意:pages_remove是索引物理收縮(shrank)的數(shù)量,如果有的話(即它在磁盤上的總大小的變化)。
 * pages_deleted和pages_free指的是索引文件中的空閑空間.
 * 某些索引訪問方法(AMs)可能通過參考num_heap_tuples計(jì)算num_index_tuples,
 *   在這種情況下會拷貝從IndexVacuumInfo中拷貝estimated_count域.
 */
typedef struct IndexBulkDeleteResult
{
    //index中剩下的pages
    BlockNumber num_pages;      /* pages remaining in index */
    //在vacuum期間清除的元組數(shù)
    BlockNumber pages_removed;  /* # removed during vacuum operation */
    //num_index_tuples是一個估算值?
    bool        estimated_count;    /* num_index_tuples is an estimate */
    //剩余的元組數(shù)
    double      num_index_tuples;   /* tuples remaining */
    //在vacuum期間清除的元組數(shù)
    double      tuples_removed; /* # removed during vacuum operation */
    //索引中未使用的pages
    BlockNumber pages_deleted;  /* # unused pages in index */
    //可重用的pages
    BlockNumber pages_free;     /* # pages available for reuse */
} IndexBulkDeleteResult;

BTPageOpaque
在每個頁面的尾部,我們存儲了一個指針用于指向樹中的兄弟


/*
 *  BTPageOpaqueData -- At the end of every page, we store a pointer
 *  to both siblings in the tree.  This is used to do forward/backward
 *  index scans.  The next-page link is also critical for recovery when
 *  a search has navigated to the wrong page due to concurrent page splits
 *  or deletions; see src/backend/access/nbtree/README for more info.
 *  BTPageOpaqueData -- 在每個頁面的尾部,我們存儲了一個指針用于指向樹中的兄弟.
 *  這用于執(zhí)行正向/反向索引掃描。
 *  當(dāng)搜索由于并發(fā)的頁面分裂或刪除而導(dǎo)航到錯誤的頁面時,
 *    下一頁鏈接對于恢復(fù)也非常關(guān)鍵;
 *  有關(guān)更多信息,請參見src/backend/access/nbtree/README。 
 *
 *  In addition, we store the page's btree level (counting upwards from
 *  zero at a leaf page) as well as some flag bits indicating the page type
 *  and status.  If the page is deleted, we replace the level with the
 *  next-transaction-ID value indicating when it is safe to reclaim the page.
 *  此外,我們存儲頁面的btree級別(在葉子頁面上從0開始計(jì)數(shù))以及一些標(biāo)志位,
 *    這些標(biāo)志位指示頁面類型和狀態(tài)。
 *  如果頁面被刪除,我們將用next-transaction-ID值替換該級別,該值指示何時可以安全地回收頁面。
 *
 *  We also store a "vacuum cycle ID".  When a page is split while VACUUM is
 *  processing the index, a nonzero value associated with the VACUUM run is
 *  stored into both halves of the split page.  (If VACUUM is not running,
 *  both pages receive zero cycleids.)  This allows VACUUM to detect whether
 *  a page was split since it started, with a small probability of false match
 *  if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
 *  ago.  Also, during a split, the BTP_SPLIT_END flag is cleared in the left
 *  (original) page, and set in the right page, but only if the next page
 *  to its right has a different cycleid.
 *  我們同樣會存儲"vacuum cycle ID".當(dāng)頁面在vacuum處理,索引被分裂時,
 *    與vacuum運(yùn)行相關(guān)的非零值存儲在分裂頁面的兩個部分中。
 *  (如果VACUUM沒有運(yùn)行,兩個頁面都接收到的cycleid均為0)
 *  這允許VACUUM檢測頁面是否在開始時就被分裂了,
 *    如果頁面上次分裂的時間恰好是MAX_BT_CYCLE_ID VACUUMs值的倍數(shù),則有很小的可能出現(xiàn)錯誤匹配。
 *  此外,在分裂期間,BTP_SPLIT_END標(biāo)記在左側(cè)(原始)頁面中被清除,
 *    并在右側(cè)頁面中設(shè)置,但只有在其右側(cè)的下一頁具有不同的cycleid時才會這樣做。
 *
 *  NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
 *  instead.
 * 注意:BTP_LEAF標(biāo)記為是冗余的,因?yàn)閘evel==0可以被測試.
 */
typedef struct BTPageOpaqueData
{
    //左兄弟,如為最左邊的節(jié)點(diǎn),則為P_NONE
    BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
    //右兄弟,如為最右邊的節(jié)點(diǎn),則為P_NONE
    BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
    union
    {
        //樹層次,如為葉子節(jié)點(diǎn),則為0
        uint32      level;      /* tree level --- zero for leaf pages */
        //如已刪除,記錄下一個事務(wù)ID
        TransactionId xact;     /* next transaction ID, if deleted */
    }           btpo;//聯(lián)合體
    //標(biāo)記位
    uint16      btpo_flags;     /* flag bits, see below */
    //最后一次分裂的vacuum cycle ID
    BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
typedef BTPageOpaqueData *BTPageOpaque;
/* Bits defined in btpo_flags */
#define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
#define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
#define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
#define BTP_META        (1 << 3)    /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

二、源碼解讀

lazy_vacuum_index->index_bulk_delete->…btbulkdelete->btvacuumscan
btvacuumscan掃描索引,執(zhí)行vacuum
該函數(shù)的功能包括:
A.搜索符合vacuum callback條件的已刪除的葉子元組;
B.搜索可刪除的空頁;
C.搜索老舊已刪除可被回收的頁面.
btbulkdelete和btvacuumcleanup函數(shù)都會調(diào)用該過程(后者僅在沒有發(fā)生btbulkdelete調(diào)用時才會發(fā)生)

其主要處理邏輯如下:
1.初始化統(tǒng)計(jì)信息(IndexBulkDeleteResult結(jié)構(gòu)體)
2.初始化vstate狀態(tài)信息(BTVacState結(jié)構(gòu)體)
3.構(gòu)造臨時上下文
4.循環(huán)遍歷page
4.1獲取relation鎖
4.2遍歷block,執(zhí)行btvacuumpage
4.3如需要,多次遍歷relation
5.WAL Record處理
6.刪除臨時上下文
7.處理空閑空間
8.更新統(tǒng)計(jì)信息


/*
 * btvacuumscan --- scan the index for VACUUMing purposes
 * btvacuumscan --- 為VACUUMing掃描索引
 *
 * This combines the functions of looking for leaf tuples that are deletable
 * according to the vacuum callback, looking for empty pages that can be
 * deleted, and looking for old deleted pages that can be recycled.  Both
 * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
 * btbulkdelete call occurred).
 * 該函數(shù)的功能包括:
 *    A.搜索符合vacuum callback條件的已刪除的葉子元組;
 *    B.搜索可刪除的空頁;
 *    C.搜索老舊已刪除可被回收的頁面.
 * btbulkdelete和btvacuumcleanup函數(shù)都會調(diào)用該過程(后者僅在沒有發(fā)生btbulkdelete調(diào)用時才會發(fā)生)
 *
 * The caller is responsible for initially allocating/zeroing a stats struct
 * and for obtaining a vacuum cycle ID if necessary.
 * 調(diào)用者有責(zé)任初始化分配或者歸零統(tǒng)計(jì)結(jié)構(gòu)體,如需要獲取一個vacuum cycle ID.
 */
static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
             IndexBulkDeleteCallback callback, void *callback_state,
             BTCycleId cycleid, TransactionId *oldestBtpoXact)
{
    Relation    rel = info->index;
    BTVacState  vstate;
    BlockNumber num_pages;
    BlockNumber blkno;
    bool        needLock;
    /*
     * Reset counts that will be incremented during the scan; needed in case
     * of multiple scans during a single VACUUM command
     * 在掃描重置計(jì)數(shù)會被增加,在單個VACUUM命令期間多次掃描時需要.
     */
    stats->estimated_count = false;
    stats->num_index_tuples = 0;
    stats->pages_deleted = 0;
    /* Set up info to pass down to btvacuumpage */
    //設(shè)置傳遞給btvacuumpage函數(shù)的參數(shù)
    vstate.info = info;
    vstate.stats = stats;
    vstate.callback = callback;
    vstate.callback_state = callback_state;
    vstate.cycleid = cycleid;
    vstate.lastBlockVacuumed = BTREE_METAPAGE;  /* Initialise at first block */
    vstate.lastBlockLocked = BTREE_METAPAGE;
    vstate.totFreePages = 0;
    vstate.oldestBtpoXact = InvalidTransactionId;
    /* Create a temporary memory context to run _bt_pagedel in */
    //創(chuàng)建臨時內(nèi)存上下文用于運(yùn)行_bt_pagedel
    vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
                                                  "_bt_pagedel",
                                                  ALLOCSET_DEFAULT_SIZES);
    /*
     * The outer loop iterates over all index pages except the metapage, in
     * physical order (we hope the kernel will cooperate in providing
     * read-ahead for speed).  It is critical that we visit all leaf pages,
     * including ones added after we start the scan, else we might fail to
     * delete some deletable tuples.  Hence, we must repeatedly check the
     * relation length.  We must acquire the relation-extension lock while
     * doing so to avoid a race condition: if someone else is extending the
     * relation, there is a window where bufmgr/smgr have created a new
     * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If
     * we manage to scan such a page here, we'll improperly assume it can be
     * recycled.  Taking the lock synchronizes things enough to prevent a
     * problem: either num_pages won't include the new page, or _bt_getbuf
     * already has write lock on the buffer and it will be fully initialized
     * before we can examine it.  (See also vacuumlazy.c, which has the same
     * issue.)  Also, we need not worry if a page is added immediately after
     * we look; the page splitting code already has write-lock on the left
     * page before it adds a right page, so we must already have processed any
     * tuples due to be moved into such a page.
     * 外部循環(huán)按照物理順序遍歷除元數(shù)據(jù)頁面之外的所有索引頁
     *   (我們希望內(nèi)核能夠配合提供預(yù)讀以提高性能)。
     * 至關(guān)重要的是,我們要訪問所有頁,包括在開始掃描后添加的頁,否則可能無法刪除一些可刪除的元組。
     * 因此,我們必須反復(fù)檢查關(guān)系的大小。我們必須在獲取關(guān)系擴(kuò)展鎖的同時避免競爭條件:
     *   如果其他人正在擴(kuò)展該關(guān)系,則會出現(xiàn)一個窗口,其中bufmgr/smgr創(chuàng)建了一個新的初始化頁面(全0),
     *   但是_bt_getbuf()尚未對其進(jìn)行寫鎖定。
     * 如果我們成功地掃描了這樣一個頁面,我們就會錯誤地認(rèn)為它可以被回收。
     * 使用鎖可以同步足夠的信息以防止出現(xiàn)此類問題:num_pages不包含新頁面,
     *   或者_(dá)bt_getbuf已經(jīng)在緩沖區(qū)上有寫鎖,在我們檢查它之前,它將被完全初始化。
     * (參見vacuumlazy.c,里面有相同的主題和內(nèi)容).
     * 此外,如果在查看后立即新增頁面,也無需擔(dān)心;
     *   在添加右頁之前,頁面分割代碼已經(jīng)在左頁上設(shè)置了寫鎖,
     *   因此我們必須已經(jīng)處理了將被移動到此類頁中的所有元組。
     *
     * We can skip locking for new or temp relations, however, since no one
     * else could be accessing them.
     * 對于新的或臨時relations可以跳過鎖定,因?yàn)槠渌M(jìn)程無法訪問這些relations.
     */
    //是否需要鎖定?
    needLock = !RELATION_IS_LOCAL(rel);
    //索引relation的第一個頁是元數(shù)據(jù)頁,需要跳過
    blkno = BTREE_METAPAGE + 1;
    for (;;)//循環(huán)
    {
        /* Get the current relation length */
        //獲取當(dāng)前relation的大小
        if (needLock)
            //如需要鎖,則鎖定
            LockRelationForExtension(rel, ExclusiveLock);
        //獲取pages
        num_pages = RelationGetNumberOfBlocks(rel);
        if (needLock)
            //解鎖
            UnlockRelationForExtension(rel, ExclusiveLock);
        /* Quit if we've scanned the whole relation */
        //如果已掃描了整個relation,則Quit
        if (blkno >= num_pages)
            break;
        /* Iterate over pages, then loop back to recheck length */
        //迭代掃描pages,然后回過頭來重新檢查大小
        for (; blkno < num_pages; blkno++)
        {
            btvacuumpage(&vstate, blkno, blkno);
        }
    }
    /*
     * Check to see if we need to issue one final WAL record for this index,
     * which may be needed for correctness on a hot standby node when non-MVCC
     * index scans could take place.
     * 檢查我們是否需要為這個索引發(fā)出最后一條WAL記錄,
     *   當(dāng)可以進(jìn)行非MVCC索引掃描時,可能需要在熱備節(jié)點(diǎn)上正確地發(fā)出這條記錄。
     *
     * If the WAL is replayed in hot standby, the replay process needs to get
     * cleanup locks on all index leaf pages, just as we've been doing here.
     * However, we won't issue any WAL records about pages that have no items
     * to be deleted.  For pages between pages we've vacuumed, the replay code
     * will take locks under the direction of the lastBlockVacuumed fields in
     * the XLOG_BTREE_VACUUM WAL records.  To cover pages after the last one
     * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
     * against the last leaf page in the index, if that one wasn't vacuumed.
     * 如果在熱備份中重放WAL,重放過程需要在所有索引頁上獲得清理鎖,就像我們在這里所做的那樣。
     * 但是,我們不會發(fā)布任何關(guān)于沒有要刪除項(xiàng)的頁面的WAL記錄。
     * 對于在已vacuumed頁面之間的頁面,
     *   重放代碼將在XLOG_BTREE_VACUUM WAL記錄中的lastBlockVacuumed字段下進(jìn)行鎖定。
     * 要覆蓋最后一個vacuumed頁面之后的頁面,
     *   我們需要對索引中的最后一個葉子頁面發(fā)出一個虛擬的XLOG_BTREE_VACUUM WAL記錄,
     *   如果這個頁面沒有vacuumed的話。
     */
    if (XLogStandbyInfoActive() &&
        vstate.lastBlockVacuumed < vstate.lastBlockLocked)
    {
        Buffer      buf;
        /*
         * The page should be valid, but we can't use _bt_getbuf() because we
         * want to use a nondefault buffer access strategy.  Since we aren't
         * going to delete any items, getting cleanup lock again is probably
         * overkill, but for consistency do that anyway.
         * 頁面應(yīng)該是有效的,但是我們不能使用_bt_getbuf(),
         *   因?yàn)槲覀兿胧褂梅悄J(rèn)的緩沖區(qū)訪問策略。
         * 因?yàn)槲覀儾淮蛩銊h除任何項(xiàng),所以再次獲得清理鎖可能有點(diǎn)過頭,但為了一致性,還是要這樣做。
         */
        buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
                                 RBM_NORMAL, info->strategy);
        LockBufferForCleanup(buf);
        _bt_checkpage(rel, buf);
        _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
        _bt_relbuf(rel, buf);
    }
    //刪除臨時上下文
    MemoryContextDelete(vstate.pagedelcontext);
    /*
     * If we found any recyclable pages (and recorded them in the FSM), then
     * forcibly update the upper-level FSM pages to ensure that searchers can
     * find them.  It's possible that the pages were also found during
     * previous scans and so this is a waste of time, but it's cheap enough
     * relative to scanning the index that it shouldn't matter much, and
     * making sure that free pages are available sooner not later seems
     * worthwhile.
     * 如果我們發(fā)現(xiàn)任何可回收的頁面(并將其記錄在FSM中),則強(qiáng)制更新高級FSM頁面,以確保能夠找到它們。
     * 有可能這些頁面也是在以前的掃描中找到的,所以這是浪費(fèi)時間,
     *   但是相對于掃描索引來說,這個操作的成本足夠低了,所以它不太重要,
     *   并且確??臻e頁面是盡早可用的,而不是以后才有用。
     *
     * Note that if no recyclable pages exist, we don't bother vacuuming the
     * FSM at all.
     * 注意:如果沒有可回收的頁面,不需要糾結(jié)是否需要vacuuming FSM
     */
    if (vstate.totFreePages > 0)
        //處理空閑空間
        IndexFreeSpaceMapVacuum(rel);
    /* update statistics */
    //更新統(tǒng)計(jì)信息
    stats->num_pages = num_pages;
    stats->pages_free = vstate.totFreePages;
    if (oldestBtpoXact)
        *oldestBtpoXact = vstate.oldestBtpoXact;
}

lazy_vacuum_index->index_bulk_delete->…btbulkdelete->btvacuumscan->btvacuumpage
btvacuumpage —- VACUUM頁面
btvacuumscan()調(diào)用該過程處理單個頁面.在某些情況下,必須回過頭來重新檢查先前已掃描過的頁面,該過程在需要的時候遞歸處理這種情況.

其主要處理邏輯如下:
1.初始化相關(guān)變量
2.調(diào)用ReadBufferExtended讀取block到buffer中,鎖定buffer,獲取page
3.如果不是new page,則執(zhí)行檢查并獲取BTPageOpaque
4.如塊號與原始不同,正在進(jìn)行遞歸處理,如page可回收或者可忽略或者不是葉子節(jié)點(diǎn)或者cycleid不同,則調(diào)用_bt_relbuf,返回;否則繼續(xù)往下執(zhí)行
5.執(zhí)行相關(guān)判斷
5.1如page可回收,則回收頁面
5.2如page已刪除,但不能回收,則更新統(tǒng)計(jì)信息
5.3如page為Half-dead,則嘗試刪除(設(shè)置delete_now標(biāo)記為T)
5.4如page為葉子節(jié)點(diǎn)
5.4.1初始化變量
5.4.2鎖定緩沖區(qū)
5.4.3記錄已取得cleanup lock的最大葉子頁編號
5.4.4檢查我們是否需要遞歸回先前的頁面
5.4.5掃描所有條目,看看哪些根據(jù)回調(diào)函數(shù)得到的需要刪除的條目(寫入到deletable數(shù)組中)
5.4.6如數(shù)組不為空,則調(diào)用_bt_delitems_vacuum,記錄相關(guān)信息
如數(shù)組為空,判斷頁面是否在這個vacuum cycle中被分裂,清除btpo_cycleid標(biāo)記,標(biāo)記緩沖區(qū)為臟
5.4.7如果page為空,則試著刪除(設(shè)置delete_now為T);否則計(jì)算活動元組
5.4.8如試著刪除(delete_now為T),則調(diào)用_bt_pagedel刪除,更新統(tǒng)計(jì)信息
否則調(diào)用_bt_relbuf
5.4.9判斷recurse_to != P_NONE,如T,則重新啟動,否則退出


/*
 * btvacuumpage --- VACUUM one page
 * btvacuumpage --- VACUUM頁面
 * 
 * This processes a single page for btvacuumscan().  In some cases we
 * must go back and re-examine previously-scanned pages; this routine
 * recurses when necessary to handle that case.
 * btvacuumscan()調(diào)用該過程處理單個頁面.
 * 在某些情況下,必須回過頭來重新檢查先前已掃描過的頁面,該過程在需要的時候遞歸處理這種情況.
 *
 * blkno is the page to process.  orig_blkno is the highest block number
 * reached by the outer btvacuumscan loop (the same as blkno, unless we
 * are recursing to re-examine a previous page).
 * blkno是需處理的頁面.orig_blkno是外層btvacuumscan循環(huán)最大的塊號
 * (與blkno一樣,除非我們需要遞歸檢查先前的頁面)
 */
static void
btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
{
    IndexVacuumInfo *info = vstate->info;//IndexVacuumInfo
    IndexBulkDeleteResult *stats = vstate->stats;//統(tǒng)計(jì)信息
    //typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
    IndexBulkDeleteCallback callback = vstate->callback;//回調(diào)函數(shù)
    void       *callback_state = vstate->callback_state;//回調(diào)函數(shù)狀態(tài)
    Relation    rel = info->index;//index relation
    bool        delete_now;//現(xiàn)在刪除?
    BlockNumber recurse_to;//遞歸處理的block
    Buffer      buf;//buffer
    Page        page;//page
    BTPageOpaque opaque = NULL;//
restart:
    delete_now = false;
    recurse_to = P_NONE;
    /* call vacuum_delay_point while not holding any buffer lock */
    //在沒有持有任何buffer lock時,調(diào)用vacuum_delay_point
    vacuum_delay_point();
    /*
     * We can't use _bt_getbuf() here because it always applies
     * _bt_checkpage(), which will barf on an all-zero page. We want to
     * recycle all-zero pages, not fail.  Also, we want to use a nondefault
     * buffer access strategy.
     * 在這里不能使用_bt_getbuf()函數(shù),因?yàn)樵摵瘮?shù)通常會調(diào)用_bt_checkpage(),
     *   該函數(shù)會braf on剛被初始化的page上.
     * 我們希望成功重用all-zero pages.
     * 而且,我們希望使用非默認(rèn)的buffer訪問策略.
     */
    buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
                             info->strategy);
    LockBuffer(buf, BT_READ);
    page = BufferGetPage(buf);
    if (!PageIsNew(page))
    {
        _bt_checkpage(rel, buf);
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);
    }
    /*
     * If we are recursing, the only case we want to do anything with is a
     * live leaf page having the current vacuum cycle ID.  Any other state
     * implies we already saw the page (eg, deleted it as being empty).
     * 如果我們正在遞歸處理,需要處理的唯一情況是存在當(dāng)前的vacuum cycle ID的活動葉子頁節(jié)點(diǎn)。
     * 任何其他狀態(tài)都意味著我們已經(jīng)看到該頁面(例如,刪除后視為空)。
     */
    if (blkno != orig_blkno)
    {
        //block編號已改變
        if (_bt_page_recyclable(page) ||
            P_IGNORE(opaque) ||
            !P_ISLEAF(opaque) ||
            opaque->btpo_cycleid != vstate->cycleid)
        {
            _bt_relbuf(rel, buf);
            return;
        }
    }
    /* Page is valid, see what to do with it */
    //Page有效,看看需要做些什么
    if (_bt_page_recyclable(page))
    {
        /* Okay to recycle this page */
        //可以回收該頁面了
        RecordFreeIndexPage(rel, blkno);
        vstate->totFreePages++;
        stats->pages_deleted++;
    }
    else if (P_ISDELETED(opaque))
    {
        /* Already deleted, but can't recycle yet */
        //該page已被刪除,但不能回收
        //更新統(tǒng)計(jì)信息
        stats->pages_deleted++;
        /* Update the oldest btpo.xact */
        //更新最舊的btpo.xact
        if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
            TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
            vstate->oldestBtpoXact = opaque->btpo.xact;
    }
    else if (P_ISHALFDEAD(opaque))
    {
        /* Half-dead, try to delete */
        //Half-dead,嘗試刪除
        delete_now = true;
    }
    else if (P_ISLEAF(opaque))
    {
        //------- 葉子節(jié)點(diǎn)
        //偏移數(shù)組
        OffsetNumber deletable[MaxOffsetNumber];
        int         ndeletable;//計(jì)數(shù)
        OffsetNumber offnum,
                    minoff,
                    maxoff;
        /*
         * Trade in the initial read lock for a super-exclusive write lock on
         * this page.  We must get such a lock on every leaf page over the
         * course of the vacuum scan, whether or not it actually contains any
         * deletable tuples --- see nbtree/README.
         * 將初始讀鎖轉(zhuǎn)換為此頁上的超級獨(dú)占寫鎖。
         * 在vacuum掃描過程中,我們必須在每個葉子頁上獲得這樣的鎖,
         *   不管它是否包含任何可刪除的元組——參見nbtree/README。
         */
        LockBuffer(buf, BUFFER_LOCK_UNLOCK);
        LockBufferForCleanup(buf);
        /*
         * Remember highest leaf page number we've taken cleanup lock on; see
         * notes in btvacuumscan
         * 記錄已取得cleanup lock的最大葉子頁編號,詳見btvacuumscan注釋
         */
        if (blkno > vstate->lastBlockLocked)
            vstate->lastBlockLocked = blkno;
        /*
         * Check whether we need to recurse back to earlier pages.  What we
         * are concerned about is a page split that happened since we started
         * the vacuum scan.  If the split moved some tuples to a lower page
         * then we might have missed 'em.  If so, set up for tail recursion.
         * (Must do this before possibly clearing btpo_cycleid below!)
         * 檢查我們是否需要遞歸回先前的頁面.
         * 我們所關(guān)心的是自開始vacuum掃描以來發(fā)生的索引頁面分裂.
         * 如果分裂將一些元組移動到層次較低的頁面,那么我們可能會錯過它們。
         * (在可能清除下面的btpo_cycleid之前,必須這樣做!)
         */
        if (vstate->cycleid != 0 &&
            opaque->btpo_cycleid == vstate->cycleid &&
            !(opaque->btpo_flags & BTP_SPLIT_END) &&
            !P_RIGHTMOST(opaque) &&
            opaque->btpo_next < orig_blkno)
            recurse_to = opaque->btpo_next;
        /*
         * Scan over all items to see which ones need deleted according to the
         * callback function.
         * 掃描所有條目,看看哪些根據(jù)回調(diào)函數(shù)得到的需要刪除的條目。
         */
        ndeletable = 0;
        //最小偏移
        minoff = P_FIRSTDATAKEY(opaque);
        //最大偏移
        maxoff = PageGetMaxOffsetNumber(page);
        if (callback)
        {
            //存在回調(diào)函數(shù)
            for (offnum = minoff;
                 offnum <= maxoff;
                 offnum = OffsetNumberNext(offnum))
            {
                //從小到大遍歷偏移
                IndexTuple  itup;//索引元組
                ItemPointer htup;//行指針
                //獲取索引元組
                itup = (IndexTuple) PageGetItem(page,
                                                PageGetItemId(page, offnum));
                htup = &(itup->t_tid);//獲取行指針
                /*
                 * During Hot Standby we currently assume that
                 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
                 * only true as long as the callback function depends only
                 * upon whether the index tuple refers to heap tuples removed
                 * in the initial heap scan. When vacuum starts it derives a
                 * value of OldestXmin. Backends taking later snapshots could
                 * have a RecentGlobalXmin with a later xid than the vacuum's
                 * OldestXmin, so it is possible that row versions deleted
                 * after OldestXmin could be marked as killed by other
                 * backends. The callback function *could* look at the index
                 * tuple state in isolation and decide to delete the index
                 * tuple, though currently it does not. If it ever did, we
                 * would need to reconsider whether XLOG_BTREE_VACUUM records
                 * should cause conflicts. If they did cause conflicts they
                 * would be fairly harsh conflicts, since we haven't yet
                 * worked out a way to pass a useful value for
                 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
                 * applies to *any* type of index that marks index tuples as
                 * killed.
                 * 在熱備份期間,目前假設(shè)XLOG_BTREE_VACUUM記錄不會產(chǎn)生沖突。
                 * 只有當(dāng)回調(diào)函數(shù)僅依賴于索引元組是否引用在初始堆掃描中刪除的堆元組時,這種情況才成立。
                 * 當(dāng)vacuum開始時,它得到一個OldestXmin的值。
                 * 拍攝較晚快照的后臺進(jìn)程可能具有一個RecentGlobalXmin,其xid比vacuum的最老的xmin還要晚,
                 *   因此,在OldestXmin之后刪除的行版本可能被其他后臺進(jìn)程標(biāo)記為已刪除。
                 * 回調(diào)函數(shù)*可以*單獨(dú)查看索引元組的狀態(tài),并決定刪除索引元組,盡管目前沒有。
                 * 如有,我們需要重新考慮XLOG_BTREE_VACUUM記錄是否應(yīng)該引起沖突.
                 * 如果它們確實(shí)導(dǎo)致沖突,那將是相當(dāng)嚴(yán)重的沖突,
                 *   因?yàn)槲覀冞€沒有找到在XLOG_BTREE_VACUUM記錄上傳遞latestRemovedXid的有用值的方法。
                 * 這適用于任何將索引元組標(biāo)記為killed的索引類型。
                 */
                if (callback(htup, callback_state))
                    //回調(diào)函數(shù)返回T,寫入數(shù)組中
                    deletable[ndeletable++] = offnum;
            }
        }
        /*
         * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
         * call per page, so as to minimize WAL traffic.
         * 應(yīng)用需要的刪除.
         * 我們每個頁面只發(fā)出一個_bt_delitems_vacuum()調(diào)用,以便最小化WAL流量。
         */
        if (ndeletable > 0)
        {
            //--------------- 如deletable數(shù)組不為空
            /*
             * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
             * all information to the replay code to allow it to get a cleanup
             * lock on all pages between the previous lastBlockVacuumed and
             * this page. This ensures that WAL replay locks all leaf pages at
             * some point, which is important should non-MVCC scans be
             * requested. This is currently unused on standby, but we record
             * it anyway, so that the WAL contains the required information.
             * 請注意,已發(fā)布的XLOG_BTREE_VACUUM WAL記錄包含重放代碼的所有信息,
             *   以允許重放代碼在上一個lastblockvacuum和這個頁面之間的所有頁面上獲得清理鎖。
             * 這確保了WAL replay在某個時刻鎖定所有的葉頁面,這一點(diǎn)在請求非mvcc掃描時非常重要。
             * 這在待機(jī)狀態(tài)下目前是未使用的,但是我們會記錄它,以便WAL包含所需的信息。
             *
             * Since we can visit leaf pages out-of-order when recursing,
             * replay might end up locking such pages an extra time, but it
             * doesn't seem worth the amount of bookkeeping it'd take to avoid
             * that.
             * 因?yàn)樵谶f歸處理時,我們可以無序地訪問葉子頁面,
             *   所以重放可能會額外地鎖定這些頁面,但是似乎不值得為此花費(fèi)大量的bookkeeping時間。
             */
            _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
                                vstate->lastBlockVacuumed);
            /*
             * Remember highest leaf page number we've issued a
             * XLOG_BTREE_VACUUM WAL record for.
             * 記住我們已經(jīng)生成LOG_BTREE_VACUUM WAL record的最大葉子頁面編號
             */
            if (blkno > vstate->lastBlockVacuumed)
                vstate->lastBlockVacuumed = blkno;
            stats->tuples_removed += ndeletable;
            /* must recompute maxoff */
            //重新計(jì)算maxoff
            maxoff = PageGetMaxOffsetNumber(page);
        }
        else
        {
            /*
             * If the page has been split during this vacuum cycle, it seems
             * worth expending a write to clear btpo_cycleid even if we don't
             * have any deletions to do.  (If we do, _bt_delitems_vacuum takes
             * care of this.)  This ensures we won't process the page again.
             * 如果頁面在這個vacuum cycle中被分裂,
             *   那么即使我們沒有任何刪除工作要做,但似乎也值得花費(fèi)一次寫操作來清除btpo_cycleid。
             * (如果我們這樣做,_bt_delitems_vacuum負(fù)責(zé)處理這個問題。)
             * 這確保我們不會再次處理該頁面。
             *
             * We treat this like a hint-bit update because there's no need to
             * WAL-log it.
             * 進(jìn)行這個處理如同hint-bit更新,因?yàn)椴恍枰涗沇AL Record.
             */
            if (vstate->cycleid != 0 &&
                opaque->btpo_cycleid == vstate->cycleid)
            {
                opaque->btpo_cycleid = 0;
                MarkBufferDirtyHint(buf, true);
            }
        }
        /*
         * If it's now empty, try to delete; else count the live tuples. We
         * don't delete when recursing, though, to avoid putting entries into
         * freePages out-of-order (doesn't seem worth any extra code to handle
         * the case).
         * 如果它現(xiàn)在是空的,試著刪除;否則計(jì)算活動元組。
         * 但是,在遞歸時我們不會刪除,
         *   以避免將條目無序地放入freePages中
         * (似乎不值得使用任何額外的代碼來處理這種情況)。
         */
        if (minoff > maxoff)
            delete_now = (blkno == orig_blkno);
        else
            stats->num_index_tuples += maxoff - minoff + 1;
    }
    if (delete_now)
    {
        MemoryContext oldcontext;
        int         ndel;
        /* Run pagedel in a temp context to avoid memory leakage */
        //在臨時內(nèi)存上下文中執(zhí)行pagedel避免內(nèi)存泄漏
        MemoryContextReset(vstate->pagedelcontext);
        oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
        ndel = _bt_pagedel(rel, buf);
        /* count only this page, else may double-count parent */
        //只對該頁面進(jìn)行計(jì)數(shù),否則會雙倍計(jì)算父節(jié)點(diǎn)
        if (ndel)
        {
            stats->pages_deleted++;
            if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
                TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
                vstate->oldestBtpoXact = opaque->btpo.xact;
        }
        MemoryContextSwitchTo(oldcontext);
        /* pagedel released buffer, so we shouldn't */
        //pagedel會釋放緩存,在這里不需要做這個事情
    }
    else
        _bt_relbuf(rel, buf);
    /*
     * This is really tail recursion, but if the compiler is too stupid to
     * optimize it as such, we'd eat an uncomfortably large amount of stack
     * space per recursion level (due to the deletable[] array). A failure is
     * improbable since the number of levels isn't likely to be large ... but
     * just in case, let's hand-optimize into a loop.
     * 這實(shí)際上是尾部遞歸,但是如果編譯器笨到無法對其進(jìn)行優(yōu)化,
     *   那么每個遞歸級別都會消耗大量堆??臻g(由于deletable[]數(shù)組的存在)。
     * 失敗是不可能的,因?yàn)榧墑e的數(shù)量不大……但以防萬一,我們手工優(yōu)化成一個循環(huán)。
     */
    if (recurse_to != P_NONE)
    {
        blkno = recurse_to;
        goto restart;
    }
}

lazy_tid_reaped
回調(diào)函數(shù),調(diào)用系統(tǒng)函數(shù)bsearch檢查tid是否可以被刪除?


/*
 *  lazy_tid_reaped() -- is a particular tid deletable?
 *
 *      This has the right signature to be an IndexBulkDeleteCallback.
 *
 *      Assumes dead_tuples array is in sorted order.
 */
static bool
lazy_tid_reaped(ItemPointer itemptr, void *state)
{
    LVRelStats *vacrelstats = (LVRelStats *) state;
    ItemPointer res;
    //vac_cmp_itemptr是比較函數(shù)
    res = (ItemPointer) bsearch((void *) itemptr,
                                (void *) vacrelstats->dead_tuples,
                                vacrelstats->num_dead_tuples,
                                sizeof(ItemPointerData),
                                vac_cmp_itemptr);
    return (res != NULL);
}
/*
 * Comparator routines for use with qsort() and bsearch().
 * qsort()和bsearch()使用的比較函數(shù)
 * 比較塊號和塊內(nèi)偏移,如一致則返回0,否則left < right,返回-1;left > right,返回1.
 */
static int
vac_cmp_itemptr(const void *left, const void *right)
{
    BlockNumber lblk,
                rblk;
    OffsetNumber loff,
                roff;
    lblk = ItemPointerGetBlockNumber((ItemPointer) left);
    rblk = ItemPointerGetBlockNumber((ItemPointer) right);
    if (lblk < rblk)
        return -1;
    if (lblk > rblk)
        return 1;
    loff = ItemPointerGetOffsetNumber((ItemPointer) left);
    roff = ItemPointerGetOffsetNumber((ItemPointer) right);
    if (loff < roff)
        return -1;
    if (loff > roff)
        return 1;
    return 0;
}

三、跟蹤分析

測試腳本 : 刪除數(shù)據(jù),執(zhí)行vacuum


14:24:12 (xdb@[local]:5432)testdb=# delete from t1 where id < 1300;
DELETE 100
14:24:23 (xdb@[local]:5432)testdb=# checkpoint;
CHECKPOINT
14:24:26 (xdb@[local]:5432)testdb=# 
14:25:28 (xdb@[local]:5432)testdb=# vacuum verbose t1;

btvacuumscan
啟動gdb,設(shè)置斷點(diǎn)


(gdb) b btvacuumscan
Breakpoint 1 at 0x509951: file nbtree.c, line 959.
(gdb) c
Continuing.
Breakpoint 1, btvacuumscan (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, 
    callback_state=0x23eaaf8, cycleid=37964, oldestBtpoXact=0x7ffd33d29a40) at nbtree.c:959
959     Relation    rel = info->index;
(gdb)

輸入?yún)?shù)


(gdb) p *info
$1 = {index = 0x7f6b76bcc688, analyze_only = false, estimated_count = true, message_level = 17, num_heap_tuples = 14444, 
  strategy = 0x2413708}
(gdb) p *stats
$2 = {num_pages = 0, pages_removed = 0, estimated_count = false, num_index_tuples = 0, tuples_removed = 0, 
  pages_deleted = 0, pages_free = 0}
(gdb) p *oldestBtpoXact
$3 = 869440096
(gdb) 
(gdb) p (LVRelStats *)callback_state
$4 = (LVRelStats *) 0x23eaaf8
(gdb) p *(LVRelStats *)callback_state
$5 = {hasindex = true, old_rel_pages = 124, rel_pages = 124, scanned_pages = 52, pinskipped_pages = 0, 
  frozenskipped_pages = 1, tupcount_pages = 52, old_live_tuples = 14444, new_rel_tuples = 14840, new_live_tuples = 14840, 
  new_dead_tuples = 0, pages_removed = 0, tuples_deleted = 100, nonempty_pages = 124, num_dead_tuples = 100, 
  max_dead_tuples = 36084, dead_tuples = 0x7f6b76ad7050, num_index_scans = 0, latestRemovedXid = 397077, 
  lock_waiter_detected = false}

1.初始化統(tǒng)計(jì)信息(IndexBulkDeleteResult結(jié)構(gòu)體)


(gdb) n
969     stats->estimated_count = false;
(gdb) 
970     stats->num_index_tuples = 0;
(gdb) 
971     stats->pages_deleted = 0;
(gdb)

2.初始化vstate狀態(tài)信息(BTVacState結(jié)構(gòu)體)


974     vstate.info = info;
(gdb) 
975     vstate.stats = stats;
(gdb) 
976     vstate.callback = callback;
(gdb) 
977     vstate.callback_state = callback_state;
(gdb) 
978     vstate.cycleid = cycleid;
(gdb) 
979     vstate.lastBlockVacuumed = BTREE_METAPAGE;  /* Initialise at first block */
(gdb) 
980     vstate.lastBlockLocked = BTREE_METAPAGE;
(gdb) 
981     vstate.totFreePages = 0;
(gdb) 
982     vstate.oldestBtpoXact = InvalidTransactionId;
(gdb) 
(gdb) p vstate
$6 = {info = 0x7ffd33d29b70, stats = 0x23ea988, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x23eaaf8, 
  cycleid = 37964, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23c1d00}

3.構(gòu)造臨時上下文


985     vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
(gdb)

4.循環(huán)遍歷page
4.1獲取relation鎖


1012        needLock = !RELATION_IS_LOCAL(rel);
(gdb) p vstate
$6 = {info = 0x7ffd33d29b70, stats = 0x23ea988, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x23eaaf8, 
  cycleid = 37964, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23c1d00}
(gdb) 
(gdb) n
1014        blkno = BTREE_METAPAGE + 1;
(gdb) 
1018            if (needLock)
(gdb) p needLock
$7 = true
(gdb) n
1019                LockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1020            num_pages = RelationGetNumberOfBlocks(rel);
(gdb) 
1021            if (needLock)
(gdb) p num_pages
$8 = 60
(gdb) n
1022                UnlockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1025            if (blkno >= num_pages)
(gdb) p blkno
$9 = 1
(gdb) n
1028            for (; blkno < num_pages; blkno++)
(gdb)

4.2遍歷block,執(zhí)行btvacuumpage


(gdb) 
1030                btvacuumpage(&vstate, blkno, blkno);
(gdb) 
1028            for (; blkno < num_pages; blkno++)
(gdb) 
1030                btvacuumpage(&vstate, blkno, blkno);
(gdb)

4.3如需要,多次遍歷relation


(gdb) b nbtree.c:1018
Breakpoint 2 at 0x509a1f: file nbtree.c, line 1018.
(gdb) c
Continuing.
Breakpoint 2, btvacuumscan (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, 
    callback_state=0x23eaaf8, cycleid=37964, oldestBtpoXact=0x7ffd33d29a40) at nbtree.c:1018
1018            if (needLock)
(gdb) n
1019                LockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1020            num_pages = RelationGetNumberOfBlocks(rel);
(gdb) 
1021            if (needLock)
(gdb) 
1022                UnlockRelationForExtension(rel, ExclusiveLock);
(gdb) 
1025            if (blkno >= num_pages)
(gdb) p blkno
$11 = 60
(gdb) n
1026                break;
(gdb)

5.WAL Record處理


(gdb) n
1048        if (XLogStandbyInfoActive() &&
(gdb)

6.刪除臨時上下文


(gdb) 
1067        MemoryContextDelete(vstate.pagedelcontext);
(gdb)

7.處理空閑空間


(gdb) 
1081        if (vstate.totFreePages > 0)
(gdb) 
1082            IndexFreeSpaceMapVacuum(rel);
(gdb)

8.更新統(tǒng)計(jì)信息


(gdb) 
1085        stats->num_pages = num_pages;
(gdb) 
1086        stats->pages_free = vstate.totFreePages;
(gdb) 
1088        if (oldestBtpoXact)
(gdb) 
1089            *oldestBtpoXact = vstate.oldestBtpoXact;
(gdb) p oldestBtpoXact
$12 = (TransactionId *) 0x7ffd33d29a40
(gdb) p *oldestBtpoXact
$13 = 869440096
(gdb) p vstate.oldestBtpoXact
$14 = 397078
(gdb) n
1090    }
(gdb) p *stats
$15 = {num_pages = 60, pages_removed = 0, estimated_count = false, num_index_tuples = 8701, tuples_removed = 100, 
  pages_deleted = 7, pages_free = 6}
(gdb)

完成調(diào)用


(gdb) n
btbulkdelete (info=0x7ffd33d29b70, stats=0x23ea988, callback=0x6bf507 <lazy_tid_reaped>, callback_state=0x23eaaf8)
    at nbtree.c:880
880         _bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
(gdb)

btvacuumpage


14:50:45 (xdb@[local]:5432)testdb=# vacuum verbose t1;
...............
(gdb) b btvacuumpage
Breakpoint 3 at 0x509b82: file nbtree.c, line 1106.
(gdb) 
(gdb) c
Continuing.
Breakpoint 3, btvacuumpage (vstate=0x7ffd33d298d0, blkno=1, orig_blkno=1) at nbtree.c:1106
1106        IndexVacuumInfo *info = vstate->info;
(gdb)

輸入?yún)?shù)


(gdb) p *vstate
$16 = {info = 0x7ffd33d29b70, stats = 0x24157e8, callback = 0x6bf507 <lazy_tid_reaped>, callback_state = 0x2415958, 
  cycleid = 37965, lastBlockVacuumed = 0, lastBlockLocked = 0, totFreePages = 0, oldestBtpoXact = 0, 
  pagedelcontext = 0x23ea7a0}
(gdb)

1.初始化相關(guān)變量


(gdb) n
1107        IndexBulkDeleteResult *stats = vstate->stats;
(gdb) 
1108        IndexBulkDeleteCallback callback = vstate->callback;
(gdb) 
1109        void       *callback_state = vstate->callback_state;
(gdb) 
1110        Relation    rel = info->index;
(gdb) 
1115        BTPageOpaque opaque = NULL;
(gdb) 
1118        delete_now = false;
(gdb) 
1119        recurse_to = P_NONE;
(gdb) 
1122        vacuum_delay_point();
(gdb) p *info
$17 = {index = 0x7f6b76b0c268, analyze_only = false, estimated_count = true, message_level = 17, num_heap_tuples = 14840, 
  strategy = 0x2403478}
(gdb) p *stats
$18 = {num_pages = 0, pages_removed = 0, estimated_count = false, num_index_tuples = 0, tuples_removed = 0, 
  pages_deleted = 0, pages_free = 0}
(gdb) p rel
$19 = (Relation) 0x7f6b76b0c268
(gdb) p *rel
$20 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50823}, rd_smgr = 0x23d0270, rd_refcnt = 1, rd_backend = -1, 
  rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000', rd_statvalid = false, 
  rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7f6b76bccd20, rd_att = 0x7f6b76bcc9b8, rd_id = 50823, 
  rd_lockInfo = {lockRelId = {relId = 50823, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, 
  rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, 
  rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, 
  rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, 
  rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x7f6b76bcc8d8, rd_indextuple = 0x7f6b76bcc8a0, 
  rd_amhandler = 330, rd_indexcxt = 0x236b340, rd_amroutine = 0x236b480, rd_opfamily = 0x236b598, rd_opcintype = 0x236b5b8, 
  rd_support = 0x236b5d8, rd_supportinfo = 0x236b600, rd_indoption = 0x236b738, rd_indexprs = 0x0, rd_indpred = 0x0, 
  rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x236b718, 
  rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x23c4198}
(gdb)

2.調(diào)用ReadBufferExtended讀取block到buffer中,鎖定buffer,獲取page


(gdb) 
1130        buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
(gdb) n
1132        LockBuffer(buf, BT_READ);
(gdb) 
1133        page = BufferGetPage(buf);
(gdb) 
1134        if (!PageIsNew(page))
(gdb) p *page
$21 = 1 '\001'
(gdb) p page
$22 = (Page) 0x7f6b4add8380 "\001"
(gdb) p *(PageHeader)page
$23 = {pd_lsn = {xlogid = 1, xrecoff = 1320733408}, pd_checksum = 0, pd_flags = 0, pd_lower = 28, pd_upper = 8168, 
  pd_special = 8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp = 0x7f6b4add8398}
(gdb)

3.如果不是new page,則執(zhí)行檢查并獲取BTPageOpaque


(gdb) n
1136            _bt_checkpage(rel, buf);
(gdb) 
1137            opaque = (BTPageOpaque) PageGetSpecialPointer(page);
(gdb) p buf
$24 = 224
(gdb) n
1145        if (blkno != orig_blkno)
(gdb) p opaque
$25 = (BTPageOpaque) 0x7f6b4adda370
(gdb) p *opaque
$26 = {btpo_prev = 0, btpo_next = 33, btpo = {level = 397073, xact = 397073}, btpo_flags = 5, btpo_cycleid = 0}
(gdb)

4.如塊號與原始不同,正在進(jìn)行遞歸處理,如page可回收或者可忽略或者不是葉子節(jié)點(diǎn)或者cycleid不同,則調(diào)用_bt_relbuf,返回;否則繼續(xù)往下執(zhí)行


(gdb) n
1145        if (blkno != orig_blkno)

5.執(zhí)行相關(guān)判斷
5.1如page可回收,則回收頁面


(gdb) n
1158        if (_bt_page_recyclable(page))
(gdb) 
1161            RecordFreeIndexPage(rel, blkno);
(gdb) 
1162            vstate->totFreePages++;
(gdb) p blkno
$27 = 1
(gdb) n
1163            stats->pages_deleted++;
(gdb)

5.2如page已刪除,但不能回收,則更新統(tǒng)計(jì)信息


N/A

5.3如page為Half-dead,則嘗試刪除(設(shè)置delete_now標(biāo)記為T)


N/A

5.5如試著刪除(delete_now為T),則調(diào)用_bt_pagedel刪除,更新統(tǒng)計(jì)信息
否則調(diào)用_bt_relbuf


1329        if (delete_now)
(gdb) 
1353            _bt_relbuf(rel, buf);

5.6判斷recurse_to != P_NONE,如T,則重新啟動,否則退出


1362        if (recurse_to != P_NONE)
(gdb) p recurse_to
$29 = 0
(gdb) p P_NONE
$30 = 0
(gdb) n
1367    }
(gdb)

進(jìn)入page為葉子節(jié)點(diǎn)的邏輯
5.4如page為葉子節(jié)點(diǎn)


(gdb) del 
Delete all breakpoints? (y or n) y
(gdb) b nbtree.c:1182
Breakpoint 5 at 0x509e61: file nbtree.c, line 1182.
(gdb) c
Continuing.
Breakpoint 5, btvacuumpage (vstate=0x7ffd33d298d0, blkno=6, orig_blkno=6) at nbtree.c:1194
1194            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
(gdb)

5.4.1初始化變量


N/A

5.4.2鎖定緩沖區(qū)


1194            LockBuffer(buf, BUFFER_LOCK_UNLOCK);
(gdb) N
1195            LockBufferForCleanup(buf);
(gdb)

5.4.3記錄已取得cleanup lock的最大葉子頁編號


(gdb) 
1201            if (blkno > vstate->lastBlockLocked)
(gdb) p blkno
$31 = 6
(gdb) p vstate->lastBlockLocked
$32 = 0
(gdb) n
1202                vstate->lastBlockLocked = blkno;
(gdb)

5.4.4檢查我們是否需要遞歸回先前的頁面


(gdb) 
1211            if (vstate->cycleid != 0 &&
(gdb) p vstate->cycleid
$33 = 37965
(gdb) p opaque->btpo_cycleid
$34 = 0
(gdb) p vstate->cycleid
$35 = 37965
(gdb) 
(gdb) n
1212                opaque->btpo_cycleid == vstate->cycleid &&
(gdb) 
1211            if (vstate->cycleid != 0 &&
(gdb) 
1222            ndeletable = 0;
(gdb) 
1223            minoff = P_FIRSTDATAKEY(opaque);
(gdb) 
1224            maxoff = PageGetMaxOffsetNumber(page);
(gdb) 
1225            if (callback)
(gdb) p minoff
$36 = 2
(gdb) p maxoff
$37 = 174
(gdb)

5.4.5掃描所有條目,看看哪些根據(jù)回調(diào)函數(shù)得到的需要刪除的條目(寫入到deletable數(shù)組中)


(gdb) n
1227                for (offnum = minoff;
(gdb) 
1234                    itup = (IndexTuple) PageGetItem(page,
(gdb) 
1236                    htup = &(itup->t_tid);
(gdb) p *itup
$38 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 103}, ip_posid = 138}, t_info = 16}
(gdb) n
1259                    if (callback(htup, callback_state))
(gdb) p *htup
$41 = {ip_blkid = {bi_hi = 0, bi_lo = 103}, ip_posid = 138}
(gdb)

進(jìn)入回調(diào)函數(shù)lazy_tid_reaped


(gdb) step
lazy_tid_reaped (itemptr=0x7f6b4addfd40, state=0x2415958) at vacuumlazy.c:2140
2140        LVRelStats *vacrelstats = (LVRelStats *) state;
(gdb)

調(diào)用bsearch判斷是否滿足條件,返回NULL,不滿足


(gdb) n
2145                                    vacrelstats->num_dead_tuples,
(gdb) 
2143        res = (ItemPointer) bsearch((void *) itemptr,
(gdb) 
2144                                    (void *) vacrelstats->dead_tuples,
(gdb) 
2143        res = (ItemPointer) bsearch((void *) itemptr,
(gdb) 
2149        return (res != NULL);
(gdb) p res
$42 = (ItemPointer) 0x0
(gdb) 
(gdb) n
2150    }
(gdb) 
btvacuumpage (vstate=0x7ffd33d298d0, blkno=6, orig_blkno=6) at nbtree.c:1229
1229                     offnum = OffsetNumberNext(offnum))
(gdb)

5.4.6如數(shù)組不為空,則調(diào)用_bt_delitems_vacuum,記錄相關(guān)信息
如數(shù)組為空,判斷頁面是否在這個vacuum cycle中被分裂,清除btpo_cycleid標(biāo)記,標(biāo)記緩沖區(qū)為臟


(gdb) del
Delete all breakpoints? (y or n) y
(gdb) b nbtree.c:1284
Breakpoint 7 at 0x50a035: file nbtree.c, line 1284.
(gdb) c
Continuing.
Breakpoint 7, btvacuumpage (vstate=0x7ffd33d298d0, blkno=48, orig_blkno=48) at nbtree.c:1284
1284                _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
(gdb) 
(gdb) n
1291                if (blkno > vstate->lastBlockVacuumed)
(gdb) p blkno
$43 = 48
(gdb) p vstate->lastBlockVacuumed
$44 = 0
(gdb) n
1292                    vstate->lastBlockVacuumed = blkno;
(gdb) 
1294                stats->tuples_removed += ndeletable;
(gdb) 
1296                maxoff = PageGetMaxOffsetNumber(page);
(gdb) 
1323            if (minoff > maxoff)
(gdb) p minoff
$45 = 2
(gdb) p maxoff
$46 = 67
(gdb) n
1326                stats->num_index_tuples += maxoff - minoff + 1;
(gdb)

DONE!

四、參考資料

PG Source Code

向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI