PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用

發(fā)布時間：2021-11-09 15:36:16 來源：億速云閱讀：244 作者：iii 欄目：關(guān)系型數(shù)據(jù)庫

這篇文章主要講解了“PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用”，文中的講解內(nèi)容簡單清晰，易于學習與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學習“PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用”吧！

本節(jié)簡單介紹了PostgreSQL在執(zhí)行插入過程中與緩存相關(guān)的函數(shù)RelationGetBufferForTuple,該函數(shù)返回滿足空閑空間 >= 給定大小的page,并且該page對應的buffer狀態(tài)為pinned和并持有獨占鎖。

一、數(shù)據(jù)結(jié)構(gòu)

BufferDesc
共享緩沖區(qū)的共享描述符(狀態(tài))數(shù)據(jù)

/*
 * Flags for buffer descriptors
 * buffer描述器標記
 *
 * Note: TAG_VALID essentially means that there is a buffer hashtable
 * entry associated with the buffer's tag.
 * 注意:TAG_VALID本質(zhì)上意味著有一個與緩沖區(qū)的標記相關(guān)聯(lián)的緩沖區(qū)散列表條目。
 */
//buffer header鎖定
#define BM_LOCKED               (1U << 22)  /* buffer header is locked */
//數(shù)據(jù)需要寫入(標記為DIRTY)
#define BM_DIRTY                (1U << 23)  /* data needs writing */
//數(shù)據(jù)是有效的
#define BM_VALID                (1U << 24)  /* data is valid */
//已分配buffer tag
#define BM_TAG_VALID            (1U << 25)  /* tag is assigned */
//正在R/W
#define BM_IO_IN_PROGRESS       (1U << 26)  /* read or write in progress */
//上一個I/O出現(xiàn)錯誤
#define BM_IO_ERROR             (1U << 27)  /* previous I/O failed */
//開始寫則變DIRTY
#define BM_JUST_DIRTIED         (1U << 28)  /* dirtied since write started */
//存在等待sole pin的其他進程
#define BM_PIN_COUNT_WAITER     (1U << 29)  /* have waiter for sole pin */
//checkpoint發(fā)生,必須刷到磁盤上
#define BM_CHECKPOINT_NEEDED    (1U << 30)  /* must write for checkpoint */
//持久化buffer(不是unlogged或者初始化fork)
#define BM_PERMANENT            (1U << 31)  /* permanent buffer (not unlogged,
                                             * or init fork) */
/*
 *  BufferDesc -- shared descriptor/state data for a single shared buffer.
 *  BufferDesc -- 共享緩沖區(qū)的共享描述符(狀態(tài))數(shù)據(jù)
 *
 * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
 * the tag, state or wait_backend_pid fields.  In general, buffer header lock
 * is a spinlock which is combined with flags, refcount and usagecount into
 * single atomic variable.  This layout allow us to do some operations in a
 * single atomic operation, without actually acquiring and releasing spinlock;
 * for instance, increase or decrease refcount.  buf_id field never changes
 * after initialization, so does not need locking.  freeNext is protected by
 * the buffer_strategy_lock not buffer header lock.  The LWLock can take care
 * of itself.  The buffer header lock is *not* used to control access to the
 * data in the buffer!
 * 注意:必須持有Buffer header鎖(BM_LOCKED標記)才能檢查或修改tag/state/wait_backend_pid字段.
 * 通常來說,buffer header lock是spinlock,它與標記位/參考計數(shù)/使用計數(shù)組合到單個原子變量中.
 * 這個布局設計允許我們執(zhí)行原子操作,而不需要實際獲得或者釋放spinlock(比如,增加或者減少參考計數(shù)).
 * buf_id字段在初始化后不會出現(xiàn)變化,因此不需要鎖定.
 * freeNext通過buffer_strategy_lock鎖而不是buffer header lock保護.
 * LWLock可以很好的處理自己的狀態(tài).
 * 務請注意的是:buffer header lock不用于控制buffer中的數(shù)據(jù)訪問!
 *
 * It's assumed that nobody changes the state field while buffer header lock
 * is held.  Thus buffer header lock holder can do complex updates of the
 * state variable in single write, simultaneously with lock release (cleaning
 * BM_LOCKED flag).  On the other hand, updating of state without holding
 * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
 * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed.
 * 假定在持有buffer header lock的情況下,沒有人改變狀態(tài)字段.
 * 持有buffer header lock的進程可以執(zhí)行在單個寫操作中執(zhí)行復雜的狀態(tài)變量更新,
 *   同步的釋放鎖(清除BM_LOCKED標記).
 * 換句話說,如果沒有持有buffer header lock的狀態(tài)更新,會受限于CAS,
 *   這種情況下確保BM_LOCKED沒有被設置.
 * 比如原子的增加/減少(AND/OR)等操作是不允許的.
 *
 * An exception is that if we have the buffer pinned, its tag can't change
 * underneath us, so we can examine the tag without locking the buffer header.
 * Also, in places we do one-time reads of the flags without bothering to
 * lock the buffer header; this is generally for situations where we don't
 * expect the flag bit being tested to be changing.
 * 一種例外情況是如果我們已有buffer pinned,該buffer的tag不能改變(在本進程之下),
 *   因此不需要鎖定buffer header就可以檢查tag了.
 * 同時,在執(zhí)行一次性的flags讀取時不需要鎖定buffer header.
 * 這種情況通常用于我們不希望正在測試的flag bit將被改變.
 *
 * We can't physically remove items from a disk page if another backend has
 * the buffer pinned.  Hence, a backend may need to wait for all other pins
 * to go away.  This is signaled by storing its own PID into
 * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present,
 * there can be only one such waiter per buffer.
 * 如果其他進程有buffer pinned,那么進程不能物理的從磁盤頁面中刪除items.
 * 因此,后臺進程需要等待其他pins清除.這可以通過存儲它自己的PID到wait_backend_pid中,
 *   并設置標記位BM_PIN_COUNT_WAITER.
 * 目前,每個緩沖區(qū)只能由一個等待進程.
 *
 * We use this same struct for local buffer headers, but the locks are not
 * used and not all of the flag bits are useful either. To avoid unnecessary
 * overhead, manipulations of the state field should be done without actual
 * atomic operations (i.e. only pg_atomic_read_u32() and
 * pg_atomic_unlocked_write_u32()).
 * 本地緩沖頭部使用同樣的結(jié)構(gòu),但并不需要使用locks,而且并不是所有的標記位都使用.
 * 為了避免不必要的負載,狀態(tài)域的維護不需要實際的原子操作
 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32())
 *
 * Be careful to avoid increasing the size of the struct when adding or
 * reordering members.  Keeping it below 64 bytes (the most common CPU
 * cache line size) is fairly important for performance.
 * 在增加或者記錄成員變量時,小心避免增加結(jié)構(gòu)體的大小.
 * 保持結(jié)構(gòu)體大小在64字節(jié)內(nèi)(通常的CPU緩存線大小)對于性能是非常重要的.
 */
typedef struct BufferDesc
{
    //buffer tag
    BufferTag   tag;            /* ID of page contained in buffer */
    //buffer索引編號(0開始),指向相應的buffer pool slot
    int         buf_id;         /* buffer's index number (from 0) */
    /* state of the tag, containing flags, refcount and usagecount */
    //tag狀態(tài),包括flags/refcount和usagecount
    pg_atomic_uint32 state;
    //pin-count等待進程ID
    int         wait_backend_pid;   /* backend PID of pin-count waiter */
    //空閑鏈表鏈中下一個空閑的buffer
    int         freeNext;       /* link in freelist chain */
    //緩沖區(qū)內(nèi)容鎖
    LWLock      content_lock;   /* to lock access to buffer contents */
} BufferDesc;

BufferTag
Buffer tag標記了buffer存儲的是磁盤中哪個block

/*
 * Buffer tag identifies which disk block the buffer contains.
 * Buffer tag標記了buffer存儲的是磁盤中哪個block
 *
 * Note: the BufferTag data must be sufficient to determine where to write the
 * block, without reference to pg_class or pg_tablespace entries.  It's
 * possible that the backend flushing the buffer doesn't even believe the
 * relation is visible yet (its xact may have started before the xact that
 * created the rel).  The storage manager must be able to cope anyway.
 * 注意:BufferTag必須足以確定如何寫block而不需要參照pg_class或者pg_tablespace數(shù)據(jù)字典信息.
 * 有可能后臺進程在刷新緩沖區(qū)的時候深圳不相信關(guān)系是可見的(事務可能在創(chuàng)建rel的事務之前).
 * 存儲管理器必須可以處理這些事情.
 *
 * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
 * to be fixed to zero them, since this struct is used as a hash key.
 * 注意:如果在結(jié)構(gòu)體中有填充的字節(jié),INIT_BUFFERTAG必須將它們固定為零，因為這個結(jié)構(gòu)體用作散列鍵.
 */
typedef struct buftag
{
    //物理relation標識符
    RelFileNode rnode;          /* physical relation identifier */
    ForkNumber  forkNum;
    //相對于relation起始的塊號
    BlockNumber blockNum;       /* blknum relative to begin of reln */
} BufferTag;

二、源碼解讀

RelationGetBufferForTuple函數(shù)返回滿足空閑空間>=給定大小的page,并且該page對應的buffer狀態(tài)為pinned和并持有獨占鎖
輸入：
relation-數(shù)據(jù)表
len-需要的空間大小
otherBuffer-用于update場景，上一次pinned的buffer
options-處理選項
bistate-BulkInsert標記
vmbuffer-第1個vm(visibilitymap)
vmbuffer_other-用于update場景，上一次pinned的buffer對應的vm(visibilitymap)
注意:
otherBuffer這個參數(shù)讓人覺得困惑，原因是PG的機制使然
Update時，不是原地更新，而是原數(shù)據(jù)保留（更新xmax），新數(shù)據(jù)插入
原數(shù)據(jù)&新數(shù)據(jù)如果在不同Block中，鎖定Block的時候可能會出現(xiàn)Deadlock
舉個例子：Session A更新表T的第一行，第一行在Block 0中，新數(shù)據(jù)存儲在Block 2中
Session B更新表T的第二行，第二行在Block 0中，新數(shù)據(jù)存儲在Block 2中
Block 0/2均要鎖定才能完整實現(xiàn)Update操作：
如果Session A先鎖定了Block 2，Session B先鎖定了Block 0，
然后Session A嘗試鎖定Block 0，Session B嘗試鎖定Block 2，這時候就會出現(xiàn)死鎖
為了避免這種情況，PG規(guī)定鎖定時，同一個Relation，按Block的編號順序鎖定，
如需要鎖定0和2，那必須先鎖定Block 0，再鎖定2
輸出：
為Tuple分配的Buffer
其主要實現(xiàn)邏輯如下:
1.初始化相關(guān)變量
2.獲取預留空間
3.如為Update操作,則獲取上次pinned buffer對應的Block
4.獲取目標page:targetBlock
5.如targetBlock非法,并且使用FSM,則使用FSM尋找
6.如targetBlock仍非法,則循環(huán)遍歷page檢索合適的Block
6.1.讀取并獨占鎖定目標block,以及給定的otherBuffer(如給出)
6.2.獲取vm
6.3.讀取buffer,判斷是否有足夠的空閑空間,如足夠,則返回
6.4.如仍不足夠,則調(diào)用RecordAndGetPageWithFreeSpace獲取targetBlock,再次循環(huán)
7.遍歷完畢,仍找不到block,則擴展表
8.擴展表后,以P_NEW模式讀取buffer并鎖定
9.獲取該buffer對應的page,執(zhí)行相關(guān)校驗
10.校驗不通過報錯,校驗通過則返回buffer

/*
 * RelationGetBufferForTuple
 *
 *  Returns pinned and exclusive-locked buffer of a page in given relation
 *  with free space >= given len.
 *  返回滿足空閑空間>=給定大小的page,并且該page對應的buffer狀態(tài)為pinned和并持有獨占鎖
 *
 *  If otherBuffer is not InvalidBuffer, then it references a previously
 *  pinned buffer of another page in the same relation; on return, this
 *  buffer will also be exclusive-locked.  (This case is used by heap_update;
 *  the otherBuffer contains the tuple being updated.)
 *  如果otherBuffer不是InvalidBuffer,
 *    那么otherBuffer依賴的是先前同一個relation但是其他page的pinned buffer.
 *  返回時,該buffer同時被獨占鎖定.
 *  (heap_update會出現(xiàn)這種情況,otherBuffer存儲正update的tuple)
 *
 *  The reason for passing otherBuffer is that if two backends are doing
 *  concurrent heap_update operations, a deadlock could occur if they try
 *  to lock the same two buffers in opposite orders.  To ensure that this
 *  can't happen, we impose the rule that buffers of a relation must be
 *  locked in increasing page number order.  This is most conveniently done
 *  by having RelationGetBufferForTuple lock them both, with suitable care
 *  for ordering.
 *  傳遞otherBuffer的原因是如果兩個進程在并發(fā)heap_update操作,
 *  如果它們嘗試以相反的順序鎖定相同的兩個buffer,那會出現(xiàn)死鎖.
 *  為了確保這種情況不會出現(xiàn),我們規(guī)定，關(guān)系緩沖區(qū)必須按page的編號順序鎖定.
 *  要做到這一點,最方便的方法是讓RelationGetBufferForTuple注意順序鎖定它們.
 *
 *  NOTE: it is unlikely, but not quite impossible, for otherBuffer to be the
 *  same buffer we select for insertion of the new tuple (this could only
 *  happen if space is freed in that page after heap_update finds there's not
 *  enough there).  In that case, the page will be pinned and locked only once.
 *  注意:這不太可能,但又不是不可能,為了讓otherBuffer與我們選擇插入新元組的buffer一致.
 *  (這只會發(fā)生在在執(zhí)行heap_update檢索page發(fā)現(xiàn)沒有足夠的空閑空間,但隨后空間被釋放的情況)
 *  在這種情況下,page會被pinned并且只會lock一次.
 *
 *  For the vmbuffer and vmbuffer_other arguments, we avoid deadlock by
 *  locking them only after locking the corresponding heap page, and taking
 *  no further lwlocks while they are locked.
 *  對于vmbuffer和vmbuffer_other參數(shù),通過在鎖定相應的heap page后再鎖定它們來避免死鎖,
 *    同時,在被鎖定后,不再持有l(wèi)wlocks.
 *
 *  We normally use FSM to help us find free space.  However,
 *  if HEAP_INSERT_SKIP_FSM is specified, we just append a new empty page to
 *  the end of the relation if the tuple won't fit on the current target page.
 *  This can save some cycles when we know the relation is new and doesn't
 *  contain useful amounts of free space.
 *  通常來說,使用FSM檢索空閑空間.但是,如果指定了HEAP_INSERT_SKIP_FSM,
 *    那么如果當前的目標page不適合,則直接在relation的最后追加空page.
 *  這樣可以在知道relation是新的情況下,節(jié)省一些處理時間,而且不需要持有有用的空閑空間計數(shù)信息.
 *
 *  HEAP_INSERT_SKIP_FSM is also useful for non-WAL-logged additions to a
 *  relation, if the caller holds exclusive lock and is careful to invalidate
 *  relation's smgr_targblock before the first insertion --- that ensures that
 *  all insertions will occur into newly added pages and not be intermixed
 *  with tuples from other transactions.  That way, a crash can't risk losing
 *  any committed data of other transactions.  (See heap_insert's comments
 *  for additional constraints needed for safe usage of this behavior.)
 *  HEAP_INSERT_SKIP_FSM同時對于非WAL logged關(guān)系也是有用的,
 *    如果調(diào)用者持有獨占鎖并且在首次插入前使得關(guān)系的smgr_targblock無效 ---
 *    這可以確保所有的插入會出現(xiàn)在新增加的pages中,而不會與其他事務的tuple混起來.
 *  按這種方式,如果出現(xiàn)宕機,那么就不會有丟失其他事務提交的數(shù)據(jù)的風險.
 *  (詳細參考heap_insert的注釋,里面提到了使用該動作的其他約束)
 *
 *  The caller can also provide a BulkInsertState object to optimize many
 *  insertions into the same relation.  This keeps a pin on the current
 *  insertion target page (to save pin/unpin cycles) and also passes a
 *  BULKWRITE buffer selection strategy object to the buffer manager.
 *  Passing NULL for bistate selects the default behavior.
 *  調(diào)用者同時提供了BulkInsertState對象用于優(yōu)化大量插入到同一個relation的情況.
 *  這會在當前插入的目標page保持pin(節(jié)省pin/unpin處理過程)
 *    同時會傳遞BULKWRITE緩沖區(qū)選擇器策略對象到buffer manager中.
 *  如使用默認模式,則設置bitstate為NULL.
 *
 *  We always try to avoid filling existing pages further than the fillfactor.
 *  This is OK since this routine is not consulted when updating a tuple and
 *  keeping it on the same page, which is the scenario fillfactor is meant
 *  to reserve space for.
 *  我們通常嘗試避免填充現(xiàn)有頁面超過填充因子設定的范圍.
 *  這是沒有問題的,因為在更新元組并將其保存在同一個page中時,不會參考此例程,
 *    該場景下填充因子會用到.
 *
 *  ereport(ERROR) is allowed here, so this routine *must* be called
 *  before any (unlogged) changes are made in buffer pool.
 *  ereport(ERROR)可在這允許使用,因此該例程必須在buffer pool出現(xiàn)任何變化前調(diào)用.
 */
/*
輸入：
    relation-數(shù)據(jù)表
    len-需要的空間大小
    otherBuffer-用于update場景，上一次pinned的buffer
    options-處理選項
    bistate-BulkInsert標記
    vmbuffer-第1個vm(visibilitymap)
    vmbuffer_other-用于update場景，上一次pinned的buffer對應的vm(visibilitymap)
    注意:
    otherBuffer這個參數(shù)讓人覺得困惑，原因是PG的機制使然
    Update時，不是原地更新，而是原數(shù)據(jù)保留（更新xmax），新數(shù)據(jù)插入
    原數(shù)據(jù)&新數(shù)據(jù)如果在不同Block中，鎖定Block的時候可能會出現(xiàn)Deadlock
    舉個例子：Session A更新表T的第一行，第一行在Block 0中，新數(shù)據(jù)存儲在Block 2中
              Session B更新表T的第二行，第二行在Block 0中，新數(shù)據(jù)存儲在Block 2中
              Block 0/2均要鎖定才能完整實現(xiàn)Update操作：
              如果Session A先鎖定了Block 2，Session B先鎖定了Block 0，
              然后Session A嘗試鎖定Block 0，Session B嘗試鎖定Block 2，這時候就會出現(xiàn)死鎖
              為了避免這種情況，PG規(guī)定鎖定時，同一個Relation，按Block的編號順序鎖定，
              如需要鎖定0和2，那必須先鎖定Block 0，再鎖定2
輸出：
    為Tuple分配的Buffer
附：
Pinned buffers：means buffers are currently being used,it should not be flushed out.
*/
Buffer
RelationGetBufferForTuple(Relation relation, Size len,
                          Buffer otherBuffer, int options,
                          BulkInsertState bistate,
                          Buffer *vmbuffer, Buffer *vmbuffer_other)
{
    bool        use_fsm = !(options & HEAP_INSERT_SKIP_FSM);//是否使用FSM尋找空閑空間
    Buffer      buffer = InvalidBuffer;//
    Page        page;//
    Size        pageFreeSpace = 0,//page空閑空間
                saveFreeSpace = 0;//page需要預留的空間
    BlockNumber targetBlock,//目標Block
                otherBlock;//上一次pinned的buffer對應的Block
    bool        needLock;//是否需要上鎖
    //大小對齊
    len = MAXALIGN(len);        /* be conservative */
    /* Bulk insert is not supported for updates, only inserts. */
    //otherBuffer有效，說明是update操作，不支持bi(BulkInsert)
    //bulk操作僅支持插入
    Assert(otherBuffer == InvalidBuffer || !bistate);
    /*
     * If we're gonna fail for oversize tuple, do it right away
     * 對于超限的元組,直接報錯
     */
    //#define MaxHeapTupleSize  (BLCKSZ - MAXALIGN(SizeOfPageHeaderData + sizeof(ItemIdData)))
    //#define MinHeapTupleSize  MAXALIGN(SizeofHeapTupleHeader)
    if (len > MaxHeapTupleSize)
        ereport(ERROR,
                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                 errmsg("row is too big: size %zu, maximum size %zu",
                        len, MaxHeapTupleSize)));
    /* Compute desired extra freespace due to fillfactor option */
    //獲取預留空間
    // #define RelationGetTargetPageFreeSpace(relation, defaultff) \
     (BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100)
    saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                   HEAP_DEFAULT_FILLFACTOR);
    //update操作,獲取上次pinned buffer對應的Block
    if (otherBuffer != InvalidBuffer)
        otherBlock = BufferGetBlockNumber(otherBuffer);
    else
        otherBlock = InvalidBlockNumber;    /* just to keep compiler quiet */
    /*
     * We first try to put the tuple on the same page we last inserted a tuple
     * on, as cached in the BulkInsertState or relcache entry.  If that
     * doesn't work, we ask the Free Space Map to locate a suitable page.
     * Since the FSM's info might be out of date, we have to be prepared to
     * loop around and retry multiple times. (To insure this isn't an infinite
     * loop, we must update the FSM with the correct amount of free space on
     * each page that proves not to be suitable.)  If the FSM has no record of
     * a page with enough free space, we give up and extend the relation.
     * 首先會嘗試把元組放在最后插入元組的page上,比如BulkInsertState或者relcache條目.
     * 如果找不到,那么我們通過FSM來定位合適的page.
     * 由于FSM的信息可能過期,這時候不得不循環(huán)并嘗試多次.
     * (為了確保這不是一個無限循環(huán),必須使用正確的頁面空閑空間信息更新不靠譜的FSM)
     * 如果FSM中信息提示沒有page有空閑空間,放棄并擴展relation.
     *
     * When use_fsm is false, we either put the tuple onto the existing target
     * page or extend the relation.
     * 如use_fsm為F,我們要不把元組放在現(xiàn)存的目標page上,要不擴展relation.
     */
    if (len + saveFreeSpace > MaxHeapTupleSize)
    {
        //如果需要的大小+預留空間大于可容納的最大Tuple大小，不使用FSM，擴展后再嘗試
        /* can't fit, don't bother asking FSM */
        targetBlock = InvalidBlockNumber;
        use_fsm = false;
    }
    else if (bistate && bistate->current_buf != InvalidBuffer)//BulkInsert模式
        targetBlock = BufferGetBlockNumber(bistate->current_buf);
    else
        targetBlock = RelationGetTargetBlock(relation);//普通Insert模式
    if (targetBlock == InvalidBlockNumber && use_fsm)
    {
        //還沒有找到合適的BlockNumber，并且需要使用FSM
        /*
         * We have no cached target page, so ask the FSM for an initial
         * target.
         * 沒有緩存目標page,使用FSM獲取初始目標page
         */
        //使用FSM申請空閑空間=len + saveFreeSpace的塊
        targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
        /*
         * If the FSM knows nothing of the rel, try the last page before we
         * give up and extend.  This avoids one-tuple-per-page syndrome during
         * bootstrapping or in a recently-started system.
         * 如果FSM對rel一無所知,在放棄并擴展前嘗試下最后那個page.
         * 這可以避免在bootstrapping或者最近已啟動系統(tǒng)時一個元組一個page的情況.
         */
        //申請不到，使用最后一個塊，否則擴展或者放棄
        if (targetBlock == InvalidBlockNumber)
        {
            BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
            if (nblocks > 0)
                targetBlock = nblocks - 1;
        }
    }
loop:
    while (targetBlock != InvalidBlockNumber)
    {
        //---------- 循環(huán)直至成功獲取插入數(shù)據(jù)的塊號
        /*
         * Read and exclusive-lock the target block, as well as the other
         * block if one was given, taking suitable care with lock ordering and
         * the possibility they are the same block.
         * 讀取并獨占鎖定目標block,以及給定的另外一個快(如給出),需要適當?shù)年P(guān)注鎖的順序
         *   并關(guān)注它們是否同一個塊.
         *
         * If the page-level all-visible flag is set, caller will need to
         * clear both that and the corresponding visibility map bit.  However,
         * by the time we return, we'll have x-locked the buffer, and we don't
         * want to do any I/O while in that state.  So we check the bit here
         * before taking the lock, and pin the page if it appears necessary.
         * Checking without the lock creates a risk of getting the wrong
         * answer, so we'll have to recheck after acquiring the lock.
         * 如果設置了塊級別的all-visible flag,調(diào)用者需要清空該塊的標記和相應的vm標記.
         * 但是,在返回時,我們將持有buffer的獨占鎖,并且我們不希望在這種情況下執(zhí)行I/O操作.
         * 因此,我們在獲取鎖前檢查標記位,如看起來需要的話,pin page.
         * 沒有持有鎖執(zhí)行檢查會出現(xiàn)錯誤,因此我們將不得不在獲取鎖后重新執(zhí)行檢查.
         */
        if (otherBuffer == InvalidBuffer)
        {
            //----------- 非Update操作
            /* easy case */
            //這種情況比較簡單
            //獲取Buffer
            buffer = ReadBufferBI(relation, targetBlock, bistate);
            if (PageIsAllVisible(BufferGetPage(buffer)))
                //如果Page可見，那么把Page Pin在內(nèi)存中（Pin的意思是固定/保留）
                visibilitymap_pin(relation, targetBlock, vmbuffer);
            LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);//鎖定buffer
        }
        else if (otherBlock == targetBlock)
        {
            //----------- Update操作，新記錄跟原記錄在同一個Block中
            //這種情況也比較簡單
            /* also easy case */
            buffer = otherBuffer;
            if (PageIsAllVisible(BufferGetPage(buffer)))
                visibilitymap_pin(relation, targetBlock, vmbuffer);
            LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
        }
        else if (otherBlock < targetBlock)
        {
            //----------- Update操作，原記錄所在的Block < 新記錄的Block
            /* lock other buffer first */
            //首先鎖定otherBlock
            buffer = ReadBuffer(relation, targetBlock);
            if (PageIsAllVisible(BufferGetPage(buffer)))
                visibilitymap_pin(relation, targetBlock, vmbuffer);
            //優(yōu)先鎖定BlockNumber小的那個
            LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
            LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
        }
        else
        {
            //------------ Update操作，原記錄所在的Block > 新記錄的Block
            /* lock target buffer first */
            buffer = ReadBuffer(relation, targetBlock);
            if (PageIsAllVisible(BufferGetPage(buffer)))
                visibilitymap_pin(relation, targetBlock, vmbuffer);
            //優(yōu)先鎖定BlockNumber小的那個
            LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
            LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
        }
        /*
         * We now have the target page (and the other buffer, if any) pinned
         * and locked.  However, since our initial PageIsAllVisible checks
         * were performed before acquiring the lock, the results might now be
         * out of date, either for the selected victim buffer, or for the
         * other buffer passed by the caller.  In that case, we'll need to
         * give up our locks, go get the pin(s) we failed to get earlier, and
         * re-lock.  That's pretty painful, but hopefully shouldn't happen
         * often.
         * 現(xiàn)在已有了target page,并且該page(包括other buffer,如存在)已緩存到內(nèi)存中(pinned)且已鎖定.
         * 但是,由于初始的PageIsAllVisible在獲取鎖前執(zhí)行,結(jié)果可能已經(jīng)過期,
         *   這時候可能選擇了需要被淘汰的buffer或者otherBuffer出現(xiàn)了變化.
         * 在這種情況下,需要放棄鎖,回到先前曾經(jīng)失敗的pin的地方,重新鎖定.
         * 這蠻吐血的,希望不要經(jīng)常出現(xiàn).
         *
         * Note that there's a small possibility that we didn't pin the page
         * above but still have the correct page pinned anyway, either because
         * we've already made a previous pass through this loop, or because
         * caller passed us the right page anyway.
         * 注意存在較小的可能是我們在上面不需要pin page,但仍然需要持有正確的pinned page,
         *   這一方面是因為我們已經(jīng)通過該循環(huán)執(zhí)行了一遍,另外一方面是調(diào)用者通過其他方式傳入了正確的page.
         *
         * Note also that it's possible that by the time we get the pin and
         * retake the buffer locks, the visibility map bit will have been
         * cleared by some other backend anyway.  In that case, we'll have
         * done a bit of extra work for no gain, but there's no real harm
         * done.
         * 同時要注意在我們獲取pin并且重新獲取buffer lock時,vm位已被其他后臺進程清除了.
         * 在這種情況下,我們需要執(zhí)行一些額外的工作以避免重復工作,但這實質(zhì)上并沒有什么危害.
         */
        if (otherBuffer == InvalidBuffer || buffer <= otherBuffer)
            GetVisibilityMapPins(relation, buffer, otherBuffer,
                                 targetBlock, otherBlock, vmbuffer,
                                 vmbuffer_other);//Pin VM在內(nèi)存中
        else
            GetVisibilityMapPins(relation, otherBuffer, buffer,
                                 otherBlock, targetBlock, vmbuffer_other,
                                 vmbuffer);//Pin VM在內(nèi)存中
        /*
         * Now we can check to see if there's enough free space here. If so,
         * we're done.
         * 現(xiàn)在我們可以檢查是否有足夠的空閑空間.
         * 如有,則我們已完成所有工作了.
         */
        page = BufferGetPage(buffer);
        pageFreeSpace = PageGetHeapFreeSpace(page);
        if (len + saveFreeSpace <= pageFreeSpace)
        {
            //有足夠的空間存儲數(shù)據(jù)，返回此Buffer
            /* use this page as future insert target, too */
            //用這個page作為未來插入的目標page
            /*
            #define RelationSetTargetBlock(relation, targblock) \
            do { \
                RelationOpenSmgr(relation); \
                (relation)->rd_smgr->smgr_targblock = (targblock); \
            } while (0)
            */
            RelationSetTargetBlock(relation, targetBlock);
            return buffer;
        }
        /*
         * Not enough space, so we must give up our page locks and pin (if
         * any) and prepare to look elsewhere.  We don't care which order we
         * unlock the two buffers in, so this can be slightly simpler than the
         * code above.
         * 空間不夠,必須放棄持有的page locks和pin,準備檢索其他地方.
         * 在解鎖時不需要關(guān)注兩個buffer的順序,這個邏輯比先前的邏輯要簡單.
         */
        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
        if (otherBuffer == InvalidBuffer)
            ReleaseBuffer(buffer);
        else if (otherBlock != targetBlock)
        {
            LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
            ReleaseBuffer(buffer);
        }
        /* Without FSM, always fall out of the loop and extend */
        //不使用FSM定位空閑空間，跳出循環(huán)，執(zhí)行擴展
        if (!use_fsm)
            break;
        /*
         * Update FSM as to condition of this page, and ask for another page
         * to try.
         */
        //使用FSM獲取下一個備選的Block
        //注意：如果全部掃描后發(fā)現(xiàn)沒有滿足條件的Block，targetBlock = InvalidBlockNumber，跳出循環(huán)
        targetBlock = RecordAndGetPageWithFreeSpace(relation,
                                                    targetBlock,
                                                    pageFreeSpace,
                                                    len + saveFreeSpace);
    }
    //--------- 沒有獲取滿足條件的Block，擴展表
    /*
     * Have to extend the relation.
     *
     * We have to use a lock to ensure no one else is extending the rel at the
     * same time, else we will both try to initialize the same new page.  We
     * can skip locking for new or temp relations, however, since no one else
     * could be accessing them.
     * 必須鎖定以確保其他進程不能擴展rel,否則我們會同時嘗試初始化新的page.
     * 但是,我們可以為新的或者臨時關(guān)系跳過鎖定,這時候沒有其他進程可以訪問它們.
     */
    //新創(chuàng)建的數(shù)據(jù)表或者臨時表，無需Lock
    needLock = !RELATION_IS_LOCAL(relation);
    /*
     * If we need the lock but are not able to acquire it immediately, we'll
     * consider extending the relation by multiple blocks at a time to manage
     * contention on the relation extension lock.  However, this only makes
     * sense if we're using the FSM; otherwise, there's no point.
     * 如果需要鎖定但不能夠馬上獲取,考慮通過一次性多個blocks的方式擴展關(guān)系,
     *   這樣可以在關(guān)系擴展鎖上管理競爭.
     * 但是,這在使用FSM的時候才會奇效,否則沒有其他太好的辦法.
     */
    if (needLock)//需要鎖定
    {
        if (!use_fsm)
            //不使用FSM
            LockRelationForExtension(relation, ExclusiveLock);
        else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
        {
            /* Couldn't get the lock immediately; wait for it. */
            //不能馬上獲取鎖,等待
            LockRelationForExtension(relation, ExclusiveLock);
            /*
             * Check if some other backend has extended a block for us while
             * we were waiting on the lock.
             */
            //如有其它進程擴展了數(shù)據(jù)表，那么可以成功獲取滿足條件的targetBlock
            targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
            /*
             * If some other waiter has already extended the relation, we
             * don't need to do so; just use the existing freespace.
             * 如果其他等待進程已經(jīng)擴展了關(guān)系,那么我們不需要再擴展了,使用現(xiàn)成的空閑空間即可.
             */
            if (targetBlock != InvalidBlockNumber)
            {
                UnlockRelationForExtension(relation, ExclusiveLock);
                goto loop;
            }
            /* Time to bulk-extend. */
            //其它進程沒有擴展
            //Just extend it!
            RelationAddExtraBlocks(relation, bistate);
        }
    }
    /*
     * In addition to whatever extension we performed above, we always add at
     * least one block to satisfy our own request.
     * 處理上面執(zhí)行的擴展,我們總是添加了至少一個block用以滿足自身需要.
     *
     * XXX This does an lseek - rather expensive - but at the moment it is the
     * only way to accurately determine how many blocks are in a relation.  Is
     * it worth keeping an accurate file length in shared memory someplace,
     * rather than relying on the kernel to do it for us?
     * XXX 這相當于做了一次lseek - 相當昂貴的操作! - 在這時候這也是唯一可以準確確定關(guān)系有多少blocks的方法.
     * 相對于不是使用內(nèi)核來完成這個事情,在內(nèi)存的某個地方保存準確的文件尺寸是否更好?
     */
    //擴展表后，New Page！
    buffer = ReadBufferBI(relation, P_NEW, bistate);
    /*
     * We can be certain that locking the otherBuffer first is OK, since it
     * must have a lower page number.
     * 這時候可以確定首先鎖定的otherBuffer沒有問題,因為它有一個較小的page編號
     */
    if (otherBuffer != InvalidBuffer)
        ////otherBuffer的順序一定在擴展的Block之前，Lock it！
        LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
    /*
     * Now acquire lock on the new page.
     * 現(xiàn)在可以嘗試為新page上鎖
     */
    //鎖定New Page
    LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
    /*
     * Release the file-extension lock; it's now OK for someone else to extend
     * the relation some more.  Note that we cannot release this lock before
     * we have buffer lock on the new page, or we risk a race condition
     * against vacuumlazy.c --- see comments therein.
     * 是否文件擴展鎖.現(xiàn)在對于其他進程來說可以擴展relation了.
     * 注意不能在持有新page的buffer lock前釋放該鎖,否則將會在vacuumlazy.c中存在條件競爭.
     * 詳細可參見注釋.
     */
    if (needLock)
        //釋放擴展鎖
        UnlockRelationForExtension(relation, ExclusiveLock);
    /*
     * We need to initialize the empty new page.  Double-check that it really
     * is empty (this should never happen, but if it does we don't want to
     * risk wiping out valid data).
     * 我們需要初始化空的新page.
     * 需再次檢查該page是空的(這應該不會出現(xiàn),但執(zhí)行這個操作是因為我們不希望冒刪除有效數(shù)據(jù)的風險)
     */
    //獲取相應的Page
    page = BufferGetPage(buffer);
    if (!PageIsNew(page))
        //不是New Page，那一定某個地方搞錯了！
        elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
             BufferGetBlockNumber(buffer),
             RelationGetRelationName(relation));
    //初始化New Page
    PageInit(page, BufferGetPageSize(buffer), 0);
    //New Page也滿足不了要求的大小，報錯
    if (len > PageGetHeapFreeSpace(page))
    {
        /* We should not get here given the test at the top */
        elog(PANIC, "tuple is too big: size %zu", len);
    }
    /*
     * Remember the new page as our target for future insertions.
     * 記錄新page為未來插入的目標page.
     *
     * XXX should we enter the new page into the free space map immediately,
     * or just keep it for this backend's exclusive use in the short run
     * (until VACUUM sees it)?  Seems to depend on whether you expect the
     * current backend to make more insertions or not, which is probably a
     * good bet most of the time.  So for now, don't add it to FSM yet.
     * XXX 我們應該馬上把新的page放到FSM中嗎,
     *   或者只是把該page放在后臺進程的私有空間中在很短時間內(nèi)獨占使用(直至vacuum可以看到它位置)?
     * 看起來這依賴于你希望當前的后臺進程是否執(zhí)行更多的插入操作,這在大多數(shù)時間下會更好.
     * 因此,現(xiàn)在還沒有把它添加到FSM中.
     */
    //終于找到了可用于存儲數(shù)據(jù)的Block
    RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
    //返回
    return buffer;
}

三、跟蹤分析

測試腳本

15:54:13 (xdb@[local]:5432)testdb=# insert into t1 values (1,'1','1');

調(diào)用棧

(gdb) b RelationGetBufferForTuple
Breakpoint 1 at 0x4ef179: file hio.c, line 318.
(gdb) c
Continuing.
Breakpoint 1, RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0, 
    vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318
318     bool        use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
(gdb) bt
#0  RelationGetBufferForTuple (relation=0x7f4f51fe39b8, len=32, otherBuffer=0, options=0, bistate=0x0, 
    vmbuffer=0x7ffea95dbf6c, vmbuffer_other=0x0) at hio.c:318
#1  0x00000000004df1f8 in heap_insert (relation=0x7f4f51fe39b8, tup=0x178a478, cid=0, options=0, bistate=0x0)
    at heapam.c:2468
#2  0x0000000000709dda in ExecInsert (mtstate=0x178a220, slot=0x178a680, planSlot=0x178a680, estate=0x1789eb8, 
    canSetTag=true) at nodeModifyTable.c:529
#3  0x000000000070c475 in ExecModifyTable (pstate=0x178a220) at nodeModifyTable.c:2159
#4  0x00000000006e05cb in ExecProcNodeFirst (node=0x178a220) at execProcnode.c:445
#5  0x00000000006d552e in ExecProcNode (node=0x178a220) at ../../../src/include/executor/executor.h:247
#6  0x00000000006d7d66 in ExecutePlan (estate=0x1789eb8, planstate=0x178a220, use_parallel_mode=false, 
    operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x17a7688, 
    execute_once=true) at execMain.c:1723
#7  0x00000000006d5af8 in standard_ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:364
#8  0x00000000006d5920 in ExecutorRun (queryDesc=0x178e458, direction=ForwardScanDirection, count=0, execute_once=true)
    at execMain.c:307
#9  0x00000000008c1092 in ProcessQuery (plan=0x16b3ac0, sourceText=0x16b1ec8 "insert into t1 values (1,'1','1');", 
    params=0x0, queryEnv=0x0, dest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:161
#10 0x00000000008c29a1 in PortalRunMulti (portal=0x1717488, isTopLevel=true, setHoldSnapshot=false, dest=0x17a7688, 
    altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:1286
#11 0x00000000008c1f7a in PortalRun (portal=0x1717488, count=9223372036854775807, isTopLevel=true, run_once=true, 
    dest=0x17a7688, altdest=0x17a7688, completionTag=0x7ffea95dc500 "") at pquery.c:799
#12 0x00000000008bbf16 in exec_simple_query (query_string=0x16b1ec8 "insert into t1 values (1,'1','1');") at postgres.c:1145
#13 0x00000000008c01a1 in PostgresMain (argc=1, argv=0x16dbaf8, dbname=0x16db960 "testdb", username=0x16aeba8 "xdb")
    at postgres.c:4182
#14 0x000000000081e07c in BackendRun (port=0x16d3940) at postmaster.c:4361
#15 0x000000000081d7ef in BackendStartup (port=0x16d3940) at postmaster.c:4033
---Type <return> to continue, or q <return> to quit---
#16 0x0000000000819be9 in ServerLoop () at postmaster.c:1706
#17 0x000000000081949f in PostmasterMain (argc=1, argv=0x16acb60) at postmaster.c:1379
#18 0x0000000000742941 in main (argc=1, argv=0x16acb60) at main.c:228
(gdb)

感謝各位的閱讀，以上就是“PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用”的內(nèi)容了，經(jīng)過本文的學習后，相信大家對PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是億速云，小編將為大家推送更多相關(guān)知識點的文章，歡迎關(guān)注！

向AI問一下細節(jié)

PostgreSQL中RelationGetBufferForTuple函數(shù)有什么作用

一、數(shù)據(jù)結(jié)構(gòu)

二、源碼解讀

三、跟蹤分析

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標簽

一、數(shù)據(jù)結(jié)構(gòu)

二、源碼解讀

三、跟蹤分析