怎么理解PostgreSQL中Clock Sweep算法

發(fā)布時間：2021-11-09 15:55:40 來源：億速云閱讀：517 作者：iii 欄目：關(guān)系型數(shù)據(jù)庫

本篇內(nèi)容介紹了“怎么理解PostgreSQL中Clock Sweep算法”的有關(guān)知識，在實際案例的操作過程中，不少人都會遇到這樣的困境，接下來就讓小編帶領(lǐng)大家學習一下如何處理這些情況吧！希望大家仔細閱讀，能夠?qū)W有所成！

一、數(shù)據(jù)結(jié)構(gòu)

BufferDesc
共享緩沖區(qū)的共享描述符(狀態(tài))數(shù)據(jù)

/*
 * Flags for buffer descriptors
 * buffer描述器標記
 *
 * Note: TAG_VALID essentially means that there is a buffer hashtable
 * entry associated with the buffer's tag.
 * 注意:TAG_VALID本質(zhì)上意味著有一個與緩沖區(qū)的標記相關(guān)聯(lián)的緩沖區(qū)散列表條目。
 */
//buffer header鎖定
#define BM_LOCKED               (1U << 22)  /* buffer header is locked */
//數(shù)據(jù)需要寫入(標記為DIRTY)
#define BM_DIRTY                (1U << 23)  /* data needs writing */
//數(shù)據(jù)是有效的
#define BM_VALID                (1U << 24)  /* data is valid */
//已分配buffer tag
#define BM_TAG_VALID            (1U << 25)  /* tag is assigned */
//正在R/W
#define BM_IO_IN_PROGRESS       (1U << 26)  /* read or write in progress */
//上一個I/O出現(xiàn)錯誤
#define BM_IO_ERROR             (1U << 27)  /* previous I/O failed */
//開始寫則變DIRTY
#define BM_JUST_DIRTIED         (1U << 28)  /* dirtied since write started */
//存在等待sole pin的其他進程
#define BM_PIN_COUNT_WAITER     (1U << 29)  /* have waiter for sole pin */
//checkpoint發(fā)生,必須刷到磁盤上
#define BM_CHECKPOINT_NEEDED    (1U << 30)  /* must write for checkpoint */
//持久化buffer(不是unlogged或者初始化fork)
#define BM_PERMANENT            (1U << 31)  /* permanent buffer (not unlogged,
                                             * or init fork) */
/*
 *  BufferDesc -- shared descriptor/state data for a single shared buffer.
 *  BufferDesc -- 共享緩沖區(qū)的共享描述符(狀態(tài))數(shù)據(jù)
 *
 * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
 * the tag, state or wait_backend_pid fields.  In general, buffer header lock
 * is a spinlock which is combined with flags, refcount and usagecount into
 * single atomic variable.  This layout allow us to do some operations in a
 * single atomic operation, without actually acquiring and releasing spinlock;
 * for instance, increase or decrease refcount.  buf_id field never changes
 * after initialization, so does not need locking.  freeNext is protected by
 * the buffer_strategy_lock not buffer header lock.  The LWLock can take care
 * of itself.  The buffer header lock is *not* used to control access to the
 * data in the buffer!
 * 注意:必須持有Buffer header鎖(BM_LOCKED標記)才能檢查或修改tag/state/wait_backend_pid字段.
 * 通常來說,buffer header lock是spinlock,它與標記位/參考計數(shù)/使用計數(shù)組合到單個原子變量中.
 * 這個布局設(shè)計允許我們執(zhí)行原子操作,而不需要實際獲得或者釋放spinlock(比如,增加或者減少參考計數(shù)).
 * buf_id字段在初始化后不會出現(xiàn)變化,因此不需要鎖定.
 * freeNext通過buffer_strategy_lock鎖而不是buffer header lock保護.
 * LWLock可以很好的處理自己的狀態(tài).
 * 務(wù)請注意的是:buffer header lock不用于控制buffer中的數(shù)據(jù)訪問!
 *
 * It's assumed that nobody changes the state field while buffer header lock
 * is held.  Thus buffer header lock holder can do complex updates of the
 * state variable in single write, simultaneously with lock release (cleaning
 * BM_LOCKED flag).  On the other hand, updating of state without holding
 * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
 * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed.
 * 假定在持有buffer header lock的情況下,沒有人改變狀態(tài)字段.
 * 持有buffer header lock的進程可以執(zhí)行在單個寫操作中執(zhí)行復雜的狀態(tài)變量更新,
 *   同步的釋放鎖(清除BM_LOCKED標記).
 * 換句話說,如果沒有持有buffer header lock的狀態(tài)更新,會受限于CAS,
 *   這種情況下確保BM_LOCKED沒有被設(shè)置.
 * 比如原子的增加/減少(AND/OR)等操作是不允許的.
 *
 * An exception is that if we have the buffer pinned, its tag can't change
 * underneath us, so we can examine the tag without locking the buffer header.
 * Also, in places we do one-time reads of the flags without bothering to
 * lock the buffer header; this is generally for situations where we don't
 * expect the flag bit being tested to be changing.
 * 一種例外情況是如果我們已有buffer pinned,該buffer的tag不能改變(在本進程之下),
 *   因此不需要鎖定buffer header就可以檢查tag了.
 * 同時,在執(zhí)行一次性的flags讀取時不需要鎖定buffer header.
 * 這種情況通常用于我們不希望正在測試的flag bit將被改變.
 *
 * We can't physically remove items from a disk page if another backend has
 * the buffer pinned.  Hence, a backend may need to wait for all other pins
 * to go away.  This is signaled by storing its own PID into
 * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present,
 * there can be only one such waiter per buffer.
 * 如果其他進程有buffer pinned,那么進程不能物理的從磁盤頁面中刪除items.
 * 因此,后臺進程需要等待其他pins清除.這可以通過存儲它自己的PID到wait_backend_pid中,
 *   并設(shè)置標記位BM_PIN_COUNT_WAITER.
 * 目前,每個緩沖區(qū)只能由一個等待進程.
 *
 * We use this same struct for local buffer headers, but the locks are not
 * used and not all of the flag bits are useful either. To avoid unnecessary
 * overhead, manipulations of the state field should be done without actual
 * atomic operations (i.e. only pg_atomic_read_u32() and
 * pg_atomic_unlocked_write_u32()).
 * 本地緩沖頭部使用同樣的結(jié)構(gòu),但并不需要使用locks,而且并不是所有的標記位都使用.
 * 為了避免不必要的負載,狀態(tài)域的維護不需要實際的原子操作
 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32())
 *
 * Be careful to avoid increasing the size of the struct when adding or
 * reordering members.  Keeping it below 64 bytes (the most common CPU
 * cache line size) is fairly important for performance.
 * 在增加或者記錄成員變量時,小心避免增加結(jié)構(gòu)體的大小.
 * 保持結(jié)構(gòu)體大小在64字節(jié)內(nèi)(通常的CPU緩存線大小)對于性能是非常重要的.
 */
typedef struct BufferDesc
{
    //buffer tag
    BufferTag   tag;            /* ID of page contained in buffer */
    //buffer索引編號(0開始),指向相應(yīng)的buffer pool slot
    int         buf_id;         /* buffer's index number (from 0) */
    /* state of the tag, containing flags, refcount and usagecount */
    //tag狀態(tài),包括flags/refcount和usagecount
    pg_atomic_uint32 state;
    //pin-count等待進程ID
    int         wait_backend_pid;   /* backend PID of pin-count waiter */
    //空閑鏈表鏈中下一個空閑的buffer
    int         freeNext;       /* link in freelist chain */
    //緩沖區(qū)內(nèi)容鎖
    LWLock      content_lock;   /* to lock access to buffer contents */
} BufferDesc;

BufferTag
Buffer tag標記了buffer存儲的是磁盤中哪個block

/*
 * Buffer tag identifies which disk block the buffer contains.
 * Buffer tag標記了buffer存儲的是磁盤中哪個block
 *
 * Note: the BufferTag data must be sufficient to determine where to write the
 * block, without reference to pg_class or pg_tablespace entries.  It's
 * possible that the backend flushing the buffer doesn't even believe the
 * relation is visible yet (its xact may have started before the xact that
 * created the rel).  The storage manager must be able to cope anyway.
 * 注意:BufferTag必須足以確定如何寫block而不需要參照pg_class或者pg_tablespace數(shù)據(jù)字典信息.
 * 有可能后臺進程在刷新緩沖區(qū)的時候深圳不相信關(guān)系是可見的(事務(wù)可能在創(chuàng)建rel的事務(wù)之前).
 * 存儲管理器必須可以處理這些事情.
 *
 * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have
 * to be fixed to zero them, since this struct is used as a hash key.
 * 注意:如果在結(jié)構(gòu)體中有填充的字節(jié),INIT_BUFFERTAG必須將它們固定為零，因為這個結(jié)構(gòu)體用作散列鍵.
 */
typedef struct buftag
{
    //物理relation標識符
    RelFileNode rnode;          /* physical relation identifier */
    ForkNumber  forkNum;
    //相對于relation起始的塊號
    BlockNumber blockNum;       /* blknum relative to begin of reln */
} BufferTag;

二、源碼解讀

算法思想,在 Buffer Manager 中已有詳細介紹,摘錄如下:

WHILE true
(1)     Obtain the candidate buffer descriptor pointed by the nextVictimBuffer
(2)     IF the candidate descriptor is unpinned THEN
(3)        IF the candidate descriptor's usage_count == 0 THEN
                BREAK WHILE LOOP  /* the corresponding slot of this descriptor is victim slot. */
           ELSE
            Decrease the candidate descriptpor's usage_count by 1
               END IF
         END IF
(4)     Advance nextVictimBuffer to the next one
      END WHILE 
(5) RETURN buffer_id of the victim

算法思想樸素且簡單,比起高大上的LRU算法要簡單多了.
nextVictimBuffer可視為時鐘的時針,把緩沖區(qū)視為環(huán)形緩沖區(qū),時針循環(huán)周而往復的循環(huán),如緩沖區(qū)滿足unpinned(ref_count == 0) && usage_count == 0的條件,則選中該緩沖,否則,usage_count減一,繼續(xù)循環(huán),直至找到滿足條件的buffer為止.選中的buffer一定是buffers中較少使用的那個.

StrategyGetBuffer中的代碼片段:

...
    /* Nothing on the freelist, so run the "clock sweep" algorithm */
    //空閑鏈表中找不到或者滿足不了條件,則執(zhí)行"clock sweep"算法
    //int NBuffers = 1000;
    trycounter = NBuffers;//嘗試次數(shù)
    for (;;)
    {
        //------- 循環(huán)
        //獲取buffer描述符
        buf = GetBufferDescriptor(ClockSweepTick());
        /*
         * If the buffer is pinned or has a nonzero usage_count, we cannot use
         * it; decrement the usage_count (unless pinned) and keep scanning.
         * 如果buffer已pinned,或者有一個非零值的usage_count,不能使用這個buffer.
         * 減少usage_count(除非已pinned)繼續(xù)掃描.
         */
        local_buf_state = LockBufHdr(buf);
        if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
        {
            //----- refcount == 0
            if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
            {
                //usage_count <> 0
                //usage_count - 1
                local_buf_state -= BUF_USAGECOUNT_ONE;
                //重置嘗試次數(shù)
                trycounter = NBuffers;
            }
            else
            {
                //usage_count = 0
                /* Found a usable buffer */
                //發(fā)現(xiàn)一個可用的buffer
                if (strategy != NULL)
                    //添加到該策略的環(huán)形緩沖區(qū)中
                    AddBufferToRing(strategy, buf);
                //輸出參數(shù)賦值
                *buf_state = local_buf_state;
                //返回buf
                return buf;
            }
        }
        else if (--trycounter == 0)
        {
            //----- refcount <> 0 && --trycounter == 0
            /*
             * We've scanned all the buffers without making any state changes,
             * so all the buffers are pinned (or were when we looked at them).
             * We could hope that someone will free one eventually, but it's
             * probably better to fail than to risk getting stuck in an
             * infinite loop.
             * 在沒有改變?nèi)魏螤顟B(tài)的情況,我們已經(jīng)完成了所有buffers的遍歷,
             *   因此所有的buffers已pinned(或者在搜索的時候pinned).
             * 我們希望某些進程會周期性的釋放buffer,但如果實在拿不到,那報錯總比傻傻的死循環(huán)要好.
             */
            UnlockBufHdr(buf, local_buf_state);
            elog(ERROR, "no unpinned buffers available");
        }
        //解鎖buffer header
        UnlockBufHdr(buf, local_buf_state);
    }

ClockSweepTick()函數(shù)是StrategyGetBuffer()的輔助過程,移動時針(當前位置往前一格),返回時針指向的buffer.
其實現(xiàn)邏輯如下:

/*
 * ClockSweepTick - Helper routine for StrategyGetBuffer()
 * ClockSweepTick - StrategyGetBuffer()的輔助過程
 *
 * Move the clock hand one buffer ahead of its current position and return the
 * id of the buffer now under the hand.
 * 移動時針(當前位置往前一格),返回時針指向的buffer.
 */
static inline uint32
ClockSweepTick(void)
{
    uint32      victim;
    /*
     * Atomically move hand ahead one buffer - if there's several processes
     * doing this, this can lead to buffers being returned slightly out of
     * apparent order.
     * 原子移動時針一格 
     *   - 如果有多個進程執(zhí)行這個操作,這可能會導致緩沖返回的順序稍微有些混亂.
     *   
     */
    victim =
        pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
    if (victim >= NBuffers)
    {
        //-------- 候選buffer大于NBuffers
        //記錄元素的victim
        uint32      originalVictim = victim;
        /* always wrap what we look up in BufferDescriptors */
        //回卷BufferDescriptors中查找的內(nèi)容
        victim = victim % NBuffers;
        /*
         * If we're the one that just caused a wraparound, force
         * completePasses to be incremented while holding the spinlock. We
         * need the spinlock so StrategySyncStart() can return a consistent
         * value consisting of nextVictimBuffer and completePasses.
         * 如果我們正好導致了wraparound,在持有自旋鎖的情況下強制completePasses增加.
         * 之所以需要自旋鎖是因為StrategySyncStart()才能返回nextVictimBuffer和completePasses的一致性值.
         */
        if (victim == 0)
        {
            //出現(xiàn)回卷
            uint32      expected;//期望值(不考慮回卷)
            uint32      wrapped;//回卷后的值
            bool        success = false;//成功標記
            //期望值
            expected = originalVictim + 1;
            while (!success)
            {
                /*
                 * Acquire the spinlock while increasing completePasses. That
                 * allows other readers to read nextVictimBuffer and
                 * completePasses in a consistent manner which is required for
                 * StrategySyncStart().  In theory delaying the increment
                 * could lead to an overflow of nextVictimBuffers, but that's
                 * highly unlikely and wouldn't be particularly harmful.
                 * 在增加completePasses時請求獲取自旋鎖.
                 * 這樣可以讓其他讀取進程以一致性的方式讀取nextVictimBuffer和completePasses,
                 *   這種一致性的方式在StrategySyncStart()中是需要的.
                 * 理論上來說,延遲增加可能會導致nextVictimBuffers溢出,
                 *   但但這是非常不可能的，也不會特別有害。
                 */
                SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
                //獲取回卷數(shù)
                wrapped = expected % NBuffers;
                //原子比較并交換數(shù)字
                success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
                                                         &expected, wrapped);
                if (success)
                    //如成功,則增加計數(shù)
                    StrategyControl->completePasses++;
                //釋放自旋鎖
                SpinLockRelease(&StrategyControl->buffer_strategy_lock);
            }
        }
    }
    //返回victim
    return victim;
}
/*
 * pg_atomic_compare_exchange_u32 - CAS operation
 *pg_atomic_compare_exchange_u32 - CAS操作
 * 
 * Atomically compare the current value of ptr with *expected and store newval
 * iff ptr and *expected have the same value. The current value of *ptr will
 * always be stored in *expected.
 * 原子比較ptr當前值與*expected,如果ptr和*expected值一致,則存儲newval到ptr中.
 * *ptr的當前值通常會存儲在*expected中.
 *
 * Return true if values have been exchanged, false otherwise.
 * 如值已交換,則返回T,否則返回F.
 *
 * Full barrier semantics.
 * 完整的屏障語義.
 */
static inline bool
pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
                               uint32 *expected, uint32 newval)
{
    AssertPointerAlignment(ptr, 4);
    AssertPointerAlignment(expected, 4);
    return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}
bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
                                    uint32 *expected, uint32 newval)
{
    bool        ret;
    /*
     * Do atomic op under a spinlock. It might look like we could just skip
     * the cmpxchg if the lock isn't available, but that'd just emulate a
     * 'weak' compare and swap. I.e. one that allows spurious failures. Since
     * several algorithms rely on a strong variant and that is efficiently
     * implementable on most major architectures let's emulate it here as
     * well.
     * 在自旋鎖保護下執(zhí)行原子操作.
     * 這看起來如果鎖不可能的話,我們可以跳過cmpxchg,但這只是模擬了一個"淺"比較和交換.
     * 比如,這會引起spurious failures.
     * 由于有幾種算法依賴于一種強大的變體，而且這種變體可以在大多數(shù)主要架構(gòu)上有效地實現(xiàn)，
     *   因此我們在這里也對其進行模擬。
     */
    SpinLockAcquire((slock_t *) &ptr->sema);
    /* perform compare/exchange logic */
    //執(zhí)行比較/交換邏輯
    ret = ptr->value == *expected;//ptr與*expected是否一致?
    *expected = ptr->value;//*expected賦值為ptr
    if (ret)
        ptr->value = newval;//值一致,則ptr設(shè)置為新值
    /* and release lock */
    //釋放自旋鎖
    SpinLockRelease((slock_t *) &ptr->sema);
    //返回結(jié)果
    return ret;
}

“怎么理解PostgreSQL中Clock Sweep算法”的內(nèi)容就介紹到這里了，感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識可以關(guān)注億速云網(wǎng)站，小編將為大家輸出更多高質(zhì)量的實用文章！

向AI問一下細節(jié)

怎么理解PostgreSQL中Clock Sweep算法

一、數(shù)據(jù)結(jié)構(gòu)

二、源碼解讀

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標簽

一、數(shù)據(jù)結(jié)構(gòu)

二、源碼解讀