溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務條款》

PostgreSQL 源碼解讀(136)- Buffer Manager#1(ReadBufferExtended函數(shù))

發(fā)布時間:2020-08-11 10:20:37 來源:ITPUB博客 閱讀:160 作者:husthxd 欄目:關系型數(shù)據(jù)庫

本節(jié)簡單介紹了PostgreSQL緩存管理(Buffer Manager)中的其中一個實現(xiàn)函數(shù)ReadBufferExtended,該函數(shù)返回對應請求關系數(shù)據(jù)塊的buffer.。

一、數(shù)據(jù)結構

Relation
關系的內(nèi)存結構.


/*
 * Here are the contents of a relation cache entry.
 */
typedef struct RelationData
{
    RelFileNode rd_node;        /* relation physical identifier */
    /* use "struct" here to avoid needing to include smgr.h: */
    struct SMgrRelationData *rd_smgr;   /* cached file handle, or NULL */
    int         rd_refcnt;      /* reference count */
    BackendId   rd_backend;     /* owning backend id, if temporary relation */
    bool        rd_islocaltemp; /* rel is a temp rel of this session */
    bool        rd_isnailed;    /* rel is nailed in cache */
    bool        rd_isvalid;     /* relcache entry is valid */
    char        rd_indexvalid;  /* state of rd_indexlist: 0 = not valid, 1 =
                                 * valid, 2 = temporarily forced */
    bool        rd_statvalid;   /* is rd_statlist valid? */
    /*
     * rd_createSubid is the ID of the highest subtransaction the rel has
     * survived into; or zero if the rel was not created in the current top
     * transaction.  This can be now be relied on, whereas previously it could
     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
     * the ID of the highest subtransaction the relfilenode change has
     * survived into, or zero if not changed in the current transaction (or we
     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
     * when a relation has multiple new relfilenodes within a single
     * transaction, with one of them occurring in a subsequently aborted
     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
     * ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
     */
    SubTransactionId rd_createSubid;    /* rel was created in current xact */
    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                 * current xact */
    Form_pg_class rd_rel;       /* RELATION tuple */
    TupleDesc   rd_att;         /* tuple descriptor */
    Oid         rd_id;          /* relation's object id */
    LockInfoData rd_lockInfo;   /* lock mgr's info for locking relation */
    RuleLock   *rd_rules;       /* rewrite rules */
    MemoryContext rd_rulescxt;  /* private memory cxt for rd_rules, if any */
    TriggerDesc *trigdesc;      /* Trigger info, or NULL if rel has none */
    /* use "struct" here to avoid needing to include rowsecurity.h: */
    struct RowSecurityDesc *rd_rsdesc;  /* row security policies, or NULL */
    /* data managed by RelationGetFKeyList: */
    List       *rd_fkeylist;    /* list of ForeignKeyCacheInfo (see below) */
    bool        rd_fkeyvalid;   /* true if list has been computed */
    MemoryContext rd_partkeycxt;    /* private memory cxt for the below */
    struct PartitionKeyData *rd_partkey;    /* partition key, or NULL */
    MemoryContext rd_pdcxt;     /* private context for partdesc */
    struct PartitionDescData *rd_partdesc;  /* partitions, or NULL */
    List       *rd_partcheck;   /* partition CHECK quals */
    /* data managed by RelationGetIndexList: */
    List       *rd_indexlist;   /* list of OIDs of indexes on relation */
    Oid         rd_oidindex;    /* OID of unique index on OID, if any */
    Oid         rd_pkindex;     /* OID of primary key, if any */
    Oid         rd_replidindex; /* OID of replica identity index, if any */
    /* data managed by RelationGetStatExtList: */
    List       *rd_statlist;    /* list of OIDs of extended stats */
    /* data managed by RelationGetIndexAttrBitmap: */
    Bitmapset  *rd_indexattr;   /* columns used in non-projection indexes */
    Bitmapset  *rd_projindexattr;   /* columns used in projection indexes */
    Bitmapset  *rd_keyattr;     /* cols that can be ref'd by foreign keys */
    Bitmapset  *rd_pkattr;      /* cols included in primary key */
    Bitmapset  *rd_idattr;      /* included in replica identity index */
    Bitmapset  *rd_projidx;     /* Oids of projection indexes */
    PublicationActions *rd_pubactions;  /* publication actions */
    /*
     * rd_options is set whenever rd_rel is loaded into the relcache entry.
     * Note that you can NOT look into rd_rel for this data.  NULL means "use
     * defaults".
     */
    bytea      *rd_options;     /* parsed pg_class.reloptions */
    /* These are non-NULL only for an index relation: */
    Form_pg_index rd_index;     /* pg_index tuple describing this index */
    /* use "struct" here to avoid needing to include htup.h: */
    struct HeapTupleData *rd_indextuple;    /* all of pg_index tuple */
    /*
     * index access support info (used only for an index relation)
     *
     * Note: only default support procs for each opclass are cached, namely
     * those with lefttype and righttype equal to the opclass's opcintype. The
     * arrays are indexed by support function number, which is a sufficient
     * identifier given that restriction.
     *
     * Note: rd_amcache is available for index AMs to cache private data about
     * an index.  This must be just a cache since it may get reset at any time
     * (in particular, it will get reset by a relcache inval message for the
     * index).  If used, it must point to a single memory chunk palloc'd in
     * rd_indexcxt.  A relcache reset will include freeing that chunk and
     * setting rd_amcache = NULL.
     */
    Oid         rd_amhandler;   /* OID of index AM's handler function */
    MemoryContext rd_indexcxt;  /* private memory cxt for this stuff */
    /* use "struct" here to avoid needing to include amapi.h: */
    struct IndexAmRoutine *rd_amroutine;    /* index AM's API struct */
    Oid        *rd_opfamily;    /* OIDs of op families for each index col */
    Oid        *rd_opcintype;   /* OIDs of opclass declared input data types */
    RegProcedure *rd_support;   /* OIDs of support procedures */
    FmgrInfo   *rd_supportinfo; /* lookup info for support procedures */
    int16      *rd_indoption;   /* per-column AM-specific flags */
    List       *rd_indexprs;    /* index expression trees, if any */
    List       *rd_indpred;     /* index predicate tree, if any */
    Oid        *rd_exclops;     /* OIDs of exclusion operators, if any */
    Oid        *rd_exclprocs;   /* OIDs of exclusion ops' procs, if any */
    uint16     *rd_exclstrats;  /* exclusion ops' strategy numbers, if any */
    void       *rd_amcache;     /* available for use by index AM */
    Oid        *rd_indcollation;    /* OIDs of index collations */
    /*
     * foreign-table support
     *
     * rd_fdwroutine must point to a single memory chunk palloc'd in
     * CacheMemoryContext.  It will be freed and reset to NULL on a relcache
     * reset.
     */
    /* use "struct" here to avoid needing to include fdwapi.h: */
    struct FdwRoutine *rd_fdwroutine;   /* cached function pointers, or NULL */
    /*
     * Hack for CLUSTER, rewriting ALTER TABLE, etc: when writing a new
     * version of a table, we need to make any toast pointers inserted into it
     * have the existing toast table's OID, not the OID of the transient toast
     * table.  If rd_toastoid isn't InvalidOid, it is the OID to place in
     * toast pointers inserted into this rel.  (Note it's set on the new
     * version of the main heap, not the toast table itself.)  This also
     * causes toast_save_datum() to try to preserve toast value OIDs.
     */
    Oid         rd_toastoid;    /* Real TOAST table's OID, or InvalidOid */
    /* use "struct" here to avoid needing to include pgstat.h: */
    struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
} RelationData;
typedef struct RelationData *Relation;

BufferAccessStrategy
buffer訪問策略


/*
 * Buffer identifiers.
 * Buffer標識符
 * 
 * Zero is invalid, positive is the index of a shared buffer (1..NBuffers),
 * negative is the index of a local buffer (-1 .. -NLocBuffer).
 * 0表示無效,正整數(shù)表示共享buffer的索引(1..N),
 *   負數(shù)是本地buffer的索引(-1..-N)
 */
typedef int Buffer;
#define InvalidBuffer   0
/*
 * Buffer access strategy objects.
 * Buffer訪問策略對象
 *
 * BufferAccessStrategyData is private to freelist.c
 * BufferAccessStrategyData對freelist.c來說是私有的
 */
typedef struct BufferAccessStrategyData *BufferAccessStrategy;
 /*
 * Private (non-shared) state for managing a ring of shared buffers to re-use.
 * This is currently the only kind of BufferAccessStrategy object, but someday
 * we might have more kinds.
 * 私有狀態(tài),用于管理可重用的環(huán)形緩沖區(qū).
 * 目前只有這么一種緩沖區(qū)訪問策略對象,但未來某一天可以擁有更多.
 */
typedef struct BufferAccessStrategyData
{
    /* Overall strategy type */
    //全局的策略類型
    BufferAccessStrategyType btype;
    /* Number of elements in buffers[] array */
    //buffers[]中的元素個數(shù)
    int         ring_size;
    /*
     * Index of the "current" slot in the ring, ie, the one most recently
     * returned by GetBufferFromRing.
     * 環(huán)形緩沖區(qū)中當前slot的索引,最近訪問的通過函數(shù)GetBufferFromRing返回.
     */
    int         current;
    /*
     * True if the buffer just returned by StrategyGetBuffer had been in the
     * ring already.
     * 如正好通過StrategyGetBuffer返回的buffer已在環(huán)形緩沖區(qū)中,則返回T
     */
    bool        current_was_in_ring;
    /*
     * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
     * have not yet selected a buffer for this ring slot.  For allocation
     * simplicity this is palloc'd together with the fixed fields of the
     * struct.
     * buffer編號數(shù)組.
     * InvalidBuffer(即:0)表示我們還沒有為該slot選擇buffer.
     * 為了分配的簡單性,這是palloc'd與結構的固定字段。
     */
    Buffer      buffers[FLEXIBLE_ARRAY_MEMBER];
}           BufferAccessStrategyData;
//Block結構體指針
typedef void *Block;
/* Possible arguments for GetAccessStrategy() */
//GetAccessStrategy()函數(shù)可取值的參數(shù)
typedef enum BufferAccessStrategyType
{
    //常規(guī)的隨機訪問
    BAS_NORMAL,                 /* Normal random access */
    //大規(guī)模的只讀掃描
    BAS_BULKREAD,               /* Large read-only scan (hint bit updates are
                                 * ok) */
    //大量的多塊寫(如 COPY IN)
    BAS_BULKWRITE,              /* Large multi-block write (e.g. COPY IN) */
    //VACUUM
    BAS_VACUUM                  /* VACUUM */
} BufferAccessStrategyType;

ReadBufferMode
ReadBufferExtended函數(shù)所可能使用的讀取模式.


/*
 * In RBM_NORMAL mode, the page is read from disk, and the page header is
 * validated.  An error is thrown if the page header is not valid.  (But
 * note that an all-zero page is considered "valid"; see PageIsVerified().)
 * 在RBM_NORMAL模式,page從磁盤中讀取,page頭部已被驗證有效.假如page頭部是無效的,那會拋出錯誤.
 * (但是需要注意,初始化的page被認為是有效的;詳細參見PageIsVerified函數(shù))
 *
 * RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
 * valid, the page is zeroed instead of throwing an error. This is intended
 * for non-critical data, where the caller is prepared to repair errors.
 * RBM_ZERO_ON_ERROR類似于正常模式,但如果page header是無效的,則初始化page(置0),而不是報錯.
 * 在調(diào)用者準備修復錯誤時,針對非關鍵數(shù)據(jù)使用.
 *
 * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
 * filled with zeros instead of reading it from disk.  Useful when the caller
 * is going to fill the page from scratch, since this saves I/O and avoids
 * unnecessary failure if the page-on-disk has corrupt page headers.
 * The page is returned locked to ensure that the caller has a chance to
 * initialize the page before it's made visible to others.
 * Caution: do not use this mode to read a page that is beyond the relation's
 * current physical EOF; that is likely to cause problems in md.c when
 * the page is modified and written out. P_NEW is OK, though.
 * 在RBM_ZERO_AND_LOCK模式,如果page還沒有處于buffer cache,填充0而不是從磁盤中讀取.
 * 在調(diào)用者從scratch填充page時使用,因為這樣可以節(jié)省I/O并避免不必要的page-on-disk的header錯誤.
 * page會被鎖定并返回,確保在page可見前由調(diào)用者初始化此page.
 * 特別注意:不要在在關系文件標記位(EOF)后使用這種模式讀取page,這會在md.c中,修改page并寫出該page后出現(xiàn)問題.
 * 但是,P_NEW是可以的.
 *
 * RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
 * a cleanup-strength lock on the page.
 * RBM_ZERO_AND_CLEANUP_LOCK模式與RBM_ZERO_AND_LOCK模式類似,但在page上請求cleanup-strength lock.
 *
 * RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
 * RBM_NORMAL_NO_LOG模式與RBM_NORMAL一致.
 */
/* Possible modes for ReadBufferExtended() */
typedef enum
{
    RBM_NORMAL,                 /* Normal read */
    RBM_ZERO_AND_LOCK,          /* Don't read from disk, caller will
                                 * initialize. Also locks the page. */
    RBM_ZERO_AND_CLEANUP_LOCK,  /* Like RBM_ZERO_AND_LOCK, but locks the page
                                 * in "cleanup" mode */
    RBM_ZERO_ON_ERROR,          /* Read, but return an all-zeros page on error */
    RBM_NORMAL_NO_LOG           /* Don't log page as invalid during WAL
                                 * replay; otherwise same as RBM_NORMAL */
} ReadBufferMode;

二、源碼解讀

ReadBufferExtended返回對應請求關系數(shù)據(jù)塊的buffer,實現(xiàn)邏輯比較簡單,詳見代碼.
主要的實現(xiàn)邏輯在ReadBuffer_common中,該函數(shù)后續(xù)再行介紹.


/*
 * ReadBufferExtended -- returns a buffer containing the requested
 *      block of the requested relation.  If the blknum
 *      requested is P_NEW, extend the relation file and
 *      allocate a new block.  (Caller is responsible for
 *      ensuring that only one backend tries to extend a
 *      relation at the same time!)
 * ReadBufferExtended -- 返回對應請求關系數(shù)據(jù)塊的buffer.
 *      如果blknum是P_NEW,則擴展關系文件并分配新塊.
 *      (調(diào)用者有責任確保只有一個后臺進程在同一時刻嘗試擴展關系)
 *
 * Returns: the buffer number for the buffer containing
 *      the block read.  The returned buffer has been pinned.
 *      Does not return on error --- elog's instead.
 * 返回:對應block的buffer編號.返回的buffer已被pinned.不需要返回錯誤,因為elog已進行處理.
 *
 * Assume when this function is called, that reln has been opened already.
 * 假定調(diào)用該函數(shù)時,關系reln已被打開.
 *
 * In RBM_NORMAL mode, the page is read from disk, and the page header is
 * validated.  An error is thrown if the page header is not valid.  (But
 * note that an all-zero page is considered "valid"; see PageIsVerified().)
 * 在RBM_NORMAL模式,page從磁盤中讀取,page頭部已被驗證有效.假如page頭部是無效的,那會拋出錯誤.
 * (但是需要注意,初始化的page被認為是有效的;詳細參見PageIsVerified函數(shù))
 *
 * RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
 * valid, the page is zeroed instead of throwing an error. This is intended
 * for non-critical data, where the caller is prepared to repair errors.
 * RBM_ZERO_ON_ERROR類似于正常模式,但如果page header是無效的,則初始化page(置0),而不是報錯.
 * 在調(diào)用者準備修復錯誤時,針對非關鍵數(shù)據(jù)使用.
 *
 * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
 * filled with zeros instead of reading it from disk.  Useful when the caller
 * is going to fill the page from scratch, since this saves I/O and avoids
 * unnecessary failure if the page-on-disk has corrupt page headers.
 * The page is returned locked to ensure that the caller has a chance to
 * initialize the page before it's made visible to others.
 * Caution: do not use this mode to read a page that is beyond the relation's
 * current physical EOF; that is likely to cause problems in md.c when
 * the page is modified and written out. P_NEW is OK, though.
 * 在RBM_ZERO_AND_LOCK模式,如果page還沒有處于buffer cache,填充0而不是從磁盤中讀取.
 * 在調(diào)用者從scratch填充page時使用,因為這樣可以節(jié)省I/O并避免不必要的page-on-disk的header錯誤.
 * page會被鎖定并返回,確保在page可見前由調(diào)用者初始化此page.
 * 特別注意:不要在在關系文件標記位(EOF)后使用這種模式讀取page,這會在md.c中,修改page并寫出該page后出現(xiàn)問題.
 * 但是,P_NEW是可以的.
 *
 * RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
 * a cleanup-strength lock on the page.
 * RBM_ZERO_AND_CLEANUP_LOCK模式與RBM_ZERO_AND_LOCK模式類似,但在page上請求cleanup-strength lock.
 *
 * RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
 * RBM_NORMAL_NO_LOG模式與RBM_NORMAL一致.
 *
 * If strategy is not NULL, a nondefault buffer access strategy is used.
 * See buffer/README for details.
 * 如strategy非空,則使用非默認的buffer訪問策略.詳細參見buffer/README.
 */
Buffer
ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
                   ReadBufferMode mode, BufferAccessStrategy strategy)
{
    bool        hit;
    Buffer      buf;
    /* Open it at the smgr level if not already done */
    //打開relation,級別為smgr
    RelationOpenSmgr(reln);
    /*
     * Reject attempts to read non-local temporary relations; we would be
     * likely to get wrong data since we have no visibility into the owning
     * session's local buffers.
     * 拒絕嘗試訪問非本地臨時relations.
     * 由于沒有自己會話的本地緩存可見信息,因此讀取臨時表會得到錯誤的數(shù)據(jù).
     */
    if (RELATION_IS_OTHER_TEMP(reln))
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("cannot access temporary tables of other sessions")));
    /*
     * Read the buffer, and update pgstat counters to reflect a cache hit or
     * miss.
     * 調(diào)用ReadBuffer_common讀取buffer,更新pgstat計數(shù)器以反映命中還是缺失.
     */
    pgstat_count_buffer_read(reln);
    buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
                            forkNum, blockNum, mode, strategy, &hit);
    if (hit)
        pgstat_count_buffer_hit(reln);
    return buf;
}

三、跟蹤分析

使用bt查看調(diào)用棧


(gdb) bt
#0  ReadBufferExtended (reln=0x7f497fe72788, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0)
    at bufmgr.c:647
#1  0x00000000004d974f in heapgetpage (scan=0x1d969d8, page=0) at heapam.c:379
#2  0x00000000004daeb2 in heapgettup_pagemode (scan=0x1d969d8, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:837
#3  0x00000000004dcf2b in heap_getnext (scan=0x1d969d8, direction=ForwardScanDirection) at heapam.c:1842
#4  0x000000000070ec39 in SeqNext (node=0x1d95890) at nodeSeqscan.c:80
#5  0x00000000006e0ab0 in ExecScanFetch (node=0x1d95890, accessMtd=0x70eba9 <SeqNext>, recheckMtd=0x70ec74 <SeqRecheck>)
    at execScan.c:95
#6  0x00000000006e0b22 in ExecScan (node=0x1d95890, accessMtd=0x70eba9 <SeqNext>, recheckMtd=0x70ec74 <SeqRecheck>)
    at execScan.c:145
#7  0x000000000070ecbe in ExecSeqScan (pstate=0x1d95890) at nodeSeqscan.c:129
#8  0x00000000006dee2a in ExecProcNodeFirst (node=0x1d95890) at execProcnode.c:445
#9  0x00000000007021b8 in ExecProcNode (node=0x1d95890) at ../../../src/include/executor/executor.h:237
#10 0x00000000007022dd in ExecLimit (pstate=0x1d95680) at nodeLimit.c:95
#11 0x00000000006dee2a in ExecProcNodeFirst (node=0x1d95680) at execProcnode.c:445
#12 0x00000000006d3d8d in ExecProcNode (node=0x1d95680) at ../../../src/include/executor/executor.h:237
#13 0x00000000006d65c5 in ExecutePlan (estate=0x1d95468, planstate=0x1d95680, use_parallel_mode=false, 
    operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x1d00ea8, 
    execute_once=true) at execMain.c:1723
#14 0x00000000006d4357 in standard_ExecutorRun (queryDesc=0x1cfdc28, direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:364
#15 0x00000000006d417f in ExecutorRun (queryDesc=0x1cfdc28, direction=ForwardScanDirection, count=0, execute_once=true)
    at execMain.c:307
#16 0x00000000008bffd4 in PortalRunSelect (portal=0x1d3ebf8, forward=true, count=0, dest=0x1d00ea8) at pquery.c:932
#17 0x00000000008bfc72 in PortalRun (portal=0x1d3ebf8, count=9223372036854775807, isTopLevel=true, run_once=true, 
    dest=0x1d00ea8, altdest=0x1d00ea8, completionTag=0x7ffc1fc513d0 "") at pquery.c:773
#18 0x00000000008b9cd4 in exec_simple_query (query_string=0x1cd8ec8 "select * from t1 limit 10;") at postgres.c:1145
---Type <return> to continue, or q <return> to quit---
#19 0x00000000008bdf5f in PostgresMain (argc=1, argv=0x1d05278, dbname=0x1d050e0 "testdb", username=0x1cd5ba8 "xdb")
    at postgres.c:4182
#20 0x000000000081c16d in BackendRun (port=0x1cfae00) at postmaster.c:4361
#21 0x000000000081b8e0 in BackendStartup (port=0x1cfae00) at postmaster.c:4033
#22 0x0000000000817cda in ServerLoop () at postmaster.c:1706
#23 0x0000000000817590 in PostmasterMain (argc=1, argv=0x1cd3b60) at postmaster.c:1379
#24 0x0000000000741003 in main (argc=1, argv=0x1cd3b60) at main.c:228
(gdb)

邏輯較為簡單,這里不再詳細跟蹤.

四、參考資料

PG Source Code

向AI問一下細節(jié)

免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權內(nèi)容。

AI