如何理解MYSQL-GroupCommit 和 2pc提交

發(fā)布時(shí)間：2021-11-16 11:34:56 來源：億速云閱讀：341 作者：柒染欄目：MySQL數(shù)據(jù)庫

這篇文章將為大家詳細(xì)講解有關(guān)如何理解MYSQL-GroupCommit 和 2pc提交，文章內(nèi)容質(zhì)量較高，因此小編分享給大家做個(gè)參考，希望大家閱讀完這篇文章后對(duì)相關(guān)知識(shí)有一定的了解。

組提交(group commit)是MYSQL處理日志的一種優(yōu)化方式，主要為了解決寫日志時(shí)頻繁刷磁盤的問題。組提交伴隨著MYSQL的發(fā)展不斷優(yōu)化，從最初只支持redo log 組提交，到目前5.6官方版本同時(shí)支持redo log 和binlog組提交。組提交的實(shí)現(xiàn)大大提高了mysql的事務(wù)處理性能，將以innodb 存儲(chǔ)引擎為例，詳細(xì)介紹組提交在各個(gè)階段的實(shí)現(xiàn)原理。

redo log的組提交

WAL(Write-Ahead-Logging)是實(shí)現(xiàn)事務(wù)持久性的一個(gè)常用技術(shù)，基本原理是在提交事務(wù)時(shí)，為了避免磁盤頁面的隨機(jī)寫，只需要保證事務(wù)的redo log寫入磁盤即可，這樣可以通過redo log的順序?qū)懘骓撁娴碾S機(jī)寫，并且可以保證事務(wù)的持久性，提高了數(shù)據(jù)庫系統(tǒng)的性能。雖然WAL使用順序?qū)懱娲穗S機(jī)寫，但是，每次事務(wù)提交，仍然需要有一次日志刷盤動(dòng)作，受限于磁盤IO，這個(gè)操作仍然是事務(wù)并發(fā)的瓶頸。

組提交思想是，將多個(gè)事務(wù)redo log的刷盤動(dòng)作合并，減少磁盤順序?qū)?。Innodb的日志系統(tǒng)里面，每條redo log都有一個(gè)LSN(Log Sequence Number)，LSN是單調(diào)遞增的。每個(gè)事務(wù)執(zhí)行更新操作都會(huì)包含一條或多條redo log，各個(gè)事務(wù)將日志拷貝到log_sys_buffer時(shí)(log_sys_buffer 通過log_mutex

保護(hù))，都會(huì)獲取當(dāng)前最大的LSN，因此可以保證不同事務(wù)的LSN不會(huì)重復(fù)。那么假設(shè)三個(gè)事務(wù)Trx1,Trx2和Trx3的日志的最大LSN分別為LSN1,LSN2,LSN3(LSN1<lsn2<lsn3)，它們同時(shí)進(jìn)行提交，那么如果trx3日志先獲取到log_mutex進(jìn)行落盤，它就可以順便把[lsn1---lsn3]這段日志也刷了，這樣trx1和trx2就不用再次請(qǐng)求磁盤io。組提交的基本流程如下： </lsn2<lsn3)，它們同時(shí)進(jìn)行提交，那么如果trx3日志先獲取到log_mutex進(jìn)行落盤，它就可以順便把[lsn1---lsn3]這段日志也刷了，這樣trx1和trx2就不用再次請(qǐng)求磁盤io。組提交的基本流程如下：<>

獲取 log_mutex
若flushed_to_disk_lsn>=lsn，表示日志已經(jīng)被刷盤,跳轉(zhuǎn)5
若 current_flush_lsn>=lsn，表示日志正在刷盤中，跳轉(zhuǎn)5后進(jìn)入等待狀態(tài)
將小于LSN的日志刷盤(flush and sync)
退出log_mutex

備注：lsn表示事務(wù)的lsn，flushed_to_disk_lsn和current_flush_lsn分別表示已刷盤的LSN和正在刷盤的LSN。

redo log 組提交優(yōu)化

我們知道，在開啟binlog的情況下，prepare階段，會(huì)對(duì)redo log進(jìn)行一次刷盤操作(innodb_flush_log_at_trx_commit=1)，確保對(duì)data頁和undo 頁的更新已經(jīng)刷新到磁盤；commit階段，會(huì)進(jìn)行刷binlog操作(sync_binlog=1),并且會(huì)對(duì)事務(wù)的undo log從prepare狀態(tài)設(shè)置為提交狀態(tài)(可清理狀態(tài))。通過兩階段提交方式(innodb_support_xa=1)，可以保證事務(wù)的binlog和redo log順序一致。二階段提交過程中，mysql_binlog作為協(xié)調(diào)者，各個(gè)存儲(chǔ)引擎和mysql_binlog作為參與者。故障恢復(fù)時(shí)，掃描最后一個(gè)binlog文件(進(jìn)行rotate binlog文件時(shí)，確保老的binlog文件對(duì)應(yīng)的事務(wù)已經(jīng)提交)，提取其中的xid；重做檢查點(diǎn)以后的redo日志，讀取事務(wù)的undo段信息，搜集處于prepare階段的事務(wù)鏈表，將事務(wù)的xid與binlog中的xid對(duì)比，若存在，則提交，否則就回滾。

通過上述的描述可知，每個(gè)事務(wù)提交時(shí)，都會(huì)觸發(fā)一次redo flush動(dòng)作，由于磁盤讀寫比較慢，因此很影響系統(tǒng)的吞吐量。淘寶童鞋做了一個(gè)優(yōu)化，將prepare階段的刷redo動(dòng)作移到了commit(flush-sync-commit)的flush階段之前，保證刷binlog之前，一定會(huì)刷redo。這樣就不會(huì)違背原有的故障恢復(fù)邏輯。移到commit階段的好處是，可以不用每個(gè)事務(wù)都刷盤，而是leader線程幫助刷一批redo。如何實(shí)現(xiàn)，很簡單，因?yàn)閘og_sys->lsn始終保持了當(dāng)前最大的lsn，只要我們刷redo刷到當(dāng)前的log_sys->lsn，就一定能保證，將要刷binlog的事務(wù)redo日志一定已經(jīng)落盤。通過延遲寫redo方式，實(shí)現(xiàn)了redo log組提交的目的，而且減少了log_sys->mutex的競爭。目前這種策略已經(jīng)被官方mysql5.7.6引入。

兩階段提交

在單機(jī)情況下，redo log組提交很好地解決了日志落盤問題，那么開啟binlog后，binlog能否和redo log一樣也開啟組提交？首先開啟binlog后，我們要解決的一個(gè)問題是，如何保證binlog和redo log的一致性。因?yàn)閎inlog是Master-Slave的橋梁，如果順序不一致，意味著Master-Slave可能不一致。MYSQL通過兩階段提交很好地解決了這一問題。Prepare階段，innodb刷redo log，并將回滾段設(shè)置為Prepared狀態(tài)，binlog不作任何操作；commit階段，innodb釋放鎖，釋放回滾段，設(shè)置提交狀態(tài)，binlog刷binlog日志。出現(xiàn)異常，需要故障恢復(fù)時(shí)，若發(fā)現(xiàn)事務(wù)處于Prepare階段，并且binlog存在則提交，否則回滾。通過兩階段提交，保證了redo log和binlog在任何情況下的一致性。

binlog的組提交

回到上節(jié)的問題，開啟binlog后，如何在保證redo log-binlog一致的基礎(chǔ)上，實(shí)現(xiàn)組提交。因?yàn)檫@個(gè)問題，5.6以前，mysql在開啟binlog的情況下，無法實(shí)現(xiàn)組提交，通過一個(gè)臭名昭著的prepare_commit_mutex，將redo log和binlog刷盤串行化，串行化的目的也僅僅是為了保證redo log-Binlog一致，但這種實(shí)現(xiàn)方式犧牲了性能。這個(gè)情況顯然是不能容忍的，因此各個(gè)mysql分支，mariadb，facebook，perconal等相繼出了補(bǔ)丁改進(jìn)這一問題，mysql官方版本5.6也終于解決了這一問題。由于各個(gè)分支版本解決方法類似，我主要通過分析5.6的實(shí)現(xiàn)來說明實(shí)現(xiàn)方法。

binlog組提交的基本思想是，引入隊(duì)列機(jī)制保證innodb commit順序與binlog落盤順序一致，并將事務(wù)分組，組內(nèi)的binlog刷盤動(dòng)作交給一個(gè)事務(wù)進(jìn)行，實(shí)現(xiàn)組提交目的。binlog提交將提交分為了3個(gè)階段，F(xiàn)LUSH階段，SYNC階段和COMMIT階段。每個(gè)階段都有一個(gè)隊(duì)列，每個(gè)隊(duì)列有一個(gè)mutex保護(hù)，約定進(jìn)入隊(duì)列第一個(gè)線程為leader，其他線程為follower，所有事情交由leader去做，leader做完所有動(dòng)作后，通知follower刷盤結(jié)束。binlog組提交基本流程如下：

FLUSH 階段

1) 持有Lock_log mutex [leader持有，follower等待]

2) 獲取隊(duì)列中的一組binlog(隊(duì)列中的所有事務(wù))

3) 將binlog buffer到I/O cache

4) 通知dump線程dump binlog

SYNC階段

1) 釋放Lock_log mutex，持有Lock_sync mutex[leader持有，follower等待]

2) 將一組binlog 落盤(sync動(dòng)作，最耗時(shí)，假設(shè)sync_binlog為1)

COMMIT階段

1) 釋放Lock_sync mutex，持有Lock_commit mutex[leader持有，follower等待]

2) 遍歷隊(duì)列中的事務(wù)，逐一進(jìn)行innodb commit

3) 釋放Lock_commit mutex

4) 喚醒隊(duì)列中等待的線程

說明：由于有多個(gè)隊(duì)列，每個(gè)隊(duì)列各自有mutex保護(hù)，隊(duì)列之間是順序的，約定進(jìn)入隊(duì)列的一個(gè)線程為leader，因此FLUSH階段的leader可能是SYNC階段的follower，但是follower永遠(yuǎn)是follower。

通過上文分析，我們知道MYSQL目前的組提交方式解決了一致性和性能的問題。通過二階段提交解決一致性，通過redo log和binlog的組提交解決磁盤IO的性能。下面我整理了Prepare階段和Commit階段的框架圖供各位參考。

如何理解MYSQL-GroupCommit 和 2pc提交

參考文檔

http://mysqlmusings.blogspot.com/2012/06/binary-log-group-commit-in-mysql-56.html

http://www.lupaworld.com/portal.php?mod=view&aid=250169&page=all

http://www.oschina.net/question/12_89981

http://kristiannielsen.livejournal.com/12254.html

http://blog.chinaunix.net/uid-26896862-id-3432594.html

http://www.csdn.net/article/2015-01-16/2823591

MySQL的事務(wù)提交邏輯主要在函數(shù)ha_commit_trans中完成。事務(wù)的提交涉及到binlog及具體的存儲(chǔ)的引擎的事務(wù)提交。所以MySQL用2PC來保證的事務(wù)的完整性。MySQL的2PC過程如下：

如何理解MYSQL-GroupCommit 和 2pc提交

T@4 : | | | | >trans_commit
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | | >trans_check
T@4 : | | | | | <trans_check 49 T@4 : | | | | | info: clearing SERVER_STATUS_IN_TRANS
T@4    : | | | | | >ha_commit_trans T@4 : | | | | | | info: all=1 thd->in_sub_stmt=0 ha_info=0x0 is_real_trans=1 T@4 : | | | | | | >MYSQL_BIN_LOG::commit T@4 : | | | | | | | enter: thd: 0x2b9f4c07beb0, all: yes, xid: 0, cache_mngr: 0x0 T@4 : | | | | | | | >ha_commit_low
T@4 : | | | | | | | | >THD::st_transaction::cleanup
T@4 : | | | | | | | | | >free_root
T@4 : | | | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | | | | <free_root 396 T@4 : | | | | | | | | <thd::st_transaction::cleanup 2521 T@4 : | | | | | | | <ha_commit_low 1535 T@4 : | | | | | | <mysql_bin_log::commit 6383 T@4 : | | | | | | >THD::st_transaction::cleanup
T@4 : | | | | | | | >free_root
T@4 : | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | | <free_root 396 T@4 : | | | | | | <thd::st_transaction::cleanup 2521 T@4 : | | | | | <ha_commit_trans 1458 T@4 : | | | | | debug: reset_unsafe_rollback_flags
T@4 : | | | | <trans_commit 233</ha_commit_trans<> T@4 : | | | | >MDL_context::release_transactional_locks
T@4 : | | | | | >MDL_context::release_locks_stored_before
T@4 : | | | | | <mdl_context::release_locks_stored_before 2771 T@4 : | | | | | >MDL_context::release_locks_stored_before
T@4 : | | | | | <mdl_context::release_locks_stored_before 2771 T@4 : | | | | <mdl_context::release_transactional_locks 2926 T@4 : | | | | >set_ok_status
T@4 : | | | | <set_ok_status 446 T@4 : | | | | THD::enter_stage: /usr/src/mysql-5.6.28/sql/sql_parse.cc:4996 T@4 : | | | | >PROFILING::status_change
T@4 : | | | | <profiling::status_change 354 T@4 : | | | | >trans_commit_stmt
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | | debug: add_unsafe_rollback_flags: 0 T@4 : | | | | | >MYSQL_BIN_LOG::commit

如何理解MYSQL-GroupCommit 和 2pc提交

(1)先調(diào)用binglog_hton和innobase_hton的prepare方法完成第一階段，binlog_hton的papare方法實(shí)際上什么也沒做，innodb的prepare將事務(wù)狀態(tài)設(shè)為TRX_PREPARED，并將redo log刷磁盤 (innobase_xa_prepare à trx_prepare_for_mysql à trx_prepare_off_kernel)。

(2)如果事務(wù)涉及的所有存儲(chǔ)引擎的prepare都執(zhí)行成功，則調(diào)用TC_LOG_BINLOG::log_xid將SQL語句寫到binlog，此時(shí)，事務(wù)已經(jīng)鐵定要提交了。否則，調(diào)用ha_rollback_trans回滾事務(wù)，而SQL語句實(shí)際上也不會(huì)寫到binlog。

(3)最后，調(diào)用引擎的commit完成事務(wù)的提交。實(shí)際上binlog_hton->commit什么也不會(huì)做(因?yàn)?2)已經(jīng)將binlog寫入磁盤)，innobase_hton->commit則清除undo信息，刷redo日志，將事務(wù)設(shè)為TRX_NOT_STARTED狀態(tài)(innobase_commit à innobase_commit_low à trx_commit_for_mysql à trx_commit_off_kernel)。

//ha_innodb.cc

static

int

innobase_commit(

/*============*/

/* out: 0 */

THD* thd, /* in: MySQL thread handle of the user for whom

the transaction should be committed */

bool all) /* in: TRUE - commit transaction

FALSE - the current SQL statement ended */

{

...

trx->mysql_log_file_name = mysql_bin_log.get_log_fname();

trx->mysql_log_offset =

(ib_longlong)mysql_bin_log.get_log_file()->pos_in_file;

...

}

函數(shù)innobase_commit提交事務(wù)，先得到當(dāng)前的binlog的位置，然后再寫入事務(wù)系統(tǒng)PAGE(trx_commit_off_kernel à trx_sys_update_mysql_binlog_offset)。

InnoDB將MySQL binlog的位置記錄到trx system header中：

//trx0sys.h

/* The offset of the MySQL binlog offset info in the trx system header */

#define TRX_SYS_MYSQL_LOG_INFO (UNIV_PAGE_SIZE - 1000)

#define TRX_SYS_MYSQL_LOG_MAGIC_N_FLD 0 /* magic number which shows

if we have valid data in the

MySQL binlog info; the value

is ..._MAGIC_N if yes */

#define TRX_SYS_MYSQL_LOG_OFFSET_HIGH 4 /* high 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_OFFSET_LOW 8 /* low 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_NAME 12 /* MySQL log file name */

5.3.2 事務(wù)恢復(fù)流程

Innodb在恢復(fù)的時(shí)候，不同狀態(tài)的事務(wù)，會(huì)進(jìn)行不同的處理(見trx_rollback_or_clean_all_without_sess函數(shù))：

<1>對(duì)于TRX_COMMITTED_IN_MEMORY的事務(wù)，清除回滾段，然后將事務(wù)設(shè)為TRX_NOT_STARTED；

<2>對(duì)于TRX_NOT_STARTED的事務(wù)，表示事務(wù)已經(jīng)提交，跳過；

<3>對(duì)于TRX_PREPARED的事務(wù)，要根據(jù)binlog來決定事務(wù)的命運(yùn)，暫時(shí)跳過;

<4>對(duì)于TRX_ACTIVE的事務(wù)，回滾。

MySQL在打開binlog時(shí)，會(huì)檢查binlog的狀態(tài)(TC_LOG_BINLOG::open)。如果binlog沒有正常關(guān)閉(LOG_EVENT_BINLOG_IN_USE_F為1)，則進(jìn)行恢復(fù)操作，基本流程如下：

如何理解MYSQL-GroupCommit 和 2pc提交

<1>掃描binlog，讀取XID_EVENT事務(wù)，得到所有已經(jīng)提交的XA事務(wù)列表(實(shí)際上事務(wù)在innodb可能處于prepare或者commit)；

<2>對(duì)每個(gè)XA事務(wù)，調(diào)用handlerton::recover，檢查存儲(chǔ)引擎是否存在處于prepare狀態(tài)的該事務(wù)(見innobase_xa_recover)，也就是檢查該XA事務(wù)在存儲(chǔ)引擎中的狀態(tài)；

<3>如果存在處于prepare狀態(tài)的該XA事務(wù)，則調(diào)用handlerton::commit_by_xid提交事務(wù)；

<4>否則，調(diào)用handlerton::rollback_by_xid回滾XA事務(wù)。

5.3.3 幾個(gè)參數(shù)討論

(1)sync_binlog

Mysql在提交事務(wù)時(shí)調(diào)用MYSQL_LOG::write完成寫binlog，并根據(jù)sync_binlog決定是否進(jìn)行刷盤。默認(rèn)值是0，即不刷盤，從而把控制權(quán)讓給OS。如果設(shè)為1，則每次提交事務(wù)，就會(huì)進(jìn)行一次刷盤；這對(duì)性能有影響(5.6已經(jīng)支持binlog group)，所以很多人將其設(shè)置為100。

bool MYSQL_LOG::flush_and_sync()

{

int err=0, fd=log_file.file;

safe_mutex_assert_owner(&LOCK_log);

if (flush_io_cache(&log_file))

return 1;

if (++sync_binlog_counter >= sync_binlog_period && sync_binlog_period)

{

sync_binlog_counter= 0;

err=my_sync(fd, MYF(MY_WME));

}

return err;

}

(2) innodb_flush_log_at_trx_commit

該參數(shù)控制innodb在提交事務(wù)時(shí)刷redo log的行為。默認(rèn)值為1，即每次提交事務(wù)，都進(jìn)行刷盤操作。為了降低對(duì)性能的影響，在很多生產(chǎn)環(huán)境設(shè)置為2，甚至0。

如何理解MYSQL-GroupCommit 和 2pc提交

 trx_flush_log_if_needed_low( /*========================*/ lsn_t    lsn) /*!< in: lsn up to which logs are to be
            flushed. */ { switch (srv_flush_log_at_trx_commit) { case 0: /* Do nothing */ break; case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP,
                srv_unix_file_flush_method != SRV_UNIX_NOSYNC); break; case 2: /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE); break; default:
        ut_error;
    }
}

如何理解MYSQL-GroupCommit 和 2pc提交

If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues.

The default value of 1 is required for full ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose up to one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, only an operating system crash or a power outage can erase the last second of transactions.

(3) innodb_support_xa

用于控制innodb是否支持XA事務(wù)的2PC，默認(rèn)是TRUE。如果關(guān)閉，則innodb在prepare階段就什么也不做；這可能會(huì)導(dǎo)致binlog的順序與innodb提交的順序不一致(比如A事務(wù)比B事務(wù)先寫binlog，但是在innodb內(nèi)部卻可能A事務(wù)比B事務(wù)后提交)，這會(huì)導(dǎo)致在恢復(fù)或者slave產(chǎn)生不同的數(shù)據(jù)。

int

innobase_xa_prepare(

/*================*/

/* out: 0 or error number */

THD* thd, /* in: handle to the MySQL thread of the user

whose XA transaction should be prepared */

bool all) /* in: TRUE - commit transaction

FALSE - the current SQL statement ended */

{

…

if (!thd->variables.innodb_support_xa) {

return(0);

}

如何理解MYSQL-GroupCommit 和 2pc提交

ver  mysql 5.7

bool trans_xa_commit(THD *thd)
{ bool res= TRUE; enum xa_states xa_state= thd->transaction.xid_state.xa_state;
  DBUG_ENTER("trans_xa_commit"); if (!thd->transaction.xid_state.xid.eq(thd->lex->xid))
  { /* xid_state.in_thd is always true beside of xa recovery procedure.
      Note, that there is no race condition here between xid_cache_search
      and xid_cache_delete, since we always delete our own XID
      (thd->lex->xid == thd->transaction.xid_state.xid).
      The only case when thd->lex->xid != thd->transaction.xid_state.xid
      and xid_state->in_thd == 0 is in the function
      xa_cache_insert(XID, xa_states), which is called before starting
      client connections, and thus is always single-threaded. */ XID_STATE *xs= xid_cache_search(thd->lex->xid);
    res= !xs || xs->in_thd; if (res)
      my_error(ER_XAER_NOTA, MYF(0)); else {
      res= xa_trans_rolled_back(xs);
      ha_commit_or_rollback_by_xid(thd, thd->lex->xid, !res);
      xid_cache_delete(xs);
    }
    DBUG_RETURN(res);
  } if (xa_trans_rolled_back(&thd->transaction.xid_state))
  {
    xa_trans_force_rollback(thd);
    res= thd->is_error();
  } else if (xa_state == XA_IDLE && thd->lex->xa_opt == XA_ONE_PHASE)
  { int r= ha_commit_trans(thd, TRUE); if ((res= MY_TEST(r)))
      my_error(r == 1 ? ER_XA_RBROLLBACK : ER_XAER_RMERR, MYF(0));
  } else if (xa_state == XA_PREPARED && thd->lex->xa_opt == XA_NONE)
  {
    MDL_request mdl_request; /* Acquire metadata lock which will ensure that COMMIT is blocked
      by active FLUSH TABLES WITH READ LOCK (and vice versa COMMIT in
      progress blocks FTWRL).

      We allow FLUSHer to COMMIT; we assume FLUSHer knows what it does. */ mdl_request.init(MDL_key::COMMIT, "", "", MDL_INTENTION_EXCLUSIVE,
                     MDL_TRANSACTION); if (thd->mdl_context.acquire_lock(&mdl_request,
                                      thd->variables.lock_wait_timeout))
    {
      ha_rollback_trans(thd, TRUE);
      my_error(ER_XAER_RMERR, MYF(0));
    } else {
      DEBUG_SYNC(thd, "trans_xa_commit_after_acquire_commit_lock"); if (tc_log)
        res= MY_TEST(tc_log->commit(thd, /* all */ true)); else res= MY_TEST(ha_commit_low(thd, /* all */ true)); if (res)
        my_error(ER_XAER_RMERR, MYF(0));
    }
  } else {
    my_error(ER_XAER_RMFAIL, MYF(0), xa_state_names[xa_state]);
    DBUG_RETURN(TRUE);
  }

  thd->variables.option_bits&= ~OPTION_BEGIN;
  thd->transaction.all.reset_unsafe_rollback_flags();
  thd->server_status&=
    ~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
  DBUG_PRINT("info", ("clearing SERVER_STATUS_IN_TRANS"));
  xid_cache_delete(&thd->transaction.xid_state);
  thd->transaction.xid_state.xa_state= XA_NOTR;

  DBUG_RETURN(res);
}

如何理解MYSQL-GroupCommit 和 2pc提交

5.3.4 安全性/性能討論

上面3個(gè)參數(shù)不同的值會(huì)帶來不同的效果。三者都設(shè)置為1(TRUE)，數(shù)據(jù)才能真正安全。sync_binlog非1，可能導(dǎo)致binlog丟失(OS掛掉)，從而與innodb層面的數(shù)據(jù)不一致。innodb_flush_log_at_trx_commit非1，可能會(huì)導(dǎo)致innodb層面的數(shù)據(jù)丟失(OS掛掉)，從而與binlog不一致。

關(guān)于性能分析，可以參考

http://www.mysqlperformanceblog.com/2011/03/02/what-is-innodb_support_xa/

http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/

如何理解MYSQL-GroupCommit 和 2pc提交

在事務(wù)提交時(shí)innobase會(huì)調(diào)用ha_innodb.cc 中的innobase_commit，而innobase_commit通過調(diào)用trx_commit_complete_for_mysql（trx0trx.c)來調(diào)用log_write_up_to（log0log.c),也就是當(dāng)innobase提交事務(wù)的時(shí)候就會(huì)調(diào)用log_write_up_to來寫redo log
innobase_commit中 if (all              # 如果是事務(wù)提交 || (!thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) {
通過下面的代碼實(shí)現(xiàn)事務(wù)的commit串行化 if (innobase_commit_concurrency > 0) {
                        pthread_mutex_lock(&commit_cond_m);
                        commit_threads++; if (commit_threads > innobase_commit_concurrency) {
                                commit_threads--;
                                pthread_cond_wait(&commit_cond, &commit_cond_m);
                                pthread_mutex_unlock(&commit_cond_m); goto retry;
                        } else {
                                pthread_mutex_unlock(&commit_cond_m);
                        }
                }

                trx->flush_log_later = TRUE;   # 在做提交操作時(shí)禁止flush binlog 到磁盤
                innobase_commit_low(trx);
                trx->flush_log_later = FALSE;
先略過innobase_commit_low調(diào)用 ,下面開始調(diào)用trx_commit_complete_for_mysql做write日志操作
trx_commit_complete_for_mysql(trx);  #開始flush  log
trx->active_trans = 0;
在trx_commit_complete_for_mysql中，主要做的是對(duì)系統(tǒng)參數(shù)srv_flush_log_at_trx_commit值做判斷來調(diào)用
log_write_up_to，或者write redo log file或者write&&flush to disk if (!trx->must_flush_log_later) { /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 0) {  #flush_log_at_trx_commit=0，事務(wù)提交不寫redo log /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 1) { #flush_log_at_trx_commit=1,事務(wù)提交寫log并flush磁盤,如果flush方式不是SRV_UNIX_NOSYNC （這個(gè)不是很熟悉） if (srv_unix_file_flush_method == SRV_UNIX_NOSYNC) { /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
                } else { /* Write the log to the log files AND flush them to
                        disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, TRUE);
                }
        } else if (srv_flush_log_at_trx_commit == 2) {   #如果是2，則只write到redo log /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
        } else {
                ut_error;
        }
那么下面看log_write_up_to if (flush_to_disk              #如果flush到磁盤，則比較當(dāng)前commit的lsn是否大于已經(jīng)flush到磁盤的lsn && ut_dulint_cmp(log_sys->flushed_to_disk_lsn, lsn) >= 0) {

                mutex_exit(&(log_sys->mutex)); return;
        } if (!flush_to_disk       #如果不flush磁盤則比較當(dāng)前commit的lsn是否大于已經(jīng)寫到所有redo log file的lsn,或者在只等一個(gè)group完成條件下是否大于已經(jīng)寫到某個(gè)redo file的lsn && (ut_dulint_cmp(log_sys->written_to_all_lsn, lsn) >= 0 || (ut_dulint_cmp(log_sys->written_to_some_lsn, lsn) >= 0 && wait != LOG_WAIT_ALL_GROUPS))) {

                mutex_exit(&(log_sys->mutex)); return;
        }
#下面的代碼判斷是否log在write,有的話等待其完成 if (log_sys->n_pending_writes > 0) { if (flush_to_disk    # 如果需要刷新到磁盤，如果正在flush的lsn包括了commit的lsn，只要等待操作完成就可以了 && ut_dulint_cmp(log_sys->current_flush_lsn, lsn) >= 0) { goto do_waits;
                } if (!flush_to_disk  # 如果是刷到redo log file的那么如果在write的lsn包括了commit的lsn,也只要等待就可以了 && ut_dulint_cmp(log_sys->write_lsn, lsn) >= 0) { goto do_waits;
                }
      ...... if (!flush_to_disk  # 如果在當(dāng)前IO空閑情況下 ，而且不需要flush到磁盤，那么 如果下次寫的位置已經(jīng)到達(dá)buf_free位置說明wirte操作都已經(jīng)完成了，直接返回 && log_sys->buf_free == log_sys->buf_next_to_write) {
                             mutex_exit(&(log_sys->mutex)); return;
               }
下面取到group,設(shè)置相關(guān)write or flush相關(guān)字段，并且得到起始和結(jié)束位置的block號(hào)
log_sys->n_pending_writes++;

        group = UT_LIST_GET_FIRST(log_sys->log_groups);
        group->n_pending_writes++; /* We assume here that we have only
                                        one log group! */ os_event_reset(log_sys->no_flush_event);
        os_event_reset(log_sys->one_flushed_event);

        start_offset = log_sys->buf_next_to_write;
        end_offset = log_sys->buf_free;

        area_start = ut_calc_align_down(start_offset, OS_FILE_LOG_BLOCK_SIZE);
        area_end = ut_calc_align(end_offset, OS_FILE_LOG_BLOCK_SIZE);

        ut_ad(area_end - area_start > 0);

        log_sys->write_lsn = log_sys->lsn; if (flush_to_disk) {
                log_sys->current_flush_lsn = log_sys->lsn;
        }
log_block_set_checkpoint_no調(diào)用設(shè)置end_offset所在block的LOG_BLOCK_CHECKPOINT_NO為log_sys中下個(gè)檢查點(diǎn)號(hào)
        log_block_set_flush_bit(log_sys->buf + area_start, TRUE);   # 這個(gè)沒看明白
        log_block_set_checkpoint_no(
                log_sys->buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
                log_sys->next_checkpoint_no);
保存不屬于end_offset但在其所在的block中的數(shù)據(jù)到下一個(gè)空閑的block
ut_memcpy(log_sys->buf + area_end,
                  log_sys->buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
                  OS_FILE_LOG_BLOCK_SIZE);
對(duì)于每個(gè)group調(diào)用log_group_write_buf寫redo log buffer while (group) {
                log_group_write_buf(
                        group, log_sys->buf + area_start,
                        area_end - area_start,
                        ut_dulint_align_down(log_sys->written_to_all_lsn,
                                             OS_FILE_LOG_BLOCK_SIZE),
                        start_offset - area_start);

                log_group_set_fields(group, log_sys->write_lsn);   # 計(jì)算這次寫的lsn和offset來設(shè)置group->lsn和group->lsn_offset

                group = UT_LIST_GET_NEXT(log_groups, group);
        }
...... if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC) {  # 這個(gè)是什么東西 /* O_DSYNC means the OS did not buffer the log file at all:
                so we have also flushed to disk what we have written */ log_sys->flushed_to_disk_lsn = log_sys->write_lsn;

        } else if (flush_to_disk) {

                group = UT_LIST_GET_FIRST(log_sys->log_groups);

                fil_flush(group->space_id);         # 最后調(diào)用fil_flush執(zhí)行flush到磁盤
                log_sys->flushed_to_disk_lsn = log_sys->write_lsn;
        }

接下來看log_group_write_buf做了點(diǎn)什么

在log_group_calc_size_offset中,從group中取到上次記錄的lsn位置（注意是log files組成的1個(gè)環(huán)狀buffer),并計(jì)算這次的lsn相對(duì)于上次的差值
# 調(diào)用log_group_calc_size_offset計(jì)算group->lsn_offset除去多個(gè)LOG_FILE頭部長度后的大小，比如lsn_offset落在第3個(gè)log file上，那么需要減掉3*LOG_FILE_HDR_SIZE的大小
gr_lsn_size_offset = (ib_longlong)
       log_group_calc_size_offset(group->lsn_offset, group);
group_size = (ib_longlong) log_group_get_capacity(group); # 計(jì)算group除去所有LOG_FILE_HDR_SIZE長度后的DATA部分大小

# 下面是典型的環(huán)狀結(jié)構(gòu)差值計(jì)算 if (ut_dulint_cmp(lsn, gr_lsn) >= 0) {

                difference = (ib_longlong) ut_dulint_minus(lsn, gr_lsn);
        } else {
                difference = (ib_longlong) ut_dulint_minus(gr_lsn, lsn);

                difference = difference % group_size;

                difference = group_size - difference;
        }

        offset = (gr_lsn_size_offset + difference) % group_size;
# 最后算上每個(gè)log file 頭部大小，返回真實(shí)的offset return(log_group_calc_real_offset((ulint)offset, group));
接著看

# 如果需要寫的內(nèi)容超過一個(gè)文件大小 if ((next_offset % group->file_size) + len > group->file_size) {

                write_len = group->file_size                # 寫到file末尾 - (next_offset % group->file_size);
        } else {
                write_len = len;                              # 否者寫len個(gè)block
        }

# 最后真正的內(nèi)容就是寫buffer了，如果跨越file的話另外需要寫file log file head部分 if ((next_offset % group->file_size == LOG_FILE_HDR_SIZE) && write_header) { /* We start to write a new log file instance in the group */ log_group_file_header_flush(group,
                                            next_offset / group->file_size,
                                            start_lsn);
                srv_os_log_written+= OS_FILE_LOG_BLOCK_SIZE;
                srv_log_writes++;
        }

# 調(diào)用fil_io來執(zhí)行buffer寫 if (log_do_write) {
                log_sys->n_log_ios++;

                srv_os_log_pending_writes++;

                fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group->space_id,
                       next_offset / UNIV_PAGE_SIZE,
                       next_offset % UNIV_PAGE_SIZE, write_len, buf, group);

                srv_os_log_pending_writes--;

                srv_os_log_written+= write_len;
                srv_log_writes++;

如何理解MYSQL-GroupCommit 和 2pc提交

然而我們考慮如下序列（Copy from worklog…）

    Trx1 ------------P----------C-------------------------------->
                                |
    Trx2 ----------------P------+---C---------------------------->
                                |   |
    Trx3 -------------------P---+---+-----C---------------------->
                                |   |     |
    Trx4 -----------------------+-P-+-----+----C----------------->
                                |   |     |    |
    Trx5 -----------------------+---+-P---+----+---C------------->
                                |   |     |    |   |
    Trx6 -----------------------+---+---P-+----+---+---C---------->
                                |   |     |    |   |   |
    Trx7 -----------------------+---+-----+----+---+-P-+--C------->
                                |   |     |    |   |   |  |

在之前的邏輯中，trx5 和 trx6是可以并發(fā)執(zhí)行的，因?yàn)樗麄儞碛邢嗤男蛄刑?hào)；Trx4無法和Trx5并行，因?yàn)樗麄兊男蛄刑?hào)不同。同樣的trx6和trx7也無法并行。當(dāng)發(fā)現(xiàn)一個(gè)無法并發(fā)的事務(wù)時(shí)，就需要等待前面的事務(wù)執(zhí)行完成才能繼續(xù)下去，這會(huì)影響到備庫的TPS。

但是理論上trx4應(yīng)該可以和trx5和trx6并行，因?yàn)閠rx4先于trx5和trx6 prepare，如果trx5 和trx6能進(jìn)入Prepare階段，證明其和trx4是沒有沖突的。

解決方案：

0.增加兩個(gè)全局變量：

/* Committed transactions timestamp */
Logical_clock max_committed_transaction;
/* "Prepared" transactions timestamp */
Logical_clock transaction_counter;

每個(gè)事務(wù)對(duì)應(yīng)兩個(gè)counter：last_committed 及 sequence_number

1.每次rotate或打開新的binlog時(shí)

MYSQL_BIN_LOG::open_binlog:

max_committed_transaction.update_offset(transaction_counter.get_timestamp());
transaction_counter.update_offset(transaction_counter.get_timestamp());

—>更新max_committed_transaction和transaction_counter的offset為當(dāng)前的state值（或者說，為上個(gè)Binlog文件最大的transaction counter值）

2.每執(zhí)行一條DML語句完成時(shí)，更新當(dāng)前會(huì)話的last_committed= mysql_bin_log.max_committed_transaction

參考函數(shù)： binlog_prepare （參數(shù)all為false）

3. 事務(wù)提交時(shí)，寫入binlog之前

binlog_cache_data::flush:

trn_ctx->sequence_number= mysql_bin_log.transaction_counter.step();

其中transaction_counter遞增1

4.寫入binlog

將sequence_number 和 last_committed寫入binlog

MYSQL_BIN_LOG::write_gtid

記錄binlog文件的seq number 和last committed會(huì)減去max_committed_transaction.get_offset()，也就是說，每個(gè)Binlog文件的序列號(hào)總是從(last_committed, sequence_number)=(0,1)開始

5.引擎層提交每個(gè)事務(wù)前更新max_committed_transaction

如果當(dāng)前事務(wù)的sequence_number大于max_committed_transaction，則更新max_committed_transaction的值

MYSQL_BIN_LOG::process_commit_stage_queue –> MYSQL_BIN_LOG::update_max_committed

6.備庫并發(fā)檢查

函數(shù)：Mts_submode_logical_clock::schedule_next_event

假設(shè)初始狀態(tài)下transaction_counter=1, max_committed_transaction=1，以上述流程為例，每個(gè)事務(wù)的<last_committed, sequence_number>序列為:

Trx1 prepare: last_commited = max_committed_transaction = 1;

Trx2 prepare: last_commited = max_committed_transaction = 1;

Trx3 prepare: last_commited = max_committed_transaction = 1;

Trx1 commit: sequence_number=++transaction_counter = 2, (transaction_counter=2, max_committed_transaction=2), write(1,2)

Trx4 prepare: last_commited =max_committed_transaction = 2;

Trx2 commit: sequence_number=++transaction_counter= 3, (transaction_counter=3, max_committed_transaction=3), write(1,3)

Trx5 prepare: last_commited = max_committed_transaction = 3;

Trx6 prepare: last_commited = max_committed_transaction = 3;

Trx3 commit: sequence_number=++transaction_counter= 4, (transaction_counter=4, max_committed_transaction=4), write(1,4)

Trx4 commit: sequence_number=++transaction_counter= 5, (transaction_counter=5, max_committed_transaction=5), write(2, 5)

Trx5 commit: sequence_number=++transaction_counter= 6, (transaction_counter=6, max_committed_transaction=6), write(3, 6)

Trx7 prepare: last_commited = max_committed_transaction = 6;

Trx6 commit: sequence_number=++transaction_counter= 7, (transaction_counter=7, max_committed_transaction=7), write(3, 7)

Trx7 commit: sequence_number=++transaction_counter= 8, (transaction_counter=8, max_committed_transaction=8), write(6, 8)

并發(fā)規(guī)則：

因此上述序列中可以并發(fā)的序列為：

trx1 1…..2

trx2 1………….3

trx3 1…………………….4

trx4 2………………………….5

trx5 3………………………………..6

trx6 3………………………………………………7

trx7 6……………………..8

備庫并行規(guī)則：當(dāng)分發(fā)一個(gè)事務(wù)時(shí)，其last_committed 序列號(hào)比當(dāng)前正在執(zhí)行的事務(wù)的最小sequence_number要小時(shí)，則允許執(zhí)行。

因此，

a)trx1執(zhí)行，last_commit<2的可并發(fā)，trx2, trx3可繼續(xù)分發(fā)執(zhí)行

b)trx1執(zhí)行完成后，last_commit < 3的可以執(zhí)行， trx4可分發(fā)

c)trx2執(zhí)行完成后，last_commit < 4的可以執(zhí)行， trx5, trx6可分發(fā)

d)trx3、trx4、trx5完成后，last_commit < 7的可以執(zhí)行，trx7可分發(fā)

關(guān)于如何理解MYSQL-GroupCommit 和 2pc提交就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，可以學(xué)到更多知識(shí)。如果覺得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

向AI問一下細(xì)節(jié)

如何理解MYSQL-GroupCommit 和 2pc提交

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽