InnoDB IO路徑源碼的示例分析

發(fā)布時(shí)間：2021-12-22 11:49:39 來(lái)源：億速云閱讀：170 作者：小新欄目：大數(shù)據(jù)

這篇文章主要介紹InnoDB IO路徑源碼的示例分析，文中介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們一定要看完！

InnoDB實(shí)現(xiàn)IO Flush通過(guò)“os_file_flush”宏收斂，macro展開(kāi)后為”os_file_flush_func”。

接下來(lái)，我們重點(diǎn)看一下，還有其它哪些場(chǎng)景會(huì)調(diào)用到這個(gè)os_file_flush_func函數(shù)：

1. buf_dblwr_init_or_load_pages中，在double buffer 做crash恢復(fù)時(shí)，如果設(shè)置了reset_space_ids為”true”

2. fil_create_new_single_table_tablespace中，創(chuàng)建表數(shù)據(jù)文件時(shí)

3. fil_user_tablespace_restore_page中，從double writer buffer中copy page時(shí)

4. fil_tablespace_iterate中，如果iterate page失敗時(shí)，會(huì)做一次sync

5. os_file_set_size中，在文件新創(chuàng)建時(shí)初始化文件大小時(shí)

從上面可以看出，直接調(diào)用os_file_flush_func還是非常少的。那么系統(tǒng)中還有一些是調(diào)用os_file_flush_func的上層函數(shù)fil_flush：

1. buf_dblwr_flush_buffered_writes中，flush double write buffer到disk

2. buf_dblwr_write_single_page中，單頁(yè)flush到disk中

3. buf_flush_write_block_low中，flush block到disk中，一般在調(diào)double write buffer“buf_dblwr_flush_buffered_writes“之后調(diào)用

4. buf_LRU_remove_pages中，delete某個(gè)tablespace中的page時(shí)，做sync調(diào)用

5. fil_rename_tablespace中，rename一個(gè)tablespace時(shí)調(diào)用

6. fil_extend_space_to_desired_size中，extend space時(shí)

7. fil_flush_file_spaces中，flush 批量tablespace的page時(shí)，被ha_innodb的force checkpoint時(shí)調(diào)用，shutdown DB時(shí)調(diào)用等

8. log_io_complete中，log io完成時(shí)，checkpoint write以及其它，是否調(diào)用受sync方式影響，非commit依賴(lài)

9. log_write_up_to中，commit事務(wù)時(shí)，這個(gè)是重點(diǎn)，innodb 的log file都會(huì)調(diào)用，而且是同步

10. create_log_files_rename中，rename log file時(shí)

11. innobase_start_or_create_for_mysql中，創(chuàng)建新的文件時(shí)

在fil_flush函數(shù)中，所有文件的flush cache行為受fil_system->mutex保護(hù)。因此不管是data file還是log file，文件級(jí)別的flush是串化的。那么具體是怎么來(lái)控制的呢？

1. 檢查file node中的n_pending_flushes，如果大于“0”就一直retry

2. 如果為“0”，進(jìn)入flush階段，在正式開(kāi)始flush之前，先將n_pending_flushe加“1”，這個(gè)操作受上文提到的“fil->mutex“保護(hù)

3. 調(diào)用“os_file_flush“ flush完成之后，n_pending_flushes減“1”，也同樣由“fil->mutex“保護(hù)

下面是MySQL中retry的代碼：

retry:

if (node->n_pending_flushes > 0) {

/* We want to avoid calling os_file_flush() on

the file twice at the same time, because we do

not know what bugs OS's may contain in file

i/o */

ib_int64_t sig_count =

os_event_reset(node->sync_event);

mutex_exit(&fil_system->mutex);

os_event_wait_low(node->sync_event, sig_count);

mutex_enter(&fil_system->mutex);

if (node->flush_counter >= old_mod_counter) {

goto skip_flush;

}

goto retry;

}

下面是MySQL中flush的代碼：

ut_a(node->open);

file = node->handle;

node->n_pending_flushes++;

mutex_exit(&fil_system->mutex);

os_file_flush(file);

mutex_enter(&fil_system->mutex);

os_event_set(node->sync_event);

node->n_pending_flushes--;

node->flush_size = node->size;

上面提到的邏輯，對(duì)于log file也同樣處理。Log file的tablespace為log space。對(duì)于log file，還受到log_sys->mutex的保護(hù)，在log_write_up_to函數(shù)中。

Log file的flush行為收斂到“l(fā)og_write_up_to”函數(shù)體中，再調(diào)用fil_flush，最終走到os_file_flush_func。

MySQL在commit的時(shí)候，如果”innodb_flush_log_at_trx_commit=1”時(shí)，調(diào)用兩次同步的log_write_up_to，一次是innobase log file 的flush，一次是bin log的flush。

如果關(guān)掉bin log，則在ordered_commit函數(shù)中，不會(huì)走sync_blog分支。以下是關(guān)掉和不關(guān)掉時(shí)，innobase flush log file的執(zhí)行路徑。

不帶binlog時(shí)的Innobase 的pstack如下：

fsync,os_file_fsync,os_file_flush_func,fil_flush,log_write_up_to,trx_flush_log_if_needed_low,trx_flush_log_if_needed,trx_commit_complete_for_mysql,innobase_commit,ha_commit_low,TC_LOG_DUMMY::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone

帶Binlog的flush pstack如下：

fsync,os_file_fsync,os_file_flush_func,fil_flush,log_write_up_to,innobase_flush_logs,flush_handlerton,plugin_foreach_with_mask,ha_flush_logs,MYSQL_BIN_LOG::process_flush_stage_queue,MYSQL_BIN_LOG::ordered_commit,MYSQL_BIN_LOG::commit,ha_commit_trans,trans_commit_stmt,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone

從上面大致的調(diào)用來(lái)看，log_write_up_to()和log_io_complete()將是重點(diǎn)，因?yàn)檫@些flush在file級(jí)別是串行的，commit時(shí)的rt主要由這些串化帶來(lái)。

接下來(lái)我們看一下log_write_up_to的調(diào)用者都有那些：

1. buf_flush_write_block_low，在force flush中調(diào)用，保證日志必須先于數(shù)據(jù)落地，刷臟頁(yè)時(shí)，由page_cleaner_do_flush_batch()發(fā)起調(diào)用到此函數(shù)

2. innobase_flush_logs，在commit時(shí)調(diào)用，主要由flush binlog分支調(diào)用

3. trx_flush_log_if_needed_low，在commit時(shí)調(diào)用，主要由flush innodb log file時(shí)調(diào)用

4. log_buffer_flush_to_disk，log buffer刷日志到disk

5. log_buffer_sync_in_background，后臺(tái)線程同步log buffer到disk，由srv_master_thread 線程調(diào)srv_sync_log_buffer_in_background()調(diào)用到，每秒一次

6. log_flush_margin，主要為騰挪log buffer空間時(shí)調(diào)用

7. log_checkpoint，主要在做checkpoint時(shí)調(diào)用到，由srv_master線程調(diào)用srv_master_do_idle_tasks()，每秒做一次

log_io_complete()函數(shù)的調(diào)用情況：

1. fil_aio_wait，aio wait中如果是log io將會(huì)調(diào)用此方法

從上面分析可以看到，主要影響RT比較嚴(yán)重的還是因?yàn)樗⑴K頁(yè)導(dǎo)致的log_sys->mutex爭(zhēng)用。另外，log_buffer_sync_in_background和log_checkpoint，這兩個(gè)都是由后臺(tái)srv_master_thread線程每隔一秒調(diào)用到。

但是這兩個(gè)方法不一定會(huì)執(zhí)行fil_flush，所以不是影響的主因。gdb掛上去后，大致會(huì)走到fil_flush爭(zhēng)用的pstack如下：

Breakpoint 13, fil_flush (space_id=4294967280, from=FLUSH_FROM_LOG_WRITE_UP_TO) at /u01/mysql-5.6/storage/innobase/fil/fil0fil.cc:6478 6478 { (gdb) bt #0 fil_flush (space_id=4294967280, from=FLUSH_FROM_LOG_WRITE_UP_TO) at /u01/mysql-5.6/storage/innobase/fil/fil0fil.cc:6478 #1 0x0000000000c890e5 in log_write_up_to (lsn=, wait=, flush_to_disk=1, caller=) at /u01/mysql-5.6/storage/innobase/log/log0log.cc:1674 #2 0x0000000000d79d26 in buf_flush_write_block_low (sync=false, flush_type=BUF_FLUSH_LIST, bpage=0x7fd706332330) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:902 #3 buf_flush_page (buf_pool=, bpage=0x7fd706332330, flush_type=BUF_FLUSH_LIST, sync=) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1061 #4 0x0000000000d7a43e in buf_flush_try_neighbors (space=0, offset=offset@entry=5, flush_type=flush_type@entry=BUF_FLUSH_LIST, n_flushed=n_flushed@entry=1, n_to_flush=n_to_flush@entry=250) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1271 #5 0x0000000000d7b1b1 in buf_flush_page_and_try_neighbors (flush_type=BUF_FLUSH_LIST, count=, n_to_flush=250, bpage=) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1355 #6 buf_do_flush_list_batch (buf_pool=buf_pool@entry=0x1f817d8, min_n=min_n@entry=250, lsn_limit=lsn_limit@entry=18446744073709551615) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1623 #7 0x0000000000d7b308 in buf_flush_batch (flush_type=BUF_FLUSH_LIST, lsn_limit=18446744073709551615, min_n=, buf_pool=0x1f817d8) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1693 #8 buf_flush_list (min_n=, n_processed=n_processed@entry=0x7fc6dd7f9bd8, lsn_limit=18446744073709551615) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:1939 #9 0x0000000000d7c79b in page_cleaner_do_flush_batch (lsn_limit=18446744073709551615, n_to_flush=) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:2216 #10 buf_flush_page_cleaner_thread (arg=) at /u01/mysql-5.6/storage/innobase/buf/buf0flu.cc:2588 #11 0x00007fd950ef9dc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007fd94ef4a28d in clone () from /lib64/libc.so.6 Breakpoint 5, fil_flush (space_id=4294967280, from=FLUSH_FROM_LOG_IO_COMPLETE) at /u01/mysql-5.6/storage/innobase/fil/fil0fil.cc:6478 6478 { (gdb) bt #0 fil_flush (space_id=4294967280, from=FLUSH_FROM_LOG_IO_COMPLETE) at /u01/mysql-5.6/storage/innobase/fil/fil0fil.cc:6478 #1 0x0000000000c86c78 in log_io_complete (group=) at /u01/mysql-5.6/storage/innobase/log/log0log.cc:1239 #2 0x0000000000db5a4b in fil_aio_wait (segment=segment@entry=1) at /u01/mysql-5.6/storage/innobase/fil/fil0fil.cc:6463 #3 0x0000000000d09ba0 in io_handler_thread (arg=) at /u01/mysql-5.6/storage/innobase/srv/srv0start.cc:498 #4 0x00007fb818efddc5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007fb816f4e28d in clone () from /lib64/libc.so.6 1: *node = {space = 0x30aa218, name = 0x30aa948 "/mnt/mysql-redo/my3308/data/ib_logfile0", open = 1, handle = 10, sync_event = 0x30aa980, is_raw_disk = 0, size = 262144, n_pending = 0, n_pending_flushes = 0, being_extended = 0, modification_counter = 49, flush_counter = 41, flush_size = 262144, chain = {prev = 0x0, next = 0x30aaa78}, LRU = {prev = 0x0, next = 0x0}, magic_n = 89389}

因此，總結(jié)起來(lái)，應(yīng)該是commit做的兩次fsync加上一次page cleaner做的log_write_up_to()。另外，還有一個(gè)fil_aio_wait完成時(shí)，如果是log io，就會(huì)做一次log_io_complete()。

這四次fsync都會(huì)對(duì)用戶的rt有影響，commit的兩次無(wú)可避免，后面兩次最多也就是調(diào)整頻率。另外是否可以改變fsync的方式？這個(gè)讀者可以思考。

比oracle實(shí)現(xiàn)差，oracle不會(huì)作強(qiáng)制的，我記得是給一個(gè)標(biāo)記，策略還是按redo自己的策略來(lái)做。oracle的實(shí)現(xiàn)應(yīng)該是考慮到這一點(diǎn)了.

以上是“InnoDB IO路徑源碼的示例分析”這篇文章的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對(duì)大家有幫助，更多相關(guān)知識(shí)，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問(wèn)一下細(xì)節(jié)

InnoDB IO路徑源碼的示例分析

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽