溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

Latch導(dǎo)致MySQL Crash

發(fā)布時(shí)間:2020-08-09 21:04:13 來源:ITPUB博客 閱讀:208 作者:沃趣科技 欄目:MySQL數(shù)據(jù)庫



沃趣科技高級(jí)數(shù)據(jù)庫專家  董 紅 禹

一、問題概述

最近我們遇到一個(gè)MySQL的問題,分析后很有代表意義,特地寫出來供大家參考。 
出現(xiàn)問題是,數(shù)據(jù)庫先是被置為只讀,然后過了一段時(shí)間,MySQL直接Crash掉了 
發(fā)生Crash時(shí)MySQL的error日志中打印了以下內(nèi)容:

		

  1. ----------
  2. SEMAPHORES
  3. ----------
  4. OS WAIT ARRAY INFO: reservation count 1246555
  5. --Thread 140363572082432 has waited at row0upd.cc line 2354 for 253.00 seconds the semaphore:
  6. X-lock (wait_ex) on RW-latch at 0x7fa949340740 created in file buf0buf.cc line 1069
  7. a writer (thread id 140363572082432) has reserved it in mode wait exclusive
  8. number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
  9. Last time read locked in file btr0sea.cc line 931
  10. Last time write locked in file /export/home/pb2/build/sb_0-17068951-1447697721.44/mysql-5.6.28/storage/innobase/row/row0upd.cc line 2354

根據(jù)日志中我們可以看到,線程140363572082432要對(duì)記錄上一個(gè)X鎖,但是等待0x7fa949340740線程的RW-latch的釋放

我們?cè)谙蛳驴床樵兊饺缦滦畔?涉及到用戶信息 謂詞就用xxx代替):

		

  1. 173 lock struct(s), heap size 30248, 7925 row lock(s), undo log entries 7924
  2. MySQL thread id 5709783, OS thread handle 0x7fa8f0da7700, query id 92213034 10.23.163.54 citicqyh updating
  3. update TB_DEPARTMENT_INFO set TOTAL_USER=1 where ID='ac84f17e-82d3-4519-a1da-0d5a5a835d44'
  4. ---TRANSACTION 53065242, ACTIVE 313 sec fetching rows, thread declared inside InnoDB 2081
  5. mysql tables in use 2, locked 0
  6. MySQL thread id 5428690, OS thread handle 0x7fa8f0136700, query id 92213061 10.23.163.55 citicqyh Sending data
  7. SELECT COUNT(DISTINCT r.user_id) FROM TB_DEPARTMENT_INFO d left join tb_qy_user_department_ref r on r.department_id = d.id WHERE d.org_id = 'xxx' AND (d.dept_full_name LIKE 'xxx%' or d.dept_full_name = 'xxx
  8. Trx read view will not see trx with id >= xxx, sees < xxx
  9. ......還有很多select語句省略

根據(jù)上面信息我們?nèi)?shù)據(jù)庫中查看了這些select語句,發(fā)現(xiàn)執(zhí)行計(jì)劃都是全表掃描。 
首先數(shù)據(jù)庫變成了只讀,最后數(shù)據(jù)庫crash了,crash輸出的信息如下:

		

  1. ----------------------------
  2. END OF INNODB MONITOR OUTPUT
  3. ============================
  4. InnoDB: ###### Diagnostic info printed to the standard error stream
  5. InnoDB: Error: semaphore wait has lasted > 600 seconds
  6. InnoDB: We intentionally crash the server, because it appears to be hung.
  7. 7fa8f9580700 InnoDB: Assertion failure in thread 140363714529024 in file srv0srv.cc line 1754
  8. InnoDB: We intentionally generate a memory trap.
  9. InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
  10. InnoDB: If you get repeated assertion failures or crashes, even
  11. InnoDB: immediately after the mysqld startup, there may be
  12. InnoDB: corruption in the InnoDB tablespace. Please refer to
  13. InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
  14. InnoDB: about forcing recovery.

InnoDB: Error: semaphore wait has lasted > 600 seconds 提示600秒沒有響應(yīng) 數(shù)據(jù)庫選擇了Crash 強(qiáng)制重啟

從報(bào)錯(cuò)信息來看:

  1. update語句需要在記錄上面加X鎖,但是必須等待RW-Latch的釋放
  2. 由于有大量的select語句是全表掃描,一直占用latch沒有釋放,update遲遲競(jìng)爭(zhēng)不到RW-latch
  3. Innodb 的Diagnostic線程檢查到RW-Latch等待超過了600秒還沒有返回,認(rèn)為系統(tǒng)出現(xiàn)了嚴(yán)重問題,于是觸發(fā)了MySQL服務(wù)的Crash。

二、進(jìn)一步分析

這里首先需要補(bǔ)充一下Latch的概念:Latch在MySQL中是用于保護(hù)高速緩沖區(qū)中共享數(shù)據(jù)的,舉個(gè)例子: 
當(dāng)我們執(zhí)行select時(shí),數(shù)據(jù)是緩存在buffer pool中的,多個(gè)線程并發(fā)訪問或者修改這個(gè)數(shù)據(jù)必然需要一個(gè)并發(fā)控制機(jī)制,這個(gè)就是Latch

大家知道,數(shù)據(jù)庫要訪問的數(shù)據(jù)都必須先存在緩存中,而緩存一般比磁盤空間要小,數(shù)據(jù)緩存使用hash表來記錄數(shù)據(jù)頁是否在內(nèi)存中。在Oracle中的并發(fā)控制比較精細(xì):首先會(huì)對(duì)hash桶加latch,并根據(jù)hash桶查找對(duì)應(yīng)的數(shù)據(jù)并加上pin,然后釋放Latch。而MySQL相對(duì)沒有控制得這么精細(xì),對(duì)應(yīng)的RW-Latch在errlog中說的很清楚,該RW-Latch是在buf0buf.cc的1069行創(chuàng)建的 
RW-latch at 0x7fa949340740 created in file buf0buf.cc line 1069

對(duì)應(yīng)的代碼摘錄如下:

		

  1. rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);

跟蹤源碼,知道這個(gè)Latch是MySQL在數(shù)據(jù)庫啟動(dòng),初始化 innodb_buffer_pool時(shí),將Latch創(chuàng)建好的。對(duì)應(yīng)的函數(shù)調(diào)用過·程:

		

  1. buf_pool_init_instance()->buf_chunk_init()->buf_block_init()

正是由于這個(gè)RW-Latch被長(zhǎng)時(shí)間占用了,其他的線程一直競(jìng)爭(zhēng)不到,才導(dǎo)致了這個(gè)問題

三、修復(fù)建議

這類問題的發(fā)生多數(shù)都是因?yàn)镾QL寫的不好,在表上面進(jìn)行了大量的全表掃描占用了大量的Latch,解決方案就是避免SQL長(zhǎng)時(shí)間占用latch:

  1. 修改select查詢避免全表掃描,避免latch長(zhǎng)期被占用
  2. 適當(dāng)?shù)募铀饕宻elect執(zhí)行更快,也避免一個(gè)select鎖的數(shù)據(jù)更少
  3. 適當(dāng)加大buffer pool instance,每個(gè)buffer pool都有自己獨(dú)立的Latch,避免latch競(jìng)爭(zhēng)。



向AI問一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI