線上Mysql重大事故快速應(yīng)急解決辦法

發(fā)布時間：2020-07-16 04:28:27 來源：網(wǎng)絡(luò) 閱讀：20983 作者：拎壺沖沖沖欄目：MySQL數(shù)據(jù)庫

由于好多公司節(jié)約成本都沒有自己的DBA人員，大部分都是開發(fā)或者運維人員操作數(shù)據(jù)庫，但數(shù)據(jù)庫是重中之重，等公司達到一定規(guī)模之后，數(shù)據(jù)庫一個不小心的事故，很有可能會讓公司回到解放前。所以在公司小規(guī)模的時候就應(yīng)該有一套自己的數(shù)據(jù)庫體系以及完善的數(shù)據(jù)庫架構(gòu)，操作人員的重點在優(yōu)化和提升性能，而不是再去修改數(shù)據(jù)庫的整體架構(gòu)。
下面我給大家總結(jié)了三種數(shù)據(jù)庫事故和解決辦法一邊操作人員應(yīng)急使用。請做類比參考，錯誤類型千變?nèi)f化，解決思路一成不變。
錯誤：

1、InnoDB: Error: Table "mysql"."innodb_table_stats" not found.問題
2、Error 'Cannot add or update a child row: a foreign key constraint fails故障
3、SQL_ERROR 1032解決辦法

問題明確，下面開始逐條解決吧：
1、InnoDB: Error: Table "mysql"."innodb_table_stats" not found.問題

在MySQL 5.6.30后臺日志報如下警告信息：

2016-05-27 12:25:27 7fabf86f7700 InnoDB: Error: Table "mysql"."innodb_table_stats" not found.
2016-05-27 12:25:27 7fabf86f7700 InnoDB: Error: Fetch of persistent statistics requested for table "hj_web"."wechat_res" but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_master_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_relay_log_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.

2016-05-27 14:03:52 28585 [Warning] InnoDB: Cannot open table mysql/slave_worker_info from the internal data dictionary of InnoDB though the .frm file for the table exists. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html for how you can resolve the problem.~~

這幾張表確實是在mysql5.6中新入的

innodb_index_stats,
innodb_tables_stats,
slave_master_info,
slave_relay_log_info,
slave_worker_info

解決方法：
登錄數(shù)據(jù)庫，進入mysql庫，執(zhí)行如下SQL刪除5張表
記住，一定要是drop table if exists

mysql> use mysql;
mysql> drop table if exists innodb_index_stats;
mysql> drop table if exists innodb_table_stats;
mysql> drop table if exists slave_master_info;
mysql> drop table if exists slave_relay_log_info;
mysql> drop table if exists slave_worker_info;

執(zhí)行完后，可以用show tables查看一下，看表的數(shù)據(jù)是否已經(jīng)比刪除之前減少了，如果減少了，說明你成功了！

上一步操作完成后，停止數(shù)據(jù)庫，并進入到數(shù)據(jù)庫數(shù)據(jù)文件所在目錄，刪除上面5個表所對應(yīng)的idb文件，如下所示：

#/etc/init.d/mysqld stop
#cd /data/mysql/data/mysql/
#ls -l .ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:17 innodb_index_stats.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:17 innodb_table_stats.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_master_info.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_relay_log_info.ibd
-rw-rw---- 1 mysql mysql 98304 May 27 14:14 slave_worker_info.ibd
#/bin/rm -rf .ibd

重新啟動數(shù)據(jù)庫，進入到mysql庫，重建上面被刪除的表結(jié)構(gòu)：

#/etc/init.d/mysqld start
mysql> use mysql;
mysql> source /data/mysql/share/mysql_system_tables.sql(這個mysql_system_tables.sql是從別的mysql數(shù)據(jù)庫備份過來的，備份方法如下：mysqldump –u賬號 –p密碼 mysql > mysql_system_tables.sql)
mysql> show tables;
+---------------------------+
| Tables_in_mysql |
+---------------------------+
| columns_priv |
| db |
| event |
| func |
| general_log |
| help_category |
| help_keyword |
| help_relation |
| help_topic |
| innodb_index_stats |
| innodb_table_stats |
| ndb_binlog_index |
| plugin |
| proc |
| procs_priv |
| proxies_priv |
| servers |
| slave_master_info |
| slave_relay_log_info |
| slave_worker_info |
| slow_log |
| tables_priv |
| time_zone |
| time_zone_leap_second |
| time_zone_name |
| time_zone_transition |
| time_zone_transition_type |
| user |
+---------------------------+
28 rows in set (0.00 sec)

說明表都正常了，再次查看mysql報錯日志，就會發(fā)現(xiàn)沒有了關(guān)于這5個表的報錯日志。

2、Error 'Cannot add or update a child row: a foreign key constraint fails故障解決

一大早的，某從庫突然報出故障：SQL線程中斷！
查看從庫狀態(tài)：

mysql> show slave status\G
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin.026023
Read_Master_Log_Pos: 230415889
Relay_Log_File: relay-bin.058946
Relay_Log_Pos: 54632056
Relay_Master_Log_File: mysql-bin.026002
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1452
Last_Error: Error 'Cannot add or update a child row: a foreign key constraint fails (zabbix.trigger_discovery, CONSTRAINT c_trigger_discovery_2 FOREIGN KEY (parent_triggerid) REFERENCES triggers (triggerid) ON DELETE CASCADE)' on query. Default database: 'zabbix'. Query: 'insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,'Free inodes is less than 20% on volume {#FSNAME}'), (1678,26250,22532,'Free inodes is less than 20% on volume {#FSNAME}'), (1679,26251,22532,'Free inodes is less than 20% on volume {#FSNAME}')'
Exec_Master_Log_Pos: 54631910

重點關(guān)注報錯信息，定位問題，問題是：Cannot add or update a child row:a foreign key constraint fails ，涉及到的外鍵是：c_trigger_discovery_2

那這個外鍵的定義是什么呢？

報錯信息中也有列出：

trigger_discovery, CONSTRAINTc_trigger_discovery_2FOREIGN KEY (parent_triggerid) REFERENCEStriggers(triggerid`) ON DELETE CASCADE

那明白了，是表trigger_discovery中的列parent_triggerid和表triggers中的列triggerid有外鍵關(guān)聯(lián)，現(xiàn)在這里的數(shù)據(jù)插入出現(xiàn)了問題
那為什么會出現(xiàn)問題？

繼續(xù)看報錯，錯誤是從這里開始的：

insert into trigger_discovery (triggerdiscoveryid,triggerid,parent_triggerid,name) values (1677,26249,22532,'Free inodes is less than 20% on volume {#FSNAME}')

上述外鍵對應(yīng)的列parent_triggerid的值是22532，難道這個值在表triggers中有問題？
我們?nèi)ケ韙riggers中查看：

從庫

mysql> select * from triggers where triggerid=22532;
Empty set (0.00 sec)

主庫

mysql> select * from triggers where triggerid=22532;
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| triggerid | expression | description | url | status | value | priority | lastchange | comments | error | templateid | type | value_flags | flags |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
| 22532 | {23251}<20 | Free inodes is less than 20% on volume {#FSNAME} | | 0 | 0 | 2 | 0 | | | 13272 | 0 | 1 | 2 |
+-----------+------------+--------------------------------------------------+-----+--------+-------+----------+------------+----------+-------+------------+------+-------------+-------+
row in set (0.00 sec)

果然，從庫中沒有這個值對應(yīng)的信息，但主庫中是有的，原來是主從不一致導(dǎo)致的，從庫中缺失這個值，主庫中順利插入了，但數(shù)據(jù)傳到從庫后，從庫的外鍵約束限制了這一插入操作，所以SQL線程阻塞

問題找到了，那怎么解決？

首先，為了讓從庫盡快恢復(fù)運行，就先把這個錯誤跳過吧：

mysql>SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; #跳過一個事務(wù)
mysql>start slave;

接下來就是主從數(shù)據(jù)不一致的問題，可以使用pt-table-checksum來檢查下不一致的數(shù)據(jù)，再進行同步，具體步驟如下：

在主庫執(zhí)行：

mysql>GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE,CREATE,DELETE,INSERT,UPDATE ON . TO 'USER'@'MASTER_HOST' identified by 'PASSWORD';

注：創(chuàng)建用戶，這些權(quán)限都是必須的，否則會報錯

shell> ./pt-table-checksum --host='master_host' --user='user' --password='password' --port='port' --databases=zabbix --ignore-tables=ignore_table --recursion-method=processlist

注：(1)因為涉及到的表太多，查看后發(fā)現(xiàn)很多表都有外鍵關(guān)聯(lián)，錯綜復(fù)雜，而且因為是監(jiān)控表，即使丟失一些也沒什么關(guān)系，所以查出較大的且沒有外鍵關(guān)聯(lián)的表用ignore-tables選項排除，對其他表進行比對，如果表比較少的話直接指定--TABLES

(2)recursion-method如果不設(shè)的話，會報錯：Diffs cannot be detected because no slaves were found. 其參數(shù)有四：processlist/hosts/dsn/no,用來決定查找slave的方式是show full processlist還是show slave hosts還是直接給出slave信息，具體用法在另一隨筆pt-table-checksum介紹中詳述

shell>./pt-table-sync --print --replicate=percona.checksums h=master_host,u=user,p=password,P=port h=slave_host,u=user,p=password,P=port --recursion-method=processlist >pt.log

注：最好使用--print，不要直接使用--execute，否則如果弄出問題，就更麻煩了，打印出直接執(zhí)行的語句，去從庫執(zhí)行就好了

將pt.log傳到從庫，直接執(zhí)行，然后再次在主庫上進行一致性檢查，如果還有不一致的數(shù)據(jù)，記得登錄mysql去把checksums表清空，然后再次進行檢查同步，直到?jīng)]有不一致的數(shù)據(jù)。
當然，如果主從數(shù)據(jù)反復(fù)出現(xiàn)不一致的話，那就要先去檢查造成不一致的原因了，釜底抽薪才是硬道理。

3、SQL_ERROR 1032解決辦法

緣由：
　　在主主同步的測試環(huán)境，由于業(yè)務(wù)側(cè)沒有遵循同一時間只寫一個點的原則，造成A庫上刪除了一條數(shù)據(jù)，B庫上在同時更新這條數(shù)據(jù)。
由于異步和網(wǎng)絡(luò)延時，B的更新event先到達A端執(zhí)行，造成A端找不到這條記錄，故SQL_THREAD報錯1032，主從同步停止。

錯誤說明：
　　MySQL主從同步的1032錯誤，一般是指要更改的數(shù)據(jù)不存在，SQL_THREAD提取的日志無法應(yīng)用故報錯，造成同步失敗
（Update、Delete、Insert一條已經(jīng)delete的數(shù)據(jù)）。
　　1032的錯誤本身對數(shù)據(jù)一致性沒什么影響，影響最大的是造成了同步失敗、同步停止。
　　如果主主（主從）有同步失敗，要第一時間查看并著手解決。因為不同步，會造成讀取數(shù)據(jù)的不一致。應(yīng)在第一時間恢復(fù)同步，
盡量減少對業(yè)務(wù)的影響。然后再具體分析不同步的原因，手動或者自動修復(fù)數(shù)據(jù)，并做pt-table-checksum數(shù)據(jù)一致性檢查。
　　目前業(yè)務(wù)一般是做主主同步，主主同步由于是異步更新，存在更新沖突的問題，且很容易引起SQL ERROR 1032錯誤。這個應(yīng)該在業(yè)務(wù)側(cè)解決，
保證同一時間只更新數(shù)據(jù)庫的一個點，類似單點寫入。我們的解決辦法是:寫一個底層數(shù)據(jù)庫調(diào)用庫，可能涉及到更新沖突的操作，都調(diào)用這個庫。
在配置文件里，配2個點的數(shù)據(jù)庫A、B，保證一直都更新A庫，如果A庫不可用，就去更新B庫。
　　另外，如果是對數(shù)據(jù)一致性要求較高的場景，比如涉及到錢，建議用PXC（強一致性、真正同步復(fù)制）。

解決辦法：
　　MySQL5.6.30版本，binlog模式為ROW。

　　show slave status\G,可以看到如下報錯：
Slave_SQL_Running: NO
Last_SQL_Errno: 1032
Last_SQL_Error: Worker 3 failed executing transaction '' at master log mysql-bin.000003, end_log_pos 440267874;
　　　　　　　　 Could not execute Delete_rows event on table db_test.tbuservcbgolog; Can't find record in 'tbuservcbgolog', Error_code: 1032;
　　　　　　　　 handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000003, end_log_pos 440267874

從上可以看出，是SQL_THREAD線程出錯，錯誤號碼1032。是在應(yīng)用delete db_test.tbuservcbgolog 表中一行數(shù)據(jù)的事件時，由于這條數(shù)據(jù)
不存在而出錯。此事件在主服務(wù)器Master binlog中的位置是 mysql-bin.000003, end_log_pos 440267874。（當然可以在從服務(wù)器Slave的Relay
log中查找，具體方法見最后）

方法1：跳過錯誤Event
先跳過這一條錯誤(event),讓主從同步恢復(fù)正常。（或者N條event，一條一條跳過）

　　stop slave;
　　set global sql_slave_skip_counter=1;
　　start slave;

方法2：跳過所有1032錯誤
更改my.cnf文件，在Replication settings下添加：

　　slave-skip-errors?。健?032
并重啟數(shù)據(jù)庫，然后start salve。
注意：因為要重啟數(shù)據(jù)庫，不推薦，除非錯誤事件太多。

方法3：還原被刪除的數(shù)據(jù)
根據(jù)錯誤提示信息，用mysqlbinlog找到該條數(shù)據(jù)event SQL并逆向手動執(zhí)行。如delete 改成insert。
本例中，此事件在主服務(wù)器Master binlog中的位置是 mysql-bin.000003, end_log_pos 440267874。
1）利用mysqlbinlog工具找出440267874的事件
/usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 |grep -A 20 '440267874'
或者/usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 --stop-position=440267874 | tail -20
或者usr/local/mysql-5.6.30/bin/mysqlbinlog --base64-output=decode-rows -vv mysql-bin.000003 > decode.log
（或者加上參數(shù)-d, --database=name 來進一步過濾）
#160923 20:01:27 server id 1223307 end_log_pos 440267874 CRC32 0x134b2cbc Delete_rows: table id 319 flags: STMT_END_F
###DELETE FROM db_99ducj.tbuservcbgolog
###WHERE
###@1=10561502 / INT meta=0 nullable=0 is_null=0 /
###@2=1683955 / INT meta=0 nullable=0 is_null=0 /
###@3=90003 / INT meta=0 nullable=0 is_null=0 /
###@4=0 / INT meta=0 nullable=0 is_null=0 /
###@5='2016-09-23 17:02:24' / DATETIME(0) meta=0 nullable=1 is_null=0 /
###@6=NULL / DATETIME(0) meta=0 nullable=1 is_null=1 /
#at 440267874

以上為檢索出來的結(jié)果，事務(wù)語句為：delete from db_99ducj.tbuservcbgolog where @1=10561502 and @2=1683955 ...
其中@1 @2 @3...分別對應(yīng)表tbuservcbgolog的列名，填補上即可。
我們可以逆向此SQL 將deleter 變成Insert,手動在從庫上執(zhí)行此Insert SQL,之后restart slave就好了。
注：通過Relay Log查找event SQL http://www.tuicool.com/articles/6RvUnqV

向AI問一下細節(jié)

線上Mysql重大事故快速應(yīng)急解決辦法

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標簽