溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

galera mysql cluster 故障節(jié)點(diǎn)再次接入集群遇到問(wèn)題該怎么辦

發(fā)布時(shí)間:2021-11-16 11:10:17 來(lái)源:億速云 閱讀:440 作者:柒染 欄目:MySQL數(shù)據(jù)庫(kù)

這期內(nèi)容當(dāng)中小編將會(huì)給大家?guī)?lái)有關(guān)galera  mysql  cluster 故障節(jié)點(diǎn)再次接入集群遇到問(wèn)題該怎么辦,文章內(nèi)容豐富且以專業(yè)的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。

galera cluster  是mysql 的多主集群. 
我們目前搭建了3個(gè)節(jié)點(diǎn)的測(cè)試集群. 
第一輪測(cè)試的時(shí)候, 發(fā)現(xiàn)一個(gè)問(wèn)題,節(jié)點(diǎn)故障了, 下線,然后重新加入集群,無(wú)法加入. 

然后直接整個(gè)節(jié)點(diǎn)內(nèi)容 作為一個(gè)新節(jié)點(diǎn)加入, 也是失敗的. 搞了兩天, 頭大了. 失敗告終. 

報(bào)錯(cuò)信息如下: 


170609 16:55:59 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:55:59 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-3/libgalera_smm.so'
170609 16:55:59 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy <info@codership.com> loaded successfully.
170609 16:55:59 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:55:59 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, safe_to_bootsrap: 0
170609 16:55:59 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:55:59 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) -> new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:55:59 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:55:59 [Note] WSREP: wsrep_sst_grab()
170609 16:55:59 [Note] WSREP: Start replication
170609 16:55:59 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:55:59 [Note] WSREP: protonet asio version 0
170609 16:55:59 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:55:59 [Note] WSREP: backend: asio
170609 16:55:59 [Note] WSREP: gcomm thread scheduling priority set to other:0 
170609 16:55:59 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:55:59 [Note] WSREP: restore pc from disk failed
170609 16:55:59 [Note] WSREP: GMCast version 0
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.11.98:4567
170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75 :4567
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
170609 16:55:59 [Note] WSREP: EVS version 0
170609 16:55:59 [Note] WSREP: gcomm: connecting to group 'mycluster', peer '192.168.11.152:, 192.168.11.98:, 192.168.12.75 :'
170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection established to 753e6ee4 tcp://192.168.11.152:4567
170609 16:55:59 [Warning] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') address 'tcp://192.168.11.152:4567' points to own listening address, blacklisting
170609 16:56:02 [Warning] WSREP: no nodes coming from prim view, prim not possible
170609 16:56:02 [Note] WSREP: view(view_id(NON_PRIM,753e6ee4,1) memb {
        753e6ee4,0
} joined {
} left {
} partitioned {
})
170609 16:56:02 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection to peer 753e6ee4 with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:56:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50193S), skipping check
170609 16:56:32 [Note] WSREP: view((empty))
170609 16:56:32 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'mycluster' at 'gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 & evs.max_install_timeouts=1': -110 (Connection timed out)
170609 16:56:32 [ERROR] WSREP: gcs connect failed: Connection timed out
170609 16:56:32 [ERROR] WSREP: wsrep::connect(gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75 ? gmcast.segment=0 & evs.max_install_timeouts=1) failed: 7
170609 16:56:32 [ERROR] Aborting

170609 16:56:32 [Note] WSREP: Service disconnected.
170609 16:56:33 [Note] WSREP: Some threads may fail to exit.
170609 16:56:33 [Note] /usr/sbin/mysqld: Shutdown complete



然后 就在也加入不了集群了. 

人都蒙了, 一度懷疑國(guó)內(nèi)的最大的集群是怎么維護(hù)這個(gè)問(wèn)題的? 


刪除所有的測(cè)試vm , 從新安裝os .從新來(lái)過(guò). 


這兩天重新開(kāi)始測(cè)試這個(gè)問(wèn)題.

繼續(xù)重復(fù)測(cè)試這個(gè)案例. 

節(jié)點(diǎn)刪除后,   重現(xiàn)了相同的問(wèn)題. 

幾點(diǎn)不管是清空所有數(shù)據(jù),重新加入, 還是保留原數(shù)據(jù)加入集群. 都是失敗, 報(bào)錯(cuò)信息跟上面是一樣的. 


又無(wú)解了. 

又開(kāi)始郁悶了. 按理說(shuō)不應(yīng)該. 開(kāi)始分析報(bào)錯(cuò)信息. 從信息上了. 似乎總是讀了第一個(gè)節(jié)點(diǎn), 也就是本身這個(gè)節(jié)點(diǎn). 
報(bào)錯(cuò)無(wú)法連接. 然后重復(fù)7次, 然后timeout 退出. 


我們集群有3個(gè)節(jié)點(diǎn), 不應(yīng)該啊, 第一個(gè)無(wú)法連接, 應(yīng)該會(huì)roundrobin  嘗試后面的節(jié)點(diǎn)連接啊. 
但是從日志里,沒(méi)有體現(xiàn)出來(lái)這個(gè)問(wèn)題. 

我突然開(kāi)始懷疑這部門(mén)軟件代碼的設(shè)計(jì)上是不是有問(wèn)題呢? 

源代碼就不用看了, 我們可以修改下配嘛. 

于是我修改了 wsrep_cluster_address 的配置 把第一個(gè)節(jié)點(diǎn)的ip 的位置拿到了最后面. 

然后重新啟動(dòng)數(shù)據(jù)庫(kù), 奇跡發(fā)生了.


170609 16:57:09 [Note] WSREP: Read nil XID from storage engines, skipping position init
170609 16:57:09 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-3/libgalera_smm.so'
170609 16:57:09 [Note] WSREP: wsrep_load(): Galera 3.20(r7e383f7) by Codership Oy <info@codership.com> loaded successfully.
170609 16:57:09 [Note] WSREP: CRC-32C: using hardware acceleration.
170609 16:57:09 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:-1, safe_to_bootsrap: 0
170609 16:57:09 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 300M; gcache.recover = no; gcache.size = 300M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc
170609 16:57:09 [Note] WSREP: GCache history reset: old(51391c6d-4bff-11e7-a1c3-b797743e8629:0) -> new(51391c6d-4bff-11e7-a1c3-b797743e8629:824276)
170609 16:57:09 [Note] WSREP: Assign initial position for certification: 824276, protocol version: -1
170609 16:57:09 [Note] WSREP: wsrep_sst_grab()
170609 16:57:09 [Note] WSREP: Start replication
170609 16:57:09 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276
170609 16:57:09 [Note] WSREP: protonet asio version 0
170609 16:57:09 [Note] WSREP: Using CRC-32C for message checksums.
170609 16:57:09 [Note] WSREP: backend: asio
170609 16:57:09 [Note] WSREP: gcomm thread scheduling priority set to other:0 
170609 16:57:09 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
170609 16:57:09 [Note] WSREP: restore pc from disk failed
170609 16:57:09 [Note] WSREP: GMCast version 0
170609 16:57:09 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
170609 16:57:09 [Note] WSREP: EVS version 0
170609 16:57:09 [Note] WSREP: gcomm: connecting to group 'mycluster', peer '192.168.11.98:, 192.168.12.75:,192.168.11.152 :'
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 9f2dfc7e tcp://192.168.11.152:4567
170609 16:57:09 [Warning] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') address 'tcp://192.168.11.152:4567' points to own listening address, blacklisting
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 017c00ff tcp://192.168.11.98:4567
170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.12.75:4567 
170609 16:57:10 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 325d47d6 tcp://192.168.12.75:4567
170609 16:57:10 [Note] WSREP: declaring 017c00ff at tcp://192.168.11.98:4567 stable
170609 16:57:10 [Note] WSREP: declaring 325d47d6 at tcp://192.168.12.75:4567 stable
170609 16:57:10 [Note] WSREP: Node 017c00ff state prim
170609 16:57:10 [Note] WSREP: view(view_id(PRIM,017c00ff,13) memb {
        017c00ff,0
        325d47d6,0
        9f2dfc7e,0
} joined {
} left {
} partitioned {
})
170609 16:57:10 [Note] WSREP: save pc into disk
170609 16:57:10 [Note] WSREP: gcomm: connected
170609 16:57:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
170609 16:57:10 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
170609 16:57:10 [Note] WSREP: Opened channel 'mycluster'
170609 16:57:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
170609 16:57:10 [Note] WSREP: Waiting for SST to complete.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 0 (11_98)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 1 (12_75)
170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 2 (11_152)
170609 16:57:10 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 3/3 (joined/total),
        act_id     = 824276,
        last_appl. = -1,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:57:10 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:57:10 [Note] WSREP: Restored state OPEN -> JOINED (824276)
170609 16:57:10 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3
170609 16:57:10 [Note] WSREP: SST complete, seqno: 824276
170609 16:57:10 [Note] WSREP: Member 2.0 (11_152) synced with group.
170609 16:57:10 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 824276)
170609 16:57:10 [Note] Plugin 'FEDERATED' is disabled.
170609 16:57:10 InnoDB: The InnoDB memory heap is disabled
170609 16:57:10 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
170609 16:57:10 InnoDB: Compressed tables use zlib 1.2.3
170609 16:57:10 InnoDB: Using Linux native AIO
170609 16:57:10 InnoDB: Initializing buffer pool, size = 122.0M
170609 16:57:10 InnoDB: Completed initialization of buffer pool
170609 16:57:10 InnoDB: highest supported file format is Barracuda.
170609 16:57:11  InnoDB: Waiting for the background threads to start
170609 16:57:12 InnoDB: 5.5.54 started; log sequence number 6024720364
170609 16:57:12 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
170609 16:57:12 [Note]   - '0.0.0.0' resolves to '0.0.0.0';
170609 16:57:12 [Note] Server socket created on IP: '0.0.0.0'.
170609 16:57:12 [Note] Event Scheduler: Loaded 0 events
170609 16:57:12 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.54'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL), wsrep_25.19.20170106.aa7e07d
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:12 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:57:12 [Note] WSREP: Assign initial position for certification: 824276, protocol version: 3
170609 16:57:12 [Note] WSREP: Service thread queue flushed.
170609 16:57:12 [Note] WSREP: Synchronized with group, ready for connections
170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection to peer 9f2dfc7e with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S
170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting off


節(jié)點(diǎn) 順利的連接并加入了集群.  

然后我又測(cè)試了, 把數(shù)據(jù)文件都清空的情況, 也是順利的加入了集群,并自動(dòng)完成了數(shù)據(jù)同步,

從另個(gè)一個(gè)幾點(diǎn)的日志可以看到 數(shù)據(jù)同步的情況: 

170608 12:05:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
170608 12:05:48 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
"/var/log/mysqld.log" 744L, 62162C                                                                                112,1         13%
170609 16:42:43 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 4,
        members    = 2/3 (joined/total),
        act_id     = 824275,
        last_appl. = 824274,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629
170609 16:42:43 [Note] WSREP: Flow-control interval: [28, 28]
170609 16:42:43 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824275, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: REPL Protocols: 7 (3, 2)
170609 16:42:43 [Note] WSREP: Assign initial position for certification: 824275, protocol version: 3
170609 16:42:43 [Note] WSREP: Service thread queue flushed.
170609 16:42:43 [Note] WSREP: Member 1.0 (11_152) requested state transfer from '*any*'. Selected 0.0 (11_98)(SYNCED) as donor.
170609 16:42:43 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 824275)
170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
170609 16:42:43 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '192.168.11.152:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid '51391c6d-4bff-11e7-a1c3-b797743e8629:824275''
170609 16:42:43 [Note] WSREP: sst_donor_thread signaled with 0
170609 16:42:43 [Note] WSREP: Flushing tables for SST...
170609 16:42:43 [Note] WSREP: Provider paused at 51391c6d-4bff-11e7-a1c3-b797743e8629:824275 (831018)
170609 16:42:43 [Note] WSREP: Tables flushed.



通過(guò)這一點(diǎn), 也基本上驗(yàn)證了我的猜測(cè).  

節(jié)點(diǎn)在退出集群后, 從新加入的時(shí)候, 如果這個(gè)故障節(jié)點(diǎn)的ip 在自己的配置文件 wsrep_cluster_address 的選項(xiàng)中的第一個(gè)ip .

那么這個(gè)節(jié)點(diǎn)是永遠(yuǎn)都無(wú)法再加入這個(gè)集群了. 

怎么辦呢, 把他的ip 從這個(gè)配置項(xiàng)里面,換一下位置. 這個(gè)問(wèn)題就完美解決了. 

通過(guò)進(jìn)一步的測(cè)試. 如果這個(gè)節(jié)點(diǎn)是master , 通過(guò)--wsrep-new-cluster 啟動(dòng)的節(jié)點(diǎn),如果ip 排在第一位會(huì)有這個(gè)問(wèn)題. 
如果這個(gè)節(jié)點(diǎn) 經(jīng)過(guò)上述的步驟能夠重新加入解群了.  那么這個(gè)節(jié)點(diǎn)應(yīng)該就拿不到這個(gè)master 的角色了. 

這個(gè)時(shí)候, 就不會(huì)發(fā)生上述的問(wèn)題, 即便ip 排在第一個(gè)的位置,也是可以加入集群的. 


這個(gè)應(yīng)該是一個(gè)bug 了. 

再進(jìn)一步驗(yàn)證后, 可以提交bug 記錄了. 

規(guī)避這個(gè)問(wèn)題的方案就是節(jié)點(diǎn)的機(jī)器上的配置  wsrep-cluster-address  的配置選項(xiàng)里, 本機(jī)的ip 不要放在第一位. 

上述就是小編為大家分享的galera  mysql  cluster 故障節(jié)點(diǎn)再次接入集群遇到問(wèn)題該怎么辦了,如果剛好有類(lèi)似的疑惑,不妨參照上述分析進(jìn)行理解。如果想知道更多相關(guān)知識(shí),歡迎關(guān)注億速云行業(yè)資訊頻道。

向AI問(wèn)一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI