您好,登錄后才能下訂單哦!
Oracle Study之--DataGuard 最大保護(hù)模式故障(ORA-16198)
系統(tǒng)環(huán)境:
操作系統(tǒng):RedHat EL5
Oracle: Oracle 11gR2 (11.2.0.1.0)
故障現(xiàn)象:
Physical Standby在從Maximum Performance轉(zhuǎn)換到Maximum Protection時(shí),出現(xiàn)以下故障:
10:13:06 SYS@ prod1>startup force mount; ORACLE instance started. Total System Global Area 418484224 bytes Fixed Size 1336932 bytes Variable Size 281020828 bytes Database Buffers 130023424 bytes Redo Buffers 6103040 bytes Database mounted.
10:13:30 SYS@ prod1>select name,protection_mode from v$database;
NAME PROTECTION_MODE
--------- --------------------
PROD1 MAXIMUM PROTECTION
Open DataBase失?。?/span>
10:07:04 SYS@ prod1>alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 4612
Session ID: 1 Serial number: 5
查看告警日志:
alter database open
Thu Jun 11 10:07:10 2015
LGWR: STARTING ARCH PROCESSES
Thu Jun 11 10:07:10 2015
ARC0 started with pid=19, OS id=4614
ARC0: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Thu Jun 11 10:07:10 2015
ARC1 started with pid=20, OS id=4616
Thu Jun 11 10:07:10 2015
ARC2 started with pid=21, OS id=4618
ARC1: Archival started
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
ARC1: Becoming the heartbeat ARCH
LGWR: Primary database is in MAXIMUM PROTECTION mode
LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
Thu Jun 11 10:07:11 2015
NSS2 started with pid=18, OS id=4620
Thu Jun 11 10:07:40 2015
ORA-16198: LGWR received timedout error from KSR
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Error 16198 verifying archivelog destination LOG_ARCHIVE_DEST_2
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Continuing...
LGWR: Minimum of 1 applicable standby database required
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16072: a minimum of one standby database destination is required
Errors in file /u01/app/oracle/diag/rdbms/bjdb/prod1/trace/prod1_lgwr_4565.trc:
ORA-16072: a minimum of one standby database destination is required
LGWR (ospid: 4565): terminating the instance due to error 16072
Instance terminated by LGWR, pid = 4565
---從日志信息可以看出,所有的歸檔路徑都失敗,本地歸檔及遠(yuǎn)程歸檔均失??!
解決方法:
依據(jù)Oracle官方建議,修改net_timeout值:(主備庫(kù))
10:10:23 SYS@ prod1>alter system set log_archive_dest_2='service=shdb lgwr sync affirm VALID_FOR=(online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=shdb'; 10:11:25 SYS@ shdb>alter system set log_archive_dest_2='service=bjdb lgwr sync affirm VALID_FOR=(online_logfiles,primary_role) net_timeout=30 DB_UNIQUE_NAME=bjdb';
增加standby redo log:
主庫(kù):(在Maximum Performance下添加Standby redo logs)
10:20:35 SYS@ prod1>select group#,status ,bytes from v$log; GROUP# STATUS BYTES ---------- ---------------- ---------- 1 INACTIVE 52428800 2 CURRENT 52428800 3 INACTIVE 52428800 10:20:54 SYS@ prod1>select member from v$logfile; MEMBER ------------------------------------------------------------------------------------------------------------------------ /u01/app/oracle/oradata/prod1/redo03.log /u01/app/oracle/oradata/prod1/redo02.log /u01/app/oracle/oradata/prod1/redo01.log /u01/app/oracle/oradata/prod1/std_redo01.log /u01/app/oracle/oradata/prod1/std_redo02.log 6 rows selected. 10:21:03 SYS@ prod1>alter database add standby logfile 10:21:25 2 '/u01/app/oracle/oradata/prod1/std_redo03.log' size 50m; Database altered. 10:21:46 SYS@ prod1>alter database add standby logfile 10:21:51 2 '/u01/app/oracle/oradata/prod1/std_redo04.log' size 50m; Database altered. 10:01:48 SYS@ shdb>select member from v$logfile; MEMBER ------------------------------------------------------------------------------------------------------------------------ /u01/app/oracle/oradata/shdb/redo03.log /u01/app/oracle/oradata/shdb/redo02.log /u01/app/oracle/oradata/shdb/redo01.log /disk2/arch_prod11_0_881851982.dbf /u01/app/oracle/oradata/shdb/std_redo01.log /u01/app/oracle/oradata/shdb/std_redo02.log 6 rows selected.
備庫(kù):
10:18:17 SYS@ shdb>alter database open; Database altered. 10:20:21 SYS@ shdb>select group#,status ,bytes from v$log; GROUP# STATUS BYTES ---------- ---------------- ---------- 1 CLEARING 52428800 2 CLEARING 52428800 3 CLEARING_CURRENT 52428800 10:20:45 SYS@ shdb>alter database add standby logfile 10:22:41 2 '/u01/app/oracle/oradata/shdb/std_redo03.log' size 50m; Database altered. 10:22:57 SYS@ shdb>alter database add standby logfile 10:23:02 2 '/u01/app/oracle/oradata/shdb/std_redo04.log' size 50m; Database altered. 10:23:14 SYS@ shdb>col member for a50 10:23:23 SYS@ shdb>select group#,member from v$logfile; GROUP# MEMBER ---------- -------------------------------------------------- 3 /u01/app/oracle/oradata/shdb/redo03.log 2 /u01/app/oracle/oradata/shdb/redo02.log 1 /u01/app/oracle/oradata/shdb/redo01.log 5 /u01/app/oracle/oradata/shdb/std_redo01.log 6 /u01/app/oracle/oradata/shdb/std_redo02.log 7 /u01/app/oracle/oradata/shdb/std_redo03.log 8 /u01/app/oracle/oradata/shdb/std_redo04.log 8 rows selected.
-----經(jīng)過(guò)以上方式處理后,問(wèn)題依舊,在Maximum Protection模式下主庫(kù)依然不能被Open ;但在Maximum Availablity 和 Maximum Performance下主庫(kù)可以O(shè)pen 。出錯(cuò)原因依舊在探索。。。
參考文檔:
數(shù)據(jù)庫(kù)報(bào)ORA-16198故障的解決方法分析
--------http://blog.itpub.net/28546804/viewspace-1260003/
1. 首先看官方文檔關(guān)于ORA-16198報(bào)錯(cuò)的說(shuō)明
.......................
報(bào)錯(cuò)可能原因是因?yàn)閚et_timeout設(shè)置低,在以前老版本默認(rèn)是10,建議更改為30
……………………………
The net_timeout attribute in the log_archive_dest_2 on the primary is
set too low so that
LNS couldn't finish sending redo block in 10 seconds in this example.
…………………………….
如果設(shè)置30還不行,請(qǐng)檢查磁盤(pán)的IO使用情況或者網(wǎng)絡(luò)傳輸情況
…………………………..
Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means LNS couldn't finish sending redo block in 30 seconds.
The slowness may caused by:
1. Operating System. Please keep track of OS usage (like iostat).
2. Network. Please keep track network flow (like tcpdump).
……………………………
也有可能是BUG,受影響的版本為11.2.0.1或10.2.0.4,建議升級(jí)到11.2.0.2以上的版本
…………………………..
Bug 9259587 Multiple LGWR reconnect attempts in Data Guard MAXIMUM_AVAILABILITY
This note gives a brief overview bug 9259587.
Affects:
Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 12.1
Versions confirmed as being affected 11.2.0.1 10.2.0.4
Platforms affected Generic (all / most platforms affected)
Fixed:
This issue is fixed in 12.1 (Future Release) 11.2.0.2 (Server Patch Set)
Symptoms:
Related To:
Hang (Process Spins)
Active Dataguard (ADG)
Physical Standby Database / Dataguard
Description
…………………………………………………
發(fā)生的報(bào)錯(cuò),大概類似于下面的顯示
…………………………………………………
Rediscovery Notes:
Alert log contains messages like:
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Errors in file
/app/oracle/diag/rdbms/ora11g_dga/ora11g/trace/ora11g_lgwr_290838.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Network asynch I/O wait error 16198 log 2 service 'ora11g_DGb'
LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby
host 'ora11g_DGb'
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Failed to archive log 2 thread 1 sequence 1422 (16198)
…………………………………………………
In a Data Guard configuration using LGWR SYNC transport on one or more LOG_ARCHIVE_DEST_n parameters, and using a protection mode of MAXIMUM_AVAILABILITY, then if the primary database becomes disconnected from the standby database, LGWR continues to attempt to reconnect to the standby database. It should instead avoid attempts to reconnect until an ARCH process has re-established communication with the standby database.
所以可以確定的是:
報(bào)這種錯(cuò)誤主要發(fā)生在DATAGUARD這種架構(gòu)上,原因就是主機(jī)的日志向備機(jī)傳輸時(shí)沒(méi)在規(guī)定時(shí)間完成,或無(wú)法向備機(jī)傳送日志,那么我們就下面主要的兩種故障原因來(lái)進(jìn)行說(shuō)明:
2. 參數(shù)設(shè)置過(guò)低導(dǎo)致的故障
可能由于設(shè)置的LOG_ARCHIVE_DEST_2的NET_TIMEOUT值過(guò)低,導(dǎo)致的日志無(wú)法在規(guī)定時(shí)間傳輸完成,建議設(shè)置成30。
查詢NET_TIMEOUT:
SQL> select DEST_NAME,NET_TIMEOUT FROM V$ARCHIVE_DEST;
DEST_NAME NET_TIMEOUT
------------------------- -----------
LOG_ARCHIVE_DEST_1 0
LOG_ARCHIVE_DEST_2 30
……………輸出省略
查看LOG_ARCHIVE_DEST_2參數(shù):
SQL> show parameter log_archive_dest_2
值為'service=orcl_std reopen=120 lgwr sync valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std'
我沒(méi)有設(shè)置NET_TIMEOUT參數(shù),默認(rèn)卻是30,因?yàn)槲业陌姹臼?1.2.0.3的。
如果你的參數(shù)不是30,請(qǐng)進(jìn)行修改,參考如下:
SQL>ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='service=orcl_std reopen=120 lgwr sync net_timeout=30 valid_for=(online_logfiles,primary_role) db_unique_name=orcl_std';
然后觀察一下是否還報(bào)此類問(wèn)題。
3. 由于網(wǎng)絡(luò)不通暢或存儲(chǔ)IO繁忙等其他原因?qū)е碌墓收?/span>
如果是由于網(wǎng)絡(luò)不通暢和存儲(chǔ)繁忙的原因?qū)е碌膱?bào)錯(cuò),請(qǐng)用操作系統(tǒng)命令類似于,tcpdump或IOSTAT,VMSTAT來(lái)查看相關(guān)資源使用情況,或聯(lián)系網(wǎng)絡(luò),存儲(chǔ)管理員來(lái)協(xié)助分析。
如果以上都沒(méi)問(wèn)題,還有一種可能性就是你主機(jī)或備機(jī)單獨(dú)改sys密碼了,但是相關(guān)的備機(jī)或主機(jī)沒(méi)有同時(shí)改,造成主機(jī)向備機(jī)驗(yàn)證時(shí)失效也是很有可能的。
4. 數(shù)據(jù)庫(kù)的BUG
如果以上方法還沒(méi)有解決問(wèn)題,你也分析不出具體的原因,恰好你的數(shù)據(jù)庫(kù)版本是11.2.0.1或10.2.0.4,那么升級(jí)吧少年。。
5. 總結(jié)
考慮此類問(wèn)題,要從多角度分析,比如:參數(shù)值低,存儲(chǔ)使用情況,網(wǎng)絡(luò)傳輸情況,sys密碼改了,數(shù)據(jù)庫(kù)的BUG等。
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。