Tuesday, 1 August 2017

ORA-01555: snapshot too old along with ORA-00704: bootstrap process failure and ORA-00604

Recently, we had a situation where we continuously  got ORA-01555 on one of our active data guard.

ORA-00604: error occurred at recursive SQL level 3

ORA-01555: snapshot too old: rollback segment number 1176 with name "_SYSSMU1176_1510088119$" too small




Environment Setup - 
2 Node RAC Primary database
2 Node RAC Active Data guard

We tried to check what was causing issue related to undo tablespace. we checked free space %, undo_retention parameter but everything looks fine.

TABLESPACE                     TOTAL SIZE FREE SPACE     % FREE
------------------------------ ---------- ---------- ----------
UNDOTBS01                           92160      87723         95

SQL> show parameter undo

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
undo_management                      string      AUTO
undo_retention                       integer     6500
undo_tablespace                      string      UNDOTBS01


There were no undo blocks used in database at the time of issue.

SQL> SELECT undoblks/((end_time-begin_time)*86400) "Peak Undo Block Generation" FROM v$undostat WHERE undoblks=(SELECT MAX(undoblks) FROM v$undostat);

Peak Undo Block Generation
--------------------------
                         0


SQL> SELECT (UR * (UPS * DBS)) AS "Bytes"
FROM (select max(tuned_undoretention) AS UR from v$undostat),
  2    3  (SELECT undoblks/((end_time-begin_time)*86400) AS UPS
  4  FROM v$undostat
  5  WHERE undoblks = (SELECT MAX(undoblks) FROM v$undostat)),
(SELECT block_size AS DBS
  6    7  FROM dba_tablespaces
  8  WHERE tablespace_name = (SELECT UPPER(value) FROM v$parameter WHERE name = 'undo_tablespace'));

     Bytes
----------
         0

SQL> SELECT TO_CHAR(BEGIN_TIME, 'MM/DD/YYYY HH24:MI:SS') BEGIN_TIME,
  TO_CHAR(END_TIME, 'MM/DD/YYYY HH24:MI:SS') END_TIME,
  UNDOTSN, UNDOBLKS, TXNCOUNT, MAXCONCURRENCY AS "MAXCON",
  MAXQUERYLEN, TUNED_UNDORETENTION
  FROM v$UNDOSTAT;  2    3    4    5

BEGIN_TIME          END_TIME               UNDOTSN   UNDOBLKS   TXNCOUNT     MAXCON MAXQUERYLEN TUNED_UNDORETENTION
------------------- ------------------- ---------- ---------- ---------- ---------- ----------- -------------------
07/24/2017 10:39:42 07/31/2017 07:48:39 2147483647          0          0          0           0                6500





SQL> SELECT A.SID, A.USERNAME, B.XIDUSN, B.USED_UREC, B.USED_UBLK
 FROM V$SESSION A, V$TRANSACTION B
 WHERE A.SADDR=B.SES_ADDR;   2    3
 FROM V$SESSION A, V$TRANSACTION B
                   *
ERROR at line 2:
ORA-00604: error occurred at recursive SQL level 3
ORA-01555: snapshot too old: rollback segment number 1176 with name "_SYSSMU1176_1510088119$" too small


Even data files were in auto extend

SQL> SELECT FILE_ID, AUTOEXTENSIBLE FROM DBA_DATA_FILES WHERE
   TABLESPACE_NAME='&UNDOTBS';  2
Enter value for undotbs: UNDOTBS01
old   2:    TABLESPACE_NAME='&UNDOTBS'
new   2:    TABLESPACE_NAME='UNDOTBS01'

   FILE_ID AUT
---------- ---
      1164 YES
      1166 YES
      1167 YES

We observed active undo extents but numbers were very small.

SQL> SELECT DISTINCT STATUS, SUM(BYTES), COUNT(*)
   FROM DBA_UNDO_EXTENTS GROUP BY STATUS;  2

STATUS       SUM(BYTES)   COUNT(*)
------------ ---------- ----------
EXPIRED      4843438080      10521
UNEXPIRED    4809162752       5178
ACTIVE       3482451968        215

SQL> select owner, segment_name, tablespace_name, status from dba_rollback_segs where tablespace_name='UNDOTBS01';
select owner, segment_name, tablespace_name, status from dba_rollback_segs where tablespace_name='UNDOTBS01'
                                                         *
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 1205 with name "_SYSSMU1205_975967709$" too small


Below are the steps we carried out to resolve the issue.

We had 2 node RAC active data guard setup. We decided to take a chance and try to bounce one of the instance from which recovery was not performed (recovery was done using current log files). Before doing restart, we check for sync between primary and standby db (both were in sync), also we check for ORA-01555 on primary db (did not found ORA-01555).

SQL> SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
  2  FROM
  3  (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH,
  4  (SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL
  5  WHERE ARCH.THREAD# = APPL.THREAD# ORDER BY 1;

    Thread Last Sequence Received Last Sequence Applied Difference                                                                                                      
---------- ---------------------- --------------------- ----------                                                                                                      
         1                 103675                103675          0                                                                                                      
         2                 100310                100310          0                                                                                                      

SQL>


Step 1 - Shut down instance on node 1.

SQL> shut immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>

Step 2 - Start up instance on node 1 to check if error has gone.

SQL> startup;
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 5.3463E+10 bytes
Fixed Size                  2194184 bytes
Variable Size            1.3959E+10 bytes
Database Buffers         3.9460E+10 bytes
Redo Buffers               42532864 bytes
Database mounted.
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 1198 with name
"_SYSSMU1198_1178130323$" too small
Process ID: 9503
Session ID: 1892 Serial number: 3



We shut down the instance on node 1 and tried to startup instance manually but unfortunately we got above error. We did multiple restarts but we got same error. After searching on google I found few blogs related to same errors. 


There was slight difference between above two scenarios, we did not performed any recovery nor upgraded database. We thought may be recovery running now instance 2 is blocking instance 1 from coming up. So we decided to cancel recovery on node 2.

Step 3 - Cancel recovery from instance on node 2 but it hanged.
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;



Step 4 - Shut down instance on node 2.

As we were unable to cancel recovery as it was hanged, we decided to shut down instance to release recovery process.

SQL> shut immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL>

Step 5 - Shutdown and Startup instance on node 1.

Then we tried to shutdown instance on node 1 and but as it was not properly opened so we killed pmon process. Data guard bring back the instance soon we killed pmon process.

-bash-3.2$ sqlplus / as sysdba
Connected.
SQL> !ps -ef| grep pmon
  oracle 12149     1   0 09:33:24 ?           0:01 ora_pmon_TESTSTB1
  oracle 10145 10143   0 09:46:56 pts/7       0:00 grep pmon
  oracle 10143  7221   0 09:46:55 pts/7       0:00 /usr/bin/bash -c ps -ef| grep pmon
  oracle 11363     1   0 08:55:05 ?           0:01 asm_pmon_+ASM1

SQL> select name, open_mode from gv$database;
select name, open_mode from gv$database
*
ERROR at line 1:
ORA-01012: not logged on
Process ID: 0
Session ID: 0 Serial number: 0


SQL> exit
Disconnected
-bash-3.2$ sqlplus / as sysdba

Connected.
SQL> shut immediate;
ORA-24324: service handle not initialized
ORA-24323: value not allowed
ORA-01089: immediate shutdown in progress - no operations are permitted
SQL>
SQL>
SQL> exit
Disconnected
-bash-3.2$
-bash-3.2$ kill -9 12149
-bash-3.2$
-bash-3.2$
-bash-3.2$ ps -ef| grep pmon
  oracle 12395  7498   0 09:48:41 pts/7       0:00 grep pmon
  oracle 11363     1   0 08:55:05 ?           0:01 asm_pmon_+ASM1
-bash-3.2$


Step 6 - Startup instance on node 1.
-bash-3.2$ sqlplus / as sysdba

Connected to an idle instance.

SQL> startup mount;
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORA-10997: another startup/shutdown operation of this instance inprogress
ORA-09968: unable to lock file
SVR4 Error: 11: Resource temporarily unavailable
Additional information: 12338
SQL>
SQL> exit
Disconnected
-bash-3.2$
-bash-3.2$ ps -ef| grep pmon
  oracle 12440     1   0 09:48:50 ?           0:00 ora_pmon_TESTSTB1
  oracle 12767  7498   0 09:49:09 pts/7       0:00 grep pmon
  oracle 11363     1   0 08:55:05 ?           0:01 asm_pmon_+ASM1
-bash-3.2$


-bash-3.2$ sqlplus / as sysdba

Connected to:

OPEN_MODE            NAME      DB_UNIQUE_NAME                 DATABASE_ROLE
-------------------- --------- ------------------------------ ----------------
READ ONLY WITH APPLY TEST   TESTSTB                      PHYSICAL STANDBY


SQL> select process,status from v$managed_standby;

PROCESS   STATUS
--------- ------------
ARCH      CONNECTED
ARCH      CONNECTED
ARCH      CONNECTED
ARCH      CLOSING
ARCH      CLOSING
ARCH      CLOSING
ARCH      CONNECTED
ARCH      CONNECTED
ARCH      CONNECTED
ARCH      CONNECTED
RFS       IDLE

PROCESS   STATUS
--------- ------------
RFS       IDLE
RFS       IDLE
RFS       IDLE
RFS       IDLE
RFS       RECEIVING
RFS       IDLE
MRP0      WAIT_FOR_LOG

18 rows selected.

Step 7 - Startup instance on node 2.

As instance on node 1 was up finally with "read only with apply" status. so we started instance on node 2 and it also came up without any issue.


Step 8 - Check for sync between primary and standby db.

Switch few log files on both primary instances and check for sync on standby instance.

SQL> SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", (ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
  2  FROM
  3  (SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH,
  4  (SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL
WHERE ARCH.THREAD# = APPL.THREAD# ORDER BY 1;  5

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                 103729                103729          0
         2                 100361                100361          0

No comments:

Post a Comment