ORA-29702 during Flashback or reverting database upgrade
Mike Dietrich blogged about this issue but I received some feedback/questions that I thought it is best to provide additional details on this error.
and
As mentioned earlier, Mike has covered this topic in depth and this is my attempt to provide additional information along with snippets from the trace files to explain the root cause analysis process, so in future DBA's can perhaps reuse this information to better understand ORA-29702 errors
ORA-29702
An ORA-29702 error is typically signaled when the Database instance attempts to talk to the Clusterware (ocssd process). The ocssd process is the Oracle Cluster Synchronization Services daemon responsible for Node management services. The most common cause of 29702 error is typically some misconfiguration that breaks the authentication required by the database process. A simple orachk will check for required permission and provide hints on what is broken (hint: check oradism configuration). I have also covered some more details in this slide share "Smart Monitoring: How does Oracle RAC manage Resources and State".
Ocssd process has steadily added more and more features over the years. One such new feature is the increase in maximum number of members that CSSD can support to 2048 from 512). In order to benefit from this change, all entities that were earlier (prior to the upgrade) connected to the CSSD needs to be register with the new maximum members value.
So any GI and database release greater than 19.3 would work without signaling the ORA-29702 error as the 19.3 database would now use the new 2048 value for MaxMembers during registration. The value of maximum members that can connect to the CSSD is persisted to storage so that future database start up can reuse that information. The problem occurs when a customer either reverts the upgrade or performs flashback operation
Lets understand this with an example
- GI version 11.2.0.4 running with 11.2.0.4 release of RDBMS with one database called Sales
- Maxmembers used by Sales is 512
- GI and Database is subsequently upgraded to 19.3
- When database sales starts, it requests MaxMembers to be set to 2048
- This value is persistently stored in the GI
- Now either Sales DB is flashed back to 11204 or the upgrade was reverted to 11204
- When Sales DB in 11204 version attempts to connect to the 19.3 GI, its Maxmember requested is 512 whereas the persistent value in the GI is 2048
Lmon trace file snippet
Batching factor: cache replay 114 size per lock 72 kjxggin: CGS tickets = 1000 kjxgrdmpcpu: CPU Total 14 Core 7 Socket 2 OCPU 14 kjxgrdmpcpu: High load threshold 17920 2020-06-03 13:52:41.714: [ CSSCLNT]clssgsGroupJoin: bad server response(-24) kgxgnreg: error: status 1 (0 ) kjxgmjoin: can not join the group (DAALL_DB_XXXXXX) with id 0 (inst 1) kjxgmjoin: kgxgn error 16 2020-06-03 13:52:41.714585 : IMR recording device closed, terminating IMR kjfmreg: Joining the cluster failed with err code 16
Cssd trace file snippet
2020-06-15 14:52:31.537 : CSSD:487659264: [ INFO] clssgmCheckGrpAttrCompat: Change was rejected due to incompatible MaxMembers value grock DAALL_DB_XXXXX, ID 219071, current value 2048, requested value 512
As seen above, CSSD rejects the request by the RDBMS instance as the MaxMembers current value is 2048 and the requested value by the older version of the RDBMS is 512
The solution as Mike suggested is already in the latest RU's. Additionally it is possible to bypass this error by shutting the GI on all the nodes.
Database Management Specialist-Google
4yAnil Nair- good to know.
Senior Manager at Oracle
4yThis is addressed in lastest RU, workaround is cold crs restart
DBA Architect | Senior DBA | Administrador de base de datos Oracle | Database Administrator | SQL SERVER DBA | Azure Database Administrator |53X OCI Certified|8X AZURE Certified|2X GCP|2X AWS
4yMolto bene. Grazie mille. Benedizioni.