首页 » ORACLE » kjfspseudorcfg and kjxgrrcfgchk some reason #

kjfspseudorcfg and kjxgrrcfgchk some reason #

# lmon trace

*** 2015-07-02 01:58:11.753
2015-07-02 01:58:11.751867 : kjfspseudorcfg: requested with reason 1(readable standby redo apply switch)

*** 2015-07-02 01:58:11.784
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 601 sec (Total CSS waittime = 1205)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 1225 sec
kjxgmcs: Setting state to 4 0.

*** 2015-07-02 01:58:11.805
     Name Service frozen
kjxgmcs: Setting state to 4 1.
kjxgrdecidever: No old version members in the cluster
kjxgrssvote: reconfig bitmap chksum 0x1969e71 cnt 2 master 1 ret 0
kjxgrpropmsg: SSMEMI: inst 1 - no disk vote
kjxggpoll: change db group poll time to 50 ms
2015-07-02 01:58:11.815684 : kjxgrcomerr: Communications reconfig: instance 2 (4,4)
2015-07-02 01:58:11.855752 : kjxgrrcfg: done - ret = 3  hist 0x1679a (initial rsn: 3)
kjxgrrcfgchk: Initiating reconfig, reason=3                       #######<<<<<<<<<
kjxgrrcfgchk: COMM rcfg - Disk Vote Required
kjfmReceiverHealthCB_CheckAll: Recievers are healthy.
2015-07-02 01:58:11.856090 : kjxgrnetchk: start 0x496effc3, end 0x496fb60d
2015-07-02 01:58:11.856106 : kjxgrnetchk: Network Validation wait: 46 sec          
kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0xcb7da6
kjxgrrcfgchk: prev pstate 3  mapsz 512
kjxgrrcfgchk: new  bmp: 1 2 
kjxgrrcfgchk: work bmp: 1 2 
kjxgrrcfgchk: rr  bmp: 1 2 


*** 2015-07-02 01:58:11.857
kjxgmrcfg: Reconfiguration started, type 3
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 601 sec (Total CSS waittime = 1205)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 1225 sec
kjxgmcs: Setting state to 4 0.
kjxgrs0h: disable CGS timeout

You can see a few  importance key messages(reason) from the Lmon trace log above, here I try to gather some information about their.

kjfspseudorcfg: requested with reason #

kjfspseudorcfg=kjfs pseudo reconfiguration

kjfs is dlm related functionality(Distributed Lock Manager) ; associated with RAC or parallel server operation

DLM Functionality in Global Enqueue Service Daemon (LMD0)
• Performing periodic scanning for move-scanconvertoperations
• Performing periodic scanning of the timer queuefor locks with expired timers
• Performing deadlock detection
• Processing incoming messages for non-PCMlocks

The lock db is either in a frozen or a running state. In the frozen state, it is not possible to get any locks from the DLM or to create any new resources. The DLM is frozen
during reconfiguration so that the node failure can be recovered from.

kjfspseudorcfg: requested with reason 1(readable standby redo apply switch)
kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)

kjxgrrcfgchk: Initiating reconfig, reason #

*********************************************************
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
**********************************************************

When an instance starts up, it’s the Lock Monitor’s (LMON) job to register with the Node Monitor (NM). That’s what we see in the alert.log with the instance ID that is getting registered. When any node joins or leaves a cluster, the global resource directory undergoes a reconfiguration event. We see the start of the reconfiguration event along with the old and new incarnation. Next, we see the number of nodes that have joined the cluster. As this was the first node to be started up, in the list of nodes we see only one node listed, and the number starts with 0. A reconfiguration event is a seven-step procedure and upon completion the “reconfiguration complete” message is logged into the alert.log.

The messages logged in the alert.log are summaries of the reconfiguration event. The LMON trace file would have more information about the reconfiguration. Following are the contents of the LMON trace file:
*** 2005-08-29 07:25:11.235kjxgmrcfg: Reconfiguration started, reason 1kjxgmcs: Setting state to 0 0.

Here, you can see the reason for the reconfiguration event. The most common reasons would be 1, 2, or 3.

Reason 1 means that the NM initiated the reconfiguration event, as typically seen when a node joins or leaves a cluster.

A reconfiguration event is initiated with reason 2 when an instance death is detected. How is an instance death detected? Every instance updates the control file with a heartbeat through its Checkpoint (CKPT) process. If heartbeat information is not present for x amount of time, the instance is considered to be dead and the Instance Membership Recovery (IMR) process initiates reconfiguration. This type of reconfiguration is commonly seen when significant time changes occur across nodes, the node is starved for CPU or I/O times, or some problems occur with the shared storage.

A reason 3 reconfiguration event is due to a communication failure. Communication channels are established between the Oracle processes across the nodes. This communication occurs over the interconnect. Every message sender expects an acknowledgment message from the receiver. If a message is not received for a timeout period, then a “communication failure” is assumed. This is more relevant for UDP, as Reliable Shared Memory (RSM), Reliable DataGram protocol (RDG), and Hyper Messaging Protocol (HMP) do not need it, since the acknowledgment mechanisms are built into the cluster communication and protocol itself.

When the block is sent from one instance to another using wire, especially when unreliable protocols such as UDP are used, it is best to get an acknowledgment message from the receiver. The acknowledgment is a simple side channel message that is normally required for most of the UNIX systems where UDP is used as the default IPC protocol. When user-mode IPC protocols such as RDG (on HP Tru64 UNIX TruCluster) or HP HMP are used, the additional messaging can be disabled by setting _reliable_block_sends=TRUE. For Windows-based systems, it is always recommended to leave the default value as is.
The Document 219361.1 states the following as the likely cause of the ora-29740 error with reason 3:

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

References “Oracle Database 10g Real Application Clusters Handbook”

打赏

,

对不起,这篇文章暂时关闭评论。