Sunday, March 13, 2011

Validation of WebSEAL instance restarts

Never, never, never rely on the 'pdweb status' command to confirm that a WebSEAL instance has restarted correctly.  A status of 'Yes' does not mean the instance is functional.  Even a successful authentication and access test can be followed 15 min. later by customer calls to your help desk.
After having had problems and pouring over logs, I have found the following procedure to be reliable.
After a WebSEAL instance restart, validate performance of the instance as follows:
1.  On the WebSEAL server and logged in as or sudo'd to ivmgr or an equivalent account, run ‘pdweb status’ and confirm that the WebSEAL instances are started.

2.  Tail the instance logs to confirm that traffic is flowing through the WebSEAL instance.  A sample command is

tail –f /var/ibm/tivoli/common/DPW/logs/<instance>/log/combined.log

Check the log for each instance that was restarted.

3.  Confirm that the instance has registered properly with SMS.  (This is the 15 min. part mentioned above.  IBM was not able to explain the delay.)  Review the following log on the SMS servers for events indicating  that a WebSEAL instance is or is not properly registered with SMS or that there is an ObjectGrid problem.

SMS server sample location:  /var/logs/<path>/SMSServer<1 or 2>/SystemOut.log

Sample registration with SMS:

001489cb DSess         I ClientStore addClientReplicaSet() CTGSM0313I   The previous instance of client, <instance>-webseald-<servername>, has been replaced. The previous instance ID was 2d009f74-3435-11e0-8c64-001125c5fec9, and the new instance ID is 844c4f58-2378-11e0-ac3e-001125c5fec9.

Sample error that I have seen start showing up 12 minutes after the restart:

0014922b DSess         E ClientStore storeNewClient() CTGSM0301E   The new instance, 2d009f74-3435-11e0-8c64-001125c5fec9, of the client, <instance>-webseald-<servername>, could not be stored.

Note that the new instance ID is the same as the old instance ID in the first event.

If ObjectGrid is having a problem, ObjectGrid errors will be in the log and the WebSEAL instances will not start.

4.  Confirm that the WebSEAL instance agrees that it is registered with SMS by reviewing the instance msg log on the WebSEAL server
/var/ibm/tivoli/common/DPW/logs/msg__webseald-<instance>.log

Sample error:
0x38A0A135 webseald ERROR wds client AMWSMSSOAPCall.cpp 104 0x00019697
DPWDS0309E   An error was returned from the SOAP server in cluster dsess when calling the getSession interface: CTGSI0302W   The client is not registered with the session management server. (pd / wsi) (code: 0x38c5812e).

Note that IBM has previously confirmed that the following is an expected and harmless error:
0x38CF0131 webseald WARNING wwa server WsTcpListener.cpp 397 0x00004647
DPWWA0305E   The 'pd_tcp_write' routine failed for 'WsTcpConnector::write', errno = -1

5.    Log into the Policy Server and observe CPU usage.   It might spike to high levels if the ssl session cache is full.  (IBM does not have a command or monitor for the portion of the ssl session cache in use.  Therefore, it is advised by IBM that the ssl session timeout and ssl session cache size be tuned for your environment.)

6.  On the Policy Server, review the msg__pdmgrd_utf8.log.  A few of the following event is expected in normal operation.  A stream of them indicates the ssl session cache may be full.
Sample event:
0x106520EB pdmgrd NOTICE bas mts e:\am610\src\mts\mtsserver.cpp 1886 0x000008dc HPDBA0235I   The server lost the client's authentication, probably because of session expiration.

The ssl session cache can be cleared by restarting the pdmgrd process or, on Windows, the Access Manager Policy Server service.    If the ssl session cache is full, new connections must wait for connections to time out.  The default timeout is 7200 seconds (2 hours).  IBM support recommends tuning this parameter beginning with a value of 1800 seconds (30 min.).  The SSL Session Cache size can also be increased from the default value of 1024 to as high as 4095 (larger numbers are not recognized and 1024 will then be used).