What to do if an Alarm Occurs

When the HPDAMON detects a failure is about to occur, or already has occurred, in an HP Disk Array system, the HPDAMON will issue an alert to the management client. Further, a descriptive message will be logged in the Event Log.

For instance, if a hard disk fails, multiple alarms might be generated depending on the particular HP Disk Array configuration. The Event Log might display the following:

Critical 02/17/94 23:12:56 HPDA Logical Drive: Not Available MY_SERVER E-SL1 LD2

Minor 02/17/94 23:12:55 HPDA Logical Drive: Critical MY_SERVER E-SL1 LD2

Warning 02/17/94 23:12:54 HPDA Hard Disk: Failed MY_SERVER E-SL1 CH0 SCSI-ID3

Typically, the most recent alarm is always displayed at the top. In this case, the first alarm from server MY_SERVER indicates that the hard disk in the bay assigned to SCSI ID3 connected to Channel 0 of the HP Disk Array Controller in EISA slot 1 has failed (E-SL1 CH0 SCSI-ID3). The hard disk failure causes two other events to occur. First, logical drive 2 on this controller (E-SL1 LD2) becomes critical (HPDA Logical Drive: Critical), and then logical drive 2 on this controller (E-SL1 LD2) becomes unavailable (HPDA Logical Drive: Not Available).

In the above example, if your HP Disk Array Controller had been installed in a PCI slot, the slot identifier would indicate a "P" (e.g. P-SL1) instead of an "E."

If one or more alarms appear on your server, it is important to consider the following generic troubleshooting steps:

  1. Capture the information in the Event Log by printing the Log. If the Event Log has been cleared, the HPDAMON also keeps a separate log (on a NetWare system, the default log-file is SYS:\PUBLIC\HPDA.LOG).
  2. If more than one disk drive has failed on one HP Disk Array system, the cause might be as simple as a cable (e.g. the SCSI cable) or power problem.
  3. If the alarm(s) indicate a problem with a single Hard Disk, make sure the disk module is properly seated. There might not be anything wrong with the disk drive.
  4. Contact your local HP Dealer or the HP Customer Support Center.

Keep in mind that the suggested cause of the problem the HPDAMON software reported may not be the real cause of the problem. Your careful troubleshooting may reveal a different cause and save you valuable time!

 

 

 

 

 

 


Write Cache Error

Write cache errors may occur when cached data is not successfully written to the hard disk(s). This is generally an indication that the cache memory on the controller should be replaced.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


NVRAM Age

The NVRAM chip on the HP Disk Array controller is used for storing configuration information. However, the NVRAM has a finite life span. A warning is issued when the NVRAM chip on the controller reaches 10% of its full life expectancy. This warning indicates that it is time to replace the controller, as the chip itself is not replaceable.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Hardware Error

The Hardware Error alarm is generally an indication of a serious problem with the disk array controller hardware. This alarm indicates that an error ocurred with the SCSI controller on one of the HPDA channels. Typical errors would be illegal SCSI phase sequence or illegal command/address. The maximum count for this error is 48. Normally there should be no hardware errors occurring.The controller should be replaced to prevent data corruption.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Hard Disk Failure

If the disk array controller is unable to communicate with a configured hard disk after multiple retries, it will consider the hard disk as failed. A hard disk failure causes the configured logical drives (or volumes) to either become Critical or Not Available. Separate alarms are issued for the state of the logical drives.

If the Hard Disk Failure alarm appears, the failed hard disk must be replaced with a proper replacement module. If the array was configured with a Hot Spare disk, the controller will start the rebuild operation immediately upon the Hard Disk Failure unless the /MR (manual rebuild) option was specified for HPDAMON (the HP Disk Array Monitor program). Otherwise, the rebuild operation must be manually started via the JetSet utility after the failed disk module has been replaced.

For more information on HPDAMON and its command line options, refer to your HP Disk Array Controller NOS Guide.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Hot Spare Failure

If a hard disk is configured as a Hot Spare, it is not being used by the disk array controller until another configured hard disk fails. Since hard disks are generally very reliable, a Hot Spare disk may be idle for a very long time before it is needed. However, since the disk array controller may at any instant decide to use a Hot Spare disk as a replacement disk, the HPDAMON will always verify its operation via routine checks.

If the Hot Spare Failure alarm appears, it must be treated as a failed hard disk. First, make sure the disk module has been properly seated. If the seating seems normal, the disk module must be replaced.

JetSet will allow configuring in a new Hot Spare disk without bringing the server down.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Hot Spare Succeeded

The Hot Spare Succeeded alarm is only issued upon successful completion of an automatic Hot Spare replacement to indicate that the array system has successfully replaced a failed hard disk. It also means that your one and only Hot Spare drive has been used and you no longer have the added protection of a Hot Spare replacement disk.

Replace the failed hard disk (check the event log to find which hard disk failed), and configure its replacement as the new Hot Spare disk via the JetSet utility.

Wiothout a Hot Spare, a disk failure will put the array into a critical state, leaving it vulnerable to data loss should another physical disk failure occur.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Logical Drive Critical

Data is considered redundant on a logical drive of RAID levels 1, 5, or 6. A redundant logical drive becomes critical if one configured hard disk fails. The data is still fully available, however, the data on this logical drive is no longer redundant.

If the array has been configured with a Hot Spare disk, replacing a failed hard disk will occur automatically. Otherwise, the rebuild operation must be started manually via the JetSet utility.

Upon successful completion of the rebuild operation, all critical logical drives on that array will again become redundant.

It is important to replace and rebuild the failed disk immediately to prevent data loss in the event that another disk fails.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Logical Drive Not Available

Data is considered non-redundant on a logical drive of RAID level 0 (or a critical logical drive of RAID levels 1, 5, or 6). Data on a non-redundant logical drive becomes unavailable if one configured hard disk fails.

Look for other messages indicating physical disk failure(s).

The Logical Drive Not Available alarm indicates that (in most cases) the data on this logical drive must be restored from backup after the failed hard disk(s) has been replaced and the array has been restored to its previous configuration.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Parity Errors Exceeded

The Parity Errors Exceeded alarm is generally indicative of hardware problems with a hard disk. The parity error is a count of SCSI bus parity errors. Normally there should be no parity errors occurring. The maximum count for this error is 48.

The cause of this problem should be identified to keep the disk subsystem running normally.

Note that if the alarm is given for one disk drive only, the alarm indicates a problem with one disk drive. However, if there are alarms from multiple hard disks on the same HP Disk Array, the cause may be from cabling or the disk array controller.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Soft Errors Exceeded

The Soft Errors Exceeded alarm is generally indicative of hardware problems with a hard disk. This error is for SCSI check condition errors indicating a bad sector on a drive. Soft errors will normally be corrected when encountered, and also by performing regular consistency checks. The maximum count for this error is 48.

Normally there should be very few soft errrors. Many errors over a short time (hours or a few days) may incdicate a problem with a disk drive.

Note that if the alarm is given for one disk drive only, the alarm indicates a problem with one disk drive. However, if there are alarms from multiple hard disks on the same HP Disk Array, the cause may be from cabling or the disk array controller.

General Instructions on What to Do if an Alarm Occurs

 

 

 

 

 

 


Miscellaneous (Misc) Errors Exceeded

The Miscellaneous Errors Exceeded alarm is generally indicative of hardware problems with a hard disk. This alarm is for errors which do not fall under Parity, Hardware, or Soft errors. A typical miscellaneous error occurs when a device (drive) times out from a SCSI command from the controller; e.g., read or write. Timeout is about 6 seconds. The maximum count for this class of error is 48.

Normally there should be no miscellaneous errors, so this could indicate a problem with cabling or enclosures, most likely with a specific disk. Continuing errors would indicate that a drive replacement is needed.

Note that if the alarm is given for one disk drive only, the alarm indicates a problem with one disk drive. However, if there are alarms from multiple hard disks on the same HP Disk Array, the cause may be from cabling or the disk array controller.

General Instructions on What to Do if an Alarm Occurs