How Data ONTAP reduces disk failures using Rapid RAID Recovery

When Data ONTAP determines that a disk has exceeded its error thresholds, Data ONTAP can perform Rapid RAID Recovery by removing the disk from its RAID group for testing and, if necessary, failing the disk. Spotting disk errors quickly helps prevent multiple disk failures and allows problem disks to be replaced.

By performing the Rapid RAID Recovery process on a suspect disk, Data ONTAP avoids three problems that occur during sudden disk failure and the subsequent RAID reconstruction process: During Rapid RAID Recovery, Data ONTAP performs the following tasks:
  1. Places the suspect disk in pre-fail mode.
  2. Selects a hot spare replacement disk.
    Note: If no appropriate hot spare is available, the suspect disk remains in pre-fail mode and data continues to be served. However, a suspect disk performs less efficiently. Impact on performance ranges from negligible to worse than degraded mode. For this reason, make sure hot spares are always available.
  3. Copies the suspect disk’s contents to the spare disk on the storage system before an actual failure occurs.
  4. After the copy is complete, attempts to put the suspect disk into the maintenance center, or else fails the disk.
Note: Tasks 2 through 4 can occur only when the RAID group is in normal (not degraded) mode.

If the suspect disk fails on its own before copying to a hot spare disk is complete, Data ONTAP starts the normal RAID reconstruction process.

A message is sent to the log file when the Rapid RAID Recovery process is started and when it is complete. The messages are tagged "raid.rg.diskcopy.start:notice" and "raid.rg.diskcopy.done:notice".