Using S.M.A.R.T. to monitor hard drive health


SMART is is defined as the Self Monitoring Analysis & Reporting Technology. It is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures. In effect, SMART can be used to monitor the health of the hard drive.

Fundamentally, hard drives can suffer one of two classes of failures:

Mechanical failures, which are usually predictable failures, account for 60 percent of drive failure. The purpose of S.M.A.R.T. is to warn a user or system administrator of impending drive failure while time remains to take preventative action such as copying the data to a replacement device. Approximately 30% of failures can be predicted by S.M.A.R.T.

Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole, but that certain sub-categories of information S.M.A.R.T. implementations might track do correlate with actual failure rates - specifically that following the first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors and first errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities.Wikipedia, SMART

The most basic information that SMART provides is the SMART status. It provides only two values, "threshold not exceeded" or "threshold exceeded". Often these are represented as "drive OK" or "drive fail" respectively. A "threshold exceeded" value is intended to indicate that there is a relatively high probability that the drive will not be able to honor its specification in the future: that is, it's "about to fail". The predicted failure may be catastrophic or may be something as subtle as inability to write to certain sectors or slower performance than the manufacturer's minimum.

The SMART status does not necessarily indicate the drive's reliability now or in the past. If the drive has already failed catastrophically, the SMART status may be inaccessible. If the drive was experiencing problems in the past, but now the sensors indicate that the problems no longer exist, the SMART status may indicate the drive is OK, depending on the manufacturer's programming.

Lets take a look at what SMART can do through the "atactl" binary in OpenBSD.

Using SMART to see the drives attributes

You can look at the attributes of our example drive by using the readattr as seen below in the scrollable window. These value cover most of the common functions of the hard drive including retry amounts and failure counts. This drive has not reported any errors.

root@machine: /sbin/atactl /dev/wd0c readattr

Attributes table revision: 16
ID      Attribute name                  Threshold   Value
  3     Spin Up Time                      63        176
  4     Start/Stop Count                   0        253
  5     Reallocated Sector Count          63        253
  6     Unknown                          100        253
  7     Seek Error Rate                    0        253
  8     Seek Time Performance            187        240
  9     Power-on Hours Count               0        235
 10     Spin Retry Count                 157        253
 11     Unknown                          223        253
 12     Device Power Cycle Count           0        253 
192     Power-off Retract Count            0        253 
193     Load Cycle Count                   0        253  
194     Temperature                        0         25
195     Unknown                            0        253
196     Reallocation Event Count           0        253
197     Current Pending Sector Count       0        253
198     Off-line Scan Uncorrectable Sect   0        253
199     Ultra DMA CRC Error Count          0        199

Using SMART to monitor the drive

You can use the binary "atactl" to monitor the SMART values of your hard drive too. In this example we are going to use the binary to check the primary hard drive of a OpenBSD system disk and notify us by email of any errors.

When using the one or both of the following options "atactl" will check the drive for errors. If an error is found then an email will be sent to root by means of the "smartenable" argument. It only sends out an email if an error is found to reduce the spam.

Option 1: Run Once - The following can be run SMART once when the system boots to check the primary hard drive. Put these lines into your /etc/rc.local to check SMART stats on boot:

## SMART hard drive boot check
if [ -x /sbin/atactl ]; then
   echo -n ' smartenable'; /sbin/atactl /dev/wd0c smartenable

Option 2: Run Periodically through Cron - By runnning the SMART check through cron you can make sure you have a heads up of any serious problems with the drive. The following will run the command every morning at 5:30am. You will only receive an email to root if there is a problem with the drive.

#minute (0-59)
#|   hour (0-23)
#|   |    day of the month (1-31)
#|   |    |   month of the year (1-12 or Jan-Dec)
#|   |    |   |   day of the week (0-6 with 0=Sun or Sun-Sat)
#|   |    |   |   |   commands
#|   |    |   |   |   |
#### SMART Hard Drive Status
30   5    *   *   *   /sbin/atactl /dev/wd0c smartstatus >> /dev/null 2>&1

Questions, comments, or suggestions? Contact