fcstat fcal_stats [ channel_name ]
fcstat device_map [ channel_name ]
link failure count
The drive will note a link failure event if it
cannot synchronize its receiver PLL for a time
greater than R_T_TOV, usually on the order of
milliseconds. A link failure is a loss of sync
that occurred for a long enough period of time and
therefore resulted in the drive initiating a Loop
Initialization Primitive (LIP). Refer to loss of
sync count below.
underrun count
Underruns are detected by the Host Adapter (HA)
during a read request. The disk sends data to the
HA through the loop and if any frames are corrupted
in transit, they are discarded by the HA as it has
received less data than expected. The driver
reports the underrun condition and retries the
read. The cause of the underrun is downstream in
the loop after the disk being read and before the
HA.
loss of sync count
The drive will note a loss of sync event if it
loses PLL synchronization for a time period less
than R_T_TOV and thereafter manages to
resynchronize. This event generally occurs when a
component, before the disk, reports loss of sync up
to and including the previous active component in
the loop. Disks that are on the shelf borders are
subject to seeing higher loss of sync counts than
disks that are not on a border.
invalid CRC count
Every frame received by a drive contains a checksum
that covers all data in the frame. If upon
receiving the frame, the checksum does not match,
the invalid CRC counter is incremented and the
frame is "dropped". Generally, the disk which
reports the CRC error is not at fault, rather a
component between the Host Adapter (which
originated the write request) and the reporting
drive, corrupts the frame.
frame in count/ frame out count
These counts represent the total number of frames
received and transmitted by a device on the loop.
The number of frames received by the Host Adapter
is equal to the sum of all of the frames
transmitted from all of the disks. Similarly, the
number of frames transmitted by the Host Adapter is
equal to the sum of all frames received by all of
the disks.
The occurrence of any of the error events may result in loop disruption. A link failure is considered the most serious since it may indicate a transmitter problem that is affecting loop signal integrity upstream of the drive. These events will typically result in frames being dropped and may result in data underruns or SCSI command timeouts.
Note that loop disruptions of this type, even though potentially resulting in data underruns and/or SCSI command timeouts, will not result in data corruption. The host adapter driver will detect such events and will retry the associated commands. The worst-case effect is a negligible drop in performance.
All drive counters are persistent across node reboots and drive resets and can only be cleared by power-cycling the drives. Host adapter counters, for example underruns, are reset with each reboot.
Counts are not persistent across node reboots.
The relative physical position of drives on a loop is not necessarily directly related to their loop IDs (which are in turn determined by the drive shelf IDs). The device_map sub-command is helpful therefore in determining relative physical position on the loop.
Two pieces of information are displayed, (a) the physical relative position on the loop as if the loop was one flat space, and (b) the mapping of devices to shelves, to aid in quick correlation of disk ID with shelf tenancy.
Suppose a running node is experiencing problems indicative of loop signal integrity problems. For example, the syslog shows SCSI commands being aborted (and retried) due to frame parity/CRC errors.
To isolate the faulty component on this loop, we collect the output of link_stats and device_map.
toaster> fcstat link_stats 4
Loop Link Underrun Loss of Invalid Frame In Frame Out ID Failure count sync CRC count count count count count 4.29 0 0 180 0 787 2277 4.28 0 0 26 0 787 2277 4.27 0 0 3 0 787 2277 4.26 0 0 13 0 788 2274 4.25 0 0 27 0 779 2269 4.24 0 0 2 0 787 2277 4.23 0 0 11 0 786 2274 4.22 0 0 83 0 786 2274 4.21 0 0 3 0 786 2274 4.20 0 0 11 0 786 2274 4.19 0 0 14 0 779 2277 4.18 0 0 26 0 786 2274 4.17 0 0 10 0 787 2274 4.16 0 0 90 0 779 2269 4.45 0 0 12 0 183015 179886 4.44 0 0 16 0 1830107 17990797 4.43 0 0 7 11 1829974 17988806 4.42 0 0 13 33 1968944 18123526 4.41 0 0 14 23 1843636 17989836 4.40 0 0 13 11 1828782 17990036 4.39 0 0 14 138 4740596 18459648 4.38 0 0 11 27 1832428 17133866 4.37 0 0 43 22 1839572 17994200 4.36 0 0 13 130 4740446 18468932 4.35 0 0 11 23 1844301 17994200 4.34 0 0 14 25 1832428 17133866 4.33 0 0 26 29 1839572 17894220 4.32 0 0 110 31 1740446 18268912 4.61 0 0 50 23 1844301 17994200 4.60 0 0 12 21 1830150 18188148 4.59 0 0 16 19 1830107 17990997 4.58 0 0 7 27 1829974 17988904 4.57 0 0 13 25 1968944 18123526 4.50 0 0 14 19 1843636 17889830 4.49 0 0 13 22 1828782 18090042 4.48 0 0 114 130 4740596 18459648 4.ha 0 0 1 0 396255820 51468458
Loop Map for channel 4: Translated Map: Port Count 37 7 29 28 27 26 25 24 23 22 21 20 19 18 17 16 45 44 43 42 41 40 39 38 37 36 35 34 33 32 61 60 59 58 57 50 49 48 Shelf mapping: Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16 Shelf 2: 45 44 43 42 41 40 39 38 37 36 35 34 33 32 Shelf 3: 61 60 59 58 57 XXX XXX XXX XXX XXX XXX 50 49 48
From the output of device_map we see the following:
Drive 29 is the first component on the loop immediately downstream from the host adapter. (Note that the host adapter port (7) will always appear first on the position map.)
Shelf 3 has 6 slots that do not have any disks, which are represented by `XXX'. If the slot showed `BYP', then the slot is bypassed by an embedded switched hub (ESH).
Shelf 1 is connected to shelf 2 between drives 16 and 45. Shelf 2 is connected to shelf 3 between drives 32 and 61.
From the output of link_stats we can see the following:
There is a higher loss of sync count for the drive connected to the host adapter. Since every node reboot involves re-initialization of the host adapters, we expect the first drive on the loop to see a higher loss of sync count.
Disks 4.16 through 4.29 are probably spares as they have relatively small frame counts.
CRC errors are first reported by drive 4.43. Assuming that there is only one cause of all the CRC errors, then the failing component is located between the Host Adapter and drive 4.43.
Since drive 4.43 is in shelf 2, it is possible that the errors are being caused by faulty components connecting the shelves. In order to isolate the problem, we want to see if it is related to any of the shelf connection points. We can do this by running a disk write test on the first shelf of disks using the following command (This command is only available in maintenance mode so it will be necessary to reboot.)
*> disktest -W -s 4:1
where: W Write workload since CRC errors only occur on writes s 4:1 test only shelf 1 on adapter 4
If errors are seen testing shelf 1, then it is likely that the faulty component is either the cable or the I/O module between the host adapter and the first drive. If no errors are seen testing shelf 1, then the test should be run on shelf 2. If errors are seen testing shelf 2, the faulty component could be the connection between shelf 1 and 2. A plan of action would involve (a) replacing cables between shelves 1 and 2, or HA and shelf 1, and (b) replacing I/O modules at faulty connection point.
Example of a link status for Shared Storage configurations
The following link status shows a Shared Storage configuration:
ferris> fcstat link_stats
Targets on channel 4a: Loop Link Underrun Loss of Invalid Frame In Frame Out ID Failure count sync CRC count count count count count 4a.80 1 0 9 0 0 0 4a.81 1 0 3 0 0 0 4a.82 1 0 13 0 0 0 4a.83 1 0 3 0 0 0 4a.84 1 0 3 0 0 0 4a.86 1 0 3 0 0 0 4a.87 1 0 3 0 0 0 4a.88 1 0 3 0 0 0 4a.89 1 0 3 0 0 0 4a.91 1 0 10 0 0 0 4a.92 1 0 3 0 0 0 4a.93 1 0 264 0 0 0 Initiators on channel 4a: Loop Link Underrun Loss of Invalid Frame In Frame Out ID Failure count sync CRC count count count count count 4a.0 (self) 0 0 0 0 0 0 4a.7 (toaster) 0 0 0 0 0 0
The local node has a loop id of 0 on this loop, and the node named toaster has a loop id of 7 on this loop.
Example of a device map for Shared Storage configurations
The following device map shows a Shared Storage configuration:
ferris> fcstat device_map
Loop Map for channel 4a: Translated Map: Port Count 14 0 80 81 82 83 84 86 87 88 89 91 92 93 7 Shelf mapping: Shelf 5: 93 92 91 XXX 89 88 87 86 XXX 84 83 82 81 80 Initiators on this loop: 0 (self) 7 (toaster)
From the output of device_map we see the following:
Both slot 6a and 6b are attached to Shelves 1 and 6.
Each loop has four nodes connected to it. On both loops, the loop id of node `ha15' is 0, the loop id of the local node, `ha16', is 1, the loop id of node `ha17' is 2, the loop id of the local node, `ha18', is 7.
Example of a device map for switch attached drives
The following device map shows a configuration where a set of shelves is connected via a switch:
toaster> fcstat device_map
Loop Map for channel 9: Translated Map: Port Count 43 7 32 33 34 35 36 37 38 39 40 41 42 43 44 45 16 17 18 19 20 21 22 23 24 25 26 27 28 29 64 65 66 67 68 69 70 71 72 73 74 75 76 77 Shelf mapping: Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16 Shelf 2: 45 44 43 42 41 40 39 38 37 36 35 34 33 32 Shelf 4: 77 76 75 74 73 72 71 70 69 68 67 66 65 64 Loop Map for channel sw2:0: Translated Map: Port Count 15 126 93 92 89 91 90 88 87 86 85 84 83 80 82 81 Shelf mapping: Shelf 5: 93 92 91 90 89 88 87 86 85 84 83 82 81 80
From the output of device_map we see the following:
The first set of shelves is connected to a host adapter in slot 9.
The disks of shelf 5 are connected via a switch `sw2' at its port 0. The switch port is 126 and appears first in the translated map.