Disk I/O errors on Adaptec ASR8805 raid controller

We have an Adaptec ASR8805 controller on one of the servers we manage. For various reasons we need to shrink a logical volume that is sitting on a RAID 6 logical device created and exposed by this controller, but we can’t because we’re getting seek errors:

Buffer I/O error on device dm-2, logical block 3330419721
sd 6:0:1:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 6:0:1:0: [sdb]  Sense Key : Hardware Error [current] 
sd 6:0:1:0: [sdb]  Add. Sense: Internal target failure
sd 6:0:1:0: [sdb] CDB: Read(16): 88 00 00 00 00 06 34 11 69 00 00 00 01 00 00 00
end_request: critical target error, dev sdb, sector 26643360000
sd 6:0:1:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

From what the controller is reporting, the RAID 6 is healthy, and all the physical drives SMART information seems ok(ish).

It turns out, no background checking of the RAID 6 parity has been enabled, and that is probably the problem, as reported by this article.

To get a “quick” fix (it’s a 24T array), I started:

# arcconf task start 1 logicaldrive 1 verify_fix

when it’ll be finished, I’ll enable the background check with:

# arcconf consistencycheck 1 on

I really hope this saves time to some fellow admin out there :)

How to display IOwait percentage in Prometheus

Prometheus has a few quirks, dealing with cpu time is one of these. this article explains how to deal with cpu time, and these are the rules I made for my own Prometheus/Grafana dashboard:

avg by (instance) (irate(node_cpu{mode="iowait"}[1m])) * 100

this rule groups by instance the iowait average for the system (all cpus)

avg by (instance) (irate(instance=~"hostname.*", node_cpu{mode="iowait"}[1m])) * 100

while this rule is like the one above, with the difference that you can filter which systems are reported, by hostname

hopefully this will be useful for someone out there :)