2005-10-12 00:02:12

by Hubert Tonneau

[permalink] [raw]
Subject: MPT fusion driver, better but still buggy at errors handling under 2.6

MPT fusion driver under Linux 2.6 still fails to recover properly from tiny
SCSI troubles through doing reset then resend pending commands just like
any good Linux SCSI driver does transparently.

Here is the behaviour of MPT fusion driver I noticed on various Linux kernels:
2.4.xx recovers gracefully (only reading the kernel log will enable
to discover that a tiny problem did append).
2.6.xx xx <= 11, enters an infinit loop attempting to reset the SCSI
so the box gets unusable and requires a rough power cycle to recover.
2.6.12 untested.
2.6.13 puts the disk offline, so the box continues to run, but Linux
software RAID removed the disk. A software reboot is required
to get the disk back online, and a 'raidhotadd' will also be to get
the all service back online, also this is a remote server, and the
box did not came up after the remote sofware reboot request :-(

I reported this bug about Linux 2.6.5 on Fri, 23 Apr 2004,
about Linux 2.6.9 on Wed, 26 Jan 2005,
and the bug is still there.


Here is the 2.6.13 kernel console:

<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1adb00)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1adb00)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad980)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad980)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad800)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad800)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad680)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad680)
<6>mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad500)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad500)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad380)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad380)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad200)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad200)
<6>mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad080)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=da1ad080)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebe00)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebe00)
<6>mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebc80)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebc80)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebb00)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39ebb00)
<6>mptbase: ioc0: IOCStatus(0x0048): SCSI Task Terminated
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39eb980)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting task abort! (sc=f39eb980)
<6>mptbase: ioc0: IOCStatus(0x004a): SCSI Task Management Failed
<4>mptscsih: ioc0: >> Attempting target reset! (sc=da1adb00)
<6>mptbase: Initiating ioc0 recovery
<6>mptbase: ioc0: IOCStatus(0x0047): SCSI Protocol Error
<4>mptscsih: ioc0: >> Attempting target reset! (sc=da1ad500)
<4>mptscsih: ioc0: >> Attempting bus reset! (sc=da1adb00)
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<4>mptscsih: ioc0: >> Attempting host reset! (sc=da1adb00)
<6>mptbase: Initiating ioc0 recovery
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>mptbase: ioc0: IOCStatus(0x0043): SCSI Device Not There
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
<3>scsi0 (0:0): rejecting I/O to offline device
<1>raid1: Disk failure on sda1, disabling device.
<4> Operation continuing on 1 devices
<3>scsi0 (0:0): rejecting I/O to offline device
<3>scsi0 (0:0): rejecting I/O to offline device
<3>scsi0 (0:0): rejecting I/O to offline device
<3>scsi0 (0:0): rejecting I/O to offline device
<3>scsi0 (0:0): rejecting I/O to offline device
<4>RAID1 conf printout:
<4> --- wd:1 rd:2
<4> disk 0, wo:1, o:0, dev:sda1
<4> disk 1, wo:0, o:1, dev:sdb1
<4>RAID1 conf printout:
<4> --- wd:1 rd:2
<4> disk 1, wo:0, o:1, dev:sdb1

and after an attempt to futher read the disk without rebooting:

<3>scsi0 (0:0): rejecting I/O to offline device
<3>Buffer I/O error on device sda1, logical block 17649664
<3>Buffer I/O error on device sda1, logical block 17649665
<3>Buffer I/O error on device sda1, logical block 17649666
<3>Buffer I/O error on device sda1, logical block 17649667
<3>Buffer I/O error on device sda1, logical block 17649668
<3>scsi0 (0:0): rejecting I/O to offline device


2005-10-12 06:26:12

by Hubert Tonneau

[permalink] [raw]
Subject: Re: MPT fusion driver, better but still buggy at errors handling under 2.6

Hubert Tonneau wrote:
>
> also this is a remote server, and the
> box did not came up after the remote sofware reboot request :-(

The Linux 2.6.13 MTP fusion driver has set the SCSI controler in a such a bad
state that it locked in it's own bios startup code as a result of the software
reboot request. A power cycle was required to reset everything properly.

2005-10-12 17:34:55

by Hubert Tonneau

[permalink] [raw]
Subject: Re: MPT fusion driver, better but still buggy at errors handling under 2.6

Hubert Tonneau wrote:
>
> 2.4.xx recovers gracefully (only reading the kernel log will enable
> to discover that a tiny problem did append).

Here is the report of the gracefull recovery the Linux 2.4.31 does as
opposed to Linux 2.6:

<4>scsi : aborting command due to timeout : pid 2232433, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 3d 3f 00 00 80 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79af800)
<4>scsi : aborting command due to timeout : pid 2232436, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 3d bf 00 00 38 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79af600)
<4>scsi : aborting command due to timeout : pid 2232437, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 3d f7 00 00 48 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79afa00)
<4>SCSI host 0 abort (pid 2232433) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af800)
<4>SCSI host 0 abort (pid 2232436) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af600)
<4>SCSI host 0 abort (pid 2232437) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79afa00)
<4>SCSI host 0 channel 0 reset (pid 2232433) timed out - trying harder
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af800)
<4>SCSI host 0 channel 0 reset (pid 2232437) timed out - trying harder
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79afa00)
<4>SCSI host 0 reset (pid 2232433) timed out again -
<4>probably an unrecoverable SCSI bus or device hang.
<4>SCSI host 0 reset (pid 2232437) timed out again -
<4>probably an unrecoverable SCSI bus or device hang.
<4>SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
<4> Key=6h (UNIT ATTENTION); FRU=00h
<4> ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
<4> CDB: 2A 00 05 B6 3D BF 00 00 38 00
<4>
<4>SCSI Error: (0:1:0) Status=02h (CHECK CONDITION)
<4> Key=6h (UNIT ATTENTION); FRU=00h
<4> ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
<4> CDB: 28 00 05 B6 3E 3F 00 00 80 00
<4>
<4>scsi : aborting command due to timeout : pid 2232577, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 41 3f 00 00 80 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79af600)
<4>scsi : aborting command due to timeout : pid 2232580, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 41 bf 00 00 18 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79af800)
<4>scsi : aborting command due to timeout : pid 2232581, scsi0, channel 0, id 0, lun 0 0x2a 00 05 b6 41 d7 00 00 68 00
<4>mptscsih: OldAbort scheduling ABORT SCSI IO (sc=f79afa00)
<4>SCSI host 0 abort (pid 2232577) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af600)
<4>SCSI host 0 abort (pid 2232580) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af800)
<4>SCSI host 0 abort (pid 2232581) timed out - resetting
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79afa00)
<4>SCSI host 0 channel 0 reset (pid 2232577) timed out - trying harder
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af600)
<4>SCSI host 0 channel 0 reset (pid 2232580) timed out - trying harder
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79af800)
<4>SCSI host 0 channel 0 reset (pid 2232581) timed out - trying harder
<4>SCSI bus is being reset for host 0 channel 0.
<4>mptscsih: OldReset scheduling BUS_RESET (sc=f79afa00)
<4>SCSI host 0 reset (pid 2232577) timed out again -
<4>probably an unrecoverable SCSI bus or device hang.
<4>SCSI host 0 reset (pid 2232580) timed out again -
<4>probably an unrecoverable SCSI bus or device hang.
<4>SCSI host 0 reset (pid 2232581) timed out again -
<4>probably an unrecoverable SCSI bus or device hang.
<4>SCSI Error: (0:0:0) Status=02h (CHECK CONDITION)
<4> Key=6h (UNIT ATTENTION); FRU=00h
<4> ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
<4> CDB: 2A 00 0D 38 DB 87 00 00 08 00
<4>
<4>SCSI Error: (0:1:0) Status=02h (CHECK CONDITION)
<4> Key=6h (UNIT ATTENTION); FRU=00h
<4> ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
<4> CDB: 2A 00 0D 38 E2 C7 00 00 10 00
<4>