Hello,
just upgraded a server running 3.2.54-2 to 3.2.57-3 (Debian Wheezy) and it
does not boot anymore because of isci driver breakage.
A (partial) log transcription:
sas: DOING DISCOVERY on port 0, pid:5
sas: Enter sas_scsi_recover_host
ata1: sas eh calling libata port error handler
sas: sas_ata_hard_reset: Unable to reset I T nexus?
sas: sas_ata_hard_reset: Found ATA device.
sas: sas_ata_hard_reset: Unable to soft reset
sas: sas_ata_hard_reset: Found ATA device.
ata1: reset failed (errno=-11), retrying in 10 secs
sas: sas_ata_hard_reset: Unable to reset I T nexus?
sas: sas_ata_hard_reset: Found ATA device.
sas: sas_ata_hard_reset: Unable to soft reset
sas: sas_ata_hard_reset: Found ATA device.
ata1: reset failed (errno=-11), retrying in 35 secs
ata1: reset failed, giving up
sas: --- Exit sas_scsi_recover_host
sas: DONE DISCOVERY on port 0, pid: 5, result:0
sas: phy-0:1 added to port-0:1, phy_mask:0x2 (5fcfffff00000002)
sas: DOING DISCOVERY on port 1, pid:5
sas: Enter sas_scsi_recover_host
ata1: sas eh calling libata port error handler
sas: sas_ata_hard_reset: Unable to reset I T nexus?
sas: sas_ata_hard_reset: Found ATA device.
sas: sas_ata_hard_reset: Unable to soft reset
sas: sas_ata_hard_reset: Found ATA device.
ata2: reset failed (errno=-11), retrying in 10 secs
sas: sas_ata_hard_reset: Unable to reset I T nexus?
sas: sas_ata_hard_reset: Found ATA device.
sas: sas_ata_hard_reset: Unable to soft reset
sas: sas_ata_hard_reset: Found ATA device.
ata2: reset failed (errno=-11), retrying in 35 secs
ata2: reset failed, giving up
It should look like this (v3.2.54-2):
isci: Intel(R) C600 SAS Controller Driver - version 1.0.0
isci 0000:03:00.0: driver configured for rev: 6 silicon
isci 0000:03:00.0: firmware: agent loaded isci/isci_firmware.bin into memory
isci 0000:03:00.0: OEM SAS parameters (version: 1.3) loaded (firmware)
isci 0000:03:00.0: setting latency timer to 64
scsi0 : isci
scsi1 : isci
isci 0000:03:00.0: irq 81 for MSI/MSI-X
isci 0000:03:00.0: irq 82 for MSI/MSI-X
isci 0000:03:00.0: irq 83 for MSI/MSI-X
isci 0000:03:00.0: irq 84 for MSI/MSI-X
sas: phy-0:0 added to port-0:0, phy_mask:0x1 (5fcfffff00000001)
sas: DOING DISCOVERY on port 0, pid:5
sas: Enter sas_scsi_recover_host
ata1: sas eh calling libata port error handler
sas: sas_ata_hard_reset: Found ATA device.
ata1.00: ATA-8: ST9500620NS, CC02, max UDMA/133
ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
sas: --- Exit sas_scsi_recover_host
scsi 0:0:0:0: Direct-Access ATA ST9500620NS CC02 PQ: 0 ANSI: 5
sas: DONE DISCOVERY on port 0, pid:5, result:0
sas: phy-0:1 added to port-0:1, phy_mask:0x2 (5fcfffff00000002)
sas: DOING DISCOVERY on port 1, pid:5
sas: Enter sas_scsi_recover_host
ata1: sas eh calling libata port error handler
ata2: sas eh calling libata port error handler
sas: sas_ata_hard_reset: Found ATA device.
ata2.00: ATA-8: ST9500620NS, CC02, max UDMA/133
ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
sas: --- Exit sas_scsi_recover_host
scsi 0:0:1:0: Direct-Access ATA ST9500620NS CC02 PQ: 0 ANSI: 5
sas: DONE DISCOVERY on port 1, pid:5, result:0
--
Ondrej Zary
On Monday 28 April 2014 17:50:29 Jiang, Dave wrote:
> On Mon, 2014-04-28 at 13:03 +0200, Ondrej Zary wrote:
> > Hello,
> > just upgraded a server running 3.2.54-2 to 3.2.57-3 (Debian Wheezy) and
> > it does not boot anymore because of isci driver breakage.
>
> I would not run anything less than 3.8 for the isci controller. 3.2 is
> VERY old for that particular driver and likely very unstable. The
> product version of that driver plus libsas started with 3.8. Also I'm
> concerned that you aren't using the platform OEM parameters. You need to
> turn your OROM or EFI driver on for the SAS controller.
It's a Cisco UCS C22 M3 server with a crappy LSI fakeraid that cannot even be
disabled. It was a pain to make it boot properly - had to use dmraid. But it
has been working fine since then (2012). Until now.
I guess that it could be caused by the following commit but haven't tested it:
commit 584ec12265192bf49dfa270d517380f6723a6956
Author: Dan Williams <[email protected]>
Date: Thu Feb 6 12:23:01 2014 -0800
> > A (partial) log transcription:
> > sas: DOING DISCOVERY on port 0, pid:5
> > sas: Enter sas_scsi_recover_host
> > ata1: sas eh calling libata port error handler
> > sas: sas_ata_hard_reset: Unable to reset I T nexus?
> > sas: sas_ata_hard_reset: Found ATA device.
> > sas: sas_ata_hard_reset: Unable to soft reset
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata1: reset failed (errno=-11), retrying in 10 secs
> > sas: sas_ata_hard_reset: Unable to reset I T nexus?
> > sas: sas_ata_hard_reset: Found ATA device.
> > sas: sas_ata_hard_reset: Unable to soft reset
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata1: reset failed (errno=-11), retrying in 35 secs
> > ata1: reset failed, giving up
> > sas: --- Exit sas_scsi_recover_host
> > sas: DONE DISCOVERY on port 0, pid: 5, result:0
> > sas: phy-0:1 added to port-0:1, phy_mask:0x2 (5fcfffff00000002)
> > sas: DOING DISCOVERY on port 1, pid:5
> > sas: Enter sas_scsi_recover_host
> > ata1: sas eh calling libata port error handler
> > sas: sas_ata_hard_reset: Unable to reset I T nexus?
> > sas: sas_ata_hard_reset: Found ATA device.
> > sas: sas_ata_hard_reset: Unable to soft reset
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata2: reset failed (errno=-11), retrying in 10 secs
> > sas: sas_ata_hard_reset: Unable to reset I T nexus?
> > sas: sas_ata_hard_reset: Found ATA device.
> > sas: sas_ata_hard_reset: Unable to soft reset
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata2: reset failed (errno=-11), retrying in 35 secs
> > ata2: reset failed, giving up
> >
> >
> > It should look like this (v3.2.54-2):
> > isci: Intel(R) C600 SAS Controller Driver - version 1.0.0
> > isci 0000:03:00.0: driver configured for rev: 6 silicon
> > isci 0000:03:00.0: firmware: agent loaded isci/isci_firmware.bin into
> > memory isci 0000:03:00.0: OEM SAS parameters (version: 1.3) loaded
> > (firmware) isci 0000:03:00.0: setting latency timer to 64
> > scsi0 : isci
> > scsi1 : isci
> > isci 0000:03:00.0: irq 81 for MSI/MSI-X
> > isci 0000:03:00.0: irq 82 for MSI/MSI-X
> > isci 0000:03:00.0: irq 83 for MSI/MSI-X
> > isci 0000:03:00.0: irq 84 for MSI/MSI-X
> > sas: phy-0:0 added to port-0:0, phy_mask:0x1 (5fcfffff00000001)
> > sas: DOING DISCOVERY on port 0, pid:5
> > sas: Enter sas_scsi_recover_host
> > ata1: sas eh calling libata port error handler
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata1.00: ATA-8: ST9500620NS, CC02, max UDMA/133
> > ata1.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> > ata1.00: configured for UDMA/133
> > sas: --- Exit sas_scsi_recover_host
> > scsi 0:0:0:0: Direct-Access ATA ST9500620NS CC02 PQ: 0
> > ANSI: 5 sas: DONE DISCOVERY on port 0, pid:5, result:0
> > sas: phy-0:1 added to port-0:1, phy_mask:0x2 (5fcfffff00000002)
> > sas: DOING DISCOVERY on port 1, pid:5
> > sas: Enter sas_scsi_recover_host
> > ata1: sas eh calling libata port error handler
> > ata2: sas eh calling libata port error handler
> > sas: sas_ata_hard_reset: Found ATA device.
> > ata2.00: ATA-8: ST9500620NS, CC02, max UDMA/133
> > ata2.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
> > ata2.00: configured for UDMA/133
> > sas: --- Exit sas_scsi_recover_host
> > scsi 0:0:1:0: Direct-Access ATA ST9500620NS CC02 PQ: 0
> > ANSI: 5 sas: DONE DISCOVERY on port 1, pid:5, result:0
--
Ondrej Zary
[ adding Ben ]
On Mon, Apr 28, 2014 at 10:22 AM, Ondrej Zary
<[email protected]> wrote:
> On Monday 28 April 2014 18:51:44 Jiang, Dave wrote:
>> On Mon, 2014-04-28 at 16:28 +0000, Ondrej Zary wrote:
>> > On Monday 28 April 2014 17:50:29 Jiang, Dave wrote:
>> > > On Mon, 2014-04-28 at 13:03 +0200, Ondrej Zary wrote:
>> > > > Hello,
>> > > > just upgraded a server running 3.2.54-2 to 3.2.57-3 (Debian Wheezy)
>> > > > and it does not boot anymore because of isci driver breakage.
>> > >
>> > > I would not run anything less than 3.8 for the isci controller. 3.2 is
>> > > VERY old for that particular driver and likely very unstable. The
>> > > product version of that driver plus libsas started with 3.8. Also I'm
>> > > concerned that you aren't using the platform OEM parameters. You need
>> > > to turn your OROM or EFI driver on for the SAS controller.
>> >
>> > It's a Cisco UCS C22 M3 server with a crappy LSI fakeraid that cannot
>> > even be disabled. It was a pain to make it boot properly - had to use
>> > dmraid. But it has been working fine since then (2012). Until now.
>>
>> Yes but just because it has been working doesn't mean it is a good idea
>> to run unstable code.... You need the driver updates and the libsas
>> updates for it to function properly. Does this fail on 3.14? If it is
>> that patch I have a feeling it may be interacting badly with whatever is
>> was in 3.2 libsas that may not be a problem with latest kernels.... It
>> is odd to see all those hard resets however.... Did you have them when
>> it was working for you?
>
> Didn't know that it was unstable - it worked with no problems, better than
> some products marked as stable :)
> 3.13 works fine - I've installed it from wheezy-backports to work-around the
> bug.
>
> The log from working 3.2.54 is below (at the end) - there's one reset for each
> port.
>
I think the right answer for 3.2 is to drop commit 584ec1226519 "isci:
fix reset timeout handling".
libsas and its libata interaction went through significant overhaul
after 3.2 so it's not surprising that a change to reset handling
regresses like this.
Ideally there would be a backport of latest libsas available for 3.2,
but no one to my knowledge is working on that.
--
Dan
I'm adding this revert to 3.2.58, taking your 'drop commit 584ec1226519'
as an ack.
Ben.
---
From: Ben Hutchings <[email protected]>
Date: Wed, 30 Apr 2014 13:22:22 +0100
Subject: Revert "isci: fix reset timeout handling"
This reverts commit 584ec12265192bf49dfa270d517380f6723a6956, which
was commit ddfadd7736b677de2d4ca2cd5b4b655368c85a7a upstream. It
causes boot failure on 3.2 although no such problem occurs upstream.
Reported-by: Ondrej Zary <[email protected]>
Signed-off-by: Ben Hutchings <[email protected]>
Acked-by: Dan Williams <[email protected]>
---
--- a/drivers/scsi/isci/port_config.c
+++ b/drivers/scsi/isci/port_config.c
@@ -610,6 +610,13 @@ static void sci_apc_agent_link_up(struct
sci_apc_agent_configure_ports(ihost, port_agent, iphy, true);
} else {
/* the phy is already the part of the port */
+ u32 port_state = iport->sm.current_state_id;
+
+ /* if the PORT'S state is resetting then the link up is from
+ * port hard reset in this case, we need to tell the port
+ * that link up is recieved
+ */
+ BUG_ON(port_state != SCI_PORT_RESETTING);
port_agent->phy_ready_mask |= 1 << phy_index;
sci_port_link_up(iport, iphy);
}
--- a/drivers/scsi/isci/task.c
+++ b/drivers/scsi/isci/task.c
@@ -1390,7 +1390,7 @@ int isci_task_I_T_nexus_reset(struct dom
spin_unlock_irqrestore(&ihost->scic_lock, flags);
if (!idev || !test_bit(IDEV_EH, &idev->flags)) {
- ret = -ENODEV;
+ ret = TMF_RESP_FUNC_COMPLETE;
goto out;
}
--
Ben Hutchings
Life would be so much easier if we could look at the source code.