2006-09-06 06:23:47

by Brett G. Durrett

[permalink] [raw]
Subject: megaraid_sas waiting for command and then offline


I have the same or a similar issue running 2.6.17 SMP x86_64 - the
megaraid_sas driver hangs waiting for commands and then the filesystem
unmounts, leaving the machine in an unusable state until there is a hard
reboot (the machine is responsive but any access, shell or otherwise, is
impossible without the filesystem). While I do not have much debugging
information available, this happens to me about once every 6-7 days in
my pool of seven machines, so I can probably get debugging info. Since
the disk is offline and I can't get remote console, I don't have any
details except something similar to Dave Lloyd's post, below.

The only thing that the machines with these failures seem to have in
common is the fact that they are almost exclusively writes - they are
slave database machines with large memory and pretty much just
replicate. The read/write machines seem to have less failures.

I am happy to help provide debugging information in any reasonable way.
In the mean time, if there are any known suggestions or workarounds for
the problem, I would be grateful for the guidance.

Here are what details on the controller. If you want additional info,
let me know exactly what you need and I will do what I can to get it to
you.:

Product Name : PERC 5/i Integrated
Serial No : 12345
FW Package Build: 5.0.1-0030
FW Version : 1.00.01-0088
BIOS Version : MT23
Ctrl-R Version :1.02-007

B-


Subject RE: MegaRaid 8408E goes out to lunch with nr_requests > 8
Date Thu, 13 Jul 2006 09:25:09 -0600
>From "Patro, Sumant" <>

Hello Dave,

I tried to duplicate the issue with 2.6.18rc1 but did not see
the issue. From the message it looks like the Firmware has stopped
processing cmds. Could you please let us know the Firmware version of
the controller ?

Thanks,

Sumant

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Dave Lloyd
Sent: Wednesday, July 12, 2006 7:47 AM
To: [email protected]; Berkley Shands
Subject: MegaRaid 8408E goes out to lunch with nr_requests > 8
This happens both on 2.6.17 and 2.6.18rc1 using the megaraid, mptsas and
mptscsih drivers supplied with the kernel.

While writing data to raid0 devs on a LSI MegaRaid 8408E controller, the
devices will hang after somewhere between 4-7gb of data written. If I
dial the nr_requests back from the default down to 8, the hang will not
occur. The hang does occur at 16. I haven't tested values between the
two, but I'm not too optimistic. From what I can see, it looks like 8
should be a magic number to make the queue look congested more often
than not.

Here are the messages I get when the devices go out to lunch:
Jul 11 14:13:34 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:13:34 systemname kernel: megasas: [ 0]waiting for 256 commands
to complete
Jul 11 14:13:39 systemname kernel: megasas: [ 5]waiting for 256 commands
to complete
Jul 11 14:13:44 systemname kernel: megasas: [10]waiting for 256 commands
to complete
Jul 11 14:13:49 systemname kernel: megasas: [15]waiting for 256 commands
to complete

[...]

Jul 11 14:16:35 systemname kernel: megasas: [175]waiting for 256
commands to complete
Jul 11 14:16:35 systemname kernel: megasas: failed to do reset
Jul 11 14:16:35 systemname kernel: sd 4:2:1:0: megasas: RESET -40216
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: megasas: RESET -40213
cmd=2a
Jul 11 14:16:35 systemname kernel: megasas: cannot recover from previous
reset failures
Jul 11 14:16:35 systemname kernel: sd 4:2:0:0: scsi: Device offlined -
not ready after error recovery
Jul 11 14:16:36 systemname last message repeated 13 times

Interestingly, the machine will hang on shutdown and requires a hard
reset to reboot. Bummer!

My next step is to try and reproduce and dig into this some in KDB.

Has anyone else seen this and/or does anyone have some suggestions for
further debugging info?

--
Dave Lloyd
Test Engineer, Exegy, Inc.
314.450.5342
[email protected]



2006-09-06 14:11:57

by Dave Lloyd

[permalink] [raw]
Subject: Re: megaraid_sas waiting for command and then offline

Brett G. Durrett wrote:
>
> I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> megaraid_sas driver hangs waiting for commands and then the filesystem
> unmounts, leaving the machine in an unusable state until there is a hard
> reboot (the machine is responsive but any access, shell or otherwise, is
> impossible without the filesystem). While I do not have much debugging
> information available, this happens to me about once every 6-7 days in
> my pool of seven machines, so I can probably get debugging info. Since
> the disk is offline and I can't get remote console, I don't have any
> details except something similar to Dave Lloyd's post, below.
>
> The only thing that the machines with these failures seem to have in
> common is the fact that they are almost exclusively writes - they are
> slave database machines with large memory and pretty much just
> replicate. The read/write machines seem to have less failures.
>
> I am happy to help provide debugging information in any reasonable way.
> In the mean time, if there are any known suggestions or workarounds for
> the problem, I would be grateful for the guidance.
>
> Here are what details on the controller. If you want additional info,
> let me know exactly what you need and I will do what I can to get it to
> you.:
>
> Product Name : PERC 5/i Integrated
> Serial No : 12345
> FW Package Build: 5.0.1-0030
> FW Version : 1.00.01-0088
> BIOS Version : MT23
> Ctrl-R Version :1.02-007
>
> B-

Which motherboard are you using? We believe that this may be a
motherboard specific issue. It appears to happen on a SuperMicro
motherboard but not a Tyan motherboard.

--
Dave Lloyd
Test Engineer, Exegy, Inc.
314.450.5342
[email protected]

2006-09-06 16:04:33

by Brett G. Durrett

[permalink] [raw]
Subject: Re: megaraid_sas waiting for command and then offline


The machines are Dell 2900s, so the mobo is custom. From a Dell SE,
"Dell uses a custom mobo that is Dell branded with the Intel chipset
Greencreek.".

B-




Dave Lloyd wrote:

> Brett G. Durrett wrote:
> >
> > I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> > megaraid_sas driver hangs waiting for commands and then the filesystem
> > unmounts, leaving the machine in an unusable state until there is a
> hard
> > reboot (the machine is responsive but any access, shell or
> otherwise, is
> > impossible without the filesystem). While I do not have much debugging
> > information available, this happens to me about once every 6-7 days in
> > my pool of seven machines, so I can probably get debugging info. Since
> > the disk is offline and I can't get remote console, I don't have any
> > details except something similar to Dave Lloyd's post, below.
> >
> > The only thing that the machines with these failures seem to have in
> > common is the fact that they are almost exclusively writes - they are
> > slave database machines with large memory and pretty much just
> > replicate. The read/write machines seem to have less failures.
> >
> > I am happy to help provide debugging information in any reasonable way.
> > In the mean time, if there are any known suggestions or workarounds for
> > the problem, I would be grateful for the guidance.
> >
> > Here are what details on the controller. If you want additional info,
> > let me know exactly what you need and I will do what I can to get it to
> > you.:
> >
> > Product Name : PERC 5/i Integrated
> > Serial No : 12345
> > FW Package Build: 5.0.1-0030
> > FW Version : 1.00.01-0088
> > BIOS Version : MT23
> > Ctrl-R Version :1.02-007
> >
> > B-
>
> Which motherboard are you using? We believe that this may be a
> motherboard specific issue. It appears to happen on a SuperMicro
> motherboard but not a Tyan motherboard.
>

2006-09-06 17:14:27

by Patro, Sumant

[permalink] [raw]
Subject: RE: megaraid_sas waiting for command and then offline

Hello Brett,

A DMA related bug was fixed in FW ver *.0095 that was causing
the FW to stop responding.

Please upgrade the FW version to >= *.0095 and let me know if
you still see the issue.

Regards,

Sumant


-----Original Message-----
From: Brett G. Durrett [mailto:[email protected]]
Sent: Wednesday, September 06, 2006 9:04 AM
To: Dave Lloyd
Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands
Subject: Re: megaraid_sas waiting for command and then offline


The machines are Dell 2900s, so the mobo is custom. From a Dell SE,
"Dell uses a custom mobo that is Dell branded with the Intel chipset
Greencreek.".

B-




Dave Lloyd wrote:

> Brett G. Durrett wrote:
> >
> > I have the same or a similar issue running 2.6.17 SMP x86_64 - the
> > megaraid_sas driver hangs waiting for commands and then the
filesystem
> > unmounts, leaving the machine in an unusable state until there is a
> hard
> > reboot (the machine is responsive but any access, shell or
> otherwise, is
> > impossible without the filesystem). While I do not have much
debugging
> > information available, this happens to me about once every 6-7 days
in
> > my pool of seven machines, so I can probably get debugging info.
Since
> > the disk is offline and I can't get remote console, I don't have any
> > details except something similar to Dave Lloyd's post, below.
> >
> > The only thing that the machines with these failures seem to have in
> > common is the fact that they are almost exclusively writes - they
are
> > slave database machines with large memory and pretty much just
> > replicate. The read/write machines seem to have less failures.
> >
> > I am happy to help provide debugging information in any reasonable
way.
> > In the mean time, if there are any known suggestions or workarounds
for
> > the problem, I would be grateful for the guidance.
> >
> > Here are what details on the controller. If you want additional
info,
> > let me know exactly what you need and I will do what I can to get it
to
> > you.:
> >
> > Product Name : PERC 5/i Integrated
> > Serial No : 12345
> > FW Package Build: 5.0.1-0030
> > FW Version : 1.00.01-0088
> > BIOS Version : MT23
> > Ctrl-R Version :1.02-007
> >
> > B-
>
> Which motherboard are you using? We believe that this may be a
> motherboard specific issue. It appears to happen on a SuperMicro
> motherboard but not a Tyan motherboard.
>

2006-09-06 20:44:19

by Brett G. Durrett

[permalink] [raw]
Subject: Re: megaraid_sas waiting for command and then offline


Sumant,

Not sure if I am missing something - I appear to be running the latest
FW available:

Versions
================
Product Name : PERC 5/i Integrated
Serial No : 12345
FW Package Build: 5.0.1-0030
FW Version : 1.00.01-0088
BIOS Version : MT23
Ctrl-R Version :1.02-007



Patro, Sumant wrote:

>Hello Brett,
>
> A DMA related bug was fixed in FW ver *.0095 that was causing
>the FW to stop responding.
>
> Please upgrade the FW version to >= *.0095 and let me know if
>you still see the issue.
>
>Regards,
>
>Sumant
>
>
>-----Original Message-----
>From: Brett G. Durrett [mailto:[email protected]]
>Sent: Wednesday, September 06, 2006 9:04 AM
>To: Dave Lloyd
>Cc: Patro, Sumant; Bagalkote, Sreenivas; lkml; Berkley Shands
>Subject: Re: megaraid_sas waiting for command and then offline
>
>
>The machines are Dell 2900s, so the mobo is custom. From a Dell SE,
>"Dell uses a custom mobo that is Dell branded with the Intel chipset
>Greencreek.".
>
>B-
>
>
>
>
>Dave Lloyd wrote:
>
>
>
>>Brett G. Durrett wrote:
>>
>>
>>>I have the same or a similar issue running 2.6.17 SMP x86_64 - the
>>>megaraid_sas driver hangs waiting for commands and then the
>>>
>>>
>filesystem
>
>
>>>unmounts, leaving the machine in an unusable state until there is a
>>>
>>>
>>hard
>>
>>
>>>reboot (the machine is responsive but any access, shell or
>>>
>>>
>>otherwise, is
>>
>>
>>>impossible without the filesystem). While I do not have much
>>>
>>>
>debugging
>
>
>>>information available, this happens to me about once every 6-7 days
>>>
>>>
>in
>
>
>>>my pool of seven machines, so I can probably get debugging info.
>>>
>>>
>Since
>
>
>>>the disk is offline and I can't get remote console, I don't have any
>>>details except something similar to Dave Lloyd's post, below.
>>>
>>>The only thing that the machines with these failures seem to have in
>>>common is the fact that they are almost exclusively writes - they
>>>
>>>
>are
>
>
>>>slave database machines with large memory and pretty much just
>>>replicate. The read/write machines seem to have less failures.
>>>
>>>I am happy to help provide debugging information in any reasonable
>>>
>>>
>way.
>
>
>>>In the mean time, if there are any known suggestions or workarounds
>>>
>>>
>for
>
>
>>>the problem, I would be grateful for the guidance.
>>>
>>>Here are what details on the controller. If you want additional
>>>
>>>
>info,
>
>
>>>let me know exactly what you need and I will do what I can to get it
>>>
>>>
>to
>
>
>>>you.:
>>>
>>>Product Name : PERC 5/i Integrated
>>>Serial No : 12345
>>>FW Package Build: 5.0.1-0030
>>>FW Version : 1.00.01-0088
>>>BIOS Version : MT23
>>>Ctrl-R Version :1.02-007
>>>
>>>B-
>>>
>>>
>>Which motherboard are you using? We believe that this may be a
>>motherboard specific issue. It appears to happen on a SuperMicro
>>motherboard but not a Tyan motherboard.
>>
>>
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>