2005-11-20 20:47:11

by Alexander Inyukhin

[permalink] [raw]
Subject: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441


I've got soft lockups during IDE probes with 2.6.15-rc1 kernel.
The box is dual Athlon MP with ASUS A7M266-D board.
Full dmesg, config and lspci -vvv output are in the attachment.
Disabling second channel with "hdc=noprobe hdd=noprobe" did not help.

CC me, please.


Attachments:
(No filename) (267.00 B)
dmesg (14.33 kB)
pci (4.24 kB)
config (29.99 kB)
Download all attachments

2005-11-20 22:35:05

by Alan

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

On Sul, 2005-11-20 at 23:46 +0300, Alexander V. Inyukhin wrote:
> I've got soft lockups during IDE probes with 2.6.15-rc1 kernel.
> The box is dual Athlon MP with ASUS A7M266-D board.
> Full dmesg, config and lspci -vvv output are in the attachment.
> Disabling second channel with "hdc=noprobe hdd=noprobe" did not help.

Quite normal. The old IDE probe code takes a long time and it makes the
soft lockup code believe a lockup occurred - rememeber its a debugging
tool not a 100% reliable detector of failures.

Alan

2005-11-21 01:29:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

Alan Cox <[email protected]> wrote:
>
> On Sul, 2005-11-20 at 23:46 +0300, Alexander V. Inyukhin wrote:
> > I've got soft lockups during IDE probes with 2.6.15-rc1 kernel.
> > The box is dual Athlon MP with ASUS A7M266-D board.
> > Full dmesg, config and lspci -vvv output are in the attachment.
> > Disabling second channel with "hdc=noprobe hdd=noprobe" did not help.
>
> Quite normal. The old IDE probe code takes a long time and it makes the
> soft lockup code believe a lockup occurred - rememeber its a debugging
> tool not a 100% reliable detector of failures.
>

We could put a touch_softlockup_watchdog() in there.

2005-11-21 20:06:32

by Alan

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

On Sul, 2005-11-20 at 17:29 -0800, Andrew Morton wrote:
> Alan Cox <[email protected]> wrote:
> > Quite normal. The old IDE probe code takes a long time and it makes the
> > soft lockup code believe a lockup occurred - rememeber its a debugging
> > tool not a 100% reliable detector of failures.
> >
>
> We could put a touch_softlockup_watchdog() in there.

Would make sense. Spin up and probe can take over 30 seconds worst case
and is polled in the IDE world. The loop will eventually exit and a true
lockup caused by a stuck IORDY line will hang forever in an inb/outb so
neither softlockup or even nmi lockup would save you.

2005-11-23 19:13:14

by Jesper Juhl

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

On Monday 21 November 2005 21:38, Alan Cox wrote:
> On Sul, 2005-11-20 at 17:29 -0800, Andrew Morton wrote:
> > Alan Cox <[email protected]> wrote:
> > > Quite normal. The old IDE probe code takes a long time and it makes the
> > > soft lockup code believe a lockup occurred - rememeber its a debugging
> > > tool not a 100% reliable detector of failures.
> > >
> >
> > We could put a touch_softlockup_watchdog() in there.
>
> Would make sense. Spin up and probe can take over 30 seconds worst case
> and is polled in the IDE world. The loop will eventually exit and a true
> lockup caused by a stuck IORDY line will hang forever in an inb/outb so
> neither softlockup or even nmi lockup would save you.
>

How about something like the patch below?

The if (!(timeout % 128)) bit is a guess that since
touch_softlockup_watchdog() is a per_cpu thing it will be cheaper to do the
modulo calculation than calling the function every time through the loop,
especially as the nr of CPU's go up. But it's purely a guess, so I may very
well be wrong - also, 128 is an arbitrarily chosen value, it's just a nice
number that'll give us <10 function calls pr second.

<disclaimer>patch is completely un-tested</disclaimer>


From: Jesper Juhl <[email protected]>
Subject: touch softlockup watchdog in ide_wait_not_busy

Make sure we touch the softlockup watchdog in
ide_wait_not_busy() since it may cause the watchdog to trigger, but
there's really no point in that since the loop will eventually return, and
triggering the watchdog won't do us any good anyway.

Signed-off-by: Jesper Juhl <[email protected]>
---

drivers/ide/ide-iops.c | 8 ++++++++
1 files changed, 8 insertions(+)

--- linux-2.6.15-rc2-orig/drivers/ide/ide-iops.c 2005-11-20 22:25:24.000000000 +0100
+++ linux-2.6.15-rc2/drivers/ide/ide-iops.c 2005-11-23 19:46:11.000000000 +0100
@@ -24,6 +24,7 @@
#include <linux/hdreg.h>
#include <linux/ide.h>
#include <linux/bitops.h>
+#include <linux/sched.h>

#include <asm/byteorder.h>
#include <asm/irq.h>
@@ -1243,6 +1244,13 @@ int ide_wait_not_busy(ide_hwif_t *hwif,
*/
if (stat == 0xff)
return -ENODEV;
+
+ /*
+ * We risk triggering the soft lockup detector, but we don't
+ * want that, so better poke it a bit once in a while.
+ */
+ if (!(timeout % 128))
+ touch_softlockup_watchdog();
}
return -EBUSY;
}


2005-11-30 11:00:06

by Alexander Inyukhin

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

On Wed, Nov 23, 2005 at 08:17:51PM +0100, Jesper Juhl wrote:
> On Monday 21 November 2005 21:38, Alan Cox wrote:
> > On Sul, 2005-11-20 at 17:29 -0800, Andrew Morton wrote:
> > > Alan Cox <[email protected]> wrote:
> > > > Quite normal. The old IDE probe code takes a long time and it makes the
> > > > soft lockup code believe a lockup occurred - rememeber its a debugging
> > > > tool not a 100% reliable detector of failures.
> > >
> > > We could put a touch_softlockup_watchdog() in there.
> >
> > Would make sense. Spin up and probe can take over 30 seconds worst case
> > and is polled in the IDE world. The loop will eventually exit and a true
> > lockup caused by a stuck IORDY line will hang forever in an inb/outb so
> > neither softlockup or even nmi lockup would save you.
>
> How about something like the patch below?
>
> The if (!(timeout % 128)) bit is a guess that since
> touch_softlockup_watchdog() is a per_cpu thing it will be cheaper to do the
> modulo calculation than calling the function every time through the loop,
> especially as the nr of CPU's go up. But it's purely a guess, so I may very
> well be wrong - also, 128 is an arbitrarily chosen value, it's just a nice
> number that'll give us <10 function calls pr second.

It seems to work.
I have no BUG messages during boot with this patch.

2005-11-30 12:00:06

by Jesper Juhl

[permalink] [raw]
Subject: Re: [BUG] 2.6.15-rc1, soft lockup detected while probing IDE devices on AMD7441

On 11/30/05, Alexander V. Inyukhin <[email protected]> wrote:
> On Wed, Nov 23, 2005 at 08:17:51PM +0100, Jesper Juhl wrote:
> > On Monday 21 November 2005 21:38, Alan Cox wrote:
> > > On Sul, 2005-11-20 at 17:29 -0800, Andrew Morton wrote:
> > > > Alan Cox <[email protected]> wrote:
> > > > > Quite normal. The old IDE probe code takes a long time and it makes the
> > > > > soft lockup code believe a lockup occurred - rememeber its a debugging
> > > > > tool not a 100% reliable detector of failures.
> > > >
> > > > We could put a touch_softlockup_watchdog() in there.
> > >
> > > Would make sense. Spin up and probe can take over 30 seconds worst case
> > > and is polled in the IDE world. The loop will eventually exit and a true
> > > lockup caused by a stuck IORDY line will hang forever in an inb/outb so
> > > neither softlockup or even nmi lockup would save you.
> >
> > How about something like the patch below?
> >
> > The if (!(timeout % 128)) bit is a guess that since
> > touch_softlockup_watchdog() is a per_cpu thing it will be cheaper to do the
> > modulo calculation than calling the function every time through the loop,
> > especially as the nr of CPU's go up. But it's purely a guess, so I may very
> > well be wrong - also, 128 is an arbitrarily chosen value, it's just a nice
> > number that'll give us <10 function calls pr second.
>
> It seems to work.
> I have no BUG messages during boot with this patch.
>
Great.
Thank you for testing.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html