2005-05-13 18:56:27

by Dave Jones

[permalink] [raw]
Subject: tickle nmi watchdog whilst doing serial writes.

This was fun. I inserted a music CD with some obnoxious copy-protection
on it into the drive, and lots of SCSI errors went zipping over to
the serial console. Unfortunatly, the box was also compiling a kernel,
playing oggs, and doing a number of other things at the same time,
so this happened..

NMI Watchdog detected LOCKUP on CPU2CPU 2
Modules linked in: loop usb_storage md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core rfcomm l2cap bluetooth sunrpc pcdPid: 3138, comm: gpm Not tainted 2.6.11-1.1290_FC4smp
RIP: 0010:[<ffffffff80273b8a>] <ffffffff80273b8a>{serial_in+106}
RSP: 0018:ffff81003afc3d50 EFLAGS: 00000002
RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804dcd60
RBP: 00000000000024fc R08: 000000000000000a R09: 0000000000000033
R10: ffff81001beb7c20 R11: 0000000000000020 R12: ffffffff804dcd60
R13: ffffffff804ade76 R14: 000000000000002b R15: 000000000000002c
FS: 00002aaaaaac4920(0000) GS:ffffffff804fca00(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaabcb000 CR3: 000000003c0d0000 CR4: 00000000000006e0
Process gpm (pid: 3138, threadinfo ffff81003afc2000, task ffff81003eb63780)
Stack: ffffffff80275f2e 0000000000000000 ffffffff80448380 0000000000007d6b
000000000000002c fffffffffffffbbf 0000000000000292 0000000000008000
ffffffff80138e8c 0000000000007d97
Call Trace:<ffffffff80275f2e>{serial8250_console_write+270} <ffffffff80138e8c>{__call_console_drivers+76}
<ffffffff8013914b>{release_console_sem+315} <ffffffff80260325>{con_open+149}
<ffffffff80254e99>{tty_open+537} <ffffffff80192713>{chrdev_open+387}
<ffffffff80188824>{dentry_open+260} <ffffffff80188994>{filp_open+68}
<ffffffff80187b73>{get_unused_fd+227} <ffffffff80188a6c>{sys_open+76}
<ffffffff8010ebc6>{tracesys+209}

Code: 0f b6 c0 c3 66 90 41 57 49 89 f7 41 56 41 be 00 01 00 00 41
console shuts up ...


Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.11/drivers/serial/8250.c~ 2005-05-13 14:18:44.000000000 -0400
+++ linux-2.6.11/drivers/serial/8250.c 2005-05-13 14:30:56.000000000 -0400
@@ -40,6 +40,7 @@
#include <linux/serial_core.h>
#include <linux/serial.h>
#include <linux/serial_8250.h>
+#include <linux/nmi.h>

#include <asm/io.h>
#include <asm/irq.h>
@@ -2099,8 +2100,10 @@ static inline void wait_for_xmitr(struct
if (up->port.flags & UPF_CONS_FLOW) {
tmout = 1000000;
while (--tmout &&
- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
+ ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
udelay(1);
+ touch_nmi_watchdog();
+ }
}
}


We *could* tickle it less often, but given we're busy waiting anyway
it probably doesnt make sense to not favour the more simple approach.
Hmm, maybe we want a cpu_relax() in there too. opinions?

Dave


2005-05-13 19:20:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Fri, 2005-05-13 at 14:48 -0400, Dave Jones wrote:
> if (up->port.flags & UPF_CONS_FLOW) {
> tmout = 1000000;
> while (--tmout &&
> - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> udelay(1);
> + touch_nmi_watchdog();
> + }
> }
> }
>
>
> We *could* tickle it less often, but given we're busy waiting anyway
> it probably doesnt make sense to not favour the more simple approach.
> Hmm, maybe we want a cpu_relax() in there too. opinions?

udelay() includes cpu_relax() already so that is futile.

However.. this is a hack. Do we really need to do busy waiting here for
this long??


2005-05-13 19:30:00

by Dave Jones

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Fri, May 13, 2005 at 09:14:52PM +0200, Arjan van de Ven wrote:
> On Fri, 2005-05-13 at 14:48 -0400, Dave Jones wrote:
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> > udelay(1);
> > + touch_nmi_watchdog();
> > + }
> > }
> > }
> >
> >
> > We *could* tickle it less often, but given we're busy waiting anyway
> > it probably doesnt make sense to not favour the more simple approach.
> > Hmm, maybe we want a cpu_relax() in there too. opinions?
>
> udelay() includes cpu_relax() already so that is futile.
>
> However.. this is a hack. Do we really need to do busy waiting here for
> this long??

Ohhhhh no. I've fallen into this trap before.
I'm not looking any further into serial code than I have to :)

Russell / dwmw2 may have a more definitive answers as to why
we have such a long wait here, but every time I learn something
about the serial layer I end up regretting it, so mine is a
drive-by patching only :-)

Dave

2005-05-13 19:44:53

by Russell King

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Fri, May 13, 2005 at 09:14:52PM +0200, Arjan van de Ven wrote:
> On Fri, 2005-05-13 at 14:48 -0400, Dave Jones wrote:
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> > udelay(1);
> > + touch_nmi_watchdog();
> > + }
> > }
> > }
> >
> >
> > We *could* tickle it less often, but given we're busy waiting anyway
> > it probably doesnt make sense to not favour the more simple approach.
> > Hmm, maybe we want a cpu_relax() in there too. opinions?
>
> udelay() includes cpu_relax() already so that is futile.
>
> However.. this is a hack. Do we really need to do busy waiting here for
> this long??

Of course you do. Think about CTS flow control where the other end
is also busy and has to drain its tty buffers before it can allow
the remote end to proceed. (And who says its another Linux box
anyway?)

However, if people are using CTS flow control, they want reliable
logging. Therefore, it's even questionable whether we should even
time out (but we do to keep the system "running".)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-05-14 06:44:22

by Andrew Morton

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Dave Jones <[email protected]> wrote:
>
> This was fun. I inserted a music CD with some obnoxious copy-protection
> on it into the drive, and lots of SCSI errors went zipping over to
> the serial console. Unfortunatly, the box was also compiling a kernel,
> playing oggs, and doing a number of other things at the same time,
> so this happened..
>
> NMI Watchdog detected LOCKUP on CPU2CPU 2

OK.. But calling touch_nmi_watchdog() at 1MHz seems a bit excessive, and
might perturb the finely-tuned timing in there.

How's about this?

--- 25/drivers/serial/8250.c~tickle-nmi-watchdog-whilst-doing-serial-writes 2005-05-13 23:37:57.000000000 -0700
+++ 25-akpm/drivers/serial/8250.c 2005-05-13 23:41:36.000000000 -0700
@@ -40,6 +40,7 @@
#include <linux/serial_core.h>
#include <linux/serial.h>
#include <linux/serial_8250.h>
+#include <linux/nmi.h>

#include <asm/io.h>
#include <asm/irq.h>
@@ -2098,9 +2099,10 @@ static inline void wait_for_xmitr(struct
/* Wait up to 1s for flow control if necessary */
if (up->port.flags & UPF_CONS_FLOW) {
tmout = 1000000;
- while (--tmout &&
- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
+ while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout)
udelay(1);
+ if (tmout < 1000000)
+ touch_nmi_watchdog();
}
}

_

2005-05-14 06:58:02

by Dave Jones

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Fri, May 13, 2005 at 11:43:31PM -0700, Andrew Morton wrote:
> Dave Jones <[email protected]> wrote:
> >
> > This was fun. I inserted a music CD with some obnoxious copy-protection
> > on it into the drive, and lots of SCSI errors went zipping over to
> > the serial console. Unfortunatly, the box was also compiling a kernel,
> > playing oggs, and doing a number of other things at the same time,
> > so this happened..
> >
> > NMI Watchdog detected LOCKUP on CPU2CPU 2
>
> OK.. But calling touch_nmi_watchdog() at 1MHz seems a bit excessive, and
> might perturb the finely-tuned timing in there.
>
> How's about this?

Umm.. Despite it being past my bedtime, I'm pretty sure I'm
missing something here...

> + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout)
> udelay(1);

I don't see how this is any better than the current code.
We're doing 1000000 udelays. Whilst we're doing that,
the nmi watchdog goes bonkers.

> + if (tmout < 1000000)
> + touch_nmi_watchdog();

So by the time we do this, its already triggered.

How about..

--- linux-2.6.11/drivers/serial/8250.c~ 2005-05-14 02:49:02.000000000 -0400
+++ linux-2.6.11/drivers/serial/8250.c 2005-05-14 02:54:30.000000000 -0400
@@ -40,6 +40,7 @@
#include <linux/serial_core.h>
#include <linux/serial.h>
#include <linux/serial_8250.h>
+#include <linux/nmi.h>

#include <asm/io.h>
#include <asm/irq.h>
@@ -2099,8 +2100,15 @@ static inline void wait_for_xmitr(struct
if (up->port.flags & UPF_CONS_FLOW) {
tmout = 1000000;
while (--tmout &&
- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
+ ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
+ int cnt=0;
udelay(1);
+ cnt++;
+ if (cnt==100) {
+ touch_nmi_watchdog();
+ cnt=0;
+ }
+ }
}
}

Signed-off-by: Dave Jones <[email protected]>

totally uncompiled & untested. I'm not even sure about
that magic '100'. I'll let someone less sleep-deprived
than myself choose a better number.

Dave


2005-05-14 07:08:39

by Andrew Morton

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Dave Jones <[email protected]> wrote:
>
> On Fri, May 13, 2005 at 11:43:31PM -0700, Andrew Morton wrote:
> > Dave Jones <[email protected]> wrote:
> > >
> > > This was fun. I inserted a music CD with some obnoxious copy-protection
> > > on it into the drive, and lots of SCSI errors went zipping over to
> > > the serial console. Unfortunatly, the box was also compiling a kernel,
> > > playing oggs, and doing a number of other things at the same time,
> > > so this happened..
> > >
> > > NMI Watchdog detected LOCKUP on CPU2CPU 2
> >
> > OK.. But calling touch_nmi_watchdog() at 1MHz seems a bit excessive, and
> > might perturb the finely-tuned timing in there.
> >
> > How's about this?
>
> Umm.. Despite it being past my bedtime, I'm pretty sure I'm
> missing something here...
>
> > + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout)
> > udelay(1);
>
> I don't see how this is any better than the current code.
> We're doing 1000000 udelays. Whilst we're doing that,
> the nmi watchdog goes bonkers.
>
> > + if (tmout < 1000000)
> > + touch_nmi_watchdog();
>
> So by the time we do this, its already triggered.

But the NMI watchdog won't expire after one second - normally it's set to
fixe seconds.

> How about..
>
> --- linux-2.6.11/drivers/serial/8250.c~ 2005-05-14 02:49:02.000000000 -0400
> +++ linux-2.6.11/drivers/serial/8250.c 2005-05-14 02:54:30.000000000 -0400
> @@ -40,6 +40,7 @@
> #include <linux/serial_core.h>
> #include <linux/serial.h>
> #include <linux/serial_8250.h>
> +#include <linux/nmi.h>
>
> #include <asm/io.h>
> #include <asm/irq.h>
> @@ -2099,8 +2100,15 @@ static inline void wait_for_xmitr(struct
> if (up->port.flags & UPF_CONS_FLOW) {
> tmout = 1000000;
> while (--tmout &&
> - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> + int cnt=0;
> udelay(1);
> + cnt++;
> + if (cnt==100) {
> + touch_nmi_watchdog();
> + cnt=0;
> + }
> + }

<obwhitespacewhine> spose so.

--- 25/drivers/serial/8250.c~tickle-nmi-watchdog-whilst-doing-serial-writes 2005-05-14 00:03:09.000000000 -0700
+++ 25-akpm/drivers/serial/8250.c 2005-05-14 00:06:53.000000000 -0700
@@ -40,6 +40,7 @@
#include <linux/serial_core.h>
#include <linux/serial.h>
#include <linux/serial_8250.h>
+#include <linux/nmi.h>

#include <asm/io.h>
#include <asm/irq.h>
@@ -2098,9 +2099,11 @@ static inline void wait_for_xmitr(struct
/* Wait up to 1s for flow control if necessary */
if (up->port.flags & UPF_CONS_FLOW) {
tmout = 1000000;
- while (--tmout &&
- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
+ while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout) {
udelay(1);
+ if ((tmout % 1000) == 0)
+ touch_nmi_watchdog();
+ }
}
}

_

2005-05-14 10:32:16

by Ed Tomlinson

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Hi,

Suspect this can be triggered using alt+sysreq+T on a busy system with a slow
serial console. Might be a easy way to see if this patch fixes the issue?

Ed Tomlinson

On Saturday 14 May 2005 03:07, Andrew Morton wrote:
> Dave Jones <[email protected]> wrote:
> >
> > On Fri, May 13, 2005 at 11:43:31PM -0700, Andrew Morton wrote:
> > > Dave Jones <[email protected]> wrote:
> > > >
> > > > This was fun. I inserted a music CD with some obnoxious copy-protection
> > > > on it into the drive, and lots of SCSI errors went zipping over to
> > > > the serial console. Unfortunatly, the box was also compiling a kernel,
> > > > playing oggs, and doing a number of other things at the same time,
> > > > so this happened..
> > > >
> > > > NMI Watchdog detected LOCKUP on CPU2CPU 2
> > >
> > > OK.. But calling touch_nmi_watchdog() at 1MHz seems a bit excessive, and
> > > might perturb the finely-tuned timing in there.
> > >
> > > How's about this?
> >
> > Umm.. Despite it being past my bedtime, I'm pretty sure I'm
> > missing something here...
> >
> > > + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout)
> > > udelay(1);
> >
> > I don't see how this is any better than the current code.
> > We're doing 1000000 udelays. Whilst we're doing that,
> > the nmi watchdog goes bonkers.
> >
> > > + if (tmout < 1000000)
> > > + touch_nmi_watchdog();
> >
> > So by the time we do this, its already triggered.
>
> But the NMI watchdog won't expire after one second - normally it's set to
> fixe seconds.
>
> > How about..
> >
> > --- linux-2.6.11/drivers/serial/8250.c~ 2005-05-14 02:49:02.000000000 -0400
> > +++ linux-2.6.11/drivers/serial/8250.c 2005-05-14 02:54:30.000000000 -0400
> > @@ -40,6 +40,7 @@
> > #include <linux/serial_core.h>
> > #include <linux/serial.h>
> > #include <linux/serial_8250.h>
> > +#include <linux/nmi.h>
> >
> > #include <asm/io.h>
> > #include <asm/irq.h>
> > @@ -2099,8 +2100,15 @@ static inline void wait_for_xmitr(struct
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> > + int cnt=0;
> > udelay(1);
> > + cnt++;
> > + if (cnt==100) {
> > + touch_nmi_watchdog();
> > + cnt=0;
> > + }
> > + }
>
> <obwhitespacewhine> spose so.
>
> --- 25/drivers/serial/8250.c~tickle-nmi-watchdog-whilst-doing-serial-writes 2005-05-14 00:03:09.000000000 -0700
> +++ 25-akpm/drivers/serial/8250.c 2005-05-14 00:06:53.000000000 -0700
> @@ -40,6 +40,7 @@
> #include <linux/serial_core.h>
> #include <linux/serial.h>
> #include <linux/serial_8250.h>
> +#include <linux/nmi.h>
>
> #include <asm/io.h>
> #include <asm/irq.h>
> @@ -2098,9 +2099,11 @@ static inline void wait_for_xmitr(struct
> /* Wait up to 1s for flow control if necessary */
> if (up->port.flags & UPF_CONS_FLOW) {
> tmout = 1000000;
> - while (--tmout &&
> - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout) {
> udelay(1);
> + if ((tmout % 1000) == 0)
> + touch_nmi_watchdog();
> + }
> }
> }
>
> _
>
> -

2005-05-14 10:53:16

by Alexander Nyberg

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

> Hi,
>
> Suspect this can be triggered using alt+sysreq+T on a busy system with a slow
> serial console. Might be a easy way to see if this patch fixes the issue?
>

Sysrq-t already tickles the watchdog between the printing of each task
so you'll have to remove the nmi tickling in show_state() if you want to
use it as a test case.

But uhm, it should take at least 5 seconds of no-interrupts before the
NMI watchdog decides the box is dead so this is kind of weird.


> > <obwhitespacewhine> spose so.
> >
> > --- 25/drivers/serial/8250.c~tickle-nmi-watchdog-whilst-doing-serial-writes 2005-05-14 00:03:09.000000000 -0700
> > +++ 25-akpm/drivers/serial/8250.c 2005-05-14 00:06:53.000000000 -0700
> > @@ -40,6 +40,7 @@
> > #include <linux/serial_core.h>
> > #include <linux/serial.h>
> > #include <linux/serial_8250.h>
> > +#include <linux/nmi.h>
> >
> > #include <asm/io.h>
> > #include <asm/irq.h>
> > @@ -2098,9 +2099,11 @@ static inline void wait_for_xmitr(struct
> > /* Wait up to 1s for flow control if necessary */
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > - while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout) {
> > udelay(1);
> > + if ((tmout % 1000) == 0)
> > + touch_nmi_watchdog();
> > + }
> > }
> > }
> >

2005-05-14 10:54:52

by Ed Tomlinson

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Hi,

The message from a+s+T on a serial console (x86_64) is:
BUG: soft lockup detected on cpu#0
Its not changed by this patch. Probably would be a good idea to correct this?

Ed

On Saturday 14 May 2005 06:31, Ed Tomlinson wrote:
> Suspect this can be triggered using alt+sysreq+T on a busy system with a slow
> serial console. Might be a easy way to see if this patch fixes the issue?
>
> Ed Tomlinson
>
> On Saturday 14 May 2005 03:07, Andrew Morton wrote:
> > Dave Jones <[email protected]> wrote:
> > >
> > > On Fri, May 13, 2005 at 11:43:31PM -0700, Andrew Morton wrote:
> > > > Dave Jones <[email protected]> wrote:
> > > > >
> > > > > This was fun. I inserted a music CD with some obnoxious copy-protection
> > > > > on it into the drive, and lots of SCSI errors went zipping over to
> > > > > the serial console. Unfortunatly, the box was also compiling a kernel,
> > > > > playing oggs, and doing a number of other things at the same time,
> > > > > so this happened..
> > > > >
> > > > > NMI Watchdog detected LOCKUP on CPU2CPU 2
> > > >
> > > > OK.. But calling touch_nmi_watchdog() at 1MHz seems a bit excessive, and
> > > > might perturb the finely-tuned timing in there.
> > > >
> > > > How's about this?
> > >
> > > Umm.. Despite it being past my bedtime, I'm pretty sure I'm
> > > missing something here...
> > >
> > > > + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout)
> > > > udelay(1);
> > >
> > > I don't see how this is any better than the current code.
> > > We're doing 1000000 udelays. Whilst we're doing that,
> > > the nmi watchdog goes bonkers.
> > >
> > > > + if (tmout < 1000000)
> > > > + touch_nmi_watchdog();
> > >
> > > So by the time we do this, its already triggered.
> >
> > But the NMI watchdog won't expire after one second - normally it's set to
> > fixe seconds.
> >
> > > How about..
> > >
> > > --- linux-2.6.11/drivers/serial/8250.c~ 2005-05-14 02:49:02.000000000 -0400
> > > +++ linux-2.6.11/drivers/serial/8250.c 2005-05-14 02:54:30.000000000 -0400
> > > @@ -40,6 +40,7 @@
> > > #include <linux/serial_core.h>
> > > #include <linux/serial.h>
> > > #include <linux/serial_8250.h>
> > > +#include <linux/nmi.h>
> > >
> > > #include <asm/io.h>
> > > #include <asm/irq.h>
> > > @@ -2099,8 +2100,15 @@ static inline void wait_for_xmitr(struct
> > > if (up->port.flags & UPF_CONS_FLOW) {
> > > tmout = 1000000;
> > > while (--tmout &&
> > > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> > > + int cnt=0;
> > > udelay(1);
> > > + cnt++;
> > > + if (cnt==100) {
> > > + touch_nmi_watchdog();
> > > + cnt=0;
> > > + }
> > > + }
> >
> > <obwhitespacewhine> spose so.
> >
> > --- 25/drivers/serial/8250.c~tickle-nmi-watchdog-whilst-doing-serial-writes 2005-05-14 00:03:09.000000000 -0700
> > +++ 25-akpm/drivers/serial/8250.c 2005-05-14 00:06:53.000000000 -0700
> > @@ -40,6 +40,7 @@
> > #include <linux/serial_core.h>
> > #include <linux/serial.h>
> > #include <linux/serial_8250.h>
> > +#include <linux/nmi.h>
> >
> > #include <asm/io.h>
> > #include <asm/irq.h>
> > @@ -2098,9 +2099,11 @@ static inline void wait_for_xmitr(struct
> > /* Wait up to 1s for flow control if necessary */
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > - while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout) {
> > udelay(1);
> > + if ((tmout % 1000) == 0)
> > + touch_nmi_watchdog();
> > + }
> > }
> > }
> >
> > _
> >
> > -
>

2005-05-15 11:38:05

by Andi Kleen

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Dave Jones <[email protected]> writes:
>
> #include <asm/io.h>
> #include <asm/irq.h>
> @@ -2099,8 +2100,10 @@ static inline void wait_for_xmitr(struct
> if (up->port.flags & UPF_CONS_FLOW) {
> tmout = 1000000;
> while (--tmout &&
> - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> udelay(1);
> + touch_nmi_watchdog();

Note that touch_nmi_watchdog is not exported on i386 - Linus vetoed
that some time ago. The real fix of course is to use schedule_timeout(),
but that might break printk() with interrupts off :/

-Andi

2005-05-15 11:40:16

by Andi Kleen

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Alexander Nyberg <[email protected]> writes:
>
> But uhm, it should take at least 5 seconds of no-interrupts before the
> NMI watchdog decides the box is dead so this is kind of weird.

There is a bug somewhere that makes it sometimes expire faster.
I found only one bug so far - touch_nmi_watchdog has a race
(fixed now in -mm*). You can try if it still happens with the next -mm.

Somehow i doubt it is the race though because it should
be too unlikely to trigger this. Could be some other bug
too.

-Andi

2005-05-15 12:08:06

by Russell King

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Sun, May 15, 2005 at 01:38:02PM +0200, Andi Kleen wrote:
> Dave Jones <[email protected]> writes:
> >
> > #include <asm/io.h>
> > #include <asm/irq.h>
> > @@ -2099,8 +2100,10 @@ static inline void wait_for_xmitr(struct
> > if (up->port.flags & UPF_CONS_FLOW) {
> > tmout = 1000000;
> > while (--tmout &&
> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
> > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
> > udelay(1);
> > + touch_nmi_watchdog();
>
> Note that touch_nmi_watchdog is not exported on i386 - Linus vetoed
> that some time ago. The real fix of course is to use schedule_timeout(),
> but that might break printk() with interrupts off :/

Not to mention printk() from atomic contexts and panic(). No,
schedule_timeout() is _not_ a "real fix" but a kludge.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-05-15 12:20:21

by Andi Kleen

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Russell King <[email protected]> writes:

> On Sun, May 15, 2005 at 01:38:02PM +0200, Andi Kleen wrote:
>> Dave Jones <[email protected]> writes:
>> >
>> > #include <asm/io.h>
>> > #include <asm/irq.h>
>> > @@ -2099,8 +2100,10 @@ static inline void wait_for_xmitr(struct
>> > if (up->port.flags & UPF_CONS_FLOW) {
>> > tmout = 1000000;
>> > while (--tmout &&
>> > - ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
>> > + ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
>> > udelay(1);
>> > + touch_nmi_watchdog();
>>
>> Note that touch_nmi_watchdog is not exported on i386 - Linus vetoed
>> that some time ago. The real fix of course is to use schedule_timeout(),
>> but that might break printk() with interrupts off :/
>
> Not to mention printk() from atomic contexts and panic(). No,
> schedule_timeout() is _not_ a "real fix" but a kludge.

Then someone needs to convince Linus to export touch_nmi_watchdog
again.

Or how about checking if interrupts are off here (iirc we have
a generic function for that now) and then using
a smaller timeout and otherwise schedule_timeout() ?

-Andi

2005-05-15 14:01:16

by Russell King

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Sun, May 15, 2005 at 02:20:14PM +0200, Andi Kleen wrote:
> Then someone needs to convince Linus to export touch_nmi_watchdog
> again.
>
> Or how about checking if interrupts are off here (iirc we have
> a generic function for that now) and then using
> a smaller timeout and otherwise schedule_timeout() ?

The interrupt state doesn't tell us whether we can schedule. It
tells us when we can't schedule, which is different from when we
can. For example:

spin_lock(foo_lock);
...
printk("blah blah blah\n");
...
spin_unlock(foo_lock);

This context is non-preemptable, but doesn't have IRQs disabled.
The solution would be to keep a "spinlock depth" counter, but
obviously that's not a possibility.

I would agree that the most correct thing to do would be to export
touch_nmi_watchdog()... if only Linus would accept the arguments
_for_ exporting it.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-05-15 14:03:43

by Andi Kleen

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Sun, May 15, 2005 at 03:01:03PM +0100, Russell King wrote:
> On Sun, May 15, 2005 at 02:20:14PM +0200, Andi Kleen wrote:
> > Then someone needs to convince Linus to export touch_nmi_watchdog
> > again.
> >
> > Or how about checking if interrupts are off here (iirc we have
> > a generic function for that now) and then using
> > a smaller timeout and otherwise schedule_timeout() ?
>
> The interrupt state doesn't tell us whether we can schedule. It
> tells us when we can't schedule, which is different from when we
> can. For example:

Yes, you're right of course. Must have not been thinking clearly.

-Andi

2005-05-19 00:24:43

by George Anzinger

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Andi Kleen wrote:
> Russell King <[email protected]> writes:
>
>
>>On Sun, May 15, 2005 at 01:38:02PM +0200, Andi Kleen wrote:
>>
>>>Dave Jones <[email protected]> writes:
>>>
>>>>
>>>> #include <asm/io.h>
>>>> #include <asm/irq.h>
>>>>@@ -2099,8 +2100,10 @@ static inline void wait_for_xmitr(struct
>>>> if (up->port.flags & UPF_CONS_FLOW) {
>>>> tmout = 1000000;
>>>> while (--tmout &&
>>>>- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
>>>>+ ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0)) {
>>>> udelay(1);
>>>>+ touch_nmi_watchdog();
>>>
>>>Note that touch_nmi_watchdog is not exported on i386 - Linus vetoed
>>>that some time ago. The real fix of course is to use schedule_timeout(),
>>>but that might break printk() with interrupts off :/
>>
>>Not to mention printk() from atomic contexts and panic(). No,
>>schedule_timeout() is _not_ a "real fix" but a kludge.

Um... I would think the real fix is to set the UART up to generate the modem
status interrupt and eliminate the pole loop. Why can't this be done? I, for
one, don't want my cpu looping in the serial driver, even more so with the
interrupt system off. This, in my mind, is a real bug in the serial driver and
should be so handled.


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-05-19 07:33:47

by Russell King

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

On Wed, May 18, 2005 at 05:24:13PM -0700, George Anzinger wrote:
> Um... I would think the real fix is to set the UART up to generate the modem
> status interrupt and eliminate the pole loop. Why can't this be done? I, for
> one, don't want my cpu looping in the serial driver, even more so with the
> interrupt system off. This, in my mind, is a real bug in the serial driver and
> should be so handled.

Because printk is *synchronous*. It never returns until it's written
the entire message. There is no buffering.

Extra complexity, adding reliance on interrupts, etc all means that
you reduce the probability that you'll get the panic or oops message
out of the system.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-05-23 20:20:13

by Bill Davidsen

[permalink] [raw]
Subject: Re: tickle nmi watchdog whilst doing serial writes.

Russell King wrote:
> On Wed, May 18, 2005 at 05:24:13PM -0700, George Anzinger wrote:
>
>>Um... I would think the real fix is to set the UART up to generate the modem
>>status interrupt and eliminate the pole loop. Why can't this be done? I, for
>>one, don't want my cpu looping in the serial driver, even more so with the
>>interrupt system off. This, in my mind, is a real bug in the serial driver and
>>should be so handled.
>
>
> Because printk is *synchronous*. It never returns until it's written
> the entire message. There is no buffering.
>
> Extra complexity, adding reliance on interrupts, etc all means that
> you reduce the probability that you'll get the panic or oops message
> out of the system.
>
This is a fine reason to loop in the serial code, I guess, but what's
the reason for allowing the NMI oops? Having use of the serial console
to catch an oops actually CAUSE an oops doesn't seem desirable, and is
probably more likely than a hardlock in the serial driver.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me