2006-02-20 21:18:36

by David Miller

[permalink] [raw]
Subject: softlockup interaction with slow consoles


When register_console() runs, it dumps the entire log buffer
to the console device with interrupts off.

If you're going over a slow console, this can easily take more
than 10 seconds especially on SMP with many cpus brought up.

This makes the softlock fire, which is terribly annoying :-)

I think the bug is in the console registry code, I think it
should capture chunks of the existing console buffer into some
local memory and push things piece by piece with interrupts
enabled to the console driver(s).

Any better ideas?


2006-02-21 09:10:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles


On Mon, 20 Feb 2006, David S. Miller wrote:

> When register_console() runs, it dumps the entire log buffer to the
> console device with interrupts off.
>
> If you're going over a slow console, this can easily take more than 10
> seconds especially on SMP with many cpus brought up.
>
> This makes the softlock fire, which is terribly annoying :-)
>
> I think the bug is in the console registry code, I think it should
> capture chunks of the existing console buffer into some local memory and
> push things piece by piece with interrupts enabled to the console
> driver(s).
>
> Any better ideas?

is this a boot-time-only problem? In recent updates to the softlockup
code in -mm:

timer-irq-driven-soft-watchdog-cleanups.patch
timer-irq-driven-soft-watchdog-percpu-race-fix.patch
timer-irq-driven-soft-watchdog-percpu-fix.patch
timer-irq-driven-soft-watchdog-boot-fix.patch

i changed soft lockup detection to be turned off during bootup. That
should work around any boot-time warnings.

(if this can happen on a booted up system then the real fix would indeed
be to split up register_console()'s workload - that would also make it
more preemption-friendly. But at first sight it looks quite complex to
do.)

Ingo

2006-02-21 09:16:37

by David Miller

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

From: Ingo Molnar <[email protected]>
Date: Tue, 21 Feb 2006 04:09:58 -0500 (EST)

> i changed soft lockup detection to be turned off during bootup. That
> should work around any boot-time warnings.

Excellent.

> (if this can happen on a booted up system then the real fix would indeed
> be to split up register_console()'s workload - that would also make it
> more preemption-friendly. But at first sight it looks quite complex to
> do.)

Agreed. I thought about buffering in the slow console driver itself
but that's bad because if it's a crash message we might not get the
events (interrupts, or whatever) in order to make forward progress
printing out the buffer, and thus we'd lose the valuable messages.

2006-02-21 13:50:01

by Andi Kleen

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

"David S. Miller" <[email protected]> writes:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 21 Feb 2006 04:09:58 -0500 (EST)
>
> > i changed soft lockup detection to be turned off during bootup. That
> > should work around any boot-time warnings.
>
> Excellent.


Still you could probably see problems with very slow consoles even after
bootup, couldn't you?

(for some reason some people still use 9600 baud serial consoles, which
tend to trigger all kind of interesting races)


-Andi

2006-02-21 19:23:50

by Matt Mackall

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

On Tue, Feb 21, 2006 at 01:16:50AM -0800, David S. Miller wrote:
> From: Ingo Molnar <[email protected]>
> Date: Tue, 21 Feb 2006 04:09:58 -0500 (EST)
>
> > i changed soft lockup detection to be turned off during bootup. That
> > should work around any boot-time warnings.
>
> Excellent.

I don't like it. We should instead just have printk tickle the watchdog.

> > (if this can happen on a booted up system then the real fix would indeed
> > be to split up register_console()'s workload - that would also make it
> > more preemption-friendly. But at first sight it looks quite complex to
> > do.)
>
> Agreed. I thought about buffering in the slow console driver itself
> but that's bad because if it's a crash message we might not get the
> events (interrupts, or whatever) in order to make forward progress
> printing out the buffer, and thus we'd lose the valuable messages.

Absolutely.

Also note that sysrq-t can be substantially worse than the the
boot-time dump if you've got a lot of processes. So the boot-time hack
is insufficient. Even at 115kbps, this can be several seconds.

--
Mathematics is the supreme nostalgia of our time.

2006-02-21 20:01:46

by David Miller

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

From: Matt Mackall <[email protected]>
Date: Tue, 21 Feb 2006 13:23:40 -0600

> I don't like it. We should instead just have printk tickle the watchdog.

You can't, interrupts are disabled the entire time and thus
jiffies aren't advancing. The "touch" just sets the local
cpu timestamp to whatever jiffies is.

That's the problem.

I agree that it would be nice to fix this, but the desire for
synchronous console output makes this very hard to solve.

Like I said, maybe some local chunking in release_console_sem()
but since interrupts will come in we can trigger more and more
lengthy console output and eventually overflow the printk log
buffer before the chunking loop can catch up and we'll thus
lose messages.

2006-02-21 20:04:11

by David Miller

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

From: Andi Kleen <[email protected]>
Date: 21 Feb 2006 14:49:49 +0100

> Still you could probably see problems with very slow consoles even
> after bootup, couldn't you?

Yes, others have mentioned this too, good point.

Depending upon what we're really trying to achieve with the
softlockup stuff, we can perhaps look into increasing the
timeout or making it configurable. We could even do this
dynamically, so when we register a serial console running
at some low baud rate, we scale up the softlockup timeout
or something like that.

2006-02-21 20:05:46

by Andi Kleen

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

On Tuesday 21 February 2006 21:01, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: 21 Feb 2006 14:49:49 +0100
>
> > Still you could probably see problems with very slow consoles even
> > after bootup, couldn't you?
>
> Yes, others have mentioned this too, good point.
>
> Depending upon what we're really trying to achieve with the
> softlockup stuff, we can perhaps look into increasing the
> timeout or making it configurable. We could even do this
> dynamically, so when we register a serial console running
> at some low baud rate, we scale up the softlockup timeout
> or something like that.

The classic way is to just use touch_nmi_watchdog() somewhere
in the loop that does work. That touches the softwatchdog too
these days.

-Andi


2006-02-21 20:22:16

by David Miller

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

From: Andi Kleen <[email protected]>
Date: Tue, 21 Feb 2006 21:05:37 +0100

> The classic way is to just use touch_nmi_watchdog() somewhere
> in the loop that does work. That touches the softwatchdog too
> these days.

"jiffies" aren't advancing, since interrupts are disabled by
release_console_sem(), so that doesn't work.

I tried that already :-)

2006-02-21 20:28:30

by Dave Jones

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

On Tue, Feb 21, 2006 at 12:19:48PM -0800, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Tue, 21 Feb 2006 21:05:37 +0100
>
> > The classic way is to just use touch_nmi_watchdog() somewhere
> > in the loop that does work. That touches the softwatchdog too
> > these days.
>
> "jiffies" aren't advancing, since interrupts are disabled by
> release_console_sem(), so that doesn't work.
>
> I tried that already :-)

Where did you put it? I hit a similar problem a few months back,
and this patch did the trick for me..

Dave

NMI Watchdog detected LOCKUP on CPU2CPU 2
Modules linked in: loop usb_storage md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core rfcomm l2cap bluetooth sunrpc pcdPid: 3138, comm: gpm Not tainted 2.6.11-1.1290_FC4smp
RIP: 0010:[<ffffffff80273b8a>] <ffffffff80273b8a>{serial_in+106}
RSP: 0018:ffff81003afc3d50 EFLAGS: 00000002
RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff804dcd60
RBP: 00000000000024fc R08: 000000000000000a R09: 0000000000000033
R10: ffff81001beb7c20 R11: 0000000000000020 R12: ffffffff804dcd60
R13: ffffffff804ade76 R14: 000000000000002b R15: 000000000000002c
FS: 00002aaaaaac4920(0000) GS:ffffffff804fca00(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaabcb000 CR3: 000000003c0d0000 CR4: 00000000000006e0
Process gpm (pid: 3138, threadinfo ffff81003afc2000, task ffff81003eb63780)
Stack: ffffffff80275f2e 0000000000000000 ffffffff80448380 0000000000007d6b
000000000000002c fffffffffffffbbf 0000000000000292 0000000000008000
ffffffff80138e8c 0000000000007d97
Call Trace:<ffffffff80275f2e>{serial8250_console_write+270} <ffffffff80138e8c>{__call_console_drivers+76}
<ffffffff8013914b>{release_console_sem+315} <ffffffff80260325>{con_open+149}
<ffffffff80254e99>{tty_open+537} <ffffffff80192713>{chrdev_open+387}
<ffffffff80188824>{dentry_open+260} <ffffffff80188994>{filp_open+68}
<ffffffff80187b73>{get_unused_fd+227} <ffffffff80188a6c>{sys_open+76}
<ffffffff8010ebc6>{tracesys+209}

Code: 0f b6 c0 c3 66 90 41 57 49 89 f7 41 56 41 be 00 01 00 00 41
console shuts up ...


--- linux-2.6.11/drivers/serial/8250.c~ 2005-05-14 02:49:02.000000000 -0400
+++ linux-2.6.11/drivers/serial/8250.c 2005-05-14 02:54:30.000000000 -0400
@@ -2098,9 +2098,11 @@ static inline void wait_for_xmitr(struct
/* Wait up to 1s for flow control if necessary */
if (up->port.flags & UPF_CONS_FLOW) {
tmout = 1000000;
- while (--tmout &&
- ((serial_in(up, UART_MSR) & UART_MSR_CTS) == 0))
+ while (!(serial_in(up, UART_MSR) & UART_MSR_CTS) && --tmout) {
udelay(1);
+ if ((tmout % 1000) == 0)
+ touch_nmi_watchdog();
+ }
}
}





2006-02-21 20:53:56

by David Miller

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

From: Dave Jones <[email protected]>
Date: Tue, 21 Feb 2006 15:27:57 -0500

> On Tue, Feb 21, 2006 at 12:19:48PM -0800, David S. Miller wrote:
> > From: Andi Kleen <[email protected]>
> > Date: Tue, 21 Feb 2006 21:05:37 +0100
> >
> > > The classic way is to just use touch_nmi_watchdog() somewhere
> > > in the loop that does work. That touches the softwatchdog too
> > > these days.
> >
> > "jiffies" aren't advancing, since interrupts are disabled by
> > release_console_sem(), so that doesn't work.
> >
> > I tried that already :-)
>
> Where did you put it? I hit a similar problem a few months back,
> and this patch did the trick for me..

Not the NMI watchdog, the softlockup watchdog.

2006-02-21 21:00:05

by Andi Kleen

[permalink] [raw]
Subject: Re: softlockup interaction with slow consoles

On Tuesday 21 February 2006 21:19, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Tue, 21 Feb 2006 21:05:37 +0100
>
> > The classic way is to just use touch_nmi_watchdog() somewhere
> > in the loop that does work. That touches the softwatchdog too
> > these days.
>
> "jiffies" aren't advancing, since interrupts are disabled by
> release_console_sem(), so that doesn't work.
>
> I tried that already :-)

Ah I see the problem I guess. When you restart then your
timer interrupt catches up the missing jiffies very quickly
and that triggers the softwatchdog.

Nasty. Perhaps it needs to look at xtime instead and
let touch_softlockup_watchdog() update that. But that could
break with NTP then. Perhaps using monotonic_clock() ?

-Andi