2005-04-12 22:22:26

by William Weston

[permalink] [raw]
Subject: BUGs in 2.6.12-rc2-RT-V0.7.45-01

Hello,

These two BUGs showed up in 2.6.12-rc2-RT-V0.7.45-01 on my P4/HT desktop
machine at work:


During boot, after unused kernel memory freed:

BUG: kstopmachine:1072 RT task yield()-ing!
[<c0103dba>] dump_stack+0x23/0x25 (20)
[<c0311d3f>] yield+0x67/0x69 (20)
[<c01439d4>] stop_machine+0xa4/0x15e (40)
[<c0143abd>] do_stop+0x15/0x77 (20)
[<c01332cb>] kthread+0xab/0xd3 (48)
[<c0100fe9>] kernel_thread_helper+0x5/0xb (567844884)
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<c013cc26>] .... print_traces+0x1b/0x52
.....[<c0103dba>] .. ( <= dump_stack+0x23/0x25)


Triggered when running 'yum update', followed by system lockup within one
minute:

BUG: sleeping function called from invalid context python(4302) at kernel/rt.c:1613
in_atomic():1 [00000001], irqs_disabled():1
[<c0103dba>] dump_stack+0x23/0x25 (20)
[<c0119ed0>] __might_sleep+0xd8/0xed (36)
[<c0139bbc>] __spin_lock+0x34/0x50 (24)
[<c0139bf5>] _spin_lock+0x1d/0x1f (16)
[<c01465cf>] lock_kprobes+0x17/0x23 (12)
[<c01122c9>] kprobe_handler+0xa9/0x209 (32)
[<c011251c>] kprobe_exceptions_notify+0x48/0x13c (28)
[<c012b65b>] notifier_call_chain+0x25/0x3f (32)
[<c0104de0>] do_int3+0x47/0x9b (76)
[<c0103b4e>] int3+0x1e/0x2c (-478200156)
---------------------------
| preempt count: 00000002 ]
| 2-level deep critical section nesting:
----------------------------------------
.. [<c0112244>] .... kprobe_handler+0x24/0x209
.....[<c011251c>] .. ( <= kprobe_exceptions_notify+0x48/0x13c)
.. [<c013cc26>] .... print_traces+0x1b/0x52
.....[<c0103dba>] .. ( <= dump_stack+0x23/0x25)


System will lock up at least once a day, starting somewhere around
2.6.12-rc1-V0.7.43-05 on the P4/HT box. My Athlon box at home is running
fine on the latest -RT kernels.

Please let me know if there's any more debugging I can do.


Regards,
--William Weston


--
/* William Weston <[email protected]> */


Attachments:
config-2.6.12-rc2-RT-V0.7.45-01-debug (11.44 kB)

2005-04-12 23:33:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01


* William Weston <[email protected]> wrote:

> > what are you using kprobes for? Do you get lockups even if you disable
> > kprobes?
>
> I'm not using kprobes currently. I'll recompile and see if the
> lockups go away.

ah - it's an 'empty' kprobes check, and python triggered a userspace
int3. I think kprobes should really get some RCU locking to check its
handler list, to not slow down normal int3 processing. Meanwhile,
disabling kprobes in the .config will probably work around this
particular problem.

Ingo

2005-04-12 23:33:05

by William Weston

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01

> > Triggered when running 'yum update', followed by system lockup within
> > one minute:
> >
> > BUG: sleeping function called from invalid context python(4302) at kernel/rt.c:1613
> > in_atomic():1 [00000001], irqs_disabled():1
> > [<c0103dba>] dump_stack+0x23/0x25 (20)
> > [<c0119ed0>] __might_sleep+0xd8/0xed (36)
> > [<c0139bbc>] __spin_lock+0x34/0x50 (24)
> > [<c0139bf5>] _spin_lock+0x1d/0x1f (16)
> > [<c01465cf>] lock_kprobes+0x17/0x23 (12)
> > [<c01122c9>] kprobe_handler+0xa9/0x209 (32)
> > [<c011251c>] kprobe_exceptions_notify+0x48/0x13c (28)
>
> what are you using kprobes for? Do you get lockups even if you disable
> kprobes?
>
> Ingo

I'm not using kprobes currently. I'll recompile and see if the lockups go
away.

--ww

--
/* William Weston <[email protected]> */

2005-04-12 23:49:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01


* William Weston <[email protected]> wrote:

> Triggered when running 'yum update', followed by system lockup within
> one minute:
>
> BUG: sleeping function called from invalid context python(4302) at kernel/rt.c:1613
> in_atomic():1 [00000001], irqs_disabled():1
> [<c0103dba>] dump_stack+0x23/0x25 (20)
> [<c0119ed0>] __might_sleep+0xd8/0xed (36)
> [<c0139bbc>] __spin_lock+0x34/0x50 (24)
> [<c0139bf5>] _spin_lock+0x1d/0x1f (16)
> [<c01465cf>] lock_kprobes+0x17/0x23 (12)
> [<c01122c9>] kprobe_handler+0xa9/0x209 (32)
> [<c011251c>] kprobe_exceptions_notify+0x48/0x13c (28)

what are you using kprobes for? Do you get lockups even if you disable
kprobes?

Ingo

2005-04-14 21:08:00

by William Weston

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01

On Wed, 13 Apr 2005, Ingo Molnar wrote:

> what are you using kprobes for? Do you get lockups even if you disable
> kprobes?

Various processes will lockup on the P4/HT system, usually while under
some load. The processes cannot be killed. X will lockup once or twice a
day (which means my console, and thus sysrq, are toast), but I can still
ssh in. Nothing is logged by the kernel. Are there any post-lockup
forensics that can be performed before I reboot?

Regards,
--William Weston

--
/* William Weston <[email protected]> */

2005-04-15 08:02:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01


* William Weston <[email protected]> wrote:

> On Wed, 13 Apr 2005, Ingo Molnar wrote:
>
> > what are you using kprobes for? Do you get lockups even if you disable
> > kprobes?
>
> Various processes will lockup on the P4/HT system, usually while under
> some load. The processes cannot be killed. X will lockup once or
> twice a day (which means my console, and thus sysrq, are toast), but I
> can still ssh in. Nothing is logged by the kernel. Are there any
> post-lockup forensics that can be performed before I reboot?

could you try the -44-03 patch:

http://redhat.com/~mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc2-V0.7.44-03

this doesnt have the plist changes yet. Is this one more stable?

Ingo

2005-04-16 01:26:27

by William Weston

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01

On Fri, 15 Apr 2005, Ingo Molnar wrote:

> * William Weston <[email protected]> wrote:
>
> > On Wed, 13 Apr 2005, Ingo Molnar wrote:
> >
> > > what are you using kprobes for? Do you get lockups even if you disable
> > > kprobes?
> >
> > Various processes will lockup on the P4/HT system, usually while under
> > some load. The processes cannot be killed. X will lockup once or
> > twice a day (which means my console, and thus sysrq, are toast), but I
> > can still ssh in. Nothing is logged by the kernel. Are there any
> > post-lockup forensics that can be performed before I reboot?
>
> could you try the -44-03 patch:
>
> http://redhat.com/~mingo/realtime-preempt/older/realtime-preempt-2.6.12-rc2-V0.7.44-03
>
> this doesnt have the plist changes yet. Is this one more stable?

With 2.6.12-rc2-V0.7.44-03, the P4/HT box has been stable all day. I'm
seeing another BUG on boot, not kprobes related this time:

Freeing unused kernel memory: 192k freed
logips2pp: Detected unknown logitech mouse model 86
input: ImExPS/2 Logitech Explorer Mouse on isa0060/serio1
BUG: kstopmachine:1054 RT task yield()-ing!
[<c0103dba>] dump_stack+0x23/0x25 (20)
[<c030ff1f>] yield+0x67/0x69 (20)
[<c0142a44>] stop_machine+0xa4/0x15e (40)
[<c0142b2d>] do_stop+0x15/0x77 (20)
[<c0132c5b>] kthread+0xab/0xd3 (48)
[<c0100fe9>] kernel_thread_helper+0x5/0xb (563617812)
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<c013bca6>] .... print_traces+0x1b/0x52
.....[<c0103dba>] .. ( <= dump_stack+0x23/0x25)

ns83820.c: National Semiconductor DP83820 10/100/1000 driver.
eth0: ns83820.c: 0x22c: 49001186, subsystem: 1186:4900
eth0: ns83820 v0.20: DP83820 v1.2: 00:50:ba:37:d4:bc io=0xff8ff000 irq=18 f=sg


I'm also seeing an unlikely wakeup time:

( X-3581 |#1): new 2533412143 us maximum-latency wakeup.


Otherwise, this on seems very stable.


Best Regards,
--William Weston


--
/* William Weston <[email protected]> */

2005-05-11 19:30:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01

Hi Ingo,

I've seen lots of complaining about the yield BUG produced by
kstopmachine, and since I'm now starting to test this on an SMP machine,
I'm seeing it too. So I've looked further into this, and here's what
I've found.

The kstopmachine creates one thread per CPU to run on each CPU. It sets
this thread to the lowest RT priority and then spins on yield, each
thread expects to be on its designated CPU and spin until all threads
check in (all threads are on their expected CPU). The yield is only to
allow one of the other threads (of same priority) to migrate to their
expected CPU if it started on the wrong CPU. So the use of yield here
is actually correct!

So for this special case, I've included a patch here (attached) to allow
for a call of yield when it is actually OK for a RT task to call yield.
It's called rt_yield. Take a look and see what you think. I patched
this against 45-01 since that's what I'm currently working with.

-- Steve

On Fri, 2005-04-15 at 18:24 -0700, William Weston wrote:
> On Fri, 15 Apr 2005, Ingo Molnar wrote:
[...]
>
> With 2.6.12-rc2-V0.7.44-03, the P4/HT box has been stable all day. I'm
> seeing another BUG on boot, not kprobes related this time:
>
> Freeing unused kernel memory: 192k freed
> logips2pp: Detected unknown logitech mouse model 86
> input: ImExPS/2 Logitech Explorer Mouse on isa0060/serio1
> BUG: kstopmachine:1054 RT task yield()-ing!
> [<c0103dba>] dump_stack+0x23/0x25 (20)
> [<c030ff1f>] yield+0x67/0x69 (20)
> [<c0142a44>] stop_machine+0xa4/0x15e (40)
> [<c0142b2d>] do_stop+0x15/0x77 (20)
> [<c0132c5b>] kthread+0xab/0xd3 (48)
> [<c0100fe9>] kernel_thread_helper+0x5/0xb (563617812)
> ---------------------------
> | preempt count: 00000001 ]
> | 1-level deep critical section nesting:
> ----------------------------------------
> .. [<c013bca6>] .... print_traces+0x1b/0x52
> .....[<c0103dba>] .. ( <= dump_stack+0x23/0x25)


Attachments:
rt-yeild.patch (1.86 kB)

2005-05-23 07:39:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: BUGs in 2.6.12-rc2-RT-V0.7.45-01


* Steven Rostedt <[email protected]> wrote:

> Hi Ingo,
>
> I've seen lots of complaining about the yield BUG produced by
> kstopmachine, and since I'm now starting to test this on an SMP
> machine, I'm seeing it too. So I've looked further into this, and
> here's what I've found.
>
> The kstopmachine creates one thread per CPU to run on each CPU. It
> sets this thread to the lowest RT priority and then spins on yield,
> each thread expects to be on its designated CPU and spin until all
> threads check in (all threads are on their expected CPU). The yield
> is only to allow one of the other threads (of same priority) to
> migrate to their expected CPU if it started on the wrong CPU. So the
> use of yield here is actually correct!
>
> So for this special case, I've included a patch here (attached) to
> allow for a call of yield when it is actually OK for a RT task to call
> yield. It's called rt_yield. Take a look and see what you think. I
> patched this against 45-01 since that's what I'm currently working
> with.

agreed - i've applied your patch and i've reworked it to be
yield()/__yield(). (making it more of an internal interface - this is a
valid but still quite unrobust use of scheduling features.)

Ingo