2006-03-17 11:30:39

by Holger Eitzenberger

[permalink] [raw]
Subject: Strange kernel bug

Hi,

I see a kernel bug with kernel 2.6.10. Hardware is UP Pentium 4 CPU
2.40GHz, output from lspci is attached.

This all happens on customers machines, so I am unable to easily
switch to a newer kernel version, sorry.

Everry few hours the machines deadlocks with the following messages:

<3>scheduling while atomic: swapper/0x00000100/0
<4> [<c010333e>] dump_stack+0x1e/0x20
<4> [<c02de278>] schedule+0x458/0x510
<4> [<c01006bc>] cpu_idle+0x1c/0x50
<4> [<c0100406>] rest_init+0x26/0x30
<4> [<c03ee99a>] start_kernel+0x1ba/0x200
<4> [<c010019f>] L6+0x0/0x2

It seemed quite clear what happened here, so I started to search for
the missing unlock in some error path, which was a quite daunting
task. So I modified the kernel in order to find out the code
which called local_bh_disable() before this all happened. Patch
is attached. This is the output:

<3>scheduling while atomic: swapper/0x00000100/0
<4>bh_users: c011b499
<4>bh_users: 00000000
<4> [<c010333e>] dump_stack+0x1e/0x20
<4> [<c02de278>] schedule+0x458/0x510
<4> [<c01006bc>] cpu_idle+0x1c/0x50
<4> [<c0100406>] rest_init+0x26/0x30
<4> [<c03ee99a>] start_kernel+0x1ba/0x200
<4> [<c010019f>] L6+0x0/0x2

c01b499 is an address from __do_softirq. And this is the point I do not
understand currently.

Note that the kernel is patched with some very intrusive patches like
LKCD, KDB and Xen 2. So I will disable all but KDB and see what
happens.

/holger

--


Attachments:
(No filename) (1.41 kB)
lspci (1.73 kB)
__do_softirq.s (3.08 kB)
bh_user.diff (2.47 kB)
Download all attachments