2003-02-03 20:32:55

by John Goerzen

[permalink] [raw]
Subject: Kernel 2.4.20 panic in scheduler

Hello,

Today I experienced a kernel panic running kernel 2.4.20 (plus the ctx
vserver patch; otherwise vanilla) with a bcm5700 module added in. It's
running on a dual Xeon Dell PowerEdge 2650. Those CPUs both feature
hyperthreading, which is enabled, so Linux sees four virtual CPUs.

This is from my handwritten notes from the screen.

Text on screen (from hurriedly-handwritten notes):

Scheduling in interrupt
kernel BUG at sched.c:570!
Invalid operand: 0000
CPU: 0
EIP: 0010:[<c011a201>]

...

Stack: c02b910a 000001c5 00000286 00000000 2088a8e
.... some intermediate lines skipped ....
ffffffff c02fbf00 c02fbf00

Call trace: c02a510c c02a4f99 c012c12a c02882ef c027ad7d
.... some intermediate lines skipped ....
c012b419 c01208c8 c01218dc c01093ef

Running each of these addresses through ksymoops yields the following:

Adhoc 00000000 Before first symbol
Adhoc 000001c5 Before first symbol
Adhoc 00000286 Before first symbol
Adhoc 02088a8e Before first symbol
Adhoc c01093ef <system_call+33/38>
Adhoc c011a201 <schedule+501/530>
Adhoc c01208c8 <release_task+e8/110>
Adhoc c01218dc <sys_wait4+39c/410>
Adhoc c01218de <sys_wait4+39e/410>
Adhoc c012b419 <sys_release_ip_info+29/60>
Adhoc c012c12a <.text.lock.sys+ea/190>
Adhoc c026910a <tcp_sendpage+11a/160>
Adhoc c027ad7d <tcp_v4_do_rcv+10d/1c0>
Adhoc c02882ef <inet_sock_destruct+ff/1c0>
Adhoc c02a4f99 <rwsem_down_write_failed+29/40>
Adhoc c02a510c <rwsem_down_failed_common+5c/7e>
Adhoc c02b910a <timer_bug_msg+912a/37b20>
Adhoc c02fbc00 <fs_table+40/220>
Adhoc c02fbf00 <uts_sem+8/28>
Adhoc ffffffff <END_OF_CODE+76b8dc8/????>


2003-02-03 21:25:58

by Chris Wright

[permalink] [raw]
Subject: Re: Kernel 2.4.20 panic in scheduler

* John Goerzen ([email protected]) wrote:
>
> Today I experienced a kernel panic running kernel 2.4.20 (plus the ctx
> vserver patch; otherwise vanilla) with a bcm5700 module added in. It's

Have you tried this without the vserver patch? Last I looked it touched
many of the code paths in your trace below. Also, if possible, set up a
serial console, it'll be a lot easier to catch the full trace.

> Adhoc c01093ef <system_call+33/38>
> Adhoc c011a201 <schedule+501/530>

vserver touches this code

> Adhoc c01208c8 <release_task+e8/110>

vserver touches this code

> Adhoc c01218dc <sys_wait4+39c/410>
> Adhoc c01218de <sys_wait4+39e/410>
> Adhoc c012b419 <sys_release_ip_info+29/60>

this is vserver code which could be called from either release_task() or
inet_sock_destruct() (both are in this trace).

you get the idea...
cheers,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2003-02-03 21:39:41

by John Goerzen

[permalink] [raw]
Subject: Re: Kernel 2.4.20 panic in scheduler

Chris Wright <[email protected]> writes:

>> Today I experienced a kernel panic running kernel 2.4.20 (plus the ctx
>> vserver patch; otherwise vanilla) with a bcm5700 module added in. It's
>
> Have you tried this without the vserver patch? Last I looked it touched
> many of the code paths in your trace below. Also, if possible, set up a
> serial console, it'll be a lot easier to catch the full trace.

Unfortunately, this is on a production server, and such a drastic
change to the configuration is not really possible at the moment.
However, I have gone ahead and sent them this info. We will see.

I'm already on the serial console option. Hope to have it soon.

I saw a lot of TCP-related symbols. Is there any chance that this is
a bug in the bcm5700 module? Or in the TCP stack?

-- John

2003-02-04 06:48:27

by Paul Rolland

[permalink] [raw]
Subject: Re: Kernel 2.4.20 panic in scheduler

Hello,

Maybe unrelated, maybe not...
I too have a Dell 2650, Perc 3/Di and bcm5700, running 2.4.20...

What I see is the machine hang (really hang, nothing on the console,
still pinging but nothing else) why doing two or three simultaneous
copy of a 2 Gb file between the three 75Gb disks I have...

I other question : bcm5700 is supported in RedHat, as a module
only. At the same time, Kernel includes support for Broadcomm
Tigon3... Is it safe to use Tigon3 driver with a bcm5700 hardware ?

Regards,
Paul

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of John Goerzen
> Sent: Monday, February 03, 2003 10:49 PM
> To: [email protected]
> Subject: Re: Kernel 2.4.20 panic in scheduler
>
>
> Chris Wright <[email protected]> writes:
>
> >> Today I experienced a kernel panic running kernel 2.4.20 (plus the
> >> ctx vserver patch; otherwise vanilla) with a bcm5700
> module added in.
> >> It's
> >
> > Have you tried this without the vserver patch? Last I looked it
> > touched many of the code paths in your trace below. Also, if
> > possible, set up a serial console, it'll be a lot easier to
> catch the
> > full trace.
>
> Unfortunately, this is on a production server, and such a
> drastic change to the configuration is not really possible at
> the moment. However, I have gone ahead and sent them this
> info. We will see.
>
> I'm already on the serial console option. Hope to have it soon.
>
> I saw a lot of TCP-related symbols. Is there any chance that
> this is a bug in the bcm5700 module? Or in the TCP stack?
>
> -- John
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to
> [email protected] More majordomo info at
http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

2003-02-04 14:09:38

by John Goerzen

[permalink] [raw]
Subject: Re: Kernel 2.4.20 panic in scheduler

"Paul Rolland" <[email protected]> writes:

> Maybe unrelated, maybe not...
> I too have a Dell 2650, Perc 3/Di and bcm5700, running 2.4.20...

Are you running the RedHat kernel?

If so, check out:

http://lists.us.dell.com/pipermail/linux-poweredge/2002-November/010486.html

There have been many reports on the PowerEdge list of trouble with the
tg3 driver in various RedHat kernels. To date, it seems that the tg3
driver in the stock Linus kernels does not cause the hang, though
there may be some other bugs, especially wrt IPv6. Specifically,
there are serious incompatibilities with Apache when running IPv6:

http://lists.us.dell.com/pipermail/linux-poweredge/2003-February/011532.html

> What I see is the machine hang (really hang, nothing on the console,
> still pinging but nothing else) why doing two or three simultaneous
> copy of a 2 Gb file between the three 75Gb disks I have...

Sounds like the tg3 bug people have been complaining of.

> I other question : bcm5700 is supported in RedHat, as a module
> only. At the same time, Kernel includes support for Broadcomm
> Tigon3... Is it safe to use Tigon3 driver with a bcm5700 hardware ?

As above, it seems it is OK to do so if you are NOT running a RedHat
kernel and you are NOT using IPv6.

The PowerEdge list is a good one. A couple of Dell and RedHat hackers
hang out there, and although I run Debian exclusively, if you want
info about your particular hardware, it's a good place to check.

-- John