2000-12-13 23:53:44

by Mohammad A. Haque

[permalink] [raw]
Subject: test12 lockups -- need feedback

At first I thought it was just me when I reported the lockups I was
having with test12 earlier this week. Now the reports are flooding. Of
course, now my machine isn't locking up anymore after recompiling from a
clean source tree (test5 w/ patches through test12)

Now, I'm trying to determine what the common element is.

Those of you who are having lockups, was test12 compiled from a patched
tree that you've previously compiled?

Those that are locking up in X. Do you have a second machine you can
hook up via serial port to grab Oops output?

I've got KDB compiled in my current kernel. I'll compile a fresh kernel
without KDB and see how long I last when I get a chance.
--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================


2000-12-14 00:53:57

by Mikael Djurfeldt

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

"Mohammad A. Haque" <[email protected]> writes:

> Those of you who are having lockups, was test12 compiled from a patched
> tree that you've previously compiled?

I downloaded the full test12 and have lockups after using X (upstream
version 4.0.1Z) 15-45 mins. For me, SysRq+u works, but if I then
press SysRq+b, nothing happens. There are no signs in the syslog.

I'm using the latest Debian packages from the Woody release on an
Mobile Pentium II, 333 MHz, 160 Mb ram.

2000-12-14 01:00:27

by Mikael Djurfeldt

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Mikael Djurfeldt <[email protected]> writes:

> "Mohammad A. Haque" <[email protected]> writes:
>
> > Those of you who are having lockups, was test12 compiled from a patched
> > tree that you've previously compiled?
>
> I downloaded the full test12 and have lockups after using X (upstream
> version 4.0.1Z) 15-45 mins. For me, SysRq+u works, but if I then
> press SysRq+b, nothing happens. There are no signs in the syslog.

I should add that I didn't have these lockups in test12-pre8.

2000-12-14 01:56:05

by dep

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Wednesday 13 December 2000 19:29, Mikael Djurfeldt wrote:

| > I downloaded the full test12 and have lockups after using X
| > (upstream version 4.0.1Z) 15-45 mins. For me, SysRq+u works, but
| > if I then press SysRq+b, nothing happens. There are no signs in
| > the syslog.
|
| I should add that I didn't have these lockups in test12-pre8.

just for statistical purposes, test12 has been running problem-free
here on a k6-2-550 (running at 500), glibc-2.2, built with
gcc-2.95-2, since about an hour after it was announced. no anomalies
at all, and the cd reader has become reliable again. in X the entire
time, and heavy system activity with a wide variety of applications.
--
dep
--
bipartisanship: an illogical construct not unlike the idea that
if half the people like red and half the people like blue, the
country's favorite color is purple.

2000-12-14 03:28:54

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Ok, got locked up. Dropped me into kdb and I was able to write down the
oops after doing a ss on btp 0.

I'll try to have something posted in an hour.

On Wed, 13 Dec 2000, Mohammad A. Haque wrote:

> At first I thought it was just me when I reported the lockups I was
> having with test12 earlier this week. Now the reports are flooding. Of
> course, now my machine isn't locking up anymore after recompiling from a
> clean source tree (test5 w/ patches through test12)
>
> Now, I'm trying to determine what the common element is.
>
> Those of you who are having lockups, was test12 compiled from a patched
> tree that you've previously compiled?
>
> Those that are locking up in X. Do you have a second machine you can
> hook up via serial port to grab Oops output?
>
> I've got KDB compiled in my current kernel. I'll compile a fresh kernel
> without KDB and see how long I last when I get a chance.
>

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 04:19:47

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Here we go folks. I hope I got everything right. The only place I have a
doubt is the 0010: part of EIP. I couldn't read what I wrote there.
Looks like it's ip fragment related?

ksymoops 0.7c on i686 2.4.0-test11. Options used
-V (default)
-K (specified)
-L (specified)
-o /lib/modules/2.4.0-test12 (specified)
-m /usr/src/linux/System.map (default)

No modules in ksyms, skipping objects
invalid operand: 0000
CPU: 0
EIP: 0010:[<c01e610e>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: d15c83e0 ecx: d1f4aa60 edx: d1f4aa60
esi: 000003d8 edi: d15c8660 ebp: 000003d8 esp: c0303c1c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0303000)
Stack: d1f4aa60 00000000 0000625b d957accf 00000014 00000000 c01e6493 d1f4aa60
d15c8660 d3fc9680 d15c8660 00000008 c0303d28 011e51be 00000000 d58ce1bf
d15c8660 d58d0008 c0303018 00000003 d58cd3ed d15c8660 d58d0d08 c0303018
Call Trace: [<d957accf>] [<c01e6493>] [<d58ce1bf>] [<d5800008>] [<d58cd3ed>] [<d5800008>] [<c012e146>]
[<d58cf370>] [<c01e88a4>] [<c01d925c>] [<c01e88a4>] [<c01e88a4>] [<c01d94b7>] [<c01e88a4>] [<d58d0d08>]
[<c01e7faf>] [<c01e88a4>] [<c01fdf2c>] [<c01e80be>] [<c01fdf2c>] [<c01fe122>] [<c01fdf2c>] [<d957accf>]
[<d957accf>] [<c01fe64b>] [<d58cc945>] [<d58d0d38>] [<d58cf2bf>] [<c01fe89a>] [<c01e59f3>] [<c01e5a68>]
[<c01d94fa>] [<c01e5845>] [<c01e5970>] [<c01e5c0f>] [<c01e5a68>] [<c01d94fa>] [<c01e593d>] [<c01e5a68>]
[<c01dce3d>] [<c011ef4f>] [<c010c891>] [<c0109420>] [<c0109420>] [<c010b128>] [<c0109420>] [<c0109420>]
[<c0100018>] [<c0109443>] [<c01094a9>] [<c0105000>] [<c0100191>]
Code: 8b 40 3c 89 41 3c 8b 47 5c c7 47 18 00 00 00 00 01 41 18 8b

>>EIP; c01e610e <ip_frag_queue+20a/254> <=====
Trace; d957accf <END_OF_CODE+19209c2b/????>
Trace; c01e6493 <ip_defrag+b3/130>
Trace; d58ce1bf <END_OF_CODE+1555d11b/????>
Trace; d5800008 <END_OF_CODE+1548ef64/????>
Trace; d58cd3ed <END_OF_CODE+1555c349/????>
Trace; d5800008 <END_OF_CODE+1548ef64/????>
Trace; c012e146 <__alloc_pages+de/2d0>
Trace; d58cf370 <END_OF_CODE+1555e2cc/????>
Trace; c01e88a4 <output_maybe_reroute+0/14>
Trace; c01d925c <nf_iterate+30/8c>
Trace; c01e88a4 <output_maybe_reroute+0/14>
Trace; c01e88a4 <output_maybe_reroute+0/14>
Trace; c01d94b7 <nf_hook_slow+7f/100>
Trace; c01e88a4 <output_maybe_reroute+0/14>
Trace; d58d0d08 <END_OF_CODE+1555fc64/????>
Trace; c01e7faf <ip_build_xmit_slow+3b7/478>
Trace; c01e88a4 <output_maybe_reroute+0/14>
Trace; c01fdf2c <icmp_glue_bits+0/88>
Trace; c01e80be <ip_build_xmit+4e/2fc>
Trace; c01fdf2c <icmp_glue_bits+0/88>
Trace; c01fe122 <icmp_reply+16e/18c>
Trace; c01fdf2c <icmp_glue_bits+0/88>
Trace; d957accf <END_OF_CODE+19209c2b/????>
Trace; d957accf <END_OF_CODE+19209c2b/????>
Trace; c01fe64b <icmp_echo+3f/48>
Trace; d58cc945 <END_OF_CODE+1555b8a1/????>
Trace; d58d0d38 <END_OF_CODE+1555fc94/????>
Trace; d58cf2bf <END_OF_CODE+1555e21b/????>
Trace; c01fe89a <icmp_rcv+9a/d0>
Trace; c01e59f3 <ip_local_deliver_finish+83/f8>
Trace; c01e5a68 <ip_rcv_finish+0/1d8>
Trace; c01d94fa <nf_hook_slow+c2/100>
Trace; c01e5845 <ip_local_deliver+39/40>
Trace; c01e5970 <ip_local_deliver_finish+0/f8>
Trace; c01e5c0f <ip_rcv_finish+1a7/1d8>
Trace; c01e5a68 <ip_rcv_finish+0/1d8>
Trace; c01d94fa <nf_hook_slow+c2/100>
Trace; c01e593d <ip_rcv+f1/124>
Trace; c01e5a68 <ip_rcv_finish+0/1d8>
Trace; c01dce3d <net_rx_action+19d/278>
Trace; c011ef4f <do_softirq+3f/64>
Trace; c010c891 <do_IRQ+a1/b0>
Trace; c0109420 <default_idle+0/28>
Trace; c0109420 <default_idle+0/28>
Trace; c010b128 <ret_from_intr+0/20>
Trace; c0109420 <default_idle+0/28>
Trace; c0109420 <default_idle+0/28>
Trace; c0100018 <startup_32+18/139>
Trace; c0109443 <default_idle+23/28>
Trace; c01094a9 <cpu_idle+41/54>
Trace; c0105000 <empty_bad_page+0/1000>
Trace; c0100191 <L6+0/2>
Code; c01e610e <ip_frag_queue+20a/254>
00000000 <_EIP>:
Code; c01e610e <ip_frag_queue+20a/254> <=====
0: 8b 40 3c mov 0x3c(%eax),%eax <=====
Code; c01e6111 <ip_frag_queue+20d/254>
3: 89 41 3c mov %eax,0x3c(%ecx)
Code; c01e6114 <ip_frag_queue+210/254>
6: 8b 47 5c mov 0x5c(%edi),%eax
Code; c01e6117 <ip_frag_queue+213/254>
9: c7 47 18 00 00 00 00 movl $0x0,0x18(%edi)
Code; c01e611e <ip_frag_queue+21a/254>
10: 01 41 18 add %eax,0x18(%ecx)
Code; c01e6121 <ip_frag_queue+21d/254>
13: 8b 00 mov (%eax),%eax



On Wed, 13 Dec 2000, Mohammad A. Haque wrote:

> Ok, got locked up. Dropped me into kdb and I was able to write down the
> oops after doing a ss on btp 0.
>
> I'll try to have something posted in an hour.
>

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 10:42:45

by Martin Bahlinger

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

In article <[email protected]> you wrote:
> At first I thought it was just me when I reported the lockups I was
> having with test12 earlier this week. Now the reports are flooding. Of
> course, now my machine isn't locking up anymore after recompiling from a
> clean source tree (test5 w/ patches through test12)

> Now, I'm trying to determine what the common element is.

> Those of you who are having lockups, was test12 compiled from a patched
> tree that you've previously compiled?

I compiled from a clean source tree test7 with patches through test12.
My machine gets locked up directly after starting the xfree-3.3.6 mach64
server. I'm running Debian2.3 woody here on a P90 w/ 32MB Ram.

> Those that are locking up in X. Do you have a second machine you can
> hook up via serial port to grab Oops output?

If it's still necessary, contact me via email.

--
[email protected] (PGP-ID: 0x0506D9B7)

2000-12-14 11:52:53

by Ingo Oeser

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Wed, Dec 13, 2000 at 10:48:56PM -0500, Mohammad A. Haque wrote:
> Trace; c0105000 <empty_bad_page+0/1000>
> Trace; c0100191 <L6+0/2>

I locked a Cyrix III machine up on boot and hat these both
elements in my trace, too.

It Oopsed and locked up after the Message: "CPU: Before vendor
init".

I locked up too with another machine (Pentium Classic) but like
all others by using X.

I have no oops yet of this lockup, because of X, but I'll ask a
friend of mine, whether the remote logging made it to him and
send you the results.

PS: I tried test12-pre8, so its inside test12-pre8 already.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2000-12-14 12:13:52

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Hmmm, does syslog sending to another machine catch oops? I guess we'll
find out.

Ingo Oeser wrote:
> I have no oops yet of this lockup, because of X, but I'll ask a
> friend of mine, whether the remote logging made it to him and
> send you the results.

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 12:38:09

by dep

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

okay. got it here this morning, too. solid lock -- no dumping out of
x, no changing terminals, no mouse, no keyboard.

k6-2-550 @ 500; 256mb memory, fic 503a mb with via chipset. kernel
built with gcc-2.95-2 against glibc-2.2. nothing remarkable underway
-- was composing a message in kmail, which i've done successfully
multiple times since upgrading to test12.
--
dep
--
bipartisanship: an illogical construct not unlike the idea that
if half the people like red and half the people like blue, the
country's favorite color is purple.

2000-12-14 12:46:00

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Were you connected to a network or receiving/sending anything?

dep wrote:
>
> okay. got it here this morning, too. solid lock -- no dumping out of
> x, no changing terminals, no mouse, no keyboard.
>
> k6-2-550 @ 500; 256mb memory, fic 503a mb with via chipset. kernel
> built with gcc-2.95-2 against glibc-2.2. nothing remarkable underway
> -- was composing a message in kmail, which i've done successfully
> multiple times since upgrading to test12.
> --
> dep

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 13:44:45

by dep

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Thursday 14 December 2000 07:15, Mohammad A. Haque wrote:
| Were you connected to a network or receiving/sending anything?

a conditional yes -- little lan here, d-link dfe-530tx+ (rtl8139) to
dlink hub, di-701 gateway, cable modem. so far as i know, i was
neither sending nor receiving at the time, and i've done both things
extensively with test12 without a lockup.

--
dep
--
bipartisanship: an illogical construct not unlike the idea that
if half the people like red and half the people like blue, the
country's favorite color is purple.

2000-12-14 15:12:18

by rct

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

Mohammad A. Haque wrote:
> Were you connected to a network or receiving/sending anything?
>
> dep wrote:
> >
> > okay. got it here this morning, too. solid lock -- no dumping out of
> > x, no changing terminals, no mouse, no keyboard.
> >
> > k6-2-550 @ 500; 256mb memory, fic 503a mb with via chipset.

This one is going to be fun to track down. So far, with a personal
sample size of three machines, 2.4.0-test12 is stable on two, locks
up in a predictable and repeatable manner on one. First, the stable
machines:

(1) P150 MMX Toshiba Tecra 730XCDT notebook, egcs-2.91.66, openwin
with XFree86 3.3.6.

(2) Cyrix 6x86 MII 233, egcs-2.91.66, AfterStep with XFree86 4.0.1,
NVIDIA-0.9-5 video driver.

The unstable machine:

Gateway PII 333, egcs-2.91.66, AfterStep with XFree86 3.3.6.
Running "startx" as "root" --> ok: no problem.
Running "startx" as normal user --> I get as far as the grey moire
pattern with the black "X" cursor in the center of the screen, and
the machine locks up solid. Cannot switch consoles, machine doesn't
respond to pings (much less remote access attempts), no disk activity,
no "oops" messages in any of the logfiles. Absolutely repeatable.
Machine works fine with earlier kernels.

Does anyone have a feeling one way or the other as far as this being
related to the CPU type? I can try building a pre-PII CPU kernel on
the unstable machine and see if that makes any difference.

--
Bob Tracy [email protected]
-----------------------------------------------------------------
"We might not be in hell, but we can see the gates from here."
--Phoenix resident, Summer of 2000

2000-12-14 18:21:47

by Eckhard Jokisch

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback


Subject: Re: test12 lockups -- need feedback
Date: Thu, 14 Dec 2000 15:31:38 +0000
From: Eckhard Jokisch <[email protected]>
To: dep <[email protected]>


On Don, 14 Dez 2000, dep wrote:
> On Thursday 14 December 2000 07:15, Mohammad A. Haque wrote:
> | Were you connected to a network or receiving/sending anything?
>
> a conditional yes -- little lan here, d-link dfe-530tx+ (rtl8139) to
> dlink hub, di-701 gateway, cable modem. so far as i know, i was
> neither sending nor receiving at the time, and i've done both things
> extensively with test12 without a lockup.

Is it possible that there is something wrong with the 8139too driver?
( I also use a card with 8139 chip )
Or do you use the "old" rtl8139 ? With that I don't have any problems.
I have an extra machine here where I can do all testing - how can I help?

Eckhard

-------------------------------------------------------

2000-12-14 19:08:59

by Ion Badulescu

[permalink] [raw]
Subject: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

On Thu, 14 Dec 2000 07:15:04 -0500, Mohammad A. Haque <[email protected]> wrote:
> Were you connected to a network or receiving/sending anything?

ip_defrag is broken -- there is an obvious NULL pointer dereference
in it, introduced in test12. It doesn't hit normally, because of
path MTU discovery, but using NFS causes instant death.

I won't venture a fix, as I don't know the networking code well
enough. So far, no networking maintainer has had anything to say
about this bug on the list...

Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2000-12-14 20:14:59

by David Miller

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Date: Thu, 14 Dec 2000 10:38:01 -0800
From: Ion Badulescu <[email protected]>

I won't venture a fix, as I don't know the networking code well
enough. So far, no networking maintainer has had anything to say
about this bug on the list...

Because this is the first most of us have heard of the issue, much
less seen any ksymoops processed OOPS logs of the bug so we can even
start thinking about what might be wrong.

Later,
David S. Miller
[email protected]

2000-12-14 20:23:19

by Ion Badulescu

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

On Thu, 14 Dec 2000, David S. Miller wrote:

> Date: Thu, 14 Dec 2000 10:38:01 -0800
> From: Ion Badulescu <[email protected]>
>
> I won't venture a fix, as I don't know the networking code well
> enough. So far, no networking maintainer has had anything to say
> about this bug on the list...
>
> Because this is the first most of us have heard of the issue, much
> less seen any ksymoops processed OOPS logs of the bug so we can even
> start thinking about what might be wrong.

Oh, there have been at least two ksymoops'ed traces posted on the list, I
thought you'd seen them already.. But never mind, the problem is that
skb->dev can be NULL and the code changed in test12 dereferences it to get
skb->dev->iif.

The oops looks something like this. It was caught on serial console, and
decoded on test11, so it doesn't have translation for module symbols. It
if helps, this box is running ip_conntrack and the oops occurred basically
as soon as an NFS request came in.

Unable to handle kernel NULL pointer dereference at virtual address 0000003c
c01917a6
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c01917a6>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: 00000000 ecx: c21d8f20 edx: 000003a0
esi: c3e73760 edi: 00000000 ebp: 00001ce8 esp: c16e9c80
ds: 0018 es: 0018 ss: 0018
Process nfsd (pid: 670, stackpage=c16e9000)
Stack: c21d8f20 00000000 c01912cf 01011eac 00002088 c21d8f20 005aac10 c0191b43
c21d8f20 c3e73760 c1786680 c3e73760 c0194718 c16e9d9c 030011cf 1121e260
00000000 c48c02d0 c3e73760 c16e9d8c c02358f8 c48bfb4e c3e73760 c16e9d8c
Call Trace: [<c01912cf>] [<c0191b43>] [<c0194718>] [<c48c02d0>] [<c48bfb4e>] [<c0194718>] [<c017b0f8>]
[<c017f6f4>] [<c017f717>] [<c48c1082>] [<c0194718>] [<c0184388>] [<c0194718>] [<c0194718>] [<c0184597>]
[<c0194718>] [<c48c2188>] [<c0193cea>] [<c0194718>] [<c0140e85>] [<c0193e0a>] [<c01a834c>] [<c01a878d>]
[<c01a834c>] [<c01ad918>] [<c01ad956>] [<c0182aed>] [<c01ad918>] [<c487f346>] [<c487f7d5>] [<c4880516>]
[<c48a7c00>] [<c487ef44>] [<c48a7ae0>] [<c48a75f8>] [<c4897331>] [<c48a75e0>] [<c0107457>]
Code: 8b 40 3c 89 41 3c 8b 46 5c c7 46 18 00 00 00 00 01 41 18 8b

>>EIP; c01917a6 <ip_frag_queue+242/298> <=====
Trace; c01912cf <ip_frag_destroy+2f/8c>
Trace; c0191b43 <ip_defrag+c3/140>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c48c02d0 <END_OF_CODE+4689b60/????>
Trace; c48bfb4e <END_OF_CODE+46893de/????>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c017b0f8 <dma_timer_expiry+0/70>
Trace; c017f6f4 <via82cxxx_dmaproc+0/2c>
Trace; c017f717 <via82cxxx_dmaproc+23/2c>
Trace; c48c1082 <END_OF_CODE+468a912/????>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c0184388 <nf_iterate+34/88>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c0184597 <nf_hook_slow+3f/b4>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c48c2188 <END_OF_CODE+468ba18/????>
Trace; c0193cea <ip_build_xmit_slow+3c6/498>
Trace; c0194718 <output_maybe_reroute+0/14>
Trace; c0140e85 <update_atime+4d/54>
Trace; c0193e0a <ip_build_xmit+4e/31c>
Trace; c01a834c <udp_getfrag+0/c4>
Trace; c01a878d <udp_sendmsg+339/3b4>
Trace; c01a834c <udp_getfrag+0/c4>
Trace; c01ad918 <inet_sendmsg+0/44>
Trace; c01ad956 <inet_sendmsg+3e/44>
Trace; c0182aed <sock_sendmsg+81/a4>
Trace; c01ad918 <inet_sendmsg+0/44>
Trace; c487f346 <END_OF_CODE+4648bd6/????>
Trace; c487f7d5 <END_OF_CODE+4649065/????>
Trace; c4880516 <END_OF_CODE+4649da6/????>
Trace; c48a7c00 <END_OF_CODE+4671490/????>
Trace; c487ef44 <END_OF_CODE+46487d4/????>
Trace; c48a7ae0 <END_OF_CODE+4671370/????>
Trace; c48a75f8 <END_OF_CODE+4670e88/????>
Trace; c4897331 <END_OF_CODE+4660bc1/????>
Trace; c48a75e0 <END_OF_CODE+4670e70/????>
Trace; c0107457 <kernel_thread+23/30>
Code; c01917a6 <ip_frag_queue+242/298>
00000000 <_EIP>:
Code; c01917a6 <ip_frag_queue+242/298> <=====
0: 8b 40 3c mov 0x3c(%eax),%eax <=====
Code; c01917a9 <ip_frag_queue+245/298>
3: 89 41 3c mov %eax,0x3c(%ecx)
Code; c01917ac <ip_frag_queue+248/298>
6: 8b 46 5c mov 0x5c(%esi),%eax
Code; c01917af <ip_frag_queue+24b/298>
9: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi)
Code; c01917b6 <ip_frag_queue+252/298>
10: 01 41 18 add %eax,0x18(%ecx)
Code; c01917b9 <ip_frag_queue+255/298>
13: 8b 00 mov (%eax),%eax


Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2000-12-14 20:31:00

by David Miller

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Date: Thu, 14 Dec 2000 11:52:29 -0800 (PST)
From: Ion Badulescu <[email protected]>

The oops looks something like this. It was caught on serial
console, and decoded on test11, so it doesn't have translation for
module symbols. It if helps, this box is running ip_conntrack and
the oops occurred basically as soon as an NFS request came in.

If you turn off netfilter, ip_conntrack, etc. does the OOPS still
occur?

Later,
David S. Miller
[email protected]

2000-12-14 20:38:24

by Ion Badulescu

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

On Thu, 14 Dec 2000, David S. Miller wrote:

> If you turn off netfilter, ip_conntrack, etc. does the OOPS still
> occur?

I'm afraid I won't be able to answer this question, since I'm leaving for
a 3-week vacation in about 50 minutes and I need my firewall functional
until then. :-) Maybe other people who have seen this problem can
experiment further.

If I get a few moments, I'll do a quick test before leaving and will let
you know. The chance of that happening is extremely slim, though.

Thanks,
Ion

--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

2000-12-14 20:43:24

by David Miller

[permalink] [raw]
Subject: Netfilter is broken (was Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback))

Date: Thu, 14 Dec 2000 12:07:38 -0800 (PST)
From: Ion Badulescu <[email protected]>

I'm afraid I won't be able to answer this question, since I'm
leaving for a 3-week vacation in about 50 minutes and I need my
firewall functional until then. :-) Maybe other people who have
seen this problem can experiment further.

Ok, regardless I'm very confident netfilter is doing something
very bad.

Essentially it is feeding SKBs into IPv4 receive processing which
have a NULL skb->dev, that has always been illegal. Now it OOPSs
so we can spot such violations.

Later,
David S. Miller
[email protected]

2000-12-14 21:06:35

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

I'll be trying in a few hours.

On Thu, 14 Dec 2000, Ion Badulescu wrote:

> On Thu, 14 Dec 2000, David S. Miller wrote:
>
> > If you turn off netfilter, ip_conntrack, etc. does the OOPS still
> > occur?
>
> I'm afraid I won't be able to answer this question, since I'm leaving for
> a 3-week vacation in about 50 minutes and I need my firewall functional
> until then. :-) Maybe other people who have seen this problem can
> experiment further.
>
> If I get a few moments, I'll do a quick test before leaving and will let
> you know. The chance of that happening is extremely slim, though.
>
> Thanks,
> Ion
>
>

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 21:10:15

by David Miller

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Date: Thu, 14 Dec 2000 15:35:48 -0500 (EST)
From: "Mohammad A. Haque" <[email protected]>

I'll be trying in a few hours.

Meanwhile for people wanting the crashes to be fixed, please
apply this patch.

This was _always_ broken, and really what netfilter is doing
should have never worked. The only theory I have right now
is that people using netfilter never had IP fragments timeout.
:-)

So the patch below restores previous behavior exactly.
Ie. netfilter sources fragments cannot send ICMP errors
on frag queue timeout :-)

(The line numbers may be off a bit, but "patch" should still
eat it).

--- net/ipv4/ip_fragment.c.~1~ Wed Dec 13 10:31:48 2000
+++ net/ipv4/ip_fragment.c Thu Dec 14 12:20:09 2000
@@ -258,7 +258,8 @@
if ((qp->last_in&FIRST_IN) && qp->fragments != NULL) {
struct sk_buff *head = qp->fragments;
/* Send an ICMP "Fragment Reassembly Timeout" message. */
- if ((head->dev = dev_get_by_index(qp->iif)) != NULL) {
+ if (qp->iif != -1 &&
+ (head->dev = dev_get_by_index(qp->iif)) != NULL) {
icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
dev_put(head->dev);
}
@@ -487,8 +488,12 @@
else
qp->fragments = skb;

- qp->iif = skb->dev->ifindex;
- skb->dev = NULL;
+ if (skb->dev != NULL) {
+ qp->iif = skb->dev->ifindex;
+ skb->dev = NULL;
+ } else
+ qp->iif = -1;
+
qp->stamp = skb->stamp;
qp->meat += skb->len;
atomic_add(skb->truesize, &ip_frag_mem);

2000-12-14 21:20:48

by rct

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Ion Badulescu wrote:
> On Thu, 14 Dec 2000 07:15:04 -0500, Mohammad A. Haque <[email protected]> wrote:
> > Were you connected to a network or receiving/sending anything?
>
> ip_defrag is broken -- there is an obvious NULL pointer dereference
> in it, introduced in test12. It doesn't hit normally, because of
> path MTU discovery, but using NFS causes instant death.

This is consistent with the lockup I reported several hours ago.
In the case of my "unstable" 2.4.0-test12 machine where "startx"
worked fine for "root" but not for a normal user, the "root"
account is local. The normal user account home directories are
NFS mounted :-(.

Good spot! I've done a little mucking around with the networking
code, i.e., no promises, but maybe I can come up with a fix.

--
Bob Tracy [email protected]
-----------------------------------------------------------------
"We might not be in hell, but we can see the gates from here."
--Phoenix resident, Summer of 2000

2000-12-14 21:55:51

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Just quick feedback.

Test 1:
Netfilter compiled into kernel. Netfilter configuration options
as modules. Modules loaded. Using NFS, I got Oops (in fact I've
never seen an Oops output infinitely before. Maybe it would have
stopped if I waited.)

Test 2:
Netfilter compiled into kernel. Netfilter configuration options
as modules. Modules _NOT_ loaded. Can use NFS just fine. Did a
couple of 100 MB transfers w/o problems.


I'll continue narrowing it down.


#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_NETLINK=y
CONFIG_RTNETLINK=y
CONFIG_NETLINK_DEV=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y
CONFIG_FILTER=y
....

#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_FTP=m
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=m
# CONFIG_IP_NF_MATCH_LIMIT is not set
# CONFIG_IP_NF_MATCH_MAC is not set
# CONFIG_IP_NF_MATCH_MARK is not set
# CONFIG_IP_NF_MATCH_MULTIPORT is not set
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_STATE=m
# CONFIG_IP_NF_MATCH_UNCLEAN is not set
# CONFIG_IP_NF_MATCH_OWNER is not set
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_MIRROR=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
# CONFIG_IP_NF_MANGLE is not set
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_COMPAT_IPCHAINS=m
CONFIG_IP_NF_NAT_NEEDED=y
# CONFIG_IP_NF_COMPAT_IPFWADM is not set


MODULES LOADED:
Module Size Used by
ipt_state 800 13 (autoclean)
ipt_tos 720 6 (autoclean)
ipt_LOG 3248 4 (autoclean)
iptable_filter 1920 0 (autoclean) (unused)
ipt_MASQUERADE 1808 1
ip_nat_ftp 3520 0 (unused)
ip_conntrack_ftp 2336 0 [ip_nat_ftp]
iptable_nat 17440 1 [ipt_MASQUERADE ip_nat_ftp]
ip_conntrack 19808 3 [ipt_state ipt_MASQUERADE ip_nat_ftp ip_conntrack_ftp iptable_nat]
ip_tables 12320 8 [ipt_state ipt_tos ipt_LOG iptable_filter ipt_MASQUERADE iptable_nat]


On Thu, 14 Dec 2000, David S. Miller wrote:

> Meanwhile for people wanting the crashes to be fixed, please
> apply this patch.
>
> This was _always_ broken, and really what netfilter is doing
> should have never worked. The only theory I have right now
> is that people using netfilter never had IP fragments timeout.
> :-)
>
> So the patch below restores previous behavior exactly.
> Ie. netfilter sources fragments cannot send ICMP errors
> on frag queue timeout :-)
>
> (The line numbers may be off a bit, but "patch" should still
> eat it).
>

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-14 23:10:22

by rct

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Ion Badulescu wrote:
> On Thu, 14 Dec 2000 07:15:04 -0500, Mohammad A. Haque <[email protected]> wrote:
> > Were you connected to a network or receiving/sending anything?
>
> ip_defrag is broken -- there is an obvious NULL pointer dereference
> in it, introduced in test12. It doesn't hit normally, because of
> path MTU discovery, but using NFS causes instant death.

...and then I wrote:
> This is consistent with the lockup I reported several hours ago.
> In the case of my "unstable" 2.4.0-test12 machine where "startx"
> worked fine for "root" but not for a normal user, the "root"
> account is local. The normal user account home directories are
> NFS mounted :-(.

I tried the submitted patch for ip_fragment.c, and there's still
no joy on that one unstable machine in my sample set. At this
point, I should probably go back through all the pre-12 patches
and see if the problem scope can be narrowed a bit.

--
Bob Tracy
[email protected]

2000-12-14 23:21:23

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

I do the following....

sudo modprobe iptable_nat

Module Size Used by
iptable_nat 17440 0 (unused)
ip_conntrack 19808 1 [iptable_nat]
ip_tables 12320 3 [iptable_nat]


Oops start flying by when I access via NFS.

If you need the actual Oops messages we're gonna have to get someone
who can setup a serial console.

On Thu, 14 Dec 2000, Mohammad A. Haque wrote:

> Just quick feedback.
>
> Test 1:
> Netfilter compiled into kernel. Netfilter configuration options
> as modules. Modules loaded. Using NFS, I got Oops (in fact I've
> never seen an Oops output infinitely before. Maybe it would have
> stopped if I waited.)
>
> Test 2:
> Netfilter compiled into kernel. Netfilter configuration options
> as modules. Modules _NOT_ loaded. Can use NFS just fine. Did a
> couple of 100 MB transfers w/o problems.
>
>
> I'll continue narrowing it down.

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-15 00:29:09

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

Problem only happens when ip_conntrack is loaded.

On Thu, 14 Dec 2000, Mohammad A. Haque wrote:

> I do the following....
>
> sudo modprobe iptable_nat
>
> Module Size Used by
> iptable_nat 17440 0 (unused)
> ip_conntrack 19808 1 [iptable_nat]
> ip_tables 12320 3 [iptable_nat]
>
>
> Oops start flying by when I access via NFS.
>
> If you need the actual Oops messages we're gonna have to get someone
> who can setup a serial console.
>

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2000-12-15 00:52:28

by Harald Welte

[permalink] [raw]
Subject: Re: Netfilter is broken (was Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback))

On Thu, Dec 14, 2000 at 11:55:43AM -0800, David S. Miller wrote:
> Date: Thu, 14 Dec 2000 12:07:38 -0800 (PST)
> From: Ion Badulescu <[email protected]>
>
> I'm afraid I won't be able to answer this question, since I'm
> leaving for a 3-week vacation in about 50 minutes and I need my
> firewall functional until then. :-) Maybe other people who have
> seen this problem can experiment further.
>
> Ok, regardless I'm very confident netfilter is doing something
> very bad.
>
> Essentially it is feeding SKBs into IPv4 receive processing which
> have a NULL skb->dev, that has always been illegal. Now it OOPSs
> so we can spot such violations.

mmh... After checking some of my assumptions with the code again, I don't
think that netfilter does something wrong.

Referring to some of the other messages in this thread, ip_conntrack seems
to be blamed.

Conntrack is registered at the NF_IP_PRE_ROUTING hook and calls ip_defrag
for all skb's it receives. But we don't touch the dev member of the skb
at all...

Or is there something wrong with:

- packet arrives in net/ipv4/ip_input.c:ip_rcv()
- netfilter hook NF_IP_PRE_ROUTING is called
- net/ipv4/netfilter/ip_conntrack_core.c:ip_conntrack_in() is called
- net/ipv4/netfilter/ip_conntrack_core.c:ip_ct_gather_frags() is called
- net/ipv4/ip_input.c:ip_defrag() is called

Isn't the skb->dev member supposed to still point to the receiving
device?


> David S. Miller

--
Live long and prosper
- Harald Welte / [email protected] http://www.gnumonks.org
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M-
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

2000-12-15 00:59:18

by David Miller

[permalink] [raw]
Subject: Re: Netfilter is broken (was Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback))

Date: Fri, 15 Dec 2000 01:20:00 +0100
From: Harald Welte <[email protected]>

Or is there something wrong with:

- packet arrives in net/ipv4/ip_input.c:ip_rcv()
- netfilter hook NF_IP_PRE_ROUTING is called
- net/ipv4/netfilter/ip_conntrack_core.c:ip_conntrack_in() is called
- net/ipv4/netfilter/ip_conntrack_core.c:ip_ct_gather_frags() is called
- net/ipv4/ip_input.c:ip_defrag() is called

Isn't the skb->dev member supposed to still point to the receiving
device?

No, once you submit the packet to the defrag layer, that SKB
instance is owned by the defrag layer.

One way to do what netfilter wants to do, but legally, is to
simply skb_clone() the SKB before passing it into the
defragmentation code.

I'm still deciding whether this is the best fix.

Later,
David S. Miller
[email protected]

2000-12-15 01:19:30

by Andi Kleen

[permalink] [raw]
Subject: Re: Netfilter is broken (was Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback))

On Thu, Dec 14, 2000 at 04:11:10PM -0800, David S. Miller wrote:
> Date: Fri, 15 Dec 2000 01:20:00 +0100
> From: Harald Welte <[email protected]>
>
> Or is there something wrong with:
>
> - packet arrives in net/ipv4/ip_input.c:ip_rcv()
> - netfilter hook NF_IP_PRE_ROUTING is called
> - net/ipv4/netfilter/ip_conntrack_core.c:ip_conntrack_in() is called
> - net/ipv4/netfilter/ip_conntrack_core.c:ip_ct_gather_frags() is called
> - net/ipv4/ip_input.c:ip_defrag() is called
>
> Isn't the skb->dev member supposed to still point to the receiving
> device?
>
> No, once you submit the packet to the defrag layer, that SKB
> instance is owned by the defrag layer.
>
> One way to do what netfilter wants to do, but legally, is to
> simply skb_clone() the SKB before passing it into the
> defragmentation code.

What should it do with the uncloned, not defragmented copy ?
It makes not much sense to clone it.

Also is it sure that the backtrace involves ip_rcv ? A more likely
guess is that it happens during the IP_LOCAL_OUT hook, when skb->dev
isn't set yet, but conntrack already has to already reassemble fragments.


-Andi

2000-12-15 01:57:34

by Harald Welte

[permalink] [raw]
Subject: Re: Netfilter is broken (was Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback))

On Fri, Dec 15, 2000 at 01:48:32AM +0100, Andi Kleen wrote:
>
> Also is it sure that the backtrace involves ip_rcv ? A more likely
> guess is that it happens during the IP_LOCAL_OUT hook, when skb->dev
> isn't set yet, but conntrack already has to already reassemble fragments.

Oh, thanks Andi. This is the key, of course. I'm always way too focused
on forwarded packets ;)

This is definitely the problem.

We could set skb->dev to skb->dst->dev, but this sounds more like a
hack than a real solution...

> -Andi

--
Live long and prosper
- Harald Welte / [email protected] http://www.gnumonks.org
============================================================================
GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M-
V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*)

2000-12-15 02:57:10

by Tom Leete

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

"David S. Miller" wrote:
>
> Date: Thu, 14 Dec 2000 15:35:48 -0500 (EST)
> From: "Mohammad A. Haque" <[email protected]>
>
> I'll be trying in a few hours.
>
> Meanwhile for people wanting the crashes to be fixed, please
> apply this patch.
>
> This was _always_ broken, and really what netfilter is doing
> should have never worked. The only theory I have right now
> is that people using netfilter never had IP fragments timeout.
> :-)
>
> So the patch below restores previous behavior exactly.
> Ie. netfilter sources fragments cannot send ICMP errors
> on frag queue timeout :-)
>

Hello,

I posted one of these generated by nfs earlier. This one is from
$ ping -c 1 -s 1478 <2.4.0-t12-host>
from peer.

kdb over serial console -- the module addresses are accurate. Lightly edited
for readability.

Hope this helps,
Tom


Unable to handle kernel NULL pointer dereference at virtual address 0000003c
printing eip:
c01c0c32
*pde = 00000000

Entering kdb (current=0xc02c0000, pid 0) Panic: Oops
due to panic @ 0xc01c0c32
eax = 0x00000000 ebx = 0x00000000 ecx = 0xc11a6fa0 edx = 0x00000006
esi = 0xc1376be0 edi = 0x00000000 esp = 0xc02c1bac eip = 0xc01c0c32
ebp = 0xc02c1bc8 xss = 0x00000018 xcs = 0xc11a0010 eflags = 0x00010246
xds = 0x31010018 xes = 0x00000018 origeax = 0xffffffff &regs = 0xc02c1b78
kdb> bt
EBP EIP Function(args)
0xc02c1bc8 0xc01c0c32 ip_frag_queue+0x222 (0xc11a6fa0, 0xc1376be0)
kernel .text 0xc0100000 0xc01c0a10 0xc01c0c90
0xc02c1bf4 0xc01c1004 ip_defrag+0xc4 (0xc1376be0)
kernel .text 0xc0100000 0xc01c0f40 0xc01c1070
0xc02c1c0c 0xc4093365 [ip_conntrack]ip_ct_gather_frags+0x25 (0xc1376be0)
ip_conntrack .text 0xc4091060 0xc4093340
0xc40933e0
0xc02c1c54 0xc40924cd [ip_conntrack]ip_conntrack_in+0x3d (0x3, 0xc02c1cdc,
0x0, 0xc3104800, 0xc01c3560)
ip_conntrack .text 0xc4091060 0xc4092490
0xc40927b0
0xc02c1c70 0xc4094666 [ip_conntrack]ip_conntrack_local+0x56 (0x3,
0xc02c1cdc, 0x0, 0xc3104800, 0xc01c3560)
ip_conntrack .text 0xc4091060 0xc4094610
0xc4094670
0xc02c1c98 0xc01b2d98 nf_iterate+0x28 (0xc0320cd8, 0xc02c1cdc, 0x3, 0x0,
0xc3104800)
kernel .text 0xc0100000 0xc01b2d70 0xc01b2e00
0xc02c1ccc 0xc01b3001 nf_hook_slow+0x71 (0x2, 0x3, 0xc1376be0, 0x0,
0xc3104800)
kernel .text 0xc0100000 0xc01b2f90 0xc01b3080
0xc02c1d3c 0xc01c2c27 ip_build_xmit_slow+0x387 (0xc11d2730, 0xc01d9a00,
0xc02c1dfc, 0x5e2, 0xc02c1de0)
kernel .text 0xc0100000 0xc01c28a0 0xc01c2d00
0xc02c1d7c 0xc01c2d4b ip_build_xmit+0x4b (0xc11d2730, 0xc01d9a00,
0xc02c1dfc, 0x5e2, 0xc02c1de0)
kernel .text 0xc0100000 0xc01c2d00 0xc01c2ff0
0xc02c1dec 0xc01d9c03 icmp_reply+0x173 (0xc02c1dfc, 0xc136aab0)
kernel .text 0xc0100000 0xc01d9a90 0xc01d9c20
0xc02c1e44 0xc01da1aa icmp_echo+0x3a (0xc0aad824, 0xc136aab0, 0x5c6)
more>
kernel .text 0xc0100000 0xc01da170 0xc01da1b0
0xc02c1e68 0xc01da459 icmp_rcv+0xa9 (0xc136aab0, 0x5ce)
kernel .text 0xc0100000 0xc01da3b0 0xc01da490
0xc02c1e88 0xc01c04a4 ip_local_deliver_finish+0x94 (0xc136aab0, 0xc136aab0)
kernel .text 0xc0100000 0xc01c0410 0xc01c0520
0xc02c1ea4 0xc01b3048 nf_hook_slow+0xb8 (0x2, 0x1, 0xc136aab0, 0xc3104800,
0x0)
kernel .text 0xc0100000 0xc01b2f90 0xc01b3080
0xc02c1ec4 0xc01c02d5 ip_local_deliver+0x45 (0xc136aab0)
kernel .text 0xc0100000 0xc01c0290 0xc01c02e0
0xc02c1ee8 0xc01c06dc ip_rcv_finish+0x1bc (0xc136aab0, 0xc08bd210)
kernel .text 0xc0100000 0xc01c0520 0xc01c0710
0xc02c1f04 0xc01b3048 nf_hook_slow+0xb8 (0x2, 0x0, 0xc136aab0, 0xc3104800,
0x0)
kernel .text 0xc0100000 0xc01b2f90 0xc01b3080
0xc02c1f38 0xc01c03dc ip_rcv+0xfc (0xc08bd210, 0xc3104800, 0xc02bca84)
kernel .text 0xc0100000 0xc01c02e0 0xc01c0410
0xc02c1f68 0xc01b703d net_rx_action+0x12d (0xc02facf0)
kernel .text 0xc0100000 0xc01b6f10 0xc01b7160
0xc02c1f80 0xc011bd7e do_softirq+0x4e
kernel .text 0xc0100000 0xc011bd30 0xc011bdb0
0xc02c1f98 0xc010ad13 do_IRQ+0xa3 (0xc01074f0, 0xc2532260, 0xc02c0000,
0xc02c0000, 0xc02c0000)
kernel .text 0xc0100000 0xc010ac70 0xc010ad30
0xc01093f0 ret_from_intr
kernel .text 0xc0100000 0xc01093f0 0xc0109410
Interrupt registers:
eax = 0x00000000 ebx = 0xc01074f0 ecx = 0xc2532260 edx = 0xc02c0000
esi = 0xc02c0000 edi = 0xc02c0000 esp = 0xc02c1fd4 eip = 0xc0107516
ebp = 0xc02c1fd4 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00000246
xds = 0xc0100018 xes = 0xc02c0018 origeax = 0xffffff0c &regs = 0xc02c1fa0
0xc0107516 default_idle+0x26
kernel .text 0xc0100000 0xc01074f0 0xc0107520
0xc02c1fe8 0xc0107585 cpu_idle+0x35
kernel .text 0xc0100000 0xc0107550 0xc01075a0
#
#
kdb> mds 0xc11a6fa0
0xc11a6fa0 00000000 ....
0xc11a6fa4 0101a8c0 ??..
0xc11a6fa8 3101a8c0 ??.1
0xc11a6fac 0101cc28 (?..
0xc11a6fb0 c1376be0 ?k7?
0xc11a6fb4 000005ce ?...
0xc11a6fb8 00000000 ....
0xc11a6fbc 00000000 ....
#
#
kdb> mds 0xc1376be0
0xc1376be0 00000000 ....
0xc1376be4 00000000 ....
0xc1376be8 00000000 ....
0xc1376bec c11d2730 0'.?
0xc1376bf0 00000000 ....
0xc1376bf4 0009bfa7 ??..
0xc1376bf8 00000000 ....
0xc1376bfc c3063f50 P?.?
#
#
kdb> mds 0xc02c1cdc
0xc02c1cdc c1376be0 ?k7?
0xc02c1ce0 00000000 ....
0xc02c1ce4 c3104800 .H.?
0xc02c1ce8 c01c3560 output_maybe_reroute
kernel .text 0xc0100000 0xc01c3560 0xc01c3580
0xc02c1cec 00000000 ....
0xc02c1cf0 c02c1dfc init_task_union+0x1dfc
kernel .data.init_task 0xc02c0000 0xc02c0000
0xc02c2000
0xc02c1cf4 00000040 @...
0xc02c1cf8 c3063f40 @?.?
#
#
kdb> mds 0xc0320cd8
0xc0320cd8 c4095f08 [ip_conntrack]ip_conntrack_local_out_ops
ip_conntrack .data 0xc4095a40 0xc4095f08 0xc4095f20
0xc0320cdc c40ae668 [iptable_filter]ipt_ops+0x30
iptable_filter .data 0xc40ae320 0xc40ae638 0xc40ae680
0xc0320ce0 c409ec98 [iptable_nat]ip_nat_out_ops
iptable_nat .data 0xc409ec80 0xc409ec98 0xc409ecb0
0xc0320ce4 c4095f20 [ip_conntrack]ip_conntrack_out_ops
ip_conntrack .data 0xc4095a40 0xc4095f20 0xc4095f38
0xc0320ce8 c0320ce8 nf_hooks+0xa8
kernel .bss 0xc02f4620 0xc0320c40 0xc0321440
0xc0320cec c0320ce8 nf_hooks+0xa8
kernel .bss 0xc02f4620 0xc0320c40 0xc0321440
0xc0320cf0 c0320cf0 nf_hooks+0xb0
kernel .bss 0xc02f4620 0xc0320c40 0xc0321440
0xc0320cf4 c0320cf0 nf_hooks+0xb0
kernel .bss 0xc02f4620 0xc0320c40 0xc0321440
#
#
kdb> mds 0xc3104800
0xc3104800 30687465 eth0
0xc3104804 00000000 ....
0xc3104808 00000000 ....
0xc310480c 00000000 ....
0xc3104810 00000000 ....
0xc3104814 00000000 ....
0xc3104818 00000000 ....
0xc310481c 00000000 ....
#
#
kdb> mds 0xc11d2730
0xc11d2730 00000000 ....
0xc11d2734 00000000 ....
0xc11d2738 00010000 ....
0xc11d273c 00000000 ....
0xc11d2740 00000000 ....
0xc11d2744 00000000 ....
0xc11d2748 00000000 ....
0xc11d274c 00000000 ....
#
#
kdb> mds 0xc40927b0
0xc40927b0 56e58955 U.?V
0xc40927b4 8b53c031 1?S.
0xc40927b8 758b0c5d ]..u
0xc40927bc 0e438a08 ..C.
0xc40927c0 e93ae850 P?:?
0xc40927c4 5350ffff ??PS
0xc40927c8 e9e2e856 V???
0xc40927cc 658dffff ??.e
#
#
kdb> mds 0xc4094670
0xc4094670 53e58955 U.?S
0xc4094674 7d83db31 1?.}
0xc4094678 840f0008 ....
0xc409467c 000000b0 ?...
0xc4094680 fff16be8 ?k??
0xc4094684 85c389ff ?.?.
0xc4094688 ed8c0fdb ?..?
0xc409468c a1000000 ...?
#
#
kdb> md ip_frag_queue
0xc01c0a10 83e58955 565710ec 0c4d8b53 8b08758b U.?.?.WVS.M..u..
0xc01c0a20 4d892049 0f5e8af0 f6fb5d88 850f04c3 I .M?.^..]???...
0xc01c0a30 0000022c 06418b66 c931c486 89c18966 ,...f.A..?1?f.?.
0xc01c0a40 ca89fc4d e000e281 e181ffff 00001fff M?.?.?.???.??...
0xc01c0a50 8b03e1c1 4d89f075 24068afc 00ff250f ??..u?.M?..$.%?.
0xc01c0a60 3c8d0000 00000085 468b6600 25c48602 ...<.....f.F..?%
0xc01c0a70 0000ffff c801f829 f6f04589 307520c6 ??..)?.?.E??? u0
0xc01c0a80 8b084d8b 45391441 d18c0ff0 f6000001 .M..A.9E?..?...?
#
#
kdb> mds 0xc11d2730
0xc11d2730 00000000 ....
0xc11d2734 00000000 ....
0xc11d2738 00010000 ....
0xc11d273c 00000000 ....
0xc11d2740 00000000 ....
0xc11d2744 00000000 ....
0xc11d2748 00000000 ....
0xc11d274c 00000000 ....
#
#
kdb> mds 0xc02c1dfc
0xc02c1dfc c0aad82c ,ت?
0xc02c1e00 000005c6 ?...
0xc02c1e04 00000000 ....
0xc02c1e08 000069d6 ?i..
0xc02c1e0c c3c38784 ..??
0xc02c1e10 00000000 ....
0xc02c1e14 00000000 ....
0xc02c1e18 00000002 ....
#
#
kdb> mds 0xc01d9a00
0xc01d9a00 57e58955 U.?W
0xc01d9a04 758b5356 VS.u
0xc01d9a08 0c7d8b08 ..}.
0xc01d9a0c 8510458b .E..
0xc01d9a10 8b4d75c0 ?uM.
0xc01d9a14 006a1046 F.j.
0xc01d9a18 6a50006a j.Pj
0xc01d9a1c 568d5708 .W.V
#
#
kdb> mds 0xc02c1de0
0xc02c1de0 3101a8c0 ??.1
0xc02c1de4 c02c1df4 init_task_union+0x1df4
kernel .data.init_task 0xc02c0000 0xc02c0000
0xc02c2000
0xc02c1de8 00000000 ....
0xc02c1dec c02c1e44 init_task_union+0x1e44
kernel .data.init_task 0xc02c0000 0xc02c0000
0xc02c2000
0xc02c1df0 c01da1aa icmp_echo+0x3a
kernel .text 0xc0100000 0xc01da170 0xc01da1b0
0xc02c1df4 c02c1dfc init_task_union+0x1dfc
kernel .data.init_task 0xc02c0000 0xc02c0000
0xc02c2000
0xc02c1df8 c136aab0 ??6?
0xc02c1dfc c0aad82c ,ت?
#
#
kdb> mds 0xc136aab0
0xc136aab0 00000000 ....
0xc136aab4 00000000 ....
0xc136aab8 00000000 ....
0xc136aabc 00000000 ....
0xc136aac0 00000000 ....
0xc136aac4 000c30a7 ?0..
0xc136aac8 c3104800 .H.?
0xc136aacc c0aad824 $ت?
#
# Let it die now
#
kdb> go
Oops: 0000
CPU: 0
EIP: 0010:[<c01c0c32>]
EFLAGS: 00010246
eax: 00000000 ebx: 00000000 ecx: c11a6fa0 edx: 00000006
esi: c1376be0 edi: 00000000 ebp: c02c1bc8 esp: c02c1bac
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c02c1000)
Stack: c11a6fa0 00000000 0000cc28 000005ce 00000015 001a6fa0 000005c8
c02c1bf4
c01c1004 c11a6fa0 c1376be0 c11d2730 c1376be0 00000008 3000fc28
0117158a
0101a8c0 00000000 c02c1c0c c4093365 c1376be0 c4095f08 c02c1cdc
00000003
Call Trace: [<c01c1004>] [<c4093365>] [<c4095f08>] [<c40924cd>] [<c4095f08>]
[<c409b2ac>] [<c4094666>]
[<c01c3560>] [<c01b2d98>] [<c01c3560>] [<c01b3001>] [<c01c3560>]
[<c4095f08>] [<c01c2c27>] [<c01c3560>]
[<c403de76>] [<cc281d80>] [<c01c2d4b>] [<c01d9a00>] [<c01d9c03>]
[<c01d9a00>] [<c01da1aa>] [<c409197c>]
[<c4095f38>] [<c01da459>] [<c01c04a4>] [<c01b3048>] [<c01c02d5>]
[<c01c0410>] [<c01c06dc>] [<c01b3048>]
[<c01c03dc>] [<c01c0520>] [<c01b703d>] [<c011bd7e>] [<c010ad13>]
[<c01074f0>] [<c01093f0>] [<c01074f0>]
[<c0100018>] [<c0107516>] [<c0107585>] [<c0105000>] [<c0100191>]
Code: 8b 40 3c 89 41 3c c7 46 18 00 00 00 00 8b 46 5c 01 41 18 8b
Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
In interrupt handler - not syncing
# DOA

2000-12-15 09:23:07

by Jasper Spaans

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

On Thu, Dec 14, 2000 at 05:50:35PM -0500, Mohammad A. Haque wrote:

[zap]

> Oops start flying by when I access via NFS.
>
> If you need the actual Oops messages we're gonna have to get someone
> who can setup a serial console.

I captured one on my console, anyone interested please drop me a note.

Regards,
--
Jasper Spaans <[email protected]>

2000-12-15 09:53:39

by Tom Leete

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

"Mohammad A. Haque" wrote:
>
> I do the following....
>
> sudo modprobe iptable_nat
>
> Module Size Used by
> iptable_nat 17440 0 (unused)
> ip_conntrack 19808 1 [iptable_nat]
> ip_tables 12320 3 [iptable_nat]
>
> Oops start flying by when I access via NFS.
>
> If you need the actual Oops messages we're gonna have to get someone
> who can setup a serial console.
>

see my post of day before yesterday under the nfs thread for serial
console+kdb of this.

I also posted a simpler one under this thread of a fragmented ping attack
which is executable by any user on a peer.
# ping -c 100 -s 1470 -f <t12-host>
works fine;
$ ping -c 1 -s 1478 <t12-host>
crashes the target every time.

Tom

2000-12-15 14:52:23

by Ingo Oeser

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Thu, Dec 14, 2000 at 06:42:58AM -0500, Mohammad A. Haque wrote:
> Hmmm, does syslog sending to another machine catch oops? I guess we'll
> find out.

No, I asked for the logs and he didn't receive any of them :-(

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2000-12-17 08:45:48

by Rusty Russell

[permalink] [raw]
Subject: Re: ip_defrag is broken (was: Re: test12 lockups -- need feedback)

In message <[email protected]> you write:
> Date: Thu, 14 Dec 2000 15:35:48 -0500 (EST)
> From: "Mohammad A. Haque" <[email protected]>
>
> I'll be trying in a few hours.
>
> Meanwhile for people wanting the crashes to be fixed, please
> apply this patch.
>
> This was _always_ broken, and really what netfilter is doing
> should have never worked. The only theory I have right now
> is that people using netfilter never had IP fragments timeout.
> :-)

Ick, we've previously had issues with using the defrag routine from
PRE_ROUTING (Andi fixed the `called without bh disabled' problem). 8(

Good news is that it's all done from one place:

net/ipv4/ip_conntrack_core.c:910:ip_ct_gather_frags(struct sk_buff *skb)

You can fix it to obey the rules there, rather than hacking fragment
code.

Cheers,
Rusty.
--
Hacking time.

2000-12-15 18:19:18

by Ingo Oeser

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Thu, Dec 14, 2000 at 06:52:34PM +0000, Eckhard Jokisch wrote:
> Is it possible that there is something wrong with the 8139too driver?
> ( I also use a card with 8139 chip )
> Or do you use the "old" rtl8139 ? With that I don't have any problems.
> I have an extra machine here where I can do all testing - how can I help?

I have no Realtek-Card and have the same lockup.

I also got a hard lockup (but with Oops) while calling the
"vendor CPU init" function during system boot.

This was on Cyrix III.

PS: CC'ed hpa, because he is cpu-detection maintainer and davej,
because he added Cyrix III support and might know details ;-)

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2000-12-15 18:23:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

>
> I have no Realtek-Card and have the same lockup.
>
> I also got a hard lockup (but with Oops) while calling the
> "vendor CPU init" function during system boot.
>
> This was on Cyrix III.
>
> PS: CC'ed hpa, because he is cpu-detection maintainer and davej,
> because he added Cyrix III support and might know details ;-)
>

Please include the oops information, as well as the /proc/cpuinfo output.

-hpa


2000-12-15 18:37:19

by Alan

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

> > I also got a hard lockup (but with Oops) while calling the
> > "vendor CPU init" function during system boot.
> >
> > This was on Cyrix III.
> > PS: CC'ed hpa, because he is cpu-detection maintainer and davej,
> > because he added Cyrix III support and might know details ;-)
>
> Please include the oops information, as well as the /proc/cpuinfo output.

Also be sure you built Pentium/TSC kernels as Cyrix III is a 686 core without
the cmov instruction it seems

2000-12-15 18:47:30

by Ingo Oeser

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Fri, Dec 15, 2000 at 09:52:22AM -0800, H. Peter Anvin wrote:
> > This was on Cyrix III.
>
> Please include the oops information, as well as the /proc/cpuinfo output.

processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 6
model name : WinChip ??
stepping : 0
cpu MHz : 501.000148
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de tsc msr mce cx8 mtrr pge mmx
bogomips : 999.42
processor : 0
vendor_id : CentaurHauls
cpu family : 6
model : 6
model name : WinChip ??
stepping : 0
cpu MHz : 501.000148
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de tsc msr mce cx8 mtrr pge mmx
bogomips : 999.42

Oops not available, because this machine is in a frozen state (in
project management context) running a specially patched test9.

It oopsed after this message:
CPU: Before vendor init, caps: <the actual caps>

The only symbols on stack where "empty_bad_page" and "L6" without
any offset.

I'll get access to a clone of this machine on monday and oops it
again ;-)

But perhaps this is helpful in any matter.

Regards & Thanks

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2000-12-15 18:52:30

by Ingo Oeser

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback

On Fri, Dec 15, 2000 at 06:06:58PM +0000, Alan Cox wrote:
> > > This was on Cyrix III.
> > Please include the oops information, as well as the /proc/cpuinfo output.
> Also be sure you built Pentium/TSC kernels as Cyrix III is a 686 core without
> the cmov instruction it seems

I did. And built with gcc 2.95.2 (debian potato) if that matters.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2000-12-15 19:22:14

by Mike Elmore

[permalink] [raw]
Subject: Re: test12 lockups -- need feedback


I have a DLink DFE-530TX+ with a RTL8139 and I lock up cold
every once in a while too. 2.4.0-test12-pre3 is the latest
kernel I've tried. The machine is a dual PII450 on a Tyan
Tiger 100 BX board w/ 128MB.

Locks up cold meaning "It's dead Jim". Non sysrq facilities
available and no Oops trail.

I don't see the old Becker 8139 driver in the 2.4 tree so
I don't know if it happens with 2.4 and the old driver.

I can provide what ever info that is available and would
be useful.

NOTE also: I have an old Dell P133 48MB masquerading machine
with 2 of these same boards that Panic's on current 2.4
kernels with the "Aieee killing interrupt handler" message
to the console but doesn't get around to writing the console
to the log before going toe up. 2.4.0-test12-pre3. Before
that I get a bunch of the RxFIFOOwv interrupt sending it
into the rtl8139_weird_interrupt routine, but it says
in the driver code that this could be related to CPU speed
and the machine's a P133. Should the machine panic though?

I can't get the console off to the serial port cause the
ports are dead on this machine for some reason. The BIOS
allocates irq 4 to the second of the 8139 cards and neither
serial port is recognised so I'm not sure how to get any
major chunk of the Panic info off teh 14" screen. Note
that this machine runs 2.2.18 fine albiet my OnStream
drive doesn't function right so maybe the old Becker driver
does solve some of the problems. Arg! =)

-mwe


On Fri, Dec 15, 2000 at 07:47:35PM +0100, Ingo Oeser wrote:
> On Thu, Dec 14, 2000 at 06:52:34PM +0000, Eckhard Jokisch wrote:
> > Is it possible that there is something wrong with the 8139too driver?
> > ( I also use a card with 8139 chip )
> > Or do you use the "old" rtl8139 ? With that I don't have any problems.
> > I have an extra machine here where I can do all testing - how can I help?
>
> I have no Realtek-Card and have the same lockup.
>
> I also got a hard lockup (but with Oops) while calling the
> "vendor CPU init" function during system boot.
>
> This was on Cyrix III.
>
> PS: CC'ed hpa, because he is cpu-detection maintainer and davej,
> because he added Cyrix III support and might know details ;-)
>
> Regards
>
> Ingo Oeser

--
Mike Elmore
[email protected]

"Never confuse activity with accomplishment."
-unknown