2001-07-16 09:03:24

by Joern Nettingsmeier

[permalink] [raw]
Subject: kernel lockup in 2.4.5-ac3 and 2.4.6-pre7 (netfilter ?)

hello everyone !

i have had reproducible lockups in 2.4.5-ac3.
the box is a cyrix@120 mhz with a via apollo chipset and two
ethercards.
it's used as a masquerading firewall/dsl router.

now when i have an ftp session from a machine on the private network
to the internet and it gets stuck or i ctrl-c out of it, this causes
the box to lock up hard. i was able to reproduce this a few times.
no syslog entries survive. alt-sysrq-sync seems to work, but
-killall and -umount don't, so after alt-sysrq-boot i have to go
through 20gigs of fsck.

i have upgraded to 2.4.7-pre6, and the problem has reappeared, this
time when closing a stuck ssh connection. same sysrq behaviour, no
logs.

when not forwarding ftp or ssh sessions, the box has had uptimes of
more than a week, so i think the problem may be netfilter related.
see below for my netfilter setting. (the same setting has run w/o
problems in earlier 2.4 kernels.)

i'd welcome hints to nail down the problem. if you want me to run
further tests or need more info, let me know.

yours,

j?rn

ps: if possible, cc: me on followups, because i only read lkml
through the archive.


----

#iptables -L -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source
destination
67171 27M block all -- any any anywhere
anywhere

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source
destination
285 16924 TCPMSS tcp -- any any anywhere
anywhere tcp flags:SYN,RST/SYN TCPMSS clamp to PMTU
6690 3628K block all -- any any anywhere
anywhere

Chain OUTPUT (policy ACCEPT 67143 packets, 47732223 bytes)
pkts bytes target prot opt in out source
destination

Chain block (2 references)
pkts bytes target prot opt in out source
destination
71441 30M ACCEPT all -- any any anywhere
anywhere state RELATED,ESTABLISHED
1848 152K ACCEPT all -- !ppp0 any anywhere
anywhere state NEW
0 0 ACCEPT tcp -- any any anywhere
anywhere tcp dpt:ssh
0 0 ACCEPT tcp -- any any anywhere
anywhere tcp dpt:http
38 1792 ACCEPT tcp -- any any anywhere
anywhere tcp dpts:1024:65535
534 31915 DROP all -- any any anywhere
anywhere

--
J?rn Nettingsmeier
home://Kurf?rstenstr.49.45138.Essen.Germany
phone://+49.201.491621
http://icem-www.folkwang-hochschule.de/~nettings/
http://www.linuxdj.com/audio/lad/


2001-07-19 13:37:43

by Joern Nettingsmeier

[permalink] [raw]
Subject: Re: kernel lockup in 2.4.5-ac3 and 2.4.6-pre7 (netfilter ?)

Steven Walter wrote:
>
> Just out of curiosity, are you using the kernel PPPoE driver? If so,
> there are several others of us on the net experiencing similar problems.

yes.
wouldn't have expected the pppoe driver to be the problem.
seems i need to google for related posts a bit..

thanks for the hint,
regards,

j?rn

> --
> -Steven
> In a time of universal deceit, telling the truth is a revolutionary act.
> -- George Orwell

--
J?rn Nettingsmeier
home://Kurf?rstenstr.49.45138.Essen.Germany
phone://+49.201.491621
http://icem-www.folkwang-hochschule.de/~nettings/
http://www.linuxdj.com/audio/lad/

2001-07-19 16:32:44

by Joern Nettingsmeier

[permalink] [raw]
Subject: Re: kernel lockup in 2.4.5-ac3 and 2.4.6-pre7 (netfilter ?)

[christian, i'm quoting a message of yours below. maybe this is of
interest to you, so i'm cc:ing]
Thomas wrote:
>
> J?rn Nettingsmeier wrote:
>
> > hello brad, hello netfilter people !
> > Brad Chapman wrote:
> >
> >> Were you able to rescue any console output from the hard
> >> lockup;
> >> i.e. you did a klogd -c 7 to capture _everything_ the kernel
> >> spit out?
> >> If you were able to, could you send them to the list, please?
> >>
> > i have just reproduced the lockup with the klogd setting you
> > suggested, but no entries at all have survived.
> > however, it has been pointed out to me that my lockups might be
> > caused by a faulty pppoe module (i'm using a dsl connection)
> > rather
> > than netfilter.
> > it looks like i need to investigate a little further on pppoe...
> > let's see what the lkml archive has to say.
> > thanks for all the helpful replies. i will keep you posted if i
> > can
> > solve this problem.
> >
> Hi,
> i can't help you exactly, but i also use T-Dsl and also there it
> come to hangup's ( sometimes kernel panic )
> At the moment i get it off, i switched the debug mode on and log
> all the crap in an file ( hoped i find the error )
> but after debug on there was no more hangup. I'm sure it is an bug
> in the pppoe system, wich come to work
> when the T* have trouble and send defekt packets. The ground i
> think it have to do with defekt packets is
> that other frinds with dsl under windoff told me that they also
> have on the same time many disconnects.
>
> Cu thomas

i found someone else's oops report on lkml, and this is exactly the
one i'm seeing, although i can't save it for lack of a serial
console:

> christian wrote on lkml:
>
> PROBLEM: Kernel Panics since i switched to T-DSL, using
> masquadering. Supposed to be
> fixed in 2.4.5pre9 ?

>
> virtual address 00008ba7
> *pde = 00000000
> Oops = 0000
> CPU = 0
> EIP = 0010:[<c01c96c9>]
> EFLAGS: 00010202
>
> eax: c1569940 ebx: 00008ba7 ecx: 00000000 edx: 00068ba7
> esi: c1b5ce80 edi: c15697e0 ebp: 00000060 esp: c0e41dd4
> ds: 0018 es: 0018 ss: 0018
>
> Process dnetc (pid: 2152, stackpage=c0e41000)
>
> Stack: ffffff00 c01c976b c1b5ce80 ffffff00 c1b5ce80 c01c9d53
> c1b5ce80 c11fa800
> c1b5ce80 0000000e c1b5ce80 ffffffe6 c01cc667 c1b5ce80
> 00000020 00000004
> c1979b20 0000000e c01d0cdd c1b5ce80 00000001 00000000
> c1b5ce80 c01dabf0
>
> Call Trace: [<c01c976b>] [<c01c9d53>] [<c01ccdd7>]
> [<c01d0cdd>] [<c01dabf0>]
> [<c01dacb0>] [<c01d1ef8>]
> [<c01d8240>] [<c01dabd2>] [<c01dabf0>]
> [<c01d829a>] [<c01d1ef8>]
> [<c01d8fd6>] [<c01d8240>] [<c01d7290>]
>
> ^ f or 1 ?
> (that's the f in the third entry, for those not using fixed
> width fonts :)
> [<c01d742d>] [<c01d7290>] [<c01d1ef8>]
> [<c01d70d6>] [<c01d7290>]
> [<c01cd59e>] [<c0116b8a>] [<c01085cb>]
> [<c0106d04>]
>
> Code: 8b 1b 8b 42 70 83 f8 01 74 0b f0 ff 4a 70 0f 94 c0 84 c0
> 74
> Kernel panic: Aiee, killing interrupt handler!
> In interrupt handler - not syncing.

i can trigger this bug by simply typing ctrl-c in an ftp or ssh
session from a machine on the local (masqueraded) network to the
internet.

some folks have blamed the ppp/pppoe code, but after further testing
it does seem to be netfilter-related somehow, since i cannot
reproduce the
oops on the router itself with iptables modules unloaded. it only
happens on a machine on the local network when masqueraded via the
router.

does this assumption make sense ?

i was pointed to a ppp patch from lkml, but it seems to be relevant
only to starting/stopping a ppp device.
(message "[PATCH] ppp_generic.c - kfree(ppp) called twice")

right now, i'm trying 2.4.7-pre8, it has a load of ppp related
patches. it's still compiling atm.

getting confused...

j?rn


--
J?rn Nettingsmeier
home://Kurf?rstenstr.49.45138.Essen.Germany
phone://+49.201.491621
http://icem-www.folkwang-hochschule.de/~nettings/
http://www.linuxdj.com/audio/lad/

2001-07-19 20:46:25

by Joern Nettingsmeier

[permalink] [raw]
Subject: Re: kernel lockup in 2.4.5-ac3 and 2.4.6-pre7 (netfilter ?)

just fyi, 2.4.7-pre8 did not cure the problem.
i was able to reproduce the problem like before.
this time, i switched to the log console before locking the machine
up, and the oops is in fact identical to the one christian was
seeing.
the last line says "In interrupt handler - not syncing." which seems
to explain why no syslog messages survive.

regards,

j?rn

--
J?rn Nettingsmeier
home://Kurf?rstenstr.49.45138.Essen.Germany
phone://+49.201.491621
http://icem-www.folkwang-hochschule.de/~nettings/
http://www.linuxdj.com/audio/lad/