2004-03-11 15:30:32

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

I didn't reboot sam like I said I would. I decided I'd let it spiral
down. I'm still collecting profile data every fifteen minutes. I
haven't posted any more graphs. They look the same as all the others: a
monotonically increasing ping latency (w/ a corresponding slow increase
in system load averages - which I'm logging, if anyone wants more data).

http://depot.mtholyoke.edu:8080/tmp/sam-profile/

I've been perusing fa.linux.kernel, and saw Brad Laue's thread. FWIW,
it smells similar. When my machines finally go down, ksoftirqd is
always at the top of the process list.

Any ideas at all about what might be happening?

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso


2004-03-11 17:32:54

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

On Thu, Mar 11, 2004 at 10:27:28AM -0500, rpeterso wrote:

> I've been perusing fa.linux.kernel, and saw Brad Laue's thread. FWIW,
> it smells similar. When my machines finally go down, ksoftirqd is
> always at the top of the process list.
>
> Any ideas at all about what might be happening?

I put my latest user.log file up (16M):

http://depot.mtholyoke.edu:8080/tmp/sam-profile/user.log

If you 'grep PSTOPCPU user.log | less', you can see that ksoftirqd_CPU0
slowly but steadily consumes a higher and higher CPU percentage. What
this means, I have no idea.

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso

2004-03-11 23:14:09

by Andrew Morton

[permalink] [raw]
Subject: Re: network/performance problem

Ron Peterson <[email protected]> wrote:
>
> I didn't reboot sam like I said I would. I decided I'd let it spiral
> down. I'm still collecting profile data every fifteen minutes. I
> haven't posted any more graphs. They look the same as all the others: a
> monotonically increasing ping latency (w/ a corresponding slow increase
> in system load averages - which I'm logging, if anyone wants more data).
>
> http://depot.mtholyoke.edu:8080/tmp/sam-profile/
>
> I've been perusing fa.linux.kernel, and saw Brad Laue's thread. FWIW,
> it smells similar. When my machines finally go down, ksoftirqd is
> always at the top of the process list.
>
> Any ideas at all about what might be happening?

The profiles tell a story:

c0217fb0 wait_for_packet 2 0.0063
c0256660 arpt_do_table 2 0.0019
c0265ca0 __generic_copy_to_user 2 0.0278
c0106bd0 system_call 3 0.0536
c0107e8c handle_IRQ_event 3 0.0326
c014bf10 statm_pgd_range 3 0.0077
c0120ed4 do_wp_page 5 0.0101
c024c0d4 ip_conntrack_expect_related 47 0.0368
c0105250 default_idle 2817 70.4250
c024bae0 init_conntrack 3053 3.7232
00000000 total 5962 0.0041

It appears that netfilter has gone berzerk and is taking your machine out.

Are you really sure that nothing is sitting there injecting new rules all
the time?

2004-03-11 23:36:04

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

On Thu, Mar 11, 2004 at 03:15:59PM -0800, Andrew Morton wrote:
> Ron Peterson <[email protected]> wrote:
> >
> > I didn't reboot sam like I said I would. I decided I'd let it spiral
> > down. I'm still collecting profile data every fifteen minutes. I
> > haven't posted any more graphs. They look the same as all the others: a
> > monotonically increasing ping latency (w/ a corresponding slow increase
> > in system load averages - which I'm logging, if anyone wants more data).
> >
> > http://depot.mtholyoke.edu:8080/tmp/sam-profile/
> >
> > I've been perusing fa.linux.kernel, and saw Brad Laue's thread. FWIW,
> > it smells similar. When my machines finally go down, ksoftirqd is
> > always at the top of the process list.
> >
> > Any ideas at all about what might be happening?
>
> The profiles tell a story:
>
> c0217fb0 wait_for_packet 2 0.0063
> c0256660 arpt_do_table 2 0.0019
> c0265ca0 __generic_copy_to_user 2 0.0278
> c0106bd0 system_call 3 0.0536
> c0107e8c handle_IRQ_event 3 0.0326
> c014bf10 statm_pgd_range 3 0.0077
> c0120ed4 do_wp_page 5 0.0101
> c024c0d4 ip_conntrack_expect_related 47 0.0368
> c0105250 default_idle 2817 70.4250
> c024bae0 init_conntrack 3053 3.7232
> 00000000 total 5962 0.0041
>
> It appears that netfilter has gone berzerk and is taking your machine out.
>
> Are you really sure that nothing is sitting there injecting new rules all
> the time?

You mean a script calling 'iptables' to dynamically add rules? Nothing
like that at all. I dumped the current rules below.

Are you looking at the init_conntrack numbers? While they seem, in the
long run, to be getting larger, they're not increasing monotonically.
My ping latencies, and the CPU percentage consumed by ksoftirqd_CPU0
just go up and and up (albeit slowly).

The graph below shows what happened when I flushed the rules, and set
the default policy to ACCEPT. So the ping latencies, at least, seem
to have something to do with iptables.

http://depot.mtholyoke.edu:8080/tmp/tap-sam/2004-03-06_9:30/sam_last_108000.png

1003# iptables -v -L
Chain INPUT (policy DROP 9910K packets, 1296M bytes)
pkts bytes target prot opt in out source destination
1899K 2581M ACCEPT all -- any any anywhere anywhere state RELATED,ESTABLISHED
28774 2396K ACCEPT icmp -- any any 138.110.0.0/16 anywhere icmp echo-request
12 672 ACCEPT tcp -- any any anywhere anywhere tcp dpt:ssh
0 0 ACCEPT tcp -- any any anywhere anywhere tcp dpt:https
127 8713 ACCEPT all -- lo any anywhere localhost

Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy DROP 137 packets, 9042 bytes)
pkts bytes target prot opt in out source destination
1433K 287M ACCEPT all -- any any anywhere anywhere state NEW,RELATED,ESTABLISHED

Thu Mar 11 06:26:55 root@sam ~
1004# iptables -v -L -t nat
Chain PREROUTING (policy ACCEPT 21M packets, 2512M bytes)
pkts bytes target prot opt in out source destination

Chain POSTROUTING (policy ACCEPT 676K packets, 27M bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 676K packets, 27M bytes)
pkts bytes target prot opt in out source destination

2004-03-12 10:17:27

by Patrick McHardy

[permalink] [raw]
Subject: Re: network/performance problem

Ron Peterson wrote:
> On Thu, Mar 11, 2004 at 03:15:59PM -0800, Andrew Morton wrote:
>>The profiles tell a story:
>>
>>c0217fb0 wait_for_packet 2 0.0063
>>c0256660 arpt_do_table 2 0.0019
>>c0265ca0 __generic_copy_to_user 2 0.0278
>>c0106bd0 system_call 3 0.0536
>>c0107e8c handle_IRQ_event 3 0.0326
>>c014bf10 statm_pgd_range 3 0.0077
>>c0120ed4 do_wp_page 5 0.0101
>>c024c0d4 ip_conntrack_expect_related 47 0.0368
>>c0105250 default_idle 2817 70.4250
>>c024bae0 init_conntrack 3053 3.7232
>>00000000 total 5962 0.0041
>>
>>It appears that netfilter has gone berzerk and is taking your machine out.
>>
>>Are you really sure that nothing is sitting there injecting new rules all
>>the time?
>
>
> You mean a script calling 'iptables' to dynamically add rules? Nothing
> like that at all. I dumped the current rules below.
>
> Are you looking at the init_conntrack numbers? While they seem, in the
> long run, to be getting larger, they're not increasing monotonically.
> My ping latencies, and the CPU percentage consumed by ksoftirqd_CPU0
> just go up and and up (albeit slowly).
>

The size-128 slab keeps growing over time, I suspect something is
registering lots of expectations. init_conntrack has to walk the
entire list for each new connection. Which helpers are you using ?
Please also post the content of /proc/net/ip_conntrack and your
config.

Regards
Patrick

2004-03-12 16:11:18

by Martin Josefsson

[permalink] [raw]
Subject: Re: network/performance problem

On Fri, 12 Mar 2004, Patrick McHardy wrote:

> >>c024c0d4 ip_conntrack_expect_related 47 0.0368
> >>c0105250 default_idle 2817 70.4250
> >>c024bae0 init_conntrack 3053 3.7232
> >>00000000 total 5962 0.0041
> >>
> >>It appears that netfilter has gone berzerk and is taking your machine out.
> >>
> >>Are you really sure that nothing is sitting there injecting new rules all
> >>the time?
> >
> >
> > You mean a script calling 'iptables' to dynamically add rules? Nothing
> > like that at all. I dumped the current rules below.
> >
> > Are you looking at the init_conntrack numbers? While they seem, in the
> > long run, to be getting larger, they're not increasing monotonically.
> > My ping latencies, and the CPU percentage consumed by ksoftirqd_CPU0
> > just go up and and up (albeit slowly).
> >
>
> The size-128 slab keeps growing over time, I suspect something is
> registering lots of expectations. init_conntrack has to walk the
> entire list for each new connection. Which helpers are you using ?
> Please also post the content of /proc/net/ip_conntrack and your
> config.

If you want to see the numbers of expectations registered per second you
can apply the ctstat patch from patch-o-matic and download the small
utility mentioned in the helpfile.

I can prepare a regular patch for you if it sounds interesting.
We can add a counter for the number of expectations in the linked-list as
well in order to debug this. (the ctstat patch only adds counters for
new/deleted expectations)

/Martin

2004-03-12 16:47:31

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

Hi Patrick.

(I'm all set now. Someone kindly sent me a critical network patch via
email... :)

I'm not subscribed to lkml, but am following along in fa.kernel.linux.
I'm replying to my own mail to keep the thread somewhat intact...

Anyway, sam's .config can be found here:

http://depot.mtholyoke.edu:8080/tmp/sam-profile/sam-config-2.4.21

On sam, I just did:

1002# cat /proc/net/ip_conntrack > ip_conntrack

..and it wiped the machine out. I can't ping it, ssh to it, nothing. I
need to go walk over to the machine room... :(

After lunch I'm stuck in meetings for awhile...

Thanks.

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso

2004-03-12 17:23:32

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem


On Fri, Mar 12, 2004 at 11:47:04AM -0500, rpeterso wrote:

> On sam, I just did:
>
> 1002# cat /proc/net/ip_conntrack > ip_conntrack
>
> ..and it wiped the machine out. I can't ping it, ssh to it, nothing. I
> need to go walk over to the machine room... :(

I rebooted, and did the exact same thing as above. Here's what the
console says:

Unable to handle kernel NULL pointer dereference at virtual address 00000018
printint eip:
c024aae5
*pde = 00000000
Ooops: 0000
CPU: 0
EIP: 0010:[<c024aaae5>] Not tainted
EFLAGS: 00010286
eax: 00000000 ebx: deb00440 ecx: ddad71d1 edx: e089b000
esi: deb00440 edi: ddad71d2 ebp: 0000002d esp: ddb4df3c
dsd: 0018 es: 0018 ss: 0018
Process cat (pid: 365, stackpage=ddb4d000)
Stack: deb00440 000001d2 000001d2 c024ad1a ddad71d2 deb00440 00000000 00000c00
ddad7000 00001000 00000ff6 c014af9f ddad7000 ddb4df98 00000029 00000c00
00000000 ddafe3c0 ffffffea 00001000 c196dce0 00000000 00000000 00000000
Call Trace: [<c024ad1a>] [<c014af9f>] [<c012f936>] [c0106c03>]

Code: 83 78 18 00 74 3a 83 7e 2c 00 74 1f a1 44 3c 32 c0 8b 56 34
<0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing


...whew. Hopefully not too many typos.. ;) After I reboot again, I'll
probably find this all got syslogged..


--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso

2004-03-12 22:56:34

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

On Fri, Mar 12, 2004 at 11:47:04AM -0500, rpeterso wrote:

> (I'm all set now. Someone kindly sent me a critical network patch via
> email... :)

...just in case ...since my sense of humor is suspect ...that was a
joke. Same problem persists after reboot. I haven't installed a
different kernel or otherwise changed anything on 'sam' yet. Not sure
what would be good to try next.

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso

2004-03-14 06:33:58

by David Miller

[permalink] [raw]
Subject: Re: network/performance problem

On Fri, 12 Mar 2004 17:56:06 -0500
Ron Peterson <[email protected]> wrote:

> ...just in case ...since my sense of humor is suspect ...that was a
> joke. Same problem persists after reboot. I haven't installed a
> different kernel or otherwise changed anything on 'sam' yet. Not sure
> what would be good to try next.

FInd out what's adding all of the netfilter rules like crazy.

It is obvious this is happening, from your profiles. I know you
say that you have no idea what might be doing it, but your description
matches every other one that was reported in the past of gradual
networking slowdown, and in each of those cases it was something
poking netfilter in some way, and your profiles basically
confirm that this is what is happening somehow on your box.

2004-03-14 13:24:18

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

On Sat, Mar 13, 2004 at 10:33:49PM -0800, David S. Miller wrote:
> On Fri, 12 Mar 2004 17:56:06 -0500
> Ron Peterson <[email protected]> wrote:
>
> > ...just in case ...since my sense of humor is suspect ...that was a
> > joke. Same problem persists after reboot. I haven't installed a
> > different kernel or otherwise changed anything on 'sam' yet. Not sure
> > what would be good to try next.
>
> FInd out what's adding all of the netfilter rules like crazy.
>
> It is obvious this is happening, from your profiles. I know you
> say that you have no idea what might be doing it, but your description
> matches every other one that was reported in the past of gradual
> networking slowdown, and in each of those cases it was something
> poking netfilter in some way, and your profiles basically
> confirm that this is what is happening somehow on your box.

Don't think so. If I revert to 2.4.20 from 2.4.21, and change nothing
else, this problem goes away.

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso

2004-03-14 17:34:08

by David Miller

[permalink] [raw]
Subject: Re: network/performance problem

On Sun, 14 Mar 2004 08:23:40 -0500
Ron Peterson <[email protected]> wrote:

> Don't think so. If I revert to 2.4.20 from 2.4.21, and change nothing
> else, this problem goes away.

That's right because a netfilter change during that time period
makes certain auto-rule adding setups go berzerk and it's a bug
in the netfilter userland bits not the kernel.

2004-03-14 18:14:33

by Ron Peterson

[permalink] [raw]
Subject: Re: network/performance problem

On Sun, Mar 14, 2004 at 09:33:58AM -0800, David S. Miller wrote:
> On Sun, 14 Mar 2004 08:23:40 -0500
> Ron Peterson <[email protected]> wrote:
>
> > Don't think so. If I revert to 2.4.20 from 2.4.21, and change nothing
> > else, this problem goes away.
>
> That's right because a netfilter change during that time period
> makes certain auto-rule adding setups go berzerk and it's a bug
> in the netfilter userland bits not the kernel.

I may indeed be completely dense. That's not unheard of around these
parts. I'd certainly accept an explanation of my denseness in lieue of
any other explanation, as long as I can make this stop happening.

What is the nature of the auto-rule adding setups going berzerk problem?

Below are my current iptables rules on 'sam' (the only machine not
currently running 2.4.20). There are no jumps to user defined chains.
I have not installed any scripts that dynamically add/alter iptables
rules. I can't imagine what package I may have installed that might do
such a thing either. Even if there were such a script somehow, since
nothing below ever jumps anywhere else, it wouldn't be getting called,
right?

If I flush and expunge my rules as follows, the problem goes away. If
this was because a jump to user defined chain was being deleted, then I'd
understand. But there are no jumps out of INPUT, OUTPUT, FORWARD,
PREROUTING, or POSTROUTING, so I'm confused.

$IPTABLES -F
$IPTABLES -t nat -F
$IPTABLES -X
$IPTABLES -P INPUT ACCEPT
$IPTABLES -P OUTPUT ACCEPT
$IPTABLES -P FORWARD ACCEPT

FWIW, I compiled the latest 'iptables' code against my current running
2.4.21 kernel also..

1052# iptables -V
iptables v1.2.9

1045# iptables -L
Chain INPUT (policy DROP)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT icmp -- 138.110.0.0/16 anywhere icmp echo-request
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere tcp dpt:https
ACCEPT all -- anywhere localhost

Chain FORWARD (policy DROP)
target prot opt source destination

Chain OUTPUT (policy DROP)
target prot opt source destination
ACCEPT all -- anywhere anywhere state NEW,RELATED,ESTABLISHED

Sun Mar 14 12:57:25 root@sam /usr/src
1046# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

--
Ron Peterson
Network & Systems Manager
Mount Holyoke College
http://www.mtholyoke.edu/~rpeterso