2005-03-05 17:20:03

by Christian Schmid

[permalink] [raw]
Subject: BUG: Slowdown on 3000 socket-machines tracked down

Hello.

After weeks of work, I can now give a detailed report about the bug and when it appears:

Attached is another traffic-image. This one is with 2.6.10 and a 3/1 split, preemtive kernel, so all
defaults.

The first part is where I throttled the whole thing to 100 MBit in order to build up a traffic-jam ;)

When I released it, it jumped up immediately but suddenly it goes down (each pixel is one second)
Playing around with min_free_kbytes didnt help. Where it goes up again I set lower_zone_protection
to 1024000 and where it goes down I set it to 0 again and where it goes up the last time... guess..

This test was with 3500 sockets.

Today I tested with 5000 sockets. The problem is the same like above but the more sockets there
come, it just doesnt claim more bandwidth as it SHOULD of course do. It seems it doesn't slow down
but it just doesnt scale anymore. The badwidth doesnt go over 80 MB/Sec, no matter what I do. Then I
did the following: I raised lower_zone_protection to 1024 (above I did 1024000 which is bullshit but
it doesnt matter as it seems to just protect the whole low-mem which is what I want) and it was at
80 MB. then I lowered to 0 again and suddenly it peaked up to full bandwidth (100 MB) for about 5
seconds until the whole protected area was in use. Then it slowed down drastically again.

My theory:

I suppose when the blocks come in fast enough and the load is high enough, the kernel cant free the
required low memory as fast as this would be required in order to NOT slow-down everything. So the
vm is basically busy freeing low-memory. What do you think? The interesting part is that it slows
down painfully with lower_zone_protection set to 0, it peaks at 80 MB/Sec with lower_zone_protection
set to max (1024, whole low-mem) and there is much cpu-ressources free..... When set to 0 it speeds
up without limit AS LONG AS there is memory left. When its consumed, it slows down painfully again
because its set to 0 of course.

Chris


Attachments:
traffic2.png (2.45 kB)

2005-03-07 00:45:35

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:

> Today I tested with 5000 sockets. The problem is the same like above but
> the more sockets there come, it just doesnt claim more bandwidth as it
> SHOULD of course do. It seems it doesn't slow down but it just doesnt
> scale anymore. The badwidth doesnt go over 80 MB/Sec, no matter what I
> do. Then I did the following: I raised lower_zone_protection to 1024
> (above I did 1024000 which is bullshit but it doesnt matter as it seems
> to just protect the whole low-mem which is what I want) and it was at 80
> MB. then I lowered to 0 again and suddenly it peaked up to full
> bandwidth (100 MB) for about 5 seconds until the whole protected area
> was in use. Then it slowed down drastically again.

This confirms my suspicion that lowmem / highmem scanning is not
properly balanced. When you raise lower_zone_protection a great
deal, lowmem is no longer used for pagecache, and your problem
goes away.

I gave you a patch to try for this - unfortunately I can't make
much more progress than that if I don't have a test case and you
can't test patches :\

Nick

2005-03-07 01:14:56

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> Hello.
>
> After weeks of work, I can now give a detailed report about the bug and
> when it appears:
>
> Attached is another traffic-image. This one is with 2.6.10 and a 3/1
> split, preemtive kernel, so all defaults.

What are the units on your graph. You say "MB" several places, but
do you mean Mb (ie, Mega-bit) instead?

I have a tool that can also generate TCP traffic on a large number of
sockets. If I can understand what you are trying to do, I may be able
to reproduce the problem. My biggest machine at present has only
2GB of RAM, however...not sure if that matters or not.

Are you sending traffic in only one direction, or more of a full-duplex
configuration? Is each socket running the same bandwidth? What is this
bandwidth? Are you setting the send & rcv buffers in the socket creation
code? (To what values if so?) How many bytes are you sending with each
call to write()/sendto() whatever?

Is there any significant latency between your sender and receiver machine?
If so, how much?

What is the physical transport...GigE? 1500 MTU?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 01:58:44

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> Hello.
>>
>> After weeks of work, I can now give a detailed report about the bug
>> and when it appears:
>>
>> Attached is another traffic-image. This one is with 2.6.10 and a 3/1
>> split, preemtive kernel, so all defaults.
>
>
> What are the units on your graph. You say "MB" several places, but
> do you mean Mb (ie, Mega-bit) instead?

The unit on this graph is kilobytes. So 80000 there means 80 megabytes per second.

> I have a tool that can also generate TCP traffic on a large number of
> sockets. If I can understand what you are trying to do, I may be able
> to reproduce the problem. My biggest machine at present has only
> 2GB of RAM, however...not sure if that matters or not.

It should not matter. Low-memory is both just 1 GB if you have default 32 bit with 3/1 split.

> Are you sending traffic in only one direction, or more of a full-duplex
> configuration?

Its a full-duplex. Its a download-service with 3000 downloaders all over the world.

> Is each socket running the same bandwidth?

No. It ranges from 3 kb/sec to 100 kb/sec. 100 kb/sec is the limit because of the send-buffer limits.

> What is this bandwidth?

1000 MBit

> Are you setting the send & rcv buffers in the socket creation
> code? (To what values if so?)

Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.

> How many bytes are you sending with each call to write()/sendto() whatever?

I am using sendfile-call every 100 ms per socket with the poll-api. So basically around 40 kb per round.

> Is there any significant latency between your sender and receiver machine?
> If so, how much?

3000 different downloaders, 3000 different locations, 3000 different machines ;)

> What is the physical transport...GigE? 1500 MTU?

Yes.

Chris

2005-03-07 02:08:48

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> I have a tool that can also generate TCP traffic on a large number of
> sockets. If I can understand what you are trying to do, I may be able
> to reproduce the problem. My biggest machine at present has only
> 2GB of RAM, however...not sure if that matters or not.

But if the problem is what I think it is, you should get the problem by doing the following.

Best use 2.6.11 since the problem got even worse there compared to 2.6.10.

Create a server on one machine. This server should wait for incoming sockets and when they come,
just send out bytes ("x" or whatever, it just doesn't matter) to that sockets. Please use a
send-buffer of 64 kbytes.

On the other machine you just create clients, which connect to the server and read the data. They
just need to read them, nothing more. Please limit the reading to once per 300 ms, so they only read
around 200 kb/sec each. Then watch your traffic as you create more sockets. When you reach 2000
sockets on 2.6.11, it should slow down more and more. You should see the same like me on the
attached graph.

First one 2.6.11, second one 2.6.10

Chris


Attachments:
traffic3.png (2.52 kB)

2005-03-07 02:57:19

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> Ben Greear wrote:
>

>> I have a tool that can also generate TCP traffic on a large number of
>> sockets. If I can understand what you are trying to do, I may be able
>> to reproduce the problem. My biggest machine at present has only
>> 2GB of RAM, however...not sure if that matters or not.
>
> It should not matter. Low-memory is both just 1 GB if you have default
> 32 bit with 3/1 split.
>
>> Are you sending traffic in only one direction, or more of a full-duplex
>> configuration?
>
> Its a full-duplex. Its a download-service with 3000 downloaders all over
> the world.

So actually it's really mostly one-way traffic, ie in the download direction.
Anything significant at all going upstream, other than ACKs, etc?

>> Is each socket running the same bandwidth?
>
> No. It ranges from 3 kb/sec to 100 kb/sec. 100 kb/sec is the limit
> because of the send-buffer limits.
>
>> What is this bandwidth?
>
> 1000 MBit
>
>> Are you setting the send & rcv buffers in the socket creation
>> code? (To what values if so?)
>
> Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.

With regard to this note in the 'man 7 socket' man page:

NOTES
Linux assumes that half of the send/receive buffer is used for internal kernel struc-
tures; thus the sysctls are twice what can be observed on the wire.

What value are you using for the sockopt call?

>> How many bytes are you sending with each call to write()/sendto()
>> whatever?
>
> I am using sendfile-call every 100 ms per socket with the poll-api. So
> basically around 40 kb per round.

My application is single-threaded, uses non-blocking IO, and sends/rcvs from/to memory.
It will be a good test of the TCP stack, but will not use the sendfile logic,
nor will it touch the HD.

>> Is there any significant latency between your sender and receiver
>> machine?
>> If so, how much?
>
> 3000 different downloaders, 3000 different locations, 3000 different
> machines ;)

I can emulate delay if I need to, but I'd rather just stick with one
delay setting and not have to set up a separate delay for each connection.

Maybe 30ms is average for round-trip time?

Have you tried benchmarking your app in a controlled manner, or are you just
letting a random 3000 machines hit it and start downloading? If the latter,
then I'd suggest getting more controll over your testing environment, otherwise
it may be impossible to really figure out where the problem lies.

I'll set up a configuration similar to the values discussed above and see
what I can see. Will probably be late tomorrow before I can do the
test though...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 05:14:57

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> Ben Greear wrote:

>>> How many bytes are you sending with each call to write()/sendto()
>>> whatever?
>>
>>
>> I am using sendfile-call every 100 ms per socket with the poll-api. So
>> basically around 40 kb per round.
>
>
> My application is single-threaded, uses non-blocking IO, and sends/rcvs
> from/to memory.
> It will be a good test of the TCP stack, but will not use the sendfile
> logic,
> nor will it touch the HD.
>

I think you would have better luck in reproducing this problem if you
did the full sendfile thing.

I think it is becoming disk bound due to page reclaim problems, which
is causing the slowdown.

In that case, writing the network only test would help to confirm the
problem is not a networking one - so not useless by any means.

2005-03-07 05:30:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:

> I think you would have better luck in reproducing this problem if you
> did the full sendfile thing.
>
> I think it is becoming disk bound due to page reclaim problems, which
> is causing the slowdown.
>
> In that case, writing the network only test would help to confirm the
> problem is not a networking one - so not useless by any means.

Not necessarily, Nick. I have written an HTTP testing tool which matches
the description of Ben's : non-blocking, single-threaded, no disk I/O,
etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,
especially if I start some CPU activity on the system, I can get pauses
of up to 13 seconds without this tool doing anything !!! At first I
believed it was because of the scheduler, but it might also be related
to what is described here since I had somewhat the same setup (gigE, 1500,
thousands of sockets). I never had enough time to investigate more, so I
went back to 2.4.

It makes me think that for the problem described here, we have no
indication of CPU & I/O activity, which might help Ben try to reproduce.

Cheers,
Willy

2005-03-07 05:41:06

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Willy Tarreau wrote:
> On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:
>
>
>>I think you would have better luck in reproducing this problem if you
>>did the full sendfile thing.
>>
>>I think it is becoming disk bound due to page reclaim problems, which
>>is causing the slowdown.
>>
>>In that case, writing the network only test would help to confirm the
>>problem is not a networking one - so not useless by any means.
>
>
> Not necessarily, Nick. I have written an HTTP testing tool which matches
> the description of Ben's : non-blocking, single-threaded, no disk I/O,
> etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,

No you're right, I'm not 100% sure, so I'm definitely not saying
Ben's test will be useless. Just that if it is not too hard to
make one with sendfile, I think he should.

If he makes a network-only version and cannot reproduce the problems,
that *doesn't* mean it is *not* a network problem. However if he
reproduces the problem with a full sendfile version and not the network
only one, then that is a better indicator... but I'm rambling.

> especially if I start some CPU activity on the system, I can get pauses
> of up to 13 seconds without this tool doing anything !!! At first I
> believed it was because of the scheduler, but it might also be related
> to what is described here since I had somewhat the same setup (gigE, 1500,
> thousands of sockets). I never had enough time to investigate more, so I
> went back to 2.4.
>

I have heard other complaints about this, and they are definitely
related to the scheduler (not saying yours is, but it is very possible).

2005-03-07 05:42:25

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Willy Tarreau wrote:


>> thousands of sockets). I never had enough time to investigate more, so I
>> went back to 2.4.
>>
>
> I have heard other complaints about this, and they are definitely
> related to the scheduler (not saying yours is, but it is very possible).
>

Oh, and if you could dig this thing up too, that might be
good: someone else may have time to investigate more.

Thanks.

2005-03-07 05:46:24

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, Mar 07, 2005 at 04:42:10PM +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> >Willy Tarreau wrote:
>
>
> >>thousands of sockets). I never had enough time to investigate more, so I
> >>went back to 2.4.
> >>
> >
> >I have heard other complaints about this, and they are definitely
> >related to the scheduler (not saying yours is, but it is very possible).
> >
>
> Oh, and if you could dig this thing up too, that might be
> good: someone else may have time to investigate more.

I would love to, since my major concern with 2.6 has always been the
scheduler (but that's not to you that I will learn this). At the moment,
I really don't have time for this, I promised that I would send a full
reproducible report, but it takes a lot of time.

Cheers,
Willy

2005-03-07 09:23:51

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Ben Greear wrote:
>
>> Christian Schmid wrote:
>>
>>> Ben Greear wrote:
>
>
>>>> How many bytes are you sending with each call to write()/sendto()
>>>> whatever?
>>>
>>>
>>>
>>> I am using sendfile-call every 100 ms per socket with the poll-api.
>>> So basically around 40 kb per round.
>>
>>
>>
>> My application is single-threaded, uses non-blocking IO, and
>> sends/rcvs from/to memory.
>> It will be a good test of the TCP stack, but will not use the sendfile
>> logic,
>> nor will it touch the HD.
>>
>
> I think you would have better luck in reproducing this problem if you
> did the full sendfile thing.
>
> I think it is becoming disk bound due to page reclaim problems, which
> is causing the slowdown.
>
> In that case, writing the network only test would help to confirm the
> problem is not a networking one - so not useless by any means.

It's not trivial to write something like this :)

I'll be using something I already have. If I can't reproduce the problem,
then perhaps it is due to sendfile and someone can write a customized
test. The main reason I offered is because people are ignoring the
bug report for the most part and asking for a test case. I may be able
to offer an independent verification of the problem which might convince
someone to write up a dedicated test case...

Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 09:31:19

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Nick Piggin wrote:
>
>> Ben Greear wrote:
>>

>> In that case, writing the network only test would help to confirm the
>> problem is not a networking one - so not useless by any means.
>
>
> It's not trivial to write something like this :)
>
> I'll be using something I already have. If I can't reproduce the problem,
> then perhaps it is due to sendfile and someone can write a customized
> test. The main reason I offered is because people are ignoring the
> bug report for the most part and asking for a test case. I may be able
> to offer an independent verification of the problem which might convince
> someone to write up a dedicated test case...
>

OK, no that sounds good, please do make the test case.

I have actually been following up with Christian regarding
the disk IO / memory management side of things but the thread
has gone offline for some reason :\

2005-03-07 14:35:51

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
>> Its a full-duplex. Its a download-service with 3000 downloaders all
>> over the world.
>
>
> So actually it's really mostly one-way traffic, ie in the download
> direction.
> Anything significant at all going upstream, other than ACKs, etc?

Not much. See on the graph. The red is the downstream ;)

>> Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.
>
>
> With regard to this note in the 'man 7 socket' man page:
>
> NOTES
> Linux assumes that half of the send/receive buffer is used for
> internal kernel struc-
> tures; thus the sysctls are twice what can be observed on the wire.
>
> What value are you using for the sockopt call?

First I used 64 * 1024 but some months ago I checked with getsockopt and realized that it always
gives twice of the value back. So I just have done 64 * 512 ;)

Chris

2005-03-07 23:43:55

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

I started trying to reproduce this, and hit a bug in either
my code or perhaps the tcp stack.

I have a control TCP socket on machine A connected to machine B.

Currently, server A is stuck spinning trying very hard to send commands to
server B. The interesting this is that netstat shows the SendQ to have
data on both machines (they are trying to send to each other on the same
socket connection), but the receive queues are empty on both machines as well:

machine A:
FC3 x86-64, kernel: 2.6.10-1.766_FC3smp, Dual opetron, 2GB RAM, SMP kernel

netstat:
tcp 0 93440 192.168.1.5:57228 192.168.1.165:4002 ESTABLISHED

Strace of this server:
socketcall(0x9, 0xffffb780) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({42949672960000000, 597879105668495392}, NULL) = 0
gettimeofday({2058282582467209, 597879105668495392}, NULL) = 0
gettimeofday({2058737849000585, 597879101513232728}, NULL) = 0
write(3, "1110237833479: iohandler.cc 383"..., 103) = 103
socketcall(0x9, 0xffffb780) = -1 EAGAIN (Resource temporarily unavailable)
.....



machine B:

2.6.11 + my patches, dual xeon, SMP kernel, 1GB RAM

netstat:
tcp 0 202940 192.168.1.165:4002 192.168.1.5:57228 ESTABLISHED

# Machine B is not trying to send so much stuff to A, so it is not busy-spinning,
# at least it won't untill it finally fills up it's 8MB user-space send buffer.

Any ideas??

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-08 06:31:41

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Ben Greear wrote:
>
>> Nick Piggin wrote:
>>
>>> Ben Greear wrote:
>>>
>
>>> In that case, writing the network only test would help to confirm the
>>> problem is not a networking one - so not useless by any means.
>>
>>
>>
>> It's not trivial to write something like this :)
>>
>> I'll be using something I already have. If I can't reproduce the
>> problem,
>> then perhaps it is due to sendfile and someone can write a customized
>> test. The main reason I offered is because people are ignoring the
>> bug report for the most part and asking for a test case. I may be able
>> to offer an independent verification of the problem which might convince
>> someone to write up a dedicated test case...
>>
>
> OK, no that sounds good, please do make the test case.
>
> I have actually been following up with Christian regarding
> the disk IO / memory management side of things but the thread
> has gone offline for some reason :\

Initial test setup: two machines, running connections between them.
Mostly asymetric (about 50Mbps in one direction,
GigE in the other). Each connection is trying some random rate between 128kbps
and 3Mbps in one direction, and 1kbps in the other direction.

Sending machine is dual 3.0Ghz xeons, 1MB cache, HT, and emt64 (running 32-bit
kernel & user space though). 1GB of RAM

Receiving machine is dual 2.8Ghz xeons, 512 MB cache, HT, 32-bit. 2GB of RAM
(but only 850Mbps of low memory of course...saw the thing OOM kill me with 1GB of
free high memory :( )


Zero latency:

2000 TCP connections: When I first start, I see errors indicating I'm out of low
memory..but it quickly recovers. Probably because my program takes a small
bit of time before it starts reading the sockets.
986Mbps of ethernet traffic (counting all ethernet headers)

3000 TCP connections: Same memory issue
986Mbps of ethernet traffic, about 82kpps

4000 TCP connections: Had to drop max_backlog to 5000 from 10000 to keep
the machine from going OOM and killing my traffic generator (on
the receiving side).
986Mbps of ethernet traffic

I will work on some numbers with latency tomorrow (had to stop and
re-write some of my code to better handle managing the 8000 endpoints
that 4000 connections requires!)

I think we can assume that the problem is either related to latency,
or sendfile, since 4000 connections with no latency rocks along just
fine...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-08 16:41:37

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> Initial test setup: two machines, running connections between them.
> Mostly asymetric (about 50Mbps in one direction,
> GigE in the other). Each connection is trying some random rate between
> 128kbps
> and 3Mbps in one direction, and 1kbps in the other direction.
>
> Sending machine is dual 3.0Ghz xeons, 1MB cache, HT, and emt64 (running
> 32-bit
> kernel & user space though). 1GB of RAM
>
> Receiving machine is dual 2.8Ghz xeons, 512 MB cache, HT, 32-bit. 2GB
> of RAM
> (but only 850Mbps of low memory of course...saw the thing OOM kill me
> with 1GB of
> free high memory :( )
>
>
> Zero latency:
>
> 2000 TCP connections: When I first start, I see errors indicating I'm
> out of low
> memory..but it quickly recovers. Probably because my program
> takes a small
> bit of time before it starts reading the sockets.
> 986Mbps of ethernet traffic (counting all ethernet headers)
>
> 3000 TCP connections: Same memory issue
> 986Mbps of ethernet traffic, about 82kpps
>
> 4000 TCP connections: Had to drop max_backlog to 5000 from 10000 to keep
> the machine from going OOM and killing my traffic generator (on
> the receiving side).
> 986Mbps of ethernet traffic
>
> I will work on some numbers with latency tomorrow (had to stop and
> re-write some of my code to better handle managing the 8000 endpoints
> that 4000 connections requires!)
>
> I think we can assume that the problem is either related to latency,
> or sendfile, since 4000 connections with no latency rocks along just
> fine...

Hmmmm.... can you try to following just to exclude some theories:

Run it with 4000 sockets and then do the following on the server-machine:

dd if=/dev/zero of=file1 bs=1M count=1024
dd if=/dev/zero of=file2 bs=1M count=1024
dd if=/dev/zero of=file3 bs=1M count=1024
cat file1 > /dev/zero & cat file2 > /dev/zero & cat file3 > /dev/zero &

I THINK it might have something to do with caching-pressure or so. See if there is a slow-down on
the sending if the page-cache gets full and has to be cleared again.

You are running 2.6.11?

Chris

2005-03-09 23:59:02

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
> related settings to give more buffers etc to networking tasks. I have not
> tried any significant disk-IO while doing these tests.
>
> I finally got my systems set up so I can run my WAN emulator at full 1Gbps:
>
> I am getting right at 986Mbps throughput with 30ms round-trip latency
> (15ms in both directions).
>
> So, latency does not seem to be the problem either.
>
> I think the problem can be narrowed down to:
>
> 1) Non-optimal kernel network tunings on your server.

I used all the default-settings on 2.6.11

> 2) Disk-IO (my disk is small and slow compared to a 'real' server, not
> sure I can
> really test this side of things, and I have not tried as of yet.)

This doesnt explain the speed-up when I change lower_zone_protection from 0 to 1024. It also doesnt
explain the slowdown on 2.6.11 compared to 2.6.10

> 3) Your clients have much more latency and/or don't have enough bandwidth
> to fully load your server. Since you didn't answer before: I
> assume you
> do not have a reliable test bed and are just hoping that enough
> clients connect
> to do your benchmarking.

Yes I just wait until they connect. On the graph it only takes about 2 minutes until 3000 sockets
are created again.

> 4) There is something strange with sendfile and/or your application's
> coding.

I am not doing more than calling sendfile. There is nothing one can do wrong.

> My suggestion would be to eliminate these variables by coming up with a
> repeatable
> test bed, alternative traffic generators, WAN/Network emulators for
> latency, etc.

The problem still is that 1) it speeds up immediately when lower_zone_protection is raised to 1024.
This proves it is NOT a disk-bottleneck. And second: it got much worse with 2.6.11 and
lower_zone_protection disappeared on 2.6.11

Chris

2005-03-10 00:44:40

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> So, maybe a VM problem? That would be a good place to focus since
> I think we can be fairly certain it isn't a problem in just the
> networking code. Otherwise, my tests would show lower bandwidth.

Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
enough, it locks as well. Addendum: If I throttle to 100 MBit it doesnt slow-down even with 5000
sockets. What do you think? I think its about having to free cache more quicker than possible. But
then, why is CPU still at 30%? Might there be some limit per cyclus? For example if that "cleaner"
wakes up every 10 ms and cleans max XXXXX pages, it would explain an artificial limit.

Chris

2005-03-10 00:44:48

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
>> Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
>> related settings to give more buffers etc to networking tasks. I have
>> not
>> tried any significant disk-IO while doing these tests.
>>
>> I finally got my systems set up so I can run my WAN emulator at full
>> 1Gbps:
>>
>> I am getting right at 986Mbps throughput with 30ms round-trip latency
>> (15ms in both directions).
>>
>> So, latency does not seem to be the problem either.
>>
>> I think the problem can be narrowed down to:
>>
>> 1) Non-optimal kernel network tunings on your server.
>
>
> I used all the default-settings on 2.6.11

Here are my settings. Hopefully it will be clear what I'm
talking about..yell if you need details. Please note that I explicitly
set the send buffers to 128k and the rcv to 16k in my test so the min and max
socket queue lengths do not matter here.

my $dflt_tx_queue_len = 2000; # Ethernet driver transmit-queue length. Might be worth making
# it bigger for GigE nics.

my $netdev_max_backlog = 5000; # Maximum number of packets, queued on the INPUT side, when
# the interface receives pkts faster than it can process them.

my $wmem_max = 4096000; # Write memory buffer. This is probably fine for any setup,
# and could be smaller (256000) for < 5Mbps connections.

my $wmem_default = 128000; # Write memory buffer. This is probably fine for any setup,
# and could be smaller (256000) for < 5Mbps connections.

my $rmem_max = 8096000; # Receive memory (packet) buffer. If you are running
# lots of very fast traffic,
# you may want to make this larger if you are running over
# fast, high-latency networks.
# For < 5Mbps of traffic, 512000 should be fine.

my $rmem_default = 128000; # Receive memory (packet) buffer.


# If this is not 1, then the tcp_* settings below will not be applied.
my $modify_tcp_settings = 1;

# See the kernel documentation: Documentation/networking/ip-sysctl.txt
my $tcp_rmem_min = 4096;
my $tcp_rmem_default = 256000; # TCP specific receive memory pool size.
my $tcp_rmem_max = 30000000; # TCP specific receive memory pool size.

my $tcp_wmem_min = 4096;
my $tcp_wmem_default = 256000; # TCP specific write memory pool size.
my $tcp_wmem_max = 30000000; # TCP specific write memory pool size.

my $tcp_mem_lo = 20000000; # Below here there is no memory pressure.
my $tcp_mem_pressure = 30000000; # Can use up to 30MB for TCP buffers.
my $tcp_mem_high = 60000000; # Can use up to 60MB for TCP buffers.


>
>> 2) Disk-IO (my disk is small and slow compared to a 'real' server,
>> not sure I can
>> really test this side of things, and I have not tried as of yet.)
>
>
> This doesnt explain the speed-up when I change lower_zone_protection
> from 0 to 1024. It also doesnt explain the slowdown on 2.6.11 compared
> to 2.6.10

Disk-IO uses buffers, so a change here could easily starve the rest
of your system. I'm just saying I can't reliably test this. To be honest,
my machines are already throwing allocation failures in the ethernet drivers
and I've had the OOM killer kill my main process several times. So, my machines
are running right at their memory limit, even w/out any disk IO.

>> 3) Your clients have much more latency and/or don't have enough
>> bandwidth
>> to fully load your server. Since you didn't answer before: I
>> assume you
>> do not have a reliable test bed and are just hoping that enough
>> clients connect
>> to do your benchmarking.
>
>
> Yes I just wait until they connect. On the graph it only takes about 2
> minutes until 3000 sockets are created again.

But, you could get unlucky and have 3000 people on a shitty dialup
connection connect to you. That does not make it easy to reliably
test the system.


>> 4) There is something strange with sendfile and/or your application's
>> coding.
>
>
> I am not doing more than calling sendfile. There is nothing one can do
> wrong.
>
>> My suggestion would be to eliminate these variables by coming up with
>> a repeatable
>> test bed, alternative traffic generators, WAN/Network emulators for
>> latency, etc.
>
>
> The problem still is that 1) it speeds up immediately when
> lower_zone_protection is raised to 1024. This proves it is NOT a
> disk-bottleneck. And second: it got much worse with 2.6.11 and
> lower_zone_protection disappeared on 2.6.11

So, maybe a VM problem? That would be a good place to focus since
I think we can be fairly certain it isn't a problem in just the
networking code. Otherwise, my tests would show lower bandwidth.

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 04:29:54

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:

> Hmmmm.... can you try to following just to exclude some theories:
>
> Run it with 4000 sockets and then do the following on the server-machine:
>
> dd if=/dev/zero of=file1 bs=1M count=1024
> dd if=/dev/zero of=file2 bs=1M count=1024
> dd if=/dev/zero of=file3 bs=1M count=1024
> cat file1 > /dev/zero & cat file2 > /dev/zero & cat file3 > /dev/zero &
>
> I THINK it might have something to do with caching-pressure or so. See
> if there is a slow-down on the sending if the page-cache gets full and
> has to be cleared again.
>
> You are running 2.6.11?

Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
related settings to give more buffers etc to networking tasks. I have not
tried any significant disk-IO while doing these tests.

I finally got my systems set up so I can run my WAN emulator at full 1Gbps:

I am getting right at 986Mbps throughput with 30ms round-trip latency
(15ms in both directions).

So, latency does not seem to be the problem either.

I think the problem can be narrowed down to:

1) Non-optimal kernel network tunings on your server.
2) Disk-IO (my disk is small and slow compared to a 'real' server, not sure I can
really test this side of things, and I have not tried as of yet.)
3) Your clients have much more latency and/or don't have enough bandwidth
to fully load your server. Since you didn't answer before: I assume you
do not have a reliable test bed and are just hoping that enough clients connect
to do your benchmarking.
4) There is something strange with sendfile and/or your application's coding.

My suggestion would be to eliminate these variables by coming up with a repeatable
test bed, alternative traffic generators, WAN/Network emulators for latency, etc.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 05:24:42

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid <[email protected]> wrote:
>
> > So, maybe a VM problem? That would be a good place to focus since
> > I think we can be fairly certain it isn't a problem in just the
> > networking code. Otherwise, my tests would show lower bandwidth.
>
> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
> enough, it locks as well.

Did anyone have a 100-liner which demonstrates this problem?

The output of `vmstat 1' when the thing starts happening would be interesting.

2005-03-10 09:01:01

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andrew Morton <[email protected]> writes:

> Christian Schmid <[email protected]> wrote:
>>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.

If he had a lot of RX traffic (it is hard to figure out because his
bug reports are more or less useless and mostly consists of rants):
The packets are allocated with GFP_ATOMIC and a lot of traffic
overwhelms the free memory.

Some drivers work around this by doing the RX ring refill in process
context (easier with NAPI), but not all do.

In general to solve it one has to increase /proc/sys/vm/freepages
a lot.

It would be nice though if the VM tuned itself dynamically to a lot
of GFP_ATOMIC requests. And maybe if GFP_ATOMIC was a bit more aggressive
and did some simple minded reclaiming that would be helpful too.
e.g. there could be a "easy to free" list in the VM for clean pages
where freeing is simple enough that it could be made interrupt safe.

-Andi

2005-03-10 09:10:48

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen <[email protected]> wrote:
>
> In general to solve it one has to increase /proc/sys/vm/freepages
> a lot.

/proc/sys/vm/min_free_kbytes

> It would be nice though if the VM tuned itself dynamically to a lot
> of GFP_ATOMIC requests. And maybe if GFP_ATOMIC was a bit more aggressive
> and did some simple minded reclaiming that would be helpful too.
> e.g. there could be a "easy to free" list in the VM for clean pages
> where freeing is simple enough that it could be made interrupt safe.

I spose we could autotune the free memory thresholds somehow, if there is
good reason and a testcase.

Or we could run page reclaim from hard IRQ context - that could be a bit
expensive in terms of CPU consumption and latency though.

2005-03-10 09:15:03

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Thu, Mar 10, 2005 at 01:09:55AM -0800, Andrew Morton wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > In general to solve it one has to increase /proc/sys/vm/freepages
> > a lot.
>
> /proc/sys/vm/min_free_kbytes

Oh yes, I still have the old 2.2 name in my finger tips

(never understood why these things need to be always renamed; I guess
keeping the old name would have made it too easy on administrators)

-Andi

2005-03-10 09:39:05

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen <[email protected]> wrote:
>
> On Thu, Mar 10, 2005 at 01:09:55AM -0800, Andrew Morton wrote:
> > Andi Kleen <[email protected]> wrote:
> > >
> > > In general to solve it one has to increase /proc/sys/vm/freepages
> > > a lot.
> >
> > /proc/sys/vm/min_free_kbytes
>
> Oh yes, I still have the old 2.2 name in my finger tips
>
> (never understood why these things need to be always renamed; I guess
> keeping the old name would have made it too easy on administrators)
>

Page sizes vary. kbytes do not. So scripts and documentation will work
correctly cross-platform, and when you change PAGE_SIZE.

2005-03-10 19:01:18

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.

There you go. As you can see, free is rather high when lower_zone_protection is set to 1024. When I
set it to 0, free goes down and the slow-down starts when the memory is full. The slow-down goes
away after I set lower_zone_protection to 1024, but only AFTER the free-memory doesnt rise up more.

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 34 0 588944 9120 6925452 0 0 29 14 10 12 15 25 27 33
3 30 0 589648 9060 6924628 0 0 68176 1488 4770 6545 12 34 0 53
1 32 0 589712 9152 6924944 0 0 66152 2824 5606 6621 10 34 0 55
3 29 0 590352 8992 6924220 0 0 68260 28 4123 7809 17 36 0 47
1 33 0 601744 8684 6899096 0 0 57776 28 4015 6267 16 38 0 45
6 31 0 602960 8960 6911604 0 0 56148 124 4659 6013 17 36 0 48
7 31 0 590736 8776 6903220 0 0 56460 824 4521 5940 17 35 0 48
0 32 0 589264 9064 6923536 0 0 67376 96 5135 6918 15 34 0 51
0 33 0 590928 8912 6923620 0 0 69504 108 4604 6487 13 34 0 53
3 30 0 589008 9080 6924472 0 0 66904 72 4300 7336 14 35 0 51
1 29 0 590544 9156 6924124 0 0 67684 28 4535 7298 16 34 0 50
2 32 0 589968 9052 6923956 0 0 61000 88 4293 6898 14 35 0 51
1 29 0 591120 8876 6923384 0 0 67940 176 4455 7259 12 34 0 54
3 31 0 589520 9024 6925616 0 0 66980 20 4909 7037 12 32 0 56
4 30 0 590096 8972 6924444 0 0 63924 80 4308 6203 14 33 0 53
9 29 0 590096 8876 6915836 0 0 59860 32 4507 6268 18 36 0 47
0 28 0 588688 9120 6923276 0 0 66844 76 4280 6025 14 32 0 54
3 31 0 581072 9336 6930744 0 0 47744 20788 5307 5195 10 31 0 59
5 28 0 500432 10456 6980352 0 0 62032 20 4749 6779 22 36 0 42
8 31 0 432720 11620 7056504 0 0 64420 4 4758 6480 29 40 0 31
10 30 0 383120 12808 7129232 0 0 72844 0 5197 7040 15 30 0 55
3 30 0 313552 13840 7198716 0 0 69516 0 4479 6216 16 32 0 51
3 28 0 245016 14912 7265508 0 0 67028 192 5111 6295 15 30 0 55
5 26 0 158744 15788 7344396 0 0 78576 0 4361 5937 19 33 0 49
1 28 0 67544 16652 7428668 0 0 84284 0 4299 5252 17 32 0 51
0 32 0 16572 17244 7463436 0 0 83744 2632 4900 6126 16 35 0 49
3 22 0 24196 17616 7453476 0 0 68804 5644 5148 5522 17 34 0 49
3 25 0 24480 17960 7453812 0 0 64872 1668 5043 5568 14 33 0 53
3 26 0 20780 18328 7456300 0 0 64692 0 4854 7081 14 33 0 53
3 32 0 20640 18432 7456672 0 0 60496 28 4882 8156 16 32 0 52

SLOWDOWN START:

2 29 0 21056 18660 7454880 0 0 58432 28 4615 8019 16 33 0 51
7 30 0 24088 18904 7451780 0 0 57920 0 5571 9829 13 32 0 55
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
5 30 0 28340 19228 7448736 0 0 61208 216 5602 9912 12 32 0 56
2 29 0 22680 19416 7453852 0 0 62096 8 6029 10302 13 33 0 54
0 31 0 23136 19616 7454128 0 0 55776 0 5986 10845 14 33 0 53
1 32 0 25240 19136 7452364 0 0 47172 48 6609 10681 18 36 0 46
2 30 0 16988 19004 7461064 0 0 62988 2352 8611 12934 10 29 0 61
0 35 0 26432 18884 7449488 0 0 46032 12920 8253 9289 9 27 0 63
0 35 0 23636 19000 7456104 0 0 44456 164 9576 11174 8 23 0 70
3 29 0 16932 18996 7465152 0 0 60108 12 9062 13320 14 28 1 57
3 31 0 24720 18768 7458648 0 0 60744 24 9926 15865 14 27 0 59
1 33 0 24284 18724 7459848 0 0 63152 0 10028 16689 13 27 0 61
1 31 0 24856 18472 7462208 0 0 59384 0 10157 16561 12 26 1 60
0 34 0 24272 18192 7462556 0 0 60276 184 10946 18029 10 25 0 64
0 34 0 25604 18312 7461756 0 0 58244 0 10217 16344 11 27 0 62
1 29 0 20816 18416 7467296 0 0 61928 0 10796 16894 10 26 0 65
2 33 0 23388 18764 7466744 0 0 47620 0 8889 15021 19 30 0 51
0 34 0 16612 18972 7473540 0 0 54648 12 10644 16752 13 26 1 59
0 35 0 22436 19024 7469068 0 0 55192 2864 11080 17519 12 26 1 61
1 33 0 16548 19192 7474204 0 0 52756 796 11412 19072 12 26 0 62
1 35 0 21352 18904 7470140 0 0 50104 4400 11999 16810 9 23 1 68
1 33 0 24412 18824 7468452 0 0 55132 80 11441 17418 9 25 2 64
1 31 0 24384 18812 7468736 0 0 49860 128 11745 19884 10 26 3 62
5 32 0 27660 19232 7465460 0 0 47396 80 10758 16619 14 27 2 57
2 31 0 31832 19596 7461628 0 0 53480 0 11355 18973 12 27 0 61
2 33 0 31264 19792 7463268 0 0 53236 88 11552 18697 10 26 0 64
2 31 0 31596 20008 7461692 0 0 52624 136 11300 19832 15 27 0 58
2 30 0 22596 20260 7473204 0 0 51864 12 11993 21202 12 25 1 62
0 33 0 20824 20460 7475044 0 0 52488 236 12395 19848 8 24 1 67
2 30 0 21204 20508 7474520 0 0 56616 72 12258 22120 10 25 4 60
3 32 0 16720 20772 7480512 0 0 53972 40 11942 20908 13 26 3 58
3 28 0 31884 20732 7464776 0 0 47236 92 11602 19614 13 26 0 61
0 12 0 16700 20776 7479828 0 0 46936 1416 11557 18017 9 24 10 56
1 31 0 25040 20896 7474268 0 0 38788 6752 10727 14073 12 23 4 61
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
4 31 0 16748 21080 7481972 0 0 49472 8 12436 19864 10 22 11 57
0 35 0 23588 21240 7475828 0 0 58944 84 12935 22227 8 25 0 66
1 30 0 24516 21260 7475128 0 0 51456 124 12701 21756 9 25 0 66
1 32 0 23772 21332 7475600 0 0 52388 116 11749 19039 11 26 1 62
1 34 0 22460 21548 7476268 0 0 50240 224 12103 21552 10 26 1 63
4 32 0 24040 21568 7476792 0 0 47648 48 11124 19446 16 28 0 56
1 33 0 34148 21724 7466504 0 0 49376 176 11903 20681 10 25 0 65
2 30 0 26528 21812 7457576 0 0 47952 140 12054 20289 12 27 0 61
2 33 0 21932 21176 7362808 0 0 46452 152 10675 18546 20 44 0 35
1 32 0 83112 20880 7331484 0 0 31612 212 8110 13900 25 44 0 31
1 33 0 93380 21172 7408304 0 0 47200 8 11808 19313 16 29 1 54
0 33 0 44540 21620 7458652 0 0 49012 0 11591 19500 16 26 0 59
0 34 0 18808 21720 7484392 0 0 55748 72 12333 22001 14 26 0 61
2 33 0 49372 21792 7445084 0 0 51660 72 11300 19226 14 29 1 56
1 29 0 29428 21984 7474608 0 0 47012 100 11435 19647 12 26 1 61
0 34 0 48368 22056 7455700 0 0 51828 92 11687 17129 9 22 6 62
1 32 0 22856 22192 7481132 0 0 50236 56 12020 18495 7 23 10 60
1 32 0 31588 21856 7446720 0 0 42816 148 8546 14873 20 34 0 46
2 34 0 29140 21976 7475568 0 0 54300 80 12145 18508 9 24 0 67
0 31 0 56328 21916 7448088 0 0 54240 308 11338 19211 13 26 1 59
2 34 0 82520 22012 7422084 0 0 48188 2500 11085 17522 12 27 1 61
1 34 0 111192 22236 7395136 0 0 53464 148 10976 18427 14 27 2 57
0 34 0 138372 22236 7366372 0 0 54852 68 11892 20766 10 27 0 63
0 29 0 162528 21772 7343852 0 0 53272 152 11218 20261 13 30 0 57
2 32 0 182976 20816 7325088 0 0 54884 320 11411 20328 12 29 0 59
4 30 0 198720 21188 7310028 0 0 51832 64 11346 20141 12 28 0 60
2 32 0 224296 21412 7283760 0 0 57440 80 12182 21054 7 25 0 68
0 33 0 257388 21688 7230920 0 0 43468 76 8279 14542 17 36 1 46
0 32 0 283164 21948 7215088 0 0 59424 312 11404 18576 11 27 0 62
0 31 0 304912 22192 7202876 0 0 57208 332 11106 19843 12 30 1 58
5 33 0 324412 22476 7183688 0 0 54904 3372 12113 17509 10 28 0 62
0 31 0 342648 22764 7165584 0 0 52332 764 11325 18572 15 29 0 56
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 33 0 360796 22944 7147248 0 0 56184 0 11082 18153 13 28 0 59
4 32 0 389932 23160 7110380 0 0 52580 0 10303 16708 14 32 0 54
1 32 0 397500 23448 7110772 0 0 61384 704 11328 17227 11 27 0 62
0 34 0 417404 23636 7091068 0 0 64840 88 11162 17765 8 28 0 64
1 33 0 432128 23840 7075904 0 0 63924 0 11492 19978 9 29 0 63
4 30 0 445324 24096 7063000 0 0 56508 28 10265 18108 14 30 0 55
3 29 0 459872 24292 7048728 0 0 57696 28 10308 18048 15 29 0 56
4 35 0 479552 23604 7022828 0 0 61804 128 9398 15858 16 35 0 49
2 35 0 476912 23216 7033620 0 0 53160 4292 9876 15157 16 28 0 56
4 32 0 481204 23372 7027616 0 0 49936 5360 10565 14389 11 24 3 62
0 33 0 482680 23476 7026424 0 0 41828 7284 10270 13548 11 23 11 56
3 25 0 492028 23704 7018240 0 0 52380 28 11229 16098 9 24 2 64
1 29 0 493972 23852 7016800 0 0 43804 8276 10168 11859 11 20 16 53
0 15 0 499740 24152 7009836 0 0 59168 352 10580 14203 10 24 0 65
0 34 0 501368 24292 6998544 0 0 43720 16184 9313 8995 13 27 8 53
3 32 0 508156 24412 7002368 0 0 55380 16 10477 11689 9 22 1 68
1 32 0 512184 24100 6998600 0 0 59068 64 9879 13955 13 26 3 58

SLOWDOWN END:

2 33 0 514884 23976 6992604 0 0 65064 36 9387 13410 17 29 2 52
1 33 0 517184 23956 6991128 0 0 61928 4312 9447 13799 14 29 0 57
6 29 0 516760 24088 6993172 0 0 62488 48 9345 15244 15 31 0 54
2 32 0 519196 24164 6992144 0 0 58348 52 8889 14019 13 29 0 57
0 27 0 523164 24140 6987884 0 0 64236 48 9732 13547 10 27 0 63
0 32 0 526752 24280 6984548 0 0 69872 3628 9800 13413 10 27 0 63
2 30 0 531428 22372 6981968 0 0 70148 584 8541 13988 12 32 0 56
0 33 0 535408 21476 6978036 0 0 68468 28 8438 11965 13 30 0 58
4 32 0 536868 20264 6976120 0 0 61804 56 8579 12447 12 28 0 60
6 29 0 540132 19908 6976136 0 0 68492 0 8319 11009 11 29 4 56
3 30 0 541564 18328 6976900 0 0 62148 40 8110 12429 12 30 0 58
1 35 0 547948 17876 6943896 0 0 49368 4900 6412 9800 19 38 0 44
2 31 0 543504 18548 6972600 0 0 63916 8 8427 12450 16 30 0 54
0 31 0 544084 18840 6972920 0 0 66656 0 7090 10378 14 30 0 56
0 32 0 543832 18824 6972528 0 0 63120 28 7035 10658 18 31 0 51
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 30 0 544664 19044 6972648 0 0 66036 28 7686 12178 14 32 0 54
3 32 0 544600 19032 6971708 0 0 73316 1868 8394 10431 11 30 0 59
5 26 0 544344 19184 6972508 0 0 63624 20 6577 11294 16 33 0 50
4 30 0 543580 19324 6972368 0 0 74900 20 6303 10768 18 35 0 47
1 32 0 545632 18504 6971488 0 0 75020 0 6737 10601 14 35 0 52
2 32 0 546848 18580 6971276 0 0 67288 0 6572 9537 13 33 0 55
4 33 0 543584 18640 6969924 0 0 61624 9936 6660 9273 13 34 0 53
2 28 0 544736 18244 6970116 0 0 67992 2708 8948 8232 6 29 0 65
2 30 0 546020 18640 6970400 0 0 64668 16 6114 8186 19 33 0 48
2 30 0 546788 18636 6970812 0 0 69240 8 6064 9482 17 34 0 49
5 32 0 544292 18352 6969804 0 0 74688 16 5957 9517 18 34 0 47
4 30 0 547524 18420 6969464 0 0 69756 276 6042 8357 12 33 0 56
6 29 0 547172 18240 6970188 0 0 70352 4 5210 8836 20 35 0 45
3 34 0 548836 18276 6971172 0 0 69536 0 5356 8443 14 35 0 52
6 28 0 548132 18228 6970268 0 0 70156 0 5428 9326 16 35 0 49
14 32 0 550672 18260 6940452 0 0 64460 12 4352 8160 19 41 0 41
10 31 0 601424 17488 6858060 0 0 62296 296 4154 6203 22 47 0 31
2 31 0 578960 17024 6851724 0 0 49492 2608 4892 6413 25 41 0 35
3 30 0 636752 17156 6881784 0 0 65036 1656 5695 7811 14 36 0 51
4 29 0 572068 17856 6944868 0 0 62252 2412 4594 6801 19 35 0 47
4 27 0 549092 17948 6967216 0 0 74172 36 5121 7671 16 35 0 50
2 34 0 550948 17980 6965552 0 0 73168 100 4792 7743 15 35 0 50
1 30 0 551736 17880 6906628 0 0 64548 208 4575 7705 20 43 0 37
5 33 0 548824 17732 6871484 0 0 61272 2844 3969 6323 22 43 0 35
0 30 0 545432 17980 6891432 0 0 67960 24 4655 6606 22 38 0 41
2 33 0 577176 18008 6932612 0 0 70592 4 4881 7156 13 36 0 51
1 29 0 544784 18328 6965068 0 0 70356 16 4246 6638 15 34 0 51
6 28 0 542672 18312 6964812 0 0 61576 4288 4333 6364 13 35 0 52
1 32 0 545744 18208 6965052 0 0 67540 24 4870 7232 18 34 0 48
2 29 0 546384 18168 6965500 0 0 68172 4 4588 7651 14 36 0 51
4 28 0 545296 18236 6964344 0 0 67020 0 4207 6546 15 35 0 51
3 30 0 544720 18536 6962548 0 0 74644 12 4417 6375 17 34 0 49
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
3 35 0 538448 18744 6964720 0 0 78184 248 4145 5987 15 35 0 50
6 25 0 538960 18284 6966540 0 0 71532 24 3885 7001 17 35 0 48
4 29 0 543280 17416 6955780 0 0 60540 20 4040 6582 20 38 0 42
5 28 0 540528 17324 6962332 0 0 79480 12 4147 5974 13 36 0 52
3 33 0 538672 17536 6960964 0 0 67944 8 4066 5822 12 34 0 53
1 33 0 535220 17696 6963796 0 0 77724 1752 4156 6577 15 35 0 51
3 32 0 538548 17800 6964032 0 0 65540 0 3890 6294 17 36 0 47
0 29 0 538740 17688 6964280 0 0 60444 12 3860 6419 21 37 0 42
5 29 0 536244 16988 6964640 0 0 67836 8 4336 5963 15 35 0 51
6 28 0 534964 17092 6965148 0 0 68276 0 3934 5697 15 35 0 50
2 30 0 534132 17184 6963492 0 0 62960 412 3883 5663 16 35 0 49
3 26 0 534900 17368 6963036 0 0 61040 8 3685 7448 24 36 0 40

2005-03-10 19:16:34

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen wrote:

> If he had a lot of RX traffic (it is hard to figure out because his
> bug reports are more or less useless and mostly consists of rants):
> The packets are allocated with GFP_ATOMIC and a lot of traffic
> overwhelms the free memory.
>
> Some drivers work around this by doing the RX ring refill in process
> context (easier with NAPI), but not all do.

I think his traffic is mostly 'send' from his server's perspective.

He's reading from disk with sendfile too, I believe, so maybe that
would be consuming lots of pages of memory?

However, in my case, I would definately welcome something that auto-tuned
the VM to give me lots and lots of GFP_ATOMIC pages. As it is now, I
end up setting the /proc/sys/vm/freepages much higher. Since it appears
the name has changed and I didn't notice, I guess my script to set
this has not actually been doing anything useful in the 2.6 kernel series :P

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 19:16:31

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Attached an image here so you can see whats happening. One pixel are 2 seconds. You can see a small
speed-up before the slow-down. This is where I changed lower_zone_protection from 1024 to 0. So it
seems its speeding up until the memory is full. Then it drastically slows-down until I set it to
1024 again. then it goes up slowly but not linear (interesting smoothly) until it peaks again at 82
MB/Sec.

PS: 82 MB/sec is not our bandwidth-limit. It still peaks there. Dont know why. Certainly not the
drives. They work up to 200 MB/Sec (10 drives there).

Chris


Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.
>
>


Attachments:
traffic8.png (1.48 kB)

2005-03-11 15:29:46

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

OHGAWD I GOT IT!!!!!!!!

I admit, totally coincidentially but its really FIXED. Today I went to the puter scanning the
servers by routine and wondered why the bandwidth is at 100% without any holes.

The only thing I have done is I switched off hyper-threading because the server is at only 20% CPU
anyway so I just disabled it.

So its something with linux dealing with hyper-threading. YAY :)

Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.
>
>

2005-03-11 19:13:40

by Ben Greear

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> OHGAWD I GOT IT!!!!!!!!
>
> I admit, totally coincidentially but its really FIXED. Today I went to
> the puter scanning the servers by routine and wondered why the bandwidth
> is at 100% without any holes.
>
> The only thing I have done is I switched off hyper-threading because the
> server is at only 20% CPU anyway so I just disabled it.
>
> So its something with linux dealing with hyper-threading. YAY :)

For what it's worth, I was running dual-xeon systems with HT turned on.

But, I have a single process, single-threaded application, so there is not much
scheduling to be done. If you have a large number of threads or processes,
then it would make more sense for turning off HT to have an affect.

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-11 19:31:46

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> OHGAWD I GOT IT!!!!!!!!
>>
>> I admit, totally coincidentially but its really FIXED. Today I went to
>> the puter scanning the servers by routine and wondered why the
>> bandwidth is at 100% without any holes.
>>
>> The only thing I have done is I switched off hyper-threading because
>> the server is at only 20% CPU anyway so I just disabled it.
>>
>> So its something with linux dealing with hyper-threading. YAY :)
>
>
> For what it's worth, I was running dual-xeon systems with HT turned on.
>
> But, I have a single process, single-threaded application, so there is
> not much
> scheduling to be done. If you have a large number of threads or processes,
> then it would make more sense for turning off HT to have an affect.

This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.

Chris

2005-03-14 04:40:27

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Fri, 2005-03-11 at 20:27 +0100, Christian Schmid wrote:
> Ben Greear wrote:
> >
> > For what it's worth, I was running dual-xeon systems with HT turned on.
> >
> > But, I have a single process, single-threaded application, so there is
> > not much
> > scheduling to be done. If you have a large number of threads or processes,
> > then it would make more sense for turning off HT to have an affect.
>
> This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
> appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
> 80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.
>

OK well that is a good result for you. Thanks for sticking with it.
Unfortunately you'll probably not want to test any patches on your
production system, so the cause of the problem will be difficult to
fix.

I am working on patches which improve HT performance in some
situations though, so with luck they will cure your problems too.
Basically I think SMP "balancing" is too aggressive - and this may
explain why 2.6.10 was worse for you, it had patches to *increase*
the aggressiveness of balancing.

The other thing that worries me is your need for lower_zone_protection.
I think this may be due to unbalanced highmem vs lowmem reclaim. It
would be interesting to know if those patches I sent you improve this.
They certainly improve reclaim balancing for me... but again I guess
you'll be reluctant to do much experimentation :\

Thanks,
Nick



2005-03-14 04:54:07

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

>>This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
>>appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
>>80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.
>>
>
> OK well that is a good result for you. Thanks for sticking with it.
> Unfortunately you'll probably not want to test any patches on your
> production system, so the cause of the problem will be difficult to
> fix.
>
> I am working on patches which improve HT performance in some
> situations though, so with luck they will cure your problems too.
> Basically I think SMP "balancing" is too aggressive - and this may
> explain why 2.6.10 was worse for you, it had patches to *increase*
> the aggressiveness of balancing.
>
> The other thing that worries me is your need for lower_zone_protection.
> I think this may be due to unbalanced highmem vs lowmem reclaim. It
> would be interesting to know if those patches I sent you improve this.
> They certainly improve reclaim balancing for me... but again I guess
> you'll be reluctant to do much experimentation :\

I have tested your patch and unfortunately on 2.6.11 it didnt change anything :( I reported this
before, or do you mean something else? I am of course willing to test patches as I do not want to
stick with 2.6.10 forever.

2005-03-14 05:08:34

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, 2005-03-14 at 05:53 +0100, Christian Schmid wrote:

> > The other thing that worries me is your need for lower_zone_protection.
> > I think this may be due to unbalanced highmem vs lowmem reclaim. It
> > would be interesting to know if those patches I sent you improve this.
> > They certainly improve reclaim balancing for me... but again I guess
> > you'll be reluctant to do much experimentation :\
>
> I have tested your patch and unfortunately on 2.6.11 it didnt change anything :( I reported this
> before, or do you mean something else? I am of course willing to test patches as I do not want to
> stick with 2.6.10 forever.

Well I hope that scheduler developments in progress will put future
kernels at least on par with 2.6.10 again (and hopefully better).

Yes you did report that my patch didn't help 2.6.11, but could those
results have been influenced by the suboptimal HT scheduling? If so,
I was interested in the results with HT turned off.

Nick


Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-05-28 03:19:06

by Christian Schmid

[permalink] [raw]
Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Hi.

I want to give the newest report for the vm-lock problem. It seems the problem is getting less
critical in every new release. I am currently using 2.6.12-rc5. The problem with the massive vm-lock
appears as always when 3500 sockets are reached as reported in earlier mails. The problem suddenly
disappears when I set lowmem_reserve_ratio to "1 1" AND min_free_kbytes to 1024000. It only starts
to appear again when reaching around 7000 sockets. -rc3 for example slowed down at 4500 sockets again.

I am very sure its a vm-lock because for example reading /proc/sys/vm/lowmem_reserve_ratio needs no
time with < 3500 sockets. While testing with 7000 sockets, I had to wait 20-30 seconds until the
"file" was opened.

Any suggestions? Dual Xeon 3,6 GHz with 8 GB Ram.

Nick Piggin wrote:
> On Mon, 2005-03-14 at 05:53 +0100, Christian Schmid wrote:
>
>
>>>The other thing that worries me is your need for lower_zone_protection.
>>>I think this may be due to unbalanced highmem vs lowmem reclaim. It
>>>would be interesting to know if those patches I sent you improve this.
>>>They certainly improve reclaim balancing for me... but again I guess
>>>you'll be reluctant to do much experimentation :\
>>
>>I have tested your patch and unfortunately on 2.6.11 it didnt change anything :( I reported this
>>before, or do you mean something else? I am of course willing to test patches as I do not want to
>>stick with 2.6.10 forever.
>
>
> Well I hope that scheduler developments in progress will put future
> kernels at least on par with 2.6.10 again (and hopefully better).
>
> Yes you did report that my patch didn't help 2.6.11, but could those
> results have been influenced by the suboptimal HT scheduling? If so,
> I was interested in the results with HT turned off.
>
> Nick
>
>
> Find local movie times and trailers on Yahoo! Movies.
> http://au.movies.yahoo.com
>
>