LinuxLists.cc - BUG: Slowdown on 3000 socket-machines tracked down

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:

> Today I tested with 5000 sockets. The problem is the same like above but
> the more sockets there come, it just doesnt claim more bandwidth as it
> SHOULD of course do. It seems it doesn't slow down but it just doesnt
> scale anymore. The badwidth doesnt go over 80 MB/Sec, no matter what I
> do. Then I did the following: I raised lower_zone_protection to 1024
> (above I did 1024000 which is bullshit but it doesnt matter as it seems
> to just protect the whole low-mem which is what I want) and it was at 80
> MB. then I lowered to 0 again and suddenly it peaked up to full
> bandwidth (100 MB) for about 5 seconds until the whole protected area
> was in use. Then it slowed down drastically again.

This confirms my suspicion that lowmem / highmem scanning is not
properly balanced. When you raise lower_zone_protection a great
deal, lowmem is no longer used for pagecache, and your problem
goes away.

I gave you a patch to try for this - unfortunately I can't make
much more progress than that if I don't have a test case and you
can't test patches :\

Nick

2005-03-07 01:14:56

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> Hello.
>
> After weeks of work, I can now give a detailed report about the bug and
> when it appears:
>
> Attached is another traffic-image. This one is with 2.6.10 and a 3/1
> split, preemtive kernel, so all defaults.

What are the units on your graph. You say "MB" several places, but
do you mean Mb (ie, Mega-bit) instead?

I have a tool that can also generate TCP traffic on a large number of
sockets. If I can understand what you are trying to do, I may be able
to reproduce the problem. My biggest machine at present has only
2GB of RAM, however...not sure if that matters or not.

Are you sending traffic in only one direction, or more of a full-duplex
configuration? Is each socket running the same bandwidth? What is this
bandwidth? Are you setting the send & rcv buffers in the socket creation
code? (To what values if so?) How many bytes are you sending with each
call to write()/sendto() whatever?

Is there any significant latency between your sender and receiver machine?
If so, how much?

What is the physical transport...GigE? 1500 MTU?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 01:58:44

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> Hello.
>>
>> After weeks of work, I can now give a detailed report about the bug
>> and when it appears:
>>
>> Attached is another traffic-image. This one is with 2.6.10 and a 3/1
>> split, preemtive kernel, so all defaults.
>
>
> What are the units on your graph. You say "MB" several places, but
> do you mean Mb (ie, Mega-bit) instead?

The unit on this graph is kilobytes. So 80000 there means 80 megabytes per second.

> I have a tool that can also generate TCP traffic on a large number of
> sockets. If I can understand what you are trying to do, I may be able
> to reproduce the problem. My biggest machine at present has only
> 2GB of RAM, however...not sure if that matters or not.

It should not matter. Low-memory is both just 1 GB if you have default 32 bit with 3/1 split.

> Are you sending traffic in only one direction, or more of a full-duplex
> configuration?

Its a full-duplex. Its a download-service with 3000 downloaders all over the world.

> Is each socket running the same bandwidth?

No. It ranges from 3 kb/sec to 100 kb/sec. 100 kb/sec is the limit because of the send-buffer limits.

> What is this bandwidth?

1000 MBit

> Are you setting the send & rcv buffers in the socket creation
> code? (To what values if so?)

Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.

> How many bytes are you sending with each call to write()/sendto() whatever?

I am using sendfile-call every 100 ms per socket with the poll-api. So basically around 40 kb per round.

> Is there any significant latency between your sender and receiver machine?
> If so, how much?

3000 different downloaders, 3000 different locations, 3000 different machines ;)

> What is the physical transport...GigE? 1500 MTU?

Yes.

Chris

2005-03-07 02:08:48

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> I have a tool that can also generate TCP traffic on a large number of
> sockets. If I can understand what you are trying to do, I may be able
> to reproduce the problem. My biggest machine at present has only
> 2GB of RAM, however...not sure if that matters or not.

But if the problem is what I think it is, you should get the problem by doing the following.

Best use 2.6.11 since the problem got even worse there compared to 2.6.10.

Create a server on one machine. This server should wait for incoming sockets and when they come,
just send out bytes ("x" or whatever, it just doesn't matter) to that sockets. Please use a
send-buffer of 64 kbytes.

On the other machine you just create clients, which connect to the server and read the data. They
just need to read them, nothing more. Please limit the reading to once per 300 ms, so they only read
around 200 kb/sec each. Then watch your traffic as you create more sockets. When you reach 2000
sockets on 2.6.11, it should slow down more and more. You should see the same like me on the
attached graph.

First one 2.6.11, second one 2.6.10

Chris

Attachments:

traffic3.png (2.52 kB)

2005-03-07 02:57:19

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> Ben Greear wrote:
>

>> I have a tool that can also generate TCP traffic on a large number of
>> sockets. If I can understand what you are trying to do, I may be able
>> to reproduce the problem. My biggest machine at present has only
>> 2GB of RAM, however...not sure if that matters or not.
>
> It should not matter. Low-memory is both just 1 GB if you have default
> 32 bit with 3/1 split.
>
>> Are you sending traffic in only one direction, or more of a full-duplex
>> configuration?
>
> Its a full-duplex. Its a download-service with 3000 downloaders all over
> the world.

So actually it's really mostly one-way traffic, ie in the download direction.
Anything significant at all going upstream, other than ACKs, etc?

>> Is each socket running the same bandwidth?
>
> No. It ranges from 3 kb/sec to 100 kb/sec. 100 kb/sec is the limit
> because of the send-buffer limits.
>
>> What is this bandwidth?
>
> 1000 MBit
>
>> Are you setting the send & rcv buffers in the socket creation
>> code? (To what values if so?)
>
> Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.

With regard to this note in the 'man 7 socket' man page:

NOTES
Linux assumes that half of the send/receive buffer is used for internal kernel struc-
tures; thus the sysctls are twice what can be observed on the wire.

What value are you using for the sockopt call?

>> How many bytes are you sending with each call to write()/sendto()
>> whatever?
>
> I am using sendfile-call every 100 ms per socket with the poll-api. So
> basically around 40 kb per round.

My application is single-threaded, uses non-blocking IO, and sends/rcvs from/to memory.
It will be a good test of the TCP stack, but will not use the sendfile logic,
nor will it touch the HD.

>> Is there any significant latency between your sender and receiver
>> machine?
>> If so, how much?
>
> 3000 different downloaders, 3000 different locations, 3000 different
> machines ;)

I can emulate delay if I need to, but I'd rather just stick with one
delay setting and not have to set up a separate delay for each connection.

Maybe 30ms is average for round-trip time?

Have you tried benchmarking your app in a controlled manner, or are you just
letting a random 3000 machines hit it and start downloading? If the latter,
then I'd suggest getting more controll over your testing environment, otherwise
it may be impossible to really figure out where the problem lies.

I'll set up a configuration similar to the values discussed above and see
what I can see. Will probably be late tomorrow before I can do the
test though...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 05:14:57

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> Ben Greear wrote:

>>> How many bytes are you sending with each call to write()/sendto()
>>> whatever?
>>
>>
>> I am using sendfile-call every 100 ms per socket with the poll-api. So
>> basically around 40 kb per round.
>
>
> My application is single-threaded, uses non-blocking IO, and sends/rcvs
> from/to memory.
> It will be a good test of the TCP stack, but will not use the sendfile
> logic,
> nor will it touch the HD.
>

I think you would have better luck in reproducing this problem if you
did the full sendfile thing.

I think it is becoming disk bound due to page reclaim problems, which
is causing the slowdown.

In that case, writing the network only test would help to confirm the
problem is not a networking one - so not useless by any means.

2005-03-07 05:30:43

by Willy Tarreau

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:

> I think you would have better luck in reproducing this problem if you
> did the full sendfile thing.
>
> I think it is becoming disk bound due to page reclaim problems, which
> is causing the slowdown.
>
> In that case, writing the network only test would help to confirm the
> problem is not a networking one - so not useless by any means.

Not necessarily, Nick. I have written an HTTP testing tool which matches
the description of Ben's : non-blocking, single-threaded, no disk I/O,
etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,
especially if I start some CPU activity on the system, I can get pauses
of up to 13 seconds without this tool doing anything !!! At first I
believed it was because of the scheduler, but it might also be related
to what is described here since I had somewhat the same setup (gigE, 1500,
thousands of sockets). I never had enough time to investigate more, so I
went back to 2.4.

It makes me think that for the problem described here, we have no
indication of CPU & I/O activity, which might help Ben try to reproduce.

Cheers,
Willy

2005-03-07 05:41:06

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Willy Tarreau wrote:
> On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:
>
>
>>I think you would have better luck in reproducing this problem if you
>>did the full sendfile thing.
>>
>>I think it is becoming disk bound due to page reclaim problems, which
>>is causing the slowdown.
>>
>>In that case, writing the network only test would help to confirm the
>>problem is not a networking one - so not useless by any means.
>
>
> Not necessarily, Nick. I have written an HTTP testing tool which matches
> the description of Ben's : non-blocking, single-threaded, no disk I/O,
> etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,

No you're right, I'm not 100% sure, so I'm definitely not saying
Ben's test will be useless. Just that if it is not too hard to
make one with sendfile, I think he should.

If he makes a network-only version and cannot reproduce the problems,
that *doesn't* mean it is *not* a network problem. However if he
reproduces the problem with a full sendfile version and not the network
only one, then that is a better indicator... but I'm rambling.

> especially if I start some CPU activity on the system, I can get pauses
> of up to 13 seconds without this tool doing anything !!! At first I
> believed it was because of the scheduler, but it might also be related
> to what is described here since I had somewhat the same setup (gigE, 1500,
> thousands of sockets). I never had enough time to investigate more, so I
> went back to 2.4.
>

I have heard other complaints about this, and they are definitely
related to the scheduler (not saying yours is, but it is very possible).

2005-03-07 05:42:25

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Willy Tarreau wrote:

>> thousands of sockets). I never had enough time to investigate more, so I
>> went back to 2.4.
>>
>
> I have heard other complaints about this, and they are definitely
> related to the scheduler (not saying yours is, but it is very possible).
>

Oh, and if you could dig this thing up too, that might be
good: someone else may have time to investigate more.

Thanks.

2005-03-07 05:46:24

by Willy Tarreau

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, Mar 07, 2005 at 04:42:10PM +1100, Nick Piggin wrote:
> Nick Piggin wrote:
> >Willy Tarreau wrote:
>
>
> >>thousands of sockets). I never had enough time to investigate more, so I
> >>went back to 2.4.
> >>
> >
> >I have heard other complaints about this, and they are definitely
> >related to the scheduler (not saying yours is, but it is very possible).
> >
>
> Oh, and if you could dig this thing up too, that might be
> good: someone else may have time to investigate more.

I would love to, since my major concern with 2.6 has always been the
scheduler (but that's not to you that I will learn this). At the moment,
I really don't have time for this, I promised that I would send a full
reproducible report, but it takes a lot of time.

Cheers,
Willy

2005-03-07 09:23:51

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Ben Greear wrote:
>
>> Christian Schmid wrote:
>>
>>> Ben Greear wrote:
>
>
>>>> How many bytes are you sending with each call to write()/sendto()
>>>> whatever?
>>>
>>>
>>>
>>> I am using sendfile-call every 100 ms per socket with the poll-api.
>>> So basically around 40 kb per round.
>>
>>
>>
>> My application is single-threaded, uses non-blocking IO, and
>> sends/rcvs from/to memory.
>> It will be a good test of the TCP stack, but will not use the sendfile
>> logic,
>> nor will it touch the HD.
>>
>
> I think you would have better luck in reproducing this problem if you
> did the full sendfile thing.
>
> I think it is becoming disk bound due to page reclaim problems, which
> is causing the slowdown.
>
> In that case, writing the network only test would help to confirm the
> problem is not a networking one - so not useless by any means.

It's not trivial to write something like this :)

I'll be using something I already have. If I can't reproduce the problem,
then perhaps it is due to sendfile and someone can write a customized
test. The main reason I offered is because people are ignoring the
bug report for the most part and asking for a test case. I may be able
to offer an independent verification of the problem which might convince
someone to write up a dedicated test case...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-07 09:31:19

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Nick Piggin wrote:
>
>> Ben Greear wrote:
>>

>> In that case, writing the network only test would help to confirm the
>> problem is not a networking one - so not useless by any means.
>
>
> It's not trivial to write something like this :)
>
> I'll be using something I already have. If I can't reproduce the problem,
> then perhaps it is due to sendfile and someone can write a customized
> test. The main reason I offered is because people are ignoring the
> bug report for the most part and asking for a test case. I may be able
> to offer an independent verification of the problem which might convince
> someone to write up a dedicated test case...
>

OK, no that sounds good, please do make the test case.

I have actually been following up with Christian regarding
the disk IO / memory management side of things but the thread
has gone offline for some reason :\

2005-03-07 14:35:51

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
>> Its a full-duplex. Its a download-service with 3000 downloaders all
>> over the world.
>
>
> So actually it's really mostly one-way traffic, ie in the download
> direction.
> Anything significant at all going upstream, other than ACKs, etc?

Not much. See on the graph. The red is the downstream ;)

>> Yes. send-buffer to 64 kbytes and receive buffer to 16 kbytes.
>
>
> With regard to this note in the 'man 7 socket' man page:
>
> NOTES
> Linux assumes that half of the send/receive buffer is used for
> internal kernel struc-
> tures; thus the sysctls are twice what can be observed on the wire.
>
> What value are you using for the sockopt call?

First I used 64 * 1024 but some months ago I checked with getsockopt and realized that it always
gives twice of the value back. So I just have done 64 * 512 ;)

Chris

2005-03-07 23:43:55

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

I started trying to reproduce this, and hit a bug in either
my code or perhaps the tcp stack.

I have a control TCP socket on machine A connected to machine B.

Currently, server A is stuck spinning trying very hard to send commands to
server B. The interesting this is that netstat shows the SendQ to have
data on both machines (they are trying to send to each other on the same
socket connection), but the receive queues are empty on both machines as well:

machine A:
FC3 x86-64, kernel: 2.6.10-1.766_FC3smp, Dual opetron, 2GB RAM, SMP kernel

netstat:
tcp 0 93440 192.168.1.5:57228 192.168.1.165:4002 ESTABLISHED

Strace of this server:
socketcall(0x9, 0xffffb780) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({42949672960000000, 597879105668495392}, NULL) = 0
gettimeofday({2058282582467209, 597879105668495392}, NULL) = 0
gettimeofday({2058737849000585, 597879101513232728}, NULL) = 0
write(3, "1110237833479: iohandler.cc 383"..., 103) = 103
socketcall(0x9, 0xffffb780) = -1 EAGAIN (Resource temporarily unavailable)
.....

machine B:

2.6.11 + my patches, dual xeon, SMP kernel, 1GB RAM

netstat:
tcp 0 202940 192.168.1.165:4002 192.168.1.5:57228 ESTABLISHED

# Machine B is not trying to send so much stuff to A, so it is not busy-spinning,
# at least it won't untill it finally fills up it's 8MB user-space send buffer.

Any ideas??

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-08 06:31:41

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Nick Piggin wrote:
> Ben Greear wrote:
>
>> Nick Piggin wrote:
>>
>>> Ben Greear wrote:
>>>
>
>>> In that case, writing the network only test would help to confirm the
>>> problem is not a networking one - so not useless by any means.
>>
>>
>>
>> It's not trivial to write something like this :)
>>
>> I'll be using something I already have. If I can't reproduce the
>> problem,
>> then perhaps it is due to sendfile and someone can write a customized
>> test. The main reason I offered is because people are ignoring the
>> bug report for the most part and asking for a test case. I may be able
>> to offer an independent verification of the problem which might convince
>> someone to write up a dedicated test case...
>>
>
> OK, no that sounds good, please do make the test case.
>
> I have actually been following up with Christian regarding
> the disk IO / memory management side of things but the thread
> has gone offline for some reason :\

Initial test setup: two machines, running connections between them.
Mostly asymetric (about 50Mbps in one direction,
GigE in the other). Each connection is trying some random rate between 128kbps
and 3Mbps in one direction, and 1kbps in the other direction.

Sending machine is dual 3.0Ghz xeons, 1MB cache, HT, and emt64 (running 32-bit
kernel & user space though). 1GB of RAM

Receiving machine is dual 2.8Ghz xeons, 512 MB cache, HT, 32-bit. 2GB of RAM
(but only 850Mbps of low memory of course...saw the thing OOM kill me with 1GB of
free high memory :( )

Zero latency:

2000 TCP connections: When I first start, I see errors indicating I'm out of low
memory..but it quickly recovers. Probably because my program takes a small
bit of time before it starts reading the sockets.
986Mbps of ethernet traffic (counting all ethernet headers)

3000 TCP connections: Same memory issue
986Mbps of ethernet traffic, about 82kpps

4000 TCP connections: Had to drop max_backlog to 5000 from 10000 to keep
the machine from going OOM and killing my traffic generator (on
the receiving side).
986Mbps of ethernet traffic

I will work on some numbers with latency tomorrow (had to stop and
re-write some of my code to better handle managing the 8000 endpoints
that 4000 connections requires!)

I think we can assume that the problem is either related to latency,
or sendfile, since 4000 connections with no latency rocks along just
fine...

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-08 16:41:37

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> Initial test setup: two machines, running connections between them.
> Mostly asymetric (about 50Mbps in one direction,
> GigE in the other). Each connection is trying some random rate between
> 128kbps
> and 3Mbps in one direction, and 1kbps in the other direction.
>
> Sending machine is dual 3.0Ghz xeons, 1MB cache, HT, and emt64 (running
> 32-bit
> kernel & user space though). 1GB of RAM
>
> Receiving machine is dual 2.8Ghz xeons, 512 MB cache, HT, 32-bit. 2GB
> of RAM
> (but only 850Mbps of low memory of course...saw the thing OOM kill me
> with 1GB of
> free high memory :( )
>
>
> Zero latency:
>
> 2000 TCP connections: When I first start, I see errors indicating I'm
> out of low
> memory..but it quickly recovers. Probably because my program
> takes a small
> bit of time before it starts reading the sockets.
> 986Mbps of ethernet traffic (counting all ethernet headers)
>
> 3000 TCP connections: Same memory issue
> 986Mbps of ethernet traffic, about 82kpps
>
> 4000 TCP connections: Had to drop max_backlog to 5000 from 10000 to keep
> the machine from going OOM and killing my traffic generator (on
> the receiving side).
> 986Mbps of ethernet traffic
>
> I will work on some numbers with latency tomorrow (had to stop and
> re-write some of my code to better handle managing the 8000 endpoints
> that 4000 connections requires!)
>
> I think we can assume that the problem is either related to latency,
> or sendfile, since 4000 connections with no latency rocks along just
> fine...

Hmmmm.... can you try to following just to exclude some theories:

Run it with 4000 sockets and then do the following on the server-machine:

dd if=/dev/zero of=file1 bs=1M count=1024
dd if=/dev/zero of=file2 bs=1M count=1024
dd if=/dev/zero of=file3 bs=1M count=1024
cat file1 > /dev/zero & cat file2 > /dev/zero & cat file3 > /dev/zero &

I THINK it might have something to do with caching-pressure or so. See if there is a slow-down on
the sending if the page-cache gets full and has to be cleared again.

You are running 2.6.11?

Chris

2005-03-09 23:59:02

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
> related settings to give more buffers etc to networking tasks. I have not
> tried any significant disk-IO while doing these tests.
>
> I finally got my systems set up so I can run my WAN emulator at full 1Gbps:
>
> I am getting right at 986Mbps throughput with 30ms round-trip latency
> (15ms in both directions).
>
> So, latency does not seem to be the problem either.
>
> I think the problem can be narrowed down to:
>
> 1) Non-optimal kernel network tunings on your server.

I used all the default-settings on 2.6.11

> 2) Disk-IO (my disk is small and slow compared to a 'real' server, not
> sure I can
> really test this side of things, and I have not tried as of yet.)

This doesnt explain the speed-up when I change lower_zone_protection from 0 to 1024. It also doesnt
explain the slowdown on 2.6.11 compared to 2.6.10

> 3) Your clients have much more latency and/or don't have enough bandwidth
> to fully load your server. Since you didn't answer before: I
> assume you
> do not have a reliable test bed and are just hoping that enough
> clients connect
> to do your benchmarking.

Yes I just wait until they connect. On the graph it only takes about 2 minutes until 3000 sockets
are created again.

> 4) There is something strange with sendfile and/or your application's
> coding.

I am not doing more than calling sendfile. There is nothing one can do wrong.

> My suggestion would be to eliminate these variables by coming up with a
> repeatable
> test bed, alternative traffic generators, WAN/Network emulators for
> latency, etc.

The problem still is that 1) it speeds up immediately when lower_zone_protection is raised to 1024.
This proves it is NOT a disk-bottleneck. And second: it got much worse with 2.6.11 and
lower_zone_protection disappeared on 2.6.11

Chris

2005-03-10 00:44:40

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

> So, maybe a VM problem? That would be a good place to focus since
> I think we can be fairly certain it isn't a problem in just the
> networking code. Otherwise, my tests would show lower bandwidth.

Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
enough, it locks as well. Addendum: If I throttle to 100 MBit it doesnt slow-down even with 5000
sockets. What do you think? I think its about having to free cache more quicker than possible. But
then, why is CPU still at 30%? Might there be some limit per cyclus? For example if that "cleaner"
wakes up every 10 ms and cleans max XXXXX pages, it would explain an artificial limit.

Chris

2005-03-10 00:44:48

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
>> Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
>> related settings to give more buffers etc to networking tasks. I have
>> not
>> tried any significant disk-IO while doing these tests.
>>
>> I finally got my systems set up so I can run my WAN emulator at full
>> 1Gbps:
>>
>> I am getting right at 986Mbps throughput with 30ms round-trip latency
>> (15ms in both directions).
>>
>> So, latency does not seem to be the problem either.
>>
>> I think the problem can be narrowed down to:
>>
>> 1) Non-optimal kernel network tunings on your server.
>
>
> I used all the default-settings on 2.6.11

Here are my settings. Hopefully it will be clear what I'm
talking about..yell if you need details. Please note that I explicitly
set the send buffers to 128k and the rcv to 16k in my test so the min and max
socket queue lengths do not matter here.

my $dflt_tx_queue_len = 2000; # Ethernet driver transmit-queue length. Might be worth making
# it bigger for GigE nics.

my $netdev_max_backlog = 5000; # Maximum number of packets, queued on the INPUT side, when
# the interface receives pkts faster than it can process them.

my $wmem_max = 4096000; # Write memory buffer. This is probably fine for any setup,
# and could be smaller (256000) for < 5Mbps connections.

my $wmem_default = 128000; # Write memory buffer. This is probably fine for any setup,
# and could be smaller (256000) for < 5Mbps connections.

my $rmem_max = 8096000; # Receive memory (packet) buffer. If you are running
# lots of very fast traffic,
# you may want to make this larger if you are running over
# fast, high-latency networks.
# For < 5Mbps of traffic, 512000 should be fine.

my $rmem_default = 128000; # Receive memory (packet) buffer.

# If this is not 1, then the tcp_* settings below will not be applied.
my $modify_tcp_settings = 1;

# See the kernel documentation: Documentation/networking/ip-sysctl.txt
my $tcp_rmem_min = 4096;
my $tcp_rmem_default = 256000; # TCP specific receive memory pool size.
my $tcp_rmem_max = 30000000; # TCP specific receive memory pool size.

my $tcp_wmem_min = 4096;
my $tcp_wmem_default = 256000; # TCP specific write memory pool size.
my $tcp_wmem_max = 30000000; # TCP specific write memory pool size.

my $tcp_mem_lo = 20000000; # Below here there is no memory pressure.
my $tcp_mem_pressure = 30000000; # Can use up to 30MB for TCP buffers.
my $tcp_mem_high = 60000000; # Can use up to 60MB for TCP buffers.

>
>> 2) Disk-IO (my disk is small and slow compared to a 'real' server,
>> not sure I can
>> really test this side of things, and I have not tried as of yet.)
>
>
> This doesnt explain the speed-up when I change lower_zone_protection
> from 0 to 1024. It also doesnt explain the slowdown on 2.6.11 compared
> to 2.6.10

Disk-IO uses buffers, so a change here could easily starve the rest
of your system. I'm just saying I can't reliably test this. To be honest,
my machines are already throwing allocation failures in the ethernet drivers
and I've had the OOM killer kill my main process several times. So, my machines
are running right at their memory limit, even w/out any disk IO.

>> 3) Your clients have much more latency and/or don't have enough
>> bandwidth
>> to fully load your server. Since you didn't answer before: I
>> assume you
>> do not have a reliable test bed and are just hoping that enough
>> clients connect
>> to do your benchmarking.
>
>
> Yes I just wait until they connect. On the graph it only takes about 2
> minutes until 3000 sockets are created again.

But, you could get unlucky and have 3000 people on a shitty dialup
connection connect to you. That does not make it easy to reliably
test the system.

>> 4) There is something strange with sendfile and/or your application's
>> coding.
>
>
> I am not doing more than calling sendfile. There is nothing one can do
> wrong.
>
>> My suggestion would be to eliminate these variables by coming up with
>> a repeatable
>> test bed, alternative traffic generators, WAN/Network emulators for
>> latency, etc.
>
>
> The problem still is that 1) it speeds up immediately when
> lower_zone_protection is raised to 1024. This proves it is NOT a
> disk-bottleneck. And second: it got much worse with 2.6.11 and
> lower_zone_protection disappeared on 2.6.11

So, maybe a VM problem? That would be a good place to focus since
I think we can be fairly certain it isn't a problem in just the
networking code. Otherwise, my tests would show lower bandwidth.

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 04:29:54

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:

> Hmmmm.... can you try to following just to exclude some theories:
>
> Run it with 4000 sockets and then do the following on the server-machine:
>
> dd if=/dev/zero of=file1 bs=1M count=1024
> dd if=/dev/zero of=file2 bs=1M count=1024
> dd if=/dev/zero of=file3 bs=1M count=1024
> cat file1 > /dev/zero & cat file2 > /dev/zero & cat file3 > /dev/zero &
>
> I THINK it might have something to do with caching-pressure or so. See
> if there is a slow-down on the sending if the page-cache gets full and
> has to be cleared again.
>
> You are running 2.6.11?

Yes, 2.6.11. I have tuned max_backlog and some other TCP and networking
related settings to give more buffers etc to networking tasks. I have not
tried any significant disk-IO while doing these tests.

I finally got my systems set up so I can run my WAN emulator at full 1Gbps:

I am getting right at 986Mbps throughput with 30ms round-trip latency
(15ms in both directions).

So, latency does not seem to be the problem either.

I think the problem can be narrowed down to:

1) Non-optimal kernel network tunings on your server.
2) Disk-IO (my disk is small and slow compared to a 'real' server, not sure I can
really test this side of things, and I have not tried as of yet.)
3) Your clients have much more latency and/or don't have enough bandwidth
to fully load your server. Since you didn't answer before: I assume you
do not have a reliable test bed and are just hoping that enough clients connect
to do your benchmarking.
4) There is something strange with sendfile and/or your application's coding.

My suggestion would be to eliminate these variables by coming up with a repeatable
test bed, alternative traffic generators, WAN/Network emulators for latency, etc.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 05:24:42

by Andrew Morton

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid <[email protected]> wrote:
>
> > So, maybe a VM problem? That would be a good place to focus since
> > I think we can be fairly certain it isn't a problem in just the
> > networking code. Otherwise, my tests would show lower bandwidth.
>
> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
> enough, it locks as well.

Did anyone have a 100-liner which demonstrates this problem?

The output of `vmstat 1' when the thing starts happening would be interesting.

2005-03-10 09:01:01

by Andi Kleen

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andrew Morton <[email protected]> writes:

> Christian Schmid <[email protected]> wrote:
>>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.

If he had a lot of RX traffic (it is hard to figure out because his
bug reports are more or less useless and mostly consists of rants):
The packets are allocated with GFP_ATOMIC and a lot of traffic
overwhelms the free memory.

Some drivers work around this by doing the RX ring refill in process
context (easier with NAPI), but not all do.

In general to solve it one has to increase /proc/sys/vm/freepages
a lot.

It would be nice though if the VM tuned itself dynamically to a lot
of GFP_ATOMIC requests. And maybe if GFP_ATOMIC was a bit more aggressive
and did some simple minded reclaiming that would be helpful too.
e.g. there could be a "easy to free" list in the VM for clean pages
where freeing is simple enough that it could be made interrupt safe.

-Andi

2005-03-10 09:10:48

by Andrew Morton

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen <[email protected]> wrote:
>
> In general to solve it one has to increase /proc/sys/vm/freepages
> a lot.

/proc/sys/vm/min_free_kbytes

> It would be nice though if the VM tuned itself dynamically to a lot
> of GFP_ATOMIC requests. And maybe if GFP_ATOMIC was a bit more aggressive
> and did some simple minded reclaiming that would be helpful too.
> e.g. there could be a "easy to free" list in the VM for clean pages
> where freeing is simple enough that it could be made interrupt safe.

I spose we could autotune the free memory thresholds somehow, if there is
good reason and a testcase.

Or we could run page reclaim from hard IRQ context - that could be a bit
expensive in terms of CPU consumption and latency though.

2005-03-10 09:15:03

by Andi Kleen

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Thu, Mar 10, 2005 at 01:09:55AM -0800, Andrew Morton wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > In general to solve it one has to increase /proc/sys/vm/freepages
> > a lot.
>
> /proc/sys/vm/min_free_kbytes

Oh yes, I still have the old 2.2 name in my finger tips

(never understood why these things need to be always renamed; I guess
keeping the old name would have made it too easy on administrators)

-Andi

2005-03-10 09:39:05

by Andrew Morton

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen <[email protected]> wrote:
>
> On Thu, Mar 10, 2005 at 01:09:55AM -0800, Andrew Morton wrote:
> > Andi Kleen <[email protected]> wrote:
> > >
> > > In general to solve it one has to increase /proc/sys/vm/freepages
> > > a lot.
> >
> > /proc/sys/vm/min_free_kbytes
>
> Oh yes, I still have the old 2.2 name in my finger tips
>
> (never understood why these things need to be always renamed; I guess
> keeping the old name would have made it too easy on administrators)
>

Page sizes vary. kbytes do not. So scripts and documentation will work
correctly cross-platform, and when you change PAGE_SIZE.

2005-03-10 19:01:18

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
That would be a good place to focus since
be fairly certain it isn't a problem in just the
Otherwise, my tests would show lower bandwidth.
I am really sure that its no network-code problem anymore. But what I THINK it
allocating buffers dynamically and if the vm doesnt provide that buffers fast
as well.
which demonstrates this problem?
1' when the thing starts happening would be interesting.
free is rather high when lower_zone_protection is set to 1024. When I
and the slow-down starts when the memory is full. The slow-down goes
to 1024, but only AFTER the free-memory doesnt rise up more.
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 29 14 10 12 15 25 27 33
0 0 68176 1488 4770 6545 12 34 0 53
0 0 66152 2824 5606 6621 10 34 0 55
0 0 68260 28 4123 7809 17 36 0 47
0 0 57776 28 4015 6267 16 38 0 45
0 0 56148 124 4659 6013 17 36 0 48
0 0 56460 824 4521 5940 17 35 0 48
0 0 67376 96 5135 6918 15 34 0 51
0 0 69504 108 4604 6487 13 34 0 53
0 0 66904 72 4300 7336 14 35 0 51
0 0 67684 28 4535 7298 16 34 0 50
0 0 61000 88 4293 6898 14 35 0 51
0 0 67940 176 4455 7259 12 34 0 54
0 0 66980 20 4909 7037 12 32 0 56
0 0 63924 80 4308 6203 14 33 0 53
0 0 59860 32 4507 6268 18 36 0 47
0 0 66844 76 4280 6025 14 32 0 54
0 0 47744 20788 5307 5195 10 31 0 59
0 0 62032 20 4749 6779 22 36 0 42
0 0 64420 4 4758 6480 29 40 0 31
0 0 72844 0 5197 7040 15 30 0 55
0 0 69516 0 4479 6216 16 32 0 51
0 0 67028 192 5111 6295 15 30 0 55
0 0 78576 0 4361 5937 19 33 0 49
0 0 84284 0 4299 5252 17 32 0 51
0 0 83744 2632 4900 6126 16 35 0 49
0 0 68804 5644 5148 5522 17 34 0 49
0 0 64872 1668 5043 5568 14 33 0 53
0 0 64692 0 4854 7081 14 33 0 53
0 0 60496 28 4882 8156 16 32 0 52
0 0 58432 28 4615 8019 16 33 0 51
0 0 57920 0 5571 9829 13 32 0 55
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 61208 216 5602 9912 12 32 0 56
0 0 62096 8 6029 10302 13 33 0 54
0 0 55776 0 5986 10845 14 33 0 53
0 0 47172 48 6609 10681 18 36 0 46
0 0 62988 2352 8611 12934 10 29 0 61
0 0 46032 12920 8253 9289 9 27 0 63
0 0 44456 164 9576 11174 8 23 0 70
0 0 60108 12 9062 13320 14 28 1 57
0 0 60744 24 9926 15865 14 27 0 59
0 0 63152 0 10028 16689 13 27 0 61
0 0 59384 0 10157 16561 12 26 1 60
0 0 60276 184 10946 18029 10 25 0 64
0 0 58244 0 10217 16344 11 27 0 62
0 0 61928 0 10796 16894 10 26 0 65
0 0 47620 0 8889 15021 19 30 0 51
0 0 54648 12 10644 16752 13 26 1 59
0 0 55192 2864 11080 17519 12 26 1 61
0 0 52756 796 11412 19072 12 26 0 62
0 0 50104 4400 11999 16810 9 23 1 68
0 0 55132 80 11441 17418 9 25 2 64
0 0 49860 128 11745 19884 10 26 3 62
0 0 47396 80 10758 16619 14 27 2 57
0 0 53480 0 11355 18973 12 27 0 61
0 0 53236 88 11552 18697 10 26 0 64
0 0 52624 136 11300 19832 15 27 0 58
0 0 51864 12 11993 21202 12 25 1 62
0 0 52488 236 12395 19848 8 24 1 67
0 0 56616 72 12258 22120 10 25 4 60
0 0 53972 40 11942 20908 13 26 3 58
0 0 47236 92 11602 19614 13 26 0 61
0 0 46936 1416 11557 18017 9 24 10 56
0 0 38788 6752 10727 14073 12 23 4 61
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 49472 8 12436 19864 10 22 11 57
0 0 58944 84 12935 22227 8 25 0 66
0 0 51456 124 12701 21756 9 25 0 66
0 0 52388 116 11749 19039 11 26 1 62
0 0 50240 224 12103 21552 10 26 1 63
0 0 47648 48 11124 19446 16 28 0 56
0 0 49376 176 11903 20681 10 25 0 65
0 0 47952 140 12054 20289 12 27 0 61
0 0 46452 152 10675 18546 20 44 0 35
0 0 31612 212 8110 13900 25 44 0 31
0 0 47200 8 11808 19313 16 29 1 54
0 0 49012 0 11591 19500 16 26 0 59
0 0 55748 72 12333 22001 14 26 0 61
0 0 51660 72 11300 19226 14 29 1 56
0 0 47012 100 11435 19647 12 26 1 61
0 0 51828 92 11687 17129 9 22 6 62
0 0 50236 56 12020 18495 7 23 10 60
0 0 42816 148 8546 14873 20 34 0 46
0 0 54300 80 12145 18508 9 24 0 67
0 0 54240 308 11338 19211 13 26 1 59
0 0 48188 2500 11085 17522 12 27 1 61
0 0 53464 148 10976 18427 14 27 2 57
0 0 54852 68 11892 20766 10 27 0 63
0 0 53272 152 11218 20261 13 30 0 57
0 0 54884 320 11411 20328 12 29 0 59
0 0 51832 64 11346 20141 12 28 0 60
0 0 57440 80 12182 21054 7 25 0 68
0 0 43468 76 8279 14542 17 36 1 46
0 0 59424 312 11404 18576 11 27 0 62
0 0 57208 332 11106 19843 12 30 1 58
0 0 54904 3372 12113 17509 10 28 0 62
0 0 52332 764 11325 18572 15 29 0 56
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 56184 0 11082 18153 13 28 0 59
0 0 52580 0 10303 16708 14 32 0 54
0 0 61384 704 11328 17227 11 27 0 62
0 0 64840 88 11162 17765 8 28 0 64
0 0 63924 0 11492 19978 9 29 0 63
0 0 56508 28 10265 18108 14 30 0 55
0 0 57696 28 10308 18048 15 29 0 56
0 0 61804 128 9398 15858 16 35 0 49
0 0 53160 4292 9876 15157 16 28 0 56
0 0 49936 5360 10565 14389 11 24 3 62
0 0 41828 7284 10270 13548 11 23 11 56
0 0 52380 28 11229 16098 9 24 2 64
0 0 43804 8276 10168 11859 11 20 16 53
0 0 59168 352 10580 14203 10 24 0 65
0 0 43720 16184 9313 8995 13 27 8 53
0 0 55380 16 10477 11689 9 22 1 68
0 0 59068 64 9879 13955 13 26 3 58
0 0 65064 36 9387 13410 17 29 2 52
0 0 61928 4312 9447 13799 14 29 0 57
0 0 62488 48 9345 15244 15 31 0 54
0 0 58348 52 8889 14019 13 29 0 57
0 0 64236 48 9732 13547 10 27 0 63
0 0 69872 3628 9800 13413 10 27 0 63
0 0 70148 584 8541 13988 12 32 0 56
0 0 68468 28 8438 11965 13 30 0 58
0 0 61804 56 8579 12447 12 28 0 60
0 0 68492 0 8319 11009 11 29 4 56
0 0 62148 40 8110 12429 12 30 0 58
0 0 49368 4900 6412 9800 19 38 0 44
0 0 63916 8 8427 12450 16 30 0 54
0 0 66656 0 7090 10378 14 30 0 56
0 0 63120 28 7035 10658 18 31 0 51
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 66036 28 7686 12178 14 32 0 54
0 0 73316 1868 8394 10431 11 30 0 59
0 0 63624 20 6577 11294 16 33 0 50
0 0 74900 20 6303 10768 18 35 0 47
0 0 75020 0 6737 10601 14 35 0 52
0 0 67288 0 6572 9537 13 33 0 55
0 0 61624 9936 6660 9273 13 34 0 53
0 0 67992 2708 8948 8232 6 29 0 65
0 0 64668 16 6114 8186 19 33 0 48
0 0 69240 8 6064 9482 17 34 0 49
0 0 74688 16 5957 9517 18 34 0 47
0 0 69756 276 6042 8357 12 33 0 56
0 0 70352 4 5210 8836 20 35 0 45
0 0 69536 0 5356 8443 14 35 0 52
0 0 70156 0 5428 9326 16 35 0 49
0 0 64460 12 4352 8160 19 41 0 41
0 0 62296 296 4154 6203 22 47 0 31
0 0 49492 2608 4892 6413 25 41 0 35
0 0 65036 1656 5695 7811 14 36 0 51
0 0 62252 2412 4594 6801 19 35 0 47
0 0 74172 36 5121 7671 16 35 0 50
0 0 73168 100 4792 7743 15 35 0 50
0 0 64548 208 4575 7705 20 43 0 37
0 0 61272 2844 3969 6323 22 43 0 35
0 0 67960 24 4655 6606 22 38 0 41
0 0 70592 4 4881 7156 13 36 0 51
0 0 70356 16 4246 6638 15 34 0 51
0 0 61576 4288 4333 6364 13 35 0 52
0 0 67540 24 4870 7232 18 34 0 48
0 0 68172 4 4588 7651 14 36 0 51
0 0 67020 0 4207 6546 15 35 0 51
0 0 74644 12 4417 6375 17 34 0 49
---swap-- -----io---- --system-- ----cpu----
bi bo in cs us sy id wa
0 0 78184 248 4145 5987 15 35 0 50
0 0 71532 24 3885 7001 17 35 0 48
0 0 60540 20 4040 6582 20 38 0 42
0 0 79480 12 4147 5974 13 36 0 52
0 0 67944 8 4066 5822 12 34 0 53
0 0 77724 1752 4156 6577 15 35 0 51
0 0 65540 0 3890 6294 17 36 0 47
0 0 60444 12 3860 6419 21 37 0 42
0 0 67836 8 4336 5963 15 35 0 51
0 0 68276 0 3934 5697 15 35 0 50
0 0 62960 412 3883 5663 16 35 0 49
0 0 61040 8 3685 7448 24 36 0 40

2005-03-10 19:16:34

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Andi Kleen wrote:

> If he had a lot of RX traffic (it is hard to figure out because his
> bug reports are more or less useless and mostly consists of rants):
> The packets are allocated with GFP_ATOMIC and a lot of traffic
> overwhelms the free memory.
>
> Some drivers work around this by doing the RX ring refill in process
> context (easier with NAPI), but not all do.

I think his traffic is mostly 'send' from his server's perspective.

He's reading from disk with sendfile too, I believe, so maybe that
would be consuming lots of pages of memory?

However, in my case, I would definately welcome something that auto-tuned
the VM to give me lots and lots of GFP_ATOMIC pages. As it is now, I
end up setting the /proc/sys/vm/freepages much higher. Since it appears
the name has changed and I didn't notice, I guess my script to set
this has not actually been doing anything useful in the 2.6 kernel series :P

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-10 19:16:31

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Attached an image here so you can see whats happening. One pixel are 2 seconds. You can see a small
speed-up before the slow-down. This is where I changed lower_zone_protection from 1024 to 0. So it
seems its speeding up until the memory is full. Then it drastically slows-down until I set it to
1024 again. then it goes up slowly but not linear (interesting smoothly) until it peaks again at 82
MB/Sec.

PS: 82 MB/sec is not our bandwidth-limit. It still peaks there. Dont know why. Certainly not the
drives. They work up to 200 MB/Sec (10 drives there).

Chris

Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.
>
>

Attachments:

traffic8.png (1.48 kB)

2005-03-11 15:29:46

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

OHGAWD I GOT IT!!!!!!!!

I admit, totally coincidentially but its really FIXED. Today I went to the puter scanning the
servers by routine and wondered why the bandwidth is at 100% without any holes.

The only thing I have done is I switched off hyper-threading because the server is at only 20% CPU
anyway so I just disabled it.

So its something with linux dealing with hyper-threading. YAY :)

Andrew Morton wrote:
> Christian Schmid <[email protected]> wrote:
>
>> > So, maybe a VM problem? That would be a good place to focus since
>> > I think we can be fairly certain it isn't a problem in just the
>> > networking code. Otherwise, my tests would show lower bandwidth.
>>
>> Thanks to your tests I am really sure that its no network-code problem anymore. But what I THINK it
>> is: The network is allocating buffers dynamically and if the vm doesnt provide that buffers fast
>> enough, it locks as well.
>
>
> Did anyone have a 100-liner which demonstrates this problem?
>
> The output of `vmstat 1' when the thing starts happening would be interesting.
>
>

2005-03-11 19:13:40

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Christian Schmid wrote:
> OHGAWD I GOT IT!!!!!!!!
>
> I admit, totally coincidentially but its really FIXED. Today I went to
> the puter scanning the servers by routine and wondered why the bandwidth
> is at 100% without any holes.
>
> The only thing I have done is I switched off hyper-threading because the
> server is at only 20% CPU anyway so I just disabled it.
>
> So its something with linux dealing with hyper-threading. YAY :)

For what it's worth, I was running dual-xeon systems with HT turned on.

But, I have a single process, single-threaded application, so there is not much
scheduling to be done. If you have a large number of threads or processes,
then it would make more sense for turning off HT to have an affect.

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2005-03-11 19:31:46

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

Ben Greear wrote:
> Christian Schmid wrote:
>
>> OHGAWD I GOT IT!!!!!!!!
>>
>> I admit, totally coincidentially but its really FIXED. Today I went to
>> the puter scanning the servers by routine and wondered why the
>> bandwidth is at 100% without any holes.
>>
>> The only thing I have done is I switched off hyper-threading because
>> the server is at only 20% CPU anyway so I just disabled it.
>>
>> So its something with linux dealing with hyper-threading. YAY :)
>
>
> For what it's worth, I was running dual-xeon systems with HT turned on.
>
> But, I have a single process, single-threaded application, so there is
> not much
> scheduling to be done. If you have a large number of threads or processes,
> then it would make more sense for turning off HT to have an affect.

This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.

Chris

2005-03-14 04:40:27

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Fri, 2005-03-11 at 20:27 +0100, Christian Schmid wrote:
> Ben Greear wrote:
> >
> > For what it's worth, I was running dual-xeon systems with HT turned on.
> >
> > But, I have a single process, single-threaded application, so there is
> > not much
> > scheduling to be done. If you have a large number of threads or processes,
> > then it would make more sense for turning off HT to have an affect.
>
> This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
> appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
> 80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.
>

OK well that is a good result for you. Thanks for sticking with it.
Unfortunately you'll probably not want to test any patches on your
production system, so the cause of the problem will be difficult to
fix.

I am working on patches which improve HT performance in some
situations though, so with luck they will cure your problems too.
Basically I think SMP "balancing" is too aggressive - and this may
explain why 2.6.10 was worse for you, it had patches to *increase*
the aggressiveness of balancing.

The other thing that worries me is your need for lower_zone_protection.
I think this may be due to unbalanced highmem vs lowmem reclaim. It
would be interesting to know if those patches I sent you improve this.
They certainly improve reclaim balancing for me... but again I guess
you'll be reluctant to do much experimentation :\

Thanks,
Nick

2005-03-14 04:54:07

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

>>This effect appeared on 1 task and on 200 tasks. I dont know what it is, but with HT off it doesnt
>>appear anymore. The slow-down still appears when lower_zone_protection is set to 0 but the peak at
>>80 MB disappeared when set to 1024. I am now running at 95 MB/Sec smoothly.
>>
>
> OK well that is a good result for you. Thanks for sticking with it.
> Unfortunately you'll probably not want to test any patches on your
> production system, so the cause of the problem will be difficult to
> fix.
>
> I am working on patches which improve HT performance in some
> situations though, so with luck they will cure your problems too.
> Basically I think SMP "balancing" is too aggressive - and this may
> explain why 2.6.10 was worse for you, it had patches to *increase*
> the aggressiveness of balancing.
>
> The other thing that worries me is your need for lower_zone_protection.
> I think this may be due to unbalanced highmem vs lowmem reclaim. It
> would be interesting to know if those patches I sent you improve this.
> They certainly improve reclaim balancing for me... but again I guess
> you'll be reluctant to do much experimentation :\

I have tested your patch and unfortunately on 2.6.11 it didnt change anything :( I reported this
before, or do you mean something else? I am of course willing to test patches as I do not want to
stick with 2.6.10 forever.

2005-03-14 05:08:34

[permalink] [raw]

Subject: Re: BUG: Slowdown on 3000 socket-machines tracked down

On Mon, 2005-03-14 at 05:53 +0100, Christian Schmid wrote:

> > The other thing that worries me is your need for lower_zone_protection.
> > I think this may be due to unbalanced highmem vs lowmem reclaim. It
> > would be interesting to know if those patches I sent you improve this.
> > They certainly improve reclaim balancing for me... but again I guess
> > you'll be reluctant to do much experimentation :\
>
> I have tested your patch and unfortunately on 2.6.11 it didnt change anything :( I reported this
> before, or do you mean something else? I am of course willing to test patches as I do not want to
> stick with 2.6.10 forever.

Well I hope that scheduler developments in progress will put future
kernels at least on par with 2.6.10 again (and hopefully better).

Yes you did report that my patch didn't help 2.6.11, but could those
results have been influenced by the suboptimal HT scheduling? If so,
I was interested in the results with HT turned off.

Nick

Find local movie times and trailers on Yahoo! Movies.
http://au.movies.yahoo.com

2005-05-28 03:19:06