2005-02-27 16:53:46

by Christian Schmid

[permalink] [raw]
Subject: Slowdown on high-load machines with 3000 sockets

Hello.

The problem here is that starting with 3000 sockets, the syswrite locks more and more on the sockets
although the sockets are non-blocking. This just suddenly appears at around 3000 sockets. I have
raised min_free_kbytes to 1024000 and then it suddenly did not block anymore. I changed it down to
16000 again and id instantly locked again. Up to 1024000 and no locking. Now it starts blocking
again at 4000 sockets even with 1024000 min_free_kbytes, slowing everything down.... what could this
be? Its no network-problem. I have discussed this issue with netdev-people for 2 weeks. No memory
problem as well I suppose, its 8 gb ram with a 2/2 split...

This problem has been observed on a 2.6.10 kernel.

Please help. Thank you in advance.


Best regards,
Chris


2005-02-27 17:18:55

by Rik van Riel

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

On Sun, 27 Feb 2005, Christian Schmid wrote:

> The problem here is that starting with 3000 sockets, the syswrite locks
> more and more on the sockets although the sockets are non-blocking. This
> just suddenly appears at around 3000 sockets. I have raised
> min_free_kbytes to 1024000 and then it suddenly did not block anymore. I
> changed it down to 16000 again and id instantly locked again. Up to
> 1024000 and no locking. Now it starts blocking again at 4000 sockets
> even with 1024000 min_free_kbytes, slowing everything down.... what
> could this be?

Is it possible to detect when the write system call blocks?

Maybe alt-sysrq-p can be used to find out where the process
is spending its time, there may be some code path left where
the write system call blocks, even with nonblocking writes...

> Its no network-problem. I have discussed this issue with netdev-people
> for 2 weeks. No memory problem as well I suppose, its 8 gb ram with a
> 2/2 split...

It could be an interaction between the network subsystem
and the memory management subsystem, eg. the TCP stack not
allocating more than a certain amount of buffer memory and
stalling until some previously sent data has been received.

Getting backtraces of when the process is "stuck" will be
very helpful.

> This problem has been observed on a 2.6.10 kernel.

Did things work right in earlier kernels (is this a regression) ?
Or have things always worked this way ?

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-02-27 21:10:10

by Rik van Riel

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

On Sun, 27 Feb 2005, Rik van Riel wrote:

> Is it possible to detect when the write system call blocks?

OK, we got that on #kernelnewbies Initially the trace looks
innocent enough, but there is something interesting going on
with the VM. Something we may want to investigate...

<Dragony> Feb 27 21:12:34 s02 kernel: Call Trace:
<Dragony> Feb 27 21:12:34 s02 kernel: [<803e1148>] io_schedule+0x28/0x40
<Dragony> Feb 27 21:12:34 s02 kernel: [<8015ac26>] sync_buffer+0x36/0x50
<Dragony> Feb 27 21:12:34 s02 kernel: [<803e1396>] __wait_on_bit+0x66/0x70
<Dragony> Feb 27 21:12:34 s02 kernel: [<803e1428>] out_of_line_wait_on_bit+0x88/0x90
<Dragony> Feb 27 21:12:34 s02 kernel: [<8015c2a5>] __bread_slow+0x65/0x90
<Dragony> Feb 27 21:12:34 s02 kernel: [<8015c5bd>] __bread+0x3d/0x50
<Dragony> Feb 27 21:12:34 s02 kernel: [<801dcdc7>] ext2_get_branch+0x77/0x120
<Dragony> Feb 27 21:12:34 s02 kernel: [<801dd187>] ext2_get_block+0x77/0x3c0
<Dragony> Feb 27 21:12:34 s02 kernel: [<8017d258>] do_mpage_readpage+0x1a8/0x4f0
<Dragony> Feb 27 21:12:34 s02 kernel: [<8017d64f>] mpage_readpages+0xaf/0x130
<Dragony> Feb 27 21:12:34 s02 kernel: [<80140c56>] read_pages+0xf6/0x100
<Dragony> Feb 27 21:12:34 s02 kernel: [<80140f47>] do_page_cache_readahead+0x117/0x1a0
<Dragony> Feb 27 21:12:34 s02 kernel: [<801411ab>] page_cache_readahead+0x1db/0x240
<Dragony> Feb 27 21:12:34 s02 kernel: [<80139b10>] do_generic_mapping_read+0x100/0x3f0
<Dragony> Feb 27 21:12:34 s02 kernel: [<8013a31b>] generic_file_sendfile+0x5b/0x70
<Dragony> Feb 27 21:12:34 s02 kernel: [<8015a4b0>] do_sendfile+0x1f0/0x2c0

Looks like his program is doing a sendfile, which pauses
on reading things from disk. So far, so good.

However, the big question is why the slowdowns go away
when min_free_kbytes is larger ?

As an additional hint, Christian is using sys_readahead.

Christian, how big are the data blocks you sys_readahead, and
how many do you have outstanding at a time ?

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-02-27 21:20:54

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets



Rik van Riel wrote:
> On Sun, 27 Feb 2005, Rik van Riel wrote:
>
>
>>Is it possible to detect when the write system call blocks?
>
>
> OK, we got that on #kernelnewbies Initially the trace looks
> innocent enough, but there is something interesting going on
> with the VM. Something we may want to investigate...
>
> <Dragony> Feb 27 21:12:34 s02 kernel: Call Trace:
> <Dragony> Feb 27 21:12:34 s02 kernel: [<803e1148>] io_schedule+0x28/0x40
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8015ac26>] sync_buffer+0x36/0x50
> <Dragony> Feb 27 21:12:34 s02 kernel: [<803e1396>] __wait_on_bit+0x66/0x70
> <Dragony> Feb 27 21:12:34 s02 kernel: [<803e1428>] out_of_line_wait_on_bit+0x88/0x90
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8015c2a5>] __bread_slow+0x65/0x90
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8015c5bd>] __bread+0x3d/0x50
> <Dragony> Feb 27 21:12:34 s02 kernel: [<801dcdc7>] ext2_get_branch+0x77/0x120
> <Dragony> Feb 27 21:12:34 s02 kernel: [<801dd187>] ext2_get_block+0x77/0x3c0
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8017d258>] do_mpage_readpage+0x1a8/0x4f0
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8017d64f>] mpage_readpages+0xaf/0x130
> <Dragony> Feb 27 21:12:34 s02 kernel: [<80140c56>] read_pages+0xf6/0x100
> <Dragony> Feb 27 21:12:34 s02 kernel: [<80140f47>] do_page_cache_readahead+0x117/0x1a0
> <Dragony> Feb 27 21:12:34 s02 kernel: [<801411ab>] page_cache_readahead+0x1db/0x240
> <Dragony> Feb 27 21:12:34 s02 kernel: [<80139b10>] do_generic_mapping_read+0x100/0x3f0
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8013a31b>] generic_file_sendfile+0x5b/0x70
> <Dragony> Feb 27 21:12:34 s02 kernel: [<8015a4b0>] do_sendfile+0x1f0/0x2c0
>
> Looks like his program is doing a sendfile, which pauses
> on reading things from disk. So far, so good.
>
> However, the big question is why the slowdowns go away
> when min_free_kbytes is larger ?
>
> As an additional hint, Christian is using sys_readahead.
>
> Christian, how big are the data blocks you sys_readahead, and
> how many do you have outstanding at a time ?

I am reading 800000 bytes of data as soon as there is less than 200000 data in the cache. I do
assume the cache doesnt kill itself. I have 8 GB machine with 7,5 GB free for caching.

2005-02-27 23:32:28

by Rik van Riel

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

On Sun, 27 Feb 2005, Christian Schmid wrote:

> > Christian, how big are the data blocks you sys_readahead, and
> > how many do you have outstanding at a time ?
>
> I am reading 800000 bytes of data as soon as there is less than 200000
> data in the cache. I do assume the cache doesnt kill itself. I have 8 GB
> machine with 7,5 GB free for caching.

OK, so you read 800kB with 4000 threads, or 3.2GB of
readahead data. The inactive list is quite possibly
smaller than that, at 1/3 of memory or around 2.6 GB.

It looks like the cache might be killing itself, through
readahead thrashing.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-02-28 00:09:01

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

No i am only using 4 tasks with Poll-API and non-blocking sockets. Every socket gets a 1 MB
read-ahead. This are 4000 MB Max on a 8 GB machine.... Shouldnt thrash.

Rik van Riel wrote:
> On Sun, 27 Feb 2005, Christian Schmid wrote:
>
>
>>>Christian, how big are the data blocks you sys_readahead, and
>>>how many do you have outstanding at a time ?
>>
>>I am reading 800000 bytes of data as soon as there is less than 200000
>>data in the cache. I do assume the cache doesnt kill itself. I have 8 GB
>>machine with 7,5 GB free for caching.
>
>
> OK, so you read 800kB with 4000 threads, or 3.2GB of
> readahead data. The inactive list is quite possibly
> smaller than that, at 1/3 of memory or around 2.6 GB.
>
> It looks like the cache might be killing itself, through
> readahead thrashing.
>

2005-02-28 00:09:25

by Rik van Riel

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

On Mon, 28 Feb 2005, Christian Schmid wrote:

> No i am only using 4 tasks with Poll-API and non-blocking sockets. Every
> socket gets a 1 MB read-ahead. This are 4000 MB Max on a 8 GB machine....
> Shouldnt thrash.

If nothing else on the system uses any memory, and there
were no memory zones and no division into active and
inactive memory.

You may want to try a smaller readahead window and see if
your system still has trouble with the load.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-02-28 00:30:13

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

I already tried with 300 KB and even used a perl-hash as a horrible-slow buffer for a
readahead-replacement. It still slowed down on the syswrite to the socket. Thats the strange thing.

Rik van Riel wrote:
> On Mon, 28 Feb 2005, Christian Schmid wrote:
>
>
>>No i am only using 4 tasks with Poll-API and non-blocking sockets. Every
>>socket gets a 1 MB read-ahead. This are 4000 MB Max on a 8 GB machine....
>>Shouldnt thrash.
>
>
> If nothing else on the system uses any memory, and there
> were no memory zones and no division into active and
> inactive memory.
>
> You may want to try a smaller readahead window and see if
> your system still has trouble with the load.
>

2005-02-28 05:13:57

by Nick Piggin

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

Christian Schmid wrote:
> I already tried with 300 KB and even used a perl-hash as a horrible-slow
> buffer for a readahead-replacement. It still slowed down on the syswrite
> to the socket. Thats the strange thing.
>

Do you have to use manual readahead though? What is the performance
like if you just let the kernel do its own thing? The kernel's
readahead provides things like automatic scaling and thrashing
control, so if possible you should just stick to that.

Although you may want to experiment with the maximum readahead on your
working disks:
/sys/block/???/queue/read_ahead_kb

Also, can we get a testcase (ie. minimal compilable code) to reproduce
this problem?

Thanks,
Nick

2005-02-28 11:37:10

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

Nick Piggin wrote:
> Christian Schmid wrote:
>
>> I already tried with 300 KB and even used a perl-hash as a
>> horrible-slow buffer for a readahead-replacement. It still slowed down
>> on the syswrite to the socket. Thats the strange thing.
>>
>
> Do you have to use manual readahead though? What is the performance
> like if you just let the kernel do its own thing? The kernel's
> readahead provides things like automatic scaling and thrashing
> control, so if possible you should just stick to that.

Acutally, thats why I changed it to manual readahead. I thought the slowdown comes from some weird
read-ahead settings. But with the manual read-ahead, this problem appears as well. Its still a
phenomenon why this slowdown definetly disappears when I raise /proc/sys/vm/min_free_kbytes to
1024000 (until it hits around 4500 sockets) but it seems not to work with smaller or bigger values....

> Also, can we get a testcase (ie. minimal compilable code) to reproduce
> this problem?

This is a complex code, thats why I really want to avoid this..... Feel free to ask for traces or
anything you want to know though.

Thanks,
Chris

2005-02-28 19:06:17

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

This issue has been tracked down more. This bug does NOT appear if I disable preemtive kernel.
Maybe this helps.

Nick Piggin wrote:
> Christian Schmid wrote:
>
>> I already tried with 300 KB and even used a perl-hash as a
>> horrible-slow buffer for a readahead-replacement. It still slowed down
>> on the syswrite to the socket. Thats the strange thing.
>>
>
> Do you have to use manual readahead though? What is the performance
> like if you just let the kernel do its own thing? The kernel's
> readahead provides things like automatic scaling and thrashing
> control, so if possible you should just stick to that.
>
> Although you may want to experiment with the maximum readahead on your
> working disks:
> /sys/block/???/queue/read_ahead_kb
>
> Also, can we get a testcase (ie. minimal compilable code) to reproduce
> this problem?
>
> Thanks,
> Nick
>
>
>

2005-02-28 23:46:48

by Nick Piggin

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

Christian Schmid wrote:
> This issue has been tracked down more. This bug does NOT appear if I
> disable preemtive kernel.
> Maybe this helps.
>

Yes, it may help - can you boot with profile=schedule and get
the results for say, a 30 second period while the application
is experiencing problems?

So:

start application
wait till it hits slowdown
readprofile -r ; sleep 30 ; readprofile > schedprof.out

and send schedprof.out, please?

Thanks,
Nick

2005-03-05 16:46:48

by Christian Schmid

[permalink] [raw]
Subject: Re: Slowdown on high-load machines with 3000 sockets

Attached is another traffic-image. This one is with 2.6.10 and a 3/1 split, preemtive kernel, so all
defaults.

The first part is where I throttled the whole thing to 100 MBit in order to build up a traffic-jam ;)

When I released it, it jumped up immediately but suddenly it goes down (each pixel is one second)
Playing around with min_free_kbytes didnt help. Where it goes up again I set lower_zone_protection
to 1024000 and where it goes down I set it to 0 again and where it goes up the last time... guess..

Well I was only able to test this with 3500 sockets. I will try to jam up tomorrow to reach 5000
sockets to see if that is really the reason.

Nick Piggin wrote:
> Christian Schmid wrote:
>
>> This issue has been tracked down more. This bug does NOT appear if I
>> disable preemtive kernel.
>> Maybe this helps.
>>
>
> Yes, it may help - can you boot with profile=schedule and get
> the results for say, a 30 second period while the application
> is experiencing problems?
>
> So:
>
> start application
> wait till it hits slowdown
> readprofile -r ; sleep 30 ; readprofile > schedprof.out
>
> and send schedprof.out, please?
>
> Thanks,
> Nick
>
>
>


Attachments:
traffic2.png (2.45 kB)