2009-05-13 03:50:21

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

(obvious cc's added...)

It's an iozone performance regression.

On Tue, 12 May 2009 23:29:30 -0400 Jeff Moyer <[email protected]> wrote:

> Jens Axboe <[email protected]> writes:
>
> > On Mon, May 11 2009, Jeff Moyer wrote:
> >> Jens Axboe <[email protected]> writes:
> >>
> >> > On Fri, May 08 2009, Andrew Morton wrote:
> >> >> On Thu, 23 Apr 2009 10:01:58 -0400
> >> >> Jeff Moyer <[email protected]> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > I've been working on CFQ improvements for interleaved I/Os between
> >> >> > processes, and noticed a regression in performance when using the
> >> >> > deadline I/O scheduler. The test uses a server configured with a cciss
> >> >> > array and 1Gb/s ethernet.
> >> >> >
> >> >> > The iozone command line was:
> >> >> > iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
> >> >> >
> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
> >> >> > These numbers (in MB/s) represent the average of 5 runs.
> >> >> >
> >> >> > v2.6.29
> >> >> >
> >> >> > nfsd's | 1 | 2 | 4 | 8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 43207 | 67436 | 96289 | 107590
> >> >> >
> >> >> > 2.6.30-rc1
> >> >> >
> >> >> > nfsd's | 1 | 2 | 4 | 8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 43732 | 68059 | 76659 | 83231
> >> >> >
> >> >> > 2.6.30-rc3.block-for-linus
> >> >> >
> >> >> > nfsd's | 1 | 2 | 4 | 8
> >> >> > --------+---------------+-------+------
> >> >> > deadline| 46102 | 71151 | 83120 | 82330
> >> >> >
> >> >> >
> >> >> > Notice the drop for 4 and 8 threads. It may be worth noting that the
> >> >> > default number of NFSD threads is 8.
> >> >> >
> >> >>
> >> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
> >> >> list.
> >> >
> >> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
> >> > change has caused this writeout regression.
> >>
> >> It's not writeout, it's a read test.
> >
> > Doh sorry, I even ran these tests as well a few weeks back. So perhaps
> > some read-ahead change, I didn't look into it. FWIW, on a single SATA
> > drive here, it didn't show any difference.
>
> OK, I bisected this to the following commit. The mount is done using
> NFSv3, by the way.
>
> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> Author: Olga Kornievskaia <[email protected]>
> Date: Tue Oct 21 14:13:47 2008 -0400
>
> svcrpc: take advantage of tcp autotuning
>
> Allow the NFSv4 server to make use of TCP autotuning behaviour, which
> was previously disabled by setting the sk_userlocks variable.
>
> Set the receive buffers to be big enough to receive the whole RPC
> request, and set this for the listening socket, not the accept socket.
>
> Remove the code that readjusts the receive/send buffer sizes for the
> accepted socket. Previously this code was used to influence the TCP
> window management behaviour, which is no longer needed when autotuning
> is enabled.
>
> This can improve IO bandwidth on networks with high bandwidth-delay
> products, where a large tcp window is required. It also simplifies
> performance tuning, since getting adequate tcp buffers previously
> required increasing the number of nfsd threads.
>
> Signed-off-by: Olga Kornievskaia <[email protected]>
> Cc: Jim Rees <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 5763e64..7a2a90f 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
> lock_sock(sock->sk);
> sock->sk->sk_sndbuf = snd * 2;
> sock->sk->sk_rcvbuf = rcv * 2;
> - sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
> release_sock(sock->sk);
> #endif
> }
> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
> test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>
> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> - /* sndbuf needs to have room for one request
> - * per thread, otherwise we can stall even when the
> - * network isn't a bottleneck.
> - *
> - * We count all threads rather than threads in a
> - * particular pool, which provides an upper bound
> - * on the number of threads which will access the socket.
> - *
> - * rcvbuf just needs to be able to hold a few requests.
> - * Normally they will be removed from the queue
> - * as soon a a complete request arrives.
> - */
> - svc_sock_setbufsize(svsk->sk_sock,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> - 3 * serv->sv_max_mesg);
> -
> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>
> /* Receive data. If we haven't got the record length yet, get
> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>
> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>
> - /* initialise setting must have enough space to
> - * receive and respond to one request.
> - * svc_tcp_recvfrom will re-adjust if necessary
> - */
> - svc_sock_setbufsize(svsk->sk_sock,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
> -
> - set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> if (sk->sk_state != TCP_ESTABLISHED)
> set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
> /* Initialize the socket */
> if (sock->type == SOCK_DGRAM)
> svc_udp_init(svsk, serv);
> - else
> + else {
> + /* initialise setting must have enough space to
> + * receive and respond to one request.
> + */
> + svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> + 4 * serv->sv_max_mesg);
> svc_tcp_init(svsk, serv);
> + }
>
> /*
> * We start one listener per sv_serv. We want AF_INET


2009-05-13 14:59:56

by Jeff Moyer

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

Andrew Morton <[email protected]> writes:

> (obvious cc's added...)
>
> It's an iozone performance regression.
>
> On Tue, 12 May 2009 23:29:30 -0400 Jeff Moyer <[email protected]> wrote:
>
>> Jens Axboe <[email protected]> writes:
>>
>> > On Mon, May 11 2009, Jeff Moyer wrote:
>> >> Jens Axboe <[email protected]> writes:
>> >>
>> >> > On Fri, May 08 2009, Andrew Morton wrote:
>> >> >> On Thu, 23 Apr 2009 10:01:58 -0400
>> >> >> Jeff Moyer <[email protected]> wrote:
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>> >> >> > processes, and noticed a regression in performance when using the
>> >> >> > deadline I/O scheduler. The test uses a server configured with a cciss
>> >> >> > array and 1Gb/s ethernet.
>> >> >> >
>> >> >> > The iozone command line was:
>> >> >> > iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>> >> >> >
>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>> >> >> >
>> >> >> > v2.6.29
>> >> >> >
>> >> >> > nfsd's | 1 | 2 | 4 | 8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>> >> >> >
>> >> >> > 2.6.30-rc1
>> >> >> >
>> >> >> > nfsd's | 1 | 2 | 4 | 8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>> >> >> >
>> >> >> > 2.6.30-rc3.block-for-linus
>> >> >> >
>> >> >> > nfsd's | 1 | 2 | 4 | 8
>> >> >> > --------+---------------+-------+------
>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>> >> >> >
>> >> >> >
>> >> >> > Notice the drop for 4 and 8 threads. It may be worth noting that the
>> >> >> > default number of NFSD threads is 8.

Just following up with numbers:

2.6.30-rc4

nfsd's | 8
--------+------
cfq | 51632 (49791 52436 52308 51488 52141)
deadline| 65558 (41675 42559 74820 87518 81221)

2.6.30-rc4 reverting the sunrpc "fix"

nfsd's | 8
--------+------
cfq | 82513 (81650 82762 83147 82935 82073)
deadline| 107827 (109730 106077 107175 108524 107632)

The numbers in parenthesis are the individual runs. Notice how
2.6.30-rc4 has some pretty wide variations for deadline.

Cheers,
Jeff

>> >> >> I guess we should ask Rafael to add this to the post-2.6.29 regression
>> >> >> list.
>> >> >
>> >> > I agree. It'd be nice to bisect this one down, I'm guessing some mm
>> >> > change has caused this writeout regression.
>> >>
>> >> It's not writeout, it's a read test.
>> >
>> > Doh sorry, I even ran these tests as well a few weeks back. So perhaps
>> > some read-ahead change, I didn't look into it. FWIW, on a single SATA
>> > drive here, it didn't show any difference.
>>
>> OK, I bisected this to the following commit. The mount is done using
>> NFSv3, by the way.
>>
>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>> Author: Olga Kornievskaia <[email protected]>
>> Date: Tue Oct 21 14:13:47 2008 -0400
>>
>> svcrpc: take advantage of tcp autotuning
>>
>> Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>> was previously disabled by setting the sk_userlocks variable.
>>
>> Set the receive buffers to be big enough to receive the whole RPC
>> request, and set this for the listening socket, not the accept socket.
>>
>> Remove the code that readjusts the receive/send buffer sizes for the
>> accepted socket. Previously this code was used to influence the TCP
>> window management behaviour, which is no longer needed when autotuning
>> is enabled.
>>
>> This can improve IO bandwidth on networks with high bandwidth-delay
>> products, where a large tcp window is required. It also simplifies
>> performance tuning, since getting adequate tcp buffers previously
>> required increasing the number of nfsd threads.
>>
>> Signed-off-by: Olga Kornievskaia <[email protected]>
>> Cc: Jim Rees <[email protected]>
>> Signed-off-by: J. Bruce Fields <[email protected]>
>>
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 5763e64..7a2a90f 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>> lock_sock(sock->sk);
>> sock->sk->sk_sndbuf = snd * 2;
>> sock->sk->sk_rcvbuf = rcv * 2;
>> - sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>> release_sock(sock->sk);
>> #endif
>> }
>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>> test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>> test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>
>> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>> - /* sndbuf needs to have room for one request
>> - * per thread, otherwise we can stall even when the
>> - * network isn't a bottleneck.
>> - *
>> - * We count all threads rather than threads in a
>> - * particular pool, which provides an upper bound
>> - * on the number of threads which will access the socket.
>> - *
>> - * rcvbuf just needs to be able to hold a few requests.
>> - * Normally they will be removed from the queue
>> - * as soon a a complete request arrives.
>> - */
>> - svc_sock_setbufsize(svsk->sk_sock,
>> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>> - 3 * serv->sv_max_mesg);
>> -
>> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>
>> /* Receive data. If we haven't got the record length yet, get
>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>
>> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>
>> - /* initialise setting must have enough space to
>> - * receive and respond to one request.
>> - * svc_tcp_recvfrom will re-adjust if necessary
>> - */
>> - svc_sock_setbufsize(svsk->sk_sock,
>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>> -
>> - set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>> if (sk->sk_state != TCP_ESTABLISHED)
>> set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>> /* Initialize the socket */
>> if (sock->type == SOCK_DGRAM)
>> svc_udp_init(svsk, serv);
>> - else
>> + else {
>> + /* initialise setting must have enough space to
>> + * receive and respond to one request.
>> + */
>> + svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>> + 4 * serv->sv_max_mesg);
>> svc_tcp_init(svsk, serv);
>> + }
>>
>> /*
>> * We start one listener per sv_serv. We want AF_INET
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-05-13 19:29:09

by Jeff Moyer

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

Hi, netdev folks. The summary here is:

A patch added in the 2.6.30 development cycle caused a performance
regression in my NFS iozone testing. The patch in question is the
following:

commit 47a14ef1af48c696b214ac168f056ddc79793d0e
Author: Olga Kornievskaia <[email protected]>
Date: Tue Oct 21 14:13:47 2008 -0400

svcrpc: take advantage of tcp autotuning

which is also quoted below. Using 8 nfsd threads, a single client doing
2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
KB/s under 2.6.30-rc4. I also see more run to run variation under
2.6.30-rc4 using the deadline I/O scheduler on the server. That
variation disappears (as does the performance regression) when reverting
the above commit.

Olga had the following to say about the regression:

>> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <[email protected]> wrote:
>>
>>> I believe what you are seeing is how well TCP autotuning performs.
>>> What old NFS code was doing is disabling autotuning and instead using
>>> #nfsd thread to scale TCP recv window. You are providing an example of
>>> where setting TCP buffer sizes outperforms TCP autotuning. While this
>>> is a valid example, there is also an alternative example of where old
>>> NFS design hurts performance.

> We realize that decrease performance is a problem and understand that
> reverting the patch might be the appropriate course of action!

> But we are curious why this is happening. Jeff if it's not too much trouble
> could you generate tcpdumps for both cases. We are curious what are
> the max window sizes in both cases? Also could you give us your tcp and
> network sysctl values for the testing environment (both client and server
> values) that you can get with "sysctl -a | grep tcp" and also
> " | grep net.core".

The requested data can be found here:
http://people.redhat.com/jmoyer/iozone-regression.tar

> Poor performance using TCP autotuning can be demonstrated outside
> of NFS but using Iperf. It can be shown that iperf will work better if "-w"
> flag is used. When this flag is set, Iperf calls setsockopt() call which in
> the kernel turns off autotuning.
>
> As for fixing this it would be great if we could get some help from the
> TCP kernel folks?

And so I've CC'd netdev.

Jim Rees had the following to add:

> TCP autotuning can reduce performance by up to about 10% in some cases.
> Jeff found one of these cases. While the autotuning penalty never exceeds
> 10% as far as I know, I can provide examples of other cases where autotuning
> improves nfsd performance by more than a factor of 100.
>
> The right thing is to fix autotuning. If autotuning is considered too
> broken to use, it should be turned off everywhere, not just in nfsd, as it
> hurts/benefits all TCP clients, not just nfs.

> This topic has been discussed before on netdev:
> http://www.spinics.net/lists/netdev/msg68650.html
> http://www.spinics.net/lists/netdev/msg68155.html

I'd like to point out that the penalty is much greater than 10% in my
case.

Here are the data points:

Jeff Moyer <[email protected]> writes:

>>> >> >> Jeff Moyer <[email protected]> wrote:
>>> >> >>
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>>> >> >> > processes, and noticed a regression in performance when using the
>>> >> >> > deadline I/O scheduler. The test uses a server configured with a cciss
>>> >> >> > array and 1Gb/s ethernet.
>>> >> >> >
>>> >> >> > The iozone command line was:
>>> >> >> > iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>>> >> >> >
>>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>>> >> >> >
>>> >> >> > v2.6.29
>>> >> >> >
>>> >> >> > nfsd's | 1 | 2 | 4 | 8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>>> >> >> >
>>> >> >> > 2.6.30-rc1
>>> >> >> >
>>> >> >> > nfsd's | 1 | 2 | 4 | 8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>>> >> >> >
>>> >> >> > 2.6.30-rc3.block-for-linus
>>> >> >> >
>>> >> >> > nfsd's | 1 | 2 | 4 | 8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>>> >> >> >
>>> >> >> >
>>> >> >> > Notice the drop for 4 and 8 threads. It may be worth noting that the
>>> >> >> > default number of NFSD threads is 8.
>
> Just following up with numbers:
>
> 2.6.30-rc4
>
> nfsd's | 8
> --------+------
> cfq | 51632 (49791 52436 52308 51488 52141)
> deadline| 65558 (41675 42559 74820 87518 81221)
>
> 2.6.30-rc4 reverting the sunrpc "fix"
>
> nfsd's | 8
> --------+------
> cfq | 82513 (81650 82762 83147 82935 82073)
> deadline| 107827 (109730 106077 107175 108524 107632)
>
> The numbers in parenthesis are the individual runs. Notice how
> 2.6.30-rc4 has some pretty wide variations for deadline.
>
>>> OK, I bisected this to the following commit. The mount is done using
>>> NFSv3, by the way.
>>>
>>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>>> Author: Olga Kornievskaia <[email protected]>
>>> Date: Tue Oct 21 14:13:47 2008 -0400
>>>
>>> svcrpc: take advantage of tcp autotuning
>>>
>>> Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>>> was previously disabled by setting the sk_userlocks variable.
>>>
>>> Set the receive buffers to be big enough to receive the whole RPC
>>> request, and set this for the listening socket, not the accept socket.
>>>
>>> Remove the code that readjusts the receive/send buffer sizes for the
>>> accepted socket. Previously this code was used to influence the TCP
>>> window management behaviour, which is no longer needed when autotuning
>>> is enabled.
>>>
>>> This can improve IO bandwidth on networks with high bandwidth-delay
>>> products, where a large tcp window is required. It also simplifies
>>> performance tuning, since getting adequate tcp buffers previously
>>> required increasing the number of nfsd threads.
>>>
>>> Signed-off-by: Olga Kornievskaia <[email protected]>
>>> Cc: Jim Rees <[email protected]>
>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>>
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 5763e64..7a2a90f 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>>> lock_sock(sock->sk);
>>> sock->sk->sk_sndbuf = snd * 2;
>>> sock->sk->sk_rcvbuf = rcv * 2;
>>> - sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>>> release_sock(sock->sk);
>>> #endif
>>> }
>>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>> test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>>> test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>>
>>> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>>> - /* sndbuf needs to have room for one request
>>> - * per thread, otherwise we can stall even when the
>>> - * network isn't a bottleneck.
>>> - *
>>> - * We count all threads rather than threads in a
>>> - * particular pool, which provides an upper bound
>>> - * on the number of threads which will access the socket.
>>> - *
>>> - * rcvbuf just needs to be able to hold a few requests.
>>> - * Normally they will be removed from the queue
>>> - * as soon a a complete request arrives.
>>> - */
>>> - svc_sock_setbufsize(svsk->sk_sock,
>>> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>>> - 3 * serv->sv_max_mesg);
>>> -
>>> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>
>>> /* Receive data. If we haven't got the record length yet, get
>>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>>
>>> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>
>>> - /* initialise setting must have enough space to
>>> - * receive and respond to one request.
>>> - * svc_tcp_recvfrom will re-adjust if necessary
>>> - */
>>> - svc_sock_setbufsize(svsk->sk_sock,
>>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>>> -
>>> - set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>> if (sk->sk_state != TCP_ESTABLISHED)
>>> set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>>> /* Initialize the socket */
>>> if (sock->type == SOCK_DGRAM)
>>> svc_udp_init(svsk, serv);
>>> - else
>>> + else {
>>> + /* initialise setting must have enough space to
>>> + * receive and respond to one request.
>>> + */
>>> + svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>>> + 4 * serv->sv_max_mesg);
>>> svc_tcp_init(svsk, serv);
>>> + }
>>>
>>> /*
>>> * We start one listener per sv_serv. We want AF_INET

2009-05-13 23:45:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> Hi, netdev folks. The summary here is:
>
> A patch added in the 2.6.30 development cycle caused a performance
> regression in my NFS iozone testing. The patch in question is the
> following:
>
> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> Author: Olga Kornievskaia <[email protected]>
> Date: Tue Oct 21 14:13:47 2008 -0400
>
> svcrpc: take advantage of tcp autotuning
>
> which is also quoted below. Using 8 nfsd threads, a single client doing
> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> KB/s under 2.6.30-rc4. I also see more run to run variation under
> 2.6.30-rc4 using the deadline I/O scheduler on the server. That
> variation disappears (as does the performance regression) when reverting
> the above commit.

It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
function. I can see no reason why we should stop processing new incoming
RPC requests just because the send buffer happens to be 2/3 full. If we
see that we have space for another reply, then we should just go for it.
OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so
that the TCP layer knows that we're congested, and that we'd like it to
increase the send window size, please.

Could you therefore please see if the following (untested) patch helps?

Cheers
Trond
---------------------------------------------------------------------
>From 1545cbda1b1cda2500cb9db3c760a05fc4f6ed4d Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Wed, 13 May 2009 19:44:58 -0400
Subject: [PATCH] SUNRPC: Fix the TCP server's send buffer accounting

Currently, the sunrpc server is refusing to allow us to process new RPC
calls if the TCP send buffer is 2/3 full, even if we do actually have
enough free space to guarantee that we can send another request.
The following patch fixes svc_tcp_has_wspace() so that we only stop
processing requests if we know that the socket buffer cannot possibly fit
another reply.

It also fixes the tcp write_space() callback so that we only clear the
SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full.
This should ensure that the send window will grow as per the standard TCP
socket code.

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/svcsock.c | 32 ++++++++++++++++----------------
1 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index af31988..8962355 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -386,6 +386,15 @@ static void svc_write_space(struct sock *sk)
}
}

+static void svc_tcp_write_space(struct sock *sk)
+{
+ struct socket *sock = sk->sk_socket;
+
+ if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock)
+ clear_bit(SOCK_NOSPACE, &sock->flags);
+ svc_write_space(sk);
+}
+
/*
* Copy the UDP datagram's destination address to the rqstp structure.
* The 'destination' address in this case is the address to which the
@@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
struct svc_serv *serv = svsk->sk_xprt.xpt_server;
int required;
- int wspace;
-
- /*
- * Set the SOCK_NOSPACE flag before checking the available
- * sock space.
- */
- set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
- required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
- wspace = sk_stream_wspace(svsk->sk_sk);
-
- if (wspace < sk_stream_min_wspace(svsk->sk_sk))
- return 0;
- if (required * 2 > wspace)
- return 0;

- clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+ required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
+ if (sk_stream_wspace(svsk->sk_sk) < required)
+ goto out_nospace;
return 1;
+out_nospace:
+ set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+ return 0;
}

static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
@@ -1036,7 +1036,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
dprintk("setting up TCP socket for reading\n");
sk->sk_state_change = svc_tcp_state_change;
sk->sk_data_ready = svc_tcp_data_ready;
- sk->sk_write_space = svc_write_space;
+ sk->sk_write_space = svc_tcp_write_space;

svsk->sk_reclen = 0;
svsk->sk_tcplen = 0;
--
1.6.0.4




2009-05-14 17:55:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > Hi, netdev folks. The summary here is:
> >
> > A patch added in the 2.6.30 development cycle caused a performance
> > regression in my NFS iozone testing. The patch in question is the
> > following:
> >
> > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > Author: Olga Kornievskaia <[email protected]>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> > which is also quoted below. Using 8 nfsd threads, a single client doing
> > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > KB/s under 2.6.30-rc4. I also see more run to run variation under
> > 2.6.30-rc4 using the deadline I/O scheduler on the server. That
> > variation disappears (as does the performance regression) when reverting
> > the above commit.
>
> It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> function. I can see no reason why we should stop processing new incoming
> RPC requests just because the send buffer happens to be 2/3 full. If we

I agree, the calculation doesn't look right. But where do you get the
2/3 number from?

...
> @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
> struct svc_serv *serv = svsk->sk_xprt.xpt_server;
> int required;
> - int wspace;
> -
> - /*
> - * Set the SOCK_NOSPACE flag before checking the available
> - * sock space.
> - */
> - set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> - wspace = sk_stream_wspace(svsk->sk_sk);
> -
> - if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> - return 0;
> - if (required * 2 > wspace)
> - return 0;
>
> - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> + required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> + if (sk_stream_wspace(svsk->sk_sk) < required)

This calculation looks the same before and after--you've just moved the
"*2" into the calcualtion of "required". Am I missing something? Maybe
you meant to write:

required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;

without the parentheses?

That looks closer, assuming the calculation is meant to be:

atomic_read(..) == amount of buffer space we think we
already need
serv->sv_max_mesg * 2 == space for worst-case request
and reply?

--b.

> + goto out_nospace;
> return 1;
> +out_nospace:
> + set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> + return 0;
> }

2009-05-14 18:26:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS

On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote:
> On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote:
> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote:
> > > Hi, netdev folks. The summary here is:
> > >
> > > A patch added in the 2.6.30 development cycle caused a performance
> > > regression in my NFS iozone testing. The patch in question is the
> > > following:
> > >
> > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e
> > > Author: Olga Kornievskaia <[email protected]>
> > > Date: Tue Oct 21 14:13:47 2008 -0400
> > >
> > > svcrpc: take advantage of tcp autotuning
> > >
> > > which is also quoted below. Using 8 nfsd threads, a single client doing
> > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
> > > KB/s under 2.6.30-rc4. I also see more run to run variation under
> > > 2.6.30-rc4 using the deadline I/O scheduler on the server. That
> > > variation disappears (as does the performance regression) when reverting
> > > the above commit.
> >
> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper
> > function. I can see no reason why we should stop processing new incoming
> > RPC requests just because the send buffer happens to be 2/3 full. If we
>
> I agree, the calculation doesn't look right. But where do you get the
> 2/3 number from?

That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison.

> ...
> > @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> > struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
> > struct svc_serv *serv = svsk->sk_xprt.xpt_server;
> > int required;
> > - int wspace;
> > -
> > - /*
> > - * Set the SOCK_NOSPACE flag before checking the available
> > - * sock space.
> > - */
> > - set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg;
> > - wspace = sk_stream_wspace(svsk->sk_sk);
> > -
> > - if (wspace < sk_stream_min_wspace(svsk->sk_sk))
> > - return 0;
> > - if (required * 2 > wspace)
> > - return 0;
> >
> > - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > + required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2;
> > + if (sk_stream_wspace(svsk->sk_sk) < required)
>
> This calculation looks the same before and after--you've just moved the
> "*2" into the calcualtion of "required". Am I missing something? Maybe
> you meant to write:
>
> required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2;
>
> without the parentheses?

I wasn't trying to change that part of the calculation. I'm just
splitting out the stuff which has to do with TCP congestion (i.e. the
window size), and stuff which has to do with remaining socket buffer
space. I do, however, agree that we should probably drop that *2.

However there is (as usual) 'interesting behaviour' when it comes to
deferred requests. Their buffer space is already accounted for in the
'xpt_reserved' calculation, but they cannot get re-scheduled unless
svc_tcp_has_wspace() thinks it has enough free socket space for yet
another reply. Can you spell 'deadlock', children?

Trond