From: minoura@valinux.co.jp (MINOURA Makoto) Subject: Re: NFSD Flow Control Using the TCP Transport Date: Thu, 20 Mar 2003 13:24:54 +0900 Sender: nfs-admin@lists.sourceforge.net Message-ID: <20030320042454.496801DA3E6@brer.local.valinux.co.jp> References: <3E78872B.5020702@RedHat.com> Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: multipart/mixed; boundary="Multipart_Thu_Mar_20_13:24:53_2003-1" Cc: nfs@lists.sourceforge.net Return-path: Received: from vagw.valinux.co.jp ([210.128.90.14] helo=brer.local.valinux.co.jp) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 18vrbt-0001O8-00 for ; Wed, 19 Mar 2003 20:24:57 -0800 To: Steve Dickson Cc: taka@valinux.co.jp, yamamoto@valinux.co.jp In-Reply-To: <3E78872B.5020702@RedHat.com> (Steve Dickson's message of "Wed, 19 Mar 2003 10:05:15 -0500") Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: --Multipart_Thu_Mar_20_13:24:53_2003-1 Content-Type: text/plain; charset=US-ASCII |> In <3E78872B.5020702@RedHat.com> |> Steve Dickson wrote: > There seems to be some issues (probably known) with the flow control > over TCP connections (on an SMP machine) to NFSD. Unfortunately, > the fstress benchmark brings these issues out fairly nicely :-( > This is occurring in a 2.4.20 kernel. We found the same problem a few weeks ago and already addressed it. > The problem seems to stem from the fact that the queued memory in > the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e > tcp_wspace(sk) becomes negative and never recovers). At first, sock_wspace() does not return the actual free space we expect; it only count the ongoing segments. Second, TCP write_space callback is called only when SOCK_NOSPACE flag is asserted. These two problems should be solved by the attached patch. Please someone who knows the network stack tell us the impact to other network part if we change the sock_wspace() itself to the svc_sock_wspace() in the patch. > with -EGAINs. Initially, this caused an oops because svc_delete_socket() > was being called twice for the same socket [ which was easily fixed by > checking > for the SK_DEAD bit in svsk->sk_flags], but now the tests just fail. Current implementation closes the socket on EAGAIN/partial write, but has some delay. During this delay on-queue messages are unintentionally sent (in case the NFS traffic is busy). We addressed this problem by removing the delay (close the socket immediatly) but sending the remaining correctly should be better. Since we use the backported version of the 2.5.x zerocopy NFSd, the patch to address this cannot be applied to 2.4.x without some modifications. We think socket lock mechanism (sk_sem) should be also backported to serialize the output. Now that we removed the MSG_DONTWAIT flag, the TCP layer might sleep with the socket lock released, and messages from other NFSd thread might be mixed. Or other locking protocol in the TCP (or socket) layer to prevent the mixture should be introduced. -- Minoura Makoto Engineering Dept., VA Linux Systems Japan --Multipart_Thu_Mar_20_13:24:53_2003-1 Content-Type: application/octet-stream Content-Disposition: attachment Content-Transfer-Encoding: 7bit diff -urpN linux-2.4.20.orig/net/sunrpc/svcsock.c linux-2.4.20/net/sunrpc/svcsock.c --- linux-2.4.20.orig/net/sunrpc/svcsock.c Fri Nov 29 08:53:16 2002 +++ linux-2.4.20/net/sunrpc/svcsock.c Thu Mar 20 11:49:41 2003 @@ -42,6 +42,8 @@ #include #include +#include + /* SMP locking strategy: * * svc_serv->sv_lock protects most stuff for that service. @@ -106,6 +108,19 @@ svc_release_skb(struct svc_rqst *rqstp) skb_free_datagram(rqstp->rq_sock->sk_sk, skb); } +static int +svc_sock_wspace(struct svc_sock *svsk) +{ + int wspace; + + if (svsk->sk_sock->type == SOCK_STREAM) + wspace = tcp_wspace(svsk->sk_sk); + else + wspace = sock_wspace(svsk->sk_sk); + + return wspace; +} + /* * Queue up a socket with data pending. If there are idle nfsd * processes, wake 'em up. @@ -134,16 +149,18 @@ svc_sock_enqueue(struct svc_sock *svsk) goto out_unlock; } + set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); if (((svsk->sk_reserved + serv->sv_bufsz)*2 - > sock_wspace(svsk->sk_sk)) + > svc_sock_wspace(svsk->sk_sk)) && !test_bit(SK_CLOSE, &svsk->sk_flags) && !test_bit(SK_CONN, &svsk->sk_flags)) { /* Don't enqueue while not enough space for reply */ dprintk("svc: socket %p no space, %d*2 > %ld, not enqueued\n", svsk->sk_sk, svsk->sk_reserved+serv->sv_bufsz, - sock_wspace(svsk->sk_sk)); + svc_sock_wspace(svsk->sk_sk)); goto out_unlock; } + clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); /* Mark socket as busy. It will remain in this state until the * server has processed all pending data and put the socket back --Multipart_Thu_Mar_20_13:24:53_2003-1-- ------------------------------------------------------- This SF.net email is sponsored by: Does your code think in ink? You could win a Tablet PC. Get a free Tablet PC hat just for playing. What are you waiting for? http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs