From: minoura@valinux.co.jp (MINOURA Makoto)
Subject: Re: NFSD Flow Control Using the TCP Transport
Date: Thu, 20 Mar 2003 13:24:54 +0900
Sender: nfs-admin@lists.sourceforge.net
Message-ID: <20030320042454.496801DA3E6@brer.local.valinux.co.jp>
References: <3E78872B.5020702@RedHat.com>
Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: multipart/mixed;
 boundary="Multipart_Thu_Mar_20_13:24:53_2003-1"
Cc: nfs@lists.sourceforge.net
To: Steve Dickson <SteveD@RedHat.com>
Cc: taka@valinux.co.jp, yamamoto@valinux.co.jp
In-Reply-To: <3E78872B.5020702@RedHat.com>
 (Steve Dickson's message of "Wed, 19 Mar 2003 10:05:15 -0500")
Errors-To: nfs-admin@lists.sourceforge.net

--Multipart_Thu_Mar_20_13:24:53_2003-1
Content-Type: text/plain; charset=US-ASCII


|> In <3E78872B.5020702@RedHat.com>
|>  Steve Dickson <SteveD@RedHat.com> wrote:

> There seems to be some issues (probably known) with the flow control
> over  TCP connections (on an SMP machine) to NFSD. Unfortunately,
> the fstress benchmark brings these issues out fairly nicely :-(
> This is occurring in a 2.4.20 kernel.

We found the same problem a few weeks ago and already
addressed it.


> The problem seems to stem from the fact that the queued memory in
> the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e
> tcp_wspace(sk)  becomes negative and  never recovers).

At first, sock_wspace() does not return the actual free
space we expect;  it only count the ongoing segments.

Second, TCP write_space callback is called only when
SOCK_NOSPACE flag is asserted.

These two problems should be solved by the attached patch.
Please someone who knows the network stack tell us the
impact to other network part if we change the sock_wspace()
itself to the svc_sock_wspace() in the patch.


> with -EGAINs. Initially, this caused  an oops because svc_delete_socket()
> was being called twice for the same socket [ which was easily fixed by 
> checking
>  for the SK_DEAD bit in svsk->sk_flags],  but now the tests just fail.

Current implementation closes the socket on EAGAIN/partial
write, but has some delay.  During this delay on-queue
messages are unintentionally sent (in case the NFS traffic
is busy).  We addressed this problem by removing the delay
(close the socket immediatly) but sending the remaining
correctly should be better.
Since we use the backported version of the 2.5.x zerocopy
NFSd, the patch to address this cannot be applied to 2.4.x
without some modifications.

We think socket lock mechanism (sk_sem) should be also
backported to serialize the output.  Now that we removed the
MSG_DONTWAIT flag, the TCP layer might sleep with the socket
lock released, and messages from other NFSd thread might be
mixed.  Or other locking protocol in the TCP (or socket)
layer to prevent the mixture should be introduced.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan

--Multipart_Thu_Mar_20_13:24:53_2003-1
Content-Type: application/octet-stream
Content-Disposition: attachment
Content-Transfer-Encoding: 7bit

diff -urpN linux-2.4.20.orig/net/sunrpc/svcsock.c linux-2.4.20/net/sunrpc/svcsock.c
--- linux-2.4.20.orig/net/sunrpc/svcsock.c	Fri Nov 29 08:53:16 2002
+++ linux-2.4.20/net/sunrpc/svcsock.c	Thu Mar 20 11:49:41 2003
@@ -42,6 +42,8 @@
 #include <linux/sunrpc/svcsock.h>
 #include <linux/sunrpc/stats.h>
 
+#include <net/tcp.h>
+
 /* SMP locking strategy:
  *
  * 	svc_serv->sv_lock protects most stuff for that service.
@@ -106,6 +108,19 @@ svc_release_skb(struct svc_rqst *rqstp)
 	skb_free_datagram(rqstp->rq_sock->sk_sk, skb);
 }
 
+static int
+svc_sock_wspace(struct svc_sock *svsk)
+{
+	int wspace;
+
+	if (svsk->sk_sock->type == SOCK_STREAM)
+		wspace = tcp_wspace(svsk->sk_sk);
+	else
+		wspace = sock_wspace(svsk->sk_sk);
+
+	return wspace;
+}
+
 /*
  * Queue up a socket with data pending. If there are idle nfsd
  * processes, wake 'em up.
@@ -134,16 +149,18 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		goto out_unlock;
 	}
 
+	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 	if (((svsk->sk_reserved + serv->sv_bufsz)*2
-	     > sock_wspace(svsk->sk_sk))
+	     > svc_sock_wspace(svsk->sk_sk))
 	    && !test_bit(SK_CLOSE, &svsk->sk_flags)
 	    && !test_bit(SK_CONN, &svsk->sk_flags)) {
 		/* Don't enqueue while not enough space for reply */
 		dprintk("svc: socket %p  no space, %d*2 > %ld, not enqueued\n",
 			svsk->sk_sk, svsk->sk_reserved+serv->sv_bufsz,
-			sock_wspace(svsk->sk_sk));
+			svc_sock_wspace(svsk->sk_sk));
 		goto out_unlock;
 	}
+	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 
 	/* Mark socket as busy. It will remain in this state until the
 	 * server has processed all pending data and put the socket back

--Multipart_Thu_Mar_20_13:24:53_2003-1--


-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink? 
You could win a Tablet PC. Get a free Tablet PC hat just for playing. 
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs