Hello,
There seems to be some issues (probably known) with the flow control
over TCP connections (on an SMP machine) to NFSD. Unfortunately,
the fstress benchmark brings these issues out fairly nicely :-(
This is occurring in a 2.4.20 kernel.
When fstress starts it's stress tests, svc_tcp_sendto() immediately
starts failing
with -EGAINs. Initially, this caused an oops because svc_delete_socket()
was being called twice for the same socket [ which was easily fixed by
checking
for the SK_DEAD bit in svsk->sk_flags], but now the tests just fail.
The problem seems to stem from the fact that the queued memory in
the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e
tcp_wspace(sk) becomes negative and never recovers).
Here is what's (appears to be) happening:
Fstress opens one TCP connection and then start sending
multiple nfs ops with different fhandles . The problems start when
a nfs op, with a large responses (like a read), gets 'stuck' in the nfs code
for a few microseconds and in the meantime other nfs ops, with smaller
responses are being processed. With every smaller response, the
sk->wmem_queued value is incremented. Now when the 'stuck' nfs read
tries to send its responses the send buffer is full (i.e.
tcp_memory_free(sk)
in tcp_sendmsg() fails) and after a 30 second sleep (in tcp_sendmsg())
-EAGAIN is returned and the show is over.....
I _guess_ what is suppose to happen is that the queued memory will be
freed (or reclaimed) when a socket buffer is freed (via kfree_skb()).
Which in turn causes the threads waiting for memory (i.e. sleeping
in tcp_sendmsg()) to be woke up via a call to sk->write_space().
But this does not seem to be happening even when the smaller
replies are processed....
Can anyone shed some light on what the heck is going on here
and if there are any patches or solutions or ideas addressing this
problem.
TIA,
SteveD.
-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink?
You could win a Tablet PC. Get a free Tablet PC hat just for playing.
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
|> In <[email protected]>
|> Steve Dickson <[email protected]> wrote:
> So If I understand what your saying, EAGAINs or partial writes are
> interpreted
> as fatal errors. This confuse me. EAGAINs or partial writes are flow
> control
> issues not fatal errors. Just like on the client side, shouldn't the
> thread sleep until there is room? Closing down the socket seems a
> bit drastic... Or am I missing something?
In general you are right, but as of Linux kNFSd something
like a flow control is done in the RPC layer.
In kNFSd multiple nfsd threads share a single socket and to
avoid data mixture a sendto call must be atomic. In order
to guarantee this the threads sleep in the RPC layer until
there is enough TCP write space. By doing so EAGAIN can be
interpreted as something wrong is going on.
But since the free space calculation was wrong this
assumption was broken. The patch attached in my previous
mail corrects this.
--
Minoura Makoto <[email protected]>
Engineering Dept., VA Linux Systems Japan
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
|> In <[email protected]>
|> Steve Dickson <[email protected]> wrote:
> There seems to be some issues (probably known) with the flow control
> over TCP connections (on an SMP machine) to NFSD. Unfortunately,
> the fstress benchmark brings these issues out fairly nicely :-(
> This is occurring in a 2.4.20 kernel.
We found the same problem a few weeks ago and already
addressed it.
> The problem seems to stem from the fact that the queued memory in
> the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e
> tcp_wspace(sk) becomes negative and never recovers).
At first, sock_wspace() does not return the actual free
space we expect; it only count the ongoing segments.
Second, TCP write_space callback is called only when
SOCK_NOSPACE flag is asserted.
These two problems should be solved by the attached patch.
Please someone who knows the network stack tell us the
impact to other network part if we change the sock_wspace()
itself to the svc_sock_wspace() in the patch.
> with -EGAINs. Initially, this caused an oops because svc_delete_socket()
> was being called twice for the same socket [ which was easily fixed by
> checking
> for the SK_DEAD bit in svsk->sk_flags], but now the tests just fail.
Current implementation closes the socket on EAGAIN/partial
write, but has some delay. During this delay on-queue
messages are unintentionally sent (in case the NFS traffic
is busy). We addressed this problem by removing the delay
(close the socket immediatly) but sending the remaining
correctly should be better.
Since we use the backported version of the 2.5.x zerocopy
NFSd, the patch to address this cannot be applied to 2.4.x
without some modifications.
We think socket lock mechanism (sk_sem) should be also
backported to serialize the output. Now that we removed the
MSG_DONTWAIT flag, the TCP layer might sleep with the socket
lock released, and messages from other NFSd thread might be
mixed. Or other locking protocol in the TCP (or socket)
layer to prevent the mixture should be introduced.
--
Minoura Makoto <[email protected]>
Engineering Dept., VA Linux Systems Japan
|> In <[email protected]>
|> [email protected] (MINOURA Makoto) wrote:
> Current implementation closes the socket on EAGAIN/partial
> write, but has some delay. During this delay on-queue
> messages are unintentionally sent (in case the NFS traffic
> is busy). We addressed this problem by removing the delay
> (close the socket immediatly) but sending the remaining
> correctly should be better.
Forgot to note: I suppose the cause of the test failure is
that someone (other NFSd thread) sent something from the
same socket before the socket is actually closed. This
breaks the `recordmark'-separated message stream and the
peer (client) would be confused.
To address this,
> We think socket lock mechanism (sk_sem) should be also
> backported to serialize the output.
should also be enough.
--
Minoura Makoto <[email protected]>
Engineering Dept., VA Linux Systems Japan
-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink?
You could win a Tablet PC. Get a free Tablet PC hat just for playing.
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
MINOURA Makoto wrote:
>Current implementation closes the socket on EAGAIN/partial
>write, but has some delay. During this delay on-queue
>messages are unintentionally sent (in case the NFS traffic
>is busy). We addressed this problem by removing the delay
>(close the socket immediatly) but sending the remaining
>correctly should be better.
>
So If I understand what your saying, EAGAINs or partial writes are
interpreted
as fatal errors. This confuse me. EAGAINs or partial writes are flow
control
issues not fatal errors. Just like on the client side, shouldn't the
thread sleep until there is room? Closing down the socket seems a
bit drastic... Or am I missing something?
SteveD.
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs