Hi David,
I have a question about the case of non-blocking sends in
ip_build_xmit_slow(). While investigating a problem with the RH7.3 kernel
causing the Netapp filer IP stack to blow up, we've observed that use of the
MSG_DONTWAIT flag causes some pretty nasty behaviour.
The fact that fragments are immediately queued for sending means that if
sock_alloc_send_skb() fails at some point in the middle of the process of
building the message, then you've ended up sending off a bunch of fragments
for which there is not even a header (can be a large source of wasted
bandwidth given heavy NFS traffic).
The appended patch which was originally designed purely to test inverting the
sending order of fragments (on the hypothesis that the receiving devices were
making buffer management assumptions based on ordering), removes this effect
because it delays sending off the fragments until the entire message has been
built.
Would such a patch be acceptable, or is there a better way of doing this?
Cheers,
Trond
Hello!
Did you not solve this problem using right write_space?
I remember we have already discussed this and to all that I remember
come to happy end. :-)
> Would such a patch be acceptable,
No, of course. :-)
> or is there a better way of doing this?
Better way exists. Just use forced sock_wmalloc instead of
sock_alloc_send_skb on non-blocking send of all the fragments
but the first.
But I still hope you will start with tuning your write_space
to send only when ~2*msg_size of space is available.
Alexey
On Thursday 27 June 2002 18:34, [email protected] wrote:
> Did you not solve this problem using right write_space?
Sure, I can add specific checks for (atomic_read(&sk->wmem_alloc) <
sk->sndbuf) in the RPC layer, however, I don't see why such a check couldn't
be put into ip_build_xmit() itself. Sending partial messages isn't a feature
that sounds like it would be particularly useful for any other applications
either.
However what if the actual call to alloc_skb() fails?
Cheers,
Trond
Hello!
> > Did you not solve this problem using right write_space?
>
> Sure, I can add specific checks for (atomic_read(&sk->wmem_alloc) <
> sk->sndbuf) in the RPC layer,
But it is there now.
static void
udp_write_space(struct sock *sk)
{
struct rpc_xprt *xprt;
if (!(xprt = xprt_from_sock(sk)))
return;
if (xprt->shutdown)
return;
/* Wait until we have enough socket memory. */
if (sock_writeable(sk))
return;
So, I do not understand what you speak about.
> Sending partial messages isn't a feature
> that sounds like it would be particularly useful for any other applications
> either.
The thing, which is really useless, is that your patch preparing skbs
and dropping them in the next line. With the same success you could
trigger BUG() there. :-) Right application just should not reach
this condition.
Anyway, I have to repeat:
>>Better way exists. Just use forced sock_wmalloc instead of
>>sock_alloc_send_skb on non-blocking send of all the fragments
>>but the first.
> However what if the actual call to alloc_skb() fails?
The same as if it would be lost by network.
Alexey
On Thursday 27 June 2002 22:05, [email protected] wrote:
> static void
> udp_write_space(struct sock *sk)
> {
<snip>
> /* Wait until we have enough socket memory. */
> if (sock_writeable(sk))
> return;
You misunderstand the code. The above is in the write_space() callback. It
tells you when it is safe to wakes up again *after* a send has failed. It
doesn't test the buffer size on the first sendmsg() call (the one that
fails).
> The thing, which is really useless, is that your patch preparing skbs
> and dropping them in the next line. With the same success you could
> trigger BUG() there. :-) Right application just should not reach
> this condition.
Are you seriously saying that all 'right' user applications should be testing
the socket buffer congestion before sending a non-blocking UDP message rather
than just testing sendmsg() for an EWOULDBLOCK return value???
According to the manpage, ioctl(SIOCOUTQ) didn't even work prior to 2.4.x...
The normal behaviour of the patch was to collect the fragments, then to send
off all the skbs to the device queue as soon as it is clear that no errors
have occurred.
The patch only dropped the skbs if some socket error occurs that would force
you to exit the loop and return EAGAIN. Since the loop will be exited before
the fragment containing the UDP header (offset 0) gets sent, sending off all
the other fragments in the other skbs would serve merely to eat up bandwidth.
I agree that for blocking calls, it is useful to send off each skb as it gets
allocated (and the patch could be amended to take this into account), but for
nonblocking I/O, it is definitely bad form...
Cheers,
Trond
Hello!
> Are you seriously saying that all 'right' user applications should be testing
> the socket buffer congestion before sending a non-blocking UDP message rather
> than just testing sendmsg() for an EWOULDBLOCK return value???
I am saying absolutely seriously that there is nothing more stupid
than preparation of skbs only to drop them and to return you EAGAIN.
_Nothing_, do you hear this?
Repeating the third time in hope you eventually read the mail to the end:
>>>Better way exists. Just use forced sock_wmalloc instead of
>>>sock_alloc_send_skb on non-blocking send of all the fragments
>>>but the first.
And, yes, until this is done, I have to be serious when saying
that any application using nonblocking sockets have to use select()
or even SIOCOUTQ. Your patch does not change anything in this.
Alexey
On Friday 28 June 2002 00:07, [email protected] wrote:
> I am saying absolutely seriously that there is nothing more stupid
> than preparation of skbs only to drop them and to return you EAGAIN.
> _Nothing_, do you hear this?
I hear, but you still haven't explained why? What exactly would trigger a
BUG() as you mentioned?
> Repeating the third time in hope you eventually read the mail to the end:
> >>>Better way exists. Just use forced sock_wmalloc instead of
> >>>sock_alloc_send_skb on non-blocking send of all the fragments
> >>>but the first.
I heard you the first time, and have been looking into it. However that still
does not address the problem of memory allocation failure (yes - I know you
said that is the same as network loss - I still don't think we need to
simulate that).
If I understand correctly, it also means that you have to estimate socket
buffer memory useage prior to actually entering the loop (something which is
impossible to do accurately). Attached is a patch that I hope you will
comment on (without too many expletives please ;-))...
> And, yes, until this is done, I have to be serious when saying
> that any application using nonblocking sockets have to use select()
> or even SIOCOUTQ. Your patch does not change anything in this.
sendmsg() + select() alone should suffice. Anybody using those 2 should be
able to expect optimal message output without the socket buffer getting
filled with junk data.
Cheers,
Trond
On Friday 28 June 2002 10:22, Trond Myklebust wrote:
> > Repeating the third time in hope you eventually read the mail to the end:
> > >>>Better way exists. Just use forced sock_wmalloc instead of
> > >>>sock_alloc_send_skb on non-blocking send of all the fragments
> > >>>but the first.
> Attached is a patch that I hope you will comment on (without too many
> expletives please ;-))...
After fixing the missing brackets around (flags&MSG_DONTWAIT), I did some
tests. I did some NFS writes to a Solaris server, write size = 32k, UDP, over
a switched 100Mbit/s network. Tests were done using the iozone program (see
http://www.iozone.org) 'iozone -c -e -t1 -s 120m -r128k'
- With an ordinary kernel without the patch, I saw average write speeds of
~2MB/s (peaking at ~2.5MB/s in one trial).
- With the same kernel, but applying the sock_wmalloc() patch, write speeds
suddenly jump to ~4.5MB/s (peak was 5MB/s).
... and yes, I did reboot several times in order to switch between the 2
kernels in order to check repeatability.
I did not expect the effect to be so large, and certainly I can't explain it,
however I hope you agree that it shows that fixing this bug *is* worth the
effort.
Cheers,
Trond
Hello!
> suddenly jump to ~4.5MB/s (peak was 5MB/s).
Hmm.. it is funny that you were satisfied with previous value
and it is funny that it still does not saturate link.
> however I hope you agree that it shows that fixing this bug *is* worth the
> effort.
Of course. If you noticed this year or two or three ago, it would be even
an urgent problem. But until now it was problem with status of "well-known
bogosity which requires some sane solution but can wait for some good idea
for infinite time because of absence of any real applications sensing it" :-)
Alexey
On Friday 28 June 2002 20:21, Alexey Kuznetsov wrote:
> Hello!
>
> > suddenly jump to ~4.5MB/s (peak was 5MB/s).
>
> Hmm.. it is funny that you were satisfied with previous value
> and it is funny that it still does not saturate link.
It is only recently that we have come far enough with implementing things like
round trip timing, congestion control, etc. to really start noticing these
effects. Without the RTT scheme on the UDP link, you simply don't see it (you
only notice a large difference with TCP).
Note that this covers the discrepancy between NFS over TCP and NFS over UDP
against that particular machine, so I do not expect further improvements.
The main reason why I don't expect to saturate the link is that these are NFS
*writes*, hence random things like file semaphore contentions, disk access
and write speeds etc. on the server, pop up.
> Of course. If you noticed this year or two or three ago, it would be even
> an urgent problem. But until now it was problem with status of "well-known
> bogosity which requires some sane solution but can wait for some good idea
> for infinite time because of absence of any real applications sensing it"
> :-)
I've now got the application and a demonstration of what kind of fix is
needed. I hope you and Dave can work out a better patch in for 2.4.20-pre
;-)...
Cheers,
Trond
From: Trond Myklebust <[email protected]>
Date: Mon, 1 Jul 2002 14:14:50 +0200
I've now got the application and a demonstration of what kind of fix is
needed. I hope you and Dave can work out a better patch in for 2.4.20-pre
;-)...
Trond does this patch fix your problem? It is exactly how Alexey
described the fix and it should work.
I think the memory failure issue is totally moot. In fact your
"accumulate skb till we have them all, then send the whole bunch"
version of the fix is very bad because it defers the transmit
on the device until the whole set of fragments have been created.
--- net/ipv4/ip_output.c.~1~ Sat Aug 3 03:14:35 2002
+++ net/ipv4/ip_output.c Sat Aug 3 03:20:49 2002
@@ -520,8 +520,18 @@
/*
* Get the memory we require with some space left for alignment.
*/
-
- skb = sock_alloc_send_skb(sk, fraglen+hh_len+15, flags&MSG_DONTWAIT, &err);
+ if (!(flags & MSG_DONTWAIT) || nfrags == 0) {
+ skb = sock_alloc_send_skb(sk, fraglen + hh_len + 15,
+ (flags & MSG_DONTWAIT), &err);
+ } else {
+ /* On a non-blocking write, we check for send buffer
+ * usage on the first fragment only.
+ */
+ skb = sock_wmalloc(sk, fraglen + hh_len + 15, 1,
+ sk->allocation);
+ if (!skb)
+ err = -ENOBUFS;
+ }
if (skb == NULL)
goto error;
>>>>> " " == David S Miller <[email protected]> writes:
> Trond does this patch fix your problem? It is exactly how
> Alexey described the fix and it should work.
That looks good. In the 2.5.x kernel I've already worked around this
bug by increasing the socket buffer size. However the problem isn't
just limited to NFS.
Concerning the 2.4.x kernel: it would be very nice if this fix made it
into 2.4.19, as the bug has already been known to crash a few
servers...
Cheers,
Trond
>>>>> " " == Trond Myklebust <[email protected]> writes:
> Concerning the 2.4.x kernel: it would be very nice if this fix
> made it into 2.4.19, as the bug has already been known to crash
> a few servers...
Bump the above request to read '...made it into 2.4.20...'. I replied
before I saw Marcelo's announcement.
Cheers,
Trond
From: Trond Myklebust <[email protected]>
Date: Mon, 5 Aug 2002 15:43:51 +0200
Concerning the 2.4.x kernel: it would be very nice if this fix made it
into 2.4.19, as the bug has already been known to crash a few
servers...
I plan to push this to Marcelo if I haven't already done
so :-)
From: [email protected]
Date: Tue, 6 Aug 2002 03:30:53 +0400 (MSD)
> the bug has already been known to crash a few
> servers...
Sorry? What crash do you speak about?
If you partial-fragment bomb NFS servers from a certain
vendor, they hard hang :-)
Hello!
> the bug has already been known to crash a few
> servers...
Sorry? What crash do you speak about?
Alexey
>>>>> " " == kuznet <[email protected]> writes:
> Hello!
>> the bug has already been known to crash a few servers...
> Sorry? What crash do you speak about?
You'll find it documented on RedHat's Bugzilla (can't remember the
exact reference - sorry). Basically the first RH-7.3 kernels were
causing a DOS on a couple of Netapps w/ Gigabit connections.
The DOS was traced to a combination of Linux flooding the server with
all these fragments w/o headers (our bug), coupled with inadequate
garbage collection of said fragments on the Netapp side (their bug).
Cheers,
Trond
Trond Myklebust <[email protected]> writes:
>>>>>> " " == kuznet <[email protected]> writes:
> > Hello!
> >> the bug has already been known to crash a few servers...
> > Sorry? What crash do you speak about?
>You'll find it documented on RedHat's Bugzilla (can't remember the
>exact reference - sorry). Basically the first RH-7.3 kernels were
>causing a DOS on a couple of Netapps w/ Gigabit connections.
You didn't exactly need a NetApp for this. A RH 7.3 NFS client with a
Solaris 2.6 NFSv3 server box and a switched, trunked 100 MBit network
was very very sufficient. I have the mrtg printouts still on the wall
in my office. 46 hours of solid 93 Mbits/sec of fragmented NFS packets
chewing off traffic on its VLAN and dropping everything else out of
the backbone trunks. Every service and their grandmothers died around
here. :-)
Ah, the joys of NFS.
Regards
Henning
--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]
Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20