My apologies for the multiple post. I got bit the first time around by my
MUA's configuration.
----
Greetings.
For some time now, the kernel and I have been having an argument over what
the MTU should be for serving NFS over Infiniband. I say 65520, the
documented maximum for connected mode. But, so far, I've been unable to have
anything over 32192 remain stable.
Back in the 2.6.14 -> .15 period, sunrpc's sk_buff allocations were changed
from GFP_KERNEL to GFP_ATOMIC (b079fa7baa86b47579f3f60f86d03d21c76159b8
mainstream commit). Understandably, this was to prevent recursion through
the NFS and sunrpc code. This is fine for the most common MTU out there, as
the kernel is almost certain to find a free page. But, as one increases the
MTU, memory fragmentation starts to play a role in nixing these allocations.
These allocation failures ultimately result in sparse files being written
through NFS. Granted, many of my users' application are oblivious to
this because they don't check for such errors. But it would be nice if the
kernel were more resilient in this regard.
For a few months now, I've been running with sunrpc sk_buff allocations using
GFP_NOFS instead, which allows for dirty data to be flushed out and still
avoids recursion through sunrpc. With this, I've been able to increase the
stable MTU to 32192. But no further, as eventually there is no dirty data
left and memory fragmentation becomes mostly due to yet-to-be-sync'ed
filesystem data. There's also the matter that using GFP_NOFS for this can
slow down NFS quite a bit.
In regrouping for my next tack at this, I noticed that all stack traces go
through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
A _very_ rough draft that would have ip_append_data() temporarily drop down
to a smaller fake MTU follows ...
diff -adNpru linux-2.6.35.2/net/ipv4/ip_output.c devel-2.6.35.2/net/ipv4/ip_output.c
--- linux-2.6.35.2/net/ipv4/ip_output.c 2010-08-13 14:44:56.000000000 -0600
+++ devel-2.6.35.2/net/ipv4/ip_output.c 2010-08-14 17:09:46.000000000 -0600
@@ -801,10 +801,10 @@ int ip_append_data(struct sock *sk,
int exthdrlen;
int mtu;
int copy;
- int err;
+ int err = 0;
int offset = 0;
unsigned int maxfraglen, fragheaderlen;
- int csummode = CHECKSUM_NONE;
+ int csummode;
struct rtable *rt;
if (flags&MSG_PROBE)
@@ -852,10 +852,9 @@ int ip_append_data(struct sock *sk,
exthdrlen = 0;
mtu = inet->cork.fragsize;
}
- hh_len = LL_RESERVED_SPACE(rt->u.dst.dev);
+ hh_len = LL_RESERVED_SPACE(rt->u.dst.dev);
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
- maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
if (inet->cork.length + length > 0xFFFF - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
@@ -863,6 +862,12 @@ int ip_append_data(struct sock *sk,
return -EMSGSIZE;
}
+ inet->cork.length += length;
+
+retry_with_smaller_mtu_data:
+ csummode = CHECKSUM_NONE;
+ maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
+
/*
* transhdrlen > 0 means that this is the first fragment and we wish
* it won't be fragmented in the future.
@@ -875,15 +880,19 @@ int ip_append_data(struct sock *sk,
skb = skb_peek_tail(&sk->sk_write_queue);
- inet->cork.length += length;
- if (((length > mtu) || (skb && skb_is_gso(skb))) &&
+ if ((err == 0) && ((length > mtu) || (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
(rt->u.dst.dev->features & NETIF_F_UFO)) {
err = ip_ufo_append_data(sk, getfrag, from, length, hh_len,
fragheaderlen, transhdrlen, mtu,
flags);
- if (err)
- goto error;
+ if (err) {
+ if (mtu == ETH_DATA_LEN || err != -ENOBUFS)
+ goto error;
+ mtu = ETH_DATA_LEN;
+ goto retry_with_smaller_mtu_data;
+ }
+
return 0;
}
@@ -957,8 +966,12 @@ alloc_new_skb:
time stamped */
ipc->shtx.flags = 0;
}
- if (skb == NULL)
- goto error;
+ if (skb == NULL) {
+ if (mtu == ETH_DATA_LEN || err != -ENOBUFS)
+ goto error;
+ mtu = ETH_DATA_LEN;
+ goto retry_with_smaller_mtu_data;
+ }
/*
* Fill in the control structures
@@ -1112,7 +1125,6 @@ ssize_t ip_append_page(struct sock *sk,
mtu = inet->cork.fragsize;
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
- maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
if (inet->cork.length + size > 0xFFFF - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport, mtu);
@@ -1123,6 +1135,7 @@ ssize_t ip_append_page(struct sock *sk,
return -EINVAL;
inet->cork.length += size;
+
if ((size + skb->len > mtu) &&
(sk->sk_protocol == IPPROTO_UDP) &&
(rt->u.dst.dev->features & NETIF_F_UFO)) {
@@ -1130,6 +1143,8 @@ ssize_t ip_append_page(struct sock *sk,
skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
}
+retry_with_smaller_mtu_page:
+ maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
while (size > 0) {
int i;
@@ -1153,8 +1168,13 @@ ssize_t ip_append_page(struct sock *sk,
alloclen = fragheaderlen + hh_len + fraggap + 15;
skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
if (unlikely(!skb)) {
- err = -ENOBUFS;
- goto error;
+ if (mtu == ETH_DATA_LEN) {
+ err = -ENOBUFS;
+ goto error;
+ }
+
+ mtu = ETH_DATA_LEN;
+ goto retry_with_smaller_mtu_page;
}
/*
Now, I don't have this working quite right yet, but in the meantime, I'd
appreciate some comments over whether this is an appropriate path to follow
and/or ideas on other avenues I should be exploring instead.
Thanks.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
Marc Aurele La France <[email protected]> wrote:
> My apologies for the multiple post. I got bit the first time around by my
> MUA's configuration.
>
> ----
>
> Greetings.
>
> For some time now, the kernel and I have been having an argument over what
> the MTU should be for serving NFS over Infiniband. I say 65520, the
> documented maximum for connected mode. But, so far, I've been unable to have
> anything over 32192 remain stable.
>
> Back in the 2.6.14 -> .15 period, sunrpc's sk_buff allocations were changed
> from GFP_KERNEL to GFP_ATOMIC (b079fa7baa86b47579f3f60f86d03d21c76159b8
> mainstream commit). Understandably, this was to prevent recursion through
> the NFS and sunrpc code. This is fine for the most common MTU out there, as
> the kernel is almost certain to find a free page. But, as one increases the
> MTU, memory fragmentation starts to play a role in nixing these allocations.
>
> These allocation failures ultimately result in sparse files being written
> through NFS. Granted, many of my users' application are oblivious to
> this because they don't check for such errors. But it would be nice if the
> kernel were more resilient in this regard.
>
> For a few months now, I've been running with sunrpc sk_buff allocations using
> GFP_NOFS instead, which allows for dirty data to be flushed out and still
> avoids recursion through sunrpc. With this, I've been able to increase the
> stable MTU to 32192. But no further, as eventually there is no dirty data
> left and memory fragmentation becomes mostly due to yet-to-be-sync'ed
> filesystem data. There's also the matter that using GFP_NOFS for this can
> slow down NFS quite a bit.
>
> In regrouping for my next tack at this, I noticed that all stack traces go
> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
> A _very_ rough draft that would have ip_append_data() temporarily drop down
> to a smaller fake MTU follows ...
Why doesn't NFS generate page size fragments? Does Infiniband or your
device not support this? Any thing that requires higher order allocation
is going to unstable under load. Let's fix the cause not the apply bandaid
solution to the symptom.
On Mon, 2010-08-23 at 08:44 -0600, Marc Aurele La France wrote:
> My apologies for the multiple post. I got bit the first time around by my
> MUA's configuration.
>
> ----
>
> Greetings.
>
> For some time now, the kernel and I have been having an argument over what
> the MTU should be for serving NFS over Infiniband. I say 65520, the
> documented maximum for connected mode. But, so far, I've been unable to have
> anything over 32192 remain stable.
>
> Back in the 2.6.14 -> .15 period, sunrpc's sk_buff allocations were changed
> from GFP_KERNEL to GFP_ATOMIC (b079fa7baa86b47579f3f60f86d03d21c76159b8
> mainstream commit). Understandably, this was to prevent recursion through
> the NFS and sunrpc code. This is fine for the most common MTU out there, as
> the kernel is almost certain to find a free page. But, as one increases the
> MTU, memory fragmentation starts to play a role in nixing these allocations.
[...]
I'm not familiar with the NFS server, but what you're saying suggests
that this code needs a more radical rethink.
Firstly, I don't see why NFS should require each packet's payload to be
contiguous. It could use page fragments and then leave it to the
networking core to linearize the buffer if necessary for stupid
hardware.
Secondly, if it's doing its own segmentation it can't take advantage of
TSO. This is likely to be a real drag on performance. If it were
taking advantage of TSO then the effective MTU over TCP/IP could be
about 64K and it would already have hit this problem on Ethernet.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On Mon, 23 Aug 2010, Stephen Hemminger wrote:
> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
> Marc Aurele La France <[email protected]> wrote:
>> In regrouping for my next tack at this, I noticed that all stack traces go
>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
>> A _very_ rough draft that would have ip_append_data() temporarily drop down
>> to a smaller fake MTU follows ...
> Why doesn't NFS generate page size fragments? Does Infiniband or your
> device not support this? Any thing that requires higher order allocation
> is going to unstable under load. Let's fix the cause not the apply bandaid
> solution to the symptom.
>From what I can tell, IP fragmentation is done centrally.
The MTU is a device attribute, yes. But, here, it is ip_append_data(),
not NFS nor the device driver, whose responsibility it is to break up the
payload into fragments, either by itself or using any facility supported
by the adapter. What I'm saying is that there's no reason to require all
fragments, except the last, to be MTU-sized. The RFCs I've looked at
allow them to be shorter which can be used to advantage when MTU-sized
fragments cannot be allocated in a memory fragmentation scenario, instead
of reporting an error.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
> > On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
> > Marc Aurele La France <[email protected]> wrote:
> >> In regrouping for my next tack at this, I noticed that all stack traces go
> >> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
> >> A _very_ rough draft that would have ip_append_data() temporarily drop down
> >> to a smaller fake MTU follows ...
>
> > Why doesn't NFS generate page size fragments? Does Infiniband or your
> > device not support this? Any thing that requires higher order allocation
> > is going to unstable under load. Let's fix the cause not the apply bandaid
> > solution to the symptom.
>
> From what I can tell, IP fragmentation is done centrally.
[...]
Stephen and I are not talking about IP fragmentation, but about the
ability to append 'fragments' to an skb rather than putting the entire
packet payload in a linear buffer. See
<http://vger.kernel.org/~davem/skb_data.html>.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On Tue, 24 Aug 2010, Ben Hutchings wrote:
> On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
>> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
>>> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
>>> Marc Aurele La France <[email protected]> wrote:
>>>> In regrouping for my next tack at this, I noticed that all stack traces go
>>>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
>>>> A _very_ rough draft that would have ip_append_data() temporarily drop down
>>>> to a smaller fake MTU follows ...
>>> Why doesn't NFS generate page size fragments? Does Infiniband or your
>>> device not support this? Any thing that requires higher order allocation
>>> is going to unstable under load. Let's fix the cause not the apply bandaid
>>> solution to the symptom.
>> From what I can tell, IP fragmentation is done centrally.
> [...]
> Stephen and I are not talking about IP fragmentation, but about the
> ability to append 'fragments' to an skb rather than putting the entire
> packet payload in a linear buffer. See
> <http://vger.kernel.org/~davem/skb_data.html>.
Any payload has to either fit in the MTU, or has to be broken up into
MTU-sized (or less) fragments, come hell or high water. That this is done
centrally is a good thing. It is the "(or less)" part that I am working
towards here.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
Le mardi 24 août 2010 à 13:49 -0600, Marc Aurele La France a écrit :
>
> Any payload has to either fit in the MTU, or has to be broken up into
> MTU-sized (or less) fragments, come hell or high water. That this is done
> centrally is a good thing. It is the "(or less)" part that I am working
> towards here.
>
Could you post a full stack trace, to help me understand the path from
NFS to ip_append_data ?
I suspect this is UDP transport ?
This reminds me a patch I wrote for IPV6 : We were allocating a huge
(MTU sized) buffer, just to fill few bytes in it...
commit 72e09ad107e78d69ff4d3b97a69f0aad2b77280f
Author: Eric Dumazet <[email protected]>
Date: Sat Jun 5 03:03:30 2010 -0700
ipv6: avoid high order allocations
With mtu=9000, mld_newpack() use order-2 GFP_ATOMIC allocations, that
are very unreliable, on machines where PAGE_SIZE=4K
Limit allocated skbs to be at most one page. (order-0 allocations)
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
On Tue, 24 Aug 2010, Eric Dumazet wrote:
> Le mardi 24 août 2010 à 13:49 -0600, Marc Aurele La France a écrit :
>> Any payload has to either fit in the MTU, or has to be broken up into
>> MTU-sized (or less) fragments, come hell or high water. That this is done
>> centrally is a good thing. It is the "(or less)" part that I am working
>> towards here.
> Could you post a full stack trace, to help me understand the path from
> NFS to ip_append_data ?
[<ffffffff810a5abe>] __alloc_pages_nodemask+0x617/0x692
[<ffffffff81061688>] ? mark_held_locks+0x49/0x64
[<ffffffff810d018b>] kmalloc_large_node+0x61/0x9e
[<ffffffff810d3050>] __kmalloc_node_track_caller+0x32/0x159
[<ffffffff812612da>] ? sock_alloc_send_pskb+0xc9/0x2ea
[<ffffffff81265cc6>] __alloc_skb+0x74/0x163
[<ffffffff812612da>] sock_alloc_send_pskb+0xc9/0x2ea
[<ffffffff81061688>] ? mark_held_locks+0x49/0x64
[<ffffffff81261510>] sock_alloc_send_skb+0x15/0x17
[<ffffffff81299317>] ip_append_data+0x500/0x9d0
[<ffffffff8103feae>] ? local_bh_enable+0xb7/0xbd
[<ffffffff8129a804>] ? ip_generic_getfrag+0x0/0x92
[<ffffffff81292bcd>] ? ip_route_output_flow+0x82/0x1f9
[<ffffffff812b8990>] udp_sendmsg+0x4ec/0x60c
[<ffffffff812bf2ac>] inet_sendmsg+0x4b/0x58
[<ffffffff8125dd89>] sock_sendmsg+0xd9/0xfa
[<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5
[<ffffffff81063fb0>] ? __lock_acquire+0x787/0x7f5
[<ffffffff8125fcf5>] kernel_sendmsg+0x37/0x43
[<ffffffffa0267cd2>] xs_send_kvec+0x88/0x93 [sunrpc]
[<ffffffff812f08dc>] ? _raw_spin_unlock_irqrestore+0x44/0x4c
[<ffffffffa0267d5c>] xs_sendpages+0x7f/0x1be [sunrpc]
[<ffffffffa026952f>] xs_udp_send_request+0x5b/0x103 [sunrpc]
[<ffffffffa0266c0a>] xprt_transmit+0x11f/0x1f5 [sunrpc]
[<ffffffffa02ea140>] ? nfs3_xdr_writeargs+0x0/0x82 [nfs]
[<ffffffffa02648b9>] call_transmit+0x218/0x25e [sunrpc]
[<ffffffffa026aced>] __rpc_execute+0x9b/0x288 [sunrpc]
[<ffffffffa026aeef>] rpc_async_schedule+0x15/0x17 [sunrpc]
[<ffffffff81051137>] worker_thread+0x1ed/0x2e6
[<ffffffff810510e1>] ? worker_thread+0x197/0x2e6
[<ffffffffa026aeda>] ? rpc_async_schedule+0x0/0x17 [sunrpc]
[<ffffffff8105450f>] ? autoremove_wake_function+0x0/0x3d
[<ffffffff81050f4a>] ? worker_thread+0x0/0x2e6
[<ffffffff810541b2>] kthread+0x82/0x8a
[<ffffffff81002f14>] kernel_thread_helper+0x4/0x10
[<ffffffff81030d20>] ? finish_task_switch+0x0/0xd6
[<ffffffff81002f10>] ? kernel_thread_helper+0x0/0x10
There are many other variations as well.
> I suspect this is UDP transport ?
Yes.
> This reminds me a patch I wrote for IPV6 : We were allocating a huge
> (MTU sized) buffer, just to fill few bytes in it...
Humm. Interesting. Thanks for the pointer.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
On Tue, 2010-08-24 at 13:49 -0600, Marc Aurele La France wrote:
> On Tue, 24 Aug 2010, Ben Hutchings wrote:
> > On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
> >> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
> >>> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
> >>> Marc Aurele La France <[email protected]> wrote:
> >>>> In regrouping for my next tack at this, I noticed that all stack traces go
> >>>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
> >>>> A _very_ rough draft that would have ip_append_data() temporarily drop down
> >>>> to a smaller fake MTU follows ...
>
> >>> Why doesn't NFS generate page size fragments? Does Infiniband or your
> >>> device not support this? Any thing that requires higher order allocation
> >>> is going to unstable under load. Let's fix the cause not the apply bandaid
> >>> solution to the symptom.
>
> >> From what I can tell, IP fragmentation is done centrally.
> > [...]
>
> > Stephen and I are not talking about IP fragmentation, but about the
> > ability to append 'fragments' to an skb rather than putting the entire
> > packet payload in a linear buffer. See
> > <http://vger.kernel.org/~davem/skb_data.html>.
>
> Any payload has to either fit in the MTU, or has to be broken up into
> MTU-sized (or less) fragments, come hell or high water. That this is done
> centrally is a good thing.
Not necessarily. Offloading it to hardware, where possible, is usually
a performance win.
> It is the "(or less)" part that I am working towards here.
The inability to allocate large linear buffers is not a good reason to
generate packets smaller than the MTU. You are working around the real
problem.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On Tue, 24 Aug 2010 23:20:41 +0100
Ben Hutchings <[email protected]> wrote:
> On Tue, 2010-08-24 at 13:49 -0600, Marc Aurele La France wrote:
> > On Tue, 24 Aug 2010, Ben Hutchings wrote:
> > > On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
> > >> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
> > >>> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
> > >>> Marc Aurele La France <[email protected]> wrote:
> > >>>> In regrouping for my next tack at this, I noticed that all stack traces go
> > >>>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
> > >>>> A _very_ rough draft that would have ip_append_data() temporarily drop down
> > >>>> to a smaller fake MTU follows ...
> >
> > >>> Why doesn't NFS generate page size fragments? Does Infiniband or your
> > >>> device not support this? Any thing that requires higher order allocation
> > >>> is going to unstable under load. Let's fix the cause not the apply bandaid
> > >>> solution to the symptom.
> >
> > >> From what I can tell, IP fragmentation is done centrally.
> > > [...]
> >
> > > Stephen and I are not talking about IP fragmentation, but about the
> > > ability to append 'fragments' to an skb rather than putting the entire
> > > packet payload in a linear buffer. See
> > > <http://vger.kernel.org/~davem/skb_data.html>.
> >
> > Any payload has to either fit in the MTU, or has to be broken up into
> > MTU-sized (or less) fragments, come hell or high water. That this is done
> > centrally is a good thing.
>
> Not necessarily. Offloading it to hardware, where possible, is usually
> a performance win.
>
> > It is the "(or less)" part that I am working towards here.
>
> The inability to allocate large linear buffers is not a good reason to
> generate packets smaller than the MTU. You are working around the real
> problem.
IF NFS server is smart enough to generate:
Header (skb) + one or more pages in fragment list
then IP fragmentation could do fragmentation by allocating
new headers skb (small) and assigning the same pages to
multiple skb's using page ref count.
It obviously isn't working that way.
The whole problem is moot because NFS over UDP has known data corruption
issues in the face of packet loss. The sequence number of the IP fragment
can easily wrap around causing old data to be grouped with new data and
the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
client ans passed to the user application as corrupted disk block.
DON'T USE NFS OVER UDP!
Le mardi 24 août 2010 à 15:39 -0700, Stephen Hemminger a écrit :
> IF NFS server is smart enough to generate:
> Header (skb) + one or more pages in fragment list
> then IP fragmentation could do fragmentation by allocating
> new headers skb (small) and assigning the same pages to
> multiple skb's using page ref count.
>
> It obviously isn't working that way.
>
It is, but ip_append_data() is allocating a huge head if MTU is huge.
NFS is trying to build paged skb, to avoid order-X allocations (X > 0)
> The whole problem is moot because NFS over UDP has known data corruption
> issues in the face of packet loss. The sequence number of the IP fragment
> can easily wrap around causing old data to be grouped with new data and
> the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
> client ans passed to the user application as corrupted disk block.
>
> DON'T USE NFS OVER UDP!
But Marc point is using a big MTU, so that no IP fragmentation is
needed.
All UDP applications using MSG_MORE will hit the order-2 allocations if
MTU=9000 for example...
Le mercredi 25 août 2010 à 16:10 +0400, Alexey Kuznetsov a écrit :
> Hello!
>
> > It is, but ip_append_data() is allocating a huge head if MTU is huge.
>
> Hmm, strange, as I remember, it was supposed to work right.
>
> If the device supports SG (which is required to accept non-linear skbs anyway),
> then ip_append_* should allocate skbs not rounded up to mtu and we should
> allocate small skb with NFS header only. Does not it work?
>
> I can only guess one possible trap: people could do _one_ huge ip_append_data()
> (instead of "planned" scenario, when the header is sent with ip_append_data()
> and the following payload is appended with ip_append_page()). Huge ip_append_data()
> will generate huge skb indeed. Is this the problem?
>
>
> BTW this issue could be revisited and this "will generate huge" can be reconsidered.
> Automatic generation of fragmented skbs was deliberately suppressed, because it was
> found that all devices existing at the moment when this code was written
> are strongly biased against SG. Current code tries to _avoid_ generating
> non-linear skbs, unless it is intended for zero-copy, which compensated
> bias against SG. Modern hardware should work better.
>
> Alexey
Hi Alexey,
Few hours ago, I privately asked to Marc Aurele if its infiniband device
was supporting NETIF_F_SG in its features ;)
Thanks !
Hello!
> It is, but ip_append_data() is allocating a huge head if MTU is huge.
Hmm, strange, as I remember, it was supposed to work right.
If the device supports SG (which is required to accept non-linear skbs anyway),
then ip_append_* should allocate skbs not rounded up to mtu and we should
allocate small skb with NFS header only. Does not it work?
I can only guess one possible trap: people could do _one_ huge ip_append_data()
(instead of "planned" scenario, when the header is sent with ip_append_data()
and the following payload is appended with ip_append_page()). Huge ip_append_data()
will generate huge skb indeed. Is this the problem?
BTW this issue could be revisited and this "will generate huge" can be reconsidered.
Automatic generation of fragmented skbs was deliberately suppressed, because it was
found that all devices existing at the moment when this code was written
are strongly biased against SG. Current code tries to _avoid_ generating
non-linear skbs, unless it is intended for zero-copy, which compensated
bias against SG. Modern hardware should work better.
Alexey
On Tue, 24 Aug 2010, Stephen Hemminger wrote:
> On Tue, 24 Aug 2010 23:20:41 +0100
> Ben Hutchings <[email protected]> wrote:
>> On Tue, 2010-08-24 at 13:49 -0600, Marc Aurele La France wrote:
>>> On Tue, 24 Aug 2010, Ben Hutchings wrote:
>>>> On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
>>>>> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
>>>>>> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
>>>>>> Marc Aurele La France <[email protected]> wrote:
>>>>>>> In regrouping for my next tack at this, I noticed that all stack traces go
>>>>>>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
>>>>>>> A _very_ rough draft that would have ip_append_data() temporarily drop down
>>>>>>> to a smaller fake MTU follows ...
>>>>>> Why doesn't NFS generate page size fragments? Does Infiniband or your
>>>>>> device not support this? Any thing that requires higher order allocation
>>>>>> is going to unstable under load. Let's fix the cause not the apply bandaid
>>>>>> solution to the symptom.
>>>>> From what I can tell, IP fragmentation is done centrally.
>>>> Stephen and I are not talking about IP fragmentation, but about the
>>>> ability to append 'fragments' to an skb rather than putting the entire
>>>> packet payload in a linear buffer. See
>>>> <http://vger.kernel.org/~davem/skb_data.html>.
>>> Any payload has to either fit in the MTU, or has to be broken up into
>>> MTU-sized (or less) fragments, come hell or high water. That this is done
>>> centrally is a good thing.
>> Not necessarily. Offloading it to hardware, where possible, is usually
>> a performance win.
ip_append_data() deals with that already.
>>> It is the "(or less)" part that I am working towards here.
>> The inability to allocate large linear buffers is not a good reason to
>> generate packets smaller than the MTU.
Generating smaller-than-MTU fragments is better than giving up and
returning an error in my book.
> IF NFS server is smart enough to generate:
> Header (skb) + one or more pages in fragment list
> then IP fragmentation could do fragmentation by allocating
> new headers skb (small) and assigning the same pages to
> multiple skb's using page ref count.
> It obviously isn't working that way.
Point of clarification: we're talking about the client here, not the
server. But, yes, it doesn't work that way.
> The whole problem is moot because NFS over UDP has known data corruption
> issues in the face of packet loss. The sequence number of the IP fragment
> can easily wrap around causing old data to be grouped with new data and
> the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
> client ans passed to the user application as corrupted disk block.
> DON'T USE NFS OVER UDP!
Steady now. There's no need to YELL nor be arrogant. You and I both know
there's a place for NFS over UDP. That's not changing any time soon. While
I'm aware of the issue you brought up, it is separate from the one at hand in
this discussion.
I do want to thank you, however, for reminding me of TCP. It's something
20/20 hindsight says I should have checked out before starting this thread.
Logistically, it'll be a few days before I can do so though. If that allows
me to increase the MTU all the way up to 65520, then this UDP thing will
likely remain unresolved.
Thanks.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
Le jeudi 26 août 2010 à 05:40 -0600, Marc Aurele La France a écrit :
> Steady now. There's no need to YELL nor be arrogant. You and I both know
> there's a place for NFS over UDP. That's not changing any time soon. While
> I'm aware of the issue you brought up, it is separate from the one at hand in
> this discussion.
>
> I do want to thank you, however, for reminding me of TCP. It's something
> 20/20 hindsight says I should have checked out before starting this thread.
> Logistically, it'll be a few days before I can do so though. If that allows
> me to increase the MTU all the way up to 65520, then this UDP thing will
> likely remain unresolved.
>
Unfortunately, your infiniband device lacks NETIF_F_SG support.
MTU a bit larger than PAGE_SIZE-overhead will need high order
allocations ?
On Thu, 26 Aug 2010, Eric Dumazet wrote:
> Le jeudi 26 août 2010 à 05:40 -0600, Marc Aurele La France a écrit :
>> Steady now. There's no need to YELL nor be arrogant. You and I both know
>> there's a place for NFS over UDP. That's not changing any time soon. While
>> I'm aware of the issue you brought up, it is separate from the one at hand in
>> this discussion.
>> I do want to thank you, however, for reminding me of TCP. It's something
>> 20/20 hindsight says I should have checked out before starting this thread.
>> Logistically, it'll be a few days before I can do so though. If that allows
>> me to increase the MTU all the way up to 65520, then this UDP thing will
>> likely remain unresolved.
> Unfortunately, your infiniband device lacks NETIF_F_SG support.
Oh, the device itself probably has something similar, but ipoib
(IP-over-Infiniband) doesn't export that capability.
> MTU a bit larger than PAGE_SIZE-overhead will need high order
> allocations ?
Right. And a 65520 MTU allocates sk_buff's with 128K contiguous payloads.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+
On Aug 26, 2010, at 7:40 AM, Marc Aurele La France wrote:
> On Tue, 24 Aug 2010, Stephen Hemminger wrote:
>> On Tue, 24 Aug 2010 23:20:41 +0100
>> Ben Hutchings <[email protected]> wrote:
>>> On Tue, 2010-08-24 at 13:49 -0600, Marc Aurele La France wrote:
>>>> On Tue, 24 Aug 2010, Ben Hutchings wrote:
>>>>> On Tue, 2010-08-24 at 09:14 -0600, Marc Aurele La France wrote:
>>>>>> On Mon, 23 Aug 2010, Stephen Hemminger wrote:
>>>>>>> On Mon, 23 Aug 2010 08:44:37 -0600 (MDT)
>>>>>>> Marc Aurele La France <[email protected]> wrote:
>>>>>>>> In regrouping for my next tack at this, I noticed that all stack traces go
>>>>>>>> through ip_append_data(). This would be ipv6_append_data() in the IPv6 case.
>>>>>>>> A _very_ rough draft that would have ip_append_data() temporarily drop down
>>>>>>>> to a smaller fake MTU follows ...
>
>>>>>>> Why doesn't NFS generate page size fragments? Does Infiniband or your
>>>>>>> device not support this? Any thing that requires higher order allocation
>>>>>>> is going to unstable under load. Let's fix the cause not the apply bandaid
>>>>>>> solution to the symptom.
>
>>>>>> From what I can tell, IP fragmentation is done centrally.
>
>>>>> Stephen and I are not talking about IP fragmentation, but about the
>>>>> ability to append 'fragments' to an skb rather than putting the entire
>>>>> packet payload in a linear buffer. See
>>>>> <http://vger.kernel.org/~davem/skb_data.html>.
>
>>>> Any payload has to either fit in the MTU, or has to be broken up into
>>>> MTU-sized (or less) fragments, come hell or high water. That this is done
>>>> centrally is a good thing.
>
>>> Not necessarily. Offloading it to hardware, where possible, is usually
>>> a performance win.
>
> ip_append_data() deals with that already.
>
>>>> It is the "(or less)" part that I am working towards here.
>
>>> The inability to allocate large linear buffers is not a good reason to
>>> generate packets smaller than the MTU.
>
> Generating smaller-than-MTU fragments is better than giving up and returning an error in my book.
>
>> IF NFS server is smart enough to generate:
>> Header (skb) + one or more pages in fragment list
>> then IP fragmentation could do fragmentation by allocating
>> new headers skb (small) and assigning the same pages to
>> multiple skb's using page ref count.
>
>> It obviously isn't working that way.
>
> Point of clarification: we're talking about the client here, not the server. But, yes, it doesn't work that way.
>
>> The whole problem is moot because NFS over UDP has known data corruption
>> issues in the face of packet loss. The sequence number of the IP fragment
>> can easily wrap around causing old data to be grouped with new data and
>> the UDP checksum is so weak that the resulting UDP packet will be consumed by the NFS
>> client ans passed to the user application as corrupted disk block.
>
>> DON'T USE NFS OVER UDP!
>
> Steady now. There's no need to YELL nor be arrogant. You and I both know there's a place for NFS over UDP. That's not changing any time soon. While I'm aware of the issue you brought up, it is separate from the one at hand in this discussion.
>
> I do want to thank you, however, for reminding me of TCP. It's something 20/20 hindsight says I should have checked out before starting this thread. Logistically, it'll be a few days before I can do so though. If that allows me to increase the MTU all the way up to 65520, then this UDP thing will likely remain unresolved.
On advanced cluster-area networks with large MTUs, the ACK packets in TCP will probably kill your performance. That's one of the main reasons we keep NFS over UDP on life support! :-)
--
chuck[dot]lever[at]oracle[dot]com
On Thu, 26 Aug 2010 08:43:42 -0600 (Mountain Daylight Time)
Marc Aurele La France <[email protected]> wrote:
> On Thu, 26 Aug 2010, Eric Dumazet wrote:
> > Le jeudi 26 ao?t 2010 ? 05:40 -0600, Marc Aurele La France a ?crit :
>
> >> Steady now. There's no need to YELL nor be arrogant. You and I both know
> >> there's a place for NFS over UDP. That's not changing any time soon. While
> >> I'm aware of the issue you brought up, it is separate from the one at hand in
> >> this discussion.
>
> >> I do want to thank you, however, for reminding me of TCP. It's something
> >> 20/20 hindsight says I should have checked out before starting this thread.
> >> Logistically, it'll be a few days before I can do so though. If that allows
> >> me to increase the MTU all the way up to 65520, then this UDP thing will
> >> likely remain unresolved.
>
> > Unfortunately, your infiniband device lacks NETIF_F_SG support.
>
> Oh, the device itself probably has something similar, but ipoib
> (IP-over-Infiniband) doesn't export that capability.
>
> > MTU a bit larger than PAGE_SIZE-overhead will need high order
> > allocations ?
>
> Right. And a 65520 MTU allocates sk_buff's with 128K contiguous payloads.
Infiniband device driver needs to be fixed to do SG and checksum offload.
Otherwise it is insane to try and run large MTU over it. I even wonder if
the dev_change_mtu() function should reject > PAGESIZE mtu for devices
that don't do scatter/gather or at least a raise a warning.
From: Stephen Hemminger <[email protected]>
Date: Thu, 26 Aug 2010 16:53:59 -0700
> On Thu, 26 Aug 2010 08:43:42 -0600 (Mountain Daylight Time)
> Marc Aurele La France <[email protected]> wrote:
>
>> Right. And a 65520 MTU allocates sk_buff's with 128K contiguous payloads.
>
> Infiniband device driver needs to be fixed to do SG and checksum offload.
Agreed, this problem is in the infiniband layer and should be fixed
there.
But I fear there is a real potential blocker for this, if the
infiniband layer can't checksum transmit packets in hardware we cannot
legitimately add SG support.
Paged SKBs can have references to page cache pages and similar. These
can be updated asynchronously to the transmit, there is no locking at
all to freeze the contents, and therefore full checksum offload is
required to support SG correctly.
So don't get the idea to do the checksum in software in the infiniband
layer, and advertize hw checksumming support, to get around this :-)
> Infiniband device driver needs to be fixed to do SG and checksum offload.
> Otherwise it is insane to try and run large MTU over it. I even wonder if
> the dev_change_mtu() function should reject > PAGESIZE mtu for devices
> that don't do scatter/gather or at least a raise a warning.
It's not possible to "fix" the driver to do checksum offload, since the
underlying hardware does not support it. Theoretically we could handle
SG but of course there's no point in that without checksum offload.
I think there is some confusion about what IPoIB is in this thread, so
let me try to give some basic background to help the discussion. There
are two "modes" that an IPoIB interface can operate in: datagram mode
and connected mode.
In datagram mode, packets given to the IPoIB driver are sent as IB
unreliable datagram messages, which means each skb turns into one packet
on the wire -- very much like the ethernet case. In this mode, the MTU
is limited by the MTU on the IB side, which is typically either 2K or 4K
depending on the adapter and the switches involved. Modern IB adapters
do support checksum offload and large send offload for datagrams, so we
can and do enable SG and IP_CSUM.
In connected mode, the IPoIB driver actually makes a reliable connection
to each peer. For reliable connections, IB adapters can actually send
messages up to 4GB, with the adapter handling all the segmentation and
transport level acks etc. -- the host system simply queues one work
request for each message of any size. These work requests do support
gather/scatter, but no existing adapter supports checksum offload for
messages on reliable connections.
However, since reliable connections support arbitrary sized messages, in
connected mode the IPoIB driver allows an MTU up to roughly the maximum
64K IP message size. (I don't think anyone has tried it with bigger
IPv6 jumbograms ;)
It does seem even with all the horrible memory allocation problems
caused by requiring huge linear skbs, connected mode does offer very
good performance for at least some real-world uses (although apparently
NFS is not one such use). In fact as far as I know, connected mode with
a huge MTU continues to outperform datagram mode even with LSO and LRO
(although I don't have any particularly recent numbers). So I don't
think we want to completely disallow such uses.
- R.
By the way, for the original poster: is using NFS/RDMA a possibility?
That might give even better performance than any config of IPoIB if you
have an InfiniBand fabric anyway.
- R.
On Fri, 27 Aug 2010, Roland Dreier wrote:
> By the way, for the original poster: is using NFS/RDMA a possibility?
> That might give even better performance than any config of IPoIB if you
> have an InfiniBand fabric anyway.
Yes, NFS/RDMA is a possibility I need to look at as well.
Thanks.
Marc.
+----------------------------------+----------------------------------+
| Marc Aurele La France | work: 1-780-492-9310 |
| Academic Information and | fax: 1-780-492-1729 |
| Communications Technologies | email: [email protected] |
| 352 General Services Building +----------------------------------+
| University of Alberta | |
| Edmonton, Alberta | Standard disclaimers apply |
| T6G 2H1 | |
| CANADA | |
+----------------------------------+----------------------------------+