2014-02-11 12:42:23

by McAninley, Jason

[permalink] [raw]
Subject: Question regard NFS 4.0 buffer sizes

I'm looking for detailed documentation regarding some of the innards of NFS 4.0, without necessarily having to read through the source code (if such documentation exists).

Specifically related to the relationship between NFS's rsize/wsize options versus some of the lower-level networking buffers. Buffer parameters that I have come across (and my current settings) include:

- sysctl's net.core.{r,w}mem_default: 229376
- sysctl's net.core.{r,w}mem_max: 131071
- sysctl's net.ipv4.tcp_{r,w}mem 4096 87380 4194304
- #define RPCSVC_MAXPAYLOAD (1*1024*1024u)

When I run Wireshark during an NFS transfer, I see MAX{READ,WRITE} attributes returned from GETATTR with the value of 1MB. I'm guessing this corresponds to the limit set by RPCSVR_MAXPAYLOAD? However, the maximum packet size I'm recording in practice is ~32K.

In fact, it seems like regardless of the change to {r,w}size (I've tried 32K, 64K, 128K) I am not seeing changes in the max packet size.

This is leading me to investigate the buffer sizes on the client/server. My thought is that if a buffer exists that is too small, NFS/Kernel will ship a packet prior to reaching the MAXWRITE size.

Any input is appreciated.

-Jason


2014-02-11 21:54:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes

On Tue, Feb 11, 2014 at 09:17:03PM +0000, McAninley, Jason wrote:
> > > My understanding is that setting {r,w}size doesn't guarantee that
> > will be the agreed-upon value. Apparently one must check the value in
> > /proc. I have verified this by checking the value of /proc/XXXX/mounts,
> > where XXXX is the pid for nfsv4.0-svc on the client. It is set to a
> > value >32K.
> >
> > I don't think that actually takes into account the value returned from
> > the server. If you watch the mount in wireshark early on you should
> > see
> > it query the server's rsize and wsize, and you may find that's less.
>
> I have seen the GETATTR return MAXREAD and MAXWRITE attribute values set to 1MB during testing with Wireshark. My educated guess is that this corresponds to RPCSVC_MAXPAYLOAD defined in linux/nfsd/const.h. Would anyone agree with this?

That's an upper limit and a server without a lot of memory may default
to something smaller. The GETATTR shows that it isn't, though.

> > If you haven't already I'd first recommend measuring your NFS read
> > and write throughput and comparing it to what you can get from the
> > network and the server's disk. No point tuning something if it
> > turns out it's already working.
>
> I have measured sequential writes using dd with 4k block size.

What's your dd commandline?

> The NFS
> share maps to a large SSD drive on the server. My understanding is
> that we have jumbo frames enabled (i.e. MTU 8k). The share is mounted
> with rsize/wsize of 32k. We're seeing write speeds of 200 MB/sec
> (mega-bytes). We have 10 GigE connections between the server and
> client with a single switch + multipathing from the client.

So both network and disk should be able to do more than that, but it
would still be worth testing both (with e.g. tcpperf and dd) just to
make sure there's nothing wrong with either.

> I will admit I have a weak networking background, but it seems like we could achieve speeds much greater than 200 MB/sec, considering the pipes are very wide and the MTU is large. Again, I'm concerned there is a buffer somewhere in the Kernel that is flushing prematurely (32k, instead of wsize).
>
> If there is detailed documentation online that I have overlooked, I would much appreciate a pointer in that direction!

Also, what kernel versions are you on?

--b.

2014-02-11 21:18:18

by McAninley, Jason

[permalink] [raw]
Subject: RE: Question regard NFS 4.0 buffer sizes

> > My understanding is that setting {r,w}size doesn't guarantee that
> will be the agreed-upon value. Apparently one must check the value in
> /proc. I have verified this by checking the value of /proc/XXXX/mounts,
> where XXXX is the pid for nfsv4.0-svc on the client. It is set to a
> value >32K.
>
> I don't think that actually takes into account the value returned from
> the server. If you watch the mount in wireshark early on you should
> see
> it query the server's rsize and wsize, and you may find that's less.

I have seen the GETATTR return MAXREAD and MAXWRITE attribute values set to 1MB during testing with Wireshark. My educated guess is that this corresponds to RPCSVC_MAXPAYLOAD defined in linux/nfsd/const.h. Would anyone agree with this?


> If you haven't already I'd first recommend measuring your NFS read and
> write throughput and comparing it to what you can get from the network
> and the server's disk. No point tuning something if it turns out it's
> already working.

I have measured sequential writes using dd with 4k block size. The NFS share maps to a large SSD drive on the server. My understanding is that we have jumbo frames enabled (i.e. MTU 8k). The share is mounted with rsize/wsize of 32k. We're seeing write speeds of 200 MB/sec (mega-bytes). We have 10 GigE connections between the server and client with a single switch + multipathing from the client.

I will admit I have a weak networking background, but it seems like we could achieve speeds much greater than 200 MB/sec, considering the pipes are very wide and the MTU is large. Again, I'm concerned there is a buffer somewhere in the Kernel that is flushing prematurely (32k, instead of wsize).

If there is detailed documentation online that I have overlooked, I would much appreciate a pointer in that direction!

Thanks,
Jason


2014-02-11 15:03:02

by McAninley, Jason

[permalink] [raw]
Subject: RE: Question regard NFS 4.0 buffer sizes

Thanks for the reply, Bruce.

> Are you using UDP or TCP?

TCP.

> And what do you mean by "maximum packet size"?

I'm generally referring to the Frame size (e.g. 32,626) and/or the TCP packet size (e.g. 32560) - The former being the size of the latter plus the ethernet/IP headers.

> To see if the maximum rsize/wsize is being used you'd need to look for
> the length of the data in a READ reply or WRITE call.

Right. When I check the contents of a WRITE RPC, I see "Data" length of 32768 (32k).

My understanding is that setting {r,w}size doesn't guarantee that will be the agreed-upon value. Apparently one must check the value in /proc. I have verified this by checking the value of /proc/XXXX/mounts, where XXXX is the pid for nfsv4.0-svc on the client. It is set to a value >32K.

> What actual problem are you trying to solve? (Is your read or write
> bandwidth lower than you expected?)

I am trying to maximize throughout within a parallel processing cluster. We have GigE connections within our closed network and I would like to ensure we are fully utilizing our bandwidth. Additionally, I find a lot of information online (that is not outdated) suggests various Kernel/OS/NFS settings without giving details for why the settings should be modified.

Upon changing the rsize/wsize, I would have expected to see a change in the packet/payload size, but I do not.

-Jason

2014-02-11 22:51:06

by McAninley, Jason

[permalink] [raw]
Subject: RE: Question regard NFS 4.0 buffer sizes

> > I have seen the GETATTR return MAXREAD and MAXWRITE attribute values
> set to 1MB during testing with Wireshark. My educated guess is that
> this corresponds to RPCSVC_MAXPAYLOAD defined in linux/nfsd/const.h.
> Would anyone agree with this?
>
> That's an upper limit and a server without a lot of memory may default
> to something smaller. The GETATTR shows that it isn't, though.

Memory shouldn't be a limit. I have the system isolated for testing - the server has ~126GB memory and the client has ~94GB.


> > > If you haven't already I'd first recommend measuring your NFS read
> > > and write throughput and comparing it to what you can get from the
> > > network and the server's disk. No point tuning something if it
> > > turns out it's already working.
> >
> > I have measured sequential writes using dd with 4k block size.
>
> What's your dd commandline?

dd if=/dev/zero of=[nfs_dir]/foo bs=4096 count=1310720

Should result in a 5 GB file.


> > The NFS
> > share maps to a large SSD drive on the server. My understanding is
> > that we have jumbo frames enabled (i.e. MTU 8k). The share is mounted
> > with rsize/wsize of 32k. We're seeing write speeds of 200 MB/sec
> > (mega-bytes). We have 10 GigE connections between the server and
> > client with a single switch + multipathing from the client.
>
> So both network and disk should be able to do more than that, but it
> would still be worth testing both (with e.g. tcpperf and dd) just to
> make sure there's nothing wrong with either.
>
> > I will admit I have a weak networking background, but it seems like
> we could achieve speeds much greater than 200 MB/sec, considering the
> pipes are very wide and the MTU is large. Again, I'm concerned there is
> a buffer somewhere in the Kernel that is flushing prematurely (32k,
> instead of wsize).
> >
> > If there is detailed documentation online that I have overlooked, I
> would much appreciate a pointer in that direction!
>
> Also, what kernel versions are you on?

RH6.3, 2.6.32-279.el6.x86_64

-Jason


2014-02-11 16:32:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes

On Tue, Feb 11, 2014 at 03:01:39PM +0000, McAninley, Jason wrote:
> Thanks for the reply, Bruce.
>
> > Are you using UDP or TCP?
>
> TCP.
>
> > And what do you mean by "maximum packet size"?
>
> I'm generally referring to the Frame size (e.g. 32,626) and/or the TCP packet size (e.g. 32560) - The former being the size of the latter plus the ethernet/IP headers.
>
> > To see if the maximum rsize/wsize is being used you'd need to look for
> > the length of the data in a READ reply or WRITE call.
>
> Right. When I check the contents of a WRITE RPC, I see "Data" length of 32768 (32k).
>
> My understanding is that setting {r,w}size doesn't guarantee that will be the agreed-upon value. Apparently one must check the value in /proc. I have verified this by checking the value of /proc/XXXX/mounts, where XXXX is the pid for nfsv4.0-svc on the client. It is set to a value >32K.

I don't think that actually takes into account the value returned from
the server. If you watch the mount in wireshark early on you should see
it query the server's rsize and wsize, and you may find that's less.

> > What actual problem are you trying to solve? (Is your read or write
> > bandwidth lower than you expected?)
>
> I am trying to maximize throughout within a parallel processing cluster. We have GigE connections within our closed network and I would like to ensure we are fully utilizing our bandwidth. Additionally, I find a lot of information online (that is not outdated) suggests various Kernel/OS/NFS settings without giving details for why the settings should be modified.

If you haven't already I'd first recommend measuring your NFS read and
write throughput and comparing it to what you can get from the network
and the server's disk. No point tuning something if it turns out it's
already working.

--b.

>
> Upon changing the rsize/wsize, I would have expected to see a change in the packet/payload size, but I do not.
>
> -Jason

2014-02-13 12:30:45

by McAninley, Jason

[permalink] [raw]
Subject: RE: Question regard NFS 4.0 buffer sizes

Sorry for the delay.

> This ends up caching and the write back should happen with larger
> sizes.
> Is this an issue with write size only or read size as well? Did you
> test
> read size something like below?
>
> dd if=[nfs_dir]/foo bs=1M count=500 of=/dev/null
>
> You can create sparse "foo" file using truncate command.

I have not tested read speeds yet since this is a bit trickier due to avoiding the client cache. I would suspect similar results since we have mirrored read/write settings in all locations (we're aware of).


> >
> >
> > > Also, what kernel versions are you on?
> >
> > RH6.3, 2.6.32-279.el6.x86_64
>
> NFS client and NFS server both using the same distro/kernel?

Yes - identical.


Would multipath play any role here? I would suspect it would only help, not hinder. I have run Wireshark against the slave and the master ports with the same result - a max of ~32K packet size, regardless of the settings I listed in my original post.

-Jason


2014-02-11 17:23:01

by Chuck Lever

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes


On Feb 11, 2014, at 10:01 AM, McAninley, Jason <[email protected]> wrote:

> Thanks for the reply, Bruce.
>
>> Are you using UDP or TCP?
>
> TCP.
>
>> And what do you mean by "maximum packet size"?
>
> I'm generally referring to the Frame size (e.g. 32,626) and/or the TCP packet size (e.g. 32560) - The former being the size of the latter plus the ethernet/IP headers.
>
>> To see if the maximum rsize/wsize is being used you'd need to look for
>> the length of the data in a READ reply or WRITE call.
>
> Right. When I check the contents of a WRITE RPC, I see "Data" length of 32768 (32k).
>
> My understanding is that setting {r,w}size doesn't guarantee that will be the agreed-upon value. Apparently one must check the value in /proc. I have verified this by checking the value of /proc/XXXX/mounts, where XXXX is the pid for nfsv4.0-svc on the client. It is set to a value >32K.
>
>> What actual problem are you trying to solve? (Is your read or write
>> bandwidth lower than you expected?)
>
> I am trying to maximize throughout within a parallel processing cluster. We have GigE connections within our closed network and I would like to ensure we are fully utilizing our bandwidth. Additionally, I find a lot of information online (that is not outdated) suggests various Kernel/OS/NFS settings without giving details for why the settings should be modified.

A closed network introduces the opportunity to use jumbo Ethernet frames. But this assumes your server NICs and switches can support it.

> Upon changing the rsize/wsize, I would have expected to see a change in the packet/payload size, but I do not.

The application itself may play a significant role. If it is writing and flushing, or using O_SYNC, for example, the NFS client may have no choice but to use WRITE operations smaller than wsize.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2014-02-11 14:36:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes

On Tue, Feb 11, 2014 at 12:32:33PM +0000, McAninley, Jason wrote:
> I'm looking for detailed documentation regarding some of the innards of NFS 4.0, without necessarily having to read through the source code (if such documentation exists).
>
> Specifically related to the relationship between NFS's rsize/wsize options versus some of the lower-level networking buffers. Buffer parameters that I have come across (and my current settings) include:
>
> - sysctl's net.core.{r,w}mem_default: 229376
> - sysctl's net.core.{r,w}mem_max: 131071
> - sysctl's net.ipv4.tcp_{r,w}mem 4096 87380 4194304
> - #define RPCSVC_MAXPAYLOAD (1*1024*1024u)
>
> When I run Wireshark during an NFS transfer, I see MAX{READ,WRITE} attributes returned from GETATTR with the value of 1MB. I'm guessing this corresponds to the limit set by RPCSVR_MAXPAYLOAD? However, the maximum packet size I'm recording in practice is ~32K.

Are you using UDP or TCP?

And what do you mean by "maximum packet size"?

To see if the maximum rsize/wsize is being used you'd need to look for
the length of the data in a READ reply or WRITE call.

What actual problem are you trying to solve? (Is your read or write
bandwidth lower than you expected?)

--b.

>
> In fact, it seems like regardless of the change to {r,w}size (I've tried 32K, 64K, 128K) I am not seeing changes in the max packet size.
>
> This is leading me to investigate the buffer sizes on the client/server. My thought is that if a buffer exists that is too small, NFS/Kernel will ship a packet prior to reaching the MAXWRITE size.
>
> Any input is appreciated.
>
> -Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-02-13 18:21:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes

On Thu, Feb 13, 2014 at 12:21:13PM +0000, McAninley, Jason wrote:
> Sorry for the delay.
>
> > This ends up caching and the write back should happen with larger
> > sizes.
> > Is this an issue with write size only or read size as well? Did you
> > test
> > read size something like below?
> >
> > dd if=[nfs_dir]/foo bs=1M count=500 of=/dev/null
> >
> > You can create sparse "foo" file using truncate command.
>
> I have not tested read speeds yet since this is a bit trickier due to avoiding the client cache. I would suspect similar results since we have mirrored read/write settings in all locations (we're aware of).
>
>
> > >
> > >
> > > > Also, what kernel versions are you on?
> > >
> > > RH6.3, 2.6.32-279.el6.x86_64
> >
> > NFS client and NFS server both using the same distro/kernel?
>
> Yes - identical.
>
>
> Would multipath play any role here? I would suspect it would only help, not hinder. I have run Wireshark against the slave and the master ports with the same result - a max of ~32K packet size, regardless of the settings I listed in my original post.

I doubt it. I don't know what's going there.

The write size might actually be too small to keep the necessary amount
of write data in flight; increasing tcp_slot_table_entries might work
around that?

Of course, since this is a Red Hat kernel that'd be a place to ask for
support unless the problem's also reproduceable on upstream kernels.

--b.

2014-02-11 23:18:56

by Malahal Naineni

[permalink] [raw]
Subject: Re: Question regard NFS 4.0 buffer sizes

McAninley, Jason [[email protected]] wrote:
> > > I have seen the GETATTR return MAXREAD and MAXWRITE attribute values
> > set to 1MB during testing with Wireshark. My educated guess is that
> > this corresponds to RPCSVC_MAXPAYLOAD defined in linux/nfsd/const.h.
> > Would anyone agree with this?
> >
> > That's an upper limit and a server without a lot of memory may default
> > to something smaller. The GETATTR shows that it isn't, though.
>
> Memory shouldn't be a limit. I have the system isolated for testing - the server has ~126GB memory and the client has ~94GB.
>
>
> > > > If you haven't already I'd first recommend measuring your NFS read
> > > > and write throughput and comparing it to what you can get from the
> > > > network and the server's disk. No point tuning something if it
> > > > turns out it's already working.
> > >
> > > I have measured sequential writes using dd with 4k block size.
> >
> > What's your dd commandline?
>
> dd if=/dev/zero of=[nfs_dir]/foo bs=4096 count=1310720

This ends up caching and the write back should happen with larger sizes.
Is this an issue with write size only or read size as well? Did you test
read size something like below?

dd if=[nfs_dir]/foo bs=1M count=500 of=/dev/null

You can create sparse "foo" file using truncate command.

>
> Should result in a 5 GB file.
>
>
> > > The NFS
> > > share maps to a large SSD drive on the server. My understanding is
> > > that we have jumbo frames enabled (i.e. MTU 8k). The share is mounted
> > > with rsize/wsize of 32k. We're seeing write speeds of 200 MB/sec
> > > (mega-bytes). We have 10 GigE connections between the server and
> > > client with a single switch + multipathing from the client.
> >
> > So both network and disk should be able to do more than that, but it
> > would still be worth testing both (with e.g. tcpperf and dd) just to
> > make sure there's nothing wrong with either.
> >
> > > I will admit I have a weak networking background, but it seems like
> > we could achieve speeds much greater than 200 MB/sec, considering the
> > pipes are very wide and the MTU is large. Again, I'm concerned there is
> > a buffer somewhere in the Kernel that is flushing prematurely (32k,
> > instead of wsize).
> > >
> > > If there is detailed documentation online that I have overlooked, I
> > would much appreciate a pointer in that direction!
> >
> > Also, what kernel versions are you on?
>
> RH6.3, 2.6.32-279.el6.x86_64

NFS client and NFS server both using the same distro/kernel?

Regards, Malahal.