2013-05-14 11:50:56

by Jim Vanns

[permalink] [raw]
Subject: Where in the server code is fsinfo rtpref calculated?

Hi. I'm struggling to locate the region in the NFSv3 server-side code where it figures out the block sizes for the FSINFO reply. We have servers that do not specify r/wsizes in their exports and so I need to find where this negotiated
value between server->client actually comes from. How does the server reach the preferred block size for a given export?

Cheers,

Jim

--
Jim Vanns
Senior Software Developer
Framestore



2013-05-14 22:01:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Tue, May 14, 2013 at 12:17:10PM +0100, James Vanns wrote:
> Hi. I'm struggling to locate the region in the NFSv3 server-side code
> where it figures out the block sizes for the FSINFO reply. We have
> servers that do not specify r/wsizes in their exports

There's no way to specify that as an export option. You can configure
it server-wide using /proc/fs/nfsd/max_block_size.

> and so I need to
> find where this negotiated value between server->client actually comes
> from. How does the server reach the preferred block size for a given
> export?

fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
starting point. Its caller, nfsd_create_serv(), calls
svc_create_pooled() with the result that's calculated.

For fsinfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
svc_max_payload().

--b.

2013-05-15 17:42:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Wed, May 15, 2013 at 05:32:15PM +0100, James Vanns wrote:
> <snip>
>
> > > I've just returned from nfsd3_proc_fsinfo() and found what I would
> > > consider an odd decision - perhaps nothing better was suggested at
> > > the time. It seems to me that in response to an FSINFO call the
> > > reply stuffs the max_block_size value in both the maximum *and*
> > > preferred block sizes for both read and write. A 1MB block size
> > > for a preferred default is a little high! If a disk is reading at
> > > 33MB/s and we have just a single server running 64 knfsd and each
> > > READ call is requesting 1MB of data then all of a sudden we have
> > > an aggregate read speed of ~512k/s
> >
> > I lost you here.
>
> OK, so what we're seeing is the large majority of our nr. ~700 clients
> (all Linux 2.6.32 based NFS clients) issuing READ requests of 1MB in
> size.

Knowing nothing about your situation, I'd assume the clients are doing
that because they actually want that 1MB of data.

Would you prefer they each send 1024 1k READs? I don't understand why
it's the read size you're focused on here.

--b.

>
> After the initial MOUNT request has been granted an FSINFO call is
> made. The contents of the REPLY from the server (another Linux 2.6.32
> server) include rtmax, rtpref, wtmax and wtpref all of which are set
> to 1MB. This 1MB appears to come from that code/explanation I
> described earlier - all values are basically getting set to whatever
> comes out of nfsd_get_default_max_blksize().



2013-05-15 14:15:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Wed, May 15, 2013 at 02:42:42PM +0100, James Vanns wrote:
> > fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> > starting point. Its caller, nfsd_create_serv(), calls
> > svc_create_pooled() with the result that's calculated.
>
> Hmm. If I've read this section of code correctly, it seems to me
> that on most modern NFS servers (using TCP as the transport) the default
> and preferred blocksize negotiated with clients will almost always be
> 1MB - the maximum RPC payload. The nfsd_get_default_maxblksize() function
> seems obsolete for modern 64-bit servers with at least 4G of RAM as it'll
> always prefer this upper bound instead of any value calculated according to
> available RAM.

Well, "obsolete" is an odd way to put it--the code is still expected to
work on smaller machines.

Arguments welcome about the defaults, though I wonder whether it would
be better to be doing this sort of calculation in user space.

> For what it's worth (not sure if I specified this) I'm running kernel 2.6.32.
>
> Anyway, this file/function appears to set the default *max* blocksize. I haven't
> read all the related code yet, but does the preferred block size derive
> from this maximum too?

See:

> > For fsinfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> > svc_max_payload().

I'm not sure what the history is behind that logic, though.

--b.

2013-05-15 14:47:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Wed, May 15, 2013 at 03:34:27PM +0100, James Vanns wrote:
>
> > On Wed, May 15, 2013 at 02:42:42PM +0100, James Vanns wrote:
> > > > fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> > > > starting point. Its caller, nfsd_create_serv(), calls
> > > > svc_create_pooled() with the result that's calculated.
> > >
> > > Hmm. If I've read this section of code correctly, it seems to me
> > > that on most modern NFS servers (using TCP as the transport) the
> > > default
> > > and preferred blocksize negotiated with clients will almost always
> > > be
> > > 1MB - the maximum RPC payload. The nfsd_get_default_maxblksize()
> > > function
> > > seems obsolete for modern 64-bit servers with at least 4G of RAM as
> > > it'll
> > > always prefer this upper bound instead of any value calculated
> > > according to
> > > available RAM.
> >
> > Well, "obsolete" is an odd way to put it--the code is still expected
> > to work on smaller machines.
>
> Poor choice of words perhaps. I guess I'm just used to NFS servers being
> pretty hefty pieces of kit and 'small' workstations having a couple of GB
> of RAM too.
>
> > Arguments welcome about the defaults, thoodd ugh I wonder whether it
> > would be better to be doing this sort of calculation in user space.
>
> See below.
>
> > > For what it's worth (not sure if I specified this) I'm running
> > > kernel 2.6.32.
> > >
> > > Anyway, this file/function appears to set the default *max*
> > > blocksize. I haven't
> > > read all the related code yet, but does the preferred block size
> > > derive
> > > from this maximum too?
> >
> > See
> > > > For finfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> > > > svc_max_payload().
>
> I've just returned from nfsd3_proc_fsinfo() and found what I would
> consider an odd decision - perhaps nothing better was suggested at
> the time. It seems to me that in response to an FSINFO call the reply
> stuffs the max_block_size value in both the maximum *and* preferred
> block sizes for both read and write. A 1MB block size for a preferred
> default is a little high! If a disk is reading at 33MB/s and we have just
> a single server running 64 knfsd and each READ call is requesting 1MB of
> data then all of a sudden we have an aggregate read speed of ~512k/s

I lost you here.

> and
> that is without network latencies. And of course we will probably have 100s of
> requests queued behind each knfsd waiting for these 512k reads to finish. All of a
> sudden our user experience is rather poor :(

Note the preferred size is not a minimum--the client isn't forced to do
1MB reads if it really only wants 1 page, for example, if that's what
you mean.

(I haven't actually looked at how typical clients used rt/wtpref.)

--b.

> Perhaps a better suggestion would be to at least expose the maximum and preferred
> block sizes (for both read and write) via a sysctl key so an administrator can set
> it to the underlying block sizes of the file system or physical device?
>
> Perhaps the defaults should at least be a smaller multiple of the page size or somewhere
> between that and the PDU of the network layer the service is bound too.
>
> Just my tuppence - and my maths might be flawed ;)
>
> Jim
>
> > I'm not sure what the history is behind that logic, though.
> >
> > --b.
> >
>
> --
> Jim Vanns
> Senior Software Developer
> Framestore

2013-05-17 13:56:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Fri, May 17, 2013 at 12:43:02PM +0100, James Vanns wrote:
> > Knowing nothing about your situation, I'd assume the clients are
> > doing that because they actually want that 1MB of data.
>
> Possibly. But we have no control over that (the application read size,
> I mean).
>
> > Would you prefer they each send 1024 1k READs?  I don't understand
> > why it's the read size you're focused on here.
>
> No. But 32x 32k reads is reasonable (because it gives other RPCs a
> look-in).

Maybe. In any case I'd want to see data before changing our defaults.

> I'm focused on reads because it makes up the majority of
> our NFS traffic. I'm concerned because as it stands (out of the box)
> if the majority of our n knfsd threads are waiting for a 1MB read to
> return then no other RPC request will be serviced and will just
> contribute to the backlog. This backlog itself will probably also contain
> a hefty no. of 1MB read requests too. In short, a lot of other RPC calls
> that are not reads will just be blocking and this will appear to an end
> user as poor performance.

Do you have a performance problem that you've actually measured, and if
so could you share the details?

> We deal with a great number of fairly large files - 10s of GBs in
> size. We just don't want others to suffer because of large request
> sizes coming in (writes end up being of the same size too but there
> are less of them). Our use cases are varied but they all have to share
> the same resource (the array of NFS servers).
>
> We've only really seen this since our upgrade to SL6/kernel 2.6.32. I
> guess previously that 32k was some sort of default or limit?
>
> Related to this was my query on when/how the (Linux) client may honour
> the preferred or optimal block size given in the FSINFO reply. Any
> ideas? Is it if a read of less than that preferred block size is
> requested then the preferred is used anyway because it comes at the
> same cost?

I'm not terribly familiar with the client logic, but would expect this
to vary depending on kernel version, read-ahead policy, application
behavior and a number of other factors, so I'd recommend testing with
your workload and finding out.

--b.

2013-05-15 14:38:25

by Jim Vanns

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?


> On Wed, May 15, 2013 at 02:42:42PM +0100, James Vanns wrote:
> > > fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> > > starting point. Its caller, nfsd_create_serv(), calls
> > > svc_create_pooled() with the result that's calculated.
> >
> > Hmm. If I've read this section of code correctly, it seems to me
> > that on most modern NFS servers (using TCP as the transport) the
> > default
> > and preferred blocksize negotiated with clients will almost always
> > be
> > 1MB - the maximum RPC payload. The nfsd_get_default_maxblksize()
> > function
> > seems obsolete for modern 64-bit servers with at least 4G of RAM as
> > it'll
> > always prefer this upper bound instead of any value calculated
> > according to
> > available RAM.
>
> Well, "obsolete" is an odd way to put it--the code is still expected
> to work on smaller machines.

Poor choice of words perhaps. I guess I'm just used to NFS servers being
pretty hefty pieces of kit and 'small' workstations having a couple of GB
of RAM too.

> Arguments welcome about the defaults, thoodd ugh I wonder whether it
> would be better to be doing this sort of calculation in user space.

See below.

> > For what it's worth (not sure if I specified this) I'm running
> > kernel 2.6.32.
> >
> > Anyway, this file/function appears to set the default *max*
> > blocksize. I haven't
> > read all the related code yet, but does the preferred block size
> > derive
> > from this maximum too?
>
> See
> > > For finfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> > > svc_max_payload().

I've just returned from nfsd3_proc_fsinfo() and found what I would
consider an odd decision - perhaps nothing better was suggested at
the time. It seems to me that in response to an FSINFO call the reply
stuffs the max_block_size value in both the maximum *and* preferred
block sizes for both read and write. A 1MB block size for a preferred
default is a little high! If a disk is reading at 33MB/s and we have just
a single server running 64 knfsd and each READ call is requesting 1MB of
data then all of a sudden we have an aggregate read speed of ~512k/s and
that is without network latencies. And of course we will probably have 100s of
requests queued behind each knfsd waiting for these 512k reads to finish. All of a
sudden our user experience is rather poor :(

Perhaps a better suggestion would be to at least expose the maximum and preferred
block sizes (for both read and write) via a sysctl key so an administrator can set
it to the underlying block sizes of the file system or physical device?

Perhaps the defaults should at least be a smaller multiple of the page size or somewhere
between that and the PDU of the network layer the service is bound too.

Just my tuppence - and my maths might be flawed ;)

Jim

> I'm not sure what the history is behind that logic, though.
>
> --b.
>

--
Jim Vanns
Senior Software Developer
Framestore

2013-05-15 09:25:03

by Jim Vanns

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

> There's no way to specify that as an export option. You can
> configure
> it server-wide using /proc/fs/nfsd/max_block_size.

Ah ha! Bingo. There it is. I can see it on our SL6 (2.6.32)
servers but not on older RHEL5 (2.6.18) servers so I guess
at some point this was hardcoded to 32k?

> > and so I need to
> > find where this negotiated value between server->client actually
> > comes
> > from. How does the server reach the preferred block size for a
> > given
> > export?
>
> fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> starting point. Its caller, nfsd_create_serv(), calls
> svc_create_pooled() with the result that's calculated.

Thanks. I shall look there.

Jim

> For fsinfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> svc_max_payload().
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Jim Vanns
Senior Software Developer
Framestore


2013-05-15 15:21:23

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

On Wed, 2013-05-15 at 10:47 -0400, J. Bruce Fields wrote:
> On Wed, May 15, 2013 at 03:34:27PM +0100, James Vanns wrote:
> >
> > > On Wed, May 15, 2013 at 02:42:42PM +0100, James Vanns wrote:
> > > > > fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> > > > > starting point. Its caller, nfsd_create_serv(), calls
> > > > > svc_create_pooled() with the result that's calculated.
> > > >
> > > > Hmm. If I've read this section of code correctly, it seems to me
> > > > that on most modern NFS servers (using TCP as the transport) the
> > > > default
> > > > and preferred blocksize negotiated with clients will almost always
> > > > be
> > > > 1MB - the maximum RPC payload. The nfsd_get_default_maxblksize()
> > > > function
> > > > seems obsolete for modern 64-bit servers with at least 4G of RAM as
> > > > it'll
> > > > always prefer this upper bound instead of any value calculated
> > > > according to
> > > > available RAM.
> > >
> > > Well, "obsolete" is an odd way to put it--the code is still expected
> > > to work on smaller machines.
> >
> > Poor choice of words perhaps. I guess I'm just used to NFS servers being
> > pretty hefty pieces of kit and 'small' workstations having a couple of GB
> > of RAM too.
> >
> > > Arguments welcome about the defaults, thoodd ugh I wonder whether it
> > > would be better to be doing this sort of calculation in user space.
> >
> > See below.
> >
> > > > For what it's worth (not sure if I specified this) I'm running
> > > > kernel 2.6.32.
> > > >
> > > > Anyway, this file/function appears to set the default *max*
> > > > blocksize. I haven't
> > > > read all the related code yet, but does the preferred block size
> > > > derive
> > > > from this maximum too?
> > >
> > > See
> > > > > For finfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> > > > > svc_max_payload().
> >
> > I've just returned from nfsd3_proc_fsinfo() and found what I would
> > consider an odd decision - perhaps nothing better was suggested at
> > the time. It seems to me that in response to an FSINFO call the reply
> > stuffs the max_block_size value in both the maximum *and* preferred
> > block sizes for both read and write. A 1MB block size for a preferred
> > default is a little high! If a disk is reading at 33MB/s and we have just
> > a single server running 64 knfsd and each READ call is requesting 1MB of
> > data then all of a sudden we have an aggregate read speed of ~512k/s
>
> I lost you here.
>
> > and
> > that is without network latencies. And of course we will probably have 100s of
> > requests queued behind each knfsd waiting for these 512k reads to finish. All of a
> > sudden our user experience is rather poor :(
>
> Note the preferred size is not a minimum--the client isn't forced to do
> 1MB reads if it really only wants 1 page, for example, if that's what
> you mean.
>
> (I haven't actually looked at how typical clients used rt/wtpref.)

For our client, the answer is:

rtpref == default rsize
wtpref == default wsize and default f_bsize


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2013-05-15 13:46:39

by Jim Vanns

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

> fs/nfsd/nfssvc.c:nfsd_get_default_maxblksize() is probably a good
> starting point. Its caller, nfsd_create_serv(), calls
> svc_create_pooled() with the result that's calculated.

Hmm. If I've read this section of code correctly, it seems to me
that on most modern NFS servers (using TCP as the transport) the default
and preferred blocksize negotiated with clients will almost always be
1MB - the maximum RPC payload. The nfsd_get_default_maxblksize() function
seems obsolete for modern 64-bit servers with at least 4G of RAM as it'll
always prefer this upper bound instead of any value calculated according to
available RAM.

For what it's worth (not sure if I specified this) I'm running kernel 2.6.32.

Anyway, this file/function appears to set the default *max* blocksize. I haven't
read all the related code yet, but does the preferred block size derive
from this maximum too?

Thanks,

Jim

> For fsinfo see fs/nfsd/nfs3proc.c:nfsd3_proc_fsinfo, which uses
> svc_max_payload().
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Jim Vanns
Senior Software Developer
Framestore

2013-05-15 16:36:15

by Jim Vanns

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

<snip>

> > I've just returned from nfsd3_proc_fsinfo() and found what I would
> > consider an odd decision - perhaps nothing better was suggested at
> > the time. It seems to me that in response to an FSINFO call the
> > reply
> > stuffs the max_block_size value in both the maximum *and*
> > preferred
> > block sizes for both read and write. A 1MB block size for a
> > preferred
> > default is a little high! If a disk is reading at 33MB/s and we
> > have just
> > a single server running 64 knfsd and each READ call is requesting
> > 1MB of
> > data then all of a sudden we have an aggregate read speed of
> > ~512k/s
>
> I lost you here.

OK, so what we're seeing is the large majority of our nr. ~700 clients
(all Linux 2.6.32 based NFS clients) issuing READ requests of 1MB in size.

After the initial MOUNT request has been granted an FSINFO call is made. The
contents of the REPLY from the server (another Linux 2.6.32 server) include
rtmax, rtpref, wtmax and wtpref all of which are set to 1MB. This 1MB appears
to come from that code/explanation I described earlier - all values are basically
getting set to whatever comes out of nfsd_get_default_max_blksize().

> > that is without network latencies. And of course we will probably
> > have 100s of
> > requests queued behind each knfsd waiting for these 512k reads to
> > finish. All of a
> > sudden our user experience is rather poor :(
>
> Note the preferred size is not a minimum--the client isn't forced to
> do
> 1MB reads if it really only wants 1 page, for example, if that's what
> you mean.

If no r/wsize has been specified on the client mount then the negotiated
values above will be used by the client for any read() by an application
exceeding that maximum. That maximum (the default 1MB) is still quite
large I reckon.

I'm not sure at which point the preferred or optimal block size will
be used by the client - because they're set as the same on the server
side, I can't tell which is being used ;)

> (I haven't actually looked at how typical clients used rt/wtpref.)
>
> --b.
>
> > Perhaps a better suggestion would be to at least expose the maximum
> > and preferred
> > block sizes (for both read and write) via a sysctl key so an
> > administrator can set
> > it to the underlying block sizes of the file system or physical
> > device?
> >
> > Perhaps the defaults should at least be a smaller multiple of the
> > page size or somewhere
> > between that and the PDU of the network layer the service is bound
> > too.
> >
> > Just my tuppence - and my maths might be flawed ;)
> >
> > Jim
> >
> > > I'm not sure what the history is behind that logic, though.
> > >
> > > --b.
> > >
> >
> > --
> > Jim Vanns
> > Senior Software Developer
> > Framestore
>

--
Jim Vanns
Senior Software Developer
Framestore

2013-05-17 11:47:08

by Jim Vanns

[permalink] [raw]
Subject: Re: Where in the server code is fsinfo rtpref calculated?

> Knowing nothing about your situation, I'd assume the clients are
> doing that because they actually want that 1MB of data.

Possibly. But we have no control over that (the application read size,
I mean).

> Would you prefer they each send 1024 1k READs?  I don't understand
> why it's the read size you're focused on here.

No. But 32x 32k reads is reasonable (because it gives other RPCs a
look-in). I'm focused on reads because it makes up the majority of
our NFS traffic. I'm concerned because as it stands (out of the box)
if the majority of our n knfsd threads are waiting for a 1MB read to
return then no other RPC request will be serviced and will just
contribute to the backlog. This backlog itself will probably also contain
a hefty no. of 1MB read requests too. In short, a lot of other RPC calls
that are not reads will just be blocking and this will appear to an end
user as poor performance.

We deal with a great number of fairly large files - 10s of GBs in size. We just
don't want others to suffer because of large request sizes coming in (writes
end up being of the same size too but there are less of them). Our use cases are
varied but they all have to share the same resource (the array of NFS servers).

We've only really seen this since our upgrade to SL6/kernel 2.6.32. I guess
previously that 32k was some sort of default or limit?

Related to this was my query on when/how the (Linux) client may honour the preferred
or optimal block size given in the FSINFO reply. Any ideas? Is it if a read of
less than that preferred block size is requested then the preferred is used anyway
because it comes at the same cost?

Thanks,

Jim

> --b.
>
> >
> > After the initial MOUNT request has been granted an FSINFO call is
> > made. The contents of the REPLY from the server (another Linux
> > 2.6.32
> > server) include rtmax, rtpref, wtmax and wtpref all of which are
> > set
> > to 1MB. This 1MB appears to come from that code/explanation I
> > described earlier  - all values are basically getting set to
> > whatever
> > comes out of nfsd_get_default_max_blksize().
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
Jim Vanns
Senior Software Developer
Framestore