2008-07-15 15:34:54

by Andrew Bell

[permalink] [raw]
Subject: Performance Diagnosis

Hi,

I have a RHEL 5 system that exhibits less than wonderful performance
when copying large files from/to an NFS filesystem. When the copy is
taking place, other access to the filesystem is painfully slow. I
would like to have the filesystem react well to small requests while a
large request is taking place.

A couple of questions:

Is this a reasonable expectation?

Is this perhaps an I/O scheduling issue that isn't specific to NFS,
but shows up there because of the latency of my NFS setup?

Is this most likely a client issue, a server issue or a combination?

Do you have recomendations on the best way to determine what is
happening? Are there existing tools to monitor active IO/NFS
requests/responses and any relevant queues?


Thanks for any info/ideas before I get in too deep :)

--
Andrew Bell
[email protected]


2008-07-15 15:49:21

by Chuck Lever

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, Jul 15, 2008 at 11:34 AM, Andrew Bell <[email protected]> wrote:
> Hi,
>
> I have a RHEL 5 system that exhibits less than wonderful performance
> when copying large files from/to an NFS filesystem. When the copy is
> taking place, other access to the filesystem is painfully slow. I
> would like to have the filesystem react well to small requests while a
> large request is taking place.
>
> A couple of questions:
>
> Is this a reasonable expectation?

Yes, but Linux NFS can't fulfill it. :-)

There is currently only one RPC transport socket between client and
server for each mount point. Large file copies (or similar
operations) will queue a lot of I/O, so your small requests will take
a while to get through the queued up writes or reads ahead of them.

> Is this perhaps an I/O scheduling issue that isn't specific to NFS,
> but shows up there because of the latency of my NFS setup?
>
> Is this most likely a client issue, a server issue or a combination?

Well, if your server or network is slow, this kind of thing is more
likely to happen.

> Do you have recomendations on the best way to determine what is
> happening? Are there existing tools to monitor active IO/NFS
> requests/responses and any relevant queues?

Yes, I wrote some Python tools that are still undocumented (ie you
will likely have to read the Python source to figure out what they
do). They were recently included in nfs-utils, but you can download
them from:

http://oss.oracle.com/~cel/linux-2.6/2.6.25/nfs-iostat

and

http://oss.oracle.com/~cel/linux-2.6/2.6.25/mountstats

>
>
> Thanks for any info/ideas before I get in too deep :)
>
> --
> Andrew Bell
> [email protected]

--
Chuck Lever

2008-07-15 15:58:36

by Peter Staubach

[permalink] [raw]
Subject: Re: Performance Diagnosis

Andrew Bell wrote:
> Hi,
>
> I have a RHEL 5 system that exhibits less than wonderful performance
> when copying large files from/to an NFS filesystem. When the copy is
> taking place, other access to the filesystem is painfully slow. I
> would like to have the filesystem react well to small requests while a
> large request is taking place.
>
> A couple of questions:
>
> Is this a reasonable expectation?
>
>

Well, yes, I think that it would be a reasonable expectation.
I know that I would certainly like for it to be true. :-)

That said, this is a common situation, but not one that we've
had/made the time to resolve yet.

> Is this perhaps an I/O scheduling issue that isn't specific to NFS,
> but shows up there because of the latency of my NFS setup?
>
>

Could be. Nothing is impossible. That said...

> Is this most likely a client issue, a server issue or a combination?
>
>

It could be either one, both, or the network even.

It could easily just be the architecture of the NFS client
solution, in that it is sharing a single TCP connection for
both data operations and also metadata operations. The
metadata operations can get behind the larger data operations
in the TCP stream, thus increasing their latencies.

> Do you have recomendations on the best way to determine what is
> happening? Are there existing tools to monitor active IO/NFS
> requests/responses and any relevant queues?
>
>

Perhaps ensure that the local file system on the server is
performing well and that there are no obvious hot spots or
that the activity is causing the file system to thrash.

Some file systems, such as ext3, tend to bottleneck in the
journaling code, so that might be an area of the local file
system to consider.

> Thanks for any info/ideas before I get in too deep :)

We could use some idea of the activities that are occurring
when you encounter the slowness that you are concerned about.

If it is the notion described above, sometimes called head
of line blocking, then we could think about ways to duplex
operations over multiple TCP connections, perhaps with one
connection for small, low latency operations, and another
connection for larger, higher latency operations.

Thanx...

ps

2008-07-15 16:24:11

by Chuck Lever

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]> wrote:
> If it is the notion described above, sometimes called head
> of line blocking, then we could think about ways to duplex
> operations over multiple TCP connections, perhaps with one
> connection for small, low latency operations, and another
> connection for larger, higher latency operations.

I've dreamed about that for years. I don't think it would be too
difficult, but one thing that has held it back is the shortage of
ephemeral ports on the client may reduce the number of concurrent
mount points we can support.

One way to avoid the port issue is to construct an SCTP transport for
NFS. SCTP allows multiple streams on the same connection, effectively
eliminating head of line blocking.

--
Chuck Lever

2008-07-15 16:34:30

by Andrew Bell

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, Jul 15, 2008 at 11:23 AM, Chuck Lever <[email protected]> wrote:
> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]> wrote:
>> If it is the notion described above, sometimes called head
>> of line blocking, then we could think about ways to duplex
>> operations over multiple TCP connections, perhaps with one
>> connection for small, low latency operations, and another
>> connection for larger, higher latency operations.
>
> I've dreamed about that for years. I don't think it would be too
> difficult, but one thing that has held it back is the shortage of
> ephemeral ports on the client may reduce the number of concurrent
> mount points we can support.

Could one come up with a way to insert "small" ops somewhere in middle
of the existing queue, or are the TCP send buffers typically too deep
for this to do much good? Seems like more than one connection would
allow "good" servers to handle requests simultaneously anyway.

Is there really that big a shortage of ephemeral ports? I guess one
could do active connection management.

> One way to avoid the port issue is to construct an SCTP transport for
> NFS. SCTP allows multiple streams on the same connection, effectively
> eliminating head of line blocking.

Waiting for SCTP sounds like a long-term solution, as server vendors
probably have little incentive.

Thanks for the ideas. I'll have to see what kind of time I can get to
investigate this stuff.

--
Andrew Bell
[email protected]

2008-07-15 17:20:51

by Chuck Lever

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, Jul 15, 2008 at 12:34 PM, Andrew Bell <[email protected]> wrote:
> On Tue, Jul 15, 2008 at 11:23 AM, Chuck Lever <[email protected]> wrote:
>> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]> wrote:
>>> If it is the notion described above, sometimes called head
>>> of line blocking, then we could think about ways to duplex
>>> operations over multiple TCP connections, perhaps with one
>>> connection for small, low latency operations, and another
>>> connection for larger, higher latency operations.
>>
>> I've dreamed about that for years. I don't think it would be too
>> difficult, but one thing that has held it back is the shortage of
>> ephemeral ports on the client may reduce the number of concurrent
>> mount points we can support.
>
> Could one come up with a way to insert "small" ops somewhere in middle
> of the existing queue, or are the TCP send buffers typically too deep
> for this to do much good? Seems like more than one connection would
> allow "good" servers to handle requests simultaneously anyway.

There are several queues inside the NFS client stack.

The underlying RPC client manages a slot table. Each slot contains
one pending RPC request; ie an RPC has been sent and this slot held is
waiting for the reply. The table contains 16 slots by default. You
can adjust the size (up to 128 slots) via a sysctl, and that may help
your situation by allowing more reads or writes to go to the server at
once.

The RPC client allows a single RPC to be sent on the socket at a time.
(Waiting for the reply is asynchronous, so the next request can be
sent on the socket as soon as this one is done being sent).

Especially for large requests, this may mean waiting for the socket
buffer to be emptied before more data can be sent. The socket is held
for each each request until it is entirely sent so that data for
different requests are not intermingled. If the network is not
congested, this is generally not a problem, but if the server is
backed up, it can take a while before the buffer is ready for more
data from a single large request.

Before an RPC gets into a slot, though, it waits on a backlog queue.
This queue can grow quite long in situations where there are a lot of
reads or writes and the server or network is slow.

The Python scripts I mentioned before have information about the
backlog queue size, slot table utilization, and per-operation average
latency. So you can clearly determine what the client is waiting for.

> Is there really that big a shortage of ephemeral ports?

Yes. The NFS client uses only privileged ports (although you can
optionally tell it to use non-privileged ports as well). For
long-lived sockets (such as transport sockets for NFS), it is careful
to choose privileged ports that are not a "well known" service (eg
port 22 is the standard ssh service port). So the default port range
is roughly between 670 and 1023.

>> One way to avoid the port issue is to construct an SCTP transport for
>> NFS. SCTP allows multiple streams on the same connection, effectively
>> eliminating head of line blocking.
>
> Waiting for SCTP sounds like a long-term solution, as server vendors
> probably have little incentive.

Yep.

> Thanks for the ideas. I'll have to see what kind of time I can get to
> investigate this stuff.

We neglected to mention that you can also increase the number of NFSD
threads on your server. I think eight is the default, and often that
isn't enough.

--
Chuck Lever

2008-07-15 17:44:36

by Peter Staubach

[permalink] [raw]
Subject: Re: Performance Diagnosis

Chuck Lever wrote:
> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]> wrote:
>
>> If it is the notion described above, sometimes called head
>> of line blocking, then we could think about ways to duplex
>> operations over multiple TCP connections, perhaps with one
>> connection for small, low latency operations, and another
>> connection for larger, higher latency operations.
>>
>
> I've dreamed about that for years. I don't think it would be too
> difficult, but one thing that has held it back is the shortage of
> ephemeral ports on the client may reduce the number of concurrent
> mount points we can support.
>
> One way to avoid the port issue is to construct an SCTP transport for
> NFS. SCTP allows multiple streams on the same connection, effectively
> eliminating head of line blocking.

I like the idea of combining this work with implementing a proper
connection manager so that we don't need a connection per mount.
We really only need one connection per client and server, no matter
how many individual mounts there might be from that single server.
(Or two connections, if we want to do something like this...)

We could also manage the connection space and thus, never run into
the shortage of ports ever again. When the port space is full or
we've run into some other artificial limit, then we simply close
down some other connection to make space.

ps

2008-07-15 18:17:36

by Chuck Lever

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <[email protected]> wrote:
> Chuck Lever wrote:
>>
>> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]>
>> wrote:
>>
>>>
>>> If it is the notion described above, sometimes called head
>>> of line blocking, then we could think about ways to duplex
>>> operations over multiple TCP connections, perhaps with one
>>> connection for small, low latency operations, and another
>>> connection for larger, higher latency operations.
>>>
>>
>> I've dreamed about that for years. I don't think it would be too
>> difficult, but one thing that has held it back is the shortage of
>> ephemeral ports on the client may reduce the number of concurrent
>> mount points we can support.
>>
>> One way to avoid the port issue is to construct an SCTP transport for
>> NFS. SCTP allows multiple streams on the same connection, effectively
>> eliminating head of line blocking.
>
> I like the idea of combining this work with implementing a proper
> connection manager so that we don't need a connection per mount.
> We really only need one connection per client and server, no matter
> how many individual mounts there might be from that single server.
> (Or two connections, if we want to do something like this...)
>
> We could also manage the connection space and thus, never run into
> the shortage of ports ever again. When the port space is full or
> we've run into some other artificial limit, then we simply close
> down some other connection to make space.

I think we should do this for text-based mounts; however this would
mean the connection management would happen in the kernel, which (only
slightly) complicates things.

I was thinking about this a little last week when Trond mentioned
implementing a connected UDP socket transport...

It would be nice if all the kernel RPC services that needed to send a
single RPC request (like mount, rpcbind, and so on) could share a
small managed pool of sockets (a pool of TCP sockets, or a pool of
connected UDP sockets). Connected sockets have the ostensible
advantage that they can quickly detect the absence of a remote
listener. But such a pool would be a good idea because multiple mount
requests to the same server could all flow over the same set of
connections.

But we might be able to get away with something nearly as efficient if
the RPC client would always invoke a connect(AF_UNSPEC) before
destroying the socket. Wouldn't that free the ephemeral port
immediately? What are the risks of trying something like this?

--
"Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil

2008-07-15 18:51:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, 2008-07-15 at 14:17 -0400, Chuck Lever wrote:
> On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <[email protected]> wrote:
> > Chuck Lever wrote:
> >>
> >> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]>
> >> wrote:
> >>
> >>>
> >>> If it is the notion described above, sometimes called head
> >>> of line blocking, then we could think about ways to duplex
> >>> operations over multiple TCP connections, perhaps with one
> >>> connection for small, low latency operations, and another
> >>> connection for larger, higher latency operations.
> >>>
> >>
> >> I've dreamed about that for years. I don't think it would be too
> >> difficult, but one thing that has held it back is the shortage of
> >> ephemeral ports on the client may reduce the number of concurrent
> >> mount points we can support.
> >>
> >> One way to avoid the port issue is to construct an SCTP transport for
> >> NFS. SCTP allows multiple streams on the same connection, effectively
> >> eliminating head of line blocking.
> >
> > I like the idea of combining this work with implementing a proper
> > connection manager so that we don't need a connection per mount.
> > We really only need one connection per client and server, no matter
> > how many individual mounts there might be from that single server.
> > (Or two connections, if we want to do something like this...)
> >
> > We could also manage the connection space and thus, never run into
> > the shortage of ports ever again. When the port space is full or
> > we've run into some other artificial limit, then we simply close
> > down some other connection to make space.
>
> I think we should do this for text-based mounts; however this would
> mean the connection management would happen in the kernel, which (only
> slightly) complicates things.
>
> I was thinking about this a little last week when Trond mentioned
> implementing a connected UDP socket transport...
>
> It would be nice if all the kernel RPC services that needed to send a
> single RPC request (like mount, rpcbind, and so on) could share a
> small managed pool of sockets (a pool of TCP sockets, or a pool of
> connected UDP sockets). Connected sockets have the ostensible
> advantage that they can quickly detect the absence of a remote
> listener. But such a pool would be a good idea because multiple mount
> requests to the same server could all flow over the same set of
> connections.
>
> But we might be able to get away with something nearly as efficient if
> the RPC client would always invoke a connect(AF_UNSPEC) before
> destroying the socket. Wouldn't that free the ephemeral port
> immediately? What are the risks of trying something like this?


Why is all the talk here only about RPC level solutions?

Newer kernels already have a good deal of extra throttling of writes at
the NFS superblock level, and there is even a sysctl to control the
amount of outstanding writes before the VM congestion control sets in.
Please see /proc/sys/fs/nfs/nfs_congestion_kb

Cheers
Trond


2008-07-15 19:22:05

by Peter Staubach

[permalink] [raw]
Subject: Re: Performance Diagnosis

Trond Myklebust wrote:
> On Tue, 2008-07-15 at 14:17 -0400, Chuck Lever wrote:
>
>> On Tue, Jul 15, 2008 at 1:44 PM, Peter Staubach <[email protected]> wrote:
>>
>>> Chuck Lever wrote:
>>>
>>>> On Tue, Jul 15, 2008 at 11:58 AM, Peter Staubach <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>> If it is the notion described above, sometimes called head
>>>>> of line blocking, then we could think about ways to duplex
>>>>> operations over multiple TCP connections, perhaps with one
>>>>> connection for small, low latency operations, and another
>>>>> connection for larger, higher latency operations.
>>>>>
>>>>>
>>>> I've dreamed about that for years. I don't think it would be too
>>>> difficult, but one thing that has held it back is the shortage of
>>>> ephemeral ports on the client may reduce the number of concurrent
>>>> mount points we can support.
>>>>
>>>> One way to avoid the port issue is to construct an SCTP transport for
>>>> NFS. SCTP allows multiple streams on the same connection, effectively
>>>> eliminating head of line blocking.
>>>>
>>> I like the idea of combining this work with implementing a proper
>>> connection manager so that we don't need a connection per mount.
>>> We really only need one connection per client and server, no matter
>>> how many individual mounts there might be from that single server.
>>> (Or two connections, if we want to do something like this...)
>>>
>>> We could also manage the connection space and thus, never run into
>>> the shortage of ports ever again. When the port space is full or
>>> we've run into some other artificial limit, then we simply close
>>> down some other connection to make space.
>>>
>> I think we should do this for text-based mounts; however this would
>> mean the connection management would happen in the kernel, which (only
>> slightly) complicates things.
>>
>> I was thinking about this a little last week when Trond mentioned
>> implementing a connected UDP socket transport...
>>
>> It would be nice if all the kernel RPC services that needed to send a
>> single RPC request (like mount, rpcbind, and so on) could share a
>> small managed pool of sockets (a pool of TCP sockets, or a pool of
>> connected UDP sockets). Connected sockets have the ostensible
>> advantage that they can quickly detect the absence of a remote
>> listener. But such a pool would be a good idea because multiple mount
>> requests to the same server could all flow over the same set of
>> connections.
>>
>> But we might be able to get away with something nearly as efficient if
>> the RPC client would always invoke a connect(AF_UNSPEC) before
>> destroying the socket. Wouldn't that free the ephemeral port
>> immediately? What are the risks of trying something like this?
>>
>
>
> Why is all the talk here only about RPC level solutions?
>
> Newer kernels already have a good deal of extra throttling of writes at
> the NFS superblock level, and there is even a sysctl to control the
> amount of outstanding writes before the VM congestion control sets in.
> Please see /proc/sys/fs/nfs/nfs_congestion_kb

The throttling of writes definitely seems like a NFS level issue,
so that's a good thing. (RHEL-5 might be a tad far enough behind
to not be able to take advantage of all of these modern
things... :-))

The connection manager would seem to be a RPC level thing, although
I haven't thought through the ramifications of the NFSv4.1 stuff
and how it might impact a connection manager sufficiently.

ps

2008-07-15 19:35:57

by Trond Myklebust

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote:
> The connection manager would seem to be a RPC level thing, although
> I haven't thought through the ramifications of the NFSv4.1 stuff
> and how it might impact a connection manager sufficiently.

We already have the scheme that shuts down connections on inactive RPC
clients after a suitable timeout period, so the only gains I can see
would have to involve shutting down connections on active clients.

At that point, the danger isn't with NFSv4.1, it is rather with
NFSv2/3/4.0... Specifically, their lack of good replay cache semantics
mean that you have to be very careful about schemes that involve
shutting down connections on active RPC clients.

Cheers
Trond


2008-07-15 19:56:05

by Peter Staubach

[permalink] [raw]
Subject: Re: Performance Diagnosis

Trond Myklebust wrote:
> On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote:
>
>> The connection manager would seem to be a RPC level thing, although
>> I haven't thought through the ramifications of the NFSv4.1 stuff
>> and how it might impact a connection manager sufficiently.
>>
>
> We already have the scheme that shuts down connections on inactive RPC
> clients after a suitable timeout period, so the only gains I can see
> would have to involve shutting down connections on active clients.
>
> At that point, the danger isn't with NFSv4.1, it is rather with
> NFSv2/3/4.0... Specifically, their lack of good replay cache semantics
> mean that you have to be very careful about schemes that involve
> shutting down connections on active RPC clients.

It seems to me that as long as we don't shut down a connection
which is actively being used for an outstanding request, then
we shouldn't have any larger problems with the duplicate caches
on servers than we do now.

We can do this easily enough by reference counting the connection
state and then only closing connections which are not being
referenced.

I definitely agree, shutting down a connection which is being used
is just inviting trouble.

A gain would be that we could reduce the numbers of connections on
active clients if we could disassociate a connection with a
particular mounted file system. As long as we can achieve maximum
network bandwidth through a single connection, then we don't need
more than one connection per server.

We could handle the case where the client was talking to more
servers than it had connection space for by forcibly, but safely
closing connections to servers and then using the space for a
new connection to a server. We could do this in the connection
manager by checking to see if there was an available connection
which was not marked as in the process of being closed. If so,
then it just enters the fray as needing a connection and am
working like all of the others.

The algorithm could look something like:

top:
Look for a connection to the right server which is not marked
as being closed.
If one was found, then increment its reference count and
return it.
Attempt to create a new connect,
If this works, then increment its reference count and
return it.
Find a connection to be closed, either one not being currently
used or via some heuristic like round-robin.
If this connection is not actively being used, then close it
and go to top.
Mark the connection as being closed, wait until it is closed,
and then go to top.

I know that this is rough and there are several races that I
glossed over, but hopefully, this will outline the general bones
of a solution.

When the system is having recycle connections, it may slow down,
but at least it will work and not have things just fail.

Thanx...

ps

2008-07-15 20:27:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Tue, 2008-07-15 at 15:55 -0400, Peter Staubach wrote:
> It seems to me that as long as we don't shut down a connection
> which is actively being used for an outstanding request, then
> we shouldn't have any larger problems with the duplicate caches
> on servers than we do now.
>
> We can do this easily enough by reference counting the connection
> state and then only closing connections which are not being
> referenced.

Agreed.

> A gain would be that we could reduce the numbers of connections on
> active clients if we could disassociate a connection with a
> particular mounted file system. As long as we can achieve maximum
> network bandwidth through a single connection, then we don't need
> more than one connection per server.

Isn't that pretty much the norm today anyway? The only call to
rpc_create() that I can find is made when creating the nfs_client
structure. All other NFS-related rpc connections are created as clones
of the above shared structure, and thus share the same rpc_xprt.

I'm not sure that we want to share connections in the cases where we
can't share the same nfs_client, since that usually means that RPC level
parameters such as timeout values, NFS protocol versions differ.

> We could handle the case where the client was talking to more
> servers than it had connection space for by forcibly, but safely
> closing connections to servers and then using the space for a
> new connection to a server. We could do this in the connection
> manager by checking to see if there was an available connection
> which was not marked as in the process of being closed. If so,
> then it just enters the fray as needing a connection and am
> working like all of the others.
>
> The algorithm could look something like:
>
> top:
> Look for a connection to the right server which is not marked
> as being closed.
> If one was found, then increment its reference count and
> return it.
> Attempt to create a new connect,
> If this works, then increment its reference count and
> return it.
> Find a connection to be closed, either one not being currently
> used or via some heuristic like round-robin.
> If this connection is not actively being used, then close it
> and go to top.
> Mark the connection as being closed, wait until it is closed,
> and then go to top.

Actually, what you really want to do is look at whether or not any of
the rpc slots are in use or not. If they aren't, then you are free to
close the connection, if not, go to the next.

Unfortunately, you still can't get rid of the 2 minute TIME_WAIT state
in the case of a TCP connection, so I'm not sure how useful this will
turn out to be...

Cheers
Trond


2008-07-15 20:49:24

by Peter Staubach

[permalink] [raw]
Subject: Re: Performance Diagnosis

Trond Myklebust wrote:
> On Tue, 2008-07-15 at 15:55 -0400, Peter Staubach wrote:
>
>> It seems to me that as long as we don't shut down a connection
>> which is actively being used for an outstanding request, then
>> we shouldn't have any larger problems with the duplicate caches
>> on servers than we do now.
>>
>> We can do this easily enough by reference counting the connection
>> state and then only closing connections which are not being
>> referenced.
>>
>
> Agreed.
>
>
>> A gain would be that we could reduce the numbers of connections on
>> active clients if we could disassociate a connection with a
>> particular mounted file system. As long as we can achieve maximum
>> network bandwidth through a single connection, then we don't need
>> more than one connection per server.
>>
>
> Isn't that pretty much the norm today anyway? The only call to
> rpc_create() that I can find is made when creating the nfs_client
> structure. All other NFS-related rpc connections are created as clones
> of the above shared structure, and thus share the same rpc_xprt.
>
>

Well, it is the norm for the shared superblock situation, yes.


> I'm not sure that we want to share connections in the cases where we
> can't share the same nfs_client, since that usually means that RPC level
> parameters such as timeout values, NFS protocol versions differ.
>
>

I think the TCP connection can be managed independent of these
things.

>> We could handle the case where the client was talking to more
>> servers than it had connection space for by forcibly, but safely
>> closing connections to servers and then using the space for a
>> new connection to a server. We could do this in the connection
>> manager by checking to see if there was an available connection
>> which was not marked as in the process of being closed. If so,
>> then it just enters the fray as needing a connection and am
>> working like all of the others.
>>
>> The algorithm could look something like:
>>
>> top:
>> Look for a connection to the right server which is not marked
>> as being closed.
>> If one was found, then increment its reference count and
>> return it.
>> Attempt to create a new connect,
>> If this works, then increment its reference count and
>> return it.
>> Find a connection to be closed, either one not being currently
>> used or via some heuristic like round-robin.
>> If this connection is not actively being used, then close it
>> and go to top.
>> Mark the connection as being closed, wait until it is closed,
>> and then go to top.
>>
>
> Actually, what you really want to do is look at whether or not any of
> the rpc slots are in use or not. If they aren't, then you are free to
> close the connection, if not, go to the next.
>
>

I think that we would still need to be able to handle the
situation where we needed a connection and all connections
appeared to be in use. I think that the ability to force
a connection to become unused would be required.


> Unfortunately, you still can't get rid of the 2 minute TIME_WAIT state
> in the case of a TCP connection, so I'm not sure how useful this will
> turn out to be...

Well, there is that... :-)

It sure seems like there has to be some way of dealing with that
thing.

Thanx...

ps


2008-07-15 21:15:30

by Talpey, Thomas

[permalink] [raw]
Subject: Re: Performance Diagnosis

At 03:55 PM 7/15/2008, Peter Staubach wrote:
>Trond Myklebust wrote:
>> On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote:
>>
>>> The connection manager would seem to be a RPC level thing, although
>>> I haven't thought through the ramifications of the NFSv4.1 stuff
>>> and how it might impact a connection manager sufficiently.
>>>
>>
>> We already have the scheme that shuts down connections on inactive RPC
>> clients after a suitable timeout period, so the only gains I can see
>> would have to involve shutting down connections on active clients.
>>
>> At that point, the danger isn't with NFSv4.1, it is rather with
>> NFSv2/3/4.0... Specifically, their lack of good replay cache semantics
>> mean that you have to be very careful about schemes that involve
>> shutting down connections on active RPC clients.
>
>It seems to me that as long as we don't shut down a connection
>which is actively being used for an outstanding request, then
>we shouldn't have any larger problems with the duplicate caches
>on servers than we do now.
>
>We can do this easily enough by reference counting the connection
>state and then only closing connections which are not being
>referenced.
>
>I definitely agree, shutting down a connection which is being used
>is just inviting trouble.
>
>A gain would be that we could reduce the numbers of connections on
>active clients if we could disassociate a connection with a
>particular mounted file system. As long as we can achieve maximum
>network bandwidth through a single connection, then we don't need
>more than one connection per server.

Not quite!

Getting full network bandwidth is one requirement, but having the slots to
use it is another! The prolblem with sharing a mount currently is that the
slot table is preallocated at mount time, each time the mount is shared,
the slots become less and less adequate to the task.

If we include growing the slot table with sharing the connection, and
having some sort of non-starvation so readaheads and deep random
read workloads don't hog the slots and block out getattrs, then I agree.

The v4.1 session brings this to the top-level btw, by explicitly negotiating
these limits end to end.

>
>We could handle the case where the client was talking to more
>servers than it had connection space for by forcibly, but safely
>closing connections to servers and then using the space for a
>new connection to a server. We could do this in the connection
>manager by checking to see if there was an available connection
>which was not marked as in the process of being closed. If so,
>then it just enters the fray as needing a connection and am
>working like all of the others.
>
>The algorithm could look something like:
>
>top:
> Look for a connection to the right server which is not marked
> as being closed.
> If one was found, then increment its reference count and

...increase its slot count and...

> return it.
> Attempt to create a new connect,
> If this works, then increment its reference count and
> return it.
> Find a connection to be closed, either one not being currently
> used or via some heuristic like round-robin.
> If this connection is not actively being used, then close it
> and go to top.
> Mark the connection as being closed, wait until it is closed,
> and then go to top.
>
>I know that this is rough and there are several races that I
>glossed over, but hopefully, this will outline the general bones
>of a solution.

There is one other *very* important thing to note. The RPC XID is
managed on a per-mount basis in Linux, two different mount points
can have duplicate XIDs. There is no ambiguity at the server, because
the two mounts, with two connections, have different IP 5-tuples.

But if mounts are shared, then we need to be sure that XIDs are also
shared, to avoid reply cache collisions.

Tom.

I'd like to add that sharing a connection when only one is
>
>When the system is having recycle connections, it may slow down,
>but at least it will work and not have things just fail.
>
> Thanx...
>
> ps
>--
>To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2008-07-16 07:35:59

by Benny Halevy

[permalink] [raw]
Subject: Re: Performance Diagnosis

On Jul. 15, 2008, 22:35 +0300, Trond Myklebust <[email protected]> wrote:
> On Tue, 2008-07-15 at 15:21 -0400, Peter Staubach wrote:
>> The connection manager would seem to be a RPC level thing, although
>> I haven't thought through the ramifications of the NFSv4.1 stuff
>> and how it might impact a connection manager sufficiently.
>
> We already have the scheme that shuts down connections on inactive RPC
> clients after a suitable timeout period, so the only gains I can see
> would have to involve shutting down connections on active clients.
>
> At that point, the danger isn't with NFSv4.1, it is rather with
> NFSv2/3/4.0... Specifically, their lack of good replay cache semantics
> mean that you have to be very careful about schemes that involve
> shutting down connections on active RPC clients.

One more thing to consider about nfsv4.1 is the back channel which
uses one of the forward going connections. You may need to keep it
alive while you hold state on the client (data/dir delegations, layouts,
etc.)

Benny

>
> Cheers
> Trond
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html