LinuxLists.cc - LAYOUTGET and NFS4ERR

2013-06-23 13:27:53

Subject: LAYOUTGET and NFS4ERR_DELAY: a few questions

Background: I'm working on a pnfs-exported filesystem implementation
(using objects-based storage)
In my ->layout_get() implementation, I use mutex_trylock() and return
NFS4ERR_DELAY in the contended case
In a real-world test, I discovered the client always waits 15 seconds
when receiving this error for LAYOUTGET.
This occurs in nfs4_async_handle_error, which always wait for
NFS4_POLL_RETRY_MAX when getting DELAY, GRACE or EKEYEXPIRED

This is in contrast to nfs4_handle_exception, which calls nfs4_delay.
In this path, the wait begins at NFS4_POLL_RETRY_MIN (0.1 seconds) and
increases two-fold each time up to RETRY_MAX.
It is used by many nfs4_proc operations - the caller creates an
nfs4_exception structure, and retries the operation until success (or
permanent error).

when nfs4_async_handle_error is used, OTOH, the RPC task is restarted
in the ->rpc_call_done callback and the sleeping is done with
rpc_delay

nfs4_async_handle_error is used in:
CLOSE, UNLINK, RENAME, READ, WRITE, COMMIT, DELEGRETURN, LOCKU,
LAYOUTGET, LAYOUTRETURN and LAYOUTCOMMIT.
A similar behavior (waiting RETRY_MAX) is also used in
nfs4*_sequence_* functions (in which case it refers to the status of
the SEQUENCE operation itself) and by RECLAIM_COMPLETE
GET_LEASE_TIME also has such a code structure, but it always waits
RETRY_MIN, not MAX

The first question, raised in the beginning of this mail:
Is it better to wait for the mutex in the NFSd thread (with the risk
of blocking that nfsd thread) or to return DELAY(with its 15s delay
and risk of repeatedly landing on a contended mutex even if it is not
kept locked the whole time)?
Is there some other solution?

The second question(s):
Why are there several different implementations of the same
restart/retry behaviors? why do some operations use one mechanism and
others use another?
Why isn't the exponential back-off mechanism used in these operations?

2013-06-25 14:00:20

by Nadav Shemer

[permalink] [raw]

Subject: Re: LAYOUTGET and NFS4ERR_DELAY: a few questions

On Tue, Jun 25, 2013 at 4:43 PM, J. Bruce Fields <[email protected]> wrote:
> On Tue, Jun 25, 2013 at 02:51:48PM +0300, Nadav Shemer wrote:
>> On Mon, Jun 24, 2013 at 10:31 PM, J. Bruce Fields <[email protected]> wrote:
>> > Attempting a summary: the constant delay is traditional behavior going
>> > back to NFSv3, and the exponential backoff was added to handle DELAY
>> > returns on OPEN due to delegation conflicts.
>> >
>> > And it would likely be tough to justify another client change here
>> > without a similar case where the spec clearly has the server returning
>> > DELAY to something that needs to be retried quickly.
>> >
>> > Not understanding your case, it doesn't sound like the result of any
>> > real requirement but rather an implementation detail that you probably
>> > want to fix in the server.
>> Well, a LAYOUTGET may cause a conflicting layout to be recalled (f.e.
>> RAID in object storage - RFC 5664, 11.).
>> Is that not similar to the
>> OPEN case?
>
> I'd expect there to be more options in the LAYOUTGET case, since a
> client can always fall back to MDS IO in the case of LAYOUTGET failure,
> whereas a failed OPEN is fatal.
Yes, but (the Linux client) only does so on permanent failure

>> This makes me ponder. If the server blocks while waiting for
>> conflicting layouts to be recalled, I think we can theoretically reach
>> a deadlock (if we take up all the nfsd threads or all the clients'
>> session slots): client A hold layout to file X, and requests layout to
>> file Y, while client B holds layout to file Y and requests layout to
>> file X.
>> To avoid this, we pretty much have to return DELAY for LAYOUTGET
>
> I agree that you wouldn't want to block waiting for a client to return a
> layout. Is this a case for NFS4ERR_LAYOUTTRYLATER?
Yes, I believe it is.
Specifically the Linux client treats them all the same (LAYOUTTRYLATER
and RECALLCONFLICT are both mapped to DELAY before passing to
nfs4_async_handle_error)
Do you think there is a case for an exponential backoff in this case
for a specific (non-DELAY) error code?

>
> --b.

2013-06-25 13:43:26

by J. Bruce Fields

[permalink] [raw]

Subject: Re: LAYOUTGET and NFS4ERR_DELAY: a few questions

On Tue, Jun 25, 2013 at 02:51:48PM +0300, Nadav Shemer wrote:
> On Mon, Jun 24, 2013 at 10:31 PM, J. Bruce Fields <[email protected]> wrote:
> > Attempting a summary: the constant delay is traditional behavior going
> > back to NFSv3, and the exponential backoff was added to handle DELAY
> > returns on OPEN due to delegation conflicts.
> >
> > And it would likely be tough to justify another client change here
> > without a similar case where the spec clearly has the server returning
> > DELAY to something that needs to be retried quickly.
> >
> > Not understanding your case, it doesn't sound like the result of any
> > real requirement but rather an implementation detail that you probably
> > want to fix in the server.
> Well, a LAYOUTGET may cause a conflicting layout to be recalled (f.e.
> RAID in object storage - RFC 5664, 11.).
> Is that not similar to the
> OPEN case?

I'd expect there to be more options in the LAYOUTGET case, since a
client can always fall back to MDS IO in the case of LAYOUTGET failure,
whereas a failed OPEN is fatal.

> This makes me ponder. If the server blocks while waiting for
> conflicting layouts to be recalled, I think we can theoretically reach
> a deadlock (if we take up all the nfsd threads or all the clients'
> session slots): client A hold layout to file X, and requests layout to
> file Y, while client B holds layout to file Y and requests layout to
> file X.
> To avoid this, we pretty much have to return DELAY for LAYOUTGET

I agree that you wouldn't want to block waiting for a client to return a
layout. Is this a case for NFS4ERR_LAYOUTTRYLATER?

--b.

2013-06-25 14:14:50

by Myklebust, Trond

[permalink] [raw]

Subject: Re: LAYOUTGET and NFS4ERR_DELAY: a few questions

On Tue, 2013-06-25 at 17:00 +0300, Nadav Shemer wrote:
> On Tue, Jun 25, 2013 at 4:43 PM, J. Bruce Fields <[email protected]> wrote:
> > On Tue, Jun 25, 2013 at 02:51:48PM +0300, Nadav Shemer wrote:
> >> This makes me ponder. If the server blocks while waiting for
> >> conflicting layouts to be recalled, I think we can theoretically reach
> >> a deadlock (if we take up all the nfsd threads or all the clients'
> >> session slots): client A hold layout to file X, and requests layout to
> >> file Y, while client B holds layout to file Y and requests layout to
> >> file X.
> >> To avoid this, we pretty much have to return DELAY for LAYOUTGET
> >
> > I agree that you wouldn't want to block waiting for a client to return a
> > layout. Is this a case for NFS4ERR_LAYOUTTRYLATER?
> Yes, I believe it is.
> Specifically the Linux client treats them all the same (LAYOUTTRYLATER
> and RECALLCONFLICT are both mapped to DELAY before passing to
> nfs4_async_handle_error)
> Do you think there is a case for an exponential backoff in this case
> for a specific (non-DELAY) error code?

Why do we care? If this is about passing some unit test somewhere, then
I frankly don't give a damn.

If. OTOH, there is a valid use case where 2 clients must request
conflicting layouts for the same file, then let's discuss that specific
case.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2013-06-24 19:31:53

by J. Bruce Fields

[permalink] [raw]

Subject: Re: LAYOUTGET and NFS4ERR_DELAY: a few questions

On Sun, Jun 23, 2013 at 04:27:52PM +0300, Nadav Shemer wrote:
> Background: I'm working on a pnfs-exported filesystem implementation
> (using objects-based storage)
> In my ->layout_get() implementation, I use mutex_trylock() and return
> NFS4ERR_DELAY in the contended case
> In a real-world test, I discovered the client always waits 15 seconds
> when receiving this error for LAYOUTGET.
> This occurs in nfs4_async_handle_error, which always wait for
> NFS4_POLL_RETRY_MAX when getting DELAY, GRACE or EKEYEXPIRED
>
> This is in contrast to nfs4_handle_exception, which calls nfs4_delay.
> In this path, the wait begins at NFS4_POLL_RETRY_MIN (0.1 seconds) and
> increases two-fold each time up to RETRY_MAX.
> It is used by many nfs4_proc operations - the caller creates an
> nfs4_exception structure, and retries the operation until success (or
> permanent error).
>
> when nfs4_async_handle_error is used, OTOH, the RPC task is restarted
> in the ->rpc_call_done callback and the sleeping is done with
> rpc_delay
>
> nfs4_async_handle_error is used in:
> CLOSE, UNLINK, RENAME, READ, WRITE, COMMIT, DELEGRETURN, LOCKU,
> LAYOUTGET, LAYOUTRETURN and LAYOUTCOMMIT.
> A similar behavior (waiting RETRY_MAX) is also used in
> nfs4*_sequence_* functions (in which case it refers to the status of
> the SEQUENCE operation itself) and by RECLAIM_COMPLETE
> GET_LEASE_TIME also has such a code structure, but it always waits
> RETRY_MIN, not MAX
>
>
> The first question, raised in the beginning of this mail:
> Is it better to wait for the mutex in the NFSd thread (with the risk
> of blocking that nfsd thread)

nfsd threads block on mutexes all the time, and it's not necessarily a
problem--depends on exactly what it's blocking on. You wouldn't want to
block waiting for the client to do something, as that might lead to
deadlock if the client can't make progress until the server responds to
some rpc. If you're blocking waiting for a disk or some internal
cluster communication--it may be fine?

> or to return DELAY(with its 15s delay
> and risk of repeatedly landing on a contended mutex even if it is not
> kept locked the whole time)?
> Is there some other solution?
>
>
> The second question(s):
> Why are there several different implementations of the same
> restart/retry behaviors? why do some operations use one mechanism and
> others use another?
> Why isn't the exponential back-off mechanism used in these operations?

Here's a previous thread on the subject:

http://comments.gmane.org/gmane.linux.nfs/56193

Attempting a summary: the constant delay is traditional behavior going
back to NFSv3, and the exponential backoff was added to handle DELAY
returns on OPEN due to delegation conflicts.

And it would likely be tough to justify another client change here
without a similar case where the spec clearly has the server returning
DELAY to something that needs to be retried quickly.

Not understanding your case, it doesn't sound like the result of any
real requirement but rather an implementation detail that you probably
want to fix in the server.

--b.

2013-06-25 11:51:50

by Nadav Shemer

[permalink] [raw]

Subject: Re: LAYOUTGET and NFS4ERR_DELAY: a few questions

On Mon, Jun 24, 2013 at 10:31 PM, J. Bruce Fields <[email protected]> wrote:
> On Sun, Jun 23, 2013 at 04:27:52PM +0300, Nadav Shemer wrote:
>> Background: I'm working on a pnfs-exported filesystem implementation
>> (using objects-based storage)
>> In my ->layout_get() implementation, I use mutex_trylock() and return
>> NFS4ERR_DELAY in the contended case
>> In a real-world test, I discovered the client always waits 15 seconds
>> when receiving this error for LAYOUTGET.
>> This occurs in nfs4_async_handle_error, which always wait for
>> NFS4_POLL_RETRY_MAX when getting DELAY, GRACE or EKEYEXPIRED
>>
>> This is in contrast to nfs4_handle_exception, which calls nfs4_delay.
>> In this path, the wait begins at NFS4_POLL_RETRY_MIN (0.1 seconds) and
>> increases two-fold each time up to RETRY_MAX.
>> It is used by many nfs4_proc operations - the caller creates an
>> nfs4_exception structure, and retries the operation until success (or
>> permanent error).
>>
>> when nfs4_async_handle_error is used, OTOH, the RPC task is restarted
>> in the ->rpc_call_done callback and the sleeping is done with
>> rpc_delay
>>
>> nfs4_async_handle_error is used in:
>> CLOSE, UNLINK, RENAME, READ, WRITE, COMMIT, DELEGRETURN, LOCKU,
>> LAYOUTGET, LAYOUTRETURN and LAYOUTCOMMIT.
>> A similar behavior (waiting RETRY_MAX) is also used in
>> nfs4*_sequence_* functions (in which case it refers to the status of
>> the SEQUENCE operation itself) and by RECLAIM_COMPLETE
>> GET_LEASE_TIME also has such a code structure, but it always waits
>> RETRY_MIN, not MAX
>>
>>
>> The first question, raised in the beginning of this mail:
>> Is it better to wait for the mutex in the NFSd thread (with the risk
>> of blocking that nfsd thread)
>
> nfsd threads block on mutexes all the time, and it's not necessarily a
> problem--depends on exactly what it's blocking on. You wouldn't want to
> block waiting for the client to do something, as that might lead to
> deadlock if the client can't make progress until the server responds to
> some rpc. If you're blocking waiting for a disk or some internal
> cluster communication--it may be fine?
internal cluster communication - I may be blocking on a DS operation
(so the mutex is being held by another nfsd thread).
Does it make sense, then, to have a lot more nfsd threads than CPUs?
(if they spend their days waiting for other hosts)

>> or to return DELAY(with its 15s delay
>> and risk of repeatedly landing on a contended mutex even if it is not
>> kept locked the whole time)?
>> Is there some other solution?
>>
>>
>> The second question(s):
>> Why are there several different implementations of the same
>> restart/retry behaviors? why do some operations use one mechanism and
>> others use another?
>> Why isn't the exponential back-off mechanism used in these operations?
>
> Here's a previous thread on the subject:
>
> http://comments.gmane.org/gmane.linux.nfs/56193
Thanks!

>
> Attempting a summary: the constant delay is traditional behavior going
> back to NFSv3, and the exponential backoff was added to handle DELAY
> returns on OPEN due to delegation conflicts.
>
> And it would likely be tough to justify another client change here
> without a similar case where the spec clearly has the server returning
> DELAY to something that needs to be retried quickly.
>
> Not understanding your case, it doesn't sound like the result of any
> real requirement but rather an implementation detail that you probably
> want to fix in the server.
Well, a LAYOUTGET may cause a conflicting layout to be recalled (f.e.
RAID in object storage - RFC 5664, 11.). Is that not similar to the
OPEN case?
This makes me ponder. If the server blocks while waiting for
conflicting layouts to be recalled, I think we can theoretically reach
a deadlock (if we take up all the nfsd threads or all the clients'
session slots): client A hold layout to file X, and requests layout to
file Y, while client B holds layout to file Y and requests layout to
file X.
To avoid this, we pretty much have to return DELAY for LAYOUTGET

>
> --b.