LinuxLists.cc - [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

2008-06-13 11:43:08

Subject: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

OCFS2 can return -ERESTARTSYS from write requests (and possibly
elsewhere) if there is a signal pending.

If nfsd is shutdown (by sending a signal to each thread) while there
is still an IO load from the client, each thread could handle one last
request with a signal pending. This can result in -ERESTARTSYS
which is not understood by nfserrno() and so is reflected back to
the client as nfserr_io aka -EIO. This is wrong.

Instead, interpret ERESTARTSYS to mean "don't send a reply".
The client will resend and - if the server is restarted - the write will
(hopefully) be successful and everyone will be happy.

Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/nfsd/nfsproc.c | 1 +
1 file changed, 1 insertion(+)

----
Funny how the shortest patches sometimes have the longest
descriptions.

The symptom that I narrowed down to this was:
copy a large file via NFS to an OCFS2 filesystem, and restart
the nfs server during the copy.
The 'cp' might get an -EIO, and the file will be corrupted -
presumably holes in the middle were writes appeared to fail.

diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
--- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
+++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
@@ -614,6 +614,7 @@ nfserrno (int errno)
#endif
{ nfserr_stale, -ESTALE },
{ nfserr_jukebox, -ETIMEDOUT },
+ { nfserr_dropit, -ERESTARTSYS },
{ nfserr_dropit, -EAGAIN },
{ nfserr_dropit, -ENOMEM },
{ nfserr_badname, -ESRCH },

### Diffstat output
./fs/nfsd/nfsproc.c | 1 +
1 file changed, 1 insertion(+)

diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
--- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
+++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
@@ -614,6 +614,7 @@ nfserrno (int errno)
#endif
{ nfserr_stale, -ESTALE },
{ nfserr_jukebox, -ETIMEDOUT },
+ { nfserr_dropit, -ERESTARTSYS },
{ nfserr_dropit, -EAGAIN },
{ nfserr_dropit, -ENOMEM },
{ nfserr_badname, -ESRCH },

2008-06-14 17:11:23

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Fri, Jun 13, 2008 at 09:42:15PM +1000, NeilBrown wrote:
>
> OCFS2 can return -ERESTARTSYS from write requests (and possibly
> elsewhere) if there is a signal pending.
>
> If nfsd is shutdown (by sending a signal to each thread) while there
> is still an IO load from the client, each thread could handle one last
> request with a signal pending. This can result in -ERESTARTSYS
> which is not understood by nfserrno() and so is reflected back to
> the client as nfserr_io aka -EIO. This is wrong.
>
> Instead, interpret ERESTARTSYS to mean "don't send a reply".
> The client will resend and - if the server is restarted - the write will
> (hopefully) be successful and everyone will be happy.

Thanks, applied--with a trivial change:

>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> ----
> Funny how the shortest patches sometimes have the longest
> descriptions.

I added this to the body of the patch description:
>
> The symptom that I narrowed down to this was:
> copy a large file via NFS to an OCFS2 filesystem, and restart
> the nfs server during the copy.
> The 'cp' might get an -EIO, and the file will be corrupted -
> presumably holes in the middle were writes appeared to fail.

On the grounds that someone might run across the same problem or need to
reproduce it for some reason, so a good, simple description of the case
that found it might help.

--b.

>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_dropit, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_dropit, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },

2008-06-16 12:40:11

by Peter Staubach

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

NeilBrown wrote:
> OCFS2 can return -ERESTARTSYS from write requests (and possibly
> elsewhere) if there is a signal pending.
>
> If nfsd is shutdown (by sending a signal to each thread) while there
> is still an IO load from the client, each thread could handle one last
> request with a signal pending. This can result in -ERESTARTSYS
> which is not understood by nfserrno() and so is reflected back to
> the client as nfserr_io aka -EIO. This is wrong.
>
> Instead, interpret ERESTARTSYS to mean "don't send a reply".
> The client will resend and - if the server is restarted - the write will
> (hopefully) be successful and everyone will be happy.
>
>

Why not handle -ERESTARTSYS in the same fashion as -ETIMEDOUT, ie.
leading to a EJUKEBOX sort of error being returned if possible?

Simply not returning is a bad thing to do for anything other than
NFSv2. It is especially bad for NFSv4.

ps

> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> ----
> Funny how the shortest patches sometimes have the longest
> descriptions.
>
> The symptom that I narrowed down to this was:
> copy a large file via NFS to an OCFS2 filesystem, and restart
> the nfs server during the copy.
> The 'cp' might get an -EIO, and the file will be corrupted -
> presumably holes in the middle were writes appeared to fail.
>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_dropit, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_dropit, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-06-16 15:10:20

by Chuck Lever

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Mon, Jun 16, 2008 at 8:39 AM, Peter Staubach <[email protected]> wrote:
> NeilBrown wrote:
>>
>> OCFS2 can return -ERESTARTSYS from write requests (and possibly
>> elsewhere) if there is a signal pending.
>>
>> If nfsd is shutdown (by sending a signal to each thread) while there
>> is still an IO load from the client, each thread could handle one last
>> request with a signal pending. This can result in -ERESTARTSYS
>> which is not understood by nfserrno() and so is reflected back to
>> the client as nfserr_io aka -EIO. This is wrong.
>>
>> Instead, interpret ERESTARTSYS to mean "don't send a reply".
>> The client will resend and - if the server is restarted - the write will
>> (hopefully) be successful and everyone will be happy.
>>
>>
>
> Why not handle -ERESTARTSYS in the same fashion as -ETIMEDOUT, ie.
> leading to a EJUKEBOX sort of error being returned if possible?
>
> Simply not returning is a bad thing to do for anything other than
> NFSv2. It is especially bad for NFSv4.

Actually, the NFSv4 spec *requires* the server to reply to every request.

Not replying means an NFSv4 client connected via NFSv4 will have to
disconnect and retransmit. That should be avoided if at all possible.

I think an error reply is much better than no reply in nearly every
case. NFS3ERR_JUKEBOX/NFS4ERR_DELAY is an interesting idea, but
something else again will probably be required for v4.1 with sessions.

>> Signed-off-by: Neil Brown <[email protected]>
>>
>> ### Diffstat output
>> ./fs/nfsd/nfsproc.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> ----
>> Funny how the shortest patches sometimes have the longest
>> descriptions.
>>
>> The symptom that I narrowed down to this was:
>> copy a large file via NFS to an OCFS2 filesystem, and restart
>> the nfs server during the copy.
>> The 'cp' might get an -EIO, and the file will be corrupted -
>> presumably holes in the middle were writes appeared to fail.
>>
>> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
>> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
>> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
>> @@ -614,6 +614,7 @@ nfserrno (int errno)
>> #endif
>> { nfserr_stale, -ESTALE },
>> { nfserr_jukebox, -ETIMEDOUT },
>> + { nfserr_dropit, -ERESTARTSYS },
>> { nfserr_dropit, -EAGAIN },
>> { nfserr_dropit, -ENOMEM },
>> { nfserr_badname, -ESRCH },
>>
>> ### Diffstat output
>> ./fs/nfsd/nfsproc.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
>> --- .prev/fs/nfsd/nfsproc.c 2008-06-13 21:31:53.000000000 +1000
>> +++ ./fs/nfsd/nfsproc.c 2008-06-13 21:31:57.000000000 +1000
>> @@ -614,6 +614,7 @@ nfserrno (int errno)
>> #endif
>> { nfserr_stale, -ESTALE },
>> { nfserr_jukebox, -ETIMEDOUT },
>> + { nfserr_dropit, -ERESTARTSYS },
>> { nfserr_dropit, -EAGAIN },
>> { nfserr_dropit, -ENOMEM },
>> { nfserr_badname, -ESRCH },
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales

2008-06-16 16:55:10

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Mon, 2008-06-16 at 11:09 -0400, Chuck Lever wrote:

> I think an error reply is much better than no reply in nearly every
> case. NFS3ERR_JUKEBOX/NFS4ERR_DELAY is an interesting idea, but
> something else again will probably be required for v4.1 with sessions.

NFS3ERR_JUKEBOX/NFS4ERR_DELAY may be inappropriate if the nfs daemon has
already started handling the RPC call, since you may be interrupting a
non-idempotent operation.

The only complete solution to this problem is NFSv4.1 with persistent
sessions.

Cheers
Trond

2008-06-16 18:14:19

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Mon, 16 Jun 2008 12:54:46 -0400
Trond Myklebust <[email protected]> wrote:

> On Mon, 2008-06-16 at 11:09 -0400, Chuck Lever wrote:
>
> > I think an error reply is much better than no reply in nearly every
> > case. NFS3ERR_JUKEBOX/NFS4ERR_DELAY is an interesting idea, but
> > something else again will probably be required for v4.1 with sessions.
>
> NFS3ERR_JUKEBOX/NFS4ERR_DELAY may be inappropriate if the nfs daemon has
> already started handling the RPC call, since you may be interrupting a
> non-idempotent operation.
>

But if you drop the reply, the client will probably still end up
retransmitting the request. It seems like a JUKEBOX/DELAY error
is at least a defined error to the client instead of leaving it
guessing. Either way, the client could still end up on the wrong
side of a non-idempotent op.

Also, am I right that this should really only be happening if nfsd
catches a SIGKILL? All other signals should be masked off when
doing the local file operation. Or do we have a potential race if
we catch a signal just after svc_recv returns but before the new
sigmask is set and svc_process is called?

> The only complete solution to this problem is NFSv4.1 with persistent
> sessions.
>

That's probably the case, though we should probably try to do
best-effort for earlier versions.

--
Jeff Layton <[email protected]>

2008-06-17 07:04:15

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Monday June 16, [email protected] wrote:
> On Mon, 2008-06-16 at 11:09 -0400, Chuck Lever wrote:
>
> > I think an error reply is much better than no reply in nearly every
> > case. NFS3ERR_JUKEBOX/NFS4ERR_DELAY is an interesting idea, but
> > something else again will probably be required for v4.1 with sessions.
>
> NFS3ERR_JUKEBOX/NFS4ERR_DELAY may be inappropriate if the nfs daemon has
> already started handling the RPC call, since you may be interrupting a
> non-idempotent operation.

If the filesystem allows you to interrupt a non-idempotent operation
part way through, then the filesystem is doing something very wrong.

The observed behaviour is that multiple 32K writes are outstanding
(in different nfsd threads) when a signal is delivered to each nfsd.

OCFS2 appears to be serialising these writes.

One of the writes completes returning a length that is less than 32K.
This length is returned to the client. A quick look at the client
code suggests that it complains with a printk, and tries to write the
remainder, which seems correct.

The other writes all complete with ERESTARTSYS. Presumably they
haven't started at all. If they had, you might expect a partial
return from them too.

So far, what OCFS2 is doing seems credible and doesn't leave us in an
awkward position with respect to incomplete idempotent operations.

I cannot be certain, but I'm willing to believe that OCFS2 only
returns ERESTARTSYS when the operation hasn't been performed at all
(or has been wound-back to the starting condition).

I agree that NFS3ERR_JUKEBOX is more appropriate than no reply, but I
don't think there is any reason to suspect that will not be
sufficient.

Thanks,
NeilBrown

2008-06-18 15:59:47

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Tue, Jun 17, 2008 at 05:03:01PM +1000, Neil Brown wrote:
> On Monday June 16, [email protected] wrote:
> > On Mon, 2008-06-16 at 11:09 -0400, Chuck Lever wrote:
> >
> > > I think an error reply is much better than no reply in nearly every
> > > case. NFS3ERR_JUKEBOX/NFS4ERR_DELAY is an interesting idea, but
> > > something else again will probably be required for v4.1 with sessions.
> >
> > NFS3ERR_JUKEBOX/NFS4ERR_DELAY may be inappropriate if the nfs daemon has
> > already started handling the RPC call, since you may be interrupting a
> > non-idempotent operation.
>
> If the filesystem allows you to interrupt a non-idempotent operation
> part way through, then the filesystem is doing something very wrong.
>
> The observed behaviour is that multiple 32K writes are outstanding
> (in different nfsd threads) when a signal is delivered to each nfsd.
>
> OCFS2 appears to be serialising these writes.
>
> One of the writes completes returning a length that is less than 32K.
> This length is returned to the client. A quick look at the client
> code suggests that it complains with a printk, and tries to write the
> remainder, which seems correct.
>
> The other writes all complete with ERESTARTSYS. Presumably they
> haven't started at all. If they had, you might expect a partial
> return from them too.
>
> So far, what OCFS2 is doing seems credible and doesn't leave us in an
> awkward position with respect to incomplete idempotent operations.
>
> I cannot be certain, but I'm willing to believe that OCFS2 only
> returns ERESTARTSYS when the operation hasn't been performed at all
> (or has been wound-back to the starting condition).
>
> I agree that NFS3ERR_JUKEBOX is more appropriate than no reply, but I
> don't think there is any reason to suspect that will not be
> sufficient.

OK. Want to send a replacement patch?

--b.