2008-06-19 00:11:16

by NeilBrown

[permalink] [raw]
Subject: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.


OCFS2 can return -ERESTARTSYS from write requests (and possibly
elsewhere) if there is a signal pending.

If nfsd is shutdown (by sending a signal to each thread) while there
is still an IO load from the client, each thread could handle one last
request with a signal pending. This can result in -ERESTARTSYS
which is not understood by nfserrno() and so is reflected back to
the client as nfserr_io aka -EIO. This is wrong.

Instead, interpret ERESTARTSYS to mean "try again later" by returning
nfserr_jukebox. The client will resend and - if the server is
restarted - the write will (hopefully) be successful and everyone will
be happy.

The symptom that I narrowed down to this was:
copy a large file via NFS to an OCFS2 filesystem, and restart
the nfs server during the copy.
The 'cp' might get an -EIO, and the file will be corrupted -
presumably holes in the middle where writes appeared to fail.


Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./fs/nfsd/nfsproc.c | 1 +
1 file changed, 1 insertion(+)

diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
--- .prev/fs/nfsd/nfsproc.c 2008-06-19 10:06:36.000000000 +1000
+++ ./fs/nfsd/nfsproc.c 2008-06-19 10:07:58.000000000 +1000
@@ -614,6 +614,7 @@ nfserrno (int errno)
#endif
{ nfserr_stale, -ESTALE },
{ nfserr_jukebox, -ETIMEDOUT },
+ { nfserr_jukebox, -ERESTARTSYS },
{ nfserr_dropit, -EAGAIN },
{ nfserr_dropit, -ENOMEM },
{ nfserr_badname, -ESRCH },


2008-06-19 01:09:52

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Thu, 19 Jun 2008 10:11:09 +1000
NeilBrown <[email protected]> wrote:

>
> OCFS2 can return -ERESTARTSYS from write requests (and possibly
> elsewhere) if there is a signal pending.
>
> If nfsd is shutdown (by sending a signal to each thread) while there
> is still an IO load from the client, each thread could handle one last
> request with a signal pending. This can result in -ERESTARTSYS
> which is not understood by nfserrno() and so is reflected back to
> the client as nfserr_io aka -EIO. This is wrong.
>
> Instead, interpret ERESTARTSYS to mean "try again later" by returning
> nfserr_jukebox. The client will resend and - if the server is
> restarted - the write will (hopefully) be successful and everyone will
> be happy.
>
> The symptom that I narrowed down to this was:
> copy a large file via NFS to an OCFS2 filesystem, and restart
> the nfs server during the copy.
> The 'cp' might get an -EIO, and the file will be corrupted -
> presumably holes in the middle where writes appeared to fail.
>
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-19 10:06:36.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-19 10:07:58.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_jukebox, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

No objection to the patch, but what signal was being sent to nfsd when
you saw this? If it's anything but a SIGKILL, then I wonder if we have
a race that we need to deal with. My understanding is that we have nfsd
flip between 2 sigmasks to prevent anything but a SIGKILL from being
delivered while we're handling the local filesystem operation.

>From nfsd():

----------[snip]-----------
sigprocmask(SIG_SETMASK, &shutdown_mask, NULL);

/*
* Find a socket with data available and call its
* recvfrom routine.
*/
while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
;
if (err < 0)
break;
update_thread_usage(atomic_read(&nfsd_busy));
atomic_inc(&nfsd_busy);

/* Lock the export hash tables for reading. */
exp_readlock();

/* Process request with signals blocked. */
sigprocmask(SIG_SETMASK, &allowed_mask, NULL);

svc_process(rqstp);

----------[snip]-----------

What happens if this catches a SIGINT after the err<0 check, but before
the mask is set to allowed_mask? Does svc_process() then get called with
a signal pending?

--
Jeff Layton <[email protected]>

2008-06-19 02:29:24

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Wednesday June 18, [email protected] wrote:
>
> No objection to the patch, but what signal was being sent to nfsd when
> you saw this? If it's anything but a SIGKILL, then I wonder if we have
> a race that we need to deal with. My understanding is that we have nfsd
> flip between 2 sigmasks to prevent anything but a SIGKILL from being
> delivered while we're handling the local filesystem operation.

SuSE /etc/init.d/nfsserver does

killproc -n -KILL nfsd

so it looks like a SIGKILL.


>
> From nfsd():
>
> ----------[snip]-----------
> sigprocmask(SIG_SETMASK, &shutdown_mask, NULL);
>
> /*
> * Find a socket with data available and call its
> * recvfrom routine.
> */
> while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> ;
> if (err < 0)
> break;
> update_thread_usage(atomic_read(&nfsd_busy));
> atomic_inc(&nfsd_busy);
>
> /* Lock the export hash tables for reading. */
> exp_readlock();
>
> /* Process request with signals blocked. */
> sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
>
> svc_process(rqstp);
>
> ----------[snip]-----------
>
> What happens if this catches a SIGINT after the err<0 check, but before
> the mask is set to allowed_mask? Does svc_process() then get called with
> a signal pending?

Yes, I suspect it does.

I wonder why we have all this mucking about this signal masks anyway.
Anyone have any ideas about what it actually achieves?

NeilBrown

2008-06-19 10:38:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Thu, 19 Jun 2008 12:29:16 +1000
Neil Brown <[email protected]> wrote:

> On Wednesday June 18, [email protected] wrote:
> >
> > No objection to the patch, but what signal was being sent to nfsd when
> > you saw this? If it's anything but a SIGKILL, then I wonder if we have
> > a race that we need to deal with. My understanding is that we have nfsd
> > flip between 2 sigmasks to prevent anything but a SIGKILL from being
> > delivered while we're handling the local filesystem operation.
>
> SuSE /etc/init.d/nfsserver does
>
> killproc -n -KILL nfsd
>
> so it looks like a SIGKILL.
>
>
> >
> > From nfsd():
> >
> > ----------[snip]-----------
> > sigprocmask(SIG_SETMASK, &shutdown_mask, NULL);
> >
> > /*
> > * Find a socket with data available and call its
> > * recvfrom routine.
> > */
> > while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> > ;
> > if (err < 0)
> > break;
> > update_thread_usage(atomic_read(&nfsd_busy));
> > atomic_inc(&nfsd_busy);
> >
> > /* Lock the export hash tables for reading. */
> > exp_readlock();
> >
> > /* Process request with signals blocked. */
> > sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
> >
> > svc_process(rqstp);
> >
> > ----------[snip]-----------
> >
> > What happens if this catches a SIGINT after the err<0 check, but before
> > the mask is set to allowed_mask? Does svc_process() then get called with
> > a signal pending?
>
> Yes, I suspect it does.
>
> I wonder why we have all this mucking about this signal masks anyway.
> Anyone have any ideas about what it actually achieves?
>

HCH asked me the same question when I did the conversion to kthreads.
My interpretation (based on guesswork here) was that we wanted to
distinguish between SIGKILL and other allowed signals. A SIGKILL is
allowed to interrupt the underlying I/O, but other signals should not.

The question to answer here, I suppose, is whether masking a pending
signal is sufficient to make signal_pending() return false. If I'm
looking correctly then the answer should be "yes". So I don't think we
have a race here after all. I suspect that if SuSE used a different
signal here, that would prevent this from happening. For the record,
both RHEL and Fedora's init scripts use SIGINT for this.

--
Jeff Layton <[email protected]>

2008-06-20 17:34:19

by Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

Thanks, applied (and replaced the earlier patch in the for-2.6.27 branch
at

git://linux-nfs.org/~bfields/linux.git for-2.6.27

I saw sort of curious if I could convince myself to go through this
cycle only appending to that branch. I guess not. Maybe next time.)

--b.

On Thu, Jun 19, 2008 at 10:11:09AM +1000, NeilBrown wrote:
>
> OCFS2 can return -ERESTARTSYS from write requests (and possibly
> elsewhere) if there is a signal pending.
>
> If nfsd is shutdown (by sending a signal to each thread) while there
> is still an IO load from the client, each thread could handle one last
> request with a signal pending. This can result in -ERESTARTSYS
> which is not understood by nfserrno() and so is reflected back to
> the client as nfserr_io aka -EIO. This is wrong.
>
> Instead, interpret ERESTARTSYS to mean "try again later" by returning
> nfserr_jukebox. The client will resend and - if the server is
> restarted - the write will (hopefully) be successful and everyone will
> be happy.
>
> The symptom that I narrowed down to this was:
> copy a large file via NFS to an OCFS2 filesystem, and restart
> the nfs server during the copy.
> The 'cp' might get an -EIO, and the file will be corrupted -
> presumably holes in the middle where writes appeared to fail.
>
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./fs/nfsd/nfsproc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff .prev/fs/nfsd/nfsproc.c ./fs/nfsd/nfsproc.c
> --- .prev/fs/nfsd/nfsproc.c 2008-06-19 10:06:36.000000000 +1000
> +++ ./fs/nfsd/nfsproc.c 2008-06-19 10:07:58.000000000 +1000
> @@ -614,6 +614,7 @@ nfserrno (int errno)
> #endif
> { nfserr_stale, -ESTALE },
> { nfserr_jukebox, -ETIMEDOUT },
> + { nfserr_jukebox, -ERESTARTSYS },
> { nfserr_dropit, -EAGAIN },
> { nfserr_dropit, -ENOMEM },
> { nfserr_badname, -ESRCH },

2008-06-20 17:50:39

by Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH - take 2] knfsd: nfsd: Handle ERESTARTSYS from syscalls.

On Thu, Jun 19, 2008 at 06:38:24AM -0400, Jeff Layton wrote:
> On Thu, 19 Jun 2008 12:29:16 +1000
> Neil Brown <[email protected]> wrote:
>
> > On Wednesday June 18, [email protected] wrote:
> > >
> > > No objection to the patch, but what signal was being sent to nfsd when
> > > you saw this? If it's anything but a SIGKILL, then I wonder if we have
> > > a race that we need to deal with. My understanding is that we have nfsd
> > > flip between 2 sigmasks to prevent anything but a SIGKILL from being
> > > delivered while we're handling the local filesystem operation.
> >
> > SuSE /etc/init.d/nfsserver does
> >
> > killproc -n -KILL nfsd
> >
> > so it looks like a SIGKILL.
> >
> >
> > >
> > > From nfsd():
> > >
> > > ----------[snip]-----------
> > > sigprocmask(SIG_SETMASK, &shutdown_mask, NULL);
> > >
> > > /*
> > > * Find a socket with data available and call its
> > > * recvfrom routine.
> > > */
> > > while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
> > > ;
> > > if (err < 0)
> > > break;
> > > update_thread_usage(atomic_read(&nfsd_busy));
> > > atomic_inc(&nfsd_busy);
> > >
> > > /* Lock the export hash tables for reading. */
> > > exp_readlock();
> > >
> > > /* Process request with signals blocked. */
> > > sigprocmask(SIG_SETMASK, &allowed_mask, NULL);
> > >
> > > svc_process(rqstp);
> > >
> > > ----------[snip]-----------
> > >
> > > What happens if this catches a SIGINT after the err<0 check, but before
> > > the mask is set to allowed_mask? Does svc_process() then get called with
> > > a signal pending?
> >
> > Yes, I suspect it does.
> >
> > I wonder why we have all this mucking about this signal masks anyway.
> > Anyone have any ideas about what it actually achieves?
> >
>
> HCH asked me the same question when I did the conversion to kthreads.
> My interpretation (based on guesswork here) was that we wanted to
> distinguish between SIGKILL and other allowed signals. A SIGKILL is
> allowed to interrupt the underlying I/O, but other signals should not.
>
> The question to answer here, I suppose, is whether masking a pending
> signal is sufficient to make signal_pending() return false. If I'm
> looking correctly then the answer should be "yes".

Just looking out of curiosity: signal_pending() checks whether some
thread_info->flags has TIF_SIGPENDING set.

sigprocmask() sets current->blocked to the given set, then calls
recalc_sigpending(), which (ignoring some freezer and SIGSTOP code that
I don't understand), clears TIF_SIGPENDING if any pending signals are in
the newly blocked set. So, yes.

--b.

> So I don't think we
> have a race here after all. I suspect that if SuSE used a different
> signal here, that would prevent this from happening. For the record,
> both RHEL and Fedora's init scripts use SIGINT for this.