2024-02-27 23:56:51

by NeilBrown

[permalink] [raw]
Subject: Re: PROBLEM: NFS client IO fails with ERESTARTSYS when another mount point with the same export is unmounted with force [NFS] [SUNRPC]

On Fri, 23 Feb 2024, Zhitao Li wrote:
> Thanks for Jeff's reply.
>
> I did see ERESTARTSYS in userland. As described in the above
> "Reproduction" chapter, "dd" fails with "dd: error writing
> '/mnt/test1/1G': Unknown error 512".
>
> After strace "dd", it turns out that syscall WRITE fails with:
> write(1, "4\303\31\211\316\237\333\r-\275g\370\233\374X\277\374Tb\202\24\365\220\320\16\27o3\331q\344\364"...,
> 1048576) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
>
> In fact, other syscalls related to file systems can also fail with
> ERESTARTSYS in our cases, for example: mount, open, read, write and so
> on.
>
> Maybe the reason is that on forced unmount, rpc_killall_tasks() in
> net/sunrpc/clnt.c will set all inflight IO with ERESTARTSYS, while no
> signal gets involved. So ERESTARTSYS is not handled before entering
> userspace.
>
> Best regards,
> Zhitao Li at SmartX.
>
> On Thu, Feb 22, 2024 at 7:05 PM Jeff Layton <[email protected]> wrote:
> >
> > On Wed, 2024-02-21 at 13:48 +0000, Trond Myklebust wrote:
> > > On Wed, 2024-02-21 at 16:20 +0800, Zhitao Li wrote:
> > > > [You don't often get email from [email protected]. Learn why this
> > > > is important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hi, everyone,
> > > >
> > > > - Facts:
> > > > I have a remote NFS export and I mount the same export on two
> > > > different directories in my OS with the same options. There is an
> > > > inflight IO under one mounted directory. And then I unmount another
> > > > mounted directory with force. The inflight IO ends up with "Unknown
> > > > error 512", which is ERESTARTSYS.
> > > >
> > >
> > > All of the above is well known. That's because forced umount affects
> > > the entire filesystem. Why are you using it here in the first place? It
> > > is not intended for casual use.
> > >
> >
> > While I agree Trond's above statement, the kernel is not supposed to
> > leak error codes that high into userland. Are you seeing ERESTARTSYS
> > being returned to system calls? If so, which ones?
> > --
> > Jeff Layton <[email protected]>
>

I think this bug was introduced by
Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
in Linux v5.2.

Prior to that commit, rpc_killall_tasks set the error to -EIO.
After that commit it calls rpc_signal_task which always uses
-ERESTARTSYS.

This might be an appropriate fix. Can you please test and confirm?

diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 2d61987b3545..ed3a116efd5d 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -222,7 +222,7 @@ void rpc_put_task(struct rpc_task *);
void rpc_put_task_async(struct rpc_task *);
bool rpc_task_set_rpc_status(struct rpc_task *task, int rpc_status);
void rpc_task_try_cancel(struct rpc_task *task, int error);
-void rpc_signal_task(struct rpc_task *);
+void rpc_signal_task(struct rpc_task *, int);
void rpc_exit_task(struct rpc_task *);
void rpc_exit(struct rpc_task *, int);
void rpc_release_calldata(const struct rpc_call_ops *, void *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index cda0935a68c9..cdbdfae13030 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -895,7 +895,7 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
trace_rpc_clnt_killall(clnt);
spin_lock(&clnt->cl_lock);
list_for_each_entry(rovr, &clnt->cl_tasks, tk_task)
- rpc_signal_task(rovr);
+ rpc_signal_task(rovr, -EIO);
spin_unlock(&clnt->cl_lock);
}
EXPORT_SYMBOL_GPL(rpc_killall_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 6debf4fd42d4..e88621881036 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -852,14 +852,14 @@ void rpc_exit_task(struct rpc_task *task)
}
}

-void rpc_signal_task(struct rpc_task *task)
+void rpc_signal_task(struct rpc_task *task, int sig)
{
struct rpc_wait_queue *queue;

if (!RPC_IS_ACTIVATED(task))
return;

- if (!rpc_task_set_rpc_status(task, -ERESTARTSYS))
+ if (!rpc_task_set_rpc_status(task, sig))
return;
trace_rpc_task_signalled(task, task->tk_action);
set_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
@@ -992,7 +992,7 @@ static void __rpc_execute(struct rpc_task *task)
* clean up after sleeping on some queue, we don't
* break the loop here, but go around once more.
*/
- rpc_signal_task(task);
+ rpc_signal_task(task, -ERESTARTSYS);
}
trace_rpc_task_sync_wake(task, task->tk_action);
}