LinuxLists.cc - Hang due to nfs letting tasks freeze with locked inodes

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> We're seeing a hang when freezing a container with an nfs bind mount while
> running iozone. Two iozone processes were hung with this stack trace.
>
> [] schedule+0x35/0x80
> [] schedule_preempt_disabled+0xe/0x10
> [] __mutex_lock_slowpath+0xb9/0x130
> [] mutex_lock+0x1f/0x30
> [] do_unlinkat+0x12b/0x2d0
> [] SyS_unlink+0x16/0x20
> [] entry_SYSCALL_64_fastpath+0x16/0x71
>
> This seems to be due to another iozone thread frozen during unlink with
> this stack trace:
>
> [] __refrigerator+0x7a/0x140
> [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> [] nfs_unlink+0x149/0x350 [nfs]
> [] vfs_unlink+0xf1/0x1a0
> [] do_unlinkat+0x279/0x2d0
> [] SyS_unlink+0x16/0x20
> [] entry_SYSCALL_64_fastpath+0x16/0x71
>
> Since nfs is allowing the thread to be frozen with the inode locked it's
> preventing other threads trying to lock the same inode from freezing. It
> seems like a bad idea for nfs to be doing this.
>

Yeah, known problem. Not a simple one to fix though.

> Can nfs do something different here to prevent this? Maybe use a
> non-freezable sleep and let the operation complete, or else abort the
> operation and return ERESTARTSYS?

The problem with letting the op complete is that often by the time you
get to the point of trying to freeze processes, the network interfaces
are already shut down. So the operation you're waiting on might never
complete. Stuff like suspend operations on your laptop fail, leading to
fun bug reports like: "Oh, my laptop burned to crisp inside my bag
because the suspend never completed."

You could (in principle) return something like -ERESTARTSYS iff the
call has not yet been transmitted. If it has already been transmitted,
then you might end up sending the call a second time (but not as an RPC
retransmission of course). If that call was non-idempotent then you end
up with all of _those_ sorts of problems.

Also, -ERESTARTSYS is not quite right as it doesn't always cause the
call to be restarted. It depends on the syscall. I think this would
probably need some other sort of syscall-restart machinery plumbed in.

--
Jeff Layton <[email protected]>

2016-07-07 03:55:37

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> >
> > [] schedule+0x35/0x80
> > [] schedule_preempt_disabled+0xe/0x10
> > [] __mutex_lock_slowpath+0xb9/0x130
> > [] mutex_lock+0x1f/0x30
> > [] do_unlinkat+0x12b/0x2d0
> > [] SyS_unlink+0x16/0x20
> > [] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> >
> > [] __refrigerator+0x7a/0x140
> > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > [] nfs_unlink+0x149/0x350 [nfs]
> > [] vfs_unlink+0xf1/0x1a0
> > [] do_unlinkat+0x279/0x2d0
> > [] SyS_unlink+0x16/0x20
> > [] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> >
>
> Yeah, known problem. Not a simple one to fix though.
>
> > Can nfs do something different here to prevent this? Maybe use a
> > non-freezable sleep and let the operation complete, or else abort the
> > operation and return ERESTARTSYS?
>
> The problem with letting the op complete is that often by the time you
> get to the point of trying to freeze processes, the network interfaces
> are already shut down. So the operation you're waiting on might never
> complete. Stuff like suspend operations on your laptop fail, leading to
> fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> because the suspend never completed."
>
> You could (in principle) return something like -ERESTARTSYS iff the
> call has not yet been transmitted. If it has already been transmitted,
> then you might end up sending the call a second time (but not as an RPC
> retransmission of course). If that call was non-idempotent then you end
> up with all of _those_ sorts of problems.
>
> Also, -ERESTARTSYS is not quite right as it doesn't always cause the
> call to be restarted. It depends on the syscall. I think this would
> probably need some other sort of syscall-restart machinery plumbed in.

I don't really know much at all about how NFS works, so I hope you don't
mind indulging me in some questions.

What happens then if you suspend waiting for an op to complete and then
resume an hour later? Will it actually succeed or end up returning some
sort of "timed out" error?

If it's going to be an error (or even likely to be one) could the op
just be aborted immediately with an error code? It just seems like there
must be something better than potentially deadlocking the kernel.

Thanks,
Seth

2016-07-07 10:29:39

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Wed, 2016-07-06 at 22:55 -0500, Seth Forshee wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> >
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > >
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > >
> > > [] schedule+0x35/0x80
> > > [] schedule_preempt_disabled+0xe/0x10
> > > [] __mutex_lock_slowpath+0xb9/0x130
> > > [] mutex_lock+0x1f/0x30
> > > [] do_unlinkat+0x12b/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > >
> > > [] __refrigerator+0x7a/0x140
> > > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > > [] nfs_unlink+0x149/0x350 [nfs]
> > > [] vfs_unlink+0xf1/0x1a0
> > > [] do_unlinkat+0x279/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > >
> > Yeah, known problem. Not a simple one to fix though.
> >
> > >
> > > Can nfs do something different here to prevent this? Maybe use a
> > > non-freezable sleep and let the operation complete, or else abort the
> > > operation and return ERESTARTSYS?
> > The problem with letting the op complete is that often by the time you
> > get to the point of trying to freeze processes, the network interfaces
> > are already shut down. So the operation you're waiting on might never
> > complete. Stuff like suspend operations on your laptop fail, leading to
> > fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> > because the suspend never completed."
> >
> > You could (in principle) return something like -ERESTARTSYS iff the
> > call has not yet been transmitted. If it has already been transmitted,
> > then you might end up sending the call a second time (but not as an RPC
> > retransmission of course). If that call was non-idempotent then you end
> > up with all of _those_ sorts of problems.
> >
> > Also, -ERESTARTSYS is not quite right as it doesn't always cause the
> > call to be restarted. It depends on the syscall. I think this would
> > probably need some other sort of syscall-restart machinery plumbed in.
> I don't really know much at all about how NFS works, so I hope you don't
> mind indulging me in some questions.
>
> What happens then if you suspend waiting for an op to complete and then
> resume an hour later? Will it actually succeed or end up returning some
> sort of "timed out" error?
>

Well, the RPC would likely time out. The RPC engine would then likely
end up retransmitting it. What happens at that point depends on a lot
of different factors -- what sort of call it was and how the server
behaves, whether it's NFSv3 or v4, etc...

If it was an idempotent call or the server still has the reply in its
duplicate reply cache, then everything "just works". If it's non-
idempotent or relies on some now-expired state, then you might get an
error because the same call ended up getting retransmitted or the state
that it relies on is now gone.

> If it's going to be an error (or even likely to be one) could the op
> just be aborted immediately with an error code? It just seems like there
> must be something better than potentially deadlocking the kernel.
>

Not without breaking "hard" retry semantics. We had discussed at one
point adding a 3rd alternative to hard vs. soft mount options
(squishy?) that would do more or less what you suggest: allow syscalls
to return an error when the task is being frozen. You'd only really
want to do that though if you've already transmitted the call, waited
for a while (several seconds) and didn't get a reply. If the call
hasn't been transmitted yet, then you'd either want to restart the call
from scratch after unfreezing (a'la something like ERESTARTSYS).

--
Jeff Layton <[email protected]>

2016-07-07 23:53:51

by Dave Chinner

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> >
> > ?[] schedule+0x35/0x80
> > ?[] schedule_preempt_disabled+0xe/0x10
> > ?[] __mutex_lock_slowpath+0xb9/0x130
> > ?[] mutex_lock+0x1f/0x30
> > ?[] do_unlinkat+0x12b/0x2d0
> > ?[] SyS_unlink+0x16/0x20
> > ?[] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> >
> > ?[] __refrigerator+0x7a/0x140
> > ?[] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > ?[] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > ?[] nfs_unlink+0x149/0x350 [nfs]
> > ?[] vfs_unlink+0xf1/0x1a0
> > ?[] do_unlinkat+0x279/0x2d0
> > ?[] SyS_unlink+0x16/0x20
> > ?[] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> >
>
> Yeah, known problem. Not a simple one to fix though.

Actually, it is simple to fix.

<insert broken record about suspend should be using freeze_super(),
not sys_sync(), to suspend filesystem operations>

i.e. the VFS blocks new operations from starting, and then then the
NFS client simply needs to implement ->freeze_fs to drain all it's
active operations before returning. Problem solved.

> > Can nfs do something different here to prevent this? Maybe use a
> > non-freezable sleep and let the operation complete, or else abort the
> > operation and return ERESTARTSYS?
>
> The problem with letting the op complete is that often by the time you
> get to the point of trying to freeze processes, the network interfaces
> are already shut down. So the operation you're waiting on might never
> complete. Stuff like suspend operations on your laptop fail, leading to
> fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> because the suspend never completed."

Yup, precisely the sort of problems we've had over the past 10 years
with XFS because we do lots of stuff aynchronously in the background
(just like NFS) and hence sys_sync() isn't sufficient to quiesce a
filesystem's operations.

But I'm used to being ignored on this topic (for almost 10 years,
now!). Indeed, it's been made clear in the past that I know
absolutely nothing about what is needed to be done to safely
suspend filesystem operations... :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-07-08 11:33:23

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, 2016-07-08 at 09:53 +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind
> > > mount while
> > > running iozone. Two iozone processes were hung with this stack
> > > trace.
> > >
> > > [] schedule+0x35/0x80
> > > [] schedule_preempt_disabled+0xe/0x10
> > > [] __mutex_lock_slowpath+0xb9/0x130
> > > [] mutex_lock+0x1f/0x30
> > > [] do_unlinkat+0x12b/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > This seems to be due to another iozone thread frozen during
> > > unlink with
> > > this stack trace:
> > >
> > > [] __refrigerator+0x7a/0x140
> > > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > > [] nfs_unlink+0x149/0x350 [nfs]
> > > [] vfs_unlink+0xf1/0x1a0
> > > [] do_unlinkat+0x279/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > Since nfs is allowing the thread to be frozen with the inode
> > > locked it's
> > > preventing other threads trying to lock the same inode from
> > > freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > >
> >
> > Yeah, known problem. Not a simple one to fix though.
>
> Actually, it is simple to fix.
>
> <insert broken record about suspend should be using freeze_super(),
> not sys_sync(), to suspend filesystem operations>
>
> i.e. the VFS blocks new operations from starting, and then then the
> NFS client simply needs to implement ->freeze_fs to drain all it's
> active operations before returning. Problem solved.
>

Not a bad idea. In the case of NFS though, I'm not sure we'd actually
do anything different than what we're doing though. Part of the problem
is that by the time

FWIW, we already have CONFIG_SUSPEND_SKIP_SYNC. It might be worth
experimenting with a CONFIG_SUSPEND_FREEZE_FS that does what you
suggest?

> > > Can nfs do something different here to prevent this? Maybe use a
> > > non-freezable sleep and let the operation complete, or else abort
> > > the
> > > operation and return ERESTARTSYS?
> >
> > The problem with letting the op complete is that often by the time
> > you
> > get to the point of trying to freeze processes, the network
> > interfaces
> > are already shut down. So the operation you're waiting on might
> > never
> > complete. Stuff like suspend operations on your laptop fail,
> > leading to
> > fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> > because the suspend never completed."
>
> Yup, precisely the sort of problems we've had over the past 10 years
> with XFS because we do lots of stuff aynchronously in the background
> (just like NFS) and hence sys_sync() isn't sufficient to quiesce a
> filesystem's operations.
>

Yeah, adding a freeze_fs operation for NFS (and using that during
suspend) sounds reasonable at first blush. I can probably trawl the
archives to better understand, but what are the arguments against doing
that? Is it just that freeze_fs is relatively new and the
suspend/resume subsystems haven't caught up?

> But I'm used to being ignored on this topic (for almost 10 years,
> now!). Indeed, it's been made clear in the past that I know
> absolutely nothing about what is needed to be done to safely
> suspend filesystem operations... :/
>
> Cheers,
>
> Dave.
--

Jeff Layton <[email protected]>

2016-07-08 12:22:28

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> >
> > ?[] schedule+0x35/0x80
> > ?[] schedule_preempt_disabled+0xe/0x10
> > ?[] __mutex_lock_slowpath+0xb9/0x130
> > ?[] mutex_lock+0x1f/0x30
> > ?[] do_unlinkat+0x12b/0x2d0
> > ?[] SyS_unlink+0x16/0x20
> > ?[] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> >
> > ?[] __refrigerator+0x7a/0x140
> > ?[] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > ?[] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > ?[] nfs_unlink+0x149/0x350 [nfs]
> > ?[] vfs_unlink+0xf1/0x1a0
> > ?[] do_unlinkat+0x279/0x2d0
> > ?[] SyS_unlink+0x16/0x20
> > ?[] entry_SYSCALL_64_fastpath+0x16/0x71
> >
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> >
>
> Yeah, known problem. Not a simple one to fix though.

Apart from alternative Dave was mentioning in other email, what is the
point to use freezable wait from this path in the first place?

nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
different modes from the same path AFAICS. There do not seem to be other
callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
something is not quite right here to me. If the nfs4_delay did regular
wait then the freezing would fail as well but at least it would be clear
who is the culrprit rather than having an indirect dependency.
--
Michal Hocko
SUSE Labs

2016-07-08 12:47:11

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, Jul 08, 2016 at 02:22:24PM +0200, Michal Hocko wrote:
> On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > >
> > > [] schedule+0x35/0x80
> > > [] schedule_preempt_disabled+0xe/0x10
> > > [] __mutex_lock_slowpath+0xb9/0x130
> > > [] mutex_lock+0x1f/0x30
> > > [] do_unlinkat+0x12b/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > >
> > > [] __refrigerator+0x7a/0x140
> > > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > > [] nfs_unlink+0x149/0x350 [nfs]
> > > [] vfs_unlink+0xf1/0x1a0
> > > [] do_unlinkat+0x279/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > >
> >
> > Yeah, known problem. Not a simple one to fix though.
>
> Apart from alternative Dave was mentioning in other email, what is the
> point to use freezable wait from this path in the first place?
>
> nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> different modes from the same path AFAICS. There do not seem to be other
> callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> something is not quite right here to me. If the nfs4_delay did regular
> wait then the freezing would fail as well but at least it would be clear
> who is the culrprit rather than having an indirect dependency.

It turns out there are more paths than this one doing a freezable wait,
and they're all also killable. This leads me to a slightly different
question than yours, why nfs can give up waiting in the case of a signal
but not when the task is frozen.

I know the changes below aren't "correct," but I've been experimenting
with them anyway to see what would happen. So far things seem to be
fine, and the deadlock is gone. That should give you an idea of all the
places I found using a freezable wait.

Seth

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f714b98..62dbe59 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -77,8 +77,8 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
*/
int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
{
- freezable_schedule_unsafe();
- if (signal_pending_state(mode, current))
+ schedule();
+ if (signal_pending_state(mode, current) || freezing(current))
return -ERESTARTSYS;
return 0;
}
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index cb28cce..2315183 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -35,9 +35,9 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags)
res = rpc_call_sync(clnt, msg, flags);
if (res != -EJUKEBOX)
break;
- freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+ schedule_timeout_killable(NFS_JUKEBOX_RETRY_TIME);
res = -ERESTARTSYS;
- } while (!fatal_signal_pending(current));
+ } while (!fatal_signal_pending(current) && !freezing(current));
return res;
}

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 98a4415..0dad2fb 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -334,9 +334,8 @@ static int nfs4_delay(struct rpc_clnt *clnt, long *timeout)

might_sleep();

- freezable_schedule_timeout_killable_unsafe(
- nfs4_update_delay(timeout));
- if (fatal_signal_pending(current))
+ schedule_timeout_killable(nfs4_update_delay(timeout));
+ if (fatal_signal_pending(current) || freezing(current))
res = -ERESTARTSYS;
return res;
}
@@ -5447,7 +5446,7 @@ int nfs4_proc_delegreturn(struct inode *inode, struct rpc_cred *cred, const nfs4
static unsigned long
nfs4_set_lock_task_retry(unsigned long timeout)
{
- freezable_schedule_timeout_killable_unsafe(timeout);
+ schedule_timeout_killable(timeout);
timeout <<= 1;
if (timeout > NFS4_LOCK_MAXTIMEOUT)
return NFS4_LOCK_MAXTIMEOUT;
@@ -6148,7 +6147,7 @@ nfs4_proc_lock(struct file *filp, int cmd, struct file_lock *request)
break;
timeout = nfs4_set_lock_task_retry(timeout);
status = -ERESTARTSYS;
- if (signalled())
+ if (signalled() || freezing(current))
break;
} while(status < 0);
return status;
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 73ad57a..0218dc2 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -252,8 +252,8 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);

static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
{
- freezable_schedule_unsafe();
- if (signal_pending_state(mode, current))
+ schedule();
+ if (signal_pending_state(mode, current) || freezing(current))
return -ERESTARTSYS;
return 0;
}

2016-07-08 12:48:55

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > >
> > > [] schedule+0x35/0x80
> > > [] schedule_preempt_disabled+0xe/0x10
> > > [] __mutex_lock_slowpath+0xb9/0x130
> > > [] mutex_lock+0x1f/0x30
> > > [] do_unlinkat+0x12b/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > >
> > > [] __refrigerator+0x7a/0x140
> > > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > > [] nfs_unlink+0x149/0x350 [nfs]
> > > [] vfs_unlink+0xf1/0x1a0
> > > [] do_unlinkat+0x279/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > >
> >
> > Yeah, known problem. Not a simple one to fix though.
>
> Actually, it is simple to fix.
>
> <insert broken record about suspend should be using freeze_super(),
> not sys_sync(), to suspend filesystem operations>
>
> i.e. the VFS blocks new operations from starting, and then then the
> NFS client simply needs to implement ->freeze_fs to drain all it's
> active operations before returning. Problem solved.

No, this won't solve my problem. We're not doing a full suspend, rather
using a freezer cgroup to freeze a subset of processes. We don't want to
want to fully freeze the filesystem.

Thanks,
Seth

2016-07-08 12:51:57

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> >
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > >
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > >
> > > [] schedule+0x35/0x80
> > > [] schedule_preempt_disabled+0xe/0x10
> > > [] __mutex_lock_slowpath+0xb9/0x130
> > > [] mutex_lock+0x1f/0x30
> > > [] do_unlinkat+0x12b/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > >
> > > [] __refrigerator+0x7a/0x140
> > > [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > > [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > > [] nfs_unlink+0x149/0x350 [nfs]
> > > [] vfs_unlink+0xf1/0x1a0
> > > [] do_unlinkat+0x279/0x2d0
> > > [] SyS_unlink+0x16/0x20
> > > [] entry_SYSCALL_64_fastpath+0x16/0x71
> > >
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > >
> > Yeah, known problem. Not a simple one to fix though.
> Apart from alternative Dave was mentioning in other email, what is the
> point to use freezable wait from this path in the first place?
>
> nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> different modes from the same path AFAICS. There do not seem to be other
> callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> something is not quite right here to me. If the nfs4_delay did regular
> wait then the freezing would fail as well but at least it would be clear
> who is the culrprit rather than having an indirect dependency.

The codepaths involved there are a lot more complex than that
unfortunately.

nfs4_delay is the function that we use to handle the case where the
server returns NFS4ERR_DELAY. Basically telling us that it's too busy
right now or has some transient error and the client should retry after
a small, sliding delay.

That codepath could probably be made more freezer-safe. The typical
case however, is that we've sent a call and just haven't gotten a
reply. That's the trickier one to handle.

--
Jeff Layton <[email protected]>

2016-07-08 12:55:48

by Trond Myklebust

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

DQo+IE9uIEp1bCA4LCAyMDE2LCBhdCAwODo0OCwgU2V0aCBGb3JzaGVlIDxzZXRoLmZvcnNoZWVA
Y2Fub25pY2FsLmNvbT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIEp1bCAwOCwgMjAxNiBhdCAwOTo1
MzozMEFNICsxMDAwLCBEYXZlIENoaW5uZXIgd3JvdGU6DQo+PiBPbiBXZWQsIEp1bCAwNiwgMjAx
NiBhdCAwNjowNzoxOFBNIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4+PiBPbiBXZWQsIDIw
MTYtMDctMDYgYXQgMTI6NDYgLTA1MDAsIFNldGggRm9yc2hlZSB3cm90ZToNCj4+Pj4gV2UncmUg
c2VlaW5nIGEgaGFuZyB3aGVuIGZyZWV6aW5nIGEgY29udGFpbmVyIHdpdGggYW4gbmZzIGJpbmQg
bW91bnQgd2hpbGUNCj4+Pj4gcnVubmluZyBpb3pvbmUuIFR3byBpb3pvbmUgcHJvY2Vzc2VzIHdl
cmUgaHVuZyB3aXRoIHRoaXMgc3RhY2sgdHJhY2UuDQo+Pj4+IA0KPj4+PiAgW10gc2NoZWR1bGUr
MHgzNS8weDgwDQo+Pj4+ICBbXSBzY2hlZHVsZV9wcmVlbXB0X2Rpc2FibGVkKzB4ZS8weDEwDQo+
Pj4+ICBbXSBfX211dGV4X2xvY2tfc2xvd3BhdGgrMHhiOS8weDEzMA0KPj4+PiAgW10gbXV0ZXhf
bG9jaysweDFmLzB4MzANCj4+Pj4gIFtdIGRvX3VubGlua2F0KzB4MTJiLzB4MmQwDQo+Pj4+ICBb
XSBTeVNfdW5saW5rKzB4MTYvMHgyMA0KPj4+PiAgW10gZW50cnlfU1lTQ0FMTF82NF9mYXN0cGF0
aCsweDE2LzB4NzENCj4+Pj4gDQo+Pj4+IFRoaXMgc2VlbXMgdG8gYmUgZHVlIHRvIGFub3RoZXIg
aW96b25lIHRocmVhZCBmcm96ZW4gZHVyaW5nIHVubGluayB3aXRoDQo+Pj4+IHRoaXMgc3RhY2sg
dHJhY2U6DQo+Pj4+IA0KPj4+PiAgW10gX19yZWZyaWdlcmF0b3IrMHg3YS8weDE0MA0KPj4+PiAg
W10gbmZzNF9oYW5kbGVfZXhjZXB0aW9uKzB4MTE4LzB4MTMwIFtuZnN2NF0NCj4+Pj4gIFtdIG5m
czRfcHJvY19yZW1vdmUrMHg3ZC8weGYwIFtuZnN2NF0NCj4+Pj4gIFtdIG5mc191bmxpbmsrMHgx
NDkvMHgzNTAgW25mc10NCj4+Pj4gIFtdIHZmc191bmxpbmsrMHhmMS8weDFhMA0KPj4+PiAgW10g
ZG9fdW5saW5rYXQrMHgyNzkvMHgyZDANCj4+Pj4gIFtdIFN5U191bmxpbmsrMHgxNi8weDIwDQo+
Pj4+ICBbXSBlbnRyeV9TWVNDQUxMXzY0X2Zhc3RwYXRoKzB4MTYvMHg3MQ0KPj4+PiANCj4+Pj4g
U2luY2UgbmZzIGlzIGFsbG93aW5nIHRoZSB0aHJlYWQgdG8gYmUgZnJvemVuIHdpdGggdGhlIGlu
b2RlIGxvY2tlZCBpdCdzDQo+Pj4+IHByZXZlbnRpbmcgb3RoZXIgdGhyZWFkcyB0cnlpbmcgdG8g
bG9jayB0aGUgc2FtZSBpbm9kZSBmcm9tIGZyZWV6aW5nLiBJdA0KPj4+PiBzZWVtcyBsaWtlIGEg
YmFkIGlkZWEgZm9yIG5mcyB0byBiZSBkb2luZyB0aGlzLg0KPj4+PiANCj4+PiANCj4+PiBZZWFo
LCBrbm93biBwcm9ibGVtLiBOb3QgYSBzaW1wbGUgb25lIHRvIGZpeCB0aG91Z2guDQo+PiANCj4+
IEFjdHVhbGx5LCBpdCBpcyBzaW1wbGUgdG8gZml4Lg0KPj4gDQo+PiA8aW5zZXJ0IGJyb2tlbiBy
ZWNvcmQgYWJvdXQgc3VzcGVuZCBzaG91bGQgYmUgdXNpbmcgZnJlZXplX3N1cGVyKCksDQo+PiBu
b3Qgc3lzX3N5bmMoKSwgdG8gc3VzcGVuZCBmaWxlc3lzdGVtIG9wZXJhdGlvbnM+DQo+PiANCj4+
IGkuZS4gdGhlIFZGUyBibG9ja3MgbmV3IG9wZXJhdGlvbnMgZnJvbSBzdGFydGluZywgYW5kIHRo
ZW4gdGhlbiB0aGUNCj4+IE5GUyBjbGllbnQgc2ltcGx5IG5lZWRzIHRvIGltcGxlbWVudCAtPmZy
ZWV6ZV9mcyB0byBkcmFpbiBhbGwgaXQncw0KPj4gYWN0aXZlIG9wZXJhdGlvbnMgYmVmb3JlIHJl
dHVybmluZy4gUHJvYmxlbSBzb2x2ZWQuDQo+IA0KPiBObywgdGhpcyB3b24ndCBzb2x2ZSBteSBw
cm9ibGVtLiBXZSdyZSBub3QgZG9pbmcgYSBmdWxsIHN1c3BlbmQsIHJhdGhlcg0KPiB1c2luZyBh
IGZyZWV6ZXIgY2dyb3VwIHRvIGZyZWV6ZSBhIHN1YnNldCBvZiBwcm9jZXNzZXMuIFdlIGRvbid0
IHdhbnQgdG8NCj4gd2FudCB0byBmdWxseSBmcmVlemUgdGhlIGZpbGVzeXN0ZW0uDQoNCuKApmFu
ZCB0aGVyZWluIGxpZXMgdGhlIHJ1Yi4gVGhlIHdob2xlIGNncm91cCBmcmVlemVyIHN0dWZmIGFz
c3VtZXMgdGhhdCB5b3UgY2FuIHNhZmVseSBkZWFjdGl2YXRlIGEgYnVuY2ggb2YgcHJvY2Vzc2Vz
IHRoYXQgbWF5IG9yIG1heSBub3QgaG9sZCBzdGF0ZSBpbiB0aGUgZmlsZXN5c3RlbS4gVGhhdOKA
mXMgZGVmaW5pdGVseSBub3QgT0sgd2hlbiB5b3UgaG9sZCBsb2NrcyBldGMgdGhhdCBjYW4gYWZm
ZWN0IHByb2Nlc3NlcyB0aGF0IGxpZXMgb3V0c2lkZSB0aGUgY2dyb3VwIChhbmQvb3Igb3V0c2lk
ZSB0aGUgTkZTIGNsaWVudCBpdHNlbGYpLg0KDQpDaGVlcnMNCiAgVHJvbmQNCg0K

2016-07-08 13:05:46

by Trond Myklebust

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

DQo+IE9uIEp1bCA4LCAyMDE2LCBhdCAwODo1NSwgVHJvbmQgTXlrbGVidXN0IDx0cm9uZG15QHBy
aW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+IA0KPiANCj4+IE9uIEp1bCA4LCAyMDE2LCBhdCAwODo0
OCwgU2V0aCBGb3JzaGVlIDxzZXRoLmZvcnNoZWVAY2Fub25pY2FsLmNvbT4gd3JvdGU6DQo+PiAN
Cj4+IE9uIEZyaSwgSnVsIDA4LCAyMDE2IGF0IDA5OjUzOjMwQU0gKzEwMDAsIERhdmUgQ2hpbm5l
ciB3cm90ZToNCj4+PiBPbiBXZWQsIEp1bCAwNiwgMjAxNiBhdCAwNjowNzoxOFBNIC0wNDAwLCBK
ZWZmIExheXRvbiB3cm90ZToNCj4+Pj4gT24gV2VkLCAyMDE2LTA3LTA2IGF0IDEyOjQ2IC0wNTAw
LCBTZXRoIEZvcnNoZWUgd3JvdGU6DQo+Pj4+PiBXZSdyZSBzZWVpbmcgYSBoYW5nIHdoZW4gZnJl
ZXppbmcgYSBjb250YWluZXIgd2l0aCBhbiBuZnMgYmluZCBtb3VudCB3aGlsZQ0KPj4+Pj4gcnVu
bmluZyBpb3pvbmUuIFR3byBpb3pvbmUgcHJvY2Vzc2VzIHdlcmUgaHVuZyB3aXRoIHRoaXMgc3Rh
Y2sgdHJhY2UuDQo+Pj4+PiANCj4+Pj4+IFtdIHNjaGVkdWxlKzB4MzUvMHg4MA0KPj4+Pj4gW10g
c2NoZWR1bGVfcHJlZW1wdF9kaXNhYmxlZCsweGUvMHgxMA0KPj4+Pj4gW10gX19tdXRleF9sb2Nr
X3Nsb3dwYXRoKzB4YjkvMHgxMzANCj4+Pj4+IFtdIG11dGV4X2xvY2srMHgxZi8weDMwDQo+Pj4+
PiBbXSBkb191bmxpbmthdCsweDEyYi8weDJkMA0KPj4+Pj4gW10gU3lTX3VubGluaysweDE2LzB4
MjANCj4+Pj4+IFtdIGVudHJ5X1NZU0NBTExfNjRfZmFzdHBhdGgrMHgxNi8weDcxDQo+Pj4+PiAN
Cj4+Pj4+IFRoaXMgc2VlbXMgdG8gYmUgZHVlIHRvIGFub3RoZXIgaW96b25lIHRocmVhZCBmcm96
ZW4gZHVyaW5nIHVubGluayB3aXRoDQo+Pj4+PiB0aGlzIHN0YWNrIHRyYWNlOg0KPj4+Pj4gDQo+
Pj4+PiBbXSBfX3JlZnJpZ2VyYXRvcisweDdhLzB4MTQwDQo+Pj4+PiBbXSBuZnM0X2hhbmRsZV9l
eGNlcHRpb24rMHgxMTgvMHgxMzAgW25mc3Y0XQ0KPj4+Pj4gW10gbmZzNF9wcm9jX3JlbW92ZSsw
eDdkLzB4ZjAgW25mc3Y0XQ0KPj4+Pj4gW10gbmZzX3VubGluaysweDE0OS8weDM1MCBbbmZzXQ0K
Pj4+Pj4gW10gdmZzX3VubGluaysweGYxLzB4MWEwDQo+Pj4+PiBbXSBkb191bmxpbmthdCsweDI3
OS8weDJkMA0KPj4+Pj4gW10gU3lTX3VubGluaysweDE2LzB4MjANCj4+Pj4+IFtdIGVudHJ5X1NZ
U0NBTExfNjRfZmFzdHBhdGgrMHgxNi8weDcxDQo+Pj4+PiANCj4+Pj4+IFNpbmNlIG5mcyBpcyBh
bGxvd2luZyB0aGUgdGhyZWFkIHRvIGJlIGZyb3plbiB3aXRoIHRoZSBpbm9kZSBsb2NrZWQgaXQn
cw0KPj4+Pj4gcHJldmVudGluZyBvdGhlciB0aHJlYWRzIHRyeWluZyB0byBsb2NrIHRoZSBzYW1l
IGlub2RlIGZyb20gZnJlZXppbmcuIEl0DQo+Pj4+PiBzZWVtcyBsaWtlIGEgYmFkIGlkZWEgZm9y
IG5mcyB0byBiZSBkb2luZyB0aGlzLg0KPj4+Pj4gDQo+Pj4+IA0KPj4+PiBZZWFoLCBrbm93biBw
cm9ibGVtLiBOb3QgYSBzaW1wbGUgb25lIHRvIGZpeCB0aG91Z2guDQo+Pj4gDQo+Pj4gQWN0dWFs
bHksIGl0IGlzIHNpbXBsZSB0byBmaXguDQo+Pj4gDQo+Pj4gPGluc2VydCBicm9rZW4gcmVjb3Jk
IGFib3V0IHN1c3BlbmQgc2hvdWxkIGJlIHVzaW5nIGZyZWV6ZV9zdXBlcigpLA0KPj4+IG5vdCBz
eXNfc3luYygpLCB0byBzdXNwZW5kIGZpbGVzeXN0ZW0gb3BlcmF0aW9ucz4NCj4+PiANCj4+PiBp
LmUuIHRoZSBWRlMgYmxvY2tzIG5ldyBvcGVyYXRpb25zIGZyb20gc3RhcnRpbmcsIGFuZCB0aGVu
IHRoZW4gdGhlDQo+Pj4gTkZTIGNsaWVudCBzaW1wbHkgbmVlZHMgdG8gaW1wbGVtZW50IC0+ZnJl
ZXplX2ZzIHRvIGRyYWluIGFsbCBpdCdzDQo+Pj4gYWN0aXZlIG9wZXJhdGlvbnMgYmVmb3JlIHJl
dHVybmluZy4gUHJvYmxlbSBzb2x2ZWQuDQo+PiANCj4+IE5vLCB0aGlzIHdvbid0IHNvbHZlIG15
IHByb2JsZW0uIFdlJ3JlIG5vdCBkb2luZyBhIGZ1bGwgc3VzcGVuZCwgcmF0aGVyDQo+PiB1c2lu
ZyBhIGZyZWV6ZXIgY2dyb3VwIHRvIGZyZWV6ZSBhIHN1YnNldCBvZiBwcm9jZXNzZXMuIFdlIGRv
bid0IHdhbnQgdG8NCj4+IHdhbnQgdG8gZnVsbHkgZnJlZXplIHRoZSBmaWxlc3lzdGVtLg0KPiAN
Cj4g4oCmYW5kIHRoZXJlaW4gbGllcyB0aGUgcnViLiBUaGUgd2hvbGUgY2dyb3VwIGZyZWV6ZXIg
c3R1ZmYgYXNzdW1lcyB0aGF0IHlvdSBjYW4gc2FmZWx5IGRlYWN0aXZhdGUgYSBidW5jaCBvZiBw
cm9jZXNzZXMgdGhhdCBtYXkgb3IgbWF5IG5vdCBob2xkIHN0YXRlIGluIHRoZSBmaWxlc3lzdGVt
LiBUaGF04oCZcyBkZWZpbml0ZWx5IG5vdCBPSyB3aGVuIHlvdSBob2xkIGxvY2tzIGV0YyB0aGF0
IGNhbiBhZmZlY3QgcHJvY2Vzc2VzIHRoYXQgbGllcyBvdXRzaWRlIHRoZSBjZ3JvdXAgKGFuZC9v
ciBvdXRzaWRlIHRoZSBORlMgY2xpZW50IGl0c2VsZikuDQo+IA0KDQpJbiBjYXNlIGl0IHdhc27i
gJl0IGNsZWFyLCBJ4oCZbSBub3QganVzdCB0YWxraW5nIGFib3V0IFZGUyBtdXRleGVzIGhlcmUu
IEnigJltIGFsc28gdGFsa2luZyBhYm91dCBhbGwgdGhlIG90aGVyIHN0dWZmLCBhIGxvdCBvZiB3
aGljaCB0aGUga2VybmVsIGhhcyBubyBjb250cm9sIG92ZXIsIGluY2x1ZGluZyBQT1NJWCBmaWxl
IGxvY2tpbmcsIHNoYXJlIGxvY2tzLCBsZWFzZXMvZGVsZWdhdGlvbnMsIGV0Yy4NCg0KVHJvbmQN
Cg==

2016-07-08 14:23:11

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
[...]
> > Apart from alternative Dave was mentioning in other email, what is the
> > point to use freezable wait from this path in the first place?
> >
> > nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> > that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> > different modes from the same path AFAICS. There do not seem to be other
> > callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> > something is not quite right here to me. If the nfs4_delay did regular
> > wait then the freezing would fail as well but at least it would be clear
> > who is the culrprit rather than having an indirect dependency.
>
> The codepaths involved there are a lot more complex than that
> unfortunately.
>
> nfs4_delay is the function that we use to handle the case where the
> server returns NFS4ERR_DELAY. Basically telling us that it's too busy
> right now or has some transient error and the client should retry after
> a small, sliding delay.
>
> That codepath could probably be made more freezer-safe. The typical
> case however, is that we've sent a call and just haven't gotten a
> reply. That's the trickier one to handle.

Why using a regular non-freezable wait would be a problem?
--
Michal Hocko
SUSE Labs

2016-07-08 14:27:42

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> >
> > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> [...]
> >
> > >
> > > Apart from alternative Dave was mentioning in other email, what
> > > is the
> > > point to use freezable wait from this path in the first place?
> > >
> > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > path and
> > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > in two
> > > different modes from the same path AFAICS. There do not seem to
> > > be other
> > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > like
> > > something is not quite right here to me. If the nfs4_delay did
> > > regular
> > > wait then the freezing would fail as well but at least it would
> > > be clear
> > > who is the culrprit rather than having an indirect dependency.
> > The codepaths involved there are a lot more complex than that
> > unfortunately.
> >
> > nfs4_delay is the function that we use to handle the case where the
> > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > busy
> > right now or has some transient error and the client should retry
> > after
> > a small, sliding delay.
> >
> > That codepath could probably be made more freezer-safe. The typical
> > case however, is that we've sent a call and just haven't gotten a
> > reply. That's the trickier one to handle.
> Why using a regular non-freezable wait would be a problem?

It has been a while since I looked at that code, but IIRC, that could
block the freezer for up to 15s, which is a significant portion of the
20s that you get before the freezer gives up.

--
Jeff Layton <[email protected]>

2016-07-11 01:20:57

by Dave Chinner

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri, Jul 08, 2016 at 01:05:40PM +0000, Trond Myklebust wrote:
> > On Jul 8, 2016, at 08:55, Trond Myklebust
> > <[email protected]> wrote:
> >> On Jul 8, 2016, at 08:48, Seth Forshee
> >> <[email protected]> wrote: On Fri, Jul 08, 2016 at
> >> 09:53:30AM +1000, Dave Chinner wrote:
> >>> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> >>>> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> >>>>> We're seeing a hang when freezing a container with an nfs
> >>>>> bind mount while running iozone. Two iozone processes were
> >>>>> hung with this stack trace.
> >>>>>
> >>>>> [] schedule+0x35/0x80 [] schedule_preempt_disabled+0xe/0x10
> >>>>> [] __mutex_lock_slowpath+0xb9/0x130 [] mutex_lock+0x1f/0x30
> >>>>> [] do_unlinkat+0x12b/0x2d0 [] SyS_unlink+0x16/0x20 []
> >>>>> entry_SYSCALL_64_fastpath+0x16/0x71
> >>>>>
> >>>>> This seems to be due to another iozone thread frozen during
> >>>>> unlink with this stack trace:
> >>>>>
> >>>>> [] __refrigerator+0x7a/0x140 []
> >>>>> nfs4_handle_exception+0x118/0x130 [nfsv4] []
> >>>>> nfs4_proc_remove+0x7d/0xf0 [nfsv4] [] nfs_unlink+0x149/0x350
> >>>>> [nfs] [] vfs_unlink+0xf1/0x1a0 [] do_unlinkat+0x279/0x2d0 []
> >>>>> SyS_unlink+0x16/0x20 [] entry_SYSCALL_64_fastpath+0x16/0x71
> >>>>>
> >>>>> Since nfs is allowing the thread to be frozen with the inode
> >>>>> locked it's preventing other threads trying to lock the same
> >>>>> inode from freezing. It seems like a bad idea for nfs to be
> >>>>> doing this.
> >>>>>
> >>>>
> >>>> Yeah, known problem. Not a simple one to fix though.
> >>>
> >>> Actually, it is simple to fix.
> >>>
> >>> <insert broken record about suspend should be using
> >>> freeze_super(), not sys_sync(), to suspend filesystem
> >>> operations>
> >>>
> >>> i.e. the VFS blocks new operations from starting, and then
> >>> then the NFS client simply needs to implement ->freeze_fs to
> >>> drain all it's active operations before returning. Problem
> >>> solved.
> >>
> >> No, this won't solve my problem. We're not doing a full
> >> suspend, rather using a freezer cgroup to freeze a subset of
> >> processes. We don't want to want to fully freeze the
> >> filesystem.
> >
> > …and therein lies the rub. The whole cgroup freezer stuff
> > assumes that you can safely deactivate a bunch of processes that
> > may or may not hold state in the filesystem. That’s
> > definitely not OK when you hold locks etc that can affect
> > processes that lies outside the cgroup (and/or outside the NFS
> > client itself).

Not just locks, but even just reference counts are bad. e.g. just
being suspended with an active write reference to the superblock
will cause the next filesystem freeze to hang waiting for that
reference to drain. In essence, that's a filesystem-wide DOS vector
for anyone using snapshots....

> In case it wasn’t clear, I’m not just talking about VFS
> mutexes here. I’m also talking about all the other stuff, a
> lot of which the kernel has no control over, including POSIX file
> locking, share locks, leases/delegations, etc.

Yeah, freezer base process-granularity suspend just seems like a bad
idea to me...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-07-11 07:23:28

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > >
> > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > [...]
> > >
> > > >
> > > > Apart from alternative Dave was mentioning in other email, what
> > > > is the
> > > > point to use freezable wait from this path in the first place?
> > > >
> > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > path and
> > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > in two
> > > > different modes from the same path AFAICS. There do not seem to
> > > > be other
> > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > like
> > > > something is not quite right here to me. If the nfs4_delay did
> > > > regular
> > > > wait then the freezing would fail as well but at least it would
> > > > be clear
> > > > who is the culrprit rather than having an indirect dependency.
> > > The codepaths involved there are a lot more complex than that
> > > unfortunately.
> > >
> > > nfs4_delay is the function that we use to handle the case where the
> > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > busy
> > > right now or has some transient error and the client should retry
> > > after
> > > a small, sliding delay.
> > >
> > > That codepath could probably be made more freezer-safe. The typical
> > > case however, is that we've sent a call and just haven't gotten a
> > > reply. That's the trickier one to handle.
> > Why using a regular non-freezable wait would be a problem?
>
> It has been a while since I looked at that code, but IIRC, that could
> block the freezer for up to 15s, which is a significant portion of the
> 20s that you get before the freezer gives up.

But how does that differ from the situation when the freezer has to give
up on the timeout because another task fails due to lock dependency.

As Trond and Dave have written in other emails. It is really danngerous
to freeze a task while it is holding locks and other resources.
--
Michal Hocko
SUSE Labs

2016-07-11 11:03:35

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > >
> > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > [...]
> > > >
> > > > >
> > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > is the
> > > > > point to use freezable wait from this path in the first place?
> > > > >
> > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > path and
> > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > in two
> > > > > different modes from the same path AFAICS. There do not seem to
> > > > > be other
> > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > like
> > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > regular
> > > > > wait then the freezing would fail as well but at least it would
> > > > > be clear
> > > > > who is the culrprit rather than having an indirect dependency.
> > > > The codepaths involved there are a lot more complex than that
> > > > unfortunately.
> > > >
> > > > nfs4_delay is the function that we use to handle the case where the
> > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > busy
> > > > right now or has some transient error and the client should retry
> > > > after
> > > > a small, sliding delay.
> > > >
> > > > That codepath could probably be made more freezer-safe. The typical
> > > > case however, is that we've sent a call and just haven't gotten a
> > > > reply. That's the trickier one to handle.
> > > Why using a regular non-freezable wait would be a problem?
> >
> > It has been a while since I looked at that code, but IIRC, that could
> > block the freezer for up to 15s, which is a significant portion of the
> > 20s that you get before the freezer gives up.
>
> But how does that differ from the situation when the freezer has to give
> up on the timeout because another task fails due to lock dependency.
>
> As Trond and Dave have written in other emails. It is really danngerous
> to freeze a task while it is holding locks and other resources.

It's not really dangerous if you're freezing every task on the host.
Sure, you're freezing with locks held, but everything else is freezing
too, so nothing will be contending for those locks.

I'm not at all opposed to changing how all of that works. My only
stipulation is that we not break the ability to reliably suspend a host
that is actively using an NFS mount. If you can come up with a way to
do that that also works for freezing cgroups, then I'm all for it.

--

Jeff Layton <[email protected]>

2016-07-11 11:43:39

[permalink] [raw]

Subject: Re: Hang due to nfs letting tasks freeze with locked inodes

On Mon 11-07-16 07:03:31, Jeff Layton wrote:
> On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > > >
> > > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > > [...]
> > > > >
> > > > > >
> > > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > > is the
> > > > > > point to use freezable wait from this path in the first place?
> > > > > >
> > > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > > path and
> > > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > > in two
> > > > > > different modes from the same path AFAICS. There do not seem to
> > > > > > be other
> > > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > > like
> > > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > > regular
> > > > > > wait then the freezing would fail as well but at least it would
> > > > > > be clear
> > > > > > who is the culrprit rather than having an indirect dependency.
> > > > > The codepaths involved there are a lot more complex than that
> > > > > unfortunately.
> > > > >
> > > > > nfs4_delay is the function that we use to handle the case where the
> > > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > > busy
> > > > > right now or has some transient error and the client should retry
> > > > > after
> > > > > a small, sliding delay.
> > > > >
> > > > > That codepath could probably be made more freezer-safe. The typical
> > > > > case however, is that we've sent a call and just haven't gotten a
> > > > > reply. That's the trickier one to handle.
> > > > Why using a regular non-freezable wait would be a problem?
> > >
> > > It has been a while since I looked at that code, but IIRC, that could
> > > block the freezer for up to 15s, which is a significant portion of the
> > > 20s that you get before the freezer gives up.
> >
> > But how does that differ from the situation when the freezer has to give
> > up on the timeout because another task fails due to lock dependency.
> >
> > As Trond and Dave have written in other emails. It is really danngerous
> > to freeze a task while it is holding locks and other resources.
>
> It's not really dangerous if you're freezing every task on the host.
> Sure, you're freezing with locks held, but everything else is freezing
> too, so nothing will be contending for those locks.

But the very same path is used also for cgroup freezer so you can end up
freezing a task while it holds locks which might block tasks from
unrelated cgroups, right?

> I'm not at all opposed to changing how all of that works. My only
> stipulation is that we not break the ability to reliably suspend a host
> that is actively using an NFS mount. If you can come up with a way to
> do that that also works for freezing cgroups, then I'm all for it.

My knowledge of NFS is too limited to help you out here but I guess it
would be a good start to stop using unsafe freezer APIs. Or use it only
when you are sure you cannot block any resources.

--
Michal Hocko
SUSE Labs

2016-07-11 12:50:45