Hi,
The below warning can be triggered each time when mount.nfs is
running on 3.9-rc1.
Not sure if freezable_schedule() inside rpc_wait_bit_killable should
be changed to schedule() since nfs_clid_init_mutex is held in the path.
[ 41.387939] =====================================
[ 41.392913] [ BUG: mount.nfs/643 still has locks held! ]
[ 41.398559] 3.9.0-rc1+ #1740 Not tainted
[ 41.402709] -------------------------------------
[ 41.407714] 1 lock held by mount.nfs/643:
[ 41.411956] #0: (nfs_clid_init_mutex){+.+...}, at: [<c0226d6c>]
nfs4_discover_server_trunking+0x60/0x1d4
[ 41.422363]
[ 41.422363] stack backtrace:
[ 41.427032] [<c0014dd4>] (unwind_backtrace+0x0/0xe0) from
[<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8)
[ 41.437103] [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8) from
[<c054e454>] (__wait_on_bit+0x54/0x9c)
[ 41.446990] [<c054e454>] (__wait_on_bit+0x54/0x9c) from
[<c054e514>] (out_of_line_wait_on_bit+0x78/0x84)
[ 41.457061] [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84) from
[<c04a8f88>] (__rpc_execute+0x170/0x348)
[ 41.467407] [<c04a8f88>] (__rpc_execute+0x170/0x348) from
[<c04a250c>] (rpc_run_task+0x9c/0xa4)
[ 41.476715] [<c04a250c>] (rpc_run_task+0x9c/0xa4) from [<c04a265c>]
(rpc_call_sync+0x70/0xb0)
[ 41.485778] [<c04a265c>] (rpc_call_sync+0x70/0xb0) from
[<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8)
[ 41.495819] [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8) from
[<c0224eb4>] (nfs40_discover_server_trunki
ng+0xec/0x148)
[ 41.507507] [<c0224eb4>]
(nfs40_discover_server_trunking+0xec/0x148) from [<c0226da0>]
(nfs4_discover_server
_trunking+0x94/0x1d4)
[ 41.519866] [<c0226da0>] (nfs4_discover_server_trunking+0x94/0x1d4)
from [<c022c664>] (nfs4_init_client+0x15
0/0x1b0)
[ 41.531036] [<c022c664>] (nfs4_init_client+0x150/0x1b0) from
[<c01fe954>] (nfs_get_client+0x2cc/0x320)
[ 41.540863] [<c01fe954>] (nfs_get_client+0x2cc/0x320) from
[<c022c080>] (nfs4_set_client+0x80/0xb0)
[ 41.550476] [<c022c080>] (nfs4_set_client+0x80/0xb0) from
[<c022cef8>] (nfs4_create_server+0xb0/0x21c)
[ 41.560333] [<c022cef8>] (nfs4_create_server+0xb0/0x21c) from
[<c0227524>] (nfs4_remote_mount+0x28/0x54)
[ 41.570373] [<c0227524>] (nfs4_remote_mount+0x28/0x54) from
[<c0113a8c>] (mount_fs+0x6c/0x160)
[ 41.579498] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
(vfs_kern_mount+0x4c/0xc0)
[ 41.588378] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
[<c022734c>] (nfs_do_root_mount+0x74/0x90)
[ 41.597961] [<c022734c>] (nfs_do_root_mount+0x74/0x90) from
[<c0227574>] (nfs4_try_mount+0x24/0x3c)
[ 41.607513] [<c0227574>] (nfs4_try_mount+0x24/0x3c) from
[<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0)
[ 41.616821] [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0) from
[<c0113a8c>] (mount_fs+0x6c/0x160)
[ 41.625701] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
(vfs_kern_mount+0x4c/0xc0)
[ 41.634582] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
[<c012c330>] (do_mount+0x710/0x81c)
[ 41.643524] [<c012c330>] (do_mount+0x710/0x81c) from [<c012c4c0>]
(sys_mount+0x84/0xb8)
[ 41.652008] [<c012c4c0>] (sys_mount+0x84/0xb8) from [<c000d6c0>]
(ret_fast_syscall+0x0/0x48)
[ 41.715911] device: '0:28': device_add
[ 41.720062] PM: Adding info for No Bus:0:28
[ 41.746887] device: '0:29': device_add
[ 41.751037] PM: Adding info for No Bus:0:29
[ 41.780700] device: '0:28': device_unregister
[ 41.785400] PM: Removing info for No Bus:0:28
[ 41.790344] device: '0:28': device_create_release
Thanks,
--
Ming Lei
On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
> Hi,
>
> The below warning can be triggered each time when mount.nfs is
> running on 3.9-rc1.
>
> Not sure if freezable_schedule() inside rpc_wait_bit_killable should
> be changed to schedule() since nfs_clid_init_mutex is held in the path.
Cc:ing Jeff, who added freezable_schedule(), and applied it to
rpc_wait_bit_killable.
So this is occurring when the kernel enters the freeze state?
Why does it occur only with nfs_clid_init_mutex, and not with all the
other mutexes that we hold across RPC calls? We hold inode->i_mutex
across RPC calls all the time when doing renames, unlinks, file
creation,...
> [ 41.387939] =====================================
> [ 41.392913] [ BUG: mount.nfs/643 still has locks held! ]
> [ 41.398559] 3.9.0-rc1+ #1740 Not tainted
> [ 41.402709] -------------------------------------
> [ 41.407714] 1 lock held by mount.nfs/643:
> [ 41.411956] #0: (nfs_clid_init_mutex){+.+...}, at: [<c0226d6c>]
> nfs4_discover_server_trunking+0x60/0x1d4
> [ 41.422363]
> [ 41.422363] stack backtrace:
> [ 41.427032] [<c0014dd4>] (unwind_backtrace+0x0/0xe0) from
> [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8)
> [ 41.437103] [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8) from
> [<c054e454>] (__wait_on_bit+0x54/0x9c)
> [ 41.446990] [<c054e454>] (__wait_on_bit+0x54/0x9c) from
> [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84)
> [ 41.457061] [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84) from
> [<c04a8f88>] (__rpc_execute+0x170/0x348)
> [ 41.467407] [<c04a8f88>] (__rpc_execute+0x170/0x348) from
> [<c04a250c>] (rpc_run_task+0x9c/0xa4)
> [ 41.476715] [<c04a250c>] (rpc_run_task+0x9c/0xa4) from [<c04a265c>]
> (rpc_call_sync+0x70/0xb0)
> [ 41.485778] [<c04a265c>] (rpc_call_sync+0x70/0xb0) from
> [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8)
> [ 41.495819] [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8) from
> [<c0224eb4>] (nfs40_discover_server_trunki
> ng+0xec/0x148)
> [ 41.507507] [<c0224eb4>]
> (nfs40_discover_server_trunking+0xec/0x148) from [<c0226da0>]
> (nfs4_discover_server
> _trunking+0x94/0x1d4)
> [ 41.519866] [<c0226da0>] (nfs4_discover_server_trunking+0x94/0x1d4)
> from [<c022c664>] (nfs4_init_client+0x15
> 0/0x1b0)
> [ 41.531036] [<c022c664>] (nfs4_init_client+0x150/0x1b0) from
> [<c01fe954>] (nfs_get_client+0x2cc/0x320)
> [ 41.540863] [<c01fe954>] (nfs_get_client+0x2cc/0x320) from
> [<c022c080>] (nfs4_set_client+0x80/0xb0)
> [ 41.550476] [<c022c080>] (nfs4_set_client+0x80/0xb0) from
> [<c022cef8>] (nfs4_create_server+0xb0/0x21c)
> [ 41.560333] [<c022cef8>] (nfs4_create_server+0xb0/0x21c) from
> [<c0227524>] (nfs4_remote_mount+0x28/0x54)
> [ 41.570373] [<c0227524>] (nfs4_remote_mount+0x28/0x54) from
> [<c0113a8c>] (mount_fs+0x6c/0x160)
> [ 41.579498] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
> (vfs_kern_mount+0x4c/0xc0)
> [ 41.588378] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
> [<c022734c>] (nfs_do_root_mount+0x74/0x90)
> [ 41.597961] [<c022734c>] (nfs_do_root_mount+0x74/0x90) from
> [<c0227574>] (nfs4_try_mount+0x24/0x3c)
> [ 41.607513] [<c0227574>] (nfs4_try_mount+0x24/0x3c) from
> [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0)
> [ 41.616821] [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0) from
> [<c0113a8c>] (mount_fs+0x6c/0x160)
> [ 41.625701] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
> (vfs_kern_mount+0x4c/0xc0)
> [ 41.634582] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
> [<c012c330>] (do_mount+0x710/0x81c)
> [ 41.643524] [<c012c330>] (do_mount+0x710/0x81c) from [<c012c4c0>]
> (sys_mount+0x84/0xb8)
> [ 41.652008] [<c012c4c0>] (sys_mount+0x84/0xb8) from [<c000d6c0>]
> (ret_fast_syscall+0x0/0x48)
> [ 41.715911] device: '0:28': device_add
> [ 41.720062] PM: Adding info for No Bus:0:28
> [ 41.746887] device: '0:29': device_add
> [ 41.751037] PM: Adding info for No Bus:0:29
> [ 41.780700] device: '0:28': device_unregister
> [ 41.785400] PM: Removing info for No Bus:0:28
> [ 41.790344] device: '0:28': device_create_release
>
>
> Thanks,
> --
> Ming Lei
On Mon, 4 Mar 2013 14:14:23 +0000
"Myklebust, Trond" <[email protected]> wrote:
> On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
> > Hi,
> >
> > The below warning can be triggered each time when mount.nfs is
> > running on 3.9-rc1.
> >
> > Not sure if freezable_schedule() inside rpc_wait_bit_killable should
> > be changed to schedule() since nfs_clid_init_mutex is held in the path.
>
> Cc:ing Jeff, who added freezable_schedule(), and applied it to
> rpc_wait_bit_killable.
>
> So this is occurring when the kernel enters the freeze state?
> Why does it occur only with nfs_clid_init_mutex, and not with all the
> other mutexes that we hold across RPC calls? We hold inode->i_mutex
> across RPC calls all the time when doing renames, unlinks, file
> creation,...
>
cc'ing Mandeep as well since his patch caused this to start popping up...
We've also gotten some similar lockdep pops in the nfsd code recently.
I responded to an email about it here on Friday, and I'll re-post what
I said there below:
-------------------------[snip]----------------------
Ok, I see...
rpc_wait_bit_killable() calls freezable_schedule(). That calls
freezer_count() which calls try_to_freeze(). try_to_freeze does this
lockdep check now as of commit 6aa9707099.
The assumption seems to be that freezing a thread while holding any
sort of lock is bad. The rationale in that patch seems a bit sketchy to
me though. We can be fairly certain that we're not going to deadlock by
holding these locks, but I guess there could be something I've missed.
Mandeep, can you elaborate on whether there's really a deadlock
scenario here? If not, then is there some way to annotate these locks
so this lockdep pop goes away?
-------------------------[snip]----------------------
> > [ 41.387939] =====================================
> > [ 41.392913] [ BUG: mount.nfs/643 still has locks held! ]
> > [ 41.398559] 3.9.0-rc1+ #1740 Not tainted
> > [ 41.402709] -------------------------------------
> > [ 41.407714] 1 lock held by mount.nfs/643:
> > [ 41.411956] #0: (nfs_clid_init_mutex){+.+...}, at: [<c0226d6c>]
> > nfs4_discover_server_trunking+0x60/0x1d4
> > [ 41.422363]
> > [ 41.422363] stack backtrace:
> > [ 41.427032] [<c0014dd4>] (unwind_backtrace+0x0/0xe0) from
> > [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8)
> > [ 41.437103] [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8) from
> > [<c054e454>] (__wait_on_bit+0x54/0x9c)
> > [ 41.446990] [<c054e454>] (__wait_on_bit+0x54/0x9c) from
> > [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84)
> > [ 41.457061] [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84) from
> > [<c04a8f88>] (__rpc_execute+0x170/0x348)
> > [ 41.467407] [<c04a8f88>] (__rpc_execute+0x170/0x348) from
> > [<c04a250c>] (rpc_run_task+0x9c/0xa4)
> > [ 41.476715] [<c04a250c>] (rpc_run_task+0x9c/0xa4) from [<c04a265c>]
> > (rpc_call_sync+0x70/0xb0)
> > [ 41.485778] [<c04a265c>] (rpc_call_sync+0x70/0xb0) from
> > [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8)
> > [ 41.495819] [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8) from
> > [<c0224eb4>] (nfs40_discover_server_trunki
> > ng+0xec/0x148)
> > [ 41.507507] [<c0224eb4>]
> > (nfs40_discover_server_trunking+0xec/0x148) from [<c0226da0>]
> > (nfs4_discover_server
> > _trunking+0x94/0x1d4)
> > [ 41.519866] [<c0226da0>] (nfs4_discover_server_trunking+0x94/0x1d4)
> > from [<c022c664>] (nfs4_init_client+0x15
> > 0/0x1b0)
> > [ 41.531036] [<c022c664>] (nfs4_init_client+0x150/0x1b0) from
> > [<c01fe954>] (nfs_get_client+0x2cc/0x320)
> > [ 41.540863] [<c01fe954>] (nfs_get_client+0x2cc/0x320) from
> > [<c022c080>] (nfs4_set_client+0x80/0xb0)
> > [ 41.550476] [<c022c080>] (nfs4_set_client+0x80/0xb0) from
> > [<c022cef8>] (nfs4_create_server+0xb0/0x21c)
> > [ 41.560333] [<c022cef8>] (nfs4_create_server+0xb0/0x21c) from
> > [<c0227524>] (nfs4_remote_mount+0x28/0x54)
> > [ 41.570373] [<c0227524>] (nfs4_remote_mount+0x28/0x54) from
> > [<c0113a8c>] (mount_fs+0x6c/0x160)
> > [ 41.579498] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
> > (vfs_kern_mount+0x4c/0xc0)
> > [ 41.588378] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
> > [<c022734c>] (nfs_do_root_mount+0x74/0x90)
> > [ 41.597961] [<c022734c>] (nfs_do_root_mount+0x74/0x90) from
> > [<c0227574>] (nfs4_try_mount+0x24/0x3c)
> > [ 41.607513] [<c0227574>] (nfs4_try_mount+0x24/0x3c) from
> > [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0)
> > [ 41.616821] [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0) from
> > [<c0113a8c>] (mount_fs+0x6c/0x160)
> > [ 41.625701] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
> > (vfs_kern_mount+0x4c/0xc0)
> > [ 41.634582] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
> > [<c012c330>] (do_mount+0x710/0x81c)
> > [ 41.643524] [<c012c330>] (do_mount+0x710/0x81c) from [<c012c4c0>]
> > (sys_mount+0x84/0xb8)
> > [ 41.652008] [<c012c4c0>] (sys_mount+0x84/0xb8) from [<c000d6c0>]
> > (ret_fast_syscall+0x0/0x48)
> > [ 41.715911] device: '0:28': device_add
> > [ 41.720062] PM: Adding info for No Bus:0:28
> > [ 41.746887] device: '0:29': device_add
> > [ 41.751037] PM: Adding info for No Bus:0:29
> > [ 41.780700] device: '0:28': device_unregister
> > [ 41.785400] PM: Removing info for No Bus:0:28
> > [ 41.790344] device: '0:28': device_create_release
> >
> >
> > Thanks,
> > --
> > Ming Lei
>
--
Jeff Layton <[email protected]>
On Mon, Mar 4, 2013 at 10:14 PM, Myklebust, Trond
<[email protected]> wrote:
> On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
>> Hi,
>>
>> The below warning can be triggered each time when mount.nfs is
>> running on 3.9-rc1.
>>
>> Not sure if freezable_schedule() inside rpc_wait_bit_killable should
>> be changed to schedule() since nfs_clid_init_mutex is held in the path.
>
> Cc:ing Jeff, who added freezable_schedule(), and applied it to
> rpc_wait_bit_killable.
>
> So this is occurring when the kernel enters the freeze state?
No, but the situation can really be triggered in freeze case, so
lockdep forecasts the problem correctly, :-)
> Why does it occur only with nfs_clid_init_mutex, and not with all the
> other mutexes that we hold across RPC calls? We hold inode->i_mutex
> across RPC calls all the time when doing renames, unlinks, file
> creation,...
At least in the mount.nfs context, only nfs_clid_init_mutex is held.
IMO, if locks might be held in the path, it isn't wise to call
freezable_schedule
inside rpc_wait_bit_killable().
Thanks,
--
Ming Lei
On Mon, 4 Mar 2013 22:40:02 +0800
Ming Lei <[email protected]> wrote:
> On Mon, Mar 4, 2013 at 10:14 PM, Myklebust, Trond
> <[email protected]> wrote:
> > On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
> >> Hi,
> >>
> >> The below warning can be triggered each time when mount.nfs is
> >> running on 3.9-rc1.
> >>
> >> Not sure if freezable_schedule() inside rpc_wait_bit_killable should
> >> be changed to schedule() since nfs_clid_init_mutex is held in the path.
> >
> > Cc:ing Jeff, who added freezable_schedule(), and applied it to
> > rpc_wait_bit_killable.
> >
> > So this is occurring when the kernel enters the freeze state?
>
> No, but the situation can really be triggered in freeze case, so
> lockdep forecasts the problem correctly, :-)
>
> > Why does it occur only with nfs_clid_init_mutex, and not with all the
> > other mutexes that we hold across RPC calls? We hold inode->i_mutex
> > across RPC calls all the time when doing renames, unlinks, file
> > creation,...
>
> At least in the mount.nfs context, only nfs_clid_init_mutex is held.
>
> IMO, if locks might be held in the path, it isn't wise to call
> freezable_schedule
> inside rpc_wait_bit_killable().
>
I don't get it -- why is it bad to hold a lock across a freeze event?
The problem that we have is that we must often hold locks across
long-running syscalls (consider something like sync()). In the event
that there is a lot of dirty data, it might take a long time for that
to finish.
There's also the problem that it's not uncommon for the freezer to take
down userland processes (such as NetworkManager) which in turn take
down network interfaces that we need to talk to the server.
The fix from a couple of years ago (which admittedly needs more work)
was to allow the freezing of tasks that are waiting on a reply from the
server. That sort of necessitates that we are allowed to hold our locks
across the try_to_freeze call though.
If that's no longer allowed then we're back to square one with laptops
that fail to suspend when they have NFS mounts. Is there some other
solution we should pursue instead?
--
Jeff Layton <[email protected]>
Hi,
CC guys who introduced the lockdep change.
On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton <[email protected]> wrote:
>
> I don't get it -- why is it bad to hold a lock across a freeze event?
At least this may deadlock another mount.nfs during freezing, :-)
See detailed explanation in the commit log:
commit 6aa9707099c4b25700940eb3d016f16c4434360d
Author: Mandeep Singh Baines <[email protected]>
Date: Wed Feb 27 17:03:18 2013 -0800
lockdep: check that no locks held at freeze time
We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g. by dpm). Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.
> The problem that we have is that we must often hold locks across
> long-running syscalls (consider something like sync()). In the event
> that there is a lot of dirty data, it might take a long time for that
> to finish.
>
> There's also the problem that it's not uncommon for the freezer to take
> down userland processes (such as NetworkManager) which in turn take
> down network interfaces that we need to talk to the server.
>
> The fix from a couple of years ago (which admittedly needs more work)
> was to allow the freezing of tasks that are waiting on a reply from the
> server. That sort of necessitates that we are allowed to hold our locks
> across the try_to_freeze call though.
>
> If that's no longer allowed then we're back to square one with laptops
> that fail to suspend when they have NFS mounts. Is there some other
> solution we should pursue instead?
Thanks,
--
Ming Lei
On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
> Hi,
>
> CC guys who introduced the lockdep change.
>
> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton <[email protected]> wrote:
>
> >
> > I don't get it -- why is it bad to hold a lock across a freeze event?
>
> At least this may deadlock another mount.nfs during freezing, :-)
>
> See detailed explanation in the commit log:
>
> commit 6aa9707099c4b25700940eb3d016f16c4434360d
> Author: Mandeep Singh Baines <[email protected]>
> Date: Wed Feb 27 17:03:18 2013 -0800
>
> lockdep: check that no locks held at freeze time
>
> We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
>
This is bloody ridiculous... If you want to add functionality to
implement cgroup or per-process freezing, then do it through some other
api instead of trying to push your problems onto others by adding new
global locking rules.
Filesystems are a shared resource that have _nothing_ to do with process
cgroups. They need to be suspended when the network goes down or other
resources that they depend on are suspended. At that point, there is no
"what if I launch a new mount command?" scenario.
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
+ rjw, akpm, tejun, mingo, oleg
On Mon, Mar 4, 2013 at 6:23 AM, Jeff Layton <[email protected]> wrote:
> On Mon, 4 Mar 2013 14:14:23 +0000
> "Myklebust, Trond" <[email protected]> wrote:
>
>> On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
>> > Hi,
>> >
>> > The below warning can be triggered each time when mount.nfs is
>> > running on 3.9-rc1.
>> >
>> > Not sure if freezable_schedule() inside rpc_wait_bit_killable should
>> > be changed to schedule() since nfs_clid_init_mutex is held in the path.
>>
>> Cc:ing Jeff, who added freezable_schedule(), and applied it to
>> rpc_wait_bit_killable.
>>
>> So this is occurring when the kernel enters the freeze state?
>> Why does it occur only with nfs_clid_init_mutex, and not with all the
>> other mutexes that we hold across RPC calls? We hold inode->i_mutex
>> across RPC calls all the time when doing renames, unlinks, file
>> creation,...
>>
>
> cc'ing Mandeep as well since his patch caused this to start popping up...
>
> We've also gotten some similar lockdep pops in the nfsd code recently.
> I responded to an email about it here on Friday, and I'll re-post what
> I said there below:
>
> -------------------------[snip]----------------------
> Ok, I see...
>
> rpc_wait_bit_killable() calls freezable_schedule(). That calls
> freezer_count() which calls try_to_freeze(). try_to_freeze does this
> lockdep check now as of commit 6aa9707099.
>
> The assumption seems to be that freezing a thread while holding any
> sort of lock is bad. The rationale in that patch seems a bit sketchy to
> me though. We can be fairly certain that we're not going to deadlock by
> holding these locks, but I guess there could be something I've missed.
>
> Mandeep, can you elaborate on whether there's really a deadlock
> scenario here? If not, then is there some way to annotate these locks
> so this lockdep pop goes away?
We were seeing deadlocks on suspend that were root caused to locks
held at freeze time. In this case, the locks were device locks. I
wanted a way to detect if a process tried to freeze with a device_lock
held.
Then I thought about the cgroup_freezer subsystem (which I recently
turned on in our system). If you were to freeze a cgroup and a process
were holding a lock, you could deadlock. Seems reasonable that this
code path could cause a deadlock that way.
The problem is that freezer_count() calls try_to_freeze(). In this
case, try_to_freeze() is not really adding any value.
If we didn't try_to_freeze() in this instance of freezer_count() you'd
avoid the potential deadlock and not block suspend. I think a lot of
call sites are like that. They want to avoid blocking suspend but
aren't really interested in trying to freeze at the moment they call
freezer_count().
Regards,
Mandeep
> -------------------------[snip]----------------------
>
>
>> > [ 41.387939] =====================================
>> > [ 41.392913] [ BUG: mount.nfs/643 still has locks held! ]
>> > [ 41.398559] 3.9.0-rc1+ #1740 Not tainted
>> > [ 41.402709] -------------------------------------
>> > [ 41.407714] 1 lock held by mount.nfs/643:
>> > [ 41.411956] #0: (nfs_clid_init_mutex){+.+...}, at: [<c0226d6c>]
>> > nfs4_discover_server_trunking+0x60/0x1d4
>> > [ 41.422363]
>> > [ 41.422363] stack backtrace:
>> > [ 41.427032] [<c0014dd4>] (unwind_backtrace+0x0/0xe0) from
>> > [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8)
>> > [ 41.437103] [<c04a8300>] (rpc_wait_bit_killable+0x38/0xc8) from
>> > [<c054e454>] (__wait_on_bit+0x54/0x9c)
>> > [ 41.446990] [<c054e454>] (__wait_on_bit+0x54/0x9c) from
>> > [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84)
>> > [ 41.457061] [<c054e514>] (out_of_line_wait_on_bit+0x78/0x84) from
>> > [<c04a8f88>] (__rpc_execute+0x170/0x348)
>> > [ 41.467407] [<c04a8f88>] (__rpc_execute+0x170/0x348) from
>> > [<c04a250c>] (rpc_run_task+0x9c/0xa4)
>> > [ 41.476715] [<c04a250c>] (rpc_run_task+0x9c/0xa4) from [<c04a265c>]
>> > (rpc_call_sync+0x70/0xb0)
>> > [ 41.485778] [<c04a265c>] (rpc_call_sync+0x70/0xb0) from
>> > [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8)
>> > [ 41.495819] [<c021af54>] (nfs4_proc_setclientid+0x1a0/0x1c8) from
>> > [<c0224eb4>] (nfs40_discover_server_trunki
>> > ng+0xec/0x148)
>> > [ 41.507507] [<c0224eb4>]
>> > (nfs40_discover_server_trunking+0xec/0x148) from [<c0226da0>]
>> > (nfs4_discover_server
>> > _trunking+0x94/0x1d4)
>> > [ 41.519866] [<c0226da0>] (nfs4_discover_server_trunking+0x94/0x1d4)
>> > from [<c022c664>] (nfs4_init_client+0x15
>> > 0/0x1b0)
>> > [ 41.531036] [<c022c664>] (nfs4_init_client+0x150/0x1b0) from
>> > [<c01fe954>] (nfs_get_client+0x2cc/0x320)
>> > [ 41.540863] [<c01fe954>] (nfs_get_client+0x2cc/0x320) from
>> > [<c022c080>] (nfs4_set_client+0x80/0xb0)
>> > [ 41.550476] [<c022c080>] (nfs4_set_client+0x80/0xb0) from
>> > [<c022cef8>] (nfs4_create_server+0xb0/0x21c)
>> > [ 41.560333] [<c022cef8>] (nfs4_create_server+0xb0/0x21c) from
>> > [<c0227524>] (nfs4_remote_mount+0x28/0x54)
>> > [ 41.570373] [<c0227524>] (nfs4_remote_mount+0x28/0x54) from
>> > [<c0113a8c>] (mount_fs+0x6c/0x160)
>> > [ 41.579498] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
>> > (vfs_kern_mount+0x4c/0xc0)
>> > [ 41.588378] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
>> > [<c022734c>] (nfs_do_root_mount+0x74/0x90)
>> > [ 41.597961] [<c022734c>] (nfs_do_root_mount+0x74/0x90) from
>> > [<c0227574>] (nfs4_try_mount+0x24/0x3c)
>> > [ 41.607513] [<c0227574>] (nfs4_try_mount+0x24/0x3c) from
>> > [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0)
>> > [ 41.616821] [<c02070f8>] (nfs_fs_mount+0x6dc/0x7a0) from
>> > [<c0113a8c>] (mount_fs+0x6c/0x160)
>> > [ 41.625701] [<c0113a8c>] (mount_fs+0x6c/0x160) from [<c012a47c>]
>> > (vfs_kern_mount+0x4c/0xc0)
>> > [ 41.634582] [<c012a47c>] (vfs_kern_mount+0x4c/0xc0) from
>> > [<c012c330>] (do_mount+0x710/0x81c)
>> > [ 41.643524] [<c012c330>] (do_mount+0x710/0x81c) from [<c012c4c0>]
>> > (sys_mount+0x84/0xb8)
>> > [ 41.652008] [<c012c4c0>] (sys_mount+0x84/0xb8) from [<c000d6c0>]
>> > (ret_fast_syscall+0x0/0x48)
>> > [ 41.715911] device: '0:28': device_add
>> > [ 41.720062] PM: Adding info for No Bus:0:28
>> > [ 41.746887] device: '0:29': device_add
>> > [ 41.751037] PM: Adding info for No Bus:0:29
>> > [ 41.780700] device: '0:28': device_unregister
>> > [ 41.785400] PM: Removing info for No Bus:0:28
>> > [ 41.790344] device: '0:28': device_create_release
>> >
>> >
>> > Thanks,
>> > --
>> > Ming Lei
>>
>
>
> --
> Jeff Layton <[email protected]>
On Mon, Mar 4, 2013 at 7:53 AM, Myklebust, Trond
<[email protected]> wrote:
> On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
>> Hi,
>>
>> CC guys who introduced the lockdep change.
>>
>> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton <[email protected]> wrote:
>>
>> >
>> > I don't get it -- why is it bad to hold a lock across a freeze event?
>>
>> At least this may deadlock another mount.nfs during freezing, :-)
>>
>> See detailed explanation in the commit log:
>>
>> commit 6aa9707099c4b25700940eb3d016f16c4434360d
>> Author: Mandeep Singh Baines <[email protected]>
>> Date: Wed Feb 27 17:03:18 2013 -0800
>>
>> lockdep: check that no locks held at freeze time
>>
>> We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
>> deadlock if the lock is later acquired in the suspend or hibernate path
>> (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
>> cgroup_freezer if a lock is held inside a frozen cgroup that is later
>> acquired by a process outside that group.
>>
>
> This is bloody ridiculous... If you want to add functionality to
> implement cgroup or per-process freezing, then do it through some other
> api instead of trying to push your problems onto others by adding new
> global locking rules.
>
> Filesystems are a shared resource that have _nothing_ to do with process
> cgroups. They need to be suspended when the network goes down or other
> resources that they depend on are suspended. At that point, there is no
> "what if I launch a new mount command?" scenario.
>
Hi Trond,
My intention was to introduce new rules. My change simply introduces a
check for a deadlock case that can already happen.
I think a deadlock could happen under the following scenario:
1) An administrator wants to freeze a container. Perhaps to checkpoint
it and it migrate it some place else.
2) An nfs mount was in progress so we hit this code path and freeze
with a lock held.
3) Another container tries to nfs mount.
4) Deadlock.
Regards,
Mandeep
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
On Mon, Mar 4, 2013 at 12:09 PM, Mandeep Singh Baines <[email protected]> wrote:
> On Mon, Mar 4, 2013 at 7:53 AM, Myklebust, Trond
> <[email protected]> wrote:
>> On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
>>> Hi,
>>>
>>> CC guys who introduced the lockdep change.
>>>
>>> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton <[email protected]> wrote:
>>>
>>> >
>>> > I don't get it -- why is it bad to hold a lock across a freeze event?
>>>
>>> At least this may deadlock another mount.nfs during freezing, :-)
>>>
>>> See detailed explanation in the commit log:
>>>
>>> commit 6aa9707099c4b25700940eb3d016f16c4434360d
>>> Author: Mandeep Singh Baines <[email protected]>
>>> Date: Wed Feb 27 17:03:18 2013 -0800
>>>
>>> lockdep: check that no locks held at freeze time
>>>
>>> We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
>>> deadlock if the lock is later acquired in the suspend or hibernate path
>>> (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
>>> cgroup_freezer if a lock is held inside a frozen cgroup that is later
>>> acquired by a process outside that group.
>>>
>>
>> This is bloody ridiculous... If you want to add functionality to
>> implement cgroup or per-process freezing, then do it through some other
>> api instead of trying to push your problems onto others by adding new
>> global locking rules.
>>
>> Filesystems are a shared resource that have _nothing_ to do with process
>> cgroups. They need to be suspended when the network goes down or other
>> resources that they depend on are suspended. At that point, there is no
>> "what if I launch a new mount command?" scenario.
>>
>
> Hi Trond,
>
> My intention was to introduce new rules. My change simply introduces a
D'oh.
s/was/was not/
Regards,
Mandeep
> check for a deadlock case that can already happen.
>
> I think a deadlock could happen under the following scenario:
>
> 1) An administrator wants to freeze a container. Perhaps to checkpoint
> it and it migrate it some place else.
> 2) An nfs mount was in progress so we hit this code path and freeze
> with a lock held.
> 3) Another container tries to nfs mount.
> 4) Deadlock.
>
> Regards,
> Mandeep
>
>> Trond
>> --
>> Trond Myklebust
>> Linux NFS client maintainer
>>
>> NetApp
>> [email protected]
>> http://www.netapp.com
On 03/04, Mandeep Singh Baines wrote:
>
> The problem is that freezer_count() calls try_to_freeze(). In this
> case, try_to_freeze() is not really adding any value.
Well, I tend to agree.
If a task calls __refrigerator() holding a lock which another freezable
task can wait for, this is not freezer-friendly.
freezable_schedule/freezer_do_not_count/etc not only means "I won't be
active if freezing()", it should also mean "I won't block suspend/etc".
OTOH, I understand that probably it is not trivial to change this code
to make it freezer-friendly. But at least I disagree with "push your
problems onto others".
Oleg.
On Mon, 2013-03-04 at 21:53 +0100, Oleg Nesterov wrote:
> On 03/04, Mandeep Singh Baines wrote:
> >
> > The problem is that freezer_count() calls try_to_freeze(). In this
> > case, try_to_freeze() is not really adding any value.
>
> Well, I tend to agree.
>
> If a task calls __refrigerator() holding a lock which another freezable
> task can wait for, this is not freezer-friendly.
>
> freezable_schedule/freezer_do_not_count/etc not only means "I won't be
> active if freezing()", it should also mean "I won't block suspend/etc".
If suspend for some reason requires a re-entrant mount, then yes, I can
see how it might be a problem that we're holding a mount-related lock.
The question is why is that necessary?
> OTOH, I understand that probably it is not trivial to change this code
> to make it freezer-friendly. But at least I disagree with "push your
> problems onto others".
That code can't be made freezer-friendly if it isn't allowed to hold
basic filesystem-related locks across RPC calls. A number of those RPC
calls do things that need to be protected by VFS or MM-level locks.
i.e.: lookups, file creation/deletion, page fault in/out, ...
IOW: the problem would need to be solved differently, possibly by adding
a new FIFREEZE-like call to allow the filesystem to quiesce itself
_before_ NetworkManager pulls the rug out from underneath it. There
would still be plenty of corner cases to keep people entertained (e.g.
the server goes down before the quiesce call is invoked) but at least
the top 90% of cases would be solved.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Mon, 4 Mar 2013 22:08:34 +0000
"Myklebust, Trond" <[email protected]> wrote:
> On Mon, 2013-03-04 at 21:53 +0100, Oleg Nesterov wrote:
> > On 03/04, Mandeep Singh Baines wrote:
> > >
> > > The problem is that freezer_count() calls try_to_freeze(). In this
> > > case, try_to_freeze() is not really adding any value.
> >
> > Well, I tend to agree.
> >
> > If a task calls __refrigerator() holding a lock which another freezable
> > task can wait for, this is not freezer-friendly.
> >
> > freezable_schedule/freezer_do_not_count/etc not only means "I won't be
> > active if freezing()", it should also mean "I won't block suspend/etc".
>
> If suspend for some reason requires a re-entrant mount, then yes, I can
> see how it might be a problem that we're holding a mount-related lock.
> The question is why is that necessary?
>
> > OTOH, I understand that probably it is not trivial to change this code
> > to make it freezer-friendly. But at least I disagree with "push your
> > problems onto others".
>
> That code can't be made freezer-friendly if it isn't allowed to hold
> basic filesystem-related locks across RPC calls. A number of those RPC
> calls do things that need to be protected by VFS or MM-level locks.
> i.e.: lookups, file creation/deletion, page fault in/out, ...
>
> IOW: the problem would need to be solved differently, possibly by adding
> a new FIFREEZE-like call to allow the filesystem to quiesce itself
> _before_ NetworkManager pulls the rug out from underneath it. There
> would still be plenty of corner cases to keep people entertained (e.g.
> the server goes down before the quiesce call is invoked) but at least
> the top 90% of cases would be solved.
>
Ok, I think I'm starting to get it. It doesn't necessarily need a
reentrant mount or anything like that. Consider this case (which is not
even that unlikely):
Suppose there are two tasks calling unlink() on files in the same NFS
directory.
First task takes the i_mutex on the parent directory and goes to ask
the server to remove the file. Second task calls unlink just
afterward and blocks on the parent's i_mutex.
Now, a suspend event comes in and freezes the first task while it's
waiting on the response. It still holds the parent's i_mutex. Freezer
now gets to the second task and can't freeze it because the sleep on
that mutex isn't freezable.
So, not a deadlock per-se in this case but it does prevent the freezer
from running to completion. I don't see any way to solve it though w/o
making all mutexes freezable. Note that I don't think this is really
limited to NFS either -- a lot of other filesystems will have similar
problems: CIFS, some FUSE variants, etc...
--
Jeff Layton <[email protected]>
Hello, guys.
On Tue, Mar 05, 2013 at 08:23:08AM -0500, Jeff Layton wrote:
> So, not a deadlock per-se in this case but it does prevent the freezer
> from running to completion. I don't see any way to solve it though w/o
> making all mutexes freezable. Note that I don't think this is really
> limited to NFS either -- a lot of other filesystems will have similar
> problems: CIFS, some FUSE variants, etc...
So, I think this is why implementing freezer as a separate blocking
mechanism isn't such a good idea. We're effectively introducing a
completely new waiting state to a lot of unsuspecting paths which
generates a lot of risks and eventually extra complexity to work
around those. I think we really should update freezer to re-use the
blocking points we already have - the ones used for signal delivery
and ptracing. That way, other code paths don't have to worry about an
extra stop state and we can confine most complexities to freezer
proper.
Thanks.
--
tejun
On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> So, I think this is why implementing freezer as a separate blocking
> mechanism isn't such a good idea. We're effectively introducing a
> completely new waiting state to a lot of unsuspecting paths which
> generates a lot of risks and eventually extra complexity to work
> around those. I think we really should update freezer to re-use the
> blocking points we already have - the ones used for signal delivery
> and ptracing. That way, other code paths don't have to worry about an
> extra stop state and we can confine most complexities to freezer
> proper.
Also, consolidating those wait states means that we can solve the
event-to-response latency problem for all three cases - signal, ptrace
and freezer, rather than adding separate backing-out strategy for
freezer.
--
tejun
On Tue, 5 Mar 2013 09:49:54 -0800
Tejun Heo <[email protected]> wrote:
> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > So, I think this is why implementing freezer as a separate blocking
> > mechanism isn't such a good idea. We're effectively introducing a
> > completely new waiting state to a lot of unsuspecting paths which
> > generates a lot of risks and eventually extra complexity to work
> > around those. I think we really should update freezer to re-use the
> > blocking points we already have - the ones used for signal delivery
> > and ptracing. That way, other code paths don't have to worry about an
> > extra stop state and we can confine most complexities to freezer
> > proper.
>
> Also, consolidating those wait states means that we can solve the
> event-to-response latency problem for all three cases - signal, ptrace
> and freezer, rather than adding separate backing-out strategy for
> freezer.
>
Sounds intriguing...
I'm not sure what this really means for something like NFS though. How
would you envision this working when we have long running syscalls that
might sit waiting in the kernel indefinitely?
Here's my blue-sky, poorly-thought-out idea...
We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
NFS/RPC layer to be interrupted. Those would return back toward
userland with a particular type of error (sort of like ERESTARTSYS).
Before returning from the kernel though, we could freeze the process.
When it wakes up, then we could go back down and retry the call again
(much like an ERESTARTSYS kind of thing).
The tricky part here is that we'd need to distinguish between the case
where we caught SIGFREEZE before sending an RPC vs. after. If we sent
the call before freezing, then we don't want to resend it again. It
might be a non-idempotent operation.
Sounds horrific to code up though... :)
--
Jeff Layton <[email protected]>
Hello, Jeff.
On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
> Sounds intriguing...
>
> I'm not sure what this really means for something like NFS though. How
> would you envision this working when we have long running syscalls that
> might sit waiting in the kernel indefinitely?
I think it is the same problem as being able to handle SIGKILL in
responsive manner. It could be tricky to implement for nfs but it at
least doesn't have to solve the problem twice.
> Here's my blue-sky, poorly-thought-out idea...
>
> We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> NFS/RPC layer to be interrupted. Those would return back toward
> userland with a particular type of error (sort of like ERESTARTSYS).
>
> Before returning from the kernel though, we could freeze the process.
> When it wakes up, then we could go back down and retry the call again
> (much like an ERESTARTSYS kind of thing).
>
> The tricky part here is that we'd need to distinguish between the case
> where we caught SIGFREEZE before sending an RPC vs. after. If we sent
> the call before freezing, then we don't want to resend it again. It
> might be a non-idempotent operation.
So, yeah, you are thinking pretty much the same as I'm.
> Sounds horrific to code up though... :)
I don't know the details of nfs but those events could essentially be
signaling that the system is gonna lose power. I think it would be a
good idea to solve it.
Thanks.
--
tejun
On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > So, I think this is why implementing freezer as a separate blocking
> > mechanism isn't such a good idea. We're effectively introducing a
> > completely new waiting state to a lot of unsuspecting paths which
> > generates a lot of risks and eventually extra complexity to work
> > around those. I think we really should update freezer to re-use the
> > blocking points we already have - the ones used for signal delivery
> > and ptracing. That way, other code paths don't have to worry about an
> > extra stop state and we can confine most complexities to freezer
> > proper.
>
> Also, consolidating those wait states means that we can solve the
> event-to-response latency problem for all three cases - signal, ptrace
> and freezer, rather than adding separate backing-out strategy for
> freezer.
Meanwhile, as none of this sounds likely to be done this time
around--are we backing out the new lockdep warnings?
--b.
On Tue, 5 Mar 2013 11:09:23 -0800
Tejun Heo <[email protected]> wrote:
> Hello, Jeff.
>
> On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
> > Sounds intriguing...
> >
> > I'm not sure what this really means for something like NFS though. How
> > would you envision this working when we have long running syscalls that
> > might sit waiting in the kernel indefinitely?
>
> I think it is the same problem as being able to handle SIGKILL in
> responsive manner. It could be tricky to implement for nfs but it at
> least doesn't have to solve the problem twice.
>
> > Here's my blue-sky, poorly-thought-out idea...
> >
> > We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> > NFS/RPC layer to be interrupted. Those would return back toward
> > userland with a particular type of error (sort of like ERESTARTSYS).
> >
> > Before returning from the kernel though, we could freeze the process.
> > When it wakes up, then we could go back down and retry the call again
> > (much like an ERESTARTSYS kind of thing).
> >
> > The tricky part here is that we'd need to distinguish between the case
> > where we caught SIGFREEZE before sending an RPC vs. after. If we sent
> > the call before freezing, then we don't want to resend it again. It
> > might be a non-idempotent operation.
>
> So, yeah, you are thinking pretty much the same as I'm.
>
> > Sounds horrific to code up though... :)
>
> I don't know the details of nfs but those events could essentially be
> signaling that the system is gonna lose power. I think it would be a
> good idea to solve it.
>
> Thanks.
>
It would be...
So, I briefly considered a similar approach when I was working on the
retry-on-ESTALE error patches. It occurred to me that it was somewhat
similar to the ERESTARTSYS case, so handling it at a higher level than
in the syscall handlers themselves might make sense...
Al was in the middle of his signal handling/execve rework though and I
ran the idea past him. He pointedly told me that I was crazy for even
considering it. This is rather non-trivial to handle since it means
mucking around in a bunch of arch-specific code and dealing with all of
the weirdo corner cases.
Anyone up for working out how to handle a freeze event on a process
that already has a pending signal, while it's being ptraced? It's
probably best to chat with Al (cc'ed here) before you embark on this
plan since he was just in that code recently.
In any case, maybe there's also some code consolidation opportunity
here too. I suspect at least some of this logic is in arch-specific
code when it really needn't be....
--
Jeff Layton <[email protected]>
Hello, Jeff.
On Tue, Mar 05, 2013 at 06:39:41PM -0500, Jeff Layton wrote:
> Al was in the middle of his signal handling/execve rework though and I
> ran the idea past him. He pointedly told me that I was crazy for even
> considering it. This is rather non-trivial to handle since it means
> mucking around in a bunch of arch-specific code and dealing with all of
> the weirdo corner cases.
Hmmm... I tried it a couple years ago while I was working on ptrace
improvements to support checkpoint-restart in userland (PTRACE_SEIZE
and friends) and had a half-working code. At the time, Oleg was
against the idea and something else came up so I didn't push it all
the way but IIRC it didn't need to touch any arch code. It should be
able to just share the existing trap path with ptrace.
> Anyone up for working out how to handle a freeze event on a process
> that already has a pending signal, while it's being ptraced? It's
> probably best to chat with Al (cc'ed here) before you embark on this
> plan since he was just in that code recently.
Maybe Al sees something that I don't but AFAICS it should be doable in
generic code proper and I don't think it's gonna be that hard.
Oleg, are you still opposed to the idea of making freezer share trap
points with ptrace?
Thanks.
--
tejun
On Tuesday, March 05, 2013 06:11:10 PM J. Bruce Fields wrote:
> On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > So, I think this is why implementing freezer as a separate blocking
> > > mechanism isn't such a good idea. We're effectively introducing a
> > > completely new waiting state to a lot of unsuspecting paths which
> > > generates a lot of risks and eventually extra complexity to work
> > > around those. I think we really should update freezer to re-use the
> > > blocking points we already have - the ones used for signal delivery
> > > and ptracing. That way, other code paths don't have to worry about an
> > > extra stop state and we can confine most complexities to freezer
> > > proper.
> >
> > Also, consolidating those wait states means that we can solve the
> > event-to-response latency problem for all three cases - signal, ptrace
> > and freezer, rather than adding separate backing-out strategy for
> > freezer.
>
> Meanwhile, as none of this sounds likely to be done this time
> around--are we backing out the new lockdep warnings?
I think we should do that.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
This check is turning up a lot of code paths which need to be
fixed so while those paths are fixed, let's make this check
optional so that folks can still use lockdep.
CC: Tejun Heo <[email protected]>
CC: Jeff Layton <[email protected]>
CC: "Myklebust, Trond" <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Ming Lei <[email protected]>
CC: "Rafael J. Wysocki" <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Ingo Molnar <[email protected]>
---
include/linux/freezer.h | 2 ++
lib/Kconfig.debug | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 043a5cf..03bdc54 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -49,8 +49,10 @@ extern void thaw_kernel_threads(void);
static inline bool try_to_freeze(void)
{
+#ifdef CONFIG_DEBUG_LOCK_HELD_FREEZING
if (!(current->flags & PF_NOFREEZE))
debug_check_no_locks_held();
+#endif
might_sleep();
if (likely(!freezing(current)))
return false;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 28be08c..bddda5f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -547,6 +547,18 @@ config DEBUG_MUTEXES
This feature allows mutex semantics violations to be detected and
reported.
+config DEBUG_LOCK_HELD_FREEZING
+ bool "Lock debugging: detect when locks are held during freeze"
+ depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
+ select DEBUG_SPINLOCK
+ select DEBUG_MUTEXES
+ select LOCKDEP
+ help
+ This feature will check whether any lock is incorrectly held
+ while freezing. If a task freezes with a lock held it will
+ block any other task that is waiting on that lock from freezing.
+ In the case of cgroup_freezer, this can cause a deadlock.
+
config DEBUG_LOCK_ALLOC
bool "Lock debugging: detect incorrect freeing of live locks"
depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
--
1.7.12.4
On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields <[email protected]> wrote:
> On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
>> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
>> > So, I think this is why implementing freezer as a separate blocking
>> > mechanism isn't such a good idea. We're effectively introducing a
>> > completely new waiting state to a lot of unsuspecting paths which
>> > generates a lot of risks and eventually extra complexity to work
>> > around those. I think we really should update freezer to re-use the
>> > blocking points we already have - the ones used for signal delivery
>> > and ptracing. That way, other code paths don't have to worry about an
>> > extra stop state and we can confine most complexities to freezer
>> > proper.
>>
>> Also, consolidating those wait states means that we can solve the
>> event-to-response latency problem for all three cases - signal, ptrace
>> and freezer, rather than adding separate backing-out strategy for
>> freezer.
>
> Meanwhile, as none of this sounds likely to be done this time
> around--are we backing out the new lockdep warnings?
>
> --b.
What if we hide it behind a Kconfig? Its finding real bugs.
http://lkml.org/lkml/2013/3/5/583
On Tue, Mar 05, 2013 at 04:59:00PM -0800, Mandeep Singh Baines wrote:
> On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields <[email protected]> wrote:
> > On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> >> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> >> > So, I think this is why implementing freezer as a separate blocking
> >> > mechanism isn't such a good idea. We're effectively introducing a
> >> > completely new waiting state to a lot of unsuspecting paths which
> >> > generates a lot of risks and eventually extra complexity to work
> >> > around those. I think we really should update freezer to re-use the
> >> > blocking points we already have - the ones used for signal delivery
> >> > and ptracing. That way, other code paths don't have to worry about an
> >> > extra stop state and we can confine most complexities to freezer
> >> > proper.
> >>
> >> Also, consolidating those wait states means that we can solve the
> >> event-to-response latency problem for all three cases - signal, ptrace
> >> and freezer, rather than adding separate backing-out strategy for
> >> freezer.
> >
> > Meanwhile, as none of this sounds likely to be done this time
> > around--are we backing out the new lockdep warnings?
> >
> > --b.
>
> What if we hide it behind a Kconfig? Its finding real bugs.
>
> http://lkml.org/lkml/2013/3/5/583
If it's really just a 2-line patch to try_to_freeze(), could it just be
carried out-of-tree by people that are specifically working on tracking
down these problems?
But I don't have strong feelings about it--as long as it doesn't result
in the same known issues getting reported again and again....
--b.
On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
> On Tue, 5 Mar 2013 09:49:54 -0800
> Tejun Heo <[email protected]> wrote:
>
> > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > So, I think this is why implementing freezer as a separate blocking
> > > mechanism isn't such a good idea. We're effectively introducing a
> > > completely new waiting state to a lot of unsuspecting paths which
> > > generates a lot of risks and eventually extra complexity to work
> > > around those. I think we really should update freezer to re-use the
> > > blocking points we already have - the ones used for signal delivery
> > > and ptracing. That way, other code paths don't have to worry about an
> > > extra stop state and we can confine most complexities to freezer
> > > proper.
> >
> > Also, consolidating those wait states means that we can solve the
> > event-to-response latency problem for all three cases - signal, ptrace
> > and freezer, rather than adding separate backing-out strategy for
> > freezer.
> >
>
> Sounds intriguing...
>
> I'm not sure what this really means for something like NFS though. How
> would you envision this working when we have long running syscalls that
> might sit waiting in the kernel indefinitely?
>
> Here's my blue-sky, poorly-thought-out idea...
>
> We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> NFS/RPC layer to be interrupted. Those would return back toward
> userland with a particular type of error (sort of like ERESTARTSYS).
>
> Before returning from the kernel though, we could freeze the process.
> When it wakes up, then we could go back down and retry the call again
> (much like an ERESTARTSYS kind of thing).
Two (three?) show stopper words for you: "non-idempotent operations".
Not all RPC calls can just be interrupted and restarted. Something like
an exclusive file create, unlink, file locking attempt, etc may give
rise to different results when replayed in the above scenario.
Interrupting an RPC call is not equivalent to cancelling its effects...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Hello, Trond.
On Wed, Mar 06, 2013 at 01:10:07AM +0000, Myklebust, Trond wrote:
> Not all RPC calls can just be interrupted and restarted. Something like
> an exclusive file create, unlink, file locking attempt, etc may give
> rise to different results when replayed in the above scenario.
> Interrupting an RPC call is not equivalent to cancelling its effects...
Then, the operation simply isn't freezable while in progress and
should be on the receiving end of failed-to-freeze error message and
users who depend on suspend/hibernation working properly should be
advised away from using nfs.
It doesn't really changes anything. The current code is buggy. We're
just not enforcing the rules strictly and people aren't hitting the
bug often enough.
Thanks.
--
tejun
On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> If it's really just a 2-line patch to try_to_freeze(), could it just be
> carried out-of-tree by people that are specifically working on tracking
> down these problems?
>
> But I don't have strong feelings about it--as long as it doesn't result
> in the same known issues getting reported again and again....
Agreed, I don't think a Kconfig option is justified for this. If this
is really important, annotate broken paths so that it doesn't trigger
spuriously; otherwise, please just remove it.
Thanks.
--
tejun
On Tue, Mar 05, 2013 at 05:14:23PM -0800, Tejun Heo wrote:
> Then, the operation simply isn't freezable while in progress and
> should be on the receiving end of failed-to-freeze error message and
> users who depend on suspend/hibernation working properly should be
> advised away from using nfs.
>
> It doesn't really changes anything. The current code is buggy. We're
> just not enforcing the rules strictly and people aren't hitting the
> bug often enough.
Just one more thing. Also, not being able to do retries without
side-effect doesn't mean it can't do retries at all. There are
syscalls which need to do things differently on re-entry (forgot which
one but there are several). They record the necessary state in the
restart block and resume from where they left off on re-entry. It
sure is hairy but is doable and we already have supporting
infrastructure. Not sure whether that would be worthwhile tho.
Thanks.
--
tejun
On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo <[email protected]> wrote:
> On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
>> If it's really just a 2-line patch to try_to_freeze(), could it just be
>> carried out-of-tree by people that are specifically working on tracking
>> down these problems?
>>
>> But I don't have strong feelings about it--as long as it doesn't result
>> in the same known issues getting reported again and again....
>
> Agreed, I don't think a Kconfig option is justified for this. If this
> is really important, annotate broken paths so that it doesn't trigger
> spuriously; otherwise, please just remove it.
>
Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
Maybe, add a new lockdep API:
lockdep_set_held_during_freeze(lock);
Then when we do the check, ignore any locks that set this bit.
Ingo, does this seem like a reasonable design to you?
Regards,
Mandeep
> Thanks.
>
> --
> tejun
* Mandeep Singh Baines <[email protected]> wrote:
> On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo <[email protected]> wrote:
> > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> >> If it's really just a 2-line patch to try_to_freeze(), could it just be
> >> carried out-of-tree by people that are specifically working on tracking
> >> down these problems?
> >>
> >> But I don't have strong feelings about it--as long as it doesn't result
> >> in the same known issues getting reported again and again....
> >
> > Agreed, I don't think a Kconfig option is justified for this. If this
> > is really important, annotate broken paths so that it doesn't trigger
> > spuriously; otherwise, please just remove it.
> >
>
> Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
>
> Maybe, add a new lockdep API:
>
> lockdep_set_held_during_freeze(lock);
>
> Then when we do the check, ignore any locks that set this bit.
>
> Ingo, does this seem like a reasonable design to you?
Am I reading the discussion correctly that the new warnings show REAL potential
deadlock scenarios, which can hit real users and can lock their box up in entirely
real usage scenarios?
If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ bugs.
We typically use them to teach lockdep about things it does not know about.
How about fixing the deadlocks instead?
Thanks,
Ingo
On Wed, 6 Mar 2013 01:10:07 +0000
"Myklebust, Trond" <[email protected]> wrote:
> On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
> > On Tue, 5 Mar 2013 09:49:54 -0800
> > Tejun Heo <[email protected]> wrote:
> >
> > > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > > So, I think this is why implementing freezer as a separate blocking
> > > > mechanism isn't such a good idea. We're effectively introducing a
> > > > completely new waiting state to a lot of unsuspecting paths which
> > > > generates a lot of risks and eventually extra complexity to work
> > > > around those. I think we really should update freezer to re-use the
> > > > blocking points we already have - the ones used for signal delivery
> > > > and ptracing. That way, other code paths don't have to worry about an
> > > > extra stop state and we can confine most complexities to freezer
> > > > proper.
> > >
> > > Also, consolidating those wait states means that we can solve the
> > > event-to-response latency problem for all three cases - signal, ptrace
> > > and freezer, rather than adding separate backing-out strategy for
> > > freezer.
> > >
> >
> > Sounds intriguing...
> >
> > I'm not sure what this really means for something like NFS though. How
> > would you envision this working when we have long running syscalls that
> > might sit waiting in the kernel indefinitely?
> >
> > Here's my blue-sky, poorly-thought-out idea...
> >
> > We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> > NFS/RPC layer to be interrupted. Those would return back toward
> > userland with a particular type of error (sort of like ERESTARTSYS).
> >
> > Before returning from the kernel though, we could freeze the process.
> > When it wakes up, then we could go back down and retry the call again
> > (much like an ERESTARTSYS kind of thing).
>
> Two (three?) show stopper words for you: "non-idempotent operations".
>
> Not all RPC calls can just be interrupted and restarted. Something like
> an exclusive file create, unlink, file locking attempt, etc may give
> rise to different results when replayed in the above scenario.
> Interrupting an RPC call is not equivalent to cancelling its effects...
>
Right -- that's the part where we have to take great care to save the
state of the syscall at the time we returned back up toward userland on
a freeze event. I suppose we'd need to hang something off the
task_struct to keep track of that.
In most cases, it would be sufficient to keep track of whether an RPC
had been sent during the call for non-idempotent operations. If it was
sent, then we'd just re-enter the wait for the reply. If it wasn't then
we'd go ahead and send the call.
Still, I'm sure there are details I'm overlooking here. The whole point
of holding these mutexes is to ensure that operations that the
directories don't change while we're doing these operations. If we
release the locks in order to go to sleep, then there's no guarantee
that things haven't changed when we reacquire them.
Maybe it's best to give up and just tell people that suspending your
laptop with a NFS mount is not allowed :P
--
Jeff Layton <[email protected]>
On Wed, 6 Mar 2013 10:09:14 +0100
Ingo Molnar <[email protected]> wrote:
>
> * Mandeep Singh Baines <[email protected]> wrote:
>
> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo <[email protected]> wrote:
> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> > >> If it's really just a 2-line patch to try_to_freeze(), could it just be
> > >> carried out-of-tree by people that are specifically working on tracking
> > >> down these problems?
> > >>
> > >> But I don't have strong feelings about it--as long as it doesn't result
> > >> in the same known issues getting reported again and again....
> > >
> > > Agreed, I don't think a Kconfig option is justified for this. If this
> > > is really important, annotate broken paths so that it doesn't trigger
> > > spuriously; otherwise, please just remove it.
> > >
> >
> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
> >
> > Maybe, add a new lockdep API:
> >
> > lockdep_set_held_during_freeze(lock);
> >
> > Then when we do the check, ignore any locks that set this bit.
> >
> > Ingo, does this seem like a reasonable design to you?
>
> Am I reading the discussion correctly that the new warnings show REAL potential
> deadlock scenarios, which can hit real users and can lock their box up in entirely
> real usage scenarios?
>
> If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ bugs.
> We typically use them to teach lockdep about things it does not know about.
>
> How about fixing the deadlocks instead?
>
I do see how the freezer might fail to suspend certain tasks, but I
don't see the deadlock scenario here in the NFS/RPC case. Can someone
outline a situation where this might end up deadlocking? If not, then
I'd be inclined to say that while this may be a problem, the warning is
excessive...
--
Jeff Layton <[email protected]>
On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton <[email protected]> wrote:
> On Wed, 6 Mar 2013 10:09:14 +0100
> Ingo Molnar <[email protected]> wrote:
>
>>
>> * Mandeep Singh Baines <[email protected]> wrote:
>>
>> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo <[email protected]> wrote:
>> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
>> > >> If it's really just a 2-line patch to try_to_freeze(), could it just be
>> > >> carried out-of-tree by people that are specifically working on tracking
>> > >> down these problems?
>> > >>
>> > >> But I don't have strong feelings about it--as long as it doesn't result
>> > >> in the same known issues getting reported again and again....
>> > >
>> > > Agreed, I don't think a Kconfig option is justified for this. If this
>> > > is really important, annotate broken paths so that it doesn't trigger
>> > > spuriously; otherwise, please just remove it.
>> > >
>> >
>> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
>> >
>> > Maybe, add a new lockdep API:
>> >
>> > lockdep_set_held_during_freeze(lock);
>> >
>> > Then when we do the check, ignore any locks that set this bit.
>> >
>> > Ingo, does this seem like a reasonable design to you?
>>
>> Am I reading the discussion correctly that the new warnings show REAL potential
>> deadlock scenarios, which can hit real users and can lock their box up in entirely
>> real usage scenarios?
>>
>> If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ bugs.
>> We typically use them to teach lockdep about things it does not know about.
>>
>> How about fixing the deadlocks instead?
>>
>
> I do see how the freezer might fail to suspend certain tasks, but I
> don't see the deadlock scenario here in the NFS/RPC case. Can someone
> outline a situation where this might end up deadlocking? If not, then
> I'd be inclined to say that while this may be a problem, the warning is
> excessive...
>
In general, holding a lock and freezing can cause a deadlock if:
1) you froze via the cgroup_freezer subsystem and a task in another
cgroup tried to acquire the same lock
2) the lock was needed later is suspend/hibernate. For example, if the
lock was needed in dpm_suspend by one of the device callbacks. For
hibernate, you also need to worry about any locks that need to be
acquired in order to write to the swap device.
3) another freezing task blocked on this lock and held other locks
needed later in suspend. If that task were skipped by the freezer, you
would deadlock
You will block/prevent suspend if:
4) another freezing task blocked on this lock and was unable to freeze
I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
cgroup freezer. Case 4) while not causing a deadlock could prevent
your laptop/phone from sleeping and end up burning all your battery.
If suspend is initiated via lid close you won't even know about the
failure.
Regards,
Mandeep
> --
> Jeff Layton <[email protected]>
On 03/05, Tejun Heo wrote:
>
> Oleg, are you still opposed to the idea of making freezer share trap
> points with ptrace?
My memory can fool me, but iirc I wasn't actually opposed... I guess
you mean the previous discussion about vfork/ptrace/etc which I forgot
completely.
But I can recall the main problem with your idea (with me): I simply
wasn't able to understand it ;)
Likewise, I can't really understand the ideas discussed in this thread.
At least when it come to this particular problem, rpc_wait_bit_killable()
is not interruptible...
And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).
But if we can do this, then it should be possible so simply make these
sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
can't always restart, so I am totally confused.
Oleg.
On 03/05, Jeff Layton wrote:
>
> Anyone up for working out how to handle a freeze event on a process
> that already has a pending signal, while it's being ptraced?
Could you explain the problem?
Oleg.
On Wed, 6 Mar 2013 07:59:01 -0800
Mandeep Singh Baines <[email protected]> wrote:
> On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton <[email protected]> wrote:
> > On Wed, 6 Mar 2013 10:09:14 +0100
> > Ingo Molnar <[email protected]> wrote:
> >
> >>
> >> * Mandeep Singh Baines <[email protected]> wrote:
> >>
> >> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo <[email protected]> wrote:
> >> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> >> > >> If it's really just a 2-line patch to try_to_freeze(), could it just be
> >> > >> carried out-of-tree by people that are specifically working on tracking
> >> > >> down these problems?
> >> > >>
> >> > >> But I don't have strong feelings about it--as long as it doesn't result
> >> > >> in the same known issues getting reported again and again....
> >> > >
> >> > > Agreed, I don't think a Kconfig option is justified for this. If this
> >> > > is really important, annotate broken paths so that it doesn't trigger
> >> > > spuriously; otherwise, please just remove it.
> >> > >
> >> >
> >> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
> >> >
> >> > Maybe, add a new lockdep API:
> >> >
> >> > lockdep_set_held_during_freeze(lock);
> >> >
> >> > Then when we do the check, ignore any locks that set this bit.
> >> >
> >> > Ingo, does this seem like a reasonable design to you?
> >>
> >> Am I reading the discussion correctly that the new warnings show REAL potential
> >> deadlock scenarios, which can hit real users and can lock their box up in entirely
> >> real usage scenarios?
> >>
> >> If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ bugs.
> >> We typically use them to teach lockdep about things it does not know about.
> >>
> >> How about fixing the deadlocks instead?
> >>
> >
> > I do see how the freezer might fail to suspend certain tasks, but I
> > don't see the deadlock scenario here in the NFS/RPC case. Can someone
> > outline a situation where this might end up deadlocking? If not, then
> > I'd be inclined to say that while this may be a problem, the warning is
> > excessive...
> >
>
> In general, holding a lock and freezing can cause a deadlock if:
>
> 1) you froze via the cgroup_freezer subsystem and a task in another
> cgroup tried to acquire the same lock
> 2) the lock was needed later is suspend/hibernate. For example, if the
> lock was needed in dpm_suspend by one of the device callbacks. For
> hibernate, you also need to worry about any locks that need to be
> acquired in order to write to the swap device.
> 3) another freezing task blocked on this lock and held other locks
> needed later in suspend. If that task were skipped by the freezer, you
> would deadlock
>
> You will block/prevent suspend if:
>
> 4) another freezing task blocked on this lock and was unable to freeze
>
> I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
> cgroup freezer. Case 4) while not causing a deadlock could prevent
> your laptop/phone from sleeping and end up burning all your battery.
> If suspend is initiated via lid close you won't even know about the
> failure.
>
We're aware of #4. That was the intent of adding try_to_freeze() into
this codepath in the first place. It's not a great solution for obvious
reasons, but we don't have another at the moment.
For #1 I'm not sure what to do. I'm that familiar with cgroups or how
the freezer works.
The bottom line is that we have a choice -- we can either rip out this
new lockdep warning, or rip out the code that causes it to fire.
If we rip out the warning we may miss some legit cases where we might
possibly have hit a deadlock.
If we rip out the code that causes it to fire, then we exacerbate the
#4 problem above. That will effectively make it so that you can't
suspend the host whenever NFS is doing anything moderately active.
Ripping out the warning seems like the best course of action in the
near term, but it's not my call...
--
Jeff Layton <[email protected]>
On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
> On Wed, 6 Mar 2013 07:59:01 -0800
> Mandeep Singh Baines <[email protected]> wrote:
> > In general, holding a lock and freezing can cause a deadlock if:
> >
> > 1) you froze via the cgroup_freezer subsystem and a task in another
> > cgroup tried to acquire the same lock
> > 2) the lock was needed later is suspend/hibernate. For example, if the
> > lock was needed in dpm_suspend by one of the device callbacks. For
> > hibernate, you also need to worry about any locks that need to be
> > acquired in order to write to the swap device.
> > 3) another freezing task blocked on this lock and held other locks
> > needed later in suspend. If that task were skipped by the freezer, you
> > would deadlock
> >
> > You will block/prevent suspend if:
> >
> > 4) another freezing task blocked on this lock and was unable to freeze
> >
> > I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
> > cgroup freezer. Case 4) while not causing a deadlock could prevent
> > your laptop/phone from sleeping and end up burning all your battery.
> > If suspend is initiated via lid close you won't even know about the
> > failure.
> >
>
> We're aware of #4. That was the intent of adding try_to_freeze() into
> this codepath in the first place. It's not a great solution for obvious
> reasons, but we don't have another at the moment.
>
> For #1 I'm not sure what to do. I'm that familiar with cgroups or how
> the freezer works.
>
> The bottom line is that we have a choice -- we can either rip out this
> new lockdep warning, or rip out the code that causes it to fire.
>
> If we rip out the warning we may miss some legit cases where we might
> possibly have hit a deadlock.
>
> If we rip out the code that causes it to fire, then we exacerbate the
> #4 problem above. That will effectively make it so that you can't
> suspend the host whenever NFS is doing anything moderately active.
#4 is probably the only case where we might want to freeze.
Unless we're in a situation where the network is going down, we can
usually always make progress with completing the RPC call and finishing
the system call. So in the case of cgroup_freezer, we only care if the
freezing cgroup also owns the network device (or net namespace) that NFS
is using to talk to the server.
As I said, the alternative is to notify NFS that the device is going
down, and to give it a chance to quiesce itself before that happens.
This is also the only way to ensure that processes which own locks on
the server (e.g. posix file locks) have a chance to release them before
being suspended.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 6 Mar 2013 19:17:35 +0100
Oleg Nesterov <[email protected]> wrote:
> On 03/05, Jeff Layton wrote:
> >
> > Anyone up for working out how to handle a freeze event on a process
> > that already has a pending signal, while it's being ptraced?
>
> Could you explain the problem?
>
Not very well. I was just saying that the signal/ptrace stuff is very
complicated already. Now we're looking at mixing in the freezer too,
which further increases the complexity.
Though when I said that, it was before Tejun mentioned hooking this up
to ptrace. I'll confess that I don't fully understand what he's
proposing either though...
--
Jeff Layton <[email protected]>
Hello,
On Wed, Mar 06, 2013 at 01:40:04PM -0500, Jeff Layton wrote:
> Though when I said that, it was before Tejun mentioned hooking this up
> to ptrace. I'll confess that I don't fully understand what he's
> proposing either though...
Oh, they're all just pretty closely related. All signal and ptrace
traps run off get_signal_to_deliver(). It is still a bit hairy but in
much better shape after cleanups which accompanied PTRACE_SEIZE and I
don't really think it'll be too difficult to extend it so that
freezing is implemented as a type of jobctl trap.
Thanks.
--
tejun
Hello, Oleg.
On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
> And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
> layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
> and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).
Oh yeah, we don't need another signal. We just need sigpending state
and a wakeup. I wasn't really going into details. The important
point is that for code paths outside signal/ptrace, freezing could
look and behave about the same as signal delivery.
> But if we can do this, then it should be possible so simply make these
> sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
> can't always restart, so I am totally confused.
Of course, switching to another freezing mechanism doesn't make issues
like this go away in itself, but it would at least allow handling such
issues in easier to grasp way rather than dealing with this giant
pseudo lock and allows code paths outside jobctl to simply deal with
one issue - signal delivery, rather than having to deal with freezing
separately. Also, while not completely identical, frozen state would
at least be an extension of jobctl trap and it would be possible to
allow things like killing frozen tasks or even ptracing them in
well-defined manner for cgroup_freezer use cases. As it currently
stands, there is no well-defined abstraction for frozen state which we
can expose in any way, which sucks.
Thanks.
--
tejun
On Wed, Mar 6, 2013 at 10:37 AM, Myklebust, Trond
<[email protected]> wrote:
> On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
>> On Wed, 6 Mar 2013 07:59:01 -0800
>> Mandeep Singh Baines <[email protected]> wrote:
>> > In general, holding a lock and freezing can cause a deadlock if:
>> >
>> > 1) you froze via the cgroup_freezer subsystem and a task in another
>> > cgroup tried to acquire the same lock
>> > 2) the lock was needed later is suspend/hibernate. For example, if the
>> > lock was needed in dpm_suspend by one of the device callbacks. For
>> > hibernate, you also need to worry about any locks that need to be
>> > acquired in order to write to the swap device.
>> > 3) another freezing task blocked on this lock and held other locks
>> > needed later in suspend. If that task were skipped by the freezer, you
>> > would deadlock
>> >
>> > You will block/prevent suspend if:
>> >
>> > 4) another freezing task blocked on this lock and was unable to freeze
>> >
>> > I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
>> > cgroup freezer. Case 4) while not causing a deadlock could prevent
>> > your laptop/phone from sleeping and end up burning all your battery.
>> > If suspend is initiated via lid close you won't even know about the
>> > failure.
>> >
>>
>> We're aware of #4. That was the intent of adding try_to_freeze() into
>> this codepath in the first place. It's not a great solution for obvious
>> reasons, but we don't have another at the moment.
>>
>> For #1 I'm not sure what to do. I'm that familiar with cgroups or how
>> the freezer works.
>>
>> The bottom line is that we have a choice -- we can either rip out this
>> new lockdep warning, or rip out the code that causes it to fire.
>>
>> If we rip out the warning we may miss some legit cases where we might
>> possibly have hit a deadlock.
>>
>> If we rip out the code that causes it to fire, then we exacerbate the
>> #4 problem above. That will effectively make it so that you can't
>> suspend the host whenever NFS is doing anything moderately active.
>
> #4 is probably the only case where we might want to freeze.
>
> Unless we're in a situation where the network is going down, we can
> usually always make progress with completing the RPC call and finishing
> the system call. So in the case of cgroup_freezer, we only care if the
> freezing cgroup also owns the network device (or net namespace) that NFS
> is using to talk to the server.
>
> As I said, the alternative is to notify NFS that the device is going
> down, and to give it a chance to quiesce itself before that happens.
> This is also the only way to ensure that processes which own locks on
> the server (e.g. posix file locks) have a chance to release them before
> being suspended.
>
So if I'm understanding correctly, the locks are held for bounded time
so under normal situation you don't really need to freeze here. But if
the task that is going to wake you were to freeze then this code path
could block suspend if PF_FREEZER_SKIP were not set.
For this particular case (not in general), instead of calling
try_to_freeze(), what if you only cleared PF_FREEZER_SKIP. For case
#1, you wouldn't block suspend if you're stuck waiting. If you
received the event while processes were freezing, you'd just
try_to_freeze() some place else. Probably on the way back up from the
system call. For case #4, you'd eventually complete the rpc and then
freeze some place else. If you're stuck waiting for the bit because
the process that was going to wake you also got frozen, then the
container would be stuck in the THAWING state. Not catastrophic and
something the administrator could workaround by sequencing things
right.
Does this sound like a reasonable workaround?
Regards,
Mandeep
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
On Wed, Mar 6, 2013 at 10:53 AM, Tejun Heo <[email protected]> wrote:
> Hello, Oleg.
>
> On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
>> And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
>> layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
>> and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).
>
> Oh yeah, we don't need another signal. We just need sigpending state
> and a wakeup. I wasn't really going into details. The important
> point is that for code paths outside signal/ptrace, freezing could
> look and behave about the same as signal delivery.
Don't we already do that? The whole "try_to_freeze()" in
get_signal_to_deliver() is about exactly this. See
fake_signal_wake_up().
You still have kernel threads (that don't do signals) to worry about,
so it doesn't make things go away. And you still have issues with
latency of disk wait, which is, I think, the reason for that
"freezable_schedule()" in the NFS code to begin with.
Linus
Hello, Linus.
On Wed, Mar 06, 2013 at 01:00:02PM -0800, Linus Torvalds wrote:
> > Oh yeah, we don't need another signal. We just need sigpending state
> > and a wakeup. I wasn't really going into details. The important
> > point is that for code paths outside signal/ptrace, freezing could
> > look and behave about the same as signal delivery.
>
> Don't we already do that? The whole "try_to_freeze()" in
> get_signal_to_deliver() is about exactly this. See
> fake_signal_wake_up().
Yeap, that was what I had in mind too. Maybe we'll need to modify it
slightly but we already have most of the basic stuff.
> You still have kernel threads (that don't do signals) to worry about,
> so it doesn't make things go away. And you still have issues with
> latency of disk wait, which is, I think, the reason for that
> "freezable_schedule()" in the NFS code to begin with.
I haven't thought about it for quite some time so things are hazy, but
here's what I can recall now.
With syscall paths out of the way, the surface is reduced a lot.
Another part is converting most freezable kthread users to freezable
workqueue which provides natural resource boundaries (the duration of
work item execution). kthread is already difficult to get the
synchronization completely right and significant number of freezable +
should_stop users are subtly broken the last time I went over the
freezer users. I think we would be much better off converting most
over to freezable workqueues which is easier to get right and likely
to be less expensive. Freezing happens at work item boundary which in
most cases could be made to coincide with the original freezer check
point.
There could be kthreads which can't be converted to workqueue for
whatever reason (there shouldn't be many at this point) but most
freezer usages in kthreads are pretty simple. It's usually single or
a couple freezer check points in the main loop. While we may still
need special handling for them, I don't think they're likely to have
implications on issues like this.
We probably would want to handle restart for freezable kthreads
calling syscalls. Haven't thought about this one too much yet. Maybe
freezable kthreads doing syscalls just need to be ready for
-ERESTARTSYS?
I'm not sure I follow the disk wait latency part. Are you saying that
switching to jobctl trap based freezer implementation wouldn't help
them? If so, right, it doesn't in itself. It's just changing the
infrastructure used for freezing and can't make the underlying
synchronization issues just disappear but at least it becomes the same
problem as being responsive to SIGKILL rather than a completely
separate problem.
Thanks.
--
tejun
On Wed, Mar 6, 2013 at 1:24 PM, Tejun Heo <[email protected]> wrote:
>
> With syscall paths out of the way, the surface is reduced a lot.
So the issue is syscalls that don't react to signals, and that can
potentially wait a long time.
Like NFS with a network hickup. Which is not at all unlikely. Think
wireless network, somebody trying to get on a network share, things
not working, and closing the damn lid because you give up.
So I do agree that we probably have *too* many of the stupid "let's
check if we can freeze", and I suspect that the NFS code should get
rid of the "freezable_schedule()" that is causing this warning
(because I also agree that you should *not* freeze while holding
locks, because it really can cause deadlocks), but I do suspect that
network filesystems do need to have a few places where they check for
freezing on their own... Exactly because freezing isn't *quite* like a
signal.
Linus
On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> So I do agree that we probably have *too* many of the stupid "let's
> check if we can freeze", and I suspect that the NFS code should get
> rid of the "freezable_schedule()" that is causing this warning
> (because I also agree that you should *not* freeze while holding
> locks, because it really can cause deadlocks), but I do suspect that
> network filesystems do need to have a few places where they check for
> freezing on their own... Exactly because freezing isn't *quite* like a
> signal.
Well, I don't really know much about nfs so I can't really tell, but
for most other cases, dealing with freezing like a signal should work
fine from what I've seen although I can't be sure before actually
trying. Trond, Bruce, can you guys please chime in?
Thanks.
--
tejun
On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
> On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > So I do agree that we probably have *too* many of the stupid "let's
> > check if we can freeze", and I suspect that the NFS code should get
> > rid of the "freezable_schedule()" that is causing this warning
> > (because I also agree that you should *not* freeze while holding
> > locks, because it really can cause deadlocks), but I do suspect that
> > network filesystems do need to have a few places where they check for
> > freezing on their own... Exactly because freezing isn't *quite* like a
> > signal.
>
> Well, I don't really know much about nfs so I can't really tell, but
> for most other cases, dealing with freezing like a signal should work
> fine from what I've seen although I can't be sure before actually
> trying. Trond, Bruce, can you guys please chime in?
So, I think the question here would be, in nfs, how many of the
current freezer check points would be difficult to conver to signal
handling model after excluding the ones which are performed while
holding some locks which we need to get rid of anyway?
Thanks.
--
tejun
On Wed, 6 Mar 2013 13:36:36 -0800
Tejun Heo <[email protected]> wrote:
> On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > So I do agree that we probably have *too* many of the stupid "let's
> > check if we can freeze", and I suspect that the NFS code should get
> > rid of the "freezable_schedule()" that is causing this warning
> > (because I also agree that you should *not* freeze while holding
> > locks, because it really can cause deadlocks), but I do suspect that
> > network filesystems do need to have a few places where they check for
> > freezing on their own... Exactly because freezing isn't *quite* like a
> > signal.
>
> Well, I don't really know much about nfs so I can't really tell, but
> for most other cases, dealing with freezing like a signal should work
> fine from what I've seen although I can't be sure before actually
> trying. Trond, Bruce, can you guys please chime in?
>
> Thanks.
>
(hopefully this isn't tl;dr)
It's not quite that simple...
The problem (as Trond already mentioned) is non-idempotent operations.
You can't just restart certain operations from scratch once you reach a
certain point. Here's an example:
Suppose I call unlink("somefile"); on an NFS mount. We take all of the
VFS locks, go down into the NFS layer. That marshals up the UNLINK
call, sends it off to the server, and waits for the reply. While we're
waiting, a freeze event comes in and we start returning from the
kernel with our new -EFREEZE return code that works sort of like
-ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
removes the file. A little while later we wake up the machine and it
goes to try and pick up where it left off.
What do we do now?
Suppose we pretend we never sent the call in the first place, marshal
up a new RPC and send it again. This is problematic -- the server will
probably send back the equivalent of ENOENT. How do we know whether the
file never existed in the first place, or whether the server processed
the original call and removed the file then?
Do we instead try and keep track of whether the RPC has been sent and
just wait for the reply on the original call? That's tricky too -- it
means adding an extra codepath to check for these sorts of restarts in
a bunch of different ops vectors into the filesystem. We also have to
somehow keep track of this state too (I guess by hanging something off
the task_struct).
Note too that the above is the simple case. We're dropping the parent's
i_mutex during the freeze. Suppose when we restart the call that the
parent directory has changed in such a way that the original lookup we
did to do the original RPC is no longer valid?
I think Trond may be on the right track. We probably need some
mechanism to quiesce the filesystem ahead of any sort of freezer
event. That quiesce could simply wait on any in flight RPCs to come
back, and not allow any new ones to go out. On syscalls where the RPC
didn't go out, we'd just return -EFREEZE or whatever and let the upper
layers restart the call after waking back up. Writeback would be
tricky, but that can be handled too.
The catch here is that it's quite possible that when we need to quiesce
that we've lost communications with the server. We don't want to hold
up the freezer at that point so the wait for replies has to be bounded
in time somehow. If that times out, we probably just have to return all
calls with our new -EFREEZE return and hope for the best when the
machine wakes back up.
--
Jeff Layton <[email protected]>
Op 06-03-13 01:30, Mandeep Singh Baines schreef:
> This check is turning up a lot of code paths which need to be
> fixed so while those paths are fixed, let's make this check
> optional so that folks can still use lockdep.
I think the config option should be inverted, and make it more clear that
you're not just not reporting some real bugs by disabling a check that should be on by default.
> CC: Tejun Heo <[email protected]>
> CC: Jeff Layton <[email protected]>
> CC: "Myklebust, Trond" <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Ming Lei <[email protected]>
> CC: "Rafael J. Wysocki" <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Ingo Molnar <[email protected]>
> ---
> include/linux/freezer.h | 2 ++
> lib/Kconfig.debug | 12 ++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/include/linux/freezer.h b/include/linux/freezer.h
> index 043a5cf..03bdc54 100644
> --- a/include/linux/freezer.h
> +++ b/include/linux/freezer.h
> @@ -49,8 +49,10 @@ extern void thaw_kernel_threads(void);
>
> static inline bool try_to_freeze(void)
> {
> +#ifdef CONFIG_DEBUG_LOCK_HELD_FREEZING
> if (!(current->flags & PF_NOFREEZE))
> debug_check_no_locks_held();
> +#endif
> might_sleep();
> if (likely(!freezing(current)))
> return false;
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 28be08c..bddda5f 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -547,6 +547,18 @@ config DEBUG_MUTEXES
> This feature allows mutex semantics violations to be detected and
> reported.
>
> +config DEBUG_LOCK_HELD_FREEZING
> + bool "Lock debugging: detect when locks are held during freeze"
> + depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
> + select DEBUG_SPINLOCK
> + select DEBUG_MUTEXES
> + select LOCKDEP
> + help
> + This feature will check whether any lock is incorrectly held
> + while freezing. If a task freezes with a lock held it will
> + block any other task that is waiting on that lock from freezing.
> + In the case of cgroup_freezer, this can cause a deadlock.
> +
> config DEBUG_LOCK_ALLOC
> bool "Lock debugging: detect incorrect freeing of live locks"
> depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
Hello, Jeff.
On Thu, Mar 07, 2013 at 06:41:40AM -0500, Jeff Layton wrote:
> Suppose I call unlink("somefile"); on an NFS mount. We take all of the
> VFS locks, go down into the NFS layer. That marshals up the UNLINK
> call, sends it off to the server, and waits for the reply. While we're
> waiting, a freeze event comes in and we start returning from the
> kernel with our new -EFREEZE return code that works sort of like
> -ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
> removes the file. A little while later we wake up the machine and it
> goes to try and pick up where it left off.
But you can't freeze regardless of the freezing mechanism in such
cases, right? The current code which allows freezing while such
operations are in progress is broken as it can lead to freezer
deadlocks. They should go away no matter how we implement freezer, so
the question is not whether we can move all the existing freezing
points to signal mechanism but that, after removing the deadlock-prone
ones, how many would be difficult to convert. I'm fully speculating
but my suspicion is not too many if you remove (or update somehow) the
ones which are being done with some locks held.
> The catch here is that it's quite possible that when we need to quiesce
> that we've lost communications with the server. We don't want to hold
> up the freezer at that point so the wait for replies has to be bounded
> in time somehow. If that times out, we probably just have to return all
> calls with our new -EFREEZE return and hope for the best when the
> machine wakes back up.
Sure, a separate prep step may be helpful but assuming a user
nfs-mounting stuff on a laptop, I'm not sure how reliable that can be
made. People move around with laptops, wifi can be iffy and the lid
can be shut at any moment. I don't think it's possible for nfs to be
laptop friendly while staying completely correct. Designing such a
network filesystem probably is possible with transactions and whatnot
but AFAIU nfs isn't designed that way. If such use case is something
nfs wants to support, I think it just should make do with some
middleground - ie. just implement a mount switch which says "retry
operations across network / power problems" and explain the
implications.
Thanks.
--
tejun
On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton <[email protected]> wrote:
>
> I think Trond may be on the right track. We probably need some
> mechanism to quiesce the filesystem ahead of any sort of freezer
> event.
No, guys. That cannot work. It's a completely moronic idea. Let me
count the way:
(a) it's just another form of saying "lock". But since other things
are (by definition) going on when it happens, it will just cause
deadlocks.
(b) the freeze event might not even be system-global. So *some*
processes (a cgroup) might freeze, others would not. You can't shut
off the filesystem just because some processes migth freeze.
(c) it just moves the same issue somewhere else. If you have some
operation that must be done under the lock, then such an operation
must be completed before you've quiesced the filesystem, which is your
whole point of that "quiesce" event. BUT THAT'S THE EXACT SAME ISSUE
AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.
In other words, that suggestion not only introduces new problems (a),
it's fundamentally broken anyway (b) *AND* it doesn't even solve
anything, it just moves it around.
The solution is damn simple: if you're in some kind of "atomic
region", then you cannot freeze. Seriously. SO DON'T CALL
"freezable_schedule()", FOR CHRISSAKE! You clearly aren't freezable!
Which is exactly what the new lockdep warning was all about. Don't try
to move the problem around, when it's quite clear where the problem
is. If you need to do something uninterruptible, you do not say "oh,
I'm freezable". Because freezing is by definition an interruption.
Seriously, it's that simple.
Linus
On Thu, 2013-03-07 at 07:55 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton <[email protected]> wrote:
> >
> > I think Trond may be on the right track. We probably need some
> > mechanism to quiesce the filesystem ahead of any sort of freezer
> > event.
>
> No, guys. That cannot work. It's a completely moronic idea. Let me
> count the way:
>
> (a) it's just another form of saying "lock". But since other things
> are (by definition) going on when it happens, it will just cause
> deadlocks.
>
> (b) the freeze event might not even be system-global. So *some*
> processes (a cgroup) might freeze, others would not. You can't shut
> off the filesystem just because some processes migth freeze.
That's the whole bloody problem in a nutshell.
We only want to freeze the filesystem when the network goes down, and in
that case we want time to clean up first. That's why it needs to be
initiated by something like NetworkManager _before_ the network is shut
down.
> (c) it just moves the same issue somewhere else. If you have some
> operation that must be done under the lock, then such an operation
> must be completed before you've quiesced the filesystem, which is your
> whole point of that "quiesce" event. BUT THAT'S THE EXACT SAME ISSUE
> AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.
>
> In other words, that suggestion not only introduces new problems (a),
> it's fundamentally broken anyway (b) *AND* it doesn't even solve
> anything, it just moves it around.
>
> The solution is damn simple: if you're in some kind of "atomic
> region", then you cannot freeze. Seriously. SO DON'T CALL
> "freezable_schedule()", FOR CHRISSAKE! You clearly aren't freezable!
>
> Which is exactly what the new lockdep warning was all about. Don't try
> to move the problem around, when it's quite clear where the problem
> is. If you need to do something uninterruptible, you do not say "oh,
> I'm freezable". Because freezing is by definition an interruption.
> Seriously, it's that simple.
It _shouldn't_ be an interruption unless the filesystem can't make
progress.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Hello, Linus.
On Thu, Mar 07, 2013 at 07:55:39AM -0800, Linus Torvalds wrote:
> In other words, that suggestion not only introduces new problems (a),
> it's fundamentally broken anyway (b) *AND* it doesn't even solve
> anything, it just moves it around.
I don't think it's gonna solve the problems we're talking about but
the "moving around" part could be useful nonetheless - e.g. we don't
want to shut down network before nfs while suspending.
Thanks.
--
tejun
On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
<[email protected]> wrote:
>
> It _shouldn't_ be an interruption unless the filesystem can't make
> progress.
So how can we tell? Calling "freezable_schedule()" if you're not ready
to be frozen is not good. And nobody but the NFS code can know.
You might want to introduce some counter that counts number of
outstanding non-interruptible events, and only call the "freezable"
version if that counter is zero.
A better alternative might be to *never* call the freezable version.
Because those freezable_*() things are really quite disgusting, and
are wrong - they don't actually freeze the process, they say "I don't
care if you freeze me while I sleep", and you might actually wake up
*while* the system is being frozen. I think the whole concept is
broken. Rafaei - comments? The function really is crap, regardless of
any unrelated NFS problems.
So what NFS could do instead is actually check the "do I need to
freeze" flag, and if you need to freeze you consider it an abort - and
do *not* try to continue. Just freeze, and then act as if the machine
got rebooted as far as NFS was concerned. That should work anyway, no?
That does sound a lot more complex, though.
Linus
On Thu, 2013-03-07 at 08:25 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
> <[email protected]> wrote:
> >
> > It _shouldn't_ be an interruption unless the filesystem can't make
> > progress.
>
> So how can we tell? Calling "freezable_schedule()" if you're not ready
> to be frozen is not good. And nobody but the NFS code can know.
>
> You might want to introduce some counter that counts number of
> outstanding non-interruptible events, and only call the "freezable"
> version if that counter is zero.
>
> A better alternative might be to *never* call the freezable version.
> Because those freezable_*() things are really quite disgusting, and
> are wrong - they don't actually freeze the process, they say "I don't
> care if you freeze me while I sleep", and you might actually wake up
> *while* the system is being frozen. I think the whole concept is
> broken. Rafaei - comments? The function really is crap, regardless of
> any unrelated NFS problems.
>
> So what NFS could do instead is actually check the "do I need to
> freeze" flag, and if you need to freeze you consider it an abort - and
> do *not* try to continue. Just freeze, and then act as if the machine
> got rebooted as far as NFS was concerned. That should work anyway, no?
>
> That does sound a lot more complex, though.
The problem there is that we get into the whole 'hard' vs 'soft' mount
problem. We're supposed to guarantee data integrity for 'hard' mounts,
so no funny business is allowed. OTOH, 'soft' mounts time out and return
EIO to the application anyway, and so shouldn't be a problem.
Perhaps we could add a '-oslushy' mount option :-) that guarantees data
integrity for all situations _except_ ENETDOWN/ENETUNREACH?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
<[email protected]> wrote:
>
> The problem there is that we get into the whole 'hard' vs 'soft' mount
> problem. We're supposed to guarantee data integrity for 'hard' mounts,
> so no funny business is allowed. OTOH, 'soft' mounts time out and return
> EIO to the application anyway, and so shouldn't be a problem.
>
> Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> integrity for all situations _except_ ENETDOWN/ENETUNREACH?
I do think we are probably over-analyzing this. It's not like people
who want freezing to work usually use flaky NFS. There's really two
main groups:
- the "freezer as a snapshot mechanism" that might use NFS because
they are in a server environment.
- the "freeezer for suspend/resume on a laptop"
The first one does use NFS, and cares about it, and probably would
prefer the freeze event to take longer and finish for all ongoing IO
operations. End result: just ditch the "freezable_schedule()"
entirely.
The second one is unlikely to really use NFS anyway. End result:
ditching the freezable_schedule() is probably perfectly fine, even if
it would cause suspend failures if the network is being troublesome.
So for now, why not just replace freezable_schedule() with plain
schedule() in the NFS code, and ignore it until somebody actually
complains about it, and then aim to try to do something more targeted
for that particular complaint?
Linus
On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
> <[email protected]> wrote:
> >
> > The problem there is that we get into the whole 'hard' vs 'soft' mount
> > problem. We're supposed to guarantee data integrity for 'hard' mounts,
> > so no funny business is allowed. OTOH, 'soft' mounts time out and return
> > EIO to the application anyway, and so shouldn't be a problem.
> >
> > Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> > integrity for all situations _except_ ENETDOWN/ENETUNREACH?
>
> I do think we are probably over-analyzing this. It's not like people
> who want freezing to work usually use flaky NFS. There's really two
> main groups:
>
> - the "freezer as a snapshot mechanism" that might use NFS because
> they are in a server environment.
>
> - the "freeezer for suspend/resume on a laptop"
>
> The first one does use NFS, and cares about it, and probably would
> prefer the freeze event to take longer and finish for all ongoing IO
> operations. End result: just ditch the "freezable_schedule()"
> entirely.
>
> The second one is unlikely to really use NFS anyway. End result:
> ditching the freezable_schedule() is probably perfectly fine, even if
> it would cause suspend failures if the network is being troublesome.
>
> So for now, why not just replace freezable_schedule() with plain
> schedule() in the NFS code, and ignore it until somebody actually
> complains about it, and then aim to try to do something more targeted
> for that particular complaint?
We _have_ had complaints about the laptop suspension problem; that was
why Jeff introduced freezable_schedule() in the first place. We've never
had complaints about any problems involvinf cgroup_freeze. This is why
our focus tends to be on the former, and why I'm more worried about
laptop suspend regressions for any short term fixes.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Thursday, March 07, 2013 08:25:10 AM Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
> <[email protected]> wrote:
> >
> > It _shouldn't_ be an interruption unless the filesystem can't make
> > progress.
>
> So how can we tell? Calling "freezable_schedule()" if you're not ready
> to be frozen is not good. And nobody but the NFS code can know.
>
> You might want to introduce some counter that counts number of
> outstanding non-interruptible events, and only call the "freezable"
> version if that counter is zero.
>
> A better alternative might be to *never* call the freezable version.
> Because those freezable_*() things are really quite disgusting, and
> are wrong - they don't actually freeze the process, they say "I don't
> care if you freeze me while I sleep", and you might actually wake up
> *while* the system is being frozen. I think the whole concept is
> broken. Rafaei - comments?
Well, the only comment I have is that the source of these functions was
commit d310310c (Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to
block the freezer) and they were added because people were complaining that
they couldn't suspend while their NFS servers were not responding, IIRC.
I agree that they are not really nice.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
On Thu, 7 Mar 2013 17:16:12 +0000
"Myklebust, Trond" <[email protected]> wrote:
> On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
> > On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
> > <[email protected]> wrote:
> > >
> > > The problem there is that we get into the whole 'hard' vs 'soft' mount
> > > problem. We're supposed to guarantee data integrity for 'hard' mounts,
> > > so no funny business is allowed. OTOH, 'soft' mounts time out and return
> > > EIO to the application anyway, and so shouldn't be a problem.
> > >
> > > Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> > > integrity for all situations _except_ ENETDOWN/ENETUNREACH?
> >
> > I do think we are probably over-analyzing this. It's not like people
> > who want freezing to work usually use flaky NFS. There's really two
> > main groups:
> >
> > - the "freezer as a snapshot mechanism" that might use NFS because
> > they are in a server environment.
> >
> > - the "freeezer for suspend/resume on a laptop"
> >
> > The first one does use NFS, and cares about it, and probably would
> > prefer the freeze event to take longer and finish for all ongoing IO
> > operations. End result: just ditch the "freezable_schedule()"
> > entirely.
> >
> > The second one is unlikely to really use NFS anyway. End result:
> > ditching the freezable_schedule() is probably perfectly fine, even if
> > it would cause suspend failures if the network is being troublesome.
> >
> > So for now, why not just replace freezable_schedule() with plain
> > schedule() in the NFS code, and ignore it until somebody actually
> > complains about it, and then aim to try to do something more targeted
> > for that particular complaint?
>
> We _have_ had complaints about the laptop suspension problem; that was
> why Jeff introduced freezable_schedule() in the first place. We've never
> had complaints about any problems involvinf cgroup_freeze. This is why
> our focus tends to be on the former, and why I'm more worried about
> laptop suspend regressions for any short term fixes.
>
Right. I don't know of anyone using the cgroup_freeze stuff. OTOH, I've
seen *many* reports of people who complained because their machine
didn't suspend because of a busy NFS mount.
In fact, the problem is still not fully "solved" with the
freezable_schedule stuff we have so far anyway. Even after this
(admittedly crappy) solution went in, we still had reports of machines
that failed to suspend because other tasks were stuck in
wait_on_page_writeback() for NFS pages.
So, in principle I think we need to abandon the current approach. The
question is what should replace it.
--
Jeff Layton <[email protected]>
* Linus Torvalds <[email protected]> wrote:
> - the "freeezer for suspend/resume on a laptop"
>
> [...]
>
> The second one is unlikely to really use NFS anyway. [...]
<me raises a hand>
Incidentally I use NFS to a file server on my laptop, over wifi, and I close the lid
for the night. It's NFS mounted soft.
If it was mounted hard I wouldn't expect the lid close event to result in a
successful suspend if wifi went down - but with soft I would be pretty sad in the
morning if batteries drained if I closed the lid after there's a power outage which
took down both wifi and the laptop power supply.
Closing the lid while on battery is a pretty strong command from the user.
Arguably this is a special case on several levels as on desktop distros it's not at
all easy mount NFS shares (all point & click automation goes to Samba shares), just
wanted to mention it.
Thanks,
Ingo
On Wed, 6 Mar 2013 13:40:16 -0800
Tejun Heo <[email protected]> wrote:
> On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
> > On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > > So I do agree that we probably have *too* many of the stupid "let's
> > > check if we can freeze", and I suspect that the NFS code should get
> > > rid of the "freezable_schedule()" that is causing this warning
> > > (because I also agree that you should *not* freeze while holding
> > > locks, because it really can cause deadlocks), but I do suspect that
> > > network filesystems do need to have a few places where they check for
> > > freezing on their own... Exactly because freezing isn't *quite* like a
> > > signal.
> >
> > Well, I don't really know much about nfs so I can't really tell, but
> > for most other cases, dealing with freezing like a signal should work
> > fine from what I've seen although I can't be sure before actually
> > trying. Trond, Bruce, can you guys please chime in?
>
> So, I think the question here would be, in nfs, how many of the
> current freezer check points would be difficult to conver to signal
> handling model after excluding the ones which are performed while
> holding some locks which we need to get rid of anyway?
>
I think we can do this, but it isn't trivial. Here's what I'd envision,
but there are still many details that would need to be worked out...
Basically what we'd need is a way to go into TASK_KILLABLE sleep, but
still allow the freezer to wake these processes up. I guess that likely
means we need a new sleep state (TASK_FREEZABLE?).
We'd also need a way to return an -ERESTARTSYS like error (-EFREEZING?)
that tells the upper syscall handling layers to freeze the task and
then restart the syscall after waking back up. Maybe we don't need a
new error at all and -ERESTARTSYS is fine here? We also need to
consider the effects vs. audit code here, but that may be OK with the
overhaul that Al merged a few months ago.
Assuming we have those, then we need to fix up the NFS and RPC layers
to use this stuff:
1/ Prior to submitting a new RPC, we'd look and see whether "current"
is being frozen. If it is, then return -EFREEZING immediately without
doing anything.
2/ We might also need to retrofit certain stages in the RPC FSM to
return -EFREEZING too if it's a synchronous RPC and the task is being
frozen.
3/ A task is waiting for an RPC reply on an async RPC, we'd need to use
this new sleep state. If the process wakes up because something wants
it to freeze, then have it go back to sleep for a short period of time
to try and wait for the reply (up to 15s or so?). If we get the reply,
great -- return to userland and freeze the task there. If the reply
never comes in, give up on it and just return -EFREEZE and hope for the
best.
We might have to make this latter behavior contingent on a new mount
option (maybe "-o slushy" like Trond recommended). The current "hard"
and "soft" semantics don't really fit this situation correctly.
Of course, this is all a lot of work, and not something we can shove
into the kernel for 3.9 at this point. In the meantime, while Mandeep's
warning is correctly pointing out a problem, I think we ought to back
it out until we can fix this properly.
We're already getting a ton of reports on the mailing lists and in the
fedora bug tracker for this warning. Part of the problem is the
verbiage -- "BUG" makes people think "Oops", but this is really just a
warning.
We should also note that this is a problem too in the CIFS code since
it uses a similar mechanism for allowing the kernel to suspend while
waiting on SMB replies.
--
Jeff Layton <[email protected]>
On Wed, 13 Mar 2013, Jeff Layton wrote:
> Of course, this is all a lot of work, and not something we can shove
> into the kernel for 3.9 at this point. In the meantime, while Mandeep's
> warning is correctly pointing out a problem, I think we ought to back
> it out until we can fix this properly.
Revert posted at:
http://marc.info/?l=linux-kernel&m=136468829626184&w=2
Andrew or Linus, care to pick this up before for 3.9-rc?
- Paul