Return-Path: linux-nfs-owner@vger.kernel.org Received: from mail-ve0-f182.google.com ([209.85.128.182]:50814 "EHLO mail-ve0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752013AbaFCDTB (ORCPT ); Mon, 2 Jun 2014 23:19:01 -0400 Received: by mail-ve0-f182.google.com with SMTP id sa20so6411812veb.27 for ; Mon, 02 Jun 2014 20:19:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20140602104948.4faf0bc2@tlielax.poochiereds.net> Date: Mon, 2 Jun 2014 20:19:00 -0700 Message-ID: Subject: Re: nfs4_do_reclaim lockdep pop in v3.15.0-rc1 From: John Stultz To: Trond Myklebust Cc: Jeff Layton , Linus Torvalds , Linux NFS Mailing List , Linux Kernel mailing list , Ingo Molnar , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon, Jun 2, 2014 at 5:59 PM, Trond Myklebust wrote: > On Mon, Jun 2, 2014 at 6:49 PM, John Stultz wrote: >> On Mon, Jun 2, 2014 at 3:42 PM, Trond Myklebust >> wrote: >>> The so_reclaim_seqcount only exists in order to tell the other threads >>> that they may need to replay file open or file lock requests that have >>> raced with state recovery (because those threads got scheduled out >>> after their RPC calls ran, but before they managed to set up the >>> tracking of the new state). It is basically an edge condition >>> killer... >> >> Would then swapping the acquisition order, so the seqcount is taken >> before the so_lock at the top of nfs4_reclaim_open_state() avoid this >> then, without having to disable lockdep? >> > > I can change the write seqcount to use raw_write_seqcount(), but that So this doesn't address my suggestion to change the locking order... is that solution not feasible? > doesn't answer the question of why raw_seqcount_begin() is the _only_ > object out there with a "raw_" prefix, that doesn't explicitly disable > lockdep checking. > > What justifies the inconsistency? Here's the naming discussion... https://lkml.org/lkml/2014/1/2/404 thanks -john On Mon, Jun 2, 2014 at 5:59 PM, Trond Myklebust wrote: > On Mon, Jun 2, 2014 at 6:49 PM, John Stultz wrote: >> On Mon, Jun 2, 2014 at 3:42 PM, Trond Myklebust >> wrote: >>> On Mon, Jun 2, 2014 at 6:12 PM, John Stultz wrote: >>>> On Mon, Jun 2, 2014 at 9:02 AM, Trond Myklebust >>>> wrote: >>>>> On Mon, Jun 2, 2014 at 10:49 AM, Jeff Layton >>>>> wrote: >>>>>> I've been working on the patchset to break up the client_mutex in nfsd. >>>>>> While doing some debugging, I had mounted my kernel git tree with >>>>>> NFSv4.1, and was running crash on the vmlinux image in it. >>>>>> >>>>>> A little while later, I saw the following lockdep inversion pop. >>>>>> Unfortunately, I couldn't get the whole log, but I think it's enough to >>>>>> show that there's a potential problem? >>>>>> >>>>>> I've not had time to give it a hard look yet, but thought I'd post it >>>>>> here in the hopes that it might look familiar to someone: >>>>>> >>>>>> [ 2581.104687] ====================================================== >>>>>> [ 2581.104716] [ INFO: possible circular locking dependency detected ] >>>>>> [ 2581.104716] 3.15.0-rc1.jlayton.1+ #2 Tainted: G OE >>>>>> [ 2581.104716] ------------------------------------------------------- >>>>>> [ 2581.104716] 2001:470:8:d63:/5622 is trying to acquire lock: >>>>>> [ 2581.104716] (&(&sp->so_lock)->rlock){+.+...}, at: [] nfs4_do_reclaim+0x5bd/0x7f0 [nfsv4] >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] but task is already holding lock: >>>>>> [ 2581.104716] (&sp->so_reclaim_seqcount){+.+...}, at: [] nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] which lock already depends on the new lock. >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] the existing dependency chain (in reverse order) is: >>>>>> [ 2581.104716] >>>>>> -> #1 (&sp->so_reclaim_seqcount){+.+...}: >>>>>> [ 2581.104716] [] lock_acquire+0xa2/0x1d0 >>>>>> [ 2581.104716] [] nfs4_do_reclaim+0x290/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] [] kthread+0xff/0x120 >>>>>> [ 2581.104716] [] ret_from_fork+0x7c/0xb0 >>>>>> [ 2581.104716] >>>>>> -> #0 (&(&sp->so_lock)->rlock){+.+...}: >>>>>> [ 2581.104716] [] __lock_acquire+0x1b8f/0x1ca0 >>>>>> [ 2581.104716] [] lock_acquire+0xa2/0x1d0 >>>>>> [ 2581.104716] [] _raw_spin_lock+0x3e/0x80 >>>>>> [ 2581.104716] [] nfs4_do_reclaim+0x5bd/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] [] kthread+0xff/0x120 >>>>>> [ 2581.104716] [] ret_from_fork+0x7c/0xb0 >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] other info that might help us debug this: >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] Possible unsafe locking scenario: >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] CPU0 CPU1 >>>>>> [ 2581.104716] ---- ---- >>>>>> [ 2581.104716] lock(&sp->so_reclaim_seqcount); >>>>>> [ 2581.104716] lock(&(&sp->so_lock)->rlock); >>>>>> [ 2581.104716] lock(&sp->so_reclaim_seqcount); >>>>>> [ 2581.104716] lock(&(&sp->so_lock)->rlock); >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] *** DEADLOCK *** >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] 1 lock held by 2001:470:8:d63:/5622: >>>>>> [ 2581.104716] #0: (&sp->so_reclaim_seqcount){+.+...}, at: [] nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] >>>>>> [ 2581.104716] stack backtrace: >>>>>> [ 2581.104716] CPU: 2 PID: 5622 Comm: 2001:470:8:d63: Tainted: G OE 3.15.0-rc1.jlayton.1+ #2 >>>>>> [ 2581.104716] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >>>>>> [ 2581.104716] 0000000000000000 00000000d29e16c4 ffff8800d8d8fba8 ffffffff817d318e >>>>>> [ 2581.104716] ffffffff8262d5e0 ffff8800d8d8fbe8 ffffffff817ce525 ffff8800d8d8fc40 >>>>>> [ 2581.104716] ffff8800362a8b98 ffff8800362a8b98 0000000000000001 ffff8800362a8000 >>>>>> [ 2581.104716] Call Trace: >>>>>> [ 2581.104716] [] dump_stack+0x4d/0x66 >>>>>> [ 2581.104716] [] print_circular_bug+0x201/0x20f >>>>>> [ 2581.104716] [] __lock_acquire+0x1b8f/0x1ca0 >>>>>> [ 2581.104716] [] ? debug_check_no_obj_freed+0x17e/0x270 >>>>>> [ 2581.104716] [] lock_acquire+0xa2/0x1d0 >>>>>> [ 2581.104716] [] ? nfs4_do_reclaim+0x5bd/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] _raw_spin_lock+0x3e/0x80 >>>>>> [ 2581.104716] [] ? nfs4_do_reclaim+0x5bd/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] nfs4_do_reclaim+0x5bd/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] ? nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] [] nfs4_run_state_manager+0x7ee/0xc00 [nfsv4] >>>>>> [ 2581.104716] [] ? nfs4_do_reclaim+0x7f0/0x7f0 [nfsv4] >>>>>> [ 2581.104716] [] kthread+0xff/0x120 >>>>>> [ 2581.104716] [] ? insert_kthread_work+0x80/0x80 >>>>>> [ 2581.104716] [] ret_from_fork+0x7c/0xb0 >>>>>> [ 2581.104716] [] ? insert_kthread_work+0x80/0x80 >>>>> >>>>> OK. So now that lockdep has been added to raw_seqcount_begin() (commit >>>>> 1ca7d67cf5d5a), exactly what are we supposed to use when we DON'T want >>>>> lockdep to "sanity check" our locking here? >>>> >>>> So raw_write_seqcount_* provides the lockdep-disabled methods. >>>> >>>>> As far as we're concerned, the above check is completely bogus, and >>>>> there is no deadlock. At best it would be a livelock, and it would be >>>>> because the server is rebooting over and over again (in which case the >>>>> client behaviour of retrying is _correct_). >>>> >>>> I've not been able to totally trace the locking path there, but having >>>> a seqlock writes and spinlock ABBA deadlock seems problematic.... at >>>> least at first glance. >>>> >>>> So if I'm reading this right... nfs4_reclaim_open_state() takes a >>>> spinlock on so_lock, then the write on the so_reclaim_seqcount, then >>>> drops the so_lock and calls nfs4_put_open_state which reaquires the >>>> so_lock. >>>> >>>> And lockdep is worried there may be another thread which called into >>>> nfs4_reclaim_open_state() and took the so_lock while it was >>>> momentarily free, and is blocking waiting on the so_reclaim_seqcount. >>>> This would cause the first threads requisition of the so_lock to >>>> potentially deadlock. >>>> >>>> And your point is that this isn't a concern since no other threads can >>>> call nfs4_reclaim_open_state() or any other code path that acquires >>>> those two locks in order? If you're going to disable the lockdep >>>> checks here, you might want to make this restriction really clear in a >>>> comment so no one accidentally breaks that rule. >>>> >>> >>> It's a general rule in the NFSv4 client that the only thread that is >>> allowed to call state recovery functions (after a server reboot or a >>> long-lived network partition) is the state manager thread. That >>> applies to _all_ state, not just open state. >>> >>> The so_reclaim_seqcount only exists in order to tell the other threads >>> that they may need to replay file open or file lock requests that have >>> raced with state recovery (because those threads got scheduled out >>> after their RPC calls ran, but before they managed to set up the >>> tracking of the new state). It is basically an edge condition >>> killer... >> >> Would then swapping the acquisition order, so the seqcount is taken >> before the so_lock at the top of nfs4_reclaim_open_state() avoid this >> then, without having to disable lockdep? >> > > I can change the write seqcount to use raw_write_seqcount(), but that > doesn't answer the question of why raw_seqcount_begin() is the _only_ > object out there with a "raw_" prefix, that doesn't explicitly disable > lockdep checking. > > What justifies the inconsistency? > > -- > Trond Myklebust > > Linux NFS client maintainer, PrimaryData > > trond.myklebust@primarydata.com