Hi,
This series is my pending lockdep queue, it includes David's graph optimization,
my new scheduler runqueue annotation amongst others.
The part that makes this RFC are the latter few patches that add the "lock
protection locks" thing Linus proposed a few days ago. I've not yet written a
test case that actively tests this functionality, but wanted to get this out so
that Jeremy can see if it works for him.
Enjoy..
--
Subject: mm: fix mm_take_all_locks() locking order
Lockdep spotted:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc1 #270
-------------------------------------------------------
qemu-kvm/2033 is trying to acquire lock:
(&inode->i_data.i_mmap_lock){----}, at: [<ffffffff802996cc>] mm_take_all_locks+0xc2/0xea
but task is already holding lock:
(&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&anon_vma->lock){----}:
[<ffffffff8025cd37>] __lock_acquire+0x11be/0x14d2
[<ffffffff8025d0a9>] lock_acquire+0x5e/0x7a
[<ffffffff804c655b>] _spin_lock+0x3b/0x47
[<ffffffff8029a2ef>] vma_adjust+0x200/0x444
[<ffffffff8029a662>] split_vma+0x12f/0x146
[<ffffffff8029bc60>] mprotect_fixup+0x13c/0x536
[<ffffffff8029c203>] sys_mprotect+0x1a9/0x21e
[<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
-> #0 (&inode->i_data.i_mmap_lock){----}:
[<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
[<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
[<ffffffff8025d515>] lock_release+0x127/0x14a
[<ffffffff804c6403>] _spin_unlock+0x1e/0x50
[<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
[<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
[<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
[<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
[<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
[<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
[<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
[<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
other info that might help us debug this:
5 locks held by qemu-kvm/2033:
#0: (&mm->mmap_sem){----}, at: [<ffffffff802a95d0>] do_mmu_notifier_register+0x55/0x112
#1: (mm_all_locks_mutex){--..}, at: [<ffffffff8029963e>] mm_take_all_locks+0x34/0xea
#2: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
#3: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
#4: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
stack backtrace:
Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270
Call Trace:
[<ffffffff8025b7c7>] print_circular_bug_tail+0xb8/0xc3
[<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
[<ffffffff80259bb1>] ? add_lock_to_list+0x7e/0xad
[<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
[<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
[<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
[<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
[<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
[<ffffffff8025b202>] ? trace_hardirqs_on_caller+0x4d/0x115
[<ffffffff802995d9>] ? mm_drop_all_locks+0x7f/0xb0
[<ffffffff8025d515>] lock_release+0x127/0x14a
[<ffffffff804c6403>] _spin_unlock+0x1e/0x50
[<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
[<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
[<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
[<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
[<ffffffff8033f9f2>] ? file_has_perm+0x83/0x8e
[<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
[<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
[<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
[<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
Which the locking hierarchy in mm/rmap.c confirms as 'valid'.
Although I don't think there are any users of these two locks that don't
hold the mmap_sem, therefore the nesting is strictly ok, but since we
already have an established order, we might as well respect it.
Fix this by first taking all the mapping->i_mmap_lock instances and then
take all anon_vma->lock instances.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/mmap.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -2358,11 +2358,17 @@ int mm_take_all_locks(struct mm_struct *
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (signal_pending(current))
goto out_unlock;
- if (vma->anon_vma)
- vm_lock_anon_vma(mm, vma->anon_vma);
if (vma->vm_file && vma->vm_file->f_mapping)
vm_lock_mapping(mm, vma->vm_file->f_mapping);
}
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (signal_pending(current))
+ goto out_unlock;
+ if (vma->anon_vma)
+ vm_lock_anon_vma(mm, vma->anon_vma);
+ }
+
ret = 0;
out_unlock:
Subject: lockdep: annotate mm_take_all_locks()
The nesting is correct due to holding mmap_sem, use the new annotation
to annotate this.
This doesn't solve the MAX_LOCK_DEPTH issue, but 48 seems enough for
the current kvm usage, which does this early on in the life of an mm.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/mmap.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -2273,14 +2273,14 @@ int install_special_mapping(struct mm_st
static DEFINE_MUTEX(mm_all_locks_mutex);
-static void vm_lock_anon_vma(struct anon_vma *anon_vma)
+static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
{
if (!test_bit(0, (unsigned long *) &anon_vma->head.next)) {
/*
* The LSB of head.next can't change from under us
* because we hold the mm_all_locks_mutex.
*/
- spin_lock(&anon_vma->lock);
+ spin_lock_nest_lock(&anon_vma->lock, &mm->mmap_sem);
/*
* We can safely modify head.next after taking the
* anon_vma->lock. If some other vma in this mm shares
@@ -2296,7 +2296,7 @@ static void vm_lock_anon_vma(struct anon
}
}
-static void vm_lock_mapping(struct address_space *mapping)
+static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
{
if (!test_bit(AS_MM_ALL_LOCKS, &mapping->flags)) {
/*
@@ -2310,7 +2310,7 @@ static void vm_lock_mapping(struct addre
*/
if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
BUG();
- spin_lock(&mapping->i_mmap_lock);
+ spin_lock_nest_lock(&mapping->i_mmap_lock, &mm->mmap_sem);
}
}
@@ -2359,9 +2359,9 @@ int mm_take_all_locks(struct mm_struct *
if (signal_pending(current))
goto out_unlock;
if (vma->anon_vma)
- vm_lock_anon_vma(vma->anon_vma);
+ vm_lock_anon_vma(mm, vma->anon_vma);
if (vma->vm_file && vma->vm_file->f_mapping)
- vm_lock_mapping(vma->vm_file->f_mapping);
+ vm_lock_mapping(mm, vma->vm_file->f_mapping);
}
ret = 0;
On Thu, 2008-08-07 at 13:14 +0100, Hugh Dickins wrote:
> On Thu, 7 Aug 2008, Peter Zijlstra wrote:
> >
> > Which the locking hierarchy in mm/rmap.c confirms as 'valid'.
> >
> > Although I don't think there are any users of these two locks that don't
> > hold the mmap_sem, therefore the nesting is strictly ok, but since we
> > already have an established order, we might as well respect it.
>
> Yes, I agree.
>
> > Fix this by first taking all the mapping->i_mmap_lock instances and then
> > take all anon_vma->lock instances.
>
> Okay. I'd have preferred taking anon_vma lock after i_mmap_lock
> each time around the loop, but imagine that's just as problematic
> for lockdep as the original.
I'm a little confused as to what you mean here, are you suggesting:
for_each_vma() {
if (file)
vm_lock_mapping();
if (anon)
vm_lock_anon();
}
?
That can still create the inverse lock order due to each vma being only
of a single type, and therefore the lock order is set by the vma order,
which can be anything.
On Thu, 7 Aug 2008, Peter Zijlstra wrote:
>
> Which the locking hierarchy in mm/rmap.c confirms as 'valid'.
>
> Although I don't think there are any users of these two locks that don't
> hold the mmap_sem, therefore the nesting is strictly ok, but since we
> already have an established order, we might as well respect it.
Yes, I agree.
> Fix this by first taking all the mapping->i_mmap_lock instances and then
> take all anon_vma->lock instances.
Okay. I'd have preferred taking anon_vma lock after i_mmap_lock
each time around the loop, but imagine that's just as problematic
for lockdep as the original.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
> ---
> mm/mmap.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -2358,11 +2358,17 @@ int mm_take_all_locks(struct mm_struct *
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> if (signal_pending(current))
> goto out_unlock;
> - if (vma->anon_vma)
> - vm_lock_anon_vma(mm, vma->anon_vma);
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_lock_mapping(mm, vma->vm_file->f_mapping);
> }
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + if (signal_pending(current))
> + goto out_unlock;
> + if (vma->anon_vma)
> + vm_lock_anon_vma(mm, vma->anon_vma);
> + }
> +
> ret = 0;
>
> out_unlock:
On Thu, 7 Aug 2008, Peter Zijlstra wrote:
> On Thu, 2008-08-07 at 13:14 +0100, Hugh Dickins wrote:
> > On Thu, 7 Aug 2008, Peter Zijlstra wrote:
> > >
> > > Which the locking hierarchy in mm/rmap.c confirms as 'valid'.
> > >
> > > Although I don't think there are any users of these two locks that don't
> > > hold the mmap_sem, therefore the nesting is strictly ok, but since we
> > > already have an established order, we might as well respect it.
> >
> > Yes, I agree.
> >
> > > Fix this by first taking all the mapping->i_mmap_lock instances and then
> > > take all anon_vma->lock instances.
> >
> > Okay. I'd have preferred taking anon_vma lock after i_mmap_lock
> > each time around the loop, but imagine that's just as problematic
> > for lockdep as the original.
>
> I'm a little confused as to what you mean here, are you suggesting:
>
> for_each_vma() {
> if (file)
> vm_lock_mapping();
> if (anon)
> vm_lock_anon();
> }
>
> ?
Yes, I would have preferred that.
>
> That can still create the inverse lock order due to each vma being only
> of a single type, and therefore the lock order is set by the vma order,
> which can be anything.
Yes, I imagined it would be just as problematic for lockep.
Hugh
On Thu, Aug 07, 2008 at 01:25:49PM +0200, Peter Zijlstra wrote:
> Index: linux-2.6/mm/mmap.c
> ===================================================================
> --- linux-2.6.orig/mm/mmap.c
> +++ linux-2.6/mm/mmap.c
> @@ -2358,11 +2358,17 @@ int mm_take_all_locks(struct mm_struct *
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> if (signal_pending(current))
> goto out_unlock;
> - if (vma->anon_vma)
> - vm_lock_anon_vma(mm, vma->anon_vma);
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_lock_mapping(mm, vma->vm_file->f_mapping);
> }
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + if (signal_pending(current))
> + goto out_unlock;
> + if (vma->anon_vma)
> + vm_lock_anon_vma(mm, vma->anon_vma);
> + }
> +
> ret = 0;
I'd suggest to add a comment to document this is just to reduce the
amount of false positives that lockdep emits, otherwise it'll be
optimized away again sooner or later I guess. I'm fine either ways
even if this runs a bit slower. Note that I _strongly_fully_ support
this kind of lockdep changes like above because those are to
accommodate check_noncircular, which is very useful feature of
prove-locking and it can find bugs that would otherwise go unnoticed
(even if it clearly emits false positives at will like above).
As for 8/7 you know my opinion from somebody who's way beyond the
point: check_deadlock should be dropped and we'd rather spend time
incorporating kdb/nlkd/whatever if sysrq/nmiwatchdog/kgdb aren't
already friendly enough for casual driver developers who may not be
able to read assembly or setup kgdb, to make a recursion deadlock
trivial to identify by other means (furthermore if those developers
can't use sysrq/nmiwatchdog/kgdb/systemtap a real debugger will help
them for many others things like singlestepping so they don't have to
add printk all over the place to debug). So I really dislike 8/7 and
furthermore I doubt it works because with regular kvm I get 57 vmas
filebacked alone, did you test 8/7, I didn't yet but I can test if you
didn't. It's true mmu notifier registration happens at the very early
stage of vm creation but most libs are loaded by the dynamic linker
before it. In any case instead of 8/7 I prefer my trylock patch than
to try to accomodate useless check_deadlock (again useless from
someone who's beyond the point, agree to disagree here).
Thanks for the help in cleaning up these lockdep bits!
On Thu, Aug 07, 2008 at 11:46:42PM +0200, Andrea Arcangeli wrote:
> furthermore I doubt it works because with regular kvm I get 57 vmas
> filebacked alone, did you test 8/7, I didn't yet but I can test if you
Correction: 57 are the vmas but counting with something less coarse
than grep lib |wc -l, there are less inodes backing those vmas than
the nesting limit, so 8/7 is ok for now (as long as check_deadlock
exists at least).
Thanks again!
On Thu, 2008-08-07 at 23:46 +0200, Andrea Arcangeli wrote:
> As for 8/7 you know my opinion from somebody who's way beyond the
> point: check_deadlock should be dropped
I'll try again one more time, don't feel obliged to reply or
anything :-)
Suppose you have objects initialized from a single site:
struct my_obj *create_obj()
{
...
spin_lock_init(&obj->lock);
...
return obj;
}
So that all these object's their locks are in a single class.
Now, put these objects into two lists without fixed order.
L1: A B C D
L2: B A D C
If you were to lock-couple your way through these lists there is a
deadlock potential.
The check_noncircular stuff will not catch this due to it all being of a
single class. The only thing we have to indicate you need to pay
attention is check_deadlock.
Yes, check_deadlock is a tad too sensitive in the amount of false
positives, but without it there are gaping holes in 'proving' lock
correctness.
Currently - if you get 100% code coverage and lockdep doesn't shout,
you're good. If we were to drop check_deadlock we can't say that
anymore.
* Peter Zijlstra <[email protected]> wrote:
> Hi,
>
> This series is my pending lockdep queue, it includes David's graph
> optimization, my new scheduler runqueue annotation amongst others.
>
> The part that makes this RFC are the latter few patches that add the
> "lock protection locks" thing Linus proposed a few days ago. I've not
> yet written a test case that actively tests this functionality, but
> wanted to get this out so that Jeremy can see if it works for him.
update: i'm now back from vacation and i've added these patches to
tip/core/locking, but -tip testing found a few problems with them so i
wont push them upstream yet.
Ingo