LinuxLists.cc - hugetlb locking bug.

2011-04-15 20:17:31

Subject: hugetlb locking bug.

Just hit this lockdep report..

Dave

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.39-rc3+ #3
-------------------------------------------------------
trinity/11299 is trying to acquire lock:
(&sb->s_type->i_mutex_key#18){+.+.+.}, at: [<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff81109097>] sys_mmap_pgoff+0xf8/0x16a

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
[<ffffffff81100cd0>] might_fault+0x89/0xac
[<ffffffff8114159b>] filldir+0x6f/0xc7
[<ffffffff81150075>] dcache_readdir+0x67/0x204
[<ffffffff811417ed>] vfs_readdir+0x78/0xb1
[<ffffffff8114190c>] sys_getdents+0x7e/0xd1
[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b

-> #0 (&sb->s_type->i_mutex_key#18){+.+.+.}:
[<ffffffff810864e6>] __lock_acquire+0x99e/0xc81
[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
[<ffffffff814b9d00>] __mutex_lock_common+0x4c/0x35b
[<ffffffff814ba0d3>] mutex_lock_nested+0x3e/0x43
[<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff81108ac8>] mmap_region+0x26d/0x446
[<ffffffff81108f45>] do_mmap_pgoff+0x2a4/0x2fe
[<ffffffff811090b7>] sys_mmap_pgoff+0x118/0x16a
[<ffffffff8100c56c>] sys_mmap+0x22/0x24
[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by trinity/11299:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff81109097>] sys_mmap_pgoff+0xf8/0x16a

stack backtrace:
Pid: 11299, comm: trinity Not tainted 2.6.39-rc3+ #3
Call Trace:
[<ffffffff814b1126>] print_circular_bug+0xa6/0xb5
[<ffffffff810864e6>] __lock_acquire+0x99e/0xc81
[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff814b9d00>] __mutex_lock_common+0x4c/0x35b
[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff814b3ee2>] ? __slab_alloc+0xdb/0x3b7
[<ffffffff81087043>] ? trace_hardirqs_on_caller+0x10b/0x12f
[<ffffffff814b3eeb>] ? __slab_alloc+0xe4/0x3b7
[<ffffffff81108a12>] ? mmap_region+0x1b7/0x446
[<ffffffff814ba0d3>] mutex_lock_nested+0x3e/0x43
[<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d
[<ffffffff81108ac8>] mmap_region+0x26d/0x446
[<ffffffff81108f45>] do_mmap_pgoff+0x2a4/0x2fe
[<ffffffff811090b7>] sys_mmap_pgoff+0x118/0x16a
[<ffffffff81087043>] ? trace_hardirqs_on_caller+0x10b/0x12f
[<ffffffff8100c56c>] sys_mmap+0x22/0x24
[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b

2011-04-15 20:51:43

by Linus Torvalds

[permalink] [raw]

Subject: Re: hugetlb locking bug.

Hmm. Adding the hugetlbfs/lockdep people to the cc.

This _looks_ like the same kind of false positive we've had with some
other cases: we're taking the i_mutex lock only dor directory inodes
(for the readdir) and we take the i_mmap lock only for non-directory
inodes, so you can't actually get any real circular locking issues.

So yes, we do mix the order of i_mmap_sem and i_mutex, but it's
conceptually two "different" kinds of i_mutex that just happen to
share a name.

And I really thought we annotated it as such with different
"lockdep_set_class()" cases (ie the whole

lockdep_set_class(&inode->i_mutex,&type->i_mutex_dir_key);

for the S_ISDIR case in unlock_new_inode().

Can somebody more alert than me see why this lockdep issue still
triggers with hugetlbfs?

Linus

On Fri, Apr 15, 2011 at 1:16 PM, Dave Jones <[email protected]> wrote:
> Just hit this lockdep report..
>
> ? ? ? ?Dave
>
> ?=======================================================
> ?[ INFO: possible circular locking dependency detected ]
> ?2.6.39-rc3+ #3
> ?-------------------------------------------------------
> ?trinity/11299 is trying to acquire lock:
> ?(&sb->s_type->i_mutex_key#18){+.+.+.}, at: [<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d
>
> ?but task is already holding lock:
> ?(&mm->mmap_sem){++++++}, at: [<ffffffff81109097>] sys_mmap_pgoff+0xf8/0x16a
>
> ?which lock already depends on the new lock.
>
>
> ?the existing dependency chain (in reverse order) is:
>
> ?-> #1 (&mm->mmap_sem){++++++}:
> ? ? ? ?[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
> ? ? ? ?[<ffffffff81100cd0>] might_fault+0x89/0xac
> ? ? ? ?[<ffffffff8114159b>] filldir+0x6f/0xc7
> ? ? ? ?[<ffffffff81150075>] dcache_readdir+0x67/0x204
> ? ? ? ?[<ffffffff811417ed>] vfs_readdir+0x78/0xb1
> ? ? ? ?[<ffffffff8114190c>] sys_getdents+0x7e/0xd1
> ? ? ? ?[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b
>
> ?-> #0 (&sb->s_type->i_mutex_key#18){+.+.+.}:
> ? ? ? ?[<ffffffff810864e6>] __lock_acquire+0x99e/0xc81
> ? ? ? ?[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
> ? ? ? ?[<ffffffff814b9d00>] __mutex_lock_common+0x4c/0x35b
> ? ? ? ?[<ffffffff814ba0d3>] mutex_lock_nested+0x3e/0x43
> ? ? ? ?[<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d
> ? ? ? ?[<ffffffff81108ac8>] mmap_region+0x26d/0x446
> ? ? ? ?[<ffffffff81108f45>] do_mmap_pgoff+0x2a4/0x2fe
> ? ? ? ?[<ffffffff811090b7>] sys_mmap_pgoff+0x118/0x16a
> ? ? ? ?[<ffffffff8100c56c>] sys_mmap+0x22/0x24
> ? ? ? ?[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b
>
> ?other info that might help us debug this:
>
> ?1 lock held by trinity/11299:
> ?#0: ?(&mm->mmap_sem){++++++}, at: [<ffffffff81109097>] sys_mmap_pgoff+0xf8/0x16a
>
> ?stack backtrace:
> ?Pid: 11299, comm: trinity Not tainted 2.6.39-rc3+ #3
> ?Call Trace:
> ?[<ffffffff814b1126>] print_circular_bug+0xa6/0xb5
> ?[<ffffffff810864e6>] __lock_acquire+0x99e/0xc81
> ?[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
> ?[<ffffffff81086c59>] lock_acquire+0xd0/0xfb
> ?[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
> ?[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
> ?[<ffffffff814b9d00>] __mutex_lock_common+0x4c/0x35b
> ?[<ffffffff811ebfec>] ? hugetlbfs_file_mmap+0x7f/0x10d
> ?[<ffffffff814b3ee2>] ? __slab_alloc+0xdb/0x3b7
> ?[<ffffffff81087043>] ? trace_hardirqs_on_caller+0x10b/0x12f
> ?[<ffffffff814b3eeb>] ? __slab_alloc+0xe4/0x3b7
> ?[<ffffffff81108a12>] ? mmap_region+0x1b7/0x446
> ?[<ffffffff814ba0d3>] mutex_lock_nested+0x3e/0x43
> ?[<ffffffff811ebfec>] hugetlbfs_file_mmap+0x7f/0x10d
> ?[<ffffffff81108ac8>] mmap_region+0x26d/0x446
> ?[<ffffffff81108f45>] do_mmap_pgoff+0x2a4/0x2fe
> ?[<ffffffff811090b7>] sys_mmap_pgoff+0x118/0x16a
> ?[<ffffffff81087043>] ? trace_hardirqs_on_caller+0x10b/0x12f
> ?[<ffffffff8100c56c>] sys_mmap+0x22/0x24
> ?[<ffffffff814c22c2>] system_call_fastpath+0x16/0x1b
>
>

2011-04-15 20:57:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, Apr 15, 2011 at 01:49:04PM -0700, Linus Torvalds wrote:
> And I really thought we annotated it as such with different
> "lockdep_set_class()" cases (ie the whole
>
> lockdep_set_class(&inode->i_mutex,&type->i_mutex_dir_key);
>
> for the S_ISDIR case in unlock_new_inode().
>
> Can somebody more alert than me see why this lockdep issue still
> triggers with hugetlbfs?

Because it doesn't use iget or unlock_new_inode, but rather calls
directly into new_inode(). It and other filesystems not using
unlock_new_inode will need a local copy of that logic.

2011-04-15 21:04:42

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, 2011-04-15 at 13:49 -0700, Linus Torvalds wrote:
> Hmm. Adding the hugetlbfs/lockdep people to the cc.
>
> This _looks_ like the same kind of false positive we've had with some
> other cases: we're taking the i_mutex lock only dor directory inodes
> (for the readdir) and we take the i_mmap lock only for non-directory
> inodes, so you can't actually get any real circular locking issues.
>
> So yes, we do mix the order of i_mmap_sem and i_mutex, but it's
> conceptually two "different" kinds of i_mutex that just happen to
> share a name.
>
> And I really thought we annotated it as such with different
> "lockdep_set_class()" cases (ie the whole
>
> lockdep_set_class(&inode->i_mutex,&type->i_mutex_dir_key);
>
> for the S_ISDIR case in unlock_new_inode().
>
> Can somebody more alert than me see why this lockdep issue still
> triggers with hugetlbfs?

afaict hugetlbfs doesn't actually end up calling unlock_new_inode()
which does the whole IS_DIR() lockdep annotation, but then I might have
gotten lost in the whole inode allocation dance.

2011-04-15 21:07:06

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, 2011-04-15 at 16:57 -0400, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 01:49:04PM -0700, Linus Torvalds wrote:
> > And I really thought we annotated it as such with different
> > "lockdep_set_class()" cases (ie the whole
> >
> > lockdep_set_class(&inode->i_mutex,&type->i_mutex_dir_key);
> >
> > for the S_ISDIR case in unlock_new_inode().
> >
> > Can somebody more alert than me see why this lockdep issue still
> > triggers with hugetlbfs?
>
> Because it doesn't use iget or unlock_new_inode, but rather calls
> directly into new_inode(). It and other filesystems not using
> unlock_new_inode will need a local copy of that logic.

Is there a sane reason they do their own magic, and thus need a copy of
the logic, instead of using the generic code that already has it?

2011-04-15 21:13:58

by Christoph Hellwig

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, Apr 15, 2011 at 11:09:26PM +0200, Peter Zijlstra wrote:
> Is there a sane reason they do their own magic, and thus need a copy of
> the logic, instead of using the generic code that already has it?

There is not need to use the inode hash for purely in-memory
filesystem. The dcache tells us if an entry already exists,
so there is no need to a lru list, hash or other overhead.

2011-04-15 21:20:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, 2011-04-15 at 17:13 -0400, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 11:09:26PM +0200, Peter Zijlstra wrote:
> > Is there a sane reason they do their own magic, and thus need a copy of
> > the logic, instead of using the generic code that already has it?
>
> There is not need to use the inode hash for purely in-memory
> filesystem. The dcache tells us if an entry already exists,
> so there is no need to a lru list, hash or other overhead.

OK makes sense, should we then make some common code to share between
these in-memory filesystems so as to limit the number of copies of this
logic?

2011-04-15 21:21:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, Apr 15, 2011 at 2:09 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2011-04-15 at 16:57 -0400, Christoph Hellwig wrote:
>>
>> Because it doesn't use iget or unlock_new_inode, but rather calls
>> directly into new_inode(). ?It and other filesystems not using
>> unlock_new_inode will need a local copy of that logic.
>
> Is there a sane reason they do their own magic, and thus need a copy of
> the logic, instead of using the generic code that already has it?

Hmm. That all seems to be just an oversight.

Does this trivial one-liner work?

(Warning: whitespace damage and TOTALLY UNTESTED)

Linus

---

fs/hugetlbfs/inode.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b9eeb1cd03ff..a14a6e03ec33 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -491,6 +491,7 @@ static struct inode *hugetlbfs_get_inode(struct
super_block *sb, uid_t uid,
inode->i_op = &page_symlink_inode_operations;
break;
}
+ unlock_new_inode(inode);
}
return inode;
}

2011-04-15 21:26:25

by Christoph Hellwig

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, Apr 15, 2011 at 02:19:01PM -0700, Linus Torvalds wrote:
> On Fri, Apr 15, 2011 at 2:09 PM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2011-04-15 at 16:57 -0400, Christoph Hellwig wrote:
> >>
> >> Because it doesn't use iget or unlock_new_inode, but rather calls
> >> directly into new_inode(). ?It and other filesystems not using
> >> unlock_new_inode will need a local copy of that logic.
> >
> > Is there a sane reason they do their own magic, and thus need a copy of
> > the logic, instead of using the generic code that already has it?
>
> Hmm. That all seems to be just an oversight.
>
> Does this trivial one-liner work?
>
> (Warning: whitespace damage and TOTALLY UNTESTED)

It'll get rid of the lockdep spat in favour of a WARN_ON, given that
inodes from new_inode() never have I_NEW set.

2011-04-15 21:30:23

by Linus Torvalds

[permalink] [raw]

Subject: Re: hugetlb locking bug.

On Fri, Apr 15, 2011 at 2:19 PM, Linus Torvalds
<[email protected]> wrote:
>
> (Warning: whitespace damage and TOTALLY UNTESTED)

Gaah. That won't work. Or rather, it probably may work, but while
working it will spam the logs with that

WARN_ON(!(inode->i_state & I_NEW));

thing from unlock_new_inode.

So the sane thing to do would be apparently one of

(a) ignore the whole thing, and just accept the false lockdep warning.

which I'd be willing to do, but it might be hiding some real
ones, so we probably shouldn't.

(b) just remove that WARN_ON(), and use the one-liner I suggested

(c) extract the "set directory i_mutex key" logic into a new helper
function for the case of filesystems like hugetlbfs that don't want to
use unlock_new_inode() for one reason or another.

Personally, I don't have any really strong preferences and would
probably just go for (b) to keep the patch small and simple. Anybody?

Linus