2009-12-07 11:59:47

by Andi Kleen

[permalink] [raw]
Subject: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1


While booting 2.6.32-git1 on a NFS root box I got the following
lockdep warning early at boot. I haven't looked at details.

-Andi

VFS: Mounted root (nfs filesystem) on device 0:15.
VFS: Mounted root (nfs filesystem) on device 0:15.
Freeing unused kernel memory: 2376k freed

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
udevinfo/2551 is trying to acquire lock:
(&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff8100fc83>] sys_mmap+0x76/0xc7

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[<ffffffff81071d0b>] __lock_acquire+0x1435/0x1774
[<ffffffff81072106>] lock_acquire+0xbc/0xd9
[<ffffffff810baa14>] might_fault+0x84/0xa4
[<ffffffff810ec340>] filldir+0x6a/0xcb
[<ffffffff81189bad>] nfs_do_filldir+0x39d/0x4ac
[<ffffffff8118a4e9>] nfs_readdir+0x82d/0x8ce
[<ffffffff810ec4bf>] vfs_readdir+0x6c/0xa1
[<ffffffff810ec632>] sys_getdents+0x7d/0xc9
[<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b

-> #0 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[<ffffffff81071a30>] __lock_acquire+0x115a/0x1774
[<ffffffff81072106>] lock_acquire+0xbc/0xd9
[<ffffffff8149210d>] mutex_lock_nested+0x68/0x2d2
[<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8118bad9>] nfs_file_mmap+0x68/0x71
[<ffffffff810c29c0>] mmap_region+0x311/0x50f
[<ffffffff810c2f13>] do_mmap_pgoff+0x355/0x3b8
[<ffffffff8100fca1>] sys_mmap+0x94/0xc7
[<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by udevinfo/2551:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff8100fc83>] sys_mmap+0x76/0xc7

stack backtrace:
Pid: 2551, comm: udevinfo Not tainted 2.6.32-git1 #31
Call Trace:
[<ffffffff81070388>] print_circular_bug+0xb3/0xc2
[<ffffffff81071a30>] __lock_acquire+0x115a/0x1774
[<ffffffff810a401d>] ? find_get_pages_tag+0x0/0x12a
[<ffffffff81072106>] lock_acquire+0xbc/0xd9
[<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8149210d>] mutex_lock_nested+0x68/0x2d2
[<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff810c291d>] ? mmap_region+0x26e/0x50f
[<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8118bad9>] nfs_file_mmap+0x68/0x71
[<ffffffff810c29c0>] mmap_region+0x311/0x50f
[<ffffffff810c2f13>] do_mmap_pgoff+0x355/0x3b8
[<ffffffff8100fca1>] sys_mmap+0x94/0xc7
[<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b

-Andi
--
[email protected] -- Speaking for myself only.


2009-12-07 12:19:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

>
> While booting 2.6.32-git1 on a NFS root box I got the following
> lockdep warning early at boot. I haven't looked at details.

It seems typical ABBA deadlock.

vfs_readdir [grab i_mutex]
nfs_readdir
nfs_do_filldir
filldir
copy_to_user
[page_fault] [grab mmap_sem]

sys_mmap [grab mmap_sem]
do_mmap_pgoff
mmap_region
nfs_file_mmap
nfs_revalidate_mapping
nfs_invalidate_mapping [grab i_mutex]

I guess recent lockdep improvement find old bug.



>
> -Andi
>
> VFS: Mounted root (nfs filesystem) on device 0:15.
> VFS: Mounted root (nfs filesystem) on device 0:15.
> Freeing unused kernel memory: 2376k freed
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> udevinfo/2551 is trying to acquire lock:
> (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5
>
> but task is already holding lock:
> (&mm->mmap_sem){++++++}, at: [<ffffffff8100fc83>] sys_mmap+0x76/0xc7
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&mm->mmap_sem){++++++}:
> [<ffffffff81071d0b>] __lock_acquire+0x1435/0x1774
> [<ffffffff81072106>] lock_acquire+0xbc/0xd9
> [<ffffffff810baa14>] might_fault+0x84/0xa4
> [<ffffffff810ec340>] filldir+0x6a/0xcb
> [<ffffffff81189bad>] nfs_do_filldir+0x39d/0x4ac
> [<ffffffff8118a4e9>] nfs_readdir+0x82d/0x8ce
> [<ffffffff810ec4bf>] vfs_readdir+0x6c/0xa1
> [<ffffffff810ec632>] sys_getdents+0x7d/0xc9
> [<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b
>
> -> #0 (&sb->s_type->i_mutex_key#5){+.+.+.}:
> [<ffffffff81071a30>] __lock_acquire+0x115a/0x1774
> [<ffffffff81072106>] lock_acquire+0xbc/0xd9
> [<ffffffff8149210d>] mutex_lock_nested+0x68/0x2d2
> [<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5
> [<ffffffff8118bad9>] nfs_file_mmap+0x68/0x71
> [<ffffffff810c29c0>] mmap_region+0x311/0x50f
> [<ffffffff810c2f13>] do_mmap_pgoff+0x355/0x3b8
> [<ffffffff8100fca1>] sys_mmap+0x94/0xc7
> [<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b
>
> other info that might help us debug this:
>
> 1 lock held by udevinfo/2551:
> #0: (&mm->mmap_sem){++++++}, at: [<ffffffff8100fc83>] sys_mmap+0x76/0xc7
>
> stack backtrace:
> Pid: 2551, comm: udevinfo Not tainted 2.6.32-git1 #31
> Call Trace:
> [<ffffffff81070388>] print_circular_bug+0xb3/0xc2
> [<ffffffff81071a30>] __lock_acquire+0x115a/0x1774
> [<ffffffff810a401d>] ? find_get_pages_tag+0x0/0x12a
> [<ffffffff81072106>] lock_acquire+0xbc/0xd9
> [<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
> [<ffffffff8149210d>] mutex_lock_nested+0x68/0x2d2
> [<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
> [<ffffffff8118e0d3>] ? nfs_revalidate_mapping+0x7c/0xc5
> [<ffffffff810c291d>] ? mmap_region+0x26e/0x50f
> [<ffffffff8118e0d3>] nfs_revalidate_mapping+0x7c/0xc5
> [<ffffffff8118bad9>] nfs_file_mmap+0x68/0x71
> [<ffffffff810c29c0>] mmap_region+0x311/0x50f
> [<ffffffff810c2f13>] do_mmap_pgoff+0x355/0x3b8
> [<ffffffff8100fca1>] sys_mmap+0x94/0xc7
> [<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b
>
> -Andi
> --
> [email protected] -- Speaking for myself only.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


2009-12-07 13:20:07

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Mon, Dec 07, 2009 at 09:19:28PM +0900, KOSAKI Motohiro wrote:
> >
> > While booting 2.6.32-git1 on a NFS root box I got the following
> > lockdep warning early at boot. I haven't looked at details.
>
> It seems typical ABBA deadlock.
>
> vfs_readdir [grab i_mutex]
> nfs_readdir
> nfs_do_filldir
> filldir
> copy_to_user
> [page_fault] [grab mmap_sem]
>
> sys_mmap [grab mmap_sem]
> do_mmap_pgoff
> mmap_region
> nfs_file_mmap
> nfs_revalidate_mapping
> nfs_invalidate_mapping [grab i_mutex]
>
> I guess recent lockdep improvement find old bug.

Thanks for the analysis.

I guess should never do copy_*_user while holding i_mutex? There might
be lots of cases like that.

-Andi

2009-12-07 17:38:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

(cc to linux-fsdevel)

2009/12/7 Andi Kleen <[email protected]>:
> On Mon, Dec 07, 2009 at 09:19:28PM +0900, KOSAKI Motohiro wrote:
>> >
>> > While booting 2.6.32-git1 on a NFS root box I got the following
>> > lockdep warning early at boot. I haven't looked at details.
>>
>> It seems typical ABBA deadlock.
>>
>> ?vfs_readdir ? ? ? ? ? ? ? ? ? ? ? ? ?[grab i_mutex]
>> ? ?nfs_readdir
>> ? ? ?nfs_do_filldir
>> ? ? ? ?filldir
>> ? ? ? ? ?copy_to_user
>> ? ? ? ? ? ?[page_fault] ? ? ? ? ? ? ? ? ? ? ? [grab mmap_sem]
>>
>> ?sys_mmap ? ? ? ? ? ? ? ? ? ? ? ? ? ? [grab mmap_sem]
>> ? ?do_mmap_pgoff
>> ? ? ?mmap_region
>> ? ? ? ?nfs_file_mmap
>> ? ? ? ? ?nfs_revalidate_mapping
>> ? ? ? ? ? ?nfs_invalidate_mapping ? ? [grab i_mutex]
>>
>> I guess recent lockdep improvement find old bug.
>
> Thanks for the analysis.
>
> I guess should never do copy_*_user while holding i_mutex? There might
> be lots of cases like that.
>
> -Andi

I'm not sure exactly vfs rule. but at least mm/rmap.c explained
collect order is i_mutex -> mmap_sem

rmap.c
---------------------------------------------------------------------
* Lock ordering in mm:
*
* inode->i_mutex (while writing or truncating, not reading or faulting)
* inode->i_alloc_sem (vmtruncate_range)
* mm->mmap_sem
* page->flags PG_locked (lock_page)
* mapping->i_mmap_lock
* anon_vma->lock
* mm->page_table_lock or pte_lock
* zone->lru_lock (in mark_page_accessed, isolate_lru_page)
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
* inode_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c)
* mapping->tree_lock (widely used, in set_page_dirty,
* in arch-dependent flush_dcache_mmap_lock,
* within inode_lock in __sync_single_inode)
-------------------------------------------------------------------------------------------------


Plus, ext4 have following comment. it imply nfs mmap implementaion is wrong...

--------------------------------------------------------------------------------------
int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
loff_t size;
unsigned long len;
int ret = -EINVAL;
void *fsdata;
struct file *file = vma->vm_file;
struct inode *inode = file->f_path.dentry->d_inode;
struct address_space *mapping = inode->i_mapping;

/*
* Get i_alloc_sem to stop truncates messing with the inode. We cannot
* get i_mutex because we are already holding mmap_sem.
*/

2009-12-15 22:21:44

by Al Viro

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Mon, Dec 07, 2009 at 02:20:09PM +0100, Andi Kleen wrote:
> > nfs_readdir
> > nfs_do_filldir
> > filldir
> > copy_to_user
> > [page_fault] [grab mmap_sem]
> >
> > sys_mmap [grab mmap_sem]
> > do_mmap_pgoff
> > mmap_region
> > nfs_file_mmap
> > nfs_revalidate_mapping
> > nfs_invalidate_mapping [grab i_mutex]
> >
> > I guess recent lockdep improvement find old bug.
>
> Thanks for the analysis.
>
> I guess should never do copy_*_user while holding i_mutex? There might
> be lots of cases like that.

No. mmap_sem inside i_mutex is the normal order; NFS mmap is doing the
wrong thing here. Note that readdir() vs. NFS (file-only, thankfully ;-)
mmap() is a non-issue; NFS mmap() vs. write() is much more interesting.

Again, a lot of mm/* code expects i_mutex, then mmap_sem order. It's not
just readdir().

2009-12-15 23:38:52

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Tue, Dec 15, 2009 at 10:21:34PM +0000, Al Viro wrote:
> On Mon, Dec 07, 2009 at 02:20:09PM +0100, Andi Kleen wrote:
> > > nfs_readdir
> > > nfs_do_filldir
> > > filldir
> > > copy_to_user
> > > [page_fault] [grab mmap_sem]
> > >
> > > sys_mmap [grab mmap_sem]
> > > do_mmap_pgoff
> > > mmap_region
> > > nfs_file_mmap
> > > nfs_revalidate_mapping
> > > nfs_invalidate_mapping [grab i_mutex]
> > >
> > > I guess recent lockdep improvement find old bug.
> >
> > Thanks for the analysis.
> >
> > I guess should never do copy_*_user while holding i_mutex? There might
> > be lots of cases like that.
>
> No. mmap_sem inside i_mutex is the normal order; NFS mmap is doing the
> wrong thing here. Note that readdir() vs. NFS (file-only, thankfully ;-)
> mmap() is a non-issue; NFS mmap() vs. write() is much more interesting.

I see.

>
> Again, a lot of mm/* code expects i_mutex, then mmap_sem order. It's not
> just readdir().

I suppose an easy workaround would be to not revalidate in mmap,
because open should have already done that?

Very lightly tested RFC patch attached.

-Andi

---

NFS: don't revalidate in mmap

nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
hold and taking i_mutex inside mmap_sem is not allowed by the VFS.

So don't revalidate on mmap time and trust it has been already done.

Signed-off-by: Andi Kleen <[email protected]>

---
fs/nfs/file.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

Index: linux-2.6.32-ak/fs/nfs/file.c
===================================================================
--- linux-2.6.32-ak.orig/fs/nfs/file.c
+++ linux-2.6.32-ak/fs/nfs/file.c
@@ -297,14 +297,9 @@ nfs_file_mmap(struct file * file, struct
dprintk("NFS: mmap(%s/%s)\n",
dentry->d_parent->d_name.name, dentry->d_name.name);

- /* Note: generic_file_mmap() returns ENOSYS on nommu systems
- * so we call that before revalidating the mapping
- */
status = generic_file_mmap(file, vma);
- if (!status) {
+ if (!status)
vma->vm_ops = &nfs_file_vm_ops;
- status = nfs_revalidate_mapping(inode, file->f_mapping);
- }
return status;
}

--
[email protected] -- Speaking for myself only.

2009-12-15 23:54:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Wed, 2009-12-16 at 00:38 +0100, Andi Kleen wrote:
> I suppose an easy workaround would be to not revalidate in mmap,
> because open should have already done that?
>
> Very lightly tested RFC patch attached.
>
> -Andi
>
> ---
>
> NFS: don't revalidate in mmap
>
> nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
> hold and taking i_mutex inside mmap_sem is not allowed by the VFS.
>
> So don't revalidate on mmap time and trust it has been already done.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> fs/nfs/file.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> Index: linux-2.6.32-ak/fs/nfs/file.c
> ===================================================================
> --- linux-2.6.32-ak.orig/fs/nfs/file.c
> +++ linux-2.6.32-ak/fs/nfs/file.c
> @@ -297,14 +297,9 @@ nfs_file_mmap(struct file * file, struct
> dprintk("NFS: mmap(%s/%s)\n",
> dentry->d_parent->d_name.name, dentry->d_name.name);
>
> - /* Note: generic_file_mmap() returns ENOSYS on nommu systems
> - * so we call that before revalidating the mapping
> - */
> status = generic_file_mmap(file, vma);
> - if (!status) {
> + if (!status)
> vma->vm_ops = &nfs_file_vm_ops;
> - status = nfs_revalidate_mapping(inode, file->f_mapping);
> - }
> return status;
> }
>

If you want to work around the problem rather than going for something
like Peter's split up of the mmap() callback, then I'd suggest changing
to using nfs_revalidate_mapping_nolock() instead. The fact that we are
seeing these lock misordering warnings is proof that the call to
nfs_revalidate_mapping() is not always a no-op.

By not taking the i_mutex your call to invalidate_inode_pages2() can
potentially end up racing with another process that is writing to the
file, but that should be a rare occurrence. The effect will be that the
two processes can end up fighting to alternatively dirty and then clean
the pages...

Cheers
Trond

2009-12-16 00:06:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

> On Tue, Dec 15, 2009 at 10:21:34PM +0000, Al Viro wrote:
> > On Mon, Dec 07, 2009 at 02:20:09PM +0100, Andi Kleen wrote:
> > > > nfs_readdir
> > > > nfs_do_filldir
> > > > filldir
> > > > copy_to_user
> > > > [page_fault] [grab mmap_sem]
> > > >
> > > > sys_mmap [grab mmap_sem]
> > > > do_mmap_pgoff
> > > > mmap_region
> > > > nfs_file_mmap
> > > > nfs_revalidate_mapping
> > > > nfs_invalidate_mapping [grab i_mutex]
> > > >
> > > > I guess recent lockdep improvement find old bug.
> > >
> > > Thanks for the analysis.
> > >
> > > I guess should never do copy_*_user while holding i_mutex? There might
> > > be lots of cases like that.
> >
> > No. mmap_sem inside i_mutex is the normal order; NFS mmap is doing the
> > wrong thing here. Note that readdir() vs. NFS (file-only, thankfully ;-)
> > mmap() is a non-issue; NFS mmap() vs. write() is much more interesting.
>
> I see.
>
> >
> > Again, a lot of mm/* code expects i_mutex, then mmap_sem order. It's not
> > just readdir().
>
> I suppose an easy workaround would be to not revalidate in mmap,
> because open should have already done that?
>
> Very lightly tested RFC patch attached.
>
> -Andi
>
> ---
>
> NFS: don't revalidate in mmap
>
> nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
> hold and taking i_mutex inside mmap_sem is not allowed by the VFS.
>
> So don't revalidate on mmap time and trust it has been already done.
>
> Signed-off-by: Andi Kleen <[email protected]>

afaik, no revalidate mean "I don't mind the file was already removed
at server, I only care client handle".

What's happen if we mmap removed file?



2009-12-16 00:09:41

by Al Viro

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Tue, Dec 15, 2009 at 06:54:37PM -0500, Trond Myklebust wrote:

> > nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
> > hold and taking i_mutex inside mmap_sem is not allowed by the VFS.

VM, actually...

> If you want to work around the problem rather than going for something
> like Peter's split up of the mmap() callback, then I'd suggest changing
> to using nfs_revalidate_mapping_nolock() instead. The fact that we are
> seeing these lock misordering warnings is proof that the call to
> nfs_revalidate_mapping() is not always a no-op.
>
> By not taking the i_mutex your call to invalidate_inode_pages2() can
> potentially end up racing with another process that is writing to the
> file, but that should be a rare occurrence. The effect will be that the
> two processes can end up fighting to alternatively dirty and then clean
> the pages...

Um... The really interesting question is whether it's a false positive;
*can* we hit the deadlock here? getdents() is a red herring; write() and
truncate() are real candidates.

What happens if we have one thread do mmap() while another (sharing the
address space with it) does write() or truncate() on the same file?

2009-12-16 00:48:36

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

> afaik, no revalidate mean "I don't mind the file was already removed
> at server, I only care client handle".
>
> What's happen if we mmap removed file?

I thought NFS only provided consistency around open/close/msync, not around
mmap? I've never been aware of mmap being a synchronization point.

-Andi
--
[email protected] -- Speaking for myself only.

2009-12-16 00:53:29

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

> If you want to work around the problem rather than going for something

I am mostly interested in making the ugly warning on my systems
go away, preferably without breaking anything in the process.
Whatever works.

> like Peter's split up of the mmap() callback, then I'd suggest changing
> to using nfs_revalidate_mapping_nolock() instead. The fact that we are
> seeing these lock misordering warnings is proof that the call to
> nfs_revalidate_mapping() is not always a no-op.

I would say the interesting question is if there is really a expectation
that mmap does this kind of synchronization?

Why in mmap, not somewhere else?

-Andi

--
[email protected] -- Speaking for myself only.

2009-12-16 13:10:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Wed, 2009-12-16 at 01:53 +0100, Andi Kleen wrote:
> > If you want to work around the problem rather than going for something
>
> I am mostly interested in making the ugly warning on my systems
> go away, preferably without breaking anything in the process.
> Whatever works.
>
> > like Peter's split up of the mmap() callback, then I'd suggest changing
> > to using nfs_revalidate_mapping_nolock() instead. The fact that we are
> > seeing these lock misordering warnings is proof that the call to
> > nfs_revalidate_mapping() is not always a no-op.
>
> I would say the interesting question is if there is really a expectation
> that mmap does this kind of synchronization?

Usually people who set the 'noac' mount flag will expect these syscalls
to act as synchronisation points.

Typically, their applications will be using some kind of locking scheme
that does not require POSIX or BSD locks to be set. For instance, they
may synchronise by means of a token passed through a socket (common in
MPI iirc).

> Why in mmap, not somewhere else?

We do the same thing in the read() and write() syscalls.

Cheers
Trond

2009-12-16 13:16:40

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Wed, 2009-12-16 at 00:09 +0000, Al Viro wrote:
> On Tue, Dec 15, 2009 at 06:54:37PM -0500, Trond Myklebust wrote:
>
> > > nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
> > > hold and taking i_mutex inside mmap_sem is not allowed by the VFS.
>
> VM, actually...
>
> > If you want to work around the problem rather than going for something
> > like Peter's split up of the mmap() callback, then I'd suggest changing
> > to using nfs_revalidate_mapping_nolock() instead. The fact that we are
> > seeing these lock misordering warnings is proof that the call to
> > nfs_revalidate_mapping() is not always a no-op.
> >
> > By not taking the i_mutex your call to invalidate_inode_pages2() can
> > potentially end up racing with another process that is writing to the
> > file, but that should be a rare occurrence. The effect will be that the
> > two processes can end up fighting to alternatively dirty and then clean
> > the pages...
>
> Um... The really interesting question is whether it's a false positive;
> *can* we hit the deadlock here? getdents() is a red herring; write() and
> truncate() are real candidates.
>
> What happens if we have one thread do mmap() while another (sharing the
> address space with it) does write() or truncate() on the same file?

If the two threads are sharing a VM then it looks to me as if they can
potentially deadlock.

The scenario would be that the writing thread triggers a page fault
(through __get_user()) when holding the i_mutex, while the other thread
is trying to grab the i_mutex within the mmap() call.

Cheers
Trond

2009-12-16 15:57:27

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

On Wed, Dec 16, 2009 at 08:09:51AM -0500, Trond Myklebust wrote:
> On Wed, 2009-12-16 at 01:53 +0100, Andi Kleen wrote:
> > > If you want to work around the problem rather than going for something
> >
> > I am mostly interested in making the ugly warning on my systems
> > go away, preferably without breaking anything in the process.
> > Whatever works.
> >
> > > like Peter's split up of the mmap() callback, then I'd suggest changing
> > > to using nfs_revalidate_mapping_nolock() instead. The fact that we are
> > > seeing these lock misordering warnings is proof that the call to
> > > nfs_revalidate_mapping() is not always a no-op.
> >
> > I would say the interesting question is if there is really a expectation
> > that mmap does this kind of synchronization?
>
> Usually people who set the 'noac' mount flag will expect these syscalls
> to act as synchronisation points.

It would be definitely a synchronization point if they deadlock in
mmap due to a ABBA race.

> Typically, their applications will be using some kind of locking scheme
> that does not require POSIX or BSD locks to be set. For instance, they
> may synchronise by means of a token passed through a socket (common in
> MPI iirc).

Still mmap seems like an odd synchronization point. Is not doing it
in it really likely to break anything?

> > Why in mmap, not somewhere else?
>
> We do the same thing in the read() and write() syscalls.

Ok I didn't fully understand your suggestion to use the _nolock
variant. Are you saying i_mutex is sometimes not needed?

I thought _nolock was only for the case i_mutex is already hold --
which is not the case here.

-Andi
--
[email protected] -- Speaking for myself only.

2009-12-23 16:32:11

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS lockdep lock misordering mmap_sem<->i_mutex_key with 2.6.32-git1

> The scenario would be that the writing thread triggers a page fault
> (through __get_user()) when holding the i_mutex, while the other thread
> is trying to grab the i_mutex within the mmap() call.

Hi Trond,

What's the state of this bug? Are you going to accept my patch
or do you have a different one in the pipeline?

Thanks,

-Andi

--
[email protected] -- Speaking for myself only.