LinuxLists.cc - [GIT PULL] Please pull NFS client bugfixes....

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

> Hi Linus,
>
> Please pull from the "bugfixes" branch of the repository at
>
> git pull git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git bugfixes
>
> This will update the following files through the appended changesets.

How about the mmap lock order inversion patch?

That problem is still there and hits immediately after mounting root
with a lockdep warning on any lockdep enabled NFS root kernel.

-Andi

---

NFS: don't revalidate in mmap

nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
hold and taking i_mutex inside mmap_sem is not allowed by the VFS.

So don't revalidate on mmap time and trust it has been already done.

Signed-off-by: Andi Kleen <[email protected]>

---
fs/nfs/file.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

Index: linux-2.6.32-ak/fs/nfs/file.c
===================================================================
--- linux-2.6.32-ak.orig/fs/nfs/file.c
+++ linux-2.6.32-ak/fs/nfs/file.c
@@ -297,14 +297,9 @@ nfs_file_mmap(struct file * file, struct
dprintk("NFS: mmap(%s/%s)\n",
dentry->d_parent->d_name.name, dentry->d_name.name);

- /* Note: generic_file_mmap() returns ENOSYS on nommu systems
- * so we call that before revalidating the mapping
- */
status = generic_file_mmap(file, vma);
- if (!status) {
+ if (!status)
vma->vm_ops = &nfs_file_vm_ops;
- status = nfs_revalidate_mapping(inode, file->f_mapping);
- }
return status;
}

--
[email protected] -- Speaking for myself only.

2010-01-07 21:23:34

by Peter Staubach

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

Andi Kleen wrote:
>> Hi Linus,
>>
>> Please pull from the "bugfixes" branch of the repository at
>>
>> git pull git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git bugfixes
>>
>> This will update the following files through the appended changesets.
>
> How about the mmap lock order inversion patch?
>
> That problem is still there and hits immediately after mounting root
> with a lockdep warning on any lockdep enabled NFS root kernel.
>

It seems to me that NFS must do some sort of validation at
this point. This is one of the last points where the NFS
client can do validation for mmap'd files.

Simply crossing our fingers probably won't be sufficient. :-)

I think that the file system must have a way to do validation
and page invalidation during mmap. Perhaps we can reorder
things or add another entry point which is invoked after the
page mapping is set up which would give the file system to
check to ensure that the page cache is filled with valid data.

ps

> -Andi
>
> ---
>
> NFS: don't revalidate in mmap
>
> nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
> hold and taking i_mutex inside mmap_sem is not allowed by the VFS.
>
> So don't revalidate on mmap time and trust it has been already done.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> fs/nfs/file.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> Index: linux-2.6.32-ak/fs/nfs/file.c
> ===================================================================
> --- linux-2.6.32-ak.orig/fs/nfs/file.c
> +++ linux-2.6.32-ak/fs/nfs/file.c
> @@ -297,14 +297,9 @@ nfs_file_mmap(struct file * file, struct
> dprintk("NFS: mmap(%s/%s)\n",
> dentry->d_parent->d_name.name, dentry->d_name.name);
>
> - /* Note: generic_file_mmap() returns ENOSYS on nommu systems
> - * so we call that before revalidating the mapping
> - */
> status = generic_file_mmap(file, vma);
> - if (!status) {
> + if (!status)
> vma->vm_ops = &nfs_file_vm_ops;
> - status = nfs_revalidate_mapping(inode, file->f_mapping);
> - }
> return status;
> }
>
>
>

2010-01-07 21:35:10

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, Jan 07, 2010 at 04:23:22PM -0500, Peter Staubach wrote:
> It seems to me that NFS must do some sort of validation at
> this point. This is one of the last points where the NFS
> client can do validation for mmap'd files.

Typically there is a open directly before it. open seems
like a much more natural point to synchronize than mmap.

Also so far nobody could point me to a standard text or similar
who said mmap needs synchronization, just some vague beliefs.

And the fact is right now with the current Linux VFS and
NFS locking mmap can't do it reliably.

> I think that the file system must have a way to do validation
> and page invalidation during mmap. Perhaps we can reorder
> things or add another entry point which is invoked after the
> page mapping is set up which would give the file system to
> check to ensure that the page cache is filled with valid data.

If someone has a better patch feel free to post it, but ignoring
the problem like it has been done so far doesn't fly.

I know that my patch works for me at least.

-Andi

2010-01-07 21:54:27

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 2010-01-07 at 22:00 +0100, Andi Kleen wrote:
> > Hi Linus,
> >
> > Please pull from the "bugfixes" branch of the repository at
> >
> > git pull git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git bugfixes
> >
> > This will update the following files through the appended changesets.
>
> How about the mmap lock order inversion patch?
>
> That problem is still there and hits immediately after mounting root
> with a lockdep warning on any lockdep enabled NFS root kernel.

I'd like to see this fixed, but I do not want to jump onto a solution
that changes the behaviour of mmap() w.r.t. revalidation. The current
behaviour dates back at least to 2.3.x if not before.

That's why I'm working slowly on this.

Cheers
Trond

2010-01-07 23:51:53

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

> I'd like to see this fixed, but I do not want to jump onto a solution
> that changes the behaviour of mmap() w.r.t. revalidation. The current
> behaviour dates back at least to 2.3.x if not before.

So do you have a plan to fix it?

I don't think it'll be possible to do drastic changes in the
VFS for 2.6.33, and it seems preserving the current semantics
would need that.

> That's why I'm working slowly on this.

Delaying a fix to after 2.6.33 is not an option imho.

It hits everyone with LOCKDEP enabled who uses mmap over NFS.
That's new in 2.6.33, previously LOCKDEP didn't diagnose this.

I'll keep using my patch, but I suppose once we're going more
towards a release you'll get more reports of this.

-Andi
--
[email protected] -- Speaking for myself only.

2010-01-08 00:15:54

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Fri, 2010-01-08 at 00:51 +0100, Andi Kleen wrote:
> > I'd like to see this fixed, but I do not want to jump onto a solution
> > that changes the behaviour of mmap() w.r.t. revalidation. The current
> > behaviour dates back at least to 2.3.x if not before.
>
> So do you have a plan to fix it?

Yes. I want to pursue Peter Zijlstra's patches, which split up the mmap
function into a set of parts which require the mmap_sem, and other parts
which don't, and that adds a filesystem callback that allows for
revalidation to occur outside the mmap_sem.

> I don't think it'll be possible to do drastic changes in the
> VFS for 2.6.33, and it seems preserving the current semantics
> would need that.
>
> > That's why I'm working slowly on this.
>
> Delaying a fix to after 2.6.33 is not an option imho.
>
> It hits everyone with LOCKDEP enabled who uses mmap over NFS.
> That's new in 2.6.33, previously LOCKDEP didn't diagnose this.
>
> I'll keep using my patch, but I suppose once we're going more
> towards a release you'll get more reports of this.

Why should this particular issue require us to rush into a solution?
This has been there for literally _years_, and I've never heard of a
single incident in which a deadlock actually occurred. The only reason
why we've noticed it at all is because lockdep has started to whine.

I agree it should be fixed.

I don't agree that it is urgent enough to warrant kneejerk reactions in
2.6.33 which change long established behaviours that people are actually
relying on.

Trond

2010-01-08 00:35:08

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Trond Myklebust wrote:
>
> Yes. I want to pursue Peter Zijlstra's patches, which split up the mmap
> function into a set of parts which require the mmap_sem, and other parts
> which don't, and that adds a filesystem callback that allows for
> revalidation to occur outside the mmap_sem.

I'm sorry, but that just sounds STUPID.

Why?

Because it means that you can trivially take page faults before the thing
is validated (think threads).

If that is ok, then why do the revalidate at all? Just do the open/close
consistency and validate at open time, not mmap time.

Linus

2010-01-08 00:43:32

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, Jan 07, 2010 at 07:14:42PM -0500, Trond Myklebust wrote:
> On Fri, 2010-01-08 at 00:51 +0100, Andi Kleen wrote:
> > > I'd like to see this fixed, but I do not want to jump onto a solution
> > > that changes the behaviour of mmap() w.r.t. revalidation. The current
> > > behaviour dates back at least to 2.3.x if not before.
> >
> > So do you have a plan to fix it?
>
> Yes. I want to pursue Peter Zijlstra's patches, which split up the mmap
> function into a set of parts which require the mmap_sem, and other parts
> which don't, and that adds a filesystem callback that allows for
> revalidation to occur outside the mmap_sem.

Is that really a 2.6.33 solution?

It sounds intrusive.

>
> > I don't think it'll be possible to do drastic changes in the
> > VFS for 2.6.33, and it seems preserving the current semantics
> > would need that.
> >
> > > That's why I'm working slowly on this.
> >
> > Delaying a fix to after 2.6.33 is not an option imho.
> >
> > It hits everyone with LOCKDEP enabled who uses mmap over NFS.
> > That's new in 2.6.33, previously LOCKDEP didn't diagnose this.
> >
> > I'll keep using my patch, but I suppose once we're going more
> > towards a release you'll get more reports of this.
>
> Why should this particular issue require us to rush into a solution?

Because every lockdep user NFS+mmap gets spammed now.

That's like having an oops in the startup sequence. The system
might still boot, but users are not happy.

> This has been there for literally _years_, and I've never heard of a
> single incident in which a deadlock actually occurred. The only reason
> why we've noticed it at all is because lockdep has started to whine.

I bet there were deadlocks somewhere, probably just didn't get reported.

-Andi
--
[email protected] -- Speaking for myself only.

2010-01-08 00:45:18

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

> If that is ok, then why do the revalidate at all? Just do the open/close
> consistency and validate at open time, not mmap time.

That's exactly what my patch does :)

-Andi

---

NFS: don't revalidate in mmap v2

nfs_revalidate_mapping takes i_mutex, but mmap already has mmap_sem
and taking i_mutex inside mmap_sem is not allowed by the VFS.

So don't revalidate on mmap time. In theory users could
rely on it, but it's unlikely enough for mmap.

This fixes a lockdep warning that happens at every boot
on my nfsroot test system.

v2: Fix unused variable warning, add comment

Signed-off-by: Andi Kleen <[email protected]>

---
fs/nfs/file.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux/fs/nfs/file.c
===================================================================
--- linux.orig/fs/nfs/file.c
+++ linux/fs/nfs/file.c
@@ -291,20 +291,20 @@ static int
nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
{
struct dentry *dentry = file->f_path.dentry;
- struct inode *inode = dentry->d_inode;
int status;

dprintk("NFS: mmap(%s/%s)\n",
dentry->d_parent->d_name.name, dentry->d_name.name);

- /* Note: generic_file_mmap() returns ENOSYS on nommu systems
- * so we call that before revalidating the mapping
- */
status = generic_file_mmap(file, vma);
- if (!status) {
+ /*
+ * This used to be a synchronization point,
+ * but this causes lock order inversions with i_mutex / mmap_sem.
+ * It's unclear anyone relies on mmap being a synchronization
+ * point anyways, so just removed it here -AK
+ */
+ if (!status)
vma->vm_ops = &nfs_file_vm_ops;
- status = nfs_revalidate_mapping(inode, file->f_mapping);
- }
return status;
}

--
[email protected] -- Speaking for myself only.

2010-01-08 01:04:39

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Fri, 2010-01-08 at 01:45 +0100, Andi Kleen wrote:
> > If that is ok, then why do the revalidate at all? Just do the open/close
> > consistency and validate at open time, not mmap time.
>
> That's exactly what my patch does :)

Actually, it doesn't....

It does remove the validation at mmap time, but it adds nothing at open
time to ensure that we call nfs_revalidate_mapping().

...and as I said, it breaks 'noac'.

Trond

2010-01-08 01:04:49

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 2010-01-07 at 16:34 -0800, Linus Torvalds wrote:
>
> On Thu, 7 Jan 2010, Trond Myklebust wrote:
> >
> > Yes. I want to pursue Peter Zijlstra's patches, which split up the mmap
> > function into a set of parts which require the mmap_sem, and other parts
> > which don't, and that adds a filesystem callback that allows for
> > revalidation to occur outside the mmap_sem.
>
> I'm sorry, but that just sounds STUPID.
>
> Why?
>
> Because it means that you can trivially take page faults before the thing
> is validated (think threads).

Which would mean that another process/thread already has part of the
file mmapped on the same client. I'm not arguing that have to revalidate
in _that_ case.

> If that is ok, then why do the revalidate at all? Just do the open/close
> consistency and validate at open time, not mmap time.

That violates the expected semantics of the 'noac' mount option.

Trond

2010-01-08 01:12:51

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Trond Myklebust wrote:
> >
> > Because it means that you can trivially take page faults before the thing
> > is validated (think threads).
>
> Which would mean that another process/thread already has part of the
> file mmapped on the same client. I'm not arguing that have to revalidate
> in _that_ case.

No, I'm talking about the new mapping. Nothing else.

If the mmap'ing thread releases mmap_sem, and then does the revalidate,
then you can have

thread1 thread2
------- -------

mmap
map it in
release mmap_sem
page-fault the mapping before it got validated
->post_mmap()
revalidate outside mmap_sem

See? No "already part of the file mmapped" case at all. The exact mmap
that you just set up - without the revalidation having happened.

In fact, because of this kind of _fundamental_ race, I don't see why I
would ever accept any patches that add multiple mmap() down-calls at
different phases to the filesystem at the VFS layer.

A filesystem that depends on the different phases would be a fundamentally
buggy filesystem. Right now mmap is "atomic", and you can pre-populate (or
pre-verify, like NFS does) the mapping in the _knowledge_ that there are
no page faults that will populate it concurrently. Exactly because we hold
the mmap_sem for writing.

Linus

2010-01-08 01:22:39

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 2010-01-07 at 17:12 -0800, Linus Torvalds wrote:
>
> On Thu, 7 Jan 2010, Trond Myklebust wrote:
> > >
> > > Because it means that you can trivially take page faults before the thing
> > > is validated (think threads).
> >
> > Which would mean that another process/thread already has part of the
> > file mmapped on the same client. I'm not arguing that have to revalidate
> > in _that_ case.
>
> No, I'm talking about the new mapping. Nothing else.
>
> If the mmap'ing thread releases mmap_sem, and then does the revalidate,
> then you can have
>
> thread1 thread2
> ------- -------
>
> mmap
> map it in
> release mmap_sem
> page-fault the mapping before it got validated
> ->post_mmap()
> revalidate outside mmap_sem
>
> See? No "already part of the file mmapped" case at all. The exact mmap
> that you just set up - without the revalidation having happened.
>
> In fact, because of this kind of _fundamental_ race, I don't see why I
> would ever accept any patches that add multiple mmap() down-calls at
> different phases to the filesystem at the VFS layer.
>
> A filesystem that depends on the different phases would be a fundamentally
> buggy filesystem. Right now mmap is "atomic", and you can pre-populate (or
> pre-verify, like NFS does) the mapping in the _knowledge_ that there are
> no page faults that will populate it concurrently. Exactly because we hold
> the mmap_sem for writing.

I don't think anyone has been advocating doing the revalidation _after_
the call to mmap_region(). All I want is to be able to do it as part of
the mmap() syscall. It would be quite OK to add a ->pre_mmap() (which is
what I believe Peter's patches do).

All I want to ensure is that people who use non-posix-lock based
synchronisation can set the 'noac' flag, and be assured that if mmap()
is called _after_ they have grabbed their lock, then the page cache will
be duly revalidated (under the lock), and the fresh data will be made
available.

Trond

2010-01-08 01:22:45

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Linus Torvalds wrote:
>
> A filesystem that depends on the different phases would be a fundamentally
> buggy filesystem. Right now mmap is "atomic", and you can pre-populate (or
> pre-verify, like NFS does) the mapping in the _knowledge_ that there are
> no page faults that will populate it concurrently. Exactly because we hold
> the mmap_sem for writing.

Side note: I realize that a filesystem could add its own lock, but that
doesn't _help_ anything. In that case, it needs to take the lock while
under the mmap_sem, and release it later in the "post-mmap-sem" callback,
but not only is that ugly, it means that there is a mmap_sem -> newlock
dependency between the two, which means that any lock dependency of code
that runs under the new lock will not really have changed. You might as
well have run it under the mmap_lock to begin with.

That's why I claim it's buggy.

I'm sure you can do clever things for special cases, but it all sounds
like you really need to fix the lockdep chain some way.

If you absolutely have to do the revalidation at mmap() time, then the
lock chain needs to be broken somewhere else. Since this is a filesystem,
and mmap_sem is involved, I can only assume that the other part of the
chain is "readdir()" as usual.

See the whole sysfs discussion about that (look for sysfs and filldir on
the kernel list a week or two ago).

Linus

2010-01-08 01:26:52

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 2010-01-07 at 20:22 -0500, Trond Myklebust wrote:
> All I want to ensure is that people who use non-posix-lock based
> synchronisation can set the 'noac' flag, and be assured that if mmap()
> is called _after_ they have grabbed their lock, then the page cache will
> be duly revalidated (under the lock), and the fresh data will be made
> available.

BTW: This is the same guarantee that we offer in the case of standard
read() (and write()) when 'noac' is set.

2010-01-08 01:30:57

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Trond Myklebust wrote:
>
> I don't think anyone has been advocating doing the revalidation _after_
> the call to mmap_region(). All I want is to be able to do it as part of
> the mmap() syscall. It would be quite OK to add a ->pre_mmap() (which is
> what I believe Peter's patches do).

->pre_mmap is better, but not obviously so. We'd have to call ->pre_mmap()
so _long_ before the mmap that it might be that the mmap never happens at
all (due to errors happening later).

Sounds like that would work in your particular case, though.

However, I still suspect that the lock inversion problem can probably be
fixed without any of that at all. Maybe you can just break the chain
somewhere else. I've not actually seen the lockdep chain, so I don't know
the deails. Pointers?

Linus

2010-01-08 01:35:23

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Linus Torvalds wrote:
>
> However, I still suspect that the lock inversion problem can probably be
> fixed without any of that at all. Maybe you can just break the chain
> somewhere else. I've not actually seen the lockdep chain, so I don't know
> the deails. Pointers?

For example, if we're talking about readdir having a lock the other way
(NFS lock taken before the mmap_sem), it's entirely possible that there is
nothing to "fix" but some lockdep annotation.

You cannot mmap a directory (and you can't readdir a non-directory), so if
it's a per-inode NFS lock, then the simplest fix _might_ be to just put
directory locks in a different lockdep class from non-directory locks.
That might fix it.

Of course, if it's not a per-inode lock, that doesn't help.

And maybe I'm missing something entirely, and such games with lockdep
classes are pointless, and I'm a moron. Quite possible.

Linus

2010-01-08 02:01:53

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

On Thu, 7 Jan 2010, Linus Torvalds wrote:
>
> You cannot mmap a directory (and you can't readdir a non-directory), so if
> it's a per-inode NFS lock, then the simplest fix _might_ be to just put
> directory locks in a different lockdep class from non-directory locks.
> That might fix it.
>
> Of course, if it's not a per-inode lock, that doesn't help.

Hmm. I notice that we already do this in unlock_new_inode() for i_mutex.
And I googled for the problem, and it does indeed seem to be about the
normal i_mutex <-> mmap_sem thing with filldir-vs-mmap.

Is there perhaps some path where NFS creates new inodes without going
through the normal path? Like the root inode or something? So then you'd
have a root inode with i_mutex annotated as being in the same class as a
regular file, which completes the lockdep chain.

Or maybe there is something else I'm missing.

I'm adding Peter and Ingo to the Cc as the "lockdep guys". Maybe they see
what I am missing.

Linus

2010-01-08 05:19:23

[permalink] [raw]

Subject: Re: [GIT PULL] Please pull NFS client bugfixes....

Linus Torvalds <[email protected]> writes:

> I've not actually seen the lockdep chain, so I don't know
> the deails. Pointers?

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.33-rc1-git3-MCE6 #12
-------------------------------------------------------
udevinfo/2562 is trying to acquire lock:
(&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<ffffffff81185188>] nfs_revalidate_mapping+0x7c/0xc5

but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff810ae75b>] sys_mmap_pgoff+0xd3/0x1a0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){++++++}:
[<ffffffff81068c48>] __lock_acquire+0x1432/0x1771
[<ffffffff81069043>] lock_acquire+0xbc/0xd9
[<ffffffff810b1a90>] might_fault+0x84/0xa4
[<ffffffff810e3c78>] filldir+0x6a/0xcb
[<ffffffff81180d58>] nfs_do_filldir+0x39d/0x4ac
[<ffffffff8118167c>] nfs_readdir+0x815/0x8b6
[<ffffffff810e3df7>] vfs_readdir+0x6c/0xa1
[<ffffffff810e3f6a>] sys_getdents+0x7d/0xc9
[<ffffffff810029eb>] system_call_fastpath+0x16/0x1b

-> #0 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[<ffffffff8106896d>] __lock_acquire+0x1157/0x1771
[<ffffffff81069043>] lock_acquire+0xbc/0xd9
[<ffffffff8149472c>] mutex_lock_nested+0x68/0x2d2
[<ffffffff81185188>] nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff81182c0a>] nfs_file_mmap+0x68/0x71
[<ffffffff810b99bf>] mmap_region+0x311/0x530
[<ffffffff810b9e6e>] do_mmap_pgoff+0x290/0x2f3
[<ffffffff810ae77c>] sys_mmap_pgoff+0xf4/0x1a0
[<ffffffff81006ba2>] sys_mmap+0x1d/0x1f
[<ffffffff810029eb>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

1 lock held by udevinfo/2562:
#0: (&mm->mmap_sem){++++++}, at: [<ffffffff810ae75b>] sys_mmap_pgoff+0xd3/0x1a0

stack backtrace:
Pid: 2562, comm: udevinfo Not tainted 2.6.33-rc1-git3-MCE6 #12
Call Trace:
[<ffffffff810672c8>] print_circular_bug+0xb3/0xc2
[<ffffffff8106896d>] __lock_acquire+0x1157/0x1771
[<ffffffff8105b031>] ? sched_clock_local+0x1c/0x80
[<ffffffff81069043>] lock_acquire+0xbc/0xd9
[<ffffffff81185188>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8149472c>] mutex_lock_nested+0x68/0x2d2
[<ffffffff81185188>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff8105b15e>] ? sched_clock_cpu+0xc9/0xce
[<ffffffff81185188>] ? nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff810b991c>] ? mmap_region+0x26e/0x530
[<ffffffff81185188>] nfs_revalidate_mapping+0x7c/0xc5
[<ffffffff81182c0a>] nfs_file_mmap+0x68/0x71
[<ffffffff810b99bf>] mmap_region+0x311/0x530
[<ffffffff810b9e6e>] do_mmap_pgoff+0x290/0x2f3
[<ffffffff810ae77c>] sys_mmap_pgoff+0xf4/0x1a0
[<ffffffff81495c86>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81006ba2>] sys_mmap+0x1d/0x1f
[<ffffffff810029eb>] system_call_fastpath+0x16/0x1b

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-09 01:06:49

[permalink] [raw]

Subject: [RFC PATCH 0/2] Fix up the NFS mmap code

How about something like the following. I chose to wrap the call to
do_mmap_pgoff() instead of making a special ->pre_mmap(), since that
seems more consistent with the way we handle ->read() and ->write().

I also left sys_uselib() and do_execve() to rely on revalidate at
open(), since executables and libraries really are not ever expected to
change while open.

Cheers
Trond

---

Trond Myklebust (2):
NFS: Fix a potential deadlock in nfs_file_mmap()
VFS: Add a mmap_file() callback to struct file_operations

fs/nfs/file.c | 28 ++++++++++++++++++++++------
fs/nfs/inode.c | 4 ++++
include/linux/fs.h | 5 +++++
mm/filemap.c | 23 +++++++++++++++++++++++
mm/mmap.c | 11 ++++++++---
mm/nommu.c | 11 ++++++++---
6 files changed, 70 insertions(+), 12 deletions(-)

--
Signature

2010-01-09 01:07:33

[permalink] [raw]

Subject: [RFC PATCH 1/2] VFS: Add a mmap_file() callback to struct file_operations

Add a helper function to allow the NFS filesystem to hook mmap() system
calls and do the necessary page cache revalidation before passing control
to the VM layer.

Signed-off-by: Trond Myklebust <[email protected]>
---

include/linux/fs.h | 5 +++++
mm/filemap.c | 23 +++++++++++++++++++++++
mm/mmap.c | 11 ++++++++---
mm/nommu.c | 11 ++++++++---
4 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9147ca8..5d66b16 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1504,6 +1504,9 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ unsigned long (*mmap_pgoff)(struct file *, unsigned long,
+ unsigned long, unsigned long,
+ unsigned long, unsigned long);
};

struct inode_operations {
@@ -2191,6 +2194,8 @@ extern int set_blocksize(struct block_device *, int);
extern int sb_set_blocksize(struct super_block *, int);
extern int sb_min_blocksize(struct super_block *, int);

+extern unsigned long generic_file_mmap_pgoff(struct file *, unsigned long,
+ unsigned long, unsigned long, unsigned long, unsigned long);
extern int generic_file_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
diff --git a/mm/filemap.c b/mm/filemap.c
index 96ac6b0..f7717b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1594,6 +1594,29 @@ const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
};

+/**
+ * generic_file_mmap_pgoff - generic filesystem mmap_pgoff routine
+ * @file: file to mmap
+ * @addr: memory address to map to
+ * @len: length of the mapping
+ * @prot: memory protection flags
+ * @flags: mapping type
+ * @pgoff: starting page offset
+ */
+unsigned long generic_file_mmap_pgoff(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long prot,
+ unsigned long flags, unsigned long pgoff)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned long retval;
+
+ down_write(&mm->mmap_sem);
+ retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+ up_write(&mm->mmap_sem);
+ return retval;
+}
+EXPORT_SYMBOL(generic_file_mmap_pgoff);
+
/* This is used for a general mmap of a disk file */

int generic_file_mmap(struct file * file, struct vm_area_struct * vma)
diff --git a/mm/mmap.c b/mm/mmap.c
index ee22989..3931811 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1047,6 +1047,9 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
{
+ unsigned long (*func)(struct file *, unsigned long,
+ unsigned long, unsigned long,
+ unsigned long, unsigned long);
struct file *file = NULL;
unsigned long retval = -EBADF;

@@ -1073,9 +1076,11 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,

flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

- down_write(&current->mm->mmap_sem);
- retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
- up_write(&current->mm->mmap_sem);
+ if (file && file->f_op && file->f_op->mmap_pgoff)
+ func = file->f_op->mmap_pgoff;
+ else
+ func = generic_file_mmap_pgoff;
+ retval = func(file, addr, len, prot, flags, pgoff);

if (file)
fput(file);
diff --git a/mm/nommu.c b/mm/nommu.c
index 1777386..4e31cb2 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1407,6 +1407,9 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
{
+ unsigned long (*func)(struct file *, unsigned long,
+ unsigned long, unsigned long,
+ unsigned long, unsigned long);
struct file *file = NULL;
unsigned long retval = -EBADF;

@@ -1418,9 +1421,11 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,

flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

- down_write(&current->mm->mmap_sem);
- retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
- up_write(&current->mm->mmap_sem);
+ if (file && file->f_op && file->f_op->mmap_pgoff)
+ func = file->f_op->mmap_pgoff;
+ else
+ func = generic_file_mmap_pgoff;
+ retval = func(file, addr, len, prot, flags, pgoff);

if (file)
fput(file);

2010-01-09 01:07:42

[permalink] [raw]

Subject: [RFC PATCH 2/2] NFS: Fix a potential deadlock in nfs_file_mmap()

We cannot call nfs_invalidate_mapping() inside file->f_ops->mmap(), since
this would cause us to grab the inode->i_mutex while already holding the
current->mm->mmap_sem (thus causing a potential ABBA deadlock with the file
write code, which can grab those locks in the opposite order).

We can fix this situation for the mmap() system call by using the new
mmap_pgoff() callback, which is called prior to taking the
current->mm->mmap_sem mutex.

We also add ensure that open() invalidates the mapping if the inode data is
stale so that other users of mmap() (mainly the exec and uselib system
calls) get up to date data too.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/file.c | 28 ++++++++++++++++++++++------
fs/nfs/inode.c | 4 ++++
2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 6b89132..723e254 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -42,6 +42,9 @@ static int nfs_file_open(struct inode *, struct file *);
static int nfs_file_release(struct inode *, struct file *);
static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin);
static int nfs_file_mmap(struct file *, struct vm_area_struct *);
+static unsigned long nfs_file_mmap_pgoff(struct file * file,
+ unsigned long addr, unsigned long len, unsigned long prot,
+ unsigned long flags, unsigned long pgoff);
static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos,
struct pipe_inode_info *pipe,
size_t count, unsigned int flags);
@@ -78,6 +81,7 @@ const struct file_operations nfs_file_operations = {
.splice_write = nfs_file_splice_write,
.check_flags = nfs_check_flags,
.setlease = nfs_setlease,
+ .mmap_pgoff = nfs_file_mmap_pgoff,
};

const struct inode_operations nfs_file_inode_operations = {
@@ -290,24 +294,36 @@ nfs_file_splice_read(struct file *filp, loff_t *ppos,
static int
nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
{
- struct dentry *dentry = file->f_path.dentry;
- struct inode *inode = dentry->d_inode;
int status;

dprintk("NFS: mmap(%s/%s)\n",
- dentry->d_parent->d_name.name, dentry->d_name.name);
+ file->f_path.dentry->d_parent->d_name.name,
+ file->f_path.dentry->d_name.name);

/* Note: generic_file_mmap() returns ENOSYS on nommu systems
* so we call that before revalidating the mapping
*/
status = generic_file_mmap(file, vma);
- if (!status) {
+ if (!status)
vma->vm_ops = &nfs_file_vm_ops;
- status = nfs_revalidate_mapping(inode, file->f_mapping);
- }
return status;
}

+static unsigned long
+nfs_file_mmap_pgoff(struct file * file, unsigned long addr,
+ unsigned long len, unsigned long prot,
+ unsigned long flags, unsigned long pgoff)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ int status;
+
+ status = nfs_revalidate_mapping(inode, file->f_mapping);
+ if (status < 0)
+ return status;
+
+ return generic_file_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+}
+
/*
* Flush any dirty pages for this process, and check for write errors.
* The return status from this call provides a reliable indication of
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index faa0918..90e1d67 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -56,6 +56,8 @@
static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;

static void nfs_invalidate_inode(struct inode *);
+static int nfs_invalidate_mapping(struct inode *inode,
+ struct address_space *mapping);
static int nfs_update_inode(struct inode *, struct nfs_fattr *);

static struct kmem_cache * nfs_inode_cachep;
@@ -694,6 +696,8 @@ int nfs_open(struct inode *inode, struct file *filp)
nfs_file_set_open_context(filp, ctx);
put_nfs_open_context(ctx);
nfs_fscache_set_inode_cookie(inode, filp);
+ if (NFS_I(inode)->cache_validity & NFS_INO_INVALID_DATA)
+ nfs_invalidate_mapping(inode, filp->f_mapping);
return 0;
}

2010-01-09 01:17:51

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Fri, 8 Jan 2010, Trond Myklebust wrote:
>
> How about something like the following. I chose to wrap the call to
> do_mmap_pgoff() instead of making a special ->pre_mmap(), since that
> seems more consistent with the way we handle ->read() and ->write().

I still don't think that you can ever do mmap _and_ readdir on the same
inode, so there's something wrong with the lockdep annotations.

See my earlier mail about putting directory inodes in a different class
from non-directory inodes, and the email after that that points out that
we already do:

- inode_init_always():
lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);

- unlock_new_inode() (for directories):
lockdep_set_class(&inode->i_mutex, &type->i_mutex_dir_key);

and I suspect that the reason why NFS triggers lockdep problems is that
NFS probably plays some game with inodes (presumably the root inode) that
ends up bypassing the normal new-inode handling.

In short - I really don't think this issue merits VFS-level (or VM,
whatever you want to call it) hooks. There's a bug there, but I don't
think you're actually fixing the right thing.

And _especially_ not the way you did it, where the filesystem can wrap the
whole complex do_mmap_pgoff. The NFS use you have doesn't seem too bad,
but anybody who tries to be clever and avoid "generic_mmap_do_pgoff()"
would immediately be a major disaster.

Linus

2010-01-09 01:38:35

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Fri, Jan 08, 2010 at 05:17:14PM -0800, Linus Torvalds wrote:
>
>
> On Fri, 8 Jan 2010, Trond Myklebust wrote:
> >
> > How about something like the following. I chose to wrap the call to
> > do_mmap_pgoff() instead of making a special ->pre_mmap(), since that
> > seems more consistent with the way we handle ->read() and ->write().
>
> I still don't think that you can ever do mmap _and_ readdir on the same
> inode, so there's something wrong with the lockdep annotations.

readdir() is certainly a red herring. write(), OTOH, is quite real.
And there we do i_mutex followed by pagefaults.

I *REALLY* dislike Trond's solution, though.

Could we please get a sane expalanation of the reasons why nfs mmap
wants i_mutex in the first place? Before we add yet another hook
from hell and complicate already overcomplicated area...

2010-01-09 01:46:40

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Sat, Jan 09, 2010 at 01:38:25AM +0000, Al Viro wrote:
> On Fri, Jan 08, 2010 at 05:17:14PM -0800, Linus Torvalds wrote:
> >
> >
> > On Fri, 8 Jan 2010, Trond Myklebust wrote:
> > >
> > > How about something like the following. I chose to wrap the call to
> > > do_mmap_pgoff() instead of making a special ->pre_mmap(), since that
> > > seems more consistent with the way we handle ->read() and ->write().
> >
> > I still don't think that you can ever do mmap _and_ readdir on the same
> > inode, so there's something wrong with the lockdep annotations.
>
> readdir() is certainly a red herring. write(), OTOH, is quite real.
> And there we do i_mutex followed by pagefaults.
>
> I *REALLY* dislike Trond's solution, though.
>
> Could we please get a sane expalanation of the reasons why nfs mmap
> wants i_mutex in the first place? Before we add yet another hook
> from hell and complicate already overcomplicated area...

PS: mmap/write deadlock is real, AFAICT - unless something very subtle
prevents it, we can get buggered if we have two threads with the same
VM, mmap 1.4Mb from floppy and do
thread A:
write(fd_on_nfs, buffer_mmaped_from_floppy, 1440 * 1024);
thread B:
mmap(..., fd_on_nfs, ...)

It's not even particulary narrow.

2010-01-09 01:54:23

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] NFS: Fix a potential deadlock in nfs_file_mmap()

On Fri, Jan 08, 2010 at 07:56:24PM -0500, Trond Myklebust wrote:
> We cannot call nfs_invalidate_mapping() inside file->f_ops->mmap(), since
> this would cause us to grab the inode->i_mutex while already holding the
> current->mm->mmap_sem (thus causing a potential ABBA deadlock with the file
> write code, which can grab those locks in the opposite order).
>
> We can fix this situation for the mmap() system call by using the new
> mmap_pgoff() callback, which is called prior to taking the
> current->mm->mmap_sem mutex.
>
> We also add ensure that open() invalidates the mapping if the inode data is
> stale so that other users of mmap() (mainly the exec and uselib system
> calls) get up to date data too.

> + status = nfs_revalidate_mapping(inode, file->f_mapping);
> + if (status < 0)
> + return status;
> +
> + return generic_file_mmap_pgoff(file, addr, len, prot, flags, pgoff);

This is completely bogus. Why do you need i_mutex for that and what
the <expletives> does that really prevent? You might wait for a _loong_
time waiting for that mmap_sem, so what is really going on there?

2010-01-09 01:58:15

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Sat, 9 Jan 2010, Al Viro wrote:
>
> readdir() is certainly a red herring.

That's the one that lockdep reports, though. I still don't see why. Afaik,
the only place where NFS gets an inode is nfs_fhget(), and that seems to
do things correctly.

> write(), OTOH, is quite real. And there we do i_mutex followed by
> pagefaults.

Ho humm..

> Could we please get a sane expalanation of the reasons why nfs mmap
> wants i_mutex in the first place? Before we add yet another hook
> from hell and complicate already overcomplicated area...

Yeah, agreed.

Linus

2010-01-09 02:11:32

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Fri, Jan 08, 2010 at 05:57:27PM -0800, Linus Torvalds wrote:
>
>
> On Sat, 9 Jan 2010, Al Viro wrote:
> >
> > readdir() is certainly a red herring.
>
> That's the one that lockdep reports, though. I still don't see why. Afaik,
> the only place where NFS gets an inode is nfs_fhget(), and that seems to
> do things correctly.

Well, sure - it steps on i_mutex-before-mmmap_sem first from ls somewhere and
records the ordering for posterity. Then NFS steps into mmap() (on a
different inode) and gets conflicting ordering.

It would be a false positive if rules for NFS *really* had been different
and it could safely grab i_mutex on NFS inodes inside mmap_sem. It can't.
The rules really are the same. And readdir is just the earliest case of
kernel stepping on mmap_sem while holding *some* i_mutex. write() is
another and there i_mutex can very well be the same as in case of mmap().

lockdep doesn't make a distinction (and really, how many paths reinforcing
the normal lock ordering would you record?), but if we'd given i_mutex of
NFS regular files a class of its own, we'd see a warning with nfs write
instead of readdir...

2010-01-09 02:22:38

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Sat, 9 Jan 2010, Al Viro wrote:
>
> Well, sure - it steps on i_mutex-before-mmmap_sem first from ls somewhere and
> records the ordering for posterity. Then NFS steps into mmap() (on a
> different inode) and gets conflicting ordering.

Look closer: the inodes for directories and for non-directories have
i_mutex in different lockdep classes.

So that "on a different inode" thing should have made it a non-issue,
since there is no actual chain back. There is "mmap_sem ->
i_mutex_regular_file" (for mmap) and there is "i_mutex_directory ->
mmap_sem" (for filldir), but that isn't an ABBA.

The problem _seems_ to be (if I read Andi's chain correctly) that a
directory hasn't gone through the i_mutex_dir_key change, so filldir ends
up being counted against the default i_mutex_key.

Linus

2010-01-09 02:30:42

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Fri, Jan 08, 2010 at 06:22:01PM -0800, Linus Torvalds wrote:
>
>
> On Sat, 9 Jan 2010, Al Viro wrote:
> >
> > Well, sure - it steps on i_mutex-before-mmmap_sem first from ls somewhere and
> > records the ordering for posterity. Then NFS steps into mmap() (on a
> > different inode) and gets conflicting ordering.
>
> Look closer: the inodes for directories and for non-directories have
> i_mutex in different lockdep classes.
>
> So that "on a different inode" thing should have made it a non-issue,
> since there is no actual chain back. There is "mmap_sem ->
> i_mutex_regular_file" (for mmap) and there is "i_mutex_directory ->
> mmap_sem" (for filldir), but that isn't an ABBA.
>
> The problem _seems_ to be (if I read Andi's chain correctly) that a
> directory hasn't gone through the i_mutex_dir_key change, so filldir ends
> up being counted against the default i_mutex_key.

Interesting... There's nfs_update_inode(), but it ought to scream bloody
murder on the type change (and its return value is ignore by nfs_fhget(),
BTW).

2010-01-09 02:40:14

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Sat, Jan 09, 2010 at 02:30:35AM +0000, Al Viro wrote:

> Interesting... There's nfs_update_inode(), but it ought to scream bloody
> murder on the type change (and its return value is ignore by nfs_fhget(),
> BTW).

Uh-oh... Just what will happen if we get FATTR_MODE without FATTR_TYPE,
BTW? AFAICS, the check for changed type won't trigger, but we will
get type bits in i_mode set to 0. Probably unrelated, but...

2010-01-09 02:43:59

[permalink] [raw]

Subject: Re: [RFC PATCH 0/2] Fix up the NFS mmap code

On Sat, Jan 09, 2010 at 02:40:05AM +0000, Al Viro wrote:
> On Sat, Jan 09, 2010 at 02:30:35AM +0000, Al Viro wrote:
>
> > Interesting... There's nfs_update_inode(), but it ought to scream bloody
> > murder on the type change (and its return value is ignore by nfs_fhget(),
> > BTW).
>
> Uh-oh... Just what will happen if we get FATTR_MODE without FATTR_TYPE,
> BTW? AFAICS, the check for changed type won't trigger, but we will
> get type bits in i_mode set to 0. Probably unrelated, but...

Definitely unrelated - we'll get i_mode buggered, but i_op remains set.

2010-01-10 02:00:35