LinuxLists.cc - [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

2009-04-16 14:28:20

Subject: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.

Signed-off-by: Alessio Igor Bogani <[email protected]>
---
fs/namespace.c | 16 +++++++++-------
fs/super.c | 9 ++++-----
2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6f54e4..fcebca1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -11,7 +11,7 @@
#include <linux/syscalls.h>
#include <linux/slab.h>
#include <linux/sched.h>
-#include <linux/smp_lock.h>
+#include <linux/mutex.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/acct.h>
@@ -51,6 +51,8 @@ static struct rw_semaphore namespace_sem;
struct kobject *fs_kobj;
EXPORT_SYMBOL_GPL(fs_kobj);

+DEFINE_MUTEX(mount_lock);
+
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
@@ -1073,9 +1075,9 @@ static int do_umount(struct vfsmount *mnt, int flags)
*/

if (flags & MNT_FORCE && sb->s_op->umount_begin) {
- lock_kernel();
+ mutex_lock(&mount_lock);
sb->s_op->umount_begin(sb);
- unlock_kernel();
+ mutex_unlock(&mount_lock);
}

/*
@@ -1094,9 +1096,9 @@ static int do_umount(struct vfsmount *mnt, int flags)
*/
down_write(&sb->s_umount);
if (!(sb->s_flags & MS_RDONLY)) {
- lock_kernel();
+ mutex_lock(&mount_lock);
retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
- unlock_kernel();
+ mutex_unlock(&mount_lock);
}
up_write(&sb->s_umount);
return retval;
@@ -2078,10 +2080,10 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
if (retval < 0)
goto out3;

- lock_kernel();
+ mutex_lock(&mount_lock);
retval = do_mount((char *)dev_page, dir_page, (char *)type_page,
flags, (void *)data_page);
- unlock_kernel();
+ mutex_unlock(&mount_lock);
free_page(data_page);

out3:
diff --git a/fs/super.c b/fs/super.c
index d9903a3..c503758 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -23,7 +23,6 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/init.h>
-#include <linux/smp_lock.h>
#include <linux/acct.h>
#include <linux/blkdev.h>
#include <linux/quotaops.h>
@@ -46,6 +45,8 @@
LIST_HEAD(super_blocks);
DEFINE_SPINLOCK(sb_lock);

+extern struct mutex mount_lock;
+
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
@@ -684,13 +685,11 @@ static void do_emergency_remount(struct work_struct *work)
down_read(&sb->s_umount);
if (sb->s_root && sb->s_bdev && !(sb->s_flags & MS_RDONLY)) {
/*
- * ->remount_fs needs lock_kernel().
- *
* What lock protects sb->s_flags??
*/
- lock_kernel();
+ mutex_lock(&mount_lock);
do_remount_sb(sb, MS_RDONLY, NULL, 1);
- unlock_kernel();
+ mutex_unlock(&mount_lock);
}
drop_super(sb);
spin_lock(&sb_lock);
--
1.6.0.4

2009-04-16 14:36:44

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Thu, Apr 16, 2009 at 04:27:58PM +0200, Alessio Igor Bogani wrote:
> Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.

Could you try to explain what these actuall try to protect?

> unsigned long tmp = ((unsigned long)mnt / L1_CACHE_BYTES);
> @@ -1073,9 +1075,9 @@ static int do_umount(struct vfsmount *mnt, int flags)
> */
>
> if (flags & MNT_FORCE && sb->s_op->umount_begin) {
> - lock_kernel();
> + mutex_lock(&mount_lock);
> sb->s_op->umount_begin(sb);
> - unlock_kernel();
> + mutex_unlock(&mount_lock);

This is a very easy case, just move the lock into ->umount_begin. And
then ping the maintainers of the 5 instances actually making use of it -
I suspect none of them actually require it.

> @@ -1094,9 +1096,9 @@ static int do_umount(struct vfsmount *mnt, int flags)
> */
> down_write(&sb->s_umount);
> if (!(sb->s_flags & MS_RDONLY)) {
> - lock_kernel();
> + mutex_lock(&mount_lock);
> retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
> - unlock_kernel();
> + mutex_unlock(&mount_lock);

I suspect moving lock_kernel down into ->remount_fs is the much better
option. Will require some audit of do_remount_sb, though.

> - lock_kernel();
> + mutex_lock(&mount_lock);
> retval = do_mount((char *)dev_page, dir_page, (char *)type_page,
> flags, (void *)data_page);
> - unlock_kernel();
> + mutex_unlock(&mount_lock);

Again, much better to push it down and probably eliminate it completely
for all sane filesystems.

2009-04-16 16:07:28

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Alessio Igor Bogani <[email protected]> wrote:

> Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.
>
> Signed-off-by: Alessio Igor Bogani <[email protected]>
> ---
> fs/namespace.c | 16 +++++++++-------
> fs/super.c | 9 ++++-----
> 2 files changed, 13 insertions(+), 12 deletions(-)

Ok, this patch needs to be flamed^W commented on by Al.

Al: this patch is very likely broken as i cannot imagine you leaving
the BKL there just so. So lets accept that (and your NAK) as a given
and not get upset about it too much.

We are willing to fix any side effects and preconditions before this
can be done - and it would be nice if you could donate a few minutes
to this effort by enumerating those preconditions for us, so that we
can provide the real fixes. No-one knows this code better than you
so even if we could guess our way around to a certain degree, some
maintainer guidance and insight would be deeply appreciated.

I'm wondering how much the BKL use here is made necessary by the
sys_open() BKL use in device drivers. Jonathan has done extensive
work on the sys_open front (and there's more such work in
tip:core/kill-the-BKL) - perhaps that has largely paved the way for
this change?

There's also ioctl BKL use - is the BKL use here in sys_mount
necessiated by the (naked) BKL use in those handlers?

Thanks,

Ingo

2009-04-16 16:50:19

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Christoph Hellwig <[email protected]> wrote:

> On Thu, Apr 16, 2009 at 04:27:58PM +0200, Alessio Igor Bogani wrote:
> > Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.
>
> Could you try to explain what these actuall try to protect?

They dont really protect anything - the patch is wrong and
equivalent to a plain removal of the BKL.

The only case we found to ever matter in practice is NFS: it really
wants to get rid of the BKL in nfsd_get_sb(). So pushing down the
BKL lock into per filesystems and then removing it from NFS should
do the trick.

Would be nice to have some tentative Ack (or, a tentative
non-immediate-NAK) from Al before we go touch a lot of filesystems
though. Stupid dont-waste-human-effort considerations and stuff.

For us, the much simpler solution would be to drop the BKL in
nfsd_get_sb() and go on with life without to touch a dozen or so
filesystems. Alessio, mind trying that too, is it a solution for
your testcase?

Ingo

2009-04-16 16:58:59

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Ingo Molnar <[email protected]> wrote:

> * Alessio Igor Bogani <[email protected]> wrote:
>
> > Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.
> >
> > Signed-off-by: Alessio Igor Bogani <[email protected]>
> > ---
> > fs/namespace.c | 16 +++++++++-------
> > fs/super.c | 9 ++++-----
> > 2 files changed, 13 insertions(+), 12 deletions(-)
>
> Ok, this patch needs to be flamed^W commented on by Al.
>
> Al: this patch is very likely broken as i cannot imagine you
> leaving the BKL there just so. So lets accept that (and your NAK)
> as a given and not get upset about it too much.
>
> We are willing to fix any side effects and preconditions before
> this can be done [...]

this == remove the BKL. Not introduce a needless mutex like this
patch does.

Ingo

2009-04-16 17:02:16

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Thu, Apr 16, 2009 at 06:49:27PM +0200, Ingo Molnar wrote:
> They dont really protect anything - the patch is wrong and
> equivalent to a plain removal of the BKL.
>
> The only case we found to ever matter in practice is NFS: it really
> wants to get rid of the BKL in nfsd_get_sb(). So pushing down the
> BKL lock into per filesystems and then removing it from NFS should
> do the trick.
>
> Would be nice to have some tentative Ack (or, a tentative
> non-immediate-NAK) from Al before we go touch a lot of filesystems
> though. Stupid dont-waste-human-effort considerations and stuff.
>
> For us, the much simpler solution would be to drop the BKL in
> nfsd_get_sb() and go on with life without to touch a dozen or so
> filesystems. Alessio, mind trying that too, is it a solution for
> your testcase?

What about trying to attack it piece-mail? ->unmount_begin is really
easy. The only one that doesn't protect everything properly is
9p, but it doesn't protect the state variable deep down a few levels
of function calls at all.

->remount_fs should be easy enough to, we do have proper per-sb
protection here, but do_remount_sb will need a bit of an audit.
(and of course pushing lock_kernel down into the many instances and
leave the cleanup-work to the fs maintainers).

The actual mount path is more interesting as there are quite a few cases
there. As a first step you can take lock_kernel from outside do_mount
into the various do_foo calls inside it, and then work on those piece
by piece.

2009-04-16 17:14:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Christoph Hellwig <[email protected]> wrote:

> On Thu, Apr 16, 2009 at 06:49:27PM +0200, Ingo Molnar wrote:
>
> > They dont really protect anything - the patch is wrong and
> > equivalent to a plain removal of the BKL.
> >
> > The only case we found to ever matter in practice is NFS: it
> > really wants to get rid of the BKL in nfsd_get_sb(). So pushing
> > down the BKL lock into per filesystems and then removing it from
> > NFS should do the trick.
> >
> > Would be nice to have some tentative Ack (or, a tentative
> > non-immediate-NAK) from Al before we go touch a lot of
> > filesystems though. Stupid dont-waste-human-effort
> > considerations and stuff.
> >
> > For us, the much simpler solution would be to drop the BKL in
> > nfsd_get_sb() and go on with life without to touch a dozen or so
> > filesystems. Alessio, mind trying that too, is it a solution for
> > your testcase?
>
> What about trying to attack it piece-mail? ->unmount_begin is
> really easy. The only one that doesn't protect everything
> properly is 9p, but it doesn't protect the state variable deep
> down a few levels of function calls at all.
>
> ->remount_fs should be easy enough to, we do have proper per-sb
> protection here, but do_remount_sb will need a bit of an audit.
> (and of course pushing lock_kernel down into the many instances
> and leave the cleanup-work to the fs maintainers).
>
> The actual mount path is more interesting as there are quite a few
> cases there. As a first step you can take lock_kernel from
> outside do_mount into the various do_foo calls inside it, and then
> work on those piece by piece.

We'd be glad to - but only with full principle and workflow backing
of VFS-folks. This has been going on for more than a year - with
ancient commits in tip:core/kill-the-BKL. I cannot mix stuff into it
that gets eventual hostile treatment and NAKs a few months down the
line, should this be submitted upstream.

Ingo

2009-04-16 23:50:44

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Thu, Apr 16, 2009 at 04:27:58PM +0200, Alessio Igor Bogani wrote:
> Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.

NAK. Push BKL into instances if you care and deal with those one by one.
New and unwarranted system-wide exclusion around hell knows what code
(we *are* talking about the stuff that is very likely to do at least some
IO, dependent on fs internals, phase of moon and amount of farting gnats
in the vicinity) is right out.

2009-04-16 23:57:09

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Thu, Apr 16, 2009 at 06:06:45PM +0200, Ingo Molnar wrote:
>
> * Alessio Igor Bogani <[email protected]> wrote:
>
> > Replace ths BKL in sys_mount()/sys_umount() syscalls with a regular mutex.
> >
> > Signed-off-by: Alessio Igor Bogani <[email protected]>
> > ---
> > fs/namespace.c | 16 +++++++++-------
> > fs/super.c | 9 ++++-----
> > 2 files changed, 13 insertions(+), 12 deletions(-)
>
> Ok, this patch needs to be flamed^W commented on by Al.
>
> Al: this patch is very likely broken as i cannot imagine you leaving
> the BKL there just so. So lets accept that (and your NAK) as a given
> and not get upset about it too much.

NAK is certainly given. Dealing with force-umount is laughable - there
are so few instances, that we simply ought to take (all two? of) them
individually.

remount is potentially nastier, but then it *is* nasty. Again, it's only
per-fs stuff, so the obvious first step is taking BKL down into the instances.
It doesn't protect anything in VFS; all uses are fs internal, so that'll
take review of individual filesystems.

NOTE: do not assume that code in fs/foo/* is correct; "it doesn't take BKL
elsewhere" does _not_ mean that we don't have races. IOW, the same review
ought to look for such beasts and deal with them. Mere "oh, no BKL anywhere
in that fs" is not enough to discard the ->remount_fs() instance.

> I'm wondering how much the BKL use here is made necessary by the
> sys_open() BKL use in device drivers. Jonathan has done extensive
> work on the sys_open front (and there's more such work in
> tip:core/kill-the-BKL) - perhaps that has largely paved the way for
> this change?
>
> There's also ioctl BKL use - is the BKL use here in sys_mount
> necessiated by the (naked) BKL use in those handlers?

No. It's all about fs internals protected from themselves. ->remount_fs()
tends to be grotty shite, so nobody had enough courage to do proper review
and fixes. And it's not a place where we would really suffer from contention,
so it didn't get high priority that way either.

2009-04-17 00:02:23

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Al Viro <[email protected]> wrote:

> remount is potentially nastier, but then it *is* nasty. Again,
> it's only per-fs stuff, so the obvious first step is taking BKL
> down into the instances. It doesn't protect anything in VFS; all
> uses are fs internal, so that'll take review of individual
> filesystems.
>
> NOTE: do not assume that code in fs/foo/* is correct; "it doesn't
> take BKL elsewhere" does _not_ mean that we don't have races.
> IOW, the same review ought to look for such beasts and deal with
> them. Mere "oh, no BKL anywhere in that fs" is not enough to
> discard the ->remount_fs() instance.

what kind of races do you mean? Timing sensitive ones that are there
just are not easy to trigger with the BKL held?

Or actual locking interaction between that body of BKL code and all
other BKL using code?

Ingo

2009-04-17 00:05:39

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Thu, Apr 16, 2009 at 01:01:50PM -0400, Christoph Hellwig wrote:
> ->remount_fs should be easy enough to, we do have proper per-sb
> protection here, but do_remount_sb will need a bit of an audit.
> (and of course pushing lock_kernel down into the many instances and
> leave the cleanup-work to the fs maintainers).
>
> The actual mount path is more interesting as there are quite a few cases
> there. As a first step you can take lock_kernel from outside do_mount
> into the various do_foo calls inside it, and then work on those piece
> by piece.

The only place that might care is ->get_sb() (i.e. old ->read_super()).
And only for protection of fs-type-wide data structures inside the
fs/foo/* - anything in VFS doesn't give a damn (e.g. a realistic candidate
might be something that maintains a private list of all sb->s_fs_info for
this type and doesn't bother to do any locking, relying on BKL for all
manipulations).

->write_super() and ->put_super() are other candidates, for the same
reason. That's where BKL is generic_shutdown is coming from.

Note that while we do have other users of do_kern_mount(), they tend to
be limited to subset of fs types, so again, do not assume that "we use
do_kern_mount() without BKL anyway" means that we are safe on that path.

I'd suggest pushing that crap down into individual filesystems again.
They *ARE* serialized for given superblock, so we really are looking for
cross-fs-instance data structures.

2009-04-17 00:14:00

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 02:01:42AM +0200, Ingo Molnar wrote:
>
> * Al Viro <[email protected]> wrote:
>
> > remount is potentially nastier, but then it *is* nasty. Again,
> > it's only per-fs stuff, so the obvious first step is taking BKL
> > down into the instances. It doesn't protect anything in VFS; all
> > uses are fs internal, so that'll take review of individual
> > filesystems.
> >
> > NOTE: do not assume that code in fs/foo/* is correct; "it doesn't
> > take BKL elsewhere" does _not_ mean that we don't have races.
> > IOW, the same review ought to look for such beasts and deal with
> > them. Mere "oh, no BKL anywhere in that fs" is not enough to
> > discard the ->remount_fs() instance.
>
> what kind of races do you mean? Timing sensitive ones that are there
> just are not easy to trigger with the BKL held?
>
> Or actual locking interaction between that body of BKL code and all
> other BKL using code?

Old foo_read_super/foo_write_super/foo_put_super/foo_remount_fs for the
same foo. IOW, per-driver (and not per-fs - that's taken care of) data
structures. Arbitrary weird ones.

2009-04-17 00:28:19

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Al Viro <[email protected]> wrote:

> On Fri, Apr 17, 2009 at 02:01:42AM +0200, Ingo Molnar wrote:
> >
> > * Al Viro <[email protected]> wrote:
> >
> > > remount is potentially nastier, but then it *is* nasty. Again,
> > > it's only per-fs stuff, so the obvious first step is taking BKL
> > > down into the instances. It doesn't protect anything in VFS; all
> > > uses are fs internal, so that'll take review of individual
> > > filesystems.
> > >
> > > NOTE: do not assume that code in fs/foo/* is correct; "it doesn't
> > > take BKL elsewhere" does _not_ mean that we don't have races.
> > > IOW, the same review ought to look for such beasts and deal with
> > > them. Mere "oh, no BKL anywhere in that fs" is not enough to
> > > discard the ->remount_fs() instance.
> >
> > what kind of races do you mean? Timing sensitive ones that are there
> > just are not easy to trigger with the BKL held?
> >
> > Or actual locking interaction between that body of BKL code and
> > all other BKL using code?
>
> Old foo_read_super/foo_write_super/foo_put_super/foo_remount_fs
> for the same foo. IOW, per-driver (and not per-fs - that's taken
> care of) data structures. Arbitrary weird ones.

Could you give an example of such a weird interaction?

I fear, unless i'm misunderstanding your feedback, that you are
setting the purist's irrealistically high burden to get rid of the
BKL from the VFS here.

"Arbitrary weird ones" means all BKL using sites in the kernel - all
~800 ones - up to 800x800 == close to a million interactions to
check.

That's not manageable humanly, nor does it really matter: most BKL
assumptions have bit-rotted anyway already (by the sheer entropy of
kernel facilities growing new schedule() sites or moving them, over
a period of 10 years or more) and many BKL using drivers are broken
already or dont need it at all.

Thanks,

Ingo

2009-04-17 00:44:41

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 02:27:44AM +0200, Ingo Molnar wrote:

> Could you give an example of such a weird interaction?

If I could give you one at the moment, I'd sent patches regardless of BKL
crusades. I _have_ run into such crap before and it had been just that -
crap.

> I fear, unless i'm misunderstanding your feedback, that you are
> setting the purist's irrealistically high burden to get rid of the
> BKL from the VFS here.
>
> "Arbitrary weird ones" means all BKL using sites in the kernel - all
> ~800 ones - up to 800x800 == close to a million interactions to
> check.

Sigh... How about dumping that lovely strawman? I've exsoddingplicitly
told you that all such stuff is *within* *individual* *fs* *driver*.

Start with taking these guys down into the superblock methods in question.
Drop that junk on VFS-only side of things completely (mount --move,
mount --bind, etc.). Then we go looking for data structures that are
a) internal to fs driver
b) accessed by methods in question (in that fs driver)
c) are shared between different filesystems.

Analysis is on per-fs basis. And getting rid of these turds doesn't have
to happen in one patch.

2009-04-17 16:57:17

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Al Viro <[email protected]> wrote:

> > > Old
> > > foo_read_super/foo_write_super/foo_put_super/foo_remount_fs
> > > for the same foo. IOW, per-driver (and not per-fs - that's
> > > taken care of) data structures. Arbitrary weird ones.
[...]
> > I fear, unless i'm misunderstanding your feedback, that you are
> > setting the purist's irrealistically high burden to get rid of the
> > BKL from the VFS here.
> >
> > "Arbitrary weird ones" means all BKL using sites in the kernel - all
> > ~800 ones - up to 800x800 == close to a million interactions to
> > check.
>
> Sigh... How about dumping that lovely strawman? I've
> exsoddingplicitly told you that all such stuff is *within*
> *individual* *fs* *driver*.

Ah, i misunderstood: "per-driver (and not per-fs" to mean to include
all other BKL-using Linux drivers as well ;-)

The 'not per-fs' excluded the 'fs driver' meaning (to me).

Per fs analysis is of course a must-review if we weaken or change
locking.

> Start with taking these guys down into the superblock methods in question.
> Drop that junk on VFS-only side of things completely (mount --move,
> mount --bind, etc.). Then we go looking for data structures that are
> a) internal to fs driver
> b) accessed by methods in question (in that fs driver)
> c) are shared between different filesystems.
>
> Analysis is on per-fs basis. And getting rid of these turds
> doesn't have to happen in one patch.

Stupid question regarding c): wouldnt such data structures go via
the VFS - which you said was free of BKL constraints? Or are there
interconnected private data structures between certain types of
closely related filesystems that the VFS does not know about? (and
hence might have BKL assumptions)

Ingo

2009-04-17 17:05:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, 2009-04-17 at 18:56 +0200, Ingo Molnar wrote:

> Stupid question regarding c): wouldnt such data structures go via
> the VFS - which you said was free of BKL constraints? Or are there
> interconnected private data structures between certain types of
> closely related filesystems that the VFS does not know about? (and
> hence might have BKL assumptions)

The VFS is stuffed with ->private like pointers for filesystems to flesh
out, and I could well imagine some implicit serialization between the
various (4?) VFS hooks that are currently still under BKL.

Anyway, it seems quite clear that the first thing is to push the current
BKL usage down into the filesystems -- which should be somewhat
straight-forward.

After that it really comes down to picking off these filesystems one at
a time, which will really need a proper audit, there just ain't a proper
substitute for thinking ;-)

2009-04-17 17:28:30

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, 17 Apr 2009, Peter Zijlstra wrote:
>
> Anyway, it seems quite clear that the first thing is to push the current
> BKL usage down into the filesystems -- which should be somewhat
> straight-forward.

Yes, if somebody sends the obvious mechanical patch, we can apply that
easily. Then, most common filesystems can probably remove the BKL
trivially by maintainers that know that they don't do anything at all with
it.

Of course, right now we do hold the BKL over _multiple_ downcalls, so in
that sense it's not actually totally 100% correct and straightforward to
just move it down. Eg in the generic_shutdown_super() case we do

lock_kernel();
->write_super();
->put_super();
invalidate_inodes();
unlock_kernel();

and obviously if we split it up so that we push a lock_kernel() into both,
we end up unlocking in between. I doubt anything cares, but it's still a
technical difference.

There are similar issues with 'remount' holding the BKL over longer
sequences.

Btw, the superblock code really does seem to depend on lock_kernel. Those
"sb->s_flags" accesses are literally not protected by anything else afaik.

That said, I think that fs/locks.c is likely a much bigger issue. Very few
people care about any realtimeness of mount/unmount/remount. But file
locking? That is much more likely to be an issue.

And I know we had patches for file locking BKL-removal at some point. What
happened to them? Were they just broken, or forgotten, or waiting in Jon's
tree, or what?

Linus

2009-04-17 17:31:53

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, 17 Apr 2009 10:21:06 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> And I know we had patches for file locking BKL-removal at some point. What
> happened to them? Were they just broken, or forgotten, or waiting in Jon's
> tree, or what?

Not in my tree, I've not seen them. I could maybe take a look at that,
I've just about recovered from my fasync experience...:)

jon

2009-04-17 17:34:46

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 06:56:43PM +0200, Ingo Molnar wrote:

> Stupid question regarding c): wouldnt such data structures go via
> the VFS - which you said was free of BKL constraints? Or are there
> interconnected private data structures between certain types of
> closely related filesystems that the VFS does not know about? (and
> hence might have BKL assumptions)

Shared between different instances of this fs type. E.g. if FAT had
its hashtable not per-superblock (as it is now) but system-wide (as it
used to be prior to [PATCH] FAT: the inode hash from per module to per sb),
it would be suspicious. In this particular case, it was OK from the
very beginning (I'd used a system-wide spinlock to protect it when it
had been introduced and per-superblock splitup made that spinlock per-sb
along with the hash table).

IOW, I wouldn't particulary worry about interactions between different
fs _types_ - it's interactions between the different fs _instances_
within the same driver that are likely sources of PITA.

ObOtherQuestion: no, I don't think that bdev open and bdev ioctls situation
is related to this group of BKL users. sb_set_blocksize() might, in principle,
want a memory barrier of some kind, but that's it.

2009-04-17 17:41:47

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 07:04:45PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-04-17 at 18:56 +0200, Ingo Molnar wrote:
>
> > Stupid question regarding c): wouldnt such data structures go via
> > the VFS - which you said was free of BKL constraints? Or are there
> > interconnected private data structures between certain types of
> > closely related filesystems that the VFS does not know about? (and
> > hence might have BKL assumptions)
>
> The VFS is stuffed with ->private like pointers for filesystems to flesh
> out, and I could well imagine some implicit serialization between the
> various (4?) VFS hooks that are currently still under BKL.

Explicit one, TYVM... Anyway, any fs that dares to use private data of
objects from another filesystem deserves everything it gets; I'm not
aware of any that would do that and I have no sympathy whatsoever to
any that would try.

That's not what I'd been talking about, though; what we want to watch out
for is sharing of data structures by some (or all) fs instances of given
type. I.e. internals of particular fs driver, not cross-driver problems.

2009-04-17 18:08:59

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 10:21:06AM -0700, Linus Torvalds wrote:

> Of course, right now we do hold the BKL over _multiple_ downcalls, so in
> that sense it's not actually totally 100% correct and straightforward to
> just move it down. Eg in the generic_shutdown_super() case we do
>
> lock_kernel();
> ->write_super();
> ->put_super();
> invalidate_inodes();
> unlock_kernel();
>
> and obviously if we split it up so that we push a lock_kernel() into both,
> we end up unlocking in between. I doubt anything cares, but it's still a
> technical difference.

No, that's OK. Anything that would expect on lack of blocking between
the callers of ->write_super() and ->put_super() is simply insane. Not
that other callers of ->write_super() had been under BKL, while we are
at it...

> There are similar issues with 'remount' holding the BKL over longer
> sequences.
>
> Btw, the superblock code really does seem to depend on lock_kernel. Those
> "sb->s_flags" accesses are literally not protected by anything else afaik.

Modifications in there *should* be protected by ->s_umount. Except that
emergency_remount() does down_read() instead of down_write(), for some
reason. And that fs going r/o on error very likely will not hold any
locks at all, BKL included.

Note that most of the readers really couldn't care less about protection.
Single-shot tests for one bit like "is this fs mounted noatime right now?"
are OK as is - we don't *care* if it races with remount and no way to
do anything about such race anyway.

Read-only is the main exception; we should be mostly OK since the per-vfsmount
r/o rework, but "I have an error and I'll go r/o now" stuff is still messy.

> That said, I think that fs/locks.c is likely a much bigger issue. Very few
> people care about any realtimeness of mount/unmount/remount. But file
> locking? That is much more likely to be an issue.

That is much more likely to require really non-trivial work, BTW. That code
is a *mess* and inventing sane locking for it will be painful.

2009-04-17 18:15:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, 17 Apr 2009, Jonathan Corbet wrote:

> On Fri, 17 Apr 2009 10:21:06 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > And I know we had patches for file locking BKL-removal at some point. What
> > happened to them? Were they just broken, or forgotten, or waiting in Jon's
> > tree, or what?
>
> Not in my tree, I've not seen them. I could maybe take a look at that,
> I've just about recovered from my fasync experience...:)

Hmm. It might also just be my fevered imagination. I'd like to say it was
Matthew Wilcox, but really, my mind is going.

Ahh. Bug google backs me up. As long as I have google, I can keep
Alzheimer's at bay: "Negative scalability by removal of lock_kernel()"
thread on lkml back in October 2000. After we had actually done the BKL
removal.

So we actually did apply it (in 2.4.0-test9, and then reverted it again
(in -test11, I think). Google for "file_lock_sem fs/locks.c" and see some
of the discussion. The end result was to go back to the BKL due to Apache
slowdowns.

But I seem to remember a later patch (in the last year or two) from Willy
too. Google doesn't help me, so that's probably just my fevered mind. But
I'm cc'ing Willy anyway.

Linus

2009-04-17 18:35:28

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

* Linus Torvalds <[email protected]> wrote:

> On Fri, 17 Apr 2009, Peter Zijlstra wrote:
> >
> > Anyway, it seems quite clear that the first thing is to push the current
> > BKL usage down into the filesystems -- which should be somewhat
> > straight-forward.
>
> Yes, if somebody sends the obvious mechanical patch, we can apply that
> easily. Then, most common filesystems can probably remove the BKL
> trivially by maintainers that know that they don't do anything at all with
> it.
>
> Of course, right now we do hold the BKL over _multiple_ downcalls, so in
> that sense it's not actually totally 100% correct and straightforward to
> just move it down. Eg in the generic_shutdown_super() case we do
>
> lock_kernel();
> ->write_super();
> ->put_super();
> invalidate_inodes();
> unlock_kernel();
>
> and obviously if we split it up so that we push a lock_kernel()
> into both, we end up unlocking in between. I doubt anything cares,
> but it's still a technical difference.
>
> There are similar issues with 'remount' holding the BKL over
> longer sequences.
>
> Btw, the superblock code really does seem to depend on
> lock_kernel. Those "sb->s_flags" accesses are literally not
> protected by anything else afaik.

The very narrow case we want to solve is this place in the NFS code
that calls schedule() with the BKL held:

[<00000000006d20ec>] rpc_wait_bit_killable+0x64/0x8c
[<00000000006f0620>] __wait_on_bit+0x64/0xc0
[<00000000006f06e4>] out_of_line_wait_on_bit+0x68/0x7c
[<00000000006d2938>] __rpc_execute+0x150/0x2b4
[<00000000006d2ac0>] rpc_execute+0x24/0x34
[<00000000006cc338>] rpc_run_task+0x64/0x74
[<00000000006cc474>] rpc_call_sync+0x58/0x7c
[<00000000005717b0>] nfs3_rpc_wrapper+0x24/0xa0
[<0000000000572024>] do_proc_get_root+0x6c/0x10c
[<00000000005720dc>] nfs3_proc_get_root+0x18/0x5c
[<000000000056401c>] nfs_get_root+0x34/0x17c
[<0000000000568adc>] nfs_get_sb+0x9ec/0xa7c
[<00000000004b7ec8>] vfs_kern_mount+0x44/0xa4
[<00000000004b7f84>] do_kern_mount+0x30/0xcc
[<00000000004cf300>] do_mount+0x7c8/0x80c
[<00000000004ed2a4>] compat_sys_mount+0x224/0x274
[<0000000000406154>] linux_sparc_syscall32+0x34/0x40

This creates circular locking if the BKL is a plain mutex and if
that mutex is dropped there (it's a too lowlevel place with many
locks held, so a re-acquire inverts the locking dependency).

I.e. the NFS code wants to drop the BKL at a high level, in
nfs_get_sb() - the NFS folks already confirmed that they have no
internal BKL dependencies. Preferably by never getting called with
the BKL held by the VFS layer.

Of course we could hack around it and add an unlock_kernel()
lock_kernel() pair into nfs_get_sb(), but we thought we'd be kernel
nice citizens and improve the general situation too :)

Ingo

2009-04-17 18:44:43

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 11:03:31AM -0700, Linus Torvalds wrote:
> Hmm. It might also just be my fevered imagination. I'd like to say it was
> Matthew Wilcox, but really, my mind is going.
>
> Ahh. Bug google backs me up. As long as I have google, I can keep
> Alzheimer's at bay: "Negative scalability by removal of lock_kernel()"
> thread on lkml back in October 2000. After we had actually done the BKL
> removal.
>
> So we actually did apply it (in 2.4.0-test9, and then reverted it again
> (in -test11, I think). Google for "file_lock_sem fs/locks.c" and see some
> of the discussion. The end result was to go back to the BKL due to Apache
> slowdowns.

That's some good ancient history ;-) It probably would make sense to
use a mutex rather than the BKL these days now that we spin on mutexes
if the other process is running. Plus, I don't think modern Apache uses
file locks any more.

There was another attempt to remove the BKL from locks.c by Dave Hansen
a few years later. That one foundered on the proposed locking scheme
being unadulterated crap; I seem to remember pointing out that it was
gathering O(n^2) locks ...

> But I seem to remember a later patch (in the last year or two) from Willy
> too. Google doesn't help me, so that's probably just my fevered mind. But
> I'm cc'ing Willy anyway.

Fortunately, this patch wasn't the product of a fevered anything. It
was in response to the performance regressions I introduced by
introducing the generic semaphores that were too fair.

It didn't deal with the really knotty problem which was the NFS server
still running under the BKL and relying on the BKL to prevent other
CPUs from messing with the list of locks.

Since the problem turned out to be the TTY layer and not the file
locking code, we just dropped the patch, but if we'd like to resurrect
it again, we need to talk to the NFS folks. Probably Bruce Fields is
the appropriate sucker.

2009-04-22 17:29:18

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH -tip] remove the BKL: Replace BKL in mount/umount syscalls with a mutex

On Fri, Apr 17, 2009 at 11:44:31AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 17, 2009 at 11:03:31AM -0700, Linus Torvalds wrote:
> > Hmm. It might also just be my fevered imagination. I'd like to say it was
> > Matthew Wilcox, but really, my mind is going.
> >
> > Ahh. Bug google backs me up. As long as I have google, I can keep
> > Alzheimer's at bay: "Negative scalability by removal of lock_kernel()"
> > thread on lkml back in October 2000. After we had actually done the BKL
> > removal.
> >
> > So we actually did apply it (in 2.4.0-test9, and then reverted it again
> > (in -test11, I think). Google for "file_lock_sem fs/locks.c" and see some
> > of the discussion. The end result was to go back to the BKL due to Apache
> > slowdowns.
>
> That's some good ancient history ;-) It probably would make sense to
> use a mutex rather than the BKL these days now that we spin on mutexes
> if the other process is running. Plus, I don't think modern Apache uses
> file locks any more.
>
> There was another attempt to remove the BKL from locks.c by Dave Hansen
> a few years later. That one foundered on the proposed locking scheme
> being unadulterated crap; I seem to remember pointing out that it was
> gathering O(n^2) locks ...
>
> > But I seem to remember a later patch (in the last year or two) from Willy
> > too. Google doesn't help me, so that's probably just my fevered mind. But
> > I'm cc'ing Willy anyway.
>
> Fortunately, this patch wasn't the product of a fevered anything. It
> was in response to the performance regressions I introduced by
> introducing the generic semaphores that were too fair.
>
> It didn't deal with the really knotty problem which was the NFS server
> still running under the BKL and relying on the BKL to prevent other
> CPUs from messing with the list of locks.

It's only lockd that actually runs *entirely* under the BKL--and lockd
obviously has a close relationship with the locks.c code, so there's a
fear of (unknown) dependencies there.

Also, more concretely (and what you probably had in mind), there are a
couple places where the nfs client or server explicitly take the bkl
just to traverse the lock list.

> Since the problem turned out to be the TTY layer and not the file
> locking code, we just dropped the patch, but if we'd like to resurrect
> it again, we need to talk to the NFS folks. Probably Bruce Fields is
> the appropriate sucker.

I've been saying for a while I'd look into this, but keep getting
distracted, apologies.... I'll see if I can at least deal with the
obvious nfs client/server lock list traversals this time around.

--b.