2002-02-19 07:16:00

by Nakayama Shintaro

[permalink] [raw]
Subject: BKL removal from VFS

Hi,

I've found great BKL contention when running multiple postmark
benchmarks. Here is the postmark results with lock contention
sampled by lockmeter.

# of trans/sec spinlock contention
1CPU 1197 --
4CPU 1549 62.2% (BKL=61.9%)
8CPU 1607 77.2% (BKL=77.0%)

- kernel is 2.4.17
- postmark params(number=5000 transactions=500000)
- Xeon 550Mhz, 1G memory
- virtually no disk I/O wait
- concurrelty running 8 postmark processes

To improve postmark results, I removed BKL from contenting points
with modifications as follows.

- removed BKL from following operations
i_op->create
i_op->lookup
i_op->link
i_op->mknod
i_op->mkdir
i_op->unlink
i_op->rmdir
i_op->truncate
i_op->setattr
i_op->permission
i_op->symlink
f_op->readdir

removed lock_kernel/unlock_kernel
-> LOCK_KERNEL/UNLOCK_KERNEL

- fine-grained ext2 locks

global resource protecting lock
-----------------------------------------
dir operation i_sem, i_zombie
inode, block bitmap per-group lock
inode meta data per-group lock
ext2 super block super block lock

- changed VFS locking strategy
fs_lock protects fs global operations such as sync,
mount. fs_lock is read/write lock. Each entry into
filesystem read-locks fs_lock, while global operations
do write-locking.

- internaling BKL into some filesystems
rootfs
proc
driverfs
tmpfs (not shmfs)
binfmt_misc
ext3
nfs

inserted lock_kernel/unlock_kernel
-> fs_lock_kernel/fs_unlock_kernel

other filesystems are not modified, thus dangerous to
use currently (they have no protecting lock!)

By this modification, BKL contention was reduced greatly.

# of trans/sec spinlock contention
1CPU 2430 --
2CPU 4098 1.3%
4CPU 7529 3.2%
8CPU 11394 10.7%

- 2.5.1 with BKL removal patch
- postmark params(number=1000 transactions=200000)
- others same as above

I've attached patch against 2.5.1 and m-postmark script. This
patch is tested on 8-way SMP machine with postmark and dbench.
You can get patch and data from

http://www.tritech.co.jp/members/nakayama/linux/vfs-bkl-2.5.1.diff.gz
http://www.tritech.co.jp/members/nakayama/linux/results.tar.bz2

I'm not in the list, please cc any reply to me. Hope this helps
to improve Linux.

(This work is paid by Fujitsu)


Attachments:
vfs-bkl-2.5.1.diff.gz (29.82 kB)

2002-02-19 07:47:36

by Alexander Viro

[permalink] [raw]
Subject: Re: BKL removal from VFS



On Tue, 19 Feb 2002, Nakayama Shintaro wrote:

> I've found great BKL contention when running multiple postmark
> benchmarks. Here is the postmark results with lock contention
> sampled by lockmeter.

Conflicts with (and massively duplicates) patches that already went
into 2.5. Absolutely useless wrt mount() locking changes (except for
remount they can't race with filesystem code even in principle and
definitely don't need system-wide exclusion among themselves). Has
a nice DoS potential (on OOM). Too large and changes too many things
to be acceptable at one chunk even if none of the above would apply.
Consider it vetoed.

Seriously, just watching the changelogs would show that it has no chance
to be applied.

I hadn't checked for races, but e.g. ext2_readdir() losing BKL without
corresponding changes to lseek() looks very suspicious. I'm more than
sure that there's more - after doing that BKL-shifting in recent 2.5.
E.g. I'm pretty sure that you are screwing ->i_nlink checks.

2002-02-19 15:57:15

by Steve Lord

[permalink] [raw]
Subject: Re: BKL removal from VFS

On Tue, 2002-02-19 at 01:47, Alexander Viro wrote:
>
>
> On Tue, 19 Feb 2002, Nakayama Shintaro wrote:
>
> > I've found great BKL contention when running multiple postmark
> > benchmarks. Here is the postmark results with lock contention
> > sampled by lockmeter.
>
> Conflicts with (and massively duplicates) patches that already went
> into 2.5. Absolutely useless wrt mount() locking changes (except for
> remount they can't race with filesystem code even in principle and
> definitely don't need system-wide exclusion among themselves). Has
> a nice DoS potential (on OOM). Too large and changes too many things
> to be acceptable at one chunk even if none of the above would apply.
> Consider it vetoed.
>
> Seriously, just watching the changelogs would show that it has no chance
> to be applied.
>
> I hadn't checked for races, but e.g. ext2_readdir() losing BKL without
> corresponding changes to lseek() looks very suspicious. I'm more than
> sure that there's more - after doing that BKL-shifting in recent 2.5.
> E.g. I'm pretty sure that you are screwing ->i_nlink checks.
>

Al, I am not proposing this to go in, but what is your opinion on a
change like this? XFS does not need the BKL at all, so for some aim7
experiments on large systems this patch was used to bypass the BKL for
filesystems which state they can live without it:

This is against 2.4.17, so it is a bit dated, it is also not
comprehensive in terms of hitting all the BKL usage around
filesystem calls.

Steve

===========================================================================
Index: linux/fs/namei.c
===========================================================================

--- /usr/tmp/TmpDir.32271-0/linux/fs/namei.c_1.41 Sat Feb 2 14:34:21 2002
+++ linux/fs/namei.c Sat Feb 2 14:11:23 2002
@@ -26,6 +26,12 @@
#include <asm/namei.h>
#include <asm/uaccess.h>

+#define lock_kernel_optional(ip) \
+ if (!(ip->i_flags & S_NOBKL)) lock_kernel()
+
+#define unlock_kernel_optional(ip) \
+ if (!(ip->i_flags & S_NOBKL)) unlock_kernel()
+
#define ACC_MODE(x) ("\000\004\002\006"[(x)&O_ACCMODE])

/* [Feb-1997 T. Schoebel-Theuer]
@@ -199,9 +205,9 @@
{
if (inode->i_op && inode->i_op->permission) {
int retval;
- lock_kernel();
+ lock_kernel_optional(inode);
retval = inode->i_op->permission(inode, mask);
- unlock_kernel();
+ unlock_kernel_optional(inode);
return retval;
}
return vfs_permission(inode, mask);
@@ -298,9 +304,9 @@
struct dentry * dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
if (dentry) {
- lock_kernel();
+ lock_kernel_optional(dir);
result = dir->i_op->lookup(dir, dentry);
- unlock_kernel();
+ unlock_kernel_optional(dir);
if (result)
dput(dentry);
else
@@ -770,9 +776,9 @@
dentry = ERR_PTR(-ENOMEM);
if (!new)
goto out;
- lock_kernel();
+ lock_kernel_optional(inode);
dentry = inode->i_op->lookup(inode, new);
- unlock_kernel();
+ unlock_kernel_optional(inode);
if (!dentry)
dentry = new;
else
@@ -945,9 +951,9 @@
goto exit_lock;

DQUOT_INIT(dir);
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->create(dir, dentry, mode);
- unlock_kernel();
+ unlock_kernel_optional(dir);
exit_lock:
up(&dir->i_zombie);
if (!error)
@@ -1225,9 +1231,9 @@
goto exit_lock;

DQUOT_INIT(dir);
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->mknod(dir, dentry, mode, dev);
- unlock_kernel();
+ unlock_kernel_optional(dir);
exit_lock:
up(&dir->i_zombie);
if (!error)
@@ -1296,9 +1302,9 @@

DQUOT_INIT(dir);
mode &= (S_IRWXUGO|S_ISVTX);
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->mkdir(dir, dentry, mode);
- unlock_kernel();
+ unlock_kernel_optional(dir);

exit_lock:
up(&dir->i_zombie);
@@ -1387,9 +1393,9 @@
else if (d_mountpoint(dentry))
error = -EBUSY;
else {
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->rmdir(dir, dentry);
- unlock_kernel();
+ unlock_kernel_optional(dir);
if (!error)
dentry->d_inode->i_flags |= S_DEAD;
}
@@ -1458,9 +1464,9 @@
if (d_mountpoint(dentry))
error = -EBUSY;
else {
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->unlink(dir, dentry);
- unlock_kernel();
+ unlock_kernel_optional(dir);
if (!error)
d_delete(dentry);
}
@@ -1529,9 +1535,9 @@
goto exit_lock;

DQUOT_INIT(dir);
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->symlink(dir, dentry, oldname);
- unlock_kernel();
+ unlock_kernel_optional(dir);

exit_lock:
up(&dir->i_zombie);
@@ -1603,9 +1609,9 @@
goto exit_lock;

DQUOT_INIT(dir);
- lock_kernel();
+ lock_kernel_optional(dir);
error = dir->i_op->link(old_dentry, dir, new_dentry);
- unlock_kernel();
+ unlock_kernel_optional(dir);

exit_lock:
up(&dir->i_zombie);

===========================================================================
Index: linux/fs/xfs/xfs_iget.c
===========================================================================

--- /usr/tmp/TmpDir.32271-0/linux/fs/xfs/xfs_iget.c_1.149 Sat Feb 2 14:34:21 2002
+++ linux/fs/xfs/xfs_iget.c Sat Feb 2 13:59:04 2002
@@ -478,6 +478,7 @@


vp = LINVFS_GET_VN_ADDRESS(inode);
+ inode->i_flags |= S_NOBKL;
if (inode->i_state & I_NEW) {
vn_initialize(XFS_MTOVFS(mp), inode, 0);
error = xfs_iget_core(vp, mp, tp, ino,

===========================================================================
Index: linux/include/linux/fs.h
===========================================================================

--- /usr/tmp/TmpDir.32271-0/linux/include/linux/fs.h_1.137 Sat Feb 2 14:34:21 2002
+++ linux/include/linux/fs.h Sat Feb 2 13:57:20 2002
@@ -142,6 +142,7 @@
#define S_IMMUTABLE 16 /* Immutable file */
#define S_DEAD 32 /* removed, but still open directory */
#define S_NOQUOTA 64 /* Inode is not counted to quota */
+#define S_NOBKL 128 /* No big kernel lock required */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-02-19 16:12:19

by Alexander Viro

[permalink] [raw]
Subject: Re: BKL removal from VFS



On 19 Feb 2002, Steve Lord wrote:

> Al, I am not proposing this to go in, but what is your opinion on a
> change like this? XFS does not need the BKL at all, so for some aim7
> experiments on large systems this patch was used to bypass the BKL for
> filesystems which state they can live without it:

> +#define lock_kernel_optional(ip) \
> + if (!(ip->i_flags & S_NOBKL)) lock_kernel()
> +

Denied. No way in hell that (or similar) will ever go in. Locking must
be consistent, _period_. No provisions for "legacy drivers" and crap
like that - it's a standard policy in all kernel and that had been discussed
a lot of times.

_Please_, check 2.5. We already don't take BKL on majority of directory
operations. The rest will follow pretty soon.

In particular, in current Linus' tree there are 3 (three) instances of
lock_kernel() in fs/namei.c. Namely, ->permission() and two calls of
d_move(). The latter will go when ->d_parent mess is cleaned up. The
former will go as soon as we get to ->setattr()/->permission() cleanups -
hopefully in a week or so.

In general, such changes are done by global lock shifting - simultaneous
for all instances and being a trivial search-and-replace. Once the lock
is taken inside the method individual filesystems/drivers/etc. can
shrink the protected areas - in separate patches.

That's how it works - and that's how it had been done for most of the methods
already. Magic flags that make locking different for different instances
are Not Good. And not needed - see above for the usual way to do that stuff.

2002-02-19 16:22:03

by Steve Lord

[permalink] [raw]
Subject: Re: BKL removal from VFS

On Tue, 2002-02-19 at 10:11, Alexander Viro wrote:
>
>
> On 19 Feb 2002, Steve Lord wrote:
>
> > Al, I am not proposing this to go in, but what is your opinion on a
> > change like this? XFS does not need the BKL at all, so for some aim7
> > experiments on large systems this patch was used to bypass the BKL for
> > filesystems which state they can live without it:
>
> > +#define lock_kernel_optional(ip) \
> > + if (!(ip->i_flags & S_NOBKL)) lock_kernel()
> > +
>
> Denied. No way in hell that (or similar) will ever go in. Locking must
> be consistent, _period_. No provisions for "legacy drivers" and crap
> like that - it's a standard policy in all kernel and that had been discussed
> a lot of times.
>
> _Please_, check 2.5. We already don't take BKL on majority of directory
> operations. The rest will follow pretty soon.
>
> In particular, in current Linus' tree there are 3 (three) instances of
> lock_kernel() in fs/namei.c. Namely, ->permission() and two calls of
> d_move(). The latter will go when ->d_parent mess is cleaned up. The
> former will go as soon as we get to ->setattr()/->permission() cleanups -
> hopefully in a week or so.
>
> In general, such changes are done by global lock shifting - simultaneous
> for all instances and being a trivial search-and-replace. Once the lock
> is taken inside the method individual filesystems/drivers/etc. can
> shrink the protected areas - in separate patches.
>
> That's how it works - and that's how it had been done for most of the methods
> already. Magic flags that make locking different for different instances
> are Not Good. And not needed - see above for the usual way to do that stuff.

Whoa, light blue touch paper and stand back! Like I said I was not proposing
this to go into the kernel, just asking your opinion. Yes I am aware of the
changes going into locking in 2.5 and like the way things are going there,
XFS is ticking along quite happily in 2.5.5-pre1 here.

Steve

--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-02-19 16:48:37

by Alexander Viro

[permalink] [raw]
Subject: Re: BKL removal from VFS



On 19 Feb 2002, Steve Lord wrote:

> Whoa, light blue touch paper and stand back! Like I said I was not proposing
> this to go into the kernel, just asking your opinion.

You asked - I answered ;-)

BTW, check your use of ->d_parent - a lot of places implicitly assumes that
it can't change under you. Currently for a filesystem with ->rename() it's
true only if at least one of the following conditions is satisfied:
* you know that lock on parent is held (e.g. you are in ->lookup()
and its ilk and dentry is one you've got from caller). Notice that
down(&dentry->d_parent->d_inode->i_sem) is 100% wrong for any such fs.
* dcache_lock is held.
* BKL is held.
* you are called from cross-directory ->rename() (then no dentry
on that filesystem will changes its parent until you are done).

Surprisingly many places implicitly rely on BKL (i.e. have no
other protection and don't fsck up only because they are always called
under BKL). Hell, some places don't have _any_ protection - see 2.4.18-rc2
for fixes to such crap in dnotify-related code.