-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hey all -
There is currently no method in the superblock shutdown path to
determine if all inodes associated with the superblock are completely
quiesced. invalidate_inodes() will attempt to flush out inodes still
hanging around, but if inodes are in the iput() path, it's possible for
them to be removed from the inode list and in the ->delete_inode fs
method, which isn't protected by inode_lock any longer.
generic_shutdown_super() will happily call the ->put_super fs method,
destroying data structures still in use by the iput (->delete_inode) in
progress. That's where Oopsen come into play.
The unlink path will call the ->unlink fs method, release the path (thus
dropping the reference to the vfsmount, and then call iput. Since the
vfsmount reference is dropped back to 1, a umount will succeed, causing
the superblock to be cleaned up.
This doesn't trigger during read/write or if pwd is inside the
filesystem, since open files and working directories also cause an
incremented vfsmount->mnt_count.
I've triggered Oopsen on 2.6.5, 2.6.8, 2.6.9-rc2, and 2.6.9-rc3 using
reiserfs, ext2, and ext3. Presumably, most (all?) 2.6 should be susceptible.
Attached is a script I've used to reliably trigger this bug. I'll
continue to track this down, but figured I'd post a heads up so more
eyes would see it.
- -Jeff
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBXl7mLPWxlyuTD7IRAu3VAJ4r/sFAX6dOr6WMpLh6YykmhBxo7gCgoluP
dpoTUfCqGZIVWeFJ1rc8yOI=
=lj86
-----END PGP SIGNATURE-----
On Sat, Oct 02, 2004 at 03:55:18AM -0400, Jeff Mahoney wrote:
> generic_shutdown_super() will happily call the ->put_super fs method,
> destroying data structures still in use by the iput (->delete_inode) in
> progress. That's where Oopsen come into play.
>
> The unlink path will call the ->unlink fs method, release the path (thus
> dropping the reference to the vfsmount, and then call iput. Since the
> vfsmount reference is dropped back to 1, a umount will succeed, causing
> the superblock to be cleaned up.
Arrgh...
Bug is in the ->i_count hacks in sys_unlink(). 1001st proof that VFS has
no fscking business playing with inode refcount directly...
OK, quick and dirty fix follows. Note: all places that go to exit1: or exit:
will have NULL inode, so we are not leaking anything here and it is OK do that
iput() early; indeed, the goal of that kludge was to postpone the final
iput() past the unlocking the parent for the sake of contention if a wunch
of bankers is doing parallel unlink() on files in the same directory and
normally it would happen on dput() after vfs_unlink())
--- linux/fs/namei.c Mon Sep 13 01:32:00 2004
+++ linux/fs/namei.c.fix Sat Oct 2 05:48:21 2004
@@ -1825,13 +1825,12 @@
dput(dentry);
}
up(&nd.dentry->d_inode->i_sem);
+ if (inode)
+ iput(inode); /* truncate the inode here */
exit1:
path_release(&nd);
exit:
putname(name);
-
- if (inode)
- iput(inode); /* truncate the inode here */
return error;
slashes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
[email protected] wrote:
| On Sat, Oct 02, 2004 at 03:55:18AM -0400, Jeff Mahoney wrote:
|
|>generic_shutdown_super() will happily call the ->put_super fs method,
|>destroying data structures still in use by the iput (->delete_inode) in
|>progress. That's where Oopsen come into play.
|>
|>The unlink path will call the ->unlink fs method, release the path (thus
|>dropping the reference to the vfsmount, and then call iput. Since the
|>vfsmount reference is dropped back to 1, a umount will succeed, causing
|>the superblock to be cleaned up.
|
|
| Arrgh...
|
| Bug is in the ->i_count hacks in sys_unlink(). 1001st proof that VFS has
| no fscking business playing with inode refcount directly...
|
| OK, quick and dirty fix follows. Note: all places that go to exit1:
or exit:
| will have NULL inode, so we are not leaking anything here and it is OK
do that
| iput() early; indeed, the goal of that kludge was to postpone the final
| iput() past the unlocking the parent for the sake of contention if a wunch
| of bankers is doing parallel unlink() on files in the same directory and
| normally it would happen on dput() after vfs_unlink())
|
| --- linux/fs/namei.c Mon Sep 13 01:32:00 2004
| +++ linux/fs/namei.c.fix Sat Oct 2 05:48:21 2004
| @@ -1825,13 +1825,12 @@
| dput(dentry);
| }
| up(&nd.dentry->d_inode->i_sem);
| + if (inode)
| + iput(inode); /* truncate the inode here */
| exit1:
| path_release(&nd);
| exit:
| putname(name);
| -
| - if (inode)
| - iput(inode); /* truncate the inode here */
| return error;
|
| slashes:
Works as expected. Thanks!
- -Jeff
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBXs0lLPWxlyuTD7IRAivHAJ4l3GsDxWgQTrTHrMeR7C0CqpomiACgljzB
njBzAn2tP+P5pya1egf9TmU=
=JmTS
-----END PGP SIGNATURE-----
[email protected] wrote:
>
> the goal of that kludge was to postpone the final
> iput() past the unlocking the parent for the sake of contention if a wunch
> of bankers is doing parallel unlink() on files in the same directory
The problematic workloads are much simpler than that: simply a kernel
compile when there's a lot of writeout happening to the same disk.
Truncation of files in /tmp takes a long time because it has to wait for
in-flight writeout. If we do that synchronous I/O while holding i_sem,
nobody else can get at /tmp.
"When there is a continuous streaming write to the same disk, this patch
reduces the time for `make -j4 bzImage' from 370 seconds to 220."
Yeah, the tweak is not pretty, but that's a significant speedup.