2005-03-15 14:15:14

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH] blockdev: fix for racing mount/umount

This patch is another take at fixing the race between mount and umount
resetting the blocksize and causing buffer errors, infinite loops in
__getblk_slow, and possibly other undiscovered effects.

It adds possible flags to bd_claim such that the caller can request
exclusive access and/or wait until the device becomes available. Since bd_claim
already allows/denies access based on the holder, the BD_EXCL flag operates
under the assumption that all callers of bd_claim for a particular holder will
use the flag. This is currently true. BD_WAIT places the request on a wait
queue until access can be granted. It uses a global wait queue, which isn't a
contention point since bd_claim/bd_release both operate under the global
bdev_lock anyway.

Filesystems (via get_sb_bdev) now use BD_EXCL|BD_WAIT to ensure the previous
mount has completely shut down and closed the device before re-opening it
for a new mount.

It's ugly, and I'm open to suggestions, but it seems to be the only way to
stop this race reliably.

Signed-off-by: Jeff Mahoney <[email protected]>

diff -ruNpX dontdiff linux-2.6.11/fs/block_dev.c linux-2.6.11.bs/fs/block_dev.c
--- linux-2.6.11/fs/block_dev.c 2005-03-14 21:25:20.000000000 -0500
+++ linux-2.6.11.bs/fs/block_dev.c 2005-03-14 22:17:16.000000000 -0500
@@ -238,6 +238,7 @@ static int block_fsync(struct file *filp
*/

static __cacheline_aligned_in_smp DEFINE_SPINLOCK(bdev_lock);
+static DECLARE_WAIT_QUEUE_HEAD(bdev_wq);
static kmem_cache_t * bdev_cachep;

static struct inode *bdev_alloc_inode(struct super_block *sb)
@@ -443,20 +444,33 @@ void bd_forget(struct inode *inode)
spin_unlock(&bdev_lock);
}

-int bd_claim(struct block_device *bdev, void *holder)
+/*
+ * flags:
+ * BD_NONE: No special behavior.
+ * BD_EXCL: Must have sole access to device, even if holder is the same.
+ * This is really enforced by the holder always using BD_EXCL.
+ * BD_WAIT: Wait until access is available before returning.
+ */
+int __bd_claim(struct block_device *bdev, void *holder, int flags)
{
int res;
+ DEFINE_WAIT (wait);
+
+retry:
spin_lock(&bdev_lock);
+ prepare_to_wait (&bdev_wq, &wait, TASK_UNINTERRUPTIBLE);

/* first decide result */
- if (bdev->bd_holder == holder)
+ if (bdev->bd_holder == holder) {
res = 0; /* already a holder */
- else if (bdev->bd_holder != NULL)
+ if (flags & BD_EXCL)
+ res = -EBUSY;
+ } else if (bdev->bd_holder != NULL)
res = -EBUSY; /* held by someone else */
else if (bdev->bd_contains == bdev)
res = 0; /* is a whole device which isn't held */

- else if (bdev->bd_contains->bd_holder == bd_claim)
+ else if (bdev->bd_contains->bd_holder == __bd_claim)
res = 0; /* is a partition of a device that is being partitioned */
else if (bdev->bd_contains->bd_holder != NULL)
res = -EBUSY; /* is a partition of a held device */
@@ -470,15 +484,21 @@ int bd_claim(struct block_device *bdev,
* be set to bd_claim before being set to holder
*/
bdev->bd_contains->bd_holders ++;
- bdev->bd_contains->bd_holder = bd_claim;
+ bdev->bd_contains->bd_holder = __bd_claim;
bdev->bd_holders++;
bdev->bd_holder = holder;
+ } else if (flags & BD_WAIT) {
+ spin_unlock (&bdev_lock);
+ schedule();
+ goto retry;
}
+
+ finish_wait (&bdev_wq, &wait);
spin_unlock(&bdev_lock);
return res;
}

-EXPORT_SYMBOL(bd_claim);
+EXPORT_SYMBOL(__bd_claim);

void bd_release(struct block_device *bdev)
{
@@ -488,6 +508,7 @@ void bd_release(struct block_device *bde
if (!--bdev->bd_holders)
bdev->bd_holder = NULL;
spin_unlock(&bdev_lock);
+ wake_up_all (&bdev_wq);
}

EXPORT_SYMBOL(bd_release);
@@ -876,7 +897,8 @@ fail:
* Open the blockdevice described by the special file at @path, claim it
* for the @holder.
*/
-struct block_device *open_bdev_excl(const char *path, int flags, void *holder)
+struct block_device *__open_bdev_excl(const char *path, int flags,
+ void *holder, int bdflags)
{
struct block_device *bdev;
mode_t mode = FMODE_READ;
@@ -894,7 +916,7 @@ struct block_device *open_bdev_excl(cons
error = -EACCES;
if (!(flags & MS_RDONLY) && bdev_read_only(bdev))
goto blkdev_put;
- error = bd_claim(bdev, holder);
+ error = __bd_claim(bdev, holder, bdflags);
if (error)
goto blkdev_put;

@@ -905,7 +927,7 @@ blkdev_put:
return ERR_PTR(error);
}

-EXPORT_SYMBOL(open_bdev_excl);
+EXPORT_SYMBOL(__open_bdev_excl);

/**
* close_bdev_excl - release a blockdevice openen by open_bdev_excl()
diff -ruNpX dontdiff linux-2.6.11/fs/super.c linux-2.6.11.bs/fs/super.c
--- linux-2.6.11/fs/super.c 2005-03-14 21:25:20.000000000 -0500
+++ linux-2.6.11.bs/fs/super.c 2005-03-14 21:38:22.000000000 -0500
@@ -677,7 +677,7 @@ struct super_block *get_sb_bdev(struct f
struct super_block *s;
int error = 0;

- bdev = open_bdev_excl(dev_name, flags, fs_type);
+ bdev = __open_bdev_excl(dev_name, flags, fs_type, BD_EXCL|BD_WAIT);
if (IS_ERR(bdev))
return (struct super_block *)bdev;

diff -ruNpX dontdiff linux-2.6.11/include/linux/fs.h linux-2.6.11.bs/include/linux/fs.h
--- linux-2.6.11/include/linux/fs.h 2005-03-14 21:25:20.000000000 -0500
+++ linux-2.6.11.bs/include/linux/fs.h 2005-03-14 21:45:21.000000000 -0500
@@ -1320,7 +1320,14 @@ extern int blkdev_ioctl(struct inode *,
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *, mode_t, unsigned);
extern int blkdev_put(struct block_device *);
-extern int bd_claim(struct block_device *, void *);
+#define BD_NONE 0x0
+#define BD_EXCL 0x1
+#define BD_WAIT 0x2
+extern int __bd_claim(struct block_device *, void *, int);
+static inline int bd_claim(struct block_device *bdev, void *holder)
+{
+ return __bd_claim(bdev, holder, 0);
+}
extern void bd_release(struct block_device *);

/* fs/char_dev.c */
@@ -1337,6 +1344,11 @@ extern int chrdev_open(struct inode *, s
extern const char *__bdevname(dev_t, char *buffer);
extern const char *bdevname(struct block_device *bdev, char *buffer);
extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *__open_bdev_excl(const char *, int, void *, int);
+static inline struct block_device *open_bdev_excl(const char *path, int flags, void *holder)
+{
+ return __open_bdev_excl(path, flags, holder, BD_NONE);
+}
extern struct block_device *open_bdev_excl(const char *, int, void *);
extern void close_bdev_excl(struct block_device *);

--
Jeff Mahoney
SuSE Labs


2005-03-15 15:59:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] blockdev: fix for racing mount/umount



On Tue, 15 Mar 2005, Jeff Mahoney wrote:
>
> This patch is another take at fixing the race between mount and umount
> resetting the blocksize and causing buffer errors, infinite loops in
> __getblk_slow, and possibly other undiscovered effects.

Ok. I had to go back and look up the original problem, and having looked
at this a bit more, I wonder whether the real problem is not that we do
that silly "set blocksize back to the original one" at umount time in the
first place.

(It happens very indirectly, though the "->kill_sb()" fn pointer, which
ends up doing kill_block_super on a regular block device).

Maybe we should just get rid of it entirely? There's really no point to
it.

Instead, to make things repeatable, we'd always just set the blocksize to
its default value at the first open. We already do that anyway, don't we?

Wouldn't that approach also just fix things? And then the fix would
literally be to just remove the set_blocksize() call in kill_block_super.
At that point, we know that the only people who set the block size are
either
- a first opener
- somebody who got exclusive access (ie a filesystem that sets it at
mount-time)

(Yeah, it's a bit more complex than that one-liner, since somebody would
need to back me up on not being totally tripping on some 'shrooms. But Al
can probably do that trivially)

Or maybe I misunderstood the problem. Jeff?

Linus

2005-03-15 17:54:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] blockdev: fix for racing mount/umount

On Tue, Mar 15, 2005 at 08:00:52AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 15 Mar 2005, Jeff Mahoney wrote:
> >
> > This patch is another take at fixing the race between mount and umount
> > resetting the blocksize and causing buffer errors, infinite loops in
> > __getblk_slow, and possibly other undiscovered effects.
>
> Ok. I had to go back and look up the original problem, and having looked
> at this a bit more, I wonder whether the real problem is not that we do
> that silly "set blocksize back to the original one" at umount time in the
> first place.
>
> (It happens very indirectly, though the "->kill_sb()" fn pointer, which
> ends up doing kill_block_super on a regular block device).
>
> Maybe we should just get rid of it entirely? There's really no point to
> it.
>
> Instead, to make things repeatable, we'd always just set the blocksize to
> its default value at the first open. We already do that anyway, don't we?

Yes, but we could explicitly change it at some point before mount. I'm looking
through that stuff right now - been net.dead for several weeks, thanks to
fscking telco idiocy at exactly wrong time ;-/

Give me a couple of days, OK? Three weeks of l-k is a hell of a backlog ;-/

2005-03-15 20:58:29

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [PATCH] blockdev: fix for racing mount/umount

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
>
> On Tue, 15 Mar 2005, Jeff Mahoney wrote:
>
>>This patch is another take at fixing the race between mount and umount
>>resetting the blocksize and causing buffer errors, infinite loops in
>>__getblk_slow, and possibly other undiscovered effects.
>
>
> Ok. I had to go back and look up the original problem, and having looked
> at this a bit more, I wonder whether the real problem is not that we do
> that silly "set blocksize back to the original one" at umount time in the
> first place.
>
> (It happens very indirectly, though the "->kill_sb()" fn pointer, which
> ends up doing kill_block_super on a regular block device).
>
> Maybe we should just get rid of it entirely? There's really no point to
> it.
>
> Instead, to make things repeatable, we'd always just set the blocksize to
> its default value at the first open. We already do that anyway, don't we?
>
> Wouldn't that approach also just fix things? And then the fix would
> literally be to just remove the set_blocksize() call in kill_block_super.
> At that point, we know that the only people who set the block size are
> either
> - a first opener
> - somebody who got exclusive access (ie a filesystem that sets it at
> mount-time)
>
> (Yeah, it's a bit more complex than that one-liner, since somebody would
> need to back me up on not being totally tripping on some 'shrooms. But Al
> can probably do that trivially)
>
> Or maybe I misunderstood the problem. Jeff?

*smacks head*

Yeah, I think this will fix the problem in a much cleaner way. I wanted
to try to preserve the old behavior, but you validly point out that it's
silly to do so. I don't believe that any opener of a block device has
any expectation that the block size be anything in particular. They
could open a block device that is mounted, or one that isn't. If they
need it a different size, they can explicitly set it. The set_blocksize
in kill_block_super is just being nice and returning it the default.

Here's a quick analysis:

Previous to the superblock being shut down, sget will return a
superblock which will then lead to get_sb_bdev returning -EBUSY as expected.

After the superblock being shut down, bd_release releases the holders.
This could allow another opener to open it exclusively, but then the new
mount would just return -EBUSY, which is safe. Otherwise, we'll take a
holder reference under bdev_lock.

Then, blkdev_put will sync and kill the blkdev under bdev->bd_sem if it
holds the sole opener reference. This is what usually will happen.

If it doesn't hold the only reference, blkdev_get in the new mount beat
us there and won't need to reset the blocksize. The bdev doesn't get
synced and killed either, though. This isn't that big a deal, since
generic_shutdown_super->fsync_super takes care of syncing the fs, but
the ->write_super call following it could leave the superblock unsynced
under certain circumstances[*]. So, rather than completely remove the
set_blocksize, I'd prefer to replace it with a sync_blockdev instead.

- -Jeff

[*]: It's a very specific corner case: Currently, since most filesystems
use a block size that is different than the device block size,
set_blocksize resetting the block size can be counted on to sync any
writes that occured during that final ->write_super call. If we remove
set_blocksize, then that final sync may not occur if there is another
opener (such as a process reading the block device).

Yes, it's a case of shooting-yourself-in-the-foot, but say it's a pen
drive and knowing that it's umounted, the user removes the device. The
final write is lost - because a read-only operation was in progress.
That doesn't seem right to me, and can be fixed pretty easily by
substituting a sync_blockdev for that set_blocksize rather than removing
it completely.

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCN0yaLPWxlyuTD7IRAtp0AJ4xQDNjS+B6wq3sKu2t6c4tj6PHowCeJeqR
54cAg0bsuf1pMU0kSDJLy20=
=t5Vf
-----END PGP SIGNATURE-----

2005-03-17 19:48:46

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [PATCH] blockdev: fix for racing mount/umount

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Linus Torvalds wrote:
>
> On Tue, 15 Mar 2005, Jeff Mahoney wrote:
>
>>This patch is another take at fixing the race between mount and umount
>>resetting the blocksize and causing buffer errors, infinite loops in
>>__getblk_slow, and possibly other undiscovered effects.
>
>
> Ok. I had to go back and look up the original problem, and having looked
> at this a bit more, I wonder whether the real problem is not that we do
> that silly "set blocksize back to the original one" at umount time in the
> first place.
>
> (It happens very indirectly, though the "->kill_sb()" fn pointer, which
> ends up doing kill_block_super on a regular block device).
>
> Maybe we should just get rid of it entirely? There's really no point to
> it.
>
> Instead, to make things repeatable, we'd always just set the blocksize to
> its default value at the first open. We already do that anyway, don't we?
>
> Wouldn't that approach also just fix things? And then the fix would
> literally be to just remove the set_blocksize() call in kill_block_super.
> At that point, we know that the only people who set the block size are
> either
> - a first opener
> - somebody who got exclusive access (ie a filesystem that sets it at
> mount-time)
>
> (Yeah, it's a bit more complex than that one-liner, since somebody would
> need to back me up on not being totally tripping on some 'shrooms. But Al
> can probably do that trivially)

I replaced the set_blocksize with a sync_blockdev as I mentioned in my
previous reply and then ran a 24+ hour test that used to fail within a
few minutes. It passed.

As for resetting the block size either before the mount or on final
close, I don't think it's necessary.

Every block filesystem either calculates block numbers using the current
blocksize or explicitly sets the block size itself. Further, the default
blocksize isn't related at all to some "ideal" blocksize for any
particular device, it's just 1k.

It seems to me that the only case restoring the block size seems to
cover is this:
1. process opens the block device non-exclusively, and may set the block
size
2. kernel mounts filesystem, resetting the blocksize to its need
[...]
3. kernel umounts filesystem, returning blocksize to previous value
4. userspace closes block device

The problem with this is that userspace has no expectation whatsoever
that the block size be anything in particular unless it has exclusive
access to the device. If the device isn't open with exclusive access,
the kernel can and will change the blocksize while the userspace process
has the device open. O_EXCL works on block devices (see blkdev_open),
and if it is opened with exclusive access, the kernel can't claim the
device to mount the filesystem.

Therefore, I propose that simply changing the set_blocksize to a
sync_blockdev is sufficient.

- -Jeff

- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFCOd/cLPWxlyuTD7IRAkbbAKCEop3QXJq0/TbllPeEw412wrPZ1gCeIoz5
5sVOoyorGhKs78WFdPLhRHs=
=CGT7
-----END PGP SIGNATURE-----