2004-06-08 19:22:52

by Chris Mason

[permalink] [raw]
Subject: [PATCH] writeback_inodes can race with unmount

There's a small window where the filesystem can be unmounted during
writeback_inodes. The end result is the iput done by sync_sb_inodes
could be done after the FS put_super and and the super has been removed
from all lists.

The fix is to hold the s_umount sem during sync_sb_inodes to make sure
the FS doesn't get unmounted.

Index: linux.t/fs/fs-writeback.c
===================================================================
--- linux.t.orig/fs/fs-writeback.c 2004-06-08 14:45:49.000000000 -0400
+++ linux.t/fs/fs-writeback.c 2004-06-08 14:47:58.000000000 -0400
@@ -360,9 +360,18 @@ restart:
sb = sb_entry(super_blocks.prev);
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
+ /* we're making our own get_super here */
sb->s_count++;
spin_unlock(&sb_lock);
- sync_sb_inodes(sb, wbc);
+ /* if we can't get the readlock, there's no sense in
+ * waiting around, most of the time the FS is going
+ * to be unmounted by the time it is released
+ */
+ if (down_read_trylock(&sb->s_umount)) {
+ if (sb->s_root)
+ sync_sb_inodes(sb, wbc);
+ up_read(&sb->s_umount);
+ }
spin_lock(&sb_lock);
if (__put_super(sb))
goto restart;



2004-06-08 19:55:38

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount



Chris Mason wrote:

>There's a small window where the filesystem can be unmounted during
>writeback_inodes. The end result is the iput done by sync_sb_inodes
>could be done after the FS put_super and and the super has been removed
>from all lists.
>
>

Why don't we have the same race in the sync() path as well? Moving the
locking to sync_sb_inodes() itself would fix it also.

>The fix is to hold the s_umount sem during sync_sb_inodes to make sure
>the FS doesn't get unmounted.
>
>Index: linux.t/fs/fs-writeback.c
>===================================================================
>--- linux.t.orig/fs/fs-writeback.c 2004-06-08 14:45:49.000000000 -0400
>+++ linux.t/fs/fs-writeback.c 2004-06-08 14:47:58.000000000 -0400
>@@ -360,9 +360,18 @@ restart:
> sb = sb_entry(super_blocks.prev);
> for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
> if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
>+ /* we're making our own get_super here */
> sb->s_count++;
> spin_unlock(&sb_lock);
>- sync_sb_inodes(sb, wbc);
>+ /* if we can't get the readlock, there's no sense in
>+ * waiting around, most of the time the FS is going
>+ * to be unmounted by the time it is released
>+ */
>+ if (down_read_trylock(&sb->s_umount)) {
>+ if (sb->s_root)
>+ sync_sb_inodes(sb, wbc);
>+ up_read(&sb->s_umount);
>+ }
> spin_lock(&sb_lock);
> if (__put_super(sb))
> goto restart;
>
>

--Mika


2004-06-08 20:18:19

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount

On Tue, 2004-06-08 at 15:57, Mika Penttil? wrote:
> Chris Mason wrote:
>
> >There's a small window where the filesystem can be unmounted during
> >writeback_inodes. The end result is the iput done by sync_sb_inodes
> >could be done after the FS put_super and and the super has been removed
> >from all lists.
> >
> >
>
> Why don't we have the same race in the sync() path as well? Moving the
> locking to sync_sb_inodes() itself would fix it also.

In the sync() path we're already taking a read lock on s_umount sem.
Moving the locking into sync_sb_inodes would be tricky, it is sometimes
called with the write lock on s_umount_sem held and sometimes with a
read lock.

-chris


2004-06-08 21:53:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount

Chris Mason <[email protected]> wrote:
>
> On Tue, 2004-06-08 at 15:57, Mika Penttil? wrote:
> > Chris Mason wrote:
> >
> > >There's a small window where the filesystem can be unmounted during
> > >writeback_inodes. The end result is the iput done by sync_sb_inodes
> > >could be done after the FS put_super and and the super has been removed
> > >from all lists.
> > >
> > >
> >
> > Why don't we have the same race in the sync() path as well? Moving the
> > locking to sync_sb_inodes() itself would fix it also.
>
> In the sync() path we're already taking a read lock on s_umount sem.
> Moving the locking into sync_sb_inodes would be tricky, it is sometimes
> called with the write lock on s_umount_sem held and sometimes with a
> read lock.
>

Plus we'd be dead if we had to do the above. If that read_trylock() fails
the sync() will forget to sync stuff.

And, contra your description, we'll fail the trylock relatively frequently
- when some other process is writing back this superblock.

But as this codepath is never used for data-integrity writeout things
should be OK. I'll slip a comment in there and maybe a
BUG_ON(wbc->sync_mode == WB_SYNC_ALL).

You didn't tell us where the window is? It's a bit weird that some other
process can come in and unmount the fs while there are non-zero-refcount
inodes floating about on the superblock.

2004-06-09 01:32:01

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount

On Tue, 2004-06-08 at 17:56, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:

> > > Why don't we have the same race in the sync() path as well? Moving the
> > > locking to sync_sb_inodes() itself would fix it also.
> >
> > In the sync() path we're already taking a read lock on s_umount sem.
> > Moving the locking into sync_sb_inodes would be tricky, it is sometimes
> > called with the write lock on s_umount_sem held and sometimes with a
> > read lock.
> >
>
> Plus we'd be dead if we had to do the above. If that read_trylock() fails
> the sync() will forget to sync stuff.
>
> And, contra your description, we'll fail the trylock relatively frequently
> - when some other process is writing back this superblock.

The write lock is taken much less frequently. Should be just on
mount/unmount no?

>
> But as this codepath is never used for data-integrity writeout things
> should be OK. I'll slip a comment in there and maybe a
> BUG_ON(wbc->sync_mode == WB_SYNC_ALL).
>
> You didn't tell us where the window is? It's a bit weird that some other
> process can come in and unmount the fs while there are non-zero-refcount
> inodes floating about on the superblock.

It's the iput in sync_sb_inodes(). The test workload is 8 parallel runs
of this loop (each run to a different /dev/xxx)

while(true) ; do
mkfs /dev/xxx
mount /dev/xxx /xxx
dd if=/dev/zero of=/xxx/foo bs=1MB count=80
rm /xxx/foo
umount /dev/xxx
done

So, each FS only has a single dirty inode, and that single inode gets
deleted during the run. With enough memory pressure, write calls
frequently go into throttling, increasing the chance for sync_sb_inodes
to be called for writeback at the same time as unmount.

Right before the iput in sync_sb_inodes, we've dropped both the sb_lock
and the inode_lock, nothing prevents the unmount from continuing. The
super block won't be freed because we've got it pinned, but the unmount
can proceed.

So, it looks like we get this:

CPU 0 CPU 1
writeback_inodes generic_shutdown_super
sync_sb_inodes
iget(inode)
spin_unlock(&inode_lock)
sop->put_super(sb)
iput(inode)
generic_delete_inode()




2004-06-09 03:23:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount

Chris Mason <[email protected]> wrote:
>
> On Tue, 2004-06-08 at 17:56, Andrew Morton wrote:
> > Chris Mason <[email protected]> wrote:
>
> > > > Why don't we have the same race in the sync() path as well? Moving the
> > > > locking to sync_sb_inodes() itself would fix it also.
> > >
> > > In the sync() path we're already taking a read lock on s_umount sem.
> > > Moving the locking into sync_sb_inodes would be tricky, it is sometimes
> > > called with the write lock on s_umount_sem held and sometimes with a
> > > read lock.
> > >
> >
> > Plus we'd be dead if we had to do the above. If that read_trylock() fails
> > the sync() will forget to sync stuff.
> >
> > And, contra your description, we'll fail the trylock relatively frequently
> > - when some other process is writing back this superblock.
>
> The write lock is taken much less frequently. Should be just on
> mount/unmount no?

OK.

> So, it looks like we get this:
>
> CPU 0 CPU 1
> writeback_inodes generic_shutdown_super
> sync_sb_inodes
> iget(inode)
> spin_unlock(&inode_lock)
> sop->put_super(sb)
> iput(inode)
> generic_delete_inode()

We really shouldn't have got to ->put_super() when there are still live
inodes around. I'd say that the problem lies on the umount path. 2.4
might have the same problem, but it'll be much harder to hit.

Something like this? If so, we should probably lock the inode in
prune_icache() and remove iprune_sem.


--- 25/fs/inode.c~a 2004-06-08 20:06:41.905455400 -0700
+++ 25-akpm/fs/inode.c 2004-06-08 20:19:40.300121424 -0700
@@ -294,10 +294,11 @@ static void dispose_list(struct list_hea
/*
* Invalidate all inodes for a device.
*/
-static int invalidate_list(struct list_head *head, struct super_block * sb, struct list_head * dispose)
+static int invalidate_list(struct list_head *head, struct super_block *sb,
+ struct list_head *dispose)
{
struct list_head *next;
- int busy = 0, count = 0;
+ int ret = 0, count = 0;

next = head->next;
for (;;) {
@@ -311,6 +312,17 @@ static int invalidate_list(struct list_h
if (inode->i_sb != sb)
continue;
invalidate_inode_buffers(inode);
+ if (inode->i_state & I_LOCK) {
+ __iget(inode);
+ inodes_stat.nr_unused -= count;
+ count = 0;
+ spin_unlock(&inode_lock);
+ wait_on_inode(inode);
+ iput(inode);
+ ret = 2;
+ spin_lock(&inode_lock);
+ break;
+ }
if (!atomic_read(&inode->i_count)) {
hlist_del_init(&inode->i_hash);
list_move(&inode->i_list, dispose);
@@ -318,11 +330,11 @@ static int invalidate_list(struct list_h
count++;
continue;
}
- busy = 1;
+ ret = 1;
}
/* only unused inodes may be cached with i_count zero */
inodes_stat.nr_unused -= count;
- return busy;
+ return ret;
}

/*
@@ -343,21 +355,23 @@ static int invalidate_list(struct list_h
*/
int invalidate_inodes(struct super_block * sb)
{
- int busy;
+ int state;
LIST_HEAD(throw_away);

down(&iprune_sem);
spin_lock(&inode_lock);
- busy = invalidate_list(&inode_in_use, sb, &throw_away);
- busy |= invalidate_list(&inode_unused, sb, &throw_away);
- busy |= invalidate_list(&sb->s_dirty, sb, &throw_away);
- busy |= invalidate_list(&sb->s_io, sb, &throw_away);
+ do {
+ state = invalidate_list(&inode_in_use, sb, &throw_away);
+ state |= invalidate_list(&inode_unused, sb, &throw_away);
+ state |= invalidate_list(&sb->s_dirty, sb, &throw_away);
+ state |= invalidate_list(&sb->s_io, sb, &throw_away);
+ } while (state & 2);
spin_unlock(&inode_lock);

dispose_list(&throw_away);
up(&iprune_sem);

- return busy;
+ return state;
}

EXPORT_SYMBOL(invalidate_inodes);
_

2004-06-09 05:25:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] writeback_inodes can race with unmount

Andrew Morton <[email protected]> wrote:
>
> Something like this?

No, that's not going to work, is it - we don't hold I_LOCK across the
critical iput().

May have to do it your way, but the trylock is irksome...