2004-06-18 01:57:47

by Chris Mason

[permalink] [raw]
Subject: [PATCH RFC] __bd_forget should wait for inodes using the mapping

__bd_forget will change the mapping for filesystem inodes without
waiting to make sure no users of the block device address space are
using that mapping.

In the case of background writeout, it is possible for __bd_forget
to free the block device inode while mpage_writepages is still
looking through the mapping for dirty pages. This is because
each device node in the filesystem has a pointer to the block
device address space, and __bd_forget is used to reset those pointers
before the block device inode is freed.

There is no locking to make sure __bd_forget isn't running
at the same time as __writeback_single_inode is run on the
filesystem device node.

Here's an example patch that should fix things, Andi just found
a race where I wasn't holding onto the filesystem inode correctly,
so this rev got a last minute fix before I wander off for the night.

It's quite ugly, I'm hoping we can hash out something better.

Index: linux.t/fs/block_dev.c
===================================================================
--- linux.t.orig/fs/block_dev.c 2004-06-17 21:14:08.000000000 -0400
+++ linux.t/fs/block_dev.c 2004-06-17 21:46:46.203782616 -0400
@@ -24,6 +24,7 @@
#include <linux/uio.h>
#include <linux/namei.h>
#include <asm/uaccess.h>
+#include <linux/writeback.h>

struct bdev_inode {
struct block_device bdev;
@@ -258,11 +259,31 @@ static void init_once(void * foo, kmem_c
}
}

+/*
+ * we have to make sure that we don't free the block
+ * device inode and mapping while one of the inodes using
+ * it is in background writeback.
+ *
+ * The lock ordering required elsewhere is bdev_lock->inode_lock.
+ */
static inline void __bd_forget(struct inode *inode)
{
+ spin_lock(&inode_lock);
+ __iget(inode);
+ while (inode->i_state & I_LOCK) {
+ spin_unlock(&bdev_lock);
+ spin_unlock(&inode_lock);
+ __wait_on_inode(inode);
+ spin_lock(&bdev_lock);
+ spin_lock(&inode_lock);
+ }
list_del_init(&inode->i_devices);
inode->i_bdev = NULL;
inode->i_mapping = &inode->i_data;
+ spin_unlock(&inode_lock);
+ spin_unlock(&bdev_lock);
+ iput(inode);
+ spin_lock(&bdev_lock);
}

static void bdev_clear_inode(struct inode *inode)



2004-06-18 02:01:01

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Thu, 2004-06-17 at 21:54, Chris Mason wrote:

> Here's an example patch that should fix things, Andi just found
> a race where I wasn't holding onto the filesystem inode correctly,
> so this rev got a last minute fix before I wander off for the night.

Ugh, this oopses on boot. I'll provide a better example patch tomorrow
unless someone comes up with something clean over night ;-)

-chris


2004-06-18 02:10:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Thu, Jun 17, 2004 at 09:54:28PM -0400, Chris Mason wrote:
> __bd_forget will change the mapping for filesystem inodes without
> waiting to make sure no users of the block device address space are
> using that mapping.

Filesystem block device inodes have no business even looking at their
->i_mapping. Where do you need to do that?

2004-06-18 13:04:30

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Thu, 2004-06-17 at 22:10, [email protected]
wrote:
> On Thu, Jun 17, 2004 at 09:54:28PM -0400, Chris Mason wrote:
> > __bd_forget will change the mapping for filesystem inodes without
> > waiting to make sure no users of the block device address space are
> > using that mapping.
>
> Filesystem block device inodes have no business even looking at their
> ->i_mapping. Where do you need to do that?

sync_sb_inodes, the filesystem block device inode ends up on some dirty
list, and under memory pressure balance_dirty_pages_ratelimited will
trigger writeback on it.

There's nothing to write back of course, the real block device address
space has no dirty pages at all. But, writeback is looking through the
mapping and __bd_forget can't drop it until writeback has finished
checking it.

I've verified this really is happening ;-) The patch I sent is nasty but
I'm sure this is a real bug.

-chris


2004-06-18 14:20:21

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Thu, 2004-06-17 at 22:10, [email protected]
wrote:
> On Thu, Jun 17, 2004 at 09:54:28PM -0400, Chris Mason wrote:
> > __bd_forget will change the mapping for filesystem inodes without
> > waiting to make sure no users of the block device address space are
> > using that mapping.

Well, here's a new patch, I'm confident you'll like it even less. My
fixes so far will trigger iput on inode B during the clear_inode call of
inode A. Even if this is legal today, it seems doomed to cause bugs
later on.

Another option is to inc the reference count of the block device inode
during background write out. But, since we're racing with the block
device clear_inode call here, that will need to involve the bdev
spinlock somewhere along the way, and has lock ordering issues since the
block device code already takes inode_lock under bdev_lock.

-chris
---

__bd_forget will change the mapping for an inode without waiting to make
sure no users of the inode are using that mapping. In the case of
background writeout, it is possible for __bd_forget to free the
block device inode while mpage_writepages is still looking through the
mapping for dirty pages.

Index: linux.t/fs/block_dev.c
===================================================================
--- linux.t.orig/fs/block_dev.c 2004-06-17 21:14:08.000000000 -0400
+++ linux.t/fs/block_dev.c 2004-06-18 09:21:41.000000000 -0400
@@ -24,6 +24,7 @@
#include <linux/uio.h>
#include <linux/namei.h>
#include <asm/uaccess.h>
+#include <linux/writeback.h>

struct bdev_inode {
struct block_device bdev;
@@ -258,11 +259,48 @@ static void init_once(void * foo, kmem_c
}
}

+/*
+ * we have to make sure that we don't free the block
+ * device inode and mapping while one of the inodes using
+ * it is in background writeback.
+ *
+ * The lock ordering required elsewhere is bdev_lock->inode_lock.
+ */
static inline void __bd_forget(struct inode *inode)
{
+ int iget_done = 0;
+ spin_lock(&inode_lock);
+
+ /* when the inode is being freed, we could be racing
+ * with its clear_inode call. The only thing saving us
+ * is the bdev_lock, inode->i_bdev is still not null
+ * and the clear_inode call will need to get the bdev lock
+ * before it can zero it.
+ *
+ * don't __iget, wait on the inode, or iput in this case,
+ */
+ if (inode->i_state & I_FREEING)
+ goto clear_it;
+
+ __iget(inode);
+ iget_done = 1;
+ while (inode->i_state & I_LOCK) {
+ spin_unlock(&bdev_lock);
+ spin_unlock(&inode_lock);
+ __wait_on_inode(inode);
+ spin_lock(&bdev_lock);
+ spin_lock(&inode_lock);
+ }
+clear_it:
list_del_init(&inode->i_devices);
inode->i_bdev = NULL;
inode->i_mapping = &inode->i_data;
+ spin_unlock(&inode_lock);
+ if (iget_done) {
+ spin_unlock(&bdev_lock);
+ iput(inode);
+ spin_lock(&bdev_lock);
+ }
}

static void bdev_clear_inode(struct inode *inode)







2004-06-18 14:23:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, Jun 18, 2004 at 09:03:31AM -0400, Chris Mason wrote:
> sync_sb_inodes, the filesystem block device inode ends up on some dirty
> list, and under memory pressure balance_dirty_pages_ratelimited will
> trigger writeback on it.
>
> There's nothing to write back of course, the real block device address
> space has no dirty pages at all. But, writeback is looking through the
> mapping and __bd_forget can't drop it until writeback has finished
> checking it.

So WTF does writeback bother with that? _That_ is the real bug here -
the only kind of bdev inodes that can have accesses to ->i_mapping
is fs/block_dev.c-created stuff.

2004-06-18 14:47:18

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, 2004-06-18 at 10:22, [email protected]
wrote:
> On Fri, Jun 18, 2004 at 09:03:31AM -0400, Chris Mason wrote:
> > sync_sb_inodes, the filesystem block device inode ends up on some dirty
> > list, and under memory pressure balance_dirty_pages_ratelimited will
> > trigger writeback on it.
> >
> > There's nothing to write back of course, the real block device address
> > space has no dirty pages at all. But, writeback is looking through the
> > mapping and __bd_forget can't drop it until writeback has finished
> > checking it.
>
> So WTF does writeback bother with that? _That_ is the real bug here -
> the only kind of bdev inodes that can have accesses to ->i_mapping
> is fs/block_dev.c-created stuff.

During writeback, we need to answer the question: "are there dirty pages
attached to this inode", and the only way to answer it is via the
address space.

If bdev inodes don't want other inodes using their address space, they
shouldn't be setting the i_mapping on other inodes. Since they are, the
bdev code needs to be aware that someone else might be using it.

-chris



2004-06-18 15:17:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, Jun 18, 2004 at 10:47:11AM -0400, Chris Mason wrote:
> During writeback, we need to answer the question: "are there dirty pages
> attached to this inode", and the only way to answer it is via the
> address space.
>
> If bdev inodes don't want other inodes using their address space, they
> shouldn't be setting the i_mapping on other inodes. Since they are, the
> bdev code needs to be aware that someone else might be using it.

Scheduled for 2.7.1; for now users of ->i_mapping (the fewer of them remain,
the better) have to be aware of bdev.

And yes, ->i_mapping flips on "normal" bdev inodes will go away - we set
->f_mapping on open directly.

2004-06-18 15:43:31

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, 2004-06-18 at 11:15, [email protected]
wrote:
> On Fri, Jun 18, 2004 at 10:47:11AM -0400, Chris Mason wrote:
> > During writeback, we need to answer the question: "are there dirty pages
> > attached to this inode", and the only way to answer it is via the
> > address space.
> >
> > If bdev inodes don't want other inodes using their address space, they
> > shouldn't be setting the i_mapping on other inodes. Since they are, the
> > bdev code needs to be aware that someone else might be using it.
>
> Scheduled for 2.7.1; for now users of ->i_mapping (the fewer of them remain,
> the better) have to be aware of bdev.
>
> And yes, ->i_mapping flips on "normal" bdev inodes will go away - we set
> ->f_mapping on open directly.

Fair enough, I'll cook up some code to bump the inode->bdev->bd_inode
i_count in __sync_single_inode It won't be pretty either though, I'll
have to drop the inode_lock so that some function can take the bdev_lock
to safely use inode->i_bdev.

-chris


2004-06-18 15:52:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, Jun 18, 2004 at 11:41:43AM -0400, Chris Mason wrote:
> > And yes, ->i_mapping flips on "normal" bdev inodes will go away - we set
> > ->f_mapping on open directly.
>
> Fair enough, I'll cook up some code to bump the inode->bdev->bd_inode
> i_count in __sync_single_inode It won't be pretty either though, I'll
> have to drop the inode_lock so that some function can take the bdev_lock
> to safely use inode->i_bdev.

*Ugh*

You do realize that ->i_bdev is not promised to be there either? Could you
show the actual code that steps into this mess?

2004-06-18 16:16:25

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, 2004-06-18 at 11:43, [email protected]
wrote:
> On Fri, Jun 18, 2004 at 11:41:43AM -0400, Chris Mason wrote:
> > > And yes, ->i_mapping flips on "normal" bdev inodes will go away - we set
> > > ->f_mapping on open directly.
> >
> > Fair enough, I'll cook up some code to bump the inode->bdev->bd_inode
> > i_count in __sync_single_inode It won't be pretty either though, I'll
> > have to drop the inode_lock so that some function can take the bdev_lock
> > to safely use inode->i_bdev.
>
> *Ugh*
>
> You do realize that ->i_bdev is not promised to be there either? Could you
> show the actual code that steps into this mess?
>
Grin, it won't be pretty, i_bdev can't be trusted without the bdev lock,
and I'll need to check for I_FREEING on it to make sure it isn't in
clear_inode.

The sequence leading up to all of this looks something like this:

CPU 0: CPU 1:

pdflush finds umount /dev/sda1
FS inode for
/dev/sda1 in dirty list,
makes it's way down
to __sync_single_inode

mapping = inode->i_mapping
(this points bdev address
space)
kill_block_super
close_bdev_excl
blkdev_put
bdput(bdev)
iput(bdev->bd_inode)
...
clear_inode(bdev inode)
bdev_clear_inode
__bd_forget(inode for /dev/sda1)
...
destroy_inode(bdev inode)
...
do_writepages(mapping, wbc)

Since mapping on CPU0 still points to the bdev address space, and that
has been freed by the destroy inode, we run into problems.

Maybe the real bug is the FS inode should never have ended up in the
dirty list. This should all work fine if the bdev inode were the only
one to ever hit a dirty list.

-chris


2004-06-18 20:25:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

Chris Mason <[email protected]> wrote:
>
> Maybe the real bug is the FS inode should never have ended up in the
> dirty list.

It'd be interesting to find out where and why it is being dirtied (atime?),
but even if we can prevent that from happening, people can still do things
like chmod on it, so we're back in the same situation.

There's tight coupling between writing back the inode and writing back its
pages, and at times it has caused problems. It's not clear _why_ there
should be such a coupling but it's never been a sufficient problem to
justify ripping it all up.

> This should all work fine if the bdev inode were the only
> one to ever hit a dirty list.

Something like this?

--- 25/fs/fs-writeback.c~dont-writeback-fd-bdev-inodes Fri Jun 18 12:54:08 2004
+++ 25-akpm/fs/fs-writeback.c Fri Jun 18 13:25:16 2004
@@ -156,7 +156,8 @@ __sync_single_inode(struct inode *inode,
struct address_space *mapping = inode->i_mapping;
struct super_block *sb = inode->i_sb;
int wait = wbc->sync_mode == WB_SYNC_ALL;
- int ret;
+ int is_fs_bdev; /* Is a block-special node */
+ int ret = 0;

BUG_ON(inode->i_state & I_LOCK);

@@ -167,13 +168,23 @@ __sync_single_inode(struct inode *inode,

spin_unlock(&inode_lock);

- ret = do_writepages(mapping, wbc);
+ /*
+ * blockdev address_spaces are always written back via their internal
+ * inodes, not via their /dev/hdXX inodes, so use is_fs_bdev to skip
+ * them here. We still need to write back the inode itself.
+ * And we cannot touch ->i_mapping of /dev/hdXX inodes at all, because
+ * umount can change their ->i_mapping.
+ */
+ is_fs_bdev = S_ISBLK(inode->i_mode) && (sb != blockdev_superblock);
+
+ if (!is_fs_bdev)
+ ret = do_writepages(mapping, wbc);

/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
write_inode(inode, wait);

- if (wait) {
+ if (wait && !is_fs_bdev) {
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
@@ -182,7 +193,7 @@ __sync_single_inode(struct inode *inode,
spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK;
if (!(inode->i_state & I_FREEING)) {
- if (!(inode->i_state & I_DIRTY) &&
+ if (!(inode->i_state & I_DIRTY) && !is_fs_bdev &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
/*
* We didn't write back all the pages. nfs_writepages()
_

2004-06-18 20:48:40

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, 2004-06-18 at 16:26, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> >
> > Maybe the real bug is the FS inode should never have ended up in the
> > dirty list.
>
> It'd be interesting to find out where and why it is being dirtied (atime?),
> but even if we can prevent that from happening, people can still do things
> like chmod on it, so we're back in the same situation.
>
> There's tight coupling between writing back the inode and writing back its
> pages, and at times it has caused problems. It's not clear _why_ there
> should be such a coupling but it's never been a sufficient problem to
> justify ripping it all up.
>
> > This should all work fine if the bdev inode were the only
> > one to ever hit a dirty list.
>
> Something like this?

[ skip writing block-special inodes ]

Hmmm, any risk in missing data integrity syncs because of this?

-chris


2004-06-18 21:28:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

Chris Mason <[email protected]> wrote:
>
> [ skip writing block-special inodes ]
>
> Hmmm, any risk in missing data integrity syncs because of this?

Need to think about that. sys_fsync(), sys_fdatasync() and sys_msync() go
direct to file->f_mapping and sys_sync() will sync the blockdev via its
kernel-internal inode. What does that leave?

2004-06-18 23:23:01

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

On Fri, 2004-06-18 at 17:27, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> >
> > [ skip writing block-special inodes ]
> >
> > Hmmm, any risk in missing data integrity syncs because of this?
>
> Need to think about that. sys_fsync(), sys_fdatasync() and sys_msync() go
> direct to file->f_mapping and sys_sync() will sync the blockdev via its
> kernel-internal inode. What does that leave?

I was worried about O_SYNC, That actually looks safe though,
generic_osync_inode will first write the mapping via filemap_fdatawrite
(the mapping comes from f_mapping).

It doesn't really give me warm fuzzies, but looks safe enough. Al had a
slightly different plan, maybe with your patch we can push his larger
changes off a bit?

-chris


2004-06-18 23:37:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC] __bd_forget should wait for inodes using the mapping

Chris Mason <[email protected]> wrote:
>
> On Fri, 2004-06-18 at 17:27, Andrew Morton wrote:
> > Chris Mason <[email protected]> wrote:
> > >
> > > [ skip writing block-special inodes ]
> > >
> > > Hmmm, any risk in missing data integrity syncs because of this?
> >
> > Need to think about that. sys_fsync(), sys_fdatasync() and sys_msync() go
> > direct to file->f_mapping and sys_sync() will sync the blockdev via its
> > kernel-internal inode. What does that leave?
>
> I was worried about O_SYNC, That actually looks safe though,
> generic_osync_inode will first write the mapping via filemap_fdatawrite
> (the mapping comes from f_mapping).
>
> It doesn't really give me warm fuzzies, but looks safe enough. Al had a
> slightly different plan, maybe with your patch we can push his larger
> changes off a bit?

>From a design POV the patch I sent isn't very nice, and does add code to a
warmpath. If there's some way in which we can defer the i_mapping switch
until all references have gone away, that would be better?