2006-03-09 16:23:30

by OGAWA Hirofumi

[permalink] [raw]
Subject: [PATCH] Fix a race condition between ->i_mapping and iput()

Hi,

This race became a cause of oops, and can reproduce by the following.

while true; do
dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
done


This race condition was between __sync_single_inode() and iput().

cpu0 (fs's inode) cpu1 (bdev's inode)
------------------------------------------------------------------------
close("/dev/hda2")
[...]
__sync_single_inode()
/* copy the bdev's ->i_mapping */
mapping = inode->i_mapping;

generic_forget_inode()
bdev_clear_inode()
/* restre the fs's ->i_mapping */
inode->i_mapping = &inode->i_data;
/* bdev's inode was freed */
destroy_inode(inode);

if (wait) {
/* dereference a freed bdev's mapping->host */
filemap_fdatawait(mapping); /* Oops */

Since __sync_signle_inode() is only taking a ref-count of fs's inode,
the another process can be close() and freeing the bdev's inode while
writing fs's inode. So, __sync_signle_inode() accesses the freed
->i_mapping, oops.

This patch takes ref-count of bdev's inode for fs's inode before
setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
So, if fs's inode is still living, bdev's inode shouldn't be freed.

This lifetime rule may be a poor, but very simple.

Umm... should we use an another rule to free it more early?
(e.g. if bdev's inode become I_FREEING, it should call bd_forget()
before releasing the inode_lock. And some place should call
igrab(->i_mapping->host->i_count) and iput())


What do you think, comment?

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

fs/block_dev.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
--- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-09 03:08:54.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-09 03:08:54.000000000 +0900
@@ -441,13 +441,22 @@ static struct block_device *bd_acquire(s
spin_unlock(&bdev_lock);
bdev = bdget(inode->i_rdev);
if (bdev) {
+ struct block_device *old = NULL;
+
+ BUG_ON(inode->i_sb == blockdev_superblock);
spin_lock(&bdev_lock);
- if (inode->i_bdev)
+ if (inode->i_bdev) {
+ old = inode->i_bdev;
__bd_forget(inode);
+ }
+ atomic_inc(&bdev->bd_inode->i_count);
inode->i_bdev = bdev;
inode->i_mapping = bdev->bd_inode->i_mapping;
list_add(&inode->i_devices, &bdev->bd_inodes);
spin_unlock(&bdev_lock);
+
+ if (old)
+ iput(old->bd_inode);
}
return bdev;
}
@@ -456,10 +465,18 @@ static struct block_device *bd_acquire(s

void bd_forget(struct inode *inode)
{
+ struct block_device *old = NULL;
+
spin_lock(&bdev_lock);
- if (inode->i_bdev)
+ if (inode->i_bdev) {
+ if (inode->i_sb != blockdev_superblock)
+ old = inode->i_bdev;
__bd_forget(inode);
+ }
spin_unlock(&bdev_lock);
+
+ if (old)
+ iput(old->bd_inode);
}

int bd_claim(struct block_device *bdev, void *holder)
_


2006-03-10 04:30:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race condition between ->i_mapping and iput()

OGAWA Hirofumi <[email protected]> wrote:
>
> Hi,
>
> This race became a cause of oops, and can reproduce by the following.
>
> while true; do
> dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
> done
>
>
> This race condition was between __sync_single_inode() and iput().
>
> cpu0 (fs's inode) cpu1 (bdev's inode)
> ------------------------------------------------------------------------
> close("/dev/hda2")
> [...]
> __sync_single_inode()
> /* copy the bdev's ->i_mapping */
> mapping = inode->i_mapping;
>
> generic_forget_inode()
> bdev_clear_inode()
> /* restre the fs's ->i_mapping */
> inode->i_mapping = &inode->i_data;
> /* bdev's inode was freed */
> destroy_inode(inode);
>
> if (wait) {
> /* dereference a freed bdev's mapping->host */
> filemap_fdatawait(mapping); /* Oops */
>
> Since __sync_signle_inode() is only taking a ref-count of fs's inode,
> the another process can be close() and freeing the bdev's inode while
> writing fs's inode. So, __sync_signle_inode() accesses the freed
> ->i_mapping, oops.
>
> This patch takes ref-count of bdev's inode for fs's inode before
> setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
> So, if fs's inode is still living, bdev's inode shouldn't be freed.
>
> This lifetime rule may be a poor, but very simple.
>
> Umm... should we use an another rule to free it more early?
> (e.g. if bdev's inode become I_FREEING, it should call bd_forget()
> before releasing the inode_lock. And some place should call
> igrab(->i_mapping->host->i_count) and iput())
>
>
> What do you think, comment?

Maybe. This code seems relatively straightforward though.

It would be preferable to have a couple of comments in there explaining
what the new refcounting is there for.

>
>...
>
> void bd_forget(struct inode *inode)
> {
> + struct block_device *old = NULL;
> +
> spin_lock(&bdev_lock);
> - if (inode->i_bdev)
> + if (inode->i_bdev) {
> + if (inode->i_sb != blockdev_superblock)
> + old = inode->i_bdev;
> __bd_forget(inode);
> + }
> spin_unlock(&bdev_lock);
> +
> + if (old)
> + iput(old->bd_inode);
> }

We're missing an atomic_inc(i_count) here?

2006-03-10 15:31:06

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] Fix a race condition between ->i_mapping and iput()

Andrew Morton <[email protected]> writes:

>> void bd_forget(struct inode *inode)
>> {
>> + struct block_device *old = NULL;
>> +
>> spin_lock(&bdev_lock);
>> - if (inode->i_bdev)
>> + if (inode->i_bdev) {
>> + if (inode->i_sb != blockdev_superblock)
>> + old = inode->i_bdev;
>> __bd_forget(inode);
>> + }
>> spin_unlock(&bdev_lock);
>> +
>> + if (old)
>> + iput(old->bd_inode);
>> }
>
> We're missing an atomic_inc(i_count) here?

I think we don't need an atomic_inc(i_count) here. The inode of
argument has already I_FREEING, so we just call iput(bdev's inode).

But, sorry. My patch seems broken.

With that simple rule, the code became more simple. And comment was
added.

I'll test a following patch today.
--
OGAWA Hirofumi <[email protected]>

diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
--- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-10 22:54:09.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-10 23:05:03.000000000 +0900
@@ -432,21 +432,32 @@ EXPORT_SYMBOL(bdput);
static struct block_device *bd_acquire(struct inode *inode)
{
struct block_device *bdev;
+
+ BUG_ON(inode->i_sb == blockdev_superblock);
spin_lock(&bdev_lock);
bdev = inode->i_bdev;
- if (bdev && igrab(bdev->bd_inode)) {
+ if (bdev) {
+ atomic_inc(&bdev->bd_inode->i_count);
spin_unlock(&bdev_lock);
return bdev;
}
spin_unlock(&bdev_lock);
+
bdev = bdget(inode->i_rdev);
if (bdev) {
spin_lock(&bdev_lock);
- if (inode->i_bdev)
- __bd_forget(inode);
- inode->i_bdev = bdev;
- inode->i_mapping = bdev->bd_inode->i_mapping;
- list_add(&inode->i_devices, &bdev->bd_inodes);
+ if (!inode->i_bdev) {
+ /*
+ * We take an additional bd_inode->i_count for inode,
+ * and it's released in clear_inode() of inode.
+ * So, we can access it via ->i_mapping always
+ * without igrab().
+ */
+ atomic_inc(&bdev->bd_inode->i_count);
+ inode->i_bdev = bdev;
+ inode->i_mapping = bdev->bd_inode->i_mapping;
+ list_add(&inode->i_devices, &bdev->bd_inodes);
+ }
spin_unlock(&bdev_lock);
}
return bdev;
@@ -456,10 +467,18 @@ static struct block_device *bd_acquire(s

void bd_forget(struct inode *inode)
{
+ struct block_device *old = NULL;
+
spin_lock(&bdev_lock);
- if (inode->i_bdev)
+ if (inode->i_bdev) {
+ if (inode->i_sb != blockdev_superblock)
+ old = inode->i_bdev;
__bd_forget(inode);
+ }
spin_unlock(&bdev_lock);
+
+ if (old)
+ iput(old->bd_inode);
}

int bd_claim(struct block_device *bdev, void *holder)
_