2006-03-11 16:00:45

by OGAWA Hirofumi

[permalink] [raw]
Subject: [PATCH] Fix a race condition between ->i_mapping and iput()

This race became a cause of oops, and can reproduce by the following.

while true; do
dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
done


This race condition was between __sync_single_inode() and iput().

cpu0 (fs's inode) cpu1 (bdev's inode)
------------------------------------------------------------------------
close("/dev/hda2")
[...]
__sync_single_inode()
/* copy the bdev's ->i_mapping */
mapping = inode->i_mapping;

generic_forget_inode()
bdev_clear_inode()
/* restre the fs's ->i_mapping */
inode->i_mapping = &inode->i_data;
/* bdev's inode was freed */
destroy_inode(inode);

if (wait) {
/* dereference a freed bdev's mapping->host */
filemap_fdatawait(mapping); /* Oops */

Since __sync_signle_inode() is only taking a ref-count of fs's inode,
the another process can be close() and freeing the bdev's inode while
writing fs's inode. So, __sync_signle_inode() accesses the freed
->i_mapping, oops.

This patch takes ref-count of bdev's inode for fs's inode before
setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
So, if fs's inode is still living, bdev's inode shouldn't be freed.

This lifetime rule may be a poor, but very simple.

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

fs/block_dev.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
--- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-11 16:19:07.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-11 17:52:11.000000000 +0900
@@ -418,21 +418,31 @@ EXPORT_SYMBOL(bdput);
static struct block_device *bd_acquire(struct inode *inode)
{
struct block_device *bdev;
+
spin_lock(&bdev_lock);
bdev = inode->i_bdev;
- if (bdev && igrab(bdev->bd_inode)) {
+ if (bdev) {
+ atomic_inc(&bdev->bd_inode->i_count);
spin_unlock(&bdev_lock);
return bdev;
}
spin_unlock(&bdev_lock);
+
bdev = bdget(inode->i_rdev);
if (bdev) {
spin_lock(&bdev_lock);
- if (inode->i_bdev)
- __bd_forget(inode);
- inode->i_bdev = bdev;
- inode->i_mapping = bdev->bd_inode->i_mapping;
- list_add(&inode->i_devices, &bdev->bd_inodes);
+ if (!inode->i_bdev) {
+ /*
+ * We take an additional bd_inode->i_count for inode,
+ * and it's released in clear_inode() of inode.
+ * So, we can access it via ->i_mapping always
+ * without igrab().
+ */
+ atomic_inc(&bdev->bd_inode->i_count);
+ inode->i_bdev = bdev;
+ inode->i_mapping = bdev->bd_inode->i_mapping;
+ list_add(&inode->i_devices, &bdev->bd_inodes);
+ }
spin_unlock(&bdev_lock);
}
return bdev;
@@ -442,10 +452,18 @@ static struct block_device *bd_acquire(s

void bd_forget(struct inode *inode)
{
+ struct block_device *bdev = NULL;
+
spin_lock(&bdev_lock);
- if (inode->i_bdev)
+ if (inode->i_bdev) {
+ if (inode->i_sb != blockdev_superblock)
+ bdev = inode->i_bdev;
__bd_forget(inode);
+ }
spin_unlock(&bdev_lock);
+
+ if (bdev)
+ iput(bdev->bd_inode);
}

int bd_claim(struct block_device *bdev, void *holder)
_


2006-03-16 22:46:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix a race condition between ->i_mapping and iput()

OGAWA Hirofumi <[email protected]> wrote:
>
> This race became a cause of oops, and can reproduce by the following.
>
> while true; do
> dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
> done
>
>
> This race condition was between __sync_single_inode() and iput().
>
> cpu0 (fs's inode) cpu1 (bdev's inode)
> ------------------------------------------------------------------------
> close("/dev/hda2")
> [...]
> __sync_single_inode()
> /* copy the bdev's ->i_mapping */
> mapping = inode->i_mapping;
>
> generic_forget_inode()
> bdev_clear_inode()
> /* restre the fs's ->i_mapping */
> inode->i_mapping = &inode->i_data;
> /* bdev's inode was freed */
> destroy_inode(inode);
>
> if (wait) {
> /* dereference a freed bdev's mapping->host */
> filemap_fdatawait(mapping); /* Oops */
>
> Since __sync_signle_inode() is only taking a ref-count of fs's inode,
> the another process can be close() and freeing the bdev's inode while
> writing fs's inode. So, __sync_signle_inode() accesses the freed
> ->i_mapping, oops.
>
> This patch takes ref-count of bdev's inode for fs's inode before
> setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
> So, if fs's inode is still living, bdev's inode shouldn't be freed.
>

OK, so to rephrase:

Whenever /dev/sda's inode->i_mapping points at a kernel-internal blockdev
inode's i_data, we will hold a ref on that blockdev inode, to pin its
i_data. That sounds sane.


>
> diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
> --- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-11 16:19:07.000000000 +0900
> +++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-11 17:52:11.000000000 +0900
> @@ -418,21 +418,31 @@ EXPORT_SYMBOL(bdput);
> static struct block_device *bd_acquire(struct inode *inode)
> {
> struct block_device *bdev;
> +
> spin_lock(&bdev_lock);
> bdev = inode->i_bdev;
> - if (bdev && igrab(bdev->bd_inode)) {
> + if (bdev) {
> + atomic_inc(&bdev->bd_inode->i_count);

No longer checking (I_FREEING|I_WILL_FREE). Why was this change included?

> spin_unlock(&bdev_lock);
> return bdev;
> }
> spin_unlock(&bdev_lock);
> +
> bdev = bdget(inode->i_rdev);
> if (bdev) {
> spin_lock(&bdev_lock);
> - if (inode->i_bdev)
> - __bd_forget(inode);
> - inode->i_bdev = bdev;
> - inode->i_mapping = bdev->bd_inode->i_mapping;
> - list_add(&inode->i_devices, &bdev->bd_inodes);
> + if (!inode->i_bdev) {
> + /*
> + * We take an additional bd_inode->i_count for inode,
> + * and it's released in clear_inode() of inode.
> + * So, we can access it via ->i_mapping always
> + * without igrab().
> + */
> + atomic_inc(&bdev->bd_inode->i_count);
> + inode->i_bdev = bdev;
> + inode->i_mapping = bdev->bd_inode->i_mapping;
> + list_add(&inode->i_devices, &bdev->bd_inodes);
> + }

And why this change? The removal of the __bd_forget() and the changed
behaviour if (inode->i_bdev != NULL)?

2006-03-17 14:13:12

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] Fix a race condition between ->i_mapping and iput()

Andrew Morton <[email protected]> writes:

>> This patch takes ref-count of bdev's inode for fs's inode before
>> setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
>> So, if fs's inode is still living, bdev's inode shouldn't be freed.
>>
>
> OK, so to rephrase:
>
> Whenever /dev/sda's inode->i_mapping points at a kernel-internal blockdev
> inode's i_data, we will hold a ref on that blockdev inode, to pin its
> i_data. That sounds sane.

Yes, thanks.

This approach may have one issue. If user is using a on-memory fs as
/dev (i.e. udev), the inode is never freed, so blockdev's inode is
also never freed.

And if user is creating the many block special file, blockdev's inode
is also pinned additionally.

Probably we can call a igrab() for ->i_mapping instead of this approach,
however it's "run-time overhead vs memory space" trade-off.

>> diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
>> --- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-11 16:19:07.000000000 +0900
>> +++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-11 17:52:11.000000000 +0900
>> @@ -418,21 +418,31 @@ EXPORT_SYMBOL(bdput);
>> static struct block_device *bd_acquire(struct inode *inode)
>> {
>> struct block_device *bdev;
>> +
>> spin_lock(&bdev_lock);
>> bdev = inode->i_bdev;
>> - if (bdev && igrab(bdev->bd_inode)) {
>> + if (bdev) {
>> + atomic_inc(&bdev->bd_inode->i_count);
>
> No longer checking (I_FREEING|I_WILL_FREE). Why was this change included?

With this patch, inode->i_bdev is freed in clear_inode() after the
inode became the I_FREEING.

The caller is holding a inode's ref, and if the inode->i_bdev != NULL,
->i_bdev->bd_inode never have the I_FREEING/I_WILL_FREE. So we don't
need to check I_FREEING/I_WILL_FREE anymore.

>> - if (inode->i_bdev)
>> - __bd_forget(inode);
>> - inode->i_bdev = bdev;
>> - inode->i_mapping = bdev->bd_inode->i_mapping;
>> - list_add(&inode->i_devices, &bdev->bd_inodes);
>> + if (!inode->i_bdev) {
>> + /*
>> + * We take an additional bd_inode->i_count for inode,
>> + * and it's released in clear_inode() of inode.
>> + * So, we can access it via ->i_mapping always
>> + * without igrab().
>> + */
>> + atomic_inc(&bdev->bd_inode->i_count);
>> + inode->i_bdev = bdev;
>> + inode->i_mapping = bdev->bd_inode->i_mapping;
>> + list_add(&inode->i_devices, &bdev->bd_inodes);
>> + }
>
> And why this change? The removal of the __bd_forget() and the changed
> behaviour if (inode->i_bdev != NULL)?

With the above change, we don't need to care the case of igrab()
returns NULL anymore. Since inode->i_bdev->bd_inode never has the
I_FREEING, we don't need to call __bd_forget() for that case.

I think we need to address an only following simple race here.

cpu0 cpu1
-----------------------------------+-----------------------------------------
bdev = bdget(inode->i_rdev);
if (bdev) {
bdev = bdget(inode->i_rdev);
if (bdev) {
spin_lock(&bdev_lock);
if (!inode->i_bdev) {
[...]
inode->i_bdev = bdev;
spin_lock(&bdev_lock);

Thanks.
--
OGAWA Hirofumi <[email protected]>