Date: Tue, 3 Feb 2015 12:01:28 +0900
From: Minchan Kim <minchan@kernel.org>
To: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Jerome Marchand <jmarchan@redhat.com>,
        Ganesh Mahendran <opensource.ganesh@gmail.com>,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] zram: fix umount-reset_store-mount race condition
Message-ID: <20150203030127.GA1541@blaptop>
References: <1422886120-16534-1-git-send-email-sergey.senozhatsky@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1422886120-16534-1-git-send-email-sergey.senozhatsky@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7440
Lines: 188

Hello Sergey,

On Mon, Feb 02, 2015 at 11:08:40PM +0900, Sergey Senozhatsky wrote:
> Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex
> to avoid ->bd_holders race condition:
> 
>         CPU0                            CPU1
> umount /* zram->init_done is true */
> reset_store()
> bdev->bd_holders == 0                   mount
> ...                                     zram_make_request()
> zram_reset_device()
> 
> However, his solution required some considerable amount of code movement,
> which we can avoid.
> 
> Apart from using bdev->bd_mutex in reset_store(), this patch also
> simplifies zram_reset_device().
> 
> zram_reset_device() has a bool parameter reset_capacity which tells
> it whether disk capacity and itself disk should be reset. There are
> two zram_reset_device() callers:
> -- zram_exit() passes reset_capacity=false
> -- reset_store() passes reset_capacity=true
> 
> So we can move reset_capacity-sensitive work out of zram_reset_device()
> and perform it unconditionally in reset_store(). This also lets us drop
> reset_capacity parameter from zram_reset_device() and pass zram pointer
> only.
> 
> Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Looks good to me but as I told you, we need to check processes open
the block device as !FMODE_EXCL.
I can make following warning with below simple script.
In addition, I added msleep(2000) below set_capacity(zram->disk, 0)
after applying your patch to make window huge(Kudos to Ganesh!)

#!/bin/bash

echo $((60<<30)) > /sys/block/zram0/disksize
# Check your kernel turns on CONFIG_SCHED_AUTOGROUP=y
# I have no idea why it doesn't happen without setsid and CONFIG_SCHED_AUTOGROUP=y
# Please let me know it if someone can know the reason.
setsid dd if=/dev/zero of=/dev/zram0 &
sleep 1
setsid echo 1 > /sys/block/zram0/reset

[   30.683449] zram0: detected capacity change from 0 to 64424509440
[   33.736869] Buffer I/O error on dev zram0, logical block 180823, lost async page write
[   33.738814] Buffer I/O error on dev zram0, logical block 180824, lost async page write
[   33.740654] Buffer I/O error on dev zram0, logical block 180825, lost async page write
[   33.742551] Buffer I/O error on dev zram0, logical block 180826, lost async page write
[   33.744153] Buffer I/O error on dev zram0, logical block 180827, lost async page write
[   33.745807] Buffer I/O error on dev zram0, logical block 180828, lost async page write
[   33.747419] Buffer I/O error on dev zram0, logical block 180829, lost async page write
[   33.749060] Buffer I/O error on dev zram0, logical block 180830, lost async page write
[   33.750687] Buffer I/O error on dev zram0, logical block 180831, lost async page write
[   33.752286] Buffer I/O error on dev zram0, logical block 180832, lost async page write
[   33.811590] ------------[ cut here ]------------
[   33.812038] WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
[   33.812817] Modules linked in: 
[   33.813142] CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
[   33.813837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   33.814525]  ffffffff81801a2d ffff880061e77db8 ffffffff815b848e 0000000000000001
[   33.815196]  0000000000000000 ffff880061e77df8 ffffffff8104de2a 0000000000000000
[   33.815867]  ffff88005da287f0 ffff88005da28680 ffff88005da28770 ffff88005da28698
[   33.816536] Call Trace:
[   33.816817]  [<ffffffff815b848e>] dump_stack+0x45/0x57
[   33.817304]  [<ffffffff8104de2a>] warn_slowpath_common+0x8a/0xc0
[   33.817829]  [<ffffffff8104df1a>] warn_slowpath_null+0x1a/0x20
[   33.818331]  [<ffffffff811b60b7>] __blkdev_put+0x1d7/0x210
[   33.818797]  [<ffffffff811b69c0>] blkdev_put+0x50/0x130
[   33.819244]  [<ffffffff811b6b55>] blkdev_close+0x25/0x30
[   33.819723]  [<ffffffff8118079f>] __fput+0xdf/0x1e0
[   33.820140]  [<ffffffff811808ee>] ____fput+0xe/0x10
[   33.820576]  [<ffffffff81068e07>] task_work_run+0xa7/0xe0
[   33.821151]  [<ffffffff81002b89>] do_notify_resume+0x49/0x60
[   33.821721]  [<ffffffff815bf09d>] int_signal+0x12/0x17
[   33.822228] ---[ end trace 274fbbc5664827d2 ]---

The warning comes from bdev_write_node in blkdev_put path.

tatic void bdev_write_inode(struct inode *inode)
{
        spin_lock(&inode->i_lock);
        while (inode->i_state & I_DIRTY) {
                spin_unlock(&inode->i_lock);
                WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                spin_lock(&inode->i_lock);
        }    
        spin_unlock(&inode->i_lock);
}

The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.

If we checks bd_openners instead of bd_holders, we could address the
problem. When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems
to be better.

> ---
>  drivers/block/zram/zram_drv.c | 23 +++++++++--------------
>  1 file changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index aa5a4c5..a32069f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -717,7 +717,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
>  	}
>  }
>  
> -static void zram_reset_device(struct zram *zram, bool reset_capacity)
> +static void zram_reset_device(struct zram *zram)
>  {
>  	down_write(&zram->init_lock);
>  
> @@ -736,18 +736,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
>  	memset(&zram->stats, 0, sizeof(zram->stats));
>  
>  	zram->disksize = 0;
> -	if (reset_capacity)
> -		set_capacity(zram->disk, 0);
> -
>  	up_write(&zram->init_lock);
> -
> -	/*
> -	 * Revalidate disk out of the init_lock to avoid lockdep splat.
> -	 * It's okay because disk's capacity is protected by init_lock
> -	 * so that revalidate_disk always sees up-to-date capacity.
> -	 */
> -	if (reset_capacity)
> -		revalidate_disk(zram->disk);
>  }
>  
>  static ssize_t disksize_store(struct device *dev,
> @@ -820,6 +809,7 @@ static ssize_t reset_store(struct device *dev,
>  	if (!bdev)
>  		return -ENOMEM;
>  
> +	mutex_lock(&bdev->bd_mutex);
>  	/* Do not reset an active device! */
>  	if (bdev->bd_holders) {
>  		ret = -EBUSY;
> @@ -837,12 +827,17 @@ static ssize_t reset_store(struct device *dev,
>  
>  	/* Make sure all pending I/O is finished */
>  	fsync_bdev(bdev);
> +	zram_reset_device(zram);
> +	set_capacity(zram->disk, 0);
> +
> +	mutex_unlock(&bdev->bd_mutex);
> +	revalidate_disk(zram->disk);
>  	bdput(bdev);
>  
> -	zram_reset_device(zram, true);
>  	return len;
>  
>  out:
> +	mutex_unlock(&bdev->bd_mutex);
>  	bdput(bdev);
>  	return ret;
>  }
> @@ -1188,7 +1183,7 @@ static void __exit zram_exit(void)
>  		 * Shouldn't access zram->disk after destroy_device
>  		 * because destroy_device already released zram->disk.
>  		 */
> -		zram_reset_device(zram, false);
> +		zram_reset_device(zram);
>  	}
>  
>  	unregister_blkdev(zram_major, "zram");
> -- 
> 2.3.0.rc2
> 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/