Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex
to avoid ->bd_holders race condition:
CPU0 CPU1
umount /* zram->init_done is true */
reset_store()
bdev->bd_holders == 0 mount
... zram_make_request()
zram_reset_device()
However, his solution required some considerable amount of code movement,
which we can avoid.
Apart from using bdev->bd_mutex in reset_store(), this patch also
simplifies zram_reset_device().
zram_reset_device() has a bool parameter reset_capacity which tells
it whether disk capacity and itself disk should be reset. There are
two zram_reset_device() callers:
-- zram_exit() passes reset_capacity=false
-- reset_store() passes reset_capacity=true
So we can move reset_capacity-sensitive work out of zram_reset_device()
and perform it unconditionally in reset_store(). This also lets us drop
reset_capacity parameter from zram_reset_device() and pass zram pointer
only.
Reported-by: Ganesh Mahendran <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
drivers/block/zram/zram_drv.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aa5a4c5..a32069f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -717,7 +717,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
}
}
-static void zram_reset_device(struct zram *zram, bool reset_capacity)
+static void zram_reset_device(struct zram *zram)
{
down_write(&zram->init_lock);
@@ -736,18 +736,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
memset(&zram->stats, 0, sizeof(zram->stats));
zram->disksize = 0;
- if (reset_capacity)
- set_capacity(zram->disk, 0);
-
up_write(&zram->init_lock);
-
- /*
- * Revalidate disk out of the init_lock to avoid lockdep splat.
- * It's okay because disk's capacity is protected by init_lock
- * so that revalidate_disk always sees up-to-date capacity.
- */
- if (reset_capacity)
- revalidate_disk(zram->disk);
}
static ssize_t disksize_store(struct device *dev,
@@ -820,6 +809,7 @@ static ssize_t reset_store(struct device *dev,
if (!bdev)
return -ENOMEM;
+ mutex_lock(&bdev->bd_mutex);
/* Do not reset an active device! */
if (bdev->bd_holders) {
ret = -EBUSY;
@@ -837,12 +827,17 @@ static ssize_t reset_store(struct device *dev,
/* Make sure all pending I/O is finished */
fsync_bdev(bdev);
+ zram_reset_device(zram);
+ set_capacity(zram->disk, 0);
+
+ mutex_unlock(&bdev->bd_mutex);
+ revalidate_disk(zram->disk);
bdput(bdev);
- zram_reset_device(zram, true);
return len;
out:
+ mutex_unlock(&bdev->bd_mutex);
bdput(bdev);
return ret;
}
@@ -1188,7 +1183,7 @@ static void __exit zram_exit(void)
* Shouldn't access zram->disk after destroy_device
* because destroy_device already released zram->disk.
*/
- zram_reset_device(zram, false);
+ zram_reset_device(zram);
}
unregister_blkdev(zram_major, "zram");
--
2.3.0.rc2
Hello Sergey,
On Mon, Feb 02, 2015 at 11:08:40PM +0900, Sergey Senozhatsky wrote:
> Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex
> to avoid ->bd_holders race condition:
>
> CPU0 CPU1
> umount /* zram->init_done is true */
> reset_store()
> bdev->bd_holders == 0 mount
> ... zram_make_request()
> zram_reset_device()
>
> However, his solution required some considerable amount of code movement,
> which we can avoid.
>
> Apart from using bdev->bd_mutex in reset_store(), this patch also
> simplifies zram_reset_device().
>
> zram_reset_device() has a bool parameter reset_capacity which tells
> it whether disk capacity and itself disk should be reset. There are
> two zram_reset_device() callers:
> -- zram_exit() passes reset_capacity=false
> -- reset_store() passes reset_capacity=true
>
> So we can move reset_capacity-sensitive work out of zram_reset_device()
> and perform it unconditionally in reset_store(). This also lets us drop
> reset_capacity parameter from zram_reset_device() and pass zram pointer
> only.
>
> Reported-by: Ganesh Mahendran <[email protected]>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
Looks good to me but as I told you, we need to check processes open
the block device as !FMODE_EXCL.
I can make following warning with below simple script.
In addition, I added msleep(2000) below set_capacity(zram->disk, 0)
after applying your patch to make window huge(Kudos to Ganesh!)
#!/bin/bash
echo $((60<<30)) > /sys/block/zram0/disksize
# Check your kernel turns on CONFIG_SCHED_AUTOGROUP=y
# I have no idea why it doesn't happen without setsid and CONFIG_SCHED_AUTOGROUP=y
# Please let me know it if someone can know the reason.
setsid dd if=/dev/zero of=/dev/zram0 &
sleep 1
setsid echo 1 > /sys/block/zram0/reset
[ 30.683449] zram0: detected capacity change from 0 to 64424509440
[ 33.736869] Buffer I/O error on dev zram0, logical block 180823, lost async page write
[ 33.738814] Buffer I/O error on dev zram0, logical block 180824, lost async page write
[ 33.740654] Buffer I/O error on dev zram0, logical block 180825, lost async page write
[ 33.742551] Buffer I/O error on dev zram0, logical block 180826, lost async page write
[ 33.744153] Buffer I/O error on dev zram0, logical block 180827, lost async page write
[ 33.745807] Buffer I/O error on dev zram0, logical block 180828, lost async page write
[ 33.747419] Buffer I/O error on dev zram0, logical block 180829, lost async page write
[ 33.749060] Buffer I/O error on dev zram0, logical block 180830, lost async page write
[ 33.750687] Buffer I/O error on dev zram0, logical block 180831, lost async page write
[ 33.752286] Buffer I/O error on dev zram0, logical block 180832, lost async page write
[ 33.811590] ------------[ cut here ]------------
[ 33.812038] WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
[ 33.812817] Modules linked in:
[ 33.813142] CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
[ 33.813837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 33.814525] ffffffff81801a2d ffff880061e77db8 ffffffff815b848e 0000000000000001
[ 33.815196] 0000000000000000 ffff880061e77df8 ffffffff8104de2a 0000000000000000
[ 33.815867] ffff88005da287f0 ffff88005da28680 ffff88005da28770 ffff88005da28698
[ 33.816536] Call Trace:
[ 33.816817] [<ffffffff815b848e>] dump_stack+0x45/0x57
[ 33.817304] [<ffffffff8104de2a>] warn_slowpath_common+0x8a/0xc0
[ 33.817829] [<ffffffff8104df1a>] warn_slowpath_null+0x1a/0x20
[ 33.818331] [<ffffffff811b60b7>] __blkdev_put+0x1d7/0x210
[ 33.818797] [<ffffffff811b69c0>] blkdev_put+0x50/0x130
[ 33.819244] [<ffffffff811b6b55>] blkdev_close+0x25/0x30
[ 33.819723] [<ffffffff8118079f>] __fput+0xdf/0x1e0
[ 33.820140] [<ffffffff811808ee>] ____fput+0xe/0x10
[ 33.820576] [<ffffffff81068e07>] task_work_run+0xa7/0xe0
[ 33.821151] [<ffffffff81002b89>] do_notify_resume+0x49/0x60
[ 33.821721] [<ffffffff815bf09d>] int_signal+0x12/0x17
[ 33.822228] ---[ end trace 274fbbc5664827d2 ]---
The warning comes from bdev_write_node in blkdev_put path.
tatic void bdev_write_inode(struct inode *inode)
{
spin_lock(&inode->i_lock);
while (inode->i_state & I_DIRTY) {
spin_unlock(&inode->i_lock);
WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
spin_lock(&inode->i_lock);
}
spin_unlock(&inode->i_lock);
}
The reason is dd process encounters I/O fails due to sudden block device
disappear so in filemap_check_errors in __writeback_single_inode returns
-EIO.
If we checks bd_openners instead of bd_holders, we could address the
problem. When I see the brd, it already have used it rather than
bd_holders so although I'm not a expert of block layer, it seems
to be better.
> ---
> drivers/block/zram/zram_drv.c | 23 +++++++++--------------
> 1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index aa5a4c5..a32069f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -717,7 +717,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
> }
> }
>
> -static void zram_reset_device(struct zram *zram, bool reset_capacity)
> +static void zram_reset_device(struct zram *zram)
> {
> down_write(&zram->init_lock);
>
> @@ -736,18 +736,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> memset(&zram->stats, 0, sizeof(zram->stats));
>
> zram->disksize = 0;
> - if (reset_capacity)
> - set_capacity(zram->disk, 0);
> -
> up_write(&zram->init_lock);
> -
> - /*
> - * Revalidate disk out of the init_lock to avoid lockdep splat.
> - * It's okay because disk's capacity is protected by init_lock
> - * so that revalidate_disk always sees up-to-date capacity.
> - */
> - if (reset_capacity)
> - revalidate_disk(zram->disk);
> }
>
> static ssize_t disksize_store(struct device *dev,
> @@ -820,6 +809,7 @@ static ssize_t reset_store(struct device *dev,
> if (!bdev)
> return -ENOMEM;
>
> + mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device! */
> if (bdev->bd_holders) {
> ret = -EBUSY;
> @@ -837,12 +827,17 @@ static ssize_t reset_store(struct device *dev,
>
> /* Make sure all pending I/O is finished */
> fsync_bdev(bdev);
> + zram_reset_device(zram);
> + set_capacity(zram->disk, 0);
> +
> + mutex_unlock(&bdev->bd_mutex);
> + revalidate_disk(zram->disk);
> bdput(bdev);
>
> - zram_reset_device(zram, true);
> return len;
>
> out:
> + mutex_unlock(&bdev->bd_mutex);
> bdput(bdev);
> return ret;
> }
> @@ -1188,7 +1183,7 @@ static void __exit zram_exit(void)
> * Shouldn't access zram->disk after destroy_device
> * because destroy_device already released zram->disk.
> */
> - zram_reset_device(zram, false);
> + zram_reset_device(zram);
> }
>
> unregister_blkdev(zram_major, "zram");
> --
> 2.3.0.rc2
>
--
Kind regards,
Minchan Kim
Hello, Sergey
2015-02-02 22:08 GMT+08:00 Sergey Senozhatsky <[email protected]>:
> Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex
> to avoid ->bd_holders race condition:
>
> CPU0 CPU1
> umount /* zram->init_done is true */
> reset_store()
> bdev->bd_holders == 0 mount
> ... zram_make_request()
> zram_reset_device()
>
> However, his solution required some considerable amount of code movement,
> which we can avoid.
>
> Apart from using bdev->bd_mutex in reset_store(), this patch also
> simplifies zram_reset_device().
>
> zram_reset_device() has a bool parameter reset_capacity which tells
> it whether disk capacity and itself disk should be reset. There are
> two zram_reset_device() callers:
> -- zram_exit() passes reset_capacity=false
> -- reset_store() passes reset_capacity=true
>
> So we can move reset_capacity-sensitive work out of zram_reset_device()
> and perform it unconditionally in reset_store(). This also lets us drop
> reset_capacity parameter from zram_reset_device() and pass zram pointer
> only.
>
> Reported-by: Ganesh Mahendran <[email protected]>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 23 +++++++++--------------
> 1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index aa5a4c5..a32069f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -717,7 +717,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
> }
> }
>
> -static void zram_reset_device(struct zram *zram, bool reset_capacity)
> +static void zram_reset_device(struct zram *zram)
> {
> down_write(&zram->init_lock);
>
> @@ -736,18 +736,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
> memset(&zram->stats, 0, sizeof(zram->stats));
>
> zram->disksize = 0;
> - if (reset_capacity)
> - set_capacity(zram->disk, 0);
How about keep this here? Protected by zram->init_lock.
set_capacity(zram->disk, 0);
Sorry for the late response.
> -
> up_write(&zram->init_lock);
> -
> - /*
> - * Revalidate disk out of the init_lock to avoid lockdep splat.
> - * It's okay because disk's capacity is protected by init_lock
> - * so that revalidate_disk always sees up-to-date capacity.
> - */
> - if (reset_capacity)
> - revalidate_disk(zram->disk);
> }
>
> static ssize_t disksize_store(struct device *dev,
> @@ -820,6 +809,7 @@ static ssize_t reset_store(struct device *dev,
> if (!bdev)
> return -ENOMEM;
>
> + mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device! */
> if (bdev->bd_holders) {
> ret = -EBUSY;
> @@ -837,12 +827,17 @@ static ssize_t reset_store(struct device *dev,
>
> /* Make sure all pending I/O is finished */
> fsync_bdev(bdev);
> + zram_reset_device(zram);
> + set_capacity(zram->disk, 0);
> +
> + mutex_unlock(&bdev->bd_mutex);
> + revalidate_disk(zram->disk);
> bdput(bdev);
>
> - zram_reset_device(zram, true);
> return len;
>
> out:
> + mutex_unlock(&bdev->bd_mutex);
> bdput(bdev);
> return ret;
> }
> @@ -1188,7 +1183,7 @@ static void __exit zram_exit(void)
> * Shouldn't access zram->disk after destroy_device
> * because destroy_device already released zram->disk.
> */
> - zram_reset_device(zram, false);
> + zram_reset_device(zram);
> }
>
> unregister_blkdev(zram_major, "zram");
> --
> 2.3.0.rc2
>
Hello,
On (02/03/15 22:05), Ganesh Mahendran wrote:
> > zram->disksize = 0;
> > - if (reset_capacity)
> > - set_capacity(zram->disk, 0);
>
> How about keep this here? Protected by zram->init_lock.
> set_capacity(zram->disk, 0);
why?
-ss
On (02/03/15 23:15), Sergey Senozhatsky wrote:
> > How about keep this here? Protected by zram->init_lock.
> > set_capacity(zram->disk, 0);
>
> why?
>
yeah, I see why. good catch.
hm, why do we perform destroy_device() before zram_reset_device() in
zram_exit()?
how about doing something like this (I don't want to return
that bool param back):
---
drivers/block/zram/zram_drv.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a32069f..386f7ed 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -736,6 +736,7 @@ static void zram_reset_device(struct zram *zram)
memset(&zram->stats, 0, sizeof(zram->stats));
zram->disksize = 0;
+ set_capacity(zram->disk, 0);
up_write(&zram->init_lock);
}
@@ -828,7 +829,6 @@ static ssize_t reset_store(struct device *dev,
/* Make sure all pending I/O is finished */
fsync_bdev(bdev);
zram_reset_device(zram);
- set_capacity(zram->disk, 0);
mutex_unlock(&bdev->bd_mutex);
revalidate_disk(zram->disk);
@@ -1178,12 +1178,8 @@ static void __exit zram_exit(void)
for (i = 0; i < num_devices; i++) {
zram = &zram_devices[i];
- destroy_device(zram);
- /*
- * Shouldn't access zram->disk after destroy_device
- * because destroy_device already released zram->disk.
- */
zram_reset_device(zram);
+ destroy_device(zram);
}
unregister_blkdev(zram_major, "zram");
On (02/03/15 23:52), Sergey Senozhatsky wrote:
> On (02/03/15 23:15), Sergey Senozhatsky wrote:
> > > How about keep this here? Protected by zram->init_lock.
> > > set_capacity(zram->disk, 0);
> >
> > why?
> >
> yeah, I see why. good catch.
>
> hm, why do we perform destroy_device() before zram_reset_device() in
> zram_exit()?
>
> how about doing something like this (I don't want to return
> that bool param back):
disregard the last one.
this is done to remove sysfs before we do reset, so we don't race module
unload with `echo 2G > /.../disksize', f.e.
well, several options:
1) move ->init_lock from zram_reset_device() to its callers.
iow, do
down_write(&zram->init_lock);
zram_reset_device(zram);
up_write(&zram->init_lock);
2) remove sysfs group separate, before zram_reset_device() in
zram_exit()
sysfs_remove_group()
zram_reset_device();
destroy_device();
3) return back bool reset_capacity to zram_reset_device(). but this one
is somewhat ungly. destroy() before reset() loks misleading, besides,
after destroy() in zram_reset_device() we
/*
* Shouldn't access zram->disk after destroy_device
* because destroy_device already released zram->disk.
*/
so we have garbaged ->disk pointer there, which is quite unsafe.
-ss
Hello,
On Wed, Feb 04, 2015 at 12:06:24AM +0900, Sergey Senozhatsky wrote:
> On (02/03/15 23:52), Sergey Senozhatsky wrote:
> > On (02/03/15 23:15), Sergey Senozhatsky wrote:
> > > > How about keep this here? Protected by zram->init_lock.
> > > > set_capacity(zram->disk, 0);
> > >
> > > why?
> > >
> > yeah, I see why. good catch.
> >
> > hm, why do we perform destroy_device() before zram_reset_device() in
> > zram_exit()?
> >
> > how about doing something like this (I don't want to return
> > that bool param back):
>
> disregard the last one.
>
>
> this is done to remove sysfs before we do reset, so we don't race module
> unload with `echo 2G > /.../disksize', f.e.
>
> well, several options:
>
> 1) move ->init_lock from zram_reset_device() to its callers.
> iow, do
>
> down_write(&zram->init_lock);
> zram_reset_device(zram);
> up_write(&zram->init_lock);
So, you mean this?
reset_store
down_write(&zram->init_lock);
zram_reset_device(zram);
set_capacity(zram->disk, 0);
up_write(&zram->init_lock);
If so, +1.
Hope you send a squash patch to Andrew.
>
> 2) remove sysfs group separate, before zram_reset_device() in
> zram_exit()
>
> sysfs_remove_group()
> zram_reset_device();
> destroy_device();
I want to keep sysfs creation/destory in zram create/destroy abstraction.
>
> 3) return back bool reset_capacity to zram_reset_device(). but this one
> is somewhat ungly. destroy() before reset() loks misleading, besides,
> after destroy() in zram_reset_device() we
> /*
> * Shouldn't access zram->disk after destroy_device
> * because destroy_device already released zram->disk.
> */
>
> so we have garbaged ->disk pointer there, which is quite unsafe.
Agree.
>
> -ss
--
Kind regards,
Minchan Kim
hello,
On (02/04/15 00:39), Minchan Kim wrote:
> So, you mean this?
>
> reset_store
>
> down_write(&zram->init_lock);
> zram_reset_device(zram);
> set_capacity(zram->disk, 0);
> up_write(&zram->init_lock);
>
>
> If so, +1.
> Hope you send a squash patch to Andrew.
yes, that's what I meant
+ zram_exit(void):
down_write(&zram->init_lock);
zram_reset_device(zram);
destroy_device(zram);
up_write(&zram->init_lock);
but I did something different and I think, a bit better than this.
sent a patch several minutes ago.
-ss