2019-08-14 10:33:51

by Martijn Coenen

[permalink] [raw]
Subject: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

Since Android Q, the creation and configuration of loop devices is in
the critical path of device boot. We found that the configuration of
loop devices is pretty slow, because many ioctl()'s involve freezing the
block queue, which in turn needs to wait for an RCU grace period. On
Android devices we've observed up to 60ms for the creation and
configuration of a single loop device; as we anticipate creating many
more in the future, we'd like to avoid this delay.

This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has
been bound; since the block queue is not running at that point, we can
avoid the expensive freezing of the queue.

On a recent x86, this patch yields the following results:

===
Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound
===
~# time ./set_block_size

real 0m0.002s
user 0m0.000s
sys 0m0.002s

===
Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound
===
~# losetup /dev/loop0 fs.img
~# time ./set_block_size

real 0m0.008s
user 0m0.000s
sys 0m0.002s

Over many runs, this is a 4x improvement.

This is RFC because technically it is a change in behavior; before,
calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and
userspace programs that left it in their code despite the returned
error, would now suddenly see the requested value effectuated. I'm not
sure whether this is acceptable.

An alternative might be a CONFIG option to set the default block size to
another value than 512. Another alternative I considered is allowing the
block device to be created with a "frozen" queue, where we can manually
unfreeze the queue when all the configuration is done. This would be a
much larger code change, though.

Signed-off-by: Martijn Coenen <[email protected]>
---
drivers/block/loop.c | 42 +++++++++++++++++++++++-------------------
1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ab7ca5989097a..d4348a4fdd7a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -214,7 +214,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
* LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
* will get updated by ioctl(LOOP_GET_STATUS)
*/
- blk_mq_freeze_queue(lo->lo_queue);
+ if (lo->lo_state == Lo_bound)
+ blk_mq_freeze_queue(lo->lo_queue);
lo->use_dio = use_dio;
if (use_dio) {
blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, lo->lo_queue);
@@ -223,7 +224,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue);
lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
}
- blk_mq_unfreeze_queue(lo->lo_queue);
+ if (lo->lo_state == Lo_bound)
+ blk_mq_unfreeze_queue(lo->lo_queue);
}

static int
@@ -621,6 +623,8 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)

static inline void loop_update_dio(struct loop_device *lo)
{
+ if (lo->lo_state != Lo_bound)
+ return;
__loop_update_dio(lo, io_is_direct(lo->lo_backing_file) |
lo->use_dio);
}
@@ -1510,27 +1514,26 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
{
int err = 0;

- if (lo->lo_state != Lo_bound)
- return -ENXIO;
-
if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
return -EINVAL;

- if (lo->lo_queue->limits.logical_block_size != arg) {
- sync_blockdev(lo->lo_device);
- kill_bdev(lo->lo_device);
- }
+ if (lo->lo_state == Lo_bound) {
+ if (lo->lo_queue->limits.logical_block_size != arg) {
+ sync_blockdev(lo->lo_device);
+ kill_bdev(lo->lo_device);
+ }

- blk_mq_freeze_queue(lo->lo_queue);
+ blk_mq_freeze_queue(lo->lo_queue);

- /* kill_bdev should have truncated all the pages */
- if (lo->lo_queue->limits.logical_block_size != arg &&
- lo->lo_device->bd_inode->i_mapping->nrpages) {
- err = -EAGAIN;
- pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n",
- __func__, lo->lo_number, lo->lo_file_name,
- lo->lo_device->bd_inode->i_mapping->nrpages);
- goto out_unfreeze;
+ /* kill_bdev should have truncated all the pages */
+ if (lo->lo_queue->limits.logical_block_size != arg &&
+ lo->lo_device->bd_inode->i_mapping->nrpages) {
+ err = -EAGAIN;
+ pr_warn("%s: loop%d (%s) has still dirty pages (nrpages=%lu)\n",
+ __func__, lo->lo_number, lo->lo_file_name,
+ lo->lo_device->bd_inode->i_mapping->nrpages);
+ goto out_unfreeze;
+ }
}

blk_queue_logical_block_size(lo->lo_queue, arg);
@@ -1538,7 +1541,8 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
blk_queue_io_min(lo->lo_queue, arg);
loop_update_dio(lo);
out_unfreeze:
- blk_mq_unfreeze_queue(lo->lo_queue);
+ if (lo->lo_state == Lo_bound)
+ blk_mq_unfreeze_queue(lo->lo_queue);

return err;
}
--
2.23.0.rc1.153.gdeed80330f-goog


2019-08-14 11:36:42

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 12:32:44PM +0200, Martijn Coenen wrote:
> Since Android Q, the creation and configuration of loop devices is in
> the critical path of device boot. We found that the configuration of
> loop devices is pretty slow, because many ioctl()'s involve freezing the
> block queue, which in turn needs to wait for an RCU grace period. On
> Android devices we've observed up to 60ms for the creation and
> configuration of a single loop device; as we anticipate creating many
> more in the future, we'd like to avoid this delay.
>

Another candidate is to not switch to q_usage_counter's percpu mode
until loop becomes Lo_bound, and this way may be more clean.

Something like the following patch:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a7461f482467..8791f9242583 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
*/
bdgrab(bdev);
mutex_unlock(&loop_ctl_mutex);
+
+ percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
+
if (partscan)
loop_reread_partitions(lo, bdev);
if (claimed_bdev)
@@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
lo->lo_state = Lo_unbound;
mutex_unlock(&loop_ctl_mutex);

+ percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
+
/*
* Need not hold loop_ctl_mutex to fput backing file.
* Calling fput holding loop_ctl_mutex triggers a circular
@@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
}
lo->lo_queue->queuedata = lo;

+ /*
+ * cheat block layer for not switching to q_usage_counter's
+ * percpu mode before loop becomes Lo_bound
+ */
+ blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
+
blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);

/*


thanks,
Ming

2019-08-14 11:40:11

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <[email protected]> wrote:
> Another candidate is to not switch to q_usage_counter's percpu mode
> until loop becomes Lo_bound, and this way may be more clean.

Thanks! I had considered this too, but thought it a bit risky to mess
with the init flag from the loop driver. Maybe we could delay
switching q_usage_counter to per-cpu mode in the block layer in
general, until the first request comes in?

This would also address our issues, and potentially be an even smaller change.

Martijn
>
> Something like the following patch:
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index a7461f482467..8791f9242583 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> */
> bdgrab(bdev);
> mutex_unlock(&loop_ctl_mutex);
> +
> + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> +
> if (partscan)
> loop_reread_partitions(lo, bdev);
> if (claimed_bdev)
> @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> lo->lo_state = Lo_unbound;
> mutex_unlock(&loop_ctl_mutex);
>
> + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> +
> /*
> * Need not hold loop_ctl_mutex to fput backing file.
> * Calling fput holding loop_ctl_mutex triggers a circular
> @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
> }
> lo->lo_queue->queuedata = lo;
>
> + /*
> + * cheat block layer for not switching to q_usage_counter's
> + * percpu mode before loop becomes Lo_bound
> + */
> + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> +
> blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>
> /*
>
>
> thanks,
> Ming

2019-08-14 11:49:46

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 01:38:25PM +0200, Martijn Coenen wrote:
> On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <[email protected]> wrote:
> > Another candidate is to not switch to q_usage_counter's percpu mode
> > until loop becomes Lo_bound, and this way may be more clean.
>
> Thanks! I had considered this too, but thought it a bit risky to mess
> with the init flag from the loop driver. Maybe we could delay

blk_queue_init_done() is only called in blk_queue_init_done() for
this purpose, so this approach should be fine, IMO.

> switching q_usage_counter to per-cpu mode in the block layer in
> general, until the first request comes in?

This approach may not help your issue on loop, IO request comes
just after disk is added, such as by systemd, or reading partition table.

However, loop only starts to handle IO really after it becomes 'Lo_bound'.

thanks,
Ming

2019-08-14 15:31:14

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On 8/14/19 3:32 AM, Martijn Coenen wrote:
> Since Android Q, the creation and configuration of loop devices is in
> the critical path of device boot. We found that the configuration of
> loop devices is pretty slow, because many ioctl()'s involve freezing the
> block queue, which in turn needs to wait for an RCU grace period. On
> Android devices we've observed up to 60ms for the creation and
> configuration of a single loop device; as we anticipate creating many
> more in the future, we'd like to avoid this delay.
>
> This allows LOOP_SET_BLOCK_SIZE to be called before the loop device has
> been bound; since the block queue is not running at that point, we can
> avoid the expensive freezing of the queue.
>
> On a recent x86, this patch yields the following results:
>
> ===
> Call LOOP_SET_BLOCK_SIZE on /dev/loop0 before being bound
> ===
> ~# time ./set_block_size
>
> real 0m0.002s
> user 0m0.000s
> sys 0m0.002s
>
> ===
> Call LOOP_SET_BLOCK_SIZE on /dev/loop0 after being bound
> ===
> ~# losetup /dev/loop0 fs.img
> ~# time ./set_block_size
>
> real 0m0.008s
> user 0m0.000s
> sys 0m0.002s
>
> Over many runs, this is a 4x improvement.
>
> This is RFC because technically it is a change in behavior; before,
> calling LOOP_SET_BLOCK_SIZE on an unbound device would return ENXIO, and
> userspace programs that left it in their code despite the returned
> error, would now suddenly see the requested value effectuated. I'm not
> sure whether this is acceptable.
>
> An alternative might be a CONFIG option to set the default block size to
> another value than 512. Another alternative I considered is allowing the
> block device to be created with a "frozen" queue, where we can manually
> unfreeze the queue when all the configuration is done. This would be a
> much larger code change, though.

Hi Martijn,

Is the loop driver used in Android Q to make a file on a filesystem
visible as a block device or rather to make a subset of a block device
visible as a block device? In the latter case, have you considered to
use the dm-linear driver instead? I expect that the overhead per I/O of
dm-linear will be lower than that of the loop driver.

Bart.

2019-08-15 15:40:40

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <[email protected]> wrote:
> blk_queue_init_done() is only called in blk_queue_init_done() for
> this purpose, so this approach should be fine, IMO.

I was thinking somebody might add more stuff to "init" in the future,
and then that new stuff would now no longer be executed for the loop
driver. The name "init" is pretty generic...but if that's not a
concern I'm happy with your proposal as well. There's one more
"freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64),
and there's a freeze in there because lo->transfer is modified. That
makes sense, but I was hoping we can make that freeze conditional on
whether lo->transfer would actually change value; if it stays the
same, I think freezing is not necessary.

>
> > switching q_usage_counter to per-cpu mode in the block layer in
> > general, until the first request comes in?
>
> This approach may not help your issue on loop, IO request comes
> just after disk is added, such as by systemd, or reading partition table.

That's a good point, I thought we could avoid the partition scan, but
on some Android devices we'll have max_part set, and there will be a
partition scan.

Thanks,
Martijn

>
> However, loop only starts to handle IO really after it becomes 'Lo_bound'.
>
> thanks,
> Ming

2019-08-15 16:42:55

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Thu, Aug 15, 2019 at 11:38 PM Martijn Coenen <[email protected]> wrote:
>
> On Wed, Aug 14, 2019 at 12:47 PM Ming Lei <[email protected]> wrote:
> > blk_queue_init_done() is only called in blk_queue_init_done() for
> > this purpose, so this approach should be fine, IMO.
>
> I was thinking somebody might add more stuff to "init" in the future,
> and then that new stuff would now no longer be executed for the loop
> driver. The name "init" is pretty generic...but if that's not a
> concern I'm happy with your proposal as well. There's one more
> "freeze" I'd like to get rid of - we also call LOOP_SET_STATUS(64),
> and there's a freeze in there because lo->transfer is modified. That
> makes sense, but I was hoping we can make that freeze conditional on
> whether lo->transfer would actually change value; if it stays the
> same, I think freezing is not necessary.

The queue freeze in SET_STATUS may not be avoided, not only
.transfer, there are also .lo_offset, .size, filename, dio and others.

If nothing will change, why does the userspace bother to send
SET_STATUS?


Thanks,
Ming Lei

2019-08-15 17:14:58

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 4:29 PM Bart Van Assche <[email protected]> wrote:
> Hi Martijn,
>
> Is the loop driver used in Android Q to make a file on a filesystem
> visible as a block device or rather to make a subset of a block device
> visible as a block device? In the latter case, have you considered to
> use the dm-linear driver instead? I expect that the overhead per I/O of
> dm-linear will be lower than that of the loop driver.

Hi Bart,

In this case we're using the loop driver to make a file on the
filesystem visible as a block device (in the file is a filesystem we
want to mount), so unfortunately dm-linear is not an option.

Best,
Martijn

>
> Bart.

2019-08-15 21:52:11

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Thu, Aug 15, 2019 at 6:34 PM Ming Lei <[email protected]> wrote:
> If nothing will change, why does the userspace bother to send
> SET_STATUS?

We don't change transfer, but we do change the offset and sizelimit.
In our specific case, we know there won't be any I/O from userspace at
this point; so from that point of view the freeze wouldn't be
necessary. But I'm not sure how we can make loop aware of that in a
safe way. Ideally we'd just have a way of completely configuring a
loop device before starting the block request queue, but that seems
like a pretty big change.

Martijn

>
>
> Thanks,
> Ming Lei

2019-08-19 09:07:35

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Wed, Aug 14, 2019 at 1:34 PM Ming Lei <[email protected]> wrote:
> Another candidate is to not switch to q_usage_counter's percpu mode
> until loop becomes Lo_bound, and this way may be more clean.

I was thinking about this more; our current sequence is:

1) LOOP_SET_FD
2) LOOP_SET_STATUS64 // for lo_offset/lo_sizelimit
3) LOOP_SET_BLOCK_SIZE // to make sure block size is correct for DIO
4) LOOP_SET_DIRECT_IO // to enable DIO

I noticed that LOOP_SET_FD already tries to configure DIO if the
underlying file is opened with O_DIRECT; but in this case, it will
still fail, because the logical block size of the loop queue must
exceed the logical I/O size of the backing device. But the default
logical block size of mq is 512.

One idea to fix is to call blk_queue_logical_block_size() as part of
LOOP_SET_FD, to match the block size of the backing fs in case the
backing file is opened with O_DIRECT; you could argue that if the
backing file is opened with O_DIRECT, this is what the user wanted
anyway. This would allow us to get rid of the latter two ioctl's and
already save quite some time.

Thanks,
Martijn

>
> Something like the following patch:
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index a7461f482467..8791f9242583 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> */
> bdgrab(bdev);
> mutex_unlock(&loop_ctl_mutex);
> +
> + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> +
> if (partscan)
> loop_reread_partitions(lo, bdev);
> if (claimed_bdev)
> @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> lo->lo_state = Lo_unbound;
> mutex_unlock(&loop_ctl_mutex);
>
> + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> +
> /*
> * Need not hold loop_ctl_mutex to fput backing file.
> * Calling fput holding loop_ctl_mutex triggers a circular
> @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
> }
> lo->lo_queue->queuedata = lo;
>
> + /*
> + * cheat block layer for not switching to q_usage_counter's
> + * percpu mode before loop becomes Lo_bound
> + */
> + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> +
> blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>
> /*
>
>
> thanks,
> Ming

2019-08-19 09:28:41

by Martijn Coenen

[permalink] [raw]
Subject: Re: [PATCH] RFC: loop: Avoid calling blk_mq_freeze_queue() when possible.

On Mon, Aug 19, 2019 at 11:06 AM Martijn Coenen <[email protected]> wrote:
> One idea to fix is to call blk_queue_logical_block_size() as part of
> LOOP_SET_FD, to match the block size of the backing fs in case the
> backing file is opened with O_DIRECT; you could argue that if the
> backing file is opened with O_DIRECT, this is what the user wanted
> anyway. This would allow us to get rid of the latter two ioctl's and
> already save quite some time.

Basically:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ab7ca5989097a..ad3db72fbd729 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -994,6 +994,12 @@ static int loop_set_fd(struct loop_device *lo,
fmode_t mode,
if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
blk_queue_write_cache(lo->lo_queue, true, false);

+ if(io_is_direct(lo->lo_backing_file) && inode->i_sb->s_bdev) {
+ /* In case of direct I/O, match underlying block size */
+ blk_queue_logical_block_size(lo->lo_queue,
+ bdev_logical_block_size(inode->i_sb->s_bdev));
+ }
+
loop_update_rotational(lo);
loop_update_dio(lo);

>
> Thanks,
> Martijn
>
> >
> > Something like the following patch:
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index a7461f482467..8791f9242583 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1015,6 +1015,9 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> > */
> > bdgrab(bdev);
> > mutex_unlock(&loop_ctl_mutex);
> > +
> > + percpu_ref_switch_to_percpu(&lo->lo_queue->q_usage_counter);
> > +
> > if (partscan)
> > loop_reread_partitions(lo, bdev);
> > if (claimed_bdev)
> > @@ -1171,6 +1174,8 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> > lo->lo_state = Lo_unbound;
> > mutex_unlock(&loop_ctl_mutex);
> >
> > + percpu_ref_switch_to_atomic(&lo->lo_queue->q_usage_counter, NULL);
> > +
> > /*
> > * Need not hold loop_ctl_mutex to fput backing file.
> > * Calling fput holding loop_ctl_mutex triggers a circular
> > @@ -2003,6 +2008,12 @@ static int loop_add(struct loop_device **l, int i)
> > }
> > lo->lo_queue->queuedata = lo;
> >
> > + /*
> > + * cheat block layer for not switching to q_usage_counter's
> > + * percpu mode before loop becomes Lo_bound
> > + */
> > + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, lo->lo_queue);
> > +
> > blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
> >
> > /*
> >
> >
> > thanks,
> > Ming