LinuxLists.cc - [PATCH v3 0/2] loop: Better discard for block devices

2019-03-27 22:29:50

Subject: [PATCH v3 0/2] loop: Better discard for block devices

This series addresses some errors seen when using the loop
device directly backed by a block device. The first change plumbs
out the correct error message, and the second change prevents the
error from occurring in many cases.

The errors look like this:
[ 90.880875] print_req_error: I/O error, dev loop5, sector 0

The errors occur when trying to do a discard or write zeroes operation
on a loop device backed by a block device that does not support discard.
Firstly, the error itself is incorrectly reported as I/O error, but is
actually EOPNOTSUPP. The first patch plumbs out EOPNOTSUPP to properly
report the error.

The second patch prevents these errors from occurring by mirroring the
discard capabilities of the underlying block device into the loop device.
Before this change, discard was always reported as being supported, and
the loop device simply turns around and does a discard operation on the
backing device. After this change, backing block devices that do support
discard will continue to work as before, and continue to get all the
benefits of doing that. Backing devices that do not support discard will
fail earlier, avoiding hitting the loop device at all and ultimately
avoiding this error in the logs.

I can also confirm that this fixes test block/003 in the blktests, when
running blktests on a loop device backed by a block device.

Changes in v3:
- Updated tags
- Updated commit description

Changes in v2:
- Unnested error if statement (Bart)

Evan Green (2):
loop: Report EOPNOTSUPP properly
loop: Better discard support for block devices

drivers/block/loop.c | 70 ++++++++++++++++++++++++++++++--------------
1 file changed, 48 insertions(+), 22 deletions(-)

--
2.20.1

2019-03-27 22:29:53

by Evan Green

[permalink] [raw]

Subject: [PATCH v3 1/2] loop: Report EOPNOTSUPP properly

Properly plumb out EOPNOTSUPP from loop driver operations, which may
get returned when for instance a discard operation is attempted but not
supported by the underlying block device. Before this change, everything
was reported in the log as an I/O error, which is scary and not
helpful in debugging.

Signed-off-by: Evan Green <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
---

Changes in v3:
- Updated tags

Changes in v2:
- Unnested error if statement (Bart)

drivers/block/loop.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bf1c61cab8eb..bbf21ebeccd3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -458,7 +458,9 @@ static void lo_complete_rq(struct request *rq)

if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
req_op(rq) != REQ_OP_READ) {
- if (cmd->ret < 0)
+ if (cmd->ret == -EOPNOTSUPP)
+ ret = BLK_STS_NOTSUPP;
+ else if (cmd->ret < 0)
ret = BLK_STS_IOERR;
goto end_io;
}
@@ -1892,7 +1894,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
- cmd->ret = ret ? -EIO : 0;
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
blk_mq_complete_request(rq);
}
}
--
2.20.1

2019-03-27 22:30:01

by Evan Green

[permalink] [raw]

Subject: [PATCH v3 2/2] loop: Better discard support for block devices

If the backing device for a loop device is a block device,
then mirror the discard properties of the underlying block
device into the loop device. This new change only applies to
loop devices backed directly by a block device, not loop
devices backed by regular files.

While in there, differentiate between REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES, which are different for block devices,
but which the loop device had just been lumping together, since
they're largely the same for files.

This change fixes blktest block/003, and removes an extraneous
error print in block/013 when testing on a loop device backed
by a block device that does not support discard.

Signed-off-by: Evan Green <[email protected]>
---

Changes in v3:
- Updated commit description

Changes in v2: None

drivers/block/loop.c | 61 +++++++++++++++++++++++++++++---------------
1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bbf21ebeccd3..e1edd004298a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
return ret;
}

-static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+static int lo_discard(struct loop_device *lo, struct request *rq,
+ int mode, loff_t pos)
{
- /*
- * We use punch hole to reclaim the free space used by the
- * image a.k.a. discard. However we do not support discard if
- * encryption is enabled, because it may give an attacker
- * useful information.
- */
struct file *file = lo->lo_backing_file;
- int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+ struct request_queue *q = lo->lo_queue;
int ret;

- if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+ if (!blk_queue_discard(q)) {
ret = -EOPNOTSUPP;
goto out;
}
@@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+ return lo_discard(lo, rq,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE_ZEROES:
- return lo_discard(lo, rq, pos);
+ return lo_discard(lo, rq,
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE:
if (lo->transfer)
return lo_write_transfer(lo, rq, pos);
@@ -854,6 +854,25 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+ struct request_queue *backingq;
+
+ /*
+ * If the backing device is a block device, mirror its discard
+ * capabilities.
+ */
+ if (S_ISBLK(inode->i_mode)) {
+ backingq = bdev_get_queue(inode->i_bdev);
+ blk_queue_max_discard_sectors(q,
+ backingq->limits.max_discard_sectors);
+
+ blk_queue_max_write_zeroes_sectors(q,
+ backingq->limits.max_write_zeroes_sectors);
+
+ q->limits.discard_granularity =
+ backingq->limits.discard_granularity;
+
+ q->limits.discard_alignment =
+ backingq->limits.discard_alignment;

/*
* We use punch hole to reclaim the free space used by the
@@ -861,22 +880,24 @@ static void loop_config_discard(struct loop_device *lo)
* encryption is enabled, because it may give an attacker
* useful information.
*/
- if ((!file->f_op->fallocate) ||
- lo->lo_encrypt_key_size) {
+ } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
blk_queue_max_write_zeroes_sectors(q, 0);
- blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
- return;
- }

- q->limits.discard_granularity = inode->i_sb->s_blocksize;
- q->limits.discard_alignment = 0;
+ } else {
+ q->limits.discard_granularity = inode->i_sb->s_blocksize;
+ q->limits.discard_alignment = 0;
+
+ blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+ }

- blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
- blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
- blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
}

static void loop_unprepare_queue(struct loop_device *lo)
--
2.20.1

2019-03-28 02:38:19

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH v3 2/2] loop: Better discard support for block devices

On Wed, Mar 27, 2019 at 03:28:41PM -0700, Evan Green wrote:
> If the backing device for a loop device is a block device,
> then mirror the discard properties of the underlying block
> device into the loop device. This new change only applies to
> loop devices backed directly by a block device, not loop
> devices backed by regular files.
>
> While in there, differentiate between REQ_OP_DISCARD and
> REQ_OP_WRITE_ZEROES, which are different for block devices,
> but which the loop device had just been lumping together, since
> they're largely the same for files.
>
> This change fixes blktest block/003, and removes an extraneous
> error print in block/013 when testing on a loop device backed
> by a block device that does not support discard.

I saw such issue many times, I believe it needs the fix.

>
> Signed-off-by: Evan Green <[email protected]>
> ---
>
> Changes in v3:
> - Updated commit description
>
> Changes in v2: None
>
> drivers/block/loop.c | 61 +++++++++++++++++++++++++++++---------------
> 1 file changed, 41 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index bbf21ebeccd3..e1edd004298a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> return ret;
> }
>
> -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> +static int lo_discard(struct loop_device *lo, struct request *rq,
> + int mode, loff_t pos)
> {
> - /*
> - * We use punch hole to reclaim the free space used by the
> - * image a.k.a. discard. However we do not support discard if
> - * encryption is enabled, because it may give an attacker
> - * useful information.
> - */
> struct file *file = lo->lo_backing_file;
> - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> + struct request_queue *q = lo->lo_queue;
> int ret;
>
> - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> + if (!blk_queue_discard(q)) {
> ret = -EOPNOTSUPP;
> goto out;
> }
> @@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> case REQ_OP_FLUSH:
> return lo_req_flush(lo, rq);
> case REQ_OP_DISCARD:
> + return lo_discard(lo, rq,
> + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> +
> case REQ_OP_WRITE_ZEROES:
> - return lo_discard(lo, rq, pos);
> + return lo_discard(lo, rq,
> + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> +
> case REQ_OP_WRITE:
> if (lo->transfer)
> return lo_write_transfer(lo, rq, pos);
> @@ -854,6 +854,25 @@ static void loop_config_discard(struct loop_device *lo)
> struct file *file = lo->lo_backing_file;
> struct inode *inode = file->f_mapping->host;
> struct request_queue *q = lo->lo_queue;
> + struct request_queue *backingq;
> +
> + /*
> + * If the backing device is a block device, mirror its discard
> + * capabilities.
> + */
> + if (S_ISBLK(inode->i_mode)) {
> + backingq = bdev_get_queue(inode->i_bdev);
> + blk_queue_max_discard_sectors(q,
> + backingq->limits.max_discard_sectors);
> +
> + blk_queue_max_write_zeroes_sectors(q,
> + backingq->limits.max_write_zeroes_sectors);
> +
> + q->limits.discard_granularity =
> + backingq->limits.discard_granularity;
> +
> + q->limits.discard_alignment =
> + backingq->limits.discard_alignment;

Loop usually doesn't mirror backing queue's limits, and I believe
it isn't necessary for this case too, just wondering why the
following simple setting can't work?

if (S_ISBLK(inode->i_mode)) {
backingq = bdev_get_queue(inode->i_bdev);

q->limits.discard_alignment = 0;
if (!blk_queue_discard(backingq)) {
q->limits.discard_granularity = 0;
blk_queue_max_discard_sectors(q, 0);
blk_queue_max_write_zeroes_sectors(q, 0);
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
} else {
q->limits.discard_granularity = inode->i_sb->s_blocksize;
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
}
} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
...
}

I remembered you mentioned the above code doesn't work in some of your
tests, but never explain the reason. However, it is supposed to work
given bio splitting does handle/respect the discard limits. Or is there
bug in bio splitting on discard IO?

Thanks,
Ming

2019-03-28 20:27:38

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v3 2/2] loop: Better discard support for block devices

On Wed, Mar 27, 2019 at 7:37 PM Ming Lei <[email protected]> wrote:
>
> On Wed, Mar 27, 2019 at 03:28:41PM -0700, Evan Green wrote:
...
> > @@ -854,6 +854,25 @@ static void loop_config_discard(struct loop_device *lo)
> > struct file *file = lo->lo_backing_file;
> > struct inode *inode = file->f_mapping->host;
> > struct request_queue *q = lo->lo_queue;
> > + struct request_queue *backingq;
> > +
> > + /*
> > + * If the backing device is a block device, mirror its discard
> > + * capabilities.
> > + */
> > + if (S_ISBLK(inode->i_mode)) {
> > + backingq = bdev_get_queue(inode->i_bdev);
> > + blk_queue_max_discard_sectors(q,
> > + backingq->limits.max_discard_sectors);
> > +
> > + blk_queue_max_write_zeroes_sectors(q,
> > + backingq->limits.max_write_zeroes_sectors);
> > +
> > + q->limits.discard_granularity =
> > + backingq->limits.discard_granularity;
> > +
> > + q->limits.discard_alignment =
> > + backingq->limits.discard_alignment;
>
> Loop usually doesn't mirror backing queue's limits, and I believe
> it isn't necessary for this case too, just wondering why the
> following simple setting can't work?
>
> if (S_ISBLK(inode->i_mode)) {
> backingq = bdev_get_queue(inode->i_bdev);
>
> q->limits.discard_alignment = 0;
> if (!blk_queue_discard(backingq)) {
> q->limits.discard_granularity = 0;
> blk_queue_max_discard_sectors(q, 0);
> blk_queue_max_write_zeroes_sectors(q, 0);
> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> } else {
> q->limits.discard_granularity = inode->i_sb->s_blocksize;
> blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> }
> } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> ...
> }
>
> I remembered you mentioned the above code doesn't work in some of your
> tests, but never explain the reason. However, it is supposed to work
> given bio splitting does handle/respect the discard limits. Or is there
> bug in bio splitting on discard IO?

I've done some more digging, and I think I have an answer for you,
with some proposed changes to the patch.

My original answer was going to be that REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES are different. So I have an NVMe device that does
support discard, but does not support write_zeroes, and should mirror
those capabilities individually to most accurately reflect the
underlying block device. But then I noticed that this device still
prints the error log I was trying to get rid of when doing mkfs.ext4,
so my fix is incomplete.

The reason is that I have the following translation between REQ_OP_*
and FALLOC_FL_*:
REQ_OP_DISCARD ==> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE
REQ_OP_WRITE_ZEROES ==> FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE

This makes sense for loop devices backed by regular files, and I think
is the right mapping. But for loop devices backed by block devices,
blkdev_fallocate() translates both of these sets of flags into
blkdev_issue_zeroout(), rather than blkdev_issue_discard() for
REQ_OP_DISCARD (since I wasn't setting FALLOC_FL_NO_HIDE_STALE).

I think this set of flags still makes sense for block devices, since
it keeps a consistent behavior for loop devices backed by files and
block devices (namely, that the discarded space is always zeroed).
However it means that for my NVMe that supports discard (never used)
but not write_zeroes (always tried), loop devices backed directly by
this NVMe should not set the discard flag.

So I think what I should actually have is this:

if (S_ISBLK(inode->i_mode)) {
backingq = bdev_get_queue(inode->i_bdev);
blk_queue_max_discard_sectors(q,
backingq->limits.max_write_zeroes_sectors); /// Note
the difference here.

blk_queue_max_write_zeroes_sectors(q,
backingq->limits.max_write_zeroes_sectors);
} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) { ... }
...
if (q->limits.max_write_zeroes_sectors)
blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
else
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);

I can confirm that this fixes the errors for my NVMe as well.

What do you think?
-Evan