LinuxLists.cc - [PATCH v6 0/2] loop: Better discard for block devices

2019-11-11 19:03:10

Subject: [PATCH v6 0/2] loop: Better discard for block devices

This series addresses some errors seen when using the loop
device directly backed by a block device. The first change plumbs
out the correct error message, and the second change prevents the
error from occurring in many cases.

The errors look like this:
[ 90.880875] print_req_error: I/O error, dev loop5, sector 0

The errors occur when trying to do a discard or write zeroes operation
on a loop device backed by a block device that does not support write zeroes.
Firstly, the error itself is incorrectly reported as I/O error, but is
actually EOPNOTSUPP. The first patch plumbs out EOPNOTSUPP to properly
report the error.

The second patch prevents these errors from occurring by mirroring the
zeroing capabilities of the underlying block device into the loop device.
Before this change, discard was always reported as being supported, and
the loop device simply turns around and does an fallocate operation on the
backing device. After this change, backing block devices that do support
zeroing will continue to work as before, and continue to get all the
benefits of doing that. Backing devices that do not support zeroing will
fail earlier, avoiding hitting the loop device at all and ultimately
avoiding this error in the logs.

I can also confirm that this fixes test block/003 in the blktests, when
running blktests on a loop device backed by a block device.

Darrick, I see you've got a related change in linux-next. I'm not sure what
the status of that is, so I didn't base my latest spin on top of yours.

Changes in v6:
- Updated tags

Changes in v5:
- Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)

Changes in v4:
- Mirror blkdev's write_zeroes into loopdev's discard_sectors.

Changes in v3:
- Updated tags
- Updated commit description

Changes in v2:
- Unnested error if statement (Bart)

Evan Green (2):
loop: Report EOPNOTSUPP properly
loop: Better discard support for block devices

drivers/block/loop.c | 66 +++++++++++++++++++++++++++++---------------
1 file changed, 44 insertions(+), 22 deletions(-)

--
2.21.0

2019-11-11 19:04:55

by Evan Green

[permalink] [raw]

Subject: [PATCH v6 1/2] loop: Report EOPNOTSUPP properly

Properly plumb out EOPNOTSUPP from loop driver operations, which may
get returned when for instance a discard operation is attempted but not
supported by the underlying block device. Before this change, everything
was reported in the log as an I/O error, which is scary and not
helpful in debugging.

Signed-off-by: Evan Green <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Reviewed-by: Gwendal Grignou <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
---

Changes in v6:
- Updated tags

Changes in v5: None
Changes in v4: None
Changes in v3:
- Updated tags

Changes in v2:
- Unnested error if statement (Bart)

drivers/block/loop.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f6f77eaa7217..d749156a3d88 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -458,7 +458,9 @@ static void lo_complete_rq(struct request *rq)

if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
req_op(rq) != REQ_OP_READ) {
- if (cmd->ret < 0)
+ if (cmd->ret == -EOPNOTSUPP)
+ ret = BLK_STS_NOTSUPP;
+ else if (cmd->ret < 0)
ret = BLK_STS_IOERR;
goto end_io;
}
@@ -1940,7 +1942,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
- cmd->ret = ret ? -EIO : 0;
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
blk_mq_complete_request(rq);
}
}
--
2.21.0

2019-11-11 19:05:34

by Evan Green

[permalink] [raw]

Subject: [PATCH v6 2/2] loop: Better discard support for block devices

If the backing device for a loop device is a block device,
then mirror the "write zeroes" capabilities of the underlying
block device into the loop device. Copy this capability into both
max_write_zeroes_sectors and max_discard_sectors of the loop device.

The reason for this is that REQ_OP_DISCARD on a loop device translates
into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
presents a consistent interface for loop devices (that discarded data
is zeroed), regardless of the backing device type of the loop device.
There should be no behavior change for loop devices backed by regular
files.

While in there, differentiate between REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES, which are different for block devices,
but which the loop device had just been lumping together, since
they're largely the same for files.

This change fixes blktest block/003, and removes an extraneous
error print in block/013 when testing on a loop device backed
by a block device that does not support discard.

Signed-off-by: Evan Green <[email protected]>
Reviewed-by: Gwendal Grignou <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
---

Changes in v6: None
Changes in v5:
- Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)

Changes in v4:
- Mirror blkdev's write_zeroes into loopdev's discard_sectors.

Changes in v3:
- Updated commit description

Changes in v2: None

drivers/block/loop.c | 57 ++++++++++++++++++++++++++++----------------
1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d749156a3d88..236f6deb0772 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
return ret;
}

-static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+static int lo_discard(struct loop_device *lo, struct request *rq,
+ int mode, loff_t pos)
{
- /*
- * We use punch hole to reclaim the free space used by the
- * image a.k.a. discard. However we do not support discard if
- * encryption is enabled, because it may give an attacker
- * useful information.
- */
struct file *file = lo->lo_backing_file;
- int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+ struct request_queue *q = lo->lo_queue;
int ret;

- if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+ if (!blk_queue_discard(q)) {
ret = -EOPNOTSUPP;
goto out;
}
@@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+ return lo_discard(lo, rq,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE_ZEROES:
- return lo_discard(lo, rq, pos);
+ return lo_discard(lo, rq,
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE:
if (lo->transfer)
return lo_write_transfer(lo, rq, pos);
@@ -854,6 +854,21 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+ struct request_queue *backingq;
+
+ /*
+ * If the backing device is a block device, mirror its zeroing
+ * capability. REQ_OP_DISCARD translates to a zero-out even when backed
+ * by block devices to keep consistent behavior with file-backed loop
+ * devices.
+ */
+ if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) {
+ backingq = bdev_get_queue(inode->i_bdev);
+ blk_queue_max_discard_sectors(q,
+ backingq->limits.max_write_zeroes_sectors);
+
+ blk_queue_max_write_zeroes_sectors(q,
+ backingq->limits.max_write_zeroes_sectors);

/*
* We use punch hole to reclaim the free space used by the
@@ -861,22 +876,24 @@ static void loop_config_discard(struct loop_device *lo)
* encryption is enabled, because it may give an attacker
* useful information.
*/
- if ((!file->f_op->fallocate) ||
- lo->lo_encrypt_key_size) {
+ } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
blk_queue_max_write_zeroes_sectors(q, 0);
- blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
- return;
- }

- q->limits.discard_granularity = inode->i_sb->s_blocksize;
- q->limits.discard_alignment = 0;
+ } else {
+ q->limits.discard_granularity = inode->i_sb->s_blocksize;
+ q->limits.discard_alignment = 0;
+
+ blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+ }

- blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
- blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
- blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ if (q->limits.max_write_zeroes_sectors)
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
}

static void loop_unprepare_queue(struct loop_device *lo)
--
2.21.0

2019-11-12 01:33:59

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v6 0/2] loop: Better discard for block devices

On Mon, Nov 11, 2019 at 10:50:28AM -0800, Evan Green wrote:
> This series addresses some errors seen when using the loop
> device directly backed by a block device. The first change plumbs
> out the correct error message, and the second change prevents the
> error from occurring in many cases.
>
> The errors look like this:
> [ 90.880875] print_req_error: I/O error, dev loop5, sector 0
>
> The errors occur when trying to do a discard or write zeroes operation
> on a loop device backed by a block device that does not support write zeroes.
> Firstly, the error itself is incorrectly reported as I/O error, but is
> actually EOPNOTSUPP. The first patch plumbs out EOPNOTSUPP to properly
> report the error.
>
> The second patch prevents these errors from occurring by mirroring the
> zeroing capabilities of the underlying block device into the loop device.
> Before this change, discard was always reported as being supported, and
> the loop device simply turns around and does an fallocate operation on the
> backing device. After this change, backing block devices that do support
> zeroing will continue to work as before, and continue to get all the
> benefits of doing that. Backing devices that do not support zeroing will
> fail earlier, avoiding hitting the loop device at all and ultimately
> avoiding this error in the logs.
>
> I can also confirm that this fixes test block/003 in the blktests, when
> running blktests on a loop device backed by a block device.
>
> Darrick, I see you've got a related change in linux-next. I'm not sure what
> the status of that is, so I didn't base my latest spin on top of yours.

AFAIK the patch you reference changes NOUNMAP requests to use
FALLOC_FL_ZERO_RANGE and is queued for 5.5, which means patch #2 will
clash with it. It sort of looks like patch #2 reimplements the patch
that Jens already pulled for 5.5, so you probably want to rebase this
series atop his for-next tree.... but you should really ask Jens.

--D

> Changes in v6:
> - Updated tags
>
> Changes in v5:
> - Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)
>
> Changes in v4:
> - Mirror blkdev's write_zeroes into loopdev's discard_sectors.
>
> Changes in v3:
> - Updated tags
> - Updated commit description
>
> Changes in v2:
> - Unnested error if statement (Bart)
>
> Evan Green (2):
> loop: Report EOPNOTSUPP properly
> loop: Better discard support for block devices
>
> drivers/block/loop.c | 66 +++++++++++++++++++++++++++++---------------
> 1 file changed, 44 insertions(+), 22 deletions(-)
>
> --
> 2.21.0
>

2019-11-12 01:41:07

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] loop: Better discard support for block devices

On Mon, Nov 11, 2019 at 10:50:30AM -0800, Evan Green wrote:
> If the backing device for a loop device is a block device,
> then mirror the "write zeroes" capabilities of the underlying
> block device into the loop device. Copy this capability into both
> max_write_zeroes_sectors and max_discard_sectors of the loop device.
>
> The reason for this is that REQ_OP_DISCARD on a loop device translates
> into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
> presents a consistent interface for loop devices (that discarded data
> is zeroed), regardless of the backing device type of the loop device.
> There should be no behavior change for loop devices backed by regular
> files.
>
> While in there, differentiate between REQ_OP_DISCARD and
> REQ_OP_WRITE_ZEROES, which are different for block devices,
> but which the loop device had just been lumping together, since
> they're largely the same for files.
>
> This change fixes blktest block/003, and removes an extraneous
> error print in block/013 when testing on a loop device backed
> by a block device that does not support discard.
>
> Signed-off-by: Evan Green <[email protected]>
> Reviewed-by: Gwendal Grignou <[email protected]>
> Reviewed-by: Chaitanya Kulkarni <[email protected]>
> ---
>
> Changes in v6: None
> Changes in v5:
> - Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)
>
> Changes in v4:
> - Mirror blkdev's write_zeroes into loopdev's discard_sectors.
>
> Changes in v3:
> - Updated commit description
>
> Changes in v2: None
>
> drivers/block/loop.c | 57 ++++++++++++++++++++++++++++----------------
> 1 file changed, 37 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d749156a3d88..236f6deb0772 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> return ret;
> }
>
> -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> +static int lo_discard(struct loop_device *lo, struct request *rq,
> + int mode, loff_t pos)
> {
> - /*
> - * We use punch hole to reclaim the free space used by the
> - * image a.k.a. discard. However we do not support discard if
> - * encryption is enabled, because it may give an attacker
> - * useful information.
> - */
> struct file *file = lo->lo_backing_file;
> - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> + struct request_queue *q = lo->lo_queue;
> int ret;
>
> - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> + if (!blk_queue_discard(q)) {
> ret = -EOPNOTSUPP;
> goto out;
> }
> @@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> case REQ_OP_FLUSH:
> return lo_req_flush(lo, rq);
> case REQ_OP_DISCARD:
> + return lo_discard(lo, rq,
> + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> +
> case REQ_OP_WRITE_ZEROES:
> - return lo_discard(lo, rq, pos);
> + return lo_discard(lo, rq,
> + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);

Yes, this more or less reimplements what's already in -next...

> +
> case REQ_OP_WRITE:
> if (lo->transfer)
> return lo_write_transfer(lo, rq, pos);
> @@ -854,6 +854,21 @@ static void loop_config_discard(struct loop_device *lo)
> struct file *file = lo->lo_backing_file;
> struct inode *inode = file->f_mapping->host;
> struct request_queue *q = lo->lo_queue;
> + struct request_queue *backingq;
> +
> + /*
> + * If the backing device is a block device, mirror its zeroing
> + * capability. REQ_OP_DISCARD translates to a zero-out even when backed
> + * by block devices to keep consistent behavior with file-backed loop
> + * devices.
> + */
> + if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) {
> + backingq = bdev_get_queue(inode->i_bdev);

What happens if the inode is from a filesystem that can have multiple
backing devices (like btrfs)?

> + blk_queue_max_discard_sectors(q,
> + backingq->limits.max_write_zeroes_sectors);
> +
> + blk_queue_max_write_zeroes_sectors(q,
> + backingq->limits.max_write_zeroes_sectors);

Also, seeing as filesystems tend to implement PUNCH_HOLE and ZERO_RANGE
on their own independent of the hardware capabilities of the underlying
device, it doesn't make much sense to forward the blockdev limits to the
loop device.

(Put another way, XFS's ZERO_RANGE implementation can zero hundreds of
gigabytes at a time even if the underlying device is a spinning rust.)

--D

>
> /*
> * We use punch hole to reclaim the free space used by the
> @@ -861,22 +876,24 @@ static void loop_config_discard(struct loop_device *lo)
> * encryption is enabled, because it may give an attacker
> * useful information.
> */
> - if ((!file->f_op->fallocate) ||
> - lo->lo_encrypt_key_size) {
> + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> q->limits.discard_granularity = 0;
> q->limits.discard_alignment = 0;
> blk_queue_max_discard_sectors(q, 0);
> blk_queue_max_write_zeroes_sectors(q, 0);
> - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> - return;
> - }
>
> - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> - q->limits.discard_alignment = 0;
> + } else {
> + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> + q->limits.discard_alignment = 0;
> +
> + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> + }
>
> - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> + if (q->limits.max_write_zeroes_sectors)
> + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> + else
> + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> }
>
> static void loop_unprepare_queue(struct loop_device *lo)
> --
> 2.21.0
>

2019-11-12 08:33:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v6 1/2] loop: Report EOPNOTSUPP properly

On Mon, Nov 11, 2019 at 10:50:29AM -0800, Evan Green wrote:
> - if (cmd->ret < 0)
> + if (cmd->ret == -EOPNOTSUPP)
> + ret = BLK_STS_NOTSUPP;
> + else if (cmd->ret < 0)
> ret = BLK_STS_IOERR;

This really should use errno_to_blk_status. Same for the other hunk.

2019-11-12 17:27:09

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] loop: Better discard support for block devices

Thanks for replying and taking a look Darrick. I didn't see your patch
in Jens tree when I looked just before sending it, but maybe I missed
it.

On Mon, Nov 11, 2019 at 5:37 PM Darrick J. Wong <[email protected]> wrote:
>
> On Mon, Nov 11, 2019 at 10:50:30AM -0800, Evan Green wrote:
> > If the backing device for a loop device is a block device,
> > then mirror the "write zeroes" capabilities of the underlying
> > block device into the loop device. Copy this capability into both
> > max_write_zeroes_sectors and max_discard_sectors of the loop device.
> >
> > The reason for this is that REQ_OP_DISCARD on a loop device translates
> > into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
> > presents a consistent interface for loop devices (that discarded data
> > is zeroed), regardless of the backing device type of the loop device.
> > There should be no behavior change for loop devices backed by regular
> > files.
> >
> > While in there, differentiate between REQ_OP_DISCARD and
> > REQ_OP_WRITE_ZEROES, which are different for block devices,
> > but which the loop device had just been lumping together, since
> > they're largely the same for files.
> >
> > This change fixes blktest block/003, and removes an extraneous
> > error print in block/013 when testing on a loop device backed
> > by a block device that does not support discard.
> >
> > Signed-off-by: Evan Green <[email protected]>
> > Reviewed-by: Gwendal Grignou <[email protected]>
> > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > ---
> >
> > Changes in v6: None
> > Changes in v5:
> > - Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)
> >
> > Changes in v4:
> > - Mirror blkdev's write_zeroes into loopdev's discard_sectors.
> >
> > Changes in v3:
> > - Updated commit description
> >
> > Changes in v2: None
> >
> > drivers/block/loop.c | 57 ++++++++++++++++++++++++++++----------------
> > 1 file changed, 37 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index d749156a3d88..236f6deb0772 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > return ret;
> > }
> >
> > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > + int mode, loff_t pos)
> > {
> > - /*
> > - * We use punch hole to reclaim the free space used by the
> > - * image a.k.a. discard. However we do not support discard if
> > - * encryption is enabled, because it may give an attacker
> > - * useful information.
> > - */
> > struct file *file = lo->lo_backing_file;
> > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > + struct request_queue *q = lo->lo_queue;
> > int ret;
> >
> > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > + if (!blk_queue_discard(q)) {
> > ret = -EOPNOTSUPP;
> > goto out;
> > }
> > @@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > case REQ_OP_FLUSH:
> > return lo_req_flush(lo, rq);
> > case REQ_OP_DISCARD:
> > + return lo_discard(lo, rq,
> > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > +
> > case REQ_OP_WRITE_ZEROES:
> > - return lo_discard(lo, rq, pos);
> > + return lo_discard(lo, rq,
> > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
>
> Yes, this more or less reimplements what's already in -next...

Agree, this part would disappear if I rebased on top of your patch.
This series has been around for awhile, you see :)

>
> > +
> > case REQ_OP_WRITE:
> > if (lo->transfer)
> > return lo_write_transfer(lo, rq, pos);
> > @@ -854,6 +854,21 @@ static void loop_config_discard(struct loop_device *lo)
> > struct file *file = lo->lo_backing_file;
> > struct inode *inode = file->f_mapping->host;
> > struct request_queue *q = lo->lo_queue;
> > + struct request_queue *backingq;
> > +
> > + /*
> > + * If the backing device is a block device, mirror its zeroing
> > + * capability. REQ_OP_DISCARD translates to a zero-out even when backed
> > + * by block devices to keep consistent behavior with file-backed loop
> > + * devices.
> > + */
> > + if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) {
> > + backingq = bdev_get_queue(inode->i_bdev);
>
> What happens if the inode is from a filesystem that can have multiple
> backing devices (like btrfs)?

Then I would expect S_ISBLK(inode->i_mode) would not be true. This is
only for when you've created a loop device directly on top of a block
device (ie you pointed the loop device at /dev/sda). We use this in
our Chrome OS installer because it makes the logic simple whether
you're installing to a real disk or a file image.

>
> > + blk_queue_max_discard_sectors(q,
> > + backingq->limits.max_write_zeroes_sectors);
> > +
> > + blk_queue_max_write_zeroes_sectors(q,
> > + backingq->limits.max_write_zeroes_sectors);
>
> Also, seeing as filesystems tend to implement PUNCH_HOLE and ZERO_RANGE
> on their own independent of the hardware capabilities of the underlying
> device, it doesn't make much sense to forward the blockdev limits to the
> loop device.
>
> (Put another way, XFS's ZERO_RANGE implementation can zero hundreds of
> gigabytes at a time even if the underlying device is a spinning rust.)

Hopefully my comment above addresses this too (there is no file system
in the scenario I'm coding for).

2019-11-12 19:11:20

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v6 1/2] loop: Report EOPNOTSUPP properly

On Tue, Nov 12, 2019 at 12:32 AM Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Nov 11, 2019 at 10:50:29AM -0800, Evan Green wrote:
> > - if (cmd->ret < 0)
> > + if (cmd->ret == -EOPNOTSUPP)
> > + ret = BLK_STS_NOTSUPP;
> > + else if (cmd->ret < 0)
> > ret = BLK_STS_IOERR;
>
> This really should use errno_to_blk_status. Same for the other hunk.

Seems reasonable, I can switch to that.

2019-11-13 00:44:34

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] loop: Better discard support for block devices

On Tue, Nov 12, 2019 at 09:22:51AM -0800, Evan Green wrote:
> Thanks for replying and taking a look Darrick. I didn't see your patch
> in Jens tree when I looked just before sending it, but maybe I missed
> it.
>
> On Mon, Nov 11, 2019 at 5:37 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Mon, Nov 11, 2019 at 10:50:30AM -0800, Evan Green wrote:
> > > If the backing device for a loop device is a block device,
> > > then mirror the "write zeroes" capabilities of the underlying
> > > block device into the loop device. Copy this capability into both
> > > max_write_zeroes_sectors and max_discard_sectors of the loop device.
> > >
> > > The reason for this is that REQ_OP_DISCARD on a loop device translates
> > > into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
> > > presents a consistent interface for loop devices (that discarded data
> > > is zeroed), regardless of the backing device type of the loop device.
> > > There should be no behavior change for loop devices backed by regular
> > > files.
> > >
> > > While in there, differentiate between REQ_OP_DISCARD and
> > > REQ_OP_WRITE_ZEROES, which are different for block devices,
> > > but which the loop device had just been lumping together, since
> > > they're largely the same for files.
> > >
> > > This change fixes blktest block/003, and removes an extraneous
> > > error print in block/013 when testing on a loop device backed
> > > by a block device that does not support discard.
> > >
> > > Signed-off-by: Evan Green <[email protected]>
> > > Reviewed-by: Gwendal Grignou <[email protected]>
> > > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > > ---
> > >
> > > Changes in v6: None
> > > Changes in v5:
> > > - Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)
> > >
> > > Changes in v4:
> > > - Mirror blkdev's write_zeroes into loopdev's discard_sectors.
> > >
> > > Changes in v3:
> > > - Updated commit description
> > >
> > > Changes in v2: None
> > >
> > > drivers/block/loop.c | 57 ++++++++++++++++++++++++++++----------------
> > > 1 file changed, 37 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index d749156a3d88..236f6deb0772 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > return ret;
> > > }
> > >
> > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > + int mode, loff_t pos)
> > > {
> > > - /*
> > > - * We use punch hole to reclaim the free space used by the
> > > - * image a.k.a. discard. However we do not support discard if
> > > - * encryption is enabled, because it may give an attacker
> > > - * useful information.
> > > - */
> > > struct file *file = lo->lo_backing_file;
> > > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > + struct request_queue *q = lo->lo_queue;
> > > int ret;
> > >
> > > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > + if (!blk_queue_discard(q)) {
> > > ret = -EOPNOTSUPP;
> > > goto out;
> > > }
> > > @@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > case REQ_OP_FLUSH:
> > > return lo_req_flush(lo, rq);
> > > case REQ_OP_DISCARD:
> > > + return lo_discard(lo, rq,
> > > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > +
> > > case REQ_OP_WRITE_ZEROES:
> > > - return lo_discard(lo, rq, pos);
> > > + return lo_discard(lo, rq,
> > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> >
> > Yes, this more or less reimplements what's already in -next...
>
> Agree, this part would disappear if I rebased on top of your patch.
> This series has been around for awhile, you see :)

Oh. Didn't quite realize that. :/

> > > +
> > > case REQ_OP_WRITE:
> > > if (lo->transfer)
> > > return lo_write_transfer(lo, rq, pos);
> > > @@ -854,6 +854,21 @@ static void loop_config_discard(struct loop_device *lo)
> > > struct file *file = lo->lo_backing_file;
> > > struct inode *inode = file->f_mapping->host;
> > > struct request_queue *q = lo->lo_queue;
> > > + struct request_queue *backingq;
> > > +
> > > + /*
> > > + * If the backing device is a block device, mirror its zeroing
> > > + * capability. REQ_OP_DISCARD translates to a zero-out even when backed
> > > + * by block devices to keep consistent behavior with file-backed loop
> > > + * devices.
> > > + */
> > > + if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) {
> > > + backingq = bdev_get_queue(inode->i_bdev);
> >
> > What happens if the inode is from a filesystem that can have multiple
> > backing devices (like btrfs)?
>
> Then I would expect S_ISBLK(inode->i_mode) would not be true. This is
> only for when you've created a loop device directly on top of a block
> device (ie you pointed the loop device at /dev/sda). We use this in
> our Chrome OS installer because it makes the logic simple whether
> you're installing to a real disk or a file image.

Heh, doh, that's right. :)

Sorry, for some reason I misread that as "If the backing device of the
filesystem from which the inode came is a block device..."

Might I suggest rewording the first sentence of the comment to read "If
the loop device's backing device is itself a block device" for oafs like
me? :)

--D

> >
> > > + blk_queue_max_discard_sectors(q,
> > > + backingq->limits.max_write_zeroes_sectors);
> > > +
> > > + blk_queue_max_write_zeroes_sectors(q,
> > > + backingq->limits.max_write_zeroes_sectors);
> >
> > Also, seeing as filesystems tend to implement PUNCH_HOLE and ZERO_RANGE
> > on their own independent of the hardware capabilities of the underlying
> > device, it doesn't make much sense to forward the blockdev limits to the
> > loop device.
> >
> > (Put another way, XFS's ZERO_RANGE implementation can zero hundreds of
> > gigabytes at a time even if the underlying device is a spinning rust.)
>
> Hopefully my comment above addresses this too (there is no file system
> in the scenario I'm coding for).

2019-11-13 19:01:51

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v6 2/2] loop: Better discard support for block devices

On Tue, Nov 12, 2019 at 4:40 PM Darrick J. Wong <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 09:22:51AM -0800, Evan Green wrote:
> > Thanks for replying and taking a look Darrick. I didn't see your patch
> > in Jens tree when I looked just before sending it, but maybe I missed
> > it.
> >
> > On Mon, Nov 11, 2019 at 5:37 PM Darrick J. Wong <[email protected]> wrote:
> > >
> > > On Mon, Nov 11, 2019 at 10:50:30AM -0800, Evan Green wrote:
> > > > If the backing device for a loop device is a block device,
> > > > then mirror the "write zeroes" capabilities of the underlying
> > > > block device into the loop device. Copy this capability into both
> > > > max_write_zeroes_sectors and max_discard_sectors of the loop device.
> > > >
> > > > The reason for this is that REQ_OP_DISCARD on a loop device translates
> > > > into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
> > > > presents a consistent interface for loop devices (that discarded data
> > > > is zeroed), regardless of the backing device type of the loop device.
> > > > There should be no behavior change for loop devices backed by regular
> > > > files.
> > > >
> > > > While in there, differentiate between REQ_OP_DISCARD and
> > > > REQ_OP_WRITE_ZEROES, which are different for block devices,
> > > > but which the loop device had just been lumping together, since
> > > > they're largely the same for files.
> > > >
> > > > This change fixes blktest block/003, and removes an extraneous
> > > > error print in block/013 when testing on a loop device backed
> > > > by a block device that does not support discard.
> > > >
> > > > Signed-off-by: Evan Green <[email protected]>
> > > > Reviewed-by: Gwendal Grignou <[email protected]>
> > > > Reviewed-by: Chaitanya Kulkarni <[email protected]>
> > > > ---
> > > >
> > > > Changes in v6: None
> > > > Changes in v5:
> > > > - Don't mirror discard if lo_encrypt_key_size is non-zero (Gwendal)
> > > >
> > > > Changes in v4:
> > > > - Mirror blkdev's write_zeroes into loopdev's discard_sectors.
> > > >
> > > > Changes in v3:
> > > > - Updated commit description
> > > >
> > > > Changes in v2: None
> > > >
> > > > drivers/block/loop.c | 57 ++++++++++++++++++++++++++++----------------
> > > > 1 file changed, 37 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > > index d749156a3d88..236f6deb0772 100644
> > > > --- a/drivers/block/loop.c
> > > > +++ b/drivers/block/loop.c
> > > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > > return ret;
> > > > }
> > > >
> > > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > > + int mode, loff_t pos)
> > > > {
> > > > - /*
> > > > - * We use punch hole to reclaim the free space used by the
> > > > - * image a.k.a. discard. However we do not support discard if
> > > > - * encryption is enabled, because it may give an attacker
> > > > - * useful information.
> > > > - */
> > > > struct file *file = lo->lo_backing_file;
> > > > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > > + struct request_queue *q = lo->lo_queue;
> > > > int ret;
> > > >
> > > > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > + if (!blk_queue_discard(q)) {
> > > > ret = -EOPNOTSUPP;
> > > > goto out;
> > > > }
> > > > @@ -599,8 +594,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > > case REQ_OP_FLUSH:
> > > > return lo_req_flush(lo, rq);
> > > > case REQ_OP_DISCARD:
> > > > + return lo_discard(lo, rq,
> > > > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > > +
> > > > case REQ_OP_WRITE_ZEROES:
> > > > - return lo_discard(lo, rq, pos);
> > > > + return lo_discard(lo, rq,
> > > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > >
> > > Yes, this more or less reimplements what's already in -next...
> >
> > Agree, this part would disappear if I rebased on top of your patch.
> > This series has been around for awhile, you see :)
>
> Oh. Didn't quite realize that. :/
>
> > > > +
> > > > case REQ_OP_WRITE:
> > > > if (lo->transfer)
> > > > return lo_write_transfer(lo, rq, pos);
> > > > @@ -854,6 +854,21 @@ static void loop_config_discard(struct loop_device *lo)
> > > > struct file *file = lo->lo_backing_file;
> > > > struct inode *inode = file->f_mapping->host;
> > > > struct request_queue *q = lo->lo_queue;
> > > > + struct request_queue *backingq;
> > > > +
> > > > + /*
> > > > + * If the backing device is a block device, mirror its zeroing
> > > > + * capability. REQ_OP_DISCARD translates to a zero-out even when backed
> > > > + * by block devices to keep consistent behavior with file-backed loop
> > > > + * devices.
> > > > + */
> > > > + if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) {
> > > > + backingq = bdev_get_queue(inode->i_bdev);
> > >
> > > What happens if the inode is from a filesystem that can have multiple
> > > backing devices (like btrfs)?
> >
> > Then I would expect S_ISBLK(inode->i_mode) would not be true. This is
> > only for when you've created a loop device directly on top of a block
> > device (ie you pointed the loop device at /dev/sda). We use this in
> > our Chrome OS installer because it makes the logic simple whether
> > you're installing to a real disk or a file image.
>
> Heh, doh, that's right. :)
>
> Sorry, for some reason I misread that as "If the backing device of the
> filesystem from which the inode came is a block device..."
>
> Might I suggest rewording the first sentence of the comment to read "If
> the loop device's backing device is itself a block device" for oafs like
> me? :)

Sure, I'll do that. Another spin coming shortly...

-Evan

2019-11-14 23:23:17

by Evan Green

[permalink] [raw]

Subject: Re: [PATCH v6 1/2] loop: Report EOPNOTSUPP properly

On Tue, Nov 12, 2019 at 11:09 AM Evan Green <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 12:32 AM Christoph Hellwig <[email protected]> wrote:
> >
> > On Mon, Nov 11, 2019 at 10:50:29AM -0800, Evan Green wrote:
> > > - if (cmd->ret < 0)
> > > + if (cmd->ret == -EOPNOTSUPP)
> > > + ret = BLK_STS_NOTSUPP;
> > > + else if (cmd->ret < 0)
> > > ret = BLK_STS_IOERR;
> >
> > This really should use errno_to_blk_status. Same for the other hunk.
>
> Seems reasonable, I can switch to that.

Oh wait, the other hunk doesn't deal with blk_status_t at all. Before,
it just translated any errno into -EIO. Now, it translates almost any
errno to -EIO (the almost being EOPNOTSUPP).

So I'll change just the first hunk you pointed out.
-Evan