LinuxLists.cc - [PATCH 0/2] loop: Better discard for block devices

2018-10-30 23:18:54

[permalink] [raw]

Subject: [PATCH 0/2] loop: Better discard for block devices

This series addresses some errors seen when using the loop
device directly backed by a block device. The first change plumbs
out the correct error message, and the second change prevents the
error from occurring in many cases.

Evan Green (2):
loop: Report EOPNOTSUPP properly
loop: Better discard support for block devices

drivers/block/loop.c | 75 ++++++++++++++++++++++++++++++++++++----------------
1 file changed, 52 insertions(+), 23 deletions(-)

--
2.16.4

2018-10-30 23:08:34

[permalink] [raw]

Subject: [PATCH 2/2] loop: Better discard support for block devices

If the backing device for a loop device is a block device,
then mirror the discard properties of the underlying block
device into the loop device. While in there, differentiate
between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
different for block devices, but which the loop device had
just been lumping together.

Signed-off-by: Evan Green <[email protected]>
---

drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 28990fc94841a..176e65101c4ef 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
return ret;
}

-static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+static int lo_discard(struct loop_device *lo, struct request *rq,
+ int mode, loff_t pos)
{
- /*
- * We use punch hole to reclaim the free space used by the
- * image a.k.a. discard. However we do not support discard if
- * encryption is enabled, because it may give an attacker
- * useful information.
- */
struct file *file = lo->lo_backing_file;
- int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+ struct request_queue *q = lo->lo_queue;
int ret;

- if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+ if (!blk_queue_discard(q)) {
ret = -EOPNOTSUPP;
goto out;
}
@@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
case REQ_OP_FLUSH:
return lo_req_flush(lo, rq);
case REQ_OP_DISCARD:
+ return lo_discard(lo, rq,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE_ZEROES:
- return lo_discard(lo, rq, pos);
+ return lo_discard(lo, rq,
+ FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
+
case REQ_OP_WRITE:
if (lo->transfer)
return lo_write_transfer(lo, rq, pos);
@@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
struct request_queue *q = lo->lo_queue;
+ struct request_queue *backingq;
+
+ /*
+ * If the backing device is a block device, mirror its discard
+ * capabilities.
+ */
+ if (S_ISBLK(inode->i_mode)) {
+ backingq = bdev_get_queue(inode->i_bdev);
+ blk_queue_max_discard_sectors(q,
+ backingq->limits.max_discard_sectors);
+
+ blk_queue_max_write_zeroes_sectors(q,
+ backingq->limits.max_write_zeroes_sectors);
+
+ q->limits.discard_granularity =
+ backingq->limits.discard_granularity;
+
+ q->limits.discard_alignment =
+ backingq->limits.discard_alignment;

/*
* We use punch hole to reclaim the free space used by the
@@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
* encryption is enabled, because it may give an attacker
* useful information.
*/
- if ((!file->f_op->fallocate) ||
- lo->lo_encrypt_key_size) {
+ } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
q->limits.discard_granularity = 0;
q->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(q, 0);
blk_queue_max_write_zeroes_sectors(q, 0);
- blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
- return;
- }

- q->limits.discard_granularity = inode->i_sb->s_blocksize;
- q->limits.discard_alignment = 0;
+ } else {
+ q->limits.discard_granularity = inode->i_sb->s_blocksize;
+ q->limits.discard_alignment = 0;
+
+ blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+ }

- blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
- blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
- blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
}

static void loop_unprepare_queue(struct loop_device *lo)
--
2.16.4

2018-10-30 23:19:03

[permalink] [raw]

Subject: [PATCH 1/2] loop: Report EOPNOTSUPP properly

Properly plumb out EOPNOTSUPP from loop driver operations, which may
get returned when for instance a discard operation is attempted but not
supported by the underlying block device. Before this change, everything
was reported in the log as an I/O error, which is scary and not
helpful in debugging.

Signed-off-by: Evan Green <[email protected]>
---

drivers/block/loop.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index abad6d15f9563..28990fc94841a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -458,8 +458,13 @@ static void lo_complete_rq(struct request *rq)

if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
req_op(rq) != REQ_OP_READ) {
- if (cmd->ret < 0)
- ret = BLK_STS_IOERR;
+ if (cmd->ret < 0) {
+ if (cmd->ret == -EOPNOTSUPP)
+ ret = BLK_STS_NOTSUPP;
+ else
+ ret = BLK_STS_IOERR;
+ }
+
goto end_io;
}

@@ -1788,7 +1793,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
- cmd->ret = ret ? -EIO : 0;
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
blk_mq_complete_request(rq);
}
}
--
2.16.4

2018-10-30 23:51:26

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 0/2] loop: Better discard for block devices

On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
+AD4 This series addresses some errors seen when using the loop
+AD4 device directly backed by a block device. The first change plumbs
+AD4 out the correct error message, and the second change prevents the
+AD4 error from occurring in many cases.

Hi Evan,

Can you provide some information about the use case? Why do you think that
it would be useful to support backing a loop device by a block device? Why
to use the loop driver instead of dm-linear for this use case?

Thanks,

Bart.

2018-11-01 18:17:54

[permalink] [raw]

Subject: Re: [PATCH 0/2] loop: Better discard for block devices

On Tue, Oct 30, 2018 at 4:50 PM Bart Van Assche <[email protected]> wrote:
>
> On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
> > This series addresses some errors seen when using the loop
> > device directly backed by a block device. The first change plumbs
> > out the correct error message, and the second change prevents the
> > error from occurring in many cases.
>
> Hi Evan,
>
> Can you provide some information about the use case? Why do you think that
> it would be useful to support backing a loop device by a block device? Why
> to use the loop driver instead of dm-linear for this use case?
>

Hi Bart,
In our case, the Chrome OS installer uses the loop device to map
slices of the disk that will ultimately represent partitions [1]. I
believe it has been doing install this way for a very long time, and
has been working well. It actually continues to work, but on block
devices that don't support discard operations, things are a tiny bit
bumpy. This series is meant to smooth out those bumps. As far as I
knew this was a supported scenario.

-Evan
[1] https://chromium.googlesource.com/chromiumos/platform/installer/+/master/chromeos-install

2018-11-01 22:45:48

by Gwendal Grignou

[permalink] [raw]

Subject: Re: [PATCH 0/2] loop: Better discard for block devices

On Thu, Nov 1, 2018 at 11:15 AM Evan Green <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 4:50 PM Bart Van Assche <[email protected]> wrote:
> >
> > On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
> > > This series addresses some errors seen when using the loop
> > > device directly backed by a block device. The first change plumbs
> > > out the correct error message, and the second change prevents the
> > > error from occurring in many cases.
> >
> > Hi Evan,
> >
> > Can you provide some information about the use case? Why do you think that
> > it would be useful to support backing a loop device by a block device? Why
> > to use the loop driver instead of dm-linear for this use case?
> >
>
> Hi Bart,
> In our case, the Chrome OS installer uses the loop device to map
> slices of the disk that will ultimately represent partitions [1]. I
> believe it has been doing install this way for a very long time, and
> has been working well. It actually continues to work, but on block
> devices that don't support discard operations, things are a tiny bit
> bumpy. This series is meant to smooth out those bumps. As far as I
> knew this was a supported scenario.
>
> -Evan
> [1] https://chromium.googlesource.com/chromiumos/platform/installer/+/master/chromeos-install

The code has moved to
https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-install
but the idea is the same. We create a loop device to abstract the
persistent destination. The destination can be a block device or a
file. The later case is used for creating master images to be flashed
on memory chip before soldering on the production line.
It is handy when the final device is 4K block aligned but the builder
is using 512b block aligned device, we can mount a device over a file
that will behave like the real device we will flash the image on.

Gwendal.

2018-11-02 16:03:14

by Bart Van Assche

[permalink] [raw]

Subject: Re: [PATCH 0/2] loop: Better discard for block devices

On Thu, 2018-11-01 at 15:44 -0700, Gwendal Grignou wrote:
+AD4 On Thu, Nov 1, 2018 at 11:15 AM Evan Green +ADw-evgreen+AEA-chromium.org+AD4 wrote:
+AD4 +AD4
+AD4 +AD4 On Tue, Oct 30, 2018 at 4:50 PM Bart Van Assche +ADw-bvanassche+AEA-acm.org+AD4 wrote:
+AD4 +AD4 +AD4
+AD4 +AD4 +AD4 On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
+AD4 +AD4 +AD4 +AD4 This series addresses some errors seen when using the loop
+AD4 +AD4 +AD4 +AD4 device directly backed by a block device. The first change plumbs
+AD4 +AD4 +AD4 +AD4 out the correct error message, and the second change prevents the
+AD4 +AD4 +AD4 +AD4 error from occurring in many cases.
+AD4 +AD4 +AD4
+AD4 +AD4 +AD4 Hi Evan,
+AD4 +AD4 +AD4
+AD4 +AD4 +AD4 Can you provide some information about the use case? Why do you think that
+AD4 +AD4 +AD4 it would be useful to support backing a loop device by a block device? Why
+AD4 +AD4 +AD4 to use the loop driver instead of dm-linear for this use case?
+AD4 +AD4 +AD4
+AD4 +AD4
+AD4 +AD4 Hi Bart,
+AD4 +AD4 In our case, the Chrome OS installer uses the loop device to map
+AD4 +AD4 slices of the disk that will ultimately represent partitions +AFs-1+AF0. I
+AD4 +AD4 believe it has been doing install this way for a very long time, and
+AD4 +AD4 has been working well. It actually continues to work, but on block
+AD4 +AD4 devices that don't support discard operations, things are a tiny bit
+AD4 +AD4 bumpy. This series is meant to smooth out those bumps. As far as I
+AD4 +AD4 knew this was a supported scenario.
+AD4 +AD4
+AD4 +AD4 -Evan
+AD4 +AD4 +AFs-1+AF0 https://chromium.googlesource.com/chromiumos/platform/installer/+-/master/chromeos-install
+AD4
+AD4 The code has moved to
+AD4 https://chromium.googlesource.com/chromiumos/platform2/+-/master/installer/chromeos-install
+AD4 but the idea is the same. We create a loop device to abstract the
+AD4 persistent destination. The destination can be a block device or a
+AD4 file. The later case is used for creating master images to be flashed
+AD4 on memory chip before soldering on the production line.
+AD4 It is handy when the final device is 4K block aligned but the builder
+AD4 is using 512b block aligned device, we can mount a device over a file
+AD4 that will behave like the real device we will flash the image on.

Hi Evan and Gwendal,

Since this is a new use case for the loop driver you may want to add a test
for this use case to the blktests project. Many block layer contributors run
these tests to verify their own block layer changes. Contributing a blktests
test for this new use case will make it easier for others to verify that
their changes do not break your use case.

Bart.

2018-11-05 20:36:52

[permalink] [raw]

Subject: Re: [PATCH 0/2] loop: Better discard for block devices

On Fri, Nov 2, 2018 at 9:02 AM Bart Van Assche <[email protected]> wrote:
>
> On Thu, 2018-11-01 at 15:44 -0700, Gwendal Grignou wrote:
> > On Thu, Nov 1, 2018 at 11:15 AM Evan Green <[email protected]> wrote:
> > >
> > > On Tue, Oct 30, 2018 at 4:50 PM Bart Van Assche <[email protected]> wrote:
> > > >
> > > > On Tue, 2018-10-30 at 16:06 -0700, Evan Green wrote:
> > > > > This series addresses some errors seen when using the loop
> > > > > device directly backed by a block device. The first change plumbs
> > > > > out the correct error message, and the second change prevents the
> > > > > error from occurring in many cases.
> > > >
> > > > Hi Evan,
> > > >
> > > > Can you provide some information about the use case? Why do you think that
> > > > it would be useful to support backing a loop device by a block device? Why
> > > > to use the loop driver instead of dm-linear for this use case?
> > > >
> > >
> > > Hi Bart,
> > > In our case, the Chrome OS installer uses the loop device to map
> > > slices of the disk that will ultimately represent partitions [1]. I
> > > believe it has been doing install this way for a very long time, and
> > > has been working well. It actually continues to work, but on block
> > > devices that don't support discard operations, things are a tiny bit
> > > bumpy. This series is meant to smooth out those bumps. As far as I
> > > knew this was a supported scenario.
> > >
> > > -Evan
> > > [1] https://chromium.googlesource.com/chromiumos/platform/installer/+/master/chromeos-install
> >
> > The code has moved to
> > https://chromium.googlesource.com/chromiumos/platform2/+/master/installer/chromeos-install
> > but the idea is the same. We create a loop device to abstract the
> > persistent destination. The destination can be a block device or a
> > file. The later case is used for creating master images to be flashed
> > on memory chip before soldering on the production line.
> > It is handy when the final device is 4K block aligned but the builder
> > is using 512b block aligned device, we can mount a device over a file
> > that will behave like the real device we will flash the image on.
>
> Hi Evan and Gwendal,
>
> Since this is a new use case for the loop driver you may want to add a test
> for this use case to the blktests project. Many block layer contributors run
> these tests to verify their own block layer changes. Contributing a blktests
> test for this new use case will make it easier for others to verify that
> their changes do not break your use case.
>

Good idea. Thanks Bart.

> Bart.

2018-11-26 18:56:38

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Tue, Oct 30, 2018 at 4:06 PM Evan Green <[email protected]> wrote:
>
> If the backing device for a loop device is a block device,
> then mirror the discard properties of the underlying block
> device into the loop device. While in there, differentiate
> between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> different for block devices, but which the loop device had
> just been lumping together.
>
> Signed-off-by: Evan Green <[email protected]>

Any thoughts on this patch? This fixes issues for us when using a loop
device backed by a block device, where we see many logs like:

[ 372.767286] print_req_error: I/O error, dev loop5, sector 88125696

-Evan

2018-11-27 02:57:53

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Tue, Nov 27, 2018 at 2:55 AM Evan Green <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 4:06 PM Evan Green <[email protected]> wrote:
> >
> > If the backing device for a loop device is a block device,

This shouldn't be a very common use case wrt. loop.

> > then mirror the discard properties of the underlying block
> > device into the loop device. While in there, differentiate
> > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > different for block devices, but which the loop device had
> > just been lumping together.
> >
> > Signed-off-by: Evan Green <[email protected]>
>
> Any thoughts on this patch? This fixes issues for us when using a loop
> device backed by a block device, where we see many logs like:
>
> [ 372.767286] print_req_error: I/O error, dev loop5, sector 88125696

Seems not see any explanation about this IO error and the fix in your patch.
Could you describe it a bit more?

thanks,
Ming Lei

2018-11-27 23:35:35

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

Hi Ming,

On Mon, Nov 26, 2018 at 6:55 PM Ming Lei <[email protected]> wrote:
>
> On Tue, Nov 27, 2018 at 2:55 AM Evan Green <[email protected]> wrote:
> >
> > On Tue, Oct 30, 2018 at 4:06 PM Evan Green <[email protected]> wrote:
> > >
> > > If the backing device for a loop device is a block device,
>
> This shouldn't be a very common use case wrt. loop.

Yeah, I'm starting to gather that. Or maybe I'm just the first one to
mention it on the kernel lists ;) We've used this in our Chrome OS
installer, I believe for many years. Gwendal piped in with a few
reasons we do it this way on the cover letter, but in general I think
it allows us to have a unified set of functions to install to a file,
disk, or prepare an image that may have a different block size than
those on the running system.

>
> > > then mirror the discard properties of the underlying block
> > > device into the loop device. While in there, differentiate
> > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > different for block devices, but which the loop device had
> > > just been lumping together.
> > >
> > > Signed-off-by: Evan Green <[email protected]>
> >
> > Any thoughts on this patch? This fixes issues for us when using a loop
> > device backed by a block device, where we see many logs like:
> >
> > [ 372.767286] print_req_error: I/O error, dev loop5, sector 88125696
>
> Seems not see any explanation about this IO error and the fix in your patch.
> Could you describe it a bit more?

Sure, I probably should have included more context with the series.

The loop device always reports that it supports discard, by setting up
the max_discard_sectors and max_write_zeroes_sectors in the blk queue.
When the loop device gets a discard or write-zeroes request, it turns
around and calls fallocate on the underlying device with the
PUNCH_HOLE flag. This makes sense when you're backed by a file and
hoping to just deallocate the space, but may fail when you're backed
by a block device that doesn't support discard, or doesn't write
zeroes to discarded sectors. Weirdly, lo_discard already had some code
for preserving EOPNOTSUPP, but then later the error is smashed into
EIO. Patch 1 pipes out EOPNOTSUPP properly, so it doesn't get squashed
into EIO.

Patch 2 reflects the discard characteristics of the underlying device
into the loop device. That way, if you're backed by a file or a block
device that does support discard, everything works great, and user
mode can even see and use the correct discard and write zero
granularities. If you're backed by a block device that does not
support discard, this is exposed to user mode, which then usually
avoids calling fallocate, and doesn't feel betrayed that their
requests are unexpectedly failing.

-Evan

2018-11-28 01:07:51

[permalink] [raw]

Subject: Re: [PATCH 1/2] loop: Report EOPNOTSUPP properly

On Tue, Oct 30, 2018 at 04:06:23PM -0700, Evan Green wrote:
> Properly plumb out EOPNOTSUPP from loop driver operations, which may
> get returned when for instance a discard operation is attempted but not
> supported by the underlying block device. Before this change, everything
> was reported in the log as an I/O error, which is scary and not
> helpful in debugging.
>
> Signed-off-by: Evan Green <[email protected]>
> ---
>
> drivers/block/loop.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index abad6d15f9563..28990fc94841a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -458,8 +458,13 @@ static void lo_complete_rq(struct request *rq)
>
> if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
> req_op(rq) != REQ_OP_READ) {
> - if (cmd->ret < 0)
> - ret = BLK_STS_IOERR;
> + if (cmd->ret < 0) {
> + if (cmd->ret == -EOPNOTSUPP)
> + ret = BLK_STS_NOTSUPP;
> + else
> + ret = BLK_STS_IOERR;
> + }
> +
> goto end_io;
> }
>
> @@ -1788,7 +1793,10 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> failed:
> /* complete non-aio request */
> if (!cmd->use_aio || ret) {
> - cmd->ret = ret ? -EIO : 0;
> + if (ret == -EOPNOTSUPP)
> + cmd->ret = ret;
> + else
> + cmd->ret = ret ? -EIO : 0;
> blk_mq_complete_request(rq);
> }
> }
> --
> 2.16.4
>

Looks fine:

Reviewed-by: Ming Lei <[email protected]>

Thanks,
Ming

2018-11-28 01:29:05

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> If the backing device for a loop device is a block device,
> then mirror the discard properties of the underlying block
> device into the loop device. While in there, differentiate
> between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> different for block devices, but which the loop device had
> just been lumping together.
>
> Signed-off-by: Evan Green <[email protected]>
> ---
>
> drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 41 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 28990fc94841a..176e65101c4ef 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> return ret;
> }
>
> -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> +static int lo_discard(struct loop_device *lo, struct request *rq,
> + int mode, loff_t pos)
> {
> - /*
> - * We use punch hole to reclaim the free space used by the
> - * image a.k.a. discard. However we do not support discard if
> - * encryption is enabled, because it may give an attacker
> - * useful information.
> - */
> struct file *file = lo->lo_backing_file;
> - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> + struct request_queue *q = lo->lo_queue;
> int ret;
>
> - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> + if (!blk_queue_discard(q)) {
> ret = -EOPNOTSUPP;
> goto out;
> }
> @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> case REQ_OP_FLUSH:
> return lo_req_flush(lo, rq);
> case REQ_OP_DISCARD:
> + return lo_discard(lo, rq,
> + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> +
> case REQ_OP_WRITE_ZEROES:
> - return lo_discard(lo, rq, pos);
> + return lo_discard(lo, rq,
> + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> +
> case REQ_OP_WRITE:
> if (lo->transfer)
> return lo_write_transfer(lo, rq, pos);
> @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> struct file *file = lo->lo_backing_file;
> struct inode *inode = file->f_mapping->host;
> struct request_queue *q = lo->lo_queue;
> + struct request_queue *backingq;
> +
> + /*
> + * If the backing device is a block device, mirror its discard
> + * capabilities.
> + */
> + if (S_ISBLK(inode->i_mode)) {
> + backingq = bdev_get_queue(inode->i_bdev);
> + blk_queue_max_discard_sectors(q,
> + backingq->limits.max_discard_sectors);
> +
> + blk_queue_max_write_zeroes_sectors(q,
> + backingq->limits.max_write_zeroes_sectors);
> +
> + q->limits.discard_granularity =
> + backingq->limits.discard_granularity;
> +
> + q->limits.discard_alignment =
> + backingq->limits.discard_alignment;

I think it isn't necessary to mirror backing queue's discard/write_zeros
capabilities, given either fs of the underlying queue can deal with well.

>
> /*
> * We use punch hole to reclaim the free space used by the
> @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> * encryption is enabled, because it may give an attacker
> * useful information.
> */
> - if ((!file->f_op->fallocate) ||
> - lo->lo_encrypt_key_size) {
> + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> q->limits.discard_granularity = 0;
> q->limits.discard_alignment = 0;
> blk_queue_max_discard_sectors(q, 0);
> blk_queue_max_write_zeroes_sectors(q, 0);
> - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> - return;
> - }
>
> - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> - q->limits.discard_alignment = 0;
> + } else {
> + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> + q->limits.discard_alignment = 0;
> +
> + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> + }
>
> - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> + if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> + else
> + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> }

Looks it should work just by mirroring backing queue's discard
capability to loop queue in case that the loop is backed by
block device, doesn't it? Meantime the unified discard limits &
write_zeros limits can be kept.

thanks,
Ming

2018-11-28 01:31:26

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Tue, Nov 27, 2018 at 03:34:04PM -0800, Evan Green wrote:
> Hi Ming,
>
> On Mon, Nov 26, 2018 at 6:55 PM Ming Lei <[email protected]> wrote:
> >
> > On Tue, Nov 27, 2018 at 2:55 AM Evan Green <[email protected]> wrote:
> > >
> > > On Tue, Oct 30, 2018 at 4:06 PM Evan Green <[email protected]> wrote:
> > > >
> > > > If the backing device for a loop device is a block device,
> >
> > This shouldn't be a very common use case wrt. loop.
>
> Yeah, I'm starting to gather that. Or maybe I'm just the first one to
> mention it on the kernel lists ;) We've used this in our Chrome OS
> installer, I believe for many years. Gwendal piped in with a few
> reasons we do it this way on the cover letter, but in general I think
> it allows us to have a unified set of functions to install to a file,
> disk, or prepare an image that may have a different block size than
> those on the running system.

OK, got it, it makes sense.

>
> >
> > > > then mirror the discard properties of the underlying block
> > > > device into the loop device. While in there, differentiate
> > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > different for block devices, but which the loop device had
> > > > just been lumping together.
> > > >
> > > > Signed-off-by: Evan Green <[email protected]>
> > >
> > > Any thoughts on this patch? This fixes issues for us when using a loop
> > > device backed by a block device, where we see many logs like:
> > >
> > > [ 372.767286] print_req_error: I/O error, dev loop5, sector 88125696
> >
> > Seems not see any explanation about this IO error and the fix in your patch.
> > Could you describe it a bit more?
>
> Sure, I probably should have included more context with the series.
>
> The loop device always reports that it supports discard, by setting up
> the max_discard_sectors and max_write_zeroes_sectors in the blk queue.
> When the loop device gets a discard or write-zeroes request, it turns
> around and calls fallocate on the underlying device with the
> PUNCH_HOLE flag. This makes sense when you're backed by a file and
> hoping to just deallocate the space, but may fail when you're backed
> by a block device that doesn't support discard, or doesn't write
> zeroes to discarded sectors. Weirdly, lo_discard already had some code
> for preserving EOPNOTSUPP, but then later the error is smashed into
> EIO. Patch 1 pipes out EOPNOTSUPP properly, so it doesn't get squashed
> into EIO.
>
> Patch 2 reflects the discard characteristics of the underlying device
> into the loop device. That way, if you're backed by a file or a block
> device that does support discard, everything works great, and user
> mode can even see and use the correct discard and write zero
> granularities. If you're backed by a block device that does not
> support discard, this is exposed to user mode, which then usually
> avoids calling fallocate, and doesn't feel betrayed that their
> requests are unexpectedly failing.

Thanks for your detailed explanation, and I think we need to fix it.

Thanks,
Ming

2018-12-04 22:21:45

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

Hi Ming,

On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <[email protected]> wrote:
>
> On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > If the backing device for a loop device is a block device,
> > then mirror the discard properties of the underlying block
> > device into the loop device. While in there, differentiate
> > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > different for block devices, but which the loop device had
> > just been lumping together.
> >
> > Signed-off-by: Evan Green <[email protected]>
> > ---
> >
> > drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 41 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 28990fc94841a..176e65101c4ef 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > return ret;
> > }
> >
> > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > + int mode, loff_t pos)
> > {
> > - /*
> > - * We use punch hole to reclaim the free space used by the
> > - * image a.k.a. discard. However we do not support discard if
> > - * encryption is enabled, because it may give an attacker
> > - * useful information.
> > - */
> > struct file *file = lo->lo_backing_file;
> > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > + struct request_queue *q = lo->lo_queue;
> > int ret;
> >
> > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > + if (!blk_queue_discard(q)) {
> > ret = -EOPNOTSUPP;
> > goto out;
> > }
> > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > case REQ_OP_FLUSH:
> > return lo_req_flush(lo, rq);
> > case REQ_OP_DISCARD:
> > + return lo_discard(lo, rq,
> > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > +
> > case REQ_OP_WRITE_ZEROES:
> > - return lo_discard(lo, rq, pos);
> > + return lo_discard(lo, rq,
> > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > +
> > case REQ_OP_WRITE:
> > if (lo->transfer)
> > return lo_write_transfer(lo, rq, pos);
> > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > struct file *file = lo->lo_backing_file;
> > struct inode *inode = file->f_mapping->host;
> > struct request_queue *q = lo->lo_queue;
> > + struct request_queue *backingq;
> > +
> > + /*
> > + * If the backing device is a block device, mirror its discard
> > + * capabilities.
> > + */
> > + if (S_ISBLK(inode->i_mode)) {
> > + backingq = bdev_get_queue(inode->i_bdev);
> > + blk_queue_max_discard_sectors(q,
> > + backingq->limits.max_discard_sectors);
> > +
> > + blk_queue_max_write_zeroes_sectors(q,
> > + backingq->limits.max_write_zeroes_sectors);
> > +
> > + q->limits.discard_granularity =
> > + backingq->limits.discard_granularity;
> > +
> > + q->limits.discard_alignment =
> > + backingq->limits.discard_alignment;
>
> I think it isn't necessary to mirror backing queue's discard/write_zeros
> capabilities, given either fs of the underlying queue can deal with well.
>
> >
> > /*
> > * We use punch hole to reclaim the free space used by the
> > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > * encryption is enabled, because it may give an attacker
> > * useful information.
> > */
> > - if ((!file->f_op->fallocate) ||
> > - lo->lo_encrypt_key_size) {
> > + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > q->limits.discard_granularity = 0;
> > q->limits.discard_alignment = 0;
> > blk_queue_max_discard_sectors(q, 0);
> > blk_queue_max_write_zeroes_sectors(q, 0);
> > - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > - return;
> > - }
> >
> > - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > - q->limits.discard_alignment = 0;
> > + } else {
> > + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > + q->limits.discard_alignment = 0;
> > +
> > + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > + }
> >
> > - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > + if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > + else
> > + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > }
>
> Looks it should work just by mirroring backing queue's discard
> capability to loop queue in case that the loop is backed by
> block device, doesn't it? Meantime the unified discard limits &
> write_zeros limits can be kept.

I tested this out, and you're right that I could just flip the
QUEUE_FLAG_DISCARD based on whether its a block device, and leave
everything else alone, to completely disable discard support for loop
devices backed by block devices. This seems to work for programs like
mkfs.ext4, but still leaves problems for coreutils cp.

But is that really the right call? With this change, we're not only
able to use loop devices in this way, but we're able to use the
discard and zero functionality of the underlying block device by
simply passing it through. To me that seemed better than disabling all
discard support for block devices, which would severely slow us down
on some devices.
-Evan

2018-12-05 01:12:03

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> Hi Ming,
>
> On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <[email protected]> wrote:
> >
> > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > If the backing device for a loop device is a block device,
> > > then mirror the discard properties of the underlying block
> > > device into the loop device. While in there, differentiate
> > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > different for block devices, but which the loop device had
> > > just been lumping together.
> > >
> > > Signed-off-by: Evan Green <[email protected]>
> > > ---
> > >
> > > drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > > 1 file changed, 41 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index 28990fc94841a..176e65101c4ef 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > return ret;
> > > }
> > >
> > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > + int mode, loff_t pos)
> > > {
> > > - /*
> > > - * We use punch hole to reclaim the free space used by the
> > > - * image a.k.a. discard. However we do not support discard if
> > > - * encryption is enabled, because it may give an attacker
> > > - * useful information.
> > > - */
> > > struct file *file = lo->lo_backing_file;
> > > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > + struct request_queue *q = lo->lo_queue;
> > > int ret;
> > >
> > > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > + if (!blk_queue_discard(q)) {
> > > ret = -EOPNOTSUPP;
> > > goto out;
> > > }
> > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > case REQ_OP_FLUSH:
> > > return lo_req_flush(lo, rq);
> > > case REQ_OP_DISCARD:
> > > + return lo_discard(lo, rq,
> > > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > +
> > > case REQ_OP_WRITE_ZEROES:
> > > - return lo_discard(lo, rq, pos);
> > > + return lo_discard(lo, rq,
> > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > +
> > > case REQ_OP_WRITE:
> > > if (lo->transfer)
> > > return lo_write_transfer(lo, rq, pos);
> > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > > struct file *file = lo->lo_backing_file;
> > > struct inode *inode = file->f_mapping->host;
> > > struct request_queue *q = lo->lo_queue;
> > > + struct request_queue *backingq;
> > > +
> > > + /*
> > > + * If the backing device is a block device, mirror its discard
> > > + * capabilities.
> > > + */
> > > + if (S_ISBLK(inode->i_mode)) {
> > > + backingq = bdev_get_queue(inode->i_bdev);
> > > + blk_queue_max_discard_sectors(q,
> > > + backingq->limits.max_discard_sectors);
> > > +
> > > + blk_queue_max_write_zeroes_sectors(q,
> > > + backingq->limits.max_write_zeroes_sectors);
> > > +
> > > + q->limits.discard_granularity =
> > > + backingq->limits.discard_granularity;
> > > +
> > > + q->limits.discard_alignment =
> > > + backingq->limits.discard_alignment;
> >
> > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > capabilities, given either fs of the underlying queue can deal with well.
> >
> > >
> > > /*
> > > * We use punch hole to reclaim the free space used by the
> > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > > * encryption is enabled, because it may give an attacker
> > > * useful information.
> > > */
> > > - if ((!file->f_op->fallocate) ||
> > > - lo->lo_encrypt_key_size) {
> > > + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > q->limits.discard_granularity = 0;
> > > q->limits.discard_alignment = 0;
> > > blk_queue_max_discard_sectors(q, 0);
> > > blk_queue_max_write_zeroes_sectors(q, 0);
> > > - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > - return;
> > > - }
> > >
> > > - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > - q->limits.discard_alignment = 0;
> > > + } else {
> > > + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > + q->limits.discard_alignment = 0;
> > > +
> > > + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > + }
> > >
> > > - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > + if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > + else
> > > + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > }
> >
> > Looks it should work just by mirroring backing queue's discard
> > capability to loop queue in case that the loop is backed by
> > block device, doesn't it? Meantime the unified discard limits &
> > write_zeros limits can be kept.
>
> I tested this out, and you're right that I could just flip the
> QUEUE_FLAG_DISCARD based on whether its a block device, and leave

What I meant actually is to do the following discard config:

bool discard;
if (S_ISBLK(inode->i_mode)) {
struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
discard = blk_queue_discard(backingq);
} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
discard = false;
else
discard = true;

if (discard) {
blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
} else {
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
}

> everything else alone, to completely disable discard support for loop
> devices backed by block devices. This seems to work for programs like
> mkfs.ext4, but still leaves problems for coreutils cp.
>
> But is that really the right call? With this change, we're not only
> able to use loop devices in this way, but we're able to use the
> discard and zero functionality of the underlying block device by
> simply passing it through. To me that seemed better than disabling all
> discard support for block devices, which would severely slow us down
> on some devices.

I guess the above approach can do the same job with yours, but simpler.

thanks,
Ming

2018-12-05 19:38:56

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

Hi Ming,

On Tue, Dec 4, 2018 at 5:11 PM Ming Lei <[email protected]> wrote:
>
> On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> > Hi Ming,
> >
> > On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <[email protected]> wrote:
> > >
> > > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > > If the backing device for a loop device is a block device,
> > > > then mirror the discard properties of the underlying block
> > > > device into the loop device. While in there, differentiate
> > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > different for block devices, but which the loop device had
> > > > just been lumping together.
> > > >
> > > > Signed-off-by: Evan Green <[email protected]>
> > > > ---
> > > >
> > > > drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > > > 1 file changed, 41 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > > index 28990fc94841a..176e65101c4ef 100644
> > > > --- a/drivers/block/loop.c
> > > > +++ b/drivers/block/loop.c
> > > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > > return ret;
> > > > }
> > > >
> > > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > > + int mode, loff_t pos)
> > > > {
> > > > - /*
> > > > - * We use punch hole to reclaim the free space used by the
> > > > - * image a.k.a. discard. However we do not support discard if
> > > > - * encryption is enabled, because it may give an attacker
> > > > - * useful information.
> > > > - */
> > > > struct file *file = lo->lo_backing_file;
> > > > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > > + struct request_queue *q = lo->lo_queue;
> > > > int ret;
> > > >
> > > > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > + if (!blk_queue_discard(q)) {
> > > > ret = -EOPNOTSUPP;
> > > > goto out;
> > > > }
> > > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > > case REQ_OP_FLUSH:
> > > > return lo_req_flush(lo, rq);
> > > > case REQ_OP_DISCARD:
> > > > + return lo_discard(lo, rq,
> > > > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > > +
> > > > case REQ_OP_WRITE_ZEROES:
> > > > - return lo_discard(lo, rq, pos);
> > > > + return lo_discard(lo, rq,
> > > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > > +
> > > > case REQ_OP_WRITE:
> > > > if (lo->transfer)
> > > > return lo_write_transfer(lo, rq, pos);
> > > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > > > struct file *file = lo->lo_backing_file;
> > > > struct inode *inode = file->f_mapping->host;
> > > > struct request_queue *q = lo->lo_queue;
> > > > + struct request_queue *backingq;
> > > > +
> > > > + /*
> > > > + * If the backing device is a block device, mirror its discard
> > > > + * capabilities.
> > > > + */
> > > > + if (S_ISBLK(inode->i_mode)) {
> > > > + backingq = bdev_get_queue(inode->i_bdev);
> > > > + blk_queue_max_discard_sectors(q,
> > > > + backingq->limits.max_discard_sectors);
> > > > +
> > > > + blk_queue_max_write_zeroes_sectors(q,
> > > > + backingq->limits.max_write_zeroes_sectors);
> > > > +
> > > > + q->limits.discard_granularity =
> > > > + backingq->limits.discard_granularity;
> > > > +
> > > > + q->limits.discard_alignment =
> > > > + backingq->limits.discard_alignment;
> > >
> > > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > > capabilities, given either fs of the underlying queue can deal with well.
> > >
> > > >
> > > > /*
> > > > * We use punch hole to reclaim the free space used by the
> > > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > > > * encryption is enabled, because it may give an attacker
> > > > * useful information.
> > > > */
> > > > - if ((!file->f_op->fallocate) ||
> > > > - lo->lo_encrypt_key_size) {
> > > > + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > q->limits.discard_granularity = 0;
> > > > q->limits.discard_alignment = 0;
> > > > blk_queue_max_discard_sectors(q, 0);
> > > > blk_queue_max_write_zeroes_sectors(q, 0);
> > > > - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > - return;
> > > > - }
> > > >
> > > > - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > - q->limits.discard_alignment = 0;
> > > > + } else {
> > > > + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > + q->limits.discard_alignment = 0;
> > > > +
> > > > + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > + }
> > > >
> > > > - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > + if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > + else
> > > > + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > }
> > >
> > > Looks it should work just by mirroring backing queue's discard
> > > capability to loop queue in case that the loop is backed by
> > > block device, doesn't it? Meantime the unified discard limits &
> > > write_zeros limits can be kept.
> >
> > I tested this out, and you're right that I could just flip the
> > QUEUE_FLAG_DISCARD based on whether its a block device, and leave
>
> What I meant actually is to do the following discard config:
>
> bool discard;
> if (S_ISBLK(inode->i_mode)) {
> struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
> discard = blk_queue_discard(backingq);
> } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
> discard = false;
> else
> discard = true;
>
> if (discard) {
> blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> } else {
> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> }

Ah, I see. But I think it's useful to reflect max_discard_sectors,
max_write_zeroes_sectors, discard_granularity, and discard_alignment
from the block device to the loop device. With the exception of
discard_alignment, these parameters are visible via sysfs, so usermode
can actually use these to make more intelligent use of fallocate.
Without this part of it, I still see issues with GNU cp.

I'm not totally sure about discard_alignment, that seems to be useful
in cases of merging blk requests. So I can stop mirroring that one if
it's harmful or not helpful. But unless it's a nak, I'd really love to
keep most of the mirroring. In which case the bool doesn't do a whole
lot of simplifying.

>
> > everything else alone, to completely disable discard support for loop
> > devices backed by block devices. This seems to work for programs like
> > mkfs.ext4, but still leaves problems for coreutils cp.
> >
> > But is that really the right call? With this change, we're not only
> > able to use loop devices in this way, but we're able to use the
> > discard and zero functionality of the underlying block device by
> > simply passing it through. To me that seemed better than disabling all
> > discard support for block devices, which would severely slow us down
> > on some devices.
>
> I guess the above approach can do the same job with yours, but simpler.
>
> thanks,
> Ming

2018-12-06 00:24:40

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Wed, Dec 05, 2018 at 11:35:57AM -0800, Evan Green wrote:
> Hi Ming,
>
> On Tue, Dec 4, 2018 at 5:11 PM Ming Lei <[email protected]> wrote:
> >
> > On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> > > Hi Ming,
> > >
> > > On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > > > If the backing device for a loop device is a block device,
> > > > > then mirror the discard properties of the underlying block
> > > > > device into the loop device. While in there, differentiate
> > > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > > different for block devices, but which the loop device had
> > > > > just been lumping together.
> > > > >
> > > > > Signed-off-by: Evan Green <[email protected]>
> > > > > ---
> > > > >
> > > > > drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > > > > 1 file changed, 41 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > > > index 28990fc94841a..176e65101c4ef 100644
> > > > > --- a/drivers/block/loop.c
> > > > > +++ b/drivers/block/loop.c
> > > > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > > > return ret;
> > > > > }
> > > > >
> > > > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > > > + int mode, loff_t pos)
> > > > > {
> > > > > - /*
> > > > > - * We use punch hole to reclaim the free space used by the
> > > > > - * image a.k.a. discard. However we do not support discard if
> > > > > - * encryption is enabled, because it may give an attacker
> > > > > - * useful information.
> > > > > - */
> > > > > struct file *file = lo->lo_backing_file;
> > > > > - int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > > > + struct request_queue *q = lo->lo_queue;
> > > > > int ret;
> > > > >
> > > > > - if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > > + if (!blk_queue_discard(q)) {
> > > > > ret = -EOPNOTSUPP;
> > > > > goto out;
> > > > > }
> > > > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > > > case REQ_OP_FLUSH:
> > > > > return lo_req_flush(lo, rq);
> > > > > case REQ_OP_DISCARD:
> > > > > + return lo_discard(lo, rq,
> > > > > + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > > > +
> > > > > case REQ_OP_WRITE_ZEROES:
> > > > > - return lo_discard(lo, rq, pos);
> > > > > + return lo_discard(lo, rq,
> > > > > + FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > > > +
> > > > > case REQ_OP_WRITE:
> > > > > if (lo->transfer)
> > > > > return lo_write_transfer(lo, rq, pos);
> > > > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > > > > struct file *file = lo->lo_backing_file;
> > > > > struct inode *inode = file->f_mapping->host;
> > > > > struct request_queue *q = lo->lo_queue;
> > > > > + struct request_queue *backingq;
> > > > > +
> > > > > + /*
> > > > > + * If the backing device is a block device, mirror its discard
> > > > > + * capabilities.
> > > > > + */
> > > > > + if (S_ISBLK(inode->i_mode)) {
> > > > > + backingq = bdev_get_queue(inode->i_bdev);
> > > > > + blk_queue_max_discard_sectors(q,
> > > > > + backingq->limits.max_discard_sectors);
> > > > > +
> > > > > + blk_queue_max_write_zeroes_sectors(q,
> > > > > + backingq->limits.max_write_zeroes_sectors);
> > > > > +
> > > > > + q->limits.discard_granularity =
> > > > > + backingq->limits.discard_granularity;
> > > > > +
> > > > > + q->limits.discard_alignment =
> > > > > + backingq->limits.discard_alignment;
> > > >
> > > > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > > > capabilities, given either fs of the underlying queue can deal with well.
> > > >
> > > > >
> > > > > /*
> > > > > * We use punch hole to reclaim the free space used by the
> > > > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > > > > * encryption is enabled, because it may give an attacker
> > > > > * useful information.
> > > > > */
> > > > > - if ((!file->f_op->fallocate) ||
> > > > > - lo->lo_encrypt_key_size) {
> > > > > + } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > > q->limits.discard_granularity = 0;
> > > > > q->limits.discard_alignment = 0;
> > > > > blk_queue_max_discard_sectors(q, 0);
> > > > > blk_queue_max_write_zeroes_sectors(q, 0);
> > > > > - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > > - return;
> > > > > - }
> > > > >
> > > > > - q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > > - q->limits.discard_alignment = 0;
> > > > > + } else {
> > > > > + q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > > + q->limits.discard_alignment = 0;
> > > > > +
> > > > > + blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > > + blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > > + }
> > > > >
> > > > > - blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > > - blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > > - blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > > + if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > > > + blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > > + else
> > > > > + blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > > }
> > > >
> > > > Looks it should work just by mirroring backing queue's discard
> > > > capability to loop queue in case that the loop is backed by
> > > > block device, doesn't it? Meantime the unified discard limits &
> > > > write_zeros limits can be kept.
> > >
> > > I tested this out, and you're right that I could just flip the
> > > QUEUE_FLAG_DISCARD based on whether its a block device, and leave
> >
> > What I meant actually is to do the following discard config:
> >
> > bool discard;
> > if (S_ISBLK(inode->i_mode)) {
> > struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
> > discard = blk_queue_discard(backingq);
> > } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
> > discard = false;
> > else
> > discard = true;
> >
> > if (discard) {
> > blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > } else {
> > blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > }
>
> Ah, I see. But I think it's useful to reflect max_discard_sectors,
> max_write_zeroes_sectors, discard_granularity, and discard_alignment
> from the block device to the loop device. With the exception of
> discard_alignment, these parameters are visible via sysfs, so usermode
> can actually use these to make more intelligent use of fallocate.

Could you share us what the intelligent use of fallocate is?

The block layer code of blk_bio_discard_split() can deal with all the
magic limits.

The unified discard limits is simpler from implement view of loop, or
from userspace view.

> Without this part of it, I still see issues with GNU cp.

Could you investigate the cause of the GNU cp issue?

Thanks,
Ming

2018-12-06 03:16:10

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

Evan,

> Ah, I see. But I think it's useful to reflect max_discard_sectors,
> max_write_zeroes_sectors, discard_granularity, and discard_alignment
> from the block device to the loop device. With the exception of
> discard_alignment, these parameters are visible via sysfs,

discard_alignment is visible in sysfs, just not in the queue directory
since alignment can be different on a per-partition basis. So there is
one discard_alignment at the root of each /sys/block/foo directory and
one for each partition subdirectory. This mirrors the alignment_offset
variable which indicates a partitions alignment wrt. the underlying
physical block size.

That said, there are not many devices that actually report a non-zero
discard alignment so it's not as useful as the device manufacturers
(that were looking for an implementation shortcut) envisioned.

> I'm not totally sure about discard_alignment, that seems to be useful
> in cases of merging blk requests. So I can stop mirroring that one if
> it's harmful or not helpful. But unless it's a nak, I'd really love to
> keep most of the mirroring. In which case the bool doesn't do a whole
> lot of simplifying.

I think it's fine to export these. The block device topology was
explicitly designed to be stackable like this.

--
Martin K. Petersen Oracle Linux Engineering

2018-12-10 17:34:30

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Wed, Dec 5, 2018 at 7:15 PM Martin K. Petersen
<[email protected]> wrote:
>
>
> Evan,
>
> > Ah, I see. But I think it's useful to reflect max_discard_sectors,
> > max_write_zeroes_sectors, discard_granularity, and discard_alignment
> > from the block device to the loop device. With the exception of
> > discard_alignment, these parameters are visible via sysfs,
>
> discard_alignment is visible in sysfs, just not in the queue directory
> since alignment can be different on a per-partition basis. So there is
> one discard_alignment at the root of each /sys/block/foo directory and
> one for each partition subdirectory. This mirrors the alignment_offset
> variable which indicates a partitions alignment wrt. the underlying
> physical block size.
>
> That said, there are not many devices that actually report a non-zero
> discard alignment so it's not as useful as the device manufacturers
> (that were looking for an implementation shortcut) envisioned.

Ah ok, thanks.

>
> > I'm not totally sure about discard_alignment, that seems to be useful
> > in cases of merging blk requests. So I can stop mirroring that one if
> > it's harmful or not helpful. But unless it's a nak, I'd really love to
> > keep most of the mirroring. In which case the bool doesn't do a whole
> > lot of simplifying.
>
> I think it's fine to export these. The block device topology was
> explicitly designed to be stackable like this.

Yeah, it seemed to fall in pretty naturally, which is why I was hoping
it might not be so controversial. Thanks Martin.
-Evan

2018-12-18 23:52:08

[permalink] [raw]

Subject: Re: [PATCH 2/2] loop: Better discard support for block devices

On Mon, Dec 10, 2018 at 9:31 AM Evan Green <[email protected]> wrote:
>
> On Wed, Dec 5, 2018 at 7:15 PM Martin K. Petersen
> <[email protected]> wrote:
> >
> >
> > Evan,
> >
> > > Ah, I see. But I think it's useful to reflect max_discard_sectors,
> > > max_write_zeroes_sectors, discard_granularity, and discard_alignment
> > > from the block device to the loop device. With the exception of
> > > discard_alignment, these parameters are visible via sysfs,
> >
> > discard_alignment is visible in sysfs, just not in the queue directory
> > since alignment can be different on a per-partition basis. So there is
> > one discard_alignment at the root of each /sys/block/foo directory and
> > one for each partition subdirectory. This mirrors the alignment_offset
> > variable which indicates a partitions alignment wrt. the underlying
> > physical block size.
> >
> > That said, there are not many devices that actually report a non-zero
> > discard alignment so it's not as useful as the device manufacturers
> > (that were looking for an implementation shortcut) envisioned.
>
> Ah ok, thanks.
>
> >
> > > I'm not totally sure about discard_alignment, that seems to be useful
> > > in cases of merging blk requests. So I can stop mirroring that one if
> > > it's harmful or not helpful. But unless it's a nak, I'd really love to
> > > keep most of the mirroring. In which case the bool doesn't do a whole
> > > lot of simplifying.
> >
> > I think it's fine to export these. The block device topology was
> > explicitly designed to be stackable like this.
>
> Yeah, it seemed to fall in pretty naturally, which is why I was hoping
> it might not be so controversial. Thanks Martin.

Any other thoughts about this series?

Ming, I did go back and experiment a little. The pseudocode you had
proposed earlier would work, and solve the error prints I was seeing.
But I still would like to keep the reflection of the block device
properties, since that allows discard to be used on block devices that
do support it.
-Evan