LinuxLists.cc - Re: [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 15:31:21

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Tue, May 30 2023 at 10:55P -0400,
Joe Thornber <[email protected]> wrote:

> On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
>
> >
> > Also Joe, for you proposed dm-thinp design where you distinquish
> > between "provision" and "reserve": Would it make sense for REQ_META
> > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > more freedom to just reserve the length of blocks? (e.g. for XFS
> > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > reserve space to accomodate it).
> >
>
> My proposal only involves 'reserve'. Provisioning will be done as part of
> the usual io path.

OK, I think we'd do well to pin down the top-level block interfaces in
question. Because this patchset's block interface patch (2/5) header
says:

"This patch also adds the capability to call fallocate() in mode 0
on block devices, which will send REQ_OP_PROVISION to the block
device for the specified range,"

So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
user of XFS could then use fallocate() for user data -- which would
cause thinp's reserve to _not_ be used for critical metadata.

The only way to distinquish the caller (between on-behalf of user data
vs XFS metadata) would be REQ_META?

So should dm-thinp have a REQ_META-based distinction? Or just treat
all REQ_OP_PROVISION the same?

Mike

2023-06-02 19:03:34

by Sarthak Kukreti

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
>
> On Tue, May 30 2023 at 10:55P -0400,
> Joe Thornber <[email protected]> wrote:
>
> > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> >
> > >
> > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > between "provision" and "reserve": Would it make sense for REQ_META
> > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > reserve space to accomodate it).
> > >
> >
> > My proposal only involves 'reserve'. Provisioning will be done as part of
> > the usual io path.
>
> OK, I think we'd do well to pin down the top-level block interfaces in
> question. Because this patchset's block interface patch (2/5) header
> says:
>
> "This patch also adds the capability to call fallocate() in mode 0
> on block devices, which will send REQ_OP_PROVISION to the block
> device for the specified range,"
>
> So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> user of XFS could then use fallocate() for user data -- which would
> cause thinp's reserve to _not_ be used for critical metadata.
>
> The only way to distinquish the caller (between on-behalf of user data
> vs XFS metadata) would be REQ_META?
>
> So should dm-thinp have a REQ_META-based distinction? Or just treat
> all REQ_OP_PROVISION the same?
>
I'm in favor of a REQ_META-based distinction. Does that imply that
REQ_META also needs to be passed through the block/filesystem stack
(eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
fallocate(<insert meta flag name>) to the underlying file)?

<bikeshed>
I think that might have applications beyond just provisioning:
currently, for stacked filesystems (eg filesystems residing in a file
on top of another filesystem), even if the upper filesystem issues
read/write requests with REQ_META | REQ_PRIO, these flags are lost in
translation at the loop device layer. A flag like the above would
allow the prioritization of stacked filesystem metadata requests.
</bikeshed>

Bringing the discussion back to this series for a bit, I'm still
waiting on feedback from the Block maintainers before sending out v8
(which at the moment, only have a
s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation
most of the above is follow up work, but please let me know if you'd
prefer I add some of this to the current series!

Best
Sarthak

> Mike

2023-06-02 22:00:16

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Fri, Jun 02 2023 at 2:44P -0400,
Sarthak Kukreti <[email protected]> wrote:

> On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
> >
> > On Tue, May 30 2023 at 10:55P -0400,
> > Joe Thornber <[email protected]> wrote:
> >
> > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> > >
> > > >
> > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > reserve space to accomodate it).
> > > >
> > >
> > > My proposal only involves 'reserve'. Provisioning will be done as part of
> > > the usual io path.
> >
> > OK, I think we'd do well to pin down the top-level block interfaces in
> > question. Because this patchset's block interface patch (2/5) header
> > says:
> >
> > "This patch also adds the capability to call fallocate() in mode 0
> > on block devices, which will send REQ_OP_PROVISION to the block
> > device for the specified range,"
> >
> > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > user of XFS could then use fallocate() for user data -- which would
> > cause thinp's reserve to _not_ be used for critical metadata.
> >
> > The only way to distinquish the caller (between on-behalf of user data
> > vs XFS metadata) would be REQ_META?
> >
> > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > all REQ_OP_PROVISION the same?
> >
> I'm in favor of a REQ_META-based distinction. Does that imply that
> REQ_META also needs to be passed through the block/filesystem stack
> (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> fallocate(<insert meta flag name>) to the underlying file)?

Unclear, I was thinking your REQ_UNSHARE (tied to fallocate) might be
a means to translate REQ_OP_PROVISION + REQ_META to fallocate and have
it perform the LBA-specific provisioning of Joe's design (referenced
below).

> <bikeshed>
> I think that might have applications beyond just provisioning:
> currently, for stacked filesystems (eg filesystems residing in a file
> on top of another filesystem), even if the upper filesystem issues
> read/write requests with REQ_META | REQ_PRIO, these flags are lost in
> translation at the loop device layer. A flag like the above would
> allow the prioritization of stacked filesystem metadata requests.
> </bikeshed>

Yes, it could prove useful.

> Bringing the discussion back to this series for a bit, I'm still
> waiting on feedback from the Block maintainers before sending out v8
> (which at the moment, only have a
> s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/g). I believe from the conversation
> most of the above is follow up work, but please let me know if you'd
> prefer I add some of this to the current series!

I need a bit more time to work through various aspects of the broader
requirements and the resulting interfaces that fall out.

Joe's design is pretty compelling because it will properly handle
snapshot thin devices:
https://listman.redhat.com/archives/dm-devel/2023-May/054351.html

Here is my latest status:
- Focused on prototype for thinp block reservation (XFS metadata, XFS
delalloc, fallocate)
- Decided the "dynamic" (non-LBA specific) reservation stuff (old
prototype code) is best left independent from Joe's design. SO 2
classes of thinp reservation.
- Forward-ported the old prototype code that Brian Foster, Joe
Thornber and I worked on years ago. It needs more careful review
(and very likely will need fixes from Brian and myself). The XFS
changes are pretty intrusive and likely up for serious debate (as
to whether we even care to handle reservations for user data).
- REQ_OP_PROVISION bio’s with REQ_META will use Joe’s design,
otherwise data (XFS data and fallocate) will use “dynamic”
reservation.
- "dynamic" name is due to the reservation being generic (non-LBA:
not in terms of an LBA range). Also, in-core only; so the associated
“dynamic_reserve_count” accounting is reset to 0 every activation.
- Fallocate may require stronger guarantees in the end (in which
case we’ll add a REQ_UNSHARE flag that is selectable from the
fallocate interface)
- Will try to share common code, but just sorting out highlevel
interface(s) still...

I'll try to get a git tree together early next week. It will be the
forward ported "dynamic" prototype code and your latest v7 code with
some additional work to branch accordingly for each class of thinp
reservation. And I'll use your v7 code as a crude stub for Joe's
approach (branch taken if REQ_META set).

Lastly, here are some additional TODOs I've noted in code earlier in
my review process:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 0d9301802609..43a6702f9efe 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1964,6 +1964,26 @@ static void process_provision_bio(struct thin_c *tc, struct bio *bio)
struct dm_cell_key key;
struct dm_thin_lookup_result lookup_result;

+ /*
+ * FIXME:
+ * Joe's elegant reservation design is detailed here:
+ * https://listman.redhat.com/archives/dm-devel/2023-May/054351.html
+ * - this design, with associated thinp metadata updates,
+ * is how provision bios should be handled.
+ *
+ * FIXME: add thin-pool flag "ignore_provision"
+ *
+ * FIXME: needs provision_passdown support
+ * (needs thinp flag "no_provision_passdown")
+ */
+
+ /*
+ * FIXME: require REQ_META (or REQ_UNSHARE?) to allow deeper
+ * provisioning code that follows? (so that thinp
+ * block _is_ fully provisioned upon return)
+ * (or just remove all below code entirely?)
+ */
+
/*
* If cell is already occupied, then the block is already
* being provisioned so we have nothing further to do here.

2023-06-03 00:54:59

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
> >
> > On Tue, May 30 2023 at 10:55P -0400,
> > Joe Thornber <[email protected]> wrote:
> >
> > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> > >
> > > >
> > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > reserve space to accomodate it).
> > > >
> > >
> > > My proposal only involves 'reserve'. Provisioning will be done as part of
> > > the usual io path.
> >
> > OK, I think we'd do well to pin down the top-level block interfaces in
> > question. Because this patchset's block interface patch (2/5) header
> > says:
> >
> > "This patch also adds the capability to call fallocate() in mode 0
> > on block devices, which will send REQ_OP_PROVISION to the block
> > device for the specified range,"
> >
> > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > user of XFS could then use fallocate() for user data -- which would
> > cause thinp's reserve to _not_ be used for critical metadata.

Mike, I think you might have misunderstood what I have been proposing.
Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
that's what I intended - the operation does not contain data at all.
It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
contains a range of sectors that need to be provisioned (or
discarded), and nothing else. The write IOs themselves are not
tagged with anything special at all.

i.e. The proposal I made does not use REQ_PROVISION anywhere in the
metadata/data IO path; provisioned regions are created by separate
operations and must be tracked by the underlying block device, then
treat any write IO to those regions as "must not fail w/ ENOSPC"
IOs.

There seems to be a lot of fear about user data requiring
provisioning. This is unfounded - provisioning is only needed for
explicitly provisioned space via fallocate(), not every byte of
user data written to the filesystem (the model Brian is proposing).

Excessive use of fallocate() is self correcting - if users and/or
their applications provision too much, they are going to get ENOSPC
or have to pay more to expand the backing pool reserves they need.
But that's not a problem the block device should be trying to solve;
that's a problem for the sysadmin and/or bean counters to address.

> >
> > The only way to distinquish the caller (between on-behalf of user data
> > vs XFS metadata) would be REQ_META?
> >
> > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > all REQ_OP_PROVISION the same?
> >
> I'm in favor of a REQ_META-based distinction.

Why? What *requirement* is driving the need for this distinction?

As the person who proposed this new REQ_OP_PROVISION architecture,
I'm dead set against it. Allowing the block device provide a set of
poorly defined "conditional guarantees" policies instead of a
mechanism with a single ironclad guarantee defeats the entire
purpose of the proposal.

We have a requirement from the *kernel ABI* that *user data writes*
must not fail with ENOSPC after an fallocate() operation. That's
one of the high level policies we need to implement. The filesystem
is already capable of guaranteeing it won't give the user ENOSPC
after fallocate, we now need a guarantee from the filesystem's
backing store that it won't give ENOSPC, too.

The _other thing_ we need to implement is a method of guaranteeing
the filesystem won't shut down when the backing device goes ENOSPC
unexpected during metadata writeback. So we also need the backing
device to guarantee the regions we write metadata to won't give
ENOSPC.

That's the whole point of REQ_OP_PROVISION: from the layers above
the block device, there is -zero- difference between the guarantee
we need for user data writes to avoid ENOSPC and for metadata writes
to avoid ENOSPC. They are one and the same.

Hence if the block device is going to say "I support provisioning"
but then give different conditional guarantees according to the
*type of data* in the IO request, then it does not provide the
functionality the higher layers actually require from it.

Indeed, what type of data the IO contains is *context dependent*.
For example, sometimes we write metadata with user data IO and but
we still need provisioning guarantees as if it was issued as
metadata IO. This is the case for mkfs initialising the file system
by writing directly to the block device.

IOWs, filesystem metadata IO issued from kernel context would be
considered metadata IO, but from userspace it would be considered
normal user data IO and hence treated differently. But the reality
is that they both need the same provisioning guarantees to be
provided by the block device.

So how do userspace tools deal with this if the block device
requires REQ_META on user data IOs to do the right thing here? And
if we provide a mechanism to allow this, how do we prevent userspace
for always using it on writes to fallocate() provisioned space?

It's just not practical for the block device to add arbitrary
constraints based on the type of IO because we then have to add
mechanisms to userspace APIs to allow them to control the IO context
so the block device will do the right thing. Especially considering
we really only need one type of guarantee regardless of where the IO
originates from or what type of data the IO contains....

> Does that imply that
> REQ_META also needs to be passed through the block/filesystem stack
> (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> fallocate(<insert meta flag name>) to the underlying file)?

This is exactly the same case as above: the loopback device does
user data IO to the backing file. Hence we have another situation
where metadata IO is issued to fallocate()d user data ranges as user
data ranges and so would be given a lesser guarantee that would lead
to upper filesystem failure. BOth upper and lower filesystem data
and metadata need to be provided the same ENOSPC guarantees by their
backing stores....

The whole point of the REQ_OP_PROVISION proposal I made is that it
doesn't require any special handling in corner cases like this.
There are no cross-layer interactions needed to make everything work
correctly because the provisioning guarantee is not -data type
dependent*. The entire user IO path code remains untouched and
blissfully unaware of provisioned regions.

And, realistically, if we have to start handling complex corner
cases in the filesystem and IO path layers to make REQ_OP_PROVISION
work correctly because of arbitary constraints imposed by the block
layer implementations, then we've failed miserably at the design and
architecture stage.

Keep in mind that every attempt made so far to address the problems
with block device ENOSPC errors has failed because of the complexity
of the corner cases that have arisen during design and/or
implementation. It's pretty ironic that now we have a proposal that
is remarkably simple, free of corner cases and has virtually no
cross-layer coupling at all, the first thing that people want to do
is add arbitrary implementation constraints that result in complex
cross-layer corner cases that now need to be handled....

Put simply: if we restrict REQ_OP_PROVISION guarantees to just
REQ_META writes (or any other specific type of write operation) then
it's simply not worth persuing at the filesystem level because the
guarantees we actually need just aren't there and the complexity of
discovering and handling those corner cases just isn't worth the
effort.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-03 16:05:19

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Fri, Jun 02 2023 at 8:52P -0400,
Dave Chinner <[email protected]> wrote:

> On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
> > >
> > > On Tue, May 30 2023 at 10:55P -0400,
> > > Joe Thornber <[email protected]> wrote:
> > >
> > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> > > >
> > > > >
> > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > reserve space to accomodate it).
> > > > >
> > > >
> > > > My proposal only involves 'reserve'. Provisioning will be done as part of
> > > > the usual io path.
> > >
> > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > question. Because this patchset's block interface patch (2/5) header
> > > says:
> > >
> > > "This patch also adds the capability to call fallocate() in mode 0
> > > on block devices, which will send REQ_OP_PROVISION to the block
> > > device for the specified range,"
> > >
> > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > user of XFS could then use fallocate() for user data -- which would
> > > cause thinp's reserve to _not_ be used for critical metadata.
>
> Mike, I think you might have misunderstood what I have been proposing.
> Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> that's what I intended - the operation does not contain data at all.
> It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> contains a range of sectors that need to be provisioned (or
> discarded), and nothing else.

No, I understood that.

> The write IOs themselves are not tagged with anything special at all.

I know, but I've been looking at how to also handle the delalloc
usecase (and yes I know you feel it doesn't need handling, the issue
is XFS does deal nicely with ensuring it has space when it tracks its
allocations on "thick" storage -- so adding coordination between XFS
and dm-thin layers provides comparable safety.. that safety is an
expected norm).

But rather than discuss in terms of data vs metadata, the distinction
is:
1) LBA range reservation (normal case, your proposal)
2) non-LBA reservation (absolute value, LBA range is known at later stage)

But I'm clearly going off script for dwelling on wanting to handle
both.

My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
a crude simplification for branching between the 2 approaches.

And I understand I made you nervous by expanding the scope to a much
more muddled/shitty interface. ;)

We all just need to focus on your proposal and Joe's dm-thin
reservation design...

[Sarthak: FYI, this implies that it doesn't really make sense to add
dm-thinp support before Joe's design is implemented. Otherwise we'll
have 2 different responses to REQ_OP_PROVISION. The one that is
captured in your patchset isn't adequate to properly handle ensuring
upper layer (like XFS) can depend on the space being available across
snapshot boundaries.]

> i.e. The proposal I made does not use REQ_PROVISION anywhere in the
> metadata/data IO path; provisioned regions are created by separate
> operations and must be tracked by the underlying block device, then
> treat any write IO to those regions as "must not fail w/ ENOSPC"
> IOs.
>
> There seems to be a lot of fear about user data requiring
> provisioning. This is unfounded - provisioning is only needed for
> explicitly provisioned space via fallocate(), not every byte of
> user data written to the filesystem (the model Brian is proposing).

As I mentioned above, I was just trying to get XFS-on-thinp to
maintain parity with how XFS's delalloc accounting works on "thick"
storage.

But happy to put that to one side. Maintain focus like I mentioned
above. I'm happy we have momentum and agreement on this design now.
Rather than be content with that, I was mistakenly looking at other
aspects and in doing so introduced "noise" before we've implemented
what we all completely agree on: your and joe's designs.

> Excessive use of fallocate() is self correcting - if users and/or
> their applications provision too much, they are going to get ENOSPC
> or have to pay more to expand the backing pool reserves they need.
> But that's not a problem the block device should be trying to solve;
> that's a problem for the sysadmin and/or bean counters to address.
>
> > >
> > > The only way to distinquish the caller (between on-behalf of user data
> > > vs XFS metadata) would be REQ_META?
> > >
> > > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > > all REQ_OP_PROVISION the same?
> > >
> > I'm in favor of a REQ_META-based distinction.
>
> Why? What *requirement* is driving the need for this distinction?

Think I answered that above, XFS delalloc accounting parity on thinp.

> As the person who proposed this new REQ_OP_PROVISION architecture,
> I'm dead set against it. Allowing the block device provide a set of
> poorly defined "conditional guarantees" policies instead of a
> mechanism with a single ironclad guarantee defeats the entire
> purpose of the proposal.
>
> We have a requirement from the *kernel ABI* that *user data writes*
> must not fail with ENOSPC after an fallocate() operation. That's
> one of the high level policies we need to implement. The filesystem
> is already capable of guaranteeing it won't give the user ENOSPC
> after fallocate, we now need a guarantee from the filesystem's
> backing store that it won't give ENOSPC, too.

Yes, I was trying to navigate Joe's reluctance to even support
fallocate() for arbitrary user data. That's where the REQ_META vs
data distinction crept in for me. But as you say: using fallocate()
excessively is self-correcting.

> The _other thing_ we need to implement is a method of guaranteeing
> the filesystem won't shut down when the backing device goes ENOSPC
> unexpected during metadata writeback. So we also need the backing
> device to guarantee the regions we write metadata to won't give
> ENOSPC.

Yeap.

> That's the whole point of REQ_OP_PROVISION: from the layers above
> the block device, there is -zero- difference between the guarantee
> we need for user data writes to avoid ENOSPC and for metadata writes
> to avoid ENOSPC. They are one and the same.

I know. The difference comes from delalloc initially needing an
absolute value of reserve rather than a specific LBA range.

> Hence if the block device is going to say "I support provisioning"
> but then give different conditional guarantees according to the
> *type of data* in the IO request, then it does not provide the
> functionality the higher layers actually require from it.

I was going for relaxing the "dynamic" approach (Brian's) to be
best-effort -- and really only for XFS delalloc usecase. Every other
usecase would respect your and Joe's vision.

> Indeed, what type of data the IO contains is *context dependent*.
> For example, sometimes we write metadata with user data IO and but
> we still need provisioning guarantees as if it was issued as
> metadata IO. This is the case for mkfs initialising the file system
> by writing directly to the block device.

I'm aware.

> IOWs, filesystem metadata IO issued from kernel context would be
> considered metadata IO, but from userspace it would be considered
> normal user data IO and hence treated differently. But the reality
> is that they both need the same provisioning guarantees to be
> provided by the block device.

What I was looking at is making the fallocate interface able to
express: I need dave's requirements (bog-standard actually) vs I need
non-LBA best effort.

> So how do userspace tools deal with this if the block device
> requires REQ_META on user data IOs to do the right thing here? And
> if we provide a mechanism to allow this, how do we prevent userspace
> for always using it on writes to fallocate() provisioned space?
>
> It's just not practical for the block device to add arbitrary
> constraints based on the type of IO because we then have to add
> mechanisms to userspace APIs to allow them to control the IO context
> so the block device will do the right thing. Especially considering
> we really only need one type of guarantee regardless of where the IO
> originates from or what type of data the IO contains....

If anything my disposition on the conditional to require a REQ_META
(or some fallocate generated REQ_UNSHARE ditto to reflect the same) to
perform your approach to REQ_OP_PROVISION and honor fallocate()
requirements is a big problem. Would be much better to have a flag to
express "this reservation does not have an LBA range _yet_,
nevertheless try to be mindful of this expected near-term block
allocation".

But I'll stop inlining repetitive (similar but different) answers to
your concern now ;)

> > Does that imply that
> > REQ_META also needs to be passed through the block/filesystem stack
> > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> > fallocate(<insert meta flag name>) to the underlying file)?
>
> This is exactly the same case as above: the loopback device does
> user data IO to the backing file. Hence we have another situation
> where metadata IO is issued to fallocate()d user data ranges as user
> data ranges and so would be given a lesser guarantee that would lead
> to upper filesystem failure. BOth upper and lower filesystem data
> and metadata need to be provided the same ENOSPC guarantees by their
> backing stores....
>
> The whole point of the REQ_OP_PROVISION proposal I made is that it
> doesn't require any special handling in corner cases like this.
> There are no cross-layer interactions needed to make everything work
> correctly because the provisioning guarantee is not -data type
> dependent*. The entire user IO path code remains untouched and
> blissfully unaware of provisioned regions.
>
> And, realistically, if we have to start handling complex corner
> cases in the filesystem and IO path layers to make REQ_OP_PROVISION
> work correctly because of arbitary constraints imposed by the block
> layer implementations, then we've failed miserably at the design and
> architecture stage.
>
> Keep in mind that every attempt made so far to address the problems
> with block device ENOSPC errors has failed because of the complexity
> of the corner cases that have arisen during design and/or
> implementation. It's pretty ironic that now we have a proposal that
> is remarkably simple, free of corner cases and has virtually no
> cross-layer coupling at all, the first thing that people want to do
> is add arbitrary implementation constraints that result in complex
> cross-layer corner cases that now need to be handled....
>
> Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> REQ_META writes (or any other specific type of write operation) then
> it's simply not worth persuing at the filesystem level because the
> guarantees we actually need just aren't there and the complexity of
> discovering and handling those corner cases just isn't worth the
> effort.

Here is where I get to say: I think you misunderstood me (but it was
my fault for not being absolutely clear: I'm very much on the same
page as you and Joe; and your visions need to just be implemented
ASAP).

I was taking your designs as a given, but looking further at: how do
we also handle the non-LBA (delalloc) usecase _before_ we include
REQ_OP_PROVISION in kernel.

But I'm happy to let the delalloc case go (we can revisit addressing
it if/when needed).

Mike

2023-06-05 21:19:54

by Sarthak Kukreti

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
>
> On Fri, Jun 02 2023 at 8:52P -0400,
> Dave Chinner <[email protected]> wrote:
>
> > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
> > > >
> > > > On Tue, May 30 2023 at 10:55P -0400,
> > > > Joe Thornber <[email protected]> wrote:
> > > >
> > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > > reserve space to accomodate it).
> > > > > >
> > > > >
> > > > > My proposal only involves 'reserve'. Provisioning will be done as part of
> > > > > the usual io path.
> > > >
> > > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > > question. Because this patchset's block interface patch (2/5) header
> > > > says:
> > > >
> > > > "This patch also adds the capability to call fallocate() in mode 0
> > > > on block devices, which will send REQ_OP_PROVISION to the block
> > > > device for the specified range,"
> > > >
> > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > > user of XFS could then use fallocate() for user data -- which would
> > > > cause thinp's reserve to _not_ be used for critical metadata.
> >
> > Mike, I think you might have misunderstood what I have been proposing.
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > that's what I intended - the operation does not contain data at all.
> > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > contains a range of sectors that need to be provisioned (or
> > discarded), and nothing else.
>
> No, I understood that.
>
> > The write IOs themselves are not tagged with anything special at all.
>
> I know, but I've been looking at how to also handle the delalloc
> usecase (and yes I know you feel it doesn't need handling, the issue
> is XFS does deal nicely with ensuring it has space when it tracks its
> allocations on "thick" storage -- so adding coordination between XFS
> and dm-thin layers provides comparable safety.. that safety is an
> expected norm).
>
> But rather than discuss in terms of data vs metadata, the distinction
> is:
> 1) LBA range reservation (normal case, your proposal)
> 2) non-LBA reservation (absolute value, LBA range is known at later stage)
>
> But I'm clearly going off script for dwelling on wanting to handle
> both.
>
> My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> a crude simplification for branching between the 2 approaches.
>
> And I understand I made you nervous by expanding the scope to a much
> more muddled/shitty interface. ;)
>
> We all just need to focus on your proposal and Joe's dm-thin
> reservation design...
>
> [Sarthak: FYI, this implies that it doesn't really make sense to add
> dm-thinp support before Joe's design is implemented. Otherwise we'll
> have 2 different responses to REQ_OP_PROVISION. The one that is
> captured in your patchset isn't adequate to properly handle ensuring
> upper layer (like XFS) can depend on the space being available across
> snapshot boundaries.]
>
Ack. Would it be premature for the rest of the series to go through
(REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
targets)? I'd like to start using this as a reference to suggest
additions to the virtio-spec for virtio-blk support and start looking
at what an ext4 implementation would look like.

> > i.e. The proposal I made does not use REQ_PROVISION anywhere in the
> > metadata/data IO path; provisioned regions are created by separate
> > operations and must be tracked by the underlying block device, then
> > treat any write IO to those regions as "must not fail w/ ENOSPC"
> > IOs.
> >
> > There seems to be a lot of fear about user data requiring
> > provisioning. This is unfounded - provisioning is only needed for
> > explicitly provisioned space via fallocate(), not every byte of
> > user data written to the filesystem (the model Brian is proposing).
>
> As I mentioned above, I was just trying to get XFS-on-thinp to
> maintain parity with how XFS's delalloc accounting works on "thick"
> storage.
>
> But happy to put that to one side. Maintain focus like I mentioned
> above. I'm happy we have momentum and agreement on this design now.
> Rather than be content with that, I was mistakenly looking at other
> aspects and in doing so introduced "noise" before we've implemented
> what we all completely agree on: your and joe's designs.
>
> > Excessive use of fallocate() is self correcting - if users and/or
> > their applications provision too much, they are going to get ENOSPC
> > or have to pay more to expand the backing pool reserves they need.
> > But that's not a problem the block device should be trying to solve;
> > that's a problem for the sysadmin and/or bean counters to address.
> >
> > > >
> > > > The only way to distinquish the caller (between on-behalf of user data
> > > > vs XFS metadata) would be REQ_META?
> > > >
> > > > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > > > all REQ_OP_PROVISION the same?
> > > >
> > > I'm in favor of a REQ_META-based distinction.
> >
> > Why? What *requirement* is driving the need for this distinction?
>
> Think I answered that above, XFS delalloc accounting parity on thinp.
>
I actually had a few different use-cases in mind (apart from the user
data provisioning 'fear' that you pointed out): in essence, there are
cases where userspace would benefit from having more control over how
much space a snapshot takes:

1) In the original RFC patchset [1], I alluded to this being a
mechanism for pre-allocating space for preserving space for thin
logical volumes. The use-case I'd like to explore is delta updatable
read-only filesystems similar to systemd system extensions [2]: In
essence:
a) Preserve space for a 'base' thin logical volume that will contain a
read-only filesystem on over-the-air installation: for filesystems
like squashfs and erofs, pretty much the entire image is a compressed
file that I'd like to reserve space for before installation.
b) Before update, create a thin snapshot and preserve enough space to
ensure that a delta update will succeed (eg. block level diff of the
base image). Then, the update is guaranteed to have disk space to
succeed (similar to the A-B update guarantees on ChromeOS). On
success, we merge the snapshot and reserve an update snapshot for the
next possible update. On failure, we drop the snapshot.

2) The other idea I wanted to explore was rollback protection for
stateful filesystem features: in essence, if an update from kernel 4.x
to 5.y failed very quickly (due to unrelated reasons) and we enabled
some stateful filesystem features that are only supported on 5.y, we'd
be able to rollback to 4.x if we used short-lived snapshots (in the
ChromiumOS world, the lifetime of these snapshots would be < 10s per
boot).

On reflection, the metadata vs user data distinction was a means to an
end for me: I'd like to retain the capability to create thin
short-lived snapshots from userspace _regardless_ of the provisioned
ranges of a thin volume and the flexibility to manipulate the space
requirements on these snapshot volumes from userspace. This might
appear as "bean-counting" but if I have eg. a 4GB read-only compressed
filesystem and I know, a priori, an update will take at most 400M, I
shouldn't need to (momentarily) reserve another 4GB or add more disks
to complete the update.

[1] https://lkml.org/lkml/2022/9/15/785
[2] https://www.freedesktop.org/software/systemd/man/systemd-sysext.html

> > As the person who proposed this new REQ_OP_PROVISION architecture,
> > I'm dead set against it. Allowing the block device provide a set of
> > poorly defined "conditional guarantees" policies instead of a
> > mechanism with a single ironclad guarantee defeats the entire
> > purpose of the proposal.
> >
> > We have a requirement from the *kernel ABI* that *user data writes*
> > must not fail with ENOSPC after an fallocate() operation. That's
> > one of the high level policies we need to implement. The filesystem
> > is already capable of guaranteeing it won't give the user ENOSPC
> > after fallocate, we now need a guarantee from the filesystem's
> > backing store that it won't give ENOSPC, too.
>
> Yes, I was trying to navigate Joe's reluctance to even support
> fallocate() for arbitrary user data. That's where the REQ_META vs
> data distinction crept in for me. But as you say: using fallocate()
> excessively is self-correcting.
>
> > The _other thing_ we need to implement is a method of guaranteeing
> > the filesystem won't shut down when the backing device goes ENOSPC
> > unexpected during metadata writeback. So we also need the backing
> > device to guarantee the regions we write metadata to won't give
> > ENOSPC.
>
> Yeap.
>
> > That's the whole point of REQ_OP_PROVISION: from the layers above
> > the block device, there is -zero- difference between the guarantee
> > we need for user data writes to avoid ENOSPC and for metadata writes
> > to avoid ENOSPC. They are one and the same.
>
> I know. The difference comes from delalloc initially needing an
> absolute value of reserve rather than a specific LBA range.
>
> > Hence if the block device is going to say "I support provisioning"
> > but then give different conditional guarantees according to the
> > *type of data* in the IO request, then it does not provide the
> > functionality the higher layers actually require from it.
>
> I was going for relaxing the "dynamic" approach (Brian's) to be
> best-effort -- and really only for XFS delalloc usecase. Every other
> usecase would respect your and Joe's vision.
>
> > Indeed, what type of data the IO contains is *context dependent*.
> > For example, sometimes we write metadata with user data IO and but
> > we still need provisioning guarantees as if it was issued as
> > metadata IO. This is the case for mkfs initialising the file system
> > by writing directly to the block device.
>
> I'm aware.
>
> > IOWs, filesystem metadata IO issued from kernel context would be
> > considered metadata IO, but from userspace it would be considered
> > normal user data IO and hence treated differently. But the reality
> > is that they both need the same provisioning guarantees to be
> > provided by the block device.
>
> What I was looking at is making the fallocate interface able to
> express: I need dave's requirements (bog-standard actually) vs I need
> non-LBA best effort.
>
> > So how do userspace tools deal with this if the block device
> > requires REQ_META on user data IOs to do the right thing here? And
> > if we provide a mechanism to allow this, how do we prevent userspace
> > for always using it on writes to fallocate() provisioned space?
> >
> > It's just not practical for the block device to add arbitrary
> > constraints based on the type of IO because we then have to add
> > mechanisms to userspace APIs to allow them to control the IO context
> > so the block device will do the right thing. Especially considering
> > we really only need one type of guarantee regardless of where the IO
> > originates from or what type of data the IO contains....
>
> If anything my disposition on the conditional to require a REQ_META
> (or some fallocate generated REQ_UNSHARE ditto to reflect the same) to
> perform your approach to REQ_OP_PROVISION and honor fallocate()
> requirements is a big problem. Would be much better to have a flag to
> express "this reservation does not have an LBA range _yet_,
> nevertheless try to be mindful of this expected near-term block
> allocation".
>
> But I'll stop inlining repetitive (similar but different) answers to
> your concern now ;)
>
> > > Does that imply that
> > > REQ_META also needs to be passed through the block/filesystem stack
> > > (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> > > fallocate(<insert meta flag name>) to the underlying file)?
> >
> > This is exactly the same case as above: the loopback device does
> > user data IO to the backing file. Hence we have another situation
> > where metadata IO is issued to fallocate()d user data ranges as user
> > data ranges and so would be given a lesser guarantee that would lead
> > to upper filesystem failure. BOth upper and lower filesystem data
> > and metadata need to be provided the same ENOSPC guarantees by their
> > backing stores....
> >
> > The whole point of the REQ_OP_PROVISION proposal I made is that it
> > doesn't require any special handling in corner cases like this.
> > There are no cross-layer interactions needed to make everything work
> > correctly because the provisioning guarantee is not -data type
> > dependent*. The entire user IO path code remains untouched and
> > blissfully unaware of provisioned regions.
> >
> > And, realistically, if we have to start handling complex corner
> > cases in the filesystem and IO path layers to make REQ_OP_PROVISION
> > work correctly because of arbitary constraints imposed by the block
> > layer implementations, then we've failed miserably at the design and
> > architecture stage.
> >
> > Keep in mind that every attempt made so far to address the problems
> > with block device ENOSPC errors has failed because of the complexity
> > of the corner cases that have arisen during design and/or
> > implementation. It's pretty ironic that now we have a proposal that
> > is remarkably simple, free of corner cases and has virtually no
> > cross-layer coupling at all, the first thing that people want to do
> > is add arbitrary implementation constraints that result in complex
> > cross-layer corner cases that now need to be handled....
> >
> > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > REQ_META writes (or any other specific type of write operation) then
> > it's simply not worth persuing at the filesystem level because the
> > guarantees we actually need just aren't there and the complexity of
> > discovering and handling those corner cases just isn't worth the
> > effort.

Fair points, I certainly don't want to derail this conversation; I'd
be happy to see this work merged sooner rather than later. For
posterity, I'll distill what I said above into the following: I'd like
a capability for userspace to create thin snapshots that ignore the
thin volume's provisioned areas. IOW, an opt-in flag which makes
snapshots fallback to what they do today to provide flexibility to
userspace to decide the space requirements for the above mentioned
scenarios, and at the same time, not adding separate corner case
handling for filesystems. But to reiterate, my intent isn't to pile
this onto the work you, Mike and Joe have planned; just some insight
into why I'm in favor of ideas that reduce the snapshot size.

Best
Sarthak

>
> Here is where I get to say: I think you misunderstood me (but it was
> my fault for not being absolutely clear: I'm very much on the same
> page as you and Joe; and your visions need to just be implemented
> ASAP).
>
> I was taking your designs as a given, but looking further at: how do
> we also handle the non-LBA (delalloc) usecase _before_ we include
> REQ_OP_PROVISION in kernel.
>
> But I'm happy to let the delalloc case go (we can revisit addressing
> it if/when needed).
>
> Mike

2023-06-07 02:46:46

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> On Fri, Jun 02 2023 at 8:52P -0400,
> Dave Chinner <[email protected]> wrote:
>
> > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <[email protected]> wrote:
> > > >
> > > > On Tue, May 30 2023 at 10:55P -0400,
> > > > Joe Thornber <[email protected]> wrote:
> > > >
> > > > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > > > LBA-specific hard request? Whereas REQ_PROVISION on its own provides
> > > > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > > > reserve space to accomodate it).
> > > > > >
> > > > >
> > > > > My proposal only involves 'reserve'. Provisioning will be done as part of
> > > > > the usual io path.
> > > >
> > > > OK, I think we'd do well to pin down the top-level block interfaces in
> > > > question. Because this patchset's block interface patch (2/5) header
> > > > says:
> > > >
> > > > "This patch also adds the capability to call fallocate() in mode 0
> > > > on block devices, which will send REQ_OP_PROVISION to the block
> > > > device for the specified range,"
> > > >
> > > > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > > > user of XFS could then use fallocate() for user data -- which would
> > > > cause thinp's reserve to _not_ be used for critical metadata.
> >
> > Mike, I think you might have misunderstood what I have been proposing.
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > that's what I intended - the operation does not contain data at all.
> > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > contains a range of sectors that need to be provisioned (or
> > discarded), and nothing else.
>
> No, I understood that.
>
> > The write IOs themselves are not tagged with anything special at all.
>
> I know, but I've been looking at how to also handle the delalloc
> usecase (and yes I know you feel it doesn't need handling, the issue
> is XFS does deal nicely with ensuring it has space when it tracks its
> allocations on "thick" storage

Oh, no it doesn't. It -works for most cases-, but that does not mean
it provides any guarantees at all. We can still get ENOSPC for user
data when delayed allocation reservations "run out".

This may be news to you, but the ephemeral XFS delayed allocation
space reservation is not accurate. It contains a "fudge factor"
called "indirect length". This is a "wet finger in the wind"
estimation of how much new metadata will need to be allocated to
index the physical allocations when they are made. It assumes large
data extents are allocated, which is good enough for most cases, but
it is no guarantee when there are no large data extents available to
allocate (e.g. near ENOSPC!).

And therein lies the fundamental problem with ephemeral range
reservations: at the time of reservation, we don't know how many
individual physical LBA ranges the reserved data range is actually
going to span.

As a result, XFS delalloc reservations are a "close-but-not-quite"
reservation backed by a global reserve pool that can be dipped into
if we run out of delalloc reservation. If the reserve pool is then
fully depleted before all delalloc conversion completes, we'll still
give ENOSPC. The pool is sized such that the vast majority of
workloads will complete delalloc conversion successfully before the
pool is depleted.

Hence XFS gives everyone the -appearance- that it deals nicely with
ENOSPC conditions, but it never provides a -guarantee- that any
accepted write will always succeed without ENOSPC.

IMO, using this "close-but-not-quite" reservation as the basis of
space requirements for other layers to provide "won't ENOSPC"
guarantees is fraught with problems. We already know that it is
insufficient in important corner cases at the filesystem level, and
we also know that lower layers trying to do ephemeral space
reservations will have exactly the same problems providing a
guarantee. And these are problems we've been unable to engineer
around in the past, so the likelihood we can engineer around them
now or in the future is also very unlikely.

> -- so adding coordination between XFS
> and dm-thin layers provides comparable safety.. that safety is an
> expected norm).
>
> But rather than discuss in terms of data vs metadata, the distinction
> is:
> 1) LBA range reservation (normal case, your proposal)
> 2) non-LBA reservation (absolute value, LBA range is known at later stage)
>
> But I'm clearly going off script for dwelling on wanting to handle
> both.

Right, because if we do 1) then we don't need 2). :)

> My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> a crude simplification for branching between the 2 approaches.
>
> And I understand I made you nervous by expanding the scope to a much
> more muddled/shitty interface. ;)

Nervous? No, I'm simply trying to make sure that everyone is on the
same page. i.e. that if we water down the guarantee that 1) relies
on, then it's not actually useful to filesystems at all.

> > It's just not practical for the block device to add arbitrary
> > constraints based on the type of IO because we then have to add
> > mechanisms to userspace APIs to allow them to control the IO context
> > so the block device will do the right thing. Especially considering
> > we really only need one type of guarantee regardless of where the IO
> > originates from or what type of data the IO contains....
>
> If anything my disposition on the conditional to require a REQ_META
> (or some fallocate generated REQ_UNSHARE ditto to reflect the same) to
> perform your approach to REQ_OP_PROVISION and honor fallocate()
> requirements is a big problem. Would be much better to have a flag to
> express "this reservation does not have an LBA range _yet_,
> nevertheless try to be mindful of this expected near-term block
> allocation".

And that's where all the complexity starts ;)

> > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > REQ_META writes (or any other specific type of write operation) then
> > it's simply not worth persuing at the filesystem level because the
> > guarantees we actually need just aren't there and the complexity of
> > discovering and handling those corner cases just isn't worth the
> > effort.
>
> Here is where I get to say: I think you misunderstood me (but it was
> my fault for not being absolutely clear: I'm very much on the same
> page as you and Joe; and your visions need to just be implemented
> ASAP).

OK, good that we've clarified the misunderstandings on both sides
quickly :)

> I was taking your designs as a given, but looking further at: how do
> we also handle the non-LBA (delalloc) usecase _before_ we include
> REQ_OP_PROVISION in kernel.
>
> But I'm happy to let the delalloc case go (we can revisit addressing
> it if/when needed).

Again, I really don't think filesystem delalloc ranges ever need to
be covered by block device provisioning guarantees because the
filesystem itself provides no guarantees for unprovisioned writes.

I suspect that if, in future, we want to manage unprovisioned space
in different ways, we're better off taking this sort of approach:

https://lore.kernel.org/linux-xfs/[email protected]/

because using grow/shrink to manage the filesystem's unprovisioned
space if far, far simpler than trying to use dynamic, cross layer
ephemeral reservations. Indeed, with the block device filesystem
shutdown path Christoph recently posted, we have a model for adding
in-kernel filesystem control interfaces for block devices...

There's something to be said for turning everything upside down
occasionally. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-07 23:30:18

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Mon, Jun 05 2023 at 5:14P -0400,
Sarthak Kukreti <[email protected]> wrote:

> On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
> >
> > We all just need to focus on your proposal and Joe's dm-thin
> > reservation design...
> >
> > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > dm-thinp support before Joe's design is implemented. Otherwise we'll
> > have 2 different responses to REQ_OP_PROVISION. The one that is
> > captured in your patchset isn't adequate to properly handle ensuring
> > upper layer (like XFS) can depend on the space being available across
> > snapshot boundaries.]
> >
> Ack. Would it be premature for the rest of the series to go through
> (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> targets)? I'd like to start using this as a reference to suggest
> additions to the virtio-spec for virtio-blk support and start looking
> at what an ext4 implementation would look like.

Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need
more work to provide the type of guarantee XFS requires across
snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
because it is best to just use dm-thin.

And FYI even your dm-thin patch will be the starting point for the
dm-thin support (we'll keep attribution to you for all the code in a
separate patch).

> Fair points, I certainly don't want to derail this conversation; I'd
> be happy to see this work merged sooner rather than later.

Once those dm target changes are dropped I think the rest of the
series is fine to go upstream now. Feel free to post a v8.

> For posterity, I'll distill what I said above into the following: I'd like
> a capability for userspace to create thin snapshots that ignore the
> thin volume's provisioned areas. IOW, an opt-in flag which makes
> snapshots fallback to what they do today to provide flexibility to
> userspace to decide the space requirements for the above mentioned
> scenarios, and at the same time, not adding separate corner case
> handling for filesystems. But to reiterate, my intent isn't to pile
> this onto the work you, Mike and Joe have planned; just some insight
> into why I'm in favor of ideas that reduce the snapshot size.

I think it'd be useful to ignore a thin device's reservation for
read-only snapshots. Adding the ability to create read-only thin
snapshots could make sense (later activations don't necessarily need
to impose read-only, doing so would require some additional work).

Mike

2023-06-08 00:24:13

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Tue, Jun 06 2023 at 10:01P -0400,
Dave Chinner <[email protected]> wrote:

> On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote:
> > On Fri, Jun 02 2023 at 8:52P -0400,
> > Dave Chinner <[email protected]> wrote:
> >
> > > Mike, I think you might have misunderstood what I have been proposing.
> > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
> > > that's what I intended - the operation does not contain data at all.
> > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
> > > contains a range of sectors that need to be provisioned (or
> > > discarded), and nothing else.
> >
> > No, I understood that.
> >
> > > The write IOs themselves are not tagged with anything special at all.
> >
> > I know, but I've been looking at how to also handle the delalloc
> > usecase (and yes I know you feel it doesn't need handling, the issue
> > is XFS does deal nicely with ensuring it has space when it tracks its
> > allocations on "thick" storage
>
> Oh, no it doesn't. It -works for most cases-, but that does not mean
> it provides any guarantees at all. We can still get ENOSPC for user
> data when delayed allocation reservations "run out".
>
> This may be news to you, but the ephemeral XFS delayed allocation
> space reservation is not accurate. It contains a "fudge factor"
> called "indirect length". This is a "wet finger in the wind"
> estimation of how much new metadata will need to be allocated to
> index the physical allocations when they are made. It assumes large
> data extents are allocated, which is good enough for most cases, but
> it is no guarantee when there are no large data extents available to
> allocate (e.g. near ENOSPC!).
>
> And therein lies the fundamental problem with ephemeral range
> reservations: at the time of reservation, we don't know how many
> individual physical LBA ranges the reserved data range is actually
> going to span.
>
> As a result, XFS delalloc reservations are a "close-but-not-quite"
> reservation backed by a global reserve pool that can be dipped into
> if we run out of delalloc reservation. If the reserve pool is then
> fully depleted before all delalloc conversion completes, we'll still
> give ENOSPC. The pool is sized such that the vast majority of
> workloads will complete delalloc conversion successfully before the
> pool is depleted.
>
> Hence XFS gives everyone the -appearance- that it deals nicely with
> ENOSPC conditions, but it never provides a -guarantee- that any
> accepted write will always succeed without ENOSPC.
>
> IMO, using this "close-but-not-quite" reservation as the basis of
> space requirements for other layers to provide "won't ENOSPC"
> guarantees is fraught with problems. We already know that it is
> insufficient in important corner cases at the filesystem level, and
> we also know that lower layers trying to do ephemeral space
> reservations will have exactly the same problems providing a
> guarantee. And these are problems we've been unable to engineer
> around in the past, so the likelihood we can engineer around them
> now or in the future is also very unlikely.

Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in
the air" ;)

So do you think it reasonable to require applications to fallocate
their data files? Unaware if users are aware to take that extra step.

> > -- so adding coordination between XFS
> > and dm-thin layers provides comparable safety.. that safety is an
> > expected norm).
> >
> > But rather than discuss in terms of data vs metadata, the distinction
> > is:
> > 1) LBA range reservation (normal case, your proposal)
> > 2) non-LBA reservation (absolute value, LBA range is known at later stage)
> >
> > But I'm clearly going off script for dwelling on wanting to handle
> > both.
>
> Right, because if we do 1) then we don't need 2). :)

Sure.

> > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was
> > a crude simplification for branching between the 2 approaches.
> >
> > And I understand I made you nervous by expanding the scope to a much
> > more muddled/shitty interface. ;)
>
> Nervous? No, I'm simply trying to make sure that everyone is on the
> same page. i.e. that if we water down the guarantee that 1) relies
> on, then it's not actually useful to filesystems at all.

Yeah, makes sense.

> > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just
> > > REQ_META writes (or any other specific type of write operation) then
> > > it's simply not worth persuing at the filesystem level because the
> > > guarantees we actually need just aren't there and the complexity of
> > > discovering and handling those corner cases just isn't worth the
> > > effort.
> >
> > Here is where I get to say: I think you misunderstood me (but it was
> > my fault for not being absolutely clear: I'm very much on the same
> > page as you and Joe; and your visions need to just be implemented
> > ASAP).
>
> OK, good that we've clarified the misunderstandings on both sides
> quickly :)

Do you think you're OK to scope out, and/or implement, the XFS changes
if you use v7 of this patchset as the starting point? (v8 should just
be v7 minus the dm-thin.c and dm-snap.c changes). The thinp
support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION
and/or fallocate (via mkfs.xfs) to dm-thin devices.

And Joe and I can make independent progress on the dm-thin.c changes
needed to ensure the REQ_OP_PROVISION gaurantee you need.

Thanks,
Mike

2023-06-08 02:16:04

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

Dave,

> Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's
> what I intended - the operation does not contain data at all. It's an
> operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a
> range of sectors that need to be provisioned (or discarded), and
> nothing else.

Yep. That's also how SCSI defines it. The act of provisioning a block
range is done through an UNMAP command using a special flag. All it does
is pin down those LBAs so future writes to them won't result in ENOSPC.

--
Martin K. Petersen Oracle Linux Engineering

2023-06-09 20:45:33

by Mike Snitzer

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Wed, Jun 07 2023 at 7:27P -0400,
Mike Snitzer <[email protected]> wrote:

> On Mon, Jun 05 2023 at 5:14P -0400,
> Sarthak Kukreti <[email protected]> wrote:
>
> > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
> > >
> > > We all just need to focus on your proposal and Joe's dm-thin
> > > reservation design...
> > >
> > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > dm-thinp support before Joe's design is implemented. Otherwise we'll
> > > have 2 different responses to REQ_OP_PROVISION. The one that is
> > > captured in your patchset isn't adequate to properly handle ensuring
> > > upper layer (like XFS) can depend on the space being available across
> > > snapshot boundaries.]
> > >
> > Ack. Would it be premature for the rest of the series to go through
> > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > targets)? I'd like to start using this as a reference to suggest
> > additions to the virtio-spec for virtio-blk support and start looking
> > at what an ext4 implementation would look like.
>
> Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need
> more work to provide the type of guarantee XFS requires across
> snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> because it is best to just use dm-thin.
>
> And FYI even your dm-thin patch will be the starting point for the
> dm-thin support (we'll keep attribution to you for all the code in a
> separate patch).
>
> > Fair points, I certainly don't want to derail this conversation; I'd
> > be happy to see this work merged sooner rather than later.
>
> Once those dm target changes are dropped I think the rest of the
> series is fine to go upstream now. Feel free to post a v8.

FYI, I've made my latest code available in this
'dm-6.5-provision-support' branch (based on 'dm-6.5'):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support

It's what v8 should be plus the 2 dm-thin patches (that I don't think
should go upstream yet, but are theoretically useful for Dave and
Joe).

The "dm thin: complete interface for REQ_OP_PROVISION support" commit
establishes all the dm-thin interface I think is needed. The FIXME in
process_provision_bio() (and the patch header) cautions against upper
layers like XFS using this dm-thinp support quite yet.

Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
support initially doesn't provide the guarantee that XFS needs across
snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).

Mike

2023-06-09 22:12:57

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Fri, Jun 09, 2023 at 04:31:41PM -0400, Mike Snitzer wrote:
> On Wed, Jun 07 2023 at 7:27P -0400,
> Mike Snitzer <[email protected]> wrote:
>
> > On Mon, Jun 05 2023 at 5:14P -0400,
> > Sarthak Kukreti <[email protected]> wrote:
> >
> > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
> > > >
> > > > We all just need to focus on your proposal and Joe's dm-thin
> > > > reservation design...
> > > >
> > > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > > dm-thinp support before Joe's design is implemented. Otherwise we'll
> > > > have 2 different responses to REQ_OP_PROVISION. The one that is
> > > > captured in your patchset isn't adequate to properly handle ensuring
> > > > upper layer (like XFS) can depend on the space being available across
> > > > snapshot boundaries.]
> > > >
> > > Ack. Would it be premature for the rest of the series to go through
> > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > > targets)? I'd like to start using this as a reference to suggest
> > > additions to the virtio-spec for virtio-blk support and start looking
> > > at what an ext4 implementation would look like.
> >
> > Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need
> > more work to provide the type of guarantee XFS requires across
> > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> > because it is best to just use dm-thin.
> >
> > And FYI even your dm-thin patch will be the starting point for the
> > dm-thin support (we'll keep attribution to you for all the code in a
> > separate patch).
> >
> > > Fair points, I certainly don't want to derail this conversation; I'd
> > > be happy to see this work merged sooner rather than later.
> >
> > Once those dm target changes are dropped I think the rest of the
> > series is fine to go upstream now. Feel free to post a v8.
>
> FYI, I've made my latest code available in this
> 'dm-6.5-provision-support' branch (based on 'dm-6.5'):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support
>
> It's what v8 should be plus the 2 dm-thin patches (that I don't think
> should go upstream yet, but are theoretically useful for Dave and
> Joe).
>
> The "dm thin: complete interface for REQ_OP_PROVISION support" commit
> establishes all the dm-thin interface I think is needed. The FIXME in
> process_provision_bio() (and the patch header) cautions against upper
> layers like XFS using this dm-thinp support quite yet.
>
> Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
> support initially doesn't provide the guarantee that XFS needs across
> snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).

Just tag it with EXPERIMENTAL on recpetion of the first
REQ_OP_PROVISION for the device (i.e. dump a log warning), like I'll
end up doing with XFS when it detects provisioning support at mount
time.

We do this all the time to allow merging new features before they
are fully production ready - EXPERIMENTAL means you can expect it to
mostly work, except when it doesn't, and you know that when it
breaks you get to report the bug, help triage it and as a bonus you
get to keep the broken bits!

$ git grep EXPERIMENTAL fs/xfs
fs/xfs/Kconfig: This feature is considered EXPERIMENTAL. Use with caution!
fs/xfs/Kconfig: This feature is considered EXPERIMENTAL. Use with caution!
fs/xfs/scrub/scrub.c: "EXPERIMENTAL online scrub feature in use. Use at your own risk!");
fs/xfs/xfs_fsops.c: "EXPERIMENTAL online shrink feature in use. Use at your own risk!");
fs/xfs/xfs_super.c: xfs_warn(mp, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
fs/xfs/xfs_super.c: "EXPERIMENTAL Large extent counts feature in use. Use at your own risk!");
fs/xfs/xfs_xattr.c: "EXPERIMENTAL logged extended attributes feature in use. Use at your own risk!");
$

IOWs, I'll be adding a:

"EXPERIMENTAL block device provisioning in use. Use at your own risk!"

warning to XFS, and it won't get removed until both the XFS and
dm-thinp support is solid and production ready....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-10-07 01:30:44

by Sarthak Kukreti

[permalink] [raw]

Subject: Re: [PATCH v7 0/5] Introduce provisioning primitives

On Fri, Jun 9, 2023 at 1:31 PM Mike Snitzer <[email protected]> wrote:
>
> On Wed, Jun 07 2023 at 7:27P -0400,
> Mike Snitzer <[email protected]> wrote:
>
> > On Mon, Jun 05 2023 at 5:14P -0400,
> > Sarthak Kukreti <[email protected]> wrote:
> >
> > > On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
> > > >
> > > > We all just need to focus on your proposal and Joe's dm-thin
> > > > reservation design...
> > > >
> > > > [Sarthak: FYI, this implies that it doesn't really make sense to add
> > > > dm-thinp support before Joe's design is implemented. Otherwise we'll
> > > > have 2 different responses to REQ_OP_PROVISION. The one that is
> > > > captured in your patchset isn't adequate to properly handle ensuring
> > > > upper layer (like XFS) can depend on the space being available across
> > > > snapshot boundaries.]
> > > >
> > > Ack. Would it be premature for the rest of the series to go through
> > > (REQ_OP_PROVISION + support for loop and non-dm-thinp device-mapper
> > > targets)? I'd like to start using this as a reference to suggest
> > > additions to the virtio-spec for virtio-blk support and start looking
> > > at what an ext4 implementation would look like.
> >
> > Please drop the dm-thin.c and dm-snap.c changes. dm-snap.c would need
> > more work to provide the type of guarantee XFS requires across
> > snapshot boundaries. I'm inclined to _not_ add dm-snap.c support
> > because it is best to just use dm-thin.
> >
> > And FYI even your dm-thin patch will be the starting point for the
> > dm-thin support (we'll keep attribution to you for all the code in a
> > separate patch).
> >
> > > Fair points, I certainly don't want to derail this conversation; I'd
> > > be happy to see this work merged sooner rather than later.
> >
> > Once those dm target changes are dropped I think the rest of the
> > series is fine to go upstream now. Feel free to post a v8.
>
> FYI, I've made my latest code available in this
> 'dm-6.5-provision-support' branch (based on 'dm-6.5'):
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.5-provision-support
>
> It's what v8 should be plus the 2 dm-thin patches (that I don't think
> should go upstream yet, but are theoretically useful for Dave and
> Joe).
>
Cheers! Apologies for dropping the ball on this, I just sent out v8
with the dm-thin patches dropped.

- Sarthak

> The "dm thin: complete interface for REQ_OP_PROVISION support" commit
> establishes all the dm-thin interface I think is needed. The FIXME in
> process_provision_bio() (and the patch header) cautions against upper
> layers like XFS using this dm-thinp support quite yet.
>
> Otherwise we'll have the issue where dm-thinp's REQ_OP_PROVISION
> support initially doesn't provide the guarantee that XFS needs across
> snapshots (which is: snapshots inherit all previous REQ_OP_PROVISION).
>
> Mike
>