Hi,
This is version 7 of the patch series to introduce block-level provisioning primitives [1]. The current series is rebased on: (2d1bcbc6cd70 Merge tag 'probes-fixes-v6.4-rc1'...).
Changelog:
v7:
- Fold up lo_req_provision() into lo_req_fallocate().
- Propagate error on failure to provision from the blkdev_issue_provision().
- Set 'max_provision_granularity' in thin_ctr.
- Fix positioning of 'max_provision_sectors' in pool_ctr.
- Add provision bios into process_prepared_mapping() to prevent the bio from being reissued to the underlying thinpool.
[1] https://lore.kernel.org/lkml/[email protected]/
Sarthak Kukreti (5):
block: Don't invalidate pagecache for invalid falloc modes
block: Introduce provisioning primitives
dm: Add block provisioning support
dm-thin: Add REQ_OP_PROVISION support
loop: Add support for provision requests
block/blk-core.c | 5 +++
block/blk-lib.c | 51 ++++++++++++++++++++++++
block/blk-merge.c | 18 +++++++++
block/blk-settings.c | 19 +++++++++
block/blk-sysfs.c | 9 +++++
block/bounce.c | 1 +
block/fops.c | 31 ++++++++++++---
drivers/block/loop.c | 34 ++++++++++++++--
drivers/md/dm-crypt.c | 4 +-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-snap.c | 7 ++++
drivers/md/dm-table.c | 23 +++++++++++
drivers/md/dm-thin.c | 74 +++++++++++++++++++++++++++++++++--
drivers/md/dm.c | 6 +++
include/linux/bio.h | 6 ++-
include/linux/blk_types.h | 5 ++-
include/linux/blkdev.h | 16 ++++++++
include/linux/device-mapper.h | 17 ++++++++
18 files changed, 310 insertions(+), 17 deletions(-)
--
2.40.1.698.g37aff9b760-goog
On Thu, May 25, 2023 at 07:35:14PM -0700, Sarthak Kukreti wrote:
> On Thu, May 25, 2023 at 6:36 PM Dave Chinner <[email protected]> wrote:
> >
> > On Thu, May 25, 2023 at 03:47:21PM -0700, Sarthak Kukreti wrote:
> > > On Thu, May 25, 2023 at 9:00 AM Mike Snitzer <[email protected]> wrote:
> > > > On Thu, May 25 2023 at 7:39P -0400,
> > > > Dave Chinner <[email protected]> wrote:
> > > > > On Wed, May 24, 2023 at 04:02:49PM -0400, Mike Snitzer wrote:
> > > > > > On Tue, May 23 2023 at 8:40P -0400,
> > > > > > Dave Chinner <[email protected]> wrote:
> > > > > > > It's worth noting that XFS already has a coarse-grained
> > > > > > > implementation of preferred regions for metadata storage. It will
> > > > > > > currently not use those metadata-preferred regions for user data
> > > > > > > unless all the remaining user data space is full. Hence I'm pretty
> > > > > > > sure that a pre-provisioning enhancment like this can be done
> > > > > > > entirely in-memory without requiring any new on-disk state to be
> > > > > > > added.
> > > > > > >
> > > > > > > Sure, if we crash and remount, then we might chose a different LBA
> > > > > > > region for pre-provisioning. But that's not really a huge deal as we
> > > > > > > could also run an internal background post-mount fstrim operation to
> > > > > > > remove any unused pre-provisioning that was left over from when the
> > > > > > > system went down.
> > > > > >
> > > > > > This would be the FITRIM with extension you mention below? Which is a
> > > > > > filesystem interface detail?
> > > > >
> > > > > No. We might reuse some of the internal infrastructure we use to
> > > > > implement FITRIM, but that's about it. It's just something kinda
> > > > > like FITRIM but with different constraints determined by the
> > > > > filesystem rather than the user...
> > > > >
> > > > > As it is, I'm not sure we'd even need it - a preiodic userspace
> > > > > FITRIM would acheive the same result, so leaked provisioned spaces
> > > > > would get cleaned up eventually without the filesystem having to do
> > > > > anything specific...
> > > > >
> > > > > > So dm-thinp would _not_ need to have new
> > > > > > state that tracks "provisioned but unused" block?
> > > > >
> > > > > No idea - that's your domain. :)
> > > > >
> > > > > dm-snapshot, for certain, will need to track provisioned regions
> > > > > because it has to guarantee that overwrites to provisioned space in
> > > > > the origin device will always succeed. Hence it needs to know how
> > > > > much space breaking sharing in provisioned regions after a snapshot
> > > > > has been taken with be required...
> > > >
> > > > dm-thinp offers its own much more scalable snapshot support (doesn't
> > > > use old dm-snapshot N-way copyout target).
> > > >
> > > > dm-snapshot isn't going to be modified to support this level of
> > > > hardening (dm-snapshot is basically in "maintenance only" now).
> >
> > Ah, of course. Sorry for the confusion, I was kinda using
> > dm-snapshot as shorthand for "dm-thinp + snapshots".
> >
> > > > But I understand your meaning: what you said is 100% applicable to
> > > > dm-thinp's snapshot implementation and needs to be accounted for in
> > > > thinp's metadata (inherent 'provisioned' flag).
> >
> > *nod*
> >
> > > A bit orthogonal: would dm-thinp need to differentiate between
> > > user-triggered provision requests (eg. from fallocate()) vs
> > > fs-triggered requests?
> >
> > Why? How is the guarantee the block device has to provide to
> > provisioned areas different for user vs filesystem internal
> > provisioned space?
> >
> After thinking this through, I stand corrected. I was primarily
> concerned with how this would balloon thin snapshot sizes if users
> potentially provision a large chunk of the filesystem but that's
> putting the cart way before the horse.
>
I think that's a legitimate concern. At some point to provide full
-ENOSPC protection the filesystem needs to provision space before it
writes to it, whether it be data or metadata, right? At what point does
that turn into a case where pretty much everything the fs wrote is
provisioned, and therefore a snapshot is just a full copy operation?
That might be Ok I guess, but if that's an eventuality then what's the
need to track provision state at dm-thin block level? Using some kind of
flag you mention below could be a good way to qualify which blocks you'd
want to copy vs. which to share on snapshot and perhaps mitigate that
problem.
> Best
> Sarthak
>
> > > I would lean towards user provisioned areas not
> > > getting dedup'd on snapshot creation,
> >
> > <twitch>
> >
> > Snapshotting is a clone operation, not a dedupe operation.
> >
> > Yes, the end result of both is that you have a block shared between
> > multiple indexes that needs COW on the next overwrite, but the two
> > operations that get to that point are very different...
> >
> > </pedantic mode disegaged>
> >
> > > but that would entail tracking
> > > the state of the original request and possibly a provision request
> > > flag (REQ_PROVISION_DEDUP_ON_SNAPSHOT) or an inverse flag
> > > (REQ_PROVISION_NODEDUP). Possibly too convoluted...
> >
> > Let's not try to add everyone's favourite pony to this interface
> > before we've even got it off the ground.
> >
> > It's the simple precision of the API, the lack of cross-layer
> > communication requirements and the ability to implement and optimise
> > the independent layers independently that makes this a very
> > appealing solution.
> >
> > We need to start with getting the simple stuff working and prove the
> > concept. Then once we can observe the behaviour of a working system
> > we can start working on optimising individual layers for efficiency
> > and performance....
> >
I think to prove the concept may not necessarily require changes to
dm-thin at all. If you want to guarantee preexisting metadata block
writeability, just scan through and provision all metadata blocks at
mount time. Hit the log, AG bufs, IIRC XFS already has btree walking
code that can be used for btrees and associated metadata, etc. Maybe
online scrub has something even better to hook into temporarily for this
sort of thing?
Mount performance would obviously be bad, but that doesn't matter for
the purposes of a prototype. The goal should really be that once
mounted, you have established expected writeability invariants and have
the ability to test for reliable prevention of -ENOSPC errors from
dm-thin from that point forward. If that ultimately works, then refine
the ideal implementation from there and ask dm to do whatever
writeability tracking and whatnot.
FWIW, that may also help deal with things like the fact that xfs_repair
can basically relocate the entire set of filesystem metadata to
completely different ranges of free space, completely breaking any
writeability guarantees tracked by previous provisions of those ranges.
Brian
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
>
On Mon, Jun 05, 2023 at 02:14:44PM -0700, Sarthak Kukreti wrote:
> On Sat, Jun 3, 2023 at 8:57 AM Mike Snitzer <[email protected]> wrote:
> > On Fri, Jun 02 2023 at 8:52P -0400,
> > Dave Chinner <[email protected]> wrote:
> > > On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> > > > > The only way to distinquish the caller (between on-behalf of user data
> > > > > vs XFS metadata) would be REQ_META?
> > > > >
> > > > > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > > > > all REQ_OP_PROVISION the same?
> > > > >
> > > > I'm in favor of a REQ_META-based distinction.
> > >
> > > Why? What *requirement* is driving the need for this distinction?
> >
> > Think I answered that above, XFS delalloc accounting parity on thinp.
> >
> I actually had a few different use-cases in mind (apart from the user
> data provisioning 'fear' that you pointed out): in essence, there are
> cases where userspace would benefit from having more control over how
> much space a snapshot takes:
>
> 1) In the original RFC patchset [1], I alluded to this being a
> mechanism for pre-allocating space for preserving space for thin
> logical volumes. The use-case I'd like to explore is delta updatable
> read-only filesystems similar to systemd system extensions [2]: In
> essence:
> a) Preserve space for a 'base' thin logical volume that will contain a
> read-only filesystem on over-the-air installation: for filesystems
> like squashfs and erofs, pretty much the entire image is a compressed
> file that I'd like to reserve space for before installation.
> b) Before update, create a thin snapshot and preserve enough space to
> ensure that a delta update will succeed (eg. block level diff of the
> base image). Then, the update is guaranteed to have disk space to
> succeed (similar to the A-B update guarantees on ChromeOS). On
> success, we merge the snapshot and reserve an update snapshot for the
> next possible update. On failure, we drop the snapshot.
Sounds very similar to the functionality blksnap is supposed to
provide....
https://lore.kernel.org/linux-fsdevel/[email protected]/
> 2) The other idea I wanted to explore was rollback protection for
> stateful filesystem features: in essence, if an update from kernel 4.x
> to 5.y failed very quickly (due to unrelated reasons) and we enabled
> some stateful filesystem features that are only supported on 5.y, we'd
> be able to rollback to 4.x if we used short-lived snapshots (in the
> ChromiumOS world, the lifetime of these snapshots would be < 10s per
> boot).
Not sure that blksnap has a "roll origin back to read-only snapshot"
feature yet, but that's what you'd need for this. i.e. on success,
drop the snapshot. On failure, "roll origin back to snapshot and
reboot".
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 07, 2023 at 10:03:40PM -0400, Martin K. Petersen wrote:
>
> Dave,
>
> > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but that's
> > what I intended - the operation does not contain data at all. It's an
> > operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it contains a
> > range of sectors that need to be provisioned (or discarded), and
> > nothing else.
>
> Yep. That's also how SCSI defines it. The act of provisioning a block
> range is done through an UNMAP command using a special flag. All it does
> is pin down those LBAs so future writes to them won't result in ENOSPC.
*nod*
That I knew, and it's one of the reasons I'd like the filesystem <->
block layer provisioning model to head in this direction. i.e. we
don't have to do anything special to enable routing of provisioning
requests to hardware and/or remote block storage devices (e.g.
ceph-rbd, nbd, etc). Hence "external" devices can provide the same
guarantees as a native software-only block device implementations
like dm-thinp can provide and everything gets just that little bit
better behaved...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Jun 07, 2023 at 07:50:25PM -0400, Mike Snitzer wrote:
> Do you think you're OK to scope out, and/or implement, the XFS changes
> if you use v7 of this patchset as the starting point? (v8 should just
> be v7 minus the dm-thin.c and dm-snap.c changes). The thinp
> support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION
> and/or fallocate (via mkfs.xfs) to dm-thin devices.
Yup, XFS only needs blkdev_issue_provision() and
bdev_max_provision_sectors() to be present. filesystem code. The
initial XFS provisioning detection and fallocate() support is just
under 50 lines of new code...
Cheers,
Dave.
--
Dave Chinner
[email protected]