2021-03-04 10:35:23

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v6 0/4] block-layer interposer

Hi all.

I'm joyful to suggest the block-layer interposer (blk_interposer) v6.
blk_interposer allows to intercept bio requests, remap bio to another
devices or add new bios.

This series of patches adds the ability to use blk_interposer for
any dm-target using the DM_INTERPOSED_FLAG flag.

The first patch adds the function blk_mq_is_queue_frozen(). It allows to
assert a queue state.

The second patch is dedicated to blk_interposer itself, which provides
the ability to intercept bio.

The third one adds support for blk_interposer from the device mapper.

The fourth patch adds the DM_INTERPOSED_FLAG flag. When this flag is
applied with the ioctl DM_TABLE_LOAD_CMD, the underlying devices are
opened without the FMODE_EXCL flag and connected via blk_interposer.

Changes:
v6 - current patch set
* designed for 5.12;
* thanks to the new design of the bio structure in v5.12, it is
possible to perform interception not for the entire disk, but
for each block device;
* instead of the new ioctl DM_DEV_REMAP_CMD and the 'noexcl' option,
the DM_INTERPOSED_FLAG flag for the ioctl DM_TABLE_LOAD_CMD is
applied.

v5 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
* rebase for v5.11-rc7;
* patch set organization;
* fix defects in documentation;
* add some comments;
* change mutex names for better code readability;
* remove calling bd_unlink_disk_holder() for targets with non-exclusive
flag;
* change type for struct dm_remap_param from uint8_t to __u8.

v4 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
Mostly changes were made, due to Damien's comments:
* on the design of the code;
* by the patch set organization;
* bug with passing a wrong parameter to dm_get_device();
* description of the 'noexcl' parameter in the linear.rst.
Also added remap_and_filter.rst.

v3 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
In this version, I already suggested blk_interposer to apply to dm-linear.
Problems were solved:
* Interception of bio requests from a specific device on the disk, not
from the entire disk. To do this, we added the dm_interposed_dev
structure and an interval tree to store these structures.
* Implemented ioctl DM_DEV_REMAP_CMD. A patch with changes in the lvm2
project was sent to the team [email protected].
* Added the 'noexcl' option for dm-linear, which allows you to open
the underlying block-device without FMODE_EXCL mode.

v2 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
I tried to suggest blk_interposer without using it in device mapper,
but with the addition of a sample of its use. It was then that I learned
about the maintainers' attitudes towards the samples directory :).

v1 - https://lwn.net/ml/linux-block/[email protected]/
This Hannes's patch can be considered as a starting point, since this is
where the interception mechanism and the term blk_interposer itself
appeared. It became clear that blk_interposer can be useful for
device mapper.

before v1 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
I tried to offer a rather cumbersome blk-filter and a monster-like
blk-snap module for creating snapshots.

Thank you to everyone who was able to take the time to review
the previous versions.
I hope that this time I achieved the required quality.

Thanks,
Sergei.

Sergei Shtepa (4):
block: add blk_mq_is_queue_frozen()
block: add blk_interposer
dm: introduce dm-interposer
dm: add DM_INTERPOSED_FLAG

block/bio.c | 2 +
block/blk-core.c | 36 +++++
block/blk-mq.c | 12 ++
block/genhd.c | 93 ++++++++++++
drivers/md/Makefile | 2 +-
drivers/md/dm-core.h | 6 +
drivers/md/dm-interposer.c | 258 ++++++++++++++++++++++++++++++++++
drivers/md/dm-interposer.h | 40 ++++++
drivers/md/dm-ioctl.c | 9 ++
drivers/md/dm-table.c | 115 +++++++++++++--
drivers/md/dm.c | 38 +++--
include/linux/blk-mq.h | 1 +
include/linux/blk_types.h | 4 +
include/linux/blkdev.h | 17 +++
include/linux/device-mapper.h | 1 +
include/uapi/linux/dm-ioctl.h | 6 +
16 files changed, 618 insertions(+), 22 deletions(-)
create mode 100644 drivers/md/dm-interposer.c
create mode 100644 drivers/md/dm-interposer.h

--
2.20.1


2021-03-04 10:35:54

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v6 3/4] dm: introduce dm-interposer

dm-interposer.c/. h contains code for working with blk_interposer
and provides an API for interposer in device-mapper.

Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/md/Makefile | 2 +-
drivers/md/dm-interposer.c | 258 +++++++++++++++++++++++++++++++++++++
drivers/md/dm-interposer.h | 40 ++++++
3 files changed, 299 insertions(+), 1 deletion(-)
create mode 100644 drivers/md/dm-interposer.c
create mode 100644 drivers/md/dm-interposer.h

diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index ef7ddc27685c..bd5b38bee82e 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -5,7 +5,7 @@

dm-mod-y += dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \
dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o \
- dm-rq.o
+ dm-rq.o dm-interposer.o
dm-multipath-y += dm-path-selector.o dm-mpath.o
dm-historical-service-time-y += dm-ps-historical-service-time.o
dm-io-affinity-y += dm-ps-io-affinity.o
diff --git a/drivers/md/dm-interposer.c b/drivers/md/dm-interposer.c
new file mode 100644
index 000000000000..e5346db81def
--- /dev/null
+++ b/drivers/md/dm-interposer.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bio.h>
+#include <linux/rwsem.h>
+#include <linux/refcount.h>
+#include <linux/device-mapper.h>
+#include <linux/interval_tree_generic.h>
+
+#include "dm-core.h"
+#include "dm-interposer.h"
+
+#define DM_MSG_PREFIX "interposer"
+
+struct dm_interposer {
+ struct bdev_interposer blk_ip;
+
+ struct kref kref;
+ struct rw_semaphore ip_devs_lock;
+ struct rb_root_cached ip_devs_root; /* dm_interposed_dev tree, since there can be multiple
+ * interceptors for different ranges for a single
+ * block device. */
+};
+
+/*
+ * Interval tree for device mapper
+ */
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->last)
+INTERVAL_TREE_DEFINE(struct dm_rb_range, node, sector_t, _subtree_last,
+ START, LAST,, dm_rb);
+
+static DEFINE_MUTEX(dm_interposer_attach_lock);
+
+static void dm_submit_bio_interposer_fn(struct bio *bio)
+{
+ struct dm_interposer *ip;
+ unsigned int noio_flag = 0;
+ sector_t start;
+ sector_t last;
+ struct dm_rb_range *node;
+
+ ip = container_of(bio->bi_bdev->bd_interposer, struct dm_interposer, blk_ip);
+
+ start = bio->bi_iter.bi_sector;
+ if (bio_flagged(bio, BIO_REMAPPED))
+ start -= get_start_sect(bio->bi_bdev);
+ last = start + dm_sector_div_up(bio->bi_iter.bi_size, SECTOR_SIZE);
+
+ noio_flag = memalloc_noio_save();
+ down_read(&ip->ip_devs_lock);
+ node = dm_rb_iter_first(&ip->ip_devs_root, start, last);
+ while (node) {
+ struct dm_interposed_dev *ip_dev =
+ container_of(node, struct dm_interposed_dev, node);
+
+ atomic64_inc(&ip_dev->ip_cnt);
+ ip_dev->dm_interpose_bio(ip_dev, bio);
+
+ node = dm_rb_iter_next(node, start, last);
+ }
+ up_read(&ip->ip_devs_lock);
+ memalloc_noio_restore(noio_flag);
+}
+
+void dm_interposer_free(struct kref *kref)
+{
+ struct dm_interposer *ip = container_of(kref, struct dm_interposer, kref);
+
+ bdev_interposer_detach(&ip->blk_ip, dm_submit_bio_interposer_fn);
+
+ kfree(ip);
+}
+
+struct dm_interposer *dm_interposer_new(struct block_device *bdev)
+{
+ int ret = 0;
+ struct dm_interposer *ip;
+
+ ip = kzalloc(sizeof(struct dm_interposer), GFP_NOIO);
+ if (!ip)
+ return ERR_PTR(-ENOMEM);
+
+ kref_init(&ip->kref);
+ init_rwsem(&ip->ip_devs_lock);
+ ip->ip_devs_root = RB_ROOT_CACHED;
+
+ ret = bdev_interposer_attach(bdev, &ip->blk_ip, dm_submit_bio_interposer_fn);
+ if (ret) {
+ DMERR("Failed to attach bdev_interposer.");
+ kref_put(&ip->kref, dm_interposer_free);
+ return ERR_PTR(ret);
+ }
+
+ return ip;
+}
+
+static struct dm_interposer *dm_interposer_get(struct block_device *bdev)
+{
+ struct dm_interposer *ip;
+
+ if (!bdev_has_interposer(bdev))
+ return NULL;
+
+ if (bdev->bd_interposer->ip_submit_bio != dm_submit_bio_interposer_fn) {
+ DMERR("Block devices interposer slot already occupied.");
+ return ERR_PTR(-EBUSY);
+ }
+
+ ip = container_of(bdev->bd_interposer, struct dm_interposer, blk_ip);
+
+ kref_get(&ip->kref);
+ return ip;
+}
+
+static inline void dm_disk_freeze(struct gendisk *disk)
+{
+ blk_mq_freeze_queue(disk->queue);
+ blk_mq_quiesce_queue(disk->queue);
+}
+
+static inline void dm_disk_unfreeze(struct gendisk *disk)
+{
+ blk_mq_unquiesce_queue(disk->queue);
+ blk_mq_unfreeze_queue(disk->queue);
+}
+
+/**
+ * dm_interposer_dev_init - initialize interposed device
+ * @ip_dev: interposed device
+ * @ofs: offset from the beginning of the block device
+ * @len: the length of the part of the block device to which requests will be interposed
+ * @private: user purpose parameter
+ * @interpose_fn: interposing callback
+ *
+ * Initialize structure dm_interposed_dev.
+ * For interposing part of block device set ofs and len.
+ * For interposing whole device set ofs=0 and len=0.
+ */
+void dm_interposer_dev_init(struct dm_interposed_dev *ip_dev,
+ sector_t ofs, sector_t len,
+ void *private, dm_interpose_bio_t interpose_fn)
+{
+ ip_dev->node.start = ofs;
+ ip_dev->node.last = ofs + len - 1;
+ ip_dev->dm_interpose_bio = interpose_fn;
+ ip_dev->private = private;
+
+ atomic64_set(&ip_dev->ip_cnt, 0);
+}
+
+/**
+ * dm_interposer_dev_attach - attach interposed device to his block device
+ * @bdev: block device
+ * @ip_dev: interposed device
+ *
+ * Return error code.
+ */
+int dm_interposer_dev_attach(struct block_device *bdev, struct dm_interposed_dev *ip_dev)
+{
+ int ret = 0;
+ struct dm_interposer *ip = NULL;
+ unsigned int noio_flag = 0;
+
+ if (!ip_dev)
+ return -EINVAL;
+
+ dm_disk_freeze(bdev->bd_disk);
+ mutex_lock(&dm_interposer_attach_lock);
+ noio_flag = memalloc_noio_save();
+
+ ip = dm_interposer_get(bdev);
+ if (ip == NULL)
+ ip = dm_interposer_new(bdev);
+ if (IS_ERR(ip)) {
+ ret = PTR_ERR(ip);
+ goto out;
+ }
+
+ /* Attach dm_interposed_dev to dm_interposer */
+ down_write(&ip->ip_devs_lock);
+ do {
+ struct dm_rb_range *node;
+
+ /* checking that ip_dev already exists for this region */
+ node = dm_rb_iter_first(&ip->ip_devs_root, ip_dev->node.start, ip_dev->node.last);
+ if (node) {
+ DMERR("Block device in region [%llu,%llu] already have interposer.",
+ node->start, node->last);
+
+ ret = -EBUSY;
+ break;
+ }
+
+ /* insert ip_dev to ip tree */
+ dm_rb_insert(&ip_dev->node, &ip->ip_devs_root);
+ /* increment ip reference counter */
+ kref_get(&ip->kref);
+ } while (false);
+ up_write(&ip->ip_devs_lock);
+
+ kref_put(&ip->kref, dm_interposer_free);
+
+out:
+ memalloc_noio_restore(noio_flag);
+ mutex_unlock(&dm_interposer_attach_lock);
+ dm_disk_unfreeze(bdev->bd_disk);
+
+ return ret;
+}
+
+/**
+ * dm_interposer_detach_dev - detach interposed device from his block device
+ * @bdev: block device
+ * @ip_dev: interposed device
+ *
+ * Return error code.
+ */
+int dm_interposer_detach_dev(struct block_device *bdev, struct dm_interposed_dev *ip_dev)
+{
+ int ret = 0;
+ struct dm_interposer *ip = NULL;
+ unsigned int noio_flag = 0;
+
+ if (!ip_dev)
+ return -EINVAL;
+
+ dm_disk_freeze(bdev->bd_disk);
+ mutex_lock(&dm_interposer_attach_lock);
+ noio_flag = memalloc_noio_save();
+
+ ip = dm_interposer_get(bdev);
+ if (IS_ERR(ip)) {
+ ret = PTR_ERR(ip);
+ DMERR("Interposer not found.");
+ goto out;
+ }
+ if (unlikely(ip == NULL)) {
+ ret = -ENXIO;
+ DMERR("Interposer not found.");
+ goto out;
+ }
+
+ down_write(&ip->ip_devs_lock);
+ {
+ dm_rb_remove(&ip_dev->node, &ip->ip_devs_root);
+ /* the reference counter here cannot be zero */
+ kref_put(&ip->kref, dm_interposer_free);
+ }
+ up_write(&ip->ip_devs_lock);
+
+ /* detach and free interposer if it's not needed */
+ kref_put(&ip->kref, dm_interposer_free);
+out:
+ memalloc_noio_restore(noio_flag);
+ mutex_unlock(&dm_interposer_attach_lock);
+ dm_disk_unfreeze(bdev->bd_disk);
+
+ return ret;
+}
diff --git a/drivers/md/dm-interposer.h b/drivers/md/dm-interposer.h
new file mode 100644
index 000000000000..17a5411f6f00
--- /dev/null
+++ b/drivers/md/dm-interposer.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Device mapper's interposer.
+ */
+
+#include <linux/rbtree.h>
+
+struct dm_rb_range {
+ struct rb_node node;
+ sector_t start; /* start sector of rb node */
+ sector_t last; /* end sector of rb node */
+ sector_t _subtree_last; /* highest sector in subtree of rb node */
+};
+
+typedef void (*dm_interpose_bio_t) (struct dm_interposed_dev *ip_dev, struct bio *bio);
+
+struct dm_interposed_dev {
+ struct dm_rb_range node;
+ void *private;
+ dm_interpose_bio_t dm_interpose_bio;
+
+ atomic64_t ip_cnt; /* for debug purpose only */
+};
+
+/*
+ * Initialize structure dm_interposed_dev.
+ * For interposing part of block device set ofs and len.
+ * For interposing whole device set ofs=0 and len=0.
+ */
+void dm_interposer_dev_init(struct dm_interposed_dev *ip_dev,
+ sector_t ofs, sector_t len,
+ void *private, dm_interpose_bio_t interpose_fn);
+/*
+ * Attach interposer to his block device.
+ */
+int dm_interposer_dev_attach(struct block_device *bdev, struct dm_interposed_dev *ip_dev);
+/*
+ * Detach interposer from his block device.
+ */
+int dm_interposer_detach_dev(struct block_device *bdev, struct dm_interposed_dev *ip_dev);
--
2.20.1

2021-03-04 10:38:02

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v6 2/4] block: add blk_interposer

blk_interposer allows to intercept bio requests, remap bio
to another devices or add new bios.

Signed-off-by: Sergei Shtepa <[email protected]>
---
block/bio.c | 2 +
block/blk-core.c | 36 +++++++++++++++
block/genhd.c | 93 +++++++++++++++++++++++++++++++++++++++
include/linux/blk_types.h | 4 ++
include/linux/blkdev.h | 17 +++++++
5 files changed, 152 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index a1c4d2900c7a..0bfbf06475ee 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio_set_flag(bio, BIO_THROTTLED);
if (bio_flagged(bio_src, BIO_REMAPPED))
bio_set_flag(bio, BIO_REMAPPED);
+ if (bio_flagged(bio_src, BIO_INTERPOSED))
+ bio_set_flag(bio, BIO_INTERPOSED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_write_hint = bio_src->bi_write_hint;
diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..e749507cadd3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1018,6 +1018,34 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
return ret;
}

+static blk_qc_t __submit_bio_interposed(struct bio *bio)
+{
+ struct bio_list bio_list[2] = { };
+ blk_qc_t ret = BLK_QC_T_NONE;
+
+ current->bio_list = bio_list;
+ if (likely(bio_queue_enter(bio) == 0)) {
+ struct block_device *bdev = bio->bi_bdev;
+
+ if (likely(bdev_has_interposer(bdev))) {
+ bio_set_flag(bio, BIO_INTERPOSED);
+ bdev->bd_interposer->ip_submit_bio(bio);
+ } else {
+ /* interposer was removed */
+ bio_list_add(&current->bio_list[0], bio);
+ }
+
+ blk_queue_exit(bdev->bd_disk->queue);
+ }
+ current->bio_list = NULL;
+
+ /* Resubmit remaining bios */
+ while ((bio = bio_list_pop(&bio_list[0])))
+ ret = submit_bio_noacct(bio);
+
+ return ret;
+}
+
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -1043,6 +1071,14 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return BLK_QC_T_NONE;
}

+ /*
+ * Checking the BIO_INTERPOSED flag is necessary so that the bio
+ * created by the bdev_interposer do not get to it for processing.
+ */
+ if (bdev_has_interposer(bio->bi_bdev) &&
+ !bio_flagged(bio, BIO_INTERPOSED))
+ return __submit_bio_interposed(bio);
+
if (!bio->bi_bdev->bd_disk->fops->submit_bio)
return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);
diff --git a/block/genhd.c b/block/genhd.c
index fcc530164b5a..1ae8516643c8 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -30,6 +30,11 @@
static struct kobject *block_depr;

DECLARE_RWSEM(bdev_lookup_sem);
+/*
+ * Prevents different block-layer interposers from attaching or detaching
+ * to the block device at the same time.
+ */
+DEFINE_MUTEX(bdev_interposer_attach_lock);

/* for extended dynamic devt allocation, currently only one major is used */
#define NR_EXT_DEVT (1 << MINORBITS)
@@ -1941,3 +1946,91 @@ static void disk_release_events(struct gendisk *disk)
WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
kfree(disk->ev);
}
+
+/**
+ * bdev_interposer_attach - Attach interposer to block device
+ * @bdev: target block device
+ * @interposer: block device interposer
+ * @ip_submit_bio: hook for submit_bio()
+ *
+ * Returns:
+ * -EINVAL if @interposer is NULL.
+ * -EPERM if queue is not frozen.
+ * -EBUSY if the block device already has @interposer.
+ * -EALREADY if the block device already has @interposer with same callback.
+ * -ENODEV if the block device cannot be referenced.
+ *
+ * Disk must be frozen by blk_mq_freeze_queue().
+ */
+int bdev_interposer_attach(struct block_device *bdev, struct bdev_interposer *interposer,
+ const ip_submit_bio_t ip_submit_bio)
+{
+ int ret = 0;
+
+ if (WARN_ON(!interposer))
+ return -EINVAL;
+
+ if (!blk_mq_is_queue_frozen(bdev->bd_disk->queue))
+ return -EPERM;
+
+ mutex_lock(&bdev_interposer_attach_lock);
+ if (bdev_has_interposer(bdev)) {
+ if (bdev->bd_interposer->ip_submit_bio == ip_submit_bio)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out;
+ }
+
+ interposer->ip_submit_bio = ip_submit_bio;
+
+ interposer->bdev = bdgrab(bdev);
+ if (!interposer->bdev) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ bdev->bd_interposer = interposer;
+out:
+ mutex_unlock(&bdev_interposer_attach_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_attach);
+
+/**
+ * bdev_interposer_detach - Detach interposer from block device
+ * @interposer: block device interposer
+ * @ip_submit_bio: hook for submit_bio()
+ *
+ * Disk must be frozen by blk_mq_freeze_queue().
+ */
+void bdev_interposer_detach(struct bdev_interposer *interposer,
+ const ip_submit_bio_t ip_submit_bio)
+{
+ struct block_device *bdev;
+
+ if (WARN_ON(!interposer))
+ return;
+
+ mutex_lock(&bdev_interposer_attach_lock);
+
+ /* Check if the interposer is still active. */
+ bdev = interposer->bdev;
+ if (WARN_ON(!bdev))
+ goto out;
+
+ if (WARN_ON(!blk_mq_is_queue_frozen(bdev->bd_disk->queue)))
+ goto out;
+
+ /* Check if it is really our interposer. */
+ if (WARN_ON(bdev->bd_interposer->ip_submit_bio != ip_submit_bio))
+ goto out;
+
+ bdev->bd_interposer = NULL;
+ interposer->bdev = NULL;
+ bdput(bdev);
+out:
+ mutex_unlock(&bdev_interposer_attach_lock);
+}
+EXPORT_SYMBOL_GPL(bdev_interposer_detach);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..2b43f65bb356 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -19,6 +19,7 @@ struct io_context;
struct cgroup_subsys_state;
typedef void (bio_end_io_t) (struct bio *);
struct bio_crypt_ctx;
+struct bdev_interposer;

struct block_device {
sector_t bd_start_sect;
@@ -46,6 +47,7 @@ struct block_device {
spinlock_t bd_size_lock; /* for bd_inode->i_size updates */
struct gendisk * bd_disk;
struct backing_dev_info *bd_bdi;
+ struct bdev_interposer * bd_interposer;

/* The counter of freeze processes */
int bd_fsfreeze_count;
@@ -304,6 +306,8 @@ enum {
BIO_CGROUP_ACCT, /* has been accounted to a cgroup */
BIO_TRACKED, /* set if bio goes through the rq_qos path */
BIO_REMAPPED,
+ BIO_INTERPOSED, /* bio has been interposed and can be moved to
+ * a different block device */
BIO_FLAG_LAST
};

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c032cfe133c7..82f8515fa3c8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -2033,4 +2033,21 @@ int fsync_bdev(struct block_device *bdev);
int freeze_bdev(struct block_device *bdev);
int thaw_bdev(struct block_device *bdev);

+/*
+ * block layer interposers structure and functions
+ */
+typedef void (*ip_submit_bio_t) (struct bio *bio);
+
+struct bdev_interposer {
+ ip_submit_bio_t ip_submit_bio;
+ struct block_device *bdev;
+};
+
+#define bdev_has_interposer(bd) ((bd)->bd_interposer != NULL)
+
+int bdev_interposer_attach(struct block_device *bdev, struct bdev_interposer *interposer,
+ const ip_submit_bio_t ip_submit_bio);
+void bdev_interposer_detach(struct bdev_interposer *interposer,
+ const ip_submit_bio_t ip_submit_bio);
+
#endif /* _LINUX_BLKDEV_H */
--
2.20.1

2021-03-04 23:19:38

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v6 1/4] block: add blk_mq_is_queue_frozen()

blk_mq_is_queue_frozen() allow to assert that the queue is frozen.

Signed-off-by: Sergei Shtepa <[email protected]>
---
block/blk-mq.c | 12 ++++++++++++
include/linux/blk-mq.h | 1 +
2 files changed, 13 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..d5e7122789fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,6 +161,18 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
}
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);

+bool blk_mq_is_queue_frozen(struct request_queue *q)
+{
+ bool ret;
+
+ mutex_lock(&q->mq_freeze_lock);
+ ret = percpu_ref_is_dying(&q->q_usage_counter) && percpu_ref_is_zero(&q->q_usage_counter);
+ mutex_unlock(&q->mq_freeze_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blk_mq_is_queue_frozen);
+
/*
* Guarantee no request is in use, so we can change any data structure of
* the queue afterward.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..6f01971abf7b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -533,6 +533,7 @@ void blk_freeze_queue_start(struct request_queue *q);
void blk_mq_freeze_queue_wait(struct request_queue *q);
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
unsigned long timeout);
+bool blk_mq_is_queue_frozen(struct request_queue *q);

int blk_mq_map_queues(struct blk_mq_queue_map *qmap);
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
--
2.20.1

2021-03-04 23:19:55

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v6 4/4] dm: add DM_INTERPOSED_FLAG

DM_INTERPOSED_FLAG allow to create dm targets on "the fly".
Underlying block device opens without a flag FMODE_EXCL.
Dm target receives bio from the original device via
blk_interposer.

Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/md/dm-core.h | 6 ++
drivers/md/dm-ioctl.c | 9 +++
drivers/md/dm-table.c | 115 +++++++++++++++++++++++++++++++---
drivers/md/dm.c | 38 +++++++----
include/linux/device-mapper.h | 1 +
include/uapi/linux/dm-ioctl.h | 6 ++
6 files changed, 154 insertions(+), 21 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 5953ff2bd260..e5c845f9b1df 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -21,6 +21,8 @@

#define DM_RESERVED_MAX_IOS 1024

+struct dm_interposed_dev;
+
struct dm_kobject_holder {
struct kobject kobj;
struct completion completion;
@@ -114,6 +116,10 @@ struct mapped_device {
bool init_tio_pdu:1;

struct srcu_struct io_barrier;
+
+ /* for interposers logic */
+ bool is_interposed;
+ struct dm_interposed_dev *ip_dev;
};

void disable_discard(struct mapped_device *md);
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 5e306bba4375..2bcb316144a1 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
return mode;
}

+static inline bool get_interposer_flag(struct dm_ioctl *param)
+{
+ return (param->flags & DM_INTERPOSED_FLAG);
+}
+
static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
struct dm_target_spec **spec, char **target_params)
{
@@ -1338,6 +1343,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
if (!md)
return -ENXIO;

+ md->is_interposed = get_interposer_flag(param);
+
r = dm_table_create(&t, get_mode(param), param->target_count, md);
if (r)
goto err;
@@ -2098,6 +2105,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
if (r)
goto err_hash_remove;

+ md->is_interposed = get_interposer_flag(dmi);
+
/* add targets */
for (i = 0; i < dmi->target_count; i++) {
r = dm_table_add_target(t, spec_array[i]->target_type,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..0b2f9b66ade5 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -6,6 +6,7 @@
*/

#include "dm-core.h"
+#include "dm-interposer.h"

#include <linux/module.h>
#include <linux/vmalloc.h>
@@ -225,12 +226,13 @@ void dm_table_destroy(struct dm_table *t)
/*
* See if we've already got a device in the list.
*/
-static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
+static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)
{
struct dm_dev_internal *dd;

list_for_each_entry (dd, l, list)
- if (dd->dm_dev->bdev->bd_dev == dev)
+ if ((dd->dm_dev->bdev->bd_dev == dev) &&
+ (dd->dm_dev->is_interposed == is_interposed))
return dd;

return NULL;
@@ -358,6 +360,90 @@ dev_t dm_get_dev_t(const char *path)
}
EXPORT_SYMBOL_GPL(dm_get_dev_t);

+/*
+ * Redirect bio from interposed device to dm device
+ */
+static void dm_interpose_fn(struct dm_interposed_dev *ip_dev, struct bio *bio)
+{
+ struct mapped_device *md = ip_dev->private;
+
+ if (bio_flagged(bio, BIO_REMAPPED)) {
+ /*
+ * Since bio has already been remapped, we need to subtract
+ * the block device offset from the beginning of the disk.
+ */
+ bio->bi_iter.bi_sector -= get_start_sect(bio->bi_bdev);
+
+ bio_clear_flag(bio, BIO_REMAPPED);
+ }
+
+ /*
+ * Set acceptor device.
+ * It is quite convenient that device mapper creates
+ * one disk for one block device.
+ */
+ bio->bi_bdev = md->disk->part0;
+
+ /*
+ * Bio should be resubmitted.
+ * The bio will be checked again and placed in current->bio_list.
+ */
+ submit_bio_noacct(bio);
+}
+
+static int _interposer_dev_create(struct block_device *bdev, sector_t ofs, sector_t len,
+ struct mapped_device *md)
+{
+ int ret;
+
+ DMDEBUG("Create dm interposer.");
+
+ if (md->ip_dev) {
+ DMERR("The dm interposer device already in use.");
+ return -EALREADY;
+ }
+
+ if ((ofs + len) > bdev_nr_sectors(bdev)) {
+ DMERR("The specified range of sectors exceeds of the size of the block device.");
+ return -ERANGE;
+ }
+
+ md->ip_dev = kzalloc(sizeof(struct dm_interposed_dev), GFP_KERNEL);
+ if (!md->ip_dev)
+ return -ENOMEM;
+
+ if ((ofs == 0) && (len == 0))
+ DMDEBUG("Whole block device should be interposed.");
+
+ dm_interposer_dev_init(md->ip_dev,
+ ofs, len,
+ md, dm_interpose_fn);
+
+ ret = dm_interposer_dev_attach(bdev, md->ip_dev);
+ if (ret) {
+ DMERR("Cannot attach dm interposer device.");
+ kfree(md->ip_dev);
+ md->ip_dev = NULL;
+ }
+
+ return ret;
+}
+
+static void _interposer_dev_remove(struct block_device *bdev, struct mapped_device *md)
+{
+ if (!md->ip_dev)
+ return;
+
+ DMDEBUG("Remove dm interposer. %llu bios was interposed.",
+ atomic64_read(&md->ip_dev->ip_cnt));
+
+ if (dm_interposer_detach_dev(bdev, md->ip_dev))
+ DMERR("Failed to detach dm interposer.");
+
+ kfree(md->ip_dev);
+ md->ip_dev = NULL;
+}
+
/*
* Add a device to the list, or just increment the usage count if
* it's already present.
@@ -385,7 +471,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
return -ENODEV;
}

- dd = find_device(&t->devices, dev);
+ dd = find_device(&t->devices, dev, t->md->is_interposed);
if (!dd) {
dd = kmalloc(sizeof(*dd), GFP_KERNEL);
if (!dd)
@@ -398,15 +484,22 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,

refcount_set(&dd->count, 1);
list_add(&dd->list, &t->devices);
- goto out;
-
} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
r = upgrade_mode(dd, mode, t->md);
if (r)
return r;
+ refcount_inc(&dd->count);
}
- refcount_inc(&dd->count);
-out:
+
+ if (t->md->is_interposed) {
+ r = _interposer_dev_create(dd->dm_dev->bdev, ti->begin, ti->len, t->md);
+ if (r) {
+ dm_put_device(ti, dd->dm_dev);
+ DMERR("Failed to attach dm interposer.");
+ return r;
+ }
+ }
+
*result = dd->dm_dev;
return 0;
}
@@ -446,6 +539,7 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
{
int found = 0;
struct list_head *devices = &ti->table->devices;
+ struct mapped_device *md = ti->table->md;
struct dm_dev_internal *dd;

list_for_each_entry(dd, devices, list) {
@@ -456,11 +550,14 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d)
}
if (!found) {
DMWARN("%s: device %s not in table devices list",
- dm_device_name(ti->table->md), d->name);
+ dm_device_name(md), d->name);
return;
}
+ if (md->is_interposed)
+ _interposer_dev_remove(d->bdev, md);
+
if (refcount_dec_and_test(&dd->count)) {
- dm_put_table_device(ti->table->md, d);
+ dm_put_table_device(md, d);
list_del(&dd->list);
kfree(dd);
}
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..466bf70a66b0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -762,16 +762,24 @@ static int open_table_device(struct table_device *td, dev_t dev,

BUG_ON(td->dm_dev.bdev);

- bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
- if (IS_ERR(bdev))
- return PTR_ERR(bdev);
+ if (md->is_interposed) {

- r = bd_link_disk_holder(bdev, dm_disk(md));
- if (r) {
- blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
- return r;
+ bdev = blkdev_get_by_dev(dev, td->dm_dev.mode, NULL);
+ if (IS_ERR(bdev))
+ return PTR_ERR(bdev);
+ } else {
+ bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
+ if (IS_ERR(bdev))
+ return PTR_ERR(bdev);
+
+ r = bd_link_disk_holder(bdev, dm_disk(md));
+ if (r) {
+ blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
+ return r;
+ }
}

+ td->dm_dev.is_interposed = md->is_interposed;
td->dm_dev.bdev = bdev;
td->dm_dev.dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
return 0;
@@ -785,20 +793,26 @@ static void close_table_device(struct table_device *td, struct mapped_device *md
if (!td->dm_dev.bdev)
return;

- bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
- blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+ if (td->dm_dev.is_interposed)
+ blkdev_put(td->dm_dev.bdev, td->dm_dev.mode);
+ else {
+ bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
+ blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+ }
put_dax(td->dm_dev.dax_dev);
td->dm_dev.bdev = NULL;
td->dm_dev.dax_dev = NULL;
}

static struct table_device *find_table_device(struct list_head *l, dev_t dev,
- fmode_t mode)
+ fmode_t mode, bool is_interposed)
{
struct table_device *td;

list_for_each_entry(td, l, list)
- if (td->dm_dev.bdev->bd_dev == dev && td->dm_dev.mode == mode)
+ if (td->dm_dev.bdev->bd_dev == dev &&
+ td->dm_dev.mode == mode &&
+ td->dm_dev.is_interposed == is_interposed)
return td;

return NULL;
@@ -811,7 +825,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
struct table_device *td;

mutex_lock(&md->table_devices_lock);
- td = find_table_device(&md->table_devices, dev, mode);
+ td = find_table_device(&md->table_devices, dev, mode, md->is_interposed);
if (!td) {
td = kmalloc_node(sizeof(*td), GFP_KERNEL, md->numa_node_id);
if (!td) {
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..76a6dfb1cb29 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -159,6 +159,7 @@ struct dm_dev {
struct block_device *bdev;
struct dax_device *dax_dev;
fmode_t mode;
+ bool is_interposed;
char name[16];
};

diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index fcff6669137b..fc4d06bb3dbb 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -362,4 +362,10 @@ enum {
*/
#define DM_INTERNAL_SUSPEND_FLAG (1 << 18) /* Out */

+/*
+ * If set, the underlying device should open without FMODE_EXCL
+ * and attach mapped device via bdev_interposer.
+ */
+#define DM_INTERPOSED_FLAG (1 << 19) /* In */
+
#endif /* _LINUX_DM_IOCTL_H */
--
2.20.1

2021-03-09 17:22:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] block: add blk_mq_is_queue_frozen()

On Wed, Mar 03, 2021 at 03:30:15PM +0300, Sergei Shtepa wrote:
> +bool blk_mq_is_queue_frozen(struct request_queue *q)
> +{
> + bool ret;
> +
> + mutex_lock(&q->mq_freeze_lock);
> + ret = percpu_ref_is_dying(&q->q_usage_counter) && percpu_ref_is_zero(&q->q_usage_counter);

Please avoid the overly long line.

Also maybe frozen is a better name for the variable currently called
ret?

2021-03-09 17:29:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] block: add blk_interposer

> +static blk_qc_t __submit_bio_interposed(struct bio *bio)
> +{
> + struct bio_list bio_list[2] = { };
> + blk_qc_t ret = BLK_QC_T_NONE;
> +
> + current->bio_list = bio_list;
> + if (likely(bio_queue_enter(bio) == 0)) {
> + struct block_device *bdev = bio->bi_bdev;
> +
> + if (likely(bdev_has_interposer(bdev))) {
> + bio_set_flag(bio, BIO_INTERPOSED);
> + bdev->bd_interposer->ip_submit_bio(bio);
> + } else {
> + /* interposer was removed */
> + bio_list_add(&current->bio_list[0], bio);
> + }
> +
> + blk_queue_exit(bdev->bd_disk->queue);
> + }
> + current->bio_list = NULL;
> +
> + /* Resubmit remaining bios */
> + while ((bio = bio_list_pop(&bio_list[0])))
> + ret = submit_bio_noacct(bio);
> +
> + return ret;
> +}
> +
> /**
> * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> * @bio: The bio describing the location in memory and on the device.
> @@ -1043,6 +1071,14 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
> return BLK_QC_T_NONE;
> }
>
> + /*
> + * Checking the BIO_INTERPOSED flag is necessary so that the bio
> + * created by the bdev_interposer do not get to it for processing.
> + */
> + if (bdev_has_interposer(bio->bi_bdev) &&
> + !bio_flagged(bio, BIO_INTERPOSED))
> + return __submit_bio_interposed(bio);
> +
> if (!bio->bi_bdev->bd_disk->fops->submit_bio)
> return __submit_bio_noacct_mq(bio);
> return __submit_bio_noacct(bio);
> diff --git a/block/genhd.c b/block/genhd.c
> index fcc530164b5a..1ae8516643c8 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -30,6 +30,11 @@
> static struct kobject *block_depr;
>
> DECLARE_RWSEM(bdev_lookup_sem);
> +/*
> + * Prevents different block-layer interposers from attaching or detaching
> + * to the block device at the same time.
> + */
> +DEFINE_MUTEX(bdev_interposer_attach_lock);

This one can and should be marked static.

> +int bdev_interposer_attach(struct block_device *bdev, struct bdev_interposer *interposer,

Please avoid the overly long line.

> + int ret = 0;
> +
> + if (WARN_ON(!interposer))

WARN_ON_ONCE?

> + return -EINVAL;
> +
> + if (!blk_mq_is_queue_frozen(bdev->bd_disk->queue))
> + return -EPERM;

This probly should be a WARN_ON_ONCE() as well.

> +
> + mutex_lock(&bdev_interposer_attach_lock);
> + if (bdev_has_interposer(bdev)) {
> + if (bdev->bd_interposer->ip_submit_bio == ip_submit_bio)
> + ret = -EALREADY;
> + else
> + ret = -EBUSY;
> + goto out;
> + }

Do we really need the two different error codes here?

> +
> + interposer->ip_submit_bio = ip_submit_bio;

I'd rather let the caller initialize the field instead of passing the
submit function separately.

> +void bdev_interposer_detach(struct bdev_interposer *interposer,
> + const ip_submit_bio_t ip_submit_bio)
> +{

> + /* Check if it is really our interposer. */
> + if (WARN_ON(bdev->bd_interposer->ip_submit_bio != ip_submit_bio))
> + goto out;

I don't really see any need to pass ip_submit_bio just for this check.

> + struct bdev_interposer * bd_interposer;

The * goes just before the member name.

> +/*
> + * block layer interposers structure and functions
> + */
> +typedef void (*ip_submit_bio_t) (struct bio *bio);
> +
> +struct bdev_interposer {
> + ip_submit_bio_t ip_submit_bio;
> + struct block_device *bdev;

Do we need the ip_ prefix here? Also we probably don't really the
the typedef for the function pointer.

> +#define bdev_has_interposer(bd) ((bd)->bd_interposer != NULL)

And inline function would be nice here.

2021-03-09 17:39:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] dm: add DM_INTERPOSED_FLAG

On Wed, Mar 03, 2021 at 03:30:18PM +0300, Sergei Shtepa wrote:
> DM_INTERPOSED_FLAG allow to create dm targets on "the fly".
> Underlying block device opens without a flag FMODE_EXCL.
> Dm target receives bio from the original device via
> blk_interposer.
>
> Signed-off-by: Sergei Shtepa <[email protected]>
> ---
> drivers/md/dm-core.h | 6 ++
> drivers/md/dm-ioctl.c | 9 +++
> drivers/md/dm-table.c | 115 +++++++++++++++++++++++++++++++---
> drivers/md/dm.c | 38 +++++++----
> include/linux/device-mapper.h | 1 +
> include/uapi/linux/dm-ioctl.h | 6 ++
> 6 files changed, 154 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 5953ff2bd260..e5c845f9b1df 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -21,6 +21,8 @@
>
> #define DM_RESERVED_MAX_IOS 1024
>
> +struct dm_interposed_dev;
> +
> struct dm_kobject_holder {
> struct kobject kobj;
> struct completion completion;
> @@ -114,6 +116,10 @@ struct mapped_device {
> bool init_tio_pdu:1;
>
> struct srcu_struct io_barrier;
> +
> + /* for interposers logic */
> + bool is_interposed;
> + struct dm_interposed_dev *ip_dev;
> };
>
> void disable_discard(struct mapped_device *md);
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index 5e306bba4375..2bcb316144a1 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
> return mode;
> }
>
> +static inline bool get_interposer_flag(struct dm_ioctl *param)
> +{
> + return (param->flags & DM_INTERPOSED_FLAG);
> +}
> +
> static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
> struct dm_target_spec **spec, char **target_params)
> {
> @@ -1338,6 +1343,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
> if (!md)
> return -ENXIO;
>
> + md->is_interposed = get_interposer_flag(param);
> +
> r = dm_table_create(&t, get_mode(param), param->target_count, md);
> if (r)
> goto err;
> @@ -2098,6 +2105,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
> if (r)
> goto err_hash_remove;
>
> + md->is_interposed = get_interposer_flag(dmi);
> +
> /* add targets */
> for (i = 0; i < dmi->target_count; i++) {
> r = dm_table_add_target(t, spec_array[i]->target_type,
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 95391f78b8d5..0b2f9b66ade5 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -6,6 +6,7 @@
> */
>
> #include "dm-core.h"
> +#include "dm-interposer.h"
>
> #include <linux/module.h>
> #include <linux/vmalloc.h>
> @@ -225,12 +226,13 @@ void dm_table_destroy(struct dm_table *t)
> /*
> * See if we've already got a device in the list.
> */
> -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)
> {
> struct dm_dev_internal *dd;
>
> list_for_each_entry (dd, l, list)
> - if (dd->dm_dev->bdev->bd_dev == dev)
> + if ((dd->dm_dev->bdev->bd_dev == dev) &&
> + (dd->dm_dev->is_interposed == is_interposed))
> return dd;
>
> return NULL;
> @@ -358,6 +360,90 @@ dev_t dm_get_dev_t(const char *path)
> }
> EXPORT_SYMBOL_GPL(dm_get_dev_t);
>
> +/*
> + * Redirect bio from interposed device to dm device
> + */
> +static void dm_interpose_fn(struct dm_interposed_dev *ip_dev, struct bio *bio)
> +{
> + struct mapped_device *md = ip_dev->private;
> +
> + if (bio_flagged(bio, BIO_REMAPPED)) {
> + /*
> + * Since bio has already been remapped, we need to subtract
> + * the block device offset from the beginning of the disk.
> + */
> + bio->bi_iter.bi_sector -= get_start_sect(bio->bi_bdev);
> +
> + bio_clear_flag(bio, BIO_REMAPPED);
> + }

So instead of doing this shoudn't the imposer just always submit to the
whole device? But if we keep it, the logic in this funtion should go
into a block layer helper, passing a block device instead of the
dm_interposed_dev. This avoids having such fragile logic in drivers.

> + if ((ofs + len) > bdev_nr_sectors(bdev)) {
> + DMERR("The specified range of sectors exceeds of the size of the block device.");
> + return -ERANGE;
> + }
> +
> + md->ip_dev = kzalloc(sizeof(struct dm_interposed_dev), GFP_KERNEL);
> + if (!md->ip_dev)
> + return -ENOMEM;
> +
> + if ((ofs == 0) && (len == 0))

Lots of superflous inner braces.

2021-03-10 05:08:26

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] block: add blk_interposer

Thank you, Christoph, for the review.
I will correct all except two points.

The 03/09/2021 20:27, Christoph Hellwig wrote:
> > +static blk_qc_t __submit_bio_interposed(struct bio *bio)
> > +{
> > + struct bio_list bio_list[2] = { };
> > + blk_qc_t ret = BLK_QC_T_NONE;
> > +
> > + current->bio_list = bio_list;
> > + if (likely(bio_queue_enter(bio) == 0)) {
> > + struct block_device *bdev = bio->bi_bdev;
> > +
> > + if (likely(bdev_has_interposer(bdev))) {
> > + bio_set_flag(bio, BIO_INTERPOSED);
> > + bdev->bd_interposer->ip_submit_bio(bio);
> > + } else {
> > + /* interposer was removed */
> > + bio_list_add(&current->bio_list[0], bio);
> > + }
> > +
> > + blk_queue_exit(bdev->bd_disk->queue);
> > + }
> > + current->bio_list = NULL;
> > +
> > + /* Resubmit remaining bios */
> > + while ((bio = bio_list_pop(&bio_list[0])))
> > + ret = submit_bio_noacct(bio);
> > +
> > + return ret;
> > +}
> > +
> > /**
> > * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> > * @bio: The bio describing the location in memory and on the device.
> > @@ -1043,6 +1071,14 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
> > return BLK_QC_T_NONE;
> > }
> >
> > + /*
> > + * Checking the BIO_INTERPOSED flag is necessary so that the bio
> > + * created by the bdev_interposer do not get to it for processing.
> > + */
> > + if (bdev_has_interposer(bio->bi_bdev) &&
> > + !bio_flagged(bio, BIO_INTERPOSED))
> > + return __submit_bio_interposed(bio);
> > +
> > if (!bio->bi_bdev->bd_disk->fops->submit_bio)
> > return __submit_bio_noacct_mq(bio);
> > return __submit_bio_noacct(bio);
> > diff --git a/block/genhd.c b/block/genhd.c
> > index fcc530164b5a..1ae8516643c8 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -30,6 +30,11 @@
> > static struct kobject *block_depr;
> >
> > DECLARE_RWSEM(bdev_lookup_sem);
> > +/*
> > + * Prevents different block-layer interposers from attaching or detaching
> > + * to the block device at the same time.
> > + */
> > +DEFINE_MUTEX(bdev_interposer_attach_lock);
>
> This one can and should be marked static.
>
> > +int bdev_interposer_attach(struct block_device *bdev, struct bdev_interposer *interposer,
>
> Please avoid the overly long line.
>
> > + int ret = 0;
> > +
> > + if (WARN_ON(!interposer))
>
> WARN_ON_ONCE?

This function should be called quite rarely, and the absence of the interposer
parameter indicates that the function is being used incorrectly.
I would like to see this warning every time.

>
> > + return -EINVAL;
> > +
> > + if (!blk_mq_is_queue_frozen(bdev->bd_disk->queue))
> > + return -EPERM;
>
> This probly should be a WARN_ON_ONCE() as well.

I think it's better to apply WARN_ON here.

>
> > +
> > + mutex_lock(&bdev_interposer_attach_lock);
> > + if (bdev_has_interposer(bdev)) {
> > + if (bdev->bd_interposer->ip_submit_bio == ip_submit_bio)
> > + ret = -EALREADY;
> > + else
> > + ret = -EBUSY;
> > + goto out;
> > + }
>
> Do we really need the two different error codes here?

I think I need it. If we try to initialize the interposer again, the reason
for this error is most likely in the logic of the module itself.
If the interposer is occupied by someone else, then we need to let know
about it.

>
> > +
> > + interposer->ip_submit_bio = ip_submit_bio;
>
> I'd rather let the caller initialize the field instead of passing the
> submit function separately.

Yes, I think so. This will allow to keep only one parameter of the function.

>
> > +void bdev_interposer_detach(struct bdev_interposer *interposer,
> > + const ip_submit_bio_t ip_submit_bio)
> > +{
>
> > + /* Check if it is really our interposer. */
> > + if (WARN_ON(bdev->bd_interposer->ip_submit_bio != ip_submit_bio))
> > + goto out;
>
> I don't really see any need to pass ip_submit_bio just for this check.
>
> > + struct bdev_interposer * bd_interposer;
>
> The * goes just before the member name.
>
> > +/*
> > + * block layer interposers structure and functions
> > + */
> > +typedef void (*ip_submit_bio_t) (struct bio *bio);
> > +
> > +struct bdev_interposer {
> > + ip_submit_bio_t ip_submit_bio;
> > + struct block_device *bdev;
>
> Do we need the ip_ prefix here? Also we probably don't really the
> the typedef for the function pointer.

Ok. Maybe submit_bio_hook would be better? or submit_bio_interposer.

>
> > +#define bdev_has_interposer(bd) ((bd)->bd_interposer != NULL)
>
> And inline function would be nice here.

--
Sergei Shtepa
Veeam Software developer.

2021-03-10 05:29:52

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] dm: add DM_INTERPOSED_FLAG

The 03/09/2021 20:35, Christoph Hellwig wrote:
> On Wed, Mar 03, 2021 at 03:30:18PM +0300, Sergei Shtepa wrote:
> > DM_INTERPOSED_FLAG allow to create dm targets on "the fly".
> > Underlying block device opens without a flag FMODE_EXCL.
> > Dm target receives bio from the original device via
> > blk_interposer.
> >
> > Signed-off-by: Sergei Shtepa <[email protected]>
> > ---
> > drivers/md/dm-core.h | 6 ++
> > drivers/md/dm-ioctl.c | 9 +++
> > drivers/md/dm-table.c | 115 +++++++++++++++++++++++++++++++---
> > drivers/md/dm.c | 38 +++++++----
> > include/linux/device-mapper.h | 1 +
> > include/uapi/linux/dm-ioctl.h | 6 ++
> > 6 files changed, 154 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> > index 5953ff2bd260..e5c845f9b1df 100644
> > --- a/drivers/md/dm-core.h
> > +++ b/drivers/md/dm-core.h
> > @@ -21,6 +21,8 @@
> >
> > #define DM_RESERVED_MAX_IOS 1024
> >
> > +struct dm_interposed_dev;
> > +
> > struct dm_kobject_holder {
> > struct kobject kobj;
> > struct completion completion;
> > @@ -114,6 +116,10 @@ struct mapped_device {
> > bool init_tio_pdu:1;
> >
> > struct srcu_struct io_barrier;
> > +
> > + /* for interposers logic */
> > + bool is_interposed;
> > + struct dm_interposed_dev *ip_dev;
> > };
> >
> > void disable_discard(struct mapped_device *md);
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index 5e306bba4375..2bcb316144a1 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -1267,6 +1267,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
> > return mode;
> > }
> >
> > +static inline bool get_interposer_flag(struct dm_ioctl *param)
> > +{
> > + return (param->flags & DM_INTERPOSED_FLAG);
> > +}
> > +
> > static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
> > struct dm_target_spec **spec, char **target_params)
> > {
> > @@ -1338,6 +1343,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
> > if (!md)
> > return -ENXIO;
> >
> > + md->is_interposed = get_interposer_flag(param);
> > +
> > r = dm_table_create(&t, get_mode(param), param->target_count, md);
> > if (r)
> > goto err;
> > @@ -2098,6 +2105,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
> > if (r)
> > goto err_hash_remove;
> >
> > + md->is_interposed = get_interposer_flag(dmi);
> > +
> > /* add targets */
> > for (i = 0; i < dmi->target_count; i++) {
> > r = dm_table_add_target(t, spec_array[i]->target_type,
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 95391f78b8d5..0b2f9b66ade5 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -6,6 +6,7 @@
> > */
> >
> > #include "dm-core.h"
> > +#include "dm-interposer.h"
> >
> > #include <linux/module.h>
> > #include <linux/vmalloc.h>
> > @@ -225,12 +226,13 @@ void dm_table_destroy(struct dm_table *t)
> > /*
> > * See if we've already got a device in the list.
> > */
> > -static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev)
> > +static struct dm_dev_internal *find_device(struct list_head *l, dev_t dev, bool is_interposed)
> > {
> > struct dm_dev_internal *dd;
> >
> > list_for_each_entry (dd, l, list)
> > - if (dd->dm_dev->bdev->bd_dev == dev)
> > + if ((dd->dm_dev->bdev->bd_dev == dev) &&
> > + (dd->dm_dev->is_interposed == is_interposed))
> > return dd;
> >
> > return NULL;
> > @@ -358,6 +360,90 @@ dev_t dm_get_dev_t(const char *path)
> > }
> > EXPORT_SYMBOL_GPL(dm_get_dev_t);
> >
> > +/*
> > + * Redirect bio from interposed device to dm device
> > + */
> > +static void dm_interpose_fn(struct dm_interposed_dev *ip_dev, struct bio *bio)
> > +{
> > + struct mapped_device *md = ip_dev->private;
> > +
> > + if (bio_flagged(bio, BIO_REMAPPED)) {
> > + /*
> > + * Since bio has already been remapped, we need to subtract
> > + * the block device offset from the beginning of the disk.
> > + */
> > + bio->bi_iter.bi_sector -= get_start_sect(bio->bi_bdev);
> > +
> > + bio_clear_flag(bio, BIO_REMAPPED);
> > + }
>
> So instead of doing this shoudn't the imposer just always submit to the
> whole device? But if we keep it, the logic in this funtion should go
> into a block layer helper, passing a block device instead of the
> dm_interposed_dev. This avoids having such fragile logic in drivers.

device-mapper allows to create devices of any size using only part of
the underlying device. Therefore, it is not possible to apply the
interposer to the whole block device.
Perhaps it makes sense to put the blk_partition_unremap() function in the
block layer? I'm not sure that's a good thing.

>
> > + if ((ofs + len) > bdev_nr_sectors(bdev)) {
> > + DMERR("The specified range of sectors exceeds of the size of the block device.");
> > + return -ERANGE;
> > + }
> > +
> > + md->ip_dev = kzalloc(sizeof(struct dm_interposed_dev), GFP_KERNEL);
> > + if (!md->ip_dev)
> > + return -ENOMEM;
> > +
> > + if ((ofs == 0) && (len == 0))
>
> Lots of superflous inner braces.

--
Sergei Shtepa
Veeam Software developer.

2021-03-10 10:06:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 2/4] block: add blk_interposer

On Wed, Mar 10, 2021 at 07:53:13AM +0300, Sergei Shtepa wrote:
> > Please avoid the overly long line.
> >
> > > + int ret = 0;
> > > +
> > > + if (WARN_ON(!interposer))
> >
> > WARN_ON_ONCE?
>
> This function should be called quite rarely, and the absence of the interposer
> parameter indicates that the function is being used incorrectly.
> I would like to see this warning every time.

Yes. Most kernel code would in fact just remove the check entirely
and let the kernel crash to indicate this. Maybe that is an even
better option for such a grave API usage mistake.

> > > +struct bdev_interposer {
> > > + ip_submit_bio_t ip_submit_bio;
> > > + struct block_device *bdev;
> >
> > Do we need the ip_ prefix here? Also we probably don't really the
> > the typedef for the function pointer.
>
> Ok. Maybe submit_bio_hook would be better? or submit_bio_interposer.

Or interpose_bio?

2021-03-10 12:37:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] dm: add DM_INTERPOSED_FLAG

On Wed, Mar 10, 2021 at 08:28:12AM +0300, Sergei Shtepa wrote:
> > So instead of doing this shoudn't the interposer just always submit to the
> > whole device? But if we keep it, the logic in this funtion should go
> > into a block layer helper, passing a block device instead of the

>
> device-mapper allows to create devices of any size using only part of
> the underlying device. Therefore, it is not possible to apply the
> interposer to the whole block device.
> Perhaps it makes sense to put the blk_partition_unremap() function in the
> block layer? I'm not sure that's a good thing.

I suspect the answer is to not remap bios that are going to be handled
by the interposer. In fact much of submit_bio_checks as-is is a bad
idea for interposed devices. I think what we need to do instead is to
pass an explicit bdev to submit_bio_checks and use that everywhere,
including in the subfunctions.

With that we might also be able to remove the separate interpose hook
and thus struct bdev_interposer entirely as now ->submit_bio of the
interposer could do all the work:

static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
{
struct block_device *orig_bdev = bio->bi_bdev, *interposer;
struct bio_list bio_list[2] = { };
blk_qc_t ret = BLK_QC_T_NONE;

if (current->bio_list) {
bio_list_add(&current->bio_list[0], bio);
return BLK_QC_T_NONE;
}

if (unlikely(bio_queue_enter(bio)))
return BLK_QC_T_NONE;

interposer = orig_bdev->bd_interposer;
if (unlikely(!interposer)) {
/* interposer was removed */
bio_list_add(&current->bio_list[0], bio);
goto queue_exit;
}
if (!submit_bio_checks(bio, interposer))
goto queue_exit;

bio_set_flag(bio, BIO_INTERPOSED);

current->bio_list = bio_list;
ret = interposer->bd_disk->fops->submit_bio(bio);
current->bio_list = NULL;

queue_exit:
blk_queue_exit(bdev->bd_disk->queue);

/* Resubmit remaining bios */
while ((bio = bio_list_pop(&bio_list[0])))
ret = submit_bio_noacct(bio);
return ret;
}

blk_qc_t submit_bio_noacct(struct bio *bio)
{
if (bio->bi_bdev->bd_interposer && !bio_flagged(bio, BIO_INTERPOSED)
return submit_bio_interposed(bio);

...
}

Note that both with this and your original code the interposer must
never resubmit I/O to itself. Is that actually the case for DM? I'm
trying to think of a good debug check for that, but right now I can't
think of something that doesn't cause any overhead for n

2021-03-11 10:58:55

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v6 4/4] dm: add DM_INTERPOSED_FLAG

The 03/10/2021 15:34, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 08:28:12AM +0300, Sergei Shtepa wrote:
> > > So instead of doing this shoudn't the interposer just always submit to the
> > > whole device? But if we keep it, the logic in this funtion should go
> > > into a block layer helper, passing a block device instead of the
>
> >
> > device-mapper allows to create devices of any size using only part of
> > the underlying device. Therefore, it is not possible to apply the
> > interposer to the whole block device.
> > Perhaps it makes sense to put the blk_partition_unremap() function in the
> > block layer? I'm not sure that's a good thing.
>
> I suspect the answer is to not remap bios that are going to be handled
> by the interposer. In fact much of submit_bio_checks as-is is a bad
> idea for interposed devices. I think what we need to do instead is to
> pass an explicit bdev to submit_bio_checks and use that everywhere,
> including in the subfunctions.
>
> With that we might also be able to remove the separate interpose hook
> and thus struct bdev_interposer entirely as now ->submit_bio of the
> interposer could do all the work:
>
> static noinline blk_qc_t submit_bio_interposed(struct bio *bio)
> {
> struct block_device *orig_bdev = bio->bi_bdev, *interposer;
> struct bio_list bio_list[2] = { };
> blk_qc_t ret = BLK_QC_T_NONE;
>
> if (current->bio_list) {
> bio_list_add(&current->bio_list[0], bio);
> return BLK_QC_T_NONE;
> }
>
> if (unlikely(bio_queue_enter(bio)))
> return BLK_QC_T_NONE;
>
> interposer = orig_bdev->bd_interposer;
> if (unlikely(!interposer)) {
> /* interposer was removed */
> bio_list_add(&current->bio_list[0], bio);
> goto queue_exit;
> }
> if (!submit_bio_checks(bio, interposer))
> goto queue_exit;
>
> bio_set_flag(bio, BIO_INTERPOSED);
>
> current->bio_list = bio_list;
> ret = interposer->bd_disk->fops->submit_bio(bio);
> current->bio_list = NULL;
>
> queue_exit:
> blk_queue_exit(bdev->bd_disk->queue);
>
> /* Resubmit remaining bios */
> while ((bio = bio_list_pop(&bio_list[0])))
> ret = submit_bio_noacct(bio);
> return ret;
> }
>
> blk_qc_t submit_bio_noacct(struct bio *bio)
> {
> if (bio->bi_bdev->bd_interposer && !bio_flagged(bio, BIO_INTERPOSED)
> return submit_bio_interposed(bio);
>
> ...
> }

Your point of view is very interesting. I like.
I will try to implement it and check how it works.

So far, I see the problem in that the interposer device has to intercept
all bio requests from the original device. It will not be possible to
implement an interception of some part. Device mapper can create its own
target for a part of the block device.

But maybe it's a good thing. First, there is little real benefit from
being able to intercept bio requests from a part of the block device.
In real use, this may not be necessary. Secondly, it will get rid of the
problem when part of the bio needs to be intercepted, and part does not.

I'd like to know Mike's opinion on this issue.

>
> Note that both with this and your original code the interposer must
> never resubmit I/O to itself. Is that actually the case for DM? I'm
> trying to think of a good debug check for that, but right now I can't
> think of something that doesn't cause any overhead for n

I believe that the BIO_INTERPOSED flag is quite good at solving this
problem. When cloning a bio, the flag is passed, which means that bio
cannot be called twice.


Thank you again.
Because of you, I will have to rewrite some code again ;)
But it's all for the best.
--
Sergei Shtepa
Veeam Software developer.