I think I'm ready to suggest the next version of block device interposer
(blk_interposer). It allows to redirect bio requests to other block
devices.
In this series of patches, I reviewed the process of attaching and
detaching device mapper via blk_interposer.
Now the dm-target is attached to the interposed block device when the
interposer dm-target is fully ready to accept requests, and the interposed
block device queue is locked, and the file system on it is frozen.
The detaching is also performed when the file system on the interposed
block device is in a frozen state, the queue is locked, and the interposer
dm-target is suspended.
To make it possible to lock the receipt of new bio requests without locking
the processing of bio requests that the interposer creates, I had to change
the submit_bio_noacct() function and add a lock. To minimize the impact of
locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
afraid it's impossible.
Checking the operation of the interposer, I did not limit myself to
a simple dm-linear. When I experimented with dm-era, I noticed that it
accepts two block devices. Since Mike was against changing the logic in
the dm-targets itself to support the interrupter, I decided to add the
[interpose] option to the block device path.
echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
dmsetup create dm-era --interpose
I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
we can assume that if the device cannot be opened with the FMODE_EXCL,
then it is considered an interposed device, but it seems to me that
algorithm is unsafe. I hope to get Mike's opinion on this.
I have successfully tried taking snapshots. But I ran into a problem
when I removed origin-target:
[ 49.031156] ------------[ cut here ]------------
[ 49.031180] kernel BUG at block/bio.c:1476!
[ 49.031198] invalid opcode: 0000 [#1] SMP NOPTI
[ 49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G E 5.12.0-rc6-ip+ #52
[ 49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 49.031257] RIP: 0010:bio_split+0x74/0x80
[ 49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
[ 49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
[ 49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
[ 49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
[ 49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
[ 49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
[ 49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
[ 49.031432] FS: 00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
[ 49.031453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
[ 49.031490] Call Trace:
[ 49.031501] dm_submit_bio+0x383/0x500 [dm_mod]
[ 49.031522] submit_bio_noacct+0x370/0x770
[ 49.031537] submit_bh_wbc+0x160/0x190
[ 49.031550] __sync_dirty_buffer+0x65/0x130
[ 49.031564] ext4_commit_super+0xbc/0x120 [ext4]
[ 49.031602] ext4_freeze+0x54/0x80 [ext4]
[ 49.031631] freeze_super+0xc8/0x160
[ 49.031643] freeze_bdev+0xb2/0xc0
[ 49.031654] lock_bdev_fs+0x1c/0x30 [dm_mod]
[ 49.031671] __dm_suspend+0x2b9/0x3b0 [dm_mod]
[ 49.032095] dm_suspend+0xed/0x160 [dm_mod]
[ 49.032496] ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
[ 49.032897] ? remove_all+0x30/0x30 [dm_mod]
[ 49.033299] dev_remove+0x4c/0x1c0 [dm_mod]
[ 49.033679] ctl_ioctl+0x1a5/0x470 [dm_mod]
[ 49.034067] dm_ctl_ioctl+0xa/0x10 [dm_mod]
[ 49.034432] __x64_sys_ioctl+0x83/0xb0
[ 49.034785] do_syscall_64+0x33/0x80
[ 49.035139] entry_SYSCALL_64_after_hwframe+0x44/0xae
When suspend is executed for origin-target before the interposer is
being detached, in the origin_map() function the value of the
o->split_binary variable is zero, since no snapshots were connected to it.
I think that if no snapshots are connected, then it does not make sense
to split the bio request into parts.
Changes summary for this patchset v7:
* The attaching and detaching to interposed device moved to
__dm_suspend() and __dm_resume() functions.
* Redesigned th submit_bio_noacct() function and added a lock for the
block device interposer.
* Adds [interpose] option to block device patch in dm table.
* Fix origin_map() then o->split_binary value is zero.
History:
v7 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
* the request interception mechanism. Now the interposer is
a block device that receives requests instead of the original device;
* code design fixes.
History:
v6 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
* designed for 5.12;
* thanks to the new design of the bio structure in v5.12, it is
possible to perform interception not for the entire disk, but
for each block device;
* instead of the new ioctl DM_DEV_REMAP_CMD and the 'noexcl' option,
the DM_INTERPOSED_FLAG flag for the ioctl DM_TABLE_LOAD_CMD is
applied.
v5 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
* rebase for v5.11-rc7;
* patch set organization;
* fix defects in documentation;
* add some comments;
* change mutex names for better code readability;
* remove calling bd_unlink_disk_holder() for targets with non-exclusive
flag;
* change type for struct dm_remap_param from uint8_t to __u8.
v4 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
Mostly changes were made, due to Damien's comments:
* on the design of the code;
* by the patch set organization;
* bug with passing a wrong parameter to dm_get_device();
* description of the 'noexcl' parameter in the linear.rst.
Also added remap_and_filter.rst.
v3 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
In this version, I already suggested blk_interposer to apply to dm-linear.
Problems were solved:
* Interception of bio requests from a specific device on the disk, not
from the entire disk. To do this, we added the dm_interposed_dev
structure and an interval tree to store these structures.
* Implemented ioctl DM_DEV_REMAP_CMD. A patch with changes in the lvm2
project was sent to the team [email protected].
* Added the 'noexcl' option for dm-linear, which allows you to open
the underlying block-device without FMODE_EXCL mode.
v2 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
I tried to suggest blk_interposer without using it in device mapper,
but with the addition of a sample of its use. It was then that I learned
about the maintainers' attitudes towards the samples directory :).
v1 - https://lwn.net/ml/linux-block/[email protected]/
This Hannes's patch can be considered as a starting point, since this is
where the interception mechanism and the term blk_interposer itself
appeared. It became clear that blk_interposer can be useful for
device mapper.
before v1 - https://patchwork.kernel.org/project/linux-block/cover/[email protected]/
I tried to offer a rather cumbersome blk-filter and a monster-like
blk-snap module for creating snapshots.
Thank you to everyone who was able to take the time to review
the previous versions.
I hope that this time I achieved the required quality.
Thanks,
Sergei.
Sergei Shtepa (4):
Adds blk_interposer. It allows to redirect bio requests to another
block device.
Adds the blk_interposers logic to __submit_bio_noacct().
Adds blk_interposer to md.
fix origin_map - don't split a bio for the origin device if it does
not have registered snapshots.
block/bio.c | 2 +
block/blk-core.c | 194 ++++++++++++++------------
block/genhd.c | 51 +++++++
drivers/md/dm-core.h | 1 +
drivers/md/dm-ioctl.c | 95 ++++++++++---
drivers/md/dm-snap.c | 15 +-
drivers/md/dm-table.c | 68 ++++++++-
drivers/md/dm.c | 254 ++++++++++++++++++++++++++++++----
drivers/md/dm.h | 8 +-
fs/block_dev.c | 3 +
include/linux/blk_types.h | 6 +
include/linux/blkdev.h | 32 +++++
include/linux/device-mapper.h | 1 +
include/uapi/linux/dm-ioctl.h | 6 +
14 files changed, 586 insertions(+), 150 deletions(-)
--
2.20.1
Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/md/dm-snap.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 11890db71f3f..81e8e3bb6d25 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2685,11 +2685,18 @@ static int origin_map(struct dm_target *ti, struct bio *bio)
if (bio_data_dir(bio) != WRITE)
return DM_MAPIO_REMAPPED;
- available_sectors = o->split_boundary -
- ((unsigned)bio->bi_iter.bi_sector & (o->split_boundary - 1));
+ /*
+ * If no snapshot is connected to origin, then split_boundary
+ * will be set to zero. In this case, we don't need to split a bio.
+ */
+ if (o->split_boundary) {
+ available_sectors = o->split_boundary -
+ ((unsigned int)bio->bi_iter.bi_sector &
+ (o->split_boundary - 1));
- if (bio_sectors(bio) > available_sectors)
- dm_accept_partial_bio(bio, available_sectors);
+ if (bio_sectors(bio) > available_sectors)
+ dm_accept_partial_bio(bio, available_sectors);
+ }
/* Only tell snapshots if this is a write */
return do_origin(o->dev, bio, true);
--
2.20.1
* The calling to blk_partition_remap() function has moved
from submit_bio_checks() to submit_bio_noacct().
* The __submit_bio() and __submit_bio_noacct_mq() functions
have been removed and their functionality moved to
submit_bio_noacct().
* Added locking of the block device queue using
the bd_interposer_lock.
Signed-off-by: Sergei Shtepa <[email protected]>
---
block/bio.c | 2 +
block/blk-core.c | 194 ++++++++++++++++++++++++++---------------------
2 files changed, 108 insertions(+), 88 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 50e579088aca..6fc9e8f395a6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -640,6 +640,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio_set_flag(bio, BIO_THROTTLED);
if (bio_flagged(bio_src, BIO_REMAPPED))
bio_set_flag(bio, BIO_REMAPPED);
+ if (bio_flagged(bio_src, BIO_INTERPOSED))
+ bio_set_flag(bio, BIO_INTERPOSED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_write_hint = bio_src->bi_write_hint;
diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a987daa76a79 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -735,26 +735,27 @@ static inline int bio_check_eod(struct bio *bio)
handle_bad_sector(bio, maxsector);
return -EIO;
}
+
+ if (unlikely(should_fail_request(bio->bi_bdev, bio->bi_iter.bi_size)))
+ return -EIO;
+
return 0;
}
/*
* Remap block n of partition p to block n+start(p) of the disk.
*/
-static int blk_partition_remap(struct bio *bio)
+static inline void blk_partition_remap(struct bio *bio)
{
- struct block_device *p = bio->bi_bdev;
+ struct block_device *bdev = bio->bi_bdev;
- if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
- return -EIO;
- if (bio_sectors(bio)) {
- bio->bi_iter.bi_sector += p->bd_start_sect;
- trace_block_bio_remap(bio, p->bd_dev,
+ if (bdev->bd_partno && bio_sectors(bio)) {
+ bio->bi_iter.bi_sector += bdev->bd_start_sect;
+ trace_block_bio_remap(bio, bdev->bd_dev,
bio->bi_iter.bi_sector -
- p->bd_start_sect);
+ bdev->bd_start_sect);
}
bio_set_flag(bio, BIO_REMAPPED);
- return 0;
}
/*
@@ -819,8 +820,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
if (!bio_flagged(bio, BIO_REMAPPED)) {
if (unlikely(bio_check_eod(bio)))
goto end_io;
- if (bdev->bd_partno && unlikely(blk_partition_remap(bio)))
- goto end_io;
}
/*
@@ -910,20 +909,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
return false;
}
-static blk_qc_t __submit_bio(struct bio *bio)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- blk_qc_t ret = BLK_QC_T_NONE;
-
- if (blk_crypto_bio_prep(&bio)) {
- if (!disk->fops->submit_bio)
- return blk_mq_submit_bio(bio);
- ret = disk->fops->submit_bio(bio);
- }
- blk_queue_exit(disk->queue);
- return ret;
-}
-
/*
* The loop in this function may be a bit non-obvious, and so deserves some
* explanation:
@@ -931,7 +916,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
* - Before entering the loop, bio->bi_next is NULL (as all callers ensure
* that), so we have a list with a single bio.
* - We pretend that we have just taken it off a longer list, so we assign
- * bio_list to a pointer to the bio_list_on_stack, thus initialising the
+ * bio_list to a pointer to the current->bio_list, thus initialising the
* bio_list of new bios to be added. ->submit_bio() may indeed add some more
* bios through a recursive call to submit_bio_noacct. If it did, we find a
* non-NULL value in bio_list and re-enter the loop from the top.
@@ -939,83 +924,75 @@ static blk_qc_t __submit_bio(struct bio *bio)
* pretending) and so remove it from bio_list, and call into ->submit_bio()
* again.
*
- * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
- * bio_list_on_stack[1] contains bios that were submitted before the current
+ * current->bio_list[0] contains bios submitted by the current ->submit_bio.
+ * current->bio_list[1] contains bios that were submitted before the current
* ->submit_bio_bio, but that haven't been processed yet.
*/
static blk_qc_t __submit_bio_noacct(struct bio *bio)
{
- struct bio_list bio_list_on_stack[2];
- blk_qc_t ret = BLK_QC_T_NONE;
-
- BUG_ON(bio->bi_next);
-
- bio_list_init(&bio_list_on_stack[0]);
- current->bio_list = bio_list_on_stack;
-
- do {
- struct request_queue *q = bio->bi_bdev->bd_disk->queue;
- struct bio_list lower, same;
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+ struct bio_list lower, same;
+ blk_qc_t ret;
- if (unlikely(bio_queue_enter(bio) != 0))
- continue;
+ if (!blk_crypto_bio_prep(&bio)) {
+ blk_queue_exit(disk->queue);
+ return BLK_QC_T_NONE;
+ }
- /*
- * Create a fresh bio_list for all subordinate requests.
- */
- bio_list_on_stack[1] = bio_list_on_stack[0];
- bio_list_init(&bio_list_on_stack[0]);
+ if (queue_is_mq(disk->queue))
+ return blk_mq_submit_bio(bio);
- ret = __submit_bio(bio);
+ /*
+ * Create a fresh bio_list for all subordinate requests.
+ */
+ current->bio_list[1] = current->bio_list[0];
+ bio_list_init(¤t->bio_list[0]);
- /*
- * Sort new bios into those for a lower level and those for the
- * same level.
- */
- bio_list_init(&lower);
- bio_list_init(&same);
- while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
- if (q == bio->bi_bdev->bd_disk->queue)
- bio_list_add(&same, bio);
- else
- bio_list_add(&lower, bio);
+ WARN_ON_ONCE(!disk->fops->submit_bio);
+ ret = disk->fops->submit_bio(bio);
+ blk_queue_exit(disk->queue);
+ /*
+ * Sort new bios into those for a lower level and those
+ * for the same level.
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(¤t->bio_list[0])) != NULL)
+ if (disk->queue == bio->bi_bdev->bd_disk->queue)
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
- /*
- * Now assemble so we handle the lowest level first.
- */
- bio_list_merge(&bio_list_on_stack[0], &lower);
- bio_list_merge(&bio_list_on_stack[0], &same);
- bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
- } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
+ /*
+ * Now assemble so we handle the lowest level first.
+ */
+ bio_list_merge(¤t->bio_list[0], &lower);
+ bio_list_merge(¤t->bio_list[0], &same);
+ bio_list_merge(¤t->bio_list[0], ¤t->bio_list[1]);
- current->bio_list = NULL;
return ret;
}
-static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
+static inline struct block_device *bio_interposer_lock(struct bio *bio)
{
- struct bio_list bio_list[2] = { };
- blk_qc_t ret = BLK_QC_T_NONE;
-
- current->bio_list = bio_list;
-
- do {
- struct gendisk *disk = bio->bi_bdev->bd_disk;
-
- if (unlikely(bio_queue_enter(bio) != 0))
- continue;
+ bool locked;
+ struct block_device *bdev = bio->bi_bdev;
- if (!blk_crypto_bio_prep(&bio)) {
- blk_queue_exit(disk->queue);
- ret = BLK_QC_T_NONE;
- continue;
+ if (bio->bi_opf & REQ_NOWAIT) {
+ locked = percpu_down_read_trylock(&bdev->bd_interposer_lock);
+ if (unlikely(!locked)) {
+ bio_wouldblock_error(bio);
+ return NULL;
}
+ } else
+ percpu_down_read(&bdev->bd_interposer_lock);
- ret = blk_mq_submit_bio(bio);
- } while ((bio = bio_list_pop(&bio_list[0])));
+ return bdev;
+}
- current->bio_list = NULL;
- return ret;
+static inline void bio_interposer_unlock(struct block_device *locked_bdev)
+{
+ percpu_up_read(&locked_bdev->bd_interposer_lock);
}
/**
@@ -1029,6 +1006,10 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
*/
blk_qc_t submit_bio_noacct(struct bio *bio)
{
+ struct block_device *locked_bdev;
+ struct bio_list bio_list_on_stack[2] = { };
+ blk_qc_t ret = BLK_QC_T_NONE;
+
if (!submit_bio_checks(bio))
return BLK_QC_T_NONE;
@@ -1043,9 +1024,46 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return BLK_QC_T_NONE;
}
- if (!bio->bi_bdev->bd_disk->fops->submit_bio)
- return __submit_bio_noacct_mq(bio);
- return __submit_bio_noacct(bio);
+ BUG_ON(bio->bi_next);
+
+ locked_bdev = bio_interposer_lock(bio);
+ if (!locked_bdev)
+ return BLK_QC_T_NONE;
+
+ current->bio_list = bio_list_on_stack;
+
+ do {
+ if (unlikely(bio_queue_enter(bio) != 0)) {
+ ret = BLK_QC_T_NONE;
+ continue;
+ }
+
+ if (!bio_flagged(bio, BIO_INTERPOSED) &&
+ bio->bi_bdev->bd_interposer) {
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
+
+ bio_set_dev(bio, bio->bi_bdev->bd_interposer);
+ bio_set_flag(bio, BIO_INTERPOSED);
+
+ bio_list_add(&bio_list_on_stack[0], bio);
+
+ blk_queue_exit(disk->queue);
+ ret = BLK_QC_T_NONE;
+ continue;
+ }
+
+ if (!bio_flagged(bio, BIO_REMAPPED))
+ blk_partition_remap(bio);
+
+ ret = __submit_bio_noacct(bio);
+
+ } while ((bio = bio_list_pop(&bio_list_on_stack[0])));
+
+ current->bio_list = NULL;
+
+ bio_interposer_unlock(locked_bdev);
+
+ return ret;
}
EXPORT_SYMBOL(submit_bio_noacct);
--
2.20.1
* The new flag DM_INTERPOSE_FLAG allows to specify that the dm target
will be attached using blk_interposer.
* The [interpose] option allows to specify which device will be
attached via the interposer.
* The connection and disconnection of the interrupter is performed in
the functions __dm_suspend() and __dm_resume(). The flag
DM_SUSPEND_DETACH_IP_FLAG was added for this purpose.
* dm_submit_bio() sets BIO_INTERPOSED for each bio from the interposer.
Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/md/dm-core.h | 1 +
drivers/md/dm-ioctl.c | 95 ++++++++++---
drivers/md/dm-table.c | 68 ++++++++-
drivers/md/dm.c | 254 ++++++++++++++++++++++++++++++----
drivers/md/dm.h | 8 +-
include/linux/device-mapper.h | 1 +
include/uapi/linux/dm-ioctl.h | 6 +
7 files changed, 375 insertions(+), 58 deletions(-)
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 5953ff2bd260..431b82461eae 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -112,6 +112,7 @@ struct mapped_device {
/* for blk-mq request-based DM support */
struct blk_mq_tag_set *tag_set;
bool init_tio_pdu:1;
+ bool interpose:1;
struct srcu_struct io_barrier;
};
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 1ca65b434f1f..7ec37526920b 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -294,11 +294,29 @@ static void dm_hash_remove_all(bool keep_open_devices, bool mark_deferred, bool
md = hc->md;
dm_get(md);
- if (keep_open_devices &&
- dm_lock_for_deletion(md, mark_deferred, only_deferred)) {
- dm_put(md);
- dev_skipped++;
- continue;
+ if (md->interpose) {
+ int r;
+
+ /*
+ * Interposer should be suspended and detached
+ * from the interposed block device.
+ */
+ r = dm_suspend(md, DM_SUSPEND_DETACH_IP_FLAG |
+ DM_SUSPEND_LOCKFS_FLAG);
+ if (r) {
+ DMERR("%s: unable to suspend and detach interposer",
+ dm_device_name(md));
+ dm_put(md);
+ dev_skipped++;
+ continue;
+ }
+ } else {
+ if (keep_open_devices &&
+ dm_lock_for_deletion(md, mark_deferred, only_deferred)) {
+ dm_put(md);
+ dev_skipped++;
+ continue;
+ }
}
t = __hash_remove(hc);
@@ -732,6 +750,9 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)
if (dm_test_deferred_remove_flag(md))
param->flags |= DM_DEFERRED_REMOVE;
+ if (dm_interposer_attached(md))
+ param->flags |= DM_INTERPOSE_FLAG;
+
param->dev = huge_encode_dev(disk_devt(disk));
/*
@@ -878,20 +899,37 @@ static int dev_remove(struct file *filp, struct dm_ioctl *param, size_t param_si
md = hc->md;
- /*
- * Ensure the device is not open and nothing further can open it.
- */
- r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
- if (r) {
- if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
+ if (!md->interpose) {
+ /*
+ * Ensure the device is not open and nothing further can open it.
+ */
+ r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
+ if (r) {
+ if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
+ up_write(&_hash_lock);
+ dm_put(md);
+ return 0;
+ }
+ DMDEBUG_LIMIT("unable to remove open device %s",
+ hc->name);
up_write(&_hash_lock);
dm_put(md);
- return 0;
+ return r;
+ }
+ } else {
+ /*
+ * Interposer should be suspended and detached from
+ * the interposed block device.
+ */
+ r = dm_suspend(md, DM_SUSPEND_DETACH_IP_FLAG |
+ DM_SUSPEND_LOCKFS_FLAG);
+ if (r) {
+ DMERR("%s: unable to suspend and detach interposer",
+ dm_device_name(md));
+ up_write(&_hash_lock);
+ dm_put(md);
+ return r;
}
- DMDEBUG_LIMIT("unable to remove open device %s", hc->name);
- up_write(&_hash_lock);
- dm_put(md);
- return r;
}
t = __hash_remove(hc);
@@ -1050,6 +1088,7 @@ static int do_resume(struct dm_ioctl *param)
md = hc->md;
+
new_map = hc->new_map;
hc->new_map = NULL;
param->flags &= ~DM_INACTIVE_PRESENT_FLAG;
@@ -1063,8 +1102,14 @@ static int do_resume(struct dm_ioctl *param)
suspend_flags &= ~DM_SUSPEND_LOCKFS_FLAG;
if (param->flags & DM_NOFLUSH_FLAG)
suspend_flags |= DM_SUSPEND_NOFLUSH_FLAG;
- if (!dm_suspended_md(md))
- dm_suspend(md, suspend_flags);
+
+ if (md->interpose) {
+ if (!dm_suspended_md(md) || dm_interposer_attached(md))
+ dm_suspend(md, suspend_flags | DM_SUSPEND_DETACH_IP_FLAG);
+ } else {
+ if (!dm_suspended_md(md))
+ dm_suspend(md, suspend_flags);
+ }
old_map = dm_swap_table(md, new_map);
if (IS_ERR(old_map)) {
@@ -1267,6 +1312,11 @@ static inline fmode_t get_mode(struct dm_ioctl *param)
return mode;
}
+static inline bool get_interpose_flag(struct dm_ioctl *param)
+{
+ return (param->flags & DM_INTERPOSE_FLAG);
+}
+
static int next_target(struct dm_target_spec *last, uint32_t next, void *end,
struct dm_target_spec **spec, char **target_params)
{
@@ -1289,11 +1339,6 @@ static int populate_table(struct dm_table *table,
void *end = (void *) param + param_size;
char *target_params;
- if (!param->target_count) {
- DMWARN("populate_table: no targets specified");
- return -EINVAL;
- }
-
for (i = 0; i < param->target_count; i++) {
r = next_target(spec, next, end, &spec, &target_params);
@@ -1338,6 +1383,8 @@ static int table_load(struct file *filp, struct dm_ioctl *param, size_t param_si
if (!md)
return -ENXIO;
+ md->interpose = get_interpose_flag(param);
+
r = dm_table_create(&t, get_mode(param), param->target_count, md);
if (r)
goto err;
@@ -2098,6 +2145,8 @@ int __init dm_early_create(struct dm_ioctl *dmi,
if (r)
goto err_hash_remove;
+ md->interpose = get_interpose_flag(dmi);
+
/* add targets */
for (i = 0; i < dmi->target_count; i++) {
r = dm_table_add_target(t, spec_array[i]->target_type,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e5f0f1703c5d..23574c727f2b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -327,14 +327,14 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
* it is accessed concurrently.
*/
static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
- struct mapped_device *md)
+ bool interpose, struct mapped_device *md)
{
int r;
struct dm_dev *old_dev, *new_dev;
old_dev = dd->dm_dev;
- r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
+ r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
dd->dm_dev->mode | new_mode, &new_dev);
if (r)
return r;
@@ -367,6 +367,8 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
{
int r;
dev_t dev;
+ size_t ofs = 0;
+ bool interpose = false;
unsigned int major, minor;
char dummy;
struct dm_dev_internal *dd;
@@ -374,13 +376,40 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
BUG_ON(!t);
- if (sscanf(path, "%u:%u%c", &major, &minor, &dummy) == 2) {
+ /*
+ * Extract extended options for device
+ */
+ if (path[0] == '[') {
+ const char *interpose_opt = "interpose";
+ size_t opt_pos = 1;
+ size_t opt_len;
+
+ /*
+ * Because only one option is supported yet, the parser
+ * can be simplest.
+ */
+ opt_len = strlen(interpose_opt);
+ if ((opt_pos + opt_len) < strlen(path) &&
+ memcmp(&path[opt_pos], interpose_opt, opt_len) == 0) {
+ interpose = true;
+
+ if (!t->md->interpose)
+ t->md->interpose = true;
+ } else {
+ DMERR("Invalid devices extended options %s", path);
+ return -EINVAL;
+ }
+
+ ofs = opt_pos + opt_len + 1;
+ }
+
+ if (sscanf(&path[ofs], "%u:%u%c", &major, &minor, &dummy) == 2) {
/* Extract the major/minor numbers */
dev = MKDEV(major, minor);
if (MAJOR(dev) != major || MINOR(dev) != minor)
return -EOVERFLOW;
} else {
- dev = dm_get_dev_t(path);
+ dev = dm_get_dev_t(&path[ofs]);
if (!dev)
return -ENODEV;
}
@@ -391,7 +420,8 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
if (!dd)
return -ENOMEM;
- if ((r = dm_get_table_device(t->md, dev, mode, &dd->dm_dev))) {
+ r = dm_get_table_device(t->md, dev, mode, interpose, &dd->dm_dev);
+ if (r) {
kfree(dd);
return r;
}
@@ -401,14 +431,40 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
goto out;
} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
- r = upgrade_mode(dd, mode, t->md);
+ r = upgrade_mode(dd, mode, interpose, t->md);
if (r)
return r;
}
refcount_inc(&dd->count);
out:
+ if (interpose) {
+ struct block_device *original = dd->dm_dev->bdev;
+ /*
+ * Interposer target should cover all underlying device
+ */
+ if (ti->begin != 0) {
+ DMERR("%s: target offset should be zero for dm interposer",
+ dm_device_name(t->md));
+ r = -EINVAL;
+ goto fail;
+ }
+ if (bdev_nr_sectors(original) != ti->len) {
+ DMERR("%s: interposer and interposed block device size should be equal",
+ dm_device_name(t->md));
+ r = -EINVAL;
+ goto fail;
+ }
+ }
+
*result = dd->dm_dev;
return 0;
+fail:
+ if (refcount_dec_and_test(&dd->count)) {
+ dm_put_table_device(t->md, dd->dm_dev);
+ list_del(&dd->list);
+ kfree(dd);
+ }
+ return r;
}
EXPORT_SYMBOL(dm_get_device);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3f3be9408afa..04142454c4ee 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -149,6 +149,7 @@ EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr);
#define DMF_DEFERRED_REMOVE 6
#define DMF_SUSPENDED_INTERNALLY 7
#define DMF_POST_SUSPENDING 8
+#define DMF_INTERPOSER_ATTACHED 9
#define DM_NUMA_NODE NUMA_NO_NODE
static int dm_numa_node = DM_NUMA_NODE;
@@ -757,18 +758,24 @@ static int open_table_device(struct table_device *td, dev_t dev,
struct mapped_device *md)
{
struct block_device *bdev;
-
+ fmode_t mode = td->dm_dev.mode;
+ void *holder = NULL;
int r;
BUG_ON(td->dm_dev.bdev);
- bdev = blkdev_get_by_dev(dev, td->dm_dev.mode | FMODE_EXCL, _dm_claim_ptr);
+ if (!td->dm_dev.interpose) {
+ mode |= FMODE_EXCL;
+ holder = _dm_claim_ptr;
+ }
+
+ bdev = blkdev_get_by_dev(dev, mode, holder);
if (IS_ERR(bdev))
return PTR_ERR(bdev);
r = bd_link_disk_holder(bdev, dm_disk(md));
if (r) {
- blkdev_put(bdev, td->dm_dev.mode | FMODE_EXCL);
+ blkdev_put(bdev, mode);
return r;
}
@@ -782,11 +789,16 @@ static int open_table_device(struct table_device *td, dev_t dev,
*/
static void close_table_device(struct table_device *td, struct mapped_device *md)
{
+ fmode_t mode = td->dm_dev.mode;
+
if (!td->dm_dev.bdev)
return;
bd_unlink_disk_holder(td->dm_dev.bdev, dm_disk(md));
- blkdev_put(td->dm_dev.bdev, td->dm_dev.mode | FMODE_EXCL);
+ if (!td->dm_dev.interpose)
+ mode |= FMODE_EXCL;
+ blkdev_put(td->dm_dev.bdev, mode);
+
put_dax(td->dm_dev.dax_dev);
td->dm_dev.bdev = NULL;
td->dm_dev.dax_dev = NULL;
@@ -805,7 +817,7 @@ static struct table_device *find_table_device(struct list_head *l, dev_t dev,
}
int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
- struct dm_dev **result)
+ bool interpose, struct dm_dev **result)
{
int r;
struct table_device *td;
@@ -821,6 +833,7 @@ int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
td->dm_dev.mode = mode;
td->dm_dev.bdev = NULL;
+ td->dm_dev.interpose = interpose;
if ((r = open_table_device(td, dev, md))) {
mutex_unlock(&md->table_devices_lock);
@@ -1496,13 +1509,12 @@ static int __send_empty_flush(struct clone_info *ci)
static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
sector_t sector, unsigned *len)
{
- struct bio *bio = ci->bio;
struct dm_target_io *tio;
int r;
tio = alloc_tio(ci, ti, 0, GFP_NOIO);
tio->len_ptr = len;
- r = clone_bio(tio, bio, sector, *len);
+ r = clone_bio(tio, ci->bio, sector, *len);
if (r < 0) {
free_tio(tio);
return r;
@@ -1696,6 +1708,13 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
goto out;
}
+ /*
+ * If md is an interposer, then we must set the BIO_INTERPOSE flag
+ * so that the request is not re-interposed.
+ */
+ if (md->interpose)
+ bio_set_flag(bio, BIO_INTERPOSED);
+
/* If suspended, queue this IO for later */
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
if (bio->bi_opf & REQ_NOWAIT)
@@ -2410,7 +2429,8 @@ static void dm_queue_flush(struct mapped_device *md)
*/
struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
{
- struct dm_table *live_map = NULL, *map = ERR_PTR(-EINVAL);
+ struct dm_table *live_map = NULL;
+ struct dm_table *map = ERR_PTR(-EINVAL);
struct queue_limits limits;
int r;
@@ -2453,26 +2473,50 @@ struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
* Functions to lock and unlock any filesystem running on the
* device.
*/
-static int lock_fs(struct mapped_device *md)
+static int lock_bdev_fs(struct mapped_device *md, struct block_device *bdev)
{
int r;
WARN_ON(test_bit(DMF_FROZEN, &md->flags));
- r = freeze_bdev(md->disk->part0);
+ r = freeze_bdev(bdev);
if (!r)
set_bit(DMF_FROZEN, &md->flags);
return r;
}
-static void unlock_fs(struct mapped_device *md)
+static void unlock_bdev_fs(struct mapped_device *md, struct block_device *bdev)
{
if (!test_bit(DMF_FROZEN, &md->flags))
return;
- thaw_bdev(md->disk->part0);
+ thaw_bdev(bdev);
clear_bit(DMF_FROZEN, &md->flags);
}
+static inline int lock_fs(struct mapped_device *md)
+{
+ return lock_bdev_fs(md, md->disk->part0);
+}
+
+static inline void unlock_fs(struct mapped_device *md)
+{
+ unlock_bdev_fs(md, md->disk->part0);
+}
+
+static inline struct block_device *get_interposed_bdev(struct dm_table *t)
+{
+ struct dm_dev_internal *dd;
+
+ /*
+ * For interposer should be only one device in dm table
+ */
+ list_for_each_entry(dd, dm_table_get_devices(t), list)
+ if (dd->dm_dev->interpose)
+ return bdgrab(dd->dm_dev->bdev);
+
+ return NULL;
+}
+
/*
* @suspend_flags: DM_SUSPEND_LOCKFS_FLAG and/or DM_SUSPEND_NOFLUSH_FLAG
* @task_state: e.g. TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
@@ -2488,7 +2532,10 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
{
bool do_lockfs = suspend_flags & DM_SUSPEND_LOCKFS_FLAG;
bool noflush = suspend_flags & DM_SUSPEND_NOFLUSH_FLAG;
- int r;
+ bool detach_ip = suspend_flags & DM_SUSPEND_DETACH_IP_FLAG
+ && md->interpose;
+ struct block_device *original_bdev = NULL;
+ int r = 0;
lockdep_assert_held(&md->suspend_lock);
@@ -2507,18 +2554,50 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
*/
dm_table_presuspend_targets(map);
+ if (!md->interpose) {
+ /*
+ * Flush I/O to the device.
+ * Any I/O submitted after lock_fs() may not be flushed.
+ * noflush takes precedence over do_lockfs.
+ * (lock_fs() flushes I/Os and waits for them to complete.)
+ */
+ if (!noflush && do_lockfs)
+ r = lock_fs(md);
+ } else if (map) {
+ /*
+ * Interposer should not lock mapped device, but
+ * should freeze interposed device and lock it.
+ */
+ original_bdev = get_interposed_bdev(map);
+ if (!original_bdev) {
+ r = -EINVAL;
+ DMERR("%s: interposer cannot get interposed device from table",
+ dm_device_name(md));
+ goto presuspend_undo;
+ }
+
+ if (!noflush && do_lockfs) {
+ r = lock_bdev_fs(md, original_bdev);
+ if (r) {
+ DMERR("%s: interposer cannot freeze interposed device",
+ dm_device_name(md));
+ goto presuspend_undo;
+ }
+ }
+
+ bdev_interposer_lock(original_bdev);
+ }
/*
- * Flush I/O to the device.
- * Any I/O submitted after lock_fs() may not be flushed.
- * noflush takes precedence over do_lockfs.
- * (lock_fs() flushes I/Os and waits for them to complete.)
+ * If map is not initialized, then we cannot suspend
+ * interposed device
*/
- if (!noflush && do_lockfs) {
- r = lock_fs(md);
- if (r) {
- dm_table_presuspend_undo_targets(map);
- return r;
- }
+
+presuspend_undo:
+ if (r) {
+ if (original_bdev)
+ bdput(original_bdev);
+ dm_table_presuspend_undo_targets(map);
+ return r;
}
/*
@@ -2559,14 +2638,40 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
if (map)
synchronize_srcu(&md->io_barrier);
- /* were we interrupted ? */
- if (r < 0) {
+ if (r == 0) { /* the wait ended successfully */
+ if (md->interpose && original_bdev) {
+ if (detach_ip) {
+ bdev_interposer_detach(original_bdev);
+ clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+ }
+
+ bdev_interposer_unlock(original_bdev);
+
+ if (detach_ip) {
+ /*
+ * If th interposer is detached, then there is
+ * no reason in keeping the queue of the
+ * interposed device stopped.
+ */
+ unlock_bdev_fs(md, original_bdev);
+ }
+
+ bdput(original_bdev);
+ }
+ } else { /* were we interrupted ? */
dm_queue_flush(md);
if (dm_request_based(md))
dm_start_queue(md->queue);
- unlock_fs(md);
+ if (md->interpose && original_bdev) {
+ bdev_interposer_unlock(original_bdev);
+ unlock_bdev_fs(md, original_bdev);
+
+ bdput(original_bdev);
+ } else
+ unlock_fs(md);
+
dm_table_presuspend_undo_targets(map);
/* pushback list is already flushed, so skip flush */
}
@@ -2574,6 +2679,88 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
return r;
}
+int __dm_attach_interposer(struct mapped_device *md)
+{
+ int r;
+ struct dm_table *map;
+ struct block_device *original_bdev = NULL;
+
+ if (dm_interposer_attached(md))
+ return 0;
+
+ map = rcu_dereference_protected(md->map,
+ lockdep_is_held(&md->suspend_lock));
+ if (!map) {
+ DMERR("%s: interposers table is not initialized",
+ dm_device_name(md));
+ return -EINVAL;
+ }
+
+ original_bdev = get_interposed_bdev(map);
+ if (!original_bdev) {
+ DMERR("%s: interposer cannot get interposed device from table",
+ dm_device_name(md));
+ return -EINVAL;
+ }
+
+ bdev_interposer_lock(original_bdev);
+
+ r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
+ if (r)
+ DMERR("%s: failed to attach interposer",
+ dm_device_name(md));
+ else
+ set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+
+ bdev_interposer_unlock(original_bdev);
+
+ unlock_bdev_fs(md, original_bdev);
+
+ bdput(original_bdev);
+
+ return r;
+}
+
+int __dm_detach_interposer(struct mapped_device *md)
+{
+ struct dm_table *map = NULL;
+ struct block_device *original_bdev;
+
+ if (!dm_interposer_attached(md))
+ return 0;
+ /*
+ * If mapped device is suspended, but should be detached
+ * we just detach without freeze fs on interposed device.
+ */
+ map = rcu_dereference_protected(md->map,
+ lockdep_is_held(&md->suspend_lock));
+ if (!map) {
+ /*
+ * If table is not initialized then interposed device
+ * cannot be attached
+ */
+ DMERR("%s: table is not initialized for device",
+ dm_device_name(md));
+ return -EINVAL;
+ }
+
+ original_bdev = get_interposed_bdev(map);
+ if (!original_bdev) {
+ DMERR("%s: interposer cannot get interposed device from table",
+ dm_device_name(md));
+ return -EINVAL;
+ }
+
+ bdev_interposer_lock(original_bdev);
+
+ bdev_interposer_detach(original_bdev);
+ clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+
+ bdev_interposer_unlock(original_bdev);
+
+ bdput(original_bdev);
+ return 0;
+}
/*
* We need to be able to change a mapping table under a mounted
* filesystem. For example we might want to move some data in
@@ -2599,7 +2786,11 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
if (dm_suspended_md(md)) {
- r = -EINVAL;
+ if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
+ r = __dm_detach_interposer(md);
+ else
+ r = -EINVAL;
+
goto out_unlock;
}
@@ -2645,8 +2836,10 @@ static int __dm_resume(struct mapped_device *md, struct dm_table *map)
if (dm_request_based(md))
dm_start_queue(md->queue);
- unlock_fs(md);
+ if (md->interpose)
+ return __dm_attach_interposer(md);
+ unlock_fs(md);
return 0;
}
@@ -2880,6 +3073,11 @@ int dm_suspended_md(struct mapped_device *md)
return test_bit(DMF_SUSPENDED, &md->flags);
}
+int dm_interposer_attached(struct mapped_device *md)
+{
+ return test_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
+}
+
static int dm_post_suspending_md(struct mapped_device *md)
{
return test_bit(DMF_POST_SUSPENDING, &md->flags);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index b441ad772c18..35f71e48abd1 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -28,6 +28,7 @@
*/
#define DM_SUSPEND_LOCKFS_FLAG (1 << 0)
#define DM_SUSPEND_NOFLUSH_FLAG (1 << 1)
+#define DM_SUSPEND_DETACH_IP_FLAG (1 << 2)
/*
* Status feature flags
@@ -122,6 +123,11 @@ int dm_deleting_md(struct mapped_device *md);
*/
int dm_suspended_md(struct mapped_device *md);
+/*
+ * Is the interposer of this mapped_device is attached?
+ */
+int dm_interposer_attached(struct mapped_device *md);
+
/*
* Internal suspend and resume methods.
*/
@@ -180,7 +186,7 @@ int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only
int dm_cancel_deferred_remove(struct mapped_device *md);
int dm_request_based(struct mapped_device *md);
int dm_get_table_device(struct mapped_device *md, dev_t dev, fmode_t mode,
- struct dm_dev **result);
+ bool interpose, struct dm_dev **result);
void dm_put_table_device(struct mapped_device *md, struct dm_dev *d);
int dm_kobject_uevent(struct mapped_device *md, enum kobject_action action,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 5c641f930caf..3a7abb347702 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -159,6 +159,7 @@ struct dm_dev {
struct block_device *bdev;
struct dax_device *dax_dev;
fmode_t mode;
+ bool interpose;
char name[16];
};
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index fcff6669137b..73a5b712cd0d 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -362,4 +362,10 @@ enum {
*/
#define DM_INTERNAL_SUSPEND_FLAG (1 << 18) /* Out */
+/*
+ * If set, the underlying device should open without FMODE_EXCL
+ * and attach mapped device via bdev_interposer.
+ */
+#define DM_INTERPOSE_FLAG (1 << 19) /* In/Out */
+
#endif /* _LINUX_DM_IOCTL_H */
--
2.20.1
Hi Sergei,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: arm-randconfig-r025-20210409 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project dd453a1389b6a7e6d9214b449d3c54981b1a89b6)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> drivers/md/dm.c:2682:5: warning: no previous prototype for function '__dm_attach_interposer' [-Wmissing-prototypes]
int __dm_attach_interposer(struct mapped_device *md)
^
drivers/md/dm.c:2682:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int __dm_attach_interposer(struct mapped_device *md)
^
static
>> drivers/md/dm.c:2724:5: warning: no previous prototype for function '__dm_detach_interposer' [-Wmissing-prototypes]
int __dm_detach_interposer(struct mapped_device *md)
^
drivers/md/dm.c:2724:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int __dm_detach_interposer(struct mapped_device *md)
^
static
2 warnings generated.
vim +/__dm_attach_interposer +2682 drivers/md/dm.c
2681
> 2682 int __dm_attach_interposer(struct mapped_device *md)
2683 {
2684 int r;
2685 struct dm_table *map;
2686 struct block_device *original_bdev = NULL;
2687
2688 if (dm_interposer_attached(md))
2689 return 0;
2690
2691 map = rcu_dereference_protected(md->map,
2692 lockdep_is_held(&md->suspend_lock));
2693 if (!map) {
2694 DMERR("%s: interposers table is not initialized",
2695 dm_device_name(md));
2696 return -EINVAL;
2697 }
2698
2699 original_bdev = get_interposed_bdev(map);
2700 if (!original_bdev) {
2701 DMERR("%s: interposer cannot get interposed device from table",
2702 dm_device_name(md));
2703 return -EINVAL;
2704 }
2705
2706 bdev_interposer_lock(original_bdev);
2707
2708 r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
2709 if (r)
2710 DMERR("%s: failed to attach interposer",
2711 dm_device_name(md));
2712 else
2713 set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
2714
2715 bdev_interposer_unlock(original_bdev);
2716
2717 unlock_bdev_fs(md, original_bdev);
2718
2719 bdput(original_bdev);
2720
2721 return r;
2722 }
2723
> 2724 int __dm_detach_interposer(struct mapped_device *md)
2725 {
2726 struct dm_table *map = NULL;
2727 struct block_device *original_bdev;
2728
2729 if (!dm_interposer_attached(md))
2730 return 0;
2731 /*
2732 * If mapped device is suspended, but should be detached
2733 * we just detach without freeze fs on interposed device.
2734 */
2735 map = rcu_dereference_protected(md->map,
2736 lockdep_is_held(&md->suspend_lock));
2737 if (!map) {
2738 /*
2739 * If table is not initialized then interposed device
2740 * cannot be attached
2741 */
2742 DMERR("%s: table is not initialized for device",
2743 dm_device_name(md));
2744 return -EINVAL;
2745 }
2746
2747 original_bdev = get_interposed_bdev(map);
2748 if (!original_bdev) {
2749 DMERR("%s: interposer cannot get interposed device from table",
2750 dm_device_name(md));
2751 return -EINVAL;
2752 }
2753
2754 bdev_interposer_lock(original_bdev);
2755
2756 bdev_interposer_detach(original_bdev);
2757 clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
2758
2759 bdev_interposer_unlock(original_bdev);
2760
2761 bdput(original_bdev);
2762 return 0;
2763 }
2764 /*
2765 * We need to be able to change a mapping table under a mounted
2766 * filesystem. For example we might want to move some data in
2767 * the background. Before the table can be swapped with
2768 * dm_bind_table, dm_suspend must be called to flush any in
2769 * flight bios and ensure that any further io gets deferred.
2770 */
2771 /*
2772 * Suspend mechanism in request-based dm.
2773 *
2774 * 1. Flush all I/Os by lock_fs() if needed.
2775 * 2. Stop dispatching any I/O by stopping the request_queue.
2776 * 3. Wait for all in-flight I/Os to be completed or requeued.
2777 *
2778 * To abort suspend, start the request_queue.
2779 */
2780 int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
2781 {
2782 struct dm_table *map = NULL;
2783 int r = 0;
2784
2785 retry:
2786 mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
2787
2788 if (dm_suspended_md(md)) {
2789 if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
2790 r = __dm_detach_interposer(md);
2791 else
2792 r = -EINVAL;
2793
2794 goto out_unlock;
2795 }
2796
2797 if (dm_suspended_internally_md(md)) {
2798 /* already internally suspended, wait for internal resume */
2799 mutex_unlock(&md->suspend_lock);
2800 r = wait_on_bit(&md->flags, DMF_SUSPENDED_INTERNALLY, TASK_INTERRUPTIBLE);
2801 if (r)
2802 return r;
2803 goto retry;
2804 }
2805
2806 map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
2807
2808 r = __dm_suspend(md, map, suspend_flags, TASK_INTERRUPTIBLE, DMF_SUSPENDED);
2809 if (r)
2810 goto out_unlock;
2811
2812 set_bit(DMF_POST_SUSPENDING, &md->flags);
2813 dm_table_postsuspend_targets(map);
2814 clear_bit(DMF_POST_SUSPENDING, &md->flags);
2815
2816 out_unlock:
2817 mutex_unlock(&md->suspend_lock);
2818 return r;
2819 }
2820
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Sergei,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-randconfig-m021-20210409 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
# save the attached .config to linux build tree
make W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> drivers/md/dm.c:2682:5: warning: no previous prototype for '__dm_attach_interposer' [-Wmissing-prototypes]
2682 | int __dm_attach_interposer(struct mapped_device *md)
| ^~~~~~~~~~~~~~~~~~~~~~
>> drivers/md/dm.c:2724:5: warning: no previous prototype for '__dm_detach_interposer' [-Wmissing-prototypes]
2724 | int __dm_detach_interposer(struct mapped_device *md)
| ^~~~~~~~~~~~~~~~~~~~~~
vim +/__dm_attach_interposer +2682 drivers/md/dm.c
2681
> 2682 int __dm_attach_interposer(struct mapped_device *md)
2683 {
2684 int r;
2685 struct dm_table *map;
2686 struct block_device *original_bdev = NULL;
2687
2688 if (dm_interposer_attached(md))
2689 return 0;
2690
2691 map = rcu_dereference_protected(md->map,
2692 lockdep_is_held(&md->suspend_lock));
2693 if (!map) {
2694 DMERR("%s: interposers table is not initialized",
2695 dm_device_name(md));
2696 return -EINVAL;
2697 }
2698
2699 original_bdev = get_interposed_bdev(map);
2700 if (!original_bdev) {
2701 DMERR("%s: interposer cannot get interposed device from table",
2702 dm_device_name(md));
2703 return -EINVAL;
2704 }
2705
2706 bdev_interposer_lock(original_bdev);
2707
2708 r = bdev_interposer_attach(original_bdev, dm_disk(md)->part0);
2709 if (r)
2710 DMERR("%s: failed to attach interposer",
2711 dm_device_name(md));
2712 else
2713 set_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
2714
2715 bdev_interposer_unlock(original_bdev);
2716
2717 unlock_bdev_fs(md, original_bdev);
2718
2719 bdput(original_bdev);
2720
2721 return r;
2722 }
2723
> 2724 int __dm_detach_interposer(struct mapped_device *md)
2725 {
2726 struct dm_table *map = NULL;
2727 struct block_device *original_bdev;
2728
2729 if (!dm_interposer_attached(md))
2730 return 0;
2731 /*
2732 * If mapped device is suspended, but should be detached
2733 * we just detach without freeze fs on interposed device.
2734 */
2735 map = rcu_dereference_protected(md->map,
2736 lockdep_is_held(&md->suspend_lock));
2737 if (!map) {
2738 /*
2739 * If table is not initialized then interposed device
2740 * cannot be attached
2741 */
2742 DMERR("%s: table is not initialized for device",
2743 dm_device_name(md));
2744 return -EINVAL;
2745 }
2746
2747 original_bdev = get_interposed_bdev(map);
2748 if (!original_bdev) {
2749 DMERR("%s: interposer cannot get interposed device from table",
2750 dm_device_name(md));
2751 return -EINVAL;
2752 }
2753
2754 bdev_interposer_lock(original_bdev);
2755
2756 bdev_interposer_detach(original_bdev);
2757 clear_bit(DMF_INTERPOSER_ATTACHED, &md->flags);
2758
2759 bdev_interposer_unlock(original_bdev);
2760
2761 bdput(original_bdev);
2762 return 0;
2763 }
2764 /*
2765 * We need to be able to change a mapping table under a mounted
2766 * filesystem. For example we might want to move some data in
2767 * the background. Before the table can be swapped with
2768 * dm_bind_table, dm_suspend must be called to flush any in
2769 * flight bios and ensure that any further io gets deferred.
2770 */
2771 /*
2772 * Suspend mechanism in request-based dm.
2773 *
2774 * 1. Flush all I/Os by lock_fs() if needed.
2775 * 2. Stop dispatching any I/O by stopping the request_queue.
2776 * 3. Wait for all in-flight I/Os to be completed or requeued.
2777 *
2778 * To abort suspend, start the request_queue.
2779 */
2780 int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
2781 {
2782 struct dm_table *map = NULL;
2783 int r = 0;
2784
2785 retry:
2786 mutex_lock_nested(&md->suspend_lock, SINGLE_DEPTH_NESTING);
2787
2788 if (dm_suspended_md(md)) {
2789 if (suspend_flags & DM_SUSPEND_DETACH_IP_FLAG)
2790 r = __dm_detach_interposer(md);
2791 else
2792 r = -EINVAL;
2793
2794 goto out_unlock;
2795 }
2796
2797 if (dm_suspended_internally_md(md)) {
2798 /* already internally suspended, wait for internal resume */
2799 mutex_unlock(&md->suspend_lock);
2800 r = wait_on_bit(&md->flags, DMF_SUSPENDED_INTERNALLY, TASK_INTERRUPTIBLE);
2801 if (r)
2802 return r;
2803 goto retry;
2804 }
2805
2806 map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
2807
2808 r = __dm_suspend(md, map, suspend_flags, TASK_INTERRUPTIBLE, DMF_SUSPENDED);
2809 if (r)
2810 goto out_unlock;
2811
2812 set_bit(DMF_POST_SUSPENDING, &md->flags);
2813 dm_table_postsuspend_targets(map);
2814 clear_bit(DMF_POST_SUSPENDING, &md->flags);
2815
2816 out_unlock:
2817 mutex_unlock(&md->suspend_lock);
2818 return r;
2819 }
2820
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Fri, Apr 09 2021 at 7:48am -0400,
Sergei Shtepa <[email protected]> wrote:
> I think I'm ready to suggest the next version of block device interposer
> (blk_interposer). It allows to redirect bio requests to other block
> devices.
>
> In this series of patches, I reviewed the process of attaching and
> detaching device mapper via blk_interposer.
>
> Now the dm-target is attached to the interposed block device when the
> interposer dm-target is fully ready to accept requests, and the interposed
> block device queue is locked, and the file system on it is frozen.
> The detaching is also performed when the file system on the interposed
> block device is in a frozen state, the queue is locked, and the interposer
> dm-target is suspended.
>
> To make it possible to lock the receipt of new bio requests without locking
> the processing of bio requests that the interposer creates, I had to change
> the submit_bio_noacct() function and add a lock. To minimize the impact of
> locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> afraid it's impossible.
>
> Checking the operation of the interposer, I did not limit myself to
> a simple dm-linear. When I experimented with dm-era, I noticed that it
> accepts two block devices. Since Mike was against changing the logic in
> the dm-targets itself to support the interrupter, I decided to add the
> [interpose] option to the block device path.
>
> echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
> dmsetup create dm-era --interpose
>
> I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> we can assume that if the device cannot be opened with the FMODE_EXCL,
> then it is considered an interposed device, but it seems to me that
> algorithm is unsafe. I hope to get Mike's opinion on this.
>
> I have successfully tried taking snapshots. But I ran into a problem
> when I removed origin-target:
> [ 49.031156] ------------[ cut here ]------------
> [ 49.031180] kernel BUG at block/bio.c:1476!
> [ 49.031198] invalid opcode: 0000 [#1] SMP NOPTI
> [ 49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G E 5.12.0-rc6-ip+ #52
> [ 49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [ 49.031257] RIP: 0010:bio_split+0x74/0x80
> [ 49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> [ 49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
> [ 49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
> [ 49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
> [ 49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
> [ 49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
> [ 49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
> [ 49.031432] FS: 00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
> [ 49.031453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
> [ 49.031490] Call Trace:
> [ 49.031501] dm_submit_bio+0x383/0x500 [dm_mod]
> [ 49.031522] submit_bio_noacct+0x370/0x770
> [ 49.031537] submit_bh_wbc+0x160/0x190
> [ 49.031550] __sync_dirty_buffer+0x65/0x130
> [ 49.031564] ext4_commit_super+0xbc/0x120 [ext4]
> [ 49.031602] ext4_freeze+0x54/0x80 [ext4]
> [ 49.031631] freeze_super+0xc8/0x160
> [ 49.031643] freeze_bdev+0xb2/0xc0
> [ 49.031654] lock_bdev_fs+0x1c/0x30 [dm_mod]
> [ 49.031671] __dm_suspend+0x2b9/0x3b0 [dm_mod]
> [ 49.032095] dm_suspend+0xed/0x160 [dm_mod]
> [ 49.032496] ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> [ 49.032897] ? remove_all+0x30/0x30 [dm_mod]
> [ 49.033299] dev_remove+0x4c/0x1c0 [dm_mod]
> [ 49.033679] ctl_ioctl+0x1a5/0x470 [dm_mod]
> [ 49.034067] dm_ctl_ioctl+0xa/0x10 [dm_mod]
> [ 49.034432] __x64_sys_ioctl+0x83/0xb0
> [ 49.034785] do_syscall_64+0x33/0x80
> [ 49.035139] entry_SYSCALL_64_after_hwframe+0x44/0xae
> When suspend is executed for origin-target before the interposer is
> being detached, in the origin_map() function the value of the
> o->split_binary variable is zero, since no snapshots were connected to it.
> I think that if no snapshots are connected, then it does not make sense
> to split the bio request into parts.
The dm-snapshot code requires careful order of operations. You say you
removed the origin target.. please show exactly what you did. Your 4th
patch shouldn't be tied to this patchset. Can be dealt with
independently.
> Changes summary for this patchset v7:
> * The attaching and detaching to interposed device moved to
> __dm_suspend() and __dm_resume() functions.
Why? Those hooks are inherently more constrained. And in the case of
resume, failure is not an option.
> * Redesigned th submit_bio_noacct() function and added a lock for the
> block device interposer.
> * Adds [interpose] option to block device patch in dm table.
I'm struggling to see why you need "[interpose]" (never mind that this
idea of device options is a new construct): what are the implications?
Are you saying that a table will have N devices with only a subset that
are interposed?
Just feels very awkward but I'll try to keep an open mind until I can
better understand.
> * Fix origin_map() then o->split_binary value is zero.
Overall this effort, while appreciated in general, is getting more and
more muddled -- you're having to sprinkle obscure code all over DM. And
your patch headers are severely lacking for a v8 patch
submission. Terse bullet points don't paint a very comprehensive
picture. Please detail how a user is expected to drive this (either in
patch headers and/or some Documentation file).
Mike
Hi Sergei,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on block/for-next]
[also build test WARNING on hch-configfs/for-next v5.12-rc6]
[cannot apply to dm/for-next next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Sergei-Shtepa/block-device-interposer/20210409-194943
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-s032-20210409 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.3-279-g6d5d9b42-dirty
# https://github.com/0day-ci/linux/commit/df79fb333cb0a1263a1f03f54de425507e3c2238
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Sergei-Shtepa/block-device-interposer/20210409-194943
git checkout df79fb333cb0a1263a1f03f54de425507e3c2238
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> drivers/md/dm-table.c:337:63: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted fmode_t [usertype] mode @@ got bool [usertype] interpose @@
drivers/md/dm-table.c:337:63: sparse: expected restricted fmode_t [usertype] mode
drivers/md/dm-table.c:337:63: sparse: got bool [usertype] interpose
--
>> drivers/md/dm.c:2682:5: sparse: sparse: symbol '__dm_attach_interposer' was not declared. Should it be static?
>> drivers/md/dm.c:2724:5: sparse: sparse: symbol '__dm_detach_interposer' was not declared. Should it be static?
Please review and possibly fold the followup patch.
vim +337 drivers/md/dm-table.c
322
323 /*
324 * This upgrades the mode on an already open dm_dev, being
325 * careful to leave things as they were if we fail to reopen the
326 * device and not to touch the existing bdev field in case
327 * it is accessed concurrently.
328 */
329 static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
330 bool interpose, struct mapped_device *md)
331 {
332 int r;
333 struct dm_dev *old_dev, *new_dev;
334
335 old_dev = dd->dm_dev;
336
> 337 r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
338 dd->dm_dev->mode | new_mode, &new_dev);
339 if (r)
340 return r;
341
342 dd->dm_dev = new_dev;
343 dm_put_table_device(md, old_dev);
344
345 return 0;
346 }
347
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: kernel test robot <[email protected]>
---
dm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 04142454c4eed..2a584c2103f3a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2679,7 +2679,7 @@ static int __dm_suspend(struct mapped_device *md, struct dm_table *map,
return r;
}
-int __dm_attach_interposer(struct mapped_device *md)
+static int __dm_attach_interposer(struct mapped_device *md)
{
int r;
struct dm_table *map;
@@ -2721,7 +2721,7 @@ int __dm_attach_interposer(struct mapped_device *md)
return r;
}
-int __dm_detach_interposer(struct mapped_device *md)
+static int __dm_detach_interposer(struct mapped_device *md)
{
struct dm_table *map = NULL;
struct block_device *original_bdev;
The 04/09/2021 18:23, Mike Snitzer wrote:
> On Fri, Apr 09 2021 at 7:48am -0400,
> Sergei Shtepa <[email protected]> wrote:
>
> > I think I'm ready to suggest the next version of block device interposer
> > (blk_interposer). It allows to redirect bio requests to other block
> > devices.
> >
> > In this series of patches, I reviewed the process of attaching and
> > detaching device mapper via blk_interposer.
> >
> > Now the dm-target is attached to the interposed block device when the
> > interposer dm-target is fully ready to accept requests, and the interposed
> > block device queue is locked, and the file system on it is frozen.
> > The detaching is also performed when the file system on the interposed
> > block device is in a frozen state, the queue is locked, and the interposer
> > dm-target is suspended.
> >
> > To make it possible to lock the receipt of new bio requests without locking
> > the processing of bio requests that the interposer creates, I had to change
> > the submit_bio_noacct() function and add a lock. To minimize the impact of
> > locking, I chose percpu_rw_sem. I tried to do without a new lock, but I'm
> > afraid it's impossible.
> >
> > Checking the operation of the interposer, I did not limit myself to
> > a simple dm-linear. When I experimented with dm-era, I noticed that it
> > accepts two block devices. Since Mike was against changing the logic in
> > the dm-targets itself to support the interrupter, I decided to add the
> > [interpose] option to the block device path.
> >
> > echo "0 ${DEV_SZ} era ${META} [interpose]${DEV} ${BLK_SZ}" | \
> > dmsetup create dm-era --interpose
> >
> > I believe this option can replace the DM_INTERPOSE_FLAG flag. Of course,
> > we can assume that if the device cannot be opened with the FMODE_EXCL,
> > then it is considered an interposed device, but it seems to me that
> > algorithm is unsafe. I hope to get Mike's opinion on this.
> >
> > I have successfully tried taking snapshots. But I ran into a problem
> > when I removed origin-target:
> > [ 49.031156] ------------[ cut here ]------------
> > [ 49.031180] kernel BUG at block/bio.c:1476!
> > [ 49.031198] invalid opcode: 0000 [#1] SMP NOPTI
> > [ 49.031213] CPU: 9 PID: 636 Comm: dmsetup Tainted: G E 5.12.0-rc6-ip+ #52
> > [ 49.031235] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > [ 49.031257] RIP: 0010:bio_split+0x74/0x80
> > [ 49.031273] Code: 89 c7 e8 5f 56 03 00 41 8b 74 24 28 48 89 ef e8 12 ea ff ff f6 45 15 01 74 08 66 41 81 4c 24 14 00 01 4c 89 e0 5b 5d 41 5c c3 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb ed 90 0f 1f 44 00 00 39 77 28 76 05
> > [ 49.031322] RSP: 0018:ffff9a6100993ab0 EFLAGS: 00010246
> > [ 49.031337] RAX: 0000000000000008 RBX: 0000000000000000 RCX: ffff8e26938f96d8
> > [ 49.031357] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff8e26937d1300
> > [ 49.031375] RBP: ffff8e2692ddc000 R08: 0000000000000000 R09: 0000000000000000
> > [ 49.031394] R10: ffff8e2692b1de00 R11: ffff8e2692b1de58 R12: ffff8e26937d1300
> > [ 49.031413] R13: ffff8e2692ddcd18 R14: ffff8e2691d22140 R15: ffff8e26937d1300
> > [ 49.031432] FS: 00007efffa6e7800(0000) GS:ffff8e269bc80000(0000) knlGS:0000000000000000
> > [ 49.031453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 49.031470] CR2: 00007efffa96cda0 CR3: 0000000114bd0000 CR4: 00000000000506e0
> > [ 49.031490] Call Trace:
> > [ 49.031501] dm_submit_bio+0x383/0x500 [dm_mod]
> > [ 49.031522] submit_bio_noacct+0x370/0x770
> > [ 49.031537] submit_bh_wbc+0x160/0x190
> > [ 49.031550] __sync_dirty_buffer+0x65/0x130
> > [ 49.031564] ext4_commit_super+0xbc/0x120 [ext4]
> > [ 49.031602] ext4_freeze+0x54/0x80 [ext4]
> > [ 49.031631] freeze_super+0xc8/0x160
> > [ 49.031643] freeze_bdev+0xb2/0xc0
> > [ 49.031654] lock_bdev_fs+0x1c/0x30 [dm_mod]
> > [ 49.031671] __dm_suspend+0x2b9/0x3b0 [dm_mod]
> > [ 49.032095] dm_suspend+0xed/0x160 [dm_mod]
> > [ 49.032496] ? __find_device_hash_cell+0x5b/0x2a0 [dm_mod]
> > [ 49.032897] ? remove_all+0x30/0x30 [dm_mod]
> > [ 49.033299] dev_remove+0x4c/0x1c0 [dm_mod]
> > [ 49.033679] ctl_ioctl+0x1a5/0x470 [dm_mod]
> > [ 49.034067] dm_ctl_ioctl+0xa/0x10 [dm_mod]
> > [ 49.034432] __x64_sys_ioctl+0x83/0xb0
> > [ 49.034785] do_syscall_64+0x33/0x80
> > [ 49.035139] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > When suspend is executed for origin-target before the interposer is
> > being detached, in the origin_map() function the value of the
> > o->split_binary variable is zero, since no snapshots were connected to it.
> > I think that if no snapshots are connected, then it does not make sense
> > to split the bio request into parts.
>
> The dm-snapshot code requires careful order of operations. You say you
> removed the origin target.. please show exactly what you did. Your 4th
> patch shouldn't be tied to this patchset. Can be dealt with
> independently.
To create a snapshot, the snapshot-origin from dm-snap must be connected to
the device. I do it like this:
DEV=/dev/nvme0n1p2
DEV_SZ=$(blockdev --getsz ${DEV})
echo "0 ${DEV_SZ} snapshot-origin [interpose]${DEV}" | \
dmsetup create dm-origin --interpose
Next, I create the snapshot itself and mount it:
META=/dev/nvme0n1p1
ORIGIN=/dev/mapper/dm-origin
echo "0 ${DEV_SZ} snapshot ${ORIGIN} ${META} N 8" | \
dmsetup create dm-snapshot
mount /dev/mapper/dm-snapshot /mnt/snap
Releasing the snapshot:
umount /mnt/snap
dmsetup remove dm-snapshot
Remove snapshot-origin:
dmsetup remove dm-origin
I think it's hard to make a mistake here, although the documentation describes
creating a snapshot using lvcreate, not dmsetup.
As for the fourth patch - I agree - this should be the next step, after the
idea of the interposer is accepted.
>
> > Changes summary for this patchset v7:
> > * The attaching and detaching to interposed device moved to
> > __dm_suspend() and __dm_resume() functions.
>
> Why? Those hooks are inherently more constrained. And in the case of
> resume, failure is not an option.
>
> > * Redesigned th submit_bio_noacct() function and added a lock for the
> > block device interposer.
> > * Adds [interpose] option to block device patch in dm table.
>
> I'm struggling to see why you need "[interpose]" (never mind that this
> idea of device options is a new construct): what are the implications?
> Are you saying that a table will have N devices with only a subset that
> are interposed?
I'm analyzing how dmsetup works with strace. I get something like this
diagram:
ioctl(3, DM_VERSION, ...
...
read(0, "0 14675935 snapshot-origin [inte"..., 4096) = 53
...
ioctl(3, DM_DEV_CREATE, ...
ioctl(3, DM_TABLE_LOAD, ...
ioctl(3, DM_DEV_SUSPEND, ...
ioctl DM_DEV_SUSPEND without the DM_SUSPEND_FLAG flag, which means that the
do_resume function is started. It turns out that only after the do_resume
DM target works, the target becomes ready to work.
Before that, we cannot attach the interposer, as the bio will not be
successfully processed by the interposer. It turns out that it is at the
final stage of initialization that we can safely connect the interposer.
Note that a special DMF_INTERPOSER_ATTACHED flag was provided, this allows
to connect the interposer only at the first resume.
When removing a DM target, there is a requirement that the DM target is closed
and not used by anyone. This ensures that no new bio requests to the DM target
will be received. But in the case of the interposer, this is not enough.
We need to lock the queue of the original block device and wait for all
previously created bio requests to complete. To do this, run dm_suspend() with
the DM_SUSPEND_DETACH_IP_FLAG flag.
If we look at the do_resume() function, it can finish with an error code for
various malfunctions and their processing is provided.
>
> Just feels very awkward but I'll try to keep an open mind until I can
> better understand.
Ok. Let's look at a simple example. We need to attach the dm-era using the
interposer. This target uses two devices ${DEV} and ${META}.
echo "0 `blockdev --getsz ${DEV}` era ${META} ${DEV} 128" | \
dmsetup create dm-era --interpose
The ${DEV} device needs to be attached using a interposer, while ${META} is
used to output the result of the module and must be opened in FMODE_EXCL mode.
It turns out that only the dm-target itself depends on which of the devices
can be connected via the interposer, and which can not. The [interpose] option
allows to explicitly specify this.
I don't really like the design with the [interpret] option either.
I think it's best to change the dm_get_device() call and explicitly specify
which device to open via the interposer. It depends on the DM target itself
whether it supports connection via the interposer and for which devices.
This would make the code more visual. But to do this, we will need to change
one line in each existing dm_target. You have already spoken out against
a similar decision. But maybe can you look at this solution again?
It will look something like this:
diff --git a/drivers/md/dm-era-target.c b/drivers/md/dm-era-target.c
index d9ac7372108c..461fd7656751 100644
--- a/drivers/md/dm-era-target.c
+++ b/drivers/md/dm-era-target.c
@@ -1455,14 +1455,16 @@ static int era_ctr(struct dm_target *ti, unsigned argc, char **argv)
era->ti = ti;
- r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &era->metadata_dev);
+ r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, false, &era->metadata_dev);
if (r) {
ti->error = "Error opening metadata device";
era_destroy(era);
return -EINVAL;
}
- r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &era->origin_dev);
+ r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, ti->table->md->interpose, &era->origin_dev);
if (r) {
ti->error = "Error opening data device";
era_destroy(era);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e5f0f1703c5d..dc08e9b0c2fc 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -327,14 +327,14 @@ static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev,
* it is accessed concurrently.
*/
static int upgrade_mode(struct dm_dev_internal *dd, fmode_t new_mode,
- struct mapped_device *md)
+ bool interpose, struct mapped_device *md)
{
int r;
struct dm_dev *old_dev, *new_dev;
old_dev = dd->dm_dev;
- r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev,
+ r = dm_get_table_device(md, dd->dm_dev->bdev->bd_dev, interpose,
dd->dm_dev->mode | new_mode, &new_dev);
if (r)
return r;
@@ -363,7 +363,7 @@ EXPORT_SYMBOL_GPL(dm_get_dev_t);
* it's already present.
*/
int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
- struct dm_dev **result)
+ bool interpose, struct dm_dev **result)
{
int r;
dev_t dev;
@@ -391,7 +391,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
if (!dd)
return -ENOMEM;
- if ((r = dm_get_table_device(t->md, dev, mode, &dd->dm_dev))) {
+ if ((r = dm_get_table_device(t->md, dev, mode, interpose, &dd->dm_dev))) {
kfree(dd);
return r;
}
@@ -401,7 +401,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
goto out;
} else if (dd->dm_dev->mode != (mode | dd->dm_dev->mode)) {
- r = upgrade_mode(dd, mode, t->md);
+ r = upgrade_mode(dd, mode, interpose, t->md);
if (r)
return r;
}
>
> > * Fix origin_map() then o->split_binary value is zero.
>
> Overall this effort, while appreciated in general, is getting more and
> more muddled -- you're having to sprinkle obscure code all over DM. And
> your patch headers are severely lacking for a v8 patch
> submission. Terse bullet points don't paint a very comprehensive
> picture. Please detail how a user is expected to drive this (either in
> patch headers and/or some Documentation file).
>
> Mike
>
Mike, thank you for appreciating my efforts. I'm getting deeper and deeper
into the DM code and trying to add new functionality to it in a harmonious way.
When discussing the previous patch, you were quite right to say that the
connection and disconnection of the interposer was not safe, although the code
looked simpler.
This time I tried to understand in detail the processes of creating and
removing DM targets. I tried to write the code as simple as possible and added
comments to make it as easy as possible to understand.
I think it would be great if you would indicate which code you found
"more muddled". I will be happy to rewrite it or give additional comments.
How a user is expected to drive this - I'll try to describe it in this email.
If this text suits you, I will create documentation based on it in the future.
At the current stage, I would not want to distract the people who are engaged
in checking the documentation, so as not to throw their work into the trash bin
as it happened with the documentation for the v4 patch.
===================
DM & blk_interposer
===================
Usually LVM should be used for new devices. The administrator have to create
logical volumes for the system partition when installing the operating system.
For a running system with partitioned disk space and mounted file systems,
it is quite difficult to reconfigure to logical volumes. As a result, all
the features that Device Mapper provides are not available for non-LVM systems.
This problem is partially solved by the interposer functionality, which uses
the kernel's blk_interposer.
Blk_interposer it allows to redirect bio requests from ordinal block devices
to DM target. It allows to attach interposer to original device "on the fly"
without stopping the execution of users programs.
Interposer for dm-flakey
========================
In a classic dm-flakey application, the /dev/sda1 device must be unmounted.
We have to create a new block device /dev/mapper/test and mount it. ::
echo "0 `blockdev --getsz /dev/sda1` flakey /dev/sda1 0 1 3" | \
dmsetup create test
mount /dev/mapper/test /mnt/test
The relationship diagram will look like this:
+-------------+
| file system |
+-------------+
||
\/
+------------------+
| /dev/mapper/test |
+------------------+
||
\/
+-------------+
| /dev/sda1 |
+-------------+
blk_interposer allows to connect the DM target to a device that is already
mounted. Adding the --interpose flag to the command::
echo "0 `blockdev --getsz /dev/sda1` flakey /dev/sda1 0 1 3" | \
dmsetup create test --interpose
Now the relationship diagram will look like this:
+-------------+
| file system |
+-------------+
||
\/
+----------------+
| blk_interposer |
+----------------+
||
\/
+--------------+
| /dev/mapper/ |
| test |
+--------------+
||
\/
+-------------+
| /dev/sda1 |
+-------------+
At the same time, we do not need to remount the file system. The new DM target
was added to the stack "on the fly" unnoticed by the user-space environment.
Interposer for dm-snap
======================
Suppose we have a file system mounted on the block device /dev/sda1::
+-------------+
| file system |
+-------------+
||
\/
+-------------+
| /dev/sda1 |
+-------------+
To create a snapshot of a block device, we need to connect the dm-snap to this
device. To do this, use the --interpose flag when creating snapshot-origin.
echo "0 `blockdev --getsz /dev/sda1` snapshot-origin /dev/sda1" | \
dmsetup create origin --interpose
In this case, thanks to blk_interposer, all bio requests from the file system
will be redirected to the new device /dev/mapper/origin.
Diagram ::
+-------------+
| file system |
+-------------+
||
\/
+----------------+
| blk_interposer |
+----------------+
||
\/
+--------------+
| /dev/mapper/ |
| origin |
+--------------+
||
\/
+-----------+
| /dev/sda1 |
+-----------+
To create a snapshot, just use the new device /dev/mapper/origin:
echo "0 `blockdev --getsz /dev/sda1` snapshot /dev/mapper/origin ${COW_DEVICE} N 8" | \
dmesetup create snapshot
Diagram::
+-------------+ +--------------+
| file system | | backup agent |
+-------------+ +--------------+
|| ||
\/ ||
+----------------+ ||
| blk_interposer | ||
+----------------+ ||
|| ||
\/ \/
+--------------+ +---------------+
| /dev/mapper/ | <=> | /dev/mapper/ |
| origin | | snapshot |
+--------------+ +---------------+
|| ||
\/ \/
+-----------+ +---------------+
| /dev/sda1 | | ${COW_DEVICE} |
+-----------+ +---------------+
The snapshot device on the /dev/mapper/snapshot device is now available for
mounting or backup.
--
Sergei Shtepa
Veeam Software developer.