2023-01-26 03:34:23

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 0/7] Allow race-free block device handling

This work aims to allow userspace to create and destroy block devices
in a race-free and leak-free way, and to allow them to be exposed to
other Xen VMs via blkback without leaks or races. It’s marked as RFC
for a few reasons:

- The code has been only lightly tested. It might be unstable or
insecure.

- The DM_DEV_CREATE ioctl gains a new flag. Unknown flags were
previously ignored, so this could theoretically break buggy userspace
tools.

- I have no idea if I got the block device reference counting and
locking correct.

Demi Marie Obenour (7):
block: Support creating a struct file from a block device
Allow userspace to get an FD to a newly-created DM device
Implement diskseq checks in blkback
Increment diskseq when releasing a loop device
If autoclear is set, delete a no-longer-used loop device
Minor blkback cleanups
xen/blkback: Inform userspace that device has been opened

block/bdev.c | 77 +++++++++++--
block/genhd.c | 1 +
drivers/block/loop.c | 17 ++-
drivers/block/xen-blkback/blkback.c | 8 +-
drivers/block/xen-blkback/xenbus.c | 171 ++++++++++++++++++++++------
drivers/md/dm-ioctl.c | 67 +++++++++--
include/linux/blkdev.h | 5 +
include/uapi/linux/dm-ioctl.h | 16 ++-
8 files changed, 298 insertions(+), 64 deletions(-)

--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


2023-01-26 03:34:25

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 1/7] block: Support creating a struct file from a block device

The newly added blkdev_get_file() function allows kernel code to create
a struct file for any block device. The main use-case is for the
struct file to be exposed to userspace as a file descriptor. A future
patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
get a file descriptor to the newly created block device, avoiding nasty
race conditions.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
block/bdev.c | 77 +++++++++++++++++++++++++++++++++++-------
include/linux/blkdev.h | 5 +++
2 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index edc110d90df4041e7d337976951bd0d17525f1f7..09cb5ef900ca9ad5b21250bb63e64cc2a79f9289 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -459,10 +459,33 @@ static struct file_system_type bd_type = {
struct super_block *blockdev_superblock __read_mostly;
EXPORT_SYMBOL_GPL(blockdev_superblock);

+static struct vfsmount *bd_mnt __read_mostly;
+
+struct file *
+blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder)
+{
+ struct inode *inode;
+ struct file *filp;
+ int ret;
+
+ ret = blkdev_do_open(bdev, flags, holder);
+ if (ret)
+ return ERR_PTR(ret);
+ inode = bdev->bd_inode;
+ filp = alloc_file_pseudo(inode, bd_mnt, "[block]", flags | O_CLOEXEC, &def_blk_fops);
+ if (IS_ERR(filp)) {
+ blkdev_put(bdev, flags);
+ } else {
+ filp->f_mapping = inode->i_mapping;
+ filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
+ }
+ return filp;
+}
+EXPORT_SYMBOL(blkdev_get_file);
+
void __init bdev_cache_init(void)
{
int err;
- static struct vfsmount *bd_mnt;

bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode),
0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
@@ -775,7 +798,7 @@ void blkdev_put_no_open(struct block_device *bdev)
*
* Use this interface ONLY if you really do not have anything better - i.e. when
* you are behind a truly sucky interface and all you are given is a device
- * number. Everything else should use blkdev_get_by_path().
+ * number. Everything else should use blkdev_get_by_path() or blkdev_do_open().
*
* CONTEXT:
* Might sleep.
@@ -785,9 +808,7 @@ void blkdev_put_no_open(struct block_device *bdev)
*/
struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
{
- bool unblock_events = true;
struct block_device *bdev;
- struct gendisk *disk;
int ret;

ret = devcgroup_check_permission(DEVCG_DEV_BLOCK,
@@ -800,18 +821,52 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
bdev = blkdev_get_no_open(dev);
if (!bdev)
return ERR_PTR(-ENXIO);
- disk = bdev->bd_disk;
+
+ ret = blkdev_do_open(bdev, mode, holder);
+ if (ret) {
+ blkdev_put_no_open(bdev);
+ return ERR_PTR(ret);
+ }
+
+ return bdev;
+}
+EXPORT_SYMBOL(blkdev_get_by_dev);
+
+/**
+ * blkdev_do_open - open a block device by device pointer
+ * @bdev: pointer to the device to open
+ * @mode: FMODE_* mask
+ * @holder: exclusive holder identifier
+ *
+ * Open the block device pointed to by @bdev. If @mode includes
+ * %FMODE_EXCL, the block device is opened with exclusive access. Specifying
+ * %FMODE_EXCL with a %NULL @holder is invalid. Exclusive opens may nest for
+ * the same @holder.
+ *
+ * Unlike blkdev_get_by_dev() and bldev_get_by_path(), this function does not
+ * do any permission checks. The most common use-case is where the device
+ * was freshly created by userspace.
+ *
+ * CONTEXT:
+ * Might sleep.
+ *
+ * RETURNS:
+ * Reference 0 on success, -errno on failure.
+ */
+int blkdev_do_open(struct block_device *bdev, fmode_t mode, void *holder) {
+ struct gendisk *disk = bdev->bd_disk;
+ int ret = -ENXIO;
+ bool unblock_events = true;

if (mode & FMODE_EXCL) {
ret = bd_prepare_to_claim(bdev, holder);
if (ret)
- goto put_blkdev;
+ return ret;
}

disk_block_events(disk);

mutex_lock(&disk->open_mutex);
- ret = -ENXIO;
if (!disk_live(disk))
goto abort_claiming;
if (!try_module_get(disk->fops->owner))
@@ -842,7 +897,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)

if (unblock_events)
disk_unblock_events(disk);
- return bdev;
+ return 0;
put_module:
module_put(disk->fops->owner);
abort_claiming:
@@ -850,11 +905,9 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder)
bd_abort_claiming(bdev, holder);
mutex_unlock(&disk->open_mutex);
disk_unblock_events(disk);
-put_blkdev:
- blkdev_put_no_open(bdev);
- return ERR_PTR(ret);
+ return ret;
}
-EXPORT_SYMBOL(blkdev_get_by_dev);
+EXPORT_SYMBOL(blkdev_do_open);

/**
* blkdev_get_by_path - open a block device by name
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 43d4e073b1115e4628a001081fbf08b296d342df..04635cb5ee29d22394a34c65eb34bea4e7847d8d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -325,6 +325,11 @@ typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,

void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model);

+struct file *
+blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder);
+
+int blkdev_do_open(struct block_device *bdev, fmode_t flags, void *holder);
+
#ifdef CONFIG_BLK_DEV_ZONED

#define BLK_ALL_ZONES ((unsigned int)-1)
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

2023-01-26 03:34:27

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 2/7] Allow userspace to get an FD to a newly-created DM device

This allows creating a device-mapper device, opening it, and setting it
to be deleted when unused in a single atomic operation.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
drivers/md/dm-ioctl.c | 67 +++++++++++++++++++++++++++++------
include/uapi/linux/dm-ioctl.h | 16 ++++++++-
2 files changed, 72 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 36fc6ae4737a05ab53ab67a8ccee525cb5fda082..05438dedcd17b7cac470fcc5a9721d67daad4bfb 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -853,9 +853,21 @@ static void __dev_status(struct mapped_device *md, struct dm_ioctl *param)

static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_size)
{
- int r, m = DM_ANY_MINOR;
+ int r, m = DM_ANY_MINOR, fd;
struct mapped_device *md;

+ /* Do not allow unknown flags */
+ if (param->flags > (2 * DM_FILE_DESCRIPTOR_FLAG - 1))
+ return -EINVAL;
+
+ /*
+ * Do not allow creating a device that would just be destroyed
+ * before the ioctl returns.
+ */
+ if ((param->flags & DM_DEFERRED_REMOVE) &&
+ !(param->flags & DM_FILE_DESCRIPTOR_FLAG))
+ return -EINVAL;
+
r = check_name(param->name);
if (r)
return r;
@@ -867,20 +879,55 @@ static int dev_create(struct file *filp, struct dm_ioctl *param, size_t param_si
if (r)
return r;

- r = dm_hash_insert(param->name, *param->uuid ? param->uuid : NULL, md);
- if (r) {
- dm_put(md);
- dm_destroy(md);
- return r;
- }
-
param->flags &= ~DM_INACTIVE_PRESENT_FLAG;

+ r = dm_hash_insert(param->name, *param->uuid ? param->uuid : NULL, md);
+ if (r)
+ goto out_put;
+
+ if (param->flags & DM_FILE_DESCRIPTOR_FLAG) {
+ struct block_device *bdev = dm_disk(md)->part0;
+ struct file *file;
+
+ fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
+ if (fd < 0) {
+ r = fd;
+ goto out_put;
+ }
+
+ file = blkdev_get_file(bdev, O_RDWR|O_CLOEXEC, NULL);
+ if (IS_ERR(file)) {
+ r = PTR_ERR(file);
+ goto out_put_fd;
+ }
+
+ /*
+ * Simulate opening the device. The other checks in
+ * dm_blk_open() are not necessary becuase we have a reference
+ * to the `struct md`.
+ */
+ atomic_inc(&md->open_count);
+ fd_install(fd, file);
+ param->file_descriptor = fd;
+ }
+
+ /*
+ * If userspace requests it, automatically delete the device
+ * when it is no longer used
+ */
+ if (param->flags & DM_DEFERRED_REMOVE)
+ set_bit(DMF_DEFERRED_REMOVE, &md->flags);
+
__dev_status(md, param);
-
dm_put(md);
-
return 0;
+
+out_put_fd:
+ put_unused_fd(fd);
+out_put:
+ dm_put(md);
+ dm_destroy(md);
+ return r;
}

/*
diff --git a/include/uapi/linux/dm-ioctl.h b/include/uapi/linux/dm-ioctl.h
index 7edf335778bae1cb206f6dd4d44e9cf7fb9da35c..30a6260ed7e06ff71fad1675dd4e7f9325d752a6 100644
--- a/include/uapi/linux/dm-ioctl.h
+++ b/include/uapi/linux/dm-ioctl.h
@@ -136,7 +136,13 @@ struct dm_ioctl {
* For output, the ioctls return the event number, not the cookie.
*/
__u32 event_nr; /* in/out */
- __u32 padding;
+
+ union {
+ /* Padding for named devices */
+ __u32 padding;
+ /* For anonymous devices, this is a file descriptor. */
+ __u32 file_descriptor;
+ };

__u64 dev; /* in/out */

@@ -382,4 +388,12 @@ enum {
*/
#define DM_IMA_MEASUREMENT_FLAG (1 << 19) /* In */

+/*
+ * If set in a DM_DEV_CREATE ioctl(), sets the file_descriptor field
+ * to a valid file descriptor. This can be combined with DM_DEFERRED_REMOVE
+ * to cause the device to be destroyed when the file descriptor is closed
+ * and is otherwise unused.
+ */
+#define DM_FILE_DESCRIPTOR_FLAG (1 << 20) /* In */
+
#endif /* _LINUX_DM_IOCTL_H */
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

2023-01-26 03:34:33

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 3/7] Implement diskseq checks in blkback

From: Demi Marie Obenour <[email protected]>

This allows specifying a disk sequence number in XenStore. If it does
not match the disk sequence number of the underlying device, the device
will not be exported and a warning will be logged. Userspace can use
this to eliminate race conditions due to major/minor number reuse.
Older kernels will ignore this, so it is safe for userspace to set it
unconditionally.

This also makes physical-device parsing stricter. I do not believe this
will break any extant userspace tools.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
drivers/block/xen-blkback/xenbus.c | 137 +++++++++++++++++++++--------
1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 4807af1d58059394d7a992335dabaf2bc3901721..2c43bfc7ab5ba6954f11d4b949a5668660dbd290 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -24,6 +24,7 @@ struct backend_info {
struct xenbus_watch backend_watch;
unsigned major;
unsigned minor;
+ unsigned long long diskseq;
char *mode;
};

@@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)

static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
unsigned major, unsigned minor, int readonly,
- int cdrom)
+ bool cdrom, u64 diskseq)
{
struct xen_vbd *vbd;
struct block_device *bdev;
@@ -507,6 +508,25 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
xen_vbd_free(vbd);
return -ENOENT;
}
+
+ if (diskseq) {
+ struct gendisk *disk = bdev->bd_disk;
+ if (unlikely(disk == NULL)) {
+ pr_err("xen_vbd_create: device %08x has no gendisk\n",
+ vbd->pdevice);
+ xen_vbd_free(vbd);
+ return -EFAULT;
+ }
+
+ if (unlikely(disk->diskseq != diskseq)) {
+ pr_warn("xen_vbd_create: device %08x has incorrect sequence "
+ "number 0x%llx (expected 0x%llx)\n",
+ vbd->pdevice, disk->diskseq, diskseq);
+ xen_vbd_free(vbd);
+ return -ENODEV;
+ }
+ }
+
vbd->size = vbd_sz(vbd);

if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
@@ -690,6 +710,55 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
return err;
}

+static bool read_physical_device(struct xenbus_device *dev,
+ unsigned long long *diskseq,
+ unsigned *major, unsigned *minor)
+{
+ char *physical_device, *problem;
+ int i, physical_device_length;
+ char junk;
+
+ physical_device = xenbus_read(XBT_NIL, dev->nodename, "physical-device",
+ &physical_device_length);
+
+ if (IS_ERR(physical_device)) {
+ int err = PTR_ERR(physical_device);
+ /*
+ * Since this watch will fire once immediately after it is
+ * registered, we expect "does not exist" errors. Ignore
+ * them and wait for the hotplug scripts.
+ */
+ if (unlikely(!XENBUS_EXIST_ERR(err)))
+ xenbus_dev_fatal(dev, err, "reading physical-device");
+ return false;
+ }
+
+ for (i = 0; i < physical_device_length; ++i)
+ if (unlikely(physical_device[i] <= 0x20 || physical_device[i] >= 0x7F)) {
+ problem = "bad byte in physical-device";
+ goto fail;
+ }
+
+ if (sscanf(physical_device, "%16llx@%8x:%8x%c",
+ diskseq, major, minor, &junk) == 3) {
+ if (*diskseq == 0) {
+ problem = "diskseq 0 is invalid";
+ goto fail;
+ }
+ } else if (sscanf(physical_device, "%8x:%8x%c", major, minor, &junk) == 2) {
+ *diskseq = 0;
+ } else {
+ problem = "invalid physical-device";
+ goto fail;
+ }
+ kfree(physical_device);
+ return true;
+fail:
+ kfree(physical_device);
+ xenbus_dev_fatal(dev, -EINVAL, problem);
+ return false;
+}
+
/*
* Callback received when the hotplug scripts have placed the physical-device
* node. Read it and the mode node, and create a vbd. If the frontend is
@@ -707,28 +776,17 @@ static void backend_changed(struct xenbus_watch *watch,
int cdrom = 0;
unsigned long handle;
char *device_type;
+ unsigned long long diskseq;

pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
-
- err = xenbus_scanf(XBT_NIL, dev->nodename, "physical-device", "%x:%x",
- &major, &minor);
- if (XENBUS_EXIST_ERR(err)) {
- /*
- * Since this watch will fire once immediately after it is
- * registered, we expect this. Ignore it, and wait for the
- * hotplug scripts.
- */
+ if (!read_physical_device(dev, &diskseq, &major, &minor))
return;
- }
- if (err != 2) {
- xenbus_dev_fatal(dev, err, "reading physical-device");
- return;
- }

- if (be->major | be->minor) {
- if (be->major != major || be->minor != minor)
- pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
- be->major, be->minor, major, minor);
+ if (be->major | be->minor | be->diskseq) {
+ if (be->major != major || be->minor != minor || be->diskseq != diskseq)
+ pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
+ " not supported.\n",
+ be->major, be->minor, be->diskseq, major, minor, diskseq);
return;
}

@@ -756,29 +814,34 @@ static void backend_changed(struct xenbus_watch *watch,

be->major = major;
be->minor = minor;
+ be->diskseq = diskseq;

err = xen_vbd_create(be->blkif, handle, major, minor,
- !strchr(be->mode, 'w'), cdrom);
-
- if (err)
- xenbus_dev_fatal(dev, err, "creating vbd structure");
- else {
- err = xenvbd_sysfs_addif(dev);
- if (err) {
- xen_vbd_free(&be->blkif->vbd);
- xenbus_dev_fatal(dev, err, "creating sysfs entries");
- }
- }
+ !strchr(be->mode, 'w'), cdrom, diskseq);

if (err) {
- kfree(be->mode);
- be->mode = NULL;
- be->major = 0;
- be->minor = 0;
- } else {
- /* We're potentially connected now */
- xen_update_blkif_status(be->blkif);
+ xenbus_dev_fatal(dev, err, "creating vbd structure");
+ goto fail;
}
+
+ err = xenvbd_sysfs_addif(dev);
+ if (err) {
+ xenbus_dev_fatal(dev, err, "creating sysfs entries");
+ goto free_vbd;
+ }
+
+ /* We're potentially connected now */
+ xen_update_blkif_status(be->blkif);
+ return;
+
+free_vbd:
+ xen_vbd_free(&be->blkif->vbd);
+fail:
+ kfree(be->mode);
+ be->mode = NULL;
+ be->major = 0;
+ be->minor = 0;
+ be->diskseq = 0;
}

/*
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

2023-01-26 03:34:36

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 4/7] Increment diskseq when releasing a loop device

This ensures that userspace is aware that the device may now point to
something else.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
block/genhd.c | 1 +
drivers/block/loop.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 23cf83b3331cdea5c916fbf01cb5b92aeb2f7cf8..5bf7664273c66d04b40730434f17f7b65fbfe101 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1490,3 +1490,4 @@ void inc_diskseq(struct gendisk *disk)
{
disk->diskseq = atomic64_inc_return(&diskseq);
}
+EXPORT_SYMBOL(inc_diskseq);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 1518a6423279bc890221e8184a8f2e420cb16715..f862b0ab1dce43b3617b1381be8e2de3aab828b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1205,6 +1205,12 @@ static void __loop_clr_fd(struct loop_device *lo, bool release)
if (!part_shift)
set_bit(GD_SUPPRESS_PART_SCAN, &lo->lo_disk->state);
mutex_lock(&lo->lo_mutex);
+
+ /*
+ * Increment the disk sequence number, so that userspace knows this
+ * device now points to something else.
+ */
+ inc_diskseq(lo->lo_disk);
lo->lo_state = Lo_unbound;
mutex_unlock(&lo->lo_mutex);

--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

2023-01-26 03:34:39

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 6/7] Minor blkback cleanups

No functional change intended.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index a5cf7f1e871c7f9ff397ab8ff1d7b9e3db686659..8a49cbe81d8895f89371bdf50d1b445c088c9b6a 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1238,6 +1238,8 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
nseg = req->operation == BLKIF_OP_INDIRECT ?
req->u.indirect.nr_segments : req->u.rw.nr_segments;

+ BUILD_BUG_ON(offsetof(struct blkif_request, u.rw.id) != 8);
+ BUILD_BUG_ON(offsetof(struct blkif_request, u.indirect.id) != 8);
if (unlikely(nseg == 0 && operation_flags != REQ_PREFLUSH) ||
unlikely((req->operation != BLKIF_OP_INDIRECT) &&
(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
@@ -1261,13 +1263,13 @@ static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
preq.sector_number = req->u.rw.sector_number;
for (i = 0; i < nseg; i++) {
pages[i]->gref = req->u.rw.seg[i].gref;
- seg[i].nsec = req->u.rw.seg[i].last_sect -
- req->u.rw.seg[i].first_sect + 1;
- seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
if ((req->u.rw.seg[i].last_sect >= (XEN_PAGE_SIZE >> 9)) ||
(req->u.rw.seg[i].last_sect <
req->u.rw.seg[i].first_sect))
goto fail_response;
+ seg[i].nsec = req->u.rw.seg[i].last_sect -
+ req->u.rw.seg[i].first_sect + 1;
+ seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
preq.nr_sects += seg[i].nsec;
}
} else {
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


2023-01-26 03:34:49

by Demi Marie Obenour

[permalink] [raw]
Subject: [RFC PATCH 7/7] xen/blkback: Inform userspace that device has been opened

This allows userspace to use block devices with delete-on-close
behavior, which is necessary to ensure virtual devices (such as loop or
device-mapper devices) are cleaned up automatically. Protocol details
are included in comments.

Signed-off-by: Demi Marie Obenour <[email protected]>
---
drivers/block/xen-blkback/xenbus.c | 34 ++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 2c43bfc7ab5ba6954f11d4b949a5668660dbd290..ca8dae05985038da490c5ac93364509913f6b4c7 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -3,6 +3,19 @@
Copyright (C) 2005 Rusty Russell <[email protected]>
Copyright (C) 2005 XenSource Ltd

+In addition to the XenStore nodes required by the Xen block device
+specification, this implementation of blkback uses a new XenStore
+node: "opened". blkback sets "opened" to "0" before the hotplug script
+is called. Once the device node has been opened, blkback sets "opened"
+to "1".
+
+"opened" is used exclusively by userspace. It serves two purposes:
+
+1. It tells userspace that diskseq@major:minor syntax for "physical-device" is
+ supported.
+2. It tells userspace that it can wait for "opened" to be set to 1. Once
+ "opened" is 1, blkback has a reference to the device, so userspace doesn't
+ need to keep one.

*/

@@ -698,6 +711,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
if (err)
pr_warn("%s write out 'max-ring-page-order' failed\n", __func__);

+ /*
+ * This informs userspace that the "opened" node will be set to "1" when
+ * the device has been opened successfully.
+ */
+ err = xenbus_write(XBT_NIL, dev->nodename, "opened", "0");
+ if (err)
+ goto fail;
+
err = xenbus_switch_state(dev, XenbusStateInitWait);
if (err)
goto fail;
@@ -824,6 +845,19 @@ static void backend_changed(struct xenbus_watch *watch,
goto fail;
}

+ /*
+ * Tell userspace that the device has been opened and that blkback has a
+ * reference to it. Userspace can then close the device or mark it as
+ * delete-on-close, knowing that blkback will keep the device open as
+ * long as necessary.
+ */
+ err = xenbus_write(XBT_NIL, dev->nodename, "opened", "1");
+ if (err) {
+ xenbus_dev_fatal(dev, err, "%s: notifying userspace device has been opened",
+ dev->nodename);
+ goto free_vbd;
+ }
+
err = xenvbd_sysfs_addif(dev);
if (err) {
xenbus_dev_fatal(dev, err, "creating sysfs entries");
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


2023-01-30 08:10:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote:
> The newly added blkdev_get_file() function allows kernel code to create
> a struct file for any block device. The main use-case is for the
> struct file to be exposed to userspace as a file descriptor. A future
> patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
> get a file descriptor to the newly created block device, avoiding nasty
> race conditions.

NAK. Do not add wierd side-way interfaces to the block layer.

2023-01-30 08:11:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 4/7] Increment diskseq when releasing a loop device

On Wed, Jan 25, 2023 at 10:33:56PM -0500, Demi Marie Obenour wrote:
> This ensures that userspace is aware that the device may now point to
> something else.

The subject is wong, this also does two things to two different
subystems, not of which is mentioned in the subject.

2023-01-30 19:22:48

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Mon, Jan 30, 2023 at 12:08:23AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote:
> > The newly added blkdev_get_file() function allows kernel code to create
> > a struct file for any block device. The main use-case is for the
> > struct file to be exposed to userspace as a file descriptor. A future
> > patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to
> > get a file descriptor to the newly created block device, avoiding nasty
> > race conditions.
>
> NAK. Do not add wierd side-way interfaces to the block layer.

What do you recommend instead? This solves a real problem for
device-mapper users and I am not aware of a better solution.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (793.00 B)
signature.asc (833.00 B)
Download all attachments

2023-01-31 08:58:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> What do you recommend instead? This solves a real problem for
> device-mapper users and I am not aware of a better solution.

You could start with explaining the problem and what other methods
you tried that failed. In the end it's not my job to fix your problem.
I generally gladly help, but this kind of attitude doesn't get very
far.

2023-01-31 16:28:46

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > What do you recommend instead? This solves a real problem for
> > device-mapper users and I am not aware of a better solution.
>
> You could start with explaining the problem and what other methods
> you tried that failed. In the end it's not my job to fix your problem.

I’m working on a “block not-script” (Xen block device hotplug script
written in C) for Qubes OS. The current hotplug script is a shell
script that takes a global lock, which serializes all invocations and
significantly slows down VM creation and destruction. My C program
avoids this problem.

One of the goals of the not-script is to never leak resources, even if
it dies with SIGKILL or is never called with the “remove” argument to
destroy the devices it created. Therefore, whenever possible, it relies
on automatic destruction of devices that are no longer used. I have
managed to make this work for loop devices, provided that the Xen
blkback driver is patched to accept a diskseq in the physical-device
Xenstore node. I have *not* managed to make this work for device-mapper
devices, however. One of the problems is that there is no way to
atomically create a device-mapper device and obtain a file descriptor to
it such that the device will be destroyed when no longer used. To solve
this problem, I added a new flag (DM_FILE_DESCRIPTOR_FLAG) that asks the
device-mapper driver to provide userspace a file descriptor for the
device that was just created. The uAPI will likely change in future
versions of the patch, but the general idea will not.

While it is easy to provide userspace with an FD to any struct file, it
is *not* easy to obtain a struct file for a given struct block_device.
I could have had device-mapper implement everything itself, but that
would have duplicated a large amount of code already in the block layer.
Instead, I decided to refactor the block layer to provide a function
that does exactly what was needed. The result was this patch. In the
future, I would like to add an ioctl for /dev/loop-control that creates
a loop device and returns a file descriptor to the loop device. I could
also see iSCSI supporting this, with the socket file descriptor being
passed in from userspace.

blkdev_do_open() does not solve any problem for me at this time.
Instead, it represents the code shared by blkdev_get_by_dev() and
blkdev_get_file(). I decided to export it because it could be of
independent use to others. In particular, it could potentially
simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.

I hope this is enough information. If it is not, feel free to ask for
more.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (2.84 kB)
signature.asc (833.00 B)
Download all attachments

2023-02-01 07:46:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> While it is easy to provide userspace with an FD to any struct file, it
> is *not* easy to obtain a struct file for a given struct block_device.
> I could have had device-mapper implement everything itself, but that
> would have duplicated a large amount of code already in the block layer.
> Instead, I decided to refactor the block layer to provide a function
> that does exactly what was needed. The result was this patch. In the
> future, I would like to add an ioctl for /dev/loop-control that creates
> a loop device and returns a file descriptor to the loop device. I could
> also see iSCSI supporting this, with the socket file descriptor being
> passed in from userspace.

And it is somewhat intentional that you can't. Block device inodes
have interesting life times and are never directly exposed to userspace
at all. They are internal, and only f_mapping of a file system inode
delegates to them or I/O. Your patch now magically exposes them to
userspace. And it then bypasses all pathname and inode permission
based access checks and auditing. So we can't just do it.

> blkdev_do_open() does not solve any problem for me at this time.
> Instead, it represents the code shared by blkdev_get_by_dev() and
> blkdev_get_file(). I decided to export it because it could be of
> independent use to others. In particular, it could potentially
> simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
> pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.

All thse need to actually open the underlying device as they do I/O.
Doing I/O without opening the device is a no-go.

2023-02-01 16:19:14

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Tue, Jan 31, 2023 at 11:45:55PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> > While it is easy to provide userspace with an FD to any struct file, it
> > is *not* easy to obtain a struct file for a given struct block_device.
> > I could have had device-mapper implement everything itself, but that
> > would have duplicated a large amount of code already in the block layer.
> > Instead, I decided to refactor the block layer to provide a function
> > that does exactly what was needed. The result was this patch. In the
> > future, I would like to add an ioctl for /dev/loop-control that creates
> > a loop device and returns a file descriptor to the loop device. I could
> > also see iSCSI supporting this, with the socket file descriptor being
> > passed in from userspace.
>
> And it is somewhat intentional that you can't. Block device inodes
> have interesting life times and are never directly exposed to userspace
> at all. They are internal, and only f_mapping of a file system inode
> delegates to them or I/O. Your patch now magically exposes them to
> userspace.

The intention is that the file descriptor is equvalent to what one would
get by first creating the device and then opening it. If it is not,
that is a bug in one of my patches.

> And it then bypasses all pathname and inode permission
> based access checks and auditing. So we can't just do it.

Accessing /dev/mapper/control is already enough to panic the kernel, so
presumably only fully trusted userspace can make the ioctl to begin
with. Furthermore, this only allows a userspace process to get a file
descriptor to the device-mapper device it itself created.

> > blkdev_do_open() does not solve any problem for me at this time.
> > Instead, it represents the code shared by blkdev_get_by_dev() and
> > blkdev_get_file(). I decided to export it because it could be of
> > independent use to others. In particular, it could potentially
> > simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in
> > pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs.
>
> All thse need to actually open the underlying device as they do I/O.
> Doing I/O without opening the device is a no-go.

blkdev_do_open() *does* open the device. If it doesn’t, that’s a bug.
In v2 I will add the same access control checks that blkdev_get_by_dev()
does. Is this sufficient?
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (2.45 kB)
signature.asc (833.00 B)
Download all attachments

2023-02-02 08:51:05

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > > What do you recommend instead? This solves a real problem for
> > > device-mapper users and I am not aware of a better solution.
> >
> > You could start with explaining the problem and what other methods
> > you tried that failed. In the end it's not my job to fix your problem.
>
> I’m working on a “block not-script” (Xen block device hotplug script
> written in C) for Qubes OS. The current hotplug script is a shell
> script that takes a global lock, which serializes all invocations and
> significantly slows down VM creation and destruction. My C program
> avoids this problem.
>
> One of the goals of the not-script is to never leak resources, even if
> it dies with SIGKILL or is never called with the “remove” argument to

If it dies, you still can restart one new instance for handling the device
leak by running one simple daemon to monitor if not-script is live.

> destroy the devices it created. Therefore, whenever possible, it relies
> on automatic destruction of devices that are no longer used. I have

This automatic destruction of devices is supposed to be done in
userspace, cause only userspace knows when device is needed, when
it is needed.

So not sure this kind of work should be involved in kernel.


Thanks,
Ming


2023-02-02 16:51:29

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Allow race-free block device handling

On Wed, Jan 25 2023 at 10:33P -0500,
Demi Marie Obenour <[email protected]> wrote:

> This work aims to allow userspace to create and destroy block devices
> in a race-free and leak-free way,

"race-free and leak-free way" implies there both races and leaks in
existing code. You're making claims that are likely very specific to
your Xen use-case. Please explain more carefully.

> and to allow them to be exposed to
> other Xen VMs via blkback without leaks or races. It’s marked as RFC
> for a few reasons:
>
> - The code has been only lightly tested. It might be unstable or
> insecure.
>
> - The DM_DEV_CREATE ioctl gains a new flag. Unknown flags were
> previously ignored, so this could theoretically break buggy userspace
> tools.

Not seeing a reason that type of DM change is needed. If you feel
strongly about it send a separate patch and we can discuss it.

> - I have no idea if I got the block device reference counting and
> locking correct.

Your headers and justifcation for this line of work are really way too
terse. Please take the time to clearly make the case for your changes
in both the patch headers and code.

Mike

2023-02-02 17:25:16

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] block: Support creating a struct file from a block device

On Thu, Feb 02, 2023 at 04:49:54PM +0800, Ming Lei wrote:
> On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote:
> > On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote:
> > > > What do you recommend instead? This solves a real problem for
> > > > device-mapper users and I am not aware of a better solution.
> > >
> > > You could start with explaining the problem and what other methods
> > > you tried that failed. In the end it's not my job to fix your problem.
> >
> > I’m working on a “block not-script” (Xen block device hotplug script
> > written in C) for Qubes OS. The current hotplug script is a shell
> > script that takes a global lock, which serializes all invocations and
> > significantly slows down VM creation and destruction. My C program
> > avoids this problem.
> >
> > One of the goals of the not-script is to never leak resources, even if
> > it dies with SIGKILL or is never called with the “remove” argument to
>
> If it dies, you still can restart one new instance for handling the device
> leak by running one simple daemon to monitor if not-script is live.

This requires userspace to maintain state that persists across process
restarts, and is also non-compositional. If there was a userspace
daemon that was responsible for all block device management in the
system, this would be more reasonable, but no such daemon exists.
Furthermore, the amount of code required in userspace dwarfs the amount
of code my patches add to the kernel, both in size and complexity.

> > destroy the devices it created. Therefore, whenever possible, it relies
> > on automatic destruction of devices that are no longer used. I have
>
> This automatic destruction of devices is supposed to be done in
> userspace, cause only userspace knows when device is needed, when
> it is needed.

In my use-case, the last reference to the device is held by the blkback
driver in the kernel. More generally, any case where a device is
created for a single purpose and should be destroyed when no longer
used will benefit from this. Encrypted swap devices are a simple
example, as they can be destroyed with a single “swapoff” command.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (2.28 kB)
signature.asc (833.00 B)
Download all attachments

2023-02-02 18:43:57

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Allow race-free block device handling

On Thu, Feb 02, 2023 at 11:50:37AM -0500, Mike Snitzer wrote:
> On Wed, Jan 25 2023 at 10:33P -0500,
> Demi Marie Obenour <[email protected]> wrote:
>
> > This work aims to allow userspace to create and destroy block devices
> > in a race-free and leak-free way,
>
> "race-free and leak-free way" implies there both races and leaks in
> existing code. You're making claims that are likely very specific to
> your Xen use-case. Please explain more carefully.

Will do in v2.

> > and to allow them to be exposed to
> > other Xen VMs via blkback without leaks or races. It’s marked as RFC
> > for a few reasons:
> >
> > - The code has been only lightly tested. It might be unstable or
> > insecure.
> >
> > - The DM_DEV_CREATE ioctl gains a new flag. Unknown flags were
> > previously ignored, so this could theoretically break buggy userspace
> > tools.
>
> Not seeing a reason that type of DM change is needed. If you feel
> strongly about it send a separate patch and we can discuss it.

Patch 2/7 is the diskseq change. v2 will contain a revised and tested
version with a greatly expanded commit message.

> > - I have no idea if I got the block device reference counting and
> > locking correct.
>
> Your headers and justifcation for this line of work are really way too
> terse. Please take the time to clearly make the case for your changes
> in both the patch headers and code.

I will expand the commit message in v2, but I am not sure what you want
me to add to the code comments. Would you mind explaining?
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (1.58 kB)
signature.asc (833.00 B)
Download all attachments

2023-02-02 19:57:23

by Mike Snitzer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Allow race-free block device handling

On Thu, Feb 02 2023 at 1:41P -0500,
Demi Marie Obenour <[email protected]> wrote:

> On Thu, Feb 02, 2023 at 11:50:37AM -0500, Mike Snitzer wrote:
> > On Wed, Jan 25 2023 at 10:33P -0500,
> > Demi Marie Obenour <[email protected]> wrote:
> >
> > > This work aims to allow userspace to create and destroy block devices
> > > in a race-free and leak-free way,
> >
> > "race-free and leak-free way" implies there both races and leaks in
> > existing code. You're making claims that are likely very specific to
> > your Xen use-case. Please explain more carefully.
>
> Will do in v2.
>
> > > and to allow them to be exposed to
> > > other Xen VMs via blkback without leaks or races. It’s marked as RFC
> > > for a few reasons:
> > >
> > > - The code has been only lightly tested. It might be unstable or
> > > insecure.
> > >
> > > - The DM_DEV_CREATE ioctl gains a new flag. Unknown flags were
> > > previously ignored, so this could theoretically break buggy userspace
> > > tools.
> >
> > Not seeing a reason that type of DM change is needed. If you feel
> > strongly about it send a separate patch and we can discuss it.
>
> Patch 2/7 is the diskseq change. v2 will contain a revised and tested
> version with a greatly expanded commit message.

I'm aware that 2/7 is where you make the DM change to disallow unknown
flags, what I'm saying is I don't see a reason for that change.

Certainly doesn't look to be a requirement for everything else in that
patch.

So send a separate patch, but I'm inclined to _not_ accept it because
it does potentially break some userspace.

> > > - I have no idea if I got the block device reference counting and
> > > locking correct.
> >
> > Your headers and justifcation for this line of work are really way too
> > terse. Please take the time to clearly make the case for your changes
> > in both the patch headers and code.
>
> I will expand the commit message in v2, but I am not sure what you want
> me to add to the code comments. Would you mind explaining?

Nothing specific about code, was just a general reminder (based on how
terse the 2/7 header was).

Mike

2023-02-02 20:58:11

by Demi Marie Obenour

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Allow race-free block device handling

On Thu, Feb 02, 2023 at 02:56:34PM -0500, Mike Snitzer wrote:
> On Thu, Feb 02 2023 at 1:41P -0500,
> Demi Marie Obenour <[email protected]> wrote:
>
> > On Thu, Feb 02, 2023 at 11:50:37AM -0500, Mike Snitzer wrote:
> > > On Wed, Jan 25 2023 at 10:33P -0500,
> > > Demi Marie Obenour <[email protected]> wrote:
> > >
> > > > This work aims to allow userspace to create and destroy block devices
> > > > in a race-free and leak-free way,
> > >
> > > "race-free and leak-free way" implies there both races and leaks in
> > > existing code. You're making claims that are likely very specific to
> > > your Xen use-case. Please explain more carefully.
> >
> > Will do in v2.
> >
> > > > and to allow them to be exposed to
> > > > other Xen VMs via blkback without leaks or races. It’s marked as RFC
> > > > for a few reasons:
> > > >
> > > > - The code has been only lightly tested. It might be unstable or
> > > > insecure.
> > > >
> > > > - The DM_DEV_CREATE ioctl gains a new flag. Unknown flags were
> > > > previously ignored, so this could theoretically break buggy userspace
> > > > tools.
> > >
> > > Not seeing a reason that type of DM change is needed. If you feel
> > > strongly about it send a separate patch and we can discuss it.
> >
> > Patch 2/7 is the diskseq change. v2 will contain a revised and tested
> > version with a greatly expanded commit message.
>
> I'm aware that 2/7 is where you make the DM change to disallow unknown
> flags, what I'm saying is I don't see a reason for that change.

Thanks for the clarification.

> Certainly doesn't look to be a requirement for everything else in that
> patch.

Indeed it is not. I will make it a separate patch.

> So send a separate patch, but I'm inclined to _not_ accept it because
> it does potentially break some userspace.

Is it okay to add DM_FILE_DESCRIPTOR_FLAG (with the same meaning as in
2/7) _without_ rejecting unknown flags? The same patch would bump the
minor version number, so userspace would still be able to tell if the
kernel supported DM_FILE_DESCRIPTOR_FLAG. If you wanted, I could ignore
DM_FILE_DESCRIPTOR_FLAG unless the minor number passed by userspace is
sufficiently recent.

Another option would be to make userspace opt-in to strict parameter
checking by passing 5 as the major version instead of 4. Userspace
programs that passed 4 would get the old behavior, while userspace
programs that passed 5 would get strict parameter checking and be able
to use new features such as DM_FILE_DESCRIPTOR_FLAG.

> > > > - I have no idea if I got the block device reference counting and
> > > > locking correct.
> > >
> > > Your headers and justifcation for this line of work are really way too
> > > terse. Please take the time to clearly make the case for your changes
> > > in both the patch headers and code.
> >
> > I will expand the commit message in v2, but I am not sure what you want
> > me to add to the code comments. Would you mind explaining?
>
> Nothing specific about code, was just a general reminder (based on how
> terse the 2/7 header was).
>
> Mike

Thanks for the feedback!
--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


Attachments:
(No filename) (3.13 kB)
signature.asc (833.00 B)
Download all attachments