From: Matteo Croce <[email protected]>
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems has
a very high latency. Block devices do not have exclusive owners in
userspace, any process can set one up (e.g. loop devices). Moreover, device
names can be reused (e.g. loop0 can be reused again and again). A userspace
process setting up a block device and watching for its events cannot thus
reliably tell whether an event relates to the device it just set up or
another earlier instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a uevent
with a UUID that does not match the device you are waiting for, you cannot
tell whether it's because the right uevent has not arrived yet, or it was
already sent and you missed it. So you cannot tell whether you should wait
for it or not.
Being able to set devices up in a namespace would solve the race conditions
too, but it can work only if being namespaced is feasible in the first
place. Many userspace processes need to set devices up for the root
namespace, so this solution cannot always work.
Changing the loop devices naming implementation to always use
monotonically increasing device numbers, instead of reusing the lowest
free number, would also solve the problem, but it would be very disruptive
to userspace and likely break many existing use cases. It would also be
quite awkward to use on long-running machines, as the loop device name
would quickly grow to many-digits length.
Furthermore, this problem does not affect only loop devices - partition
probing is asynchronous and very slow on busy systems. It is very easy to
enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
show up, as it can take a long time for the uevents to be delivered after
setting them up.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
This does not benefit only loop devices and block devices with multiple
partitions, but for example also removable media such as USB sticks or
cdroms/dvdroms/etc.
The first patch is the core one, the 2..4 expose the information in
different ways, and the last one makes the loop device generate a media
changed event upon attach, detach or reconfigure, so the sequence number
is increased.
If merged, this feature will immediately used by the userspace:
https://github.com/systemd/systemd/issues/17469#issuecomment-762919781
v4 -> v5:
- introduce a helper to raise media changed events
- use the new helper in loop instead of the full event code
- unexport inc_diskseq() which is only used by the block code now
- rebase on top of 5.14-rc1
v3 -> v4:
- rebased on top of 5.13
- hook the seqnum increase into the media change event
- make the loop device raise media change events
- merge 1/6 and 5/6
- move the uevent part of 1/6 into a separate one
- drop the now unneeded sysfs refactor
- change 'diskseq' to a global static variable
- add more comments
- refactor commit messages
v2 -> v3:
- rebased on top of 5.13-rc7
- resend because it appeared archived on patchwork
v1 -> v2:
- increase seqnum on media change
- increase on loop detach
Matteo Croce (6):
block: add disk sequence number
block: export the diskseq in uevents
block: add ioctl to read the disk sequence number
block: export diskseq in sysfs
block: add a helper to raise a media changed event
loop: raise media_change event
Documentation/ABI/testing/sysfs-block | 12 ++++++
block/disk-events.c | 62 +++++++++++++++++++++------
block/genhd.c | 43 +++++++++++++++++++
block/ioctl.c | 2 +
drivers/block/loop.c | 5 +++
include/linux/genhd.h | 3 ++
include/uapi/linux/fs.h | 1 +
7 files changed, 114 insertions(+), 14 deletions(-)
--
2.31.1
From: Matteo Croce <[email protected]>
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
---
block/disk-events.c | 3 +++
block/genhd.c | 24 ++++++++++++++++++++++++
include/linux/genhd.h | 2 ++
3 files changed, 29 insertions(+)
diff --git a/block/disk-events.c b/block/disk-events.c
index a75931ff5da4..04c52f3992ed 100644
--- a/block/disk-events.c
+++ b/block/disk-events.c
@@ -190,6 +190,9 @@ static void disk_check_events(struct disk_events *ev,
spin_unlock_irq(&ev->lock);
+ if (events & DISK_EVENT_MEDIA_CHANGE)
+ inc_diskseq(disk);
+
/*
* Tell userland about new events. Only the events listed in
* @disk->events are reported, and only if DISK_EVENT_FLAG_UEVENT
diff --git a/block/genhd.c b/block/genhd.c
index af4d2ab4a633..0be32dbe97bb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -29,6 +29,23 @@
static struct kobject *block_depr;
+/*
+ * Unique, monotonically increasing sequential number associated with block
+ * devices instances (i.e. incremented each time a device is attached).
+ * Associating uevents with block devices in userspace is difficult and racy:
+ * the uevent netlink socket is lossy, and on slow and overloaded systems has
+ * a very high latency.
+ * Block devices do not have exclusive owners in userspace, any process can set
+ * one up (e.g. loop devices). Moreover, device names can be reused (e.g. loop0
+ * can be reused again and again).
+ * A userspace process setting up a block device and watching for its events
+ * cannot thus reliably tell whether an event relates to the device it just set
+ * up or another earlier instance with the same name.
+ * This sequential number allows userspace processes to solve this problem, and
+ * uniquely associate an uevent to the lifetime to a device.
+ */
+static atomic64_t diskseq;
+
/* for extended dynamic devt allocation, currently only one major is used */
#define NR_EXT_DEVT (1 << MINORBITS)
static DEFINE_IDA(ext_devt_ida);
@@ -1263,6 +1280,8 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
disk_to_dev(disk)->class = &block_class;
disk_to_dev(disk)->type = &disk_type;
device_initialize(disk_to_dev(disk));
+ inc_diskseq(disk);
+
return disk;
out_destroy_part_tbl:
@@ -1363,3 +1382,8 @@ int bdev_read_only(struct block_device *bdev)
return bdev->bd_read_only || get_disk_ro(bdev->bd_disk);
}
EXPORT_SYMBOL(bdev_read_only);
+
+void inc_diskseq(struct gendisk *disk)
+{
+ disk->diskseq = atomic64_inc_return(&diskseq);
+}
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 13b34177cc85..140c028845af 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -172,6 +172,7 @@ struct gendisk {
int node_id;
struct badblocks *bb;
struct lockdep_map lockdep_map;
+ u64 diskseq;
};
/*
@@ -332,6 +333,7 @@ static inline void bd_unlink_disk_holder(struct block_device *bdev,
#endif /* CONFIG_SYSFS */
dev_t part_devt(struct gendisk *disk, u8 partno);
+void inc_diskseq(struct gendisk *disk);
dev_t blk_lookup_devt(const char *name, int partno);
void blk_request_module(dev_t devt);
#ifdef CONFIG_BLOCK
--
2.31.1
From: Matteo Croce <[email protected]>
Export the newly introduced diskseq in uevents:
$ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ
E: DEVNAME=/dev/loop0
E: DISKSEQ=1
E: DEVNAME=/dev/loop1
E: DISKSEQ=2
E: DEVNAME=/dev/loop2
E: DISKSEQ=3
E: DEVNAME=/dev/loop3
E: DISKSEQ=4
E: DEVNAME=/dev/loop4
E: DISKSEQ=5
E: DEVNAME=/dev/loop5
E: DISKSEQ=6
E: DEVNAME=/dev/loop6
E: DISKSEQ=7
E: DEVNAME=/dev/loop7
E: DISKSEQ=8
E: DEVNAME=/dev/nvme0n1
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p1
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p2
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p3
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p4
E: DISKSEQ=9
E: DEVNAME=/dev/nvme0n1p5
E: DISKSEQ=9
E: DEVNAME=/dev/sda
E: DISKSEQ=10
E: DEVNAME=/dev/sda1
E: DISKSEQ=10
E: DEVNAME=/dev/sda2
E: DISKSEQ=10
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
---
block/genhd.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/block/genhd.c b/block/genhd.c
index 0be32dbe97bb..3126f8afe3b8 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1101,8 +1101,17 @@ static void disk_release(struct device *dev)
blk_put_queue(disk->queue);
kfree(disk);
}
+
+static int block_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return add_uevent_var(env, "DISKSEQ=%llu", disk->diskseq);
+}
+
struct class block_class = {
.name = "block",
+ .dev_uevent = block_uevent,
};
static char *block_devnode(struct device *dev, umode_t *mode,
--
2.31.1
From: Matteo Croce <[email protected]>
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number
from the genhd structure.
# ./getdiskseq /dev/loop*
/dev/loop0: 13
/dev/loop0p1: 13
/dev/loop0p2: 13
/dev/loop0p3: 13
/dev/loop1: 14
/dev/loop1p1: 14
/dev/loop1p2: 14
/dev/loop2: 5
/dev/loop3: 6
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
---
block/ioctl.c | 2 ++
include/uapi/linux/fs.h | 1 +
2 files changed, 3 insertions(+)
diff --git a/block/ioctl.c b/block/ioctl.c
index 24beec9ca9c9..0c3a4a53fa11 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -469,6 +469,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
BLKDEV_DISCARD_SECURE);
case BLKZEROOUT:
return blk_ioctl_zeroout(bdev, mode, arg);
+ case BLKGETDISKSEQ:
+ return put_u64(argp, bdev->bd_disk->diskseq);
case BLKREPORTZONE:
return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
case BLKRESETZONE:
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 4c32e97dcdf0..bdf7b404b3e7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -184,6 +184,7 @@ struct fsxattr {
#define BLKSECDISCARD _IO(0x12,125)
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
+#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
/*
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
--
2.31.1
From: Matteo Croce <[email protected]>
Refactor disk_check_events() and move some code into disk_event_uevent().
Then add disk_force_media_change(), a helper which will be used by
devices to force issuing a DISK_EVENT_MEDIA_CHANGE event.
Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
---
block/disk-events.c | 61 ++++++++++++++++++++++++++++++++-----------
include/linux/genhd.h | 1 +
2 files changed, 47 insertions(+), 15 deletions(-)
diff --git a/block/disk-events.c b/block/disk-events.c
index 04c52f3992ed..7445b8ff2775 100644
--- a/block/disk-events.c
+++ b/block/disk-events.c
@@ -163,15 +163,31 @@ void disk_flush_events(struct gendisk *disk, unsigned int mask)
spin_unlock_irq(&ev->lock);
}
+/*
+ * Tell userland about new events. Only the events listed in @disk->events are
+ * reported, and only if DISK_EVENT_FLAG_UEVENT is set. Otherwise, events are
+ * processed internally but never get reported to userland.
+ */
+static void disk_event_uevent(struct gendisk *disk, unsigned int events)
+{
+ char *envp[ARRAY_SIZE(disk_uevents) + 1] = { };
+ int nr_events = 0, i;
+
+ for (i = 0; i < ARRAY_SIZE(disk_uevents); i++)
+ if (events & disk->events & (1 << i))
+ envp[nr_events++] = disk_uevents[i];
+
+ if (nr_events)
+ kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
+}
+
static void disk_check_events(struct disk_events *ev,
unsigned int *clearing_ptr)
{
struct gendisk *disk = ev->disk;
- char *envp[ARRAY_SIZE(disk_uevents) + 1] = { };
unsigned int clearing = *clearing_ptr;
unsigned int events;
unsigned long intv;
- int nr_events = 0, i;
/* check events */
events = disk->fops->check_events(disk, clearing);
@@ -193,19 +209,8 @@ static void disk_check_events(struct disk_events *ev,
if (events & DISK_EVENT_MEDIA_CHANGE)
inc_diskseq(disk);
- /*
- * Tell userland about new events. Only the events listed in
- * @disk->events are reported, and only if DISK_EVENT_FLAG_UEVENT
- * is set. Otherwise, events are processed internally but never
- * get reported to userland.
- */
- for (i = 0; i < ARRAY_SIZE(disk_uevents); i++)
- if ((events & disk->events & (1 << i)) &&
- (disk->event_flags & DISK_EVENT_FLAG_UEVENT))
- envp[nr_events++] = disk_uevents[i];
-
- if (nr_events)
- kobject_uevent_env(&disk_to_dev(disk)->kobj, KOBJ_CHANGE, envp);
+ if (disk->event_flags & DISK_EVENT_FLAG_UEVENT)
+ disk_event_uevent(disk, events);
}
/**
@@ -284,6 +289,32 @@ bool bdev_check_media_change(struct block_device *bdev)
}
EXPORT_SYMBOL(bdev_check_media_change);
+/**
+ * disk_force_media_change - force a media change event
+ * @disk: the disk which will raise the event
+ * @events: the events to raise
+ *
+ * Generate uevents for the disk. If DISK_EVENT_MEDIA_CHANGE is present,
+ * attempt to free all dentries and inodes and invalidates all block
+ * device page cache entries in that case.
+ *
+ * Returns %true if DISK_EVENT_MEDIA_CHANGE was raised, or %false if not.
+ */
+bool disk_force_media_change(struct gendisk *disk, unsigned int events)
+{
+ disk_event_uevent(disk, events);
+
+ if (!(events & DISK_EVENT_MEDIA_CHANGE))
+ return false;
+
+ if (__invalidate_device(disk->part0, true))
+ pr_warn("VFS: busy inodes on changed media %s\n",
+ disk->disk_name);
+ set_bit(GD_NEED_PART_SCAN, &disk->state);
+ return true;
+}
+EXPORT_SYMBOL_GPL(disk_force_media_change);
+
/*
* Separate this part out so that a different pointer for clearing_ptr can be
* passed in for disk_clear_events.
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 140c028845af..849486de81c6 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -237,6 +237,7 @@ extern void disk_block_events(struct gendisk *disk);
extern void disk_unblock_events(struct gendisk *disk);
extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
bool set_capacity_and_notify(struct gendisk *disk, sector_t size);
+bool disk_force_media_change(struct gendisk *disk, unsigned int events);
/* drivers/char/random.c */
extern void add_disk_randomness(struct gendisk *disk) __latent_entropy;
--
2.31.1
From: Matteo Croce <[email protected]>
Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.
$ grep . /sys/class/block/*/diskseq
/sys/class/block/loop0/diskseq:13
/sys/class/block/loop1/diskseq:14
/sys/class/block/loop2/diskseq:5
/sys/class/block/loop3/diskseq:6
/sys/class/block/ram0/diskseq:1
/sys/class/block/ram1/diskseq:2
/sys/class/block/vda/diskseq:7
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Matteo Croce <[email protected]>
---
Documentation/ABI/testing/sysfs-block | 12 ++++++++++++
block/genhd.c | 10 ++++++++++
2 files changed, 22 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index e34cdeeeb9d4..a0ed87386639 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -28,6 +28,18 @@ Description:
For more details refer Documentation/admin-guide/iostats.rst
+What: /sys/block/<disk>/diskseq
+Date: February 2021
+Contact: Matteo Croce <[email protected]>
+Description:
+ The /sys/block/<disk>/diskseq files reports the disk
+ sequence number, which is a monotonically increasing
+ number assigned to every drive.
+ Some devices, like the loop device, refresh such number
+ every time the backing file is changed.
+ The value type is 64 bit unsigned.
+
+
What: /sys/block/<disk>/<part>/stat
Date: February 2008
Contact: Jerome Marchand <[email protected]>
diff --git a/block/genhd.c b/block/genhd.c
index 3126f8afe3b8..9ac41500caf9 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -985,6 +985,14 @@ static ssize_t disk_discard_alignment_show(struct device *dev,
return sprintf(buf, "%d\n", queue_discard_alignment(disk->queue));
}
+static ssize_t diskseq_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return sprintf(buf, "%llu\n", disk->diskseq);
+}
+
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -997,6 +1005,7 @@ static DEVICE_ATTR(capability, 0444, disk_capability_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
+static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
@@ -1042,6 +1051,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_events.attr,
&dev_attr_events_async.attr,
&dev_attr_events_poll_msecs.attr,
+ &dev_attr_diskseq.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
--
2.31.1
From: Matteo Croce <[email protected]>
Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach.
# udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME &
# losetup -f zero
[ 7.454235] loop0: detected capacity change from 0 to 16384
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop0
DEVNAME=/dev/loop0
DEVNAME=/dev/loop0
# losetup -f zero
[ 10.205245] loop1: detected capacity change from 0 to 16384
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop1
DEVNAME=/dev/loop1
DEVNAME=/dev/loop1
# losetup -f zero2
[ 13.532368] loop2: detected capacity change from 0 to 40960
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop2
DEVNAME=/dev/loop2
# losetup -D
DEVNAME=/dev/loop1
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop1
DEVNAME=/dev/loop2
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop2
DEVNAME=/dev/loop0
DISK_MEDIA_CHANGE=1
DEVNAME=/dev/loop0
Signed-off-by: Matteo Croce <[email protected]>
---
drivers/block/loop.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f37b9e3d833c..f562609b6d53 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -731,6 +731,7 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
goto out_err;
/* and ... switch */
+ disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
blk_mq_freeze_queue(lo->lo_queue);
mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
lo->lo_backing_file = file;
@@ -1205,6 +1206,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
goto out_unlock;
}
+ disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);
INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
@@ -1349,6 +1351,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
lo_number = lo->lo_number;
+ disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
out_unlock:
mutex_unlock(&lo->lo_mutex);
if (partscan) {
@@ -2325,6 +2328,8 @@ static int loop_add(int i)
disk->fops = &lo_fops;
disk->private_data = lo;
disk->queue = lo->lo_queue;
+ disk->events = DISK_EVENT_MEDIA_CHANGE;
+ disk->event_flags = DISK_EVENT_FLAG_UEVENT;
sprintf(disk->disk_name, "loop%d", i);
add_disk(disk);
mutex_unlock(&loop_ctl_mutex);
--
2.31.1
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
On Tue, 2021-07-13 at 01:05 +0200, Matteo Croce wrote:
> From: Matteo Croce <[email protected]>
>
> Associating uevents with block devices in userspace is difficult and racy:
> the uevent netlink socket is lossy, and on slow and overloaded systems has
> a very high latency. Block devices do not have exclusive owners in
> userspace, any process can set one up (e.g. loop devices). Moreover, device
> names can be reused (e.g. loop0 can be reused again and again). A userspace
> process setting up a block device and watching for its events cannot thus
> reliably tell whether an event relates to the device it just set up or
> another earlier instance with the same name.
>
> Being able to set a UUID on a loop device would solve the race conditions.
> But it does not allow to derive orderings from uevents: if you see a uevent
> with a UUID that does not match the device you are waiting for, you cannot
> tell whether it's because the right uevent has not arrived yet, or it was
> already sent and you missed it. So you cannot tell whether you should wait
> for it or not.
>
> Being able to set devices up in a namespace would solve the race conditions
> too, but it can work only if being namespaced is feasible in the first
> place. Many userspace processes need to set devices up for the root
> namespace, so this solution cannot always work.
>
> Changing the loop devices naming implementation to always use
> monotonically increasing device numbers, instead of reusing the lowest
> free number, would also solve the problem, but it would be very disruptive
> to userspace and likely break many existing use cases. It would also be
> quite awkward to use on long-running machines, as the loop device name
> would quickly grow to many-digits length.
>
> Furthermore, this problem does not affect only loop devices - partition
> probing is asynchronous and very slow on busy systems. It is very easy to
> enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
> show up, as it can take a long time for the uevents to be delivered after
> setting them up.
>
> Associating a unique, monotonically increasing sequential number to the
> lifetime of each block device, which can be retrieved with an ioctl
> immediately upon setting it up, allows to solve the race conditions with
> uevents, and also allows userspace processes to know whether they should
> wait for the uevent they need or if it was dropped and thus they should
> move on.
>
> This does not benefit only loop devices and block devices with multiple
> partitions, but for example also removable media such as USB sticks or
> cdroms/dvdroms/etc.
>
> The first patch is the core one, the 2..4 expose the information in
> different ways, and the last one makes the loop device generate a media
> changed event upon attach, detach or reconfigure, so the sequence number
> is increased.
>
> If merged, this feature will immediately used by the userspace:
> https://github.com/systemd/systemd/issues/17469#issuecomment-762919781
>
> v4 -> v5:
> - introduce a helper to raise media changed events
> - use the new helper in loop instead of the full event code
> - unexport inc_diskseq() which is only used by the block code now
> - rebase on top of 5.14-rc1
>
> v3 -> v4:
> - rebased on top of 5.13
> - hook the seqnum increase into the media change event
> - make the loop device raise media change events
> - merge 1/6 and 5/6
> - move the uevent part of 1/6 into a separate one
> - drop the now unneeded sysfs refactor
> - change 'diskseq' to a global static variable
> - add more comments
> - refactor commit messages
>
> v2 -> v3:
> - rebased on top of 5.13-rc7
> - resend because it appeared archived on patchwork
>
> v1 -> v2:
> - increase seqnum on media change
> - increase on loop detach
>
> Matteo Croce (6):
> block: add disk sequence number
> block: export the diskseq in uevents
> block: add ioctl to read the disk sequence number
> block: export diskseq in sysfs
> block: add a helper to raise a media changed event
> loop: raise media_change event
>
> Documentation/ABI/testing/sysfs-block | 12 ++++++
> block/disk-events.c | 62 +++++++++++++++++++++------
> block/genhd.c | 43 +++++++++++++++++++
> block/ioctl.c | 2 +
> drivers/block/loop.c | 5 +++
> include/linux/genhd.h | 3 ++
> include/uapi/linux/fs.h | 1 +
> 7 files changed, 114 insertions(+), 14 deletions(-)
For the series:
Tested-by: Luca Boccassi <[email protected]>
I have implemented the basic systemd support for this (ioctl + uevent,
sysfs will be done later), and tested with this series on x86_64 and
Debian 11 userspace, everything seems to work great. Thanks Matteo!
Here's the implementation, in draft state until the kernel side is
merged:
https://github.com/systemd/systemd/pull/20257
--
Kind regards,
Luca Boccassi
On Tue, Jul 20, 2021 at 7:27 PM Luca Boccassi <[email protected]> wrote:
>
> On Tue, 2021-07-13 at 01:05 +0200, Matteo Croce wrote:
> > From: Matteo Croce <[email protected]>
> >
> > Associating uevents with block devices in userspace is difficult and racy:
> > the uevent netlink socket is lossy, and on slow and overloaded systems has
> > a very high latency. Block devices do not have exclusive owners in
> > userspace, any process can set one up (e.g. loop devices). Moreover, device
> > names can be reused (e.g. loop0 can be reused again and again). A userspace
> > process setting up a block device and watching for its events cannot thus
> > reliably tell whether an event relates to the device it just set up or
> > another earlier instance with the same name.
> >
> > Being able to set a UUID on a loop device would solve the race conditions.
> > But it does not allow to derive orderings from uevents: if you see a uevent
> > with a UUID that does not match the device you are waiting for, you cannot
> > tell whether it's because the right uevent has not arrived yet, or it was
> > already sent and you missed it. So you cannot tell whether you should wait
> > for it or not.
> >
> > Being able to set devices up in a namespace would solve the race conditions
> > too, but it can work only if being namespaced is feasible in the first
> > place. Many userspace processes need to set devices up for the root
> > namespace, so this solution cannot always work.
> >
> > Changing the loop devices naming implementation to always use
> > monotonically increasing device numbers, instead of reusing the lowest
> > free number, would also solve the problem, but it would be very disruptive
> > to userspace and likely break many existing use cases. It would also be
> > quite awkward to use on long-running machines, as the loop device name
> > would quickly grow to many-digits length.
> >
> > Furthermore, this problem does not affect only loop devices - partition
> > probing is asynchronous and very slow on busy systems. It is very easy to
> > enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
> > show up, as it can take a long time for the uevents to be delivered after
> > setting them up.
> >
> > Associating a unique, monotonically increasing sequential number to the
> > lifetime of each block device, which can be retrieved with an ioctl
> > immediately upon setting it up, allows to solve the race conditions with
> > uevents, and also allows userspace processes to know whether they should
> > wait for the uevent they need or if it was dropped and thus they should
> > move on.
> >
> > This does not benefit only loop devices and block devices with multiple
> > partitions, but for example also removable media such as USB sticks or
> > cdroms/dvdroms/etc.
> >
> > The first patch is the core one, the 2..4 expose the information in
> > different ways, and the last one makes the loop device generate a media
> > changed event upon attach, detach or reconfigure, so the sequence number
> > is increased.
> >
> > If merged, this feature will immediately used by the userspace:
> > https://github.com/systemd/systemd/issues/17469#issuecomment-762919781
> >
> > v4 -> v5:
> > - introduce a helper to raise media changed events
> > - use the new helper in loop instead of the full event code
> > - unexport inc_diskseq() which is only used by the block code now
> > - rebase on top of 5.14-rc1
> >
> > v3 -> v4:
> > - rebased on top of 5.13
> > - hook the seqnum increase into the media change event
> > - make the loop device raise media change events
> > - merge 1/6 and 5/6
> > - move the uevent part of 1/6 into a separate one
> > - drop the now unneeded sysfs refactor
> > - change 'diskseq' to a global static variable
> > - add more comments
> > - refactor commit messages
> >
> > v2 -> v3:
> > - rebased on top of 5.13-rc7
> > - resend because it appeared archived on patchwork
> >
> > v1 -> v2:
> > - increase seqnum on media change
> > - increase on loop detach
> >
> > Matteo Croce (6):
> > block: add disk sequence number
> > block: export the diskseq in uevents
> > block: add ioctl to read the disk sequence number
> > block: export diskseq in sysfs
> > block: add a helper to raise a media changed event
> > loop: raise media_change event
> >
> > Documentation/ABI/testing/sysfs-block | 12 ++++++
> > block/disk-events.c | 62 +++++++++++++++++++++------
> > block/genhd.c | 43 +++++++++++++++++++
> > block/ioctl.c | 2 +
> > drivers/block/loop.c | 5 +++
> > include/linux/genhd.h | 3 ++
> > include/uapi/linux/fs.h | 1 +
> > 7 files changed, 114 insertions(+), 14 deletions(-)
>
> For the series:
>
> Tested-by: Luca Boccassi <[email protected]>
>
> I have implemented the basic systemd support for this (ioctl + uevent,
> sysfs will be done later), and tested with this series on x86_64 and
> Debian 11 userspace, everything seems to work great. Thanks Matteo!
>
> Here's the implementation, in draft state until the kernel side is
> merged:
>
> https://github.com/systemd/systemd/pull/20257
>
Hi Jens,
Given that the whole series has been acked and tested, and the
userspace has a draft implementation for it, is there anything else we
can do here?
Regards,
--
per aspera ad upstream
On Di, 20.07.21 18:27, Luca Boccassi ([email protected]) wrote:
> Here's the implementation, in draft state until the kernel side is
> merged:
>
> https://github.com/systemd/systemd/pull/20257
I reviewed this systemd PR now. Looks excellent. See my comments on the PR.
Jens, anything we can do to get your blessing on the kernel patch set
and get this landed?
Thank you,
Lennart
On 7/12/21 5:05 PM, Matteo Croce wrote:
> From: Matteo Croce <[email protected]>
>
> Associating uevents with block devices in userspace is difficult and racy:
> the uevent netlink socket is lossy, and on slow and overloaded systems has
> a very high latency. Block devices do not have exclusive owners in
> userspace, any process can set one up (e.g. loop devices). Moreover, device
> names can be reused (e.g. loop0 can be reused again and again). A userspace
> process setting up a block device and watching for its events cannot thus
> reliably tell whether an event relates to the device it just set up or
> another earlier instance with the same name.
>
> Being able to set a UUID on a loop device would solve the race conditions.
> But it does not allow to derive orderings from uevents: if you see a uevent
> with a UUID that does not match the device you are waiting for, you cannot
> tell whether it's because the right uevent has not arrived yet, or it was
> already sent and you missed it. So you cannot tell whether you should wait
> for it or not.
>
> Being able to set devices up in a namespace would solve the race conditions
> too, but it can work only if being namespaced is feasible in the first
> place. Many userspace processes need to set devices up for the root
> namespace, so this solution cannot always work.
>
> Changing the loop devices naming implementation to always use
> monotonically increasing device numbers, instead of reusing the lowest
> free number, would also solve the problem, but it would be very disruptive
> to userspace and likely break many existing use cases. It would also be
> quite awkward to use on long-running machines, as the loop device name
> would quickly grow to many-digits length.
>
> Furthermore, this problem does not affect only loop devices - partition
> probing is asynchronous and very slow on busy systems. It is very easy to
> enter races when using LO_FLAGS_PARTSCAN and watching for the partitions to
> show up, as it can take a long time for the uevents to be delivered after
> setting them up.
>
> Associating a unique, monotonically increasing sequential number to the
> lifetime of each block device, which can be retrieved with an ioctl
> immediately upon setting it up, allows to solve the race conditions with
> uevents, and also allows userspace processes to know whether they should
> wait for the uevent they need or if it was dropped and thus they should
> move on.
>
> This does not benefit only loop devices and block devices with multiple
> partitions, but for example also removable media such as USB sticks or
> cdroms/dvdroms/etc.
>
> The first patch is the core one, the 2..4 expose the information in
> different ways, and the last one makes the loop device generate a media
> changed event upon attach, detach or reconfigure, so the sequence number
> is increased.
>
> If merged, this feature will immediately used by the userspace:
> https://github.com/systemd/systemd/issues/17469#issuecomment-762919781
Applied for 5.15, with #2 done manually since it didn't apply cleanly.
--
Jens Axboe