2008-11-24 03:57:52

by NeilBrown

[permalink] [raw]
Subject: [PATCH 0/2] RFC: allow md devices to disappear when not in use.

The following two patches - which are in -next if you need to see
their context - make two fairly significant changes to the way md
devices are (or can be) created and destroyed.

They include a smallish change to fs/block_dev.c so I'm cc:ing recent
modifiers of that file.

Currently, md devices spring into existence when a corresponding
device-special file is opened, and they remain around forever, or
until the module is unloaded.

The first of these patches causes md devices to disappear when they
are no longer in use. More precisely: after the last close when the
array is completely unconfigured.
This opens up a small race if one process is opening the device about
the same time that another process is closing it for the last time.
Closing this race requires the small change in fs/block_dev.c

The second patch allows md device to be created with a name rather
than a number. Sometimes it is more convenient to think about arrays
by name, and supporting this in the kernel is a useful idea (From Doug
Ledford).

If you write e.g. "md_foo" to "/sys/modules/md_mod/parameters/new_array"
then an md device was an arbitrary minor number will be created named
md_foo rather than e.g. md256. (the "md_" prefix is required).

Current tools seem to cope perfectly with md devices disappearing when
finished with. They are not likely to cope with "md_foo" device
naming, but that would only be used if a new tool requested it, so it
is up to that tool (mdadm) to not cause confusion.

Any review comments most welcome.

Thanks,
NeilBrown


---

NeilBrown (2):
Allow md devices to be created by name.
md: make devices disappear when they are no longer needed.


drivers/md/md.c | 175 ++++++++++++++++++++++++++++++++++++++-------
fs/block_dev.c | 14 ++++
include/linux/raid/md_k.h | 5 +
3 files changed, 165 insertions(+), 29 deletions(-)

--


2008-11-24 03:58:24

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Currently md devices, once created, never disappear until the module
is unloaded. This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed. However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed. However this has a
complication.
Between the call
__blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
__blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created. We need to
ensure that this reference can not be used. i.e. the ->open must
fail.

So:
1/ in md_probe we set a flag in the mddev (hold_active) which
indicates that the array should be treated as active, even
though there are no references, and no appearance of activity.
This is cleared by md_release when the device is closed.
This ensure that the gendisk will survive between md_probe and
md_open.

2/ In md_open we check if the mddev we expect to open matches
the gendisk that we did open.
If there is a mismatch we return -ERESTARTSYS and modify
__blkdev_get to retry from the top in that case.
In the -ERESTARTSYS sys case we make sure to wait until
the old gendisk (that we succeeded in opening) is really gone so
we loop at most once.

If an inactive md device is opened and closed in quick succession, a
gendisk will be created and destroyed repeatedly. If this happens
often, udev can easily get a backlog of ADD/DELETE requests. However
this is not normally the case (and it does work, just slowly).
Normally the first open of an MD device will involve initialising it
(SET_ARRAY_INFO / NEW_DISK) which will cause it to appear active and
not be destroyed.

When an array is stopped, the gendisk will remain around until the
last open is closed, then it will disappear.

As an array can be stopped by writing to a sysfs attribute
echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects. This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().



Signed-off-by: NeilBrown <[email protected]>
---

drivers/md/md.c | 57 ++++++++++++++++++++++++++++++++++++---------
fs/block_dev.c | 14 +++++++++++
include/linux/raid/md_k.h | 4 +++
3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index cad0825..5ebda55 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -214,16 +214,33 @@ static inline mddev_t *mddev_get(mddev_t *mddev)
return mddev;
}

+static void mddev_delayed_delete(struct work_struct *ws)
+{
+ mddev_t *mddev = container_of(ws, mddev_t, del_work);
+ kobject_del(&mddev->kobj);
+ kobject_put(&mddev->kobj);
+}
+
static void mddev_put(mddev_t *mddev)
{
if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
return;
- if (!mddev->raid_disks && list_empty(&mddev->disks)) {
+ if (!mddev->raid_disks && list_empty(&mddev->disks) &&
+ !mddev->hold_active) {
list_del(&mddev->all_mddevs);
- spin_unlock(&all_mddevs_lock);
- kobject_put(&mddev->kobj);
- } else
- spin_unlock(&all_mddevs_lock);
+ if (mddev->gendisk) {
+ /* we did a probe so need to clean up.
+ * Call schedule_work inside the spinlock
+ * so that flush_scheduled_work() after
+ * mddev_find will succeed in waiting for the
+ * work to be done.
+ */
+ INIT_WORK(&mddev->del_work, mddev_delayed_delete);
+ schedule_work(&mddev->del_work);
+ } else
+ kfree(mddev);
+ }
+ spin_unlock(&all_mddevs_lock);
}

static mddev_t * mddev_find(dev_t unit)
@@ -242,6 +259,7 @@ static mddev_t * mddev_find(dev_t unit)

if (new) {
list_add(&new->all_mddevs, &all_mddevs);
+ mddev->hold_active = UNTIL_CLOSE;
spin_unlock(&all_mddevs_lock);
return new;
}
@@ -3484,6 +3502,11 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
if (!mddev)
return NULL;

+ /* wait for any previous instance if this device
+ * to be completed removed (mddev_delayed_delete).
+ */
+ flush_scheduled_work();
+
mutex_lock(&disks_mutex);
if (mddev->gendisk) {
mutex_unlock(&disks_mutex);
@@ -3520,7 +3543,7 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
disk->private_data = mddev;
disk->queue = mddev->queue;
/* Allow extended partitions. This makes the
- * 'mdp' device redundant, but we can really
+ * 'mdp' device redundant, but we can't really
* remove it now.
*/
disk->flags |= GENHD_FL_EXT_DEVT;
@@ -3536,6 +3559,7 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
kobject_uevent(&mddev->kobj, KOBJ_ADD);
mddev->sysfs_state = sysfs_get_dirent(mddev->kobj.sd, "array_state");
}
+ mddev_put(mddev);
return NULL;
}

@@ -5070,14 +5094,25 @@ static int md_open(struct block_device *bdev, fmode_t mode)
* Succeed if we can lock the mddev, which confirms that
* it isn't being stopped right now.
*/
- mddev_t *mddev = bdev->bd_disk->private_data;
+ mddev_t *mddev = mddev_find(bdev->bd_dev);
int err;

+ if (mddev->gendisk != bdev->bd_disk) {
+ /* we are racing with mddev_put which is discarding this
+ * bd_disk.
+ */
+ mddev_put(mddev);
+ /* Wait until bdev->bd_disk is definitely gone */
+ flush_scheduled_work();
+ /* Then retry the open from the top */
+ return -ERESTARTSYS;
+ }
+ BUG_ON(mddev != bdev->bd_disk->private_data);
+
if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
goto out;

err = 0;
- mddev_get(mddev);
atomic_inc(&mddev->openers);
mddev_unlock(mddev);

@@ -5092,6 +5127,7 @@ static int md_release(struct gendisk *disk, fmode_t mode)

BUG_ON(!mddev);
atomic_dec(&mddev->openers);
+ mddev->hold_active = 0;
mddev_put(mddev);

return 0;
@@ -6436,11 +6472,8 @@ static __exit void md_exit(void)
unregister_sysctl_table(raid_table_header);
remove_proc_entry("mdstat", NULL);
for_each_mddev(mddev, tmp) {
- struct gendisk *disk = mddev->gendisk;
- if (!disk)
- continue;
export_array(mddev);
- mddev_put(mddev);
+ mddev->hold_active = 0;
}
}

diff --git a/fs/block_dev.c b/fs/block_dev.c
index db831ef..30b88b9 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1004,6 +1004,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
}

lock_kernel();
+ restart:

ret = -ENXIO;
disk = get_gendisk(bdev->bd_dev, &partno);
@@ -1024,6 +1025,19 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

if (disk->fops->open) {
ret = disk->fops->open(bdev, mode);
+ if (ret == -ERESTARTSYS) {
+ /* Lost a race with 'disk' being
+ * deleted, try again.
+ * See md.c
+ */
+ disk_put_part(bdev->bd_part);
+ bdev->bd_part = NULL;
+ module_put(disk->fops->owner);
+ put_disk(disk);
+ bdev->bd_disk = NULL;
+ mutex_unlock(&bdev->bd_mutex);
+ goto restart;
+ }
if (ret)
goto out_clear;
}
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 8f9a54c..7b6029b 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -137,6 +137,8 @@ struct mddev_s
struct gendisk *gendisk;

struct kobject kobj;
+ int hold_active;
+#define UNTIL_CLOSE 1

/* Superblock information */
int major_version,
@@ -246,6 +248,8 @@ struct mddev_s
*/
struct sysfs_dirent *sysfs_action; /* handle for 'sync_action' */

+ struct work_struct del_work; /* used for delayed sysfs removal */
+
spinlock_t write_lock;
wait_queue_head_t sb_wait; /* for waiting on superblock updates */
atomic_t pending_writes; /* number of active superblock writes */

2008-11-24 03:58:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/2] Allow md devices to be created by name.

Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.

Also, creating md devices by opening the device special file is a bit
awkward.

So this patch provides a new option for creating and naming devices.

Writing a name such as "md_home" to
/sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created. It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.

md devices that a created by an open are destroyed on the last
close when the device is inactive.
For names md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.

The name of the array must start 'md_' to avoid conflict with
other devices.

Signed-off-by: NeilBrown <[email protected]>
---

drivers/md/md.c | 122 ++++++++++++++++++++++++++++++++++++++-------
include/linux/raid/md_k.h | 1
2 files changed, 104 insertions(+), 19 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5ebda55..89700b5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -249,17 +249,51 @@ static mddev_t * mddev_find(dev_t unit)

retry:
spin_lock(&all_mddevs_lock);
- list_for_each_entry(mddev, &all_mddevs, all_mddevs)
- if (mddev->unit == unit) {
- mddev_get(mddev);
+
+ if (unit) {
+ list_for_each_entry(mddev, &all_mddevs, all_mddevs)
+ if (mddev->unit == unit) {
+ mddev_get(mddev);
+ spin_unlock(&all_mddevs_lock);
+ kfree(new);
+ return mddev;
+ }
+
+ if (new) {
+ list_add(&new->all_mddevs, &all_mddevs);
spin_unlock(&all_mddevs_lock);
- kfree(new);
- return mddev;
+ new->hold_active = UNTIL_CLOSE;
+ return new;
}
-
- if (new) {
+ } else if (new) {
+ /* find an unused unit number */
+ static int next_minor = 512;
+ int start = next_minor;
+ int is_free = 0;
+ int dev = 0;
+ while (!is_free) {
+ dev = MKDEV(MD_MAJOR, next_minor);
+ next_minor++;
+ if (next_minor > MINORMASK)
+ next_minor = 0;
+ if (next_minor == start) {
+ /* Oh dear, all in use. */
+ spin_unlock(&all_mddevs_lock);
+ kfree(new);
+ return NULL;
+ }
+
+ is_free = 1;
+ list_for_each_entry(mddev, &all_mddevs, all_mddevs)
+ if (mddev->unit == dev) {
+ is_free = 0;
+ break;
+ }
+ }
+ new->unit = dev;
+ new->md_minor = MINOR(dev);
+ new->hold_active = UNTIL_STOP;
list_add(&new->all_mddevs, &all_mddevs);
- mddev->hold_active = UNTIL_CLOSE;
spin_unlock(&all_mddevs_lock);
return new;
}
@@ -3489,18 +3523,22 @@ static struct kobj_type md_ktype = {

int mdp_major = 0;

-static struct kobject *md_probe(dev_t dev, int *part, void *data)
+static int md_alloc(dev_t dev, char *name)
{
static DEFINE_MUTEX(disks_mutex);
mddev_t *mddev = mddev_find(dev);
struct gendisk *disk;
- int partitioned = (MAJOR(dev) != MD_MAJOR);
- int shift = partitioned ? MdpMinorShift : 0;
- int unit = MINOR(dev) >> shift;
+ int partitioned;
+ int shift;
+ int unit;
int error;

if (!mddev)
- return NULL;
+ return -ENODEV;
+
+ partitioned = (MAJOR(mddev->unit) != MD_MAJOR);
+ shift = partitioned ? MdpMinorShift : 0;
+ unit = MINOR(mddev->unit) >> shift;

/* wait for any previous instance if this device
* to be completed removed (mddev_delayed_delete).
@@ -3511,14 +3549,29 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
if (mddev->gendisk) {
mutex_unlock(&disks_mutex);
mddev_put(mddev);
- return NULL;
+ return -EEXIST;
+ }
+
+ if (name) {
+ /* Need to ensure that 'name' is not a duplicate.
+ */
+ mddev_t *mddev2;
+ spin_lock(&all_mddevs_lock);
+
+ list_for_each_entry(mddev2, &all_mddevs, all_mddevs)
+ if (mddev2->gendisk &&
+ strcmp(mddev2->gendisk->disk_name, name) == 0) {
+ spin_unlock(&all_mddevs_lock);
+ return -EEXIST;
+ }
+ spin_unlock(&all_mddevs_lock);
}

mddev->queue = blk_alloc_queue(GFP_KERNEL);
if (!mddev->queue) {
mutex_unlock(&disks_mutex);
mddev_put(mddev);
- return NULL;
+ return -ENOMEM;
}
/* Can be unlocked because the queue is new: no concurrency */
queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, mddev->queue);
@@ -3531,11 +3584,13 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
blk_cleanup_queue(mddev->queue);
mddev->queue = NULL;
mddev_put(mddev);
- return NULL;
+ return -ENOMEM;
}
- disk->major = MAJOR(dev);
+ disk->major = MAJOR(mddev->unit);
disk->first_minor = unit << shift;
- if (partitioned)
+ if (name)
+ strcpy(disk->disk_name, name);
+ else if (partitioned)
sprintf(disk->disk_name, "md_d%d", unit);
else
sprintf(disk->disk_name, "md%d", unit);
@@ -3560,9 +3615,34 @@ static struct kobject *md_probe(dev_t dev, int *part, void *data)
mddev->sysfs_state = sysfs_get_dirent(mddev->kobj.sd, "array_state");
}
mddev_put(mddev);
+ return 0;
+}
+
+static struct kobject *md_probe(dev_t dev, int *part, void *data)
+{
+ md_alloc(dev, NULL);
return NULL;
}

+static int add_named_array(const char *val, struct kernel_param *kp)
+{
+ /* val must be "md_*" where * is not all digits.
+ * We allocate an array with a large free minor number, and
+ * set the name to val. val must not already be an active name.
+ */
+ int len = strlen(val);
+ char buf[DISK_NAME_LEN];
+
+ while (len && val[len-1] == '\n')
+ len--;
+ if (len >= DISK_NAME_LEN)
+ return -E2BIG;
+ strlcpy(buf, val, len+1);
+ if (strncmp(buf, "md_", 3) != 0)
+ return -EINVAL;
+ return md_alloc(0, buf);
+}
+
static void md_safemode_timeout(unsigned long data)
{
mddev_t *mddev = (mddev_t *) data;
@@ -4023,6 +4103,8 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
mddev->barriers_work = 0;
mddev->safemode = 0;
kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE);
+ if (mddev->hold_active == UNTIL_STOP)
+ mddev->hold_active = 0;

} else if (mddev->pers)
printk(KERN_INFO "md: %s switched to read-only mode.\n",
@@ -5127,7 +5209,8 @@ static int md_release(struct gendisk *disk, fmode_t mode)

BUG_ON(!mddev);
atomic_dec(&mddev->openers);
- mddev->hold_active = 0;
+ if (mddev->hold_active == UNTIL_CLOSE)
+ mddev->hold_active = 0;
mddev_put(mddev);

return 0;
@@ -6498,6 +6581,7 @@ static int set_ro(const char *val, struct kernel_param *kp)
module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR);
module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);

+module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);

EXPORT_SYMBOL(register_md_personality);
EXPORT_SYMBOL(unregister_md_personality);
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 7b6029b..19756d7 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -139,6 +139,7 @@ struct mddev_s
struct kobject kobj;
int hold_active;
#define UNTIL_CLOSE 1
+#define UNTIL_STOP 2

/* Superblock information */
int major_version,

2008-11-24 04:19:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

(cc'ing Greg)

NeilBrown wrote:
> Currently md devices, once created, never disappear until the module
> is unloaded. This is essentially because the gendisk holds a
> reference to the mddev, and the mddev holds a reference to the
> gendisk, this a circular reference.
>
> If we drop the reference from mddev to gendisk, then we need to ensure
> that the mddev is destroyed when the gendisk is destroyed. However it
> is not possible to hook into the gendisk destruction process to enable
> this.
>
> So we drop the reference from the gendisk to the mddev and destroy the
> gendisk when the mddev gets destroyed. However this has a
> complication.
> Between the call
> __blkdev_get->get_gendisk->kobj_lookup->md_probe
> and the call
> __blkdev_get->md_open
>
> there is no obvious way to hold a reference on the mddev any more, so
> unless something is done, it will disappear and gendisk will be
> destroyed prematurely.
>
> Also, once we decide to destroy the mddev, there will be an unlockable
> moment before the gendisk is unlinked (blk_unregister_region) during
> which a new reference to the gendisk can be created. We need to
> ensure that this reference can not be used. i.e. the ->open must
> fail.

Ah... I'm not really sure I'm following all of this correctly but would
it be possible to just add ->release to genhd and do regular reference
counting rather than this complex dancing? ->release was recently added
to cdev so it'll be nicely parallel.

Thanks.

--
tejun

2008-11-24 04:24:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, Nov 24, 2008 at 02:55:30PM +1100, NeilBrown wrote:
> Between the call
> __blkdev_get->get_gendisk->kobj_lookup->md_probe
> and the call
> __blkdev_get->md_open
>
> there is no obvious way to hold a reference on the mddev any more, so
> unless something is done, it will disappear and gendisk will be
> destroyed prematurely.
>
> Also, once we decide to destroy the mddev, there will be an unlockable
> moment before the gendisk is unlinked (blk_unregister_region) during
> which a new reference to the gendisk can be created. We need to
> ensure that this reference can not be used. i.e. the ->open must
> fail.
>
> So:
> 1/ in md_probe we set a flag in the mddev (hold_active) which
> indicates that the array should be treated as active, even
> though there are no references, and no appearance of activity.
> This is cleared by md_release when the device is closed.
> This ensure that the gendisk will survive between md_probe and
> md_open.

That won't work. Note that you are not guaranteed that md_release() will be
called after md_probe(); there are failure exits in __blkdev_get() that do
not reach ->open() at all.

What lifetime rules do you really want? I never liked the tricks pulled
by md wrt gendisk lifetimes and that might be a good time to sort that
out for good...

What should happen to things like pending IO, etc. on array destruction?
AFAICS, that's the real question...

2008-11-24 04:48:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Monday November 24, [email protected] wrote:
> On Mon, Nov 24, 2008 at 02:55:30PM +1100, NeilBrown wrote:
> > Between the call
> > __blkdev_get->get_gendisk->kobj_lookup->md_probe
> > and the call
> > __blkdev_get->md_open
> >
> > there is no obvious way to hold a reference on the mddev any more, so
> > unless something is done, it will disappear and gendisk will be
> > destroyed prematurely.
> >
> > Also, once we decide to destroy the mddev, there will be an unlockable
> > moment before the gendisk is unlinked (blk_unregister_region) during
> > which a new reference to the gendisk can be created. We need to
> > ensure that this reference can not be used. i.e. the ->open must
> > fail.
> >
> > So:
> > 1/ in md_probe we set a flag in the mddev (hold_active) which
> > indicates that the array should be treated as active, even
> > though there are no references, and no appearance of activity.
> > This is cleared by md_release when the device is closed.
> > This ensure that the gendisk will survive between md_probe and
> > md_open.

Thanks for the reply.

>
> That won't work. Note that you are not guaranteed that md_release() will be
> called after md_probe(); there are failure exits in __blkdev_get() that do
> not reach ->open() at all.

I thought about those failure exits and concluded that they are the
sort the almost never happen in practice (I think -ENOMEM is the only
credible error) and the consequence is only that the gendisk will hang
around a until some future open/close, so it is no worse that the
current situation.
Resolving that would be nice but I didn't feel up to any major surgery.

>
> What lifetime rules do you really want? I never liked the tricks pulled
> by md wrt gendisk lifetimes and that might be a good time to sort that
> out for good...

I'm not sure what 'tricks' you are referring to. Can you elaborate?

I want the gendisk to appear as soon as it is needed (not because I
think that is necessarily a good idea, but it is legacy functionality that I
don't think we can easily discard). And I want them to disappear when
they contain no information and have nothing referring to them.

>
> What should happen to things like pending IO, etc. on array destruction?
> AFAICS, that's the real question...

Pending IO should not be a possibility thanks to the sync_blockdev call
in __blkdev_put.
During that last close, nothing can generate new IO, and any old IO
will be flushed (won't it?).

NeilBrown

2008-11-24 05:13:38

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Monday November 24, [email protected] wrote:
> (cc'ing Greg)
>
> NeilBrown wrote:
> > Currently md devices, once created, never disappear until the module
> > is unloaded. This is essentially because the gendisk holds a
> > reference to the mddev, and the mddev holds a reference to the
> > gendisk, this a circular reference.
> >
> > If we drop the reference from mddev to gendisk, then we need to ensure
> > that the mddev is destroyed when the gendisk is destroyed. However it
> > is not possible to hook into the gendisk destruction process to enable
> > this.
> >
> > So we drop the reference from the gendisk to the mddev and destroy the
> > gendisk when the mddev gets destroyed. However this has a
> > complication.
> > Between the call
> > __blkdev_get->get_gendisk->kobj_lookup->md_probe
> > and the call
> > __blkdev_get->md_open
> >
> > there is no obvious way to hold a reference on the mddev any more, so
> > unless something is done, it will disappear and gendisk will be
> > destroyed prematurely.
> >
> > Also, once we decide to destroy the mddev, there will be an unlockable
> > moment before the gendisk is unlinked (blk_unregister_region) during
> > which a new reference to the gendisk can be created. We need to
> > ensure that this reference can not be used. i.e. the ->open must
> > fail.
>
> Ah... I'm not really sure I'm following all of this correctly but would
> it be possible to just add ->release to genhd and do regular reference
> counting rather than this complex dancing? ->release was recently added
> to cdev so it'll be nicely parallel.

Maybe...

If genhd.c:disk_release called e.g.
disk->fops->final_put(disk)

then I could possibly link in to that to destroy the md state when the
gendisk finally disappears.

When I want to kill the gendisk I would call blk_unregister_region
directly (not through del_gendisk) to allow it to disappear.
If md_probe then gets called before the final_put, I'd need to
call blk_register_region again to re-install it.

I think that would work.

Would 'block_device_operations' be the right place for this
'final_put' or 'final_release' ??

Thanks.


NeilBrown

2008-11-24 05:35:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

(cc'ing Jens)

Neil Brown wrote:
> On Monday November 24, [email protected] wrote:
>> (cc'ing Greg)
>>
>> NeilBrown wrote:
>>> Currently md devices, once created, never disappear until the module
>>> is unloaded. This is essentially because the gendisk holds a
>>> reference to the mddev, and the mddev holds a reference to the
>>> gendisk, this a circular reference.
>>>
>>> If we drop the reference from mddev to gendisk, then we need to ensure
>>> that the mddev is destroyed when the gendisk is destroyed. However it
>>> is not possible to hook into the gendisk destruction process to enable
>>> this.
>>>
>>> So we drop the reference from the gendisk to the mddev and destroy the
>>> gendisk when the mddev gets destroyed. However this has a
>>> complication.
>>> Between the call
>>> __blkdev_get->get_gendisk->kobj_lookup->md_probe
>>> and the call
>>> __blkdev_get->md_open
>>>
>>> there is no obvious way to hold a reference on the mddev any more, so
>>> unless something is done, it will disappear and gendisk will be
>>> destroyed prematurely.
>>>
>>> Also, once we decide to destroy the mddev, there will be an unlockable
>>> moment before the gendisk is unlinked (blk_unregister_region) during
>>> which a new reference to the gendisk can be created. We need to
>>> ensure that this reference can not be used. i.e. the ->open must
>>> fail.
>> Ah... I'm not really sure I'm following all of this correctly but would
>> it be possible to just add ->release to genhd and do regular reference
>> counting rather than this complex dancing? ->release was recently added
>> to cdev so it'll be nicely parallel.
>
> Maybe...
>
> If genhd.c:disk_release called e.g.
> disk->fops->final_put(disk)
>
> then I could possibly link in to that to destroy the md state when the
> gendisk finally disappears.
>
> When I want to kill the gendisk I would call blk_unregister_region
> directly (not through del_gendisk) to allow it to disappear.
> If md_probe then gets called before the final_put, I'd need to
> call blk_register_region again to re-install it.
>
> I think that would work.
>
> Would 'block_device_operations' be the right place for this
> 'final_put' or 'final_release' ??

I suppose so. Maybe just void (*release)(struct gendisk *) but Jens is
the maintainer. Jens, what do you think?

Thanks.

--
tejun

2008-11-24 06:11:23

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, November 24, 2008 4:34 pm, Tejun Heo wrote:
> (cc'ing Jens)
>
> Neil Brown wrote:
>> On Monday November 24, [email protected] wrote:
>>> (cc'ing Greg)
>>>
>>> NeilBrown wrote:
>>>> Currently md devices, once created, never disappear until the module
>>>> is unloaded. This is essentially because the gendisk holds a
>>>> reference to the mddev, and the mddev holds a reference to the
>>>> gendisk, this a circular reference.
>>>>
>>>> If we drop the reference from mddev to gendisk, then we need to ensure
>>>> that the mddev is destroyed when the gendisk is destroyed. However it
>>>> is not possible to hook into the gendisk destruction process to enable
>>>> this.
>>>>
>>>> So we drop the reference from the gendisk to the mddev and destroy the
>>>> gendisk when the mddev gets destroyed. However this has a
>>>> complication.
>>>> Between the call
>>>> __blkdev_get->get_gendisk->kobj_lookup->md_probe
>>>> and the call
>>>> __blkdev_get->md_open
>>>>
>>>> there is no obvious way to hold a reference on the mddev any more, so
>>>> unless something is done, it will disappear and gendisk will be
>>>> destroyed prematurely.
>>>>
>>>> Also, once we decide to destroy the mddev, there will be an unlockable
>>>> moment before the gendisk is unlinked (blk_unregister_region) during
>>>> which a new reference to the gendisk can be created. We need to
>>>> ensure that this reference can not be used. i.e. the ->open must
>>>> fail.
>>> Ah... I'm not really sure I'm following all of this correctly but would
>>> it be possible to just add ->release to genhd and do regular reference
>>> counting rather than this complex dancing? ->release was recently
>>> added
>>> to cdev so it'll be nicely parallel.
>>
>> Maybe...
>>
>> If genhd.c:disk_release called e.g.
>> disk->fops->final_put(disk)
>>
>> then I could possibly link in to that to destroy the md state when the
>> gendisk finally disappears.
>>
>> When I want to kill the gendisk I would call blk_unregister_region
>> directly (not through del_gendisk) to allow it to disappear.
>> If md_probe then gets called before the final_put, I'd need to
>> call blk_register_region again to re-install it.
>>
>> I think that would work.
>>
>> Would 'block_device_operations' be the right place for this
>> 'final_put' or 'final_release' ??
>
> I suppose so. Maybe just void (*release)(struct gendisk *) but Jens is
> the maintainer. Jens, what do you think?

Note that we already have 'release' in block_device_operations. It is
called on last close rather than last put.

NeilBrown

2008-11-24 06:13:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

NeilBrown wrote:
> Note that we already have 'release' in block_device_operations. It is
> called on last close rather than last put.

Aieee... right. How about release_disk() or disk_release()?

--
tejun

2008-11-24 06:24:30

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, Nov 24, 2008 at 02:34:30PM +0900, Tejun Heo wrote:
> > Maybe...
> >
> > If genhd.c:disk_release called e.g.
> > disk->fops->final_put(disk)
> >
> > then I could possibly link in to that to destroy the md state when the
> > gendisk finally disappears.
> >
> > When I want to kill the gendisk I would call blk_unregister_region
> > directly (not through del_gendisk) to allow it to disappear.
> > If md_probe then gets called before the final_put, I'd need to
> > call blk_register_region again to re-install it.
> >
> > I think that would work.
> >
> > Would 'block_device_operations' be the right place for this
> > 'final_put' or 'final_release' ??
>
> I suppose so. Maybe just void (*release)(struct gendisk *) but Jens is
> the maintainer. Jens, what do you think?

First of all, release is already taken (with exactly that argument, BTW).
And doing that at freeing gendisk is a bad idea - md.ko might have been
long gone by the time you've got there, not to mention anything else...
IOW, it's too late; once the damn thing is not opened anymore (and nobody
is the middle of trying to open it), the module might be dead and gone,
so all uses of ->private_data and ->fops are illegal after that point.

2008-11-24 06:38:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, Nov 24, 2008 at 03:47:51PM +1100, Neil Brown wrote:

> > What lifetime rules do you really want? I never liked the tricks pulled
> > by md wrt gendisk lifetimes and that might be a good time to sort that
> > out for good...
>
> I'm not sure what 'tricks' you are referring to. Can you elaborate?
>
> I want the gendisk to appear as soon as it is needed (not because I
> think that is necessarily a good idea, but it is legacy functionality that I
> don't think we can easily discard). And I want them to disappear when
> they contain no information and have nothing referring to them.

"Tricks" are about md_probe() and weird allocation time for these suckers.
But OK, legacy API is a good argument.

So you want the rules of the same nature as for module refcount? Then
the natural place to do that would be in failure exit of __blkdev_get()
and in normal path in blkdev_put() (for the final opener going away).

However, let's try to do it right - there's a *lot* of drivers where
we do no work in ->release() until it's the final one. It would be
nice to accomodate them as well while we are at it...

2008-11-24 06:56:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Al Viro wrote:
>> I suppose so. Maybe just void (*release)(struct gendisk *) but Jens is
>> the maintainer. Jens, what do you think?
>
> First of all, release is already taken (with exactly that argument, BTW).

Right. :-(

> And doing that at freeing gendisk is a bad idea - md.ko might have been
> long gone by the time you've got there, not to mention anything else...
> IOW, it's too late; once the damn thing is not opened anymore (and nobody
> is the middle of trying to open it), the module might be dead and gone,
> so all uses of ->private_data and ->fops are illegal aftenr that point.

mddev holds module reference and till the gendisk is gone mddev won't
be gone, so as long as gendisk is around the respective mddev and
md.ko are around.

This is because creation and deletion of md devices are managed by
userland and the module is kept referenced as long as there are active
devices. The current block interface only verifies ops availability
before open and asks drivers to verify object validity separately in
->open, which makes sense if the intention is to allow rmmod while
devices are hot, but it also requires the low level driver to
implement index of all active devices and look them up atomically
during ->open, which sometimes is unnatural to implement as it's
unnecesary for any other purpose and can create subtle races as shown
by md.

I don't think it's necessary to allow modules to unload while devices
are hot. SCSI doesn't follow it (at least not at the block layer),
nor does md. It's seldom necessary to unload those modules in the
first place and when necessary active devices can be killed via sysfs
or mdadm (which BTW makes more sense to me - kill the users before
unloading the module). Without such requirement, both block and
character devices can stick with more regular reference counting like
the rest of the kernel.

Thanks.

--
tejun

2008-11-24 13:31:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, Nov 24, 2008 at 03:56:12PM +0900, Tejun Heo wrote:

> mddev holds module reference and till the gendisk is gone mddev won't
> be gone, so as long as gendisk is around the respective mddev and
> md.ko are around.

It doesn't and it *could* *not* - you can't drop the final reference to
module from within that module, period.

> I don't think it's necessary to allow modules to unload while devices
> are hot.

gendisk may stay referenced past the point when everything got closed
and unregistered.

2008-11-24 14:05:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Al Viro wrote:
> On Mon, Nov 24, 2008 at 03:56:12PM +0900, Tejun Heo wrote:
>
>> mddev holds module reference and till the gendisk is gone mddev won't
>> be gone, so as long as gendisk is around the respective mddev and
>> md.ko are around.
>
> It doesn't and it *could* *not* - you can't drop the final reference to
> module from within that module, period.

Yeap, right, I was confused. Referencing self doesn't make any sense.
Got confused with holding sub modules and the days when there was no
preemption inside kernel. I think there still might be some remnants
of those in SCSI but I need to look again.

>> I don't think it's necessary to allow modules to unload while devices
>> are hot.
>
> gendisk may stay referenced past the point when everything got closed
> and unregistered.

Can we then make gendisk hold owner module till it gets released? It
would be much nicer to write code to if we can keep the regular object
reference counting across module boundaries and being able to taking
down a module while devices are active isn't a too important
requirement. For vast majoerity (ide, scsi, md) one way or the other
doesn't even matter at all.

Thanks.

--
tejun

2008-11-24 14:26:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Tejun Heo wrote:
> Can we then make gendisk hold owner module till it gets released? It
> would be much nicer to write code to if we can keep the regular object
> reference counting across module boundaries and being able to taking
> down a module while devices are active isn't a too important
> requirement. For vast majoerity (ide, scsi, md) one way or the other
> doesn't even matter at all.

If always holding reference is too much of a change, we can do

if (gendisk->fops->disk_release) {
__module_get(gendisk->fops->owner);
gendisk->fops->disk_release(gendisk);
module_put(gendisk->fops->owner);
}

So that both parties - drivers which can happily unregister devices
during exit and drivers which want to do reference counting across
module boundaries - can be happy.

Thanks.

--
tejun

2008-11-24 14:48:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Mon, Nov 24, 2008 at 11:26:20PM +0900, Tejun Heo wrote:
> Tejun Heo wrote:
> > Can we then make gendisk hold owner module till it gets released? It
> > would be much nicer to write code to if we can keep the regular object
> > reference counting across module boundaries and being able to taking
> > down a module while devices are active isn't a too important
> > requirement. For vast majoerity (ide, scsi, md) one way or the other
> > doesn't even matter at all.
>
> If always holding reference is too much of a change, we can do
>
> if (gendisk->fops->disk_release) {
> __module_get(gendisk->fops->owner);
> gendisk->fops->disk_release(gendisk);
> module_put(gendisk->fops->owner);
> }

With ->fops in already freed memory? Good idea, that...

Look, that's the wrong object *and* the wrong place; gendisk is separate
from driver-specific data structures for a good reason.

_IF_ you want to tie something to "opened or in process of being opened",
the right place is __blkdev_get()/blkdev_put() (BTW, note that get_gendisk()
is essentially a part of __blkdev_get()).

It's not about rmmod of module while some disk is opened; _that_ is impossible
as is. It's about e.g. modules like sd.ko that would be flat-out impossible
to unload at all.

Think for a minute: we do have a bunch of gendisks created by sd; they stay
around until we rmmod the damn thing and we really want it that way.
We can't have their existence pinning sd down. What we do is "module is
busy if any of those is opened or in the middle of opening".

The same goes for just about any block driver out there; md is weird for
one reason - it wants to have devices possible to open without any prior
event that would mark the beginning of availability (such as e.g. hardware
detection for SCSI, IDE, etc.). That is a side effect of bad API - there
*is* a moment for md that would be a good candidate for such event (array
being completely configured and available for normal IO), but that would
require an API for creating and configuring these suckers that wouldn't
start with "First you open the very device you are trying to create..."
Too bad - we are tied to lousy userland interface and can't get rid of
it. That is where all the PITA is coming from.

And the right way to deal with that is to have explicit boundaries for
"opened or in process of being opened"; we almost have them (probe and
final release), so the only point we are missing is on failure exit from
__blkdev_get()...

I really think that it's much saner than trying to change the lifetime
rules for gendisk, etc.

2008-11-24 16:08:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Al Viro wrote:
> With ->fops in already freed memory? Good idea, that...

The existence can be tested on registration and remembered. Existence
of fops->disk_release signifies that the object and module are around
till the last object is released.

> Look, that's the wrong object *and* the wrong place; gendisk is separate
> from driver-specific data structures for a good reason.
>
> _IF_ you want to tie something to "opened or in process of being opened",
> the right place is __blkdev_get()/blkdev_put() (BTW, note that get_gendisk()
> is essentially a part of __blkdev_get()).
>
> It's not about rmmod of module while some disk is opened; _that_ is impossible
> as is. It's about e.g. modules like sd.ko that would be flat-out impossible
> to unload at all.
>
> Think for a minute: we do have a bunch of gendisks created by sd; they stay
> around until we rmmod the damn thing and we really want it that way.
> We can't have their existence pinning sd down. What we do is "module is
> busy if any of those is opened or in the middle of opening".

Devices can be killed from userland via sysfs for SCSI or mdadm for
md. It's true that such approach is less convenient for unloading but
if it can make usual cases easier, why not?

For both char and block devices, having ->open lookup and grab the
device were easy when each driver supported limited number of devices
and things were pretty much static, but it grows more awkward as
things become more dynamic. Having to have a lookup table just for
->open lookup isn't pretty at all and missing ->release can get quite
uncomfortable too.

> The same goes for just about any block driver out there; md is weird for
> one reason - it wants to have devices possible to open without any prior
> event that would mark the beginning of availability (such as e.g. hardware
> detection for SCSI, IDE, etc.). That is a side effect of bad API - there
> *is* a moment for md that would be a good candidate for such event (array
> being completely configured and available for normal IO), but that would
> require an API for creating and configuring these suckers that wouldn't
> start with "First you open the very device you are trying to create..."
> Too bad - we are tied to lousy userland interface and can't get rid of
> it. That is where all the PITA is coming from.
>
> And the right way to deal with that is to have explicit boundaries for
> "opened or in process of being opened"; we almost have them (probe and
> final release), so the only point we are missing is on failure exit from
> __blkdev_get()...
>
> I really think that it's much saner than trying to change the lifetime
> rules for gendisk, etc.

Well, I don't know. It seems like a lot of trouble just to allow
"rmmod something" without first killing the devices and as people are
now so used to reference counted objects and ->release, not having it
on cdev or gendisk is quite a PITA. (BTW, Greg, can you please drop
cdev->release patch for now, it's wrong as it currently stands).

Can you see any problem with caching ->disk_release existence on
registration and wrap __module_get/put() around its invocation? It
wouldn't change behavior of any existing drivers and md can use it if
it wants. Doing "mdadm --stop --scan" would be enough to unload the
module and md can do whatever forward or back reference it wants to do
to work out the weird userland interface.

Thanks.

--
tejun

2008-11-24 16:43:01

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Tue, Nov 25, 2008 at 01:08:09AM +0900, Tejun Heo wrote:

> Devices can be killed from userland via sysfs for SCSI or mdadm for
> md. It's true that such approach is less convenient for unloading but
> if it can make usual cases easier, why not?

_What_ usual cases?

> > And the right way to deal with that is to have explicit boundaries for
> > "opened or in process of being opened"; we almost have them (probe and
> > final release), so the only point we are missing is on failure exit from
> > __blkdev_get()...
> >
> > I really think that it's much saner than trying to change the lifetime
> > rules for gendisk, etc.
>
> Well, I don't know. It seems like a lot of trouble just to allow
> "rmmod something" without first killing the devices and as people are
> now so used to reference counted objects and ->release, not having it
> on cdev or gendisk is quite a PITA. (BTW, Greg, can you please drop
> cdev->release patch for now, it's wrong as it currently stands).

gendisks *ARE* reference counted, damnit. So are net_device and a lot
of other things. And no, it's not true that "struct net_device exists"
implies "the low-level objects that once might have been related to it
still exist" either.

> Can you see any problem with caching ->disk_release existence on
> registration and wrap __module_get/put() around its invocation? It
> wouldn't change behavior of any existing drivers and md can use it if
> it wants. Doing "mdadm --stop --scan" would be enough to unload the
> module and md can do whatever forward or back reference it wants to do
> to work out the weird userland interface.

Other than general ugliness and special-casing where none is really needed?
Special-casing as "very different life cycle if special method is present"...

If anything, we need to go in opposite direction - give the drivers a way
to say "my underlying object is gone, STFU and don't bother me with that
gendisk ever again; free it when you are done with it, but from now on
any access to it would better fail. Oh, and I might find a new device
in place of that any time now, so new open() would better get not fail
just something in VFS still has a reference to that gendisk".

Which is doable - note that we can unhash block_device, dissociate inodes
from it and let new open() DTRT. Earlier opened files will still have
a reference to address_space of original block_device (which is why we
have file->f_mapping instead of going through ->f_dentry->d_inode->i_mapping),
so we are fine.

2008-11-24 17:18:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

Al Viro wrote:
> On Tue, Nov 25, 2008 at 01:08:09AM +0900, Tejun Heo wrote:
>
>> Devices can be killed from userland via sysfs for SCSI or mdadm for
>> md. It's true that such approach is less convenient for unloading but
>> if it can make usual cases easier, why not?
>
> _What_ usual cases?

Following from gendisk or whatever high level object to lower level
object. open is most common but there are other places.

>> Well, I don't know. It seems like a lot of trouble just to allow
>> "rmmod something" without first killing the devices and as people are
>> now so used to reference counted objects and ->release, not having it
>> on cdev or gendisk is quite a PITA. (BTW, Greg, can you please drop
>> cdev->release patch for now, it's wrong as it currently stands).
>
> gendisks *ARE* reference counted, damnit. So are net_device and a lot
> of other things.

Yes it is (where did I say they weren't?) but different in that it is
a severing boundary for module reference and thus can't tell its
creator when it's dying.

> And no, it's not true that "struct net_device exists"
> implies "the low-level objects that once might have been related to it
> still exist" either.

No, it's not. If it were, we wouldn't be arguing. :-)

>> Can you see any problem with caching ->disk_release existence on
>> registration and wrap __module_get/put() around its invocation? It
>> wouldn't change behavior of any existing drivers and md can use it if
>> it wants. Doing "mdadm --stop --scan" would be enough to unload the
>> module and md can do whatever forward or back reference it wants to do
>> to work out the weird userland interface.
>
> Other than general ugliness and special-casing where none is really needed?
> Special-casing as "very different life cycle if special method is present"...

The duality is ugly alright and will probably need a big fat comment
to explain what's going on but the current situation isn't too pretty
(or at least convenient) either.

> If anything, we need to go in opposite direction - give the drivers a way
> to say "my underlying object is gone, STFU and don't bother me with that
> gendisk ever again; free it when you are done with it, but from now on
> any access to it would better fail. Oh, and I might find a new device
> in place of that any time now, so new open() would better get not fail
> just something in VFS still has a reference to that gendisk".
>
> Which is doable - note that we can unhash block_device, dissociate inodes
> from it and let new open() DTRT. Earlier opened files will still have
> a reference to address_space of original block_device (which is why we
> have file->f_mapping instead of going through ->f_dentry->d_inode->i_mapping),
> so we are fine.

Yes, that will be great too. I'm just unhappy with the current
situation which is in the middle of the two sane positions where
drivers have to worry about whether high level object is valid or not.
If it's either always valid or can be told to go away into oblivion
(proper severing, no ->release necessary at all), I won't have any
complaint.

Thanks.

--
tejun

2008-11-28 00:23:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2] md: make devices disappear when they are no longer needed.

On Monday November 24, [email protected] wrote:
>
> I really think that it's much saner than trying to change the lifetime
> rules for gendisk, etc.

I'm not sure there is anything wrong with the lifetime rules for
gendisk....

In my mind a big issue is:
What should happen if "open" is racing with "destroy device" ?

For 'sd', if you open before the decision has been made to tear down
the data structures, the open succeeds but you start getting EIO on
any IO.
If you open after the decision, you get ENODEV (or ENXIO?).
If you wait a little longer and some other device gets hot-plugged,
then the open succeeds on a different physical device.

For 'md' the situation is quite different, due to unfortunate legacy
interface design.
An open should never fail with ENODEV.
It might return a device on which read/write always fails and only
various 'ioctls' work, or it might return a device which is fully
working.

So we need to close the window in which ENODEV can be returned.

I think the code I wrote is the best way to close that window. If we
detect that we are in the window during open, wait for the window to
close and retry, by returning ERESTARTSYS.


I've thought more about the possibility of the disk_release method
which is used to tear down the md data structures but I don't think it
would really answer the need.

A key point in time is when blk_unregister_region is called to
remove the mapping from minor number to the gendisk.
At any time before this, new references can be taken by an open.
At any time after this, any open will be directed through ->probe
which can do any necessary interlocking with ->open.

Currently we need to call blk_unregister_region before the final
put_disk otherwise a new reference could be taken while that put_disk
is running.
blk_unregister_region is normally called by unlink_gendisk called by
del_gendisk.
So we need to call del_gendisk before the final call to put_disk.
And once we have called del_gendisk we may as well clean up the md
specific stuff there and not bother with a ->disk_release method.

So I'm having trouble seeing what is wrong with my approach (other
than the fact it implements an unfortunate choice of user-space
interface, which we agree is a legacy requirement) or how anything
else could provide a better solution.

I can appreciate that there might be other issues with gendisk lifetime
(such as the fact that get_disk takes a reference to the module but
put_disk doesn't) but I suspect they are largely orthogonal to my
current issue.

Finally, I suspect that there is a problem with my patch as it stands
with respect to module reference counting. I'll have to explore to
be sure exactly what is happening, but I strongly suspect it is
wrong. Thank you for highlighting that issue for me.

NeilBrown