Hi,
I have recently been trying to find the cause a livelock occurring during MD
device open.
The livelock happens when a process tries to open an MD device for the
first time and another opens the same MD device and sends an invalid
ioctl:
Process 1 Process 2
--------- ---------
md_alloc()
mddev_find()
-> returns a new mddev with
hold_active == UNTIL_IOCTL
add_disk()
-> sends KOBJ_ADD uevent
(sees KOBJ_ADD uevent for device)
md_open()
md_ioctl(INVALID_IOCTL)
-> returns ENODEV and clears
mddev->hold_active
md_release()
md_put()
-> deletes the mddev as
hold_active is 0
md_open()
mddev_find()
-> returns a newly
allocated mddev with
mddev->gendisk == NULL
-> returns with ERESTARTSYS
(kernel restarts the open syscall)
As to how to fix this, I see two possibilities:
- don't set hold_active to 0 if err is -ENODEV in the abort_unlock
path in md_ioctl().
- check cmd parameter early in md_ioctl() and return -ENOTTY if the
cmd parameter is not a valid MD ioctl.
Please advise on the preferred way to fix this, I'll be glad to send a
patch for whatever is the preferred solution.
I have also a simple C program that I can send should you want to reproduce
the issue.
Regards,
--
Nicolas Schichan
Freebox SAS
On Tue, 14 Jan 2014 18:14:46 +0100 Nicolas Schichan <[email protected]>
wrote:
>
>
> Hi,
>
> I have recently been trying to find the cause a livelock occurring during MD
> device open.
>
> The livelock happens when a process tries to open an MD device for the
> first time and another opens the same MD device and sends an invalid
> ioctl:
>
> Process 1 Process 2
> --------- ---------
>
> md_alloc()
> mddev_find()
> -> returns a new mddev with
> hold_active == UNTIL_IOCTL
> add_disk()
> -> sends KOBJ_ADD uevent
>
> (sees KOBJ_ADD uevent for device)
> md_open()
> md_ioctl(INVALID_IOCTL)
> -> returns ENODEV and clears
> mddev->hold_active
> md_release()
> md_put()
> -> deletes the mddev as
> hold_active is 0
>
> md_open()
> mddev_find()
> -> returns a newly
> allocated mddev with
> mddev->gendisk == NULL
> -> returns with ERESTARTSYS
> (kernel restarts the open syscall)
>
>
> As to how to fix this, I see two possibilities:
>
> - don't set hold_active to 0 if err is -ENODEV in the abort_unlock
> path in md_ioctl().
>
> - check cmd parameter early in md_ioctl() and return -ENOTTY if the
> cmd parameter is not a valid MD ioctl.
>
> Please advise on the preferred way to fix this, I'll be glad to send a
> patch for whatever is the preferred solution.
>
> I have also a simple C program that I can send should you want to reproduce
> the issue.
>
> Regards,
>
That's a very small race you are consistently losing - if I understand
correctly.
In __blkdev_get:
restart:
ret = -ENXIO;
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
owner = disk->fops->owner;
disk_block_events(disk);
mutex_lock_nested(&bdev->bd_mutex, for_part);
The "get_gendisk" calls into md_alloc (via md_probe) and then add_disk(),
which generates a uevent which is handled by udev.
And before the above code gets to the mutex_lock_nexted(), the process run by
udev must have opened the device (executing all that code above and more) and
issued the ioctl.
I guess it is possible, but happening every time to produce a live-lock
suggests that the scheduler must be encouraging that behaviour. Presumably
this is a virtual machine with just one CPU ??
I suppose the best fix is, as you suggest, to avoid clearing hold_active for
invalid ioctls. It feels a bit like papering over a bug, but I think the
only way to really fix it is to add extra locking to the above code sequence
and I don't want to do that.
Of your two suggestions I much prefer the second. It will be more code, but
it is also more obviously correct. The current code is rather messy with
respect to invalid ioctl commands.
I would be happy to accept a patch which aborted md_ioctl if the cmd wasn't
one of those known to md.
Thanks,
NeilBrown
On 01/15/2014 02:57 AM, NeilBrown wrote:
[...]
> That's a very small race you are consistently losing - if I understand
> correctly.
>
> In __blkdev_get:
>
> restart:
>
> ret = -ENXIO;
> disk = get_gendisk(bdev->bd_dev, &partno);
> if (!disk)
> goto out;
> owner = disk->fops->owner;
>
> disk_block_events(disk);
> mutex_lock_nested(&bdev->bd_mutex, for_part);
>
>
> The "get_gendisk" calls into md_alloc (via md_probe) and then add_disk(),
> which generates a uevent which is handled by udev.
> And before the above code gets to the mutex_lock_nexted(), the process run by
> udev must have opened the device (executing all that code above and more) and
> issued the ioctl.
>
> I guess it is possible, but happening every time to produce a live-lock
> suggests that the scheduler must be encouraging that behaviour. Presumably
> this is a virtual machine with just one CPU ??
add_disk() will call schedule via call_usermode_helper() (for /sbin/hotplug),
the scheduler will, I think, almost always choose to schedule the "udev"
process instead of the process which did the first open to the md device, once
the usermode helper has run.
SysRq + 'w' consistently showed a process that had called schedule() from
md_alloc() -> add_disk() ... -> call_usermodehelper():
[ 57.932671] test D 805ac41c 0 748 745 0x00000000
[ 57.939075] [<805ac41c>] (__schedule+0x33c/0x3b0) from [<805aa8f0>]
(schedule_timeout+0x18/0x168)
[ 57.947984] [<805aa8f0>] (schedule_timeout+0x18/0x168) from [<805abfa4>]
(wait_for_common+0xf0/0x198)
[ 57.957245] [<805abfa4>] (wait_for_common+0xf0/0x198) from [<80029e30>]
(call_usermodehelper_exec+0xf8/0x15c)
[ 57.967205] [<80029e30>] (call_usermodehelper_exec+0xf8/0x15c) from
[<8028abd0>] (kobject_uevent_env+0x37c/0x3e8)
[ 57.977515] [<8028abd0>] (kobject_uevent_env+0x37c/0x3e8) from [<8027cff8>]
(add_disk+0x29c/0x400)
[ 57.986520] [<8027cff8>] (add_disk+0x29c/0x400) from [<803bbcac>]
(md_alloc+0x1cc/0x2cc)
[ 57.994645] [<803bbcac>] (md_alloc+0x1cc/0x2cc) from [<803bbe38>]
(md_probe+0xc/0x14)
[ 58.002511] [<803bbe38>] (md_probe+0xc/0x14) from [<802ede10>]
(kobj_lookup+0xd8/0x110)
[ 58.010550] [<802ede10>] (kobj_lookup+0xd8/0x110) from [<8027cacc>]
(get_gendisk+0x2c/0xe0)
[ 58.018942] [<8027cacc>] (get_gendisk+0x2c/0xe0) from [<800c0598>]
(__blkdev_get+0x28/0x364)
[ 58.027416] [<800c0598>] (__blkdev_get+0x28/0x364) from [<800c0ab0>]
(blkdev_get+0x1dc/0x318)
[ 58.035983] [<800c0ab0>] (blkdev_get+0x1dc/0x318) from [<800918bc>]
(do_dentry_open.isra.15+0x184/0x248)
[ 58.045510] [<800918bc>] (do_dentry_open.isra.15+0x184/0x248) from
[<800919a4>] (finish_open+0x24/0x38)
[ 58.054945] [<800919a4>] (finish_open+0x24/0x38) from [<8009fb40>]
(do_last+0x9a8/0xc08)
[ 58.063071] [<8009fb40>] (do_last+0x9a8/0xc08) from [<8009ffdc>]
(path_openat+0x23c/0x5c8)
[ 58.071371] [<8009ffdc>] (path_openat+0x23c/0x5c8) from [<800a0620>]
(do_filp_open+0x2c/0x78)
[ 58.079935] [<800a0620>] (do_filp_open+0x2c/0x78) from [<800929b4>]
(do_sys_open+0x124/0x1c4)
[ 58.088498] [<800929b4>] (do_sys_open+0x124/0x1c4) from [<80009060>]
(ret_fast_syscall+0x0/0x2c)
The system showing the livelock is a Marvell 88F6282 CPU (single core processor).
It took me some time start to suspect the problem was due to the md_ioctl()
code :)
> I suppose the best fix is, as you suggest, to avoid clearing hold_active for
> invalid ioctls. It feels a bit like papering over a bug, but I think the
> only way to really fix it is to add extra locking to the above code sequence
> and I don't want to do that.
>
> Of your two suggestions I much prefer the second. It will be more code, but
> it is also more obviously correct. The current code is rather messy with
> respect to invalid ioctl commands.
>
> I would be happy to accept a patch which aborted md_ioctl if the cmd wasn't
> one of those known to md.
I'll send a patch for that then.
Regards,
--
Nicolas Schichan
Freebox SAS
Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.
This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.
Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:
Process 1 Process 2
--------- ---------
md_alloc()
mddev_find()
-> returns a new mddev with
hold_active == UNTIL_IOCTL
add_disk()
-> sends KOBJ_ADD uevent
(sees KOBJ_ADD uevent for device)
md_open()
md_ioctl(INVALID_IOCTL)
-> returns ENODEV and clears
mddev->hold_active
md_release()
md_put()
-> deletes the mddev as
hold_active is 0
md_open()
mddev_find()
-> returns a newly
allocated mddev with
mddev->gendisk == NULL
-> returns with ERESTARTSYS
(kernel restarts the open syscall)
Signed-off-by: Nicolas Schichan <[email protected]>
---
A couple of notes:
This patch is based on linux 3.13-rc8.
The following MD ioctl constants are defined in md_u.h but not used
anywhere else, so are not accepted as valid ioctl commands:
CLEAR_ARRAY
SET_DISK_INFO
WRITE_RAID_INFO
UNPROTECT_ARRAY
PROTECT_ARRAY
drivers/md/md.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 21f4d7f..941ac65 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6328,6 +6328,32 @@ static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo)
return 0;
}
+static inline bool md_ioctl_valid(unsigned int cmd)
+{
+ switch (cmd) {
+ case ADD_NEW_DISK:
+ case BLKROSET:
+ case GET_ARRAY_INFO:
+ case GET_BITMAP_FILE:
+ case GET_DISK_INFO:
+ case HOT_ADD_DISK:
+ case HOT_REMOVE_DISK:
+ case PRINT_RAID_DEBUG:
+ case RAID_AUTORUN:
+ case RAID_VERSION:
+ case RESTART_ARRAY_RW:
+ case RUN_ARRAY:
+ case SET_ARRAY_INFO:
+ case SET_BITMAP_FILE:
+ case SET_DISK_FAULTY:
+ case STOP_ARRAY:
+ case STOP_ARRAY_RO:
+ return true;
+ default:
+ return false;
+ }
+}
+
static int md_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
@@ -6336,6 +6362,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
struct mddev *mddev = NULL;
int ro;
+ if (!md_ioctl_valid(cmd))
+ return -ENOTTY;
+
switch (cmd) {
case RAID_VERSION:
case GET_ARRAY_INFO:
--
1.8.3.2
On Wed, 15 Jan 2014 16:58:52 +0100 Nicolas Schichan <[email protected]>
wrote:
> Verify that the cmd parameter passed to md_ioctl() is valid before
> doing anything.
>
> This fixes mddev->hold_active being set to 0 when an invalid ioctl
> command is passed to md_ioctl() before the array has been configured.
>
> Clearing mddev->hold_active in that case can lead to a livelock
> situation when an invalid ioctl number is given to md_ioctl() by a
> process when the mddev is currently being opened by another process:
>
> Process 1 Process 2
> --------- ---------
>
> md_alloc()
> mddev_find()
> -> returns a new mddev with
> hold_active == UNTIL_IOCTL
> add_disk()
> -> sends KOBJ_ADD uevent
>
> (sees KOBJ_ADD uevent for device)
> md_open()
> md_ioctl(INVALID_IOCTL)
> -> returns ENODEV and clears
> mddev->hold_active
> md_release()
> md_put()
> -> deletes the mddev as
> hold_active is 0
>
> md_open()
> mddev_find()
> -> returns a newly
> allocated mddev with
> mddev->gendisk == NULL
> -> returns with ERESTARTSYS
> (kernel restarts the open syscall)
>
> Signed-off-by: Nicolas Schichan <[email protected]>
> ---
>
> A couple of notes:
>
> This patch is based on linux 3.13-rc8.
>
> The following MD ioctl constants are defined in md_u.h but not used
> anywhere else, so are not accepted as valid ioctl commands:
>
> CLEAR_ARRAY
> SET_DISK_INFO
> WRITE_RAID_INFO
> UNPROTECT_ARRAY
> PROTECT_ARRAY
>
>
> drivers/md/md.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 21f4d7f..941ac65 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6328,6 +6328,32 @@ static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo)
> return 0;
> }
>
> +static inline bool md_ioctl_valid(unsigned int cmd)
> +{
> + switch (cmd) {
> + case ADD_NEW_DISK:
> + case BLKROSET:
> + case GET_ARRAY_INFO:
> + case GET_BITMAP_FILE:
> + case GET_DISK_INFO:
> + case HOT_ADD_DISK:
> + case HOT_REMOVE_DISK:
> + case PRINT_RAID_DEBUG:
> + case RAID_AUTORUN:
> + case RAID_VERSION:
> + case RESTART_ARRAY_RW:
> + case RUN_ARRAY:
> + case SET_ARRAY_INFO:
> + case SET_BITMAP_FILE:
> + case SET_DISK_FAULTY:
> + case STOP_ARRAY:
> + case STOP_ARRAY_RO:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> static int md_ioctl(struct block_device *bdev, fmode_t mode,
> unsigned int cmd, unsigned long arg)
> {
> @@ -6336,6 +6362,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
> struct mddev *mddev = NULL;
> int ro;
>
> + if (!md_ioctl_valid(cmd))
> + return -ENOTTY;
> +
> switch (cmd) {
> case RAID_VERSION:
> case GET_ARRAY_INFO:
Patch applied - thanks!
NeilBrown