The soft lockup below happes at the boot time of the system using dm
multipath and automated elevator switching udev rules.
[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
[ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
[ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
[ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
[ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
[ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
[ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
[ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
[ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
[ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
This is caused by a race between md device initialization and sysfs knob
to switch the scheduler.
* multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
q->elevator = elevator_alloc(q, e); // not yet initialized
* sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
( mutex_lock(&q->sysfs_lock) here. )
-> elv_iosched_store -> elevator_change:
elevator_exit(old); // try to de-init uninitialized elevator and hang up
This patch adds acquisition of q->sysfs_lock in blk_init_allocated_queue().
This also adds the lock into elevator_change() to ensure locking from the
other path, as it is exposed function (and queue_attr_store will uses
__elevator_change() now, the non-locking version of elevator_change()).
Signed-off-by: Tomoki Sekiyama <[email protected]>
---
block/blk-core.c | 6 +++++-
block/elevator.c | 16 ++++++++++++++--
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..2323ec3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -739,9 +739,13 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
q->sg_reserved_size = INT_MAX;
+ /* Protect q->elevator from elevator_change */
+ mutex_lock(&q->sysfs_lock);
/* init elevator */
if (elevator_init(q, NULL))
- return NULL;
+ q = NULL;
+ mutex_unlock(&q->sysfs_lock);
+
return q;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
diff --git a/block/elevator.c b/block/elevator.c
index 668394d..5232565 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -959,7 +959,7 @@ fail_init:
/*
* Switch this queue to the given IO scheduler.
*/
-int elevator_change(struct request_queue *q, const char *name)
+static int __elevator_change(struct request_queue *q, const char *name)
{
char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;
@@ -981,6 +981,18 @@ int elevator_change(struct request_queue *q, const char *name)
return elevator_switch(q, e);
}
+
+int elevator_change(struct request_queue *q, const char *name)
+{
+ int ret;
+
+ /* Protect q->elevator from blk_init_allocated_queue() */
+ mutex_lock(&q->sysfs_lock);
+ ret = __elevator_change(q, name);
+ mutex_unlock(&q->sysfs_lock);
+
+ return ret;
+}
EXPORT_SYMBOL(elevator_change);
ssize_t elv_iosched_store(struct request_queue *q, const char *name,
@@ -991,7 +1003,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
if (!q->elevator)
return count;
- ret = elevator_change(q, name);
+ ret = __elevator_change(q, name);
if (!ret)
return count;
On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
> The soft lockup below happes at the boot time of the system using dm
> multipath and automated elevator switching udev rules.
>
> [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
> [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
> ...
> [ 356.127001] Call Trace:
> [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
> [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
> [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
> [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
> [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
> [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
> [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
> [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
> [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
> [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
> [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
> [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
>
Tokomi,
As you noticed, there is a fedora bug open with similar signature. May
be this patch will fix that issue also.
https://bugzilla.redhat.com/show_bug.cgi?id=902012
> This is caused by a race between md device initialization and sysfs knob
> to switch the scheduler.
>
> * multipathd:
> SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
> -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
>
> q->elevator = elevator_alloc(q, e); // not yet initialized
>
> * sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
> SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
> ( mutex_lock(&q->sysfs_lock) here. )
> -> elv_iosched_store -> elevator_change:
>
> elevator_exit(old); // try to de-init uninitialized elevator and hang up
>
> This patch adds acquisition of q->sysfs_lock in blk_init_allocated_queue().
> This also adds the lock into elevator_change() to ensure locking from the
> other path, as it is exposed function (and queue_attr_store will uses
> __elevator_change() now, the non-locking version of elevator_change()).
I think introducing __elevator_change() is orthogonal to this problem.
May be keep that in a separate patch.
> block/blk-core.c | 6 +++++-
> block/elevator.c | 16 ++++++++++++++--
> 2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 93a18d1..2323ec3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -739,9 +739,13 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
>
> q->sg_reserved_size = INT_MAX;
>
> + /* Protect q->elevator from elevator_change */
> + mutex_lock(&q->sysfs_lock);
> /* init elevator */
> if (elevator_init(q, NULL))
> - return NULL;
> + q = NULL;
> + mutex_unlock(&q->sysfs_lock);
> +
So core of the problem is, what's the locking semantics to make sure
that we are not trying to switch elevator while it is still initializing.
IOW, should we allow multiple parallel calls of elevator_init_fn() on a
queue and is it safe?
I would argue that it is easier to read and maintain the code if we
provide explicit locking around. So I like the idea of introducing
some locking around elevator_init().
Because we are racing against elevator switch path which takes
q->sysfs_lock, it makes sense to provide mutual exlusion using
q->sysfs_lock.
What I don't know is that can we take mutex in queue init path. Generally
drivers call it and do they expect that they can call this function
while holding a spin lock.
I am CCing Tejun also to the thread. He also might have some ideas here.
Thanks
Vivek
> return q;
> }
> EXPORT_SYMBOL(blk_init_allocated_queue);
> diff --git a/block/elevator.c b/block/elevator.c
> index 668394d..5232565 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -959,7 +959,7 @@ fail_init:
> /*
> * Switch this queue to the given IO scheduler.
> */
> -int elevator_change(struct request_queue *q, const char *name)
> +static int __elevator_change(struct request_queue *q, const char *name)
> {
> char elevator_name[ELV_NAME_MAX];
> struct elevator_type *e;
> @@ -981,6 +981,18 @@ int elevator_change(struct request_queue *q, const char *name)
>
> return elevator_switch(q, e);
> }
> +
> +int elevator_change(struct request_queue *q, const char *name)
> +{
> + int ret;
> +
> + /* Protect q->elevator from blk_init_allocated_queue() */
> + mutex_lock(&q->sysfs_lock);
> + ret = __elevator_change(q, name);
> + mutex_unlock(&q->sysfs_lock);
> +
> + return ret;
> +}
> EXPORT_SYMBOL(elevator_change);
>
> ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> @@ -991,7 +1003,7 @@ ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> if (!q->elevator)
> return count;
>
> - ret = elevator_change(q, name);
> + ret = __elevator_change(q, name);
> if (!ret)
> return count;
>
On Thu, Aug 29, 2013 at 02:33:10PM -0400, Vivek Goyal wrote:
> On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
> > The soft lockup below happes at the boot time of the system using dm
> > multipath and automated elevator switching udev rules.
> >
> > [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
> > [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
> > ...
> > [ 356.127001] Call Trace:
> > [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
> > [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
> > [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
> > [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
> > [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
> > [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
> > [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
> > [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
> > [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
> > [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
> > [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
> > [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
> >
>
> Tokomi,
>
> As you noticed, there is a fedora bug open with similar signature. May
> be this patch will fix that issue also.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=902012
>
>
> > This is caused by a race between md device initialization and sysfs knob
> > to switch the scheduler.
> >
> > * multipathd:
> > SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
> > -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
> >
> > q->elevator = elevator_alloc(q, e); // not yet initialized
> >
> > * sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
> > SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
> > ( mutex_lock(&q->sysfs_lock) here. )
> > -> elv_iosched_store -> elevator_change:
> >
> > elevator_exit(old); // try to de-init uninitialized elevator and hang up
> >
If problem in this case is that we are trying to exit() the elevator
which has not been properly initialized, then we should not attach
the elevator to the queue yet.
In cfq_init_queue(), can we move following code towards the end of
function.
spin_lock_irq(q->queue_lock);
q->elevator = eq;
spin_unlock_irq(q->queue_lock);
So till elevator is initialized, we will not attach it to queue and
elevator_switch() will return as it will not find a valid elevator
on the queue.
elevator_change() {
if (!q->elevator)
return -ENXIO;
}
Thanks
Vivek
Hi vivek,
Thanks for your comments.
On 8/29/13 14:33 , "Vivek Goyal" <[email protected]> wrote:
>On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
>> The soft lockup below happes at the boot time of the system using dm
>> multipath and automated elevator switching udev rules.
>>
>> [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
>> [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>]
>>lock_timer_base.isra.35+0x1d/0x50
>> ...
>> [ 356.127001] Call Trace:
>> [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
>> [ 356.127001] [<ffffffff8118b08a>] ?
>>kmem_cache_alloc_node_trace+0x20a/0x230
>> [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
>> [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
>> [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
>> [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
>> [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
>> [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
>> [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
>> [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
>> [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
>> [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
>>
>
>Tokomi,
>
>As you noticed, there is a fedora bug open with similar signature. May
>be this patch will fix that issue also.
>
>https://bugzilla.redhat.com/show_bug.cgi?id=902012
>
>
>> This is caused by a race between md device initialization and sysfs knob
>> to switch the scheduler.
>>
>> * multipathd:
>> SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
>> -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
>>
>> q->elevator = elevator_alloc(q, e); // not yet initialized
>>
>>
>>* sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
>> SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
>> ( mutex_lock(&q->sysfs_lock) here. )
>> -> elv_iosched_store -> elevator_change:
>>
>>
>> elevator_exit(old); // try to de-init uninitialized elevator and hang
>>up
>>
>>
>>This patch adds acquisition of q->sysfs_lock in
>>blk_init_allocated_queue().
>> This also adds the lock into elevator_change() to ensure locking from
>>the
>> other path, as it is exposed function (and queue_attr_store will uses
>> __elevator_change() now, the non-locking version of elevator_change()).
>
>I think introducing __elevator_change() is orthogonal to this problem.
>May be keep that in a separate patch.
OK, I will split it into 2 patches.
>> block/blk-core.c | 6 +++++-
>> block/elevator.c | 16 ++++++++++++++--
>> 2 files changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 93a18d1..2323ec3 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -739,9 +739,13 @@ blk_init_allocated_queue(struct request_queue *q,
>>request_fn_proc *rfn,
>>
>> q->sg_reserved_size = INT_MAX;
>>
>> + /* Protect q->elevator from elevator_change */
>> + mutex_lock(&q->sysfs_lock);
>> /* init elevator */
>> if (elevator_init(q, NULL))
>> - return NULL;
>> + q = NULL;
>> + mutex_unlock(&q->sysfs_lock);
>> +
>
>So core of the problem is, what's the locking semantics to make sure
>that we are not trying to switch elevator while it is still initializing.
>IOW, should we allow multiple parallel calls of elevator_init_fn() on a
>queue and is it safe?
>
>I would argue that it is easier to read and maintain the code if we
>provide explicit locking around. So I like the idea of introducing
>some locking around elevator_init().
>
>Because we are racing against elevator switch path which takes
>q->sysfs_lock, it makes sense to provide mutual exlusion using
>q->sysfs_lock.
>
>What I don't know is that can we take mutex in queue init path. Generally
>drivers call it and do they expect that they can call this function
>while holding a spin lock.
As elevator_alloc() allocates memory with GFP_KERNEL, elevator_init() might
sleep. So it should be safe to use mutex here.
>I am CCing Tejun also to the thread. He also might have some ideas here.
>
>Thanks
>Vivek
Thanks,
Tomoki Sekiyama
On 8/29/13 14:43 , "Vivek Goyal" <[email protected]> wrote:
>On Thu, Aug 29, 2013 at 02:33:10PM -0400, Vivek Goyal wrote:
>> On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
>> > The soft lockup below happes at the boot time of the system using dm
>> > multipath and automated elevator switching udev rules.
>> >
>> > [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
>> > [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>]
>>lock_timer_base.isra.35+0x1d/0x50
>> > ...
>> > [ 356.127001] Call Trace:
>> > [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
>> > [ 356.127001] [<ffffffff8118b08a>] ?
>>kmem_cache_alloc_node_trace+0x20a/0x230
>> > [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
>> > [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
>> > [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
>> > [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
>> > [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
>> > [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
>> > [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
>> > [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
>> > [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
>> > [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
>> >
>>
>> Tokomi,
>>
>> As you noticed, there is a fedora bug open with similar signature. May
>> be this patch will fix that issue also.
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=902012
>>
>>
>> > This is caused by a race between md device initialization and sysfs
>>knob
>> > to switch the scheduler.
>> >
>> > * multipathd:
>> > SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
>> > -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
>> >
>> > q->elevator = elevator_alloc(q, e); // not yet initialized
>> >
>> > * sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
>> > SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
>> > ( mutex_lock(&q->sysfs_lock) here. )
>> > -> elv_iosched_store -> elevator_change:
>> >
>> > elevator_exit(old); // try to de-init uninitialized elevator and
>>hang up
>> >
>
>If problem in this case is that we are trying to exit() the elevator
>which has not been properly initialized, then we should not attach
>the elevator to the queue yet.
>
>In cfq_init_queue(), can we move following code towards the end of
>function.
>
> spin_lock_irq(q->queue_lock);
> q->elevator = eq;
> spin_unlock_irq(q->queue_lock);
>
>So till elevator is initialized, we will not attach it to queue and
>elevator_switch() will return as it will not find a valid elevator
>on the queue.
>
>
>elevator_change() {
> if (!q->elevator)
> return -ENXIO;
>}
>
>Thanks
>Vivek
I think it also works, though I prefer introducing explicit locking,
as you said, so that code won't break again in some future.
Thanks,
Tomoki
On Thu, Aug 29, 2013 at 07:28:02PM +0000, Tomoki Sekiyama wrote:
> Hi vivek,
>
> Thanks for your comments.
>
> On 8/29/13 14:33 , "Vivek Goyal" <[email protected]> wrote:
>
> >On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
> >> The soft lockup below happes at the boot time of the system using dm
> >> multipath and automated elevator switching udev rules.
> >>
> >> [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
> >> [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>]
> >>lock_timer_base.isra.35+0x1d/0x50
> >> ...
> >> [ 356.127001] Call Trace:
> >> [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
> >> [ 356.127001] [<ffffffff8118b08a>] ?
> >>kmem_cache_alloc_node_trace+0x20a/0x230
> >> [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
> >> [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
> >> [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
> >> [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
> >> [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
> >> [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
> >> [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
> >> [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
> >> [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
> >> [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
> >>
> >
> >Tokomi,
> >
> >As you noticed, there is a fedora bug open with similar signature. May
> >be this patch will fix that issue also.
> >
> >https://bugzilla.redhat.com/show_bug.cgi?id=902012
> >
> >
> >> This is caused by a race between md device initialization and sysfs knob
> >> to switch the scheduler.
> >>
> >> * multipathd:
> >> SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
> >> -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
> >>
> >> q->elevator = elevator_alloc(q, e); // not yet initialized
> >>
> >>
> >>* sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
> >> SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
> >> ( mutex_lock(&q->sysfs_lock) here. )
> >> -> elv_iosched_store -> elevator_change:
> >>
> >>
> >> elevator_exit(old); // try to de-init uninitialized elevator and hang
> >>up
> >>
> >>
> >>This patch adds acquisition of q->sysfs_lock in
> >>blk_init_allocated_queue().
> >> This also adds the lock into elevator_change() to ensure locking from
> >>the
> >> other path, as it is exposed function (and queue_attr_store will uses
> >> __elevator_change() now, the non-locking version of elevator_change()).
> >
> >I think introducing __elevator_change() is orthogonal to this problem.
> >May be keep that in a separate patch.
>
> OK, I will split it into 2 patches.
>
>
> >> block/blk-core.c | 6 +++++-
> >> block/elevator.c | 16 ++++++++++++++--
> >> 2 files changed, 19 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 93a18d1..2323ec3 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -739,9 +739,13 @@ blk_init_allocated_queue(struct request_queue *q,
> >>request_fn_proc *rfn,
> >>
> >> q->sg_reserved_size = INT_MAX;
> >>
> >> + /* Protect q->elevator from elevator_change */
> >> + mutex_lock(&q->sysfs_lock);
> >> /* init elevator */
> >> if (elevator_init(q, NULL))
> >> - return NULL;
> >> + q = NULL;
> >> + mutex_unlock(&q->sysfs_lock);
> >> +
> >
> >So core of the problem is, what's the locking semantics to make sure
> >that we are not trying to switch elevator while it is still initializing.
> >IOW, should we allow multiple parallel calls of elevator_init_fn() on a
> >queue and is it safe?
> >
> >I would argue that it is easier to read and maintain the code if we
> >provide explicit locking around. So I like the idea of introducing
> >some locking around elevator_init().
> >
> >Because we are racing against elevator switch path which takes
> >q->sysfs_lock, it makes sense to provide mutual exlusion using
> >q->sysfs_lock.
> >
> >What I don't know is that can we take mutex in queue init path. Generally
> >drivers call it and do they expect that they can call this function
> >while holding a spin lock.
>
> As elevator_alloc() allocates memory with GFP_KERNEL, elevator_init() might
> sleep. So it should be safe to use mutex here.
That's a good point. So it should be safe to add q->sysfs_lock. I would
say this patch sounds reasonable to me. Just document around
elevator_init_fn() that it should be called with q->sysfs_lock held to
provide mutual exclusion between elevator init and elevator switch paths.
Thanks
Vivek
On Thu, Aug 29, 2013 at 07:29:07PM +0000, Tomoki Sekiyama wrote:
> On 8/29/13 14:43 , "Vivek Goyal" <[email protected]> wrote:
> >On Thu, Aug 29, 2013 at 02:33:10PM -0400, Vivek Goyal wrote:
> >> On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
> >> > The soft lockup below happes at the boot time of the system using dm
> >> > multipath and automated elevator switching udev rules.
> >> >
> >> > [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
> >> > [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>]
> >>lock_timer_base.isra.35+0x1d/0x50
> >> > ...
> >> > [ 356.127001] Call Trace:
> >> > [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
> >> > [ 356.127001] [<ffffffff8118b08a>] ?
> >>kmem_cache_alloc_node_trace+0x20a/0x230
> >> > [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
> >> > [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
> >> > [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
> >> > [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
> >> > [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
> >> > [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
> >> > [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
> >> > [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
> >> > [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
> >> > [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
> >> >
> >>
> >> Tokomi,
> >>
> >> As you noticed, there is a fedora bug open with similar signature. May
> >> be this patch will fix that issue also.
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=902012
> >>
> >>
> >> > This is caused by a race between md device initialization and sysfs
> >>knob
> >> > to switch the scheduler.
> >> >
> >> > * multipathd:
> >> > SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
> >> > -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init:
> >> >
> >> > q->elevator = elevator_alloc(q, e); // not yet initialized
> >> >
> >> > * sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler'
> >> > SyS_write -> vfs_write -> sysfs_write_file -> queue_attr_store
> >> > ( mutex_lock(&q->sysfs_lock) here. )
> >> > -> elv_iosched_store -> elevator_change:
> >> >
> >> > elevator_exit(old); // try to de-init uninitialized elevator and
> >>hang up
> >> >
> >
> >If problem in this case is that we are trying to exit() the elevator
> >which has not been properly initialized, then we should not attach
> >the elevator to the queue yet.
> >
> >In cfq_init_queue(), can we move following code towards the end of
> >function.
> >
> > spin_lock_irq(q->queue_lock);
> > q->elevator = eq;
> > spin_unlock_irq(q->queue_lock);
> >
> >So till elevator is initialized, we will not attach it to queue and
> >elevator_switch() will return as it will not find a valid elevator
> >on the queue.
> >
> >
> >elevator_change() {
> > if (!q->elevator)
> > return -ENXIO;
> >}
> >
> >Thanks
> >Vivek
>
> I think it also works, though I prefer introducing explicit locking,
> as you said, so that code won't break again in some future.
I agree. Providing explicit locking and making sure only one elevator
can be initializing at a time on a queue and others wait till that
operation is complete, will make up the code more readable and less
bug suspecible.
Thanks
Vivek
On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
> The soft lockup below happes at the boot time of the system using dm
> multipath and automated elevator switching udev rules.
>
> [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
> [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
> ...
> [ 356.127001] Call Trace:
> [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
> [ 356.127001] [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
> [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
> [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
> [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
> [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
> [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
> [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
> [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
> [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
> [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
> [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
>
> This is caused by a race between md device initialization and sysfs knob
> to switch the scheduler.
I think we can also improve changelog a bit. So IIUC, softlockup
happens because one thread called del_timer_sync() on a timer which
was not even initilized. Timer initialization should have happened
in cfq_init_queue() using init_timer(). But before init_timer()
could be called, elevator switch path called del_timer_sync().
del_timer_sync() in turn calls lock_timer_base() which will loop
infinitely if timer->base == NULL. And because we have not called
init_timer() yet, I am assuming timer->base is null?
Is this right analysis? If yes, then this patch should most likely
fix following bz.
https://bugzilla.redhat.com/show_bug.cgi?id=902012
I had concluded that some how timer->base is NULL but could not understand
how come timer base is NULL when we have called init_timer() on it.
Thanks
Vivek
On 8/29/13 16:29 , "Vivek Goyal" <[email protected]> wrote:
>On Mon, Aug 26, 2013 at 09:45:15AM -0400, Tomoki Sekiyama wrote:
>> The soft lockup below happes at the boot time of the system using dm
>> multipath and automated elevator switching udev rules.
>>
>> [ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
>> [ 356.127001] RIP: 0010:[<ffffffff81072a7d>] [<ffffffff81072a7d>]
>>lock_timer_base.isra.35+0x1d/0x50
>> ...
>> [ 356.127001] Call Trace:
>> [ 356.127001] [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
>> [ 356.127001] [<ffffffff8118b08a>] ?
>>kmem_cache_alloc_node_trace+0x20a/0x230
>> [ 356.127001] [<ffffffff810738b2>] del_timer_sync+0x52/0x60
>> [ 356.127001] [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
>> [ 356.127001] [<ffffffff812c98df>] elevator_exit+0x2f/0x50
>> [ 356.127001] [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
>> [ 356.127001] [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
>> [ 356.127001] [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
>> [ 356.127001] [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
>> [ 356.127001] [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
>> [ 356.127001] [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
>> [ 356.127001] [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
>>
>> This is caused by a race between md device initialization and sysfs knob
>> to switch the scheduler.
>
>I think we can also improve changelog a bit. So IIUC, softlockup
>happens because one thread called del_timer_sync() on a timer which
>was not even initilized. Timer initialization should have happened
>in cfq_init_queue() using init_timer(). But before init_timer()
>could be called, elevator switch path called del_timer_sync().
>
>del_timer_sync() in turn calls lock_timer_base() which will loop
>infinitely if timer->base == NULL. And because we have not called
>init_timer() yet, I am assuming timer->base is null?
>
>Is this right analysis? If yes, then this patch should most likely
>fix following bz.
>
>https://bugzilla.redhat.com/show_bug.cgi?id=902012
I think your analysis is correct. If del_timer_sync() is called right after
cfqd is allocated (with __GFP_ZERO), timer->base == NULL. Otherwise it may
hit NULL pointer.
The other reason of NULL timer->base is that the timer is migrating in
__mod_timer(), but then the it must be set to non-NULL in a short time.
Maybe __mod_timer should use some illegal pointer value (like LIST_POISON1)
instead of NULL to represent the timer is migrating....
Actually, when I changed __mod_timer to timer_set_base(timer, 0xdeadbeaf),
made lock_timer_base wait while base == 0xdeadbeaf, and
added BUG_ON(!timer->base) to lock_timer_base, my system hits the BUG.
>I had concluded that some how timer->base is NULL but could not understand
>how come timer base is NULL when we have called init_timer() on it.
>
>Thanks
>Vivek
Thanks,
Tomoki