2012-10-25 09:42:54

by Junichi Nomura

[permalink] [raw]
Subject: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

[PATCH] dm: stay in blk_queue_bypass until queue becomes initialized

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
add_disk() eventually calls blk_queue_bypass_end().
This change invokes the following warning when multipath is used.

BUG: scheduling while atomic: multipath/2460/0x00000002
1 lock held by multipath/2460:
#0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
Modules linked in: ...
Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
Call Trace:
[<ffffffff810723ae>] __schedule_bug+0x6a/0x78
[<ffffffff81428ba2>] __schedule+0xb4/0x5e0
[<ffffffff814291e6>] schedule+0x64/0x66
[<ffffffff8142773a>] schedule_timeout+0x39/0xf8
[<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
[<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
[<ffffffff814289e3>] wait_for_common+0x9d/0xee
[<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
[<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
[<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
[<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
[<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
[<ffffffff8106fb18>] ? complete+0x21/0x53
[<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
[<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
[<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
[<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
[<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
[<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
[<ffffffff811d8c41>] elevator_init+0xe1/0x115
[<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
[<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
[<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
[<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
[<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
[<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
[<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
[<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
[<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
[<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
[<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

The warning means during queue initialization blk_queue_bypass_start()
calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

dm device initialization basically includes the following 3 steps:
1. create ioctl, allocates queue and call add_disk()
2. table load ioctl, determines device type and initialize queue
if request-based
3. resume ioctl, device becomes functional

So it is better to have dm's queue stay in bypass mode until
the initialization completes in table load ioctl.

The effect of additional blk_queue_bypass_start():

3.7-rc2 (plain)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real 0m15.434s
user 0m0.423s
sys 0m7.052s

3.7-rc2 (with this patch)
# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done
real 0m19.766s
user 0m0.442s
sys 0m6.861s

If this additional cost is not negligible, we need a variant of add_disk()
that does not end bypassing.

Signed-off-by: Jun'ichi Nomura <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Alasdair G Kergon <[email protected]>

---
drivers/md/dm.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 02db918..ad02761 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1869,6 +1869,8 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
add_disk(md->disk);
+ /* Until md type is determined, put the queue in bypass mode */
+ blk_queue_bypass_start(md->queue);
format_dev_t(md->name, MKDEV(_major, minor));

md->wq = alloc_workqueue("kdmflush",
@@ -2172,6 +2174,7 @@ static int dm_init_request_based_queue(struct mapped_device *md)
return 1;

/* Fully initialize the queue */
+ WARN_ON(!blk_queue_bypass(md->queue));
q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
if (!q)
return 0;
@@ -2198,6 +2201,7 @@ int dm_setup_md_queue(struct mapped_device *md)
return -EINVAL;
}

+ blk_queue_bypass_end(md->queue);
return 0;
}


2012-10-26 01:43:51

by Junichi Nomura

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On 10/25/12 18:41, Jun'ichi Nomura wrote:
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
...
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
>
> dm device initialization basically includes the following 3 steps:
> 1. create ioctl, allocates queue and call add_disk()
> 2. table load ioctl, determines device type and initialize queue
> if request-based
> 3. resume ioctl, device becomes functional
>
> So it is better to have dm's queue stay in bypass mode until
> the initialization completes in table load ioctl.
>
> The effect of additional blk_queue_bypass_start():
>
> 3.7-rc2 (plain)
> # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
> dmsetup remove a; done
>
> real 0m15.434s
> user 0m0.423s
> sys 0m7.052s
>
> 3.7-rc2 (with this patch)
> # time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
> dmsetup remove a; done
> real 0m19.766s
> user 0m0.442s
> sys 0m6.861s
>
> If this additional cost is not negligible, we need a variant of add_disk()
> that does not end bypassing.

Or call blk_queue_bypass_start() before add_disk():

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ad02761..d14639b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1868,9 +1868,9 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->queue = md->queue;
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
- add_disk(md->disk);
/* Until md type is determined, put the queue in bypass mode */
blk_queue_bypass_start(md->queue);
+ add_disk(md->disk);
format_dev_t(md->name, MKDEV(_major, minor));

md->wq = alloc_workqueue("kdmflush",
---

If the patch is modified like above, we could fix the issue
without incurring additional cost on dm device creation.

# time for n in $(seq 1000); do dmsetup create --noudevsync --notable a; \
dmsetup remove a; done

real 0m15.684s
user 0m0.404s
sys 0m7.181s

--
Jun'ichi Nomura, NEC Corporation

2012-10-26 20:21:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
>
> With 749fefe677 ("block: lift the initial queue bypass mode on
> blk_register_queue() instead of blk_init_allocated_queue()"),
> add_disk() eventually calls blk_queue_bypass_end().
> This change invokes the following warning when multipath is used.
>
> BUG: scheduling while atomic: multipath/2460/0x00000002
> 1 lock held by multipath/2460:
> #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> Modules linked in: ...
> Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
> Call Trace:
> [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> [<ffffffff814291e6>] schedule+0x64/0x66
> [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> [<ffffffff8106fb18>] ? complete+0x21/0x53
> [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
>
> The warning means during queue initialization blk_queue_bypass_start()
> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.

md->type_lock is a mutex, isn't it? I thought we are allowed to block
and schedule out under mutex?

add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
So we already have code which can block/wait under md->type_lock. I am
not sure why should we get this warning under a mutex.

Thanks
Vivek

2012-10-29 11:07:20

by Junichi Nomura

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On 10/27/12 05:21, Vivek Goyal wrote:
> On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
>> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
>>
>> With 749fefe677 ("block: lift the initial queue bypass mode on
>> blk_register_queue() instead of blk_init_allocated_queue()"),
>> add_disk() eventually calls blk_queue_bypass_end().
>> This change invokes the following warning when multipath is used.
>>
>> BUG: scheduling while atomic: multipath/2460/0x00000002
>> 1 lock held by multipath/2460:
>> #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
>> Modules linked in: ...
>> Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
>> Call Trace:
>> [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
>> [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
>> [<ffffffff814291e6>] schedule+0x64/0x66
>> [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
>> [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
>> [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
>> [<ffffffff814289e3>] wait_for_common+0x9d/0xee
>> [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
>> [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
>> [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
>> [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
>> [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
>> [<ffffffff8106fb18>] ? complete+0x21/0x53
>> [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
>> [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
>> [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
>> [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
>> [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
>> [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
>> [<ffffffff811d8c41>] elevator_init+0xe1/0x115
>> [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
>> [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
>> [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
>> [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
>> [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
>> [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
>> [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
>> [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
>> [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
>> [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
>> [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
>>
>> The warning means during queue initialization blk_queue_bypass_start()
>> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
>
> md->type_lock is a mutex, isn't it? I thought we are allowed to block
> and schedule out under mutex?

Hm, you are right. It's a mutex.
The warning occurs only if I turned on CONFIG_PREEMPT=y.

> add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> So we already have code which can block/wait under md->type_lock. I am
> not sure why should we get this warning under a mutex.

add_disk() is called without md->type_lock.

Call flow is like this:

dm_create
alloc_dev
blk_alloc_queue
alloc_disk
add_disk
blk_queue_bypass_end [with 3.7-rc2]

table_load
dm_lock_md_type [takes md->type_lock]
dm_setup_md_queue
blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
elevator_init
blkcg_activate_policy
blk_queue_bypass_start <-- THIS triggers the warning
blk_queue_bypass_end
blk_queue_bypass_end [with 3.6]
dm_unlock_md_type

blk_queue_bypass_start() in blkcg_activate_policy was nested call,
that did nothing, with 3.6.
With 3.7-rc2, it becomes the initial call and does
actual draining stuff.

--
Jun'ichi Nomura, NEC Corporation

2012-10-29 16:39:47

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> On 10/27/12 05:21, Vivek Goyal wrote:
> > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> >>
> >> With 749fefe677 ("block: lift the initial queue bypass mode on
> >> blk_register_queue() instead of blk_init_allocated_queue()"),
> >> add_disk() eventually calls blk_queue_bypass_end().
> >> This change invokes the following warning when multipath is used.
> >>
> >> BUG: scheduling while atomic: multipath/2460/0x00000002
> >> 1 lock held by multipath/2460:
> >> #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> >> Modules linked in: ...
> >> Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
> >> Call Trace:
> >> [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> >> [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> >> [<ffffffff814291e6>] schedule+0x64/0x66
> >> [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> >> [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> >> [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> >> [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> >> [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> >> [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> >> [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> >> [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> >> [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> >> [<ffffffff8106fb18>] ? complete+0x21/0x53
> >> [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> >> [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> >> [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> >> [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> >> [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> >> [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> >> [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> >> [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> >> [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> >> [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> >> [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> >> [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> >> [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> >> [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> >> [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> >> [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> >> [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> >> [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >> [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> >>
> >> The warning means during queue initialization blk_queue_bypass_start()
> >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> >
> > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > and schedule out under mutex?
>
> Hm, you are right. It's a mutex.
> The warning occurs only if I turned on CONFIG_PREEMPT=y.

Ok, so the question is what's wrong with calling synchronize_rcu() inside
a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney and
peterz.

>
> > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > So we already have code which can block/wait under md->type_lock. I am
> > not sure why should we get this warning under a mutex.
>
> add_disk() is called without md->type_lock.
>
> Call flow is like this:
>
> dm_create
> alloc_dev
> blk_alloc_queue
> alloc_disk
> add_disk
> blk_queue_bypass_end [with 3.7-rc2]
>
> table_load
> dm_lock_md_type [takes md->type_lock]
> dm_setup_md_queue
> blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
> elevator_init
> blkcg_activate_policy
> blk_queue_bypass_start <-- THIS triggers the warning
> blk_queue_bypass_end
> blk_queue_bypass_end [with 3.6]
> dm_unlock_md_type
>
> blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> that did nothing, with 3.6.
> With 3.7-rc2, it becomes the initial call and does
> actual draining stuff.

Ok. Once we know what's wrong, we should be able to figure out the
right solution. Artificially putting queue one level deep in bypass
to avoid calling synchronize_rcu() sounds bad.

Thanks
Vivek

>
> --
> Jun'ichi Nomura, NEC Corporation

2012-10-29 16:45:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney and
> peterz.

int blkcg_activate_policy(struct request_queue *q,

{

...

preloaded = !radix_tree_preload(GFP_KERNEL);

blk_queue_bypass_start(q);




where:

int radix_tree_preload(gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;

preempt_disable();


Seems obvious why it explodes..

2012-10-29 16:55:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On Mon, Oct 29, 2012 at 12:38:45PM -0400, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 07:15:08PM +0900, Jun'ichi Nomura wrote:
> > On 10/27/12 05:21, Vivek Goyal wrote:
> > > On Thu, Oct 25, 2012 at 06:41:11PM +0900, Jun'ichi Nomura wrote:
> > >> [PATCH] dm: stay in blk_queue_bypass until queue becomes initialized
> > >>
> > >> With 749fefe677 ("block: lift the initial queue bypass mode on
> > >> blk_register_queue() instead of blk_init_allocated_queue()"),
> > >> add_disk() eventually calls blk_queue_bypass_end().
> > >> This change invokes the following warning when multipath is used.
> > >>
> > >> BUG: scheduling while atomic: multipath/2460/0x00000002

Looks like preemption has been disabled, nested two deep, based
on the /0x00000002, which dumps out preempt_count().

So is someone invoking this with preemption disabled?

> > >> 1 lock held by multipath/2460:
> > >> #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> > >> Modules linked in: ...
> > >> Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
> > >> Call Trace:
> > >> [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> > >> [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> > >> [<ffffffff814291e6>] schedule+0x64/0x66
> > >> [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> > >> [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> > >> [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> > >> [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> > >> [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> > >> [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> > >> [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> > >> [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> > >> [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> > >> [<ffffffff8106fb18>] ? complete+0x21/0x53
> > >> [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> > >> [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> > >> [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> > >> [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> > >> [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> > >> [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> > >> [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> > >> [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> > >> [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> > >> [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> > >> [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> > >> [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> > >> [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> > >> [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> > >> [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> > >> [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> > >> [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> > >> [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > >> [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
> > >>
> > >> The warning means during queue initialization blk_queue_bypass_start()
> > >> calls sleeping function (synchronize_rcu) while dm holds md->type_lock.
> > >
> > > md->type_lock is a mutex, isn't it? I thought we are allowed to block
> > > and schedule out under mutex?
> >
> > Hm, you are right. It's a mutex.
> > The warning occurs only if I turned on CONFIG_PREEMPT=y.
>
> Ok, so the question is what's wrong with calling synchronize_rcu() inside
> a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney and
> peterz.

Nothing is wrong with calling synchronize_rcu() inside a mutex, this has
been used and does work. Last I tried it, anyway. ;-)

> > > add_disk() also calls disk_alloc_events() which does kzalloc(GFP_KERNEL).
> > > So we already have code which can block/wait under md->type_lock. I am
> > > not sure why should we get this warning under a mutex.
> >
> > add_disk() is called without md->type_lock.
> >
> > Call flow is like this:
> >
> > dm_create
> > alloc_dev
> > blk_alloc_queue
> > alloc_disk
> > add_disk
> > blk_queue_bypass_end [with 3.7-rc2]
> >
> > table_load
> > dm_lock_md_type [takes md->type_lock]
> > dm_setup_md_queue
> > blk_init_allocated_queue [when DM_TYPE_REQUEST_BASED]
> > elevator_init
> > blkcg_activate_policy
> > blk_queue_bypass_start <-- THIS triggers the warning
> > blk_queue_bypass_end
> > blk_queue_bypass_end [with 3.6]
> > dm_unlock_md_type
> >
> > blk_queue_bypass_start() in blkcg_activate_policy was nested call,
> > that did nothing, with 3.6.
> > With 3.7-rc2, it becomes the initial call and does
> > actual draining stuff.
>
> Ok. Once we know what's wrong, we should be able to figure out the
> right solution. Artificially putting queue one level deep in bypass
> to avoid calling synchronize_rcu() sounds bad.

Also should be unnecessary. I suggest finding out where preemption is
disabled. Scattering checks for preempt_count()!=0 might help locate it.

Thanx, Paul

2012-10-29 17:14:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 2/2] dm: stay in blk_queue_bypass until queue becomes initialized

On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-10-29 at 12:38 -0400, Vivek Goyal wrote:
> > Ok, so the question is what's wrong with calling synchronize_rcu() inside
> > a mutex with CONFIG_PREEMPT=y. I don't know. Ccing paul mckenney and
> > peterz.
>
> int blkcg_activate_policy(struct request_queue *q,
>
> {
>
> ...
>
> preloaded = !radix_tree_preload(GFP_KERNEL);
>
> blk_queue_bypass_start(q);
>
>
>
>
> where:
>
> int radix_tree_preload(gfp_t gfp_mask)
> {
> struct radix_tree_preload *rtp;
> struct radix_tree_node *node;
> int ret = -ENOMEM;
>
> preempt_disable();
>
>
> Seems obvious why it explodes..

Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
radix_tree_preload() should fix it. Junichi, can you please give it
a try.

Thanks
Vivek

2012-10-30 02:28:44

by Junichi Nomura

[permalink] [raw]
Subject: [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start

On 10/30/12 02:13, Vivek Goyal wrote:
> On Mon, Oct 29, 2012 at 05:45:15PM +0100, Peter Zijlstra wrote:
>> int radix_tree_preload(gfp_t gfp_mask)
>> {
>> struct radix_tree_preload *rtp;
>> struct radix_tree_node *node;
>> int ret = -ENOMEM;
>>
>> preempt_disable();
>>
>>
>> Seems obvious why it explodes..
>
> Oh right. Thanks Peter. So just calling blk_queue_bypass_start() before
> radix_tree_preload() should fix it. Junichi, can you please give it
> a try.

Thank you! It just works.
This is a revised patch.

With 749fefe677 ("block: lift the initial queue bypass mode on
blk_register_queue() instead of blk_init_allocated_queue()"),
the following warning appears when multipath is used with
CONFIG_PREEMPT=y.

This patch moves blk_queue_bypass_start() before radix_tree_preload()
to avoid the sleeping call while preemption is disabled.

BUG: scheduling while atomic: multipath/2460/0x00000002
1 lock held by multipath/2460:
#0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
Modules linked in: ...
Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
Call Trace:
[<ffffffff810723ae>] __schedule_bug+0x6a/0x78
[<ffffffff81428ba2>] __schedule+0xb4/0x5e0
[<ffffffff814291e6>] schedule+0x64/0x66
[<ffffffff8142773a>] schedule_timeout+0x39/0xf8
[<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
[<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
[<ffffffff814289e3>] wait_for_common+0x9d/0xee
[<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
[<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
[<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
[<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
[<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
[<ffffffff8106fb18>] ? complete+0x21/0x53
[<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
[<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
[<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
[<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
[<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
[<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
[<ffffffff811d8c41>] elevator_init+0xe1/0x115
[<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
[<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
[<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
[<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
[<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
[<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
[<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
[<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
[<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
[<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
[<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff814310d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Jun'ichi Nomura <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Alasdair G Kergon <[email protected]>
---
block/blk-cgroup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d0b7703..954f4be 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
if (!blkg)
return -ENOMEM;

- preloaded = !radix_tree_preload(GFP_KERNEL);
-
blk_queue_bypass_start(q);

+ preloaded = !radix_tree_preload(GFP_KERNEL);
+
/* make sure the root blkg exists and count the existing blkgs */
spin_lock_irq(q->queue_lock);

2012-10-30 13:22:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] blkcg: fix "scheduling while atomic" in blk_queue_bypass_start

On Tue, Oct 30, 2012 at 11:25:47AM +0900, Jun'ichi Nomura wrote:

[..]
> This patch moves blk_queue_bypass_start() before radix_tree_preload()
> to avoid the sleeping call while preemption is disabled.
>
> BUG: scheduling while atomic: multipath/2460/0x00000002
> 1 lock held by multipath/2460:
> #0: (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
> Modules linked in: ...
> Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
> Call Trace:
> [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
> [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
> [<ffffffff814291e6>] schedule+0x64/0x66
> [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
> [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
> [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
> [<ffffffff814289e3>] wait_for_common+0x9d/0xee
> [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
> [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
> [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
> [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
> [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
> [<ffffffff8106fb18>] ? complete+0x21/0x53
> [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
> [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
> [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
> [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
> [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
> [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
> [<ffffffff811d8c41>] elevator_init+0xe1/0x115
> [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
> [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
> [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
> [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
> [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
> [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
> [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
> [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
> [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
> [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
> [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
>
> Signed-off-by: Jun'ichi Nomura <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Alasdair G Kergon <[email protected]>
> ---

Thanks Junichi. This patch looks good to me.

Acked-by: Vivek Goyal <[email protected]>

Vivek

> block/blk-cgroup.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d0b7703..954f4be 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -791,10 +791,10 @@ int blkcg_activate_policy(struct request_queue *q,
> if (!blkg)
> return -ENOMEM;
>
> - preloaded = !radix_tree_preload(GFP_KERNEL);
> -
> blk_queue_bypass_start(q);
>
> + preloaded = !radix_tree_preload(GFP_KERNEL);
> +
> /* make sure the root blkg exists and count the existing blkgs */
> spin_lock_irq(q->queue_lock);
>