2010-11-12 18:56:20

by Alok Kataria

[permalink] [raw]
Subject: (mem hotplug, pcpu_alloc) BUG: sleeping function called from invalid context at kernel/mutex.c:94

Hi,

We have seen following might_sleep warning while hot adding memory...

[ 142.339267] BUG: sleeping function called from invalid context at kernel/mutex.c:94
[ 142.339276] in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
[ 142.339283] Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
[ 142.339288] Call Trace:
[ 142.339305] [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
[ 142.339316] [<ffffffff81468245>] mutex_lock+0x24/0x50
[ 142.339326] [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
[ 142.339336] [<ffffffff81048888>] ? load_balance+0xbe/0x60e
[ 142.339343] [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
[ 142.339349] [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
[ 142.339356] [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
[ 142.339362] [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
[ 142.339373] [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
[ 142.339384] [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
[ 142.339395] [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
[ 142.339401] [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
[ 142.339407] [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
[ 142.339414] [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
[ 142.339420] [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
[ 142.339426] [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
[ 142.339434] [<ffffffff81065f29>] kthread+0x7f/0x87
[ 142.339443] [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
[ 142.339449] [<ffffffff81065eaa>] ? kthread+0x0/0x87
[ 142.339455] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
[ 142.340099] Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
[ 142.340108] Policy zone: Normal


This warning was seen on the FC14 kernel, though looking at the current
git, the problem seems to exist on mainline too.
The problem is that pcpu_alloc expects that it is called from non-atomic
context as a result it grabs the pcpu_alloc_mutex.
In the memory-hotplug case though, we do end up calling pcpu_alloc from
atomic context, while all cpus are stopped.

void build_all_zonelists(void *data)
{
set_zonelist_order();

if (system_state == SYSTEM_BOOTING) {
__build_all_zonelists(NULL);
mminit_verify_zonelist();
cpuset_init_current_mems_allowed();
} else {
/* we have to stop all cpus to guarantee there is no user
of zonelist */
stop_machine(__build_all_zonelists, data, NULL); <=========
/* cpuset refresh routine should be here */
}

__build_all_zonelists eventually calls pcpu_alloc.

I didn't dive through the history, so am not sure when was this
regression introduced, but could have regressed with the new pcpu memory
allocator.

--
Alok


2010-11-13 10:09:58

by Tejun Heo

[permalink] [raw]
Subject: Re: (mem hotplug, pcpu_alloc) BUG: sleeping function called from invalid context at kernel/mutex.c:94

Hello,

On 11/12/2010 07:56 PM, Alok Kataria wrote:
> We have seen following might_sleep warning while hot adding memory...
>
> [ 142.339267] BUG: sleeping function called from invalid context at kernel/mutex.c:94
> [ 142.339276] in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
> [ 142.339283] Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
> [ 142.339288] Call Trace:
> [ 142.339305] [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
> [ 142.339316] [<ffffffff81468245>] mutex_lock+0x24/0x50
> [ 142.339326] [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
> [ 142.339336] [<ffffffff81048888>] ? load_balance+0xbe/0x60e
> [ 142.339343] [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
> [ 142.339349] [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
> [ 142.339356] [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
> [ 142.339362] [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
> [ 142.339373] [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
> [ 142.339384] [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
> [ 142.339395] [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
> [ 142.339401] [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
> [ 142.339407] [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
> [ 142.339414] [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
> [ 142.339420] [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
> [ 142.339426] [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
> [ 142.339434] [<ffffffff81065f29>] kthread+0x7f/0x87
> [ 142.339443] [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
> [ 142.339449] [<ffffffff81065eaa>] ? kthread+0x0/0x87
> [ 142.339455] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
> [ 142.340099] Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
> [ 142.340108] Policy zone: Normal
>
>
> This warning was seen on the FC14 kernel, though looking at the current
> git, the problem seems to exist on mainline too.
> The problem is that pcpu_alloc expects that it is called from non-atomic
> context as a result it grabs the pcpu_alloc_mutex.
> In the memory-hotplug case though, we do end up calling pcpu_alloc from
> atomic context, while all cpus are stopped.
>
> void build_all_zonelists(void *data)
> {
> set_zonelist_order();
>
> if (system_state == SYSTEM_BOOTING) {
> __build_all_zonelists(NULL);
> mminit_verify_zonelist();
> cpuset_init_current_mems_allowed();
> } else {
> /* we have to stop all cpus to guarantee there is no user
> of zonelist */
> stop_machine(__build_all_zonelists, data, NULL); <=========
> /* cpuset refresh routine should be here */
> }
>
> __build_all_zonelists eventually calls pcpu_alloc.
>
> I didn't dive through the history, so am not sure when was this
> regression introduced, but could have regressed with the new pcpu memory
> allocator.

Meh... the percpu allocator required user context from the beginning.
The new allocator didn't change that.

Wouldn't it be possible to prepare hotplug outside of cpu_stop and use
stop_machine() only to make it available to the system. In general,
it's a very bad idea to allocate memory from inside stop_machine. The
whole machine is stopped, after all. In general, it shouldn't be too
difficult to add new resource without stop_machine too unlike removing
one. Pekka, Christoph, any ideas?

Thanks.

--
tejun

2010-11-18 07:45:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] fix build_all_zonelist where percpu_alloc is wrongly called under stop_machine_run (Was Re: (mem hotplug, pcpu_alloc) BUG: sleeping function called from invalid context at kernel/mutex.c:94

On Sat, 13 Nov 2010 11:09:23 +0100
Tejun Heo <[email protected]> wrote:

> Meh... the percpu allocator required user context from the beginning.
> The new allocator didn't change that.
>
> Wouldn't it be possible to prepare hotplug outside of cpu_stop and use
> stop_machine() only to make it available to the system. In general,
> it's a very bad idea to allocate memory from inside stop_machine. The
> whole machine is stopped, after all. In general, it shouldn't be too
> difficult to add new resource without stop_machine too unlike removing
> one. Pekka, Christoph, any ideas?
>

Fix here. I'm glad if someone test this.
==

At memory hotplug, build_allzonelists() may be called under stop_machine_run().
In this function, setup_zone_pageset() is called. But it's bug because it
will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

[ 142.339267] BUG: sleeping function called from invalid context at kernel/mutex.c:94
[ 142.339276] in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
[ 142.339283] Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
[ 142.339288] Call Trace:
[ 142.339305] [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
[ 142.339316] [<ffffffff81468245>] mutex_lock+0x24/0x50
[ 142.339326] [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
[ 142.339336] [<ffffffff81048888>] ? load_balance+0xbe/0x60e
[ 142.339343] [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
[ 142.339349] [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
[ 142.339356] [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
[ 142.339362] [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
[ 142.339373] [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
[ 142.339384] [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
[ 142.339395] [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
[ 142.339401] [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
[ 142.339407] [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
[ 142.339414] [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
[ 142.339420] [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
[ 142.339426] [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
[ 142.339434] [<ffffffff81065f29>] kthread+0x7f/0x87
[ 142.339443] [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
[ 142.339449] [<ffffffff81065eaa>] ? kthread+0x0/0x87
[ 142.339455] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
[ 142.340099] Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
[ 142.340108] Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

Reported-by: Alok Kataria <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/page_alloc.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

Index: mmotm-1117/mm/page_alloc.c
===================================================================
--- mmotm-1117.orig/mm/page_alloc.c
+++ mmotm-1117/mm/page_alloc.c
@@ -3027,14 +3027,6 @@ static __init_refok int __build_all_zone
build_zonelist_cache(pgdat);
}

-#ifdef CONFIG_MEMORY_HOTPLUG
- /* Setup real pagesets for the new zone */
- if (data) {
- struct zone *zone = data;
- setup_zone_pageset(zone);
- }
-#endif
-
/*
* Initialize the boot_pagesets that are going to be used
* for bootstrapping processors. The real pagesets for
@@ -3083,7 +3075,13 @@ void build_all_zonelists(void *data)
} else {
/* we have to stop all cpus to guarantee there is no user
of zonelist */
- stop_machine(__build_all_zonelists, data, NULL);
+#ifdef CONFIG_MEMORY_HOTPLUG
+ if (data) {
+ struct zone *zone = (struct zone *)data;
+ setup_zone_pageset(zone);
+ }
+#endif
+ stop_machine(__build_all_zonelists, NULL, NULL);
/* cpuset refresh routine should be here */
}
vm_total_pages = nr_free_pagecache_pages();