2009-07-15 09:48:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

Hi

On 2.6.31-rc3, following test makes kernel panic immediately.

numactl --interleave=all echo

Panic message is below. I don't think commit 58568d2a8 is correct patch.

old behavior:
do_set_mempolicy
mpol_new
cpuset_update_task_memory_state
guarantee_online_mems
nodes_and(cs->mems_allowed, node_states[N_HIGH_MEMORY]);

but new code doesn't consider N_HIGH_MEMORY. Then, the userland program
passing non-online node bit makes crash, I guess.

Miao, What do you think?


========================================================
login: numactl[4506]: NaT consumption 17179869216 [1]
Modules linked in: binfmt_misc nls_iso8859_1 nls_cp437 dm_multipath scsi_dh fan sg processor button thermal container e100 mii dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod lpfc mptspi mptscsih mptbase ehci_hcd ohci_hcd uhci_hcd usbcore

Pid: 4506, CPU 1, comm: numactl
psr : 00001010085a6010 ifs : 8000000000000a1c ip : [<a0000001001c16b0>] Not tainted (2.6.31-rc2-g8b48a9f-dirty)
ip is at __alloc_pages_nodemask+0x130/0xc00
unat: 0000000000000000 pfs : 0000000000000a1c rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 0021055a065a9555
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001001c16b0 b6 : a0000001001b2620 b7 : a00000010000c300
f6 : 1003e0000000000000002 f7 : 1003e00000000fffffffe
f8 : 1001580200800ffbfeffc f9 : 1003efffffffffffffc01
f10 : 10006acfffffffeeae330 f11 : 1003e0000000000000000
r1 : a0000001010c5840 r2 : e0000040cf590ea0 r3 : 0000000000000000
r8 : 0000000000000000 r9 : a000000100ec64d8 r10 : 0000000000080008
r11 : 0000000000000008 r12 : e0000040cf59fdc0 r13 : e0000040cf590000
r14 : 0000000000000000 r15 : 0000000000000002 r16 : 0000000000080008
r17 : 0000000000000000 r18 : 0000000000000002 r19 : 00000000003fffff
r20 : a000000100d45538 r21 : a000000100ec6180 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000000070 r25 : 0000000000000028
r26 : a000000100e5ee28 r27 : a000000100e5ee20 r28 : 0000000000000001
r29 : e0000040cf590ea4 r30 : e0000040c011012c r31 : 0000000000000000

Call Trace:
[<a000000100019c00>] show_stack+0x80/0xa0
sp=e0000040cf59f810 bsp=e0000040cf591610
[<a00000010001a710>] show_regs+0xa90/0xac0
sp=e0000040cf59f9e0 bsp=e0000040cf5915b0
[<a000000100057360>] die+0x2c0/0x3e0
sp=e0000040cf59f9e0 bsp=e0000040cf591568
[<a0000001000574d0>] die_if_kernel+0x50/0x80
sp=e0000040cf59f9e0 bsp=e0000040cf591538
[<a0000001008c3990>] ia64_fault+0xf0/0x2080
sp=e0000040cf59f9e0 bsp=e0000040cf5914e8
[<a00000010008df20>] paravirt_leave_kernel+0x0/0x40
sp=e0000040cf59fbf0 bsp=e0000040cf5914e8
[<a0000001001c16b0>] __alloc_pages_nodemask+0x130/0xc00
sp=e0000040cf59fdc0 bsp=e0000040cf591408
[<a0000001002249b0>] alloc_page_interleave+0xb0/0x180
sp=e0000040cf59fde0 bsp=e0000040cf5913c8
[<a000000100225010>] alloc_page_vma+0x1d0/0x2e0
sp=e0000040cf59fdf0 bsp=e0000040cf591390
[<a0000001001eeea0>] handle_mm_fault+0xa20/0x15a0
sp=e0000040cf59fdf0 bsp=e0000040cf591308
[<a0000001001efca0>] __get_user_pages+0x280/0x9e0
sp=e0000040cf59fe00 bsp=e0000040cf591260
[<a0000001001f0460>] get_user_pages+0x60/0x80
sp=e0000040cf59fe10 bsp=e0000040cf591208
[<a00000010025bbe0>] get_arg_page+0xa0/0x220
sp=e0000040cf59fe10 bsp=e0000040cf5911d0
[<a00000010025c4a0>] copy_strings+0x3e0/0x6c0
sp=e0000040cf59fe20 bsp=e0000040cf591120
[<a00000010025c880>] copy_strings_kernel+0x100/0x180
sp=e0000040cf59fe20 bsp=e0000040cf5910e8
[<a000000100261790>] do_execve+0x6b0/0xba0
sp=e0000040cf59fe20 bsp=e0000040cf591080
[<a000000100017fc0>] sys_execve+0x60/0xc0
sp=e0000040cf59fe30 bsp=e0000040cf591048
[<a00000010000c030>] ia64_execve+0x30/0x160
sp=e0000040cf59fe30 bsp=e0000040cf590ff0
[<a00000010000c980>] ia64_ret_from_syscall+0x0/0x40
sp=e0000040cf59fe30 bsp=e0000040cf590ff0
[<a000000000012000>] __kernel_syscall_via_break+0x0/0x20
sp=e0000040cf5a0000 bsp=e0000040cf590ff0
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception
Rebooting in 1 seconds..


2009-07-15 17:31:12

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Wed, 2009-07-15 at 18:48 +0900, KOSAKI Motohiro wrote:
> Hi
>
> On 2.6.31-rc3, following test makes kernel panic immediately.
>
> numactl --interleave=all echo
>
> Panic message is below. I don't think commit 58568d2a8 is correct patch.
>
> old behavior:
> do_set_mempolicy
> mpol_new
> cpuset_update_task_memory_state
> guarantee_online_mems
> nodes_and(cs->mems_allowed, node_states[N_HIGH_MEMORY]);
>
> but new code doesn't consider N_HIGH_MEMORY. Then, the userland program
> passing non-online node bit makes crash, I guess.
>
> Miao, What do you think?

This looks similar to the problem I tried to fix in:

http://marc.info/?l=linux-mm&m=124140637722309&w=4

Miao pointed out that the patch needs more work to track hot plug of
nodes. I've not had time to get back to this.

Interestingly, on ia64, the top cpuset mems_allowed gets set to all
possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
with memory]. Maybe this is a to support hot-plug?

Lee

<snip>

2009-07-16 01:00:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

> On Wed, 2009-07-15 at 18:48 +0900, KOSAKI Motohiro wrote:
> > Hi
> >
> > On 2.6.31-rc3, following test makes kernel panic immediately.
> >
> > numactl --interleave=all echo
> >
> > Panic message is below. I don't think commit 58568d2a8 is correct patch.
> >
> > old behavior:
> > do_set_mempolicy
> > mpol_new
> > cpuset_update_task_memory_state
> > guarantee_online_mems
> > nodes_and(cs->mems_allowed, node_states[N_HIGH_MEMORY]);
> >
> > but new code doesn't consider N_HIGH_MEMORY. Then, the userland program
> > passing non-online node bit makes crash, I guess.
> >
> > Miao, What do you think?
>
> This looks similar to the problem I tried to fix in:
>
> http://marc.info/?l=linux-mm&m=124140637722309&w=4
>
> Miao pointed out that the patch needs more work to track hot plug of
> nodes. I've not had time to get back to this.
>
> Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> with memory]. Maybe this is a to support hot-plug?

Maybe.

task->mems_allowed of the init process is initialized by node_possible_map.
if the system doesn't have memory hot-plug capability, node_possible_map
is equal to node_online_map.


-------------------------------------------------
@@ -867,6 +866,11 @@ static noinline int init_post(void)
static int __init kernel_init(void * unused)
{
lock_kernel();
+
+ /*
+ * init can allocate pages on any node
+ */
+ set_mems_allowed(node_possible_map);



2009-07-16 20:05:46

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Wed, 15 Jul 2009, Lee Schermerhorn wrote:

> Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> with memory]. Maybe this is a to support hot-plug?
>

numactl --interleave=all simply passes a nodemask with all bits set, so if
cpuset_current_mems_allowed includes offline nodes from node_possible_map,
then mpol_set_nodemask() doesn't mask them off.

Seems like we could handle this strictly in mempolicies without worrying
about top_cpuset like in the following?
---
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -194,6 +194,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
{
nodemask_t cpuset_context_nmask;
+ nodemask_t mems_allowed;
int ret;

/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
@@ -201,20 +202,21 @@ static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
return 0;

VM_BUG_ON(!nodes);
+ nodes_and(mems_allowed, cpuset_current_mems_allowed,
+ node_states[N_HIGH_MEMORY]);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
nodes = NULL; /* explicit local allocation */
else {
if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(&cpuset_context_nmask, nodes,
- &cpuset_current_mems_allowed);
+ &mems_allowed);
else
nodes_and(cpuset_context_nmask, *nodes,
- cpuset_current_mems_allowed);
+ mems_allowed);
if (mpol_store_user_nodemask(pol))
pol->w.user_nodemask = *nodes;
else
- pol->w.cpuset_mems_allowed =
- cpuset_current_mems_allowed;
+ pol->w.cpuset_mems_allowed = mems_allowed;
}

ret = mpol_ops[pol->mode].create(pol,

2009-07-17 00:04:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

> On Wed, 15 Jul 2009, Lee Schermerhorn wrote:
>
> > Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> > possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> > with memory]. Maybe this is a to support hot-plug?
> >
>
> numactl --interleave=all simply passes a nodemask with all bits set, so if
> cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> then mpol_set_nodemask() doesn't mask them off.
>
> Seems like we could handle this strictly in mempolicies without worrying
> about top_cpuset like in the following?

This patch seems band-aid patch. it will change memory-hotplug behavior.
Please imazine following scenario:

1. numactl interleave=all process-A
2. memory hot-add

before 2.6.30:
-> process-A can use hot-added memory

your proposal patch:
-> process-A can't use hot-added memory




> ---
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -194,6 +194,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> {
> nodemask_t cpuset_context_nmask;
> + nodemask_t mems_allowed;
> int ret;
>
> /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
> @@ -201,20 +202,21 @@ static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> return 0;
>
> VM_BUG_ON(!nodes);
> + nodes_and(mems_allowed, cpuset_current_mems_allowed,
> + node_states[N_HIGH_MEMORY]);
> if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
> nodes = NULL; /* explicit local allocation */
> else {
> if (pol->flags & MPOL_F_RELATIVE_NODES)
> mpol_relative_nodemask(&cpuset_context_nmask, nodes,
> - &cpuset_current_mems_allowed);
> + &mems_allowed);
> else
> nodes_and(cpuset_context_nmask, *nodes,
> - cpuset_current_mems_allowed);
> + mems_allowed);
> if (mpol_store_user_nodemask(pol))
> pol->w.user_nodemask = *nodes;
> else
> - pol->w.cpuset_mems_allowed =
> - cpuset_current_mems_allowed;
> + pol->w.cpuset_mems_allowed = mems_allowed;
> }
>
> ret = mpol_ops[pol->mode].create(pol,


2009-07-17 00:59:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Fri, 17 Jul 2009 09:04:46 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > On Wed, 15 Jul 2009, Lee Schermerhorn wrote:
> >
> > > Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> > > possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> > > with memory]. Maybe this is a to support hot-plug?
> > >
> >
> > numactl --interleave=all simply passes a nodemask with all bits set, so if
> > cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> > then mpol_set_nodemask() doesn't mask them off.
> >
> > Seems like we could handle this strictly in mempolicies without worrying
> > about top_cpuset like in the following?
>
> This patch seems band-aid patch. it will change memory-hotplug behavior.
> Please imazine following scenario:
>
> 1. numactl interleave=all process-A
> 2. memory hot-add
>
> before 2.6.30:
> -> process-A can use hot-added memory
>
> your proposal patch:
> -> process-A can't use hot-added memory
>

IMHO, the application itseld should be notifed to change its mempolicy by
hot-plug script on the host. While an application uses interleave, a new node
hot-added is just a noise. I think "How pages are interleaved" should not be
changed implicitly. Then, checking at set_mempolicy() seems sane. If notified,
application can do page migration and rebuild his mapping in ideal way.

BUT I don't linke init->mem_allowed contains N_POSSIBLE...it should be initialized
to N_HIGH_MEMORY, IMHO.

Thanks,
-Kame

2009-07-17 02:07:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

> On Fri, 17 Jul 2009 09:04:46 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > > On Wed, 15 Jul 2009, Lee Schermerhorn wrote:
> > >
> > > > Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> > > > possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> > > > with memory]. Maybe this is a to support hot-plug?
> > > >
> > >
> > > numactl --interleave=all simply passes a nodemask with all bits set, so if
> > > cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> > > then mpol_set_nodemask() doesn't mask them off.
> > >
> > > Seems like we could handle this strictly in mempolicies without worrying
> > > about top_cpuset like in the following?
> >
> > This patch seems band-aid patch. it will change memory-hotplug behavior.
> > Please imazine following scenario:
> >
> > 1. numactl interleave=all process-A
> > 2. memory hot-add
> >
> > before 2.6.30:
> > -> process-A can use hot-added memory
> >
> > your proposal patch:
> > -> process-A can't use hot-added memory
> >
>
> IMHO, the application itseld should be notifed to change its mempolicy by
> hot-plug script on the host. While an application uses interleave, a new node
> hot-added is just a noise. I think "How pages are interleaved" should not be
> changed implicitly. Then, checking at set_mempolicy() seems sane. If notified,
> application can do page migration and rebuild his mapping in ideal way.

Do you really want ABI change?



> BUT I don't linke init->mem_allowed contains N_POSSIBLE...it should be initialized
> to N_HIGH_MEMORY, IMHO.



2009-07-17 02:41:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Fri, 17 Jul 2009 11:07:09 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > On Fri, 17 Jul 2009 09:04:46 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> > > > On Wed, 15 Jul 2009, Lee Schermerhorn wrote:
> > > >
> > > > > Interestingly, on ia64, the top cpuset mems_allowed gets set to all
> > > > > possible nodes, while on x86_64, it gets set to on-line nodes [or nodes
> > > > > with memory]. Maybe this is a to support hot-plug?
> > > > >
> > > >
> > > > numactl --interleave=all simply passes a nodemask with all bits set, so if
> > > > cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> > > > then mpol_set_nodemask() doesn't mask them off.
> > > >
> > > > Seems like we could handle this strictly in mempolicies without worrying
> > > > about top_cpuset like in the following?
> > >
> > > This patch seems band-aid patch. it will change memory-hotplug behavior.
> > > Please imazine following scenario:
> > >
> > > 1. numactl interleave=all process-A
> > > 2. memory hot-add
> > >
> > > before 2.6.30:
> > > -> process-A can use hot-added memory
> > >
> > > your proposal patch:
> > > -> process-A can't use hot-added memory
> > >
> >
> > IMHO, the application itseld should be notifed to change its mempolicy by
> > hot-plug script on the host. While an application uses interleave, a new node
> > hot-added is just a noise. I think "How pages are interleaved" should not be
> > changed implicitly. Then, checking at set_mempolicy() seems sane. If notified,
> > application can do page migration and rebuild his mapping in ideal way.
>
> Do you really want ABI change?
>
No ;_

Hmm, IIUC, current handling of nodemask of mempolicy is below.
There should be 3 masks.
- systems's N_HIGH_MEMORY
- the mask user specified via mempolicy() (remembered only when MPOL_F_RELATIVE
- cpusets's one

And pol->v.nodes is just a _cache_ of logical-and of aboves.
Synchronization with cpusets is guaranteed by cpuset's generation.
Synchronization with N_HIGH_MEMORY should be guaranteed by memory hotplug
notifier, but this is not implemented yet.

Then, what I can tell here is...
- remember what's user requested. (only when MPOL_F_RELATIVE_NODES ?)
- add notifiers for memory hot-add. (only when MPOL_F_RELATIVE_NODES ?)
- add notifiers for memory hot-remove (both MPOL_F_STATIC/RELATIVE_NODES ?)

IMHO, for cpusets, don't calculate v.nodes again if MPOL_F_STATIC is good.
But for N_HIGH_MEMORY, v.nodes should be caluculated even if MPOL_F_STATIC is set.

Then, I think the mask user passed should be remembered even if MPOL_F_STATIC is
set and v.nodes should work as cache and should be updated in appropriate way.

Thanks,
-Kame











2009-07-17 09:01:55

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Fri, 17 Jul 2009, KOSAKI Motohiro wrote:

> This patch seems band-aid patch. it will change memory-hotplug behavior.
> Please imazine following scenario:
>
> 1. numactl interleave=all process-A
> 2. memory hot-add
>
> before 2.6.30:
> -> process-A can use hot-added memory
>

That apparently depends on the architecture since Lee said top_cpuset
reflects node_possible_map on ia64 and node_online_map on x86_64.

2009-07-17 09:09:39

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Fri, 17 Jul 2009, KAMEZAWA Hiroyuki wrote:

> IMHO, the application itseld should be notifed to change its mempolicy by
> hot-plug script on the host. While an application uses interleave, a new node
> hot-added is just a noise. I think "How pages are interleaved" should not be
> changed implicitly. Then, checking at set_mempolicy() seems sane. If notified,
> application can do page migration and rebuild his mapping in ideal way.
>

Agreed, it doesn't seem helpful to add a node to MPOL_INTERLEAVE for
memory hotplug; the same is probably true of MPOL_BIND as well since the
application will never want to unknowingly expand its set of allowed
nodes. There's no case where MPOL_PREFERRED would want to implicitly
switch to the added node.

I'm not convinced that we have to support hot-add in existing mempolicies
at all.

> BUT I don't linke init->mem_allowed contains N_POSSIBLE...it should be initialized
> to N_HIGH_MEMORY, IMHO.
>

It doesn't matter for the page allocator since zonelists will never
include zones from nodes that aren't online, so the underlying bug here
does seem to be the behavior of ->mems_allowed and we're probably only
triggering it by mempolicies. cpuset_track_online_nodes() should be
keeping top_cpuset's mems_allowed consistent with N_HIGH_MEMORY and all
descendents must have a subset of top_cpuset's nodes, though.

2009-07-24 22:52:04

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Thu, 16 Jul 2009, David Rientjes wrote:

> numactl --interleave=all simply passes a nodemask with all bits set, so if
> cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> then mpol_set_nodemask() doesn't mask them off.
>
> Seems like we could handle this strictly in mempolicies without worrying
> about top_cpuset like in the following?
> ---
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -194,6 +194,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> {
> nodemask_t cpuset_context_nmask;
> + nodemask_t mems_allowed;
> int ret;
>
> /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
> @@ -201,20 +202,21 @@ static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> return 0;
>
> VM_BUG_ON(!nodes);
> + nodes_and(mems_allowed, cpuset_current_mems_allowed,
> + node_states[N_HIGH_MEMORY]);
> if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
> nodes = NULL; /* explicit local allocation */
> else {
> if (pol->flags & MPOL_F_RELATIVE_NODES)
> mpol_relative_nodemask(&cpuset_context_nmask, nodes,
> - &cpuset_current_mems_allowed);
> + &mems_allowed);
> else
> nodes_and(cpuset_context_nmask, *nodes,
> - cpuset_current_mems_allowed);
> + mems_allowed);
> if (mpol_store_user_nodemask(pol))
> pol->w.user_nodemask = *nodes;
> else
> - pol->w.cpuset_mems_allowed =
> - cpuset_current_mems_allowed;
> + pol->w.cpuset_mems_allowed = mems_allowed;
> }
>
> ret = mpol_ops[pol->mode].create(pol,
>

Should this patch be added to 2.6.31-rc4 to prevent the kernel panic while
hotplug notifiers are being added to mempolicies?

2009-07-24 23:10:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Fri, 24 Jul 2009 15:51:51 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Thu, 16 Jul 2009, David Rientjes wrote:
>
> > numactl --interleave=all simply passes a nodemask with all bits set, so if
> > cpuset_current_mems_allowed includes offline nodes from node_possible_map,
> > then mpol_set_nodemask() doesn't mask them off.
> >
> > Seems like we could handle this strictly in mempolicies without worrying
> > about top_cpuset like in the following?
> > ---
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -194,6 +194,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> > static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> > {
> > nodemask_t cpuset_context_nmask;
> > + nodemask_t mems_allowed;
> > int ret;
> >
> > /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
> > @@ -201,20 +202,21 @@ static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> > return 0;
> >
> > VM_BUG_ON(!nodes);
> > + nodes_and(mems_allowed, cpuset_current_mems_allowed,
> > + node_states[N_HIGH_MEMORY]);
> > if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
> > nodes = NULL; /* explicit local allocation */
> > else {
> > if (pol->flags & MPOL_F_RELATIVE_NODES)
> > mpol_relative_nodemask(&cpuset_context_nmask, nodes,
> > - &cpuset_current_mems_allowed);
> > + &mems_allowed);
> > else
> > nodes_and(cpuset_context_nmask, *nodes,
> > - cpuset_current_mems_allowed);
> > + mems_allowed);
> > if (mpol_store_user_nodemask(pol))
> > pol->w.user_nodemask = *nodes;
> > else
> > - pol->w.cpuset_mems_allowed =
> > - cpuset_current_mems_allowed;
> > + pol->w.cpuset_mems_allowed = mems_allowed;
> > }
> >
> > ret = mpol_ops[pol->mode].create(pol,
> >
>
> Should this patch be added to 2.6.31-rc4 to prevent the kernel panic while
> hotplug notifiers are being added to mempolicies?

afaik we don't have a final patch for this. I asked Motohiro-san about
this and he's proposing that we revert the offending change (which one
was it?) if nothing gets fixed soon - the original author is on a
lengthy vacation.


If we _do_ have a patch then can we start again? Someone send out the patch
and let's take a look at it.

2009-07-25 01:33:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

Andrew Morton wrote:
> On Fri, 24 Jul 2009 15:51:51 -0700 (PDT)
> David Rientjes <[email protected]> wrote:

> afaik we don't have a final patch for this. I asked Motohiro-san about
> this and he's proposing that we revert the offending change (which one
> was it?) if nothing gets fixed soon - the original author is on a
> lengthy vacation.
>
>
> If we _do_ have a patch then can we start again? Someone send out the
> patch
> and let's take a look at it.
Hmm, like this ? (cleaned up David's one because we shouldn't have
extra nodemask_t on stack.)

Problems are
- rebind() is maybe broken but no good idea.
(but it seems to be broken in old kernels
- Who can test this is only a user who has possible node on SRAT.

==
From: KAMEZAWA Hiroyuki <[email protected]>

At setting mempolicy's nodemask (or node id), we need to guarantee
node-id is online. But cpuset's nodemask may contain not-online(possible)
nodes and it can cause an access to NODE_DATA(nid) of not-online nodes.

This patch fiexs mempolicy's nodemask to be subset of valid nodes.
(N_HIGH_MEMORY).

But, there are 2 caes for setting policy's mask
- new
- rebind
A difficult case is rebind. In this patch, if relationship of
new cpuset's nodemask & policy's mask is invalid, just use cpuset's
mask.

Based on David Rientjes's patch.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/mempolicy.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)

Index: mmotm-2.6.31-Jul16/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul16.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul16/mm/mempolicy.c
@@ -204,12 +204,22 @@ static int mpol_set_nodemask(struct memp
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
nodes = NULL; /* explicit local allocation */
else {
+ /*
+ * Here, we mask this new nodemask with N_HIGH_MEMORY.
+ * An issue is memory hotplug. Now, at hot-add, we don't
+ * update, this. This should be fixed. At hot-remove, we don't
+ * remove pgdat itself, then, we should update this but
+ * we'll never see terrible bugs. Leaving it as it is, now.
+ */
+ nodes_and(cpuset_context_mask, &cpuset_current_mems_allowed,
+ node_states[N_HIGH_MEMORY]);
+ /* should we call is_valid_nodemask() here ?*/
if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(&cpuset_context_nmask, nodes,
- &cpuset_current_mems_allowed);
+ &cpuset_context_nmask);
else
nodes_and(cpuset_context_nmask, *nodes,
- cpuset_current_mems_allowed);
+ cpuset_context_nmask);
if (mpol_store_user_nodemask(pol))
pol->w.user_nodemask = *nodes;
else
@@ -290,7 +300,16 @@ static void mpol_rebind_nodemask(struct
*nodes);
pol->w.cpuset_mems_allowed = *nodes;
}
-
+ /*
+ * At rebind, passed *nodes is guaranteed to online, but..calculated
+ * nodemask can be empty or invalid. print WARNING and use cpuset's
+ * mask
+ */
+ if (nodes_empty(tmp) ||
+ (pol->mode == MPOL_BIND && !is_valid_nodemask(tmp))) {
+ tmp = *nodes;
+ printk("relation amoung cpuset/mempolicy goes bad.\n");
+ }
pol->v.nodes = tmp;
if (!node_isset(current->il_next, tmp)) {
current->il_next = next_node(current->il_next, tmp);






2009-07-25 02:35:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

KAMEZAWA Hiroyuki wrote:
> Andrew Morton wrote:
>> On Fri, 24 Jul 2009 15:51:51 -0700 (PDT)
>> David Rientjes <[email protected]> wrote:
>
>> afaik we don't have a final patch for this. I asked Motohiro-san about
>> this and he's proposing that we revert the offending change (which one
>> was it?) if nothing gets fixed soon - the original author is on a
>> lengthy vacation.
>>
>>
>> If we _do_ have a patch then can we start again? Someone send out the
>> patch
>> and let's take a look at it.
> Hmm, like this ? (cleaned up David's one because we shouldn't have
> extra nodemask_t on stack.)
>
> Problems are
> - rebind() is maybe broken but no good idea.
> (but it seems to be broken in old kernels
> - Who can test this is only a user who has possible node on SRAT.
>

> + /* should we call is_valid_nodemask() here ?*/
> if (pol->flags & MPOL_F_RELATIVE_NODES)
> mpol_relative_nodemask(&cpuset_context_nmask, nodes,
> - &cpuset_current_mems_allowed);
> + &cpuset_context_nmask);
Sorry this part is buggy.

But to fix this, we use extra nodemask here and this patch will
allocate 3 nodemasks on stack!
Then, here is a much easier fix. for trusting cpuset more.

==
From: KAMEZAWA Hiroyuki <[email protected]>

task->mems_allowed is guaranteed to be only includes valid nodes when
it is set under cpuset. but, at init, all possible nodes are included.
fix it.

And at cpuset-rebind, caluculated result can be a invaild one.
In that case, trust cpuset's one

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
init/main.c | 4 ++--
mm/mempolicy.c | 7 ++++++-
2 files changed, 8 insertions(+), 3 deletions(-)

Index: mmotm-2.6.31-Jul16/init/main.c
===================================================================
--- mmotm-2.6.31-Jul16.orig/init/main.c
+++ mmotm-2.6.31-Jul16/init/main.c
@@ -855,9 +855,9 @@ static int __init kernel_init(void * unu
lock_kernel();

/*
- * init can allocate pages on any node
+ * init can allocate pages on any online node
*/
- set_mems_allowed(node_possible_map);
+ set_mems_allowed(node_state[N_HIGH_MEMORY]);
/*
* init can run on any cpu.
*/
Index: mmotm-2.6.31-Jul16/mm/mempolicy.c
===================================================================
--- mmotm-2.6.31-Jul16.orig/mm/mempolicy.c
+++ mmotm-2.6.31-Jul16/mm/mempolicy.c
@@ -290,7 +290,12 @@ static void mpol_rebind_nodemask(struct
*nodes);
pol->w.cpuset_mems_allowed = *nodes;
}
-
+ /*
+ * tmp can be invalid ...just use cpuset's one in that case.
+ */
+ if (nodes_empty(tmp) ||
+ ((pol->mode == MPOL_BIND) && !is_valid_nodemask(&tmp)))
+ tmp = *nodes;
pol->v.nodes = tmp;
if (!node_isset(current->il_next, tmp)) {
current->il_next = next_node(current->il_next, tmp);




2009-07-25 03:15:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

KAMEZAWA Hiroyuki wrote:
> KAMEZAWA Hiroyuki wrote:
> Then, here is a much easier fix. for trusting cpuset more.
>
just a memo about memory hotplug

_Direct_ use of task->mems_allowed is only in cpuset and mempolicy.
If no policy is used, it's not checked.
(See alloc_pages_current())

memory hotplug's notifier just updates top_cpuset's mems_allowed.
But it doesn't update each task's ones.
Then, task's bahavior is

- tasks which don't use mempolicy will use all nodes, N_HIGH_MEMORY.
- tasks under cpuset will be controlled under their own cpuset.
- tasks under mempolicy will use their own policy.
but no new policy is re-calculated and, then, no new mask.

Now, even if all memory on nodes a removed, pgdat just remains.
Then, cpuset/mempolicy will never access NODE_DATA(nid) which is NULL.

Thanks,
-Kame

2009-07-25 13:21:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

2009/07/25 12:15 に KAMEZAWA Hiroyuki<[email protected]> さんは書きました:
> KAMEZAWA Hiroyuki wrote:
>> KAMEZAWA Hiroyuki wrote:
>> Then, here is a much easier fix. for trusting cpuset more.
>>
> just a memo about memory hotplug
>
> _Direct_ use of task->mems_allowed is only in cpuset and mempolicy.
> If no policy is used, it's not checked.
> (See alloc_pages_current())
>
> memory hotplug's notifier just updates top_cpuset's mems_allowed.
> But it doesn't update each task's ones.
> Then, task's bahavior is
>
> - tasks which don't use mempolicy will use all nodes, N_HIGH_MEMORY.
> - tasks under cpuset will be controlled under their own cpuset.
> - tasks under mempolicy will use their own policy.
> but no new policy is re-calculated and, then, no new mask.
>
> Now, even if all memory on nodes a removed, pgdat just remains.
> Then, cpuset/mempolicy will never access NODE_DATA(nid) which is NULL.

Umm..
I don't think this is optimal behavior. but if hotplug guys agree
this, I agree this too.

Thanks.

2009-07-25 14:40:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

KOSAKI Motohiro wrote:
> 2009/07/25 12:15 に KAMEZAWA Hiroyuki<[email protected]> さんは
> 書きました:
>> KAMEZAWA Hiroyuki wrote:
>>> KAMEZAWA Hiroyuki wrote:
>>> Then, here is a much easier fix. for trusting cpuset more.
>>>
>> just a memo about memory hotplug
>>
>> _Direct_ use of task->mems_allowed is only in cpuset and mempolicy.
>> If no policy is used, it's not checked.
>> (See alloc_pages_current())
>>
>> memory hotplug's notifier just updates top_cpuset's mems_allowed.
>> But it doesn't update each task's ones.
>> Then, task's bahavior is
>>
>> - tasks which don't use mempolicy will use all nodes, N_HIGH_MEMORY.
>> - tasks under cpuset will be controlled under their own cpuset.
>> - tasks under mempolicy will use their own policy.
>> but no new policy is re-calculated and, then, no new mask.
>>
>> Now, even if all memory on nodes a removed, pgdat just remains.
>> Then, cpuset/mempolicy will never access NODE_DATA(nid) which is NULL.
>
> Umm..
> I don't think this is optimal behavior. but if hotplug guys agree
> this, I agree this too.
>
This behavior itself is not very bad.
And all hotplug thing is just a side story of this bugfix.


To update nodemask, user's mask should be saved in the policy
even when the mask is not relative and v.node should be calculated
again, at event. IIUC, rather than per-policy update by notifer,
some new implemenation for policy will be necessary.

If you mention about the fact that NODE_DATA(nid) is not removed
at node removal. I have no idea, now. copied zonelist is a problem.


Thanks,
-Kame


2009-07-27 17:56:02

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Sat, 25 Jul 2009, KAMEZAWA Hiroyuki wrote:

> _Direct_ use of task->mems_allowed is only in cpuset and mempolicy.
> If no policy is used, it's not checked.
> (See alloc_pages_current())
>
> memory hotplug's notifier just updates top_cpuset's mems_allowed.
> But it doesn't update each task's ones.

That's not true, cpuset_track_online_nodes() will call
scan_for_empty_cpusets() on top_cpuset, which works from the root to
leaves updating each cpuset's mems_allowed by intersecting it with
node_states[N_HIGH_MEMORY]. This is done as part of the MEM_OFFLINE
callback in the cpuset code, so N_HIGH_MEMORY represents the nodes still
online.

The nodemask for each task is updated to reflect the removal of a node and
it calls mpol_rebind_mm() with the new nodemask.

This is admittedly pretty late to be removing mems from cpusets (and
mempolicies) when the unplug has already happened. We should look at
doing the rebind for MEM_GOING_OFFLINE.

2009-07-27 18:00:59

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Sat, 25 Jul 2009, KAMEZAWA Hiroyuki wrote:

> This behavior itself is not very bad.
> And all hotplug thing is just a side story of this bugfix.
>

Right, the original problem that Lee reported doesn't appear to be caused
by hotplug.

> To update nodemask, user's mask should be saved in the policy
> even when the mask is not relative and v.node should be calculated
> again, at event. IIUC, rather than per-policy update by notifer,
> some new implemenation for policy will be necessary.
>

We don't need additional non-default mempolicy support for MEM_ONLINE.
It would be inappropriate to store the user nodemask and then hot-add new
nodes to mempolicies based on the given node id's when nothing is assumed
of its proximity. It's better left to userspace to update existing
mempolicies to use the newly added memory.

2009-07-28 00:00:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Mon, 27 Jul 2009 10:55:47 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Sat, 25 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> > _Direct_ use of task->mems_allowed is only in cpuset and mempolicy.
> > If no policy is used, it's not checked.
> > (See alloc_pages_current())
> >
> > memory hotplug's notifier just updates top_cpuset's mems_allowed.
> > But it doesn't update each task's ones.
>
> That's not true, cpuset_track_online_nodes() will call
> scan_for_empty_cpusets() on top_cpuset, which works from the root to
> leaves updating each cpuset's mems_allowed by intersecting it with
> node_states[N_HIGH_MEMORY]. This is done as part of the MEM_OFFLINE
> callback in the cpuset code, so N_HIGH_MEMORY represents the nodes still
> online.
>
yes.

> The nodemask for each task is updated to reflect the removal of a node and
> it calls mpol_rebind_mm() with the new nodemask.
>
yes, but _not_ updated at online.

> This is admittedly pretty late to be removing mems from cpusets (and
> mempolicies) when the unplug has already happened. We should look at
> doing the rebind for MEM_GOING_OFFLINE.
>
Hm.

What I felt at reading cpuset/mempolicy again is that it's too complex ;)
The 1st question is why mems_allowed which can be 1024bytes when max_node=4096
is copied per tasks....
And mempolicy code uses too much nodemask_t on stack.

I'll try some, today, including this bug-fix.

Thanks,
-Kame





2009-07-28 00:14:42

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > The nodemask for each task is updated to reflect the removal of a node and
> > it calls mpol_rebind_mm() with the new nodemask.
> >
> yes, but _not_ updated at online.
>

Well, I disagreed that we needed to alter any pre-existing mempolicies for
MEM_GOING_ONLINE or MEM_ONLINE since it may diverge from the original
intent of the policy. MPOL_PREFERRED certain shouldn't change,
MPOL_INTERLEAVE would be unbalanced, and MPOL_BIND could diverge from
memory isolation or affinity requirements.

I'd be interested to hear any real world use cases for MEM_ONLINE updating
of mempolicies.

> What I felt at reading cpuset/mempolicy again is that it's too complex ;)
> The 1st question is why mems_allowed which can be 1024bytes when max_node=4096
> is copied per tasks....

The page allocator needs lockless access to mems_allowed.

2009-07-28 00:27:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Mon, 27 Jul 2009 17:14:32 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> > > The nodemask for each task is updated to reflect the removal of a node and
> > > it calls mpol_rebind_mm() with the new nodemask.
> > >
> > yes, but _not_ updated at online.
> >
>
> Well, I disagreed that we needed to alter any pre-existing mempolicies for
> MEM_GOING_ONLINE or MEM_ONLINE since it may diverge from the original
> intent of the policy. MPOL_PREFERRED certain shouldn't change,
> MPOL_INTERLEAVE would be unbalanced, and MPOL_BIND could diverge from
> memory isolation or affinity requirements.
>
> I'd be interested to hear any real world use cases for MEM_ONLINE updating
> of mempolicies.
>
Sorry, I was a bit condused. I thought I said about task->mems_allowed.
Not each policy.

Because we dont' update, task->mems_allowed need to be initilaized as
N_POSSIBLE_NODES. At usual thinking, it should be N_HIGH_MEMORY or
N_ONLINE_NODES, as my patch does.

> > What I felt at reading cpuset/mempolicy again is that it's too complex ;)
> > The 1st question is why mems_allowed which can be 1024bytes when max_node=4096
> > is copied per tasks....
>
> The page allocator needs lockless access to mems_allowed.
>
Hmm, ok, I'll take care of that.

Thanks,
-Kame

2009-07-28 00:38:40

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:

> Because we dont' update, task->mems_allowed need to be initilaized as
> N_POSSIBLE_NODES. At usual thinking, it should be N_HIGH_MEMORY or
> N_ONLINE_NODES, as my patch does.
>

On MEM_OFFLINE, cpusets calls scan_for_empty_cpusets() which will
intersect each system cpuset's mems_allowed with N_HIGH_MEMORY. It then
calls update_tasks_nodemask() which will update task->mems_allowed for
each task assigned to those cpusets. This has a callback into the
mempolicy code to rebind the policy with the new mems.

So there's no apparent issue with memory hotplug in dealing with cpuset
mems, although I suggested that this be done for MEM_GOING_OFFLINE instead
of waiting until the mem is actually offline.

The problem originally reported here doesn't appear to have anything to do
with hotplug, it looks like it is the result of Lee's observation that
ia64 defaults top_cpuset's mems to N_POSSIBLE, which _should_ have been
updated by cpuset_init_smp(). So it makes me believe that N_HIGH_MEMORY
isn't actually ready by the time do_basic_setup() is called to be useful.

2009-07-28 00:56:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Mon, 27 Jul 2009 17:38:30 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> > Because we dont' update, task->mems_allowed need to be initilaized as
> > N_POSSIBLE_NODES. At usual thinking, it should be N_HIGH_MEMORY or
> > N_ONLINE_NODES, as my patch does.
> >
>

> On MEM_OFFLINE, cpusets calls scan_for_empty_cpusets() which will
> intersect each system cpuset's mems_allowed with N_HIGH_MEMORY. It then
> calls update_tasks_nodemask() which will update task->mems_allowed for
> each task assigned to those cpusets. This has a callback into the
> mempolicy code to rebind the policy with the new mems.
>
> So there's no apparent issue with memory hotplug in dealing with cpuset
> mems, although I suggested that this be done for MEM_GOING_OFFLINE instead
> of waiting until the mem is actually offline.
>
I _wrote_ this is just a side story to bug.
online/offline isn't related to this bug.

> The problem originally reported here doesn't appear to have anything to do
> with hotplug, it looks like it is the result of Lee's observation that
> ia64 defaults top_cpuset's mems to N_POSSIBLE, which _should_ have been
> updated by cpuset_init_smp().
cpuset_init_smp() just updates cpuset's mask.
init's task->mems_allowed is intizialized independently from cpuset's mask.

Could you teach me a pointer for Lee's observation ?

> So it makes me believe that N_HIGH_MEMORY
> isn't actually ready by the time do_basic_setup() is called to be useful.
>
N_HIGH_MEMORY should be ready when zonelist is built. If not, it's bug.


Thanks,
-Kame

2009-07-28 01:03:02

by David Rientjes

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:

> > The problem originally reported here doesn't appear to have anything to do
> > with hotplug, it looks like it is the result of Lee's observation that
> > ia64 defaults top_cpuset's mems to N_POSSIBLE, which _should_ have been
> > updated by cpuset_init_smp().
> cpuset_init_smp() just updates cpuset's mask.
> init's task->mems_allowed is intizialized independently from cpuset's mask.
>

Presumably the bug is that N_HIGH_MEMORY is not a subset of N_ONLINE at
this point on ia64.

> Could you teach me a pointer for Lee's observation ?
>

http://marc.info/?l=linux-kernel&m=124767909310293

2009-07-28 01:13:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Mon, 27 Jul 2009 18:02:50 -0700 (PDT)
David Rientjes <[email protected]> wrote:

> On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:
>
> > > The problem originally reported here doesn't appear to have anything to do
> > > with hotplug, it looks like it is the result of Lee's observation that
> > > ia64 defaults top_cpuset's mems to N_POSSIBLE, which _should_ have been
> > > updated by cpuset_init_smp().
> > cpuset_init_smp() just updates cpuset's mask.
> > init's task->mems_allowed is intizialized independently from cpuset's mask.
> >
>
> Presumably the bug is that N_HIGH_MEMORY is not a subset of N_ONLINE at
> this point on ia64.
>
N_HIGH_MEMORY is must be subset of N_ONLINE, at the any moment. Hmm,
I'll look into what happens in ia64 world.

> > Could you teach me a pointer for Lee's observation ?
> >
>
> http://marc.info/?l=linux-kernel&m=124767909310293
>
Aha, my patch is very similar to Lee's one.
And he said "hotplug is issue", yes. it is.


Thanks,
-Kame

2009-07-28 01:26:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG] set_mempolicy(MPOL_INTERLEAV) cause kernel panic

On Tue, 28 Jul 2009 10:11:57 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 27 Jul 2009 18:02:50 -0700 (PDT)
> David Rientjes <[email protected]> wrote:
>
> > On Tue, 28 Jul 2009, KAMEZAWA Hiroyuki wrote:
> >
> > > > The problem originally reported here doesn't appear to have anything to do
> > > > with hotplug, it looks like it is the result of Lee's observation that
> > > > ia64 defaults top_cpuset's mems to N_POSSIBLE, which _should_ have been
> > > > updated by cpuset_init_smp().
> > > cpuset_init_smp() just updates cpuset's mask.
> > > init's task->mems_allowed is intizialized independently from cpuset's mask.
> > >
> >
> > Presumably the bug is that N_HIGH_MEMORY is not a subset of N_ONLINE at
> > this point on ia64.
> >
> N_HIGH_MEMORY is must be subset of N_ONLINE, at the any moment. Hmm,
> I'll look into what happens in ia64 world.
>

At quick look, N_HIGH_MEMORY is set here while init. (I ignore hoplug now.)

== before init==
60 #ifndef CONFIG_NUMA
61 [N_NORMAL_MEMORY] = { { [0] = 1UL } },
62 #ifdef CONFIG_HIGHMEM
63 [N_HIGH_MEMORY] = { { [0] = 1UL } },
64 #endif
==
3860 static unsigned long __init early_calculate_totalpages(void)
3861 {
....
3869 if (pages)
3870 node_set_state(early_node_map[i].nid, N_HIGH_MEMORY);
3871 }
==
4041 void __init free_area_init_nodes(unsigned long *max_zone_pfn)
4042 {
4105 if (pgdat->node_present_pages)
4106 node_set_state(nid, N_HIGH_MEMORY);
==


All of them are done while mem_init(). Then if N_HIGH_MEMORY is not correct at
kernel_init(). It's bug.

I think what Lee and Miao pointed out is just a hotplug problem.
Ok, I'll try some patch but it'll take some hours.

Thanks,
-Kame

2009-07-28 07:20:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX] set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware

tested on x86-64/fake NUMA and ia64/NUMA.
(That ia64 is a host which orignal bug report used.)

Maybe this is bigger patch than expected, but NODEMASK_ALLOC() will be a way
to go, anyway. (even if CPUMASK_ALLOC is not used anyware yet..)
Kosaki tested this on ia64 NUMA. thanks.

I'll wonder more fundamental fix to tsk->mems_allowed but this patch
is enough as a fix for now, I think.

From: KAMEZAWA Hiroyuki <[email protected]>

mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
should be includes only online nodes.
In old behavior, this is guaranteed by frequent reference to cpuset's code.
Now, most of them are removed and mempolicy has to check it by itself.

To do check, a few nodemask_t will be used for calculating nodemask. But,
size of nodemask_t can be big and it's not good to allocate them on stack.

Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
NODEMASK_ALLOC/FREE shoudl be there.

Tested-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/nodemask.h | 31 +++++++++++++++++
mm/mempolicy.c | 82 ++++++++++++++++++++++++++++++++---------------
2 files changed, 87 insertions(+), 26 deletions(-)

Index: task-mems-allowed-fix/include/linux/nodemask.h
===================================================================
--- task-mems-allowed-fix.orig/include/linux/nodemask.h
+++ task-mems-allowed-fix/include/linux/nodemask.h
@@ -82,6 +82,13 @@
* to generate slightly worse code. So use a simple one-line #define
* for node_isset(), instead of wrapping an inline inside a macro, the
* way we do the other calls.
+ *
+ * NODEMASK_SCRATCH
+ * For doing above logical AND, OR, XOR, Remap, etc...the caller tend to be
+ * necessary to use temporal nodemask_t on stack. But if NODES_SHIFT is large,
+ * size of nodemask_t can be very big and not suitable for allocating in stack.
+ * NODEMASK_SCRATCH is a helper for such situaions. See below and CPUMASK_ALLOC
+ * also.
*/

#include <linux/kernel.h>
@@ -473,4 +480,28 @@ static inline int num_node_state(enum no
#define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)

+/*
+ * For nodemask scrach area.(See CPUMASK_ALLOC() in cpumask.h)
+ */
+
+#if NODES_SHIFT > 8 /* nodemask_t > 64 bytes */
+#define NODEMASK_ALLOC(x, m) struct x *m = kmalloc(sizeof(*m), GFP_KERNEL)
+#define NODEMASK_FREE(m) kfree(m)
+#else
+#define NODEMASK_ALLOC(x, m) struct x _m, *m = &_m
+#define NODEMASK_FREE(m)
+#endif
+
+#define NODEMASK_POINTER(v, m) nodemask_t *v = &(m->v)
+
+/* A example struture for using NODEMASK_ALLOC, used in mempolicy. */
+struct nodemask_scratch {
+ nodemask_t mask1;
+ nodemask_t mask2;
+};
+
+#define NODEMASK_SCRATCH(x) NODEMASK_ALLOC(nodemask_scratch, x)
+#define NODEMASK_SCRATCH_FREE(x) NODEMASK_FREE(x)
+
+
#endif /* __LINUX_NODEMASK_H */
Index: task-mems-allowed-fix/mm/mempolicy.c
===================================================================
--- task-mems-allowed-fix.orig/mm/mempolicy.c
+++ task-mems-allowed-fix/mm/mempolicy.c
@@ -191,25 +191,27 @@ static int mpol_new_bind(struct mempolic
* Must be called holding task's alloc_lock to protect task's mems_allowed
* and mempolicy. May also be called holding the mmap_semaphore for write.
*/
-static int mpol_set_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_set_nodemask(struct mempolicy *pol,
+ const nodemask_t *nodes, struct nodemask_scratch *nsc)
{
- nodemask_t cpuset_context_nmask;
int ret;

/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
if (pol == NULL)
return 0;
+ /* Check N_HIGH_MEMORY */
+ nodes_and(nsc->mask1,
+ cpuset_current_mems_allowed, node_states[N_HIGH_MEMORY]);

VM_BUG_ON(!nodes);
if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
nodes = NULL; /* explicit local allocation */
else {
if (pol->flags & MPOL_F_RELATIVE_NODES)
- mpol_relative_nodemask(&cpuset_context_nmask, nodes,
- &cpuset_current_mems_allowed);
+ mpol_relative_nodemask(&nsc->mask2, nodes,&nsc->mask1);
else
- nodes_and(cpuset_context_nmask, *nodes,
- cpuset_current_mems_allowed);
+ nodes_and(nsc->mask2, *nodes, nsc->mask1);
+
if (mpol_store_user_nodemask(pol))
pol->w.user_nodemask = *nodes;
else
@@ -217,8 +219,10 @@ static int mpol_set_nodemask(struct memp
cpuset_current_mems_allowed;
}

- ret = mpol_ops[pol->mode].create(pol,
- nodes ? &cpuset_context_nmask : NULL);
+ if (nodes)
+ ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
+ else
+ ret = mpol_ops[pol->mode].create(pol, NULL);
return ret;
}

@@ -620,12 +624,17 @@ static long do_set_mempolicy(unsigned sh
{
struct mempolicy *new, *old;
struct mm_struct *mm = current->mm;
+ NODEMASK_SCRATCH(scratch);
int ret;

- new = mpol_new(mode, flags, nodes);
- if (IS_ERR(new))
- return PTR_ERR(new);
+ if (!scratch)
+ return -ENOMEM;

+ new = mpol_new(mode, flags, nodes);
+ if (IS_ERR(new)) {
+ ret = PTR_ERR(new);
+ goto out;
+ }
/*
* prevent changing our mempolicy while show_numa_maps()
* is using it.
@@ -635,13 +644,13 @@ static long do_set_mempolicy(unsigned sh
if (mm)
down_write(&mm->mmap_sem);
task_lock(current);
- ret = mpol_set_nodemask(new, nodes);
+ ret = mpol_set_nodemask(new, nodes, scratch);
if (ret) {
task_unlock(current);
if (mm)
up_write(&mm->mmap_sem);
mpol_put(new);
- return ret;
+ goto out;
}
old = current->mempolicy;
current->mempolicy = new;
@@ -654,7 +663,10 @@ static long do_set_mempolicy(unsigned sh
up_write(&mm->mmap_sem);

mpol_put(old);
- return 0;
+ ret = 0;
+out:
+ NODEMASK_SCRATCH_FREE(scratch);
+ return ret;
}

/*
@@ -1014,10 +1026,17 @@ static long do_mbind(unsigned long start
if (err)
return err;
}
- down_write(&mm->mmap_sem);
- task_lock(current);
- err = mpol_set_nodemask(new, nmask);
- task_unlock(current);
+ {
+ NODEMASK_SCRATCH(scratch);
+ if (scratch) {
+ down_write(&mm->mmap_sem);
+ task_lock(current);
+ err = mpol_set_nodemask(new, nmask, scratch);
+ task_unlock(current);
+ } else
+ err = -ENOMEM;
+ NODEMASK_SCRATCH_FREE(scratch);
+ }
if (err) {
up_write(&mm->mmap_sem);
mpol_put(new);
@@ -1891,10 +1910,12 @@ restart:
* Install non-NULL @mpol in inode's shared policy rb-tree.
* On entry, the current task has a reference on a non-NULL @mpol.
* This must be released on exit.
+ * This is called at get_inode() calls and we can use GFP_KERNEL.
*/
void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
{
int ret;
+ NODEMASK_SCRATCH(scratch);

sp->root = RB_ROOT; /* empty tree == default mempolicy */
spin_lock_init(&sp->lock);
@@ -1902,19 +1923,22 @@ void mpol_shared_policy_init(struct shar
if (mpol) {
struct vm_area_struct pvma;
struct mempolicy *new;
-
+ if (!scratch)
+ return;
/* contextualize the tmpfs mount point mempolicy */
new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
if (IS_ERR(new)) {
mpol_put(mpol); /* drop our ref on sb mpol */
+ NODEMASK_SCRATCH_FREE(scratch);
return; /* no valid nodemask intersection */
}

task_lock(current);
- ret = mpol_set_nodemask(new, &mpol->w.user_nodemask);
+ ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
task_unlock(current);
mpol_put(mpol); /* drop our ref on sb mpol */
if (ret) {
+ NODEMASK_SCRATCH_FREE(scratch);
mpol_put(new);
return;
}
@@ -1925,6 +1949,7 @@ void mpol_shared_policy_init(struct shar
mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
mpol_put(new); /* drop initial ref */
}
+ NODEMASK_SCRATCH_FREE(scratch);
}

int mpol_set_shared_policy(struct shared_policy *info,
@@ -2140,13 +2165,18 @@ int mpol_parse_str(char *str, struct mem
err = 1;
else {
int ret;
-
- task_lock(current);
- ret = mpol_set_nodemask(new, &nodes);
- task_unlock(current);
- if (ret)
+ NODEMASK_SCRATCH(scratch);
+ if (scratch) {
+ task_lock(current);
+ ret = mpol_set_nodemask(new, &nodes, scratch);
+ task_unlock(current);
+ } else
+ ret = -ENOMEM;
+ NODEMASK_SCRATCH_FREE(scratch);
+ if (ret) {
err = 1;
- else if (no_context) {
+ mpol_put(new);
+ } else if (no_context) {
/* save for contextualization */
new->w.user_nodemask = nodes;
}

2009-07-28 08:52:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUGFIX] set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware

KAMEZAWA Hiroyuki <[email protected]> writes:

> tested on x86-64/fake NUMA and ia64/NUMA.
> (That ia64 is a host which orignal bug report used.)
>
> Maybe this is bigger patch than expected, but NODEMASK_ALLOC() will be a way
> to go, anyway. (even if CPUMASK_ALLOC is not used anyware yet..)
> Kosaki tested this on ia64 NUMA. thanks.

Note that low/high memory support in NUMA policy is only partial
anyways, e.g. the VMA based policy only supports a single zone. That
was by design choice and because NUMA has a lot of issues on 32bit due
to the limited address space and is not widely used.

So small fixes are ok but I wouldn't go to large effort to fix NUMA
policy on 32bit.

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-28 10:04:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX] set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware

Andi Kleen wrote:
> KAMEZAWA Hiroyuki <[email protected]> writes:
>
>> tested on x86-64/fake NUMA and ia64/NUMA.
>> (That ia64 is a host which orignal bug report used.)
>>
>> Maybe this is bigger patch than expected, but NODEMASK_ALLOC() will be a
>> way
>> to go, anyway. (even if CPUMASK_ALLOC is not used anyware yet..)
>> Kosaki tested this on ia64 NUMA. thanks.
>
> Note that low/high memory support in NUMA policy is only partial
> anyways, e.g. the VMA based policy only supports a single zone. That
> was by design choice and because NUMA has a lot of issues on 32bit due
> to the limited address space and is not widely used.
>
> So small fixes are ok but I wouldn't go to large effort to fix NUMA
> policy on 32bit.
>
ya, maybe you mention to something related to policy_zone.
It's checked only in bind code now. (from before 2.6.30..)
The bug was accessing NODE_DATA(nid), which is NULL.
(All possible node doesn't have pgdat)

Thanks,
-Kame

2009-07-29 20:16:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUGFIX] set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware

On Tue, 28 Jul 2009 16:18:13 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> tested on x86-64/fake NUMA and ia64/NUMA.
> (That ia64 is a host which orignal bug report used.)

There's no description here of this bug.

Does this patch actually fix a bug? Seems not. Confusing.

> Maybe this is bigger patch than expected, but NODEMASK_ALLOC() will be a way
> to go, anyway. (even if CPUMASK_ALLOC is not used anyware yet..)
> Kosaki tested this on ia64 NUMA. thanks.
>
> I'll wonder more fundamental fix to tsk->mems_allowed but this patch
> is enough as a fix for now, I think.
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
> should be includes only online nodes.
> In old behavior, this is guaranteed by frequent reference to cpuset's code.
> Now, most of them are removed and mempolicy has to check it by itself.
>
> To do check, a few nodemask_t will be used for calculating nodemask. But,
> size of nodemask_t can be big and it's not good to allocate them on stack.
>
> Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
> NODEMASK_ALLOC/FREE shoudl be there.
>
> Tested-by: KOSAKI Motohiro <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/nodemask.h | 31 +++++++++++++++++
> mm/mempolicy.c | 82 ++++++++++++++++++++++++++++++++---------------
> 2 files changed, 87 insertions(+), 26 deletions(-)
>
> Index: task-mems-allowed-fix/include/linux/nodemask.h
> ===================================================================
> --- task-mems-allowed-fix.orig/include/linux/nodemask.h
> +++ task-mems-allowed-fix/include/linux/nodemask.h
> @@ -82,6 +82,13 @@
> * to generate slightly worse code. So use a simple one-line #define
> * for node_isset(), instead of wrapping an inline inside a macro, the
> * way we do the other calls.
> + *
> + * NODEMASK_SCRATCH
> + * For doing above logical AND, OR, XOR, Remap, etc...the caller tend to be
> + * necessary to use temporal nodemask_t on stack. But if NODES_SHIFT is large,
> + * size of nodemask_t can be very big and not suitable for allocating in stack.
> + * NODEMASK_SCRATCH is a helper for such situaions. See below and CPUMASK_ALLOC
> + * also.
> */
>
> #include <linux/kernel.h>
> @@ -473,4 +480,28 @@ static inline int num_node_state(enum no
> #define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
> #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
>
> +/*
> + * For nodemask scrach area.(See CPUMASK_ALLOC() in cpumask.h)
> + */
> +
> +#if NODES_SHIFT > 8 /* nodemask_t > 64 bytes */
> +#define NODEMASK_ALLOC(x, m) struct x *m = kmalloc(sizeof(*m), GFP_KERNEL)
> +#define NODEMASK_FREE(m) kfree(m)
> +#else
> +#define NODEMASK_ALLOC(x, m) struct x _m, *m = &_m
> +#define NODEMASK_FREE(m)
> +#endif
> +
> +#define NODEMASK_POINTER(v, m) nodemask_t *v = &(m->v)
> +
> +/* A example struture for using NODEMASK_ALLOC, used in mempolicy. */
> +struct nodemask_scratch {
> + nodemask_t mask1;
> + nodemask_t mask2;
> +};
> +
> +#define NODEMASK_SCRATCH(x) NODEMASK_ALLOC(nodemask_scratch, x)
> +#define NODEMASK_SCRATCH_FREE(x) NODEMASK_FREE(x)

Ick. Ho hum. OK. Such is life.

NODEMASK_POINTER() has no callers and is undocumented and unobvious.
Can I delete it?

> void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> {
> int ret;
> + NODEMASK_SCRATCH(scratch);
>
> sp->root = RB_ROOT; /* empty tree == default mempolicy */
> spin_lock_init(&sp->lock);
> @@ -1902,19 +1923,22 @@ void mpol_shared_policy_init(struct shar
> if (mpol) {
> struct vm_area_struct pvma;
> struct mempolicy *new;
> -
> + if (!scratch)
> + return;
> /* contextualize the tmpfs mount point mempolicy */
> new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
> if (IS_ERR(new)) {
> mpol_put(mpol); /* drop our ref on sb mpol */
> + NODEMASK_SCRATCH_FREE(scratch);
> return; /* no valid nodemask intersection */
> }
>
> task_lock(current);
> - ret = mpol_set_nodemask(new, &mpol->w.user_nodemask);
> + ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
> task_unlock(current);
> mpol_put(mpol); /* drop our ref on sb mpol */
> if (ret) {
> + NODEMASK_SCRATCH_FREE(scratch);
> mpol_put(new);
> return;
> }
> @@ -1925,6 +1949,7 @@ void mpol_shared_policy_init(struct shar
> mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
> mpol_put(new); /* drop initial ref */
> }
> + NODEMASK_SCRATCH_FREE(scratch);
> }

This function does an unneeded kmalloc/kfree if mpol==NULL.


How's this look?

diff -puN include/linux/nodemask.h~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix include/linux/nodemask.h
--- a/include/linux/nodemask.h~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix
+++ a/include/linux/nodemask.h
@@ -84,11 +84,10 @@
* way we do the other calls.
*
* NODEMASK_SCRATCH
- * For doing above logical AND, OR, XOR, Remap, etc...the caller tend to be
- * necessary to use temporal nodemask_t on stack. But if NODES_SHIFT is large,
- * size of nodemask_t can be very big and not suitable for allocating in stack.
- * NODEMASK_SCRATCH is a helper for such situaions. See below and CPUMASK_ALLOC
- * also.
+ * When doing above logical AND, OR, XOR, Remap operations the callers tend to
+ * need temporary nodemask_t's on the stack. But if NODES_SHIFT is large,
+ * nodemask_t's consume too much stack space. NODEMASK_SCRATCH is a helper
+ * for such situations. See below and CPUMASK_ALLOC also.
*/

#include <linux/kernel.h>
@@ -492,8 +491,6 @@ static inline int num_node_state(enum no
#define NODEMASK_FREE(m)
#endif

-#define NODEMASK_POINTER(v, m) nodemask_t *v = &(m->v)
-
/* A example struture for using NODEMASK_ALLOC, used in mempolicy. */
struct nodemask_scratch {
nodemask_t mask1;
diff -puN mm/mempolicy.c~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix mm/mempolicy.c
--- a/mm/mempolicy.c~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix
+++ a/mm/mempolicy.c
@@ -1915,7 +1915,6 @@ restart:
void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
{
int ret;
- NODEMASK_SCRATCH(scratch);

sp->root = RB_ROOT; /* empty tree == default mempolicy */
spin_lock_init(&sp->lock);
@@ -1923,6 +1922,8 @@ void mpol_shared_policy_init(struct shar
if (mpol) {
struct vm_area_struct pvma;
struct mempolicy *new;
+ NODEMASK_SCRATCH(scratch);
+
if (!scratch)
return;
/* contextualize the tmpfs mount point mempolicy */
@@ -1948,8 +1949,8 @@ void mpol_shared_policy_init(struct shar
pvma.vm_end = TASK_SIZE; /* policy covers entire file */
mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
mpol_put(new); /* drop initial ref */
+ NODEMASK_SCRATCH_FREE(scratch);
}
- NODEMASK_SCRATCH_FREE(scratch);
}

int mpol_set_shared_policy(struct shared_policy *info,
_

2009-07-30 00:08:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX] set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware

On Wed, 29 Jul 2009 13:16:00 -0700
Andrew Morton <[email protected]> wrote:

> On Tue, 28 Jul 2009 16:18:13 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > tested on x86-64/fake NUMA and ia64/NUMA.
> > (That ia64 is a host which orignal bug report used.)
>
> There's no description here of this bug.
>
> Does this patch actually fix a bug? Seems not. Confusing.
>
yes. fix a bug. maybe my description is not good.

The bug itself is orignally reporeted here:
http://marc.info/?l=linux-kernel&m=124765131625716&w=2

Ok, let me explain what happened.



At first, init_task's mems_allowed is initialized as this.
init_task->mems_allowed == node_state[N_POSSIBLE]

And cpuset's top_cpuset mask is initialized as this
top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

Before 2.6.29:
policy's mems_allowed is initialized as this.

1. update tasks->mems_allowed by its cpuset->mems_allowed.
2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

Updating task's mems_allowed in reference to top_cpuset's one.
cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.


In 2.6.30: After commit=58568d2a8215cb6f55caf2332017d7bdff954e1c

policy's mems_allowed is initialized as this.
1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

Here, if task is in top_cpuset, task->mems_allowed is not updated from init's
one. Assume user excutes command as
#numactrl --interleave=all ,....

policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

Then, policy's mems_allowd can includes a possible node, which has no pgdat.

MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
directly.
NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL


Then, what's we need is making policy->mems_allowed be aware of N_HIGH_MEMORY.
This patch does that. But to do so, extra nodemask will be on statck.
Because I know cpumask has a new interface of CPUMASK_ALLOC(), I added it to node.

This patch stands on old behavior. But I feel this fix itself is just a Band-Aid.
But to do fundametal fix, we have to take care of memory hotplug and it takes time.
(task->mems_allowd should be N_HIGH_MEMORY, I think.)



> > Maybe this is bigger patch than expected, but NODEMASK_ALLOC() will be a way
> > to go, anyway. (even if CPUMASK_ALLOC is not used anyware yet..)
> > Kosaki tested this on ia64 NUMA. thanks.
> >
> > I'll wonder more fundamental fix to tsk->mems_allowed but this patch
> > is enough as a fix for now, I think.
> >
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
> > should be includes only online nodes.
> > In old behavior, this is guaranteed by frequent reference to cpuset's code.
> > Now, most of them are removed and mempolicy has to check it by itself.
> >
> > To do check, a few nodemask_t will be used for calculating nodemask. But,
> > size of nodemask_t can be big and it's not good to allocate them on stack.
> >
> > Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
> > NODEMASK_ALLOC/FREE shoudl be there.
> >
> > Tested-by: KOSAKI Motohiro <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/nodemask.h | 31 +++++++++++++++++
> > mm/mempolicy.c | 82 ++++++++++++++++++++++++++++++++---------------
> > 2 files changed, 87 insertions(+), 26 deletions(-)
> >
> > Index: task-mems-allowed-fix/include/linux/nodemask.h
> > ===================================================================
> > --- task-mems-allowed-fix.orig/include/linux/nodemask.h
> > +++ task-mems-allowed-fix/include/linux/nodemask.h
> > @@ -82,6 +82,13 @@
> > * to generate slightly worse code. So use a simple one-line #define
> > * for node_isset(), instead of wrapping an inline inside a macro, the
> > * way we do the other calls.
> > + *
> > + * NODEMASK_SCRATCH
> > + * For doing above logical AND, OR, XOR, Remap, etc...the caller tend to be
> > + * necessary to use temporal nodemask_t on stack. But if NODES_SHIFT is large,
> > + * size of nodemask_t can be very big and not suitable for allocating in stack.
> > + * NODEMASK_SCRATCH is a helper for such situaions. See below and CPUMASK_ALLOC
> > + * also.
> > */
> >
> > #include <linux/kernel.h>
> > @@ -473,4 +480,28 @@ static inline int num_node_state(enum no
> > #define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
> > #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
> >
> > +/*
> > + * For nodemask scrach area.(See CPUMASK_ALLOC() in cpumask.h)
> > + */
> > +
> > +#if NODES_SHIFT > 8 /* nodemask_t > 64 bytes */
> > +#define NODEMASK_ALLOC(x, m) struct x *m = kmalloc(sizeof(*m), GFP_KERNEL)
> > +#define NODEMASK_FREE(m) kfree(m)
> > +#else
> > +#define NODEMASK_ALLOC(x, m) struct x _m, *m = &_m
> > +#define NODEMASK_FREE(m)
> > +#endif
> > +
> > +#define NODEMASK_POINTER(v, m) nodemask_t *v = &(m->v)
> > +
> > +/* A example struture for using NODEMASK_ALLOC, used in mempolicy. */
> > +struct nodemask_scratch {
> > + nodemask_t mask1;
> > + nodemask_t mask2;
> > +};
> > +
> > +#define NODEMASK_SCRATCH(x) NODEMASK_ALLOC(nodemask_scratch, x)
> > +#define NODEMASK_SCRATCH_FREE(x) NODEMASK_FREE(x)
>
> Ick. Ho hum. OK. Such is life.
>
> NODEMASK_POINTER() has no callers and is undocumented and unobvious.
> Can I delete it?
>
Yes. I added it just because CPUMASK_ALLOC() provides that.
(And I don't think it's necessary, either ;)


> > void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> > {
> > int ret;
> > + NODEMASK_SCRATCH(scratch);
> >
> > sp->root = RB_ROOT; /* empty tree == default mempolicy */
> > spin_lock_init(&sp->lock);
> > @@ -1902,19 +1923,22 @@ void mpol_shared_policy_init(struct shar
> > if (mpol) {
> > struct vm_area_struct pvma;
> > struct mempolicy *new;
> > -
> > + if (!scratch)
> > + return;
> > /* contextualize the tmpfs mount point mempolicy */
> > new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
> > if (IS_ERR(new)) {
> > mpol_put(mpol); /* drop our ref on sb mpol */
> > + NODEMASK_SCRATCH_FREE(scratch);
> > return; /* no valid nodemask intersection */
> > }
> >
> > task_lock(current);
> > - ret = mpol_set_nodemask(new, &mpol->w.user_nodemask);
> > + ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
> > task_unlock(current);
> > mpol_put(mpol); /* drop our ref on sb mpol */
> > if (ret) {
> > + NODEMASK_SCRATCH_FREE(scratch);
> > mpol_put(new);
> > return;
> > }
> > @@ -1925,6 +1949,7 @@ void mpol_shared_policy_init(struct shar
> > mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
> > mpol_put(new); /* drop initial ref */
> > }
> > + NODEMASK_SCRATCH_FREE(scratch);
> > }
>
> This function does an unneeded kmalloc/kfree if mpol==NULL.
>

Ah yes, the range is not good.


>
> How's this look?
>
> diff -puN include/linux/nodemask.h~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix include/linux/nodemask.h
> --- a/include/linux/nodemask.h~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix
> +++ a/include/linux/nodemask.h
> @@ -84,11 +84,10 @@
> * way we do the other calls.
> *
> * NODEMASK_SCRATCH
> - * For doing above logical AND, OR, XOR, Remap, etc...the caller tend to be
> - * necessary to use temporal nodemask_t on stack. But if NODES_SHIFT is large,
> - * size of nodemask_t can be very big and not suitable for allocating in stack.
> - * NODEMASK_SCRATCH is a helper for such situaions. See below and CPUMASK_ALLOC
> - * also.
> + * When doing above logical AND, OR, XOR, Remap operations the callers tend to
> + * need temporary nodemask_t's on the stack. But if NODES_SHIFT is large,
> + * nodemask_t's consume too much stack space. NODEMASK_SCRATCH is a helper
> + * for such situations. See below and CPUMASK_ALLOC also.
> */
>
> #include <linux/kernel.h>
> @@ -492,8 +491,6 @@ static inline int num_node_state(enum no
> #define NODEMASK_FREE(m)
> #endif
>
> -#define NODEMASK_POINTER(v, m) nodemask_t *v = &(m->v)
> -
> /* A example struture for using NODEMASK_ALLOC, used in mempolicy. */
> struct nodemask_scratch {
> nodemask_t mask1;
> diff -puN mm/mempolicy.c~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix mm/mempolicy.c
> --- a/mm/mempolicy.c~mm-make-set_mempolicympol_interleav-n_high_memory-aware-fix
> +++ a/mm/mempolicy.c
> @@ -1915,7 +1915,6 @@ restart:
> void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> {
> int ret;
> - NODEMASK_SCRATCH(scratch);
>
> sp->root = RB_ROOT; /* empty tree == default mempolicy */
> spin_lock_init(&sp->lock);
> @@ -1923,6 +1922,8 @@ void mpol_shared_policy_init(struct shar
> if (mpol) {
> struct vm_area_struct pvma;
> struct mempolicy *new;
> + NODEMASK_SCRATCH(scratch);
> +
> if (!scratch)
> return;
> /* contextualize the tmpfs mount point mempolicy */
> @@ -1948,8 +1949,8 @@ void mpol_shared_policy_init(struct shar
> pvma.vm_end = TASK_SIZE; /* policy covers entire file */
> mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
> mpol_put(new); /* drop initial ref */
> + NODEMASK_SCRATCH_FREE(scratch);
> }
> - NODEMASK_SCRATCH_FREE(scratch);
> }
>
> int mpol_set_shared_policy(struct shared_policy *info,
> _
>

Seems nice

Acked-by: KAMEZAWA Hiroyuki <[email protected]>


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>