Hi,
The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
a regression uncovered by SGI. Basically it is allowing possible but not
online nodes in the task_struct.mems_allowed nodemask (which is contrary
to several comments still in kernel/cpuset.c), and that causes
cpuset_mem_spread_node() to return an offline node to slab, causing an
oops.
Easy to reproduce if you have a machine with !online nodes.
- mkdir /dev/cpuset
- mount cpuset -t cpuset /dev/cpuset
- echo 1 > /dev/cpuset/memory_spread_slab
kernel BUG at
/usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
bash[6885]: bugcheck! 0 [1]
Pid: 6885, CPU 5, comm: bash
psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
Tainted: G W (2.6.32-0.6.8-default)
ip is at ____cache_alloc_node+0x440/0x500
unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
rnat: 0000000000283d85 bsps: 0000000000000001 pr : 99596aaa69aa6999
ldrs: 0000000000000000 ccv : 0000000000000018 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010020cf00 b6 : a0000001004962c0 b7 : a000000100493240
f6 : 000000000000000000000 f7 : 000000000000000000000
f8 : 000000000000000000000 f9 : 000000000000000000000
f10 : 000000000000000000000 f11 : 000000000000000000000
r1 : a0000001015c6fc0 r2 : 000000000000e662 r3 : 000000000000fffe
r8 : 000000000000005c r9 : 0000000000000000 r10 : 0000000000004000
r11 : 0000000000000000 r12 : e000003c3904fcc0 r13 : e000003c39040000
r14 : 000000000000e662 r15 : a00000010138ed88 r16 : ffffffffffff65c8
r17 : a00000010138ed80 r18 : a0000001013c7ad0 r19 : a0000001013d3b60
r20 : e00001b03afdfe18 r21 : 0000000000000001 r22 : e0000130030365c8
r23 : e000013003040000 r24 : ffffffffffff0400 r25 : 00000000000068ef
r26 : 00000000000068ef r27 : a0000001029621d0 r28 : 00000000000068f0
r29 : 00000000000068f0 r30 : 00000000000068f0 r31 : 000000000000000a
Call Trace:
[<a000000100017a80>] show_stack+0x80/0xa0
[<a0000001000180e0>] show_regs+0x640/0x920
[<a000000100029a90>] die+0x190/0x2e0
[<a000000100029c30>] die_if_kernel+0x50/0x80
[<a000000100904af0>] ia64_bad_break+0x470/0x760
[<a00000010000cb60>] ia64_native_leave_kernel+0x0/0x270
[<a00000010020cf00>] ____cache_alloc_node+0x440/0x500
[<a00000010020ffa0>] kmem_cache_alloc+0x360/0x660
A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
However I'm a bit worried about 58568d2.
It is doing a lot of stuff. It is removing the callback_mutex from
around several seemingly unrelated places (eg. from around
guarnatee_online_cpus, which explicitly asks to be called with that
lock held), and other places, so I don't know how it is not racy
with hotplug.
Then it also says that the fastpath doesn't use any locking, so the
update-path first adds the newly allowed nodes, then removes the
newly prohibited nodes. Unfortunately there are no barriers apparent
(and none added), and cpumask/nodemask can be larger than one word,
so it seems there could be races.
It also seems like the exported cpuset_mems_allowed and
cpuset_cpus_allowed APIs are just broken wrt hotplug because the
hotplug lock is dropped before returning.
I'd just like to get opinions or comments from people who know the
code better before wading in too far myself. I'd be really keen on
making the locking simpler, using seqlocks for fastpaths, etc.
Thanks,
Nick
On Fri, 19 Feb 2010, Nick Piggin wrote:
> Hi,
>
> The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> a regression uncovered by SGI. Basically it is allowing possible but not
> online nodes in the task_struct.mems_allowed nodemask (which is contrary
> to several comments still in kernel/cpuset.c), and that causes
> cpuset_mem_spread_node() to return an offline node to slab, causing an
> oops.
>
> Easy to reproduce if you have a machine with !online nodes.
>
> - mkdir /dev/cpuset
> - mount cpuset -t cpuset /dev/cpuset
> - echo 1 > /dev/cpuset/memory_spread_slab
>
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> bash[6885]: bugcheck! 0 [1]
> Pid: 6885, CPU 5, comm: bash
> psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> Tainted: G W (2.6.32-0.6.8-default)
> ip is at ____cache_alloc_node+0x440/0x500
It seems like current->mems_allowed is not properly initialized, although
task_cs(current)->mems_allowed is to node_states[N_HIGH_MEMORY]. See
below.
> A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
> However I'm a bit worried about 58568d2.
>
> It is doing a lot of stuff. It is removing the callback_mutex from
> around several seemingly unrelated places (eg. from around
> guarnatee_online_cpus, which explicitly asks to be called with that
> lock held), and other places, so I don't know how it is not racy
> with hotplug.
>
guarantee_online_cpus() truly does require callback_mutex, the
cgroup_scan_tasks() iterator locking can protect changes in the cgroup
hierarchy but it doesn't protect a store to cs->cpus_allowed or for
hotplug.
top_cpuset.cpus_allowed will always need to track cpu_active_map since
those are the schedulable cpus, it looks like that's initialized for SMP
and the cpu hotplug notifier does that correctly.
I'm not sure what the logic is doing in cpuset_attach() where cs is the
cpuset to attach to:
if (cs == &top_cpuset) {
cpumask_copy(cpus_attach, cpu_possible_mask);
to = node_possible_map;
}
cpus_attach is properly protected by cgroup_lock, but using
node_possible_map here will set task->mems_allowed to node_possible_map
when the cpuset does not have memory_migrate enabled. This is the source
of your oops, I think.
> Then it also says that the fastpath doesn't use any locking, so the
> update-path first adds the newly allowed nodes, then removes the
> newly prohibited nodes. Unfortunately there are no barriers apparent
> (and none added), and cpumask/nodemask can be larger than one word,
> so it seems there could be races.
>
We can remove the store to tsk->mems_allowed in cpuset_migrate_mm()
because cpuset_change_task_nodemask() already does it under
task_lock(tsk).
cpuset_migrate_mm() looks to be subsequently updating the cpuset_attach()
nodemask when moving to top_cpuset so it doesn't get stuck with
node_possible_map, but that's not called unless memory_migrate is enabled.
> It also seems like the exported cpuset_mems_allowed and
> cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> hotplug lock is dropped before returning.
>
The usage of cpuset_cpus_allowed_locked() looks wrong in the scheduler, as
well: it can't hold callback_mutex since it is only declared at file scope
in the cpuset code.
On Thu, Feb 18, 2010 at 01:38:11PM -0800, David Rientjes wrote:
> On Fri, 19 Feb 2010, Nick Piggin wrote:
>
> > Hi,
> >
> > The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> > a regression uncovered by SGI. Basically it is allowing possible but not
> > online nodes in the task_struct.mems_allowed nodemask (which is contrary
> > to several comments still in kernel/cpuset.c), and that causes
> > cpuset_mem_spread_node() to return an offline node to slab, causing an
> > oops.
> >
> > Easy to reproduce if you have a machine with !online nodes.
> >
> > - mkdir /dev/cpuset
> > - mount cpuset -t cpuset /dev/cpuset
> > - echo 1 > /dev/cpuset/memory_spread_slab
> >
> > kernel BUG at
> > /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> > bash[6885]: bugcheck! 0 [1]
> > Pid: 6885, CPU 5, comm: bash
> > psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> > Tainted: G W (2.6.32-0.6.8-default)
> > ip is at ____cache_alloc_node+0x440/0x500
>
> It seems like current->mems_allowed is not properly initialized, although
> task_cs(current)->mems_allowed is to node_states[N_HIGH_MEMORY]. See
> below.
>
> > A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
> > However I'm a bit worried about 58568d2.
> >
> > It is doing a lot of stuff. It is removing the callback_mutex from
> > around several seemingly unrelated places (eg. from around
> > guarnatee_online_cpus, which explicitly asks to be called with that
> > lock held), and other places, so I don't know how it is not racy
> > with hotplug.
> >
>
> guarantee_online_cpus() truly does require callback_mutex, the
> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> hotplug.
Right, but the callback_mutex was being removed by this patch.
>
> top_cpuset.cpus_allowed will always need to track cpu_active_map since
> those are the schedulable cpus, it looks like that's initialized for SMP
> and the cpu hotplug notifier does that correctly.
>
> I'm not sure what the logic is doing in cpuset_attach() where cs is the
> cpuset to attach to:
>
> if (cs == &top_cpuset) {
> cpumask_copy(cpus_attach, cpu_possible_mask);
> to = node_possible_map;
> }
>
> cpus_attach is properly protected by cgroup_lock, but using
> node_possible_map here will set task->mems_allowed to node_possible_map
> when the cpuset does not have memory_migrate enabled. This is the source
> of your oops, I think.
Could be, yes.
> > Then it also says that the fastpath doesn't use any locking, so the
> > update-path first adds the newly allowed nodes, then removes the
> > newly prohibited nodes. Unfortunately there are no barriers apparent
> > (and none added), and cpumask/nodemask can be larger than one word,
> > so it seems there could be races.
> >
>
> We can remove the store to tsk->mems_allowed in cpuset_migrate_mm()
> because cpuset_change_task_nodemask() already does it under
> task_lock(tsk).
But it doesn't matter if stores are done under lock, if the loads are
not. masks can be multiple words, so there isn't any ordering between
reading half and old mask and half a new one that results in an invalid
state. AFAIKS.
>
> cpuset_migrate_mm() looks to be subsequently updating the cpuset_attach()
> nodemask when moving to top_cpuset so it doesn't get stuck with
> node_possible_map, but that's not called unless memory_migrate is enabled.
>
> > It also seems like the exported cpuset_mems_allowed and
> > cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> > hotplug lock is dropped before returning.
> >
>
> The usage of cpuset_cpus_allowed_locked() looks wrong in the scheduler, as
> well: it can't hold callback_mutex since it is only declared at file scope
> in the cpuset code.
Well it is exported as cpuset_lock(). And the scheduler has it covered
in all cases by the looks except for select_task_rq, which is called
by wakeup code. We should stick WARN_ONs through the cpuset code for
mutexes not held when they should be.
> Hi,
>
> The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> a regression uncovered by SGI. Basically it is allowing possible but not
> online nodes in the task_struct.mems_allowed nodemask (which is contrary
> to several comments still in kernel/cpuset.c), and that causes
> cpuset_mem_spread_node() to return an offline node to slab, causing an
> oops.
>
> Easy to reproduce if you have a machine with !online nodes.
>
> - mkdir /dev/cpuset
> - mount cpuset -t cpuset /dev/cpuset
> - echo 1 > /dev/cpuset/memory_spread_slab
>
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> bash[6885]: bugcheck! 0 [1]
> Pid: 6885, CPU 5, comm: bash
> psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> Tainted: G W (2.6.32-0.6.8-default)
> ip is at ____cache_alloc_node+0x440/0x500
>
> unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
> rnat: 0000000000283d85 bsps: 0000000000000001 pr : 99596aaa69aa6999
> ldrs: 0000000000000000 ccv : 0000000000000018 fpsr: 0009804c0270033f
> csd : 0000000000000000 ssd : 0000000000000000
> b0 : a00000010020cf00 b6 : a0000001004962c0 b7 : a000000100493240
> f6 : 000000000000000000000 f7 : 000000000000000000000
> f8 : 000000000000000000000 f9 : 000000000000000000000
> f10 : 000000000000000000000 f11 : 000000000000000000000
> r1 : a0000001015c6fc0 r2 : 000000000000e662 r3 : 000000000000fffe
> r8 : 000000000000005c r9 : 0000000000000000 r10 : 0000000000004000
> r11 : 0000000000000000 r12 : e000003c3904fcc0 r13 : e000003c39040000
> r14 : 000000000000e662 r15 : a00000010138ed88 r16 : ffffffffffff65c8
> r17 : a00000010138ed80 r18 : a0000001013c7ad0 r19 : a0000001013d3b60
> r20 : e00001b03afdfe18 r21 : 0000000000000001 r22 : e0000130030365c8
> r23 : e000013003040000 r24 : ffffffffffff0400 r25 : 00000000000068ef
> r26 : 00000000000068ef r27 : a0000001029621d0 r28 : 00000000000068f0
> r29 : 00000000000068f0 r30 : 00000000000068f0 r31 : 000000000000000a
>
> Call Trace:
> [<a000000100017a80>] show_stack+0x80/0xa0
> [<a0000001000180e0>] show_regs+0x640/0x920
> [<a000000100029a90>] die+0x190/0x2e0
> [<a000000100029c30>] die_if_kernel+0x50/0x80
> [<a000000100904af0>] ia64_bad_break+0x470/0x760
> [<a00000010000cb60>] ia64_native_leave_kernel+0x0/0x270
> [<a00000010020cf00>] ____cache_alloc_node+0x440/0x500
> [<a00000010020ffa0>] kmem_cache_alloc+0x360/0x660
>
> A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
> However I'm a bit worried about 58568d2.
Personally, I like just revert at once than bandaid. 58568d2 didn't
introduce any new feature, then we can revet it without abi breakage.
This test result seems patch author didn't test his own patch enough much.
thanks.
>
> It is doing a lot of stuff. It is removing the callback_mutex from
> around several seemingly unrelated places (eg. from around
> guarnatee_online_cpus, which explicitly asks to be called with that
> lock held), and other places, so I don't know how it is not racy
> with hotplug.
>
> Then it also says that the fastpath doesn't use any locking, so the
> update-path first adds the newly allowed nodes, then removes the
> newly prohibited nodes. Unfortunately there are no barriers apparent
> (and none added), and cpumask/nodemask can be larger than one word,
> so it seems there could be races.
>
> It also seems like the exported cpuset_mems_allowed and
> cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> hotplug lock is dropped before returning.
>
> I'd just like to get opinions or comments from people who know the
> code better before wading in too far myself. I'd be really keen on
> making the locking simpler, using seqlocks for fastpaths, etc.
>
> Thanks,
> Nick
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, 19 Feb 2010 00:49:21 +1100
Nick Piggin <[email protected]> wrote:
> Hi,
>
> The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
> a regression uncovered by SGI. Basically it is allowing possible but not
> online nodes in the task_struct.mems_allowed nodemask (which is contrary
> to several comments still in kernel/cpuset.c), and that causes
> cpuset_mem_spread_node() to return an offline node to slab, causing an
> oops.
>
> Easy to reproduce if you have a machine with !online nodes.
>
> - mkdir /dev/cpuset
> - mount cpuset -t cpuset /dev/cpuset
> - echo 1 > /dev/cpuset/memory_spread_slab
>
> kernel BUG at
> /usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
> bash[6885]: bugcheck! 0 [1]
> Pid: 6885, CPU 5, comm: bash
> psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
> Tainted: G W (2.6.32-0.6.8-default)
> ip is at ____cache_alloc_node+0x440/0x500
>
> unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
> rnat: 0000000000283d85 bsps: 0000000000000001 pr : 99596aaa69aa6999
> ldrs: 0000000000000000 ccv : 0000000000000018 fpsr: 0009804c0270033f
> csd : 0000000000000000 ssd : 0000000000000000
> b0 : a00000010020cf00 b6 : a0000001004962c0 b7 : a000000100493240
> f6 : 000000000000000000000 f7 : 000000000000000000000
> f8 : 000000000000000000000 f9 : 000000000000000000000
> f10 : 000000000000000000000 f11 : 000000000000000000000
> r1 : a0000001015c6fc0 r2 : 000000000000e662 r3 : 000000000000fffe
> r8 : 000000000000005c r9 : 0000000000000000 r10 : 0000000000004000
> r11 : 0000000000000000 r12 : e000003c3904fcc0 r13 : e000003c39040000
> r14 : 000000000000e662 r15 : a00000010138ed88 r16 : ffffffffffff65c8
> r17 : a00000010138ed80 r18 : a0000001013c7ad0 r19 : a0000001013d3b60
> r20 : e00001b03afdfe18 r21 : 0000000000000001 r22 : e0000130030365c8
> r23 : e000013003040000 r24 : ffffffffffff0400 r25 : 00000000000068ef
> r26 : 00000000000068ef r27 : a0000001029621d0 r28 : 00000000000068f0
> r29 : 00000000000068f0 r30 : 00000000000068f0 r31 : 000000000000000a
>
> Call Trace:
> [<a000000100017a80>] show_stack+0x80/0xa0
> [<a0000001000180e0>] show_regs+0x640/0x920
> [<a000000100029a90>] die+0x190/0x2e0
> [<a000000100029c30>] die_if_kernel+0x50/0x80
> [<a000000100904af0>] ia64_bad_break+0x470/0x760
> [<a00000010000cb60>] ia64_native_leave_kernel+0x0/0x270
> [<a00000010020cf00>] ____cache_alloc_node+0x440/0x500
> [<a00000010020ffa0>] kmem_cache_alloc+0x360/0x660
>
> A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
I think that's good.
> However I'm a bit worried about 58568d2.
>
> It is doing a lot of stuff. It is removing the callback_mutex from
> around several seemingly unrelated places (eg. from around
> guarnatee_online_cpus, which explicitly asks to be called with that
> lock held), and other places, so I don't know how it is not racy
> with hotplug.
Because removing pgdat is not archieved yet. It was verrrry difficult..
So, once node become online, it never turns to be offline.
But all pages on the node are removed. Just zonelists are rebuilded.
(zonelist rebuild uses stop_machine_run.
>
> Then it also says that the fastpath doesn't use any locking, so the
> update-path first adds the newly allowed nodes, then removes the
> newly prohibited nodes. Unfortunately there are no barriers apparent
> (and none added), and cpumask/nodemask can be larger than one word,
> so it seems there could be races.
>
Maybe. IMHO, "newly allowed node" and "newly prohobited node" come from
user's command. We don't need to guarantee correctness.
So, our concerns is only "don't access offlined node". Right ?
But as I wrote, a node once onlined will never be offlined.
So, I think it's difficult to cause panic.
My concern is zonelist rather than bitmap. But I hear no panic report
about update of it until now. (Maybe because "struct zone" is never
freed.)
> It also seems like the exported cpuset_mems_allowed and
> cpuset_cpus_allowed APIs are just broken wrt hotplug because the
> hotplug lock is dropped before returning.
>
About cpu, it can disappear...then, it should be fixed.
> I'd just like to get opinions or comments from people who know the
> code better before wading in too far myself. I'd be really keen on
> making the locking simpler, using seqlocks for fastpaths, etc.
>
Thanks,
-Kame
On Fri, 19 Feb 2010, KOSAKI Motohiro wrote:
> Personally, I like just revert at once than bandaid. 58568d2 didn't
> introduce any new feature, then we can revet it without abi breakage.
>
Revert a commit from more than six months ago when the fix is probably a
small patch in cpuset_attach()? I think we can do better than that.
This may not have introduced a new feature, but it was a worthwhile change
to avoid the old cpuset_update_task_memory_state() hooks in mempolicy,
page allocator, etc. code that could block on callback_mutex for iterating
the hierarchy.
On Fri, 19 Feb 2010, Nick Piggin wrote:
> > guarantee_online_cpus() truly does require callback_mutex, the
> > cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> > hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> > hotplug.
>
> Right, but the callback_mutex was being removed by this patch.
>
I was making the case for it to be readded :)
> > top_cpuset.cpus_allowed will always need to track cpu_active_map since
> > those are the schedulable cpus, it looks like that's initialized for SMP
> > and the cpu hotplug notifier does that correctly.
> >
> > I'm not sure what the logic is doing in cpuset_attach() where cs is the
> > cpuset to attach to:
> >
> > if (cs == &top_cpuset) {
> > cpumask_copy(cpus_attach, cpu_possible_mask);
> > to = node_possible_map;
> > }
> >
> > cpus_attach is properly protected by cgroup_lock, but using
> > node_possible_map here will set task->mems_allowed to node_possible_map
> > when the cpuset does not have memory_migrate enabled. This is the source
> > of your oops, I think.
>
> Could be, yes.
>
I'd be interested to see if you still get the same oops with the patch at
the end of this email that fixes this logic.
> But it doesn't matter if stores are done under lock, if the loads are
> not. masks can be multiple words, so there isn't any ordering between
> reading half and old mask and half a new one that results in an invalid
> state. AFAIKS.
>
It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because
task->mems_alllowed only gets updated via cpuset_change_task_nodemask()
where the added nodes are set and then the removed nodes are cleared. The
side effect of this lockless access to task->mems_allowed means we may
have a small race between
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
and
tsk->mems_allowed = *newmems;
but the penalty is that we get an allocation on a removed node, which
isn't a big deal, especially since it was previously allowed.
> Well it is exported as cpuset_lock(). And the scheduler has it covered
> in all cases by the looks except for select_task_rq, which is called
> by wakeup code. We should stick WARN_ONs through the cpuset code for
> mutexes not held when they should be.
>
A lot of the reliance on callback_mutex was removed because the strict
hierarchy walking and task membership is now guarded by cgroup_mutex
instead. Some of the comments in kernel/cpuset.c weren't updated so they
still say callback_mutex when in reality they mean cgroup_mutex.
---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1319,7 +1319,7 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}
-/* Protected by cgroup_lock */
+/* Protected by cgroup_mutex held on cpuset_attach() */
static cpumask_var_t cpus_attach;
/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
@@ -1390,8 +1390,12 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
struct cpuset *oldcs = cgroup_cs(oldcont);
if (cs == &top_cpuset) {
- cpumask_copy(cpus_attach, cpu_possible_mask);
- to = node_possible_map;
+ /*
+ * top_cpuset.cpus_allowed and top_cpuset.mems_allowed are
+ * protected by cgroup_lock which is already held here.
+ */
+ cpumask_copy(cpus_attach, top_cpuset.cpus_allowed);
+ to = top_cpuset.mems_allowed;
} else {
guarantee_online_cpus(cs, cpus_attach);
guarantee_online_mems(cs, &to);
I'm sorry for replying this late.
on 2010-2-19 18:06, David Rientjes wrote:
> On Fri, 19 Feb 2010, Nick Piggin wrote:
>
>>> guarantee_online_cpus() truly does require callback_mutex, the
>>> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
>>> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
>>> hotplug.
>>
>> Right, but the callback_mutex was being removed by this patch.
>>
>
> I was making the case for it to be readded :)
But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
so I think callback_mutex is not necessary in this case.
>
>>> top_cpuset.cpus_allowed will always need to track cpu_active_map since
>>> those are the schedulable cpus, it looks like that's initialized for SMP
>>> and the cpu hotplug notifier does that correctly.
>>>
>>> I'm not sure what the logic is doing in cpuset_attach() where cs is the
>>> cpuset to attach to:
>>>
>>> if (cs == &top_cpuset) {
>>> cpumask_copy(cpus_attach, cpu_possible_mask);
>>> to = node_possible_map;
>>> }
>>>
>>> cpus_attach is properly protected by cgroup_lock, but using
>>> node_possible_map here will set task->mems_allowed to node_possible_map
>>> when the cpuset does not have memory_migrate enabled. This is the source
>>> of your oops, I think.
>>
>> Could be, yes.
>>
>
> I'd be interested to see if you still get the same oops with the patch at
> the end of this email that fixes this logic.
I think this patch can't fix this bug, because mems_allowed of tasks in the
top group is set to node_possible_map by default, not when the task is
attached.
I made a new patch at the end of this email to fix it, but I have no machine
to test it now. who can test it for me.
---
diff --git a/init/main.c b/init/main.c
index 4cb47a1..512ba15 100644
--- a/init/main.c
+++ b/init/main.c
@@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
/*
* init can allocate pages on any node
*/
- set_mems_allowed(node_possible_map);
+ set_mems_allowed(node_states[N_HIGH_MEMORY]);
/*
* init can run on any cpu.
*/
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index ba401fa..e29b440 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -935,10 +935,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
struct task_struct *tsk = current;
tsk->mems_allowed = *to;
+ wmb();
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
+ wmb();
}
/*
@@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
if (cs == &top_cpuset) {
cpumask_copy(cpus_attach, cpu_possible_mask);
- to = node_possible_map;
} else {
guarantee_online_cpus(cs, cpus_attach);
- guarantee_online_mems(cs, &to);
}
+ guarantee_online_mems(cs, &to);
/* do per-task migration stuff possibly for each in the threadgroup */
cpuset_attach_task(tsk, &to, cs);
@@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
static int cpuset_track_online_nodes(struct notifier_block *self,
unsigned long action, void *arg)
{
+ nodemask_t oldmems;
+
cgroup_lock();
switch (action) {
case MEM_ONLINE:
- case MEM_OFFLINE:
+ oldmems = top_cpuset.mems_allowed;
mutex_lock(&callback_mutex);
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
mutex_unlock(&callback_mutex);
- if (action == MEM_OFFLINE)
- scan_for_empty_cpusets(&top_cpuset);
+ update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
+ break;
+ case MEM_OFFLINE:
+ scan_for_empty_cpusets(&top_cpuset);
break;
default:
break;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fbb6222..84c7f99 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -219,7 +219,7 @@ int kthreadd(void *unused)
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
- set_mems_allowed(node_possible_map);
+ set_mems_allowed(node_states[N_HIGH_MEMORY]);
current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
On Mon, Feb 22, 2010 at 07:53:39PM +0800, Miao Xie wrote:
> I'm sorry for replying this late.
No problem.
> on 2010-2-19 18:06, David Rientjes wrote:
> > On Fri, 19 Feb 2010, Nick Piggin wrote:
> >
> >>> guarantee_online_cpus() truly does require callback_mutex, the
> >>> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> >>> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> >>> hotplug.
> >>
> >> Right, but the callback_mutex was being removed by this patch.
> >>
> >
> > I was making the case for it to be readded :)
>
> But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
> so I think callback_mutex is not necessary in this case.
So long as that's done consistently (and we should update the comments
too).
> >>> top_cpuset.cpus_allowed will always need to track cpu_active_map since
> >>> those are the schedulable cpus, it looks like that's initialized for SMP
> >>> and the cpu hotplug notifier does that correctly.
> >>>
> >>> I'm not sure what the logic is doing in cpuset_attach() where cs is the
> >>> cpuset to attach to:
> >>>
> >>> if (cs == &top_cpuset) {
> >>> cpumask_copy(cpus_attach, cpu_possible_mask);
> >>> to = node_possible_map;
> >>> }
> >>>
> >>> cpus_attach is properly protected by cgroup_lock, but using
> >>> node_possible_map here will set task->mems_allowed to node_possible_map
> >>> when the cpuset does not have memory_migrate enabled. This is the source
> >>> of your oops, I think.
> >>
> >> Could be, yes.
> >>
> >
> > I'd be interested to see if you still get the same oops with the patch at
> > the end of this email that fixes this logic.
>
> I think this patch can't fix this bug, because mems_allowed of tasks in the
> top group is set to node_possible_map by default, not when the task is
> attached.
>
> I made a new patch at the end of this email to fix it, but I have no machine
> to test it now. who can test it for me.
I can test it for you, but I wonder about your barriers.
>
> ---
> diff --git a/init/main.c b/init/main.c
> index 4cb47a1..512ba15 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
> /*
> * init can allocate pages on any node
> */
> - set_mems_allowed(node_possible_map);
> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
> /*
> * init can run on any cpu.
> */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index ba401fa..e29b440 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -935,10 +935,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> struct task_struct *tsk = current;
>
> tsk->mems_allowed = *to;
> + wmb();
>
> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
>
> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> + wmb();
> }
>
> /*
You always need to comment barriers (and use smp_ variants unless you're
doing mmio).
> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>
> if (cs == &top_cpuset) {
> cpumask_copy(cpus_attach, cpu_possible_mask);
> - to = node_possible_map;
> } else {
> guarantee_online_cpus(cs, cpus_attach);
> - guarantee_online_mems(cs, &to);
> }
> + guarantee_online_mems(cs, &to);
>
> /* do per-task migration stuff possibly for each in the threadgroup */
> cpuset_attach_task(tsk, &to, cs);
> @@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
> static int cpuset_track_online_nodes(struct notifier_block *self,
> unsigned long action, void *arg)
> {
> + nodemask_t oldmems;
> +
> cgroup_lock();
> switch (action) {
> case MEM_ONLINE:
> - case MEM_OFFLINE:
> + oldmems = top_cpuset.mems_allowed;
> mutex_lock(&callback_mutex);
> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
> mutex_unlock(&callback_mutex);
> - if (action == MEM_OFFLINE)
> - scan_for_empty_cpusets(&top_cpuset);
> + update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
> + break;
> + case MEM_OFFLINE:
> + scan_for_empty_cpusets(&top_cpuset);
> break;
> default:
> break;
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index fbb6222..84c7f99 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -219,7 +219,7 @@ int kthreadd(void *unused)
> set_task_comm(tsk, "kthreadd");
> ignore_signals(tsk);
> set_cpus_allowed_ptr(tsk, cpu_all_mask);
> - set_mems_allowed(node_possible_map);
> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
>
> current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
>
On Fri, Feb 19, 2010 at 02:06:45AM -0800, David Rientjes wrote:
> On Fri, 19 Feb 2010, Nick Piggin wrote:
> > But it doesn't matter if stores are done under lock, if the loads are
> > not. masks can be multiple words, so there isn't any ordering between
> > reading half and old mask and half a new one that results in an invalid
> > state. AFAIKS.
> >
>
> It doesn't matter for MAX_NUMNODES > BITS_PER_LONG because
> task->mems_alllowed only gets updated via cpuset_change_task_nodemask()
> where the added nodes are set and then the removed nodes are cleared. The
> side effect of this lockless access to task->mems_allowed means we may
> have a small race between
>
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>
> and
>
> tsk->mems_allowed = *newmems;
>
> but the penalty is that we get an allocation on a removed node, which
> isn't a big deal, especially since it was previously allowed.
If you have a concurrent reader without any synchronisation, then what
stops it from loading a word of the mask before stores to add the new
nodes and then loading another word of the mask after the stores to
remove the old nodes? (which can give an empty mask).
On Mon, 22 Feb 2010, Nick Piggin wrote:
> If you have a concurrent reader without any synchronisation, then what
> stops it from loading a word of the mask before stores to add the new
> nodes and then loading another word of the mask after the stores to
> remove the old nodes? (which can give an empty mask).
>
Currently nothing, so we'll need a variant for configurations where the
size of nodemask_t is larger than we can atomically store.
On Mon, 22 Feb 2010, Miao Xie wrote:
> >>> guarantee_online_cpus() truly does require callback_mutex, the
> >>> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
> >>> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
> >>> hotplug.
> >>
> >> Right, but the callback_mutex was being removed by this patch.
> >>
> >
> > I was making the case for it to be readded :)
>
> But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
> so I think callback_mutex is not necessary in this case.
>
Then why is it taken in update_cpumask()?
> I think this patch can't fix this bug, because mems_allowed of tasks in the
> top group is set to node_possible_map by default, not when the task is
> attached.
>
Ok, I thought that all tasks get their ->attach() function called whenever
their cgroup is mounted.
> I made a new patch at the end of this email to fix it, but I have no machine
> to test it now. who can test it for me.
>
> ---
> diff --git a/init/main.c b/init/main.c
> index 4cb47a1..512ba15 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
> /*
> * init can allocate pages on any node
> */
> - set_mems_allowed(node_possible_map);
> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
> /*
> * init can run on any cpu.
> */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index ba401fa..e29b440 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -935,10 +935,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> struct task_struct *tsk = current;
>
> tsk->mems_allowed = *to;
> + wmb();
>
> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
>
> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> + wmb();
> }
>
> /*
> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>
> if (cs == &top_cpuset) {
> cpumask_copy(cpus_attach, cpu_possible_mask);
> - to = node_possible_map;
> } else {
> guarantee_online_cpus(cs, cpus_attach);
> - guarantee_online_mems(cs, &to);
> }
> + guarantee_online_mems(cs, &to);
>
> /* do per-task migration stuff possibly for each in the threadgroup */
> cpuset_attach_task(tsk, &to, cs);
Do we need to set cpus_attach to cpu_possible_mask? Why won't
cpu_active_mask suffice?
> @@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
> static int cpuset_track_online_nodes(struct notifier_block *self,
> unsigned long action, void *arg)
> {
> + nodemask_t oldmems;
Is it possible to use NODEMASK_ALLOC() instead?
> +
> cgroup_lock();
> switch (action) {
> case MEM_ONLINE:
> - case MEM_OFFLINE:
> + oldmems = top_cpuset.mems_allowed;
> mutex_lock(&callback_mutex);
> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
> mutex_unlock(&callback_mutex);
> - if (action == MEM_OFFLINE)
> - scan_for_empty_cpusets(&top_cpuset);
> + update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
> + break;
> + case MEM_OFFLINE:
> + scan_for_empty_cpusets(&top_cpuset);
> break;
> default:
> break;
Looks good.
on 2010-2-22 20:06, Nick Piggin wrote:
>>>>> guarantee_online_cpus() truly does require callback_mutex, the
>>>>> cgroup_scan_tasks() iterator locking can protect changes in the cgroup
>>>>> hierarchy but it doesn't protect a store to cs->cpus_allowed or for
>>>>> hotplug.
>>>>
>>>> Right, but the callback_mutex was being removed by this patch.
>>>>
>>>
>>> I was making the case for it to be readded :)
>>
>> But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
>> so I think callback_mutex is not necessary in this case.
>
> So long as that's done consistently (and we should update the comments
> too).
I will update the comments.
>> ---
>> diff --git a/init/main.c b/init/main.c
>> index 4cb47a1..512ba15 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -846,7 +846,7 @@ static int __init kernel_init(void * unused)
>> /*
>> * init can allocate pages on any node
>> */
>> - set_mems_allowed(node_possible_map);
>> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
>> /*
>> * init can run on any cpu.
>> */
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index ba401fa..e29b440 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -935,10 +935,12 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>> struct task_struct *tsk = current;
>>
>> tsk->mems_allowed = *to;
>> + wmb();
>>
>> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
>>
>> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
>> + wmb();
>> }
>>
>> /*
>
> You always need to comment barriers (and use smp_ variants unless you're
> doing mmio).
It's my mistake.
I'll remake a new patch and change it.
Thanks.
Miao
>
>
>> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>>
>> if (cs == &top_cpuset) {
>> cpumask_copy(cpus_attach, cpu_possible_mask);
>> - to = node_possible_map;
>> } else {
>> guarantee_online_cpus(cs, cpus_attach);
>> - guarantee_online_mems(cs, &to);
>> }
>> + guarantee_online_mems(cs, &to);
>>
>> /* do per-task migration stuff possibly for each in the threadgroup */
>> cpuset_attach_task(tsk, &to, cs);
>> @@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
>> static int cpuset_track_online_nodes(struct notifier_block *self,
>> unsigned long action, void *arg)
>> {
>> + nodemask_t oldmems;
>> +
>> cgroup_lock();
>> switch (action) {
>> case MEM_ONLINE:
>> - case MEM_OFFLINE:
>> + oldmems = top_cpuset.mems_allowed;
>> mutex_lock(&callback_mutex);
>> top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
>> mutex_unlock(&callback_mutex);
>> - if (action == MEM_OFFLINE)
>> - scan_for_empty_cpusets(&top_cpuset);
>> + update_tasks_nodemask(&top_cpuset, &oldmems, NULL);
>> + break;
>> + case MEM_OFFLINE:
>> + scan_for_empty_cpusets(&top_cpuset);
>> break;
>> default:
>> break;
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index fbb6222..84c7f99 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -219,7 +219,7 @@ int kthreadd(void *unused)
>> set_task_comm(tsk, "kthreadd");
>> ignore_signals(tsk);
>> set_cpus_allowed_ptr(tsk, cpu_all_mask);
>> - set_mems_allowed(node_possible_map);
>> + set_mems_allowed(node_states[N_HIGH_MEMORY]);
>>
>> current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
>>
>
>
>
on 2010-2-23 6:06, David Rientjes wrote:
>>>> Right, but the callback_mutex was being removed by this patch.
>>>>
>>>
>>> I was making the case for it to be readded :)
>>
>> But cgroup_mutex is held when someone changes cs->cpus_allowed or doing hotplug,
>> so I think callback_mutex is not necessary in this case.
>>
>
> Then why is it taken in update_cpumask()?
when we read cs->cpus_allowed, we need just hold one of callback_mutex and cgroup_mutex.
If we want to change cs->cpus_allowed, we must hold callback_mutex and cgroup_mutex.
>> /*
>> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
>>
>> if (cs == &top_cpuset) {
>> cpumask_copy(cpus_attach, cpu_possible_mask);
>> - to = node_possible_map;
>> } else {
>> guarantee_online_cpus(cs, cpus_attach);
>> - guarantee_online_mems(cs, &to);
>> }
>> + guarantee_online_mems(cs, &to);
>>
>> /* do per-task migration stuff possibly for each in the threadgroup */
>> cpuset_attach_task(tsk, &to, cs);
>
> Do we need to set cpus_attach to cpu_possible_mask? Why won't
> cpu_active_mask suffice?
If we set cpus_attach to cpu_possible_mask, we needn't do anything for tasks in the top_cpuset when
doing cpu hotplug. If not, we will update cpus_allowed of all tasks in the top_cpuset.
>
>> @@ -2090,15 +2091,19 @@ static int cpuset_track_online_cpus(struct notifier_block *unused_nb,
>> static int cpuset_track_online_nodes(struct notifier_block *self,
>> unsigned long action, void *arg)
>> {
>> + nodemask_t oldmems;
>
> Is it possible to use NODEMASK_ALLOC() instead?
Yes. I will write another patch to fix it.(These are the same problems in the other functions)
on 2010-2-23 6:00, David Rientjes wrote:
> On Mon, 22 Feb 2010, Nick Piggin wrote:
>
>> If you have a concurrent reader without any synchronisation, then what
>> stops it from loading a word of the mask before stores to add the new
>> nodes and then loading another word of the mask after the stores to
>> remove the old nodes? (which can give an empty mask).
>>
>
> Currently nothing, so we'll need a variant for configurations where the
> size of nodemask_t is larger than we can atomically store.
>
Sorry, Could you explain what you advised?
I think it is hard to fix this problem by adding a variant, because it is
hard to avoid loading a word of the mask before
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
and then loading another word of the mask after
tsk->mems_allowed = *newmems;
unless we use lock.
Maybe we need a rw-lock to protect task->mems_allowed.
On Tue, 23 Feb 2010, Miao Xie wrote:
> Sorry, Could you explain what you advised?
> I think it is hard to fix this problem by adding a variant, because it is
> hard to avoid loading a word of the mask before
>
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>
> and then loading another word of the mask after
>
> tsk->mems_allowed = *newmems;
>
> unless we use lock.
>
> Maybe we need a rw-lock to protect task->mems_allowed.
>
I meant that we need to define synchronization only for configurations
that do not do atomic nodemask_t stores, it's otherwise unnecessary.
We'll need to load and store tsk->mems_allowed via a helper function that
is defined to take the rwlock for such configs and only read/write the
nodemask for others.
On Tue, 23 Feb 2010, Miao Xie wrote:
> >> /*
> >> @@ -1391,11 +1393,10 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> >>
> >> if (cs == &top_cpuset) {
> >> cpumask_copy(cpus_attach, cpu_possible_mask);
> >> - to = node_possible_map;
> >> } else {
> >> guarantee_online_cpus(cs, cpus_attach);
> >> - guarantee_online_mems(cs, &to);
> >> }
> >> + guarantee_online_mems(cs, &to);
> >>
> >> /* do per-task migration stuff possibly for each in the threadgroup */
> >> cpuset_attach_task(tsk, &to, cs);
> >
> > Do we need to set cpus_attach to cpu_possible_mask? Why won't
> > cpu_active_mask suffice?
>
> If we set cpus_attach to cpu_possible_mask, we needn't do anything for tasks in the top_cpuset when
> doing cpu hotplug. If not, we will update cpus_allowed of all tasks in the top_cpuset.
>
Cpu hotplug sets top_cpuset's cpus_allowed to cpu_active_mask by default,
regardless of what was onlined or offlined. cpus_attach in the context of
your patch (in cpuset_attach()) passes cpu_possible_mask to
set_cpus_allowed_ptr() if the task is being attached to top_cpuset, my
question was why don't we pass cpu_active_mask instead? In other words, I
think we should do
cpumask_copy(cpus_attach, cpu_active_mask);
when attached to top_cpuset like my patch did.
on 2010-2-23 16:55, David Rientjes wrote:
> Cpu hotplug sets top_cpuset's cpus_allowed to cpu_active_mask by default,
> regardless of what was onlined or offlined. cpus_attach in the context of
> your patch (in cpuset_attach()) passes cpu_possible_mask to
> set_cpus_allowed_ptr() if the task is being attached to top_cpuset, my
> question was why don't we pass cpu_active_mask instead? In other words, I
> think we should do
>
> cpumask_copy(cpus_attach, cpu_active_mask);
>
> when attached to top_cpuset like my patch did.
If we pass cpu_active_mask to set_cpus_allowed_ptr(), task->cpus_allowed just contains
the online cpus. In this way, if we do cpu hotplug(such as: online some cpu), we must
update cpus_allowed of all tasks in the top cpuset.
But if we pass cpu_possible_mask, we needn't update cpus_allowed of all tasks in the
top cpuset. And when the kernel looks for a cpu for task to run, the kernel will use
cpu_active_mask to filter out offline cpus in task->cpus_allowed. Thus, it is safe.
On Tue, 23 Feb 2010, Miao Xie wrote:
> > Cpu hotplug sets top_cpuset's cpus_allowed to cpu_active_mask by default,
> > regardless of what was onlined or offlined. cpus_attach in the context of
> > your patch (in cpuset_attach()) passes cpu_possible_mask to
> > set_cpus_allowed_ptr() if the task is being attached to top_cpuset, my
> > question was why don't we pass cpu_active_mask instead? In other words, I
> > think we should do
> >
> > cpumask_copy(cpus_attach, cpu_active_mask);
> >
> > when attached to top_cpuset like my patch did.
>
> If we pass cpu_active_mask to set_cpus_allowed_ptr(), task->cpus_allowed just contains
> the online cpus. In this way, if we do cpu hotplug(such as: online some cpu), we must
> update cpus_allowed of all tasks in the top cpuset.
>
> But if we pass cpu_possible_mask, we needn't update cpus_allowed of all tasks in the
> top cpuset. And when the kernel looks for a cpu for task to run, the kernel will use
> cpu_active_mask to filter out offline cpus in task->cpus_allowed. Thus, it is safe.
>
That is terribly inconsistent between top_cpuset and all descendants; all
other cpusets require that task->cpus_allowed be a subset of
cpu_online_mask, including those descendants that allow all cpus (and all
mems).
on 2010-2-24 6:31, David Rientjes wrote:
> On Tue, 23 Feb 2010, Miao Xie wrote:
>
>>> Cpu hotplug sets top_cpuset's cpus_allowed to cpu_active_mask by default,
>>> regardless of what was onlined or offlined. cpus_attach in the context of
>>> your patch (in cpuset_attach()) passes cpu_possible_mask to
>>> set_cpus_allowed_ptr() if the task is being attached to top_cpuset, my
>>> question was why don't we pass cpu_active_mask instead? In other words, I
>>> think we should do
>>>
>>> cpumask_copy(cpus_attach, cpu_active_mask);
>>>
>>> when attached to top_cpuset like my patch did.
>>
>> If we pass cpu_active_mask to set_cpus_allowed_ptr(), task->cpus_allowed just contains
>> the online cpus. In this way, if we do cpu hotplug(such as: online some cpu), we must
>> update cpus_allowed of all tasks in the top cpuset.
>>
>> But if we pass cpu_possible_mask, we needn't update cpus_allowed of all tasks in the
>> top cpuset. And when the kernel looks for a cpu for task to run, the kernel will use
>> cpu_active_mask to filter out offline cpus in task->cpus_allowed. Thus, it is safe.
>>
>
> That is terribly inconsistent between top_cpuset and all descendants; all
> other cpusets require that task->cpus_allowed be a subset of
> cpu_online_mask, including those descendants that allow all cpus (and all
> mems).
I think it is not a big deal because it is safe and doesn't cause any problem.
Beside that, task->cpus_allowed is initialized to cpu_possible_mask on the no-cpuset
kernel, so using cpu_possible_mask to initialize task->cpus_allowed is reasonable.
(top cpuset is a special cpuset, isn't it?)
on 2010-2-23 16:44, David Rientjes wrote:
> On Tue, 23 Feb 2010, Miao Xie wrote:
>
>> Sorry, Could you explain what you advised?
>> I think it is hard to fix this problem by adding a variant, because it is
>> hard to avoid loading a word of the mask before
>>
>> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>>
>> and then loading another word of the mask after
>>
>> tsk->mems_allowed = *newmems;
>>
>> unless we use lock.
>>
>> Maybe we need a rw-lock to protect task->mems_allowed.
>>
>
> I meant that we need to define synchronization only for configurations
> that do not do atomic nodemask_t stores, it's otherwise unnecessary.
> We'll need to load and store tsk->mems_allowed via a helper function that
> is defined to take the rwlock for such configs and only read/write the
> nodemask for others.
>
By investigating, we found that it is hard to guarantee the consistent between
mempolicy and mems_allowed because mempolicy was designed as a self-update function.
it just can be changed by one's self. Maybe we must change the implement of mempolicy.
On Wed, 24 Feb 2010, Miao Xie wrote:
> >> Sorry, Could you explain what you advised?
> >> I think it is hard to fix this problem by adding a variant, because it is
> >> hard to avoid loading a word of the mask before
> >>
> >> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> >>
> >> and then loading another word of the mask after
> >>
> >> tsk->mems_allowed = *newmems;
> >>
> >> unless we use lock.
> >>
> >> Maybe we need a rw-lock to protect task->mems_allowed.
> >>
> >
> > I meant that we need to define synchronization only for configurations
> > that do not do atomic nodemask_t stores, it's otherwise unnecessary.
> > We'll need to load and store tsk->mems_allowed via a helper function that
> > is defined to take the rwlock for such configs and only read/write the
> > nodemask for others.
> >
>
> By investigating, we found that it is hard to guarantee the consistent between
> mempolicy and mems_allowed because mempolicy was designed as a self-update function.
> it just can be changed by one's self. Maybe we must change the implement of mempolicy.
>
Before your change, cpuset nodemask changes were serialized on
manage_mutex which would, in turn, serialize the rebinding of each
attached task's mempolicy. update_nodemask() is now serialized on
cgroup_lock(), which also protects scan_for_empty_cpusets(), so the cpuset
code protects it adequately. If a concurrent mempolicy change from a
user's set_mempolicy() happens, however, it could introduce an
inconsistency between them.
If we protect current->mems_allowed with a rwlock or seqlock for configs
where MAX_NUMNODES > BITS_PER_LONG, then we can always guarantee that we
get the entire nodemask. The same problem is present for
current->cpus_allowed, however, with NR_CPUS > BITS_PER_LONG. We must be
able to safely dereference both masks without the chance of returning
nodes_empty() or cpus_empty().
On Wed, 24 Feb 2010, Miao Xie wrote:
> I think it is not a big deal because it is safe and doesn't cause any problem.
> Beside that, task->cpus_allowed is initialized to cpu_possible_mask on the no-cpuset
> kernel, so using cpu_possible_mask to initialize task->cpus_allowed is reasonable.
> (top cpuset is a special cpuset, isn't it?)
>
I'm suprised that I can create a descendant cpuset of top_cpuset that
cannot include all of its parents' cpus and that the root cpuset's cpus
mask doesn't change when cpus are onlined/offlined.
on 2010-2-25 5:08, David Rientjes wrote:
> On Wed, 24 Feb 2010, Miao Xie wrote:
>
>> I think it is not a big deal because it is safe and doesn't cause any problem.
>> Beside that, task->cpus_allowed is initialized to cpu_possible_mask on the no-cpuset
>> kernel, so using cpu_possible_mask to initialize task->cpus_allowed is reasonable.
>> (top cpuset is a special cpuset, isn't it?)
>>
>
> I'm suprised that I can create a descendant cpuset of top_cpuset that
> cannot include all of its parents' cpus and that the root cpuset's cpus
> mask doesn't change when cpus are onlined/offlined.
>
top cpuset's cpus is consistent with cpu_online_mask because the kernel changes it
when doing cpu hotplug. So the problem which you said doesn't exist.
Just cpus_allowed of all tasks in the top cpuset is initialized to cpu_possible_mask
in order to avoid updating them when doing cpu hotplug.