If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.
Here is a call trace:
[ 609.824017] Call Trace:
[ 609.824017] <IRQ>
[ 609.824017] [<ffffffff810b0721>] select_fallback_rq+0x71/0x190
[ 609.824017] [<ffffffff810b086e>] ? try_to_wake_up+0x2e/0x2f0
[ 609.824017] [<ffffffff810b0b0b>] try_to_wake_up+0x2cb/0x2f0
[ 609.824017] [<ffffffff8109da08>] ? __run_hrtimer+0x78/0x320
[ 609.824017] [<ffffffff810b0b85>] wake_up_process+0x15/0x20
[ 609.824017] [<ffffffff8109ce62>] hrtimer_wakeup+0x22/0x30
[ 609.824017] [<ffffffff8109da13>] __run_hrtimer+0x83/0x320
[ 609.824017] [<ffffffff8109ce40>] ? update_rmtp+0x80/0x80
[ 609.824017] [<ffffffff8109df56>] hrtimer_interrupt+0x106/0x280
[ 609.824017] [<ffffffff810a72c8>] ? sd_free_ctl_entry+0x68/0x70
[ 609.824017] [<ffffffff8167cf39>] smp_apic_timer_interrupt+0x69/0x99
[ 609.824017] [<ffffffff8167be2f>] apic_timer_interrupt+0x6f/0x80
There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1 nid.
As a result, cpumask_of_node(-1) returns NULL, and causes ernel panic.
This patch fixes this problem by judging if the nid is -1.
If nid is not -1, a cpu on the same node will be picked.
Else, a online cpu on another node will be picked.
Signed-off-by: Tang Chen <[email protected]>
---
kernel/sched/core.c | 24 +++++++++++++++---------
1 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66b36ab..e76dce9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1263,18 +1263,24 @@ EXPORT_SYMBOL_GPL(kick_process);
*/
static int select_fallback_rq(int cpu, struct task_struct *p)
{
- const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
+ int nid = cpu_to_node(cpu);
+ const struct cpumask *nodemask = NULL;
enum { cpuset, possible, fail } state = cpuset;
int dest_cpu;
- /* Look for allowed, online CPU in same node. */
- for_each_cpu(dest_cpu, nodemask) {
- if (!cpu_online(dest_cpu))
- continue;
- if (!cpu_active(dest_cpu))
- continue;
- if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
- return dest_cpu;
+ /* If the cpu has been offlined, its nid was set to -1. */
+ if (nid != -1) {
+ nodemask = cpumask_of_node(nid);
+
+ /* Look for allowed, online CPU in same node. */
+ for_each_cpu(dest_cpu, nodemask) {
+ if (!cpu_online(dest_cpu))
+ continue;
+ if (!cpu_active(dest_cpu))
+ continue;
+ if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+ return dest_cpu;
+ }
}
for (;;) {
--
1.7.1
On Mon, 8 Oct 2012, Tang Chen wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66b36ab..e76dce9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1263,18 +1263,24 @@ EXPORT_SYMBOL_GPL(kick_process);
> */
> static int select_fallback_rq(int cpu, struct task_struct *p)
> {
> - const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
> + int nid = cpu_to_node(cpu);
> + const struct cpumask *nodemask = NULL;
> enum { cpuset, possible, fail } state = cpuset;
> int dest_cpu;
>
> - /* Look for allowed, online CPU in same node. */
> - for_each_cpu(dest_cpu, nodemask) {
> - if (!cpu_online(dest_cpu))
> - continue;
> - if (!cpu_active(dest_cpu))
> - continue;
> - if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
> - return dest_cpu;
> + /* If the cpu has been offlined, its nid was set to -1. */
> + if (nid != -1) {
NUMA_NO_NODE.
Eek, the nid shouldn't be -1 yet, though, for cpu hotplug since this
should be called at CPU_DYING level and migrate_tasks() still sees a valid
cpu.
On x86, cpumask_of_node() is always guaranteed to return a valid cpumask
after boot so presumably this is a problem in some non-x86 arch code and
isn't actually a sched problem.
> + nodemask = cpumask_of_node(nid);
> +
> + /* Look for allowed, online CPU in same node. */
> + for_each_cpu(dest_cpu, nodemask) {
> + if (!cpu_online(dest_cpu))
> + continue;
> + if (!cpu_active(dest_cpu))
> + continue;
> + if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
> + return dest_cpu;
> + }
> }
>
> for (;;) {
At 10/09/2012 02:21 PM, David Rientjes Wrote:
> On Mon, 8 Oct 2012, Tang Chen wrote:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 66b36ab..e76dce9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1263,18 +1263,24 @@ EXPORT_SYMBOL_GPL(kick_process);
>> */
>> static int select_fallback_rq(int cpu, struct task_struct *p)
>> {
>> - const struct cpumask *nodemask = cpumask_of_node(cpu_to_node(cpu));
>> + int nid = cpu_to_node(cpu);
>> + const struct cpumask *nodemask = NULL;
>> enum { cpuset, possible, fail } state = cpuset;
>> int dest_cpu;
>>
>> - /* Look for allowed, online CPU in same node. */
>> - for_each_cpu(dest_cpu, nodemask) {
>> - if (!cpu_online(dest_cpu))
>> - continue;
>> - if (!cpu_active(dest_cpu))
>> - continue;
>> - if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>> - return dest_cpu;
>> + /* If the cpu has been offlined, its nid was set to -1. */
>> + if (nid != -1) {
>
> NUMA_NO_NODE.
>
> Eek, the nid shouldn't be -1 yet, though, for cpu hotplug since this
> should be called at CPU_DYING level and migrate_tasks() still sees a valid
> cpu.
the cpu's node is set when the cpu is hotpluged(not online), and it will
be cleared when the cpu is hotremoved(This patch is in akpm tree):
https://lkml.org/lkml/2012/9/3/39
I guess the task is in sleep state when the cpu is offlined, and it doesn't
be migrated to another cpu.
Thanks
Wen Congyang
>
> On x86, cpumask_of_node() is always guaranteed to return a valid cpumask
> after boot so presumably this is a problem in some non-x86 arch code and
> isn't actually a sched problem.
>
>> + nodemask = cpumask_of_node(nid);
>> +
>> + /* Look for allowed, online CPU in same node. */
>> + for_each_cpu(dest_cpu, nodemask) {
>> + if (!cpu_online(dest_cpu))
>> + continue;
>> + if (!cpu_active(dest_cpu))
>> + continue;
>> + if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>> + return dest_cpu;
>> + }
>> }
>>
>> for (;;) {
>
Hi David,
Thanks for reviewing this patch. :)
On 10/09/2012 04:34 PM, Wen Congyang wrote:
> At 10/09/2012 02:21 PM, David Rientjes Wrote:
>> On Mon, 8 Oct 2012, Tang Chen wrote:
>>> + /* If the cpu has been offlined, its nid was set to -1. */
>>> + if (nid != -1) {
>>
>> NUMA_NO_NODE.
Yes, NUMA_NO_NODE is better. I'll improve it, thanks. :)
>>
>> Eek, the nid shouldn't be -1 yet, though, for cpu hotplug since this
>> should be called at CPU_DYING level and migrate_tasks() still sees a valid
>> cpu.
As Wen said below, nid is now set to -1 when cpu is hotremoved.
I reproduce this problem in this situation:
all cpus are online, and hot remove a system board directorily, without
offlining any cpu.
As a result, the removed cpu's nid is set to -1, and this causes
problems.
>
> the cpu's node is set when the cpu is hotpluged(not online), and it will
> be cleared when the cpu is hotremoved(This patch is in akpm tree):
> https://lkml.org/lkml/2012/9/3/39
>
> I guess the task is in sleep state when the cpu is offlined, and it doesn't
> be migrated to another cpu.
>
> Thanks
> Wen Congyang
>
>>
>> On x86, cpumask_of_node() is always guaranteed to return a valid cpumask
>> after boot so presumably this is a problem in some non-x86 arch code and
>> isn't actually a sched problem.
BTW, my box is x86. I think for the reason I said above,
cpumask_of_node() will no longer guaranteed anything even if on x86.
Thanks. :)
>>
>>> + nodemask = cpumask_of_node(nid);
>>> +
>>> + /* Look for allowed, online CPU in same node. */
>>> + for_each_cpu(dest_cpu, nodemask) {
>>> + if (!cpu_online(dest_cpu))
>>> + continue;
>>> + if (!cpu_active(dest_cpu))
>>> + continue;
>>> + if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
>>> + return dest_cpu;
>>> + }
>>> }
>>>
>>> for (;;) {
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, 9 Oct 2012, Tang Chen wrote:
> > > Eek, the nid shouldn't be -1 yet, though, for cpu hotplug since this
> > > should be called at CPU_DYING level and migrate_tasks() still sees a valid
> > > cpu.
>
> As Wen said below, nid is now set to -1 when cpu is hotremoved.
> I reproduce this problem in this situation:
>
> all cpus are online, and hot remove a system board directorily, without
> offlining any cpu.
>
> As a result, the removed cpu's nid is set to -1, and this causes
> problems.
>
Let's add Andrew to the cc list then, because I'm nacking
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in the -mm
tree for this reason.
We can only clear a cpu-to-node mapping when the cpu is completely
offline, not before or during the CPU_DYING stage. Kernel code, such as
the sched code that you are now trying to "fix", depends on this mapping
to work correctly; obviously no audit was done of cpu hotplug code
depending on it before the patch was proposed.
I say "fix" because even this workaround isn't a good solution since it
would be much better to pick another cpu on the same node as the offlining
cpu for the runqueue before falling back to the set of all allowed nodes.
We lose all NUMA affinity information with that patch. There's no reason
why we shouldn't know the node of a cpu that is being offlined.
So nack to cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch.
After it's removed because it's buggy, this "fix" will no longer be
necessary.
At 10/09/2012 06:04 PM, David Rientjes Wrote:
> On Tue, 9 Oct 2012, Tang Chen wrote:
>
>>>> Eek, the nid shouldn't be -1 yet, though, for cpu hotplug since this
>>>> should be called at CPU_DYING level and migrate_tasks() still sees a valid
>>>> cpu.
>>
>> As Wen said below, nid is now set to -1 when cpu is hotremoved.
>> I reproduce this problem in this situation:
>>
>> all cpus are online, and hot remove a system board directorily, without
>> offlining any cpu.
>>
>> As a result, the removed cpu's nid is set to -1, and this causes
>> problems.
>>
>
> Let's add Andrew to the cc list then, because I'm nacking
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in the -mm
> tree for this reason.
>
> We can only clear a cpu-to-node mapping when the cpu is completely
> offline, not before or during the CPU_DYING stage. Kernel code, such as
I clear cpu-to-node mapping when the cpu is hotremoved. If the cpu is onlined,
it will be offlined before clearing cpu-to-node mapping.
Here is the code in driver/acpi/processor_driver.c:
=============
static int acpi_processor_handle_eject(struct acpi_processor *pr)
{
if (cpu_online(pr->id))
cpu_down(pr->id); <========== cpu is offlined here.
arch_unregister_cpu(pr->id);
acpi_unmap_lsapic(pr->id); <========== I clear the mapping here.
return (0);
}
=============
The real problem is that: we don't migrate this task to another cpu when
the cpu is offlined. I guess this task is not in running state, and it
is not in the cpu's runqueue.
Thanks
Wen Congyang
> the sched code that you are now trying to "fix", depends on this mapping
> to work correctly; obviously no audit was done of cpu hotplug code
> depending on it before the patch was proposed.
>
> I say "fix" because even this workaround isn't a good solution since it
> would be much better to pick another cpu on the same node as the offlining
> cpu for the runqueue before falling back to the set of all allowed nodes.
> We lose all NUMA affinity information with that patch. There's no reason
> why we shouldn't know the node of a cpu that is being offlined.
>
> So nack to cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch.
> After it's removed because it's buggy, this "fix" will no longer be
> necessary.
>
This patch is try to fix a bug: the kernel will be panicked after removing a
node.
On Mon, 2012-10-08 at 10:59 +0800, Tang Chen wrote:
> If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
> return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
> find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.
Hurm,. this is new, right? Who is changing all these semantics without
auditing the tree and informing all affected people?
On Tue, 9 Oct 2012, Wen Congyang wrote:
> I clear cpu-to-node mapping when the cpu is hotremoved. If the cpu is onlined,
> it will be offlined before clearing cpu-to-node mapping.
>
> Here is the code in driver/acpi/processor_driver.c:
> =============
> static int acpi_processor_handle_eject(struct acpi_processor *pr)
> {
> if (cpu_online(pr->id))
> cpu_down(pr->id); <========== cpu is offlined here.
>
> arch_unregister_cpu(pr->id);
> acpi_unmap_lsapic(pr->id); <========== I clear the mapping here.
> return (0);
> }
> =============
>
> The real problem is that: we don't migrate this task to another cpu when
> the cpu is offlined. I guess this task is not in running state, and it
> is not in the cpu's runqueue.
>
ACPI is not the only way to offline a cpu, this all needs to be
standardized throughout the kernel for each arch so that generic code like
sched can use cpu_to_node() in a CPU_DYING callback. The only way to do
that is by having arch callbacks to set cpu-to-node to NUMA_NO_NODE only
at CPU_DEAD.
> This patch is try to fix a bug: the kernel will be panicked after removing a
> node.
>
That's much more subtle than having a NULL pointer dereference like Tang
reports during migration.
On Tue, 9 Oct 2012, Peter Zijlstra wrote:
> On Mon, 2012-10-08 at 10:59 +0800, Tang Chen wrote:
> > If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
> > return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
> > find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.
>
> Hurm,. this is new, right? Who is changing all these semantics without
> auditing the tree and informing all affected people?
>
I've nacked the patch that did it because I think it should be done from
the generic cpu hotplug code only at the CPU_DEAD level with a per-arch
callback to fixup whatever cpu-to-node mappings they maintain since
processes can reenter the scheduler at CPU_DYING.
The whole issue seems to be because alloc_{fair,rt}_sched_group() does an
iteration over all possible cpus (not all online cpus) and does
kzalloc_node() which references a now-offlined node. Changing it to -1
makes the slab code fallback to any online node.
What I think we need to do instead of hacking only the acpi code and not
standardizing this across the kernel is:
- reset cpu-to-node with a per-arch callback in generic cpu hotplug code
at CPU_DEAD, and
- do an iteration over all possible cpus for node hot-remove ensuring
there are no stale references.
On Tue, 2012-10-09 at 13:36 -0700, David Rientjes wrote:
> On Tue, 9 Oct 2012, Peter Zijlstra wrote:
>
> > On Mon, 2012-10-08 at 10:59 +0800, Tang Chen wrote:
> > > If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
> > > return -1. As a result, cpumask_of_node(nid) will return NULL. In this case,
> > > find_next_bit() in for_each_cpu will get a NULL pointer and cause panic.
> >
> > Hurm,. this is new, right? Who is changing all these semantics without
> > auditing the tree and informing all affected people?
> >
>
> I've nacked the patch that did it because I think it should be done from
> the generic cpu hotplug code only at the CPU_DEAD level with a per-arch
> callback to fixup whatever cpu-to-node mappings they maintain since
> processes can reenter the scheduler at CPU_DYING.
Well the code they were patching is in the wakeup path. As I think Tang
said, we leave !runnable tasks on whatever cpu they ran on last, even if
that cpu is offlined, we try and fix up state when we get a wakeup.
On wakeup, it tries to find a cpu to run on and will try a cpu of the
same node first.
Now if that node's entirely gone away, it appears the cpu_to_node() map
will not return a valid node number.
I think that's a change in behaviour, it didn't used to do that afaik.
Certainly this code hasn't change in a while.
> The whole issue seems to be because alloc_{fair,rt}_sched_group() does an
> iteration over all possible cpus (not all online cpus) and does
> kzalloc_node() which references a now-offlined node. Changing it to -1
> makes the slab code fallback to any online node.
Right, that's because the rq structures are assumed always present. What
I cannot remember is why I'm not using per-cpu allocations there,
because that's exactly what it looks like it wants to be.
> What I think we need to do instead of hacking only the acpi code and not
> standardizing this across the kernel is:
Right, what I don't understand is wtf ACPI has to do with anything. We
have plenty cpu hotplug code, ACPI isn't involved in any of that last
time I checked.
> - reset cpu-to-node with a per-arch callback in generic cpu hotplug code
> at CPU_DEAD, and
>
> - do an iteration over all possible cpus for node hot-remove ensuring
> there are no stale references.
Why do we need to clear cpu-to-node maps? are we going to change the
topology at runtime? What are you going to do with per-cpu stuff,
per-cpu memory isn't freed on hotplug, so its node relation is static.
/me confused..
On Tue, 9 Oct 2012, Peter Zijlstra wrote:
> Well the code they were patching is in the wakeup path. As I think Tang
> said, we leave !runnable tasks on whatever cpu they ran on last, even if
> that cpu is offlined, we try and fix up state when we get a wakeup.
>
> On wakeup, it tries to find a cpu to run on and will try a cpu of the
> same node first.
>
> Now if that node's entirely gone away, it appears the cpu_to_node() map
> will not return a valid node number.
>
> I think that's a change in behaviour, it didn't used to do that afaik.
> Certainly this code hasn't change in a while.
>
If cpu_to_node() always returns a valid node id even if all cpus on the
node are offline, then the cpumask_of_node() implementation, which the
sched code is using, should either return an empty cpumask (if
node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
behavior here occurred because
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
return a valid node id and forces it to return -1 so a kzalloc_node(...,
-1) fallsback to allocate anywhere.
But if you only need cpu_to_node() when waking up to find a runnable cpu
for this NUMA information, then I think you can just change the
kzalloc_node() in alloc_{fair,rt}_sched_group() to do
kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
[ The changelog here is confusing because it's fixing a problem in
linux-next without saying so. ]
At 10/10/2012 07:27 AM, David Rientjes Wrote:
> On Tue, 9 Oct 2012, Peter Zijlstra wrote:
>
>> Well the code they were patching is in the wakeup path. As I think Tang
>> said, we leave !runnable tasks on whatever cpu they ran on last, even if
>> that cpu is offlined, we try and fix up state when we get a wakeup.
>>
>> On wakeup, it tries to find a cpu to run on and will try a cpu of the
>> same node first.
>>
>> Now if that node's entirely gone away, it appears the cpu_to_node() map
>> will not return a valid node number.
>>
>> I think that's a change in behaviour, it didn't used to do that afaik.
>> Certainly this code hasn't change in a while.
>>
>
> If cpu_to_node() always returns a valid node id even if all cpus on the
> node are offline, then the cpumask_of_node() implementation, which the
> sched code is using, should either return an empty cpumask (if
> node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
> behavior here occurred because
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
> return a valid node id and forces it to return -1 so a kzalloc_node(...,
> -1) fallsback to allocate anywhere.
>
> But if you only need cpu_to_node() when waking up to find a runnable cpu
> for this NUMA information, then I think you can just change the
> kzalloc_node() in alloc_{fair,rt}_sched_group() to do
> kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
>
> [ The changelog here is confusing because it's fixing a problem in
> linux-next without saying so. ]
>
I don't agree with this way. Because it only fix the code which causes a
problem, and we can't say there is no any similar problem. So it is
why I clear the cpu-to-node mapping.
What about the following solution:
1. clear the cpu-to-node mapping when the node is offlined
2. tang's patch is still necessary because we leave !runnable tasks on
whatever cpu they ran on last. If cpu's node is NUMA_NO_NODE, it means
the entire node is offlined, and we must migrate the task to the other
node.
Thanks
Wen Congyang
At 10/10/2012 10:06 AM, Wen Congyang Wrote:
> At 10/10/2012 07:27 AM, David Rientjes Wrote:
>> On Tue, 9 Oct 2012, Peter Zijlstra wrote:
>>
>>> Well the code they were patching is in the wakeup path. As I think Tang
>>> said, we leave !runnable tasks on whatever cpu they ran on last, even if
>>> that cpu is offlined, we try and fix up state when we get a wakeup.
>>>
>>> On wakeup, it tries to find a cpu to run on and will try a cpu of the
>>> same node first.
>>>
>>> Now if that node's entirely gone away, it appears the cpu_to_node() map
>>> will not return a valid node number.
>>>
>>> I think that's a change in behaviour, it didn't used to do that afaik.
>>> Certainly this code hasn't change in a while.
>>>
>>
>> If cpu_to_node() always returns a valid node id even if all cpus on the
>> node are offline, then the cpumask_of_node() implementation, which the
>> sched code is using, should either return an empty cpumask (if
>> node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
>> behavior here occurred because
>> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
>> return a valid node id and forces it to return -1 so a kzalloc_node(...,
>> -1) fallsback to allocate anywhere.
>>
>> But if you only need cpu_to_node() when waking up to find a runnable cpu
>> for this NUMA information, then I think you can just change the
>> kzalloc_node() in alloc_{fair,rt}_sched_group() to do
>> kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
>>
>> [ The changelog here is confusing because it's fixing a problem in
>> linux-next without saying so. ]
>>
>
> I don't agree with this way. Because it only fix the code which causes a
> problem, and we can't say there is no any similar problem. So it is
> why I clear the cpu-to-node mapping.
>
> What about the following solution:
> 1. clear the cpu-to-node mapping when the node is offlined
There is no interface to online/offline a node. We online a node only
when the cpu/memory is node, and offline it when all cpu/memory in
this node is offlined(TODO).
So we may need to map cpu-to-node when the cpu is onlined if clear
it when the node is offlined. But we don't know the cpu's node.
Thanks
Wen Congyang
> 2. tang's patch is still necessary because we leave !runnable tasks on
> whatever cpu they ran on last. If cpu's node is NUMA_NO_NODE, it means
> the entire node is offlined, and we must migrate the task to the other
> node.
>
> Thanks
> Wen Congyang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Tue, 2012-10-09 at 16:27 -0700, David Rientjes wrote:
> On Tue, 9 Oct 2012, Peter Zijlstra wrote:
>
> > Well the code they were patching is in the wakeup path. As I think Tang
> > said, we leave !runnable tasks on whatever cpu they ran on last, even if
> > that cpu is offlined, we try and fix up state when we get a wakeup.
> >
> > On wakeup, it tries to find a cpu to run on and will try a cpu of the
> > same node first.
> >
> > Now if that node's entirely gone away, it appears the cpu_to_node() map
> > will not return a valid node number.
> >
> > I think that's a change in behaviour, it didn't used to do that afaik.
> > Certainly this code hasn't change in a while.
> >
>
> If cpu_to_node() always returns a valid node id even if all cpus on the
> node are offline, then the cpumask_of_node() implementation, which the
> sched code is using, should either return an empty cpumask (if
> node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
> behavior here occurred because
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
> return a valid node id and forces it to return -1 so a kzalloc_node(...,
> -1) fallsback to allocate anywhere.
I think that's broken semantics.. so far the entire cpu<->node mapping
was invariant during hotplug. Changing that is going to be _very_
interesting and cannot be done lightly.
Because as I said, per-cpu memory is preserved over hotplug, and that
has numa affinity.
So for now, let me NACK that patch. You cannot go change stuff like
that.
>
> But if you only need cpu_to_node() when waking up to find a runnable cpu
> for this NUMA information, then I think you can just change the
> kzalloc_node() in alloc_{fair,rt}_sched_group() to do
> kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
That's a confusing statement, the wakeup stuff and the
alloc_{fair,rt}_sched_group() stuff are unrelated, although both sites
might need fixing if we're going to go ahead with this.
At 10/10/2012 05:10 PM, Peter Zijlstra Wrote:
> On Tue, 2012-10-09 at 16:27 -0700, David Rientjes wrote:
>> On Tue, 9 Oct 2012, Peter Zijlstra wrote:
>>
>>> Well the code they were patching is in the wakeup path. As I think Tang
>>> said, we leave !runnable tasks on whatever cpu they ran on last, even if
>>> that cpu is offlined, we try and fix up state when we get a wakeup.
>>>
>>> On wakeup, it tries to find a cpu to run on and will try a cpu of the
>>> same node first.
>>>
>>> Now if that node's entirely gone away, it appears the cpu_to_node() map
>>> will not return a valid node number.
>>>
>>> I think that's a change in behaviour, it didn't used to do that afaik.
>>> Certainly this code hasn't change in a while.
>>>
>>
>> If cpu_to_node() always returns a valid node id even if all cpus on the
>> node are offline, then the cpumask_of_node() implementation, which the
>> sched code is using, should either return an empty cpumask (if
>> node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
>> behavior here occurred because
>> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
>> return a valid node id and forces it to return -1 so a kzalloc_node(...,
>> -1) fallsback to allocate anywhere.
>
> I think that's broken semantics.. so far the entire cpu<->node mapping
> was invariant during hotplug. Changing that is going to be _very_
> interesting and cannot be done lightly.
>
> Because as I said, per-cpu memory is preserved over hotplug, and that
> has numa affinity.
Hmm, if per-cpu memory is preserved, and we can't offline and remove
this memory. So we can't offline the node.
But, if the node is hot added, and per-cpu memory doesn't use the
memory on this node. We can hotremove cpu/memory on this node, and then
offline this node.
Before the cpu is hotadded, cpu's node is -1. We set cpu<->node mapping
when it is hotadded. So the entire cpu<->node mapping was not invariant
during hotplug.
So it is why I try to clear it when the cpu is hot-removed.
As we need the mapping to migrate a task to the cpu on the same node first,
I think we can clear the mapping when the node is offlined.
Thanks
Wen Congyang
>
> So for now, let me NACK that patch. You cannot go change stuff like
> that.
>
>>
>> But if you only need cpu_to_node() when waking up to find a runnable cpu
>> for this NUMA information, then I think you can just change the
>> kzalloc_node() in alloc_{fair,rt}_sched_group() to do
>> kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
>
> That's a confusing statement, the wakeup stuff and the
> alloc_{fair,rt}_sched_group() stuff are unrelated, although both sites
> might need fixing if we're going to go ahead with this.
>
On Wed, 2012-10-10 at 17:33 +0800, Wen Congyang wrote:
>
> Hmm, if per-cpu memory is preserved, and we can't offline and remove
> this memory. So we can't offline the node.
>
> But, if the node is hot added, and per-cpu memory doesn't use the
> memory on this node. We can hotremove cpu/memory on this node, and then
> offline this node.
>
> Before the cpu is hotadded, cpu's node is -1. We set cpu<->node mapping
> when it is hotadded. So the entire cpu<->node mapping was not invariant
> during hotplug.
>
> So it is why I try to clear it when the cpu is hot-removed.
>
> As we need the mapping to migrate a task to the cpu on the same node first,
> I think we can clear the mapping when the node is offlined.
Hmm maybe, but hardware that can hot-add is rare and nobody has it so
nobody cares ;-)
But by clearing cpu_to_node on every hotplug you change semantics for
all hardware and everybody gets to feel the pain.
I'm not saying you cannot change things, I'm only saying you should be
far more careful about it, not change it and wait for things to break.
Put in some effort to find things that might break and warn people --
sure, you'll always miss some, and that's ok.
On Wed, 2012-10-10 at 18:10 +0800, Wen Congyang wrote:
> I use ./scripts/get_maintainer.pl, and it doesn't tell me that I should cc
> you when I post that patch.
That script doesn't look at all usage sites of the code you modify does
it?
You need to audit the entire tree for usage of the interfaces you change
and email all relevant people for whoem you're planning to break stuff
-- preferably with a proposed fix.
That's manual work, no script will ever do this for you.
At 10/10/2012 05:51 PM, Peter Zijlstra Wrote:
> On Wed, 2012-10-10 at 17:33 +0800, Wen Congyang wrote:
>>
>> Hmm, if per-cpu memory is preserved, and we can't offline and remove
>> this memory. So we can't offline the node.
>>
>> But, if the node is hot added, and per-cpu memory doesn't use the
>> memory on this node. We can hotremove cpu/memory on this node, and then
>> offline this node.
>>
>> Before the cpu is hotadded, cpu's node is -1. We set cpu<->node mapping
>> when it is hotadded. So the entire cpu<->node mapping was not invariant
>> during hotplug.
>>
>> So it is why I try to clear it when the cpu is hot-removed.
>>
>> As we need the mapping to migrate a task to the cpu on the same node first,
>> I think we can clear the mapping when the node is offlined.
>
> Hmm maybe, but hardware that can hot-add is rare and nobody has it so
> nobody cares ;-)
Yes, nobody cares it now. But we have a such hardware, so I care it now.
>
> But by clearing cpu_to_node on every hotplug you change semantics for
> all hardware and everybody gets to feel the pain.
>
> I'm not saying you cannot change things, I'm only saying you should be
> far more careful about it, not change it and wait for things to break.
> Put in some effort to find things that might break and warn people --
> sure, you'll always miss some, and that's ok.
I use ./scripts/get_maintainer.pl, and it doesn't tell me that I should cc
you when I post that patch.
On Wed, 10 Oct 2012, Peter Zijlstra wrote:
> > If cpu_to_node() always returns a valid node id even if all cpus on the
> > node are offline, then the cpumask_of_node() implementation, which the
> > sched code is using, should either return an empty cpumask (if
> > node_to_cpumask_map[nid] isn't freed) or cpu_online_mask. The change in
> > behavior here occurred because
> > cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch in -mm doesn't
> > return a valid node id and forces it to return -1 so a kzalloc_node(...,
> > -1) fallsback to allocate anywhere.
>
> I think that's broken semantics.. so far the entire cpu<->node mapping
> was invariant during hotplug. Changing that is going to be _very_
> interesting and cannot be done lightly.
>
> Because as I said, per-cpu memory is preserved over hotplug, and that
> has numa affinity.
>
> So for now, let me NACK that patch. You cannot go change stuff like
> that.
>
Agreed, that makes the nack-count up to 2 now. Andrew, please remove
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch
from -mm.
> > But if you only need cpu_to_node() when waking up to find a runnable cpu
> > for this NUMA information, then I think you can just change the
> > kzalloc_node() in alloc_{fair,rt}_sched_group() to do
> > kzalloc(..., cpu_online(cpu) ? cpu_to_node(cpu) : NUMA_NO_NODE).
>
> That's a confusing statement, the wakeup stuff and the
> alloc_{fair,rt}_sched_group() stuff are unrelated, although both sites
> might need fixing if we're going to go ahead with this.
>
The alternative is for node hot-remove to do an iteration of all possible
cpus and set cpu-to-node to be NUMA_NO_NODE for all offlined cpus that map
to that node. If cpu_online() is true for any of those cpus, then
obviously it can't be offlined. We want to do this so that
kzalloc_node(..., cpu_to_node()) fallsback to allocating from any node,
which it should, and because a subsequent node hot-add event that reuses
the same node id may not be the same node.
On Wed, 10 Oct 2012 13:30:29 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> > So for now, let me NACK that patch. You cannot go change stuff like
> > that.
> >
>
> Agreed, that makes the nack-count up to 2 now. Andrew, please remove
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch
> from -mm.
Nope. It fixes a BUG() and so I'll be keeping it around until I see a
better fix. It's one of the ways in which I prevent things from falling
through cracks.
On Wed, 10 Oct 2012, Andrew Morton wrote:
> > > So for now, let me NACK that patch. You cannot go change stuff like
> > > that.
> > >
> >
> > Agreed, that makes the nack-count up to 2 now. Andrew, please remove
> > cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch
> > cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch
> > from -mm.
>
> Nope. It fixes a BUG() and so I'll be keeping it around until I see a
> better fix. It's one of the ways in which I prevent things from falling
> through cracks.
>
It fixes a BUG() that only affects users who are doing node hot-remove,
which is still radically under development, and nobody cares about except
those on the cc list, but it also introduces the NULL pointer dereference
that is attempting to be addressed in this patch. The "fix" that causes
this NULL pointer is clearly not the direction we want to go, I think we
have agreement at node hot-remove to iterate all possible cpus are map all
offline cpus with cpu_to_node(cpu) == node to NUMA_NO_NODE instead in the
generic hotplug code.
Regardless, this shouldn't be touching the acpi code which
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch and
cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch do since
it makes the behavior inconsistent across interfaces and architectures.
On Wed, 10 Oct 2012, David Rientjes wrote:
> It fixes a BUG() that only affects users who are doing node hot-remove,
> which is still radically under development, and nobody cares about except
> those on the cc list, but it also introduces the NULL pointer dereference
> that is attempting to be addressed in this patch. The "fix" that causes
> this NULL pointer is clearly not the direction we want to go, I think we
> have agreement at node hot-remove to iterate all possible cpus are map all
> offline cpus with cpu_to_node(cpu) == node to NUMA_NO_NODE instead in the
> generic hotplug code.
>
> Regardless, this shouldn't be touching the acpi code which
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch and
> cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch do since
> it makes the behavior inconsistent across interfaces and architectures.
>
Ok, so it's been a week and these patches are still in -mm. This is what
I was afraid of: patches that both Peter and I nacked sitting in -mm and
allow a NULL pointer dereference because no alternative patch exists yet
to fix the issue correctly.
Tang and Wen, are you intending on addressing these problems (i.e. not
touching the acpi code at all and rather clearing cpu-to-node mappings at
node hot-remove) as we've discussed or do I need to do it myself?
Thanks.
On 10/18/2012 08:52 AM, David Rientjes wrote:
> On Wed, 10 Oct 2012, David Rientjes wrote:
>
> Ok, so it's been a week and these patches are still in -mm. This is what
> I was afraid of: patches that both Peter and I nacked sitting in -mm and
> allow a NULL pointer dereference because no alternative patch exists yet
> to fix the issue correctly.
>
> Tang and Wen, are you intending on addressing these problems (i.e. not
> touching the acpi code at all and rather clearing cpu-to-node mappings at
> node hot-remove) as we've discussed or do I need to do it myself?
Hi David,
We are working on this problem. Since it is complicated, it really
takes us some time. Sorry for the delay. :)
Actually, we intend to clear cpu-to-node mappings when a whole node is
removed. But the node hot-plug code is still under development, so I
think Wen will send a fix patch soon. :)
Thanks.
>
> Thanks.
>
On Thu, 18 Oct 2012, Tang Chen wrote:
> We are working on this problem. Since it is complicated, it really
> takes us some time. Sorry for the delay. :)
>
> Actually, we intend to clear cpu-to-node mappings when a whole node is
> removed. But the node hot-plug code is still under development, so I
> think Wen will send a fix patch soon. :)
>
Ok, thanks for the update. I agree that we should be clearing the mapping
at node hot-remove since any cpu that would subsequently get onlined and
assume one of the previous cpu's ids is not guaranteed to have the same
affinity.
I'm just really hoping that we don't touch the acpi code and that we can
remove both cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch
and cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch from
-mm.
Thanks again!
On Wed, 2012-10-17 at 20:29 -0700, David Rientjes wrote:
>
> Ok, thanks for the update. I agree that we should be clearing the mapping
> at node hot-remove since any cpu that would subsequently get onlined and
> assume one of the previous cpu's ids is not guaranteed to have the same
> affinity.
Would this mean we have to remap (and memcpy) per-cpu memory on
node-plug?
> I'm just really hoping that we don't touch the acpi code and that we can
> remove both cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved.patch
> and cpu_hotplug-unmap-cpu2node-when-the-cpu-is-hotremoved-fix.patch from
> -mm.
Yeah, none of this should be anywhere near ACPI, its got nothing to do
with ACPI. Furthermore it should be be same across all archs, not just
be weird and wonderful for x86.