LinuxLists.cc - [PATCH] power, sched: stop updating inside arch_update_cpu

2014-04-02 03:30:11

Subject: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

During the testing, we encounter below WARN followed by Oops:

WARNING: at kernel/sched/core.c:6218
...
NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
LR [c000000000101358] .build_sched_domains+0xec8/0x1200
PACATMSCRATCH [800000000000f032]
Call Trace:
[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
PACATMSCRATCH [8000000000029032]
Call Trace:
[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...

This was caused by that 'sd->groups == NULL' after building groups, which
was caused by the empty 'sd->span'.

The cpu's domain contain nothing because the cpu was assigned to wrong
node inside arch_update_cpu_topology() by calling update_lookup_table()
with the uninitialized param, in the case when there is nothing to be
update.

Thus we should stop the updating in such cases, this patch will achieve
this and fix the issue.

CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Nathan Fontenot <[email protected]>
CC: Stephen Rothwell <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Robert Jennings <[email protected]>
CC: Jesse Larrew <[email protected]>
CC: "Srivatsa S. Bhat" <[email protected]>
CC: Alistair Popple <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
arch/powerpc/mm/numa.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 30a42e2..6757690 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
cpu = cpu_last_thread_sibling(cpu);
}

+ /*
+ * The 'cpu_associativity_changes_mask' could be cleared if
+ * all the cpus it indicates won't change their node, in
+ * which case the 'updated_cpus' will be empty.
+ */
+ if (!cpumask_weight(&updated_cpus))
+ goto out;
+
stop_machine(update_cpu_topology, &updates[0], &updated_cpus);

/*
@@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
changed = 1;
}

+out:
kfree(updates);
return changed;
}
--
1.7.9.5

2014-04-03 08:51:21

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

On 04/02/2014 08:59 AM, Michael wang wrote:
> During the testing, we encounter below WARN followed by Oops:
>
> WARNING: at kernel/sched/core.c:6218
> ...
> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
> PACATMSCRATCH [800000000000f032]
> Call Trace:
> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> PACATMSCRATCH [8000000000029032]
> Call Trace:
> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
>
> This was caused by that 'sd->groups == NULL' after building groups, which
> was caused by the empty 'sd->span'.
>
> The cpu's domain contain nothing because the cpu was assigned to wrong
> node inside arch_update_cpu_topology() by calling update_lookup_table()
> with the uninitialized param, in the case when there is nothing to be
> update.
>

I think we need to verify this theory. Here are my thoughts:

If we get a topology update notification from the hypervisor, and then the
update happens to be nothing new (as in, new_node == old_node for the given
cpu), then updated_cpus mask will not have any cpus set in it (like you noted
below). In that scenario, the following sequence will take place:

Inside arch_cpu_update_topology(), the 'updates' has been kzalloc()'ed.
So updates[0] will have all its fields set to 0, such as ->cpu, ->new_nid etc.

When we invoke stop_machine() the first time like this:

stop_machine(update_cpu_topology, &updates[0], &updated_cpus);

stop-machine will notice that updated_cpus mask does not have any cpus set
in it, so it will nominate cpumask_first(cpu_online_mask) to run the
update_cpu_topology() function, which means that CPU0 will run it.

So, when CPU0 runs update_cpu_topology(), it will find that cpu == update->cpu
since both are 0, and will invoke unmap_cpu_from_node() and then calls
map_cpu_to_node(), with update->new_nid == 0.

Now, the interesting thing to note here is that, if CPU0's node was already
set as node0, *nothing* should go wrong, since its just a redundant update.
However, if CPU0's original node mapping was something different, or if
node0 doesn't even exist in the machine, then the system can crash.

Have you verified that CPU0's node mapping is different from node 0?
That is, boot the kernel with "numa=debug" in the kernel command line and
it will print out the cpu-to-node associativity during boot. That way you
can figure out what was the original associativity that was set. This will
confirm the theory that the hypervisor sent a redundant update, but because
of the weird pre-allocation using kzalloc that we do inside
arch_update_cpu_topology(), we wrongly updated CPU0's mapping as CPU0 <-> Node0.

Regards,
Srivatsa S. Bhat

> Thus we should stop the updating in such cases, this patch will achieve
> this and fix the issue.
>
> CC: Benjamin Herrenschmidt <[email protected]>
> CC: Paul Mackerras <[email protected]>
> CC: Nathan Fontenot <[email protected]>
> CC: Stephen Rothwell <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Robert Jennings <[email protected]>
> CC: Jesse Larrew <[email protected]>
> CC: "Srivatsa S. Bhat" <[email protected]>
> CC: Alistair Popple <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> arch/powerpc/mm/numa.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 30a42e2..6757690 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
> cpu = cpu_last_thread_sibling(cpu);
> }
>
> + /*
> + * The 'cpu_associativity_changes_mask' could be cleared if
> + * all the cpus it indicates won't change their node, in
> + * which case the 'updated_cpus' will be empty.
> + */
> + if (!cpumask_weight(&updated_cpus))
> + goto out;
> +
> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>
> /*
> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
> changed = 1;
> }
>
> +out:
> kfree(updates);
> return changed;
> }
>

2014-04-04 03:49:08

by Michael wang

[permalink] [raw]

Subject: Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Hi, Srivatsa

Thanks for your reply :)

On 04/03/2014 04:50 PM, Srivatsa S. Bhat wrote:
[snip]
>
> Now, the interesting thing to note here is that, if CPU0's node was already
> set as node0, *nothing* should go wrong, since its just a redundant update.
> However, if CPU0's original node mapping was something different, or if
> node0 doesn't even exist in the machine, then the system can crash.

By printk I confirmed all cpus was belong to node 1 at very beginning,
and things become magically after the wrong updating...

>
> Have you verified that CPU0's node mapping is different from node 0?
> That is, boot the kernel with "numa=debug" in the kernel command line and
> it will print out the cpu-to-node associativity during boot. That way you
> can figure out what was the original associativity that was set. This will
> confirm the theory that the hypervisor sent a redundant update, but because
> of the weird pre-allocation using kzalloc that we do inside
> arch_update_cpu_topology(), we wrongly updated CPU0's mapping as CPU0 <-> Node0.

Associativity should changes, otherwise we won't continue the updating,
and empty updates[] was confirmed to show up inside
arch_update_cpu_topology().

What I can't make sure is whether this is legal, notify changes but no
changes happen sounds weird...however, even if it's legal, a check in
here still make sense IMHO.

Regards,
Michael Wang

>
>
> Regards,
> Srivatsa S. Bhat
>
>> Thus we should stop the updating in such cases, this patch will achieve
>> this and fix the issue.
>>
>> CC: Benjamin Herrenschmidt <[email protected]>
>> CC: Paul Mackerras <[email protected]>
>> CC: Nathan Fontenot <[email protected]>
>> CC: Stephen Rothwell <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: Robert Jennings <[email protected]>
>> CC: Jesse Larrew <[email protected]>
>> CC: "Srivatsa S. Bhat" <[email protected]>
>> CC: Alistair Popple <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
>> ---
>> arch/powerpc/mm/numa.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 30a42e2..6757690 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
>> cpu = cpu_last_thread_sibling(cpu);
>> }
>>
>> + /*
>> + * The 'cpu_associativity_changes_mask' could be cleared if
>> + * all the cpus it indicates won't change their node, in
>> + * which case the 'updated_cpus' will be empty.
>> + */
>> + if (!cpumask_weight(&updated_cpus))
>> + goto out;
>> +
>> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>>
>> /*
>> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
>> changed = 1;
>> }
>>
>> +out:
>> kfree(updates);
>> return changed;
>> }
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-04-07 09:51:47

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Hi Michael,

On 04/04/2014 09:18 AM, Michael wang wrote:
> Hi, Srivatsa
>
> Thanks for your reply :)
>
> On 04/03/2014 04:50 PM, Srivatsa S. Bhat wrote:
> [snip]
>>
>> Now, the interesting thing to note here is that, if CPU0's node was already
>> set as node0, *nothing* should go wrong, since its just a redundant update.
>> However, if CPU0's original node mapping was something different, or if
>> node0 doesn't even exist in the machine, then the system can crash.
>
> By printk I confirmed all cpus was belong to node 1 at very beginning,
> and things become magically after the wrong updating...
>

Ok, thanks!

>>
>> Have you verified that CPU0's node mapping is different from node 0?
>> That is, boot the kernel with "numa=debug" in the kernel command line and
>> it will print out the cpu-to-node associativity during boot. That way you
>> can figure out what was the original associativity that was set. This will
>> confirm the theory that the hypervisor sent a redundant update, but because
>> of the weird pre-allocation using kzalloc that we do inside
>> arch_update_cpu_topology(), we wrongly updated CPU0's mapping as CPU0 <-> Node0.
>
> Associativity should changes, otherwise we won't continue the updating,
> and empty updates[] was confirmed to show up inside
> arch_update_cpu_topology().
>

Ah, ok, that makes it very clear. So, I agree that your patch is correct,
but I think the comment in your patch can be enhanced a bit. I'll suggest
something if I manage to come up with a better wording.

> What I can't make sure is whether this is legal, notify changes but no
> changes happen sounds weird...however, even if it's legal, a check in
> here still make sense IMHO.
>

That looks like a bug in the hypervisor/firmware. But the Linux kernel should
be able to handle such NULL updates without crashing. So yes, your patch makes
sense to me.

Thank you!

Regards,
Srivatsa S. Bhat

>>
>>> Thus we should stop the updating in such cases, this patch will achieve
>>> this and fix the issue.
>>>
>>> CC: Benjamin Herrenschmidt <[email protected]>
>>> CC: Paul Mackerras <[email protected]>
>>> CC: Nathan Fontenot <[email protected]>
>>> CC: Stephen Rothwell <[email protected]>
>>> CC: Andrew Morton <[email protected]>
>>> CC: Robert Jennings <[email protected]>
>>> CC: Jesse Larrew <[email protected]>
>>> CC: "Srivatsa S. Bhat" <[email protected]>
>>> CC: Alistair Popple <[email protected]>
>>> Signed-off-by: Michael Wang <[email protected]>
>>> ---
>>> arch/powerpc/mm/numa.c | 9 +++++++++
>>> 1 file changed, 9 insertions(+)
>>>
>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>> index 30a42e2..6757690 100644
>>> --- a/arch/powerpc/mm/numa.c
>>> +++ b/arch/powerpc/mm/numa.c
>>> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
>>> cpu = cpu_last_thread_sibling(cpu);
>>> }
>>>
>>> + /*
>>> + * The 'cpu_associativity_changes_mask' could be cleared if
>>> + * all the cpus it indicates won't change their node, in
>>> + * which case the 'updated_cpus' will be empty.
>>> + */
>>> + if (!cpumask_weight(&updated_cpus))
>>> + goto out;
>>> +
>>> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>>>
>>> /*
>>> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
>>> changed = 1;
>>> }
>>>
>>> +out:
>>> kfree(updates);
>>> return changed;
>>> }
>>>
>>

2014-04-07 10:16:14

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Hi Michael,

On 04/02/2014 08:59 AM, Michael wang wrote:
> During the testing, we encounter below WARN followed by Oops:
>
> WARNING: at kernel/sched/core.c:6218
> ...
> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
> PACATMSCRATCH [800000000000f032]
> Call Trace:
> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> PACATMSCRATCH [8000000000029032]
> Call Trace:
> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
>
> This was caused by that 'sd->groups == NULL' after building groups, which
> was caused by the empty 'sd->span'.
>
> The cpu's domain contain nothing because the cpu was assigned to wrong
> node inside arch_update_cpu_topology() by calling update_lookup_table()
> with the uninitialized param, in the case when there is nothing to be
> update.
>

Can you reword the above paragraph to something like this:

The cpu's domain contained nothing because the cpu was assigned to a wrong
node, due to the following unfortunate sequence of events:

1. The hypervisor sent a topology update to the guest OS, to notify changes
to the cpu-node mapping. However, the update was actually redundant - i.e.,
the "new" mapping was exactly the same as the old one.

2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
the 'for-loop' in arch_update_cpu_topology().

3. So we ended up calling stop-machine() with an empty cpumask list, which made
stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
the cpu to run the payload (the update_cpu_topology() function).

4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
finds update->cpu as well as update->new_nid to be 0. In other words, we
end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.

This causes the sched-domain rebuild code to break and crash the system.

> Thus we should stop the updating in such cases, this patch will achieve
> this and fix the issue.
>

We can reword this part as:

Fix this by skipping the topology update in cases where we find that
the topology has not actually changed in reality (ie., spurious updates).

> CC: Benjamin Herrenschmidt <[email protected]>
> CC: Paul Mackerras <[email protected]>
> CC: Nathan Fontenot <[email protected]>
> CC: Stephen Rothwell <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Robert Jennings <[email protected]>
> CC: Jesse Larrew <[email protected]>
> CC: "Srivatsa S. Bhat" <[email protected]>
> CC: Alistair Popple <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> arch/powerpc/mm/numa.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 30a42e2..6757690 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
> cpu = cpu_last_thread_sibling(cpu);
> }
>
> + /*
> + * The 'cpu_associativity_changes_mask' could be cleared if
> + * all the cpus it indicates won't change their node, in
> + * which case the 'updated_cpus' will be empty.
> + */

How about rewording the comment like this:

In cases where we have nothing to update (because the updates list
is too short or because the new topology is same as the old one),
skip invoking update_cpu_topology() via stop-machine(). This is
necessary (and not just a fast-path optimization) because stop-machine
can end up electing a random CPU to run update_cpu_topology(), and
thus trick us into setting up incorrect cpu-node mappings (since
'updates' is kzalloc()'ed).

Regards,
Srivatsa S. Bhat

> + if (!cpumask_weight(&updated_cpus))
> + goto out;
> +
> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>
> /*
> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
> changed = 1;
> }
>
> +out:
> kfree(updates);
> return changed;
> }
>

2014-04-08 02:39:43

by Michael wang

[permalink] [raw]

Subject: Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Hi, Srivatsa

It's nice to have you confirmed the fix, and thanks for the well-writing
comments, will apply them and send out the new patch later :)

Regards,
Michael Wang

On 04/07/2014 06:15 PM, Srivatsa S. Bhat wrote:
> Hi Michael,
>
> On 04/02/2014 08:59 AM, Michael wang wrote:
>> During the testing, we encounter below WARN followed by Oops:
>>
>> WARNING: at kernel/sched/core.c:6218
>> ...
>> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
>> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
>> PACATMSCRATCH [800000000000f032]
>> Call Trace:
>> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
>> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
>> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
>> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
>> ...
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> ...
>> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
>> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
>> PACATMSCRATCH [8000000000029032]
>> Call Trace:
>> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
>> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
>> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
>> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
>> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
>> ...
>>
>> This was caused by that 'sd->groups == NULL' after building groups, which
>> was caused by the empty 'sd->span'.
>>
>> The cpu's domain contain nothing because the cpu was assigned to wrong
>> node inside arch_update_cpu_topology() by calling update_lookup_table()
>> with the uninitialized param, in the case when there is nothing to be
>> update.
>>
>
> Can you reword the above paragraph to something like this:
>
> The cpu's domain contained nothing because the cpu was assigned to a wrong
> node, due to the following unfortunate sequence of events:
>
> 1. The hypervisor sent a topology update to the guest OS, to notify changes
> to the cpu-node mapping. However, the update was actually redundant - i.e.,
> the "new" mapping was exactly the same as the old one.
>
> 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
> the 'for-loop' in arch_update_cpu_topology().
>
> 3. So we ended up calling stop-machine() with an empty cpumask list, which made
> stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
> the cpu to run the payload (the update_cpu_topology() function).
>
> 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
> is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
> finds update->cpu as well as update->new_nid to be 0. In other words, we
> end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
>
> This causes the sched-domain rebuild code to break and crash the system.
>
>
>> Thus we should stop the updating in such cases, this patch will achieve
>> this and fix the issue.
>>
>
> We can reword this part as:
>
> Fix this by skipping the topology update in cases where we find that
> the topology has not actually changed in reality (ie., spurious updates).
>
>> CC: Benjamin Herrenschmidt <[email protected]>
>> CC: Paul Mackerras <[email protected]>
>> CC: Nathan Fontenot <[email protected]>
>> CC: Stephen Rothwell <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: Robert Jennings <[email protected]>
>> CC: Jesse Larrew <[email protected]>
>> CC: "Srivatsa S. Bhat" <[email protected]>
>> CC: Alistair Popple <[email protected]>
>> Signed-off-by: Michael Wang <[email protected]>
>> ---
>> arch/powerpc/mm/numa.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 30a42e2..6757690 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
>> cpu = cpu_last_thread_sibling(cpu);
>> }
>>
>> + /*
>> + * The 'cpu_associativity_changes_mask' could be cleared if
>> + * all the cpus it indicates won't change their node, in
>> + * which case the 'updated_cpus' will be empty.
>> + */
>
> How about rewording the comment like this:
>
> In cases where we have nothing to update (because the updates list
> is too short or because the new topology is same as the old one),
> skip invoking update_cpu_topology() via stop-machine(). This is
> necessary (and not just a fast-path optimization) because stop-machine
> can end up electing a random CPU to run update_cpu_topology(), and
> thus trick us into setting up incorrect cpu-node mappings (since
> 'updates' is kzalloc()'ed).
>
> Regards,
> Srivatsa S. Bhat
>
>> + if (!cpumask_weight(&updated_cpus))
>> + goto out;
>> +
>> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>>
>> /*
>> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
>> changed = 1;
>> }
>>
>> +out:
>> kfree(updates);
>> return changed;
>> }
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-04-08 03:19:48

by Michael wang

[permalink] [raw]

Subject: [PATCH v2] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

Since v1:
Edited the comment according to Srivatsa's suggestion.

During the testing, we encounter below WARN followed by Oops:

WARNING: at kernel/sched/core.c:6218
...
NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
LR [c000000000101358] .build_sched_domains+0xec8/0x1200
PACATMSCRATCH [800000000000f032]
Call Trace:
[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
PACATMSCRATCH [8000000000029032]
Call Trace:
[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...

This was caused by that 'sd->groups == NULL' after building groups, which
was caused by the empty 'sd->span'.

The cpu's domain contained nothing because the cpu was assigned to a wrong
node, due to the following unfortunate sequence of events:

1. The hypervisor sent a topology update to the guest OS, to notify changes
to the cpu-node mapping. However, the update was actually redundant - i.e.,
the "new" mapping was exactly the same as the old one.

2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
the 'for-loop' in arch_update_cpu_topology().

3. So we ended up calling stop-machine() with an empty cpumask list, which made
stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
the cpu to run the payload (the update_cpu_topology() function).

4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
finds update->cpu as well as update->new_nid to be 0. In other words, we
end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.

Along with the following wrong updating, it causes the sched-domain rebuild
code to break and crash the system.

Fix this by skipping the topology update in cases where we find that
the topology has not actually changed in reality (ie., spurious updates).

CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Nathan Fontenot <[email protected]>
CC: Stephen Rothwell <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Robert Jennings <[email protected]>
CC: Jesse Larrew <[email protected]>
CC: "Srivatsa S. Bhat" <[email protected]>
CC: Alistair Popple <[email protected]>
Suggested-by: "Srivatsa S. Bhat" <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
arch/powerpc/mm/numa.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 30a42e2..4ebbb9e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1591,6 +1591,20 @@ int arch_update_cpu_topology(void)
cpu = cpu_last_thread_sibling(cpu);
}

+ /*
+ * In cases where we have nothing to update (because the updates list
+ * is too short or because the new topology is same as the old one),
+ * skip invoking update_cpu_topology() via stop-machine(). This is
+ * necessary (and not just a fast-path optimization) since stop-machine
+ * can end up electing a random CPU to run update_cpu_topology(), and
+ * thus trick us into setting up incorrect cpu-node mappings (since
+ * 'updates' is kzalloc()'ed).
+ *
+ * And for the similar reason, we will skip all the following updating.
+ */
+ if (!cpumask_weight(&updated_cpus))
+ goto out;
+
stop_machine(update_cpu_topology, &updates[0], &updated_cpus);

/*
@@ -1612,6 +1626,7 @@ int arch_update_cpu_topology(void)
changed = 1;
}

+out:
kfree(updates);
return changed;
}
--
1.7.9.5

2014-04-08 08:25:22

by Srivatsa S. Bhat

[permalink] [raw]

Subject: Re: [PATCH v2] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

On 04/08/2014 08:49 AM, Michael wang wrote:
> Since v1:
> Edited the comment according to Srivatsa's suggestion.
>
> During the testing, we encounter below WARN followed by Oops:
>
> WARNING: at kernel/sched/core.c:6218
> ...
> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
> PACATMSCRATCH [800000000000f032]
> Call Trace:
> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> PACATMSCRATCH [8000000000029032]
> Call Trace:
> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
>
> This was caused by that 'sd->groups == NULL' after building groups, which
> was caused by the empty 'sd->span'.
>
> The cpu's domain contained nothing because the cpu was assigned to a wrong
> node, due to the following unfortunate sequence of events:
>
> 1. The hypervisor sent a topology update to the guest OS, to notify changes
> to the cpu-node mapping. However, the update was actually redundant - i.e.,
> the "new" mapping was exactly the same as the old one.
>
> 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
> the 'for-loop' in arch_update_cpu_topology().
>
> 3. So we ended up calling stop-machine() with an empty cpumask list, which made
> stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
> the cpu to run the payload (the update_cpu_topology() function).
>
> 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
> is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
> finds update->cpu as well as update->new_nid to be 0. In other words, we
> end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
>
> Along with the following wrong updating, it causes the sched-domain rebuild
> code to break and crash the system.
>
> Fix this by skipping the topology update in cases where we find that
> the topology has not actually changed in reality (ie., spurious updates).
>
> CC: Benjamin Herrenschmidt <[email protected]>
> CC: Paul Mackerras <[email protected]>
> CC: Nathan Fontenot <[email protected]>
> CC: Stephen Rothwell <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Robert Jennings <[email protected]>
> CC: Jesse Larrew <[email protected]>
> CC: "Srivatsa S. Bhat" <[email protected]>
> CC: Alistair Popple <[email protected]>
> Suggested-by: "Srivatsa S. Bhat" <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---

Reviewed-by: Srivatsa S. Bhat <[email protected]>

Regards,
Srivatsa S. Bhat

> arch/powerpc/mm/numa.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 30a42e2..4ebbb9e 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1591,6 +1591,20 @@ int arch_update_cpu_topology(void)
> cpu = cpu_last_thread_sibling(cpu);
> }
>
> + /*
> + * In cases where we have nothing to update (because the updates list
> + * is too short or because the new topology is same as the old one),
> + * skip invoking update_cpu_topology() via stop-machine(). This is
> + * necessary (and not just a fast-path optimization) since stop-machine
> + * can end up electing a random CPU to run update_cpu_topology(), and
> + * thus trick us into setting up incorrect cpu-node mappings (since
> + * 'updates' is kzalloc()'ed).
> + *
> + * And for the similar reason, we will skip all the following updating.
> + */
> + if (!cpumask_weight(&updated_cpus))
> + goto out;
> +
> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>
> /*
> @@ -1612,6 +1626,7 @@ int arch_update_cpu_topology(void)
> changed = 1;
> }
>
> +out:
> kfree(updates);
> return changed;
> }
>

2014-04-16 03:15:10

by Michael wang

[permalink] [raw]

Subject: Re: [PATCH v2] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update

On 04/08/2014 11:19 AM, Michael wang wrote:
> Since v1:
> Edited the comment according to Srivatsa's suggestion.
>
> During the testing, we encounter below WARN followed by Oops:

Is there any more comments on this issue? Should we apply this fix?

Regards,
Michael Wang

>
> WARNING: at kernel/sched/core.c:6218
> ...
> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
> PACATMSCRATCH [800000000000f032]
> Call Trace:
> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> PACATMSCRATCH [8000000000029032]
> Call Trace:
> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
> ...
>
> This was caused by that 'sd->groups == NULL' after building groups, which
> was caused by the empty 'sd->span'.
>
> The cpu's domain contained nothing because the cpu was assigned to a wrong
> node, due to the following unfortunate sequence of events:
>
> 1. The hypervisor sent a topology update to the guest OS, to notify changes
> to the cpu-node mapping. However, the update was actually redundant - i.e.,
> the "new" mapping was exactly the same as the old one.
>
> 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
> the 'for-loop' in arch_update_cpu_topology().
>
> 3. So we ended up calling stop-machine() with an empty cpumask list, which made
> stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
> the cpu to run the payload (the update_cpu_topology() function).
>
> 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
> is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
> finds update->cpu as well as update->new_nid to be 0. In other words, we
> end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
>
> Along with the following wrong updating, it causes the sched-domain rebuild
> code to break and crash the system.
>
> Fix this by skipping the topology update in cases where we find that
> the topology has not actually changed in reality (ie., spurious updates).
>
> CC: Benjamin Herrenschmidt <[email protected]>
> CC: Paul Mackerras <[email protected]>
> CC: Nathan Fontenot <[email protected]>
> CC: Stephen Rothwell <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Robert Jennings <[email protected]>
> CC: Jesse Larrew <[email protected]>
> CC: "Srivatsa S. Bhat" <[email protected]>
> CC: Alistair Popple <[email protected]>
> Suggested-by: "Srivatsa S. Bhat" <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> arch/powerpc/mm/numa.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 30a42e2..4ebbb9e 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1591,6 +1591,20 @@ int arch_update_cpu_topology(void)
> cpu = cpu_last_thread_sibling(cpu);
> }
>
> + /*
> + * In cases where we have nothing to update (because the updates list
> + * is too short or because the new topology is same as the old one),
> + * skip invoking update_cpu_topology() via stop-machine(). This is
> + * necessary (and not just a fast-path optimization) since stop-machine
> + * can end up electing a random CPU to run update_cpu_topology(), and
> + * thus trick us into setting up incorrect cpu-node mappings (since
> + * 'updates' is kzalloc()'ed).
> + *
> + * And for the similar reason, we will skip all the following updating.
> + */
> + if (!cpumask_weight(&updated_cpus))
> + goto out;
> +
> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>
> /*
> @@ -1612,6 +1626,7 @@ int arch_update_cpu_topology(void)
> changed = 1;
> }
>
> +out:
> kfree(updates);
> return changed;
> }
>