2017-07-27 18:15:55

by Michael Bringmann

[permalink] [raw]
Subject: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask


On NUMA systems with dynamic processors, the content of the cpumask
may change over time. As new processors are added via DLPAR operations,
workqueues are created for them. Depending upon the order in which CPUs
are added/removed, we may run into problems with the content of the
cpumask used by the workqueues. This patch deals with situations where
the online cpumask for a node is a proper superset of possible cpumask
for the node. It also deals with edge cases where the order in which
CPUs are removed/added from the online cpumask may leave the set for a
node empty, and require execution by CPUs on another node.

In these and other cases, the patch attempts to ensure that a valid,
usable cpumask is used to set up newly created pools for workqueues.
This patch provides a fix for NUMA systems which can add/subtract
processors dynamically. The patch is expected to be an intermediate
one while developers search for any underlying issues.

[With additions to the patch provided by Tejun Hao <[email protected]>]

Signed-off-by: Michael Bringmann <[email protected]>
---
Changes in V6:
-- Update descriptive text
---
kernel/workqueue.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39..6b6d540 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3577,6 +3577,13 @@ static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,

/* yeap, return possible CPUs in @node that @attrs wants */
cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+
+ if (cpumask_empty(cpumask)) {
+ pr_warn_once("WARNING: workqueue cpumask: onl intersect > "
+ "possible intersect\n");
+ return false;
+ }
+
return !cpumask_equal(cpumask, attrs->cpumask);

use_dfl:


2017-07-27 18:31:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

On Thu, Jul 27, 2017 at 01:15:48PM -0500, Michael Bringmann wrote:
>
> On NUMA systems with dynamic processors, the content of the cpumask
> may change over time. As new processors are added via DLPAR operations,
> workqueues are created for them. Depending upon the order in which CPUs
> are added/removed, we may run into problems with the content of the
> cpumask used by the workqueues. This patch deals with situations where
> the online cpumask for a node is a proper superset of possible cpumask
> for the node. It also deals with edge cases where the order in which
> CPUs are removed/added from the online cpumask may leave the set for a
> node empty, and require execution by CPUs on another node.
>
> In these and other cases, the patch attempts to ensure that a valid,
> usable cpumask is used to set up newly created pools for workqueues.
> This patch provides a fix for NUMA systems which can add/subtract
> processors dynamically. The patch is expected to be an intermediate
> one while developers search for any underlying issues.

Please start with describing what the underlying problem is - CPU <->
NUMA node mapping change on powerpc. The mapping shouldn't change,
not just for workqueue, but because we don't have any kind of
synchronization around the mapping throughout allocation paths. And
then, please describe how this patch can at least prevent immediate
crashes in a lot of cases.

Thanks.

--
tejun

2017-07-27 18:42:21

by Nathan Fontenot

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

On 07/27/2017 01:15 PM, Michael Bringmann wrote:
>
> On NUMA systems with dynamic processors, the content of the cpumask
> may change over time. As new processors are added via DLPAR operations,
> workqueues are created for them. Depending upon the order in which CPUs
> are added/removed, we may run into problems with the content of the
> cpumask used by the workqueues. This patch deals with situations where
> the online cpumask for a node is a proper superset of possible cpumask
> for the node. It also deals with edge cases where the order in which
> CPUs are removed/added from the online cpumask may leave the set for a
> node empty, and require execution by CPUs on another node.
>
> In these and other cases, the patch attempts to ensure that a valid,
> usable cpumask is used to set up newly created pools for workqueues.
> This patch provides a fix for NUMA systems which can add/subtract
> processors dynamically. The patch is expected to be an intermediate
> one while developers search for any underlying issues.
>
> [With additions to the patch provided by Tejun Hao <[email protected]>]
>
> Signed-off-by: Michael Bringmann <[email protected]>
> ---
> Changes in V6:
> -- Update descriptive text
> ---
> kernel/workqueue.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c74bf39..6b6d540 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3577,6 +3577,13 @@ static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
>
> /* yeap, return possible CPUs in @node that @attrs wants */
> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> +
> + if (cpumask_empty(cpumask)) {
> + pr_warn_once("WARNING: workqueue cpumask: onl intersect > "
^^^
This message doesn't seem right, or I am missing something, "onl"?

-Nathan

> + "possible intersect\n");
> + return false;
> + }
> +
> return !cpumask_equal(cpumask, attrs->cpumask);
>
> use_dfl:
>

2017-07-27 19:08:01

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask



On 07/27/2017 01:31 PM, Tejun Heo wrote:
> On Thu, Jul 27, 2017 at 01:15:48PM -0500, Michael Bringmann wrote:
>>
>> On NUMA systems with dynamic processors, the content of the cpumask
>> may change over time. As new processors are added via DLPAR operations,
>> workqueues are created for them. Depending upon the order in which CPUs
>> are added/removed, we may run into problems with the content of the
>> cpumask used by the workqueues. This patch deals with situations where
>> the online cpumask for a node is a proper superset of possible cpumask
>> for the node. It also deals with edge cases where the order in which
>> CPUs are removed/added from the online cpumask may leave the set for a
>> node empty, and require execution by CPUs on another node.
>>
>> In these and other cases, the patch attempts to ensure that a valid,
>> usable cpumask is used to set up newly created pools for workqueues.
>> This patch provides a fix for NUMA systems which can add/subtract
>> processors dynamically. The patch is expected to be an intermediate
>> one while developers search for any underlying issues.
>
> Please start with describing what the underlying problem is - CPU <->
> NUMA node mapping change on powerpc. The mapping shouldn't change,
> not just for workqueue, but because we don't have any kind of
> synchronization around the mapping throughout allocation paths. And
> then, please describe how this patch can at least prevent immediate
> crashes in a lot of cases.

How about this:

The problem lies with the ordering of events with respect to the order in
which we add (or remove) CPUs to NUMA systems, and make use of that knowledge.
The CPUs present are assigned to nodes, and workqueues and their infrastructure
are created to use the CPUs in a node. Workqueues are created at boot time
and updated or created as CPUs are added or removed. However, there is little
or no synchronization or ordering of these events, and the data structures
mapping CPUs to nodes may not be updated before the workqueue infrastructure
is built for a node. Thus we have the possibility of an invalid CPU mask
attribute being attached to a newly created workqueue before the CPUs have
been properly registered and published to a node.

This patch attempts to provide a partial ordering of events within workqueue
by delaying the use of newly calculated CPU masks as the value for a workqueue
attribute until they have valid content. Instead the workqueue code must delay
creating new workqueues until this function succeeds, or it can use a previously
calculated cpumask attribute that is known to be valid.

This patch attempts to ensure that a valid, usable cpumask is used to set up
newly created pools for workqueues. This patch provides a fix for NUMA systems
which can add/subtract processors dynamically. The patch is expected to be an
intermediate one while developers find and correct any underlying issues.

>
> Thanks.
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]

2017-07-27 19:12:57

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask


>>
>> /* yeap, return possible CPUs in @node that @attrs wants */
>> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>> +
>> + if (cpumask_empty(cpumask)) {
>> + pr_warn_once("WARNING: workqueue cpumask: onl intersect > "
> ^^^
> This message doesn't seem right, or I am missing something, "onl"?

Shorthand for 'online' in long message.

>
> -Nathan
>
>> + "possible intersect\n");
>> + return false;
>> + }
>> +
>> return !cpumask_equal(cpumask, attrs->cpumask);
>>
>> use_dfl:
>>
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]

2017-07-27 19:24:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

Hello, Michael.

On Thu, Jul 27, 2017 at 02:07:53PM -0500, Michael Bringmann wrote:
> The problem lies with the ordering of events with respect to the order in
> which we add (or remove) CPUs to NUMA systems, and make use of that knowledge.

Isn't the root cause that the upper layers including workqueue expect
cpu <-> node mapping to be static but powerpc doesn't follow that? I
don't get why ordering matters here.

> The CPUs present are assigned to nodes, and workqueues and their infrastructure
> are created to use the CPUs in a node. Workqueues are created at boot time
> and updated or created as CPUs are added or removed. However, there is little
> or no synchronization or ordering of these events, and the data structures

What I meant was that there's no synchronization construct protection
cpu <-> node mapping. If arch code changes it during hot plug, it's
changing it underneath anybody who might be using that association.

> mapping CPUs to nodes may not be updated before the workqueue infrastructure
> is built for a node. Thus we have the possibility of an invalid CPU mask
> attribute being attached to a newly created workqueue before the CPUs have
> been properly registered and published to a node.
>
> This patch attempts to provide a partial ordering of events within workqueue
> by delaying the use of newly calculated CPU masks as the value for a workqueue
> attribute until they have valid content. Instead the workqueue code must delay
> creating new workqueues until this function succeeds, or it can use a previously
> calculated cpumask attribute that is known to be valid.
>
> This patch attempts to ensure that a valid, usable cpumask is used to set up
> newly created pools for workqueues. This patch provides a fix for NUMA systems
> which can add/subtract processors dynamically. The patch is expected to be an
> intermediate one while developers find and correct any underlying issues.

And what this patch does is adding a bandaid so that we at least don't
crash immediately when this condition triggers until the arch code can
be fixed properly.

Thanks.

--
tejun

2017-07-27 19:24:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

On Thu, Jul 27, 2017 at 02:12:52PM -0500, Michael Bringmann wrote:
>
> >>
> >> /* yeap, return possible CPUs in @node that @attrs wants */
> >> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> >> +
> >> + if (cpumask_empty(cpumask)) {
> >> + pr_warn_once("WARNING: workqueue cpumask: onl intersect > "
> > ^^^
> > This message doesn't seem right, or I am missing something, "onl"?
>
> Shorthand for 'online' in long message.

Let's please spell it out.

Thanks.

--
tejun

2017-07-27 20:15:52

by Michael Bringmann

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

Revised text:

There is an underlying assumption in many layers / modules of the Linux
system that CPU <-> node mapping is static. This is despite the presence
of features like NUMA and 'hotplug' that support the dynamic addition/
removal of fundamental system resources like CPUs and memory. PowerPC
systems, however, do provide extensive features for the dynamic change
of resources available to a system.

Currently, there is little or no synchronization protection around the
updating of the CPU <-> node mapping, and the export/update of this
information for other layers / modules. In systems which can change
this mapping during 'hotplug', like PowerPC, the information is changing
underneath all layers that might reference it.

This patch attempts to ensure that a valid, usable cpumask attribute is
used by the workqueue infrastructure when setting up new resource pools.
It prevents a crash that has been observed when an 'empty' cpumask is
passed along to the worker/task scheduling code. It is intended as an
intermediate fix until a more fundamental review and correction of the
issue can be done.

On 07/27/2017 02:24 PM, Tejun Heo wrote:
> Hello, Michael.
>
> On Thu, Jul 27, 2017 at 02:07:53PM -0500, Michael Bringmann wrote:
>> The problem lies with the ordering of events with respect to the order in
>> which we add (or remove) CPUs to NUMA systems, and make use of that knowledge.
>
> Isn't the root cause that the upper layers including workqueue expect
> cpu <-> node mapping to be static but powerpc doesn't follow that? I
> don't get why ordering matters here.
>
>> The CPUs present are assigned to nodes, and workqueues and their infrastructure
>> are created to use the CPUs in a node. Workqueues are created at boot time
>> and updated or created as CPUs are added or removed. However, there is little
>> or no synchronization or ordering of these events, and the data structures
>
> What I meant was that there's no synchronization construct protection
> cpu <-> node mapping. If arch code changes it during hot plug, it's
> changing it underneath anybody who might be using that association.
>
>> mapping CPUs to nodes may not be updated before the workqueue infrastructure
>> is built for a node. Thus we have the possibility of an invalid CPU mask
>> attribute being attached to a newly created workqueue before the CPUs have
>> been properly registered and published to a node.
>>
>> This patch attempts to provide a partial ordering of events within workqueue
>> by delaying the use of newly calculated CPU masks as the value for a workqueue
>> attribute until they have valid content. Instead the workqueue code must delay
>> creating new workqueues until this function succeeds, or it can use a previously
>> calculated cpumask attribute that is known to be valid.
>>
>> This patch attempts to ensure that a valid, usable cpumask is used to set up
>> newly created pools for workqueues. This patch provides a fix for NUMA systems
>> which can add/subtract processors dynamically. The patch is expected to be an
>> intermediate one while developers find and correct any underlying issues.
>
> And what this patch does is adding a bandaid so that we at least don't
> crash immediately when this condition triggers until the arch code can
> be fixed properly.
>
> Thanks.
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]

2017-07-27 20:46:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v6] workqueue: Fix edge cases for calc of pool's cpumask

Hello, Michael.

On Thu, Jul 27, 2017 at 03:15:47PM -0500, Michael Bringmann wrote:
> There is an underlying assumption in many layers / modules of the Linux
> system that CPU <-> node mapping is static. This is despite the presence
> of features like NUMA and 'hotplug' that support the dynamic addition/
> removal of fundamental system resources like CPUs and memory. PowerPC
> systems, however, do provide extensive features for the dynamic change
> of resources available to a system.

The text can go as-is but keeping cpu <-> node mapping static is a
trade-off rather than upper layers missing out something. Making cpu
<-> node mapping dynamic means adding complexities and overhead to a
lot hotter paths including memory allocation. It's just a better
trade off to keep the mapping static from arch side even if that means
we have to use more complex mapping in arch and/or allocate more
possible cpus than strictly necessary.

> Currently, there is little or no synchronization protection around the
> updating of the CPU <-> node mapping, and the export/update of this
> information for other layers / modules. In systems which can change
> this mapping during 'hotplug', like PowerPC, the information is changing
> underneath all layers that might reference it.
>
> This patch attempts to ensure that a valid, usable cpumask attribute is
> used by the workqueue infrastructure when setting up new resource pools.
> It prevents a crash that has been observed when an 'empty' cpumask is
> passed along to the worker/task scheduling code. It is intended as an
> intermediate fix until a more fundamental review and correction of the
> issue can be done.

Thanks a lot for your patience! Much appreciated.

--
tejun