2017-08-21 13:49:57

by Laurent Vivier

[permalink] [raw]
Subject: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
is built.

Unfortunately, on powerpc, the Firmware is only able to provide the
node of a CPU if the CPU is present. So, in our case (possible CPU)
CPU ids are known, but as the CPU is not present, the node id is
unknown and all the unplugged CPUs are attached to node 0.

When we hotplugged CPU, there can be an inconsistency between
the node id known by the workqueue, and the real node id of
the CPU.

This patch adds a hotplug handler to add the hotplugged CPU to
update the node entry in wq_numa_possible_cpumask array.

Signed-off-by: Laurent Vivier <[email protected]>
---
arch/powerpc/kernel/smp.c | 1 +
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 21 +++++++++++++++++++++
3 files changed, 23 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8d3320562c70..abc533146514 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -964,6 +964,7 @@ void start_secondary(void *unused)

set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+ wq_numa_add_possible_cpu(cpu);

smp_wmb();
notify_cpu_starting(cpu);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9dc0482..4ef030dae22c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -644,6 +644,7 @@ int workqueue_online_cpu(unsigned int cpu);
int workqueue_offline_cpu(unsigned int cpu);
#endif

+void wq_numa_add_possible_cpu(unsigned int cpu);
int __init workqueue_init_early(void);
int __init workqueue_init(void);

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0c3a96..d1a99e25d5da 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5486,6 +5486,27 @@ static inline void wq_watchdog_init(void) { }

#endif /* CONFIG_WQ_WATCHDOG */

+void wq_numa_add_possible_cpu(unsigned int cpu)
+{
+ int node;
+
+ if (num_possible_nodes() <= 1)
+ return;
+
+ if (wq_disable_numa)
+ return;
+
+ if (!wq_numa_enabled)
+ return;
+
+ node = cpu_to_node(cpu);
+ if (node == NUMA_NO_NODE)
+ return;
+
+ cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+EXPORT_SYMBOL_GPL(wq_numa_add_possible_cpu);
+
static void __init wq_numa_init(void)
{
cpumask_var_t *tbl;
--
2.13.5


2017-08-21 13:49:59

by Laurent Vivier

[permalink] [raw]
Subject: [PATCH 2/2] blk-mq: don't use WORK_CPU_UNBOUND

cpumask is the list of CPUs present when the queue is built.
If a new CPU is hotplugged, this list is not updated,
and when the scheduler asks for a CPU id, blk_mq_hctx_next_cpu()
can return WORK_CPU_UNBOUND.
And in this case _blk_mq_run_hw_queue() can be executed by the new CPU
(that is not present in cpumask) and raises the following warning:

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));

To fix this problem, this patch modifies blk_mq_hctx_next_cpu()
to only return a CPU id present in cpumask.

Signed-off-by: Laurent Vivier <[email protected]>
---
block/blk-mq.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..bdac1e654814 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1126,9 +1126,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
*/
static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
{
- if (hctx->queue->nr_hw_queues == 1)
- return WORK_CPU_UNBOUND;
-
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;

--
2.13.5

2017-08-21 14:48:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
> is built.
>
> Unfortunately, on powerpc, the Firmware is only able to provide the
> node of a CPU if the CPU is present. So, in our case (possible CPU)
> CPU ids are known, but as the CPU is not present, the node id is
> unknown and all the unplugged CPUs are attached to node 0.

This is something powerpc needs to fix. Workqueue isn't the only one
making this assumption. mm as a whole assumes that CPU <-> node
mapping is stable regardless of hotplug events. Powerpc people know
about the issue and AFAIK are working on it.

Thanks.

--
tejun

2017-08-21 14:49:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] blk-mq: don't use WORK_CPU_UNBOUND

On Mon, Aug 21, 2017 at 03:49:51PM +0200, Laurent Vivier wrote:
> cpumask is the list of CPUs present when the queue is built.
> If a new CPU is hotplugged, this list is not updated,
> and when the scheduler asks for a CPU id, blk_mq_hctx_next_cpu()
> can return WORK_CPU_UNBOUND.
> And in this case _blk_mq_run_hw_queue() can be executed by the new CPU
> (that is not present in cpumask) and raises the following warning:
>
> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> cpu_online(hctx->next_cpu));
>
> To fix this problem, this patch modifies blk_mq_hctx_next_cpu()
> to only return a CPU id present in cpumask.

So, this one isn't needed either.

Thanks.

--
tejun

2017-08-22 01:41:45

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

Tejun Heo <[email protected]> writes:

> On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
>> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
>> is built.
>>
>> Unfortunately, on powerpc, the Firmware is only able to provide the
>> node of a CPU if the CPU is present. So, in our case (possible CPU)
>> CPU ids are known, but as the CPU is not present, the node id is
>> unknown and all the unplugged CPUs are attached to node 0.
>
> This is something powerpc needs to fix.

There is no way for us to fix it.

At boot, for possible but not present CPUs, we have no way of knowing
the CPU <-> node mapping, firmware simply doesn't tell us.

> Workqueue isn't the only one making this assumption. mm as a whole
> assumes that CPU <-> node mapping is stable regardless of hotplug
> events.

At least in this case I don't think the mapping changes, it's just we
don't know the mapping at boot.

Currently we have to report possible but not present CPUs as belonging
to node 0, because otherwise we trip this helpful piece of code:

for_each_possible_cpu(cpu) {
node = cpu_to_node(cpu);
if (WARN_ON(node == NUMA_NO_NODE)) {
pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
/* happens iff arch is bonkers, let's just proceed */
return;
}

But if we remove that, we could then accurately report NUMA_NO_NODE at
boot, and then update the mapping when the CPU is hotplugged.

cheers

2017-08-22 16:54:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

Hello, Michael.

On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
> > This is something powerpc needs to fix.
>
> There is no way for us to fix it.

I don't think that's true. The CPU id used in kernel doesn't have to
match the physical one and arch code should be able to pre-map CPU IDs
to nodes and use the matching one when hotplugging CPUs. I'm not
saying that's the best way to solve the problem tho. It could be that
the best way forward is making cpu <-> node mapping dynamic and
properly synchronized. However, please note that that does mean we
mess up node affinity for things like per-cpu memory which are
allocated before the cpu comes up, so there's some inherent benefits
to keeping the mapping static even if that involves indirection.

> > Workqueue isn't the only one making this assumption. mm as a whole
> > assumes that CPU <-> node mapping is stable regardless of hotplug
> > events.
>
> At least in this case I don't think the mapping changes, it's just we
> don't know the mapping at boot.
>
> Currently we have to report possible but not present CPUs as belonging
> to node 0, because otherwise we trip this helpful piece of code:
>
> for_each_possible_cpu(cpu) {
> node = cpu_to_node(cpu);
> if (WARN_ON(node == NUMA_NO_NODE)) {
> pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu);
> /* happens iff arch is bonkers, let's just proceed */
> return;
> }
>
> But if we remove that, we could then accurately report NUMA_NO_NODE at
> boot, and then update the mapping when the CPU is hotplugged.

If you think that making this dynamic is the right way to go, I have
no objection but we should be doing this properly instead of patching
up what seems to be crashing right now. What synchronization and
notification mechanisms do we need to make cpu <-> node mapping
dynamic? Do we need any synchronization in memory allocation paths?
If not, why would it be safe?

Thanks.

--
tejun

2017-08-23 11:00:46

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

Hi Tejun,

Tejun Heo <[email protected]> writes:
> Hello, Michael.
>
> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
>> > This is something powerpc needs to fix.
>>
>> There is no way for us to fix it.
>
> I don't think that's true. The CPU id used in kernel doesn't have to
> match the physical one and arch code should be able to pre-map CPU IDs
> to nodes and use the matching one when hotplugging CPUs. I'm not
> saying that's the best way to solve the problem tho.

We already virtualise the CPU numbers, but not the node IDs. And it's
the node IDs that are really the problem.

So yeah I guess we might be able to make that work, but I'd have to
think about it a bit more.

> It could be that the best way forward is making cpu <-> node mapping
> dynamic and properly synchronized.

We don't need it to be dynamic (at least for this bug).

Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
that because some CPUs aren't present at boot we don't know what the
node mapping is. (Correct me if I'm wrong Laurent).

So all we need is:
- the workqueue code to cope with CPUs that are possible but not online
having NUMA_NO_NODE to begin with.
- a way to update the workqueue cpumask when the CPU comes online.


Which seems reasonable to me?

cheers

2017-08-23 11:17:42

by Laurent Vivier

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

On 23/08/2017 13:00, Michael Ellerman wrote:
> Hi Tejun,
>
> Tejun Heo <[email protected]> writes:
>> Hello, Michael.
>>
>> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote:
>>>> This is something powerpc needs to fix.
>>>
>>> There is no way for us to fix it.
>>
>> I don't think that's true. The CPU id used in kernel doesn't have to
>> match the physical one and arch code should be able to pre-map CPU IDs
>> to nodes and use the matching one when hotplugging CPUs. I'm not
>> saying that's the best way to solve the problem tho.
>
> We already virtualise the CPU numbers, but not the node IDs. And it's
> the node IDs that are really the problem.
>
> So yeah I guess we might be able to make that work, but I'd have to
> think about it a bit more.
>
>> It could be that the best way forward is making cpu <-> node mapping
>> dynamic and properly synchronized.
>
> We don't need it to be dynamic (at least for this bug).
>
> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
> that because some CPUs aren't present at boot we don't know what the
> node mapping is. (Correct me if I'm wrong Laurent).

You're correct.

Qemu is started with:

-numa node,cpus=0-1 -numa node,cpus=2-3 \
-smp 2,maxcpus=4,cores=4,threads=1,sockets=1

Which means we have 2 nodes with cpu ids 0 and 1 on node 0, and cpu ids
2 and 3 on node 1, but at boot only 2 CPUs are present.

The problem I try to fix with this series is when we hotplug a third
CPU, to node 1 with id 2.

Thanks,
Laurent

2017-08-23 13:26:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

Hello, Michael.

On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
> > I don't think that's true. The CPU id used in kernel doesn't have to
> > match the physical one and arch code should be able to pre-map CPU IDs
> > to nodes and use the matching one when hotplugging CPUs. I'm not
> > saying that's the best way to solve the problem tho.
>
> We already virtualise the CPU numbers, but not the node IDs. And it's
> the node IDs that are really the problem.

Yeah, it just needs to match up new cpus to the cpu ids assigned to
the right node.

> > It could be that the best way forward is making cpu <-> node mapping
> > dynamic and properly synchronized.
>
> We don't need it to be dynamic (at least for this bug).

The node mapping for that cpu id changes *dynamically* while the
system is running and that can race with node-affinity sensitive
operations such as memory allocations.

> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
> that because some CPUs aren't present at boot we don't know what the
> node mapping is. (Correct me if I'm wrong Laurent).
>
> So all we need is:
> - the workqueue code to cope with CPUs that are possible but not online
> having NUMA_NO_NODE to begin with.
> - a way to update the workqueue cpumask when the CPU comes online.
>
> Which seems reasonable to me?

Please take a step back and think through the problem again. You
can't bandaid it this way.

Thanks.

--
tejun

2017-08-24 12:10:36

by Laurent Vivier

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

On 23/08/2017 15:26, Tejun Heo wrote:
> Hello, Michael.
>
> On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote:
>>> I don't think that's true. The CPU id used in kernel doesn't have to
>>> match the physical one and arch code should be able to pre-map CPU IDs
>>> to nodes and use the matching one when hotplugging CPUs. I'm not
>>> saying that's the best way to solve the problem tho.
>>
>> We already virtualise the CPU numbers, but not the node IDs. And it's
>> the node IDs that are really the problem.
>
> Yeah, it just needs to match up new cpus to the cpu ids assigned to
> the right node.

We are not able to assign the cpu ids to the right node before the CPU
is present, because firmware doesn't provide CPU mapping <-> node id
before that.

>>> It could be that the best way forward is making cpu <-> node mapping
>>> dynamic and properly synchronized.
>>
>> We don't need it to be dynamic (at least for this bug).
>
> The node mapping for that cpu id changes *dynamically* while the
> system is running and that can race with node-affinity sensitive
> operations such as memory allocations.

Memory is mapped to the node through its own firmware entry, so I don't
think cpu id change can affect memory affinity, and before we know the
node id of the CPU, the CPU is not present and thus it can't use memory.

>> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just
>> that because some CPUs aren't present at boot we don't know what the
>> node mapping is. (Correct me if I'm wrong Laurent).
>>
>> So all we need is:
>> - the workqueue code to cope with CPUs that are possible but not online
>> having NUMA_NO_NODE to begin with.
>> - a way to update the workqueue cpumask when the CPU comes online.
>>
>> Which seems reasonable to me?
>
> Please take a step back and think through the problem again. You
> can't bandaid it this way.

Could you give some ideas, proposals?
As the firmware doesn't provide the information before the CPU is really
plugged, I really don't know how to manage this problem.

Thanks,
Laurent

2017-08-24 13:51:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

Hello, Laurent.

On Thu, Aug 24, 2017 at 02:10:31PM +0200, Laurent Vivier wrote:
> > Yeah, it just needs to match up new cpus to the cpu ids assigned to
> > the right node.
>
> We are not able to assign the cpu ids to the right node before the CPU
> is present, because firmware doesn't provide CPU mapping <-> node id
> before that.

What I meant was to assign the matching CPU ID when the CPU becomes
present - ie. have CPU IDs available for different nodes and allocate
them to the new CPU according to its node mapping when it actually
comes up. Please note that I'm not saying this is the way to go, just
that it is a solvable problem from the arch code.

> > The node mapping for that cpu id changes *dynamically* while the
> > system is running and that can race with node-affinity sensitive
> > operations such as memory allocations.
>
> Memory is mapped to the node through its own firmware entry, so I don't
> think cpu id change can affect memory affinity, and before we know the
> node id of the CPU, the CPU is not present and thus it can't use memory.

The latter part isn't true. For example, percpu memory gets alloacted
for all possible CPUs according to their node affinity, so the memory
node association change which happens when the CPU comes up for the
first time can race against such allocations. I don't know whether
that's actually problematic but we don't have *any* synchronization
around it. If you think it's safe to have such races, please explain
why that is.

> > Please take a step back and think through the problem again. You
> > can't bandaid it this way.
>
> Could you give some ideas, proposals?
> As the firmware doesn't provide the information before the CPU is really
> plugged, I really don't know how to manage this problem.

There are two possible approaches, I think.

1. Make physical cpu -> logical cpu mapping indirect so that the
kernel's cpu ID assignment is always on the right numa node. This
may mean that the kernel might have to keep more possible CPUs
around than necessary but it does have the benefit that all memory
allocations are affine to the right node.

2. Make cpu <-> node mapping properly dynamic. Identify what sort of
synchronization we'd need around the mapping changing dynamically.
Note that we might not need much but it'll most likely need some.
Build synchronization and notification infrastructure around it.

Thanks.

--
tejun