2019-11-04 10:29:42

by Zhangshaokun

[permalink] [raw]
Subject: [PATCH v2] lib: optimize cpumask_local_spread()

From: yuqi jin <[email protected]>

In the multi-processor and NUMA system, I/O device may have many numa
nodes belonging to multiple cpus. When we get a local numa, it is
better to find the node closest to the local numa node, instead
of choosing any online cpu immediately.

For the current code, it only considers the local NUMA node and it
doesn't compute the distances between different NUMA nodes for the
non-local NUMA nodes. Let's optimize it and find the nearest node
through NUMA distance. The performance will be better if it return
the nearest node than the random node.

When Parameter Server workload is tested using NIC device on Huawei
Kunpeng 920 SoC:
Without the patch, the performance is 22W QPS;
Added this patch, the performance become better and it is 26W QPS.

Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: yuqi jin <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
---
lib/cpumask.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 81 insertions(+), 12 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..15d8940f32a8 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -192,18 +192,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
}
#endif

-/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
- * @i: index number
- * @node: local numa_node
- *
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
- *
- * It's not very efficient, but useful for setup.
- */
-unsigned int cpumask_local_spread(unsigned int i, int node)
+static void calc_node_distance(int *node_dist, int node)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++)
+ node_dist[i] = node_distance(node, i);
+}
+
+static int find_nearest_node(int *node_dist, bool *used)
+{
+ int i, min_dist = node_dist[0], node_id = -1;
+
+ /* Choose the first unused node to compare */
+ for (i = 0; i < nr_node_ids; i++) {
+ if (used[i] == 0) {
+ min_dist = node_dist[i];
+ node_id = i;
+ break;
+ }
+ }
+
+ /* Compare and return the nearest node */
+ for (i = 0; i < nr_node_ids; i++) {
+ if (node_dist[i] < min_dist && used[i] == 0) {
+ min_dist = node_dist[i];
+ node_id = i;
+ }
+ }
+
+ return node_id;
+}
+
+static unsigned int __cpumask_local_spread(unsigned int i, int node)
{
int cpu;

@@ -231,4 +252,52 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
BUG();
}
+
+/**
+ * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * @i: index number
+ * @node: local numa_node
+ *
+ * This function selects an online CPU according to a numa aware policy;
+ * local cpus are returned first, followed by the nearest non-local ones,
+ * then it wraps around.
+ *
+ * It's not very efficient, but useful for setup.
+ */
+unsigned int cpumask_local_spread(unsigned int i, int node)
+{
+ int node_dist[MAX_NUMNODES] = {0};
+ bool used[MAX_NUMNODES] = {0};
+ int cpu, j, id;
+
+ /* Wrap: we always want a cpu. */
+ i %= num_online_cpus();
+
+ if (node == NUMA_NO_NODE) {
+ for_each_cpu(cpu, cpu_online_mask)
+ if (i-- == 0)
+ return cpu;
+ } else {
+ if (nr_node_ids > MAX_NUMNODES)
+ return __cpumask_local_spread(i, node);
+
+ calc_node_distance(node_dist, node);
+ for (j = 0; j < nr_node_ids; j++) {
+ id = find_nearest_node(node_dist, used);
+ if (id < 0)
+ break;
+
+ for_each_cpu_and(cpu, cpumask_of_node(id),
+ cpu_online_mask)
+ if (i-- == 0)
+ return cpu;
+ used[id] = 1;
+ }
+
+ for_each_cpu(cpu, cpu_online_mask)
+ if (i-- == 0)
+ return cpu;
+ }
+ BUG();
+}
EXPORT_SYMBOL(cpumask_local_spread);
--
2.7.4


2019-11-05 07:04:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> From: yuqi jin <[email protected]>
>
> In the multi-processor and NUMA system, I/O device may have many numa
> nodes belonging to multiple cpus. When we get a local numa, it is
> better to find the node closest to the local numa node, instead
> of choosing any online cpu immediately.
>
> For the current code, it only considers the local NUMA node and it
> doesn't compute the distances between different NUMA nodes for the
> non-local NUMA nodes. Let's optimize it and find the nearest node
> through NUMA distance. The performance will be better if it return
> the nearest node than the random node.

Numbers please

[...]
> +/**
> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> + * @i: index number
> + * @node: local numa_node
> + *
> + * This function selects an online CPU according to a numa aware policy;
> + * local cpus are returned first, followed by the nearest non-local ones,
> + * then it wraps around.
> + *
> + * It's not very efficient, but useful for setup.
> + */
> +unsigned int cpumask_local_spread(unsigned int i, int node)
> +{
> + int node_dist[MAX_NUMNODES] = {0};
> + bool used[MAX_NUMNODES] = {0};

Ugh. This might be a lot of stack space. Some distro kernels use large
NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
node_dist).

> + int cpu, j, id;
--
Michal Hocko
SUSE Labs

2019-11-06 01:35:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:

> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> > From: yuqi jin <[email protected]>
> >
> > In the multi-processor and NUMA system, I/O device may have many numa
> > nodes belonging to multiple cpus. When we get a local numa, it is
> > better to find the node closest to the local numa node, instead
> > of choosing any online cpu immediately.
> >
> > For the current code, it only considers the local NUMA node and it
> > doesn't compute the distances between different NUMA nodes for the
> > non-local NUMA nodes. Let's optimize it and find the nearest node
> > through NUMA distance. The performance will be better if it return
> > the nearest node than the random node.
>
> Numbers please

The changelog had

: When Parameter Server workload is tested using NIC device on Huawei
: Kunpeng 920 SoC:
: Without the patch, the performance is 22W QPS;
: Added this patch, the performance become better and it is 26W QPS.

> [...]
> > +/**
> > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> > + * @i: index number
> > + * @node: local numa_node
> > + *
> > + * This function selects an online CPU according to a numa aware policy;
> > + * local cpus are returned first, followed by the nearest non-local ones,
> > + * then it wraps around.
> > + *
> > + * It's not very efficient, but useful for setup.
> > + */
> > +unsigned int cpumask_local_spread(unsigned int i, int node)
> > +{
> > + int node_dist[MAX_NUMNODES] = {0};
> > + bool used[MAX_NUMNODES] = {0};
>
> Ugh. This might be a lot of stack space. Some distro kernels use large
> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
> node_dist).

Yes, that's big. From a quick peek I suspect we could get by using an
array of unsigned shorts here but that might be fragile over time even
if it works now?

Perhaps we could make it a statically allocated array and protect the
entire thing with a spin_lock_irqsave()? It's not a frequently called
function.

2019-11-06 02:54:59

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

Hi Andrew,

On 2019/11/6 9:33, Andrew Morton wrote:
> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:
>
>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>> From: yuqi jin <[email protected]>
>>>
>>> In the multi-processor and NUMA system, I/O device may have many numa
>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>> better to find the node closest to the local numa node, instead
>>> of choosing any online cpu immediately.
>>>
>>> For the current code, it only considers the local NUMA node and it
>>> doesn't compute the distances between different NUMA nodes for the
>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>> through NUMA distance. The performance will be better if it return
>>> the nearest node than the random node.
>>
>> Numbers please
>
> The changelog had
>
> : When Parameter Server workload is tested using NIC device on Huawei
> : Kunpeng 920 SoC:
> : Without the patch, the performance is 22W QPS;
> : Added this patch, the performance become better and it is 26W QPS.
>
>> [...]
>>> +/**
>>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>>> + * @i: index number
>>> + * @node: local numa_node
>>> + *
>>> + * This function selects an online CPU according to a numa aware policy;
>>> + * local cpus are returned first, followed by the nearest non-local ones,
>>> + * then it wraps around.
>>> + *
>>> + * It's not very efficient, but useful for setup.
>>> + */
>>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>>> +{
>>> + int node_dist[MAX_NUMNODES] = {0};
>>> + bool used[MAX_NUMNODES] = {0};
>>
>> Ugh. This might be a lot of stack space. Some distro kernels use large
>> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
>> node_dist).
>
> Yes, that's big. From a quick peek I suspect we could get by using an
> array of unsigned shorts here but that might be fragile over time even
> if it works now?
>

Yes, how about we define another macro and its value is 128(not sure it
is big enough for the actual need)?

--->8
unsigned int cpumask_local_spread(unsigned int i, int node)
{
- int node_dist[MAX_NUMNODES] = {0};
- bool used[MAX_NUMNODES] = {0};
+ #define NUMA_NODE_NR 128
+ int node_dist[NUMA_NODE_NR] = {0};
+ bool used[NUMA_NODE_NR] = {0};
int cpu, j, id;

/* Wrap: we always want a cpu. */
@@ -278,7 +279,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
if (i-- == 0)
return cpu;
} else {
- if (nr_node_ids > MAX_NUMNODES)
+ if (nr_node_ids > NUMA_NODE_NR)
return __cpumask_local_spread(i, node);

calc_node_distance(node_dist, node);

> Perhaps we could make it a statically allocated array and protect the
> entire thing with a spin_lock_irqsave()? It's not a frequently called

It's another way to solve this issue. I'm not sure you and Michal like which one. ;-)

Thanks,
Shaokun

> function.
>
>
> .
>

2019-11-06 07:18:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

On Tue 05-11-19 17:33:59, Andrew Morton wrote:
> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> > > From: yuqi jin <[email protected]>
> > >
> > > In the multi-processor and NUMA system, I/O device may have many numa
> > > nodes belonging to multiple cpus. When we get a local numa, it is
> > > better to find the node closest to the local numa node, instead
> > > of choosing any online cpu immediately.
> > >
> > > For the current code, it only considers the local NUMA node and it
> > > doesn't compute the distances between different NUMA nodes for the
> > > non-local NUMA nodes. Let's optimize it and find the nearest node
> > > through NUMA distance. The performance will be better if it return
> > > the nearest node than the random node.
> >
> > Numbers please
>
> The changelog had
>
> : When Parameter Server workload is tested using NIC device on Huawei
> : Kunpeng 920 SoC:
> : Without the patch, the performance is 22W QPS;
> : Added this patch, the performance become better and it is 26W QPS.

Maybe it is just me but this doesn't really tell me a lot. What is
Parameter Server workload? What do I do to replicate those numbers? Is
this really specific to the Kunpeng 920 server? What is the usual
variance of the performance numbers?

> > [...]
> > > +/**
> > > + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> > > + * @i: index number
> > > + * @node: local numa_node
> > > + *
> > > + * This function selects an online CPU according to a numa aware policy;
> > > + * local cpus are returned first, followed by the nearest non-local ones,
> > > + * then it wraps around.
> > > + *
> > > + * It's not very efficient, but useful for setup.
> > > + */
> > > +unsigned int cpumask_local_spread(unsigned int i, int node)
> > > +{
> > > + int node_dist[MAX_NUMNODES] = {0};
> > > + bool used[MAX_NUMNODES] = {0};
> >
> > Ugh. This might be a lot of stack space. Some distro kernels use large
> > NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
> > node_dist).
>
> Yes, that's big. From a quick peek I suspect we could get by using an
> array of unsigned shorts here but that might be fragile over time even
> if it works now?

Whatever data type we use it will be still quite large to be on the
stack.

> Perhaps we could make it a statically allocated array and protect the
> entire thing with a spin_lock_irqsave()? It's not a frequently called
> function.

This is what I was suggesting in previous review feedback.

--
Michal Hocko
SUSE Labs

2019-11-06 08:05:28

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/6 15:17, Michal Hocko wrote:
> On Tue 05-11-19 17:33:59, Andrew Morton wrote:
>> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:
>>
>>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>>> From: yuqi jin <[email protected]>
>>>>
>>>> In the multi-processor and NUMA system, I/O device may have many numa
>>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>>> better to find the node closest to the local numa node, instead
>>>> of choosing any online cpu immediately.
>>>>
>>>> For the current code, it only considers the local NUMA node and it
>>>> doesn't compute the distances between different NUMA nodes for the
>>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>>> through NUMA distance. The performance will be better if it return
>>>> the nearest node than the random node.
>>>
>>> Numbers please
>>
>> The changelog had
>>
>> : When Parameter Server workload is tested using NIC device on Huawei
>> : Kunpeng 920 SoC:
>> : Without the patch, the performance is 22W QPS;
>> : Added this patch, the performance become better and it is 26W QPS.
>
> Maybe it is just me but this doesn't really tell me a lot. What is
> Parameter Server workload? What do I do to replicate those numbers? Is

I will give it better description on it in next version. Since it returns
the nearest node from the non-local node than the random one, no harmless
to others, Right?

> this really specific to the Kunpeng 920 server? What is the usual
> variance of the performance numbers?
>
>>> [...]
>>>> +/**
>>>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>>>> + * @i: index number
>>>> + * @node: local numa_node
>>>> + *
>>>> + * This function selects an online CPU according to a numa aware policy;
>>>> + * local cpus are returned first, followed by the nearest non-local ones,
>>>> + * then it wraps around.
>>>> + *
>>>> + * It's not very efficient, but useful for setup.
>>>> + */
>>>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>>>> +{
>>>> + int node_dist[MAX_NUMNODES] = {0};
>>>> + bool used[MAX_NUMNODES] = {0};
>>>
>>> Ugh. This might be a lot of stack space. Some distro kernels use large
>>> NODE_SHIFT (e.g 10 so this would be 4kB of stack space just for the
>>> node_dist).
>>
>> Yes, that's big. From a quick peek I suspect we could get by using an
>> array of unsigned shorts here but that might be fragile over time even
>> if it works now?
>
> Whatever data type we use it will be still quite large to be on the
> stack.
>
>> Perhaps we could make it a statically allocated array and protect the
>> entire thing with a spin_lock_irqsave()? It's not a frequently called
>> function.
>
> This is what I was suggesting in previous review feedback.

Ok, will do it in next version.

Thanks,
Shaokun

>

2019-11-06 09:23:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

On Wed 06-11-19 16:02:29, Shaokun Zhang wrote:
> Hi Michal,
>
> On 2019/11/6 15:17, Michal Hocko wrote:
> > On Tue 05-11-19 17:33:59, Andrew Morton wrote:
> >> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:
> >>
> >>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
> >>>> From: yuqi jin <[email protected]>
> >>>>
> >>>> In the multi-processor and NUMA system, I/O device may have many numa
> >>>> nodes belonging to multiple cpus. When we get a local numa, it is
> >>>> better to find the node closest to the local numa node, instead
> >>>> of choosing any online cpu immediately.
> >>>>
> >>>> For the current code, it only considers the local NUMA node and it
> >>>> doesn't compute the distances between different NUMA nodes for the
> >>>> non-local NUMA nodes. Let's optimize it and find the nearest node
> >>>> through NUMA distance. The performance will be better if it return
> >>>> the nearest node than the random node.
> >>>
> >>> Numbers please
> >>
> >> The changelog had
> >>
> >> : When Parameter Server workload is tested using NIC device on Huawei
> >> : Kunpeng 920 SoC:
> >> : Without the patch, the performance is 22W QPS;
> >> : Added this patch, the performance become better and it is 26W QPS.
> >
> > Maybe it is just me but this doesn't really tell me a lot. What is
> > Parameter Server workload? What do I do to replicate those numbers? Is
>
> I will give it better description on it in next version. Since it returns
> the nearest node from the non-local node than the random one, no harmless
> to others, Right?

Well, I am not really familiar with consumers of this API to understand
the full consequences and that is why I keep asking. From a very
highlevel POV prefering CPUs on the same NUMA domain sounds like a
reasonable thing to do.
--
Michal Hocko
SUSE Labs

2019-11-07 01:49:59

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v2] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/6 17:22, Michal Hocko wrote:
> On Wed 06-11-19 16:02:29, Shaokun Zhang wrote:
>> Hi Michal,
>>
>> On 2019/11/6 15:17, Michal Hocko wrote:
>>> On Tue 05-11-19 17:33:59, Andrew Morton wrote:
>>>> On Tue, 5 Nov 2019 08:01:41 +0100 Michal Hocko <[email protected]> wrote:
>>>>
>>>>> On Mon 04-11-19 18:27:48, Shaokun Zhang wrote:
>>>>>> From: yuqi jin <[email protected]>
>>>>>>
>>>>>> In the multi-processor and NUMA system, I/O device may have many numa
>>>>>> nodes belonging to multiple cpus. When we get a local numa, it is
>>>>>> better to find the node closest to the local numa node, instead
>>>>>> of choosing any online cpu immediately.
>>>>>>
>>>>>> For the current code, it only considers the local NUMA node and it
>>>>>> doesn't compute the distances between different NUMA nodes for the
>>>>>> non-local NUMA nodes. Let's optimize it and find the nearest node
>>>>>> through NUMA distance. The performance will be better if it return
>>>>>> the nearest node than the random node.
>>>>>
>>>>> Numbers please
>>>>
>>>> The changelog had
>>>>
>>>> : When Parameter Server workload is tested using NIC device on Huawei
>>>> : Kunpeng 920 SoC:
>>>> : Without the patch, the performance is 22W QPS;
>>>> : Added this patch, the performance become better and it is 26W QPS.
>>>
>>> Maybe it is just me but this doesn't really tell me a lot. What is
>>> Parameter Server workload? What do I do to replicate those numbers? Is
>>
>> I will give it better description on it in next version. Since it returns
>> the nearest node from the non-local node than the random one, no harmless
>> to others, Right?
>
> Well, I am not really familiar with consumers of this API to understand
> the full consequences and that is why I keep asking. From a very

Good job, thanks you and Andrew's nice comment, at the beginning, I'm not sure
how to fix this issue correctly and it become better now.

> highlevel POV prefering CPUs on the same NUMA domain sounds like a
> reasonable thing to do.

Thanks :-)

>