2019-11-07 01:45:55

by Zhangshaokun

[permalink] [raw]
Subject: [PATCH v3] lib: optimize cpumask_local_spread()

From: yuqi jin <[email protected]>

In the multi-processors and NUMA system, I/O driver will find cpu cores
that which shall be bound IRQ. When cpu cores in the local numa have
been used, it is better to find the node closest to the local numa node,
instead of choosing any online cpu immediately.

On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
system(0 - 1). We perform PS (parameter server) business test, the
behavior of the service is that the client initiates a request through
the network card, the server responds to the request after calculation.
When two PS processes run on node2 and node3 separately and the
network card is located on 'node2' which is in cpu1, the performance
of node2 (26W QPS) and node3 (22W QPS) was different.
It is better that the NIC queues are bound to the cpu1 cores in turn,
then XPS will also be properly initialized, while cpumask_local_spread
only considers the local node. When the number of NIC queues exceeds
the number of cores in the local node, it returns to the online core
directly. So when PS runs on node3 sending a calculated request,
the performance is not as good as the node2. It is considered that
the NIC and other I/O devices shall initialize the interrupt binding,
if the cores of the local node are used up, it is reasonable to return
the node closest to it.

Let's optimize it and find the nearest node through NUMA distance for the
non-local NUMA nodes. The performance will be better if it return the
nearest node than the random node.

After this patch, the performance of the node3 is the same as node2
that is 26W QPS when the network card is still in 'node2'. Since it will
return the closest non-local NUMA code rather than random node, it is no
harm to others at least.

Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: yuqi jin <[email protected]>
Signed-off-by: Shaokun Zhang <[email protected]>
---
ChangeLog from v2:
1. Change the variables as static and use spinlock to protect;
2. Give more explantation on test and performance;
lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 90 insertions(+), 12 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 0cb672eb107c..b98a2256bc5a 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/memblock.h>
#include <linux/numa.h>
+#include <linux/spinlock.h>

/**
* cpumask_next - get the next cpu in a cpumask
@@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
}
#endif

-/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
- * @i: index number
- * @node: local numa_node
- *
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
- *
- * It's not very efficient, but useful for setup.
- */
-unsigned int cpumask_local_spread(unsigned int i, int node)
+static void calc_node_distance(int *node_dist, int node)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++)
+ node_dist[i] = node_distance(node, i);
+}
+
+static int find_nearest_node(int *node_dist, bool *used)
+{
+ int i, min_dist = node_dist[0], node_id = -1;
+
+ /* Choose the first unused node to compare */
+ for (i = 0; i < nr_node_ids; i++) {
+ if (used[i] == 0) {
+ min_dist = node_dist[i];
+ node_id = i;
+ break;
+ }
+ }
+
+ /* Compare and return the nearest node */
+ for (i = 0; i < nr_node_ids; i++) {
+ if (node_dist[i] < min_dist && used[i] == 0) {
+ min_dist = node_dist[i];
+ node_id = i;
+ }
+ }
+
+ return node_id;
+}
+
+static unsigned int __cpumask_local_spread(unsigned int i, int node)
{
int cpu;

@@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
BUG();
}
+
+static DEFINE_SPINLOCK(spread_lock);
+/**
+ * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * @i: index number
+ * @node: local numa_node
+ *
+ * This function selects an online CPU according to a numa aware policy;
+ * local cpus are returned first, followed by the nearest non-local ones,
+ * then it wraps around.
+ *
+ * It's not very efficient, but useful for setup.
+ */
+unsigned int cpumask_local_spread(unsigned int i, int node)
+{
+ static int node_dist[MAX_NUMNODES];
+ static bool used[MAX_NUMNODES];
+ unsigned long flags;
+ int cpu, j, id;
+
+ /* Wrap: we always want a cpu. */
+ i %= num_online_cpus();
+
+ if (node == NUMA_NO_NODE) {
+ for_each_cpu(cpu, cpu_online_mask)
+ if (i-- == 0)
+ return cpu;
+ } else {
+ if (nr_node_ids > MAX_NUMNODES)
+ return __cpumask_local_spread(i, node);
+
+ spin_lock_irqsave(&spread_lock, flags);
+ memset(used, 0, nr_node_ids * sizeof(bool));
+ calc_node_distance(node_dist, node);
+ for (j = 0; j < nr_node_ids; j++) {
+ id = find_nearest_node(node_dist, used);
+ if (id < 0)
+ break;
+
+ for_each_cpu_and(cpu, cpumask_of_node(id),
+ cpu_online_mask)
+ if (i-- == 0) {
+ spin_unlock_irqrestore(&spread_lock,
+ flags);
+ return cpu;
+ }
+ used[id] = 1;
+ }
+ spin_unlock_irqrestore(&spread_lock, flags);
+
+ for_each_cpu(cpu, cpu_online_mask)
+ if (i-- == 0)
+ return cpu;
+ }
+ BUG();
+}
EXPORT_SYMBOL(cpumask_local_spread);
--
2.7.4


2019-11-08 03:51:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <[email protected]> wrote:

> In the multi-processors and NUMA system, I/O driver will find cpu cores
> that which shall be bound IRQ. When cpu cores in the local numa have
> been used, it is better to find the node closest to the local numa node,
> instead of choosing any online cpu immediately.
>
> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
> system(0 - 1). We perform PS (parameter server) business test, the
> behavior of the service is that the client initiates a request through
> the network card, the server responds to the request after calculation.
> When two PS processes run on node2 and node3 separately and the
> network card is located on 'node2' which is in cpu1, the performance
> of node2 (26W QPS) and node3 (22W QPS) was different.
> It is better that the NIC queues are bound to the cpu1 cores in turn,
> then XPS will also be properly initialized, while cpumask_local_spread
> only considers the local node. When the number of NIC queues exceeds
> the number of cores in the local node, it returns to the online core
> directly. So when PS runs on node3 sending a calculated request,
> the performance is not as good as the node2. It is considered that
> the NIC and other I/O devices shall initialize the interrupt binding,
> if the cores of the local node are used up, it is reasonable to return
> the node closest to it.
>
> Let's optimize it and find the nearest node through NUMA distance for the
> non-local NUMA nodes. The performance will be better if it return the
> nearest node than the random node.
>
> After this patch, the performance of the node3 is the same as node2
> that is 26W QPS when the network card is still in 'node2'. Since it will
> return the closest non-local NUMA code rather than random node, it is no
> harm to others at least.

This is a little nicer:

--- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix
+++ a/lib/cpumask.c
@@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre
BUG();
}

-static DEFINE_SPINLOCK(spread_lock);
/**
* cpumask_local_spread - select the i'th cpu with local numa cpu's first
* @i: index number
@@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign
{
static int node_dist[MAX_NUMNODES];
static bool used[MAX_NUMNODES];
+ static DEFINE_SPINLOCK(spread_lock);
unsigned long flags;
int cpu, j, id;

_

2019-11-08 05:51:42

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Andrew,

On 2019/11/8 11:49, Andrew Morton wrote:
> On Thu, 7 Nov 2019 09:44:08 +0800 Shaokun Zhang <[email protected]> wrote:
>
>> In the multi-processors and NUMA system, I/O driver will find cpu cores
>> that which shall be bound IRQ. When cpu cores in the local numa have
>> been used, it is better to find the node closest to the local numa node,
>> instead of choosing any online cpu immediately.
>>
>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
>> system(0 - 1). We perform PS (parameter server) business test, the
>> behavior of the service is that the client initiates a request through
>> the network card, the server responds to the request after calculation.
>> When two PS processes run on node2 and node3 separately and the
>> network card is located on 'node2' which is in cpu1, the performance
>> of node2 (26W QPS) and node3 (22W QPS) was different.
>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>> then XPS will also be properly initialized, while cpumask_local_spread
>> only considers the local node. When the number of NIC queues exceeds
>> the number of cores in the local node, it returns to the online core
>> directly. So when PS runs on node3 sending a calculated request,
>> the performance is not as good as the node2. It is considered that
>> the NIC and other I/O devices shall initialize the interrupt binding,
>> if the cores of the local node are used up, it is reasonable to return
>> the node closest to it.
>>
>> Let's optimize it and find the nearest node through NUMA distance for the
>> non-local NUMA nodes. The performance will be better if it return the
>> nearest node than the random node.
>>
>> After this patch, the performance of the node3 is the same as node2
>> that is 26W QPS when the network card is still in 'node2'. Since it will
>> return the closest non-local NUMA code rather than random node, it is no
>> harm to others at least.
>
> This is a little nicer:
>
> --- a/lib/cpumask.c~lib-optimize-cpumask_local_spread-v3-fix
> +++ a/lib/cpumask.c
> @@ -254,7 +254,6 @@ static unsigned int __cpumask_local_spre
> BUG();
> }
>
> -static DEFINE_SPINLOCK(spread_lock);
> /**
> * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> * @i: index number
> @@ -270,6 +269,7 @@ unsigned int cpumask_local_spread(unsign
> {
> static int node_dist[MAX_NUMNODES];
> static bool used[MAX_NUMNODES];
> + static DEFINE_SPINLOCK(spread_lock);

Good catch, thanks for fixing it.

Shaokun.

> unsigned long flags;
> int cpu, j, id;
>
> _
>
>
> .
>

2019-11-08 10:32:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

This changelog looks better, thanks! I still have some questions though.
Btw. cpumask_local_spread is used by the networking code but I do not
see net guys involved (Cc netdev)

On Thu 07-11-19 09:44:08, Shaokun Zhang wrote:
> From: yuqi jin <[email protected]>
>
> In the multi-processors and NUMA system, I/O driver will find cpu cores
> that which shall be bound IRQ. When cpu cores in the local numa have
> been used, it is better to find the node closest to the local numa node,
> instead of choosing any online cpu immediately.
>
> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
> system(0 - 1).

Please send a topology of this server (numactl -H).

> We perform PS (parameter server) business test, the
> behavior of the service is that the client initiates a request through
> the network card, the server responds to the request after calculation.

Is the benchmark any ublicly available?

> When two PS processes run on node2 and node3 separately and the
> network card is located on 'node2' which is in cpu1, the performance
> of node2 (26W QPS) and node3 (22W QPS) was different.
> It is better that the NIC queues are bound to the cpu1 cores in turn,
> then XPS will also be properly initialized, while cpumask_local_spread
> only considers the local node. When the number of NIC queues exceeds
> the number of cores in the local node, it returns to the online core
> directly. So when PS runs on node3 sending a calculated request,
> the performance is not as good as the node2. It is considered that
> the NIC and other I/O devices shall initialize the interrupt binding,
> if the cores of the local node are used up, it is reasonable to return
> the node closest to it.

Can you post cpu affinities before and after this patch?

> Let's optimize it and find the nearest node through NUMA distance for the
> non-local NUMA nodes. The performance will be better if it return the
> nearest node than the random node.
>
> After this patch, the performance of the node3 is the same as node2
> that is 26W QPS when the network card is still in 'node2'. Since it will
> return the closest non-local NUMA code rather than random node, it is no
> harm to others at least.

It would be also nice to explain why the current implementation hasn't
taken the path your have chosen. Was it a simplicity or is there a more
fundamental reason? Is there any risk that existing users would regress?
Preferring cpus from the local socket which is what you aim for sounds
like a logical thing to do so I am wondering why this hasn't been
considered.

> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Paul Burton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Signed-off-by: yuqi jin <[email protected]>
> Signed-off-by: Shaokun Zhang <[email protected]>
> ---
> ChangeLog from v2:
> 1. Change the variables as static and use spinlock to protect;
> 2. Give more explantation on test and performance;
> lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 90 insertions(+), 12 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 0cb672eb107c..b98a2256bc5a 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
> #include <linux/export.h>
> #include <linux/memblock.h>
> #include <linux/numa.h>
> +#include <linux/spinlock.h>
>
> /**
> * cpumask_next - get the next cpu in a cpumask
> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> }
> #endif
>
> -/**
> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> - * @i: index number
> - * @node: local numa_node
> - *
> - * This function selects an online CPU according to a numa aware policy;
> - * local cpus are returned first, followed by non-local ones, then it
> - * wraps around.
> - *
> - * It's not very efficient, but useful for setup.
> - */
> -unsigned int cpumask_local_spread(unsigned int i, int node)
> +static void calc_node_distance(int *node_dist, int node)
> +{
> + int i;
> +
> + for (i = 0; i < nr_node_ids; i++)
> + node_dist[i] = node_distance(node, i);
> +}
> +
> +static int find_nearest_node(int *node_dist, bool *used)
> +{
> + int i, min_dist = node_dist[0], node_id = -1;
> +
> + /* Choose the first unused node to compare */
> + for (i = 0; i < nr_node_ids; i++) {
> + if (used[i] == 0) {
> + min_dist = node_dist[i];
> + node_id = i;
> + break;
> + }
> + }
> +
> + /* Compare and return the nearest node */
> + for (i = 0; i < nr_node_ids; i++) {
> + if (node_dist[i] < min_dist && used[i] == 0) {
> + min_dist = node_dist[i];
> + node_id = i;
> + }
> + }
> +
> + return node_id;
> +}
> +
> +static unsigned int __cpumask_local_spread(unsigned int i, int node)
> {
> int cpu;
>
> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> BUG();
> }
> +
> +static DEFINE_SPINLOCK(spread_lock);
> +/**
> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> + * @i: index number
> + * @node: local numa_node
> + *
> + * This function selects an online CPU according to a numa aware policy;
> + * local cpus are returned first, followed by the nearest non-local ones,
> + * then it wraps around.
> + *
> + * It's not very efficient, but useful for setup.
> + */
> +unsigned int cpumask_local_spread(unsigned int i, int node)
> +{
> + static int node_dist[MAX_NUMNODES];
> + static bool used[MAX_NUMNODES];
> + unsigned long flags;
> + int cpu, j, id;
> +
> + /* Wrap: we always want a cpu. */
> + i %= num_online_cpus();
> +
> + if (node == NUMA_NO_NODE) {
> + for_each_cpu(cpu, cpu_online_mask)
> + if (i-- == 0)
> + return cpu;
> + } else {
> + if (nr_node_ids > MAX_NUMNODES)
> + return __cpumask_local_spread(i, node);
> +
> + spin_lock_irqsave(&spread_lock, flags);
> + memset(used, 0, nr_node_ids * sizeof(bool));
> + calc_node_distance(node_dist, node);
> + for (j = 0; j < nr_node_ids; j++) {
> + id = find_nearest_node(node_dist, used);
> + if (id < 0)
> + break;
> +
> + for_each_cpu_and(cpu, cpumask_of_node(id),
> + cpu_online_mask)
> + if (i-- == 0) {
> + spin_unlock_irqrestore(&spread_lock,
> + flags);
> + return cpu;
> + }
> + used[id] = 1;
> + }
> + spin_unlock_irqrestore(&spread_lock, flags);
> +
> + for_each_cpu(cpu, cpu_online_mask)
> + if (i-- == 0)
> + return cpu;
> + }
> + BUG();
> +}
> EXPORT_SYMBOL(cpumask_local_spread);
> --
> 2.7.4

--
Michal Hocko
SUSE Labs

2019-11-11 02:04:16

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/8 18:31, Michal Hocko wrote:
> This changelog looks better, thanks! I still have some questions though.
> Btw. cpumask_local_spread is used by the networking code but I do not
> see net guys involved (Cc netdev)

Oh, I forgot to involve the net guys, sorry.

>
> On Thu 07-11-19 09:44:08, Shaokun Zhang wrote:
>> From: yuqi jin <[email protected]>
>>
>> In the multi-processors and NUMA system, I/O driver will find cpu cores
>> that which shall be bound IRQ. When cpu cores in the local numa have
>> been used, it is better to find the node closest to the local numa node,
>> instead of choosing any online cpu immediately.
>>
>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
>> system(0 - 1).
>
> Please send a topology of this server (numactl -H).
>

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 63379 MB
node 0 free: 61899 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64509 MB
node 1 free: 63942 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 64509 MB
node 2 free: 63056 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 63997 MB
node 3 free: 63420 MB
node distances:
node 0 1 2 3
0: 10 16 32 33
1: 16 10 25 32
2: 32 25 10 16
3: 33 32 16 10

>> We perform PS (parameter server) business test, the
>> behavior of the service is that the client initiates a request through
>> the network card, the server responds to the request after calculation.
>
> Is the benchmark any ublicly available?
>

Sorry, the PS which we test is not open, but I think redis is the same as PS
on the macro level. When there are both 24 redis servers on node2 and node3.
if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers
on node3 will not performance good.

>> When two PS processes run on node2 and node3 separately and the
>> network card is located on 'node2' which is in cpu1, the performance
>> of node2 (26W QPS) and node3 (22W QPS) was different.
>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>> then XPS will also be properly initialized, while cpumask_local_spread
>> only considers the local node. When the number of NIC queues exceeds
>> the number of cores in the local node, it returns to the online core
>> directly. So when PS runs on node3 sending a calculated request,
>> the performance is not as good as the node2. It is considered that
>> the NIC and other I/O devices shall initialize the interrupt binding,
>> if the cores of the local node are used up, it is reasonable to return
>> the node closest to it.
>
> Can you post cpu affinities before and after this patch?
>

Before this patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0
00000000,00010000,00000000
Euler:~ # cat /proc/irq/369/smp_affinity #IRQ24
00000000,00000000,00000001
Euler:~ # cat /proc/irq/393/smp_affinity #IRQ48
00000000,00000000,01000000
Euler:~ # cat /sys/class/net/eth7/queues/tx-0/xps_cpus
00000000,00010000,00000000
Euler:~ # cat /sys/class/net/eth7/queues/tx-24/xps_cpus
00000000,00000000,00000001
Euler:~ # cat /sys/class/net/eth7/queues/tx-48/xps_cpus
00000000,00000000,01000000

After this patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/345/smp_affinity
00000000,00010000,00000000
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/369/smp_affinity
00000100,00000000,00000000
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat /proc/irq/393/smp_affinity
00000000,00000000,01000000
Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-0/xps_cpus
00000000,00010000,00000000
Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-24/xps_cpus
00000100,00000000,00000000
Euler:/sys/bus/pci # cat /sys/class/net/eth7/queues/tx-48/xps_cpus
00000000,00000000,01000000

>> Let's optimize it and find the nearest node through NUMA distance for the
>> non-local NUMA nodes. The performance will be better if it return the
>> nearest node than the random node.
>>
>> After this patch, the performance of the node3 is the same as node2
>> that is 26W QPS when the network card is still in 'node2'. Since it will
>> return the closest non-local NUMA code rather than random node, it is no
>> harm to others at least.
>
> It would be also nice to explain why the current implementation hasn't
> taken the path your have chosen. Was it a simplicity or is there a more
> fundamental reason? Is there any risk that existing users would regress?

It seems that we don't see an interface like cpumask_of_socket. So we use NUMA
distances to solve this issue. And it can also solve the problem when the irqs of
the I/O device are more than the cores number of sockets. On the other hand,
after the system started, the user can manually change the irq and xps bindings.
But reasonable initialization is a really good experience.

> Preferring cpus from the local socket which is what you aim for sounds

Right.

> like a logical thing to do so I am wondering why this hasn't been
> considered.

An interesting question, When I met this issue, I tried to find the socket ID
in kernel, @physical_package_id is architecture and platform dependent in
Documentation/admin-guide/cputopology.rst.
In fact, I also check "Core(s) per socket" from lscpu command which is calculated
in sys-utils/lscpu.c.
It seems that NUMA distance is a general way for this problem.

Thanks,
Shaokun

>> Cc: Andrew Morton <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Cc: Paul Burton <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Cc: Anshuman Khandual <[email protected]>
>> Signed-off-by: yuqi jin <[email protected]>
>> Signed-off-by: Shaokun Zhang <[email protected]>
>> ---
>> ChangeLog from v2:
>> 1. Change the variables as static and use spinlock to protect;
>> 2. Give more explantation on test and performance;
>> lib/cpumask.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 90 insertions(+), 12 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index 0cb672eb107c..b98a2256bc5a 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -6,6 +6,7 @@
>> #include <linux/export.h>
>> #include <linux/memblock.h>
>> #include <linux/numa.h>
>> +#include <linux/spinlock.h>
>>
>> /**
>> * cpumask_next - get the next cpu in a cpumask
>> @@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>> }
>> #endif
>>
>> -/**
>> - * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> - * @i: index number
>> - * @node: local numa_node
>> - *
>> - * This function selects an online CPU according to a numa aware policy;
>> - * local cpus are returned first, followed by non-local ones, then it
>> - * wraps around.
>> - *
>> - * It's not very efficient, but useful for setup.
>> - */
>> -unsigned int cpumask_local_spread(unsigned int i, int node)
>> +static void calc_node_distance(int *node_dist, int node)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < nr_node_ids; i++)
>> + node_dist[i] = node_distance(node, i);
>> +}
>> +
>> +static int find_nearest_node(int *node_dist, bool *used)
>> +{
>> + int i, min_dist = node_dist[0], node_id = -1;
>> +
>> + /* Choose the first unused node to compare */
>> + for (i = 0; i < nr_node_ids; i++) {
>> + if (used[i] == 0) {
>> + min_dist = node_dist[i];
>> + node_id = i;
>> + break;
>> + }
>> + }
>> +
>> + /* Compare and return the nearest node */
>> + for (i = 0; i < nr_node_ids; i++) {
>> + if (node_dist[i] < min_dist && used[i] == 0) {
>> + min_dist = node_dist[i];
>> + node_id = i;
>> + }
>> + }
>> +
>> + return node_id;
>> +}
>> +
>> +static unsigned int __cpumask_local_spread(unsigned int i, int node)
>> {
>> int cpu;
>>
>> @@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>> }
>> BUG();
>> }
>> +
>> +static DEFINE_SPINLOCK(spread_lock);
>> +/**
>> + * cpumask_local_spread - select the i'th cpu with local numa cpu's first
>> + * @i: index number
>> + * @node: local numa_node
>> + *
>> + * This function selects an online CPU according to a numa aware policy;
>> + * local cpus are returned first, followed by the nearest non-local ones,
>> + * then it wraps around.
>> + *
>> + * It's not very efficient, but useful for setup.
>> + */
>> +unsigned int cpumask_local_spread(unsigned int i, int node)
>> +{
>> + static int node_dist[MAX_NUMNODES];
>> + static bool used[MAX_NUMNODES];
>> + unsigned long flags;
>> + int cpu, j, id;
>> +
>> + /* Wrap: we always want a cpu. */
>> + i %= num_online_cpus();
>> +
>> + if (node == NUMA_NO_NODE) {
>> + for_each_cpu(cpu, cpu_online_mask)
>> + if (i-- == 0)
>> + return cpu;
>> + } else {
>> + if (nr_node_ids > MAX_NUMNODES)
>> + return __cpumask_local_spread(i, node);
>> +
>> + spin_lock_irqsave(&spread_lock, flags);
>> + memset(used, 0, nr_node_ids * sizeof(bool));
>> + calc_node_distance(node_dist, node);
>> + for (j = 0; j < nr_node_ids; j++) {
>> + id = find_nearest_node(node_dist, used);
>> + if (id < 0)
>> + break;
>> +
>> + for_each_cpu_and(cpu, cpumask_of_node(id),
>> + cpu_online_mask)
>> + if (i-- == 0) {
>> + spin_unlock_irqrestore(&spread_lock,
>> + flags);
>> + return cpu;
>> + }
>> + used[id] = 1;
>> + }
>> + spin_unlock_irqrestore(&spread_lock, flags);
>> +
>> + for_each_cpu(cpu, cpu_online_mask)
>> + if (i-- == 0)
>> + return cpu;
>> + }
>> + BUG();
>> +}
>> EXPORT_SYMBOL(cpumask_local_spread);
>> --
>> 2.7.4
>

2019-11-12 11:58:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

On Mon 11-11-19 10:02:37, Shaokun Zhang wrote:
> Hi Michal,
>
> On 2019/11/8 18:31, Michal Hocko wrote:
> > This changelog looks better, thanks! I still have some questions though.
> > Btw. cpumask_local_spread is used by the networking code but I do not
> > see net guys involved (Cc netdev)
>
> Oh, I forgot to involve the net guys, sorry.
>
> >
> > On Thu 07-11-19 09:44:08, Shaokun Zhang wrote:
> >> From: yuqi jin <[email protected]>
> >>
> >> In the multi-processors and NUMA system, I/O driver will find cpu cores
> >> that which shall be bound IRQ. When cpu cores in the local numa have
> >> been used, it is better to find the node closest to the local numa node,
> >> instead of choosing any online cpu immediately.
> >>
> >> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
> >> system(0 - 1).
> >
> > Please send a topology of this server (numactl -H).
> >
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 0 size: 63379 MB
> node 0 free: 61899 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 1 size: 64509 MB
> node 1 free: 63942 MB
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 2 size: 64509 MB
> node 2 free: 63056 MB
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 3 size: 63997 MB
> node 3 free: 63420 MB
> node distances:
> node 0 1 2 3
> 0: 10 16 32 33
> 1: 16 10 25 32
> 2: 32 25 10 16
> 3: 33 32 16 10
>
> >> We perform PS (parameter server) business test, the
> >> behavior of the service is that the client initiates a request through
> >> the network card, the server responds to the request after calculation.
> >
> > Is the benchmark any ublicly available?
> >
>
> Sorry, the PS which we test is not open, but I think redis is the same as PS
> on the macro level. When there are both 24 redis servers on node2 and node3.
> if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers
> on node3 will not performance good.

Are there any other benchmarks showing improvements?

> >> When two PS processes run on node2 and node3 separately and the
> >> network card is located on 'node2' which is in cpu1, the performance
> >> of node2 (26W QPS) and node3 (22W QPS) was different.
> >> It is better that the NIC queues are bound to the cpu1 cores in turn,
> >> then XPS will also be properly initialized, while cpumask_local_spread
> >> only considers the local node. When the number of NIC queues exceeds
> >> the number of cores in the local node, it returns to the online core
> >> directly. So when PS runs on node3 sending a calculated request,
> >> the performance is not as good as the node2. It is considered that
> >> the NIC and other I/O devices shall initialize the interrupt binding,
> >> if the cores of the local node are used up, it is reasonable to return
> >> the node closest to it.
> >
> > Can you post cpu affinities before and after this patch?
> >
>
> Before this patch
> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
> 2
> Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0
> 00000000,00010000,00000000

This representation is awkward to parse. Could you add smp_affinity_list
please? It would save quite some head scratching.
--
Michal Hocko
SUSE Labs

2019-11-13 02:47:42

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/12 19:56, Michal Hocko wrote:
> On Mon 11-11-19 10:02:37, Shaokun Zhang wrote:
>> Hi Michal,
>>
>> On 2019/11/8 18:31, Michal Hocko wrote:
>>> This changelog looks better, thanks! I still have some questions though.
>>> Btw. cpumask_local_spread is used by the networking code but I do not
>>> see net guys involved (Cc netdev)
>>
>> Oh, I forgot to involve the net guys, sorry.
>>
>>>
>>> On Thu 07-11-19 09:44:08, Shaokun Zhang wrote:
>>>> From: yuqi jin <[email protected]>
>>>>
>>>> In the multi-processors and NUMA system, I/O driver will find cpu cores
>>>> that which shall be bound IRQ. When cpu cores in the local numa have
>>>> been used, it is better to find the node closest to the local numa node,
>>>> instead of choosing any online cpu immediately.
>>>>
>>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 -3) in the 2-cpu
>>>> system(0 - 1).
>>>
>>> Please send a topology of this server (numactl -H).
>>>
>>
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>> node 0 size: 63379 MB
>> node 0 free: 61899 MB
>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 1 size: 64509 MB
>> node 1 free: 63942 MB
>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>> node 2 size: 64509 MB
>> node 2 free: 63056 MB
>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
>> node 3 size: 63997 MB
>> node 3 free: 63420 MB
>> node distances:
>> node 0 1 2 3
>> 0: 10 16 32 33
>> 1: 16 10 25 32
>> 2: 32 25 10 16
>> 3: 33 32 16 10
>>
>>>> We perform PS (parameter server) business test, the
>>>> behavior of the service is that the client initiates a request through
>>>> the network card, the server responds to the request after calculation.
>>>
>>> Is the benchmark any ublicly available?
>>>
>>
>> Sorry, the PS which we test is not open, but I think redis is the same as PS
>> on the macro level. When there are both 24 redis servers on node2 and node3.
>> if the 24-47 irqs and xps of NIC are not bound to node3, the redis servers
>> on node3 will not performance good.
>
> Are there any other benchmarks showing improvements?
>

Sorry, I don't have it. The issue is clear and the patch is helpful for the
actual Parameter Server and Redis test.

>>>> When two PS processes run on node2 and node3 separately and the
>>>> network card is located on 'node2' which is in cpu1, the performance
>>>> of node2 (26W QPS) and node3 (22W QPS) was different.
>>>> It is better that the NIC queues are bound to the cpu1 cores in turn,
>>>> then XPS will also be properly initialized, while cpumask_local_spread
>>>> only considers the local node. When the number of NIC queues exceeds
>>>> the number of cores in the local node, it returns to the online core
>>>> directly. So when PS runs on node3 sending a calculated request,
>>>> the performance is not as good as the node2. It is considered that
>>>> the NIC and other I/O devices shall initialize the interrupt binding,
>>>> if the cores of the local node are used up, it is reasonable to return
>>>> the node closest to it.
>>>
>>> Can you post cpu affinities before and after this patch?
>>>
>>
>> Before this patch
>> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
>> 2
>> Euler:~ # cat /proc/irq/345/smp_affinity #IRQ0
>> 00000000,00010000,00000000
>
> This representation is awkward to parse. Could you add smp_affinity_list
> please? It would save quite some head scratching.
>

before patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci #

after patch
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci #

Thanks,
Shaokun


2019-11-14 14:44:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

On Wed 13-11-19 10:46:05, Shaokun Zhang wrote:
[...]
> >> available: 4 nodes (0-3)
> >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> >> node 0 size: 63379 MB
> >> node 0 free: 61899 MB
> >> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> >> node 1 size: 64509 MB
> >> node 1 free: 63942 MB
> >> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> >> node 2 size: 64509 MB
> >> node 2 free: 63056 MB
> >> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> >> node 3 size: 63997 MB
> >> node 3 free: 63420 MB
> >> node distances:
> >> node 0 1 2 3
> >> 0: 10 16 32 33
> >> 1: 16 10 25 32
> >> 2: 32 25 10 16
> >> 3: 33 32 16 10
[...]
> before patch
> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
> 2
> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
> 48

node 2

> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
> 0

node 0

> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
> 24

node 1

> Euler:/sys/bus/pci #
>
> after patch
> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
> 2
> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
> 48

node 2

> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
> 72

node 3

> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
> 24

node 1

So few more questions. The only difference seems to be IRQ369
moving from 0 to 3 and having the device affinity to node 2
makes some sense because node 3 is closer. So far so good.
I still have a large gap to get the whole picture. Namely why those
other IRQs are not using any of the existing CPUs on the node 2.
Could you explain that please?

Btw. this all should be in the changelog.
--
Michal Hocko
SUSE Labs

2019-11-15 09:11:25

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/14 22:43, Michal Hocko wrote:
> On Wed 13-11-19 10:46:05, Shaokun Zhang wrote:
> [...]
>>>> available: 4 nodes (0-3)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
>>>> node 0 size: 63379 MB
>>>> node 0 free: 61899 MB
>>>> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>>>> node 1 size: 64509 MB
>>>> node 1 free: 63942 MB
>>>> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
>>>> node 2 size: 64509 MB
>>>> node 2 free: 63056 MB
>>>> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
>>>> node 3 size: 63997 MB
>>>> node 3 free: 63420 MB
>>>> node distances:
>>>> node 0 1 2 3
>>>> 0: 10 16 32 33
>>>> 1: 16 10 25 32
>>>> 2: 32 25 10 16
>>>> 3: 33 32 16 10
> [...]
>> before patch
>> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
>> 2
>> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
>> 48
>
> node 2
>
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 0
>
> node 0
>
>> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
>> 24
>
> node 1
>
>> Euler:/sys/bus/pci #
>>
>> after patch
>> Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
>> 2
>> Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
>> 48
>
> node 2
>
>> Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
>> 72
>
> node 3
>
>> Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
>> 24
>
> node 1
>
> So few more questions. The only difference seems to be IRQ369
> moving from 0 to 3 and having the device affinity to node 2
> makes some sense because node 3 is closer. So far so good.

Right, it is what we want.

> I still have a large gap to get the whole picture. Namely why those
> other IRQs are not using any of the existing CPUs on the node 2.
> Could you explain that please?
>

Oh, my mistake, for the previous instance, I don't list all IRQs and
just choose one IRQ from one NUMA node. You can see that the IRQ
number is not consistent :-).
IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2
and each IRQ is corresponding to one core.

Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list
49

Others are the similar.

> Btw. this all should be in the changelog.

Ok, I will follow it in future.

Thanks,
Shaokun

>

2019-11-15 13:37:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

On Fri 15-11-19 17:09:13, Shaokun Zhang wrote:
[...]
> Oh, my mistake, for the previous instance, I don't list all IRQs and
> just choose one IRQ from one NUMA node. You can see that the IRQ
> number is not consistent :-).
> IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2
> and each IRQ is corresponding to one core.
>

This is quite confusing then. I would suggest providing all IRQ used for
the device with the specific node affinity to see the difference in the
setup.

--
Michal Hocko
SUSE Labs

2019-11-21 08:17:43

by Zhangshaokun

[permalink] [raw]
Subject: Re: [PATCH v3] lib: optimize cpumask_local_spread()

Hi Michal,

On 2019/11/15 21:36, Michal Hocko wrote:
> On Fri 15-11-19 17:09:13, Shaokun Zhang wrote:
> [...]
>> Oh, my mistake, for the previous instance, I don't list all IRQs and
>> just choose one IRQ from one NUMA node. You can see that the IRQ
>> number is not consistent :-).
>> IRQ from 345 to 368 will be bound to CPU cores which are in NUMA node2
>> and each IRQ is corresponding to one core.
>>
>
> This is quite confusing then. I would suggest providing all IRQ used for

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 63379 MB
node 0 free: 61899 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64509 MB
node 1 free: 63942 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 64509 MB
node 2 free: 63056 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 63997 MB
node 3 free: 63420 MB
node distances:
node 0 1 2 3
0: 10 16 32 33
1: 16 10 25 32
2: 32 25 10 16
3: 33 32 16 10

Before the patch:
/* I/O device is located on NUMA node2 */
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2
Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list
49
.
.
.
Euler:/sys/bus/pci # cat /proc/irq/367/smp_affinity_list
70
Euler:/sys/bus/pci # cat /proc/irq/368/smp_affinity_list
71

/* there we expect irq form 24 to 47 binding to node3 */
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
1
.
.
.
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
22
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
23

Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci # cat /proc/irq/394/smp_affinity_list
25
.
.
.
/* There are total 64 irqs on eth IO device. */
Euler:/sys/bus/pci # cat /proc/irq/407/smp_affinity_list
38
Euler:/sys/bus/pci # cat /proc/irq/408/smp_affinity_list
39
Euler:/sys/bus/pci #

After the patch:
Euler:/sys/bus/pci/devices/0000:7d:00.2 # cat numa_node
2

Euler:/sys/bus/pci # cat /proc/irq/345/smp_affinity_list
48
Euler:/sys/bus/pci # cat /proc/irq/346/smp_affinity_list
49
.
.
.
Euler:/sys/bus/pci # cat /proc/irq/367/smp_affinity_list
70
Euler:/sys/bus/pci # cat /proc/irq/368/smp_affinity_list
71

Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
73
.
.
.
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
94
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
95
/* when the cores of socket were used up, we choose the node which is closer to node 2
before this patch choose the same node here is a coincident event*/
Euler:/sys/bus/pci # cat /proc/irq/393/smp_affinity_list
24
Euler:/sys/bus/pci # cat /proc/irq/394/smp_affinity_list
25
.
.
.
/* There are total 64 irqs on eth IO device. */
Euler:/sys/bus/pci # cat /proc/irq/407/smp_affinity_list
38
Euler:/sys/bus/pci # cat /proc/irq/408/smp_affinity_list
39
Euler:/sys/bus/pci #

Thanks,
Shaokun

> the device with the specific node affinity to see the difference in the
> setup.
>