2024-01-09 10:51:50

by Souradeep Chakrabarti

[permalink] [raw]
Subject: [PATCH 0/4 net-next] net: mana: Assigning IRQ affinity on HT cores

This patch set introduces a new helper function irq_setup(),
to optimize IRQ distribution for MANA network devices.
The patch set makes the driver working 15% faster than
with cpumask_local_spread().

Souradeep Chakrabarti (1):
net: mana: Assigning IRQ affinity on HT cores

Yury Norov (3):
cpumask: add cpumask_weight_andnot()
cpumask: define cleanup function for cpumasks
net: mana: add a function to spread IRQs per CPUs

.../net/ethernet/microsoft/mana/gdma_main.c | 87 ++++++++++++++++---
include/linux/bitmap.h | 12 +++
include/linux/cpumask.h | 16 ++++
lib/bitmap.c | 7 ++
4 files changed, 112 insertions(+), 10 deletions(-)

--
2.34.1



2024-01-09 10:52:12

by Souradeep Chakrabarti

[permalink] [raw]
Subject: [PATCH 1/4 net-next] cpumask: add cpumask_weight_andnot()

From: Yury Norov <[email protected]>

Similarly to cpumask_weight_and(), cpumask_weight_andnot() is a handy
helper that may help to avoid creating an intermediate mask just to
calculate number of bits that set in a 1st given mask, and clear in 2nd
one.

Signed-off-by: Yury Norov <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
---
include/linux/bitmap.h | 12 ++++++++++++
include/linux/cpumask.h | 13 +++++++++++++
lib/bitmap.c | 7 +++++++
3 files changed, 32 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 99451431e4d6..5814e9ee40ba 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -54,6 +54,7 @@ struct device;
* bitmap_full(src, nbits) Are all bits set in *src?
* bitmap_weight(src, nbits) Hamming Weight: number set bits
* bitmap_weight_and(src1, src2, nbits) Hamming Weight of and'ed bitmap
+ * bitmap_weight_andnot(src1, src2, nbits) Hamming Weight of andnot'ed bitmap
* bitmap_set(dst, pos, nbits) Set specified bit area
* bitmap_clear(dst, pos, nbits) Clear specified bit area
* bitmap_find_next_zero_area(buf, len, pos, n, mask) Find bit free area
@@ -169,6 +170,8 @@ bool __bitmap_subset(const unsigned long *bitmap1,
unsigned int __bitmap_weight(const unsigned long *bitmap, unsigned int nbits);
unsigned int __bitmap_weight_and(const unsigned long *bitmap1,
const unsigned long *bitmap2, unsigned int nbits);
+unsigned int __bitmap_weight_andnot(const unsigned long *bitmap1,
+ const unsigned long *bitmap2, unsigned int nbits);
void __bitmap_set(unsigned long *map, unsigned int start, int len);
void __bitmap_clear(unsigned long *map, unsigned int start, int len);

@@ -425,6 +428,15 @@ unsigned long bitmap_weight_and(const unsigned long *src1,
return __bitmap_weight_and(src1, src2, nbits);
}

+static __always_inline
+unsigned long bitmap_weight_andnot(const unsigned long *src1,
+ const unsigned long *src2, unsigned int nbits)
+{
+ if (small_const_nbits(nbits))
+ return hweight_long(*src1 & ~(*src2) & BITMAP_LAST_WORD_MASK(nbits));
+ return __bitmap_weight_andnot(src1, src2, nbits);
+}
+
static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
unsigned int nbits)
{
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cfb545841a2c..228c23eb36d2 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -719,6 +719,19 @@ static inline unsigned int cpumask_weight_and(const struct cpumask *srcp1,
return bitmap_weight_and(cpumask_bits(srcp1), cpumask_bits(srcp2), small_cpumask_bits);
}

+/**
+ * cpumask_weight_andnot - Count of bits in (*srcp1 & ~*srcp2)
+ * @srcp1: the cpumask to count bits (< nr_cpu_ids) in.
+ * @srcp2: the cpumask to count bits (< nr_cpu_ids) in.
+ *
+ * Return: count of bits set in both *srcp1 and *srcp2
+ */
+static inline unsigned int cpumask_weight_andnot(const struct cpumask *srcp1,
+ const struct cpumask *srcp2)
+{
+ return bitmap_weight_andnot(cpumask_bits(srcp1), cpumask_bits(srcp2), small_cpumask_bits);
+}
+
/**
* cpumask_shift_right - *dstp = *srcp >> n
* @dstp: the cpumask result
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 09522af227f1..b97692854966 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -348,6 +348,13 @@ unsigned int __bitmap_weight_and(const unsigned long *bitmap1,
}
EXPORT_SYMBOL(__bitmap_weight_and);

+unsigned int __bitmap_weight_andnot(const unsigned long *bitmap1,
+ const unsigned long *bitmap2, unsigned int bits)
+{
+ return BITMAP_WEIGHT(bitmap1[idx] & ~bitmap2[idx], bits);
+}
+EXPORT_SYMBOL(__bitmap_weight_andnot);
+
void __bitmap_set(unsigned long *map, unsigned int start, int len)
{
unsigned long *p = map + BIT_WORD(start);
--
2.34.1


2024-01-09 10:52:38

by Souradeep Chakrabarti

[permalink] [raw]
Subject: [PATCH 2/4 net-next] cpumask: define cleanup function for cpumasks

From: Yury Norov <[email protected]>

Now we can simplify code that allocates cpumasks for local needs.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 228c23eb36d2..1c29947db848 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -7,6 +7,7 @@
* set of CPUs in a system, one bit position per CPU number. In general,
* only nr_cpu_ids (<= NR_CPUS) bits are valid.
*/
+#include <linux/cleanup.h>
#include <linux/kernel.h>
#include <linux/threads.h>
#include <linux/bitmap.h>
@@ -990,6 +991,8 @@ static inline bool cpumask_available(cpumask_var_t mask)
}
#endif /* CONFIG_CPUMASK_OFFSTACK */

+DEFINE_FREE(free_cpumask_var, struct cpumask *, if (_T) free_cpumask_var(_T));
+
/* It's common to want to use cpu_all_mask in struct member initializers,
* so it has to refer to an address rather than a pointer. */
extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
--
2.34.1


2024-01-09 10:53:06

by Souradeep Chakrabarti

[permalink] [raw]
Subject: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Yury Norov <[email protected]>

Souradeep investigated that the driver performs faster if IRQs are
spread on CPUs with the following heuristics:

1. No more than one IRQ per CPU, if possible;
2. NUMA locality is the second priority;
3. Sibling dislocality is the last priority.

Let's consider this topology:

Node 0 1
Core 0 1 2 3
CPU 0 1 2 3 4 5 6 7

The most performant IRQ distribution based on the above topology
and heuristics may look like this:

IRQ Nodes Cores CPUs
0 1 0 0-1
1 1 1 2-3
2 1 0 0-1
3 1 1 2-3
4 2 2 4-5
5 2 3 6-7
6 2 2 4-5
7 2 3 6-7

The irq_setup() routine introduced in this patch leverages the
for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
as described above.

According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
based on cpumask_local_spread() performance test results look like this:

/ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: ##### Totals: #####
08:06:28 INFO: test duration :60.00 seconds
08:06:28 INFO: total bytes :630292053310
08:06:28 INFO: throughput :84.04Gbps
08:06:28 INFO: retrans segs :4
08:06:28 INFO: cpu cores :192
08:06:28 INFO: cpu speed :3799.725MHz
08:06:28 INFO: user :0.05%
08:06:28 INFO: system :1.60%
08:06:28 INFO: idle :96.41%
08:06:28 INFO: iowait :0.00%
08:06:28 INFO: softirq :1.94%
08:06:28 INFO: cycles/byte :2.50
08:06:28 INFO: cpu busy (all) :534.41%

For NUMA- and sibling-aware IRQ distribution, the same test works
15% faster:

/ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: ##### Totals: #####
08:09:56 INFO: test duration :60.00 seconds
08:09:56 INFO: total bytes :741966608384
08:09:56 INFO: throughput :98.93Gbps
08:09:56 INFO: retrans segs :6
08:09:56 INFO: cpu cores :192
08:09:56 INFO: cpu speed :3799.791MHz
08:09:56 INFO: user :0.06%
08:09:56 INFO: system :1.81%
08:09:56 INFO: idle :96.18%
08:09:56 INFO: iowait :0.00%
08:09:56 INFO: softirq :1.95%
08:09:56 INFO: cycles/byte :2.25
08:09:56 INFO: cpu busy (all) :569.22%

[1] https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/

Signed-off-by: Yury Norov <[email protected]>
Co-developed-by: Souradeep Chakrabarti <[email protected]>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 6367de0c2c2e..6a967d6be01e 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1243,6 +1243,35 @@ void mana_gd_free_res_map(struct gdma_resource *r)
r->size = 0;
}

+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
+{
+ const struct cpumask *next, *prev = cpu_none_mask;
+ cpumask_var_t cpus __free(free_cpumask_var);
+ int cpu, weight;
+
+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+ return -ENOMEM;
+
+ rcu_read_lock();
+ for_each_numa_hop_mask(next, node) {
+ weight = cpumask_weight_andnot(next, prev);
+ while (weight > 0) {
+ cpumask_andnot(cpus, next, prev);
+ for_each_cpu(cpu, cpus) {
+ if (len-- == 0)
+ goto done;
+ irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
+ cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
+ --weight;
+ }
+ }
+ prev = next;
+ }
+done:
+ rcu_read_unlock();
+ return 0;
+}
+
static int mana_gd_setup_irqs(struct pci_dev *pdev)
{
unsigned int max_queues_per_port = num_online_cpus();
--
2.34.1


2024-01-09 10:53:31

by Souradeep Chakrabarti

[permalink] [raw]
Subject: [PATCH 4/4 net-next] net: mana: Assigning IRQ affinity on HT cores

Existing MANA design assigns IRQ to every CPU, including sibling
hyper-threads. This may cause multiple IRQs to be active simultaneously
in the same core and may reduce the network performance.

Improve the performance by assigning IRQ to non sibling CPUs in local
NUMA node. The performance improvement we are getting using ntttcp with
following patch is around 15 percent against existing design and
approximately 11 percent, when trying to assign one IRQ in each core
across NUMA nodes, if enough cores are present.

Signed-off-by: Souradeep Chakrabarti <[email protected]>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 58 +++++++++++++++----
1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 6a967d6be01e..6715d6939bc7 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1274,13 +1274,16 @@ static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int no

static int mana_gd_setup_irqs(struct pci_dev *pdev)
{
- unsigned int max_queues_per_port = num_online_cpus();
struct gdma_context *gc = pci_get_drvdata(pdev);
+ unsigned int max_queues_per_port;
struct gdma_irq_context *gic;
unsigned int max_irqs, cpu;
- int nvec, irq;
+ int start_irq_index = 1;
+ int nvec, *irqs, irq;
int err, i = 0, j;

+ cpus_read_lock();
+ max_queues_per_port = num_online_cpus();
if (max_queues_per_port > MANA_MAX_NUM_QUEUES)
max_queues_per_port = MANA_MAX_NUM_QUEUES;

@@ -1288,8 +1291,18 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
max_irqs = max_queues_per_port + 1;

nvec = pci_alloc_irq_vectors(pdev, 2, max_irqs, PCI_IRQ_MSIX);
- if (nvec < 0)
+ if (nvec < 0) {
+ cpus_read_unlock();
return nvec;
+ }
+ if (nvec <= num_online_cpus())
+ start_irq_index = 0;
+
+ irqs = kmalloc_array((nvec - start_irq_index), sizeof(int), GFP_KERNEL);
+ if (!irqs) {
+ err = -ENOMEM;
+ goto free_irq_vector;
+ }

gc->irq_contexts = kcalloc(nvec, sizeof(struct gdma_irq_context),
GFP_KERNEL);
@@ -1316,21 +1329,44 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
goto free_irq;
}

- err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_irq;
-
- cpu = cpumask_local_spread(i, gc->numa_node);
- irq_set_affinity_and_hint(irq, cpumask_of(cpu));
+ if (!i) {
+ err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+ if (err)
+ goto free_irq;
+
+ /* If number of IRQ is one extra than number of online CPUs,
+ * then we need to assign IRQ0 (hwc irq) and IRQ1 to
+ * same CPU.
+ * Else we will use different CPUs for IRQ0 and IRQ1.
+ * Also we are using cpumask_local_spread instead of
+ * cpumask_first for the node, because the node can be
+ * mem only.
+ */
+ if (start_irq_index) {
+ cpu = cpumask_local_spread(i, gc->numa_node);
+ irq_set_affinity_and_hint(irq, cpumask_of(cpu));
+ } else {
+ irqs[start_irq_index] = irq;
+ }
+ } else {
+ irqs[i - start_irq_index] = irq;
+ err = request_irq(irqs[i - start_irq_index], mana_gd_intr, 0,
+ gic->name, gic);
+ if (err)
+ goto free_irq;
+ }
}

+ err = irq_setup(irqs, (nvec - start_irq_index), gc->numa_node);
+ if (err)
+ goto free_irq;
err = mana_gd_alloc_res_map(nvec, &gc->msix_resource);
if (err)
goto free_irq;

gc->max_num_msix = nvec;
gc->num_msix_usable = nvec;
-
+ cpus_read_unlock();
return 0;

free_irq:
@@ -1343,8 +1379,10 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
}

kfree(gc->irq_contexts);
+ kfree(irqs);
gc->irq_contexts = NULL;
free_irq_vector:
+ cpus_read_unlock();
pci_free_irq_vectors(pdev);
return err;
}
--
2.34.1


2024-01-09 11:57:58

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH 0/4 net-next] net: mana: Assigning IRQ affinity on HT cores

On Tue, 2024-01-09 at 02:51 -0800, Souradeep Chakrabarti wrote:
> This patch set introduces a new helper function irq_setup(),
> to optimize IRQ distribution for MANA network devices.
> The patch set makes the driver working 15% faster than
> with cpumask_local_spread().
>
> Souradeep Chakrabarti (1):
> net: mana: Assigning IRQ affinity on HT cores
>
> Yury Norov (3):
> cpumask: add cpumask_weight_andnot()
> cpumask: define cleanup function for cpumasks
> net: mana: add a function to spread IRQs per CPUs
>
> .../net/ethernet/microsoft/mana/gdma_main.c | 87 ++++++++++++++++---
> include/linux/bitmap.h | 12 +++
> include/linux/cpumask.h | 16 ++++
> lib/bitmap.c | 7 ++
> 4 files changed, 112 insertions(+), 10 deletions(-)

net-next is closed for the merge window period, please repost when it
will re-open in ~2 weeks.

Cheers,

Paolo


2024-01-09 19:23:03

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Souradeep Chakrabarti <[email protected]> Sent: Tuesday, January 9, 2024 2:51 AM
>
> From: Yury Norov <[email protected]>
>
> Souradeep investigated that the driver performs faster if IRQs are
> spread on CPUs with the following heuristics:
>
> 1. No more than one IRQ per CPU, if possible;
> 2. NUMA locality is the second priority;
> 3. Sibling dislocality is the last priority.
>
> Let's consider this topology:
>
> Node 0 1
> Core 0 1 2 3
> CPU 0 1 2 3 4 5 6 7
>
> The most performant IRQ distribution based on the above topology
> and heuristics may look like this:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
> 2 1 0 0-1
> 3 1 1 2-3
> 4 2 2 4-5
> 5 2 3 6-7
> 6 2 2 4-5
> 7 2 3 6-7

I didn't pay attention to the detailed discussion of this issue
over the past 2 to 3 weeks during the holidays in the U.S., but
the above doesn't align with the original problem as I understood
it. I thought the original problem was to avoid putting IRQs on
both hyper-threads in the same core, and that the perf
improvements are based on that configuration. At least that's
what the commit message for Patch 4/4 in this series says.

The above chart results in 8 IRQs being assigned to the 8 CPUs,
probably with 1 IRQ per CPU. At least on x86, if the affinity
mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
should balance the IRQ assignments between the CPUs in the mask.
So the original problem is still present because both hyper-threads
in a core are likely to have an IRQ assigned.

Of course, this example has 8 IRQs and 8 CPUs, so assigning an
IRQ to every hyper-thread may be the only choice. If that's the
case, maybe this just isn't a good example to illustrate the
original problem and solution. But even with a better example
where the # of IRQs is <= half the # of CPUs in a NUMA node,
I don't think the code below accomplishes the original intent.

Maybe I've missed something along the way in getting to this
version of the patch. Please feel free to set me straight. :-)

Michael

>
> The irq_setup() routine introduced in this patch leverages the
> for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
> as described above.
>
> According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
> based on cpumask_local_spread() performance test results look like this:
>
> /ntttcp -r -m 16
> NTTTCP for Linux 1.4.0
> ---------------------------------------------------------
> 08:05:20 INFO: 17 threads created
> 08:05:28 INFO: Network activity progressing...
> 08:06:28 INFO: Test run completed.
> 08:06:28 INFO: Test cycle finished.
> 08:06:28 INFO: ##### Totals: #####
> 08:06:28 INFO: test duration :60.00 seconds
> 08:06:28 INFO: total bytes :630292053310
> 08:06:28 INFO: throughput :84.04Gbps
> 08:06:28 INFO: retrans segs :4
> 08:06:28 INFO: cpu cores :192
> 08:06:28 INFO: cpu speed :3799.725MHz
> 08:06:28 INFO: user :0.05%
> 08:06:28 INFO: system :1.60%
> 08:06:28 INFO: idle :96.41%
> 08:06:28 INFO: iowait :0.00%
> 08:06:28 INFO: softirq :1.94%
> 08:06:28 INFO: cycles/byte :2.50
> 08:06:28 INFO: cpu busy (all) :534.41%
>
> For NUMA- and sibling-aware IRQ distribution, the same test works
> 15% faster:
>
> /ntttcp -r -m 16
> NTTTCP for Linux 1.4.0
> ---------------------------------------------------------
> 08:08:51 INFO: 17 threads created
> 08:08:56 INFO: Network activity progressing...
> 08:09:56 INFO: Test run completed.
> 08:09:56 INFO: Test cycle finished.
> 08:09:56 INFO: ##### Totals: #####
> 08:09:56 INFO: test duration :60.00 seconds
> 08:09:56 INFO: total bytes :741966608384
> 08:09:56 INFO: throughput :98.93Gbps
> 08:09:56 INFO: retrans segs :6
> 08:09:56 INFO: cpu cores :192
> 08:09:56 INFO: cpu speed :3799.791MHz
> 08:09:56 INFO: user :0.06%
> 08:09:56 INFO: system :1.81%
> 08:09:56 INFO: idle :96.18%
> 08:09:56 INFO: iowait :0.00%
> 08:09:56 INFO: softirq :1.95%
> 08:09:56 INFO: cycles/byte :2.25
> 08:09:56 INFO: cpu busy (all) :569.22%
>
> [1]
> https://lore.kernel.org/all/[email protected]
> yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
>
> Signed-off-by: Yury Norov <[email protected]>
> Co-developed-by: Souradeep Chakrabarti
> <[email protected]>
> ---
> .../net/ethernet/microsoft/mana/gdma_main.c | 29
> +++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 6367de0c2c2e..6a967d6be01e 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1243,6 +1243,35 @@ void mana_gd_free_res_map(struct gdma_resource *r)
> r->size = 0;
> }
>
> +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
> +{
> + const struct cpumask *next, *prev = cpu_none_mask;
> + cpumask_var_t cpus __free(free_cpumask_var);
> + int cpu, weight;
> +
> + if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> + return -ENOMEM;
> +
> + rcu_read_lock();
> + for_each_numa_hop_mask(next, node) {
> + weight = cpumask_weight_andnot(next, prev);
> + while (weight > 0) {
> + cpumask_andnot(cpus, next, prev);
> + for_each_cpu(cpu, cpus) {
> + if (len-- == 0)
> + goto done;
> + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> + --weight;
> + }
> + }
> + prev = next;
> + }
> +done:
> + rcu_read_unlock();
> + return 0;
> +}
> +
> static int mana_gd_setup_irqs(struct pci_dev *pdev)
> {
> unsigned int max_queues_per_port = num_online_cpus();
> --
> 2.34.1
>


2024-01-09 20:20:53

by Haiyang Zhang

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs



> -----Original Message-----
> From: Michael Kelley <[email protected]>
> Sent: Tuesday, January 9, 2024 2:23 PM
> To: Souradeep Chakrabarti <[email protected]>; KY Srinivasan
> <[email protected]>; Haiyang Zhang <[email protected]>;
> [email protected]; Dexuan Cui <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; Long Li <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]
> Cc: Souradeep Chakrabarti <[email protected]>; Paul Rosswurm
> <[email protected]>
> Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per
> CPUs
>
> [Some people who received this message don't often get email from
> [email protected]. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
>
> From: Souradeep Chakrabarti <[email protected]> Sent:
> Tuesday, January 9, 2024 2:51 AM
> >
> > From: Yury Norov <[email protected]>
> >
> > Souradeep investigated that the driver performs faster if IRQs are
> > spread on CPUs with the following heuristics:
> >
> > 1. No more than one IRQ per CPU, if possible;
> > 2. NUMA locality is the second priority;
> > 3. Sibling dislocality is the last priority.
> >
> > Let's consider this topology:
> >
> > Node 0 1
> > Core 0 1 2 3
> > CPU 0 1 2 3 4 5 6 7
> >
> > The most performant IRQ distribution based on the above topology
> > and heuristics may look like this:
> >
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> > 2 1 0 0-1
> > 3 1 1 2-3
> > 4 2 2 4-5
> > 5 2 3 6-7
> > 6 2 2 4-5
> > 7 2 3 6-7
>
> I didn't pay attention to the detailed discussion of this issue
> over the past 2 to 3 weeks during the holidays in the U.S., but
> the above doesn't align with the original problem as I understood
> it. I thought the original problem was to avoid putting IRQs on
> both hyper-threads in the same core, and that the perf
> improvements are based on that configuration. At least that's
> what the commit message for Patch 4/4 in this series says.
>
> The above chart results in 8 IRQs being assigned to the 8 CPUs,
> probably with 1 IRQ per CPU. At least on x86, if the affinity
> mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> should balance the IRQ assignments between the CPUs in the mask.
> So the original problem is still present because both hyper-threads
> in a core are likely to have an IRQ assigned.
>
> Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> IRQ to every hyper-thread may be the only choice. If that's the
> case, maybe this just isn't a good example to illustrate the
> original problem and solution. But even with a better example
> where the # of IRQs is <= half the # of CPUs in a NUMA node,
> I don't think the code below accomplishes the original intent.
>
> Maybe I've missed something along the way in getting to this
> version of the patch. Please feel free to set me straight. :-)
>
> Michael

I have the same question as Michael. Also, I'm asking Souradeep
in another channel: So, the algorithm still uses up all current
NUMA node before moving on to the next NUMA node, right?

Except each IRQ is affinitized to 2 CPUs.
For example, a system with 2 IRQs:
IRQ Nodes Cores CPUs
0 1 0 0-1
1 1 1 2-3

Is this performing better than the algorithm in earlier patches? like below:
IRQ Nodes Cores CPUs
0 1 0 0
1 1 1 2

Thanks,
- Haiyang


2024-01-09 23:29:22

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

Hi Michael,

So, I'm just a guy who helped to formulate the heuristics in an
itemized form, and implement them using the existing kernel API.
I have no access to MANA machines and I ran no performance tests
myself.

On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote:
> From: Souradeep Chakrabarti <[email protected]> Sent: Tuesday, January 9, 2024 2:51 AM
> >
> > From: Yury Norov <[email protected]>
> >
> > Souradeep investigated that the driver performs faster if IRQs are
> > spread on CPUs with the following heuristics:
> >
> > 1. No more than one IRQ per CPU, if possible;
> > 2. NUMA locality is the second priority;
> > 3. Sibling dislocality is the last priority.
> >
> > Let's consider this topology:
> >
> > Node 0 1
> > Core 0 1 2 3
> > CPU 0 1 2 3 4 5 6 7
> >
> > The most performant IRQ distribution based on the above topology
> > and heuristics may look like this:
> >
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> > 2 1 0 0-1
> > 3 1 1 2-3
> > 4 2 2 4-5
> > 5 2 3 6-7
> > 6 2 2 4-5
> > 7 2 3 6-7
>
> I didn't pay attention to the detailed discussion of this issue
> over the past 2 to 3 weeks during the holidays in the U.S., but
> the above doesn't align with the original problem as I understood
> it. I thought the original problem was to avoid putting IRQs on
> both hyper-threads in the same core, and that the perf
> improvements are based on that configuration. At least that's
> what the commit message for Patch 4/4 in this series says.

Yes, and the original distribution suggested by Souradeep looks very
similar:

IRQ Nodes Cores CPUs
0 1 0 0
1 1 1 2
2 1 0 1
3 1 1 3
4 2 2 4
5 2 3 6
6 2 2 5
7 2 3 7

I just added a bit more flexibility, so that kernel may pick any
sibling for the IRQ. As I understand, both approaches have similar
performance. Probably my fine-tune added another half-percent...

Souradeep, can you please share the exact numbers on this?

> The above chart results in 8 IRQs being assigned to the 8 CPUs,
> probably with 1 IRQ per CPU. At least on x86, if the affinity
> mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> should balance the IRQ assignments between the CPUs in the mask.
> So the original problem is still present because both hyper-threads
> in a core are likely to have an IRQ assigned.

That's what I think, if the topology makes us to put IRQs in the
same sibling group, the best thing we can to is to rely on existing
balancing mechanisms in a hope that they will do their job well.

> Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> IRQ to every hyper-thread may be the only choice. If that's the
> case, maybe this just isn't a good example to illustrate the
> original problem and solution.

Yeah... This example illustrates the order of IRQ distribution.
I really doubt that if we distribute IRQs like in the above example,
there would be any difference in performance. But I think it's quite
a good illustration. I could write the title for the table like this:

The order of IRQ distribution for the best performance
based on [...] may look like this.

> But even with a better example
> where the # of IRQs is <= half the # of CPUs in a NUMA node,
> I don't think the code below accomplishes the original intent.
>
> Maybe I've missed something along the way in getting to this
> version of the patch. Please feel free to set me straight. :-)

Hmm. So if the number of IRQs is the half # of CPUs in the nodes,
which is 2 in the example above, the distribution will look like
this:

IRQ Nodes Cores CPUs
0 1 0 0-1
1 1 1 2-3

And each IRQ belongs to a different sibling group. This follows
the rules above.

I think of it like we assign an IRQ to a group of 2 CPUs, so from
the heuristic #1 perspective, each CPU is assigned with 1/2 of the
IRQ.

If I add one more IRQ, then according to the heuristics, NUMA locality
trumps sibling dislocality, so we'd assign IRO to the same node on any
core. My algorithm assigns it to the core #0:

2 1 0 0-1

This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1.

The next IRQ should be assigned to the same node again, and we've got
the only choice:


3 1 1 2-3

Starting from IRQ #5, the node #1 is full - each CPU is assigned with
exactly one IRQ, and the heuristic #1 makes us to switch to the other
node; and then do the same thing:

4 2 2 4-5
5 2 3 6-7
6 2 2 4-5
7 2 3 6-7

So I think the algorithm is correct... Really hope the above makes
sense. :) If so, I can add it to the commit message for patch #3.

Nevertheless... Souradeep, in addition to the performance numbers, can
you share your topology and actual IRQ distribution that gains 15%? I
think it should be added to the patch #4 commit message.

Thanks,
Yury

2024-01-10 00:28:31

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Yury Norov <[email protected]> Sent: Tuesday, January 9, 2024 3:29 PM
>
> Hi Michael,
>
> So, I'm just a guy who helped to formulate the heuristics in an
> itemized form, and implement them using the existing kernel API.
> I have no access to MANA machines and I ran no performance tests
> myself.

Agreed. :-) Given the heritage of the patch, I should have clarified
that my question was directed to Souradeep. Regardless, your work
on the cpumask manipulation made everything better and clearer.

>
> On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote:
> > From: Souradeep Chakrabarti <[email protected]> Sent:
> Tuesday, January 9, 2024 2:51 AM
> > >
> > > From: Yury Norov <[email protected]>
> > >
> > > Souradeep investigated that the driver performs faster if IRQs are
> > > spread on CPUs with the following heuristics:
> > >
> > > 1. No more than one IRQ per CPU, if possible;
> > > 2. NUMA locality is the second priority;
> > > 3. Sibling dislocality is the last priority.
> > >
> > > Let's consider this topology:
> > >
> > > Node 0 1
> > > Core 0 1 2 3
> > > CPU 0 1 2 3 4 5 6 7
> > >
> > > The most performant IRQ distribution based on the above topology
> > > and heuristics may look like this:
> > >
> > > IRQ Nodes Cores CPUs
> > > 0 1 0 0-1
> > > 1 1 1 2-3
> > > 2 1 0 0-1
> > > 3 1 1 2-3
> > > 4 2 2 4-5
> > > 5 2 3 6-7
> > > 6 2 2 4-5
> > > 7 2 3 6-7
> >
> > I didn't pay attention to the detailed discussion of this issue
> > over the past 2 to 3 weeks during the holidays in the U.S., but
> > the above doesn't align with the original problem as I understood
> > it. I thought the original problem was to avoid putting IRQs on
> > both hyper-threads in the same core, and that the perf
> > improvements are based on that configuration. At least that's
> > what the commit message for Patch 4/4 in this series says.
>
> Yes, and the original distribution suggested by Souradeep looks very
> similar:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0
> 1 1 1 2
> 2 1 0 1
> 3 1 1 3
> 4 2 2 4
> 5 2 3 6
> 6 2 2 5
> 7 2 3 7
>
> I just added a bit more flexibility, so that kernel may pick any
> sibling for the IRQ. As I understand, both approaches have similar
> performance. Probably my fine-tune added another half-percent...
>
> Souradeep, can you please share the exact numbers on this?
>
> > The above chart results in 8 IRQs being assigned to the 8 CPUs,
> > probably with 1 IRQ per CPU. At least on x86, if the affinity
> > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> > should balance the IRQ assignments between the CPUs in the mask.
> > So the original problem is still present because both hyper-threads
> > in a core are likely to have an IRQ assigned.
>
> That's what I think, if the topology makes us to put IRQs in the
> same sibling group, the best thing we can to is to rely on existing
> balancing mechanisms in a hope that they will do their job well.
>
> > Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> > IRQ to every hyper-thread may be the only choice. If that's the
> > case, maybe this just isn't a good example to illustrate the
> > original problem and solution.
>
> Yeah... This example illustrates the order of IRQ distribution.
> I really doubt that if we distribute IRQs like in the above example,
> there would be any difference in performance. But I think it's quite
> a good illustration. I could write the title for the table like this:
>
> The order of IRQ distribution for the best performance
> based on [...] may look like this.
>
> > But even with a better example
> > where the # of IRQs is <= half the # of CPUs in a NUMA node,
> > I don't think the code below accomplishes the original intent.
> >
> > Maybe I've missed something along the way in getting to this
> > version of the patch. Please feel free to set me straight. :-)
>
> Hmm. So if the number of IRQs is the half # of CPUs in the nodes,
> which is 2 in the example above, the distribution will look like
> this:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
>
> And each IRQ belongs to a different sibling group. This follows
> the rules above.
>
> I think of it like we assign an IRQ to a group of 2 CPUs, so from
> the heuristic #1 perspective, each CPU is assigned with 1/2 of the
> IRQ.
>
> If I add one more IRQ, then according to the heuristics, NUMA locality
> trumps sibling dislocality, so we'd assign IRO to the same node on any
> core. My algorithm assigns it to the core #0:
>
> 2 1 0 0-1
>
> This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1.
>
> The next IRQ should be assigned to the same node again, and we've got
> the only choice:
>
>
> 3 1 1 2-3
>
> Starting from IRQ #5, the node #1 is full - each CPU is assigned with
> exactly one IRQ, and the heuristic #1 makes us to switch to the other
> node; and then do the same thing:
>
> 4 2 2 4-5
> 5 2 3 6-7
> 6 2 2 4-5
> 7 2 3 6-7
>
> So I think the algorithm is correct... Really hope the above makes
> sense. :) If so, I can add it to the commit message for patch #3.

Thinking about it further, I agree with you. If we want NUMA
locality to trump avoiding hyper-threads in the same core, then
I'm good with the algorithm. If I think of the "weight" variable
in your function as the "number of IRQs to assign to CPUs in
this NUMA hop", then it makes sense to decrement it by 1
each time irq_set_affinity_and_hint() is called. I was confused
by likely removing multiple cpus from the "cpus" cpumask
juxtaposed with decrementing "weight" by only 1, and by my
preconception that to get the perf benefit we wanted to avoid
hyper-threads in the same core.

>
> Nevertheless... Souradeep, in addition to the performance numbers, can
> you share your topology and actual IRQ distribution that gains 15%? I
> think it should be added to the patch #4 commit message.

Yes -- this is the key thing for me. What is the configuration that
shows the 15% performance gain? Patch 3/4 and Patch 4/4 in the
series need to be consistent in describing when there's a performance
benefit and when there's no significant difference. In Patch 4/4,
the mana driver creates IRQs equal to the # of CPUs, up to
MANA_MAX_NUM_QUEUES, which is 64. So the new algorithm
still assigns IRQs to both hyper-threads in cores in the local NUMA
node (unless the node is bigger than 64 CPUs, which I don't think
happens in Azure today). For the first hop from the local NUMA
node, IRQs might get assigned to only one hyper-thread in a core
if the total CPU count in the VM is more than 64.

Michael

2024-01-10 09:08:37

by Souradeep Chakrabarti

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Tue, Jan 09, 2024 at 08:20:31PM +0000, Haiyang Zhang wrote:
>
>
> > -----Original Message-----
> > From: Michael Kelley <[email protected]>
> > Sent: Tuesday, January 9, 2024 2:23 PM
> > To: Souradeep Chakrabarti <[email protected]>; KY Srinivasan
> > <[email protected]>; Haiyang Zhang <[email protected]>;
> > [email protected]; Dexuan Cui <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; Long Li <[email protected]>; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; linux-
> > [email protected]
> > Cc: Souradeep Chakrabarti <[email protected]>; Paul Rosswurm
> > <[email protected]>
> > Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per
> > CPUs
> >
> > [Some people who received this message don't often get email from
> > [email protected]. Learn why this is important at
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > From: Souradeep Chakrabarti <[email protected]> Sent:
> > Tuesday, January 9, 2024 2:51 AM
> > >
> > > From: Yury Norov <[email protected]>
> > >
> > > Souradeep investigated that the driver performs faster if IRQs are
> > > spread on CPUs with the following heuristics:
> > >
> > > 1. No more than one IRQ per CPU, if possible;
> > > 2. NUMA locality is the second priority;
> > > 3. Sibling dislocality is the last priority.
> > >
> > > Let's consider this topology:
> > >
> > > Node 0 1
> > > Core 0 1 2 3
> > > CPU 0 1 2 3 4 5 6 7
> > >
> > > The most performant IRQ distribution based on the above topology
> > > and heuristics may look like this:
> > >
> > > IRQ Nodes Cores CPUs
> > > 0 1 0 0-1
> > > 1 1 1 2-3
> > > 2 1 0 0-1
> > > 3 1 1 2-3
> > > 4 2 2 4-5
> > > 5 2 3 6-7
> > > 6 2 2 4-5
> > > 7 2 3 6-7
> >
> > I didn't pay attention to the detailed discussion of this issue
> > over the past 2 to 3 weeks during the holidays in the U.S., but
> > the above doesn't align with the original problem as I understood
> > it. I thought the original problem was to avoid putting IRQs on
> > both hyper-threads in the same core, and that the perf
> > improvements are based on that configuration. At least that's
> > what the commit message for Patch 4/4 in this series says.
> >
> > The above chart results in 8 IRQs being assigned to the 8 CPUs,
> > probably with 1 IRQ per CPU. At least on x86, if the affinity
> > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> > should balance the IRQ assignments between the CPUs in the mask.
> > So the original problem is still present because both hyper-threads
> > in a core are likely to have an IRQ assigned.
> >
> > Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> > IRQ to every hyper-thread may be the only choice. If that's the
> > case, maybe this just isn't a good example to illustrate the
> > original problem and solution. But even with a better example
> > where the # of IRQs is <= half the # of CPUs in a NUMA node,
> > I don't think the code below accomplishes the original intent.
> >
> > Maybe I've missed something along the way in getting to this
> > version of the patch. Please feel free to set me straight. :-)
> >
> > Michael
>
> I have the same question as Michael. Also, I'm asking Souradeep
> in another channel: So, the algorithm still uses up all current
> NUMA node before moving on to the next NUMA node, right?
>
> Except each IRQ is affinitized to 2 CPUs.
> For example, a system with 2 IRQs:
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
>
> Is this performing better than the algorithm in earlier patches? like below:
> IRQ Nodes Cores CPUs
> 0 1 0 0
> 1 1 1 2
>
The details for this approach has been shared by Yury later in this thread.
The main intention with this approach is kernel may pick any
sibling for the IRQ.
> Thanks,
> - Haiyang

2024-01-10 09:11:39

by Souradeep Chakrabarti

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Tue, Jan 09, 2024 at 03:28:59PM -0800, Yury Norov wrote:
> Hi Michael,
>
> So, I'm just a guy who helped to formulate the heuristics in an
> itemized form, and implement them using the existing kernel API.
> I have no access to MANA machines and I ran no performance tests
> myself.
>
> On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote:
> > From: Souradeep Chakrabarti <[email protected]> Sent: Tuesday, January 9, 2024 2:51 AM
> > >
> > > From: Yury Norov <[email protected]>
> > >
> > > Souradeep investigated that the driver performs faster if IRQs are
> > > spread on CPUs with the following heuristics:
> > >
> > > 1. No more than one IRQ per CPU, if possible;
> > > 2. NUMA locality is the second priority;
> > > 3. Sibling dislocality is the last priority.
> > >
> > > Let's consider this topology:
> > >
> > > Node 0 1
> > > Core 0 1 2 3
> > > CPU 0 1 2 3 4 5 6 7
> > >
> > > The most performant IRQ distribution based on the above topology
> > > and heuristics may look like this:
> > >
> > > IRQ Nodes Cores CPUs
> > > 0 1 0 0-1
> > > 1 1 1 2-3
> > > 2 1 0 0-1
> > > 3 1 1 2-3
> > > 4 2 2 4-5
> > > 5 2 3 6-7
> > > 6 2 2 4-5
> > > 7 2 3 6-7
> >
> > I didn't pay attention to the detailed discussion of this issue
> > over the past 2 to 3 weeks during the holidays in the U.S., but
> > the above doesn't align with the original problem as I understood
> > it. I thought the original problem was to avoid putting IRQs on
> > both hyper-threads in the same core, and that the perf
> > improvements are based on that configuration. At least that's
> > what the commit message for Patch 4/4 in this series says.
>
> Yes, and the original distribution suggested by Souradeep looks very
> similar:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0
> 1 1 1 2
> 2 1 0 1
> 3 1 1 3
> 4 2 2 4
> 5 2 3 6
> 6 2 2 5
> 7 2 3 7
>
> I just added a bit more flexibility, so that kernel may pick any
> sibling for the IRQ. As I understand, both approaches have similar
> performance. Probably my fine-tune added another half-percent...
>
> Souradeep, can you please share the exact numbers on this?
>
> > The above chart results in 8 IRQs being assigned to the 8 CPUs,
> > probably with 1 IRQ per CPU. At least on x86, if the affinity
> > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> > should balance the IRQ assignments between the CPUs in the mask.
> > So the original problem is still present because both hyper-threads
> > in a core are likely to have an IRQ assigned.
>
> That's what I think, if the topology makes us to put IRQs in the
> same sibling group, the best thing we can to is to rely on existing
> balancing mechanisms in a hope that they will do their job well.
>
> > Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> > IRQ to every hyper-thread may be the only choice. If that's the
> > case, maybe this just isn't a good example to illustrate the
> > original problem and solution.
>
> Yeah... This example illustrates the order of IRQ distribution.
> I really doubt that if we distribute IRQs like in the above example,
> there would be any difference in performance. But I think it's quite
> a good illustration. I could write the title for the table like this:
>
> The order of IRQ distribution for the best performance
> based on [...] may look like this.
>
> > But even with a better example
> > where the # of IRQs is <= half the # of CPUs in a NUMA node,
> > I don't think the code below accomplishes the original intent.
> >
> > Maybe I've missed something along the way in getting to this
> > version of the patch. Please feel free to set me straight. :-)
>
> Hmm. So if the number of IRQs is the half # of CPUs in the nodes,
> which is 2 in the example above, the distribution will look like
> this:
>
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
>
> And each IRQ belongs to a different sibling group. This follows
> the rules above.
>
> I think of it like we assign an IRQ to a group of 2 CPUs, so from
> the heuristic #1 perspective, each CPU is assigned with 1/2 of the
> IRQ.
>
> If I add one more IRQ, then according to the heuristics, NUMA locality
> trumps sibling dislocality, so we'd assign IRO to the same node on any
> core. My algorithm assigns it to the core #0:
>
> 2 1 0 0-1
>
> This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1.
>
> The next IRQ should be assigned to the same node again, and we've got
> the only choice:
>
>
> 3 1 1 2-3
>
> Starting from IRQ #5, the node #1 is full - each CPU is assigned with
> exactly one IRQ, and the heuristic #1 makes us to switch to the other
> node; and then do the same thing:
>
> 4 2 2 4-5
> 5 2 3 6-7
> 6 2 2 4-5
> 7 2 3 6-7
>
> So I think the algorithm is correct... Really hope the above makes
> sense. :) If so, I can add it to the commit message for patch #3.
>
> Nevertheless... Souradeep, in addition to the performance numbers, can
> you share your topology and actual IRQ distribution that gains 15%? I
> think it should be added to the patch #4 commit message.
Sure I will add my topology in #4 commit message. Thanks for the suggestion.
>
> Thanks,
> Yury

2024-01-11 06:13:39

by Souradeep Chakrabarti

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Wed, Jan 10, 2024 at 12:27:54AM +0000, Michael Kelley wrote:
> From: Yury Norov <[email protected]> Sent: Tuesday, January 9, 2024 3:29 PM
> >
> > Hi Michael,
> >
> > So, I'm just a guy who helped to formulate the heuristics in an
> > itemized form, and implement them using the existing kernel API.
> > I have no access to MANA machines and I ran no performance tests
> > myself.
>
> Agreed. :-) Given the heritage of the patch, I should have clarified
> that my question was directed to Souradeep. Regardless, your work
> on the cpumask manipulation made everything better and clearer.
>
> >
> > On Tue, Jan 09, 2024 at 07:22:38PM +0000, Michael Kelley wrote:
> > > From: Souradeep Chakrabarti <[email protected]> Sent:
> > Tuesday, January 9, 2024 2:51 AM
> > > >
> > > > From: Yury Norov <[email protected]>
> > > >
> > > > Souradeep investigated that the driver performs faster if IRQs are
> > > > spread on CPUs with the following heuristics:
> > > >
> > > > 1. No more than one IRQ per CPU, if possible;
> > > > 2. NUMA locality is the second priority;
> > > > 3. Sibling dislocality is the last priority.
> > > >
> > > > Let's consider this topology:
> > > >
> > > > Node 0 1
> > > > Core 0 1 2 3
> > > > CPU 0 1 2 3 4 5 6 7
> > > >
> > > > The most performant IRQ distribution based on the above topology
> > > > and heuristics may look like this:
> > > >
> > > > IRQ Nodes Cores CPUs
> > > > 0 1 0 0-1
> > > > 1 1 1 2-3
> > > > 2 1 0 0-1
> > > > 3 1 1 2-3
> > > > 4 2 2 4-5
> > > > 5 2 3 6-7
> > > > 6 2 2 4-5
> > > > 7 2 3 6-7
> > >
> > > I didn't pay attention to the detailed discussion of this issue
> > > over the past 2 to 3 weeks during the holidays in the U.S., but
> > > the above doesn't align with the original problem as I understood
> > > it. I thought the original problem was to avoid putting IRQs on
> > > both hyper-threads in the same core, and that the perf
> > > improvements are based on that configuration. At least that's
> > > what the commit message for Patch 4/4 in this series says.
> >
> > Yes, and the original distribution suggested by Souradeep looks very
> > similar:
> >
> > IRQ Nodes Cores CPUs
> > 0 1 0 0
> > 1 1 1 2
> > 2 1 0 1
> > 3 1 1 3
> > 4 2 2 4
> > 5 2 3 6
> > 6 2 2 5
> > 7 2 3 7
> >
> > I just added a bit more flexibility, so that kernel may pick any
> > sibling for the IRQ. As I understand, both approaches have similar
> > performance. Probably my fine-tune added another half-percent...
> >
> > Souradeep, can you please share the exact numbers on this?
> >
> > > The above chart results in 8 IRQs being assigned to the 8 CPUs,
> > > probably with 1 IRQ per CPU. At least on x86, if the affinity
> > > mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> > > should balance the IRQ assignments between the CPUs in the mask.
> > > So the original problem is still present because both hyper-threads
> > > in a core are likely to have an IRQ assigned.
> >
> > That's what I think, if the topology makes us to put IRQs in the
> > same sibling group, the best thing we can to is to rely on existing
> > balancing mechanisms in a hope that they will do their job well.
> >
> > > Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> > > IRQ to every hyper-thread may be the only choice. If that's the
> > > case, maybe this just isn't a good example to illustrate the
> > > original problem and solution.
> >
> > Yeah... This example illustrates the order of IRQ distribution.
> > I really doubt that if we distribute IRQs like in the above example,
> > there would be any difference in performance. But I think it's quite
> > a good illustration. I could write the title for the table like this:
> >
> > The order of IRQ distribution for the best performance
> > based on [...] may look like this.
> >
> > > But even with a better example
> > > where the # of IRQs is <= half the # of CPUs in a NUMA node,
> > > I don't think the code below accomplishes the original intent.
> > >
> > > Maybe I've missed something along the way in getting to this
> > > version of the patch. Please feel free to set me straight. :-)
> >
> > Hmm. So if the number of IRQs is the half # of CPUs in the nodes,
> > which is 2 in the example above, the distribution will look like
> > this:
> >
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> >
> > And each IRQ belongs to a different sibling group. This follows
> > the rules above.
> >
> > I think of it like we assign an IRQ to a group of 2 CPUs, so from
> > the heuristic #1 perspective, each CPU is assigned with 1/2 of the
> > IRQ.
> >
> > If I add one more IRQ, then according to the heuristics, NUMA locality
> > trumps sibling dislocality, so we'd assign IRO to the same node on any
> > core. My algorithm assigns it to the core #0:
> >
> > 2 1 0 0-1
> >
> > This doubles # of IRQs for the CPUs 0 and 1: from 1/2 to 1.
> >
> > The next IRQ should be assigned to the same node again, and we've got
> > the only choice:
> >
> >
> > 3 1 1 2-3
> >
> > Starting from IRQ #5, the node #1 is full - each CPU is assigned with
> > exactly one IRQ, and the heuristic #1 makes us to switch to the other
> > node; and then do the same thing:
> >
> > 4 2 2 4-5
> > 5 2 3 6-7
> > 6 2 2 4-5
> > 7 2 3 6-7
> >
> > So I think the algorithm is correct... Really hope the above makes
> > sense. :) If so, I can add it to the commit message for patch #3.
>
> Thinking about it further, I agree with you. If we want NUMA
> locality to trump avoiding hyper-threads in the same core, then
> I'm good with the algorithm. If I think of the "weight" variable
> in your function as the "number of IRQs to assign to CPUs in
> this NUMA hop", then it makes sense to decrement it by 1
> each time irq_set_affinity_and_hint() is called. I was confused
> by likely removing multiple cpus from the "cpus" cpumask
> juxtaposed with decrementing "weight" by only 1, and by my
> preconception that to get the perf benefit we wanted to avoid
> hyper-threads in the same core.
>
> >
> > Nevertheless... Souradeep, in addition to the performance numbers, can
> > you share your topology and actual IRQ distribution that gains 15%? I
> > think it should be added to the patch #4 commit message.
>
> Yes -- this is the key thing for me. What is the configuration that
> shows the 15% performance gain? Patch 3/4 and Patch 4/4 in the
> series need to be consistent in describing when there's a performance
> benefit and when there's no significant difference. In Patch 4/4,
> the mana driver creates IRQs equal to the # of CPUs, up to
> MANA_MAX_NUM_QUEUES, which is 64. So the new algorithm
> still assigns IRQs to both hyper-threads in cores in the local NUMA
> node (unless the node is bigger than 64 CPUs, which I don't think
> happens in Azure today). For the first hop from the local NUMA
> node, IRQs might get assigned to only one hyper-thread in a core
> if the total CPU count in the VM is more than 64.
The test topology was used to check the performance between
cpu_local_spread() and the new approach is :
Case 1
IRQ Nodes Cores CPUs
0 1 0 0-1
1 1 1 2-3
2 1 2 4-5
3 1 3 6-7

and with existing cpu_local_spread()
Case 2
IRQ Nodes Cores CPUs
0 1 0 0
1 1 0 1
2 1 1 2
3 1 1 3

Total 4 channels were used, which was set up by ethtool.
case 1 with ntttcp has given 15 percent better performance, than
case 2. During the test irqbalance was disabled as well.

Also you are right, with 64CPU system this approach will spread
the irqs like the cpu_local_spread() but in the future we will offer
MANA nodes, with more than 64 CPUs. There it this new design will
give better performance.

I will add this performance benefit details in commit message of
next version.
>
> Michael

2024-01-12 16:38:39

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Souradeep Chakrabarti <[email protected]> Sent: Wednesday, January 10, 2024 10:13 PM
>
> The test topology was used to check the performance between
> cpu_local_spread() and the new approach is :
> Case 1
> IRQ Nodes Cores CPUs
> 0 1 0 0-1
> 1 1 1 2-3
> 2 1 2 4-5
> 3 1 3 6-7
>
> and with existing cpu_local_spread()
> Case 2
> IRQ Nodes Cores CPUs
> 0 1 0 0
> 1 1 0 1
> 2 1 1 2
> 3 1 1 3
>
> Total 4 channels were used, which was set up by ethtool.
> case 1 with ntttcp has given 15 percent better performance, than
> case 2. During the test irqbalance was disabled as well.
>
> Also you are right, with 64CPU system this approach will spread
> the irqs like the cpu_local_spread() but in the future we will offer
> MANA nodes, with more than 64 CPUs. There it this new design will
> give better performance.
>
> I will add this performance benefit details in commit message of
> next version.

Here are my concerns:

1. The most commonly used VMs these days have 64 or fewer
vCPUs and won't see any performance benefit.

2. Larger VMs probably won't see the full 15% benefit because
all vCPUs in the local NUMA node will be assigned IRQs. For
example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
16 IRQs will be spread out on the 48 CPUs in NUMA node 1
in a way that avoids sharing a core. But overall the means
that 75% of the IRQs will still be sharing a core and
presumably not see any perf benefit.

3. Your experiment was on a relatively small scale: 4 IRQs
spread across 2 cores vs. across 4 cores. Have you run any
experiments on VMs with 128 vCPUs (for example) where
most of the IRQs are not sharing a core? I'm wondering if
the results with 4 IRQs really scale up to 64 IRQs. A lot can
be different in a VM with 64 cores and 2 NUMA nodes vs.
4 cores in a single node.

4. The new algorithm prefers assigning to all vCPUs in
each NUMA hop over assigning to separate cores. Are there
experiments showing that is the right tradeoff? What
are the results if assigning to separate cores is preferred?

My bottom line: The new algorithm adds complexity. We
should be really clear on where it adds performance benefit
and where it doesn't, and be convinced that the benefit
is worth the extra complexity.

Michael

2024-01-12 18:31:07

by Haiyang Zhang

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs



> -----Original Message-----
> From: Michael Kelley <[email protected]>
> Sent: Friday, January 12, 2024 11:37 AM
> To: Souradeep Chakrabarti <[email protected]>
> Cc: Yury Norov <[email protected]>; KY Srinivasan <[email protected]>;
> Haiyang Zhang <[email protected]>; [email protected]; Dexuan Cui
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; Long Li <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; Souradeep Chakrabarti <[email protected]>;
> Paul Rosswurm <[email protected]>
> Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread
> IRQs per CPUs
>
> [Some people who received this message don't often get email from
> [email protected]. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
>
> From: Souradeep Chakrabarti <[email protected]> Sent:
> Wednesday, January 10, 2024 10:13 PM
> >
> > The test topology was used to check the performance between
> > cpu_local_spread() and the new approach is :
> > Case 1
> > IRQ Nodes Cores CPUs
> > 0 1 0 0-1
> > 1 1 1 2-3
> > 2 1 2 4-5
> > 3 1 3 6-7
> >
> > and with existing cpu_local_spread()
> > Case 2
> > IRQ Nodes Cores CPUs
> > 0 1 0 0
> > 1 1 0 1
> > 2 1 1 2
> > 3 1 1 3
> >
> > Total 4 channels were used, which was set up by ethtool.
> > case 1 with ntttcp has given 15 percent better performance, than
> > case 2. During the test irqbalance was disabled as well.
> >
> > Also you are right, with 64CPU system this approach will spread
> > the irqs like the cpu_local_spread() but in the future we will offer
> > MANA nodes, with more than 64 CPUs. There it this new design will
> > give better performance.
> >
> > I will add this performance benefit details in commit message of
> > next version.
>
> Here are my concerns:
>
> 1. The most commonly used VMs these days have 64 or fewer
> vCPUs and won't see any performance benefit.
>
> 2. Larger VMs probably won't see the full 15% benefit because
> all vCPUs in the local NUMA node will be assigned IRQs. For
> example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> in a way that avoids sharing a core. But overall the means
> that 75% of the IRQs will still be sharing a core and
> presumably not see any perf benefit.
>
> 3. Your experiment was on a relatively small scale: 4 IRQs
> spread across 2 cores vs. across 4 cores. Have you run any
> experiments on VMs with 128 vCPUs (for example) where
> most of the IRQs are not sharing a core? I'm wondering if
> the results with 4 IRQs really scale up to 64 IRQs. A lot can
> be different in a VM with 64 cores and 2 NUMA nodes vs.
> 4 cores in a single node.
>
> 4. The new algorithm prefers assigning to all vCPUs in
> each NUMA hop over assigning to separate cores. Are there
> experiments showing that is the right tradeoff? What
> are the results if assigning to separate cores is preferred?

I remember in a customer case, putting the IRQs on the same
NUMA node has better perf. But I agree, this should be re-tested
on MANA nic.

- Haiyang



2024-01-13 06:30:57

by Souradeep Chakrabarti

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
>
>
> > -----Original Message-----
> > From: Michael Kelley <[email protected]>
> > Sent: Friday, January 12, 2024 11:37 AM
> > To: Souradeep Chakrabarti <[email protected]>
> > Cc: Yury Norov <[email protected]>; KY Srinivasan <[email protected]>;
> > Haiyang Zhang <[email protected]>; [email protected]; Dexuan Cui
> > <[email protected]>; [email protected]; [email protected];
> > [email protected]; [email protected]; Long Li <[email protected]>;
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; linux-
> > [email protected]; Souradeep Chakrabarti <[email protected]>;
> > Paul Rosswurm <[email protected]>
> > Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread
> > IRQs per CPUs
> >
> > [Some people who received this message don't often get email from
> > [email protected]. Learn why this is important at
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > From: Souradeep Chakrabarti <[email protected]> Sent:
> > Wednesday, January 10, 2024 10:13 PM
> > >
> > > The test topology was used to check the performance between
> > > cpu_local_spread() and the new approach is :
> > > Case 1
> > > IRQ Nodes Cores CPUs
> > > 0 1 0 0-1
> > > 1 1 1 2-3
> > > 2 1 2 4-5
> > > 3 1 3 6-7
> > >
> > > and with existing cpu_local_spread()
> > > Case 2
> > > IRQ Nodes Cores CPUs
> > > 0 1 0 0
> > > 1 1 0 1
> > > 2 1 1 2
> > > 3 1 1 3
> > >
> > > Total 4 channels were used, which was set up by ethtool.
> > > case 1 with ntttcp has given 15 percent better performance, than
> > > case 2. During the test irqbalance was disabled as well.
> > >
> > > Also you are right, with 64CPU system this approach will spread
> > > the irqs like the cpu_local_spread() but in the future we will offer
> > > MANA nodes, with more than 64 CPUs. There it this new design will
> > > give better performance.
> > >
> > > I will add this performance benefit details in commit message of
> > > next version.
> >
> > Here are my concerns:
> >
> > 1. The most commonly used VMs these days have 64 or fewer
> > vCPUs and won't see any performance benefit.
> >
> > 2. Larger VMs probably won't see the full 15% benefit because
> > all vCPUs in the local NUMA node will be assigned IRQs. For
> > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> > in a way that avoids sharing a core. But overall the means
> > that 75% of the IRQs will still be sharing a core and
> > presumably not see any perf benefit.
> >
> > 3. Your experiment was on a relatively small scale: 4 IRQs
> > spread across 2 cores vs. across 4 cores. Have you run any
> > experiments on VMs with 128 vCPUs (for example) where
> > most of the IRQs are not sharing a core? I'm wondering if
> > the results with 4 IRQs really scale up to 64 IRQs. A lot can
> > be different in a VM with 64 cores and 2 NUMA nodes vs.
> > 4 cores in a single node.
> >
> > 4. The new algorithm prefers assigning to all vCPUs in
> > each NUMA hop over assigning to separate cores. Are there
> > experiments showing that is the right tradeoff? What
> > are the results if assigning to separate cores is preferred?
>
> I remember in a customer case, putting the IRQs on the same
> NUMA node has better perf. But I agree, this should be re-tested
> on MANA nic.
1) and 2) The change will not decrease the existing performance, but for system
with high number of CPU, will be benefited after this.

3) The result has shown around 6 percent improvement.

4)The test result has shown around 10 percent difference when IRQs are spread
on multiple numa nodes.
>
> - Haiyang
>

2024-01-13 16:20:51

by Michael Kelley

[permalink] [raw]
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

From: Souradeep Chakrabarti <[email protected]> Sent: Friday, January 12, 2024 10:31 PM

> On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
> >
> > > -----Original Message-----
> > From: Michael Kelley <[email protected]> Sent: Friday, January 12, 2024 11:37 AM
> > >
> > > From: Souradeep Chakrabarti <[email protected]> Sent:
> > > Wednesday, January 10, 2024 10:13 PM
> > > >
> > > > The test topology was used to check the performance between
> > > > cpu_local_spread() and the new approach is :
> > > > Case 1
> > > > IRQ Nodes Cores CPUs
> > > > 0 1 0 0-1
> > > > 1 1 1 2-3
> > > > 2 1 2 4-5
> > > > 3 1 3 6-7
> > > >
> > > > and with existing cpu_local_spread()
> > > > Case 2
> > > > IRQ Nodes Cores CPUs
> > > > 0 1 0 0
> > > > 1 1 0 1
> > > > 2 1 1 2
> > > > 3 1 1 3
> > > >
> > > > Total 4 channels were used, which was set up by ethtool.
> > > > case 1 with ntttcp has given 15 percent better performance, than
> > > > case 2. During the test irqbalance was disabled as well.
> > > >
> > > > Also you are right, with 64CPU system this approach will spread
> > > > the irqs like the cpu_local_spread() but in the future we will offer
> > > > MANA nodes, with more than 64 CPUs. There it this new design will
> > > > give better performance.
> > > >
> > > > I will add this performance benefit details in commit message of
> > > > next version.
> > >
> > > Here are my concerns:
> > >
> > > 1. The most commonly used VMs these days have 64 or fewer
> > > vCPUs and won't see any performance benefit.
> > >
> > > 2. Larger VMs probably won't see the full 15% benefit because
> > > all vCPUs in the local NUMA node will be assigned IRQs. For
> > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> > > in a way that avoids sharing a core. But overall the means
> > > that 75% of the IRQs will still be sharing a core and
> > > presumably not see any perf benefit.
> > >
> > > 3. Your experiment was on a relatively small scale: 4 IRQs
> > > spread across 2 cores vs. across 4 cores. Have you run any
> > > experiments on VMs with 128 vCPUs (for example) where
> > > most of the IRQs are not sharing a core? I'm wondering if
> > > the results with 4 IRQs really scale up to 64 IRQs. A lot can
> > > be different in a VM with 64 cores and 2 NUMA nodes vs.
> > > 4 cores in a single node.
> > >
> > > 4. The new algorithm prefers assigning to all vCPUs in
> > > each NUMA hop over assigning to separate cores. Are there
> > > experiments showing that is the right tradeoff? What
> > > are the results if assigning to separate cores is preferred?
> >
> > I remember in a customer case, putting the IRQs on the same
> > NUMA node has better perf. But I agree, this should be re-tested
> > on MANA nic.
>
> 1) and 2) The change will not decrease the existing performance, but for
> system with high number of CPU, will be benefited after this.
>
> 3) The result has shown around 6 percent improvement.
>
> 4)The test result has shown around 10 percent difference when IRQs are
> spread on multiple numa nodes.

OK, this looks pretty good. Make clear in the commit messages what
the tradeoffs are, and what the real-world benefits are expected to be.
Some future developer who wants to understand why IRQs are assigned
this way will thank you. :-)

Michael

2024-01-13 19:12:09

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Sat, Jan 13, 2024 at 04:20:31PM +0000, Michael Kelley wrote:
> From: Souradeep Chakrabarti <[email protected]> Sent: Friday, January 12, 2024 10:31 PM
>
> > On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
> > >
> > > > -----Original Message-----
> > > From: Michael Kelley <[email protected]> Sent: Friday, January 12, 2024 11:37 AM
> > > >
> > > > From: Souradeep Chakrabarti <[email protected]> Sent:
> > > > Wednesday, January 10, 2024 10:13 PM
> > > > >
> > > > > The test topology was used to check the performance between
> > > > > cpu_local_spread() and the new approach is :
> > > > > Case 1
> > > > > IRQ Nodes Cores CPUs
> > > > > 0 1 0 0-1
> > > > > 1 1 1 2-3
> > > > > 2 1 2 4-5
> > > > > 3 1 3 6-7
> > > > >
> > > > > and with existing cpu_local_spread()
> > > > > Case 2
> > > > > IRQ Nodes Cores CPUs
> > > > > 0 1 0 0
> > > > > 1 1 0 1
> > > > > 2 1 1 2
> > > > > 3 1 1 3
> > > > >
> > > > > Total 4 channels were used, which was set up by ethtool.
> > > > > case 1 with ntttcp has given 15 percent better performance, than
> > > > > case 2. During the test irqbalance was disabled as well.
> > > > >
> > > > > Also you are right, with 64CPU system this approach will spread
> > > > > the irqs like the cpu_local_spread() but in the future we will offer
> > > > > MANA nodes, with more than 64 CPUs. There it this new design will
> > > > > give better performance.
> > > > >
> > > > > I will add this performance benefit details in commit message of
> > > > > next version.
> > > >
> > > > Here are my concerns:
> > > >
> > > > 1. The most commonly used VMs these days have 64 or fewer
> > > > vCPUs and won't see any performance benefit.
> > > >
> > > > 2. Larger VMs probably won't see the full 15% benefit because
> > > > all vCPUs in the local NUMA node will be assigned IRQs. For
> > > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> > > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> > > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> > > > in a way that avoids sharing a core. But overall the means
> > > > that 75% of the IRQs will still be sharing a core and
> > > > presumably not see any perf benefit.
> > > >
> > > > 3. Your experiment was on a relatively small scale: 4 IRQs
> > > > spread across 2 cores vs. across 4 cores. Have you run any
> > > > experiments on VMs with 128 vCPUs (for example) where
> > > > most of the IRQs are not sharing a core? I'm wondering if
> > > > the results with 4 IRQs really scale up to 64 IRQs. A lot can
> > > > be different in a VM with 64 cores and 2 NUMA nodes vs.
> > > > 4 cores in a single node.
> > > >
> > > > 4. The new algorithm prefers assigning to all vCPUs in
> > > > each NUMA hop over assigning to separate cores. Are there
> > > > experiments showing that is the right tradeoff? What
> > > > are the results if assigning to separate cores is preferred?
> > >
> > > I remember in a customer case, putting the IRQs on the same
> > > NUMA node has better perf. But I agree, this should be re-tested
> > > on MANA nic.
> >
> > 1) and 2) The change will not decrease the existing performance, but for
> > system with high number of CPU, will be benefited after this.
> >
> > 3) The result has shown around 6 percent improvement.
> >
> > 4)The test result has shown around 10 percent difference when IRQs are
> > spread on multiple numa nodes.
>
> OK, this looks pretty good. Make clear in the commit messages what
> the tradeoffs are, and what the real-world benefits are expected to be.
> Some future developer who wants to understand why IRQs are assigned
> this way will thank you. :-)

I agree with Michael, this needs to be spoken aloud.

From the above, is that correct that the best performance is achieved
when the # of IRQs is half the nubmer of CPUs in the 1st node, because
this configuration allows to spread IRQs across cores the most optimal
way? And if we have more or less than that, it hurts performance, at
least for MANA networking?

So, the B|A performance chart may look like this, right?

irq nodes cores cpus perf
0 1 | 1 0 | 0 0 | 0-1 0%
1 1 | 1 0 | 1 1 | 2-3 +5%
2 1 | 1 1 | 2 2 | 4-5 +10%
3 1 | 1 1 | 3 3 | 6-7 +15%
4 1 | 1 0 | 4 3 | 0-1 +12%
... | | |
7 1 | 1 1 | 7 3 | 6-7 0%
...
15 2 | 2 3 | 3 15 | 14-15 0%

Souradeep, can you please confirm that my understanding is correct?

In v5, can you add a table like the above with real performance
numbers for your driver? I think that it would help people to
configure their VMs better when networking is a bottleneck.

Thanks,
Yury

2024-01-16 06:14:04

by Souradeep Chakrabarti

[permalink] [raw]
Subject: Re: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

On Sat, Jan 13, 2024 at 11:11:50AM -0800, Yury Norov wrote:
> On Sat, Jan 13, 2024 at 04:20:31PM +0000, Michael Kelley wrote:
> > From: Souradeep Chakrabarti <[email protected]> Sent: Friday, January 12, 2024 10:31 PM
> >
> > > On Fri, Jan 12, 2024 at 06:30:44PM +0000, Haiyang Zhang wrote:
> > > >
> > > > > -----Original Message-----
> > > > From: Michael Kelley <[email protected]> Sent: Friday, January 12, 2024 11:37 AM
> > > > >
> > > > > From: Souradeep Chakrabarti <[email protected]> Sent:
> > > > > Wednesday, January 10, 2024 10:13 PM
> > > > > >
> > > > > > The test topology was used to check the performance between
> > > > > > cpu_local_spread() and the new approach is :
> > > > > > Case 1
> > > > > > IRQ Nodes Cores CPUs
> > > > > > 0 1 0 0-1
> > > > > > 1 1 1 2-3
> > > > > > 2 1 2 4-5
> > > > > > 3 1 3 6-7
> > > > > >
> > > > > > and with existing cpu_local_spread()
> > > > > > Case 2
> > > > > > IRQ Nodes Cores CPUs
> > > > > > 0 1 0 0
> > > > > > 1 1 0 1
> > > > > > 2 1 1 2
> > > > > > 3 1 1 3
> > > > > >
> > > > > > Total 4 channels were used, which was set up by ethtool.
> > > > > > case 1 with ntttcp has given 15 percent better performance, than
> > > > > > case 2. During the test irqbalance was disabled as well.
> > > > > >
> > > > > > Also you are right, with 64CPU system this approach will spread
> > > > > > the irqs like the cpu_local_spread() but in the future we will offer
> > > > > > MANA nodes, with more than 64 CPUs. There it this new design will
> > > > > > give better performance.
> > > > > >
> > > > > > I will add this performance benefit details in commit message of
> > > > > > next version.
> > > > >
> > > > > Here are my concerns:
> > > > >
> > > > > 1. The most commonly used VMs these days have 64 or fewer
> > > > > vCPUs and won't see any performance benefit.
> > > > >
> > > > > 2. Larger VMs probably won't see the full 15% benefit because
> > > > > all vCPUs in the local NUMA node will be assigned IRQs. For
> > > > > example, in a VM with 96 vCPUs and 2 NUMA nodes, all 48
> > > > > vCPUs in NUMA node 0 will all be assigned IRQs. The remaining
> > > > > 16 IRQs will be spread out on the 48 CPUs in NUMA node 1
> > > > > in a way that avoids sharing a core. But overall the means
> > > > > that 75% of the IRQs will still be sharing a core and
> > > > > presumably not see any perf benefit.
> > > > >
> > > > > 3. Your experiment was on a relatively small scale: 4 IRQs
> > > > > spread across 2 cores vs. across 4 cores. Have you run any
> > > > > experiments on VMs with 128 vCPUs (for example) where
> > > > > most of the IRQs are not sharing a core? I'm wondering if
> > > > > the results with 4 IRQs really scale up to 64 IRQs. A lot can
> > > > > be different in a VM with 64 cores and 2 NUMA nodes vs.
> > > > > 4 cores in a single node.
> > > > >
> > > > > 4. The new algorithm prefers assigning to all vCPUs in
> > > > > each NUMA hop over assigning to separate cores. Are there
> > > > > experiments showing that is the right tradeoff? What
> > > > > are the results if assigning to separate cores is preferred?
> > > >
> > > > I remember in a customer case, putting the IRQs on the same
> > > > NUMA node has better perf. But I agree, this should be re-tested
> > > > on MANA nic.
> > >
> > > 1) and 2) The change will not decrease the existing performance, but for
> > > system with high number of CPU, will be benefited after this.
> > >
> > > 3) The result has shown around 6 percent improvement.
> > >
> > > 4)The test result has shown around 10 percent difference when IRQs are
> > > spread on multiple numa nodes.
> >
> > OK, this looks pretty good. Make clear in the commit messages what
> > the tradeoffs are, and what the real-world benefits are expected to be.
> > Some future developer who wants to understand why IRQs are assigned
> > this way will thank you. :-)
>
> I agree with Michael, this needs to be spoken aloud.
>
> >From the above, is that correct that the best performance is achieved
> when the # of IRQs is half the nubmer of CPUs in the 1st node, because
> this configuration allows to spread IRQs across cores the most optimal
> way? And if we have more or less than that, it hurts performance, at
> least for MANA networking?
It does not decrease the performance from current cpu_local_spread(),
but optimum performance comes when node has CPUs double that of number
of IRQs (considering SMT==2).

Now only if the number of CPUs are same that of number of IRQs,
(that is num of CPUs <= 64) then, we see same performance like existing
design with cpu_local_spread().

If node has more CPUs than 64, then we get better performance than
cpu_local_spread().
>
> So, the B|A performance chart may look like this, right?
>
> irq nodes cores cpus perf
> 0 1 | 1 0 | 0 0 | 0-1 0%
> 1 1 | 1 0 | 1 1 | 2-3 +5%
> 2 1 | 1 1 | 2 2 | 4-5 +10%
> 3 1 | 1 1 | 3 3 | 6-7 +15%
> 4 1 | 1 0 | 4 3 | 0-1 +12%
> ... | | |
> 7 1 | 1 1 | 7 3 | 6-7 0%
> ...
> 15 2 | 2 3 | 3 15 | 14-15 0%
>
> Souradeep, can you please confirm that my understanding is correct?
>
> In v5, can you add a table like the above with real performance
> numbers for your driver? I think that it would help people to
> configure their VMs better when networking is a bottleneck.
>
I will share a chart on next version of patch 3.
Thanks for the suggestion.
> Thanks,
> Yury