2022-11-11 04:02:06

by Yury Norov

[permalink] [raw]
Subject: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality

cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Valentin Schneider's "net/mlx5e: Improve remote
NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

According to Valentin's measurements, for mlx5e:

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in Valentin's case.

I tested it on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3

Yury Norov (4):
lib/find: introduce find_nth_and_andnot_bit
cpumask: introduce cpumask_nth_and_andnot
sched: add sched_numa_find_nth_cpu()
cpumask: improve on cpumask_local_spread() locality

include/linux/cpumask.h | 20 +++++++++++++++++++
include/linux/find.h | 33 +++++++++++++++++++++++++++++++
include/linux/topology.h | 8 ++++++++
kernel/sched/topology.c | 42 ++++++++++++++++++++++++++++++++++++++++
lib/cpumask.c | 12 ++----------
lib/find_bit.c | 9 +++++++++
6 files changed, 114 insertions(+), 10 deletions(-)

--
2.34.1



2022-11-11 04:02:26

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/4] cpumask: introduce cpumask_nth_and_andnot()

Introduce cpumask_nth_and_andnot() based on find_nth_and_andnot_bit().
It's used in the following patch to traverse cpumasks without storing
intermediate result in temporary cpumask.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/cpumask.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c2aa0aa26b45..debfa2261569 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -391,6 +391,26 @@ unsigned int cpumask_nth_andnot(unsigned int cpu, const struct cpumask *srcp1,
nr_cpumask_bits, cpumask_check(cpu));
}

+/**
+ * cpumask_nth_and_andnot - get the Nth cpu set in 1st and 2nd cpumask, and clear in 3rd.
+ * @srcp1: the cpumask pointer
+ * @srcp2: the cpumask pointer
+ * @srcp3: the cpumask pointer
+ * @cpu: the N'th cpu to find, starting from 0
+ *
+ * Returns >= nr_cpu_ids if such cpu doesn't exist.
+ */
+static inline
+unsigned int cpumask_nth_and_andnot(unsigned int cpu, const struct cpumask *srcp1,
+ const struct cpumask *srcp2,
+ const struct cpumask *srcp3)
+{
+ return find_nth_and_andnot_bit(cpumask_bits(srcp1),
+ cpumask_bits(srcp2),
+ cpumask_bits(srcp3),
+ nr_cpumask_bits, cpumask_check(cpu));
+}
+
#define CPU_BITS_NONE \
{ \
[0 ... BITS_TO_LONGS(NR_CPUS)-1] = 0UL \
--
2.34.1


2022-11-11 04:02:34

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/4] lib/find: introduce find_nth_and_andnot_bit

The function is used to implement in-place bitmaps traversing without
storing intermediate result in temporary bitmaps, in the following patches.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/find.h | 33 +++++++++++++++++++++++++++++++++
lib/find_bit.c | 9 +++++++++
2 files changed, 42 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index ccaf61a0f5fd..7a97b492b447 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -22,6 +22,9 @@ unsigned long __find_nth_and_bit(const unsigned long *addr1, const unsigned long
unsigned long size, unsigned long n);
unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long size, unsigned long n);
+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1, const unsigned long *addr2,
+ const unsigned long *addr3, unsigned long size,
+ unsigned long n);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
@@ -255,6 +258,36 @@ unsigned long find_nth_andnot_bit(const unsigned long *addr1, const unsigned lon
return __find_nth_andnot_bit(addr1, addr2, size, n);
}

+/**
+ * find_nth_and_andnot_bit - find N'th set bit in 2 memory regions,
+ * excluding those set in 3rd region
+ * @addr1: The 1st address to start the search at
+ * @addr2: The 2nd address to start the search at
+ * @addr3: The 3rd address to start the search at
+ * @size: The maximum number of bits to search
+ * @n: The number of set bit, which position is needed, counting from 0
+ *
+ * Returns the bit number of the N'th set bit.
+ * If no such, returns @size.
+ */
+static inline
+unsigned long find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ if (n >= size)
+ return size;
+
+ if (small_const_nbits(size)) {
+ unsigned long val = *addr1 & *addr2 & (~*addr2) & GENMASK(size - 1, 0);
+
+ return val ? fns(val, n) : size;
+ }
+
+ return __find_nth_and_andnot_bit(addr1, addr2, addr3, size, n);
+}
+
#ifndef find_first_and_bit
/**
* find_first_and_bit - find the first set bit in both memory regions
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 18bc0a7ac8ee..c10920e66788 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -155,6 +155,15 @@ unsigned long __find_nth_andnot_bit(const unsigned long *addr1, const unsigned l
}
EXPORT_SYMBOL(__find_nth_andnot_bit);

+unsigned long __find_nth_and_andnot_bit(const unsigned long *addr1,
+ const unsigned long *addr2,
+ const unsigned long *addr3,
+ unsigned long size, unsigned long n)
+{
+ return FIND_NTH_BIT(addr1[idx] & addr2[idx] & ~addr3[idx], size, n);
+}
+EXPORT_SYMBOL(__find_nth_and_andnot_bit);
+
#ifndef find_next_and_bit
unsigned long _find_next_and_bit(const unsigned long *addr1, const unsigned long *addr2,
unsigned long nbits, unsigned long start)
--
2.34.1


2022-11-11 04:02:49

by Yury Norov

[permalink] [raw]
Subject: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()

The function finds Nth set CPU in a given cpumask starting from a given
node.

Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/topology.h | 8 ++++++++
kernel/sched/topology.c | 42 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 4564faafd0e1..63048ac3207c 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
return cpumask_of_node(cpu_to_node(cpu));
}

+#ifdef CONFIG_NUMA
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
+#else
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ return cpumask_nth(cpu, cpus);
+}
+#endif /* CONFIG_NUMA */

#endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..c8f56287de46 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2067,6 +2067,48 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
return found;
}

+/*
+ * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
+ * closest to @cpu from @cpumask.
+ * cpumask: cpumask to find a cpu from
+ * cpu: Nth cpu to find
+ *
+ * returns: cpu, or >= nr_cpu_ids when nothing found.
+ */
+int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
+{
+ unsigned int first = 0, mid, last = sched_domains_numa_levels;
+ struct cpumask ***masks;
+ int w, ret = nr_cpu_ids;
+
+ rcu_read_lock();
+ masks = rcu_dereference(sched_domains_numa_masks);
+ if (!masks)
+ goto out;
+
+ while (last >= first) {
+ mid = (last + first) / 2;
+
+ if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
+ first = mid + 1;
+ continue;
+ }
+
+ w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
+ if (w <= cpu)
+ break;
+
+ last = mid - 1;
+ }
+
+ ret = (mid == 0) ?
+ cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
+ cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
+out:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
#endif /* CONFIG_NUMA */

static int __sdt_alloc(const struct cpumask *cpu_map)
--
2.34.1


2022-11-11 04:03:10

by Yury Norov

[permalink] [raw]
Subject: [PATCH 4/4] cpumask: improve on cpumask_local_spread() locality

Switch cpumask_local_spread() to newly added sched_numa_find_nth_cpu(),
which takes into account distances to each node in the system.

For the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node 0 1 2 3
0: 10 50 30 70
1: 50 10 70 30
2: 30 70 10 50
3: 70 30 50 10

The new cpumask_local_spread() traverses cpus for each node like this:

node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15
node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7
node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5
node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3

Signed-off-by: Yury Norov <[email protected]>
---
lib/cpumask.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index c7c392514fd3..255974cd6734 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -110,7 +110,7 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
#endif

/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * cpumask_local_spread - select the i'th cpu based on NUMA distances
* @i: index number
* @node: local numa_node
*
@@ -132,15 +132,7 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
if (cpu < nr_cpu_ids)
return cpu;
} else {
- /* NUMA first. */
- cpu = cpumask_nth_and(i, cpu_online_mask, cpumask_of_node(node));
- if (cpu < nr_cpu_ids)
- return cpu;
-
- i -= cpumask_weight_and(cpu_online_mask, cpumask_of_node(node));
-
- /* Skip NUMA nodes, done above. */
- cpu = cpumask_nth_andnot(i, cpu_online_mask, cpumask_of_node(node));
+ cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node);
if (cpu < nr_cpu_ids)
return cpu;
}
--
2.34.1


2022-11-11 04:15:56

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()

On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
>
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.
>
> Signed-off-by: Yury Norov <[email protected]>
> ---
> include/linux/topology.h | 8 ++++++++
> kernel/sched/topology.c | 42 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+)
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 4564faafd0e1..63048ac3207c 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -245,5 +245,13 @@ static inline const struct cpumask *cpu_cpu_mask(int cpu)
> return cpumask_of_node(cpu_to_node(cpu));
> }
>
> +#ifdef CONFIG_NUMA
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node);
> +#else
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)

Ah, this should be static of course.

> +{
> + return cpumask_nth(cpu, cpus);
> +}
> +#endif /* CONFIG_NUMA */
>
> #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8739c2a5a54e..c8f56287de46 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2067,6 +2067,48 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> return found;
> }
>
> +/*
> + * sched_numa_find_nth_cpu() - given the NUMA topology, find the Nth next cpu
> + * closest to @cpu from @cpumask.
> + * cpumask: cpumask to find a cpu from
> + * cpu: Nth cpu to find
> + *
> + * returns: cpu, or >= nr_cpu_ids when nothing found.
> + */
> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> + unsigned int first = 0, mid, last = sched_domains_numa_levels;
> + struct cpumask ***masks;
> + int w, ret = nr_cpu_ids;
> +
> + rcu_read_lock();
> + masks = rcu_dereference(sched_domains_numa_masks);
> + if (!masks)
> + goto out;
> +
> + while (last >= first) {
> + mid = (last + first) / 2;
> +
> + if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> + first = mid + 1;
> + continue;
> + }
> +
> + w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> + if (w <= cpu)
> + break;
> +
> + last = mid - 1;
> + }
> +
> + ret = (mid == 0) ?
> + cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> + cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
> #endif /* CONFIG_NUMA */
>
> static int __sdt_alloc(const struct cpumask *cpu_map)
> --
> 2.34.1

2022-11-11 11:50:51

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()

On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> The function finds Nth set CPU in a given cpumask starting from a given
> node.
>
> Leveraging the fact that each hop in sched_domains_numa_masks includes the
> same or greater number of CPUs than the previous one, we can use binary
> search on hops instead of linear walk, which makes the overall complexity
> of O(log n) in terms of number of cpumask_weight() calls.

...

> +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> +{
> + unsigned int first = 0, mid, last = sched_domains_numa_levels;
> + struct cpumask ***masks;

*** ?
Hmm... Do we really need such deep indirection?

> + int w, ret = nr_cpu_ids;
> +
> + rcu_read_lock();
> + masks = rcu_dereference(sched_domains_numa_masks);
> + if (!masks)
> + goto out;
> +
> + while (last >= first) {
> + mid = (last + first) / 2;
> +
> + if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> + first = mid + 1;
> + continue;
> + }
> +
> + w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);

See below.

> + if (w <= cpu)
> + break;
> +
> + last = mid - 1;
> + }

We have lib/bsearch.h. I haven't really looked deeply into the above, but my
gut feelings that that might be useful here. Can you check that?

> + ret = (mid == 0) ?
> + cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> + cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);

You can also shorten this by inversing the conditional:

ret = mid ? ...not 0... : ...for 0...;

> +out:

out_unlock: ?

> + rcu_read_unlock();
> + return ret;
> +}

--
With Best Regards,
Andy Shevchenko



2022-11-11 16:33:58

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality

On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.

Nice.

> This series is inspired by Valentin Schneider's "net/mlx5e: Improve remote
> NUMA preferences used for the IRQ affinity hints"
>
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>
> According to Valentin's measurements, for mlx5e:
>
> Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> ~30% less cpu util on TX.
>
> This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> distance, just as well, and I expect comparabale improvement for its
> users, as in Valentin's case.
>
> I tested it on my VM with the following NUMA configuration:

nit: the authorship is a bit more complicated, it'd be good to mention
Tariq. Both for the code and attribution of the testing / measurements.

2022-11-11 17:09:47

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()

On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> > The function finds Nth set CPU in a given cpumask starting from a given
> > node.
> >
> > Leveraging the fact that each hop in sched_domains_numa_masks includes the
> > same or greater number of CPUs than the previous one, we can use binary
> > search on hops instead of linear walk, which makes the overall complexity
> > of O(log n) in terms of number of cpumask_weight() calls.
>
> ...
>
> > +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > +{
> > + unsigned int first = 0, mid, last = sched_domains_numa_levels;
> > + struct cpumask ***masks;
>
> *** ?
> Hmm... Do we really need such deep indirection?

It's 2d array of pointers, so - yes.

> > + int w, ret = nr_cpu_ids;
> > +
> > + rcu_read_lock();
> > + masks = rcu_dereference(sched_domains_numa_masks);
> > + if (!masks)
> > + goto out;
> > +
> > + while (last >= first) {
> > + mid = (last + first) / 2;
> > +
> > + if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> > + first = mid + 1;
> > + continue;
> > + }
> > +
> > + w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
>
> See below.
>
> > + if (w <= cpu)
> > + break;
> > +
> > + last = mid - 1;
> > + }
>
> We have lib/bsearch.h. I haven't really looked deeply into the above, but my
> gut feelings that that might be useful here. Can you check that?

Yes we do. I tried it, and it didn't work because nodes arrays are
allocated dynamically, and distance between different pairs of hops
for a given node is not a constant, which is a requirement for
bsearch.

However, distance between hops pointers in 1st level array should be
constant, and we can try feeding bsearch with it. I'll experiment with
bsearch for more.

> > + ret = (mid == 0) ?
> > + cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> > + cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
>
> You can also shorten this by inversing the conditional:
>
> ret = mid ? ...not 0... : ...for 0...;

Yep, why not.

> > +out:
>
> out_unlock: ?

Do you think it's better?

> > + rcu_read_unlock();
> > + return ret;
> > +}
>
> --
> With Best Regards,
> Andy Shevchenko
>

2022-11-11 18:24:34

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 3/4] sched: add sched_numa_find_nth_cpu()

On Fri, Nov 11, 2022 at 09:07:15AM -0800, Yury Norov wrote:
> On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> > On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:

...

> > > +out:
> >
> > out_unlock: ?
>
> Do you think it's better?

Yes. It shows what will happen at goto.

So when one reads the "goto out;" it's something like "return ret;".
But "goto out_unlock;" immediately pictures "unlock; return ret;".

P.S. That's basically the way how we name labels.

> > > + rcu_read_unlock();
> > > + return ret;

--
With Best Regards,
Andy Shevchenko



2022-11-13 08:03:27

by Tariq Toukan

[permalink] [raw]
Subject: Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality



On 11/11/2022 6:47 PM, Yury Norov wrote:
>
>
> On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <[email protected]
> <mailto:[email protected]>> wrote:
>
> On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:
> > cpumask_local_spread() currently checks local node for presence
> of i'th
> > CPU, and then if it finds nothing makes a flat search among all
> non-local
> > CPUs. We can do it better by checking CPUs per NUMA hops.
>
> Nice.
>

Thanks for your series.
This improves them all, with no changes required to the network device
drivers.

> > This series is inspired by Valentin Schneider's "net/mlx5e:
> Improve remote
> > NUMA preferences used for the IRQ affinity hints"
> >
> >
> https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/ <https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/>


Find my very first version here, including the perf testing results:
https://patchwork.kernel.org/project/netdevbpf/list/?series=660413&state=*


> >
> > According to Valentin's measurements, for mlx5e:
> >
> >       Bottleneck in RX side is released, reached linerate (~1.8x
> speedup).
> >       ~30% less cpu util on TX.
> >
> > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
> > distance, just as well, and I expect comparabale improvement for its
> > users, as in Valentin's case.
> >

Right.

> > I tested it on my VM with the following NUMA configuration:
>
> nit: the authorship is a bit more complicated, it'd be good to mention
> Tariq. Both for the code and attribution of the testing / measurements.
>
>
> Sure. Tariq and Valentine please send your tags as appropriate.
>

I wonder what fits best here?

As the contribution is based upon previous work that I developed, then
probably:
Signed-off-by: Tariq Toukan <[email protected]>

Thanks,
Tariq

2022-11-13 12:37:10

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 0/4] cpumask: improve on cpumask_local_spread() locality

On Sun, Nov 13, 2022 at 09:37:59AM +0200, Tariq Toukan wrote:
> On 11/11/2022 6:47 PM, Yury Norov wrote:
> > On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <[email protected]
> > <mailto:[email protected]>> wrote:
> > On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:

..

> > Sure. Tariq and Valentine please send your tags as appropriate.
>
> I wonder what fits best here?
>
> As the contribution is based upon previous work that I developed, then
> probably:
> Signed-off-by: Tariq Toukan <[email protected]>

Then it probably means that one of you (Yury or you) should also have a
Co-developed-by.

--
With Best Regards,
Andy Shevchenko