2018-03-08 10:55:28

by Ming Lei

[permalink] [raw]
Subject: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi,

This patchset tries to spread among online CPUs as far as possible, so
that we can avoid to allocate too less irq vectors with online CPUs
mapped.

For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
on a device with 4 queues:

1) before this patchset
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

2) after this patchset
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Without this patchset, only two vectors(39, 40) can be active, but there
can be 4 active irq vectors after applying this patchset.

One disadvantage is that CPUs from different NUMA node can be mapped to
one same irq vector. Given generally one CPU should be enough to handle
one irq vector, it shouldn't be a big deal. Especailly more vectors have
to be allocated, otherwise performance can be hurt in current
assignment.

V3:
- fix one compile warning reported by kbuild test robot

V2:
- address coments from Christoph
- mark irq_build_affinity_masks as static
- move constification of get_nodes_in_cpumask's parameter into one
prep patch
- add Reviewed-by tag

Thanks
Ming

Ming Lei (4):
genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask
genirq/affinity: move actual irq vector spread into one helper
genirq/affinity: support to do irq vectors spread starting from any
vector
genirq/affinity: irq vector spread among online CPUs as far as
possible

kernel/irq/affinity.c | 145 ++++++++++++++++++++++++++++++++------------------
1 file changed, 94 insertions(+), 51 deletions(-)

--
2.9.5



2018-03-08 10:56:01

by Ming Lei

[permalink] [raw]
Subject: [PATCH V3 2/4] genirq/affinity: move actual irq vector spread into one helper

No functional change, just prepare for converting to 2-stage
irq vector spread.

Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
kernel/irq/affinity.c | 97 +++++++++++++++++++++++++++++----------------------
1 file changed, 55 insertions(+), 42 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4b1c4763212d..e119e86bed48 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
return nodes;
}

-/**
- * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
- * @nvecs: The total number of vectors
- * @affd: Description of the affinity requirements
- *
- * Returns the masks pointer or NULL if allocation failed.
- */
-struct cpumask *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+ cpumask_var_t *node_to_cpumask,
+ const struct cpumask *cpu_mask,
+ struct cpumask *nmsk,
+ struct cpumask *masks)
{
- int n, nodes, cpus_per_vec, extra_vecs, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
+ int curvec = affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
- struct cpumask *masks;
- cpumask_var_t nmsk, *node_to_cpumask;
-
- /*
- * If there aren't any vectors left after applying the pre/post
- * vectors don't bother with assigning affinity.
- */
- if (!affv)
- return NULL;
-
- if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
- return NULL;
-
- masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
- if (!masks)
- goto out;
+ int n, nodes, cpus_per_vec, extra_vecs;

- node_to_cpumask = alloc_node_to_cpumask();
- if (!node_to_cpumask)
- goto out;
-
- /* Fill out vectors at the beginning that don't need affinity */
- for (curvec = 0; curvec < affd->pre_vectors; curvec++)
- cpumask_copy(masks + curvec, irq_default_affinity);
-
- /* Stabilize the cpumasks */
- get_online_cpus();
- build_node_to_cpumask(node_to_cpumask);
- nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
- &nodemsk);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
* If the number of nodes in the mask is greater than or equal the
@@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (++curvec == last_affv)
break;
}
- goto done;
+ goto out;
}

for_each_node_mask(n, nodemsk) {
@@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;

/* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
+ cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);

/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
--nodes;
}

-done:
+out:
+ return curvec - affd->pre_vectors;
+}
+
+/**
+ * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
+ * @nvecs: The total number of vectors
+ * @affd: Description of the affinity requirements
+ *
+ * Returns the masks pointer or NULL if allocation failed.
+ */
+struct cpumask *
+irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+{
+ int curvec;
+ struct cpumask *masks;
+ cpumask_var_t nmsk, *node_to_cpumask;
+
+ /*
+ * If there aren't any vectors left after applying the pre/post
+ * vectors don't bother with assigning affinity.
+ */
+ if (nvecs == affd->pre_vectors + affd->post_vectors)
+ return NULL;
+
+ if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+ return NULL;
+
+ masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ goto out;
+
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
+ goto out;
+
+ /* Fill out vectors at the beginning that don't need affinity */
+ for (curvec = 0; curvec < affd->pre_vectors; curvec++)
+ cpumask_copy(masks + curvec, irq_default_affinity);
+
+ /* Stabilize the cpumasks */
+ get_online_cpus();
+ build_node_to_cpumask(node_to_cpumask);
+ curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+ cpu_possible_mask, nmsk, masks);
put_online_cpus();

/* Fill out vectors at the end that don't need affinity */
--
2.9.5


2018-03-08 10:56:16

by Ming Lei

[permalink] [raw]
Subject: [PATCH V3 3/4] genirq/affinity: support to do irq vectors spread starting from any vector

Now two parameters(start_vec, affv) are introduced to irq_build_affinity_masks(),
then this helper can build the affinity of each irq vector starting from
the irq vector of 'start_vec', and handle at most 'affv' vectors.

This way is required to do 2-stages irq vectors spread among all
possible CPUs.

Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
kernel/irq/affinity.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e119e86bed48..616f040c5d02 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
return nodes;
}

-static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+ int start_vec, int affv,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
struct cpumask *masks)
{
- int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
- int curvec = affd->pre_vectors;
+ int curvec = start_vec;
nodemask_t nodemsk = NODE_MASK_NONE;
- int n, nodes, cpus_per_vec, extra_vecs;
+ int n, nodes, cpus_per_vec, extra_vecs, done = 0;

nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

@@ -116,8 +116,10 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
node_to_cpumask[n]);
- if (++curvec == last_affv)
+ if (++done == affv)
break;
+ if (++curvec == last_affv)
+ curvec = affd->pre_vectors;
}
goto out;
}
@@ -150,13 +152,16 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}

- if (curvec >= last_affv)
+ done += v;
+ if (done >= affv)
break;
+ if (curvec >= last_affv)
+ curvec = affd->pre_vectors;
--nodes;
}

out:
- return curvec - affd->pre_vectors;
+ return done;
}

/**
@@ -169,6 +174,7 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
struct cpumask *
irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
{
+ int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int curvec;
struct cpumask *masks;
cpumask_var_t nmsk, *node_to_cpumask;
@@ -198,7 +204,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+ curvec += irq_build_affinity_masks(affd, curvec, affv,
+ node_to_cpumask,
cpu_possible_mask, nmsk, masks);
put_online_cpus();

--
2.9.5


2018-03-08 10:56:52

by Ming Lei

[permalink] [raw]
Subject: [PATCH V3 1/4] genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask

The following patches will introduce two stage irq spread for improving
irq spread on all possible CPUs.

No funtional change.

Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
kernel/irq/affinity.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..4b1c4763212d 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
}
}

-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_cpumask(void)
{
cpumask_var_t *masks;
int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
return NULL;
}

-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_cpumask(cpumask_var_t *masks)
{
int node;

@@ -71,7 +71,7 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
kfree(masks);
}

-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_cpumask(cpumask_var_t *masks)
{
int cpu;

@@ -79,14 +79,14 @@ static void build_node_to_possible_cpumask(cpumask_var_t *masks)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
}

-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
{
int n, nodes = 0;

/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
- if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+ if (cpumask_intersects(mask, node_to_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
- cpumask_var_t nmsk, *node_to_possible_cpumask;
+ cpumask_var_t nmsk, *node_to_cpumask;

/*
* If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (!masks)
goto out;

- node_to_possible_cpumask = alloc_node_to_possible_cpumask();
- if (!node_to_possible_cpumask)
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
goto out;

/* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)

/* Stabilize the cpumasks */
get_online_cpus();
- build_node_to_possible_cpumask(node_to_possible_cpumask);
- nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
+ build_node_to_cpumask(node_to_cpumask);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
&nodemsk);

/*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
- node_to_possible_cpumask[n]);
+ node_to_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;

/* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
+ cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);

/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Fill out vectors at the end that don't need affinity */
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
- free_node_to_possible_cpumask(node_to_possible_cpumask);
+ free_node_to_cpumask(node_to_cpumask);
out:
free_cpumask_var(nmsk);
return masks;
--
2.9.5


2018-03-08 10:57:53

by Ming Lei

[permalink] [raw]
Subject: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
may cause irq vector assigned to all offline CPUs, and this kind of
assignment may cause much less irq vectors mapped to online CPUs, and
performance may get hurt.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

[ming@box]$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
NUMA node(s): 2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...

For example, one device has 4 queues:

1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3

2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

3) after applying this patch against V4.15+:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

This patch tries to do irq vector spread among online CPUs as far as
possible by 2 stages spread.

The above assignment 3) isn't the optimal result from NUMA view, but it
returns more irq vectors with online CPU mapped, given in reality one CPU
should be enough to handle one irq vector, so it is better to do this way.

Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reported-by: Laurence Oberman <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
kernel/irq/affinity.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 616f040c5d02..253c5bf85d18 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
nodemask_t nodemsk = NODE_MASK_NONE;
int n, nodes, cpus_per_vec, extra_vecs, done = 0;

+ if (!cpumask_weight(cpu_mask))
+ return 0;
+
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
@@ -175,9 +178,9 @@ struct cpumask *
irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
{
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
- int curvec;
+ int curvec, vecs_offline, vecs_online;
struct cpumask *masks;
- cpumask_var_t nmsk, *node_to_cpumask;
+ cpumask_var_t nmsk, cpu_mask, *node_to_cpumask;

/*
* If there aren't any vectors left after applying the pre/post
@@ -193,9 +196,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (!masks)
goto out;

+ if (!alloc_cpumask_var(&cpu_mask, GFP_KERNEL))
+ goto out;
+
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
- goto out;
+ goto out_free_cpu_mask;

/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
@@ -204,15 +210,32 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(affd, curvec, affv,
- node_to_cpumask,
- cpu_possible_mask, nmsk, masks);
+ /* spread on online CPUs starting from the vector of affd->pre_vectors */
+ vecs_online = irq_build_affinity_masks(affd, curvec, affv,
+ node_to_cpumask,
+ cpu_online_mask, nmsk, masks);
+
+ /* spread on offline CPUs starting from the next vector to be handled */
+ if (vecs_online >= affv)
+ curvec = affd->pre_vectors;
+ else
+ curvec = affd->pre_vectors + vecs_online;
+ cpumask_andnot(cpu_mask, cpu_possible_mask, cpu_online_mask);
+ vecs_offline = irq_build_affinity_masks(affd, curvec, affv,
+ node_to_cpumask,
+ cpu_mask, nmsk, masks);
put_online_cpus();

/* Fill out vectors at the end that don't need affinity */
+ if (vecs_online + vecs_offline >= affv)
+ curvec = affv + affd->pre_vectors;
+ else
+ curvec = affd->pre_vectors + vecs_online + vecs_offline;
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
free_node_to_cpumask(node_to_cpumask);
+out_free_cpu_mask:
+ free_cpumask_var(cpu_mask);
out:
free_cpumask_var(nmsk);
return masks;
--
2.9.5


2018-03-08 13:21:16

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
> Hi,
>
> This patchset tries to spread among online CPUs as far as possible, so
> that we can avoid to allocate too less irq vectors with online CPUs
> mapped.
>
> For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
> on a device with 4 queues:
>
> 1) before this patchset
> irq 39, cpu list 0-2
> irq 40, cpu list 3-4,6
> irq 41, cpu list 5
> irq 42, cpu list 7
>
> 2) after this patchset
> irq 39, cpu list 0,4
> irq 40, cpu list 1,6
> irq 41, cpu list 2,5
> irq 42, cpu list 3,7
>
> Without this patchset, only two vectors(39, 40) can be active, but there
> can be 4 active irq vectors after applying this patchset.

Tested-by: Artem Bityutskiy <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

Ming,

this patchset fixes the v4.16-rcX regression that I reported few weeks
ago. I applied it and verified that Dell R640 server that I mentioned
in the bug report boots up and the disk works.

So this is not just an improvement, it also includes a bugfix.

Thanks!

2018-03-08 13:26:59

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Thu, 2018-03-08 at 15:18 +0200, Artem Bityutskiy wrote:
> Tested-by: Artem Bityutskiy <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]

And for completeness:
Linux-Regression-ID: lr#15a115

2018-03-08 13:36:49

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Thu, Mar 08, 2018 at 03:18:33PM +0200, Artem Bityutskiy wrote:
> On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
> > Hi,
> >
> > This patchset tries to spread among online CPUs as far as possible, so
> > that we can avoid to allocate too less irq vectors with online CPUs
> > mapped.
> >
> > For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
> > on a device with 4 queues:
> >
> > 1) before this patchset
> > irq 39, cpu list 0-2
> > irq 40, cpu list 3-4,6
> > irq 41, cpu list 5
> > irq 42, cpu list 7
> >
> > 2) after this patchset
> > irq 39, cpu list 0,4
> > irq 40, cpu list 1,6
> > irq 41, cpu list 2,5
> > irq 42, cpu list 3,7
> >
> > Without this patchset, only two vectors(39, 40) can be active, but there
> > can be 4 active irq vectors after applying this patchset.
>
> Tested-by: Artem Bityutskiy <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]

Hi Artem,

Thanks for your test!

>
> Ming,
>
> this patchset fixes the v4.16-rcX regression that I reported few weeks
> ago. I applied it and verified that Dell R640 server that I mentioned
> in the bug report boots up and the disk works.
>
> So this is not just an improvement, it also includes a bugfix.

Actually, it isn't a real fix, the real one is in the following two:

0c20244d458e scsi: megaraid_sas: fix selection of reply queue
ed6d043be8cd scsi: hpsa: fix selection of reply queue

This patchset can't guarantee that all IRQ vectors are assigned by one
online CPU, for example, in a quad-socket system, if only one processor
is present, then some of vectors are still assigned by all offline CPUs,
and it is a valid case, but still may cause io hang if drivers(hpsa, megaraid_sas)
select reply queue in current way.

Thanks,
Ming

2018-03-08 23:21:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Thu, 8 Mar 2018, Ming Lei wrote:
> Actually, it isn't a real fix, the real one is in the following two:
>
> 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> ed6d043be8cd scsi: hpsa: fix selection of reply queue

Where are these commits? Neither Linus tree not -next know anything about
them....

> This patchset can't guarantee that all IRQ vectors are assigned by one
> online CPU, for example, in a quad-socket system, if only one processor
> is present, then some of vectors are still assigned by all offline CPUs,
> and it is a valid case, but still may cause io hang if drivers(hpsa,
> megaraid_sas) select reply queue in current way.

So my understanding is that these irq patches are enhancements and not bug
fixes. I'll queue them for 4.17 then.

Thanks,

tglx

2018-03-09 01:26:30

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Thomas,

On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> On Thu, 8 Mar 2018, Ming Lei wrote:
> > Actually, it isn't a real fix, the real one is in the following two:
> >
> > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > ed6d043be8cd scsi: hpsa: fix selection of reply queue
>
> Where are these commits? Neither Linus tree not -next know anything about
> them....

Both aren't merged yet, but they should land V4.16, IMO.

>
> > This patchset can't guarantee that all IRQ vectors are assigned by one
> > online CPU, for example, in a quad-socket system, if only one processor
> > is present, then some of vectors are still assigned by all offline CPUs,
> > and it is a valid case, but still may cause io hang if drivers(hpsa,
> > megaraid_sas) select reply queue in current way.
>
> So my understanding is that these irq patches are enhancements and not bug
> fixes. I'll queue them for 4.17 then.

Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
fix performance regression[1] for some systems caused by 84676c1f21 ("genirq/affinity:
assign vectors to all possible CPUs").

[1] https://marc.info/?l=linux-block&m=152050347831149&w=2

Thanks,
Ming

2018-03-09 07:01:42

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, 2018-03-09 at 09:24 +0800, Ming Lei wrote:
> Hi Thomas,
>
> On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > Actually, it isn't a real fix, the real one is in the following
> > > two:
> > >
> > > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > > ed6d043be8cd scsi: hpsa: fix selection of reply queue
> >
> > Where are these commits? Neither Linus tree not -next know anything
> > about
> > them....
>
> Both aren't merged yet, but they should land V4.16, IMO.

Is it a secret where they are? If not, could you please give ma a
pointer and I'll give them a test.

2018-03-09 07:34:45

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, Mar 09, 2018 at 09:00:08AM +0200, Artem Bityutskiy wrote:
> On Fri, 2018-03-09 at 09:24 +0800, Ming Lei wrote:
> > Hi Thomas,
> >
> > On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > > Actually, it isn't a real fix, the real one is in the following
> > > > two:
> > > >
> > > > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > > > ed6d043be8cd scsi: hpsa: fix selection of reply queue
> > >
> > > Where are these commits? Neither Linus tree not -next know anything
> > > about
> > > them....
> >
> > Both aren't merged yet, but they should land V4.16, IMO.
>
> Is it a secret where they are? If not, could you please give ma a
> pointer and I'll give them a test.

https://marc.info/?l=linux-block&m=152056636717380&w=2

Thanks,
Ming

2018-03-09 10:10:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, 9 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > Actually, it isn't a real fix, the real one is in the following two:
> > >
> > > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > > ed6d043be8cd scsi: hpsa: fix selection of reply queue
> >
> > Where are these commits? Neither Linus tree not -next know anything about
> > them....
>
> Both aren't merged yet, but they should land V4.16, IMO.
>
> >
> > > This patchset can't guarantee that all IRQ vectors are assigned by one
> > > online CPU, for example, in a quad-socket system, if only one processor
> > > is present, then some of vectors are still assigned by all offline CPUs,
> > > and it is a valid case, but still may cause io hang if drivers(hpsa,
> > > megaraid_sas) select reply queue in current way.
> >
> > So my understanding is that these irq patches are enhancements and not bug
> > fixes. I'll queue them for 4.17 then.
>
> Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> fix performance regression[1] for some systems caused by 84676c1f21 ("genirq/affinity:
> assign vectors to all possible CPUs").
>
> [1] https://marc.info/?l=linux-block&m=152050347831149&w=2

Hmm. The patches are rather large for urgent and evtl. backporting. Is
there a simpler way to address that performance issue?

Thanks,

tglx

2018-03-09 12:10:06

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, Mar 09, 2018 at 11:08:54AM +0100, Thomas Gleixner wrote:
> On Fri, 9 Mar 2018, Ming Lei wrote:
> > On Fri, Mar 09, 2018 at 12:20:09AM +0100, Thomas Gleixner wrote:
> > > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > > Actually, it isn't a real fix, the real one is in the following two:
> > > >
> > > > 0c20244d458e scsi: megaraid_sas: fix selection of reply queue
> > > > ed6d043be8cd scsi: hpsa: fix selection of reply queue
> > >
> > > Where are these commits? Neither Linus tree not -next know anything about
> > > them....
> >
> > Both aren't merged yet, but they should land V4.16, IMO.
> >
> > >
> > > > This patchset can't guarantee that all IRQ vectors are assigned by one
> > > > online CPU, for example, in a quad-socket system, if only one processor
> > > > is present, then some of vectors are still assigned by all offline CPUs,
> > > > and it is a valid case, but still may cause io hang if drivers(hpsa,
> > > > megaraid_sas) select reply queue in current way.
> > >
> > > So my understanding is that these irq patches are enhancements and not bug
> > > fixes. I'll queue them for 4.17 then.
> >
> > Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> > fix performance regression[1] for some systems caused by 84676c1f21 ("genirq/affinity:
> > assign vectors to all possible CPUs").
> >
> > [1] https://marc.info/?l=linux-block&m=152050347831149&w=2
>
> Hmm. The patches are rather large for urgent and evtl. backporting. Is
> there a simpler way to address that performance issue?

Not thought of a simpler solution. The problem is that number of active msix vector
is decreased a lot by commit 84676c1f21.

However, if someone wants to backport, this patchset can be applied cleanly, no
any conflict.

Thanks,
Ming

2018-03-09 15:11:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, 9 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 11:08:54AM +0100, Thomas Gleixner wrote:
> > > > So my understanding is that these irq patches are enhancements and not bug
> > > > fixes. I'll queue them for 4.17 then.
> > >
> > > Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> > > fix performance regression[1] for some systems caused by 84676c1f21 ("genirq/affinity:
> > > assign vectors to all possible CPUs").
> > >
> > > [1] https://marc.info/?l=linux-block&m=152050347831149&w=2
> >
> > Hmm. The patches are rather large for urgent and evtl. backporting. Is
> > there a simpler way to address that performance issue?
>
> Not thought of a simpler solution. The problem is that number of active msix vector
> is decreased a lot by commit 84676c1f21.

It's reduced in cases where the number of possible CPUs is way larger than
the number of online CPUs.

Now, if you look at the number of present CPUs on such systems it's
probably the same as the number of online CPUs.

It only differs on machines which support physical hotplug, but that's not
the normal case. Those systems are more special and less wide spread.

So the obvious simple fix for this regression issue is to spread out the
vectors accross present CPUs and not accross possible CPUs.

I'm not sure if there is a clear indicator whether physcial hotplug is
supported or not, but the ACPI folks (x86) and architecture maintainers
should be able to answer that question. I have a machine which says:

smpboot: Allowing 128 CPUs, 96 hotplug CPUs

There is definitely no way to hotplug anything on that machine and sure the
existing spread algorithm will waste vectors to no end.

Sure then there is virt, which can pretend to have a gazillion of possible
hotpluggable CPUs, but virt is an insanity on its own. Though someone might
come up with reasonable heuristics for that as well.

Thoughts?

Thanks,

tglx









2018-03-13 03:13:05

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Thomas,

At 03/09/2018 11:08 PM, Thomas Gleixner wrote:
[...]
>
> I'm not sure if there is a clear indicator whether physcial hotplug is
> supported or not, but the ACPI folks (x86) and architecture maintainers
+cc Rafael

> should be able to answer that question. I have a machine which says:
>
> smpboot: Allowing 128 CPUs, 96 hotplug CPUs
>
> There is definitely no way to hotplug anything on that machine and sure the

AFAIK, in ACPI based dynamic reconfiguration, there is no clear
indicator. In theory, If the ACPI tables have the hotpluggable
CPU resources, the OS can support physical hotplug.

For your machine, Did your CPUs support multi-threading, but not enable
it?

And, sometimes we should not trust the number of possible CPUs. I also
met the situation that BIOS told to ACPI that it could support physical
CPUs hotplug, But actually, there was no hardware slots in the machine.
the ACPI tables like user inputs which should be validated when we use.

> existing spread algorithm will waste vectors to no end.
>
> Sure then there is virt, which can pretend to have a gazillion of possible
> hotpluggable CPUs, but virt is an insanity on its own. Though someone might
> come up with reasonable heuristics for that as well.
>
> Thoughts?

Do we have to map the vectors to CPU statically? Can we map them when
we hotplug/enable the possible CPU?

Thanks,

dou



2018-03-13 07:39:53

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, 2018-03-13 at 11:11 +0800, Dou Liyang wrote:
> I also
> met the situation that BIOS told to ACPI that it could support
> physical
> CPUs hotplug, But actually, there was no hardware slots in the
> machine.
> the ACPI tables like user inputs which should be validated when we
> use.

This is exactly what happens on Skylake Xeon systems. When I check
dmesg or this file:

/sys/devices/system/cpu/possible

on 2S (two socket) and 4S (four socket) systems, I see the same number
432.

This number comes from ACPI MADT. I will speculate (did not see myself)
that 8S systems will report the same number as well, because of the
Skylake-SP (Scalable Platform) architecture.

Number 432 is good for 8S systems, but it is way too large for 2S and
4S systems - 4x or 2x larger than the theoretical maximum.

I do not know why BIOSes have to report unrealistically high numbers, I
am just sharing my observation.

So yes, Linux kernel's possible CPU count knowledge may be too large.
If we use that number to evenly spread IRQ vectors among the CPUs, we
end up with wasted vectors, and even bugs, as I observe on a 2S
Skylake.

Artem.

2018-03-13 08:37:16

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, Mar 13, 2018 at 09:38:41AM +0200, Artem Bityutskiy wrote:
> On Tue, 2018-03-13 at 11:11 +0800, Dou Liyang wrote:
> > I also
> > met the situation that BIOS told to ACPI that it could support
> > physical
> > CPUs hotplug, But actually, there was no hardware slots in the
> > machine.
> > the ACPI tables like user inputs which should be validated when we
> > use.
>
> This is exactly what happens on Skylake Xeon systems. When I check
> dmesg or this file:
>
> /sys/devices/system/cpu/possible
>
> on 2S (two socket) and 4S (four socket) systems, I see the same number
> 432.
>
> This number comes from ACPI MADT. I will speculate (did not see myself)
> that 8S systems will report the same number as well, because of the
> Skylake-SP (Scalable Platform) architecture.
>
> Number 432 is good for 8S systems, but it is way too large for 2S and
> 4S systems - 4x or 2x larger than the theoretical maximum.
>
> I do not know why BIOSes have to report unrealistically high numbers, I
> am just sharing my observation.
>
> So yes, Linux kernel's possible CPU count knowledge may be too large.
> If we use that number to evenly spread IRQ vectors among the CPUs, we
> end up with wasted vectors, and even bugs, as I observe on a 2S
> Skylake.

Then looks this issue need to fix by making possible CPU count accurate
because there are other resources allocated according to num_possible_cpus(),
such as percpu variables.

Thanks,
Ming

2018-03-13 08:40:57

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, 2018-03-13 at 16:35 +0800, Ming Lei wrote:
> Then looks this issue need to fix by making possible CPU count
> accurate
> because there are other resources allocated according to
> num_possible_cpus(),
> such as percpu variables.

Short term the regression should be fixed. It is already v4.16-rc6, we
have little time left.

Longer term, yeah, I agree. Kernel's notion of possible CPU count
should be realistic.

Artem.

2018-03-13 09:26:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, Mar 13, 2018 at 4:11 AM, Dou Liyang <[email protected]> wrote:
> Hi Thomas,
>
> At 03/09/2018 11:08 PM, Thomas Gleixner wrote:
> [...]
>>
>>
>> I'm not sure if there is a clear indicator whether physcial hotplug is
>> supported or not, but the ACPI folks (x86) and architecture maintainers
>
> +cc Rafael
>
>> should be able to answer that question. I have a machine which says:
>>
>> smpboot: Allowing 128 CPUs, 96 hotplug CPUs
>>
>> There is definitely no way to hotplug anything on that machine and sure
>> the
>
>
> AFAIK, in ACPI based dynamic reconfiguration, there is no clear
> indicator. In theory, If the ACPI tables have the hotpluggable
> CPU resources, the OS can support physical hotplug.

In order for the ACPI-based CPU hotplug (I mean physical, not just the
software offline/online we do in the kernel) to work, there have to be
objects in the ACPI namespace corresponding to all of the processors
in question.

If they are not present, there is no way to signal insertion and eject
the processors safely.

2018-03-13 09:36:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy <[email protected]> wrote:
> On Tue, 2018-03-13 at 16:35 +0800, Ming Lei wrote:
>> Then looks this issue need to fix by making possible CPU count
>> accurate
>> because there are other resources allocated according to
>> num_possible_cpus(),
>> such as percpu variables.
>
> Short term the regression should be fixed. It is already v4.16-rc6, we
> have little time left.

Right.

> Longer term, yeah, I agree. Kernel's notion of possible CPU count
> should be realistic.

I agree.

Moreover, there are not too many systems where physical CPU hotplug
actually works in practice AFAICS, so IMO we should default to "no
physical CPU hotplug" and only change that default in special cases
(which may be hard to figure out, but that's a different matter).

What platform firmware tells us may be completely off.

2018-03-14 03:30:51

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi All,

At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:
> On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy <[email protected]> wrote:
>> On Tue, 2018-03-13 at 16:35 +0800, Ming Lei wrote:
>>> Then looks this issue need to fix by making possible CPU count
>>> accurate
>>> because there are other resources allocated according to
>>> num_possible_cpus(),
>>> such as percpu variables.
>>
>> Short term the regression should be fixed. It is already v4.16-rc6, we
>> have little time left.
>
> Right.
>
>> Longer term, yeah, I agree. Kernel's notion of possible CPU count
>> should be realistic.
>

I did a patch for that, Artem, could you help me to test it.

----------------------->8-------------------------

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 449d86d39965..878abfa0ce30 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -671,6 +671,18 @@ static acpi_status __init
acpi_processor_ids_walk(acpi_handle handle,

}

+static void __init acpi_refill_possible_map(void)
+{
+ int i;
+
+ reset_cpu_possible_mask();
+
+ for (i = 0; i < nr_unique_ids; i++)
+ set_cpu_possible(i, true);
+
+ pr_info("Allowing %d possible CPUs\n", nr_unique_ids);
+}
+
static void __init acpi_processor_check_duplicates(void)
{
/* check the correctness for all processors in ACPI namespace */
@@ -680,6 +692,9 @@ static void __init acpi_processor_check_duplicates(void)
NULL, NULL, NULL);
acpi_get_devices(ACPI_PROCESSOR_DEVICE_HID,
acpi_processor_ids_walk,
NULL, NULL);
+
+ /* make possible CPU count more realistic */
+ acpi_refill_possible_map();
}

bool acpi_duplicate_processor_id(int proc_id)

------------------------------------------------------------------
> I agree.
>
> Moreover, there are not too many systems where physical CPU hotplug
> actually works in practice AFAICS, so IMO we should default to "no
> physical CPU hotplug" and only change that default in special cases
> (which may be hard to figure out, but that's a different matter).
>

Yes, I think so.



> What platform firmware tells us may be completely off.
>
Rafeal,

Sorry, I am not sure what you mean :-) . Did you mean no platform
firmware can tell us whether physcial CPU hotplug is supported or not?

My colleagues also told to me that there is no way in OS to know whether
it is supported or not.

Thanks
dou
>
>



2018-03-14 03:32:19

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Rafael,

Thank you so much for your reply.

At 03/13/2018 05:25 PM, Rafael J. Wysocki wrote:
> On Tue, Mar 13, 2018 at 4:11 AM, Dou Liyang <[email protected]> wrote:
>> Hi Thomas,
>>
>> At 03/09/2018 11:08 PM, Thomas Gleixner wrote:
>> [...]
>>>
>>>
>>> I'm not sure if there is a clear indicator whether physcial hotplug is
>>> supported or not, but the ACPI folks (x86) and architecture maintainers
>>
>> +cc Rafael
>>
>>> should be able to answer that question. I have a machine which says:
>>>
>>> smpboot: Allowing 128 CPUs, 96 hotplug CPUs
>>>
>>> There is definitely no way to hotplug anything on that machine and sure
>>> the
>>
>>
>> AFAIK, in ACPI based dynamic reconfiguration, there is no clear
>> indicator. In theory, If the ACPI tables have the hotpluggable
>> CPU resources, the OS can support physical hotplug.
>
> In order for the ACPI-based CPU hotplug (I mean physical, not just the
> software offline/online we do in the kernel) to work, there have to be
> objects in the ACPI namespace corresponding to all of the processors
> in question.
>
> If they are not present, there is no way to signal insertion and eject
> the processors safely.

Yes, I see.

Thanks
dou

>
>
>



2018-03-14 04:12:44

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Artem,

At 03/14/2018 11:29 AM, Dou Liyang wrote:
> Hi All,
>
> At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:
>> On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy
>> <[email protected]> wrote:
>>> On Tue, 2018-03-13 at 16:35 +0800, Ming Lei wrote:
>>>> Then looks this issue need to fix by making possible CPU count
>>>> accurate
>>>> because there are other resources allocated according to
>>>> num_possible_cpus(),
>>>> such as percpu variables.
>>>
>>> Short term the regression should be fixed. It is already v4.16-rc6, we
>>> have little time left.
>>
>> Right.
>>
>>> Longer term, yeah, I agree. Kernel's notion of possible CPU count
>>> should be realistic.
>>
>
> I did a patch for that, Artem, could you help me to test it.
>

I didn't consider the nr_cpu_ids before. please ignore the old patch and
try the following RFC patch.

Thanks
dou

--------------------------->8-------------------------------------

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 449d86d39965..96d568408515 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -671,6 +671,23 @@ static acpi_status __init
acpi_processor_ids_walk(acpi_handle handle,

}

+static void __init acpi_refill_possible_map(void)
+{
+ unsigned int cpu, nr = 0;
+
+ if (nr_cpu_ids <= nr_unique_ids)
+ return;
+
+ for_each_possible_cpu(cpu) {
+ if (nr >= nr_unique_ids)
+ set_cpu_possible(cpu, false);
+ nr++;
+ }
+
+ nr_cpu_ids = nr_unique_ids;
+ pr_info("Allowing %d possible CPUs\n", nr_cpu_ids);
+}
+
static void __init acpi_processor_check_duplicates(void)
{
/* check the correctness for all processors in ACPI namespace */
@@ -680,6 +697,9 @@ static void __init acpi_processor_check_duplicates(void)
NULL, NULL, NULL);
acpi_get_devices(ACPI_PROCESSOR_DEVICE_HID,
acpi_processor_ids_walk,
NULL, NULL);
+
+ /* make possible CPU count more realistic */
+ acpi_refill_possible_map();
}

bool acpi_duplicate_processor_id(int proc_id)




2018-03-14 09:09:30

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote:
> > At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:
> > > On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy
> > > > Longer term, yeah, I agree. Kernel's notion of possible CPU
> > > > count
> > > > should be realistic.
> >
> > I did a patch for that, Artem, could you help me to test it.
> >
>
> I didn't consider the nr_cpu_ids before. please ignore the old patch
> and
> try the following RFC patch.

Sure I can help with testing a patch, could we please:

1. Start a new thread for this
2. Include ACPI forum/folks

Thanks,
Artem.

2018-03-14 09:50:15

by Dou Liyang

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Artern,

At 03/14/2018 05:07 PM, Artem Bityutskiy wrote:
> On Wed, 2018-03-14 at 12:11 +0800, Dou Liyang wrote:
>>> At 03/13/2018 05:35 PM, Rafael J. Wysocki wrote:
>>>> On Tue, Mar 13, 2018 at 9:39 AM, Artem Bityutskiy
>>>>> Longer term, yeah, I agree. Kernel's notion of possible CPU
>>>>> count
>>>>> should be realistic.
>>>
>>> I did a patch for that, Artem, could you help me to test it.
>>>
>>
>> I didn't consider the nr_cpu_ids before. please ignore the old patch
>> and
>> try the following RFC patch.
>
> Sure I can help with testing a patch, could we please:
>
> 1. Start a new thread for this
> 2. Include ACPI forum/folks
>

OK, I will do that right now.

Thanks,
dou

> Thanks,
> Artem.
>
>
>



2018-03-26 08:41:30

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Lo! Your friendly Linux regression tracker here ;-)

On 08.03.2018 14:18, Artem Bityutskiy wrote:
> On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
>> This patchset tries to spread among online CPUs as far as possible, so
>> that we can avoid to allocate too less irq vectors with online CPUs
>> mapped.
> […]
> Tested-by: Artem Bityutskiy <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> this patchset fixes the v4.16-rcX regression that I reported few weeks
> ago. I applied it and verified that Dell R640 server that I mentioned
> in the bug report boots up and the disk works.

Artem (or anyone else), what's the status here? I have this on my list
of regressions, but it looks like there wasn't any progress in the past
week. Or was it discussed somewhere else or even fixed in the meantime
and I missed it? Ciao, Thorsten

2018-03-28 06:17:14

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Mon, 2018-03-26 at 10:39 +0200, Thorsten Leemhuis wrote:
> Lo! Your friendly Linux regression tracker here ;-)
>
> On 08.03.2018 14:18, Artem Bityutskiy wrote:
> > On Thu, 2018-03-08 at 18:53 +0800, Ming Lei wrote:
> > > This patchset tries to spread among online CPUs as far as possible, so
> > > that we can avoid to allocate too less irq vectors with online CPUs
> > > mapped.
> >
> > […]
> > Tested-by: Artem Bityutskiy <[email protected]>
> > Link: https://lkml.kernel.org/r/[email protected]
> >
> > this patchset fixes the v4.16-rcX regression that I reported few weeks
> > ago. I applied it and verified that Dell R640 server that I mentioned
> > in the bug report boots up and the disk works.
>
> Artem (or anyone else), what's the status here? I have this on my list
> of regressions, but it looks like there wasn't any progress in the past
> week. Or was it discussed somewhere else or even fixed in the meantime
> and I missed it? Ciao, Thorsten

Hi, it is not fixed in upstream.

I got an e-mail from James that the fixes are in his tree in the
"fixes" branch. There is no word about when it will be merged. There is
also no stable tag.



2018-03-30 03:18:28

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Thomas,

On Fri, Mar 09, 2018 at 04:08:19PM +0100, Thomas Gleixner wrote:
> On Fri, 9 Mar 2018, Ming Lei wrote:
> > On Fri, Mar 09, 2018 at 11:08:54AM +0100, Thomas Gleixner wrote:
> > > > > So my understanding is that these irq patches are enhancements and not bug
> > > > > fixes. I'll queue them for 4.17 then.
> > > >
> > > > Wrt. this IO hang issue, these patches shouldn't be bug fix, but they may
> > > > fix performance regression[1] for some systems caused by 84676c1f21 ("genirq/affinity:
> > > > assign vectors to all possible CPUs").
> > > >
> > > > [1] https://marc.info/?l=linux-block&m=152050347831149&w=2
> > >
> > > Hmm. The patches are rather large for urgent and evtl. backporting. Is
> > > there a simpler way to address that performance issue?
> >
> > Not thought of a simpler solution. The problem is that number of active msix vector
> > is decreased a lot by commit 84676c1f21.
>
> It's reduced in cases where the number of possible CPUs is way larger than
> the number of online CPUs.
>
> Now, if you look at the number of present CPUs on such systems it's
> probably the same as the number of online CPUs.
>
> It only differs on machines which support physical hotplug, but that's not
> the normal case. Those systems are more special and less wide spread.
>
> So the obvious simple fix for this regression issue is to spread out the
> vectors accross present CPUs and not accross possible CPUs.
>
> I'm not sure if there is a clear indicator whether physcial hotplug is
> supported or not, but the ACPI folks (x86) and architecture maintainers
> should be able to answer that question. I have a machine which says:
>
> smpboot: Allowing 128 CPUs, 96 hotplug CPUs
>
> There is definitely no way to hotplug anything on that machine and sure the
> existing spread algorithm will waste vectors to no end.

percpu variable may waste space too if the possible cpu number is
provided not accurately from ACPI.

>
> Sure then there is virt, which can pretend to have a gazillion of possible
> hotpluggable CPUs, but virt is an insanity on its own. Though someone might
> come up with reasonable heuristics for that as well.

There are also IBM s390, in which physical CPU hotplug is one normal use
case.

Looks not see any other solution posted out for virt, and it may cause
complicated queue dependency issue by re-introducing CPU hotplug
handler for blk-mq.

>
> Thoughts?

Given this patchset doesn't have effect on normal machines without
supporting physical CPU hotplug, it can fix performance regression on
machines which might support physical CPU hotplug(cpu_present_mask !=
cpu_possible_mask) with some extra memory allocation cost.

So is there any chance to make it in v4.17?

Thanks,
Ming

2018-04-03 12:56:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 0/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Ming,

On Fri, 30 Mar 2018, Ming Lei wrote:
> On Fri, Mar 09, 2018 at 04:08:19PM +0100, Thomas Gleixner wrote:
> > Thoughts?
>
> Given this patchset doesn't have effect on normal machines without
> supporting physical CPU hotplug, it can fix performance regression on
> machines which might support physical CPU hotplug(cpu_present_mask !=
> cpu_possible_mask) with some extra memory allocation cost.
>
> So is there any chance to make it in v4.17?

Sorry, that thing fell through the cracks. I'll queue it now and try to do
a pull request late in the merge window.

Thanks,

tglx



2018-04-03 13:34:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Thu, 8 Mar 2018, Ming Lei wrote:
> 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> irq 39, cpu list 0
> irq 40, cpu list 1
> irq 41, cpu list 2
> irq 42, cpu list 3
>
> 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> irq 39, cpu list 0-2
> irq 40, cpu list 3-4,6
> irq 41, cpu list 5
> irq 42, cpu list 7
>
> 3) after applying this patch against V4.15+:
> irq 39, cpu list 0,4
> irq 40, cpu list 1,6
> irq 41, cpu list 2,5
> irq 42, cpu list 3,7

That's more or less window dressing. If the device is already in use when
the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
because the effective affinity of interrupts on X86 (and other
architectures) is always a single CPU.

So this only might move interrupts to the hotplugged CPUs when the device
is initialized after CPU hotplug and the actual vector allocation moves an
interrupt out to the higher numbered CPUs if they have less vectors
allocated than the lower numbered ones.

Probably not an issue for the majority of machines where ACPI stupidly
claims that extra CPUs are possible while there is absolute no support for
physical hotplug.

Though in scenarios where "physical" hotplug is possible, e.g. virt, this
might come surprising.

Thanks,

tglx

2018-04-03 16:01:59

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote:
> On Thu, 8 Mar 2018, Ming Lei wrote:
> > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > irq 39, cpu list 0
> > irq 40, cpu list 1
> > irq 41, cpu list 2
> > irq 42, cpu list 3
> >
> > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > irq 39, cpu list 0-2
> > irq 40, cpu list 3-4,6
> > irq 41, cpu list 5
> > irq 42, cpu list 7
> >
> > 3) after applying this patch against V4.15+:
> > irq 39, cpu list 0,4
> > irq 40, cpu list 1,6
> > irq 41, cpu list 2,5
> > irq 42, cpu list 3,7
>
> That's more or less window dressing. If the device is already in use when
> the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
> because the effective affinity of interrupts on X86 (and other
> architectures) is always a single CPU.
>
> So this only might move interrupts to the hotplugged CPUs when the device
> is initialized after CPU hotplug and the actual vector allocation moves an
> interrupt out to the higher numbered CPUs if they have less vectors
> allocated than the lower numbered ones.

It works for blk-mq devices, such as NVMe.

Now NVMe driver creates num_possible_cpus() hw queues, and each
hw queue is assigned one msix irq vector.

Storage is Client/Server model, that means the interrupt is only
delivered to CPU after one IO request is submitted to hw queue and
it is completed by this hw queue.

When CPUs is hotplugged, and there will be IO submitted from these
CPUs, then finally IOs complete and irq events are generated from
hw queues, and notify these submission CPU by IRQ finally.

Thanks,
Ming

2018-04-04 08:26:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote:
> > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > irq 39, cpu list 0
> > > irq 40, cpu list 1
> > > irq 41, cpu list 2
> > > irq 42, cpu list 3
> > >
> > > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > irq 39, cpu list 0-2
> > > irq 40, cpu list 3-4,6
> > > irq 41, cpu list 5
> > > irq 42, cpu list 7
> > >
> > > 3) after applying this patch against V4.15+:
> > > irq 39, cpu list 0,4
> > > irq 40, cpu list 1,6
> > > irq 41, cpu list 2,5
> > > irq 42, cpu list 3,7
> >
> > That's more or less window dressing. If the device is already in use when
> > the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
> > because the effective affinity of interrupts on X86 (and other
> > architectures) is always a single CPU.
> >
> > So this only might move interrupts to the hotplugged CPUs when the device
> > is initialized after CPU hotplug and the actual vector allocation moves an
> > interrupt out to the higher numbered CPUs if they have less vectors
> > allocated than the lower numbered ones.
>
> It works for blk-mq devices, such as NVMe.
>
> Now NVMe driver creates num_possible_cpus() hw queues, and each
> hw queue is assigned one msix irq vector.
>
> Storage is Client/Server model, that means the interrupt is only
> delivered to CPU after one IO request is submitted to hw queue and
> it is completed by this hw queue.
>
> When CPUs is hotplugged, and there will be IO submitted from these
> CPUs, then finally IOs complete and irq events are generated from
> hw queues, and notify these submission CPU by IRQ finally.

I'm aware how that hw-queue stuff works. But that only works if the
spreading algorithm makes the interrupts affine to offline/not-present CPUs
when the block device is initialized.

In the example above:

> > > irq 39, cpu list 0,4
> > > irq 40, cpu list 1,6
> > > irq 41, cpu list 2,5
> > > irq 42, cpu list 3,7

and assumed that at driver init time only CPU 0-3 are online then the
hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.

So the extra assignment to CPU 4-7 in the affinity mask has no effect
whatsoever and even if the spreading result is 'perfect' it just looks
perfect as it is not making any difference versus the original result:

> > > irq 39, cpu list 0
> > > irq 40, cpu list 1
> > > irq 41, cpu list 2
> > > irq 42, cpu list 3

Thanks,

tglx



2018-04-04 12:47:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, 4 Apr 2018, Thomas Gleixner wrote:
> I'm aware how that hw-queue stuff works. But that only works if the
> spreading algorithm makes the interrupts affine to offline/not-present CPUs
> when the block device is initialized.
>
> In the example above:
>
> > > > irq 39, cpu list 0,4
> > > > irq 40, cpu list 1,6
> > > > irq 41, cpu list 2,5
> > > > irq 42, cpu list 3,7
>
> and assumed that at driver init time only CPU 0-3 are online then the
> hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
>
> So the extra assignment to CPU 4-7 in the affinity mask has no effect
> whatsoever and even if the spreading result is 'perfect' it just looks
> perfect as it is not making any difference versus the original result:
>
> > > > irq 39, cpu list 0
> > > > irq 40, cpu list 1
> > > > irq 41, cpu list 2
> > > > irq 42, cpu list 3

And looking deeper into the changes, I think that the first spreading step
has to use cpu_present_mask and not cpu_online_mask.

Assume the following scenario:

Machine with 8 present CPUs is booted, the 4 last CPUs are
unplugged. Device with 4 queues is initialized.

The resulting spread is going to be exactly your example:

irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
unplugged. Using cpu_present_mask the spread would be:

irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7

while on a machine where CPU 4-7 are NOT present, but advertised as
possible the spread would be:

irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Hmm?

Thanks,

tglx




2018-04-04 15:10:03

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> On Wed, 4 Apr 2018, Ming Lei wrote:
> > On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote:
> > > On Thu, 8 Mar 2018, Ming Lei wrote:
> > > > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > > irq 39, cpu list 0
> > > > irq 40, cpu list 1
> > > > irq 41, cpu list 2
> > > > irq 42, cpu list 3
> > > >
> > > > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > > > irq 39, cpu list 0-2
> > > > irq 40, cpu list 3-4,6
> > > > irq 41, cpu list 5
> > > > irq 42, cpu list 7
> > > >
> > > > 3) after applying this patch against V4.15+:
> > > > irq 39, cpu list 0,4
> > > > irq 40, cpu list 1,6
> > > > irq 41, cpu list 2,5
> > > > irq 42, cpu list 3,7
> > >
> > > That's more or less window dressing. If the device is already in use when
> > > the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3
> > > because the effective affinity of interrupts on X86 (and other
> > > architectures) is always a single CPU.
> > >
> > > So this only might move interrupts to the hotplugged CPUs when the device
> > > is initialized after CPU hotplug and the actual vector allocation moves an
> > > interrupt out to the higher numbered CPUs if they have less vectors
> > > allocated than the lower numbered ones.
> >
> > It works for blk-mq devices, such as NVMe.
> >
> > Now NVMe driver creates num_possible_cpus() hw queues, and each
> > hw queue is assigned one msix irq vector.
> >
> > Storage is Client/Server model, that means the interrupt is only
> > delivered to CPU after one IO request is submitted to hw queue and
> > it is completed by this hw queue.
> >
> > When CPUs is hotplugged, and there will be IO submitted from these
> > CPUs, then finally IOs complete and irq events are generated from
> > hw queues, and notify these submission CPU by IRQ finally.
>
> I'm aware how that hw-queue stuff works. But that only works if the
> spreading algorithm makes the interrupts affine to offline/not-present CPUs
> when the block device is initialized.
>
> In the example above:
>
> > > > irq 39, cpu list 0,4
> > > > irq 40, cpu list 1,6
> > > > irq 41, cpu list 2,5
> > > > irq 42, cpu list 3,7
>
> and assumed that at driver init time only CPU 0-3 are online then the
> hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.

Indeed, and I just tested this case, and found that no interrupts are
delivered to CPU 4-7.

In theory, the affinity has been assigned to these irq vectors, and
programmed to interrupt controller, I understand it should work.

Could you explain it a bit why interrupts aren't delivered to CPU 4-7?


Thanks,
Ming

2018-04-04 15:21:53

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote:
> On Wed, 4 Apr 2018, Thomas Gleixner wrote:
> > I'm aware how that hw-queue stuff works. But that only works if the
> > spreading algorithm makes the interrupts affine to offline/not-present CPUs
> > when the block device is initialized.
> >
> > In the example above:
> >
> > > > > irq 39, cpu list 0,4
> > > > > irq 40, cpu list 1,6
> > > > > irq 41, cpu list 2,5
> > > > > irq 42, cpu list 3,7
> >
> > and assumed that at driver init time only CPU 0-3 are online then the
> > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
> >
> > So the extra assignment to CPU 4-7 in the affinity mask has no effect
> > whatsoever and even if the spreading result is 'perfect' it just looks
> > perfect as it is not making any difference versus the original result:
> >
> > > > > irq 39, cpu list 0
> > > > > irq 40, cpu list 1
> > > > > irq 41, cpu list 2
> > > > > irq 42, cpu list 3
>
> And looking deeper into the changes, I think that the first spreading step
> has to use cpu_present_mask and not cpu_online_mask.
>
> Assume the following scenario:
>
> Machine with 8 present CPUs is booted, the 4 last CPUs are
> unplugged. Device with 4 queues is initialized.
>
> The resulting spread is going to be exactly your example:
>
> irq 39, cpu list 0,4
> irq 40, cpu list 1,6
> irq 41, cpu list 2,5
> irq 42, cpu list 3,7
>
> Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
> interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
> unplugged. Using cpu_present_mask the spread would be:
>
> irq 39, cpu list 0,1
> irq 40, cpu list 2,3
> irq 41, cpu list 4,5
> irq 42, cpu list 6,7

Given physical CPU hotplug isn't common, this way will make only irq 39
and irq 40 active most of times, so performance regression is caused just
as Kashyap reported.

>
> while on a machine where CPU 4-7 are NOT present, but advertised as
> possible the spread would be:
>
> irq 39, cpu list 0,4
> irq 40, cpu list 1,6
> irq 41, cpu list 2,5
> irq 42, cpu list 3,7

I think this way is still better, since performance regression can be
avoided, and there is at least one CPU for covering one irq vector,
in reality, it is often enough.

As I mentioned in another email, I still don't understand why interrupts
can't be delivered to CPU 4~7 after these CPUs become present & online.
Seems in theory, interrupts should be delivered to these CPUs since
affinity info has been programmed to interrupt controller already.

Or do we still need CPU hotplug handler for device driver to tell device
the CPU hotplug change for delivering interrupts to new added CPUs?


Thanks,
Ming

2018-04-04 19:39:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> > In the example above:
> >
> > > > > irq 39, cpu list 0,4
> > > > > irq 40, cpu list 1,6
> > > > > irq 41, cpu list 2,5
> > > > > irq 42, cpu list 3,7
> >
> > and assumed that at driver init time only CPU 0-3 are online then the
> > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
>
> Indeed, and I just tested this case, and found that no interrupts are
> delivered to CPU 4-7.
>
> In theory, the affinity has been assigned to these irq vectors, and
> programmed to interrupt controller, I understand it should work.
>
> Could you explain it a bit why interrupts aren't delivered to CPU 4-7?

As I explained before:

"If the device is already in use when the offline CPUs get hot plugged, then
the interrupts still stay on cpu 0-3 because the effective affinity of
interrupts on X86 (and other architectures) is always a single CPU."

IOW. If you set the affinity mask so it contains more than one CPU then the
kernel selects a single CPU as target. The selected CPU must be online and
if there is more than one online CPU in the mask then the kernel picks the
one which has the least number of interrupts targeted at it. This selected
CPU target is programmed into the corresponding interrupt chip
(IOAPIC/MSI/MSIX....) and it stays that way until the selected target CPU
goes offline or the affinity mask changes.

The reasons why we use single target delivery on X86 are:

1) Not all X86 systems support multi target delivery

2) If a system supports multi target delivery then the interrupt is
preferrably delivered to the CPU with the lowest APIC ID (which
usually corresponds to the lowest CPU number) due to hardware magic
and only a very small percentage of interrupts are delivered to the
other CPUs in the multi target set. So the benefit is rather dubious
and extensive performance testing did not show any significant
difference.

3) The management of multi targets on the software side is painful as
the same low level vector number has to be allocated on all possible
target CPUs. That's making a lot of things including hotplug more
complex for very little - if at all - benefit.

So at some point we ripped out the multi target support on X86 and moved
everything to single target delivery mode.

Other architectures never supported multi target delivery either due to
hardware restrictions or for similar reasons why X86 dropped it. There
might be a few architectures which support it, but I have no overview at
the moment.

The information is in procfs

# cat /proc/irq/9/smp_affinity_list
0-3
# cat /proc/irq/9/effective_affinity_list
1

# cat /proc/irq/10/smp_affinity_list
0-3
# cat /proc/irq/10/effective_affinity_list
2

smp_affinity[_list] is the affinity which is set either by the kernel or by
writing to /proc/irq/$N/smp_affinity[_list]

effective_affinity[_list] is the affinity which is effective, i.e. the
single target CPU to which the interrupt is affine at this point.

As you can see in the above examples the target CPU is selected from the
given possible target set and the internal spreading of the low level x86
vector allocation code picks a CPU which has the lowest number of
interrupts targeted at it.

Let's assume for the example below

# cat /proc/irq/10/smp_affinity_list
0-3
# cat /proc/irq/10/effective_affinity_list
2

that CPU 3 was offline when the device was initialized. So there was no way
to select it and when CPU 3 comes online there is no reason to change the
affinity of that interrupt, at least not from the kernel POV. Actually we
don't even have a mechanism to do so automagically.

If I offline CPU 2 after onlining CPU 3 then the kernel has to move the
interrupt away from CPU 2, so it selects CPU 3 as it's the one with the
lowest number of interrupts targeted at it.

Now this is a bit different if you use affinity managed interrupts like
NVME and other devices do.

Many of these devices create one queue per possible CPU, so the spreading
is simple; One interrupt per possible cpu. Pretty boring.

When the device has less queues than possible CPUs, then stuff gets more
interesting. The queues and therefore the interrupts must be targeted at
multiple CPUs. There is some logic which spreads them over the numa nodes
and takes siblings into account when Hyperthreading is enabled.

In both cases the managed interrupts are handled over CPU soft
hotplug/unplug:

1) If a CPU is soft unplugged and an interrupt is targeted at the CPU
then the interrupt is either moved to a still online CPU in the
affinity mask or if the outgoing CPU is the last one in the affinity
mask it is shut down.

2) If a CPU is soft plugged then the interrupts are scanned and the ones
which are managed and shutdown checked whether the affinity mask
contains the upcoming CPU. If that's the case then the interrupt is
started up and can deliver interrupts for the corresponding queue.

If an interupt is managed and already started, then nothing happens
and the effective affinity is untouched even if the upcoming CPU is in
the affinity set.

Lets briefly talk about the 3 cpu masks:

1) cpus_possible_mask:

The CPUs which are possible on a system.

2) cpus_present_mask:

The CPUs which are present on a system. Present means phsyically
present. Physical hotplug sets or removes CPUs from that mask,

"Physical" hotplug is used in virtualization as well.

3) cpus_online_mask:

The CPUs which are soft onlined. If a present CPU is not soft onlined
then its cleared in the online mask, but still set in the present
mask.

Now back to my suggestion in the other end of this thread, that we should
use cpus_present_mask instead of cpus_online_mask.

The reason why I suggested this is that we have to differentiate between
soft plugging and phsycial plugging of CPUs.

If CPUs are in the present mask, i.e. phsyically available, but not in the
online mask, then it's trivial to plug them soft by writing to the
corresponding online file in sysfs. CPU soft plugging is used for power
management nowadays, so the scenario I described in the other mail is not
completely unrealistic.

In case of physical hotplug it's a different story. Neither the kernel nor
user space can plug a CPU phsyically. It needs interaction by the operator,
i.e. in the real world by inserting/removing hardware or in the
virtualization space by changing the current CPU allocation. So here the
present mask wont help when the number of queues is less than the number of
possible CPUs and an initially not present CPU gets 'physically' plugged
in.

To make things worse we have the unfortunate case of qualiteee BIOS/ACPI
tables which claim that there are more possible CPUs than present CPUs on
systems which cannot support phsyical hotplug due to lack of hardware
support. Unfortunately there is no simple way to figure out whether a
system supports physical hotplug or not, so we cannot make an informed
decision here. But we can look at the present mask which tells us how many
CPUs are physically available. In a regular boot up the present mask and
the online mask are identical, so there is no difference.

For the physical hotplug case - real or virtual - neither of the spreading
algorithms is ideal. Solving this needs more thought as it would require to
recalculate the spreading once the physically plugged CPUs become
available/online.

Hope that clarifies the internals.

Thanks,

tglx



2018-04-05 10:13:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote:
> > Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
> > interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
> > unplugged. Using cpu_present_mask the spread would be:
> >
> > irq 39, cpu list 0,1
> > irq 40, cpu list 2,3
> > irq 41, cpu list 4,5
> > irq 42, cpu list 6,7
>
> Given physical CPU hotplug isn't common, this way will make only irq 39
> and irq 40 active most of times, so performance regression is caused just
> as Kashyap reported.

That is only true, if CPU 4-7 are in the present mask at boot time. I
seriously doubt that this is the case for Kashyaps scenario. Grrr, if you
would have included him into the Reported-by: tags then I could have asked
him myself.

In the physical hotplug case, the physcially (or virtually) not available
CPUs are not in the present mask. They are solely in the possible mask.

The above is about soft hotplug where the CPUs are physically there and
therefore in the present mask and can be onlined without interaction from
the outside (mechanical or virt config).

If nobody objects, I'll make that change and queue the stuff tomorrow
morning so it can brew a few days in next before I send it off to Linus.

Thanks,

tglx

2018-04-06 09:15:25

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

Hi Thomas,

On Wed, Apr 04, 2018 at 09:38:26PM +0200, Thomas Gleixner wrote:
> On Wed, 4 Apr 2018, Ming Lei wrote:
> > On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> > > In the example above:
> > >
> > > > > > irq 39, cpu list 0,4
> > > > > > irq 40, cpu list 1,6
> > > > > > irq 41, cpu list 2,5
> > > > > > irq 42, cpu list 3,7
> > >
> > > and assumed that at driver init time only CPU 0-3 are online then the
> > > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
> >
> > Indeed, and I just tested this case, and found that no interrupts are
> > delivered to CPU 4-7.
> >
> > In theory, the affinity has been assigned to these irq vectors, and
> > programmed to interrupt controller, I understand it should work.
> >
> > Could you explain it a bit why interrupts aren't delivered to CPU 4-7?
>
> As I explained before:
>
> "If the device is already in use when the offline CPUs get hot plugged, then
> the interrupts still stay on cpu 0-3 because the effective affinity of
> interrupts on X86 (and other architectures) is always a single CPU."
>
> IOW. If you set the affinity mask so it contains more than one CPU then the
> kernel selects a single CPU as target. The selected CPU must be online and
> if there is more than one online CPU in the mask then the kernel picks the
> one which has the least number of interrupts targeted at it. This selected
> CPU target is programmed into the corresponding interrupt chip
> (IOAPIC/MSI/MSIX....) and it stays that way until the selected target CPU
> goes offline or the affinity mask changes.
>
> The reasons why we use single target delivery on X86 are:
>
> 1) Not all X86 systems support multi target delivery
>
> 2) If a system supports multi target delivery then the interrupt is
> preferrably delivered to the CPU with the lowest APIC ID (which
> usually corresponds to the lowest CPU number) due to hardware magic
> and only a very small percentage of interrupts are delivered to the
> other CPUs in the multi target set. So the benefit is rather dubious
> and extensive performance testing did not show any significant
> difference.
>
> 3) The management of multi targets on the software side is painful as
> the same low level vector number has to be allocated on all possible
> target CPUs. That's making a lot of things including hotplug more
> complex for very little - if at all - benefit.
>
> So at some point we ripped out the multi target support on X86 and moved
> everything to single target delivery mode.
>
> Other architectures never supported multi target delivery either due to
> hardware restrictions or for similar reasons why X86 dropped it. There
> might be a few architectures which support it, but I have no overview at
> the moment.
>
> The information is in procfs
>
> # cat /proc/irq/9/smp_affinity_list
> 0-3
> # cat /proc/irq/9/effective_affinity_list
> 1
>
> # cat /proc/irq/10/smp_affinity_list
> 0-3
> # cat /proc/irq/10/effective_affinity_list
> 2
>
> smp_affinity[_list] is the affinity which is set either by the kernel or by
> writing to /proc/irq/$N/smp_affinity[_list]
>
> effective_affinity[_list] is the affinity which is effective, i.e. the
> single target CPU to which the interrupt is affine at this point.
>
> As you can see in the above examples the target CPU is selected from the
> given possible target set and the internal spreading of the low level x86
> vector allocation code picks a CPU which has the lowest number of
> interrupts targeted at it.
>
> Let's assume for the example below
>
> # cat /proc/irq/10/smp_affinity_list
> 0-3
> # cat /proc/irq/10/effective_affinity_list
> 2
>
> that CPU 3 was offline when the device was initialized. So there was no way
> to select it and when CPU 3 comes online there is no reason to change the
> affinity of that interrupt, at least not from the kernel POV. Actually we
> don't even have a mechanism to do so automagically.
>
> If I offline CPU 2 after onlining CPU 3 then the kernel has to move the
> interrupt away from CPU 2, so it selects CPU 3 as it's the one with the
> lowest number of interrupts targeted at it.
>
> Now this is a bit different if you use affinity managed interrupts like
> NVME and other devices do.
>
> Many of these devices create one queue per possible CPU, so the spreading
> is simple; One interrupt per possible cpu. Pretty boring.
>
> When the device has less queues than possible CPUs, then stuff gets more
> interesting. The queues and therefore the interrupts must be targeted at
> multiple CPUs. There is some logic which spreads them over the numa nodes
> and takes siblings into account when Hyperthreading is enabled.
>
> In both cases the managed interrupts are handled over CPU soft
> hotplug/unplug:
>
> 1) If a CPU is soft unplugged and an interrupt is targeted at the CPU
> then the interrupt is either moved to a still online CPU in the
> affinity mask or if the outgoing CPU is the last one in the affinity
> mask it is shut down.
>
> 2) If a CPU is soft plugged then the interrupts are scanned and the ones
> which are managed and shutdown checked whether the affinity mask
> contains the upcoming CPU. If that's the case then the interrupt is
> started up and can deliver interrupts for the corresponding queue.
>
> If an interupt is managed and already started, then nothing happens
> and the effective affinity is untouched even if the upcoming CPU is in
> the affinity set.
>
> Lets briefly talk about the 3 cpu masks:
>
> 1) cpus_possible_mask:
>
> The CPUs which are possible on a system.
>
> 2) cpus_present_mask:
>
> The CPUs which are present on a system. Present means phsyically
> present. Physical hotplug sets or removes CPUs from that mask,
>
> "Physical" hotplug is used in virtualization as well.
>
> 3) cpus_online_mask:
>
> The CPUs which are soft onlined. If a present CPU is not soft onlined
> then its cleared in the online mask, but still set in the present
> mask.

Great thanks for your so detailed explanation.

>
> Now back to my suggestion in the other end of this thread, that we should
> use cpus_present_mask instead of cpus_online_mask.
>
> The reason why I suggested this is that we have to differentiate between
> soft plugging and phsycial plugging of CPUs.
>
> If CPUs are in the present mask, i.e. phsyically available, but not in the
> online mask, then it's trivial to plug them soft by writing to the
> corresponding online file in sysfs. CPU soft plugging is used for power
> management nowadays, so the scenario I described in the other mail is not
> completely unrealistic.

OK, got it, and this scenario can be emulated easily by offlining CPU
before loading device driver.

I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
And it should work fine for Kashyap's case in normal cases.

It might not work fine when there are lots of offline CPUs before device
initialization, in which less active irq vectors will be assigned. But
given driver is usually loaded during kernel booting, at that time generally
all CPUs are online, so looks it shouldn't be one issue to consider.

>
> In case of physical hotplug it's a different story. Neither the kernel nor
> user space can plug a CPU phsyically. It needs interaction by the operator,
> i.e. in the real world by inserting/removing hardware or in the
> virtualization space by changing the current CPU allocation. So here the
> present mask wont help when the number of queues is less than the number of
> possible CPUs and an initially not present CPU gets 'physically' plugged
> in.
>
> To make things worse we have the unfortunate case of qualiteee BIOS/ACPI
> tables which claim that there are more possible CPUs than present CPUs on
> systems which cannot support phsyical hotplug due to lack of hardware
> support. Unfortunately there is no simple way to figure out whether a
> system supports physical hotplug or not, so we cannot make an informed
> decision here. But we can look at the present mask which tells us how many
> CPUs are physically available. In a regular boot up the present mask and
> the online mask are identical, so there is no difference.
>
> For the physical hotplug case - real or virtual - neither of the spreading
> algorithms is ideal. Solving this needs more thought as it would require to
> recalculate the spreading once the physically plugged CPUs become
> available/online.

I agree, that may be an another improvement in this field.

>
> Hope that clarifies the internals.

Sure, it does, thanks again for your clarification!

Thanks,
Ming

2018-04-06 09:47:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, 6 Apr 2018, Ming Lei wrote:
>
> I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> And it should work fine for Kashyap's case in normal cases.

No need to resend. I've changed it already and will push it out after
lunch.

Thanks,

tglx

2018-04-06 21:54:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, 6 Apr 2018, Thomas Gleixner wrote:

> On Fri, 6 Apr 2018, Ming Lei wrote:
> >
> > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> > And it should work fine for Kashyap's case in normal cases.
>
> No need to resend. I've changed it already and will push it out after
> lunch.

No. Lunch did not last that long :)

I pushed out the lot to

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

Please double check the modifications I did. The first related commit fixes
an existing error handling bug.

Thanks,

tglx

Subject: [tip:irq/core] genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask

Commit-ID: 47778f33dcba7feb92031643b37e477892f82b62
Gitweb: https://git.kernel.org/tip/47778f33dcba7feb92031643b37e477892f82b62
Author: Ming Lei <[email protected]>
AuthorDate: Thu, 8 Mar 2018 18:53:55 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 6 Apr 2018 12:19:50 +0200

genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask

The following patches will introduce two stage irq spreading for improving
irq spread on all possible CPUs.

No functional change.

Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Laurence Oberman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/irq/affinity.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index e0665549af59..272c968d9ef1 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
}
}

-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_cpumask(void)
{
cpumask_var_t *masks;
int node;
@@ -62,7 +62,7 @@ out_unwind:
return NULL;
}

-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_cpumask(cpumask_var_t *masks)
{
int node;

@@ -71,7 +71,7 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
kfree(masks);
}

-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_cpumask(cpumask_var_t *masks)
{
int cpu;

@@ -79,14 +79,14 @@ static void build_node_to_possible_cpumask(cpumask_var_t *masks)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
}

-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
{
int n, nodes = 0;

/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
- if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+ if (cpumask_intersects(mask, node_to_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks = NULL;
- cpumask_var_t nmsk, *node_to_possible_cpumask;
+ cpumask_var_t nmsk, *node_to_cpumask;

/*
* If there aren't any vectors left after applying the pre/post
@@ -121,8 +121,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;

- node_to_possible_cpumask = alloc_node_to_possible_cpumask();
- if (!node_to_possible_cpumask)
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
goto outcpumsk;

masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)

/* Stabilize the cpumasks */
get_online_cpus();
- build_node_to_possible_cpumask(node_to_possible_cpumask);
- nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
+ build_node_to_cpumask(node_to_cpumask);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
&nodemsk);

/*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
- node_to_possible_cpumask[n]);
+ node_to_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;

/* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
+ cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);

/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -193,7 +193,7 @@ done:
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
outnodemsk:
- free_node_to_possible_cpumask(node_to_possible_cpumask);
+ free_node_to_cpumask(node_to_cpumask);
outcpumsk:
free_cpumask_var(nmsk);
return masks;

Subject: [tip:irq/core] genirq/affinity: Allow irq spreading from a given starting point

Commit-ID: 1a2d0914e23aab386f5d5acb689777e24151c2c8
Gitweb: https://git.kernel.org/tip/1a2d0914e23aab386f5d5acb689777e24151c2c8
Author: Ming Lei <[email protected]>
AuthorDate: Thu, 8 Mar 2018 18:53:57 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 6 Apr 2018 12:19:51 +0200

genirq/affinity: Allow irq spreading from a given starting point

To support two stage irq vector spreading, it's required to add a starting
point to the spreading function. No functional change, just preparatory
work for the actual two stage change.

[ tglx: Renamed variables, tidied up the code and massaged changelog ]

Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Laurence Oberman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/irq/affinity.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a9c36904500c..213695a27ddb 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
return nodes;
}

-static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+static int irq_build_affinity_masks(const struct irq_affinity *affd,
+ int startvec, int numvecs,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
struct cpumask *nmsk,
struct cpumask *masks)
{
- int affv = nvecs - affd->pre_vectors - affd->post_vectors;
- int last_affv = affv + affd->pre_vectors;
- int curvec = affd->pre_vectors;
+ int n, nodes, cpus_per_vec, extra_vecs, done = 0;
+ int last_affv = affd->pre_vectors + numvecs;
+ int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;
- int n, nodes, cpus_per_vec, extra_vecs;

nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

@@ -112,12 +112,13 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
* If the number of nodes in the mask is greater than or equal the
* number of vectors we just spread the vectors across the nodes.
*/
- if (affv <= nodes) {
+ if (numvecs <= nodes) {
for_each_node_mask(n, nodemsk) {
- cpumask_copy(masks + curvec,
- node_to_cpumask[n]);
- if (++curvec == last_affv)
+ cpumask_copy(masks + curvec, node_to_cpumask[n]);
+ if (++done == numvecs)
break;
+ if (++curvec == last_affv)
+ curvec = affd->pre_vectors;
}
goto out;
}
@@ -126,7 +127,7 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
int ncpus, v, vecs_to_assign, vecs_per_node;

/* Spread the vectors per node */
- vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
+ vecs_per_node = (numvecs - (curvec - affd->pre_vectors)) / nodes;

/* Get the cpus on this node which are in the mask */
cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);
@@ -150,13 +151,16 @@ static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}

- if (curvec >= last_affv)
+ done += v;
+ if (done >= numvecs)
break;
+ if (curvec >= last_affv)
+ curvec = affd->pre_vectors;
--nodes;
}

out:
- return curvec - affd->pre_vectors;
+ return done;
}

/**
@@ -169,9 +173,9 @@ out:
struct cpumask *
irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
{
+ int curvec, affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
cpumask_var_t nmsk, *node_to_cpumask;
struct cpumask *masks = NULL;
- int curvec;

/*
* If there aren't any vectors left after applying the pre/post
@@ -198,8 +202,9 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
- cpu_possible_mask, nmsk, masks);
+ curvec += irq_build_affinity_masks(affd, curvec, affvecs,
+ node_to_cpumask, cpu_possible_mask,
+ nmsk, masks);
put_online_cpus();

/* Fill out vectors at the end that don't need affinity */

Subject: [tip:irq/core] genirq/affinity: Spread irq vectors among present CPUs as far as possible

Commit-ID: d3056812e7dfe6bf4f8ad9e397a9116dd5d32d15
Gitweb: https://git.kernel.org/tip/d3056812e7dfe6bf4f8ad9e397a9116dd5d32d15
Author: Ming Lei <[email protected]>
AuthorDate: Thu, 8 Mar 2018 18:53:58 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 6 Apr 2018 12:19:51 +0200

genirq/affinity: Spread irq vectors among present CPUs as far as possible

Commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
tried to spread the interrupts accross all possible CPUs to make sure that
in case of phsyical hotplug (e.g. virtualization) the CPUs which get
plugged in after the device was initialized are targeted by a hardware
queue and the corresponding interrupt.

This has a downside in cases where the ACPI tables claim that there are
more possible CPUs than present CPUs and the number of interrupts to spread
out is smaller than the number of possible CPUs. These bogus ACPI tables
are unfortunately not uncommon.

In such a case the vector spreading algorithm assigns interrupts to CPUs
which can never be utilized and as a consequence these interrupts are
unused instead of being mapped to present CPUs. As a result the performance
of the device is suboptimal.

To fix this spread the interrupt vectors in two stages:

1) Spread as many interrupts as possible among the present CPUs

2) Spread the remaining vectors among non present CPUs

On a 8 core system, where CPU 0-3 are present and CPU 4-7 are not present,
for a device with 4 queues the resulting interrupt affinity is:

1) Before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3

2) With 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

3) With the refined vector spread applied:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

On a 8 core system, where all CPUs are present the resulting interrupt
affinity for the 4 queues is:

irq 39, cpu list 0,1
irq 40, cpu list 2,3
irq 41, cpu list 4,5
irq 42, cpu list 6,7

This is independent of the number of CPUs which are online at the point of
initialization because in such a system the offline CPUs can be easily
onlined afterwards, while in non-present CPUs need to be plugged physically
or virtually which requires external interaction.

The downside of this approach is that in case of physical hotplug the
interrupt vector spreading might be suboptimal when CPUs 4-7 are physically
plugged. Suboptimal from a NUMA point of view and due to the single target
nature of interrupt affinities the later plugged CPUs might not be targeted
by interrupts at all.

Though, physical hotplug systems are not the common case while the broken
ACPI table disease is wide spread. So it's preferred to have as many
interrupts as possible utilized at the point where the device is
initialized.

Block multi-queue devices like NVME create a hardware queue per possible
CPU, so the goal of commit 84676c1f21 to assign one interrupt vector per
possible CPU is still achieved even with physical/virtual hotplug.

[ tglx: Changed from online to present CPUs for the first spreading stage,
renamed variables for readability sake, added comments and massaged
changelog ]

Reported-by: Laurence Oberman <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/irq/affinity.c | 43 +++++++++++++++++++++++++++++++++++++------
1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 213695a27ddb..f4f29b9d90ee 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ static int irq_build_affinity_masks(const struct irq_affinity *affd,
int curvec = startvec;
nodemask_t nodemsk = NODE_MASK_NONE;

+ if (!cpumask_weight(cpu_mask))
+ return 0;
+
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
@@ -173,8 +176,9 @@ out:
struct cpumask *
irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
{
- int curvec, affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
- cpumask_var_t nmsk, *node_to_cpumask;
+ int affvecs = nvecs - affd->pre_vectors - affd->post_vectors;
+ int curvec, usedvecs;
+ cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
struct cpumask *masks = NULL;

/*
@@ -187,9 +191,12 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
return NULL;

+ if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+ goto outcpumsk;
+
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
- goto outcpumsk;
+ goto outnpresmsk;

masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
if (!masks)
@@ -202,16 +209,40 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
- curvec += irq_build_affinity_masks(affd, curvec, affvecs,
- node_to_cpumask, cpu_possible_mask,
- nmsk, masks);
+
+ /* Spread on present CPUs starting from affd->pre_vectors */
+ usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
+ node_to_cpumask, cpu_present_mask,
+ nmsk, masks);
+
+ /*
+ * Spread on non present CPUs starting from the next vector to be
+ * handled. If the spreading of present CPUs already exhausted the
+ * vector space, assign the non present CPUs to the already spread
+ * out vectors.
+ */
+ if (usedvecs >= affvecs)
+ curvec = affd->pre_vectors;
+ else
+ curvec = affd->pre_vectors + usedvecs;
+ cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+ usedvecs += irq_build_affinity_masks(affd, curvec, affvecs,
+ node_to_cpumask, npresmsk,
+ nmsk, masks);
put_online_cpus();

/* Fill out vectors at the end that don't need affinity */
+ if (usedvecs >= affvecs)
+ curvec = affd->pre_vectors + affvecs;
+ else
+ curvec = affd->pre_vectors + usedvecs;
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
+
outnodemsk:
free_node_to_cpumask(node_to_cpumask);
+outnpresmsk:
+ free_cpumask_var(npresmsk);
outcpumsk:
free_cpumask_var(nmsk);
return masks;

Subject: [tip:irq/core] genirq/affinity: Move actual irq vector spreading into a helper function

Commit-ID: b3e6aaa8d94d618e685c4df08bef991a4fb43923
Gitweb: https://git.kernel.org/tip/b3e6aaa8d94d618e685c4df08bef991a4fb43923
Author: Ming Lei <[email protected]>
AuthorDate: Thu, 8 Mar 2018 18:53:56 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 6 Apr 2018 12:19:51 +0200

genirq/affinity: Move actual irq vector spreading into a helper function

No functional change, just prepare for converting to 2-stage irq vector
spreading.

Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Laurence Oberman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
kernel/irq/affinity.c | 97 +++++++++++++++++++++++++++++----------------------
1 file changed, 55 insertions(+), 42 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 272c968d9ef1..a9c36904500c 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,50 +94,19 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
return nodes;
}

-/**
- * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
- * @nvecs: The total number of vectors
- * @affd: Description of the affinity requirements
- *
- * Returns the masks pointer or NULL if allocation failed.
- */
-struct cpumask *
-irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+static int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+ cpumask_var_t *node_to_cpumask,
+ const struct cpumask *cpu_mask,
+ struct cpumask *nmsk,
+ struct cpumask *masks)
{
- int n, nodes, cpus_per_vec, extra_vecs, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
+ int curvec = affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
- struct cpumask *masks = NULL;
- cpumask_var_t nmsk, *node_to_cpumask;
+ int n, nodes, cpus_per_vec, extra_vecs;

- /*
- * If there aren't any vectors left after applying the pre/post
- * vectors don't bother with assigning affinity.
- */
- if (!affv)
- return NULL;
-
- if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
- return NULL;
-
- node_to_cpumask = alloc_node_to_cpumask();
- if (!node_to_cpumask)
- goto outcpumsk;
-
- masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
- if (!masks)
- goto outnodemsk;
-
- /* Fill out vectors at the beginning that don't need affinity */
- for (curvec = 0; curvec < affd->pre_vectors; curvec++)
- cpumask_copy(masks + curvec, irq_default_affinity);
-
- /* Stabilize the cpumasks */
- get_online_cpus();
- build_node_to_cpumask(node_to_cpumask);
- nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
- &nodemsk);
+ nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, &nodemsk);

/*
* If the number of nodes in the mask is greater than or equal the
@@ -150,7 +119,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
if (++curvec == last_affv)
break;
}
- goto done;
+ goto out;
}

for_each_node_mask(n, nodemsk) {
@@ -160,7 +129,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;

/* Get the cpus on this node which are in the mask */
- cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
+ cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]);

/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -186,7 +155,51 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
--nodes;
}

-done:
+out:
+ return curvec - affd->pre_vectors;
+}
+
+/**
+ * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
+ * @nvecs: The total number of vectors
+ * @affd: Description of the affinity requirements
+ *
+ * Returns the masks pointer or NULL if allocation failed.
+ */
+struct cpumask *
+irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
+{
+ cpumask_var_t nmsk, *node_to_cpumask;
+ struct cpumask *masks = NULL;
+ int curvec;
+
+ /*
+ * If there aren't any vectors left after applying the pre/post
+ * vectors don't bother with assigning affinity.
+ */
+ if (nvecs == affd->pre_vectors + affd->post_vectors)
+ return NULL;
+
+ if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+ return NULL;
+
+ node_to_cpumask = alloc_node_to_cpumask();
+ if (!node_to_cpumask)
+ goto outcpumsk;
+
+ masks = kcalloc(nvecs, sizeof(*masks), GFP_KERNEL);
+ if (!masks)
+ goto outnodemsk;
+
+ /* Fill out vectors at the beginning that don't need affinity */
+ for (curvec = 0; curvec < affd->pre_vectors; curvec++)
+ cpumask_copy(masks + curvec, irq_default_affinity);
+
+ /* Stabilize the cpumasks */
+ get_online_cpus();
+ build_node_to_cpumask(node_to_cpumask);
+ curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+ cpu_possible_mask, nmsk, masks);
put_online_cpus();

/* Fill out vectors at the end that don't need affinity */

2018-04-08 03:23:07

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

On Fri, Apr 06, 2018 at 11:49:47PM +0200, Thomas Gleixner wrote:
> On Fri, 6 Apr 2018, Thomas Gleixner wrote:
>
> > On Fri, 6 Apr 2018, Ming Lei wrote:
> > >
> > > I will post V4 soon by using cpu_present_mask in the 1st stage irq spread.
> > > And it should work fine for Kashyap's case in normal cases.
> >
> > No need to resend. I've changed it already and will push it out after
> > lunch.
>
> No. Lunch did not last that long :)
>
> I pushed out the lot to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
>
> Please double check the modifications I did. The first related commit fixes
> an existing error handling bug.

I think your modification is better, especially about adding comment in
irq_create_affinity_masks().

I also testes these patches again, and they just work fine.

Thanks,
Ming