2020-06-25 23:49:34

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

From: Alex Belits <[email protected]>

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits <[email protected]>
Signed-off-by: Nitesh Narayan Lal <[email protected]>
---
lib/cpumask.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/memblock.h>
#include <linux/numa.h>
+#include <linux/sched/isolation.h>

/**
* cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*/
unsigned int cpumask_local_spread(unsigned int i, int node)
{
- int cpu;
+ int cpu, hk_flags;
+ const struct cpumask *mask;

+ hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+ mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */
- i %= num_online_cpus();
+ i %= cpumask_weight(mask);

if (node == NUMA_NO_NODE) {
- for_each_cpu(cpu, cpu_online_mask)
+ for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+ }
} else {
/* NUMA first. */
- for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+ for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+ }

- for_each_cpu(cpu, cpu_online_mask) {
+ for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;
--
2.18.4


2020-06-29 19:14:59

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
> From: Alex Belits <[email protected]>
>
> The current implementation of cpumask_local_spread() does not respect the
> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> it will return it to the caller for pinning of its IRQ threads. Having
> these unwanted IRQ threads on an isolated CPU adds up to a latency
> overhead.
>
> Restrict the CPUs that are returned for spreading IRQs only to the
> available housekeeping CPUs.
>
> Signed-off-by: Alex Belits <[email protected]>
> Signed-off-by: Nitesh Narayan Lal <[email protected]>

Hi Peter,

I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
lying in linux-next.
Should I do a re-post by re-basing the patches on the top of linux-next?

[1]
https://lore.kernel.org/lkml/[email protected]/

> ---
> lib/cpumask.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index fb22fb266f93..85da6ab4fbb5 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
> #include <linux/export.h>
> #include <linux/memblock.h>
> #include <linux/numa.h>
> +#include <linux/sched/isolation.h>
>
> /**
> * cpumask_next - get the next cpu in a cpumask
> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> */
> unsigned int cpumask_local_spread(unsigned int i, int node)
> {
> - int cpu;
> + int cpu, hk_flags;
> + const struct cpumask *mask;
>
> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> + mask = housekeeping_cpumask(hk_flags);
> /* Wrap: we always want a cpu. */
> - i %= num_online_cpus();
> + i %= cpumask_weight(mask);
>
> if (node == NUMA_NO_NODE) {
> - for_each_cpu(cpu, cpu_online_mask)
> + for_each_cpu(cpu, mask) {
> if (i-- == 0)
> return cpu;
> + }
> } else {
> /* NUMA first. */
> - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> if (i-- == 0)
> return cpu;
> + }
>
> - for_each_cpu(cpu, cpu_online_mask) {
> + for_each_cpu(cpu, mask) {
> /* Skip NUMA nodes, done above. */
> if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> continue;
--
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2020-07-01 00:35:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Mon, 29 Jun 2020 12:11:25 -0400 Nitesh Narayan Lal <[email protected]> wrote:

>
> On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
> > From: Alex Belits <[email protected]>
> >
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> >
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> >
> > Signed-off-by: Alex Belits <[email protected]>
> > Signed-off-by: Nitesh Narayan Lal <[email protected]>
>
> Hi Peter,
>
> I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
> lying in linux-next.
> Should I do a re-post by re-basing the patches on the top of linux-next?
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/

This patch has had some review difficulties and has been pending for
quite some time. I suggest you base your work on mainline and that we
ask Yuqi jin to rebase on that, if I don't feel confident doing it,

2020-07-01 00:49:06

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 6/30/20 8:32 PM, Andrew Morton wrote:
> On Mon, 29 Jun 2020 12:11:25 -0400 Nitesh Narayan Lal <[email protected]> wrote:
>
>> On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
>>> From: Alex Belits <[email protected]>
>>>
>>> The current implementation of cpumask_local_spread() does not respect the
>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>> it will return it to the caller for pinning of its IRQ threads. Having
>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>> overhead.
>>>
>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>> available housekeeping CPUs.
>>>
>>> Signed-off-by: Alex Belits <[email protected]>
>>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>> Hi Peter,
>>
>> I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
>> lying in linux-next.
>> Should I do a re-post by re-basing the patches on the top of linux-next?
>>
>> [1]
>> https://lore.kernel.org/lkml/[email protected]/
> This patch has had some review difficulties and has been pending for
> quite some time. I suggest you base your work on mainline and that we
> ask Yuqi jin to rebase on that, if I don't feel confident doing it,
>

I see, in that case, it should be fine to go ahead with this patch-set as I
already based this on top of the latest master before posting.

--
Thanks
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2020-07-09 08:49:38

by tip-bot2 for Mike Travis

[permalink] [raw]
Subject: [tip: sched/core] lib: Restrict cpumask_local_spread to houskeeping CPUs

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1abdfe706a579a702799fce465bceb9fb01d407c
Gitweb: https://git.kernel.org/tip/1abdfe706a579a702799fce465bceb9fb01d407c
Author: Alex Belits <[email protected]>
AuthorDate: Thu, 25 Jun 2020 18:34:41 -04:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

lib: Restrict cpumask_local_spread to houskeeping CPUs

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits <[email protected]>
Signed-off-by: Nitesh Narayan Lal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
lib/cpumask.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb2..85da6ab 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
#include <linux/export.h>
#include <linux/memblock.h>
#include <linux/numa.h>
+#include <linux/sched/isolation.h>

/**
* cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*/
unsigned int cpumask_local_spread(unsigned int i, int node)
{
- int cpu;
+ int cpu, hk_flags;
+ const struct cpumask *mask;

+ hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+ mask = housekeeping_cpumask(hk_flags);
/* Wrap: we always want a cpu. */
- i %= num_online_cpus();
+ i %= cpumask_weight(mask);

if (node == NUMA_NO_NODE) {
- for_each_cpu(cpu, cpu_online_mask)
+ for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+ }
} else {
/* NUMA first. */
- for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+ for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+ }

- for_each_cpu(cpu, cpu_online_mask) {
+ for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;

2021-01-27 23:55:28

by Robin Murphy

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

Hi,

On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> From: Alex Belits <[email protected]>
>
> The current implementation of cpumask_local_spread() does not respect the
> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> it will return it to the caller for pinning of its IRQ threads. Having
> these unwanted IRQ threads on an isolated CPU adds up to a latency
> overhead.
>
> Restrict the CPUs that are returned for spreading IRQs only to the
> available housekeeping CPUs.
>
> Signed-off-by: Alex Belits <[email protected]>
> Signed-off-by: Nitesh Narayan Lal <[email protected]>
> ---
> lib/cpumask.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index fb22fb266f93..85da6ab4fbb5 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
> #include <linux/export.h>
> #include <linux/memblock.h>
> #include <linux/numa.h>
> +#include <linux/sched/isolation.h>
>
> /**
> * cpumask_next - get the next cpu in a cpumask
> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> */
> unsigned int cpumask_local_spread(unsigned int i, int node)
> {
> - int cpu;
> + int cpu, hk_flags;
> + const struct cpumask *mask;
>
> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> + mask = housekeeping_cpumask(hk_flags);

AFAICS, this generally resolves to something based on cpu_possible_mask
rather than cpu_online_mask as before, so could now potentially return
an offline CPU. Was that an intentional change?

I was just looking at the current code since I had the rare presence of
mind to check if something suitable already existed before I start
open-coding "any online CPU, but local node preferred" logic for
handling IRQ affinity in a driver - cpumask_local_spread() appears to be
almost what I want (if a bit more heavyweight), if only it would
actually guarantee an online CPU as the kerneldoc claims :(

Robin.

> /* Wrap: we always want a cpu. */
> - i %= num_online_cpus();
> + i %= cpumask_weight(mask);
>
> if (node == NUMA_NO_NODE) {
> - for_each_cpu(cpu, cpu_online_mask)
> + for_each_cpu(cpu, mask) {
> if (i-- == 0)
> return cpu;
> + }
> } else {
> /* NUMA first. */
> - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> if (i-- == 0)
> return cpu;
> + }
>
> - for_each_cpu(cpu, cpu_online_mask) {
> + for_each_cpu(cpu, mask) {
> /* Skip NUMA nodes, done above. */
> if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> continue;
>

2021-01-27 23:55:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> Hi,
>
> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > From: Alex Belits <[email protected]>
> >
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> >
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> >
> > Signed-off-by: Alex Belits <[email protected]>
> > Signed-off-by: Nitesh Narayan Lal <[email protected]>
> > ---
> > lib/cpumask.c | 16 +++++++++++-----
> > 1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index fb22fb266f93..85da6ab4fbb5 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -6,6 +6,7 @@
> > #include <linux/export.h>
> > #include <linux/memblock.h>
> > #include <linux/numa.h>
> > +#include <linux/sched/isolation.h>
> > /**
> > * cpumask_next - get the next cpu in a cpumask
> > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> > */
> > unsigned int cpumask_local_spread(unsigned int i, int node)
> > {
> > - int cpu;
> > + int cpu, hk_flags;
> > + const struct cpumask *mask;
> > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > + mask = housekeeping_cpumask(hk_flags);
>
> AFAICS, this generally resolves to something based on cpu_possible_mask
> rather than cpu_online_mask as before, so could now potentially return an
> offline CPU. Was that an intentional change?

Robin,

AFAICS online CPUs should be filtered.

> I was just looking at the current code since I had the rare presence of mind
> to check if something suitable already existed before I start open-coding
> "any online CPU, but local node preferred" logic for handling IRQ affinity
> in a driver - cpumask_local_spread() appears to be almost what I want (if a
> bit more heavyweight), if only it would actually guarantee an online CPU as
> the kerneldoc claims :(
>
> Robin.
>
> > /* Wrap: we always want a cpu. */
> > - i %= num_online_cpus();
> > + i %= cpumask_weight(mask);
> > if (node == NUMA_NO_NODE) {
> > - for_each_cpu(cpu, cpu_online_mask)
> > + for_each_cpu(cpu, mask) {
> > if (i-- == 0)
> > return cpu;
> > + }
> > } else {
> > /* NUMA first. */
> > - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > if (i-- == 0)
> > return cpu;
> > + }
> > - for_each_cpu(cpu, cpu_online_mask) {
> > + for_each_cpu(cpu, mask) {
> > /* Skip NUMA nodes, done above. */
> > if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> > continue;
> >

2021-01-27 23:56:39

by Robin Murphy

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On 2021-01-27 12:19, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>> Hi,
>>
>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>> From: Alex Belits <[email protected]>
>>>
>>> The current implementation of cpumask_local_spread() does not respect the
>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>> it will return it to the caller for pinning of its IRQ threads. Having
>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>> overhead.
>>>
>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>> available housekeeping CPUs.
>>>
>>> Signed-off-by: Alex Belits <[email protected]>
>>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>>> ---
>>> lib/cpumask.c | 16 +++++++++++-----
>>> 1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>> index fb22fb266f93..85da6ab4fbb5 100644
>>> --- a/lib/cpumask.c
>>> +++ b/lib/cpumask.c
>>> @@ -6,6 +6,7 @@
>>> #include <linux/export.h>
>>> #include <linux/memblock.h>
>>> #include <linux/numa.h>
>>> +#include <linux/sched/isolation.h>
>>> /**
>>> * cpumask_next - get the next cpu in a cpumask
>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>> */
>>> unsigned int cpumask_local_spread(unsigned int i, int node)
>>> {
>>> - int cpu;
>>> + int cpu, hk_flags;
>>> + const struct cpumask *mask;
>>> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> + mask = housekeeping_cpumask(hk_flags);
>>
>> AFAICS, this generally resolves to something based on cpu_possible_mask
>> rather than cpu_online_mask as before, so could now potentially return an
>> offline CPU. Was that an intentional change?
>
> Robin,
>
> AFAICS online CPUs should be filtered.

Apologies if I'm being thick, but can you explain how? In the case of
isolation being disabled or compiled out, housekeeping_cpumask() is
literally just "return cpu_possible_mask;". If we then iterate over that
with for_each_cpu() and just return the i'th possible CPU (e.g. in the
NUMA_NO_NODE case), what guarantees that CPU is actually online?

Robin.

>> I was just looking at the current code since I had the rare presence of mind
>> to check if something suitable already existed before I start open-coding
>> "any online CPU, but local node preferred" logic for handling IRQ affinity
>> in a driver - cpumask_local_spread() appears to be almost what I want (if a
>> bit more heavyweight), if only it would actually guarantee an online CPU as
>> the kerneldoc claims :(
>>
>> Robin.
>>
>>> /* Wrap: we always want a cpu. */
>>> - i %= num_online_cpus();
>>> + i %= cpumask_weight(mask);
>>> if (node == NUMA_NO_NODE) {
>>> - for_each_cpu(cpu, cpu_online_mask)
>>> + for_each_cpu(cpu, mask) {
>>> if (i-- == 0)
>>> return cpu;
>>> + }
>>> } else {
>>> /* NUMA first. */
>>> - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>>> + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>> if (i-- == 0)
>>> return cpu;
>>> + }
>>> - for_each_cpu(cpu, cpu_online_mask) {
>>> + for_each_cpu(cpu, mask) {
>>> /* Skip NUMA nodes, done above. */
>>> if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>> continue;
>>>
>

2021-01-27 23:58:49

by Robin Murphy

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On 2021-01-27 13:09, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> On 2021-01-27 12:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>>>> From: Alex Belits <[email protected]>
>>>>>
>>>>> The current implementation of cpumask_local_spread() does not respect the
>>>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>>>> it will return it to the caller for pinning of its IRQ threads. Having
>>>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>>>> overhead.
>>>>>
>>>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>>>> available housekeeping CPUs.
>>>>>
>>>>> Signed-off-by: Alex Belits <[email protected]>
>>>>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>>>>> ---
>>>>> lib/cpumask.c | 16 +++++++++++-----
>>>>> 1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>>>> index fb22fb266f93..85da6ab4fbb5 100644
>>>>> --- a/lib/cpumask.c
>>>>> +++ b/lib/cpumask.c
>>>>> @@ -6,6 +6,7 @@
>>>>> #include <linux/export.h>
>>>>> #include <linux/memblock.h>
>>>>> #include <linux/numa.h>
>>>>> +#include <linux/sched/isolation.h>
>>>>> /**
>>>>> * cpumask_next - get the next cpu in a cpumask
>>>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>>>> */
>>>>> unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>> {
>>>>> - int cpu;
>>>>> + int cpu, hk_flags;
>>>>> + const struct cpumask *mask;
>>>>> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> + mask = housekeeping_cpumask(hk_flags);
>>>>
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>>
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>>
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
>
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.

True, if someone calls from a racy context then there's not much we can
do to ensure that any CPU *remains* online after we initially observed
it to be, but when it's called from somewhere safe like a cpuhp offline
handler, then picking from cpu_online_mask *did* always do the right
thing (by my interpretation), whereas picking from
housekeeping_cpumask() might not.

This is why I decided to ask rather than just send a patch to fix what I
think might be a bug - I have no objection if this *is* intended
behaviour, other than suggesting we amend the "...selects an online
CPU..." comment if that aspect was never meant to be relied upon.

Thanks,
Robin.

>
> cpumask_local_spread() should probably be disabling CPU hotplug.
>
> Thomas?
>
>>
>>>> I was just looking at the current code since I had the rare presence of mind
>>>> to check if something suitable already existed before I start open-coding
>>>> "any online CPU, but local node preferred" logic for handling IRQ affinity
>>>> in a driver - cpumask_local_spread() appears to be almost what I want (if a
>>>> bit more heavyweight), if only it would actually guarantee an online CPU as
>>>> the kerneldoc claims :(
>>>>
>>>> Robin.
>>>>
>>>>> /* Wrap: we always want a cpu. */
>>>>> - i %= num_online_cpus();
>>>>> + i %= cpumask_weight(mask);
>>>>> if (node == NUMA_NO_NODE) {
>>>>> - for_each_cpu(cpu, cpu_online_mask)
>>>>> + for_each_cpu(cpu, mask) {
>>>>> if (i-- == 0)
>>>>> return cpu;
>>>>> + }
>>>>> } else {
>>>>> /* NUMA first. */
>>>>> - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>>>>> + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>>>> if (i-- == 0)
>>>>> return cpu;
>>>>> + }
>>>>> - for_each_cpu(cpu, cpu_online_mask) {
>>>>> + for_each_cpu(cpu, mask) {
>>>>> /* Skip NUMA nodes, done above. */
>>>>> if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>>>> continue;
>>>>>
>>>
>

2021-01-27 23:59:57

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
> On 2021-01-27 12:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> > > Hi,
> > >
> > > On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > > > From: Alex Belits <[email protected]>
> > > >
> > > > The current implementation of cpumask_local_spread() does not respect the
> > > > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > > > it will return it to the caller for pinning of its IRQ threads. Having
> > > > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > > > overhead.
> > > >
> > > > Restrict the CPUs that are returned for spreading IRQs only to the
> > > > available housekeeping CPUs.
> > > >
> > > > Signed-off-by: Alex Belits <[email protected]>
> > > > Signed-off-by: Nitesh Narayan Lal <[email protected]>
> > > > ---
> > > > lib/cpumask.c | 16 +++++++++++-----
> > > > 1 file changed, 11 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > > index fb22fb266f93..85da6ab4fbb5 100644
> > > > --- a/lib/cpumask.c
> > > > +++ b/lib/cpumask.c
> > > > @@ -6,6 +6,7 @@
> > > > #include <linux/export.h>
> > > > #include <linux/memblock.h>
> > > > #include <linux/numa.h>
> > > > +#include <linux/sched/isolation.h>
> > > > /**
> > > > * cpumask_next - get the next cpu in a cpumask
> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > */
> > > > unsigned int cpumask_local_spread(unsigned int i, int node)
> > > > {
> > > > - int cpu;
> > > > + int cpu, hk_flags;
> > > > + const struct cpumask *mask;
> > > > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > + mask = housekeeping_cpumask(hk_flags);
> > >
> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> > > rather than cpu_online_mask as before, so could now potentially return an
> > > offline CPU. Was that an intentional change?
> >
> > Robin,
> >
> > AFAICS online CPUs should be filtered.
>
> Apologies if I'm being thick, but can you explain how? In the case of
> isolation being disabled or compiled out, housekeeping_cpumask() is
> literally just "return cpu_possible_mask;". If we then iterate over that
> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>
> Robin.

Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
as well.

cpumask_local_spread() should probably be disabling CPU hotplug.

Thomas?

>
> > > I was just looking at the current code since I had the rare presence of mind
> > > to check if something suitable already existed before I start open-coding
> > > "any online CPU, but local node preferred" logic for handling IRQ affinity
> > > in a driver - cpumask_local_spread() appears to be almost what I want (if a
> > > bit more heavyweight), if only it would actually guarantee an online CPU as
> > > the kerneldoc claims :(
> > >
> > > Robin.
> > >
> > > > /* Wrap: we always want a cpu. */
> > > > - i %= num_online_cpus();
> > > > + i %= cpumask_weight(mask);
> > > > if (node == NUMA_NO_NODE) {
> > > > - for_each_cpu(cpu, cpu_online_mask)
> > > > + for_each_cpu(cpu, mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > + }
> > > > } else {
> > > > /* NUMA first. */
> > > > - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > > > + for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > + }
> > > > - for_each_cpu(cpu, cpu_online_mask) {
> > > > + for_each_cpu(cpu, mask) {
> > > > /* Skip NUMA nodes, done above. */
> > > > if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> > > > continue;
> > > >
> >

2021-01-28 00:01:48

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 1/27/21 8:09 AM, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> On 2021-01-27 12:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>>>> From: Alex Belits <[email protected]>
>>>>>
>>>>> The current implementation of cpumask_local_spread() does not respect the
>>>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>>>> it will return it to the caller for pinning of its IRQ threads. Having
>>>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>>>> overhead.
>>>>>
>>>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>>>> available housekeeping CPUs.
>>>>>
>>>>> Signed-off-by: Alex Belits <[email protected]>
>>>>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>>>>> ---
>>>>> lib/cpumask.c | 16 +++++++++++-----
>>>>> 1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>>>> index fb22fb266f93..85da6ab4fbb5 100644
>>>>> --- a/lib/cpumask.c
>>>>> +++ b/lib/cpumask.c
>>>>> @@ -6,6 +6,7 @@
>>>>> #include <linux/export.h>
>>>>> #include <linux/memblock.h>
>>>>> #include <linux/numa.h>
>>>>> +#include <linux/sched/isolation.h>
>>>>> /**
>>>>> * cpumask_next - get the next cpu in a cpumask
>>>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>>>> */
>>>>> unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>> {
>>>>> - int cpu;
>>>>> + int cpu, hk_flags;
>>>>> + const struct cpumask *mask;
>>>>> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> + mask = housekeeping_cpumask(hk_flags);
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.

Marcelo, before the commit cpumask_local_spread, was in fact, relying on
cpu_online_mask as Robin mentioned.
The problem here is with housekeeping_cpumask which always relied on the
cpu_possible_mask.

>
> cpumask_local_spread() should probably be disabling CPU hotplug.


Yes and this should also be done at several other places in the drivers
which don't take CPU hotplug into account eg. at the time of vector
allocation.


--
Thanks
Nitesh

2021-01-28 16:00:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> > > > /**
>> > > > * cpumask_next - get the next cpu in a cpumask
>> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>> > > > */
>> > > > unsigned int cpumask_local_spread(unsigned int i, int node)
>> > > > {
>> > > > - int cpu;
>> > > > + int cpu, hk_flags;
>> > > > + const struct cpumask *mask;
>> > > > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>> > > > + mask = housekeeping_cpumask(hk_flags);
>> > >
>> > > AFAICS, this generally resolves to something based on cpu_possible_mask
>> > > rather than cpu_online_mask as before, so could now potentially return an
>> > > offline CPU. Was that an intentional change?
>> >
>> > Robin,
>> >
>> > AFAICS online CPUs should be filtered.
>>
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
>
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.
>
> cpumask_local_spread() should probably be disabling CPU hotplug.

It can't unless all callers are from preemtible code.

Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
over the kernel is just wrong, really.

As I explained several times before there are very valid reasons for
having queues and interrupts on isolated CPUs. Just optimizing for the
usecases some people care about is not making anything better.

Thanks,

tglx

2021-01-28 16:06:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>> > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>> > + mask = housekeeping_cpumask(hk_flags);
>>
>> AFAICS, this generally resolves to something based on cpu_possible_mask
>> rather than cpu_online_mask as before, so could now potentially return an
>> offline CPU. Was that an intentional change?
>
> Robin,
>
> AFAICS online CPUs should be filtered.

The whole pile wants to be reverted. It's simply broken in several ways.

Thanks,

tglx

2021-01-28 17:04:07

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> >> > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > + mask = housekeeping_cpumask(hk_flags);
> >>
> >> AFAICS, this generally resolves to something based on cpu_possible_mask
> >> rather than cpu_online_mask as before, so could now potentially return an
> >> offline CPU. Was that an intentional change?
> >
> > Robin,
> >
> > AFAICS online CPUs should be filtered.
>
> The whole pile wants to be reverted. It's simply broken in several ways.

I was asking for your comments on interaction with CPU hotplug :-)
Anyway...

So housekeeping_cpumask has multiple meanings. In this case:

HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ

domain
Isolate from the general SMP balancing and scheduling
algorithms. Note that performing domain isolation this way
is irreversible: it's not possible to bring back a CPU to
the domains once isolated through isolcpus. It's strongly
advised to use cpusets instead to disable scheduler load
balancing through the "cpuset.sched_load_balance" file.
It offers a much more flexible interface where CPUs can
move in and out of an isolated set anytime.

You can move a process onto or off an "isolated" CPU via
the CPU affinity syscalls or cpuset.
<cpu number> begins at 0 and the maximum value is
"number of CPUs in system - 1".

managed_irq

Isolate from being targeted by managed interrupts
which have an interrupt mask containing isolated
CPUs. The affinity of managed interrupts is
handled by the kernel and cannot be changed via
the /proc/irq/* interfaces.

This isolation is best effort and only effective
if the automatically assigned interrupt mask of a
device queue contains isolated and housekeeping
CPUs. If housekeeping CPUs are online then such
interrupts are directed to the housekeeping CPU
so that IO submitted on the housekeeping CPU
cannot disturb the isolated CPU.

If a queue's affinity mask contains only isolated
CPUs then this parameter has no effect on the
interrupt routing decision, though interrupts are
only delivered when tasks running on those
isolated CPUs submit IO. IO submitted on
housekeeping CPUs has no influence on those
queues.

So as long as the meaning of the flags are respected, seems
alright.

Nitesh, is there anything preventing this from being fixed
in userspace ? (as Thomas suggested previously).



2021-01-28 17:06:15

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Jan 28, 2021 at 04:56:07PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
> >> > > > /**
> >> > > > * cpumask_next - get the next cpu in a cpumask
> >> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> >> > > > */
> >> > > > unsigned int cpumask_local_spread(unsigned int i, int node)
> >> > > > {
> >> > > > - int cpu;
> >> > > > + int cpu, hk_flags;
> >> > > > + const struct cpumask *mask;
> >> > > > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > > > + mask = housekeeping_cpumask(hk_flags);
> >> > >
> >> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> >> > > rather than cpu_online_mask as before, so could now potentially return an
> >> > > offline CPU. Was that an intentional change?
> >> >
> >> > Robin,
> >> >
> >> > AFAICS online CPUs should be filtered.
> >>
> >> Apologies if I'm being thick, but can you explain how? In the case of
> >> isolation being disabled or compiled out, housekeeping_cpumask() is
> >> literally just "return cpu_possible_mask;". If we then iterate over that
> >> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> >> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> >>
> >> Robin.
> >
> > Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> > as well.
> >
> > cpumask_local_spread() should probably be disabling CPU hotplug.
>
> It can't unless all callers are from preemtible code.
>
> Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> over the kernel is just wrong, really.
>
> As I explained several times before there are very valid reasons for
> having queues and interrupts on isolated CPUs. Just optimizing for the
> usecases some people care about is not making anything better.

And that is right.

2021-01-28 17:43:36

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 1/28/21 11:59 AM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
>> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>>> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> + mask = housekeeping_cpumask(hk_flags);
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> The whole pile wants to be reverted. It's simply broken in several ways.
> I was asking for your comments on interaction with CPU hotplug :-)
> Anyway...
>
> So housekeeping_cpumask has multiple meanings. In this case:
>
> HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ
>
> domain
> Isolate from the general SMP balancing and scheduling
> algorithms. Note that performing domain isolation this way
> is irreversible: it's not possible to bring back a CPU to
> the domains once isolated through isolcpus. It's strongly
> advised to use cpusets instead to disable scheduler load
> balancing through the "cpuset.sched_load_balance" file.
> It offers a much more flexible interface where CPUs can
> move in and out of an isolated set anytime.
>
> You can move a process onto or off an "isolated" CPU via
> the CPU affinity syscalls or cpuset.
> <cpu number> begins at 0 and the maximum value is
> "number of CPUs in system - 1".
>
> managed_irq
>
> Isolate from being targeted by managed interrupts
> which have an interrupt mask containing isolated
> CPUs. The affinity of managed interrupts is
> handled by the kernel and cannot be changed via
> the /proc/irq/* interfaces.
>
> This isolation is best effort and only effective
> if the automatically assigned interrupt mask of a
> device queue contains isolated and housekeeping
> CPUs. If housekeeping CPUs are online then such
> interrupts are directed to the housekeeping CPU
> so that IO submitted on the housekeeping CPU
> cannot disturb the isolated CPU.
>
> If a queue's affinity mask contains only isolated
> CPUs then this parameter has no effect on the
> interrupt routing decision, though interrupts are
> only delivered when tasks running on those
> isolated CPUs submit IO. IO submitted on
> housekeeping CPUs has no influence on those
> queues.
>
> So as long as the meaning of the flags are respected, seems
> alright.
>
> Nitesh, is there anything preventing this from being fixed
> in userspace ? (as Thomas suggested previously).

I think it should be doable atleast for most of the devices.
However, I do wonder if there is a better way of fixing this generically
from the kernel?

Currently, as Thomas mentioned housekeeping_cpumask() is used at different
locations just to fix the issue corresponding to that component or driver.

--
Thanks
Nitesh

2021-01-28 20:10:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
>> The whole pile wants to be reverted. It's simply broken in several ways.
>
> I was asking for your comments on interaction with CPU hotplug :-)

Which I answered in an seperate mail :)

> So housekeeping_cpumask has multiple meanings. In this case:

...

> So as long as the meaning of the flags are respected, seems
> alright.

Yes. Stuff like the managed interrupts preference for housekeeping CPUs
when a affinity mask spawns housekeeping and isolated is perfectly
fine. It's well thought out and has no limitations.

> Nitesh, is there anything preventing this from being fixed
> in userspace ? (as Thomas suggested previously).

Everything with is not managed can be steered by user space.

Thanks,

tglx

2021-02-04 18:25:07

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >> The whole pile wants to be reverted. It's simply broken in several ways.
> >
> > I was asking for your comments on interaction with CPU hotplug :-)
>
> Which I answered in an seperate mail :)
>
> > So housekeeping_cpumask has multiple meanings. In this case:
>
> ...
>
> > So as long as the meaning of the flags are respected, seems
> > alright.
>
> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> when a affinity mask spawns housekeeping and isolated is perfectly
> fine. It's well thought out and has no limitations.
>
> > Nitesh, is there anything preventing this from being fixed
> > in userspace ? (as Thomas suggested previously).
>
> Everything with is not managed can be steered by user space.

Yes, but it seems to be racy (that is, there is a window where the
interrupt can be delivered to an isolated CPU).

ethtool ->
xgbe_set_channels ->
xgbe_full_restart_dev ->
xgbe_alloc_memory ->
xgbe_alloc_channels ->
cpumask_local_spread

Also ifconfig eth0 down / ifconfig eth0 up leads
to cpumask_spread_local.

How about adding a new flag for isolcpus instead?

2021-02-04 18:53:54

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
>> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
>>>> The whole pile wants to be reverted. It's simply broken in several ways.
>>> I was asking for your comments on interaction with CPU hotplug :-)
>> Which I answered in an seperate mail :)
>>
>>> So housekeeping_cpumask has multiple meanings. In this case:
>> ...
>>
>>> So as long as the meaning of the flags are respected, seems
>>> alright.
>> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
>> when a affinity mask spawns housekeeping and isolated is perfectly
>> fine. It's well thought out and has no limitations.
>>
>>> Nitesh, is there anything preventing this from being fixed
>>> in userspace ? (as Thomas suggested previously).
>> Everything with is not managed can be steered by user space.
> Yes, but it seems to be racy (that is, there is a window where the
> interrupt can be delivered to an isolated CPU).
>
> ethtool ->
> xgbe_set_channels ->
> xgbe_full_restart_dev ->
> xgbe_alloc_memory ->
> xgbe_alloc_channels ->
> cpumask_local_spread
>
> Also ifconfig eth0 down / ifconfig eth0 up leads
> to cpumask_spread_local.

There's always that possibility.

We have to ensure that we move the IRQs by a tuned daemon or some other
userspace script every time there is a net-dev change (eg. device comes up,
creates VFs, etc).

> How about adding a new flag for isolcpus instead?
>

Do you mean a flag based on which we can switch the affinity mask to
housekeeping for all the devices at the time of IRQ distribution?

--
Thanks
Nitesh

2021-02-04 19:13:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:
>
> On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> > On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> >> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >>>> The whole pile wants to be reverted. It's simply broken in several ways.
> >>> I was asking for your comments on interaction with CPU hotplug :-)
> >> Which I answered in an seperate mail :)
> >>
> >>> So housekeeping_cpumask has multiple meanings. In this case:
> >> ...
> >>
> >>> So as long as the meaning of the flags are respected, seems
> >>> alright.
> >> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> >> when a affinity mask spawns housekeeping and isolated is perfectly
> >> fine. It's well thought out and has no limitations.
> >>
> >>> Nitesh, is there anything preventing this from being fixed
> >>> in userspace ? (as Thomas suggested previously).
> >> Everything with is not managed can be steered by user space.
> > Yes, but it seems to be racy (that is, there is a window where the
> > interrupt can be delivered to an isolated CPU).
> >
> > ethtool ->
> > xgbe_set_channels ->
> > xgbe_full_restart_dev ->
> > xgbe_alloc_memory ->
> > xgbe_alloc_channels ->
> > cpumask_local_spread
> >
> > Also ifconfig eth0 down / ifconfig eth0 up leads
> > to cpumask_spread_local.
>
> There's always that possibility.

Then there is a window where isolation can be broken.

> We have to ensure that we move the IRQs by a tuned daemon or some other
> userspace script every time there is a net-dev change (eg. device comes up,
> creates VFs, etc).

Again, race window open can result in interrupt to isolated CPU.

> > How about adding a new flag for isolcpus instead?
> >
>
> Do you mean a flag based on which we can switch the affinity mask to
> housekeeping for all the devices at the time of IRQ distribution?

Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.


2021-02-05 01:40:19

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
> On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:

[...]

>>>>> Nitesh, is there anything preventing this from being fixed
>>>>> in userspace ? (as Thomas suggested previously).
>>>> Everything with is not managed can be steered by user space.
>>> Yes, but it seems to be racy (that is, there is a window where the
>>> interrupt can be delivered to an isolated CPU).
>>>
>>> ethtool ->
>>> xgbe_set_channels ->
>>> xgbe_full_restart_dev ->
>>> xgbe_alloc_memory ->
>>> xgbe_alloc_channels ->
>>> cpumask_local_spread
>>>
>>> Also ifconfig eth0 down / ifconfig eth0 up leads
>>> to cpumask_spread_local.
>> There's always that possibility.
> Then there is a window where isolation can be broken.
>
>> We have to ensure that we move the IRQs by a tuned daemon or some other
>> userspace script every time there is a net-dev change (eg. device comes up,
>> creates VFs, etc).
> Again, race window open can result in interrupt to isolated CPU.


Yes, and if I am not mistaken then that always has been a problem with
these userspace solutions.

>
>>> How about adding a new flag for isolcpus instead?
>>>
>> Do you mean a flag based on which we can switch the affinity mask to
>> housekeeping for all the devices at the time of IRQ distribution?
> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>
>

Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

--
Thanks
Nitesh

2021-02-05 20:00:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Fri, Jan 29 2021 at 09:35, Nitesh Narayan Lal wrote:
> On 1/29/21 9:23 AM, Marcelo Tosatti wrote:
>>> I am not sure about the PCI patch as I don't think we can control that from
>>> the userspace or maybe I am wrong?
>> You mean "lib: Restrict cpumask_local_spread to housekeeping CPUs" ?
>
> No, "PCI: Restrict probe functions to housekeeping CPUs".

That part is fine because it just moves the probing to a work queue on a
housekeeping CPU. But that has nothing to do with the interrupt
spreading library.

Thanks,

tglx

2021-02-06 03:42:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>> How about adding a new flag for isolcpus instead?
>>>>
>>> Do you mean a flag based on which we can switch the affinity mask to
>>> housekeeping for all the devices at the time of IRQ distribution?
>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>
> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

I just read back up on that whole discussion and stared into the usage
sites a bit.

There are a couple of issues here in a larger picture. Looking at it
from the device side first:

The spreading is done for non-managed queues/interrupts which makes them
movable by user space. So it could be argued from both sides that the
damage done by allowing the full online mask or by allowing only the
house keeping mask can be fixed up by user space.

But that's the trivial part of the problem. The real problem is CPU
hotplug and offline CPUs and the way how interrupts are set up for their
initial affinity.

As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict
cpumask_local_spread to houskeeping CPUs") is broken as it can return
offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case.

The original code is racy vs. hotplug unless the callers block hotplug.

Let's look at all the callers and what they do with it.

cptvf_set_irq_affinity() affinity hint
safexcel_request_ring_irq() affinity hint
mv_cesa_probe() affinity hint
bnxt_request_irq() affinity hint
nicvf_set_irq_affinity() affinity hint
cxgb4_set_msix_aff() affinity hint
enic_init_affinity_hint(() affinity hint
iavf_request_traffic_irqs() affinity hint
ionic_alloc_qcq_interrupt() affinity hint
efx_set_interrupt_affinity() affinity hint
i40e_vsi_request_irq_msix() affinity hint

be_evt_queues_create() affinity hint, queue affinity
hns3_nic_set_cpumask() affinity hint, queue affinity
mlx4_en_init_affinity_hint() affinity hint, queue affinity
mlx4_en_create_tx_ring() affinity hint, queue affinity
set_comp_irq_affinity_hint() affinity hint, queue affinity
i40e_config_xps_tx_ring() affinity hint, queue affinity

hclge_configure affinity_hint, queue affinity, workqueue selection

ixgbe_alloc_q_vector() node selection, affinity hint, queue affinity

All of them do not care about disabling hotplug. Taking cpu_read_lock()
inside of that spread function would not solve anything because once the
lock is dropped the CPU can go away.

There are 3 classes of this:

1) Does not matter: affinity hint

2) Might fail to set up the network queue when the selected CPU
is offline.

3) Broken: The hclge driver which uses the cpu to schedule work on
that cpu. That's broken, but unfortunately neither the workqueue
code nor the timer code will ever notice. The work just wont be
scheduled until the CPU comes online again which might be never.

But looking at the above I really have to ask the question what the
commit in question is actually trying to solve.

AFAICT, nothing at all. Why?

1) The majority of the drivers sets the hint __after_ requesting the
interrupt

2) Even if set _before_ requesting the interrupt it does not solve
anything because it's a hint and the interrupt core code does
not care about it at all. It provides the storage and the procfs
interface nothing else.

So how does that prevent the interrupt subsystem from assigning an
interrupt to an isolated CPU? Not at all.

Interrupts which are freshly allocated get the default interrupt
affinity mask, which is either set on the command line or via /proc. The
affinity of the interrupt can be changed after it has been populated in
/proc.

When the interrupt is requested then one of the online CPUs in it's
affinity mask is chosen.

X86 is special here because this also requires that there are free
vectors on one of the online CPUs in the mask. If the CPUs in the
affinity mask run out of vectors then it will grab a vector from some
other CPU which might be an isolated CPU.

When the affinity mask of the interrupt at the time when it is actually
requested contains an isolated CPU then nothing prevents the kernel from
steering it at an isolated CPU. But that has absolutely nothing to do
with that spreading thingy.

The only difference which this change makes is the fact that the
affinity hint changes. Nothing else.

This whole blurb about it might break isolation when an interrupt is
requested is just nonsensical, really.

If the default affinity mask is not correctly set up before devices are
initialized then it's not going to be cured by changing that spread
function. If the user space irq balancer ignores the isolation mask and
blindly moves stuff to the affinity hint, then this monstrosity needs to
be fixed.

So I'm going to revert this commit because it _IS_ broken _AND_ useless
and does not solve anything it claims to solve.

Thanks,

tglx

2021-02-06 03:55:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

On Fri, Feb 05 2021 at 23:23, Thomas Gleixner wrote:
> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>> How about adding a new flag for isolcpus instead?
>>>>>
>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>> housekeeping for all the devices at the time of IRQ distribution?
>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>
>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

<.SNIP.>

> So I'm going to revert this commit because it _IS_ broken _AND_ useless
> and does not solve anything it claims to solve.

And no. HK_FLAG_IRQ_SPREAD is not going to solve anything either.

Thanks,

tglx

2021-02-07 00:51:38

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 2/5/21 5:23 PM, Thomas Gleixner wrote:
> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>> How about adding a new flag for isolcpus instead?
>>>>>
>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>> housekeeping for all the devices at the time of IRQ distribution?
>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.
> I just read back up on that whole discussion and stared into the usage
> sites a bit.
>
> There are a couple of issues here in a larger picture. Looking at it
> from the device side first:
>
> The spreading is done for non-managed queues/interrupts which makes them
> movable by user space. So it could be argued from both sides that the
> damage done by allowing the full online mask or by allowing only the
> house keeping mask can be fixed up by user space.
>
> But that's the trivial part of the problem. The real problem is CPU
> hotplug and offline CPUs and the way how interrupts are set up for their
> initial affinity.
>
> As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict
> cpumask_local_spread to houskeeping CPUs") is broken as it can return
> offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case.

A quick question here, is there any reason why we don't have cpu_online_mask
instead of cpu_possible_mask in the housekeeping_cpumask()?
(not for this particular patch but in general)

>
> The original code is racy vs. hotplug unless the callers block hotplug.
>
> Let's look at all the callers and what they do with it.
>
> cptvf_set_irq_affinity() affinity hint
> safexcel_request_ring_irq() affinity hint
> mv_cesa_probe() affinity hint
> bnxt_request_irq() affinity hint
> nicvf_set_irq_affinity() affinity hint
> cxgb4_set_msix_aff() affinity hint
> enic_init_affinity_hint(() affinity hint
> iavf_request_traffic_irqs() affinity hint
> ionic_alloc_qcq_interrupt() affinity hint
> efx_set_interrupt_affinity() affinity hint
> i40e_vsi_request_irq_msix() affinity hint
>
> be_evt_queues_create() affinity hint, queue affinity
> hns3_nic_set_cpumask() affinity hint, queue affinity
> mlx4_en_init_affinity_hint() affinity hint, queue affinity
> mlx4_en_create_tx_ring() affinity hint, queue affinity
> set_comp_irq_affinity_hint() affinity hint, queue affinity
> i40e_config_xps_tx_ring() affinity hint, queue affinity
>
> hclge_configure affinity_hint, queue affinity, workqueue selection
>
> ixgbe_alloc_q_vector() node selection, affinity hint, queue affinity
>
> All of them do not care about disabling hotplug. Taking cpu_read_lock()
> inside of that spread function would not solve anything because once the
> lock is dropped the CPU can go away.
>
> There are 3 classes of this:
>
> 1) Does not matter: affinity hint
>
> 2) Might fail to set up the network queue when the selected CPU
> is offline.
>
> 3) Broken: The hclge driver which uses the cpu to schedule work on
> that cpu. That's broken, but unfortunately neither the workqueue
> code nor the timer code will ever notice. The work just wont be
> scheduled until the CPU comes online again which might be never.

Agreed.

> But looking at the above I really have to ask the question what the
> commit in question is actually trying to solve.
>
> AFAICT, nothing at all. Why?
>
> 1) The majority of the drivers sets the hint __after_ requesting the
> interrupt
>
> 2) Even if set _before_ requesting the interrupt it does not solve
> anything because it's a hint and the interrupt core code does
> not care about it at all. It provides the storage and the procfs
> interface nothing else.
>
> So how does that prevent the interrupt subsystem from assigning an
> interrupt to an isolated CPU? Not at all.
>
> Interrupts which are freshly allocated get the default interrupt
> affinity mask, which is either set on the command line or via /proc. The
> affinity of the interrupt can be changed after it has been populated in
> /proc.
>
> When the interrupt is requested then one of the online CPUs in it's
> affinity mask is chosen.
>
> X86 is special here because this also requires that there are free
> vectors on one of the online CPUs in the mask. If the CPUs in the
> affinity mask run out of vectors then it will grab a vector from some
> other CPU which might be an isolated CPU.
>
> When the affinity mask of the interrupt at the time when it is actually
> requested contains an isolated CPU then nothing prevents the kernel from
> steering it at an isolated CPU. But that has absolutely nothing to do
> with that spreading thingy.
>
> The only difference which this change makes is the fact that the
> affinity hint changes. Nothing else.
>

Thanks for the detailed explanation.

Before I posted this patch, I was doing some debugging on a setup where I
was observing some latency issues due to the iavf IRQs that were pinned on
the isolated CPUs.

Based on some initial traces I had this impression that the affinity hint
or cpumask_local_spread was somehow playing a role in deciding the affinity
mask of these IRQs. Although, that does look incorrect after going through
your explanation.
For some reason, with a kernel that had this patch when I tried creating
VFs iavf IRQs always ended up on the HK CPUs.

The reasoning for the above is still not very clear to me. I will investigate
this further to properly understand this behavior.

--
Nitesh

2021-02-11 23:20:19

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>
>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

<snip>

>>> When the affinity mask of the interrupt at the time when it is actually
>>> requested contains an isolated CPU then nothing prevents the kernel from
>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>> with that spreading thingy.
>>>
>>> The only difference which this change makes is the fact that the
>>> affinity hint changes. Nothing else.
>>>
> Thanks for the detailed explanation.
>
> Before I posted this patch, I was doing some debugging on a setup where I
> was observing some latency issues due to the iavf IRQs that were pinned on
> the isolated CPUs.
>
> Based on some initial traces I had this impression that the affinity hint
> or cpumask_local_spread was somehow playing a role in deciding the affinity
> mask of these IRQs. Although, that does look incorrect after going through
> your explanation.
> For some reason, with a kernel that had this patch when I tried creating
> VFs iavf IRQs always ended up on the HK CPUs.
>
> The reasoning for the above is still not very clear to me. I will investigate
> this further to properly understand this behavior.
>
>

After a little more digging, I found out why cpumask_local_spread change
affects the general/initial smp_affinity for certain device IRQs.

After the introduction of the commit:

    e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()

For all the drivers that set hint, initial affinity is set based on the
CPU retrieved from cpumask_local_spread. So in an environment where
irqbalance is disabled, these device IRQs remain on the CPUs that are
picked from cpumask_local_spread even though they are isolated. I think
the commit message of the reverted patch should have covered this as
well.

--
Thanks
Nitesh

2021-03-04 19:08:26

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 2/11/21 10:55 AM, Nitesh Narayan Lal wrote:
> On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
>> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>>
>>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.
> <snip>
>
>>>> When the affinity mask of the interrupt at the time when it is actually
>>>> requested contains an isolated CPU then nothing prevents the kernel from
>>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>>> with that spreading thingy.
>>>>
>>>> The only difference which this change makes is the fact that the
>>>> affinity hint changes. Nothing else.
>>>>
>> Thanks for the detailed explanation.
>>
>> Before I posted this patch, I was doing some debugging on a setup where I
>> was observing some latency issues due to the iavf IRQs that were pinned on
>> the isolated CPUs.
>>
>> Based on some initial traces I had this impression that the affinity hint
>> or cpumask_local_spread was somehow playing a role in deciding the affinity
>> mask of these IRQs. Although, that does look incorrect after going through
>> your explanation.
>> For some reason, with a kernel that had this patch when I tried creating
>> VFs iavf IRQs always ended up on the HK CPUs.
>>
>> The reasoning for the above is still not very clear to me. I will investigate
>> this further to properly understand this behavior.
>>
>>
> After a little more digging, I found out why cpumask_local_spread change
> affects the general/initial smp_affinity for certain device IRQs.
>
> After the introduction of the commit:
>
>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>

Continuing the conversation about the above commit and adding Jesse.
I was trying to understand the problem that the commit message explains
"The default behavior of the kernel is somewhat undesirable as all
requested interrupts end up on CPU0 after registration.", I have also been
trying to reproduce this behavior without the patch but I failed in doing
so, maybe because I am missing something here.

@Jesse Can you please explain? FWIU IRQ affinity should be decided based on
the default affinity mask.

The problem with the commit is that when we overwrite the affinity mask
based on the hinting mask we completely ignore the default SMP affinity
mask. If we do want to overwrite the affinity based on the hint mask we
should atleast consider the default SMP affinity.

--
Thanks
Nitesh

2021-04-07 10:26:50

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

Continuing a thread from a bit ago...

Nitesh Narayan Lal wrote:

> > After a little more digging, I found out why cpumask_local_spread change
> > affects the general/initial smp_affinity for certain device IRQs.
> >
> > After the introduction of the commit:
> >
> > ??? e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
> >
>
> Continuing the conversation about the above commit and adding Jesse.
> I was trying to understand the problem that the commit message explains
> "The default behavior of the kernel is somewhat undesirable as all
> requested interrupts end up on CPU0 after registration.", I have also been
> trying to reproduce this behavior without the patch but I failed in doing
> so, maybe because I am missing something here.
>
> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
> the default affinity mask.

The original issue as seen, was that if you rmmod/insmod a driver
*without* irqbalance running, the default irq mask is -1, which means
any CPU. The older kernels (this issue was patched in 2014) used to use
that affinity mask, but the value programmed into all the interrupt
registers "actual affinity" would end up delivering all interrupts to
CPU0, and if the machine was under traffic load incoming when the
driver loaded, CPU0 would start to poll among all the different netdev
queues, all on CPU0.

The above then leads to the condition that the device is stuck polling
even if the affinity gets updated from user space, and the polling will
continue until traffic stops.

> The problem with the commit is that when we overwrite the affinity mask
> based on the hinting mask we completely ignore the default SMP affinity
> mask. If we do want to overwrite the affinity based on the hint mask we
> should atleast?consider the default SMP affinity.

Maybe the right thing is to fix which CPUs are passed in as the valid
mask, or make sure the kernel cross checks that what the driver asks
for is a "valid CPU"?

2021-04-07 21:14:09

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
> Continuing a thread from a bit ago...
>
> Nitesh Narayan Lal wrote:
>
>>> After a little more digging, I found out why cpumask_local_spread change
>>> affects the general/initial smp_affinity for certain device IRQs.
>>>
>>> After the introduction of the commit:
>>>
>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>
>> Continuing the conversation about the above commit and adding Jesse.
>> I was trying to understand the problem that the commit message explains
>> "The default behavior of the kernel is somewhat undesirable as all
>> requested interrupts end up on CPU0 after registration.", I have also been
>> trying to reproduce this behavior without the patch but I failed in doing
>> so, maybe because I am missing something here.
>>
>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>> the default affinity mask.

Thanks, Jesse for responding.

> The original issue as seen, was that if you rmmod/insmod a driver
> *without* irqbalance running, the default irq mask is -1, which means
> any CPU. The older kernels (this issue was patched in 2014) used to use
> that affinity mask, but the value programmed into all the interrupt
> registers "actual affinity" would end up delivering all interrupts to
> CPU0,

So does that mean the affinity mask for the IRQs was different wrt where
the IRQs were actually delivered?
Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
to 0 instead of -1?

I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
mask before removing the kernel module and after doing rmmod+insmod
and didn't find any difference.

> and if the machine was under traffic load incoming when the
> driver loaded, CPU0 would start to poll among all the different netdev
> queues, all on CPU0.
>
> The above then leads to the condition that the device is stuck polling
> even if the affinity gets updated from user space, and the polling will
> continue until traffic stops.
>
>> The problem with the commit is that when we overwrite the affinity mask
>> based on the hinting mask we completely ignore the default SMP affinity
>> mask. If we do want to overwrite the affinity based on the hint mask we
>> should atleast consider the default SMP affinity.

For the issue where the IRQs don't follow the default_smp_affinity mask
because of this patch, the following are the steps by which it can be easily
reproduced with the latest linux kernel:

# Kernel
5.12.0-rc6+

# Other pramaeters in the cmdline
isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
rcu_nocbs=2-39,44-79

# cat /proc/irq/default_smp_affinity
0000,00000f00,00000003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]

# Create VFs and check IRQ affinity mask

/proc/irq/1423/iavf-ens1f1v3-TxRx-3
3
/proc/irq/1424/iavf-0000:3b:0b.0:mbx
0
40
42
/proc/irq/1425/iavf-ens1f1v8-TxRx-0
0
/proc/irq/1426/iavf-ens1f1v8-TxRx-1
1
/proc/irq/1427/iavf-ens1f1v8-TxRx-2
2
/proc/irq/1428/iavf-ens1f1v8-TxRx-3
3
...
/proc/irq/1475/iavf-ens1f1v15-TxRx-0
0
/proc/irq/1476/iavf-ens1f1v15-TxRx-1
1
/proc/irq/1477/iavf-ens1f1v15-TxRx-2
2
/proc/irq/1478/iavf-ens1f1v15-TxRx-3
3
/proc/irq/1479/iavf-0000:3b:0a.0:mbx
0
40
42
...
/proc/irq/240/iavf-ens1f1v3-TxRx-0
0
/proc/irq/248/iavf-ens1f1v3-TxRx-1
1
/proc/irq/249/iavf-ens1f1v3-TxRx-2
2


Trace dump:
----------
..
11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
            irq=1478 is_managed=0 can_reserve=1 reserve=0
11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
            irq=1478 vector=65 reserved=1 ret=0
11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
            irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
            irq=1478 vector=65 cpu=42 apicdest=0x00000200
11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
            irq=1478 vector=46 reserved=0 ret=0

11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
            irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42

11551364:  NetworkManager-1734  [040]  8167.465770: vector_config:      
            irq=1478 vector=46 cpu=3 apicdest=0x00040100
..

As we can see in the above trace the initial affinity for the IRQ 1478 was
correctly set as per the default_smp_affinity mask which includes CPU 42,
however, later on, it is updated with CPU3 which is returned from
cpumask_local_spread().

> Maybe the right thing is to fix which CPUs are passed in as the valid
> mask, or make sure the kernel cross checks that what the driver asks
> for is a "valid CPU"?
>

Sure, if we can still reproduce the problem that your patch was fixing then
maybe we can consider adding a new API like cpumask_local_spread_irq in
which we should consider deafult_smp_affinity mask as well before returning
the CPU.

--
Thanks
Nitesh

2021-04-08 18:51:47

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs


On 4/7/21 11:18 AM, Nitesh Narayan Lal wrote:
> On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
>> Continuing a thread from a bit ago...
>>
>> Nitesh Narayan Lal wrote:
>>
>>>> After a little more digging, I found out why cpumask_local_spread change
>>>> affects the general/initial smp_affinity for certain device IRQs.
>>>>
>>>> After the introduction of the commit:
>>>>
>>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>>
>>> Continuing the conversation about the above commit and adding Jesse.
>>> I was trying to understand the problem that the commit message explains
>>> "The default behavior of the kernel is somewhat undesirable as all
>>> requested interrupts end up on CPU0 after registration.", I have also been
>>> trying to reproduce this behavior without the patch but I failed in doing
>>> so, maybe because I am missing something here.
>>>
>>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>>> the default affinity mask.
> Thanks, Jesse for responding.
>
>> The original issue as seen, was that if you rmmod/insmod a driver
>> *without* irqbalance running, the default irq mask is -1, which means
>> any CPU. The older kernels (this issue was patched in 2014) used to use
>> that affinity mask, but the value programmed into all the interrupt
>> registers "actual affinity" would end up delivering all interrupts to
>> CPU0,
> So does that mean the affinity mask for the IRQs was different wrt where
> the IRQs were actually delivered?
> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
> to 0 instead of -1?
>
> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
> mask before removing the kernel module and after doing rmmod+insmod
> and didn't find any difference.
>
>> and if the machine was under traffic load incoming when the
>> driver loaded, CPU0 would start to poll among all the different netdev
>> queues, all on CPU0.
>>
>> The above then leads to the condition that the device is stuck polling
>> even if the affinity gets updated from user space, and the polling will
>> continue until traffic stops.
>>
>>> The problem with the commit is that when we overwrite the affinity mask
>>> based on the hinting mask we completely ignore the default SMP affinity
>>> mask. If we do want to overwrite the affinity based on the hint mask we
>>> should atleast consider the default SMP affinity.
> For the issue where the IRQs don't follow the default_smp_affinity mask
> because of this patch, the following are the steps by which it can be easily
> reproduced with the latest linux kernel:
>
> # Kernel
> 5.12.0-rc6+
>
> # Other pramaeters in the cmdline
> isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
> rcu_nocbs=2-39,44-79
>
> # cat /proc/irq/default_smp_affinity
> 0000,00000f00,00000003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]
>
> # Create VFs and check IRQ affinity mask
>
> /proc/irq/1423/iavf-ens1f1v3-TxRx-3
> 3
> /proc/irq/1424/iavf-0000:3b:0b.0:mbx
> 0
> 40
> 42
> /proc/irq/1425/iavf-ens1f1v8-TxRx-0
> 0
> /proc/irq/1426/iavf-ens1f1v8-TxRx-1
> 1
> /proc/irq/1427/iavf-ens1f1v8-TxRx-2
> 2
> /proc/irq/1428/iavf-ens1f1v8-TxRx-3
> 3
> ...
> /proc/irq/1475/iavf-ens1f1v15-TxRx-0
> 0
> /proc/irq/1476/iavf-ens1f1v15-TxRx-1
> 1
> /proc/irq/1477/iavf-ens1f1v15-TxRx-2
> 2
> /proc/irq/1478/iavf-ens1f1v15-TxRx-3
> 3
> /proc/irq/1479/iavf-0000:3b:0a.0:mbx
> 0
> 40
> 42
> ...
> /proc/irq/240/iavf-ens1f1v3-TxRx-0
> 0
> /proc/irq/248/iavf-ens1f1v3-TxRx-1
> 1
> /proc/irq/249/iavf-ens1f1v3-TxRx-2
> 2
>
>
> Trace dump:
> ----------
> ..
> 11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
>             irq=1478 is_managed=0 can_reserve=1 reserve=0
> 11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
>             irq=1478 vector=65 reserved=1 ret=0
> 11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
>             irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
> 11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
>             irq=1478 vector=65 cpu=42 apicdest=0x00000200
> 11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
>             irq=1478 vector=46 reserved=0 ret=0
>
> 11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
>             irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42
>
> 11551364:  NetworkManager-1734  [040]  8167.465770: vector_config:      
>             irq=1478 vector=46 cpu=3 apicdest=0x00040100
> ..
>
> As we can see in the above trace the initial affinity for the IRQ 1478 was
> correctly set as per the default_smp_affinity mask which includes CPU 42,
> however, later on, it is updated with CPU3 which is returned from
> cpumask_local_spread().
>
>> Maybe the right thing is to fix which CPUs are passed in as the valid
>> mask, or make sure the kernel cross checks that what the driver asks
>> for is a "valid CPU"?
>>
> Sure, if we can still reproduce the problem that your patch was fixing then
> maybe we can consider adding a new API like cpumask_local_spread_irq in
> which we should consider deafult_smp_affinity mask as well before returning
> the CPU.
>

Didn't realize that netdev ml was not included, so adding that.

--
Nitesh