2020-09-28 18:40:26

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs

This is a follow-up posting for "[PATCH v3 0/4] isolation: limit msix vectors
to housekeeping CPUs".

Issue
=====
With the current implementation device drivers while creating their MSIX        
vectors only take num_online_cpus() into consideration which works quite well  
for a non-RT environment, but in an RT environment that has a large number of  
isolated CPUs and very few housekeeping CPUs this could lead to a problem.    
The problem will be triggered when something like tuned will try to move all    
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to      
prevent interruptions for a latency-sensitive workload that will be running
on the isolated CPUs. This failure is caused because of the per CPU vector        
limitation.                                                                    


Proposed Fix
============
In this patch-set, the following changes are proposed:
- A generic API housekeeping_num_online_cpus() which is meant to return the
online housekeeping CPUs based on the hk_flag passed by the caller.
- i40e: Specifically for the i40e driver the num_online_cpus() used in
  i40e_init_msix() to calculate numbers msix vectors is replaced with the
above defined API that returns the online housekeeping CPUs that are meant
to handle managed IRQ jobs.
- pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the
max_vecs passed in pci_alloc_irq_vector() is restricted only to the online
  housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT
environment. However, if the min_vecs exceeds the online housekeeping CPUs,
max_vecs is limited based on the min_vecs instead.


Future Work
===========

- In the previous upstream discussion [1], it was decided that it would be
better if we can have a generic framework that can be consumed by all the
drivers to fix this kind of issue. However, it will be a long term work,
and since there are RT workloads that are getting impacted by the reported
issue. We agreed upon the proposed per-device approach for now.


Testing
=======
Functionality:
- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it. Another way to verify is by checking
the number of IRQs that get created corresponding to a impacted device.
Similarly did a few more tests with different modes eg with only nohz_full,
isolcpus etc.

Performance:
- To analyze the performance impact I have targetted the change introduced in
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and
UDP transmission via iperf.
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
    CPUs).
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

 Results:
  # UDP Stream Test:
  + There was no degradation observed in UDP stream tests in both
environments. (With isolated CPUs and without isolated CPUs after the
introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
  + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
  + Multiple Stream (4)  - Average degradation of around 5-6%
  + Multiple Stream (24) - Average degradation of around 2-3%
  + Single Stream        - Even on a vanilla kernel the Bitrate observed
for a TCP single stream test seem to vary
significantly across different runs (eg. the %
variation between the best and the worst case on
a vanilla kernel was around 8-10%). A similar
variation was observed with the kernel that
included my patches. No additional degradation
was observed.

If there are any suggestions for more performance evaluation, I would
be happy to discuss/perform them.


Changes from v3[2]:
==================
- Moved the logic to limit the max_vecs from pci_alloc_irq_vectors() to
pci_alloc_irq_vectors_affinity() as that's the exported interface and
drivers using this API also need to be fixed (suggestion from Bjorn Helgaas).

Changes from v2[3]:
==================
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
the naming convention consistent (based on a suggestion from Peter
Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
to make it more usable in different use-cases (based on a suggestion from
Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
suggestion.

Changes from v1[4]:
==================
Patch1:                                                                      
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
of online housekeeping CPUs. This is based on Frederic Weisbecker's
suggestion.          
- Since the hk_num_online_cpus() is self-explanatory, got rid of            
  the comment that was added previously.                                    
Patch2:                                                                      
- Added a new patch that is meant to enable managed IRQ isolation for
nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.             
Patch4 (PCI):                                                                
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
of skipping modification to max_vecs, started restricting it based on the
min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                    


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/


Nitesh Narayan Lal (4):
sched/isolation: API to get number of housekeeping CPUs
sched/isolation: Extend nohz_full to isolate managed IRQs
i40e: Limit msix vectors to housekeeping CPUs
PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
drivers/pci/msi.c | 18 ++++++++++++++++++
include/linux/sched/isolation.h | 9 +++++++++
kernel/sched/isolation.c | 2 +-
4 files changed, 30 insertions(+), 2 deletions(-)

--



2020-09-28 18:41:09

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Suggested-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Nitesh Narayan Lal <[email protected]>
---
kernel/sched/isolation.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned int flags;

flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
- HK_FLAG_MISC | HK_FLAG_KTHREAD;
+ HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;

return housekeeping_setup(str, flags);
}
--
2.18.2

2020-10-01 15:51:57

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs

On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
> Nitesh Narayan Lal (4):
> sched/isolation: API to get number of housekeeping CPUs
> sched/isolation: Extend nohz_full to isolate managed IRQs
> i40e: Limit msix vectors to housekeeping CPUs
> PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>
> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
> drivers/pci/msi.c | 18 ++++++++++++++++++
> include/linux/sched/isolation.h | 9 +++++++++
> kernel/sched/isolation.c | 2 +-
> 4 files changed, 30 insertions(+), 2 deletions(-)

Acked-by: Frederic Weisbecker <[email protected]>

Peter, if you're ok with the set, I guess this should go through
the scheduler tree?

Thanks.

2020-10-08 21:56:49

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs


On 10/1/20 11:49 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
>> Nitesh Narayan Lal (4):
>> sched/isolation: API to get number of housekeeping CPUs
>> sched/isolation: Extend nohz_full to isolate managed IRQs
>> i40e: Limit msix vectors to housekeeping CPUs
>> PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>>
>> drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
>> drivers/pci/msi.c | 18 ++++++++++++++++++
>> include/linux/sched/isolation.h | 9 +++++++++
>> kernel/sched/isolation.c | 2 +-
>> 4 files changed, 30 insertions(+), 2 deletions(-)
> Acked-by: Frederic Weisbecker <[email protected]>
>
> Peter, if you're ok with the set, I guess this should go through
> the scheduler tree?
>
> Thanks.

Hi Peter,

I wanted follow up to check if you have any concerns/suggestions that I
should address in this patch-set before this can be picked?

--
Thanks
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2020-10-23 15:59:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
> > Extend nohz_full feature set to include isolation from managed IRQS. This
>
> So you say it's for managed-irqs, the feature is actually called
> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.
>
> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
> fine and don't need help at at...
>
> > is required specifically for setups that only uses nohz_full and still
> > requires isolation for maintaining lower latency for the listed CPUs.
> >
> > Suggested-by: Frederic Weisbecker <[email protected]>

Ah and yes there is this tag :-p

So that's my bad, I really thought this thing was about managed IRQ.
The problem is that I can't find a single documentation about them so I'm
too clueless on that matter.

Thanks.

> > Signed-off-by: Nitesh Narayan Lal <[email protected]>
> > ---
> > kernel/sched/isolation.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 5a6ea03f9882..9df9598a9e39 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
> > unsigned int flags;
> >
> > flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> > - HK_FLAG_MISC | HK_FLAG_KTHREAD;
> > + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
> >
> > return housekeeping_setup(str, flags);
> > }
> > --
> > 2.18.2
> >

2020-10-23 18:13:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
> Extend nohz_full feature set to include isolation from managed IRQS. This

So you say it's for managed-irqs, the feature is actually called
MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
fine and don't need help at at...

> is required specifically for setups that only uses nohz_full and still
> requires isolation for maintaining lower latency for the listed CPUs.
>
> Suggested-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: Nitesh Narayan Lal <[email protected]>
> ---
> kernel/sched/isolation.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5a6ea03f9882..9df9598a9e39 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
> unsigned int flags;
>
> flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> - HK_FLAG_MISC | HK_FLAG_KTHREAD;
> + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>
> return housekeeping_setup(str, flags);
> }
> --
> 2.18.2
>

2020-10-23 18:16:44

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs


On 10/23/20 9:29 AM, Frederic Weisbecker wrote:
> On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>>> Extend nohz_full feature set to include isolation from managed IRQS. This
>> So you say it's for managed-irqs, the feature is actually called
>> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.
>>
>> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
>> fine and don't need help at at...
>>
>>> is required specifically for setups that only uses nohz_full and still
>>> requires isolation for maintaining lower latency for the listed CPUs.
>>>
>>> Suggested-by: Frederic Weisbecker <[email protected]>
> Ah and yes there is this tag :-p
>
> So that's my bad, I really thought this thing was about managed IRQ.
> The problem is that I can't find a single documentation about them so I'm
> too clueless on that matter.

I am also confused with this terminology.
So my bad for not taking care of this.

--
Thanks
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2020-10-23 18:17:12

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs


On 10/23/20 9:25 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>> Extend nohz_full feature set to include isolation from managed IRQS. This
> So you say it's for managed-irqs, the feature is actually called
> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Ah my bad! I should replace the managed IRQS with MANAGED_IRQ.
I can send another version with this fixed.

>
> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
> fine and don't need help at at...

Since the introduction of
"genirq, sched/isolation: Isolate from handling managed interrupts"

Within irq_do_set_affinity(), it is ensured that for managed intrrupts as
well, the isolated CPUs are removed from the affinity mask.

Hence, IMHO before this change managed interrupts were affecting the
isolated CPUs.

My intent of having this change is to basically allow isolation for
nohz_full CPUs even when we don't have something like isolcpus.
Does that make sense?


>
>> is required specifically for setups that only uses nohz_full and still
>> requires isolation for maintaining lower latency for the listed CPUs.
>>
>> Suggested-by: Frederic Weisbecker <[email protected]>
>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>> ---
>> kernel/sched/isolation.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 5a6ea03f9882..9df9598a9e39 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>> unsigned int flags;
>>
>> flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
>> - HK_FLAG_MISC | HK_FLAG_KTHREAD;
>> + HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>>
>> return housekeeping_setup(str, flags);
>> }
>> --
>> 2.18.2
>>
--
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature