2020-09-25 18:28:50

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v3 0/4] isolation: limit msix vectors to housekeeping CPUs

This is a follow-up posting for "[PATCH v2 0/4] isolation: limit msix vectors
based on housekeeping CPUs".

Issue
=====
With the current implementation device drivers while creating their MSIX        
vectors only take num_online_cpus() into consideration which works quite well  
for a non-RT environment, but in an RT environment that has a large number of  
isolated CPUs and very few housekeeping CPUs this could lead to a problem.    
The problem will be triggered when something like tuned will try to move all    
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to      
prevent interruptions for a latency-sensitive workload that will be running
on the isolated CPUs. This failure is caused because of the per CPU vector        
limitation.                                                                    


Proposed Fix
============
In this patch-set, the following changes are proposed:
- A generic API housekeeping_num_online_cpus() which is meant to return the
online housekeeping CPUs based on the hk_flag passed by the caller.
- i40e: Specifically for the i40e driver the num_online_cpus() used in
  i40e_init_msix() to calculate numbers msix vectors is replaced with the
above defined API that returns the online housekeeping CPUs that are meant
to handle managed IRQ jobs.
- pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the
max_vecs passed in pci_alloc_irq_vector() is restricted only to the online
  housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT
environment. However, if the min_vecs exceeds the online housekeeping CPUs,
max_vecs is limited based on the min_vecs instead.


Future Work
===========

- In the previous upstream discussion [1], it was decided that it would be
better if we can have a generic framework that can be consumed by all the
drivers to fix this kind of issue. However, it will be a long term work,
and since there are RT workloads that are getting impacted by the reported
issue. We agreed upon the proposed per-device approach for now.


Testing
=======
Functionality:
- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it. Another way to verify is by checking
the number of IRQs that get created corresponding to a impacted device.
Similarly did a few more tests with different modes eg with only nohz_full,
isolcpus etc.

Performance:
- To analyze the performance impact I have targetted the change introduced in
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and
UDP transmission via iperf.
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
    CPUs).
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

 Results:
  # UDP Stream Test:
  + There was no degradation observed in UDP stream tests in both
environments. (With isolated CPUs and without isolated CPUs after the
introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
  + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
  + Multiple Stream (4)  - Average degradation of around 5-6%
  + Multiple Stream (24) - Average degradation of around 2-3%
  + Single Stream        - Even on a vanilla kernel the Bitrate observed
for a TCP single stream test seem to vary
significantly across different runs (eg. the %
variation between the best and the worst case on
a vanilla kernel was around 8-10%). A similar
variation was observed with the kernel that
included my patches. No additional degradation
was observed.

If there are any suggestions for more performance evaluation, I would
be happy to discuss/perform them.


Changes from v2[2]:
==================
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
the naming convention consistent (based on a suggestion from Peter
Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
to make it more usable in different use-cases (based on a suggestion from
Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
suggestion.

Changes from v1[3]:
==================
Patch1:                                                                      
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
of online housekeeping CPUs. This is based on Frederic Weisbecker's
suggestion.          
- Since the hk_num_online_cpus() is self-explanatory, got rid of            
  the comment that was added previously.                                    
Patch2:                                                                      
- Added a new patch that is meant to enable managed IRQ isolation for
nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.             
Patch4 (PCI):                                                                
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
of skipping modification to max_vecs, started restricting it based on the
min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                    


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/


Nitesh Narayan Lal (4):
sched/isolation: API to get number of housekeeping CPUs
sched/isolation: Extend nohz_full to isolate managed IRQs
i40e: Limit msix vectors to housekeeping CPUs
PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
include/linux/pci.h | 17 +++++++++++++++++
include/linux/sched/isolation.h | 9 +++++++++
kernel/sched/isolation.c | 2 +-
4 files changed, 29 insertions(+), 2 deletions(-)

--



2020-09-25 18:31:26

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal <[email protected]>
---
include/linux/pci.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..a7b10240b778 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/resource_ext.h>
+#include <linux/sched/isolation.h>
#include <uapi/linux/pci.h>

#include <linux/pci_ids.h>
@@ -1797,6 +1798,22 @@ static inline int
pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags)
{
+ unsigned int hk_cpus;
+
+ hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+ /*
+ * If we have isolated CPUs for use by real-time tasks, to keep the
+ * latency overhead to a minimum, device-specific IRQ vectors are moved
+ * to the housekeeping CPUs from the userspace by changing their
+ * affinity mask. Limit the vector usage to keep housekeeping CPUs from
+ * running out of IRQ vectors.
+ */
+ if (hk_cpus < num_online_cpus()) {
+ if (hk_cpus < min_vecs)
+ max_vecs = min_vecs;
+ else if (hk_cpus < max_vecs)
+ max_vecs = hk_cpus;
+ }
return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
NULL);
}
--
2.18.2

2020-09-25 18:31:54

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v3 1/4] sched/isolation: API to get number of housekeeping CPUs

Introduce a new API housekeeping_num_online_cpus(), that can be used to
retrieve the number of online housekeeping CPUs based on the housekeeping
flag passed by the caller.

Some of the consumers for this API are the device drivers that were
previously relying only on num_online_cpus() to determine the number of
MSIX vectors to create. In real-time environments to minimize interruptions
to isolated CPUs, all device-specific IRQ vectors are often moved to the
housekeeping CPUs, having excess vectors could cause housekeeping CPU to
run out of IRQ vectors.

Signed-off-by: Nitesh Narayan Lal <[email protected]>
---
include/linux/sched/isolation.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..e021b1846c1d 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -57,4 +57,13 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
return true;
}

+static inline unsigned int housekeeping_num_online_cpus(enum hk_flags flags)
+{
+#ifdef CONFIG_CPU_ISOLATION
+ if (static_branch_unlikely(&housekeeping_overridden))
+ return cpumask_weight(housekeeping_cpumask(flags));
+#endif
+ return num_online_cpus();
+}
+
#endif /* _LINUX_SCHED_ISOLATION_H */
--
2.18.2

2020-09-25 18:32:08

by Nitesh Narayan Lal

[permalink] [raw]
Subject: [PATCH v3 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs

Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Suggested-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Nitesh Narayan Lal <[email protected]>
---
kernel/sched/isolation.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
unsigned int flags;

flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
- HK_FLAG_MISC | HK_FLAG_KTHREAD;
+ HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;

return housekeeping_setup(str, flags);
}
--
2.18.2

2020-09-25 20:34:59

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

On Fri, Sep 25, 2020 at 02:26:54PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
>
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
>
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
>
> Signed-off-by: Nitesh Narayan Lal <[email protected]>
> ---
> include/linux/pci.h | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 835530605c0d..a7b10240b778 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -38,6 +38,7 @@
> #include <linux/interrupt.h>
> #include <linux/io.h>
> #include <linux/resource_ext.h>
> +#include <linux/sched/isolation.h>
> #include <uapi/linux/pci.h>
>
> #include <linux/pci_ids.h>
> @@ -1797,6 +1798,22 @@ static inline int
> pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> unsigned int max_vecs, unsigned int flags)
> {
> + unsigned int hk_cpus;
> +
> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);

Add blank line here before the block comment.

> + /*
> + * If we have isolated CPUs for use by real-time tasks, to keep the
> + * latency overhead to a minimum, device-specific IRQ vectors are moved
> + * to the housekeeping CPUs from the userspace by changing their
> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> + * running out of IRQ vectors.
> + */
> + if (hk_cpus < num_online_cpus()) {
> + if (hk_cpus < min_vecs)
> + max_vecs = min_vecs;
> + else if (hk_cpus < max_vecs)
> + max_vecs = hk_cpus;
> + }

It seems like you'd want to do this inside
pci_alloc_irq_vectors_affinity() since that's an exported interface,
and drivers that use it will bypass the limiting you're doing here.

> return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
> NULL);
> }
> --
> 2.18.2
>

2020-09-25 21:42:14

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs


On 9/25/20 4:23 PM, Bjorn Helgaas wrote:
> On Fri, Sep 25, 2020 at 02:26:54PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal <[email protected]>
>> ---
>> include/linux/pci.h | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 835530605c0d..a7b10240b778 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -38,6 +38,7 @@
>> #include <linux/interrupt.h>
>> #include <linux/io.h>
>> #include <linux/resource_ext.h>
>> +#include <linux/sched/isolation.h>
>> #include <uapi/linux/pci.h>
>>
>> #include <linux/pci_ids.h>
>> @@ -1797,6 +1798,22 @@ static inline int
>> pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>> unsigned int max_vecs, unsigned int flags)
>> {
>> + unsigned int hk_cpus;
>> +
>> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> Add blank line here before the block comment.
>
>> + /*
>> + * If we have isolated CPUs for use by real-time tasks, to keep the
>> + * latency overhead to a minimum, device-specific IRQ vectors are moved
>> + * to the housekeeping CPUs from the userspace by changing their
>> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> + * running out of IRQ vectors.
>> + */
>> + if (hk_cpus < num_online_cpus()) {
>> + if (hk_cpus < min_vecs)
>> + max_vecs = min_vecs;
>> + else if (hk_cpus < max_vecs)
>> + max_vecs = hk_cpus;
>> + }
> It seems like you'd want to do this inside
> pci_alloc_irq_vectors_affinity() since that's an exported interface,
> and drivers that use it will bypass the limiting you're doing here.

Good point, few drivers directly use this.
I took a quick look and it seems I may also have to take the pre and the
post vectors into consideration.

>> return pci_alloc_irq_vectors_affinity(dev, min_vecs, max_vecs, flags,
>> NULL);
>> }
>> --
>> 2.18.2
>>
--
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2020-09-25 23:20:37

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs


On 9/25/20 5:38 PM, Nitesh Narayan Lal wrote:
> On 9/25/20 4:23 PM, Bjorn Helgaas wrote:

[...]

>>> + /*
>>> + * If we have isolated CPUs for use by real-time tasks, to keep the
>>> + * latency overhead to a minimum, device-specific IRQ vectors are moved
>>> + * to the housekeeping CPUs from the userspace by changing their
>>> + * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>>> + * running out of IRQ vectors.
>>> + */
>>> + if (hk_cpus < num_online_cpus()) {
>>> + if (hk_cpus < min_vecs)
>>> + max_vecs = min_vecs;
>>> + else if (hk_cpus < max_vecs)
>>> + max_vecs = hk_cpus;
>>> + }
>> It seems like you'd want to do this inside
>> pci_alloc_irq_vectors_affinity() since that's an exported interface,
>> and drivers that use it will bypass the limiting you're doing here.
> Good point, few drivers directly use this.
> I took a quick look and it seems I may also have to take the pre and the
> post vectors into consideration.
>

It seems my initial interpretation was incorrect, reserved vecs (pre+post)
should always be less than the min_vecs.
So, FWIU we should be fine in limiting the max_vecs.

I can make this change and do a repost.

Do you have any other concerns with this patch or with any of the other
patches?

[...]

--
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature