2015-12-23 10:18:18

by Daniel J Blueman

[permalink] [raw]
Subject: [PATCH 1/2] PCI: Add mechanism to find topologically near cores

Some devices (eg ixgbe) make assumptions about device to core locality when
specifying interrupts locality hints and allocate starting from core 0.
Moreover, interrupts may not be routable to distant NUMA nodes due to the
8-bit APIC ID space limitations.

Provide a mechanism drivers can use to find cores with reasonable locality
to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping
the offset.

Signed-off-by: Daniel J Blueman <[email protected]>
---
drivers/pci/pci.c | 15 +++++++++++++++
include/linux/pci.h | 1 +
2 files changed, 16 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 314db8c..d5535d1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
}
EXPORT_SYMBOL(pci_fixup_cardbus);

+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset)
+{
+ /* Start search from node device is on for optimal locality */
+ int localnode = pcibus_to_node(pdev->bus);
+ int cpu = cpumask_first(cpumask_of_node(localnode));
+
+ while (offset--) {
+ do {
+ cpu = (cpu + 1) % nr_cpu_ids;
+ } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu),
+ localnode) > RECLAIM_DISTANCE);
+ }
+
+ return cpu;
+}
+
static int __init pci_setup(char *str)
{
while (str) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6ae25aa..f7491bd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus);
void pci_remove_root_bus(struct pci_bus *bus);
void pci_setup_cardbus(struct pci_bus *bus);
void pci_sort_breadthfirst(void);
+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset);
#define dev_is_pci(d) ((d)->bus == &pci_bus_type)
#define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
#define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0))
--
2.5.0


2015-12-23 10:02:55

by Daniel J Blueman

[permalink] [raw]
Subject: [PATCH 2/2] ixgbe: Use core to device locality interface

Rather than assuming cores starting from 0 are local to the ethernet
device, use the introduced interface to find near cores.

Not only does this improve performance due to spreading interrupts via near
NUMA nodes, it prevents assigning cores on distant NUMA nodes, which aren't
reachable by device interrupts due to the 8-bit APIC ID limitation.

With Numascale NumaConnect2 systems with Intel ixgbe cards on
non-primary PCI domains, all ixgbe NICs would previously revector
interrupts to cores 0 to 63 (cores 0 to 47 would be considered
near the primary PCI domain). Now, cores 48 to 95 are used, increasing
performance and addressing interrupt delivery issues:

do_IRQ: 79.180 No irq handler for vector (irq -1)
do_IRQ: 78.42 No irq handler for vector (irq -1)
do_IRQ: 71.172 No irq handler for vector (irq -1)
do_IRQ: 70.236 No irq handler for vector (irq -1)
do_IRQ: 69.109 No irq handler for vector (irq -1)
do_IRQ: 68.189 No irq handler for vector (irq -1)
do_IRQ: 72.92 No irq handler for vector (irq -1)
do_IRQ: 73.235 No irq handler for vector (irq -1)
do_IRQ: 66.185 No irq handler for vector (irq -1)
do_IRQ: 67.62 No irq handler for vector (irq -1)
do_IRQ: 197 callbacks suppressed

Signed-off-by: Daniel J Blueman <[email protected]>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f3168bc..12c4ce1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -817,10 +817,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) {
u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
if (rss_i > 1 && adapter->atr_sample_rate) {
- if (cpu_online(v_idx)) {
- cpu = v_idx;
- node = cpu_to_node(cpu);
- }
+ cpu = cpu_near_dev(adapter->pdev, v_idx);
+ node = cpu_to_node(cpu);
}
}

--
2.5.0

2015-12-23 10:41:03

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] PCI: Add mechanism to find topologically near cores

Hi Daniel,

[auto build test WARNING on pci/next]
[also build test WARNING on v4.4-rc6 next-20151223]

url: https://github.com/0day-ci/linux/commits/Daniel-J-Blueman/PCI-Add-mechanism-to-find-topologically-near-cores/20151223-181947
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: s390-allyesconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=s390

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.

All warnings (new ones prefixed by >>):

In file included from include/linux/smp.h:12:0,
from arch/s390/include/asm/spinlock.h:12,
from include/linux/spinlock.h:87,
from include/linux/rcupdate.h:38,
from include/linux/idr.h:18,
from include/linux/kernfs.h:14,
from include/linux/sysfs.h:15,
from include/linux/kobject.h:21,
from include/linux/of.h:21,
from drivers/pci/pci.c:13:
drivers/pci/pci.c: In function 'cpu_near_dev':
>> include/linux/cpumask.h:174:9: warning: array subscript is below array bounds [-Warray-bounds]
return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
^

vim +174 include/linux/cpumask.h

da91309e Amir Vadai 2014-06-09 158
2d3854a3 Rusty Russell 2008-11-05 159 #define for_each_cpu(cpu, mask) \
2d3854a3 Rusty Russell 2008-11-05 160 for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
8bd93a2c Paul E. McKenney 2010-02-22 161 #define for_each_cpu_not(cpu, mask) \
8bd93a2c Paul E. McKenney 2010-02-22 162 for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)
2d3854a3 Rusty Russell 2008-11-05 163 #define for_each_cpu_and(cpu, mask, and) \
2d3854a3 Rusty Russell 2008-11-05 164 for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask, (void)and)
2d3854a3 Rusty Russell 2008-11-05 165 #else
2d3854a3 Rusty Russell 2008-11-05 166 /**
2d3854a3 Rusty Russell 2008-11-05 167 * cpumask_first - get the first cpu in a cpumask
2d3854a3 Rusty Russell 2008-11-05 168 * @srcp: the cpumask pointer
2d3854a3 Rusty Russell 2008-11-05 169 *
2d3854a3 Rusty Russell 2008-11-05 170 * Returns >= nr_cpu_ids if no cpus set.
2d3854a3 Rusty Russell 2008-11-05 171 */
2d3854a3 Rusty Russell 2008-11-05 172 static inline unsigned int cpumask_first(const struct cpumask *srcp)
2d3854a3 Rusty Russell 2008-11-05 173 {
2d3854a3 Rusty Russell 2008-11-05 @174 return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
2d3854a3 Rusty Russell 2008-11-05 175 }
2d3854a3 Rusty Russell 2008-11-05 176
2d3854a3 Rusty Russell 2008-11-05 177 /**
2d3854a3 Rusty Russell 2008-11-05 178 * cpumask_next - get the next cpu in a cpumask
2d3854a3 Rusty Russell 2008-11-05 179 * @n: the cpu prior to the place to search (ie. return will be > @n)
2d3854a3 Rusty Russell 2008-11-05 180 * @srcp: the cpumask pointer
2d3854a3 Rusty Russell 2008-11-05 181 *
2d3854a3 Rusty Russell 2008-11-05 182 * Returns >= nr_cpu_ids if no further cpus set.

:::::: The code at line 174 was first introduced by commit
:::::: 2d3854a37e8b767a51aba38ed6d22817b0631e33 cpumask: introduce new API, without changing anything

:::::: TO: Rusty Russell <[email protected]>
:::::: CC: Ingo Molnar <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.67 kB)
.config.gz (38.47 kB)
Download all attachments

2015-12-23 11:16:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] ixgbe: Use core to device locality interface

Hi Daniel,

[auto build test ERROR on pci/next]
[also build test ERROR on v4.4-rc6 next-20151223]

url: https://github.com/0day-ci/linux/commits/Daniel-J-Blueman/PCI-Add-mechanism-to-find-topologically-near-cores/20151223-181947
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: i386-randconfig-a0-12222034 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

>> ERROR: "cpu_near_dev" [drivers/net/ethernet/intel/ixgbe/ixgbe.ko] undefined!

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (739.00 B)
.config.gz (24.99 kB)
Download all attachments

2015-12-23 15:46:44

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/2] PCI: Add mechanism to find topologically near cores

Hi Daniel,

On Wed, Dec 23, 2015 at 06:01:40PM +0800, Daniel J Blueman wrote:
> Some devices (eg ixgbe) make assumptions about device to core locality when
> specifying interrupts locality hints and allocate starting from core 0.
> Moreover, interrupts may not be routable to distant NUMA nodes due to the
> 8-bit APIC ID space limitations.

The APIC ID issue is the primary problem you're trying to solve, but
this patch doesn't solve it directly because it doesn't look at
anything related to the APIC ID domain. Anything you do here is a
guess that might work better, but it still won't necessarily work in
all cases.

Also, can you add a note about how this relates to the "call driver
probe function on node where device is attached" functionality in
pci_call_probe()? It seems like that should be enough to keep us from
always allocating interrupts on core 0. Maybe that's broken, or maybe
ixgbe isn't taking advantage of that?

> Provide a mechanism drivers can use to find cores with reasonable locality
> to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping
> the offset.

I don't think it's a benefit to reuse RECLAIM_DISTANCE, because that
is a different concept that doesn't seem directly related to what
you're doing. The name and the existing uses are related to memory
zone reclaiming, so I know I would be confused to see it also used for
IRQ assignment.

> Signed-off-by: Daniel J Blueman <[email protected]>
> ---
> drivers/pci/pci.c | 15 +++++++++++++++
> include/linux/pci.h | 1 +
> 2 files changed, 16 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 314db8c..d5535d1 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
> }
> EXPORT_SYMBOL(pci_fixup_cardbus);
>
> +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset)

If this becomes a PCI interface, please add a "pci_" prefix to the
name.

The distance concept also applies to non-PCI devices, so ideally I
think a "pci_nearby_cpu()" or similar interface would be a wrapper
around a more generic interface that takes a "struct device *".

I don't really understand what the "offset" parameter is for. It
looks like you're using it to spread across CPUs on a node, but that
seems like something that should be done internally to this interface.

> +{
> + /* Start search from node device is on for optimal locality */
> + int localnode = pcibus_to_node(pdev->bus);
> + int cpu = cpumask_first(cpumask_of_node(localnode));
> +
> + while (offset--) {
> + do {
> + cpu = (cpu + 1) % nr_cpu_ids;
> + } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu),
> + localnode) > RECLAIM_DISTANCE);
> + }
> +
> + return cpu;
> +}
> +
> static int __init pci_setup(char *str)
> {
> while (str) {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 6ae25aa..f7491bd 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus);
> void pci_remove_root_bus(struct pci_bus *bus);
> void pci_setup_cardbus(struct pci_bus *bus);
> void pci_sort_breadthfirst(void);
> +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset);
> #define dev_is_pci(d) ((d)->bus == &pci_bus_type)
> #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
> #define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0))
> --
> 2.5.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html