2024-02-02 06:56:04

by Chen Yu

[permalink] [raw]
Subject: Managed interrupt spread based on Numa locality

Dear experts,

Recently we are evaluating some multi-queue NIC device drivers,
to switch them from conventional interrupt to managed interrupt.

In this way, the managed interrupts do not have to be migrated
during CPU offline, when the last CPU of the cpumask is offline.

This can save the space of vectors and help the hibernation put
every nonboot CPUs offline. Otherwise, an error would occur:
[48175.409994] CPU 239 has 165 vectors, 36 available. Cannot disable CPU

However after switching to the managed interrupt, there is a question
about the interrupt spreading among numa nodes.

If the device d1 is attached to node n1, can d1's managed interrupts
be allocated on the CPUs of n1 first? In this way, we can let
the driver of d1 to allocate buffer on n1, and with the managed
interrupt of d1 delivered to CPUs on n1, the path of DMA->DRAM->CPU
->net_rx_action() would be Numa friendly.

Question:
Does it make sense to make the interrupt spreading aware of numa
locality, or is there existing mechanism to do this? The driver can provide
the preferred numa node in the struct irq_affinity->node, and passed it to
the managed interrupt spreading logic, then the interrupts are spread within
that node.

Thanks in advance.

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index aa3f6815bb12..836e9d374c19 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -344,7 +344,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
* We guarantee in the resulted grouping that all CPUs are covered, and
* no same CPU is assigned to multiple groups
*/
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps, int node)
{
unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
cpumask_var_t *node_to_cpumask;
@@ -370,9 +370,14 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
cpus_read_lock();
build_node_to_cpumask(node_to_cpumask);

+ if (node != NUMA_NO_NODE)
+ cpumask_and(npresmsk, cpu_present_mask, node_to_cpumask[node]);
+ else
+ cpumask_copy(npresmsk, cpu_present_mask);
+
/* grouping present CPUs first */
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
- cpu_present_mask, nmsk, masks);
+ npresmsk, nmsk, masks);
if (ret < 0)
goto fail_build_affinity;
nr_present = ret;
--
2.25.1