2002-12-20 09:00:37

by Kamble, Nitin A

[permalink] [raw]
Subject: [PATCH] IRQ distribution in the 2.5.52 kernel

Hello All,

We were looking at the performance impact of the IRQ routing from the 2.5.52 Linux kernel. This email includes some of our findings about the way the interrupts are getting moved in the 2.5.52 kernel. Also there is discussion and a patch for a new implementation. Let me know what you think at [email protected]

Current implementation:
======================
We have found that the existing implementation works well on IA32 SMP systems with light load of interrupts. Also we noticed that it is not working that well under heavy interrupt load conditions on these SMP systems. The observations are:

* Interrupt load of each IRQ is getting balanced on CPUs independent of load of other IRQs. Also the current implementation moves the IRQs randomly. This works well when the interrupt load is light. But we start seeing imbalance of interrupt load with existence of multiple heavy interrupt sources. Frequently multiple heavily loaded IRQs gets moved to a single CPU while other CPUs stay very lightly loaded. To achieve a good interrupts load balance, it is important to consider the load of all the interrupts together.
This further can be explained with an example of 4 CPUs and 4 heavy interrupt sources. With the existing random movement approach, the chance of each of these heavy interrupt sources moving to separate CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of the time the situation is, some CPUs are very lightly loaded and some are loaded with multiple heavy interrupts. This causes the interrupt load imbalance and results in less performance. In a case of 2 CPUs and 2 heavily loaded interrupt sources, this imbalance happens 1/2 = 50% of the times. This issue becomes more and more severe with increasing number of heavy interrupt sources.

* Another interesting observation is: We cannot see the imbalance of the interrupt load from /proc/interrupts. (/proc/interrupts shows the cumulative load of interrupts on all CPUs.) If the interrupt load is imbalanced and this imbalance is getting rotated among CPUs continuously, then /proc/interrupts will still show that the interrupt load is going to processors very evenly. Currently at the frequency (HZ/50) at which IRQs are moved across CPUs, it is not possible to see any interrupt load imbalance happening.

* We have also found that, in certain cases the static IRQ binding performs better than the existing kernel distribution of interrupt load. The reason is, in a well-balanced interrupt load situations, these interrupts are unnecessarily getting frequently moved across CPUs. This adds an extra overhead; also it takes off the CPU cache warmth benefits.
This came out from the performance measurements done on a 4-way HT (8 logical processors) Pentium 4 Xeon system running 8 copies of netperf. The 4 NICs in the system taking different IRQs generated sizable interrupt load with the help of connected clients.

Here the netperf transactions/sec throughput numbers observed are:

IRQs nicely manually bound to CPUs: 56.20K
The current kernel implementation of IRQ movement: 50.05K
-----------------------
The static binding of IRQs has performed 12.28% better than the current IRQ movement implemented in the kernel.

* The current implementation does not distinguish siblings from the HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to balance the interrupt load with respect to processor packages first, and then among logical CPUs inside processor packages.
For example if we have 2 heavy interrupt sources and 2 processor packages (4 logical CPUs); Assigning both the heavy interrupt sources in different processor packages is better, it will use different execution resources from the different processor packages.



New revised implementation:
==========================
We also have been working on a new implementation. The following points are in main focus.

* At any moment heavily loaded IRQs are distributed to different CPUs to achieve as much balance as possible.

* Lightly loaded interrupt sources are ignored from the load balancing, as they do not cause considerable imbalance.

* When the heavy interrupt sources are balanced, they are not moved around. This also helps in keeping the CPU caches warm.

* It has been made HT aware. While distributing the load, the load on a processor package to which the logical CPUs belong to is also considered.

* In the situations of few (lesser than num_cpus) heavy interrupt sources, it is not possible to balance them evenly. In such case the existing code has been reused to move the interrupts. The randomness from the original code has been removed.

* The time interval for redistribution has been made flexible. It varies as the system interrupt load changes.

* A new kernel_thread is introduced to do the load balancing calculations for all the interrupt sources. It keeps the balanace_maps ready for interrupt handlers, keeping the overhead in the interrupt handling to minimum.

* It allows the disabling of the IRQ distribution from the boot loader command line, if anybody wants to do it for any reason.

* The algorithm also takes into account the static binding of interrupts to CPUs that user imposes from the /proc/irq/{n}/smp_affinity interface.


Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
---------------------
The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
---------------------
Here the performance improvement on P6 system of about 2% is observed.


Thanks,
Nitin

diff -Naru 2.5.52/Documentation/kernel-parameters.txt kirqb/Documentation/kernel-parameters.txt
--- 2.5.52/Documentation/kernel-parameters.txt Tue Dec 17 15:35:57 2002
+++ kirqb/Documentation/kernel-parameters.txt Tue Dec 17 15:37:29 2002
@@ -352,6 +352,8 @@

hugepages= [HW,IA-32] Maximal number of HugeTLB pages

+ noirqbalance [IA-32,SMP,KNL] Disable kernel irq balancing
+
i8042_direct [HW] Non-translated mode
i8042_dumbkbd
i8042_noaux
diff -Naru 2.5.52/arch/i386/kernel/io_apic.c kirqb/arch/i386/kernel/io_apic.c
--- 2.5.52/arch/i386/kernel/io_apic.c Tue Dec 17 15:35:26 2002
+++ kirqb/arch/i386/kernel/io_apic.c Fri Dec 20 01:23:15 2002
@@ -206,19 +206,37 @@
spin_unlock_irqrestore(&ioapic_lock, flags);
}

-#if CONFIG_SMP
+#if defined(CONFIG_SMP)
+# include <asm/processor.h> /* kernel_thread() */
+# include <linux/kernel_stat.h> /* kstat */
+# include <linux/slab.h> /* kmalloc() */
+# include <linux/timer.h> /* time_after() */
+
+# if CONFIG_BALANCED_IRQ_DEBUG
+# define TDprintk(x...) do { printk("<%ld:%s:%d>: ", jiffies, __FILE__, __LINE__); printk(x); } while (0)
+# define Dprintk(x...) do { TDprintk(x); } while (0)
+# else
+# define TDprintk(x...)
+# define Dprintk(x...)
+# endif

-typedef struct {
- unsigned int cpu;
- unsigned long timestamp;
-} ____cacheline_aligned irq_balance_t;
-
-static irq_balance_t irq_balance[NR_IRQS] __cacheline_aligned
- = { [ 0 ... NR_IRQS-1 ] = { 0, 0 } };
+# define MIN(a,b) (((a) < (b)) ? (a) : (b))
+# define MAX(a,b) (((a) > (b)) ? (a) : (b))

extern unsigned long irq_affinity [NR_IRQS];
-
-#endif
+unsigned long __cacheline_aligned irq_balance_mask [NR_IRQS];
+static int irqbalance_disabled __initdata = 0;
+static int physical_balance = 0;
+
+struct irq_cpu_info {
+ unsigned long * last_irq;
+ unsigned long * irq_delta;
+ unsigned long irq;
+} irq_cpu_data[NR_CPUS];
+
+#define CPU_IRQ(cpu) (irq_cpu_data[cpu].irq)
+#define LAST_CPU_IRQ(cpu,irq) (irq_cpu_data[cpu].last_irq[irq])
+#define IRQ_DELTA(cpu,irq) (irq_cpu_data[cpu].irq_delta[irq])

#define IDLE_ENOUGH(cpu,now) \
(idle_cpu(cpu) && ((now) - irq_stat[(cpu)].idle_timestamp > 1))
@@ -226,10 +244,224 @@
#define IRQ_ALLOWED(cpu,allowed_mask) \
((1 << cpu) & (allowed_mask))

-#if CONFIG_SMP
+#define CPU_TO_PACKAGEINDEX(i) \
+ ((physical_balance && i > cpu_sibling_map[i]) ? cpu_sibling_map[i] : i)
+
+#define MAX_BALANCED_IRQ_INTERVAL (5*HZ)
+#define MIN_BALANCED_IRQ_INTERVAL (HZ/2)
+#define BALANCED_IRQ_MORE_DELTA (HZ/10)
+#define BALANCED_IRQ_LESS_DELTA (HZ)
+
+unsigned long balanced_irq_interval = MAX_BALANCED_IRQ_INTERVAL;
+
+static inline void balance_irq(int cpu, int irq);
+
+static inline void rotate_irqs_among_cpus(unsigned long useful_load_threshold)
+{
+ int i, j;
+ Dprintk("Rotating IRQs among CPUs.\n");
+ for (i = 0; i < NR_CPUS; i++) {
+ for (j = 0; cpu_online(i) && (j < NR_IRQS); j++) {
+ if (!irq_desc[j].action)
+ continue;
+ /* Is it a significant load ? */
+ if (IRQ_DELTA(CPU_TO_PACKAGEINDEX(i),j) < useful_load_threshold)
+ continue;
+ balance_irq(i, j);
+ }
+ }
+ balanced_irq_interval = MAX(MIN_BALANCED_IRQ_INTERVAL,
+ balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);
+ return;
+}
+
+static void do_irq_balance(void)
+{
+ int i, j;
+ unsigned long max_cpu_irq = 0, min_cpu_irq = (~0);
+ unsigned long move_this_load = 0;
+ int max_loaded = 0, min_loaded = 0;
+ unsigned long useful_load_threshold = balanced_irq_interval + 10;
+ int selected_irq;
+ int tmp_loaded, first_attempt = 1;
+ unsigned long tmp_cpu_irq;
+ unsigned long imbalance = 0;
+ unsigned long allowed_mask;
+ unsigned long target_cpu_mask;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ int package_index;
+ CPU_IRQ(i) = 0;
+ if (!cpu_online(i))
+ continue;
+ package_index = CPU_TO_PACKAGEINDEX(i);
+ for (j = 0; j < NR_IRQS; j++) {
+ unsigned long value_now, delta;
+ /* Is this an active IRQ? */
+ if (!irq_desc[j].action)
+ continue;
+ if (package_index == i)
+ IRQ_DELTA(package_index,j) = 0;
+ /* Determine the total count per processor per IRQ */
+ value_now = (unsigned long) kstat_cpu(i).irqs[j];
+
+ /* Determine the activity per processor per IRQ */
+ delta = value_now - LAST_CPU_IRQ(i,j);
+
+ /* Update last_cpu_irq[][] for the next time */
+ LAST_CPU_IRQ(i,j) = value_now;
+
+ /* Ignore IRQs whose rate is less than the clock */
+ if (delta < useful_load_threshold)
+ continue;
+ /* update the load for the processor or package total */
+ IRQ_DELTA(package_index,j) += delta;
+
+ /* Keep track of the higher numbered sibling as well */
+ if (i != package_index)
+ CPU_IRQ(i) += delta;
+ /*
+ * We have sibling A and sibling B in the package
+ *
+ * cpu_irq[A] = load for cpu A + load for cpu B
+ * cpu_irq[B] = load for cpu B
+ */
+ CPU_IRQ(package_index) += delta;
+ }
+ }
+ /* Find the least loaded processor package */
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ if (physical_balance && i > cpu_sibling_map[i])
+ continue;
+ if (min_cpu_irq > CPU_IRQ(i)) {
+ min_cpu_irq = CPU_IRQ(i);
+ min_loaded = i;
+ }
+ }
+ max_cpu_irq = ULONG_MAX;
+
+tryanothercpu:
+ /* Look for heaviest loaded processor.
+ * We may come back to get the next heaviest loaded processor.
+ * Skip processors with trivial loads.
+ */
+ tmp_cpu_irq = 0;
+ tmp_loaded = -1;
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ if (physical_balance && i > cpu_sibling_map[i])
+ continue;
+ if (max_cpu_irq <= CPU_IRQ(i))
+ continue;
+ if (tmp_cpu_irq < CPU_IRQ(i)) {
+ tmp_cpu_irq = CPU_IRQ(i);
+ tmp_loaded = i;
+ }
+ }
+
+ if (tmp_loaded == -1) {
+ /* In the case of small number of heavy interrupt sources,
+ * loading some of the cpus too much. We use Ingo's original
+ * approach to rotate them around.
+ */
+ if (!first_attempt && imbalance >= useful_load_threshold) {
+ rotate_irqs_among_cpus(useful_load_threshold);
+ return;
+ }
+ goto not_worth_the_effort;
+ }
+
+ first_attempt = 0; /* heaviest search */
+ max_cpu_irq = tmp_cpu_irq; /* load */
+ max_loaded = tmp_loaded; /* processor */
+ imbalance = (max_cpu_irq - min_cpu_irq) / 2;
+
+ Dprintk("max_loaded cpu = %d\n", max_loaded);
+ Dprintk("min_loaded cpu = %d\n", min_loaded);
+ Dprintk("max_cpu_irq load = %ld\n", max_cpu_irq);
+ Dprintk("min_cpu_irq load = %ld\n", min_cpu_irq);
+ Dprintk("load imbalance = %lu\n", imbalance);
+
+ /* if imbalance is less than approx 10% of max load, then
+ * observe diminishing returns action. - quit
+ */
+ if (imbalance < (max_cpu_irq >> 3)) {
+ Dprintk("Imbalance too trivial\n");
+ goto not_worth_the_effort;
+ }
+
+tryanotherirq:
+ /* if we select an IRQ to move that can't go where we want, then
+ * see if there is another one to try.
+ */
+ move_this_load = 0;
+ selected_irq = -1;
+ for (j = 0; j < NR_IRQS; j++) {
+ /* Is this an active IRQ? */
+ if (!irq_desc[j].action)
+ continue;
+ if (imbalance <= IRQ_DELTA(max_loaded,j))
+ continue;
+ /* Try to find the IRQ that is closest to the imbalance
+ * without going over.
+ */
+ if (move_this_load < IRQ_DELTA(max_loaded,j)) {
+ move_this_load = IRQ_DELTA(max_loaded,j);
+ selected_irq = j;
+ }
+ }
+ if (selected_irq == -1) {
+ goto tryanothercpu;
+ }

-#define IRQ_BALANCE_INTERVAL (HZ/50)
+ imbalance = move_this_load;

+ /* For physical_balance case, we accumlated both load
+ * values in the one of the siblings cpu_irq[],
+ * to use the same code for physical and logical processors
+ * as much as possible.
+ *
+ * NOTE: the cpu_irq[] array holds the sum of the load for
+ * sibling A and sibling B in the slot for the lowest numbered
+ * sibling (A), _AND_ the load for sibling B in the slot for
+ * the higher numbered sibling.
+ *
+ * We seek the least loaded sibling by making the comparison
+ * (A+B)/2 vs B
+ */
+ if (physical_balance && (CPU_IRQ(min_loaded) >> 1) > CPU_IRQ(cpu_sibling_map[min_loaded]))
+ min_loaded = cpu_sibling_map[min_loaded];
+
+ allowed_mask = cpu_online_map & irq_affinity[selected_irq];
+ target_cpu_mask = 1 << min_loaded;
+
+ if (target_cpu_mask & allowed_mask) {
+ irq_desc_t *desc = irq_desc + selected_irq;
+ Dprintk("irq = %d moved to cpu = %d\n", selected_irq, min_loaded);
+ /* mark for change destination */
+ spin_lock(&desc->lock);
+ irq_balance_mask[selected_irq] = target_cpu_mask;
+ spin_unlock(&desc->lock);
+ /* Since we made a change, come back sooner to
+ * check for more variation.
+ */
+ balanced_irq_interval = MAX(MIN_BALANCED_IRQ_INTERVAL,
+ balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);
+ return;
+ }
+ goto tryanotherirq;
+
+not_worth_the_effort:
+ /* if we did not find an IRQ to move, then adjust the time interval upward */
+ balanced_irq_interval = MIN(MAX_BALANCED_IRQ_INTERVAL,
+ balanced_irq_interval + BALANCED_IRQ_MORE_DELTA);
+ Dprintk("IRQ worth rotating not found\n");
+ return;
+}
+
static unsigned long move(int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
{
int search_idle = 1;
@@ -256,34 +488,112 @@
return cpu;
}

-static inline void balance_irq(int irq)
+static inline void balance_irq (int cpu, int irq)
{
- irq_balance_t *entry = irq_balance + irq;
unsigned long now = jiffies;
-
+ unsigned long allowed_mask;
+ unsigned int new_cpu;
+
if (clustered_apic_mode)
return;

- if (unlikely(time_after(now, entry->timestamp + IRQ_BALANCE_INTERVAL))) {
- unsigned long allowed_mask;
- unsigned int new_cpu;
- int random_number;
-
- rdtscl(random_number);
- random_number &= 1;
-
- allowed_mask = cpu_online_map & irq_affinity[irq];
- entry->timestamp = now;
- new_cpu = move(entry->cpu, allowed_mask, now, random_number);
- if (entry->cpu != new_cpu) {
- entry->cpu = new_cpu;
- set_ioapic_affinity(irq, 1 << new_cpu);
+ allowed_mask = cpu_online_map & irq_affinity[irq];
+ new_cpu = move(cpu, allowed_mask, now, 1);
+ if (cpu != new_cpu) {
+ irq_desc_t *desc = irq_desc + irq;
+ spin_lock(&desc->lock);
+ irq_balance_mask[irq] = 1 << new_cpu;
+ spin_unlock(&desc->lock);
+ }
+}
+
+int balanced_irq(void *unused)
+{
+ int i;
+ unsigned long prev_balance_time = jiffies;
+ long time_remaining = balanced_irq_interval;
+ daemonize();
+ sigfillset(&current->blocked);
+ sprintf(current->comm, "balanced_irq");
+
+ /* push everything to CPU 0 to give us a starting point. */
+ for (i = 0 ; i < NR_IRQS ; i++)
+ irq_balance_mask[i] = 1 << 0;
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ time_remaining = schedule_timeout(time_remaining);
+ if (time_after(jiffies, prev_balance_time+balanced_irq_interval)) {
+ Dprintk("balanced_irq: calling do_irq_balance() %lu\n", jiffies);
+ do_irq_balance();
+ prev_balance_time = jiffies;
+ time_remaining = balanced_irq_interval;
}
+ }
+}
+
+static int __init balanced_irq_init(void)
+{
+ int i;
+ struct cpuinfo_x86 *c;
+ c = &boot_cpu_data;
+ if (irqbalance_disabled)
+ return 0;
+ /* Enable physical balance only if more than
+ * one physical processor package is present */
+ if (smp_num_siblings > 1 && cpu_online_map >> 2)
+ physical_balance = 1;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ irq_cpu_data[i].irq_delta = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
+ irq_cpu_data[i].last_irq = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
+ if (irq_cpu_data[i].irq_delta == NULL || irq_cpu_data[i].last_irq == NULL) {
+ printk(KERN_ERR "balanced_irq_init: out of memory");
+ goto failed;
+ }
+ memset(irq_cpu_data[i].irq_delta,0,sizeof(unsigned long) * NR_IRQS);
+ memset(irq_cpu_data[i].last_irq,0,sizeof(unsigned long) * NR_IRQS);
+ }
+
+ printk(KERN_INFO "Starting balanced_irq\n");
+ if (kernel_thread(balanced_irq, NULL, CLONE_KERNEL) >= 0)
+ return 0;
+ else
+ printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq");
+failed:
+ for (i = 0; i < NR_CPUS; i++) {
+ if (irq_cpu_data[i].irq_delta)
+ kfree(irq_cpu_data[i].irq_delta);
+ if (irq_cpu_data[i].last_irq)
+ kfree(irq_cpu_data[i].last_irq);
}
+ return 0;
}
-#else /* !SMP */
-static inline void balance_irq(int irq) { }
-#endif
+
+static int __init irqbalance_disable(char *str)
+{
+ irqbalance_disabled = 1;
+ return 0;
+}
+
+__setup("noirqbalance", irqbalance_disable);
+
+static void set_ioapic_affinity (unsigned int irq, unsigned long mask);
+
+static inline void move_irq(int irq)
+{
+ /* note - we hold the desc->lock */
+ if (unlikely(irq_balance_mask[irq])) {
+ set_ioapic_affinity(irq, irq_balance_mask[irq]);
+ irq_balance_mask[irq] = 0;
+ }
+}
+
+__initcall(balanced_irq_init);
+
+#endif /* defined(CONFIG_SMP) */
+

/*
* support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to
@@ -1308,7 +1618,7 @@
*/
static void ack_edge_ioapic_irq(unsigned int irq)
{
- balance_irq(irq);
+ move_irq(irq);
if ((irq_desc[irq].status & (IRQ_PENDING | IRQ_DISABLED))
== (IRQ_PENDING | IRQ_DISABLED))
mask_IO_APIC_irq(irq);
@@ -1348,7 +1658,7 @@
unsigned long v;
int i;

- balance_irq(irq);
+ move_irq(irq);
/*
* It appears there is an erratum which affects at least version 0x11
* of I/O APIC (that's the 82093AA and cores integrated into various



Attachments:
kirqb_2.5.52.ZIP (4.38 kB)
kirqb_2.5.52.ZIP

2002-12-20 11:07:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] IRQ distribution in the 2.5.52 kernel

On Fri, Dec 20, 2002 at 01:08:18AM -0800, Kamble, Nitin A wrote:
> -static inline void balance_irq(int irq)
> +static inline void balance_irq (int cpu, int irq)
> {
> - irq_balance_t *entry = irq_balance + irq;
> unsigned long now = jiffies;
> -
> + unsigned long allowed_mask;
> + unsigned int new_cpu;
> +
> if (clustered_apic_mode)
> return;

This can actually be done for clustered_apic_mode, just not as the
IO-APIC is currently programmed. It needs to either:

(1) develop awareness of multiple APIC buses and not attempt to perform
physical mode interrupt delivery to non-existant destinations or
overflow destination bitmasks to cpus not physically addressible
from the local cluster

or

(2) the IO-APIC must be programmed for clustered hierarchical destinations
in clustered setups, which probably isn't that hot an idea as the
IO-APIC's in such setups usually have some affinity to the locally
addressible physical destinations

These are both a PITA, but I thought I'd just sort of fling the issues
into open discussion since something touching this code hit the list.

There's some complexity in these schemes so unless you feel brave and/or
interested there's no need for you to run off and implement them etc.

No criticism of or flaws in your patch implied.


Thanks,
Bill