2003-05-06 18:49:50

by Keith Mannthey

[permalink] [raw]
Subject: [RFC][Patch] fix for irq_affinity_write_proc v2.5

Hello,
irq_affinity_write_proc currently directly calls set_ioapic_affinity
which writes to the ioapic. This undermines the work done by kirqd by
writing a cpu mask directly to the ioapic. I propose the following patch
to tie the /proc affinity writes into the same code path as kirqd.
Kirqd will enforce the affinity requested by the user.

Keith Mannthey


diff -urN linux-2.5.68/arch/i386/kernel/irq.c linux-2.5.68-procfix/arch/i386/kernel/irq.c
--- linux-2.5.68/arch/i386/kernel/irq.c Sat Apr 19 19:48:50 2003
+++ linux-2.5.68-procfix/arch/i386/kernel/irq.c Thu May 8 13:47:38 2003
@@ -871,8 +871,11 @@
return -EINVAL;

irq_affinity[irq] = new_value;
- irq_desc[irq].handler->set_affinity(irq, new_value);
-
+ if (irqbalance_disabled)
+ irq_desc[irq].handler->set_affinity(irq, new_value);
+ else
+ do_irq_balance();
+
return full_count;
}



2003-05-06 20:41:08

by Alan

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

On Maw, 2003-05-06 at 20:03, Keith Mannthey wrote:
> Hello,
> irq_affinity_write_proc currently directly calls set_ioapic_affinity
> which writes to the ioapic. This undermines the work done by kirqd by
> writing a cpu mask directly to the ioapic. I propose the following patch
> to tie the /proc affinity writes into the same code path as kirqd.
> Kirqd will enforce the affinity requested by the user.

Why should the kernel be enforcing policy here. You have to be root to
do this, and root should have the ability to configure apparently stupid
things because they may find them useful.


2003-05-06 21:44:44

by Keith Mannthey

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

> Why should the kernel be enforcing policy here. You have to be root to
> do this, and root should have the ability to configure apparently stupid
> things because they may find them useful.

What the kernel does now seems incorrect.
With kirqd runnig (irqbalance_disable == 0) the kernel writes an
arbitrary cpu mask to the ioapic. The kirqd thread only maps the irq to
a single cpu at at time. Mapping an irq to multiple cpus breaks this
ideal. A root user could destroy what kirqd is doing for them by trying
to set affinity which writes to the ioapic directly. Let kirqd manage
the affinity. (NOTE: I suspect that do_irq_balance might not set
affinity until the irq needs to be balance and am looking for a better
thing to do.)
Also if a user writes an arbitrary cpu map to the ioapic next time the
irq need to be balanced by kirqd their value will be overwritten. The
affinity defined in irq_affinity array will still be valid but the
mapping the user wrote to the ioapic will be destroyed.
This seems to be a loose loose situation. If users want to do map
irqs to multiple cpus, at the ioapic level, they should user their root
privileges to boot with noirqbalance. If kirqd is running the should
let it do most of the work.

Keith


2003-05-07 02:14:59

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

On Maw, 2003-05-06 at 20:03, Keith Mannthey wrote:
>> irq_affinity_write_proc currently directly calls set_ioapic_affinity
>> which writes to the ioapic. This undermines the work done by kirqd by
>> writing a cpu mask directly to the ioapic. I propose the following patch
>> to tie the /proc affinity writes into the same code path as kirqd.
>> Kirqd will enforce the affinity requested by the user.

On Tue, May 06, 2003 at 08:54:35PM +0100, Alan Cox wrote:
> Why should the kernel be enforcing policy here. You have to be root to
> do this, and root should have the ability to configure apparently stupid
> things because they may find them useful.

It's basically not working as specified for clustered hierarchical, and
in truth the specification can never be met. As it stands most calls to
it are lethal on such systems, especially those using physical destmod.

I'd prefer to have it redesigned for some validity checking and error
returns as on such systems the impossible destinations serve no purpose
but raising MCE's and/or deadlocking the box.


-- wli

2003-05-07 05:58:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

>> irq_affinity_write_proc currently directly calls set_ioapic_affinity
>> which writes to the ioapic. This undermines the work done by kirqd by
>> writing a cpu mask directly to the ioapic. I propose the following patch
>> to tie the /proc affinity writes into the same code path as kirqd.
>> Kirqd will enforce the affinity requested by the user.
>
> Why should the kernel be enforcing policy here. You have to be root to
> do this, and root should have the ability to configure apparently stupid
> things because they may find them useful.

Whilst in general, I'd agree, in this case it makes no sense - there are
two masks set: the irqbalance mask, and the apicid mask. If irqbalance
is on, setting the apicid mask doesn't really do anything, because
irqbalance is going to reset it dynamically very shortly afterwards
anyway. It's hard to imagine how that might be useful ;-)

It might conceivably be useful to disable irqbalance dynamically, if that's
what you mean, but that shouldn't be the default behaviour, I think.
Keith already a nice simple patch to make it a config option ... doing it
dynamically is a different project, IMHO, and one of questionable utility
to anyone not capable of coding it themselves ;-)

M.

2003-05-07 13:02:22

by Alan

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

On Mer, 2003-05-07 at 03:27, William Lee Irwin III wrote:
> It's basically not working as specified for clustered hierarchical, and
> in truth the specification can never be met. As it stands most calls to
> it are lethal on such systems, especially those using physical destmod.
>
> I'd prefer to have it redesigned for some validity checking and error
> returns as on such systems the impossible destinations serve no purpose
> but raising MCE's and/or deadlocking the box.

In which case I agree it does indeed make sense

2003-05-07 17:59:35

by Keith Mannthey

[permalink] [raw]
Subject: Re: [RFC][Patch] fix for irq_affinity_write_proc v2.5

> > I'd prefer to have it redesigned for some validity checking and error
> > returns as on such systems the impossible destinations serve no purpose
> > but raising MCE's and/or deadlocking the box.
>
> In which case I agree it does indeed make sense

As a matter of fact I have a patch that attempts to bring order to
set_ioapic_affinity the cause of the underlining problem for clustered
apics. I am not a big fan of the /proc interface because the values the
users sets and sees may or may not be the values written to the apic.
With kirqd running the user has no idea which cpu the irq is set to.
The user can be misinformed vi the /proc interface. The following patch
doesn't deal with this issue however. Maybe a later one will.
In the current kernel it is unclear as to weather the value being
passed to set_ioapic_affinity is a cpu mask or valid apic id. In
irq_affinity_write_proc the kernel passes on a cpu mask but the kirqd
thread passes on logical apic ids. In flat apic mode this is not an
issue because a cpu mask represents the apic value. However in
clustered apic mode the cpu mask is very different from the logical
apicid.
This is an attempt to do the right thing for clustered apics. I
clarify that the value being passed to set_ioapic_affinity is a cpu mask
not a apicid. Set_ioapic_affinity will do the conversion to logical
apic ids via a new function defined in the sub-arch called
cpu_mask_to_apicid. Since many cpu masks don't map to valid apicids in
clustered apic mode TARGET_CPUS is used as a default value when such a
situation occurs. I think this is a good step in making irq_affinity
clustered apic safe.

Keith

diff -urN linux-2.5.68/arch/i386/kernel/io_apic.c linux-2.5.68-fix_irq_affinity/arch/i386/kernel/io_apic.c
--- linux-2.5.68/arch/i386/kernel/io_apic.c Sat Apr 19 19:49:09 2003
+++ linux-2.5.68-fix_irq_affinity/arch/i386/kernel/io_apic.c Tue May 6 23:10:18 2003
@@ -240,22 +240,22 @@
clear_IO_APIC_pin(apic, pin);
}

-static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
+static void set_ioapic_affinity (unsigned int irq, unsigned long cpu_mask)
{
unsigned long flags;
int pin;
struct irq_pin_list *entry = irq_2_pin + irq;
-
- /*
- * Only the first 8 bits are valid.
- */
- mask = mask << 24;
+ unsigned int apicid_value;
+
+ apicid_value = cpu_mask_to_apicid(cpu_mask);
+ /* Prepare to do the io_apic_write */
+ apicid_value = apicid_value << 24;
spin_lock_irqsave(&ioapic_lock, flags);
for (;;) {
pin = entry->pin;
if (pin == -1)
break;
- io_apic_write(entry->apic, 0x10 + 1 + pin*2, mask);
+ io_apic_write(entry->apic, 0x10 + 1 + pin*2, apicid_value);
if (!entry->next)
break;
entry = irq_2_pin + entry->next;
@@ -279,7 +279,7 @@

extern unsigned long irq_affinity[NR_IRQS];

-static int __cacheline_aligned pending_irq_balance_apicid[NR_IRQS];
+static int __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS];
static int irqbalance_disabled = NO_BALANCE_IRQ;
static int physical_balance = 0;

@@ -352,7 +352,7 @@
unsigned long flags;

spin_lock_irqsave(&desc->lock, flags);
- pending_irq_balance_apicid[irq]=cpu_to_logical_apicid(new_cpu);
+ pending_irq_balance_cpumask[irq] = 1 << new_cpu;
spin_unlock_irqrestore(&desc->lock, flags);
}
}
@@ -549,8 +549,7 @@
selected_irq, min_loaded);
/* mark for change destination */
spin_lock_irqsave(&desc->lock, flags);
- pending_irq_balance_apicid[selected_irq] =
- cpu_to_logical_apicid(min_loaded);
+ pending_irq_balance_cpumask[selected_irq] = 1 << min_loaded;
spin_unlock_irqrestore(&desc->lock, flags);
/* Since we made a change, come back sooner to
* check for more variation.
@@ -582,7 +581,7 @@

/* push everything to CPU 0 to give us a starting point. */
for (i = 0 ; i < NR_IRQS ; i++)
- pending_irq_balance_apicid[i] = cpu_to_logical_apicid(0);
+ pending_irq_balance_cpumask[i] = 1;

repeat:
set_current_state(TASK_INTERRUPTIBLE);
@@ -659,9 +658,9 @@
static inline void move_irq(int irq)
{
/* note - we hold the desc->lock */
- if (unlikely(pending_irq_balance_apicid[irq])) {
- set_ioapic_affinity(irq, pending_irq_balance_apicid[irq]);
- pending_irq_balance_apicid[irq] = 0;
+ if (unlikely(pending_irq_balance_cpumask[irq])) {
+ set_ioapic_affinity(irq, pending_irq_balance_cpumask[irq]);
+ pending_irq_balance_cpumask[irq] = 0;
}
}

diff -urN linux-2.5.68/include/asm-i386/mach-bigsmp/mach_apic.h linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-bigsmp/mach_apic.h
--- linux-2.5.68/include/asm-i386/mach-bigsmp/mach_apic.h Sat Apr 19 19:51:08 2003
+++ linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-bigsmp/mach_apic.h Tue May 6 22:28:37 2003
@@ -27,7 +27,7 @@
#define APIC_BROADCAST_ID (0x0f)
#define check_apicid_used(bitmap, apicid) (0)
#define check_apicid_present(bit) (phys_cpu_present_map & (1 << bit))
-
+#define apicid_cluster(apicid) (apicid & 0xF0)
static inline unsigned long calculate_ldr(unsigned long old)
{
unsigned long id;
@@ -115,4 +115,37 @@
return (1);
}

+static inline unsigned int cpu_mask_to_apicid (unsigned long cpumask)
+{
+ int num_bits_set;
+ int cpus_found = 0;
+ int cpu;
+ int apicid;
+
+ num_bits_set = hweight32(cpumask);
+ /* Return id to all */
+ if (num_bits_set == 32)
+ return (int) 0xFF;
+ /*
+ * The cpus in the mask must all be on the apic cluster. If are not
+ * on the same apicid cluster return default value of TARGET_CPUS.
+ */
+ cpu = ffs(cpumask)-1;
+ apicid = cpu_to_logical_apicid(cpu);
+ while (cpus_found < num_bits_set) {
+ if (cpumask & (1 << cpu)) {
+ int new_apicid = cpu_to_logical_apicid(cpu);
+ if (apicid_cluster(apicid) !=
+ apicid_cluster(new_apicid)){
+ printk ("%s Not a valid mask! \n",__FUNCTION__);
+ return TARGET_CPUS;
+ }
+ apicid = apicid | new_apicid;
+ cpus_found++;
+ }
+ cpu++;
+ }
+ return apicid;
+}
+
#endif /* __ASM_MACH_APIC_H */
diff -urN linux-2.5.68/include/asm-i386/mach-default/mach_apic.h linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-default/mach_apic.h
--- linux-2.5.68/include/asm-i386/mach-default/mach_apic.h Sat Apr 19 19:51:19 2003
+++ linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-default/mach_apic.h Tue May 6 22:14:57 2003
@@ -99,4 +99,9 @@
return test_bit(boot_cpu_physical_apicid, &phys_cpu_present_map);
}

+static inline unsigned int cpu_mask_to_apicid (unsigned long cpumask)
+{
+ return cpumask;
+}
+
#endif /* __ASM_MACH_APIC_H */
diff -urN linux-2.5.68/include/asm-i386/mach-numaq/mach_apic.h linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-numaq/mach_apic.h
--- linux-2.5.68/include/asm-i386/mach-numaq/mach_apic.h Sat Apr 19 19:49:17 2003
+++ linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-numaq/mach_apic.h Tue May 6 22:51:02 2003
@@ -14,6 +14,7 @@
#define APIC_BROADCAST_ID 0x0F
#define check_apicid_used(bitmap, apicid) ((bitmap) & (1 << (apicid)))
#define check_apicid_present(bit) (phys_cpu_present_map & (1 << bit))
+#define apicid_cluster(apicid) (apicid & 0xF0)

static inline int apic_id_registered(void)
{
@@ -103,4 +104,37 @@
return (1);
}

+static inline unsigned int cpu_mask_to_apicid (unsigned long cpumask)
+{
+ int num_bits_set;
+ int cpus_found = 0;
+ int cpu;
+ int apicid;
+
+ num_bits_set = hweight32(cpumask);
+ /* Return id to all */
+ if (num_bits_set == 32)
+ return (int) 0xFF;
+ /*
+ * The cpus in the mask must all be on the apic cluster. If are not
+ * on the same apicid cluster return default value of TARGET_CPUS.
+ */
+ cpu = ffs(cpumask)-1;
+ apicid = cpu_to_logical_apicid(cpu);
+ while (cpus_found < num_bits_set) {
+ if (cpumask & (1 << cpu)) {
+ int new_apicid = cpu_to_logical_apicid(cpu);
+ if (apicid_cluster(apicid) !=
+ apicid_cluster(new_apicid)){
+ printk ("%s Not a valid mask! \n",__FUNCTION__);
+ return TARGET_CPUS;
+ }
+ apicid = apicid | new_apicid;
+ cpus_found++;
+ }
+ cpu++;
+ }
+ return apicid;
+}
+
#endif /* __ASM_MACH_APIC_H */
diff -urN linux-2.5.68/include/asm-i386/mach-summit/mach_apic.h linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-summit/mach_apic.h
--- linux-2.5.68/include/asm-i386/mach-summit/mach_apic.h Sat Apr 19 19:50:06 2003
+++ linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-summit/mach_apic.h Tue May 6 22:12:04 2003
@@ -23,7 +23,7 @@

/* we don't use the phys_cpu_present_map to indicate apicid presence */
#define check_apicid_present(bit) (x86_summit ? 1 : (phys_cpu_present_map & (1 << bit)))
-
+#define apicid_cluster(apicid) (apicid & 0xF0)
extern u8 bios_cpu_apicid[];

static inline void init_apic_ldr(void)
@@ -113,4 +113,37 @@
return test_bit(boot_cpu_physical_apicid, &phys_cpu_present_map);
}

+static inline unsigned int cpu_mask_to_apicid (unsigned long cpumask)
+{
+ int num_bits_set;
+ int cpus_found = 0;
+ int cpu;
+ int apicid;
+
+ num_bits_set = hweight32(cpumask);
+ /* Return id to all */
+ if (num_bits_set == 32)
+ return (int) 0xFF;
+ /*
+ * The cpus in the mask must all be on the apic cluster. If are not
+ * on the same apicid cluster return default value of TARGET_CPUS.
+ */
+ cpu = ffs(cpumask)-1;
+ apicid = cpu_to_logical_apicid(cpu);
+ while (cpus_found < num_bits_set) {
+ if (cpumask & (1 << cpu)) {
+ int new_apicid = cpu_to_logical_apicid(cpu);
+ if (apicid_cluster(apicid) !=
+ apicid_cluster(new_apicid)){
+ printk ("%s Not a valid mask! \n",__FUNCTION__);
+ return TARGET_CPUS;
+ }
+ apicid = apicid | new_apicid;
+ cpus_found++;
+ }
+ cpu++;
+ }
+ return apicid;
+}
+
#endif /* __ASM_MACH_APIC_H */
diff -urN linux-2.5.68/include/asm-i386/mach-visws/mach_apic.h linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-visws/mach_apic.h
--- linux-2.5.68/include/asm-i386/mach-visws/mach_apic.h Sat Apr 19 19:48:49 2003
+++ linux-2.5.68-fix_irq_affinity/include/asm-i386/mach-visws/mach_apic.h Tue May 6 22:16:52 2003
@@ -77,4 +77,8 @@
return test_bit(boot_cpu_physical_apicid, &phys_cpu_present_map);
}

+static inline unsigned int cpu_mask_to_apicid (unsigned long cpumask)
+{
+ return cpumask;
+}
#endif /* __ASM_MACH_APIC_H */
diff -urN linux-2.5.68/irq_proc_patch linux-2.5.68-fix_irq_affinity/irq_proc_patch
--- linux-2.5.68/irq_proc_patch Wed Dec 31 16:00:00 1969
+++ linux-2.5.68-fix_irq_affinity/irq_proc_patch Fri May 2 14:01:47 2003
@@ -0,0 +1,16 @@
+--- ../linux-2.5.68/arch/i386/kernel/irq.c Sat Apr 19 19:48:50 2003
++++ arch/i386/kernel/irq.c Fri May 2 14:01:12 2003
+@@ -871,8 +871,11 @@
+ return -EINVAL;
+
+ irq_affinity[irq] = new_value;
+- irq_desc[irq].handler->set_affinity(irq, new_value);
+-
++ if (irqbalance_disabled) {
++ irq_desc[irq].handler->set_affinity(irq, new_value);
++ } else {
++ do_irq_balance();
++ }
+ return full_count;
+ }
+