2003-05-24 03:23:38

by James Cleverdon

[permalink] [raw]
Subject: [PATCH][2.5] provisional 32-way x445 patches

Here's what it took to get 2.5.68 going on a 32-way x445. It contains a
kludge in setup_ioapic_ids_from_mps() that needs to be virtualized.

Note the code in init_apic_ldr() that breaks the connection between logical
and physical APIC IDs. That's there because the BIOS folks reserve the right
to make the physical numbering scheme anything they want at > 16-way. I
suspect that other vendors will have the same problems (P4s only latch 6 bits
of physical APIC ID on reset) and will come to entirely different solutions.
Best to make a clean break and accept any scheme.

Martin, I'm taking a couple of extra days off past Memorial Day, so we'll have
to discuss hiding the use of x86_summit later next week.


diff -pru 2.5.68/arch/i386/kernel/io_apic.c t68/arch/i386/kernel/io_apic.c
--- 2.5.68/arch/i386/kernel/io_apic.c 2003-04-19 22:49:09.000000000 -0400
+++ t68/arch/i386/kernel/io_apic.c 2003-05-07 11:12:49.000000000 -0400
@@ -1105,11 +1105,12 @@ static inline int IO_APIC_irq_trigger(in
return 0;
}

-int irq_vector[NR_IRQS] = { FIRST_DEVICE_VECTOR , 0 };
+u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 };

static int __init assign_irq_vector(int irq)
{
static int current_vector = FIRST_DEVICE_VECTOR, offset = 0;
+ BUG_ON(irq >= sizeof(irq_vector)/sizeof(*irq_vector));
if (IO_APIC_VECTOR(irq) > 0)
return IO_APIC_VECTOR(irq);
next:
@@ -1592,6 +1593,11 @@ static void __init setup_ioapic_ids_from
mp_ioapics[apic].mpc_apicid = reg_00.ID;
}

+ /* Temp kludge. Anyway, the BIOS sets the IDs correctly on Summit boxen.
*/
+ { extern int x86_summit;
+ if (x86_summit)
+ continue;
+ }
/*
* Sanity check, is the ID really free? Every APIC in a
* system must have a unique ID or we get lots of nice
diff -pru 2.5.68/include/asm-i386/hw_irq.h t68/include/asm-i386/hw_irq.h
--- 2.5.68/include/asm-i386/hw_irq.h 2003-04-19 22:51:13.000000000 -0400
+++ t68/include/asm-i386/hw_irq.h 2003-05-07 11:01:13.000000000 -0400
@@ -24,8 +24,13 @@
* Interrupt entry/exit code at both C and assembly level
*/

-extern int irq_vector[NR_IRQS];
-#define IO_APIC_VECTOR(irq) irq_vector[irq]
+/* The upper limit of irq_vector's size is 16 + sum(num_RTEs_in_IO_APICs).
+ * On 32-way x445s this is already 266 without any I/O expansion boxes.
+ * This should eventually be dynamically allocated.
+ */
+#define NR_IRQ_VECTORS (4 * NR_IRQS)
+extern u8 irq_vector[NR_IRQ_VECTORS];
+#define IO_APIC_VECTOR(irq) (int)(irq_vector[irq])

extern void (*interrupt[NR_IRQS])(void);

diff -pru 2.5.68/include/asm-i386/mach-summit/mach_apic.h
t68/include/asm-i386/mach-summit/mach_apic.h
--- 2.5.68/include/asm-i386/mach-summit/mach_apic.h 2003-04-19
22:50:06.000000000 -0400
+++ t68/include/asm-i386/mach-summit/mach_apic.h 2003-05-07 11:05:20.000000000
-0400
@@ -9,9 +9,6 @@ extern int x86_summit;
#define XAPIC_DEST_CPUS_MASK 0x0Fu
#define XAPIC_DEST_CLUSTER_MASK 0xF0u

-#define xapic_phys_to_log_apicid(phys_apic) ( (1ul << ((phys_apic) & 0x3)) |\
- ((phys_apic) & XAPIC_DEST_CLUSTER_MASK) )
-
#define APIC_DFR_VALUE (x86_summit ? APIC_DFR_CLUSTER : APIC_DFR_FLAT)
#define TARGET_CPUS (x86_summit ? XAPIC_DEST_CPUS_MASK : cpu_online_map)

@@ -25,14 +22,31 @@ extern int x86_summit;
#define check_apicid_present(bit) (x86_summit ? 1 : (phys_cpu_present_map &
(1 << bit)))

extern u8 bios_cpu_apicid[];
+extern volatile u8 cpu_2_logical_apicid[];

static inline void init_apic_ldr(void)
{
unsigned long val, id;

- if (x86_summit)
- id = xapic_phys_to_log_apicid(hard_smp_processor_id());
- else
+ if (x86_summit) {
+ int i, count;
+ u8 lid;
+ u8 my_id = (u8)hard_smp_processor_id();
+ u8 my_cluster = my_id & XAPIC_DEST_CLUSTER_MASK;
+
+ for (count = 0, i = NR_CPUS; --i >= 0; ) {
+ lid = cpu_2_logical_apicid[i];
+ if (lid == BAD_APICID)
+ continue;
+ if ((lid & 0xF0) == my_cluster)
+ ++count; /* got one */
+ }
+ if (count > 3) {
+ printk("init_apic_ldr: Found %d CPUs in APIC cluster 0x%X! Kludging CPU
0x%02X...\n", count, my_cluster, my_id);
+ count = 3;
+ }
+ id = my_cluster | (1UL << count);
+ } else
id = 1UL << smp_processor_id();
apic_write_around(APIC_DFR, APIC_DFR_VALUE);
val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
@@ -62,7 +76,6 @@ static inline int apicid_to_node(int log
}

/* Mapping from cpu number to logical apicid */
-extern volatile u8 cpu_2_logical_apicid[];
static inline int cpu_to_logical_apicid(int cpu)
{
return (int)cpu_2_logical_apicid[cpu];


--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com


2003-05-24 18:44:03

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] provisional 32-way x445 patches

On Fri, 23 May 2003, James Cleverdon wrote:

> -int irq_vector[NR_IRQS] = { FIRST_DEVICE_VECTOR , 0 };
> +u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 };
>
> static int __init assign_irq_vector(int irq)
> {
> static int current_vector = FIRST_DEVICE_VECTOR, offset = 0;
> + BUG_ON(irq >= sizeof(irq_vector)/sizeof(*irq_vector));

Can't you just skip that one (-ENOSPC)? It would oops on 32way NUMAQ. Why
don't we just fix this properly now, it looks like we might end up just
piling workarounds ontop of kludges. The intel guys were kind enough to
crank out a vector based irq handling patch and it's just what we need to
purge NR_IRQS misuse.

> */
>
> -extern int irq_vector[NR_IRQS];
> -#define IO_APIC_VECTOR(irq) irq_vector[irq]
> +/* The upper limit of irq_vector's size is 16 + sum(num_RTEs_in_IO_APICs).
> + * On 32-way x445s this is already 266 without any I/O expansion boxes.
> + * This should eventually be dynamically allocated.
> + */
> +#define NR_IRQ_VECTORS (4 * NR_IRQS)
> +extern u8 irq_vector[NR_IRQ_VECTORS];
> +#define IO_APIC_VECTOR(irq) (int)(irq_vector[irq])

This just makes the relationship between NR_IRQS and NR_IRQ_VECTORS even
more confusing. If you have one IDT, NR_IRQ_VECTORS is ~190

Zwane
--
function.linuxpower.ca

2003-05-24 21:00:55

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] provisional 32-way x445 patches

On Sat, 24 May 2003, Zwane Mwaikambo wrote:

> Can't you just skip that one (-ENOSPC)? It would oops on 32way NUMAQ. Why

crackpipe alert!! i didn't notice your s/int/u8/ change, regardless how
are you handling irqs > 256? The code in 2.5.68 simply overwrites previous
vector entries in your IDT when it runs out.

Zwane
--
function.linuxpower.ca

2003-06-03 01:22:59

by James Cleverdon

[permalink] [raw]
Subject: Re: [PATCH][2.5] provisional 32-way x445 patches

On Saturday 24 May 2003 02:03 pm, Zwane Mwaikambo wrote:
> On Sat, 24 May 2003, Zwane Mwaikambo wrote:
> > Can't you just skip that one (-ENOSPC)? It would oops on 32way NUMAQ. Why
>
> crackpipe alert!! i didn't notice your s/int/u8/ change, regardless how
> are you handling irqs > 256? The code in 2.5.68 simply overwrites previous
> vector entries in your IDT when it runs out.
>
> Zwane

Back from holiday.

This kludge doesn't change any IDT behavior, so it is vulnerable to vector
exhaustion too. It just deals with large systems that have large I/O APICs.
Since we are indexing irq_vectors by the sum of all available I/O APIC RTEs
and not checking for overflow, we can get into trouble.

Some numbers:

* A 32-way x445 is made up of four 8-way chassis hooked together by
scalability cables.

* Each Summit chassis has 2 I/O APICs with 50 RTEs per. The BIOS guys are
trying to help out by using some hardware to only use one I/O APIC for all
but the boot chassis.

* Each RXE100 PCI expansion box contains one or two I/O APICs with 50 RTEs
each. Every chassis can have one RXE100.

Even without PCI expansion boxes, 5 * 50 == 250 which is > 224. The kernel
overflows irq_vectors and dies.

Since the value stuffed into irq_vectors is 0x31 to 0xF8, it easily fits into
a byte. As a quick kludge, I changed the type of irq_vectors and quadrupled
the number. With 896 elements in the array, the system survived and ran.

For a real fix, irq_vectors should be dynamically allocated. But then, I
should port the dynamic MAX_MP_BUSSES patch from 2.4 to 2.5 anyway....

--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2003-06-03 02:41:20

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] provisional 32-way x445 patches

On Mon, 2 Jun 2003, James Cleverdon wrote:

> Back from holiday.
>
> This kludge doesn't change any IDT behavior, so it is vulnerable to vector
> exhaustion too. It just deals with large systems that have large I/O APICs.
> Since we are indexing irq_vectors by the sum of all available I/O APIC RTEs
> and not checking for overflow, we can get into trouble.

Yeah i get the same with 2.5.70 on a 32way NUMAQ, basically we just keep
going regardless of what NR_IRQS is and then try and write garbage all
over the IDT. There appears to be something wrong with my patch as it gets
ignored each time i send it.

> Some numbers:
>
> * A 32-way x445 is made up of four 8-way chassis hooked together by
> scalability cables.
>
> * Each Summit chassis has 2 I/O APICs with 50 RTEs per. The BIOS guys are
> trying to help out by using some hardware to only use one I/O APIC for all
> but the boot chassis.
>
> * Each RXE100 PCI expansion box contains one or two I/O APICs with 50 RTEs
> each. Every chassis can have one RXE100.
>
> Even without PCI expansion boxes, 5 * 50 == 250 which is > 224. The kernel
> overflows irq_vectors and dies.
>
> Since the value stuffed into irq_vectors is 0x31 to 0xF8, it easily fits into
> a byte. As a quick kludge, I changed the type of irq_vectors and quadrupled
> the number. With 896 elements in the array, the system survived and ran.

Are you implying that the large array stopped your box from booting?

> For a real fix, irq_vectors should be dynamically allocated. But then, I
> should port the dynamic MAX_MP_BUSSES patch from 2.4 to 2.5 anyway....

Hmm dynamic irq_vectors sounds good, we could start off with a static
amount and then dynamically allocate once we start reaching the silly
numbers. Dynamic allocation for all might get interesting at early boot.

This is the patch i use right now to boot 2.5.70 on 32way NUMAQ without
disabling IOAPICs, however i do drop the overflow interrupts.

http://function.linuxpower.ca/patches/misc/patch-fix-8quad-mainline3

I also have another patch to just increase NR_IRQS, this was used on the
same 8quad and allowed full functionality of all the devices upto and
including node7

http://www.osdl.org/projects/numaqhwspprt/results/patch-numaq-highirq7

http://www.osdl.org/projects/numaqhwspprt/results/data/dmesg-32way-8quad-2.5.70-640irqs

But the static arrays aren't all that nice, do you plan on starting on the
dynamic allocation of NR_IRQS sized arrays sometime soon?

Thanks,
Zwane
--
function.linuxpower.ca