I suggest to change the way IRQs are handed out to PCI devices. Currently, each I/O APIC pin gets associated with an IRQ, no matter if the pin is used or not. It is expected that each pin can potentually be engaged by a device inserted into the corresponding PCI slot. However, this imposes severe limitation on systems that have designs that employ many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset. It is used in ES7000, and currently, there is no way to boot the system with more that 9 I/O APICs. The simple change below allows to boot a system with say 64 (or more) I/O APICs, each providing 1 slot, which otherwise impossible because of the IRQ gaps created for unused lines on each I/O APIC. It does not resolve the problem with number of devices that exceeds number of possible IRQs, but eases up a tension for IRQs on any large system with potentually large number of devices. I only implemented this for the ACPI boot, since if the system is this big and
using newer chipsets it is probably (better be!) an ACPI based system :). The change is completely "mechanical" and does not alter any internal structures or interrupt model/implementation. The patch works for both i386 and x86_64 archs. It works with MSIs just fine, and should not intervene with implementations like shared vectors, when they get worked out and incorporated.
To illustrate, below is the interrupt distribution for 2-cell ES7000 with 20 I/O APICs, and an Ethernet card in the last slot, which should be eth1 and which was not configured because its IRQ exceeded allowable number (it actially turned out huge - 480!):
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer
4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0
16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3
19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx
97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx
192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0
NMI: 0 0 0 0 0 0 0 0
LOC: 273423 274528 272829 274228 274092 273761 273827 273694
ERR: 7
MIS: 0
Even thouigh the system doesn't have that many devices, some don't get enabled only because of IRQ numbering model.
This is the IRQ picture after the patch was applied:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer
4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0
17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx
18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3
21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0
25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0 0 0
LOC: 107981 107636 108899 108698 108489 108326 108331 108254
ERR: 7
MIS: 0
Not only we see the card in the last I/O APIC, but we are not even close to using up available IRQs, since we didn't waste any.
Signed-off-by: Natalie Protasevich <[email protected]>
---
diff -puN arch/x86_64/kernel/mpparse.c~irq-pack-x86_64 arch/x86_64/kernel/mpparse.c
--- linux-2.6.12-rc4-mm2/arch/x86_64/kernel/mpparse.c~irq-pack-x86_64 2005-05-18 15:32:19.369637392 -0700
+++ linux-2.6.12-rc4-mm2-root/arch/x86_64/kernel/mpparse.c 2005-05-19 02:36:07.017914536 -0700
@@ -903,11 +903,20 @@ void __init mp_config_acpi_legacy_irqs (
return;
}
+#define MAX_GSI_NUM 4096
+
int mp_register_gsi(u32 gsi, int edge_level, int active_high_low)
{
int ioapic = -1;
int ioapic_pin = 0;
int idx, bit = 0;
+ static int pci_irq = 16;
+ /*
+ * Mapping between Global System Interrupts, which
+ * represent all possible interrupts, to the IRQs
+ * assigned to actual devices.
+ */
+ static int gsi_to_irq[MAX_GSI_NUM];
if (acpi_irq_model != ACPI_IRQ_MODEL_IOAPIC)
return gsi;
@@ -942,11 +951,21 @@ int mp_register_gsi(u32 gsi, int edge_le
if ((1<<bit) & mp_ioapic_routing[ioapic].pin_programmed[idx]) {
Dprintk(KERN_DEBUG "Pin %d-%d already programmed\n",
mp_ioapic_routing[ioapic].apic_id, ioapic_pin);
- return gsi;
+ return gsi_to_irq[gsi];
}
mp_ioapic_routing[ioapic].pin_programmed[idx] |= (1<<bit);
+ if (edge_level) {
+ /*
+ * For PCI devices assign IRQs in order, avoiding gaps
+ * due to unused I/O APIC pins.
+ */
+ int irq = gsi;
+ gsi = pci_irq++;
+ gsi_to_irq[irq] = gsi;
+ }
+
io_apic_set_pci_routing(ioapic, ioapic_pin, gsi,
edge_level == ACPI_EDGE_SENSITIVE ? 0 : 1,
active_high_low == ACPI_ACTIVE_HIGH ? 0 : 1);
_
Hi Natalie,
have you taken a look a the Vector Sharing Patch posted by Kaneshige for IA64?
Cheers,
ashok
On Thu, May 19, 2005 at 04:06:13AM -0700, [email protected] wrote:
>
> I suggest to change the way IRQs are handed out to PCI devices.
> Currently, each I/O APIC pin gets associated with an IRQ, no matter if
> the pin is used or not. It is expected that each pin can potentually
> be engaged by a device inserted into the corresponding PCI slot.
> However, this imposes severe limitation on systems that have designs
> that employ many I/O APICs, only utilizing couple lines of each, such
> as P64H2 chipset. It is used in ES7000, and currently, there is no way
> to boot the system with more that 9 I/O APICs. The simple change below
> allows to boot a system with say 64 (or more) I/O APICs, each
> providing 1 slot, which otherwise impossible because of the IRQ gaps
> created for unused lines on each I/O APIC. It does not resolve the
> problem with number of devices that exceeds number of possible IRQs,
> but eases up a tension for IRQs on any large system with potentually
> large number of devices. I only implemented this for the ACPI boot,
> since if the system is this big and
>.. deleted...
>
> Hi Natalie,
>
> have you taken a look a the Vector Sharing Patch posted by
> Kaneshige for IA64?
>
> Cheers,
> ashok
Ashok,
I did initial testing of Zwane's IA-32 vector sharing patch, which
worked beautifully for the test case I mentioned. Here is the IRQ
snapshot on the same system booted as IA-32 with Zwane's patch applied:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
CPU6 CPU7
0: 21287 95050 0 0 0 0
0 0 IO-APIC-edge timer
4: 35 0 0 0 0 0
0 0 IO-APIC-edge serial
8: 2 0 0 0 0 0
0 0 IO-APIC-edge rtc
9: 1 0 0 0 0 0
0 0 IO-APIC-level acpi
14: 42 0 0 0 0 0
0 0 IO-APIC-edge ide0
16: 166 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb1
18: 0 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb3
19: 13 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb2
22: 3 0 0 0 0 0
0 0 IO-APIC-level ohci1394
23: 3 0 0 0 0 0
0 0 IO-APIC-level ehci_hcd:usb4
96: 4531 0 0 0 0 0
0 0 IO-APIC-level aic7xxx
97: 15 0 0 0 0 0
0 0 IO-APIC-level aic7xxx
192: 319 0 0 0 0 0
0 0 IO-APIC-level eth0
480: 197 0 0 0 0 0
0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0
0 0
LOC: 112387 113275 108500 112878 113213 113158
113272 113249
ERR: 0
MIS: 0
After I applied my patch on top of his patch, the picture became:
zorro-tb2:~ # cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
CPU6 CPU7
0: 21339 82235 0 0 0 0
0 0 IO-APIC-edge timer
4: 13 0 0 0 0 0
0 0 IO-APIC-edge serial
8: 2 0 0 0 0 0
0 0 IO-APIC-edge rtc
9: 1 0 0 0 0 0
0 0 IO-APIC-level acpi
14: 42 0 0 0 0 0
0 0 IO-APIC-edge ide0
16: 0 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb3
17: 4533 0 0 0 0 0
0 0 IO-APIC-level aic7xxx
18: 15 0 0 0 0 0
0 0 IO-APIC-level aic7xxx
21: 172 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb1
22: 13 0 0 0 0 0
0 0 IO-APIC-level uhci_hcd:usb2
23: 3 0 0 0 0 0
0 0 IO-APIC-level ehci_hcd:usb4
24: 3 0 0 0 0 0
0 0 IO-APIC-level ohci1394
25: 252 0 0 0 0 0
0 0 IO-APIC-level eth0
26: 115 0 0 0 0 0
0 0 IO-APIC-level eth1
NMI: 0 0 0 0 0 0
0 0
LOC: 99762 100508 95782 100288 100309 100517
100550 100349
ERR: 0
MIS: 0
So, there is no conflict between the two, and when it's fully
implemented, tested, and incorporated into the source (and into x86_64),
well, we still will have IRQs numbers utilized better per node.
Thanks,
--Natalie
I actually tested the code I'm offering with Zwane's IA-32 vector
sharing patch. By the way, I tested hi
> On Thu, May 19, 2005 at 04:06:13AM -0700,
> [email protected] wrote:
> >
> > I suggest to change the way IRQs are handed out
> to PCI devices.
> > Currently, each I/O APIC pin gets associated with an
> IRQ, no matter if
> > the pin is used or not. It is expected that each pin
> can potentually
> > be engaged by a device inserted into the
> corresponding PCI slot.
> > However, this imposes severe limitation on systems
> that have designs
> > that employ many I/O APICs, only utilizing couple lines
> of each, such
> > as P64H2 chipset. It is used in ES7000, and currently,
> there is no way
> > to boot the system with more that 9 I/O APICs. The
> simple change below
> > allows to boot a system with say 64 (or more)
> I/O APICs, each
> > providing 1 slot, which otherwise impossible because
> of the IRQ gaps
> > created for unused lines on each I/O APIC. It does
> not resolve the
> > problem with number of devices that exceeds number of
> possible IRQs,
> > but eases up a tension for IRQs on any large system
> with potentually
> > large number of devices. I only implemented this for
> the ACPI boot,
> > since if the system is this big and .. deleted...
>
Hi Natalie,
On Thu, 19 May 2005 [email protected] wrote:
> I suggest to change the way IRQs are handed out to PCI devices.
> Currently, each I/O APIC pin gets associated with an IRQ, no matter if
> the pin is used or not. It is expected that each pin can potentually be
> engaged by a device inserted into the corresponding PCI slot. However,
> this imposes severe limitation on systems that have designs that employ
> many I/O APICs, only utilizing couple lines of each, such as P64H2
> chipset. It is used in ES7000, and currently, there is no way to boot
> the system with more that 9 I/O APICs. The simple change below allows to
> boot a system with say 64 (or more) I/O APICs, each providing 1 slot,
> which otherwise impossible because of the IRQ gaps created for unused
> lines on each I/O APIC. It does not resolve the problem with number of
> devices that exceeds number of possible IRQs, but eases up a tension for
> IRQs on any large system with potentually large number of devices. I
> only implemented this for the ACPI boot, since if the system is this big
> and
Can you determine number of slots in use?
> using newer chipsets it is probably (better be!) an ACPI based system
> :). The change is completely "mechanical" and does not alter any
> internal structures or interrupt model/implementation. The patch works
> for both i386 and x86_64 archs. It works with MSIs just fine, and should
> not intervene with implementations like shared vectors, when they get
> worked out and incorporated.
Well we ran into similar problems on older MPS systems (NUMAQ) but those
don't really matter right now anyway. So i think fixing this for ACPI is
fine.
But i like your patch =)
Thanks,
Zwane
>
> Hi Natalie,
>
> On Thu, 19 May 2005 [email protected] wrote:
>
> > I suggest to change the way IRQs are handed out to PCI devices.
> > Currently, each I/O APIC pin gets associated with an IRQ,
> no matter if
> > the pin is used or not. It is expected that each pin can
> potentually
> > be engaged by a device inserted into the corresponding PCI slot.
> > However, this imposes severe limitation on systems that
> have designs
> > that employ many I/O APICs, only utilizing couple lines of
> each, such
> > as P64H2 chipset. It is used in ES7000, and currently,
> there is no way
> > to boot the system with more that 9 I/O APICs. The simple
> change below
> > allows to boot a system with say 64 (or more) I/O APICs, each
> > providing 1 slot, which otherwise impossible because of the
> IRQ gaps
> > created for unused lines on each I/O APIC. It does not resolve the
> > problem with number of devices that exceeds number of
> possible IRQs,
> > but eases up a tension for IRQs on any large system with
> potentually
> > large number of devices. I only implemented this for the ACPI boot,
> > since if the system is this big and
>
> Can you determine number of slots in use?
I think it is possible, but then it will probably be back to something
like "pci=routeirq" philosophy.
> > using newer chipsets it is probably (better be!) an ACPI based
> > system :). The change is completely "mechanical" and does not alter
> > any internal structures or interrupt model/implementation.
> The patch
> > works for both i386 and x86_64 archs. It works with MSIs just fine,
> > and should not intervene with implementations like shared vectors,
> > when they get worked out and incorporated.
>
> Well we ran into similar problems on older MPS systems
> (NUMAQ) but those don't really matter right now anyway. So i
> think fixing this for ACPI is fine.
ACPI is well organized and easily manipulated, that's why I stopped
right there. I tried adjusting the MPS case, it is possible, but then I
thought no one would need it anyway. It would take multiple changes and
won't be pretty, but if you insist I can work it out :)
> But i like your patch =)
:)
> Thanks,
> Zwane
>
>
On Thu, May 19, 2005 at 04:06:13AM -0700, [email protected] wrote:
>
>
> I suggest to change the way IRQs are handed out to PCI devices. Currently, each I/O APIC pin gets associated with an IRQ, no matter if the pin is used or not. It is expected that each pin can potentually be engaged by a device inserted into the corresponding PCI slot. However, this imposes severe limitation on systems that have designs that employ many I/O APICs, only utilizing couple lines of each, such as P64H2 chipset. It is used in ES7000, and currently, there is no way to boot the system with more that 9 I/O APICs. The simple change below allows to boot a system with say 64 (or more) I/O APICs, each providing 1 slot, which otherwise impossible because of the IRQ gaps created for unused lines on each I/O APIC. It does not resolve the problem with number of devices that exceeds number of possible IRQs, but eases up a tension for IRQs on any large system with potentually large number of devices. I only implemented this for the ACPI boot, since if the system is this big and
> using newer chipsets it is probably (better be!) an ACPI based system :). The change is completely "mechanical" and does not alter any internal structures or interrupt model/implementation. The patch works for both i386 and x86_64 archs. It works with MSIs just fine, and should not intervene with implementations like shared vectors, when they get worked out and incorporated.
>
>
> To illustrate, below is the interrupt distribution for 2-cell ES7000 with 20 I/O APICs, and an Ethernet card in the last slot, which should be eth1 and which was not configured because its IRQ exceeded allowable number (it actially turned out huge - 480!):
>
> zorro-tb2:~ # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 0: 65716 30012 30007 30002 30009 30010 30010 30010 IO-APIC-edge timer
> 4: 373 0 725 280 0 0 0 0 IO-APIC-edge serial
> 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
> 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
> 14: 39 3 0 0 0 0 0 0 IO-APIC-edge ide0
> 16: 108 13 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
> 18: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb3
> 19: 15 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
> 23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
> 96: 4240 397 18 0 0 0 0 0 IO-APIC-level aic7xxx
> 97: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx
> 192: 847 0 0 0 0 0 0 0 IO-APIC-level eth0
> NMI: 0 0 0 0 0 0 0 0
> LOC: 273423 274528 272829 274228 274092 273761 273827 273694
> ERR: 7
> MIS: 0
>
> Even thouigh the system doesn't have that many devices, some don't get enabled only because of IRQ numbering model.
>
> This is the IRQ picture after the patch was applied:
>
> zorro-tb2:~ # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 0: 44169 10004 10004 10001 10004 10003 10004 6135 IO-APIC-edge timer
> 4: 345 0 0 0 0 244 0 0 IO-APIC-edge serial
> 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc
> 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi
> 14: 39 0 3 0 0 0 0 0 IO-APIC-edge ide0
> 17: 4425 0 9 0 0 0 0 0 IO-APIC-level aic7xxx
> 18: 15 0 0 0 0 0 0 0 IO-APIC-level aic7xxx, uhci_hcd:usb3
> 21: 231 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb1
> 22: 26 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb2
> 23: 3 0 0 0 0 0 0 0 IO-APIC-level ehci_hcd:usb4
> 24: 348 0 0 0 0 0 0 0 IO-APIC-level eth0
> 25: 6 192 0 0 0 0 0 0 IO-APIC-level eth1
> NMI: 0 0 0 0 0 0 0 0
> LOC: 107981 107636 108899 108698 108489 108326 108331 108254
> ERR: 7
> MIS: 0
>
> Not only we see the card in the last I/O APIC, but we are not even close to using up available IRQs, since we didn't waste any.
Thanks. Looks good to me and makes a lot of sense.
Eventually the non ACPI case will need to be fixed too. At least
on some distributions there are "failsafe" boot loader entries
which disable ACPI, and users tend to use them occasionally
and get unhappy when they dont work.
-Andi
P.S.: Could you please line wrap your emails at 80 chars/line. That
would make it easier to quote.
On Friday 20 May 2005 7:45 am, Ashok Raj wrote:
> have you taken a look a the Vector Sharing Patch posted by Kaneshige for IA64?
Vector sharing has a performance cost, so we should avoid it when
we can.
I think you should bounds-check the gsi_to_irq[] references. When
you finally get a machine with GSI values larger than MAX_GSI_NUM,
things will start failing mysteriously as you corrupt things after
the gsi_to_irq[] array.
> On Friday 20 May 2005 7:45 am, Ashok Raj wrote:
> > have you taken a look a the Vector Sharing Patch posted by
> Kaneshige for IA64?
>
> Vector sharing has a performance cost, so we should avoid it
> when we can.
>
> I think you should bounds-check the gsi_to_irq[] references.
> When you finally get a machine with GSI values larger than
> MAX_GSI_NUM, things will start failing mysteriously as you
> corrupt things after the gsi_to_irq[] array.
>
Yes, indeed, I will do that. Next round I will submit ACPI cases for
i386 and x86_64 then, with correction above, and will start working on
the MPS cases.
Thanks,
--Natalie