2008-07-29 16:10:29

by Dhaval Giani

[permalink] [raw]
Subject: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Hi Ingo, Thomas,

Hit this on 2.6.27-rc1

(The kernel bug is at line 356 (Its 357 as I applied a debug patch to
print out the irq) (Full dmesg and .config attached)

Thanks,

------------[ cut here ]------------
kernel BUG at arch/x86/kernel/io_apic_64.c:357!
invalid opcode: 0000 [1] SMP
CPU 24
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.27-rc1-autokern1 #1
RIP: 0010:[<ffffffff8021bb2e>] [<ffffffff8021bb2e>] add_pin_to_irq+0x8e/0xa0
RSP: 0018:ffff88032e4b9b30 EFLAGS: 00010216
RAX: 00000000000000f0 RBX: 00000000000000f0 RCX: 0000000000000000
RDX: 000000000000afaf RSI: 0000000000000046 RDI: ffffffff80738234
RBP: 0000000000000006 R08: 0000000000000000 R09: ffff8800280992c0
R10: 0000000000000000 R11: ffffffff80372060 R12: 0000000000000018
R13: 0000000000000001 R14: 0000000000000018 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880bfe733540(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88032e4b8000, task ffff880bfe4ca050)
Stack: 00000000000000f0 0000000000000006 0000000000000001 ffffffff8021bbbe
00000000000000f0 0000000000000001 0000000000000000 ffff88032e4b9c0c
00000000000000f0 ffffffff80218d81 00000000000000f0 0000000000000000
Call Trace:
[<ffffffff8021bbbe>] ? io_apic_set_pci_routing+0x7e/0xa0
[<ffffffff80218d81>] ? mp_register_gsi+0xb1/0xd0
[<ffffffff80218e0c>] ? acpi_register_gsi+0x6c/0x70
[<ffffffff80392f30>] ? acpi_pci_irq_enable+0x178/0x260
[<ffffffff80392cdd>] ? acpi_pci_allocate_irq+0x0/0x4c
[<ffffffff80370657>] ? pci_enable_resources+0x27/0x160
[<ffffffff8036be6a>] ? do_pci_enable_device+0x4a/0x70
[<ffffffff8036bee1>] ? __pci_enable_device_flags+0x51/0x60
[<ffffffff804e01e8>] ? tg3_init_one+0x58/0x1640
[<ffffffff8022a980>] ? default_wake_function+0x0/0x10
[<ffffffff8022efd8>] ? set_cpus_allowed_ptr+0xe8/0x110
[<ffffffff8036e28f>] ? pci_device_probe+0xdf/0x130
[<ffffffff803c2416>] ? driver_probe_device+0x96/0x1a0
[<ffffffff803c25a9>] ? __driver_attach+0x89/0x90
[<ffffffff803c2520>] ? __driver_attach+0x0/0x90
[<ffffffff803c1a8d>] ? bus_for_each_dev+0x4d/0x80
[<ffffffff8028e168>] ? kmem_cache_alloc+0xc8/0xf0
[<ffffffff803c1f7e>] ? bus_add_driver+0xae/0x220
[<ffffffff803c2836>] ? driver_register+0x56/0x130
[<ffffffff8036e548>] ? __pci_register_driver+0x68/0xb0
[<ffffffff806cf8e0>] ? tg3_init+0x0/0x20
[<ffffffff806b15b1>] ? do_one_initcall+0x41/0x180
[<ffffffff802d8a08>] ? create_proc_entry+0x58/0xa0
[<ffffffff80261cc4>] ? register_irq_proc+0xd4/0xf0
[<ffffffff806b1b53>] ? kernel_init+0x133/0x190
[<ffffffff8020c529>] ? child_rip+0xa/0x11
[<ffffffff806b1a20>] ? kernel_init+0x0/0x190
[<ffffffff8020c51f>] ? child_rip+0x0/0x11


Code: 89 05 2b 54 42 00 7f 27 48 0f bf c1 48 8d 14 00 48 c1 e0 03 48 29 d0 48 8d 90 c0 5e 73 80 66 89 2a 66 44 89 62 02 5b 5d 41 5c c3 <0f> 0b eb fe 48 c7 c7 c8 db 5c 80 31 c0 e8 60 7e 01 00 48 83 ec
RIP [<ffffffff8021bb2e>] add_pin_to_irq+0x8e/0xa0
RSP <ffff88032e4b9b30>
--
regards,
Dhaval


Attachments:
(No filename) (3.12 kB)
dotconfig (34.73 kB)
console.log (47.97 kB)
Download all attachments

2008-07-29 18:35:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]> wrote:
> Hi Ingo, Thomas,
>
> Hit this on 2.6.27-rc1
>
> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
> print out the irq) (Full dmesg and .config attached)

can you boot with "debug apic=verbose pci=routeirq"?

YH

2008-07-29 19:20:15

by Yinghai Lu

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
> On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]> wrote:
>> Hi Ingo, Thomas,
>>
>> Hit this on 2.6.27-rc1
>>
>> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
>> print out the irq) (Full dmesg and .config attached)
>
> can you boot with "debug apic=verbose pci=routeirq"?

please try attached patch

YH


Attachments:
(No filename) (428.00 B)
nr_irqs.patch (1.40 kB)
Download all attachments

2008-07-29 20:22:16

by Dhaval Giani

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, Jul 29, 2008 at 12:20:01PM -0700, Yinghai Lu wrote:
> On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
> > On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]> wrote:
> >> Hi Ingo, Thomas,
> >>
> >> Hit this on 2.6.27-rc1
> >>
> >> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
> >> print out the irq) (Full dmesg and .config attached)
> >
> > can you boot with "debug apic=verbose pci=routeirq"?
>
> please try attached patch
>

Thanks. The patch gets my system to boot.

--
regards,
Dhaval

2008-07-29 20:23:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

"Yinghai Lu" <[email protected]> writes:

> On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
>> On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]>
> wrote:
>>> Hi Ingo, Thomas,
>>>
>>> Hit this on 2.6.27-rc1
>>>
>>> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
>>> print out the irq) (Full dmesg and .config attached)
>>
>> can you boot with "debug apic=verbose pci=routeirq"?
>
> please try attached patch

Ugh. Yuck bleh nasty gag.

Yes please try the YH's patch that should fix the worst of the
problem.

YH I think you have hit the root cause of the bug with NR_IRQS being
defined a ridiculously low value on x86_64. At first glance your
patch looks reasonable, I had to stop and look why it was needed. At
second glance it looks like it doesn't go far enough.

The commit that unified interrupt vector defines appears substantially
fumbled. YH your description of why your patch is needed is not
especially useful.

To be perfectly clear. The maximum number of interrupts we can handle is:
NR_CPUS*NR_VECTORS. Defining NR_IRQS to a lower value is essentially a
hack to allows us to use less memory as we seldom push that limit.

Further the only valid definition of NR_IRQ_VECTORS is NR_IRQS
as we index it by irq. Which really means now that we are unifying
the code we should simply kill NR_IRQ_VECTORS.

I expect there is a lot more in that mess that can be cleaned up.
Right now irq_vectors.h reads like nonsense to me. I will see if
I can find some time soon to look into what we should be doing.

Increasing the size of the next field in the irq_pin_list is a good
idea. Frankly I think we never use the next field (especially on
x86_64 and should just remove it) but I have not been brave enough
to try.

commit 9b7dc567d03d74a1fbae84e88949b6a60d922d82
Author: Thomas Gleixner <[email protected]>
Date: Fri May 2 20:10:09 2008 +0200

x86: unify interrupt vector defines

The interrupt vector defines are copied 4 times around with minimal
differences. Move them all into asm-x86/irq_vectors.h

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

> [PATCH] x86: 64bit support more than 256 irq
>
> because 64bit allow same vector for different cpu to serve different irq
>
> also change next in irq_pin_list from short to next. because for 4096 NR_IRQS
> is 2^(5+12)+224.
>
> need to create that array dynamically later
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/kernel/io_apic_64.c | 3 ++-
> include/asm-x86/irq_vectors.h | 6 +++++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/include/asm-x86/irq_vectors.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/irq_vectors.h
> +++ linux-2.6/include/asm-x86/irq_vectors.h
> @@ -113,9 +113,13 @@
>
> # if defined(CONFIG_X86_IO_APIC) || defined(CONFIG_PARAVIRT) ||
> defined(CONFIG_X86_VISWS)/include/asm-x86/irq_vectors.h
>
> +#ifdef CONFIG_X86_64
> +# define NR_IRQS (32 * NR_CPUS + 224)
> +#else
> # define NR_IRQS 224
> +#endif
>
> -# if (224 >= 32 * NR_CPUS)
> +# if (NR_IRQS >= 32 * NR_CPUS)
> # define NR_IRQ_VECTORS NR_IRQS
> # else
> # define NR_IRQ_VECTORS (32 * NR_CPUS)
> Index: linux-2.6/arch/x86/kernel/io_apic_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/io_apic_64.c
> +++ linux-2.6/arch/x86/kernel/io_apic_64.c
> @@ -140,7 +140,8 @@ DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BU
> */
>
> static struct irq_pin_list {
> - short apic, pin, next;
> + short apic, pin;
> + int next;
> } irq_2_pin[PIN_MAP_SIZE];
>
> struct io_apic {

2008-07-29 20:37:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, Jul 29, 2008 at 1:14 PM, Eric W. Biederman
<[email protected]> wrote:
> "Yinghai Lu" <[email protected]> writes:
>
>> On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
>>> On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]>
>> wrote:
>>>> Hi Ingo, Thomas,
>>>>
>>>> Hit this on 2.6.27-rc1
>>>>
>>>> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
>>>> print out the irq) (Full dmesg and .config attached)
>>>
>>> can you boot with "debug apic=verbose pci=routeirq"?
>>
>> please try attached patch
>
> Ugh. Yuck bleh nasty gag.
>
> Yes please try the YH's patch that should fix the worst of the
> problem.
>
> YH I think you have hit the root cause of the bug with NR_IRQS being
> defined a ridiculously low value on x86_64. At first glance your
> patch looks reasonable, I had to stop and look why it was needed. At
> second glance it looks like it doesn't go far enough.

x3950 has 8 ioapic
ACPI: IOAPIC (id[0x0f] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 15, version 0, address 0xfec00000, GSI 0-35
ACPI: IOAPIC (id[0x0e] address[0xfec01000] gsi_base[36])
IOAPIC[1]: apic_id 14, version 0, address 0xfec01000, GSI 36-71
ACPI: IOAPIC (id[0x0d] address[0xfec02000] gsi_base[72])
IOAPIC[2]: apic_id 13, version 0, address 0xfec02000, GSI 72-107
ACPI: IOAPIC (id[0x0c] address[0xfec03000] gsi_base[108])
IOAPIC[3]: apic_id 12, version 0, address 0xfec03000, GSI 108-143
ACPI: IOAPIC (id[0x0b] address[0xfec04000] gsi_base[144])
IOAPIC[4]: apic_id 11, version 0, address 0xfec04000, GSI 144-179
ACPI: IOAPIC (id[0x0a] address[0xfec05000] gsi_base[180])
IOAPIC[5]: apic_id 10, version 0, address 0xfec05000, GSI 180-215
ACPI: IOAPIC (id[0x09] address[0xfec06000] gsi_base[216])
IOAPIC[6]: apic_id 9, version 0, address 0xfec06000, GSI 216-251
ACPI: IOAPIC (id[0x08] address[0xfec07000] gsi_base[252])
IOAPIC[7]: apic_id 8, version 0, address 0xfec07000, GSI 252-287

that is crazy. I only played that kind of layout in SimNow.

>
> The commit that unified interrupt vector defines appears substantially
> fumbled. YH your description of why your patch is needed is not
> especially useful.
>
> To be perfectly clear. The maximum number of interrupts we can handle is:
> NR_CPUS*NR_VECTORS. Defining NR_IRQS to a lower value is essentially a
> hack to allows us to use less memory as we seldom push that limit.
>
> Further the only valid definition of NR_IRQ_VECTORS is NR_IRQS
> as we index it by irq. Which really means now that we are unifying
> the code we should simply kill NR_IRQ_VECTORS.
>
> I expect there is a lot more in that mess that can be cleaned up.
> Right now irq_vectors.h reads like nonsense to me. I will see if
> I can find some time soon to look into what we should be doing.

good.
wonder if you can make 32bit support same vector on different cpu
support different irq.
or not needed.
also we should dump the irq balance on 32 bit kernel.

YH

2008-07-29 22:17:19

by Mike Travis

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Eric W. Biederman wrote:
> "Yinghai Lu" <[email protected]> writes:
>
>> On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
>>> On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]>
>> wrote:
>>>> Hi Ingo, Thomas,
>>>>
>>>> Hit this on 2.6.27-rc1
>>>>
>>>> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
>>>> print out the irq) (Full dmesg and .config attached)
>>> can you boot with "debug apic=verbose pci=routeirq"?
>> please try attached patch

I didn't follow this from the start but one reason why NR_IRQS based on
NR_CPUS is a bad idea, is the huge (nearly 300Mb) increase in memory usage
(that's mostly wasted.) I believe there's another patch coming real soon
now to make irq allocations dynamic. (I had also hoped to look closer at
your irq abstraction patch you sent a while back. Does that also address
this issue?)

But this would be a show stopper for SGI being able to ship systems if the
distros do not want to waste this much memory and won't set NR_CPUS=4096.

Thanks,
Mike

====== Data (-l 500)

1 - 4k-defconfig
2 - 4k-defconfig-tmp (has this patch)

.1. .2. ..final..
258048 +150994944 151252992 +58514% irq_desc(.data.cacheline_aligned)
231168 +135266304 135497472 +58514% irq_cfg(.data.read_mostly)
3584 +2097152 2100736 +58514% irq_lists(.bss)
2688 +2098048 2100736 +78052% irq_2_pin(.bss)
1792 +1048576 1050368 +58514% irq_timer_state(.bss)
968 +524288 525256 +54161% per_cpu__kstat(.data.percpu)
498248 +292029312 292527560 +58611% Totals

====== Sections (-l 500)

1 - 4k-defconfig
2 - 4k-defconfig-tmp

.1. .2. ..final..
10379571 +292028228 302407799 +2813% Total
1118924 +5242880 6361804 +468% .bss
742016 +150994944 151736960 +20349% .data.cacheline_aligned
274416 +135266304 135540720 +49292% .data.read_mostly
44928 +524288 569216 +1166% .data.percpu
12559855 +584056644 596616499 +4650% Totals

====== Text/Data ()

1 - 4k-defconfig
2 - 4k-defconfig-tmp

.1. .2. ..final..
1118208 +5242880 6361088 +468% BssSize
1232896 +524288 1757184 +42% InitSize
45056 +524288 569344 +1163% PerCPU
1077248 +286261248 287338496 +26573% OtherSize
3473408 +292552704 296026112 +8422% Totals

>
> Ugh. Yuck bleh nasty gag.
>
> Yes please try the YH's patch that should fix the worst of the
> problem.
>
> YH I think you have hit the root cause of the bug with NR_IRQS being
> defined a ridiculously low value on x86_64. At first glance your
> patch looks reasonable, I had to stop and look why it was needed. At
> second glance it looks like it doesn't go far enough.
>
> The commit that unified interrupt vector defines appears substantially
> fumbled. YH your description of why your patch is needed is not
> especially useful.
>
> To be perfectly clear. The maximum number of interrupts we can handle is:
> NR_CPUS*NR_VECTORS. Defining NR_IRQS to a lower value is essentially a
> hack to allows us to use less memory as we seldom push that limit.
>
> Further the only valid definition of NR_IRQ_VECTORS is NR_IRQS
> as we index it by irq. Which really means now that we are unifying
> the code we should simply kill NR_IRQ_VECTORS.
>
> I expect there is a lot more in that mess that can be cleaned up.
> Right now irq_vectors.h reads like nonsense to me. I will see if
> I can find some time soon to look into what we should be doing.
>
> Increasing the size of the next field in the irq_pin_list is a good
> idea. Frankly I think we never use the next field (especially on
> x86_64 and should just remove it) but I have not been brave enough
> to try.
>
> commit 9b7dc567d03d74a1fbae84e88949b6a60d922d82
> Author: Thomas Gleixner <[email protected]>
> Date: Fri May 2 20:10:09 2008 +0200
>
> x86: unify interrupt vector defines
>
> The interrupt vector defines are copied 4 times around with minimal
> differences. Move them all into asm-x86/irq_vectors.h
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>> [PATCH] x86: 64bit support more than 256 irq
>>
>> because 64bit allow same vector for different cpu to serve different irq
>>
>> also change next in irq_pin_list from short to next. because for 4096 NR_IRQS
>> is 2^(5+12)+224.
>>
>> need to create that array dynamically later
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>>
>> ---
>> arch/x86/kernel/io_apic_64.c | 3 ++-
>> include/asm-x86/irq_vectors.h | 6 +++++-
>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> Index: linux-2.6/include/asm-x86/irq_vectors.h
>> ===================================================================
>> --- linux-2.6.orig/include/asm-x86/irq_vectors.h
>> +++ linux-2.6/include/asm-x86/irq_vectors.h
>> @@ -113,9 +113,13 @@
>>
>> # if defined(CONFIG_X86_IO_APIC) || defined(CONFIG_PARAVIRT) ||
>> defined(CONFIG_X86_VISWS)/include/asm-x86/irq_vectors.h
>>
>> +#ifdef CONFIG_X86_64
>> +# define NR_IRQS (32 * NR_CPUS + 224)
>> +#else
>> # define NR_IRQS 224
>> +#endif
>>
>> -# if (224 >= 32 * NR_CPUS)
>> +# if (NR_IRQS >= 32 * NR_CPUS)
>> # define NR_IRQ_VECTORS NR_IRQS
>> # else
>> # define NR_IRQ_VECTORS (32 * NR_CPUS)
>> Index: linux-2.6/arch/x86/kernel/io_apic_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/io_apic_64.c
>> +++ linux-2.6/arch/x86/kernel/io_apic_64.c
>> @@ -140,7 +140,8 @@ DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BU
>> */
>>
>> static struct irq_pin_list {
>> - short apic, pin, next;
>> + short apic, pin;
>> + int next;
>> } irq_2_pin[PIN_MAP_SIZE];
>>
>> struct io_apic {

2008-07-29 22:21:18

by Yinghai Lu

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, Jul 29, 2008 at 3:17 PM, Mike Travis <[email protected]> wrote:
> Eric W. Biederman wrote:
>> "Yinghai Lu" <[email protected]> writes:
>>
>>> On Tue, Jul 29, 2008 at 11:35 AM, Yinghai Lu <[email protected]> wrote:
>>>> On Tue, Jul 29, 2008 at 9:09 AM, Dhaval Giani <[email protected]>
>>> wrote:
>>>>> Hi Ingo, Thomas,
>>>>>
>>>>> Hit this on 2.6.27-rc1
>>>>>
>>>>> (The kernel bug is at line 356 (Its 357 as I applied a debug patch to
>>>>> print out the irq) (Full dmesg and .config attached)
>>>> can you boot with "debug apic=verbose pci=routeirq"?
>>> please try attached patch
>
> I didn't follow this from the start but one reason why NR_IRQS based on
> NR_CPUS is a bad idea, is the huge (nearly 300Mb) increase in memory usage
> (that's mostly wasted.) I believe there's another patch coming real soon
> now to make irq allocations dynamic. (I had also hoped to look closer at
> your irq abstraction patch you sent a while back. Does that also address
> this issue?)

Yes.

>
> But this would be a show stopper for SGI being able to ship systems if the
> distros do not want to waste this much memory and won't set NR_CPUS=4096.

wonder if nr_irqs need to be probed dynamically too.

YH

2008-07-29 23:23:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

"Yinghai Lu" <[email protected]> writes:

>> But this would be a show stopper for SGI being able to ship systems if the
>> distros do not want to waste this much memory and won't set NR_CPUS=4096.
>
> wonder if nr_irqs need to be probed dynamically too.

NR_IRQS simply needs to die.

Eric

2008-07-29 23:23:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Mike Travis <[email protected]> writes:

> I didn't follow this from the start but one reason why NR_IRQS based on
> NR_CPUS is a bad idea, is the huge (nearly 300Mb) increase in memory usage
> (that's mostly wasted.) I believe there's another patch coming real soon
> now to make irq allocations dynamic. (I had also hoped to look closer at
> your irq abstraction patch you sent a while back. Does that also address
> this issue?)

The patch I sent out earlier is one of the key patches needed for killing
NR_IRQS usage in generic code. Which is part of what we need to make this
dynamic.

In systems where the I/O is well balanced with the compute the typical
usage is usually within 16 irqs per core, and at worst 32. That is an old
rule of thumb observation and that makes for reasonable allocations.

I don't have a problem at all with your code that updated the heuristic to
be based on the NR_IOAPICS.

My problem is with Thomas's patch that totally threw out all of our tuned
heuristics and made NR_IRQS=256. Which is ludicrous. Even on 32bit systems
there are cases where 1024 irq sources needed to be supported.

Which is what NR_IRQ_VECTORS is. I goofed slightly in my comments.
irq_vector only needs to be NR_IRQS in size. I think ACPI still needs
NR_IRQ_VECTORS to know how many GSI the kernel can support. The fact they
are not mapped 1-1 right now in the 32bit kernel is unfortunate.

> But this would be a show stopper for SGI being able to ship systems if the
> distros do not want to waste this much memory and won't set NR_CPUS=4096.

Yes. We absolutely need to dynamically allocate the irq data structures. Then
we can use the irq numbers sparsely and not have problems.

I just have problems with the code setting NR_IRQS at 256 when we have single
potentially common hardware devices talking about having that many irqs on
a single device.

We really need to be able to scale to an unreasonable number of IRQs when we
have the hardware plugged into the system that will use them. Just like
we need to scale to an unreasonable number of cpus when you plug them into
a system.

I expect irqs to actually grow faster then cpus while all of the devices are
learning how to accommodate hardware virtualization. It would not surprise
me in the slightest if I can plug in the right hardware and exceed
NR_CPUS*32 irqs in an sgi machine in the next year or so.

The only problem with NR_IRQS=NR_CPUS*32 is that we pay the price on lower
end machines when we compile to support a higher cpu count.

Eric

2008-07-29 23:42:32

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Eric W. Biederman wrote:
> Mike Travis <[email protected]> writes:
>
>
>> I didn't follow this from the start but one reason why NR_IRQS based on
>> NR_CPUS is a bad idea, is the huge (nearly 300Mb) increase in memory usage
>> (that's mostly wasted.) I believe there's another patch coming real soon
>> now to make irq allocations dynamic. (I had also hoped to look closer at
>> your irq abstraction patch you sent a while back. Does that also address
>> this issue?)
>>
>
> The patch I sent out earlier is one of the key patches needed for killing
> NR_IRQS usage in generic code. Which is part of what we need to make this
> dynamic.
>
> In systems where the I/O is well balanced with the compute the typical
> usage is usually within 16 irqs per core, and at worst 32. That is an old
> rule of thumb observation and that makes for reasonable allocations.
>
> I don't have a problem at all with your code that updated the heuristic to
> be based on the NR_IOAPICS.
>
> My problem is with Thomas's patch that totally threw out all of our tuned
> heuristics and made NR_IRQS=256. Which is ludicrous. Even on 32bit systems
> there are cases where 1024 irq sources needed to be supported.
>
> Which is what NR_IRQ_VECTORS is. I goofed slightly in my comments.
> irq_vector only needs to be NR_IRQS in size. I think ACPI still needs
> NR_IRQ_VECTORS to know how many GSI the kernel can support. The fact they
> are not mapped 1-1 right now in the 32bit kernel is unfortunate.
>

I'm still interested in making Xen's event channel-based interrupts fit
better into the rest of the interrupt handling scheme. In particular,
event channels map very closely to the x86-64 notion of a vector.
There's 1024 of them per domain, and each is bound to a cpu. At the
moment, I map them to irqs, which means that I need to allocate around
5-6 irqs per cpu, which makes everything very cluttered. I'd like to
map event channels to vectors, and then map vectors to (irq,cpu) tuples.

From what I've seen this is exactly how x86-64 currently has things set
up, and I'm interested in making sure that 32-bit does the same thing.

I'm also interested in having vectors being sourced from multiple
interrupt controllers. So, some vectors would be sourced from APICs,
and other are sourced from event channels. This would be useful for Xen
domains which have direct access to hardware (ie, the dom0 control
domain in the short term, and disaggregated driver domains later on),
and fully emulated domains which have paravirtual drivers.

I haven't studied the current code to see if this notion already exists
or not.

While the APIC interrupt model is the most architecturally important for
the x86 platform, I'd like to make sure we don't build in the assumption
that it's the *only* interrupt model.

J

2008-07-30 00:03:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Jeremy Fitzhardinge <[email protected]> writes:

> I'm still interested in making Xen's event channel-based interrupts fit better
> into the rest of the interrupt handling scheme. In particular, event channels
> map very closely to the x86-64 notion of a vector. There's 1024 of them per
> domain, and each is bound to a cpu. At the moment, I map them to irqs, which
> means that I need to allocate around 5-6 irqs per cpu, which makes everything
> very cluttered. I'd like to map event channels to vectors, and then map vectors
> to (irq,cpu) tuples.

Uh.... I'm not certain this applies.

> From what I've seen this is exactly how x86-64 currently has things set up, and
> I'm interested in making sure that 32-bit does the same thing.

Yes. x86_32 needs work to get cleaned up.

The architecture on x86_64 is as follows.

We have interrupt sources: GSIs in the case of acpi.
We have linux interupts: something with an irq number.

Vectors are an internal implementation detail.

I don't know if your event channels more closely resemble interrupt sources or internal
implementation details. If they are an implementation detail that interrupt sources
just flow through we should hide them like we do vectors. If event channels actually are
the sources of interrupts we should do something different.

> I'm also interested in having vectors being sourced from multiple interrupt
> controllers. So, some vectors would be sourced from APICs, and other are
> sourced from event channels. This would be useful for Xen domains which have
> direct access to hardware (ie, the dom0 control domain in the short term, and
> disaggregated driver domains later on), and fully emulated domains which have
> paravirtual drivers.

Generally easy except for the disparate methods of catching interrupts.

> I haven't studied the current code to see if this notion already exists or not.
>
> While the APIC interrupt model is the most architecturally important for the x86
> platform, I'd like to make sure we don't build in the assumption that it's the
> *only* interrupt model.

Well with iommus starting to show up in our irq paths it looks we are going to get
a lot of diversity.

Eric

2008-07-30 00:20:37

by Alan

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

On Tue, 29 Jul 2008 16:12:50 -0700
[email protected] (Eric W. Biederman) wrote:

> "Yinghai Lu" <[email protected]> writes:
>
> >> But this would be a show stopper for SGI being able to ship systems if the
> >> distros do not want to waste this much memory and won't set NR_CPUS=4096.
> >
> > wonder if nr_irqs need to be probed dynamically too.
>
> NR_IRQS simply needs to die.

Agreed - it would also be nice to bite the bullet at the same time and
stop using dev->irq as an integer in an increasingly fake and imaginary
numberspace and instead make it an object with a description and other
properties. That would make "no IRQ" NULL, let us display meaningful IRQ
strings and the like as well as ensuring nobody tries to abuse NR_IRQ and
hardcoded similar knowledge for arrays

Alan

2008-07-30 00:51:15

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Eric W. Biederman wrote:
>> I'm still interested in making Xen's event channel-based interrupts fit better
>> into the rest of the interrupt handling scheme. In particular, event channels
>> map very closely to the x86-64 notion of a vector. There's 1024 of them per
>> domain, and each is bound to a cpu. At the moment, I map them to irqs, which
>> means that I need to allocate around 5-6 irqs per cpu, which makes everything
>> very cluttered. I'd like to map event channels to vectors, and then map vectors
>> to (irq,cpu) tuples.
>>
>
> Uh.... I'm not certain this applies.
>

No, but, hey, it's a hook.

>> From what I've seen this is exactly how x86-64 currently has things set up, and
>> I'm interested in making sure that 32-bit does the same thing.
>>
>
> Yes. x86_32 needs work to get cleaned up.
>
> The architecture on x86_64 is as follows.
>
> We have interrupt sources: GSIs in the case of acpi.
> We have linux interupts: something with an irq number.
>
> Vectors are an internal implementation detail.
>
> I don't know if your event channels more closely resemble interrupt sources or internal
> implementation details. If they are an implementation detail that interrupt sources
> just flow through we should hide them like we do vectors. If event channels actually are
> the sources of interrupts we should do something different.
>

They're an interrupt source, I guess. They need to be behind some layer
of indirection because they can be reassigned at arbitrary times (like
suspend/resume, or if the backend driver just decided to disconnect
itself), and so they need to get rebound to at least the same irq.

>> I'm also interested in having vectors being sourced from multiple interrupt
>> controllers. So, some vectors would be sourced from APICs, and other are
>> sourced from event channels. This would be useful for Xen domains which have
>> direct access to hardware (ie, the dom0 control domain in the short term, and
>> disaggregated driver domains later on), and fully emulated domains which have
>> paravirtual drivers.
>>
>
> Generally easy except for the disparate methods of catching interrupts.
>

Catching in what sense? I assume the interrupt gets raised in some
source-specific way, and then passed into a generic layer where it
eventually gets matched with an appropriate handler. I'm sure there's
some subtlety I'm missing.

>> I haven't studied the current code to see if this notion already exists or not.
>>
>> While the APIC interrupt model is the most architecturally important for the x86
>> platform, I'd like to make sure we don't build in the assumption that it's the
>> *only* interrupt model.
>>
>
> Well with iommus starting to show up in our irq paths it looks we are going to get
> a lot of diversity.
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-07-30 01:33:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Alan Cox <[email protected]> writes:

> On Tue, 29 Jul 2008 16:12:50 -0700
> [email protected] (Eric W. Biederman) wrote:
>
>> "Yinghai Lu" <[email protected]> writes:
>>
>> >> But this would be a show stopper for SGI being able to ship systems if the
>> >> distros do not want to waste this much memory and won't set NR_CPUS=4096.
>> >
>> > wonder if nr_irqs need to be probed dynamically too.
>>
>> NR_IRQS simply needs to die.
>
> Agreed - it would also be nice to bite the bullet at the same time and
> stop using dev->irq as an integer in an increasingly fake and imaginary
> numberspace and instead make it an object with a description and other
> properties. That would make "no IRQ" NULL, let us display meaningful IRQ
> strings and the like as well as ensuring nobody tries to abuse NR_IRQ and
> hardcoded similar knowledge for arrays

Unfortunately that is extremely expensive and hard. Plus with NR_IRQs
going away the irq number space becomes sparse enough that the irq
numbers can become stable between boots, and as meaningful as they
have ever been. At least for userspace an irq number is part of the
ABI so we will need one for the forseeable future.

We really need to change architecture specific code/generic irq code
first, preserving the number based interface to the drivers. Then if
it looks useful change the drivers.

I would like nothing better then to change fields from "int irq" to
"struct irq_desc *irq" but unless we want to do a 2.7 we need a migration
path. So we can change drivers slowly. Which means providing both
interfaces in parallel for a while.

So it makes most sense to me to change everything beneath interfaces still
taking irq numbers (because that is where we get the performance and the
scalability improvements). Then push forward with driver level changes.

Eric

2008-07-30 01:43:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Jeremy Fitzhardinge <[email protected]> writes:

>> Generally easy except for the disparate methods of catching interrupts.
>>
>
> Catching in what sense? I assume the interrupt gets raised in some
> source-specific way, and then passed into a generic layer where it eventually
> gets matched with an appropriate handler. I'm sure there's some subtlety I'm
> missing.

Sorry. Probably too much context in my head.
The model I use is irq sources throw interrupts and then the cpus catch them.
Sometimes those in flight irqs go through several transformations.

Given that the event channels that logical irq are bound to change over time
I would say they appear to be not irq sources. Those are the physical
lines coming out of hardware devices (if physical), and the equivalent
parts of the hardware when the irqs are sent message based.

So it does sound like the event channels function much like vectors. Which
are the token thrown from ioapics to cpus to tell them which irq has happened
but have nothing to do with it.

The sources and the linux irq numbers should be stable if you don't unplug
anything. The rest are implementation details the architecture should hide.

In that case I don't see any reason we could not be receiving irqs with the
cpus catching vectors and with events showing up in event channels. Having
both running at the same time is a little odd but doable.

Eric

2008-08-01 17:48:32

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: kernel BUG at arch/x86/kernel/io_apic_64.c:357!

Eric W. Biederman wrote:
> Sorry. Probably too much context in my head.
> The model I use is irq sources throw interrupts and then the cpus catch them.
> Sometimes those in flight irqs go through several transformations.
>

OK, that's more or less my mental model too.

> Given that the event channels that logical irq are bound to change over time
> I would say they appear to be not irq sources. Those are the physical
> lines coming out of hardware devices (if physical), and the equivalent
> parts of the hardware when the irqs are sent message based.
>
> So it does sound like the event channels function much like vectors. Which
> are the token thrown from ioapics to cpus to tell them which irq has happened
> but have nothing to do with it.
>

Yes and no. The event channels are similar to the actual interrupt
wires coming out of a PCI card, but they generally connect to virtual
devices, and so are a lot more dynamic than physical devices. The event
channel remapping scenario I described is pretty much exactly analogous
to having a hotplug PCI device being pulled and then replugged into a
different slot with a different interrupt line.

> The sources and the linux irq numbers should be stable if you don't unplug
> anything. The rest are implementation details the architecture should hide.
>

Yeah, plugging can happen as a quite regular thing.

> In that case I don't see any reason we could not be receiving irqs with the
> cpus catching vectors and with events showing up in event channels. Having
> both running at the same time is a little odd but doable.
>

Yes, it's a bit odd, but there are two distinct configurations where
it's a useful thing to have, where the system is running in a hybrid
Xen/native environment.

J