Subject: x86: fix race in create_irq_nr on irq_desc

Race in create_irq_nr():

- Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.

- Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
setting desc->chip_data = NULL

- Thread 1 then dereferences NULL via desc_new->chip_data->vector

Fix by moving holding vector_lock until after the dynamic_irq_init().

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffff8101df32>] create_irq_nr+0x62/0x100
PGD 23dc24067 PUD 23dc72067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:08:00.0/net/eth2/type
CPU 12
Modules linked in: i2c_i801 igb(+) iTCO_wdt ixgbe(+) ioatdma(+) e1000e mptctl mdio usb_storage iTCO_vendor_support dca ses button sg pcspkr enclosure container ac usbhid uhci_hcd ehci_hcd usbcore sd_mod edd fan processor ide_pci_generic ide_core ata_generic ata_piix libata lpfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas megaraid_sas scsi_mod thermal thermal_sys
Supported: Yes
Pid: 1684, comm: modprobe Not tainted 2.6.32.3-0.3-default #1 PRIMERGY RX300 S5
RIP: 0010:[<ffffffff8101df32>] [<ffffffff8101df32>] create_irq_nr+0x62/0x100
RSP: 0018:ffff88013ce0fc18 EFLAGS: 00010086
RAX: ffff88023e11ee00 RBX: 0000000000000066 RCX: 00000000000000c2
RDX: 00000000000000c2 RSI: 00000000ffffffff RDI: 0000000000000066
RBP: 0000000000000000 R08: ffffffff81767a85 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffff
R13: 0000000000000206 R14: ffff88013f381000 R15: 0000000000000080
FS: 00007f16d181e700(0000) GS:ffff880143d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000088 CR3: 000000023d26c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 1684, threadinfo ffff88013ce0e000, task ffff88013d080340)
Stack:
0000000000000001 0000000000000000 ffff88023d2d8740 0000000000000064
<0> 0000000000000007 ffffffff8101f2ce 0000000900000009 ffff88013f381810
<0> ffffffff3f381000 0000000000000048 0000000000000009 ffff88013f381000
Call Trace:
[<ffffffff8101f2ce>] arch_setup_msi_irqs+0xce/0x190
[<ffffffff812574b9>] msix_capability_init+0x189/0x2f0
[<ffffffffa032b4a4>] igb_set_interrupt_capability+0xe4/0x1e0 [igb]
[<ffffffffa033634e>] igb_probe+0x3de/0xd15 [igb]
[<ffffffff8124d212>] local_pci_probe+0x12/0x20
[<ffffffff8124d4c0>] __pci_device_probe+0xe0/0xf0
[<ffffffff8124e3d3>] pci_device_probe+0x33/0x60
[<ffffffff812e72f7>] really_probe+0x77/0x230
[<ffffffff812e751a>] driver_probe_device+0x6a/0xc0
[<ffffffff812e7603>] __driver_attach+0x93/0xa0
[<ffffffff812e6928>] bus_for_each_dev+0x58/0x80
[<ffffffff812e6115>] bus_add_driver+0x195/0x2f0
[<ffffffff812e7919>] driver_register+0x79/0x170
[<ffffffff8124e648>] __pci_register_driver+0x58/0xe0
[<ffffffff810001e5>] do_one_initcall+0x35/0x190
[<ffffffff8108af34>] sys_init_module+0xe4/0x270
[<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b
[<00007f16d13b234a>] 0x7f16d13b234a
Code: 2e 0f 1f 84 00 00 00 00 00 83 c3 01 39 1d e7 e2 9f 00 76 7d 44 89 e6 89 df e8 2b 2a 3d 00 48 85 c0 0f 84 8a 00 00 00 48 8b 68 40 <80> bd 88 00 00 00 00 75 d5 44 89 e6 48 89 c7 e8 6a 5c 09 00 49
RIP [<ffffffff8101df32>] create_irq_nr+0x62/0x100
RSP <ffff88013ce0fc18>
CR2: 0000000000000088

Signed-off-by: Brandon Philips <[email protected]>

---
arch/x86/kernel/apic/io_apic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6.32-SLE11-SP1/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.32-SLE11-SP1.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6.32-SLE11-SP1/arch/x86/kernel/apic/io_apic.c
@@ -3212,7 +3212,6 @@ unsigned int create_irq_nr(unsigned int
irq = new;
break;
}
- spin_unlock_irqrestore(&vector_lock, flags);

if (irq > 0) {
dynamic_irq_init(irq);
@@ -3220,6 +3219,8 @@ unsigned int create_irq_nr(unsigned int
if (desc_new)
desc_new->chip_data = cfg_new;
}
+ spin_unlock_irqrestore(&vector_lock, flags);
+
return irq;
}


2010-02-03 10:25:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: x86: fix race in create_irq_nr on irq_desc

On 02/02/2010 07:31 PM, Brandon Philips wrote:
> Race in create_irq_nr():
>
> - Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.
>
> - Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
> setting desc->chip_data = NULL
>
> - Thread 1 then dereferences NULL via desc_new->chip_data->vector

two threads get same irq?

YH

2010-02-03 10:33:46

by Yinghai Lu

[permalink] [raw]
Subject: Re: x86: fix race in create_irq_nr on irq_desc

On 02/02/2010 07:31 PM, Brandon Philips wrote:
> Race in create_irq_nr():
>
> - Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.
>
> - Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
> setting desc->chip_data = NULL
>
> - Thread 1 then dereferences NULL via desc_new->chip_data->vector
>
> Fix by moving holding vector_lock until after the dynamic_irq_init().
>
>
> Index: linux-2.6.32-SLE11-SP1/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.32-SLE11-SP1.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6.32-SLE11-SP1/arch/x86/kernel/apic/io_apic.c

can you check if

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37ef2a3029fde884808ff1b369677abc7dd9a79a

fix your problem with 2.6.32?

>From 37ef2a3029fde884808ff1b369677abc7dd9a79a Mon Sep 17 00:00:00 2001
From: Yinghai Lu <[email protected]>
Date: Sat, 21 Nov 2009 00:23:37 -0800
Subject: [PATCH] x86: Re-get cfg_new in case reuse/move irq_desc

When irq_desc is moved, we need to make sure to use the right cfg_new.

Signed-off-by: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index ff23719..085e60e 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3186,6 +3186,7 @@ unsigned int create_irq_nr(unsigned int irq_want, int node)
continue;

desc_new = move_irq_desc(desc_new, node);
+ cfg_new = desc_new->chip_data;

if (__assign_irq_vector(new, cfg_new, apic->target_cpus()) == 0)
irq = new;
--
1.6.6.1

Subject: Re: x86: fix race in create_irq_nr on irq_desc

On 02:20 Wed 03 Feb 2010, Yinghai Lu wrote:
> On 02/02/2010 07:31 PM, Brandon Philips wrote:
> > Race in create_irq_nr():
> >
> > - Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.
> >
> > - Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
> > setting desc->chip_data = NULL
> >
> > - Thread 1 then dereferences NULL via desc_new->chip_data->vector
>
> two threads get same irq?

This race happened when two drivers were setting up MSI-X at the same
time via pci_enable_msix(). See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function. So, let me rewrite my
explanation using this example:

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Does that make sense? It is sort of a rare thing to reproduce- took
40+ reboots.

Thanks,

Brandon

2010-02-03 19:32:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: x86: fix race in create_irq_nr on irq_desc

On 02/03/2010 09:42 AM, Brandon Philips wrote:
> On 02:20 Wed 03 Feb 2010, Yinghai Lu wrote:
>> On 02/02/2010 07:31 PM, Brandon Philips wrote:
>>> Race in create_irq_nr():
>>>
>>> - Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.
>>>
>>> - Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
>>> setting desc->chip_data = NULL
>>>
>>> - Thread 1 then dereferences NULL via desc_new->chip_data->vector
>>
>> two threads get same irq?
>
> This race happened when two drivers were setting up MSI-X at the same
> time via pci_enable_msix(). See this dmesg excerpt:
>
> [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
> [ 85.170611] alloc irq_desc for 99 on node -1
> [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
> [ 85.170614] alloc kstat_irqs on node -1
> [ 85.170616] alloc irq_2_iommu on node -1
> [ 85.170617] alloc irq_desc for 100 on node -1
> [ 85.170619] alloc kstat_irqs on node -1
> [ 85.170621] alloc irq_2_iommu on node -1
> [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
> [ 85.170626] alloc irq_desc for 101 on node -1
> [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
> [ 85.170630] alloc kstat_irqs on node -1
> [ 85.170631] alloc irq_2_iommu on node -1
> [ 85.170635] alloc irq_desc for 102 on node -1
> [ 85.170636] alloc kstat_irqs on node -1
> [ 85.170639] alloc irq_2_iommu on node -1
> [ 85.170646] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000088
>
> As you can see igb and ixgbe are both alternating on create_irq_nr()
> via pci_enable_msix() in their probe function. So, let me rewrite my
> explanation using this example:
>
> ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
> choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
> calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
> NULL via dynamic_irq_init().
>
> igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
> via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:
>
> cfg_new = irq_desc_ptrs[102]->chip_data;
> if (cfg_new->vector != 0)
> continue;
>
> This hits the NULL deref.
>

please try following patch in addition to

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37ef2a3029fde884808ff1b369677abc7dd9a79a

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7edafc7..14099ba 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3280,12 +3280,9 @@ unsigned int create_irq_nr(unsigned int irq_want, int node)
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

diff --git a/include/linux/irq.h b/include/linux/irq.h
index d13492d..cd6b870 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -400,6 +400,7 @@ static inline int irq_has_action(unsigned int irq)

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);

/* Set/get chip/data for an IRQ: */
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index ecc3fa2..370dbc4 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -22,7 +22,7 @@
* dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +41,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -54,6 +55,16 @@ void dynamic_irq_init(unsigned int irq)
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
/**
* dynamic_irq_cleanup - cleanup a dynamically allocated irq
* @irq: irq number to initialize

Subject: Re: x86: fix race in create_irq_nr on irq_desc

On 11:31 Wed 03 Feb 2010, Yinghai Lu wrote:
> On 02/03/2010 09:42 AM, Brandon Philips wrote:
> > On 02:20 Wed 03 Feb 2010, Yinghai Lu wrote:
> >> On 02/02/2010 07:31 PM, Brandon Philips wrote:
> >>> Race in create_irq_nr():
> >>>
> >>> - Thread 1 loops through and calls irq_to_desc_alloc_node with new=0x66.
> >>>
> >>> - Thread 2 has exited the loop with irq=0x66 and calls dynamic_irq_init(0x66)
> >>> setting desc->chip_data = NULL
> >>>
> >>> - Thread 1 then dereferences NULL via desc_new->chip_data->vector
> >>
> >> two threads get same irq?
> >
> > This race happened when two drivers were setting up MSI-X at the same
> > time via pci_enable_msix(). See this dmesg excerpt:
> >
> > [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
> > [ 85.170611] alloc irq_desc for 99 on node -1
> > [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
> > [ 85.170614] alloc kstat_irqs on node -1
> > [ 85.170616] alloc irq_2_iommu on node -1
> > [ 85.170617] alloc irq_desc for 100 on node -1
> > [ 85.170619] alloc kstat_irqs on node -1
> > [ 85.170621] alloc irq_2_iommu on node -1
> > [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
> > [ 85.170626] alloc irq_desc for 101 on node -1
> > [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
> > [ 85.170630] alloc kstat_irqs on node -1
> > [ 85.170631] alloc irq_2_iommu on node -1
> > [ 85.170635] alloc irq_desc for 102 on node -1
> > [ 85.170636] alloc kstat_irqs on node -1
> > [ 85.170639] alloc irq_2_iommu on node -1
> > [ 85.170646] BUG: unable to handle kernel NULL pointer dereference
> > at 0000000000000088
> >
> > As you can see igb and ixgbe are both alternating on create_irq_nr()
> > via pci_enable_msix() in their probe function. So, let me rewrite my
> > explanation using this example:
> >
> > ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
> > choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
> > calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
> > NULL via dynamic_irq_init().
> >
> > igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
> > via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:
> >
> > cfg_new = irq_desc_ptrs[102]->chip_data;
> > if (cfg_new->vector != 0)
> > continue;
> >
> > This hits the NULL deref.
> >
>
> please try following patch in addition to
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=37ef2a3029fde884808ff1b369677abc7dd9a79a

How is this commit related to this bug? The NULL deref I am hitting is
from this bit in create_irq_nr():

if (cfg_new->vector != 0)
continue;

Which comes before the assignment of cfg_new. I don't see how it is
related. Plus, node == -1 in this case so move_irq_desc() is a no-op.

> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 7edafc7..14099ba 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -3280,12 +3280,9 @@ unsigned int create_irq_nr(unsigned int irq_want, int node)
> }
> spin_unlock_irqrestore(&vector_lock, flags);
>
> - if (irq > 0) {
> - dynamic_irq_init(irq);
> - /* restore it, in case dynamic_irq_init clear it */
> - if (desc_new)
> - desc_new->chip_data = cfg_new;
> - }
> + if (irq > 0)
> + dynamic_irq_init_keep_chip_data(irq);
> +
> return irq;
> }

That would solve it too but I don't think it is a great
solution. Keeping the vector_lock until we are completely done setting
up the irq is more straightforward and won't cost much time at all.

I am hesitant to have it tested since it is a really small race
window, reproducing took 40+ reboots initially and looks technically
correct.

Thanks,

Brandon

2010-02-05 08:47:15

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86: keep chip_data in create_irq_nr



Brodon found:
race happened when two drivers were setting up MSI-X at the same
time via pci_enable_msix(). See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

so let remove the save and restore code.

just don't clear it in that path

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/apic/io_apic.c | 9 +++------
include/linux/irq.h | 1 +
kernel/irq/chip.c | 15 +++++++++++++--
3 files changed, 17 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3280,12 +3280,9 @@ unsigned int create_irq_nr(unsigned int
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -400,6 +400,7 @@ static inline int irq_has_action(unsigne

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);

/* Set/get chip/data for an IRQ: */
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -22,7 +22,7 @@
* dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +41,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -54,6 +55,16 @@ void dynamic_irq_init(unsigned int irq)
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
/**
* dynamic_irq_cleanup - cleanup a dynamically allocated irq
* @irq: irq number to initialize

Subject: Re: [PATCH] x86: keep chip_data in create_irq_nr

On 00:45 Fri 05 Feb 2010, Yinghai Lu wrote:
> Brodon found:
> race happened when two drivers were setting up MSI-X at the same
> time via pci_enable_msix(). See this dmesg excerpt:
>
> [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
> [ 85.170611] alloc irq_desc for 99 on node -1
> [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
> [ 85.170614] alloc kstat_irqs on node -1
> [ 85.170616] alloc irq_2_iommu on node -1
> [ 85.170617] alloc irq_desc for 100 on node -1
> [ 85.170619] alloc kstat_irqs on node -1
> [ 85.170621] alloc irq_2_iommu on node -1
> [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
> [ 85.170626] alloc irq_desc for 101 on node -1
> [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
> [ 85.170630] alloc kstat_irqs on node -1
> [ 85.170631] alloc irq_2_iommu on node -1
> [ 85.170635] alloc irq_desc for 102 on node -1
> [ 85.170636] alloc kstat_irqs on node -1
> [ 85.170639] alloc irq_2_iommu on node -1
> [ 85.170646] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000088
>
> As you can see igb and ixgbe are both alternating on create_irq_nr()
> via pci_enable_msix() in their probe function.
>
> ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
> choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
> calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
> NULL via dynamic_irq_init().
>
> igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
> via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:
>
> cfg_new = irq_desc_ptrs[102]->chip_data;
> if (cfg_new->vector != 0)
> continue;
>
> This hits the NULL deref.
>
> so let remove the save and restore code.
> just don't clear it in that path
>
> Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6/arch/x86/kernel/apic/io_apic.c
> @@ -3280,12 +3280,9 @@ unsigned int create_irq_nr(unsigned int
> }
> spin_unlock_irqrestore(&vector_lock, flags);
>
> - if (irq > 0) {
> - dynamic_irq_init(irq);
> - /* restore it, in case dynamic_irq_init clear it */
> - if (desc_new)
> - desc_new->chip_data = cfg_new;
> - }
> + if (irq > 0)
> + dynamic_irq_init_keep_chip_data(irq);
> +
> return irq;
> }

Nearly every function in kernel/irq/chip.c takes the desc->lock when
manipulating the fields of the irq_desc including chip_data. Should
create_irq_nr() do the same when getting the chip_data field?

I am just a bit confused on what protects the chip_data field now.

Actually, while looking at your patch there is a related race in
destroy_irq() that I just noticed. This race could happen via
pci_disable_msix() in a driver or in the number of error paths that
call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

It could race with create_irq_nr() in the same way in the irq destroy
path.

So, I will reply after this with a combined patch fixing this
potential race along with the minor things below.

Cheers,

Brandon

>
> Index: linux-2.6/include/linux/irq.h
> ===================================================================
> --- linux-2.6.orig/include/linux/irq.h
> +++ linux-2.6/include/linux/irq.h
> @@ -400,6 +400,7 @@ static inline int irq_has_action(unsigne
>
> /* Dynamic irq helper functions */
> extern void dynamic_irq_init(unsigned int irq);
> +void dynamic_irq_init_keep_chip_data(unsigned int irq);
> extern void dynamic_irq_cleanup(unsigned int irq);

Missing extern?

> /* Set/get chip/data for an IRQ: */
> Index: linux-2.6/kernel/irq/chip.c
> ===================================================================
> --- linux-2.6.orig/kernel/irq/chip.c
> +++ linux-2.6/kernel/irq/chip.c
> @@ -22,7 +22,7 @@
> * dynamic_irq_init - initialize a dynamically allocated irq
> * @irq: irq number to initialize

Update kerndoc?

> +static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
> {
> struct irq_desc *desc;
> unsigned long flags;
> @@ -41,7 +41,8 @@ void dynamic_irq_init(unsigned int irq)
> desc->depth = 1;
> desc->msi_desc = NULL;
> desc->handler_data = NULL;
> - desc->chip_data = NULL;
> + if (!keep_chip_data)
> + desc->chip_data = NULL;
> desc->action = NULL;
> desc->irq_count = 0;
> desc->irqs_unhandled = 0;
> @@ -54,6 +55,16 @@ void dynamic_irq_init(unsigned int irq)
> raw_spin_unlock_irqrestore(&desc->lock, flags);
> }
>
> +void dynamic_irq_init(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, false);
> +}
> +
> +void dynamic_irq_init_keep_chip_data(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, true);
> +}
> +
> /**
> * dynamic_irq_cleanup - cleanup a dynamically allocated irq
> * @irq: irq number to initialize

Subject: [PATCH] x86: keep chip_data in create_irq_nr and destroy_irq

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Brandon Phiilps <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/apic/io_apic.c | 14 +++--------
include/linux/irq.h | 2 +
kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++--------
3 files changed, 49 insertions(+), 19 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3228,12 +3228,9 @@ unsigned int create_irq_nr(unsigned int
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

@@ -3260,10 +3257,7 @@ void destroy_irq(unsigned int irq)

/* store it, in case dynamic_irq_cleanup clear it */
desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -400,7 +400,9 @@ static inline int irq_has_action(unsigne

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+extern void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);
+extern void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);

/* Set/get chip/data for an IRQ: */
extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -18,11 +18,7 @@

#include "internals.h"

-/**
- * dynamic_irq_init - initialize a dynamically allocated irq
- * @irq: irq number to initialize
- */
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
}

/**
- * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_cleanup(unsigned int irq)
+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
+static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
@@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
}
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->handle_irq = handle_bad_irq;
desc->chip = &no_irq_chip;
desc->name = NULL;
@@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+/**
+ * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ */
+void dynamic_irq_cleanup(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+

/**
* set_irq_chip - set the irq chip for an irq

2010-02-05 21:43:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: keep chip_data in create_irq_nr

On 02/05/2010 01:05 PM, Brandon Philips wrote:
>>
>> Index: linux-2.6/include/linux/irq.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/irq.h
>> +++ linux-2.6/include/linux/irq.h
>> @@ -400,6 +400,7 @@ static inline int irq_has_action(unsigne
>>
>> /* Dynamic irq helper functions */
>> extern void dynamic_irq_init(unsigned int irq);
>> +void dynamic_irq_init_keep_chip_data(unsigned int irq);
>> extern void dynamic_irq_cleanup(unsigned int irq);
>
> Missing extern?
>

Quite the opposite -- function prototypes don't need "extern", and
putting "extern" on them is generally considered bad style.

-hpa

2010-02-05 22:46:32

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86: keep chip_data in create_irq_nr and destroy_irq

On 02/05/2010 01:09 PM, Brandon Philips wrote:
> When two drivers are setting up MSI-X at the same time via
> pci_enable_msix() there is a race. See this dmesg excerpt:
>
> [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
> [ 85.170611] alloc irq_desc for 99 on node -1
> [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
> [ 85.170614] alloc kstat_irqs on node -1
> [ 85.170616] alloc irq_2_iommu on node -1
> [ 85.170617] alloc irq_desc for 100 on node -1
> [ 85.170619] alloc kstat_irqs on node -1
> [ 85.170621] alloc irq_2_iommu on node -1
> [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
> [ 85.170626] alloc irq_desc for 101 on node -1
> [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
> [ 85.170630] alloc kstat_irqs on node -1
> [ 85.170631] alloc irq_2_iommu on node -1
> [ 85.170635] alloc irq_desc for 102 on node -1
> [ 85.170636] alloc kstat_irqs on node -1
> [ 85.170639] alloc irq_2_iommu on node -1
> [ 85.170646] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000088
>
> As you can see igb and ixgbe are both alternating on create_irq_nr()
> via pci_enable_msix() in their probe function.
>
> ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
> choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
> calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
> NULL via dynamic_irq_init().
>
> igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
> via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:
>
> cfg_new = irq_desc_ptrs[102]->chip_data;
> if (cfg_new->vector != 0)
> continue;
>
> This hits the NULL deref.
>
> Another possible race exists via pci_disable_msix() in a driver or in
> the number of error paths that call free_msi_irqs():
>
> destroy_irq()
> dynamic_irq_cleanup() which sets desc->chip_data = NULL
> ...race window...
> desc->chip_data = cfg;
>
> Remove the save and restore code for cfg in create_irq_nr() and
> destroy_irq() and take the desc->lock when checking the irq_cfg.
>
> Reported-and-analyzed-by: Brandon Philips <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
> Signed-off-by: Brandon Phiilps <[email protected]>
> Cc: [email protected]
>
> ---
> arch/x86/kernel/apic/io_apic.c | 14 +++--------
> include/linux/irq.h | 2 +
> kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++--------
> 3 files changed, 49 insertions(+), 19 deletions(-)
>
> Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
> +++ linux-2.6/arch/x86/kernel/apic/io_apic.c
> @@ -3228,12 +3228,9 @@ unsigned int create_irq_nr(unsigned int
> }
> spin_unlock_irqrestore(&vector_lock, flags);
>
> - if (irq > 0) {
> - dynamic_irq_init(irq);
> - /* restore it, in case dynamic_irq_init clear it */
> - if (desc_new)
> - desc_new->chip_data = cfg_new;
> - }
> + if (irq > 0)
> + dynamic_irq_init_keep_chip_data(irq);
> +
> return irq;
> }
>
> @@ -3260,10 +3257,7 @@ void destroy_irq(unsigned int irq)
>
> /* store it, in case dynamic_irq_cleanup clear it */
> desc = irq_to_desc(irq);
> - cfg = desc->chip_data;
> - dynamic_irq_cleanup(irq);
> - /* connect back irq_cfg */
> - desc->chip_data = cfg;
> + dynamic_irq_cleanup_keep_chip_data(irq);
>
> free_irte(irq);
> spin_lock_irqsave(&vector_lock, flags);
> Index: linux-2.6/include/linux/irq.h
> ===================================================================
> --- linux-2.6.orig/include/linux/irq.h
> +++ linux-2.6/include/linux/irq.h
> @@ -400,7 +400,9 @@ static inline int irq_has_action(unsigne
>
> /* Dynamic irq helper functions */
> extern void dynamic_irq_init(unsigned int irq);
> +extern void dynamic_irq_init_keep_chip_data(unsigned int irq);
> extern void dynamic_irq_cleanup(unsigned int irq);
> +extern void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);
>
> /* Set/get chip/data for an IRQ: */
> extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
> Index: linux-2.6/kernel/irq/chip.c
> ===================================================================
> --- linux-2.6.orig/kernel/irq/chip.c
> +++ linux-2.6/kernel/irq/chip.c
> @@ -18,11 +18,7 @@
>
> #include "internals.h"
>
> -/**
> - * dynamic_irq_init - initialize a dynamically allocated irq
> - * @irq: irq number to initialize
> - */
> -void dynamic_irq_init(unsigned int irq)
> +static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
> {
> struct irq_desc *desc;
> unsigned long flags;
> @@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
> desc->depth = 1;
> desc->msi_desc = NULL;
> desc->handler_data = NULL;
> - desc->chip_data = NULL;
> + if (!keep_chip_data)
> + desc->chip_data = NULL;
> desc->action = NULL;
> desc->irq_count = 0;
> desc->irqs_unhandled = 0;
> @@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
> }
>
> /**
> - * dynamic_irq_cleanup - cleanup a dynamically allocated irq
> + * dynamic_irq_init - initialize a dynamically allocated irq
> * @irq: irq number to initialize
> */
> -void dynamic_irq_cleanup(unsigned int irq)
> +void dynamic_irq_init(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, false);
> +}
> +
> +/**
> + * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
> + * @irq: irq number to initialize
> + *
> + * does not set irq_to_desc(irq)->chip_data to NULL
> + */
> +void dynamic_irq_init_keep_chip_data(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, true);
> +}
> +
> +static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
> {
> struct irq_desc *desc = irq_to_desc(irq);
> unsigned long flags;
> @@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
> }
> desc->msi_desc = NULL;
> desc->handler_data = NULL;
> - desc->chip_data = NULL;
> + if (!keep_chip_data)
> + desc->chip_data = NULL;
> desc->handle_irq = handle_bad_irq;
> desc->chip = &no_irq_chip;
> desc->name = NULL;
> @@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
> raw_spin_unlock_irqrestore(&desc->lock, flags);
> }
>
> +/**
> + * dynamic_irq_cleanup - cleanup a dynamically allocated irq
> + * @irq: irq number to initialize
> + */
> +void dynamic_irq_cleanup(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, false);

should be dynamic_irq_cleanup_x here.



> +}
> +
> +/**
> + * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
> + * @irq: irq number to initialize
> + *
> + * does not set irq_to_desc(irq)->chip_data to NULL
> + */
> +void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
> +{
> + dynamic_irq_init_x(irq, true);

should be dynamic_irq_cleanup_x

> +}
> +
>
> /**
> * set_irq_chip - set the irq chip for an irq

YH

Subject: Re: [PATCH] x86: keep chip_data in create_irq_nr and destroy_irq

On 14:44 Fri 05 Feb 2010, Yinghai Lu wrote:
> On 02/05/2010 01:09 PM, Brandon Philips wrote:
> > @@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
> > }
> > desc->msi_desc = NULL;
> > desc->handler_data = NULL;
> > - desc->chip_data = NULL;
> > + if (!keep_chip_data)
> > + desc->chip_data = NULL;
> > desc->handle_irq = handle_bad_irq;
> > desc->chip = &no_irq_chip;
> > desc->name = NULL;
> > @@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
> > raw_spin_unlock_irqrestore(&desc->lock, flags);
> > }
> >
> > +/**
> > + * dynamic_irq_cleanup - cleanup a dynamically allocated irq
> > + * @irq: irq number to initialize
> > + */
> > +void dynamic_irq_cleanup(unsigned int irq)
> > +{
> > + dynamic_irq_init_x(irq, false);
>
> should be dynamic_irq_cleanup_x here.
>
> > +}
> > +
> > +/**
> > + * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
> > + * @irq: irq number to initialize
> > + *
> > + * does not set irq_to_desc(irq)->chip_data to NULL
> > + */
> > +void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
> > +{
> > + dynamic_irq_init_x(irq, true);
>
> should be dynamic_irq_cleanup_x

Oops, right. I will fix this up along with the externs as hpa
suggested and send again.

What are your thoughts on locking? Does it look OK as is?

Thanks,

Brandon

2010-02-06 00:08:39

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86: keep chip_data in create_irq_nr and destroy_irq

On 02/05/2010 02:55 PM, Brandon Philips wrote:
> On 14:44 Fri 05 Feb 2010, Yinghai Lu wrote:
>> On 02/05/2010 01:09 PM, Brandon Philips wrote:
>>> @@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
>>> }
>>> desc->msi_desc = NULL;
>>> desc->handler_data = NULL;
>>> - desc->chip_data = NULL;
>>> + if (!keep_chip_data)
>>> + desc->chip_data = NULL;
>>> desc->handle_irq = handle_bad_irq;
>>> desc->chip = &no_irq_chip;
>>> desc->name = NULL;
>>> @@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
>>> raw_spin_unlock_irqrestore(&desc->lock, flags);
>>> }
>>>
>>> +/**
>>> + * dynamic_irq_cleanup - cleanup a dynamically allocated irq
>>> + * @irq: irq number to initialize
>>> + */
>>> +void dynamic_irq_cleanup(unsigned int irq)
>>> +{
>>> + dynamic_irq_init_x(irq, false);
>>
>> should be dynamic_irq_cleanup_x here.
>>
>>> +}
>>> +
>>> +/**
>>> + * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
>>> + * @irq: irq number to initialize
>>> + *
>>> + * does not set irq_to_desc(irq)->chip_data to NULL
>>> + */
>>> +void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
>>> +{
>>> + dynamic_irq_init_x(irq, true);
>>
>> should be dynamic_irq_cleanup_x
>
> Oops, right. I will fix this up along with the externs as hpa
> suggested and send again.

>
> What are your thoughts on locking? Does it look OK as is?
>

ok to me.

YH

Subject: [PATCH v2] x86: keep chip_data in create_irq_nr and destroy_irq

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() to avoid the NULL.

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Brandon Phiilps <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/apic/io_apic.c | 14 +++--------
include/linux/irq.h | 2 +
kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++--------
3 files changed, 49 insertions(+), 19 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3228,12 +3228,9 @@ unsigned int create_irq_nr(unsigned int
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

@@ -3260,10 +3257,7 @@ void destroy_irq(unsigned int irq)

/* store it, in case dynamic_irq_cleanup clear it */
desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -400,7 +400,9 @@ static inline int irq_has_action(unsigne

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);

/* Set/get chip/data for an IRQ: */
extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -18,11 +18,7 @@

#include "internals.h"

-/**
- * dynamic_irq_init - initialize a dynamically allocated irq
- * @irq: irq number to initialize
- */
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
}

/**
- * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_cleanup(unsigned int irq)
+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
+static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
@@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
}
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->handle_irq = handle_bad_irq;
desc->chip = &no_irq_chip;
desc->name = NULL;
@@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+/**
+ * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ */
+void dynamic_irq_cleanup(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, false);
+}
+
+/**
+ * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, true);
+}
+

/**
* set_irq_chip - set the irq chip for an irq

Subject: Re: [PATCH v3] x86: keep chip_data in create_irq_nr and destroy_irq

Version 3: Forgot to refresh patch so destroy_irq() had uninitialized
cfg as param to __clear_irq_vector(). Fixed now.

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Brandon Phiilps <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/apic/io_apic.c | 17 +++----------
include/linux/irq.h | 2 +
kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++--------
3 files changed, 50 insertions(+), 21 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3228,12 +3228,9 @@ unsigned int create_irq_nr(unsigned int
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

@@ -3255,19 +3252,15 @@ int create_irq(void)
void destroy_irq(unsigned int irq)
{
unsigned long flags;
- struct irq_cfg *cfg;
struct irq_desc *desc;

/* store it, in case dynamic_irq_cleanup clear it */
desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
- __clear_irq_vector(irq, cfg);
+ __clear_irq_vector(irq, desc->chip_data);
spin_unlock_irqrestore(&vector_lock, flags);
}

Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -400,7 +400,9 @@ static inline int irq_has_action(unsigne

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);

/* Set/get chip/data for an IRQ: */
extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -18,11 +18,7 @@

#include "internals.h"

-/**
- * dynamic_irq_init - initialize a dynamically allocated irq
- * @irq: irq number to initialize
- */
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
}

/**
- * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_cleanup(unsigned int irq)
+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
+static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
@@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
}
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->handle_irq = handle_bad_irq;
desc->chip = &no_irq_chip;
desc->name = NULL;
@@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+/**
+ * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ */
+void dynamic_irq_cleanup(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, false);
+}
+
+/**
+ * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, true);
+}
+

/**
* set_irq_chip - set the irq chip for an irq

2010-02-06 07:20:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v3] x86: keep chip_data in create_irq_nr and destroy_irq

On 02/05/2010 10:42 PM, Brandon Philips wrote:
> Version 3: Forgot to refresh patch so destroy_irq() had uninitialized
> cfg as param to __clear_irq_vector(). Fixed now.

> @@ -3255,19 +3252,15 @@ int create_irq(void)
> void destroy_irq(unsigned int irq)
> {
> unsigned long flags;
> - struct irq_cfg *cfg;
> struct irq_desc *desc;
>
> /* store it, in case dynamic_irq_cleanup clear it */
> desc = irq_to_desc(irq);
> - cfg = desc->chip_data;
> - dynamic_irq_cleanup(irq);
> - /* connect back irq_cfg */
> - desc->chip_data = cfg;
> + dynamic_irq_cleanup_keep_chip_data(irq);
>
> free_irte(irq);
> spin_lock_irqsave(&vector_lock, flags);
> - __clear_irq_vector(irq, cfg);
> + __clear_irq_vector(irq, desc->chip_data);
> spin_unlock_irqrestore(&vector_lock, flags);
> }

==>
@@ -3308,17 +3305,12 @@ void destroy_irq(unsigned int irq)
{
unsigned long flags;
struct irq_cfg *cfg;
- struct irq_desc *desc;

- /* store it, in case dynamic_irq_cleanup clear it */
- desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
+ cfg = irq_to_desc(irq)->chip_data;
__clear_irq_vector(irq, cfg);
spin_unlock_irqrestore(&vector_lock, flags);
}


Yinghai

Subject: Re: [PATCH v3] x86: keep chip_data in create_irq_nr and destroy_irq

On 23:16 Fri 05 Feb 2010, Yinghai Lu wrote:
> On 02/05/2010 10:42 PM, Brandon Philips wrote:
> @@ -3308,17 +3305,12 @@ void destroy_irq(unsigned int irq)
> {
> unsigned long flags;
> struct irq_cfg *cfg;
> - struct irq_desc *desc;
>
> - /* store it, in case dynamic_irq_cleanup clear it */
> - desc = irq_to_desc(irq);
> - cfg = desc->chip_data;
> - dynamic_irq_cleanup(irq);
> - /* connect back irq_cfg */
> - desc->chip_data = cfg;
> + dynamic_irq_cleanup_keep_chip_data(irq);
>
> free_irte(irq);
> spin_lock_irqsave(&vector_lock, flags);
> + cfg = irq_to_desc(irq)->chip_data;
> __clear_irq_vector(irq, cfg);

Or even two lines better!

void destroy_irq(unsigned int irq)
{
unsigned long flags;
- struct irq_cfg *cfg;
- struct irq_desc *desc;

- /* store it, in case dynamic_irq_cleanup clear it */
- desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
- __clear_irq_vector(irq, cfg);
+ __clear_irq_vector(irq, get_irq_chip_data(irq));
spin_unlock_irqrestore(&vector_lock, flags);
}


Sigh... I guess I will send the patch again :D

Cheers,

Brandon

Subject: Re: [PATCH v4] x86: keep chip_data in create_irq_nr and destroy_irq

Version 4: use get_irq_chip_data() in destroy_irq() to get rid of some
local vars.

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Brandon Phiilps <[email protected]>
Cc: [email protected]

---
arch/x86/kernel/apic/io_apic.c | 20 +++------------
include/linux/irq.h | 2 +
kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++--------
3 files changed, 50 insertions(+), 24 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic/io_apic.c
+++ linux-2.6/arch/x86/kernel/apic/io_apic.c
@@ -3228,12 +3228,9 @@ unsigned int create_irq_nr(unsigned int
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

@@ -3255,19 +3252,12 @@ int create_irq(void)
void destroy_irq(unsigned int irq)
{
unsigned long flags;
- struct irq_cfg *cfg;
- struct irq_desc *desc;

- /* store it, in case dynamic_irq_cleanup clear it */
- desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
- __clear_irq_vector(irq, cfg);
+ __clear_irq_vector(irq, get_irq_chip_data(irq));
spin_unlock_irqrestore(&vector_lock, flags);
}

Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -400,7 +400,9 @@ static inline int irq_has_action(unsigne

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);

/* Set/get chip/data for an IRQ: */
extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -18,11 +18,7 @@

#include "internals.h"

-/**
- * dynamic_irq_init - initialize a dynamically allocated irq
- * @irq: irq number to initialize
- */
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
}

/**
- * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_cleanup(unsigned int irq)
+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
+static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
@@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int ir
}
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->handle_irq = handle_bad_irq;
desc->chip = &no_irq_chip;
desc->name = NULL;
@@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int ir
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+/**
+ * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ */
+void dynamic_irq_cleanup(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, false);
+}
+
+/**
+ * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, true);
+}
+

/**
* set_irq_chip - set the irq chip for an irq

Subject: [tip:x86/urgent] x86, irq: Keep chip_data in create_irq_nr and destroy_irq

Commit-ID: eb5b3794062824ba12d883901eea49ea89d0a678
Gitweb: http://git.kernel.org/tip/eb5b3794062824ba12d883901eea49ea89d0a678
Author: Brandon Philips <[email protected]>
AuthorDate: Sun, 7 Feb 2010 13:02:50 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Thu, 18 Feb 2010 21:53:15 -0800

x86, irq: Keep chip_data in create_irq_nr and destroy_irq

Version 4: use get_irq_chip_data() in destroy_irq() to get rid of some
local vars.

When two drivers are setting up MSI-X at the same time via
pci_enable_msix() there is a race. See this dmesg excerpt:

[ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X
[ 85.170611] alloc irq_desc for 99 on node -1
[ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X
[ 85.170614] alloc kstat_irqs on node -1
[ 85.170616] alloc irq_2_iommu on node -1
[ 85.170617] alloc irq_desc for 100 on node -1
[ 85.170619] alloc kstat_irqs on node -1
[ 85.170621] alloc irq_2_iommu on node -1
[ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X
[ 85.170626] alloc irq_desc for 101 on node -1
[ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X
[ 85.170630] alloc kstat_irqs on node -1
[ 85.170631] alloc irq_2_iommu on node -1
[ 85.170635] alloc irq_desc for 102 on node -1
[ 85.170636] alloc kstat_irqs on node -1
[ 85.170639] alloc irq_2_iommu on node -1
[ 85.170646] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000088

As you can see igb and ixgbe are both alternating on create_irq_nr()
via pci_enable_msix() in their probe function.

ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe
choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and
calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data =
NULL via dynamic_irq_init().

igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[]
via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this:

cfg_new = irq_desc_ptrs[102]->chip_data;
if (cfg_new->vector != 0)
continue;

This hits the NULL deref.

Another possible race exists via pci_disable_msix() in a driver or in
the number of error paths that call free_msi_irqs():

destroy_irq()
dynamic_irq_cleanup() which sets desc->chip_data = NULL
...race window...
desc->chip_data = cfg;

Remove the save and restore code for cfg in create_irq_nr() and
destroy_irq() and take the desc->lock when checking the irq_cfg.

Reported-and-analyzed-by: Brandon Philips <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Brandon Phiilps <[email protected]>
Cc: [email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 20 ++++-----------
include/linux/irq.h | 2 +
kernel/irq/chip.c | 52 +++++++++++++++++++++++++++++++++-------
3 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 5e4cce2..e93a76b 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3278,12 +3278,9 @@ unsigned int create_irq_nr(unsigned int irq_want, int node)
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq > 0) {
- dynamic_irq_init(irq);
- /* restore it, in case dynamic_irq_init clear it */
- if (desc_new)
- desc_new->chip_data = cfg_new;
- }
+ if (irq > 0)
+ dynamic_irq_init_keep_chip_data(irq);
+
return irq;
}

@@ -3305,19 +3302,12 @@ int create_irq(void)
void destroy_irq(unsigned int irq)
{
unsigned long flags;
- struct irq_cfg *cfg;
- struct irq_desc *desc;

- /* store it, in case dynamic_irq_cleanup clear it */
- desc = irq_to_desc(irq);
- cfg = desc->chip_data;
- dynamic_irq_cleanup(irq);
- /* connect back irq_cfg */
- desc->chip_data = cfg;
+ dynamic_irq_cleanup_keep_chip_data(irq);

free_irte(irq);
spin_lock_irqsave(&vector_lock, flags);
- __clear_irq_vector(irq, cfg);
+ __clear_irq_vector(irq, get_irq_chip_data(irq));
spin_unlock_irqrestore(&vector_lock, flags);
}

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 451481c..4d9b26e 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -400,7 +400,9 @@ static inline int irq_has_action(unsigned int irq)

/* Dynamic irq helper functions */
extern void dynamic_irq_init(unsigned int irq);
+void dynamic_irq_init_keep_chip_data(unsigned int irq);
extern void dynamic_irq_cleanup(unsigned int irq);
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq);

/* Set/get chip/data for an IRQ: */
extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index ecc3fa2..d70394f 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -18,11 +18,7 @@

#include "internals.h"

-/**
- * dynamic_irq_init - initialize a dynamically allocated irq
- * @irq: irq number to initialize
- */
-void dynamic_irq_init(unsigned int irq)
+static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc;
unsigned long flags;
@@ -41,7 +37,8 @@ void dynamic_irq_init(unsigned int irq)
desc->depth = 1;
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->action = NULL;
desc->irq_count = 0;
desc->irqs_unhandled = 0;
@@ -55,10 +52,26 @@ void dynamic_irq_init(unsigned int irq)
}

/**
- * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * dynamic_irq_init - initialize a dynamically allocated irq
* @irq: irq number to initialize
*/
-void dynamic_irq_cleanup(unsigned int irq)
+void dynamic_irq_init(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, false);
+}
+
+/**
+ * dynamic_irq_init_keep_chip_data - initialize a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_init_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_init_x(irq, true);
+}
+
+static void dynamic_irq_cleanup_x(unsigned int irq, bool keep_chip_data)
{
struct irq_desc *desc = irq_to_desc(irq);
unsigned long flags;
@@ -77,7 +90,8 @@ void dynamic_irq_cleanup(unsigned int irq)
}
desc->msi_desc = NULL;
desc->handler_data = NULL;
- desc->chip_data = NULL;
+ if (!keep_chip_data)
+ desc->chip_data = NULL;
desc->handle_irq = handle_bad_irq;
desc->chip = &no_irq_chip;
desc->name = NULL;
@@ -85,6 +99,26 @@ void dynamic_irq_cleanup(unsigned int irq)
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

+/**
+ * dynamic_irq_cleanup - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ */
+void dynamic_irq_cleanup(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, false);
+}
+
+/**
+ * dynamic_irq_cleanup_keep_chip_data - cleanup a dynamically allocated irq
+ * @irq: irq number to initialize
+ *
+ * does not set irq_to_desc(irq)->chip_data to NULL
+ */
+void dynamic_irq_cleanup_keep_chip_data(unsigned int irq)
+{
+ dynamic_irq_cleanup_x(irq, true);
+}
+

/**
* set_irq_chip - set the irq chip for an irq

2010-02-26 10:26:56

by Ingo Molnar

[permalink] [raw]
Subject: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Commit-ID: f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
Gitweb: http://git.kernel.org/tip/f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 26 Feb 2010 11:17:16 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 26 Feb 2010 11:17:16 +0100

x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Merge commit aef55d4922 mis-merged io_apic.c so we lost the
arch_probe_nr_irqs() method.

This caused subtle boot breakages (udev confusion likely
due to missing drivers) with certain configs.

Cc: H. Peter Anvin <[email protected]>
Cc: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 527390c..c13ab8f 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3874,6 +3874,28 @@ void __init probe_nr_irqs_gsi(void)
printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi);
}

+#ifdef CONFIG_SPARSE_IRQ
+int __init arch_probe_nr_irqs(void)
+{
+ int nr;
+
+ if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
+ nr_irqs = NR_VECTORS * nr_cpu_ids;
+
+ nr = nr_irqs_gsi + 8 * nr_cpu_ids;
+#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
+ /*
+ * for MSI and HT dyn irq
+ */
+ nr += nr_irqs_gsi * 16;
+#endif
+ if (nr < nr_irqs)
+ nr_irqs = nr;
+
+ return 0;
+}
+#endif
+
static int __io_apic_set_pci_routing(struct device *dev, int irq,
struct io_apic_irq_attr *irq_attr)
{

2010-02-26 18:20:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

On 02/26/2010 02:26 AM, tip-bot for Ingo Molnar wrote:
> Commit-ID: f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
> Gitweb: http://git.kernel.org/tip/f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
> Author: Ingo Molnar <[email protected]>
> AuthorDate: Fri, 26 Feb 2010 11:17:16 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Fri, 26 Feb 2010 11:17:16 +0100
>
> x86: apic: Fix mismerge, add arch_probe_nr_irqs() again
>
> Merge commit aef55d4922 mis-merged io_apic.c so we lost the
> arch_probe_nr_irqs() method.
>
> This caused subtle boot breakages (udev confusion likely
> due to missing drivers) with certain configs.
>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kernel/apic/io_apic.c | 22 ++++++++++++++++++++++
> 1 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 527390c..c13ab8f 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -3874,6 +3874,28 @@ void __init probe_nr_irqs_gsi(void)
> printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi);
> }
>
> +#ifdef CONFIG_SPARSE_IRQ
> +int __init arch_probe_nr_irqs(void)
> +{
> + int nr;
> +
> + if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
> + nr_irqs = NR_VECTORS * nr_cpu_ids;
> +
> + nr = nr_irqs_gsi + 8 * nr_cpu_ids;
> +#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
> + /*
> + * for MSI and HT dyn irq
> + */
> + nr += nr_irqs_gsi * 16;
> +#endif
> + if (nr < nr_irqs)
> + nr_irqs = nr;
> +
> + return 0;
> +}
> +#endif
> +
> static int __io_apic_set_pci_routing(struct device *dev, int irq,
> struct io_apic_irq_attr *irq_attr)
> {

for x86, with radix tree based irq_to_desc(),
removing arch_probe_nr_irqs is intentional. so we get more irq that could be used.

wonder if the udev for some of your test system have irq number limitation?

YH

2010-02-27 09:11:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again


* Yinghai Lu <[email protected]> wrote:

> On 02/26/2010 02:26 AM, tip-bot for Ingo Molnar wrote:
> > Commit-ID: f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
> > Gitweb: http://git.kernel.org/tip/f6a929eb62ef43cae16ba6d4e2c64b4e430fe7a4
> > Author: Ingo Molnar <[email protected]>
> > AuthorDate: Fri, 26 Feb 2010 11:17:16 +0100
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Fri, 26 Feb 2010 11:17:16 +0100
> >
> > x86: apic: Fix mismerge, add arch_probe_nr_irqs() again
> >
> > Merge commit aef55d4922 mis-merged io_apic.c so we lost the
> > arch_probe_nr_irqs() method.
> >
> > This caused subtle boot breakages (udev confusion likely
> > due to missing drivers) with certain configs.
> >
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Yinghai Lu <[email protected]>
> > LKML-Reference: <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > arch/x86/kernel/apic/io_apic.c | 22 ++++++++++++++++++++++
> > 1 files changed, 22 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> > index 527390c..c13ab8f 100644
> > --- a/arch/x86/kernel/apic/io_apic.c
> > +++ b/arch/x86/kernel/apic/io_apic.c
> > @@ -3874,6 +3874,28 @@ void __init probe_nr_irqs_gsi(void)
> > printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi);
> > }
> >
> > +#ifdef CONFIG_SPARSE_IRQ
> > +int __init arch_probe_nr_irqs(void)
> > +{
> > + int nr;
> > +
> > + if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
> > + nr_irqs = NR_VECTORS * nr_cpu_ids;
> > +
> > + nr = nr_irqs_gsi + 8 * nr_cpu_ids;
> > +#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
> > + /*
> > + * for MSI and HT dyn irq
> > + */
> > + nr += nr_irqs_gsi * 16;
> > +#endif
> > + if (nr < nr_irqs)
> > + nr_irqs = nr;
> > +
> > + return 0;
> > +}
> > +#endif
> > +
> > static int __io_apic_set_pci_routing(struct device *dev, int irq,
> > struct io_apic_irq_attr *irq_attr)
> > {
>
> for x86, with radix tree based irq_to_desc(),
> removing arch_probe_nr_irqs is intentional. so we get more irq that could be used.
>
> wonder if the udev for some of your test system have irq number limitation?

was ancient udev: udev-095-17.fc6.

Ingo

2010-02-27 09:38:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Ingo Molnar <[email protected]> writes:

> * Yinghai Lu <[email protected]> wrote:
>
>>
>> for x86, with radix tree based irq_to_desc(),
>> removing arch_probe_nr_irqs is intentional. so we get more irq that could be used.
>>
>> wonder if the udev for some of your test system have irq number limitation?
>
> was ancient udev: udev-095-17.fc6.

Something doesn't add up. Nowhere in the udev source is there a
single mention of irq.

gsi have fixed interrupt numbers so that would not change.

The dynamic irqs are allocated starting from the high gsi
and working up.

The irq numbers that get allocated should not have changed,
unless this was actually a bug fix in this configuration.

The other possibility is that somehow arch_probe_nr_irqs()
was returning a number greater than NR_IRQS.

Ingo do you have any idea what NR_IRQS or nr_irqs were/are on
that failing machine?

Eric

2010-02-27 09:53:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again


* Eric W. Biederman <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > * Yinghai Lu <[email protected]> wrote:
> >
> >>
> >> for x86, with radix tree based irq_to_desc(),
> >> removing arch_probe_nr_irqs is intentional. so we get more irq that could be used.
> >>
> >> wonder if the udev for some of your test system have irq number limitation?
> >
> > was ancient udev: udev-095-17.fc6.
>
> Something doesn't add up. Nowhere in the udev source is there a
> single mention of irq.
>
> gsi have fixed interrupt numbers so that would not change.
>
> The dynamic irqs are allocated starting from the high gsi
> and working up.
>
> The irq numbers that get allocated should not have changed,
> unless this was actually a bug fix in this configuration.
>
> The other possibility is that somehow arch_probe_nr_irqs()
> was returning a number greater than NR_IRQS.
>
> Ingo do you have any idea what NR_IRQS or nr_irqs were/are on
> that failing machine?

Sorry, not - and the merge window doesnt leave much time to revisit the
problem right now.

But the failures were very real and 100% caused by this: they resulted in
non-existent /dev/sda* nodes and resulting fsck failure by rc.

Thanks,

Ingo

2010-02-27 10:12:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [tip:x86/irq] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Ingo Molnar <[email protected]> writes:

>> Ingo do you have any idea what NR_IRQS or nr_irqs were/are on
>> that failing machine?
>
> Sorry, not - and the merge window doesnt leave much time to revisit the
> problem right now.
>
> But the failures were very real and 100% caused by this: they resulted in
> non-existent /dev/sda* nodes and resulting fsck failure by rc.

I have looked it over a second time and I have convinced myself
that arch_probe_nr_irqs will in the worst case reduce nr_irqs,
and never increase it beyond NR_IRQS. So this revert (keeping
arch_probe_nr_irqs) is safe.

It makes little sense that a larger nr_irqs would be a problem,
but clearly there are assumptions somewhere that we still need
to remove.

Eric

2010-02-27 12:57:31

by Ingo Molnar

[permalink] [raw]
Subject: [tip:x86/apic] x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Commit-ID: 21c2fd9970cc8e457058f84016450a2e9876125e
Gitweb: http://git.kernel.org/tip/21c2fd9970cc8e457058f84016450a2e9876125e
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 26 Feb 2010 11:17:16 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Sat, 27 Feb 2010 12:49:56 +0100

x86: apic: Fix mismerge, add arch_probe_nr_irqs() again

Merge commit aef55d4922 mis-merged io_apic.c so we lost the
arch_probe_nr_irqs() method.

This caused subtle boot breakages (udev confusion likely
due to missing drivers) with certain configs.

Cc: H. Peter Anvin <[email protected]>
Cc: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9795898..72ac2a3 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3876,6 +3876,28 @@ void __init probe_nr_irqs_gsi(void)
printk(KERN_DEBUG "nr_irqs_gsi: %d\n", nr_irqs_gsi);
}

+#ifdef CONFIG_SPARSE_IRQ
+int __init arch_probe_nr_irqs(void)
+{
+ int nr;
+
+ if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
+ nr_irqs = NR_VECTORS * nr_cpu_ids;
+
+ nr = nr_irqs_gsi + 8 * nr_cpu_ids;
+#if defined(CONFIG_PCI_MSI) || defined(CONFIG_HT_IRQ)
+ /*
+ * for MSI and HT dyn irq
+ */
+ nr += nr_irqs_gsi * 16;
+#endif
+ if (nr < nr_irqs)
+ nr_irqs = nr;
+
+ return 0;
+}
+#endif
+
static int __io_apic_set_pci_routing(struct device *dev, int irq,
struct io_apic_irq_attr *irq_attr)
{