2009-06-02 19:32:26

by Gary Hade

[permalink] [raw]
Subject: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption


Impact: Eliminates a race that can leave the system in an
unusable state

During rapid offlining of multiple CPUs there is a chance
that an IRQ affinity move destination CPU will be offlined
before the IRQ affinity move initiated during the offlining
of a previous CPU completes. This can happen when the device
is not very active and thus fails to generate the IRQ that is
needed to complete the IRQ affinity move before the move
destination CPU is offlined. When this happens there is an
-EBUSY return from __assign_irq_vector() during the offlining
of the IRQ move destination CPU which prevents initiation of
a new IRQ affinity move operation to an online CPU. This
leaves the IRQ affinity set to an offlined CPU.

I have been able to reproduce the problem on some of our
systems using the following script. When the system is idle
the problem often reproduces during the first CPU offlining
sequence.

#!/bin/sh

SYS_CPU_DIR=/sys/devices/system/cpu
VICTIM_IRQ=25
IRQ_MASK=f0

iteration=0
while true; do
echo $iteration
echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity
for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
echo 0 > $cpudir/online
done
for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
echo 1 > $cpudir/online
done
iteration=`expr $iteration + 1`
done

The proposed fix takes advantage of the fact that when all
CPUs in the old domain are offline there is nothing to be done
by send_cleanup_vector() during the affinity move completion.
So, we simply avoid setting cfg->move_in_progress preventing
the above mentioned -EBUSY return from __assign_irq_vector().
This allows initiation of a new IRQ affinity move to a CPU
that is not going offline.

Successfully tested with Ingo's linux-2.6-tip (32 and 64-bit
builds) on the IBM x460, x3550 M2, x3850, and x3950 M2.

v2: modified to integrate with Yinghai Lu's
"x86/irq: remove leftover code from NUMA_MIGRATE_IRQ_DESC"
patch which modified intersecting lines. Only comment
changes were affected. The actual change to the code
is the same.

Signed-off-by: Gary Hade <[email protected]>

---
arch/x86/kernel/apic/io_apic.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6-tip/arch/x86/kernel/apic/io_apic.c
===================================================================
--- linux-2.6-tip.orig/arch/x86/kernel/apic/io_apic.c 2009-05-14 14:06:30.000000000 -0700
+++ linux-2.6-tip/arch/x86/kernel/apic/io_apic.c 2009-05-14 14:09:42.000000000 -0700
@@ -1218,8 +1218,11 @@ next:
current_vector = vector;
current_offset = offset;
if (old_vector) {
- cfg->move_in_progress = 1;
cpumask_copy(cfg->old_domain, cfg->domain);
+ if (cpumask_intersects(cfg->old_domain,
+ cpu_online_mask)) {
+ cfg->move_in_progress = 1;
+ }
}
for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;


2009-06-03 11:55:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption

Gary Hade <[email protected]> writes:

> Impact: Eliminates a race that can leave the system in an
> unusable state
>
> During rapid offlining of multiple CPUs there is a chance
> that an IRQ affinity move destination CPU will be offlined
> before the IRQ affinity move initiated during the offlining
> of a previous CPU completes. This can happen when the device
> is not very active and thus fails to generate the IRQ that is
> needed to complete the IRQ affinity move before the move
> destination CPU is offlined. When this happens there is an
> -EBUSY return from __assign_irq_vector() during the offlining
> of the IRQ move destination CPU which prevents initiation of
> a new IRQ affinity move operation to an online CPU. This
> leaves the IRQ affinity set to an offlined CPU.
>
> I have been able to reproduce the problem on some of our
> systems using the following script. When the system is idle
> the problem often reproduces during the first CPU offlining
> sequence.

Nacked-by: "Eric W. Biederman" <[email protected]>

fixup_irqs() is broken for allowing such a thing.

> #!/bin/sh
>
> SYS_CPU_DIR=/sys/devices/system/cpu
> VICTIM_IRQ=25
> IRQ_MASK=f0
>
> iteration=0
> while true; do
> echo $iteration
> echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity
> for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
> echo 0 > $cpudir/online
> done
> for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
> echo 1 > $cpudir/online
> done
> iteration=`expr $iteration + 1`
> done
>
> The proposed fix takes advantage of the fact that when all
> CPUs in the old domain are offline there is nothing to be done
> by send_cleanup_vector() during the affinity move completion.
> So, we simply avoid setting cfg->move_in_progress preventing
> the above mentioned -EBUSY return from __assign_irq_vector().
> This allows initiation of a new IRQ affinity move to a CPU
> that is not going offline.
>
> Successfully tested with Ingo's linux-2.6-tip (32 and 64-bit
> builds) on the IBM x460, x3550 M2, x3850, and x3950 M2.
>
> v2: modified to integrate with Yinghai Lu's
> "x86/irq: remove leftover code from NUMA_MIGRATE_IRQ_DESC"
> patch which modified intersecting lines. Only comment
> changes were affected. The actual change to the code
> is the same.
>
> Signed-off-by: Gary Hade <[email protected]>
>
> ---
> arch/x86/kernel/apic/io_apic.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-tip/arch/x86/kernel/apic/io_apic.c
> ===================================================================
> --- linux-2.6-tip.orig/arch/x86/kernel/apic/io_apic.c 2009-05-14 14:06:30.000000000 -0700
> +++ linux-2.6-tip/arch/x86/kernel/apic/io_apic.c 2009-05-14 14:09:42.000000000 -0700
> @@ -1218,8 +1218,11 @@ next:
> current_vector = vector;
> current_offset = offset;
> if (old_vector) {
> - cfg->move_in_progress = 1;
> cpumask_copy(cfg->old_domain, cfg->domain);
> + if (cpumask_intersects(cfg->old_domain,
> + cpu_online_mask)) {
> + cfg->move_in_progress = 1;
> + }
> }
> for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
> per_cpu(vector_irq, new_cpu)[vector] = irq;

2009-06-04 16:39:12

by Gary Hade

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption

On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
> Gary Hade <[email protected]> writes:
>
> > Impact: Eliminates a race that can leave the system in an
> > unusable state
> >
> > During rapid offlining of multiple CPUs there is a chance
> > that an IRQ affinity move destination CPU will be offlined
> > before the IRQ affinity move initiated during the offlining
> > of a previous CPU completes. This can happen when the device
> > is not very active and thus fails to generate the IRQ that is
> > needed to complete the IRQ affinity move before the move
> > destination CPU is offlined. When this happens there is an
> > -EBUSY return from __assign_irq_vector() during the offlining
> > of the IRQ move destination CPU which prevents initiation of
> > a new IRQ affinity move operation to an online CPU. This
> > leaves the IRQ affinity set to an offlined CPU.
> >
> > I have been able to reproduce the problem on some of our
> > systems using the following script. When the system is idle
> > the problem often reproduces during the first CPU offlining
> > sequence.
>
> Nacked-by: "Eric W. Biederman" <[email protected]>
>
> fixup_irqs() is broken for allowing such a thing.

When fixup_irqs() calls the set_affinity function:
...
if (desc->chip->set_affinity)
desc->chip->set_affinity(irq, affinity);
...
it receives no feedback so it obviously expects the set_affinity
function or it's called functions to do the right thing by preventing
or correctly handling any problems that should arise. In the case of
this bug there is obviously a problem happening during the set_affinity
function call that needs to be resolved and/or properly handled.

When you made your "x86_64 irq: Safely cleanup an irq after moving it."
changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
to __assign_irq_vector() that causes it to return -EBUSY if the
migration of the IRQ is still in progress:
+ if ((cfg->move_in_progress) || cfg->move_cleanup_count)
+ return -EBUSY;
+
However, you did not add any code to other functions on the
call stack to properly deal with this error. When doing this
you may have assumed (as I may have also assumed) that the underlying
code was solid enough that the handling was not needed. Unfortunately,
you apparently did not anticipate the case where an idle or relatively
idle device may not generate the IRQ needed to complete the move
before the CPU that is still handling that IRQ is offlined.

My fix only addresses the issue that caused the -EBUSY return
and subsequent mess. It does not address the omitted handling
for this error condition. If we were to add the handling to
fixup_irq() and the arch and non-arch specific functions above
it on the call stack as you may be suggesting, it would be quite
involved because of all the things that would need to be undone.

I am not certain that my fix plugs the very last hole that could
cause the -EBUSY return from __assign_irq_vector() so maybe we
should at least add a warning or BUG_ON to make the unhandled
error more obvious in the future. I would be happy to provide
this via a separate patch.

Thanks,
Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2009-06-07 09:54:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption


* Gary Hade <[email protected]> wrote:

> On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
> > Gary Hade <[email protected]> writes:
> >
> > > Impact: Eliminates a race that can leave the system in an
> > > unusable state
> > >
> > > During rapid offlining of multiple CPUs there is a chance
> > > that an IRQ affinity move destination CPU will be offlined
> > > before the IRQ affinity move initiated during the offlining
> > > of a previous CPU completes. This can happen when the device
> > > is not very active and thus fails to generate the IRQ that is
> > > needed to complete the IRQ affinity move before the move
> > > destination CPU is offlined. When this happens there is an
> > > -EBUSY return from __assign_irq_vector() during the offlining
> > > of the IRQ move destination CPU which prevents initiation of
> > > a new IRQ affinity move operation to an online CPU. This
> > > leaves the IRQ affinity set to an offlined CPU.
> > >
> > > I have been able to reproduce the problem on some of our
> > > systems using the following script. When the system is idle
> > > the problem often reproduces during the first CPU offlining
> > > sequence.
> >
> > Nacked-by: "Eric W. Biederman" <[email protected]>
> >
> > fixup_irqs() is broken for allowing such a thing.
>
> When fixup_irqs() calls the set_affinity function:
> ...
> if (desc->chip->set_affinity)
> desc->chip->set_affinity(irq, affinity);
> ...
> it receives no feedback so it obviously expects the set_affinity
> function or it's called functions to do the right thing by preventing
> or correctly handling any problems that should arise. In the case of
> this bug there is obviously a problem happening during the set_affinity
> function call that needs to be resolved and/or properly handled.
>
> When you made your "x86_64 irq: Safely cleanup an irq after moving it."
> changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
> to __assign_irq_vector() that causes it to return -EBUSY if the
> migration of the IRQ is still in progress:
> + if ((cfg->move_in_progress) || cfg->move_cleanup_count)
> + return -EBUSY;
> +
> However, you did not add any code to other functions on the
> call stack to properly deal with this error. When doing this
> you may have assumed (as I may have also assumed) that the underlying
> code was solid enough that the handling was not needed. Unfortunately,
> you apparently did not anticipate the case where an idle or relatively
> idle device may not generate the IRQ needed to complete the move
> before the CPU that is still handling that IRQ is offlined.
>
> My fix only addresses the issue that caused the -EBUSY return
> and subsequent mess. It does not address the omitted handling
> for this error condition. If we were to add the handling to
> fixup_irq() and the arch and non-arch specific functions above
> it on the call stack as you may be suggesting, it would be quite
> involved because of all the things that would need to be undone.
>
> I am not certain that my fix plugs the very last hole that could
> cause the -EBUSY return from __assign_irq_vector() so maybe we
> should at least add a warning or BUG_ON to make the unhandled
> error more obvious in the future. I would be happy to provide
> this via a separate patch.

A WARN_ON_ONCE() patch would certainly be nice, as a reminder and as
a prodder-tool.

Ingo

2009-06-08 18:12:48

by Gary Hade

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption

On Sun, Jun 07, 2009 at 11:54:03AM +0200, Ingo Molnar wrote:
>
> * Gary Hade <[email protected]> wrote:
>
> > On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
> > > Gary Hade <[email protected]> writes:
> > >
> > > > Impact: Eliminates a race that can leave the system in an
> > > > unusable state
> > > >
> > > > During rapid offlining of multiple CPUs there is a chance
> > > > that an IRQ affinity move destination CPU will be offlined
> > > > before the IRQ affinity move initiated during the offlining
> > > > of a previous CPU completes. This can happen when the device
> > > > is not very active and thus fails to generate the IRQ that is
> > > > needed to complete the IRQ affinity move before the move
> > > > destination CPU is offlined. When this happens there is an
> > > > -EBUSY return from __assign_irq_vector() during the offlining
> > > > of the IRQ move destination CPU which prevents initiation of
> > > > a new IRQ affinity move operation to an online CPU. This
> > > > leaves the IRQ affinity set to an offlined CPU.
> > > >
> > > > I have been able to reproduce the problem on some of our
> > > > systems using the following script. When the system is idle
> > > > the problem often reproduces during the first CPU offlining
> > > > sequence.
> > >
> > > Nacked-by: "Eric W. Biederman" <[email protected]>
> > >
> > > fixup_irqs() is broken for allowing such a thing.
> >
> > When fixup_irqs() calls the set_affinity function:
> > ...
> > if (desc->chip->set_affinity)
> > desc->chip->set_affinity(irq, affinity);
> > ...
> > it receives no feedback so it obviously expects the set_affinity
> > function or it's called functions to do the right thing by preventing
> > or correctly handling any problems that should arise. In the case of
> > this bug there is obviously a problem happening during the set_affinity
> > function call that needs to be resolved and/or properly handled.
> >
> > When you made your "x86_64 irq: Safely cleanup an irq after moving it."
> > changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
> > to __assign_irq_vector() that causes it to return -EBUSY if the
> > migration of the IRQ is still in progress:
> > + if ((cfg->move_in_progress) || cfg->move_cleanup_count)
> > + return -EBUSY;
> > +
> > However, you did not add any code to other functions on the
> > call stack to properly deal with this error. When doing this
> > you may have assumed (as I may have also assumed) that the underlying
> > code was solid enough that the handling was not needed. Unfortunately,
> > you apparently did not anticipate the case where an idle or relatively
> > idle device may not generate the IRQ needed to complete the move
> > before the CPU that is still handling that IRQ is offlined.
> >
> > My fix only addresses the issue that caused the -EBUSY return
> > and subsequent mess. It does not address the omitted handling
> > for this error condition. If we were to add the handling to
> > fixup_irq() and the arch and non-arch specific functions above
> > it on the call stack as you may be suggesting, it would be quite
> > involved because of all the things that would need to be undone.
> >
> > I am not certain that my fix plugs the very last hole that could
> > cause the -EBUSY return from __assign_irq_vector() so maybe we
> > should at least add a warning or BUG_ON to make the unhandled
> > error more obvious in the future. I would be happy to provide
> > this via a separate patch.
>
> A WARN_ON_ONCE() patch would certainly be nice, as a reminder and as
> a prodder-tool.

Done. re: http://lkml.org/lkml/2009/6/8/354

In the case of the bug we are discussing here, the added warning
looks like:
WARNING: at arch/x86/kernel/apic/io_apic.c:1334 __assign_irq_vector+0x53/0x262()
Hardware name: IBM x3850-[88641RY]-
Modules linked in: radeon drm ipv6 af_packet microcode fuse loop dm_mod rtc_cmos rtc_core i2c_piix4 tg3 e1000e s2io sr_mod libphy rtc_lib i2c_core joydev button pcspkr cdrom sg usbhid hid sd_mod crc_t10dif ohci_hcd ehci_hcd aic94xx libsas scsi_transport_sas usbcore edd ext3 mbcache jbd fan ide_pci_generic serverworks ide_core ata_generic thermal processor thermal_sys hwmon pata_serverworks libata scsi_mod
Pid: 5328, comm: kstop/5 Not tainted 2.6.30-rc8-gh-5-default #5
Call Trace:
[<ffffffff8021cf2a>] ? __assign_irq_vector+0x53/0x262
[<ffffffff8023ee59>] warn_slowpath_common+0x77/0xa4
[<ffffffff8023ee95>] warn_slowpath_null+0xf/0x11
[<ffffffff8021cf2a>] __assign_irq_vector+0x53/0x262
[<ffffffff8021d16a>] assign_irq_vector+0x31/0x4d
[<ffffffff8021d1c6>] set_desc_affinity+0x40/0x86
[<ffffffff8021d53b>] set_ioapic_affinity_irq_desc+0x3b/0x135
[<ffffffff8021d651>] set_ioapic_affinity_irq+0x1c/0x20
[<ffffffff8020de46>] fixup_irqs+0xd5/0x163
[<ffffffff8021b1c2>] cpu_disable_common+0x19f/0x1ae
[<ffffffff8021b200>] native_cpu_disable+0x2f/0x37
[<ffffffff80491878>] take_cpu_down+0x12/0x38
[<ffffffff80274d56>] stop_cpu+0x87/0xc7
[<ffffffff8024e55f>] worker_thread+0x172/0x20c
[<ffffffff80274ccf>] ? stop_cpu+0x0/0xc7
[<ffffffff802520b8>] ? autoremove_wake_function+0x0/0x38
[<ffffffff8024e3ed>] ? worker_thread+0x0/0x20c
[<ffffffff8024e3ed>] ? worker_thread+0x0/0x20c
[<ffffffff80251ce8>] kthread+0x56/0x83
[<ffffffff8020ca3a>] child_rip+0xa/0x20
[<ffffffff80251c92>] ? kthread+0x0/0x83
[<ffffffff8020ca30>] ? child_rip+0x0/0x20

Thanks,
Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc

2009-06-08 19:19:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption

Ingo Molnar <[email protected]> writes:

> * Gary Hade <[email protected]> wrote:
>
>> On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
>> > Gary Hade <[email protected]> writes:
>> >
>> > > Impact: Eliminates a race that can leave the system in an
>> > > unusable state
>> > >
>> > > During rapid offlining of multiple CPUs there is a chance
>> > > that an IRQ affinity move destination CPU will be offlined
>> > > before the IRQ affinity move initiated during the offlining
>> > > of a previous CPU completes. This can happen when the device
>> > > is not very active and thus fails to generate the IRQ that is
>> > > needed to complete the IRQ affinity move before the move
>> > > destination CPU is offlined. When this happens there is an
>> > > -EBUSY return from __assign_irq_vector() during the offlining
>> > > of the IRQ move destination CPU which prevents initiation of
>> > > a new IRQ affinity move operation to an online CPU. This
>> > > leaves the IRQ affinity set to an offlined CPU.
>> > >
>> > > I have been able to reproduce the problem on some of our
>> > > systems using the following script. When the system is idle
>> > > the problem often reproduces during the first CPU offlining
>> > > sequence.
>> >
>> > Nacked-by: "Eric W. Biederman" <[email protected]>
>> >
>> > fixup_irqs() is broken for allowing such a thing.
>>
>> When fixup_irqs() calls the set_affinity function:
>> ...
>> if (desc->chip->set_affinity)
>> desc->chip->set_affinity(irq, affinity);
>> ...
>> it receives no feedback so it obviously expects the set_affinity
>> function or it's called functions to do the right thing by preventing
>> or correctly handling any problems that should arise. In the case of
>> this bug there is obviously a problem happening during the set_affinity
>> function call that needs to be resolved and/or properly handled.
>>
>> When you made your "x86_64 irq: Safely cleanup an irq after moving it."
>> changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
>> to __assign_irq_vector() that causes it to return -EBUSY if the
>> migration of the IRQ is still in progress:
>> + if ((cfg->move_in_progress) || cfg->move_cleanup_count)
>> + return -EBUSY;
>> +
>> However, you did not add any code to other functions on the
>> call stack to properly deal with this error. When doing this
>> you may have assumed (as I may have also assumed) that the underlying
>> code was solid enough that the handling was not needed. Unfortunately,
>> you apparently did not anticipate the case where an idle or relatively
>> idle device may not generate the IRQ needed to complete the move
>> before the CPU that is still handling that IRQ is offlined.
>>
>> My fix only addresses the issue that caused the -EBUSY return
>> and subsequent mess. It does not address the omitted handling
>> for this error condition. If we were to add the handling to
>> fixup_irq() and the arch and non-arch specific functions above
>> it on the call stack as you may be suggesting, it would be quite
>> involved because of all the things that would need to be undone.
>>
>> I am not certain that my fix plugs the very last hole that could
>> cause the -EBUSY return from __assign_irq_vector() so maybe we
>> should at least add a warning or BUG_ON to make the unhandled
>> error more obvious in the future. I would be happy to provide
>> this via a separate patch.
>
> A WARN_ON_ONCE() patch would certainly be nice, as a reminder and as
> a prodder-tool.

In fixup_irqs such a warning would be reasonable. In assign_irq_vector
it makes no sense.

I just read through the code. Anything that assumes assign_irq_vector
will always succeed is BROKEN. We can not guarantee it. There are
also memory allocation failures and the fundamental problem that we
may have more irqs than can fit on a single cpu.

Furthermore while we require at least two irqs to complete a irq migration
I don't believe we can avoid returning -EBUSY there.

I really really hate these patches that come out of assuming that fixup_irqs
is or ever was working and reasonable.

Eric

2009-06-09 20:47:17

by Gary Hade

[permalink] [raw]
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "inactive" device IRQ interrruption

On Mon, Jun 08, 2009 at 12:19:30PM -0700, Eric W. Biederman wrote:
> Ingo Molnar <[email protected]> writes:
>
> > * Gary Hade <[email protected]> wrote:
> >
> >> On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
> >> > Gary Hade <[email protected]> writes:
> >> >
> >> > > Impact: Eliminates a race that can leave the system in an
> >> > > unusable state
> >> > >
> >> > > During rapid offlining of multiple CPUs there is a chance
> >> > > that an IRQ affinity move destination CPU will be offlined
> >> > > before the IRQ affinity move initiated during the offlining
> >> > > of a previous CPU completes. This can happen when the device
> >> > > is not very active and thus fails to generate the IRQ that is
> >> > > needed to complete the IRQ affinity move before the move
> >> > > destination CPU is offlined. When this happens there is an
> >> > > -EBUSY return from __assign_irq_vector() during the offlining
> >> > > of the IRQ move destination CPU which prevents initiation of
> >> > > a new IRQ affinity move operation to an online CPU. This
> >> > > leaves the IRQ affinity set to an offlined CPU.
> >> > >
> >> > > I have been able to reproduce the problem on some of our
> >> > > systems using the following script. When the system is idle
> >> > > the problem often reproduces during the first CPU offlining
> >> > > sequence.
> >> >
> >> > Nacked-by: "Eric W. Biederman" <[email protected]>
> >> >
> >> > fixup_irqs() is broken for allowing such a thing.
> >>
> >> When fixup_irqs() calls the set_affinity function:
> >> ...
> >> if (desc->chip->set_affinity)
> >> desc->chip->set_affinity(irq, affinity);
> >> ...
> >> it receives no feedback so it obviously expects the set_affinity
> >> function or it's called functions to do the right thing by preventing
> >> or correctly handling any problems that should arise. In the case of
> >> this bug there is obviously a problem happening during the set_affinity
> >> function call that needs to be resolved and/or properly handled.
> >>
> >> When you made your "x86_64 irq: Safely cleanup an irq after moving it."
> >> changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
> >> to __assign_irq_vector() that causes it to return -EBUSY if the
> >> migration of the IRQ is still in progress:
> >> + if ((cfg->move_in_progress) || cfg->move_cleanup_count)
> >> + return -EBUSY;
> >> +
> >> However, you did not add any code to other functions on the
> >> call stack to properly deal with this error. When doing this
> >> you may have assumed (as I may have also assumed) that the underlying
> >> code was solid enough that the handling was not needed. Unfortunately,
> >> you apparently did not anticipate the case where an idle or relatively
> >> idle device may not generate the IRQ needed to complete the move
> >> before the CPU that is still handling that IRQ is offlined.
> >>
> >> My fix only addresses the issue that caused the -EBUSY return
> >> and subsequent mess. It does not address the omitted handling
> >> for this error condition. If we were to add the handling to
> >> fixup_irq() and the arch and non-arch specific functions above
> >> it on the call stack as you may be suggesting, it would be quite
> >> involved because of all the things that would need to be undone.
> >>
> >> I am not certain that my fix plugs the very last hole that could
> >> cause the -EBUSY return from __assign_irq_vector() so maybe we
> >> should at least add a warning or BUG_ON to make the unhandled
> >> error more obvious in the future. I would be happy to provide
> >> this via a separate patch.
> >
> > A WARN_ON_ONCE() patch would certainly be nice, as a reminder and as
> > a prodder-tool.
>
> In fixup_irqs such a warning would be reasonable. In assign_irq_vector
> it makes no sense.
>
> I just read through the code. Anything that assumes assign_irq_vector
> will always succeed is BROKEN. We can not guarantee it. There are
> also memory allocation failures and the fundamental problem that we
> may have more irqs than can fit on a single cpu.

Yes, I am certain that there are other bugs lurking that haven't
yet manifested into real and serious failures. I am simply
proposing that we fix one very serious bug that has.

>
> Furthermore while we require at least two irqs to complete a irq migration
> I don't believe we can avoid returning -EBUSY there.

I didn't see anything in send_cleanup_vector() such as a memory
allocation failure (already handled there) that, in the absense
of a yet to be discoverd bug, should prevent cfg->move_in_progress
from getting zeroed. Assuming a memory allocation failure in
send_cleanup_vector() that brings cfg->move_cleanup_count into
the picture, I didn't see anything, in the absense of a yet to
be discovered bug, in smp_irq_move_cleanup_interrupt() that I
thought would prevent cfg->move_cleanup_count from getting
decremented to zero.

If you are still uncomfortable with this, the WARN_ON_ONCE
could be limited to the instances where the CPU is being offlined
i.e. the case that is known to be so very destructive.

Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
[email protected]
http://www.ibm.com/linux/ltc