To: Gary Hade <garyhade@us.ibm.com>
Cc: mingo@elte.hu, mingo@redhat.com, tglx@linutronix.de, hpa@zytor.com,
       x86@kernel.org, linux-kernel@vger.kernel.org, lcm@us.ibm.com
References: <20090408210735.GD11159@us.ibm.com>
	<m1r601fg9l.fsf@fess.ebiederm.org> <20090410200919.GA7242@us.ibm.com>
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 10 Apr 2009 15:02:07 -0700
In-Reply-To: <20090410200919.GA7242@us.ibm.com> (Gary Hade's message of "Fri\, 10 Apr 2009 13\:09\:19 -0700")
Message-ID: <m1tz4wcgm8.fsf@fess.ebiederm.org>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: Re: [PATCH 2/3] [BUGFIX] x86/x86_64: fix CPU offlining triggered inactive device IRQ interrruption
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6322
Lines: 143

Gary Hade <garyhade@us.ibm.com> writes:

> On Thu, Apr 09, 2009 at 06:29:10PM -0700, Eric W. Biederman wrote:
>> Gary Hade <garyhade@us.ibm.com> writes:
>> 
>> > Impact: Eliminates a race that can leave the system in an
>> >         unusable state
>> >
>> > During rapid offlining of multiple CPUs there is a chance
>> > that an IRQ affinity move destination CPU will be offlined
>> > before the IRQ affinity move initiated during the offlining
>> > of a previous CPU completes.  This can happen when the device
>> > is not very active and thus fails to generate the IRQ that is
>> > needed to complete the IRQ affinity move before the move
>> > destination CPU is offlined.  When this happens there is an
>> > -EBUSY return from __assign_irq_vector() during the offlining
>> > of the IRQ move destination CPU which prevents initiation of
>> > a new IRQ affinity move operation to an online CPU.  This
>> > leaves the IRQ affinity set to an offlined CPU.
>> >
>> > I have been able to reproduce the problem on some of our
>> > systems using the following script.  When the system is idle
>> > the problem often reproduces during the first CPU offlining
>> > sequence.
>> 
>> You appear to be focusing on the IBM x460 and x3835.
>
> True.  I have also observed IRQ interruptions on an IBM x3950 M2
> which I believe, but am not certain, were due to the other
> "I/O redirection table register write with Remote IRR bit set"
> caused problem.
>
> I intend to do more testing on the x3950 M2 and other
> IBM System x servers but I unfortunately do not currently
> have access to any Intel based non-IBM MP servers.  I was
> hoping that my testing request might at least get some
> others interested in running the simple test script on their
> systems and reporting their results.  Have you perhaps tried
> the test on any of the Intel based MP systems that you have
> access to?
>
>> Can you describe
>> what kind of interrupt setup you are running.
>
> Being somewhat of a ioapic neophyte I am not exactly sure
> what you are asking for here.  This is ioapic information
> logged during boot if that helps at all.
> x3850:
>     ACPI: IOAPIC (id[0x0f] address[0xfec00000] gsi_base[0])
>     IOAPIC[0]: apic_id 15, version 0, address 0xfec00000, GSI 0-35
>     ACPI: IOAPIC (id[0x0e] address[0xfec01000] gsi_base[36])
>     IOAPIC[1]: apic_id 14, version 0, address 0xfec01000, GSI 36-71
> x460:
>     ACPI: IOAPIC (id[0x0f] address[0xfec00000] gsi_base[0])
>     IOAPIC[0]: apic_id 15, version 17, address 0xfec00000, GSI 0-35
>     ACPI: IOAPIC (id[0x0e] address[0xfec01000] gsi_base[36])
>     IOAPIC[1]: apic_id 14, version 17, address 0xfec01000, GSI 36-71
>     ACPI: IOAPIC (id[0x0d] address[0xfec02000] gsi_base[72])
>     IOAPIC[2]: apic_id 13, version 17, address 0xfec02000, GSI 72-107
>     ACPI: IOAPIC (id[0x0c] address[0xfec03000] gsi_base[108])
>     IOAPIC[3]: apic_id 12, version 17, address 0xfec03000, GSI 108-143

Sorry.  My real question is which mode you are running the ioapics in.


>> You may be the first person to actually hit the problems with cpu offlining
>> and irq migration that have theoretically been present for a long.
>
> Your "Safely cleanup an irq after moving it" changes have been
> present in mainline for quite some time so I have been thinking
> about this as well.
>
> I can certainly understand why it may not be very likely
> for users to see the "I/O redirection table register write
> with Remote IRR bit set" caused problem.  It has actually
> been fairly difficult to reproduce.  I very much doubt that
> there are many users out there that would be continuously offlining
> and onlining all the offlineable CPUs from a script or program on
> a heavily loaded system.  IMO, this would _not_ be a very common
> useage scenario.  The test script that I provided usually performs
> many CPU offline/online iterations before the problem is triggered.
> A much more likely useage scenario, for which there is already code
> in ack_apic_level() to avoid the problem, would be IRQ affinity
> adjustments requested from user level (e.g. by the irqbalance daemon)
> on a very active system.
>
> It is less clear to me why users have not been reporting the
> idle system race but I suspect that
>   - script or program driven offlining of multiple CPUs
>     may not be very common
>   - the actual affinity on an idle system is usually set to
>     cpu0 which is always online
>
> I am glad you are looking at this since I know it involves code
> that you should be quite familiar with.  Thanks!

The class of irq and the mode in which we run it make a bit difference.

MSIs and machines with iommus that remap irqs we should be able to
migrate irqs safely at any time.

Also lowest priority delivery mode when all cpus are setup on the same
vector to receive an irq.  Changing it's affinity should be safe.
This works for up to 8 cpus.  And I expect is the common case.

The rest of the case when we change the vector number short of turning
off the devices generating the interrupts (and thus stopping them at
the source) I do not know of a safe way to disable interrupts.

The only reason I did not make the irq migration at cpu shutdown time
depend on CONFIG_BROKEN was that it is by the suspend code on laptops.

Alternatively if you have ioapics that support having their irqs
acknowledged with a register I like ia64 has.  I think there are some
additional options.

So depending on what hardware you have it might be possible to implement
this safely.

It looks like some additional bugs have slipped in since last I looked.

set_irq_affinity does this:
ifdef CONFIG_GENERIC_PENDING_IRQ
	if (desc->status & IRQ_MOVE_PCNTXT || desc->status & IRQ_DISABLED) {
		cpumask_copy(desc->affinity, cpumask);
		desc->chip->set_affinity(irq, cpumask);
	} else {
		desc->status |= IRQ_MOVE_PENDING;
		cpumask_copy(desc->pending_mask, cpumask);
	}
#else

That IRQ_DISABLED case is a software state and as such it has nothing to
do with how safe it is to move an irq in process context.

Do any of your device drivers call irq_disable?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/