To: Gary Hade <garyhade@us.ibm.com>
Cc: mingo@elte.hu, mingo@redhat.com, linux-kernel@vger.kernel.org,
       tglx@linutronix.de, hpa@zytor.com, x86@kernel.org, yinghai@kernel.org,
       lcm@us.ibm.com
Subject: Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offlining triggered "active" device IRQ interrruption
References: <20090602193216.GC7282@us.ibm.com>
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 03 Jun 2009 05:27:13 -0700
In-Reply-To: <20090602193216.GC7282@us.ibm.com> (Gary Hade's message of "Tue\, 2 Jun 2009 12\:32\:16 -0700")
Message-ID: <m1y6s9qz8e.fsf@fess.ebiederm.org>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2532
Lines: 65

Gary Hade <garyhade@us.ibm.com> writes:

> Impact: Eliminates an issue that can leave the system in an
>         unusable state.
>
> This patch addresses an issue where device generated IRQs
> are no longer seen by the kernel following IRQ affinity
> migration while the device is generating IRQs at a high rate.
>
> I have been able to consistently reproduce the problem on
> some of our systems by running the following script (VICTIM_IRQ
> specifies the IRQ for the aic94xx device) while a single instance
> of the command
>   # while true; do find / -exec file {} \;; done
> is keeping the filesystem activity and IRQ rate reasonably high.

To be 100% clear.

If masking and checking to see if the irq was already pending was
sufficient to migrate irqs in process context was enough to
safely migrate irqs in process context then that is how we would
always do it.  I have been down that road and down some extensive
testing in the past.

I found hardware bugs in both AMD and Intel IOAPIC that make your
code demonstrably unsafe.

I was challenged by some of the software guys from Intel and eventually
the came back and told me they had talked with their hardware engineers
and I was correct.

So no.  This code is totally and severely broken and we should not do
it.

You are introducing complexity and heuristics to avoid the fact that
fixup_irqs is fundamentally broken.  Sure you might tweak things
so they work a little more often.

> The root cause is a known issue already addressed for some
> code paths [e.g. ack_apic_level() and the now obsolete
> migrate_irq_remapped_level_desc()] where the ioapic can
> misbehave when the I/O redirection table register is written
> while the Remote IRR bit is set.

No the reason we do this is not because of the IRR.  Although
that certainly does not help.

We do this because it is not in general safe to do complicated
reprogramming to the ioapic while the hardware may send an
irq.  You can lock up the hardware state machine etc.

If the work around was as simple as you propose a delayed work or busy
waiting until the irq handler was complete variant would have been
written and used long ago.

So my reaction to this horrible afterthought is 
NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO NO
PLEASE NO.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/