Hi Linus,
This pair of patches fixes a long-standing bug whose most noticeable
symptom was that the Xen blkfront driver would hang up very occasionally
(sometimes never, sometimes after weeks or months of uptime).
We worked out the root cause was that it was incorrectly treating Xen
events as level rather than edge triggered interrupts, which works fine
unless you're handling one interrupt, the interrupt gets migrated to
another cpu and then re-raised. This ends up losing the interrupt
because the edge-triggering of the second interrupt is lost.
The other change changes IPI and VIRQ event sources to use
handle_percpu_irq, because treating them as level is also wrong, and
they're actually inherently percpu events, much like LAPIC vectors.
I'd like to get this fix into the current kernel and into stable sooner
rather than later.
Thanks,
J
The following changes since commit 76be97c1fc945db08aae1f1b746012662d643e97:
Linux 2.6.36-rc2 (2010-08-22 17:43:29 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git upstream/core
Jeremy Fitzhardinge (2):
xen: use percpu interrupts for IPIs and VIRQs
xen: handle events as edge-triggered
drivers/xen/events.c | 21 ++++++++++++++++-----
1 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index 72f91bf..13365ba 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -112,6 +112,7 @@ static inline unsigned long *cpu_evtchn_mask(int cpu)
#define VALID_EVTCHN(chn) ((chn) != 0)
static struct irq_chip xen_dynamic_chip;
+static struct irq_chip xen_percpu_chip;
/* Constructor for packed IRQ information. */
static struct irq_info mk_unbound_info(void)
@@ -377,7 +378,7 @@ int bind_evtchn_to_irq(unsigned int evtchn)
irq = find_unbound_irq();
set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
- handle_level_irq, "event");
+ handle_edge_irq, "event");
evtchn_to_irq[evtchn] = irq;
irq_info[irq] = mk_evtchn_info(evtchn);
@@ -403,8 +404,8 @@ static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu)
if (irq < 0)
goto out;
- set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
- handle_level_irq, "ipi");
+ set_irq_chip_and_handler_name(irq, &xen_percpu_chip,
+ handle_percpu_irq, "ipi");
bind_ipi.vcpu = cpu;
if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_ipi,
@@ -444,8 +445,8 @@ static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
irq = find_unbound_irq();
- set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
- handle_level_irq, "virq");
+ set_irq_chip_and_handler_name(irq, &xen_percpu_chip,
+ handle_percpu_irq, "virq");
evtchn_to_irq[evtchn] = irq;
irq_info[irq] = mk_virq_info(evtchn, virq);
@@ -964,6 +965,16 @@ static struct irq_chip xen_dynamic_chip __read_mostly = {
.retrigger = retrigger_dynirq,
};
+static struct irq_chip xen_percpu_chip __read_mostly = {
+ .name = "xen-percpu",
+
+ .disable = disable_dynirq,
+ .mask = disable_dynirq,
+ .unmask = enable_dynirq,
+
+ .ack = ack_dynirq,
+};
+
int xen_set_callback_via(uint64_t via)
{
struct xen_hvm_param a;
>>> On 24.08.10 at 23:35, Jeremy Fitzhardinge <[email protected]> wrote:
> We worked out the root cause was that it was incorrectly treating Xen
> events as level rather than edge triggered interrupts, which works fine
> unless you're handling one interrupt, the interrupt gets migrated to
> another cpu and then re-raised. This ends up losing the interrupt
> because the edge-triggering of the second interrupt is lost.
While this description would seem plausible at the first glance, it
doesn't match up with unmask_evtchn() already taking care of
exactly this case. Or are you implicitly saying that this code is
broken in some way (if so, how, and shouldn't it then be that
code that needs fixing, or removing if you want to stay with the
edge handling)?
I do however agree that using handle_level_irq() is problematic
(see http://lists.xensource.com/archives/html/xen-devel/2010-04/msg01178.html),
but as said there I think using the fasteoi logic is preferable. No
matter whether using edge or level, the ->end() method will
never be called (whereas fasteoi calls ->eoi(), which would
just need to be vectored to the same function as ->end()).
Without end_pirq() ever called, you can't let Xen know of
bad PIRQs (so that it can disable them instead of continuing
to call the [now shortcut] handler in the owning domain).
> The other change changes IPI and VIRQ event sources to use
> handle_percpu_irq, because treating them as level is also wrong, and
> they're actually inherently percpu events, much like LAPIC vectors.
This doesn't seem right for the general VIRQ case: global ones
should not be disallowed migration between CPUs. Since in your
model the requestor has to pass IRQF_PERCPU anyway,
shouldn't you make the selection of the handler dependent
upon this flag?
Jan
On Wed, 2010-08-25 at 03:52 -0400, Jan Beulich wrote:
> >>> On 24.08.10 at 23:35, Jeremy Fitzhardinge <[email protected]> wrote:
> > We worked out the root cause was that it was incorrectly treating Xen
> > events as level rather than edge triggered interrupts, which works fine
> > unless you're handling one interrupt, the interrupt gets migrated to
> > another cpu and then re-raised. This ends up losing the interrupt
> > because the edge-triggering of the second interrupt is lost.
>
> While this description would seem plausible at the first glance, it
> doesn't match up with unmask_evtchn() already taking care of
> exactly this case. Or are you implicitly saying that this code is
> broken in some way (if so, how, and shouldn't it then be that
> code that needs fixing, or removing if you want to stay with the
> edge handling)?
Not broken, but a different problem. The unmask 'resend' only catches
the edge lost if the event was raised while it was still masked. But
level irq doesn't have to save PENDING state. In the Xen event migration
case the edge isn't lost, but the upcall will drop the invocation when
the handler is found inprogress on the previous cpu.
> I do however agree that using handle_level_irq() is problematic
> (see http://lists.xensource.com/archives/html/xen-devel/2010-04/msg01178.html),
> but as said there I think using the fasteoi logic is preferable. No
> matter whether using edge or level, the ->end() method will
> never be called (whereas fasteoi calls ->eoi(), which would
> just need to be vectored to the same function as ->end()).
> Without end_pirq() ever called, you can't let Xen know of
> bad PIRQs (so that it can disable them instead of continuing
> to call the [now shortcut] handler in the owning domain).
Not an opinion, just confused: Isn't all that dealt with in
chip->disable?
Daniel
>>> On 25.08.10 at 12:04, Daniel Stodden <[email protected]> wrote:
> On Wed, 2010-08-25 at 03:52 -0400, Jan Beulich wrote:
>> >>> On 24.08.10 at 23:35, Jeremy Fitzhardinge <[email protected]> wrote:
>> > We worked out the root cause was that it was incorrectly treating Xen
>> > events as level rather than edge triggered interrupts, which works fine
>> > unless you're handling one interrupt, the interrupt gets migrated to
>> > another cpu and then re-raised. This ends up losing the interrupt
>> > because the edge-triggering of the second interrupt is lost.
>>
>> While this description would seem plausible at the first glance, it
>> doesn't match up with unmask_evtchn() already taking care of
>> exactly this case. Or are you implicitly saying that this code is
>> broken in some way (if so, how, and shouldn't it then be that
>> code that needs fixing, or removing if you want to stay with the
>> edge handling)?
>
> Not broken, but a different problem. The unmask 'resend' only catches
> the edge lost if the event was raised while it was still masked. But
> level irq doesn't have to save PENDING state. In the Xen event migration
> case the edge isn't lost, but the upcall will drop the invocation when
> the handler is found inprogress on the previous cpu.
Hmm, indeed. But that problem must have existed in all post-2.6.18
kernels then... And that shouldn't be a problem with fasteoi, as that
one calls ->eoi() even when INPROGRESS was set (other than level,
which calls unmask only when it wasn't set).
>> I do however agree that using handle_level_irq() is problematic
>> (see
> http://lists.xensource.com/archives/html/xen-devel/2010-04/msg01178.html),
>> but as said there I think using the fasteoi logic is preferable. No
>> matter whether using edge or level, the ->end() method will
>> never be called (whereas fasteoi calls ->eoi(), which would
>> just need to be vectored to the same function as ->end()).
>> Without end_pirq() ever called, you can't let Xen know of
>> bad PIRQs (so that it can disable them instead of continuing
>> to call the [now shortcut] handler in the owning domain).
>
> Not an opinion, just confused: Isn't all that dealt with in
> chip->disable?
With disable_pirq() being empty (at least in the branches I
looked at)?
Jan
On 08/25/2010 12:52 AM, Jan Beulich wrote:
> >>> On 24.08.10 at 23:35, Jeremy Fitzhardinge <[email protected]> wrote:
>> We worked out the root cause was that it was incorrectly treating Xen
>> events as level rather than edge triggered interrupts, which works fine
>> unless you're handling one interrupt, the interrupt gets migrated to
>> another cpu and then re-raised. This ends up losing the interrupt
>> because the edge-triggering of the second interrupt is lost.
> While this description would seem plausible at the first glance, it
> doesn't match up with unmask_evtchn() already taking care of
> exactly this case. Or are you implicitly saying that this code is
> broken in some way (if so, how, and shouldn't it then be that
> code that needs fixing, or removing if you want to stay with the
> edge handling)?
>
> I do however agree that using handle_level_irq() is problematic
> (see http://lists.xensource.com/archives/html/xen-devel/2010-04/msg01178.html),
> but as said there I think using the fasteoi logic is preferable. No
> matter whether using edge or level, the ->end() method will
> never be called (whereas fasteoi calls ->eoi(), which would
> just need to be vectored to the same function as ->end()).
> Without end_pirq() ever called, you can't let Xen know of
> bad PIRQs (so that it can disable them instead of continuing
> to call the [now shortcut] handler in the owning domain).
Note that this patch is specifically for upstream Xen, which doesn't
have any pirq support in it at present.
However, I did consider using fasteoi, but I couldn't see how to make
it work. The problem is that it only does a single call into the
irq_chip for EOI after calling the interrupt handler, but there is no
call beforehand to ack the interrupt (which means clear the event flag
in our case). This leads to a race where an event can be lost after the
interrupt handler has returned, but before the event flag has been
cleared (because Xen won't set pending or call the upcall function if
the event is already set). I guess I could pre-clear the event in the
upcall function, but I'm not sure that's any better.
In the dom0 kernels, I followed the example of the IOAPIC irq_chip
implementation, which does the hardware EOI in the ack call at the start
of handle_edge_irq, can did the EOI hypercall (when necessary) there. I
welcome a reviewer's eye on this though.
>> The other change changes IPI and VIRQ event sources to use
>> handle_percpu_irq, because treating them as level is also wrong, and
>> they're actually inherently percpu events, much like LAPIC vectors.
> This doesn't seem right for the general VIRQ case: global ones
> should not be disallowed migration between CPUs. Since in your
> model the requestor has to pass IRQF_PERCPU anyway,
> shouldn't you make the selection of the handler dependent
> upon this flag?
I was thinking specifically of the timer, debug and console virqs. The
last is the only one which could conceivably be non-percpu, but in
practice I think it would be a bad idea to put it on anything other than
cpu0. What other global virqs are there? Nothing that's high-frequency
enough to be worth migrating?
But yes, if necessary, I could make it depend on the IRQF_PERCPU flag.
J