2023-06-08 12:32:46

by Gowans, James

[permalink] [raw]
Subject: [PATCH 0/3] Resend GIC-v3 LPIs on concurrent invoke

If interrupts do not have global active states it is possible for
the next interrupt to arrive on a new CPU if an affinity change happens
while the original CPU is still running the handler. This specifically
impacts GIC-v3.

In this series, generic functionality is added to handle_fast_eoi() to
support resending the interrupt when this race happens, and that generic
functionality is enabled specifically for the GIC-v3 which is impacted
by this issue. GIC-v3 uses the handle_fast_eoi() generic handler, hence
that is the handler getting the functionality.

Also adding a bit more details to the IRQD flags docs to help future
readers know when/why flags should be used and what they mean.

== Testing: ==

TL;DR: Run a virt using QEMU on a EC2 R6g.metal host with a ENA device
passed through using VFIO - bounce IRQ affinity between two CPUs. Before
this change an interrupt can get lost and the device stalls; after this
change the interrupt is not lost.

=== Details: ===

Intentionally slow down the IRQ injection a bit, to turn this from a
rare race condition which to something which can easily be flushed out
in testing:

@@ -763,6 +764,7 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi)
raw_spin_lock_irqsave(&irq->irq_lock, flags);
irq->pending_latch = true;
vgic_queue_irq_unlock(kvm, irq, flags);
+ udelay(10);

return 0;
}

Also sprinkle a print to make it clear when the race described here is
hit:

@@ -698,6 +698,7 @@ void handle_fasteoi_irq(struct irq_desc *desc)
* handling the previous one - it may need to be resent.
*/
if (!irq_may_run(desc)) {
+ printk("!irq_may_run %i\n", desc->irq_data.irq);
if (irqd_needs_resend_when_in_progress(&desc->irq_data))
desc->istate |= IRQS_PENDING;
goto out;

Launch QEMU in your favourite way, with an ENA device passed through via
VFIO (VFIO driver re-binding needs to be done before this):

qemu-system-aarch64 -enable-kvm -machine virt,gic_version=3 -device vfio-pci,host=04:00.0 ...

In the VM, generate network traffic to get interrupts flowing:

ping -f -i 0.001 10.0.3.1 > /dev/null

On the host, change affinity of the interrupt around to flush out the race:

while true; do
echo 1 > /proc/irq/71/smp_affinity ; sleep 0.01;
echo 2 > /proc/irq/71/smp_affinity ; sleep 0.01;
done

In host dmesg the printk indicates that the race is hit:

[ 102.215801] !irq_may_run 71
[ 105.426413] !irq_may_run 71
[ 105.586462] !irq_may_run 71

Before this change, an interrupt is lost and this manifests as a driver
watchdog timeout in the guest device driver:

[ 35.124441] ena 0000:00:02.0 enp0s2: Found a Tx that wasn't completed on time,...
...
[ 37.124459] ------------[ cut here ]------------
[ 37.124791] NETDEV WATCHDOG: enp0s2 (ena): transmit queue 0 timed out

After this change, even though the !irq_may_run print is still shown
(indicating that the race is still hit) the driver no longer times out
because the interrupt now gets resent when the race occurs.

James Gowans (3):
genirq: Expand doc for PENDING and REPLAY flags
genirq: fasteoi supports resend on concurrent invoke
irqchip/gic-v3-its: Enable RESEND_WHEN_IN_PROGRESS for LPIs

drivers/irqchip/irq-gic-v3-its.c | 2 ++
include/linux/irq.h | 13 +++++++++++++
kernel/irq/chip.c | 16 +++++++++++++++-
kernel/irq/debugfs.c | 2 ++
kernel/irq/internals.h | 7 +++++--
5 files changed, 37 insertions(+), 3 deletions(-)


base-commit: 5f63595ebd82f56a2dd36ca013dd7f5ff2e2416a
--
2.25.1



2023-06-08 12:33:25

by Gowans, James

[permalink] [raw]
Subject: [PATCH 2/3] genirq: fasteoi supports resend on concurrent invoke

Update the generic handle_fasteoi_irq to support catering for the case
when the next interrupt comes in while the previous handler is still
running. Currently when that happens the irq_may_run() early out causes
the next IRQ to be lost. Support marking the interrupt as pending when
that happens so that the interrupt can be resent before
handle_fasteoi_irq returns. This is inspired by handle_edge_irq.

Generally it should not be possible for the next interrupt to arrive
while the previous handler is still running: the CPU will not preempt an
interrupt with another from the same source or same priority. However,
there is a race: if the interrupt affinity is changed while the previous
handler is running, then the next interrupt can arrive at a different
CPU while the previous handler is still running. In that case there will
be a concurrent invoke and the early out will be taken.

For example:

CPU 0 | CPU 1
-----------------------------|-----------------------------
interrupt start |
handle_fasteoi_irq | set_affinity(CPU 1)
handler |
... | interrupt start
... | handle_fasteoi_irq -> early out
handle_fasteoi_irq return | interrupt end
interrupt end |

This can happen on interrupt controllers which do not have a global
active state; the next commit will enable the flag for known impacted
controllers.

Implementation notes:

It is believed that it's NOT necessary to mask the interrupt in
handle_fasteoi_irq() the way that handle_edge_irq() does. This is
because handle_edge_irq() caters for controllers which are too simple to
gate interrupts from the same source, so the kernel explicitly masks the
interrupt if it re-occurs [0].

The resend on concurrent invoke logic is gated by a flag which the
interrupt controller can set if it's susceptible to this problem. It is
not desirable to resend unconditionally: a wake up source for example
has no need to be re-sent.

[0] https://lore.kernel.org/all/[email protected]/

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: James Gowans <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: KarimAllah Raslan <[email protected]>
Cc: Yipeng Zou <[email protected]>
Cc: Zhang Jianhua <[email protected]>
---
include/linux/irq.h | 13 +++++++++++++
kernel/irq/chip.c | 16 +++++++++++++++-
kernel/irq/debugfs.c | 2 ++
3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index b1b28affb32a..cb77da1ac4c6 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -223,6 +223,8 @@ struct irq_data {
* irq_chip::irq_set_affinity() when deactivated.
* IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq pm if
* irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
+ * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in progress in which
+ * case it must be resent at the next available opportunity.
*/
enum {
IRQD_TRIGGER_MASK = 0xf,
@@ -249,6 +251,7 @@ enum {
IRQD_HANDLE_ENFORCE_IRQCTX = (1 << 28),
IRQD_AFFINITY_ON_ACTIVATE = (1 << 29),
IRQD_IRQ_ENABLED_ON_SUSPEND = (1 << 30),
+ IRQD_RESEND_WHEN_IN_PROGRESS = (1U << 31),
};

#define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
}

+static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
+{
+ __irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
+static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
+{
+ return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
#undef __irqd_to_state

static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..57cd8f475302 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -692,8 +692,16 @@ void handle_fasteoi_irq(struct irq_desc *desc)

raw_spin_lock(&desc->lock);

- if (!irq_may_run(desc))
+ /*
+ * When an affinity change races with IRQ handling, the next interrupt
+ * can arrive on the new CPU before the original CPU has completed
+ * handling the previous one - it may need to be resent.
+ */
+ if (!irq_may_run(desc)) {
+ if (irqd_needs_resend_when_in_progress(&desc->irq_data))
+ desc->istate |= IRQS_PENDING;
goto out;
+ }

desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);

@@ -715,6 +723,12 @@ void handle_fasteoi_irq(struct irq_desc *desc)

cond_unmask_eoi_irq(desc, chip);

+ /*
+ * When the race described above happens this will resend the interrupt.
+ */
+ if (unlikely(desc->istate & IRQS_PENDING))
+ check_irq_resend(desc, false);
+
raw_spin_unlock(&desc->lock);
return;
out:
diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
index bbcaac64038e..5971a66be034 100644
--- a/kernel/irq/debugfs.c
+++ b/kernel/irq/debugfs.c
@@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),

BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
+
+ BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
};

static const struct irq_bit_descr irqdesc_states[] = {
--
2.25.1


2023-06-16 08:53:09

by Gowans, James

[permalink] [raw]
Subject: Re: [PATCH 0/3] Resend GIC-v3 LPIs on concurrent invoke

Hi Marc and Tomas,
Just a ping on this series; would be great to get any more feedback, or
get this merged.

Thanks!
James

On Thu, 2023-06-08 at 14:00 +0200, James Gowans wrote:
> If interrupts do not have global active states it is possible for
> the next interrupt to arrive on a new CPU if an affinity change happens
> while the original CPU is still running the handler. This specifically
> impacts GIC-v3.
>
> In this series, generic functionality is added to handle_fast_eoi() to
> support resending the interrupt when this race happens, and that generic
> functionality is enabled specifically for the GIC-v3 which is impacted
> by this issue. GIC-v3 uses the handle_fast_eoi() generic handler, hence
> that is the handler getting the functionality.
>
> Also adding a bit more details to the IRQD flags docs to help future
> readers know when/why flags should be used and what they mean.
>
> == Testing: ==
>
> TL;DR: Run a virt using QEMU on a EC2 R6g.metal host with a ENA device
> passed through using VFIO - bounce IRQ affinity between two CPUs. Before
> this change an interrupt can get lost and the device stalls; after this
> change the interrupt is not lost.
>
> === Details: ===
>
> Intentionally slow down the IRQ injection a bit, to turn this from a
> rare race condition which to something which can easily be flushed out
> in testing:
>
> @@ -763,6 +764,7 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi)
> raw_spin_lock_irqsave(&irq->irq_lock, flags);
> irq->pending_latch = true;
> vgic_queue_irq_unlock(kvm, irq, flags);
> + udelay(10);
>
> return 0;
> }
>
> Also sprinkle a print to make it clear when the race described here is
> hit:
>
> @@ -698,6 +698,7 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> * handling the previous one - it may need to be resent.
> */
> if (!irq_may_run(desc)) {
> + printk("!irq_may_run %i\n", desc->irq_data.irq);
> if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> desc->istate |= IRQS_PENDING;
> goto out;
>
> Launch QEMU in your favourite way, with an ENA device passed through via
> VFIO (VFIO driver re-binding needs to be done before this):
>
> qemu-system-aarch64 -enable-kvm -machine virt,gic_version=3 -device vfio-pci,host=04:00.0 ...
>
> In the VM, generate network traffic to get interrupts flowing:
>
> ping -f -i 0.001 10.0.3.1 > /dev/null
>
> On the host, change affinity of the interrupt around to flush out the race:
>
> while true; do
> echo 1 > /proc/irq/71/smp_affinity ; sleep 0.01;
> echo 2 > /proc/irq/71/smp_affinity ; sleep 0.01;
> done
>
> In host dmesg the printk indicates that the race is hit:
>
> [ 102.215801] !irq_may_run 71
> [ 105.426413] !irq_may_run 71
> [ 105.586462] !irq_may_run 71
>
> Before this change, an interrupt is lost and this manifests as a driver
> watchdog timeout in the guest device driver:
>
> [ 35.124441] ena 0000:00:02.0 enp0s2: Found a Tx that wasn't completed on time,...
> ...
> [ 37.124459] ------------[ cut here ]------------
> [ 37.124791] NETDEV WATCHDOG: enp0s2 (ena): transmit queue 0 timed out
>
> After this change, even though the !irq_may_run print is still shown
> (indicating that the race is still hit) the driver no longer times out
> because the interrupt now gets resent when the race occurs.
>
> James Gowans (3):
> genirq: Expand doc for PENDING and REPLAY flags
> genirq: fasteoi supports resend on concurrent invoke
> irqchip/gic-v3-its: Enable RESEND_WHEN_IN_PROGRESS for LPIs
>
> drivers/irqchip/irq-gic-v3-its.c | 2 ++
> include/linux/irq.h | 13 +++++++++++++
> kernel/irq/chip.c | 16 +++++++++++++++-
> kernel/irq/debugfs.c | 2 ++
> kernel/irq/internals.h | 7 +++++--
> 5 files changed, 37 insertions(+), 3 deletions(-)
>
>
> base-commit: 5f63595ebd82f56a2dd36ca013dd7f5ff2e2416a

2023-06-16 11:39:04

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [irqchip: irq/irqchip-next] genirq: Allow fasteoi handler to resend interrupts on concurrent handling

The following commit has been merged into the irq/irqchip-next branch of irqchip:

Commit-ID: 9c15eeb5362c48dd27d51bd72e8873341fa9383c
Gitweb: https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms/9c15eeb5362c48dd27d51bd72e8873341fa9383c
Author: James Gowans <[email protected]>
AuthorDate: Thu, 08 Jun 2023 14:00:20 +02:00
Committer: Marc Zyngier <[email protected]>
CommitterDate: Fri, 16 Jun 2023 12:22:35 +01:00

genirq: Allow fasteoi handler to resend interrupts on concurrent handling

There is a class of interrupt controllers out there that, once they
have signalled a given interrupt number, will still signal incoming
instances of the *same* interrupt despite the original interrupt
not having been EOIed yet.

As long as the new interrupt reaches the *same* CPU, nothing bad
happens, as that CPU still has its interrupts globally disabled,
and we will only take the new interrupt once the interrupt has
been EOIed.

However, things become more "interesting" if an affinity change comes
in while the interrupt is being handled. More specifically, while
the per-irq lock is being dropped. This results in the affinity change
taking place immediately. At this point, there is nothing that prevents
the interrupt from firing on the new target CPU. We end-up with the
interrupt running concurrently on two CPUs, which isn't a good thing.

And that's where things become worse: the new CPU notices that the
interrupt handling is in progress (irq_may_run() return false), and
*drops the interrupt on the floor*.

The whole race looks like this:

CPU 0 | CPU 1
-----------------------------|-----------------------------
interrupt start |
handle_fasteoi_irq | set_affinity(CPU 1)
handler |
... | interrupt start
... | handle_fasteoi_irq -> early out
handle_fasteoi_irq return | interrupt end
interrupt end |

If the interrupt was an edge, too bad. The interrupt is lost, and
the system will eventually die one way or another. Not great.

A way to avoid this situation is to detect this problem at the point
we handle the interrupt on the new target. Instead of dropping the
interrupt, use the resend mechanism to force it to be replayed.

Also, in order to limit the impact of this workaround to the pathetic
architectures that require it, gate it behind a new irq flag aptly
named IRQD_RESEND_WHEN_IN_PROGRESS.

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: James Gowans <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: KarimAllah Raslan <[email protected]>
Cc: Yipeng Zou <[email protected]>
Cc: Zhang Jianhua <[email protected]>
[maz: reworded commit mesage]
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/irq.h | 13 +++++++++++++
kernel/irq/chip.c | 16 +++++++++++++++-
kernel/irq/debugfs.c | 2 ++
3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index d9c86db..d8a6fdc 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -223,6 +223,8 @@ struct irq_data {
* irq_chip::irq_set_affinity() when deactivated.
* IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq pm if
* irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
+ * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in progress in which
+ * case it must be resent at the next available opportunity.
*/
enum {
IRQD_TRIGGER_MASK = 0xf,
@@ -249,6 +251,7 @@ enum {
IRQD_HANDLE_ENFORCE_IRQCTX = BIT(28),
IRQD_AFFINITY_ON_ACTIVATE = BIT(29),
IRQD_IRQ_ENABLED_ON_SUSPEND = BIT(30),
+ IRQD_RESEND_WHEN_IN_PROGRESS = BIT(31),
};

#define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
}

+static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
+{
+ __irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
+static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
+{
+ return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
#undef __irqd_to_state

static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc8..57cd8f4 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -692,8 +692,16 @@ void handle_fasteoi_irq(struct irq_desc *desc)

raw_spin_lock(&desc->lock);

- if (!irq_may_run(desc))
+ /*
+ * When an affinity change races with IRQ handling, the next interrupt
+ * can arrive on the new CPU before the original CPU has completed
+ * handling the previous one - it may need to be resent.
+ */
+ if (!irq_may_run(desc)) {
+ if (irqd_needs_resend_when_in_progress(&desc->irq_data))
+ desc->istate |= IRQS_PENDING;
goto out;
+ }

desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);

@@ -715,6 +723,12 @@ void handle_fasteoi_irq(struct irq_desc *desc)

cond_unmask_eoi_irq(desc, chip);

+ /*
+ * When the race described above happens this will resend the interrupt.
+ */
+ if (unlikely(desc->istate & IRQS_PENDING))
+ check_irq_resend(desc, false);
+
raw_spin_unlock(&desc->lock);
return;
out:
diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
index bbcaac6..5971a66 100644
--- a/kernel/irq/debugfs.c
+++ b/kernel/irq/debugfs.c
@@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),

BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
+
+ BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
};

static const struct irq_bit_descr irqdesc_states[] = {

2023-06-16 12:35:20

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/3] Resend GIC-v3 LPIs on concurrent invoke

On Fri, 16 Jun 2023 09:32:30 +0100,
"Gowans, James" <[email protected]> wrote:
>
> Hi Marc and Tomas,
> Just a ping on this series; would be great to get any more feedback, or
> get this merged.

Just did, after converting everything to BIT() and massaging the
commit messages to my own liking.

M.

--
Without deviation from the norm, progress is not possible.

2023-06-16 13:08:13

by Gowans, James

[permalink] [raw]
Subject: Re: [PATCH 0/3] Resend GIC-v3 LPIs on concurrent invoke

On Fri, 2023-06-16 at 12:30 +0100, Marc Zyngier wrote:
> On Fri, 16 Jun 2023 09:32:30 +0100,
> "Gowans, James" <[email protected]> wrote:
> > Hi Marc and Tomas,
> > Just a ping on this series; would be great to get any more feedback, or
> > get this merged.
>
> Just did, after converting everything to BIT() and massaging the
> commit messages to my own liking.

The commit message improvements are fantastic! Thanks for merging it
Marc.

Cheers,
JG