The existing KVM mechanism for forwarding of level-triggered interrupts
using resample eventfd doesn't work quite correctly in the case of
interrupts that are handled in a Linux guest as oneshot interrupts
(IRQF_ONESHOT). Such an interrupt is acked to the device in its
threaded irq handler, i.e. later than it is acked to the interrupt
controller (EOI at the end of hardirq), not earlier. The existing KVM
code doesn't take that into account, which results in erroneous extra
interrupts in the guest caused by premature re-assert of an
unacknowledged IRQ by the host.
This patch series fixes this issue (for now on x86 only) by checking if
the interrupt is unmasked when we receive irq ack (EOI) and, in case if
it's masked, postponing resamplefd notify until the guest unmasks it.
Patches 1 and 2 extend the existing support for irq mask notifiers in
KVM, which is a prerequisite needed for KVM irqfd to use mask notifiers
to know when an interrupt is masked or unmasked.
Patch 3 implements the actual fix: postponing resamplefd notify in irqfd
until the irq is unmasked.
Patches 4 and 5 just do some optional renaming for consistency, as we
are now using irq mask notifiers in irqfd along with irq ack notifiers.
Please see individual patches for more details.
v2:
- Fixed compilation failure on non-x86: mask_notifier_list moved from
x86 "struct kvm_arch" to generic "struct kvm".
- kvm_fire_mask_notifiers() also moved from x86 to generic code, even
though it is not called on other architectures for now.
- Instead of kvm_irq_is_masked() implemented
kvm_register_and_fire_irq_mask_notifier() to fix potential race
when reading the initial IRQ mask state.
- Renamed for clarity:
- irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
- kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
- resampler->notifier -> resampler->ack_notifier
- Reorganized code in irqfd_resampler_ack() and
irqfd_resampler_mask_notify() to make it easier to follow.
- Don't follow unwanted "return type on separate line" style for
irqfd_resampler_mask_notify().
Dmytro Maluka (5):
KVM: x86: Move irq mask notifiers from x86 to generic KVM
KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
KVM: irqfd: Rename resampler->notifier
KVM: Rename kvm_irq_has_notifier()
arch/x86/include/asm/kvm_host.h | 17 +---
arch/x86/kvm/i8259.c | 6 ++
arch/x86/kvm/ioapic.c | 8 +-
arch/x86/kvm/ioapic.h | 1 +
arch/x86/kvm/irq_comm.c | 74 +++++++++++------
arch/x86/kvm/x86.c | 1 -
include/linux/kvm_host.h | 21 ++++-
include/linux/kvm_irqfd.h | 16 +++-
virt/kvm/eventfd.c | 136 ++++++++++++++++++++++++++++----
virt/kvm/kvm_main.c | 1 +
10 files changed, 221 insertions(+), 60 deletions(-)
--
2.37.1.559.g78731f0fdb-goog
The existing KVM mechanism for forwarding of level-triggered interrupts
using resample eventfd doesn't work quite correctly in the case of
interrupts that are handled in a Linux guest as oneshot interrupts
(IRQF_ONESHOT). Such an interrupt is acked to the device in its
threaded irq handler, i.e. later than it is acked to the interrupt
controller (EOI at the end of hardirq), not earlier.
Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we don't check that the interrupt is still masked in the guest at the
moment of EOI. Resamplefd is notified regardless, so vfio prematurely
unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
is generated in the host and queued for injection to the guest.
The fact that the virtual IRQ is still masked doesn't prevent this new
physical IRQ from being propagated to the guest, because:
1. It is not guaranteed that the vIRQ will remain masked by the time
when vfio signals the trigger eventfd.
2. KVM marks this IRQ as pending (e.g. setting its bit in the virtual
IRR register of IOAPIC on x86), so after the vIRQ is unmasked, this
new pending interrupt is injected by KVM to the guest anyway.
There are observed at least 2 user-visible issues caused by those
extra erroneous pending interrupts for oneshot irq in the guest:
1. System suspend aborted due to a pending wakeup interrupt from
ChromeOS EC (drivers/platform/chrome/cros_ec.c).
2. Annoying "invalid report id data" errors from ELAN0000 touchpad
(drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
every time the touchpad is touched.
This patch fixes the issue on x86 by checking if the interrupt is
unmasked when we receive irq ack (EOI) and, in case if it's masked,
postponing resamplefd notify until the guest unmasks it.
It doesn't fix the issue for other archs yet, since it relies on KVM
irq mask notifiers functionality which currently works only on x86.
On other archs we can register mask notifiers but they are never called.
So on other archs resampler->masked is always false, so the behavior is
the same as before this patch.
Link: https://lore.kernel.org/kvm/[email protected]/
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Dmytro Maluka <[email protected]>
---
include/linux/kvm_irqfd.h | 14 ++++++++++
virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++----
2 files changed, 65 insertions(+), 5 deletions(-)
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index dac047abdba7..01754a1abb9e 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -19,6 +19,16 @@
* resamplefd. All resamplers on the same gsi are de-asserted
* together, so we don't need to track the state of each individual
* user. We can also therefore share the same irq source ID.
+ *
+ * A special case is when the interrupt is still masked at the moment
+ * an irq ack is received. That likely means that the interrupt has
+ * been acknowledged to the interrupt controller but not acknowledged
+ * to the device yet, e.g. it might be a Linux guest's threaded
+ * oneshot interrupt (IRQF_ONESHOT). In this case notifying through
+ * resamplefd is postponed until the guest unmasks the interrupt,
+ * which is detected through the irq mask notifier. This prevents
+ * erroneous extra interrupts caused by premature re-assert of an
+ * unacknowledged interrupt by the resamplefd listener.
*/
struct kvm_kernel_irqfd_resampler {
struct kvm *kvm;
@@ -28,6 +38,10 @@ struct kvm_kernel_irqfd_resampler {
*/
struct list_head list;
struct kvm_irq_ack_notifier notifier;
+ struct kvm_irq_mask_notifier mask_notifier;
+ bool masked;
+ bool pending;
+ spinlock_t lock;
/*
* Entry in list of kvm->irqfd.resampler_list. Use for sharing
* resamplers among irqfds on the same gsi.
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 3007d956b626..f98dcce3959c 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -67,6 +67,7 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
struct kvm *kvm;
struct kvm_kernel_irqfd *irqfd;
int idx;
+ bool notify = true;
resampler = container_of(kian,
struct kvm_kernel_irqfd_resampler, notifier);
@@ -75,13 +76,52 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
resampler->notifier.gsi, 0, false);
- idx = srcu_read_lock(&kvm->irq_srcu);
+ spin_lock(&resampler->lock);
+ if (resampler->masked) {
+ notify = false;
+ resampler->pending = true;
+ }
+ spin_unlock(&resampler->lock);
+
+ if (notify) {
+ idx = srcu_read_lock(&kvm->irq_srcu);
- list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
- srcu_read_lock_held(&kvm->irq_srcu))
- eventfd_signal(irqfd->resamplefd, 1);
+ list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
+ srcu_read_lock_held(&kvm->irq_srcu))
+ eventfd_signal(irqfd->resamplefd, 1);
- srcu_read_unlock(&kvm->irq_srcu, idx);
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+ }
+}
+
+static void irqfd_resampler_mask_notify(struct kvm_irq_mask_notifier *kimn,
+ bool masked)
+{
+ struct kvm_kernel_irqfd_resampler *resampler;
+ struct kvm *kvm;
+ struct kvm_kernel_irqfd *irqfd;
+ int idx;
+ bool notify;
+
+ resampler = container_of(kimn,
+ struct kvm_kernel_irqfd_resampler, mask_notifier);
+ kvm = resampler->kvm;
+
+ spin_lock(&resampler->lock);
+ notify = !masked && resampler->pending;
+ resampler->masked = masked;
+ resampler->pending = false;
+ spin_unlock(&resampler->lock);
+
+ if (notify) {
+ idx = srcu_read_lock(&kvm->irq_srcu);
+
+ list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
+ srcu_read_lock_held(&kvm->irq_srcu))
+ eventfd_signal(irqfd->resamplefd, 1);
+
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+ }
}
static void
@@ -98,6 +138,8 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
if (list_empty(&resampler->list)) {
list_del(&resampler->link);
kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
+ kvm_unregister_irq_mask_notifier(kvm, resampler->mask_notifier.irq,
+ &resampler->mask_notifier);
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
resampler->notifier.gsi, 0, false);
kfree(resampler);
@@ -367,9 +409,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
INIT_LIST_HEAD(&resampler->list);
resampler->notifier.gsi = irqfd->gsi;
resampler->notifier.irq_acked = irqfd_resampler_ack;
+ resampler->mask_notifier.func = irqfd_resampler_mask_notify;
+ spin_lock_init(&resampler->lock);
INIT_LIST_HEAD(&resampler->link);
list_add(&resampler->link, &kvm->irqfds.resampler_list);
+ kvm_register_and_fire_irq_mask_notifier(kvm, irqfd->gsi,
+ &resampler->mask_notifier);
kvm_register_irq_ack_notifier(kvm,
&resampler->notifier);
irqfd->resampler = resampler;
--
2.37.1.559.g78731f0fdb-goog
Since resampler irqfd is now using a mask notifier along with an ack
notifier, rename resampler->notifier to resampler->ack_notifier for
clarity.
Signed-off-by: Dmytro Maluka <[email protected]>
---
include/linux/kvm_irqfd.h | 2 +-
virt/kvm/eventfd.c | 16 ++++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 01754a1abb9e..4df9e6bbd7db 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -37,7 +37,7 @@ struct kvm_kernel_irqfd_resampler {
* RCU list modified under kvm->irqfds.resampler_lock
*/
struct list_head list;
- struct kvm_irq_ack_notifier notifier;
+ struct kvm_irq_ack_notifier ack_notifier;
struct kvm_irq_mask_notifier mask_notifier;
bool masked;
bool pending;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index f98dcce3959c..72de942dbb9c 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -70,11 +70,11 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
bool notify = true;
resampler = container_of(kian,
- struct kvm_kernel_irqfd_resampler, notifier);
+ struct kvm_kernel_irqfd_resampler, ack_notifier);
kvm = resampler->kvm;
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
- resampler->notifier.gsi, 0, false);
+ resampler->ack_notifier.gsi, 0, false);
spin_lock(&resampler->lock);
if (resampler->masked) {
@@ -137,11 +137,11 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
if (list_empty(&resampler->list)) {
list_del(&resampler->link);
- kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
+ kvm_unregister_irq_ack_notifier(kvm, &resampler->ack_notifier);
kvm_unregister_irq_mask_notifier(kvm, resampler->mask_notifier.irq,
&resampler->mask_notifier);
kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
- resampler->notifier.gsi, 0, false);
+ resampler->ack_notifier.gsi, 0, false);
kfree(resampler);
}
@@ -390,7 +390,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
list_for_each_entry(resampler,
&kvm->irqfds.resampler_list, link) {
- if (resampler->notifier.gsi == irqfd->gsi) {
+ if (resampler->ack_notifier.gsi == irqfd->gsi) {
irqfd->resampler = resampler;
break;
}
@@ -407,8 +407,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
resampler->kvm = kvm;
INIT_LIST_HEAD(&resampler->list);
- resampler->notifier.gsi = irqfd->gsi;
- resampler->notifier.irq_acked = irqfd_resampler_ack;
+ resampler->ack_notifier.gsi = irqfd->gsi;
+ resampler->ack_notifier.irq_acked = irqfd_resampler_ack;
resampler->mask_notifier.func = irqfd_resampler_mask_notify;
spin_lock_init(&resampler->lock);
INIT_LIST_HEAD(&resampler->link);
@@ -417,7 +417,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
kvm_register_and_fire_irq_mask_notifier(kvm, irqfd->gsi,
&resampler->mask_notifier);
kvm_register_irq_ack_notifier(kvm,
- &resampler->notifier);
+ &resampler->ack_notifier);
irqfd->resampler = resampler;
}
--
2.37.1.559.g78731f0fdb-goog
Now we have irq mask notifiers available in generic KVM along with irq
ack notifiers, so rename kvm_irq_has_notifier() to
kvm_irq_has_ack_notifier() to make it clear which notifier it is about.
Signed-off-by: Dmytro Maluka <[email protected]>
---
arch/x86/kvm/ioapic.c | 2 +-
include/linux/kvm_host.h | 2 +-
virt/kvm/eventfd.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index fab11de1f885..20d758ac2234 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -291,7 +291,7 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, ulong *ioapic_handled_vectors)
for (index = 0; index < IOAPIC_NUM_PINS; index++) {
e = &ioapic->redirtbl[index];
if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
- kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index) ||
+ kvm_irq_has_ack_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index) ||
index == RTC_GSI) {
u16 dm = kvm_lapic_irq_dest_mode(!!e->fields.dest_mode);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 55233eb18eb4..ba18276691e1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1601,7 +1601,7 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id,
int level, bool line_status);
-bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
+bool kvm_irq_has_ack_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
void kvm_notify_acked_gsi(struct kvm *kvm, int gsi);
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
void kvm_register_irq_ack_notifier(struct kvm *kvm,
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 72de942dbb9c..4dd7b6f2da69 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -502,7 +502,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
return ret;
}
-bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
+bool kvm_irq_has_ack_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
{
struct kvm_irq_ack_notifier *kian;
int gsi, idx;
@@ -521,7 +521,7 @@ bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
return false;
}
-EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
+EXPORT_SYMBOL_GPL(kvm_irq_has_ack_notifier);
void kvm_notify_acked_gsi(struct kvm *kvm, int gsi)
{
--
2.37.1.559.g78731f0fdb-goog
Currently irq mask notifiers are used only internally in the x86 code
for PIT emulation. However they are not really arch specific. We are
going to use them in the generic irqfd code, for postponing resampler
irqfd notification until the interrupt is unmasked. So move the
implementation of mask notifiers to the generic code, to allow irqfd to
register its mask notifiers.
Note that calling mask notifiers via calling kvm_fire_mask_notifiers()
is still implemented for x86 only, so registering mask notifiers on
other architectures will have no effect for now.
Signed-off-by: Dmytro Maluka <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 16 ----------------
arch/x86/kvm/irq_comm.c | 33 ---------------------------------
arch/x86/kvm/x86.c | 1 -
include/linux/kvm_host.h | 15 +++++++++++++++
virt/kvm/eventfd.c | 33 +++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 1 +
6 files changed, 49 insertions(+), 50 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9217bd6cf0d1..dc76617f11c1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1198,9 +1198,6 @@ struct kvm_arch {
struct kvm_xen_hvm_config xen_hvm_config;
- /* reads protected by irq_srcu, writes by irq_lock */
- struct hlist_head mask_notifier_list;
-
struct kvm_hv hyperv;
struct kvm_xen xen;
@@ -1688,19 +1685,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes);
-struct kvm_irq_mask_notifier {
- void (*func)(struct kvm_irq_mask_notifier *kimn, bool masked);
- int irq;
- struct hlist_node link;
-};
-
-void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
- struct kvm_irq_mask_notifier *kimn);
-void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
- struct kvm_irq_mask_notifier *kimn);
-void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
- bool mask);
-
extern bool tdp_enabled;
u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 0687162c4f22..f27e4c9c403e 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -234,39 +234,6 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
mutex_unlock(&kvm->irq_lock);
}
-void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
- struct kvm_irq_mask_notifier *kimn)
-{
- mutex_lock(&kvm->irq_lock);
- kimn->irq = irq;
- hlist_add_head_rcu(&kimn->link, &kvm->arch.mask_notifier_list);
- mutex_unlock(&kvm->irq_lock);
-}
-
-void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
- struct kvm_irq_mask_notifier *kimn)
-{
- mutex_lock(&kvm->irq_lock);
- hlist_del_rcu(&kimn->link);
- mutex_unlock(&kvm->irq_lock);
- synchronize_srcu(&kvm->irq_srcu);
-}
-
-void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
- bool mask)
-{
- struct kvm_irq_mask_notifier *kimn;
- int idx, gsi;
-
- idx = srcu_read_lock(&kvm->irq_srcu);
- gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
- if (gsi != -1)
- hlist_for_each_entry_rcu(kimn, &kvm->arch.mask_notifier_list, link)
- if (kimn->irq == gsi)
- kimn->func(kimn, mask);
- srcu_read_unlock(&kvm->irq_srcu, idx);
-}
-
bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
{
return irqchip_in_kernel(kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5fa335a4ea7..a0a776f5c42f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11818,7 +11818,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (ret)
goto out_page_track;
- INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
atomic_set(&kvm->arch.noncoherent_dma_count, 0);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90a45ef7203b..dd5f14e31996 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -760,7 +760,10 @@ struct kvm {
struct kvm_irq_routing_table __rcu *irq_routing;
#endif
#ifdef CONFIG_HAVE_KVM_IRQFD
+ /* reads protected by irq_srcu, writes by irq_lock */
struct hlist_head irq_ack_notifier_list;
+ /* reads protected by irq_srcu, writes by irq_lock */
+ struct hlist_head irq_mask_notifier_list;
#endif
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
@@ -1581,6 +1584,12 @@ struct kvm_irq_ack_notifier {
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
};
+struct kvm_irq_mask_notifier {
+ void (*func)(struct kvm_irq_mask_notifier *kimn, bool masked);
+ int irq;
+ struct hlist_node link;
+};
+
int kvm_irq_map_gsi(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *entries, int gsi);
int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin);
@@ -1599,6 +1608,12 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
+void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn);
+void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn);
+void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
+ bool mask);
int kvm_request_irq_source_id(struct kvm *kvm);
void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2a3ed401ce46..39403d9fbdcc 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -518,6 +518,39 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
synchronize_srcu(&kvm->irq_srcu);
kvm_arch_post_irq_ack_notifier_list_update(kvm);
}
+
+void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn)
+{
+ mutex_lock(&kvm->irq_lock);
+ kimn->irq = irq;
+ hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
+ mutex_unlock(&kvm->irq_lock);
+}
+
+void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn)
+{
+ mutex_lock(&kvm->irq_lock);
+ hlist_del_rcu(&kimn->link);
+ mutex_unlock(&kvm->irq_lock);
+ synchronize_srcu(&kvm->irq_srcu);
+}
+
+void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
+ bool mask)
+{
+ struct kvm_irq_mask_notifier *kimn;
+ int idx, gsi;
+
+ idx = srcu_read_lock(&kvm->irq_srcu);
+ gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
+ if (gsi != -1)
+ hlist_for_each_entry_rcu(kimn, &kvm->irq_mask_notifier_list, link)
+ if (kimn->irq == gsi)
+ kimn->func(kimn, mask);
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+}
#endif
void
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a49df8988cd6..5ca7fb0b8257 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1144,6 +1144,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
#ifdef CONFIG_HAVE_KVM_IRQFD
INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
+ INIT_HLIST_HEAD(&kvm->irq_mask_notifier_list);
#endif
r = kvm_init_mmu_notifier(kvm);
--
2.37.1.559.g78731f0fdb-goog
In order to implement postponing resamplefd notification until an
interrupt is unmasked, we need not only to track changes of the
interrupt mask state (which is already possible with
kvm_register_irq_mask_notifier()) but also to know its initial
mask state before any mask notifier has fired.
Moreover, we need to do this initial check of the IRQ mask state in a
race-free way, to ensure that we will not miss any further mask or
unmask events after we check the initial mask state.
So implement kvm_register_and_fire_irq_mask_notifier() which atomically
registers an IRQ mask notifier and calls it with the current mask value
of the IRQ. It does that using the same locking order as when calling
notifier normally via kvm_fire_mask_notifiers(), to prevent deadlocks.
Its implementation needs to be arch-specific since it relies on
arch-specific synchronization (e.g. ioapic->lock and pic->lock on x86,
or a per-IRQ lock on ARM vGIC) for serializing our initial reading of
the IRQ mask state with a pending change of this mask state.
For now implement it for x86 only, and for other archs add a weak dummy
implementation which doesn't really call the notifier (as other archs
don't currently implement calling notifiers normally via
kvm_fire_mask_notifiers() either, i.e. registering mask notifiers has no
effect on those archs anyway).
Signed-off-by: Dmytro Maluka <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/i8259.c | 6 ++++
arch/x86/kvm/ioapic.c | 6 ++++
arch/x86/kvm/ioapic.h | 1 +
arch/x86/kvm/irq_comm.c | 57 +++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 4 +++
virt/kvm/eventfd.c | 31 ++++++++++++++++--
7 files changed, 104 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dc76617f11c1..cf0571ed2968 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1834,6 +1834,7 @@ static inline int __kvm_irq_line_state(unsigned long *irq_state,
int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level);
void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
+bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq);
void kvm_inject_nmi(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index e1bb6218bb96..1eb3127f6047 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -211,6 +211,12 @@ void kvm_pic_clear_all(struct kvm_pic *s, int irq_source_id)
pic_unlock(s);
}
+/* Called with s->lock held. */
+bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq)
+{
+ return !!(s->pics[irq >> 3].imr & (1 << irq));
+}
+
/*
* acknowledge interrupt 'irq'
*/
diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 765943d7cfa5..fab11de1f885 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -478,6 +478,12 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id)
spin_unlock(&ioapic->lock);
}
+/* Called with ioapic->lock held. */
+bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq)
+{
+ return !!ioapic->redirtbl[irq].fields.mask;
+}
+
static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
{
int i;
diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
index 539333ac4b38..fe1f51319992 100644
--- a/arch/x86/kvm/ioapic.h
+++ b/arch/x86/kvm/ioapic.h
@@ -114,6 +114,7 @@ void kvm_ioapic_destroy(struct kvm *kvm);
int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id,
int level, bool line_status);
void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id);
+bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq);
void kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
void kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index f27e4c9c403e..4bd4218821a2 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -234,6 +234,63 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
mutex_unlock(&kvm->irq_lock);
}
+void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn)
+{
+ struct kvm_pic *pic = kvm->arch.vpic;
+ struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+ struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
+ struct kvm_kernel_irq_routing_entry *pic_e = NULL, *ioapic_e = NULL;
+ int idx, i, n;
+ bool masked;
+
+ mutex_lock(&kvm->irq_lock);
+
+ /*
+ * Not possible to detect if the guest uses the PIC or the
+ * IOAPIC. So assume the interrupt to be unmasked iff it is
+ * unmasked in at least one of both.
+ */
+ idx = srcu_read_lock(&kvm->irq_srcu);
+ n = kvm_irq_map_gsi(kvm, entries, irq);
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+
+ for (i = 0; i < n; i++) {
+ if (entries[i].type != KVM_IRQ_ROUTING_IRQCHIP)
+ continue;
+
+ switch (entries[i].irqchip.irqchip) {
+ case KVM_IRQCHIP_PIC_MASTER:
+ case KVM_IRQCHIP_PIC_SLAVE:
+ pic_e = &entries[i];
+ break;
+ case KVM_IRQCHIP_IOAPIC:
+ ioapic_e = &entries[i];
+ break;
+ default:
+ break;
+ }
+ }
+
+ if (pic_e)
+ spin_lock(&pic->lock);
+ if (ioapic_e)
+ spin_lock(&ioapic->lock);
+
+ __kvm_register_irq_mask_notifier(kvm, irq, kimn);
+
+ masked = (!pic_e || kvm_pic_irq_is_masked(pic, pic_e->irqchip.pin)) &&
+ (!ioapic_e || kvm_ioapic_irq_is_masked(ioapic, ioapic_e->irqchip.pin));
+ kimn->func(kimn, masked);
+
+ if (ioapic_e)
+ spin_unlock(&ioapic->lock);
+ if (pic_e)
+ spin_unlock(&pic->lock);
+
+ mutex_unlock(&kvm->irq_lock);
+}
+
bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
{
return irqchip_in_kernel(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dd5f14e31996..55233eb18eb4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1608,8 +1608,12 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
+void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn);
void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn);
+void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn);
void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn);
void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 39403d9fbdcc..3007d956b626 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -519,15 +519,42 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
kvm_arch_post_irq_ack_notifier_list_update(kvm);
}
+void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn)
+{
+ kimn->irq = irq;
+ hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
+}
+
void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn)
{
mutex_lock(&kvm->irq_lock);
- kimn->irq = irq;
- hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
+ __kvm_register_irq_mask_notifier(kvm, irq, kimn);
mutex_unlock(&kvm->irq_lock);
}
+/*
+ * kvm_register_and_fire_irq_mask_notifier() registers the notifier and
+ * immediately calls it with the current mask value of the IRQ. It does
+ * that atomically, so that we will find out the initial mask state of
+ * the IRQ and will not miss any further mask or unmask events. It does
+ * that using the same locking order as when calling notifier normally
+ * via kvm_fire_mask_notifiers(), to prevent deadlocks.
+ *
+ * Implementation is arch-specific since it relies on arch-specific
+ * (irqchip-specific) synchronization. Below is a weak dummy
+ * implementation for archs not implementing it yet, as those archs
+ * don't implement calling notifiers normally via
+ * kvm_fire_mask_notifiers() either, i.e. registering mask notifiers
+ * has no effect on those archs anyway.
+ */
+void __weak kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
+ struct kvm_irq_mask_notifier *kimn)
+{
+ kvm_register_irq_mask_notifier(kvm, irq, kimn);
+}
+
void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
struct kvm_irq_mask_notifier *kimn)
{
--
2.37.1.559.g78731f0fdb-goog
>
> The existing KVM mechanism for forwarding of level-triggered interrupts using
> resample eventfd doesn't work quite correctly in the case of interrupts that are
> handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an
> interrupt is acked to the device in its threaded irq handler, i.e. later than it is
> acked to the interrupt controller (EOI at the end of hardirq), not earlier. The
> existing KVM code doesn't take that into account, which results in erroneous
> extra interrupts in the guest caused by premature re-assert of an
> unacknowledged IRQ by the host.
Interesting... How it behaviors in native side?
>
> This patch series fixes this issue (for now on x86 only) by checking if the
> interrupt is unmasked when we receive irq ack (EOI) and, in case if it's masked,
> postponing resamplefd notify until the guest unmasks it.
>
> Patches 1 and 2 extend the existing support for irq mask notifiers in KVM,
> which is a prerequisite needed for KVM irqfd to use mask notifiers to know
> when an interrupt is masked or unmasked.
>
> Patch 3 implements the actual fix: postponing resamplefd notify in irqfd until
> the irq is unmasked.
>
> Patches 4 and 5 just do some optional renaming for consistency, as we are now
> using irq mask notifiers in irqfd along with irq ack notifiers.
>
> Please see individual patches for more details.
>
> v2:
> - Fixed compilation failure on non-x86: mask_notifier_list moved from
> x86 "struct kvm_arch" to generic "struct kvm".
> - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
> though it is not called on other architectures for now.
> - Instead of kvm_irq_is_masked() implemented
> kvm_register_and_fire_irq_mask_notifier() to fix potential race
> when reading the initial IRQ mask state.
> - Renamed for clarity:
> - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
> - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
> - resampler->notifier -> resampler->ack_notifier
> - Reorganized code in irqfd_resampler_ack() and
> irqfd_resampler_mask_notify() to make it easier to follow.
> - Don't follow unwanted "return type on separate line" style for
> irqfd_resampler_mask_notify().
>
> Dmytro Maluka (5):
> KVM: x86: Move irq mask notifiers from x86 to generic KVM
> KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
> KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
> KVM: irqfd: Rename resampler->notifier
> KVM: Rename kvm_irq_has_notifier()
>
> arch/x86/include/asm/kvm_host.h | 17 +---
> arch/x86/kvm/i8259.c | 6 ++
> arch/x86/kvm/ioapic.c | 8 +-
> arch/x86/kvm/ioapic.h | 1 +
> arch/x86/kvm/irq_comm.c | 74 +++++++++++------
> arch/x86/kvm/x86.c | 1 -
> include/linux/kvm_host.h | 21 ++++-
> include/linux/kvm_irqfd.h | 16 +++-
> virt/kvm/eventfd.c | 136 ++++++++++++++++++++++++++++----
> virt/kvm/kvm_main.c | 1 +
> 10 files changed, 221 insertions(+), 60 deletions(-)
>
> --
> 2.37.1.559.g78731f0fdb-goog
On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>
>> The existing KVM mechanism for forwarding of level-triggered interrupts using
>> resample eventfd doesn't work quite correctly in the case of interrupts that are
>> handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an
>> interrupt is acked to the device in its threaded irq handler, i.e. later than it is
>> acked to the interrupt controller (EOI at the end of hardirq), not earlier. The
>> existing KVM code doesn't take that into account, which results in erroneous
>> extra interrupts in the guest caused by premature re-assert of an
>> unacknowledged IRQ by the host.
>
> Interesting... How it behaviors in native side?
In native it behaves correctly, since Linux masks such a oneshot
interrupt at the beginning of hardirq, so that the EOI at the end of
hardirq doesn't result in its immediate re-assert, and then unmasks it
later, after its threaded irq handler completes.
In handle_fasteoi_irq():
if (desc->istate & IRQS_ONESHOT)
mask_irq(desc);
handle_irq_event(desc);
cond_unmask_eoi_irq(desc, chip);
and later in unmask_threaded_irq():
unmask_irq(desc);
I also mentioned that in patch #3 description:
"Linux keeps such interrupt masked until its threaded handler finishes,
to prevent the EOI from re-asserting an unacknowledged interrupt.
However, with KVM + vfio (or whatever is listening on the resamplefd)
we don't check that the interrupt is still masked in the guest at the
moment of EOI. Resamplefd is notified regardless, so vfio prematurely
unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
is generated in the host and queued for injection to the guest."
>
>>
>> This patch series fixes this issue (for now on x86 only) by checking if the
>> interrupt is unmasked when we receive irq ack (EOI) and, in case if it's masked,
>> postponing resamplefd notify until the guest unmasks it.
>>
>> Patches 1 and 2 extend the existing support for irq mask notifiers in KVM,
>> which is a prerequisite needed for KVM irqfd to use mask notifiers to know
>> when an interrupt is masked or unmasked.
>>
>> Patch 3 implements the actual fix: postponing resamplefd notify in irqfd until
>> the irq is unmasked.
>>
>> Patches 4 and 5 just do some optional renaming for consistency, as we are now
>> using irq mask notifiers in irqfd along with irq ack notifiers.
>>
>> Please see individual patches for more details.
>>
>> v2:
>> - Fixed compilation failure on non-x86: mask_notifier_list moved from
>> x86 "struct kvm_arch" to generic "struct kvm".
>> - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
>> though it is not called on other architectures for now.
>> - Instead of kvm_irq_is_masked() implemented
>> kvm_register_and_fire_irq_mask_notifier() to fix potential race
>> when reading the initial IRQ mask state.
>> - Renamed for clarity:
>> - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
>> - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
>> - resampler->notifier -> resampler->ack_notifier
>> - Reorganized code in irqfd_resampler_ack() and
>> irqfd_resampler_mask_notify() to make it easier to follow.
>> - Don't follow unwanted "return type on separate line" style for
>> irqfd_resampler_mask_notify().
>>
>> Dmytro Maluka (5):
>> KVM: x86: Move irq mask notifiers from x86 to generic KVM
>> KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
>> KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
>> KVM: irqfd: Rename resampler->notifier
>> KVM: Rename kvm_irq_has_notifier()
>>
>> arch/x86/include/asm/kvm_host.h | 17 +---
>> arch/x86/kvm/i8259.c | 6 ++
>> arch/x86/kvm/ioapic.c | 8 +-
>> arch/x86/kvm/ioapic.h | 1 +
>> arch/x86/kvm/irq_comm.c | 74 +++++++++++------
>> arch/x86/kvm/x86.c | 1 -
>> include/linux/kvm_host.h | 21 ++++-
>> include/linux/kvm_irqfd.h | 16 +++-
>> virt/kvm/eventfd.c | 136 ++++++++++++++++++++++++++++----
>> virt/kvm/kvm_main.c | 1 +
>> 10 files changed, 221 insertions(+), 60 deletions(-)
>>
>> --
>> 2.37.1.559.g78731f0fdb-goog
>
> -----Original Message-----
> From: Dmytro Maluka <[email protected]>
> Sent: Tuesday, August 9, 2022 12:24 AM
> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
> <[email protected]>; Paolo Bonzini <[email protected]>;
> [email protected]
> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
> [email protected]; H. Peter Anvin <[email protected]>; linux-
> [email protected]; Eric Auger <[email protected]>; Alex
> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
> Zhenyu Wang <[email protected]>; Tomasz Nowicki
> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
> [email protected]; Dmitry Torokhov <[email protected]>
> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>
> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>
> >> The existing KVM mechanism for forwarding of level-triggered
> >> interrupts using resample eventfd doesn't work quite correctly in the
> >> case of interrupts that are handled in a Linux guest as oneshot
> >> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >> in its threaded irq handler, i.e. later than it is acked to the
> >> interrupt controller (EOI at the end of hardirq), not earlier. The
> >> existing KVM code doesn't take that into account, which results in
> >> erroneous extra interrupts in the guest caused by premature re-assert of an
> unacknowledged IRQ by the host.
> >
> > Interesting... How it behaviors in native side?
>
> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> immediate re-assert, and then unmasks it later, after its threaded irq handler
> completes.
>
> In handle_fasteoi_irq():
>
> if (desc->istate & IRQS_ONESHOT)
> mask_irq(desc);
>
> handle_irq_event(desc);
>
> cond_unmask_eoi_irq(desc, chip);
>
>
> and later in unmask_threaded_irq():
>
> unmask_irq(desc);
>
> I also mentioned that in patch #3 description:
> "Linux keeps such interrupt masked until its threaded handler finishes, to
> prevent the EOI from re-asserting an unacknowledged interrupt.
That makes sense. Can you include the full story in cover letter too?
> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> check that the interrupt is still masked in the guest at the moment of EOI.
> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> and queued for injection to the guest."
>
Emulation of level triggered IRQ is a pain point ☹
I read we need to emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e. ioapic here).
Technically, the guest can change the polarity of vIOAPIC, which will lead to a new virtual IRQ
even w/o host side interrupt.
"pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an event rather than an interrupt level.
> >
> >>
> >> This patch series fixes this issue (for now on x86 only) by checking
> >> if the interrupt is unmasked when we receive irq ack (EOI) and, in
> >> case if it's masked, postponing resamplefd notify until the guest unmasks it.
> >>
> >> Patches 1 and 2 extend the existing support for irq mask notifiers in
> >> KVM, which is a prerequisite needed for KVM irqfd to use mask
> >> notifiers to know when an interrupt is masked or unmasked.
> >>
> >> Patch 3 implements the actual fix: postponing resamplefd notify in
> >> irqfd until the irq is unmasked.
> >>
> >> Patches 4 and 5 just do some optional renaming for consistency, as we
> >> are now using irq mask notifiers in irqfd along with irq ack notifiers.
> >>
> >> Please see individual patches for more details.
> >>
> >> v2:
> >> - Fixed compilation failure on non-x86: mask_notifier_list moved from
> >> x86 "struct kvm_arch" to generic "struct kvm".
> >> - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
> >> though it is not called on other architectures for now.
> >> - Instead of kvm_irq_is_masked() implemented
> >> kvm_register_and_fire_irq_mask_notifier() to fix potential race
> >> when reading the initial IRQ mask state.
> >> - Renamed for clarity:
> >> - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
> >> - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
> >> - resampler->notifier -> resampler->ack_notifier
> >> - Reorganized code in irqfd_resampler_ack() and
> >> irqfd_resampler_mask_notify() to make it easier to follow.
> >> - Don't follow unwanted "return type on separate line" style for
> >> irqfd_resampler_mask_notify().
> >>
> >> Dmytro Maluka (5):
> >> KVM: x86: Move irq mask notifiers from x86 to generic KVM
> >> KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
> >> KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
> >> KVM: irqfd: Rename resampler->notifier
> >> KVM: Rename kvm_irq_has_notifier()
> >>
> >> arch/x86/include/asm/kvm_host.h | 17 +---
> >> arch/x86/kvm/i8259.c | 6 ++
> >> arch/x86/kvm/ioapic.c | 8 +-
> >> arch/x86/kvm/ioapic.h | 1 +
> >> arch/x86/kvm/irq_comm.c | 74 +++++++++++------
> >> arch/x86/kvm/x86.c | 1 -
> >> include/linux/kvm_host.h | 21 ++++-
> >> include/linux/kvm_irqfd.h | 16 +++-
> >> virt/kvm/eventfd.c | 136 ++++++++++++++++++++++++++++----
> >> virt/kvm/kvm_main.c | 1 +
> >> 10 files changed, 221 insertions(+), 60 deletions(-)
> >>
> >> --
> >> 2.37.1.559.g78731f0fdb-goog
> >
Hi Dmytro,
On 8/5/22 21:39, Dmytro Maluka wrote:
> The existing KVM mechanism for forwarding of level-triggered interrupts
> using resample eventfd doesn't work quite correctly in the case of
> interrupts that are handled in a Linux guest as oneshot interrupts
> (IRQF_ONESHOT). Such an interrupt is acked to the device in its
> threaded irq handler, i.e. later than it is acked to the interrupt
> controller (EOI at the end of hardirq), not earlier.
>
> Linux keeps such interrupt masked until its threaded handler finishes,
> to prevent the EOI from re-asserting an unacknowledged interrupt.
> However, with KVM + vfio (or whatever is listening on the resamplefd)
> we don't check that the interrupt is still masked in the guest at the
> moment of EOI. Resamplefd is notified regardless, so vfio prematurely
> unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
> is generated in the host and queued for injection to the guest.
>
> The fact that the virtual IRQ is still masked doesn't prevent this new
> physical IRQ from being propagated to the guest, because:
>
> 1. It is not guaranteed that the vIRQ will remain masked by the time
> when vfio signals the trigger eventfd.
> 2. KVM marks this IRQ as pending (e.g. setting its bit in the virtual
> IRR register of IOAPIC on x86), so after the vIRQ is unmasked, this
> new pending interrupt is injected by KVM to the guest anyway.
>
> There are observed at least 2 user-visible issues caused by those
> extra erroneous pending interrupts for oneshot irq in the guest:
>
> 1. System suspend aborted due to a pending wakeup interrupt from
> ChromeOS EC (drivers/platform/chrome/cros_ec.c).
> 2. Annoying "invalid report id data" errors from ELAN0000 touchpad
> (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
> every time the touchpad is touched.
>
> This patch fixes the issue on x86 by checking if the interrupt is
> unmasked when we receive irq ack (EOI) and, in case if it's masked,
> postponing resamplefd notify until the guest unmasks it.
>
> It doesn't fix the issue for other archs yet, since it relies on KVM
> irq mask notifiers functionality which currently works only on x86.
> On other archs we can register mask notifiers but they are never called.
> So on other archs resampler->masked is always false, so the behavior is
> the same as before this patch.
>
> Link: https://lore.kernel.org/kvm/[email protected]/
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Dmytro Maluka <[email protected]>
> ---
> include/linux/kvm_irqfd.h | 14 ++++++++++
> virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++----
> 2 files changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
> index dac047abdba7..01754a1abb9e 100644
> --- a/include/linux/kvm_irqfd.h
> +++ b/include/linux/kvm_irqfd.h
> @@ -19,6 +19,16 @@
> * resamplefd. All resamplers on the same gsi are de-asserted
> * together, so we don't need to track the state of each individual
> * user. We can also therefore share the same irq source ID.
> + *
> + * A special case is when the interrupt is still masked at the moment
> + * an irq ack is received. That likely means that the interrupt has
> + * been acknowledged to the interrupt controller but not acknowledged
> + * to the device yet, e.g. it might be a Linux guest's threaded
> + * oneshot interrupt (IRQF_ONESHOT). In this case notifying through
> + * resamplefd is postponed until the guest unmasks the interrupt,
> + * which is detected through the irq mask notifier. This prevents
> + * erroneous extra interrupts caused by premature re-assert of an
> + * unacknowledged interrupt by the resamplefd listener.
> */
> struct kvm_kernel_irqfd_resampler {
> struct kvm *kvm;
> @@ -28,6 +38,10 @@ struct kvm_kernel_irqfd_resampler {
> */
> struct list_head list;
> struct kvm_irq_ack_notifier notifier;
> + struct kvm_irq_mask_notifier mask_notifier;
> + bool masked;
> + bool pending;
> + spinlock_t lock;
> /*
> * Entry in list of kvm->irqfd.resampler_list. Use for sharing
> * resamplers among irqfds on the same gsi.
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 3007d956b626..f98dcce3959c 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -67,6 +67,7 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
> struct kvm *kvm;
> struct kvm_kernel_irqfd *irqfd;
> int idx;
> + bool notify = true;
>
> resampler = container_of(kian,
> struct kvm_kernel_irqfd_resampler, notifier);
> @@ -75,13 +76,52 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
> resampler->notifier.gsi, 0, false);
>
> - idx = srcu_read_lock(&kvm->irq_srcu);
> + spin_lock(&resampler->lock);
> + if (resampler->masked) {
> + notify = false;
> + resampler->pending = true;
> + }
> + spin_unlock(&resampler->lock);
> +
> + if (notify) {
> + idx = srcu_read_lock(&kvm->irq_srcu);
>
> - list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
> - srcu_read_lock_held(&kvm->irq_srcu))
> - eventfd_signal(irqfd->resamplefd, 1);
> + list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
> + srcu_read_lock_held(&kvm->irq_srcu))
> + eventfd_signal(irqfd->resamplefd, 1);
nit: you may introduce a helper for above code as the code is duplicated.
>
> - srcu_read_unlock(&kvm->irq_srcu, idx);
> + srcu_read_unlock(&kvm->irq_srcu, idx);
> + }
> +}
> +
> +static void irqfd_resampler_mask_notify(struct kvm_irq_mask_notifier *kimn,
> + bool masked)
> +{
> + struct kvm_kernel_irqfd_resampler *resampler;
> + struct kvm *kvm;
> + struct kvm_kernel_irqfd *irqfd;
> + int idx;
> + bool notify;
> +
> + resampler = container_of(kimn,
> + struct kvm_kernel_irqfd_resampler, mask_notifier);
> + kvm = resampler->kvm;
> +
> + spin_lock(&resampler->lock);
> + notify = !masked && resampler->pending;
> + resampler->masked = masked;
> + resampler->pending = false;
> + spin_unlock(&resampler->lock);
> +
> + if (notify) {
> + idx = srcu_read_lock(&kvm->irq_srcu);
> +
> + list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
> + srcu_read_lock_held(&kvm->irq_srcu))
> + eventfd_signal(irqfd->resamplefd, 1);
> +
> + srcu_read_unlock(&kvm->irq_srcu, idx);
> + }
> }
>
> static void
> @@ -98,6 +138,8 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
> if (list_empty(&resampler->list)) {
> list_del(&resampler->link);
> kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
> + kvm_unregister_irq_mask_notifier(kvm, resampler->mask_notifier.irq,
> + &resampler->mask_notifier);
> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
> resampler->notifier.gsi, 0, false);
> kfree(resampler);
> @@ -367,9 +409,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> INIT_LIST_HEAD(&resampler->list);
> resampler->notifier.gsi = irqfd->gsi;
> resampler->notifier.irq_acked = irqfd_resampler_ack;
> + resampler->mask_notifier.func = irqfd_resampler_mask_notify;
> + spin_lock_init(&resampler->lock);
> INIT_LIST_HEAD(&resampler->link);
>
> list_add(&resampler->link, &kvm->irqfds.resampler_list);
> + kvm_register_and_fire_irq_mask_notifier(kvm, irqfd->gsi,
> + &resampler->mask_notifier);
> kvm_register_irq_ack_notifier(kvm,
> &resampler->notifier);
> irqfd->resampler = resampler;
Adding Marc in CC
Thanks
Eric
On 8/5/22 21:39, Dmytro Maluka wrote:
> Since resampler irqfd is now using a mask notifier along with an ack
> notifier, rename resampler->notifier to resampler->ack_notifier for
> clarity.
>
> Signed-off-by: Dmytro Maluka <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Eric
> ---
> include/linux/kvm_irqfd.h | 2 +-
> virt/kvm/eventfd.c | 16 ++++++++--------
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
> index 01754a1abb9e..4df9e6bbd7db 100644
> --- a/include/linux/kvm_irqfd.h
> +++ b/include/linux/kvm_irqfd.h
> @@ -37,7 +37,7 @@ struct kvm_kernel_irqfd_resampler {
> * RCU list modified under kvm->irqfds.resampler_lock
> */
> struct list_head list;
> - struct kvm_irq_ack_notifier notifier;
> + struct kvm_irq_ack_notifier ack_notifier;
> struct kvm_irq_mask_notifier mask_notifier;
> bool masked;
> bool pending;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index f98dcce3959c..72de942dbb9c 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -70,11 +70,11 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
> bool notify = true;
>
> resampler = container_of(kian,
> - struct kvm_kernel_irqfd_resampler, notifier);
> + struct kvm_kernel_irqfd_resampler, ack_notifier);
> kvm = resampler->kvm;
>
> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
> - resampler->notifier.gsi, 0, false);
> + resampler->ack_notifier.gsi, 0, false);
>
> spin_lock(&resampler->lock);
> if (resampler->masked) {
> @@ -137,11 +137,11 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>
> if (list_empty(&resampler->list)) {
> list_del(&resampler->link);
> - kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
> + kvm_unregister_irq_ack_notifier(kvm, &resampler->ack_notifier);
> kvm_unregister_irq_mask_notifier(kvm, resampler->mask_notifier.irq,
> &resampler->mask_notifier);
> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
> - resampler->notifier.gsi, 0, false);
> + resampler->ack_notifier.gsi, 0, false);
> kfree(resampler);
> }
>
> @@ -390,7 +390,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>
> list_for_each_entry(resampler,
> &kvm->irqfds.resampler_list, link) {
> - if (resampler->notifier.gsi == irqfd->gsi) {
> + if (resampler->ack_notifier.gsi == irqfd->gsi) {
> irqfd->resampler = resampler;
> break;
> }
> @@ -407,8 +407,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>
> resampler->kvm = kvm;
> INIT_LIST_HEAD(&resampler->list);
> - resampler->notifier.gsi = irqfd->gsi;
> - resampler->notifier.irq_acked = irqfd_resampler_ack;
> + resampler->ack_notifier.gsi = irqfd->gsi;
> + resampler->ack_notifier.irq_acked = irqfd_resampler_ack;
> resampler->mask_notifier.func = irqfd_resampler_mask_notify;
> spin_lock_init(&resampler->lock);
> INIT_LIST_HEAD(&resampler->link);
> @@ -417,7 +417,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> kvm_register_and_fire_irq_mask_notifier(kvm, irqfd->gsi,
> &resampler->mask_notifier);
> kvm_register_irq_ack_notifier(kvm,
> - &resampler->notifier);
> + &resampler->ack_notifier);
> irqfd->resampler = resampler;
> }
>
Hi Dmytro,
On 8/5/22 21:39, Dmytro Maluka wrote:
> Currently irq mask notifiers are used only internally in the x86 code
> for PIT emulation. However they are not really arch specific. We are
> going to use them in the generic irqfd code, for postponing resampler
> irqfd notification until the interrupt is unmasked. So move the
> implementation of mask notifiers to the generic code, to allow irqfd to
> register its mask notifiers.
>
> Note that calling mask notifiers via calling kvm_fire_mask_notifiers()
> is still implemented for x86 only, so registering mask notifiers on
> other architectures will have no effect for now.
>
> Signed-off-by: Dmytro Maluka <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Eric
> ---
> arch/x86/include/asm/kvm_host.h | 16 ----------------
> arch/x86/kvm/irq_comm.c | 33 ---------------------------------
> arch/x86/kvm/x86.c | 1 -
> include/linux/kvm_host.h | 15 +++++++++++++++
> virt/kvm/eventfd.c | 33 +++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 1 +
> 6 files changed, 49 insertions(+), 50 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9217bd6cf0d1..dc76617f11c1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1198,9 +1198,6 @@ struct kvm_arch {
>
> struct kvm_xen_hvm_config xen_hvm_config;
>
> - /* reads protected by irq_srcu, writes by irq_lock */
> - struct hlist_head mask_notifier_list;
> -
> struct kvm_hv hyperv;
> struct kvm_xen xen;
>
> @@ -1688,19 +1685,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
> int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
> const void *val, int bytes);
>
> -struct kvm_irq_mask_notifier {
> - void (*func)(struct kvm_irq_mask_notifier *kimn, bool masked);
> - int irq;
> - struct hlist_node link;
> -};
> -
> -void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> - struct kvm_irq_mask_notifier *kimn);
> -void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> - struct kvm_irq_mask_notifier *kimn);
> -void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> - bool mask);
> -
> extern bool tdp_enabled;
>
> u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 0687162c4f22..f27e4c9c403e 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -234,39 +234,6 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
> mutex_unlock(&kvm->irq_lock);
> }
>
> -void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> - struct kvm_irq_mask_notifier *kimn)
> -{
> - mutex_lock(&kvm->irq_lock);
> - kimn->irq = irq;
> - hlist_add_head_rcu(&kimn->link, &kvm->arch.mask_notifier_list);
> - mutex_unlock(&kvm->irq_lock);
> -}
> -
> -void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> - struct kvm_irq_mask_notifier *kimn)
> -{
> - mutex_lock(&kvm->irq_lock);
> - hlist_del_rcu(&kimn->link);
> - mutex_unlock(&kvm->irq_lock);
> - synchronize_srcu(&kvm->irq_srcu);
> -}
> -
> -void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> - bool mask)
> -{
> - struct kvm_irq_mask_notifier *kimn;
> - int idx, gsi;
> -
> - idx = srcu_read_lock(&kvm->irq_srcu);
> - gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
> - if (gsi != -1)
> - hlist_for_each_entry_rcu(kimn, &kvm->arch.mask_notifier_list, link)
> - if (kimn->irq == gsi)
> - kimn->func(kimn, mask);
> - srcu_read_unlock(&kvm->irq_srcu, idx);
> -}
> -
> bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
> {
> return irqchip_in_kernel(kvm);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e5fa335a4ea7..a0a776f5c42f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11818,7 +11818,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> if (ret)
> goto out_page_track;
>
> - INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
> INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
> atomic_set(&kvm->arch.noncoherent_dma_count, 0);
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90a45ef7203b..dd5f14e31996 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -760,7 +760,10 @@ struct kvm {
> struct kvm_irq_routing_table __rcu *irq_routing;
> #endif
> #ifdef CONFIG_HAVE_KVM_IRQFD
> + /* reads protected by irq_srcu, writes by irq_lock */
> struct hlist_head irq_ack_notifier_list;
> + /* reads protected by irq_srcu, writes by irq_lock */
> + struct hlist_head irq_mask_notifier_list;
> #endif
>
> #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> @@ -1581,6 +1584,12 @@ struct kvm_irq_ack_notifier {
> void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
> };
>
> +struct kvm_irq_mask_notifier {
> + void (*func)(struct kvm_irq_mask_notifier *kimn, bool masked);
> + int irq;
> + struct hlist_node link;
> +};
> +
> int kvm_irq_map_gsi(struct kvm *kvm,
> struct kvm_kernel_irq_routing_entry *entries, int gsi);
> int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin);
> @@ -1599,6 +1608,12 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> +void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn);
> +void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn);
> +void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> + bool mask);
> int kvm_request_irq_source_id(struct kvm *kvm);
> void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 2a3ed401ce46..39403d9fbdcc 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -518,6 +518,39 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> synchronize_srcu(&kvm->irq_srcu);
> kvm_arch_post_irq_ack_notifier_list_update(kvm);
> }
> +
> +void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn)
> +{
> + mutex_lock(&kvm->irq_lock);
> + kimn->irq = irq;
> + hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
> + mutex_unlock(&kvm->irq_lock);
> +}
> +
> +void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn)
> +{
> + mutex_lock(&kvm->irq_lock);
> + hlist_del_rcu(&kimn->link);
> + mutex_unlock(&kvm->irq_lock);
> + synchronize_srcu(&kvm->irq_srcu);
> +}
> +
> +void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> + bool mask)
> +{
> + struct kvm_irq_mask_notifier *kimn;
> + int idx, gsi;
> +
> + idx = srcu_read_lock(&kvm->irq_srcu);
> + gsi = kvm_irq_map_chip_pin(kvm, irqchip, pin);
> + if (gsi != -1)
> + hlist_for_each_entry_rcu(kimn, &kvm->irq_mask_notifier_list, link)
> + if (kimn->irq == gsi)
> + kimn->func(kimn, mask);
> + srcu_read_unlock(&kvm->irq_srcu, idx);
> +}
> #endif
>
> void
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a49df8988cd6..5ca7fb0b8257 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1144,6 +1144,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>
> #ifdef CONFIG_HAVE_KVM_IRQFD
> INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
> + INIT_HLIST_HEAD(&kvm->irq_mask_notifier_list);
> #endif
>
> r = kvm_init_mmu_notifier(kvm);
Hi Dmytro,
On 8/5/22 21:39, Dmytro Maluka wrote:
> In order to implement postponing resamplefd notification until an
> interrupt is unmasked, we need not only to track changes of the
> interrupt mask state (which is already possible with
> kvm_register_irq_mask_notifier()) but also to know its initial
> mask state before any mask notifier has fired.
>
> Moreover, we need to do this initial check of the IRQ mask state in a
> race-free way, to ensure that we will not miss any further mask or
> unmask events after we check the initial mask state.
>
> So implement kvm_register_and_fire_irq_mask_notifier() which atomically
> registers an IRQ mask notifier and calls it with the current mask value
> of the IRQ. It does that using the same locking order as when calling
> notifier normally via kvm_fire_mask_notifiers(), to prevent deadlocks.
>
> Its implementation needs to be arch-specific since it relies on
> arch-specific synchronization (e.g. ioapic->lock and pic->lock on x86,
> or a per-IRQ lock on ARM vGIC) for serializing our initial reading of
> the IRQ mask state with a pending change of this mask state.
>
> For now implement it for x86 only, and for other archs add a weak dummy
> implementation which doesn't really call the notifier (as other archs
> don't currently implement calling notifiers normally via
> kvm_fire_mask_notifiers() either, i.e. registering mask notifiers has no
> effect on those archs anyway).
>
> Signed-off-by: Dmytro Maluka <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]/
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/i8259.c | 6 ++++
> arch/x86/kvm/ioapic.c | 6 ++++
> arch/x86/kvm/ioapic.h | 1 +
> arch/x86/kvm/irq_comm.c | 57 +++++++++++++++++++++++++++++++++
> include/linux/kvm_host.h | 4 +++
> virt/kvm/eventfd.c | 31 ++++++++++++++++--
> 7 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dc76617f11c1..cf0571ed2968 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1834,6 +1834,7 @@ static inline int __kvm_irq_line_state(unsigned long *irq_state,
>
> int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level);
> void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
> +bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq);
>
> void kvm_inject_nmi(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index e1bb6218bb96..1eb3127f6047 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -211,6 +211,12 @@ void kvm_pic_clear_all(struct kvm_pic *s, int irq_source_id)
> pic_unlock(s);
> }
>
> +/* Called with s->lock held. */
> +bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq)
> +{
> + return !!(s->pics[irq >> 3].imr & (1 << irq));
> +}
> +
> /*
> * acknowledge interrupt 'irq'
> */
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 765943d7cfa5..fab11de1f885 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -478,6 +478,12 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id)
> spin_unlock(&ioapic->lock);
> }
>
> +/* Called with ioapic->lock held. */
> +bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq)
> +{
> + return !!ioapic->redirtbl[irq].fields.mask;
> +}
> +
> static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
> {
> int i;
> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
> index 539333ac4b38..fe1f51319992 100644
> --- a/arch/x86/kvm/ioapic.h
> +++ b/arch/x86/kvm/ioapic.h
> @@ -114,6 +114,7 @@ void kvm_ioapic_destroy(struct kvm *kvm);
> int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id,
> int level, bool line_status);
> void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id);
> +bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq);
> void kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> void kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index f27e4c9c403e..4bd4218821a2 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -234,6 +234,63 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
> mutex_unlock(&kvm->irq_lock);
> }
>
> +void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn)
> +{
> + struct kvm_pic *pic = kvm->arch.vpic;
> + struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> + struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
> + struct kvm_kernel_irq_routing_entry *pic_e = NULL, *ioapic_e = NULL;
> + int idx, i, n;
> + bool masked;
> +
> + mutex_lock(&kvm->irq_lock);
> +
> + /*
> + * Not possible to detect if the guest uses the PIC or the
> + * IOAPIC. So assume the interrupt to be unmasked iff it is
> + * unmasked in at least one of both.
> + */
> + idx = srcu_read_lock(&kvm->irq_srcu);
> + n = kvm_irq_map_gsi(kvm, entries, irq);
> + srcu_read_unlock(&kvm->irq_srcu, idx);
> +
> + for (i = 0; i < n; i++) {
> + if (entries[i].type != KVM_IRQ_ROUTING_IRQCHIP)
> + continue;
> +
> + switch (entries[i].irqchip.irqchip) {
> + case KVM_IRQCHIP_PIC_MASTER:
> + case KVM_IRQCHIP_PIC_SLAVE:
> + pic_e = &entries[i];
> + break;
> + case KVM_IRQCHIP_IOAPIC:
> + ioapic_e = &entries[i];
> + break;
> + default:
> + break;
> + }
> + }
> +
> + if (pic_e)
> + spin_lock(&pic->lock);
> + if (ioapic_e)
> + spin_lock(&ioapic->lock);
> +
> + __kvm_register_irq_mask_notifier(kvm, irq, kimn);
> +
> + masked = (!pic_e || kvm_pic_irq_is_masked(pic, pic_e->irqchip.pin)) &&
> + (!ioapic_e || kvm_ioapic_irq_is_masked(ioapic, ioapic_e->irqchip.pin));
Looks a bit cryptic to me. Don't you want pic_e && masked on pic ||
ioapic_e && masked on ioapic?
> + kimn->func(kimn, masked);
> +
> + if (ioapic_e)
> + spin_unlock(&ioapic->lock);
> + if (pic_e)
> + spin_unlock(&pic->lock);
> +
> + mutex_unlock(&kvm->irq_lock);
> +}
> +
> bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
> {
> return irqchip_in_kernel(kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index dd5f14e31996..55233eb18eb4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1608,8 +1608,12 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> +void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn);
> void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> struct kvm_irq_mask_notifier *kimn);
> +void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn);
> void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> struct kvm_irq_mask_notifier *kimn);
> void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 39403d9fbdcc..3007d956b626 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -519,15 +519,42 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> kvm_arch_post_irq_ack_notifier_list_update(kvm);
> }
>
> +void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn)
> +{
> + kimn->irq = irq;
> + hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
> +}
> +
> void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
> struct kvm_irq_mask_notifier *kimn)
> {
> mutex_lock(&kvm->irq_lock);
> - kimn->irq = irq;
> - hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
> + __kvm_register_irq_mask_notifier(kvm, irq, kimn);
> mutex_unlock(&kvm->irq_lock);
> }
>
> +/*
> + * kvm_register_and_fire_irq_mask_notifier() registers the notifier and
> + * immediately calls it with the current mask value of the IRQ. It does
> + * that atomically, so that we will find out the initial mask state of
> + * the IRQ and will not miss any further mask or unmask events. It does
> + * that using the same locking order as when calling notifier normally
> + * via kvm_fire_mask_notifiers(), to prevent deadlocks.
you may document somewhere that it must be called before
kvm_register_irq_ack_notifier()
> + *
> + * Implementation is arch-specific since it relies on arch-specific
> + * (irqchip-specific) synchronization. Below is a weak dummy
> + * implementation for archs not implementing it yet, as those archs
> + * don't implement calling notifiers normally via
> + * kvm_fire_mask_notifiers() either, i.e. registering mask notifiers
> + * has no effect on those archs anyway.
I would advise you to put Marc in the loop for the whole series (adding
him in CC).
Thanks
Eric
> + */
> +void __weak kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
> + struct kvm_irq_mask_notifier *kimn)
> +{
> + kvm_register_irq_mask_notifier(kvm, irq, kimn);
> +}
> +
> void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
> struct kvm_irq_mask_notifier *kimn)
> {
On 8/9/22 10:01 PM, Dong, Eddie wrote:
>
>
>> -----Original Message-----
>> From: Dmytro Maluka <[email protected]>
>> Sent: Tuesday, August 9, 2022 12:24 AM
>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>> <[email protected]>; Paolo Bonzini <[email protected]>;
>> [email protected]
>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>> [email protected]; Eric Auger <[email protected]>; Alex
>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>> [email protected]; Dmitry Torokhov <[email protected]>
>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>
>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>
>>>> The existing KVM mechanism for forwarding of level-triggered
>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>> existing KVM code doesn't take that into account, which results in
>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>> unacknowledged IRQ by the host.
>>>
>>> Interesting... How it behaviors in native side?
>>
>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>> completes.
>>
>> In handle_fasteoi_irq():
>>
>> if (desc->istate & IRQS_ONESHOT)
>> mask_irq(desc);
>>
>> handle_irq_event(desc);
>>
>> cond_unmask_eoi_irq(desc, chip);
>>
>>
>> and later in unmask_threaded_irq():
>>
>> unmask_irq(desc);
>>
>> I also mentioned that in patch #3 description:
>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>> prevent the EOI from re-asserting an unacknowledged interrupt.
>
> That makes sense. Can you include the full story in cover letter too?
Ok, I will.
>
>
>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>> check that the interrupt is still masked in the guest at the moment of EOI.
>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>> and queued for injection to the guest."
>>
>
> Emulation of level triggered IRQ is a pain point ☹
> I read we need to emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e. ioapic here).
> Technically, the guest can change the polarity of vIOAPIC, which will lead to a new virtual IRQ
> even w/o host side interrupt.
Thanks, interesting point. Do you mean that this behavior (a new vIRQ as
a result of polarity change) may already happen with the existing KVM code?
It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC
polarity bit, in particular it doesn't handle change of the polarity by
the guest (i.e. doesn't update the virtual IRR register, and so on), so
it shouldn't result in a new interrupt.
Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there
seems to be an assumption that KVM interpretes the IRQ level value as
active (asserted) vs inactive (deasserted) rather than high vs low, i.e.
the polarity doesn't matter to KVM.
So, since both sides (KVM emulating the IOAPIC, and vfio/whatever
emulating an external interrupt source) seem to operate on a level of
abstraction of "asserted" vs "de-asserted" interrupt state regardless of
the polarity, and that seems not a bug but a feature, it seems that we
don't need to emulate the IRQ level as such. Or am I missing something?
OTOH, I guess this means that the existing KVM's emulation of
level-triggered interrupts is somewhat limited (a guest may legitimately
expect an interrupt fired as a result of polarity change, and that case
is not supported by KVM). But that is rather out of scope of the oneshot
interrupts issue addressed by this patchset.
> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an event rather than an interrupt level.
>
>
>>>
>>>>
>>>> This patch series fixes this issue (for now on x86 only) by checking
>>>> if the interrupt is unmasked when we receive irq ack (EOI) and, in
>>>> case if it's masked, postponing resamplefd notify until the guest unmasks it.
>>>>
>>>> Patches 1 and 2 extend the existing support for irq mask notifiers in
>>>> KVM, which is a prerequisite needed for KVM irqfd to use mask
>>>> notifiers to know when an interrupt is masked or unmasked.
>>>>
>>>> Patch 3 implements the actual fix: postponing resamplefd notify in
>>>> irqfd until the irq is unmasked.
>>>>
>>>> Patches 4 and 5 just do some optional renaming for consistency, as we
>>>> are now using irq mask notifiers in irqfd along with irq ack notifiers.
>>>>
>>>> Please see individual patches for more details.
>>>>
>>>> v2:
>>>> - Fixed compilation failure on non-x86: mask_notifier_list moved from
>>>> x86 "struct kvm_arch" to generic "struct kvm".
>>>> - kvm_fire_mask_notifiers() also moved from x86 to generic code, even
>>>> though it is not called on other architectures for now.
>>>> - Instead of kvm_irq_is_masked() implemented
>>>> kvm_register_and_fire_irq_mask_notifier() to fix potential race
>>>> when reading the initial IRQ mask state.
>>>> - Renamed for clarity:
>>>> - irqfd_resampler_mask() -> irqfd_resampler_mask_notify()
>>>> - kvm_irq_has_notifier() -> kvm_irq_has_ack_notifier()
>>>> - resampler->notifier -> resampler->ack_notifier
>>>> - Reorganized code in irqfd_resampler_ack() and
>>>> irqfd_resampler_mask_notify() to make it easier to follow.
>>>> - Don't follow unwanted "return type on separate line" style for
>>>> irqfd_resampler_mask_notify().
>>>>
>>>> Dmytro Maluka (5):
>>>> KVM: x86: Move irq mask notifiers from x86 to generic KVM
>>>> KVM: x86: Add kvm_register_and_fire_irq_mask_notifier()
>>>> KVM: irqfd: Postpone resamplefd notify for oneshot interrupts
>>>> KVM: irqfd: Rename resampler->notifier
>>>> KVM: Rename kvm_irq_has_notifier()
>>>>
>>>> arch/x86/include/asm/kvm_host.h | 17 +---
>>>> arch/x86/kvm/i8259.c | 6 ++
>>>> arch/x86/kvm/ioapic.c | 8 +-
>>>> arch/x86/kvm/ioapic.h | 1 +
>>>> arch/x86/kvm/irq_comm.c | 74 +++++++++++------
>>>> arch/x86/kvm/x86.c | 1 -
>>>> include/linux/kvm_host.h | 21 ++++-
>>>> include/linux/kvm_irqfd.h | 16 +++-
>>>> virt/kvm/eventfd.c | 136 ++++++++++++++++++++++++++++----
>>>> virt/kvm/kvm_main.c | 1 +
>>>> 10 files changed, 221 insertions(+), 60 deletions(-)
>>>>
>>>> --
>>>> 2.37.1.559.g78731f0fdb-goog
>>>
On 8/9/22 10:45 PM, Eric Auger wrote:
> Hi Dmytro,
>
> On 8/5/22 21:39, Dmytro Maluka wrote:
>> The existing KVM mechanism for forwarding of level-triggered interrupts
>> using resample eventfd doesn't work quite correctly in the case of
>> interrupts that are handled in a Linux guest as oneshot interrupts
>> (IRQF_ONESHOT). Such an interrupt is acked to the device in its
>> threaded irq handler, i.e. later than it is acked to the interrupt
>> controller (EOI at the end of hardirq), not earlier.
>>
>> Linux keeps such interrupt masked until its threaded handler finishes,
>> to prevent the EOI from re-asserting an unacknowledged interrupt.
>> However, with KVM + vfio (or whatever is listening on the resamplefd)
>> we don't check that the interrupt is still masked in the guest at the
>> moment of EOI. Resamplefd is notified regardless, so vfio prematurely
>> unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
>> is generated in the host and queued for injection to the guest.
>>
>> The fact that the virtual IRQ is still masked doesn't prevent this new
>> physical IRQ from being propagated to the guest, because:
>>
>> 1. It is not guaranteed that the vIRQ will remain masked by the time
>> when vfio signals the trigger eventfd.
>> 2. KVM marks this IRQ as pending (e.g. setting its bit in the virtual
>> IRR register of IOAPIC on x86), so after the vIRQ is unmasked, this
>> new pending interrupt is injected by KVM to the guest anyway.
>>
>> There are observed at least 2 user-visible issues caused by those
>> extra erroneous pending interrupts for oneshot irq in the guest:
>>
>> 1. System suspend aborted due to a pending wakeup interrupt from
>> ChromeOS EC (drivers/platform/chrome/cros_ec.c).
>> 2. Annoying "invalid report id data" errors from ELAN0000 touchpad
>> (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
>> every time the touchpad is touched.
>>
>> This patch fixes the issue on x86 by checking if the interrupt is
>> unmasked when we receive irq ack (EOI) and, in case if it's masked,
>> postponing resamplefd notify until the guest unmasks it.
>>
>> It doesn't fix the issue for other archs yet, since it relies on KVM
>> irq mask notifiers functionality which currently works only on x86.
>> On other archs we can register mask notifiers but they are never called.
>> So on other archs resampler->masked is always false, so the behavior is
>> the same as before this patch.
>>
>> Link: https://lore.kernel.org/kvm/[email protected]/
>> Suggested-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Dmytro Maluka <[email protected]>
>> ---
>> include/linux/kvm_irqfd.h | 14 ++++++++++
>> virt/kvm/eventfd.c | 56 +++++++++++++++++++++++++++++++++++----
>> 2 files changed, 65 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
>> index dac047abdba7..01754a1abb9e 100644
>> --- a/include/linux/kvm_irqfd.h
>> +++ b/include/linux/kvm_irqfd.h
>> @@ -19,6 +19,16 @@
>> * resamplefd. All resamplers on the same gsi are de-asserted
>> * together, so we don't need to track the state of each individual
>> * user. We can also therefore share the same irq source ID.
>> + *
>> + * A special case is when the interrupt is still masked at the moment
>> + * an irq ack is received. That likely means that the interrupt has
>> + * been acknowledged to the interrupt controller but not acknowledged
>> + * to the device yet, e.g. it might be a Linux guest's threaded
>> + * oneshot interrupt (IRQF_ONESHOT). In this case notifying through
>> + * resamplefd is postponed until the guest unmasks the interrupt,
>> + * which is detected through the irq mask notifier. This prevents
>> + * erroneous extra interrupts caused by premature re-assert of an
>> + * unacknowledged interrupt by the resamplefd listener.
>> */
>> struct kvm_kernel_irqfd_resampler {
>> struct kvm *kvm;
>> @@ -28,6 +38,10 @@ struct kvm_kernel_irqfd_resampler {
>> */
>> struct list_head list;
>> struct kvm_irq_ack_notifier notifier;
>> + struct kvm_irq_mask_notifier mask_notifier;
>> + bool masked;
>> + bool pending;
>> + spinlock_t lock;
>> /*
>> * Entry in list of kvm->irqfd.resampler_list. Use for sharing
>> * resamplers among irqfds on the same gsi.
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 3007d956b626..f98dcce3959c 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -67,6 +67,7 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
>> struct kvm *kvm;
>> struct kvm_kernel_irqfd *irqfd;
>> int idx;
>> + bool notify = true;
>>
>> resampler = container_of(kian,
>> struct kvm_kernel_irqfd_resampler, notifier);
>> @@ -75,13 +76,52 @@ irqfd_resampler_ack(struct kvm_irq_ack_notifier *kian)
>> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
>> resampler->notifier.gsi, 0, false);
>>
>> - idx = srcu_read_lock(&kvm->irq_srcu);
>> + spin_lock(&resampler->lock);
>> + if (resampler->masked) {
>> + notify = false;
>> + resampler->pending = true;
>> + }
>> + spin_unlock(&resampler->lock);
>> +
>> + if (notify) {
>> + idx = srcu_read_lock(&kvm->irq_srcu);
>>
>> - list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
>> - srcu_read_lock_held(&kvm->irq_srcu))
>> - eventfd_signal(irqfd->resamplefd, 1);
>> + list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
>> + srcu_read_lock_held(&kvm->irq_srcu))
>> + eventfd_signal(irqfd->resamplefd, 1);
> nit: you may introduce a helper for above code as the code is duplicated.
Ack.
>>
>> - srcu_read_unlock(&kvm->irq_srcu, idx);
>> + srcu_read_unlock(&kvm->irq_srcu, idx);
>> + }
>> +}
>> +
>> +static void irqfd_resampler_mask_notify(struct kvm_irq_mask_notifier *kimn,
>> + bool masked)
>> +{
>> + struct kvm_kernel_irqfd_resampler *resampler;
>> + struct kvm *kvm;
>> + struct kvm_kernel_irqfd *irqfd;
>> + int idx;
>> + bool notify;
>> +
>> + resampler = container_of(kimn,
>> + struct kvm_kernel_irqfd_resampler, mask_notifier);
>> + kvm = resampler->kvm;
>> +
>> + spin_lock(&resampler->lock);
>> + notify = !masked && resampler->pending;
>> + resampler->masked = masked;
>> + resampler->pending = false;
>> + spin_unlock(&resampler->lock);
>> +
>> + if (notify) {
>> + idx = srcu_read_lock(&kvm->irq_srcu);
>> +
>> + list_for_each_entry_srcu(irqfd, &resampler->list, resampler_link,
>> + srcu_read_lock_held(&kvm->irq_srcu))
>> + eventfd_signal(irqfd->resamplefd, 1);
>> +
>> + srcu_read_unlock(&kvm->irq_srcu, idx);
>> + }
>> }
>>
>> static void
>> @@ -98,6 +138,8 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>> if (list_empty(&resampler->list)) {
>> list_del(&resampler->link);
>> kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
>> + kvm_unregister_irq_mask_notifier(kvm, resampler->mask_notifier.irq,
>> + &resampler->mask_notifier);
>> kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
>> resampler->notifier.gsi, 0, false);
>> kfree(resampler);
>> @@ -367,9 +409,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>> INIT_LIST_HEAD(&resampler->list);
>> resampler->notifier.gsi = irqfd->gsi;
>> resampler->notifier.irq_acked = irqfd_resampler_ack;
>> + resampler->mask_notifier.func = irqfd_resampler_mask_notify;
>> + spin_lock_init(&resampler->lock);
>> INIT_LIST_HEAD(&resampler->link);
>>
>> list_add(&resampler->link, &kvm->irqfds.resampler_list);
>> + kvm_register_and_fire_irq_mask_notifier(kvm, irqfd->gsi,
>> + &resampler->mask_notifier);
>> kvm_register_irq_ack_notifier(kvm,
>> &resampler->notifier);
>> irqfd->resampler = resampler;
> Adding Marc in CC
>
> Thanks
>
> Eric
>
Hi Eric,
On 8/9/22 10:43 PM, Eric Auger wrote:
> Hi Dmytro,
>
> On 8/5/22 21:39, Dmytro Maluka wrote:
>> In order to implement postponing resamplefd notification until an
>> interrupt is unmasked, we need not only to track changes of the
>> interrupt mask state (which is already possible with
>> kvm_register_irq_mask_notifier()) but also to know its initial
>> mask state before any mask notifier has fired.
>>
>> Moreover, we need to do this initial check of the IRQ mask state in a
>> race-free way, to ensure that we will not miss any further mask or
>> unmask events after we check the initial mask state.
>>
>> So implement kvm_register_and_fire_irq_mask_notifier() which atomically
>> registers an IRQ mask notifier and calls it with the current mask value
>> of the IRQ. It does that using the same locking order as when calling
>> notifier normally via kvm_fire_mask_notifiers(), to prevent deadlocks.
>>
>> Its implementation needs to be arch-specific since it relies on
>> arch-specific synchronization (e.g. ioapic->lock and pic->lock on x86,
>> or a per-IRQ lock on ARM vGIC) for serializing our initial reading of
>> the IRQ mask state with a pending change of this mask state.
>>
>> For now implement it for x86 only, and for other archs add a weak dummy
>> implementation which doesn't really call the notifier (as other archs
>> don't currently implement calling notifiers normally via
>> kvm_fire_mask_notifiers() either, i.e. registering mask notifiers has no
>> effect on those archs anyway).
>>
>> Signed-off-by: Dmytro Maluka <[email protected]>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/i8259.c | 6 ++++
>> arch/x86/kvm/ioapic.c | 6 ++++
>> arch/x86/kvm/ioapic.h | 1 +
>> arch/x86/kvm/irq_comm.c | 57 +++++++++++++++++++++++++++++++++
>> include/linux/kvm_host.h | 4 +++
>> virt/kvm/eventfd.c | 31 ++++++++++++++++--
>> 7 files changed, 104 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index dc76617f11c1..cf0571ed2968 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1834,6 +1834,7 @@ static inline int __kvm_irq_line_state(unsigned long *irq_state,
>>
>> int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int level);
>> void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
>> +bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq);
>>
>> void kvm_inject_nmi(struct kvm_vcpu *vcpu);
>>
>> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
>> index e1bb6218bb96..1eb3127f6047 100644
>> --- a/arch/x86/kvm/i8259.c
>> +++ b/arch/x86/kvm/i8259.c
>> @@ -211,6 +211,12 @@ void kvm_pic_clear_all(struct kvm_pic *s, int irq_source_id)
>> pic_unlock(s);
>> }
>>
>> +/* Called with s->lock held. */
>> +bool kvm_pic_irq_is_masked(struct kvm_pic *s, int irq)
>> +{
>> + return !!(s->pics[irq >> 3].imr & (1 << irq));
>> +}
>> +
>> /*
>> * acknowledge interrupt 'irq'
>> */
>> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> index 765943d7cfa5..fab11de1f885 100644
>> --- a/arch/x86/kvm/ioapic.c
>> +++ b/arch/x86/kvm/ioapic.c
>> @@ -478,6 +478,12 @@ void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id)
>> spin_unlock(&ioapic->lock);
>> }
>>
>> +/* Called with ioapic->lock held. */
>> +bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq)
>> +{
>> + return !!ioapic->redirtbl[irq].fields.mask;
>> +}
>> +
>> static void kvm_ioapic_eoi_inject_work(struct work_struct *work)
>> {
>> int i;
>> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
>> index 539333ac4b38..fe1f51319992 100644
>> --- a/arch/x86/kvm/ioapic.h
>> +++ b/arch/x86/kvm/ioapic.h
>> @@ -114,6 +114,7 @@ void kvm_ioapic_destroy(struct kvm *kvm);
>> int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id,
>> int level, bool line_status);
>> void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int irq_source_id);
>> +bool kvm_ioapic_irq_is_masked(struct kvm_ioapic *ioapic, int irq);
>> void kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>> void kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>> void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu,
>> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
>> index f27e4c9c403e..4bd4218821a2 100644
>> --- a/arch/x86/kvm/irq_comm.c
>> +++ b/arch/x86/kvm/irq_comm.c
>> @@ -234,6 +234,63 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
>> mutex_unlock(&kvm->irq_lock);
>> }
>>
>> +void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
>> + struct kvm_irq_mask_notifier *kimn)
>> +{
>> + struct kvm_pic *pic = kvm->arch.vpic;
>> + struct kvm_ioapic *ioapic = kvm->arch.vioapic;
>> + struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
>> + struct kvm_kernel_irq_routing_entry *pic_e = NULL, *ioapic_e = NULL;
>> + int idx, i, n;
>> + bool masked;
>> +
>> + mutex_lock(&kvm->irq_lock);
>> +
>> + /*
>> + * Not possible to detect if the guest uses the PIC or the
>> + * IOAPIC. So assume the interrupt to be unmasked iff it is
>> + * unmasked in at least one of both.
>> + */
>> + idx = srcu_read_lock(&kvm->irq_srcu);
>> + n = kvm_irq_map_gsi(kvm, entries, irq);
>> + srcu_read_unlock(&kvm->irq_srcu, idx);
>> +
>> + for (i = 0; i < n; i++) {
>> + if (entries[i].type != KVM_IRQ_ROUTING_IRQCHIP)
>> + continue;
>> +
>> + switch (entries[i].irqchip.irqchip) {
>> + case KVM_IRQCHIP_PIC_MASTER:
>> + case KVM_IRQCHIP_PIC_SLAVE:
>> + pic_e = &entries[i];
>> + break;
>> + case KVM_IRQCHIP_IOAPIC:
>> + ioapic_e = &entries[i];
>> + break;
>> + default:
>> + break;
>> + }
>> + }
>> +
>> + if (pic_e)
>> + spin_lock(&pic->lock);
>> + if (ioapic_e)
>> + spin_lock(&ioapic->lock);
>> +
>> + __kvm_register_irq_mask_notifier(kvm, irq, kimn);
>> +
>> + masked = (!pic_e || kvm_pic_irq_is_masked(pic, pic_e->irqchip.pin)) &&
>> + (!ioapic_e || kvm_ioapic_irq_is_masked(ioapic, ioapic_e->irqchip.pin));
> Looks a bit cryptic to me. Don't you want pic_e && masked on pic ||
> ioapic_e && masked on ioapic?
That would be quite different: it would be "masked on at least one of
both", while I want "masked on both (if both are used)".
>
>> + kimn->func(kimn, masked);
>> +
>> + if (ioapic_e)
>> + spin_unlock(&ioapic->lock);
>> + if (pic_e)
>> + spin_unlock(&pic->lock);
>> +
>> + mutex_unlock(&kvm->irq_lock);
>> +}
>> +
>> bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
>> {
>> return irqchip_in_kernel(kvm);
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index dd5f14e31996..55233eb18eb4 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1608,8 +1608,12 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
>> struct kvm_irq_ack_notifier *kian);
>> void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>> struct kvm_irq_ack_notifier *kian);
>> +void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
>> + struct kvm_irq_mask_notifier *kimn);
>> void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
>> struct kvm_irq_mask_notifier *kimn);
>> +void kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
>> + struct kvm_irq_mask_notifier *kimn);
>> void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
>> struct kvm_irq_mask_notifier *kimn);
>> void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 39403d9fbdcc..3007d956b626 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -519,15 +519,42 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>> kvm_arch_post_irq_ack_notifier_list_update(kvm);
>> }
>>
>> +void __kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
>> + struct kvm_irq_mask_notifier *kimn)
>> +{
>> + kimn->irq = irq;
>> + hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
>> +}
>> +
>> void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
>> struct kvm_irq_mask_notifier *kimn)
>> {
>> mutex_lock(&kvm->irq_lock);
>> - kimn->irq = irq;
>> - hlist_add_head_rcu(&kimn->link, &kvm->irq_mask_notifier_list);
>> + __kvm_register_irq_mask_notifier(kvm, irq, kimn);
>> mutex_unlock(&kvm->irq_lock);
>> }
>>
>> +/*
>> + * kvm_register_and_fire_irq_mask_notifier() registers the notifier and
>> + * immediately calls it with the current mask value of the IRQ. It does
>> + * that atomically, so that we will find out the initial mask state of
>> + * the IRQ and will not miss any further mask or unmask events. It does
>> + * that using the same locking order as when calling notifier normally
>> + * via kvm_fire_mask_notifiers(), to prevent deadlocks.
> you may document somewhere that it must be called before
>
> kvm_register_irq_ack_notifier()
Actually I think it would still be ok to call it after
kvm_register_irq_ack_notifier(), not necessarily before. We could then
miss a mask notification between kvm_register_irq_ack_notifier() and
kvm_register_and_fire_irq_mask_notifier(), but it's ok since
kvm_register_and_fire_irq_mask_notifier() would then immediately send a
new notification with the up-to-date mask value.
>
>> + *
>> + * Implementation is arch-specific since it relies on arch-specific
>> + * (irqchip-specific) synchronization. Below is a weak dummy
>> + * implementation for archs not implementing it yet, as those archs
>> + * don't implement calling notifiers normally via
>> + * kvm_fire_mask_notifiers() either, i.e. registering mask notifiers
>> + * has no effect on those archs anyway.
> I would advise you to put Marc in the loop for the whole series (adding
> him in CC).
Ok.
>
> Thanks
>
> Eric
>> + */
>> +void __weak kvm_register_and_fire_irq_mask_notifier(struct kvm *kvm, int irq,
>> + struct kvm_irq_mask_notifier *kimn)
>> +{
>> + kvm_register_irq_mask_notifier(kvm, irq, kimn);
>> +}
>> +
>> void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
>> struct kvm_irq_mask_notifier *kimn)
>> {
>
On Wed, 10 Aug 2022 00:30:29 +0100,
Dmytro Maluka <[email protected]> wrote:
>
> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >
> >
> >> -----Original Message-----
> >> From: Dmytro Maluka <[email protected]>
> >> Sent: Tuesday, August 9, 2022 12:24 AM
> >> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
> >> <[email protected]>; Paolo Bonzini <[email protected]>;
> >> [email protected]
> >> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> >> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
> >> [email protected]; H. Peter Anvin <[email protected]>; linux-
> >> [email protected]; Eric Auger <[email protected]>; Alex
> >> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
> >> Zhenyu Wang <[email protected]>; Tomasz Nowicki
> >> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
> >> [email protected]; Dmitry Torokhov <[email protected]>
> >> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>
> >> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>
> >>>> The existing KVM mechanism for forwarding of level-triggered
> >>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>> existing KVM code doesn't take that into account, which results in
> >>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >> unacknowledged IRQ by the host.
> >>>
> >>> Interesting... How it behaviors in native side?
> >>
> >> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >> completes.
> >>
> >> In handle_fasteoi_irq():
> >>
> >> if (desc->istate & IRQS_ONESHOT)
> >> mask_irq(desc);
> >>
> >> handle_irq_event(desc);
> >>
> >> cond_unmask_eoi_irq(desc, chip);
> >>
> >>
> >> and later in unmask_threaded_irq():
> >>
> >> unmask_irq(desc);
> >>
> >> I also mentioned that in patch #3 description:
> >> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >> prevent the EOI from re-asserting an unacknowledged interrupt.
> >
> > That makes sense. Can you include the full story in cover letter too?
>
> Ok, I will.
>
> >
> >
> >> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >> check that the interrupt is still masked in the guest at the moment of EOI.
> >> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >> and queued for injection to the guest."
Sorry to barge in pretty late in the conversation (just been Cc'd on
this), but why shouldn't the resamplefd be notified? If there has been
an EOI, a new level must be made visible to the guest interrupt
controller, no matter what the state of the interrupt masking is.
Whether this new level is actually *presented* to a vCPU is another
matter entirely, and is arguably a problem for the interrupt
controller emulation.
For example on arm64, we expect to be able to read the pending state
of an interrupt from the guest irrespective of the masking state of
that interrupt. Any change to the interrupt flow should preserve this.
Thankfully, we don't have the polarity issue (there is no such thing
in the GIC architecture) and we only deal with pending/not-pending.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
Hi Marc,
On 8/10/22 08:51, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 00:30:29 +0100,
> Dmytro Maluka <[email protected]> wrote:
>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>
>>>> -----Original Message-----
>>>> From: Dmytro Maluka <[email protected]>
>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>> [email protected]
>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>
>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>> unacknowledged IRQ by the host.
>>>>> Interesting... How it behaviors in native side?
>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>> completes.
>>>>
>>>> In handle_fasteoi_irq():
>>>>
>>>> if (desc->istate & IRQS_ONESHOT)
>>>> mask_irq(desc);
>>>>
>>>> handle_irq_event(desc);
>>>>
>>>> cond_unmask_eoi_irq(desc, chip);
>>>>
>>>>
>>>> and later in unmask_threaded_irq():
>>>>
>>>> unmask_irq(desc);
>>>>
>>>> I also mentioned that in patch #3 description:
>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>> That makes sense. Can you include the full story in cover letter too?
>> Ok, I will.
>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>> and queued for injection to the guest."
> Sorry to barge in pretty late in the conversation (just been Cc'd on
> this), but why shouldn't the resamplefd be notified? If there has been
yeah sorry to get you involved here ;-)
> an EOI, a new level must be made visible to the guest interrupt
> controller, no matter what the state of the interrupt masking is.
>
> Whether this new level is actually *presented* to a vCPU is another
> matter entirely, and is arguably a problem for the interrupt
> controller emulation.
FWIU on guest EOI the physical line is still asserted so the pIRQ is
immediatly re-sampled by the interrupt controller (because the
resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
(although it is masked at guest level). When the guest actually unmasks
the vIRQ we do not get a chance to re-evaluate the physical line level.
When running native, when EOI is sent, the physical line is still
asserted but the IRQ is masked. When unmasking, the line is de-asserted.
Thanks
Eric
>
> For example on arm64, we expect to be able to read the pending state
> of an interrupt from the guest irrespective of the masking state of
> that interrupt. Any change to the interrupt flow should preserve this.
>
> Thankfully, we don't have the polarity issue (there is no such thing
> in the GIC architecture) and we only deal with pending/not-pending.
>
> Thanks,
>
> M.
>
On Tue, 09 Aug 2022 21:45:25 +0100,
Eric Auger <[email protected]> wrote:
>
> Hi Dmytro,
>
> On 8/5/22 21:39, Dmytro Maluka wrote:
> > The existing KVM mechanism for forwarding of level-triggered interrupts
> > using resample eventfd doesn't work quite correctly in the case of
> > interrupts that are handled in a Linux guest as oneshot interrupts
> > (IRQF_ONESHOT). Such an interrupt is acked to the device in its
> > threaded irq handler, i.e. later than it is acked to the interrupt
> > controller (EOI at the end of hardirq), not earlier.
> >
> > Linux keeps such interrupt masked until its threaded handler finishes,
> > to prevent the EOI from re-asserting an unacknowledged interrupt.
> > However, with KVM + vfio (or whatever is listening on the resamplefd)
> > we don't check that the interrupt is still masked in the guest at the
> > moment of EOI. Resamplefd is notified regardless, so vfio prematurely
> > unmasks the host physical IRQ, thus a new (unwanted) physical interrupt
> > is generated in the host and queued for injection to the guest.
> >
> > The fact that the virtual IRQ is still masked doesn't prevent this new
> > physical IRQ from being propagated to the guest, because:
> >
> > 1. It is not guaranteed that the vIRQ will remain masked by the time
> > when vfio signals the trigger eventfd.
> > 2. KVM marks this IRQ as pending (e.g. setting its bit in the virtual
> > IRR register of IOAPIC on x86), so after the vIRQ is unmasked, this
> > new pending interrupt is injected by KVM to the guest anyway.
> >
> > There are observed at least 2 user-visible issues caused by those
> > extra erroneous pending interrupts for oneshot irq in the guest:
> >
> > 1. System suspend aborted due to a pending wakeup interrupt from
> > ChromeOS EC (drivers/platform/chrome/cros_ec.c).
> > 2. Annoying "invalid report id data" errors from ELAN0000 touchpad
> > (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg
> > every time the touchpad is touched.
> >
> > This patch fixes the issue on x86 by checking if the interrupt is
> > unmasked when we receive irq ack (EOI) and, in case if it's masked,
> > postponing resamplefd notify until the guest unmasks it.
> >
> > It doesn't fix the issue for other archs yet, since it relies on KVM
> > irq mask notifiers functionality which currently works only on x86.
> > On other archs we can register mask notifiers but they are never called.
> > So on other archs resampler->masked is always false, so the behavior is
> > the same as before this patch.
The core issue seems that you would like to be able to retire a
interrupt from what has been queued into the guest by a previous
resampling (because the line has effectively dropped in the meantime).
On arm64, it would be easy enough to sample the pending state of the
physical line and adjust the state of the virtual interrupt
accordingly. This would at least have the advantage of preserving the
illusion of an interrupt being directly routed to the guest and its
pending state being preserved between EOI and unmask.
It isn't perfect either though as, assuming the guest can ack the
interrupt on the device without exiting, the line would still appear
as pending until the next exit, possibly the unmask.
M.
--
Without deviation from the norm, progress is not possible.
On Wed, 10 Aug 2022 09:12:18 +0100,
Eric Auger <[email protected]> wrote:
>
> Hi Marc,
>
> On 8/10/22 08:51, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 00:30:29 +0100,
> > Dmytro Maluka <[email protected]> wrote:
> >> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: Dmytro Maluka <[email protected]>
> >>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
> >>>> <[email protected]>; Paolo Bonzini <[email protected]>;
> >>>> [email protected]
> >>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> >>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
> >>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
> >>>> [email protected]; Eric Auger <[email protected]>; Alex
> >>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
> >>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
> >>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
> >>>> [email protected]; Dmitry Torokhov <[email protected]>
> >>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>
> >>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>> unacknowledged IRQ by the host.
> >>>>> Interesting... How it behaviors in native side?
> >>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>> completes.
> >>>>
> >>>> In handle_fasteoi_irq():
> >>>>
> >>>> if (desc->istate & IRQS_ONESHOT)
> >>>> mask_irq(desc);
> >>>>
> >>>> handle_irq_event(desc);
> >>>>
> >>>> cond_unmask_eoi_irq(desc, chip);
> >>>>
> >>>>
> >>>> and later in unmask_threaded_irq():
> >>>>
> >>>> unmask_irq(desc);
> >>>>
> >>>> I also mentioned that in patch #3 description:
> >>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>> That makes sense. Can you include the full story in cover letter too?
> >> Ok, I will.
> >>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>> and queued for injection to the guest."
> > Sorry to barge in pretty late in the conversation (just been Cc'd on
> > this), but why shouldn't the resamplefd be notified? If there has been
> yeah sorry to get you involved here ;-)
No problem!
> > an EOI, a new level must be made visible to the guest interrupt
> > controller, no matter what the state of the interrupt masking is.
> >
> > Whether this new level is actually *presented* to a vCPU is another
> > matter entirely, and is arguably a problem for the interrupt
> > controller emulation.
>
> FWIU on guest EOI the physical line is still asserted so the pIRQ is
> immediatly re-sampled by the interrupt controller (because the
> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
> (although it is masked at guest level). When the guest actually unmasks
> the vIRQ we do not get a chance to re-evaluate the physical line level.
Indeed, and maybe this is what should be fixed instead of moving the
resampling point around (I was suggesting something along these lines
in [1]).
We already do this on arm64 for the timer, and it should be easy
enough it generalise to any interrupt backed by the GIC (there is an
in-kernel API to sample the pending state). No idea how that translate
for other architectures though.
M.
[1] https://lore.kernel.org/r/[email protected]
--
Without deviation from the norm, progress is not possible.
Hi Marc,
On 8/10/22 8:51 AM, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 00:30:29 +0100,
> Dmytro Maluka <[email protected]> wrote:
>>
>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Dmytro Maluka <[email protected]>
>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>> [email protected]
>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>
>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>
>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>> unacknowledged IRQ by the host.
>>>>>
>>>>> Interesting... How it behaviors in native side?
>>>>
>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>> completes.
>>>>
>>>> In handle_fasteoi_irq():
>>>>
>>>> if (desc->istate & IRQS_ONESHOT)
>>>> mask_irq(desc);
>>>>
>>>> handle_irq_event(desc);
>>>>
>>>> cond_unmask_eoi_irq(desc, chip);
>>>>
>>>>
>>>> and later in unmask_threaded_irq():
>>>>
>>>> unmask_irq(desc);
>>>>
>>>> I also mentioned that in patch #3 description:
>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>
>>> That makes sense. Can you include the full story in cover letter too?
>>
>> Ok, I will.
>>
>>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>> and queued for injection to the guest."
>
> Sorry to barge in pretty late in the conversation (just been Cc'd on
> this), but why shouldn't the resamplefd be notified? If there has been
> an EOI, a new level must be made visible to the guest interrupt
> controller, no matter what the state of the interrupt masking is.
>
> Whether this new level is actually *presented* to a vCPU is another
> matter entirely, and is arguably a problem for the interrupt
> controller emulation.
>
> For example on arm64, we expect to be able to read the pending state
> of an interrupt from the guest irrespective of the masking state of
> that interrupt. Any change to the interrupt flow should preserve this.
I'd like to understand the problem better, so could you please give some
examples of cases where it is required/useful/desirable to read the
correct pending state of a guest interrupt?
>
> Thankfully, we don't have the polarity issue (there is no such thing
> in the GIC architecture) and we only deal with pending/not-pending.
>
> Thanks,
>
> M.
>
On 8/10/22 7:17 PM, Dong, Eddie wrote:
>>>
>>>
>>>> However, with KVM + vfio (or whatever is listening on the resamplefd)
>>>> we don't check that the interrupt is still masked in the guest at the moment
>> of EOI.
>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the
>>>> host physical IRQ, thus a new (unwanted) physical interrupt is
>>>> generated in the host and queued for injection to the guest."
>>>>
>>>
>>> Emulation of level triggered IRQ is a pain point ☹ I read we need to
>>> emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
>> ioapic here).
>>> Technically, the guest can change the polarity of vIOAPIC, which will
>>> lead to a new virtual IRQ even w/o host side interrupt.
>>
>> Thanks, interesting point. Do you mean that this behavior (a new vIRQ as a
>> result of polarity change) may already happen with the existing KVM code?
>>
>> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC polarity
>> bit, in particular it doesn't handle change of the polarity by the guest (i.e.
>> doesn't update the virtual IRR register, and so on), so it shouldn't result in a
>> new interrupt.
>
> Correct, KVM doesn't handle polarity now. Probably because unlikely the commercial OSes
> will change polarity.
>
>>
>> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there seems to
>> be an assumption that KVM interpretes the IRQ level value as active (asserted)
>> vs inactive (deasserted) rather than high vs low, i.e.
>
> Asserted/deasserted vs. high/low is same to me, though asserted/deasserted hints more for event rather than state.
>
>> the polarity doesn't matter to KVM.
>>
>> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever emulating
>> an external interrupt source) seem to operate on a level of abstraction of
>> "asserted" vs "de-asserted" interrupt state regardless of the polarity, and that
>> seems not a bug but a feature, it seems that we don't need to emulate the IRQ
>> level as such. Or am I missing something?
>
> YES, I know current KVM doesn't handle it. Whether we should support it is another story which I cannot speak for.
> Paolo and Alex are the right person ????
> The reason I mention this is because the complexity to adding a pending event vs. supporting a interrupt pin state is same.
> I am wondering if we need to revisit it or not. Behavior closing to real hardware helps us to avoid potential issues IMO, but I am fine to either choice.
I guess that would imply revisiting KVM irqfd interface, since its
design is based rather on events than states, even for level-triggered
interrupts:
- trigger event (from vfio to KVM) to assert an IRQ
- resample event (from KVM to vfio) to de-assert an IRQ
>
>>
>> OTOH, I guess this means that the existing KVM's emulation of level-triggered
>> interrupts is somewhat limited (a guest may legitimately expect an interrupt
>> fired as a result of polarity change, and that case is not supported by KVM). But
>> that is rather out of scope of the oneshot interrupts issue addressed by this
>> patchset.
>
> Agree.
> I didn't know any commercial OSes change polarity either. But I know Xen hypervisor uses polarity under certain condition.
> One day, we may see the issue when running Xen as a L1 hypervisor. But this is not the current worry.
>
>
>>
>>> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an
>> event rather than an interrupt level.
>
> I know. I am fine either.
>
> Thanks Eddie
>
>>>
>>>
Hi Marc,
On 8/10/22 3:01 PM, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 09:12:18 +0100,
> Eric Auger <[email protected]> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 08:51, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>> Dmytro Maluka <[email protected]> wrote:
>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmytro Maluka <[email protected]>
>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>>>> [email protected]
>>>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>
>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>> unacknowledged IRQ by the host.
>>>>>>> Interesting... How it behaviors in native side?
>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>> completes.
>>>>>>
>>>>>> In handle_fasteoi_irq():
>>>>>>
>>>>>> if (desc->istate & IRQS_ONESHOT)
>>>>>> mask_irq(desc);
>>>>>>
>>>>>> handle_irq_event(desc);
>>>>>>
>>>>>> cond_unmask_eoi_irq(desc, chip);
>>>>>>
>>>>>>
>>>>>> and later in unmask_threaded_irq():
>>>>>>
>>>>>> unmask_irq(desc);
>>>>>>
>>>>>> I also mentioned that in patch #3 description:
>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>> That makes sense. Can you include the full story in cover letter too?
>>>> Ok, I will.
>>>>
>>>>>
>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>> and queued for injection to the guest."
>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>> this), but why shouldn't the resamplefd be notified? If there has been
>> yeah sorry to get you involved here ;-)
>
> No problem!
>
>>> an EOI, a new level must be made visible to the guest interrupt
>>> controller, no matter what the state of the interrupt masking is.
>>>
>>> Whether this new level is actually *presented* to a vCPU is another
>>> matter entirely, and is arguably a problem for the interrupt
>>> controller emulation.
>>
>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>> immediatly re-sampled by the interrupt controller (because the
>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>> (although it is masked at guest level). When the guest actually unmasks
>> the vIRQ we do not get a chance to re-evaluate the physical line level.
>
> Indeed, and maybe this is what should be fixed instead of moving the
> resampling point around (I was suggesting something along these lines
> in [1]).
>
> We already do this on arm64 for the timer, and it should be easy
> enough it generalise to any interrupt backed by the GIC (there is an
> in-kernel API to sample the pending state). No idea how that translate
> for other architectures though.
Actually I'm now thinking about changing the behavior implemented in my
patchset, which is:
1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
that no new physical IRQ is generated, and the vIRQ is not set as
pending.
2. After this vIRQ is unmasked by the guest, notify resamplefd.
to the following one:
1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
but also remember this vIRQ as, let's call it, "pending oneshot".
2. A new physical IRQ is immediately generated, so the vIRQ is
properly set as pending.
3. After the vIRQ is unmasked by the guest, check and find out that
it is not just pending but also "pending oneshot", so don't
deliver it to a vCPU. Instead, immediately notify resamplefd once
again.
In other words, don't avoid extra physical interrupts in the host
(rather, use those extra interrupts for properly updating the pending
state of the vIRQ) but avoid propagating those extra interrupts to the
guest.
Does this sound reasonable to you?
Your suggestion to sample the pending state of the physical IRQ sounds
interesting too. But as you said, it's yet to be checked how feasible it
would be on architectures other than arm64. Also it assumes that the IRQ
in question is a forwarded physical interrupt, while I can imagine that
KVM's resamplefd could in principle also be useful for implementing
purely emulated interrupts.
Do you see any advantages of sampling the physical IRQ pending state
over remembering the "pending oneshot" state as described above?
>
> M.
>
> [1] https://lore.kernel.org/r/[email protected]
>
> On 8/10/22 7:17 PM, Dong, Eddie wrote:
> >>>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the
> >>>> resamplefd) we don't check that the interrupt is still masked in
> >>>> the guest at the moment
> >> of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the
> >>>> host physical IRQ, thus a new (unwanted) physical interrupt is
> >>>> generated in the host and queued for injection to the guest."
> >>>>
> >>>
> >>> Emulation of level triggered IRQ is a pain point ☹ I read we need to
> >>> emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
> >> ioapic here).
> >>> Technically, the guest can change the polarity of vIOAPIC, which
> >>> will lead to a new virtual IRQ even w/o host side interrupt.
> >>
> >> Thanks, interesting point. Do you mean that this behavior (a new vIRQ
> >> as a result of polarity change) may already happen with the existing KVM
> code?
> >>
> >> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC
> >> polarity bit, in particular it doesn't handle change of the polarity by the guest
> (i.e.
> >> doesn't update the virtual IRR register, and so on), so it shouldn't
> >> result in a new interrupt.
> >
> > Correct, KVM doesn't handle polarity now. Probably because unlikely
> > the commercial OSes will change polarity.
> >
> >>
> >> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there
> >> seems to be an assumption that KVM interpretes the IRQ level value as
> >> active (asserted) vs inactive (deasserted) rather than high vs low, i.e.
> >
> > Asserted/deasserted vs. high/low is same to me, though
> asserted/deasserted hints more for event rather than state.
> >
> >> the polarity doesn't matter to KVM.
> >>
> >> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever
> >> emulating an external interrupt source) seem to operate on a level of
> >> abstraction of "asserted" vs "de-asserted" interrupt state regardless
> >> of the polarity, and that seems not a bug but a feature, it seems
> >> that we don't need to emulate the IRQ level as such. Or am I missing
> something?
> >
> > YES, I know current KVM doesn't handle it. Whether we should support it is
> another story which I cannot speak for.
> > Paolo and Alex are the right person ????
> > The reason I mention this is because the complexity to adding a pending
> event vs. supporting a interrupt pin state is same.
> > I am wondering if we need to revisit it or not. Behavior closing to real
> hardware helps us to avoid potential issues IMO, but I am fine to either choice.
>
> I guess that would imply revisiting KVM irqfd interface, since its design is based
> rather on events than states, even for level-triggered
> interrupts:
We can read 2 different events: IRQ fire/no-fire event, and state change event (for consumers to maintain internal assert/deassert state).
If we switch from the former one to the later one. Do we need to change the interface?
Probably needs Paolo and Alex to give clear direction, given that ARM64 side seems have similar state concept too.
Thanks Dmytro!
Eddie
>
> - trigger event (from vfio to KVM) to assert an IRQ
> - resample event (from KVM to vfio) to de-assert an IRQ
>
> >
> >>
> >> OTOH, I guess this means that the existing KVM's emulation of
> >> level-triggered interrupts is somewhat limited (a guest may
> >> legitimately expect an interrupt fired as a result of polarity
> >> change, and that case is not supported by KVM). But that is rather
> >> out of scope of the oneshot interrupts issue addressed by this patchset.
> >
> > Agree.
> > I didn't know any commercial OSes change polarity either. But I know Xen
> hypervisor uses polarity under certain condition.
> > One day, we may see the issue when running Xen as a L1 hypervisor. But this
> is not the current worry.
> >
> >
> >>
> >>> "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more
> >>> an
> >> event rather than an interrupt level.
> >
> > I know. I am fine either.
> >
> > Thanks Eddie
> >
> >>>
> >>>
> >
> >
> >> However, with KVM + vfio (or whatever is listening on the resamplefd)
> >> we don't check that the interrupt is still masked in the guest at the moment
> of EOI.
> >> Resamplefd is notified regardless, so vfio prematurely unmasks the
> >> host physical IRQ, thus a new (unwanted) physical interrupt is
> >> generated in the host and queued for injection to the guest."
> >>
> >
> > Emulation of level triggered IRQ is a pain point ☹ I read we need to
> > emulate the "level" of the IRQ pin (connecting from device to IRQchip, i.e.
> ioapic here).
> > Technically, the guest can change the polarity of vIOAPIC, which will
> > lead to a new virtual IRQ even w/o host side interrupt.
>
> Thanks, interesting point. Do you mean that this behavior (a new vIRQ as a
> result of polarity change) may already happen with the existing KVM code?
>
> It doesn't seem so to me. AFAICT, KVM completely ignores the vIOAPIC polarity
> bit, in particular it doesn't handle change of the polarity by the guest (i.e.
> doesn't update the virtual IRR register, and so on), so it shouldn't result in a
> new interrupt.
Correct, KVM doesn't handle polarity now. Probably because unlikely the commercial OSes
will change polarity.
>
> Since commit 100943c54e09 ("kvm: x86: ignore ioapic polarity") there seems to
> be an assumption that KVM interpretes the IRQ level value as active (asserted)
> vs inactive (deasserted) rather than high vs low, i.e.
Asserted/deasserted vs. high/low is same to me, though asserted/deasserted hints more for event rather than state.
> the polarity doesn't matter to KVM.
>
> So, since both sides (KVM emulating the IOAPIC, and vfio/whatever emulating
> an external interrupt source) seem to operate on a level of abstraction of
> "asserted" vs "de-asserted" interrupt state regardless of the polarity, and that
> seems not a bug but a feature, it seems that we don't need to emulate the IRQ
> level as such. Or am I missing something?
YES, I know current KVM doesn't handle it. Whether we should support it is another story which I cannot speak for.
Paolo and Alex are the right person ????
The reason I mention this is because the complexity to adding a pending event vs. supporting a interrupt pin state is same.
I am wondering if we need to revisit it or not. Behavior closing to real hardware helps us to avoid potential issues IMO, but I am fine to either choice.
>
> OTOH, I guess this means that the existing KVM's emulation of level-triggered
> interrupts is somewhat limited (a guest may legitimately expect an interrupt
> fired as a result of polarity change, and that case is not supported by KVM). But
> that is rather out of scope of the oneshot interrupts issue addressed by this
> patchset.
Agree.
I didn't know any commercial OSes change polarity either. But I know Xen hypervisor uses polarity under certain condition.
One day, we may see the issue when running Xen as a L1 hypervisor. But this is not the current worry.
>
> > "pending" field of kvm_kernel_irqfd_resampler in patch 3 means more an
> event rather than an interrupt level.
I know. I am fine either.
Thanks Eddie
> >
> >
On 8/10/22 19:02, Dmytro Maluka wrote:
> 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
> but also remember this vIRQ as, let's call it, "pending oneshot".
>
> 2. A new physical IRQ is immediately generated, so the vIRQ is
> properly set as pending.
>
> 3. After the vIRQ is unmasked by the guest, check and find out that
> it is not just pending but also "pending oneshot", so don't
> deliver it to a vCPU. Instead, immediately notify resamplefd once
> again.
>
> In other words, don't avoid extra physical interrupts in the host
> (rather, use those extra interrupts for properly updating the pending
> state of the vIRQ) but avoid propagating those extra interrupts to the
> guest.
>
> Does this sound reasonable to you?
Yeah, this makes sense and it lets the resamplefd set the "pending"
status in the vGIC. It still has the issue that the interrupt can
remain pending in the guest for longer than it's pending on the host,
but that can't be fixed?
Paolo
On Wed, 10 Aug 2022 18:06:53 +0100,
Dmytro Maluka <[email protected]> wrote:
>
> Hi Marc,
>
> On 8/10/22 8:51 AM, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 00:30:29 +0100,
> > Dmytro Maluka <[email protected]> wrote:
> >>
> >> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Dmytro Maluka <[email protected]>
> >>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
> >>>> <[email protected]>; Paolo Bonzini <[email protected]>;
> >>>> [email protected]
> >>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> >>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
> >>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
> >>>> [email protected]; Eric Auger <[email protected]>; Alex
> >>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
> >>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
> >>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
> >>>> [email protected]; Dmitry Torokhov <[email protected]>
> >>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>
> >>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>>
> >>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>> unacknowledged IRQ by the host.
> >>>>>
> >>>>> Interesting... How it behaviors in native side?
> >>>>
> >>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>> completes.
> >>>>
> >>>> In handle_fasteoi_irq():
> >>>>
> >>>> if (desc->istate & IRQS_ONESHOT)
> >>>> mask_irq(desc);
> >>>>
> >>>> handle_irq_event(desc);
> >>>>
> >>>> cond_unmask_eoi_irq(desc, chip);
> >>>>
> >>>>
> >>>> and later in unmask_threaded_irq():
> >>>>
> >>>> unmask_irq(desc);
> >>>>
> >>>> I also mentioned that in patch #3 description:
> >>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>>
> >>> That makes sense. Can you include the full story in cover letter too?
> >>
> >> Ok, I will.
> >>
> >>>
> >>>
> >>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>> and queued for injection to the guest."
> >
> > Sorry to barge in pretty late in the conversation (just been Cc'd on
> > this), but why shouldn't the resamplefd be notified? If there has been
> > an EOI, a new level must be made visible to the guest interrupt
> > controller, no matter what the state of the interrupt masking is.
> >
> > Whether this new level is actually *presented* to a vCPU is another
> > matter entirely, and is arguably a problem for the interrupt
> > controller emulation.
> >
> > For example on arm64, we expect to be able to read the pending state
> > of an interrupt from the guest irrespective of the masking state of
> > that interrupt. Any change to the interrupt flow should preserve this.
>
> I'd like to understand the problem better, so could you please give some
> examples of cases where it is required/useful/desirable to read the
> correct pending state of a guest interrupt?
I'm not sure I understand the question. It is *always* desirable to
present the correct information to the guest.
For example, a guest could periodically poll the pending interrupt
registers and only enable interrupts that are pending. Is it a good
idea? No. Is it expected to work? Absolutely.
And yes, we go out of our way to make sure these things actually work,
because one day or another, you'll find a guest that does exactly
that.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
Hi Dmytro,
On Wed, 10 Aug 2022 18:02:29 +0100,
Dmytro Maluka <[email protected]> wrote:
>
> Hi Marc,
>
> On 8/10/22 3:01 PM, Marc Zyngier wrote:
> > On Wed, 10 Aug 2022 09:12:18 +0100,
> > Eric Auger <[email protected]> wrote:
> >>
> >> Hi Marc,
> >>
> >> On 8/10/22 08:51, Marc Zyngier wrote:
> >>> On Wed, 10 Aug 2022 00:30:29 +0100,
> >>> Dmytro Maluka <[email protected]> wrote:
> >>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dmytro Maluka <[email protected]>
> >>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
> >>>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
> >>>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
> >>>>>> [email protected]
> >>>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
> >>>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
> >>>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
> >>>>>> [email protected]; Eric Auger <[email protected]>; Alex
> >>>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
> >>>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
> >>>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
> >>>>>> [email protected]; Dmitry Torokhov <[email protected]>
> >>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
> >>>>>>
> >>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
> >>>>>>>> The existing KVM mechanism for forwarding of level-triggered
> >>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
> >>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
> >>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
> >>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
> >>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
> >>>>>>>> existing KVM code doesn't take that into account, which results in
> >>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
> >>>>>> unacknowledged IRQ by the host.
> >>>>>>> Interesting... How it behaviors in native side?
> >>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
> >>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
> >>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
> >>>>>> completes.
> >>>>>>
> >>>>>> In handle_fasteoi_irq():
> >>>>>>
> >>>>>> if (desc->istate & IRQS_ONESHOT)
> >>>>>> mask_irq(desc);
> >>>>>>
> >>>>>> handle_irq_event(desc);
> >>>>>>
> >>>>>> cond_unmask_eoi_irq(desc, chip);
> >>>>>>
> >>>>>>
> >>>>>> and later in unmask_threaded_irq():
> >>>>>>
> >>>>>> unmask_irq(desc);
> >>>>>>
> >>>>>> I also mentioned that in patch #3 description:
> >>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
> >>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
> >>>>> That makes sense. Can you include the full story in cover letter too?
> >>>> Ok, I will.
> >>>>
> >>>>>
> >>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
> >>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
> >>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
> >>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
> >>>>>> and queued for injection to the guest."
> >>> Sorry to barge in pretty late in the conversation (just been Cc'd on
> >>> this), but why shouldn't the resamplefd be notified? If there has been
> >> yeah sorry to get you involved here ;-)
> >
> > No problem!
> >
> >>> an EOI, a new level must be made visible to the guest interrupt
> >>> controller, no matter what the state of the interrupt masking is.
> >>>
> >>> Whether this new level is actually *presented* to a vCPU is another
> >>> matter entirely, and is arguably a problem for the interrupt
> >>> controller emulation.
> >>
> >> FWIU on guest EOI the physical line is still asserted so the pIRQ is
> >> immediatly re-sampled by the interrupt controller (because the
> >> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
> >> (although it is masked at guest level). When the guest actually unmasks
> >> the vIRQ we do not get a chance to re-evaluate the physical line level.
> >
> > Indeed, and maybe this is what should be fixed instead of moving the
> > resampling point around (I was suggesting something along these lines
> > in [1]).
> >
> > We already do this on arm64 for the timer, and it should be easy
> > enough it generalise to any interrupt backed by the GIC (there is an
> > in-kernel API to sample the pending state). No idea how that translate
> > for other architectures though.
>
> Actually I'm now thinking about changing the behavior implemented in my
> patchset, which is:
>
> 1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
> that no new physical IRQ is generated, and the vIRQ is not set as
> pending.
>
> 2. After this vIRQ is unmasked by the guest, notify resamplefd.
>
> to the following one:
>
> 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
> but also remember this vIRQ as, let's call it, "pending oneshot".
>
> 2. A new physical IRQ is immediately generated, so the vIRQ is
> properly set as pending.
>
> 3. After the vIRQ is unmasked by the guest, check and find out that
> it is not just pending but also "pending oneshot", so don't
> deliver it to a vCPU. Instead, immediately notify resamplefd once
> again.
>
> In other words, don't avoid extra physical interrupts in the host
> (rather, use those extra interrupts for properly updating the pending
> state of the vIRQ) but avoid propagating those extra interrupts to the
> guest.
>
> Does this sound reasonable to you?
It does. I'm a bit concerned about the extra state (more state, more
problems...), but let's see the implementation.
> Your suggestion to sample the pending state of the physical IRQ sounds
> interesting too. But as you said, it's yet to be checked how feasible it
> would be on architectures other than arm64. Also it assumes that the IRQ
> in question is a forwarded physical interrupt, while I can imagine that
> KVM's resamplefd could in principle also be useful for implementing
> purely emulated interrupts.
No, there is no requirement for this being a forwarded interrupt. The
vgic code does that for forwarded interrupts, but the core code could
do that too if the information is available (irq_get_irqchip_state()
was introduced for this exact purpose).
> Do you see any advantages of sampling the physical IRQ pending state
> over remembering the "pending oneshot" state as described above?
The advantage is to not maintain some extra state, as this is usually
a source of problem, but to get to the source (the HW pending state).
It also solves the "pending in the vgic but not pending in the HW"
problem, as reading the pending state causes an exit (the register is
emulated), and as part of the exit handling we already perform the
resample. We just need to extend this to check the HW state, and
correct the pending state if required, making sure that the emulation
will return an accurate view.
M.
--
Without deviation from the norm, progress is not possible.
On 8/11/22 14:35, Marc Zyngier wrote:
> On Wed, 10 Aug 2022 18:06:53 +0100,
> Dmytro Maluka <[email protected]> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 8:51 AM, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>> Dmytro Maluka <[email protected]> wrote:
>>>>
>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dmytro Maluka <[email protected]>
>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>>>> [email protected]
>>>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>
>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>>
>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>> unacknowledged IRQ by the host.
>>>>>>>
>>>>>>> Interesting... How it behaviors in native side?
>>>>>>
>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>> completes.
>>>>>>
>>>>>> In handle_fasteoi_irq():
>>>>>>
>>>>>> if (desc->istate & IRQS_ONESHOT)
>>>>>> mask_irq(desc);
>>>>>>
>>>>>> handle_irq_event(desc);
>>>>>>
>>>>>> cond_unmask_eoi_irq(desc, chip);
>>>>>>
>>>>>>
>>>>>> and later in unmask_threaded_irq():
>>>>>>
>>>>>> unmask_irq(desc);
>>>>>>
>>>>>> I also mentioned that in patch #3 description:
>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>>
>>>>> That makes sense. Can you include the full story in cover letter too?
>>>>
>>>> Ok, I will.
>>>>
>>>>>
>>>>>
>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>> and queued for injection to the guest."
>>>
>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>> this), but why shouldn't the resamplefd be notified? If there has been
>>> an EOI, a new level must be made visible to the guest interrupt
>>> controller, no matter what the state of the interrupt masking is.
>>>
>>> Whether this new level is actually *presented* to a vCPU is another
>>> matter entirely, and is arguably a problem for the interrupt
>>> controller emulation.
>>>
>>> For example on arm64, we expect to be able to read the pending state
>>> of an interrupt from the guest irrespective of the masking state of
>>> that interrupt. Any change to the interrupt flow should preserve this.
>>
>> I'd like to understand the problem better, so could you please give some
>> examples of cases where it is required/useful/desirable to read the
>> correct pending state of a guest interrupt?
>
> I'm not sure I understand the question. It is *always* desirable to
> present the correct information to the guest.
>
> For example, a guest could periodically poll the pending interrupt
> registers and only enable interrupts that are pending. Is it a good
> idea? No. Is it expected to work? Absolutely.
>
> And yes, we go out of our way to make sure these things actually work,
> because one day or another, you'll find a guest that does exactly
> that.
Ah indeed, thanks. Somehow I was thinking only about using this
information internally in KVM or perhaps presenting it to the host
userspace via some ioctl. Whereas indeed, the guest itself may well read
those registers and rely on this information.
>
> Thanks,
>
> M.
>
Hi Marc,
On 8/11/22 14:21, Marc Zyngier wrote:
> Hi Dmytro,
>
> On Wed, 10 Aug 2022 18:02:29 +0100,
> Dmytro Maluka <[email protected]> wrote:
>>
>> Hi Marc,
>>
>> On 8/10/22 3:01 PM, Marc Zyngier wrote:
>>> On Wed, 10 Aug 2022 09:12:18 +0100,
>>> Eric Auger <[email protected]> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>> On 8/10/22 08:51, Marc Zyngier wrote:
>>>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>>>> Dmytro Maluka <[email protected]> wrote:
>>>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Dmytro Maluka <[email protected]>
>>>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>>>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>>>>>> [email protected]
>>>>>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>>>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>>>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>>>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>>>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>>>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>>>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>>>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>>>
>>>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>>>> unacknowledged IRQ by the host.
>>>>>>>>> Interesting... How it behaviors in native side?
>>>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>>>> completes.
>>>>>>>>
>>>>>>>> In handle_fasteoi_irq():
>>>>>>>>
>>>>>>>> if (desc->istate & IRQS_ONESHOT)
>>>>>>>> mask_irq(desc);
>>>>>>>>
>>>>>>>> handle_irq_event(desc);
>>>>>>>>
>>>>>>>> cond_unmask_eoi_irq(desc, chip);
>>>>>>>>
>>>>>>>>
>>>>>>>> and later in unmask_threaded_irq():
>>>>>>>>
>>>>>>>> unmask_irq(desc);
>>>>>>>>
>>>>>>>> I also mentioned that in patch #3 description:
>>>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>>>> That makes sense. Can you include the full story in cover letter too?
>>>>>> Ok, I will.
>>>>>>
>>>>>>>
>>>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>>>> and queued for injection to the guest."
>>>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>>>> this), but why shouldn't the resamplefd be notified? If there has been
>>>> yeah sorry to get you involved here ;-)
>>>
>>> No problem!
>>>
>>>>> an EOI, a new level must be made visible to the guest interrupt
>>>>> controller, no matter what the state of the interrupt masking is.
>>>>>
>>>>> Whether this new level is actually *presented* to a vCPU is another
>>>>> matter entirely, and is arguably a problem for the interrupt
>>>>> controller emulation.
>>>>
>>>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>>>> immediatly re-sampled by the interrupt controller (because the
>>>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>>>> (although it is masked at guest level). When the guest actually unmasks
>>>> the vIRQ we do not get a chance to re-evaluate the physical line level.
>>>
>>> Indeed, and maybe this is what should be fixed instead of moving the
>>> resampling point around (I was suggesting something along these lines
>>> in [1]).
>>>
>>> We already do this on arm64 for the timer, and it should be easy
>>> enough it generalise to any interrupt backed by the GIC (there is an
>>> in-kernel API to sample the pending state). No idea how that translate
>>> for other architectures though.
>>
>> Actually I'm now thinking about changing the behavior implemented in my
>> patchset, which is:
>>
>> 1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
>> that no new physical IRQ is generated, and the vIRQ is not set as
>> pending.
>>
>> 2. After this vIRQ is unmasked by the guest, notify resamplefd.
>>
>> to the following one:
>>
>> 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>> but also remember this vIRQ as, let's call it, "pending oneshot".
>>
>> 2. A new physical IRQ is immediately generated, so the vIRQ is
>> properly set as pending.
>>
>> 3. After the vIRQ is unmasked by the guest, check and find out that
>> it is not just pending but also "pending oneshot", so don't
>> deliver it to a vCPU. Instead, immediately notify resamplefd once
>> again.
>>
>> In other words, don't avoid extra physical interrupts in the host
>> (rather, use those extra interrupts for properly updating the pending
>> state of the vIRQ) but avoid propagating those extra interrupts to the
>> guest.
>>
>> Does this sound reasonable to you?
>
> It does. I'm a bit concerned about the extra state (more state, more
> problems...), but let's see the implementation.
>
>> Your suggestion to sample the pending state of the physical IRQ sounds
>> interesting too. But as you said, it's yet to be checked how feasible it
>> would be on architectures other than arm64. Also it assumes that the IRQ
>> in question is a forwarded physical interrupt, while I can imagine that
>> KVM's resamplefd could in principle also be useful for implementing
>> purely emulated interrupts.
>
> No, there is no requirement for this being a forwarded interrupt. The
> vgic code does that for forwarded interrupts, but the core code could
> do that too if the information is available (irq_get_irqchip_state()
> was introduced for this exact purpose).
I meant "forwarding" in a generic sense, not vgic specific. I.e. the
forwarding itself may be done generically by software, e.g. by vfio, but
the source is in any case a physical HW interrupt.
Whereas I have in mind also cases where an irqfd user injects purely
virtual interrupts, not coming from HW. I don't know any particular use
case for that, but irqfd doesn't seem to prohibit such use cases. So I
was thinking that maybe it's better to keep it this way, i.e. not depend
on reading physical HW state in KVM. Or am I trying to be too generic here?
>
>> Do you see any advantages of sampling the physical IRQ pending state
>> over remembering the "pending oneshot" state as described above?
>
> The advantage is to not maintain some extra state, as this is usually
> a source of problem, but to get to the source (the HW pending state).
>
> It also solves the "pending in the vgic but not pending in the HW"
> problem, as reading the pending state causes an exit (the register is
> emulated), and as part of the exit handling we already perform the
> resample. We just need to extend this to check the HW state, and
> correct the pending state if required, making sure that the emulation
> will return an accurate view.
BTW, it seems that besides this "pending in the guest but not in the
host" issue, we also already have an opposite issue ("pending in the
host but not in the guest"): upon the guest EOI, we unconditionally
deassert the vIRQ in irqfd_resampler_ack() before notifying resamplefd,
even if the pIRQ is still asserted. So there is a time window (before
the new pIRQ trigger event makes it to KVM) when the IRQ may be pending
in the host but not in the guest.
Am I right that this is actually an issue, and that sampling the
physical state could help with this issue too?
>
> M.
>
Hi Paolo and Dmytro,
> -----Original Message-----
> From: Paolo Bonzini <[email protected]>
> Sent: Wednesday, August 10, 2022 11:48 PM
> To: Dmytro Maluka <[email protected]>; Marc Zyngier
> <[email protected]>; [email protected]
> Cc: Dong, Eddie <[email protected]>; Christopherson,, Sean
> <[email protected]>; [email protected]; Thomas Gleixner
> <[email protected]>; Ingo Molnar <[email protected]>; Borislav
> Petkov <[email protected]>; Dave Hansen <[email protected]>;
> [email protected]; H. Peter Anvin <[email protected]>; linux-
> [email protected]; Alex Williamson <[email protected]>;
> Liu, Rong L <[email protected]>; Zhenyu Wang
> <[email protected]>; Tomasz Nowicki <[email protected]>;
> Grzegorz Jaszczyk <[email protected]>; [email protected];
> Dmitry Torokhov <[email protected]>
> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>
> On 8/10/22 19:02, Dmytro Maluka wrote:
> > 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
> > but also remember this vIRQ as, let's call it, "pending oneshot".
> >
This is the part always confuses me. In x86 case, for level triggered
interrupt, even if it is not oneshot, there is still "unmask" and the unmask
happens in the same sequence as in oneshot interrupt, just timing is different.
So are you going to differentiate oneshot from "normal" level triggered
interrupt or not? And there is any situation that vEOI happens for an unmasked
vIRQ?
> > 2. A new physical IRQ is immediately generated, so the vIRQ is
> > properly set as pending.
> >
I am not sure this is always the case. For example, a device may not raise a
new interrupt until it is notified that "done reading" - by device driver
writing to a register or something when device driver finishes reading data. So
how do you handle this situation?
> > 3. After the vIRQ is unmasked by the guest, check and find out that
> > it is not just pending but also "pending oneshot", so don't
> > deliver it to a vCPU. Instead, immediately notify resamplefd once
> > again.
> >
Does this mean the change of vfio code also? That seems the case: vfio seems
keeping its own internal "state" whether the irq is enabled or not.
Thanks,
Rong
> > In other words, don't avoid extra physical interrupts in the host
> > (rather, use those extra interrupts for properly updating the pending
> > state of the vIRQ) but avoid propagating those extra interrupts to the
> > guest.
> >
> > Does this sound reasonable to you?
>
> Yeah, this makes sense and it lets the resamplefd set the "pending"
> status in the vGIC. It still has the issue that the interrupt can
> remain pending in the guest for longer than it's pending on the host,
> but that can't be fixed?
>
> Paolo
On 8/11/22 3:54 PM, Dmytro Maluka wrote:
> Hi Marc,
>
> On 8/11/22 14:21, Marc Zyngier wrote:
>> Hi Dmytro,
>>
>> On Wed, 10 Aug 2022 18:02:29 +0100,
>> Dmytro Maluka <[email protected]> wrote:
>>>
>>> Hi Marc,
>>>
>>> On 8/10/22 3:01 PM, Marc Zyngier wrote:
>>>> On Wed, 10 Aug 2022 09:12:18 +0100,
>>>> Eric Auger <[email protected]> wrote:
>>>>>
>>>>> Hi Marc,
>>>>>
>>>>> On 8/10/22 08:51, Marc Zyngier wrote:
>>>>>> On Wed, 10 Aug 2022 00:30:29 +0100,
>>>>>> Dmytro Maluka <[email protected]> wrote:
>>>>>>> On 8/9/22 10:01 PM, Dong, Eddie wrote:
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dmytro Maluka <[email protected]>
>>>>>>>>> Sent: Tuesday, August 9, 2022 12:24 AM
>>>>>>>>> To: Dong, Eddie <[email protected]>; Christopherson,, Sean
>>>>>>>>> <[email protected]>; Paolo Bonzini <[email protected]>;
>>>>>>>>> [email protected]
>>>>>>>>> Cc: Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>;
>>>>>>>>> Borislav Petkov <[email protected]>; Dave Hansen <[email protected]>;
>>>>>>>>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>>>>>>>>> [email protected]; Eric Auger <[email protected]>; Alex
>>>>>>>>> Williamson <[email protected]>; Liu, Rong L <[email protected]>;
>>>>>>>>> Zhenyu Wang <[email protected]>; Tomasz Nowicki
>>>>>>>>> <[email protected]>; Grzegorz Jaszczyk <[email protected]>;
>>>>>>>>> [email protected]; Dmitry Torokhov <[email protected]>
>>>>>>>>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>>>>>>>>
>>>>>>>>> On 8/9/22 1:26 AM, Dong, Eddie wrote:
>>>>>>>>>>> The existing KVM mechanism for forwarding of level-triggered
>>>>>>>>>>> interrupts using resample eventfd doesn't work quite correctly in the
>>>>>>>>>>> case of interrupts that are handled in a Linux guest as oneshot
>>>>>>>>>>> interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device
>>>>>>>>>>> in its threaded irq handler, i.e. later than it is acked to the
>>>>>>>>>>> interrupt controller (EOI at the end of hardirq), not earlier. The
>>>>>>>>>>> existing KVM code doesn't take that into account, which results in
>>>>>>>>>>> erroneous extra interrupts in the guest caused by premature re-assert of an
>>>>>>>>> unacknowledged IRQ by the host.
>>>>>>>>>> Interesting... How it behaviors in native side?
>>>>>>>>> In native it behaves correctly, since Linux masks such a oneshot interrupt at the
>>>>>>>>> beginning of hardirq, so that the EOI at the end of hardirq doesn't result in its
>>>>>>>>> immediate re-assert, and then unmasks it later, after its threaded irq handler
>>>>>>>>> completes.
>>>>>>>>>
>>>>>>>>> In handle_fasteoi_irq():
>>>>>>>>>
>>>>>>>>> if (desc->istate & IRQS_ONESHOT)
>>>>>>>>> mask_irq(desc);
>>>>>>>>>
>>>>>>>>> handle_irq_event(desc);
>>>>>>>>>
>>>>>>>>> cond_unmask_eoi_irq(desc, chip);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> and later in unmask_threaded_irq():
>>>>>>>>>
>>>>>>>>> unmask_irq(desc);
>>>>>>>>>
>>>>>>>>> I also mentioned that in patch #3 description:
>>>>>>>>> "Linux keeps such interrupt masked until its threaded handler finishes, to
>>>>>>>>> prevent the EOI from re-asserting an unacknowledged interrupt.
>>>>>>>> That makes sense. Can you include the full story in cover letter too?
>>>>>>> Ok, I will.
>>>>>>>
>>>>>>>>
>>>>>>>>> However, with KVM + vfio (or whatever is listening on the resamplefd) we don't
>>>>>>>>> check that the interrupt is still masked in the guest at the moment of EOI.
>>>>>>>>> Resamplefd is notified regardless, so vfio prematurely unmasks the host
>>>>>>>>> physical IRQ, thus a new (unwanted) physical interrupt is generated in the host
>>>>>>>>> and queued for injection to the guest."
>>>>>> Sorry to barge in pretty late in the conversation (just been Cc'd on
>>>>>> this), but why shouldn't the resamplefd be notified? If there has been
>>>>> yeah sorry to get you involved here ;-)
>>>>
>>>> No problem!
>>>>
>>>>>> an EOI, a new level must be made visible to the guest interrupt
>>>>>> controller, no matter what the state of the interrupt masking is.
>>>>>>
>>>>>> Whether this new level is actually *presented* to a vCPU is another
>>>>>> matter entirely, and is arguably a problem for the interrupt
>>>>>> controller emulation.
>>>>>
>>>>> FWIU on guest EOI the physical line is still asserted so the pIRQ is
>>>>> immediatly re-sampled by the interrupt controller (because the
>>>>> resamplefd unmasked the physical IRQ) and recorded as a guest IRQ
>>>>> (although it is masked at guest level). When the guest actually unmasks
>>>>> the vIRQ we do not get a chance to re-evaluate the physical line level.
>>>>
>>>> Indeed, and maybe this is what should be fixed instead of moving the
>>>> resampling point around (I was suggesting something along these lines
>>>> in [1]).
>>>>
>>>> We already do this on arm64 for the timer, and it should be easy
>>>> enough it generalise to any interrupt backed by the GIC (there is an
>>>> in-kernel API to sample the pending state). No idea how that translate
>>>> for other architectures though.
>>>
>>> Actually I'm now thinking about changing the behavior implemented in my
>>> patchset, which is:
>>>
>>> 1. If vEOI happens for a masked vIRQ, don't notify resamplefd, so
>>> that no new physical IRQ is generated, and the vIRQ is not set as
>>> pending.
>>>
>>> 2. After this vIRQ is unmasked by the guest, notify resamplefd.
>>>
>>> to the following one:
>>>
>>> 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>>> but also remember this vIRQ as, let's call it, "pending oneshot".
>>>
>>> 2. A new physical IRQ is immediately generated, so the vIRQ is
>>> properly set as pending.
>>>
>>> 3. After the vIRQ is unmasked by the guest, check and find out that
>>> it is not just pending but also "pending oneshot", so don't
>>> deliver it to a vCPU. Instead, immediately notify resamplefd once
>>> again.
>>>
>>> In other words, don't avoid extra physical interrupts in the host
>>> (rather, use those extra interrupts for properly updating the pending
>>> state of the vIRQ) but avoid propagating those extra interrupts to the
>>> guest.
>>>
>>> Does this sound reasonable to you?
>>
>> It does. I'm a bit concerned about the extra state (more state, more
>> problems...), but let's see the implementation.
>>
>>> Your suggestion to sample the pending state of the physical IRQ sounds
>>> interesting too. But as you said, it's yet to be checked how feasible it
>>> would be on architectures other than arm64. Also it assumes that the IRQ
>>> in question is a forwarded physical interrupt, while I can imagine that
>>> KVM's resamplefd could in principle also be useful for implementing
>>> purely emulated interrupts.
>>
>> No, there is no requirement for this being a forwarded interrupt. The
>> vgic code does that for forwarded interrupts, but the core code could
>> do that too if the information is available (irq_get_irqchip_state()
>> was introduced for this exact purpose).
>
> I meant "forwarding" in a generic sense, not vgic specific. I.e. the
> forwarding itself may be done generically by software, e.g. by vfio, but
> the source is in any case a physical HW interrupt.
>
> Whereas I have in mind also cases where an irqfd user injects purely
> virtual interrupts, not coming from HW. I don't know any particular use
> case for that, but irqfd doesn't seem to prohibit such use cases. So I
> was thinking that maybe it's better to keep it this way, i.e. not depend
> on reading physical HW state in KVM. Or am I trying to be too generic here?
Aren't for example Virtio interrupts a real existing use case for that
(KVM irqfd used for completely virtual interrupts not coming from HW)?
Perhaps this itself is not a problem. I guess that for sampling physical
IRQ state we'd need to extend KVM_IRQFD ioctl interface anyway, to
provide KVM with the information about the host IRQ number, at least. So
we then could also tell KVM that there is no host IRQ at all, i.e. the
given IRQ is a pure virtual interrupt like virtio, so that KVM would not
use physical sampling in this case.
>
>>
>>> Do you see any advantages of sampling the physical IRQ pending state
>>> over remembering the "pending oneshot" state as described above?
>>
>> The advantage is to not maintain some extra state, as this is usually
>> a source of problem, but to get to the source (the HW pending state).
>>
>> It also solves the "pending in the vgic but not pending in the HW"
>> problem, as reading the pending state causes an exit (the register is
>> emulated), and as part of the exit handling we already perform the
>> resample. We just need to extend this to check the HW state, and
>> correct the pending state if required, making sure that the emulation
>> will return an accurate view.
>
> BTW, it seems that besides this "pending in the guest but not in the
> host" issue, we also already have an opposite issue ("pending in the
> host but not in the guest"): upon the guest EOI, we unconditionally
> deassert the vIRQ in irqfd_resampler_ack() before notifying resamplefd,
> even if the pIRQ is still asserted. So there is a time window (before
> the new pIRQ trigger event makes it to KVM) when the IRQ may be pending
> in the host but not in the guest.
>
> Am I right that this is actually an issue, and that sampling the
> physical state could help with this issue too?
So perhaps there is another possibility (to avoid this particular issue,
as we don't implement physical sampling yet): postpone notifying
resamplefd like in this patch series, but also postpone changing the
vIRQ state from pending to not pending. I'm not sure it's a better
option though.
Hi Rong,
On 8/12/22 12:40 AM, Liu, Rong L wrote:
> Hi Paolo and Dmytro,
>
>> -----Original Message-----
>> From: Paolo Bonzini <[email protected]>
>> Sent: Wednesday, August 10, 2022 11:48 PM
>> To: Dmytro Maluka <[email protected]>; Marc Zyngier
>> <[email protected]>; [email protected]
>> Cc: Dong, Eddie <[email protected]>; Christopherson,, Sean
>> <[email protected]>; [email protected]; Thomas Gleixner
>> <[email protected]>; Ingo Molnar <[email protected]>; Borislav
>> Petkov <[email protected]>; Dave Hansen <[email protected]>;
>> [email protected]; H. Peter Anvin <[email protected]>; linux-
>> [email protected]; Alex Williamson <[email protected]>;
>> Liu, Rong L <[email protected]>; Zhenyu Wang
>> <[email protected]>; Tomasz Nowicki <[email protected]>;
>> Grzegorz Jaszczyk <[email protected]>; [email protected];
>> Dmitry Torokhov <[email protected]>
>> Subject: Re: [PATCH v2 0/5] KVM: Fix oneshot interrupts forwarding
>>
>> On 8/10/22 19:02, Dmytro Maluka wrote:
>>> 1. If vEOI happens for a masked vIRQ, notify resamplefd as usual,
>>> but also remember this vIRQ as, let's call it, "pending oneshot".
>>>
>
> This is the part always confuses me. In x86 case, for level triggered
> interrupt, even if it is not oneshot, there is still "unmask" and the unmask
> happens in the same sequence as in oneshot interrupt, just timing is different.
> So are you going to differentiate oneshot from "normal" level triggered
> interrupt or not? And there is any situation that vEOI happens for an unmasked
> vIRQ?
We were already talking about it in [1] and before. It still seems to me
that your statement is wrong and that with x86 ioapic, "normal"
level-triggered interrupts normally stay unmasked all the time, and only
EOI is used for interrupt completion. To double-confirm that, I was once
tracing KVM's ioapic_write_indirect() and confirmed that it's not called
when Linux guest is handling a "normal" level-triggered interrupt.
However, it seems that even if you were right and for normal interrupts
an EOI was always followed by an unmask, this proposal would still work
correctly.
>
> > > 2. A new physical IRQ is immediately generated, so the vIRQ is
>>> properly set as pending.
>>>
>
> I am not sure this is always the case. For example, a device may not raise a
> new interrupt until it is notified that "done reading" - by device driver
> writing to a register or something when device driver finishes reading data. So
> how do you handle this situation?
Right, the device will not raise new interrupts, but also it will not
lower the currently pending interrupt until "done reading". Precisely
for this reason the host will receive a new interrupt immediately after
vfio unmasks the physical IRQ.
It's also possible that the driver will notify "done reading" quite
early, so the device will lower the interrupt before vfio unmasks it, so
no new physical interrupt will be generated, - and that is fine too,
since it means that the physical IRQ is no longer pending, so we don't
need to notify KVM to set the virtual IRQ status to "pending".
>
>>> 3. After the vIRQ is unmasked by the guest, check and find out that
>>> it is not just pending but also "pending oneshot", so don't
>>> deliver it to a vCPU. Instead, immediately notify resamplefd once
>>> again.
>>>
>
> Does this mean the change of vfio code also? That seems the case: vfio seems
> keeping its own internal "state" whether the irq is enabled or not.
I don't quite get why would it require changing vfio. Could you
elaborate?
[1] https://lore.kernel.org/kvm/[email protected]/
Thanks,
Dmytro
>
> Thanks,
>
> Rong
>>> In other words, don't avoid extra physical interrupts in the host
>>> (rather, use those extra interrupts for properly updating the pending
>>> state of the vIRQ) but avoid propagating those extra interrupts to the
>>> guest.
>>>
>>> Does this sound reasonable to you?
>>
>> Yeah, this makes sense and it lets the resamplefd set the "pending"
>> status in the vGIC. It still has the issue that the interrupt can
>> remain pending in the guest for longer than it's pending on the host,
>> but that can't be fixed?
>>
>> Paolo
>