Besides 3 cleanup patches, this series consists of two major changes.
The first introduces an interrupt sharing notifier to the genirq
subsystem. It fires when an interrupt line is about to be use by more
than one driver or the last but one user called free_irq.
The second major change makes use of this interface in KVM's PCI pass-
through subsystem. KVM has to keep the interrupt source disabled while
calling into the guest to handle the event. This can be done at device
or line level. The former is required to share the interrupt line, the
latter is an order of magnitude faster (see patch 3 for details).
Beside pass-through support of KVM, further users of the IRQ notifier
could become VFIO (not yet mainline) and uio_pci_generic which have to
resolve the same conflict.
Jan Kiszka (5):
genirq: Pass descriptor to __free_irq
genirq: Introduce interrupt sharing notifier
KVM: Split up MSI-X assigned device IRQ handler
KVM: Clean up unneeded void pointer casts
KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
Documentation/kvm/api.txt | 25 +++
arch/x86/kvm/x86.c | 1 +
include/linux/interrupt.h | 21 +++
include/linux/irqdesc.h | 9 +
include/linux/kvm.h | 6 +
include/linux/kvm_host.h | 4 +
kernel/irq/irqdesc.c | 6 +
kernel/irq/manage.c | 174 ++++++++++++++++++--
virt/kvm/assigned-dev.c | 420 +++++++++++++++++++++++++++++++++++++++------
9 files changed, 606 insertions(+), 60 deletions(-)
From: Jan Kiszka <[email protected]>
One caller of __free_free already has to resolve and check the irq
descriptor. So this small cleanup consistently pushes irq_to_desc and
the NULL check to the call site to avoid redundant work.
Signed-off-by: Jan Kiszka <[email protected]>
---
kernel/irq/manage.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5f92acc..6341765 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -879,17 +879,14 @@ EXPORT_SYMBOL_GPL(setup_irq);
* Internal function to unregister an irqaction - used to free
* regular and special interrupts that are part of the architecture.
*/
-static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
+static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id)
{
- struct irq_desc *desc = irq_to_desc(irq);
struct irqaction *action, **action_ptr;
+ unsigned int irq = desc->irq_data.irq;
unsigned long flags;
WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
- if (!desc)
- return NULL;
-
raw_spin_lock_irqsave(&desc->lock, flags);
/*
@@ -977,7 +974,12 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
*/
void remove_irq(unsigned int irq, struct irqaction *act)
{
- __free_irq(irq, act->dev_id);
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ if (!desc)
+ return;
+
+ __free_irq(desc, act->dev_id);
}
EXPORT_SYMBOL_GPL(remove_irq);
@@ -1003,7 +1005,7 @@ void free_irq(unsigned int irq, void *dev_id)
return;
chip_bus_lock(desc);
- kfree(__free_irq(irq, dev_id));
+ kfree(__free_irq(desc, dev_id));
chip_bus_sync_unlock(desc);
}
EXPORT_SYMBOL(free_irq);
--
1.7.1
From: Jan Kiszka <[email protected]>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.
However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register an IRQ sharing
notifier and switch between line and device level disabling on demand.
This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.
Signed-off-by: Jan Kiszka <[email protected]>
---
Documentation/kvm/api.txt | 25 +++
arch/x86/kvm/x86.c | 1 +
include/linux/kvm.h | 6 +
include/linux/kvm_host.h | 4 +
virt/kvm/assigned-dev.c | 382 +++++++++++++++++++++++++++++++++++++++++----
5 files changed, 385 insertions(+), 33 deletions(-)
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..5e164db 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
/* Depends on KVM_CAP_IOMMU */
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
4.48 KVM_DEASSIGN_PCI_DEVICE
@@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
};
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Informs the kernel about the guest's view on the INTx mask. As long as the
+guest masks the legacy INTx, the kernel will refrain from unmasking it at
+hardware level and will not assert the guest's IRQ line. User space is still
+responsible for applying this state to the assigned device's real config space.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
5. The kvm_run structure
Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 410d2d1..c77c129 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1969,6 +1969,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+ case KVM_CAP_PCI_2_3:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..3cadb42 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
#define KVM_CAP_PPC_GET_PVINFO 57
#define KVM_CAP_PPC_IRQ_LEVEL 58
#define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_PCI_2_3 60
#ifdef KVM_CAP_IRQ_ROUTING
@@ -677,6 +678,9 @@ struct kvm_clock_data {
#define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2)
/* Available with KVM_CAP_PPC_GET_PVINFO */
#define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK _IOW(KVMIO, 0xa2, \
+ struct kvm_assigned_pci_dev)
/*
* ioctls for vcpu fds
@@ -742,6 +746,8 @@ struct kvm_clock_data {
#define KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs)
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
struct kvm_assigned_pci_dev {
__u32 assigned_dev_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index da0794f..6849568 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -18,6 +18,7 @@
#include <linux/msi.h>
#include <linux/slab.h>
#include <linux/rcupdate.h>
+#include <linux/notifier.h>
#include <asm/signal.h>
#include <linux/kvm.h>
@@ -485,6 +486,7 @@ struct kvm_assigned_dev_kernel {
unsigned int entries_nr;
int host_irq;
bool host_irq_disabled;
+ bool pci_2_3_masking;
struct msix_entry *host_msix_entries;
int guest_irq;
struct msix_entry *guest_msix_entries;
@@ -494,7 +496,9 @@ struct kvm_assigned_dev_kernel {
struct pci_dev *dev;
struct kvm *kvm;
spinlock_t intx_lock;
+ spinlock_t intx_mask_lock;
char irq_name[32];
+ struct notifier_block sh_notifier;
};
struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index c6114d3..633bab3 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,22 +55,124 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
return index;
}
-static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
+static bool
+pci_2_3_set_irq_mask(struct pci_dev *dev, bool mask, bool check_status)
+{
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd;
+ bool mask_updated = true;
+
+ /*
+ * We do a single dword read to retrieve both command and status.
+ * Document assumptions that make this possible.
+ */
+ BUILD_BUG_ON(PCI_COMMAND % 4);
+ BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
+
+ pci_block_user_cfg_access(dev);
+
+ /*
+ * Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs.
+ */
+ pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
+
+ if (check_status) {
+ bool irq_pending =
+ (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
+
+ /*
+ * Check interrupt status register to see whether our device
+ * triggered the interrupt (when masking) or the next IRQ is
+ * already pending (when unmasking).
+ */
+ if (mask != irq_pending) {
+ mask_updated = false;
+ goto done;
+ }
+ }
+
+ origcmd = cmd_status_dword;
+ newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
+ if (mask)
+ newcmd |= PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(dev, PCI_COMMAND, newcmd);
+
+done:
+ pci_unblock_user_cfg_access(dev);
+ return mask_updated;
+}
+
+static void pci_2_3_irq_mask(struct pci_dev *dev)
+{
+ pci_2_3_set_irq_mask(dev, true, false);
+}
+
+static bool pci_2_3_irq_check_and_mask(struct pci_dev *dev)
+{
+ return pci_2_3_set_irq_mask(dev, true, true);
+}
+
+static void pci_2_3_irq_unmask(struct pci_dev *dev)
+{
+ pci_2_3_set_irq_mask(dev, false, false);
+}
+
+static bool pci_2_3_irq_check_and_unmask(struct pci_dev *dev)
+{
+ return pci_2_3_set_irq_mask(dev, false, true);
+}
+
+static irqreturn_t kvm_assigned_dev_intr_intx(int irq, void *dev_id)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+ int ret;
+
+ spin_lock(&assigned_dev->intx_lock);
+ if (pci_2_3_irq_check_and_mask(assigned_dev->dev)) {
+ assigned_dev->host_irq_disabled = true;
+ ret = IRQ_WAKE_THREAD;
+ } else
+ ret =IRQ_NONE;
+ spin_unlock(&assigned_dev->intx_lock);
+
+ return ret;
+}
+
+static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id)
{
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
- spin_lock(&assigned_dev->intx_lock);
+ if (!assigned_dev->pci_2_3_masking) {
+ spin_lock_irq(&assigned_dev->intx_lock);
disable_irq_nosync(irq);
assigned_dev->host_irq_disabled = true;
- spin_unlock(&assigned_dev->intx_lock);
+ spin_unlock_irq(&assigned_dev->intx_lock);
}
+ spin_lock(&assigned_dev->intx_mask_lock);
+ if (!(assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX))
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 1);
+ spin_unlock(&assigned_dev->intx_mask_lock);
+
+ return IRQ_HANDLED;
+}
+
+#ifdef __KVM_HAVE_MSI
+static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+
kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
assigned_dev->guest_irq, 1);
return IRQ_HANDLED;
}
+#endif
#ifdef __KVM_HAVE_MSIX
static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
@@ -102,15 +204,114 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
- /* The guest irq may be shared so this ack may be
- * from another device.
- */
- spin_lock(&dev->intx_lock);
- if (dev->host_irq_disabled) {
- enable_irq(dev->host_irq);
- dev->host_irq_disabled = false;
+ if (likely(!(dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX)))
+ return;
+
+ spin_lock(&dev->intx_mask_lock);
+
+ if (!(dev->flags & KVM_DEV_ASSIGN_MASK_INTX)) {
+ bool reassert = false;
+
+ spin_lock_irq(&dev->intx_lock);
+ /*
+ * The guest IRQ may be shared so this ack can come from an
+ * IRQ for another guest device.
+ */
+ if (dev->host_irq_disabled) {
+ if (!dev->pci_2_3_masking)
+ enable_irq(dev->host_irq);
+ else if (!pci_2_3_irq_check_and_unmask(dev->dev))
+ reassert = true;
+ dev->host_irq_disabled = reassert;
+ }
+ spin_unlock_irq(&dev->intx_lock);
+
+ if (reassert)
+ kvm_set_irq(dev->kvm, dev->irq_source_id,
+ dev->guest_irq, 1);
}
- spin_unlock(&dev->intx_lock);
+
+ spin_unlock(&dev->intx_mask_lock);
+}
+
+static int notify_host_irq(struct notifier_block *nb, unsigned long event,
+ void *perror)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev =
+ container_of(nb, struct kvm_assigned_dev_kernel, sh_notifier);
+ int *perr = perror;
+
+ switch (event) {
+ case IRQN_SHARED:
+ /* will be reenabled by request_threaded_irq */
+ disable_irq(assigned_dev->host_irq);
+
+ /* synchronize with kvm_assigned_dev_ack_irq */
+ spin_lock_irq(&assigned_dev->intx_lock);
+ if (assigned_dev->host_irq_disabled) {
+ /* reduce disabling depth here as we switch the type */
+ enable_irq(assigned_dev->host_irq);
+ assigned_dev->host_irq_disabled = false;
+ }
+ spin_unlock_irq(&assigned_dev->intx_lock);
+
+ free_irq(assigned_dev->host_irq, assigned_dev);
+ /* fall through */
+ case IRQN_SETUP_USED:
+ assigned_dev->pci_2_3_masking = true;
+ *perr = request_threaded_irq(assigned_dev->host_irq,
+ kvm_assigned_dev_intr_intx,
+ kvm_assigned_dev_thread_intx,
+ IRQF_SHARED,
+ assigned_dev->irq_name,
+ assigned_dev);
+ if (*perr)
+ return NOTIFY_BAD;
+
+ spin_lock_irq(&assigned_dev->intx_lock);
+ pci_2_3_irq_unmask(assigned_dev->dev);
+ spin_unlock_irq(&assigned_dev->intx_lock);
+ break;
+ case IRQN_EXCLUSIVE:
+ /* will be reenabled by request_threaded_irq */
+ disable_irq(assigned_dev->host_irq);
+
+ /* synchronize with kvm_assigned_dev_ack_irq */
+ spin_lock_irq(&assigned_dev->intx_lock);
+ if (assigned_dev->host_irq_disabled) {
+ /* re-enable at PCI level as we switch the type */
+ pci_2_3_irq_unmask(assigned_dev->dev);
+ assigned_dev->host_irq_disabled = false;
+ }
+ spin_unlock_irq(&assigned_dev->intx_lock);
+
+ free_irq(assigned_dev->host_irq, assigned_dev);
+ /* fall through */
+ case IRQN_SETUP_UNUSED:
+ assigned_dev->pci_2_3_masking = false;
+ *perr = request_threaded_irq(assigned_dev->host_irq, NULL,
+ kvm_assigned_dev_thread_intx,
+ IRQF_ONESHOT,
+ assigned_dev->irq_name,
+ assigned_dev);
+ if (*perr)
+ return NOTIFY_BAD;
+ break;
+ case IRQN_SHUTDOWN:
+ if (assigned_dev->pci_2_3_masking) {
+ spin_lock_irq(&assigned_dev->intx_lock);
+ pci_2_3_irq_mask(assigned_dev->dev);
+ /* prevent re-enabling by kvm_assigned_dev_ack_irq */
+ assigned_dev->host_irq_disabled = false;
+ spin_unlock_irq(&assigned_dev->intx_lock);
+ synchronize_irq(assigned_dev->host_irq);
+ } else
+ disable_irq(assigned_dev->host_irq);
+
+ free_irq(assigned_dev->host_irq, assigned_dev);
+ break;
+ }
+ return NOTIFY_OK;
}
static void deassign_guest_irq(struct kvm *kvm,
@@ -132,6 +333,8 @@ static void deassign_guest_irq(struct kvm *kvm,
static void deassign_host_irq(struct kvm *kvm,
struct kvm_assigned_dev_kernel *assigned_dev)
{
+ int unused;
+
/*
* We disable irq here to prevent further events.
*
@@ -155,15 +358,16 @@ static void deassign_host_irq(struct kvm *kvm,
kfree(assigned_dev->host_msix_entries);
kfree(assigned_dev->guest_msix_entries);
pci_disable_msix(assigned_dev->dev);
- } else {
- /* Deal with MSI and INTx */
- disable_irq(assigned_dev->host_irq);
-
+ } else if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) {
free_irq(assigned_dev->host_irq, assigned_dev);
-
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
- pci_disable_msi(assigned_dev->dev);
- }
+ pci_disable_msi(assigned_dev->dev);
+ } else if (assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3)
+ unregister_irq_sharing_notifier(assigned_dev->host_irq,
+ &assigned_dev->sh_notifier);
+ else
+ /* We have no notifier registered, so fire manually. */
+ notify_host_irq(&assigned_dev->sh_notifier, IRQN_SHUTDOWN,
+ &unused);
assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
}
@@ -231,15 +435,28 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
static int assigned_device_enable_host_intx(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
{
+ int unused;
+
dev->host_irq = dev->dev->irq;
- /* Even though this is PCI, we don't want to use shared
- * interrupts. Sharing host devices with guest-assigned devices
- * on the same interrupt line is not a happy situation: there
- * are going to be long delays in accepting, acking, etc.
+
+ /*
+ * We can only share the IRQ line with other host devices if we are
+ * able to disable the IRQ source at device-level - independently of
+ * the guest driver. Otherwise host devices may suffer from unbounded
+ * IRQ latencies when the guest keeps the line asserted.
+ * If PCI 2.3 support is available, we can instal a sharing notifier
+ * and apply the required disabling pattern on demand.
*/
- if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- IRQF_ONESHOT, dev->irq_name, dev))
- return -EIO;
+ if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+ dev->sh_notifier.notifier_call = notify_host_irq;
+ if (register_irq_sharing_notifier(dev->host_irq,
+ &dev->sh_notifier))
+ return -EIO;
+ } else
+ /* Fire the notify manually for exclusive use. */
+ if (notify_host_irq(&dev->sh_notifier, IRQN_SETUP_UNUSED,
+ &unused) != NOTIFY_OK)
+ return -EIO;
return 0;
}
@@ -256,8 +473,9 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
}
dev->host_irq = dev->dev->irq;
- if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- 0, dev->irq_name, dev)) {
+ if (request_threaded_irq(dev->host_irq, NULL,
+ kvm_assigned_dev_thread_msi, 0,
+ dev->irq_name, dev)) {
pci_disable_msi(dev->dev);
return -EIO;
}
@@ -296,7 +514,6 @@ err:
pci_disable_msix(dev->dev);
return r;
}
-
#endif
static int assigned_device_enable_guest_intx(struct kvm *kvm,
@@ -315,7 +532,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -327,7 +543,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -361,6 +576,7 @@ static int assign_host_irq(struct kvm *kvm,
default:
r = -EINVAL;
}
+ dev->host_irq_disabled = false;
if (!r)
dev->irq_requested_type |= host_irq_type;
@@ -461,6 +677,7 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
{
int r = -ENODEV;
struct kvm_assigned_dev_kernel *match;
+ unsigned long irq_type;
mutex_lock(&kvm->lock);
@@ -469,12 +686,51 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
if (!match)
goto out;
- r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
+ irq_type = assigned_irq->flags & (KVM_DEV_IRQ_HOST_MASK |
+ KVM_DEV_IRQ_GUEST_MASK);
+ r = kvm_deassign_irq(kvm, match, irq_type);
out:
mutex_unlock(&kvm->lock);
return r;
}
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.
+ */
+static bool pci_2_3_supported(struct pci_dev *pdev)
+{
+ u16 orig, new;
+
+ pci_block_user_cfg_access(pdev);
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+
+ pci_unblock_user_cfg_access(pdev);
+
+ /*
+ * There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it.
+ */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ return false;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support disabling "
+ "interrupts, IRQ sharing impossible.\n");
+ return false;
+ }
+ return true;
+}
+
static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
struct kvm_assigned_pci_dev *assigned_dev)
{
@@ -523,6 +779,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
pci_reset_function(dev);
pci_save_state(dev);
+ if (!pci_2_3_supported(dev))
+ assigned_dev->flags &= ~KVM_DEV_ASSIGN_PCI_2_3;
+
match->assigned_dev_id = assigned_dev->assigned_dev_id;
match->host_segnr = assigned_dev->segnr;
match->host_busnr = assigned_dev->busnr;
@@ -530,6 +789,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
match->flags = assigned_dev->flags;
match->dev = dev;
spin_lock_init(&match->intx_lock);
+ spin_lock_init(&match->intx_mask_lock);
match->irq_source_id = -1;
match->kvm = kvm;
match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
@@ -676,6 +936,53 @@ msix_entry_out:
}
#endif
+static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
+ struct kvm_assigned_pci_dev *assigned_dev)
+{
+ int r = 0;
+ struct kvm_assigned_dev_kernel *match;
+
+ mutex_lock(&kvm->lock);
+
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_dev->assigned_dev_id);
+ if (!match) {
+ r = -ENODEV;
+ goto out;
+ }
+
+ spin_lock(&match->intx_mask_lock);
+
+ match->flags &= ~KVM_DEV_ASSIGN_MASK_INTX;
+ match->flags |= assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX;
+
+ if (assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX) {
+ kvm_set_irq(match->kvm, match->irq_source_id,
+ match->guest_irq, 0);
+ /*
+ * Masking at hardware-level is performed on demand, i.e. when
+ * an IRQ actually arrives at the host.
+ */
+ } else {
+ /*
+ * Unmask the IRQ line. It may have been masked meanwhile if
+ * we aren't using PCI 2.3 INTx masking on the host side.
+ */
+ spin_lock_irq(&match->intx_lock);
+ if (match->host_irq_disabled) {
+ enable_irq(match->host_irq);
+ match->host_irq_disabled = false;
+ }
+ spin_unlock_irq(&match->intx_lock);
+ }
+
+ spin_unlock(&match->intx_mask_lock);
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
unsigned long arg)
{
@@ -783,6 +1090,15 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
break;
}
#endif
+ case KVM_ASSIGN_SET_INTX_MASK: {
+ struct kvm_assigned_pci_dev assigned_dev;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
+ goto out;
+ r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
+ break;
+ }
default:
r = -ENOTTY;
break;
--
1.7.1
From: Jan Kiszka <[email protected]>
Signed-off-by: Jan Kiszka <[email protected]>
---
virt/kvm/assigned-dev.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3f8a745..c6114d3 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -149,7 +149,7 @@ static void deassign_host_irq(struct kvm *kvm,
for (i = 0; i < assigned_dev->entries_nr; i++)
free_irq(assigned_dev->host_msix_entries[i].vector,
- (void *)assigned_dev);
+ assigned_dev);
assigned_dev->entries_nr = 0;
kfree(assigned_dev->host_msix_entries);
@@ -159,7 +159,7 @@ static void deassign_host_irq(struct kvm *kvm,
/* Deal with MSI and INTx */
disable_irq(assigned_dev->host_irq);
- free_irq(assigned_dev->host_irq, (void *)assigned_dev);
+ free_irq(assigned_dev->host_irq, assigned_dev);
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
pci_disable_msi(assigned_dev->dev);
@@ -238,7 +238,7 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
* are going to be long delays in accepting, acking, etc.
*/
if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- IRQF_ONESHOT, dev->irq_name, (void *)dev))
+ IRQF_ONESHOT, dev->irq_name, dev))
return -EIO;
return 0;
}
@@ -257,7 +257,7 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
dev->host_irq = dev->dev->irq;
if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- 0, dev->irq_name, (void *)dev)) {
+ 0, dev->irq_name, dev)) {
pci_disable_msi(dev->dev);
return -EIO;
}
@@ -284,7 +284,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
for (i = 0; i < dev->entries_nr; i++) {
r = request_threaded_irq(dev->host_msix_entries[i].vector,
NULL, kvm_assigned_dev_thread_msix,
- 0, dev->irq_name, (void *)dev);
+ 0, dev->irq_name, dev);
if (r)
goto err;
}
@@ -292,7 +292,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
return 0;
err:
for (i -= 1; i >= 0; i--)
- free_irq(dev->host_msix_entries[i].vector, (void *)dev);
+ free_irq(dev->host_msix_entries[i].vector, dev);
pci_disable_msix(dev->dev);
return r;
}
--
1.7.1
From: Jan Kiszka <[email protected]>
The threaded IRQ handler for MSI-X has almost nothing in common with the
INTx/MSI handler. Move its code into a dedicated handler.
Signed-off-by: Jan Kiszka <[email protected]>
---
virt/kvm/assigned-dev.c | 32 +++++++++++++++++++-------------
1 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ae72ae6..3f8a745 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -58,8 +58,6 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
{
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
- u32 vector;
- int index;
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
spin_lock(&assigned_dev->intx_lock);
@@ -68,20 +66,28 @@ static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
spin_unlock(&assigned_dev->intx_lock);
}
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
- index = find_index_from_host_irq(assigned_dev, irq);
- if (index >= 0) {
- vector = assigned_dev->
- guest_msix_entries[index].vector;
- kvm_set_irq(assigned_dev->kvm,
- assigned_dev->irq_source_id, vector, 1);
- }
- } else
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 1);
+
+ return IRQ_HANDLED;
+}
+
+#ifdef __KVM_HAVE_MSIX
+static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+ int index = find_index_from_host_irq(assigned_dev, irq);
+ u32 vector;
+
+ if (index >= 0) {
+ vector = assigned_dev->guest_msix_entries[index].vector;
kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
- assigned_dev->guest_irq, 1);
+ vector, 1);
+ }
return IRQ_HANDLED;
}
+#endif
/* Ack the irq line for an assigned device */
static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
@@ -277,7 +283,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
for (i = 0; i < dev->entries_nr; i++) {
r = request_threaded_irq(dev->host_msix_entries[i].vector,
- NULL, kvm_assigned_dev_thread,
+ NULL, kvm_assigned_dev_thread_msix,
0, dev->irq_name, (void *)dev);
if (r)
goto err;
--
1.7.1
From: Jan Kiszka <[email protected]>
This enables drivers to switch their interrupt handling between
exclusive and shared mode.
While there is generally no advantage in exclusive interrupt handling
patterns, at least one use case can benefit from adaptive handling:
Generic stub drivers that hand out PCI devices for unprivileged use in
virtualization or user space device driver scenarios. They need to mask
the interrupt until the unprivileged driver had a chance to process the
event. As generic IRQ masking at device level (via PCI config space) is
at least 10 times slower than disabling at interrupt controller level on
x86, we only want to apply the costly method when there is a real need.
The sharing notifier provides both the required information about the
number of interrupt handlers as well as a properly synchronized
execution context to run interrupt de-/registration and mode switch
procedures. The notifier is called with IRQN_SETUP_USED or
IRQN_SETUP_USED, respectively, during registration to allow setup
according to the current interrupt use. When the number of users is one
on entry of request_threaded_irq or on exit of free_irq, IRQN_SHARED or
IRQN_EXCLUSIVE, respectively, are signaled. Deregistration is signaled
as IRQN_SHUTDOWN to the removed notifier.
Signed-off-by: Jan Kiszka <[email protected]>
---
include/linux/interrupt.h | 21 ++++++
include/linux/irqdesc.h | 9 +++
kernel/irq/irqdesc.c | 6 ++
kernel/irq/manage.c | 158 +++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 190 insertions(+), 4 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..9ebb98f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -96,6 +96,23 @@ enum {
IRQC_IS_NESTED,
};
+/*
+ * These values are passed to the sharing notifier as second argument.
+ *
+ * IRQN_SHARED - interrupt line is shared by two or more devices
+ * IRQN_EXCLUSIVE - interrupt line is used by one device only
+ * IRQN_SETUP_USED - initial notification for interrupt line already in use
+ * IRQN_SETUP_UNUSED - initial notification for yet unused interrupt line
+ * IRQN_SHUTDOWN - shutdown notification on notifier unregistration
+ */
+enum {
+ IRQN_SHARED = 0,
+ IRQN_EXCLUSIVE,
+ IRQN_SETUP_USED,
+ IRQN_SETUP_UNUSED,
+ IRQN_SHUTDOWN,
+};
+
typedef irqreturn_t (*irq_handler_t)(int, void *);
/**
@@ -176,6 +193,10 @@ static inline void exit_irq_thread(void) { }
extern void free_irq(unsigned int, void *);
+int register_irq_sharing_notifier(unsigned int irq, struct notifier_block *nb);
+int unregister_irq_sharing_notifier(unsigned int irq,
+ struct notifier_block *nb);
+
struct device;
extern int __must_check
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 979c68c..eefffef 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -29,6 +29,10 @@ struct timer_rand_state;
* @wait_for_threads: wait queue for sync_irq to wait for threaded handlers
* @dir: /proc/irq/ procfs entry
* @name: flow handler name for /proc/interrupts output
+ * @users: number of registered users (>1: irq is shared)
+ * @sh_lock: mutex to synchronize sharing notifications
+ * @sh_lock_holder: holder of sh_lock, allows recursion
+ * @sh_notifier: fired when the sharing state changes
*/
struct irq_desc {
@@ -80,6 +84,11 @@ struct irq_desc {
struct proc_dir_entry *dir;
#endif
const char *name;
+
+ unsigned int users;
+ struct mutex sh_lock;
+ struct task_struct *sh_lock_holder;
+ struct raw_notifier_head sh_notifier;
} ____cacheline_internodealigned_in_smp;
#ifndef CONFIG_SPARSE_IRQ
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 9988d03..8c5399d 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -83,6 +83,9 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node)
desc->irq_count = 0;
desc->irqs_unhandled = 0;
desc->name = NULL;
+ desc->users = 0;
+ desc->sh_lock_holder = NULL;
+ RAW_INIT_NOTIFIER_HEAD(&desc->sh_notifier);
memset(desc->kstat_irqs, 0, nr_cpu_ids * sizeof(*(desc->kstat_irqs)));
desc_smp_init(desc, node);
}
@@ -144,6 +147,8 @@ static struct irq_desc *alloc_desc(int irq, int node)
raw_spin_lock_init(&desc->lock);
lockdep_set_class(&desc->lock, &irq_desc_lock_class);
+ mutex_init(&desc->sh_lock);
+
desc_set_defaults(irq, desc, node);
return desc;
@@ -254,6 +259,7 @@ int __init early_irq_init(void)
alloc_masks(desc + i, GFP_KERNEL, node);
desc_smp_init(desc + i, node);
lockdep_set_class(&desc[i].lock, &irq_desc_lock_class);
+ mutex_init(&desc[i].sh_lock);
}
return arch_early_irq_init();
}
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6341765..a19d621 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -414,7 +414,12 @@ int can_request_irq(unsigned int irq, unsigned long irqflags)
raw_spin_lock_irqsave(&desc->lock, flags);
action = desc->action;
if (action)
- if (irqflags & action->flags & IRQF_SHARED)
+ /*
+ * A registered sharing notifier indicates the ability to turn
+ * the irq into a sharable one on demand.
+ */
+ if (irqflags & action->flags & IRQF_SHARED ||
+ desc->sh_notifier.head)
action = NULL;
raw_spin_unlock_irqrestore(&desc->lock, flags);
@@ -870,8 +875,12 @@ out_thread:
int setup_irq(unsigned int irq, struct irqaction *act)
{
struct irq_desc *desc = irq_to_desc(irq);
+ int ret;
- return __setup_irq(irq, desc, act);
+ ret = __setup_irq(irq, desc, act);
+ if (!ret)
+ desc->users++;
+ return ret;
}
EXPORT_SYMBOL_GPL(setup_irq);
@@ -980,9 +989,37 @@ void remove_irq(unsigned int irq, struct irqaction *act)
return;
__free_irq(desc, act->dev_id);
+ desc->users--;
}
EXPORT_SYMBOL_GPL(remove_irq);
+/*
+ * Internal helper that informs listeners about potential sharing state
+ * changes after free_irq() or failed request_irq.
+ */
+static void notify_free_irq(struct irq_desc *desc)
+{
+ int ret = 0;
+
+ if (desc->sh_lock_holder != current) {
+ mutex_lock(&desc->sh_lock);
+ if (--desc->users == 1) {
+ desc->sh_lock_holder = current;
+ raw_notifier_call_chain(&desc->sh_notifier,
+ IRQN_EXCLUSIVE, &ret);
+ desc->sh_lock_holder = NULL;
+ }
+ mutex_unlock(&desc->sh_lock);
+ } else if (--desc->users == 1)
+ raw_notifier_call_chain(&desc->sh_notifier, IRQN_EXCLUSIVE,
+ &ret);
+
+ if (ret)
+ printk(KERN_ERR
+ "IRQ sharing callback returned %d on IRQN_EXCLUSIVE\n",
+ ret);
+}
+
/**
* free_irq - free an interrupt allocated with request_irq
* @irq: Interrupt line to free
@@ -1007,6 +1044,8 @@ void free_irq(unsigned int irq, void *dev_id)
chip_bus_lock(desc);
kfree(__free_irq(desc, dev_id));
chip_bus_sync_unlock(desc);
+
+ notify_free_irq(desc);
}
EXPORT_SYMBOL(free_irq);
@@ -1059,7 +1098,7 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
{
struct irqaction *action;
struct irq_desc *desc;
- int retval;
+ int retval = 0;
/*
* Sanity-check: shared interrupts must pass in a real dev-ID,
@@ -1093,12 +1132,37 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
action->name = devname;
action->dev_id = dev_id;
+ if (desc->sh_lock_holder != current) {
+ mutex_lock(&desc->sh_lock);
+ if (desc->users == 1) {
+ desc->sh_lock_holder = current;
+ raw_notifier_call_chain(&desc->sh_notifier,
+ IRQN_SHARED, &retval);
+ desc->sh_lock_holder = NULL;
+ }
+ desc->users++;
+ mutex_unlock(&desc->sh_lock);
+ } else {
+ if (desc->users == 1)
+ raw_notifier_call_chain(&desc->sh_notifier,
+ IRQN_SHARED, &retval);
+ desc->users++;
+ }
+ if (retval) {
+ printk(KERN_ERR
+ "IRQ sharing callback returned %d on IRQN_SHARED\n",
+ retval);
+ return retval;
+ }
+
chip_bus_lock(desc);
retval = __setup_irq(irq, desc, action);
chip_bus_sync_unlock(desc);
- if (retval)
+ if (retval) {
+ notify_free_irq(desc);
kfree(action);
+ }
#ifdef CONFIG_DEBUG_SHIRQ
if (!retval && (irqflags & IRQF_SHARED)) {
@@ -1159,3 +1223,89 @@ int request_any_context_irq(unsigned int irq, irq_handler_t handler,
return !ret ? IRQC_IS_HARDIRQ : ret;
}
EXPORT_SYMBOL_GPL(request_any_context_irq);
+
+/**
+ * register_irq_sharing_notifier - register notifier for sharing state
+ * changes
+ * @irq: Interrupt line to register on
+ * @nb: Notifier block
+ *
+ * This registers a notifier to be called whenever the specified interrupt
+ * line is about to be shared or unshared. The callback is invoked with
+ * IRQN_SHARED when the second user of the line calls a request_irq
+ * service. IRQN_EXCLUSIVE is signaled when only one user remains for the
+ * interrupt line after a free_irq invocation.
+ *
+ * The callback is supposed to re-register the interrupt handler it
+ * manages to allow sharing or to optimize for non-shared operation. It
+ * can safely call register_[threaded_]irq and free_irq without triggering
+ * a recursive notification for itself. It has to report errors by
+ * returning NOTIFY_BAD and writing the error code to the int which the
+ * third argument of the callback points to.
+ *
+ * Before registration, the notifier is fired once for the passed callback
+ * to report the current usage of the interrupt, i.e. IRQN_SETUP_USED or
+ * IRQN_SETUP_UNUSED. If this initial call fails, no registration is
+ * performed.
+ *
+ * On failure, it returns negative value. On success, it returns 0.
+ */
+int register_irq_sharing_notifier(unsigned int irq, struct notifier_block *nb)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long event;
+ int ret = 0;
+
+ if (!desc)
+ return -EINVAL;
+
+ mutex_lock(&desc->sh_lock);
+ desc->sh_lock_holder = current;
+
+ event = desc->users ? IRQN_SETUP_USED : IRQN_SETUP_UNUSED;
+ nb->notifier_call(nb, event, &ret);
+ if (!ret)
+ raw_notifier_chain_register(&desc->sh_notifier, nb);
+
+ desc->sh_lock_holder = NULL;
+ mutex_unlock(&desc->sh_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(register_irq_sharing_notifier);
+
+/**
+ * unregister_irq_sharing_notifier - unregister notifier for sharing state
+ * changes
+ * @irq: Interrupt line to unregister from
+ * @nb: Notifier block
+ *
+ * Unregisters the notifier previously registered with the specified
+ * interrupt line via register_irq_sharing_notifier. A final IRQN_SHUTDOWN
+ * is sent to the callback of the unregistered notifier to allow resource
+ * cleanup in a synchronized context.
+ *
+ * On failure, it returns a negative value. On success, it returns 0.
+ */
+int unregister_irq_sharing_notifier(unsigned int irq,
+ struct notifier_block *nb)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ int ret;
+
+ if (!desc)
+ return -EINVAL;
+
+ mutex_lock(&desc->sh_lock);
+ desc->sh_lock_holder = current;
+
+ ret = raw_notifier_chain_unregister(&desc->sh_notifier, nb);
+ if (!ret)
+ nb->notifier_call(nb, IRQN_SHUTDOWN, &ret);
+
+ desc->sh_lock_holder = NULL;
+ mutex_unlock(&desc->sh_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(unregister_irq_sharing_notifier);
--
1.7.1
On Sat, 4 Dec 2010, Jan Kiszka wrote:
> Besides 3 cleanup patches, this series consists of two major changes.
> The first introduces an interrupt sharing notifier to the genirq
> subsystem. It fires when an interrupt line is about to be use by more
> than one driver or the last but one user called free_irq.
>
> The second major change makes use of this interface in KVM's PCI pass-
> through subsystem. KVM has to keep the interrupt source disabled while
> calling into the guest to handle the event. This can be done at device
> or line level. The former is required to share the interrupt line, the
> latter is an order of magnitude faster (see patch 3 for details).
>
> Beside pass-through support of KVM, further users of the IRQ notifier
> could become VFIO (not yet mainline) and uio_pci_generic which have to
> resolve the same conflict.
Hmm. You basically want to have the following functionality:
If interrupt is shared, then you want to keep the current behaviour:
disable at line level (IRQF_ONESHOT)
run handler thread (PCI level masking)
reenable at line level in irq_finalize_oneshot()
reenable at PCI level when guest is done
If interrupt is not shared:
disable at line level (IRQF_ONESHOT)
run handler thread (no PCI level masking)
no reenable at line level
reenable at line level when guest is done
I think the whole notifier approach including the extra irq handlers
plus the requirement to call disable_irq_nosync() from the non shared
handler is overkill. We can be more clever.
The genirq code knows whether you have one or more handler
registered. So we can add IRQF_ONESHOT_UNMASK_SHARED and add a status
field to irq_data (which I was going to do anyway for other
reasons). In that status field you get a bit which says IRQ_MASK_DEVICE.
So with IRQF_ONESHOT_UNMASK_SHARED == 0 we keep the current behaviour.
If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 1 we keep the
current behaviour.
If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 0 then then
irq_finalize_oneshot() simply marks the interrupt disabled (equivalent
to disable_irq_nosync()) and returns.
Now in your primary irq handler you simply check the IRQ_MASK_DEVICE
status flag and decide whether you need to mask at PCI level or not.
Your threaded handler gets the same information via IRQ_MASK_DEVICE so
it can issue the appropriate user space notification depending on that
flag.
This works fully transparent across adding and removing handlers. On
request_irq/free_irq we update the IRQ_MASK_DEVICE flag with the
following logic:
nr_actions IRQF_ONESHOT_UNMASK_SHARED IRQ_MASK_DEVICE
1 0 1
1 1 0
>1 don't care 1
If interrupts are in flight accross request/free then this change
takes effect when the next interrupt comes in.
No notifiers, no disable_irq_nosync() magic, less and simpler code.
Thoughts ?
tglx
Am 04.12.2010 11:37, Thomas Gleixner wrote:
> On Sat, 4 Dec 2010, Jan Kiszka wrote:
>
>> Besides 3 cleanup patches, this series consists of two major changes.
>> The first introduces an interrupt sharing notifier to the genirq
>> subsystem. It fires when an interrupt line is about to be use by more
>> than one driver or the last but one user called free_irq.
>>
>> The second major change makes use of this interface in KVM's PCI pass-
>> through subsystem. KVM has to keep the interrupt source disabled while
>> calling into the guest to handle the event. This can be done at device
>> or line level. The former is required to share the interrupt line, the
>> latter is an order of magnitude faster (see patch 3 for details).
>>
>> Beside pass-through support of KVM, further users of the IRQ notifier
>> could become VFIO (not yet mainline) and uio_pci_generic which have to
>> resolve the same conflict.
>
> Hmm. You basically want to have the following functionality:
>
> If interrupt is shared, then you want to keep the current behaviour:
>
> disable at line level (IRQF_ONESHOT)
> run handler thread (PCI level masking)
> reenable at line level in irq_finalize_oneshot()
> reenable at PCI level when guest is done
If the interrupt is shared, we must mask at PCI level inside the primary
handler as we also have to support non-threaded users of the same line.
So we always have a transition line-level -> device-level
masking in a primary handler.
>
> If interrupt is not shared:
>
> disable at line level (IRQF_ONESHOT)
> run handler thread (no PCI level masking)
> no reenable at line level
> reenable at line level when guest is done
We only use threaded handlers so far for hairy lock-dependency reasons
and as certain injection complexity is too much guest-controllable. We
hope to move most injection cases (ie. any IRQ directed to a single VCPU
/ user space task) directly into a primary handler in the future to
reduce the latency. So both threaded and non-threaded cases should be
addressable by any approach.
>
> I think the whole notifier approach including the extra irq handlers
> plus the requirement to call disable_irq_nosync() from the non shared
> handler is overkill. We can be more clever.
Always welcome.
>
> The genirq code knows whether you have one or more handler
> registered. So we can add IRQF_ONESHOT_UNMASK_SHARED and add a status
> field to irq_data (which I was going to do anyway for other
> reasons). In that status field you get a bit which says IRQ_MASK_DEVICE.
>
> So with IRQF_ONESHOT_UNMASK_SHARED == 0 we keep the current behaviour.
>
> If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 1 we keep the
> current behaviour.
>
> If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 0 then then
> irq_finalize_oneshot() simply marks the interrupt disabled (equivalent
> to disable_irq_nosync()) and returns.
That sounds good.
>
> Now in your primary irq handler you simply check the IRQ_MASK_DEVICE
> status flag and decide whether you need to mask at PCI level or not.
>
> Your threaded handler gets the same information via IRQ_MASK_DEVICE so
> it can issue the appropriate user space notification depending on that
> flag.
>
> This works fully transparent across adding and removing handlers. On
> request_irq/free_irq we update the IRQ_MASK_DEVICE flag with the
> following logic:
>
> nr_actions IRQF_ONESHOT_UNMASK_SHARED IRQ_MASK_DEVICE
> 1 0 1
> 1 1 0
> >1 don't care 1
>
> If interrupts are in flight accross request/free then this change
> takes effect when the next interrupt comes in.
For registration, that might be too late. We need to synchronize on
in-flight interrupts for that line and then ensure that it gets enabled
independently of the registered user. That user may have applied
outdated information, thus would block the line for too long if user
space decides to do something else.
I think this will be the trickier part of the otherwise nice & simple
concept.
>
> No notifiers, no disable_irq_nosync() magic, less and simpler code.
>
> Thoughts ?
Pulling the effect of disable_irq_nosync into genirq also for threaded
handling would remove the need for re-registering handlers. That's good.
But we need to think about the hand-over scenarios again. Maybe
solvable, but non-trivial.
Jan
Jan,
On Sat, 4 Dec 2010, Jan Kiszka wrote:
> Am 04.12.2010 11:37, Thomas Gleixner wrote:
> > On Sat, 4 Dec 2010, Jan Kiszka wrote:
> > If interrupt is shared, then you want to keep the current behaviour:
> >
> > disable at line level (IRQF_ONESHOT)
> > run handler thread (PCI level masking)
> > reenable at line level in irq_finalize_oneshot()
> > reenable at PCI level when guest is done
>
> If the interrupt is shared, we must mask at PCI level inside the primary
> handler as we also have to support non-threaded users of the same line.
> So we always have a transition line-level -> device-level
> masking in a primary handler.
Sorry that left out the hard irq part. Of course it needs to do the
PCI level masking in the primary one.
> reduce the latency. So both threaded and non-threaded cases should be
> addressable by any approach.
The oneshot magic should work on non threaded cases as well. Needs
some modifications, but nothing complex.
> > If interrupts are in flight accross request/free then this change
> > takes effect when the next interrupt comes in.
>
> For registration, that might be too late. We need to synchronize on
> in-flight interrupts for that line and then ensure that it gets enabled
> independently of the registered user. That user may have applied
> outdated information, thus would block the line for too long if user
> space decides to do something else.
No, that's really just a corner case when going from one to two
handlers and I don't think it matters much. If you setup a new driver
it's not really important whether that first thing comes in a few ms
later.
Also there is a pretty simple solution for this: The core code knows,
that there is an ONESHOT interrupt in flight, so it simply can call
the primary handler of that device with the appropriate flag set
(maybe an additional one to indicate the transition) and let that deal
with it. Needs some thought vs. locking and races, but that shouldn't
be hard.
> Pulling the effect of disable_irq_nosync into genirq also for threaded
> handling would remove the need for re-registering handlers. That's good.
> But we need to think about the hand-over scenarios again. Maybe
> solvable, but non-trivial.
See above.
Thanks,
tglx
Am 04.12.2010 15:41, Thomas Gleixner wrote:
> Jan,
>
> On Sat, 4 Dec 2010, Jan Kiszka wrote:
>
>> Am 04.12.2010 11:37, Thomas Gleixner wrote:
>>> On Sat, 4 Dec 2010, Jan Kiszka wrote:
>>> If interrupt is shared, then you want to keep the current behaviour:
>>>
>>> disable at line level (IRQF_ONESHOT)
>>> run handler thread (PCI level masking)
>>> reenable at line level in irq_finalize_oneshot()
>>> reenable at PCI level when guest is done
>>
>> If the interrupt is shared, we must mask at PCI level inside the primary
>> handler as we also have to support non-threaded users of the same line.
>> So we always have a transition line-level -> device-level
>> masking in a primary handler.
>
> Sorry that left out the hard irq part. Of course it needs to do the
> PCI level masking in the primary one.
>
>> reduce the latency. So both threaded and non-threaded cases should be
>> addressable by any approach.
>
> The oneshot magic should work on non threaded cases as well. Needs
> some modifications, but nothing complex.
>
>>> If interrupts are in flight accross request/free then this change
>>> takes effect when the next interrupt comes in.
>>
>> For registration, that might be too late. We need to synchronize on
>> in-flight interrupts for that line and then ensure that it gets enabled
>> independently of the registered user. That user may have applied
>> outdated information, thus would block the line for too long if user
>> space decides to do something else.
>
> No, that's really just a corner case when going from one to two
> handlers and I don't think it matters much. If you setup a new driver
> it's not really important whether that first thing comes in a few ms
> later.
The worst case remains infinite (user space never signals end of interrupt).
>
> Also there is a pretty simple solution for this: The core code knows,
> that there is an ONESHOT interrupt in flight, so it simply can call
It doesn't synchronize the tail part against the masking in the
handler(s), that's driver business.
> the primary handler of that device with the appropriate flag set
> (maybe an additional one to indicate the transition) and let that deal
> with it. Needs some thought vs. locking and races, but that shouldn't
> be hard.
Yes, I thought about this kind of transition (re-invoking the existing
handler) already. We do need notification of the switch (at least for
exclusive->shared) as only the driver can migrate the masking for
in-flight interrupts.
Jan
On Sat, 4 Dec 2010, Jan Kiszka wrote:
> Am 04.12.2010 15:41, Thomas Gleixner wrote:
> > Also there is a pretty simple solution for this: The core code knows,
> > that there is an ONESHOT interrupt in flight, so it simply can call
>
> It doesn't synchronize the tail part against the masking in the
> handler(s), that's driver business.
Right, but the core knows from the irq state, that the line is masked
and due to the ONESHOT or whatever feature it knows that it needs to
call the handler.
The other way round shared -> exclusive does not matter at all.
> > the primary handler of that device with the appropriate flag set
> > (maybe an additional one to indicate the transition) and let that deal
> > with it. Needs some thought vs. locking and races, but that shouldn't
> > be hard.
>
> Yes, I thought about this kind of transition (re-invoking the existing
> handler) already. We do need notification of the switch (at least for
> exclusive->shared) as only the driver can migrate the masking for
> in-flight interrupts.
Right. It needs to set the device level mask in that case. As the
interrupt handler already has the code to do that it's the most
obvious function to call.
Thanks,
tglx
On 12/04/2010 01:37 AM, Jan Kiszka wrote:
> From: Jan Kiszka<[email protected]>
>
> PCI 2.3 allows to generically disable IRQ sources at device level. This
> enables us to share IRQs of such devices on the host side when passing
> them to a guest.
>
> However, IRQ disabling via the PCI config space is more costly than
> masking the line via disable_irq. Therefore we register an IRQ sharing
> notifier and switch between line and device level disabling on demand.
>
> This feature is optional, user space has to request it explicitly as it
> also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
> way, we can avoid unmasking the interrupt and signaling it if the guest
> masked it via the PCI config space.
>
>
> diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> index e1a9297..5e164db 100644
> --- a/Documentation/kvm/api.txt
> +++ b/Documentation/kvm/api.txt
> @@ -1112,6 +1112,14 @@ following flags are specified:
>
> /* Depends on KVM_CAP_IOMMU */
> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1<< 0)
> +/* The following two depend on KVM_CAP_PCI_2_3 */
> +#define KVM_DEV_ASSIGN_PCI_2_3 (1<< 1)
> +#define KVM_DEV_ASSIGN_MASK_INTX (1<< 2)
> +
> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
> +via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
> +assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>
> 4.48 KVM_DEASSIGN_PCI_DEVICE
>
> @@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
> __u16 padding[3];
> };
>
> +4.54 KVM_ASSIGN_SET_INTX_MASK
> +
> +Capability: KVM_CAP_PCI_2_3
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_pci_dev (in)
> +Returns: 0 on success, -1 on error
> +
> +Informs the kernel about the guest's view on the INTx mask. As long as the
> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
> +hardware level and will not assert the guest's IRQ line. User space is still
> +responsible for applying this state to the assigned device's real config space.
What's the protocol for doing this? I suppose userspace has to disable
interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
unmasked), enable interrupts?
Isn't there a race window between the two operations?
Maybe we should give the kernel full ownership of that bit.
> +
> +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
> +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
> +evaluated.
> +
--
error compiling committee.c: too many arguments to function
Am 06.12.2010 17:21, Avi Kivity wrote:
> On 12/04/2010 01:37 AM, Jan Kiszka wrote:
>> From: Jan Kiszka<[email protected]>
>>
>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>> enables us to share IRQs of such devices on the host side when passing
>> them to a guest.
>>
>> However, IRQ disabling via the PCI config space is more costly than
>> masking the line via disable_irq. Therefore we register an IRQ sharing
>> notifier and switch between line and device level disabling on demand.
>>
>> This feature is optional, user space has to request it explicitly as it
>> also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
>> way, we can avoid unmasking the interrupt and signaling it if the guest
>> masked it via the PCI config space.
>>
>>
>> diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
>> index e1a9297..5e164db 100644
>> --- a/Documentation/kvm/api.txt
>> +++ b/Documentation/kvm/api.txt
>> @@ -1112,6 +1112,14 @@ following flags are specified:
>>
>> /* Depends on KVM_CAP_IOMMU */
>> #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1<< 0)
>> +/* The following two depend on KVM_CAP_PCI_2_3 */
>> +#define KVM_DEV_ASSIGN_PCI_2_3 (1<< 1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX (1<< 2)
>> +
>> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
>> +via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
>> +assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
>> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>>
>> 4.48 KVM_DEASSIGN_PCI_DEVICE
>>
>> @@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
>> __u16 padding[3];
>> };
>>
>> +4.54 KVM_ASSIGN_SET_INTX_MASK
>> +
>> +Capability: KVM_CAP_PCI_2_3
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct kvm_assigned_pci_dev (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +Informs the kernel about the guest's view on the INTx mask. As long as the
>> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>> +hardware level and will not assert the guest's IRQ line. User space is still
>> +responsible for applying this state to the assigned device's real config space.
>
> What's the protocol for doing this? I suppose userspace has to disable
> interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
> unmasked), enable interrupts?
Userspace just has to synchronize against itself - what it already does:
qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.
The guest resolves the remaining races.
>
> Isn't there a race window between the two operations?
>
> Maybe we should give the kernel full ownership of that bit.
I think this is what VFIO does and is surely cleaner than this approach.
But it's not possible with the existing interface (sysfs + KVM ioctls) -
or can you restrict the sysfs access to the config space in such details?
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On 12/06/2010 06:34 PM, Jan Kiszka wrote:
> >
> > What's the protocol for doing this? I suppose userspace has to disable
> > interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
> > unmasked), enable interrupts?
>
> Userspace just has to synchronize against itself - what it already does:
> qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
> VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.
> The guest resolves the remaining races.
I meant when qemu sets INTX_MASK and the kernel clears it immediately
afterwards because the two are not synchronized. I guess that won't
happen in practice because playing with INTX_MASK is very rare.
> >
> > Isn't there a race window between the two operations?
> >
> > Maybe we should give the kernel full ownership of that bit.
>
> I think this is what VFIO does and is surely cleaner than this approach.
> But it's not possible with the existing interface (sysfs + KVM ioctls) -
> or can you restrict the sysfs access to the config space in such details?
I'm sure you can, not sure it's worth it. Can the situation be
exploited? what if userspace lies?
--
error compiling committee.c: too many arguments to function
Am 06.12.2010 17:40, Avi Kivity wrote:
> On 12/06/2010 06:34 PM, Jan Kiszka wrote:
>>>
>>> What's the protocol for doing this? I suppose userspace has to disable
>>> interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
>>> unmasked), enable interrupts?
>>
>> Userspace just has to synchronize against itself - what it already does:
>> qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
>> VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.
>> The guest resolves the remaining races.
>
> I meant when qemu sets INTX_MASK and the kernel clears it immediately
> afterwards because the two are not synchronized. I guess that won't
> happen in practice because playing with INTX_MASK is very rare.
Ah, there is indeed a race, and the qemu-kvm patches I did not post yet
(to wait for the kernel interface to settle) actually suffer from it:
userspace needs to set the kernel mask before writing the config space
(it's the other way around ATM). This avoids that the kernel overwrites
what userspace just wrote out. We always suffer from the race the other
way around, see below.
>
>
>>>
>>> Isn't there a race window between the two operations?
>>>
>>> Maybe we should give the kernel full ownership of that bit.
>>
>> I think this is what VFIO does and is surely cleaner than this approach.
>> But it's not possible with the existing interface (sysfs + KVM ioctls) -
>> or can you restrict the sysfs access to the config space in such details?
>
> I'm sure you can, not sure it's worth it. Can the situation be
> exploited? what if userspace lies?
That's also the above scenario inverted: Userspace can mask or unmask at
any time. If it unmasks a yet unhandled, thus raise interrupt, it will
trigger another one. The kernel will catch it and mask it again. That
can repeat forever with the frequency userspace is able to run its
unmasking code. Not nice, but nothing to leverage for a DoS.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On 12/06/2010 06:46 PM, Jan Kiszka wrote:
> Am 06.12.2010 17:40, Avi Kivity wrote:
> > On 12/06/2010 06:34 PM, Jan Kiszka wrote:
> >>>
> >>> What's the protocol for doing this? I suppose userspace has to disable
> >>> interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
> >>> unmasked), enable interrupts?
> >>
> >> Userspace just has to synchronize against itself - what it already does:
> >> qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
> >> VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.
> >> The guest resolves the remaining races.
> >
> > I meant when qemu sets INTX_MASK and the kernel clears it immediately
> > afterwards because the two are not synchronized. I guess that won't
> > happen in practice because playing with INTX_MASK is very rare.
>
> Ah, there is indeed a race, and the qemu-kvm patches I did not post yet
> (to wait for the kernel interface to settle) actually suffer from it:
> userspace needs to set the kernel mask before writing the config space
> (it's the other way around ATM). This avoids that the kernel overwrites
> what userspace just wrote out. We always suffer from the race the other
> way around, see below.
Please document the protocol. Is this always the right order?
Shouldn't it be reversed when unmasking? I admit I'm confused about this.
> >>
> >> I think this is what VFIO does and is surely cleaner than this approach.
> >> But it's not possible with the existing interface (sysfs + KVM ioctls) -
> >> or can you restrict the sysfs access to the config space in such details?
> >
> > I'm sure you can, not sure it's worth it. Can the situation be
> > exploited? what if userspace lies?
>
> That's also the above scenario inverted: Userspace can mask or unmask at
> any time. If it unmasks a yet unhandled, thus raise interrupt, it will
> trigger another one. The kernel will catch it and mask it again. That
> can repeat forever with the frequency userspace is able to run its
> unmasking code. Not nice, but nothing to leverage for a DoS.
Ok (I think).
--
error compiling committee.c: too many arguments to function
Am 06.12.2010 18:01, Avi Kivity wrote:
> On 12/06/2010 06:46 PM, Jan Kiszka wrote:
>> Am 06.12.2010 17:40, Avi Kivity wrote:
>>> On 12/06/2010 06:34 PM, Jan Kiszka wrote:
>>>>>
>>>>> What's the protocol for doing this? I suppose userspace has to disable
>>>>> interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK,
>>>>> unmasked), enable interrupts?
>>>>
>>>> Userspace just has to synchronize against itself - what it already does:
>>>> qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
>>>> VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.
>>>> The guest resolves the remaining races.
>>>
>>> I meant when qemu sets INTX_MASK and the kernel clears it immediately
>>> afterwards because the two are not synchronized. I guess that won't
>>> happen in practice because playing with INTX_MASK is very rare.
>>
>> Ah, there is indeed a race, and the qemu-kvm patches I did not post yet
>> (to wait for the kernel interface to settle) actually suffer from it:
>> userspace needs to set the kernel mask before writing the config space
>> (it's the other way around ATM). This avoids that the kernel overwrites
>> what userspace just wrote out. We always suffer from the race the other
>> way around, see below.
>
> Please document the protocol.
Yes, will do.
> Is this always the right order?
> Shouldn't it be reversed when unmasking? I admit I'm confused about this.
The kernel only notes down userspace's view on the mask, it doesn't
modify the hardware (unless it is in line-disabling mode, but then there
is no conflict anyway). Therefore, if userspace first unmasks and an
interrupts arrives before this was reported to the kernel, the device
interrupt will remain masked and no event is delivered to userspace.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux