This addresses the review comments of the previous round:
- renamed irq_data::status to drv_status
- moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED
- fixed signature of get_irq_status (irq is now unsigned int)
- converted register_lock into a global one
- fixed critical white space breakage (that I just left in to check if
anyone is actually reading the code, of course...)
Note: The KVM patch still depends on
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/64515
Thanks for all comments!
Final but critical question: Who will pick up which bits?
Jan Kiszka (4):
genirq: Introduce driver-readable IRQ status word
genirq: Inform handler about line sharing state
genirq: Add support for IRQF_COND_ONESHOT
KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
Documentation/kvm/api.txt | 27 ++++
arch/x86/kvm/x86.c | 1 +
include/linux/interrupt.h | 15 ++
include/linux/irq.h | 2 +
include/linux/kvm.h | 6 +
include/linux/kvm_host.h | 10 ++-
kernel/irq/manage.c | 77 ++++++++++-
virt/kvm/assigned-dev.c | 336 ++++++++++++++++++++++++++++++++++++++++-----
8 files changed, 436 insertions(+), 38 deletions(-)
From: Jan Kiszka <[email protected]>
This associates a status word with every IRQ descriptor. Drivers can obtain
its content via get_irq_status(irq). First use case will be propagating the
interrupt sharing state.
Signed-off-by: Jan Kiszka <[email protected]>
---
include/linux/interrupt.h | 2 ++
include/linux/irq.h | 2 ++
kernel/irq/manage.c | 15 +++++++++++++++
3 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..4c1aa72 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -126,6 +126,8 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+extern unsigned long get_irq_status(unsigned int irq);
+
#ifdef CONFIG_GENERIC_HARDIRQS
extern int __must_check
request_threaded_irq(unsigned int irq, irq_handler_t handler,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index abde252..8bdb421 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -96,6 +96,7 @@ struct msi_desc;
* methods, to allow shared chip implementations
* @msi_desc: MSI descriptor
* @affinity: IRQ affinity on SMP
+ * @drv_status: driver-readable status flags (IRQS_*)
*
* The fields here need to overlay the ones in irq_desc until we
* cleaned up the direct references and switched everything over to
@@ -111,6 +112,7 @@ struct irq_data {
#ifdef CONFIG_SMP
cpumask_var_t affinity;
#endif
+ unsigned long drv_status;
};
/**
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5f92acc..2ea0d30 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1157,3 +1157,18 @@ int request_any_context_irq(unsigned int irq, irq_handler_t handler,
return !ret ? IRQC_IS_HARDIRQ : ret;
}
EXPORT_SYMBOL_GPL(request_any_context_irq);
+
+/**
+ * get_irq_status - read interrupt line status word
+ * @irq: Interrupt line of the status word
+ *
+ * This returns the current content of the status word associated with
+ * the given interrupt line. See IRQS_* flags for details.
+ */
+unsigned long get_irq_status(unsigned int irq)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ return desc ? desc->irq_data.drv_status : 0;
+}
+EXPORT_SYMBOL_GPL(get_irq_status);
--
1.7.1
From: Jan Kiszka <[email protected]>
Provide an adaptive version of IRQF_ONESHOT: If the line is exclusively used,
IRQF_COND_ONESHOT provides the same semantics as IRQF_ONESHOT. If it is
shared, the line will be unmasked directly after the hardirq handler, just as
if IRQF_COND_ONESHOT was not provided.
Signed-off-by: Jan Kiszka <[email protected]>
---
include/linux/interrupt.h | 3 +++
kernel/irq/manage.c | 19 ++++++++++++++++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 12e5fc0..bbb16f4 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -56,6 +56,8 @@
* irq line disabled until the threaded handler has been run.
* IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
* IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
+ * IRQF_COND_ONESHOT - If line is not shared, keep interrupt disabled after
+ * hardirq handler finshed.
*
*/
#define IRQF_DISABLED 0x00000020
@@ -69,6 +71,7 @@
#define IRQF_ONESHOT 0x00002000
#define IRQF_NO_SUSPEND 0x00004000
#define IRQF_ADAPTIVE 0x00008000
+#define IRQF_COND_ONESHOT 0x00010000
#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2dd4eef..9a73633 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -583,7 +583,7 @@ static int irq_thread(void *data)
struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
struct irqaction *action = data;
struct irq_desc *desc = irq_to_desc(action->irq);
- int wake, oneshot = desc->status & IRQ_ONESHOT;
+ int wake, oneshot;
sched_setscheduler(current, SCHED_FIFO, ¶m);
current->irqaction = action;
@@ -606,6 +606,7 @@ static int irq_thread(void *data)
desc->status |= IRQ_PENDING;
raw_spin_unlock_irq(&desc->lock);
} else {
+ oneshot = desc->status & IRQ_ONESHOT;
raw_spin_unlock_irq(&desc->lock);
action->thread_fn(action->irq, action->dev_id);
@@ -759,6 +760,15 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
shared = 1;
desc->irq_data.drv_status |= IRQS_SHARED;
+ desc->status &= ~IRQ_ONESHOT;
+
+ /* Unmask if the interrupt was masked due to oneshot mode. */
+ if ((desc->status &
+ (IRQ_INPROGRESS | IRQ_DISABLED | IRQ_MASKED)) ==
+ IRQ_MASKED) {
+ desc->irq_data.chip->irq_unmask(&desc->irq_data);
+ desc->status &= ~IRQ_MASKED;
+ }
}
if (!shared) {
@@ -783,7 +793,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING | IRQ_ONESHOT |
IRQ_INPROGRESS | IRQ_SPURIOUS_DISABLED);
- if (new->flags & IRQF_ONESHOT)
+ if (new->flags & (IRQF_ONESHOT | IRQF_COND_ONESHOT))
desc->status |= IRQ_ONESHOT;
if (!(desc->status & IRQ_NOAUTOEN)) {
@@ -934,8 +944,11 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
desc->irq_data.chip->irq_shutdown(&desc->irq_data);
else
desc->irq_data.chip->irq_disable(&desc->irq_data);
- } else if (!desc->action->next)
+ } else if (!desc->action->next) {
single_handler = true;
+ if (desc->action->flags & IRQF_COND_ONESHOT)
+ desc->status |= IRQ_ONESHOT;
+ }
#ifdef CONFIG_SMP
/* make sure affinity_hint is cleaned up */
--
1.7.1
From: Jan Kiszka <[email protected]>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.
However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register the IRQ in adaptive
mode and switch between line and device level disabling on demand.
This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.
Signed-off-by: Jan Kiszka <[email protected]>
---
Documentation/kvm/api.txt | 27 ++++
arch/x86/kvm/x86.c | 1 +
include/linux/kvm.h | 6 +
include/linux/kvm_host.h | 10 ++-
virt/kvm/assigned-dev.c | 336 ++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 346 insertions(+), 34 deletions(-)
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..1c34e25 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
/* Depends on KVM_CAP_IOMMU */
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
4.48 KVM_DEASSIGN_PCI_DEVICE
@@ -1263,6 +1271,25 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
};
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Informs the kernel about the guest's view on the INTx mask. As long as the
+guest masks the legacy INTx, the kernel will refrain from unmasking it at
+hardware level and will not assert the guest's IRQ line. User space is still
+responsible for applying this state to the assigned device's real config space.
+To avoid that the kernel overwrites the state user space wants to set,
+KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config space.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
5. The kvm_run structure
Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed373ba..8775a54 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1965,6 +1965,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+ case KVM_CAP_PCI_2_3:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..3cadb42 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
#define KVM_CAP_PPC_GET_PVINFO 57
#define KVM_CAP_PPC_IRQ_LEVEL 58
#define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_PCI_2_3 60
#ifdef KVM_CAP_IRQ_ROUTING
@@ -677,6 +678,9 @@ struct kvm_clock_data {
#define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2)
/* Available with KVM_CAP_PPC_GET_PVINFO */
#define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK _IOW(KVMIO, 0xa2, \
+ struct kvm_assigned_pci_dev)
/*
* ioctls for vcpu fds
@@ -742,6 +746,8 @@ struct kvm_clock_data {
#define KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs)
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
struct kvm_assigned_pci_dev {
__u32 assigned_dev_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ac4e83a..4f95070 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,12 @@ struct kvm_irq_ack_notifier {
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
};
+enum kvm_intx_state {
+ KVM_INTX_ENABLED,
+ KVM_INTX_LINE_DISABLED,
+ KVM_INTX_DEVICE_DISABLED,
+};
+
struct kvm_assigned_dev_kernel {
struct kvm_irq_ack_notifier ack_notifier;
struct list_head list;
@@ -486,7 +492,7 @@ struct kvm_assigned_dev_kernel {
int host_devfn;
unsigned int entries_nr;
int host_irq;
- bool host_irq_disabled;
+ unsigned long last_irq_status;
struct msix_entry *host_msix_entries;
int guest_irq;
struct msix_entry *guest_msix_entries;
@@ -496,6 +502,8 @@ struct kvm_assigned_dev_kernel {
struct pci_dev *dev;
struct kvm *kvm;
spinlock_t intx_lock;
+ spinlock_t intx_mask_lock;
+ enum kvm_intx_state intx_state;
char irq_name[32];
};
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index c6114d3..b91a2db 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,22 +55,141 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
return index;
}
-static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
+static bool
+pci_2_3_set_irq_mask(struct pci_dev *dev, bool mask, bool check_status)
+{
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd;
+ bool mask_updated = true;
+
+ /*
+ * We do a single dword read to retrieve both command and status.
+ * Document assumptions that make this possible.
+ */
+ BUILD_BUG_ON(PCI_COMMAND % 4);
+ BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
+
+ pci_block_user_cfg_access(dev);
+
+ /*
+ * Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs.
+ */
+ pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
+
+ if (check_status) {
+ bool irq_pending =
+ (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
+
+ /*
+ * Check interrupt status register to see whether our device
+ * triggered the interrupt (when masking) or the next IRQ is
+ * already pending (when unmasking).
+ */
+ if (mask != irq_pending) {
+ mask_updated = false;
+ goto done;
+ }
+ }
+
+ origcmd = cmd_status_dword;
+ newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
+ if (mask)
+ newcmd |= PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(dev, PCI_COMMAND, newcmd);
+
+done:
+ pci_unblock_user_cfg_access(dev);
+ return mask_updated;
+}
+
+static void pci_2_3_irq_mask(struct pci_dev *dev)
+{
+ pci_2_3_set_irq_mask(dev, true, false);
+}
+
+static bool pci_2_3_irq_check_and_mask(struct pci_dev *dev)
+{
+ return pci_2_3_set_irq_mask(dev, true, true);
+}
+
+static void pci_2_3_irq_unmask(struct pci_dev *dev)
+{
+ pci_2_3_set_irq_mask(dev, false, false);
+}
+
+static bool pci_2_3_irq_check_and_unmask(struct pci_dev *dev)
+{
+ return pci_2_3_set_irq_mask(dev, false, true);
+}
+
+static irqreturn_t kvm_assigned_dev_intr_intx(int irq, void *dev_id)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+ unsigned long irq_status = get_irq_status(irq);
+ int ret;
+
+ assigned_dev->last_irq_status = irq_status;
+
+ if (!(irq_status & (IRQS_SHARED | IRQS_MAKE_SHAREABLE)))
+ return IRQ_WAKE_THREAD;
+
+ spin_lock(&assigned_dev->intx_lock);
+
+ if (irq_status & IRQS_MAKE_SHAREABLE) {
+ if (assigned_dev->intx_state == KVM_INTX_LINE_DISABLED) {
+ pci_2_3_irq_mask(assigned_dev->dev);
+ enable_irq(irq);
+ assigned_dev->intx_state = KVM_INTX_DEVICE_DISABLED;
+ }
+ ret = IRQ_HANDLED;
+ } else if (pci_2_3_irq_check_and_mask(assigned_dev->dev)) {
+ assigned_dev->intx_state = KVM_INTX_DEVICE_DISABLED;
+ ret = IRQ_WAKE_THREAD;
+ } else
+ ret = IRQ_NONE;
+
+ spin_unlock(&assigned_dev->intx_lock);
+
+ return ret;
+}
+
+static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id)
{
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
- spin_lock(&assigned_dev->intx_lock);
- disable_irq_nosync(irq);
- assigned_dev->host_irq_disabled = true;
- spin_unlock(&assigned_dev->intx_lock);
+ if (!(assigned_dev->last_irq_status & IRQS_SHARED)) {
+ spin_lock_irq(&assigned_dev->intx_lock);
+ if (assigned_dev->intx_state == KVM_INTX_ENABLED) {
+ disable_irq_nosync(irq);
+ assigned_dev->intx_state = KVM_INTX_LINE_DISABLED;
+ }
+ spin_unlock_irq(&assigned_dev->intx_lock);
}
+ spin_lock(&assigned_dev->intx_mask_lock);
+ if (!(assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX))
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 1);
+ spin_unlock(&assigned_dev->intx_mask_lock);
+
+ return IRQ_HANDLED;
+}
+
+#ifdef __KVM_HAVE_MSI
+static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
+{
+ struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+
kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
assigned_dev->guest_irq, 1);
return IRQ_HANDLED;
}
+#endif
#ifdef __KVM_HAVE_MSIX
static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
@@ -102,15 +221,36 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
- /* The guest irq may be shared so this ack may be
- * from another device.
- */
- spin_lock(&dev->intx_lock);
- if (dev->host_irq_disabled) {
- enable_irq(dev->host_irq);
- dev->host_irq_disabled = false;
+ if (likely(!(dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX)))
+ return;
+
+ spin_lock(&dev->intx_mask_lock);
+
+ if (!(dev->flags & KVM_DEV_ASSIGN_MASK_INTX)) {
+ bool reassert = false;
+
+ spin_lock_irq(&dev->intx_lock);
+ /*
+ * The guest IRQ may be shared so this ack can come from an
+ * IRQ for another guest device.
+ */
+ if (dev->intx_state == KVM_INTX_LINE_DISABLED) {
+ enable_irq(dev->host_irq);
+ dev->intx_state = KVM_INTX_ENABLED;
+ } else if (dev->intx_state == KVM_INTX_DEVICE_DISABLED) {
+ if (pci_2_3_irq_check_and_unmask(dev->dev))
+ dev->intx_state = KVM_INTX_ENABLED;
+ else
+ reassert = true;
+ }
+ spin_unlock_irq(&dev->intx_lock);
+
+ if (reassert)
+ kvm_set_irq(dev->kvm, dev->irq_source_id,
+ dev->guest_irq, 1);
}
- spin_unlock(&dev->intx_lock);
+
+ spin_unlock(&dev->intx_mask_lock);
}
static void deassign_guest_irq(struct kvm *kvm,
@@ -155,14 +295,21 @@ static void deassign_host_irq(struct kvm *kvm,
kfree(assigned_dev->host_msix_entries);
kfree(assigned_dev->guest_msix_entries);
pci_disable_msix(assigned_dev->dev);
+ } else if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) {
+ free_irq(assigned_dev->host_irq, assigned_dev);
+ pci_disable_msi(assigned_dev->dev);
} else {
- /* Deal with MSI and INTx */
- disable_irq(assigned_dev->host_irq);
+ if (assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+ spin_lock_irq(&assigned_dev->intx_lock);
+ pci_2_3_irq_mask(assigned_dev->dev);
+ /* prevent re-enabling by kvm_assigned_dev_ack_irq */
+ assigned_dev->intx_state = KVM_INTX_ENABLED;
+ spin_unlock_irq(&assigned_dev->intx_lock);
+ synchronize_irq(assigned_dev->host_irq);
+ } else
+ disable_irq(assigned_dev->host_irq);
free_irq(assigned_dev->host_irq, assigned_dev);
-
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
- pci_disable_msi(assigned_dev->dev);
}
assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
@@ -231,16 +378,41 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
static int assigned_device_enable_host_intx(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
{
+ irq_handler_t handler;
+ unsigned long flags;
+ int err;
+
dev->host_irq = dev->dev->irq;
- /* Even though this is PCI, we don't want to use shared
- * interrupts. Sharing host devices with guest-assigned devices
- * on the same interrupt line is not a happy situation: there
- * are going to be long delays in accepting, acking, etc.
+ dev->intx_state = KVM_INTX_ENABLED;
+ dev->last_irq_status = 0;
+
+ /*
+ * We can only share the IRQ line with other host devices if we are
+ * able to disable the IRQ source at device-level - independently of
+ * the guest driver. Otherwise host devices may suffer from unbounded
+ * IRQ latencies when the guest keeps the line asserted.
+ * If PCI 2.3 support is available, we can instal a sharing notifier
+ * and apply the required disabling pattern on demand.
*/
- if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- IRQF_ONESHOT, dev->irq_name, dev))
- return -EIO;
- return 0;
+ if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+ handler = kvm_assigned_dev_intr_intx;
+ flags = IRQF_SHARED | IRQF_ADAPTIVE | IRQF_COND_ONESHOT;
+ } else {
+ handler = NULL;
+ flags = IRQF_ONESHOT;
+ }
+
+ err = request_threaded_irq(dev->host_irq, handler,
+ kvm_assigned_dev_thread_intx, flags,
+ dev->irq_name, dev);
+
+ if (!err && dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+ spin_lock_irq(&dev->intx_lock);
+ pci_2_3_irq_unmask(dev->dev);
+ spin_unlock_irq(&dev->intx_lock);
+ }
+
+ return err;
}
#ifdef __KVM_HAVE_MSI
@@ -256,8 +428,9 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
}
dev->host_irq = dev->dev->irq;
- if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
- 0, dev->irq_name, dev)) {
+ if (request_threaded_irq(dev->host_irq, NULL,
+ kvm_assigned_dev_thread_msi, 0,
+ dev->irq_name, dev)) {
pci_disable_msi(dev->dev);
return -EIO;
}
@@ -296,7 +469,6 @@ err:
pci_disable_msix(dev->dev);
return r;
}
-
#endif
static int assigned_device_enable_guest_intx(struct kvm *kvm,
@@ -315,7 +487,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -327,7 +498,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -461,6 +631,7 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
{
int r = -ENODEV;
struct kvm_assigned_dev_kernel *match;
+ unsigned long irq_type;
mutex_lock(&kvm->lock);
@@ -469,12 +640,51 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
if (!match)
goto out;
- r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
+ irq_type = assigned_irq->flags & (KVM_DEV_IRQ_HOST_MASK |
+ KVM_DEV_IRQ_GUEST_MASK);
+ r = kvm_deassign_irq(kvm, match, irq_type);
out:
mutex_unlock(&kvm->lock);
return r;
}
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.
+ */
+static bool pci_2_3_supported(struct pci_dev *pdev)
+{
+ u16 orig, new;
+
+ pci_block_user_cfg_access(pdev);
+
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+
+ pci_unblock_user_cfg_access(pdev);
+
+ /*
+ * There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it.
+ */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ return false;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support disabling "
+ "interrupts, IRQ sharing impossible.\n");
+ return false;
+ }
+ return true;
+}
+
static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
struct kvm_assigned_pci_dev *assigned_dev)
{
@@ -523,6 +733,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
pci_reset_function(dev);
pci_save_state(dev);
+ if (!pci_2_3_supported(dev))
+ assigned_dev->flags &= ~KVM_DEV_ASSIGN_PCI_2_3;
+
match->assigned_dev_id = assigned_dev->assigned_dev_id;
match->host_segnr = assigned_dev->segnr;
match->host_busnr = assigned_dev->busnr;
@@ -530,6 +743,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
match->flags = assigned_dev->flags;
match->dev = dev;
spin_lock_init(&match->intx_lock);
+ spin_lock_init(&match->intx_mask_lock);
match->irq_source_id = -1;
match->kvm = kvm;
match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
@@ -676,6 +890,53 @@ msix_entry_out:
}
#endif
+static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
+ struct kvm_assigned_pci_dev *assigned_dev)
+{
+ int r = 0;
+ struct kvm_assigned_dev_kernel *match;
+
+ mutex_lock(&kvm->lock);
+
+ match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+ assigned_dev->assigned_dev_id);
+ if (!match) {
+ r = -ENODEV;
+ goto out;
+ }
+
+ spin_lock(&match->intx_mask_lock);
+
+ match->flags &= ~KVM_DEV_ASSIGN_MASK_INTX;
+ match->flags |= assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX;
+
+ if (assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX) {
+ kvm_set_irq(match->kvm, match->irq_source_id,
+ match->guest_irq, 0);
+ /*
+ * Masking at hardware-level is performed on demand, i.e. when
+ * an IRQ actually arrives at the host.
+ */
+ } else {
+ /*
+ * Unmask the IRQ line. It may have been masked meanwhile if
+ * we aren't using PCI 2.3 INTx masking on the host side.
+ */
+ spin_lock_irq(&match->intx_lock);
+ if (match->intx_state == KVM_INTX_LINE_DISABLED) {
+ enable_irq(match->host_irq);
+ match->intx_state = KVM_INTX_ENABLED;
+ }
+ spin_unlock_irq(&match->intx_lock);
+ }
+
+ spin_unlock(&match->intx_mask_lock);
+
+out:
+ mutex_unlock(&kvm->lock);
+ return r;
+}
+
long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
unsigned long arg)
{
@@ -783,6 +1044,15 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
break;
}
#endif
+ case KVM_ASSIGN_SET_INTX_MASK: {
+ struct kvm_assigned_pci_dev assigned_dev;
+
+ r = -EFAULT;
+ if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
+ goto out;
+ r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
+ break;
+ }
default:
r = -ENOTTY;
break;
--
1.7.1
From: Jan Kiszka <[email protected]>
This enabled interrupt handlers to retrieve the current line sharing state via
the new interrupt status word so that they can adapt to it.
The switch from shared to exclusive is generally uncritical and can thus be
performed on demand. However, preparing a line for shared mode may require
preparational steps of the currently registered handler. It can therefore
request an ahead-of-time notification via IRQF_ADAPTIVE. The notification
consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in
the status word.
Signed-off-by: Jan Kiszka <[email protected]>
---
include/linux/interrupt.h | 10 +++++++++
kernel/irq/manage.c | 47 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 4c1aa72..12e5fc0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -55,6 +55,7 @@
* Used by threaded interrupts which need to keep the
* irq line disabled until the threaded handler has been run.
* IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
+ * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
*
*/
#define IRQF_DISABLED 0x00000020
@@ -67,6 +68,7 @@
#define IRQF_IRQPOLL 0x00001000
#define IRQF_ONESHOT 0x00002000
#define IRQF_NO_SUSPEND 0x00004000
+#define IRQF_ADAPTIVE 0x00008000
#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
@@ -126,6 +128,14 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+/*
+ * Driver-readable IRQ line status flags:
+ * IRQS_SHARED - line is shared between multiple handlers
+ * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable
+ */
+#define IRQS_SHARED 0x00000001
+#define IRQS_MAKE_SHAREABLE 0x00000002
+
extern unsigned long get_irq_status(unsigned int irq);
#ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2ea0d30..2dd4eef 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -14,9 +14,12 @@
#include <linux/interrupt.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/mutex.h>
#include "internals.h"
+static DEFINE_MUTEX(register_lock);
+
/**
* synchronize_irq - wait for pending IRQ handlers (on other CPUs)
* @irq: interrupt number to wait for
@@ -754,6 +757,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
old = *old_ptr;
} while (old);
shared = 1;
+
+ desc->irq_data.drv_status |= IRQS_SHARED;
}
if (!shared) {
@@ -883,6 +888,7 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
{
struct irq_desc *desc = irq_to_desc(irq);
struct irqaction *action, **action_ptr;
+ bool single_handler = false;
unsigned long flags;
WARN(in_interrupt(), "Trying to free IRQ %d from IRQ context!\n", irq);
@@ -928,7 +934,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
desc->irq_data.chip->irq_shutdown(&desc->irq_data);
else
desc->irq_data.chip->irq_disable(&desc->irq_data);
- }
+ } else if (!desc->action->next)
+ single_handler = true;
#ifdef CONFIG_SMP
/* make sure affinity_hint is cleaned up */
@@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
+ if (single_handler)
+ desc->irq_data.drv_status &= ~IRQS_SHARED;
+
#ifdef CONFIG_DEBUG_SHIRQ
/*
* It's a shared IRQ -- the driver ought to be prepared for an IRQ
@@ -1002,9 +1012,13 @@ void free_irq(unsigned int irq, void *dev_id)
if (!desc)
return;
+ mutex_lock(®ister_lock);
+
chip_bus_lock(desc);
kfree(__free_irq(irq, dev_id));
chip_bus_sync_unlock(desc);
+
+ mutex_unlock(®ister_lock);
}
EXPORT_SYMBOL(free_irq);
@@ -1055,7 +1069,7 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
irq_handler_t thread_fn, unsigned long irqflags,
const char *devname, void *dev_id)
{
- struct irqaction *action;
+ struct irqaction *action, *old_action;
struct irq_desc *desc;
int retval;
@@ -1091,12 +1105,39 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
action->name = devname;
action->dev_id = dev_id;
+ mutex_lock(®ister_lock);
+
+ old_action = desc->action;
+ if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
+ !(desc->irq_data.drv_status & IRQS_SHARED)) {
+ /*
+ * Signal the old handler that is has to switch to shareable
+ * handling mode. Disable the line to avoid any conflict with
+ * a real IRQ.
+ */
+ disable_irq(irq);
+ local_irq_disable();
+
+ desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;
+ old_action->handler(irq, old_action->dev_id);
+ desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE;
+
+ local_irq_enable();
+ enable_irq(irq);
+
+ }
+
chip_bus_lock(desc);
retval = __setup_irq(irq, desc, action);
chip_bus_sync_unlock(desc);
- if (retval)
+ if (retval) {
+ if (desc->action && !desc->action->next)
+ desc->irq_data.drv_status &= ~IRQS_SHARED;
kfree(action);
+ }
+
+ mutex_unlock(®ister_lock);
#ifdef CONFIG_DEBUG_SHIRQ
if (!retval && (irqflags & IRQF_SHARED)) {
--
1.7.1
On 12/14/2010 12:59 AM, Jan Kiszka wrote:
> Final but critical question: Who will pick up which bits?
>
The procedure which has served us well in the past is that tip picks up
the irq stuff and sticks them in a fast-forward-only branch; kvm merges
the branch and applies the kvm bits on top.
--
error compiling committee.c: too many arguments to function
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> +/**
> + * get_irq_status - read interrupt line status word
> + * @irq: Interrupt line of the status word
> + *
> + * This returns the current content of the status word associated with
> + * the given interrupt line. See IRQS_* flags for details.
> + */
> +unsigned long get_irq_status(unsigned int irq)
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
> +
> + return desc ? desc->irq_data.drv_status : 0;
> +}
> +EXPORT_SYMBOL_GPL(get_irq_status);
We should document that this is a snapshot and in no way serialized
against modifications of drv_status. I'll fix up the kernel doc.
Thanks,
tglx
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
> /* Make sure it's not being used on another CPU: */
> synchronize_irq(irq);
>
> + if (single_handler)
> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> +
What's the reason to clear this flag outside of the desc->lock held
region. I need this status for other purposes as well, where I
definitely need serialization.
> + mutex_lock(®ister_lock);
> +
> + old_action = desc->action;
> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
> + /*
> + * Signal the old handler that is has to switch to shareable
> + * handling mode. Disable the line to avoid any conflict with
> + * a real IRQ.
> + */
> + disable_irq(irq);
> + local_irq_disable();
> +
> + desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;
Unserialized access as well. Will think about it.
> + old_action->handler(irq, old_action->dev_id);
> + desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE;
Thanks,
tglx
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> From: Jan Kiszka <[email protected]>
> chip_bus_lock(desc);
> retval = __setup_irq(irq, desc, action);
> chip_bus_sync_unlock(desc);
>
> - if (retval)
> + if (retval) {
> + if (desc->action && !desc->action->next)
> + desc->irq_data.drv_status &= ~IRQS_SHARED;
This is redundant. IRQS_SHARED gets set in a code path where all
checks are done already.
To make that more obvious we can set it right before
raw_spin_unlock_irqrestore(&desc->lock, flags);
conditionally on (shared).
That way we can also move the kfree out of the mutex locked section.
Thanks,
tglx
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> This addresses the review comments of the previous round:
> - renamed irq_data::status to drv_status
> - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED
> - fixed signature of get_irq_status (irq is now unsigned int)
> - converted register_lock into a global one
> - fixed critical white space breakage (that I just left in to check if
> anyone is actually reading the code, of course...)
Just for the record, you either missed or introduced some new white
space noise :)
Am 14.12.2010 21:54, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>> /* Make sure it's not being used on another CPU: */
>> synchronize_irq(irq);
>>
>> + if (single_handler)
>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
>> +
>
> What's the reason to clear this flag outside of the desc->lock held
> region.
We need to synchronize the irq first before clearing the flag.
The problematic scenario behind this: An IRQ started in shared mode,
this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
before calling into the threaded handler. And that handler may now think
that the line is still masked as IRQS_SHARED is set.
> I need this status for other purposes as well, where I
> definitely need serialization.
Well, two options: wrap all bit manipulations with desc->lock
acquisition/release or turn drv_status into an atomic. I don't know what
your plans with drv_status are, so...
>
>> + mutex_lock(®ister_lock);
>> +
>> + old_action = desc->action;
>> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
>> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
>> + /*
>> + * Signal the old handler that is has to switch to shareable
>> + * handling mode. Disable the line to avoid any conflict with
>> + * a real IRQ.
>> + */
>> + disable_irq(irq);
>> + local_irq_disable();
>> +
>> + desc->irq_data.drv_status |= IRQS_SHARED | IRQS_MAKE_SHAREABLE;
>
> Unserialized access as well. Will think about it.
>
>> + old_action->handler(irq, old_action->dev_id);
>> + desc->irq_data.drv_status &= ~IRQS_MAKE_SHAREABLE;
>
> Thanks,
>
> tglx
Jan
Am 14.12.2010 22:46, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> From: Jan Kiszka <[email protected]>
>> chip_bus_lock(desc);
>> retval = __setup_irq(irq, desc, action);
>> chip_bus_sync_unlock(desc);
>>
>> - if (retval)
>> + if (retval) {
>> + if (desc->action && !desc->action->next)
>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
>
> This is redundant. IRQS_SHARED gets set in a code path where all
> checks are done already.
Nope, it's also set before entry of __setup_irq in case we call an
IRQF_ADAPTIVE handler.
We need to set it that early as we may race with IRQ events for the
already registered handler happening between the sharing notification
and the actual registration of the second handler.
Jan
Am 14.12.2010 21:47, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> +/**
>> + * get_irq_status - read interrupt line status word
>> + * @irq: Interrupt line of the status word
>> + *
>> + * This returns the current content of the status word associated with
>> + * the given interrupt line. See IRQS_* flags for details.
>> + */
>> +unsigned long get_irq_status(unsigned int irq)
>> +{
>> + struct irq_desc *desc = irq_to_desc(irq);
>> +
>> + return desc ? desc->irq_data.drv_status : 0;
>> +}
>> +EXPORT_SYMBOL_GPL(get_irq_status);
>
> We should document that this is a snapshot and in no way serialized
> against modifications of drv_status. I'll fix up the kernel doc.
Yeah, I think I had some hint on this in the previous version but
apparently dropped it for this round.
Thanks,
Jan
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 14.12.2010 22:46, Thomas Gleixner wrote:
> > On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >> From: Jan Kiszka <[email protected]>
> >> chip_bus_lock(desc);
> >> retval = __setup_irq(irq, desc, action);
> >> chip_bus_sync_unlock(desc);
> >>
> >> - if (retval)
> >> + if (retval) {
> >> + if (desc->action && !desc->action->next)
> >> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> >
> > This is redundant. IRQS_SHARED gets set in a code path where all
> > checks are done already.
>
> Nope, it's also set before entry of __setup_irq in case we call an
> IRQF_ADAPTIVE handler.
>
> We need to set it that early as we may race with IRQ events for the
> already registered handler happening between the sharing notification
> and the actual registration of the second handler.
Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the
notification.
Thanks,
tglx
Am 15.12.2010 09:05, Thomas Gleixner wrote:
> On Wed, 15 Dec 2010, Jan Kiszka wrote:
>
>> Am 14.12.2010 22:46, Thomas Gleixner wrote:
>>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>>>> From: Jan Kiszka <[email protected]>
>>>> chip_bus_lock(desc);
>>>> retval = __setup_irq(irq, desc, action);
>>>> chip_bus_sync_unlock(desc);
>>>>
>>>> - if (retval)
>>>> + if (retval) {
>>>> + if (desc->action && !desc->action->next)
>>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
>>>
>>> This is redundant. IRQS_SHARED gets set in a code path where all
>>> checks are done already.
>>
>> Nope, it's also set before entry of __setup_irq in case we call an
>> IRQF_ADAPTIVE handler.
>>
>> We need to set it that early as we may race with IRQ events for the
>> already registered handler happening between the sharing notification
>> and the actual registration of the second handler.
>
> Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the
> notification.
For notification, yes. But we need SHARED once we reenable the line
after the notification.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 15.12.2010 09:05, Thomas Gleixner wrote:
> > On Wed, 15 Dec 2010, Jan Kiszka wrote:
> >
> >> Am 14.12.2010 22:46, Thomas Gleixner wrote:
> >>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >>>> From: Jan Kiszka <[email protected]>
> >>>> chip_bus_lock(desc);
> >>>> retval = __setup_irq(irq, desc, action);
> >>>> chip_bus_sync_unlock(desc);
> >>>>
> >>>> - if (retval)
> >>>> + if (retval) {
> >>>> + if (desc->action && !desc->action->next)
> >>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> >>>
> >>> This is redundant. IRQS_SHARED gets set in a code path where all
> >>> checks are done already.
> >>
> >> Nope, it's also set before entry of __setup_irq in case we call an
> >> IRQF_ADAPTIVE handler.
> >>
> >> We need to set it that early as we may race with IRQ events for the
> >> already registered handler happening between the sharing notification
> >> and the actual registration of the second handler.
> >
> > Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the
> > notification.
>
> For notification, yes. But we need SHARED once we reenable the line
> after the notification.
Darn. Will think more about it.
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 14.12.2010 21:54, Thomas Gleixner wrote:
> > On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
> >> /* Make sure it's not being used on another CPU: */
> >> synchronize_irq(irq);
> >>
> >> + if (single_handler)
> >> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> >> +
> >
> > What's the reason to clear this flag outside of the desc->lock held
> > region.
>
> We need to synchronize the irq first before clearing the flag.
>
> The problematic scenario behind this: An IRQ started in shared mode,
> this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
> before calling into the threaded handler. And that handler may now think
> that the line is still masked as IRQS_SHARED is set.
That should read "not set" I guess. Hmm, needs more thoughts :(
> > I need this status for other purposes as well, where I
> > definitely need serialization.
>
> Well, two options: wrap all bit manipulations with desc->lock
> acquisition/release or turn drv_status into an atomic. I don't know what
> your plans with drv_status are, so...
Some bits for irq migration and other stuff, which allows us to avoid
fiddling with irqdesc in the drivers.
Thanks,
tglx
Am 15.12.2010 14:04, Thomas Gleixner wrote:
> On Wed, 15 Dec 2010, Jan Kiszka wrote:
>> Am 14.12.2010 21:54, Thomas Gleixner wrote:
>>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>>>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>>>> /* Make sure it's not being used on another CPU: */
>>>> synchronize_irq(irq);
>>>>
>>>> + if (single_handler)
>>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
>>>> +
>>>
>>> What's the reason to clear this flag outside of the desc->lock held
>>> region.
>>
>> We need to synchronize the irq first before clearing the flag.
>>
>> The problematic scenario behind this: An IRQ started in shared mode,
>> this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
>> before calling into the threaded handler. And that handler may now think
>> that the line is still masked as IRQS_SHARED is set.
>
> That should read "not set" I guess.
Can't remember who wrote this, but that guy might have been too tired
for clear sentences: Yes, of course, we could run into troubles, if
IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
and threaded handler.
> Hmm, needs more thoughts :(
Be warned, might be painful.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 15.12.2010 14:04, Thomas Gleixner wrote:
> > On Wed, 15 Dec 2010, Jan Kiszka wrote:
> >> Am 14.12.2010 21:54, Thomas Gleixner wrote:
> >>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >>>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
> >>>> /* Make sure it's not being used on another CPU: */
> >>>> synchronize_irq(irq);
> >>>>
> >>>> + if (single_handler)
> >>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> >>>> +
> >>>
> >>> What's the reason to clear this flag outside of the desc->lock held
> >>> region.
> >>
> >> We need to synchronize the irq first before clearing the flag.
> >>
> >> The problematic scenario behind this: An IRQ started in shared mode,
> >> this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
> >> before calling into the threaded handler. And that handler may now think
> >> that the line is still masked as IRQS_SHARED is set.
> >
> > That should read "not set" I guess.
>
> Can't remember who wrote this, but that guy might have been too tired
> for clear sentences: Yes, of course, we could run into troubles, if
> IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
> and threaded handler.
Right.
As a side note, the current implementation requires that you lookup
irq_data.drv_status for every invocation of the handler or have a
reference to irq_data.drv_status somewhere locally, which I don't like
either.
I have an evil and nasy idea how to avoid that, need to look how ugly
it gets. Worst case we need to go back to that notification thing
which I wanted really avoid in the first place.
Though I like the register_mutex idea which came out of this a lot as
it allows us to reduce desc->lock held and interrupt disabled regions
quite nicely.
/me goes back to stare at the code
> > Hmm, needs more thoughts :(
>
> Be warned, might be painful.
Bah, my brain became pain resistant when I started hacking that code.
Thanks,
tglx
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 15.12.2010 14:04, Thomas Gleixner wrote:
> > On Wed, 15 Dec 2010, Jan Kiszka wrote:
> >> Am 14.12.2010 21:54, Thomas Gleixner wrote:
> >>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >>>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
> >>>> /* Make sure it's not being used on another CPU: */
> >>>> synchronize_irq(irq);
> >>>>
> >>>> + if (single_handler)
> >>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
> >>>> +
> >>>
> >>> What's the reason to clear this flag outside of the desc->lock held
> >>> region.
> >>
> >> We need to synchronize the irq first before clearing the flag.
> >>
> >> The problematic scenario behind this: An IRQ started in shared mode,
> >> this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
> >> before calling into the threaded handler. And that handler may now think
> >> that the line is still masked as IRQS_SHARED is set.
> >
> > That should read "not set" I guess.
>
> Can't remember who wrote this, but that guy might have been too tired
> for clear sentences: Yes, of course, we could run into troubles, if
> IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
> and threaded handler.
>
> > Hmm, needs more thoughts :(
>
> Be warned, might be painful.
Talking about headache. Your solution above does not prevent that
scenario.
CPU 0 CPU 1
synchronize_irq();
hard irq comes in sees shared and unmasks
clear IRQS_SHARED
thread handler runs and sees !SHARED
Same scenario, just moved by a few lines :)
Thanks,
tglx
Am 15.12.2010 16:41, Thomas Gleixner wrote:
> On Wed, 15 Dec 2010, Jan Kiszka wrote:
>
>> Am 15.12.2010 14:04, Thomas Gleixner wrote:
>>> On Wed, 15 Dec 2010, Jan Kiszka wrote:
>>>> Am 14.12.2010 21:54, Thomas Gleixner wrote:
>>>>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>>>>>> @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>>>>>> /* Make sure it's not being used on another CPU: */
>>>>>> synchronize_irq(irq);
>>>>>>
>>>>>> + if (single_handler)
>>>>>> + desc->irq_data.drv_status &= ~IRQS_SHARED;
>>>>>> +
>>>>>
>>>>> What's the reason to clear this flag outside of the desc->lock held
>>>>> region.
>>>>
>>>> We need to synchronize the irq first before clearing the flag.
>>>>
>>>> The problematic scenario behind this: An IRQ started in shared mode,
>>>> this the line was unmasked after the hardirq. Now we clear IRQS_SHARED
>>>> before calling into the threaded handler. And that handler may now think
>>>> that the line is still masked as IRQS_SHARED is set.
>>>
>>> That should read "not set" I guess.
>>
>> Can't remember who wrote this, but that guy might have been too tired
>> for clear sentences: Yes, of course, we could run into troubles, if
>> IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard
>> and threaded handler.
>>
>>> Hmm, needs more thoughts :(
>>
>> Be warned, might be painful.
>
> Talking about headache. Your solution above does not prevent that
> scenario.
>
> CPU 0 CPU 1
>
> synchronize_irq();
> hard irq comes in sees shared and unmasks
Nope, IRQ_ONESHOT is already cleared at that point.
> clear IRQS_SHARED
> thread handler runs and sees !SHARED
>
> Same scenario, just moved by a few lines :)
The same, just the other way around - and mostly harmless, I hope. :)
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Wed, 15 Dec 2010, Jan Kiszka wrote:
> Am 15.12.2010 16:41, Thomas Gleixner wrote:
> > Talking about headache. Your solution above does not prevent that
> > scenario.
> >
> > CPU 0 CPU 1
> >
> > synchronize_irq();
> > hard irq comes in sees shared and unmasks
>
> Nope, IRQ_ONESHOT is already cleared at that point.
Errm ? It's set. Could you please stop to increase my confusion ? :)
On Mon, 13 Dec 2010, Jan Kiszka wrote:
> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
> + /*
> + * Signal the old handler that is has to switch to shareable
> + * handling mode. Disable the line to avoid any conflict with
> + * a real IRQ.
> + */
> + disable_irq(irq);
This is weird, really. I thought you wanted to avoid waiting for the
threaded handler to finish if it's on the fly. So this should be
disable_irq_nosync() or did you change your mind ?
Thanks,
tglx
Am 16.12.2010 14:13, Thomas Gleixner wrote:
> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
>> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
>> + /*
>> + * Signal the old handler that is has to switch to shareable
>> + * handling mode. Disable the line to avoid any conflict with
>> + * a real IRQ.
>> + */
>> + disable_irq(irq);
>
> This is weird, really. I thought you wanted to avoid waiting for the
> threaded handler to finish if it's on the fly. So this should be
> disable_irq_nosync() or did you change your mind ?
No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there
might be another IRQ in flight. The handler that is called due to a real
IRQ might misinterpret MAKE_SHAREABLE as "there is no real event" and
perform the wrong steps (at least the current implementation for KVM would).
However, I will rebase my patch over your series now and try to re-think
this. The question is what could go wrong if we do not guarantee that
MAKE_SHAREABLE and ordinary IRQ will always be distinguishable. If there
is really nothing, specifically for the KVM scenario, we could even drop
the disable/enable_irq. That would be also be nicer when thinking about
potential delays of the already registered handler during this
transitional phase.
Jan
Am 16.12.2010 21:26, Jan Kiszka wrote:
> Am 16.12.2010 14:13, Thomas Gleixner wrote:
>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>>> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
>>> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
>>> + /*
>>> + * Signal the old handler that is has to switch to shareable
>>> + * handling mode. Disable the line to avoid any conflict with
>>> + * a real IRQ.
>>> + */
>>> + disable_irq(irq);
>>
>> This is weird, really. I thought you wanted to avoid waiting for the
>> threaded handler to finish if it's on the fly. So this should be
>> disable_irq_nosync() or did you change your mind ?
>
> No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there
> might be another IRQ in flight. The handler that is called due to a real
> IRQ might misinterpret MAKE_SHAREABLE as "there is no real event" and
> perform the wrong steps (at least the current implementation for KVM would).
Actually, the requirement we have to fulfill here is to avoid that the
hardirq handler sees !SHARED while the threaded one reads "SHARED". To
achieve this without disabling the line, I'm still searching for a way
to couple the sharing state of associated hard and threaded handler runs
- but I think there is no reliable association, is there?
Jan
On Fri, 17 Dec 2010, Jan Kiszka wrote:
> Am 16.12.2010 21:26, Jan Kiszka wrote:
> > Am 16.12.2010 14:13, Thomas Gleixner wrote:
> >> On Mon, 13 Dec 2010, Jan Kiszka wrote:
> >>> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
> >>> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
> >>> + /*
> >>> + * Signal the old handler that is has to switch to shareable
> >>> + * handling mode. Disable the line to avoid any conflict with
> >>> + * a real IRQ.
> >>> + */
> >>> + disable_irq(irq);
> >>
> >> This is weird, really. I thought you wanted to avoid waiting for the
> >> threaded handler to finish if it's on the fly. So this should be
> >> disable_irq_nosync() or did you change your mind ?
> >
> > No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there
> > might be another IRQ in flight. The handler that is called due to a real
> > IRQ might misinterpret MAKE_SHAREABLE as "there is no real event" and
> > perform the wrong steps (at least the current implementation for KVM would).
>
> Actually, the requirement we have to fulfill here is to avoid that the
> hardirq handler sees !SHARED while the threaded one reads "SHARED". To
> achieve this without disabling the line, I'm still searching for a way
> to couple the sharing state of associated hard and threaded handler runs
> - but I think there is no reliable association, is there?
Unfortunately not. So the only way to solve that is disabling the
interrupt which makes sure that all handlers have completed.
OTOH, if we have to disable anyway, then we could simply keep it
disabled across the installation of a new handler. That would make the
notification business go away, wouldn't it ?
Thanks,
tglx
Am 17.12.2010 11:23, Thomas Gleixner wrote:
> On Fri, 17 Dec 2010, Jan Kiszka wrote:
>> Am 16.12.2010 21:26, Jan Kiszka wrote:
>>> Am 16.12.2010 14:13, Thomas Gleixner wrote:
>>>> On Mon, 13 Dec 2010, Jan Kiszka wrote:
>>>>> + if (old_action && (old_action->flags & IRQF_ADAPTIVE) &&
>>>>> + !(desc->irq_data.drv_status & IRQS_SHARED)) {
>>>>> + /*
>>>>> + * Signal the old handler that is has to switch to shareable
>>>>> + * handling mode. Disable the line to avoid any conflict with
>>>>> + * a real IRQ.
>>>>> + */
>>>>> + disable_irq(irq);
>>>>
>>>> This is weird, really. I thought you wanted to avoid waiting for the
>>>> threaded handler to finish if it's on the fly. So this should be
>>>> disable_irq_nosync() or did you change your mind ?
>>>
>>> No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there
>>> might be another IRQ in flight. The handler that is called due to a real
>>> IRQ might misinterpret MAKE_SHAREABLE as "there is no real event" and
>>> perform the wrong steps (at least the current implementation for KVM would).
>>
>> Actually, the requirement we have to fulfill here is to avoid that the
>> hardirq handler sees !SHARED while the threaded one reads "SHARED". To
>> achieve this without disabling the line, I'm still searching for a way
>> to couple the sharing state of associated hard and threaded handler runs
>> - but I think there is no reliable association, is there?
>
> Unfortunately not. So the only way to solve that is disabling the
> interrupt which makes sure that all handlers have completed.
Hmm, what a pity.
>
> OTOH, if we have to disable anyway, then we could simply keep it
> disabled across the installation of a new handler. That would make the
> notification business go away, wouldn't it ?
No, the notification is still necessary in case the registered handler
keeps the line off after returning from both hard and threaded handler.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Fri, 17 Dec 2010, Jan Kiszka wrote:
> Am 17.12.2010 11:23, Thomas Gleixner wrote:
> > OTOH, if we have to disable anyway, then we could simply keep it
> > disabled across the installation of a new handler. That would make the
> > notification business go away, wouldn't it ?
>
> No, the notification is still necessary in case the registered handler
> keeps the line off after returning from both hard and threaded handler.
And how should that happen? If it is in oneshot mode then the line is
reenabled when the thread handler returns.
Thanks,
tglx
Am 17.12.2010 11:41, Thomas Gleixner wrote:
> On Fri, 17 Dec 2010, Jan Kiszka wrote:
>> Am 17.12.2010 11:23, Thomas Gleixner wrote:
>>> OTOH, if we have to disable anyway, then we could simply keep it
>>> disabled across the installation of a new handler. That would make the
>>> notification business go away, wouldn't it ?
>>
>> No, the notification is still necessary in case the registered handler
>> keeps the line off after returning from both hard and threaded handler.
>
> And how should that happen? If it is in oneshot mode then the line is
> reenabled when the thread handler returns.
disable_irq_nosync is called by the handler before returning. And it's
the handler's job to revert this, properly synchronizing it internally.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Fri, 17 Dec 2010, Jan Kiszka wrote:
> Am 17.12.2010 11:41, Thomas Gleixner wrote:
> > On Fri, 17 Dec 2010, Jan Kiszka wrote:
> >> Am 17.12.2010 11:23, Thomas Gleixner wrote:
> >>> OTOH, if we have to disable anyway, then we could simply keep it
> >>> disabled across the installation of a new handler. That would make the
> >>> notification business go away, wouldn't it ?
> >>
> >> No, the notification is still necessary in case the registered handler
> >> keeps the line off after returning from both hard and threaded handler.
> >
> > And how should that happen? If it is in oneshot mode then the line is
> > reenabled when the thread handler returns.
>
> disable_irq_nosync is called by the handler before returning. And it's
> the handler's job to revert this, properly synchronizing it internally.
disable_irq_nosync() is really the worst thing to do. That's simply
not going to work without a lot of fuglyness.
What about the following:
primary_handler(....)
{
if (!shared)
return IRQ_WAKE_THREAD;
spin_lock(dev->irq_lock);
if (from_my_device() || dev->irq_thread_waiting) {
mask_dev();
dev->masked = true;
ret = IRQ_WAKE_THREAD;
} else
ret = IRQ_NONE;
spin_unlock(dev->irq_lock);
return ret;
}
check_timeout()
{
if (dev->irq_active && wait_longer())
return IRQ_WAKE_THREAD;
return IRQ_HANDLED;
}
unmask_dev_if_necessary()
{
if (dev->masked && dev->irq_active)
umask_dev();
}
threaded_handler(....)
{
if (!dev->irq_thread_waiting) {
spin_lock_irq(dev->irq_lock);
wake_user = do_magic_stuff_with_the_dev();
dev->irq_thread_waiting = wake_user;
spin_unlock(dev->irq_lock);
if (wake_user)
wake_up(user);
}
if (!dev->irq_thread_waiting) {
spin_lock_irq(dev->irq_lock);
unmask_dev_if_necessary();
spin_unlock(dev->irq_lock);
return IRQ_HANDLED;
}
/*
* Wait for user space to complete. Timeout is to
* avoid starvation of the irq line when
* something goes wrong
*/
wait_for_completion_timeout(dev->compl, SENSIBLE_TIMEOUT);
spin_lock_irq(dev->irq_lock);
if (timedout) {
mask_dev();
dev->masked = true;
/*
* Leave dev->irq_thread_waiting untouched and let
* the core code reschedule us when check_timeout
* decides it's worth to wait. In any case we leave
* the device masked at the device level, so we don't
* cause an interrupt storm.
*/
ret = check_timeout();
} else {
unmask_dev_if_necessary();
dev->irq_thread_waiting = false;
ret = IRQ_HANDLED;
}
spin_unlock(dev->irq_lock);
return ret;
}
userspace_complete()
{
complete(dev->irq_compl);
}
Your aproach with disable_irq_nosync() is completely flawed, simply
because you try to pretend that your interrupt handler is done, while
it is not done at all. The threaded interrupt handler is done when
user space completes. Everything else is just hacking around the
problem and creates all that nasty transitional problems.
The above code does not have them at all. The threaded handler does
not care at all about the dev_id shared state encoding and the state
transitions. It merily cares about the device internal status
dev->masked. Everything else is handled by the genirq code and the
litte status check in the primary handler.
Neither does the user space completion care about it. It just
completes and is completely oblivious of the irq line state/mode. And
really, the user space part should not care at all. It can set some
status before calling complete(), but that's driver specific stuff and
has nothing to do with the irq handling magic.
It requires a few not too intrusive modifications to the genirq code:
- Rescheduling the thread handler on IRQ_WAKE_THREAD
- Changing the oneshot finalizing a bit
- Adding the status manipulations for request/free_irq
Now you might argue that the timeout is ugly, but I don't think it's
ugly at all. You need it anyway in case user space failed
completely. And coming back after 100ms to let the genirq code handle
a disable_irq() or synchronize_irq() is a reasonable request, it's the
error/corner case at all. If there is code which installs/removes an
interrupt handler on the same line every 5ms, then this code becomes
rightfully blocked out for 100ms or such.
Thanks,
tglx
Am 17.12.2010 16:25, Thomas Gleixner wrote:
> On Fri, 17 Dec 2010, Jan Kiszka wrote:
>
>> Am 17.12.2010 11:41, Thomas Gleixner wrote:
>>> On Fri, 17 Dec 2010, Jan Kiszka wrote:
>>>> Am 17.12.2010 11:23, Thomas Gleixner wrote:
>>>>> OTOH, if we have to disable anyway, then we could simply keep it
>>>>> disabled across the installation of a new handler. That would make the
>>>>> notification business go away, wouldn't it ?
>>>>
>>>> No, the notification is still necessary in case the registered handler
>>>> keeps the line off after returning from both hard and threaded handler.
>>>
>>> And how should that happen? If it is in oneshot mode then the line is
>>> reenabled when the thread handler returns.
>>
>> disable_irq_nosync is called by the handler before returning. And it's
>> the handler's job to revert this, properly synchronizing it internally.
>
> disable_irq_nosync() is really the worst thing to do. That's simply
> not going to work without a lot of fuglyness.
>
> What about the following:
>
> primary_handler(....)
> {
> if (!shared)
> return IRQ_WAKE_THREAD;
>
> spin_lock(dev->irq_lock);
>
> if (from_my_device() || dev->irq_thread_waiting) {
> mask_dev();
> dev->masked = true;
> ret = IRQ_WAKE_THREAD;
> } else
> ret = IRQ_NONE;
>
> spin_unlock(dev->irq_lock);
> return ret;
> }
>
> check_timeout()
> {
> if (dev->irq_active && wait_longer())
> return IRQ_WAKE_THREAD;
> return IRQ_HANDLED;
> }
>
> unmask_dev_if_necessary()
> {
> if (dev->masked && dev->irq_active)
> umask_dev();
> }
>
> threaded_handler(....)
> {
> if (!dev->irq_thread_waiting) {
> spin_lock_irq(dev->irq_lock);
> wake_user = do_magic_stuff_with_the_dev();
> dev->irq_thread_waiting = wake_user;
> spin_unlock(dev->irq_lock);
> if (wake_user)
> wake_up(user);
> }
>
> if (!dev->irq_thread_waiting) {
> spin_lock_irq(dev->irq_lock);
> unmask_dev_if_necessary();
> spin_unlock(dev->irq_lock);
> return IRQ_HANDLED;
> }
>
> /*
> * Wait for user space to complete. Timeout is to
> * avoid starvation of the irq line when
> * something goes wrong
> */
> wait_for_completion_timeout(dev->compl, SENSIBLE_TIMEOUT);
>
> spin_lock_irq(dev->irq_lock);
> if (timedout) {
> mask_dev();
> dev->masked = true;
> /*
> * Leave dev->irq_thread_waiting untouched and let
> * the core code reschedule us when check_timeout
> * decides it's worth to wait. In any case we leave
> * the device masked at the device level, so we don't
> * cause an interrupt storm.
> */
> ret = check_timeout();
> } else {
> unmask_dev_if_necessary();
> dev->irq_thread_waiting = false;
> ret = IRQ_HANDLED;
> }
> spin_unlock(dev->irq_lock);
> return ret;
> }
>
> userspace_complete()
> {
> complete(dev->irq_compl);
> }
>
> Your aproach with disable_irq_nosync() is completely flawed, simply
> because you try to pretend that your interrupt handler is done, while
> it is not done at all. The threaded interrupt handler is done when
> user space completes. Everything else is just hacking around the
> problem and creates all that nasty transitional problems.
disable_irq_nosync is the pattern currently used in KVM, it's nothing
new in fact.
The approach looks interesting but requires separate code for
non-PCI-2.3 devices, i.e. when we have no means to mask at device level.
Further drawbacks - unless I missed something on first glance:
- prevents any future optimizations that would work without IRQ thread
ping-pong (ie. once we allow guest IRQ injection from hardirq context
for selected but typical setups)
- two additional, though light-weight, context switches on each
interrupt completion
- continuous polling if user space decides to leave the interrupt
unhandled (e.g. because the virtual IRQ line is masked)
Maybe the latter can be solved in a nicer way, but I don't think we can
avoid the first two. I'm not saying yet that they are killing this
approach, we just need to asses their relevance.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
On Fri, 17 Dec 2010, Jan Kiszka wrote:
> Am 17.12.2010 16:25, Thomas Gleixner wrote:
> > Your aproach with disable_irq_nosync() is completely flawed, simply
> > because you try to pretend that your interrupt handler is done, while
> > it is not done at all. The threaded interrupt handler is done when
> > user space completes. Everything else is just hacking around the
> > problem and creates all that nasty transitional problems.
>
> disable_irq_nosync is the pattern currently used in KVM, it's nothing
> new in fact.
That does not make it less flawed :)
> The approach looks interesting but requires separate code for
> non-PCI-2.3 devices, i.e. when we have no means to mask at device level.
Why? You can have the same code, you just can't request COND_ONESHOT
handlers for it, so it needs unshared ONESHOT or it won't work at all,
no matter what approach you chose. No device level mask, no sharing,
it's that simple.
> Further drawbacks - unless I missed something on first glance:
>
> - prevents any future optimizations that would work without IRQ thread
> ping-pong (ie. once we allow guest IRQ injection from hardirq context
> for selected but typical setups)
> - two additional, though light-weight, context switches on each
> interrupt completion
The drawback of these two points is way less than the horror which you
need to introduce to do the whole async handler disable, userspace
enable dance. Robust and simple solutions really a preferred over
complex and fragile horror which has a questionable runtime benefit.
> - continuous polling if user space decides to leave the interrupt
> unhandled (e.g. because the virtual IRQ line is masked)
That should be a solvable problem.
Thanks,
tglx
On Fri, Dec 17, 2010 at 05:32:46PM +0100, Thomas Gleixner wrote:
> On Fri, 17 Dec 2010, Jan Kiszka wrote:
> > Am 17.12.2010 16:25, Thomas Gleixner wrote:
> > > Your aproach with disable_irq_nosync() is completely flawed, simply
> > > because you try to pretend that your interrupt handler is done, while
> > > it is not done at all. The threaded interrupt handler is done when
> > > user space completes. Everything else is just hacking around the
> > > problem and creates all that nasty transitional problems.
> >
> > disable_irq_nosync is the pattern currently used in KVM, it's nothing
> > new in fact.
>
> That does not make it less flawed :)
>
> > The approach looks interesting but requires separate code for
> > non-PCI-2.3 devices, i.e. when we have no means to mask at device level.
>
> Why? You can have the same code, you just can't request COND_ONESHOT
> handlers for it, so it needs unshared ONESHOT or it won't work at all,
> no matter what approach you chose. No device level mask, no sharing,
> it's that simple.
>
> > Further drawbacks - unless I missed something on first glance:
> >
> > - prevents any future optimizations that would work without IRQ thread
> > ping-pong (ie. once we allow guest IRQ injection from hardirq context
> > for selected but typical setups)
> > - two additional, though light-weight, context switches on each
> > interrupt completion
>
> The drawback of these two points is way less than the horror which you
> need to introduce to do the whole async handler disable, userspace
> enable dance. Robust and simple solutions really a preferred over
> complex and fragile horror which has a questionable runtime benefit.
I'd like to note that the overhead of involving the scheduler in
interrupt injection for an assigned device should be easily measurable:
just make the MSI handlers threaded and see what the result is.
In the case of emulated devices, when we had an extra thread
involved in MSI handling, the vcpu thread and the
interrupt injection thread were competing for cpu with strange
fluctuations in performance as the result
(i.e. sometimes we would get good speed as threading would introduce
a kind of interrupt coalescing, sometimes we would get huge latency).
> > - continuous polling if user space decides to leave the interrupt
> > unhandled (e.g. because the virtual IRQ line is masked)
>
> That should be a solvable problem.
>
> Thanks,
>
> tglx