LinuxLists.cc - [PATCH 0/4] kvm: level triggered irqfd support

2012-06-22 22:15:20

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 0/4] kvm: level triggered irqfd support

Here's a refined series based on my previous RFC. I now have this
working through VFIO-based device assignment in Qemu. I've split
up the patches and tried to generalize the interface as one of
potentially a couple ways we might support level irqfds. I also
tried to document the requirements around re-asserting interrupts
and expectations for avoiding races.

The last patch in the series is just an FYI/RFC of something I
expect we're also going to need. We want to take advantage of KVM
irqchip, but not be dependent on it. That means we'll want to come
up with a way for Qemu to notify drivers of an EOI. We won't want
that interface to break when when enabling the in-kernel irqchip.
The needs are just different enough that I don't see how to combine
them together. Thanks,

Alex

---

Alex Williamson (4):
[RFC] kvm: eoi_eventfd
kvm: Extend irqfd to support level interrupts
kvm: Add missing KVM_IRQFD API documentation
kvm: Pass kvm_irqfd to functions

Documentation/virtual/kvm/api.txt | 46 ++++++++
arch/x86/kvm/x86.c | 2
include/linux/kvm.h | 18 +++
include/linux/kvm_host.h | 11 ++
virt/kvm/eventfd.c | 201 ++++++++++++++++++++++++++++++++++---
virt/kvm/kvm_main.c | 11 ++
6 files changed, 268 insertions(+), 21 deletions(-)

2012-06-22 22:15:39

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 1/4] kvm: Pass kvm_irqfd to functions

Prune this down to just the struct kvm_irqfd so we can avoid
changing function definition for every flag for field we use.

Signed-off-by: Alex Williamson <[email protected]>
---

include/linux/kvm_host.h | 4 ++--
virt/kvm/eventfd.c | 20 ++++++++++----------
virt/kvm/kvm_main.c | 2 +-
3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27ac8a4..ae3b426 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -824,7 +824,7 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {}
#ifdef CONFIG_HAVE_KVM_EVENTFD

void kvm_eventfd_init(struct kvm *kvm);
-int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags);
+int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
void kvm_irqfd_release(struct kvm *kvm);
void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
@@ -833,7 +833,7 @@ int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);

static inline void kvm_eventfd_init(struct kvm *kvm) {}

-static inline int kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
{
return -EINVAL;
}
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index f59c1e8..c307c24 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -198,7 +198,7 @@ static void irqfd_update(struct kvm *kvm, struct _irqfd *irqfd,
}

static int
-kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
+kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
{
struct kvm_irq_routing_table *irq_rt;
struct _irqfd *irqfd, *tmp;
@@ -212,12 +212,12 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
return -ENOMEM;

irqfd->kvm = kvm;
- irqfd->gsi = gsi;
+ irqfd->gsi = args->gsi;
INIT_LIST_HEAD(&irqfd->list);
INIT_WORK(&irqfd->inject, irqfd_inject);
INIT_WORK(&irqfd->shutdown, irqfd_shutdown);

- file = eventfd_fget(fd);
+ file = eventfd_fget(args->fd);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto fail;
@@ -298,19 +298,19 @@ kvm_eventfd_init(struct kvm *kvm)
* shutdown any irqfd's that match fd+gsi
*/
static int
-kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
+kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
{
struct _irqfd *irqfd, *tmp;
struct eventfd_ctx *eventfd;

- eventfd = eventfd_ctx_fdget(fd);
+ eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);

spin_lock_irq(&kvm->irqfds.lock);

list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
- if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
+ if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
/*
* This rcu_assign_pointer is needed for when
* another thread calls kvm_irq_routing_update before
@@ -338,12 +338,12 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
}

int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
{
- if (flags & KVM_IRQFD_FLAG_DEASSIGN)
- return kvm_irqfd_deassign(kvm, fd, gsi);
+ if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)
+ return kvm_irqfd_deassign(kvm, args);

- return kvm_irqfd_assign(kvm, fd, gsi);
+ return kvm_irqfd_assign(kvm, args);
}

/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 02cb440..b4ad14cc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2059,7 +2059,7 @@ static long kvm_vm_ioctl(struct file *filp,
r = -EFAULT;
if (copy_from_user(&data, argp, sizeof data))
goto out;
- r = kvm_irqfd(kvm, data.fd, data.gsi, data.flags);
+ r = kvm_irqfd(kvm, &data);
break;
}
case KVM_IOEVENTFD: {

2012-06-22 22:16:03

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 2/4] kvm: Add missing KVM_IRQFD API documentation

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/virtual/kvm/api.txt | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 310fe50..9b4cb2b 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1965,6 +1965,23 @@ return the hash table order in the parameter. (If the guest is using
the virtualized real-mode area (VRMA) facility, the kernel will
re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)

+4.76 KVM_IRQFD
+
+Capability: KVM_CAP_IRQFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_irqfd (in)
+Returns: 0 on success, -1 on error
+
+Allows setting an eventfd to directly trigger a guest interrupt
+kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
+kvm_irqfd.gsi specifies the irqchip pin toggled by this event. By
+default this interface only supports edge triggered interrupts,
+meaning the specified gsi is asserted and immediately de-asserted
+when the eventfd is triggered. Specifying KVM_IRQFD_FLAG_DEASSIGN
+removes the previously set irqfd matching both kvm_irqfd.fd and
+kvm_irqfd.gsi.
+

5. The kvm_run structure
------------------------

2012-06-22 22:16:27

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

KVM_IRQFD currently only supports edge triggered interrupts,
asserting then immediately deasserting an interrupt. There are a
couple ways we can emulate level triggered interrupts using
discrete events depending on the usage model we expect from drivers.
This patch implements a level emulation model useful for external
assigned device drivers, like VFIO. The irqfd is used to assert
the interrupt. When the guest issues an EOI for the interrupt, the
level is automatically deasserted and the irqfd user is notified via
an eventfd. This is therefore the LEVEL_EOI extension to KVM_IRQFD.
To do this, we need to allocate a new irq source ID for the interrupt
so we don't get interference from userspace.

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/virtual/kvm/api.txt | 15 ++++++
arch/x86/kvm/x86.c | 1
include/linux/kvm.h | 6 ++
virt/kvm/eventfd.c | 94 ++++++++++++++++++++++++++++++++++---
4 files changed, 107 insertions(+), 9 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 9b4cb2b..2f8a0aa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1982,6 +1982,21 @@ when the eventfd is triggered. Specifying KVM_IRQFD_FLAG_DEASSIGN
removes the previously set irqfd matching both kvm_irqfd.fd and
kvm_irqfd.gsi.

+With KVM_CAP_IRQFD_LEVEL_EOI KVM_IRQFD is able to support a level
+triggered interrupt model where the irqchip pin (kvm_irqfd.gsi) is
+asserted from the kvm_irqfd.fd eventfd and remain asserted until the
+guest issues an EOI for the irqchip pin. The level interrupt is
+then de-asserted and the caller is notified via the eventfd specified
+by kvm_irqfd.fd2. Note that users of this interface are responsible
+for re-asserting the interrupt if their device still requires service
+after receiving the EOI notification. Additionally, users must not
+re-assert an interrupt until after receiving an EOI. When available,
+this feature is enabled using the KVM_IRQFD_FLAG_LEVEL_EOI flag.
+De-assigning an irqfd setup using this flag should include both
+KVM_IRQFD_FLAG_DEASSIGN and KVM_IRQFD_FLAG_LEVEL_EOI and will be
+matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
+De-assigning automatically de-asserts the interrupt line setup through
+this interface.

5. The kvm_run structure
------------------------
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a01a424..20a51fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_GET_TSC_KHZ:
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
+ case KVM_CAP_IRQFD_LEVEL_EOI:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..a916186 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_GET_SMMU_INFO 78
#define KVM_CAP_S390_COW 79
#define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_IRQFD_LEVEL_EOI 81

#ifdef KVM_CAP_IRQ_ROUTING

@@ -683,12 +684,15 @@ struct kvm_xen_hvm_config {
#endif

#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_IRQFD_LEVEL_EOI */
+#define KVM_IRQFD_FLAG_LEVEL_EOI (1 << 1)

struct kvm_irqfd {
__u32 fd;
__u32 gsi;
__u32 flags;
- __u8 pad[20];
+ __u32 fd2;
+ __u8 pad[16];
};

struct kvm_clock_data {
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c307c24..2bc7275 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -49,9 +49,13 @@ struct _irqfd {
wait_queue_t wait;
/* Update side is protected by irqfds.lock */
struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
- /* Used for level IRQ fast-path */
+ /* Used for IRQ fast-path */
int gsi;
struct work_struct inject;
+ /* Used for level EOI path */
+ int irq_source_id;
+ struct eventfd_ctx *eoi_eventfd;
+ struct kvm_irq_ack_notifier notifier;
/* Used for setup/shutdown */
struct eventfd_ctx *eventfd;
struct list_head list;
@@ -62,7 +66,7 @@ struct _irqfd {
static struct workqueue_struct *irqfd_cleanup_wq;

static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject_edge(struct work_struct *work)
{
struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd->kvm;
@@ -71,6 +75,23 @@ irqfd_inject(struct work_struct *work)
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
}

+static void
+irqfd_inject_level(struct work_struct *work)
+{
+ struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+ kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
+}
+
+static void
+irqfd_ack_level(struct kvm_irq_ack_notifier *notifier)
+{
+ struct _irqfd *irqfd = container_of(notifier, struct _irqfd, notifier);
+
+ kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 0);
+ eventfd_signal(irqfd->eoi_eventfd, 1);
+}
+
/*
* Race-free decouple logic (ordering is critical)
*/
@@ -96,6 +117,14 @@ irqfd_shutdown(struct work_struct *work)
* It is now safe to release the object's resources
*/
eventfd_ctx_put(irqfd->eventfd);
+
+ if (irqfd->eoi_eventfd) {
+ kvm_unregister_irq_ack_notifier(irqfd->kvm, &irqfd->notifier);
+ eventfd_ctx_put(irqfd->eoi_eventfd);
+ kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 0);
+ kvm_free_irq_source_id(irqfd->kvm, irqfd->irq_source_id);
+ }
+
kfree(irqfd);
}

@@ -203,8 +232,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
struct kvm_irq_routing_table *irq_rt;
struct _irqfd *irqfd, *tmp;
struct file *file = NULL;
- struct eventfd_ctx *eventfd = NULL;
- int ret;
+ struct eventfd_ctx *eventfd = NULL, *eoi_eventfd = NULL;
+ int ret, irq_source_id = -1;
unsigned int events;

irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
@@ -214,7 +243,30 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
irqfd->kvm = kvm;
irqfd->gsi = args->gsi;
INIT_LIST_HEAD(&irqfd->list);
- INIT_WORK(&irqfd->inject, irqfd_inject);
+
+ if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
+ irq_source_id = kvm_request_irq_source_id(kvm);
+ if (irq_source_id < 0) {
+ ret = irq_source_id;
+ goto fail;
+ }
+
+ eoi_eventfd = eventfd_ctx_fdget(args->fd2);
+ if (IS_ERR(eoi_eventfd)) {
+ ret = PTR_ERR(eoi_eventfd);
+ goto fail;
+ }
+
+ irqfd->irq_source_id = irq_source_id;
+ irqfd->eoi_eventfd = eoi_eventfd;
+ irqfd->notifier.gsi = args->gsi;
+ irqfd->notifier.irq_acked = irqfd_ack_level;
+ kvm_register_irq_ack_notifier(kvm, &irqfd->notifier);
+
+ INIT_WORK(&irqfd->inject, irqfd_inject_level);
+ } else
+ INIT_WORK(&irqfd->inject, irqfd_inject_edge);
+
INIT_WORK(&irqfd->shutdown, irqfd_shutdown);

file = eventfd_fget(args->fd);
@@ -231,6 +283,11 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)

irqfd->eventfd = eventfd;

+ if (eventfd == eoi_eventfd) {
+ ret = -EINVAL;
+ goto fail;
+ }
+
/*
* Install our own custom wake-up handling so we are notified via
* a callback whenever someone signals the underlying eventfd
@@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)

ret = 0;
list_for_each_entry(tmp, &kvm->irqfds.items, list) {
- if (irqfd->eventfd != tmp->eventfd)
+ if (irqfd->eventfd != tmp->eventfd &&
+ irqfd->eventfd != tmp->eoi_eventfd)
continue;
/* This fd is used for another irq already. */
ret = -EBUSY;
@@ -282,6 +340,14 @@ fail:
if (!IS_ERR(file))
fput(file);

+ if (eoi_eventfd && !IS_ERR(eoi_eventfd)) {
+ kvm_unregister_irq_ack_notifier(kvm, &irqfd->notifier);
+ eventfd_ctx_put(eoi_eventfd);
+ }
+
+ if (irq_source_id >= 0)
+ kvm_free_irq_source_id(kvm, irq_source_id);
+
kfree(irqfd);
return ret;
}
@@ -301,16 +367,26 @@ static int
kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
{
struct _irqfd *irqfd, *tmp;
- struct eventfd_ctx *eventfd;
+ struct eventfd_ctx *eventfd, *eoi_eventfd = NULL;

eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);

+ if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
+ eoi_eventfd = eventfd_ctx_fdget(args->fd2);
+ if (IS_ERR(eoi_eventfd)) {
+ eventfd_ctx_put(eventfd);
+ return PTR_ERR(eoi_eventfd);
+ }
+ }
+
spin_lock_irq(&kvm->irqfds.lock);

list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
- if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
+ if (irqfd->eventfd == eventfd &&
+ irqfd->gsi == args->gsi &&
+ irqfd->eoi_eventfd == eoi_eventfd) {
/*
* This rcu_assign_pointer is needed for when
* another thread calls kvm_irq_routing_update before
@@ -326,6 +402,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)

spin_unlock_irq(&kvm->irqfds.lock);
eventfd_ctx_put(eventfd);
+ if (eoi_eventfd)
+ eventfd_ctx_put(eoi_eventfd);

/*
* Block until we know all outstanding shutdown jobs have completed

2012-06-22 22:17:01

by Alex Williamson

[permalink] [raw]

Subject: [PATCH 4/4][RFC] kvm: eoi_eventfd

I think we're probably also going to need something like this.
When running in non-accelerated qemu, we're going to have to
create some kind of EOI notifier for drivers. VFIO can make
additional improvements when running on KVM so it will probably
make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
want to have a generic EOI notifier in qemu that just stops
working when kvm-ioapic is enabled. This is just a simple way
to register an eventfd using the existing KVM ack notifier.
I tried combining the ack notifier of the LEVEL_EOI interface
into this one, but it didn't work out well. The code complexity
goes up a lot.

Signed-off-by: Alex Williamson <[email protected]>
---

Documentation/virtual/kvm/api.txt | 14 ++++++
arch/x86/kvm/x86.c | 1
include/linux/kvm.h | 12 +++++
include/linux/kvm_host.h | 7 +++
virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 9 ++++
6 files changed, 132 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 2f8a0aa..69b1747 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
De-assigning automatically de-asserts the interrupt line setup through
this interface.

+4.77 KVM_EOI_EVENTFD
+
+Capability: KVM_CAP_EOI_EVENTFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_eoi_eventfd (in)
+Returns: 0 on success, -1 on error
+
+This interface allows userspace to be notified through an eventfd for
+EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
+the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
+use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
+and gsi.
+
5. The kvm_run structure
------------------------

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 20a51fe..00118b5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
case KVM_CAP_IRQFD_LEVEL_EOI:
+ case KVM_CAP_EOI_EVENTFD:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index a916186..a8a7fa3 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_S390_COW 79
#define KVM_CAP_PPC_ALLOC_HTAB 80
#define KVM_CAP_IRQFD_LEVEL_EOI 81
+#define KVM_CAP_EOI_EVENTFD 82

#ifdef KVM_CAP_IRQ_ROUTING

@@ -755,6 +756,15 @@ struct kvm_msi {
__u8 pad[16];
};

+#define KVM_EOI_EVENTFD_FLAG_DEASSIGN (1 << 0)
+
+struct kvm_eoi_eventfd {
+ __u32 fd;
+ __u32 gsi;
+ __u32 flags;
+ __u8 pad[20];
+};
+
/*
* ioctls for VM fds
*/
@@ -908,6 +918,8 @@ struct kvm_s390_ucas_mapping {
#define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg)
/* VM is being stopped by host */
#define KVM_KVMCLOCK_CTRL _IO(KVMIO, 0xad)
+/* Available with KVM_CAP_EOI_EVENTFD */
+#define KVM_EOI_EVENTFD _IOW(KVMIO, 0xae, struct kvm_eoi_eventfd)

#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae3b426..97fbe21 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -285,6 +285,7 @@ struct kvm {
struct list_head items;
} irqfds;
struct list_head ioeventfds;
+ struct list_head eoi_eventfds;
#endif
struct kvm_vm_stat stat;
struct kvm_arch arch;
@@ -828,6 +829,7 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
void kvm_irqfd_release(struct kvm *kvm);
void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
+int kvm_eoi_eventfd(struct kvm *kvm, struct kvm_eoi_eventfd *args);

#else

@@ -853,6 +855,11 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
return -ENOSYS;
}

+static inline int kvm_eoi_eventfd(struct kvm *kvm, struct kvm_eoi_eventfd *args)
+{
+ return -ENOSYS;
+}
+
#endif /* CONFIG_HAVE_KVM_EVENTFD */

#ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2bc7275..a01e377 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -358,6 +358,7 @@ kvm_eventfd_init(struct kvm *kvm)
spin_lock_init(&kvm->irqfds.lock);
INIT_LIST_HEAD(&kvm->irqfds.items);
INIT_LIST_HEAD(&kvm->ioeventfds);
+ INIT_LIST_HEAD(&kvm->eoi_eventfds);
}

/*
@@ -733,3 +734,91 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)

return kvm_assign_ioeventfd(kvm, args);
}
+
+/*
+ * --------------------------------------------------------------------
+ * eoi_eventfd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
+ *
+ * userspace can register GSIs with an eventfd for receiving
+ * notification when an EOI occurs.
+ * --------------------------------------------------------------------
+ */
+
+struct _eoi_eventfd {
+ struct kvm *kvm;
+ struct eventfd_ctx *eventfd;
+ struct kvm_irq_ack_notifier notifier;
+ struct list_head list;
+};
+
+static void kvm_eoi_eventfd_acked(struct kvm_irq_ack_notifier *notifier)
+{
+ struct _eoi_eventfd *eoifd;
+
+ eoifd = container_of(notifier, struct _eoi_eventfd, notifier);
+
+ eventfd_signal(eoifd->eventfd, 1);
+}
+
+static int kvm_assign_eoi_eventfd(struct kvm *kvm, struct kvm_eoi_eventfd *args)
+{
+ struct eventfd_ctx *eventfd;
+ struct _eoi_eventfd *eoifd;
+
+ eventfd = eventfd_ctx_fdget(args->fd);
+ if (IS_ERR(eventfd))
+ return PTR_ERR(eventfd);
+
+ eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
+ if (!eoifd) {
+ eventfd_ctx_put(eventfd);
+ return -ENOMEM;
+ }
+
+ INIT_LIST_HEAD(&eoifd->list);
+ eoifd->kvm = kvm;
+ eoifd->eventfd = eventfd;
+ eoifd->notifier.gsi = args->gsi;
+ eoifd->notifier.irq_acked = kvm_eoi_eventfd_acked;
+
+ list_add_tail(&eoifd->list, &kvm->eoi_eventfds);
+ kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
+
+ return 0;
+}
+static int kvm_deassign_eoi_eventfd(struct kvm *kvm,
+ struct kvm_eoi_eventfd *args)
+{
+ struct eventfd_ctx *eventfd;
+ struct _eoi_eventfd *eoifd, *tmp;
+ int ret = -ENOENT;
+
+ eventfd = eventfd_ctx_fdget(args->fd);
+ if (IS_ERR(eventfd))
+ return PTR_ERR(eventfd);
+
+ list_for_each_entry_safe(eoifd, tmp, &kvm->eoi_eventfds, list) {
+ if (eoifd->eventfd != eventfd ||
+ eoifd->notifier.gsi != args->gsi)
+ continue;
+
+ kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
+ eventfd_ctx_put(eoifd->eventfd);
+ list_del(&eoifd->list);
+ kfree(eoifd);
+ ret = 0;
+ break;
+ }
+
+ eventfd_ctx_put(eventfd);
+
+ return ret;
+}
+
+int kvm_eoi_eventfd(struct kvm *kvm, struct kvm_eoi_eventfd *args)
+{
+ if (args->flags & KVM_EOI_EVENTFD_FLAG_DEASSIGN)
+ return kvm_deassign_eoi_eventfd(kvm, args);
+
+ return kvm_assign_eoi_eventfd(kvm, args);
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4ad14cc..20508c9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2093,6 +2093,15 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
#endif
+ case KVM_EOI_EVENTFD: {
+ struct kvm_eoi_eventfd data;
+
+ r = -EFAULT;
+ if (copy_from_user(&data, argp, sizeof data))
+ goto out;
+ r = kvm_eoi_eventfd(kvm, &data);
+ break;
+ }
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)

2012-06-24 08:24:56

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> I think we're probably also going to need something like this.
> When running in non-accelerated qemu, we're going to have to
> create some kind of EOI notifier for drivers. VFIO can make
> additional improvements when running on KVM so it will probably
> make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> want to have a generic EOI notifier in qemu that just stops
> working when kvm-ioapic is enabled. This is just a simple way
> to register an eventfd using the existing KVM ack notifier.
> I tried combining the ack notifier of the LEVEL_EOI interface
> into this one, but it didn't work out well. The code complexity
> goes up a lot.
>
> Signed-off-by: Alex Williamson <[email protected]>
> ---
>
> Documentation/virtual/kvm/api.txt | 14 ++++++
> arch/x86/kvm/x86.c | 1
> include/linux/kvm.h | 12 +++++
> include/linux/kvm_host.h | 7 +++
> virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 9 ++++
> 6 files changed, 132 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 2f8a0aa..69b1747 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> De-assigning automatically de-asserts the interrupt line setup through
> this interface.
>
> +4.77 KVM_EOI_EVENTFD
> +
> +Capability: KVM_CAP_EOI_EVENTFD
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_eoi_eventfd (in)
> +Returns: 0 on success, -1 on error
> +
> +This interface allows userspace to be notified through an eventfd for
> +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> +and gsi.
> +
> 5. The kvm_run structure
> ------------------------
>

The patch looks like it only works for ioapic IRQs. I think it's a good
idea to explicitly limit this to level interrupts, straight away:
there's no reason for userspace to need an exit on an edge interrupt.

I also suggest we put LEVEL somewhere in the name.

With this eventfd, do we also need the fd2 parameter in the irqfd
structure? It seems to be used for the same thing ...

--
MST

2012-06-24 08:28:59

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Fri, Jun 22, 2012 at 04:16:17PM -0600, Alex Williamson wrote:
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 2ce09aa..a916186 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> #define KVM_CAP_PPC_GET_SMMU_INFO 78
> #define KVM_CAP_S390_COW 79
> #define KVM_CAP_PPC_ALLOC_HTAB 80
> +#define KVM_CAP_IRQFD_LEVEL_EOI 81
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -683,12 +684,15 @@ struct kvm_xen_hvm_config {
> #endif
>
> #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +/* Available with KVM_CAP_IRQFD_LEVEL_EOI */
> +#define KVM_IRQFD_FLAG_LEVEL_EOI (1 << 1)
>
> struct kvm_irqfd {
> __u32 fd;
> __u32 gsi;
> __u32 flags;
> - __u8 pad[20];
> + __u32 fd2;
> + __u8 pad[16];
> };
>
> struct kvm_clock_data {
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index c307c24..2bc7275 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -49,9 +49,13 @@ struct _irqfd {
> wait_queue_t wait;
> /* Update side is protected by irqfds.lock */
> struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
> - /* Used for level IRQ fast-path */
> + /* Used for IRQ fast-path */
> int gsi;
> struct work_struct inject;
> + /* Used for level EOI path */
> + int irq_source_id;
> + struct eventfd_ctx *eoi_eventfd;
> + struct kvm_irq_ack_notifier notifier;
> /* Used for setup/shutdown */
> struct eventfd_ctx *eventfd;
> struct list_head list;
> @@ -62,7 +66,7 @@ struct _irqfd {
> static struct workqueue_struct *irqfd_cleanup_wq;
>
> static void
> -irqfd_inject(struct work_struct *work)
> +irqfd_inject_edge(struct work_struct *work)
> {
> struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> struct kvm *kvm = irqfd->kvm;
> @@ -71,6 +75,23 @@ irqfd_inject(struct work_struct *work)
> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> }
>
> +static void
> +irqfd_inject_level(struct work_struct *work)
> +{
> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> +
> + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> +}
> +

I think that we can actually do this for all interrupts
unconditionally. We used to need to clear level for edge, but nomore.
Once you do this, can't this patch be replaced with 4/4 completely?

--
MST

2012-06-24 08:34:34

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 2/4] kvm: Add missing KVM_IRQFD API documentation

On Fri, Jun 22, 2012 at 04:15:50PM -0600, Alex Williamson wrote:
> Signed-off-by: Alex Williamson <[email protected]>
> ---
>
> Documentation/virtual/kvm/api.txt | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 310fe50..9b4cb2b 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1965,6 +1965,23 @@ return the hash table order in the parameter. (If the guest is using
> the virtualized real-mode area (VRMA) facility, the kernel will
> re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
>
> +4.76 KVM_IRQFD
> +
> +Capability: KVM_CAP_IRQFD
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_irqfd (in)
> +Returns: 0 on success, -1 on error
> +
> +Allows setting an eventfd to directly trigger a guest interrupt
> +kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
> +kvm_irqfd.gsi specifies the irqchip pin toggled by this event. By
> +default this interface only supports edge triggered interrupts,
> +meaning the specified gsi is asserted and immediately de-asserted
> +when the eventfd is triggered.

That's a bit confusing. This assert/deassert only has effect for level.
Do we or do we not want to maintain this assert/deassert behaviour
for level irqfds? If yes we shouldn't say it's unsupported if
no we should not document it.

Gleb, Avi, any thoughts?

> Specifying KVM_IRQFD_FLAG_DEASSIGN
> +removes the previously set irqfd matching both kvm_irqfd.fd and
> +kvm_irqfd.gsi.
> +
>
> 5. The kvm_run structure
> ------------------------

2012-06-24 10:30:01

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On 06/23/2012 01:16 AM, Alex Williamson wrote:
> KVM_IRQFD currently only supports edge triggered interrupts,
> asserting then immediately deasserting an interrupt. There are a
> couple ways we can emulate level triggered interrupts using
> discrete events depending on the usage model we expect from drivers.
> This patch implements a level emulation model useful for external
> assigned device drivers, like VFIO. The irqfd is used to assert
> the interrupt. When the guest issues an EOI for the interrupt, the
> level is automatically deasserted and the irqfd user is notified via
> an eventfd. This is therefore the LEVEL_EOI extension to KVM_IRQFD.
> To do this, we need to allocate a new irq source ID for the interrupt
> so we don't get interference from userspace.
>
>
> +With KVM_CAP_IRQFD_LEVEL_EOI KVM_IRQFD is able to support a level
> +triggered interrupt model where the irqchip pin (kvm_irqfd.gsi) is
> +asserted from the kvm_irqfd.fd eventfd and remain asserted until the
> +guest issues an EOI for the irqchip pin. The level interrupt is
> +then de-asserted and the caller is notified via the eventfd specified
> +by kvm_irqfd.fd2. Note that users of this interface are responsible
> +for re-asserting the interrupt if their device still requires service
> +after receiving the EOI notification. Additionally, users must not
> +re-assert an interrupt until after receiving an EOI.

What happens if this is violated?

> When available,
> +this feature is enabled using the KVM_IRQFD_FLAG_LEVEL_EOI flag.
> +De-assigning an irqfd setup using this flag should include both
> +KVM_IRQFD_FLAG_DEASSIGN and KVM_IRQFD_FLAG_LEVEL_EOI and will be
> +matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> +De-assigning automatically de-asserts the interrupt line setup through
> +this interface.
>
> @@ -203,8 +232,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> struct kvm_irq_routing_table *irq_rt;
> struct _irqfd *irqfd, *tmp;
> struct file *file = NULL;
> - struct eventfd_ctx *eventfd = NULL;
> - int ret;
> + struct eventfd_ctx *eventfd = NULL, *eoi_eventfd = NULL;
> + int ret, irq_source_id = -1;
> unsigned int events;
>
> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> @@ -214,7 +243,30 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> irqfd->kvm = kvm;
> irqfd->gsi = args->gsi;
> INIT_LIST_HEAD(&irqfd->list);
> - INIT_WORK(&irqfd->inject, irqfd_inject);
> +
> + if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
> + irq_source_id = kvm_request_irq_source_id(kvm);
> + if (irq_source_id < 0) {
> + ret = irq_source_id;
> + goto fail;

'file' is NULL at this point, and fput() doesn't test for NULL.

> + }
> +
> + eoi_eventfd = eventfd_ctx_fdget(args->fd2);
> + if (IS_ERR(eoi_eventfd)) {
> + ret = PTR_ERR(eoi_eventfd);
> + goto fail;

Same.

> + }
> +
> + irqfd->irq_source_id = irq_source_id;
> + irqfd->eoi_eventfd = eoi_eventfd;
> + irqfd->notifier.gsi = args->gsi;
> + irqfd->notifier.irq_acked = irqfd_ack_level;
> + kvm_register_irq_ack_notifier(kvm, &irqfd->notifier);
> +
> + INIT_WORK(&irqfd->inject, irqfd_inject_level);
> + } else
> + INIT_WORK(&irqfd->inject, irqfd_inject_edge);
> +
> INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>
> file = eventfd_fget(args->fd);

> @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>
> ret = 0;
> list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> - if (irqfd->eventfd != tmp->eventfd)
> + if (irqfd->eventfd != tmp->eventfd &&
> + irqfd->eventfd != tmp->eoi_eventfd)
> continue;

So we allow duplicate irqfd with differing eoifd (or edge-triggered and
level-triggered irqfd on the same context).

(why the check in the first place? just so we can have a reliable
deassign or is it avoiding a deeper problem?)

> /* This fd is used for another irq already. */
> ret = -EBUSY;
> @@ -282,6 +340,14 @@ fail:
> if (!IS_ERR(file))
> fput(file);
>
> + if (eoi_eventfd && !IS_ERR(eoi_eventfd)) {
> + kvm_unregister_irq_ack_notifier(kvm, &irqfd->notifier);
> + eventfd_ctx_put(eoi_eventfd);
> + }
> +
> + if (irq_source_id >= 0)
> + kvm_free_irq_source_id(kvm, irq_source_id);
> +
> kfree(irqfd);
> return ret;
> }
> @@ -301,16 +367,26 @@ static int
> kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
> {
> struct _irqfd *irqfd, *tmp;
> - struct eventfd_ctx *eventfd;
> + struct eventfd_ctx *eventfd, *eoi_eventfd = NULL;
>
> eventfd = eventfd_ctx_fdget(args->fd);
> if (IS_ERR(eventfd))
> return PTR_ERR(eventfd);
>
> + if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
> + eoi_eventfd = eventfd_ctx_fdget(args->fd2);
> + if (IS_ERR(eoi_eventfd)) {
> + eventfd_ctx_put(eventfd);
> + return PTR_ERR(eoi_eventfd);
> + }
> + }
> +
> spin_lock_irq(&kvm->irqfds.lock);
>
> list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> - if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
> + if (irqfd->eventfd == eventfd &&
> + irqfd->gsi == args->gsi &&
> + irqfd->eoi_eventfd == eoi_eventfd) {
> /*
> * This rcu_assign_pointer is needed for when
> * another thread calls kvm_irq_routing_update before
> @@ -326,6 +402,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
>
> spin_unlock_irq(&kvm->irqfds.lock);
> eventfd_ctx_put(eventfd);
> + if (eoi_eventfd)
> + eventfd_ctx_put(eoi_eventfd);
>
> /*
> * Block until we know all outstanding shutdown jobs have completed

I see there is no check on undefined ->flags that you had to relax.
This means that there could be buggy code out there that sets LEVEL_EOI
and will now get an unexpected error. I doubt it's the case, but please
add a patch (in front of the series) that adds the check.

Xen had/has a hack for doing this in a different way, based on ioapic
polarity. When the host takes an interrupt, they reverse the polarity
on that ioapic pin, so they get interrupts on both assertion and
deassertion. This is more general and more correct, but waaaaaaaaaaay
more intrusive and won't play well with shared host interrupts. But
let's at least consider it.

--
error compiling committee.c: too many arguments to function

2012-06-24 12:57:05

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On 06/23/2012 01:16 AM, Alex Williamson wrote:
> I think we're probably also going to need something like this.
> When running in non-accelerated qemu, we're going to have to
> create some kind of EOI notifier for drivers. VFIO can make
> additional improvements when running on KVM so it will probably
> make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> want to have a generic EOI notifier in qemu that just stops
> working when kvm-ioapic is enabled.

Why?

--
error compiling committee.c: too many arguments to function

2012-06-24 14:48:05

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > I think we're probably also going to need something like this.
> > When running in non-accelerated qemu, we're going to have to
> > create some kind of EOI notifier for drivers. VFIO can make
> > additional improvements when running on KVM so it will probably
> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > want to have a generic EOI notifier in qemu that just stops
> > working when kvm-ioapic is enabled. This is just a simple way
> > to register an eventfd using the existing KVM ack notifier.
> > I tried combining the ack notifier of the LEVEL_EOI interface
> > into this one, but it didn't work out well. The code complexity
> > goes up a lot.
> >
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> >
> > Documentation/virtual/kvm/api.txt | 14 ++++++
> > arch/x86/kvm/x86.c | 1
> > include/linux/kvm.h | 12 +++++
> > include/linux/kvm_host.h | 7 +++
> > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > virt/kvm/kvm_main.c | 9 ++++
> > 6 files changed, 132 insertions(+)
> >
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index 2f8a0aa..69b1747 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > De-assigning automatically de-asserts the interrupt line setup through
> > this interface.
> >
> > +4.77 KVM_EOI_EVENTFD
> > +
> > +Capability: KVM_CAP_EOI_EVENTFD
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_eoi_eventfd (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +This interface allows userspace to be notified through an eventfd for
> > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > +and gsi.
> > +
> > 5. The kvm_run structure
> > ------------------------
> >
>
> The patch looks like it only works for ioapic IRQs. I think it's a good
> idea to explicitly limit this to level interrupts, straight away:
> there's no reason for userspace to need an exit on an edge interrupt.

Are you suggesting documentation or code to prevent users from binding
to an edge interrupt?

> I also suggest we put LEVEL somewhere in the name.

Ok

> With this eventfd, do we also need the fd2 parameter in the irqfd
> structure? It seems to be used for the same thing ...

Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
models we need to support are:

1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
de-assert from userspace (IRQ_LINE)

2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
(IRQFD.fd2)

or

2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
from vfio (IRQFD.fd2)

This series enables 1) and 2a). 2b) has some pros and cons. The pro is
that vfio could test the device to see if an interrupt is pending and
not de-assert if there is still an interrupt pending. The con is that a
standard level interrupt cycle takes 3 transactions instead of 2.

Any case of triggering the interrupt externally via an irqfd requires
that the irqfd uses it's own irq_source_id so that it doesn't interfere
with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
unless we want to completely expose a new irq source id allocation
mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
Thanks,

Alex

2012-06-24 14:51:04

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, 2012-06-24 at 11:28 +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 22, 2012 at 04:16:17PM -0600, Alex Williamson wrote:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 2ce09aa..a916186 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > #define KVM_CAP_S390_COW 79
> > #define KVM_CAP_PPC_ALLOC_HTAB 80
> > +#define KVM_CAP_IRQFD_LEVEL_EOI 81
> >
> > #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -683,12 +684,15 @@ struct kvm_xen_hvm_config {
> > #endif
> >
> > #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > +/* Available with KVM_CAP_IRQFD_LEVEL_EOI */
> > +#define KVM_IRQFD_FLAG_LEVEL_EOI (1 << 1)
> >
> > struct kvm_irqfd {
> > __u32 fd;
> > __u32 gsi;
> > __u32 flags;
> > - __u8 pad[20];
> > + __u32 fd2;
> > + __u8 pad[16];
> > };
> >
> > struct kvm_clock_data {
> > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > index c307c24..2bc7275 100644
> > --- a/virt/kvm/eventfd.c
> > +++ b/virt/kvm/eventfd.c
> > @@ -49,9 +49,13 @@ struct _irqfd {
> > wait_queue_t wait;
> > /* Update side is protected by irqfds.lock */
> > struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
> > - /* Used for level IRQ fast-path */
> > + /* Used for IRQ fast-path */
> > int gsi;
> > struct work_struct inject;
> > + /* Used for level EOI path */
> > + int irq_source_id;
> > + struct eventfd_ctx *eoi_eventfd;
> > + struct kvm_irq_ack_notifier notifier;
> > /* Used for setup/shutdown */
> > struct eventfd_ctx *eventfd;
> > struct list_head list;
> > @@ -62,7 +66,7 @@ struct _irqfd {
> > static struct workqueue_struct *irqfd_cleanup_wq;
> >
> > static void
> > -irqfd_inject(struct work_struct *work)
> > +irqfd_inject_edge(struct work_struct *work)
> > {
> > struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > struct kvm *kvm = irqfd->kvm;
> > @@ -71,6 +75,23 @@ irqfd_inject(struct work_struct *work)
> > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > }
> >
> > +static void
> > +irqfd_inject_level(struct work_struct *work)
> > +{
> > + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > +
> > + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> > +}
> > +
>
> I think that we can actually do this for all interrupts
> unconditionally. We used to need to clear level for edge, but nomore.
> Once you do this, can't this patch be replaced with 4/4 completely?

Nope, as described in my reply there, we can't mix IRQFD and IRQ_LINE
for a level interrupt because we need a separate source id and have no
way to specify it with IRQ_LINE. Thanks,

Alex

2012-06-24 14:56:24

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 2/4] kvm: Add missing KVM_IRQFD API documentation

On Sun, 2012-06-24 at 11:34 +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 22, 2012 at 04:15:50PM -0600, Alex Williamson wrote:
> > Signed-off-by: Alex Williamson <[email protected]>
> > ---
> >
> > Documentation/virtual/kvm/api.txt | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index 310fe50..9b4cb2b 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1965,6 +1965,23 @@ return the hash table order in the parameter. (If the guest is using
> > the virtualized real-mode area (VRMA) facility, the kernel will
> > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> >
> > +4.76 KVM_IRQFD
> > +
> > +Capability: KVM_CAP_IRQFD
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_irqfd (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +Allows setting an eventfd to directly trigger a guest interrupt
> > +kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
> > +kvm_irqfd.gsi specifies the irqchip pin toggled by this event. By
> > +default this interface only supports edge triggered interrupts,
> > +meaning the specified gsi is asserted and immediately de-asserted
> > +when the eventfd is triggered.
>
> That's a bit confusing. This assert/deassert only has effect for level.
> Do we or do we not want to maintain this assert/deassert behaviour
> for level irqfds? If yes we shouldn't say it's unsupported if
> no we should not document it.
>
> Gleb, Avi, any thoughts?

AIUI the assert/deassert manages to work for level (I actually had vfio
injecting these for a while and it seemed to work), but I don't think it
was ever designed for level interrupts and attempting to do that is
unsupported. Thanks,

Alex

> > Specifying KVM_IRQFD_FLAG_DEASSIGN
> > +removes the previously set irqfd matching both kvm_irqfd.fd and
> > +kvm_irqfd.gsi.
> > +
> >
> > 5. The kvm_run structure
> > ------------------------

2012-06-24 15:02:43

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Sun, 2012-06-24 at 15:56 +0300, Avi Kivity wrote:
> On 06/23/2012 01:16 AM, Alex Williamson wrote:
> > I think we're probably also going to need something like this.
> > When running in non-accelerated qemu, we're going to have to
> > create some kind of EOI notifier for drivers. VFIO can make
> > additional improvements when running on KVM so it will probably
> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > want to have a generic EOI notifier in qemu that just stops
> > working when kvm-ioapic is enabled.
>
> Why?

Hmm, I must be missing something or not describing it correctly, because
it seems obvious. If we create a dependency in qemu of needing to know
when an eoi occurs and notifier a driver and have no way to fulfill that
dependency when running on kvm... that'd be bad, right? I don't want to
assume that every consumer of such an interface would prefer to make use
of an irqfd. Not sure if that answers your question though. Thanks,

Alex

2012-06-24 15:18:45

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, 2012-06-24 at 13:29 +0300, Avi Kivity wrote:
> On 06/23/2012 01:16 AM, Alex Williamson wrote:
> > KVM_IRQFD currently only supports edge triggered interrupts,
> > asserting then immediately deasserting an interrupt. There are a
> > couple ways we can emulate level triggered interrupts using
> > discrete events depending on the usage model we expect from drivers.
> > This patch implements a level emulation model useful for external
> > assigned device drivers, like VFIO. The irqfd is used to assert
> > the interrupt. When the guest issues an EOI for the interrupt, the
> > level is automatically deasserted and the irqfd user is notified via
> > an eventfd. This is therefore the LEVEL_EOI extension to KVM_IRQFD.
> > To do this, we need to allocate a new irq source ID for the interrupt
> > so we don't get interference from userspace.
> >
> >
> > +With KVM_CAP_IRQFD_LEVEL_EOI KVM_IRQFD is able to support a level
> > +triggered interrupt model where the irqchip pin (kvm_irqfd.gsi) is
> > +asserted from the kvm_irqfd.fd eventfd and remain asserted until the
> > +guest issues an EOI for the irqchip pin. The level interrupt is
> > +then de-asserted and the caller is notified via the eventfd specified
> > +by kvm_irqfd.fd2. Note that users of this interface are responsible
> > +for re-asserting the interrupt if their device still requires service
> > +after receiving the EOI notification. Additionally, users must not
> > +re-assert an interrupt until after receiving an EOI.
>
> What happens if this is violated?

Hmm, perhaps nothing. The only race I see is re-asserting in the gap
between de-asserting the guest and sending the EOI. At worst that would
cause a spurious interrupt, so probably no big deal.

> > When available,
> > +this feature is enabled using the KVM_IRQFD_FLAG_LEVEL_EOI flag.
> > +De-assigning an irqfd setup using this flag should include both
> > +KVM_IRQFD_FLAG_DEASSIGN and KVM_IRQFD_FLAG_LEVEL_EOI and will be
> > +matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > +De-assigning automatically de-asserts the interrupt line setup through
> > +this interface.
> >
> > @@ -203,8 +232,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > struct kvm_irq_routing_table *irq_rt;
> > struct _irqfd *irqfd, *tmp;
> > struct file *file = NULL;
> > - struct eventfd_ctx *eventfd = NULL;
> > - int ret;
> > + struct eventfd_ctx *eventfd = NULL, *eoi_eventfd = NULL;
> > + int ret, irq_source_id = -1;
> > unsigned int events;
> >
> > irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> > @@ -214,7 +243,30 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > irqfd->kvm = kvm;
> > irqfd->gsi = args->gsi;
> > INIT_LIST_HEAD(&irqfd->list);
> > - INIT_WORK(&irqfd->inject, irqfd_inject);
> > +
> > + if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
> > + irq_source_id = kvm_request_irq_source_id(kvm);
> > + if (irq_source_id < 0) {
> > + ret = irq_source_id;
> > + goto fail;
>
> 'file' is NULL at this point, and fput() doesn't test for NULL.

Good catch. I was looking for an excuse to move the existing code to
eventfd_ctx_fdget() and avoid the 2 step process it uses now.

> > + }
> > +
> > + eoi_eventfd = eventfd_ctx_fdget(args->fd2);
> > + if (IS_ERR(eoi_eventfd)) {
> > + ret = PTR_ERR(eoi_eventfd);
> > + goto fail;
>
> Same.

Yep

> > + }
> > +
> > + irqfd->irq_source_id = irq_source_id;
> > + irqfd->eoi_eventfd = eoi_eventfd;
> > + irqfd->notifier.gsi = args->gsi;
> > + irqfd->notifier.irq_acked = irqfd_ack_level;
> > + kvm_register_irq_ack_notifier(kvm, &irqfd->notifier);
> > +
> > + INIT_WORK(&irqfd->inject, irqfd_inject_level);
> > + } else
> > + INIT_WORK(&irqfd->inject, irqfd_inject_edge);
> > +
> > INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> >
> > file = eventfd_fget(args->fd);
>
> > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> >
> > ret = 0;
> > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > - if (irqfd->eventfd != tmp->eventfd)
> > + if (irqfd->eventfd != tmp->eventfd &&
> > + irqfd->eventfd != tmp->eoi_eventfd)
> > continue;
>
> So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> level-triggered irqfd on the same context).
>
> (why the check in the first place? just so we can have a reliable
> deassign or is it avoiding a deeper problem?)

I really wasn't sure to what extent we wanted to prevent duplicates. My
guess was that we don't want to have an irqfd trigger more than one
thing. That seems to be what the current code does. I don't see any
problems with multiple irqfds triggering the same eventfd though. I
only added a test that a new irqfd can't be triggered by an existing
eoi_eventfd as that could make a nasty loop.

> > /* This fd is used for another irq already. */
> > ret = -EBUSY;
> > @@ -282,6 +340,14 @@ fail:
> > if (!IS_ERR(file))
> > fput(file);
> >
> > + if (eoi_eventfd && !IS_ERR(eoi_eventfd)) {
> > + kvm_unregister_irq_ack_notifier(kvm, &irqfd->notifier);
> > + eventfd_ctx_put(eoi_eventfd);
> > + }
> > +
> > + if (irq_source_id >= 0)
> > + kvm_free_irq_source_id(kvm, irq_source_id);
> > +
> > kfree(irqfd);
> > return ret;
> > }
> > @@ -301,16 +367,26 @@ static int
> > kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
> > {
> > struct _irqfd *irqfd, *tmp;
> > - struct eventfd_ctx *eventfd;
> > + struct eventfd_ctx *eventfd, *eoi_eventfd = NULL;
> >
> > eventfd = eventfd_ctx_fdget(args->fd);
> > if (IS_ERR(eventfd))
> > return PTR_ERR(eventfd);
> >
> > + if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
> > + eoi_eventfd = eventfd_ctx_fdget(args->fd2);
> > + if (IS_ERR(eoi_eventfd)) {
> > + eventfd_ctx_put(eventfd);
> > + return PTR_ERR(eoi_eventfd);
> > + }
> > + }
> > +
> > spin_lock_irq(&kvm->irqfds.lock);
> >
> > list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> > - if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi) {
> > + if (irqfd->eventfd == eventfd &&
> > + irqfd->gsi == args->gsi &&
> > + irqfd->eoi_eventfd == eoi_eventfd) {
> > /*
> > * This rcu_assign_pointer is needed for when
> > * another thread calls kvm_irq_routing_update before
> > @@ -326,6 +402,8 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
> >
> > spin_unlock_irq(&kvm->irqfds.lock);
> > eventfd_ctx_put(eventfd);
> > + if (eoi_eventfd)
> > + eventfd_ctx_put(eoi_eventfd);
> >
> > /*
> > * Block until we know all outstanding shutdown jobs have completed
>
> I see there is no check on undefined ->flags that you had to relax.
> This means that there could be buggy code out there that sets LEVEL_EOI
> and will now get an unexpected error. I doubt it's the case, but please
> add a patch (in front of the series) that adds the check.

Ok.

> Xen had/has a hack for doing this in a different way, based on ioapic
> polarity. When the host takes an interrupt, they reverse the polarity
> on that ioapic pin, so they get interrupts on both assertion and
> deassertion. This is more general and more correct, but waaaaaaaaaaay
> more intrusive and won't play well with shared host interrupts. But
> let's at least consider it.

Thanks, I'll look for that code.

Alex

2012-06-24 15:40:50

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Sun, Jun 24, 2012 at 08:47:57AM -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > > I think we're probably also going to need something like this.
> > > When running in non-accelerated qemu, we're going to have to
> > > create some kind of EOI notifier for drivers. VFIO can make
> > > additional improvements when running on KVM so it will probably
> > > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > > want to have a generic EOI notifier in qemu that just stops
> > > working when kvm-ioapic is enabled. This is just a simple way
> > > to register an eventfd using the existing KVM ack notifier.
> > > I tried combining the ack notifier of the LEVEL_EOI interface
> > > into this one, but it didn't work out well. The code complexity
> > > goes up a lot.
> > >
> > > Signed-off-by: Alex Williamson <[email protected]>
> > > ---
> > >
> > > Documentation/virtual/kvm/api.txt | 14 ++++++
> > > arch/x86/kvm/x86.c | 1
> > > include/linux/kvm.h | 12 +++++
> > > include/linux/kvm_host.h | 7 +++
> > > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > > virt/kvm/kvm_main.c | 9 ++++
> > > 6 files changed, 132 insertions(+)
> > >
> > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > index 2f8a0aa..69b1747 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > De-assigning automatically de-asserts the interrupt line setup through
> > > this interface.
> > >
> > > +4.77 KVM_EOI_EVENTFD
> > > +
> > > +Capability: KVM_CAP_EOI_EVENTFD
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_eoi_eventfd (in)
> > > +Returns: 0 on success, -1 on error
> > > +
> > > +This interface allows userspace to be notified through an eventfd for
> > > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > > +and gsi.
> > > +
> > > 5. The kvm_run structure
> > > ------------------------
> > >
> >
> > The patch looks like it only works for ioapic IRQs. I think it's a good
> > idea to explicitly limit this to level interrupts, straight away:
> > there's no reason for userspace to need an exit on an edge interrupt.
>
> Are you suggesting documentation or code to prevent users from binding
> to an edge interrupt?

Yes.

> > I also suggest we put LEVEL somewhere in the name.
>
> Ok
>
> > With this eventfd, do we also need the fd2 parameter in the irqfd
> > structure? It seems to be used for the same thing ...
>
> Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
> models we need to support are:
>
> 1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
> de-assert from userspace (IRQ_LINE)
>
> 2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
> (IRQFD.fd2)
> or
>
> 2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
> from vfio (IRQFD.fd2)

Or maybe deassert in kvm, notify vfio using eventfd?

>
> This series enables 1) and 2a). 2b) has some pros and cons. The pro is
> that vfio could test the device to see if an interrupt is pending and
> not de-assert if there is still an interrupt pending. The con is that a
> standard level interrupt cycle takes 3 transactions instead of 2.
>
> Any case of triggering the interrupt externally via an irqfd requires
> that the irqfd uses it's own irq_source_id so that it doesn't interfere
> with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
> unless we want to completely expose a new irq source id allocation
> mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
> KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
> Thanks,
>
> Alex

I don't follow this last paragraph. What are you trying to say?
To support IRQ sharing for level irqs, each irqfd for level IRQ
must have a distinct source id. But IRQ_LINE works fine without,
and one source ID for all irq_line users seems enough.

--
MST

2012-06-24 15:45:54

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, Jun 24, 2012 at 08:50:56AM -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 11:28 +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 22, 2012 at 04:16:17PM -0600, Alex Williamson wrote:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 2ce09aa..a916186 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > #define KVM_CAP_S390_COW 79
> > > #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > +#define KVM_CAP_IRQFD_LEVEL_EOI 81
> > >
> > > #ifdef KVM_CAP_IRQ_ROUTING
> > >
> > > @@ -683,12 +684,15 @@ struct kvm_xen_hvm_config {
> > > #endif
> > >
> > > #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > +/* Available with KVM_CAP_IRQFD_LEVEL_EOI */
> > > +#define KVM_IRQFD_FLAG_LEVEL_EOI (1 << 1)
> > >
> > > struct kvm_irqfd {
> > > __u32 fd;
> > > __u32 gsi;
> > > __u32 flags;
> > > - __u8 pad[20];
> > > + __u32 fd2;
> > > + __u8 pad[16];
> > > };
> > >
> > > struct kvm_clock_data {
> > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > index c307c24..2bc7275 100644
> > > --- a/virt/kvm/eventfd.c
> > > +++ b/virt/kvm/eventfd.c
> > > @@ -49,9 +49,13 @@ struct _irqfd {
> > > wait_queue_t wait;
> > > /* Update side is protected by irqfds.lock */
> > > struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
> > > - /* Used for level IRQ fast-path */
> > > + /* Used for IRQ fast-path */
> > > int gsi;
> > > struct work_struct inject;
> > > + /* Used for level EOI path */
> > > + int irq_source_id;
> > > + struct eventfd_ctx *eoi_eventfd;
> > > + struct kvm_irq_ack_notifier notifier;
> > > /* Used for setup/shutdown */
> > > struct eventfd_ctx *eventfd;
> > > struct list_head list;
> > > @@ -62,7 +66,7 @@ struct _irqfd {
> > > static struct workqueue_struct *irqfd_cleanup_wq;
> > >
> > > static void
> > > -irqfd_inject(struct work_struct *work)
> > > +irqfd_inject_edge(struct work_struct *work)
> > > {
> > > struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > struct kvm *kvm = irqfd->kvm;
> > > @@ -71,6 +75,23 @@ irqfd_inject(struct work_struct *work)
> > > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > }
> > >
> > > +static void
> > > +irqfd_inject_level(struct work_struct *work)
> > > +{
> > > + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > +
> > > + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> > > +}
> > > +
> >
> > I think that we can actually do this for all interrupts
> > unconditionally. We used to need to clear level for edge, but nomore.
> > Once you do this, can't this patch be replaced with 4/4 completely?
>
> Nope, as described in my reply there, we can't mix IRQFD and IRQ_LINE
> for a level interrupt because we need a separate source id and have no
> way to specify it with IRQ_LINE. Thanks,
>
> Alex

OK, so the reason to set a new flag here is to assign
a unique source ID to it? OK, so just store the source
id in irqfd, rest of the code can be shared.
Maybe we can make all irqfds use unique source IDs?
Or is that too expensive for some reason?

--
MST

2012-06-24 15:46:45

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 2/4] kvm: Add missing KVM_IRQFD API documentation

On Sun, Jun 24, 2012 at 08:56:17AM -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 11:34 +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 22, 2012 at 04:15:50PM -0600, Alex Williamson wrote:
> > > Signed-off-by: Alex Williamson <[email protected]>
> > > ---
> > >
> > > Documentation/virtual/kvm/api.txt | 17 +++++++++++++++++
> > > 1 file changed, 17 insertions(+)
> > >
> > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > index 310fe50..9b4cb2b 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1965,6 +1965,23 @@ return the hash table order in the parameter. (If the guest is using
> > > the virtualized real-mode area (VRMA) facility, the kernel will
> > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > >
> > > +4.76 KVM_IRQFD
> > > +
> > > +Capability: KVM_CAP_IRQFD
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_irqfd (in)
> > > +Returns: 0 on success, -1 on error
> > > +
> > > +Allows setting an eventfd to directly trigger a guest interrupt
> > > +kvm_irqfd.fd specifies the file descriptor to use as the eventfd and
> > > +kvm_irqfd.gsi specifies the irqchip pin toggled by this event. By
> > > +default this interface only supports edge triggered interrupts,
> > > +meaning the specified gsi is asserted and immediately de-asserted
> > > +when the eventfd is triggered.
> >
> > That's a bit confusing. This assert/deassert only has effect for level.
> > Do we or do we not want to maintain this assert/deassert behaviour
> > for level irqfds? If yes we shouldn't say it's unsupported if
> > no we should not document it.
> >
> > Gleb, Avi, any thoughts?
>
> AIUI the assert/deassert manages to work for level (I actually had vfio
> injecting these for a while and it seemed to work), but I don't think it
> was ever designed for level interrupts and attempting to do that is
> unsupported. Thanks,
>
> Alex

So don't document assert/deassert then.

> > > Specifying KVM_IRQFD_FLAG_DEASSIGN
> > > +removes the previously set irqfd matching both kvm_irqfd.fd and
> > > +kvm_irqfd.gsi.
> > > +
> > >
> > > 5. The kvm_run structure
> > > ------------------------
>
>

2012-06-24 15:49:38

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, Jun 24, 2012 at 09:18:38AM -0600, Alex Williamson wrote:
> > > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > >
> > > ret = 0;
> > > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > - if (irqfd->eventfd != tmp->eventfd)
> > > + if (irqfd->eventfd != tmp->eventfd &&
> > > + irqfd->eventfd != tmp->eoi_eventfd)
> > > continue;
> >
> > So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> > level-triggered irqfd on the same context).
> >
> > (why the check in the first place? just so we can have a reliable
> > deassign or is it avoiding a deeper problem?)
>
> I really wasn't sure to what extent we wanted to prevent duplicates. My
> guess was that we don't want to have an irqfd trigger more than one
> thing. That seems to be what the current code does. I don't see any
> problems with multiple irqfds triggering the same eventfd though. I
> only added a test that a new irqfd can't be triggered by an existing
> eoi_eventfd as that could make a nasty loop.

How would that make a loop? You can have the same thing
with e.g. ioeventfd - why isn't it a problem there?

--
MST

2012-06-24 21:50:20

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Sun, 2012-06-24 at 18:40 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 24, 2012 at 08:47:57AM -0600, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> > > On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > > > I think we're probably also going to need something like this.
> > > > When running in non-accelerated qemu, we're going to have to
> > > > create some kind of EOI notifier for drivers. VFIO can make
> > > > additional improvements when running on KVM so it will probably
> > > > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > > > want to have a generic EOI notifier in qemu that just stops
> > > > working when kvm-ioapic is enabled. This is just a simple way
> > > > to register an eventfd using the existing KVM ack notifier.
> > > > I tried combining the ack notifier of the LEVEL_EOI interface
> > > > into this one, but it didn't work out well. The code complexity
> > > > goes up a lot.
> > > >
> > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > ---
> > > >
> > > > Documentation/virtual/kvm/api.txt | 14 ++++++
> > > > arch/x86/kvm/x86.c | 1
> > > > include/linux/kvm.h | 12 +++++
> > > > include/linux/kvm_host.h | 7 +++
> > > > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > > > virt/kvm/kvm_main.c | 9 ++++
> > > > 6 files changed, 132 insertions(+)
> > > >
> > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > index 2f8a0aa..69b1747 100644
> > > > --- a/Documentation/virtual/kvm/api.txt
> > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > > De-assigning automatically de-asserts the interrupt line setup through
> > > > this interface.
> > > >
> > > > +4.77 KVM_EOI_EVENTFD
> > > > +
> > > > +Capability: KVM_CAP_EOI_EVENTFD
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_eoi_eventfd (in)
> > > > +Returns: 0 on success, -1 on error
> > > > +
> > > > +This interface allows userspace to be notified through an eventfd for
> > > > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > > > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > > > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > > > +and gsi.
> > > > +
> > > > 5. The kvm_run structure
> > > > ------------------------
> > > >
> > >
> > > The patch looks like it only works for ioapic IRQs. I think it's a good
> > > idea to explicitly limit this to level interrupts, straight away:
> > > there's no reason for userspace to need an exit on an edge interrupt.
> >
> > Are you suggesting documentation or code to prevent users from binding
> > to an edge interrupt?
>
> Yes.

Ok, that was failed attempt at clarification. I'll assume documentation
is sufficient.

> > > I also suggest we put LEVEL somewhere in the name.
> >
> > Ok
> >
> > > With this eventfd, do we also need the fd2 parameter in the irqfd
> > > structure? It seems to be used for the same thing ...
> >
> > Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
> > models we need to support are:
> >
> > 1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
> > de-assert from userspace (IRQ_LINE)
> >
> > 2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
> > (IRQFD.fd2)
> > or
> >
> > 2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
> > from vfio (IRQFD.fd2)
>
> Or maybe deassert in kvm, notify vfio using eventfd?

Um, how is that different from 2a? Are you suggesting EOI_EVENTFD does
a de-assert? My first implementation of this whole thing looked like
this:

irq_source_id = ioctl(vmfd, KVM_GET_IRQ_SOURCE_ID)

struct kvm_irqfd irqfd = {
.fd = fd,
.gsi = gsi,
.flags = KVM_IRQFD_FLAG_LEVEL | KVM_IRQFD_FLAG_SOURCE_ID,
.irq_source_id = irq_source_id,
};

ioctl(vmfd, KVM_IRQFD, &irqfd);

struct kvm_eoi_eventfd eoifd = {
.fd = fd2,
.gsi = gsi,
.flags = KVM_EOI_EVENTFD_FLAG_SOURCE_ID,
.irq_source_id = irq_source_id,
};

ioctl(vmfd, KVM_EOI_EVENTFD, &eoifd);

What I didn't like about this was that the latter two ioctls are too
interdependent on each other. For instance, ordering of de-assignment
is potentially important to avoid leaving the interrupt asserted. The
combined KVM_IRQFD_LEVEL_EOI completely manages that. Also, if qemu
wants to make use of this, they do not want the de-assert on EOI and
want to make use of the fixed userspace source id, so we end up with a
source id flag and a de-assert flag. Eventually I decided that using a
new source id is really only required when using an irqfd, so we could
keep the source id internal and simplify the interface by pulling it all
into KVM_IRQFD.

> > This series enables 1) and 2a). 2b) has some pros and cons. The pro is
> > that vfio could test the device to see if an interrupt is pending and
> > not de-assert if there is still an interrupt pending. The con is that a
> > standard level interrupt cycle takes 3 transactions instead of 2.
> >
> > Any case of triggering the interrupt externally via an irqfd requires
> > that the irqfd uses it's own irq_source_id so that it doesn't interfere
> > with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
> > unless we want to completely expose a new irq source id allocation
> > mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
> > KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
> > Thanks,
> >
> > Alex
>
> I don't follow this last paragraph. What are you trying to say?
> To support IRQ sharing for level irqs, each irqfd for level IRQ
> must have a distinct source id. But IRQ_LINE works fine without,
> and one source ID for all irq_line users seems enough.

Ok, assume we have irqfd support as implemented in 3/4, but instead of
allocating a new irq_source_id, we just use KVM_USERSPACE_IRQ_SOURCE_ID.
The irqfd is triggered and asserts the interrupt line. The guest begins
handling the interrupt, first probing an emulated device, but it has
nothing to do so moves on to the assigned device on the same interrupt
line. Qemu then decides that emulated device really does have something
to do and assert the interrupt line. Nothing happens because the guest
is already handling that interrupt. Guest finishes and writes an eoi.
We're done, so we deassert. Qemu's assert gets lost.

If we use a new irq_source_id, the guest sees the logical OR of all the
source ids for that interrupt line, so even though we de-assert, the
guest receives another interrupt because qemu's interrupt line status is
still visible. Qemu is able to share a single irq_source_id for all
it's devices because the various components in the set_irq path are able
to multiplex multiple emulated interrupt sources themselves.

As above, I argue that the only time you need a new source id is if
you're injecting an interrupt from outside of the userspace model, which
pretty much means irqfd. Thanks,

Alex

2012-06-24 21:52:54

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, 2012-06-24 at 18:45 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 24, 2012 at 08:50:56AM -0600, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 11:28 +0300, Michael S. Tsirkin wrote:
> > > On Fri, Jun 22, 2012 at 04:16:17PM -0600, Alex Williamson wrote:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 2ce09aa..a916186 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
> > > > #define KVM_CAP_PPC_GET_SMMU_INFO 78
> > > > #define KVM_CAP_S390_COW 79
> > > > #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > +#define KVM_CAP_IRQFD_LEVEL_EOI 81
> > > >
> > > > #ifdef KVM_CAP_IRQ_ROUTING
> > > >
> > > > @@ -683,12 +684,15 @@ struct kvm_xen_hvm_config {
> > > > #endif
> > > >
> > > > #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> > > > +/* Available with KVM_CAP_IRQFD_LEVEL_EOI */
> > > > +#define KVM_IRQFD_FLAG_LEVEL_EOI (1 << 1)
> > > >
> > > > struct kvm_irqfd {
> > > > __u32 fd;
> > > > __u32 gsi;
> > > > __u32 flags;
> > > > - __u8 pad[20];
> > > > + __u32 fd2;
> > > > + __u8 pad[16];
> > > > };
> > > >
> > > > struct kvm_clock_data {
> > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > index c307c24..2bc7275 100644
> > > > --- a/virt/kvm/eventfd.c
> > > > +++ b/virt/kvm/eventfd.c
> > > > @@ -49,9 +49,13 @@ struct _irqfd {
> > > > wait_queue_t wait;
> > > > /* Update side is protected by irqfds.lock */
> > > > struct kvm_kernel_irq_routing_entry __rcu *irq_entry;
> > > > - /* Used for level IRQ fast-path */
> > > > + /* Used for IRQ fast-path */
> > > > int gsi;
> > > > struct work_struct inject;
> > > > + /* Used for level EOI path */
> > > > + int irq_source_id;
> > > > + struct eventfd_ctx *eoi_eventfd;
> > > > + struct kvm_irq_ack_notifier notifier;
> > > > /* Used for setup/shutdown */
> > > > struct eventfd_ctx *eventfd;
> > > > struct list_head list;
> > > > @@ -62,7 +66,7 @@ struct _irqfd {
> > > > static struct workqueue_struct *irqfd_cleanup_wq;
> > > >
> > > > static void
> > > > -irqfd_inject(struct work_struct *work)
> > > > +irqfd_inject_edge(struct work_struct *work)
> > > > {
> > > > struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > struct kvm *kvm = irqfd->kvm;
> > > > @@ -71,6 +75,23 @@ irqfd_inject(struct work_struct *work)
> > > > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > }
> > > >
> > > > +static void
> > > > +irqfd_inject_level(struct work_struct *work)
> > > > +{
> > > > + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> > > > +
> > > > + kvm_set_irq(irqfd->kvm, irqfd->irq_source_id, irqfd->gsi, 1);
> > > > +}
> > > > +
> > >
> > > I think that we can actually do this for all interrupts
> > > unconditionally. We used to need to clear level for edge, but nomore.
> > > Once you do this, can't this patch be replaced with 4/4 completely?
> >
> > Nope, as described in my reply there, we can't mix IRQFD and IRQ_LINE
> > for a level interrupt because we need a separate source id and have no
> > way to specify it with IRQ_LINE. Thanks,
> >
> > Alex
>
> OK, so the reason to set a new flag here is to assign
> a unique source ID to it? OK, so just store the source
> id in irqfd, rest of the code can be shared.
> Maybe we can make all irqfds use unique source IDs?
> Or is that too expensive for some reason?

It adds a lot of complexity when irqfd is the only interface that needs
a new source id. We've split the discussion into two threads though,
please let's continue it in 4/4 where we're further along. Thanks,

Alex

2012-06-24 21:59:33

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, 2012-06-24 at 18:49 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 24, 2012 at 09:18:38AM -0600, Alex Williamson wrote:
> > > > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > >
> > > > ret = 0;
> > > > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > - if (irqfd->eventfd != tmp->eventfd)
> > > > + if (irqfd->eventfd != tmp->eventfd &&
> > > > + irqfd->eventfd != tmp->eoi_eventfd)
> > > > continue;
> > >
> > > So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> > > level-triggered irqfd on the same context).
> > >
> > > (why the check in the first place? just so we can have a reliable
> > > deassign or is it avoiding a deeper problem?)
> >
> > I really wasn't sure to what extent we wanted to prevent duplicates. My
> > guess was that we don't want to have an irqfd trigger more than one
> > thing. That seems to be what the current code does. I don't see any
> > problems with multiple irqfds triggering the same eventfd though. I
> > only added a test that a new irqfd can't be triggered by an existing
> > eoi_eventfd as that could make a nasty loop.
>
> How would that make a loop? You can have the same thing
> with e.g. ioeventfd - why isn't it a problem there?

eoi_eventfd1 -> irqfd2 [eoi] eoi_eventfd2 -> irqfd1 [eoi] eoi_eventfd1 ->...

Yes, in reality we'd need to search fds from all the interfaces and come
up with some grossly complicated truth table of what's allowed and
what's not. The original code didn't go to that kind of extreme, so I
just added something that seemed like a reasonable case
we'd want to prevent. Thanks,

Alex

2012-06-24 22:35:04

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Sun, Jun 24, 2012 at 03:50:12PM -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 18:40 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jun 24, 2012 at 08:47:57AM -0600, Alex Williamson wrote:
> > > On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > > > > I think we're probably also going to need something like this.
> > > > > When running in non-accelerated qemu, we're going to have to
> > > > > create some kind of EOI notifier for drivers. VFIO can make
> > > > > additional improvements when running on KVM so it will probably
> > > > > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > > > > want to have a generic EOI notifier in qemu that just stops
> > > > > working when kvm-ioapic is enabled. This is just a simple way
> > > > > to register an eventfd using the existing KVM ack notifier.
> > > > > I tried combining the ack notifier of the LEVEL_EOI interface
> > > > > into this one, but it didn't work out well. The code complexity
> > > > > goes up a lot.
> > > > >
> > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > ---
> > > > >
> > > > > Documentation/virtual/kvm/api.txt | 14 ++++++
> > > > > arch/x86/kvm/x86.c | 1
> > > > > include/linux/kvm.h | 12 +++++
> > > > > include/linux/kvm_host.h | 7 +++
> > > > > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > > > > virt/kvm/kvm_main.c | 9 ++++
> > > > > 6 files changed, 132 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > index 2f8a0aa..69b1747 100644
> > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > > > De-assigning automatically de-asserts the interrupt line setup through
> > > > > this interface.
> > > > >
> > > > > +4.77 KVM_EOI_EVENTFD
> > > > > +
> > > > > +Capability: KVM_CAP_EOI_EVENTFD
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_eoi_eventfd (in)
> > > > > +Returns: 0 on success, -1 on error
> > > > > +
> > > > > +This interface allows userspace to be notified through an eventfd for
> > > > > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > > > > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > > > > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > > > > +and gsi.
> > > > > +
> > > > > 5. The kvm_run structure
> > > > > ------------------------
> > > > >
> > > >
> > > > The patch looks like it only works for ioapic IRQs. I think it's a good
> > > > idea to explicitly limit this to level interrupts, straight away:
> > > > there's no reason for userspace to need an exit on an edge interrupt.
> > >
> > > Are you suggesting documentation or code to prevent users from binding
> > > to an edge interrupt?
> >
> > Yes.
>
> Ok, that was failed attempt at clarification. I'll assume documentation
> is sufficient.
>
> > > > I also suggest we put LEVEL somewhere in the name.
> > >
> > > Ok
> > >
> > > > With this eventfd, do we also need the fd2 parameter in the irqfd
> > > > structure? It seems to be used for the same thing ...
> > >
> > > Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
> > > models we need to support are:
> > >
> > > 1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
> > > de-assert from userspace (IRQ_LINE)
> > >
> > > 2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
> > > (IRQFD.fd2)
> > > or
> > >
> > > 2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
> > > from vfio (IRQFD.fd2)
> >
> > Or maybe deassert in kvm, notify vfio using eventfd?
>
> Um, how is that different from 2a? Are you suggesting EOI_EVENTFD does
> a de-assert?

Yes.

> My first implementation of this whole thing looked like
> this:
>
> irq_source_id = ioctl(vmfd, KVM_GET_IRQ_SOURCE_ID)
>
> struct kvm_irqfd irqfd = {
> .fd = fd,
> .gsi = gsi,
> .flags = KVM_IRQFD_FLAG_LEVEL | KVM_IRQFD_FLAG_SOURCE_ID,

FLAG_LEVEL might not be needed.

> .irq_source_id = irq_source_id,
> };
>
> ioctl(vmfd, KVM_IRQFD, &irqfd);
>
> struct kvm_eoi_eventfd eoifd = {
> .fd = fd2,
> .gsi = gsi,
> .flags = KVM_EOI_EVENTFD_FLAG_SOURCE_ID,
> .irq_source_id = irq_source_id,
> };
>
> ioctl(vmfd, KVM_EOI_EVENTFD, &eoifd);
>
> What I didn't like about this was that the latter two ioctls are too
> interdependent on each other. For instance, ordering of de-assignment
> is potentially important to avoid leaving the interrupt asserted. The
> combined KVM_IRQFD_LEVEL_EOI completely manages that. Also, if qemu
> wants to make use of this, they do not want the de-assert on EOI

IMO it should not matter. Most likely level is already 0 by the time
EOI is invoked. eventfds are asynchronous so you do not want
level to stay asserted until some action.
If qemu does not want the deassert on EOI, it should just
avoid using the eventfd interface.

> and
> want to make use of the fixed userspace source id, so we end up with a
> source id flag and a de-assert flag. Eventually I decided that using a
> new source id is really only required when using an irqfd, so we could
> keep the source id internal and simplify the interface by pulling it all
> into KVM_IRQFD.

I see, thanks for the clarification. But then we end up with
EOI_EVENTFD for userspace that duplicates some of the functionality.
I'm not sure how to do it best. How about passing in a special
fd value to IRQFD so that only fd2 is valid?

> > > This series enables 1) and 2a). 2b) has some pros and cons. The pro is
> > > that vfio could test the device to see if an interrupt is pending and
> > > not de-assert if there is still an interrupt pending. The con is that a
> > > standard level interrupt cycle takes 3 transactions instead of 2.
> > >
> > > Any case of triggering the interrupt externally via an irqfd requires
> > > that the irqfd uses it's own irq_source_id so that it doesn't interfere
> > > with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
> > > unless we want to completely expose a new irq source id allocation
> > > mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
> > > KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
> > > Thanks,
> > >
> > > Alex
> >
> > I don't follow this last paragraph. What are you trying to say?
> > To support IRQ sharing for level irqs, each irqfd for level IRQ
> > must have a distinct source id. But IRQ_LINE works fine without,
> > and one source ID for all irq_line users seems enough.
>
> Ok, assume we have irqfd support as implemented in 3/4, but instead of
> allocating a new irq_source_id, we just use KVM_USERSPACE_IRQ_SOURCE_ID.
> The irqfd is triggered and asserts the interrupt line. The guest begins
> handling the interrupt, first probing an emulated device, but it has
> nothing to do so moves on to the assigned device on the same interrupt
> line. Qemu then decides that emulated device really does have something
> to do and assert the interrupt line. Nothing happens because the guest
> is already handling that interrupt. Guest finishes and writes an eoi.
> We're done, so we deassert. Qemu's assert gets lost.
>
> If we use a new irq_source_id, the guest sees the logical OR of all the
> source ids for that interrupt line, so even though we de-assert, the
> guest receives another interrupt because qemu's interrupt line status is
> still visible. Qemu is able to share a single irq_source_id for all
> it's devices because the various components in the set_irq path are able
> to multiplex multiple emulated interrupt sources themselves.
>
> As above, I argue that the only time you need a new source id is if
> you're injecting an interrupt from outside of the userspace model, which
> pretty much means irqfd.

This last paragraph is true IMO.

> Thanks,
>
> Alex

2012-06-24 23:02:54

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, Jun 24, 2012 at 03:59:27PM -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 18:49 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jun 24, 2012 at 09:18:38AM -0600, Alex Williamson wrote:
> > > > > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > > >
> > > > > ret = 0;
> > > > > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > > - if (irqfd->eventfd != tmp->eventfd)
> > > > > + if (irqfd->eventfd != tmp->eventfd &&
> > > > > + irqfd->eventfd != tmp->eoi_eventfd)
> > > > > continue;
> > > >
> > > > So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> > > > level-triggered irqfd on the same context).
> > > >
> > > > (why the check in the first place? just so we can have a reliable
> > > > deassign or is it avoiding a deeper problem?)
> > >
> > > I really wasn't sure to what extent we wanted to prevent duplicates. My
> > > guess was that we don't want to have an irqfd trigger more than one
> > > thing. That seems to be what the current code does. I don't see any
> > > problems with multiple irqfds triggering the same eventfd though. I
> > > only added a test that a new irqfd can't be triggered by an existing
> > > eoi_eventfd as that could make a nasty loop.
> >
> > How would that make a loop? You can have the same thing
> > with e.g. ioeventfd - why isn't it a problem there?
>
> eoi_eventfd1 -> irqfd2 [eoi] eoi_eventfd2 -> irqfd1 [eoi] eoi_eventfd1 ->...

Sorry I don't understand.
What does this [eoi] mean? How is eoi eventfd different from ioeventfd?

>
> Yes, in reality we'd need to search fds from all the interfaces and come
> up with some grossly complicated truth table of what's allowed and
> what's not. The original code didn't go to that kind of extreme, so I
> just added something that seemed like a reasonable case
> we'd want to prevent. Thanks,
>
> Alex

2012-06-25 16:09:42

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Mon, 2012-06-25 at 01:35 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 24, 2012 at 03:50:12PM -0600, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 18:40 +0300, Michael S. Tsirkin wrote:
> > > On Sun, Jun 24, 2012 at 08:47:57AM -0600, Alex Williamson wrote:
> > > > On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> > > > > On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > > > > > I think we're probably also going to need something like this.
> > > > > > When running in non-accelerated qemu, we're going to have to
> > > > > > create some kind of EOI notifier for drivers. VFIO can make
> > > > > > additional improvements when running on KVM so it will probably
> > > > > > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > > > > > want to have a generic EOI notifier in qemu that just stops
> > > > > > working when kvm-ioapic is enabled. This is just a simple way
> > > > > > to register an eventfd using the existing KVM ack notifier.
> > > > > > I tried combining the ack notifier of the LEVEL_EOI interface
> > > > > > into this one, but it didn't work out well. The code complexity
> > > > > > goes up a lot.
> > > > > >
> > > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > > ---
> > > > > >
> > > > > > Documentation/virtual/kvm/api.txt | 14 ++++++
> > > > > > arch/x86/kvm/x86.c | 1
> > > > > > include/linux/kvm.h | 12 +++++
> > > > > > include/linux/kvm_host.h | 7 +++
> > > > > > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > > > > > virt/kvm/kvm_main.c | 9 ++++
> > > > > > 6 files changed, 132 insertions(+)
> > > > > >
> > > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > > index 2f8a0aa..69b1747 100644
> > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > > > > De-assigning automatically de-asserts the interrupt line setup through
> > > > > > this interface.
> > > > > >
> > > > > > +4.77 KVM_EOI_EVENTFD
> > > > > > +
> > > > > > +Capability: KVM_CAP_EOI_EVENTFD
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_eoi_eventfd (in)
> > > > > > +Returns: 0 on success, -1 on error
> > > > > > +
> > > > > > +This interface allows userspace to be notified through an eventfd for
> > > > > > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > > > > > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > > > > > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > > > > > +and gsi.
> > > > > > +
> > > > > > 5. The kvm_run structure
> > > > > > ------------------------
> > > > > >
> > > > >
> > > > > The patch looks like it only works for ioapic IRQs. I think it's a good
> > > > > idea to explicitly limit this to level interrupts, straight away:
> > > > > there's no reason for userspace to need an exit on an edge interrupt.
> > > >
> > > > Are you suggesting documentation or code to prevent users from binding
> > > > to an edge interrupt?
> > >
> > > Yes.
> >
> > Ok, that was failed attempt at clarification. I'll assume documentation
> > is sufficient.
> >
> > > > > I also suggest we put LEVEL somewhere in the name.
> > > >
> > > > Ok
> > > >
> > > > > With this eventfd, do we also need the fd2 parameter in the irqfd
> > > > > structure? It seems to be used for the same thing ...
> > > >
> > > > Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
> > > > models we need to support are:
> > > >
> > > > 1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
> > > > de-assert from userspace (IRQ_LINE)
> > > >
> > > > 2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
> > > > (IRQFD.fd2)
> > > > or
> > > >
> > > > 2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
> > > > from vfio (IRQFD.fd2)
> > >
> > > Or maybe deassert in kvm, notify vfio using eventfd?
> >
> > Um, how is that different from 2a? Are you suggesting EOI_EVENTFD does
> > a de-assert?
>
> Yes.
>
> > My first implementation of this whole thing looked like
> > this:
> >
> > irq_source_id = ioctl(vmfd, KVM_GET_IRQ_SOURCE_ID)
> >
> > struct kvm_irqfd irqfd = {
> > .fd = fd,
> > .gsi = gsi,
> > .flags = KVM_IRQFD_FLAG_LEVEL | KVM_IRQFD_FLAG_SOURCE_ID,
>
> FLAG_LEVEL might not be needed.

Yep, would be nice to get some confirmation on this. If we can just do
a single kvm_set_irq(,,,1) w/o 0, I'll add a patch in the series to make
that change.

> > .irq_source_id = irq_source_id,
> > };
> >
> > ioctl(vmfd, KVM_IRQFD, &irqfd);
> >
> > struct kvm_eoi_eventfd eoifd = {
> > .fd = fd2,
> > .gsi = gsi,
> > .flags = KVM_EOI_EVENTFD_FLAG_SOURCE_ID,
> > .irq_source_id = irq_source_id,
> > };
> >
> > ioctl(vmfd, KVM_EOI_EVENTFD, &eoifd);
> >
> > What I didn't like about this was that the latter two ioctls are too
> > interdependent on each other. For instance, ordering of de-assignment
> > is potentially important to avoid leaving the interrupt asserted. The
> > combined KVM_IRQFD_LEVEL_EOI completely manages that. Also, if qemu
> > wants to make use of this, they do not want the de-assert on EOI
>
> IMO it should not matter. Most likely level is already 0 by the time
> EOI is invoked. eventfds are asynchronous so you do not want
> level to stay asserted until some action.
> If qemu does not want the deassert on EOI, it should just
> avoid using the eventfd interface.

Well, that prevents us from even notifying qemu about an EOI when
irqchip is enabled without introducing other behavioral changes. That
seems bad.

> > and
> > want to make use of the fixed userspace source id, so we end up with a
> > source id flag and a de-assert flag. Eventually I decided that using a
> > new source id is really only required when using an irqfd, so we could
> > keep the source id internal and simplify the interface by pulling it all
> > into KVM_IRQFD.
>
>
> I see, thanks for the clarification. But then we end up with
> EOI_EVENTFD for userspace that duplicates some of the functionality.
> I'm not sure how to do it best. How about passing in a special
> fd value to IRQFD so that only fd2 is valid?

I can try it. I did try to implement implement the code so that the EOI
portion or LEVEL_EOI used the EOI_EVENTFD code directly, but it got
overly complicated for the effectively trivial code savings. Should we
go ahead and try to implement the full gamut of what we could do too? I
think it would look something like this:

#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
#define KVM_IRQFD_LEVEL_EOI_FD (1 << 1)
#define KVM_IRQFD_LEVEL_DEASSERT_ON_EOI (1 << 2)
#define KVM_IRQFD_LEVEL_DEASSERT_FD (1 << 3)

struct kvm_irqfd {
__u32 fd; /* interrupt assert (irqfd) */
__u32 gsi;
__u32 flags;
__u32 fd2; /* EOI notification (eventfd) */
__u32 fd3; /* interrupt de-assert (irqfd) */
__u8 pad[12];
};

I'm a little nervous that the shutdown routine might be a mess with two
irqfds, but we'll see. We could assume kvm_irqfd.fd = ~0 is reserved
and allows us to skip irqfd setup, but I think we might be better off
using a flag with the opposite polarity, perhaps KVM_IRQFD_NO_ASSERT.
It's becoming quite a complicated ioctl though and you usually seem
opposed to multiplexing ioctls for so many options. Thanks,

Alex

> > > > This series enables 1) and 2a). 2b) has some pros and cons. The pro is
> > > > that vfio could test the device to see if an interrupt is pending and
> > > > not de-assert if there is still an interrupt pending. The con is that a
> > > > standard level interrupt cycle takes 3 transactions instead of 2.
> > > >
> > > > Any case of triggering the interrupt externally via an irqfd requires
> > > > that the irqfd uses it's own irq_source_id so that it doesn't interfere
> > > > with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
> > > > unless we want to completely expose a new irq source id allocation
> > > > mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
> > > > KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
> > > > Thanks,
> > > >
> > > > Alex
> > >
> > > I don't follow this last paragraph. What are you trying to say?
> > > To support IRQ sharing for level irqs, each irqfd for level IRQ
> > > must have a distinct source id. But IRQ_LINE works fine without,
> > > and one source ID for all irq_line users seems enough.
> >
> > Ok, assume we have irqfd support as implemented in 3/4, but instead of
> > allocating a new irq_source_id, we just use KVM_USERSPACE_IRQ_SOURCE_ID.
> > The irqfd is triggered and asserts the interrupt line. The guest begins
> > handling the interrupt, first probing an emulated device, but it has
> > nothing to do so moves on to the assigned device on the same interrupt
> > line. Qemu then decides that emulated device really does have something
> > to do and assert the interrupt line. Nothing happens because the guest
> > is already handling that interrupt. Guest finishes and writes an eoi.
> > We're done, so we deassert. Qemu's assert gets lost.
> >
> > If we use a new irq_source_id, the guest sees the logical OR of all the
> > source ids for that interrupt line, so even though we de-assert, the
> > guest receives another interrupt because qemu's interrupt line status is
> > still visible. Qemu is able to share a single irq_source_id for all
> > it's devices because the various components in the set_irq path are able
> > to multiplex multiple emulated interrupt sources themselves.
> >
> > As above, I argue that the only time you need a new source id is if
> > you're injecting an interrupt from outside of the userspace model, which
> > pretty much means irqfd.
>
> This last paragraph is true IMO.
>
> > Thanks,
> >
> > Alex

2012-06-25 16:17:21

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Mon, 2012-06-25 at 02:02 +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 24, 2012 at 03:59:27PM -0600, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 18:49 +0300, Michael S. Tsirkin wrote:
> > > On Sun, Jun 24, 2012 at 09:18:38AM -0600, Alex Williamson wrote:
> > > > > > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > > > >
> > > > > > ret = 0;
> > > > > > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > > > - if (irqfd->eventfd != tmp->eventfd)
> > > > > > + if (irqfd->eventfd != tmp->eventfd &&
> > > > > > + irqfd->eventfd != tmp->eoi_eventfd)
> > > > > > continue;
> > > > >
> > > > > So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> > > > > level-triggered irqfd on the same context).
> > > > >
> > > > > (why the check in the first place? just so we can have a reliable
> > > > > deassign or is it avoiding a deeper problem?)
> > > >
> > > > I really wasn't sure to what extent we wanted to prevent duplicates. My
> > > > guess was that we don't want to have an irqfd trigger more than one
> > > > thing. That seems to be what the current code does. I don't see any
> > > > problems with multiple irqfds triggering the same eventfd though. I
> > > > only added a test that a new irqfd can't be triggered by an existing
> > > > eoi_eventfd as that could make a nasty loop.
> > >
> > > How would that make a loop? You can have the same thing
> > > with e.g. ioeventfd - why isn't it a problem there?
> >
> > eoi_eventfd1 -> irqfd2 [eoi] eoi_eventfd2 -> irqfd1 [eoi] eoi_eventfd1 ->...
>
> Sorry I don't understand.
> What does this [eoi] mean? How is eoi eventfd different from ioeventfd?

[eoi] is simply the guest doing an EOI write. There's some interaction
required from the guest which could rate limit the loop. I'm not really
following your comparison to an ioeventfd. If the question is can you
create loops with an ioeventfd, I imagine the answer is probably so. I
certainly don't plan on adding code to test every fd in use by a vm and
validate it can't do crazy things, so if even this minor sanity test
could prevent useful things, I'll happily remove it. Thanks,

Alex

2012-06-25 19:29:54

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Sun, 2012-06-24 at 09:18 -0600, Alex Williamson wrote:
> On Sun, 2012-06-24 at 13:29 +0300, Avi Kivity wrote:
> > On 06/23/2012 01:16 AM, Alex Williamson wrote:
> > > KVM_IRQFD currently only supports edge triggered interrupts,
> > > asserting then immediately deasserting an interrupt. There are a
> > > couple ways we can emulate level triggered interrupts using
> > > discrete events depending on the usage model we expect from drivers.
> > > This patch implements a level emulation model useful for external
> > > assigned device drivers, like VFIO. The irqfd is used to assert
> > > the interrupt. When the guest issues an EOI for the interrupt, the
> > > level is automatically deasserted and the irqfd user is notified via
> > > an eventfd. This is therefore the LEVEL_EOI extension to KVM_IRQFD.
> > > To do this, we need to allocate a new irq source ID for the interrupt
> > > so we don't get interference from userspace.
> > >
> > >
> > > +With KVM_CAP_IRQFD_LEVEL_EOI KVM_IRQFD is able to support a level
> > > +triggered interrupt model where the irqchip pin (kvm_irqfd.gsi) is
> > > +asserted from the kvm_irqfd.fd eventfd and remain asserted until the
> > > +guest issues an EOI for the irqchip pin. The level interrupt is
> > > +then de-asserted and the caller is notified via the eventfd specified
> > > +by kvm_irqfd.fd2. Note that users of this interface are responsible
> > > +for re-asserting the interrupt if their device still requires service
> > > +after receiving the EOI notification. Additionally, users must not
> > > +re-assert an interrupt until after receiving an EOI.
> >
> > What happens if this is violated?
>
> Hmm, perhaps nothing. The only race I see is re-asserting in the gap
> between de-asserting the guest and sending the EOI. At worst that would
> cause a spurious interrupt, so probably no big deal.
>
> > > When available,
> > > +this feature is enabled using the KVM_IRQFD_FLAG_LEVEL_EOI flag.
> > > +De-assigning an irqfd setup using this flag should include both
> > > +KVM_IRQFD_FLAG_DEASSIGN and KVM_IRQFD_FLAG_LEVEL_EOI and will be
> > > +matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > +De-assigning automatically de-asserts the interrupt line setup through
> > > +this interface.
> > >
> > > @@ -203,8 +232,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > struct kvm_irq_routing_table *irq_rt;
> > > struct _irqfd *irqfd, *tmp;
> > > struct file *file = NULL;
> > > - struct eventfd_ctx *eventfd = NULL;
> > > - int ret;
> > > + struct eventfd_ctx *eventfd = NULL, *eoi_eventfd = NULL;
> > > + int ret, irq_source_id = -1;
> > > unsigned int events;
> > >
> > > irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> > > @@ -214,7 +243,30 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > irqfd->kvm = kvm;
> > > irqfd->gsi = args->gsi;
> > > INIT_LIST_HEAD(&irqfd->list);
> > > - INIT_WORK(&irqfd->inject, irqfd_inject);
> > > +
> > > + if (args->flags & KVM_IRQFD_FLAG_LEVEL_EOI) {
> > > + irq_source_id = kvm_request_irq_source_id(kvm);
> > > + if (irq_source_id < 0) {
> > > + ret = irq_source_id;
> > > + goto fail;
> >
> > 'file' is NULL at this point, and fput() doesn't test for NULL.
>
> Good catch. I was looking for an excuse to move the existing code to
> eventfd_ctx_fdget() and avoid the 2 step process it uses now.

Well, we need file later, so a !NULL test will fix it.

...
>
> > Xen had/has a hack for doing this in a different way, based on ioapic
> > polarity. When the host takes an interrupt, they reverse the polarity
> > on that ioapic pin, so they get interrupts on both assertion and
> > deassertion. This is more general and more correct, but waaaaaaaaaaay
> > more intrusive and won't play well with shared host interrupts. But
> > let's at least consider it.
>
> Thanks, I'll look for that code.

I can't find the code for it, pointers welcome. I have a hard time
thinking this is practical for legacy interrupts though. It sounds more
like an x86 party trick that minimally breaks shared host interrupts,
but more likely is intrusive (hey, why's that PCI driver asking for an
active high interrupt, obviously wrong, BUG) and probably breaks on most
non-x86 platforms. I just can't imagine the cost/benefit works out for
it with legacy interrupts. Thanks,

Alex

2012-06-25 20:12:07

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Mon, Jun 25, 2012 at 10:09:37AM -0600, Alex Williamson wrote:
> On Mon, 2012-06-25 at 01:35 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jun 24, 2012 at 03:50:12PM -0600, Alex Williamson wrote:
> > > On Sun, 2012-06-24 at 18:40 +0300, Michael S. Tsirkin wrote:
> > > > On Sun, Jun 24, 2012 at 08:47:57AM -0600, Alex Williamson wrote:
> > > > > On Sun, 2012-06-24 at 11:24 +0300, Michael S. Tsirkin wrote:
> > > > > > On Fri, Jun 22, 2012 at 04:16:53PM -0600, Alex Williamson wrote:
> > > > > > > I think we're probably also going to need something like this.
> > > > > > > When running in non-accelerated qemu, we're going to have to
> > > > > > > create some kind of EOI notifier for drivers. VFIO can make
> > > > > > > additional improvements when running on KVM so it will probably
> > > > > > > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> > > > > > > want to have a generic EOI notifier in qemu that just stops
> > > > > > > working when kvm-ioapic is enabled. This is just a simple way
> > > > > > > to register an eventfd using the existing KVM ack notifier.
> > > > > > > I tried combining the ack notifier of the LEVEL_EOI interface
> > > > > > > into this one, but it didn't work out well. The code complexity
> > > > > > > goes up a lot.
> > > > > > >
> > > > > > > Signed-off-by: Alex Williamson <[email protected]>
> > > > > > > ---
> > > > > > >
> > > > > > > Documentation/virtual/kvm/api.txt | 14 ++++++
> > > > > > > arch/x86/kvm/x86.c | 1
> > > > > > > include/linux/kvm.h | 12 +++++
> > > > > > > include/linux/kvm_host.h | 7 +++
> > > > > > > virt/kvm/eventfd.c | 89 +++++++++++++++++++++++++++++++++++++
> > > > > > > virt/kvm/kvm_main.c | 9 ++++
> > > > > > > 6 files changed, 132 insertions(+)
> > > > > > >
> > > > > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > > > > index 2f8a0aa..69b1747 100644
> > > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > > @@ -1998,6 +1998,20 @@ matched using kvm_irqfd.fd, kvm_irqfd.gsi, and kvm_irqfd.fd2.
> > > > > > > De-assigning automatically de-asserts the interrupt line setup through
> > > > > > > this interface.
> > > > > > >
> > > > > > > +4.77 KVM_EOI_EVENTFD
> > > > > > > +
> > > > > > > +Capability: KVM_CAP_EOI_EVENTFD
> > > > > > > +Architectures: x86
> > > > > > > +Type: vm ioctl
> > > > > > > +Parameters: struct kvm_eoi_eventfd (in)
> > > > > > > +Returns: 0 on success, -1 on error
> > > > > > > +
> > > > > > > +This interface allows userspace to be notified through an eventfd for
> > > > > > > +EOI writes to the in-kernel irqchip. kvm_eoi_eventfd.fd specifies
> > > > > > > +the eventfd to signal on EOI to kvm_eoi_eventfd.gsi. To disable,
> > > > > > > +use KVM_EOI_EVENTFD_FLAG_DEASSIGN and specify both the original fd
> > > > > > > +and gsi.
> > > > > > > +
> > > > > > > 5. The kvm_run structure
> > > > > > > ------------------------
> > > > > > >
> > > > > >
> > > > > > The patch looks like it only works for ioapic IRQs. I think it's a good
> > > > > > idea to explicitly limit this to level interrupts, straight away:
> > > > > > there's no reason for userspace to need an exit on an edge interrupt.
> > > > >
> > > > > Are you suggesting documentation or code to prevent users from binding
> > > > > to an edge interrupt?
> > > >
> > > > Yes.
> > >
> > > Ok, that was failed attempt at clarification. I'll assume documentation
> > > is sufficient.
> > >
> > > > > > I also suggest we put LEVEL somewhere in the name.
> > > > >
> > > > > Ok
> > > > >
> > > > > > With this eventfd, do we also need the fd2 parameter in the irqfd
> > > > > > structure? It seems to be used for the same thing ...
> > > > >
> > > > > Yes we still need fd2, this does not replace KVM_IRQFD_LEVEL_EOI. The
> > > > > models we need to support are:
> > > > >
> > > > > 1) assert from userspace (IRQ_LINE), eoi to userspace (EOI_EVENTFD),
> > > > > de-assert from userspace (IRQ_LINE)
> > > > >
> > > > > 2a) assert from vfio (IRQFD.fd), de-assert in kvm, notify vfio
> > > > > (IRQFD.fd2)
> > > > > or
> > > > >
> > > > > 2b) assert from vfio (IRQFD.fd), eoi to vfio (EOI_EVENTFD), de-assert
> > > > > from vfio (IRQFD.fd2)
> > > >
> > > > Or maybe deassert in kvm, notify vfio using eventfd?
> > >
> > > Um, how is that different from 2a? Are you suggesting EOI_EVENTFD does
> > > a de-assert?
> >
> > Yes.
> >
> > > My first implementation of this whole thing looked like
> > > this:
> > >
> > > irq_source_id = ioctl(vmfd, KVM_GET_IRQ_SOURCE_ID)
> > >
> > > struct kvm_irqfd irqfd = {
> > > .fd = fd,
> > > .gsi = gsi,
> > > .flags = KVM_IRQFD_FLAG_LEVEL | KVM_IRQFD_FLAG_SOURCE_ID,
> >
> > FLAG_LEVEL might not be needed.
>
> Yep, would be nice to get some confirmation on this. If we can just do
> a single kvm_set_irq(,,,1) w/o 0, I'll add a patch in the series to make
> that change.

Avi, Marcelo, any comments? There are unlikely to be
any apps using this undocumented hack to send level
interrupts woth irqfd - ok to remove?

> > > .irq_source_id = irq_source_id,
> > > };
> > >
> > > ioctl(vmfd, KVM_IRQFD, &irqfd);
> > >
> > > struct kvm_eoi_eventfd eoifd = {
> > > .fd = fd2,
> > > .gsi = gsi,
> > > .flags = KVM_EOI_EVENTFD_FLAG_SOURCE_ID,
> > > .irq_source_id = irq_source_id,
> > > };
> > >
> > > ioctl(vmfd, KVM_EOI_EVENTFD, &eoifd);
> > >
> > > What I didn't like about this was that the latter two ioctls are too
> > > interdependent on each other. For instance, ordering of de-assignment
> > > is potentially important to avoid leaving the interrupt asserted. The
> > > combined KVM_IRQFD_LEVEL_EOI completely manages that. Also, if qemu
> > > wants to make use of this, they do not want the de-assert on EOI
> >
> > IMO it should not matter. Most likely level is already 0 by the time
> > EOI is invoked. eventfds are asynchronous so you do not want
> > level to stay asserted until some action.
> > If qemu does not want the deassert on EOI, it should just
> > avoid using the eventfd interface.
>
> Well, that prevents us from even notifying qemu about an EOI when
> irqchip is enabled without introducing other behavioral changes. That
> seems bad.

Frankly, using EOIs is a hack. It seems better to hide it
from userspace.

> > > and
> > > want to make use of the fixed userspace source id, so we end up with a
> > > source id flag and a de-assert flag. Eventually I decided that using a
> > > new source id is really only required when using an irqfd, so we could
> > > keep the source id internal and simplify the interface by pulling it all
> > > into KVM_IRQFD.
> >
> >
> > I see, thanks for the clarification. But then we end up with
> > EOI_EVENTFD for userspace that duplicates some of the functionality.
> > I'm not sure how to do it best. How about passing in a special
> > fd value to IRQFD so that only fd2 is valid?
>
> I can try it. I did try to implement implement the code so that the EOI
> portion or LEVEL_EOI used the EOI_EVENTFD code directly, but it got
> overly complicated for the effectively trivial code savings. Should we
> go ahead and try to implement the full gamut of what we could do too? I
> think it would look something like this:
>
> #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> #define KVM_IRQFD_LEVEL_EOI_FD (1 << 1)
> #define KVM_IRQFD_LEVEL_DEASSERT_ON_EOI (1 << 2)
> #define KVM_IRQFD_LEVEL_DEASSERT_FD (1 << 3)
>
> struct kvm_irqfd {
> __u32 fd; /* interrupt assert (irqfd) */
> __u32 gsi;
> __u32 flags;
> __u32 fd2; /* EOI notification (eventfd) */

Call this eoifd then?

> __u32 fd3; /* interrupt de-assert (irqfd) */
> __u8 pad[12];
> };
>
> I'm a little nervous that the shutdown routine might be a mess with two
> irqfds, but we'll see. We could assume kvm_irqfd.fd = ~0 is reserved
> and allows us to skip irqfd setup, but I think we might be better off
> using a flag with the opposite polarity, perhaps KVM_IRQFD_NO_ASSERT.

Maybe drop DEASSERT_FD? No one is likely to use it short term so it
will bitrot.

> It's becoming quite a complicated ioctl though and you usually seem
> opposed to multiplexing ioctls for so many options. Thanks,
>
> Alex

I don't mind really.

> > > > > This series enables 1) and 2a). 2b) has some pros and cons. The pro is
> > > > > that vfio could test the device to see if an interrupt is pending and
> > > > > not de-assert if there is still an interrupt pending. The con is that a
> > > > > standard level interrupt cycle takes 3 transactions instead of 2.
> > > > >
> > > > > Any case of triggering the interrupt externally via an irqfd requires
> > > > > that the irqfd uses it's own irq_source_id so that it doesn't interfere
> > > > > with KVM_USERSPACE_IRQ_SOURCE_ID. So we can't mix IRQ_LINE and IRQFD
> > > > > unless we want to completely expose a new irq source id allocation
> > > > > mechanism and include it in all the interfaces (KVM_GET_IRQ_SOURCE_ID,
> > > > > KVM_PUT_IRQ_SOURCE_ID, pass a source id to IRQFD and IRQ_LINE, etc).
> > > > > Thanks,
> > > > >
> > > > > Alex
> > > >
> > > > I don't follow this last paragraph. What are you trying to say?
> > > > To support IRQ sharing for level irqs, each irqfd for level IRQ
> > > > must have a distinct source id. But IRQ_LINE works fine without,
> > > > and one source ID for all irq_line users seems enough.
> > >
> > > Ok, assume we have irqfd support as implemented in 3/4, but instead of
> > > allocating a new irq_source_id, we just use KVM_USERSPACE_IRQ_SOURCE_ID.
> > > The irqfd is triggered and asserts the interrupt line. The guest begins
> > > handling the interrupt, first probing an emulated device, but it has
> > > nothing to do so moves on to the assigned device on the same interrupt
> > > line. Qemu then decides that emulated device really does have something
> > > to do and assert the interrupt line. Nothing happens because the guest
> > > is already handling that interrupt. Guest finishes and writes an eoi.
> > > We're done, so we deassert. Qemu's assert gets lost.
> > >
> > > If we use a new irq_source_id, the guest sees the logical OR of all the
> > > source ids for that interrupt line, so even though we de-assert, the
> > > guest receives another interrupt because qemu's interrupt line status is
> > > still visible. Qemu is able to share a single irq_source_id for all
> > > it's devices because the various components in the set_irq path are able
> > > to multiplex multiple emulated interrupt sources themselves.
> > >
> > > As above, I argue that the only time you need a new source id is if
> > > you're injecting an interrupt from outside of the userspace model, which
> > > pretty much means irqfd.
> >
> > This last paragraph is true IMO.
> >
> > > Thanks,
> > >
> > > Alex
>
>

2012-06-25 20:13:53

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH 3/4] kvm: Extend irqfd to support level interrupts

On Mon, Jun 25, 2012 at 10:17:14AM -0600, Alex Williamson wrote:
> On Mon, 2012-06-25 at 02:02 +0300, Michael S. Tsirkin wrote:
> > On Sun, Jun 24, 2012 at 03:59:27PM -0600, Alex Williamson wrote:
> > > On Sun, 2012-06-24 at 18:49 +0300, Michael S. Tsirkin wrote:
> > > > On Sun, Jun 24, 2012 at 09:18:38AM -0600, Alex Williamson wrote:
> > > > > > > @@ -242,7 +299,8 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
> > > > > > >
> > > > > > > ret = 0;
> > > > > > > list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > > > > - if (irqfd->eventfd != tmp->eventfd)
> > > > > > > + if (irqfd->eventfd != tmp->eventfd &&
> > > > > > > + irqfd->eventfd != tmp->eoi_eventfd)
> > > > > > > continue;
> > > > > >
> > > > > > So we allow duplicate irqfd with differing eoifd (or edge-triggered and
> > > > > > level-triggered irqfd on the same context).
> > > > > >
> > > > > > (why the check in the first place? just so we can have a reliable
> > > > > > deassign or is it avoiding a deeper problem?)
> > > > >
> > > > > I really wasn't sure to what extent we wanted to prevent duplicates. My
> > > > > guess was that we don't want to have an irqfd trigger more than one
> > > > > thing. That seems to be what the current code does. I don't see any
> > > > > problems with multiple irqfds triggering the same eventfd though. I
> > > > > only added a test that a new irqfd can't be triggered by an existing
> > > > > eoi_eventfd as that could make a nasty loop.
> > > >
> > > > How would that make a loop? You can have the same thing
> > > > with e.g. ioeventfd - why isn't it a problem there?
> > >
> > > eoi_eventfd1 -> irqfd2 [eoi] eoi_eventfd2 -> irqfd1 [eoi] eoi_eventfd1 ->...
> >
> > Sorry I don't understand.
> > What does this [eoi] mean? How is eoi eventfd different from ioeventfd?
>
> [eoi] is simply the guest doing an EOI write.

So you trigger irq, later guest does an eoi etc. Why is this a loop?

> There's some interaction
> required from the guest which could rate limit the loop. I'm not really
> following your comparison to an ioeventfd. If the question is can you
> create loops with an ioeventfd, I imagine the answer is probably so.

I think the answer is no.

> I
> certainly don't plan on adding code to test every fd in use by a vm and
> validate it can't do crazy things, so if even this minor sanity test
> could prevent useful things, I'll happily remove it. Thanks,
>
> Alex

2012-06-28 16:28:03

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On 06/24/2012 06:02 PM, Alex Williamson wrote:
> On Sun, 2012-06-24 at 15:56 +0300, Avi Kivity wrote:
>> On 06/23/2012 01:16 AM, Alex Williamson wrote:
>> > I think we're probably also going to need something like this.
>> > When running in non-accelerated qemu, we're going to have to
>> > create some kind of EOI notifier for drivers. VFIO can make
>> > additional improvements when running on KVM so it will probably
>> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
>> > want to have a generic EOI notifier in qemu that just stops
>> > working when kvm-ioapic is enabled.
>>
>> Why?
>
> Hmm, I must be missing something or not describing it correctly, because
> it seems obvious.

I have not exhausted this quarter's quota of stupid questions yet.

> If we create a dependency in qemu of needing to know
> when an eoi occurs and notifier a driver and have no way to fulfill that
> dependency when running on kvm... that'd be bad, right? I don't want to
> assume that every consumer of such an interface would prefer to make use
> of an irqfd. Not sure if that answers your question though. Thanks,

I meant, what scenario do you have in mind where we want the EOI
notifier while running with kvm-irqchip enabled? Perhaps I phrased my
question a bit too tersely.

--
error compiling committee.c: too many arguments to function

2012-06-28 17:21:30

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

On Thu, 2012-06-28 at 19:27 +0300, Avi Kivity wrote:
> On 06/24/2012 06:02 PM, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 15:56 +0300, Avi Kivity wrote:
> >> On 06/23/2012 01:16 AM, Alex Williamson wrote:
> >> > I think we're probably also going to need something like this.
> >> > When running in non-accelerated qemu, we're going to have to
> >> > create some kind of EOI notifier for drivers. VFIO can make
> >> > additional improvements when running on KVM so it will probably
> >> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> >> > want to have a generic EOI notifier in qemu that just stops
> >> > working when kvm-ioapic is enabled.
> >>
> >> Why?
> >
> > Hmm, I must be missing something or not describing it correctly, because
> > it seems obvious.
>
> I have not exhausted this quarter's quota of stupid questions yet.

;)

> > If we create a dependency in qemu of needing to know
> > when an eoi occurs and notifier a driver and have no way to fulfill that
> > dependency when running on kvm... that'd be bad, right? I don't want to
> > assume that every consumer of such an interface would prefer to make use
> > of an irqfd. Not sure if that answers your question though. Thanks,
>
> I meant, what scenario do you have in mind where we want the EOI
> notifier while running with kvm-irqchip enabled? Perhaps I phrased my
> question a bit too tersely.

Aha. Well, the v2 series really pulls eoifd in as more of a primary
player and is now the only way to get EOI notification regardless of
whether it's bound to a separate level irq source or used by qemu for
the common userspace source. I don't know of any users that currently
need EOI notification for the common userspace source id, but depending
on where we land with whether KVM_IRQFD supports level interrupts, it
may be vfio that needs it.

Alex