2009-10-26 16:22:03

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v3 0/3] irqfd enhancements, and irq_routing fixes

(Applies to kvm.git/master:11b06403)

The following patches are cleanups/enhancements for IRQFD now that
we have lockless interrupt injection. For more details, please see
the patch headers.

These patches pass checkpatch, and are fully tested. Please consider
for merging. Patch 1/3 is a fix for an issue that may exist upstream
and should be considered for a more timely push upstream. Patches 2/3
- 3/3 are an enhancement only, so there is no urgency to push to
mainline until a suitable merge window presents itself.

Kind Regards,
-Greg

[ Change log:

v3:
*) Added patch 1/3 as a fix for a race condition
*) Minor cleanup to 2/3 to ensure that all shared vectors conform
to a unified locking model.

v2:
*) dropped original cleanup which relied on the user registering
MSI based GSIs or we may crash at runtime. Instead, we now
check at registration whether the GSI supports lockless
operation and dynamically adapt to either the original
deferred path for lock-based injections, or direct for lockless.

v1:
*) original release
]

---

Gregory Haskins (3):
KVM: Directly inject interrupts if they support lockless operation
KVM: export lockless GSI attribute
KVM: fix race in irq_routing logic


include/linux/kvm_host.h | 8 ++++
virt/kvm/eventfd.c | 31 +++++++++++++++--
virt/kvm/irq_comm.c | 85 ++++++++++++++++++++++++++++++++++------------
virt/kvm/kvm_main.c | 1 +
4 files changed, 98 insertions(+), 27 deletions(-)

--
Signature


2009-10-26 16:22:21

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

The current code suffers from the following race condition:

thread-1 thread-2
-----------------------------------------------------------

kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
rcu_read_unlock();

kvm_set_irq_routing() {
mutex_lock();
irq_rt = table;
rcu_assign_pointer();
mutex_unlock();
synchronize_rcu();

kfree(irq_rt);

irq_rt->entry->set(); /* bad */

-------------------------------------------------------------

Because the pointer is accessed outside of the read-side critical
section. There are two basic patterns we can use to fix this bug:

1) Switch to sleeping-rcu and encompass the ->set() access within the
read-side critical section,

OR

2) Add reference counting to the irq_rt structure, and simply acquire
the reference from within the RSCS.

This patch implements solution (1).

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_host.h | 6 +++++-
virt/kvm/irq_comm.c | 50 +++++++++++++++++++++++++++-------------------
virt/kvm/kvm_main.c | 1 +
3 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd5a616..1fe135d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -185,7 +185,10 @@ struct kvm {

struct mutex irq_lock;
#ifdef CONFIG_HAVE_KVM_IRQCHIP
- struct kvm_irq_routing_table *irq_routing;
+ struct {
+ struct srcu_struct srcu;
+ struct kvm_irq_routing_table *table;
+ } irq_routing;
struct hlist_head mask_notifier_list;
struct hlist_head irq_ack_notifier_list;
#endif
@@ -541,6 +544,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
const struct kvm_irq_routing_entry *entries,
unsigned nr,
unsigned flags);
+void kvm_init_irq_routing(struct kvm *kvm);
void kvm_free_irq_routing(struct kvm *kvm);

#else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 00c68d2..db2553f 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -144,10 +144,11 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
*/
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
{
- struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
- int ret = -1, i = 0;
+ struct kvm_kernel_irq_routing_entry *e;
+ int ret = -1;
struct kvm_irq_routing_table *irq_rt;
struct hlist_node *n;
+ int idx;

trace_kvm_set_irq(irq, level, irq_source_id);

@@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
* IOAPIC. So set the bit in both. The guest will ignore
* writes to the unused one.
*/
- rcu_read_lock();
- irq_rt = rcu_dereference(kvm->irq_routing);
+ idx = srcu_read_lock(&kvm->irq_routing.srcu);
+ irq_rt = rcu_dereference(kvm->irq_routing.table);
if (irq < irq_rt->nr_rt_entries)
- hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
- irq_set[i++] = *e;
- rcu_read_unlock();
+ hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
+ int r;

- while(i--) {
- int r;
- r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
- if (r < 0)
- continue;
+ r = e->set(e, kvm, irq_source_id, level);
+ if (r < 0)
+ continue;

- ret = r + ((ret < 0) ? 0 : ret);
- }
+ ret = r + ((ret < 0) ? 0 : ret);
+ }
+ srcu_read_unlock(&kvm->irq_routing.srcu, idx);

return ret;
}
@@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
struct kvm_irq_ack_notifier *kian;
struct hlist_node *n;
int gsi;
+ int idx;

trace_kvm_ack_irq(irqchip, pin);

- rcu_read_lock();
- gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+ idx = srcu_read_lock(&kvm->irq_routing.srcu);
+ gsi = rcu_dereference(kvm->irq_routing.table)->chip[irqchip][pin];
if (gsi != -1)
hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
link)
if (kian->gsi == gsi)
kian->irq_acked(kian);
- rcu_read_unlock();
+ srcu_read_unlock(&kvm->irq_routing.srcu, idx);
}

void kvm_register_irq_ack_notifier(struct kvm *kvm,
@@ -287,11 +287,19 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
rcu_read_unlock();
}

+void kvm_init_irq_routing(struct kvm *kvm)
+{
+ init_srcu_struct(&kvm->irq_routing.srcu);
+}
+
void kvm_free_irq_routing(struct kvm *kvm)
{
/* Called only during vm destruction. Nobody can use the pointer
at this stage */
- kfree(kvm->irq_routing);
+ synchronize_srcu(&kvm->irq_routing.srcu);
+ cleanup_srcu_struct(&kvm->irq_routing.srcu);
+
+ kfree(kvm->irq_routing.table);
}

static int setup_routing_entry(struct kvm_irq_routing_table *rt,
@@ -396,10 +404,10 @@ int kvm_set_irq_routing(struct kvm *kvm,
}

mutex_lock(&kvm->irq_lock);
- old = kvm->irq_routing;
- rcu_assign_pointer(kvm->irq_routing, new);
+ old = kvm->irq_routing.table;
+ rcu_assign_pointer(kvm->irq_routing.table, new);
mutex_unlock(&kvm->irq_lock);
- synchronize_rcu();
+ synchronize_srcu(&kvm->irq_routing.srcu);

new = old;
r = 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cac69c4..ba94159 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -388,6 +388,7 @@ static struct kvm *kvm_create_vm(void)
atomic_inc(&kvm->mm->mm_count);
spin_lock_init(&kvm->mmu_lock);
spin_lock_init(&kvm->requests_lock);
+ kvm_init_irq_routing(kvm);
kvm_io_bus_init(&kvm->pio_bus);
kvm_eventfd_init(kvm);
mutex_init(&kvm->lock);

2009-10-26 16:22:39

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v3 2/3] KVM: export lockless GSI attribute

Certain GSI's support lockless injecton, but we have no way to detect
which ones at the GSI level. Knowledge of this attribute will be
useful later in the series so that we can optimize irqfd injection
paths for cases where we know the code will not sleep. Therefore,
we provide an API to query a specific GSI.

Signed-off-by: Gregory Haskins <[email protected]>
---

include/linux/kvm_host.h | 2 ++
virt/kvm/irq_comm.c | 35 ++++++++++++++++++++++++++++++++++-
2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1fe135d..01151a6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,7 @@ struct kvm_memory_slot {
struct kvm_kernel_irq_routing_entry {
u32 gsi;
u32 type;
+ bool lockless;
int (*set)(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level);
union {
@@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
unsigned long *deliver_bitmask);
#endif
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
void kvm_register_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index db2553f..a7fd487 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
return ret;
}

+int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
+{
+ struct kvm_kernel_irq_routing_entry *e;
+ struct kvm_irq_routing_table *irq_rt;
+ struct hlist_node *n;
+ int ret = -ENOENT;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->irq_routing.srcu);
+ irq_rt = rcu_dereference(kvm->irq_routing.table);
+ if (irq < irq_rt->nr_rt_entries)
+ hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
+ if (!e->lockless) {
+ /*
+ * all destinations need to be lockless to
+ * declare that the GSI as a whole is also
+ * lockless
+ */
+ ret = 0;
+ break;
+ }
+
+ ret = 1;
+ }
+ srcu_read_unlock(&kvm->irq_routing.srcu, idx);
+
+ return ret;
+}
+
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
{
struct kvm_irq_ack_notifier *kian;
@@ -310,18 +339,22 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
int delta;
struct kvm_kernel_irq_routing_entry *ei;
struct hlist_node *n;
+ bool lockless = ue->type == KVM_IRQ_ROUTING_MSI;

/*
* Do not allow GSI to be mapped to the same irqchip more than once.
* Allow only one to one mapping between GSI and MSI.
+ * Do not allow mixed lockless vs locked variants to coexist.
*/
hlist_for_each_entry(ei, n, &rt->map[ue->gsi], link)
if (ei->type == KVM_IRQ_ROUTING_MSI ||
- ue->u.irqchip.irqchip == ei->irqchip.irqchip)
+ ue->u.irqchip.irqchip == ei->irqchip.irqchip ||
+ ei->lockless != lockless)
return r;

e->gsi = ue->gsi;
e->type = ue->type;
+ e->lockless = lockless;
switch (ue->type) {
case KVM_IRQ_ROUTING_IRQCHIP:
delta = 0;

2009-10-26 16:22:17

by Gregory Haskins

[permalink] [raw]
Subject: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

IRQFD currently uses a deferred workqueue item to execute the injection
operation. It was originally designed this way because kvm_set_irq()
required the caller to hold the irq_lock mutex, and the eventfd callback
is invoked from within a non-preemptible critical section.

With the advent of lockless injection support for certain GSIs, the
deferment mechanism is no longer technically needed in all cases.
Since context switching to the workqueue is a source of interrupt
latency, lets switch to a direct method whenever possible. Fortunately
for us, the most common use of irqfd (MSI-based GSIs) readily support
lockless injection.

Signed-off-by: Gregory Haskins <[email protected]>
---

virt/kvm/eventfd.c | 31 +++++++++++++++++++++++++++----
1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 30f70fd..e6cc958 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -51,20 +51,34 @@ struct _irqfd {
wait_queue_t wait;
struct work_struct inject;
struct work_struct shutdown;
+ void (*execute)(struct _irqfd *);
};

static struct workqueue_struct *irqfd_cleanup_wq;

static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject(struct _irqfd *irqfd)
{
- struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
struct kvm *kvm = irqfd->kvm;

kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
}

+static void
+irqfd_deferred_inject(struct work_struct *work)
+{
+ struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+ irqfd_inject(irqfd);
+}
+
+static void
+irqfd_schedule(struct _irqfd *irqfd)
+{
+ schedule_work(&irqfd->inject);
+}
+
/*
* Race-free decouple logic (ordering is critical)
*/
@@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)

if (flags & POLLIN)
/* An event has been signaled, inject an interrupt */
- schedule_work(&irqfd->inject);
+ irqfd->execute(irqfd);

if (flags & POLLHUP) {
/* The eventfd is closing, detach from KVM */
@@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
irqfd->kvm = kvm;
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
- INIT_WORK(&irqfd->inject, irqfd_inject);
+ INIT_WORK(&irqfd->inject, irqfd_deferred_inject);
INIT_WORK(&irqfd->shutdown, irqfd_shutdown);

file = eventfd_fget(fd);
@@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
list_add_tail(&irqfd->list, &kvm->irqfds.items);
spin_unlock_irq(&kvm->irqfds.lock);

+ ret = kvm_irq_check_lockless(kvm, gsi);
+ if (ret < 0)
+ goto fail;
+
+ if (ret)
+ irqfd->execute = &irqfd_inject;
+ else
+ irqfd->execute = &irqfd_schedule;
+
/*
* Check if there was an event already pending on the eventfd
* before we registered, and trigger it as if we didn't miss it.

2009-10-27 03:46:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> The current code suffers from the following race condition:
>
> thread-1 thread-2
> -----------------------------------------------------------
>
> kvm_set_irq() {
> rcu_read_lock()
> irq_rt = rcu_dereference(table);
> rcu_read_unlock();
>
> kvm_set_irq_routing() {
> mutex_lock();
> irq_rt = table;
> rcu_assign_pointer();
> mutex_unlock();
> synchronize_rcu();
>
> kfree(irq_rt);
>
> irq_rt->entry->set(); /* bad */
>
> -------------------------------------------------------------
>
> Because the pointer is accessed outside of the read-side critical
> section. There are two basic patterns we can use to fix this bug:
>
> 1) Switch to sleeping-rcu and encompass the ->set() access within the
> read-side critical section,
>
> OR
>
> 2) Add reference counting to the irq_rt structure, and simply acquire
> the reference from within the RSCS.
>
> This patch implements solution (1).

Looks like a good transformation! A few questions interspersed below.

> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> include/linux/kvm_host.h | 6 +++++-
> virt/kvm/irq_comm.c | 50 +++++++++++++++++++++++++++-------------------
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 35 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index bd5a616..1fe135d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -185,7 +185,10 @@ struct kvm {
>
> struct mutex irq_lock;
> #ifdef CONFIG_HAVE_KVM_IRQCHIP
> - struct kvm_irq_routing_table *irq_routing;
> + struct {
> + struct srcu_struct srcu;

Each structure has its own SRCU domain. This is OK, but just asking
if that is the intent. It does look like the SRCU primitives are
passed a pointer to the correct structure, and that the return value
from srcu_read_lock() gets passed into the matching srcu_read_unlock()
like it needs to be, so that is good.

> + struct kvm_irq_routing_table *table;
> + } irq_routing;
> struct hlist_head mask_notifier_list;
> struct hlist_head irq_ack_notifier_list;
> #endif

[ . . . ]

> @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> * IOAPIC. So set the bit in both. The guest will ignore
> * writes to the unused one.
> */
> - rcu_read_lock();
> - irq_rt = rcu_dereference(kvm->irq_routing);
> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> + irq_rt = rcu_dereference(kvm->irq_routing.table);
> if (irq < irq_rt->nr_rt_entries)
> - hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> - irq_set[i++] = *e;
> - rcu_read_unlock();
> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {

What prevents the above list from changing while we are traversing it?
(Yes, presumably whatever was preventing it from changing before this
patch, but what?)

Mostly kvm->lock is held, but not always. And if kvm->lock were held
all the time, there would be no point in using SRCU. ;-)

> + int r;
>
> - while(i--) {
> - int r;
> - r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
> - if (r < 0)
> - continue;
> + r = e->set(e, kvm, irq_source_id, level);
> + if (r < 0)
> + continue;
>
> - ret = r + ((ret < 0) ? 0 : ret);
> - }
> + ret = r + ((ret < 0) ? 0 : ret);
> + }
> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
>
> return ret;
> }
> @@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> struct kvm_irq_ack_notifier *kian;
> struct hlist_node *n;
> int gsi;
> + int idx;
>
> trace_kvm_ack_irq(irqchip, pin);
>
> - rcu_read_lock();
> - gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> + gsi = rcu_dereference(kvm->irq_routing.table)->chip[irqchip][pin];
> if (gsi != -1)
> hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> link)

And same question here -- what keeps the above list from changing while
we are traversing it?

Thanx, Paul

2009-10-27 06:45:45

by Gleb Natapov

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> The current code suffers from the following race condition:
>
> thread-1 thread-2
> -----------------------------------------------------------
>
> kvm_set_irq() {
> rcu_read_lock()
> irq_rt = rcu_dereference(table);
> rcu_read_unlock();
>
> kvm_set_irq_routing() {
> mutex_lock();
> irq_rt = table;
> rcu_assign_pointer();
> mutex_unlock();
> synchronize_rcu();
>
> kfree(irq_rt);
>
> irq_rt->entry->set(); /* bad */
>
This is not what happens. irq_rt is never accessed outside read-side
critical section. Data is copied from irq_rt onto the stack and this
copy is accessed outside critical section.

> -------------------------------------------------------------
>
> Because the pointer is accessed outside of the read-side critical
> section. There are two basic patterns we can use to fix this bug:
>
> 1) Switch to sleeping-rcu and encompass the ->set() access within the
> read-side critical section,
>
> OR
>
> 2) Add reference counting to the irq_rt structure, and simply acquire
> the reference from within the RSCS.
>
> This patch implements solution (1).
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> include/linux/kvm_host.h | 6 +++++-
> virt/kvm/irq_comm.c | 50 +++++++++++++++++++++++++++-------------------
> virt/kvm/kvm_main.c | 1 +
> 3 files changed, 35 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index bd5a616..1fe135d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -185,7 +185,10 @@ struct kvm {
>
> struct mutex irq_lock;
> #ifdef CONFIG_HAVE_KVM_IRQCHIP
> - struct kvm_irq_routing_table *irq_routing;
> + struct {
> + struct srcu_struct srcu;
> + struct kvm_irq_routing_table *table;
> + } irq_routing;
> struct hlist_head mask_notifier_list;
> struct hlist_head irq_ack_notifier_list;
> #endif
> @@ -541,6 +544,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
> const struct kvm_irq_routing_entry *entries,
> unsigned nr,
> unsigned flags);
> +void kvm_init_irq_routing(struct kvm *kvm);
> void kvm_free_irq_routing(struct kvm *kvm);
>
> #else
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 00c68d2..db2553f 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -144,10 +144,11 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> */
> int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> {
> - struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
> - int ret = -1, i = 0;
> + struct kvm_kernel_irq_routing_entry *e;
> + int ret = -1;
> struct kvm_irq_routing_table *irq_rt;
> struct hlist_node *n;
> + int idx;
>
> trace_kvm_set_irq(irq, level, irq_source_id);
>
> @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> * IOAPIC. So set the bit in both. The guest will ignore
> * writes to the unused one.
> */
> - rcu_read_lock();
> - irq_rt = rcu_dereference(kvm->irq_routing);
> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> + irq_rt = rcu_dereference(kvm->irq_routing.table);
> if (irq < irq_rt->nr_rt_entries)
> - hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> - irq_set[i++] = *e;
> - rcu_read_unlock();
> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> + int r;
>
> - while(i--) {
> - int r;
> - r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
> - if (r < 0)
> - continue;
> + r = e->set(e, kvm, irq_source_id, level);
> + if (r < 0)
> + continue;
>
> - ret = r + ((ret < 0) ? 0 : ret);
> - }
> + ret = r + ((ret < 0) ? 0 : ret);
> + }
> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
>
> return ret;
> }
> @@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> struct kvm_irq_ack_notifier *kian;
> struct hlist_node *n;
> int gsi;
> + int idx;
>
> trace_kvm_ack_irq(irqchip, pin);
>
> - rcu_read_lock();
> - gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> + gsi = rcu_dereference(kvm->irq_routing.table)->chip[irqchip][pin];
> if (gsi != -1)
> hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> link)
> if (kian->gsi == gsi)
> kian->irq_acked(kian);
> - rcu_read_unlock();
> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
> }
>
> void kvm_register_irq_ack_notifier(struct kvm *kvm,
> @@ -287,11 +287,19 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
> rcu_read_unlock();
> }
>
> +void kvm_init_irq_routing(struct kvm *kvm)
> +{
> + init_srcu_struct(&kvm->irq_routing.srcu);
> +}
> +
> void kvm_free_irq_routing(struct kvm *kvm)
> {
> /* Called only during vm destruction. Nobody can use the pointer
> at this stage */
> - kfree(kvm->irq_routing);
> + synchronize_srcu(&kvm->irq_routing.srcu);
> + cleanup_srcu_struct(&kvm->irq_routing.srcu);
> +
> + kfree(kvm->irq_routing.table);
> }
>
> static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> @@ -396,10 +404,10 @@ int kvm_set_irq_routing(struct kvm *kvm,
> }
>
> mutex_lock(&kvm->irq_lock);
> - old = kvm->irq_routing;
> - rcu_assign_pointer(kvm->irq_routing, new);
> + old = kvm->irq_routing.table;
> + rcu_assign_pointer(kvm->irq_routing.table, new);
> mutex_unlock(&kvm->irq_lock);
> - synchronize_rcu();
> + synchronize_srcu(&kvm->irq_routing.srcu);
>
> new = old;
> r = 0;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cac69c4..ba94159 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -388,6 +388,7 @@ static struct kvm *kvm_create_vm(void)
> atomic_inc(&kvm->mm->mm_count);
> spin_lock_init(&kvm->mmu_lock);
> spin_lock_init(&kvm->requests_lock);
> + kvm_init_irq_routing(kvm);
> kvm_io_bus_init(&kvm->pio_bus);
> kvm_eventfd_init(kvm);
> mutex_init(&kvm->lock);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Gleb.

2009-10-27 13:34:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Hi Paul,

Paul E. McKenney wrote:
> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>> The current code suffers from the following race condition:
>>
>> thread-1 thread-2
>> -----------------------------------------------------------
>>
>> kvm_set_irq() {
>> rcu_read_lock()
>> irq_rt = rcu_dereference(table);
>> rcu_read_unlock();
>>
>> kvm_set_irq_routing() {
>> mutex_lock();
>> irq_rt = table;
>> rcu_assign_pointer();
>> mutex_unlock();
>> synchronize_rcu();
>>
>> kfree(irq_rt);
>>
>> irq_rt->entry->set(); /* bad */
>>
>> -------------------------------------------------------------
>>
>> Because the pointer is accessed outside of the read-side critical
>> section. There are two basic patterns we can use to fix this bug:
>>
>> 1) Switch to sleeping-rcu and encompass the ->set() access within the
>> read-side critical section,
>>
>> OR
>>
>> 2) Add reference counting to the irq_rt structure, and simply acquire
>> the reference from within the RSCS.
>>
>> This patch implements solution (1).
>
> Looks like a good transformation! A few questions interspersed below.

Thanks for the review. I would have CC'd you but I figured I pestered
you enough with my RCU reviews in the past, and didn't want to annoy you ;)

I will be sure to CC you in the future, unless you ask otherwise.

>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> include/linux/kvm_host.h | 6 +++++-
>> virt/kvm/irq_comm.c | 50 +++++++++++++++++++++++++++-------------------
>> virt/kvm/kvm_main.c | 1 +
>> 3 files changed, 35 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index bd5a616..1fe135d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -185,7 +185,10 @@ struct kvm {
>>
>> struct mutex irq_lock;
>> #ifdef CONFIG_HAVE_KVM_IRQCHIP
>> - struct kvm_irq_routing_table *irq_routing;
>> + struct {
>> + struct srcu_struct srcu;
>
> Each structure has its own SRCU domain. This is OK, but just asking
> if that is the intent. It does look like the SRCU primitives are
> passed a pointer to the correct structure, and that the return value
> from srcu_read_lock() gets passed into the matching srcu_read_unlock()
> like it needs to be, so that is good.

Yeah, it was intentional. Technically the table is per-guest, and thus
the locking is too, which is the desired/intentional granularity.

On that note, I tried to denote that kvm->irq_routing.srcu and
kvm->irq_routing.table were related to one another, but then went ahead
and modified the hunks that touched kvm->irq_ack_notifier_list, too. In
retrospect, this was probably a mistake. I should leave the rcu usage
outside of ->irq_routing.table alone.

>
>> + struct kvm_irq_routing_table *table;
>> + } irq_routing;
>> struct hlist_head mask_notifier_list;
>> struct hlist_head irq_ack_notifier_list;
>> #endif
>
> [ . . . ]
>
>> @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>> * IOAPIC. So set the bit in both. The guest will ignore
>> * writes to the unused one.
>> */
>> - rcu_read_lock();
>> - irq_rt = rcu_dereference(kvm->irq_routing);
>> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
>> + irq_rt = rcu_dereference(kvm->irq_routing.table);
>> if (irq < irq_rt->nr_rt_entries)
>> - hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
>> - irq_set[i++] = *e;
>> - rcu_read_unlock();
>> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
>
> What prevents the above list from changing while we are traversing it?
> (Yes, presumably whatever was preventing it from changing before this
> patch, but what?)
>
> Mostly kvm->lock is held, but not always. And if kvm->lock were held
> all the time, there would be no point in using SRCU. ;-)

This is protected by kvm->irq_lock within kvm_set_irq_routing().
Entries are added to a copy of the list, and the top-level table pointer
is swapped (via rcu_assign_pointer(), as it should be) while holding the
lock. Finally, we synchronize with the RSCS before deleting the old
copy. It looks to me like the original author got this part right, so I
didn't modify it outside of converting to SRCU.

>
>> + int r;
>>
>> - while(i--) {
>> - int r;
>> - r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
>> - if (r < 0)
>> - continue;
>> + r = e->set(e, kvm, irq_source_id, level);
>> + if (r < 0)
>> + continue;
>>
>> - ret = r + ((ret < 0) ? 0 : ret);
>> - }
>> + ret = r + ((ret < 0) ? 0 : ret);
>> + }
>> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
>>
>> return ret;
>> }
>> @@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>> struct kvm_irq_ack_notifier *kian;
>> struct hlist_node *n;
>> int gsi;
>> + int idx;
>>
>> trace_kvm_ack_irq(irqchip, pin);
>>
>> - rcu_read_lock();
>> - gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
>> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
>> + gsi = rcu_dereference(kvm->irq_routing.table)->chip[irqchip][pin];
>> if (gsi != -1)
>> hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
>> link)
>
> And same question here -- what keeps the above list from changing while
> we are traversing it?

This is also protected via the kvm->irq_lock in
kvm_register_irq_ack_notifier(). Though as mentioned above, I should
probably drop the non irq_routing.table hunks, so this will go away.
But I think its correct either way.

Thanks Paul,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 13:39:09

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gleb Natapov wrote:
> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>> The current code suffers from the following race condition:
>>
>> thread-1 thread-2
>> -----------------------------------------------------------
>>
>> kvm_set_irq() {
>> rcu_read_lock()
>> irq_rt = rcu_dereference(table);
>> rcu_read_unlock();
>>
>> kvm_set_irq_routing() {
>> mutex_lock();
>> irq_rt = table;
>> rcu_assign_pointer();
>> mutex_unlock();
>> synchronize_rcu();
>>
>> kfree(irq_rt);
>>
>> irq_rt->entry->set(); /* bad */
>>
> This is not what happens. irq_rt is never accessed outside read-side
> critical section.

Sorry, I was generalizing to keep the comments short. I figured it
would be clear what I was actually saying, but realize in retrospect
that I was a little ambiguous.

Yes, irq_rt is not accessed outside the RSCS. However, the entry
pointers stored in the irq_rt->map are, and this is equally problematic
afaict.

In this particular case we seem to never delete entries at run-time once
they are established. Therefore, while perhaps sloppy, its technically
safe to leave them unprotected from this perspective. The issue is more
related to shutdown since a kvm_set_irq() caller could be within the
aforementioned race-region and call entry->set() after the guest is
gone. Or did I miss something?

> Data is copied from irq_rt onto the stack and this copy is accessed
> outside critical section.

As mentioned above, I do not believe this really protect us. And even
if it did, the copy is just a work-around to avoid sleeping within the
standard RCU RSCS, which is what SRCU is designed for. So rather than
inventing an awkward two-phased stack based solution, it's better to
reuse the provided tools, IMO.

To flip it around: Is there any reason why an SRCU would not work here,
and thus we were forced to use something like the stack-copy approach?

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 14:00:34

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gregory Haskins wrote:
> Gleb Natapov wrote:
>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>>> The current code suffers from the following race condition:
>>>
>>> thread-1 thread-2
>>> -----------------------------------------------------------
>>>
>>> kvm_set_irq() {
>>> rcu_read_lock()
>>> irq_rt = rcu_dereference(table);
>>> rcu_read_unlock();
>>>
>>> kvm_set_irq_routing() {
>>> mutex_lock();
>>> irq_rt = table;
>>> rcu_assign_pointer();
>>> mutex_unlock();
>>> synchronize_rcu();
>>>
>>> kfree(irq_rt);
>>>
>>> irq_rt->entry->set(); /* bad */
>>>
>> This is not what happens. irq_rt is never accessed outside read-side
>> critical section.
>
> Sorry, I was generalizing to keep the comments short. I figured it
> would be clear what I was actually saying, but realize in retrospect
> that I was a little ambiguous.

Here is a revised problem statement

thread-1 thread-2
-----------------------------------------------------------

kvm_set_irq() {
rcu_read_lock()
irq_rt = rcu_dereference(table);
entry_cache = get_entries(irq_rt);
rcu_read_unlock();

invalidate_entries(irq_rt);

for_each_entry(entry_cache)
entry->set(); /* bad */

-------------------------------------------------------------


"invalidate_entries()" may be any operation that deletes an entry at
run-time (doesn't exist today), or as the guest is shutting down. As
far as I can tell, the current code does not protect us from either
condition, and my proposed patch protects us from both. Did I miss
anything?

HTH
-Greg



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 14:02:54

by Gleb Natapov

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
> Gleb Natapov wrote:
> > On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >> The current code suffers from the following race condition:
> >>
> >> thread-1 thread-2
> >> -----------------------------------------------------------
> >>
> >> kvm_set_irq() {
> >> rcu_read_lock()
> >> irq_rt = rcu_dereference(table);
> >> rcu_read_unlock();
> >>
> >> kvm_set_irq_routing() {
> >> mutex_lock();
> >> irq_rt = table;
> >> rcu_assign_pointer();
> >> mutex_unlock();
> >> synchronize_rcu();
> >>
> >> kfree(irq_rt);
> >>
> >> irq_rt->entry->set(); /* bad */
> >>
> > This is not what happens. irq_rt is never accessed outside read-side
> > critical section.
>
> Sorry, I was generalizing to keep the comments short. I figured it
> would be clear what I was actually saying, but realize in retrospect
> that I was a little ambiguous.
>
A little is underestimation :) There is not /* bad */ line in the code!

> Yes, irq_rt is not accessed outside the RSCS. However, the entry
> pointers stored in the irq_rt->map are, and this is equally problematic
> afaict.
The pointer is in text and can't disappear without kvm_set_irq()
disappearing too.

>
> In this particular case we seem to never delete entries at run-time once
> they are established. Therefore, while perhaps sloppy, its technically
> safe to leave them unprotected from this perspective. The issue is more
> related to shutdown since a kvm_set_irq() caller could be within the
> aforementioned race-region and call entry->set() after the guest is
> gone. Or did I miss something?
>
The caller of kvm_set_irq() should hold reference to kvm instance, so it
can't disappear while you are inside kvm_set_irq(). RCU protects only
kvm->irq_routing not kvm structure itself.

> > Data is copied from irq_rt onto the stack and this copy is accessed
> > outside critical section.
>
> As mentioned above, I do not believe this really protect us. And even
I don't see the prove it doesn't, so I assume it does.

> if it did, the copy is just a work-around to avoid sleeping within the
It is not a work-around. There was two solutions to the problem one is
to call ->set() outside rcu critical section another is to use SRCU. I
decided to use the first one. This way the code is much simpler and I
remember asking Paul what are the disadvantages of using SRCU and there
was something.

> standard RCU RSCS, which is what SRCU is designed for. So rather than
> inventing an awkward two-phased stack based solution, it's better to
> reuse the provided tools, IMO.
>
> To flip it around: Is there any reason why an SRCU would not work here,
> and thus we were forced to use something like the stack-copy approach?
>
If SRCU has no disadvantage comparing to RCU why not use it always? :)

--
Gleb.

2009-10-27 14:05:32

by Gleb Natapov

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
> Gregory Haskins wrote:
> > Gleb Natapov wrote:
> >> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >>> The current code suffers from the following race condition:
> >>>
> >>> thread-1 thread-2
> >>> -----------------------------------------------------------
> >>>
> >>> kvm_set_irq() {
> >>> rcu_read_lock()
> >>> irq_rt = rcu_dereference(table);
> >>> rcu_read_unlock();
> >>>
> >>> kvm_set_irq_routing() {
> >>> mutex_lock();
> >>> irq_rt = table;
> >>> rcu_assign_pointer();
> >>> mutex_unlock();
> >>> synchronize_rcu();
> >>>
> >>> kfree(irq_rt);
> >>>
> >>> irq_rt->entry->set(); /* bad */
> >>>
> >> This is not what happens. irq_rt is never accessed outside read-side
> >> critical section.
> >
> > Sorry, I was generalizing to keep the comments short. I figured it
> > would be clear what I was actually saying, but realize in retrospect
> > that I was a little ambiguous.
>
> Here is a revised problem statement
>
> thread-1 thread-2
> -----------------------------------------------------------
>
> kvm_set_irq() {
> rcu_read_lock()
> irq_rt = rcu_dereference(table);
> entry_cache = get_entries(irq_rt);
> rcu_read_unlock();
>
> invalidate_entries(irq_rt);
>
> for_each_entry(entry_cache)
> entry->set(); /* bad */
>
> -------------------------------------------------------------
>
>
> "invalidate_entries()" may be any operation that deletes an entry at
> run-time (doesn't exist today), or as the guest is shutting down. As
> far as I can tell, the current code does not protect us from either
> condition, and my proposed patch protects us from both. Did I miss
> anything?
>
Yes. What happened to irq_rt is completely irrelevant at the point you
marked /* bad */.

--
Gleb.

2009-10-27 14:47:51

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gleb Natapov wrote:
> On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
>> Gleb Natapov wrote:
>>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>>>> The current code suffers from the following race condition:
>>>>
>>>> thread-1 thread-2
>>>> -----------------------------------------------------------
>>>>
>>>> kvm_set_irq() {
>>>> rcu_read_lock()
>>>> irq_rt = rcu_dereference(table);
>>>> rcu_read_unlock();
>>>>
>>>> kvm_set_irq_routing() {
>>>> mutex_lock();
>>>> irq_rt = table;
>>>> rcu_assign_pointer();
>>>> mutex_unlock();
>>>> synchronize_rcu();
>>>>
>>>> kfree(irq_rt);
>>>>
>>>> irq_rt->entry->set(); /* bad */
>>>>
>>> This is not what happens. irq_rt is never accessed outside read-side
>>> critical section.
>> Sorry, I was generalizing to keep the comments short. I figured it
>> would be clear what I was actually saying, but realize in retrospect
>> that I was a little ambiguous.
>>
> A little is underestimation :) There is not /* bad */ line in the code!

Sorry, that was my own highlighting, not trying to reflect actual code.

>
>> Yes, irq_rt is not accessed outside the RSCS. However, the entry
>> pointers stored in the irq_rt->map are, and this is equally problematic
>> afaict.
> The pointer is in text and can't disappear without kvm_set_irq()
> disappearing too.

No, the entry* pointer is .text _AND_ .data, and is subject to standard
synchronization rules like most other objects.

Unless I am misreading the code, the entry* pointers point to heap
within the irq_rt pointer. Therefore, the "kfree(irq_rt)" I mention
above effectively invalidates the entire set of entry* pointers that you
are caching, and is thus an issue.

>
>> In this particular case we seem to never delete entries at run-time once
>> they are established. Therefore, while perhaps sloppy, its technically
>> safe to leave them unprotected from this perspective.

Note: I was wrong in this statement. I forgot that it's not safe at
run-time either since the entry objects are part of irq_rt.

>> The issue is more
>> related to shutdown since a kvm_set_irq() caller could be within the
>> aforementioned race-region and call entry->set() after the guest is
>> gone. Or did I miss something?
>>
> The caller of kvm_set_irq() should hold reference to kvm instance, so it
> can't disappear while you are inside kvm_set_irq(). RCU protects only
> kvm->irq_routing not kvm structure itself.

Agreed, but this has nothing to do with protecting the entry* pointers.

>
>>> Data is copied from irq_rt onto the stack and this copy is accessed
>>> outside critical section.
>> As mentioned above, I do not believe this really protect us. And even
> I don't see the prove it doesn't, so I assume it does.

What would you like to see beyond what I've already provided you? I can
show how the entry pointers are allocated as part of the irq_rt, and I
can show how the irq_rt (via entry->set) access is racy against
invalidation.

>
>> if it did, the copy is just a work-around to avoid sleeping within the
> It is not a work-around. There was two solutions to the problem one is
> to call ->set() outside rcu critical section

This is broken afaict without taking additional precautions, such as a
reference count on the irq_rt structure, but I mentioned this alternate
solution in my header.

> another is to use SRCU. I
> decided to use the first one. This way the code is much simpler

"simpler" is debatable, but ok. SRCU is an established pattern
available in the upstream kernel, so I don't think its particularly
complicated or controversial to use.

> and I remember asking Paul what are the disadvantages of using SRCU and there
> was something.
>

The disadvantages to my knowledge are as follows:

1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
we are talking about nanoseconds on modern hardware (I think Paul quoted
me 10ns vs 45ns on his rig). I don't think either overhead is something
to be concerned about in this case.

2) standard rcu supports deferred synchronization (call_rcu()), as well
as barriers (synchronize_rcu()), whereas SRCU only supports barriers
(synchronize_srcu()). We only use the barrier type in this code path,
so that is not an issue.

3) SRCU requires explicit initialization/cleanup, whereas regular RCU
does not. Trivially solved in my patch since KVM has plenty of
init/cleanup hook points.

>> standard RCU RSCS, which is what SRCU is designed for. So rather than
>> inventing an awkward two-phased stack based solution, it's better to
>> reuse the provided tools, IMO.
>>
>> To flip it around: Is there any reason why an SRCU would not work here,
>> and thus we were forced to use something like the stack-copy approach?
>>
> If SRCU has no disadvantage comparing to RCU why not use it always? :)

No one is debating that SRCU has some disadvantages to RCU, but it
should also be noted that RCU has disadvantages as well (for instance,
you can't sleep within the RSCS except for preemptible-based configurations)

The differences between them is really not the issue. The bottom line
is that upstream KVM irq_routing code is broken afaict with the
application of RCU alone.

IMO: Its not the tool for the job: At least, not when used alone. You
either need RCU + reference count (which has more overhead than SRCU due
to the atomic ops), or SRCU. There may perhaps be other variations on
this theme, as well, and I am not married to SRCU as the solution, per
se. But it is *a* solution that I believe works, and IMO its the
best/cleanest/simplest one at our disposal.


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 14:49:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 04:02:37PM +0200, Gleb Natapov wrote:
> On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:

[ . . . ]

> > standard RCU RSCS, which is what SRCU is designed for. So rather than
> > inventing an awkward two-phased stack based solution, it's better to
> > reuse the provided tools, IMO.
> >
> > To flip it around: Is there any reason why an SRCU would not work here,
> > and thus we were forced to use something like the stack-copy approach?
> >
> If SRCU has no disadvantage comparing to RCU why not use it always? :)

The disadvantages of SRCU compared to RCU include the following:

1. SRCU requires that the return value of srcu_read_lock()
be fed into srcu_read_unlock(). This is usually not a problem,
but can be painful if there are multiple levels of function
call separating the two.

2. SRCU's grace periods are about 4x slower than those of RCU.
And they also don't scale all that well with extremely large
numbers of CPUs (but this can be fixed when/if it becomes a
real problem).

3. SRCU's read-side primitives are also significantly slower than
those of RCU.

4. SRCU does not have a call_srcu(). One could be provided, but
its semantics would be a bit strange due to the need to limit
the number of callbacks, given that general blocking is
permitted in SRCU read-side critical sections. (And it would
take some doing to convince me to supply an SRCU!)

5. The current SRCU has no reasonable way to implement read-side
priority boosting, as there is no record of which task
is read-holding which SRCU.

Hey, you asked! ;-)

Thanx, Paul

2009-10-27 14:50:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gleb Natapov wrote:
> On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
>> Gregory Haskins wrote:
>>> Gleb Natapov wrote:
>>>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>>>>> The current code suffers from the following race condition:
>>>>>
>>>>> thread-1 thread-2
>>>>> -----------------------------------------------------------
>>>>>
>>>>> kvm_set_irq() {
>>>>> rcu_read_lock()
>>>>> irq_rt = rcu_dereference(table);
>>>>> rcu_read_unlock();
>>>>>
>>>>> kvm_set_irq_routing() {
>>>>> mutex_lock();
>>>>> irq_rt = table;
>>>>> rcu_assign_pointer();
>>>>> mutex_unlock();
>>>>> synchronize_rcu();
>>>>>
>>>>> kfree(irq_rt);
>>>>>
>>>>> irq_rt->entry->set(); /* bad */
>>>>>
>>>> This is not what happens. irq_rt is never accessed outside read-side
>>>> critical section.
>>> Sorry, I was generalizing to keep the comments short. I figured it
>>> would be clear what I was actually saying, but realize in retrospect
>>> that I was a little ambiguous.
>> Here is a revised problem statement
>>
>> thread-1 thread-2
>> -----------------------------------------------------------
>>
>> kvm_set_irq() {
>> rcu_read_lock()
>> irq_rt = rcu_dereference(table);
>> entry_cache = get_entries(irq_rt);
>> rcu_read_unlock();
>>
>> invalidate_entries(irq_rt);
>>
>> for_each_entry(entry_cache)
>> entry->set(); /* bad */
>>
>> -------------------------------------------------------------
>>
>>
>> "invalidate_entries()" may be any operation that deletes an entry at
>> run-time (doesn't exist today), or as the guest is shutting down. As
>> far as I can tell, the current code does not protect us from either
>> condition, and my proposed patch protects us from both. Did I miss
>> anything?
>>
> Yes. What happened to irq_rt is completely irrelevant at the point you
> marked /* bad */.

kfree() happened to irq_rt, and thus to the objects behind the pointers
in entry_cache at the point I marked /* bad */.

That certainly isn't /* good */ ;)

-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 15:02:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Thanks for this, Paul.

Some questions and statements below.

Paul E. McKenney wrote:
> On Tue, Oct 27, 2009 at 04:02:37PM +0200, Gleb Natapov wrote:
>> On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
>
> [ . . . ]
>
>>> standard RCU RSCS, which is what SRCU is designed for. So rather than
>>> inventing an awkward two-phased stack based solution, it's better to
>>> reuse the provided tools, IMO.
>>>
>>> To flip it around: Is there any reason why an SRCU would not work here,
>>> and thus we were forced to use something like the stack-copy approach?
>>>
>> If SRCU has no disadvantage comparing to RCU why not use it always? :)
>
> The disadvantages of SRCU compared to RCU include the following:
>
> 1. SRCU requires that the return value of srcu_read_lock()
> be fed into srcu_read_unlock(). This is usually not a problem,
> but can be painful if there are multiple levels of function
> call separating the two.

Right, and this is simple/neat w.r.t. its usage in irq_routing, so no
problem there.

>
> 2. SRCU's grace periods are about 4x slower than those of RCU.
> And they also don't scale all that well with extremely large
> numbers of CPUs (but this can be fixed when/if it becomes a
> real problem).

The irq_routing update path is extremely infrequent, so this should not
be an issue.

>
> 3. SRCU's read-side primitives are also significantly slower than
> those of RCU.
>

Are the 10ns vs 45ns numbers that I mentioned in my last reply the
proper ballpark? How do these compare to an atomic-op, say an
uncontended spinlock on modern hardware? The assumption is that
srcu_read_lock() should be significantly cheaper than a read-lock(). If
its not, then we might as well use something else, I suppose. But if
its not, I guess you probably wouldn't have bothered to invent it in the
first place ;)

> 4. SRCU does not have a call_srcu(). One could be provided, but
> its semantics would be a bit strange due to the need to limit
> the number of callbacks, given that general blocking is
> permitted in SRCU read-side critical sections. (And it would
> take some doing to convince me to supply an SRCU!)

This is not an issue in our design.

>
> 5. The current SRCU has no reasonable way to implement read-side
> priority boosting, as there is no record of which task
> is read-holding which SRCU.

Given the infrequency of the update path, I do not see this as a problem.

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 15:05:09

by Gleb Natapov

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 10:50:45AM -0400, Gregory Haskins wrote:
> Gleb Natapov wrote:
> > On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
> >> Gregory Haskins wrote:
> >>> Gleb Natapov wrote:
> >>>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >>>>> The current code suffers from the following race condition:
> >>>>>
> >>>>> thread-1 thread-2
> >>>>> -----------------------------------------------------------
> >>>>>
> >>>>> kvm_set_irq() {
> >>>>> rcu_read_lock()
> >>>>> irq_rt = rcu_dereference(table);
> >>>>> rcu_read_unlock();
> >>>>>
> >>>>> kvm_set_irq_routing() {
> >>>>> mutex_lock();
> >>>>> irq_rt = table;
> >>>>> rcu_assign_pointer();
> >>>>> mutex_unlock();
> >>>>> synchronize_rcu();
> >>>>>
> >>>>> kfree(irq_rt);
> >>>>>
> >>>>> irq_rt->entry->set(); /* bad */
> >>>>>
> >>>> This is not what happens. irq_rt is never accessed outside read-side
> >>>> critical section.
> >>> Sorry, I was generalizing to keep the comments short. I figured it
> >>> would be clear what I was actually saying, but realize in retrospect
> >>> that I was a little ambiguous.
> >> Here is a revised problem statement
> >>
> >> thread-1 thread-2
> >> -----------------------------------------------------------
> >>
> >> kvm_set_irq() {
> >> rcu_read_lock()
> >> irq_rt = rcu_dereference(table);
> >> entry_cache = get_entries(irq_rt);
> >> rcu_read_unlock();
> >>
> >> invalidate_entries(irq_rt);
> >>
> >> for_each_entry(entry_cache)
> >> entry->set(); /* bad */
> >>
> >> -------------------------------------------------------------
> >>
> >>
> >> "invalidate_entries()" may be any operation that deletes an entry at
> >> run-time (doesn't exist today), or as the guest is shutting down. As
> >> far as I can tell, the current code does not protect us from either
> >> condition, and my proposed patch protects us from both. Did I miss
> >> anything?
> >>
> > Yes. What happened to irq_rt is completely irrelevant at the point you
> > marked /* bad */.
>
> kfree() happened to irq_rt, and thus to the objects behind the pointers
> in entry_cache at the point I marked /* bad */.
The entire entry is cached not a pointer to an entry! kfree().

>
> That certainly isn't /* good */ ;)
>
It looks like we are looking at different code :)

--
Gleb.

2009-10-27 15:30:31

by Gleb Natapov

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 10:47:49AM -0400, Gregory Haskins wrote:
> Gleb Natapov wrote:
> > On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
> >> Gleb Natapov wrote:
> >>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >>>> The current code suffers from the following race condition:
> >>>>
> >>>> thread-1 thread-2
> >>>> -----------------------------------------------------------
> >>>>
> >>>> kvm_set_irq() {
> >>>> rcu_read_lock()
> >>>> irq_rt = rcu_dereference(table);
> >>>> rcu_read_unlock();
> >>>>
> >>>> kvm_set_irq_routing() {
> >>>> mutex_lock();
> >>>> irq_rt = table;
> >>>> rcu_assign_pointer();
> >>>> mutex_unlock();
> >>>> synchronize_rcu();
> >>>>
> >>>> kfree(irq_rt);
> >>>>
> >>>> irq_rt->entry->set(); /* bad */
> >>>>
> >>> This is not what happens. irq_rt is never accessed outside read-side
> >>> critical section.
> >> Sorry, I was generalizing to keep the comments short. I figured it
> >> would be clear what I was actually saying, but realize in retrospect
> >> that I was a little ambiguous.
> >>
> > A little is underestimation :) There is not /* bad */ line in the code!
>
> Sorry, that was my own highlighting, not trying to reflect actual code.
>
> >
> >> Yes, irq_rt is not accessed outside the RSCS. However, the entry
> >> pointers stored in the irq_rt->map are, and this is equally problematic
> >> afaict.
> > The pointer is in text and can't disappear without kvm_set_irq()
> > disappearing too.
>
> No, the entry* pointer is .text _AND_ .data, and is subject to standard
> synchronization rules like most other objects.
>
> Unless I am misreading the code, the entry* pointers point to heap
> within the irq_rt pointer. Therefore, the "kfree(irq_rt)" I mention
> above effectively invalidates the entire set of entry* pointers that you
> are caching, and is thus an issue.
>
I think you are missing that the content of the entry is copied, not
pointer to the entry:
irq_set[i++] = *e;
> >
> >> In this particular case we seem to never delete entries at run-time once
> >> they are established. Therefore, while perhaps sloppy, its technically
> >> safe to leave them unprotected from this perspective.
>
> Note: I was wrong in this statement. I forgot that it's not safe at
> run-time either since the entry objects are part of irq_rt.
>
> >> The issue is more
> >> related to shutdown since a kvm_set_irq() caller could be within the
> >> aforementioned race-region and call entry->set() after the guest is
> >> gone. Or did I miss something?
> >>
> > The caller of kvm_set_irq() should hold reference to kvm instance, so it
> > can't disappear while you are inside kvm_set_irq(). RCU protects only
> > kvm->irq_routing not kvm structure itself.
>
> Agreed, but this has nothing to do with protecting the entry* pointers.
>
There are not used outside critical section.

> >
> >>> Data is copied from irq_rt onto the stack and this copy is accessed
> >>> outside critical section.
> >> As mentioned above, I do not believe this really protect us. And even
> > I don't see the prove it doesn't, so I assume it does.
>
> What would you like to see beyond what I've already provided you? I can
> show how the entry pointers are allocated as part of the irq_rt, and I
> can show how the irq_rt (via entry->set) access is racy against
> invalidation.
>
> >
> >> if it did, the copy is just a work-around to avoid sleeping within the
> > It is not a work-around. There was two solutions to the problem one is
> > to call ->set() outside rcu critical section
>
> This is broken afaict without taking additional precautions, such as a
> reference count on the irq_rt structure, but I mentioned this alternate
> solution in my header.
>
> > another is to use SRCU. I
> > decided to use the first one. This way the code is much simpler
>
> "simpler" is debatable, but ok. SRCU is an established pattern
> available in the upstream kernel, so I don't think its particularly
> complicated or controversial to use.
>
> > and I remember asking Paul what are the disadvantages of using SRCU and there
> > was something.
> >
>
> The disadvantages to my knowledge are as follows:
>
> 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
> we are talking about nanoseconds on modern hardware (I think Paul quoted
> me 10ns vs 45ns on his rig). I don't think either overhead is something
> to be concerned about in this case.
>
If we can avoid why not? Nanoseconds tend to add up.

> 2) standard rcu supports deferred synchronization (call_rcu()), as well
> as barriers (synchronize_rcu()), whereas SRCU only supports barriers
> (synchronize_srcu()). We only use the barrier type in this code path,
> so that is not an issue.
Agree.

>
> 3) SRCU requires explicit initialization/cleanup, whereas regular RCU
> does not. Trivially solved in my patch since KVM has plenty of
> init/cleanup hook points.
>
No problem here too.

> >> standard RCU RSCS, which is what SRCU is designed for. So rather than
> >> inventing an awkward two-phased stack based solution, it's better to
> >> reuse the provided tools, IMO.
> >>
> >> To flip it around: Is there any reason why an SRCU would not work here,
> >> and thus we were forced to use something like the stack-copy approach?
> >>
> > If SRCU has no disadvantage comparing to RCU why not use it always? :)
>
> No one is debating that SRCU has some disadvantages to RCU, but it
> should also be noted that RCU has disadvantages as well (for instance,
> you can't sleep within the RSCS except for preemptible-based configurations)
>
> The differences between them is really not the issue. The bottom line
> is that upstream KVM irq_routing code is broken afaict with the
> application of RCU alone.
>
> IMO: Its not the tool for the job: At least, not when used alone. You
> either need RCU + reference count (which has more overhead than SRCU due
> to the atomic ops), or SRCU. There may perhaps be other variations on
> this theme, as well, and I am not married to SRCU as the solution, per
> se. But it is *a* solution that I believe works, and IMO its the
> best/cleanest/simplest one at our disposal.
>



--
Gleb.

2009-10-27 15:42:14

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gleb Natapov wrote:
> On Tue, Oct 27, 2009 at 10:50:45AM -0400, Gregory Haskins wrote:
>> Gleb Natapov wrote:
>>> On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote:
>>>> Gregory Haskins wrote:
>>>>> Gleb Natapov wrote:
>>>>>> On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
>>>>>>> The current code suffers from the following race condition:
>>>>>>>
>>>>>>> thread-1 thread-2
>>>>>>> -----------------------------------------------------------
>>>>>>>
>>>>>>> kvm_set_irq() {
>>>>>>> rcu_read_lock()
>>>>>>> irq_rt = rcu_dereference(table);
>>>>>>> rcu_read_unlock();
>>>>>>>
>>>>>>> kvm_set_irq_routing() {
>>>>>>> mutex_lock();
>>>>>>> irq_rt = table;
>>>>>>> rcu_assign_pointer();
>>>>>>> mutex_unlock();
>>>>>>> synchronize_rcu();
>>>>>>>
>>>>>>> kfree(irq_rt);
>>>>>>>
>>>>>>> irq_rt->entry->set(); /* bad */
>>>>>>>
>>>>>> This is not what happens. irq_rt is never accessed outside read-side
>>>>>> critical section.
>>>>> Sorry, I was generalizing to keep the comments short. I figured it
>>>>> would be clear what I was actually saying, but realize in retrospect
>>>>> that I was a little ambiguous.
>>>> Here is a revised problem statement
>>>>
>>>> thread-1 thread-2
>>>> -----------------------------------------------------------
>>>>
>>>> kvm_set_irq() {
>>>> rcu_read_lock()
>>>> irq_rt = rcu_dereference(table);
>>>> entry_cache = get_entries(irq_rt);
>>>> rcu_read_unlock();
>>>>
>>>> invalidate_entries(irq_rt);
>>>>
>>>> for_each_entry(entry_cache)
>>>> entry->set(); /* bad */
>>>>
>>>> -------------------------------------------------------------
>>>>
>>>>
>>>> "invalidate_entries()" may be any operation that deletes an entry at
>>>> run-time (doesn't exist today), or as the guest is shutting down. As
>>>> far as I can tell, the current code does not protect us from either
>>>> condition, and my proposed patch protects us from both. Did I miss
>>>> anything?
>>>>
>>> Yes. What happened to irq_rt is completely irrelevant at the point you
>>> marked /* bad */.
>> kfree() happened to irq_rt, and thus to the objects behind the pointers
>> in entry_cache at the point I marked /* bad */.
> The entire entry is cached not a pointer to an entry! kfree().

<light bulb goes off>

Ah, I see. I missed that detail that it was a structure copy, not a
pointer copy.

My bad. You are right, and I am wrong. I retract the 1/3 patch.

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 17:33:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 11:02:23AM -0400, Gregory Haskins wrote:
> Thanks for this, Paul.
>
> Some questions and statements below.
>
> Paul E. McKenney wrote:
> > On Tue, Oct 27, 2009 at 04:02:37PM +0200, Gleb Natapov wrote:
> >> On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote:
> >
> > [ . . . ]
> >
> >>> standard RCU RSCS, which is what SRCU is designed for. So rather than
> >>> inventing an awkward two-phased stack based solution, it's better to
> >>> reuse the provided tools, IMO.
> >>>
> >>> To flip it around: Is there any reason why an SRCU would not work here,
> >>> and thus we were forced to use something like the stack-copy approach?
> >>>
> >> If SRCU has no disadvantage comparing to RCU why not use it always? :)
> >
> > The disadvantages of SRCU compared to RCU include the following:
> >
> > 1. SRCU requires that the return value of srcu_read_lock()
> > be fed into srcu_read_unlock(). This is usually not a problem,
> > but can be painful if there are multiple levels of function
> > call separating the two.
>
> Right, and this is simple/neat w.r.t. its usage in irq_routing, so no
> problem there.

Fair enough!

> >
> > 2. SRCU's grace periods are about 4x slower than those of RCU.
> > And they also don't scale all that well with extremely large
> > numbers of CPUs (but this can be fixed when/if it becomes a
> > real problem).
>
> The irq_routing update path is extremely infrequent, so this should not
> be an issue.

Sounds good!

> > 3. SRCU's read-side primitives are also significantly slower than
> > those of RCU.
>
> Are the 10ns vs 45ns numbers that I mentioned in my last reply the
> proper ballpark? How do these compare to an atomic-op, say an
> uncontended spinlock on modern hardware? The assumption is that
> srcu_read_lock() should be significantly cheaper than a read-lock(). If
> its not, then we might as well use something else, I suppose. But if
> its not, I guess you probably wouldn't have bothered to invent it in the
> first place ;)

SRCU read-side critical sections should indeed be quite a bit cheaper than
uncontended spinlock, particularly if the spinlock was last released by
some other CPU. There are those who insist that uncontended spinlocks
and atomic operations will soon be free, but I will believe this when
I see it. ;-)

> > 4. SRCU does not have a call_srcu(). One could be provided, but
> > its semantics would be a bit strange due to the need to limit
> > the number of callbacks, given that general blocking is
> > permitted in SRCU read-side critical sections. (And it would
> > take some doing to convince me to supply an SRCU!)
>
> This is not an issue in our design.

Very good!

> > 5. The current SRCU has no reasonable way to implement read-side
> > priority boosting, as there is no record of which task
> > is read-holding which SRCU.
>
> Given the infrequency of the update path, I do not see this as a problem.

Sounds like you have it covered, then!

Thanx, Paul

2009-10-27 16:53:10

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

Gleb Natapov wrote:
>>
>> 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but
>> we are talking about nanoseconds on modern hardware (I think Paul quoted
>> me 10ns vs 45ns on his rig). I don't think either overhead is something
>> to be concerned about in this case.
>>
> If we can avoid why not? Nanoseconds tend to add up.
>

BTW: I didn't mean to imply that we should be cavalier in adding
overhead. My point was that adding overhead is sometimes _necessary_ to
prevent a race above and beyond an RCU pointer acquisition, and 35ns is
not _terrible_ IMO.

I was suggesting to solve this by switching to SRCU, but an alternative
is copying the structure (when permitted with immutable objects), which
seems to work in this particular case. It should be noted that the copy
has its own unquantified overhead beyond basic RCU as well, so its not
truly free (I'd guess its <= as the switch to SRCU without copies, though).

IOW: sync hurts, but its sometimes a necessary evil ;)

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-27 17:01:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic

On Tue, Oct 27, 2009 at 09:34:41AM -0400, Gregory Haskins wrote:
> Hi Paul,
>
> Paul E. McKenney wrote:
> > On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote:
> >> The current code suffers from the following race condition:
> >>
> >> thread-1 thread-2
> >> -----------------------------------------------------------
> >>
> >> kvm_set_irq() {
> >> rcu_read_lock()
> >> irq_rt = rcu_dereference(table);
> >> rcu_read_unlock();
> >>
> >> kvm_set_irq_routing() {
> >> mutex_lock();
> >> irq_rt = table;
> >> rcu_assign_pointer();
> >> mutex_unlock();
> >> synchronize_rcu();
> >>
> >> kfree(irq_rt);
> >>
> >> irq_rt->entry->set(); /* bad */
> >>
> >> -------------------------------------------------------------
> >>
> >> Because the pointer is accessed outside of the read-side critical
> >> section. There are two basic patterns we can use to fix this bug:
> >>
> >> 1) Switch to sleeping-rcu and encompass the ->set() access within the
> >> read-side critical section,
> >>
> >> OR
> >>
> >> 2) Add reference counting to the irq_rt structure, and simply acquire
> >> the reference from within the RSCS.
> >>
> >> This patch implements solution (1).
> >
> > Looks like a good transformation! A few questions interspersed below.
>
> Thanks for the review. I would have CC'd you but I figured I pestered
> you enough with my RCU reviews in the past, and didn't want to annoy you ;)
>
> I will be sure to CC you in the future, unless you ask otherwise.

No problem either way. ;-)

Thanx, Paul

> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> ---
> >>
> >> include/linux/kvm_host.h | 6 +++++-
> >> virt/kvm/irq_comm.c | 50 +++++++++++++++++++++++++++-------------------
> >> virt/kvm/kvm_main.c | 1 +
> >> 3 files changed, 35 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index bd5a616..1fe135d 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -185,7 +185,10 @@ struct kvm {
> >>
> >> struct mutex irq_lock;
> >> #ifdef CONFIG_HAVE_KVM_IRQCHIP
> >> - struct kvm_irq_routing_table *irq_routing;
> >> + struct {
> >> + struct srcu_struct srcu;
> >
> > Each structure has its own SRCU domain. This is OK, but just asking
> > if that is the intent. It does look like the SRCU primitives are
> > passed a pointer to the correct structure, and that the return value
> > from srcu_read_lock() gets passed into the matching srcu_read_unlock()
> > like it needs to be, so that is good.
>
> Yeah, it was intentional. Technically the table is per-guest, and thus
> the locking is too, which is the desired/intentional granularity.
>
> On that note, I tried to denote that kvm->irq_routing.srcu and
> kvm->irq_routing.table were related to one another, but then went ahead
> and modified the hunks that touched kvm->irq_ack_notifier_list, too. In
> retrospect, this was probably a mistake. I should leave the rcu usage
> outside of ->irq_routing.table alone.
>
> >
> >> + struct kvm_irq_routing_table *table;
> >> + } irq_routing;
> >> struct hlist_head mask_notifier_list;
> >> struct hlist_head irq_ack_notifier_list;
> >> #endif
> >
> > [ . . . ]
> >
> >> @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> >> * IOAPIC. So set the bit in both. The guest will ignore
> >> * writes to the unused one.
> >> */
> >> - rcu_read_lock();
> >> - irq_rt = rcu_dereference(kvm->irq_routing);
> >> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> >> + irq_rt = rcu_dereference(kvm->irq_routing.table);
> >> if (irq < irq_rt->nr_rt_entries)
> >> - hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
> >> - irq_set[i++] = *e;
> >> - rcu_read_unlock();
> >> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> >
> > What prevents the above list from changing while we are traversing it?
> > (Yes, presumably whatever was preventing it from changing before this
> > patch, but what?)
> >
> > Mostly kvm->lock is held, but not always. And if kvm->lock were held
> > all the time, there would be no point in using SRCU. ;-)
>
> This is protected by kvm->irq_lock within kvm_set_irq_routing().
> Entries are added to a copy of the list, and the top-level table pointer
> is swapped (via rcu_assign_pointer(), as it should be) while holding the
> lock. Finally, we synchronize with the RSCS before deleting the old
> copy. It looks to me like the original author got this part right, so I
> didn't modify it outside of converting to SRCU.
>
> >
> >> + int r;
> >>
> >> - while(i--) {
> >> - int r;
> >> - r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level);
> >> - if (r < 0)
> >> - continue;
> >> + r = e->set(e, kvm, irq_source_id, level);
> >> + if (r < 0)
> >> + continue;
> >>
> >> - ret = r + ((ret < 0) ? 0 : ret);
> >> - }
> >> + ret = r + ((ret < 0) ? 0 : ret);
> >> + }
> >> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
> >>
> >> return ret;
> >> }
> >> @@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >> struct kvm_irq_ack_notifier *kian;
> >> struct hlist_node *n;
> >> int gsi;
> >> + int idx;
> >>
> >> trace_kvm_ack_irq(irqchip, pin);
> >>
> >> - rcu_read_lock();
> >> - gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> >> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> >> + gsi = rcu_dereference(kvm->irq_routing.table)->chip[irqchip][pin];
> >> if (gsi != -1)
> >> hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> >> link)
> >
> > And same question here -- what keeps the above list from changing while
> > we are traversing it?
>
> This is also protected via the kvm->irq_lock in
> kvm_register_irq_ack_notifier(). Though as mentioned above, I should
> probably drop the non irq_routing.table hunks, so this will go away.
> But I think its correct either way.
>
> Thanks Paul,
> -Greg
>

2009-10-27 17:47:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
> IRQFD currently uses a deferred workqueue item to execute the injection
> operation. It was originally designed this way because kvm_set_irq()
> required the caller to hold the irq_lock mutex, and the eventfd callback
> is invoked from within a non-preemptible critical section.
>
> With the advent of lockless injection support for certain GSIs, the
> deferment mechanism is no longer technically needed in all cases.
> Since context switching to the workqueue is a source of interrupt
> latency, lets switch to a direct method whenever possible. Fortunately
> for us, the most common use of irqfd (MSI-based GSIs) readily support
> lockless injection.
>
> Signed-off-by: Gregory Haskins <[email protected]>

This is a useful optimization I think.
Some comments below.

> ---
>
> virt/kvm/eventfd.c | 31 +++++++++++++++++++++++++++----
> 1 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 30f70fd..e6cc958 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -51,20 +51,34 @@ struct _irqfd {
> wait_queue_t wait;
> struct work_struct inject;
> struct work_struct shutdown;
> + void (*execute)(struct _irqfd *);
> };
>
> static struct workqueue_struct *irqfd_cleanup_wq;
>
> static void
> -irqfd_inject(struct work_struct *work)
> +irqfd_inject(struct _irqfd *irqfd)
> {
> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> struct kvm *kvm = irqfd->kvm;
>
> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> }
>
> +static void
> +irqfd_deferred_inject(struct work_struct *work)
> +{
> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> +
> + irqfd_inject(irqfd);
> +}
> +
> +static void
> +irqfd_schedule(struct _irqfd *irqfd)
> +{
> + schedule_work(&irqfd->inject);
> +}
> +
> /*
> * Race-free decouple logic (ordering is critical)
> */
> @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>
> if (flags & POLLIN)
> /* An event has been signaled, inject an interrupt */
> - schedule_work(&irqfd->inject);
> + irqfd->execute(irqfd);
>
> if (flags & POLLHUP) {
> /* The eventfd is closing, detach from KVM */
> @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> irqfd->kvm = kvm;
> irqfd->gsi = gsi;
> INIT_LIST_HEAD(&irqfd->list);
> - INIT_WORK(&irqfd->inject, irqfd_inject);
> + INIT_WORK(&irqfd->inject, irqfd_deferred_inject);
> INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>
> file = eventfd_fget(fd);
> @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> list_add_tail(&irqfd->list, &kvm->irqfds.items);
> spin_unlock_irq(&kvm->irqfds.lock);
>
> + ret = kvm_irq_check_lockless(kvm, gsi);
> + if (ret < 0)
> + goto fail;
> +
> + if (ret)
> + irqfd->execute = &irqfd_inject;
> + else
> + irqfd->execute = &irqfd_schedule;
> +

Can't gsi get converted from lockless to non-lockless
after it's checked (by the routing ioctl)? Kernel will crash then.

How about, each time we get event from eventfd, we implement
kvm_irqfd_toggle_lockless, which does a single scan, and returns
true/false status (and I really mean toggle, let's not do set 1 / set 0
as well) telling us whether interrupts could be delivered in a lockless
manner?

Then we can either just ignore the error (no one uses eventfd this way),
or handle the mostly irrelevant case of level by means of the workqueue,
like we did previously.


> /*
> * Check if there was an event already pending on the eventfd
> * before we registered, and trigger it as if we didn't miss it.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-10-27 18:54:42

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

Michael S. Tsirkin wrote:
> On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
>> IRQFD currently uses a deferred workqueue item to execute the injection
>> operation. It was originally designed this way because kvm_set_irq()
>> required the caller to hold the irq_lock mutex, and the eventfd callback
>> is invoked from within a non-preemptible critical section.
>>
>> With the advent of lockless injection support for certain GSIs, the
>> deferment mechanism is no longer technically needed in all cases.
>> Since context switching to the workqueue is a source of interrupt
>> latency, lets switch to a direct method whenever possible. Fortunately
>> for us, the most common use of irqfd (MSI-based GSIs) readily support
>> lockless injection.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>
> This is a useful optimization I think.
> Some comments below.
>
>> ---
>>
>> virt/kvm/eventfd.c | 31 +++++++++++++++++++++++++++----
>> 1 files changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 30f70fd..e6cc958 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -51,20 +51,34 @@ struct _irqfd {
>> wait_queue_t wait;
>> struct work_struct inject;
>> struct work_struct shutdown;
>> + void (*execute)(struct _irqfd *);
>> };
>>
>> static struct workqueue_struct *irqfd_cleanup_wq;
>>
>> static void
>> -irqfd_inject(struct work_struct *work)
>> +irqfd_inject(struct _irqfd *irqfd)
>> {
>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>> struct kvm *kvm = irqfd->kvm;
>>
>> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> }
>>
>> +static void
>> +irqfd_deferred_inject(struct work_struct *work)
>> +{
>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>> +
>> + irqfd_inject(irqfd);
>> +}
>> +
>> +static void
>> +irqfd_schedule(struct _irqfd *irqfd)
>> +{
>> + schedule_work(&irqfd->inject);
>> +}
>> +
>> /*
>> * Race-free decouple logic (ordering is critical)
>> */
>> @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>
>> if (flags & POLLIN)
>> /* An event has been signaled, inject an interrupt */
>> - schedule_work(&irqfd->inject);
>> + irqfd->execute(irqfd);
>>
>> if (flags & POLLHUP) {
>> /* The eventfd is closing, detach from KVM */
>> @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>> irqfd->kvm = kvm;
>> irqfd->gsi = gsi;
>> INIT_LIST_HEAD(&irqfd->list);
>> - INIT_WORK(&irqfd->inject, irqfd_inject);
>> + INIT_WORK(&irqfd->inject, irqfd_deferred_inject);
>> INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>>
>> file = eventfd_fget(fd);
>> @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>> list_add_tail(&irqfd->list, &kvm->irqfds.items);
>> spin_unlock_irq(&kvm->irqfds.lock);
>>
>> + ret = kvm_irq_check_lockless(kvm, gsi);
>> + if (ret < 0)
>> + goto fail;
>> +
>> + if (ret)
>> + irqfd->execute = &irqfd_inject;
>> + else
>> + irqfd->execute = &irqfd_schedule;
>> +
>
> Can't gsi get converted from lockless to non-lockless
> after it's checked (by the routing ioctl)?

I think I protect against this in patch 2/3 by ensuring that any vectors
that are added have to conform to the same locking rules. The code
doesn't support deleting routes, so we really only need to make sure
that new routes do not change.

> Kernel will crash then.
>
> How about, each time we get event from eventfd, we implement
> kvm_irqfd_toggle_lockless, which does a single scan, and returns
> true/false status (and I really mean toggle, let's not do set 1 / set 0
> as well) telling us whether interrupts could be delivered in a lockless
> manner?

I am not sure I like this idea in general given that I believe I already
handle the error case you are concerned with.

However, the concept of providing a "toggle" option so we can avoid
scanning the list twice is a good one. That can be done as a new patch
series, but it would be a nice addition.

Thanks Michael,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-28 07:38:28

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

On Tue, Oct 27, 2009 at 02:54:40PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
> >> IRQFD currently uses a deferred workqueue item to execute the injection
> >> operation. It was originally designed this way because kvm_set_irq()
> >> required the caller to hold the irq_lock mutex, and the eventfd callback
> >> is invoked from within a non-preemptible critical section.
> >>
> >> With the advent of lockless injection support for certain GSIs, the
> >> deferment mechanism is no longer technically needed in all cases.
> >> Since context switching to the workqueue is a source of interrupt
> >> latency, lets switch to a direct method whenever possible. Fortunately
> >> for us, the most common use of irqfd (MSI-based GSIs) readily support
> >> lockless injection.
> >>
> >> Signed-off-by: Gregory Haskins <[email protected]>
> >
> > This is a useful optimization I think.
> > Some comments below.
> >
> >> ---
> >>
> >> virt/kvm/eventfd.c | 31 +++++++++++++++++++++++++++----
> >> 1 files changed, 27 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> index 30f70fd..e6cc958 100644
> >> --- a/virt/kvm/eventfd.c
> >> +++ b/virt/kvm/eventfd.c
> >> @@ -51,20 +51,34 @@ struct _irqfd {
> >> wait_queue_t wait;
> >> struct work_struct inject;
> >> struct work_struct shutdown;
> >> + void (*execute)(struct _irqfd *);
> >> };
> >>
> >> static struct workqueue_struct *irqfd_cleanup_wq;
> >>
> >> static void
> >> -irqfd_inject(struct work_struct *work)
> >> +irqfd_inject(struct _irqfd *irqfd)
> >> {
> >> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> >> struct kvm *kvm = irqfd->kvm;
> >>
> >> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >> }
> >>
> >> +static void
> >> +irqfd_deferred_inject(struct work_struct *work)
> >> +{
> >> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> >> +
> >> + irqfd_inject(irqfd);
> >> +}
> >> +
> >> +static void
> >> +irqfd_schedule(struct _irqfd *irqfd)
> >> +{
> >> + schedule_work(&irqfd->inject);
> >> +}
> >> +
> >> /*
> >> * Race-free decouple logic (ordering is critical)
> >> */
> >> @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>
> >> if (flags & POLLIN)
> >> /* An event has been signaled, inject an interrupt */
> >> - schedule_work(&irqfd->inject);
> >> + irqfd->execute(irqfd);
> >>
> >> if (flags & POLLHUP) {
> >> /* The eventfd is closing, detach from KVM */
> >> @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> >> irqfd->kvm = kvm;
> >> irqfd->gsi = gsi;
> >> INIT_LIST_HEAD(&irqfd->list);
> >> - INIT_WORK(&irqfd->inject, irqfd_inject);
> >> + INIT_WORK(&irqfd->inject, irqfd_deferred_inject);
> >> INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> >>
> >> file = eventfd_fget(fd);
> >> @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> >> list_add_tail(&irqfd->list, &kvm->irqfds.items);
> >> spin_unlock_irq(&kvm->irqfds.lock);
> >>
> >> + ret = kvm_irq_check_lockless(kvm, gsi);
> >> + if (ret < 0)
> >> + goto fail;
> >> +
> >> + if (ret)
> >> + irqfd->execute = &irqfd_inject;
> >> + else
> >> + irqfd->execute = &irqfd_schedule;
> >> +
> >
> > Can't gsi get converted from lockless to non-lockless
> > after it's checked (by the routing ioctl)?
>
> I think I protect against this in patch 2/3 by ensuring that any vectors
> that are added have to conform to the same locking rules. The code
> doesn't support deleting routes, so we really only need to make sure
> that new routes do not change.

What I refer to, is when userspace calls KVM_SET_GSI_ROUTING.
I don't see how your patch helps here: can't a GSI formerly
used for MSI become unused, and then reused for non-MSI?
If not, it's a problem I think, because I think userspace currently does this
sometimes.

> > Kernel will crash then.
> >
> > How about, each time we get event from eventfd, we implement
> > kvm_irqfd_toggle_lockless, which does a single scan, and returns
> > true/false status (and I really mean toggle, let's not do set 1 / set 0
> > as well) telling us whether interrupts could be delivered in a lockless
> > manner?
>
> I am not sure I like this idea in general given that I believe I already
> handle the error case you are concerned with.
>
> However, the concept of providing a "toggle" option so we can avoid
> scanning the list twice is a good one. That can be done as a new patch
> series, but it would be a nice addition.
>
> Thanks Michael,
> -Greg
>

2009-10-28 07:49:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM PATCH v3 2/3] KVM: export lockless GSI attribute

On Mon, Oct 26, 2009 at 12:22:03PM -0400, Gregory Haskins wrote:
> Certain GSI's support lockless injecton, but we have no way to detect
> which ones at the GSI level. Knowledge of this attribute will be
> useful later in the series so that we can optimize irqfd injection
> paths for cases where we know the code will not sleep. Therefore,
> we provide an API to query a specific GSI.
>
> Signed-off-by: Gregory Haskins <[email protected]>
> ---
>
> include/linux/kvm_host.h | 2 ++
> virt/kvm/irq_comm.c | 35 ++++++++++++++++++++++++++++++++++-
> 2 files changed, 36 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1fe135d..01151a6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -119,6 +119,7 @@ struct kvm_memory_slot {
> struct kvm_kernel_irq_routing_entry {
> u32 gsi;
> u32 type;
> + bool lockless;

So lockless is the same as type == MSI from below?
If the idea is to make it extensible for the future,
let's just add an inline function, we don't need a field for this.

> int (*set)(struct kvm_kernel_irq_routing_entry *e,
> struct kvm *kvm, int irq_source_id, int level);
> union {
> @@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
> unsigned long *deliver_bitmask);
> #endif
> int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
> +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
> void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
> void kvm_register_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index db2553f..a7fd487 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> return ret;
> }
>
> +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
> +{
> + struct kvm_kernel_irq_routing_entry *e;
> + struct kvm_irq_routing_table *irq_rt;
> + struct hlist_node *n;
> + int ret = -ENOENT;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
> + irq_rt = rcu_dereference(kvm->irq_routing.table);
> + if (irq < irq_rt->nr_rt_entries)
> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> + if (!e->lockless) {
> + /*
> + * all destinations need to be lockless to
> + * declare that the GSI as a whole is also
> + * lockless
> + */
> + ret = 0;
> + break;
> + }
> +
> + ret = 1;
> + }
> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
> +
> + return ret;
> +}
> +
> void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> {
> struct kvm_irq_ack_notifier *kian;
> @@ -310,18 +339,22 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> int delta;
> struct kvm_kernel_irq_routing_entry *ei;
> struct hlist_node *n;
> + bool lockless = ue->type == KVM_IRQ_ROUTING_MSI;
>
> /*
> * Do not allow GSI to be mapped to the same irqchip more than once.
> * Allow only one to one mapping between GSI and MSI.
> + * Do not allow mixed lockless vs locked variants to coexist.

Userspace has no idea which entries are lockless and which are not:
this is an implementation detail - so it might not be able to avoid
illegal combinations.
Since this is called on an ioctl, can the rule be formulated in a way
that makes sense for userspace?

> */
> hlist_for_each_entry(ei, n, &rt->map[ue->gsi], link)
> if (ei->type == KVM_IRQ_ROUTING_MSI ||
> - ue->u.irqchip.irqchip == ei->irqchip.irqchip)
> + ue->u.irqchip.irqchip == ei->irqchip.irqchip ||
> + ei->lockless != lockless)

So this check seems like it does nothing, because lockless is same as
MSI, and MSI is always 1:1? Intentional?

> return r;
>
> e->gsi = ue->gsi;
> e->type = ue->type;
> + e->lockless = lockless;
> switch (ue->type) {
> case KVM_IRQ_ROUTING_IRQCHIP:
> delta = 0;
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-10-28 13:20:47

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation

Michael S. Tsirkin wrote:
> On Tue, Oct 27, 2009 at 02:54:40PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote:
>>>> IRQFD currently uses a deferred workqueue item to execute the injection
>>>> operation. It was originally designed this way because kvm_set_irq()
>>>> required the caller to hold the irq_lock mutex, and the eventfd callback
>>>> is invoked from within a non-preemptible critical section.
>>>>
>>>> With the advent of lockless injection support for certain GSIs, the
>>>> deferment mechanism is no longer technically needed in all cases.
>>>> Since context switching to the workqueue is a source of interrupt
>>>> latency, lets switch to a direct method whenever possible. Fortunately
>>>> for us, the most common use of irqfd (MSI-based GSIs) readily support
>>>> lockless injection.
>>>>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>> This is a useful optimization I think.
>>> Some comments below.
>>>
>>>> ---
>>>>
>>>> virt/kvm/eventfd.c | 31 +++++++++++++++++++++++++++----
>>>> 1 files changed, 27 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>>>> index 30f70fd..e6cc958 100644
>>>> --- a/virt/kvm/eventfd.c
>>>> +++ b/virt/kvm/eventfd.c
>>>> @@ -51,20 +51,34 @@ struct _irqfd {
>>>> wait_queue_t wait;
>>>> struct work_struct inject;
>>>> struct work_struct shutdown;
>>>> + void (*execute)(struct _irqfd *);
>>>> };
>>>>
>>>> static struct workqueue_struct *irqfd_cleanup_wq;
>>>>
>>>> static void
>>>> -irqfd_inject(struct work_struct *work)
>>>> +irqfd_inject(struct _irqfd *irqfd)
>>>> {
>>>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>>>> struct kvm *kvm = irqfd->kvm;
>>>>
>>>> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> }
>>>>
>>>> +static void
>>>> +irqfd_deferred_inject(struct work_struct *work)
>>>> +{
>>>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>>>> +
>>>> + irqfd_inject(irqfd);
>>>> +}
>>>> +
>>>> +static void
>>>> +irqfd_schedule(struct _irqfd *irqfd)
>>>> +{
>>>> + schedule_work(&irqfd->inject);
>>>> +}
>>>> +
>>>> /*
>>>> * Race-free decouple logic (ordering is critical)
>>>> */
>>>> @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>
>>>> if (flags & POLLIN)
>>>> /* An event has been signaled, inject an interrupt */
>>>> - schedule_work(&irqfd->inject);
>>>> + irqfd->execute(irqfd);
>>>>
>>>> if (flags & POLLHUP) {
>>>> /* The eventfd is closing, detach from KVM */
>>>> @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>>>> irqfd->kvm = kvm;
>>>> irqfd->gsi = gsi;
>>>> INIT_LIST_HEAD(&irqfd->list);
>>>> - INIT_WORK(&irqfd->inject, irqfd_inject);
>>>> + INIT_WORK(&irqfd->inject, irqfd_deferred_inject);
>>>> INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>>>>
>>>> file = eventfd_fget(fd);
>>>> @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>>>> list_add_tail(&irqfd->list, &kvm->irqfds.items);
>>>> spin_unlock_irq(&kvm->irqfds.lock);
>>>>
>>>> + ret = kvm_irq_check_lockless(kvm, gsi);
>>>> + if (ret < 0)
>>>> + goto fail;
>>>> +
>>>> + if (ret)
>>>> + irqfd->execute = &irqfd_inject;
>>>> + else
>>>> + irqfd->execute = &irqfd_schedule;
>>>> +
>>> Can't gsi get converted from lockless to non-lockless
>>> after it's checked (by the routing ioctl)?
>> I think I protect against this in patch 2/3 by ensuring that any vectors
>> that are added have to conform to the same locking rules. The code
>> doesn't support deleting routes, so we really only need to make sure
>> that new routes do not change.
>
> What I refer to, is when userspace calls KVM_SET_GSI_ROUTING.
> I don't see how your patch helps here: can't a GSI formerly
> used for MSI become unused, and then reused for non-MSI?
> If not, it's a problem I think, because I think userspace currently does this
> sometimes.

I see your point. I was thinking vectors could only be added, not
deleted, but I see upon further inspection that is not the case.

>
>>> Kernel will crash then.
>>>
>>> How about, each time we get event from eventfd, we implement
>>> kvm_irqfd_toggle_lockless, which does a single scan, and returns
>>> true/false status (and I really mean toggle, let's not do set 1 / set 0
>>> as well) telling us whether interrupts could be delivered in a lockless
>>> manner?
>> I am not sure I like this idea in general given that I believe I already
>> handle the error case you are concerned with.
>>
>> However, the concept of providing a "toggle" option so we can avoid
>> scanning the list twice is a good one. That can be done as a new patch
>> series, but it would be a nice addition.
>>
>> Thanks Michael,
>> -Greg
>>
>
>



Attachments:
signature.asc (267.00 B)
OpenPGP digital signature

2009-10-28 13:24:36

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM PATCH v3 2/3] KVM: export lockless GSI attribute

Michael S. Tsirkin wrote:
> On Mon, Oct 26, 2009 at 12:22:03PM -0400, Gregory Haskins wrote:
>> Certain GSI's support lockless injecton, but we have no way to detect
>> which ones at the GSI level. Knowledge of this attribute will be
>> useful later in the series so that we can optimize irqfd injection
>> paths for cases where we know the code will not sleep. Therefore,
>> we provide an API to query a specific GSI.
>>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> ---
>>
>> include/linux/kvm_host.h | 2 ++
>> virt/kvm/irq_comm.c | 35 ++++++++++++++++++++++++++++++++++-
>> 2 files changed, 36 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 1fe135d..01151a6 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -119,6 +119,7 @@ struct kvm_memory_slot {
>> struct kvm_kernel_irq_routing_entry {
>> u32 gsi;
>> u32 type;
>> + bool lockless;
>
> So lockless is the same as type == MSI from below?

Yep, today anyway.

> If the idea is to make it extensible for the future,
> let's just add an inline function, we don't need a field for this.
>

This makes sense.

>> int (*set)(struct kvm_kernel_irq_routing_entry *e,
>> struct kvm *kvm, int irq_source_id, int level);
>> union {
>> @@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic,
>> unsigned long *deliver_bitmask);
>> #endif
>> int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
>> +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq);
>> void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
>> void kvm_register_irq_ack_notifier(struct kvm *kvm,
>> struct kvm_irq_ack_notifier *kian);
>> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
>> index db2553f..a7fd487 100644
>> --- a/virt/kvm/irq_comm.c
>> +++ b/virt/kvm/irq_comm.c
>> @@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>> return ret;
>> }
>>
>> +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq)
>> +{
>> + struct kvm_kernel_irq_routing_entry *e;
>> + struct kvm_irq_routing_table *irq_rt;
>> + struct hlist_node *n;
>> + int ret = -ENOENT;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&kvm->irq_routing.srcu);
>> + irq_rt = rcu_dereference(kvm->irq_routing.table);
>> + if (irq < irq_rt->nr_rt_entries)
>> + hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
>> + if (!e->lockless) {
>> + /*
>> + * all destinations need to be lockless to
>> + * declare that the GSI as a whole is also
>> + * lockless
>> + */
>> + ret = 0;
>> + break;
>> + }
>> +
>> + ret = 1;
>> + }
>> + srcu_read_unlock(&kvm->irq_routing.srcu, idx);
>> +
>> + return ret;
>> +}
>> +
>> void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>> {
>> struct kvm_irq_ack_notifier *kian;
>> @@ -310,18 +339,22 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt,
>> int delta;
>> struct kvm_kernel_irq_routing_entry *ei;
>> struct hlist_node *n;
>> + bool lockless = ue->type == KVM_IRQ_ROUTING_MSI;
>>
>> /*
>> * Do not allow GSI to be mapped to the same irqchip more than once.
>> * Allow only one to one mapping between GSI and MSI.
>> + * Do not allow mixed lockless vs locked variants to coexist.
>
> Userspace has no idea which entries are lockless and which are not:
> this is an implementation detail - so it might not be able to avoid
> illegal combinations.
> Since this is called on an ioctl, can the rule be formulated in a way
> that makes sense for userspace?
>

I'm not sure.

>> */
>> hlist_for_each_entry(ei, n, &rt->map[ue->gsi], link)
>> if (ei->type == KVM_IRQ_ROUTING_MSI ||
>> - ue->u.irqchip.irqchip == ei->irqchip.irqchip)
>> + ue->u.irqchip.irqchip == ei->irqchip.irqchip ||
>> + ei->lockless != lockless)
>
> So this check seems like it does nothing, because lockless is same as
> MSI, and MSI is always 1:1? Intentional?
>

Yeah, it was more to guard-against/document the dependency, in case the
1:1 with MSI ever changes in the future.

Kind Regards,
-Greg


Attachments:
signature.asc (267.00 B)
OpenPGP digital signature