2009-06-02 15:15:51

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

(Applies to kvm.git/master:25deed73)

Please see the header for 2/2 for a description. This patch series has
been fully tested and appears to be working correctly. I have it as an RFC
for now because it needs Davide's official submission/SOB for patch 1/2, and
it should get some eyeballs/acks on my SRCU usage before going in.

I will submit the updated irqfd userspace which eschews the deassign() verb
since we can now just use the close(fd) method alone. I will also address
the userspace review comments from Avi.

---

Davide Libenzi (1):
eventfd: send POLLHUP on f_ops->release

Gregory Haskins (1):
kvm: use POLLHUP to close an irqfd instead of an explicit ioctl


fs/eventfd.c | 10 +++
include/linux/kvm.h | 2 -
virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
virt/kvm/kvm_main.c | 3 +
4 files changed, 90 insertions(+), 102 deletions(-)

--
Signature


2009-06-02 15:16:04

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 1/2] eventfd: send POLLHUP on f_ops->release

From: Davide Libenzi <[email protected]>

Return-path: <[email protected]>
Received: from novell.com ([::ffff:130.57.1.153])
by soto.provo.novell.com with ESMTP; Wed, 27 May 2009 13:35:04 -0600
Received: from x35.xmailserver.org not authenticated [64.71.152.41]
by novell.com with M+ Extreme Email Engine 2008.3.release
via secured & encrypted transport (TLS);
Wed, 27 May 2009 13:35:04 -0600
X-MailFrom: [email protected]
X-AuthUser: [email protected]
Received: from makko.or.mcafeemobile.com
by x35.xmailserver.org with [XMail 1.26 ESMTP Server]
id <S2EBEB0> for <[email protected]> from <[email protected]>;
Wed, 27 May 2009 15:34:01 -0400
Date: Wed, 27 May 2009 12:28:57 -0700 (PDT)
From: Davide Libenzi <[email protected]>
X-X-Sender: [email protected]
To: "Michael S. Tsirkin" <[email protected]>
cc: Gregory Haskins <[email protected]>, [email protected],
Linux Kernel Mailing List <[email protected]>, [email protected],
[email protected]
Subject: Re: [KVM PATCH v10] kvm: add support for irqfd
In-Reply-To: <[email protected]>
Message-ID: <[email protected]>
References: <[email protected]> <[email protected]> <[email protected]> <[email protected]>
User-Agent: Alpine 1.10 (DEB 962 2008-03-14)
X-GPG-FINGRPRINT: CFAE 5BEE FD36 F65E E640 56FE 0974 BF23 270F 474E
X-GPG-PUBLIC_KEY: http://www.xmailserver.org/davidel.asc
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Wed, 27 May 2009, Michael S. Tsirkin wrote:

> On Wed, May 27, 2009 at 10:06:50AM -0400, Gregory Haskins wrote:
> > Michael S. Tsirkin wrote:
> > > On Wed, May 20, 2009 at 10:30:49AM -0400, Gregory Haskins wrote:
> > >
> > >> +static int
> > >> +kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> > >> +{
> > >> + struct _irqfd *irqfd;
> > >> + struct file *file = NULL;
> > >> + int ret;
> > >> +
> > >> + irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> > >> + if (!irqfd)
> > >> + return -ENOMEM;
> > >> +
> > >> + irqfd->kvm = kvm;
> > >> + irqfd->gsi = gsi;
> > >> + INIT_LIST_HEAD(&irqfd->list);
> > >> + INIT_WORK(&irqfd->work, irqfd_inject);
> > >> +
> > >> + /*
> > >> + * Embed the file* lifetime in the irqfd.
> > >> + */
> > >> + file = fget(fd);
> > >> + if (IS_ERR(file)) {
> > >> + ret = PTR_ERR(file);
> > >> + goto fail;
> > >> + }
> > >>
> > >
> > > So we get a reference to a file, and unless the user is nice to us, it
> > > will only be dropped when kvm char device file is closed?
> > > I think this will deadlock if the fd in question is the open kvm char device.
> > >
> > >
> > >
> > Hmm...I hadn't considered this possibility, though I am not sure if it
> > would cause a deadlock in the pattern you suggest. It seems more like
> > it would result in, at worst, an extra reference to itself (and thus a
> > leak) rather than a deadlock...
> >
> > I digress. In either case, perhaps I should s/fget/eventfd_fget to at
> > least limit the type of fd to eventfd. I was trying to be "slick" by
> > not needing the eventfd_fget() exported, but I am going to need to
> > export it later anyway for iosignalfd, so its probably a moot point.
> >
> > Thanks Michael,
> > -Greg
> >
>
> This only works as long as eventfd does not do fget on some fd as well.
> Which it does not do now, and may never do - but we create a fragile
> system this way.
>
> I think it's really wrong, fundamentally, to keep a reference to a
> file until another file is closed, unless you are code under fs/.
> We will get nasty circular references sooner or later.
>
> Isn't the real reason we use fd to be able to support the same interface
> on top of both kvm and lguest?
> And if so, wouldn't some kind of bus be a better solution?

Another solution, that I proposed in the past, is having irqfd hold no
references to the eventfd. It's just register (holding an eventfd-get())
for events (in the way that currently does), notice the POLLHUP, unchain
from it, and propagate the eventfd-close event to whatever the irqfd logic
is supposed to.



- Davide

Signed-off-by: Gregory Haskins <[email protected]>
---

fs/eventfd.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 3f0e197..72f5f8d 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -61,7 +61,15 @@ EXPORT_SYMBOL_GPL(eventfd_signal);

static int eventfd_release(struct inode *inode, struct file *file)
{
- kfree(file->private_data);
+ struct eventfd_ctx *ctx = file->private_data;
+
+ /*
+ * No need to hold the lock here, since we are on the file cleanup
+ * path and the ones still attached to the wait queue will be
+ * serialized by wake_up_locked_poll().
+ */
+ wake_up_locked_poll(&ctx->wqh, POLLHUP);
+ kfree(ctx);
return 0;
}

2009-06-02 15:16:21

by Gregory Haskins

[permalink] [raw]
Subject: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

Assigning an irqfd object to a kvm object creates a relationship that we
currently manage by having the kvm oject acquire/hold a file* reference to
the underlying eventfd. The lifetime of these objects is properly maintained
by decoupling the two objects whenever the irqfd is closed or kvm is closed,
whichever comes first.

However, the irqfd "close" method is less than ideal since it requires two
system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
close(eventfd)). This dual-call approach was utilized because there was no
notification mechanism on the eventfd side at the time irqfd was implemented.

Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
vector in favor of sensing the desassign automatically when the fd is closed.
The resulting code is slightly more complex as a result since we need to
allow either side to sever the relationship independently. We utilize SRCU
to guarantee stable concurrent access to the KVM pointer without adding
additional atomic operations in the fast path.

At minimum, this design should be acked by both Davide and Paul (cc'd).

(*) The irqfd patch does not exist in any released tree, so the understanding
is that we can alter the irqfd specific ABI without taking the normal
precautions, such as CAP bits.

Signed-off-by: Gregory Haskins <[email protected]>
CC: Davide Libenzi <[email protected]>
CC: Michael S. Tsirkin <[email protected]>
CC: Paul E. McKenney <[email protected]>
---

include/linux/kvm.h | 2 -
virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
virt/kvm/kvm_main.c | 3 +
3 files changed, 81 insertions(+), 101 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 632a856..29b62cc 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -482,8 +482,6 @@ struct kvm_x86_mce {
};
#endif

-#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
-
struct kvm_irqfd {
__u32 fd;
__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index f3f2ea1..6ed62e2 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -37,26 +37,63 @@
* --------------------------------------------------------------------
*/
struct _irqfd {
+ struct mutex lock;
+ struct srcu_struct srcu;
struct kvm *kvm;
int gsi;
- struct file *file;
struct list_head list;
poll_table pt;
wait_queue_head_t *wqh;
wait_queue_t wait;
- struct work_struct work;
+ struct work_struct inject;
};

static void
irqfd_inject(struct work_struct *work)
{
- struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
- struct kvm *kvm = irqfd->kvm;
+ struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+ struct kvm *kvm;
+ int idx;
+
+ idx = srcu_read_lock(&irqfd->srcu);
+
+ kvm = rcu_dereference(irqfd->kvm);
+ if (kvm) {
+ mutex_lock(&kvm->lock);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+ kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+ mutex_unlock(&kvm->lock);
+ }
+
+ srcu_read_unlock(&irqfd->srcu, idx);
+}
+
+static void
+irqfd_disconnect(struct _irqfd *irqfd)
+{
+ struct kvm *kvm;
+
+ mutex_lock(&irqfd->lock);
+
+ kvm = rcu_dereference(irqfd->kvm);
+ rcu_assign_pointer(irqfd->kvm, NULL);
+
+ mutex_unlock(&irqfd->lock);
+
+ if (!kvm)
+ return;

mutex_lock(&kvm->lock);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
- kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+ list_del(&irqfd->list);
mutex_unlock(&kvm->lock);
+
+ /*
+ * It is important to not drop the kvm reference until the next grace
+ * period because there might be lockless references in flight up
+ * until then
+ */
+ synchronize_srcu(&irqfd->srcu);
+ kvm_put_kvm(kvm);
}

static int
@@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);

- /*
- * The wake_up is called with interrupts disabled. Therefore we need
- * to defer the IRQ injection until later since we need to acquire the
- * kvm->lock to do so.
- */
- schedule_work(&irqfd->work);
+ switch ((unsigned long)key) {
+ case POLLIN:
+ /*
+ * The POLLIN wake_up is called with interrupts disabled.
+ * Therefore we need to defer the IRQ injection until later
+ * since we need to acquire the kvm->lock to do so.
+ */
+ schedule_work(&irqfd->inject);
+ break;
+ case POLLHUP:
+ /*
+ * The POLLHUP is called unlocked, so it theoretically should
+ * be safe to remove ourselves from the wqh
+ */
+ remove_wait_queue(irqfd->wqh, &irqfd->wait);
+ flush_work(&irqfd->inject);
+ irqfd_disconnect(irqfd);
+
+ cleanup_srcu_struct(&irqfd->srcu);
+ kfree(irqfd);
+ break;
+ }

return 0;
}
@@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
add_wait_queue(wqh, &irqfd->wait);
}

-static int
-kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
{
struct _irqfd *irqfd;
struct file *file = NULL;
@@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
if (!irqfd)
return -ENOMEM;

+ mutex_init(&irqfd->lock);
+ init_srcu_struct(&irqfd->srcu);
irqfd->kvm = kvm;
irqfd->gsi = gsi;
INIT_LIST_HEAD(&irqfd->list);
- INIT_WORK(&irqfd->work, irqfd_inject);
+ INIT_WORK(&irqfd->inject, irqfd_inject);

/*
* Embed the file* lifetime in the irqfd.
@@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
if (ret < 0)
goto fail;

- irqfd->file = file;
+ kvm_get_kvm(kvm);

mutex_lock(&kvm->lock);
list_add_tail(&irqfd->list, &kvm->irqfds);
mutex_unlock(&kvm->lock);

+ /*
+ * do not drop the file until the irqfd is fully initialized, otherwise
+ * we might race against the POLLHUP
+ */
+ fput(file);
+
return 0;

fail:
@@ -139,97 +200,17 @@ fail:
return ret;
}

-static void
-irqfd_release(struct _irqfd *irqfd)
-{
- /*
- * The ordering is important. We must remove ourselves from the wqh
- * first to ensure no more event callbacks are issued, and then flush
- * any previously scheduled work prior to freeing the memory
- */
- remove_wait_queue(irqfd->wqh, &irqfd->wait);
-
- flush_work(&irqfd->work);
-
- fput(irqfd->file);
- kfree(irqfd);
-}
-
-static struct _irqfd *
-irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
-{
- struct _irqfd *irqfd;
-
- mutex_lock(&kvm->lock);
-
- /*
- * linear search isn't brilliant, but this should be an infrequent
- * slow-path operation, and the list should not grow very large
- */
- list_for_each_entry(irqfd, &kvm->irqfds, list) {
- if (irqfd->file != file || irqfd->gsi != gsi)
- continue;
-
- list_del(&irqfd->list);
- mutex_unlock(&kvm->lock);
-
- return irqfd;
- }
-
- mutex_unlock(&kvm->lock);
-
- return NULL;
-}
-
-static int
-kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
-{
- struct _irqfd *irqfd;
- struct file *file;
- int count = 0;
-
- file = fget(fd);
- if (IS_ERR(file))
- return PTR_ERR(file);
-
- while ((irqfd = irqfd_remove(kvm, file, gsi))) {
- /*
- * We remove the item from the list under the lock, but we
- * free it outside the lock to avoid deadlocking with the
- * flush_work and the work_item taking the lock
- */
- irqfd_release(irqfd);
- count++;
- }
-
- fput(file);
-
- return count ? count : -ENOENT;
-}
-
void
kvm_irqfd_init(struct kvm *kvm)
{
INIT_LIST_HEAD(&kvm->irqfds);
}

-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
-{
- if (flags & KVM_IRQFD_FLAG_DEASSIGN)
- return kvm_deassign_irqfd(kvm, fd, gsi);
-
- return kvm_assign_irqfd(kvm, fd, gsi);
-}
-
void
kvm_irqfd_release(struct kvm *kvm)
{
struct _irqfd *irqfd, *tmp;

- /* don't bother with the lock..we are shutting down */
- list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
- list_del(&irqfd->list);
- irqfd_release(irqfd);
- }
+ list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
+ irqfd_disconnect(irqfd);
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 902fed9..a9f62bb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
spin_lock(&kvm_lock);
list_del(&kvm->vm_list);
spin_unlock(&kvm_lock);
- kvm_irqfd_release(kvm);
kvm_free_irq_routing(kvm);
kvm_io_bus_destroy(&kvm->pio_bus);
kvm_io_bus_destroy(&kvm->mmio_bus);
@@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
{
struct kvm *kvm = filp->private_data;

+ kvm_irqfd_release(kvm);
+
kvm_put_kvm(kvm);
return 0;
}

2009-06-02 16:05:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
> (Applies to kvm.git/master:25deed73)
>
> Please see the header for 2/2 for a description. This patch series has
> been fully tested and appears to be working correctly. I have it as an RFC
> for now because it needs Davide's official submission/SOB for patch 1/2, and
> it should get some eyeballs/acks on my SRCU usage before going in.
>
> I will submit the updated irqfd userspace which eschews the deassign() verb
> since we can now just use the close(fd) method alone. I will also address
> the userspace review comments from Avi.


We are not killing the deassign though, do we?
It's good to have that option e.g. for when we pass
the fd to another process.

--
MST

2009-06-02 16:14:33

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
>
>> (Applies to kvm.git/master:25deed73)
>>
>> Please see the header for 2/2 for a description. This patch series has
>> been fully tested and appears to be working correctly. I have it as an RFC
>> for now because it needs Davide's official submission/SOB for patch 1/2, and
>> it should get some eyeballs/acks on my SRCU usage before going in.
>>
>> I will submit the updated irqfd userspace which eschews the deassign() verb
>> since we can now just use the close(fd) method alone. I will also address
>> the userspace review comments from Avi.
>>
>
>
> We are not killing the deassign though, do we?
>

Yes, it is not needed any more now that we have proper
release-notification from eventfd.

> It's good to have that option e.g. for when we pass
> the fd to another process.
>

Passing the fd to another app should up the underlying file reference
count. If the producer app wants to "deassign" it simply calls
close(fd) (as opposed to today where it calls DEASSIGN+close), but the
reference count will allow the consuming app to leave the eventfd's file
open. Or am I misunderstanding you?

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-02 16:21:24

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

On Tue, Jun 02, 2009 at 12:14:15PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
> >
> >> (Applies to kvm.git/master:25deed73)
> >>
> >> Please see the header for 2/2 for a description. This patch series has
> >> been fully tested and appears to be working correctly. I have it as an RFC
> >> for now because it needs Davide's official submission/SOB for patch 1/2, and
> >> it should get some eyeballs/acks on my SRCU usage before going in.
> >>
> >> I will submit the updated irqfd userspace which eschews the deassign() verb
> >> since we can now just use the close(fd) method alone. I will also address
> >> the userspace review comments from Avi.
> >>
> >
> >
> > We are not killing the deassign though, do we?
> >
>
> Yes, it is not needed any more now that we have proper
> release-notification from eventfd.
>
> > It's good to have that option e.g. for when we pass
> > the fd to another process.
> >
>
> Passing the fd to another app should up the underlying file reference
> count. If the producer app wants to "deassign" it simply calls
> close(fd) (as opposed to today where it calls DEASSIGN+close), but the
> reference count will allow the consuming app to leave the eventfd's file
> open. Or am I misunderstanding you?
>
> -Greg
>
>

I think we want to keep supporting the deassign ioctl. This, even though
close overlaps with it functionally somewhat.

This allows qemu to pass eventfd to another process/device, and then
block/unblock interrupts as seen by that process by
assigning/deassigning irq to it. This is much easier and lightweight
than asking another process to close the fd and passing another fd
later.

--
MST

2009-06-02 16:34:34

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 12:14:15PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> (Applies to kvm.git/master:25deed73)
>>>>
>>>> Please see the header for 2/2 for a description. This patch series has
>>>> been fully tested and appears to be working correctly. I have it as an RFC
>>>> for now because it needs Davide's official submission/SOB for patch 1/2, and
>>>> it should get some eyeballs/acks on my SRCU usage before going in.
>>>>
>>>> I will submit the updated irqfd userspace which eschews the deassign() verb
>>>> since we can now just use the close(fd) method alone. I will also address
>>>> the userspace review comments from Avi.
>>>>
>>>>
>>> We are not killing the deassign though, do we?
>>>
>>>
>> Yes, it is not needed any more now that we have proper
>> release-notification from eventfd.
>>
>>
>>> It's good to have that option e.g. for when we pass
>>> the fd to another process.
>>>
>>>
>> Passing the fd to another app should up the underlying file reference
>> count. If the producer app wants to "deassign" it simply calls
>> close(fd) (as opposed to today where it calls DEASSIGN+close), but the
>> reference count will allow the consuming app to leave the eventfd's file
>> open. Or am I misunderstanding you?
>>
>> -Greg
>>
>>
>>
>
> I think we want to keep supporting the deassign ioctl. This, even though
> close overlaps with it functionally somewhat.
>
> This allows qemu to pass eventfd to another process/device, and then
> block/unblock interrupts as seen by that process by
> assigning/deassigning irq to it. This is much easier and lightweight
> than asking another process to close the fd and passing another fd
> later.
>
>
Perhaps, but if that is the case we should just ignore this series and
continue with the DEASSIGN+close methodology since it already provides
that separation. Trying to do a hybrid is just messy.

But in any case, I think that approach is flawed. DEASSIGN shouldn't be
used as a mask in my opinion, and we shouldn't be reassigning a
channel's meaning under the covers like that. If this is in fact a
valid use case, we should have a separate "GSI_MASK" type operation that
is independent of irqfd. Likewise, we really should pass a new fd if
the gsi-routing is changing. Today there is a tight coupling of
fd-to-gsi, and I think that makes sense to continue this association.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-02 17:00:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

On Tue, Jun 02, 2009 at 12:34:12PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Tue, Jun 02, 2009 at 12:14:15PM -0400, Gregory Haskins wrote:
> >
> >> Michael S. Tsirkin wrote:
> >>
> >>> On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
> >>>
> >>>
> >>>> (Applies to kvm.git/master:25deed73)
> >>>>
> >>>> Please see the header for 2/2 for a description. This patch series has
> >>>> been fully tested and appears to be working correctly. I have it as an RFC
> >>>> for now because it needs Davide's official submission/SOB for patch 1/2, and
> >>>> it should get some eyeballs/acks on my SRCU usage before going in.
> >>>>
> >>>> I will submit the updated irqfd userspace which eschews the deassign() verb
> >>>> since we can now just use the close(fd) method alone. I will also address
> >>>> the userspace review comments from Avi.
> >>>>
> >>>>
> >>> We are not killing the deassign though, do we?
> >>>
> >>>
> >> Yes, it is not needed any more now that we have proper
> >> release-notification from eventfd.
> >>
> >>
> >>> It's good to have that option e.g. for when we pass
> >>> the fd to another process.
> >>>
> >>>
> >> Passing the fd to another app should up the underlying file reference
> >> count. If the producer app wants to "deassign" it simply calls
> >> close(fd) (as opposed to today where it calls DEASSIGN+close), but the
> >> reference count will allow the consuming app to leave the eventfd's file
> >> open. Or am I misunderstanding you?
> >>
> >> -Greg
> >>
> >>
> >>
> >
> > I think we want to keep supporting the deassign ioctl. This, even though
> > close overlaps with it functionally somewhat.
> >
> > This allows qemu to pass eventfd to another process/device, and then
> > block/unblock interrupts as seen by that process by
> > assigning/deassigning irq to it. This is much easier and lightweight
> > than asking another process to close the fd and passing another fd
> > later.
> >
> >
> Perhaps, but if that is the case we should just ignore this series and
> continue with the DEASSIGN+close methodology since it already provides
> that separation. Trying to do a hybrid is just messy.

As I see it, it's the least evil.

One-way ioctl operations on file descriptors are messier still. What's
another example of an ioctl that can't be undone without closing the fd?
And having close not clean up the state unless you do an ioctl first is
very messy IMO - I don't think you'll find any such examples in kernel.

> But in any case, I think that approach is flawed. DEASSIGN shouldn't be
> used as a mask in my opinion, and we shouldn't be reassigning a
> channel's meaning under the covers like that. If this is in fact a
> valid use case, we should have a separate "GSI_MASK" type operation that
> is independent of irqfd.
> Likewise, we really should pass a new fd if
> the gsi-routing is changing. Today there is a tight coupling of
> fd-to-gsi, and I think that makes sense to continue this association.
>
> -Greg
>

I'm not arguing that this use-case is not theoretical. Just that if you
don't create the fd to connect to GSI, you shouln't ask the user to
destroy it to disconnect. Who knows what else this eventfd descriptor
can be used for?

--
MST

2009-06-02 17:03:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

On Tue, Jun 02, 2009 at 07:59:49PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 12:34:12PM -0400, Gregory Haskins wrote:
> > Michael S. Tsirkin wrote:
> > > On Tue, Jun 02, 2009 at 12:14:15PM -0400, Gregory Haskins wrote:
> > >
> > >> Michael S. Tsirkin wrote:
> > >>
> > >>> On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
> > >>>
> > >>>
> > >>>> (Applies to kvm.git/master:25deed73)
> > >>>>
> > >>>> Please see the header for 2/2 for a description. This patch series has
> > >>>> been fully tested and appears to be working correctly. I have it as an RFC
> > >>>> for now because it needs Davide's official submission/SOB for patch 1/2, and
> > >>>> it should get some eyeballs/acks on my SRCU usage before going in.
> > >>>>
> > >>>> I will submit the updated irqfd userspace which eschews the deassign() verb
> > >>>> since we can now just use the close(fd) method alone. I will also address
> > >>>> the userspace review comments from Avi.
> > >>>>
> > >>>>
> > >>> We are not killing the deassign though, do we?
> > >>>
> > >>>
> > >> Yes, it is not needed any more now that we have proper
> > >> release-notification from eventfd.
> > >>
> > >>
> > >>> It's good to have that option e.g. for when we pass
> > >>> the fd to another process.
> > >>>
> > >>>
> > >> Passing the fd to another app should up the underlying file reference
> > >> count. If the producer app wants to "deassign" it simply calls
> > >> close(fd) (as opposed to today where it calls DEASSIGN+close), but the
> > >> reference count will allow the consuming app to leave the eventfd's file
> > >> open. Or am I misunderstanding you?
> > >>
> > >> -Greg
> > >>
> > >>
> > >>
> > >
> > > I think we want to keep supporting the deassign ioctl. This, even though
> > > close overlaps with it functionally somewhat.
> > >
> > > This allows qemu to pass eventfd to another process/device, and then
> > > block/unblock interrupts as seen by that process by
> > > assigning/deassigning irq to it. This is much easier and lightweight
> > > than asking another process to close the fd and passing another fd
> > > later.
> > >
> > >
> > Perhaps, but if that is the case we should just ignore this series and
> > continue with the DEASSIGN+close methodology since it already provides
> > that separation. Trying to do a hybrid is just messy.
>
> As I see it, it's the least evil.
>
> One-way ioctl operations on file descriptors are messier still. What's
> another example of an ioctl that can't be undone without closing the fd?
> And having close not clean up the state unless you do an ioctl first is
> very messy IMO - I don't think you'll find any such examples in kernel.
>
> > But in any case, I think that approach is flawed. DEASSIGN shouldn't be
> > used as a mask in my opinion, and we shouldn't be reassigning a
> > channel's meaning under the covers like that. If this is in fact a
> > valid use case, we should have a separate "GSI_MASK" type operation that
> > is independent of irqfd.
> > Likewise, we really should pass a new fd if
> > the gsi-routing is changing. Today there is a tight coupling of
> > fd-to-gsi, and I think that makes sense to continue this association.
> >
> > -Greg
> >
>
> I'm not arguing that this use-case is not theoretical. Just that if you
> don't create the fd to connect to GSI, you shouln't ask the user to
> destroy it to disconnect. Who knows what else this eventfd descriptor
> can be used for?

As a follow-up, here's another example: imagine an application that
handles interrupt events from a thread by blocking on eventfd. To wake
up this thread, we could reuse the same eventfd just by writing there. I
might want to do this even after I don't get any interrupts anymore.

--
MST

2009-06-02 17:22:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

On Tue, 2 Jun 2009, Gregory Haskins wrote:

> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> {
> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>
> - /*
> - * The wake_up is called with interrupts disabled. Therefore we need
> - * to defer the IRQ injection until later since we need to acquire the
> - * kvm->lock to do so.
> - */
> - schedule_work(&irqfd->work);
> + switch ((unsigned long)key) {
> + case POLLIN:
> + /*
> + * The POLLIN wake_up is called with interrupts disabled.
> + * Therefore we need to defer the IRQ injection until later
> + * since we need to acquire the kvm->lock to do so.
> + */
> + schedule_work(&irqfd->inject);
> + break;
> + case POLLHUP:
> + /*
> + * The POLLHUP is called unlocked, so it theoretically should
> + * be safe to remove ourselves from the wqh
> + */
> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> + flush_work(&irqfd->inject);
> + irqfd_disconnect(irqfd);
> +
> + cleanup_srcu_struct(&irqfd->srcu);
> + kfree(irqfd);
> + break;
> + }

Since "key" is an event *bitmap*, you better be treating it as such. Do
not assume what eventfd delivers today (event bitmaps with only one bit
set). So this better be like:

if (key & POLLIN) {
...
}
if (key & POLLHUP) {
...
}


- Davide

2009-06-02 17:41:28

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 12:34:12PM -0400, Gregory Haskins wrote:
>
>> Michael S. Tsirkin wrote:
>>
>>> On Tue, Jun 02, 2009 at 12:14:15PM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> Michael S. Tsirkin wrote:
>>>>
>>>>
>>>>> On Tue, Jun 02, 2009 at 11:15:28AM -0400, Gregory Haskins wrote:
>>>>>
>>>>>
>>>>>
>>>>>> (Applies to kvm.git/master:25deed73)
>>>>>>
>>>>>> Please see the header for 2/2 for a description. This patch series has
>>>>>> been fully tested and appears to be working correctly. I have it as an RFC
>>>>>> for now because it needs Davide's official submission/SOB for patch 1/2, and
>>>>>> it should get some eyeballs/acks on my SRCU usage before going in.
>>>>>>
>>>>>> I will submit the updated irqfd userspace which eschews the deassign() verb
>>>>>> since we can now just use the close(fd) method alone. I will also address
>>>>>> the userspace review comments from Avi.
>>>>>>
>>>>>>
>>>>>>
>>>>> We are not killing the deassign though, do we?
>>>>>
>>>>>
>>>>>
>>>> Yes, it is not needed any more now that we have proper
>>>> release-notification from eventfd.
>>>>
>>>>
>>>>
>>>>> It's good to have that option e.g. for when we pass
>>>>> the fd to another process.
>>>>>
>>>>>
>>>>>
>>>> Passing the fd to another app should up the underlying file reference
>>>> count. If the producer app wants to "deassign" it simply calls
>>>> close(fd) (as opposed to today where it calls DEASSIGN+close), but the
>>>> reference count will allow the consuming app to leave the eventfd's file
>>>> open. Or am I misunderstanding you?
>>>>
>>>> -Greg
>>>>
>>>>
>>>>
>>>>
>>> I think we want to keep supporting the deassign ioctl. This, even though
>>> close overlaps with it functionally somewhat.
>>>
>>> This allows qemu to pass eventfd to another process/device, and then
>>> block/unblock interrupts as seen by that process by
>>> assigning/deassigning irq to it. This is much easier and lightweight
>>> than asking another process to close the fd and passing another fd
>>> later.
>>>
>>>
>>>
>> Perhaps, but if that is the case we should just ignore this series and
>> continue with the DEASSIGN+close methodology since it already provides
>> that separation. Trying to do a hybrid is just messy.
>>
>
> As I see it, it's the least evil.
>

Which? Leaving the code as is, or a hybrid?
> One-way ioctl operations on file descriptors are messier still. What's
> another example of an ioctl that can't be undone without closing the fd?
>
-ENOPARSE

> And having close not clean up the state unless you do an ioctl first is
> very messy IMO - I don't think you'll find any such examples in kernel.
>
>

I agree, and that is why I am advocating this POLLHUP solution. It was
only this other way to begin with because the technology didn't exist
until Davide showed me the light.

Problem with your request is that I already looked into what is
essentially a bi-directional reference problem (for a different reason)
when I started the POLLHUP series. Its messy to do this in a way that
doesn't negatively impact the fast path (introducing locking, etc) or
make my head explode making sure it doesn't race. Afaict, we would need
to solve this problem to do what you are proposing (patches welcome).

If this hybrid decoupled-deassign + unified-close is indeed an important
feature set, I suggest that we still consider this POLLHUP series for
inclusion, and then someone can re-introduce DEASSIGN support in the
future as a CAP bit extension. That way we at least get the desirable
close() properties that we both seem in favor of, and get this advanced
use case when we need it (and can figure out the locking design).


>> But in any case, I think that approach is flawed. DEASSIGN shouldn't be
>> used as a mask in my opinion, and we shouldn't be reassigning a
>> channel's meaning under the covers like that. If this is in fact a
>> valid use case, we should have a separate "GSI_MASK" type operation that
>> is independent of irqfd.
>> Likewise, we really should pass a new fd if
>> the gsi-routing is changing. Today there is a tight coupling of
>> fd-to-gsi, and I think that makes sense to continue this association.
>>
>> -Greg
>>
>>
>
> I'm not arguing that this use-case is not theoretical. Just that if you
> don't create the fd to connect to GSI, you shouln't ask the user to
> destroy it to disconnect.

Well, thats just it. Today, you *do* create the eventfd to bundle with
the gsi (take a look at my userspace patches..I posted some new ones
today). The eventfd is returned after you specify the GSI via
kvm_irqfd(). Thats why I am arguing that it is natural for close() to
terminate the assignment. To me, this is consistent with other
interfaces that return an fd (socket(), open(), etc).

That said, if we are going to support your proposal going forward, we
should probably change libkvm::kvm_irqfd() to take the fd as a
parameter, instead of returning it.


> Who knows what else this eventfd descriptor
> can be used for?
>

Perhaps, but you are exceeding the original design specifications of
irqfd as it is, so we can't really predict what limitations it will have
for other esoteric uses either. While I think the masking use case is
bogus, I guess I don't specifically object to what you are proposing
w.r.t. gsi remapping. Fwiw, this use-case you are presenting would have
been more useful if proposed during the 10+ revisions of design review
on irqfd instead of now, but se la vie.

-Greg



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-02 17:42:20

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

Davide Libenzi wrote:
> On Tue, 2 Jun 2009, Gregory Haskins wrote:
>
>
>> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> {
>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>
>> - /*
>> - * The wake_up is called with interrupts disabled. Therefore we need
>> - * to defer the IRQ injection until later since we need to acquire the
>> - * kvm->lock to do so.
>> - */
>> - schedule_work(&irqfd->work);
>> + switch ((unsigned long)key) {
>> + case POLLIN:
>> + /*
>> + * The POLLIN wake_up is called with interrupts disabled.
>> + * Therefore we need to defer the IRQ injection until later
>> + * since we need to acquire the kvm->lock to do so.
>> + */
>> + schedule_work(&irqfd->inject);
>> + break;
>> + case POLLHUP:
>> + /*
>> + * The POLLHUP is called unlocked, so it theoretically should
>> + * be safe to remove ourselves from the wqh
>> + */
>> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> + flush_work(&irqfd->inject);
>> + irqfd_disconnect(irqfd);
>> +
>> + cleanup_srcu_struct(&irqfd->srcu);
>> + kfree(irqfd);
>> + break;
>> + }
>>
>
> Since "key" is an event *bitmap*, you better be treating it as such. Do
> not assume what eventfd delivers today (event bitmaps with only one bit
> set). So this better be like:
>
> if (key & POLLIN) {
> ...
> }
> if (key & POLLHUP) {
> ...
> }
>
>
> - Davide
>
>
>
Indeed. Thanks for catching that!

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-02 18:03:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

On Tue, Jun 02, 2009 at 11:15:38AM -0400, Gregory Haskins wrote:
> Assigning an irqfd object to a kvm object creates a relationship that we
> currently manage by having the kvm oject acquire/hold a file* reference to
> the underlying eventfd. The lifetime of these objects is properly maintained
> by decoupling the two objects whenever the irqfd is closed or kvm is closed,
> whichever comes first.
>
> However, the irqfd "close" method is less than ideal since it requires two
> system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
> close(eventfd)). This dual-call approach was utilized because there was no
> notification mechanism on the eventfd side at the time irqfd was implemented.
>
> Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
> eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
> vector in favor of sensing the desassign automatically when the fd is closed.
> The resulting code is slightly more complex as a result since we need to
> allow either side to sever the relationship independently. We utilize SRCU
> to guarantee stable concurrent access to the KVM pointer without adding
> additional atomic operations in the fast path.
>
> At minimum, this design should be acked by both Davide and Paul (cc'd).
>
> (*) The irqfd patch does not exist in any released tree, so the understanding
> is that we can alter the irqfd specific ABI without taking the normal
> precautions, such as CAP bits.

A few questions and suggestions interspersed below.

Thanx, Paul

> Signed-off-by: Gregory Haskins <[email protected]>
> CC: Davide Libenzi <[email protected]>
> CC: Michael S. Tsirkin <[email protected]>
> CC: Paul E. McKenney <[email protected]>
> ---
>
> include/linux/kvm.h | 2 -
> virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
> virt/kvm/kvm_main.c | 3 +
> 3 files changed, 81 insertions(+), 101 deletions(-)
>
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 632a856..29b62cc 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -482,8 +482,6 @@ struct kvm_x86_mce {
> };
> #endif
>
> -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> -
> struct kvm_irqfd {
> __u32 fd;
> __u32 gsi;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index f3f2ea1..6ed62e2 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -37,26 +37,63 @@
> * --------------------------------------------------------------------
> */
> struct _irqfd {
> + struct mutex lock;
> + struct srcu_struct srcu;
> struct kvm *kvm;
> int gsi;
> - struct file *file;
> struct list_head list;
> poll_table pt;
> wait_queue_head_t *wqh;
> wait_queue_t wait;
> - struct work_struct work;
> + struct work_struct inject;
> };
>
> static void
> irqfd_inject(struct work_struct *work)
> {
> - struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
> - struct kvm *kvm = irqfd->kvm;
> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> + struct kvm *kvm;
> + int idx;
> +
> + idx = srcu_read_lock(&irqfd->srcu);
> +
> + kvm = rcu_dereference(irqfd->kvm);
> + if (kvm) {
> + mutex_lock(&kvm->lock);
> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> + mutex_unlock(&kvm->lock);
> + }
> +
> + srcu_read_unlock(&irqfd->srcu, idx);
> +}
> +
> +static void
> +irqfd_disconnect(struct _irqfd *irqfd)
> +{
> + struct kvm *kvm;
> +
> + mutex_lock(&irqfd->lock);
> +
> + kvm = rcu_dereference(irqfd->kvm);
> + rcu_assign_pointer(irqfd->kvm, NULL);
> +
> + mutex_unlock(&irqfd->lock);
> +
> + if (!kvm)
> + return;
>
> mutex_lock(&kvm->lock);
> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> + list_del(&irqfd->list);
> mutex_unlock(&kvm->lock);
> +
> + /*
> + * It is important to not drop the kvm reference until the next grace
> + * period because there might be lockless references in flight up
> + * until then
> + */

The lockless references are all harmless even if carried out after the
kvm structure has been removed? Or does there need to be a ->deleted
field that allows the lockless references to ignore a kvm structure that
has already been deleted?

On the other hand, if it really is somehow OK for kvm_set_irq() to be
called on an already-deleted kvm structure, then this code would be OK
as is.

> + synchronize_srcu(&irqfd->srcu);
> + kvm_put_kvm(kvm);
> }
>
> static int
> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> {
> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>
> - /*
> - * The wake_up is called with interrupts disabled. Therefore we need
> - * to defer the IRQ injection until later since we need to acquire the
> - * kvm->lock to do so.
> - */
> - schedule_work(&irqfd->work);
> + switch ((unsigned long)key) {
> + case POLLIN:
> + /*
> + * The POLLIN wake_up is called with interrupts disabled.
> + * Therefore we need to defer the IRQ injection until later
> + * since we need to acquire the kvm->lock to do so.
> + */
> + schedule_work(&irqfd->inject);
> + break;
> + case POLLHUP:
> + /*
> + * The POLLHUP is called unlocked, so it theoretically should
> + * be safe to remove ourselves from the wqh
> + */
> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> + flush_work(&irqfd->inject);
> + irqfd_disconnect(irqfd);

Good. The fact that irqfd_disconnect() does a synchronize_srcu()
prevents any readers from trying to do an SRCU operation on an already
cleaned-up srcu_struct, so this does look safe to me.

> + cleanup_srcu_struct(&irqfd->srcu);
> + kfree(irqfd);
> + break;
> + }
>
> return 0;
> }
> @@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> add_wait_queue(wqh, &irqfd->wait);
> }
>
> -static int
> -kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> +int
> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> {
> struct _irqfd *irqfd;
> struct file *file = NULL;
> @@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> if (!irqfd)
> return -ENOMEM;
>
> + mutex_init(&irqfd->lock);
> + init_srcu_struct(&irqfd->srcu);
> irqfd->kvm = kvm;
> irqfd->gsi = gsi;
> INIT_LIST_HEAD(&irqfd->list);
> - INIT_WORK(&irqfd->work, irqfd_inject);
> + INIT_WORK(&irqfd->inject, irqfd_inject);
>
> /*
> * Embed the file* lifetime in the irqfd.
> @@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> if (ret < 0)
> goto fail;
>
> - irqfd->file = file;
> + kvm_get_kvm(kvm);
>
> mutex_lock(&kvm->lock);
> list_add_tail(&irqfd->list, &kvm->irqfds);

Doesn't the above need to be list_add_tail_rcu()? Unless I am confused,
this is the point at which the new SRCU-protected structure is first
made public. If so, you really do need list_add_tail_rcu() to make sure
that concurrent readers don't see pre-initialized values in the structure.

> mutex_unlock(&kvm->lock);
>
> + /*
> + * do not drop the file until the irqfd is fully initialized, otherwise
> + * we might race against the POLLHUP
> + */
> + fput(file);
> +
> return 0;
>
> fail:
> @@ -139,97 +200,17 @@ fail:
> return ret;
> }
>
> -static void
> -irqfd_release(struct _irqfd *irqfd)
> -{
> - /*
> - * The ordering is important. We must remove ourselves from the wqh
> - * first to ensure no more event callbacks are issued, and then flush
> - * any previously scheduled work prior to freeing the memory
> - */
> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
> -
> - flush_work(&irqfd->work);
> -
> - fput(irqfd->file);
> - kfree(irqfd);
> -}
> -
> -static struct _irqfd *
> -irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
> -{
> - struct _irqfd *irqfd;
> -
> - mutex_lock(&kvm->lock);
> -
> - /*
> - * linear search isn't brilliant, but this should be an infrequent
> - * slow-path operation, and the list should not grow very large
> - */
> - list_for_each_entry(irqfd, &kvm->irqfds, list) {
> - if (irqfd->file != file || irqfd->gsi != gsi)
> - continue;
> -
> - list_del(&irqfd->list);
> - mutex_unlock(&kvm->lock);
> -
> - return irqfd;
> - }
> -
> - mutex_unlock(&kvm->lock);
> -
> - return NULL;
> -}
> -
> -static int
> -kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
> -{
> - struct _irqfd *irqfd;
> - struct file *file;
> - int count = 0;
> -
> - file = fget(fd);
> - if (IS_ERR(file))
> - return PTR_ERR(file);
> -
> - while ((irqfd = irqfd_remove(kvm, file, gsi))) {
> - /*
> - * We remove the item from the list under the lock, but we
> - * free it outside the lock to avoid deadlocking with the
> - * flush_work and the work_item taking the lock
> - */
> - irqfd_release(irqfd);
> - count++;
> - }
> -
> - fput(file);
> -
> - return count ? count : -ENOENT;
> -}
> -
> void
> kvm_irqfd_init(struct kvm *kvm)
> {
> INIT_LIST_HEAD(&kvm->irqfds);
> }
>
> -int
> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> -{
> - if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> - return kvm_deassign_irqfd(kvm, fd, gsi);
> -
> - return kvm_assign_irqfd(kvm, fd, gsi);
> -}
> -
> void
> kvm_irqfd_release(struct kvm *kvm)
> {
> struct _irqfd *irqfd, *tmp;
>
> - /* don't bother with the lock..we are shutting down */
> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> - list_del(&irqfd->list);
> - irqfd_release(irqfd);
> - }
> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
> + irqfd_disconnect(irqfd);
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 902fed9..a9f62bb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
> spin_lock(&kvm_lock);
> list_del(&kvm->vm_list);
> spin_unlock(&kvm_lock);
> - kvm_irqfd_release(kvm);
> kvm_free_irq_routing(kvm);
> kvm_io_bus_destroy(&kvm->pio_bus);
> kvm_io_bus_destroy(&kvm->mmio_bus);
> @@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> {
> struct kvm *kvm = filp->private_data;
>
> + kvm_irqfd_release(kvm);
> +
> kvm_put_kvm(kvm);
> return 0;
> }
>

2009-06-02 18:23:44

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

Paul E. McKenney wrote:
> On Tue, Jun 02, 2009 at 11:15:38AM -0400, Gregory Haskins wrote:
>
>> Assigning an irqfd object to a kvm object creates a relationship that we
>> currently manage by having the kvm oject acquire/hold a file* reference to
>> the underlying eventfd. The lifetime of these objects is properly maintained
>> by decoupling the two objects whenever the irqfd is closed or kvm is closed,
>> whichever comes first.
>>
>> However, the irqfd "close" method is less than ideal since it requires two
>> system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
>> close(eventfd)). This dual-call approach was utilized because there was no
>> notification mechanism on the eventfd side at the time irqfd was implemented.
>>
>> Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
>> eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
>> vector in favor of sensing the desassign automatically when the fd is closed.
>> The resulting code is slightly more complex as a result since we need to
>> allow either side to sever the relationship independently. We utilize SRCU
>> to guarantee stable concurrent access to the KVM pointer without adding
>> additional atomic operations in the fast path.
>>
>> At minimum, this design should be acked by both Davide and Paul (cc'd).
>>
>> (*) The irqfd patch does not exist in any released tree, so the understanding
>> is that we can alter the irqfd specific ABI without taking the normal
>> precautions, such as CAP bits.
>>
>
> A few questions and suggestions interspersed below.
>
> Thanx, Paul
>

Thanks for the review, Paul.

(FYI: This isn't quite what I was asking you about on IRC yesterday, but
it's related...and the SRCU portion of the conversation *did* inspire me
here. Just note that the stuff I was asking about is still forthcoming)

>
>> Signed-off-by: Gregory Haskins <[email protected]>
>> CC: Davide Libenzi <[email protected]>
>> CC: Michael S. Tsirkin <[email protected]>
>> CC: Paul E. McKenney <[email protected]>
>> ---
>>
>> include/linux/kvm.h | 2 -
>> virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
>> virt/kvm/kvm_main.c | 3 +
>> 3 files changed, 81 insertions(+), 101 deletions(-)
>>
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 632a856..29b62cc 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -482,8 +482,6 @@ struct kvm_x86_mce {
>> };
>> #endif
>>
>> -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
>> -
>> struct kvm_irqfd {
>> __u32 fd;
>> __u32 gsi;
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index f3f2ea1..6ed62e2 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -37,26 +37,63 @@
>> * --------------------------------------------------------------------
>> */
>> struct _irqfd {
>> + struct mutex lock;
>> + struct srcu_struct srcu;
>> struct kvm *kvm;
>> int gsi;
>> - struct file *file;
>> struct list_head list;
>> poll_table pt;
>> wait_queue_head_t *wqh;
>> wait_queue_t wait;
>> - struct work_struct work;
>> + struct work_struct inject;
>> };
>>
>> static void
>> irqfd_inject(struct work_struct *work)
>> {
>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
>> - struct kvm *kvm = irqfd->kvm;
>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>> + struct kvm *kvm;
>> + int idx;
>> +
>> + idx = srcu_read_lock(&irqfd->srcu);
>> +
>> + kvm = rcu_dereference(irqfd->kvm);
>> + if (kvm) {
>> + mutex_lock(&kvm->lock);
>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> + mutex_unlock(&kvm->lock);
>> + }
>> +
>> + srcu_read_unlock(&irqfd->srcu, idx);
>> +}
>>

[1]

>> +
>> +static void
>> +irqfd_disconnect(struct _irqfd *irqfd)
>> +{
>> + struct kvm *kvm;
>> +
>> + mutex_lock(&irqfd->lock);
>> +
>> + kvm = rcu_dereference(irqfd->kvm);
>> + rcu_assign_pointer(irqfd->kvm, NULL);
>> +
>> + mutex_unlock(&irqfd->lock);
>>

[2]

>> +
>> + if (!kvm)
>> + return;
>>
>> mutex_lock(&kvm->lock);
>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> + list_del(&irqfd->list);
>> mutex_unlock(&kvm->lock);
>> +
>> + /*
>> + * It is important to not drop the kvm reference until the next grace
>> + * period because there might be lockless references in flight up
>> + * until then
>> + */
>>
>
> The lockless references are all harmless even if carried out after the
> kvm structure has been removed?

No, but all ([1]) references to my knowledge occur within an srcu
read-side CS, and we only drop the object reference ([3]) outside of
that CS by virtue of the synchronize_srcu() barrier below. The one
notable exception is [2], which I don't declare as a read-side CS since
I hold the mutex during the swap.

OTOH, this is really my _intention_, not _reality_ per se. ;) E.g. I
may have completely flubbed up the design, so I'm glad you are looking
at it.

> Or does there need to be a ->deleted
> field that allows the lockless references to ignore a kvm structure that
> has already been deleted?
>

I guess you could say that the "rcu_assign_pointer(NULL)" is my
"deleted" field. The reader-side code in question should check for that
condition before proceeding.

> On the other hand, if it really is somehow OK for kvm_set_irq() to be
> called on an already-deleted kvm structure, then this code would be OK
> as is.
>

Definitely not, so if you think that can happen please raise the flag.
>
>> + synchronize_srcu(&irqfd->srcu);
>> + kvm_put_kvm(kvm);
>>

[3]

>> }
>>
>> static int
>> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> {
>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>
>> - /*
>> - * The wake_up is called with interrupts disabled. Therefore we need
>> - * to defer the IRQ injection until later since we need to acquire the
>> - * kvm->lock to do so.
>> - */
>> - schedule_work(&irqfd->work);
>> + switch ((unsigned long)key) {
>> + case POLLIN:
>> + /*
>> + * The POLLIN wake_up is called with interrupts disabled.
>> + * Therefore we need to defer the IRQ injection until later
>> + * since we need to acquire the kvm->lock to do so.
>> + */
>> + schedule_work(&irqfd->inject);
>> + break;
>> + case POLLHUP:
>> + /*
>> + * The POLLHUP is called unlocked, so it theoretically should
>> + * be safe to remove ourselves from the wqh
>> + */
>> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> + flush_work(&irqfd->inject);
>> + irqfd_disconnect(irqfd);
>>
>
> Good. The fact that irqfd_disconnect() does a synchronize_srcu()
> prevents any readers from trying to do an SRCU operation on an already
> cleaned-up srcu_struct, so this does look safe to me.
>

As an additional data point, we can guarantee that we will never be
called again since we synchronously unhook from the wqh and kvm->irqfds
list, and the POLLHUP is called from f_ops->release().

>
>> + cleanup_srcu_struct(&irqfd->srcu);
>> + kfree(irqfd);
>> + break;
>> + }
>>
>> return 0;
>> }
>> @@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>> add_wait_queue(wqh, &irqfd->wait);
>> }
>>
>> -static int
>> -kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>> +int
>> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> {
>> struct _irqfd *irqfd;
>> struct file *file = NULL;
>> @@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>> if (!irqfd)
>> return -ENOMEM;
>>
>> + mutex_init(&irqfd->lock);
>> + init_srcu_struct(&irqfd->srcu);
>> irqfd->kvm = kvm;
>> irqfd->gsi = gsi;
>> INIT_LIST_HEAD(&irqfd->list);
>> - INIT_WORK(&irqfd->work, irqfd_inject);
>> + INIT_WORK(&irqfd->inject, irqfd_inject);
>>
>> /*
>> * Embed the file* lifetime in the irqfd.
>> @@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>> if (ret < 0)
>> goto fail;
>>
>> - irqfd->file = file;
>> + kvm_get_kvm(kvm);
>>
>> mutex_lock(&kvm->lock);
>> list_add_tail(&irqfd->list, &kvm->irqfds);
>>
>
> Doesn't the above need to be list_add_tail_rcu()? Unless I am confused,
> this is the point at which the new SRCU-protected structure is first
> made public. If so, you really do need list_add_tail_rcu() to make sure
> that concurrent readers don't see pre-initialized values in the structure.
>

I *think* this is ok as a traditional list, because the only paths that
touch this list are guarded by the kvm->lock mutex. Let me know if you
see otherwise or if that is not enough.
>
>> mutex_unlock(&kvm->lock);
>>
>> + /*
>> + * do not drop the file until the irqfd is fully initialized, otherwise
>> + * we might race against the POLLHUP
>> + */
>> + fput(file);
>> +
>> return 0;
>>
>> fail:
>> @@ -139,97 +200,17 @@ fail:
>> return ret;
>> }
>>
>> -static void
>> -irqfd_release(struct _irqfd *irqfd)
>> -{
>> - /*
>> - * The ordering is important. We must remove ourselves from the wqh
>> - * first to ensure no more event callbacks are issued, and then flush
>> - * any previously scheduled work prior to freeing the memory
>> - */
>> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> -
>> - flush_work(&irqfd->work);
>> -
>> - fput(irqfd->file);
>> - kfree(irqfd);
>> -}
>> -
>> -static struct _irqfd *
>> -irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
>> -{
>> - struct _irqfd *irqfd;
>> -
>> - mutex_lock(&kvm->lock);
>> -
>> - /*
>> - * linear search isn't brilliant, but this should be an infrequent
>> - * slow-path operation, and the list should not grow very large
>> - */
>> - list_for_each_entry(irqfd, &kvm->irqfds, list) {
>> - if (irqfd->file != file || irqfd->gsi != gsi)
>> - continue;
>> -
>> - list_del(&irqfd->list);
>> - mutex_unlock(&kvm->lock);
>> -
>> - return irqfd;
>> - }
>> -
>> - mutex_unlock(&kvm->lock);
>> -
>> - return NULL;
>> -}
>> -
>> -static int
>> -kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
>> -{
>> - struct _irqfd *irqfd;
>> - struct file *file;
>> - int count = 0;
>> -
>> - file = fget(fd);
>> - if (IS_ERR(file))
>> - return PTR_ERR(file);
>> -
>> - while ((irqfd = irqfd_remove(kvm, file, gsi))) {
>> - /*
>> - * We remove the item from the list under the lock, but we
>> - * free it outside the lock to avoid deadlocking with the
>> - * flush_work and the work_item taking the lock
>> - */
>> - irqfd_release(irqfd);
>> - count++;
>> - }
>> -
>> - fput(file);
>> -
>> - return count ? count : -ENOENT;
>> -}
>> -
>> void
>> kvm_irqfd_init(struct kvm *kvm)
>> {
>> INIT_LIST_HEAD(&kvm->irqfds);
>> }
>>
>> -int
>> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> -{
>> - if (flags & KVM_IRQFD_FLAG_DEASSIGN)
>> - return kvm_deassign_irqfd(kvm, fd, gsi);
>> -
>> - return kvm_assign_irqfd(kvm, fd, gsi);
>> -}
>> -
>> void
>> kvm_irqfd_release(struct kvm *kvm)
>> {
>> struct _irqfd *irqfd, *tmp;
>>
>> - /* don't bother with the lock..we are shutting down */
>> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
>> - list_del(&irqfd->list);
>> - irqfd_release(irqfd);
>> - }
>> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
>> + irqfd_disconnect(irqfd);
>> }
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 902fed9..a9f62bb 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
>> spin_lock(&kvm_lock);
>> list_del(&kvm->vm_list);
>> spin_unlock(&kvm_lock);
>> - kvm_irqfd_release(kvm);
>> kvm_free_irq_routing(kvm);
>> kvm_io_bus_destroy(&kvm->pio_bus);
>> kvm_io_bus_destroy(&kvm->mmio_bus);
>> @@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
>> {
>> struct kvm *kvm = filp->private_data;
>>
>> + kvm_irqfd_release(kvm);
>> +
>> kvm_put_kvm(kvm);
>> return 0;
>> }
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-02 22:02:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

On Tue, Jun 02, 2009 at 02:23:14PM -0400, Gregory Haskins wrote:
> Paul E. McKenney wrote:
> > On Tue, Jun 02, 2009 at 11:15:38AM -0400, Gregory Haskins wrote:
> >
> >> Assigning an irqfd object to a kvm object creates a relationship that we
> >> currently manage by having the kvm oject acquire/hold a file* reference to
> >> the underlying eventfd. The lifetime of these objects is properly maintained
> >> by decoupling the two objects whenever the irqfd is closed or kvm is closed,
> >> whichever comes first.
> >>
> >> However, the irqfd "close" method is less than ideal since it requires two
> >> system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
> >> close(eventfd)). This dual-call approach was utilized because there was no
> >> notification mechanism on the eventfd side at the time irqfd was implemented.
> >>
> >> Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
> >> eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
> >> vector in favor of sensing the desassign automatically when the fd is closed.
> >> The resulting code is slightly more complex as a result since we need to
> >> allow either side to sever the relationship independently. We utilize SRCU
> >> to guarantee stable concurrent access to the KVM pointer without adding
> >> additional atomic operations in the fast path.
> >>
> >> At minimum, this design should be acked by both Davide and Paul (cc'd).
> >>
> >> (*) The irqfd patch does not exist in any released tree, so the understanding
> >> is that we can alter the irqfd specific ABI without taking the normal
> >> precautions, such as CAP bits.
> >>
> >
> > A few questions and suggestions interspersed below.
> >
> > Thanx, Paul
> >
>
> Thanks for the review, Paul.

Some questions, clarifications, and mea culpas below.

Thanx, Paul

> (FYI: This isn't quite what I was asking you about on IRC yesterday, but
> it's related...and the SRCU portion of the conversation *did* inspire me
> here. Just note that the stuff I was asking about is still forthcoming)

;-)

> >> Signed-off-by: Gregory Haskins <[email protected]>
> >> CC: Davide Libenzi <[email protected]>
> >> CC: Michael S. Tsirkin <[email protected]>
> >> CC: Paul E. McKenney <[email protected]>
> >> ---
> >>
> >> include/linux/kvm.h | 2 -
> >> virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
> >> virt/kvm/kvm_main.c | 3 +
> >> 3 files changed, 81 insertions(+), 101 deletions(-)
> >>
> >> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> >> index 632a856..29b62cc 100644
> >> --- a/include/linux/kvm.h
> >> +++ b/include/linux/kvm.h
> >> @@ -482,8 +482,6 @@ struct kvm_x86_mce {
> >> };
> >> #endif
> >>
> >> -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> >> -
> >> struct kvm_irqfd {
> >> __u32 fd;
> >> __u32 gsi;
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> index f3f2ea1..6ed62e2 100644
> >> --- a/virt/kvm/eventfd.c
> >> +++ b/virt/kvm/eventfd.c
> >> @@ -37,26 +37,63 @@
> >> * --------------------------------------------------------------------
> >> */
> >> struct _irqfd {
> >> + struct mutex lock;
> >> + struct srcu_struct srcu;
> >> struct kvm *kvm;
> >> int gsi;
> >> - struct file *file;
> >> struct list_head list;
> >> poll_table pt;
> >> wait_queue_head_t *wqh;
> >> wait_queue_t wait;
> >> - struct work_struct work;
> >> + struct work_struct inject;
> >> };
> >>
> >> static void
> >> irqfd_inject(struct work_struct *work)
> >> {
> >> - struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
> >> - struct kvm *kvm = irqfd->kvm;
> >> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> >> + struct kvm *kvm;
> >> + int idx;
> >> +
> >> + idx = srcu_read_lock(&irqfd->srcu);
> >> +
> >> + kvm = rcu_dereference(irqfd->kvm);
> >> + if (kvm) {
> >> + mutex_lock(&kvm->lock);
> >> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >> + mutex_unlock(&kvm->lock);
> >> + }
> >> +
> >> + srcu_read_unlock(&irqfd->srcu, idx);
> >> +}
> >>
>
> [1]
>
> >> +
> >> +static void
> >> +irqfd_disconnect(struct _irqfd *irqfd)
> >> +{
> >> + struct kvm *kvm;
> >> +
> >> + mutex_lock(&irqfd->lock);
> >> +
> >> + kvm = rcu_dereference(irqfd->kvm);
> >> + rcu_assign_pointer(irqfd->kvm, NULL);
> >> +
> >> + mutex_unlock(&irqfd->lock);
> >>
>
> [2]
>
> >> +
> >> + if (!kvm)
> >> + return;
> >>
> >> mutex_lock(&kvm->lock);
> >> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >> + list_del(&irqfd->list);
> >> mutex_unlock(&kvm->lock);
> >> +
> >> + /*
> >> + * It is important to not drop the kvm reference until the next grace
> >> + * period because there might be lockless references in flight up
> >> + * until then
> >> + */
> >>
> >
> > The lockless references are all harmless even if carried out after the
> > kvm structure has been removed?
>
> No, but all ([1]) references to my knowledge occur within an srcu
> read-side CS, and we only drop the object reference ([3]) outside of
> that CS by virtue of the synchronize_srcu() barrier below. The one
> notable exception is [2], which I don't declare as a read-side CS since
> I hold the mutex during the swap.
>
> OTOH, this is really my _intention_, not _reality_ per se. ;) E.g. I
> may have completely flubbed up the design, so I'm glad you are looking
> at it.

Looks good in general -- my question is about the window of time
between when the object is removed from the list (and might still have
readers referencing it) and when the object is freed (and, courtesy of
synchronize_srcu(), can no longer be referenced by readers).

In other words, the following sequence of events:

o CPU 0 picks up a pointer to the object.

o CPU 1 removes that same object from the list, and invokes
synchronize_srcu(), which cannot return yet due to CPU 0
being in an SRCU read-side critical section.

o CPU 0 acquires the lock and invokes the pair of kvm_set_irq()
calls, releases the lock and exits the SRCU read-side critical
section.

o CPU 1's synchronize_srcu() can now return, and does.

o CPU 1 frees the object.

I honestly don't know enough about KVM to say whether or not this is a
problem, but thought I should ask. ;-)

> > Or does there need to be a ->deleted
> > field that allows the lockless references to ignore a kvm structure that
> > has already been deleted?
>
> I guess you could say that the "rcu_assign_pointer(NULL)" is my
> "deleted" field. The reader-side code in question should check for that
> condition before proceeding.

Fair enough! But please keep in mind that the pointer could be assigned
to NULL just after we dereference it, especially if we are interrupted
or preempted or something. Or is the idea to re-check the pointer under
some lock?

> > On the other hand, if it really is somehow OK for kvm_set_irq() to be
> > called on an already-deleted kvm structure, then this code would be OK
> > as is.
>
> Definitely not, so if you think that can happen please raise the flag.

Apologies, I was being a bit sloppy. Instead of "already-deleted",
I should have said something like "already removed from the list but
not yet freed".

> >> + synchronize_srcu(&irqfd->srcu);
> >> + kvm_put_kvm(kvm);
> >>
>
> [3]
>
> >> }
> >>
> >> static int
> >> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >> {
> >> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >>
> >> - /*
> >> - * The wake_up is called with interrupts disabled. Therefore we need
> >> - * to defer the IRQ injection until later since we need to acquire the
> >> - * kvm->lock to do so.
> >> - */
> >> - schedule_work(&irqfd->work);
> >> + switch ((unsigned long)key) {
> >> + case POLLIN:
> >> + /*
> >> + * The POLLIN wake_up is called with interrupts disabled.
> >> + * Therefore we need to defer the IRQ injection until later
> >> + * since we need to acquire the kvm->lock to do so.
> >> + */
> >> + schedule_work(&irqfd->inject);
> >> + break;
> >> + case POLLHUP:
> >> + /*
> >> + * The POLLHUP is called unlocked, so it theoretically should
> >> + * be safe to remove ourselves from the wqh
> >> + */
> >> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> + flush_work(&irqfd->inject);
> >> + irqfd_disconnect(irqfd);
> >>
> >
> > Good. The fact that irqfd_disconnect() does a synchronize_srcu()
> > prevents any readers from trying to do an SRCU operation on an already
> > cleaned-up srcu_struct, so this does look safe to me.
>
> As an additional data point, we can guarantee that we will never be
> called again since we synchronously unhook from the wqh and kvm->irqfds
> list, and the POLLHUP is called from f_ops->release().

So the window is short, but still exists, correct?

> >> + cleanup_srcu_struct(&irqfd->srcu);
> >> + kfree(irqfd);
> >> + break;
> >> + }
> >>
> >> return 0;
> >> }
> >> @@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> >> add_wait_queue(wqh, &irqfd->wait);
> >> }
> >>
> >> -static int
> >> -kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> +int
> >> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >> {
> >> struct _irqfd *irqfd;
> >> struct file *file = NULL;
> >> @@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> if (!irqfd)
> >> return -ENOMEM;
> >>
> >> + mutex_init(&irqfd->lock);
> >> + init_srcu_struct(&irqfd->srcu);
> >> irqfd->kvm = kvm;
> >> irqfd->gsi = gsi;
> >> INIT_LIST_HEAD(&irqfd->list);
> >> - INIT_WORK(&irqfd->work, irqfd_inject);
> >> + INIT_WORK(&irqfd->inject, irqfd_inject);
> >>
> >> /*
> >> * Embed the file* lifetime in the irqfd.
> >> @@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> if (ret < 0)
> >> goto fail;
> >>
> >> - irqfd->file = file;
> >> + kvm_get_kvm(kvm);
> >>
> >> mutex_lock(&kvm->lock);
> >> list_add_tail(&irqfd->list, &kvm->irqfds);
> >>
> >
> > Doesn't the above need to be list_add_tail_rcu()? Unless I am confused,
> > this is the point at which the new SRCU-protected structure is first
> > made public. If so, you really do need list_add_tail_rcu() to make sure
> > that concurrent readers don't see pre-initialized values in the structure.
>
> I *think* this is ok as a traditional list, because the only paths that
> touch this list are guarded by the kvm->lock mutex. Let me know if you
> see otherwise or if that is not enough.

My mistake, you are using rcu_assign_pointer() and rcu_dereference()
instead of the list primitives. Never mind!!!

> >> mutex_unlock(&kvm->lock);
> >>
> >> + /*
> >> + * do not drop the file until the irqfd is fully initialized, otherwise
> >> + * we might race against the POLLHUP
> >> + */
> >> + fput(file);
> >> +
> >> return 0;
> >>
> >> fail:
> >> @@ -139,97 +200,17 @@ fail:
> >> return ret;
> >> }
> >>
> >> -static void
> >> -irqfd_release(struct _irqfd *irqfd)
> >> -{
> >> - /*
> >> - * The ordering is important. We must remove ourselves from the wqh
> >> - * first to ensure no more event callbacks are issued, and then flush
> >> - * any previously scheduled work prior to freeing the memory
> >> - */
> >> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> -
> >> - flush_work(&irqfd->work);
> >> -
> >> - fput(irqfd->file);
> >> - kfree(irqfd);
> >> -}
> >> -
> >> -static struct _irqfd *
> >> -irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
> >> -{
> >> - struct _irqfd *irqfd;
> >> -
> >> - mutex_lock(&kvm->lock);
> >> -
> >> - /*
> >> - * linear search isn't brilliant, but this should be an infrequent
> >> - * slow-path operation, and the list should not grow very large
> >> - */
> >> - list_for_each_entry(irqfd, &kvm->irqfds, list) {
> >> - if (irqfd->file != file || irqfd->gsi != gsi)
> >> - continue;
> >> -
> >> - list_del(&irqfd->list);
> >> - mutex_unlock(&kvm->lock);
> >> -
> >> - return irqfd;
> >> - }
> >> -
> >> - mutex_unlock(&kvm->lock);
> >> -
> >> - return NULL;
> >> -}
> >> -
> >> -static int
> >> -kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
> >> -{
> >> - struct _irqfd *irqfd;
> >> - struct file *file;
> >> - int count = 0;
> >> -
> >> - file = fget(fd);
> >> - if (IS_ERR(file))
> >> - return PTR_ERR(file);
> >> -
> >> - while ((irqfd = irqfd_remove(kvm, file, gsi))) {
> >> - /*
> >> - * We remove the item from the list under the lock, but we
> >> - * free it outside the lock to avoid deadlocking with the
> >> - * flush_work and the work_item taking the lock
> >> - */
> >> - irqfd_release(irqfd);
> >> - count++;
> >> - }
> >> -
> >> - fput(file);
> >> -
> >> - return count ? count : -ENOENT;
> >> -}
> >> -
> >> void
> >> kvm_irqfd_init(struct kvm *kvm)
> >> {
> >> INIT_LIST_HEAD(&kvm->irqfds);
> >> }
> >>
> >> -int
> >> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >> -{
> >> - if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> >> - return kvm_deassign_irqfd(kvm, fd, gsi);
> >> -
> >> - return kvm_assign_irqfd(kvm, fd, gsi);
> >> -}
> >> -
> >> void
> >> kvm_irqfd_release(struct kvm *kvm)
> >> {
> >> struct _irqfd *irqfd, *tmp;
> >>
> >> - /* don't bother with the lock..we are shutting down */
> >> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> >> - list_del(&irqfd->list);
> >> - irqfd_release(irqfd);
> >> - }
> >> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
> >> + irqfd_disconnect(irqfd);
> >> }
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 902fed9..a9f62bb 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >> spin_lock(&kvm_lock);
> >> list_del(&kvm->vm_list);
> >> spin_unlock(&kvm_lock);
> >> - kvm_irqfd_release(kvm);
> >> kvm_free_irq_routing(kvm);
> >> kvm_io_bus_destroy(&kvm->pio_bus);
> >> kvm_io_bus_destroy(&kvm->mmio_bus);
> >> @@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> >> {
> >> struct kvm *kvm = filp->private_data;
> >>
> >> + kvm_irqfd_release(kvm);
> >> +
> >> kvm_put_kvm(kvm);
> >> return 0;
> >> }
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>

2009-06-03 01:53:58

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

Paul E. McKenney wrote:
> On Tue, Jun 02, 2009 at 02:23:14PM -0400, Gregory Haskins wrote:
>
>> Paul E. McKenney wrote:
>>
>>> On Tue, Jun 02, 2009 at 11:15:38AM -0400, Gregory Haskins wrote:
>>>
>>>
>>>> Assigning an irqfd object to a kvm object creates a relationship that we
>>>> currently manage by having the kvm oject acquire/hold a file* reference to
>>>> the underlying eventfd. The lifetime of these objects is properly maintained
>>>> by decoupling the two objects whenever the irqfd is closed or kvm is closed,
>>>> whichever comes first.
>>>>
>>>> However, the irqfd "close" method is less than ideal since it requires two
>>>> system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
>>>> close(eventfd)). This dual-call approach was utilized because there was no
>>>> notification mechanism on the eventfd side at the time irqfd was implemented.
>>>>
>>>> Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
>>>> eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
>>>> vector in favor of sensing the desassign automatically when the fd is closed.
>>>> The resulting code is slightly more complex as a result since we need to
>>>> allow either side to sever the relationship independently. We utilize SRCU
>>>> to guarantee stable concurrent access to the KVM pointer without adding
>>>> additional atomic operations in the fast path.
>>>>
>>>> At minimum, this design should be acked by both Davide and Paul (cc'd).
>>>>
>>>> (*) The irqfd patch does not exist in any released tree, so the understanding
>>>> is that we can alter the irqfd specific ABI without taking the normal
>>>> precautions, such as CAP bits.
>>>>
>>>>
>>> A few questions and suggestions interspersed below.
>>>
>>> Thanx, Paul
>>>
>>>
>> Thanks for the review, Paul.
>>
>
> Some questions, clarifications, and mea culpas below.
>
> Thanx, Paul
>
>
>> (FYI: This isn't quite what I was asking you about on IRC yesterday, but
>> it's related...and the SRCU portion of the conversation *did* inspire me
>> here. Just note that the stuff I was asking about is still forthcoming)
>>
>
> ;-)
>
>
>>>> Signed-off-by: Gregory Haskins <[email protected]>
>>>> CC: Davide Libenzi <[email protected]>
>>>> CC: Michael S. Tsirkin <[email protected]>
>>>> CC: Paul E. McKenney <[email protected]>
>>>> ---
>>>>
>>>> include/linux/kvm.h | 2 -
>>>> virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
>>>> virt/kvm/kvm_main.c | 3 +
>>>> 3 files changed, 81 insertions(+), 101 deletions(-)
>>>>
>>>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>>>> index 632a856..29b62cc 100644
>>>> --- a/include/linux/kvm.h
>>>> +++ b/include/linux/kvm.h
>>>> @@ -482,8 +482,6 @@ struct kvm_x86_mce {
>>>> };
>>>> #endif
>>>>
>>>> -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
>>>> -
>>>> struct kvm_irqfd {
>>>> __u32 fd;
>>>> __u32 gsi;
>>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>>>> index f3f2ea1..6ed62e2 100644
>>>> --- a/virt/kvm/eventfd.c
>>>> +++ b/virt/kvm/eventfd.c
>>>> @@ -37,26 +37,63 @@
>>>> * --------------------------------------------------------------------
>>>> */
>>>> struct _irqfd {
>>>> + struct mutex lock;
>>>> + struct srcu_struct srcu;
>>>> struct kvm *kvm;
>>>> int gsi;
>>>> - struct file *file;
>>>> struct list_head list;
>>>> poll_table pt;
>>>> wait_queue_head_t *wqh;
>>>> wait_queue_t wait;
>>>> - struct work_struct work;
>>>> + struct work_struct inject;
>>>> };
>>>>
>>>> static void
>>>> irqfd_inject(struct work_struct *work)
>>>> {
>>>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
>>>> - struct kvm *kvm = irqfd->kvm;
>>>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>>>> + struct kvm *kvm;
>>>> + int idx;
>>>> +
>>>> + idx = srcu_read_lock(&irqfd->srcu);
>>>> +
>>>> + kvm = rcu_dereference(irqfd->kvm);
>>>>

[4]

>>>> + if (kvm) {
>>>> + mutex_lock(&kvm->lock);
>>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> + mutex_unlock(&kvm->lock);
>>>> + }
>>>> +
>>>> + srcu_read_unlock(&irqfd->srcu, idx);
>>>> +}
>>>>
>>>>
>> [1]
>>
>>
>>>> +
>>>> +static void
>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>> +{
>>>> + struct kvm *kvm;
>>>> +
>>>> + mutex_lock(&irqfd->lock);
>>>> +
>>>> + kvm = rcu_dereference(irqfd->kvm);
>>>> + rcu_assign_pointer(irqfd->kvm, NULL);
>>>> +
>>>> + mutex_unlock(&irqfd->lock);
>>>>
>>>>
>> [2]
>>
>>
>>>> +
>>>> + if (!kvm)
>>>> + return;
>>>>
>>>> mutex_lock(&kvm->lock);
>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> + list_del(&irqfd->list);
>>>> mutex_unlock(&kvm->lock);
>>>> +
>>>> + /*
>>>> + * It is important to not drop the kvm reference until the next grace
>>>> + * period because there might be lockless references in flight up
>>>> + * until then
>>>> + */
>>>>
>>>>
>>> The lockless references are all harmless even if carried out after the
>>> kvm structure has been removed?
>>>
>> No, but all ([1]) references to my knowledge occur within an srcu
>> read-side CS, and we only drop the object reference ([3]) outside of
>> that CS by virtue of the synchronize_srcu() barrier below. The one
>> notable exception is [2], which I don't declare as a read-side CS since
>> I hold the mutex during the swap.
>>
>> OTOH, this is really my _intention_, not _reality_ per se. ;) E.g. I
>> may have completely flubbed up the design, so I'm glad you are looking
>> at it.
>>
>
> Looks good in general -- my question is about the window of time
> between when the object is removed from the list (and might still have
> readers referencing it) and when the object is freed (and, courtesy of
> synchronize_srcu(), can no longer be referenced by readers).
>
> In other words, the following sequence of events:
>
> o CPU 0 picks up a pointer to the object.
>
> o CPU 1 removes that same object from the list, and invokes
> synchronize_srcu(), which cannot return yet due to CPU 0
> being in an SRCU read-side critical section.
>
> o CPU 0 acquires the lock and invokes the pair of kvm_set_irq()
> calls, releases the lock and exits the SRCU read-side critical
> section.
>
> o CPU 1's synchronize_srcu() can now return, and does.
>
> o CPU 1 frees the object.
>
> I honestly don't know enough about KVM to say whether or not this is a
> problem, but thought I should ask. ;-)
>

Right, ok. What you outline is consistent with my expectations. That
is, I need to make sure that it is not possible to have any concurrent
code call kvm_put_cpu between [4] and [1] against the same pointer. It
is, of course, ok if some other code path enters irqfd_inject() _after_
[2] because I would have already swapped the pointer with NULL and it
will simply bail out on the conditional right after [4].
>
>>> Or does there need to be a ->deleted
>>> field that allows the lockless references to ignore a kvm structure that
>>> has already been deleted?
>>>
>> I guess you could say that the "rcu_assign_pointer(NULL)" is my
>> "deleted" field. The reader-side code in question should check for that
>> condition before proceeding.
>>
>
> Fair enough! But please keep in mind that the pointer could be assigned
> to NULL just after we dereference it, especially if we are interrupted
> or preempted or something.

Right, and that should be ok IIUC as long as I can be guaranteed to
never call kvm_put_kvm() before the dereferencer calls
srcu_read_unlock(). I currently believe this guarantee is provided by
the synchronize_srcu() at [3], but please straighten me out if that is a
naive assumption.

> Or is the idea to re-check the pointer under some lock?
>
>
I do not currently believe I need to worry about that case, but as
always, straighten me out if that is wrong. ;)

>>> On the other hand, if it really is somehow OK for kvm_set_irq() to be
>>> called on an already-deleted kvm structure, then this code would be OK
>>> as is.
>>>
>> Definitely not, so if you think that can happen please raise the flag.
>>
>
> Apologies, I was being a bit sloppy. Instead of "already-deleted",
> I should have said something like "already removed from the list but
> not yet freed".
>

Ah, ok. The answer in that case would be "yes". It's ok to call
kvm_set_irq() while the irqfd->kvm pointer is NULL, but it is not ok to
call it after or during kvm_put_kvm() has been invoked. Technically the
first safe point is right after the last mutex_unlock(&kvm->lock)
completes (right before [1]), and is officially annotated with the
subsequent srcu_read_unlock().
>
>>>> + synchronize_srcu(&irqfd->srcu);
>>>> + kvm_put_kvm(kvm);
>>>>
>>>>
>> [3]
>>
>>
>>>> }
>>>>
>>>> static int
>>>> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>> {
>>>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>>
>>>> - /*
>>>> - * The wake_up is called with interrupts disabled. Therefore we need
>>>> - * to defer the IRQ injection until later since we need to acquire the
>>>> - * kvm->lock to do so.
>>>> - */
>>>> - schedule_work(&irqfd->work);
>>>> + switch ((unsigned long)key) {
>>>> + case POLLIN:
>>>> + /*
>>>> + * The POLLIN wake_up is called with interrupts disabled.
>>>> + * Therefore we need to defer the IRQ injection until later
>>>> + * since we need to acquire the kvm->lock to do so.
>>>> + */
>>>> + schedule_work(&irqfd->inject);
>>>> + break;
>>>> + case POLLHUP:
>>>> + /*
>>>> + * The POLLHUP is called unlocked, so it theoretically should
>>>> + * be safe to remove ourselves from the wqh
>>>> + */
>>>> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> + flush_work(&irqfd->inject);
>>>> + irqfd_disconnect(irqfd);
>>>>
>>>>
>>> Good. The fact that irqfd_disconnect() does a synchronize_srcu()
>>> prevents any readers from trying to do an SRCU operation on an already
>>> cleaned-up srcu_struct, so this does look safe to me.
>>>
>> As an additional data point, we can guarantee that we will never be
>> called again since we synchronously unhook from the wqh and kvm->irqfds
>> list, and the POLLHUP is called from f_ops->release().
>>
>
> So the window is short, but still exists, correct?
>

Can you elaborate?


>
>>>> + cleanup_srcu_struct(&irqfd->srcu);
>>>> + kfree(irqfd);
>>>> + break;
>>>> + }
>>>>
>>>> return 0;
>>>> }
>>>> @@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>>>> add_wait_queue(wqh, &irqfd->wait);
>>>> }
>>>>
>>>> -static int
>>>> -kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>>>> +int
>>>> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> {
>>>> struct _irqfd *irqfd;
>>>> struct file *file = NULL;
>>>> @@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>>>> if (!irqfd)
>>>> return -ENOMEM;
>>>>
>>>> + mutex_init(&irqfd->lock);
>>>> + init_srcu_struct(&irqfd->srcu);
>>>> irqfd->kvm = kvm;
>>>> irqfd->gsi = gsi;
>>>> INIT_LIST_HEAD(&irqfd->list);
>>>> - INIT_WORK(&irqfd->work, irqfd_inject);
>>>> + INIT_WORK(&irqfd->inject, irqfd_inject);
>>>>
>>>> /*
>>>> * Embed the file* lifetime in the irqfd.
>>>> @@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
>>>> if (ret < 0)
>>>> goto fail;
>>>>
>>>> - irqfd->file = file;
>>>> + kvm_get_kvm(kvm);
>>>>
>>>> mutex_lock(&kvm->lock);
>>>> list_add_tail(&irqfd->list, &kvm->irqfds);
>>>>
>>>>
>>> Doesn't the above need to be list_add_tail_rcu()? Unless I am confused,
>>> this is the point at which the new SRCU-protected structure is first
>>> made public. If so, you really do need list_add_tail_rcu() to make sure
>>> that concurrent readers don't see pre-initialized values in the structure.
>>>
>> I *think* this is ok as a traditional list, because the only paths that
>> touch this list are guarded by the kvm->lock mutex. Let me know if you
>> see otherwise or if that is not enough.
>>
>
> My mistake, you are using rcu_assign_pointer() and rcu_dereference()
> instead of the list primitives. Never mind!!!
>

Yeah, and note that we actually have two types of objects and their
references floating around:

*) We have "struct irqfd" which can be thought of as an extension of
eventfd. It holds exactly one (or zero) references to kvm via the
irqfd->kvm pointer, and as you note above I use rcu_XX() macros and srcu
to manage it.

*) Conversely, we have "struct kvm" which may have a 1:N relationship
with many irqfds, which I manage with a standard list at kvm->irqfds
protected by kvm->lock.

So the code that uses the rcu_dereference/rcu_assign_pointer is actually
different than the code mentioned above that is manipulating the
kvm->irqfds list with list_add_tail(). The latter isn't directly RCU
related and is why you see the non-rcu variants of the list functions in
use.

That said, if you still see a hole in that approach, do not be shy in
pointing it out ;)

Thanks again for taking to time to go over all this, Paul. I know you
are very busy, and its very much appreciated!

-Greg

>
>>>> mutex_unlock(&kvm->lock);
>>>>
>>>> + /*
>>>> + * do not drop the file until the irqfd is fully initialized, otherwise
>>>> + * we might race against the POLLHUP
>>>> + */
>>>> + fput(file);
>>>> +
>>>> return 0;
>>>>
>>>> fail:
>>>> @@ -139,97 +200,17 @@ fail:
>>>> return ret;
>>>> }
>>>>
>>>> -static void
>>>> -irqfd_release(struct _irqfd *irqfd)
>>>> -{
>>>> - /*
>>>> - * The ordering is important. We must remove ourselves from the wqh
>>>> - * first to ensure no more event callbacks are issued, and then flush
>>>> - * any previously scheduled work prior to freeing the memory
>>>> - */
>>>> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> -
>>>> - flush_work(&irqfd->work);
>>>> -
>>>> - fput(irqfd->file);
>>>> - kfree(irqfd);
>>>> -}
>>>> -
>>>> -static struct _irqfd *
>>>> -irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
>>>> -{
>>>> - struct _irqfd *irqfd;
>>>> -
>>>> - mutex_lock(&kvm->lock);
>>>> -
>>>> - /*
>>>> - * linear search isn't brilliant, but this should be an infrequent
>>>> - * slow-path operation, and the list should not grow very large
>>>> - */
>>>> - list_for_each_entry(irqfd, &kvm->irqfds, list) {
>>>> - if (irqfd->file != file || irqfd->gsi != gsi)
>>>> - continue;
>>>> -
>>>> - list_del(&irqfd->list);
>>>> - mutex_unlock(&kvm->lock);
>>>> -
>>>> - return irqfd;
>>>> - }
>>>> -
>>>> - mutex_unlock(&kvm->lock);
>>>> -
>>>> - return NULL;
>>>> -}
>>>> -
>>>> -static int
>>>> -kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
>>>> -{
>>>> - struct _irqfd *irqfd;
>>>> - struct file *file;
>>>> - int count = 0;
>>>> -
>>>> - file = fget(fd);
>>>> - if (IS_ERR(file))
>>>> - return PTR_ERR(file);
>>>> -
>>>> - while ((irqfd = irqfd_remove(kvm, file, gsi))) {
>>>> - /*
>>>> - * We remove the item from the list under the lock, but we
>>>> - * free it outside the lock to avoid deadlocking with the
>>>> - * flush_work and the work_item taking the lock
>>>> - */
>>>> - irqfd_release(irqfd);
>>>> - count++;
>>>> - }
>>>> -
>>>> - fput(file);
>>>> -
>>>> - return count ? count : -ENOENT;
>>>> -}
>>>> -
>>>> void
>>>> kvm_irqfd_init(struct kvm *kvm)
>>>> {
>>>> INIT_LIST_HEAD(&kvm->irqfds);
>>>> }
>>>>
>>>> -int
>>>> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>>> -{
>>>> - if (flags & KVM_IRQFD_FLAG_DEASSIGN)
>>>> - return kvm_deassign_irqfd(kvm, fd, gsi);
>>>> -
>>>> - return kvm_assign_irqfd(kvm, fd, gsi);
>>>> -}
>>>> -
>>>> void
>>>> kvm_irqfd_release(struct kvm *kvm)
>>>> {
>>>> struct _irqfd *irqfd, *tmp;
>>>>
>>>> - /* don't bother with the lock..we are shutting down */
>>>> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
>>>> - list_del(&irqfd->list);
>>>> - irqfd_release(irqfd);
>>>> - }
>>>> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
>>>> + irqfd_disconnect(irqfd);
>>>> }
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 902fed9..a9f62bb 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>>> spin_lock(&kvm_lock);
>>>> list_del(&kvm->vm_list);
>>>> spin_unlock(&kvm_lock);
>>>> - kvm_irqfd_release(kvm);
>>>> kvm_free_irq_routing(kvm);
>>>> kvm_io_bus_destroy(&kvm->pio_bus);
>>>> kvm_io_bus_destroy(&kvm->mmio_bus);
>>>> @@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
>>>> {
>>>> struct kvm *kvm = filp->private_data;
>>>>
>>>> + kvm_irqfd_release(kvm);
>>>> +
>>>> kvm_put_kvm(kvm);
>>>> return 0;
>>>> }
>>>>
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>
>
>




Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-03 06:41:03

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

On Tue, Jun 02, 2009 at 01:41:05PM -0400, Gregory Haskins wrote:
> > And having close not clean up the state unless you do an ioctl first is
> > very messy IMO - I don't think you'll find any such examples in kernel.
> >
> >
>
> I agree, and that is why I am advocating this POLLHUP solution. It was
> only this other way to begin with because the technology didn't exist
> until Davide showed me the light.
>
> Problem with your request is that I already looked into what is
> essentially a bi-directional reference problem (for a different reason)
> when I started the POLLHUP series. Its messy to do this in a way that
> doesn't negatively impact the fast path (introducing locking, etc) or
> make my head explode making sure it doesn't race. Afaict, we would need
> to solve this problem to do what you are proposing (patches welcome).
>
> If this hybrid decoupled-deassign + unified-close is indeed an important
> feature set, I suggest that we still consider this POLLHUP series for
> inclusion, and then someone can re-introduce DEASSIGN support in the
> future as a CAP bit extension. That way we at least get the desirable
> close() properties that we both seem in favor of, and get this advanced
> use case when we need it (and can figure out the locking design).
>

FWIW, I took a look and yes, it is non-trivial.
I concur, we can always add the deassign ioctl later.



--
MST

2009-06-03 11:34:37

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 01:41:05PM -0400, Gregory Haskins wrote:
>
>>> And having close not clean up the state unless you do an ioctl first is
>>> very messy IMO - I don't think you'll find any such examples in kernel.
>>>
>>>
>>>
>> I agree, and that is why I am advocating this POLLHUP solution. It was
>> only this other way to begin with because the technology didn't exist
>> until Davide showed me the light.
>>
>> Problem with your request is that I already looked into what is
>> essentially a bi-directional reference problem (for a different reason)
>> when I started the POLLHUP series. Its messy to do this in a way that
>> doesn't negatively impact the fast path (introducing locking, etc) or
>> make my head explode making sure it doesn't race. Afaict, we would need
>> to solve this problem to do what you are proposing (patches welcome).
>>
>> If this hybrid decoupled-deassign + unified-close is indeed an important
>> feature set, I suggest that we still consider this POLLHUP series for
>> inclusion, and then someone can re-introduce DEASSIGN support in the
>> future as a CAP bit extension. That way we at least get the desirable
>> close() properties that we both seem in favor of, and get this advanced
>> use case when we need it (and can figure out the locking design).
>>
>>
>
> FWIW, I took a look and yes, it is non-trivial.
> I concur, we can always add the deassign ioctl later.
>
>
>
>

Sounds good, Michael. Thanks!

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-03 15:04:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

On Tue, Jun 02, 2009 at 09:53:29PM -0400, Gregory Haskins wrote:
> Paul E. McKenney wrote:
> > On Tue, Jun 02, 2009 at 02:23:14PM -0400, Gregory Haskins wrote:
> >
> >> Paul E. McKenney wrote:
> >>
> >>> On Tue, Jun 02, 2009 at 11:15:38AM -0400, Gregory Haskins wrote:
> >>>
> >>>
> >>>> Assigning an irqfd object to a kvm object creates a relationship that we
> >>>> currently manage by having the kvm oject acquire/hold a file* reference to
> >>>> the underlying eventfd. The lifetime of these objects is properly maintained
> >>>> by decoupling the two objects whenever the irqfd is closed or kvm is closed,
> >>>> whichever comes first.
> >>>>
> >>>> However, the irqfd "close" method is less than ideal since it requires two
> >>>> system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
> >>>> close(eventfd)). This dual-call approach was utilized because there was no
> >>>> notification mechanism on the eventfd side at the time irqfd was implemented.
> >>>>
> >>>> Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
> >>>> eventfd is about to close. So we eliminate the IRQFD_DEASSIGN ioctl (*)
> >>>> vector in favor of sensing the desassign automatically when the fd is closed.
> >>>> The resulting code is slightly more complex as a result since we need to
> >>>> allow either side to sever the relationship independently. We utilize SRCU
> >>>> to guarantee stable concurrent access to the KVM pointer without adding
> >>>> additional atomic operations in the fast path.
> >>>>
> >>>> At minimum, this design should be acked by both Davide and Paul (cc'd).
> >>>>
> >>>> (*) The irqfd patch does not exist in any released tree, so the understanding
> >>>> is that we can alter the irqfd specific ABI without taking the normal
> >>>> precautions, such as CAP bits.
> >>>>
> >>>>
> >>> A few questions and suggestions interspersed below.
> >>>
> >>> Thanx, Paul
> >>>
> >>>
> >> Thanks for the review, Paul.
> >>
> >
> > Some questions, clarifications, and mea culpas below.
> >
> > Thanx, Paul
> >
> >
> >> (FYI: This isn't quite what I was asking you about on IRC yesterday, but
> >> it's related...and the SRCU portion of the conversation *did* inspire me
> >> here. Just note that the stuff I was asking about is still forthcoming)
> >>
> >
> > ;-)
> >
> >
> >>>> Signed-off-by: Gregory Haskins <[email protected]>
> >>>> CC: Davide Libenzi <[email protected]>
> >>>> CC: Michael S. Tsirkin <[email protected]>
> >>>> CC: Paul E. McKenney <[email protected]>
> >>>> ---
> >>>>
> >>>> include/linux/kvm.h | 2 -
> >>>> virt/kvm/eventfd.c | 177 +++++++++++++++++++++++----------------------------
> >>>> virt/kvm/kvm_main.c | 3 +
> >>>> 3 files changed, 81 insertions(+), 101 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> >>>> index 632a856..29b62cc 100644
> >>>> --- a/include/linux/kvm.h
> >>>> +++ b/include/linux/kvm.h
> >>>> @@ -482,8 +482,6 @@ struct kvm_x86_mce {
> >>>> };
> >>>> #endif
> >>>>
> >>>> -#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> >>>> -
> >>>> struct kvm_irqfd {
> >>>> __u32 fd;
> >>>> __u32 gsi;
> >>>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >>>> index f3f2ea1..6ed62e2 100644
> >>>> --- a/virt/kvm/eventfd.c
> >>>> +++ b/virt/kvm/eventfd.c
> >>>> @@ -37,26 +37,63 @@
> >>>> * --------------------------------------------------------------------
> >>>> */
> >>>> struct _irqfd {
> >>>> + struct mutex lock;
> >>>> + struct srcu_struct srcu;
> >>>> struct kvm *kvm;
> >>>> int gsi;
> >>>> - struct file *file;
> >>>> struct list_head list;
> >>>> poll_table pt;
> >>>> wait_queue_head_t *wqh;
> >>>> wait_queue_t wait;
> >>>> - struct work_struct work;
> >>>> + struct work_struct inject;
> >>>> };
> >>>>
> >>>> static void
> >>>> irqfd_inject(struct work_struct *work)
> >>>> {
> >>>> - struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
> >>>> - struct kvm *kvm = irqfd->kvm;
> >>>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> >>>> + struct kvm *kvm;
> >>>> + int idx;
> >>>> +
> >>>> + idx = srcu_read_lock(&irqfd->srcu);
> >>>> +
> >>>> + kvm = rcu_dereference(irqfd->kvm);
> >>>>
>
> [4]
>
> >>>> + if (kvm) {
> >>>> + mutex_lock(&kvm->lock);
> >>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >>>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >>>> + mutex_unlock(&kvm->lock);
> >>>> + }
> >>>> +
> >>>> + srcu_read_unlock(&irqfd->srcu, idx);
> >>>> +}
> >>>>
> >>>>
> >> [1]
> >>
> >>
> >>>> +
> >>>> +static void
> >>>> +irqfd_disconnect(struct _irqfd *irqfd)
> >>>> +{
> >>>> + struct kvm *kvm;
> >>>> +
> >>>> + mutex_lock(&irqfd->lock);
> >>>> +
> >>>> + kvm = rcu_dereference(irqfd->kvm);
> >>>> + rcu_assign_pointer(irqfd->kvm, NULL);
> >>>> +
> >>>> + mutex_unlock(&irqfd->lock);
> >>>>
> >>>>
> >> [2]
> >>
> >>
> >>>> +
> >>>> + if (!kvm)
> >>>> + return;
> >>>>
> >>>> mutex_lock(&kvm->lock);
> >>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >>>> + list_del(&irqfd->list);
> >>>> mutex_unlock(&kvm->lock);
> >>>> +
> >>>> + /*
> >>>> + * It is important to not drop the kvm reference until the next grace
> >>>> + * period because there might be lockless references in flight up
> >>>> + * until then
> >>>> + */
> >>>>
> >>>>
> >>> The lockless references are all harmless even if carried out after the
> >>> kvm structure has been removed?
> >>>
> >> No, but all ([1]) references to my knowledge occur within an srcu
> >> read-side CS, and we only drop the object reference ([3]) outside of
> >> that CS by virtue of the synchronize_srcu() barrier below. The one
> >> notable exception is [2], which I don't declare as a read-side CS since
> >> I hold the mutex during the swap.
> >>
> >> OTOH, this is really my _intention_, not _reality_ per se. ;) E.g. I
> >> may have completely flubbed up the design, so I'm glad you are looking
> >> at it.
> >>
> >
> > Looks good in general -- my question is about the window of time
> > between when the object is removed from the list (and might still have
> > readers referencing it) and when the object is freed (and, courtesy of
> > synchronize_srcu(), can no longer be referenced by readers).
> >
> > In other words, the following sequence of events:
> >
> > o CPU 0 picks up a pointer to the object.
> >
> > o CPU 1 removes that same object from the list, and invokes
> > synchronize_srcu(), which cannot return yet due to CPU 0
> > being in an SRCU read-side critical section.
> >
> > o CPU 0 acquires the lock and invokes the pair of kvm_set_irq()
> > calls, releases the lock and exits the SRCU read-side critical
> > section.
> >
> > o CPU 1's synchronize_srcu() can now return, and does.
> >
> > o CPU 1 frees the object.
> >
> > I honestly don't know enough about KVM to say whether or not this is a
> > problem, but thought I should ask. ;-)
>
> Right, ok. What you outline is consistent with my expectations. That
> is, I need to make sure that it is not possible to have any concurrent
> code call kvm_put_cpu between [4] and [1] against the same pointer. It
> is, of course, ok if some other code path enters irqfd_inject() _after_
> [2] because I would have already swapped the pointer with NULL and it
> will simply bail out on the conditional right after [4].

Yep, that is what the combination of srcu_read_lock(), srcu_read_unlock(),
and synchronize_srcu() will do.

> >>> Or does there need to be a ->deleted
> >>> field that allows the lockless references to ignore a kvm structure that
> >>> has already been deleted?
> >>>
> >> I guess you could say that the "rcu_assign_pointer(NULL)" is my
> >> "deleted" field. The reader-side code in question should check for that
> >> condition before proceeding.
> >>
> >
> > Fair enough! But please keep in mind that the pointer could be assigned
> > to NULL just after we dereference it, especially if we are interrupted
> > or preempted or something.
>
> Right, and that should be ok IIUC as long as I can be guaranteed to
> never call kvm_put_kvm() before the dereferencer calls
> srcu_read_unlock(). I currently believe this guarantee is provided by
> the synchronize_srcu() at [3], but please straighten me out if that is a
> naive assumption.

You are correct, that is indeed the guarantee provided by
synchronize_srcu().

> > Or is the idea to re-check the pointer under some lock?
> >
> I do not currently believe I need to worry about that case, but as
> always, straighten me out if that is wrong. ;)

Given what you say below, you should be OK.

> >>> On the other hand, if it really is somehow OK for kvm_set_irq() to be
> >>> called on an already-deleted kvm structure, then this code would be OK
> >>> as is.
> >>>
> >> Definitely not, so if you think that can happen please raise the flag.
> >
> > Apologies, I was being a bit sloppy. Instead of "already-deleted",
> > I should have said something like "already removed from the list but
> > not yet freed".
>
> Ah, ok. The answer in that case would be "yes". It's ok to call
> kvm_set_irq() while the irqfd->kvm pointer is NULL, but it is not ok to
> call it after or during kvm_put_kvm() has been invoked. Technically the
> first safe point is right after the last mutex_unlock(&kvm->lock)
> completes (right before [1]), and is officially annotated with the
> subsequent srcu_read_unlock().

Good enough!

> >>>> + synchronize_srcu(&irqfd->srcu);
> >>>> + kvm_put_kvm(kvm);
> >>>>
> >>>>
> >> [3]
> >>
> >>
> >>>> }
> >>>>
> >>>> static int
> >>>> @@ -64,12 +101,28 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>>> {
> >>>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >>>>
> >>>> - /*
> >>>> - * The wake_up is called with interrupts disabled. Therefore we need
> >>>> - * to defer the IRQ injection until later since we need to acquire the
> >>>> - * kvm->lock to do so.
> >>>> - */
> >>>> - schedule_work(&irqfd->work);
> >>>> + switch ((unsigned long)key) {
> >>>> + case POLLIN:
> >>>> + /*
> >>>> + * The POLLIN wake_up is called with interrupts disabled.
> >>>> + * Therefore we need to defer the IRQ injection until later
> >>>> + * since we need to acquire the kvm->lock to do so.
> >>>> + */
> >>>> + schedule_work(&irqfd->inject);
> >>>> + break;
> >>>> + case POLLHUP:
> >>>> + /*
> >>>> + * The POLLHUP is called unlocked, so it theoretically should
> >>>> + * be safe to remove ourselves from the wqh
> >>>> + */
> >>>> + remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >>>> + flush_work(&irqfd->inject);
> >>>> + irqfd_disconnect(irqfd);
> >>>>
> >>>>
> >>> Good. The fact that irqfd_disconnect() does a synchronize_srcu()
> >>> prevents any readers from trying to do an SRCU operation on an already
> >>> cleaned-up srcu_struct, so this does look safe to me.
> >>>
> >> As an additional data point, we can guarantee that we will never be
> >> called again since we synchronously unhook from the wqh and kvm->irqfds
> >> list, and the POLLHUP is called from f_ops->release().
> >
> > So the window is short, but still exists, correct?
>
> Can you elaborate?

This is the same window we discussed above. It is the window that
allows an object to be removed from the list between the time that a
reader obtains a reference to the object and the time that this readers
actually uses this reference.

> >>>> + cleanup_srcu_struct(&irqfd->srcu);
> >>>> + kfree(irqfd);
> >>>> + break;
> >>>> + }
> >>>>
> >>>> return 0;
> >>>> }
> >>>> @@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> >>>> add_wait_queue(wqh, &irqfd->wait);
> >>>> }
> >>>>
> >>>> -static int
> >>>> -kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >>>> +int
> >>>> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>>> {
> >>>> struct _irqfd *irqfd;
> >>>> struct file *file = NULL;
> >>>> @@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >>>> if (!irqfd)
> >>>> return -ENOMEM;
> >>>>
> >>>> + mutex_init(&irqfd->lock);
> >>>> + init_srcu_struct(&irqfd->srcu);
> >>>> irqfd->kvm = kvm;
> >>>> irqfd->gsi = gsi;
> >>>> INIT_LIST_HEAD(&irqfd->list);
> >>>> - INIT_WORK(&irqfd->work, irqfd_inject);
> >>>> + INIT_WORK(&irqfd->inject, irqfd_inject);
> >>>>
> >>>> /*
> >>>> * Embed the file* lifetime in the irqfd.
> >>>> @@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
> >>>> if (ret < 0)
> >>>> goto fail;
> >>>>
> >>>> - irqfd->file = file;
> >>>> + kvm_get_kvm(kvm);
> >>>>
> >>>> mutex_lock(&kvm->lock);
> >>>> list_add_tail(&irqfd->list, &kvm->irqfds);
> >>>>
> >>>>
> >>> Doesn't the above need to be list_add_tail_rcu()? Unless I am confused,
> >>> this is the point at which the new SRCU-protected structure is first
> >>> made public. If so, you really do need list_add_tail_rcu() to make sure
> >>> that concurrent readers don't see pre-initialized values in the structure.
> >>>
> >> I *think* this is ok as a traditional list, because the only paths that
> >> touch this list are guarded by the kvm->lock mutex. Let me know if you
> >> see otherwise or if that is not enough.
> >
> > My mistake, you are using rcu_assign_pointer() and rcu_dereference()
> > instead of the list primitives. Never mind!!!
>
> Yeah, and note that we actually have two types of objects and their
> references floating around:
>
> *) We have "struct irqfd" which can be thought of as an extension of
> eventfd. It holds exactly one (or zero) references to kvm via the
> irqfd->kvm pointer, and as you note above I use rcu_XX() macros and srcu
> to manage it.
>
> *) Conversely, we have "struct kvm" which may have a 1:N relationship
> with many irqfds, which I manage with a standard list at kvm->irqfds
> protected by kvm->lock.

This combination should work fine. You do acquire the lock within the
SRCU read-side critical section, as required. It looks to me that you
avoid freeing the underlying "struct kvm" until after a grace period
past removal from the list, again, as required.

Thanx, Paul

> So the code that uses the rcu_dereference/rcu_assign_pointer is actually
> different than the code mentioned above that is manipulating the
> kvm->irqfds list with list_add_tail(). The latter isn't directly RCU
> related and is why you see the non-rcu variants of the list functions in
> use.
>
> That said, if you still see a hole in that approach, do not be shy in
> pointing it out ;)
>
> Thanks again for taking to time to go over all this, Paul. I know you
> are very busy, and its very much appreciated!
>
> -Greg
>
> >
> >>>> mutex_unlock(&kvm->lock);
> >>>>
> >>>> + /*
> >>>> + * do not drop the file until the irqfd is fully initialized, otherwise
> >>>> + * we might race against the POLLHUP
> >>>> + */
> >>>> + fput(file);
> >>>> +
> >>>> return 0;
> >>>>
> >>>> fail:
> >>>> @@ -139,97 +200,17 @@ fail:
> >>>> return ret;
> >>>> }
> >>>>
> >>>> -static void
> >>>> -irqfd_release(struct _irqfd *irqfd)
> >>>> -{
> >>>> - /*
> >>>> - * The ordering is important. We must remove ourselves from the wqh
> >>>> - * first to ensure no more event callbacks are issued, and then flush
> >>>> - * any previously scheduled work prior to freeing the memory
> >>>> - */
> >>>> - remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >>>> -
> >>>> - flush_work(&irqfd->work);
> >>>> -
> >>>> - fput(irqfd->file);
> >>>> - kfree(irqfd);
> >>>> -}
> >>>> -
> >>>> -static struct _irqfd *
> >>>> -irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
> >>>> -{
> >>>> - struct _irqfd *irqfd;
> >>>> -
> >>>> - mutex_lock(&kvm->lock);
> >>>> -
> >>>> - /*
> >>>> - * linear search isn't brilliant, but this should be an infrequent
> >>>> - * slow-path operation, and the list should not grow very large
> >>>> - */
> >>>> - list_for_each_entry(irqfd, &kvm->irqfds, list) {
> >>>> - if (irqfd->file != file || irqfd->gsi != gsi)
> >>>> - continue;
> >>>> -
> >>>> - list_del(&irqfd->list);
> >>>> - mutex_unlock(&kvm->lock);
> >>>> -
> >>>> - return irqfd;
> >>>> - }
> >>>> -
> >>>> - mutex_unlock(&kvm->lock);
> >>>> -
> >>>> - return NULL;
> >>>> -}
> >>>> -
> >>>> -static int
> >>>> -kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
> >>>> -{
> >>>> - struct _irqfd *irqfd;
> >>>> - struct file *file;
> >>>> - int count = 0;
> >>>> -
> >>>> - file = fget(fd);
> >>>> - if (IS_ERR(file))
> >>>> - return PTR_ERR(file);
> >>>> -
> >>>> - while ((irqfd = irqfd_remove(kvm, file, gsi))) {
> >>>> - /*
> >>>> - * We remove the item from the list under the lock, but we
> >>>> - * free it outside the lock to avoid deadlocking with the
> >>>> - * flush_work and the work_item taking the lock
> >>>> - */
> >>>> - irqfd_release(irqfd);
> >>>> - count++;
> >>>> - }
> >>>> -
> >>>> - fput(file);
> >>>> -
> >>>> - return count ? count : -ENOENT;
> >>>> -}
> >>>> -
> >>>> void
> >>>> kvm_irqfd_init(struct kvm *kvm)
> >>>> {
> >>>> INIT_LIST_HEAD(&kvm->irqfds);
> >>>> }
> >>>>
> >>>> -int
> >>>> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>>> -{
> >>>> - if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> >>>> - return kvm_deassign_irqfd(kvm, fd, gsi);
> >>>> -
> >>>> - return kvm_assign_irqfd(kvm, fd, gsi);
> >>>> -}
> >>>> -
> >>>> void
> >>>> kvm_irqfd_release(struct kvm *kvm)
> >>>> {
> >>>> struct _irqfd *irqfd, *tmp;
> >>>>
> >>>> - /* don't bother with the lock..we are shutting down */
> >>>> - list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> >>>> - list_del(&irqfd->list);
> >>>> - irqfd_release(irqfd);
> >>>> - }
> >>>> + list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
> >>>> + irqfd_disconnect(irqfd);
> >>>> }
> >>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >>>> index 902fed9..a9f62bb 100644
> >>>> --- a/virt/kvm/kvm_main.c
> >>>> +++ b/virt/kvm/kvm_main.c
> >>>> @@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >>>> spin_lock(&kvm_lock);
> >>>> list_del(&kvm->vm_list);
> >>>> spin_unlock(&kvm_lock);
> >>>> - kvm_irqfd_release(kvm);
> >>>> kvm_free_irq_routing(kvm);
> >>>> kvm_io_bus_destroy(&kvm->pio_bus);
> >>>> kvm_io_bus_destroy(&kvm->mmio_bus);
> >>>> @@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> >>>> {
> >>>> struct kvm *kvm = filp->private_data;
> >>>>
> >>>> + kvm_irqfd_release(kvm);
> >>>> +
> >>>> kvm_put_kvm(kvm);
> >>>> return 0;
> >>>> }
> >>>>
> >>>>
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> the body of a message to [email protected]
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>
> >
> >
> >
>
>
>

2009-06-03 17:27:48

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl


Thanks again for the review, Paul. IIUC, you think the design is ok as
it is.

Davide,
In light of this, would you like to submit patch 1/2 formally with
your SOB at your earliest convenience? Or would you prefer that I
submit it and you can simply ack it? Either is fine with me.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-03 17:30:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

On Wed, 3 Jun 2009, Gregory Haskins wrote:

>
> Thanks again for the review, Paul. IIUC, you think the design is ok as
> it is.
>
> Davide,
> In light of this, would you like to submit patch 1/2 formally with
> your SOB at your earliest convenience? Or would you prefer that I
> submit it and you can simply ack it? Either is fine with me.

Will do later today...



- Davide

2009-06-04 10:25:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2009 at 01:41:05PM -0400, Gregory Haskins wrote:
>
>>> And having close not clean up the state unless you do an ioctl first is
>>> very messy IMO - I don't think you'll find any such examples in kernel.
>>>
>>>
>>>
>> I agree, and that is why I am advocating this POLLHUP solution. It was
>> only this other way to begin with because the technology didn't exist
>> until Davide showed me the light.
>>
>> Problem with your request is that I already looked into what is
>> essentially a bi-directional reference problem (for a different reason)
>> when I started the POLLHUP series. Its messy to do this in a way that
>> doesn't negatively impact the fast path (introducing locking, etc) or
>> make my head explode making sure it doesn't race. Afaict, we would need
>> to solve this problem to do what you are proposing (patches welcome).
>>
>> If this hybrid decoupled-deassign + unified-close is indeed an important
>> feature set, I suggest that we still consider this POLLHUP series for
>> inclusion, and then someone can re-introduce DEASSIGN support in the
>> future as a CAP bit extension. That way we at least get the desirable
>> close() properties that we both seem in favor of, and get this advanced
>> use case when we need it (and can figure out the locking design).
>>
>>
>
> FWIW, I took a look and yes, it is non-trivial.
> I concur, we can always add the deassign ioctl later.
>

I agree that deassign is needed for reasons of symmetry, and that it can
be added later.

--
error compiling committee.c: too many arguments to function

2009-06-04 11:44:21

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Avi Kivity wrote:
> Michael S. Tsirkin wrote:
>> On Tue, Jun 02, 2009 at 01:41:05PM -0400, Gregory Haskins wrote:
>>
>>>> And having close not clean up the state unless you do an ioctl
>>>> first is
>>>> very messy IMO - I don't think you'll find any such examples in
>>>> kernel.
>>>>
>>>>
>>> I agree, and that is why I am advocating this POLLHUP solution. It was
>>> only this other way to begin with because the technology didn't exist
>>> until Davide showed me the light.
>>>
>>> Problem with your request is that I already looked into what is
>>> essentially a bi-directional reference problem (for a different reason)
>>> when I started the POLLHUP series. Its messy to do this in a way that
>>> doesn't negatively impact the fast path (introducing locking, etc) or
>>> make my head explode making sure it doesn't race. Afaict, we would
>>> need
>>> to solve this problem to do what you are proposing (patches welcome).
>>>
>>> If this hybrid decoupled-deassign + unified-close is indeed an
>>> important
>>> feature set, I suggest that we still consider this POLLHUP series for
>>> inclusion, and then someone can re-introduce DEASSIGN support in the
>>> future as a CAP bit extension. That way we at least get the desirable
>>> close() properties that we both seem in favor of, and get this advanced
>>> use case when we need it (and can figure out the locking design).
>>>
>>>
>>
>> FWIW, I took a look and yes, it is non-trivial.
>> I concur, we can always add the deassign ioctl later.
>>
>
> I agree that deassign is needed for reasons of symmetry, and that it
> can be added later.
>
Cool.

FYI: Davide's patch has been accepted into -mm (Andrew CC'd). I am not
sure of the protocol here, but I assume this means you can now safely
pull it from -mm into kvm.git so the prerequisite for 2/2 is properly met.

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-04 11:50:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Gregory Haskins wrote:
>> I agree that deassign is needed for reasons of symmetry, and that it
>> can be added later.
>>
>>
> Cool.
>
> FYI: Davide's patch has been accepted into -mm (Andrew CC'd). I am not
> sure of the protocol here, but I assume this means you can now safely
> pull it from -mm into kvm.git so the prerequisite for 2/2 is properly met.
>

I'm not sure either.

But I think I saw a "Thanks for catching that" for 2/2?

--
error compiling committee.c: too many arguments to function

2009-06-04 11:53:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> I agree that deassign is needed for reasons of symmetry, and that it
>>> can be added later.
>>>
>>>
>> Cool.
>>
>> FYI: Davide's patch has been accepted into -mm (Andrew CC'd). I am not
>> sure of the protocol here, but I assume this means you can now safely
>> pull it from -mm into kvm.git so the prerequisite for 2/2 is properly
>> met.
>>
>
> I'm not sure either.
>
> But I think I saw a "Thanks for catching that" for 2/2?
>
Ah, right! I queued that fix up eons ago after David's feedback and
forgot that it was there waiting for me ;)

Since Paul ok'd (I think?) the srcu design, and the only other feedback
was the key-bitmap thing from Davide, I will go ahead and push a v2 with
just that one fix (unless there is any other feedback?)

-Greg


Attachments:
signature.asc (266.00 B)
OpenPGP digital signature

2009-06-04 12:03:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [KVM-RFC PATCH 0/2] irqfd: use POLLHUP notification for close()

Gregory Haskins wrote:
> Since Paul ok'd (I think?) the srcu design, and the only other feedback
> was the key-bitmap thing from Davide, I will go ahead and push a v2 with
> just that one fix (unless there is any other feedback?)
>

I'll do a detailed review on your next posting. When I see a long
thread I go hide under the bed, where there is no Internet access.

--
error compiling committee.c: too many arguments to function