2012-08-06 10:17:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 07/24/2012 11:43 PM, Alex Williamson wrote:
> This new ioctl enables an eventfd to be triggered when an EOI is
> written for a specified irqchip pin. The first user of this will
> be external device assignment through VFIO, using a level irqfd
> for asserting a PCI INTx interrupt and this interface for de-assert
> and notification once the interrupt is serviced.
>
> Here we make use of the reference counting of the _irq_source
> object allowing us to share it with an irqfd and cleanup regardless
> of the release order.

The name is slightly misleading. eoifd doesn't trigger on EOI (which is
an APIC->IOAPIC interface) but rather when an interrupt controller
resamples an input line. This happens for the IOAPIC when an EOI is
received for a vector that is configured for level interrupts and not
masked, or similarly for a PIC (but this is not triggered by an APIC EOI).

It's not a huge difference, but let's document it.

>
> +4.77 KVM_EOIFD
> +
> +Capability: KVM_CAP_EOIFD
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_eoifd (in)
> +Returns: 0 on success, < 0 on error
> +
> +KVM_EOIFD allows userspace to receive interrupt EOI notification
> +through an eventfd. kvm_eoifd.fd specifies the eventfd used for
> +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> +once assigned. KVM_EOIFD also requires additional bits set in
> +kvm_eoifd.flags to bind to the proper interrupt line. The
> +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> +and is a key from a level triggered interrupt (configured from
> +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound
> +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key
> +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a
> +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> +KVM_EOIFD_FLAG_LEVEL_IRQFD.

Why do we need to couple eoifd and irqfd?



--
error compiling committee.c: too many arguments to function


2012-08-06 10:38:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 08/06/2012 01:17 PM, Avi Kivity wrote:
>
>>
>> +4.77 KVM_EOIFD
>> +
>> +Capability: KVM_CAP_EOIFD
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct kvm_eoifd (in)
>> +Returns: 0 on success, < 0 on error
>> +
>> +KVM_EOIFD allows userspace to receive interrupt EOI notification
>> +through an eventfd. kvm_eoifd.fd specifies the eventfd used for
>> +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
>> +once assigned. KVM_EOIFD also requires additional bits set in
>> +kvm_eoifd.flags to bind to the proper interrupt line. The
>> +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
>> +and is a key from a level triggered interrupt (configured from
>> +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound
>> +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key
>> +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
>> +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a
>> +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
>> +KVM_EOIFD_FLAG_LEVEL_IRQFD.
>
> Why do we need to couple eoifd and irqfd?

Oh, it's to auto-deassert the line.

Regarding the implementation, instead of a linked list, would an array
of counters parallel to the bitmap make it simpler?

--
error compiling committee.c: too many arguments to function

2012-08-06 10:40:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 08/06/2012 01:38 PM, Avi Kivity wrote:

> Regarding the implementation, instead of a linked list, would an array
> of counters parallel to the bitmap make it simpler?

Or even, replace the bitmap with an array of counters.

--
error compiling committee.c: too many arguments to function

2012-08-09 19:26:19

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> On 08/06/2012 01:38 PM, Avi Kivity wrote:
>
> > Regarding the implementation, instead of a linked list, would an array
> > of counters parallel to the bitmap make it simpler?
>
> Or even, replace the bitmap with an array of counters.

I'm not sure a counter array is what we're really after. That gives us
reference counting for the irq source IDs, but not the key->gsi lookup.
It also highlights another issue, that we have a limited set of source
IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
for the shared userspace ID and another for the PIT. How happy are we
going to be with a limit of 62 level interrupts in use at one time?

It's arguably a reasonable number since the most virtualization friendly
devices (sr-iov VFs) don't even support this kind of interrupt. It's
also very wasteful allocating an entire source ID for a single GSI
within that source ID. PCI supports interrupts A, B, C, and D, which,
in the most optimal config, each go to different GSIs. So we could
theoretically be more efficient in our use and allocation of irq source
IDs if we tracked use by the source ID, gsi pair.

That probably makes it less practical to replace anything at the top
level with a counter array. The key that we pass back is currently the
actual source ID, but we don't specify what it is, so we could split it
and have it encode a 16bit source ID plus 16 bit GSI. It could also be
an idr entry.

Michael, would the interface be more acceptable to you if we added
separate ioctls to allocate and free some representation of an irq
source ID, gsi pair? For instance, an ioctl might return an idr entry
for an irq source ID/gsi object which would then be passed as a
parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
representing the source id/gsi isn't magically freed on it's own. This
would also allow us to deassign/close one end and reconfigure it later.
Thanks,

Alex

2012-08-12 07:52:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, Aug 06, 2012 at 01:17:12PM +0300, Avi Kivity wrote:
> On 07/24/2012 11:43 PM, Alex Williamson wrote:
> > This new ioctl enables an eventfd to be triggered when an EOI is
> > written for a specified irqchip pin. The first user of this will
> > be external device assignment through VFIO, using a level irqfd
> > for asserting a PCI INTx interrupt and this interface for de-assert
> > and notification once the interrupt is serviced.
> >
> > Here we make use of the reference counting of the _irq_source
> > object allowing us to share it with an irqfd and cleanup regardless
> > of the release order.
>
> The name is slightly misleading. eoifd doesn't trigger on EOI (which is
> an APIC->IOAPIC interface) but rather when an interrupt controller
> resamples an input line. This happens for the IOAPIC when an EOI is
> received for a vector that is configured for level interrupts and not
> masked, or similarly for a PIC (but this is not triggered by an APIC EOI).
>
> It's not a huge difference, but let's document it.

In fact, when we really need to notify userspace is after
we auto-deassert an interrupt: userspace does not
need an EOI notification as such.


> >
> > +4.77 KVM_EOIFD
> > +
> > +Capability: KVM_CAP_EOIFD
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_eoifd (in)
> > +Returns: 0 on success, < 0 on error
> > +
> > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > +through an eventfd. kvm_eoifd.fd specifies the eventfd used for
> > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > +once assigned. KVM_EOIFD also requires additional bits set in
> > +kvm_eoifd.flags to bind to the proper interrupt line. The
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > +and is a key from a level triggered interrupt (configured from
> > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound
> > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key
> > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a
> > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
>
> Why do we need to couple eoifd and irqfd?
>
>
>
> --
> error compiling committee.c: too many arguments to function

2012-08-12 08:36:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 08/09/2012 10:26 PM, Alex Williamson wrote:
> On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
>> On 08/06/2012 01:38 PM, Avi Kivity wrote:
>>
>> > Regarding the implementation, instead of a linked list, would an array
>> > of counters parallel to the bitmap make it simpler?
>>
>> Or even, replace the bitmap with an array of counters.
>
> I'm not sure a counter array is what we're really after. That gives us
> reference counting for the irq source IDs, but not the key->gsi lookup.

You can look up the gsi while registering the eoifd, so it's accessible
as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
while the eoifd is still active, but is this a problem?


> It also highlights another issue, that we have a limited set of source
> IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> for the shared userspace ID and another for the PIT. How happy are we
> going to be with a limit of 62 level interrupts in use at one time?

When we start being unhappy we can increase that number. On the other
hand more locks and lists makes me unhappy now.

>
> It's arguably a reasonable number since the most virtualization friendly
> devices (sr-iov VFs) don't even support this kind of interrupt. It's
> also very wasteful allocating an entire source ID for a single GSI
> within that source ID. PCI supports interrupts A, B, C, and D, which,
> in the most optimal config, each go to different GSIs. So we could
> theoretically be more efficient in our use and allocation of irq source
> IDs if we tracked use by the source ID, gsi pair.

There are, in one userspace, just three gsis available for PCI links, so
you're compressing the source id space by 3.

> That probably makes it less practical to replace anything at the top
> level with a counter array. The key that we pass back is currently the
> actual source ID, but we don't specify what it is, so we could split it
> and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> an idr entry.

We can fix those kinds of problems by adding another layer of
indirection. But I doubt they will be needed. I don't see people
assigning 60 legacy devices to one guest.

> Michael, would the interface be more acceptable to you if we added
> separate ioctls to allocate and free some representation of an irq
> source ID, gsi pair? For instance, an ioctl might return an idr entry
> for an irq source ID/gsi object which would then be passed as a
> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> representing the source id/gsi isn't magically freed on it's own. This
> would also allow us to deassign/close one end and reconfigure it later.
> Thanks,

Another option is to push the responsibility for allocating IDs for the
association to userspace. Let userspace both create the irqfd and the
eoifd with the same ID, the kernel matches them at registration time and
copies the gsi/sourceid from the first to the second eventfd.

--
error compiling committee.c: too many arguments to function

2012-08-12 09:32:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Thu, Aug 09, 2012 at 01:26:15PM -0600, Alex Williamson wrote:
> On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > On 08/06/2012 01:38 PM, Avi Kivity wrote:
> >
> > > Regarding the implementation, instead of a linked list, would an array
> > > of counters parallel to the bitmap make it simpler?
> >
> > Or even, replace the bitmap with an array of counters.
>
> I'm not sure a counter array is what we're really after. That gives us
> reference counting for the irq source IDs, but not the key->gsi lookup.
> It also highlights another issue, that we have a limited set of source
> IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> for the shared userspace ID and another for the PIT. How happy are we
> going to be with a limit of 62 level interrupts in use at one time?
>
> It's arguably a reasonable number since the most virtualization friendly
> devices (sr-iov VFs) don't even support this kind of interrupt. It's
> also very wasteful allocating an entire source ID for a single GSI
> within that source ID. PCI supports interrupts A, B, C, and D, which,
> in the most optimal config, each go to different GSIs. So we could
> theoretically be more efficient in our use and allocation of irq source
> IDs if we tracked use by the source ID, gsi pair.
>
> That probably makes it less practical to replace anything at the top
> level with a counter array. The key that we pass back is currently the
> actual source ID, but we don't specify what it is, so we could split it
> and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> an idr entry.
>
> Michael, would the interface be more acceptable to you if we added
> separate ioctls to allocate and free some representation of an irq
> source ID, gsi pair? For instance, an ioctl might return an idr entry
> for an irq source ID/gsi object which would then be passed as a
> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> representing the source id/gsi isn't magically freed on it's own. This
> would also allow us to deassign/close one end and reconfigure it later.
> Thanks,
>
> Alex

It's acceptable to me either way. I was only pointing out that as
designed, the interface looks simple at first but then you find out some
subtle limitations which are implementation driven. This gives
an overall feeling the abstraction is too low level.

If we compare to the existing irqfd, isn't the difference
simply that irqfd deasserts immediately ATM, while we
want to delay this until later?

If yes, then along the lines that you proposed, and combining with my
idea of tracking deasserts, how do you like the following:

/* Keep line asserted until guest has handled the interrupt. */
#define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
/* Notify after line is deasserted. */
#define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)

struct kvm_irqfd {
__u32 fd;
__u32 gsi;
__u32 flags;
/* eventfd to notify when line is deasserted */
__u32 deassert_eventfd;
__u8 pad[16];
};

now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
effective for level interrupts.

Notes about lifetime of objects:
- closing deassert_eventfd does nothing (we can keep
reference to it from irqfd so no need for
complex polling/flushing scheme)
- closing irqfd or deasserting dis-associates
deassert_eventfd automatically
- source id is internal to irqfd and goes away with it

it looks harder to misuse and fits what we want to do nicely,
and needs less code to implement.

Avi, what do you think?

--
MST

2012-08-13 21:23:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Sun, 2012-08-12 at 12:33 +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 09, 2012 at 01:26:15PM -0600, Alex Williamson wrote:
> > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > >
> > > > Regarding the implementation, instead of a linked list, would an array
> > > > of counters parallel to the bitmap make it simpler?
> > >
> > > Or even, replace the bitmap with an array of counters.
> >
> > I'm not sure a counter array is what we're really after. That gives us
> > reference counting for the irq source IDs, but not the key->gsi lookup.
> > It also highlights another issue, that we have a limited set of source
> > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > for the shared userspace ID and another for the PIT. How happy are we
> > going to be with a limit of 62 level interrupts in use at one time?
> >
> > It's arguably a reasonable number since the most virtualization friendly
> > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > also very wasteful allocating an entire source ID for a single GSI
> > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > in the most optimal config, each go to different GSIs. So we could
> > theoretically be more efficient in our use and allocation of irq source
> > IDs if we tracked use by the source ID, gsi pair.
> >
> > That probably makes it less practical to replace anything at the top
> > level with a counter array. The key that we pass back is currently the
> > actual source ID, but we don't specify what it is, so we could split it
> > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > an idr entry.
> >
> > Michael, would the interface be more acceptable to you if we added
> > separate ioctls to allocate and free some representation of an irq
> > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > for an irq source ID/gsi object which would then be passed as a
> > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > representing the source id/gsi isn't magically freed on it's own. This
> > would also allow us to deassign/close one end and reconfigure it later.
> > Thanks,
> >
> > Alex
>
> It's acceptable to me either way. I was only pointing out that as
> designed, the interface looks simple at first but then you find out some
> subtle limitations which are implementation driven. This gives
> an overall feeling the abstraction is too low level.
>
> If we compare to the existing irqfd, isn't the difference
> simply that irqfd deasserts immediately ATM, while we
> want to delay this until later?
>
> If yes, then along the lines that you proposed, and combining with my
> idea of tracking deasserts, how do you like the following:
>
> /* Keep line asserted until guest has handled the interrupt. */
> #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> /* Notify after line is deasserted. */
> #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
>
> struct kvm_irqfd {
> __u32 fd;
> __u32 gsi;
> __u32 flags;
> /* eventfd to notify when line is deasserted */
> __u32 deassert_eventfd;
> __u8 pad[16];
> };
>
> now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> effective for level interrupts.
>
> Notes about lifetime of objects:
> - closing deassert_eventfd does nothing (we can keep
> reference to it from irqfd so no need for
> complex polling/flushing scheme)
> - closing irqfd or deasserting dis-associates
> deassert_eventfd automatically
> - source id is internal to irqfd and goes away with it
>
> it looks harder to misuse and fits what we want to do nicely,
> and needs less code to implement.

This is effectively what I meant when I suggested we either need to a)
pull eoifd into irqfd or b) implement them as modular components. I
chose to implement b) because I think that non-irqfd related ack
notification to userspace will be useful and a) does not provide that.
So this interface enables exactly the use case for device assignment and
no more. I feel like this is the start of an ioctl that will be quickly
deprecated, but if that's the direction we want to go, I'll write the
code. Thanks,

Alex

2012-08-13 21:34:06

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> >>
> >> > Regarding the implementation, instead of a linked list, would an array
> >> > of counters parallel to the bitmap make it simpler?
> >>
> >> Or even, replace the bitmap with an array of counters.
> >
> > I'm not sure a counter array is what we're really after. That gives us
> > reference counting for the irq source IDs, but not the key->gsi lookup.
>
> You can look up the gsi while registering the eoifd, so it's accessible
> as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> while the eoifd is still active, but is this a problem?

In my opinion, no, but Michael disagrees.

> > It also highlights another issue, that we have a limited set of source
> > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > for the shared userspace ID and another for the PIT. How happy are we
> > going to be with a limit of 62 level interrupts in use at one time?
>
> When we start being unhappy we can increase that number. On the other
> hand more locks and lists makes me unhappy now.

Yep, good point. My latest version removes the source ID object lock
and list (and objects). I still have a lock and list for the ack
notification, but it's hard not to unless we combine them into one
mega-irqfd ioctl as Michael suggests.

> > It's arguably a reasonable number since the most virtualization friendly
> > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > also very wasteful allocating an entire source ID for a single GSI
> > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > in the most optimal config, each go to different GSIs. So we could
> > theoretically be more efficient in our use and allocation of irq source
> > IDs if we tracked use by the source ID, gsi pair.
>
> There are, in one userspace, just three gsis available for PCI links, so
> you're compressing the source id space by 3.

I imagine there's a way to put each PCI interrupt pin on a GSI, but
still only 4, not a great expansion of source ID space. I like
Michael's idea of re-using source IDs if we run out better.

> > That probably makes it less practical to replace anything at the top
> > level with a counter array. The key that we pass back is currently the
> > actual source ID, but we don't specify what it is, so we could split it
> > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > an idr entry.
>
> We can fix those kinds of problems by adding another layer of
> indirection. But I doubt they will be needed. I don't see people
> assigning 60 legacy devices to one guest.

Yep, we can ignore it for now and put it in the hands of userspace to
re-use IDs if needed.

> > Michael, would the interface be more acceptable to you if we added
> > separate ioctls to allocate and free some representation of an irq
> > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > for an irq source ID/gsi object which would then be passed as a
> > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > representing the source id/gsi isn't magically freed on it's own. This
> > would also allow us to deassign/close one end and reconfigure it later.
> > Thanks,
>
> Another option is to push the responsibility for allocating IDs for the
> association to userspace. Let userspace both create the irqfd and the
> eoifd with the same ID, the kernel matches them at registration time and
> copies the gsi/sourceid from the first to the second eventfd.

Aside from the copying gsi/sourceid bit, you've just described my latest
attempt at this series. Specifying both a sourceid and gsi also allows
userspace to make better use of the sourceid address space (use more
than one gsi if userspace wants the complexity of managing them).
Thanks,

Alex

2012-08-13 21:59:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, Aug 13, 2012 at 03:23:24PM -0600, Alex Williamson wrote:
> On Sun, 2012-08-12 at 12:33 +0300, Michael S. Tsirkin wrote:
> > On Thu, Aug 09, 2012 at 01:26:15PM -0600, Alex Williamson wrote:
> > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > > On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > >
> > > > > Regarding the implementation, instead of a linked list, would an array
> > > > > of counters parallel to the bitmap make it simpler?
> > > >
> > > > Or even, replace the bitmap with an array of counters.
> > >
> > > I'm not sure a counter array is what we're really after. That gives us
> > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > > It also highlights another issue, that we have a limited set of source
> > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > for the shared userspace ID and another for the PIT. How happy are we
> > > going to be with a limit of 62 level interrupts in use at one time?
> > >
> > > It's arguably a reasonable number since the most virtualization friendly
> > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > also very wasteful allocating an entire source ID for a single GSI
> > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > in the most optimal config, each go to different GSIs. So we could
> > > theoretically be more efficient in our use and allocation of irq source
> > > IDs if we tracked use by the source ID, gsi pair.
> > >
> > > That probably makes it less practical to replace anything at the top
> > > level with a counter array. The key that we pass back is currently the
> > > actual source ID, but we don't specify what it is, so we could split it
> > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > an idr entry.
> > >
> > > Michael, would the interface be more acceptable to you if we added
> > > separate ioctls to allocate and free some representation of an irq
> > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > for an irq source ID/gsi object which would then be passed as a
> > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > representing the source id/gsi isn't magically freed on it's own. This
> > > would also allow us to deassign/close one end and reconfigure it later.
> > > Thanks,
> > >
> > > Alex
> >
> > It's acceptable to me either way. I was only pointing out that as
> > designed, the interface looks simple at first but then you find out some
> > subtle limitations which are implementation driven. This gives
> > an overall feeling the abstraction is too low level.
> >
> > If we compare to the existing irqfd, isn't the difference
> > simply that irqfd deasserts immediately ATM, while we
> > want to delay this until later?
> >
> > If yes, then along the lines that you proposed, and combining with my
> > idea of tracking deasserts, how do you like the following:
> >
> > /* Keep line asserted until guest has handled the interrupt. */
> > #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> > /* Notify after line is deasserted. */
> > #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
> >
> > struct kvm_irqfd {
> > __u32 fd;
> > __u32 gsi;
> > __u32 flags;
> > /* eventfd to notify when line is deasserted */
> > __u32 deassert_eventfd;
> > __u8 pad[16];
> > };
> >
> > now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> > effective for level interrupts.
> >
> > Notes about lifetime of objects:
> > - closing deassert_eventfd does nothing (we can keep
> > reference to it from irqfd so no need for
> > complex polling/flushing scheme)
> > - closing irqfd or deasserting dis-associates
> > deassert_eventfd automatically
> > - source id is internal to irqfd and goes away with it
> >
> > it looks harder to misuse and fits what we want to do nicely,
> > and needs less code to implement.
>
> This is effectively what I meant when I suggested we either need to a)
> pull eoifd into irqfd or b) implement them as modular components. I
> chose to implement b) because I think that non-irqfd related ack
> notification to userspace will be useful and a) does not provide that.
> So this interface enables exactly the use case for device assignment and
> no more. I feel like this is the start of an ioctl that will be quickly
> deprecated, but if that's the direction we want to go, I'll write the
> code. Thanks,
>
> Alex

Sorry I wrote this before I knew we really do not need
the deassert on ack at all, existing irqfd is fine for level.


--
MST

2012-08-13 22:10:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > >>
> > >> > Regarding the implementation, instead of a linked list, would an array
> > >> > of counters parallel to the bitmap make it simpler?
> > >>
> > >> Or even, replace the bitmap with an array of counters.
> > >
> > > I'm not sure a counter array is what we're really after. That gives us
> > > reference counting for the irq source IDs, but not the key->gsi lookup.
> >
> > You can look up the gsi while registering the eoifd, so it's accessible
> > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > while the eoifd is still active, but is this a problem?
>
> In my opinion, no, but Michael disagrees.
>
> > > It also highlights another issue, that we have a limited set of source
> > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > for the shared userspace ID and another for the PIT. How happy are we
> > > going to be with a limit of 62 level interrupts in use at one time?
> >
> > When we start being unhappy we can increase that number. On the other
> > hand more locks and lists makes me unhappy now.
>
> Yep, good point. My latest version removes the source ID object lock
> and list (and objects). I still have a lock and list for the ack
> notification, but it's hard not to unless we combine them into one
> mega-irqfd ioctl as Michael suggests.
>
> > > It's arguably a reasonable number since the most virtualization friendly
> > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > also very wasteful allocating an entire source ID for a single GSI
> > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > in the most optimal config, each go to different GSIs. So we could
> > > theoretically be more efficient in our use and allocation of irq source
> > > IDs if we tracked use by the source ID, gsi pair.
> >
> > There are, in one userspace, just three gsis available for PCI links, so
> > you're compressing the source id space by 3.
>
> I imagine there's a way to put each PCI interrupt pin on a GSI, but
> still only 4, not a great expansion of source ID space. I like
> Michael's idea of re-using source IDs if we run out better.
>
> > > That probably makes it less practical to replace anything at the top
> > > level with a counter array. The key that we pass back is currently the
> > > actual source ID, but we don't specify what it is, so we could split it
> > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > an idr entry.
> >
> > We can fix those kinds of problems by adding another layer of
> > indirection. But I doubt they will be needed. I don't see people
> > assigning 60 legacy devices to one guest.
>
> Yep, we can ignore it for now and put it in the hands of userspace to
> re-use IDs if needed.
>
> > > Michael, would the interface be more acceptable to you if we added
> > > separate ioctls to allocate and free some representation of an irq
> > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > for an irq source ID/gsi object which would then be passed as a
> > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > representing the source id/gsi isn't magically freed on it's own. This
> > > would also allow us to deassign/close one end and reconfigure it later.
> > > Thanks,
> >
> > Another option is to push the responsibility for allocating IDs for the
> > association to userspace. Let userspace both create the irqfd and the
> > eoifd with the same ID, the kernel matches them at registration time and
> > copies the gsi/sourceid from the first to the second eventfd.
>
> Aside from the copying gsi/sourceid bit, you've just described my latest
> attempt at this series. Specifying both a sourceid and gsi also allows
> userspace to make better use of the sourceid address space (use more
> than one gsi if userspace wants the complexity of managing them).
> Thanks,
>
> Alex

Turns out per device source ID is a bug copied from existing
device assignment. I am amazed we did not notice before.
There we have small # of devices so it's not a problem but there's no
reason just not to have a source ID for all irqfds.
So the problem goes away, and there is no limit on # of level irqfds,
and no need to manage IDs in userspace at all.
You can still have cookies in userspace if you like but do not map them
to source IDs.

--
MST

2012-08-13 22:41:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, 2012-08-14 at 01:06 +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> > On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > >>
> > > >> > Regarding the implementation, instead of a linked list, would an array
> > > >> > of counters parallel to the bitmap make it simpler?
> > > >>
> > > >> Or even, replace the bitmap with an array of counters.
> > > >
> > > > I'm not sure a counter array is what we're really after. That gives us
> > > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > >
> > > You can look up the gsi while registering the eoifd, so it's accessible
> > > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > > while the eoifd is still active, but is this a problem?
> >
> > In my opinion, no, but Michael disagrees.
> >
> > > > It also highlights another issue, that we have a limited set of source
> > > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > > for the shared userspace ID and another for the PIT. How happy are we
> > > > going to be with a limit of 62 level interrupts in use at one time?
> > >
> > > When we start being unhappy we can increase that number. On the other
> > > hand more locks and lists makes me unhappy now.
> >
> > Yep, good point. My latest version removes the source ID object lock
> > and list (and objects). I still have a lock and list for the ack
> > notification, but it's hard not to unless we combine them into one
> > mega-irqfd ioctl as Michael suggests.
> >
> > > > It's arguably a reasonable number since the most virtualization friendly
> > > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > > also very wasteful allocating an entire source ID for a single GSI
> > > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > > in the most optimal config, each go to different GSIs. So we could
> > > > theoretically be more efficient in our use and allocation of irq source
> > > > IDs if we tracked use by the source ID, gsi pair.
> > >
> > > There are, in one userspace, just three gsis available for PCI links, so
> > > you're compressing the source id space by 3.
> >
> > I imagine there's a way to put each PCI interrupt pin on a GSI, but
> > still only 4, not a great expansion of source ID space. I like
> > Michael's idea of re-using source IDs if we run out better.
> >
> > > > That probably makes it less practical to replace anything at the top
> > > > level with a counter array. The key that we pass back is currently the
> > > > actual source ID, but we don't specify what it is, so we could split it
> > > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > > an idr entry.
> > >
> > > We can fix those kinds of problems by adding another layer of
> > > indirection. But I doubt they will be needed. I don't see people
> > > assigning 60 legacy devices to one guest.
> >
> > Yep, we can ignore it for now and put it in the hands of userspace to
> > re-use IDs if needed.
> >
> > > > Michael, would the interface be more acceptable to you if we added
> > > > separate ioctls to allocate and free some representation of an irq
> > > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > > for an irq source ID/gsi object which would then be passed as a
> > > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > > representing the source id/gsi isn't magically freed on it's own. This
> > > > would also allow us to deassign/close one end and reconfigure it later.
> > > > Thanks,
> > >
> > > Another option is to push the responsibility for allocating IDs for the
> > > association to userspace. Let userspace both create the irqfd and the
> > > eoifd with the same ID, the kernel matches them at registration time and
> > > copies the gsi/sourceid from the first to the second eventfd.
> >
> > Aside from the copying gsi/sourceid bit, you've just described my latest
> > attempt at this series. Specifying both a sourceid and gsi also allows
> > userspace to make better use of the sourceid address space (use more
> > than one gsi if userspace wants the complexity of managing them).
> > Thanks,
> >
> > Alex
>
> Turns out per device source ID is a bug copied from existing
> device assignment. I am amazed we did not notice before.
> There we have small # of devices so it's not a problem but there's no
> reason just not to have a source ID for all irqfds.
> So the problem goes away, and there is no limit on # of level irqfds,
> and no need to manage IDs in userspace at all.
> You can still have cookies in userspace if you like but do not map them
> to source IDs.

IMHO it's not a bug, it's an implementation decision. They could be
shared, but that doesn't make it wrong to not share them. Given that we
have 32 memory slots, the only way you could hit this would be to have a
lot of really slow devices that don't direct-map any BARs. A reason to
not have the same source id for everything is that I think we can do ack
notification filtering more easily using separate source ids (as is done
in the first patch of the v8 series). As the code is today, I agree,
there's probably no advantage to using multiple source IDs. Thanks,

Alex

2012-08-13 22:59:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, Aug 13, 2012 at 04:41:05PM -0600, Alex Williamson wrote:
> On Tue, 2012-08-14 at 01:06 +0300, Michael S. Tsirkin wrote:
> > On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> > > On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > > > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > > >>
> > > > >> > Regarding the implementation, instead of a linked list, would an array
> > > > >> > of counters parallel to the bitmap make it simpler?
> > > > >>
> > > > >> Or even, replace the bitmap with an array of counters.
> > > > >
> > > > > I'm not sure a counter array is what we're really after. That gives us
> > > > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > > >
> > > > You can look up the gsi while registering the eoifd, so it's accessible
> > > > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > > > while the eoifd is still active, but is this a problem?
> > >
> > > In my opinion, no, but Michael disagrees.
> > >
> > > > > It also highlights another issue, that we have a limited set of source
> > > > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > > > for the shared userspace ID and another for the PIT. How happy are we
> > > > > going to be with a limit of 62 level interrupts in use at one time?
> > > >
> > > > When we start being unhappy we can increase that number. On the other
> > > > hand more locks and lists makes me unhappy now.
> > >
> > > Yep, good point. My latest version removes the source ID object lock
> > > and list (and objects). I still have a lock and list for the ack
> > > notification, but it's hard not to unless we combine them into one
> > > mega-irqfd ioctl as Michael suggests.
> > >
> > > > > It's arguably a reasonable number since the most virtualization friendly
> > > > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > > > also very wasteful allocating an entire source ID for a single GSI
> > > > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > > > in the most optimal config, each go to different GSIs. So we could
> > > > > theoretically be more efficient in our use and allocation of irq source
> > > > > IDs if we tracked use by the source ID, gsi pair.
> > > >
> > > > There are, in one userspace, just three gsis available for PCI links, so
> > > > you're compressing the source id space by 3.
> > >
> > > I imagine there's a way to put each PCI interrupt pin on a GSI, but
> > > still only 4, not a great expansion of source ID space. I like
> > > Michael's idea of re-using source IDs if we run out better.
> > >
> > > > > That probably makes it less practical to replace anything at the top
> > > > > level with a counter array. The key that we pass back is currently the
> > > > > actual source ID, but we don't specify what it is, so we could split it
> > > > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > > > an idr entry.
> > > >
> > > > We can fix those kinds of problems by adding another layer of
> > > > indirection. But I doubt they will be needed. I don't see people
> > > > assigning 60 legacy devices to one guest.
> > >
> > > Yep, we can ignore it for now and put it in the hands of userspace to
> > > re-use IDs if needed.
> > >
> > > > > Michael, would the interface be more acceptable to you if we added
> > > > > separate ioctls to allocate and free some representation of an irq
> > > > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > > > for an irq source ID/gsi object which would then be passed as a
> > > > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > > > representing the source id/gsi isn't magically freed on it's own. This
> > > > > would also allow us to deassign/close one end and reconfigure it later.
> > > > > Thanks,
> > > >
> > > > Another option is to push the responsibility for allocating IDs for the
> > > > association to userspace. Let userspace both create the irqfd and the
> > > > eoifd with the same ID, the kernel matches them at registration time and
> > > > copies the gsi/sourceid from the first to the second eventfd.
> > >
> > > Aside from the copying gsi/sourceid bit, you've just described my latest
> > > attempt at this series. Specifying both a sourceid and gsi also allows
> > > userspace to make better use of the sourceid address space (use more
> > > than one gsi if userspace wants the complexity of managing them).
> > > Thanks,
> > >
> > > Alex
> >
> > Turns out per device source ID is a bug copied from existing
> > device assignment. I am amazed we did not notice before.
> > There we have small # of devices so it's not a problem but there's no
> > reason just not to have a source ID for all irqfds.
> > So the problem goes away, and there is no limit on # of level irqfds,
> > and no need to manage IDs in userspace at all.
> > You can still have cookies in userspace if you like but do not map them
> > to source IDs.
>
> IMHO it's not a bug, it's an implementation decision. They could be
> shared, but that doesn't make it wrong to not share them. Given that we
> have 32 memory slots, the only way you could hit this would be to have a
> lot of really slow devices that don't direct-map any BARs. A reason to
> not have the same source id for everything is that I think we can do ack
> notification filtering more easily using separate source ids (as is done
> in the first patch of the v8 series).

Just a thought: can filtering read and clear the irqfd counter?

> As the code is today, I agree,
> there's probably no advantage to using multiple source IDs. Thanks,
>
> Alex

I think one point worth addressing is, Gleb wanted
to get eoifd without irqfd at all and that works for
timer interrupt.

--
MST

2012-08-14 03:09:48

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, 2012-08-14 at 02:00 +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 13, 2012 at 04:41:05PM -0600, Alex Williamson wrote:
> > On Tue, 2012-08-14 at 01:06 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> > > > On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > > > > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > > > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > > > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > > > >>
> > > > > >> > Regarding the implementation, instead of a linked list, would an array
> > > > > >> > of counters parallel to the bitmap make it simpler?
> > > > > >>
> > > > > >> Or even, replace the bitmap with an array of counters.
> > > > > >
> > > > > > I'm not sure a counter array is what we're really after. That gives us
> > > > > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > > > >
> > > > > You can look up the gsi while registering the eoifd, so it's accessible
> > > > > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > > > > while the eoifd is still active, but is this a problem?
> > > >
> > > > In my opinion, no, but Michael disagrees.
> > > >
> > > > > > It also highlights another issue, that we have a limited set of source
> > > > > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > > > > for the shared userspace ID and another for the PIT. How happy are we
> > > > > > going to be with a limit of 62 level interrupts in use at one time?
> > > > >
> > > > > When we start being unhappy we can increase that number. On the other
> > > > > hand more locks and lists makes me unhappy now.
> > > >
> > > > Yep, good point. My latest version removes the source ID object lock
> > > > and list (and objects). I still have a lock and list for the ack
> > > > notification, but it's hard not to unless we combine them into one
> > > > mega-irqfd ioctl as Michael suggests.
> > > >
> > > > > > It's arguably a reasonable number since the most virtualization friendly
> > > > > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > > > > also very wasteful allocating an entire source ID for a single GSI
> > > > > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > > > > in the most optimal config, each go to different GSIs. So we could
> > > > > > theoretically be more efficient in our use and allocation of irq source
> > > > > > IDs if we tracked use by the source ID, gsi pair.
> > > > >
> > > > > There are, in one userspace, just three gsis available for PCI links, so
> > > > > you're compressing the source id space by 3.
> > > >
> > > > I imagine there's a way to put each PCI interrupt pin on a GSI, but
> > > > still only 4, not a great expansion of source ID space. I like
> > > > Michael's idea of re-using source IDs if we run out better.
> > > >
> > > > > > That probably makes it less practical to replace anything at the top
> > > > > > level with a counter array. The key that we pass back is currently the
> > > > > > actual source ID, but we don't specify what it is, so we could split it
> > > > > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > > > > an idr entry.
> > > > >
> > > > > We can fix those kinds of problems by adding another layer of
> > > > > indirection. But I doubt they will be needed. I don't see people
> > > > > assigning 60 legacy devices to one guest.
> > > >
> > > > Yep, we can ignore it for now and put it in the hands of userspace to
> > > > re-use IDs if needed.
> > > >
> > > > > > Michael, would the interface be more acceptable to you if we added
> > > > > > separate ioctls to allocate and free some representation of an irq
> > > > > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > > > > for an irq source ID/gsi object which would then be passed as a
> > > > > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > > > > representing the source id/gsi isn't magically freed on it's own. This
> > > > > > would also allow us to deassign/close one end and reconfigure it later.
> > > > > > Thanks,
> > > > >
> > > > > Another option is to push the responsibility for allocating IDs for the
> > > > > association to userspace. Let userspace both create the irqfd and the
> > > > > eoifd with the same ID, the kernel matches them at registration time and
> > > > > copies the gsi/sourceid from the first to the second eventfd.
> > > >
> > > > Aside from the copying gsi/sourceid bit, you've just described my latest
> > > > attempt at this series. Specifying both a sourceid and gsi also allows
> > > > userspace to make better use of the sourceid address space (use more
> > > > than one gsi if userspace wants the complexity of managing them).
> > > > Thanks,
> > > >
> > > > Alex
> > >
> > > Turns out per device source ID is a bug copied from existing
> > > device assignment. I am amazed we did not notice before.
> > > There we have small # of devices so it's not a problem but there's no
> > > reason just not to have a source ID for all irqfds.
> > > So the problem goes away, and there is no limit on # of level irqfds,
> > > and no need to manage IDs in userspace at all.
> > > You can still have cookies in userspace if you like but do not map them
> > > to source IDs.
> >
> > IMHO it's not a bug, it's an implementation decision. They could be
> > shared, but that doesn't make it wrong to not share them. Given that we
> > have 32 memory slots, the only way you could hit this would be to have a
> > lot of really slow devices that don't direct-map any BARs. A reason to
> > not have the same source id for everything is that I think we can do ack
> > notification filtering more easily using separate source ids (as is done
> > in the first patch of the v8 series).
>
> Just a thought: can filtering read and clear the irqfd counter?

Sorry, what's "the irqfd counter"? The eventfd counter? As I have it
in the patch series, the filtering happens where the irq ack notifier
calls the individual notifier callbacks. That's not irqfd/eventfd
specific, so it doesn't have access to the eventfd counter there.
Taking the filtering into the into the actual callbacks seems to require
locking or maybe your proposed test and clear interface (which still
requires locking).

> > As the code is today, I agree,
> > there's probably no advantage to using multiple source IDs. Thanks,
> >
> > Alex
>
> I think one point worth addressing is, Gleb wanted
> to get eoifd without irqfd at all and that works for
> timer interrupt.

Right, that's what I'm referring to with the modular components vs
pulling eoifd into irqfd. One gives us interfaces that can easily be
extended or already supports a more generic eoifd, the other gives us a
very specific use case and we'll have to come up with something else for
non-irqfd related eois. Thanks,

Alex

2012-08-14 08:34:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Mon, Aug 13, 2012 at 09:09:43PM -0600, Alex Williamson wrote:
> On Tue, 2012-08-14 at 02:00 +0300, Michael S. Tsirkin wrote:
> > On Mon, Aug 13, 2012 at 04:41:05PM -0600, Alex Williamson wrote:
> > > On Tue, 2012-08-14 at 01:06 +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> > > > > On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > > > > > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > > > > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > > > > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > > > > >>
> > > > > > >> > Regarding the implementation, instead of a linked list, would an array
> > > > > > >> > of counters parallel to the bitmap make it simpler?
> > > > > > >>
> > > > > > >> Or even, replace the bitmap with an array of counters.
> > > > > > >
> > > > > > > I'm not sure a counter array is what we're really after. That gives us
> > > > > > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > > > > >
> > > > > > You can look up the gsi while registering the eoifd, so it's accessible
> > > > > > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > > > > > while the eoifd is still active, but is this a problem?
> > > > >
> > > > > In my opinion, no, but Michael disagrees.
> > > > >
> > > > > > > It also highlights another issue, that we have a limited set of source
> > > > > > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > > > > > for the shared userspace ID and another for the PIT. How happy are we
> > > > > > > going to be with a limit of 62 level interrupts in use at one time?
> > > > > >
> > > > > > When we start being unhappy we can increase that number. On the other
> > > > > > hand more locks and lists makes me unhappy now.
> > > > >
> > > > > Yep, good point. My latest version removes the source ID object lock
> > > > > and list (and objects). I still have a lock and list for the ack
> > > > > notification, but it's hard not to unless we combine them into one
> > > > > mega-irqfd ioctl as Michael suggests.
> > > > >
> > > > > > > It's arguably a reasonable number since the most virtualization friendly
> > > > > > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > > > > > also very wasteful allocating an entire source ID for a single GSI
> > > > > > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > > > > > in the most optimal config, each go to different GSIs. So we could
> > > > > > > theoretically be more efficient in our use and allocation of irq source
> > > > > > > IDs if we tracked use by the source ID, gsi pair.
> > > > > >
> > > > > > There are, in one userspace, just three gsis available for PCI links, so
> > > > > > you're compressing the source id space by 3.
> > > > >
> > > > > I imagine there's a way to put each PCI interrupt pin on a GSI, but
> > > > > still only 4, not a great expansion of source ID space. I like
> > > > > Michael's idea of re-using source IDs if we run out better.
> > > > >
> > > > > > > That probably makes it less practical to replace anything at the top
> > > > > > > level with a counter array. The key that we pass back is currently the
> > > > > > > actual source ID, but we don't specify what it is, so we could split it
> > > > > > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > > > > > an idr entry.
> > > > > >
> > > > > > We can fix those kinds of problems by adding another layer of
> > > > > > indirection. But I doubt they will be needed. I don't see people
> > > > > > assigning 60 legacy devices to one guest.
> > > > >
> > > > > Yep, we can ignore it for now and put it in the hands of userspace to
> > > > > re-use IDs if needed.
> > > > >
> > > > > > > Michael, would the interface be more acceptable to you if we added
> > > > > > > separate ioctls to allocate and free some representation of an irq
> > > > > > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > > > > > for an irq source ID/gsi object which would then be passed as a
> > > > > > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > > > > > representing the source id/gsi isn't magically freed on it's own. This
> > > > > > > would also allow us to deassign/close one end and reconfigure it later.
> > > > > > > Thanks,
> > > > > >
> > > > > > Another option is to push the responsibility for allocating IDs for the
> > > > > > association to userspace. Let userspace both create the irqfd and the
> > > > > > eoifd with the same ID, the kernel matches them at registration time and
> > > > > > copies the gsi/sourceid from the first to the second eventfd.
> > > > >
> > > > > Aside from the copying gsi/sourceid bit, you've just described my latest
> > > > > attempt at this series. Specifying both a sourceid and gsi also allows
> > > > > userspace to make better use of the sourceid address space (use more
> > > > > than one gsi if userspace wants the complexity of managing them).
> > > > > Thanks,
> > > > >
> > > > > Alex
> > > >
> > > > Turns out per device source ID is a bug copied from existing
> > > > device assignment. I am amazed we did not notice before.
> > > > There we have small # of devices so it's not a problem but there's no
> > > > reason just not to have a source ID for all irqfds.
> > > > So the problem goes away, and there is no limit on # of level irqfds,
> > > > and no need to manage IDs in userspace at all.
> > > > You can still have cookies in userspace if you like but do not map them
> > > > to source IDs.
> > >
> > > IMHO it's not a bug, it's an implementation decision. They could be
> > > shared, but that doesn't make it wrong to not share them. Given that we
> > > have 32 memory slots, the only way you could hit this would be to have a
> > > lot of really slow devices that don't direct-map any BARs. A reason to
> > > not have the same source id for everything is that I think we can do ack
> > > notification filtering more easily using separate source ids (as is done
> > > in the first patch of the v8 series).
> >
> > Just a thought: can filtering read and clear the irqfd counter?
>
> Sorry, what's "the irqfd counter"? The eventfd counter? As I have it
> in the patch series, the filtering happens where the irq ack notifier
> calls the individual notifier callbacks. That's not irqfd/eventfd
> specific, so it doesn't have access to the eventfd counter there.
> Taking the filtering into the into the actual callbacks seems to require
> locking or maybe your proposed test and clear interface (which still
> requires locking).
>
> > > As the code is today, I agree,
> > > there's probably no advantage to using multiple source IDs. Thanks,
> > >
> > > Alex
> >
> > I think one point worth addressing is, Gleb wanted
> > to get eoifd without irqfd at all and that works for
> > timer interrupt.
>
> Right, that's what I'm referring to with the modular components vs
> pulling eoifd into irqfd. One gives us interfaces that can easily be
> extended or already supports a more generic eoifd, the other gives us a
> very specific use case and we'll have to come up with something else for
> non-irqfd related eois. Thanks,
>
> Alex

Yes that is fine but previous versions tied eoifd to irqfd
so were not useful alone anyway. Will look at v8.

--
MST

2012-08-14 12:41:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 08/12/2012 12:33 PM, Michael S. Tsirkin wrote:
>>
>> Michael, would the interface be more acceptable to you if we added
>> separate ioctls to allocate and free some representation of an irq
>> source ID, gsi pair? For instance, an ioctl might return an idr entry
>> for an irq source ID/gsi object which would then be passed as a
>> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
>> representing the source id/gsi isn't magically freed on it's own. This
>> would also allow us to deassign/close one end and reconfigure it later.
>> Thanks,
>>
>> Alex
>
> It's acceptable to me either way. I was only pointing out that as
> designed, the interface looks simple at first but then you find out some
> subtle limitations which are implementation driven. This gives
> an overall feeling the abstraction is too low level.
>
> If we compare to the existing irqfd, isn't the difference
> simply that irqfd deasserts immediately ATM, while we
> want to delay this until later?
>
> If yes, then along the lines that you proposed, and combining with my
> idea of tracking deasserts, how do you like the following:
>
> /* Keep line asserted until guest has handled the interrupt. */
> #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> /* Notify after line is deasserted. */
> #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
>
> struct kvm_irqfd {
> __u32 fd;
> __u32 gsi;
> __u32 flags;
> /* eventfd to notify when line is deasserted */
> __u32 deassert_eventfd;
> __u8 pad[16];
> };
>
> now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> effective for level interrupts.
>
> Notes about lifetime of objects:
> - closing deassert_eventfd does nothing (we can keep
> reference to it from irqfd so no need for
> complex polling/flushing scheme)
> - closing irqfd or deasserting dis-associates
> deassert_eventfd automatically
> - source id is internal to irqfd and goes away with it
>
> it looks harder to misuse and fits what we want to do nicely,
> and needs less code to implement.
>
> Avi, what do you think?

I think given all the complexity in the separate ioctl approach that
this makes sense. There are no lifetime issues or code to match the two
eventfds. Alex, would this API simplify the code?

Yet another option was raised in the past, and that was exiling ioapic
and pic to userspace. This moves the entire issue to userspace. The
cost is a new interface that implements the APIC bus (betweem APIC and
IOAPIC) and the INTACK sequence (between APIC and PIC), and potential
for performance regressions due to the PIC, IOAPIC, and PIT being in
userspace. We would still have to keep the IOAPIC/PIC in the kernel,
but no new features would be added.

However, this is a huge job. We could discuss this to death too but I
have the feeling the end result will be to choose the shorter path --
adding irqackfd/deassertfd/whateverwecallitfd.


--
error compiling committee.c: too many arguments to function

2012-08-14 14:49:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, Aug 14, 2012 at 03:35:54PM +0300, Avi Kivity wrote:
> On 08/12/2012 12:33 PM, Michael S. Tsirkin wrote:
> >>
> >> Michael, would the interface be more acceptable to you if we added
> >> separate ioctls to allocate and free some representation of an irq
> >> source ID, gsi pair? For instance, an ioctl might return an idr entry
> >> for an irq source ID/gsi object which would then be passed as a
> >> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> >> representing the source id/gsi isn't magically freed on it's own. This
> >> would also allow us to deassign/close one end and reconfigure it later.
> >> Thanks,
> >>
> >> Alex
> >
> > It's acceptable to me either way. I was only pointing out that as
> > designed, the interface looks simple at first but then you find out some
> > subtle limitations which are implementation driven. This gives
> > an overall feeling the abstraction is too low level.
> >
> > If we compare to the existing irqfd, isn't the difference
> > simply that irqfd deasserts immediately ATM, while we
> > want to delay this until later?
> >
> > If yes, then along the lines that you proposed, and combining with my
> > idea of tracking deasserts, how do you like the following:
> >
> > /* Keep line asserted until guest has handled the interrupt. */
> > #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> > /* Notify after line is deasserted. */
> > #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
> >
> > struct kvm_irqfd {
> > __u32 fd;
> > __u32 gsi;
> > __u32 flags;
> > /* eventfd to notify when line is deasserted */
> > __u32 deassert_eventfd;
> > __u8 pad[16];
> > };
> >
> > now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> > effective for level interrupts.
> >
> > Notes about lifetime of objects:
> > - closing deassert_eventfd does nothing (we can keep
> > reference to it from irqfd so no need for
> > complex polling/flushing scheme)
> > - closing irqfd or deasserting dis-associates
> > deassert_eventfd automatically
> > - source id is internal to irqfd and goes away with it
> >
> > it looks harder to misuse and fits what we want to do nicely,
> > and needs less code to implement.
> >
> > Avi, what do you think?
>
> I think given all the complexity in the separate ioctl approach that
> this makes sense. There are no lifetime issues or code to match the two
> eventfds.

OK, it's fine with me too then. Pls disregard my earlier proposal to
deassert immediately; Gleb showed me it does not work.

> Alex, would this API simplify the code?
>
> Yet another option was raised in the past, and that was exiling ioapic
> and pic to userspace. This moves the entire issue to userspace. The
> cost is a new interface that implements the APIC bus (betweem APIC and
> IOAPIC) and the INTACK sequence (between APIC and PIC), and potential
> for performance regressions due to the PIC, IOAPIC, and PIT being in
> userspace. We would still have to keep the IOAPIC/PIC in the kernel,
> but no new features would be added.
>
> However, this is a huge job. We could discuss this to death too but I
> have the feeling the end result will be to choose the shorter path --
> adding irqackfd/deassertfd/whateverwecallitfd.
>
>
> --
> error compiling committee.c: too many arguments to function

2012-08-14 21:28:25

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, 2012-08-14 at 11:35 +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 13, 2012 at 09:09:43PM -0600, Alex Williamson wrote:
> > On Tue, 2012-08-14 at 02:00 +0300, Michael S. Tsirkin wrote:
> > > On Mon, Aug 13, 2012 at 04:41:05PM -0600, Alex Williamson wrote:
> > > > On Tue, 2012-08-14 at 01:06 +0300, Michael S. Tsirkin wrote:
> > > > > On Mon, Aug 13, 2012 at 03:34:01PM -0600, Alex Williamson wrote:
> > > > > > On Sun, 2012-08-12 at 11:36 +0300, Avi Kivity wrote:
> > > > > > > On 08/09/2012 10:26 PM, Alex Williamson wrote:
> > > > > > > > On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> > > > > > > >> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> > > > > > > >>
> > > > > > > >> > Regarding the implementation, instead of a linked list, would an array
> > > > > > > >> > of counters parallel to the bitmap make it simpler?
> > > > > > > >>
> > > > > > > >> Or even, replace the bitmap with an array of counters.
> > > > > > > >
> > > > > > > > I'm not sure a counter array is what we're really after. That gives us
> > > > > > > > reference counting for the irq source IDs, but not the key->gsi lookup.
> > > > > > >
> > > > > > > You can look up the gsi while registering the eoifd, so it's accessible
> > > > > > > as eoifd->gsi instead of eoifd->source->gsi. The irqfd can go away
> > > > > > > while the eoifd is still active, but is this a problem?
> > > > > >
> > > > > > In my opinion, no, but Michael disagrees.
> > > > > >
> > > > > > > > It also highlights another issue, that we have a limited set of source
> > > > > > > > IDs. Looks like we have BITS_PER_LONG IDs, with two already used, one
> > > > > > > > for the shared userspace ID and another for the PIT. How happy are we
> > > > > > > > going to be with a limit of 62 level interrupts in use at one time?
> > > > > > >
> > > > > > > When we start being unhappy we can increase that number. On the other
> > > > > > > hand more locks and lists makes me unhappy now.
> > > > > >
> > > > > > Yep, good point. My latest version removes the source ID object lock
> > > > > > and list (and objects). I still have a lock and list for the ack
> > > > > > notification, but it's hard not to unless we combine them into one
> > > > > > mega-irqfd ioctl as Michael suggests.
> > > > > >
> > > > > > > > It's arguably a reasonable number since the most virtualization friendly
> > > > > > > > devices (sr-iov VFs) don't even support this kind of interrupt. It's
> > > > > > > > also very wasteful allocating an entire source ID for a single GSI
> > > > > > > > within that source ID. PCI supports interrupts A, B, C, and D, which,
> > > > > > > > in the most optimal config, each go to different GSIs. So we could
> > > > > > > > theoretically be more efficient in our use and allocation of irq source
> > > > > > > > IDs if we tracked use by the source ID, gsi pair.
> > > > > > >
> > > > > > > There are, in one userspace, just three gsis available for PCI links, so
> > > > > > > you're compressing the source id space by 3.
> > > > > >
> > > > > > I imagine there's a way to put each PCI interrupt pin on a GSI, but
> > > > > > still only 4, not a great expansion of source ID space. I like
> > > > > > Michael's idea of re-using source IDs if we run out better.
> > > > > >
> > > > > > > > That probably makes it less practical to replace anything at the top
> > > > > > > > level with a counter array. The key that we pass back is currently the
> > > > > > > > actual source ID, but we don't specify what it is, so we could split it
> > > > > > > > and have it encode a 16bit source ID plus 16 bit GSI. It could also be
> > > > > > > > an idr entry.
> > > > > > >
> > > > > > > We can fix those kinds of problems by adding another layer of
> > > > > > > indirection. But I doubt they will be needed. I don't see people
> > > > > > > assigning 60 legacy devices to one guest.
> > > > > >
> > > > > > Yep, we can ignore it for now and put it in the hands of userspace to
> > > > > > re-use IDs if needed.
> > > > > >
> > > > > > > > Michael, would the interface be more acceptable to you if we added
> > > > > > > > separate ioctls to allocate and free some representation of an irq
> > > > > > > > source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > > > > > > for an irq source ID/gsi object which would then be passed as a
> > > > > > > > parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > > > > > > representing the source id/gsi isn't magically freed on it's own. This
> > > > > > > > would also allow us to deassign/close one end and reconfigure it later.
> > > > > > > > Thanks,
> > > > > > >
> > > > > > > Another option is to push the responsibility for allocating IDs for the
> > > > > > > association to userspace. Let userspace both create the irqfd and the
> > > > > > > eoifd with the same ID, the kernel matches them at registration time and
> > > > > > > copies the gsi/sourceid from the first to the second eventfd.
> > > > > >
> > > > > > Aside from the copying gsi/sourceid bit, you've just described my latest
> > > > > > attempt at this series. Specifying both a sourceid and gsi also allows
> > > > > > userspace to make better use of the sourceid address space (use more
> > > > > > than one gsi if userspace wants the complexity of managing them).
> > > > > > Thanks,
> > > > > >
> > > > > > Alex
> > > > >
> > > > > Turns out per device source ID is a bug copied from existing
> > > > > device assignment. I am amazed we did not notice before.
> > > > > There we have small # of devices so it's not a problem but there's no
> > > > > reason just not to have a source ID for all irqfds.
> > > > > So the problem goes away, and there is no limit on # of level irqfds,
> > > > > and no need to manage IDs in userspace at all.
> > > > > You can still have cookies in userspace if you like but do not map them
> > > > > to source IDs.
> > > >
> > > > IMHO it's not a bug, it's an implementation decision. They could be
> > > > shared, but that doesn't make it wrong to not share them. Given that we
> > > > have 32 memory slots, the only way you could hit this would be to have a
> > > > lot of really slow devices that don't direct-map any BARs. A reason to
> > > > not have the same source id for everything is that I think we can do ack
> > > > notification filtering more easily using separate source ids (as is done
> > > > in the first patch of the v8 series).
> > >
> > > Just a thought: can filtering read and clear the irqfd counter?
> >
> > Sorry, what's "the irqfd counter"? The eventfd counter? As I have it
> > in the patch series, the filtering happens where the irq ack notifier
> > calls the individual notifier callbacks. That's not irqfd/eventfd
> > specific, so it doesn't have access to the eventfd counter there.
> > Taking the filtering into the into the actual callbacks seems to require
> > locking or maybe your proposed test and clear interface (which still
> > requires locking).
> >
> > > > As the code is today, I agree,
> > > > there's probably no advantage to using multiple source IDs. Thanks,
> > > >
> > > > Alex
> > >
> > > I think one point worth addressing is, Gleb wanted
> > > to get eoifd without irqfd at all and that works for
> > > timer interrupt.
> >
> > Right, that's what I'm referring to with the modular components vs
> > pulling eoifd into irqfd. One gives us interfaces that can easily be
> > extended or already supports a more generic eoifd, the other gives us a
> > very specific use case and we'll have to come up with something else for
> > non-irqfd related eois. Thanks,
> >
> > Alex
>
> Yes that is fine but previous versions tied eoifd to irqfd
> so were not useful alone anyway. Will look at v8.

This is because earlier feedback rejected creating a version of the
ioctl that had no users. It was only a matter of adding a flag to
indicate kvm_eoifd.key was actually a gsi and some trivial code changes
to enable such an interface. v8 builds the interface in the other
direction, so I left the notify-only, untied version as the base.
Thanks,

Alex

2012-08-14 22:01:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, 2012-08-14 at 15:35 +0300, Avi Kivity wrote:
> On 08/12/2012 12:33 PM, Michael S. Tsirkin wrote:
> >>
> >> Michael, would the interface be more acceptable to you if we added
> >> separate ioctls to allocate and free some representation of an irq
> >> source ID, gsi pair? For instance, an ioctl might return an idr entry
> >> for an irq source ID/gsi object which would then be passed as a
> >> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> >> representing the source id/gsi isn't magically freed on it's own. This
> >> would also allow us to deassign/close one end and reconfigure it later.
> >> Thanks,
> >>
> >> Alex
> >
> > It's acceptable to me either way. I was only pointing out that as
> > designed, the interface looks simple at first but then you find out some
> > subtle limitations which are implementation driven. This gives
> > an overall feeling the abstraction is too low level.
> >
> > If we compare to the existing irqfd, isn't the difference
> > simply that irqfd deasserts immediately ATM, while we
> > want to delay this until later?
> >
> > If yes, then along the lines that you proposed, and combining with my
> > idea of tracking deasserts, how do you like the following:
> >
> > /* Keep line asserted until guest has handled the interrupt. */
> > #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> > /* Notify after line is deasserted. */
> > #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
> >
> > struct kvm_irqfd {
> > __u32 fd;
> > __u32 gsi;
> > __u32 flags;
> > /* eventfd to notify when line is deasserted */
> > __u32 deassert_eventfd;
> > __u8 pad[16];
> > };
> >
> > now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> > effective for level interrupts.
> >
> > Notes about lifetime of objects:
> > - closing deassert_eventfd does nothing (we can keep
> > reference to it from irqfd so no need for
> > complex polling/flushing scheme)
> > - closing irqfd or deasserting dis-associates
> > deassert_eventfd automatically
> > - source id is internal to irqfd and goes away with it
> >
> > it looks harder to misuse and fits what we want to do nicely,
> > and needs less code to implement.
> >
> > Avi, what do you think?
>
> I think given all the complexity in the separate ioctl approach that
> this makes sense. There are no lifetime issues or code to match the two
> eventfds. Alex, would this API simplify the code?

It does though I'm concerned that it's a very specific solution that
only addresses this problem. Generic userspace eoi/ack is not
addressed. The latest version using separate ioctls does a lot of
simplification by exposing irq sourceids. The bulk of the code there is
duplicating what irqfd does just so we can catch the POLLHUP for
cleanup. If there was an easier way to do that, we don't care about
POLLIN/POLLOUT, much of the code could be removed. Alternatively we
could make some common infrastructure to simplify both irqfd and
irq_ackfd, but how to frame the helpers isn't easy.

> Yet another option was raised in the past, and that was exiling ioapic
> and pic to userspace. This moves the entire issue to userspace. The
> cost is a new interface that implements the APIC bus (betweem APIC and
> IOAPIC) and the INTACK sequence (between APIC and PIC), and potential
> for performance regressions due to the PIC, IOAPIC, and PIT being in
> userspace. We would still have to keep the IOAPIC/PIC in the kernel,
> but no new features would be added.

Doesn't this assure a performance regression or are we assuming anywhere
we care about performance we're using MSI? Thanks,

Alex

2012-08-14 23:03:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Tue, Aug 14, 2012 at 04:01:15PM -0600, Alex Williamson wrote:
> On Tue, 2012-08-14 at 15:35 +0300, Avi Kivity wrote:
> > On 08/12/2012 12:33 PM, Michael S. Tsirkin wrote:
> > >>
> > >> Michael, would the interface be more acceptable to you if we added
> > >> separate ioctls to allocate and free some representation of an irq
> > >> source ID, gsi pair? For instance, an ioctl might return an idr entry
> > >> for an irq source ID/gsi object which would then be passed as a
> > >> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > >> representing the source id/gsi isn't magically freed on it's own. This
> > >> would also allow us to deassign/close one end and reconfigure it later.
> > >> Thanks,
> > >>
> > >> Alex
> > >
> > > It's acceptable to me either way. I was only pointing out that as
> > > designed, the interface looks simple at first but then you find out some
> > > subtle limitations which are implementation driven. This gives
> > > an overall feeling the abstraction is too low level.
> > >
> > > If we compare to the existing irqfd, isn't the difference
> > > simply that irqfd deasserts immediately ATM, while we
> > > want to delay this until later?
> > >
> > > If yes, then along the lines that you proposed, and combining with my
> > > idea of tracking deasserts, how do you like the following:
> > >
> > > /* Keep line asserted until guest has handled the interrupt. */
> > > #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> > > /* Notify after line is deasserted. */
> > > #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
> > >
> > > struct kvm_irqfd {
> > > __u32 fd;
> > > __u32 gsi;
> > > __u32 flags;
> > > /* eventfd to notify when line is deasserted */
> > > __u32 deassert_eventfd;
> > > __u8 pad[16];
> > > };
> > >
> > > now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> > > effective for level interrupts.
> > >
> > > Notes about lifetime of objects:
> > > - closing deassert_eventfd does nothing (we can keep
> > > reference to it from irqfd so no need for
> > > complex polling/flushing scheme)
> > > - closing irqfd or deasserting dis-associates
> > > deassert_eventfd automatically
> > > - source id is internal to irqfd and goes away with it
> > >
> > > it looks harder to misuse and fits what we want to do nicely,
> > > and needs less code to implement.
> > >
> > > Avi, what do you think?
> >
> > I think given all the complexity in the separate ioctl approach that
> > this makes sense. There are no lifetime issues or code to match the two
> > eventfds. Alex, would this API simplify the code?
>
> It does though I'm concerned that it's a very specific solution that
> only addresses this problem. Generic userspace eoi/ack is not
> addressed. The latest version using separate ioctls does a lot of
> simplification by exposing irq sourceids. The bulk of the code there is
> duplicating what irqfd does just so we can catch the POLLHUP for
> cleanup. If there was an easier way to do that, we don't care about
> POLLIN/POLLOUT, much of the code could be removed. Alternatively we
> could make some common infrastructure to simplify both irqfd and
> irq_ackfd, but how to frame the helpers isn't easy.

There is way easier with a single ioctl. Don't you see?

As ack_eventfd pointer becomes part of the irqfd structure now, you
simply drop the reference together with irqfd.
In other words you do not care that ack eventfd goes
away anymore. So no need for POLLHUP handlers, no
separate DEASSERT that can race with that, etc.

So all this code just goes away, and it goes away completely, together
with managing source IDs (source ID comes an internal optimization to
avoid spurious EOIs, so no need to expose it to userspace anymore).

So all we are left with is minimal:
1. change irqfds to use a separate source id (can do this
unconditionally for all irqfds)
2. check deassert on ack, if set register ack notifier
3. in ack notifier check deassert eventfd, if set signal it
4. (optionally) add a flag in irqfd, set on assert, test and clear
on deassert, and only signal eventfd if it was set

on top of that we could try to do
5. allocate some more source IDs and if they are free try to use them as
an optimization to avoid atomics


> > Yet another option was raised in the past, and that was exiling ioapic
> > and pic to userspace. This moves the entire issue to userspace. The
> > cost is a new interface that implements the APIC bus (betweem APIC and
> > IOAPIC) and the INTACK sequence (between APIC and PIC), and potential
> > for performance regressions due to the PIC, IOAPIC, and PIT being in
> > userspace. We would still have to keep the IOAPIC/PIC in the kernel,
> > but no new features would be added.
>
> Doesn't this assure a performance regression or are we assuming anywhere
> we care about performance we're using MSI? Thanks,
>
> Alex

2012-08-14 23:26:32

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On Wed, 2012-08-15 at 02:04 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 14, 2012 at 04:01:15PM -0600, Alex Williamson wrote:
> > On Tue, 2012-08-14 at 15:35 +0300, Avi Kivity wrote:
> > > On 08/12/2012 12:33 PM, Michael S. Tsirkin wrote:
> > > >>
> > > >> Michael, would the interface be more acceptable to you if we added
> > > >> separate ioctls to allocate and free some representation of an irq
> > > >> source ID, gsi pair? For instance, an ioctl might return an idr entry
> > > >> for an irq source ID/gsi object which would then be passed as a
> > > >> parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
> > > >> representing the source id/gsi isn't magically freed on it's own. This
> > > >> would also allow us to deassign/close one end and reconfigure it later.
> > > >> Thanks,
> > > >>
> > > >> Alex
> > > >
> > > > It's acceptable to me either way. I was only pointing out that as
> > > > designed, the interface looks simple at first but then you find out some
> > > > subtle limitations which are implementation driven. This gives
> > > > an overall feeling the abstraction is too low level.
> > > >
> > > > If we compare to the existing irqfd, isn't the difference
> > > > simply that irqfd deasserts immediately ATM, while we
> > > > want to delay this until later?
> > > >
> > > > If yes, then along the lines that you proposed, and combining with my
> > > > idea of tracking deasserts, how do you like the following:
> > > >
> > > > /* Keep line asserted until guest has handled the interrupt. */
> > > > #define KVM_IRQFD_FLAG_DEASSERT_ON_ACK (1 << 1)
> > > > /* Notify after line is deasserted. */
> > > > #define KVM_IRQFD_FLAG_DEASSERT_EVENTFD (2 << 1)
> > > >
> > > > struct kvm_irqfd {
> > > > __u32 fd;
> > > > __u32 gsi;
> > > > __u32 flags;
> > > > /* eventfd to notify when line is deasserted */
> > > > __u32 deassert_eventfd;
> > > > __u8 pad[16];
> > > > };
> > > >
> > > > now the only limitation is that KVM_IRQFD_FLAG_DEASSERT_ON_ACK is only
> > > > effective for level interrupts.
> > > >
> > > > Notes about lifetime of objects:
> > > > - closing deassert_eventfd does nothing (we can keep
> > > > reference to it from irqfd so no need for
> > > > complex polling/flushing scheme)
> > > > - closing irqfd or deasserting dis-associates
> > > > deassert_eventfd automatically
> > > > - source id is internal to irqfd and goes away with it
> > > >
> > > > it looks harder to misuse and fits what we want to do nicely,
> > > > and needs less code to implement.
> > > >
> > > > Avi, what do you think?
> > >
> > > I think given all the complexity in the separate ioctl approach that
> > > this makes sense. There are no lifetime issues or code to match the two
> > > eventfds. Alex, would this API simplify the code?
> >
> > It does though I'm concerned that it's a very specific solution that
> > only addresses this problem. Generic userspace eoi/ack is not
> > addressed. The latest version using separate ioctls does a lot of
> > simplification by exposing irq sourceids. The bulk of the code there is
> > duplicating what irqfd does just so we can catch the POLLHUP for
> > cleanup. If there was an easier way to do that, we don't care about
> > POLLIN/POLLOUT, much of the code could be removed. Alternatively we
> > could make some common infrastructure to simplify both irqfd and
> > irq_ackfd, but how to frame the helpers isn't easy.
>
> There is way easier with a single ioctl. Don't you see?
>
> As ack_eventfd pointer becomes part of the irqfd structure now, you
> simply drop the reference together with irqfd.
> In other words you do not care that ack eventfd goes
> away anymore. So no need for POLLHUP handlers, no
> separate DEASSERT that can race with that, etc.
>
> So all this code just goes away, and it goes away completely, together
> with managing source IDs (source ID comes an internal optimization to
> avoid spurious EOIs, so no need to expose it to userspace anymore).
>
> So all we are left with is minimal:
> 1. change irqfds to use a separate source id (can do this
> unconditionally for all irqfds)
> 2. check deassert on ack, if set register ack notifier
> 3. in ack notifier check deassert eventfd, if set signal it
> 4. (optionally) add a flag in irqfd, set on assert, test and clear
> on deassert, and only signal eventfd if it was set
>
> on top of that we could try to do
> 5. allocate some more source IDs and if they are free try to use them as
> an optimization to avoid atomics

Yes, I understand. It's simple, it's also very specific to this
problem, and doesn't address generic ack notification. All of which
I've noted before and I continue to note that v8 offers simplifications
while retaining flexibility. Least amount of code doesn't really buy us
much if we end up needing to invent new interfaces down the road because
we've created such a specific solution here. Thanks,

Alex

2012-08-15 13:09:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

On 08/15/2012 02:26 AM, Alex Williamson wrote:
>
> Yes, I understand. It's simple, it's also very specific to this
> problem, and doesn't address generic ack notification. All of which
> I've noted before and I continue to note that v8 offers simplifications
> while retaining flexibility. Least amount of code doesn't really buy us
> much if we end up needing to invent new interfaces down the road because
> we've created such a specific solution here. Thanks,
>

One side of the coin is trying to create one generic interface instead
of multiple specific interfaces. The other side is that by providing a
generic interface, you sometimes expose internal implementation details,
or you constrain future development in order to preserve that interface.
If the generic interface is not actually exploited, you get pain for no
gain.

This tradeoff is different for every feature. Right now I'm leaning
towards specialized interfaces here, because we expose quite a lot of
low-level detail. However I'll review v8 soon and see.

--
error compiling committee.c: too many arguments to function