2012-02-05 13:14:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/03/2012 12:13 AM, Rob Earhart wrote:
> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <[email protected]
> <mailto:[email protected]>> wrote:
>
> The kvm api has been accumulating cruft for several years now.
> This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API
> now, I'm
> writing this as a thought experiment to see where a from-scratch
> API can
> take us. Of course, if we do implement this, the new and old APIs
> will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point. While this made it easy to add kvm to the kernel
> unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu
> mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a
> thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes. We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need
> several
> - syscalls into modules are harder and rarer than into core kernel
> code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick
> it up
> from current.
>
>
> <snipped>
>
> I like the ioctl() interface. If the overhead matters in your hot path,

I can't say that it's a pressing problem, but it's not negligible.

> I suspect you're doing it wrong;

What am I doing wrong?

> use irq fds & ioevent fds. You might fix the semantic mismatch by
> having a notion of a "current process's VM" and "current thread's
> VCPU", and just use the one /dev/kvm filedescriptor.
>
> Or you could go the other way, and break the connection between VMs
> and processes / VCPUs and threads: I don't know how easy it is to do
> it in Linux, but a VCPU might be backed by a kernel thread, operated
> on via ioctl()s, indicating that they've exited the guest by having
> their descriptors become readable (and either use read() or mmap() to
> pull off the reason why the VCPU exited).

That breaks the ability to renice vcpu threads (unless you want the user
renice kernel threads).

> This would allow for a variety of different programming styles for the
> VMM--I'm a fan of CSP model myself, but that's hard to do with the
> current API.

Just convert the synchronous API to an RPC over a pipe, in the vcpu
thread, and you have the asynchronous model you asked for.

>
> It'd be nice to be able to kick a VCPU out of the guest without
> messing around with signals. One possibility would be to tie it to an
> eventfd;

We have to support signals in any case, supporting more mechanisms just
increases complexity.

> another might be to add a pseudo-register to indicate whether the VCPU
> is explicitly suspended. (Combined with the decoupling idea, you'd
> want another pseudo-register to indicate whether the VMM is implicitly
> suspended due to an intercept; a single "runnable" bit is racy if both
> the VMM and VCPU are setting it.)
>
> ioevent fds are definitely useful. It might be cute if they could
> synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do
> this itself, but that'd require giving the guest write access to the
> used side of the virtio queue, and I kind of like the idea that it
> doesn't need write access there. Then again, I don't have any perf
> data to back up the need for this.
>

I'd hate to tie ioeventfds into virtio specifics, they're a general
mechanism. Especially if the guest can do it itself.

--
error compiling committee.c: too many arguments to function


2012-02-06 17:41:08

by Rob Earhart

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity <[email protected]> wrote:
> On 02/03/2012 12:13 AM, Rob Earhart wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> ? ? The kvm api has been accumulating cruft for several years now.
>> ? ? ?This is
>> ? ? due to feature creep, fixing mistakes, experience gained by the
>> ? ? maintainers and developers on how to do things, ports to new
>> ? ? architectures, and simply as a side effect of a code base that is
>> ? ? developed slowly and incrementally.
>>
>> ? ? While I don't think we can justify a complete revamp of the API
>> ? ? now, I'm
>> ? ? writing this as a thought experiment to see where a from-scratch
>> ? ? API can
>> ? ? take us. ?Of course, if we do implement this, the new and old APIs
>> ? ? will
>> ? ? have to be supported side by side for several years.
>>
>> ? ? Syscalls
>> ? ? --------
>> ? ? kvm currently uses the much-loved ioctl() system call as its entry
>> ? ? point. ?While this made it easy to add kvm to the kernel
>> ? ? unintrusively,
>> ? ? it does have downsides:
>>
>> ? ? - overhead in the entry path, for the ioctl dispatch path and vcpu
>> ? ? mutex
>> ? ? (low but measurable)
>> ? ? - semantic mismatch: kvm really wants a vcpu to be tied to a
>> ? ? thread, and
>> ? ? a vm to be tied to an mm_struct, but the current API ties them to file
>> ? ? descriptors, which can move between threads and processes. ?We check
>> ? ? that they don't, but we don't want to.
>>
>> ? ? Moving to syscalls avoids these problems, but introduces new ones:
>>
>> ? ? - adding new syscalls is generally frowned upon, and kvm will need
>> ? ? several
>> ? ? - syscalls into modules are harder and rarer than into core kernel
>> ? ? code
>> ? ? - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> ? ? mm_struct
>>
>> ? ? Syscalls that operate on the entire guest will pick it up implicitly
>> ? ? from the mm_struct, and syscalls that operate on a vcpu will pick
>> ? ? it up
>> ? ? from current.
>>
>>
>> <snipped>
>>
>> I like the ioctl() interface. ?If the overhead matters in your hot path,
>
> I can't say that it's a pressing problem, but it's not negligible.
>
>> I suspect you're doing it wrong;
>
> What am I doing wrong?

"You the vmm" not "you the KVM maintainer" :-)

To be a little more precise: If a VCPU thread is going all the way out
to host usermode in its hot path, that's probably a performance
problem regardless of how fast you make the transitions between host
user and host kernel.

That's why ioctl() doesn't bother me. I think it'd be more useful to
focus on mechanisms which don't require the VCPU thread to exit at all
in its hot paths, so the overhead of the ioctl() really becomes lost
in the noise. irq fds and ioevent fds are great for that, and I
really like your MMIO-over-socketpair idea.

>> use irq fds & ioevent fds. ?You might fix the semantic mismatch by
>> having a notion of a "current process's VM" and "current thread's
>> VCPU", and just use the one /dev/kvm filedescriptor.
>>
>> Or you could go the other way, and break the connection between VMs
>> and processes / VCPUs and threads: I don't know how easy it is to do
>> it in Linux, but a VCPU might be backed by a kernel thread, operated
>> on via ioctl()s, indicating that they've exited the guest by having
>> their descriptors become readable (and either use read() or mmap() to
>> pull off the reason why the VCPU exited).
>
> That breaks the ability to renice vcpu threads (unless you want the user
> renice kernel threads).

I think it'd be fine to have an ioctl()/syscall() to do it. But I
don't know how well that'd compose with other tools people might use
for managing priorities.

>> This would allow for a variety of different programming styles for the
>> VMM--I'm a fan of CSP model myself, but that's hard to do with the
>> current API.
>
> Just convert the synchronous API to an RPC over a pipe, in the vcpu
> thread, and you have the asynchronous model you asked for.

Yup. But you still get multiple threads in your process. It's not a
disaster, though.

)Rob

2012-02-06 19:12:01

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/06/2012 11:41 AM, Rob Earhart wrote:
> On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity<[email protected]> wrote:
>> On 02/03/2012 12:13 AM, Rob Earhart wrote:
>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> The kvm api has been accumulating cruft for several years now.
>>> This is
>>> due to feature creep, fixing mistakes, experience gained by the
>>> maintainers and developers on how to do things, ports to new
>>> architectures, and simply as a side effect of a code base that is
>>> developed slowly and incrementally.
>>>
>>> While I don't think we can justify a complete revamp of the API
>>> now, I'm
>>> writing this as a thought experiment to see where a from-scratch
>>> API can
>>> take us. Of course, if we do implement this, the new and old APIs
>>> will
>>> have to be supported side by side for several years.
>>>
>>> Syscalls
>>> --------
>>> kvm currently uses the much-loved ioctl() system call as its entry
>>> point. While this made it easy to add kvm to the kernel
>>> unintrusively,
>>> it does have downsides:
>>>
>>> - overhead in the entry path, for the ioctl dispatch path and vcpu
>>> mutex
>>> (low but measurable)
>>> - semantic mismatch: kvm really wants a vcpu to be tied to a
>>> thread, and
>>> a vm to be tied to an mm_struct, but the current API ties them to file
>>> descriptors, which can move between threads and processes. We check
>>> that they don't, but we don't want to.
>>>
>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>
>>> - adding new syscalls is generally frowned upon, and kvm will need
>>> several
>>> - syscalls into modules are harder and rarer than into core kernel
>>> code
>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>> mm_struct
>>>
>>> Syscalls that operate on the entire guest will pick it up implicitly
>>> from the mm_struct, and syscalls that operate on a vcpu will pick
>>> it up
>>> from current.
>>>
>>>
>>> <snipped>
>>>
>>> I like the ioctl() interface. If the overhead matters in your hot path,
>>
>> I can't say that it's a pressing problem, but it's not negligible.
>>
>>> I suspect you're doing it wrong;
>>
>> What am I doing wrong?
>
> "You the vmm" not "you the KVM maintainer" :-)
>
> To be a little more precise: If a VCPU thread is going all the way out
> to host usermode in its hot path, that's probably a performance
> problem regardless of how fast you make the transitions between host
> user and host kernel.
>
> That's why ioctl() doesn't bother me. I think it'd be more useful to
> focus on mechanisms which don't require the VCPU thread to exit at all
> in its hot paths, so the overhead of the ioctl() really becomes lost
> in the noise. irq fds and ioevent fds are great for that, and I
> really like your MMIO-over-socketpair idea.

I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
kthread to sleep while it waits for the other end to process it. This is
effectively equivalent to a heavy weight exit. The difference in cost is
dropping to userspace which is really neglible these days (< 100 cycles).

There is some fast-path trickery to avoid heavy weight exits but this presents
the same basic problem of having to put all the device model stuff in the kernel.

ioeventfd to userspace is almost certainly worse for performance. And Avi
mentioned, you can emulate this behavior yourself in userspace if so inclined.

Regards,

Anthony Liguori

2012-02-07 12:01:53

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/06/2012 07:41 PM, Rob Earhart wrote:
> >>
> >> I like the ioctl() interface. If the overhead matters in your hot path,
> >
> > I can't say that it's a pressing problem, but it's not negligible.
> >
> >> I suspect you're doing it wrong;
> >
> > What am I doing wrong?
>
> "You the vmm" not "you the KVM maintainer" :-)
>
> To be a little more precise: If a VCPU thread is going all the way out
> to host usermode in its hot path, that's probably a performance
> problem regardless of how fast you make the transitions between host
> user and host kernel.

Why?

> That's why ioctl() doesn't bother me. I think it'd be more useful to
> focus on mechanisms which don't require the VCPU thread to exit at all
> in its hot paths, so the overhead of the ioctl() really becomes lost
> in the noise. irq fds and ioevent fds are great for that, and I
> really like your MMIO-over-socketpair idea.

I like them too, but they're not suitable for all cases.

An ioeventfd, or unordered write-over-mmio-socketpair can take one of
two paths:

- waking up an idle mmio service thread on a different core, involving
a double context switch on that remote core
- scheduling the idle mmio service thread on the current core,
involving both a double context switch and a heavyweight exit

An ordered write-over-mmio-socketpair, or a read-over-mmio-socketpair
can also take one of two paths
- waking up an idle mmio service thread on a different core, involving
a double context switch on that remote core, and also invoking two
context switches on the current core (while we wait for a reply); if the
current core schedules a user task we might also have a heavyweight exit
- scheduling the idle mmio service thread on the current core,
involving both a double context switch and a heavyweight exit

As you can see the actual work is greater for threaded io handlers than
the synchronous ones. The real advantage is that you can perform more
work in parallel if you have the spare cores (not a given in
consolidation environments) and if you actually have a lot of work to do
(like virtio-net in a throughput load). It doesn't quite fit a "read
hpet register" load.



>
> >> This would allow for a variety of different programming styles for the
> >> VMM--I'm a fan of CSP model myself, but that's hard to do with the
> >> current API.
> >
> > Just convert the synchronous API to an RPC over a pipe, in the vcpu
> > thread, and you have the asynchronous model you asked for.
>
> Yup. But you still get multiple threads in your process. It's not a
> disaster, though.
>

You have multiple threads anyway, even if it's the kernel that creates them.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-02-07 12:04:06

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>
> I'm not so sure. ioeventfds and a future mmio-over-socketpair have to
> put the kthread to sleep while it waits for the other end to process
> it. This is effectively equivalent to a heavy weight exit. The
> difference in cost is dropping to userspace which is really neglible
> these days (< 100 cycles).

On what machine did you measure these wonderful numbers?

But I agree a heavyweight exit is probably faster than a double context
switch on a remote core.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-02-07 15:17:40

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 06:03 AM, Avi Kivity wrote:
> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>
>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
>> kthread to sleep while it waits for the other end to process it. This is
>> effectively equivalent to a heavy weight exit. The difference in cost is
>> dropping to userspace which is really neglible these days (< 100 cycles).
>
> On what machine did you measure these wonderful numbers?

A syscall is what I mean by "dropping to userspace", not the cost of a heavy
weight exit. I think a heavy weight exit is still around a few thousand cycles.

Any nehalem class or better processor should have a syscall cost of around that
unless I'm wildly mistaken.

>
> But I agree a heavyweight exit is probably faster than a double context switch
> on a remote core.

I meant, if you already need to take a heavyweight exit (and you do to schedule
something else on the core), than the only additional cost is taking a syscall
return to userspace *first* before scheduling another process. That overhead is
pretty low.

Regards,

Anthony Liguori

>
>

2012-02-07 16:02:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 05:17 PM, Anthony Liguori wrote:
> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>
>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>> to put the
>>> kthread to sleep while it waits for the other end to process it.
>>> This is
>>> effectively equivalent to a heavy weight exit. The difference in
>>> cost is
>>> dropping to userspace which is really neglible these days (< 100
>>> cycles).
>>
>> On what machine did you measure these wonderful numbers?
>
> A syscall is what I mean by "dropping to userspace", not the cost of a
> heavy weight exit.

Ah. But then ioeventfd has that as well, unless the other end is in the
kernel too.

> I think a heavy weight exit is still around a few thousand cycles.
>
> Any nehalem class or better processor should have a syscall cost of
> around that unless I'm wildly mistaken.
>

That's what I remember too.

>>
>> But I agree a heavyweight exit is probably faster than a double
>> context switch
>> on a remote core.
>
> I meant, if you already need to take a heavyweight exit (and you do to
> schedule something else on the core), than the only additional cost is
> taking a syscall return to userspace *first* before scheduling another
> process. That overhead is pretty low.

Yeah.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-02-07 16:19:03

by Jan Kiszka

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 2012-02-07 17:02, Avi Kivity wrote:
> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>
>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>> to put the
>>>> kthread to sleep while it waits for the other end to process it.
>>>> This is
>>>> effectively equivalent to a heavy weight exit. The difference in
>>>> cost is
>>>> dropping to userspace which is really neglible these days (< 100
>>>> cycles).
>>>
>>> On what machine did you measure these wonderful numbers?
>>
>> A syscall is what I mean by "dropping to userspace", not the cost of a
>> heavy weight exit.
>
> Ah. But then ioeventfd has that as well, unless the other end is in the
> kernel too.
>
>> I think a heavy weight exit is still around a few thousand cycles.
>>
>> Any nehalem class or better processor should have a syscall cost of
>> around that unless I'm wildly mistaken.
>>
>
> That's what I remember too.
>
>>>
>>> But I agree a heavyweight exit is probably faster than a double
>>> context switch
>>> on a remote core.
>>
>> I meant, if you already need to take a heavyweight exit (and you do to
>> schedule something else on the core), than the only additional cost is
>> taking a syscall return to userspace *first* before scheduling another
>> process. That overhead is pretty low.
>
> Yeah.
>

Isn't there another level in between just scheduling and full syscall
return if the user return notifier has some real work to do?

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

2012-02-07 16:20:04

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 10:02 AM, Avi Kivity wrote:
> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>
>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
>>>> kthread to sleep while it waits for the other end to process it. This is
>>>> effectively equivalent to a heavy weight exit. The difference in cost is
>>>> dropping to userspace which is really neglible these days (< 100 cycles).
>>>
>>> On what machine did you measure these wonderful numbers?
>>
>> A syscall is what I mean by "dropping to userspace", not the cost of a heavy
>> weight exit.
>
> Ah. But then ioeventfd has that as well, unless the other end is in the kernel too.

Yes, that was my point exactly :-)

ioeventfd/mmio-over-socketpair to adifferent thread is not faster than a
synchronous KVM_RUN + writing to an eventfd in userspace modulo a couple of
cheap syscalls.

The exception is when the other end is in the kernel and there is magic
optimizations (like there is today with ioeventfd).

Regards,

Anthony Liguori

>
>> I think a heavy weight exit is still around a few thousand cycles.
>>
>> Any nehalem class or better processor should have a syscall cost of around
>> that unless I'm wildly mistaken.
>>
>
> That's what I remember too.
>
>>>
>>> But I agree a heavyweight exit is probably faster than a double context switch
>>> on a remote core.
>>
>> I meant, if you already need to take a heavyweight exit (and you do to
>> schedule something else on the core), than the only additional cost is taking
>> a syscall return to userspace *first* before scheduling another process. That
>> overhead is pretty low.
>
> Yeah.
>

2012-02-07 16:21:34

by Anthony Liguori

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 10:18 AM, Jan Kiszka wrote:
> On 2012-02-07 17:02, Avi Kivity wrote:
>> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>>
>>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>>> to put the
>>>>> kthread to sleep while it waits for the other end to process it.
>>>>> This is
>>>>> effectively equivalent to a heavy weight exit. The difference in
>>>>> cost is
>>>>> dropping to userspace which is really neglible these days (< 100
>>>>> cycles).
>>>>
>>>> On what machine did you measure these wonderful numbers?
>>>
>>> A syscall is what I mean by "dropping to userspace", not the cost of a
>>> heavy weight exit.
>>
>> Ah. But then ioeventfd has that as well, unless the other end is in the
>> kernel too.
>>
>>> I think a heavy weight exit is still around a few thousand cycles.
>>>
>>> Any nehalem class or better processor should have a syscall cost of
>>> around that unless I'm wildly mistaken.
>>>
>>
>> That's what I remember too.
>>
>>>>
>>>> But I agree a heavyweight exit is probably faster than a double
>>>> context switch
>>>> on a remote core.
>>>
>>> I meant, if you already need to take a heavyweight exit (and you do to
>>> schedule something else on the core), than the only additional cost is
>>> taking a syscall return to userspace *first* before scheduling another
>>> process. That overhead is pretty low.
>>
>> Yeah.
>>
>
> Isn't there another level in between just scheduling and full syscall
> return if the user return notifier has some real work to do?

Depends on whether you're scheduling a kthread or a userspace process, no? If
you're eventually going to end up in userspace, you have to do the full heavy
weight exit.

If you're scheduling to a kthread, it's better to do the type of trickery that
ioeventfd does and just turn it into a function call.

Regards,

Anthony Liguori

>
> Jan
>

2012-02-07 16:29:33

by Jan Kiszka

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 2012-02-07 17:21, Anthony Liguori wrote:
> On 02/07/2012 10:18 AM, Jan Kiszka wrote:
>> On 2012-02-07 17:02, Avi Kivity wrote:
>>> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>>>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>>>
>>>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>>>> to put the
>>>>>> kthread to sleep while it waits for the other end to process it.
>>>>>> This is
>>>>>> effectively equivalent to a heavy weight exit. The difference in
>>>>>> cost is
>>>>>> dropping to userspace which is really neglible these days (< 100
>>>>>> cycles).
>>>>>
>>>>> On what machine did you measure these wonderful numbers?
>>>>
>>>> A syscall is what I mean by "dropping to userspace", not the cost of a
>>>> heavy weight exit.
>>>
>>> Ah. But then ioeventfd has that as well, unless the other end is in the
>>> kernel too.
>>>
>>>> I think a heavy weight exit is still around a few thousand cycles.
>>>>
>>>> Any nehalem class or better processor should have a syscall cost of
>>>> around that unless I'm wildly mistaken.
>>>>
>>>
>>> That's what I remember too.
>>>
>>>>>
>>>>> But I agree a heavyweight exit is probably faster than a double
>>>>> context switch
>>>>> on a remote core.
>>>>
>>>> I meant, if you already need to take a heavyweight exit (and you do to
>>>> schedule something else on the core), than the only additional cost is
>>>> taking a syscall return to userspace *first* before scheduling another
>>>> process. That overhead is pretty low.
>>>
>>> Yeah.
>>>
>>
>> Isn't there another level in between just scheduling and full syscall
>> return if the user return notifier has some real work to do?
>
> Depends on whether you're scheduling a kthread or a userspace process, no? If

Kthreads can't return, of course. User space threads /may/ do so. And
then there needs to be a differences between host and guest in the
tracked MSRs. I think to recall it's a question of another few hundred
cycles.

Jan

> you're eventually going to end up in userspace, you have to do the full heavy
> weight exit.
>
> If you're scheduling to a kthread, it's better to do the type of trickery that
> ioeventfd does and just turn it into a function call.
>
> Regards,
>
> Anthony Liguori
>
>>
>> Jan
>>
>

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

2012-02-15 13:41:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 06:29 PM, Jan Kiszka wrote:
> >>>
> >>
> >> Isn't there another level in between just scheduling and full syscall
> >> return if the user return notifier has some real work to do?
> >
> > Depends on whether you're scheduling a kthread or a userspace process, no? If
>
> Kthreads can't return, of course. User space threads /may/ do so. And
> then there needs to be a differences between host and guest in the
> tracked MSRs.

Right. Until we randomize kernel virtual addresses (what happened to
that?) and then there will always be a difference, even if you run the
same kernel in the host and guest.

> I think to recall it's a question of another few hundred
> cycles.

Right.

--
error compiling committee.c: too many arguments to function

2012-02-15 13:47:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api

On 02/07/2012 06:19 PM, Anthony Liguori wrote:
>> Ah. But then ioeventfd has that as well, unless the other end is in
>> the kernel too.
>
>
> Yes, that was my point exactly :-)
>
> ioeventfd/mmio-over-socketpair to adifferent thread is not faster than
> a synchronous KVM_RUN + writing to an eventfd in userspace modulo a
> couple of cheap syscalls.
>
> The exception is when the other end is in the kernel and there is
> magic optimizations (like there is today with ioeventfd).

vhost seems to schedule a workqueue item unconditionally.

irqfd does have magic optimizations to avoid an extra schedule.

--
error compiling committee.c: too many arguments to function