The kvm api has been accumulating cruft for several years now. This is
due to feature creep, fixing mistakes, experience gained by the
maintainers and developers on how to do things, ports to new
architectures, and simply as a side effect of a code base that is
developed slowly and incrementally.
While I don't think we can justify a complete revamp of the API now, I'm
writing this as a thought experiment to see where a from-scratch API can
take us. Of course, if we do implement this, the new and old APIs will
have to be supported side by side for several years.
Syscalls
--------
kvm currently uses the much-loved ioctl() system call as its entry
point. While this made it easy to add kvm to the kernel unintrusively,
it does have downsides:
- overhead in the entry path, for the ioctl dispatch path and vcpu mutex
(low but measurable)
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
a vm to be tied to an mm_struct, but the current API ties them to file
descriptors, which can move between threads and processes. We check
that they don't, but we don't want to.
Moving to syscalls avoids these problems, but introduces new ones:
- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct
Syscalls that operate on the entire guest will pick it up implicitly
from the mm_struct, and syscalls that operate on a vcpu will pick it up
from current.
State accessors
---------------
Currently vcpu state is read and written by a bunch of ioctls that
access register sets that were added (or discovered) along the years.
Some state is stored in the vcpu mmap area. These will be replaced by a
pair of syscalls that read or write the entire state, or a subset of the
state, in a tag/value format. A register will be described by a tuple:
set: the register set to which it belongs; either a real set (GPR,
x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
eflags/rip/IDT/interrupt shadow/pending exception/etc.)
number: register number within a set
size: for self-description, and to allow expanding registers like
SSE->AVX or eax->rax
attributes: read-write, read-only, read-only for guest but read-write
for host
value
Device model
------------
Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host. The API allows emulating the local
APICs in userspace.
The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace. Note: this may cause a regression for older guests
that don't support MSI or kvmclock. Device assignment will be done
using VFIO, that is, without direct kvm involvement.
Local APICs will be mandatory, but it will be possible to hide them from
the guest. This means that it will no longer be possible to emulate an
APIC in userspace, but it will be possible to virtualize an APIC-less
core - userspace will play with the LINT0/LINT1 inputs (configured as
EXITINT and NMI) to queue interrupts and NMIs.
The communications between the local APIC and the IOAPIC/PIC will be
done over a socketpair, emulating the APIC bus protocol.
Ioeventfd/irqfd
---------------
As the ioeventfd/irqfd mechanism has been quite successful, it will be
retained, and perhaps supplemented with a way to assign an mmio region
to a socketpair carrying transactions. This allows a device model to be
implemented out-of-process. The socketpair can also be used to
implement a replacement for coalesced mmio, by not waiting for responses
on write transactions when enabled. Synchronization of coalesced mmio
will be implemented in the kernel, not userspace as now: when a
non-coalesced mmio is needed, the kernel will first flush the coalesced
mmio queue(s).
Guest memory management
-----------------------
Instead of managing each memory slot individually, a single API will be
provided that replaces the entire guest physical memory map atomically.
This matches the implementation (using RCU) and plugs holes in the
current API, where you lose the dirty log in the window between the last
call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
that removes the slot.
Slot-based dirty logging will be replaced by range-based and work-based
dirty logging; that is "what pages are dirty in this range, which may be
smaller than a slot" and "don't return more than N pages".
We may want to place the log in user memory instead of kernel memory, to
reduce pinned memory and increase flexibility.
vcpu fd mmap area
-----------------
Currently we mmap() a few pages of the vcpu fd for fast user/kernel
communications. This will be replaced by a more orthodox pointer
parameter to sys_kvm_enter_guest(), that will be accessed using
get_user() and put_user(). This is slower than the current situation,
but better for things like strace.
--
error compiling committee.c: too many arguments to function
On 02/02/2012 10:09 AM, Avi Kivity wrote:
> The kvm api has been accumulating cruft for several years now. This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us. Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point. While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes. We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.
This seems like the natural progression.
> State accessors
> ---------------
> Currently vcpu state is read and written by a bunch of ioctls that
> access register sets that were added (or discovered) along the years.
> Some state is stored in the vcpu mmap area. These will be replaced by a
> pair of syscalls that read or write the entire state, or a subset of the
> state, in a tag/value format. A register will be described by a tuple:
>
> set: the register set to which it belongs; either a real set (GPR,
> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
> number: register number within a set
> size: for self-description, and to allow expanding registers like
> SSE->AVX or eax->rax
> attributes: read-write, read-only, read-only for guest but read-write
> for host
> value
I do like the idea a lot of being able to read one register at a time as often
times that's all you need.
>
> Device model
> ------------
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host. The API allows emulating the local
> APICs in userspace.
>
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.
I'm a big fan of this.
> Note: this may cause a regression for older guests
> that don't support MSI or kvmclock. Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
>
> Local APICs will be mandatory, but it will be possible to hide them from
> the guest. This means that it will no longer be possible to emulate an
> APIC in userspace, but it will be possible to virtualize an APIC-less
> core - userspace will play with the LINT0/LINT1 inputs (configured as
> EXITINT and NMI) to queue interrupts and NMIs.
I think this makes sense. An interesting consequence of this is that it's no
longer necessary to associate the VCPU context with an MMIO/PIO operation. I'm
not sure if there's an obvious benefit to that but it's interesting nonetheless.
> The communications between the local APIC and the IOAPIC/PIC will be
> done over a socketpair, emulating the APIC bus protocol.
>
> Ioeventfd/irqfd
> ---------------
> As the ioeventfd/irqfd mechanism has been quite successful, it will be
> retained, and perhaps supplemented with a way to assign an mmio region
> to a socketpair carrying transactions. This allows a device model to be
> implemented out-of-process. The socketpair can also be used to
> implement a replacement for coalesced mmio, by not waiting for responses
> on write transactions when enabled. Synchronization of coalesced mmio
> will be implemented in the kernel, not userspace as now: when a
> non-coalesced mmio is needed, the kernel will first flush the coalesced
> mmio queue(s).
>
> Guest memory management
> -----------------------
> Instead of managing each memory slot individually, a single API will be
> provided that replaces the entire guest physical memory map atomically.
> This matches the implementation (using RCU) and plugs holes in the
> current API, where you lose the dirty log in the window between the last
> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> that removes the slot.
>
> Slot-based dirty logging will be replaced by range-based and work-based
> dirty logging; that is "what pages are dirty in this range, which may be
> smaller than a slot" and "don't return more than N pages".
>
> We may want to place the log in user memory instead of kernel memory, to
> reduce pinned memory and increase flexibility.
Since we really only support 64-bit hosts, what about just pointing the kernel
at a address/size pair and rely on userspace to mmap() the range appropriately?
> vcpu fd mmap area
> -----------------
> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
> communications. This will be replaced by a more orthodox pointer
> parameter to sys_kvm_enter_guest(), that will be accessed using
> get_user() and put_user(). This is slower than the current situation,
> but better for things like strace.
Look pretty interesting overall.
Regards,
Anthony Liguori
>
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <[email protected]> wrote:
[...]
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.
How would the ability to use sys_kvm_* be regulated?
On 02/03/2012 12:07 PM, Eric Northup wrote:
> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]> wrote:
> [...]
>>
>> Moving to syscalls avoids these problems, but introduces new ones:
>>
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
> - Lost a good place to put access control (permissions on /dev/kvm)
> for which user-mode processes can use KVM.
>
> How would the ability to use sys_kvm_* be regulated?
Why should it be regulated?
It's not a finite or privileged resource.
Regards,
Anthony Liguori
>
Hope to get comments from live migration developers,
Anthony Liguori <[email protected]> wrote:
> > Guest memory management
> > -----------------------
> > Instead of managing each memory slot individually, a single API will be
> > provided that replaces the entire guest physical memory map atomically.
> > This matches the implementation (using RCU) and plugs holes in the
> > current API, where you lose the dirty log in the window between the last
> > call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> > that removes the slot.
> >
> > Slot-based dirty logging will be replaced by range-based and work-based
> > dirty logging; that is "what pages are dirty in this range, which may be
> > smaller than a slot" and "don't return more than N pages".
> >
> > We may want to place the log in user memory instead of kernel memory, to
> > reduce pinned memory and increase flexibility.
>
> Since we really only support 64-bit hosts, what about just pointing the kernel
> at a address/size pair and rely on userspace to mmap() the range appropriately?
>
Seems reasonable but the real problem is not how to set up the memory:
the problem is how to set a bit in user-space.
We need two things:
- introducing set_bit_user()
- changing mmu_lock from spin_lock to mutex_lock
(mark_page_dirty() can be called with mmu_lock held)
The former is straightforward and I sent a patch last year.
The latter needs a fundamental change: I heard (from Avi) that we can
change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.
So I was planning to restart this work when Peter's
"mm: Preemptibility"
http://lkml.org/lkml/2011/4/1/141
gets finished.
But even if we cannot achieve "without pinned memory" we may also want
to make the user-space know how many pages are getting dirty.
For example think about the last step of live migration. We stop the
guest and send the remaining pages. For this we do not need to write
protect them any more, just want to know which ones are dirty.
If user-space can read the bitmap, it does not need to do GET_DIRTY_LOG
because the guest is already stopped, so we can reduce the downtime.
Is this correct?
So I think we can do this in two steps:
1. just move the bitmap to user-space and (pin it)
2. un-pin it when the time comes
I can start 1 after "srcu-less dirty logging" gets finished.
Takuya
On 02/03/2012 04:09 AM, Anthony Liguori wrote:
>
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock. Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>>
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest. This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
>
> I think this makes sense. An interesting consequence of this is that
> it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation. I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.
It doesn't follow (at least from the above), and it isn't allowed in
some situations (like PIO invoking synchronous SMI). So we'll have to
retain synchronous PIO/MMIO (but we can allow to relax this for
socketpair mmio).
>
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.
>>
>> Ioeventfd/irqfd
>> ---------------
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions. This allows a device model to be
>> implemented out-of-process. The socketpair can also be used to
>> implement a replacement for coalesced mmio, by not waiting for responses
>> on write transactions when enabled. Synchronization of coalesced mmio
>> will be implemented in the kernel, not userspace as now: when a
>> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> mmio queue(s).
>>
>> Guest memory management
>> -----------------------
>> Instead of managing each memory slot individually, a single API will be
>> provided that replaces the entire guest physical memory map atomically.
>> This matches the implementation (using RCU) and plugs holes in the
>> current API, where you lose the dirty log in the window between the last
>> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> that removes the slot.
>>
>> Slot-based dirty logging will be replaced by range-based and work-based
>> dirty logging; that is "what pages are dirty in this range, which may be
>> smaller than a slot" and "don't return more than N pages".
>>
>> We may want to place the log in user memory instead of kernel memory, to
>> reduce pinned memory and increase flexibility.
>
> Since we really only support 64-bit hosts,
We don't (Red Hat does, but that's a distro choice). Non-x86 also needs
32-bit.
> what about just pointing the kernel at a address/size pair and rely on
> userspace to mmap() the range appropriately?
The "one large slot" approach. Even if we ignore the 32-bit issue, we
still need some per-slot information, like per-slot dirty logging. It's
also hard to create aliases this way (BIOS at 0xe0000 and 0xfffe0000) or
to move memory around (framebuffer BAR).
>
>> vcpu fd mmap area
>> -----------------
>> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> communications. This will be replaced by a more orthodox pointer
>> parameter to sys_kvm_enter_guest(), that will be accessed using
>> get_user() and put_user(). This is slower than the current situation,
>> but better for things like strace.
>
> Look pretty interesting overall.
I'll get an actual API description for the next round.
--
error compiling committee.c: too many arguments to function
On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> Device model
> ------------
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host. The API allows emulating the local
> APICs in userspace.
>
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace. Note: this may cause a regression for older guests
> that don't support MSI or kvmclock. Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
>
So are we officially saying that KVM is only for modern guest
virtualization? Also my not so old host kernel uses MSI only for NIC.
SATA and USB are using IOAPIC (though this is probably more HW related
than kernel version related).
--
Gleb.
On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > Device model
> > ------------
> > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > PCI devices assigned from the host. The API allows emulating the local
> > APICs in userspace.
> >
> > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > them to userspace. Note: this may cause a regression for older guests
> > that don't support MSI or kvmclock. Device assignment will be done
> > using VFIO, that is, without direct kvm involvement.
> >
> So are we officially saying that KVM is only for modern guest
> virtualization?
No, but older guests may have reduced performance in some workloads
(e.g. RHEL4 gettimeofday() intensive workloads).
> Also my not so old host kernel uses MSI only for NIC.
> SATA and USB are using IOAPIC (though this is probably more HW related
> than kernel version related).
For devices emulated in userspace, it doesn't matter where the IOAPIC
is. It only matters for kernel provided devices (PIT, assigned devices,
vhost-net).
--
error compiling committee.c: too many arguments to function
On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > Device model
> > > ------------
> > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > PCI devices assigned from the host. The API allows emulating the local
> > > APICs in userspace.
> > >
> > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > them to userspace. Note: this may cause a regression for older guests
> > > that don't support MSI or kvmclock. Device assignment will be done
> > > using VFIO, that is, without direct kvm involvement.
> > >
> > So are we officially saying that KVM is only for modern guest
> > virtualization?
>
> No, but older guests may have reduced performance in some workloads
> (e.g. RHEL4 gettimeofday() intensive workloads).
>
Reduced performance is what I mean. Obviously old guests will continue working.
> > Also my not so old host kernel uses MSI only for NIC.
> > SATA and USB are using IOAPIC (though this is probably more HW related
> > than kernel version related).
>
> For devices emulated in userspace, it doesn't matter where the IOAPIC
> is. It only matters for kernel provided devices (PIT, assigned devices,
> vhost-net).
>
What about EOI that will have to do additional exit to userspace for each
interrupt delivered?
--
Gleb.
On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > Device model
> > > > ------------
> > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > PCI devices assigned from the host. The API allows emulating the local
> > > > APICs in userspace.
> > > >
> > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > them to userspace. Note: this may cause a regression for older guests
> > > > that don't support MSI or kvmclock. Device assignment will be done
> > > > using VFIO, that is, without direct kvm involvement.
> > > >
> > > So are we officially saying that KVM is only for modern guest
> > > virtualization?
> >
> > No, but older guests may have reduced performance in some workloads
> > (e.g. RHEL4 gettimeofday() intensive workloads).
> >
> Reduced performance is what I mean. Obviously old guests will continue working.
I'm not happy about it either.
> > > Also my not so old host kernel uses MSI only for NIC.
> > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > than kernel version related).
> >
> > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > is. It only matters for kernel provided devices (PIT, assigned devices,
> > vhost-net).
> >
> What about EOI that will have to do additional exit to userspace for each
> interrupt delivered?
I think the ioapic EOI is asynchronous wrt the core, yes? So the vcpu
can just post the EOI broadcast on the apic-bus socketpair, waking up
the thread handling the ioapic, and continue running. This trades off
vcpu latency for using more host resources.
--
error compiling committee.c: too many arguments to function
On Sun, Feb 05, 2012 at 11:56:21AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> > On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > > Device model
> > > > > ------------
> > > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > > PCI devices assigned from the host. The API allows emulating the local
> > > > > APICs in userspace.
> > > > >
> > > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > > them to userspace. Note: this may cause a regression for older guests
> > > > > that don't support MSI or kvmclock. Device assignment will be done
> > > > > using VFIO, that is, without direct kvm involvement.
> > > > >
> > > > So are we officially saying that KVM is only for modern guest
> > > > virtualization?
> > >
> > > No, but older guests may have reduced performance in some workloads
> > > (e.g. RHEL4 gettimeofday() intensive workloads).
> > >
> > Reduced performance is what I mean. Obviously old guests will continue working.
>
> I'm not happy about it either.
>
It is not only about old guests either. In RHEL we pretend to not
support HPET because when some guests detect it they are accessing
its mmio frequently for certain workloads. For Linux guests we can
avoid that by using kvmclock. For Windows guests I hope we will have
enlightenment timers + RTC, but what about other guests? *BSD? How often
they access HPET when it is available? We will probably have to move
HPET into the kernel if we want to make it usable.
So what is the criteria for device to be emulated in userspace vs kernelspace
in new API? Never? What about vhost-net then? Only if a device works in MSI
mode? This may work for HPET case, but looks like artificial limitation
since the problem with HPET is not interrupt latency, but mmio space
access.
And BTW, what about enlightenment timers for Windows? Are we going to
implement them in userspace or kernel?
> > > > Also my not so old host kernel uses MSI only for NIC.
> > > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > > than kernel version related).
> > >
> > > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > > is. It only matters for kernel provided devices (PIT, assigned devices,
> > > vhost-net).
> > >
> > What about EOI that will have to do additional exit to userspace for each
> > interrupt delivered?
>
> I think the ioapic EOI is asynchronous wrt the core, yes? So the vcpu
Probably, do not see what problem can async EOI may cause.
> can just post the EOI broadcast on the apic-bus socketpair, waking up
> the thread handling the ioapic, and continue running. This trades off
> vcpu latency for using more host resources.
>
Sounds good. This will increase IOAPIC interrupt latency though since next
interrupt (same GSI) can't be delivered until EOI is processed.
--
Gleb.
On 02/05/2012 12:58 PM, Gleb Natapov wrote:
> > > >
> > > Reduced performance is what I mean. Obviously old guests will continue working.
> >
> > I'm not happy about it either.
> >
> It is not only about old guests either. In RHEL we pretend to not
> support HPET because when some guests detect it they are accessing
> its mmio frequently for certain workloads. For Linux guests we can
> avoid that by using kvmclock. For Windows guests I hope we will have
> enlightenment timers + RTC, but what about other guests? *BSD? How often
> they access HPET when it is available? We will probably have to move
> HPET into the kernel if we want to make it usable.
If we have to, we'll do it.
> So what is the criteria for device to be emulated in userspace vs kernelspace
> in new API? Never? What about vhost-net then? Only if a device works in MSI
> mode? This may work for HPET case, but looks like artificial limitation
> since the problem with HPET is not interrupt latency, but mmio space
> access.
The criteria is, if it's absolutely necessary.
> And BTW, what about enlightenment timers for Windows? Are we going to
> implement them in userspace or kernel?
The kernel.
--
error compiling committee.c: too many arguments to function
On 02/05/2012 03:51 AM, Gleb Natapov wrote:
> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>> Device model
>>>> ------------
>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>> PCI devices assigned from the host. The API allows emulating the local
>>>> APICs in userspace.
>>>>
>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>> them to userspace. Note: this may cause a regression for older guests
>>>> that don't support MSI or kvmclock. Device assignment will be done
>>>> using VFIO, that is, without direct kvm involvement.
>>>>
>>> So are we officially saying that KVM is only for modern guest
>>> virtualization?
>>
>> No, but older guests may have reduced performance in some workloads
>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>
> Reduced performance is what I mean. Obviously old guests will continue working.
An interesting solution to this problem would be an in-kernel device VM.
Most of the time, the hot register is just one register within a more complex
device. The reads are often side-effect free and trivially computed from some
device state + host time.
If userspace had a way to upload bytecode to the kernel that was executed for a
PIO operation, it could either pass the operation to userspace or handle it
within the kernel when possible without taking a heavy weight exit.
If the bytecode can access variables in a shared memory area, it could be pretty
efficient to work with.
This means that the kernel never has to deal with specific in-kernel devices but
that userspace can accelerator as many of its devices as it sees fit.
This could replace ioeventfd as a mechanism (which would allow clearing the
notify flag before writing to an eventfd).
We could potentially just use BPF for this.
Regards,
Anthony Liguori
On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> On 02/05/2012 03:51 AM, Gleb Natapov wrote:
>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>>> Device model
>>>>> ------------
>>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>>> PCI devices assigned from the host. The API allows emulating the
>>>>> local
>>>>> APICs in userspace.
>>>>>
>>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>>> them to userspace. Note: this may cause a regression for older
>>>>> guests
>>>>> that don't support MSI or kvmclock. Device assignment will be done
>>>>> using VFIO, that is, without direct kvm involvement.
>>>>>
>>>> So are we officially saying that KVM is only for modern guest
>>>> virtualization?
>>>
>>> No, but older guests may have reduced performance in some workloads
>>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>>
>> Reduced performance is what I mean. Obviously old guests will
>> continue working.
>
> An interesting solution to this problem would be an in-kernel device VM.
It's interesting, yes, but has a very high barrier to implementation.
>
> Most of the time, the hot register is just one register within a more
> complex device. The reads are often side-effect free and trivially
> computed from some device state + host time.
Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
There are also interactions with other devices (for example the
apic/ioapic interaction via the apic bus).
>
> If userspace had a way to upload bytecode to the kernel that was
> executed for a PIO operation, it could either pass the operation to
> userspace or handle it within the kernel when possible without taking
> a heavy weight exit.
>
> If the bytecode can access variables in a shared memory area, it could
> be pretty efficient to work with.
>
> This means that the kernel never has to deal with specific in-kernel
> devices but that userspace can accelerator as many of its devices as
> it sees fit.
I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs. The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.
>
> This could replace ioeventfd as a mechanism (which would allow
> clearing the notify flag before writing to an eventfd).
>
> We could potentially just use BPF for this.
BPF generally just computes a predicate. We could overload the scratch
area for storing internal state and for read results, though (and have
an "mmio scratch register" for reading the time).
--
error compiling committee.c: too many arguments to function
On 02/06/2012 03:34 AM, Avi Kivity wrote:
> On 02/05/2012 06:36 PM, Anthony Liguori wrote:
>> On 02/05/2012 03:51 AM, Gleb Natapov wrote:
>>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>>>> Device model
>>>>>> ------------
>>>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>>>> PCI devices assigned from the host. The API allows emulating the
>>>>>> local
>>>>>> APICs in userspace.
>>>>>>
>>>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>>>> them to userspace. Note: this may cause a regression for older
>>>>>> guests
>>>>>> that don't support MSI or kvmclock. Device assignment will be done
>>>>>> using VFIO, that is, without direct kvm involvement.
>>>>>>
>>>>> So are we officially saying that KVM is only for modern guest
>>>>> virtualization?
>>>>
>>>> No, but older guests may have reduced performance in some workloads
>>>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>>>
>>> Reduced performance is what I mean. Obviously old guests will
>>> continue working.
>>
>> An interesting solution to this problem would be an in-kernel device VM.
>
> It's interesting, yes, but has a very high barrier to implementation.
>
>>
>> Most of the time, the hot register is just one register within a more
>> complex device. The reads are often side-effect free and trivially
>> computed from some device state + host time.
>
> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
> There are also interactions with other devices (for example the
> apic/ioapic interaction via the apic bus).
Hrm, maybe I'm missing it, but the path that would be hot is:
if (!status_latched && !count_latched) {
value = kpit_elapsed()
// manipulate count based on mode
// mask value depending on read_state
}
This path is side-effect free, and applies relatively simple math to a time counter.
The idea would be to allow the filter to not handle an I/O request depending on
existing state. Anything that's modifies state (like reading the latch counter)
would drop to userspace.
>
>>
>> If userspace had a way to upload bytecode to the kernel that was
>> executed for a PIO operation, it could either pass the operation to
>> userspace or handle it within the kernel when possible without taking
>> a heavy weight exit.
>>
>> If the bytecode can access variables in a shared memory area, it could
>> be pretty efficient to work with.
>>
>> This means that the kernel never has to deal with specific in-kernel
>> devices but that userspace can accelerator as many of its devices as
>> it sees fit.
>
> I would really love to have this, but the problem is that we'd need a
> general purpose bytecode VM with binding to some kernel APIs. The
> bytecode VM, if made general enough to host more complicated devices,
> would likely be much larger than the actual code we have in the kernel now.
I think the question is whether BPF is good enough as it stands. I'm not really
sure. I agree that inventing a new bytecode VM is probably not worth it.
>>
>> This could replace ioeventfd as a mechanism (which would allow
>> clearing the notify flag before writing to an eventfd).
>>
>> We could potentially just use BPF for this.
>
> BPF generally just computes a predicate.
Can it modify a packet in place? I think a predicate is about right (can this
io operation be handled in the kernel or not) but the question is whether
there's a way produce an output as a side effect.
> We could overload the scratch
> area for storing internal state and for read results, though (and have
> an "mmio scratch register" for reading the time).
Right.
Regards,
Anthony Liguori
On 02/06/2012 03:33 PM, Anthony Liguori wrote:
>> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
>> There are also interactions with other devices (for example the
>> apic/ioapic interaction via the apic bus).
>
>
> Hrm, maybe I'm missing it, but the path that would be hot is:
>
> if (!status_latched && !count_latched) {
> value = kpit_elapsed()
> // manipulate count based on mode
> // mask value depending on read_state
> }
>
> This path is side-effect free, and applies relatively simple math to a
> time counter.
Do guests always read an unlatched counter? Doesn't seem reasonable
since they can't get a stable count this way.
>
> The idea would be to allow the filter to not handle an I/O request
> depending on existing state. Anything that's modifies state (like
> reading the latch counter) would drop to userspace.
This restricts us to a subset of the device which is at the mercy of the
guest.
>
>>
>>>
>>> If userspace had a way to upload bytecode to the kernel that was
>>> executed for a PIO operation, it could either pass the operation to
>>> userspace or handle it within the kernel when possible without taking
>>> a heavy weight exit.
>>>
>>> If the bytecode can access variables in a shared memory area, it could
>>> be pretty efficient to work with.
>>>
>>> This means that the kernel never has to deal with specific in-kernel
>>> devices but that userspace can accelerator as many of its devices as
>>> it sees fit.
>>
>> I would really love to have this, but the problem is that we'd need a
>> general purpose bytecode VM with binding to some kernel APIs. The
>> bytecode VM, if made general enough to host more complicated devices,
>> would likely be much larger than the actual code we have in the
>> kernel now.
>
> I think the question is whether BPF is good enough as it stands. I'm
> not really sure.
I think not. It doesn't have 64-bit muldiv, required for hpet, for example.
> I agree that inventing a new bytecode VM is probably not worth it.
>
>>>
>>> This could replace ioeventfd as a mechanism (which would allow
>>> clearing the notify flag before writing to an eventfd).
>>>
>>> We could potentially just use BPF for this.
>>
>> BPF generally just computes a predicate.
>
> Can it modify a packet in place? I think a predicate is about right
> (can this io operation be handled in the kernel or not) but the
> question is whether there's a way produce an output as a side effect.
You can use the scratch area, and say that it's persistent. But the VM
itself isn't rich enough.
>
>> We could overload the scratch
>> area for storing internal state and for read results, though (and have
>> an "mmio scratch register" for reading the time).
>
> Right.
>
We could define mmio registers for muldiv64, and for communicating over
the APIC bus. But then the device model for BPF ends up more
complicated than the kernel devices we have put together.
--
error compiling committee.c: too many arguments to function
On 02/06/2012 07:54 AM, Avi Kivity wrote:
> On 02/06/2012 03:33 PM, Anthony Liguori wrote:
>>> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
>>> There are also interactions with other devices (for example the
>>> apic/ioapic interaction via the apic bus).
>>
>>
>> Hrm, maybe I'm missing it, but the path that would be hot is:
>>
>> if (!status_latched&& !count_latched) {
>> value = kpit_elapsed()
>> // manipulate count based on mode
>> // mask value depending on read_state
>> }
>>
>> This path is side-effect free, and applies relatively simple math to a
>> time counter.
>
> Do guests always read an unlatched counter? Doesn't seem reasonable
> since they can't get a stable count this way.
Perhaps. You could have the latching done by writing to persisted scratch
memory but then locking becomes an issue.
>> The idea would be to allow the filter to not handle an I/O request
>> depending on existing state. Anything that's modifies state (like
>> reading the latch counter) would drop to userspace.
>
> This restricts us to a subset of the device which is at the mercy of the
> guest.
Yes, but it provides an elegant solution to having a flexible way to do things
in the fast path in a generic way without presenting additional security concerns.
A similar, albeit more complex and less elegant, approach would be to make use
of something like the vtpm optimization to reflect certain exits back into
injected code into the guest. But this has the disadvantage of being very
x86-centric and it's not clear if you can avoid double exits which would hurt
the slow paths.
> We could define mmio registers for muldiv64, and for communicating over
> the APIC bus. But then the device model for BPF ends up more
> complicated than the kernel devices we have put together.
Maybe what we really need is NaCL for kernel space :-D
Regards,
Anthony Liguori
On 02/06/2012 04:00 PM, Anthony Liguori wrote:
>> Do guests always read an unlatched counter? Doesn't seem reasonable
>> since they can't get a stable count this way.
>
>
> Perhaps. You could have the latching done by writing to persisted
> scratch memory but then locking becomes an issue.
Oh, you'd certainly serialize the entire device.
>
>>> The idea would be to allow the filter to not handle an I/O request
>>> depending on existing state. Anything that's modifies state (like
>>> reading the latch counter) would drop to userspace.
>>
>> This restricts us to a subset of the device which is at the mercy of the
>> guest.
>
> Yes, but it provides an elegant solution to having a flexible way to
> do things in the fast path in a generic way without presenting
> additional security concerns.
>
> A similar, albeit more complex and less elegant, approach would be to
> make use of something like the vtpm optimization to reflect certain
> exits back into injected code into the guest. But this has the
> disadvantage of being very x86-centric and it's not clear if you can
> avoid double exits which would hurt the slow paths.
It's also hard to communicate with the rest of the host kernel (say for
timers). You can't ensure that any piece of memory will be virtually
mapped, and with the correct permissions too.
>
>> We could define mmio registers for muldiv64, and for communicating over
>> the APIC bus. But then the device model for BPF ends up more
>> complicated than the kernel devices we have put together.
>
> Maybe what we really need is NaCL for kernel space :-D
NaCl or bytecode, doesn't matter. But we do need bindings to other
kernel and kvm services.
--
error compiling committee.c: too many arguments to function
On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> On 02/03/2012 12:07 PM, Eric Northup wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]> wrote:
>> [...]
>>>
>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>
>>> - adding new syscalls is generally frowned upon, and kvm will need
>>> several
>>> - syscalls into modules are harder and rarer than into core kernel code
>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>> mm_struct
>> - Lost a good place to put access control (permissions on /dev/kvm)
>> for which user-mode processes can use KVM.
>>
>> How would the ability to use sys_kvm_* be regulated?
>
> Why should it be regulated?
>
> It's not a finite or privileged resource.
You're exposing a large, complex kernel subsystem that does very
low-level things with the hardware. It's a potential source of exploits
(from bugs in KVM or in hardware). I can see people wanting to be
selective with access because of that.
And sometimes it is a finite resource. I don't know how x86 does it,
but on at least some powerpc hardware we have a finite, relatively small
number of hardware partition IDs.
-Scott
On 03.02.2012, at 03:09, Anthony Liguori wrote:
> On 02/02/2012 10:09 AM, Avi Kivity wrote:
>> The kvm api has been accumulating cruft for several years now. This is
>> due to feature creep, fixing mistakes, experience gained by the
>> maintainers and developers on how to do things, ports to new
>> architectures, and simply as a side effect of a code base that is
>> developed slowly and incrementally.
>>
>> While I don't think we can justify a complete revamp of the API now, I'm
>> writing this as a thought experiment to see where a from-scratch API can
>> take us. Of course, if we do implement this, the new and old APIs will
>> have to be supported side by side for several years.
>>
>> Syscalls
>> --------
>> kvm currently uses the much-loved ioctl() system call as its entry
>> point. While this made it easy to add kvm to the kernel unintrusively,
>> it does have downsides:
>>
>> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
>> (low but measurable)
>> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
>> a vm to be tied to an mm_struct, but the current API ties them to file
>> descriptors, which can move between threads and processes. We check
>> that they don't, but we don't want to.
>>
>> Moving to syscalls avoids these problems, but introduces new ones:
>>
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
>>
>> Syscalls that operate on the entire guest will pick it up implicitly
>> from the mm_struct, and syscalls that operate on a vcpu will pick it up
>> from current.
>
> This seems like the natural progression.
I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
I really do like the ioctl model btw. It's easily extensible and easy to understand.
I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
>
>> State accessors
>> ---------------
>> Currently vcpu state is read and written by a bunch of ioctls that
>> access register sets that were added (or discovered) along the years.
>> Some state is stored in the vcpu mmap area. These will be replaced by a
>> pair of syscalls that read or write the entire state, or a subset of the
>> state, in a tag/value format. A register will be described by a tuple:
>>
>> set: the register set to which it belongs; either a real set (GPR,
>> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>> number: register number within a set
>> size: for self-description, and to allow expanding registers like
>> SSE->AVX or eax->rax
>> attributes: read-write, read-only, read-only for guest but read-write
>> for host
>> value
>
> I do like the idea a lot of being able to read one register at a time as often times that's all you need.
The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
>
>>
>> Device model
>> ------------
>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>> PCI devices assigned from the host. The API allows emulating the local
>> APICs in userspace.
>>
>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>> them to userspace.
>
> I'm a big fan of this.
>
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock. Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>>
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest. This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
>
> I think this makes sense. An interesting consequence of this is that it's no longer necessary to associate the VCPU context with an MMIO/PIO operation. I'm not sure if there's an obvious benefit to that but it's interesting nonetheless.
>
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.
What is keeping us from moving there today?
>>
>> Ioeventfd/irqfd
>> ---------------
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions. This allows a device model to be
>> implemented out-of-process. The socketpair can also be used to
>> implement a replacement for coalesced mmio, by not waiting for responses
>> on write transactions when enabled. Synchronization of coalesced mmio
>> will be implemented in the kernel, not userspace as now: when a
>> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> mmio queue(s).
I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs. Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.
One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.
I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.
To me, coalesced mmio has proven that's it's generalization where it doesn't belong.
>>
>> Guest memory management
>> -----------------------
>> Instead of managing each memory slot individually, a single API will be
>> provided that replaces the entire guest physical memory map atomically.
>> This matches the implementation (using RCU) and plugs holes in the
>> current API, where you lose the dirty log in the window between the last
>> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> that removes the slot.
So we render the actual slot logic invisible? That's a very good idea.
>>
>> Slot-based dirty logging will be replaced by range-based and work-based
>> dirty logging; that is "what pages are dirty in this range, which may be
>> smaller than a slot" and "don't return more than N pages".
>>
>> We may want to place the log in user memory instead of kernel memory, to
>> reduce pinned memory and increase flexibility.
>
> Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?
That's basically what he suggested, no?
>
>> vcpu fd mmap area
>> -----------------
>> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> communications. This will be replaced by a more orthodox pointer
>> parameter to sys_kvm_enter_guest(), that will be accessed using
>> get_user() and put_user(). This is slower than the current situation,
>> but better for things like strace.
I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> Look pretty interesting overall.
Yeah, I agree with most ideas, except for the syscall one. Everything else can easily be implemented on top of the current model.
Alex
On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> > On 02/03/2012 12:07 PM, Eric Northup wrote:
> >> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]> wrote:
> >> [...]
> >>>
> >>> Moving to syscalls avoids these problems, but introduces new ones:
> >>>
> >>> - adding new syscalls is generally frowned upon, and kvm will need
> >>> several
> >>> - syscalls into modules are harder and rarer than into core kernel code
> >>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> >>> mm_struct
> >> - Lost a good place to put access control (permissions on /dev/kvm)
> >> for which user-mode processes can use KVM.
> >>
> >> How would the ability to use sys_kvm_* be regulated?
> >
> > Why should it be regulated?
> >
> > It's not a finite or privileged resource.
>
> You're exposing a large, complex kernel subsystem that does very
> low-level things with the hardware. It's a potential source of exploits
> (from bugs in KVM or in hardware). I can see people wanting to be
> selective with access because of that.
Exactly.
In a perfect world I'd agree with Anthony, but in reality I think
sysadmins are quite happy that they can prevent some users from using
KVM.
You could presumably achieve something similar with capabilities or
whatever, but a node in /dev is much simpler.
cheers
On 07.02.2012, at 07:58, Michael Ellerman wrote:
> On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]> wrote:
>>>> [...]
>>>>>
>>>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>>>
>>>>> - adding new syscalls is generally frowned upon, and kvm will need
>>>>> several
>>>>> - syscalls into modules are harder and rarer than into core kernel code
>>>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>>>> mm_struct
>>>> - Lost a good place to put access control (permissions on /dev/kvm)
>>>> for which user-mode processes can use KVM.
>>>>
>>>> How would the ability to use sys_kvm_* be regulated?
>>>
>>> Why should it be regulated?
>>>
>>> It's not a finite or privileged resource.
>>
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware. It's a potential source of exploits
>> (from bugs in KVM or in hardware). I can see people wanting to be
>> selective with access because of that.
>
> Exactly.
>
> In a perfect world I'd agree with Anthony, but in reality I think
> sysadmins are quite happy that they can prevent some users from using
> KVM.
>
> You could presumably achieve something similar with capabilities or
> whatever, but a node in /dev is much simpler.
Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.
But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.
Alex
On 02/07/2012 03:08 AM, Alexander Graf wrote:
> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
It would be a "vm-wide syscall". You can also do that on x86 (through
KVM_IRQ_LINE).
>
> I really do like the ioctl model btw. It's easily extensible and easy to understand.
>
> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
Good point. If we ever go through with it, it will only be after we see
the interface has stabilized.
>
> >
> >> State accessors
> >> ---------------
> >> Currently vcpu state is read and written by a bunch of ioctls that
> >> access register sets that were added (or discovered) along the years.
> >> Some state is stored in the vcpu mmap area. These will be replaced by a
> >> pair of syscalls that read or write the entire state, or a subset of the
> >> state, in a tag/value format. A register will be described by a tuple:
> >>
> >> set: the register set to which it belongs; either a real set (GPR,
> >> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
> >> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
> >> number: register number within a set
> >> size: for self-description, and to allow expanding registers like
> >> SSE->AVX or eax->rax
> >> attributes: read-write, read-only, read-only for guest but read-write
> >> for host
> >> value
> >
> > I do like the idea a lot of being able to read one register at a time as often times that's all you need.
>
> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
This is more like MANY_REG, where you scatter/gather a list of registers
in userspace to the kernel or vice versa.
>
> >> The communications between the local APIC and the IOAPIC/PIC will be
> >> done over a socketpair, emulating the APIC bus protocol.
>
> What is keeping us from moving there today?
The biggest problem with this proposal is that what we have today works
reasonably well. Nothing is keeping us from moving there, except the
fear of performance regressions and lack of strong motivation.
>
> >>
> >> Ioeventfd/irqfd
> >> ---------------
> >> As the ioeventfd/irqfd mechanism has been quite successful, it will be
> >> retained, and perhaps supplemented with a way to assign an mmio region
> >> to a socketpair carrying transactions. This allows a device model to be
> >> implemented out-of-process. The socketpair can also be used to
> >> implement a replacement for coalesced mmio, by not waiting for responses
> >> on write transactions when enabled. Synchronization of coalesced mmio
> >> will be implemented in the kernel, not userspace as now: when a
> >> non-coalesced mmio is needed, the kernel will first flush the coalesced
> >> mmio queue(s).
>
> I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs.
It's actually used by e1000 too, don't remember what the performance
benefits are. Of course, few people use e1000.
> Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.
>
> One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.
>
> I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.
This goes back to the discussion about a kernel bytecode vm for
accelerating mmio. The problem is that we need something really general.
> To me, coalesced mmio has proven that's it's generalization where it doesn't belong.
But you want to generalize it even more?
There's no way a patch with 'VGA' in it would be accepted.
>
> >>
> >> Guest memory management
> >> -----------------------
> >> Instead of managing each memory slot individually, a single API will be
> >> provided that replaces the entire guest physical memory map atomically.
> >> This matches the implementation (using RCU) and plugs holes in the
> >> current API, where you lose the dirty log in the window between the last
> >> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> >> that removes the slot.
>
> So we render the actual slot logic invisible? That's a very good idea.
No, slots still exist. Only the API is "replace slot list" instead of
"add slot" and "remove slot".
>
> >>
> >> Slot-based dirty logging will be replaced by range-based and work-based
> >> dirty logging; that is "what pages are dirty in this range, which may be
> >> smaller than a slot" and "don't return more than N pages".
> >>
> >> We may want to place the log in user memory instead of kernel memory, to
> >> reduce pinned memory and increase flexibility.
> >
> > Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?
>
> That's basically what he suggested, no?
No.
> >
> >> vcpu fd mmap area
> >> -----------------
> >> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
> >> communications. This will be replaced by a more orthodox pointer
> >> parameter to sys_kvm_enter_guest(), that will be accessed using
> >> get_user() and put_user(). This is slower than the current situation,
> >> but better for things like strace.
>
> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
Something really critical should be handled in the kernel. Care to
provide examples?
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/06/2012 01:46 PM, Scott Wood wrote:
> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<[email protected]> wrote:
>>> [...]
>>>>
>>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>>
>>>> - adding new syscalls is generally frowned upon, and kvm will need
>>>> several
>>>> - syscalls into modules are harder and rarer than into core kernel code
>>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>>> mm_struct
>>> - Lost a good place to put access control (permissions on /dev/kvm)
>>> for which user-mode processes can use KVM.
>>>
>>> How would the ability to use sys_kvm_* be regulated?
>>
>> Why should it be regulated?
>>
>> It's not a finite or privileged resource.
>
> You're exposing a large, complex kernel subsystem that does very
> low-level things with the hardware.
As does the rest of the kernel.
> It's a potential source of exploits
> (from bugs in KVM or in hardware). I can see people wanting to be
> selective with access because of that.
As is true of the rest of the kernel.
If you want finer grain access control, that's exactly why we have things like
LSM and SELinux. You can add the appropriate LSM hooks into the KVM
infrastructure and setup default SELinux policies appropriately.
> And sometimes it is a finite resource. I don't know how x86 does it,
> but on at least some powerpc hardware we have a finite, relatively small
> number of hardware partition IDs.
But presumably this is per-core, right? And they're recycled, right? IOW,
there isn't a limit of number of guests <= number of hardware partitions IDs.
It just impacts performance.
Regards,
Anthony Liguori
>
> -Scott
>
>
On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>
>> It's a potential source of exploits
>> (from bugs in KVM or in hardware). I can see people wanting to be
>> selective with access because of that.
>
> As is true of the rest of the kernel.
>
> If you want finer grain access control, that's exactly why we have
> things like LSM and SELinux. You can add the appropriate LSM hooks
> into the KVM infrastructure and setup default SELinux policies
> appropriately.
LSMs protect objects, not syscalls. There isn't an object to protect
here (except the fake /dev/kvm object).
In theory, kvm is exactly the same as other syscalls, but in practice,
it is used by only very few user programs, so there may be many
unexercised paths.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/07/2012 06:40 AM, Avi Kivity wrote:
> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>
>>> It's a potential source of exploits
>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>> selective with access because of that.
>>
>> As is true of the rest of the kernel.
>>
>> If you want finer grain access control, that's exactly why we have things like
>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>> infrastructure and setup default SELinux policies appropriately.
>
> LSMs protect objects, not syscalls. There isn't an object to protect here
> (except the fake /dev/kvm object).
A VM can be an object.
Regards,
Anthony Liguori
> In theory, kvm is exactly the same as other syscalls, but in practice, it is
> used by only very few user programs, so there may be many unexercised paths.
>
On 07.02.2012, at 13:24, Avi Kivity wrote:
> On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
>
> It would be a "vm-wide syscall". You can also do that on x86 (through KVM_IRQ_LINE).
>
>>
>> I really do like the ioctl model btw. It's easily extensible and easy to understand.
>>
>> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
>
> Good point. If we ever go through with it, it will only be after we see the interface has stabilized.
Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
>
>>
>> >
>> >> State accessors
>> >> ---------------
>> >> Currently vcpu state is read and written by a bunch of ioctls that
>> >> access register sets that were added (or discovered) along the years.
>> >> Some state is stored in the vcpu mmap area. These will be replaced by a
>> >> pair of syscalls that read or write the entire state, or a subset of the
>> >> state, in a tag/value format. A register will be described by a tuple:
>> >>
>> >> set: the register set to which it belongs; either a real set (GPR,
>> >> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> >> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>> >> number: register number within a set
>> >> size: for self-description, and to allow expanding registers like
>> >> SSE->AVX or eax->rax
>> >> attributes: read-write, read-only, read-only for guest but read-write
>> >> for host
>> >> value
>> >
>> > I do like the idea a lot of being able to read one register at a time as often times that's all you need.
>>
>> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
>
> This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.
Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.
>
>>
>> >> The communications between the local APIC and the IOAPIC/PIC will be
>> >> done over a socketpair, emulating the APIC bus protocol.
>>
>> What is keeping us from moving there today?
>
> The biggest problem with this proposal is that what we have today works reasonably well. Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.
So why bring it up in the "next-gen" api discussion?
>
>>
>> >>
>> >> Ioeventfd/irqfd
>> >> ---------------
>> >> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> >> retained, and perhaps supplemented with a way to assign an mmio region
>> >> to a socketpair carrying transactions. This allows a device model to be
>> >> implemented out-of-process. The socketpair can also be used to
>> >> implement a replacement for coalesced mmio, by not waiting for responses
>> >> on write transactions when enabled. Synchronization of coalesced mmio
>> >> will be implemented in the kernel, not userspace as now: when a
>> >> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> >> mmio queue(s).
>>
>> I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs.
>
> It's actually used by e1000 too, don't remember what the performance benefits are. Of course, few people use e1000.
And for e1000 it's only used for nvram which actually could benefit from a more clever "this is backed by ram" logic. Coalesced mmio is not a great fit here.
>
>> Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.
>>
>> One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.
>>
>> I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.
>
> This goes back to the discussion about a kernel bytecode vm for accelerating mmio. The problem is that we need something really general.
>
>> To me, coalesced mmio has proven that's it's generalization where it doesn't belong.
>
> But you want to generalize it even more?
>
> There's no way a patch with 'VGA' in it would be accepted.
Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space. Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
Good candidates for in-kernel acceleration are:
- HPET
- VGA
- IDE
I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.
We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
>
>>
>> >>
>> >> Guest memory management
>> >> -----------------------
>> >> Instead of managing each memory slot individually, a single API will be
>> >> provided that replaces the entire guest physical memory map atomically.
>> >> This matches the implementation (using RCU) and plugs holes in the
>> >> current API, where you lose the dirty log in the window between the last
>> >> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> >> that removes the slot.
>>
>> So we render the actual slot logic invisible? That's a very good idea.
>
> No, slots still exist. Only the API is "replace slot list" instead of "add slot" and "remove slot".
Why? On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here. That only works when then internal slot structure is hidden from user space though.
>
>>
>> >>
>> >> Slot-based dirty logging will be replaced by range-based and work-based
>> >> dirty logging; that is "what pages are dirty in this range, which may be
>> >> smaller than a slot" and "don't return more than N pages".
>> >>
>> >> We may want to place the log in user memory instead of kernel memory, to
>> >> reduce pinned memory and increase flexibility.
>> >
>> > Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?
>>
>> That's basically what he suggested, no?
>
>
> No.
>
>> >
>> >> vcpu fd mmap area
>> >> -----------------
>> >> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> >> communications. This will be replaced by a more orthodox pointer
>> >> parameter to sys_kvm_enter_guest(), that will be accessed using
>> >> get_user() and put_user(). This is slower than the current situation,
>> >> but better for things like strace.
>>
>> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
>
> Something really critical should be handled in the kernel. Care to provide examples?
Just look at the s390 patches Christian posted recently. I think that's a very nice direction to walk towards.
For permanently mapped space, the hybrid stuff above could fall into that category. We could however to it through copy_from/to_user with a user space pointer.
So maybe you're right - the mmap'ed space isn't all that important. Having kernel space write into user space memory is however.
Alex
On 02/07/2012 02:51 PM, Alexander Graf wrote:
> On 07.02.2012, at 13:24, Avi Kivity wrote:
>
> > On 02/07/2012 03:08 AM, Alexander Graf wrote:
> >> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
> >
> > It would be a "vm-wide syscall". You can also do that on x86 (through KVM_IRQ_LINE).
> >
> >>
> >> I really do like the ioctl model btw. It's easily extensible and easy to understand.
> >>
> >> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
> >
> > Good point. If we ever go through with it, it will only be after we see the interface has stabilized.
>
> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
I would expect that newer archs have less constraints, not more.
> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
The trick is to get the ABI to be flexible, like a generalized ABI for
state. But it's true that it's really hard to nail it down.
> >>
> >> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
> >
> > This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.
>
> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.
Definitely easy to extend.
> >
> >>
> >> >> The communications between the local APIC and the IOAPIC/PIC will be
> >> >> done over a socketpair, emulating the APIC bus protocol.
> >>
> >> What is keeping us from moving there today?
> >
> > The biggest problem with this proposal is that what we have today works reasonably well. Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.
>
> So why bring it up in the "next-gen" api discussion?
One reason is to try to shape future changes to the current ABI in the
same direction. Another is that maybe someone will convince us that it
is needed.
> >
> > There's no way a patch with 'VGA' in it would be accepted.
>
> Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space.
When a device is fully in the kernel, we have a good specification of
the ABI: it just implements the spec, and the ABI provides the interface
from the device to the rest of the world. Partially accelerated devices
means a much greater effort in specifying exactly what it does. It's
also vulnerable to changes in how the guest uses the device.
> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
vhost-net was a massive effort, I hope we don't have to replicate it.
>
> Good candidates for in-kernel acceleration are:
>
> - HPET
Yes
> - VGA
> - IDE
Why? There are perfectly good replacements for these (qxl, virtio-blk,
virtio-scsi).
> I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.
Pretty hard.
>
> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
Pointer to the qemu code?
> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
Like I mentioned, I see that as a good thing.
> >
> > No, slots still exist. Only the API is "replace slot list" instead of "add slot" and "remove slot".
>
> Why?
Physical memory is discontiguous, and includes aliases (two gpas
referencing the same backing page). How else would you describe it.
> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
We can certainly convert the slots to a tree internally. I'm doing the
same thing for qemu now, maybe we can do it for kvm too. No need to
involve the ABI at all.
Slot searching is quite fast since there's a small number of slots, and
we sort the larger ones to be in the front, so positive lookups are
fast. We cache negative lookups in the shadow page tables (an spte can
be either "not mapped", "mapped to RAM", or "not mapped and known to be
mmio") so we rarely need to walk the entire list.
> That only works when then internal slot structure is hidden from user space though.
Why?
>
> >> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> >
> > Something really critical should be handled in the kernel. Care to provide examples?
>
> Just look at the s390 patches Christian posted recently.
Which ones?
> I think that's a very nice direction to walk towards.
> For permanently mapped space, the hybrid stuff above could fall into that category. We could however to it through copy_from/to_user with a user space pointer.
>
> So maybe you're right - the mmap'ed space isn't all that important. Having kernel space write into user space memory is however.
>
>
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/07/2012 02:51 PM, Anthony Liguori wrote:
> On 02/07/2012 06:40 AM, Avi Kivity wrote:
>> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>>
>>>> It's a potential source of exploits
>>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>>> selective with access because of that.
>>>
>>> As is true of the rest of the kernel.
>>>
>>> If you want finer grain access control, that's exactly why we have
>>> things like
>>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>>> infrastructure and setup default SELinux policies appropriately.
>>
>> LSMs protect objects, not syscalls. There isn't an object to protect
>> here
>> (except the fake /dev/kvm object).
>
> A VM can be an object.
>
Not really, it's not accessible in a namespace. How would you label it?
Maybe we can reuse the process label/context (not sure what the right
term is for a process).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 07.02.2012, at 14:16, Avi Kivity wrote:
> On 02/07/2012 02:51 PM, Alexander Graf wrote:
>> On 07.02.2012, at 13:24, Avi Kivity wrote:
>>
>> > On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> >> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
>> >
>> > It would be a "vm-wide syscall". You can also do that on x86 (through KVM_IRQ_LINE).
>> >
>> >>
>> >> I really do like the ioctl model btw. It's easily extensible and easy to understand.
>> >>
>> >> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
>> >
>> > Good point. If we ever go through with it, it will only be after we see the interface has stabilized.
>>
>> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
>
> I would expect that newer archs have less constraints, not more.
Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?
I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
And what if MIPS comes along? I hear they also work on hw accelerated virtualization.
>
>> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
>
> The trick is to get the ABI to be flexible, like a generalized ABI for state. But it's true that it's really hard to nail it down.
Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.
>
>
>> >>
>> >> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
>> >
>> > This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.
>>
>> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.
>
> Definitely easy to extend.
>
>
>> >
>> >>
>> >> >> The communications between the local APIC and the IOAPIC/PIC will be
>> >> >> done over a socketpair, emulating the APIC bus protocol.
>> >>
>> >> What is keeping us from moving there today?
>> >
>> > The biggest problem with this proposal is that what we have today works reasonably well. Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.
>>
>> So why bring it up in the "next-gen" api discussion?
>
> One reason is to try to shape future changes to the current ABI in the same direction. Another is that maybe someone will convince us that it is needed.
>
>> >
>> > There's no way a patch with 'VGA' in it would be accepted.
>>
>> Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space.
>
>
> When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device.
Why? For the HPET timer register for example, we could have a simple MMIO hook that says
on_read:
return read_current_time() - shared_page.offset;
on_write:
handle_in_user_space();
For IDE, it would be as simple as
register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE, &s->cmd[0]);
for (i = 1; i < 7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
}
and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
>
>> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
>
> vhost-net was a massive effort, I hope we don't have to replicate it.
Was it harder than the in-kernel io-apic?
>
>>
>> Good candidates for in-kernel acceleration are:
>>
>> - HPET
>
> Yes
>
>> - VGA
>> - IDE
>
> Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.
Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. KVM's strength has always been its close resemblance to hardware.
>
>> I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.
>
> Pretty hard.
>
>>
>> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
>
> Pointer to the qemu code?
hw/openpic.c
>
>> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
>
> Like I mentioned, I see that as a good thing.
I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.
>
>> >
>> > No, slots still exist. Only the API is "replace slot list" instead of "add slot" and "remove slot".
>>
>> Why?
>
> Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page). How else would you describe it.
>
>> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
>
> We can certainly convert the slots to a tree internally. I'm doing the same thing for qemu now, maybe we can do it for kvm too. No need to involve the ABI at all.
Hrm, true.
> Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
>
>> That only works when then internal slot structure is hidden from user space though.
>
> Why?
Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
>
>>
>> >> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
>> >
>> > Something really critical should be handled in the kernel. Care to provide examples?
>>
>> Just look at the s390 patches Christian posted recently.
>
> Which ones?
http://www.mail-archive.com/[email protected]/msg66155.html
Alex
On 02/07/2012 03:40 PM, Alexander Graf wrote:
> >>
> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
> >
> > I would expect that newer archs have less constraints, not more.
>
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?
That's not what I mean by constraints. It's easy to accommodate
different register layouts. Constraints (for me) are like requiring
gang scheduling. But you introduced the subject - what did you mean?
Let's take for example the software-controlled TLB on some ppc. It's
tempting to call them all "registers" and use the register interface to
access them. Is it workable?
Or let's look at SMM on x86. To implement it memory slots need an
additional attribute "SMM/non-SMM/either". These sort of things, if you
don't think of them beforehand, break your interface.
>
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
>
> And what if MIPS comes along? I hear they also work on hw accelerated virtualization.
If it's just a matter of different register names and sizes, no
problem. From what I've seen of v8, it doesn't introduce new wierdnesses.
>
> >
> >> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
> >
> > The trick is to get the ABI to be flexible, like a generalized ABI for state. But it's true that it's really hard to nail it down.
>
> Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.
Syscalls are orthogonal to that - they're to avoid the fget_light() and
to tighten the vcpu/thread and vm/process relationship.
> , keep the rest in user space.
> >
> >
> > When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device.
>
> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>
> on_read:
> return read_current_time() - shared_page.offset;
> on_write:
> handle_in_user_space();
It works for the really simple cases, yes, but if the guest wants to set
up one-shot timers, it fails. Also look at the PIT which latches on read.
>
> For IDE, it would be as simple as
>
> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> for (i = 1; i< 7; i++) {
> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> }
>
> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
Just use virtio.
>
> >
> >> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
> >
> > vhost-net was a massive effort, I hope we don't have to replicate it.
>
> Was it harder than the in-kernel io-apic?
Much, much harder.
>
> >
> >>
> >> Good candidates for in-kernel acceleration are:
> >>
> >> - HPET
> >
> > Yes
> >
> >> - VGA
> >> - IDE
> >
> > Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>
> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.
>
> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
Rest easy, there's no chance of that. But if a guest is important
enough, virtio drivers will get written. IDE has no chance in hell of
approaching virtio-blk performance, no matter how much effort we put
into it.
> KVM's strength has always been its close resemblance to hardware.
This will remain. But we can't optimize everything.
> >
> >>
> >> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
> >
> > Pointer to the qemu code?
>
> hw/openpic.c
I see what you mean.
>
> >
> >> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
> >
> > Like I mentioned, I see that as a good thing.
>
> I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.
Well, the MPIC thing really supports your point.
> >
> >> >
> >> > No, slots still exist. Only the API is "replace slot list" instead of "add slot" and "remove slot".
> >>
> >> Why?
> >
> > Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page). How else would you describe it.
> >
> >> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
> >
> > We can certainly convert the slots to a tree internally. I'm doing the same thing for qemu now, maybe we can do it for kvm too. No need to involve the ABI at all.
>
> Hrm, true.
>
> > Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
>
> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
For x86 that's not a problem, since once you map a page, it stays mapped
(on modern hardware).
>
> >
> >> That only works when then internal slot structure is hidden from user space though.
> >
> > Why?
>
> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
Userspace needs to provide a function hva = f(gpa). Why does it matter
how the function is spelled out? Slots happen to be a concise
representation. Transform the function all you like in the kernel, as
long as you preserve all the mappings.
>
> >
> >>
> >> >> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> >> >
> >> > Something really critical should be handled in the kernel. Care to provide examples?
> >>
> >> Just look at the s390 patches Christian posted recently.
> >
> > Which ones?
>
> http://www.mail-archive.com/[email protected]/msg66155.html
>
Yeah - s390 is always different. On the current interface synchronous
registers are easy, so why not. But I wonder if it's really critical.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 07.02.2012, at 15:21, Avi Kivity wrote:
> On 02/07/2012 03:40 PM, Alexander Graf wrote:
>> >>
>> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
>> >
>> > I would expect that newer archs have less constraints, not more.
>>
>> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?
>
> That's not what I mean by constraints. It's easy to accommodate different register layouts. Constraints (for me) are like requiring gang scheduling. But you introduced the subject - what did you mean?
New extensions to architectures give us new challenges. Newer booke for example implements page tables in parallel to soft TLBs. We need to model that. My point was more that I can't predict the future :).
> Let's take for example the software-controlled TLB on some ppc. It's tempting to call them all "registers" and use the register interface to access them. Is it workable?
Workable, yes. Fast? No. Right now we share them between kernel and user space to have very fast access to them. That way we don't have to sync anything at all.
> Or let's look at SMM on x86. To implement it memory slots need an additional attribute "SMM/non-SMM/either". These sort of things, if you don't think of them beforehand, break your interface.
Yup. And we will never think of all the cases.
>
>>
>> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
>>
>> And what if MIPS comes along? I hear they also work on hw accelerated virtualization.
>
> If it's just a matter of different register names and sizes, no problem. From what I've seen of v8, it doesn't introduce new wierdnesses.
I haven't seen anything real yet, since the spec isn't out. So far only generic architecture documentation is available.
>
>>
>> >
>> >> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
>> >
>> > The trick is to get the ABI to be flexible, like a generalized ABI for state. But it's true that it's really hard to nail it down.
>>
>> Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.
>
> Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.
How about keeping the ioctl interface but moving vcpu_run to a syscall then? That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either
a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
or
b) keep everything that would be requested by the register synchronization in shared memory
>
>> , keep the rest in user space.
>> >
>> >
>> > When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device.
>>
>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>>
>> on_read:
>> return read_current_time() - shared_page.offset;
>> on_write:
>> handle_in_user_space();
>
> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.
I don't understand. Why would anything fail here? Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.
> Also look at the PIT which latches on read.
>
>>
>> For IDE, it would be as simple as
>>
>> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>> for (i = 1; i< 7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> }
>>
>> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
>
>
> Just use virtio.
Just use xenbus. Seriously, this is not an answer.
>
>>
>> >
>> >> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
>> >
>> > vhost-net was a massive effort, I hope we don't have to replicate it.
>>
>> Was it harder than the in-kernel io-apic?
>
> Much, much harder.
>
>>
>> >
>> >>
>> >> Good candidates for in-kernel acceleration are:
>> >>
>> >> - HPET
>> >
>> > Yes
>> >
>> >> - VGA
>> >> - IDE
>> >
>> > Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>>
>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.
>>
>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
>
> Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
>
>> KVM's strength has always been its close resemblance to hardware.
>
> This will remain. But we can't optimize everything.
That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
>
>> >
>> >>
>> >> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
>> >
>> > Pointer to the qemu code?
>>
>> hw/openpic.c
>
> I see what you mean.
>
>>
>> >
>> >> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
>> >
>> > Like I mentioned, I see that as a good thing.
>>
>> I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.
>
> Well, the MPIC thing really supports your point.
I'm sure we'll find more examples :)
>
>> >
>> >> >
>> >> > No, slots still exist. Only the API is "replace slot list" instead of "add slot" and "remove slot".
>> >>
>> >> Why?
>> >
>> > Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page). How else would you describe it.
>> >
>> >> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
>> >
>> > We can certainly convert the slots to a tree internally. I'm doing the same thing for qemu now, maybe we can do it for kvm too. No need to involve the ABI at all.
>>
>> Hrm, true.
>>
>> > Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
>>
>> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
>> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
>
> For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).
Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
>
>>
>> >
>> >> That only works when then internal slot structure is hidden from user space though.
>> >
>> > Why?
>>
>> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
>
> Userspace needs to provide a function hva = f(gpa). Why does it matter how the function is spelled out? Slots happen to be a concise representation. Transform the function all you like in the kernel, as long as you preserve all the mappings.
I think we're talking about the same thing really.
>
>>
>> >
>> >>
>> >> >> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
>> >> >
>> >> > Something really critical should be handled in the kernel. Care to provide examples?
>> >>
>> >> Just look at the s390 patches Christian posted recently.
>> >
>> > Which ones?
>>
>> http://www.mail-archive.com/[email protected]/msg66155.html
>>
>
> Yeah - s390 is always different. On the current interface synchronous registers are easy, so why not. But I wonder if it's really critical.
It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
Alex
On 02/07/2012 07:18 AM, Avi Kivity wrote:
> On 02/07/2012 02:51 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:40 AM, Avi Kivity wrote:
>>> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>>>
>>>>> It's a potential source of exploits
>>>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>>>> selective with access because of that.
>>>>
>>>> As is true of the rest of the kernel.
>>>>
>>>> If you want finer grain access control, that's exactly why we have things like
>>>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>>>> infrastructure and setup default SELinux policies appropriately.
>>>
>>> LSMs protect objects, not syscalls. There isn't an object to protect here
>>> (except the fake /dev/kvm object).
>>
>> A VM can be an object.
>>
>
> Not really, it's not accessible in a namespace. How would you label it?
Labels can originate from userspace, IIUC, so I think it's possible for QEMU (or
whatever the userspace is) to set the label for the VM while it's creating it.
I think this is how most of the labeling for X and things of that nature works.
Maybe Chris can set me straight.
> Maybe we can reuse the process label/context (not sure what the right term is
> for a process).
Regards,
Anthony Liguori
>
On 02/07/2012 07:40 AM, Alexander Graf wrote:
>
> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>
> on_read:
> return read_current_time() - shared_page.offset;
> on_write:
> handle_in_user_space();
>
> For IDE, it would be as simple as
>
> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> for (i = 1; i< 7; i++) {
> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> }
You can't easily serialize updates to that address with the kernel since two
threads are likely going to be accessing it at the same time. That either means
an expensive sync operation or a reliance on atomic instructions.
But not all architectures offer non-word sized atomic instructions so it gets
fairly nasty in practice.
Regards,
Anthony Liguori
On 07.02.2012, at 16:23, Anthony Liguori wrote:
> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>>
>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>>
>> on_read:
>> return read_current_time() - shared_page.offset;
>> on_write:
>> handle_in_user_space();
>>
>> For IDE, it would be as simple as
>>
>> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>> for (i = 1; i< 7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> }
>
> You can't easily serialize updates to that address with the kernel since two threads are likely going to be accessing it at the same time. That either means an expensive sync operation or a reliance on atomic instructions.
Yes. Essentially we want a mutex for them.
> But not all architectures offer non-word sized atomic instructions so it gets fairly nasty in practice.
Well, we can always require fields to be word sized.
Alex
* Anthony Liguori ([email protected]) wrote:
> On 02/07/2012 07:18 AM, Avi Kivity wrote:
> >On 02/07/2012 02:51 PM, Anthony Liguori wrote:
> >>On 02/07/2012 06:40 AM, Avi Kivity wrote:
> >>>On 02/07/2012 02:28 PM, Anthony Liguori wrote:
> >>>>
> >>>>>It's a potential source of exploits
> >>>>>(from bugs in KVM or in hardware). I can see people wanting to be
> >>>>>selective with access because of that.
> >>>>
> >>>>As is true of the rest of the kernel.
> >>>>
> >>>>If you want finer grain access control, that's exactly why we have things like
> >>>>LSM and SELinux. You can add the appropriate LSM hooks into the KVM
> >>>>infrastructure and setup default SELinux policies appropriately.
> >>>
> >>>LSMs protect objects, not syscalls. There isn't an object to protect here
> >>>(except the fake /dev/kvm object).
> >>
> >>A VM can be an object.
> >
> >Not really, it's not accessible in a namespace. How would you label it?
A VM, vcpu, etc are all objects. The labelling can be implicit based on
the security context of the process creating the object. You could create
simplistic rules such as a process may have the ability KVM__VM_CREATE
(this is roughly analogous to the PROC__EXECMEM policy control that
allows some processes to create executable writable memory mappings, or
SHM__CREATE for a process that can create a shared memory segment).
Adding some label mgmt to the object (add ->security and some callbacks to
do ->alloc/init/free), and then checks on the object itself would allow
for finer grained protection. If there was any VM lookup (although the
original example explicitly ties a process to a vm and a thread to a
vcpu) the finer grained check would certainly be useful to verify that
the process can access the VM.
> Labels can originate from userspace, IIUC, so I think it's possible for QEMU
> (or whatever the userspace is) to set the label for the VM while it's
> creating it. I think this is how most of the labeling for X and things of
> that nature works.
For X, the policy enforcement is done in the X server. There is
assistance from the kernel for doing policy server queries (can foo do
bar?), but it's up to the X server to actually care enough to ask and
then fail a request that doesn't comply. I'm not sure that's the model
here.
thanks,
-chris
On Mon, 06 Feb 2012 11:34:01 +0200, Avi Kivity <[email protected]> wrote:
> On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> > If userspace had a way to upload bytecode to the kernel that was
> > executed for a PIO operation, it could either pass the operation to
> > userspace or handle it within the kernel when possible without taking
> > a heavy weight exit.
> >
> > If the bytecode can access variables in a shared memory area, it could
> > be pretty efficient to work with.
> >
> > This means that the kernel never has to deal with specific in-kernel
> > devices but that userspace can accelerator as many of its devices as
> > it sees fit.
>
> I would really love to have this, but the problem is that we'd need a
> general purpose bytecode VM with binding to some kernel APIs. The
> bytecode VM, if made general enough to host more complicated devices,
> would likely be much larger than the actual code we have in the kernel now.
We have the ability to upload bytecode into the kernel already. It's in
a great bytecode interpreted by the CPU itself.
If every user were emulating different machines, LPF this would make
sense. Are they? Or should we write those helpers once, in C, and
provide that for them.
Cheers,
Rusty.
On 02/07/2012 06:28 AM, Anthony Liguori wrote:
> On 02/06/2012 01:46 PM, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>>> How would the ability to use sys_kvm_* be regulated?
>>>
>>> Why should it be regulated?
>>>
>>> It's not a finite or privileged resource.
>>
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware.
>
> As does the rest of the kernel.
Just because other parts of the kernel made this mistake (e.g.
networking) doesn't mean that KVM should as well.
> If you want finer grain access control, that's exactly why we have
> things like LSM and SELinux. You can add the appropriate LSM hooks into
> the KVM infrastructure and setup default SELinux policies appropriately.
Needing to use such bandaids is more complicated (or at least less
familiar to many) than setting permissions on a filesystem object.
>> And sometimes it is a finite resource. I don't know how x86 does it,
>> but on at least some powerpc hardware we have a finite, relatively small
>> number of hardware partition IDs.
>
> But presumably this is per-core, right?
Not currently.
I can't speak for the IBM stuff, but our hardware is desgined with the
idea that a partition has a permanent system-wide LPID (partition ID).
We *may* be able to do dynamic LPID on e500mc, but it is likely to be a
problem in the future with things like LPID-based direct-to-guest
interrupt delivery. There's also a question of prioritizing effort --
there's enough other stuff that needs work first.
> And they're recycled, right?
Not currently (other than when a guest is destroyed, of course).
What are the advantages of getting rid of the file descriptor that
warrant this? What is performance sensitive enough than an fd lookup is
unacceptable but the other overhead of going out to qemu is fine?
Is that fd lookup any heavier than "appropriate LSM hooks"?
If the fd overhead really is a problem, perhaps the fd could be retained
for setup operations, and omitted only on calls that require a vcpu to
have been already set up on the current thread?
-Scott
> If the fd overhead really is a problem, perhaps the fd could be retained
> for setup operations, and omitted only on calls that require a vcpu to
> have been already set up on the current thread?
Quite frankly I'd like to have an fd because it means you've got a
meaningful way of ensuring that id reuse problems go away. You open a
given id and keep a handle to it, if the id gets reused then your handle
will be tied to the old one so you can fail the requests.
Without an fd it's near impossible to get this right. The Unix/Linux
model is open an object, use it, close it. I see no reason not to do that.
Also the LSM hooks apply to file objects mostly, so its a natural fit on
top *IF* you choose to use them.
Finally you can pass file handles around between processes - do that any
other way 8)
Alan
> > register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> > for (i = 1; i< 7; i++) {
> > register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> > register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> > }
>
> You can't easily serialize updates to that address with the kernel since two
> threads are likely going to be accessing it at the same time. That either means
> an expensive sync operation or a reliance on atomic instructions.
Who cares
If your API is right this isn't a problem (and for IDE the guess that it
won't happen you will win 99.999% of the time).
In fact IDE you can do even better in many cases because you'll get a
single rep outsw you can trap and shortcut.
> But not all architectures offer non-word sized atomic instructions so it gets
> fairly nasty in practice.
Thats their problem. We don't screwup the fast paths because some
hardware vendor screwed up that bit of their implementation. That's
*their* problem not everyone elses.
So on x86 IDE should be about 10 outb traps that can be predicted, a rep
outsw which can be shortcut and a completion set of inb/inw ops that can
be predicted.
You should hit userspace about once per IDE operation. Fix the hot paths
with good design and the noise doesn't matter.
Alan
Anthony Liguori wrote:
> >The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> >them to userspace.
>
> I'm a big fan of this.
I agree with getting rid of unnecessary emulations.
(Why were those things emulated in the first place?)
But it would be good to retain some way to "plugin" device emulations
in the kernel, separate from KVM core with a well-defined API boundary.
Then it wouldn't matter to the KVM core whether there's PIT emulation
or whatever; that would just be a separate module. Perhaps even with
its own /dev device and maybe not tightly bound to KVM,
> >Note: this may cause a regression for older guests that don't
> >support MSI or kvmclock. Device assignment will be done using
> >VFIO, that is, without direct kvm involvement.
I don't like the sound of regressions.
I tend to think of a VM as something that needs to have consistent
behaviour over a long time, for keeping working systems running for
years despite changing hardware, or reviving old systems to test
software and make patches for things in long-term maintenance etc.
But I haven't noticed problems from upgrading kernelspace-KVM yet,
only upgrading the userspace parts. If a kernel upgrade is risky,
that makes upgrading host kernels difficult and "all or nothing" for
all the guests within.
However it looks like you mean only the performance characteristics
will change because of moving things back to userspace?
> >Local APICs will be mandatory, but it will be possible to hide them from
> >the guest. This means that it will no longer be possible to emulate an
> >APIC in userspace, but it will be possible to virtualize an APIC-less
> >core - userspace will play with the LINT0/LINT1 inputs (configured as
> >EXITINT and NMI) to queue interrupts and NMIs.
>
> I think this makes sense. An interesting consequence of this is
> that it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation. I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.
Would that be useful for using VCPUs to run sandboxed userspace code
with ability to trap and control the whole environment (as opposed to
guest OSes, or ptrace which is rather incomplete and unsuitable for
sandboxing code meant for other OSes)?
Thanks,
-- Jamie
Avi Kivity <[email protected]> wrote:
> > > Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
> >
> > Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> > We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
>
> For x86 that's not a problem, since once you map a page, it stays mapped
> (on modern hardware).
>
I was once thinking about how to search a slot reasonably fast for every case,
even when we do not have mmio-spte cache.
One possible way I thought up was to sort slots according to their base_gfn.
Then the problem would become: "find the first slot whose base_gfn + npages
is greater than this gfn."
Since we can do binary search, the search cost is O(log(# of slots)).
But I guess that most of the time was wasted on reading many memslots just to
know their base_gfn and npages.
So the most practically effective thing is to make a separate array which holds
just their base_gfn. This will make the task a simple, and cache friendly,
search on an integer array: probably faster than using *-tree data structure.
If needed, we should make cmp_memslot() architecture specific in the end?
Takuya
On 02/07/2012 04:39 PM, Alexander Graf wrote:
> >
> > Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.
>
> How about keeping the ioctl interface but moving vcpu_run to a syscall then?
I dislike half-and-half interfaces even more. And it's not like the
fget_light() is really painful - it's just that I see it occasionally in
perf top so it annoys me.
> That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either
>
> a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
> or
> b) keep everything that would be requested by the register synchronization in shared memory
Always-synced shared memory is a liability, since newer hardware might
introduce on-chip caches for that state, making synchronization
expensive. Or we may choose to keep some of the registers loaded, if we
have a way to trap on their use from userspace - for example we can
return to userspace with the guest fpu loaded, and trap if userspace
tries to use it.
Is an extra syscall for copying TLB entries to user space prohibitively
expensive?
> >
> >> , keep the rest in user space.
> >> >
> >> >
> >> > When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device.
> >>
> >> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
> >>
> >> on_read:
> >> return read_current_time() - shared_page.offset;
> >> on_write:
> >> handle_in_user_space();
> >
> > It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.
>
> I don't understand. Why would anything fail here?
It fails to provide a benefit, I didn't mean it causes guest failures.
You also have to make sure the kernel part and the user part use exactly
the same time bases.
> Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.
Yeah.
>
> > Also look at the PIT which latches on read.
> >
> >>
> >> For IDE, it would be as simple as
> >>
> >> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >> for (i = 1; i< 7; i++) {
> >> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >> }
> >>
> >> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
> >
> >
> > Just use virtio.
>
> Just use xenbus. Seriously, this is not an answer.
Why not? We invested effort in making it as fast as possible, and in
writing the drivers. IDE will never, ever, get anything close to virtio
performance, even if we put all of it in the kernel.
However, after these examples, I'm more open to partial acceleration
now. I won't ever like it though.
> >> >
> >> >> - VGA
> >> >> - IDE
> >> >
> >> > Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
> >>
> >> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp.
3rd party drivers are a way of life for Windows users; and the
incremental benefits of IDE acceleration are still far behind virtio.
> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers.
Cirrus or vesa should be okay for them, I don't see what we could do for
them in the kernel, or why.
> Same for virtio.
> >>
> >> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
> >
> > Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>
> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
For linear loads, so should we, perhaps with greater cpu utliization.
If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits
shouldn't matter.
> >
> >> KVM's strength has always been its close resemblance to hardware.
> >
> > This will remain. But we can't optimize everything.
>
> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
We should make sure that we don't default to IDE. Qemu has no knowledge
of the guest, so it can't default to virtio, but higher level tools can
and should.
> >>
> >> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> >> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> >
> > For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).
>
> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
Well the real reason is we have an extra bit reported by page faults
that we can control. Can't you set up a hashed pte that is configured
in a way that it will fault, no matter what type of access the guest
does, and see it in your page fault handler?
I'm guessing guest kernel ptes don't get evicted often.
> >
> >>
> >> >
> >> >> That only works when then internal slot structure is hidden from user space though.
> >> >
> >> > Why?
> >>
> >> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
> >
> > Userspace needs to provide a function hva = f(gpa). Why does it matter how the function is spelled out? Slots happen to be a concise representation. Transform the function all you like in the kernel, as long as you preserve all the mappings.
>
> I think we're talking about the same thing really.
So what's your objection to slots?
> >> http://www.mail-archive.com/[email protected]/msg66155.html
> >>
> >
> > Yeah - s390 is always different. On the current interface synchronous registers are easy, so why not. But I wonder if it's really critical.
>
> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
It's also dangerous wrt future hardware, as noted above.
--
error compiling committee.c: too many arguments to function
On 15.02.2012, at 12:18, Avi Kivity wrote:
> On 02/07/2012 04:39 PM, Alexander Graf wrote:
>>>
>>> Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.
>>
>> How about keeping the ioctl interface but moving vcpu_run to a syscall then?
>
> I dislike half-and-half interfaces even more. And it's not like the
> fget_light() is really painful - it's just that I see it occasionally in
> perf top so it annoys me.
>
>> That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either
>>
>> a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
>> or
>> b) keep everything that would be requested by the register synchronization in shared memory
>
> Always-synced shared memory is a liability, since newer hardware might
> introduce on-chip caches for that state, making synchronization
> expensive. Or we may choose to keep some of the registers loaded, if we
> have a way to trap on their use from userspace - for example we can
> return to userspace with the guest fpu loaded, and trap if userspace
> tries to use it.
>
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?
The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
>
>>>
>>>> , keep the rest in user space.
>>>>>
>>>>>
>>>>> When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device.
>>>>
>>>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>>>>
>>>> on_read:
>>>> return read_current_time() - shared_page.offset;
>>>> on_write:
>>>> handle_in_user_space();
>>>
>>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.
>>
>> I don't understand. Why would anything fail here?
>
> It fails to provide a benefit, I didn't mean it causes guest failures.
>
> You also have to make sure the kernel part and the user part use exactly
> the same time bases.
Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
>
>> Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.
>
> Yeah.
>
>>
>>> Also look at the PIT which latches on read.
>>>
>>>>
>>>> For IDE, it would be as simple as
>>>>
>>>> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>>> for (i = 1; i< 7; i++) {
>>>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>>> }
>>>>
>>>> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
>>>
>>>
>>> Just use virtio.
>>
>> Just use xenbus. Seriously, this is not an answer.
>
> Why not? We invested effort in making it as fast as possible, and in
> writing the drivers. IDE will never, ever, get anything close to virtio
> performance, even if we put all of it in the kernel.
>
> However, after these examples, I'm more open to partial acceleration
> now. I won't ever like it though.
>
>>>>>
>>>>>> - VGA
>>>>>> - IDE
>>>>>
>>>>> Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>>>>
>>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp.
>
> 3rd party drivers are a way of life for Windows users; and the
> incremental benefits of IDE acceleration are still far behind virtio.
The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.
It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.
And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.
>
>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers.
>
> Cirrus or vesa should be okay for them, I don't see what we could do for
> them in the kernel, or why.
That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>
>> Same for virtio.
>>>>
>>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
>>>
>>> Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>>
>> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
>
> For linear loads, so should we, perhaps with greater cpu utliization.
>
> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits
> shouldn't matter.
*shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).
>
>>>
>>>> KVM's strength has always been its close resemblance to hardware.
>>>
>>> This will remain. But we can't optimize everything.
>>
>> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
>
> We should make sure that we don't default to IDE. Qemu has no knowledge
> of the guest, so it can't default to virtio, but higher level tools can
> and should.
You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.
>
>>>>
>>>> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
>>>> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
>>>
>>> For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).
>>
>> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
>
> Well the real reason is we have an extra bit reported by page faults
> that we can control. Can't you set up a hashed pte that is configured
> in a way that it will fault, no matter what type of access the guest
> does, and see it in your page fault handler?
I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.
So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
But it's certainly an interesting idea.
> I'm guessing guest kernel ptes don't get evicted often.
Yeah, depends on the model you're running on ;). It's not the most common thing though, I agree.
>
>>>
>>>>
>>>>>
>>>>>> That only works when then internal slot structure is hidden from user space though.
>>>>>
>>>>> Why?
>>>>
>>>> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
>>>
>>> Userspace needs to provide a function hva = f(gpa). Why does it matter how the function is spelled out? Slots happen to be a concise representation. Transform the function all you like in the kernel, as long as you preserve all the mappings.
>>
>> I think we're talking about the same thing really.
>
> So what's your objection to slots?
I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.
>
>>>> http://www.mail-archive.com/[email protected]/msg66155.html
>>>>
>>>
>>> Yeah - s390 is always different. On the current interface synchronous registers are easy, so why not. But I wonder if it's really critical.
>>
>> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
>
> It's also dangerous wrt future hardware, as noted above.
Yes and no. I see the capability system as two things in one:
1) indicate features we learn later
2) indicate missing features in our current model
So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.
We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.
Alex
On 02/15/2012 01:57 PM, Alexander Graf wrote:
> >
> > Is an extra syscall for copying TLB entries to user space prohibitively
> > expensive?
>
> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
You don't need to copy the entire TLB, just the way that maps the
address you're interested in.
btw, why are you interested in virtual addresses in userspace at all?
> >>>
> >>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.
> >>
> >> I don't understand. Why would anything fail here?
> >
> > It fails to provide a benefit, I didn't mean it causes guest failures.
> >
> > You also have to make sure the kernel part and the user part use exactly
> > the same time bases.
>
> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
Depends on how much the alignment relies on guest knowledge. I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.
> >>>>
> >>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp.
> >
> > 3rd party drivers are a way of life for Windows users; and the
> > incremental benefits of IDE acceleration are still far behind virtio.
>
> The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.
>
> It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.
>
> And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.
Ok.
> >
> >> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers.
> >
> > Cirrus or vesa should be okay for them, I don't see what we could do for
> > them in the kernel, or why.
>
> That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>
> >
> >> Same for virtio.
> >>>>
> >>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
> >>>
> >>> Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
> >>
> >> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
> >
> > For linear loads, so should we, perhaps with greater cpu utliization.
> >
> > If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> > means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits
> > shouldn't matter.
>
> *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).
One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.
> >
> >>>
> >>>> KVM's strength has always been its close resemblance to hardware.
> >>>
> >>> This will remain. But we can't optimize everything.
> >>
> >> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
> >
> > We should make sure that we don't default to IDE. Qemu has no knowledge
> > of the guest, so it can't default to virtio, but higher level tools can
> > and should.
>
> You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.
The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.
>
> >> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
> >
> > Well the real reason is we have an extra bit reported by page faults
> > that we can control. Can't you set up a hashed pte that is configured
> > in a way that it will fault, no matter what type of access the guest
> > does, and see it in your page fault handler?
>
> I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.
>
> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
COWs usually happen from guest userspace, while mmio is usually from the
guest kernel, so you can switch on that, maybe.
> >>
> >> I think we're talking about the same thing really.
> >
> > So what's your objection to slots?
>
> I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.
Ah, but it doesn't. We can sort them, convert them to a radix tree,
basically do anything with them.
>
> >
> >>>> http://www.mail-archive.com/[email protected]/msg66155.html
> >>>>
> >>>
> >>> Yeah - s390 is always different. On the current interface synchronous registers are easy, so why not. But I wonder if it's really critical.
> >>
> >> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
> >
> > It's also dangerous wrt future hardware, as noted above.
>
> Yes and no. I see the capability system as two things in one:
>
> 1) indicate features we learn later
> 2) indicate missing features in our current model
>
> So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.
>
> We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.
>
At least qemu tends to assume a certain baseline and won't run without
it. We also need to make sure that the feature is available in some
other way (non-shared memory), which means duplication to begin with.
--
error compiling committee.c: too many arguments to function
On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote:
> Avi Kivity <[email protected]> wrote:
>
> > > > Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
> > >
> > > Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> > > We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> >
> > For x86 that's not a problem, since once you map a page, it stays mapped
> > (on modern hardware).
> >
>
> I was once thinking about how to search a slot reasonably fast for every case,
> even when we do not have mmio-spte cache.
>
> One possible way I thought up was to sort slots according to their base_gfn.
> Then the problem would become: "find the first slot whose base_gfn + npages
> is greater than this gfn."
>
> Since we can do binary search, the search cost is O(log(# of slots)).
>
> But I guess that most of the time was wasted on reading many memslots just to
> know their base_gfn and npages.
>
> So the most practically effective thing is to make a separate array which holds
> just their base_gfn. This will make the task a simple, and cache friendly,
> search on an integer array: probably faster than using *-tree data structure.
This assumes that there is equal probability for matching any slot. But
that's not true, even if you have hundreds of slots, the probability is
much greater for the two main memory slots, or if you're playing with
the framebuffer, the framebuffer slot. Everything else is loaded
quickly into shadow and forgotten.
> If needed, we should make cmp_memslot() architecture specific in the end?
We could, but why is it needed? This logic holds for all architectures.
--
error compiling committee.c: too many arguments to function
On 02/07/2012 05:23 PM, Anthony Liguori wrote:
> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>>
>> Why? For the HPET timer register for example, we could have a simple
>> MMIO hook that says
>>
>> on_read:
>> return read_current_time() - shared_page.offset;
>> on_write:
>> handle_in_user_space();
>>
>> For IDE, it would be as simple as
>>
>> register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>> for (i = 1; i< 7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> }
>
> You can't easily serialize updates to that address with the kernel
> since two threads are likely going to be accessing it at the same
> time. That either means an expensive sync operation or a reliance on
> atomic instructions.
>
> But not all architectures offer non-word sized atomic instructions so
> it gets fairly nasty in practice.
>
I doubt that any guest accesses IDE registers from two threads in
parallel. The guest will have some lock, so we could have a lock as
well and be assured that there will never be contention.
--
error compiling committee.c: too many arguments to function
On 15.02.2012, at 14:29, Avi Kivity wrote:
> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>>
>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>> expensive?
>>
>> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
>
> You don't need to copy the entire TLB, just the way that maps the
> address you're interested in.
Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.
> btw, why are you interested in virtual addresses in userspace at all?
We need them for gdb and monitor introspection.
>
>>>>>
>>>>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.
>>>>
>>>> I don't understand. Why would anything fail here?
>>>
>>> It fails to provide a benefit, I didn't mean it causes guest failures.
>>>
>>> You also have to make sure the kernel part and the user part use exactly
>>> the same time bases.
>>
>> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
>
> Depends on how much the alignment relies on guest knowledge. I guess
> with a simple device like HPET, it's simple, but with a complex device,
> different guests (or different versions of the same guest) could drive
> it very differently.
Right. But accelerating simple devices > not accelerating any devices. No? :)
>
>>>>>>
>>>>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp.
>>>
>>> 3rd party drivers are a way of life for Windows users; and the
>>> incremental benefits of IDE acceleration are still far behind virtio.
>>
>> The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.
>>
>> It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.
>>
>> And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.
>
> Ok.
>
>>>
>>>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers.
>>>
>>> Cirrus or vesa should be okay for them, I don't see what we could do for
>>> them in the kernel, or why.
>>
>> That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>>
>>>
>>>> Same for virtio.
>>>>>>
>>>>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
>>>>>
>>>>> Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>>>>
>>>> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
>>>
>>> For linear loads, so should we, perhaps with greater cpu utliization.
>>>
>>> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
>>> means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits
>>> shouldn't matter.
>>
>> *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).
>
> One thing that's different is that virtio offloads itself to a thread
> very quickly, while IDE does a lot of work in vcpu thread context.
So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
>
>>>
>>>>>
>>>>>> KVM's strength has always been its close resemblance to hardware.
>>>>>
>>>>> This will remain. But we can't optimize everything.
>>>>
>>>> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
>>>
>>> We should make sure that we don't default to IDE. Qemu has no knowledge
>>> of the guest, so it can't default to virtio, but higher level tools can
>>> and should.
>>
>> You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.
>
> The all-knowing management tool can provide a virtio driver disk, or
> even slip-stream the driver into the installation CD.
One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).
>
>
>>
>>>> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
>>>
>>> Well the real reason is we have an extra bit reported by page faults
>>> that we can control. Can't you set up a hashed pte that is configured
>>> in a way that it will fault, no matter what type of access the guest
>>> does, and see it in your page fault handler?
>>
>> I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.
>>
>> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
>
> COWs usually happen from guest userspace, while mmio is usually from the
> guest kernel, so you can switch on that, maybe.
Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
>
>>>>
>>>> I think we're talking about the same thing really.
>>>
>>> So what's your objection to slots?
>>
>> I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.
>
> Ah, but it doesn't. We can sort them, convert them to a radix tree,
> basically do anything with them.
That's perfectly fine then :).
>
>>
>>>
>>>>>> http://www.mail-archive.com/[email protected]/msg66155.html
>>>>>>
>>>>>
>>>>> Yeah - s390 is always different. On the current interface synchronous registers are easy, so why not. But I wonder if it's really critical.
>>>>
>>>> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
>>>
>>> It's also dangerous wrt future hardware, as noted above.
>>
>> Yes and no. I see the capability system as two things in one:
>>
>> 1) indicate features we learn later
>> 2) indicate missing features in our current model
>>
>> So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.
>>
>> We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.
>>
>
> At least qemu tends to assume a certain baseline and won't run without
> it. We also need to make sure that the feature is available in some
> other way (non-shared memory), which means duplication to begin with.
Yes, but that's the nature of accelerating things in other layers. If we move registers from ioctl get/set to shared pages, we need to keep the ioctls around. We also need to keep the ioctl access functions in qemu around. Unless we move up the baseline, but then we'd kill our backwards compatibility, which isn't all that great of an idea.
So yes, that's exactly what happens. And it's good that it does :). Gives us the chance to roll back when we realized we did something stupid.
Alex
On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > I would really love to have this, but the problem is that we'd need a
> > general purpose bytecode VM with binding to some kernel APIs. The
> > bytecode VM, if made general enough to host more complicated devices,
> > would likely be much larger than the actual code we have in the kernel now.
>
> We have the ability to upload bytecode into the kernel already. It's in
> a great bytecode interpreted by the CPU itself.
Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.
> If every user were emulating different machines, LPF this would make
> sense. Are they?
They aren't.
> Or should we write those helpers once, in C, and
> provide that for them.
There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated. However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.
--
error compiling committee.c: too many arguments to function
On 02/15/2012 03:37 PM, Alexander Graf wrote:
> On 15.02.2012, at 14:29, Avi Kivity wrote:
>
> > On 02/15/2012 01:57 PM, Alexander Graf wrote:
> >>>
> >>> Is an extra syscall for copying TLB entries to user space prohibitively
> >>> expensive?
> >>
> >> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
> >
> > You don't need to copy the entire TLB, just the way that maps the
> > address you're interested in.
>
> Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.
Well, the scatter/gather registers I proposed will give you just one
register or all of them.
> > btw, why are you interested in virtual addresses in userspace at all?
>
> We need them for gdb and monitor introspection.
Hardly fast paths that justify shared memory. I should be much harder
on you.
> >>
> >> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
> >
> > Depends on how much the alignment relies on guest knowledge. I guess
> > with a simple device like HPET, it's simple, but with a complex device,
> > different guests (or different versions of the same guest) could drive
> > it very differently.
>
> Right. But accelerating simple devices > not accelerating any devices. No? :)
Yes. But introducing bugs and vulns < not introducing them. It's a
tradeoff. Even an unexploited vulnerability can be a lot more pain,
just because you need to update your entire cluster, than a simple
device that is accelerated for a guest which has maybe 3% utilization.
Performance is just one parameter we optimize for. It's easy to overdo
it because it's an easily measurable and sexy parameter, but it's a mistake.
> >
> > One thing that's different is that virtio offloads itself to a thread
> > very quickly, while IDE does a lot of work in vcpu thread context.
>
> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
Simply making qemu issue the request from a thread would be way better.
Something like socketpair mmio, configured for not waiting for the
writes to be seen (posted writes) will also help by buffering writes in
the socket buffer.
> >
> > The all-knowing management tool can provide a virtio driver disk, or
> > even slip-stream the driver into the installation CD.
>
> One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).
That is true, but we have to leave some work for the management guys.
>
> >> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
> >
> > COWs usually happen from guest userspace, while mmio is usually from the
> > guest kernel, so you can switch on that, maybe.
>
> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
Or nested virt...
--
error compiling committee.c: too many arguments to function
On 15.02.2012, at 14:57, Avi Kivity wrote:
> On 02/15/2012 03:37 PM, Alexander Graf wrote:
>> On 15.02.2012, at 14:29, Avi Kivity wrote:
>>
>>> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>>>>
>>>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>>>> expensive?
>>>>
>>>> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
>>>
>>> You don't need to copy the entire TLB, just the way that maps the
>>> address you're interested in.
>>
>> Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.
>
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.
One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86. On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
>
>>> btw, why are you interested in virtual addresses in userspace at all?
>>
>> We need them for gdb and monitor introspection.
>
> Hardly fast paths that justify shared memory. I should be much harder
> on you.
It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code. There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
>
>>>>
>>>> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
>>>
>>> Depends on how much the alignment relies on guest knowledge. I guess
>>> with a simple device like HPET, it's simple, but with a complex device,
>>> different guests (or different versions of the same guest) could drive
>>> it very differently.
>>
>> Right. But accelerating simple devices > not accelerating any devices. No? :)
>
> Yes. But introducing bugs and vulns < not introducing them. It's a
> tradeoff. Even an unexploited vulnerability can be a lot more pain,
> just because you need to update your entire cluster, than a simple
> device that is accelerated for a guest which has maybe 3% utilization.
> Performance is just one parameter we optimize for. It's easy to overdo
> it because it's an easily measurable and sexy parameter, but it's a mistake.
Yeah, I agree. That's why I was trying to get AHCI to the default storage adapter for a while, because I think the same. However, Anthony believes that XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do that :(.
I'm mostly trying to think of ways to accelerate the obvious low hanging fruits, without overengineering any interfaces.
>
>>>
>>> One thing that's different is that virtio offloads itself to a thread
>>> very quickly, while IDE does a lot of work in vcpu thread context.
>>
>> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
>
> Simply making qemu issue the request from a thread would be way better.
> Something like socketpair mmio, configured for not waiting for the
> writes to be seen (posted writes) will also help by buffering writes in
> the socket buffer.
Yup, nice idea. That only works when all parts of a device are actually implemented through the same socket though. Otherwise you could run out of order. So if you have a PCI device with a PIO and an MMIO BAR region, they would both have to be handled through the same socket.
>
>>>
>>> The all-knowing management tool can provide a virtio driver disk, or
>>> even slip-stream the driver into the installation CD.
>>
>> One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).
>
> That is true, but we have to leave some work for the management guys.
The easier the management stack is, the happier I am ;).
>
>>
>>>> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
>>>
>>> COWs usually happen from guest userspace, while mmio is usually from the
>>> guest kernel, so you can switch on that, maybe.
>>
>> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
>
> Or nested virt...
Nested virt on ppc with device assignment? And here I thought I was the crazy one of the two of us :)
Alex
On 02/15/2012 05:57 AM, Alexander Graf wrote:
>
> On 15.02.2012, at 12:18, Avi Kivity wrote:
>
>> Well the real reason is we have an extra bit reported by page faults
>> that we can control. Can't you set up a hashed pte that is configured
>> in a way that it will fault, no matter what type of access the guest
>> does, and see it in your page fault handler?
>
> I might be able to synthesize a PTE that is !readable and might throw
> a permission exception instead of a miss exception. I might be able
> to synthesize something similar for booke. I don't however get any
> indication on why things failed.
On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will
trigger a DSI that gets sent to the hypervisor even if normal DSIs go
directly to the guest. You'll still need to zero out the execute
permission bits.
For other booke, you could use one of the user bits in MAS3 (along with
zeroing out all the permission bits), which you could get to by doing a
tlbsx.
-Scott
On 02/15/2012 07:39 AM, Avi Kivity wrote:
> On 02/07/2012 08:12 PM, Rusty Russell wrote:
>>> I would really love to have this, but the problem is that we'd need a
>>> general purpose bytecode VM with binding to some kernel APIs. The
>>> bytecode VM, if made general enough to host more complicated devices,
>>> would likely be much larger than the actual code we have in the kernel now.
>>
>> We have the ability to upload bytecode into the kernel already. It's in
>> a great bytecode interpreted by the CPU itself.
>
> Unfortunately it's inflexible (has to come with the kernel) and open to
> security vulnerabilities.
I wonder if there's any reasonable way to run device emulation within the
context of the guest. Could we effectively do something like SMM?
For a given set of traps, reflect back into the guest quickly changing the
visibility of the VGA region. It may require installing a new CR3 but maybe that
wouldn't be so bad with VPIDs.
Then you could implement the PIT as guest firmware using kvmclock as the time base.
Once you're back in the guest, you could install the old CR3. Perhaps just hide
a portion of the physical address space with the e820.
Regards,
Anthony Liguori
>> If every user were emulating different machines, LPF this would make
>> sense. Are they?
>
> They aren't.
>
>> Or should we write those helpers once, in C, and
>> provide that for them.
>
> There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> them are quite complicated. However implementing them in bytecode
> amounts to exposing a stable kernel ABI, since they use such a vast
> range of kernel services.
>
On Tuesday 07 February 2012, Alexander Graf wrote:
> On 07.02.2012, at 07:58, Michael Ellerman wrote:
>
> > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> >> You're exposing a large, complex kernel subsystem that does very
> >> low-level things with the hardware. It's a potential source of exploits
> >> (from bugs in KVM or in hardware). I can see people wanting to be
> >> selective with access because of that.
> >
> > Exactly.
> >
> > In a perfect world I'd agree with Anthony, but in reality I think
> > sysadmins are quite happy that they can prevent some users from using
> > KVM.
> >
> > You could presumably achieve something similar with capabilities or
> > whatever, but a node in /dev is much simpler.
>
> Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.
>
> But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.
>
ioctl is good for hardware devices and stuff that you want to enumerate
and/or control permissions on. For something like KVM that is really a
core kernel service, a syscall makes much more sense.
I would certainly never mix the two concepts: If you use a chardev to get
a file descriptor, use ioctl to do operations on it, and if you use a
syscall to get the file descriptor then use other syscalls to do operations
on it.
I don't really have a good recommendation whether or not to change from an
ioctl based interface to syscall for KVM now. On the one hand I believe it
would be significantly cleaner, on the other hand we cannot remove the
chardev interface any more since there are many existing users.
Arnd
On Tuesday 07 February 2012, Alexander Graf wrote:
> >>
> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
> >
> > I would expect that newer archs have less constraints, not more.
>
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a
> bunch of registers to 64-bit. So what if we laid out stuff wrong before?
>
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
>
I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).
Arnd
On Wed, 2012-02-15 at 22:21 +0000, Arnd Bergmann wrote:
> On Tuesday 07 February 2012, Alexander Graf wrote:
> > On 07.02.2012, at 07:58, Michael Ellerman wrote:
> >
> > > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> > >> You're exposing a large, complex kernel subsystem that does very
> > >> low-level things with the hardware. It's a potential source of exploits
> > >> (from bugs in KVM or in hardware). I can see people wanting to be
> > >> selective with access because of that.
> > >
> > > Exactly.
> > >
> > > In a perfect world I'd agree with Anthony, but in reality I think
> > > sysadmins are quite happy that they can prevent some users from using
> > > KVM.
> > >
> > > You could presumably achieve something similar with capabilities or
> > > whatever, but a node in /dev is much simpler.
> >
> > Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.
> >
> > But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.
> >
>
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.
Yeah maybe. That distinction is at least in part just historical.
The first problem I see with using a syscall is that you don't need one
syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
multiplexed syscall like epoll_ctl() - or probably several
(vm/vcpu/etc).
Secondly you still need a handle/context for those syscalls, and I think
the most sane thing to use for that is an fd.
At that point you've basically reinvented ioctl :)
I also think it is an advantage that you have a node in /dev for
permissions. I know other "core kernel" interfaces don't use a /dev
node, but arguably that is their loss.
> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a
> syscall to get the file descriptor then use other syscalls to do operations
> on it.
Sure, we use a syscall to get the fd (open) and then other syscalls to
do operations on it, ioctl and kvm_vcpu_run. ;)
But seriously, I guess that makes sense. Though it's a bit of a pity
because if you want a syscall for any of it, eg. vcpu_run(), then you
have to basically reinvent ioctl for all the other little operations.
cheers
On Wed, 15 Feb 2012 15:39:41 +0200, Avi Kivity <[email protected]> wrote:
> On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > > I would really love to have this, but the problem is that we'd need a
> > > general purpose bytecode VM with binding to some kernel APIs. The
> > > bytecode VM, if made general enough to host more complicated devices,
> > > would likely be much larger than the actual code we have in the kernel now.
> >
> > We have the ability to upload bytecode into the kernel already. It's in
> > a great bytecode interpreted by the CPU itself.
>
> Unfortunately it's inflexible (has to come with the kernel) and open to
> security vulnerabilities.
It doesn't have to come with the kernel, but it does require privs. And
the bytecode itself might be invulnerable, the services it will call
will be, so it's not clear it'll be a win, given the reduced
auditability.
The grass is not really greener, and getting there involves many fences.
> > If every user were emulating different machines, LPF this would make
> > sense. Are they?
>
> They aren't.
>
> > Or should we write those helpers once, in C, and
> > provide that for them.
>
> There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> them are quite complicated. However implementing them in bytecode
> amounts to exposing a stable kernel ABI, since they use such a vast
> range of kernel services.
We could think about regularizing and enumerating the various in-kernel
helpers, and give userspace a generic mechanism for wiring them up.
That would surely be the first step towards bytecode anyway.
But the current device assignment ioctls make me think that this
wouldn't be simple or neat.
Cheers,
Rusty.
On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:
> On 02/15/2012 07:39 AM, Avi Kivity wrote:
> >On 02/07/2012 08:12 PM, Rusty Russell wrote:
> >>>I would really love to have this, but the problem is that we'd need a
> >>>general purpose bytecode VM with binding to some kernel APIs. The
> >>>bytecode VM, if made general enough to host more complicated devices,
> >>>would likely be much larger than the actual code we have in the kernel now.
> >>
> >>We have the ability to upload bytecode into the kernel already. It's in
> >>a great bytecode interpreted by the CPU itself.
> >
> >Unfortunately it's inflexible (has to come with the kernel) and open to
> >security vulnerabilities.
>
> I wonder if there's any reasonable way to run device emulation
> within the context of the guest. Could we effectively do something
> like SMM?
>
> For a given set of traps, reflect back into the guest quickly
> changing the visibility of the VGA region. It may require installing
> a new CR3 but maybe that wouldn't be so bad with VPIDs.
>
What will it buy us? Surely not speed. Entering a guest is not much
(if at all) faster than exiting to userspace and any non trivial
operation will require exit to userspace anyway, so we just added one
more guest entry/exit operation on the way to userspace.
> Then you could implement the PIT as guest firmware using kvmclock as the time base.
>
> Once you're back in the guest, you could install the old CR3.
> Perhaps just hide a portion of the physical address space with the
> e820.
>
> Regards,
>
> Anthony Liguori
>
> >>If every user were emulating different machines, LPF this would make
> >>sense. Are they?
> >
> >They aren't.
> >
> >>Or should we write those helpers once, in C, and
> >>provide that for them.
> >
> >There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> >stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> >them are quite complicated. However implementing them in bytecode
> >amounts to exposing a stable kernel ABI, since they use such a vast
> >range of kernel services.
> >
--
Gleb.
On 02/16/2012 12:21 AM, Arnd Bergmann wrote:
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.
>
> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a
> syscall to get the file descriptor then use other syscalls to do operations
> on it.
>
> I don't really have a good recommendation whether or not to change from an
> ioctl based interface to syscall for KVM now. On the one hand I believe it
> would be significantly cleaner, on the other hand we cannot remove the
> chardev interface any more since there are many existing users.
>
This sums up my feelings exactly. Moving to syscalls would be an
improvement, but not so much an improvement as to warrant the thrashing
and the pain from having to maintain the old interface for a long while.
--
error compiling committee.c: too many arguments to function
On 02/16/2012 02:57 AM, Gleb Natapov wrote:
> On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:
>> On 02/15/2012 07:39 AM, Avi Kivity wrote:
>>> On 02/07/2012 08:12 PM, Rusty Russell wrote:
>>>>> I would really love to have this, but the problem is that we'd need a
>>>>> general purpose bytecode VM with binding to some kernel APIs. The
>>>>> bytecode VM, if made general enough to host more complicated devices,
>>>>> would likely be much larger than the actual code we have in the kernel now.
>>>>
>>>> We have the ability to upload bytecode into the kernel already. It's in
>>>> a great bytecode interpreted by the CPU itself.
>>>
>>> Unfortunately it's inflexible (has to come with the kernel) and open to
>>> security vulnerabilities.
>>
>> I wonder if there's any reasonable way to run device emulation
>> within the context of the guest. Could we effectively do something
>> like SMM?
>>
>> For a given set of traps, reflect back into the guest quickly
>> changing the visibility of the VGA region. It may require installing
>> a new CR3 but maybe that wouldn't be so bad with VPIDs.
>>
> What will it buy us? Surely not speed. Entering a guest is not much
> (if at all) faster than exiting to userspace and any non trivial
> operation will require exit to userspace anyway,
You can emulate the PIT/RTC entirely within the guest using kvmclock which
doesn't require an additional exit to get the current time base.
So instead of:
1) guest -> host kernel
2) host kernel -> userspace
3) implement logic using rdtscp via VDSO
4) userspace -> host kernel
5) host kernel -> guest
You go:
1) guest -> host kernel
2) host kernel -> guest (with special CR3)
3) implement logic using rdtscp + kvmclock page
4) change CR3 within guest and RETI to VMEXIT source RIP
Same basic concept as PS/2 emulation with SMM.
Regards,
Anthony Liguori
On 02/15/2012 04:08 PM, Alexander Graf wrote:
> >
> > Well, the scatter/gather registers I proposed will give you just one
> > register or all of them.
>
> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
I should have said, just one register, or all of them, or anything in
between.
> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
Sharing the data structures is not need. Simply synchronize them before
lookup, like we do for ordinary registers.
> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
on every exit. And you're risking the same thing if your hardware gets
cleverer.
> >
> >>> btw, why are you interested in virtual addresses in userspace at all?
> >>
> >> We need them for gdb and monitor introspection.
> >
> > Hardly fast paths that justify shared memory. I should be much harder
> > on you.
>
> It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code.
It's too magical, fitting a random version of a random userspace
component. Now you can't change this tcg code (and still keep the magic).
Some complexity is part of keeping software as separate components.
> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
We have the same issue with registers. There we call
cpu_synchronize_state() before every access. No magic, but we get to
reuse the code just the same.
> >
> >>>
> >>> One thing that's different is that virtio offloads itself to a thread
> >>> very quickly, while IDE does a lot of work in vcpu thread context.
> >>
> >> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
> >
> > Simply making qemu issue the request from a thread would be way better.
> > Something like socketpair mmio, configured for not waiting for the
> > writes to be seen (posted writes) will also help by buffering writes in
> > the socket buffer.
>
> Yup, nice idea. That only works when all parts of a device are actually implemented through the same socket though.
Right, but that's not an issue.
> Otherwise you could run out of order. So if you have a PCI device with a PIO and an MMIO BAR region, they would both have to be handled through the same socket.
I'm more worried about interactions between hotplug and a device, and
between people issuing unrelated PCI reads to flush writes (not sure
what the hardware semantics are there). It's easy to get this wrong.
> >>>
> >>> COWs usually happen from guest userspace, while mmio is usually from the
> >>> guest kernel, so you can switch on that, maybe.
> >>
> >> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
> >
> > Or nested virt...
>
> Nested virt on ppc with device assignment? And here I thought I was the crazy one of the two of us :)
I don't mind being crazy on somebody else's arch.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> >
> > ioctl is good for hardware devices and stuff that you want to enumerate
> > and/or control permissions on. For something like KVM that is really a
> > core kernel service, a syscall makes much more sense.
>
> Yeah maybe. That distinction is at least in part just historical.
>
> The first problem I see with using a syscall is that you don't need one
> syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> multiplexed syscall like epoll_ctl() - or probably several
> (vm/vcpu/etc).
No. Many of our ioctls are for state save/restore - we reduce that to
two. Many others are due to the with/without irqchip support - we slash
that as well. The device assignment stuff is relegated to vfio.
I still have to draw up a concrete proposal, but I think we'll end up
with 10-15.
>
> Secondly you still need a handle/context for those syscalls, and I think
> the most sane thing to use for that is an fd.
The context is the process (for vm-wide calls) and thread (for vcpu
local calls).
>
> At that point you've basically reinvented ioctl :)
>
> I also think it is an advantage that you have a node in /dev for
> permissions. I know other "core kernel" interfaces don't use a /dev
> node, but arguably that is their loss.
Have to agree with that. Theoretically we don't need permissions for
/dev/kvm, but in practice we do.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 16.02.2012, at 20:24, Avi Kivity wrote:
> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>
>>> Well, the scatter/gather registers I proposed will give you just one
>>> register or all of them.
>>
>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
>
> I should have said, just one register, or all of them, or anything in
> between.
>
>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>
> Sharing the data structures is not need. Simply synchronize them before
> lookup, like we do for ordinary registers.
Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>
>> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
>
> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> on every exit. And you're risking the same thing if your hardware gets
> cleverer.
Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).
>
>>>
>>>>> btw, why are you interested in virtual addresses in userspace at all?
>>>>
>>>> We need them for gdb and monitor introspection.
>>>
>>> Hardly fast paths that justify shared memory. I should be much harder
>>> on you.
>>
>> It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code.
>
> It's too magical, fitting a random version of a random userspace
> component. Now you can't change this tcg code (and still keep the magic).
>
> Some complexity is part of keeping software as separate components.
Why? If another user space wants to use this, they can
a) do the slow copy path
or
b) simply use our struct definitions
The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.
>
>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
>
> We have the same issue with registers. There we call
> cpu_synchronize_state() before every access. No magic, but we get to
> reuse the code just the same.
Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?
Alex
On 02/16/2012 04:46 PM, Anthony Liguori wrote:
>> What will it buy us? Surely not speed. Entering a guest is not much
>> (if at all) faster than exiting to userspace and any non trivial
>> operation will require exit to userspace anyway,
>
>
> You can emulate the PIT/RTC entirely within the guest using kvmclock
> which doesn't require an additional exit to get the current time base.
>
> So instead of:
>
> 1) guest -> host kernel
> 2) host kernel -> userspace
> 3) implement logic using rdtscp via VDSO
> 4) userspace -> host kernel
> 5) host kernel -> guest
>
> You go:
>
> 1) guest -> host kernel
> 2) host kernel -> guest (with special CR3)
> 3) implement logic using rdtscp + kvmclock page
> 4) change CR3 within guest and RETI to VMEXIT source RIP
>
> Same basic concept as PS/2 emulation with SMM.
Interesting, but unimplementable in practice. SMM requires a VMEXIT for
RSM, and anything non-SMM wants a virtual address mapping (and some RAM)
which you can't get without guest cooperation. There are other
complications like an NMI interrupting hypervisor-provided code and
finding unexpected addresses on its stack (SMM at least blocks NMIs).
Tangentially related, Intel introduced a VMFUNC that allows you to
change the guest's physical memory map to a pre-set alternative provided
by the host, without a VMEXIT. Seems similar to SMM but requires guest
cooperation. I guess it's for unintrusive virus scanners and the like.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/16/2012 09:34 PM, Alexander Graf wrote:
> On 16.02.2012, at 20:24, Avi Kivity wrote:
>
> > On 02/15/2012 04:08 PM, Alexander Graf wrote:
> >>>
> >>> Well, the scatter/gather registers I proposed will give you just one
> >>> register or all of them.
> >>
> >> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
> >
> > I should have said, just one register, or all of them, or anything in
> > between.
> >
> >> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
> >
> > Sharing the data structures is not need. Simply synchronize them before
> > lookup, like we do for ordinary registers.
>
> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
A TLB way is a few dozen bytes, no?
> >
> >> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
> >
> > But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> > on every exit. And you're risking the same thing if your hardware gets
> > cleverer.
>
> Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).
Or we try to be less clever unless we have a really compelling reason.
qemu monitor and gdb support aren't compelling reasons to optimize.
> >
> > It's too magical, fitting a random version of a random userspace
> > component. Now you can't change this tcg code (and still keep the magic).
> >
> > Some complexity is part of keeping software as separate components.
>
> Why? If another user space wants to use this, they can
>
> a) do the slow copy path
> or
> b) simply use our struct definitions
>
> The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.
Yeah, I guess.
>
> >
> >> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
> >
> > We have the same issue with registers. There we call
> > cpu_synchronize_state() before every access. No magic, but we get to
> > reuse the code just the same.
>
> Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?
>
At least on x86, we synchronize only rarely.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/16/2012 01:38 PM, Avi Kivity wrote:
> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>>
>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>> register or all of them.
>>>>
>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
>>>
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>>
>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>
>>> Sharing the data structures is not need. Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>>
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>
> A TLB way is a few dozen bytes, no?
I think you mean a TLB set... but the TLB (or part of it) may be fully
associative.
On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
a set of TLB0, and all 64 entries in TLB1. So 1632 bytes total.
Then we'd need to deal with tracking whether we synchronized one or more
specific sets, or everything (for migration or debug TLB dump). The
request to synchronize would have to come from within the QEMU MMU code,
since that's the point where we know what to ask for (unless we
duplicate the logic elsewhere). I'm not sure that reusing the standard
QEMU MMU code for individual debug address translation is really
simplifying things...
And yes, we do have fancier hardware coming fairly soon for which this
breaks (TLB0 entries can be loaded without host involvement, as long as
there's a translation from guest physical to physical in a separate
hardware table). It'd be reasonable to ignore TLB0 for migration (treat
it as invalidated), but not for debug since that may be where the
translation we're interested in resides.
-Scott
On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > >
> > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > and/or control permissions on. For something like KVM that is really a
> > > core kernel service, a syscall makes much more sense.
> >
> > Yeah maybe. That distinction is at least in part just historical.
> >
> > The first problem I see with using a syscall is that you don't need one
> > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > multiplexed syscall like epoll_ctl() - or probably several
> > (vm/vcpu/etc).
>
> No. Many of our ioctls are for state save/restore - we reduce that to
> two. Many others are due to the with/without irqchip support - we slash
> that as well. The device assignment stuff is relegated to vfio.
>
> I still have to draw up a concrete proposal, but I think we'll end up
> with 10-15.
That's true, you certainly could reduce it, though by how much I'm not
sure. On powerpc I'm working on moving the irq controller emulation into
the kernel, and some associated firmware emulation, so that's at least
one new ioctl. And there will always be more, whatever scheme you have
must be easily extensible - ie. not requiring new syscalls for each new
weird platform.
> > Secondly you still need a handle/context for those syscalls, and I think
> > the most sane thing to use for that is an fd.
>
> The context is the process (for vm-wide calls) and thread (for vcpu
> local calls).
Yeah OK I forgot you'd mentioned that. But isn't that change basically
orthogonal to how you get into the kernel? ie. we could have the
kvm/vcpu pointers in mm_struct/task_struct today?
I guess it wouldn't win you much though because you still have the fd
and ioctl overhead as well.
cheers
On 16.02.2012, at 20:38, Avi Kivity wrote:
> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>>
>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>> register or all of them.
>>>>
>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
>>>
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>>
>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>
>>> Sharing the data structures is not need. Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>>
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>
> A TLB way is a few dozen bytes, no?
>
>>>
>>>> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
>>>
>>> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
>>> on every exit. And you're risking the same thing if your hardware gets
>>> cleverer.
>>
>> Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).
>
> Or we try to be less clever unless we have a really compelling reason.
> qemu monitor and gdb support aren't compelling reasons to optimize.
The goal here was simplicity with a grain of performance concerns.
So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.
Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.
>
>>>
>>> It's too magical, fitting a random version of a random userspace
>>> component. Now you can't change this tcg code (and still keep the magic).
>>>
>>> Some complexity is part of keeping software as separate components.
>>
>> Why? If another user space wants to use this, they can
>>
>> a) do the slow copy path
>> or
>> b) simply use our struct definitions
>>
>> The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.
>
> Yeah, I guess.
>
>>
>>>
>>>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
>>>
>>> We have the same issue with registers. There we call
>>> cpu_synchronize_state() before every access. No magic, but we get to
>>> reuse the code just the same.
>>
>> Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?
>>
>
> At least on x86, we synchronize only rarely.
Yeah, on s390 we only know which registers actually contain the information we need for traps / hypercalls when in user space, since that's where the decoding happens. So we better have all GPRs available to read from and write to.
Alex
On 16.02.2012, at 21:41, Scott Wood wrote:
> On 02/16/2012 01:38 PM, Avi Kivity wrote:
>> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>>
>>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>>>
>>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>>> register or all of them.
>>>>>
>>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them.
>>>>
>>>> I should have said, just one register, or all of them, or anything in
>>>> between.
>>>>
>>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>>
>>>> Sharing the data structures is not need. Simply synchronize them before
>>>> lookup, like we do for ordinary registers.
>>>
>>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>>
>> A TLB way is a few dozen bytes, no?
>
> I think you mean a TLB set... but the TLB (or part of it) may be fully
> associative.
>
> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1. So 1632 bytes total.
>
> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump). The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere). I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
>
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table). It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.
Could we maybe add an ioctl that forces kvm to read out the current tlb0 contents and push them to memory? How slow would that be?
Alex
On 02/16/2012 06:23 PM, Alexander Graf wrote:
> On 16.02.2012, at 21:41, Scott Wood wrote:
>> And yes, we do have fancier hardware coming fairly soon for which this
>> breaks (TLB0 entries can be loaded without host involvement, as long as
>> there's a translation from guest physical to physical in a separate
>> hardware table). It'd be reasonable to ignore TLB0 for migration (treat
>> it as invalidated), but not for debug since that may be where the
>> translation we're interested in resides.
>
> Could we maybe add an ioctl that forces kvm to read out the current tlb0 contents and push them to memory? How slow would that be?
Yes, I was thinking something like that. We'd just have to remove (make
conditional on MMU type) the statement that this is synchronized
implicitly on return from vcpu_run.
Performance shouldn't be a problem -- we'd only need to sync once and
then can do all the repeated debug accesses we want. So should be no
need to mess around with partial sync.
-Scott
On 02/16/2012 10:41 PM, Scott Wood wrote:
> >>> Sharing the data structures is not need. Simply synchronize them before
> >>> lookup, like we do for ordinary registers.
> >>
> >> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> >
> > A TLB way is a few dozen bytes, no?
>
> I think you mean a TLB set...
Yes, thanks.
> but the TLB (or part of it) may be fully
> associative.
A fully associative TLB has to be very small.
> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1. So 1632 bytes total.
Syncing this every time you need a translation (for gdb or the monitor)
is trivial in terms of performance.
> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump). The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere). I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
>
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table). It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.
>
So with this new hardware, the always-sync API breaks.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/17/2012 02:19 AM, Alexander Graf wrote:
> >
> > Or we try to be less clever unless we have a really compelling reason.
> > qemu monitor and gdb support aren't compelling reasons to optimize.
>
> The goal here was simplicity with a grain of performance concerns.
>
Shared memory is simple in one way, but in other ways it is more
complicated since it takes away the kernel's freedom in how it manages
the data, how it's laid out, and whether it can lazify things or not.
> So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.
First, it's trivial, when you access a set you call
cpu_synchronize_tlb(set), just like how you access the registers when
you want them.
Second, and more important, how a random version of qemu works is
totally immaterial to the kvm userspace interface. qemu could change in
15 different ways and so could the kernel, and other users exist.
Fitting into qemu's current model is not a goal (if qemu happens to have
a good model, use it by all means; and clashing with qemu is likely an
indication the something is wrong -- but the two projects need to be
decoupled).
> Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.
That's the whole point. You could store it on the cpu hardware, if the
cpu allows it. Forcing it into always-synchronized shared memory takes
that ability away from you.
>
> >
> > At least on x86, we synchronize only rarely.
>
> Yeah, on s390 we only know which registers actually contain the information we need for traps / hypercalls when in user space, since that's where the decoding happens. So we better have all GPRs available to read from and write to.
>
Ok.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 02/17/2012 02:09 AM, Michael Ellerman wrote:
> On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> > On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > > >
> > > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > > and/or control permissions on. For something like KVM that is really a
> > > > core kernel service, a syscall makes much more sense.
> > >
> > > Yeah maybe. That distinction is at least in part just historical.
> > >
> > > The first problem I see with using a syscall is that you don't need one
> > > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > > multiplexed syscall like epoll_ctl() - or probably several
> > > (vm/vcpu/etc).
> >
> > No. Many of our ioctls are for state save/restore - we reduce that to
> > two. Many others are due to the with/without irqchip support - we slash
> > that as well. The device assignment stuff is relegated to vfio.
> >
> > I still have to draw up a concrete proposal, but I think we'll end up
> > with 10-15.
>
> That's true, you certainly could reduce it, though by how much I'm not
> sure. On powerpc I'm working on moving the irq controller emulation into
> the kernel, and some associated firmware emulation, so that's at least
> one new ioctl. And there will always be more, whatever scheme you have
> must be easily extensible - ie. not requiring new syscalls for each new
> weird platform.
Most of it falls into read/write state, which is covered by two
syscalls. There's probably need for configuration (wiring etc.); we
could call that pseudo-state with fake registers but I don't like that
very much.
> > > Secondly you still need a handle/context for those syscalls, and I think
> > > the most sane thing to use for that is an fd.
> >
> > The context is the process (for vm-wide calls) and thread (for vcpu
> > local calls).
>
> Yeah OK I forgot you'd mentioned that. But isn't that change basically
> orthogonal to how you get into the kernel? ie. we could have the
> kvm/vcpu pointers in mm_struct/task_struct today?
>
> I guess it wouldn't win you much though because you still have the fd
> and ioctl overhead as well.
>
Yes. I also dislike bypassing ioctl semantics (though we already do
that by requiring vcpus to stay on the same thread and vms on the same
process).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 18.02.2012, at 11:00, Avi Kivity <[email protected]> wrote:
> On 02/17/2012 02:19 AM, Alexander Graf wrote:
>>>
>>> Or we try to be less clever unless we have a really compelling reason.
>>> qemu monitor and gdb support aren't compelling reasons to optimize.
>>
>> The goal here was simplicity with a grain of performance concerns.
>>
>
> Shared memory is simple in one way, but in other ways it is more
> complicated since it takes away the kernel's freedom in how it manages
> the data, how it's laid out, and whether it can lazify things or not.
Yes and no. Shared memory is a means of transferring data. If it's implemented by copying internally or by implicit sychronization is orthogonal to that.
With the interface as is, we can now on newer CPUs (which need changes to user space to work anyways) take the current interface and add a new CAP + ioctl that allows us to force flush the TLYb into the shared buffer. That way we maintain backwards compatibility, memory savings, no in kernel vmalloc cluttering etc. on all CPUs, but get the checkpoint to actually have useful contents for new CPUs.
I don't see the problem really. The data is the architected layout of the TLB. It contains all the data that can possibly make up a TLB entry according to the booke spec. If we wanted to copy different data, we'd need a different ioctl too.
>
>> So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.
>
> First, it's trivial, when you access a set you call
> cpu_synchronize_tlb(set), just like how you access the registers when
> you want them.
Yes, which is reasonably intrusive and going to be necessary with LRAT.
>
> Second, and more important, how a random version of qemu works is
> totally immaterial to the kvm userspace interface. qemu could change in
> 15 different ways and so could the kernel, and other users exist.
> Fitting into qemu's current model is not a goal (if qemu happens to have
> a good model, use it by all means; and clashing with qemu is likely an
> indication the something is wrong -- but the two projects need to be
> decoupled).
Sure. In fact, in this case, the two were developed together. QEMU didn't have support for this specific TLB type, so we combined the development efforts. This way any new user space has a very easy time to implement it too, because we didn't model the KVM parts after QEMU, but the QEMU parts after KVM.
I still think it holds true that the KVM interface is very easy to plug in to any random emulation project. And to achieve that, the interface should be as little intrusive as possible wrt its requirements. The one we have seemed to fit that pretty well. Sure, we need a special flush command for newer CPUs, but at least we don't have to always copy. We only copy when we need to.
>
>> Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.
>
> That's the whole point. You could store it on the cpu hardware, if the
> cpu allows it. Forcing it into always-synchronized shared memory takes
> that ability away from you.
Yup. So the correct comment to make would be "don't make the shared TLB always synchronized", which I agree with today. I still think that the whole idea of passing kvm user space memory to work on is great. It reduces vmalloc footprint, it reduces copying, and it keeps data at one place, reducing chances to mess up.
Having it defined to always be in sync was a mistake, but one we can easily fix. That's why the CAP and ioctl interfaces are so awesome ;). I strongly believe that I can't predict the future. So designing an interface that holds stable for the next 10 years is close to imposdible. with an easily extensible interface however, it becomes almost trivial tk fix earlier messups ;).
Alex
On Sat, 2012-02-04 at 11:08 +0900, Takuya Yoshikawa wrote:
> The latter needs a fundamental change: I heard (from Avi) that we can
> change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.
>
> So I was planning to restart this work when Peter's
> "mm: Preemptibility"
> http://lkml.org/lkml/2011/4/1/141
> gets finished.
That got merged a while ago:
# git describe --contains d16dfc550f5326a4000f3322582a7c05dec91d7a --match "v*"
v3.0-rc1~275
While I still need to get back to unifying mmu_gather across
architectures the whole thing is currently preemptible.