2013-04-04 11:50:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: [PATCH RFC] kvm: add PV MMIO EVENTFD

With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Add an interface for userspace to specify this per-address, we can
use this e.g. for virtio.

The implementation adds a separate bus internally. This serves two
purposes:
- minimize overhead for old userspace that does not use PV MMIO
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them handle
an dinvalid length)

At the moment, this optimization is only supported for EPT on x86 and
silently ignored for NPT and MMU, so everything works correctly but
slowly.

TODO: NPT, MMU and non x86 architectures.

The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
arch/x86/kvm/vmx.c | 4 ++++
arch/x86/kvm/x86.c | 1 +
include/linux/kvm_host.h | 1 +
include/uapi/linux/kvm.h | 9 +++++++++
virt/kvm/eventfd.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
virt/kvm/kvm_main.c | 1 +
6 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..cdaac9b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5127,6 +5127,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa_t gpa;

gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+ if (!kvm_io_bus_write(vcpu->kvm, KVM_PV_MMIO_BUS, gpa, 0, NULL)) {
+ skip_emulated_instruction(vcpu);
+ return 1;
+ }

ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == 1))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f19ac0a..b9223d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2483,6 +2483,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_IRQFD:
case KVM_CAP_IOEVENTFD:
+ case KVM_CAP_IOEVENTFD_PV_MMIO:
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..35b74cd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -149,6 +149,7 @@ struct kvm_io_bus {
enum kvm_bus {
KVM_MMIO_BUS,
KVM_PIO_BUS,
+ KVM_PV_MMIO_BUS,
KVM_NR_BUSES
};

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3c56ba3..61783ee 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -449,11 +449,19 @@ enum {
kvm_ioeventfd_flag_nr_datamatch,
kvm_ioeventfd_flag_nr_pio,
kvm_ioeventfd_flag_nr_deassign,
+ kvm_ioeventfd_flag_nr_pv_mmio,
kvm_ioeventfd_flag_nr_max,
};

#define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch)
#define KVM_IOEVENTFD_FLAG_PIO (1 << kvm_ioeventfd_flag_nr_pio)
+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */
+#define KVM_IOEVENTFD_FLAG_PV_MMIO (1 << kvm_ioeventfd_flag_nr_pv_mmio)
#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)

#define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
@@ -665,6 +673,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_EPR 86
#define KVM_CAP_ARM_PSCI 87
#define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_IOEVENTFD_PV_MMIO 89

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 93e5b05..1b7619e 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -579,6 +579,7 @@ struct _ioeventfd {
struct kvm_io_device dev;
u8 bus_idx;
bool wildcard;
+ bool pvmmio;
};

static inline struct _ioeventfd *
@@ -600,7 +601,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int len, const void *val)
{
u64 _val;

- if (!(addr == p->addr && len == p->length))
+ if (addr != p->addr)
+ /* address must be precise for a hit */
+ return false;
+
+ if (p->pvmmio)
+ /* pvmmio only looks at the address, so always a hit */
+ return true;
+
+ if (len != p->length)
/* address-range must be precise for a hit */
return false;

@@ -671,9 +680,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct _ioeventfd *p)

list_for_each_entry(_p, &kvm->ioeventfds, list)
if (_p->bus_idx == p->bus_idx &&
- _p->addr == p->addr && _p->length == p->length &&
- (_p->wildcard || p->wildcard ||
- _p->datamatch == p->datamatch))
+ _p->addr == p->addr &&
+ (_p->pvmmio || p->pvmmio ||
+ (_p->length == p->length &&
+ (_p->wildcard || p->wildcard ||
+ _p->datamatch == p->datamatch))))
return true;

return false;
@@ -707,6 +718,12 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
return -EINVAL;

+ /* PV MMIO can't be combined with PIO or DATAMATCH */
+ if (args->flags & KVM_IOEVENTFD_FLAG_PV_MMIO &&
+ args->flags & (KVM_IOEVENTFD_FLAG_PIO |
+ KVM_IOEVENTFD_FLAG_DATAMATCH))
+ return -EINVAL;
+
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
@@ -722,6 +739,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
p->bus_idx = bus_idx;
p->length = args->len;
p->eventfd = eventfd;
+ p->pvmmio = args->flags & KVM_IOEVENTFD_FLAG_PV_MMIO;

/* The datamatch feature is optional, otherwise this is a wildcard */
if (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH)
@@ -729,6 +747,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
else
p->wildcard = true;

+
mutex_lock(&kvm->slots_lock);

/* Verify that there isn't a match already */
@@ -744,12 +763,24 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
if (ret < 0)
goto unlock_fail;

+ /* PV MMIO is also put on a separate bus, for faster lookups.
+ * Length is ignored for PV MMIO bus. */
+ if (p->pvmmio) {
+ ret = kvm_io_bus_register_dev(kvm, KVM_PV_MMIO_BUS,
+ p->addr, 0, &p->dev);
+ if (ret < 0)
+ goto register_fail;
+ }
+
list_add_tail(&p->list, &kvm->ioeventfds);

mutex_unlock(&kvm->slots_lock);

return 0;

+register_fail:
+ kvm_io_bus_register_dev(kvm, bus_idx, p->addr, p->length,
+ &p->dev);
unlock_fail:
mutex_unlock(&kvm->slots_lock);

@@ -776,19 +807,25 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
mutex_lock(&kvm->slots_lock);

list_for_each_entry_safe(p, tmp, &kvm->ioeventfds, list) {
+ bool pvmmio = args->flags & KVM_IOEVENTFD_FLAG_PV_MMIO;
bool wildcard = !(args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH);

if (p->bus_idx != bus_idx ||
p->eventfd != eventfd ||
p->addr != args->addr ||
p->length != args->len ||
- p->wildcard != wildcard)
+ p->wildcard != wildcard ||
+ p->pvmmio != pvmmio)
continue;

if (!p->wildcard && p->datamatch != args->datamatch)
continue;

kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
+ if (pvmmio) {
+ kvm_io_bus_unregister_dev(kvm, KVM_PV_MMIO_BUS,
+ &p->dev);
+ }
ioeventfd_release(p);
ret = 0;
break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index adc68fe..74c5eb5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2709,6 +2709,7 @@ int kvm_io_bus_write(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,

return -EOPNOTSUPP;
}
+EXPORT_SYMBOL_GPL(kvm_io_bus_write);

/* kvm_io_bus_read - called under kvm->slots_lock */
int kvm_io_bus_read(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
--
MST


2013-04-04 11:57:43

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:

> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
>
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
>
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
>
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
>
> TODO: NPT, MMU and non x86 architectures.
>
> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> pre-review and suggestions.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>

This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?

That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.


Alex

2013-04-04 12:04:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> >
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> >
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> >
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> >
> > TODO: NPT, MMU and non x86 architectures.
> >
> > The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> > pre-review and suggestions.
> >
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> This still uses page fault intercepts which are orders of magnitudes
> slower than hypercalls.

Not really. Here's a test:
compare vmcall to portio:

vmcall 1519
...
outl_to_kernel 1745

compare portio to mmio:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So not orders of magnitude.

> Why don't you just create a PV MMIO hypercall
> that the guest can use to invoke MMIO accesses towards the host based
> on physical addresses with explicit length encodings?
> That way you simplify and speed up all code paths, exceeding the speed
> of PIO exits even. It should also be quite easily portable, as all
> other platforms have hypercalls available as well.
>
>
> Alex

I sent such a patch, but maintainers seem reluctant to add hypercalls.
Gleb, could you comment please?

A fast way to do MMIO is probably useful in any case ...

> Alex

2013-04-04 12:09:36

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> >
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> >
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> >
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> >
> > TODO: NPT, MMU and non x86 architectures.
> >
> > The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> > pre-review and suggestions.
> >
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>
It is slower, but not an order of magnitude slower. It become faster
with newer HW.

> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>
We are trying to avoid PV as much as possible (well this is also PV,
but not guest visible). We haven't replaced PIO with hypercall for the
same reason. My hope is that future HW will provide us with instruction
decode for basic mov instruction at which point this optimisation can be
dropped. And hypercall has its own set of problems with Windows guests.
When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
hypercalls. Mixing KVM hypercalls and Hyper-V requires some tricks. It
may also affect WHQLing Windows drivers since driver will talk to HW
bypassing Windows interfaces.

--
Gleb.

2013-04-04 12:09:59

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>>
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>>
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>>
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>>
>>> TODO: NPT, MMU and non x86 architectures.
>>>
>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>>
>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>
>> This still uses page fault intercepts which are orders of magnitudes
>> slower than hypercalls.
>
> Not really. Here's a test:
> compare vmcall to portio:
>
> vmcall 1519
> ...
> outl_to_kernel 1745
>
> compare portio to mmio:
>
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
>
> So not orders of magnitude.

https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

Check out page 41. Higher is better (number is number of loop cycles in a second). My test system was an AMD Istanbul based box.

>
>> Why don't you just create a PV MMIO hypercall
>> that the guest can use to invoke MMIO accesses towards the host based
>> on physical addresses with explicit length encodings?
>> That way you simplify and speed up all code paths, exceeding the speed
>> of PIO exits even. It should also be quite easily portable, as all
>> other platforms have hypercalls available as well.
>>
>>
>> Alex
>
> I sent such a patch, but maintainers seem reluctant to add hypercalls.
> Gleb, could you comment please?
>
> A fast way to do MMIO is probably useful in any case ...

Yes, but at least according to my numbers optimizing anything that is not hcalls is a waste of time.


Alex

2013-04-04 12:20:38

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> >
> > Not really. Here's a test:
> > compare vmcall to portio:
> >
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> >
> > compare portio to mmio:
> >
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> >
> > So not orders of magnitude.
>
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
>
> Check out page 41. Higher is better (number is number of loop cycles in a second). My test system was an AMD Istanbul based box.
>
Have you bypassed instruction emulation in your testing?

> >
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >>
> >>
> >> Alex
> >
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> >
> > A fast way to do MMIO is probably useful in any case ...
>
> Yes, but at least according to my numbers optimizing anything that is not hcalls is a waste of time.
>
>
> Alex

--
Gleb.

2013-04-04 12:22:00

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> >
> > Not really. Here's a test:
> > compare vmcall to portio:
> >
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> >
> > compare portio to mmio:
> >
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> >
> > So not orders of magnitude.
>
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
>
> Check out page 41. Higher is better (number is number of loop cycles in a second). My test system was an AMD Istanbul based box.

Wow 2x difference. The difference seems to be much smaller now. Newer
hardware? Newer software?

> >
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >>
> >>
> >> Alex
> >
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> >
> > A fast way to do MMIO is probably useful in any case ...
>
> Yes, but at least according to my numbers optimizing anything that is not hcalls is a waste of time.
>
>
> Alex

This was the implementation: 'kvm_para: add mmio word store hypercall'.
Again, I don't mind.

--
MST

2013-04-04 12:22:14

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:08, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>>
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>>
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>>
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>>
>>> TODO: NPT, MMU and non x86 architectures.
>>>
>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>>
>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>
>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
>
>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible). We haven't replaced PIO with hypercall for the
> same reason. My hope is that future HW will provide us with instruction
> decode for basic mov instruction at which point this optimisation can be
> dropped.

The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.

> And hypercall has its own set of problems with Windows guests.
> When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> hypercalls. Mixing KVM hypercalls and Hyper-V requires some tricks. It

Can't we simply register a hypercall ID range with Microsoft?

> may also affect WHQLing Windows drivers since driver will talk to HW
> bypassing Windows interfaces.

Then the WHQL'ed driver doesn't support the PV MMIO hcall?


Alex

2013-04-04 12:22:42

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:19, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
>>
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>
>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>> up the eventfd directly, bypassing emulation.
>>>>>
>>>>> Add an interface for userspace to specify this per-address, we can
>>>>> use this e.g. for virtio.
>>>>>
>>>>> The implementation adds a separate bus internally. This serves two
>>>>> purposes:
>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>> - minimize disruption in other code (since we don't know the length,
>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>> way we don't need to touch all devices to teach them handle
>>>>> an dinvalid length)
>>>>>
>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>> slowly.
>>>>>
>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>
>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>> pre-review and suggestions.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>
>>>> This still uses page fault intercepts which are orders of magnitudes
>>>> slower than hypercalls.
>>>
>>> Not really. Here's a test:
>>> compare vmcall to portio:
>>>
>>> vmcall 1519
>>> ...
>>> outl_to_kernel 1745
>>>
>>> compare portio to mmio:
>>>
>>> mmio-wildcard-eventfd:pci-mem 3529
>>> mmio-pv-eventfd:pci-mem 1878
>>> portio-wildcard-eventfd:pci-io 1846
>>>
>>> So not orders of magnitude.
>>
>> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
>>
>> Check out page 41. Higher is better (number is number of loop cycles in a second). My test system was an AMD Istanbul based box.
>>
> Have you bypassed instruction emulation in your testing?

PIO doesn't do instruction emulation.


Alex

2013-04-04 12:32:13

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:08, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>>
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>>
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>>
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>>
>>> TODO: NPT, MMU and non x86 architectures.
>>>
>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>>
>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>
>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
>
>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible

Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.

+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */


Alex

2013-04-04 12:35:49

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> >
> >> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
>
> The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.
>
Disable it on newer HW is easy, but it is not that simple to get rid of the guest code.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls. Mixing KVM hypercalls and Hyper-V requires some tricks. It
>
> Can't we simply register a hypercall ID range with Microsoft?
Doubt it. This is not only about the rang though. The calling convention
is completely different.

>
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
>
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
>
All Windows drivers have to be WHQL'ed, saying that is like saying that
we do not care about Windows guest virtio speed.

--
Gleb.

2013-04-04 12:39:01

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> >
> >> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
>
> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
>
QEMU sets it.

> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
>
Virtio spec will state that access to a kick register needs to be of
specific length. This is reasonable thing for HW to ask.

--
Gleb.

2013-04-04 12:39:13

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:34, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>
>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>> up the eventfd directly, bypassing emulation.
>>>>>
>>>>> Add an interface for userspace to specify this per-address, we can
>>>>> use this e.g. for virtio.
>>>>>
>>>>> The implementation adds a separate bus internally. This serves two
>>>>> purposes:
>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>> - minimize disruption in other code (since we don't know the length,
>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>> way we don't need to touch all devices to teach them handle
>>>>> an dinvalid length)
>>>>>
>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>> slowly.
>>>>>
>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>
>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>> pre-review and suggestions.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>
>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>>
>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible). We haven't replaced PIO with hypercall for the
>>> same reason. My hope is that future HW will provide us with instruction
>>> decode for basic mov instruction at which point this optimisation can be
>>> dropped.
>>
>> The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.
>>
> Disable it on newer HW is easy, but it is not that simple to get rid of the guest code.

You need to have guest code that runs with non-PV hcalls anyways, since you want to be able to run on older KVM versions. So what's the problem?

>
>>> And hypercall has its own set of problems with Windows guests.
>>> When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
>>> hypercalls. Mixing KVM hypercalls and Hyper-V requires some tricks. It
>>
>> Can't we simply register a hypercall ID range with Microsoft?
> Doubt it. This is not only about the rang though. The calling convention
> is completely different.

There is no calling convention for KVM hypercalls anymore, or is there?
Either way, we would of course just adhere to the MS calling convention. We're only talking about register values here...

>
>>
>>> may also affect WHQLing Windows drivers since driver will talk to HW
>>> bypassing Windows interfaces.
>>
>> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
>>
> All Windows drivers have to be WHQL'ed, saying that is like saying that
> we do not care about Windows guest virtio speed.

I don't mind if a Linux guest is faster than a Windows guest, if that's what you're asking.


Alex

2013-04-04 12:39:55

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:38, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>
>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>> up the eventfd directly, bypassing emulation.
>>>>>
>>>>> Add an interface for userspace to specify this per-address, we can
>>>>> use this e.g. for virtio.
>>>>>
>>>>> The implementation adds a separate bus internally. This serves two
>>>>> purposes:
>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>> - minimize disruption in other code (since we don't know the length,
>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>> way we don't need to touch all devices to teach them handle
>>>>> an dinvalid length)
>>>>>
>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>> slowly.
>>>>>
>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>
>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>> pre-review and suggestions.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>
>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>>
>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible
>>
>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
>>
> QEMU sets it.

How does QEMU know?

>
>> +/*
>> + * PV_MMIO - Guest can promise us that all accesses touching this address
>> + * are writes of specified length, starting at the specified address.
>> + * If not - it's a Guest bug.
>> + * Can not be used together with either PIO or DATAMATCH.
>> + */
>>
> Virtio spec will state that access to a kick register needs to be of
> specific length. This is reasonable thing for HW to ask.

This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.


Alex

2013-04-04 12:46:08

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>
> >>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>> up the eventfd directly, bypassing emulation.
> >>>>>
> >>>>> Add an interface for userspace to specify this per-address, we can
> >>>>> use this e.g. for virtio.
> >>>>>
> >>>>> The implementation adds a separate bus internally. This serves two
> >>>>> purposes:
> >>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>> - minimize disruption in other code (since we don't know the length,
> >>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>> way we don't need to touch all devices to teach them handle
> >>>>> an dinvalid length)
> >>>>>
> >>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>> slowly.
> >>>>>
> >>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>
> >>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>> pre-review and suggestions.
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>
> >>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>>
> >>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >>
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>
> > QEMU sets it.
>
> How does QEMU know?
>
Knows what? When to create such eventfd? virtio device knows.

> >
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >>
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
>
> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
>
There is not virtio spec that has kick register in MMIO. The spec is in
the works AFAIK. Actually PIO will not be deprecated and my suggestion
is to move to MMIO only when PIO address space is exhausted. For PCI it
will be never, for PCI-e it will be after ~16 devices.

--
Gleb.

2013-04-04 12:49:43

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:45, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>>>
>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>>>
>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>>>> up the eventfd directly, bypassing emulation.
>>>>>>>
>>>>>>> Add an interface for userspace to specify this per-address, we can
>>>>>>> use this e.g. for virtio.
>>>>>>>
>>>>>>> The implementation adds a separate bus internally. This serves two
>>>>>>> purposes:
>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>>>> - minimize disruption in other code (since we don't know the length,
>>>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>>>> way we don't need to touch all devices to teach them handle
>>>>>>> an dinvalid length)
>>>>>>>
>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>>>> slowly.
>>>>>>>
>>>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>>>
>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>>>> pre-review and suggestions.
>>>>>>>
>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>>>
>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>>>
>>>>> It is slower, but not an order of magnitude slower. It become faster
>>>>> with newer HW.
>>>>>
>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>>>
>>>>> We are trying to avoid PV as much as possible (well this is also PV,
>>>>> but not guest visible
>>>>
>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
>>>>
>>> QEMU sets it.
>>
>> How does QEMU know?
>>
> Knows what? When to create such eventfd? virtio device knows.

Where does it know from?

>
>>>
>>>> +/*
>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
>>>> + * are writes of specified length, starting at the specified address.
>>>> + * If not - it's a Guest bug.
>>>> + * Can not be used together with either PIO or DATAMATCH.
>>>> + */
>>>>
>>> Virtio spec will state that access to a kick register needs to be of
>>> specific length. This is reasonable thing for HW to ask.
>>
>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
>>
> There is not virtio spec that has kick register in MMIO. The spec is in
> the works AFAIK. Actually PIO will not be deprecated and my suggestion

So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?

> is to move to MMIO only when PIO address space is exhausted. For PCI it
> will be never, for PCI-e it will be after ~16 devices.

Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?


Alex

2013-04-04 12:57:50

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:45, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>
> >>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>
> >>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>
> >>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>
> >>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>> use this e.g. for virtio.
> >>>>>>>
> >>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>> purposes:
> >>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>> an dinvalid length)
> >>>>>>>
> >>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>> slowly.
> >>>>>>>
> >>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>
> >>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>>>> pre-review and suggestions.
> >>>>>>>
> >>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>>>
> >>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>>>
> >>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>> with newer HW.
> >>>>>
> >>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>>>
> >>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>> but not guest visible
> >>>>
> >>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>>>
> >>> QEMU sets it.
> >>
> >> How does QEMU know?
> >>
> > Knows what? When to create such eventfd? virtio device knows.
>
> Where does it know from?
>
It does it always.

> >
> >>>
> >>>> +/*
> >>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >>>> + * are writes of specified length, starting at the specified address.
> >>>> + * If not - it's a Guest bug.
> >>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>> + */
> >>>>
> >>> Virtio spec will state that access to a kick register needs to be of
> >>> specific length. This is reasonable thing for HW to ask.
> >>
> >> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> >>
> > There is not virtio spec that has kick register in MMIO. The spec is in
> > the works AFAIK. Actually PIO will not be deprecated and my suggestion
>
> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
>
Guest will indicate nothing. New driver will use MMIO if PIO is bar is
not configured. All driver will not work for virtio devices with MMIO
bar, but not PIO bar.

> > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > will be never, for PCI-e it will be after ~16 devices.
>
> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
>
>
That's the question for MST. I think he did only micro benchmarks till
now and he already posted his result here:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So the patch speedup mmio by almost 100% and it is almost the same as PIO.

--
Gleb.

2013-04-04 13:06:49

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:56, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:45, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>>>>
>>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>>>>>
>>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>>>>>
>>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>>>>>> up the eventfd directly, bypassing emulation.
>>>>>>>>>
>>>>>>>>> Add an interface for userspace to specify this per-address, we can
>>>>>>>>> use this e.g. for virtio.
>>>>>>>>>
>>>>>>>>> The implementation adds a separate bus internally. This serves two
>>>>>>>>> purposes:
>>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>>>>>> - minimize disruption in other code (since we don't know the length,
>>>>>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>>>>>> way we don't need to touch all devices to teach them handle
>>>>>>>>> an dinvalid length)
>>>>>>>>>
>>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>>>>>> slowly.
>>>>>>>>>
>>>>>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>>>>>
>>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>>>>>> pre-review and suggestions.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>>>>>
>>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>>>>>
>>>>>>> It is slower, but not an order of magnitude slower. It become faster
>>>>>>> with newer HW.
>>>>>>>
>>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>>>>>
>>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
>>>>>>> but not guest visible
>>>>>>
>>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
>>>>>>
>>>>> QEMU sets it.
>>>>
>>>> How does QEMU know?
>>>>
>>> Knows what? When to create such eventfd? virtio device knows.
>>
>> Where does it know from?
>>
> It does it always.
>
>>>
>>>>>
>>>>>> +/*
>>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
>>>>>> + * are writes of specified length, starting at the specified address.
>>>>>> + * If not - it's a Guest bug.
>>>>>> + * Can not be used together with either PIO or DATAMATCH.
>>>>>> + */
>>>>>>
>>>>> Virtio spec will state that access to a kick register needs to be of
>>>>> specific length. This is reasonable thing for HW to ask.
>>>>
>>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
>>>>
>>> There is not virtio spec that has kick register in MMIO. The spec is in
>>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
>>
>> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
>>
> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> not configured. All driver will not work for virtio devices with MMIO
> bar, but not PIO bar.

I can't parse that, sorry :).

>
>>> is to move to MMIO only when PIO address space is exhausted. For PCI it
>>> will be never, for PCI-e it will be after ~16 devices.
>>
>> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
>>
>>
> That's the question for MST. I think he did only micro benchmarks till
> now and he already posted his result here:
>
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
>
> So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Those numbers don't align at all with what I measured.

MST, could you please do a real world latency benchmark with virtio-net and

* normal ioeventfd
* mmio-pv eventfd
* hcall eventfd

to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.

I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?


Alex

2013-04-04 13:15:59

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>>>
> >>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>>>
> >>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>>>
> >>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>>>
> >>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>>>
> >>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>>>
> >>>>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>>>> use this e.g. for virtio.
> >>>>>>>>>
> >>>>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>>>> purposes:
> >>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>>>> an dinvalid length)
> >>>>>>>>>
> >>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>>>> slowly.
> >>>>>>>>>
> >>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>>>
> >>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>>>>>> pre-review and suggestions.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>>>>>
> >>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>>>>>
> >>>>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>>>> with newer HW.
> >>>>>>>
> >>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>>>>>
> >>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>>>> but not guest visible
> >>>>>>
> >>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>>>>>
> >>>>> QEMU sets it.
> >>>>
> >>>> How does QEMU know?
> >>>>
> >>> Knows what? When to create such eventfd? virtio device knows.
> >>
> >> Where does it know from?
> >>
> > It does it always.
> >
> >>>
> >>>>>
> >>>>>> +/*
> >>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >>>>>> + * are writes of specified length, starting at the specified address.
> >>>>>> + * If not - it's a Guest bug.
> >>>>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>>>> + */
> >>>>>>
> >>>>> Virtio spec will state that access to a kick register needs to be of
> >>>>> specific length. This is reasonable thing for HW to ask.
> >>>>
> >>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> >>>>
> >>> There is not virtio spec that has kick register in MMIO. The spec is in
> >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >>
> >> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
> >>
> > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > not configured. All driver will not work for virtio devices with MMIO
> > bar, but not PIO bar.
>
> I can't parse that, sorry :).
>
I am sure MST can explain it better, but I'll try one more time.
Device will have two BARs with kick register one is PIO another is MMIO.
Old driver works only with PIO new one support both. MMIO is used only
when PIO space is exhausted. So old driver will not be able to drive new
virtio device that have no PIO bar configured.

> >
> >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> >>> will be never, for PCI-e it will be after ~16 devices.
> >>
> >> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> >>
> >>
> > That's the question for MST. I think he did only micro benchmarks till
> > now and he already posted his result here:
> >
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> >
> > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
>
> Those numbers don't align at all with what I measured.
I am trying to run vmexit test on AMD now, but something does not work
there. Next week I'll fix it and see how AMD differs, bit on Intel those are the
numbers.

>
> MST, could you please do a real world latency benchmark with virtio-net and
>
> * normal ioeventfd
> * mmio-pv eventfd
> * hcall eventfd
>
> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
>
> I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
>
>
> Alex

--
Gleb.

2013-04-04 13:58:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> >
> >> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
>
> The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.

Yes but unlike a hypercall this optimization does not need special code
in the guest. You can use standard OS interfaces for memory access.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls. Mixing KVM hypercalls and Hyper-V requires some tricks. It
>
> Can't we simply register a hypercall ID range with Microsoft?
>
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
>
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
>
>
> Alex


--
MST

2013-04-04 14:03:05

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>
>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>> up the eventfd directly, bypassing emulation.
>>>>>
>>>>> Add an interface for userspace to specify this per-address, we can
>>>>> use this e.g. for virtio.
>>>>>
>>>>> The implementation adds a separate bus internally. This serves two
>>>>> purposes:
>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>> - minimize disruption in other code (since we don't know the length,
>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>> way we don't need to touch all devices to teach them handle
>>>>> an dinvalid length)
>>>>>
>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>> slowly.
>>>>>
>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>
>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>> pre-review and suggestions.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>
>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>>
>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible). We haven't replaced PIO with hypercall for the
>>> same reason. My hope is that future HW will provide us with instruction
>>> decode for basic mov instruction at which point this optimisation can be
>>> dropped.
>>
>> The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.
>
> Yes but unlike a hypercall this optimization does not need special code
> in the guest. You can use standard OS interfaces for memory access.

Yes, but let's try to understand the room for optimization we're talking about here. The "normal" MMIO case seems excessively slower in MST's benchmarks than it did in mine. So maybe we're really just looking at a bug here.

Also, if hcalls are again only 50% of a fast MMIO callback, it's certainly worth checking out what room for improvement we're really wasting.


Alex

2013-04-04 14:03:18

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>>
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>>
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>>
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>>
> >>> TODO: NPT, MMU and non x86 architectures.
> >>>
> >>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>
> >> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> >
> >> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
>
> Also, how is this not guest visible?

It's visible but it does not require special code in the guest.
Only on qemu. Guest can use standard iowritel/w/b.

> Who sets
> KVM_IOEVENTFD_FLAG_PV_MMIO?

It's an ioctl so qemu does this.

> The comment above its definition indicates
> that the guest does so, so it is guest visible.
>
> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
>
>
> Alex

The requirement is to only access a specific address by a single aligned
instruction. For example, only aligned signle-word accesses are
allowed.

This is a standard practice with many devices, in fact,
virtio spec already requires this for notifications.


--
MST

2013-04-04 14:06:25

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>
> >>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>> up the eventfd directly, bypassing emulation.
> >>>>>
> >>>>> Add an interface for userspace to specify this per-address, we can
> >>>>> use this e.g. for virtio.
> >>>>>
> >>>>> The implementation adds a separate bus internally. This serves two
> >>>>> purposes:
> >>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>> - minimize disruption in other code (since we don't know the length,
> >>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>> way we don't need to touch all devices to teach them handle
> >>>>> an dinvalid length)
> >>>>>
> >>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>> slowly.
> >>>>>
> >>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>
> >>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>> pre-review and suggestions.
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>
> >>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>>
> >>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >>
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>
> > QEMU sets it.
>
> How does QEMU know?
>
> >
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >>
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
>
> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
>
>
> Alex

To use MMIO is a spec change, yes.
What Gleb refers to is that the optimization is not guest visible.
If you have hardware that can quickly decode MMIO you can
drop the PV ioctl from qemu without changing guests.


--
MST

2013-04-04 14:12:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 03:56:49PM +0300, Gleb Natapov wrote:
> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >
> > On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >
> > > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> > >>
> > >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> > >>
> > >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> > >>>>
> > >>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> > >>>>
> > >>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> > >>>>>>
> > >>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> > >>>>>>
> > >>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> > >>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> > >>>>>>> know the address from the VMCS so if the address is unique, we can look
> > >>>>>>> up the eventfd directly, bypassing emulation.
> > >>>>>>>
> > >>>>>>> Add an interface for userspace to specify this per-address, we can
> > >>>>>>> use this e.g. for virtio.
> > >>>>>>>
> > >>>>>>> The implementation adds a separate bus internally. This serves two
> > >>>>>>> purposes:
> > >>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> > >>>>>>> - minimize disruption in other code (since we don't know the length,
> > >>>>>>> devices on the MMIO bus only get a valid address in write, this
> > >>>>>>> way we don't need to touch all devices to teach them handle
> > >>>>>>> an dinvalid length)
> > >>>>>>>
> > >>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> > >>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> > >>>>>>> slowly.
> > >>>>>>>
> > >>>>>>> TODO: NPT, MMU and non x86 architectures.
> > >>>>>>>
> > >>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> > >>>>>>> pre-review and suggestions.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> > >>>>>>
> > >>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> > >>>>>>
> > >>>>> It is slower, but not an order of magnitude slower. It become faster
> > >>>>> with newer HW.
> > >>>>>
> > >>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> > >>>>>>
> > >>>>> We are trying to avoid PV as much as possible (well this is also PV,
> > >>>>> but not guest visible
> > >>>>
> > >>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> > >>>>
> > >>> QEMU sets it.
> > >>
> > >> How does QEMU know?
> > >>
> > > Knows what? When to create such eventfd? virtio device knows.
> >
> > Where does it know from?
> >
> It does it always.
>
> > >
> > >>>
> > >>>> +/*
> > >>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> > >>>> + * are writes of specified length, starting at the specified address.
> > >>>> + * If not - it's a Guest bug.
> > >>>> + * Can not be used together with either PIO or DATAMATCH.
> > >>>> + */
> > >>>>
> > >>> Virtio spec will state that access to a kick register needs to be of
> > >>> specific length. This is reasonable thing for HW to ask.
> > >>
> > >> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> > >>
> > > There is not virtio spec that has kick register in MMIO. The spec is in
> > > the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >
> > So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
> >
> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> not configured. All driver will not work for virtio devices with MMIO
> bar, but not PIO bar.
>
> > > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > > will be never, for PCI-e it will be after ~16 devices.
> >
> > Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> >
> >
> That's the question for MST. I think he did only micro benchmarks till
> now and he already posted his result here:
>
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
>
> So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Exactly. I sent patches for kvm unittest so you can try it yourself.
At the moment you need a box with EPT to try this.

> --
> Gleb.

2013-04-04 14:34:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
>
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>>>
> >>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>>>
> >>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>>>
> >>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>>>
> >>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>>>
> >>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>>>
> >>>>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>>>> use this e.g. for virtio.
> >>>>>>>>>
> >>>>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>>>> purposes:
> >>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>>>> an dinvalid length)
> >>>>>>>>>
> >>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>>>> slowly.
> >>>>>>>>>
> >>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>>>
> >>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>>>>>> pre-review and suggestions.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>>>>>
> >>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>>>>>
> >>>>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>>>> with newer HW.
> >>>>>>>
> >>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>>>>>
> >>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>>>> but not guest visible
> >>>>>>
> >>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>>>>>
> >>>>> QEMU sets it.
> >>>>
> >>>> How does QEMU know?
> >>>>
> >>> Knows what? When to create such eventfd? virtio device knows.
> >>
> >> Where does it know from?
> >>
> > It does it always.
> >
> >>>
> >>>>>
> >>>>>> +/*
> >>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >>>>>> + * are writes of specified length, starting at the specified address.
> >>>>>> + * If not - it's a Guest bug.
> >>>>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>>>> + */
> >>>>>>
> >>>>> Virtio spec will state that access to a kick register needs to be of
> >>>>> specific length. This is reasonable thing for HW to ask.
> >>>>
> >>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> >>>>
> >>> There is not virtio spec that has kick register in MMIO. The spec is in
> >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >>
> >> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
> >>
> > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > not configured. All driver will not work for virtio devices with MMIO
> > bar, but not PIO bar.
>
> I can't parse that, sorry :).

It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
QEMU tells KVM which address driver uses, to make exits faster. This is not
different from how eventfd works. For example if exits to QEMU suddenly become
very cheap we can remove eventfd completely.

> >
> >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> >>> will be never, for PCI-e it will be after ~16 devices.
> >>
> >> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> >>
> >>
> > That's the question for MST. I think he did only micro benchmarks till
> > now and he already posted his result here:
> >
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> >
> > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
>
> Those numbers don't align at all with what I measured.

Yep. But why?
Could be a different hardware. My laptop is i7, what did you measure on?
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
stepping : 7
microcode : 0x28
cpu MHz : 2801.000
cache size : 4096 KB

Or could be different software, this is on top of 3.9.0-rc5, what
did you try?

> MST, could you please do a real world latency benchmark with virtio-net and
>
> * normal ioeventfd
> * mmio-pv eventfd
> * hcall eventfd

I can't do this right away, sorry. For MMIO we are discussing the new
layout on the virtio mailing list, guest and qemu need a patch for this
too. My hcall patches are stale and would have to be brought up to
date.


> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.

Latency is dominated by the scheduling latency.
This means virtio-net is not the best benchmark.

> I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
>
>
> Alex


It's the TSC divided by number of iterations. kvm unittest this value, here's
what it does (removed some dead code):

#define GOAL (1ull << 30)

do {
iterations *= 2;
t1 = rdtsc();

for (i = 0; i < iterations; ++i)
func();
t2 = rdtsc();
} while ((t2 - t1) < GOAL);
printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));


--
MST

2013-04-04 14:41:02

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 04:02:57PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:
>
> > On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>
> >>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>> up the eventfd directly, bypassing emulation.
> >>>>>
> >>>>> Add an interface for userspace to specify this per-address, we can
> >>>>> use this e.g. for virtio.
> >>>>>
> >>>>> The implementation adds a separate bus internally. This serves two
> >>>>> purposes:
> >>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>> - minimize disruption in other code (since we don't know the length,
> >>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>> way we don't need to touch all devices to teach them handle
> >>>>> an dinvalid length)
> >>>>>
> >>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>> slowly.
> >>>>>
> >>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>
> >>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>> pre-review and suggestions.
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>
> >>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>>
> >>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible). We haven't replaced PIO with hypercall for the
> >>> same reason. My hope is that future HW will provide us with instruction
> >>> decode for basic mov instruction at which point this optimisation can be
> >>> dropped.
> >>
> >> The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, we can drop the capability we expose to the guest.
> >
> > Yes but unlike a hypercall this optimization does not need special code
> > in the guest. You can use standard OS interfaces for memory access.
>
> Yes, but let's try to understand the room for optimization we're
> talking about here. The "normal" MMIO case seems excessively slower in
> MST's benchmarks than it did in mine. So maybe we're really just
> looking at a bug here.

Could be. I posted the code (kvm,qemu and test) so please review and try
to spot a bug.

> Also, if hcalls are again only 50% of a fast MMIO callback, it's
> certainly worth checking out what room for improvement we're really
> wasting.
>
>
> Alex

Take a look at 'kvm: pci PORT IO MMIO and PV MMIO speed tests'.
Try running the test on your hardware and see what happens.
Or post the test you used, I can try it on my box if you like.

--
MST

2013-04-04 15:27:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
> On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> >
> > On 04.04.2013, at 14:56, Gleb Natapov wrote:
> >
> > > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> > >>
> > >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> > >>
> > >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> > >>>>
> > >>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> > >>>>
> > >>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> > >>>>>>
> > >>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> > >>>>>>
> > >>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> > >>>>>>>>
> > >>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> > >>>>>>>>
> > >>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> > >>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> > >>>>>>>>> know the address from the VMCS so if the address is unique, we can look
> > >>>>>>>>> up the eventfd directly, bypassing emulation.
> > >>>>>>>>>
> > >>>>>>>>> Add an interface for userspace to specify this per-address, we can
> > >>>>>>>>> use this e.g. for virtio.
> > >>>>>>>>>
> > >>>>>>>>> The implementation adds a separate bus internally. This serves two
> > >>>>>>>>> purposes:
> > >>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> > >>>>>>>>> - minimize disruption in other code (since we don't know the length,
> > >>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> > >>>>>>>>> way we don't need to touch all devices to teach them handle
> > >>>>>>>>> an dinvalid length)
> > >>>>>>>>>
> > >>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> > >>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> > >>>>>>>>> slowly.
> > >>>>>>>>>
> > >>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> > >>>>>>>>>
> > >>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> > >>>>>>>>> pre-review and suggestions.
> > >>>>>>>>>
> > >>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> > >>>>>>>>
> > >>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> > >>>>>>>>
> > >>>>>>> It is slower, but not an order of magnitude slower. It become faster
> > >>>>>>> with newer HW.
> > >>>>>>>
> > >>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> > >>>>>>>>
> > >>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> > >>>>>>> but not guest visible
> > >>>>>>
> > >>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> > >>>>>>
> > >>>>> QEMU sets it.
> > >>>>
> > >>>> How does QEMU know?
> > >>>>
> > >>> Knows what? When to create such eventfd? virtio device knows.
> > >>
> > >> Where does it know from?
> > >>
> > > It does it always.
> > >
> > >>>
> > >>>>>
> > >>>>>> +/*
> > >>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> > >>>>>> + * are writes of specified length, starting at the specified address.
> > >>>>>> + * If not - it's a Guest bug.
> > >>>>>> + * Can not be used together with either PIO or DATAMATCH.
> > >>>>>> + */
> > >>>>>>
> > >>>>> Virtio spec will state that access to a kick register needs to be of
> > >>>>> specific length. This is reasonable thing for HW to ask.
> > >>>>
> > >>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> > >>>>
> > >>> There is not virtio spec that has kick register in MMIO. The spec is in
> > >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> > >>
> > >> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
> > >>
> > > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > > not configured. All driver will not work for virtio devices with MMIO
> > > bar, but not PIO bar.
> >
> > I can't parse that, sorry :).
> >
> I am sure MST can explain it better, but I'll try one more time.
> Device will have two BARs with kick register one is PIO another is MMIO.
> Old driver works only with PIO new one support both. MMIO is used only
> when PIO space is exhausted. So old driver will not be able to drive new
> virtio device that have no PIO bar configured.

Right, I think this was the latest proposal by Rusty.

The discussion about the new layout is taking place on the virtio mailing list.
See thread 'virtio_pci: use separate notification offsets for each vq'
started by Rusty.


> > >
> > >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> > >>> will be never, for PCI-e it will be after ~16 devices.
> > >>
> > >> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> > >>
> > >>
> > > That's the question for MST. I think he did only micro benchmarks till
> > > now and he already posted his result here:
> > >
> > > mmio-wildcard-eventfd:pci-mem 3529
> > > mmio-pv-eventfd:pci-mem 1878
> > > portio-wildcard-eventfd:pci-io 1846
> > >
> > > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
> >
> > Those numbers don't align at all with what I measured.
> I am trying to run vmexit test on AMD now, but something does not work
> there. Next week I'll fix it and see how AMD differs, bit on Intel those are the
> numbers.

Right. Also next week, need to implement the optimization for NPT.

> >
> > MST, could you please do a real world latency benchmark with virtio-net and
> >
> > * normal ioeventfd
> > * mmio-pv eventfd
> > * hcall eventfd
> >
> > to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
> >
> > I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
> >
> >
> > Alex
>
> --
> Gleb.

2013-04-04 15:36:47

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD


On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
>>
>> On 04.04.2013, at 14:56, Gleb Natapov wrote:
>>
>>> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
>>>>
>>>> On 04.04.2013, at 14:45, Gleb Natapov wrote:
>>>>
>>>>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>>>>>>
>>>>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>>>>>>>>
>>>>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>>>>>>>>>>
>>>>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
>>>>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
>>>>>>>>>>> know the address from the VMCS so if the address is unique, we can look
>>>>>>>>>>> up the eventfd directly, bypassing emulation.
>>>>>>>>>>>
>>>>>>>>>>> Add an interface for userspace to specify this per-address, we can
>>>>>>>>>>> use this e.g. for virtio.
>>>>>>>>>>>
>>>>>>>>>>> The implementation adds a separate bus internally. This serves two
>>>>>>>>>>> purposes:
>>>>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
>>>>>>>>>>> - minimize disruption in other code (since we don't know the length,
>>>>>>>>>>> devices on the MMIO bus only get a valid address in write, this
>>>>>>>>>>> way we don't need to touch all devices to teach them handle
>>>>>>>>>>> an dinvalid length)
>>>>>>>>>>>
>>>>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
>>>>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
>>>>>>>>>>> slowly.
>>>>>>>>>>>
>>>>>>>>>>> TODO: NPT, MMU and non x86 architectures.
>>>>>>>>>>>
>>>>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
>>>>>>>>>>> pre-review and suggestions.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>>>>>>>>>>
>>>>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
>>>>>>>>>>
>>>>>>>>> It is slower, but not an order of magnitude slower. It become faster
>>>>>>>>> with newer HW.
>>>>>>>>>
>>>>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
>>>>>>>>>>
>>>>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
>>>>>>>>> but not guest visible
>>>>>>>>
>>>>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
>>>>>>>>
>>>>>>> QEMU sets it.
>>>>>>
>>>>>> How does QEMU know?
>>>>>>
>>>>> Knows what? When to create such eventfd? virtio device knows.
>>>>
>>>> Where does it know from?
>>>>
>>> It does it always.
>>>
>>>>>
>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
>>>>>>>> + * are writes of specified length, starting at the specified address.
>>>>>>>> + * If not - it's a Guest bug.
>>>>>>>> + * Can not be used together with either PIO or DATAMATCH.
>>>>>>>> + */
>>>>>>>>
>>>>>>> Virtio spec will state that access to a kick register needs to be of
>>>>>>> specific length. This is reasonable thing for HW to ask.
>>>>>>
>>>>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
>>>>>>
>>>>> There is not virtio spec that has kick register in MMIO. The spec is in
>>>>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
>>>>
>>>> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
>>>>
>>> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
>>> not configured. All driver will not work for virtio devices with MMIO
>>> bar, but not PIO bar.
>>
>> I can't parse that, sorry :).
>
> It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
> QEMU tells KVM which address driver uses, to make exits faster. This is not
> different from how eventfd works. For example if exits to QEMU suddenly become
> very cheap we can remove eventfd completely.
>
>>>
>>>>> is to move to MMIO only when PIO address space is exhausted. For PCI it
>>>>> will be never, for PCI-e it will be after ~16 devices.
>>>>
>>>> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
>>>>
>>>>
>>> That's the question for MST. I think he did only micro benchmarks till
>>> now and he already posted his result here:
>>>
>>> mmio-wildcard-eventfd:pci-mem 3529
>>> mmio-pv-eventfd:pci-mem 1878
>>> portio-wildcard-eventfd:pci-io 1846
>>>
>>> So the patch speedup mmio by almost 100% and it is almost the same as PIO.
>>
>> Those numbers don't align at all with what I measured.
>
> Yep. But why?
> Could be a different hardware. My laptop is i7, what did you measure on?
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 42
> model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
> stepping : 7
> microcode : 0x28
> cpu MHz : 2801.000
> cache size : 4096 KB

processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 8
model name : Six-Core AMD Opteron(tm) Processor 8435
stepping : 0
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 8
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save pausefilter
bogomips : 5199.87
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

>
> Or could be different software, this is on top of 3.9.0-rc5, what
> did you try?

3.0 plus kvm-kmod of whatever was current back in autumn :).

>
>> MST, could you please do a real world latency benchmark with virtio-net and
>>
>> * normal ioeventfd
>> * mmio-pv eventfd
>> * hcall eventfd
>
> I can't do this right away, sorry. For MMIO we are discussing the new
> layout on the virtio mailing list, guest and qemu need a patch for this
> too. My hcall patches are stale and would have to be brought up to
> date.
>
>
>> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
>
> Latency is dominated by the scheduling latency.
> This means virtio-net is not the best benchmark.

So what is a good benchmark? Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.

>
>> I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
>>
>>
>> Alex
>
>
> It's the TSC divided by number of iterations. kvm unittest this value, here's
> what it does (removed some dead code):
>
> #define GOAL (1ull << 30)
>
> do {
> iterations *= 2;
> t1 = rdtsc();
>
> for (i = 0; i < iterations; ++i)
> func();
> t2 = rdtsc();
> } while ((t2 - t1) < GOAL);
> printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));

So it's the number of cycles per run.

That means translated my numbers are:

MMIO: 4307
PIO: 3658
HCALL: 1756

MMIO - PIO = 649

which aligns roughly with your PV MMIO callback.

My MMIO benchmark was to poke the LAPIC version register. That does go through instruction emulation, no?


Alex

2013-04-04 16:34:43

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
> >
> > #define GOAL (1ull << 30)
> >
> > do {
> > iterations *= 2;
> > t1 = rdtsc();
> >
> > for (i = 0; i < iterations; ++i)
> > func();
> > t2 = rdtsc();
> > } while ((t2 - t1) < GOAL);
> > printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
>
> So it's the number of cycles per run.
>
> That means translated my numbers are:
>
> MMIO: 4307
> PIO: 3658
> HCALL: 1756
>
> MMIO - PIO = 649
>
> which aligns roughly with your PV MMIO callback.
>
> My MMIO benchmark was to poke the LAPIC version register. That does go through instruction emulation, no?
>
It should and PIO 'in' or string also goes through the emulator.

--
Gleb.

2013-04-04 16:36:32

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
>
> On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:
>
> > On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> >>
> >> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> >>
> >>> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >>>>
> >>>> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >>>>
> >>>>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>>>>>
> >>>>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>>>>>
> >>>>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>>>>>
> >>>>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>>>>>
> >>>>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>>>>>
> >>>>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>>>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>>>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>>>>>
> >>>>>>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>>>>>> use this e.g. for virtio.
> >>>>>>>>>>>
> >>>>>>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>>>>>> purposes:
> >>>>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>>>>>> an dinvalid length)
> >>>>>>>>>>>
> >>>>>>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>>>>>> slowly.
> >>>>>>>>>>>
> >>>>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>>>>>
> >>>>>>>>>>> The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
> >>>>>>>>>>> pre-review and suggestions.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <[email protected]>
> >>>>>>>>>>
> >>>>>>>>>> This still uses page fault intercepts which are orders of magnitudes slower than hypercalls. Why don't you just create a PV MMIO hypercall that the guest can use to invoke MMIO accesses towards the host based on physical addresses with explicit length encodings?
> >>>>>>>>>>
> >>>>>>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>>>>>> with newer HW.
> >>>>>>>>>
> >>>>>>>>>> That way you simplify and speed up all code paths, exceeding the speed of PIO exits even. It should also be quite easily portable, as all other platforms have hypercalls available as well.
> >>>>>>>>>>
> >>>>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>>>>>> but not guest visible
> >>>>>>>>
> >>>>>>>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates that the guest does so, so it is guest visible.
> >>>>>>>>
> >>>>>>> QEMU sets it.
> >>>>>>
> >>>>>> How does QEMU know?
> >>>>>>
> >>>>> Knows what? When to create such eventfd? virtio device knows.
> >>>>
> >>>> Where does it know from?
> >>>>
> >>> It does it always.
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>> +/*
> >>>>>>>> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >>>>>>>> + * are writes of specified length, starting at the specified address.
> >>>>>>>> + * If not - it's a Guest bug.
> >>>>>>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>>>>>> + */
> >>>>>>>>
> >>>>>>> Virtio spec will state that access to a kick register needs to be of
> >>>>>>> specific length. This is reasonable thing for HW to ask.
> >>>>>>
> >>>>>> This is a spec change. So the guest would have to indicate that it adheres to a newer spec. Thus it's a guest visible change.
> >>>>>>
> >>>>> There is not virtio spec that has kick register in MMIO. The spec is in
> >>>>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >>>>
> >>>> So the guest would indicate that it supports a newer revision of the spec (in your case, that it supports MMIO). How is that any different from exposing that it supports a PV MMIO hcall?
> >>>>
> >>> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> >>> not configured. All driver will not work for virtio devices with MMIO
> >>> bar, but not PIO bar.
> >>
> >> I can't parse that, sorry :).
> >
> > It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
> > QEMU tells KVM which address driver uses, to make exits faster. This is not
> > different from how eventfd works. For example if exits to QEMU suddenly become
> > very cheap we can remove eventfd completely.
> >
> >>>
> >>>>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> >>>>> will be never, for PCI-e it will be after ~16 devices.
> >>>>
> >>>> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> >>>>
> >>>>
> >>> That's the question for MST. I think he did only micro benchmarks till
> >>> now and he already posted his result here:
> >>>
> >>> mmio-wildcard-eventfd:pci-mem 3529
> >>> mmio-pv-eventfd:pci-mem 1878
> >>> portio-wildcard-eventfd:pci-io 1846
> >>>
> >>> So the patch speedup mmio by almost 100% and it is almost the same as PIO.
> >>
> >> Those numbers don't align at all with what I measured.
> >
> > Yep. But why?
> > Could be a different hardware. My laptop is i7, what did you measure on?
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 42
> > model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
> > stepping : 7
> > microcode : 0x28
> > cpu MHz : 2801.000
> > cache size : 4096 KB
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 8
> model name : Six-Core AMD Opteron(tm) Processor 8435
> stepping : 0
> cpu MHz : 800.000
> cache size : 512 KB
> physical id : 0
> siblings : 6
> core id : 0
> cpu cores : 6
> apicid : 8
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save pausefilter
> bogomips : 5199.87
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate

Hmm, svm code seems less optimized for MMIO, but PIO
is almost identical. Gleb says the unittest is broken
on AMD so I'll wait until it's fixed to test.

Did you do PIO reads by chance?

> >
> > Or could be different software, this is on top of 3.9.0-rc5, what
> > did you try?
>
> 3.0 plus kvm-kmod of whatever was current back in autumn :).
>
> >
> >> MST, could you please do a real world latency benchmark with virtio-net and
> >>
> >> * normal ioeventfd
> >> * mmio-pv eventfd
> >> * hcall eventfd
> >
> > I can't do this right away, sorry. For MMIO we are discussing the new
> > layout on the virtio mailing list, guest and qemu need a patch for this
> > too. My hcall patches are stale and would have to be brought up to
> > date.
> >
> >
> >> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
> >
> > Latency is dominated by the scheduling latency.
> > This means virtio-net is not the best benchmark.
>
> So what is a good benchmark?

E.g. ping pong stress will do but need to look at CPU utilization,
that's what is affected, not latency.

> Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.

For this stage of the project I think microbenchmarks are more appropriate.
Doubling the price of exit is likely to be measureable. 30 cycles likely
not ...

> >
> >> I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
> >>
> >>
> >> Alex
> >
> >
> > It's the TSC divided by number of iterations. kvm unittest this value, here's
> > what it does (removed some dead code):
> >
> > #define GOAL (1ull << 30)
> >
> > do {
> > iterations *= 2;
> > t1 = rdtsc();
> >
> > for (i = 0; i < iterations; ++i)
> > func();
> > t2 = rdtsc();
> > } while ((t2 - t1) < GOAL);
> > printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
>
> So it's the number of cycles per run.
>
> That means translated my numbers are:
>
> MMIO: 4307
> PIO: 3658
> HCALL: 1756
>
> MMIO - PIO = 649
>
> which aligns roughly with your PV MMIO callback.
>
> My MMIO benchmark was to poke the LAPIC version register. That does go through instruction emulation, no?
>
>
> Alex

Why wouldn't it?

--
MST

2013-04-04 16:40:37

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 06:36:30PM +0300, Michael S. Tsirkin wrote:
> > processor : 0
> > vendor_id : AuthenticAMD
> > cpu family : 16
> > model : 8
> > model name : Six-Core AMD Opteron(tm) Processor 8435
> > stepping : 0
> > cpu MHz : 800.000
> > cache size : 512 KB
> > physical id : 0
> > siblings : 6
> > core id : 0
> > cpu cores : 6
> > apicid : 8
> > initial apicid : 0
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 5
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save pausefilter
> > bogomips : 5199.87
> > TLB size : 1024 4K pages
> > clflush size : 64
> > cache_alignment : 64
> > address sizes : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
>
> Hmm, svm code seems less optimized for MMIO, but PIO
> is almost identical. Gleb says the unittest is broken
> on AMD so I'll wait until it's fixed to test.
>
It's not unittest is broken, its my environment is broken :)

> Did you do PIO reads by chance?
>
> > >
> > > Or could be different software, this is on top of 3.9.0-rc5, what
> > > did you try?
> >
> > 3.0 plus kvm-kmod of whatever was current back in autumn :).
> >
> > >
> > >> MST, could you please do a real world latency benchmark with virtio-net and
> > >>
> > >> * normal ioeventfd
> > >> * mmio-pv eventfd
> > >> * hcall eventfd
> > >
> > > I can't do this right away, sorry. For MMIO we are discussing the new
> > > layout on the virtio mailing list, guest and qemu need a patch for this
> > > too. My hcall patches are stale and would have to be brought up to
> > > date.
> > >
> > >
> > >> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
> > >
> > > Latency is dominated by the scheduling latency.
> > > This means virtio-net is not the best benchmark.
> >
> > So what is a good benchmark?
>
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
>
> > Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.
>
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
>
> > >
> > >> I'm also slightly puzzled why the wildcard eventfd mechanism is so significantly slower, while it was only a few percent on my test system. What are the numbers you're listing above? Cycles? How many cycles do you execute in a second?
> > >>
> > >>
> > >> Alex
> > >
> > >
> > > It's the TSC divided by number of iterations. kvm unittest this value, here's
> > > what it does (removed some dead code):
> > >
> > > #define GOAL (1ull << 30)
> > >
> > > do {
> > > iterations *= 2;
> > > t1 = rdtsc();
> > >
> > > for (i = 0; i < iterations; ++i)
> > > func();
> > > t2 = rdtsc();
> > > } while ((t2 - t1) < GOAL);
> > > printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> >
> > So it's the number of cycles per run.
> >
> > That means translated my numbers are:
> >
> > MMIO: 4307
> > PIO: 3658
> > HCALL: 1756
> >
> > MMIO - PIO = 649
> >
> > which aligns roughly with your PV MMIO callback.
> >
> > My MMIO benchmark was to poke the LAPIC version register. That does go through instruction emulation, no?
> >
> >
> > Alex
>
> Why wouldn't it?
>
Intel decodes access to apic page, but we use it only for fast eoi.

--
Gleb.

2013-04-04 23:32:04

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

[...]

>> >> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
>> >
>> > Latency is dominated by the scheduling latency.
>> > This means virtio-net is not the best benchmark.
>>
>> So what is a good benchmark?
>
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
>
>> Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.
>
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
>
I don't quite understand this point here. If we don't have anything
real-world where we can measure a decent difference, then why are we
doing this? I would agree with Alex that the three test scenarios
proposed by him should be tried out before adding this complexity,
measured in CPU utilization or latency as you wish.

FWIW, ARM always uses MMIO and provides hardware decoding of all sane
(not user register-writeback) instruction, but the hypercall vs. mmio
looks like this:

hvc: 4,917
mmio_kernel: 6,248

But I doubt that an hvc wrapper around mmio decoding would take care
of all this difference, because the mmio operation needs to do other
work not realated to emulating the instruction in software, which
you'd have to do for an hvc anyway (populate kvm_mmio structure etc.)

-Christoffer

2013-04-07 09:26:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 04:32:01PM -0700, Christoffer Dall wrote:
> [...]
>
> >> >> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
> >> >
> >> > Latency is dominated by the scheduling latency.
> >> > This means virtio-net is not the best benchmark.
> >>
> >> So what is a good benchmark?
> >
> > E.g. ping pong stress will do but need to look at CPU utilization,
> > that's what is affected, not latency.
> >
> >> Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.
> >
> > For this stage of the project I think microbenchmarks are more appropriate.
> > Doubling the price of exit is likely to be measureable. 30 cycles likely
> > not ...
> >
> I don't quite understand this point here. If we don't have anything
> real-world where we can measure a decent difference, then why are we
> doing this? I would agree with Alex that the three test scenarios
> proposed by him should be tried out before adding this complexity,
> measured in CPU utilization or latency as you wish.

Sure, plan to do real world benchmarks for PV MMIO versus PIO as well.
I don't see why I should bother implementing hypercalls given that the
kvm maintainer says they won't be merged.

> FWIW, ARM always uses MMIO and provides hardware decoding of all sane
> (not user register-writeback) instruction, but the hypercall vs. mmio
> looks like this:
>
> hvc: 4,917
> mmio_kernel: 6,248

So 20% difference? That's not far from what happens on my intel laptop:
vmcall 1519
outl_to_kernel 1745
10% difference here.

>
> But I doubt that an hvc wrapper around mmio decoding would take care
> of all this difference, because the mmio operation needs to do other
> work not realated to emulating the instruction in software, which
> you'd have to do for an hvc anyway (populate kvm_mmio structure etc.)
>
> -Christoffer

Instead of speculating, someone with relevant hardware
could just try this, but kvm unittest doesn't seem to have arm support
at the moment. Anyone working on this?

--
MST

2013-04-07 09:31:36

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
> > >
> > >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> > >>> will be never, for PCI-e it will be after ~16 devices.
> > >>
> > >> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> > >>
> > >>
> > > That's the question for MST. I think he did only micro benchmarks till
> > > now and he already posted his result here:
> > >
> > > mmio-wildcard-eventfd:pci-mem 3529
> > > mmio-pv-eventfd:pci-mem 1878
> > > portio-wildcard-eventfd:pci-io 1846
> > >
> > > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
> >
> > Those numbers don't align at all with what I measured.
> I am trying to run vmexit test on AMD now, but something does not work
> there. Next week I'll fix it and see how AMD differs, bit on Intel those are the
> numbers.
>
The numbers are:
vmcall 1921
inl_from_kernel 4227
outl_to_kernel 2345

outl is specifically optimized to not go through the emulator since it
is used for virtio kick. mmio-pv-eventfd is the same kind of
optimization but for mmio.

--
Gleb.

2013-04-07 09:44:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Sun, Apr 07, 2013 at 12:30:38PM +0300, Gleb Natapov wrote:
> On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
> > > >
> > > >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> > > >>> will be never, for PCI-e it will be after ~16 devices.
> > > >>
> > > >> Ok, let's go back a step here. Are you actually able to measure any speed in performance with this patch applied and without when going through MMIO kicks?
> > > >>
> > > >>
> > > > That's the question for MST. I think he did only micro benchmarks till
> > > > now and he already posted his result here:
> > > >
> > > > mmio-wildcard-eventfd:pci-mem 3529
> > > > mmio-pv-eventfd:pci-mem 1878
> > > > portio-wildcard-eventfd:pci-io 1846
> > > >
> > > > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
> > >
> > > Those numbers don't align at all with what I measured.
> > I am trying to run vmexit test on AMD now, but something does not work
> > there. Next week I'll fix it and see how AMD differs, bit on Intel those are the
> > numbers.
> >
> The numbers are:
> vmcall 1921
> inl_from_kernel 4227
> outl_to_kernel 2345
>
> outl is specifically optimized to not go through the emulator since it
> is used for virtio kick. mmio-pv-eventfd is the same kind of
> optimization but for mmio.
>
> --
> Gleb.


Hmm so on AMD it's more like 20% overhead, like ARM.

--
MST

2013-04-07 21:25:21

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

On Sun, Apr 7, 2013 at 12:41 AM, Michael S. Tsirkin <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 04:32:01PM -0700, Christoffer Dall wrote:
>> [...]
>>
>> >> >> to give us some idea how much performance we would gain from each approach? Thoughput should be completely unaffected anyway, since virtio just coalesces kicks internally.
>> >> >
>> >> > Latency is dominated by the scheduling latency.
>> >> > This means virtio-net is not the best benchmark.
>> >>
>> >> So what is a good benchmark?
>> >
>> > E.g. ping pong stress will do but need to look at CPU utilization,
>> > that's what is affected, not latency.
>> >
>> >> Is there any difference in speed at all? I strongly doubt it. One of virtio's main points is to reduce the number of kicks.
>> >
>> > For this stage of the project I think microbenchmarks are more appropriate.
>> > Doubling the price of exit is likely to be measureable. 30 cycles likely
>> > not ...
>> >
>> I don't quite understand this point here. If we don't have anything
>> real-world where we can measure a decent difference, then why are we
>> doing this? I would agree with Alex that the three test scenarios
>> proposed by him should be tried out before adding this complexity,
>> measured in CPU utilization or latency as you wish.
>
> Sure, plan to do real world benchmarks for PV MMIO versus PIO as well.
> I don't see why I should bother implementing hypercalls given that the
> kvm maintainer says they won't be merged.
>

the implementation effort to simply measure the hypercall performance
should be minimal, no? If we can measure a true difference in
performance, I'm sure we can revisit the issue of what will be merged
and what won't be, but until we have those numbers it's all
speculation.

>> FWIW, ARM always uses MMIO and provides hardware decoding of all sane
>> (not user register-writeback) instruction, but the hypercall vs. mmio
>> looks like this:
>>
>> hvc: 4,917
>> mmio_kernel: 6,248
>
> So 20% difference? That's not far from what happens on my intel laptop:
> vmcall 1519
> outl_to_kernel 1745
> 10% difference here.
>
>>
>> But I doubt that an hvc wrapper around mmio decoding would take care
>> of all this difference, because the mmio operation needs to do other
>> work not realated to emulating the instruction in software, which
>> you'd have to do for an hvc anyway (populate kvm_mmio structure etc.)
>>
>
> Instead of speculating, someone with relevant hardware
> could just try this, but kvm unittest doesn't seem to have arm support
> at the moment. Anyone working on this?
>
We have a branch called kvm-selftest that replicates much of the
functionality, which is what I run to get these measurements. I can
port it over to unittest at some point, but I'm not active working on
that.

I can measure it, but we have bigger fish to fry on the ARM side right
now, so it'll be a while until I get to that.