LinuxLists.cc - [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
> Here is the version 2.
>
> ChangeLog since V1: Mostly changes based on Avi's suggestions.
> 1) Use a id to identify the perf_event between host and guest;
> 2) Changes lots of codes to deal with malicious guest os;
> 3) Add a perf_event number limitation per gust os instance;
> 4) Support guest os on the top of another guest os scenario. But
> I didn't test it yet as there is no environment. The design is to
> add 2 pointers in struct perf_event. One is used by host and the
> other is used by guest.
> 5) Fix the bug to support 'perf stat'. The key is sync count data
> back to guest when guest tries to disable the perf_event at host
> side.
> 6) Add a clear ABI of PV perf.
>
>

Please use meaningful subject lines for individual patches.

> I don't implement live migration feature.
>
> Avi,
> Is live migration necessary on pv perf support?
>

Yes.

> --- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt 1970-01-01 08:00:00.000000000 +0800
> +++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt 2010-06-21 15:21:39.312999849 +0800
> @@ -0,0 +1,133 @@
> +The x86 kvm paravirt perf event interface
> +===================================
> +
> +This paravirt interface is responsible for supporting guest os perf event
> +collections. If guest os supports this interface, users could run command
> +perf in guest os directly.
> +
> +Design
> +========
> +
> +Guest os calls a series of hypercalls to communicate with host kernel to
> +create/enable/disable/close perf events. Host kernel notifies guest os
> +by injecting an NMI to guest os when an event overflows. Guets os need
> +go through all its active events to check if they overflow, and output
> +performance statistics if they do.
> +
> +ABI
> +=====
> +
> +1) Detect if host kernel supports paravirt perf interface:
> +#define KVM_FEATURE_PV_PERF 4
> +Host kernel defines above cpuid bit. Guest os calls cpuid to check if host
> +os retuns this bit. If it does, it mean host kernel supports paravirt perf
> +interface.
> +
> +2) Open a new event at host side:
> +kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, param_addr_low32bit,
> +param_addr_high32bit);
> +
> +#define KVM_PERF_OP 3
> +/* Operations for KVM_PERF_OP */
> +#define KVM_PERF_OP_OPEN 1
> +#define KVM_PERF_OP_CLOSE 2
> +#define KVM_PERF_OP_ENABLE 3
> +#define KVM_PERF_OP_DISABLE 4
> +#define KVM_PERF_OP_READ 5
>

> +/*
> + * guest_perf_attr is used when guest calls hypercall to
> + * open a new perf_event at host side. Mostly, it's a copy of
> + * perf_event_attr and deletes something not used by host kernel.
> + */
> +struct guest_perf_attr {
> + __u32 type;
>

Need padding here, otherwise the structure is different on 32-bit and
64-bit guests.

> + __u64 config;
> + __u64 sample_period;
> + __u64 sample_type;
> + __u64 read_format;
> + __u64 flags;
>

and here.

> + __u32 bp_type;
> + __u64 bp_addr;
> + __u64 bp_len;
>

Do we actually support breakpoints on the guest? Note the hardware
breakpoints are also usable by the guest, so if the host uses them, we
won't be able to emulate them correctly. We can let the guest to
breakpoint perf monitoring itself and drop this feature.

> +};
>

What about documentation for individual fields? Esp. type, config, and
flags, but also the others.

> +/*
> + * data communication area about perf_event between
> + * Host kernel and guest kernel
> + */
> +struct guest_perf_event {
> + u64 count;
> + atomic_t overflows;
>

Please use __u64 and __u32, assume guests don't have Linux internal
types (though of course the first guest _is_ Linux).

Add padding to 64-bit.

> +};
> +struct guest_perf_event_param {
> + __u64 attr_addr;
> + __u64 guest_event_addr;
> + /* In case there is an alignment issue, we put id as the last one */
> + int id;
>

Add explicit padding to be sure.

Also makes sense to add a flags field for future expansion.

> +};
> +
> +param_addr_low32bit and param_addr_high32bit compose a u64 integer which means
> +the physical address of parameter struct guest_perf_event_param.
> +struct guest_perf_event_param consists of 3 members. attr_addr has the
> +physical address of parameter struct guest_perf_attr. guest_event_addr has the
> +physical address of a parameter whose type is struct guest_perf_eventi which
> +has to be aligned with 4 bytes.
> +guest os need allocate an exclusive id per event in this guest os instance, and save it to
> +guest_perf_event_param->id. Later on, the id is the only method to notify host
> +kernel about on what event guest os wants host kernel to operate.
>

Need a way to expose the maximum number of events available to the
guest. I suggest exposing it in cpuid, and requiring 0 <= id < MAX_EVENTS.

> +guest_perf_event->count saves the latest count of the event.
> +guest_perf_event->overflows means how many times this event has overflowed
> +since guest os processes it. Host kernel just inc guest_perf_event->overflows
> +when the event overflows. Guest kernel should use a atomic_cmpxchg to reset
> +guest_perf_event->overflows to 0 in case there is a race between its reset by
> +guest os and host kernel data update.
>

Is overflows really needed? Since the guest can use NMI to read the
counter, it should have the highest possible priority, and thus it
shouldn't see any overflow unless it configured the threshold really low.

If we drop overflow, we can use the RDPMC instruction instead of
KVM_PERF_OP_READ. This allows the guest to allow userspace to read a
counter, or prevent userspace from reading the counter, by setting cr4.pce.

> +Host kernel saves count and overflow update information into guest_perf_event
> +pointed by guest_perf_event_param->guest_event_addr.
> +
> +After host kernel creates the event, this event is at disabled mode.
> +
> +This hypercall3 return 0 when host kernel creates the event successfully. Or
> +other value if it fails.
> +
> +3) Enable event at host side:
> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_ENABLE, id);
> +
> +Parameter id means the event id allocated by guest os. Guest os need call this
> +hypercall to enable the event at host side. Then, host side will really start
> +to collect statistics by this event.
> +
> +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
> +
> +
> +4) Disable event at host side:
> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_DISABLE, id);
> +
> +Parameter id means the event id allocated by guest os. Guest os need call this
> +hypercall to disable the event at host side. Then, host side will stop
> +statistics collection initiated by the event.
> +
> +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
> +
> +
> +5) Close event at host side:
> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_CLOSE, id);
> +it will close and delete the event at host side.
>

What about using MSRs to configure the counter like real hardware? That
takes care of live migration, since we already migrate MSRs. At the end
of the migration userspace will read all config and counter data from
the source and transfer it to the destination. This should work with
existing userspace since we query the MSR index list from the host.

> +
> +8) NMI notification from host kernel:
> +When an event overflows at host side, host kernel injects an NMI to guest os.
> +Guest os has to check all its active events in guest os NMI handler.
>

Item 8) -> 6) :)

Should be via the performance counter LVT. Since we lack infrastructure
for this at the moment, direct NMI delivery is fine. I'm working on
that infrastructure now.

> +
> +
> +Usage flow at guest side
> +=============
> +1) Guest os registers an NMI handler to prepare to process all active event
> +overflows.
> +2) Guest os calls hypercall3(..., KVM_PERF_OP_OPEN, ...) to create an event at
> +host side.
> +3) Guest os calls hypercall2 (..., KVM_PERF_OP_ENABLE, ...) to enable the
> +event.
> +4) Guest os calls hypercall2 (..., KVM_PERF_OP_DISABLE, ...) to disable the
> +event.
> +5) Guest os could repeat 3) and 4).
> +6) Guest os calls hypercall2 (..., KVM_PERF_OP_CLOSE, ...) to close the event.
> +
> +
>

How does OP_READ work? simply update the guest_perf structure?

--
error compiling committee.c: too many arguments to function

2010-06-22 01:49:09

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
> On 06/21/2010 12:31 PM, Zhang, Yanmin wrote:
> > Here is the version 2.
> >
> > ChangeLog since V1: Mostly changes based on Avi's suggestions.
> > 1) Use a id to identify the perf_event between host and guest;
> > 2) Changes lots of codes to deal with malicious guest os;
> > 3) Add a perf_event number limitation per gust os instance;
> > 4) Support guest os on the top of another guest os scenario. But
> > I didn't test it yet as there is no environment. The design is to
> > add 2 pointers in struct perf_event. One is used by host and the
> > other is used by guest.
> > 5) Fix the bug to support 'perf stat'. The key is sync count data
> > back to guest when guest tries to disable the perf_event at host
> > side.
> > 6) Add a clear ABI of PV perf.
> >
> >
>
> Please use meaningful subject lines for individual patches.
Yes, I should. I rushed to send the patches out yesterday afternoon as I need
to take company shuttle back home.
>
> > I don't implement live migration feature.
> >
> > Avi,
> > Is live migration necessary on pv perf support?
> >
>
> Yes.
Ok. With the PV perf interface, host perf saves all counter info into perf_event
structure. To support live migration, we need save all host perf_event structure,
or at least perf_event->count and perf_event->attr. Then, recreate the host perf_event
after migration.

I check qemu-kvm codes and it seems most live migration is to save cpu states.
So it seems it's hard for perf pv interface to match current live migration. Any suggestion?

>
> > --- linux-2.6_tip0620/Documentation/kvm/paravirt-perf.txt 1970-01-01 08:00:00.000000000 +0800
> > +++ linux-2.6_tip0620perfkvm/Documentation/kvm/paravirt-perf.txt 2010-06-21 15:21:39.312999849 +0800
> > @@ -0,0 +1,133 @@
> > +The x86 kvm paravirt perf event interface
> > +===================================
> > +
> > +This paravirt interface is responsible for supporting guest os perf event
> > +collections. If guest os supports this interface, users could run command
> > +perf in guest os directly.
> > +
> > +Design
> > +========
> > +
> > +Guest os calls a series of hypercalls to communicate with host kernel to
> > +create/enable/disable/close perf events. Host kernel notifies guest os
> > +by injecting an NMI to guest os when an event overflows. Guets os need
> > +go through all its active events to check if they overflow, and output
> > +performance statistics if they do.
> > +
> > +ABI
> > +=====
> > +
> > +1) Detect if host kernel supports paravirt perf interface:
> > +#define KVM_FEATURE_PV_PERF 4
> > +Host kernel defines above cpuid bit. Guest os calls cpuid to check if host
> > +os retuns this bit. If it does, it mean host kernel supports paravirt perf
> > +interface.
> > +
> > +2) Open a new event at host side:
> > +kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, param_addr_low32bit,
> > +param_addr_high32bit);
> > +
> > +#define KVM_PERF_OP 3
> > +/* Operations for KVM_PERF_OP */
> > +#define KVM_PERF_OP_OPEN 1
> > +#define KVM_PERF_OP_CLOSE 2
> > +#define KVM_PERF_OP_ENABLE 3
> > +#define KVM_PERF_OP_DISABLE 4
> > +#define KVM_PERF_OP_READ 5
> >
>
> > +/*
> > + * guest_perf_attr is used when guest calls hypercall to
> > + * open a new perf_event at host side. Mostly, it's a copy of
> > + * perf_event_attr and deletes something not used by host kernel.
> > + */
> > +struct guest_perf_attr {
> > + __u32 type;
> >
>
> Need padding here, otherwise the structure is different on 32-bit and
> 64-bit guests.
Ok. I will change it.

>
> > + __u64 config;
> > + __u64 sample_period;
> > + __u64 sample_type;
> > + __u64 read_format;
> > + __u64 flags;
> >
>
> and here.
I will rearrange the whole structure.

>
> > + __u32 bp_type;
> > + __u64 bp_addr;
> > + __u64 bp_len;
> >
>
> Do we actually support breakpoints on the guest? Note the hardware
> breakpoints are also usable by the guest, so if the host uses them, we
> won't be able to emulate them correctly.
> We can let the guest to
> breakpoint perf monitoring itself and drop this feature.
Ok, I will disable breakpoint feature of pv interface.

>
> > +};
> >
>
> What about documentation for individual fields? Esp. type, config, and
> flags, but also the others.
They are really perf implementation specific. Even perf_event definition
has no document but code comments. I will add simple explanation around
the new structure definition.

>
> > +/*
> > + * data communication area about perf_event between
> > + * Host kernel and guest kernel
> > + */
> > +struct guest_perf_event {
> > + u64 count;
> > + atomic_t overflows;
> >
>
> Please use __u64 and __u32, assume guests don't have Linux internal
> types (though of course the first guest _is_ Linux).
This structure is used by both host and guest. In case there is a race
condition, guest os has to use atomic_cmpxchg to reset it to 0. I could
change its type to __u32, but guest kernel should calls atomic_cmpxchg to
reset it.

>
> Add padding to 64-bit.
Ok.

>
> > +};
> > +struct guest_perf_event_param {
> > + __u64 attr_addr;
> > + __u64 guest_event_addr;
> > + /* In case there is an alignment issue, we put id as the last one */
> > + int id;
> >
>
> Add explicit padding to be sure.
Ok.

>
> Also makes sense to add a flags field for future expansion.
Ok. So it could also work as something like version info.

>
> > +};
> > +
> > +param_addr_low32bit and param_addr_high32bit compose a u64 integer which means
> > +the physical address of parameter struct guest_perf_event_param.
> > +struct guest_perf_event_param consists of 3 members. attr_addr has the
> > +physical address of parameter struct guest_perf_attr. guest_event_addr has the
> > +physical address of a parameter whose type is struct guest_perf_eventi which
> > +has to be aligned with 4 bytes.
> > +guest os need allocate an exclusive id per event in this guest os instance, and save it to
> > +guest_perf_event_param->id. Later on, the id is the only method to notify host
> > +kernel about on what event guest os wants host kernel to operate.
> >
>
> Need a way to expose the maximum number of events available to the
> guest. I suggest exposing it in cpuid, and requiring 0 <= id < MAX_EVENTS.
Ok.

>
> > +guest_perf_event->count saves the latest count of the event.
> > +guest_perf_event->overflows means how many times this event has overflowed
> > +since guest os processes it. Host kernel just inc guest_perf_event->overflows
> > +when the event overflows. Guest kernel should use a atomic_cmpxchg to reset
> > +guest_perf_event->overflows to 0 in case there is a race between its reset by
> > +guest os and host kernel data update.
> >
>
> Is overflows really needed?
Theoretically, we can remove it. But it could simplify the implementations and touch
perf generic codes as small as we can.

> Since the guest can use NMI to read the
> counter, it should have the highest possible priority, and thus it
> shouldn't see any overflow unless it configured the threshold really low.
>
> If we drop overflow, we can use the RDPMC instruction instead of
> KVM_PERF_OP_READ. This allows the guest to allow userspace to read a
> counter, or prevent userspace from reading the counter, by setting cr4.pce.
1) para virt perf interface is to hide PMU hardware in host os. Guest os shouldn't
access PMU hardware directly. We could expose PMU hardware to guest os directly, but
that would be another guest os PMU support method. It shouldn't be a part of para virt
interface.
2) Consider below scenario: PMU counter overflows and NMI causes guest os vmexit to
host kernel. Host kernel schedules the vcpu thread to another physical cpu before
vmenter the guest os again. So later on, guest os just RDPMC the counter on another
cpu.

So I think above discussion is around how to expose PMU hardware to guest os. I will
also check this method after the para virt interface is done.

>
> > +Host kernel saves count and overflow update information into guest_perf_event
> > +pointed by guest_perf_event_param->guest_event_addr.
> > +
> > +After host kernel creates the event, this event is at disabled mode.
> > +
> > +This hypercall3 return 0 when host kernel creates the event successfully. Or
> > +other value if it fails.
> > +
> > +3) Enable event at host side:
> > +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_ENABLE, id);
> > +
> > +Parameter id means the event id allocated by guest os. Guest os need call this
> > +hypercall to enable the event at host side. Then, host side will really start
> > +to collect statistics by this event.
> > +
> > +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
> > +
> > +
> > +4) Disable event at host side:
> > +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_DISABLE, id);
> > +
> > +Parameter id means the event id allocated by guest os. Guest os need call this
> > +hypercall to disable the event at host side. Then, host side will stop
> > +statistics collection initiated by the event.
> > +
> > +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
> > +
> > +
> > +5) Close event at host side:
> > +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_CLOSE, id);
> > +it will close and delete the event at host side.
> >
>
> What about using MSRs to configure the counter like real hardware? That
> takes care of live migration, since we already migrate MSRs. At the end
> of the migration userspace will read all config and counter data from
> the source and transfer it to the destination. This should work with
> existing userspace since we query the MSR index list from the host.
Yes, but it will belong to the method that exposes PMU hardware to guest os directly.

>
>
> > +
> > +8) NMI notification from host kernel:
> > +When an event overflows at host side, host kernel injects an NMI to guest os.
> > +Guest os has to check all its active events in guest os NMI handler.
> >
>
> Item 8) -> 6) :)
Sorry. Originally, I added start and stop callbacks into the para virt PMU. I deleted
them as they are just duplicate of enable and disable, but forgot to change sequence
number.

>
> Should be via the performance counter LVT. Since we lack infrastructure
> for this at the moment, direct NMI delivery is fine. I'm working on
> that infrastructure now.
That's a good idea.

>
> > +
> > +
> > +Usage flow at guest side
> > +=============
> > +1) Guest os registers an NMI handler to prepare to process all active event
> > +overflows.
> > +2) Guest os calls hypercall3(..., KVM_PERF_OP_OPEN, ...) to create an event at
> > +host side.
> > +3) Guest os calls hypercall2 (..., KVM_PERF_OP_ENABLE, ...) to enable the
> > +event.
> > +4) Guest os calls hypercall2 (..., KVM_PERF_OP_DISABLE, ...) to disable the
> > +event.
> > +5) Guest os could repeat 3) and 4).
> > +6) Guest os calls hypercall2 (..., KVM_PERF_OP_CLOSE, ...) to close the event.
> > +
> > +
> >
>
> How does OP_READ work? simply update the guest_perf structure?
Host kernel updates guest os perf->event->guest_perf_shadow->count. Then, guest os
copies it to its perf_event->count.

2010-06-22 07:15:10

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/10 03:49, Zhang, Yanmin wrote:
> On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
>> Since the guest can use NMI to read the
>> counter, it should have the highest possible priority, and thus it
>> shouldn't see any overflow unless it configured the threshold really low.
>>
>> If we drop overflow, we can use the RDPMC instruction instead of
>> KVM_PERF_OP_READ. This allows the guest to allow userspace to read a
>> counter, or prevent userspace from reading the counter, by setting cr4.pce.
> 1) para virt perf interface is to hide PMU hardware in host os. Guest os shouldn't
> access PMU hardware directly. We could expose PMU hardware to guest os directly, but
> that would be another guest os PMU support method. It shouldn't be a part of para virt
> interface.
> 2) Consider below scenario: PMU counter overflows and NMI causes guest os vmexit to
> host kernel. Host kernel schedules the vcpu thread to another physical cpu before
> vmenter the guest os again. So later on, guest os just RDPMC the counter on another
> cpu.
>
> So I think above discussion is around how to expose PMU hardware to guest os. I will
> also check this method after the para virt interface is done.

You should be able to expose the counters as read-only to the guest. KVM
allows you to specify whether or not a guest has read, write or
read/write access. If you allowed read access of the counters that would
safe a fair bit of hyper calls.

Question is if it is safe to drop overflow support?

Cheers,
Jes

2010-06-22 07:47:35

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote:
> On 06/22/10 03:49, Zhang, Yanmin wrote:
> > On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
> >> Since the guest can use NMI to read the
> >> counter, it should have the highest possible priority, and thus it
> >> shouldn't see any overflow unless it configured the threshold really low.
> >>
> >> If we drop overflow, we can use the RDPMC instruction instead of
> >> KVM_PERF_OP_READ. This allows the guest to allow userspace to read a
> >> counter, or prevent userspace from reading the counter, by setting cr4.pce.
> > 1) para virt perf interface is to hide PMU hardware in host os. Guest os shouldn't
> > access PMU hardware directly. We could expose PMU hardware to guest os directly, but
> > that would be another guest os PMU support method. It shouldn't be a part of para virt
> > interface.
> > 2) Consider below scenario: PMU counter overflows and NMI causes guest os vmexit to
> > host kernel. Host kernel schedules the vcpu thread to another physical cpu before
> > vmenter the guest os again. So later on, guest os just RDPMC the counter on another
> > cpu.
> >
> > So I think above discussion is around how to expose PMU hardware to guest os. I will
> > also check this method after the para virt interface is done.
>
> You should be able to expose the counters as read-only to the guest. KVM
> allows you to specify whether or not a guest has read, write or
> read/write access. If you allowed read access of the counters that would
> safe a fair bit of hyper calls.
Thanks. KVM is good in register access permission configuration. But things are not so
simple like that if we consider real running environment. Host kernel might schedule
guest os vcpu thread to other cpus, or other non-kvm processes might preempt the vcpu
thread on this cpu.

To support such capability you said, we have to implement the direct exposition of PMU
hardware to guest os eventually.

>
> Question is if it is safe to drop overflow support?
Not safe. One of PMU hardware design objectives is to use interrupt or NMI to notify
software when event counter overflows. Without overflow support, software need poll
the PMU registers looply. That is not good and consumes more cpu resources.

Besides the para virt perf interface, I'm also considering the direct exposition
of PMU hardware to guest os. But that will be another very different implementation. We
should not combine it with pv interface. Perhaps our target is to implement both, so
unmodified guest os could get support on perf statistics.

Yanmin

2010-06-22 07:55:26

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
> Besides the para virt perf interface, I'm also considering the direct exposition
> of PMU hardware to guest os.

NAK NAK NAK NAK, we've been over that, its not going to happen, full
stop!

Use MSR read/write traps and host perf to emulate the hardware. In some
cases we could allow the reads without trap but that's a later
optimization.

2010-06-22 07:59:20

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/10 09:47, Zhang, Yanmin wrote:
> On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote:
>> On 06/22/10 03:49, Zhang, Yanmin wrote:
>>> On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
>>> So I think above discussion is around how to expose PMU hardware to guest os. I will
>>> also check this method after the para virt interface is done.
>>
>> You should be able to expose the counters as read-only to the guest. KVM
>> allows you to specify whether or not a guest has read, write or
>> read/write access. If you allowed read access of the counters that would
>> safe a fair bit of hyper calls.
> Thanks. KVM is good in register access permission configuration. But things are not so
> simple like that if we consider real running environment. Host kernel might schedule
> guest os vcpu thread to other cpus, or other non-kvm processes might preempt the vcpu
> thread on this cpu.
>
> To support such capability you said, we have to implement the direct exposition of PMU
> hardware to guest os eventually.

If the guest is rescheduled to another CPU, or you get a preemption, you
have a VMEXIT. The vcpu thread will not migrate while it is running, so
you can handle it while the the VMEXIT is being serviced.

Exposing the counters read-only would save a lot of overhead for sure.

>> Question is if it is safe to drop overflow support?
> Not safe. One of PMU hardware design objectives is to use interrupt or NMI to notify
> software when event counter overflows. Without overflow support, software need poll
> the PMU registers looply. That is not good and consumes more cpu resources.

Here is an idea, how about having the overflow NMI in the host trigger a
flag that causes the PMU register read to trap and get special handling?
That way you could propagate the overflow back down to the guest.

> Besides the para virt perf interface, I'm also considering the direct exposition
> of PMU hardware to guest os. But that will be another very different implementation. We
> should not combine it with pv interface. Perhaps our target is to implement both, so
> unmodified guest os could get support on perf statistics.

That was what I was looking at initially, but it got stalled. I think it
will make sense to build it on top of the infrastructure you have
already posted, so once that settles it will definitely be easier to do.

Cheers,
Jes

2010-06-22 08:00:48

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/10 09:55, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
>> Besides the para virt perf interface, I'm also considering the direct exposition
>> of PMU hardware to guest os.
>
> NAK NAK NAK NAK, we've been over that, its not going to happen, full
> stop!
>
> Use MSR read/write traps and host perf to emulate the hardware. In some
> cases we could allow the reads without trap but that's a later
> optimization.

I believe whats meant here is a PMU compatible interface which is
partially emulated. Not a handover of the PMU.

Jes

2010-06-22 08:59:56

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/2010 04:49 AM, Zhang, Yanmin wrote:
>
>>> Is live migration necessary on pv perf support?
>>>
>>>
>> Yes.
>>
> Ok. With the PV perf interface, host perf saves all counter info into perf_event
> structure. To support live migration, we need save all host perf_event structure,
> or at least perf_event->count and perf_event->attr. Then, recreate the host perf_event
> after migration.
>

Much better to save the guest structure (which is an ABI, and doesn't
change between kernels).

> I check qemu-kvm codes and it seems most live migration is to save cpu states.
> So it seems it's hard for perf pv interface to match current live migration. Any suggestion?
>

Make it part of the cpu state then. If you encode the interface as
MSRs, it comes for free (including migration of the counter values). If
not, save the parameters to OP_OPEN and enable/disable state, as well as
the counters.

But using MSRs will be much more natural. Almost by definition they
encode state, instead of hypercalls, which work to maintain state which
isn't clearly specified.

>>>
>>>
>> What about documentation for individual fields? Esp. type, config, and
>> flags, but also the others.
>>
> They are really perf implementation specific. Even perf_event definition
> has no document but code comments. I will add simple explanation around
> the new structure definition.
>

Ok. Please drop anything we don't support and document what we do.
Note that if the perf implementation changes, we will need to convert
between the kvm ABI and the new implementation.

>>> +guest_perf_event->count saves the latest count of the event.
>>> +guest_perf_event->overflows means how many times this event has overflowed
>>> +since guest os processes it. Host kernel just inc guest_perf_event->overflows
>>> +when the event overflows. Guest kernel should use a atomic_cmpxchg to reset
>>> +guest_perf_event->overflows to 0 in case there is a race between its reset by
>>> +guest os and host kernel data update.
>>>
>>>
>> Is overflows really needed?
>>
> Theoretically, we can remove it. But it could simplify the implementations and touch
> perf generic codes as small as we can.
>

Since real hardware doesn't provide overflows, guest software is
prepared to handle it. So if removing it simplifies the host, it's an
improvement.

>> Since the guest can use NMI to read the
>> counter, it should have the highest possible priority, and thus it
>> shouldn't see any overflow unless it configured the threshold really low.
>>
>> If we drop overflow, we can use the RDPMC instruction instead of
>> KVM_PERF_OP_READ. This allows the guest to allow userspace to read a
>> counter, or prevent userspace from reading the counter, by setting cr4.pce.
>>
> 1) para virt perf interface is to hide PMU hardware in host os. Guest os shouldn't
> access PMU hardware directly. We could expose PMU hardware to guest os directly, but
> that would be another guest os PMU support method. It shouldn't be a part of para virt
> interface.
>

RDPMC will be trapped by the host, so it won't access the real PMU.
It's a convenient shorthand for 'read a counter designated by this index'.

(similarly, without EPT 'mov cr3' doesn't affect the real cr3 but only
the virtual cr3).

> 2) Consider below scenario: PMU counter overflows and NMI causes guest os vmexit to
> host kernel. Host kernel schedules the vcpu thread to another physical cpu before
> vmenter the guest os again. So later on, guest os just RDPMC the counter on another
> cpu.
>

Again, RDPMC will access the paravirt counter, not the hardware counter.

> So I think above discussion is around how to expose PMU hardware to guest os. I will
> also check this method after the para virt interface is done.
>
>
>>
>>> +Host kernel saves count and overflow update information into guest_perf_event
>>> +pointed by guest_perf_event_param->guest_event_addr.
>>> +
>>> +After host kernel creates the event, this event is at disabled mode.
>>> +
>>> +This hypercall3 return 0 when host kernel creates the event successfully. Or
>>> +other value if it fails.
>>> +
>>> +3) Enable event at host side:
>>> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_ENABLE, id);
>>> +
>>> +Parameter id means the event id allocated by guest os. Guest os need call this
>>> +hypercall to enable the event at host side. Then, host side will really start
>>> +to collect statistics by this event.
>>> +
>>> +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
>>> +
>>> +
>>> +4) Disable event at host side:
>>> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_DISABLE, id);
>>> +
>>> +Parameter id means the event id allocated by guest os. Guest os need call this
>>> +hypercall to disable the event at host side. Then, host side will stop
>>> +statistics collection initiated by the event.
>>> +
>>> +This hypercall3 return 0 if host kernel succeds. Or other value if it fails.
>>> +
>>> +
>>> +5) Close event at host side:
>>> +kvm_hypercall2(KVM_PERF_OP, KVM_PERF_OP_CLOSE, id);
>>> +it will close and delete the event at host side.
>>>
>>>
>> What about using MSRs to configure the counter like real hardware? That
>> takes care of live migration, since we already migrate MSRs. At the end
>> of the migration userspace will read all config and counter data from
>> the source and transfer it to the destination. This should work with
>> existing userspace since we query the MSR index list from the host.
>>
> Yes, but it will belong to the method that exposes PMU hardware to guest os directly.
>

I'm suggesting to use virtual MSRs defined by you. Those MSRs will
encode the guest_perf_attr structure. Since we already copy MSRs on
live migration, we will have live migration support, and reset will also
work. Look at kvmclock for an example of a virtual MSR.

--

error compiling committee.c: too many arguments to function

2010-06-22 09:28:47

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote:
> On 06/22/10 09:55, Peter Zijlstra wrote:
> > On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
> >> Besides the para virt perf interface, I'm also considering the direct exposition
> >> of PMU hardware to guest os.
> >
> > NAK NAK NAK NAK, we've been over that, its not going to happen, full
> > stop!
> >
> > Use MSR read/write traps and host perf to emulate the hardware. In some
> > cases we could allow the reads without trap but that's a later
> > optimization.
>
> I believe whats meant here is a PMU compatible interface which is
> partially emulated. Not a handover of the PMU.
Right. We need capture all write to PMU MSR and allows guest os to read MSR directly.

Yanmin

2010-06-22 09:31:27

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 17:29 +0800, Zhang, Yanmin wrote:
> On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote:
> > On 06/22/10 09:55, Peter Zijlstra wrote:
> > > On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
> > >> Besides the para virt perf interface, I'm also considering the direct exposition
> > >> of PMU hardware to guest os.
> > >
> > > NAK NAK NAK NAK, we've been over that, its not going to happen, full
> > > stop!
> > >
> > > Use MSR read/write traps and host perf to emulate the hardware. In some
> > > cases we could allow the reads without trap but that's a later
> > > optimization.
> >
> > I believe whats meant here is a PMU compatible interface which is
> > partially emulated. Not a handover of the PMU.
> Right. We need capture all write to PMU MSR and allows guest os to read MSR directly.

That latter is not possible, only in a subset of cases can you allow
that read.

2010-06-22 09:39:44

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/10 11:31, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 17:29 +0800, Zhang, Yanmin wrote:
>> On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote:
>>> On 06/22/10 09:55, Peter Zijlstra wrote:
>>>> On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
>>>>> Besides the para virt perf interface, I'm also considering the direct exposition
>>>>> of PMU hardware to guest os.
>>>>
>>>> NAK NAK NAK NAK, we've been over that, its not going to happen, full
>>>> stop!
>>>>
>>>> Use MSR read/write traps and host perf to emulate the hardware. In some
>>>> cases we could allow the reads without trap but that's a later
>>>> optimization.
>>>
>>> I believe whats meant here is a PMU compatible interface which is
>>> partially emulated. Not a handover of the PMU.
>> Right. We need capture all write to PMU MSR and allows guest os to read MSR directly.
>
> That latter is not possible, only in a subset of cases can you allow
> that read.

Avi's suggestion of using virtual MSRs makes a ton of sense for this
though, and it makes it possible to switch direct access on/off for the
cases where direct access is possible, and go emulated when it isn't.

Cheers,
Jes

2010-06-22 09:47:08

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 11:39 +0200, Jes Sorensen wrote:
> On 06/22/10 11:31, Peter Zijlstra wrote:
> > On Tue, 2010-06-22 at 17:29 +0800, Zhang, Yanmin wrote:
> >> On Tue, 2010-06-22 at 10:00 +0200, Jes Sorensen wrote:
> >>> On 06/22/10 09:55, Peter Zijlstra wrote:
> >>>> On Tue, 2010-06-22 at 15:47 +0800, Zhang, Yanmin wrote:
> >>>>> Besides the para virt perf interface, I'm also considering the direct exposition
> >>>>> of PMU hardware to guest os.
> >>>>
> >>>> NAK NAK NAK NAK, we've been over that, its not going to happen, full
> >>>> stop!
> >>>>
> >>>> Use MSR read/write traps and host perf to emulate the hardware. In some
> >>>> cases we could allow the reads without trap but that's a later
> >>>> optimization.
> >>>
> >>> I believe whats meant here is a PMU compatible interface which is
> >>> partially emulated. Not a handover of the PMU.
> >> Right. We need capture all write to PMU MSR and allows guest os to read MSR directly.
> >
> > That latter is not possible, only in a subset of cases can you allow
> > that read.
>
> Avi's suggestion of using virtual MSRs makes a ton of sense for this
> though, and it makes it possible to switch direct access on/off for the
> cases where direct access is possible, and go emulated when it isn't.

/me has no clue what virtual MSRs are, but yeah, that sounds about
right. Anyway, the generic case is full trap and emulate get that
working first, then try and be smart and avoid some traps.

2010-06-22 09:54:19

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/2010 12:46 PM, Peter Zijlstra wrote:
>
>> Avi's suggestion of using virtual MSRs makes a ton of sense for this
>> though, and it makes it possible to switch direct access on/off for the
>> cases where direct access is possible, and go emulated when it isn't.
>>
> /me has no clue what virtual MSRs are,

MSRs that are not defined by the hardware, but instead by the hypervisor.

> but yeah, that sounds about
> right. Anyway, the generic case is full trap and emulate get that
> working first, then try and be smart and avoid some traps.
>

I doubt we can avoid traps for the paravirt PMU since the counter
indexes will not match.

When emulating the hardware PMU we can be clever at times and allow
RDPMC not to trap.

--
error compiling committee.c: too many arguments to function

2010-06-22 10:02:40

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 12:53 +0300, Avi Kivity wrote:

> > /me has no clue what virtual MSRs are,
>
> MSRs that are not defined by the hardware, but instead by the
> hypervisor.
>
Uhm, but the PMU MSRs are all defined by the hardware, if you move the
PMU MSRs around nothing will work.. *confusion*

> When emulating the hardware PMU we can be clever at times and allow
> RDPMC not to trap.

Sure, not disagreeing with that, still the generic case is to trap, so
lets first get that to work and then try and be smart :-)

2010-06-22 10:06:24

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/2010 01:02 PM, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 12:53 +0300, Avi Kivity wrote:
>
>
>
>>> /me has no clue what virtual MSRs are,
>>>
>> MSRs that are not defined by the hardware, but instead by the
>> hypervisor.
>>
>>
> Uhm, but the PMU MSRs are all defined by the hardware, if you move the
> PMU MSRs around nothing will work.. *confusion*
>

You have a set of MSRs for real hardware (actually several sets)
discoverable by cpuid bits. You have another set of MSRs, using other
indexes, discoverable by more CPUID bits.

The new MSR indexes will always #GP on real hardware, but will be
trapped and serviced by kvm. In effect kvm will pretend to have a
hardware-like PMU but done according to its own specifications.

>> When emulating the hardware PMU we can be clever at times and allow
>> RDPMC not to trap.
>>
> Sure, not disagreeing with that, still the generic case is to trap, so
> lets first get that to work and then try and be smart :-)
>

That's what we're doing here.

--
error compiling committee.c: too many arguments to function

2010-06-22 10:10:49

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 13:06 +0300, Avi Kivity wrote:

> You have a set of MSRs for real hardware (actually several sets)
> discoverable by cpuid bits. You have another set of MSRs, using other
> indexes, discoverable by more CPUID bits.
>
> The new MSR indexes will always #GP on real hardware, but will be
> trapped and serviced by kvm. In effect kvm will pretend to have a
> hardware-like PMU but done according to its own specifications.

So what's the point? I thought the whole MSR interface thing was purely
to let other-o$ play with the PMU, but if you move it around like that
and make it KVM specific, nobody will find it...

2010-06-22 11:01:30

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On 06/22/2010 01:10 PM, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 13:06 +0300, Avi Kivity wrote:
>
>
>> You have a set of MSRs for real hardware (actually several sets)
>> discoverable by cpuid bits. You have another set of MSRs, using other
>> indexes, discoverable by more CPUID bits.
>>
>> The new MSR indexes will always #GP on real hardware, but will be
>> trapped and serviced by kvm. In effect kvm will pretend to have a
>> hardware-like PMU but done according to its own specifications.
>>
> So what's the point?

We already have infrastructure for save/restore around MSRs. They are
state-based (as opposed to function-based hypercalls), so it's easy to
live migrate by copying the MSR values.

> I thought the whole MSR interface thing was purely
> to let other-o$ play with the PMU, but if you move it around like that
> and make it KVM specific, nobody will find it...
>

Other-os support will be achieved by emulating an existing interface.

--
error compiling committee.c: too many arguments to function

2010-06-23 01:13:26

[permalink] [raw]

Subject: Re: [PATCH V2 1/5] ara virt interface of perf to support kvm guest os statistics collection in guest os

On Tue, 2010-06-22 at 09:58 +0200, Jes Sorensen wrote:
> On 06/22/10 09:47, Zhang, Yanmin wrote:
> > On Tue, 2010-06-22 at 09:14 +0200, Jes Sorensen wrote:
> >> On 06/22/10 03:49, Zhang, Yanmin wrote:
> >>> On Mon, 2010-06-21 at 14:45 +0300, Avi Kivity wrote:
> >>> So I think above discussion is around how to expose PMU hardware to guest os. I will
> >>> also check this method after the para virt interface is done.
> >>
> >> You should be able to expose the counters as read-only to the guest. KVM
> >> allows you to specify whether or not a guest has read, write or
> >> read/write access. If you allowed read access of the counters that would
> >> safe a fair bit of hyper calls.
> > Thanks. KVM is good in register access permission configuration. But things are not so
> > simple like that if we consider real running environment. Host kernel might schedule
> > guest os vcpu thread to other cpus, or other non-kvm processes might preempt the vcpu
> > thread on this cpu.
> >
> > To support such capability you said, we have to implement the direct exposition of PMU
> > hardware to guest os eventually.
>
> If the guest is rescheduled to another CPU, or you get a preemption, you
> have a VMEXIT. The vcpu thread will not migrate while it is running, so
> you can handle it while the the VMEXIT is being serviced.
>
> Exposing the counters read-only would save a lot of overhead for sure.
> >> Question is if it is safe to drop overflow support?
> > Not safe. One of PMU hardware design objectives is to use interrupt or NMI to notify
> > software when event counter overflows. Without overflow support, software need poll
> > the PMU registers looply. That is not good and consumes more cpu resources.
>
> Here is an idea, how about having the overflow NMI in the host trigger a
> flag that causes the PMU register read to trap and get special handling?
> That way you could propagate the overflow back down to the guest.
That doesn't resolve the issue that guest os software has to poll register.

2010-06-23 08:15:54