2020-08-12 05:14:01

by Like Xu

[permalink] [raw]
Subject: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

To emulate PMC counter for guest, KVM would create an
event on the host with 'exclude_guest=0, exclude_hv=0'
which simply makes no sense and is utterly broken.

To keep perf semantics consistent, any event created by
pmc_reprogram_counter() should both set exclude_hv and
exclude_host in the KVM context.

Message-ID: <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 67741d2a0308..6a30763a10d7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -108,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_host = 1,
.exclude_user = exclude_user,
.exclude_kernel = exclude_kernel,
+ .exclude_hv = 1,
.config = config,
};

--
2.21.3


2020-08-12 10:26:47

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 12/08/20 07:07, Like Xu wrote:
> To emulate PMC counter for guest, KVM would create an
> event on the host with 'exclude_guest=0, exclude_hv=0'
> which simply makes no sense and is utterly broken.
>
> To keep perf semantics consistent, any event created by
> pmc_reprogram_counter() should both set exclude_hv and
> exclude_host in the KVM context.
>
> Message-ID: <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/kvm/pmu.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 67741d2a0308..6a30763a10d7 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -108,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
> .exclude_host = 1,
> .exclude_user = exclude_user,
> .exclude_kernel = exclude_kernel,
> + .exclude_hv = 1,
> .config = config,
> };
>
>

x86 does not have a hypervisor privilege level, so it never uses
exclude_hv; exclude_host already excludes all root mode activity for
both ring0 and ring3.

Paolo

2020-08-12 11:14:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On Wed, Aug 12, 2020 at 12:25:43PM +0200, Paolo Bonzini wrote:
> On 12/08/20 07:07, Like Xu wrote:
> > To emulate PMC counter for guest, KVM would create an
> > event on the host with 'exclude_guest=0, exclude_hv=0'
> > which simply makes no sense and is utterly broken.
> >
> > To keep perf semantics consistent, any event created by
> > pmc_reprogram_counter() should both set exclude_hv and
> > exclude_host in the KVM context.
> >
> > Message-ID: <[email protected]>
> > Signed-off-by: Like Xu <[email protected]>
> > ---
> > arch/x86/kvm/pmu.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 67741d2a0308..6a30763a10d7 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -108,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
> > .exclude_host = 1,
> > .exclude_user = exclude_user,
> > .exclude_kernel = exclude_kernel,
> > + .exclude_hv = 1,
> > .config = config,
> > };
> >
> >
>
> x86 does not have a hypervisor privilege level, so it never uses

Arguably it does when Xen, but I don't think we support that, so *phew*.

> exclude_hv; exclude_host already excludes all root mode activity for
> both ring0 and ring3.

Right, but we want to tighten the permission checks and not excluding_hv
is just sloppy.

That said; the exclude_host / exclude_guest thing is a giant trainwreck
and I'm really not sure what to do about it.

The thing is, we very much do not want to allow unpriv user to be able
to create: exclude_host=1, exclude_guest=0 counters (they currently
can).

So we really want to add:

if ((!exclude_host || !exclude_guest || !exclude_hv) && !perf_allow_kernel())
return -EACCESS;

But the problem is, they were added late, so lots of userspace will not
be setting those fields (or might not have even known about them), so we
got to somehow deal with:

exclude_host == exclude_guest == 0

And because of that, we now also have genius code like
(intel_set_masks, amd_core_hw_config has even 'funnier' code):

if (event->attr.exclude_host)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
if (event->attr.exclude_guest)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);

Which is just confusing and bad, that really should've been:

if (!event->attr.exclude_host)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
if (!event->attr.exclude_guest)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);

:-(

Also, exclude_host is really poorly defined:

https://lkml.kernel.org/r/[email protected]

"Suppose we have nested virt:

L0-hv
|
G0/L1-hv
|
G1

And we're running in G0, then:

- 'exclude_hv' would exclude L0 events
- 'exclude_host' would ... exclude L1-hv events?
- 'exclude_guest' would ... exclude G1 events?

Then the next question is, if G0 is a host, does the L1-hv run in
G0 userspace or G0 kernel space?

I was assuming G0 userspace would not include anything L1 (kvm is a
kernel module after all), but what do I know."

The way it is implemented, you basically have to always set
exclude_host=0, even if there is no virt at all and you want to measure
your own userspace thing -- which is just weird.

Meanwhile ARM64 couldn't quite figure out what it was all supposed to be
either and also implemented something -- and i've not tried to
understand what exactly, but hopefully compatible enough that we're not
in an even worse corner.



So on the one hand we're now leaking all sorts due to lack of permission
checks, on the other hand we can't fix it because we're not allowed to
break userspace :-(


I suppose the 'best' option at this point is something like:

/*
* comment that explains the trainwreck.
*/
if (!exclude_host && !exclude_guest)
exclude_guest = 1;

if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel())
return -EPERM;

But that takes away the possibility of actually having:
'exclude_host=0, exclude_guest=0' to create an event that measures both,
which also sucks.


2020-08-12 11:34:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 12/08/20 13:11, [email protected] wrote:
>> x86 does not have a hypervisor privilege level, so it never uses
>
> Arguably it does when Xen, but I don't think we support that, so *phew*.

Yeah, I suppose you could imagine having paravirtualized perf counters
where the Xen privileged domain could ask Xen to run perf counters on
itself.

>> exclude_hv; exclude_host already excludes all root mode activity for
>> both ring0 and ring3.
>
> Right, but we want to tighten the permission checks and not excluding_hv
> is just sloppy.

I would just document that it's ignored as it doesn't make sense. ARM64
does that too, for new processors where the kernel is not itself split
between supervisor and hypervisor privilege levels.

There are people that are trying to run Linux-based firmware and have
SMM handlers as part of the kernel. Perhaps they could use exclude_hv
to exclude the SMM handlers from perf (if including them is possible at
all).

> The thing is, we very much do not want to allow unpriv user to be able
> to create: exclude_host=1, exclude_guest=0 counters (they currently
> can).

That would be the case of an unprivileged user that wants to measure
performance of its guests. It's a scenario that makes a lot of sense,
are you worried about side channels? Can perf-events on guests leak
more about the host than perf-events on a random userspace program?

> Also, exclude_host is really poorly defined:
>
> https://lkml.kernel.org/r/[email protected]
>
> "Suppose we have nested virt:
>
> L0-hv
> |
> G0/L1-hv
> |
> G1
>
> And we're running in G0, then:
>
> - 'exclude_hv' would exclude L0 events
> - 'exclude_host' would ... exclude L1-hv events?
> - 'exclude_guest' would ... exclude G1 events?

From the point of view of G0, L0 *does not exist at all*. You just
cannot see L0 events if you're running in G0.

exclude_host/exclude_guest are the right definition.

> Then the next question is, if G0 is a host, does the L1-hv run in
> G0 userspace or G0 kernel space?

It's mostly kernel, but sometimes you're interested in events from QEMU
or whoever else has opened /dev/kvm. In that case you care about G0
userspace too.

> The way it is implemented, you basically have to always set
> exclude_host=0, even if there is no virt at all and you want to measure
> your own userspace thing -- which is just weird.

I understand regretting having exclude_guest that way; include_guest
(defaulting to 0!) would have made more sense. But defaulting to
exclude_host==0 makes sense: if there is no virt at all, memset(0) does
the right thing so it does not seem weird to me.

> I suppose the 'best' option at this point is something like:
>
> /*
> * comment that explains the trainwreck.
> */
> if (!exclude_host && !exclude_guest)
> exclude_guest = 1;
>
> if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel())
> return -EPERM;
>
> But that takes away the possibility of actually having:
> 'exclude_host=0, exclude_guest=0' to create an event that measures both,
> which also sucks.

In fact both of the above "if"s suck. :(

Paolo

2020-08-12 13:00:31

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 2020/8/12 19:32, Paolo Bonzini wrote:
> On 12/08/20 13:11, [email protected] wrote:
>>> x86 does not have a hypervisor privilege level, so it never uses
>> Arguably it does when Xen, but I don't think we support that, so *phew*.
> Yeah, I suppose you could imagine having paravirtualized perf counters
> where the Xen privileged domain could ask Xen to run perf counters on
> itself.
>
>>> exclude_hv; exclude_host already excludes all root mode activity for
>>> both ring0 and ring3.
>> Right, but we want to tighten the permission checks and not excluding_hv
>> is just sloppy.
> I would just document that it's ignored as it doesn't make sense. ARM64
> does that too, for new processors where the kernel is not itself split
> between supervisor and hypervisor privilege levels.
>
> There are people that are trying to run Linux-based firmware and have
> SMM handlers as part of the kernel. Perhaps they could use exclude_hv
> to exclude the SMM handlers from perf (if including them is possible at
> all).
Hi Paolo,

My proposal is to define:
the "hypervisor privilege levels" events in the KVM/x86 context as
all the host kernel events plus /dev/kvm user space events.

If we add ".exclude_hv = 1" in the pmc_reprogram_counter(),
do you see any side effect to cover the above usages?

The fact that exclude_hv has never been used in x86 does help
the generic perf code to handle permission checks in a more concise way.

Thanks,
Like Xu
>> The thing is, we very much do not want to allow unpriv user to be able
>> to create: exclude_host=1, exclude_guest=0 counters (they currently
>> can).
> That would be the case of an unprivileged user that wants to measure
> performance of its guests. It's a scenario that makes a lot of sense,
> are you worried about side channels? Can perf-events on guests leak
> more about the host than perf-events on a random userspace program?
>
>> Also, exclude_host is really poorly defined:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> "Suppose we have nested virt:
>>
>> L0-hv
>> |
>> G0/L1-hv
>> |
>> G1
>>
>> And we're running in G0, then:
>>
>> - 'exclude_hv' would exclude L0 events
>> - 'exclude_host' would ... exclude L1-hv events?
>> - 'exclude_guest' would ... exclude G1 events?
> From the point of view of G0, L0 *does not exist at all*. You just
> cannot see L0 events if you're running in G0.
>
> exclude_host/exclude_guest are the right definition.
>
>> Then the next question is, if G0 is a host, does the L1-hv run in
>> G0 userspace or G0 kernel space?
> It's mostly kernel, but sometimes you're interested in events from QEMU
> or whoever else has opened /dev/kvm. In that case you care about G0
> userspace too.
>
>> The way it is implemented, you basically have to always set
>> exclude_host=0, even if there is no virt at all and you want to measure
>> your own userspace thing -- which is just weird.
> I understand regretting having exclude_guest that way; include_guest
> (defaulting to 0!) would have made more sense. But defaulting to
> exclude_host==0 makes sense: if there is no virt at all, memset(0) does
> the right thing so it does not seem weird to me.
>
>> I suppose the 'best' option at this point is something like:
>>
>> /*
>> * comment that explains the trainwreck.
>> */
>> if (!exclude_host && !exclude_guest)
>> exclude_guest = 1;
>>
>> if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel())
>> return -EPERM;
>>
>> But that takes away the possibility of actually having:
>> 'exclude_host=0, exclude_guest=0' to create an event that measures both,
>> which also sucks.
> In fact both of the above "if"s suck. :(
>
> Paolo
>

2020-08-12 13:07:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 12/08/20 14:56, Xu, Like wrote:
>
> My proposal is to define:
> the "hypervisor privilege levels" events in the KVM/x86 context as
> all the host kernel events plus /dev/kvm user space events.

What are "/dev/kvm user space events"? In any case, this patch should
be included only in the series that adds exclude_hv support in arch/x86.

Paolo

> If we add ".exclude_hv = 1" in the pmc_reprogram_counter(),
> do you see any side effect to cover the above usages?
>
> The fact that exclude_hv has never been used in x86 does help
> the generic perf code to handle permission checks in a more concise way.

2020-08-12 13:15:18

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 2020/8/12 21:04, Paolo Bonzini wrote:
> On 12/08/20 14:56, Xu, Like wrote:
>> My proposal is to define:
>> the "hypervisor privilege levels" events in the KVM/x86 context as
>> all the host kernel events plus /dev/kvm user space events.
> What are "/dev/kvm user space events"? In any case, this patch should
> be included only in the series that adds exclude_hv support in arch/x86.
The exclude_kernel events from the QEMU or whoever else has opened /dev/kvm.

Do you see any (patches) gap if we map
the exclude_host events into exclude_hv events naturally ?

Thanks,
Like Xu
> Paolo
>
>> If we add ".exclude_hv = 1" in the pmc_reprogram_counter(),
>> do you see any side effect to cover the above usages?
>>
>> The fact that exclude_hv has never been used in x86 does help
>> the generic perf code to handle permission checks in a more concise way.

2020-08-12 13:33:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On Wed, Aug 12, 2020 at 01:32:58PM +0200, Paolo Bonzini wrote:
> On 12/08/20 13:11, [email protected] wrote:
> > Right, but we want to tighten the permission checks and not excluding_hv
> > is just sloppy.
>
> I would just document that it's ignored as it doesn't make sense. ARM64
> does that too, for new processors where the kernel is not itself split
> between supervisor and hypervisor privilege levels.

This isn't about x86, I want these checks in generic code. We have the
flag, it needs checking.

unpriv users have no busniess getting anything from a possible hv.

> > The thing is, we very much do not want to allow unpriv user to be able
> > to create: exclude_host=1, exclude_guest=0 counters (they currently
> > can).
>
> That would be the case of an unprivileged user that wants to measure
> performance of its guests. It's a scenario that makes a lot of sense,
> are you worried about side channels? Can perf-events on guests leak
> more about the host than perf-events on a random userspace program?

An unpriv user can run guests?

> > Also, exclude_host is really poorly defined:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > "Suppose we have nested virt:
> >
> > L0-hv
> > |
> > G0/L1-hv
> > |
> > G1
> >
> > And we're running in G0, then:
> >
> > - 'exclude_hv' would exclude L0 events
> > - 'exclude_host' would ... exclude L1-hv events?
> > - 'exclude_guest' would ... exclude G1 events?
>
> From the point of view of G0, L0 *does not exist at all*. You just
> cannot see L0 events if you're running in G0.

On x86, probably, in general, I'm not at all sure, we have that
exclude_hv flag after all.

> exclude_host/exclude_guest are the right definition.

For what? I still think exclude_host is absolute shit. If you set it,
you'll not get anything even without virt.

Run a native linux kernel, no kvm loaded, create a counter with
exclude_host=1 and you'll get nothing, that's just really confusing IMO.
There is no host, so excluding it should not affect anything.

> > Then the next question is, if G0 is a host, does the L1-hv run in
> > G0 userspace or G0 kernel space?
>
> It's mostly kernel, but sometimes you're interested in events from QEMU
> or whoever else has opened /dev/kvm. In that case you care about G0
> userspace too.

I really don't think userspace helpers should be consideed part of
the host, but whatever.

> > The way it is implemented, you basically have to always set
> > exclude_host=0, even if there is no virt at all and you want to measure
> > your own userspace thing -- which is just weird.
>
> I understand regretting having exclude_guest that way; include_guest
> (defaulting to 0!) would have made more sense. But defaulting to
> exclude_host==0 makes sense: if there is no virt at all, memset(0) does
> the right thing so it does not seem weird to me.

Sure, but having exclude_host affect anything outside of kvm is still
dodgy as heck.

> > I suppose the 'best' option at this point is something like:
> >
> > /*
> > * comment that explains the trainwreck.
> > */
> > if (!exclude_host && !exclude_guest)
> > exclude_guest = 1;
> >
> > if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel())
> > return -EPERM;
> >
> > But that takes away the possibility of actually having:
> > 'exclude_host=0, exclude_guest=0' to create an event that measures both,
> > which also sucks.
>
> In fact both of the above "if"s suck. :(

If, as you seem to imply above, that unpriv users can create guests,
then maybe so, but if I look at /dev/kvm it seems to have 0660
permissions and thus really requires privileges.

2020-08-12 13:52:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event

On 12/08/20 15:31, [email protected] wrote:
> This isn't about x86, I want these checks in generic code. We have the
> flag, it needs checking.
>
> unpriv users have no busniess getting anything from a possible hv.

Ah ok if it's generic that sounds good.

>> That would be the case of an unprivileged user that wants to measure
>> performance of its guests.
>
> An unpriv user can run guests?

Sure, on most distros /dev/kvm is either 0666 or 0660, usually with
group kvm if it's 0660. To run a guest you might have to be in group
kvm, but it does not require either root or CAP_SOMETHING.

>>> Also, exclude_host is really poorly defined:
>>>
>>> https://lkml.kernel.org/r/[email protected]
>>>
>>> "Suppose we have nested virt:
>>>
>>> L0-hv
>>> |
>>> G0/L1-hv
>>> |
>>> G1
>>>
>>> And we're running in G0, then:
>>>
>>> - 'exclude_hv' would exclude L0 events
>>> - 'exclude_host' would ... exclude L1-hv events?
>>> - 'exclude_guest' would ... exclude G1 events?
>>
>> From the point of view of G0, L0 *does not exist at all*. You just
>> cannot see L0 events if you're running in G0.
>
> On x86, probably, in general, I'm not at all sure, we have that
> exclude_hv flag after all.

No, and you can quote me on that: exclude_hv is *not* about excluding
the hypervisor from a guest. It's about excluding the part of _your_
kernel which runs in a "more privileged" level (EL2 on ARM, HV on POWER).

>> exclude_host/exclude_guest are the right definition.
>
> For what?

I meant in the nested virt case you drew above.

> I still think exclude_host is absolute shit. If you set it,
> you'll not get anything even without virt.

If you dislike the name you can change it to only_guest. Anybody who
does not do virt just leaves it zero and is happy. Anybody who does not
do virt and sets it gets what they expect (or deserve). But this
definition is the same as exclude_host, and it's the correct one.

> If, as you seem to imply above, that unpriv users can create guests,
> then maybe so, but if I look at /dev/kvm it seems to have 0660
> permissions and thus really requires privileges.

Since you can be non-root and you don't need any capability either, it
doesn't require what the kernel considers to be privilege.

Paolo