2019-06-26 09:44:27

by Wanpeng Li

[permalink] [raw]
Subject: cputime takes cstate into consideration

Hi all,

After exposing mwait/monitor into kvm guest, the guest can make
physical cpu enter deeper cstate through mwait instruction, however,
the top command on host still observe 100% cpu utilization since qemu
process is running even though guest who has the power management
capability executes mwait. Actually we can observe the physical cpu
has already enter deeper cstate by powertop on host. Could we take
cstate into consideration when accounting cputime etc?

Regards,
Wanpeng Li


2019-06-26 10:14:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 05:43:55PM +0800, Wanpeng Li wrote:
> Hi all,
>
> After exposing mwait/monitor into kvm guest, the guest can make
> physical cpu enter deeper cstate through mwait instruction, however,
> the top command on host still observe 100% cpu utilization since qemu
> process is running even though guest who has the power management
> capability executes mwait. Actually we can observe the physical cpu
> has already enter deeper cstate by powertop on host. Could we take
> cstate into consideration when accounting cputime etc?

Either we account runtime on the CPU itself, in which case it will not
be in a C state due to actually running an interrupt that does
accounting, or we do it remote (NOHZ_FULL case) and there is no way to
know what C state, if any, that CPU is in.


2019-06-26 10:34:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Wanpeng Li wrote:
> After exposing mwait/monitor into kvm guest, the guest can make
> physical cpu enter deeper cstate through mwait instruction, however,
> the top command on host still observe 100% cpu utilization since qemu
> process is running even though guest who has the power management
> capability executes mwait. Actually we can observe the physical cpu
> has already enter deeper cstate by powertop on host. Could we take
> cstate into consideration when accounting cputime etc?

If MWAIT can be used inside the guest then the host cannot distinguish
between execution and stuck in mwait.

It'd need to poll the power monitoring MSRs on every occasion where the
accounting happens.

This completely falls apart when you have zero exit guest. (think
NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
the per CPU MSRs.

I assume a lot of people will be happy about all that :)

Thanks,

tglx

2019-06-26 14:55:45

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > After exposing mwait/monitor into kvm guest, the guest can make
> > physical cpu enter deeper cstate through mwait instruction, however,
> > the top command on host still observe 100% cpu utilization since qemu
> > process is running even though guest who has the power management
> > capability executes mwait. Actually we can observe the physical cpu
> > has already enter deeper cstate by powertop on host. Could we take
> > cstate into consideration when accounting cputime etc?
>
> If MWAIT can be used inside the guest then the host cannot distinguish
> between execution and stuck in mwait.
>
> It'd need to poll the power monitoring MSRs on every occasion where the
> accounting happens.
>
> This completely falls apart when you have zero exit guest. (think
> NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> the per CPU MSRs.
>
> I assume a lot of people will be happy about all that :)

There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
counters (in the host) to sample the guest and construct a better
accounting idea of what the guest does. That way the dashboard
from the host would not show 100% CPU utilization.

But the patches that Marcelo posted (" cpuidle-haltpoll driver") in
"solves" the problem for Linux. That is the guest wants awesome latency and
one way was to expose MWAIT to the guest, or just tweak the guest to do the
idling a bit different.

Marcelo patches are all good for Linux, but Windows is still an issue.

Ankur, would you be OK sharing some of your ideas?
>
> Thanks,
>
> tglx
>

2019-06-26 16:17:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > After exposing mwait/monitor into kvm guest, the guest can make
> > > physical cpu enter deeper cstate through mwait instruction, however,
> > > the top command on host still observe 100% cpu utilization since qemu
> > > process is running even though guest who has the power management
> > > capability executes mwait. Actually we can observe the physical cpu
> > > has already enter deeper cstate by powertop on host. Could we take
> > > cstate into consideration when accounting cputime etc?
> >
> > If MWAIT can be used inside the guest then the host cannot distinguish
> > between execution and stuck in mwait.
> >
> > It'd need to poll the power monitoring MSRs on every occasion where the
> > accounting happens.
> >
> > This completely falls apart when you have zero exit guest. (think
> > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > the per CPU MSRs.
> >
> > I assume a lot of people will be happy about all that :)
>
> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> counters (in the host) to sample the guest and construct a better
> accounting idea of what the guest does. That way the dashboard
> from the host would not show 100% CPU utilization.

But then you generate extra noise and vmexits on those cpus, just to get
this accounting sorted, which sounds like a bad trade.

2019-06-26 18:32:29

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> > > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > > > After exposing mwait/monitor into kvm guest, the guest can make
> > > > physical cpu enter deeper cstate through mwait instruction, however,
> > > > the top command on host still observe 100% cpu utilization since qemu
> > > > process is running even though guest who has the power management
> > > > capability executes mwait. Actually we can observe the physical cpu
> > > > has already enter deeper cstate by powertop on host. Could we take
> > > > cstate into consideration when accounting cputime etc?
> > >
> > > If MWAIT can be used inside the guest then the host cannot distinguish
> > > between execution and stuck in mwait.
> > >
> > > It'd need to poll the power monitoring MSRs on every occasion where the
> > > accounting happens.
> > >
> > > This completely falls apart when you have zero exit guest. (think
> > > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > > the per CPU MSRs.
> > >
> > > I assume a lot of people will be happy about all that :)
> >
> > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > counters (in the host) to sample the guest and construct a better
> > accounting idea of what the guest does. That way the dashboard
> > from the host would not show 100% CPU utilization.
>
> But then you generate extra noise and vmexits on those cpus, just to get
> this accounting sorted, which sounds like a bad trade.

Considering that the CPUs aren't doing anything and if you do say the
IPIs "only" 100/second - that would be so small but give you a big benefit
in properly accounting the guests.

But perhaps there are other ways too to "snoop" if a guest is sitting on
an MWAIT?

2019-06-26 18:41:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > counters (in the host) to sample the guest and construct a better
> > > accounting idea of what the guest does. That way the dashboard
> > > from the host would not show 100% CPU utilization.
> >
> > But then you generate extra noise and vmexits on those cpus, just to get
> > this accounting sorted, which sounds like a bad trade.
>
> Considering that the CPUs aren't doing anything and if you do say the
> IPIs "only" 100/second - that would be so small but give you a big benefit
> in properly accounting the guests.

The host doesn't know what the guest CPUs are doing. And if you have a full
zero exit setup and the guest is computing stuff or doing that network
offloading thing then they will notice the 100/s vmexits and complain.

> But perhaps there are other ways too to "snoop" if a guest is sitting on
> an MWAIT?

No idea.

Thanks,

tglx


2019-06-26 18:56:30

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 2019-06-26 at 20:41 +0200, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Konrad Rzeszutek Wilk wrote:
> >
> > On Wed, Jun 26, 2019 at 06:16:08PM +0200, Peter Zijlstra wrote:
> > >
> > > On Wed, Jun 26, 2019 at 10:54:13AM -0400, Konrad Rzeszutek Wilk wrote:
> > > >
> > > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > > counters (in the host) to sample the guest and construct a better
> > > > accounting idea of what the guest does. That way the dashboard
> > > > from the host would not show 100% CPU utilization.
> > >
> > > But then you generate extra noise and vmexits on those cpus, just to get
> > > this accounting sorted, which sounds like a bad trade.
> >
> > Considering that the CPUs aren't doing anything and if you do say the
> > IPIs "only" 100/second - that would be so small but give you a big benefit
> > in properly accounting the guests.
>
> The host doesn't know what the guest CPUs are doing. And if you have a full
> zero exit setup and the guest is computing stuff or doing that network
> offloading thing then they will notice the 100/s vmexits and complain.

If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
still be ticking in the host once every second for housekeeping, right? Would 
not updating the mwait-time once a second be enough here?

>
> >
> > But perhaps there are other ways too to "snoop" if a guest is sitting on
> > an MWAIT?
>
> No idea.
>
> Thanks,
>
> tglx
>
>



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2019-06-26 19:00:00

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 26, 2019 at 12:33:30PM +0200, Thomas Gleixner wrote:
> >
> > On Wed, 26 Jun 2019, Wanpeng Li wrote:
> > >
> > > After exposing mwait/monitor into kvm guest, the guest can make
> > > physical cpu enter deeper cstate through mwait instruction, however,
> > > the top command on host still observe 100% cpu utilization since qemu
> > > process is running even though guest who has the power management
> > > capability executes mwait. Actually we can observe the physical cpu
> > > has already enter deeper cstate by powertop on host. Could we take
> > > cstate into consideration when accounting cputime etc?
> >
> > If MWAIT can be used inside the guest then the host cannot distinguish
> > between execution and stuck in mwait.
> >
> > It'd need to poll the power monitoring MSRs on every occasion where the
> > accounting happens.
> >
> > This completely falls apart when you have zero exit guest. (think
> > NOHZ_FULL). Then you'd have to bring the guest out with an IPI to access
> > the per CPU MSRs.
> >
> > I assume a lot of people will be happy about all that :)
>
> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> counters (in the host) to sample the guest and construct a better
> accounting idea of what the guest does. That way the dashboard
> from the host would not show 100% CPU utilization.

You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF 
MSRs for that. (sorry I got distracted and forgot to send the patch)

>
> But the patches that Marcelo posted (" cpuidle-haltpoll driver") in
> "solves" the problem for Linux. That is the guest wants awesome latency and
> one way was to expose MWAIT to the guest, or just tweak the guest to do the
> idling a bit different.
>
> Marcelo patches are all good for Linux, but Windows is still an issue.
>
> Ankur, would you be OK sharing some of your ideas?
> >
> >
> > Thanks,
> >
> > tglx
> >



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2019-06-26 19:19:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 20:41 +0200, Thomas Gleixner wrote:
> > The host doesn't know what the guest CPUs are doing. And if you have a full
> > zero exit setup and the guest is computing stuff or doing that network
> > offloading thing then they will notice the 100/s vmexits and complain.
>
> If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> still be ticking in the host once every second for housekeeping, right? Would 
> not updating the mwait-time once a second be enough here?

It maybe that it 'still' does that, but the goal is to fix that by doing
remote accounting. I think Frederic is pretty close to that.

Then your 'lets do accounting' on the housekeeping tick falls apart.

And even with that tick every second, the nohz full people take every
shortcut to go back into the guest ASAP. Doing a dozen MSR reads will
surely not find many enthusiastic supporters.

Thanks,

tglx

2019-06-26 19:22:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:

> If the host is completely in no_full_hz mode and the pCPU is dedicated to a?
> single vCPU/task (and the guest is 100% CPU bound and never exits), you would?
> still be ticking in the host once every second for housekeeping, right? Would?
> not updating the mwait-time once a second be enough here?

People are trying very hard to get rid of that remnant tick. Lets not
add dependencies to it.

IMO this is a really stupid issue, 100% time is correct if the guest
does idle in pinned vcpu mode.

2019-06-26 19:24:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > counters (in the host) to sample the guest and construct a better
> > accounting idea of what the guest does. That way the dashboard
> > from the host would not show 100% CPU utilization.
>
> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF 
> MSRs for that. (sorry I got distracted and forgot to send the patch)

Sure, but then you conflict with the other people who fight tooth and nail
over every single performance counter.

Thanks,

tglx

2019-06-26 19:28:45

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
>
> >
> > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > still be ticking in the host once every second for housekeeping, right? Would 
> > not updating the mwait-time once a second be enough here?
>
> People are trying very hard to get rid of that remnant tick. Lets not
> add dependencies to it.
>
> IMO this is a really stupid issue, 100% time is correct if the guest
> does idle in pinned vcpu mode.

One use case for proper accounting (obviously for a slightly relaxed definition 
or *proper*) is *external* monitoring of CPU utilization for scaling group
(i.e. more VMs will be launched when you reach a certain CPU utilization).
These external monitoring tools needs to account CPU utilization properly.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2019-06-26 19:31:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Peter Zijlstra wrote:

> On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
>
> > If the host is completely in no_full_hz mode and the pCPU is dedicated to a?
> > single vCPU/task (and the guest is 100% CPU bound and never exits), you would?
> > still be ticking in the host once every second for housekeeping, right? Would?
> > not updating the mwait-time once a second be enough here?
>
> People are trying very hard to get rid of that remnant tick. Lets not
> add dependencies to it.
>
> IMO this is a really stupid issue, 100% time is correct if the guest
> does idle in pinned vcpu mode.

Correct. We are going to see the same issue with UMWAIT/UMONITOR. If the
timeout is set long enough by the admin, then a task can stay in user mode
UMWAIT for a very long time. And we're going to account that as user time.

That's not any different with a guest.

You might go there and establish a shared page with the guest where the
guest drops his internal accounting information. For trusted guests that
might be a good approximation. For untrusted ones not so much, but then you
just have to say, you occupy the CPU 100% in guest mode. If you idle there,
none of my problems.

Thanks,

tglx

2019-06-26 19:33:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> >
> > >
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a 
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would 
> > > still be ticking in the host once every second for housekeeping, right? Would 
> > > not updating the mwait-time once a second be enough here?
> >
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> >
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
>
> One use case for proper accounting (obviously for a slightly relaxed definition 
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

Then you need a trusted cooperative guest and that can give you the
information. If it doesn't, then either do not give him MWAIT or the scheme
does not work.

If you can afford to waste performance counters for that, you can do that
from user space.

There are lots of options, but the kernel won't chose one because it's
guaranteed to be the wrong choice for most scenarios.

Thanks,

tglx

2019-06-26 20:04:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, Jun 26, 2019 at 07:27:35PM +0000, Raslan, KarimAllah wrote:
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> >
> > >
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a?
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would?
> > > still be ticking in the host once every second for housekeeping, right? Would?
> > > not updating the mwait-time once a second be enough here?
> >
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> >
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
>
> One use case for proper accounting (obviously for a slightly relaxed definition?
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

That's utter nonsense; what's the point of exposing mwait to guests if
you're not doing vcpu pinning. For overloaded guests mwait makes no
sense what so ever.

2019-06-26 20:10:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Wed, 26 Jun 2019, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 07:27:35PM +0000, Raslan, KarimAllah wrote:
> > On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> > >
> > > >
> > > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a?
> > > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would?
> > > > still be ticking in the host once every second for housekeeping, right? Would?
> > > > not updating the mwait-time once a second be enough here?
> > >
> > > People are trying very hard to get rid of that remnant tick. Lets not
> > > add dependencies to it.
> > >
> > > IMO this is a really stupid issue, 100% time is correct if the guest
> > > does idle in pinned vcpu mode.
> >
> > One use case for proper accounting (obviously for a slightly relaxed definition?
> > or *proper*) is *external* monitoring of CPU utilization for scaling group
> > (i.e. more VMs will be launched when you reach a certain CPU utilization).
> > These external monitoring tools needs to account CPU utilization properly.
>
> That's utter nonsense; what's the point of exposing mwait to guests if
> you're not doing vcpu pinning. For overloaded guests mwait makes no
> sense what so ever.

I think you misunderstood. The guests are pinned. What they can do today is
monitor the guests utilization time through mwait/vmexit. If that goes over
a certain threshold they can automatically launch more VMs to spread the
load.

With MWAIT in the guest this is gone...

Thanks,

tglx

2019-07-09 02:02:54

by Ankur Arora

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
>> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
>>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
>>> counters (in the host) to sample the guest and construct a better
>>> accounting idea of what the guest does. That way the dashboard
>>> from the host would not show 100% CPU utilization.
>>
>> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
>> MSRs for that. (sorry I got distracted and forgot to send the patch)
>
> Sure, but then you conflict with the other people who fight tooth and nail
> over every single performance counter.
How about using Intel PT PwrEvt extensions? This should allow us to
precisely track idle residency via just MWAIT and TSC packets. Should
be pretty cheap too. It's post Cascade Lake though.

Ankur

>
> Thanks,
>
> tglx
>

2019-07-09 02:08:57

by Wanpeng Li

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

also Cc Frederic,
On Tue, 9 Jul 2019 at 10:00, Ankur Arora <[email protected]> wrote:
>
> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> >> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> >>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> >>> counters (in the host) to sample the guest and construct a better
> >>> accounting idea of what the guest does. That way the dashboard
> >>> from the host would not show 100% CPU utilization.
> >>
> >> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
> >> MSRs for that. (sorry I got distracted and forgot to send the patch)
> >
> > Sure, but then you conflict with the other people who fight tooth and nail
> > over every single performance counter.
> How about using Intel PT PwrEvt extensions? This should allow us to
> precisely track idle residency via just MWAIT and TSC packets. Should
> be pretty cheap too. It's post Cascade Lake though.
>
> Ankur
>
> >
> > Thanks,
> >
> > tglx
> >
>

2019-07-09 12:39:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Mon, Jul 08, 2019 at 07:00:08PM -0700, Ankur Arora wrote:
> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
> > On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
> > > On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
> > > > There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
> > > > counters (in the host) to sample the guest and construct a better
> > > > accounting idea of what the guest does. That way the dashboard
> > > > from the host would not show 100% CPU utilization.
> > >
> > > You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
> > > MSRs for that. (sorry I got distracted and forgot to send the patch)
> >
> > Sure, but then you conflict with the other people who fight tooth and nail
> > over every single performance counter.
> How about using Intel PT PwrEvt extensions? This should allow us to
> precisely track idle residency via just MWAIT and TSC packets. Should
> be pretty cheap too. It's post Cascade Lake though.

That would fully claim PT just for this stupid accounting thing and be
completely Intel specific.

Just stop this madness already.

2019-07-09 19:00:13

by Ankur Arora

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On 7/9/19 5:38 AM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 07:00:08PM -0700, Ankur Arora wrote:
>> On 2019-06-26 12:23 p.m., Thomas Gleixner wrote:
>>> On Wed, 26 Jun 2019, Raslan, KarimAllah wrote:
>>>> On Wed, 2019-06-26 at 10:54 -0400, Konrad Rzeszutek Wilk wrote:
>>>>> There were some ideas that Ankur (CC-ed) mentioned to me of using the perf
>>>>> counters (in the host) to sample the guest and construct a better
>>>>> accounting idea of what the guest does. That way the dashboard
>>>>> from the host would not show 100% CPU utilization.
>>>>
>>>> You can either use the UNHALTED cycles perf-counter or you can use MPERF/APERF
>>>> MSRs for that. (sorry I got distracted and forgot to send the patch)
>>>
>>> Sure, but then you conflict with the other people who fight tooth and nail
>>> over every single performance counter.
>> How about using Intel PT PwrEvt extensions? This should allow us to
>> precisely track idle residency via just MWAIT and TSC packets. Should
>> be pretty cheap too. It's post Cascade Lake though.
>
> That would fully claim PT just for this stupid accounting thing and be
> completely Intel specific.
>
> Just stop this madness already.
I see the point about just accruing guest time (in mwait or not) as
guest CPU time.
But, to take this madness a little further, I'm not sure I see why it
fully claims PT. AFAICS, we should be able to enable PwrEvt and whatever
else simultaneously.

Ankur

>

2019-12-10 00:45:23

by Wanpeng Li

[permalink] [raw]
Subject: Re: cputime takes cstate into consideration

On Thu, 27 Jun 2019 at 03:27, Raslan, KarimAllah <[email protected]> wrote:
>
> On Wed, 2019-06-26 at 21:21 +0200, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 06:55:36PM +0000, Raslan, KarimAllah wrote:
> >
> > >
> > > If the host is completely in no_full_hz mode and the pCPU is dedicated to a
> > > single vCPU/task (and the guest is 100% CPU bound and never exits), you would
> > > still be ticking in the host once every second for housekeeping, right? Would
> > > not updating the mwait-time once a second be enough here?
> >
> > People are trying very hard to get rid of that remnant tick. Lets not
> > add dependencies to it.
> >
> > IMO this is a really stupid issue, 100% time is correct if the guest
> > does idle in pinned vcpu mode.
>
> One use case for proper accounting (obviously for a slightly relaxed definition
> or *proper*) is *external* monitoring of CPU utilization for scaling group
> (i.e. more VMs will be launched when you reach a certain CPU utilization).
> These external monitoring tools needs to account CPU utilization properly.

Except cputime accounting, the other gordian knot is qemu main loop,
libvirt, kthreads etc can't be offload to the other hardware like
smart nic, these stuff will contend with vCPUs even if MWAIT/HLT
instructions are executing in the guest. There is a HLT activity state
in CPU VMCS which indicates the logical processor is inactive because
it executed the HLT instruction, but SDM 24.4.2 mentioned that
execution of the MWAIT instruction may put a logical processor into an
inactive state, however, this VMCS field never reflects this state.

Wanpeng