2013-03-05 20:32:39

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

Sorry for the delay in the response. I did not see the email
right away.

On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > 2013/2/5 Michael Wolf <[email protected]>:
> > > In the case of where you have a system that is running in a
> > > capped or overcommitted environment the user may see steal time
> > > being reported in accounting tools such as top or vmstat. This can
> > > cause confusion for the end user.
> >
> > Sorry, I'm no expert in this area. But I don't really understand what
> > is confusing for the end user here.
>
> I suppose that what is wanted is to subtract stolen time due to 'known
> reasons' from steal time reporting. 'Known reasons' being, for example,
> hard caps. So a vcpu executing instructions with no halt, but limited to
> 80% of available bandwidth, would not have 20% of stolen time reported.

Yes exactly and the end user many times did not set up the guest and is
not aware of the capping. The end user is only aware of the performance
level that they were told they would get with the guest.

>
> But yes, a description of the scenario that is being dealt with, with
> details, is important.

I will add more detail to the description next time I submit the
patches. How about something like,"In a cloud environment the user of a
kvm guest is not aware of the underlying hardware or how many other
guests are running on it. The end user is only aware of a level of
performance that they should see." or does that just muddy the picture
more??
>
> > > To ease the confusion this patch set
> > > adds the idea of consigned (expected steal) time. The host will separate
> > > the consigned time from the steal time. Tthe steal time will only be altered
> > > if hard limits (cfs bandwidth control) is used. The period and the quota used
> > > to separate the consigned time (expected steal) from the steal time are taken
> > > from the cfs bandwidth control settings. Any other steal time accruing during
> > > that period will show as the traditional steal time.
> >
> > I'm also a bit confused here. steal time will then only account the
> > cpu time lost due to quotas from cfs bandwidth control? Also what do
> > you exactly mean by "expected steal time"? Is it steal time due to
> > overcommitting minus scheduler quotas?
> >
> > Thanks.
>

Thanks
Mike Wolf


2013-03-06 01:59:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> Sorry for the delay in the response. I did not see the email
> right away.
>
> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > 2013/2/5 Michael Wolf <[email protected]>:
> > > > In the case of where you have a system that is running in a
> > > > capped or overcommitted environment the user may see steal time
> > > > being reported in accounting tools such as top or vmstat. This can
> > > > cause confusion for the end user.
> > >
> > > Sorry, I'm no expert in this area. But I don't really understand what
> > > is confusing for the end user here.
> >
> > I suppose that what is wanted is to subtract stolen time due to 'known
> > reasons' from steal time reporting. 'Known reasons' being, for example,
> > hard caps. So a vcpu executing instructions with no halt, but limited to
> > 80% of available bandwidth, would not have 20% of stolen time reported.
>
> Yes exactly and the end user many times did not set up the guest and is
> not aware of the capping. The end user is only aware of the performance
> level that they were told they would get with the guest.
> > But yes, a description of the scenario that is being dealt with, with
> > details, is important.
>
> I will add more detail to the description next time I submit the
> patches. How about something like,"In a cloud environment the user of a
> kvm guest is not aware of the underlying hardware or how many other
> guests are running on it. The end user is only aware of a level of
> performance that they should see." or does that just muddy the picture
> more??

So the feature aims for is to report stolen time relative to hard
capping. That is: stolen time should be counted as time stolen from
the guest _beyond_ hard capping. Yes?

Probably don't need to report new data to the guest for that.

2013-03-06 08:13:10

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
> On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
>> Sorry for the delay in the response. I did not see the email
>> right away.
>>
>> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
>>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
>>>> 2013/2/5 Michael Wolf <[email protected]>:
>>>>> In the case of where you have a system that is running in a
>>>>> capped or overcommitted environment the user may see steal time
>>>>> being reported in accounting tools such as top or vmstat. This can
>>>>> cause confusion for the end user.
>>>>
>>>> Sorry, I'm no expert in this area. But I don't really understand what
>>>> is confusing for the end user here.
>>>
>>> I suppose that what is wanted is to subtract stolen time due to 'known
>>> reasons' from steal time reporting. 'Known reasons' being, for example,
>>> hard caps. So a vcpu executing instructions with no halt, but limited to
>>> 80% of available bandwidth, would not have 20% of stolen time reported.
>>
>> Yes exactly and the end user many times did not set up the guest and is
>> not aware of the capping. The end user is only aware of the performance
>> level that they were told they would get with the guest.
>>> But yes, a description of the scenario that is being dealt with, with
>>> details, is important.
>>
>> I will add more detail to the description next time I submit the
>> patches. How about something like,"In a cloud environment the user of a
>> kvm guest is not aware of the underlying hardware or how many other
>> guests are running on it. The end user is only aware of a level of
>> performance that they should see." or does that just muddy the picture
>> more??
>
> So the feature aims for is to report stolen time relative to hard
> capping. That is: stolen time should be counted as time stolen from
> the guest _beyond_ hard capping. Yes?
>
> Probably don't need to report new data to the guest for that.
>
If we take into account that 1 second always have one second, I believe
that you can just subtract the consigned time from the steal time the
host passes to the guest.

During each second, the numbers won't sum up to 100. The delta to 100 is
the consigned time, if anyone cares.

Adopting this would simplify this a lot. All you need to do, really, is
to get your calculation right from the bandwidth given by the cpu
controller. Subtract it in the host, and voila.

2013-03-06 13:34:57

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013/3/5 Michael Wolf <[email protected]>:
> Sorry for the delay in the response. I did not see the email
> right away.
>
> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
>> > 2013/2/5 Michael Wolf <[email protected]>:
>> > > In the case of where you have a system that is running in a
>> > > capped or overcommitted environment the user may see steal time
>> > > being reported in accounting tools such as top or vmstat. This can
>> > > cause confusion for the end user.
>> >
>> > Sorry, I'm no expert in this area. But I don't really understand what
>> > is confusing for the end user here.
>>
>> I suppose that what is wanted is to subtract stolen time due to 'known
>> reasons' from steal time reporting. 'Known reasons' being, for example,
>> hard caps. So a vcpu executing instructions with no halt, but limited to
>> 80% of available bandwidth, would not have 20% of stolen time reported.
>
> Yes exactly and the end user many times did not set up the guest and is
> not aware of the capping. The end user is only aware of the performance
> level that they were told they would get with the guest.
>
>>
>> But yes, a description of the scenario that is being dealt with, with
>> details, is important.
>
> I will add more detail to the description next time I submit the
> patches. How about something like,"In a cloud environment the user of a
> kvm guest is not aware of the underlying hardware or how many other
> guests are running on it. The end user is only aware of a level of
> performance that they should see." or does that just muddy the picture
> more??

That alone is probably not enough. But yeah, make sure you clearly
state the difference between expected (caps, sched bandwidth...) and
unexpected (overcommitting, competing load...) stolen time. Then add a
practical example as you made above that explains why it matters to
make that distinction and why you want to report it.

2013-03-06 16:28:44

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > Sorry for the delay in the response. I did not see the email
> > right away.
> >
> > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > 2013/2/5 Michael Wolf <[email protected]>:
> > > > > In the case of where you have a system that is running in a
> > > > > capped or overcommitted environment the user may see steal time
> > > > > being reported in accounting tools such as top or vmstat. This can
> > > > > cause confusion for the end user.
> > > >
> > > > Sorry, I'm no expert in this area. But I don't really understand what
> > > > is confusing for the end user here.
> > >
> > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > reasons' from steal time reporting. 'Known reasons' being, for example,
> > > hard caps. So a vcpu executing instructions with no halt, but limited to
> > > 80% of available bandwidth, would not have 20% of stolen time reported.
> >
> > Yes exactly and the end user many times did not set up the guest and is
> > not aware of the capping. The end user is only aware of the performance
> > level that they were told they would get with the guest.
> > > But yes, a description of the scenario that is being dealt with, with
> > > details, is important.
> >
> > I will add more detail to the description next time I submit the
> > patches. How about something like,"In a cloud environment the user of a
> > kvm guest is not aware of the underlying hardware or how many other
> > guests are running on it. The end user is only aware of a level of
> > performance that they should see." or does that just muddy the picture
> > more??
>
> So the feature aims for is to report stolen time relative to hard
> capping. That is: stolen time should be counted as time stolen from
> the guest _beyond_ hard capping. Yes?
Yes, that is the goal.
>
> Probably don't need to report new data to the guest for that.
Not sure I understand what you are saying here. Do you mean that I don't
need to report the expected steal from the guest? If I don't do that
then I'm not reporting all of the time and changing /proc/stat in a
bigger way than adding another catagory. Also I thought I would need to
provide the consigned time and the steal time for debugging purposes.
Maybe I'm missing your point.....
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-03-06 16:32:49

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote:
> On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
> > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> >> Sorry for the delay in the response. I did not see the email
> >> right away.
> >>
> >> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> >>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> >>>> 2013/2/5 Michael Wolf <[email protected]>:
> >>>>> In the case of where you have a system that is running in a
> >>>>> capped or overcommitted environment the user may see steal time
> >>>>> being reported in accounting tools such as top or vmstat. This can
> >>>>> cause confusion for the end user.
> >>>>
> >>>> Sorry, I'm no expert in this area. But I don't really understand what
> >>>> is confusing for the end user here.
> >>>
> >>> I suppose that what is wanted is to subtract stolen time due to 'known
> >>> reasons' from steal time reporting. 'Known reasons' being, for example,
> >>> hard caps. So a vcpu executing instructions with no halt, but limited to
> >>> 80% of available bandwidth, would not have 20% of stolen time reported.
> >>
> >> Yes exactly and the end user many times did not set up the guest and is
> >> not aware of the capping. The end user is only aware of the performance
> >> level that they were told they would get with the guest.
> >>> But yes, a description of the scenario that is being dealt with, with
> >>> details, is important.
> >>
> >> I will add more detail to the description next time I submit the
> >> patches. How about something like,"In a cloud environment the user of a
> >> kvm guest is not aware of the underlying hardware or how many other
> >> guests are running on it. The end user is only aware of a level of
> >> performance that they should see." or does that just muddy the picture
> >> more??
> >
> > So the feature aims for is to report stolen time relative to hard
> > capping. That is: stolen time should be counted as time stolen from
> > the guest _beyond_ hard capping. Yes?
> >
> > Probably don't need to report new data to the guest for that.
> >
> If we take into account that 1 second always have one second, I believe
> that you can just subtract the consigned time from the steal time the
> host passes to the guest.
>
> During each second, the numbers won't sum up to 100. The delta to 100 is
> the consigned time, if anyone cares.
>
> Adopting this would simplify this a lot. All you need to do, really, is
> to get your calculation right from the bandwidth given by the cpu
> controller. Subtract it in the host, and voila.

I looked at doing that once but was told that I was changing the
interface in an unacceptable way, because now I was not reporting all of
the elapsed time. I agree it would make things simpler.
>
>

2013-03-06 16:34:41

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, 2013-03-06 at 14:34 +0100, Frederic Weisbecker wrote:
> 2013/3/5 Michael Wolf <[email protected]>:
> > Sorry for the delay in the response. I did not see the email
> > right away.
> >
> > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> >> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> >> > 2013/2/5 Michael Wolf <[email protected]>:
> >> > > In the case of where you have a system that is running in a
> >> > > capped or overcommitted environment the user may see steal time
> >> > > being reported in accounting tools such as top or vmstat. This can
> >> > > cause confusion for the end user.
> >> >
> >> > Sorry, I'm no expert in this area. But I don't really understand what
> >> > is confusing for the end user here.
> >>
> >> I suppose that what is wanted is to subtract stolen time due to 'known
> >> reasons' from steal time reporting. 'Known reasons' being, for example,
> >> hard caps. So a vcpu executing instructions with no halt, but limited to
> >> 80% of available bandwidth, would not have 20% of stolen time reported.
> >
> > Yes exactly and the end user many times did not set up the guest and is
> > not aware of the capping. The end user is only aware of the performance
> > level that they were told they would get with the guest.
> >
> >>
> >> But yes, a description of the scenario that is being dealt with, with
> >> details, is important.
> >
> > I will add more detail to the description next time I submit the
> > patches. How about something like,"In a cloud environment the user of a
> > kvm guest is not aware of the underlying hardware or how many other
> > guests are running on it. The end user is only aware of a level of
> > performance that they should see." or does that just muddy the picture
> > more??
>
> That alone is probably not enough. But yeah, make sure you clearly
> state the difference between expected (caps, sched bandwidth...) and
> unexpected (overcommitting, competing load...) stolen time. Then add a
> practical example as you made above that explains why it matters to
> make that distinction and why you want to report it.
>

Ok, I understand what you are requesting. I will make sure to add it to
the description the next time I submit the patches.

2013-03-07 02:31:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote:
> On Wed, 2013-03-06 at 12:13 +0400, Glauber Costa wrote:
> > On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
> > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > >> Sorry for the delay in the response. I did not see the email
> > >> right away.
> > >>
> > >> On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > >>> On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > >>>> 2013/2/5 Michael Wolf <[email protected]>:
> > >>>>> In the case of where you have a system that is running in a
> > >>>>> capped or overcommitted environment the user may see steal time
> > >>>>> being reported in accounting tools such as top or vmstat. This can
> > >>>>> cause confusion for the end user.
> > >>>>
> > >>>> Sorry, I'm no expert in this area. But I don't really understand what
> > >>>> is confusing for the end user here.
> > >>>
> > >>> I suppose that what is wanted is to subtract stolen time due to 'known
> > >>> reasons' from steal time reporting. 'Known reasons' being, for example,
> > >>> hard caps. So a vcpu executing instructions with no halt, but limited to
> > >>> 80% of available bandwidth, would not have 20% of stolen time reported.
> > >>
> > >> Yes exactly and the end user many times did not set up the guest and is
> > >> not aware of the capping. The end user is only aware of the performance
> > >> level that they were told they would get with the guest.
> > >>> But yes, a description of the scenario that is being dealt with, with
> > >>> details, is important.
> > >>
> > >> I will add more detail to the description next time I submit the
> > >> patches. How about something like,"In a cloud environment the user of a
> > >> kvm guest is not aware of the underlying hardware or how many other
> > >> guests are running on it. The end user is only aware of a level of
> > >> performance that they should see." or does that just muddy the picture
> > >> more??
> > >
> > > So the feature aims for is to report stolen time relative to hard
> > > capping. That is: stolen time should be counted as time stolen from
> > > the guest _beyond_ hard capping. Yes?
> > >
> > > Probably don't need to report new data to the guest for that.
> > >
> > If we take into account that 1 second always have one second, I believe
> > that you can just subtract the consigned time from the steal time the
> > host passes to the guest.
> >
> > During each second, the numbers won't sum up to 100. The delta to 100 is
> > the consigned time, if anyone cares.
> >
> > Adopting this would simplify this a lot. All you need to do, really, is
> > to get your calculation right from the bandwidth given by the cpu
> > controller. Subtract it in the host, and voila.
>
> I looked at doing that once but was told that I was changing the
> interface in an unacceptable way, because now I was not reporting all of
> the elapsed time. I agree it would make things simpler.

Pointer to that claim, please?

2013-03-07 02:31:57

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
> On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > > Sorry for the delay in the response. I did not see the email
> > > right away.
> > >
> > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > > 2013/2/5 Michael Wolf <[email protected]>:
> > > > > > In the case of where you have a system that is running in a
> > > > > > capped or overcommitted environment the user may see steal time
> > > > > > being reported in accounting tools such as top or vmstat. This can
> > > > > > cause confusion for the end user.
> > > > >
> > > > > Sorry, I'm no expert in this area. But I don't really understand what
> > > > > is confusing for the end user here.
> > > >
> > > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > > reasons' from steal time reporting. 'Known reasons' being, for example,
> > > > hard caps. So a vcpu executing instructions with no halt, but limited to
> > > > 80% of available bandwidth, would not have 20% of stolen time reported.
> > >
> > > Yes exactly and the end user many times did not set up the guest and is
> > > not aware of the capping. The end user is only aware of the performance
> > > level that they were told they would get with the guest.
> > > > But yes, a description of the scenario that is being dealt with, with
> > > > details, is important.
> > >
> > > I will add more detail to the description next time I submit the
> > > patches. How about something like,"In a cloud environment the user of a
> > > kvm guest is not aware of the underlying hardware or how many other
> > > guests are running on it. The end user is only aware of a level of
> > > performance that they should see." or does that just muddy the picture
> > > more??
> >
> > So the feature aims for is to report stolen time relative to hard
> > capping. That is: stolen time should be counted as time stolen from
> > the guest _beyond_ hard capping. Yes?
> Yes, that is the goal.
> >
> > Probably don't need to report new data to the guest for that.
> Not sure I understand what you are saying here. Do you mean that I don't
> need to report the expected steal from the guest? If I don't do that
> then I'm not reporting all of the time and changing /proc/stat in a
> bigger way than adding another catagory. Also I thought I would need to
> provide the consigned time and the steal time for debugging purposes.
> Maybe I'm missing your point.....

OK so the usefulness of steal time comes from the ability to measure
CPU cycles that the guest is being deprived of, relative to some unit
(implicitly the CPU frequency presented to the VM). That way, it becomes
easier to properly allocate resources.

>From top man page:
st : time stolen from this vm by the hypervisor

Not only its a problem for the lender, it is also confusing for the user
(who has to subtract from the reported value himself), the hardcapping
from reported steal time.


The problem with the algorithm in the patchset is the following
(practical example):

- Hard capping set to 80% of available CPU.
- vcpu does not exceed its threshold, say workload with 40%
CPU utilization.
- Under this scenario it is possible for vcpu to be deprived
of cycles (because out of the 40% that workload uses, only 30% of
actual CPU time are being provided).
- The algorithm in this patchset will not report any stolen time
because it assumes 20% of stolen time reported via 'run_delay'
is fixed at all times (which is false), therefore any valid
stolen time below 20% will not be reported.

Makes sense?

Not sure what the concrete way to report stolen time relative to hard
capping is (probably easier inside the scheduler, where run_delay is
calculated).

Reporting the hard capping to the guest is a good idea (which saves the
user from having to measure it themselves), but better done separately
via new field.

2013-03-07 03:12:01

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, Mar 06, 2013 at 09:52:16PM -0300, Marcelo Tosatti wrote:
> On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote:
> > I looked at doing that once but was told that I was changing the
> > interface in an unacceptable way, because now I was not reporting all of
> > the elapsed time. I agree it would make things simpler.
>
> Pointer to that claim, please?

Back in about 2004 or 2005 or so I was looking at changing how user
and system times were calculated (in the context of trying to find a
better way to report resources used by a thread in an SMT processor).
I found that utilities such as top expected the deltas in the
/proc/stat numbers to add up to elapsed time, and would report strange
and inconsistent results if that wasn't the case. Unfortunately at
this distance I don't recall the exact details. I don't know whether
the expectation that the deltas in the /proc/stat numbers over a
period of time add up to the elapsed real time is documented anywhere,
but I wouldn't be at all surprised if some programs depend on it, so
it's better to maintain that property.

Paul.

2013-03-07 20:24:02

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Thu, 2013-03-07 at 14:11 +1100, Paul Mackerras wrote:
> On Wed, Mar 06, 2013 at 09:52:16PM -0300, Marcelo Tosatti wrote:
> > On Wed, Mar 06, 2013 at 10:29:12AM -0600, Michael Wolf wrote:
> > > I looked at doing that once but was told that I was changing the
> > > interface in an unacceptable way, because now I was not reporting all of
> > > the elapsed time. I agree it would make things simpler.
> >
> > Pointer to that claim, please?
>
> Back in about 2004 or 2005 or so I was looking at changing how user
> and system times were calculated (in the context of trying to find a
> better way to report resources used by a thread in an SMT processor).
> I found that utilities such as top expected the deltas in the
> /proc/stat numbers to add up to elapsed time, and would report strange
> and inconsistent results if that wasn't the case. Unfortunately at
> this distance I don't recall the exact details. I don't know whether
> the expectation that the deltas in the /proc/stat numbers over a
> period of time add up to the elapsed real time is documented anywhere,
> but I wouldn't be at all surprised if some programs depend on it, so
> it's better to maintain that property.

I will have to look at this again. When looking at the cpu data where
steal time is reported there isn't a problem today. I will have to run
it and see if there is anything incorrect with the time being reported
for the individual processes.

My real concern here was that in changing the /proc/stat interface am I
going to mess private tools that look at that information. When I've
looked at vmstat and top they report the cpu information fine, but I may
end up creating problems for home grown scripts and tools.

2013-03-07 21:10:13

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote:
> On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
> > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > > > Sorry for the delay in the response. I did not see the email
> > > > right away.
> > > >
> > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > > > 2013/2/5 Michael Wolf <[email protected]>:
> > > > > > > In the case of where you have a system that is running in a
> > > > > > > capped or overcommitted environment the user may see steal time
> > > > > > > being reported in accounting tools such as top or vmstat. This can
> > > > > > > cause confusion for the end user.
> > > > > >
> > > > > > Sorry, I'm no expert in this area. But I don't really understand what
> > > > > > is confusing for the end user here.
> > > > >
> > > > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > > > reasons' from steal time reporting. 'Known reasons' being, for example,
> > > > > hard caps. So a vcpu executing instructions with no halt, but limited to
> > > > > 80% of available bandwidth, would not have 20% of stolen time reported.
> > > >
> > > > Yes exactly and the end user many times did not set up the guest and is
> > > > not aware of the capping. The end user is only aware of the performance
> > > > level that they were told they would get with the guest.
> > > > > But yes, a description of the scenario that is being dealt with, with
> > > > > details, is important.
> > > >
> > > > I will add more detail to the description next time I submit the
> > > > patches. How about something like,"In a cloud environment the user of a
> > > > kvm guest is not aware of the underlying hardware or how many other
> > > > guests are running on it. The end user is only aware of a level of
> > > > performance that they should see." or does that just muddy the picture
> > > > more??
> > >
> > > So the feature aims for is to report stolen time relative to hard
> > > capping. That is: stolen time should be counted as time stolen from
> > > the guest _beyond_ hard capping. Yes?
> > Yes, that is the goal.
> > >
> > > Probably don't need to report new data to the guest for that.
> > Not sure I understand what you are saying here. Do you mean that I don't
> > need to report the expected steal from the guest? If I don't do that
> > then I'm not reporting all of the time and changing /proc/stat in a
> > bigger way than adding another catagory. Also I thought I would need to
> > provide the consigned time and the steal time for debugging purposes.
> > Maybe I'm missing your point.....
>
> OK so the usefulness of steal time comes from the ability to measure
> CPU cycles that the guest is being deprived of, relative to some unit
> (implicitly the CPU frequency presented to the VM). That way, it becomes
> easier to properly allocate resources.
>
> From top man page:
> st : time stolen from this vm by the hypervisor
>
> Not only its a problem for the lender, it is also confusing for the user
> (who has to subtract from the reported value himself), the hardcapping
> from reported steal time.
>
>
> The problem with the algorithm in the patchset is the following
> (practical example):
>
> - Hard capping set to 80% of available CPU.
> - vcpu does not exceed its threshold, say workload with 40%
> CPU utilization.
> - Under this scenario it is possible for vcpu to be deprived
> of cycles (because out of the 40% that workload uses, only 30% of
> actual CPU time are being provided).
> - The algorithm in this patchset will not report any stolen time
> because it assumes 20% of stolen time reported via 'run_delay'
> is fixed at all times (which is false), therefore any valid
> stolen time below 20% will not be reported.
>
> Makes sense?

I understand the scenerio. I will have to go back and look at the
CFS bandwidth code and run some tests. The question I have to look at is
how is everything reported in your scenerio above.

This will depend on how the cfs bandwidth is configured, are there
uncapped processes on the system and how cpu intensive are they.

I will run some tests and report back.

>
> Not sure what the concrete way to report stolen time relative to hard
> capping is (probably easier inside the scheduler, where run_delay is
> calculated).
>
> Reporting the hard capping to the guest is a good idea (which saves the
> user from having to measure it themselves), but better done separately
> via new field.

I looked at doing something like this. If bandwidth controls are
configured there is a throttled flag. So in effect if the throttled
flag is set, don't add the time spent on the runqueue. But this will
fail to work in some cases.

For example
you sent up cfs bandwidth controls. set up the group to get 50% of the
processor

Have 1 physical cpu

Have 2 guests each with 1 vcpu.

Have each guest running to its full entitlement.

So in this case each guest will have time on the runqueue but neither
will ever be throttled since they will not exceed their quota in the
defined period. So now just trying to do this in the scheduler doesn't
work because you cannot rely on the throttled flag. In either case the
time is accumulated as time on the runqueue.

This is why my patchset had included a timer. It was basically
mimicking the bandwidth controller by using a timer set to the same
period. So in a given period of time a fixed quota of time on the
runqueue can be expected. If the amount of time on the runqueue exceeds
the expected, then report it.

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-03-07 21:15:27

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Wed, 2013-03-06 at 23:30 -0300, Marcelo Tosatti wrote:
> On Wed, Mar 06, 2013 at 10:27:13AM -0600, Michael Wolf wrote:
> > On Tue, 2013-03-05 at 22:41 -0300, Marcelo Tosatti wrote:
> > > On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
> > > > Sorry for the delay in the response. I did not see the email
> > > > right away.
> > > >
> > > > On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
> > > > > On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
> > > > > > 2013/2/5 Michael Wolf <[email protected]>:
> > > > > > > In the case of where you have a system that is running in a
> > > > > > > capped or overcommitted environment the user may see steal time
> > > > > > > being reported in accounting tools such as top or vmstat. This can
> > > > > > > cause confusion for the end user.
> > > > > >
> > > > > > Sorry, I'm no expert in this area. But I don't really understand what
> > > > > > is confusing for the end user here.
> > > > >
> > > > > I suppose that what is wanted is to subtract stolen time due to 'known
> > > > > reasons' from steal time reporting. 'Known reasons' being, for example,
> > > > > hard caps. So a vcpu executing instructions with no halt, but limited to
> > > > > 80% of available bandwidth, would not have 20% of stolen time reported.
> > > >
> > > > Yes exactly and the end user many times did not set up the guest and is
> > > > not aware of the capping. The end user is only aware of the performance
> > > > level that they were told they would get with the guest.
> > > > > But yes, a description of the scenario that is being dealt with, with
> > > > > details, is important.
> > > >
> > > > I will add more detail to the description next time I submit the
> > > > patches. How about something like,"In a cloud environment the user of a
> > > > kvm guest is not aware of the underlying hardware or how many other
> > > > guests are running on it. The end user is only aware of a level of
> > > > performance that they should see." or does that just muddy the picture
> > > > more??
> > >
> > > So the feature aims for is to report stolen time relative to hard
> > > capping. That is: stolen time should be counted as time stolen from
> > > the guest _beyond_ hard capping. Yes?
> > Yes, that is the goal.
> > >
> > > Probably don't need to report new data to the guest for that.
> > Not sure I understand what you are saying here. Do you mean that I don't
> > need to report the expected steal from the guest? If I don't do that
> > then I'm not reporting all of the time and changing /proc/stat in a
> > bigger way than adding another catagory. Also I thought I would need to
> > provide the consigned time and the steal time for debugging purposes.
> > Maybe I'm missing your point.....
>
> OK so the usefulness of steal time comes from the ability to measure
> CPU cycles that the guest is being deprived of, relative to some unit
> (implicitly the CPU frequency presented to the VM). That way, it becomes
> easier to properly allocate resources.
>
> From top man page:
> st : time stolen from this vm by the hypervisor
>
> Not only its a problem for the lender, it is also confusing for the user
> (who has to subtract from the reported value himself), the hardcapping
> from reported steal time.
>
>
> The problem with the algorithm in the patchset is the following
> (practical example):
>
> - Hard capping set to 80% of available CPU.
> - vcpu does not exceed its threshold, say workload with 40%
> CPU utilization.
> - Under this scenario it is possible for vcpu to be deprived
> of cycles (because out of the 40% that workload uses, only 30% of
> actual CPU time are being provided).
> - The algorithm in this patchset will not report any stolen time
> because it assumes 20% of stolen time reported via 'run_delay'
> is fixed at all times (which is false), therefore any valid
> stolen time below 20% will not be reported.
>
> Makes sense?
>
> Not sure what the concrete way to report stolen time relative to hard
> capping is (probably easier inside the scheduler, where run_delay is
> calculated).
>
> Reporting the hard capping to the guest is a good idea (which saves the
> user from having to measure it themselves), but better done separately
> via new field.

didnt respond to this in the previous response. I'm not sure I'm
following you here. I thought this is what I was doing by having a
consigned (expected steal) field add to the /proc/stat output. Are you
looking for something else or a better naming convention?

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-03-07 21:26:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> >
> > Makes sense?
> >
> > Not sure what the concrete way to report stolen time relative to hard
> > capping is (probably easier inside the scheduler, where run_delay is
> > calculated).
> >
> > Reporting the hard capping to the guest is a good idea (which saves the
> > user from having to measure it themselves), but better done separately
> > via new field.
>
> didnt respond to this in the previous response. I'm not sure I'm
> following you here. I thought this is what I was doing by having a
> consigned (expected steal) field add to the /proc/stat output. Are you
> looking for something else or a better naming convention?

Expected steal is not a good measure to use (because as mentioned in the
previous email there is no expected steal over a fixed period of time).

It is fine to report 'maximum percentage of underlying physical CPU'
(what percentage of the physical CPU time guest VM is allowed to make
use of).

And then steal time is relative to maximum percentage of underlying
physical CPU time allowed.

2013-03-07 22:34:31

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote:
> On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> > >
> > > Makes sense?
> > >
> > > Not sure what the concrete way to report stolen time relative to hard
> > > capping is (probably easier inside the scheduler, where run_delay is
> > > calculated).
> > >
> > > Reporting the hard capping to the guest is a good idea (which saves the
> > > user from having to measure it themselves), but better done separately
> > > via new field.
> >
> > didnt respond to this in the previous response. I'm not sure I'm
> > following you here. I thought this is what I was doing by having a
> > consigned (expected steal) field add to the /proc/stat output. Are you
> > looking for something else or a better naming convention?
>
> Expected steal is not a good measure to use (because as mentioned in the
> previous email there is no expected steal over a fixed period of time).
>
> It is fine to report 'maximum percentage of underlying physical CPU'
> (what percentage of the physical CPU time guest VM is allowed to make
> use of).
>
> And then steal time is relative to maximum percentage of underlying
> physical CPU time allowed.
>

So last August I had sent out an RFC set of patches to do this. That
patchset was meant to handle the general overcommit case as well as the
capping case by having qemu pass a percentage to the host that would
then be passed onto the guest and used to adjust the steal time.
Here is the link to the discussion
http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html

As you will see there Avi didn't like the idea of a percentage down in
the guest, among other reasons he was concerned about migration. Also
in the email thread you will see that Anthony Liguori was opposed to the
idea of just changing the steal time, he wanted it split out.

What Glauber has suggested and I am working on implementing is taking
out the timer and adding a last read field in the host. So in the host
I can determine the total time that has passed and compute a percentage
and apply that percentage to the steal time while the info is still on
the host. Then pass the steal and consigned time to the guest.

Does that address your concerns?

2013-03-08 02:04:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Thu, Mar 07, 2013 at 04:34:16PM -0600, Michael Wolf wrote:
> On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote:
> > On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> > > >
> > > > Makes sense?
> > > >
> > > > Not sure what the concrete way to report stolen time relative to hard
> > > > capping is (probably easier inside the scheduler, where run_delay is
> > > > calculated).
> > > >
> > > > Reporting the hard capping to the guest is a good idea (which saves the
> > > > user from having to measure it themselves), but better done separately
> > > > via new field.
> > >
> > > didnt respond to this in the previous response. I'm not sure I'm
> > > following you here. I thought this is what I was doing by having a
> > > consigned (expected steal) field add to the /proc/stat output. Are you
> > > looking for something else or a better naming convention?
> >
> > Expected steal is not a good measure to use (because as mentioned in the
> > previous email there is no expected steal over a fixed period of time).
> >
> > It is fine to report 'maximum percentage of underlying physical CPU'
> > (what percentage of the physical CPU time guest VM is allowed to make
> > use of).
> >
> > And then steal time is relative to maximum percentage of underlying
> > physical CPU time allowed.
> >
>
> So last August I had sent out an RFC set of patches to do this. That
> patchset was meant to handle the general overcommit case as well as the
> capping case by having qemu pass a percentage to the host that would
> then be passed onto the guest and used to adjust the steal time.
> Here is the link to the discussion
> http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html
>
> As you will see there Avi didn't like the idea of a percentage down in
> the guest, among other reasons he was concerned about migration. Also
> in the email thread you will see that Anthony Liguori was opposed to the
> idea of just changing the steal time, he wanted it split out.
>
> What Glauber has suggested and I am working on implementing is taking
> out the timer and adding a last read field in the host. So in the host
> I can determine the total time that has passed and compute a percentage
> and apply that percentage to the steal time while the info is still on
> the host. Then pass the steal and consigned time to the guest.
>
> Does that address your concerns?

I am not asking about passing percentage down the host - just pointing
out a counter example to the correctness of the current algorithm.

I cannot see how you can report proper steal time value relative to
hard cap without having that number calculated in the scheduler. IOW,
"run_delay" must be split in two: you want to differentiate whether run
delay was due to hard cap exhaustion or due to other reasons. Without
that, steal time reporting is incorrect (as the example details). Now
the question is, how to do that separation.

2013-03-08 02:21:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/4] Alter steal-time reporting in the guest

On Thu, Mar 07, 2013 at 10:54:37PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 07, 2013 at 04:34:16PM -0600, Michael Wolf wrote:
> > On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote:
> > > On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> > > > >
> > > > > Makes sense?
> > > > >
> > > > > Not sure what the concrete way to report stolen time relative to hard
> > > > > capping is (probably easier inside the scheduler, where run_delay is
> > > > > calculated).
> > > > >
> > > > > Reporting the hard capping to the guest is a good idea (which saves the
> > > > > user from having to measure it themselves), but better done separately
> > > > > via new field.
> > > >
> > > > didnt respond to this in the previous response. I'm not sure I'm
> > > > following you here. I thought this is what I was doing by having a
> > > > consigned (expected steal) field add to the /proc/stat output. Are you
> > > > looking for something else or a better naming convention?
> > >
> > > Expected steal is not a good measure to use (because as mentioned in the
> > > previous email there is no expected steal over a fixed period of time).
> > >
> > > It is fine to report 'maximum percentage of underlying physical CPU'
> > > (what percentage of the physical CPU time guest VM is allowed to make
> > > use of).
> > >
> > > And then steal time is relative to maximum percentage of underlying
> > > physical CPU time allowed.
> > >
> >
> > So last August I had sent out an RFC set of patches to do this. That
> > patchset was meant to handle the general overcommit case as well as the
> > capping case by having qemu pass a percentage to the host that would
> > then be passed onto the guest and used to adjust the steal time.
> > Here is the link to the discussion
> > http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html
> >
> > As you will see there Avi didn't like the idea of a percentage down in
> > the guest, among other reasons he was concerned about migration.

OK.

> > Also in the email thread you will see that Anthony Liguori was
> > opposed to the idea of just changing the steal time, he wanted it
> > split out.

"What I had previously suggested what splitting entitlement loss out of
steal time and reporting it as a separate metric (but not reporting a
fixed notion of entitlement).

You're missing the entitlement loss bit above. But you need to call
out entitlement loss in order to report idle time correctly.

I think changing steal time (as this patch does) is wrong.

Regards,

Anthony Liguori"

This is what is suggested below. What you mentioned earlier

"So in this case each guest will have time on the runqueue but neither
will ever be throttled since they will not exceed their quota in the
defined period. So now just trying to do this in the scheduler doesn't
work because you cannot rely on the throttled flag. In either case the
time is accumulated as time on the runqueue.

This is why my patchset had included a timer. It was basically
mimicking the bandwidth controller by using a timer set to the same
period. So in a given period of time a fixed quota of time on the
runqueue can be expected. If the amount of time on the runqueue exceeds
the expected, then report it."

Understood, but its problematic: it is possible for a vcpu to be
deprived of cycles even if it did not exceed its quota. Did you
investigate whether its possible to split run_delay?


> > What Glauber has suggested and I am working on implementing is taking
> > out the timer and adding a last read field in the host. So in the host
> > I can determine the total time that has passed and compute a percentage
> > and apply that percentage to the steal time while the info is still on
> > the host. Then pass the steal and consigned time to the guest.

Or maybe i missed why the suggestion above is immune to this problem?

> >
> > Does that address your concerns?
>
> I am not asking about passing percentage down the host - just pointing
> out a counter example to the correctness of the current algorithm.
>
> I cannot see how you can report proper steal time value relative to
> hard cap without having that number calculated in the scheduler. IOW,
> "run_delay" must be split in two: you want to differentiate whether run
> delay was due to hard cap exhaustion or due to other reasons. Without
> that, steal time reporting is incorrect (as the example details). Now
> the question is, how to do that separation.