2014-06-03 15:41:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] sched: get CPU's activity statistic

On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
>
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

Like said earlier the 94% mark is actually hit much sooner, but yes,
likely still too slow.

50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.
>
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems).

So the per-task-load-tracking stuff already does that. It updates the
per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

And keeping an unweighted per-cpu variant isn't that much more work.

> In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.
> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.

I'm not too worried about the tipping point, per task runnable figures
of an overloaded cpu are higher, so migration between an overloaded cpu
and an underloaded cpu are going to be tricky no matter what we do.

> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.

So keeping one with a faster decay would add extra per-task storage. But
would be possible..


Attachments:
(No filename) (2.59 kB)
(No filename) (836.00 B)
Download all attachments

2014-06-03 17:16:41

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] sched: get CPU's activity statistic

On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> >
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
>
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
>
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

Agreed.

>
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> >
> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems).
>
> So the per-task-load-tracking stuff already does that. It updates the
> per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

I think there is some confusion here. There are two per-cpu load metrics
that tracks differently.

The cfs.runnable_load_avg is basically the sum of the load contributions
of the tasks on the cfs rq. The sum gets updated whenever tasks are
{en,de}queued by adding/subtracting the load contribution of the task
being added/removed. That is the one you are referring to.

The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
tracking whether the cpu has something to do or not. It doesn't matter
many tasks are runnable or what their load is. It is updated in
update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
decays if not. It also takes time spent running rt tasks into account in
idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
metric will start decaying and eventually get to 0, unlike the
cfs.runnable_load_avg where the task load contribution subtracted every
time a task is removed. The rq runnable_avg_sum is the one being used in
this patch set.

Ben, pjt, please correct me if I'm wrong.

> And keeping an unweighted per-cpu variant isn't that much more work.

Agreed.

>
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
>
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.

Yes, agreed. I just got the impression that you were concerned about
smp_nice last time we discussed this.

> > rq runnable_avg_sum is useful for decisions where we need a longer term
> > view of the cpu utilization, but I don't see how we can use as cpu
> > utilization metric for load-balancing decisions at wakeup or
> > periodically.
>
> So keeping one with a faster decay would add extra per-task storage. But
> would be possible..

I have had that thought when we discussed potential replacements for
cpu_load[]. It will require some messing around with the nicely
optimized load tracking maths if we want to have load tracking with a
different y-coefficient.

2014-06-03 17:37:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] sched: get CPU's activity statistic

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > I'm not too worried about the tipping point, per task runnable figures
> > of an overloaded cpu are higher, so migration between an overloaded cpu
> > and an underloaded cpu are going to be tricky no matter what we do.
>
> Yes, agreed. I just got the impression that you were concerned about
> smp_nice last time we discussed this.

Well, yes, we need to keep that working, but the exact detail around the
tipping point are near impossible to get right, so I'm not too bothered
there.

> > > rq runnable_avg_sum is useful for decisions where we need a longer term
> > > view of the cpu utilization, but I don't see how we can use as cpu
> > > utilization metric for load-balancing decisions at wakeup or
> > > periodically.
> >
> > So keeping one with a faster decay would add extra per-task storage. But
> > would be possible..
>
> I have had that thought when we discussed potential replacements for
> cpu_load[]. It will require some messing around with the nicely
> optimized load tracking maths if we want to have load tracking with a
> different y-coefficient.

My initial thought was a y=0.5, which is really >>=1. But yes, if we
want something else that'll get messy real fast methinks.

2014-06-03 17:39:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] sched: get CPU's activity statistic

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > So the per-task-load-tracking stuff already does that. It updates the
> > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
>
> I think there is some confusion here. There are two per-cpu load metrics
> that tracks differently.
>
> The cfs.runnable_load_avg is basically the sum of the load contributions
> of the tasks on the cfs rq. The sum gets updated whenever tasks are
> {en,de}queued by adding/subtracting the load contribution of the task
> being added/removed. That is the one you are referring to.
>
> The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
> tracking whether the cpu has something to do or not. It doesn't matter
> many tasks are runnable or what their load is. It is updated in
> update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
> decays if not. It also takes time spent running rt tasks into account in
> idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
> metric will start decaying and eventually get to 0, unlike the
> cfs.runnable_load_avg where the task load contribution subtracted every
> time a task is removed. The rq runnable_avg_sum is the one being used in
> this patch set.
>
> Ben, pjt, please correct me if I'm wrong.

Argh, ok I completely missed that. I think the cfs.runnable_load_avg is
the sane number, not entirely sure what good rq->avg.runnable_avg is
good for, it seems a weird metric on first consideration.

Will have to ponder that a bit more.

2014-06-04 07:15:33

by Yuyang Du

[permalink] [raw]
Subject: Re: [PATCH v2 08/11] sched: get CPU's activity statistic

> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
>
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
>
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
>
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
>
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.
>
Hi,

Can I join this dicussion late?

As I understand, you are talking about a metric for cpu activity. And the
issues about runnable_avg_sum is its sluggishness to latest change, and also
need unweighted load averages.

You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an
average of nr_running, or 2) nr_running weighted CPU utilization. So it is a
combination of CPU utlization and run queue (both factored natually). It
meets the needs you talked, I think. You can take it as a candidate, or at
least we can talk about it?

Thanks,
Yuyang