LinuxLists.cc - [PATCH RFC] sched/fair: let cpu's cfs

2016-04-01 16:39:10

Subject: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

When task is migrated from CPU_A to CPU_B, scheduler will decrease
the task's load/util from the task's cfs_rq and also add them into
migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
migrated to CPU_B, then CPU_A still have task's stale value for
load/util; on the other hand CPU_B also cannot reflect new load/util
which introduced by the task.

So this patch is to operate the task's load/util to cpu's cfs_rq, so
finally cpu's cfs_rq can really reflect task's migration.

Signed-off-by: Leo Yan <[email protected]>
---
kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..10ca1a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
struct sched_avg *sa = &cfs_rq->avg;
+ struct sched_avg *cpu_sa = NULL;
int decayed, removed = 0;
+ int cpu = cpu_of(rq_of(cfs_rq));
+
+ if (&cpu_rq(cpu)->cfs != cfs_rq)
+ cpu_sa = &cpu_rq(cpu)->cfs.avg;

if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
+
+ if (cpu_sa) {
+ cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0);
+ cpu_sa->load_sum = max_t(s64,
+ cpu_sa->load_sum - r * LOAD_AVG_MAX, 0);
+ }
+
removed = 1;
}

@@ -2838,6 +2850,12 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
sa->util_avg = max_t(long, sa->util_avg - r, 0);
sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
+
+ if (cpu_sa) {
+ cpu_sa->util_avg = max_t(long, cpu_sa->util_avg - r, 0);
+ cpu_sa->util_sum = max_t(s64,
+ cpu_sa->util_sum - r * LOAD_AVG_MAX, 0);
+ }
}

decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
@@ -2896,6 +2914,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)

static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ int cpu = cpu_of(rq_of(cfs_rq));
+
if (!sched_feat(ATTACH_AGE_LOAD))
goto skip_aging;

@@ -2919,6 +2939,13 @@ skip_aging:
cfs_rq->avg.load_sum += se->avg.load_sum;
cfs_rq->avg.util_avg += se->avg.util_avg;
cfs_rq->avg.util_sum += se->avg.util_sum;
+
+ if (&cpu_rq(cpu)->cfs != cfs_rq) {
+ cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg;
+ cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum;
+ cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg;
+ cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum;
+ }
}

static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
--
1.9.1

2016-04-01 19:49:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> When task is migrated from CPU_A to CPU_B, scheduler will decrease
> the task's load/util from the task's cfs_rq and also add them into
> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
> migrated to CPU_B, then CPU_A still have task's stale value for
> load/util; on the other hand CPU_B also cannot reflect new load/util
> which introduced by the task.
>
> So this patch is to operate the task's load/util to cpu's cfs_rq, so
> finally cpu's cfs_rq can really reflect task's migration.

Sorry but that is unintelligible.

What is the problem? Why do we care? How did you fix it? and at what
cost?

2016-04-01 22:28:53

by Steve Muckle

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
>> When task is migrated from CPU_A to CPU_B, scheduler will decrease
>> the task's load/util from the task's cfs_rq and also add them into
>> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
>> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
>> migrated to CPU_B, then CPU_A still have task's stale value for
>> load/util; on the other hand CPU_B also cannot reflect new load/util
>> which introduced by the task.
>>
>> So this patch is to operate the task's load/util to cpu's cfs_rq, so
>> finally cpu's cfs_rq can really reflect task's migration.
>
> Sorry but that is unintelligible.
>
> What is the problem? Why do we care? How did you fix it? and at what
> cost?

I think I follow - Leo please correct me if I mangle your intentions.
It's an issue that Morten and Dietmar had mentioned to me as well.

Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
task group other than the root. The task migrates from one CPU to
another. The cfs_rq.avg structures on the src and dest CPUs
corresponding to the group containing the task are updated immediately
via remove_entity_load_avg()/update_cfs_rq_load_avg() and
attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
on the src and dest are not immediately updated. The root cfs_rq.avg
must catch up over time with PELT.

As to why we care, there's at least one issue which may or may not be
Leo's - the root cfs_rq is the one whose avg structure we read from to
determine the frequency with schedutil. If you have a cpu-bound task in
a non-root cgroup which periodically migrates among CPUs on an otherwise
idle system, I believe each time it migrates the frequency would drop to
fmin and have to ramp up again with PELT.

Leo I noticed you did not modify detach_entity_load_average(). I think
this would be needed to avoid the task's stats being double counted for
a while after switched_from_fair() or task_move_group_fair().

2016-04-02 07:12:06

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> On 04/01/2016 12:49 PM, Peter Zijlstra wrote:
> > On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> >> When task is migrated from CPU_A to CPU_B, scheduler will decrease
> >> the task's load/util from the task's cfs_rq and also add them into
> >> migrated cfs_rq. But if kernel enables CONFIG_FAIR_GROUP_SCHED then this
> >> cfs_rq is not the same one with cpu's cfs_rq. As a result, after task is
> >> migrated to CPU_B, then CPU_A still have task's stale value for
> >> load/util; on the other hand CPU_B also cannot reflect new load/util
> >> which introduced by the task.
> >>
> >> So this patch is to operate the task's load/util to cpu's cfs_rq, so
> >> finally cpu's cfs_rq can really reflect task's migration.
> >
> > Sorry but that is unintelligible.
> >
> > What is the problem? Why do we care? How did you fix it? and at what
> > cost?

Sorry. I should take time to write commit with quality.

> I think I follow - Leo please correct me if I mangle your intentions.
> It's an issue that Morten and Dietmar had mentioned to me as well.
>
> Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
> task group other than the root. The task migrates from one CPU to
> another. The cfs_rq.avg structures on the src and dest CPUs
> corresponding to the group containing the task are updated immediately
> via remove_entity_load_avg()/update_cfs_rq_load_avg() and
> attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
> on the src and dest are not immediately updated. The root cfs_rq.avg
> must catch up over time with PELT.
>
> As to why we care, there's at least one issue which may or may not be
> Leo's - the root cfs_rq is the one whose avg structure we read from to
> determine the frequency with schedutil. If you have a cpu-bound task in
> a non-root cgroup which periodically migrates among CPUs on an otherwise
> idle system, I believe each time it migrates the frequency would drop to
> fmin and have to ramp up again with PELT.

Steve, thanks for explanation and totally agree. My initial purpose is
not from schedutil's perspective, but just did some basic analysis for
utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
affinity to only two CPUs. So observed the same result like you said.

After applied this patch, I can get much better result for the CPU's
utilization after task's migration. Please see the parsed result for
CPU's utilization:
http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png

> Leo I noticed you did not modify detach_entity_load_average(). I think
> this would be needed to avoid the task's stats being double counted for
> a while after switched_from_fair() or task_move_group_fair().

Will add it.

Thanks,
Leo Yan

2016-04-04 08:45:54

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > I think I follow - Leo please correct me if I mangle your intentions.
> > It's an issue that Morten and Dietmar had mentioned to me as well.

Yes. We have been working on this issue for a while without getting to a
nice solution yet.

> > Assume CONFIG_FAIR_GROUP_SCHED is enabled and a task is running in a
> > task group other than the root. The task migrates from one CPU to
> > another. The cfs_rq.avg structures on the src and dest CPUs
> > corresponding to the group containing the task are updated immediately
> > via remove_entity_load_avg()/update_cfs_rq_load_avg() and
> > attach_entity_load_avg(). But the cfs_rq.avg corresponding to the root
> > on the src and dest are not immediately updated. The root cfs_rq.avg
> > must catch up over time with PELT.

Yes. The problem is that it is only the cfs_rq.avg of the cfs_rq where
the is enqueued/dequeued that gets immediately updated. If the cfs_rq is
a group cfs_rq its group entity se.avg doesn't get updated immediately.
It has to adapt over time at the pace defined by the geometric series.
The impact of a task migration therefore doesn't trickle through to the
root cfs_rq.avg. This behaviour is one of the fundamental changes Yuyang
introduced with his rewrite of PELT.

> > As to why we care, there's at least one issue which may or may not be
> > Leo's - the root cfs_rq is the one whose avg structure we read from to
> > determine the frequency with schedutil. If you have a cpu-bound task in
> > a non-root cgroup which periodically migrates among CPUs on an otherwise
> > idle system, I believe each time it migrates the frequency would drop to
> > fmin and have to ramp up again with PELT.

It makes any scheduling decision based on utilization difficult if fair
group scheduling is used as cpu_util() doesn't give an up-to-date
picture of any utilization caused by task in task groups.

For the energy-aware scheduling patches and patches we have in the
pipeline for improving capacity awareness in the scheduler we rely on
cpu_util().

> Steve, thanks for explanation and totally agree. My initial purpose is
> not from schedutil's perspective, but just did some basic analysis for
> utilization. So my case is: fixed cpu to maximum frequency 1.2GHz, then
> launch a periodic task (period: 100ms, duty cycle: 40%) and limit its
> affinity to only two CPUs. So observed the same result like you said.
>
> After applied this patch, I can get much better result for the CPU's
> utilization after task's migration. Please see the parsed result for
> CPU's utilization:
> http://people.linaro.org/~leo.yan/eas_profiling/cpu1_util_update_cfs_rq_avg.png
> http://people.linaro.org/~leo.yan/eas_profiling/cpu2_util_update_cfs_rq_avg.png
>
> > Leo I noticed you did not modify detach_entity_load_average(). I think
> > this would be needed to avoid the task's stats being double counted for
> > a while after switched_from_fair() or task_move_group_fair().

I'm afraid that the solution to problem is more complicated than that
:-(

You are adding/removing a contribution from the root cfs_rq.avg which
isn't part of the signal in the first place. The root cfs_rq.avg only
contains the sum of the load/util of the sched_entities on the cfs_rq.
If you remove the contribution of the tasks from there you may end up
double-accounting for the task migration. Once due to you patch and then
again slowly over time as the group sched_entity starts reflecting that
the task has migrated. Furthermore, for group scheduling to make sense
it has to be the task_h_load() you add/remove otherwise the group
weighting is completely lost. Or am I completely misreading your patch?

I don't think the slow response time for _load_ is necessarily a big
problem. Otherwise we would have had people complaining already about
group scheduling being broken. It is however a problem for all the
initiatives that built on utilization.

We have to either make the updates trickle through the group hierarchy
for utilization, which is difficult with making a lot of changes to the
current code structure, or introduce a new avg structure at root level
which contains the sum of task utilization for all _tasks_ (not groups)
in the group hierarchy and maintain it separately.

None of those two are particularly pretty. Better suggestions?

2016-04-04 08:58:35

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Sat, Apr 02, 2016 at 12:38:37AM +0800, Leo Yan wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fe30e6..10ca1a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2825,12 +2825,24 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> {
> struct sched_avg *sa = &cfs_rq->avg;
> + struct sched_avg *cpu_sa = NULL;
> int decayed, removed = 0;
> + int cpu = cpu_of(rq_of(cfs_rq));
> +
> + if (&cpu_rq(cpu)->cfs != cfs_rq)
> + cpu_sa = &cpu_rq(cpu)->cfs.avg;
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> +
> + if (cpu_sa) {
> + cpu_sa->load_avg = max_t(long, cpu_sa->load_avg - r, 0);
> + cpu_sa->load_sum = max_t(s64,
> + cpu_sa->load_sum - r * LOAD_AVG_MAX, 0);

I think this is not quite right. 'r' is the load contribution removed
from a group cfs_rq. Unless the task is the only task in the group and
the group shares are default, this value is different from the load
contribution of the task at root level (task_h_load()).

And how about nested groups? I think you may end up removing the
contribution of a task multiple times from the root cfs_rq if it is in a
nested group.

You will need to either redesign group scheduling so you get the load to
trickle down automatically and with the right contribution, or maintain
a new sum at the root bypassing the existing design. I'm not sure if the
latter makes sense for load though. It would make more sense for
utilization only.

> @@ -2919,6 +2939,13 @@ skip_aging:
> cfs_rq->avg.load_sum += se->avg.load_sum;
> cfs_rq->avg.util_avg += se->avg.util_avg;
> cfs_rq->avg.util_sum += se->avg.util_sum;
> +
> + if (&cpu_rq(cpu)->cfs != cfs_rq) {
> + cpu_rq(cpu)->cfs.avg.load_avg += se->avg.load_avg;
> + cpu_rq(cpu)->cfs.avg.load_sum += se->avg.load_sum;
> + cpu_rq(cpu)->cfs.avg.util_avg += se->avg.util_avg;
> + cpu_rq(cpu)->cfs.avg.util_sum += se->avg.util_sum;
> + }

Same problem as above. You should be adding the right amount of task
contribution here. Something similar to task_h_load(). The problem with
using task_h_load() is that it is not updated immediately either, so
maintaining a stable signal based on that might be difficult.

2016-04-05 02:12:55

by Yuyang Du

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

2016-04-05 06:56:55

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > > I think I follow - Leo please correct me if I mangle your intentions.
> > > It's an issue that Morten and Dietmar had mentioned to me as well.
>
> Yes. We have been working on this issue for a while without getting to a
> nice solution yet.

Good to know this. This patch is mainly for discussion purpose.

[...]

> > > Leo I noticed you did not modify detach_entity_load_average(). I think
> > > this would be needed to avoid the task's stats being double counted for
> > > a while after switched_from_fair() or task_move_group_fair().
>
> I'm afraid that the solution to problem is more complicated than that
> :-(
>
> You are adding/removing a contribution from the root cfs_rq.avg which
> isn't part of the signal in the first place. The root cfs_rq.avg only
> contains the sum of the load/util of the sched_entities on the cfs_rq.
> If you remove the contribution of the tasks from there you may end up
> double-accounting for the task migration. Once due to you patch and then
> again slowly over time as the group sched_entity starts reflecting that
> the task has migrated. Furthermore, for group scheduling to make sense
> it has to be the task_h_load() you add/remove otherwise the group
> weighting is completely lost. Or am I completely misreading your patch?

Here have one thing want to confirm firstly: though CFS has maintained
task group's hierarchy, but between task group's cfs_rq.avg and root
cfs_rq.avg, CFS updates these signals independently rather than accouting
them by crossing the hierarchy.

So currently CFS decreases the group's cfs_rq.avg for task's migration,
but it don't iterate task group's hierarchy to root cfs_rq.avg. I
don't understand your meantioned the second accounting by "then again
slowly over time as the group sched_entity starts reflecting that the
task has migrated."

Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but
not present behavior? so even the task has been migrated we still need
decay it slowly? Or this will be different between load and util?

> I don't think the slow response time for _load_ is necessarily a big
> problem. Otherwise we would have had people complaining already about
> group scheduling being broken. It is however a problem for all the
> initiatives that built on utilization.

Or maybe we need seperate utilization and load, these two signals
have different semantics and purpose.

Thanks,
Leo Yan

2016-04-05 07:48:40

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Tue, Apr 05, 2016 at 02:30:03AM +0800, Yuyang Du wrote:
> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > > > I think I follow - Leo please correct me if I mangle your intentions.
> > > > It's an issue that Morten and Dietmar had mentioned to me as well.
> >
> > Yes. We have been working on this issue for a while without getting to a
> > nice solution yet.
>
> So do you want a "flat hirarchy" for util_avg - just do util_avg for
> rq and task respectively? Seems it is what you want, and it is even easier?

Pretty much, yes. I can't think of a good reason why we need the
utilization of groups as long as we have the task utilization and the
sum of those for the root cfs_rq.

I'm not saying it can't be implemented, just saying that it will make
utilization tracking for groups redundant and possibly duplicate or hack
some the existing code to implement the new root utilization sum.

2016-04-05 07:58:37

by Yuyang Du

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Tue, Apr 05, 2016 at 08:51:13AM +0100, Morten Rasmussen wrote:
> On Tue, Apr 05, 2016 at 02:30:03AM +0800, Yuyang Du wrote:
> > On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> > > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> > > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > > > > I think I follow - Leo please correct me if I mangle your intentions.
> > > > > It's an issue that Morten and Dietmar had mentioned to me as well.
> > >
> > > Yes. We have been working on this issue for a while without getting to a
> > > nice solution yet.
> >
> > So do you want a "flat hirarchy" for util_avg - just do util_avg for
> > rq and task respectively? Seems it is what you want, and it is even easier?
>
> Pretty much, yes. I can't think of a good reason why we need the
> utilization of groups as long as we have the task utilization and the
> sum of those for the root cfs_rq.

Sound good to me too.

> I'm not saying it can't be implemented, just saying that it will make
> utilization tracking for groups redundant and possibly duplicate or hack
> some the existing code to implement the new root utilization sum.

A initial evaluation of the implementation: it looks much easier to do (at
least) than the current. Lets wait for a day or two, if no objection, then
lets do it.

2016-04-05 09:10:57

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Tue, Apr 05, 2016 at 02:56:44PM +0800, Leo Yan wrote:
> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
> > On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
> > > On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
> > > > I think I follow - Leo please correct me if I mangle your intentions.
> > > > It's an issue that Morten and Dietmar had mentioned to me as well.
> >
> > Yes. We have been working on this issue for a while without getting to a
> > nice solution yet.
>
> Good to know this. This patch is mainly for discussion purpose.
>
> [...]
>
> > > > Leo I noticed you did not modify detach_entity_load_average(). I think
> > > > this would be needed to avoid the task's stats being double counted for
> > > > a while after switched_from_fair() or task_move_group_fair().
> >
> > I'm afraid that the solution to problem is more complicated than that
> > :-(
> >
> > You are adding/removing a contribution from the root cfs_rq.avg which
> > isn't part of the signal in the first place. The root cfs_rq.avg only
> > contains the sum of the load/util of the sched_entities on the cfs_rq.
> > If you remove the contribution of the tasks from there you may end up
> > double-accounting for the task migration. Once due to you patch and then
> > again slowly over time as the group sched_entity starts reflecting that
> > the task has migrated. Furthermore, for group scheduling to make sense
> > it has to be the task_h_load() you add/remove otherwise the group
> > weighting is completely lost. Or am I completely misreading your patch?
>
> Here have one thing want to confirm firstly: though CFS has maintained
> task group's hierarchy, but between task group's cfs_rq.avg and root
> cfs_rq.avg, CFS updates these signals independently rather than accouting
> them by crossing the hierarchy.
>
> So currently CFS decreases the group's cfs_rq.avg for task's migration,
> but it don't iterate task group's hierarchy to root cfs_rq.avg. I
> don't understand your meantioned the second accounting by "then again
> slowly over time as the group sched_entity starts reflecting that the
> task has migrated."

The problem is that there is direct link between a group sched_entity's
se->avg and se->my_q.avg. The latter is the sum of PELT load/util of the
sched_entities (tasks or nested groups) on the group cfs_rq, while the
former is the PELT load/util of the group entity which is not based on
cfs_rq sum, but basically just tracks whether that group entity has been
running/runnable or not, but weighted by group load code which is
updating the weight occasionally.

In other words, we do go up/down the hierarchy when tasks migrate, but
we only update the se->my_q.avg (cfs_rq), not the se->avg which is the
load of the group seen by the parent cfs_rq. So the immediate update of
the group cfs_rq.avg where the task sched_entity is enqueued/dequeued
doesn't trickle through the hierarchy instantaneously.

> Another question is: does cfs_rq.avg _ONLY_ signal historic behavior but
> not present behavior? so even the task has been migrated we still need
> decay it slowly? Or this will be different between load and util?

cfs_rq.avg is instantaneously updated on task migration as it is the sum
of the PELT contributions of the sched_entities associated with that
cfs_rq. The group se->avg is not a sum, it behaves just as if it a task
which has a variable load_weight which is determined by group weighting
code, but otherwise identical. No adding/removing of contributions when
tasks migrate.

>
> > I don't think the slow response time for _load_ is necessarily a big
> > problem. Otherwise we would have had people complaining already about
> > group scheduling being broken. It is however a problem for all the
> > initiatives that built on utilization.
>
> Or maybe we need seperate utilization and load, these two signals
> have different semantics and purpose.

I think that is up for discussion. People might have different views on
the semantics of utilization. I see them as very similar in the
non-group scheduling case, one is based on running time and not priority
weighted, the other is based on runnable time and has priority
weighting. Otherwise they are the same.

However, in the group scheduling case, I think they should behave
somewhat differently. Load is priority scaled and is designed to ensure
fair scheduling when the system is fully utilized, where utilization
provides a metric the estimates the actual busy time of the cpus. Group
load is scaled such that is capped no matter how much actual cpu time
the group gets across the system. I don't think it makes sense to do the
same for utilization as it would not represent the actual compute
demand. It should be treated as a 'flat hierarchy' as Yuyang mentions in
his reply, so the sum at the root cfs_rq is a proper estimate of the
utilization of the cpu regardless of whether tasks are grouped or not.

2016-04-05 17:00:51

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

Hi Yuyang,

On 05/04/16 01:15, Yuyang Du wrote:
> On Tue, Apr 05, 2016 at 08:51:13AM +0100, Morten Rasmussen wrote:
>> On Tue, Apr 05, 2016 at 02:30:03AM +0800, Yuyang Du wrote:
>>> On Mon, Apr 04, 2016 at 09:48:23AM +0100, Morten Rasmussen wrote:
>>>> On Sat, Apr 02, 2016 at 03:11:54PM +0800, Leo Yan wrote:
>>>>> On Fri, Apr 01, 2016 at 03:28:49PM -0700, Steve Muckle wrote:
>>>>>> I think I follow - Leo please correct me if I mangle your intentions.
>>>>>> It's an issue that Morten and Dietmar had mentioned to me as well.
>>>>
>>>> Yes. We have been working on this issue for a while without getting to a
>>>> nice solution yet.
>>>
>>> So do you want a "flat hirarchy" for util_avg - just do util_avg for
>>> rq and task respectively? Seems it is what you want, and it is even easier?
>>
>> Pretty much, yes. I can't think of a good reason why we need the
>> utilization of groups as long as we have the task utilization and the
>> sum of those for the root cfs_rq.
>
> Sound good to me too.
>
>> I'm not saying it can't be implemented, just saying that it will make
>> utilization tracking for groups redundant and possibly duplicate or hack
>> some the existing code to implement the new root utilization sum.
>
> A initial evaluation of the implementation: it looks much easier to do (at
> least) than the current. Lets wait for a day or two, if no objection, then
> lets do it.
>

I have been playing with this patch to achieve this 'flat hirarchy" for util_avg'
after I gave up to implement this propagating down the cfs_rq/se hierarchy thing
for task groups. The patch has been created w/o your 'sched/fair: Initiate a new
task's util avg to a bounded value' which recently went into tip/sched/core.

-- >8 --

Subject: [PATCH] sched/fair: Aggregate task utilization only on the root
cfs_rq

cpu utilization is defined as the cpu (original) capacity capped
sched_avg.util_avg signal of the root cfs_rq of that cpu.

With the current pelt version, the utilization of a task enqueued/dequeued
on/from a cfs_rq, representing a task group other than the root task group
on a cpu, is not immediately propagated down to the root cfs_rq of that
cpu.

This makes decisions based on cpu_util() for scheduling or cpu frequency
settings less accurate in case tasks are running in task groups.

This patch aggregates the task utilization only on the root cfs_rq,
essentially bypassing cfs_rq's and se's representing task groups
(&rq_of(cfs_rq)->cfs != cfs_rq and !entity_is_task(se)).

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33130529e9b5..51d675715776 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,8 +682,10 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->period_contrib = 1023;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
- sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
- sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+ if (entity_is_task(se)) {
+ sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+ sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+ }
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}

@@ -2651,6 +2653,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u32 contrib;
unsigned int delta_w, scaled_delta_w, decayed = 0;
unsigned long scale_freq, scale_cpu;
+ int update_util = 0;
+
+ if (cfs_rq) {
+ if (&rq_of(cfs_rq)->cfs == cfs_rq)
+ update_util = 1;
+ } else {
+ if (entity_is_task(container_of(sa, struct sched_entity, avg)))
+ update_util = 1;
+ }

delta = now - sa->last_update_time;
/*
@@ -2696,7 +2707,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
weight * scaled_delta_w;
}
}
- if (running)
+ if (update_util && running)
sa->util_sum += scaled_delta_w * scale_cpu;

delta -= delta_w;
@@ -2720,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
if (cfs_rq)
cfs_rq->runnable_load_sum += weight * contrib;
}
- if (running)
+ if (update_util && running)
sa->util_sum += contrib * scale_cpu;
}

@@ -2731,7 +2742,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
if (cfs_rq)
cfs_rq->runnable_load_sum += weight * scaled_delta;
}
- if (running)
+ if (update_util && running)
sa->util_sum += scaled_delta * scale_cpu;

sa->period_contrib += delta;
@@ -2742,7 +2753,8 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
cfs_rq->runnable_load_avg =
div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
}
- sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
+ if (update_util)
+ sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
}

return decayed;
@@ -2834,7 +2846,8 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
removed = 1;
}

- if (atomic_long_read(&cfs_rq->removed_util_avg)) {
+ if ((&rq_of(cfs_rq)->cfs == cfs_rq) &&
+ atomic_long_read(&cfs_rq->removed_util_avg)) {
long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
sa->util_avg = max_t(long, sa->util_avg - r, 0);
sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
@@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
se->avg.last_update_time = cfs_rq->avg.last_update_time;
cfs_rq->avg.load_avg += se->avg.load_avg;
cfs_rq->avg.load_sum += se->avg.load_sum;
- cfs_rq->avg.util_avg += se->avg.util_avg;
- cfs_rq->avg.util_sum += se->avg.util_sum;
+
+ if (!entity_is_task(se))
+ return;
+
+ rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
+ rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
}

static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -2905,8 +2922,14 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
- cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
- cfs_rq->avg.util_sum = max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
+
+ if (!entity_is_task(se))
+ return;
+
+ rq_of(cfs_rq)->cfs.avg.util_avg =
+ max_t(long, rq_of(cfs_rq)->cfs.avg.util_avg - se->avg.util_avg, 0);
+ rq_of(cfs_rq)->cfs.avg.util_sum =
+ max_t(s32, rq_of(cfs_rq)->cfs.avg.util_sum - se->avg.util_sum, 0);
}

/* Add the load generated by se into cfs_rq's load average */
@@ -2989,7 +3012,11 @@ void remove_entity_load_avg(struct sched_entity *se)

__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
- atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
+
+ if (!entity_is_task(se))
+ return;
+
+ atomic_long_add(se->avg.util_avg, &rq_of(cfs_rq)->cfs.removed_util_avg);
}

static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
@@ -8268,7 +8295,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
#endif
#ifdef CONFIG_SMP
atomic_long_set(&cfs_rq->removed_load_avg, 0);
- atomic_long_set(&cfs_rq->removed_util_avg, 0);
+
+ if (&rq_of(cfs_rq)->cfs == cfs_rq)
+ atomic_long_set(&cfs_rq->removed_util_avg, 0);
#endif
}

--
1.9.1

2016-04-06 08:34:30

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
> @@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> se->avg.last_update_time = cfs_rq->avg.last_update_time;
> cfs_rq->avg.load_avg += se->avg.load_avg;
> cfs_rq->avg.load_sum += se->avg.load_sum;
> - cfs_rq->avg.util_avg += se->avg.util_avg;
> - cfs_rq->avg.util_sum += se->avg.util_sum;
> +
> + if (!entity_is_task(se))
> + return;
> +
> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;

To me it seems that you cannot be sure that the rq_of(cfs_rq)->cfs.avg
time stamp is aligned with se->avg time stamp, which is necessary before
you can add/subtract two geometric series without introducing an error.

attach_entity_load_avg() is called (through a couple of other functions)
from the for_each_sched_entity() loop in enqueue_task_fair() which works
its way towards the root cfs_rq, i.e. rq_of(cfs_rq)->cfs. So in the loop
iteration where you attach the task sched_entity, we haven't yet visited
and updated rq_of(cfs_rq)->cfs.avg.

If you just add the task contribution and discover later that there is a
time delta when you update rq_of(cfs_rq)->cfs.avg you end up decaying
the task contribution which was already up-to-date and its util
contribution to rq_of(cfs_rq)->cfs.avg ends up being smaller than it
should be.

Am I missing something?

2016-04-06 12:14:29

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On 6 April 2016 at 10:37, Morten Rasmussen <[email protected]> wrote:
> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
>> @@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>> se->avg.last_update_time = cfs_rq->avg.last_update_time;
>> cfs_rq->avg.load_avg += se->avg.load_avg;
>> cfs_rq->avg.load_sum += se->avg.load_sum;
>> - cfs_rq->avg.util_avg += se->avg.util_avg;
>> - cfs_rq->avg.util_sum += se->avg.util_sum;
>> +
>> + if (!entity_is_task(se))
>> + return;
>> +
>> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>
> To me it seems that you cannot be sure that the rq_of(cfs_rq)->cfs.avg
> time stamp is aligned with se->avg time stamp, which is necessary before
> you can add/subtract two geometric series without introducing an error.
>
> attach_entity_load_avg() is called (through a couple of other functions)
> from the for_each_sched_entity() loop in enqueue_task_fair() which works
> its way towards the root cfs_rq, i.e. rq_of(cfs_rq)->cfs. So in the loop
> iteration where you attach the task sched_entity, we haven't yet visited
> and updated rq_of(cfs_rq)->cfs.avg.
>
> If you just add the task contribution and discover later that there is a
> time delta when you update rq_of(cfs_rq)->cfs.avg you end up decaying
> the task contribution which was already up-to-date and its util
> contribution to rq_of(cfs_rq)->cfs.avg ends up being smaller than it
> should be.
>
> Am I missing something?

Yes I agree that se->avg and rq_of(cfs_rq)->cfs.avg have to be aligned
on the same time stamp before adding or removing se.

2016-04-06 18:54:06

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On 06/04/16 09:37, Morten Rasmussen wrote:
> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
>> @@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>> se->avg.last_update_time = cfs_rq->avg.last_update_time;
>> cfs_rq->avg.load_avg += se->avg.load_avg;
>> cfs_rq->avg.load_sum += se->avg.load_sum;
>> - cfs_rq->avg.util_avg += se->avg.util_avg;
>> - cfs_rq->avg.util_sum += se->avg.util_sum;
>> +
>> + if (!entity_is_task(se))
>> + return;
>> +
>> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>
> To me it seems that you cannot be sure that the rq_of(cfs_rq)->cfs.avg
> time stamp is aligned with se->avg time stamp, which is necessary before
> you can add/subtract two geometric series without introducing an error.
>
> attach_entity_load_avg() is called (through a couple of other functions)
> from the for_each_sched_entity() loop in enqueue_task_fair() which works
> its way towards the root cfs_rq, i.e. rq_of(cfs_rq)->cfs. So in the loop
> iteration where you attach the task sched_entity, we haven't yet visited
> and updated rq_of(cfs_rq)->cfs.avg.
>
> If you just add the task contribution and discover later that there is a
> time delta when you update rq_of(cfs_rq)->cfs.avg you end up decaying
> the task contribution which was already up-to-date and its util
> contribution to rq_of(cfs_rq)->cfs.avg ends up being smaller than it
> should be.
>
> Am I missing something?

No that's definitely wrong in the current patch. I could defer the attach into
the update_cfs_rq_load_avg() call for the root cfs_rq if
'&rq_of(cfs_rq)->cfs != cfs_rq' in attach_entity_load_avg():

Something like this (only lightly tested!):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 51d675715776..d31d9cd453a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2856,6 +2856,16 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);

+ if (cfs_rq->added_util_avg) {
+ sa->util_avg += cfs_rq->added_util_avg;
+ cfs_rq->added_util_avg = 0;
+ }
+
+ if (cfs_rq->added_util_sum) {
+ sa->util_sum += cfs_rq->added_util_sum;
+ cfs_rq->added_util_sum = 0;
+ }
+
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->load_last_update_time_copy = sa->last_update_time;
@@ -2910,8 +2920,13 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
if (!entity_is_task(se))
return;

- rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
- rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
+ if (&rq_of(cfs_rq)->cfs == cfs_rq) {
+ rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
+ rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
+ } else {
+ rq_of(cfs_rq)->cfs.added_util_avg = se->avg.util_avg;
+ rq_of(cfs_rq)->cfs.added_util_sum = se->avg.util_sum;
+ }
}

static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -4344,8 +4359,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running++;

- if (cfs_rq_throttled(cfs_rq))
+ if (cfs_rq_throttled(cfs_rq)) {
+ rq_of(cfs_rq)->cfs.added_util_avg = 0;
+ rq_of(cfs_rq)->cfs.added_util_sum = 0;
break;
+ }

update_load_avg(se, 1);
update_cfs_shares(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2ff5a2bd6df..f0ea3a7eaf07 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,8 @@ struct cfs_rq {
unsigned long tg_load_avg_contrib;
#endif
atomic_long_t removed_load_avg, removed_util_avg;
+ u32 added_util_sum;
+ unsigned long added_util_avg;
#ifndef CONFIG_64BIT
u64 load_last_update_time_copy;

But attach_entity_load_avg() is not only called in enqueue_entity_load_avg() for migrated
tasks but also in attach_task_cfs_rq() which is called from switched_to_fair() and
task_move_group_fair() where we can't assume that after the enqueue_entity_load_avg() a
call to update_cfs_rq_load_avg() follows like in

enqueue_task_fair():

for_each_sched_entity(se)
enqueue_entity()
enqueue_entity_load_avg()
update_cfs_rq_load_avg(now, cfs_rq)
if (migrated) attach_entity_load_avg()

for_each_sched_entity(se)
update_load_avg()
update_cfs_rq_load_avg(now, cfs_rq)

Not sure if we can just update the root cfs_rq to se->avg.last_update_time before we add
se->avg.util_[avg/sum] to rq_of(cfs_rq)->cfs.avg.util_[avg/sum] in attach_entity_load_avg()?

cfs_rq throttling has to be considered as well ...

2016-04-07 13:04:50

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

Hi Dietmar,

On 6 April 2016 at 20:53, Dietmar Eggemann <[email protected]> wrote:
> On 06/04/16 09:37, Morten Rasmussen wrote:
>> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
>>> @@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>> se->avg.last_update_time = cfs_rq->avg.last_update_time;
>>> cfs_rq->avg.load_avg += se->avg.load_avg;
>>> cfs_rq->avg.load_sum += se->avg.load_sum;
>>> - cfs_rq->avg.util_avg += se->avg.util_avg;
>>> - cfs_rq->avg.util_sum += se->avg.util_sum;
>>> +
>>> + if (!entity_is_task(se))
>>> + return;
>>> +
>>> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>>> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>>
>> To me it seems that you cannot be sure that the rq_of(cfs_rq)->cfs.avg
>> time stamp is aligned with se->avg time stamp, which is necessary before
>> you can add/subtract two geometric series without introducing an error.
>>
>> attach_entity_load_avg() is called (through a couple of other functions)
>> from the for_each_sched_entity() loop in enqueue_task_fair() which works
>> its way towards the root cfs_rq, i.e. rq_of(cfs_rq)->cfs. So in the loop
>> iteration where you attach the task sched_entity, we haven't yet visited
>> and updated rq_of(cfs_rq)->cfs.avg.
>>
>> If you just add the task contribution and discover later that there is a
>> time delta when you update rq_of(cfs_rq)->cfs.avg you end up decaying
>> the task contribution which was already up-to-date and its util
>> contribution to rq_of(cfs_rq)->cfs.avg ends up being smaller than it
>> should be.
>>
>> Am I missing something?
>
> No that's definitely wrong in the current patch. I could defer the attach into
> the update_cfs_rq_load_avg() call for the root cfs_rq if
> '&rq_of(cfs_rq)->cfs != cfs_rq' in attach_entity_load_avg():
>
> Something like this (only lightly tested!):
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 51d675715776..d31d9cd453a1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2856,6 +2856,16 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
>
> + if (cfs_rq->added_util_avg) {
> + sa->util_avg += cfs_rq->added_util_avg;
> + cfs_rq->added_util_avg = 0;
> + }
> +
> + if (cfs_rq->added_util_sum) {
> + sa->util_sum += cfs_rq->added_util_sum;
> + cfs_rq->added_util_sum = 0;
> + }
> +
> #ifndef CONFIG_64BIT
> smp_wmb();
> cfs_rq->load_last_update_time_copy = sa->last_update_time;
> @@ -2910,8 +2920,13 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> if (!entity_is_task(se))
> return;
>
> - rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
> - rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
> + if (&rq_of(cfs_rq)->cfs == cfs_rq) {
> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
> + } else {
> + rq_of(cfs_rq)->cfs.added_util_avg = se->avg.util_avg;
> + rq_of(cfs_rq)->cfs.added_util_sum = se->avg.util_sum;
> + }
> }

Don't you also need similar thing for the detach ?

>
> static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -4344,8 +4359,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cfs_rq = cfs_rq_of(se);
> cfs_rq->h_nr_running++;
>
> - if (cfs_rq_throttled(cfs_rq))
> + if (cfs_rq_throttled(cfs_rq)) {
> + rq_of(cfs_rq)->cfs.added_util_avg = 0;
> + rq_of(cfs_rq)->cfs.added_util_sum = 0;
> break;
> + }
>
> update_load_avg(se, 1);
> update_cfs_shares(cfs_rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b2ff5a2bd6df..f0ea3a7eaf07 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -391,6 +391,8 @@ struct cfs_rq {
> unsigned long tg_load_avg_contrib;
> #endif
> atomic_long_t removed_load_avg, removed_util_avg;
> + u32 added_util_sum;
> + unsigned long added_util_avg;
> #ifndef CONFIG_64BIT
> u64 load_last_update_time_copy;
>
> But attach_entity_load_avg() is not only called in enqueue_entity_load_avg() for migrated
> tasks but also in attach_task_cfs_rq() which is called from switched_to_fair() and
> task_move_group_fair() where we can't assume that after the enqueue_entity_load_avg() a
> call to update_cfs_rq_load_avg() follows like in
>
> enqueue_task_fair():
>
> for_each_sched_entity(se)
> enqueue_entity()
> enqueue_entity_load_avg()
> update_cfs_rq_load_avg(now, cfs_rq)
> if (migrated) attach_entity_load_avg()
>
> for_each_sched_entity(se)
> update_load_avg()
> update_cfs_rq_load_avg(now, cfs_rq)
>
>
> Not sure if we can just update the root cfs_rq to se->avg.last_update_time before we add
> se->avg.util_[avg/sum] to rq_of(cfs_rq)->cfs.avg.util_[avg/sum] in attach_entity_load_avg()?
>
> cfs_rq throttling has to be considered as well ...

IIUC this new proposal, the utilization of a task will be accounted on
the utilization of the root cfs_rq thanks to
tsk->se->cfs_rq->tg->se[cpu]->... down to the root cfs_rq. Then, you
directly add the utilization of the newly enqueued task in the root
cfs_rq.

2016-04-07 20:30:19

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

Hi Vincent,

On 04/07/2016 02:04 PM, Vincent Guittot wrote:
> Hi Dietmar,
>
> On 6 April 2016 at 20:53, Dietmar Eggemann <[email protected]> wrote:
>> On 06/04/16 09:37, Morten Rasmussen wrote:
>>> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:

[...]

>> @@ -2910,8 +2920,13 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>> if (!entity_is_task(se))
>> return;
>>
>> - rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>> - rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>> + if (&rq_of(cfs_rq)->cfs == cfs_rq) {
>> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>> + } else {
>> + rq_of(cfs_rq)->cfs.added_util_avg = se->avg.util_avg;
>> + rq_of(cfs_rq)->cfs.added_util_sum = se->avg.util_sum;
>> + }
>> }
>
> Don't you also need similar thing for the detach ?

Maybe? I ran workloads in tg's and checked last_update_time of cfs_rq
and &rq_of(cfs_rq)->cfs and they always were in sync. That's obviously
only the call-stack 'task_move_group_fair() -> detach_task_cfs_rq() ->
detach_entity_load_avg()' and not the one starting from
switched_from_fair().

[...]

>> But attach_entity_load_avg() is not only called in enqueue_entity_load_avg() for migrated
>> tasks but also in attach_task_cfs_rq() which is called from switched_to_fair() and
>> task_move_group_fair() where we can't assume that after the enqueue_entity_load_avg() a
>> call to update_cfs_rq_load_avg() follows like in
>>
>> enqueue_task_fair():
>>
>> for_each_sched_entity(se)
>> enqueue_entity()
>> enqueue_entity_load_avg()
>> update_cfs_rq_load_avg(now, cfs_rq)
>> if (migrated) attach_entity_load_avg()
>>
>> for_each_sched_entity(se)
>> update_load_avg()
>> update_cfs_rq_load_avg(now, cfs_rq)
>>
>>
>> Not sure if we can just update the root cfs_rq to se->avg.last_update_time before we add
>> se->avg.util_[avg/sum] to rq_of(cfs_rq)->cfs.avg.util_[avg/sum] in attach_entity_load_avg()?
>>
>> cfs_rq throttling has to be considered as well ...
>
> IIUC this new proposal, the utilization of a task will be accounted on
> the utilization of the root cfs_rq thanks to
> tsk->se->cfs_rq->tg->se[cpu]->... down to the root cfs_rq. Then, you
> directly add the utilization of the newly enqueued task in the root
> cfs_rq.

Not sure if you're referring to this, but in __update_load_avg() I
suppress the utilization update for se's w/ !entity_is_task(se) and
cfs_rq's w/ &rq_of(cfs_rq)->cfs != cfs_rq so preventing the first case.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2016-04-08 06:05:38

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration

On 7 April 2016 at 22:30, Dietmar Eggemann <[email protected]> wrote:
> Hi Vincent,
>
> On 04/07/2016 02:04 PM, Vincent Guittot wrote:
>>
>> Hi Dietmar,
>>
>> On 6 April 2016 at 20:53, Dietmar Eggemann <[email protected]>
>> wrote:
>>>
>>> On 06/04/16 09:37, Morten Rasmussen wrote:
>>>>
>>>> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
>
>
> [...]
>
>>> @@ -2910,8 +2920,13 @@ static void attach_entity_load_avg(struct cfs_rq
>>> *cfs_rq, struct sched_entity *s
>>> if (!entity_is_task(se))
>>> return;
>>>
>>> - rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>>> - rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>>> + if (&rq_of(cfs_rq)->cfs == cfs_rq) {
>>> + rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>>> + rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>>> + } else {
>>> + rq_of(cfs_rq)->cfs.added_util_avg = se->avg.util_avg;
>>> + rq_of(cfs_rq)->cfs.added_util_sum = se->avg.util_sum;
>>> + }
>>> }
>>
>>
>> Don't you also need similar thing for the detach ?
>
>
> Maybe? I ran workloads in tg's and checked last_update_time of cfs_rq
> and &rq_of(cfs_rq)->cfs and they always were in sync. That's obviously
> only the call-stack 'task_move_group_fair() -> detach_task_cfs_rq() ->
> detach_entity_load_avg()' and not the one starting from
> switched_from_fair().
>
> [...]
>
>>> But attach_entity_load_avg() is not only called in
>>> enqueue_entity_load_avg() for migrated
>>> tasks but also in attach_task_cfs_rq() which is called from
>>> switched_to_fair() and
>>> task_move_group_fair() where we can't assume that after the
>>> enqueue_entity_load_avg() a
>>> call to update_cfs_rq_load_avg() follows like in
>>>
>>> enqueue_task_fair():
>>>
>>> for_each_sched_entity(se)
>>> enqueue_entity()
>>> enqueue_entity_load_avg()
>>> update_cfs_rq_load_avg(now, cfs_rq)
>>> if (migrated) attach_entity_load_avg()
>>>
>>> for_each_sched_entity(se)
>>> update_load_avg()
>>> update_cfs_rq_load_avg(now, cfs_rq)
>>>
>>>
>>> Not sure if we can just update the root cfs_rq to
>>> se->avg.last_update_time before we add
>>> se->avg.util_[avg/sum] to rq_of(cfs_rq)->cfs.avg.util_[avg/sum] in
>>> attach_entity_load_avg()?
>>>
>>> cfs_rq throttling has to be considered as well ...
>>
>>
>> IIUC this new proposal, the utilization of a task will be accounted on
>> the utilization of the root cfs_rq thanks to
>> tsk->se->cfs_rq->tg->se[cpu]->... down to the root cfs_rq. Then, you
>> directly add the utilization of the newly enqueued task in the root
>> cfs_rq.
>
>
> Not sure if you're referring to this, but in __update_load_avg() I
> suppress the utilization update for se's w/ !entity_is_task(se) and
> cfs_rq's w/ &rq_of(cfs_rq)->cfs != cfs_rq so preventing the first case.

ok, so you still need part of the previous patch, i thought you had
skipped it as it was wrong

>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>