Schedutil is not properly updated when the first FAIR task wakes up on a
CPU and when a RQ is (un)throttled. This is mainly due to the current
integration strategy, which relies on updates being triggered implicitly
each time a cfs_rq's utilization is updated.
Those updates are currently provided (mainly) via
cfs_rq_util_change()
which is used in:
- update_cfs_rq_load_avg()
when the utilization of a cfs_rq is updated
- {attach,detach}_entity_load_avg()
This is done based on the idea that "we should callback schedutil
frequently enough" to properly update the CPU frequency at every
utilization change.
Since this recent schedutil update:
commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
we use additional RQ information to properly account for FAIR tasks
utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
in sugov_aggregate_util() to sum up the cfs_rq's utilization.
However, cfs_rq::h_nr_running is usually updated as:
enqueue_entity()
...
update_load_avg()
...
cfs_rq_util_change ==> trigger schedutil update
...
cfs_rq->h_nr_running += number_of_tasks
both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
A similar pattern is used also in dequeue_task_fair() and
throttle_cfs_rq() to remove tasks.
This means that we are likely to see a zero cfs_rq utilization when we
enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
instead, for example, we are throttling all the FAIR tasks of a CPU.
While the second issue is less important, since we are less likely to
reduce frequency when CPU utilization decreases, the first issue can
instead impact performance. Indeed, we potentially introduce a not desired
latency between a task enqueue on a CPU and its frequency increase.
Another possible unwanted side effect is the iowait boosting of a CPU
when we enqueue a task into a throttled cfs_rq.
Moreover, the current schedutil integration has these other downsides:
- schedutil updates are triggered by RQ's load updates, which makes
sense in general but it does not allow to know exactly which other RQ
related information has been updated (e.g. h_nr_running).
- increasing the chances to update schedutil does not always correspond
to provide the most accurate information for a proper frequency
selection, thus we can skip some updates.
- we don't know exactly at which point a schedutil update is triggered,
and thus potentially a frequency change started, and that's because
the update is a side effect of cfs_rq_util_changed instead of an
explicit call from the most suitable call path.
- cfs_rq_util_change() is mainly a wrapper function for an already
existing "public API", cpufreq_update_util(), to ensure we actually
update schedutil only when we are updating a root RQ. Thus, especially
when task groups are in use, most of the calls to this wrapper
function are really not required.
- the usage of a wrapper function is not completely consistent across
fair.c, since we still need sometimes additional explicit calls to
cpufreq_update_util(), for example to support the IOWAIT boot flag in
the wakeup path
- it makes it hard to integrate new features since it could require to
change other function prototypes just to pass in an additional flag,
as it happened for example here:
commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
All the above considered, let's try to make schedutil updates more
explicit in fair.c by:
- removing the cfs_rq_util_change() wrapper function to use the
cpufreq_update_util() public API only when root cfs_rq is updated
- remove indirect and side-effect (sometimes not required) schedutil
updates when the cfs_rq utilization is updated
- call cpufreq_update_util() explicitly in the few call sites where it
really makes sense and all the required information has been updated
By doing so, this patch mainly removes code and adds explicit calls to
schedutil only when we:
- {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
- (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
- task_tick_fair() to update the utilization of the root cfs_rq
All the other code paths, currently _indirectly_ covered by a call to
update_load_avg(), are also covered by the above three calls.
Some already imply enqueue/dequeue calls:
- switch_{to,from}_fair()
- sched_move_task()
or are followed by enqueue/dequeue calls:
- cpu_cgroup_fork() and
post_init_entity_util_avg():
are used at wakeup_new_task() time and thus already followed by an
enqueue_task_fair()
- migrate_task_rq_fair():
updates the removed utilization but not the actual cfs_rq
utilization, which is updated by a following sched event
This new proposal allows also to better aggregate schedutil related
flags, which are required only at enqueue_task_fair() time.
Indeed, IOWAIT and MIGRATION flags are now requested only when a task is
actually visible at the root cfs_rq level.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
The SCHED_CPUFREQ_MIGRATION flags, recently introduced by:
ea14b57e8a18 sched/cpufreq: Provide migration hint
is maintained although there are not actual usages so far in mainline
for this hint... do we really need it?
---
kernel/sched/fair.c | 84 ++++++++++++++++++++++++-----------------------------
1 file changed, 38 insertions(+), 46 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c58d2f..e726f91f0089 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
* For !fair tasks do:
*
update_cfs_rq_load_avg(now, cfs_rq);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
switched_from_fair(rq, p);
*
* such that the next switched_to_fair() has the
@@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
-{
- struct rq *rq = rq_of(cfs_rq);
-
- if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
- /*
- * There are a few boundary cases this might miss but it should
- * get called often enough that that should (hopefully) not be
- * a real problem.
- *
- * It will not get called when we go idle, because the idle
- * thread is a different class (!fair), nor will the utilization
- * number include things like RT tasks.
- *
- * As is, the util number is not freq-invariant (we'd have to
- * implement arch_scale_freq_capacity() for that).
- *
- * See cpu_util().
- */
- cpufreq_update_util(rq, flags);
- }
-}
-
#ifdef CONFIG_SMP
/*
* Approximate:
@@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif
- if (decayed)
- cfs_rq_util_change(cfs_rq, 0);
-
return decayed;
}
@@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
* Must call update_cfs_rq_load_avg() before this, since we rely on
* cfs_rq->avg.last_update_time being current.
*/
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
@@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
- cfs_rq_util_change(cfs_rq, flags);
}
/**
@@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
- cfs_rq_util_change(cfs_rq, 0);
}
/*
@@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*
* IOW we're enqueueing a task on a new CPU.
*/
- attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, 0);
} else if (decayed && (flags & UPDATE_TG))
@@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
{
- cfs_rq_util_change(cfs_rq, 0);
}
static inline void remove_entity_load_avg(struct sched_entity *se) {}
static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
+attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
@@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue = 0;
}
- if (!se)
+ /* The tasks are no more visible from the root cfs_rq */
+ if (!se) {
sub_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }
cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
@@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}
- if (!se)
+ /* The tasks are now visible from the root cfs_rq */
+ if (!se) {
add_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5356,14 +5333,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
- /*
- * If in_iowait is set, the code below may not trigger any cpufreq
- * utilization updates, so do it here explicitly with the IOWAIT flag
- * passed.
- */
- if (p->in_iowait)
- cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
-
for_each_sched_entity(se) {
if (se->on_rq)
break;
@@ -5394,9 +5363,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}
- if (!se)
+ /* The task is visible from the root cfs_rq */
+ if (!se) {
+ unsigned int flags = 0;
+
add_nr_running(rq, 1);
+ if (p->in_iowait)
+ flags |= SCHED_CPUFREQ_IOWAIT;
+
+ /*
+ * !last_update_time means we've passed through
+ * migrate_task_rq_fair() indicating we migrated.
+ *
+ * IOW we're enqueueing a task on a new CPU.
+ */
+ if (!p->se.avg.last_update_time)
+ flags |= SCHED_CPUFREQ_MIGRATION;
+
+ cpufreq_update_util(rq, flags);
+ }
+
util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
}
@@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}
- if (!se)
+ /* The task is no more visible from the root cfs_rq */
+ if (!se) {
sub_nr_running(rq, 1);
+ cpufreq_update_util(rq, 0);
+ }
util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
@@ -9950,6 +9940,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ cpufreq_update_util(rq, 0);
}
/*
@@ -10087,7 +10079,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
/* Synchronize entity with its cfs_rq */
update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, false);
propagate_entity_cfs_rq(se);
}
--
2.15.1
On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi
<[email protected]> wrote:
> Schedutil is not properly updated when the first FAIR task wakes up on a
> CPU and when a RQ is (un)throttled. This is mainly due to the current
> integration strategy, which relies on updates being triggered implicitly
> each time a cfs_rq's utilization is updated.
To me that's not implicit, but is explicitly done when the util is
updated. It seems that's the logical place..
>
> Those updates are currently provided (mainly) via
> cfs_rq_util_change()
> which is used in:
> - update_cfs_rq_load_avg()
> when the utilization of a cfs_rq is updated
> - {attach,detach}_entity_load_avg()
> This is done based on the idea that "we should callback schedutil
> frequently enough" to properly update the CPU frequency at every
This also I didn't get, its not called "frequently enough" but at the
right time (when the util is updated).
> utilization change.
>
> Since this recent schedutil update:
>
> commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>
> we use additional RQ information to properly account for FAIR tasks
> utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> in sugov_aggregate_util() to sum up the cfs_rq's utilization.
>
> However, cfs_rq::h_nr_running is usually updated as:
>
> enqueue_entity()
> ...
> update_load_avg()
> ...
> cfs_rq_util_change ==> trigger schedutil update
> ...
> cfs_rq->h_nr_running += number_of_tasks
>
> both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
> A similar pattern is used also in dequeue_task_fair() and
> throttle_cfs_rq() to remove tasks.
Maybe this should be fixed then? It seems broken to begin with...
>
> This means that we are likely to see a zero cfs_rq utilization when we
> enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> instead, for example, we are throttling all the FAIR tasks of a CPU.
>
> While the second issue is less important, since we are less likely to
Actually I wanted behavior like the second issue, because that means
frequency will not be dropped if CPU is about to idle which is similar
to a patch I sent long time ago (skip freq update if CPU about to
idle).
> reduce frequency when CPU utilization decreases, the first issue can
> instead impact performance. Indeed, we potentially introduce a not desired
This issue would be fixed by just fixing the h_nr_running bug right?
> latency between a task enqueue on a CPU and its frequency increase.
>
> Another possible unwanted side effect is the iowait boosting of a CPU
> when we enqueue a task into a throttled cfs_rq.
Probably very very rare.
> Moreover, the current schedutil integration has these other downsides:
>
> - schedutil updates are triggered by RQ's load updates, which makes
> sense in general but it does not allow to know exactly which other RQ
> related information has been updated (e.g. h_nr_running).
It seems broken that all information that schedutil needs isn't
updated _before_ the cpufreq update request, so that should be fixed
instead?
>
> - increasing the chances to update schedutil does not always correspond
> to provide the most accurate information for a proper frequency
> selection, thus we can skip some updates.
Again IMO its just updated at the right time already, not just
frequently enough...
>
> - we don't know exactly at which point a schedutil update is triggered,
> and thus potentially a frequency change started, and that's because
> the update is a side effect of cfs_rq_util_changed instead of an
> explicit call from the most suitable call path.
>
> - cfs_rq_util_change() is mainly a wrapper function for an already
> existing "public API", cpufreq_update_util(), to ensure we actually
> update schedutil only when we are updating a root RQ. Thus, especially
> when task groups are in use, most of the calls to this wrapper
> function are really not required.
>
> - the usage of a wrapper function is not completely consistent across
> fair.c, since we still need sometimes additional explicit calls to
> cpufreq_update_util(), for example to support the IOWAIT boot flag in
> the wakeup path
>
> - it makes it hard to integrate new features since it could require to
> change other function prototypes just to pass in an additional flag,
> as it happened for example here:
>
> commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
>
> All the above considered, let's try to make schedutil updates more
> explicit in fair.c by:
>
> - removing the cfs_rq_util_change() wrapper function to use the
> cpufreq_update_util() public API only when root cfs_rq is updated
>
> - remove indirect and side-effect (sometimes not required) schedutil
> updates when the cfs_rq utilization is updated
>
> - call cpufreq_update_util() explicitly in the few call sites where it
> really makes sense and all the required information has been updated
>
> By doing so, this patch mainly removes code and adds explicit calls to
> schedutil only when we:
> - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> - task_tick_fair() to update the utilization of the root cfs_rq
About the "update for root" thingy, we're already doing rq->cfs ==
cfs_rq in cfs_rq_util_change so its already covered?
Also, I feel if you can shorten the commit message a little and only
include the best reasons for this patch that'll be nice so its easier
to review.
>
> All the other code paths, currently _indirectly_ covered by a call to
> update_load_avg(), are also covered by the above three calls.
I would rather we do it in as few places as possible (when util is
updated for root) than spreading it around and making it "explicit". I
hope I didn't miss something but I feel its explicit enough already...
thanks!
- Joel
Hi Patrick
On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
> Schedutil is not properly updated when the first FAIR task wakes up on a
> CPU and when a RQ is (un)throttled. This is mainly due to the current
> integration strategy, which relies on updates being triggered implicitly
> each time a cfs_rq's utilization is updated.
>
> Those updates are currently provided (mainly) via
> cfs_rq_util_change()
> which is used in:
> - update_cfs_rq_load_avg()
> when the utilization of a cfs_rq is updated
> - {attach,detach}_entity_load_avg()
> This is done based on the idea that "we should callback schedutil
> frequently enough" to properly update the CPU frequency at every
> utilization change.
>
> Since this recent schedutil update:
>
> commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>
> we use additional RQ information to properly account for FAIR tasks
> utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> in sugov_aggregate_util() to sum up the cfs_rq's utilization.
Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ?
I can now see a lot a frequency changes on my hikey with this new
condition in sugov_aggregate_util().
With a rt-app UC that creates a periodic cfs task, I have a lot of
frequency changes instead of staying at the same frequency
Peter,
what was your goal with adding the condition "if
(rq->cfs.h_nr_running)" for the aggragation of CFS utilization
Thanks
Vincent
>
> However, cfs_rq::h_nr_running is usually updated as:
>
> enqueue_entity()
> ...
> update_load_avg()
> ...
> cfs_rq_util_change ==> trigger schedutil update
> ...
> cfs_rq->h_nr_running += number_of_tasks
>
> both in enqueue_task_fair() as well as in unthrottle_cfs_rq().
> A similar pattern is used also in dequeue_task_fair() and
> throttle_cfs_rq() to remove tasks.
>
> This means that we are likely to see a zero cfs_rq utilization when we
> enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> instead, for example, we are throttling all the FAIR tasks of a CPU.
>
> While the second issue is less important, since we are less likely to
> reduce frequency when CPU utilization decreases, the first issue can
> instead impact performance. Indeed, we potentially introduce a not desired
> latency between a task enqueue on a CPU and its frequency increase.
>
> Another possible unwanted side effect is the iowait boosting of a CPU
> when we enqueue a task into a throttled cfs_rq.
>
> Moreover, the current schedutil integration has these other downsides:
>
> - schedutil updates are triggered by RQ's load updates, which makes
> sense in general but it does not allow to know exactly which other RQ
> related information has been updated (e.g. h_nr_running).
>
> - increasing the chances to update schedutil does not always correspond
> to provide the most accurate information for a proper frequency
> selection, thus we can skip some updates.
>
> - we don't know exactly at which point a schedutil update is triggered,
> and thus potentially a frequency change started, and that's because
> the update is a side effect of cfs_rq_util_changed instead of an
> explicit call from the most suitable call path.
>
> - cfs_rq_util_change() is mainly a wrapper function for an already
> existing "public API", cpufreq_update_util(), to ensure we actually
> update schedutil only when we are updating a root RQ. Thus, especially
> when task groups are in use, most of the calls to this wrapper
> function are really not required.
>
> - the usage of a wrapper function is not completely consistent across
> fair.c, since we still need sometimes additional explicit calls to
> cpufreq_update_util(), for example to support the IOWAIT boot flag in
> the wakeup path
>
> - it makes it hard to integrate new features since it could require to
> change other function prototypes just to pass in an additional flag,
> as it happened for example here:
>
> commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
>
> All the above considered, let's try to make schedutil updates more
> explicit in fair.c by:
>
> - removing the cfs_rq_util_change() wrapper function to use the
> cpufreq_update_util() public API only when root cfs_rq is updated
>
> - remove indirect and side-effect (sometimes not required) schedutil
> updates when the cfs_rq utilization is updated
>
> - call cpufreq_update_util() explicitly in the few call sites where it
> really makes sense and all the required information has been updated
>
> By doing so, this patch mainly removes code and adds explicit calls to
> schedutil only when we:
> - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> - task_tick_fair() to update the utilization of the root cfs_rq
>
> All the other code paths, currently _indirectly_ covered by a call to
> update_load_avg(), are also covered by the above three calls.
> Some already imply enqueue/dequeue calls:
> - switch_{to,from}_fair()
> - sched_move_task()
> or are followed by enqueue/dequeue calls:
> - cpu_cgroup_fork() and
> post_init_entity_util_avg():
> are used at wakeup_new_task() time and thus already followed by an
> enqueue_task_fair()
> - migrate_task_rq_fair():
> updates the removed utilization but not the actual cfs_rq
> utilization, which is updated by a following sched event
>
> This new proposal allows also to better aggregate schedutil related
> flags, which are required only at enqueue_task_fair() time.
> Indeed, IOWAIT and MIGRATION flags are now requested only when a task is
> actually visible at the root cfs_rq level.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
>
> The SCHED_CPUFREQ_MIGRATION flags, recently introduced by:
>
> ea14b57e8a18 sched/cpufreq: Provide migration hint
>
> is maintained although there are not actual usages so far in mainline
> for this hint... do we really need it?
> ---
> kernel/sched/fair.c | 84 ++++++++++++++++++++++++-----------------------------
> 1 file changed, 38 insertions(+), 46 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c58d2f..e726f91f0089 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> * For !fair tasks do:
> *
> update_cfs_rq_load_avg(now, cfs_rq);
> - attach_entity_load_avg(cfs_rq, se, 0);
> + attach_entity_load_avg(cfs_rq, se);
> switched_from_fair(rq, p);
> *
> * such that the next switched_to_fair() has the
> @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> }
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> - struct rq *rq = rq_of(cfs_rq);
> -
> - if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
> - /*
> - * There are a few boundary cases this might miss but it should
> - * get called often enough that that should (hopefully) not be
> - * a real problem.
> - *
> - * It will not get called when we go idle, because the idle
> - * thread is a different class (!fair), nor will the utilization
> - * number include things like RT tasks.
> - *
> - * As is, the util number is not freq-invariant (we'd have to
> - * implement arch_scale_freq_capacity() for that).
> - *
> - * See cpu_util().
> - */
> - cpufreq_update_util(rq, flags);
> - }
> -}
> -
> #ifdef CONFIG_SMP
> /*
> * Approximate:
> @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> cfs_rq->load_last_update_time_copy = sa->last_update_time;
> #endif
>
> - if (decayed)
> - cfs_rq_util_change(cfs_rq, 0);
> -
> return decayed;
> }
>
> @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> * Must call update_cfs_rq_load_avg() before this, since we rely on
> * cfs_rq->avg.last_update_time being current.
> */
> -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>
> @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, flags);
> }
>
> /**
> @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, 0);
> }
>
> /*
> @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> *
> * IOW we're enqueueing a task on a new CPU.
> */
> - attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> + attach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq, 0);
>
> } else if (decayed && (flags & UPDATE_TG))
> @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> {
> - cfs_rq_util_change(cfs_rq, 0);
> }
>
> static inline void remove_entity_load_avg(struct sched_entity *se) {}
>
> static inline void
> -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
> +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> static inline void
> detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
>
> @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> dequeue = 0;
> }
>
> - if (!se)
> + /* The tasks are no more visible from the root cfs_rq */
> + if (!se) {
> sub_nr_running(rq, task_delta);
> + cpufreq_update_util(rq, 0);
> + }
>
> cfs_rq->throttled = 1;
> cfs_rq->throttled_clock = rq_clock(rq);
> @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> break;
> }
>
> - if (!se)
> + /* The tasks are now visible from the root cfs_rq */
> + if (!se) {
> add_nr_running(rq, task_delta);
> + cpufreq_update_util(rq, 0);
> + }
>
> /* Determine whether we need to wake up potentially idle CPU: */
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> @@ -5356,14 +5333,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
>
> - /*
> - * If in_iowait is set, the code below may not trigger any cpufreq
> - * utilization updates, so do it here explicitly with the IOWAIT flag
> - * passed.
> - */
> - if (p->in_iowait)
> - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
> for_each_sched_entity(se) {
> if (se->on_rq)
> break;
> @@ -5394,9 +5363,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> - if (!se)
> + /* The task is visible from the root cfs_rq */
> + if (!se) {
> + unsigned int flags = 0;
> +
> add_nr_running(rq, 1);
>
> + if (p->in_iowait)
> + flags |= SCHED_CPUFREQ_IOWAIT;
> +
> + /*
> + * !last_update_time means we've passed through
> + * migrate_task_rq_fair() indicating we migrated.
> + *
> + * IOW we're enqueueing a task on a new CPU.
> + */
> + if (!p->se.avg.last_update_time)
> + flags |= SCHED_CPUFREQ_MIGRATION;
> +
> + cpufreq_update_util(rq, flags);
> + }
> +
> util_est_enqueue(&rq->cfs, p);
> hrtick_update(rq);
> }
> @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> - if (!se)
> + /* The task is no more visible from the root cfs_rq */
> + if (!se) {
> sub_nr_running(rq, 1);
> + cpufreq_update_util(rq, 0);
> + }
>
> util_est_dequeue(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
> @@ -9950,6 +9940,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
> if (static_branch_unlikely(&sched_numa_balancing))
> task_tick_numa(rq, curr);
> +
> + cpufreq_update_util(rq, 0);
> }
>
> /*
> @@ -10087,7 +10079,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
>
> /* Synchronize entity with its cfs_rq */
> update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
> - attach_entity_load_avg(cfs_rq, se, 0);
> + attach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq, false);
> propagate_entity_cfs_rq(se);
> }
> --
> 2.15.1
>
Hi Vincent,
On 09-Apr 10:51, Vincent Guittot wrote:
> Hi Patrick
>
> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
> > Schedutil is not properly updated when the first FAIR task wakes up on a
> > CPU and when a RQ is (un)throttled. This is mainly due to the current
> > integration strategy, which relies on updates being triggered implicitly
> > each time a cfs_rq's utilization is updated.
> >
> > Those updates are currently provided (mainly) via
> > cfs_rq_util_change()
> > which is used in:
> > - update_cfs_rq_load_avg()
> > when the utilization of a cfs_rq is updated
> > - {attach,detach}_entity_load_avg()
> > This is done based on the idea that "we should callback schedutil
> > frequently enough" to properly update the CPU frequency at every
> > utilization change.
> >
> > Since this recent schedutil update:
> >
> > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
> >
> > we use additional RQ information to properly account for FAIR tasks
> > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
> > in sugov_aggregate_util() to sum up the cfs_rq's utilization.
>
> Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ?
Not really...
> I can now see a lot a frequency changes on my hikey with this new
> condition in sugov_aggregate_util().
> With a rt-app UC that creates a periodic cfs task, I have a lot of
> frequency changes instead of staying at the same frequency
I don't remember a similar behavior... but I'll check better.
> Peter,
> what was your goal with adding the condition "if
> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
The original intent was to get rid of sched class flags, used to track
which class has tasks runnable from within schedutil. The reason was
to solve some misalignment between scheduler class status and
schedutil status.
The solution, initially suggested by Viresh, and finally proposed by
Peter was to exploit RQ knowledges directly from within schedutil.
The problem is that now schedutil updated depends on two information:
utilization changes and number of RT and CFS runnable tasks.
Thus, using cfs_rq::h_nr_running is not the problem... it's actually
part of a much more clean solution of the code we used to have.
The problem, IMO is that we now depend on other information which
needs to be in sync before calling schedutil... and the patch I
proposed is meant to make it less likely that all the information
required are not aligned (also in the future).
--
#include <best/regards.h>
Patrick Bellasi
Hi Joel,
On 06-Apr 16:48, Joel Fernandes wrote:
> On Fri, Apr 6, 2018 at 10:28 AM, Patrick Bellasi
> <[email protected]> wrote:
> > Schedutil is not properly updated when the first FAIR task wakes up on a
> > CPU and when a RQ is (un)throttled. This is mainly due to the current
> > integration strategy, which relies on updates being triggered implicitly
> > each time a cfs_rq's utilization is updated.
>
> To me that's not implicit, but is explicitly done when the util is
> updated. It seems that's the logical place..
I'm not arguing the place is not logical, it makes perfect sense.
IMO it just makes more difficult to track dependencies like the one we
have now: when the root cfs_rq utilization is updated the h_nr_running
is not always aligned.
Moreover, when task groups are in use, we do many times a call to a
wrapper function which just bails out, since we are updating a non
root cfs_rq. Other reasons have also been detailed in the changelog.
I've notice that we already have pretty well defined call sites
in fair.c where we update the core's nr_running counter. These are
also the exact and only places where the root cfs_rq utilization
change, apart from the tick.
What I'm proposing here is meant not only to fix the current issue,
but also at possible find a code organization which makes less likely
to miss dependencies in possible future code updates too.
To me it looks more clean and, still have to look better at the code,
but I think this can be a first step to possibly factor all schedutil
updates (for FAIR and RT) into just core.c
[...]
> > This means that we are likely to see a zero cfs_rq utilization when we
> > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when
> > instead, for example, we are throttling all the FAIR tasks of a CPU.
> >
> > While the second issue is less important, since we are less likely to
>
> Actually I wanted behavior like the second issue, because that means
> frequency will not be dropped if CPU is about to idle which is similar
> to a patch I sent long time ago (skip freq update if CPU about to
> idle).
I think that's a slightly different case since here a cfs_rq is
intentionally throttled thus we want to stop the tasks and potentially
to drop the frequency.
> > reduce frequency when CPU utilization decreases, the first issue can
> > instead impact performance. Indeed, we potentially introduce a not desired
>
> This issue would be fixed by just fixing the h_nr_running bug right?
Sure, something like this:
---8<---
index 0951d1c58d2f..e9a31258d345 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5368,6 +5368,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
+ cfs_rq->h_nr_running++;
enqueue_entity(cfs_rq, se, flags);
/*
@@ -5377,8 +5378,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* post the final h_nr_running increment below.
*/
if (cfs_rq_throttled(cfs_rq))
+ cfs_rq->h_nr_running--;
break;
- cfs_rq->h_nr_running++;
+ }
flags = ENQUEUE_WAKEUP;
}
---8<---
can probably easily fix one of the problems.
But I did not checked what does it imply to increment h_nr_running
before calling enqueue_entity() and cfs_rq_throttled().
Still we miss the (IMO) opportunity to make a more suitable single and
explicit function call only when needed. Which is also just few code
lines below in the same function as proposed by this patch.
> > latency between a task enqueue on a CPU and its frequency increase.
> >
> > Another possible unwanted side effect is the iowait boosting of a CPU
> > when we enqueue a task into a throttled cfs_rq.
>
> Probably very very rare.
Still possible by code... and when you notice it you cannot think
about a non wanted behavior.
> > Moreover, the current schedutil integration has these other downsides:
> >
> > - schedutil updates are triggered by RQ's load updates, which makes
> > sense in general but it does not allow to know exactly which other RQ
> > related information has been updated (e.g. h_nr_running).
>
> It seems broken that all information that schedutil needs isn't
> updated _before_ the cpufreq update request, so that should be fixed
> instead?
That's what I'm trying to do here but, instead of doing before some
operations I'm proposing to postpone what can be done after.
Schedutil updates can be done right before returning to the core
scheduler, at which time we should be pretty sure all CFS related
info have been properly updated.
[...]
> > All the above considered, let's try to make schedutil updates more
> > explicit in fair.c by:
> >
> > - removing the cfs_rq_util_change() wrapper function to use the
> > cpufreq_update_util() public API only when root cfs_rq is updated
> >
> > - remove indirect and side-effect (sometimes not required) schedutil
> > updates when the cfs_rq utilization is updated
> >
> > - call cpufreq_update_util() explicitly in the few call sites where it
> > really makes sense and all the required information has been updated
> >
> > By doing so, this patch mainly removes code and adds explicit calls to
> > schedutil only when we:
> > - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> > - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> > - task_tick_fair() to update the utilization of the root cfs_rq
>
> About the "update for root" thingy, we're already doing rq->cfs ==
> cfs_rq in cfs_rq_util_change so its already covered?
>
> Also, I feel if you can shorten the commit message a little and only
> include the best reasons for this patch that'll be nice so its easier
> to review.
Sorry for that, since (as demonstrated by your reply) I was expecting
quite some comments on this change, I just wanted to detail the
reasons and motivations behind the proposed change.
> > All the other code paths, currently _indirectly_ covered by a call to
> > update_load_avg(), are also covered by the above three calls.
>
> I would rather we do it in as few places as possible (when util is
> updated for root) than spreading it around and making it "explicit". I
> hope I didn't miss something but I feel its explicit enough already...
Well, again, I'm not saying that the current solution is not possibly
working... just that it can make easy to possible break some
dependencies. Moreover, maybe it makes less easy to see that, perhaps,
we can factorize most of the schedutil updates into just core.c
> thanks!
Thanks for your time.
--
#include <best/regards.h>
Patrick Bellasi
On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
> Hi Vincent,
>
> On 09-Apr 10:51, Vincent Guittot wrote:
>> Hi Patrick
>>
>> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
>> > Schedutil is not properly updated when the first FAIR task wakes up on a
>> > CPU and when a RQ is (un)throttled. This is mainly due to the current
>> > integration strategy, which relies on updates being triggered implicitly
>> > each time a cfs_rq's utilization is updated.
>> >
>> > Those updates are currently provided (mainly) via
>> > cfs_rq_util_change()
>> > which is used in:
>> > - update_cfs_rq_load_avg()
>> > when the utilization of a cfs_rq is updated
>> > - {attach,detach}_entity_load_avg()
>> > This is done based on the idea that "we should callback schedutil
>> > frequently enough" to properly update the CPU frequency at every
>> > utilization change.
>> >
>> > Since this recent schedutil update:
>> >
>> > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>> >
>> > we use additional RQ information to properly account for FAIR tasks
>> > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero
>> > in sugov_aggregate_util() to sum up the cfs_rq's utilization.
>>
>> Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ?
>
> Not really...
>
>> I can now see a lot a frequency changes on my hikey with this new
>> condition in sugov_aggregate_util().
>> With a rt-app UC that creates a periodic cfs task, I have a lot of
>> frequency changes instead of staying at the same frequency
>
> I don't remember a similar behavior... but I'll check better.
I have discovered this behavior quite recently while preparing OSPM
>
>> Peter,
>> what was your goal with adding the condition "if
>> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>
> The original intent was to get rid of sched class flags, used to track
> which class has tasks runnable from within schedutil. The reason was
> to solve some misalignment between scheduler class status and
> schedutil status.
This was mainly for RT tasks but it was not the case for cfs task
before commit 8f111bc357aa
>
> The solution, initially suggested by Viresh, and finally proposed by
> Peter was to exploit RQ knowledges directly from within schedutil.
>
> The problem is that now schedutil updated depends on two information:
> utilization changes and number of RT and CFS runnable tasks.
>
> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> part of a much more clean solution of the code we used to have.
So there are 2 problems there:
- using cfs_rq::h_nr_running when aggregating cfs utilization which
generates a lot of frequency drop
- making sure that the nr-running are up-to-date when used in sched_util
>
> The problem, IMO is that we now depend on other information which
> needs to be in sync before calling schedutil... and the patch I
> proposed is meant to make it less likely that all the information
> required are not aligned (also in the future).
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
> }
> @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> - if (!se)
> + /* The task is no more visible from the root cfs_rq */
> + if (!se) {
> sub_nr_running(rq, 1);
> + cpufreq_update_util(rq, 0);
call to cpufreq_update_util() should be done after util_est_dequeue()
> + }
>
> util_est_dequeue(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
On 11-Apr 09:57, Vincent Guittot wrote:
> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
>
> > }
> > @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > - if (!se)
> > + /* The task is no more visible from the root cfs_rq */
> > + if (!se) {
> > sub_nr_running(rq, 1);
> > + cpufreq_update_util(rq, 0);
>
> call to cpufreq_update_util() should be done after util_est_dequeue()
Yeah... good point, looks like I should have notice it :)
That's another compelling example why updating schedutil as a side
effect of update_load_avg does not allow to easily track when it's the
best time to trigger an update.
UtilEst is now a factor which could impact on OPP selection for CFS
tasks, and thus we should update at the really end of the function.
> > + }
> >
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > hrtick_update(rq);
--
#include <best/regards.h>
Patrick Bellasi
On 11-Apr 08:57, Vincent Guittot wrote:
> On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
> > On 09-Apr 10:51, Vincent Guittot wrote:
> >> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
> >> Peter,
> >> what was your goal with adding the condition "if
> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >
> > The original intent was to get rid of sched class flags, used to track
> > which class has tasks runnable from within schedutil. The reason was
> > to solve some misalignment between scheduler class status and
> > schedutil status.
>
> This was mainly for RT tasks but it was not the case for cfs task
> before commit 8f111bc357aa
True, but with his solution Peter has actually come up with a unified
interface which is now (and can be IMO) based just on RUNNABLE
counters for each class.
> > The solution, initially suggested by Viresh, and finally proposed by
> > Peter was to exploit RQ knowledges directly from within schedutil.
> >
> > The problem is that now schedutil updated depends on two information:
> > utilization changes and number of RT and CFS runnable tasks.
> >
> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> > part of a much more clean solution of the code we used to have.
>
> So there are 2 problems there:
> - using cfs_rq::h_nr_running when aggregating cfs utilization which
> generates a lot of frequency drop
You mean because we now completely disregard the blocked utilization
where a CPU is idle, right?
Given how PELT works and the recent support for IDLE CPUs updated, we
should probably always add contributions for the CFS class.
> - making sure that the nr-running are up-to-date when used in sched_util
Right... but, if we always add the cfs_rq (to always account for
blocked utilization), we don't have anymore this last dependency,
isn't it?
We still have to account for the util_est dependency.
Should I add a patch to this series to disregard cfs_rq::h_nr_running
from schedutil as you suggested?
> > The problem, IMO is that we now depend on other information which
> > needs to be in sync before calling schedutil... and the patch I
> > proposed is meant to make it less likely that all the information
> > required are not aligned (also in the future).
> >
> > --
> > #include <best/regards.h>
> >
> > Patrick Bellasi
--
#include <best/regards.h>
Patrick Bellasi
On 11 April 2018 at 12:15, Patrick Bellasi <[email protected]> wrote:
> On 11-Apr 08:57, Vincent Guittot wrote:
>> On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
>> > On 09-Apr 10:51, Vincent Guittot wrote:
>> >> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
>> >> Peter,
>> >> what was your goal with adding the condition "if
>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>> >
>> > The original intent was to get rid of sched class flags, used to track
>> > which class has tasks runnable from within schedutil. The reason was
>> > to solve some misalignment between scheduler class status and
>> > schedutil status.
>>
>> This was mainly for RT tasks but it was not the case for cfs task
>> before commit 8f111bc357aa
>
> True, but with his solution Peter has actually come up with a unified
> interface which is now (and can be IMO) based just on RUNNABLE
> counters for each class.
But do we really want to only take care of runnable counter for all class ?
>
>> > The solution, initially suggested by Viresh, and finally proposed by
>> > Peter was to exploit RQ knowledges directly from within schedutil.
>> >
>> > The problem is that now schedutil updated depends on two information:
>> > utilization changes and number of RT and CFS runnable tasks.
>> >
>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> > part of a much more clean solution of the code we used to have.
>>
>> So there are 2 problems there:
>> - using cfs_rq::h_nr_running when aggregating cfs utilization which
>> generates a lot of frequency drop
>
> You mean because we now completely disregard the blocked utilization
> where a CPU is idle, right?
yes
>
> Given how PELT works and the recent support for IDLE CPUs updated, we
> should probably always add contributions for the CFS class.
>
>> - making sure that the nr-running are up-to-date when used in sched_util
>
> Right... but, if we always add the cfs_rq (to always account for
> blocked utilization), we don't have anymore this last dependency,
> isn't it?
yes
>
> We still have to account for the util_est dependency.
>
> Should I add a patch to this series to disregard cfs_rq::h_nr_running
> from schedutil as you suggested?
It's probably better to have a separate patch as these are 2 different topics
- when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
- should we use runnable or running utilization for CFS
Vincent,
>
>> > The problem, IMO is that we now depend on other information which
>> > needs to be in sync before calling schedutil... and the patch I
>> > proposed is meant to make it less likely that all the information
>> > required are not aligned (also in the future).
>> >
>> > --
>> > #include <best/regards.h>
>> >
>> > Patrick Bellasi
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 11-Apr 13:56, Vincent Guittot wrote:
> On 11 April 2018 at 12:15, Patrick Bellasi <[email protected]> wrote:
> > On 11-Apr 08:57, Vincent Guittot wrote:
> >> On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
> >> > On 09-Apr 10:51, Vincent Guittot wrote:
> >> >> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
> >> >> Peter,
> >> >> what was your goal with adding the condition "if
> >> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >> >
> >> > The original intent was to get rid of sched class flags, used to track
> >> > which class has tasks runnable from within schedutil. The reason was
> >> > to solve some misalignment between scheduler class status and
> >> > schedutil status.
> >>
> >> This was mainly for RT tasks but it was not the case for cfs task
> >> before commit 8f111bc357aa
> >
> > True, but with his solution Peter has actually come up with a unified
> > interface which is now (and can be IMO) based just on RUNNABLE
> > counters for each class.
>
> But do we really want to only take care of runnable counter for all class ?
Perhaps, once we have PELT RT support with your patches we can
consider blocked utilization also for those tasks...
However, we can also argue that a policy where we trigger updates
based on RUNNABLE counters and then it's up to the schedutil policy to
decide for how long to ignore a frequency drop, using a step down
holding timer similar to what we already have, can also be a possible
solution.
I also kind-of see a possible interesting per-task tuning of such a
policy. Meaning that, for example, for certain tasks we wanna use a
longer throttling down scale time which can be instead shorter if only
"background" tasks are currently active.
> >> > The solution, initially suggested by Viresh, and finally proposed by
> >> > Peter was to exploit RQ knowledges directly from within schedutil.
> >> >
> >> > The problem is that now schedutil updated depends on two information:
> >> > utilization changes and number of RT and CFS runnable tasks.
> >> >
> >> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> >> > part of a much more clean solution of the code we used to have.
> >>
> >> So there are 2 problems there:
> >> - using cfs_rq::h_nr_running when aggregating cfs utilization which
> >> generates a lot of frequency drop
> >
> > You mean because we now completely disregard the blocked utilization
> > where a CPU is idle, right?
>
> yes
>
> >
> > Given how PELT works and the recent support for IDLE CPUs updated, we
> > should probably always add contributions for the CFS class.
> >
> >> - making sure that the nr-running are up-to-date when used in sched_util
> >
> > Right... but, if we always add the cfs_rq (to always account for
> > blocked utilization), we don't have anymore this last dependency,
> > isn't it?
>
> yes
>
> >
> > We still have to account for the util_est dependency.
> >
> > Should I add a patch to this series to disregard cfs_rq::h_nr_running
> > from schedutil as you suggested?
>
> It's probably better to have a separate patch as these are 2 different topics
> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
> - should we use runnable or running utilization for CFS
Yes, well... since OSPM is just next week, we can also have a better
discussion there and decide by then.
What is true so far is that using RUNNABLE is a change with respect to
the previous behaviors which unfortunately went unnoticed so far.
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Apr 06, 2018 at 06:28:35PM +0100, Patrick Bellasi wrote:
> is maintained although there are not actual usages so far in mainline
> for this hint... do we really need it?
There were intel_pstate patches that I expected to have shown up
already, and I meant to have a look at sugov, except I got side-tracked
again :/
On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> On 09-Apr 10:51, Vincent Guittot wrote:
> > Peter,
> > what was your goal with adding the condition "if
> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>
> The original intent was to get rid of sched class flags, used to track
> which class has tasks runnable from within schedutil. The reason was
> to solve some misalignment between scheduler class status and
> schedutil status.
>
> The solution, initially suggested by Viresh, and finally proposed by
> Peter was to exploit RQ knowledges directly from within schedutil.
>
> The problem is that now schedutil updated depends on two information:
> utilization changes and number of RT and CFS runnable tasks.
>
> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> part of a much more clean solution of the code we used to have.
>
> The problem, IMO is that we now depend on other information which
> needs to be in sync before calling schedutil... and the patch I
> proposed is meant to make it less likely that all the information
> required are not aligned (also in the future).
Specifically, the h_nr_running test was get rid of
if (delta_ns > TICK_NSEC) {
j_sg_cpu->iowait_boost = 0;
j_sg_cpu->iowait_boost_pending = false;
- j_sg_cpu->util_cfs = 0;
^^^^^^^^^^^^^^^^^^^^^^^ that..
- if (j_sg_cpu->util_dl == 0)
- continue;
}
because that felt rather arbitrary.
On 11 April 2018 at 17:14, Peter Zijlstra <[email protected]> wrote:
> On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
>> On 09-Apr 10:51, Vincent Guittot wrote:
>
>> > Peter,
>> > what was your goal with adding the condition "if
>> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>>
>> The original intent was to get rid of sched class flags, used to track
>> which class has tasks runnable from within schedutil. The reason was
>> to solve some misalignment between scheduler class status and
>> schedutil status.
>>
>> The solution, initially suggested by Viresh, and finally proposed by
>> Peter was to exploit RQ knowledges directly from within schedutil.
>>
>> The problem is that now schedutil updated depends on two information:
>> utilization changes and number of RT and CFS runnable tasks.
>>
>> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> part of a much more clean solution of the code we used to have.
>>
>> The problem, IMO is that we now depend on other information which
>> needs to be in sync before calling schedutil... and the patch I
>> proposed is meant to make it less likely that all the information
>> required are not aligned (also in the future).
>
> Specifically, the h_nr_running test was get rid of
>
> if (delta_ns > TICK_NSEC) {
> j_sg_cpu->iowait_boost = 0;
> j_sg_cpu->iowait_boost_pending = false;
> - j_sg_cpu->util_cfs = 0;
>
> ^^^^^^^^^^^^^^^^^^^^^^^ that..
>
> - if (j_sg_cpu->util_dl == 0)
> - continue;
> }
>
>
> because that felt rather arbitrary.
yes I agree.
With the patch that updates blocked idle load, we should not have the
problem of blocked utilization anymore and get rid of the code above
and h_nr_running test
On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
> On 11 April 2018 at 17:14, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> >> On 09-Apr 10:51, Vincent Guittot wrote:
> >
> >> > Peter,
> >> > what was your goal with adding the condition "if
> >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> >>
> >> The original intent was to get rid of sched class flags, used to track
> >> which class has tasks runnable from within schedutil. The reason was
> >> to solve some misalignment between scheduler class status and
> >> schedutil status.
> >>
> >> The solution, initially suggested by Viresh, and finally proposed by
> >> Peter was to exploit RQ knowledges directly from within schedutil.
> >>
> >> The problem is that now schedutil updated depends on two information:
> >> utilization changes and number of RT and CFS runnable tasks.
> >>
> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> >> part of a much more clean solution of the code we used to have.
> >>
> >> The problem, IMO is that we now depend on other information which
> >> needs to be in sync before calling schedutil... and the patch I
> >> proposed is meant to make it less likely that all the information
> >> required are not aligned (also in the future).
> >
> > Specifically, the h_nr_running test was get rid of
> >
> > if (delta_ns > TICK_NSEC) {
> > j_sg_cpu->iowait_boost = 0;
> > j_sg_cpu->iowait_boost_pending = false;
> > - j_sg_cpu->util_cfs = 0;
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^ that..
> >
> > - if (j_sg_cpu->util_dl == 0)
> > - continue;
> > }
> >
> >
> > because that felt rather arbitrary.
>
> yes I agree.
>
> With the patch that updates blocked idle load, we should not have the
> problem of blocked utilization anymore and get rid of the code above
> and h_nr_running test
Yes, these patches predate those, but indeed, now that we age the
blocked load consistently it should no longer be required.
Of course, you still have that weird regression report against those
patches... :-)
On 11 April 2018 at 17:37, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
>> On 11 April 2018 at 17:14, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
>> >> On 09-Apr 10:51, Vincent Guittot wrote:
>> >
>> >> > Peter,
>> >> > what was your goal with adding the condition "if
>> >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>> >>
>> >> The original intent was to get rid of sched class flags, used to track
>> >> which class has tasks runnable from within schedutil. The reason was
>> >> to solve some misalignment between scheduler class status and
>> >> schedutil status.
>> >>
>> >> The solution, initially suggested by Viresh, and finally proposed by
>> >> Peter was to exploit RQ knowledges directly from within schedutil.
>> >>
>> >> The problem is that now schedutil updated depends on two information:
>> >> utilization changes and number of RT and CFS runnable tasks.
>> >>
>> >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>> >> part of a much more clean solution of the code we used to have.
>> >>
>> >> The problem, IMO is that we now depend on other information which
>> >> needs to be in sync before calling schedutil... and the patch I
>> >> proposed is meant to make it less likely that all the information
>> >> required are not aligned (also in the future).
>> >
>> > Specifically, the h_nr_running test was get rid of
>> >
>> > if (delta_ns > TICK_NSEC) {
>> > j_sg_cpu->iowait_boost = 0;
>> > j_sg_cpu->iowait_boost_pending = false;
>> > - j_sg_cpu->util_cfs = 0;
>> >
>> > ^^^^^^^^^^^^^^^^^^^^^^^ that..
>> >
>> > - if (j_sg_cpu->util_dl == 0)
>> > - continue;
>> > }
>> >
>> >
>> > because that felt rather arbitrary.
>>
>> yes I agree.
>>
>> With the patch that updates blocked idle load, we should not have the
>> problem of blocked utilization anymore and get rid of the code above
>> and h_nr_running test
>
> Yes, these patches predate those, but indeed, now that we age the
> blocked load consistently it should no longer be required.
>
> Of course, you still have that weird regression report against those
> patches... :-)
Yes. and to be honest I don't have any clues of the root cause :-(
Heiner mentioned that it's much better in latest linux-next but I
haven't seen any changes related to the code of those patches
On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote:
> Yes. and to be honest I don't have any clues of the root cause :-(
> Heiner mentioned that it's much better in latest linux-next but I
> haven't seen any changes related to the code of those patches
Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in
next that are most likely to have affected this are rjw's
cpuidle-vs-nohz patches. The common demoninator being nohz.
Now I think rjw's patches will ensure we enter nohz _less_, they avoid
stopping the tick when we expect to go idle for a short period only.
So if your patch makes nohz go wobbly, going nohz less will make that
better.
Of course, I've no actual clue as to what that patch (it's the last one
in the series, right?:
31e77c93e432 ("sched/fair: Update blocked load when newly idle")
) does that is so offensive to that one machine. You never did manage to
reproduce, right?
Could is be that for some reason the nohz balancer now takes a very long
time to run?
Could something like the following happen (and this is really flaky
thinking here):
last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs,
which somehow again triggers idle_balance and around we go?
I'm not immediately seeing how that could happen, but if we do something
daft like that we can tie up the CPU for a while, mostly with IRQs
disabled, and that would be visible as that latency he sees.
On 11 April 2018 at 18:00, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote:
>> Yes. and to be honest I don't have any clues of the root cause :-(
>> Heiner mentioned that it's much better in latest linux-next but I
>> haven't seen any changes related to the code of those patches
>
> Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in
> next that are most likely to have affected this are rjw's
> cpuidle-vs-nohz patches. The common demoninator being nohz.
>
> Now I think rjw's patches will ensure we enter nohz _less_, they avoid
> stopping the tick when we expect to go idle for a short period only.
>
> So if your patch makes nohz go wobbly, going nohz less will make that
> better.
>
> Of course, I've no actual clue as to what that patch (it's the last one
> in the series, right?:
>
> 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>
> ) does that is so offensive to that one machine. You never did manage to
> reproduce, right?
yes
>
> Could is be that for some reason the nohz balancer now takes a very long
> time to run?
Heiner mentions that is was a relatively slow celeron and he uses
ondemand governor. So I was about to ask him to use performance
governor to see if it can be because cpu runs slow and takes too muche
time to enter idle
>
> Could something like the following happen (and this is really flaky
> thinking here):
>
> last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs,
> which somehow again triggers idle_balance and around we go?
>
> I'm not immediately seeing how that could happen, but if we do something
> daft like that we can tie up the CPU for a while, mostly with IRQs
> disabled, and that would be visible as that latency he sees.
>
>
On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote:
> > Could is be that for some reason the nohz balancer now takes a very long
> > time to run?
>
> Heiner mentions that is was a relatively slow celeron and he uses
> ondemand governor. So I was about to ask him to use performance
> governor to see if it can be because cpu runs slow and takes too muche
> time to enter idle
Maybe also hand him a patch with some trace_printk()s in and ask him to
send /debug/tracing/trace after it happens or somesuch. Maybe we can
find a clue that way.
On 11 April 2018 at 18:15, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 11, 2018 at 06:10:47PM +0200, Vincent Guittot wrote:
>> > Could is be that for some reason the nohz balancer now takes a very long
>> > time to run?
>>
>> Heiner mentions that is was a relatively slow celeron and he uses
>> ondemand governor. So I was about to ask him to use performance
>> governor to see if it can be because cpu runs slow and takes too muche
>> time to enter idle
>
> Maybe also hand him a patch with some trace_printk()s in and ask him to
> send /debug/tracing/trace after it happens or somesuch. Maybe we can
> find a clue that way.
yes
Hi Vincent,
On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot
<[email protected]> wrote:
> On 11 April 2018 at 12:15, Patrick Bellasi <[email protected]> wrote:
>> On 11-Apr 08:57, Vincent Guittot wrote:
>>> On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
>>> > On 09-Apr 10:51, Vincent Guittot wrote:
>>> >> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
>>> >> Peter,
>>> >> what was your goal with adding the condition "if
>>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>>> >
>>> > The original intent was to get rid of sched class flags, used to track
>>> > which class has tasks runnable from within schedutil. The reason was
>>> > to solve some misalignment between scheduler class status and
>>> > schedutil status.
>>>
>>> This was mainly for RT tasks but it was not the case for cfs task
>>> before commit 8f111bc357aa
>>
>> True, but with his solution Peter has actually come up with a unified
>> interface which is now (and can be IMO) based just on RUNNABLE
>> counters for each class.
>
> But do we really want to only take care of runnable counter for all class ?
>
>>
>>> > The solution, initially suggested by Viresh, and finally proposed by
>>> > Peter was to exploit RQ knowledges directly from within schedutil.
>>> >
>>> > The problem is that now schedutil updated depends on two information:
>>> > utilization changes and number of RT and CFS runnable tasks.
>>> >
>>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>>> > part of a much more clean solution of the code we used to have.
>>>
>>> So there are 2 problems there:
>>> - using cfs_rq::h_nr_running when aggregating cfs utilization which
>>> generates a lot of frequency drop
>>
>> You mean because we now completely disregard the blocked utilization
>> where a CPU is idle, right?
>
> yes
>
>>
>> Given how PELT works and the recent support for IDLE CPUs updated, we
>> should probably always add contributions for the CFS class.
>>
>>> - making sure that the nr-running are up-to-date when used in sched_util
>>
>> Right... but, if we always add the cfs_rq (to always account for
>> blocked utilization), we don't have anymore this last dependency,
>> isn't it?
>
> yes
>
>>
>> We still have to account for the util_est dependency.
>>
>> Should I add a patch to this series to disregard cfs_rq::h_nr_running
>> from schedutil as you suggested?
>
> It's probably better to have a separate patch as these are 2 different topics
> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
> - should we use runnable or running utilization for CFS
By runnable you don't mean sched_avg::load_avg right? I got a bit
confused, since runnable means load_avg and running means util_avg.
But I didn't follow here why we are talking about load_avg for
schedutil at all. I am guessing by "runnable" you mean h_nr_running !=
0.
Also that aside, the "running util" is what was used to drive the CFS
util before Peter's patch (8f111bc357a). That was accounting the
blocked and decaying utilization but that patch changed the behavior.
It seems logical we should just use that not check for h_nr_running
for CFS so we don't miss on the decayed utilization. What is the use
of checking h_nr_running or state of runqueue for CFS? I am sure to be
missing something here. :-(
thanks!
- Joel
Hi Joel,
On 11 April 2018 at 23:34, Joel Fernandes <[email protected]> wrote:
> Hi Vincent,
>
> On Wed, Apr 11, 2018 at 4:56 AM, Vincent Guittot
> <[email protected]> wrote:
>> On 11 April 2018 at 12:15, Patrick Bellasi <[email protected]> wrote:
>>> On 11-Apr 08:57, Vincent Guittot wrote:
>>>> On 10 April 2018 at 13:04, Patrick Bellasi <[email protected]> wrote:
>>>> > On 09-Apr 10:51, Vincent Guittot wrote:
>>>> >> On 6 April 2018 at 19:28, Patrick Bellasi <[email protected]> wrote:
>>>> >> Peter,
>>>> >> what was your goal with adding the condition "if
>>>> >> (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
>>>> >
>>>> > The original intent was to get rid of sched class flags, used to track
>>>> > which class has tasks runnable from within schedutil. The reason was
>>>> > to solve some misalignment between scheduler class status and
>>>> > schedutil status.
>>>>
>>>> This was mainly for RT tasks but it was not the case for cfs task
>>>> before commit 8f111bc357aa
>>>
>>> True, but with his solution Peter has actually come up with a unified
>>> interface which is now (and can be IMO) based just on RUNNABLE
>>> counters for each class.
>>
>> But do we really want to only take care of runnable counter for all class ?
>>
>>>
>>>> > The solution, initially suggested by Viresh, and finally proposed by
>>>> > Peter was to exploit RQ knowledges directly from within schedutil.
>>>> >
>>>> > The problem is that now schedutil updated depends on two information:
>>>> > utilization changes and number of RT and CFS runnable tasks.
>>>> >
>>>> > Thus, using cfs_rq::h_nr_running is not the problem... it's actually
>>>> > part of a much more clean solution of the code we used to have.
>>>>
>>>> So there are 2 problems there:
>>>> - using cfs_rq::h_nr_running when aggregating cfs utilization which
>>>> generates a lot of frequency drop
>>>
>>> You mean because we now completely disregard the blocked utilization
>>> where a CPU is idle, right?
>>
>> yes
>>
>>>
>>> Given how PELT works and the recent support for IDLE CPUs updated, we
>>> should probably always add contributions for the CFS class.
>>>
>>>> - making sure that the nr-running are up-to-date when used in sched_util
>>>
>>> Right... but, if we always add the cfs_rq (to always account for
>>> blocked utilization), we don't have anymore this last dependency,
>>> isn't it?
>>
>> yes
>>
>>>
>>> We still have to account for the util_est dependency.
>>>
>>> Should I add a patch to this series to disregard cfs_rq::h_nr_running
>>> from schedutil as you suggested?
>>
>> It's probably better to have a separate patch as these are 2 different topics
>> - when updating cfs_rq::h_nr_running and when calling cpufreq_update_util
>> - should we use runnable or running utilization for CFS
>
> By runnable you don't mean sched_avg::load_avg right? I got a bit
> confused, since runnable means load_avg and running means util_avg.
Sorry for the confusion. By runnable utilization, I meant taking into
account the number of running task (cfs_rq::h_nr_running) like what is
done by commit (8f111bc357a)
> But I didn't follow here why we are talking about load_avg for
> schedutil at all. I am guessing by "runnable" you mean h_nr_running !=
> 0.
yes
>
> Also that aside, the "running util" is what was used to drive the CFS
> util before Peter's patch (8f111bc357a). That was accounting the
> blocked and decaying utilization but that patch changed the behavior.
> It seems logical we should just use that not check for h_nr_running
> for CFS so we don't miss on the decayed utilization. What is the use
> of checking h_nr_running or state of runqueue for CFS? I am sure to be
> missing something here. :-(
As Peter mentioned, the change in commit (8f111bc357a) was to remove
the test that was arbitrary removing the util_avg of a cpu that has
not been updated since a tick
But with the update of blocked idle load, we don't need to handle the
case of stalled load/utilization
>
> thanks!
>
> - Joel
On Thu, Apr 12, 2018 at 12:01 AM, Vincent Guittot
<[email protected]> wrote:
>>
>> Also that aside, the "running util" is what was used to drive the CFS
>> util before Peter's patch (8f111bc357a). That was accounting the
>> blocked and decaying utilization but that patch changed the behavior.
>> It seems logical we should just use that not check for h_nr_running
>> for CFS so we don't miss on the decayed utilization. What is the use
>> of checking h_nr_running or state of runqueue for CFS? I am sure to be
>> missing something here. :-(
>
> As Peter mentioned, the change in commit (8f111bc357a) was to remove
> the test that was arbitrary removing the util_avg of a cpu that has
> not been updated since a tick
>
> But with the update of blocked idle load, we don't need to handle the
> case of stalled load/utilization
Thanks a lot for the clarification. It makes sense now.
- Joel
On 11-Apr 17:37, Peter Zijlstra wrote:
> On Wed, Apr 11, 2018 at 05:29:01PM +0200, Vincent Guittot wrote:
> > On 11 April 2018 at 17:14, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, Apr 10, 2018 at 12:04:12PM +0100, Patrick Bellasi wrote:
> > >> On 09-Apr 10:51, Vincent Guittot wrote:
> > >
> > >> > Peter,
> > >> > what was your goal with adding the condition "if
> > >> > (rq->cfs.h_nr_running)" for the aggragation of CFS utilization
> > >>
> > >> The original intent was to get rid of sched class flags, used to track
> > >> which class has tasks runnable from within schedutil. The reason was
> > >> to solve some misalignment between scheduler class status and
> > >> schedutil status.
> > >>
> > >> The solution, initially suggested by Viresh, and finally proposed by
> > >> Peter was to exploit RQ knowledges directly from within schedutil.
> > >>
> > >> The problem is that now schedutil updated depends on two information:
> > >> utilization changes and number of RT and CFS runnable tasks.
> > >>
> > >> Thus, using cfs_rq::h_nr_running is not the problem... it's actually
> > >> part of a much more clean solution of the code we used to have.
> > >>
> > >> The problem, IMO is that we now depend on other information which
> > >> needs to be in sync before calling schedutil... and the patch I
> > >> proposed is meant to make it less likely that all the information
> > >> required are not aligned (also in the future).
> > >
> > > Specifically, the h_nr_running test was get rid of
> > >
> > > if (delta_ns > TICK_NSEC) {
> > > j_sg_cpu->iowait_boost = 0;
> > > j_sg_cpu->iowait_boost_pending = false;
> > > - j_sg_cpu->util_cfs = 0;
> > >
> > > ^^^^^^^^^^^^^^^^^^^^^^^ that..
> > >
> > > - if (j_sg_cpu->util_dl == 0)
> > > - continue;
> > > }
> > >
> > >
> > > because that felt rather arbitrary.
> >
> > yes I agree.
> >
> > With the patch that updates blocked idle load, we should not have the
> > problem of blocked utilization anymore and get rid of the code above
> > and h_nr_running test
>
> Yes, these patches predate those, but indeed, now that we age the
> blocked load consistently it should no longer be required.
After this discussion, I think there is a general consensus about
always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util.
Is that right?
For the rest, what this patch proposes is a code reorganization which
is not required anymore to fix this specific issue but, it's still
required to fix the other issue reported by Vincent: i.e. util_est is
not updated before schedutil.
Thus, I would propose to still keep this refactoring but in the
context of a different patch to specifically fixes the util_est case.
If there are not major complains, I'll post a new series in the next
few days.
--
#include <best/regards.h>
Patrick Bellasi
On Thu, Apr 26, 2018 at 12:15:33PM +0100, Patrick Bellasi wrote:
> > Yes, these patches predate those, but indeed, now that we age the
> > blocked load consistently it should no longer be required.
>
> After this discussion, I think there is a general consensus about
> always add sg_cpu->util_cfs in cpufreq_schedutil.c::sugov_aggregate_util.
>
> Is that right?
Yes I think so. I've been waiting to see the problem with the blocked
load aging patches sorted though. Because if we'd have to revert those
we'd be back to needing the current stuff again.
Luckily it appears Vincent found something there, so fingers crossed.
> For the rest, what this patch proposes is a code reorganization which
> is not required anymore to fix this specific issue but, it's still
> required to fix the other issue reported by Vincent: i.e. util_est is
> not updated before schedutil.
>
> Thus, I would propose to still keep this refactoring but in the
> context of a different patch to specifically fixes the util_est case.
>
> If there are not major complains, I'll post a new series in the next
> few days.
Fair enough..