2018-10-30 16:10:50

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH] sched/fair: util_est: fix cpu_util_wake for execl

A ~10% regression has been reported for UnixBench's execl throughput
test:

Message-ID: <20180402032000.GD3101@yexl-desktop>
Message-ID: <[email protected]>

That test is pretty simple, it does a "recursive" execve syscall on the
same binary. Starting from the syscall, this sequence is possible:

do_execve()
do_execveat_common()
__do_execve_file()
sched_exec()
select_task_rq_fair() <==| Task already enqueued
find_idlest_cpu()
find_idlest_group()
capacity_spare_wake() <==| Functions not called from
cpu_util_wake() | the wakeup path

which means we can end up calling cpu_util_wake() not only from the
"wakeup path", as its name would suggest. Indeed, the task doing an
execve syscall is already enqueued on the CPU we want to get the
cpu_util_wake for.

The estimated utilization for a CPU computed in cpu_util_wake() was
encoded under the assumption that function can be called only from the
wakeup path. If instead the task is already enqueued, we end up with a
utilization which does not remove the current task's contribution form
the estimated utilization of the CPU.
This will wrongly assume a reduced spare capacity on the current CPU and
increase the chances to migrate the task on execve.

The regression is tracked down to:

commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")

because in that patch we turn on by default the UTIL_EST sched feature.
However, the real issue is introduced by:

commit f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)

Let's fix this by ensuring to always discount the task estimated
utilization from the CPU's estimated utilization when the task is also
the current one. The same benchmark of the bug report, executed on a
dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
reports these "Execl Throughput" figures (higher the better):

mainline : 48136.5 lps
mainline+fix : 55376.5 lps

which correspond to a 15% speedup.

Moreover, since {cpu_util,capacity_spare}_wake() are not really only
used from the wakeup path, let's remove this ambiguity by using a better
matching name: {cpu_util,capacity_spare}_without().
Since we are at that, let's also improve the existing documentation.

Signed-off-by: Patrick Bellasi <[email protected]>
Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
Reported-by: Aaron Lu <[email protected]>
Reported-by: Ye Xiaolong <[email protected]>
Tested-by: Aaron Lu <[email protected]>
Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 908c9cdae2f0..bdc0be267621 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5672,11 +5672,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}

-static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
+static unsigned long cpu_util_without(int cpu, struct task_struct *p);

-static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
{
- return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
+ return max_t(long, capacity_of(cpu) - cpu_util_without(cpu, p), 0);
}

/*
@@ -5736,7 +5736,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);

- spare_cap = capacity_spare_wake(i, p);
+ spare_cap = capacity_spare_without(i, p);

if (spare_cap > max_spare_cap)
max_spare_cap = spare_cap;
@@ -5887,8 +5887,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return prev_cpu;

/*
- * We need task's util for capacity_spare_wake, sync it up to prev_cpu's
- * last_update_time.
+ * We need task's util for capacity_spare_without, sync it up to
+ * prev_cpu's last_update_time.
*/
if (!(sd_flag & SD_BALANCE_FORK))
sync_entity_load_avg(&p->se);
@@ -6214,10 +6214,19 @@ static inline unsigned long cpu_util(int cpu)
}

/*
- * cpu_util_wake: Compute CPU utilization with any contributions from
- * the waking task p removed.
+ * cpu_util_without: compute cpu utilization without any contributions from *p
+ * @cpu: the CPU which utilization is requested
+ * @p: the task which utilization should be discounted
+ *
+ * The utilization of a CPU is defined by the utilization of tasks currently
+ * enqueued on that CPU as well as tasks which are currently sleeping after an
+ * execution on that CPU.
+ *
+ * This method returns the utilization of the specified CPU by discounting the
+ * utilization of the specified task, whenever the task is currently
+ * contributing to the CPU utilization.
*/
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_without(int cpu, struct task_struct *p)
{
struct cfs_rq *cfs_rq;
unsigned int util;
@@ -6238,14 +6247,14 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
* a) if *p is the only task sleeping on this CPU, then:
* cpu_util (== task_util) > util_est (== 0)
* and thus we return:
- * cpu_util_wake = (cpu_util - task_util) = 0
+ * cpu_util_without = (cpu_util - task_util) = 0
*
* b) if other tasks are SLEEPING on this CPU, which is now exiting
* IDLE, then:
* cpu_util >= task_util
* cpu_util > util_est (== 0)
* and thus we discount *p's blocked utilization to return:
- * cpu_util_wake = (cpu_util - task_util) >= 0
+ * cpu_util_without = (cpu_util - task_util) >= 0
*
* c) if other tasks are RUNNABLE on that CPU and
* util_est > cpu_util
@@ -6258,8 +6267,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
* covered by the following code when estimated utilization is
* enabled.
*/
- if (sched_feat(UTIL_EST))
- util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+ if (sched_feat(UTIL_EST)) {
+ unsigned int estimated =
+ READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+ if (unlikely(current == p || task_on_rq_queued(p))) {
+ estimated -= min_t(unsigned int, estimated,
+ (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ }
+
+ util = max(util, estimated);
+ }

/*
* Utilization (estimated) can exceed the CPU capacity, thus let's
--
2.18.0



2018-10-31 18:47:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: util_est: fix cpu_util_wake for execl

On Tue, Oct 30, 2018 at 04:09:47PM +0000, Patrick Bellasi wrote:

> Let's fix this by ensuring to always discount the task estimated
> utilization from the CPU's estimated utilization when the task is also
> the current one. The same benchmark of the bug report, executed on a
> dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
> reports these "Execl Throughput" figures (higher the better):

Before this we have:

/* Discount task's blocked util from CPU's util */
util -= min_t(unsigned int, util, task_util(p));

at the very least that comment is now inaccurate, since @p might not be
blocked.

> @@ -6258,8 +6267,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> * covered by the following code when estimated utilization is
> * enabled.
> */
> - if (sched_feat(UTIL_EST))
> - util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
> + if (sched_feat(UTIL_EST)) {
> + unsigned int estimated =
> + READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +
> + if (unlikely(current == p || task_on_rq_queued(p))) {

I'm confused by the need for 'current == p', afaict task_on_rq_queued(p)
is sufficient -- we've already established task_cpu(p) == cpu earlier.

> + estimated -= min_t(unsigned int, estimated,
> + (_task_util_est(p) | UTIL_AVG_UNCHANGED));
> + }
> +
> + util = max(util, estimated);
> + }

Also, I think it is about time we find a suitable name for:

#define xxx(_var, _val) do { \
typeof(_var) var = (_var); \
typeof(_var) val = (_val); \
typeof(_var) res = var - val; \
if (res > var) \
res = 0; \
(_var) = res; \
} while (0)

Which is basically sub_positive() but without the READ_ONCE/WRITE_ONCE
stuff. We do that:

var -= min_t(typeof(var), var, val);

pattern _all_ over.

2018-11-02 09:55:42

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: util_est: fix cpu_util_wake for execl

On 31-Oct 19:45, Peter Zijlstra wrote:
> On Tue, Oct 30, 2018 at 04:09:47PM +0000, Patrick Bellasi wrote:
>
> > Let's fix this by ensuring to always discount the task estimated
> > utilization from the CPU's estimated utilization when the task is also
> > the current one. The same benchmark of the bug report, executed on a
> > dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
> > reports these "Execl Throughput" figures (higher the better):
>
> Before this we have:
>
> /* Discount task's blocked util from CPU's util */
> util -= min_t(unsigned int, util, task_util(p));
>
> at the very least that comment is now inaccurate, since @p might not be
> blocked.

Right... will fix this too.

> > @@ -6258,8 +6267,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> > * covered by the following code when estimated utilization is
> > * enabled.
> > */
> > - if (sched_feat(UTIL_EST))
> > - util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
> > + if (sched_feat(UTIL_EST)) {
> > + unsigned int estimated =
> > + READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +
> > + if (unlikely(current == p || task_on_rq_queued(p))) {
>
> I'm confused by the need for 'current == p', afaict task_on_rq_queued(p)
> is sufficient -- we've already established task_cpu(p) == cpu earlier.

Mmm... you right, I've got confused by the fact that current is
removed from the RBTree, but we keep tracking it as:

on_rq = TASK_ON_RQ_QUEUED

... unless, select_task_rq_fair() races with LB's:

detach_task()
p->on_rq = TASK_ON_RQ_MIGRATING;
-----------------------------------A
deactivate_task() \
dequeue_task() +- RaceTime
util_est_dequeue() /
-----------------------------------B
set_task_cpu()
migrate_task_rq{_fair}()
detach_entity_cfs_rq()

where, in [A..B] we will still avoid to discount *p's estimated
utilization. :/

Do you think we can live with that for the time being, maybe by just
adding a comment, or should we try to close that too ?

Eventually, the (current == p) check, maybe moved to the right of the
OR condition above, should certainly close the race window for the
specific UnixBench's execl case. Assuming for example the execl is
executed by a misfit task which is target of an active load balance...


> > + estimated -= min_t(unsigned int, estimated,
> > + (_task_util_est(p) | UTIL_AVG_UNCHANGED));
> > + }
> > +
> > + util = max(util, estimated);
> > + }
>
> Also, I think it is about time we find a suitable name for:
>
> #define xxx(_var, _val) do { \

remove_contrib(_var, _val) ?

> typeof(_var) var = (_var); \
> typeof(_var) val = (_val); \
> typeof(_var) res = var - val; \
> if (res > var) \
> res = 0; \
> (_var) = res; \
> } while (0)
>
> Which is basically sub_positive() but without the READ_ONCE/WRITE_ONCE
> stuff.

Perhaps there are still some paths in where sub_positive() can be
recycled... will look better into that and see what we can do on that
polishing side. However, I'll keep all that in a different patch.

> We do that:
>
> var -= min_t(typeof(var), var, val);
>
> pattern _all_ over.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi