2018-11-05 14:55:02

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 0/3] util_est regression fixup and cleanups

This is a respin of:

https://lore.kernel.org/lkml/[email protected]/

rebased on v4.20-rc1, which addresses Peter's comments by also adding a
couple of additional cleanup patches on top.

Tests on a 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz system
still reports the ~10-15% Execl Throughput improvements after applying
the first patch. Those benefits are not there if we remove the
additional test on "current == p" which Peter was asking about.

I guess the race condition described in the new inline comment I've now
added could be the reason for the additional test being required, but I
did not really verified that guess.
I've just kept both conditions but swapped them since we will probably
be more likely to call cpu_util_without() with a task which is
eventually marked as task_on_rq_queued().

The second patch is pretty simple, while the last one implements what
Peter suggested in the previous review. I did not used something similar
to sub_positive, as suggested by Peter, just because in my tests that
implementation seems to affect negatively the Execl Throughput tests
results by reducing the speedup we get with the proposed version.

Best Patrick

Patrick Bellasi (3):
sched/fair: util_est: fix cpu_util_wake for execl
sched/fair: util_est: mask UTIL_AVG_UNCHANGED usages
sched/fair: add lsub_positive and use it consistently

kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++-----------
1 file changed, 64 insertions(+), 21 deletions(-)

--
2.18.0



2018-11-05 14:55:09

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/fair: util_est: fix cpu_util_wake for execl

A ~10% regression has been reported for UnixBench's execl throughput
test:

Message-ID: <20180402032000.GD3101@yexl-desktop>
Message-ID: <[email protected]>

That test is pretty simple, it does a "recursive" execve syscall on the
same binary. Starting from the syscall, this sequence is possible:

do_execve()
do_execveat_common()
__do_execve_file()
sched_exec()
select_task_rq_fair() <==| Task already enqueued
find_idlest_cpu()
find_idlest_group()
capacity_spare_wake() <==| Functions not called from
cpu_util_wake() | the wakeup path

which means we can end up calling cpu_util_wake() not only from the
"wakeup path", as its name would suggest. Indeed, the task doing an
execve syscall is already enqueued on the CPU we want to get the
cpu_util_wake for.

The estimated utilization for a CPU computed in cpu_util_wake() was
encoded under the assumption that function can be called only from the
wakeup path. If instead the task is already enqueued, we end up with a
utilization which does not remove the current task's contribution form
the estimated utilization of the CPU.
This will wrongly assume a reduced spare capacity on the current CPU and
increase the chances to migrate the task on execve.

The regression is tracked down to:

commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates")

because in that patch we turn on by default the UTIL_EST sched feature.
However, the real issue is introduced by:

commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths")

Let's fix this by ensuring to always discount the task estimated
utilization from the CPU's estimated utilization when the task is also
the current one. The same benchmark of the bug report, executed on a
dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
reports these "Execl Throughput" figures (higher the better):

mainline : 48136.5 lps
mainline+fix : 55376.5 lps

which correspond to a 15% speedup.

Moreover, since {cpu_util,capacity_spare}_wake() are not really only
used from the wakeup path, let's remove this ambiguity by using a better
matching name: {cpu_util,capacity_spare}_without().
Since we are at that, let's also improve the existing documentation.

Signed-off-by: Patrick Bellasi <[email protected]>
Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
Reported-by: Aaron Lu <[email protected]>
Reported-by: Ye Xiaolong <[email protected]>
Tested-by: Aaron Lu <[email protected]>
Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++----------
1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee271bb661cc..473a9cc559e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5674,11 +5674,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}

-static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
+static unsigned long cpu_util_without(int cpu, struct task_struct *p);

-static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
{
- return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
+ return max_t(long, capacity_of(cpu) - cpu_util_without(cpu, p), 0);
}

/*
@@ -5738,7 +5738,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);

- spare_cap = capacity_spare_wake(i, p);
+ spare_cap = capacity_spare_without(i, p);

if (spare_cap > max_spare_cap)
max_spare_cap = spare_cap;
@@ -5889,8 +5889,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
return prev_cpu;

/*
- * We need task's util for capacity_spare_wake, sync it up to prev_cpu's
- * last_update_time.
+ * We need task's util for capacity_spare_without, sync it up to
+ * prev_cpu's last_update_time.
*/
if (!(sd_flag & SD_BALANCE_FORK))
sync_entity_load_avg(&p->se);
@@ -6216,10 +6216,19 @@ static inline unsigned long cpu_util(int cpu)
}

/*
- * cpu_util_wake: Compute CPU utilization with any contributions from
- * the waking task p removed.
+ * cpu_util_without: compute cpu utilization without any contributions from *p
+ * @cpu: the CPU which utilization is requested
+ * @p: the task which utilization should be discounted
+ *
+ * The utilization of a CPU is defined by the utilization of tasks currently
+ * enqueued on that CPU as well as tasks which are currently sleeping after an
+ * execution on that CPU.
+ *
+ * This method returns the utilization of the specified CPU by discounting the
+ * utilization of the specified task, whenever the task is currently
+ * contributing to the CPU utilization.
*/
-static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_without(int cpu, struct task_struct *p)
{
struct cfs_rq *cfs_rq;
unsigned int util;
@@ -6231,7 +6240,7 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
cfs_rq = &cpu_rq(cpu)->cfs;
util = READ_ONCE(cfs_rq->avg.util_avg);

- /* Discount task's blocked util from CPU's util */
+ /* Discount task's util from CPU's util */
util -= min_t(unsigned int, util, task_util(p));

/*
@@ -6240,14 +6249,14 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
* a) if *p is the only task sleeping on this CPU, then:
* cpu_util (== task_util) > util_est (== 0)
* and thus we return:
- * cpu_util_wake = (cpu_util - task_util) = 0
+ * cpu_util_without = (cpu_util - task_util) = 0
*
* b) if other tasks are SLEEPING on this CPU, which is now exiting
* IDLE, then:
* cpu_util >= task_util
* cpu_util > util_est (== 0)
* and thus we discount *p's blocked utilization to return:
- * cpu_util_wake = (cpu_util - task_util) >= 0
+ * cpu_util_without = (cpu_util - task_util) >= 0
*
* c) if other tasks are RUNNABLE on that CPU and
* util_est > cpu_util
@@ -6260,8 +6269,33 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
* covered by the following code when estimated utilization is
* enabled.
*/
- if (sched_feat(UTIL_EST))
- util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+ if (sched_feat(UTIL_EST)) {
+ unsigned int estimated =
+ READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+ /*
+ * Despite the following checks we still have a small window
+ * for a possible race, when an execl's select_task_rq_fair()
+ * races with LB's detach_task():
+ *
+ * detach_task()
+ * p->on_rq = TASK_ON_RQ_MIGRATING;
+ * ---------------------------------- A
+ * deactivate_task() \
+ * dequeue_task() + RaceTime
+ * util_est_dequeue() /
+ * ---------------------------------- B
+ *
+ * The additional check on "current == p" it's required to
+ * properly fix the execl regression and it helps in further
+ * reducing the chances for the above race.
+ */
+ if (unlikely(task_on_rq_queued(p) || current == p)) {
+ estimated -= min_t(unsigned int, estimated,
+ (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ }
+ util = max(util, estimated);
+ }

/*
* Utilization (estimated) can exceed the CPU capacity, thus let's
--
2.18.0


2018-11-05 14:55:22

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 2/3] sched/fair: util_est: mask UTIL_AVG_UNCHANGED usages

The _task_util_est() is mainly used to add/remove the task contribution
to/from the rq's estimated utilization at task enqueue/dequeue time.
In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep
consistency between enqueue and dequeue time while still being
transparent to update_load_avg calls which will eventually reset the
flag.

Let's move the flag forcing within _task_util_est() itself so that we
can simplify calling code by hiding that estimated utilization
implementation detail into one of its internal functions.

This will affect also the "public" API task_util_est() but we know that
the flag will (eventually) impact just on the LSB of the estimated
utilization, thus it's certainly acceptable.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 473a9cc559e8..aeb37fe4dbb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3604,7 +3604,7 @@ static inline unsigned long _task_util_est(struct task_struct *p)
{
struct util_est ue = READ_ONCE(p->se.avg.util_est);

- return max(ue.ewma, ue.enqueued);
+ return (max(ue.ewma, ue.enqueued) | UTIL_AVG_UNCHANGED);
}

static inline unsigned long task_util_est(struct task_struct *p)
@@ -3622,7 +3622,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,

/* Update root cfs_rq's estimated utilization */
enqueued = cfs_rq->avg.util_est.enqueued;
- enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED);
+ enqueued += _task_util_est(p);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

@@ -3650,8 +3650,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

/* Update root cfs_rq's estimated utilization */
ue.enqueued = cfs_rq->avg.util_est.enqueued;
- ue.enqueued -= min_t(unsigned int, ue.enqueued,
- (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

/*
@@ -6292,7 +6291,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
*/
if (unlikely(task_on_rq_queued(p) || current == p)) {
estimated -= min_t(unsigned int, estimated,
- (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ _task_util_est(p));
}
util = max(util, estimated);
}
--
2.18.0


2018-11-05 14:55:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 3/3] sched/fair: add lsub_positive and use it consistently

The following pattern:

var -= min_t(typeof(var), var, val);

is used multiple times in fair.c.

The existing sub_positive() already capture that pattern but it adds
also explicit load-sotre to properly support lockless observations.
In other cases, the patter above is used to update local, and/or not
concurrently accessed, variables.

Let's add a simpler version of sub_positive, targeted to local variables
updates, which gives the same readability benefits at calling sites
without enforcing {READ,WRITE}_ONCE barriers.

Signed-off-by: Patrick Bellasi <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
kernel/sched/fair.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aeb37fe4dbb1..d50c739127d6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2734,6 +2734,17 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
WRITE_ONCE(*ptr, res); \
} while (0)

+/*
+ * Remove and clamp on negative, from a local variable.
+ *
+ * A variant of sub_positive which do not use explicit load-store
+ * and thus optimized for local variable updates.
+ */
+#define lsub_positive(_ptr, _val) do { \
+ typeof(_ptr) ptr = (_ptr); \
+ *ptr -= min_t(typeof(*ptr), *ptr, _val); \
+} while (0)
+
#ifdef CONFIG_SMP
static inline void
enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -4639,7 +4650,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
cfs_b->distribute_running = 0;
throttled = !list_empty(&cfs_b->throttled_cfs_rq);

- cfs_b->runtime -= min(runtime, cfs_b->runtime);
+ lsub_positive(&cfs_b->runtime, runtime);
}

/*
@@ -4773,7 +4784,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)

raw_spin_lock(&cfs_b->lock);
if (expires == cfs_b->runtime_expires)
- cfs_b->runtime -= min(runtime, cfs_b->runtime);
+ lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock(&cfs_b->lock);
}
@@ -6240,7 +6251,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
util = READ_ONCE(cfs_rq->avg.util_avg);

/* Discount task's util from CPU's util */
- util -= min_t(unsigned int, util, task_util(p));
+ lsub_positive(&util, task_util(p));

/*
* Covered cases:
@@ -6289,10 +6300,9 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
* properly fix the execl regression and it helps in further
* reducing the chances for the above race.
*/
- if (unlikely(task_on_rq_queued(p) || current == p)) {
- estimated -= min_t(unsigned int, estimated,
- _task_util_est(p));
- }
+ if (unlikely(task_on_rq_queued(p) || current == p))
+ lsub_positive(&estimated, _task_util_est(p));
+
util = max(util, estimated);
}

--
2.18.0


2018-11-12 04:17:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/fair: add lsub_positive and use it consistently


* Patrick Bellasi <[email protected]> wrote:

> The following pattern:
>
> var -= min_t(typeof(var), var, val);
>
> is used multiple times in fair.c.
>
> The existing sub_positive() already capture that pattern but it adds
> also explicit load-sotre to properly support lockless observations.
> In other cases, the patter above is used to update local, and/or not
> concurrently accessed, variables.

> Let's add a simpler version of sub_positive, targeted to local variables
> updates, which gives the same readability benefits at calling sites
> without enforcing {READ,WRITE}_ONCE barriers.

> +/*
> + * Remove and clamp on negative, from a local variable.
> + *
> + * A variant of sub_positive which do not use explicit load-store
> + * and thus optimized for local variable updates.

I fixed up the two typos ('load-sotre', 'patter'), and fixed eight
grammar mistakes (!) in the changelog and in the code comments, but
*please* read the changelogs and code you are writing, this is scheduler
code after all ...

( Please also use the fn() notation in changelogs consistently in the
future: first you use 'sub_positive()' correctly then it becomes
'sub_positive'. )

Anyway, the fix looks sane so I've applied it to sched/urgent.

Thanks,

Ingo

Subject: [tip:sched/core] sched/fair: Mask UTIL_AVG_UNCHANGED usages

Commit-ID: 92a801e5d5b7a893881c1676b15dd246727ccd16
Gitweb: https://git.kernel.org/tip/92a801e5d5b7a893881c1676b15dd246727ccd16
Author: Patrick Bellasi <[email protected]>
AuthorDate: Mon, 5 Nov 2018 14:53:59 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 12 Nov 2018 06:17:52 +0100

sched/fair: Mask UTIL_AVG_UNCHANGED usages

The _task_util_est() is mainly used to add/remove the task contribution
to/from the rq's estimated utilization at task enqueue/dequeue time.
In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep
consistency between enqueue and dequeue time while still being
transparent to update_load_avg calls which will eventually reset the
flag.

Let's move the flag forcing within _task_util_est() itself so that we
can simplify calling code by hiding that estimated utilization
implementation detail into one of its internal functions.

This will affect also the "public" API task_util_est() but we know that
the flag will (eventually) impact just on the LSB of the estimated
utilization, thus it's certainly acceptable.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1ccf1ddd37a..28ee60cabba1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3604,7 +3604,7 @@ static inline unsigned long _task_util_est(struct task_struct *p)
{
struct util_est ue = READ_ONCE(p->se.avg.util_est);

- return max(ue.ewma, ue.enqueued);
+ return (max(ue.ewma, ue.enqueued) | UTIL_AVG_UNCHANGED);
}

static inline unsigned long task_util_est(struct task_struct *p)
@@ -3622,7 +3622,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,

/* Update root cfs_rq's estimated utilization */
enqueued = cfs_rq->avg.util_est.enqueued;
- enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED);
+ enqueued += _task_util_est(p);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

@@ -3650,8 +3650,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

/* Update root cfs_rq's estimated utilization */
ue.enqueued = cfs_rq->avg.util_est.enqueued;
- ue.enqueued -= min_t(unsigned int, ue.enqueued,
- (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ ue.enqueued -= min_t(unsigned int, ue.enqueued, _task_util_est(p));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

/*
@@ -6292,7 +6291,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
*/
if (unlikely(task_on_rq_queued(p) || current == p)) {
estimated -= min_t(unsigned int, estimated,
- (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ _task_util_est(p));
}
util = max(util, estimated);
}